From academia to corporate work, this recipe is for you
After years of doing scientific research, I ended up creating a recipe to start any research project. This recipe, that I am about to share with all of you, helps me every day to have a very structured approach, reducing the risk of missing something and maximizing efficiency.
You will see that, as some of the best chef, mastering the basics is enough to come up with something great. This is also my approach here. You will see no fancy commands, long lines of codes or complicated package to install. This methodology allowed me to publish my three first scientific papers in some of the most prestigious journals in their respecting fields: PNAS, Environmental Research Letters, or Management Science (forthcoming).
Exploring data can be painful, and boring, but it is also the starting point to answer empirically any question (in academia or in the “real” world). Hence, it’s absolutely key to do this part cautiously.
Let me use a very concrete example from my daily life as a researcher to illustrate my method and see how we can shed light quickly on a question, while hopefully enjoying the process.
Case study: A colleague of mine approached me during the current heat wave (June 2022), and she was wondering if such an event would help to enforce environmental policies or not. Hence, I’ve decided to use this question to illustrate my standard approach to studying the relationship between two variables.
At each step I will explain what I observe and how it impacts my choices for the analysis. This is the approach I will jointly teach (with Boris Thurm and Edoardo Chiarotti) for the Master in Sustainable Management and Technology during the spring semester (by Enterprise 4 Society at EPFL/UNIL/IMD). I will post regularly similar analyses with different type of data (you can suggest what to explore next in the comments).
Disclaimer: This is the first step when I want a first glimpse at the data in one hour or so. Obviously, every part could be extended. Hence, I put some notes for a few things that I would explore in a second step.
The five steps for a good recipe:
- Selecting the ingredients for the recipe (how I select the variables)
- Picking the right quantity of each ingredient (how I select my sample)
- Tasting and preparing the ingredients (univariate analysis)
- Cooking the ingredients together (bivariate analysis)
- Tasting the new recipe (conclusion).
1. Variables selection
In this section, I will just conceptually think about the relationship between the elements and what are the key ingredients I will need.
Central aspect: Climate change is a global, long-lasting, relatively low pace phenomenon. Hence, the effect of heat waves on climate-change laws is arguably causal (e.g. the behavior of France in 2020 is not expected to affect the average temperature the same year or the year after).
This first part helps me to define what are the variables I will need to start my analysis:
- outcome: Environmental policies,
- explanatory variable: Average yearly temperature,
- additional explanatory variable: Heat-waves might be aggravated if rainfall is low, hence I will also add average yearly rainfall.
This is the minimal ingredient list I would select: an outcome, an explanatory variable, and a third variable to explore the potential heterogeneity of the effect (here rainfall).
2. Sample selection
In this section, I will look for the data availability to define a clear sample. I will base my initial analysis on the Quality of Government Environmental Indicators Dataset¹ which aggregates numerous important datasets with hundreds of environmental variables.
Here is a list of the variables I selected going through the codebook (link):
– cname: Country name
– year: Year
– oecd_eps: Environmental Policy Stringency Index² (from Botta and Kozluk (2014))
– cckp_temp: Annual average temperature in Celsius³ (from the Climate Change Knowledge Portal)
– cckp_rain: Annual average rainfall in mm³ (from the Climate Change Knowledge Portal)
This first table reveals that the variable oecd_eps is available only for 799 observations.
Based on the previous tables, I will restrict to the countries with data for the outcome (oecd_eps) and to years from 1993 to 2012 to have a relatively constant sample size. The clear “bottle neck” from the table above is the outcome variable oecd_eps. Obviously, because it’s focused only on OECD countries.
From the ‘count’ line, we can see that the sample has no missing values.
Almost perfectly balanced panel dataset (same number of observations for each unit/country) with Brazil and Slovenia with fewer years.
The map above shows that OECD countries (contained in the sample) cover every continent (except Antarctica). Africa is represented only by South Africa, while South America has only Brazil. Note that the small yellow part above Brazil is an overseas department of France: French Guiana. Hence, it’s important to keep in mind that Europe, represents a large part of the dataset, which eventually will drive the results.
In this section, I will use descriptive statistics to:
- Prepare the data: By studying the distribution of the variables, I will see if I should transform the data (e.g. log-transform, define a categorical variable, deal with outliers, etc.)
- Choose the right statistical tools: After this step, I will know the nature of each variable (continuous, categorical, binary, etc) which allows me to choose the right statistical tools (correlation, bar/line graphs, scatter plot, etc).
- Get an idea of the underlying variation: I also find it important to look at how the variable varies over time (line graph) and space (map). It helps me to get a better understanding of the data and potentially spot some anomalies or interesting shocks to exploit for a natural experiment.
3.1 Outcome variable: Environmental Policy Stringency Index
- The Stringency index is a continuous variable taking values from 0.3 and 4.1 within the sample selected.
- The mean is 1.6 with a median slightly lower (1.5).
- Histogram and table: The data are slightly asymmetric on the right (skewness 0.48) with a density of very low values. The skewness is a measure of asymmetry. Negative values mean that there is an asymmetry on the left, a null value means that the distribution is perfectly symmetric while positive values imply an asymmetry on the right.
- Map: It seems that the stringency of the environmental policies is highly correlated with GDP: Europe, North America, or Australia have high values while African, Asian or South American countries have lowers values. Hence, it would be important to take this into account for the multivariate analysis (e.g. include as control variable in a regression model).
- Line graph: Overall there is a positive trend. There is also an interesting drop in 2007. I would try to explore this later to find out if it is driven by a subset of countries.
Now that I picked the quantities and prepared the ingredients, I will put them together. My outcome and explanatory variables are continuous variables. Hence, I will start with a simple scatter plot as it remains very close to the data, allows me to see each data point, and adequate given the sample size (with millions of datapoints I would favor a hexplot for example).
To help you follow the steps, I put my observations after each graph.
Observations: It seems that the dots are spread in small groups vertically. I guess that the average annual temperature have higher between-country variation than within-country. Hence, it makes more sense for the analysis to look at how within-country variations in temperature are associated with the outcome. To confirm this idea I will simply plot the same graph while coloring the dots by countries.
Observation: Indeed the temperature is relatively stable over time, while there is significant variation for the stringency index within-country.
Hence, let us compute the difference from the mean for the temperature variable for each country rather than using the level (within-country variation).
Observation: The relationship is clearer and as expected it is positive.
Next? Heterogeneity: Now let’s look if there is some heterogeneity with respect to rainfall.
Observation: Visually the relationship is unclear. However, the countries with the highest rain seem to have a low-temperature variation and low environmental policy stringency index (potentially Indonesia).
Let’s do a sample split to explore the relationship between temperature and environmental policy stringency index for countries with rainfall above versus below median (heterogeneity exercise).
Observation: The correlation between environmental policy and temperature (deviation from the mean) is positive. The correlation is almost zero for the sample with rain above median (0.036) while it is relatively large for the sample above median (0.19). From the last graph, we can see that the slope of the linear fit is steeper for the sample with low rain but the intercept is lower. It would be heroic to conclude anything from those simple preliminary graphs. However, it suggests some potential interesting heterogeneity. This is as far as I will go with the descriptive statistics (already more than a bivariate analysis by the way).
What have we learned from this exploration? First, it allowed me to show you my recipe. Second, we learned eventually new things about the relationship between temperature and environmental policies:
- There is a positive association between (within-country) temperature variation and environmental policy stringency index (for OECD countries from 1993 to 2012).
- The relationship is strongly reinforced for observations experiencing low rain during the same year.
- From this simple heterogeneity exercise, it remains unclear if this association is stronger for countries where there is little rain on average (e.g.: Australia, Spain, or South Africa) or for any country during drier years (e.g.: France in 2003).
Next: I would now refine this exercise and in particular look into three things: 1. What happened in 2007 with the drop in the stringency index (one country, all countries, and understand why), 2. Explore also which regions are driving the aggregated drop in temperature in 1998 and 2010 (it could lead to an interesting event study). 3. Use an event study to inspect if the effect takes place solely the same year, or with lags etc. Then, depending on the findings, I would fit a multivariate model to quantify the effect and control for cofounders.
 Povitkina, Marina, Natalia Alvarado Pachon, and Cem Mert Dalli. “The Quality of Government Environmental Indicators Dataset, version Sep21.” University of Gothenburg: The Quality of Government Institute, https://www. gu.se/en/quality-government (2021).
 Botta, Enrico, and Tomasz Koźluk. “Measuring environmental policy stringency in OECD countries: A composite index approach.” (2014).
 The World Bank Group. 2021. Climate Change Knowledge Portal. url: https://climateknowledgeportal.worldbank.org
 Harris, Ian, et al. “Version 4 of the CRU TS monthly high-resolution gridded multivariate climate dataset.” Scientific data 7.1 (2020): 1–18.