import pandas as pd
import numpy as np
import bambi as bmb
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt
from matplotlib import transforms
from itertools import product
import pyfixest as pf
import statsmodels.formula.api as smf
from patsy import dmatrices, dmatrix
import seaborn as sns
import networkx as nx
from pyfixest.did.estimation import did2s, lpdid
from matplotlib.lines import Line2D
from matplotlib.patches import Patch
import warnings
='ignore', category=FutureWarning)
warnings.simplefilter(action= None
pd.options.mode.chained_assignment
100) np.random.seed(
Architectures and Free Parameters
“There are no rules of architecture for a castle in the clouds.” - G.K. Chesterton
Free choice is not whimsy or wanton action. Choice is free when taken within the confines of tolerable bounds. In statistical modelling we seek to discover what is plausibly realised within such bounds. We learn from the past to discern the shape of the processes which determine the future. Different models yield different insights into those structures which govern our trajectories. Which parameters are “free”, and which are decisively characterised? Which realities are contestable and which non-negotiable? In this post we will dive into the way in which different inferential methodologies yield different insights into how group-level confounding constrains the realisations of our parameter space and shapes the architecture of causal inference.
Difference in Differences estimation is core a topic in contemporary social science and causal inference. It is a compelling methodology for licensing inferential statements about policy effects, but it is not a panacea. The method makes substantive assumptions about the treatment regime and the trends in the process before the treatment is introduced, but it also requires that the data is rich enough to mandate the intended interpretation of a heavily parameterised estimation model. The two-way fixed effects models attempt to tightly estimate the treatment effect by controlling for all sources of variation in our treatment due to group level factors in the DiD designs. These controls help isolate the variation due to actual treatment by pinning down the effects of interactions between treatment-units and time. We will see how these models and their intended interpretation come apart quickly in the face of staggered treatment roll-outs. We will also show how to fix these issues by enriching our regression specification.
Structure of the Presentation
In this post we will structure the discussion as follows:
- Estimation Strategies and Group Effects
- We go through a paired down parameter recovery examples of group-confounding and a range of estimation strategies aiming to account for group effect bias.
- Nested Groups, Hierarchies and Fixed Effects
- We work through a real example of three-level group nesting; pupils in classes, in schools. Where we need to make decisions about the estimation strategy.
- Two Way Fixed Effects and Temporal Confounding
- We branch off into examining an analogous case of confouning addressed in DiD designs. We show how TWFE works and how to specify these regression designs using Fixed Effects and Mundlak devices.
- Issues with TWFE and Richly Parameterised Linear Models
- We examine the problems arising for the TWFE estimator in the case of staggered roll-outs. We also see how to address the issue with Mundlak devices, and reflect on what we’ve seen.
Estimation Strategies and Group Effects
First consider an estimation example due to Richard McElreath’s lecture series where we examine the various parameter recovery options available in the case of group level confounding. We want to move through these parameterisations to highlight the usefulness of Mundlak regressions. The source of these models is a paper Mundlak’s 1978 paper “On the pooling of time series and cross section data” in which he shows a method for adjusting for group-levels which proves to be a viable alternative to richly specified fixed effects model.
First define a data generating process determined by group level effects:
def inv_logit(p):
return np.exp(p) / (1 + np.exp(p))
= 30
N_groups = 3000
N_id = -2
a0 = 1
bXY = -0.5
bZY = np.random.choice(range(N_groups), size=N_id, replace=True)
g = np.random.normal(1.5, size=N_groups)
Ug
= np.random.normal(Ug[g], size=N_id)
X = np.random.normal(size=N_groups)
Z
= a0 + bXY*X + Ug[g] + bZY*Z[g]
s = inv_logit(s)
p = np.random.binomial(n=1, p=p)
Y
= pd.DataFrame({'Y': Y, 'p': p, 's': s, 'g': g, 'X': X, 'Z': Z[g]})
sim_df sim_df.head()
Y | p | s | g | X | Z | |
---|---|---|---|---|---|---|
0 | 1 | 0.943243 | 2.810542 | 8 | 1.384072 | -1.818302 |
1 | 0 | 0.593872 | 0.379996 | 24 | 1.047128 | -0.242817 |
2 | 1 | 0.874945 | 1.945406 | 3 | 2.340418 | -0.650122 |
3 | 0 | 0.317089 | -0.767181 | 7 | 0.550504 | 0.795680 |
4 | 1 | 0.699030 | 0.842684 | 23 | 1.183011 | -0.111692 |
This data generating process is characterised by the bernoulli outcome with group level confounds on the \(X\) variable. If we model this relationship the confounding effects will bias naive parameter estimates on the covariates \(X\), \(Z\). We can imagine this is a class room setting with group-confounding due to teacher effects. In a DAG we see a picture something like this:
Code
= nx.DiGraph()
graph "classroom \n G", "scores \n y"), ("classroom \n G", "student prep \n X"), ("student prep \n X", "scores \n y"), ("class temp \n Z", "scores \n y")])
graph.add_edges_from([(= plt.subplots(figsize=(10,6))
fig, ax =True, ax=ax,
nx.draw_networkx(graph, arrows= 6000, font_color="whitesmoke",pos={'classroom \n G': [0, 0], 'scores \n y': [0.5, 1], 'student prep \n X': [0, 2],
node_size 'class temp \n Z': [1, 1]})
"Group Confounding in a DAG", fontsize=12); ax.set_title(
We can plot the data here to visualise the shares of the binary outcomes.
= sns.pairplot(sim_df[['Y', 'X', 'Z']], diag_kind="kde", corner=True)
g ; g
We will see different estimation results as we explore different ways of parameterising the relationships, and the trade-offs for model specifications will become clearer.
Naive Model
We will use Bambi
to specify the majority of these generalised linear models throughout this post. The formula syntax for Bambi
enables us to specify generalised linear models with bernoulli outcomes.
= bmb.Model(f"Y['>.5'] ~ X + Z ", sim_df,
naive_model ="bernoulli")
family= naive_model.fit( inference_method="nuts_numpyro",
naive_idata ={"log_likelihood": True},) idata_kwargs
=['Intercept', 'X', 'Z']) az.summary(naive_idata, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | -0.991 | 0.076 | -1.129 | -0.841 | 0.001 | 0.001 | 4859.0 | 3143.0 | 1.0 |
X | 1.308 | 0.053 | 1.210 | 1.410 | 0.001 | 0.001 | 3449.0 | 2770.0 | 1.0 |
Z | -0.614 | 0.048 | -0.702 | -0.522 | 0.001 | 0.001 | 3822.0 | 3033.0 | 1.0 |
Here we see that all three parameter estimates are biased away from their true values. Let’s try a simple fixed effects approach that adds indicator variables for all but one of the group levels.
Fixed Effects Model
The additional syntax for creating the group level indicator variables is specified here. This will create a host of indicator variables for membership in each level of the group variable.
= bmb.Model(f"Y['>.5'] ~ C(g) + X + Z", sim_df,
fixed_effects_model ="bernoulli")
family= fixed_effects_model.fit( inference_method="nuts_numpyro",
fixed_effects_idata ={"log_likelihood": True},) idata_kwargs
=['Intercept', 'X', 'Z']) az.summary(fixed_effects_idata, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | -0.386 | 1.505 | -3.181 | 2.308 | 0.086 | 0.061 | 307.0 | 552.0 | 1.02 |
X | 0.977 | 0.059 | 0.872 | 1.097 | 0.001 | 0.001 | 2635.0 | 2189.0 | 1.00 |
Z | -0.008 | 1.408 | -2.684 | 2.501 | 0.079 | 0.056 | 314.0 | 565.0 | 1.01 |
Now we see that the coefficient on the \(X\) variable seems correct, but the coefficient on \(Z\) is wildly wrong. Indeed the uncertainty interval on the \(Z\) coefficient is huge. The fixed effect model was unable to learn anything about the correct parameter. Whereas the naive model seems to learn the correct \(Z\) parameter but over estimates the \(X\) coefficient. These are the kinds of trade-offs we need to be wary of as we account for the complexities of extensive group interactions in our model’s functional form.
Code
= plt.subplots(2, 1, figsize=(10, 9))
fig, axs = axs.flatten()
axs
=['X'], ax=axs[0],
az.plot_posterior(naive_idata, var_names=None, hdi_prob='hide', color='red', label='Naive Model')
point_estimate0].axvline(1, color='k')
axs[
=['X'], ax=axs[0], point_estimate=None, hdi_prob='hide', label='Fixed Effect Models')
az.plot_posterior(fixed_effects_idata , var_names0].set_title("Naive/Fixed Model X Coefficient")
axs[
=['Z'], ax=axs[1], point_estimate=None, hdi_prob='hide', color='red', ref_val_color='black')
az.plot_posterior(naive_idata, var_names
=['Z'], ax=axs[1], point_estimate=None, hdi_prob='hide')
az.plot_posterior(fixed_effects_idata , var_names
1].set_title("Naive/Fixed Effect Model Z Coefficient")
axs[1].axvline(-0.5, color='k'); axs[
We now want to try another approach to handle to the group confounding that involves a hierarchical approach - adding group level effects to the population intercept term.
Multilevel Model
The syntax for the random effects model is easily expressed in bambi
too. These group level effects modify the intercept additively and are collectively drawn from a shared distribution implicit in the model.
= bmb.Model(f"Y['>.5'] ~ (1 | g) + X + Z", sim_df,
multilevel_model ="bernoulli")
family= multilevel_model.fit( inference_method="nuts_numpyro",
multilevel_model_idata ={"log_likelihood": True},) idata_kwargs
This method posits a view that there is a shared underlying process generating each instance of group-behaviour i.e. the realised values of the outcome within each group. In the random effects multilevel model we replace the indivdual fixed effects indicator columns with additional parameters modifying the intercept term. This is the “hierarchy” or multi-level aspect of the specification.
=['X', 'Z', 'Intercept']) az.summary(multilevel_model_idata, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
X | 1.023 | 0.059 | 0.914 | 1.134 | 0.001 | 0.001 | 4641.0 | 3115.0 | 1.0 |
Z | -0.606 | 0.169 | -0.918 | -0.277 | 0.006 | 0.004 | 868.0 | 1672.0 | 1.0 |
Intercept | -0.573 | 0.201 | -0.940 | -0.180 | 0.007 | 0.005 | 781.0 | 1053.0 | 1.0 |
Next we’ll apply the Mundlak device method which adds the group mean back to each observation as a single additional covariate.
Mundlak Model
For this technique we augment the model by supplying group an additional column. This column records the group-level means of the “confounded variable” i.e. the variable which is realised under group level influences. Adding the group mean(s) to the model in this sense somewhat absolves of the requirement to include an extra multiplicity of parameters. It’s more akin to feature creation than model specification, but it serves the same purpose - it accounts for group level variation and provides a mechanism for the model to learn the appropriate weights to accord each group in the final calculation.
'group_mean'] = sim_df.groupby('g')['X'].transform(np.mean)
sim_df[
'group_mean_Z'] = sim_df.groupby('g')['Z'].transform(np.mean)
sim_df[
= bmb.Model(f"Y['>.5'] ~ (1 | g) + X + Z + group_mean", sim_df,
mundlak_model ="bernoulli")
family= mundlak_model.fit( inference_method="nuts_numpyro",
mundlak_idata ={"log_likelihood": True},) idata_kwargs
Mundlak’s insight is that this mechanism is akin to adding a whole slew of fixed-effect indicator variables for each group. It is a corrective mechanism for group confounding in our focal variable. We’ll see how it allows for more flexible model specifications and has a role in making adjustments to aid identification in causal inference.
=['Intercept', 'X', 'Z']) az.summary(mundlak_idata, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | -1.906 | 0.141 | -2.174 | -1.649 | 0.002 | 0.002 | 4017.0 | 2950.0 | 1.0 |
X | 0.952 | 0.059 | 0.842 | 1.062 | 0.001 | 0.001 | 4442.0 | 3188.0 | 1.0 |
Z | -0.443 | 0.064 | -0.569 | -0.330 | 0.001 | 0.001 | 3206.0 | 2967.0 | 1.0 |
We can now plot all the parameter recovery models together and we’ll see that there are some trade-offs between the fixed effects and random effects varieties of the modelling.
Plotting the Comparisons
The parameter recovery exercise shows striking differences across each of the models
Code
= plt.subplots(3, 1, figsize=(10, 11))
fig, axs = axs.flatten()
axs
=['X'], ax=axs[0],
az.plot_posterior(naive_idata, var_names=None, color='red', label='Naive', hdi_prob='hide')
point_estimate0].axvline(1, color='k', linestyle='--', label='True value')
axs[
=['X'], ax=axs[0], point_estimate=None, hdi_prob='hide', label='Fixed')
az.plot_posterior(fixed_effects_idata , var_names
=['X'], ax=axs[0], point_estimate=None, hdi_prob='hide', color='green', label='Hierarchical')
az.plot_posterior(multilevel_model_idata, var_names
=['X'], ax=axs[0], point_estimate=None, hdi_prob='hide', color='purple', label='Mundlak')
az.plot_posterior(mundlak_idata, var_names
0].set_title("X Coefficient")
axs[
=['Z'], ax=axs[1], point_estimate=None, hdi_prob='hide', color='red', ref_val_color='black')
az.plot_posterior(naive_idata, var_names
=['Z'], ax=axs[1], point_estimate=None, hdi_prob='hide')
az.plot_posterior(fixed_effects_idata , var_names
=['Z'], ax=axs[1], point_estimate=None, hdi_prob='hide', color='green')
az.plot_posterior(multilevel_model_idata, var_names
=['Z'], ax=axs[1], point_estimate=None, hdi_prob='hide', color='purple')
az.plot_posterior(mundlak_idata, var_names
1].set_title("Z Coefficient")
axs[1].axvline(-0.5, color='k', linestyle='--');
axs[
=['Intercept'], ax=axs[2], point_estimate=None, color='red', ref_val_color='black', hdi_prob='hide')
az.plot_posterior(naive_idata, var_names
=['Intercept'], ax=axs[2], point_estimate=None, hdi_prob='hide')
az.plot_posterior(fixed_effects_idata , var_names
=['Intercept'], ax=axs[2], point_estimate=None, hdi_prob='hide', color='green')
az.plot_posterior(multilevel_model_idata, var_names
=['Intercept'], ax=axs[2], point_estimate=None, hdi_prob='hide', color='purple')
az.plot_posterior(mundlak_idata, var_names
2].axvline(-2, color='k', linestyle='--'); axs[
Importantly, we see that while the fixed effects model is focused on recovering the treatment effect on the \(X\) covariate, it does so somewhat at the expense of accuracy on the other systematic components of the model. This focus renders the model less predictively accurate. If we compare the models on the cross-validation score, we see how the hierarchical mundlak model is to be preferred.
= az.compare({'naive': naive_idata, 'fixed': fixed_effects_idata, 'hierarchical': multilevel_model_idata,
compare_df 'mundlak': mundlak_idata})
compare_df
/Users/nathanielforde/mambaforge/envs/pymc_causal/lib/python3.11/site-packages/arviz/stats/stats.py:805: UserWarning: Estimated shape parameter of Pareto distribution is greater than 0.7 for one or more samples. You should consider using a more robust model, this is because importance sampling is less likely to work well if the marginal posterior and LOO posterior are very different. This is more likely to happen with a non-robust model and highly influential observations.
warnings.warn(
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
mundlak | 0 | -1209.532733 | 12.255178 | 0.000000 | 0.951092 | 30.213512 | 0.000000 | False | log |
hierarchical | 1 | -1218.446608 | 27.713536 | 8.913876 | 0.025741 | 30.404197 | 4.443081 | False | log |
fixed | 2 | -1221.467513 | 33.823574 | 11.934780 | 0.000000 | 31.743164 | 5.184940 | True | log |
naive | 3 | -1295.325039 | 2.808180 | 85.792306 | 0.023167 | 29.474596 | 12.972671 | False | log |
; az.plot_compare(compare_df)
This is not the only way to assess viability of the model’s functional form but it’s not a bad way.
Full Luxury Bayesian Mundlak Machine
As good Bayesians we might be worry about the false precision of adding simple point estimates for the group mean covariates in the Mundlak model. We can remedy this by explicitly incorporating these values as an extra parameter and adding uncertainty to the draws on these parameters.
= pd.factorize(sim_df["g"])
id_indx, unique_ids
= {'ids': list(range(N_groups))}
coords with pm.Model(coords=coords) as model:
= pm.Data('X_data', sim_df['X'])
x_data = pm.Data('Z_data', sim_df['Z'])
z_data = pm.Data('Y_data', sim_df['Y'])
y_data
= pm.Normal('Intercept', 0, 1)
alpha0 = pm.Normal('alpha_j', 0, 1, dims='ids')
alpha_j = pm.Normal('X', 0, 1)
beta_xy = pm.Normal('Z', 0, 1)
beta_zy
= pm.Normal('group_means', sim_df.groupby('g')['X'].mean().values, .1, dims='ids')
group_means = pm.HalfNormal('group_sigma', 0.1)
group_sigma
= pm.Normal('group_mu', (group_means[id_indx]), group_sigma, observed=sim_df['X'].values)
group_mu
= pm.Deterministic('mu', (alpha0 + alpha_j[id_indx]) + beta_xy*x_data + beta_zy*z_data + group_means[id_indx])
mu = pm.Deterministic("p", pm.math.invlogit(mu))
p # likelihood
"y", n=1, p=p, observed=y_data)
pm.Binomial(
= pm.sample(idata_kwargs={"log_likelihood": True}) idata
Instead of merely adding a new feature to a regression we’ve added a second likelihood term for the predictor \(X\) and seek to model this outcome as a function of the group means. In other words, we incorporate the idea that \(X\) is confounded by some aspect of the group membership, and seek to estimate the nature of that confounding as it influences both the realisations of \(X\) and \(y\). Calibrating the group_mean
parameter that is also included in our likelihood term for the \(Y\) outcome.
pm.model_to_graphviz(model)
This model bakes more uncertainty into the process assuming a kind of measurement-error model, but again, the focus here is that we need to estimate a corrective adjustment factor for confounding effects that impact the estimation of impact due to our focal variable \(X\). It’s perhaps easier to see in this parameter recovery exercise than it is in real cases, but wherever we suspect plausible group level confounding you should consider Mundlak adjustments alongside your go-to fixed effects or random effects controls.
=['Intercept', 'X', 'Z']) az.summary(idata, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | -1.876 | 0.204 | -2.248 | -1.481 | 0.007 | 0.005 | 866.0 | 1519.0 | 1.0 |
X | 0.990 | 0.060 | 0.871 | 1.099 | 0.001 | 0.001 | 6210.0 | 3050.0 | 1.0 |
Z | -0.523 | 0.175 | -0.840 | -0.185 | 0.005 | 0.004 | 1091.0 | 1951.0 | 1.0 |
You can see it also recovers the correct parameter specifcation well.
Recap
We’ve just seen a number of different modelling specifications that attempt to deal with the threat of group level confounding. Some varieties of model result in entirely saturated regression models with more parameters than observations. These models can quickly become unwieldy. We’ll now look at a case-study where group level confounding remains highly plausible and the choice of estimation routine is not trivial. This will highlight aspects of how what we seek learn determines our model choice.
Nested Groups, Hierarchies and Fixed Effects
We’ve seen how various attempts to account for the group effects can more or less recover the parameters of a complex data generating process with group confounding. Now we want to look at a case where we can have interacting group effects at multiple levels.
Pupils within Class Rooms within Schools
A natural three level group hierarchy occurs in the context of educational organisations and business org-charts. We can use this fact to interrogate briefly how inferential statements about treatment effects vary as a function of what and how we control for group level variation. We draw the following data set from Linear Mixed Models: A Practical Guide Using Statistical Software.
= pd.read_csv('classroom.csv')
df 'class_mean'] = df.groupby(['classid'])['mathprep'].transform(np.mean)
df['school_mean'] = df.groupby(['schoolid'])['mathprep'].transform(np.mean)
df[ df.head()
sex | minority | mathkind | mathgain | ses | yearstea | mathknow | housepov | mathprep | classid | schoolid | childid | class_mean | school_mean | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | 448 | 32 | 0.46 | 1.0 | NaN | 0.082 | 2.00 | 160 | 1 | 1 | 2.00 | 2.909091 |
1 | 0 | 1 | 460 | 109 | -0.27 | 1.0 | NaN | 0.082 | 2.00 | 160 | 1 | 2 | 2.00 | 2.909091 |
2 | 1 | 1 | 511 | 56 | -0.03 | 1.0 | NaN | 0.082 | 2.00 | 160 | 1 | 3 | 2.00 | 2.909091 |
3 | 0 | 1 | 449 | 83 | -0.38 | 2.0 | -0.11 | 0.082 | 3.25 | 217 | 1 | 4 | 3.25 | 2.909091 |
4 | 0 | 1 | 425 | 53 | -0.03 | 2.0 | -0.11 | 0.082 | 3.25 | 217 | 1 | 5 | 3.25 | 2.909091 |
The data has three distinct levels: (1) the child or pupil and their demographic attributes and outcome variable mathgain
, (2) the classroom and the teacher level attributes such as their experience yearstea
and a record of their mathematics courses taken mathprep
, (3) school and neighbourhood level with features describing poverty measures in the vicinity housepov
.
We’ll plot the child’s outcome mathgain
against the mathprep
and distinguish the patterns by school.
Code
def rand_jitter(arr):
= .01 * (max(arr) - min(arr))
stdev return arr + np.random.randn(len(arr)) * stdev
= [Line2D([0], [0], marker='x', label='Minority', markerfacecolor='k'),
legend_elements 0], [0], marker='x', label='Minority + Poverty', markersize=14, markerfacecolor='k'),
Line2D([
]
= df['schoolid'].unique()
schools = [schools[i:i+10] for i in range(0, len(schools), 10)]
schools_10 = plt.subplots(3,4, figsize=(15, 13),
fig, axs =True, sharex=True)
sharey= axs.flatten()
axs = {1: 'x', 0: '+'}
mkr_dict for s, ax in zip(schools_10, axs):
= df[df['schoolid'].isin(s)]
temp 'm'] = temp['minority'].map(mkr_dict)
temp[for m in temp['m'].unique():
= temp[temp['m'] == m]
temp1 'mathprep']),
ax.scatter(rand_jitter(temp1['mathgain'],
temp1[=temp1['schoolid'], cmap='tab10',
c=temp1['housepov']*100,
s= m)
marker f"Schools \n {s}");
ax.set_title("MathPrep")
ax.set_xlabel("MathGain")
ax.set_ylabel(
0].legend(handles=legend_elements); axs[
We’ve plotted here the individual student outcomes. We’ve sized the dots by the poverty in their neighbourhoods and differentiated the markers by whether the student was in minority group. There are, in short, reasons here to worry about group-level confounding. There is a small number of observed students per school so the individual school level distributions show some extreme outliers but the overall distribution nicely converges to an approximately normal symmetric shape.
= plt.subplots(1, 2, figsize=(10, 6))
fig, axs = axs.flatten()
axs for school in schools:
= df[df['schoolid'] ==school]
temp 0].hist(temp['mathgain'], color='grey', alpha=0.3, density=True, histtype='step', cumulative=True)
axs[
0].hist(df['mathgain'], bins=30, ec='black', density=True, cumulative=True, histtype='step')
axs[1].hist(df['mathgain'], bins=30, ec='black', density=True, cumulative=False)
axs[0].set_title("Cumulative Distribution Function by School")
axs[1].set_title("Overall Distribution"); axs[
With these kinds of structures we need to be careful in how we evaluate any treatment effects when there are reasons to believe in group-level effects that impact the outcome variable. Imagine the true treatment effect is a sprinter running too fast to cleanly measure - each group interaction effect is added to his load as a weight. Enough weights absorb enough of the variation in the treatment that he is dragged to a crawl. Whatever movement he can make while dragging this burden is the effect we attribute to the treatment. Or put another way - the effects of various group interactions will modify the treatment effectiveness in some way and unless we account for this impact in our model, then the inferences regarding the treatment will be clouded.
Interaction Effects and Nuisance Parameters
Not all possible interactions will be present in the data. But if we specify the model to account for various interactions we may explode the number of parameters beyond the number of data points. This can cause issues in estimation routines and requires consideration about what the group parameters are aimed at capturing.
Why add covariates for group-membership when no observation reflects outcomes in that group? We cannot learn anything about these cases. At least if we add the group effects as a hierarchical random effect we induce shrinkage on the parameter estimates towards the mean of the hierarchical parameter in the model. This means that when predicting on “new” data with examples of these missing cases we can predict a sensible default. The distinction rests in the role we have in mind for these tools. If we seriously commit to the idea that group variation reflects a real “common” process in the larger population and we want to learn about that over-arching process then we deploy a random effects model. But if we only see these as tools for accounting to variation in the sample, allowing us to pin down an alternative focal estimate then the group indicator covariates are just “nuisance” parameters and missing cases are irrelevant.
To see the extent of redundany we can examine the dimensions of the covariate matrices that result from including more or less interaction terms.
= dmatrices("mathgain ~ mathprep + C(schoolid)+ C(classid)", df, return_type="dataframe")
y, X print(X.shape)
= dmatrices("mathgain ~ mathprep + C(schoolid)/C(classid)", df, return_type="dataframe")
y, X1 print(X1.shape)
= dmatrices("mathgain ~ mathprep + C(schoolid):C(childid)", df, return_type="dataframe")
y, X2 print(X2.shape)
(1190, 419)
(1190, 33385)
(1190, 127331)
We see here how different ways in which to account for group level variation and interaction effects lead to vastly inflated feature matrices. However not all interaction terms matter, or put another way… nor all the possible interactions feature in the data. So we have likely inflated the data matrix beyond necessity.
Here we define a helper function to parse a complex interaction formula, remove the columns entirely composed of zeros and return a new formula and dataframe which has a suitable range of features to capture the variation structures in the data.
def make_interactions_df(formula, df):
= dmatrices(formula, df, return_type="dataframe")
y, X = X.shape[1]
n = X[X.columns[~(np.abs(X) < 1e-12).all()]]
X = X.shape[1]
n1 = y.columns[0]
target_name = pd.concat([y, X], axis=1)
d 'Intercept'], axis=1, inplace=True)
d.drop([= [c.replace('[', '').replace(']','').replace('C(', '').replace(')', '').replace('.', '_').replace(':', '_') for c in d.columns]
d.columns = ' + '.join([col for col in d.columns if col != target_name])
cols = f"{target_name} ~ {cols}"
formula print(f"""Size of original interaction features: {n} \nSize of reduced feature set: {n1}""")
return formula, d
= make_interactions_df("mathgain ~ mathprep + C(schoolid):C(childid)", df)
formula, interaction_df
interaction_df.head()
Size of original interaction features: 127331
Size of reduced feature set: 2370
mathgain | childidT_2 | childidT_3 | childidT_4 | childidT_5 | childidT_6 | childidT_7 | childidT_8 | childidT_9 | childidT_10 | ... | schoolidT_107_childid1182 | schoolidT_107_childid1183 | schoolidT_107_childid1184 | schoolidT_107_childid1185 | schoolidT_107_childid1186 | schoolidT_107_childid1187 | schoolidT_107_childid1188 | schoolidT_107_childid1189 | schoolidT_107_childid1190 | mathprep | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.00 |
1 | 109.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.00 |
2 | 56.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.00 |
3 | 83.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.25 |
4 | 53.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.25 |
5 rows × 2370 columns
We have reduced the number of interactions by an order of magnitude! We can now fit a regression model to the revised feature matrix.
Comparing Interaction Models
Consider the variation in the coefficient values estimated for mathprep
as we add more and more interaction effects. The addition of interaction effects generates a large number of completely 0 interaction terms which we remove here.
= ["""mathgain ~ mathprep + C(schoolid)""",
formulas """ mathgain ~ mathprep + school_mean*class_mean""" ,
""" mathgain ~ mathprep + mathkind + sex + school_mean*class_mean""" ,
"""mathgain ~ mathprep + C(schoolid) + C(classid)""",
"""mathgain ~ mathprep + C(schoolid)*C(classid)""",
"""mathgain ~ mathprep + C(classid):C(childid)""",
]
= []
estimates_df for f in formulas:
= make_interactions_df(f, df)
formula, interaction_df = smf.ols(formula, interaction_df).fit()
result = [[result.params['mathprep']], list(result.conf_int().loc['mathprep', :]), [formula]]
estimates = [e for est in estimates for e in est]
estimates
estimates_df.append(estimates)
= pd.DataFrame(estimates_df, columns=['mathprep_estimate', 'lower bound', 'upper bound', 'formula'])
estimates_df
estimates_df
Size of original interaction features: 108
Size of reduced feature set: 108
Size of original interaction features: 5
Size of reduced feature set: 5
Size of original interaction features: 7
Size of reduced feature set: 7
Size of original interaction features: 419
Size of reduced feature set: 419
Size of original interaction features: 33385
Size of reduced feature set: 728
Size of original interaction features: 371281
Size of reduced feature set: 2376
/Users/nathanielforde/mambaforge/envs/pymc_causal/lib/python3.11/site-packages/statsmodels/regression/linear_model.py:1717: RuntimeWarning: divide by zero encountered in scalar divide
return np.dot(wresid, wresid) / self.df_resid
mathprep_estimate | lower bound | upper bound | formula | |
---|---|---|---|---|
0 | 1.060768 | -1.435118 | 3.556655 | mathgain ~ schoolidT_2 + schoolidT_3 + schooli... |
1 | 1.789413 | -1.587051 | 5.165876 | mathgain ~ mathprep + school_mean + class_mean... |
2 | 0.904018 | -2.056505 | 3.864541 | mathgain ~ mathprep + mathkind + sex + school_... |
3 | 2.948546 | 0.601150 | 5.295942 | mathgain ~ schoolidT_2 + schoolidT_3 + schooli... |
4 | 3.545931 | 1.359793 | 5.732068 | mathgain ~ schoolidT_2 + schoolidT_3 + schooli... |
5 | 2.303187 | NaN | NaN | mathgain ~ childidT_2 + childidT_3 + childidT_... |
The point here (perhap obvious) is that the estimate of treatment effects due to some policy or programme can be differently understood when the regression model is able to account for increasing aspects of individual variation. Choice of the right way to “saturate” your regression specification are at the heart of causal inference. The right control structures determine the freedom available to vary your focal treatment effect parameter.
We will consider a number of specifications below that incorporate these group effects in a hierarchical model which nests the effect of class-membership within school membership. This choice allows us to control for group specific interactions without worrying about over-indexing on the observed interaction effects in the sample data requiring that we handle more fixed effects parameters than we have data points.
In what follows we’ll specify a nested approach to the parameter specifcation using a random effects model. The idea here is that classes are already implicitly nested in schools and so we don’t need to add parameters for classes at multiple schools. Additionally we’re positing that there is independent interest in the effectiveness school/class effects i.e. the degree to which variation in a school/class nest can account for variation in the outcome.
Minimal Model
= bmb.Model(f"mathgain ~ mathprep + (1 | schoolid / classid)", df)
model = model.fit( inference_method="nuts_numpyro",
idata ={"log_likelihood": True}) idata_kwargs
The model specification here is deliberately minimalist we want to observe how much of the variation in the outcome can be accounted for by solely adding extensive controls for interactions of group level effects and the treatment but ignoring all else.
model.graph()
We can see the derived sigma parameters here which can be understood as partialling out the variance of the outcome into components due to those group level effects and the unexplained residuals.
=['Intercept', '1|schoolid_sigma', '1|schoolid:classid_sigma', 'mathgain_sigma', 'mathprep']) az.summary(idata, var_names
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 52.561 | 3.542 | 46.010 | 59.516 | 0.066 | 0.047 | 2839.0 | 2731.0 | 1.00 |
1|schoolid_sigma | 8.319 | 2.265 | 4.099 | 12.590 | 0.124 | 0.087 | 399.0 | 285.0 | 1.01 |
1|schoolid:classid_sigma | 9.877 | 2.459 | 5.283 | 14.369 | 0.161 | 0.114 | 256.0 | 313.0 | 1.02 |
mathgain_sigma | 32.136 | 0.786 | 30.652 | 33.620 | 0.020 | 0.014 | 1537.0 | 2143.0 | 1.00 |
mathprep | 1.889 | 1.242 | -0.563 | 4.092 | 0.023 | 0.017 | 2933.0 | 2596.0 | 1.00 |
Note here the relative proportion of the school specifc variances 1|schoolid_sigma
to the overall variance of the residuals mathgain_sigma
.
Calculating the IntraClass Correlation Coefficient
These models faciliate the calculation of the ICC statistics which is a measure of “explained variance”. The thought is to gauge the proportion of variance ascribed to one set of random effects over and above the total estimated variance in the baseline model, including the residuals mathgain_sigma
.
= idata['posterior']['1|schoolid_sigma']**2
a
= (idata['posterior']['1|schoolid:classid_sigma']**2 + idata['posterior']['1|schoolid_sigma']**2)
b
= (idata['posterior']['1|schoolid:classid_sigma']**2 + idata['posterior']['1|schoolid_sigma']**2 + idata['posterior']['mathgain_sigma']**2)
c
/ c).mean().item() (a
0.061101339125402415
+ b) / c).mean().item() ((a
0.20748893817579722
We can see here that the interaction terms do seem to account for a goodly portion of the variance in the outcome and we ought to consider retaining their inclusion in our modelling work. The structure of the problem drives us towards their inclusion. Class/school effects are going to absorb a sufficient portion of the variation. So they merit study in their own right, lest the individual class/school dynamics obscure the effectiveness of the mathprep
treatment. Similarly, it’s likely valuable to consider the efficacy of the average class/school in a wider policy conversation.
Augmenting the Models
Next we augment our model with more pupil level control variables aiming to pin down some of the aspects of the variation in the outcome.
Adding Pupil Fixed Effects
Here we add these fixed effects population parameters. But note they are not merely devices for controlling variance in the outcome, they’re interpretation is likely of independent interest.
= bmb.Model(f"mathgain ~ sex + minority + ses + mathprep + (1 | schoolid / classid)", df)
model_fixed = model_fixed.fit( inference_method="nuts_numpyro",
idata_fixed ={"log_likelihood": True}) idata_kwargs
=['Intercept', 'sex', 'minority', 'ses', 'mathprep',
az.summary(idata_fixed, var_names'1|schoolid_sigma', '1|schoolid:classid_sigma', 'mathgain_sigma'])
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 53.364 | 4.052 | 45.853 | 61.036 | 0.075 | 0.053 | 2911.0 | 2702.0 | 1.00 |
sex | -2.320 | 1.980 | -6.060 | 1.392 | 0.029 | 0.024 | 4785.0 | 2927.0 | 1.00 |
minority | 0.285 | 2.666 | -4.744 | 5.240 | 0.046 | 0.038 | 3342.0 | 2993.0 | 1.00 |
ses | 0.916 | 1.442 | -1.713 | 3.641 | 0.024 | 0.020 | 3726.0 | 2813.0 | 1.00 |
mathprep | 1.933 | 1.207 | -0.444 | 4.032 | 0.024 | 0.017 | 2571.0 | 2478.0 | 1.00 |
1|schoolid_sigma | 8.575 | 2.000 | 4.787 | 12.441 | 0.082 | 0.058 | 632.0 | 623.0 | 1.01 |
1|schoolid:classid_sigma | 9.857 | 2.345 | 5.219 | 14.041 | 0.120 | 0.085 | 402.0 | 537.0 | 1.01 |
mathgain_sigma | 32.155 | 0.768 | 30.760 | 33.633 | 0.020 | 0.014 | 1425.0 | 2531.0 | 1.00 |
Now we add a further class level control.
Adding More Class and Pupil Level Fixed Effects
= bmb.Model(f"mathgain ~ mathkind + sex + minority + ses + yearstea + mathknow + mathprep + (1 | schoolid / classid)", df.dropna())
model_fixed_1 = model_fixed_1.fit( inference_method="nuts_numpyro",
idata_fixed_1 ={"log_likelihood": True}) idata_kwargs
=['Intercept',
az.summary(idata_fixed_1, var_names'mathkind', 'sex', 'minority', 'ses', 'yearstea', 'mathknow', 'mathprep','1|schoolid_sigma', '1|schoolid:classid_sigma', 'mathgain_sigma'])
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 281.822 | 11.832 | 259.430 | 303.680 | 0.204 | 0.144 | 3361.0 | 2999.0 | 1.00 |
mathkind | -0.475 | 0.023 | -0.520 | -0.434 | 0.000 | 0.000 | 3596.0 | 2843.0 | 1.00 |
sex | -1.413 | 1.677 | -4.369 | 1.928 | 0.026 | 0.022 | 4234.0 | 2940.0 | 1.00 |
minority | -7.866 | 2.428 | -12.438 | -3.209 | 0.043 | 0.030 | 3230.0 | 3086.0 | 1.00 |
ses | 5.414 | 1.256 | 3.165 | 7.864 | 0.020 | 0.014 | 4041.0 | 2963.0 | 1.00 |
yearstea | 0.042 | 0.118 | -0.178 | 0.253 | 0.002 | 0.002 | 2778.0 | 2847.0 | 1.00 |
mathknow | 1.861 | 1.211 | -0.236 | 4.339 | 0.028 | 0.020 | 1915.0 | 2552.0 | 1.00 |
mathprep | 1.119 | 1.142 | -1.145 | 3.189 | 0.022 | 0.016 | 2700.0 | 2959.0 | 1.00 |
1|schoolid_sigma | 8.603 | 1.701 | 5.402 | 11.758 | 0.070 | 0.050 | 606.0 | 871.0 | 1.01 |
1|schoolid:classid_sigma | 9.348 | 1.745 | 5.827 | 12.490 | 0.092 | 0.068 | 353.0 | 899.0 | 1.02 |
mathgain_sigma | 26.766 | 0.675 | 25.578 | 28.073 | 0.015 | 0.011 | 1995.0 | 2470.0 | 1.00 |
= plt.subplots(figsize=(10, 7))
fig, ax =True, var_names=['mathprep', '1|schoolid_sigma', '1|schoolid:classid_sigma', 'mathgain_sigma'], ax=ax)
az.plot_forest([idata, idata_fixed, idata_fixed_1], combined1) ax.axvline(
<matplotlib.lines.Line2D at 0x29f11b890>
We now make use of bambi
s model interpretation module to plot the marginal effect on the outcome due to changes in the treatment intensity.
Code
= plt.subplots(1, 3, figsize=(20, 6),
fig, axs =120, sharey=True, sharex=True)
dpi= axs.flatten()
axs 0].axhline(50, color='red', label='Reference Line',
axs[='--')
linestyle1].axhline(50, color='red', label='Reference Line',
axs[='--')
linestyle2].axhline(50, color='red', label='Reference Line',
axs[='--')
linestyle"mathprep", ax=axs[0]);
bmb.interpret.plot_predictions(model, idata, "mathprep", ax=axs[1]);
bmb.interpret.plot_predictions(model_fixed, idata_fixed, "mathprep", ax=axs[2]);
bmb.interpret.plot_predictions(model_fixed_1, idata_fixed_1, 0].set_title("Variation in Implied Outcome by Treatment \n mathgain ~ mathprep + (1 | schoolid / classid) ")
axs[1].set_title("Variation in Implied Outcome by Treatment \n mathgain ~ sex + minority + ses + mathprep + (1 | schoolid / classid) ")
axs[2].set_title("Variation in Implied Outcome by Treatment \n mathgain ~ mathkind + sex + minority + ses + yearstea + \n mathknow + mathprep + (1 | schoolid / classid")
axs[0].set_xlabel('')
axs[1].set_xlabel('')
axs[0].legend(); axs[
As we can see here across all the different model specifications we see quite modest effects of treatment with very wide uncertainty. You might therefore be sceptical that teacher training has any real discernible impact on child outcomes? Maybe you believe other interventions are more important to fund? These kinds of questions determine policy. So misguided policy interventions on child-hood education can have radical consequences. It’s, therefore, vital that we have robust and justifiable approaches and tooling for the analysis of these policy questions in the face of group level confounding.
This last point is crucial - when model complexity balloons due extensive interaction effects, then we need efficient tools to interrogate the outcome level differences implied by the model choices. Your intuition and understanding of a process is keyed to the observable effects of that process. Your understanding of the internal mechanics of a model is likely less than your concrete expectations of observable behaviour. As such, credibility requires that we be able to assess models on the variation produced in the outcome variable rather merely on the parameters values.
Two Way Fixed Effects and Temporal Confounding
Difference in Differences designs are the overworked donkeys of social science. Many, many studies stand or fall by the assumptions baked into DiD designs. There are at least two aspects to these assumptions (i) the substantive commitments about the data generating process (e.g. parrallel trends) and (ii) the appropriateness of the functional form used to model (i). We will look first at a case where all the assumptions can be met, and then examine how things break-down under challenging staggered treatment regimes.
Event Studies and Change in Time
Event studies or dynamic treatment effects are an approach to measuring the gradual treatment effect as it evolves in time. We take this panel data set from the pyfixest
package to demonstrate.
= "https://raw.githubusercontent.com/py-econometrics/pyfixest/master/pyfixest/did/data/df_het.csv"
url = pd.read_csv(url)
df_het df_het.head()
unit | state | group | unit_fe | g | year | year_fe | treat | rel_year | rel_year_binned | error | te | te_dynamic | dep_var | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 33 | Group 2 | 7.043016 | 2010 | 1990 | 0.066159 | False | -20.0 | -6 | -0.086466 | 0 | 0.0 | 7.022709 |
1 | 1 | 33 | Group 2 | 7.043016 | 2010 | 1991 | -0.030980 | False | -19.0 | -6 | 0.766593 | 0 | 0.0 | 7.778628 |
2 | 1 | 33 | Group 2 | 7.043016 | 2010 | 1992 | -0.119607 | False | -18.0 | -6 | 1.512968 | 0 | 0.0 | 8.436377 |
3 | 1 | 33 | Group 2 | 7.043016 | 2010 | 1993 | 0.126321 | False | -17.0 | -6 | 0.021870 | 0 | 0.0 | 7.191207 |
4 | 1 | 33 | Group 2 | 7.043016 | 2010 | 1994 | -0.106921 | False | -16.0 | -6 | -0.017603 | 0 | 0.0 | 6.918492 |
Panel data of this kind of structure is difficult to intuit unless visualised. The key structures on which this DiD estimator works are the averages across state and time. We depict these here. Note how we define a causal contrast as the difference between the treated and control groups averages over time. In particular our estimator has to account for the fact that we have a “dynamic” control group. As each cohort is treated our treatment group expands and the causal contrast of interest is differently defined. To gain a view of the time-evolution of treatment we want to marginalise over the specifc cohorts and extract a view of the how the effects evolve after the introduction of the treatment in expectation.
Code
= plt.subplots(2, 1, figsize=(10, 12))
fig, axs = axs.flatten()
axs for u in df_het['unit'].unique():
= df_het[df_het['unit']==u]
temp 0].plot(temp['year'], temp['dep_var'], color='grey', alpha=0.01)
axs[1].plot(temp['year'], temp['dep_var'], color='grey', alpha=0.01)
axs['year', 'state'])[['dep_var']].mean().reset_index().pivot(index='year', columns='state', values='dep_var').plot(ax=axs[0], legend=False, color='blue',
df_het.groupby([=0.4)
alpha
'year', 'g'])[['dep_var']].mean().reset_index().pivot(index='year', columns='g', values='dep_var').plot(ax=axs[1], legend=False)
df_het.groupby([
0].set_title("Difference in Differences \n State Mean Change an Individual Trajectories")
axs[1].set_title("Difference in Differences \n Mean Change an Individual Trajectories"); axs[
Note how the blue line represents a cohort that never undergoes a treatment and is maintained as a coherent control group throughout the sequence even though we have two other cohorts. This ensures we have a constant contrast case allowing us to identify the causal estimand of interest. It’s also worth calling out that we have heterogenous patterns in the treatment effects across the cohorts i.e. the the intensity of the treatment effect differs across the cohorts. We’ll apply the vanilla TWFE estimator and then dig into how it works.
TWFE in pyfixest
A natural question is to wonder how the treatment effect evolves over time? How does policy shift behaviours? Is it initially impactful converging to a quick plateau or a slowly building pattern of consistent growth?
= pf.feols(
fit_twfe_event "dep_var ~ i(rel_year, ref=-1.0) | state + year ",
df_het,={"CRV1": "state"},
vcov
)
fit_twfe_event.tidy()
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
C(rel_year, contr.treatment(base=-1.0))[T.-20.0] | -0.099445 | 0.081699 | -1.217202 | 0.230842 | -0.264697 | 0.065808 |
C(rel_year, contr.treatment(base=-1.0))[T.-19.0] | 0.000624 | 0.083213 | 0.007493 | 0.994060 | -0.167690 | 0.168937 |
C(rel_year, contr.treatment(base=-1.0))[T.-18.0] | 0.004125 | 0.089719 | 0.045974 | 0.963565 | -0.177349 | 0.185599 |
C(rel_year, contr.treatment(base=-1.0))[T.-17.0] | 0.021899 | 0.085686 | 0.255573 | 0.799624 | -0.151418 | 0.195216 |
C(rel_year, contr.treatment(base=-1.0))[T.-16.0] | -0.036933 | 0.096921 | -0.381067 | 0.705221 | -0.232975 | 0.159108 |
C(rel_year, contr.treatment(base=-1.0))[T.-15.0] | 0.069578 | 0.081289 | 0.855936 | 0.397262 | -0.094844 | 0.234000 |
C(rel_year, contr.treatment(base=-1.0))[T.-14.0] | 0.037734 | 0.086618 | 0.435641 | 0.665499 | -0.137467 | 0.212936 |
C(rel_year, contr.treatment(base=-1.0))[T.-13.0] | 0.061779 | 0.083362 | 0.741098 | 0.463073 | -0.106836 | 0.230394 |
C(rel_year, contr.treatment(base=-1.0))[T.-12.0] | 0.089913 | 0.084900 | 1.059044 | 0.296095 | -0.081814 | 0.261639 |
C(rel_year, contr.treatment(base=-1.0))[T.-11.0] | 0.000982 | 0.079104 | 0.012417 | 0.990156 | -0.159021 | 0.160986 |
C(rel_year, contr.treatment(base=-1.0))[T.-10.0] | -0.113033 | 0.064922 | -1.741047 | 0.089560 | -0.244351 | 0.018285 |
C(rel_year, contr.treatment(base=-1.0))[T.-9.0] | -0.069225 | 0.057046 | -1.213481 | 0.232244 | -0.184612 | 0.046162 |
C(rel_year, contr.treatment(base=-1.0))[T.-8.0] | -0.061290 | 0.060362 | -1.015370 | 0.316188 | -0.183383 | 0.060804 |
C(rel_year, contr.treatment(base=-1.0))[T.-7.0] | -0.002022 | 0.064602 | -0.031306 | 0.975185 | -0.132693 | 0.128648 |
C(rel_year, contr.treatment(base=-1.0))[T.-6.0] | -0.055810 | 0.064674 | -0.862938 | 0.393447 | -0.186626 | 0.075006 |
C(rel_year, contr.treatment(base=-1.0))[T.-5.0] | -0.065009 | 0.064810 | -1.003066 | 0.322012 | -0.196099 | 0.066082 |
C(rel_year, contr.treatment(base=-1.0))[T.-4.0] | -0.009850 | 0.053098 | -0.185505 | 0.853794 | -0.117251 | 0.097551 |
C(rel_year, contr.treatment(base=-1.0))[T.-3.0] | 0.046338 | 0.062950 | 0.736104 | 0.466072 | -0.080991 | 0.173667 |
C(rel_year, contr.treatment(base=-1.0))[T.-2.0] | 0.015947 | 0.065442 | 0.243687 | 0.808751 | -0.116421 | 0.148316 |
C(rel_year, contr.treatment(base=-1.0))[T.0.0] | 1.406404 | 0.055641 | 25.276422 | 0.000000 | 1.293860 | 1.518949 |
C(rel_year, contr.treatment(base=-1.0))[T.1.0] | 1.660552 | 0.065033 | 25.533859 | 0.000000 | 1.529010 | 1.792094 |
C(rel_year, contr.treatment(base=-1.0))[T.2.0] | 1.728790 | 0.056793 | 30.440100 | 0.000000 | 1.613915 | 1.843665 |
C(rel_year, contr.treatment(base=-1.0))[T.3.0] | 1.854839 | 0.058315 | 31.807288 | 0.000000 | 1.736886 | 1.972792 |
C(rel_year, contr.treatment(base=-1.0))[T.4.0] | 1.958676 | 0.071144 | 27.531232 | 0.000000 | 1.814774 | 2.102578 |
C(rel_year, contr.treatment(base=-1.0))[T.5.0] | 2.082161 | 0.063855 | 32.607651 | 0.000000 | 1.953002 | 2.211320 |
C(rel_year, contr.treatment(base=-1.0))[T.6.0] | 2.191062 | 0.068510 | 31.981462 | 0.000000 | 2.052487 | 2.329637 |
C(rel_year, contr.treatment(base=-1.0))[T.7.0] | 2.279073 | 0.075014 | 30.381921 | 0.000000 | 2.127342 | 2.430803 |
C(rel_year, contr.treatment(base=-1.0))[T.8.0] | 2.364593 | 0.058598 | 40.352477 | 0.000000 | 2.246067 | 2.483120 |
C(rel_year, contr.treatment(base=-1.0))[T.9.0] | 2.372163 | 0.056560 | 41.940696 | 0.000000 | 2.257759 | 2.486566 |
C(rel_year, contr.treatment(base=-1.0))[T.10.0] | 2.649271 | 0.056177 | 47.158942 | 0.000000 | 2.535641 | 2.762901 |
C(rel_year, contr.treatment(base=-1.0))[T.11.0] | 2.753591 | 0.075526 | 36.458951 | 0.000000 | 2.600826 | 2.906356 |
C(rel_year, contr.treatment(base=-1.0))[T.12.0] | 2.813935 | 0.079320 | 35.475900 | 0.000000 | 2.653496 | 2.974375 |
C(rel_year, contr.treatment(base=-1.0))[T.13.0] | 2.756070 | 0.078368 | 35.168251 | 0.000000 | 2.597555 | 2.914584 |
C(rel_year, contr.treatment(base=-1.0))[T.14.0] | 2.863427 | 0.098389 | 29.103072 | 0.000000 | 2.664416 | 3.062438 |
C(rel_year, contr.treatment(base=-1.0))[T.15.0] | 2.986652 | 0.093309 | 32.008066 | 0.000000 | 2.797916 | 3.175388 |
C(rel_year, contr.treatment(base=-1.0))[T.16.0] | 2.963032 | 0.085427 | 34.684870 | 0.000000 | 2.790239 | 3.135825 |
C(rel_year, contr.treatment(base=-1.0))[T.17.0] | 2.972596 | 0.092390 | 32.174553 | 0.000000 | 2.785721 | 3.159472 |
C(rel_year, contr.treatment(base=-1.0))[T.18.0] | 2.935051 | 0.094126 | 31.182203 | 0.000000 | 2.744664 | 3.125439 |
C(rel_year, contr.treatment(base=-1.0))[T.19.0] | 2.918707 | 0.084193 | 34.667036 | 0.000000 | 2.748411 | 3.089002 |
C(rel_year, contr.treatment(base=-1.0))[T.20.0] | 2.979701 | 0.087777 | 33.946394 | 0.000000 | 2.802156 | 3.157246 |
C(rel_year, contr.treatment(base=-1.0))[T.inf] | 0.128053 | 0.078513 | 1.630976 | 0.110946 | -0.030755 | 0.286861 |
The model specification defines indicator variables for the relative years before and after the penultimate year before treatment is applied, it then incorporates the fixed effects for state and year. The coefficient values ascribed the the relative year indicators are used to plot the event-study trajectories. This is a two-way fixed effects estimation routine where the fixed effects for state and year indicators absorb the variance due to those groupings. It is often estimated using a de-meaning technique which we will demonstrate below.
= [1200, 600]
figsize
fit_twfe_event.iplot(=False,
coord_flip="TWFE-Estimator",
title=figsize,
figsize=18.5,
xintercept=0,
yintercept ).show()
We can also aim to marginalise over the details of the temporal trajectories by defining the similar estimation routine on the individuals and their treatment indicator.
= pf.feols(
fit_twfe "dep_var ~ i(treat) | unit + year",
df_het,={"CRV1": "state"},
vcov
)
fit_twfe.tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
C(treat)[T.True] | 1.98254 | 0.019331 | 102.55618 | 0.0 | 1.943439 | 2.021642 |
De-meaning and TWFEs
We’ve seen above that the fixed-effect estimators in these DiD designs involve a lot of indicator variables. These are largely not the focus on the question at hand but are used exlusively to absorb the noise that takes away from our understanding of the treatment effect. We can achieve similar results with less parameters required if we “de-mean” the focus variables by the group averages of the control factors of state
and year
or unit
. This operation makes for more efficient TWFE estimation routines is provably a variety of mundlak regression as shown by Woolridge 2021.
We implement the de-meaning technique and show “equivalence by python” as follows:
def demean(df, col_to_demean, group):
return df.assign(**{col_to_demean: (df[col_to_demean]
- df.groupby(group)[col_to_demean].transform("mean")
)})
def apply_demeaning(df_het, by=['state', 'year'], event=True):
if event:
= pd.get_dummies(df_het['rel_year']).drop(-1, axis=1)
d = ['rel_year_' +str(c).replace('-', 'minus_') for c in d.columns]
d.columns else:
= df_het[['treat']]
d 0]] = df_het[by[0]]
d[by[1]] = df_het[by[1]]
d[by[for col in d.columns:
if col in by:
pass
else:
for c in by:
= demean(d, col, c)
d = d.drop(by, axis=1)
d 'dep_var'] = df_het['dep_var']
d[return d
## Demean the relative years features for event studies
= apply_demeaning(df_het, by=['state', 'year'], event=True)
d_event
## Demean the treatment indicator for ATT estimation
= apply_demeaning(df_het, by=['unit', 'year'], event=False)
d
d_event.head()
rel_year_minus_20.0 | rel_year_minus_19.0 | rel_year_minus_18.0 | rel_year_minus_17.0 | rel_year_minus_16.0 | rel_year_minus_15.0 | rel_year_minus_14.0 | rel_year_minus_13.0 | rel_year_minus_12.0 | rel_year_minus_11.0 | ... | rel_year_13.0 | rel_year_14.0 | rel_year_15.0 | rel_year_16.0 | rel_year_17.0 | rel_year_18.0 | rel_year_19.0 | rel_year_20.0 | rel_year_inf | dep_var | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.651694 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | ... | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | -0.28125 | 7.022709 |
1 | -0.002973 | 0.651694 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | ... | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | -0.28125 | 7.778628 |
2 | -0.002973 | -0.002973 | 0.651694 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | ... | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | -0.28125 | 8.436377 |
3 | -0.002973 | -0.002973 | -0.002973 | 0.651694 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | ... | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | -0.28125 | 7.191207 |
4 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | 0.651694 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | -0.002973 | ... | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | 0.000734 | -0.28125 | 6.918492 |
5 rows × 42 columns
We now have a data set with 42 columns focused on the treatment structures, but implicitly controls for the variation due to state and time. We’ll see below that this representation of the data will correctly estimate the treatment effects.
Event Study and De-Meaning
Now we’ll use the de-meaned data structure above to estimate an event study using Bambi
.
= ' + '.join([c for c in d_event.columns if c != 'dep_var'])
x_cols = bmb.Model(f"dep_var ~ + {x_cols}", d_event)
model_twfe_event = model_twfe_event.fit( inference_method="nuts_numpyro",
idata_twfe_event ={"log_likelihood": True},) idata_kwargs
model_twfe_event
Formula: dep_var ~ + rel_year_minus_20.0 + rel_year_minus_19.0 + rel_year_minus_18.0 + rel_year_minus_17.0 + rel_year_minus_16.0 + rel_year_minus_15.0 + rel_year_minus_14.0 + rel_year_minus_13.0 + rel_year_minus_12.0 + rel_year_minus_11.0 + rel_year_minus_10.0 + rel_year_minus_9.0 + rel_year_minus_8.0 + rel_year_minus_7.0 + rel_year_minus_6.0 + rel_year_minus_5.0 + rel_year_minus_4.0 + rel_year_minus_3.0 + rel_year_minus_2.0 + rel_year_0.0 + rel_year_1.0 + rel_year_2.0 + rel_year_3.0 + rel_year_4.0 + rel_year_5.0 + rel_year_6.0 + rel_year_7.0 + rel_year_8.0 + rel_year_9.0 + rel_year_10.0 + rel_year_11.0 + rel_year_12.0 + rel_year_13.0 + rel_year_14.0 + rel_year_15.0 + rel_year_16.0 + rel_year_17.0 + rel_year_18.0 + rel_year_19.0 + rel_year_20.0 + rel_year_inf
Family: gaussian
Link: mu = identity
Observations: 46500
Priors:
target = mu
Common-level effects
Intercept ~ Normal(mu: 4.7062, sigma: 7.1549)
rel_year_minus_20.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_19.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_18.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_17.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_16.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_15.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_14.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_13.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_12.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_11.0 ~ Normal(mu: 0.0, sigma: 83.8218)
rel_year_minus_10.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_9.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_8.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_7.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_6.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_5.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_4.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_3.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_minus_2.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_0.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_1.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_2.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_3.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_4.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_5.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_6.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_7.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_8.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_9.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_10.0 ~ Normal(mu: 0.0, sigma: 60.2319)
rel_year_11.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_12.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_13.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_14.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_15.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_16.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_17.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_18.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_19.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_20.0 ~ Normal(mu: 0.0, sigma: 86.6805)
rel_year_inf ~ Normal(mu: 0.0, sigma: 15.2312)
Auxiliary parameters
sigma ~ HalfStudentT(nu: 4.0, sigma: 2.862)
------
* To see a plot of the priors call the .plot_priors() method.
* To see a summary or plot of the posterior pass the object returned by .fit() to az.summary() or az.plot_trace()
We can then plot the event-study and observe a similar pattern to the one observed with pyfixest
.
def plot_event_study(idata, ax, color='blue', model='demeaned'):
= az.summary(idata)
summary_df = [i for i in summary_df.index if 'rel' in i]
cols = summary_df[summary_df.index.isin(cols)]
summary_df = range(len(summary_df))
x 'mean'], label=model, color=color)
ax.scatter(x, summary_df['hdi_3%'],summary_df['hdi_97%']], color=color)
ax.plot([x, x], [summary_df["Event Study")
ax.set_title(return ax
= plt.subplots(figsize=(10, 5))
fig, ax
plot_event_study(idata_twfe_event, ax); ax.legend()
Similarly, we can de-mean the simple treatment indicator using the group means and marginalise over time periods to find a single treatment effect estimate.
= bmb.Model(f"dep_var ~ treat", d)
model_twfe_trt_demean = model_twfe_trt_demean.fit( inference_method="nuts_numpyro",
idata_twfe_trt_demean ={"log_likelihood": True},) idata_kwargs
az.summary(idata_twfe_trt_demean)
mean | sd | hdi_3% | hdi_97% | mcse_mean | mcse_sd | ess_bulk | ess_tail | r_hat | |
---|---|---|---|---|---|---|---|---|---|
Intercept | 4.706 | 0.013 | 4.681 | 4.731 | 0.000 | 0.000 | 3435.0 | 2692.0 | 1.0 |
treat | 1.982 | 0.049 | 1.893 | 2.077 | 0.001 | 0.001 | 3203.0 | 2855.0 | 1.0 |
dep_var_sigma | 2.811 | 0.009 | 2.793 | 2.828 | 0.000 | 0.000 | 5532.0 | 2855.0 | 1.0 |
Which again accords with the reported values from pyfixest
. This is equivalent to using a Mundlak device as we can see below:
TWFE by Mundlak Device
Woolridge recounts that the TWFE is equivalent to a Mundlak regression where:
[The Mundlak devices] \(\bar{X_{i}}\) and \(\bar{X_{t}}\) effectively act as sufficient statistics in accounting for any unit-specific heterogeneity and time-specific heterogeneity that is correlated with \(X_{it}\). Rather than having to include (N − 1) + (T − 1) control variables, it suffices to include 2K control variables, \(\bar{X_{i}}\), \(\bar{X_{t}}\) - Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators
We will see an example of this equivalence here.
'unit_mean'] = df_het.groupby('unit')['treat'].transform(np.mean)
df_het['time_mean'] = df_het.groupby('year')['treat'].transform(np.mean)
df_het[
= bmb.Model(f"dep_var ~ treat", df_het)
model_twfe_trt = model_twfe_trt.fit( inference_method="nuts_numpyro",
idata_twfe_trt ={"log_likelihood": True})
idata_kwargs
= bmb.Model(f"dep_var ~ treat + unit_mean + time_mean", df_het)
model_twfe_trt_mundlak = model_twfe_trt_mundlak.fit( inference_method="nuts_numpyro",
idata_twfe_trt_mundlak ={"log_likelihood": True}) idata_kwargs
=True, var_names=['treat'], model_names=['De-meaned', 'Mundlak', 'Simple']); az.plot_forest([idata_twfe_trt_demean, idata_twfe_trt_mundlak, idata_twfe_trt], combined
The de-meaned TWFE estimator and the Mundlak specification result in identical estimates and differ from the naive estimate that fails to control group level confounds. The Mundlak specification is far easier to implement and offers altnerative ways to parameterise a model capable of adjusting for the group level confounding.
Functional Form and Richly Parameterised Regressions
The vanilla TWFE estimator can successfully recover the treatment effects in DiD designs and facilitate event studies of the same. However, the details of the estimation matter because this functional form is not always robust. Here we’ll see other options that can recover substantially the same inferences and may prove more robust as we’ll see below. The key to each is to articulate enough structural features that allow the model to modify effects based on the suspected group level confounds. Most crucially, the model needs to express (or account for) variation due to relevant data generating process.
'state_mean'] = df_het.groupby('state')['treat'].transform(np.mean)
df_het['time_mean'] = df_het.groupby('year')['treat'].transform(np.mean)
df_het['cohort_mean'] = df_het.groupby('group')['treat'].transform(np.mean)
df_het[
= bmb.Model(f"dep_var ~ 1 + C(year) + state_mean + C(rel_year, Treatment(reference=-1)) ", df_het)
model_twfe_event_1 = model_twfe_event_1.fit( inference_method="nuts_numpyro",
idata_twfe_event_1 ={"log_likelihood": True})
idata_kwargs
= """ dep_var ~ 1 + time_mean:state_mean + C(rel_year, Treatment(reference=-1))"""
formula = smf.ols(formula, data=df_het).fit()
twfe_model_ols
twfe_model_ols.summary()= pd.DataFrame(twfe_model_ols.params, columns=['estimate']).iloc[1:-1]
param_est 'index_number'] = list(range(len(param_est)))
param_est[= (param_est.reset_index()
temp
)= temp[(~temp['index'].str.contains(':')) & (temp['index'].str.contains('rel'))]
param_est =True)
param_est.reset_index(inplace
= bmb.Model(f"dep_var ~ (1 | year) + state_mean + C(rel_year, Treatment(reference=-1)) ", df_het)
model_twfe_event_2 = model_twfe_event_2.fit( inference_method="nuts_numpyro",
idata_twfe_event_2 ={"log_likelihood": True}) idata_kwargs
Having estimated the various alternatives model specifications we compare each against our baseline de-meaned event-study.
Code
= plt.subplots(2, 2, figsize=(20, 10))
fig, axs = axs.flatten()
axs 0], model='Manual DeMeaned')
plot_event_study(idata_twfe_event, axs[1], model='Manual DeMeaned')
plot_event_study(idata_twfe_event, axs[2], model='Manual DeMeaned')
plot_event_study(idata_twfe_event, axs[3], model='Manual DeMeaned')
plot_event_study(idata_twfe_event, axs[0], color='green', model='Fixed Effects Saturated Bayes')
plot_event_study(idata_twfe_event_1, axs[1], color='purple', model='Hierarchical Effects Saturated Bayes')
plot_event_study(idata_twfe_event_2, axs[2].scatter(param_est['index'], param_est['estimate'], color='red', label='Mundlak Interaction Features OLS')
axs[= fit_twfe_event.tidy()
tidy = range(len(tidy))
xs =True)
tidy.reset_index(inplace3].scatter(xs, tidy['Estimate'], color='orange', label='pyfixest TWFE')
axs[3].plot([xs, xs], [tidy['2.5%'],tidy['97.5%']], color='orange')
axs[2].set_xticks([])
axs[0].set_title("dep_var ~ 1 + C(year) + state_mean + C(rel_year, Treatment(reference=-1))")
axs[1].set_title("dep_var ~ (1 | year) + state_mean + C(rel_year, Treatment(reference=-1))")
axs[2].set_title("dep_var ~ 1 + time_mean:state_mean + C(rel_year, Treatment(reference=-1))")
axs[3].set_title("dep_var ~ i(rel_year, ref=-1.0) | state + year")
axs[0].legend()
axs[1].legend()
axs[2].legend()
axs[3].legend(); axs[
`
This suggests that there are a variety of functional forms even just using regression specifications that seek to control from different types of group level confounding. In this example data most of the functional forms that seek to control for time and state level effects seem to converge in their answers. We will now switch to an example where the vanilla TWFE breaks down.
Issues with TWFE and Richly Parameterised Linear Models
We draw on the example from Pedro Sant’Anna here where it is demonstrated that the vanilla TWFE estimator breaks down under various conditions. These conditions are often related to staggered roll out of a treatment. This staggered roll out induces dynamic changes in the composition of treatment group over time. Appropriate inference needs to carefully control for the interaction effects due to staggered treatment. In particular we need our model of the situation to reflect this aspect of the data generating process.
Let’s generate some data.
= 1
true_mu
def make_data(nobs = 1000, nstates = 40):
= list(range(nobs))
ids = np.random.choice(range(nstates), size=nobs, replace=True)
states = pd.DataFrame({'unit': ids,
unit 'state': states,
'unit_fe': np.random.normal(states/5, 1, size=nobs),
'mu': true_mu})
= pd.DataFrame({'year': pd.date_range('1980-01-01', '2010-01-01', freq='y'),
year 'year_fe': np.random.normal(0, 1, 30) })
'year'] = year['year'].dt.year
year[
= pd.DataFrame({'state': np.random.choice(range(nstates), size=nstates, replace=False),
treat_taus 'cohort_year': np.sort([1986, 1992, 1998, 2004]*10)
})
= pd.DataFrame([row for row in product(range(nobs), year['year'].unique())], columns =['unit', 'year'])
cross_join = cross_join.merge(unit, how='left', left_on='unit',
cross_join ='unit')
right_on= cross_join.merge(year, how='left', left_on='year',
cross_join ='year')
right_on= cross_join.merge(treat_taus, how='left', left_on='state', right_on='state')
cross_join = cross_join.assign(
cross_join = np.random.normal(0, 1, len(cross_join)),
error = lambda x: np.where(x['year'] >= x['cohort_year'], 1, 0)
treat
)= cross_join.assign(tau = np.where(cross_join['treat'] == 1, cross_join['mu'], 0),
cross_join = lambda x: x['year_fe'] + 0.1*(x['year']-x['cohort_year']))
).assign(year_fe
'tau_cum'] = cross_join.groupby('unit')['tau'].transform(np.cumsum)
cross_join[= cross_join.assign(dep_var = lambda x: 2010-x['cohort_year'] +
cross_join 'unit_fe'] + x['year_fe'] + x['tau_cum'] + x['error'])
x['rel_year'] = cross_join['year'] - cross_join['cohort_year']
cross_join[
return cross_join
= make_data(500, 40)
sim_df sim_df.head()
unit | year | state | unit_fe | mu | year_fe | cohort_year | error | treat | tau | tau_cum | dep_var | rel_year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1980 | 12 | 2.770007 | 1 | -1.443538 | 1998 | -0.883541 | 0 | 0 | 0 | 12.442927 | -18 |
1 | 0 | 1981 | 12 | 2.770007 | 1 | -3.140326 | 1998 | 1.044140 | 0 | 0 | 0 | 12.673821 | -17 |
2 | 0 | 1982 | 12 | 2.770007 | 1 | -2.430673 | 1998 | 0.276021 | 0 | 0 | 0 | 12.615355 | -16 |
3 | 0 | 1983 | 12 | 2.770007 | 1 | -4.253994 | 1998 | 0.181049 | 0 | 0 | 0 | 10.697061 | -15 |
4 | 0 | 1984 | 12 | 2.770007 | 1 | -3.347429 | 1998 | -2.104055 | 0 | 0 | 0 | 9.318523 | -14 |
We can now plot the staggered nature of the imagined treatment regime.
Code
= plt.subplots(figsize=(10, 6))
fig, ax
for unit in sim_df['unit'].unique()[0:100]:
= sim_df[sim_df['unit'] == unit]
temp 'year'], temp['dep_var'], alpha=0.1, color='grey')
ax.plot(temp[
'cohort_year', 'year'])[['dep_var']].mean().reset_index().pivot(index='year', columns='cohort_year', values='dep_var').plot(ax=ax)
sim_df.groupby([1986)
ax.axvline(1992, color='orange')
ax.axvline(1998, color='green')
ax.axvline(2004, color='red')
ax.axvline("Simulated Cohorts Homogenous Treatment Effects \n All Eventually Treated", fontsize=20)
ax.set_title( ax.legend()
<matplotlib.legend.Legend at 0x2a273b2d0>
This data will present problems for the vanilla TWFE estimator. The issues stems from how each cohort receives a treatment and there are periods in the data when no group is in the “control” group. We can see how this plays out with de-meaning TWFE strategy.
= pf.feols(
fit_twfe "dep_var ~ i(rel_year, ref=-1.0) | state + year",
sim_df,={"CRV1": "state"},
vcov
)
= [1200, 400]
figsize
fit_twfe.iplot(=False,
coord_flip="TWFE-Estimator",
title=figsize,
figsize=18.5,
xintercept=0,
yintercept ).show()
/Users/nathanielforde/mambaforge/envs/pymc_causal/lib/python3.11/site-packages/pyfixest/estimation/feols_.py:1987: UserWarning:
The following variables are collinear: ['C(rel_year, contr.treatment(base=-1.0))[T.18]', 'C(rel_year, contr.treatment(base=-1.0))[T.19]', 'C(rel_year, contr.treatment(base=-1.0))[T.20]', 'C(rel_year, contr.treatment(base=-1.0))[T.21]', 'C(rel_year, contr.treatment(base=-1.0))[T.22]', 'C(rel_year, contr.treatment(base=-1.0))[T.23]'].
The variables are dropped from the model.
warnings.warn(
This is not remotely close to the expected pattern. For contrast, consider an alternative estimator.
= lpdid(
fit_lpdid =sim_df,
data="dep_var",
yname="cohort_year",
gname="year",
tname="unit",
idname={"CRV1": "state"},
vcov=-17,
pre_window=17,
post_window=False,
att
)
fit_lpdid.iplot(=False,
coord_flip="Local-Projections-Estimator",
title=figsize,
figsize=0,
yintercept=18.5,
xintercept ).show()
The initial TWFE estimate is utterly skewed and entirely incorrect. Something dreadful has gone wrong under the hood. For contrast, we’ve included the Local Projections estimator from the pyfixest
to show that we can recover the actual treatment effect under this event study with alternative strategies. However, there is more machinary involved in the local-projections estimator.
Instead we want show how to use mundlak devices to recover more reasonable estimates. No fancy estimators, just more regressions.
Woolridge argues for this approach when he states that
[T]hat there is nothing inherently wrong with TWFE, which is an estimation method. The problem with how TWFE is implemented in DiD settings is that it is applied to a restrictive model… - Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators
We want to explore different ways of isolating the treatment effects under a variety of model specifications that controls for time-constant treatment intensities, covariates, and interactions between them. That is we want to better express the data generating process in our model of the situation.
Fitting a Variety of Models
Consider the following model specifications.
'unit_mean'] = sim_df.groupby('unit')['treat'].transform(np.mean)
sim_df[
'state_mean'] = sim_df.groupby('state')['treat'].transform(np.mean)
sim_df[
'cohort_mean'] = sim_df.groupby('cohort_year')['treat'].transform(np.mean)
sim_df[
'time_mean'] = sim_df.groupby('year')['treat'].transform(np.mean)
sim_df[
= bmb.Model(f"""dep_var ~ 1 + time_mean:state_mean + C(rel_year, Treatment(reference=-1))""", sim_df)
model_twfe
= model_twfe.fit( inference_method="nuts_numpyro",
idata_twfe ={"log_likelihood": True})
idata_kwargs
= bmb.Model(f"""dep_var ~ 1 + time_mean* state_mean + C(cohort_year) + C(rel_year, Treatment(reference=-1))""", sim_df)
model_twfe1
= model_twfe1.fit( inference_method="nuts_numpyro",
idata_twfe1 ={"log_likelihood": True})
idata_kwargs
= bmb.Model(f"""dep_var ~ 1 + cohort_mean: state_mean + C(rel_year, Treatment(reference=-1))""", sim_df)
model_twfe2
= model_twfe2.fit( inference_method="nuts_numpyro",
idata_twfe2 ={"log_likelihood": True})
idata_kwargs
= bmb.Model(f"""dep_var ~ (1| year) + state_mean + C(rel_year, Treatment(reference=-1))""", sim_df)
model_twfe3 = model_twfe3.fit( inference_method="nuts_numpyro",
idata_twfe3 ={"log_likelihood": True}) idata_kwargs
These latter models will recover the appropriate treatment effects with slight variations due to the adapted functional form. The initial models will fail to capture the pattern correctly.
Code
= plt.subplots(4, 1, figsize=(10, 15), sharey=True)
fig, axs = axs.flatten()
axs =axs[0], model='Additive Mundlak')
plot_event_study(idata_twfe, ax=axs[1], color='red', model='Mundlak State & Time Interactions')
plot_event_study(idata_twfe1, ax=axs[2], color='green', model='Mundlak Cohort & State Interactions')
plot_event_study(idata_twfe2, ax=axs[3], color='purple', model='Hierarchical Time + Mundlak State')
plot_event_study(idata_twfe3, ax0].set_title(str(model_twfe.formula))
axs[1].set_title(str(model_twfe1.formula))
axs[2].set_title(str(model_twfe2.formula))
axs[3].set_title(str(model_twfe3.formula))
axs[= np.linspace(0, 45, 45)
xs = xs - 22
xs_centered = np.where(xs_centered <= 0, 0, (xs_centered +1))
true_effect 0].plot(xs, true_effect, color='k', alpha=0.6, label='True effect')
axs[1].plot(xs, true_effect, color='k', alpha=0.6, label='True effect')
axs[2].plot(xs, true_effect, color='k', alpha=0.6,
axs[='True effect')
label3].plot(xs, true_effect, color='k', alpha=0.6,
axs[='True effect')
label0].legend()
axs[1].legend()
axs[2].legend()
axs[3].legend(); axs[
Note how the naive mundlak approach also exhibits odd behaviour as we saw in the TWFE estimation routine above. Adding additional interactions and controlling for the staggered launch dates seems to help isolate the real pattern in the data. We’ve seen approximate success in a number of these richly parameterised versions of simple event study regressions. But the idiosyncracies of any one sample will contort and distort the estimates away from the true values. We might hope for more stability in expectation over repeated draws from this kind of data-generating process. Consider the following Bootstrapping estimation routine.
Evaluating Robustness of Functional Form
Code
= plt.subplots(2, 1, figsize=(10, 15))
fig, axs = axs.flatten()
axs = []
event_coefs = []
event_coefs1
def fit_ols(formula, df_het):
= smf.ols(formula, data=df_het).fit()
twfe_model_ols
twfe_model_ols.summary()= pd.DataFrame(twfe_model_ols.params, columns=['estimate']).iloc[1:-1]
param_est 'index_number'] = list(range(len(param_est)))
param_est[= (param_est.reset_index())
temp = temp[(~temp['index'].str.contains(':')) & (temp['index'].str.contains('rel'))]
param_est =True)
param_est.reset_index(inplacetry:
'rel_year'] =(param_est['index'].str.split('[', expand=True)[1].str.replace('T.', '')
param_est[str.replace(']', ''))
.except Exception as e:
'rel_year'] = param_est['level_0'] - 22
param_est['rel_year'] = param_est['rel_year'].astype(int)
param_est['rel_year', inplace=True)
param_est.set_index(return param_est
for i in range(100):
= make_data(500, 40)
df_het 'state_mean'] = df_het.groupby('state')['treat'].transform(np.mean)
df_het['time_mean'] = df_het.groupby('year')['treat'].transform(np.mean)
df_het['cohort_mean'] = df_het.groupby('cohort_year')['treat'].transform(np.mean)
df_het['unit_mean'] = df_het.groupby('unit')['treat'].transform(np.mean)
df_het[
= """ dep_var ~ 1 + time_mean + state_mean + cohort_mean + C(rel_year, Treatment(reference=-1)) """
formula = """ dep_var ~ 1 + C(cohort_year) + time_mean:unit_mean + C(rel_year, Treatment(reference=-1))"""
formula1
= fit_ols(formula, df_het)
param_est 0].plot(param_est['estimate'], color='blue', alpha=0.3)
axs['estimate'])
event_coefs.append(param_est[= fit_ols(formula1, df_het)
param_est1 1].plot(param_est1['estimate'], color='blue', alpha=0.3)
axs['estimate'])
event_coefs1.append(param_est1[
= pd.DataFrame(event_coefs)
bootstrap_df = bootstrap_df.mean(axis=0)
mean_df = param_est.index
mean_df.index =axs[0], color='red')
mean_df.plot(ax0].set_title(f"""Bootstrapped Estimates of Demean Event Study DGP \n {formula}""")
axs[
= pd.DataFrame(event_coefs1)
bootstrap_df1 = bootstrap_df1.mean(axis=0)
mean_df = param_est1.index
mean_df.index =axs[1], color='red')
mean_df.plot(ax1].set_title(f"""Bootstrapped Estimates of Event Study DGP \n
axs[{formula1}""");
Here we see the importance of robustness tests in estimation techniques. In the above plot we see something like the correct pattern emerging in expectation, but absurd estimates occuring on individual cases. Whereas the bottom plot appears much more consistent across multiple draws from the data generating process. This points to something about the justification required for a credible result. Asymptotic accuracy is generally not sufficient for inducing credibility in any particular analysis of finite data.
We can also compare these models on standard adequacy measures.
'fe_mundlak_naive': idata_twfe,
az.compare({'mundlak_state_time_interactions_cohort': idata_twfe1,
'mundlak_cohort_state_interactions': idata_twfe2,
'mundlak_state_hierarchical_year': idata_twfe3})
rank | elpd_loo | p_loo | elpd_diff | weight | se | dse | warning | scale | |
---|---|---|---|---|---|---|---|---|---|
mundlak_state_hierarchical_year | 0 | -35912.100213 | 71.825690 | 0.000000 | 0.801341 | 72.022818 | 0.000000 | False | log |
mundlak_state_time_interactions_cohort | 1 | -36427.310049 | 53.130023 | 515.209836 | 0.198691 | 74.337703 | 41.178348 | False | log |
mundlak_cohort_state_interactions | 2 | -38254.123775 | 47.009205 | 2342.023562 | 0.000023 | 78.060258 | 49.592305 | False | log |
fe_mundlak_naive | 3 | -46324.432551 | 38.486619 | 10412.332337 | 0.000000 | 86.756424 | 97.412077 | False | log |
Again, we can plot the predictive performance of these models. Here we compare two via posterior predictive checks. Notable we see how the model with the better predictive performance also correctly estimates the treatement effects in the above plots.
='pps')
model_twfe3.predict(idata_twfe3, kind='pps')
model_twfe.predict(idata_twfe, kind= plt.subplots(2, 1, figsize=(10, 12))
fig, axs = axs.flatten()
axs =axs[0])
az.plot_ppc(idata_twfe3, ax0].set_title("PPC checks for Mundlak State Hierarchical Year")
axs[1].set_title("PPC checks for FE Mundlak Naive")
axs[=axs[1]); az.plot_ppc(idata_twfe, ax
It’s certainly possible that parameter recovery accuracy and predictive performance come apart. But in lieu of knowledge of the true data generating processes, the route to credible models is via their calibration against real data.
Conclusion
“A schema may be transported almost anywhere. The choice of territory for invasion is arbitrary; but the operation within that territory is almost never completely so.” - Nelson Goodman in Languages of Art
We’ve now seen various cases of group-level confounding and strategies to address these issues. We’ve seen how they give rise to non-trivial questions about the effects of policy, and what can be learned in the face of group-level confounds. The case of TWFE’s estimation is especially instructive - here we have an estimation technique applied well in limited circumstances, which ran into trouble when applied in contexts with a diverging data generating process. We all lose our way some of the time, but this misstep led us off the map - we’d followed the trail well past the intended territory.
Statistical models do not mirror the world, they project patterns in the data. Each rupture between model and world is a gift that highlights deficits in our understanding - an impropriety of projections. It is therefore a delightful reversal to see, in Woolridge’s proof, that the issue is not with the tooling but the thoughtless application of the tool. The unreasonable effectiveness of regression modelling is surely driven by projections of linearity on the world, but it remains compelling in approximation when used well. There are architectural constraints in causal inference - hard unyielding facts, that shape the contours of our design, and structure the steps we can make. The degree to which we can flexibly encode those structures in our models, the more compelling our models will be. It is this recognition of dual fit requirements between world and model - the focus on the conditions for identifiability - that lends causal inference any kind of credibility. The expressive capacity of generalised linear modelling enables us to articulate too many candidate pictures of the world. It’s the business of science to cull those pictures, to find only the plausible representations of reality.
Citation
@online{forde2024,
author = {Nathaniel Forde},
title = {Freedom, {Hierarchies} and {Confounded} {Estimates}},
date = {2024-07-01},
langid = {en}
}