Session 3: Common Extensions for Empirical Work
Introduction to Difference-in-Differences
Including Covariates in the Parallel Trends Assumption
Common Extensions for Empirical Work
Dealing with More Complicated Treatment Regimes
Alternative Identification Strategies
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)
Repeated Cross Sections
Unbalanced Panel Data
Alternative Comparison Groups
Anticipation
Alternative Base Periods
Dealing with Violations of Parallel Trends
Sampling Weights
Inference / Clustered Standard Errors
DiD often works just fine with repeated cross sections data. \[ \begin{aligned} ATT &= \E[Y_{t=2}-Y_{t=1} | G=1] - \E[Y_{t=2}-Y_{t=1} | G=0] \hspace{150pt} \end{aligned} \]
DiD often works just fine with repeated cross sections data. \[ \begin{aligned} ATT &= \E[Y_{t=2}-Y_{t=1} | G=1] - \E[Y_{t=2}-Y_{t=1} | G=0] \hspace{150pt}\\ &= \Big( \E[Y_{t=2} | G=1] - \E[Y_{t=1} | G=1] \Big) - \Big(\E[Y_{t=2} | G=0] - \E[Y_{t=1} | G=0]\Big) \end{aligned} \]
which you can compute with repeated cross sections data
Additional Discussion:
Issues with unbalanced panel data seem to be very similar to typical panel data methods
If not observing a unit in a particular time period (e.g., attrition) is “at random”, then everything should be fine
If missing data is not a random, then this can be a major issue
Additional Discussion:
set.seed(123)
# randomly drop 100 observations
this_data <- data2[sample(1:nrow(data2), nrow(data2)-100),]
attgt_up <- did::att_gt(yname="lemp",
idname="id",
gname="G",
tname="year",
data=this_data,
control_group="nevertreated",
base_period="universal",
panel=TRUE,
allow_unbalanced_panel=TRUE)
ggdid(aggte(attgt_up, type="dynamic"))
So far, we have used the never-treated group as the comparison group. But there are other possibilities:
pte
package which allows for selecting the comparison group in fully customizable ways# have to do a little hack to get this to work
# drop never-treated group
this_data <- subset(data2, G != 0)
# note: this causes us to lose the 2006 group
# as it no longer has a valid comparison group
# and we lose some periods for the 2004 group
# because it only has a valide comparison group up to 2005
attgt_nye <- did::att_gt(yname="lemp",
idname="id",
gname="G",
tname="year",
data=this_data,
control_group="notyettreated",
base_period="universal")
ggdid(aggte(attgt_nye, type="dynamic"))
Anticipation occurs when treatments start affecting outcomes in periods before the treatment actually occurs. Examples:
Policies that are announced before they are implemented
Ashenfelter’s dip—for applications like job training, often people participate because they experience a shock to their labor market prospects
Dealing with Anticipation:
Usually easy to deal with, as long as there is only “limited anticipation”
In particular, could assume that \(Y_{it} = Y_{it}(0)\) for \(t < G_i - a\), then the new base period is \(g-(a+1)\) for group \(g\), and \[ATT(g,t) = \E[Y_t - Y_{g-(a+1)} | G=g] - \E[Y_t - Y_{g-(a+1)} | G=0]\]
In all the results above, we have used a universal base period.
This means that all differences (including pre-treatment periods) are relative to period \((g-1)\)
(In my view) this is a legacy of implementing DID in a regression framework, where it is not clear if it is possible to make a different choice
The main alternative is a varying base period.
No difference in post-treatment periods
In pre-treatment periods, we compare period \(t\) to \(t-1\) for all \(t < g\)
In either case, the information from pre-treatment periods is the same
Statistical tests of parallel trends in pre-treatment periods will give the same results
However, they can quite visually different in an event study plot
Choosing between different base periods comes down to the type of information that you would like the event study to highlight
The case for a universal base period:
The case for a varying base period:
Call:
did::att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
data = data2, control_group = "nevertreated", base_period = "universal")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Group-Time Average Treatment Effects:
Group Time ATT(g,t) Std. Error [95% Simult. Conf. Band]
2004 2003 0.0000 NA NA NA
2004 2004 -0.0327 0.0210 -0.0858 0.0205
2004 2005 -0.0683 0.0218 -0.1232 -0.0133 *
2004 2006 -0.1234 0.0219 -0.1786 -0.0681 *
2004 2007 -0.1311 0.0230 -0.1891 -0.0731 *
2006 2003 -0.0341 0.0114 -0.0630 -0.0052 *
2006 2004 -0.0167 0.0082 -0.0375 0.0041
2006 2005 0.0000 NA NA NA
2006 2006 -0.0194 0.0093 -0.0430 0.0042
2006 2007 -0.0661 0.0093 -0.0895 -0.0427 *
---
Signif. codes: `*' confidence band does not cover 0
P-value for pre-test of parallel trends assumption: 0.01325
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
Call:
did::att_gt(yname = "lemp", tname = "year", idname = "id", gname = "G",
data = data2, control_group = "nevertreated", base_period = "varying")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Group-Time Average Treatment Effects:
Group Time ATT(g,t) Std. Error [95% Simult. Conf. Band]
2004 2004 -0.0327 0.0223 -0.0882 0.0229
2004 2005 -0.0683 0.0222 -0.1235 -0.0130 *
2004 2006 -0.1234 0.0216 -0.1771 -0.0696 *
2004 2007 -0.1311 0.0244 -0.1917 -0.0705 *
2006 2004 0.0174 0.0096 -0.0066 0.0414
2006 2005 0.0167 0.0085 -0.0046 0.0380
2006 2006 -0.0194 0.0090 -0.0418 0.0030
2006 2007 -0.0661 0.0093 -0.0893 -0.0429 *
---
Signif. codes: `*' confidence band does not cover 0
P-value for pre-test of parallel trends assumption: 0.01325
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
To me, it looks like the universal base period would be a better choice in our application
It does not look like there are anticipation effects.
It does look like there are trend differences between the treated and untreated units.
Parallel trends assumptions don’t automatically hold in applications with repeated observations over time
DID + pre-tests are a very powerful/useful approach to “validating” the parallel trends assumption. But they are less than a full test of parallel trends.
Just because parallel trends holds in pre-treatment periods doesn’t mean it holds in post-treatment periods
Pre-tests can suffer from low power (Roth (2022))
References: Manski and Pepper (2018), Rambachan and Roth (2023), Ban and Kedagni (2022)
Two versions of sensitivity analysis in RR:
Next slides: show these results in minimum wage application for the “on impact” effect of the treatment
Sampling weights are common in DID applications
Can come from survey design (e.g., oversampling certain subpoopulations)
Can also arise from working with aggregated data (e.g., counties) where we might want to count larger counties more than smaller counties
The most well-known paper (well…to me) on sampling weights is Solon, Haider, and Wooldridge (2015)
Tentative Heuristic Advice:
If you have unit-level data (e.g., job displacement example) and there are sampling weights, use them
If you have aggregate data (e.g., county-level data), but you are interested in effects at level of underlying unit (e.g., person-level), then you should consider using sampling weights
# create weights based on population
data2$pop <- exp(data2$lpop)
data2$avg_pop <- BMisc::get_Yibar(data2, "id", "pop")
attgt_sw <- did::att_gt(yname="lemp",
idname="id",
gname="G",
tname="year",
data=data2,
control_group="nevertreated",
base_period="universal",
weightsname="avg_pop")
ggdid(aggte(attgt_sw, type="dynamic"))
There are a number of complications that can arise for inference in DID settings:
Serial correlation
Fixed population inference
Small number of treated units
Clustered treatment assignment
Multiple hypothesis testing
Different issues related to inference arise given different types of data, all of which are common in DID applications.
Here are a few leading examples:
Unit-level treatments, units sampled from a large population.
Aggregate treatments, aggregate data
Example: State-level policy, studied with state-level data
Also: try to answer the question what happens if you observe the entire population?
Aggregate treatments, underlying unit-level data
Probably the most common inference issue in DID settings is serial correlation.
Sample consists of:
\[\{Y_{i1}, Y_{i2}, \ldots, Y_{iT}, D_{i1}, D_{i2}, \ldots D_{iT} \}_{i=1}^n\]
which are iid across units, drawn from a “super-population”, and where the number of units in each group is “large”. This is a common sampling scheme in panel data applications (e.g., job displacement).
This sampling scheme allows for outcomes to be arbitrarily correlated across time periods.
\[Y_{it}(0) = \theta_t + \eta_i + e_{it}\]
Ignoring serial correlation can lead to incorrect standard errors and confidence intervals (Bertrand, Duflo, and Mullainathan (2004)).
Instead of modeling the serial correlation, it is most common to cluster at the unit level (i.e., allow for arbitrary serial correlation within units).
Most (all?) software implementations can accommodate serial correlation (often by default).
The previous discussion has emphasized traditional sampling uncertainty arising from drawing a sample from an underlying super-population.
In many DID applications, we observe the entire population of interest (e.g., all 50 states).
One possibility is to condition on (i.e., treat as fixed/non-random) \((G_i, \theta_t, \eta_i)\) while treating as random \(e_{it}\).
Intuition: repeated sampling thought experiment where we redraw \(e_{it} \sim (\mu_i, \sigma^2_i)\) for each unit \(i\) and time period \(t\) while unit fixed effects, time fixed effects, and treatment status are held fixed.
Adjusted target parameter: mean of finite sample \(ATT\) across repeated samples \[ATT(g,t) := \frac{1}{n_g} \sum_{G_i=g} \E[Y_{t} - Y_{t}(0) | G_i=g]\]
Adjusted parallel trends: mean of finite sample parallel trends across repeated samples \[\frac{1}{n_g} \sum_{G_i=g} \E[\Delta Y_{t}(0) | G_i=g] - \frac{1}{n_\infty} \sum_{G_i=\infty} \E[\Delta Y_{t}(0) | U_i=1]\]
\(\implies\) somewhat different interpretation, but use same \(\widehat{ATT}\) and constructing unconditional standard errors (i.e., same as before) can be conservative in this case. See Borusyak, Jaravel, and Spiess (2024), for example.
What if the number of treated units is small (e.g., 1 treated unit)?
In this case, if there is one treated unit, we can often come up with an unbiased (though not consistent) estimator of the \(ATT\).
For inference, there are a number of ideas, often involving some kind of permutation test
Many DID applications have clustered treatment assignment—all units within the same cluster (e.g., a state) have the same treatment status/timing.
The most common approach in this setting is to cluster at the level of the treatment.
To rationalize conducting inference in this way typically requires a large number of both treated and untreated clusters.
Can we conduct inference in cases where there are only two clusters (but we observe underlying data)?
Yes, but it may require additional assumptions.
Let’s start with the setting where there are exactly two clusters. Then (for simplicity just focusing on untreated potential outcomes), \[\begin{align*} Y_{ij,t}(0) &= \theta_t + \eta_i + e_{ij,t} \\ &= \theta_t + \eta_i + \underbrace{\nu_{j,t} + \epsilon_{ij,t}} \end{align*}\] where \(\nu_{j,t}\) is a cluster-specific time-varying error term and \(\epsilon_{ij,t}\) are idiosyncratic, time-varying unobservables (possibly serially correlated but independent across units).
\(\nu_{j,t}\) is often viewed as a common cluster-specific shock that affects all units in the cluster (e.g., some other state-level policy).
For inference with two clusters, we need \(\nu_{j,t}=0\).
Suppose that we are studying a policy that is implemented in a few states.
Option 1: Include a large number of states, cluster at state-level
For clustering: We may only need the weaker condition that \(\E[\nu_{j,t} | G_i] = 0\).
For identification: less robust to small violations of parallel trends.
Option 2: Only include a few similar states, cluster at unit-level
For clustering: We need the stronger condition that \(\nu_{j,t}=0\).
For identification: possibly more robust to small violations of parallel trends.
In my view, there is a fairly strong case for (in many applications) using tighter comparisons with a smaller number of clusters while arguing that \(\nu_{j,t}=0\) by a combination of
researcher legwork
pre-testing
If you report multiple hypothesis tests and/or confidence intervals—e.g., an event study—it’s a good idea to make an adjustment for multiple hypothesis testing.
sup-t confidence band—construct confidence intervals of the form \[\Big[\widehat{ATT}^{es}(e) \pm \hat{c}_{1-\alpha/2} \textrm{s.e.}\big(\widehat{ATT}^{es}(e)\big) \Big]\] but instead of choosing the critical value as a quantile of the normal distribution, choose a (slightly) larger critical value that accounts for the fact that you are testing multiple hypotheses.
Typically, an appropriate choice for the critical value to ensure the correct (uniform) coverage requires resampling methods.
The reason for this is that the appropriate critical value depends on the joint distribution of \(\sqrt{n}(\widehat{ATT}^{es}(e) - ATT^{es}(e))\) across all \(e\), and these are generally not independent of each other.
That said, this is not too difficult in practice, and is a default option in the did
package.
# note: these are the varying base period results
# that were "just barely" not statistically significant
# in pre-treatment periods
attgt_poi <- did::att_gt(yname="lemp",
idname="id",
gname="G",
tname="year",
data=data2,
control_group="nevertreated",
base_period="varying",
cband=FALSE)
ggdid(aggte(attgt_poi,type="dynamic",cband=FALSE))