Recent Methodological Advances and their Relevance to Empirical Work
University of Georgia
August 4, 2025
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} } \newcommand{\T}{T} \newcommand{\ATT}{\text{ATT}}\)
Panel Data Causal Inference: Challenges and Opportunities
DiD with Two Periods
Staggered Treatment Adoption
Application + Code
Additional Materials: https://bcallaway11.github.io/camp-resources/
References:
Advanced Materials: https://github.com/bcallaway11/bank-of-portugal
Relaxing the parallel trends assumption by including covariates
Common issues in empirical work
Dealing with more complicated treatment regimes
Alternative identification strategies (e.g., conditioning on lagged outcome, change-in-changes, others)
Running Example: Causal effect of a state-level minimum wage increase on employment
Research Design: The setting that the researcher will use to estimate causal effects.
Exploit a data structure where the researcher observes:
This research design is a key distinguishing feature of modern approaches to panel data causal inference relative to traditional panel data models
Identification Strategy: A target parameter and set of assumptions that allow the researcher to recover the target parameter ➡
IV and RD are closely connected to natural experiments where the assignment of treatment, though not controlled by the researcher, is (usually locally) randomly assigned.
This implies that
Panel data causal inference methods are often used in settings where there is no explicit natural experiment:
This implies that
Modern approaches also typically allow for treatment effect heterogeneity
This is going to be a major issue in the discussion below
We’ll consider implications for “traditional” regression approaches and how new approaches are designed to handle this
Forward-Engineering: Identification first, then estimation
Reverse-Engineering: Prioritize estimation (for economists, often some form of regression)
Data:
Potential Outcomes: \(Y_{it}(1)\) and \(Y_{it}(0)\)
Observed Outcomes: \(Y_{it=2}\) and \(Y_{it=1}\)
\[\begin{align*} Y_{it=2} = G_i Y_{it=2}(1) +(1-G_i)Y_{it=2}(0) \quad \textrm{and} \quad Y_{it=1} = Y_{it=1}(0) \end{align*}\]
Average Treatment Effect on the Treated: \[\ATT = \E[Y_{t=2}(1) - Y_{t=2}(0) | G=1]\]
Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group
\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]
With panel data, we can re-write this as
\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]
The first term is how outcomes changed over time for the treated group
\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]
With panel data, we can re-write this as
\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]
The second term is how outcomes would have changed over time if the treated group had not been treated
\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]
With panel data, we can re-write this as
\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]
The second term is how outcomes would have changed over time if the treated group had not been treated
\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]
With panel data, we can re-write this as
\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]
The second term is how outcomes would have changed over time if the treated group had not been treated
\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]
With panel data, we can re-write this as
\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]
The second term is how outcomes would have changed over time if the treated group had not been treated
Parallel Trends Assumption
\[\color{#BA0C2F}{\E[\Delta Y(0) | G=1]} = \color{#336699}{\E[\Delta Y(0) | G=0]}\]
Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group
Identification: Under PTA, we can identify \(\ATT\): \[ \begin{aligned} \ATT &= \color{#4B8B3B}{\E[\Delta Y | G=1]} - \color{#BA0C2F}{\E[\Delta Y(0) | G=1]} \end{aligned} \]
Parallel Trends Assumption
\[\color{#BA0C2F}{\E[\Delta Y(0) | G=1]} = \color{#336699}{\E[\Delta Y(0) | G=0]}\]
Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group
Identification: Under PTA, we can identify \(\ATT\): \[ \begin{aligned} \ATT &= \color{#4B8B3B}{\E[\Delta Y | G=1]} - \color{#BA0C2F}{\E[\Delta Y(0) | G=1]}\\ &= \color{#4B8B3B}{\E[\Delta Y | G=1]} - \color{#336699}{\E[\Delta Y | G=0]} \end{aligned} \]
\(\implies \ATT\) is identified can be recovered by the difference in outcomes over time (difference 1) relative to the difference in outcomes over time for the untreated group (difference 2)
The most straightforward approach to estimation is the plugin estimator:
\[\widehat{\ATT} = \frac{1}{n_1} \sum_{i=1}^n G_i \Delta Y_i - \frac{1}{n_0} \sum_{i=1}^n (1-G_i) \Delta Y_i\]
An alternative approach is to use a TWFE regression: \[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it}\]
It’s easy to make the TWFE regression more complicated:
Unfortunately, the robustness of TWFE regressions to treatment effect heterogeneity or these more complicated (and empirically relevant) settings does not seem to hold
\(\T\) time periods
Staggered treatment adoption: Units can become treated at different points in time, but once a unit becomes treated, it remains treated.
Examples:
Notation:
Notation (cont’d):
Group-time average treatment effects: \[\begin{align*} \ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]
Explanation: \(\ATT\) for group \(g\) in time period \(t\)
Event Study: \[\begin{align*} \ATT^{es}(e) = \E[ Y_{g+e}(G) - Y_{g+e}(0) | G \in \mathcal{G}_e] \end{align*}\]
where \(\mathcal{G}_e\) is the set of groups observed to have experienced the treatment for \(e\) periods at some point.
Explanation: \(\ATT\) when units have been treated for \(e\) periods
Overall \(\mathbf{\ATT}\):
Towards this end: the average treatment effect for unit \(i\) (across its post-treatment time periods) is given by: \[\bar{\tau}_i(G_i) = \frac{1}{\T - G_i + 1} \sum_{t=G_i}^{\T} \Big( Y_{it}(G_i) - Y_{it}(0) \Big)\]
Then,
\[\begin{align*} \ATT^o = \E[\bar{\tau}(G) | U=0] \end{align*}\]
Explanation: \(\ATT\) across all units that every participate in the treatment
⟶
⟶
To understand the discussion later, it is also helpful to think of \(\ATT(g,t)\) as a building block for the other parameters discussed above. For example:
Overall ATT: \[\begin{align*} \ATT^o = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^o(g,t) \ATT(g,t) \qquad \qquad \textrm{where} \quad w^o(g,t) = \frac{\P(G=g|U=0)}{\T-g+1} \end{align*}\]
Event Study: Likewise, \(\ATT^{es}(e)\) is a weighted average of \(\ATT(g,g+e)\)
\(\implies\) If we can identify \(\mathbf{\ATT(g,t)}\), then we can proceed to recover \(\mathbf{\ATT^{es}(e)}\) and \(\mathbf{\ATT^o}\).
Multiple Period Version of Parallel Trends Assumption
For all groups \(g \in \bar{\mathcal{G}}\) (all groups except the never-treated group) and for all time periods \(t=2,\ldots,\T\), \[\begin{align*} \E[\Delta Y_{t}(0) | G=g] = \E[\Delta Y_{t}(0) | U=1] \end{align*}\]
Using very similar arguments as before, can show that \[\begin{align*} \ATT(g,t) = \E[Y_{t} - Y_{g-1} | G=g] - \E[Y_{t} - Y_{g-1} | U=1] \end{align*}\]
where the main difference is that we use \((g-1)\) as the base period (this is the period right before group \(g\) becomes treated).
The previous discussion emphasizes a general purpose identification strategy with staggered treatment adoption:
Step 1: Target disaggregated treatment effect parameters (i.e., group-time average treatment effects)
Step 2: (If desired) combine disaggregated treatment effects into lower dimensional summary treatment effect parameter
Notice that:
With staggered treatments, traditionally DiD identification strategies have been implemented with two-way fixed effects (TWFE) regressions: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*}\]
One main contribution of recent work on DiD has been to diagnose and understand the limitations of TWFE regressions for implementing DiD
Goodman-Bacon (2021) intuition: \(\alpha\) “comes from” comparisons between the path of outcomes for units whose treatment status changes relative to the path of outcomes for units whose treatment status stays the same over time.
de Chaisemartin and D’Haultfoeuille (2020) intuition: You can write \(\alpha\) as a weighted average of \(\ATT(g,t)\)
First, a decomposition: \[\begin{align*} \alpha &= \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big) \\ & + \sum_{g \in \bar{\mathcal{G}}} \sum_{t=1}^{g-1} w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big) \end{align*}\]
de Chaisemartin and D’Haultfoeuille (2020) intuition: You can write \(\alpha\) as a weighted average of \(\ATT(g,t)\)
Second, under parallel trends: \[\begin{align*} \alpha = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^{TWFE}(g,t) \ATT(g,t) \end{align*}\]
Intuition: Directly implement the identification result discussed above
\[\begin{align*} \ATT(g,t) = \E[Y_{t} - Y_{g-1} | G=g] - \E[Y_{t} - Y_{g-1} | U=1] \end{align*}\]
Estimation:
\[\begin{align*}\widehat{\ATT}^{CS}(g,t) = \frac{1}{n_g}\sum_{i=1}^n \indicator{G_i = g}(Y_{it} - Y_{ig-1}) - \frac{1}{n_U}\sum_{i=1}^n \indicator{U_i = 1} (Y_{it} - Y_{ig-1}) \end{align*}\]
2nd step: Recall: group-time average treatment effects are building blocks for more aggregated parameters such as \(\ATT^{es}(e)\) and \(\ATT^o\) \(\implies\) just plug in
Regression based: Sun and Abraham (2021), Wooldridge (2021)
Imputation: Gardner et al. (2023), Borusyak, Jaravel, and Spiess (2024)
“Stacked” regression: Dube et al. (2023)
All of these approaches are conceptually very similar
Why can you get different numbers?
Important differences:
Use county-level data from 2003-2007 during a period where the federal minimum wage was flat
Exploit minimum wage changes across states
Interested in the effect of the minimum wage on teen employment
We’ll also make a number of simplifications:
Goals:
Full code is available on GitHub.
R packages used in empirical example:
# drops NE region and a couple of small groups
mw_data_ch2 <- subset(mw_data_ch2, (G %in% c(2004,2006,2007,0)) & (region != "1"))
head(mw_data_ch2[,c("id","year","G","lemp","lpop","lavg_pay","region")])
id year G lemp lpop lavg_pay region
554 8003 2001 2007 5.556828 9.614137 10.05750 4
555 8003 2002 2007 5.356586 9.623972 10.09712 4
556 8003 2003 2007 5.389072 9.620859 10.10761 4
557 8003 2004 2007 5.356586 9.626548 10.14034 4
558 8003 2005 2007 5.303305 9.637958 10.17550 4
559 8003 2006 2007 5.342334 9.633056 10.21859 4
attgt <- did::att_gt(yname="lemp",
idname="id",
gname="G",
tname="year",
data=data2,
control_group="nevertreated",
base_period="universal")
tidy(attgt)[,1:5] # print results, drop some extra columns
term group time estimate std.error
1 ATT(2004,2003) 2004 2003 0.00000000 NA
2 ATT(2004,2004) 2004 2004 -0.03266653 0.021210914
3 ATT(2004,2005) 2004 2005 -0.06827991 0.021592785
4 ATT(2004,2006) 2004 2006 -0.12335404 0.021745364
5 ATT(2004,2007) 2004 2007 -0.13109136 0.023757903
6 ATT(2006,2003) 2006 2003 -0.03408910 0.011674878
7 ATT(2006,2004) 2006 2004 -0.01669977 0.007910050
8 ATT(2006,2005) 2006 2005 0.00000000 NA
9 ATT(2006,2006) 2006 2006 -0.01939335 0.009693080
10 ATT(2006,2007) 2006 2007 -0.06607568 0.009354202
Call:
did::aggte(MP = attgt, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0571 0.0087 -0.0741 -0.0401 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0888 0.0202 -0.1306 -0.0470 *
2006 -0.0427 0.0075 -0.0583 -0.0272 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
To summarize: \(ATT^o = -0.057\) while \(\alpha^{TWFE} = -0.038\). This difference can be fully accounted for
In my experience: this is fairly representative of how much new DiD approaches matter relative to TWFE regressions.
There has been a lot concern about negative weights (both in econometrics and empirical work).
data3
(the data that includes \(G_i=2007\)), you will get a negative weight on \(ATT(g=2004,t=2007)\). But it turns out not to matter much, and TWFE works better in this case than in the case that I showed you.Not a Panacea:
Possible Disadvantages:
Advantages:
View #1: Parallel trends as a purely reduced form assumption
But this is certainly not the only possibility:
In my view, these seem like fair points
View #2: Models that lead to parallel trends assumption. We’ll focus on untreated potential outcomes: \[Y_{it}(0) = \theta_t + \eta_i + e_{it}\] Parallel trends is equivalent to this model along with the condition that \(\E[e_t | G] = 0\).
Many economic models have this sort of flavor, that the important thing driving differences in outcomes is some latent characteristic (differences in lagged outcomes may proxy this, but not the “deep” explanation)
Pros
View #2: Models that lead to parallel trends assumption. We’ll focus on untreated potential outcomes: \[Y_{it}(0) = \theta_t + \eta_i + e_{it}\] Parallel trends is equivalent to this model along with the condition that \(\E[e_t | G] = 0\).
Many economic models have this sort of flavor, that the important thing driving differences in outcomes is some latent characteristic (differences in lagged outcomes may proxy this, but not the “deep” explanation)
Cons: However, additive separability of \(\theta_t\) and \(\eta_i\) is crucial for identification
Consider a simplified setting where \(\T=2\), but we allow for there to be units that are already treated in the first period.
\(\implies\) 3 groups: \(G_i=1\), \(G_i=2\), \(G_i=\infty\)
Because there are only two periods, the TWFE regression is equivalent to the regression \[\begin{align*} \Delta Y_i = \Delta \theta_{t=2} + \alpha \Delta D_{it=2} + \Delta e_{it=2} \end{align*}\]
Moreover, \(\Delta D_{it=2}\) only takes two values:
Thus, this is a fully saturated regression, and we have that \[\begin{align*} \alpha = \E[\Delta Y | \Delta D_{t=2} = 1] - \E[\Delta Y | \Delta D_{t=2}=0] \end{align*}\]
Starting from the previous slide: \[\begin{align*} \alpha = \E[\Delta Y | \Delta D_{t=2} = 1] - \E[\Delta Y | \Delta D_{t=2}=0] \end{align*}\] and consider the term on the far right, we have that \[\begin{align*} \E[\Delta Y | \Delta D_{t=2}=0] = \E[\Delta Y | G=1] \underbrace{\frac{p_1}{p_1 + p_\infty}}_{=: w_1} + \E[\Delta Y | G=\infty] \underbrace{\frac{p_\infty}{p_1 + p_\infty}}_{=: w_\infty} \end{align*}\]
where \(w_1\) and \(w_\infty\) are the relative sizes of group 1 and the never treated group, and notice that \(w_1 + w_\infty = 1\). Plugging this back in \(\implies\) \[\begin{align*} \alpha = \Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big) w_1 + \Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big) w_\infty \end{align*}\]
This is exactly the Goodman-Bacon result! \(\alpha\) is a weighted average of all possible 2x2 comparisons
Let’s keep going: \[\begin{align*} \alpha = \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big)}_{\textrm{What is this?}} w_1 + \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big)}_{ATT(2,2)} w_\infty \end{align*}\] Working on the first term, we have that \[ \begin{aligned} & \E[\Delta Y_{2} | G=2] - \E[\Delta Y_{2} | G=1] \hspace{300pt} \end{aligned} \]
Let’s keep going: \[\begin{align*} \alpha = \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big)}_{\textrm{What is this?}} w_1 + \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big)}_{ATT(2,2)} w_\infty \end{align*}\] Working on the first term, we have that \[ \begin{aligned} & \E[\Delta Y_{2} | G=2] - \E[\Delta Y_{2} | G=1] \hspace{300pt}\\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{1}(\infty) | G=2] - \E[Y_{2}(1) - Y_{1}(1) | G=1] \end{aligned} \]
Let’s keep going: \[\begin{align*} \alpha = \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big)}_{\textrm{What is this?}} w_1 + \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big)}_{ATT(2,2)} w_\infty \end{align*}\] Working on the first term, we have that \[ \begin{aligned} & \E[\Delta Y_{2} | G=2] - \E[\Delta Y_{2} | G=1] \hspace{300pt}\\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{1}(\infty) | G=2] - \E[Y_{2}(1) - Y_{1}(1) | G=1] \\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{2}(\infty) | G=2] + \underline{\E[Y_{2}(\infty) - Y_{1}(\infty) | G=2]} \end{aligned} \]
Let’s keep going: \[\begin{align*} \alpha = \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big)}_{\textrm{What is this?}} w_1 + \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big)}_{ATT(2,2)} w_\infty \end{align*}\] Working on the first term, we have that \[ \begin{aligned} & \E[\Delta Y_{2} | G=2] - \E[\Delta Y_{2} | G=1] \hspace{300pt}\\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{1}(\infty) | G=2] - \E[Y_{2}(1) - Y_{1}(1) | G=1] \\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{2}(\infty) | G=2] + \underline{\E[Y_{2}(\infty) - Y_{1}(\infty) | G=2]}\\ &\hspace{20pt} - \Big( \E[Y_{2}(1) - Y_{2}(\infty) | G=1] - \E[Y_{1}(1) - Y_{1}(\infty) | G=1] + \underline{\E[Y_{2}(\infty) - Y_{1}(\infty) | G=1]} \Big) \end{aligned} \]
Let’s keep going: \[\begin{align*} \alpha = \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big)}_{\textrm{What is this?}} w_1 + \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big)}_{ATT(2,2)} w_\infty \end{align*}\] Working on the first term, we have that \[ \begin{aligned} & \E[\Delta Y_{2} | G=2] - \E[\Delta Y_{2} | G=1] \hspace{300pt}\\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{1}(\infty) | G=2] - \E[Y_{2}(1) - Y_{1}(1) | G=1] \\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{2}(\infty) | G=2] + \underline{\E[Y_{2}(\infty) - Y_{1}(\infty) | G=2]}\\ &\hspace{20pt} - \Big( \E[Y_{2}(1) - Y_{2}(\infty) | G=1] - \E[Y_{1}(1) - Y_{1}(\infty) | G=1] + \underline{\E[Y_{2}(\infty) - Y_{1}(\infty) | G=1]} \Big)\\ &\hspace{10pt} = \underbrace{ATT(2,2)}_{\textrm{causal effect}} - \underbrace{\Big(ATT(1,2) - ATT(1,1)\Big)}_{\textrm{treatment effect dynamics}} \end{aligned} \]
Plug this expression back in \(\rightarrow\)
Plugging the previous expression back in, we have that \[\begin{align*} \alpha = ATT(2,2) + ATT(1,1) w_1 + ATT(1,2)(-w_1) \end{align*}\]
This is exactly the result in de Chaisemartin and d’Haultfoeuille! \(\alpha\) is equal to a weighted average of \(ATT(g,t)\)’s, but it is possible that some of the weights can be negative.
Also, as they point out, a sufficient condition for the weights to be non-negative is: no treatment effect dynamics \(\implies ATT(1,1) = ATT(1,2)\) \(\overset{\textrm{here}}{\implies} \alpha = ATT(2,2)\).
[Back]
We’ll discuss:
did
, Stata: csdid
, Python: csdid
fixest
, Stata: eventstudyinteract
etwfe
, Stata: JWDiD
did2s
, Stata: did2s
and did_imputation
Not including:
Intuition: Paper points out limitations of event-study versions of the TWFE regressions discussed above: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{e=-(\T-1)}^{-2} \beta_e D_{it}^e + \sum_{e=0}^{\T} \beta_e D_{it}^e + e_{it} \end{align*}\] and points out similar issues. In particular, the event study regression is “underspecified” \(\implies\) heterogeneous effects can “confound” the treatment effect estimates
Solution: Run fully interacted regression: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{e \neq -1} \delta^{SA}_{ge} \indicator{G_i=g} \indicator{g+e=t} + e_{it} \end{align*}\]
2nd step: Aggregate \(\delta^{SA}_{ge}\)’s across groups (usually into an event study).
Intuition: Are issues in DiD literature due to limitations of TWFE regressions per se or due to misspecification of TWFE regression?
Solution: Proposes running “more interacted” TWFE regression: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{s=g}^{\T} \alpha_{gt}^W \indicator{G_i=g, t=s} + e_{it} \end{align*}\] This is quite similar to Sun and Abraham (2021) except for that it doesn’t include interactions in pre-treatment periods. [The differences about \((g,t)\) relative to \((g,e)\) are trivial.]
Intuition: Parallel trends is closely connected to a TWFE model for untreated potential outcomes \[Y_{it}(0) = \theta_t + \eta_i + e_{it}\]
Estimation:
Can compute other treatment effect parameters too (e.g., event study or overall average treatment effect)
In my view, all of the approaches discussed above are fundamentally similar to each other.
In practice, it is sometimes possible to get different results though this is often driven by
In post-treatment periods, these give numerically identical results: \(\widehat{\ATT}^{CS}(g,t) = \hat{\delta}^{SA}_{t,t-g}\)
In pre-treatment periods, code will give different pre-treatment estimates, but this is due to different default choices
These are clearly closely related, with the difference amounting to whether or not one includes indicators for pre-treatment periods.
It is fair to see this as a way to trade-off robustness and efficiency
Wooldridge and Gardner/BJS give numerically the same estimates: \(\hat{\alpha}^W_{gt} = \widehat{\ATT}^{G/BJS}(g,t)\)
Intuition: Including full set of interactions is equivalent to estimating separate models by groups
The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.
The other major source of differences in estimates across procedures is different default options in software implementations. Examples:
The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.
The other major source of differences in estimates across procedures is different default options in software implementations. Examples:
The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.
The other major source of differences in estimates across procedures is different default options in software implementations. Examples:
The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.
The other major source of differences in estimates across procedures is different default options in software implementations. Examples:
See Baker, Larcker, and Wang (2022) and Callaway (2023) for more substantially more details.
[Back]
Consider the following alternative aggregated treatment effect parameter
\[\begin{align*} \widehat{ATT}^{\text{simple}} = \frac{1}{N_{\text{post}}} \sum_{(i,t), t \leq G_i} \Big(Y_{it} - \hat{Y}_{it}(0)\Big) \end{align*}\] i.e., we just average all possible estimated treatment effects that we have available in post-treatment periods.
Relative to \(ATT^o\), early treated units get more weight (because we have more \(Y_{it}-\hat{Y}_{it}(0)\) for them).
By construction, weights are all positive. However, they are different from \(ATT^o\) weights
Call:
did::aggte(MP = attgt, type = "simple")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
ATT Std. Error [ 95% Conf. Int.]
-0.0646 0.0106 -0.0854 -0.0439 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
Besides the violations of parallel trends in pre-treatment periods, these weights are further away from \(ATT^o\) than the TWFE regression weights are!
Implications:
[Back]
Comments
The differences between the CS estimates and the TWFE estimates are fairly large here: the CS estimate is about 50% larger than the TWFE estimate, though results are qualitatively similar.