University of Georgia
University of Alabama
September 24, 2025
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} } \newcommand{\T}{T} \newcommand{\ATT}{\text{ATT}}\) Setting of the paper: Panel data causal inference with
Main idea: Exploit having access to “extra” periods and comparison groups to substantially weaken auxiliary assumptions (like parallel trends) that are common in this setting
Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on \(\underbrace{\textrm{earnings}}_{\textrm{outcome}}\)
1. Motivation
2. Identification
3. Application
Research Design: The setting that the researcher will use to estimate causal effects.
Staggered adoption research design:
This research design is a key distinguishing feature of modern approaches to panel data causal inference relative to traditional panel data models
Identification Strategy: A target parameter and set of assumptions that allow the researcher to recover the target parameter ➡
IV and RD are closely connected to natural experiments where the assignment of treatment, though not controlled by the researcher, is (usually locally) randomly assigned.
This implies that
Panel data causal inference methods are often used in settings where there is no explicit natural experiment:
This implies that
1. Availability
2. Allow for within-unit comparisons
3. Allow for selection on unobservables
4. Pre-testing [Event Study Plot]
Treatment timing can vary across units, but once a unit becomes treated, it remains treated in subsequent periods
Many of the insights of recent work on DiD have been in the context of staggered treatment adoption
In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters
Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\T}, D_{i1}, D_{i2}, \ldots, D_{i\T}\}_{i=1}^n\)
Setup is exactly the same as DiD with staggered treatment adoption!
Following Callaway and Sant’Anna (2021), we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]
\(ATT(g,t)\) is the average treatment effect for group \(g\) in time period \(t\)
Group-time average treatment effects are the natural building block for other common target parameters in DiD applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)
Latent Unconfoundedness
\[ Y_{it}(0) \independent G_i | \xi_i \]
See Gobillon and Magnac (2016);Gardner (2020);Arkhangelsky and Imbens (2022);Callaway and Karami (2023), among others
Intuition:
Latent unconfoundedness implies that we can write
\[ Y_{it}(0) = h_t(\xi_i) + e_{it} \quad \textrm{where} \quad \E[e_{it} | \xi_i, G_i] = 0 \]
This is a hard model to make progress with because \(h_t(\cdot)\) is completely unrestricted, but often we think of approximations
\[h_t(\xi_i) = \theta_t + \xi_i \qquad \implies \text{DiD}\]
We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):
\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]
where \(r_{it}\) is approximation error.
Terminology:
We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):
\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]
where \(r_{it}\) is approximation error.
Terminology:
In our context, though, it makes sense to interpret these as
[Unit-specific linear trends] [IFE and violations of parallel trends]
We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):
\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]
where \(r_{it}\) is approximation error.
Some comments:
🧩 In a setting with fixed-\(T\) and fixed number of comparison groups, there are going to be limits on how complex we can make approximation
🧩 The dimension of \(\xi_i\) could be fairly large
We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):
\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]
where \(r_{it}\) is approximation error.
Some comments:
💡 The dimension of \(\eta_i\) can be high without much cost
💡 We do not necessarily need to “control for” every component of \(\lambda_i\) that affects the outcome, just the ones that are imbalanced across groups
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
IFE Model:
Target:
\(Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it}\)
\(ATT(g=3,t=3) = \E[\Delta Y_3 | G=3] - \underbrace{\color{#BA0C2F}{\E[\Delta Y_3(0) | G=3]}}_{\textrm{have to figure out}}\)
Using quasi-differencing argument, can show that
\[ \Delta Y_{i3}(0) = \theta_t^* + F_3^* \Delta Y_{i2}(0) + v_{i3} \]
where \(\theta_3^*\) and \(F_3^*\) are functions of the parameters \(\theta_t\) and \(F_t\), and \(v_{i3}\) is a function of \(e_{it}\).
Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,
\[\begin{align*} \color{#BA0C2F}{\E[\Delta Y_3(0) | G=3]} = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_2(0) | G = 3]}_{\textrm{identified}} + \underbrace{\E[v_3|G=3]}_{=0} \end{align*}\]
\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).
Also, note that \(\color{#BA0C2F}{\E[\Delta Y_2(0) | G = 3]}\) is a linear combination of \(1\) and \(\E[\Delta Y_2 | G=3]\).
Recall:
\[ \Delta Y_{i3}(0) = \theta_3^* + F_3^* \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \]
Some issues:
📝 Expression involves untreated potential outcomes through period 3 \(\implies\) Only groups 4 and \(\infty\) are useful for recovering \(\theta_3^*\) and \(F_3^*\)
🤔 \(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) by construction \(\implies\) We need some exogenous variation to recover the parameters
There are a number of different ideas here:
In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:
\[\begin{align*} \E[\Delta Y_3(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=4] \\ \E[\Delta Y_3(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=\infty] \\ \end{align*}\]
We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* &= \frac{\E[\Delta Y_3|G=\infty] - \E[\Delta Y_3|G=4]}{\E[\Delta Y_2|G=\infty] - \E[\Delta Y_2|G=4]} \\[10pt] \theta_3^* &= \E[\Delta Y_3 | G=4] - F_3^* \E[\Delta Y_2 | G=4] \end{align*}\]
\(\implies\) we can recover \(ATT(3,3)\).
This strategy amounts to using “group” as an instrument for \(\Delta Y_{i2}(0)\).
Condition 1: Relevance \(\quad \E[\Delta Y_2(0) | G=4] \neq \E[\Delta Y_2(0) | G=\infty]\)
For relevance to hold, the following two “more primitive” conditions both need to hold
Otherwise, \(G = 4\) and \(G = \infty\) have the same trend between the first two periods.
Condition 2: Exogeneity
Group is uncorrelated with \(r_{it} + e_{it}\)
Condition 2: Exogeneity
Can’t directly test exogeneity, but a lot of the DiD infrastructure carries over
There are additional complications for making this work in realistic applications:
1. How do we know how many IFEs there are?
2. Does having more groups/periods help?
3. Are there testable implications in general settings?
where \(\widetilde{\Delta}_{g,t} = \E[\Delta Y_t(0) | G=g]\)
if we know \(\Huge \star\) then we can recover \(ATT(3,3)\)
if has reduced rank, then we can fill in \(\Huge \star\) and recover \(ATT(3,3)\)
Check rank of and assume that rank( ) = rank( ) ➡️
Case 1: If rank( ) = 2
Case 2: If rank( ) = 1
If rank( ) = rank( ) = 1, then we can observe and also fill it in from
You can see these in pre-treatment periods in a group-specifc event studies
if has reduced rank, then we can fill in \(\Huge \star\) and recover \(ATT(5,5)\)
Check rank of and assume that rank( ) = rank( ) ➡️
Case 1: If rank( ) = 4 \(\implies\) 3 IFEs
Case 2: If rank( ) = 3 \(\implies\) 2 IFEs
Case 3: If rank( ) = 2 \(\implies\) 1 IFE
Case 4: If rank( ) = 1 \(\implies\) 0 IFEs (i.e., parallel trends)
Testable implications when rank( ) = rank( ) = 4
Testable implications when rank( ) = rank( ) = 3
Testable implications when rank( ) = rank( ) = 2
Intuition:
Examples:
If the number of interactive fixed effects is known
If the number of interactive fixed effects is unknown
The next set of results include one interactive fixed effect
Additional Comments:
Comments welcome: brantly.callaway@uga.edu
Code: staggered_ife2
function in ife
package in R, available at github.com/bcallaway11/ife
[Back]
IFE model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
Special Case: \(F_t = t\) \(\implies\) unit-specific linear trends
[Back]
Interactive fixed effects models allow for violations of parallel trends:
\[ \begin{aligned} Y_{it}(0) &= \theta_t + \eta_i + \lambda_i F_t + e_{it} \\ \end{aligned} \]
\[ \implies \E[\Delta Y_{t}(0) | D = d] = \Delta \theta_t + \E[\lambda|D=d]\Delta F_t \]
which can vary across groups.
Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups
[Back]
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
\[Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it}\]
In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\]
The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
From last slide, combining terms we have that
\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]
[Back]
Insight: Similar to previous case except there are fewer “available” comparison groups for later time periods. [Back]
Insight: Fewer available pre-treatment periods limit the number of IFEs we can accommodate, though there are testable implications here (due to the large number of available comparison groups). [Back]
Insight: Only one comparison group available in period 9, so we can only accommodate 0 IFEs (i.e., parallel trends), though there are testable implications (due to the large number of pre-treatment periods). [Back]
Interactive fixed effects for untreated potential outcomes:
\[ Y_{it}(0) = \theta_t + \eta_i + \lambda_i' F_t + e_{it} \] where \(\lambda_i\) and \(F_t\) are \(R\) dimensional vectors.
Assume: Unconfoundedness conditional on unobserved heterogeneity (i.e., this implies “groups” can be used as instruments):
\[ \E[Y_{t}(0) |\eta, \lambda, G] = \E[Y_{t}(0) |\eta, \lambda] \quad \text{a.s.} \]
An implication of both conditions above is that
\[ \E[e_t |\eta, \lambda, G] = 0 \]
which we use below as a source of moment conditions to identify parameters from the interactive fixed effects model.
Similar to earlier case:
\[ \begin{aligned} ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \underbrace{\E[Y_t(0) - Y_{g-1}(0) | G=g]}_{\textrm{need to figure out}} \\ \end{aligned} \]
Using similar differencing arguments as before, one can show:
\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]
where
Using similar differencing arguments as before, one can show:
\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]
so that
For \(g' \in \mathcal{G}^{comp}(g,t)\), we use moment conditions of the form
\[ 0 = \E\Big[\indicator{G=g'} v(g,t)\Big]\]
Stacking the above moment conditions, we have that
\[ \mathbf{0}_{|\mathcal{G}^{comp}(g,t)|} = \E\left[ \ell^{comp}(g,t) \left\{ \Big( Y_{t} - Y_{g-1}\Big) - \Big(\theta^*(g,t) - {\widetilde{\Delta Y}}^{{pre(g)}^{'}} F^*(g,t) \Big) \right\} \right] \]
where \(\ell^{comp}(g,t)\) is a vector of indicators for groups that have not yet been treated by period \(t\).
Since we are using groups as IVs, identification hinges on relevance:
\[ \textrm{Rank}\Big(\mathbf{\Gamma}(g,t)\Big) = R + 1 \]
where
\[ \mathbf{\Gamma}(g,t) := \E\left[ \ell^{comp}(g,t) \begin{pmatrix} 1 \\ \widetilde{\Delta Y}^{pre(g)} \end{pmatrix}' \right] \]
As before, you can relate the relevance condition to conditions on \(\lambda_i\) and \(F_t\).
Theorem: Identification
For some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\) and under given assumptions,
\[ \begin{pmatrix} \theta^*(g,t) \\ F^*(g,t) \end{pmatrix} = \Big( \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \mathbf{\Gamma}(g,t) \Big)^{-1} \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \E[\ell^{comp}(g,t)(Y_{t} - Y_{g-1})] \]
In addition, \(ATT(g,t)\) is identified, and it is given by:
\[ ATT(g,t) = \E[Y_t(g) - Y_{g-1} | G=g] - \Big( \theta^*(g,t) + F^*(g,t)'\E[\Delta Y^{pre(g)} | G=g] \Big) \]
Estimation proceeds in two steps and is constructive given identification results. The first step is to estimate \(\theta^*(g,t)\) and \(F^*(g,t)\):
Given a positive definite matrix \(\widehat{\mathbf{W}}(g,t)\), the estimator of \(\delta^*(g,t)\) is:
\[ \widehat{\delta}^*(g,t) = \left( \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t)\widehat{\mathbf{\Gamma}}(g,t) \right)^{-1} \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t) \E_n\big[\ell^{comp}_i(g,t)(Y_{it} - Y_{ig-1})\big] \]
Second step, plug into sample analog of expression for \(ATT(g,t)\):
\[ \widehat{ATT}(g,t) = \hat{p}_g^{-1} \left\{ \E_n\Big[\indicator{G_i=g}(Y_{it} - Y_{ig-1})\Big] - \E_n\Big[A_i(g)\Big]^\prime \widehat{\delta}^*(g,t) \right\} \]
where
\[ A_i(g) := \indicator{G_i=g}\begin{pmatrix}1 \\ \widetilde{\Delta Y}_i^{pre(g)} \end{pmatrix} \]
If you want an event study or overall average treatment effect, can combine estimates across groups and time periods, following the same logic as in Callaway and Sant’Anna (2021).
Theorem: Asymptotic Normality
Suppose assumptions hold, then for some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\),
\(\widehat{ATT}(g,t)\) is consistent and asymptotically normal, in particular, for each \((g,t)\):
\[\begin{aligned} \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) &= \frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_{igt} + o_p(1) \\ & \xrightarrow{d} \mathcal{N}(0,\sigma_{gt}^2) \end{aligned}\]where \(\sigma_{gt}^2 = \E[\psi_{gt}^2]\).
\[ \begin{aligned} \textrm{Define: } \qquad \mathbf{\Lambda}^{comp}(g,t) := \E\Big[ \ell^{comp}(g,t) \begin{pmatrix} 1 & \lambda' \end{pmatrix} \Big] \quad \textrm{and} \quad \mathbf{\Delta F}^{pre(g)} := \begin{bmatrix} \Delta F_2' \\ \vdots \\ \Delta F_{g-1}' \end{bmatrix} \end{aligned} \]
where \(\mathbf{\Lambda}^{comp}(g,t)\) is a \(|\mathcal{G}^{comp}(g,t)| \times (R+1)\) matrix, and \(\mathbf{\Delta F}^{pre(g)}\) is a \((g-2) \times R\) matrix.
Proposition: Relevance
The rank condition for identification is equivalent to the following: \[ \textrm{Rank}\Big(\mathbf{\Lambda}^{comp}(g,t)\Big) = R + 1 \quad \textrm{and} \quad \textrm{Rank}\Big(\mathbf{\Delta F}^{pre(g)}\Big) = R \]
[Return]