Treatment Effects in Staggered Adoption Designs with Non-Parallel Trends

Brantly Callaway

University of Georgia

Emmanuel Tsyawo

University of Alabama

September 24, 2025

Introduction

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} } \newcommand{\T}{T} \newcommand{\ATT}{\text{ATT}}\) Setting of the paper: Panel data causal inference with

Binary Treatment
Fixed-\(T\)
Multiple possible comparison groups
- Staggered treatment adoption
- Multiple comparison states

Main idea: Exploit having access to “extra” periods and comparison groups to substantially weaken auxiliary assumptions (like parallel trends) that are common in this setting

Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on \(\underbrace{\textrm{earnings}}_{\textrm{outcome}}\)

Outline of the Talk

1. Motivation

2. Identification

3. Application

Motivation

Research Design

Research Design: The setting that the researcher will use to estimate causal effects.

Staggered adoption research design:

Multiple periods of data
Some pre-treatment data for all units
Some units become treated while other units remain untreated

This research design is a key distinguishing feature of modern approaches to panel data causal inference relative to traditional panel data models

It allows for explicit comparisons between treated units and a comparison group.
It also explains why the methods we consider today are often grouped among quasi-experimental methods such as IV or RD.

Identification Strategy

Identification Strategy: A target parameter and set of assumptions that allow the researcher to recover the target parameter ➡

Comparison to IV and RD

IV and RD are closely connected to natural experiments where the assignment of treatment, though not controlled by the researcher, is (usually locally) randomly assigned.

draft lottery numbers
cutoff scores on standardized tests

This implies that

The research design is simply to exploit the natural experiment
The identification strategy formalizes the natural experiment
- Exclusion restriction, exogeneity, monotonicity are (almost) properties of the natural experiment
- There are not really alternative identification for the same research design

Panel Data Natural Experiments?

Panel data causal inference methods are often used in settings where there is no explicit natural experiment:

e.g., some locations implement a policy while others do not

This implies that

Auxiliary assumptions, such as parallel trends, play an important role in the identification strategy
- These assumptions are not implied by the research design
- One could imagine making alternative auxiliary assumptions instead
- Panel data causal inference methods are often referred to “model-based” as assumptions like parallel trends effectively involve a model for the outcome.
(Probably) less credible than methods that are based on natural experiments

Why Are Panel Data Approaches Popular?

1. Availability

2. Allow for within-unit comparisons

3. Allow for selection on unobservables

4. Pre-testing [Event Study Plot]

Our Paper

Keep the nice features of panel approaches to causal inference, but weaken auxiliary assumptions
Try to adapt to complexity of trend differences across groups in pre-treatment periods
Retain ability to falsify approach in pre-treatment periods

Identification

Staggered Treatment Adoption

Treatment timing can vary across units, but once a unit becomes treated, it remains treated in subsequent periods

Many of the insights of recent work on DiD have been in the context of staggered treatment adoption

de Chaisemartin and D’Haultfoeuille (2020), Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021), Marcus and Sant’Anna (2021), among others
These papers treat staggered treatment adoption as a nuisance, and
- Show limitations of two-way fixed regressions for implementing DiD identification strategies
- Provide alternative estimation strategies

In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters

Notation / Data / Setup

Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\T}, D_{i1}, D_{i2}, \ldots, D_{i\T}\}_{i=1}^n\)

\(\T\) time periods
No one treated in the first time period (i.e., \(D_{i1} = 0\))
Staggered treatment adoption: for \(t=2,\ldots,\T\), \(D_{it-1} = 1 \implies D_{it}=1\).
A unit’s group \(G_i\) is the time period when it becomes treated.
- By convention, set \(G_i = \infty\) for never-treated units
Potential outcomes: \(Y_{it}(g)\), \(Y_{it}(0)\) is untreated potential outcome
Observed outcomes: \(Y_{it} = Y_{it}(G_i)\)
No anticipation: For \(t < G_i\), \(Y_{it} = Y_{it}(0)\)

Setup is exactly the same as DiD with staggered treatment adoption!

Target Parameters

Following Callaway and Sant’Anna (2021), we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]

\(ATT(g,t)\) is the average treatment effect for group \(g\) in time period \(t\)

Group-time average treatment effects are the natural building block for other common target parameters in DiD applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)

Latent Unconfoundedness

\[ Y_{it}(0) \independent G_i | \xi_i \]

See Gobillon and Magnac (2016);Gardner (2020);Arkhangelsky and Imbens (2022);Callaway and Karami (2023), among others

Intuition:

If we could observe \(\xi_i\), then we could estimate causal effects by comparing treated and untreated units with the same \(\xi_i\)

Operationalizing Latent Unconfoundedness

Latent unconfoundedness implies that we can write

\[ Y_{it}(0) = h_t(\xi_i) + e_{it} \quad \textrm{where} \quad \E[e_{it} | \xi_i, G_i] = 0 \]

This is a hard model to make progress with because \(h_t(\cdot)\) is completely unrestricted, but often we think of approximations

\[h_t(\xi_i) = \theta_t + \xi_i \qquad \implies \text{DiD}\]

More General Approximations to \(h_t(\xi_i)\)

We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):

\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]

where \(r_{it}\) is approximation error.

Terminology:

\(\lambda_i\) is often referred to as “factor loading”
\(F_t\) is often referred to as a “factor”

More General Approximations to \(h_t(\xi_i)\)

We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):

\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]

where \(r_{it}\) is approximation error.

Terminology:

In our context, though, it makes sense to interpret these as

\(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved “ability”)
\(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to “ability”)

[Unit-specific linear trends] [IFE and violations of parallel trends]

More General Approximations to \(h_t(\xi_i)\)

We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):

\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]

where \(r_{it}\) is approximation error.

Some comments:

🧩 In a setting with fixed-\(T\) and fixed number of comparison groups, there are going to be limits on how complex we can make approximation

🧩 The dimension of \(\xi_i\) could be fairly large

More General Approximations to \(h_t(\xi_i)\)

We will go for more general approximations to \(h_t(\xi_i)\), for \(\xi_i = (\eta_i', \lambda_i')'\):

\[h_t(\xi_i) = \theta_t + \eta_i + \lambda_i'F_t + r_{it}\]

where \(r_{it}\) is approximation error.

Some comments:

💡 The dimension of \(\eta_i\) can be high without much cost

💡 We do not necessarily need to “control for” every component of \(\lambda_i\) that affects the outcome, just the ones that are imbalanced across groups

Identification in a Particular Case

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

IFE Model:

Target:

\(Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it}\)

\(ATT(g=3,t=3) = \E[\Delta Y_3 | G=3] - \underbrace{\color{#BA0C2F}{\E[\Delta Y_3(0) | G=3]}}_{\textrm{have to figure out}}\)

Using quasi-differencing argument, can show that

\[ \Delta Y_{i3}(0) = \theta_t^* + F_3^* \Delta Y_{i2}(0) + v_{i3} \]

where \(\theta_3^*\) and \(F_3^*\) are functions of the parameters \(\theta_t\) and \(F_t\), and \(v_{i3}\) is a function of \(e_{it}\).

[Explanation]

Identifying \(ATT(g=3,t=3)\)

Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,

\[\begin{align*} \color{#BA0C2F}{\E[\Delta Y_3(0) | G=3]} = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_2(0) | G = 3]}_{\textrm{identified}} + \underbrace{\E[v_3|G=3]}_{=0} \end{align*}\]

\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).

Also, note that \(\color{#BA0C2F}{\E[\Delta Y_2(0) | G = 3]}\) is a linear combination of \(1\) and \(\E[\Delta Y_2 | G=3]\).

How can we recover \(\theta_3^\) and \(F_3^\)?

Recall:

\[ \Delta Y_{i3}(0) = \theta_3^* + F_3^* \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \]

Some issues:

📝 Expression involves untreated potential outcomes through period 3 \(\implies\) Only groups 4 and \(\infty\) are useful for recovering \(\theta_3^*\) and \(F_3^*\)

🤔 \(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) by construction \(\implies\) We need some exogenous variation to recover the parameters

Existing Ideas in the Literature

There are a number of different ideas here:

Make additional assumptions ruling out serial correlation in \(e_{it}\) \(\implies\) can use lags of outcomes as instruments (Imbens, Kallus, and Mao (2021)):
- But this is seen as a strong assumption in many applications (Bertrand, Duflo, and Mullainathan (2004))
Alternatively can introduce covariates and make auxiliary assumptions about them (Callaway and Karami (2023);Brown and Butts (2023);Brown, Butts, and Westerlund (2023))
However, it turns out that, with staggered treatment adoption, you can recover \(ATT(3,3)\) essentially for free

Our Approach

In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:

\[\begin{align*} \E[\Delta Y_3(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=4] \\ \E[\Delta Y_3(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=\infty] \\ \end{align*}\]

We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* &= \frac{\E[\Delta Y_3|G=\infty] - \E[\Delta Y_3|G=4]}{\E[\Delta Y_2|G=\infty] - \E[\Delta Y_2|G=4]} \\[10pt] \theta_3^* &= \E[\Delta Y_3 | G=4] - F_3^* \E[\Delta Y_2 | G=4] \end{align*}\]

\(\implies\) we can recover \(ATT(3,3)\).

This strategy amounts to using “group” as an instrument for \(\Delta Y_{i2}(0)\).

Additional Details about Identification

Condition 1: Relevance \(\quad \E[\Delta Y_2(0) | G=4] \neq \E[\Delta Y_2(0) | G=\infty]\)

For relevance to hold, the following two “more primitive” conditions both need to hold

\(\E[\lambda | G=4] \neq \E[\lambda | G = \infty]\)
\(F_2 \neq F_1\)

Otherwise, \(G = 4\) and \(G = \infty\) have the same trend between the first two periods.

\(\implies\) \(F_3^*\) is not identified

Additional Details about Identification

Condition 2: Exogeneity

Group is uncorrelated with \(r_{it} + e_{it}\)

This would be violated if, for example, the model for untreated potential outcomes should include two factors instead of one.
i.e., our approximation to \(h_t(\xi_i)\) is not good enough

Additional Details about Identification

Condition 2: Exogeneity

Can’t directly test exogeneity, but a lot of the DiD infrastructure carries over

For DiD, can “pre-test” parallel trends if have more than 1 pre-period
For our approach, we need 2 pre-treatment periods to identify \(ATT(g,t)\), but if have more pre-treatment periods then we can pre-test
- e.g., if we have 3 pre-treatment periods, then non-zero pseudo-ATT’s in pre-treatment periods suggest that the exogeneity condition is violated.

General Case with More Periods and Groups

There are additional complications for making this work in realistic applications:

1. How do we know how many IFEs there are?

2. Does having more groups/periods help?

3. Are there testable implications in general settings?

A Picture of the Previous Case

where \(\widetilde{\Delta}_{g,t} = \E[\Delta Y_t(0) | G=g]\)
if we know \(\Huge \star\) then we can recover \(ATT(3,3)\)

A Picture of the Previous Case

if has reduced rank, then we can fill in \(\Huge \star\) and recover \(ATT(3,3)\)

A Picture of the Previous Case

Check rank of and assume that rank( ) = rank( ) ➡️

A Picture of the Previous Case

Case 1: If rank( ) = 2

Suggests model with 1 IFE
This is exactly the rank condition discussed before!
No testable implications

Case 2: If rank( ) = 1

Suggests model with 0 IFEs(i.e., parallel trends)
This corresponds to failure of rank condition from before
Testable implications ➡️

A Picture of the Previous Case

If rank( ) = rank( ) = 1, then we can observe and also fill it in from
You can see these in pre-treatment periods in a group-specifc event studies

More General Case with More Periods and Groups

if has reduced rank, then we can fill in \(\Huge \star\) and recover \(ATT(5,5)\)

More General Case with More Periods and Groups

Check rank of and assume that rank( ) = rank( ) ➡️

More General Case with More Periods and Groups

Case 1: If rank( ) = 4 \(\implies\) 3 IFEs

Testable implications this time (because of the “extra” period)

Case 2: If rank( ) = 3 \(\implies\) 2 IFEs

Case 3: If rank( ) = 2 \(\implies\) 1 IFE

Case 4: If rank( ) = 1 \(\implies\) 0 IFEs (i.e., parallel trends)

More General Case with More Periods and Groups

Testable implications when rank( ) = rank( ) = 4

More General Case with More Periods and Groups

Testable implications when rank( ) = rank( ) = 3

More General Case with More Periods and Groups

Testable implications when rank( ) = rank( ) = 2

Identifying Other ATT(g,t)’s

Intuition:

Very similar to previous case
In earlier periods, the complexity of trends we can accommodate is limited by the number of pre-treatment periods
In later periods, the complexity of trends we can accommodate is limited by the number of available comparison groups
- Though still have testable implications

Examples:

ATT(g=4,t=4)
ATT(g=5,t=6)
ATT(g=9,t=9)

Discussion

Relative to other approaches to dealing with IFEs:
- We do not need a large number of periods or extra auxiliary assumptions
- Only need there to be staggered treatment adoption (or other source of comparison groups)
Our approach adapts to the complexity of trend differences across groups in pre-treatment periods
Like DiD, we are always going to give an estimate and, often, we will have extra information to falsify our approach
We have talked about IFE models, but exploiting additional comparison groups seems like a good idea more generally
- There are other types of models that need extra moment conditions (e.g., dynamic panel data model for \(Y_{it}(0)\)), could use the same sort of idea there

Estimation

If the number of interactive fixed effects is known

Two-step estimation procedure where we estimate the parameters of the IFE model in the first step (e.g., \(\theta_3^*\) and \(F_3^*\)) and then plug these into a second step estimator for \(ATT(g,t)\).

If the number of interactive fixed effects is unknown

Additionally need to estimate rank( ) and account for the fact that this is estimated in the theory
In progress…

Application

Setup

Use county-level data from 1998-2007 during a period where the federal minimum wage was flat
Exploit minimum wage changes across states
- Any state that increases their minimum wage above the federal minimum wage will be considered as treated
- Allow for one year of “anticipation” (this only affects estimates in post-treatment periods)
Interested in the effect of the minimum wage on teen employment
We’ll also make a number of simplifications:
- not worry much about issues like clustered standard errors
- not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states

DiD Estimates

IFE Clarifications

The next set of results include one interactive fixed effect
Additional Comments:
- Because of anticipation, we can only estimate effects up to 2005 (after that \(G=2007\) is no longer a valid comparison group)
- We also lose some estimates in early periods because those are needed in the quasi-differencing steps
- No estimates for \(G=2007\) at all because not enough valid comparison groups

IFE Results

Conclusion

Comments welcome: brantly.callaway@uga.edu

Code: staggered_ife2 function in ife package in R, available at github.com/bcallaway11/ife

Appendix

References

Arkhangelsky, Dmitry, Susan Athey, David A Hirshberg, Guido W Imbens, and Stefan Wager. 2021. “Synthetic Difference-in-Differences.” American Economic Review 111 (12): 4088–118.

Arkhangelsky, Dmitry, and Guido W Imbens. 2022. “Doubly Robust Identification for Causal Panel Data Models.” The Econometrics Journal 25 (3): 649–74.

Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. 2021. “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association, 1–15.

Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan. 2004. “How Much Should We Trust Differences-in-Differences Estimates?” The Quarterly Journal of Economics 119 (1): 249–75.

Brown, Nicholas, and Kyle Butts. 2023. “Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels.”

Brown, Nicholas, Kyle Butts, and Joakim Westerlund. 2023. “Simple Difference-in-Differences Estimation in Fixed-t Panels.”

Callaway, Brantly, and Sonia Karami. 2023. “Treatment Effects in Interactive Fixed Effects Models with a Small Number of Time Periods.” Journal of Econometrics 233 (1): 184–208.

Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

de Chaisemartin, Clement, and Xavier D’Haultfoeuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Gardner, John. 2020. “Identification and Estimation of Average Causal Effects When Treatment Status Is Ignorable Within Unobserved Strata.” Econometric Reviews 39 (10): 1014–41.

Gobillon, Laurent, and Thierry Magnac. 2016. “Regional Policy Evaluation: Interactive Fixed Effects and Synthetic Controls.” Review of Economics and Statistics 98 (3): 535–51.

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Imbens, Guido, Nathan Kallus, and Xiaojie Mao. 2021. “Controlling for Unmeasured Confounding in Panel Data Using Minimal Bridge Functions: From Two-Way Fixed Effects to Factor Models.”

Marcus, Michelle, and Pedro HC Sant’Anna. 2021. “The Role of Parallel Trends in Event Study Settings: An Application to Environmental Economics.” Journal of the Association of Environmental and Resource Economists 8 (2): 235–75.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.

Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25 (1): 57–76.

Event Study Plot

Deryugina (2017)

[Back]

IFE Models and Unit-Specific Linear Trends

IFE model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

Special Case: \(F_t = t\) \(\implies\) unit-specific linear trends

[Back]

IFE & Violations of Parallel Trends

Interactive fixed effects models allow for violations of parallel trends:

\[ \begin{aligned} Y_{it}(0) &= \theta_t + \eta_i + \lambda_i F_t + e_{it} \\ \end{aligned} \]

\[ \implies \E[\Delta Y_{t}(0) | D = d] = \Delta \theta_t + \E[\lambda|D=d]\Delta F_t \]

which can vary across groups.

Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups

[Back]

Quasi-Differencing Explanation

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

\[Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it}\]

In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\]

The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)

Quasi-Differencing Explanation

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

From last slide, combining terms we have that

\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

[Back]

ATT(g=5,t=6)

Insight: Similar to previous case except there are fewer “available” comparison groups for later time periods. [Back]

ATT(g=4,t=4)

Insight: Fewer available pre-treatment periods limit the number of IFEs we can accommodate, though there are testable implications here (due to the large number of available comparison groups). [Back]

ATT(g=9,t=9)

Insight: Only one comparison group available in period 9, so we can only accommodate 0 IFEs (i.e., parallel trends), though there are testable implications (due to the large number of pre-treatment periods). [Back]

General Case - Setup

Interactive fixed effects for untreated potential outcomes:

\[ Y_{it}(0) = \theta_t + \eta_i + \lambda_i' F_t + e_{it} \] where \(\lambda_i\) and \(F_t\) are \(R\) dimensional vectors.

Assume: Unconfoundedness conditional on unobserved heterogeneity (i.e., this implies “groups” can be used as instruments):

\[ \E[Y_{t}(0) |\eta, \lambda, G] = \E[Y_{t}(0) |\eta, \lambda] \quad \text{a.s.} \]

An implication of both conditions above is that

\[ \E[e_t |\eta, \lambda, G] = 0 \]

which we use below as a source of moment conditions to identify parameters from the interactive fixed effects model.

General Case - Identification

Similar to earlier case:

\[ \begin{aligned} ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \underbrace{\E[Y_t(0) - Y_{g-1}(0) | G=g]}_{\textrm{need to figure out}} \\ \end{aligned} \]

General Case - Identification

Using similar differencing arguments as before, one can show:

\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]

where

\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is an \(R\)-dimensional vector of transformations of pre-treatment outcomes,
\(\theta^*(g,t)\) and \(F^*(g,t)\) are transformations of time fixed effects and factors,
\(v_i(g,t)\) involves transformations of \(e_{it}\).

General Case - Identification

Using similar differencing arguments as before, one can show:

\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]

so that

\(R+1\) parameters to identify
\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is endogenous by construction
Can use “groups” as instruments
Identification is local to groups/periods that meet the following criteria:
- Group must have at least \(R+1\) pre-treatment periods (so quasi-differencing is feasible)
- Time period must be early enough so that there are enough \((R+1)\) untreated comparison groups.

General Case - Identification

For \(g' \in \mathcal{G}^{comp}(g,t)\), we use moment conditions of the form

\[ 0 = \E\Big[\indicator{G=g'} v(g,t)\Big]\]

Stacking the above moment conditions, we have that

\[ \mathbf{0}_{|\mathcal{G}^{comp}(g,t)|} = \E\left[ \ell^{comp}(g,t) \left\{ \Big( Y_{t} - Y_{g-1}\Big) - \Big(\theta^*(g,t) - {\widetilde{\Delta Y}}^{{pre(g)}^{'}} F^*(g,t) \Big) \right\} \right] \]

where \(\ell^{comp}(g,t)\) is a vector of indicators for groups that have not yet been treated by period \(t\).

General Case - Identification

Since we are using groups as IVs, identification hinges on relevance:

\[ \textrm{Rank}\Big(\mathbf{\Gamma}(g,t)\Big) = R + 1 \]

where

\[ \mathbf{\Gamma}(g,t) := \E\left[ \ell^{comp}(g,t) \begin{pmatrix} 1 \\ \widetilde{\Delta Y}^{pre(g)} \end{pmatrix}' \right] \]

As before, you can relate the relevance condition to conditions on \(\lambda_i\) and \(F_t\).

Need “enough variation” in \(\E[\lambda|G=g']\) among groups in \(\mathcal{G}^{comp}(g,t)\).
Need “enough variation” in \(F_t\) across pre-treatment time periods.

[More Details]

General Case - Identification

Theorem: Identification

For some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\) and under given assumptions,

\[ \begin{pmatrix} \theta^*(g,t) \\ F^*(g,t) \end{pmatrix} = \Big( \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \mathbf{\Gamma}(g,t) \Big)^{-1} \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \E[\ell^{comp}(g,t)(Y_{t} - Y_{g-1})] \]

In addition, \(ATT(g,t)\) is identified, and it is given by:

\[ ATT(g,t) = \E[Y_t(g) - Y_{g-1} | G=g] - \Big( \theta^*(g,t) + F^*(g,t)'\E[\Delta Y^{pre(g)} | G=g] \Big) \]

Estimation

Estimation proceeds in two steps and is constructive given identification results. The first step is to estimate \(\theta^*(g,t)\) and \(F^*(g,t)\):

Given a positive definite matrix \(\widehat{\mathbf{W}}(g,t)\), the estimator of \(\delta^*(g,t)\) is:

\[ \widehat{\delta}^*(g,t) = \left( \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t)\widehat{\mathbf{\Gamma}}(g,t) \right)^{-1} \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t) \E_n\big[\ell^{comp}_i(g,t)(Y_{it} - Y_{ig-1})\big] \]

Estimation

Second step, plug into sample analog of expression for \(ATT(g,t)\):

\[ \widehat{ATT}(g,t) = \hat{p}_g^{-1} \left\{ \E_n\Big[\indicator{G_i=g}(Y_{it} - Y_{ig-1})\Big] - \E_n\Big[A_i(g)\Big]^\prime \widehat{\delta}^*(g,t) \right\} \]

where

\[ A_i(g) := \indicator{G_i=g}\begin{pmatrix}1 \\ \widetilde{\Delta Y}_i^{pre(g)} \end{pmatrix} \]

If you want an event study or overall average treatment effect, can combine estimates across groups and time periods, following the same logic as in Callaway and Sant’Anna (2021).

Asymptotic Theory

Theorem: Asymptotic Normality

Suppose assumptions hold, then for some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\),

\(\widehat{ATT}(g,t)\) is consistent and asymptotically normal, in particular, for each \((g,t)\):

\[\begin{aligned} \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) &= \frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_{igt} + o_p(1) \\ & \xrightarrow{d} \mathcal{N}(0,\sigma_{gt}^2) \end{aligned}\]

where \(\sigma_{gt}^2 = \E[\psi_{gt}^2]\).

General Case - Relevance Condition

\[ \begin{aligned} \textrm{Define: } \qquad \mathbf{\Lambda}^{comp}(g,t) := \E\Big[ \ell^{comp}(g,t) \begin{pmatrix} 1 & \lambda' \end{pmatrix} \Big] \quad \textrm{and} \quad \mathbf{\Delta F}^{pre(g)} := \begin{bmatrix} \Delta F_2' \\ \vdots \\ \Delta F_{g-1}' \end{bmatrix} \end{aligned} \]

where \(\mathbf{\Lambda}^{comp}(g,t)\) is a \(|\mathcal{G}^{comp}(g,t)| \times (R+1)\) matrix, and \(\mathbf{\Delta F}^{pre(g)}\) is a \((g-2) \times R\) matrix.

Proposition: Relevance

The rank condition for identification is equivalent to the following: \[ \textrm{Rank}\Big(\mathbf{\Lambda}^{comp}(g,t)\Big) = R + 1 \quad \textrm{and} \quad \textrm{Rank}\Big(\mathbf{\Delta F}^{pre(g)}\Big) = R \]

[Return]

Treatment Effects in Staggered Adoption Designs with Non-Parallel Trends

Introduction

Outline of the Talk

Motivation

Research Design

Identification Strategy

Comparison to IV and RD

Panel Data Natural Experiments?

Why Are Panel Data Approaches Popular?

Our Paper

Identification

Staggered Treatment Adoption

Notation / Data / Setup

Target Parameters

Latent Unconfoundedness

Operationalizing Latent Unconfoundedness

More General Approximations to \(h_t(\xi_i)\)

More General Approximations to \(h_t(\xi_i)\)

More General Approximations to \(h_t(\xi_i)\)

More General Approximations to \(h_t(\xi_i)\)

Identification in a Particular Case

Identifying \(ATT(g=3,t=3)\)

How can we recover \(\theta_3^*\) and \(F_3^*\)?

Existing Ideas in the Literature

Our Approach

Additional Details about Identification

Additional Details about Identification

Additional Details about Identification

General Case with More Periods and Groups

A Picture of the Previous Case

A Picture of the Previous Case

A Picture of the Previous Case

A Picture of the Previous Case

A Picture of the Previous Case

More General Case with More Periods and Groups

More General Case with More Periods and Groups

More General Case with More Periods and Groups

More General Case with More Periods and Groups

More General Case with More Periods and Groups

More General Case with More Periods and Groups

More General Case with More Periods and Groups

Identifying Other ATT(g,t)’s

Discussion

Estimation

Application

Setup

DiD Estimates

IFE Clarifications

IFE Results

Conclusion

Appendix

References

Event Study Plot

IFE Models and Unit-Specific Linear Trends

IFE & Violations of Parallel Trends

Quasi-Differencing Explanation

Quasi-Differencing Explanation

ATT(g=5,t=6)

ATT(g=4,t=4)

ATT(g=9,t=9)

General Case - Setup

General Case - Identification

General Case - Identification

General Case - Identification

General Case - Identification

General Case - Identification

General Case - Identification

Estimation

Estimation

Asymptotic Theory

General Case - Relevance Condition

How can we recover \(\theta_3^\) and \(F_3^\)?