Treatment Effects in Staggered Adoption Designs with Non-Parallel Trends

Brantly Callaway

University of Georgia

Emmanuel Tsyawo

FGSES, Universit'e Mohammed VI Polytechnique

November 19, 2023


\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)Setting of the paper: Researcher interested in learning about the causal effect of a binary treatment and has access to a few periods of panel data

  • In economics, by far the most common approach in this setting is to use difference-in-differences (DID).

In the current paper, we will think about:

  • Cases where the parallel trends assumption could be violated

  • Applications where there is staggered treatment adoption

  • How to exploit staggered treatment adoption to allow for violations of parallel trends while still recovering the same target causal effect parameters


  • Parallel trends assumption: \(\E[\Delta Y_t(0) | D=1] = \E[\Delta Y_t(0) | D=0]\)

  • DID is different from other quasi-experimental (e.g., random assignment, IV, RD) approaches to policy evaluation in that it inherently relies on functional form assumptions \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + e_{it} \end{align*}\] where the distribution of \(\eta_i\) can differ arbitrarily across groups, but \(\E[e_{it}|\eta_i, D] = \E[e_{it} | \eta_i] = 0\)


  • That the model for \(Y_{it}(0)\) depends on time and unit-specific unobserved heterogneity is in-line with a long history of economic models, but the additive separability between time effects and unobserved heterogeneity is often harder to justify

    • And, in (probably most) applications, we simply do not know if it is reasonable or not
  • Therefore, most DID applications in economics include an event study plot that checks parallel trends in pre-treatment periods. This is implicitly a test of the additive separability in the previous model for untreated potential outcomes.

Deryugina (2017), Y: gov’t transfers, D: hurricane

Callaway and Sant’Anna (2021), Y: employment, D: min. wage

Some ideas

  • Allow for certain violations of parallel trends (\(\implies\) bounds on causal effect parameters) often connected to the magnitude of the violations of parallel trends in pre-treatment periods (Manski and Pepper 2018; Rambachan and Roth 2023; Ban and Kedagni 2022)

  • Consider alternative model for untreated potential outcomes

    • That’s what we will do in this paper!

    • Using (arguably) the most naturally connected approach to DID, interactive fixed effects (IFE)

    • (I think) IFE is closely connected to the ways that “bounding approaches” allow for violations of parallel trends…

  • Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on a \(\underbrace{\textrm{person's earnings}}_{\textrm{outcome}}\)

IFE Model for Untreated Potential Outcomes

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

  • \(\lambda_i\) is often referred to as “factor loading” (notation above implies that this is a scalar, but you can allow for higher dimension)

  • \(F_t\) is often referred to as a “factor”

  • \(e_{it}\) is idioyncratic in the sense that \(\E[e_{it} | \eta_i, \lambda_i, D_i] = 0\)

In our context, though, it makes sense to interpret these as

  • \(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved “ability”)

  • \(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to “ability”)

Interactive Fixed Effects

Interactive fixed effects models allow for violations of parallel trends:

\[\begin{align*} \E[\Delta Y_{it}(0) | D_i = d] = \Delta \theta_t + \E[\lambda_i|D_i=d]\Delta F_t \end{align*}\] which can vary across groups.

Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups

Special Cases:

  • Observed \(\lambda_i \implies\) regression adjustment

  • \(F_t = t \implies\) unit-specific linear trend

But allowing \(F_t\) to vary arbitrarily is harder…[[More details]]

Staggered Treatment Adoption

Many of the insights of recent work no DID have been in the context of staggered treatment adoption

  • \(\implies\) there is variation in treatment timing across units

  • de Chaisemartin and D’Haultfœuille (2020), Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021), among others

  • These papers all treat staggered treatment adoption as a nuisance, and

    • Show limitations of two-way fixed regressions for implementing DID identification strategies
    • Provide alternative estimation strategies
  • In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters

Notation / Data / Setup

Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\mathcal{T}}, D_{i1}, D_{i2}, \ldots, D_{i\mathcal{T}}\}_{i=1}^n\)

  • \(\mathcal{T}\) time periods

  • No one treated in the first time period (i.e., \(D_{i1} = 0\))

  • Staggered treatment adoption: for \(t=2,\ldots,\mathcal{T}\), \(D_{it-1} = 1 \implies D_{it}=1\).

  • A unit’s group \(G_i\) is the time period when it becomes treated. By convention, set \(G_i = \infty\) for units that do not participate in the treatment in any period.

  • Potential outcomes: \(Y_{it}(g)\), \(Y_{it}(\infty)\) is untreated potential outcome

  • Observed outcomes: \(Y_{it} = Y_{it}(G_i)\)

  • No anticipation: For \(t < G_i\), \(Y_{it} = Y_{it}(\infty)\)

Target Parameters

Following CS-2021, we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_{it}(g) - Y_{it}(\infty) | G=g] \end{align*}\]

  • Group-time average treatment effects are the natural building block for other common target parameters in DID applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)

  • Back to DID for a moment: Under parallel trends, \[\begin{align*} ATT(g,t) = \E[Y_{it} - Y_{ig-1} | G=g] - \E[Y_{it} - Y_{ig-1} | G=g'] \end{align*}\] for any \(g' > t\) (i.e., any group that is not-yet-treated by period \(t\))

    • \(\implies\) \(ATT(g,t)\) is often over-identified (OI)
    • Marcus and Sant’Anna (2021) OI \(\rightarrow\) more efficiently estimate \(ATT(g,t)\)
    • This paper: OI \(\rightarrow\) relax parallel trends

Recovering \(ATT(g,t)\) with fixed-\(\mathcal{T}\) approaches

Very Simple Case:

  • \(\mathcal{T}=4\)

  • 3 groups: 3, 4, \(\infty\)

  • We will target \(ATT(3,3) = \E[\Delta Y_{i3} | G_i=3] - \underbrace{\E[\Delta Y_{i3}(0) | G_i=3]}_{\textrm{have to figure out}}\) In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\] The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)

Fixed-\(\mathcal{T}\) approaches

From last slide, combining terms we have that

\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,

\[\begin{align*} \E[\Delta Y_{i3}(0) | G_i=3] = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_{i2}(0) | G_i = 3]}_{\textrm{identified}} + \underbrace{\E[v_{i3}|G_i=3]}_{=0} \end{align*}\]

\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).

Fixed-\(\mathcal{T}\) approaches

From last slide, combining terms we have that

\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

How can we recover \(\theta_3^*\) and \(F_3^*\)?

Notice: this involves untreated potential outcomes through period 3, and we have groups 4 and \(\infty\) for which we observe these untreated potential outcomes. This suggests using those groups.

  • However, this is not so simple because, by construction, \(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) (note: \(v_{i3}\) contains \(\Delta e_{i2} \implies\) they will be correlated by construction)

  • We need some exogenous variation (IV) to recover the parameters \(\rightarrow\)

Fixed-\(\mathcal{T}\) approaches

There are a number of different ideas here:

Our Approach

In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:

\[\begin{align*} \E[\Delta Y_{i3}(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_{i2}(0) | G=4] \\ \E[\Delta Y_{i3}(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_{i2}(0) | G=\infty] \\ \end{align*}\] We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* = \frac{\E[\Delta Y_{i3}|G_i=\infty] - \E[\Delta Y_{i3}|G_i=4]}{\E[\Delta Y_{i2}|G_i=\infty] - \E[\Delta Y_{i2}|G_i=4]} \end{align*}\]

\(\implies\) we can recover \(ATT(3,3)\).

Additional Details about Identification

An important issue for IFE approaches is determining how many IFEs terms there are (e.g., 0, 1, 2, …)

  • Interestingly, there is a tight link in our approach between this “model selection” and identification

    • In cases where parallel trends holds in pre-treatment periods \(\implies\) 0 IFEs, failure of relevance condition if we try to include 1 IFE

    • In cases where parallel trends is violated in pre-treatment periods (essentially) \(\implies\) (at least) 1 IFE, relevance condition holds for recovering \(ATT(g,t)\).

DID infrastructure related to pre-testing also carries over to our approach

[[Detailed Version]]


  • Can scale this argument up for more periods, groups, and IFEs (see paper)

  • Relative to other approaches to dealing with IFEs:

    • We do not need a large number of periods or extra auxiliary assumptions

    • Only need there to be staggered treatment adoption

    • The main drawback is that can’t recover as many \(ATT(g,t)\)’s; e.g., in this example, we can’t recover \(ATT(3,4)\) or \(ATT(4,4)\) which might be recoverable in other settings

  • Generality: we have talked about IFE models, but

    • There are other types of models that need extra moment conditions (e.g., dynamic panel data model for \(Y_{it}(0)\)), could use the same sort of idea there


Comments very welcome:

Code: staggered_ife2 function in ife package in R, available at



Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. “Synthetic Control Methods for Comparative Case Studies: Estimating the Effect of California’s Tobacco Control Program.” Journal of the American Statistical Association 105 (490): 493–505.
Arkhangelsky, Dmitry, Susan Athey, David A Hirshberg, Guido W Imbens, and Stefan Wager. 2021. “Synthetic Difference-in-Differences.” American Economic Review 111 (12): 4088–118.
Bai, Jushan. 2009. “Panel Data Models with Interactive Fixed Effects.” Econometrica 77 (4): 1229–79.
Ban, Kyunghoon, and Desire Kedagni. 2022. “Generalized Difference-in-Differences Models: Robust Bounds.”
Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan. 2004. “How Much Should We Trust Differences-in-Differences Estimates?” The Quarterly Journal of Economics 119 (1): 249–75.
Brown, Nicholas, and Kyle Butts. 2023. “Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels.”
Brown, Nicholas, Kyle Butts, and Joakim Westerlund. 2023. “Simple Difference-in-Differences Estimation in Fixed-t Panels.”
Callaway, Brantly, and Sonia Karami. 2023. “Treatment Effects in Interactive Fixed Effects Models with a Small Number of Time Periods.” Journal of Econometrics 233 (1): 184–208.
Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.
de Chaisemartin, Clement, and Xavier D’Haultfœuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.
Deryugina, Tatyana. 2017. “The Fiscal Cost of Hurricanes: Disaster Aid Versus Social Insurance.” American Economic Journal: Economic Policy 9 (3): 168–98.
Gobillon, Laurent, and Thierry Magnac. 2016. “Regional Policy Evaluation: Interactive Fixed Effects and Synthetic Controls.” Review of Economics and Statistics 98 (3): 535–51.
Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.
Imbens, Guido, Nathan Kallus, and Xiaojie Mao. 2021. “Controlling for Unmeasured Confounding in Panel Data Using Minimal Bridge Functions: From Two-Way Fixed Effects to Factor Models.”
Manski, Charles F, and John V Pepper. 2018. “How Do Right-to-Carry Laws Affect Crime Rates? Coping with Ambiguity Using Bounded-Variation Assumptions.” Review of Economics and Statistics 100 (2): 232–44.
Marcus, Michelle, and Pedro HC Sant’Anna. 2021. “The Role of Parallel Trends in Event Study Settings: An Application to Environmental Economics.” Journal of the Association of Environmental and Resource Economists 8 (2): 235–75.
Rambachan, Ashesh, and Jonathan Roth. 2023. “A More Credible Approach to Parallel Trends.” Review of Economic Studies, rdad018.
Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.
Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25 (1): 57–76.

Interactive Fixed Effects Examples

Interactive fixed effects models for untreated potential outcomes generalize some other important cases:

Example 1: Suppose we observe \(\lambda_i\), then this amounts to the regression adjustment version of DID with a time-invariant covariate

Example 2: Suppose you know that \(F_t = t\), then this leads to a unit-specific linear trend model: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i t + e_{it} \end{align*}\]

Interactive Fixed Effects Examples (cont’d)

To allow for \(F_t\) to change arbitrarily over time is harder…

Example 3: Interactive fixed effects models also provide a connection to “large-T” approaches such as synthetic control and synthetic DID (Abadie, Diamond, and Hainmueller 2010; Arkhangelsky et al. 2021)

  • e.g., one of the motivations of the SCM in ADH-2010 is that (given large-T) constructing a synthetic control can balance the factor loadings in an interactive fixed effects model for untreated potential outcomes


How can you recover \(ATT(g,t)\) in an IFE setting?

There are a lot of ideas. Probably the most prominent idea is to directly estimate the model for untreated potential outcomes and impute untreated potential outcomes

  • See (Xu 2017; Gobillon and Magnac 2016) for substantial detail on this front

  • For example, Xu (2017) uses Bai (2009) principal components approach to estimate the model.

    • This is a bit different in spirit from what we have been doing before as this argument requires the number of time periods to be “large”

Additional Details about Identification

For the IFE model not to “reduce” to two-way fixed effects, we need to rule out both:

  • \(\E[\lambda_i | G_i=3] = \E[\lambda_i | G_i=4] = \E[\lambda_i | G_i = \infty]\)

  • \(F_1 = F_2 = F_3 = F_4\)

But we need to strengthen these for our approach to work

What could go right/wrong?

Relevance: We additionally need that both

  • \(\E[\lambda_i | G_i=4] \neq \E[\lambda_i | G_i = \infty]\)

  • \(F_2 \neq F_1\)

Otherwise, \(G_i = 4\) and \(G_i = \infty\) have the same trend between the first two periods.

Case 1: \(F_2 \neq F_1\) but \(\E[\lambda_i | G_i=3] \neq \E[\lambda_i | G_i=4] = \E[\lambda_i | G_i = \infty]\)

  • \(\implies\) our approach won’t work, but you would be able to see that \(G_i=3\) is trending differently from \(G_i=4\) and \(G_i=\infty\)

  • Effectively, \(G_i=4\) and \(G_i=\infty\) are the “same” comparison group, so we cannot deal with the IFE, but we can reject parallel trends in the same way as typical DID approaches

What could go right/wrong?

Relevance: We additionally need that both

  • \(\E[\lambda_i | G_i=4] \neq \E[\lambda_i | G_i = \infty]\)

  • \(F_2 \neq F_1\)

Otherwise, \(G_i = 4\) and \(G_i = \infty\) have the same trend between the first two periods.

Case 2: \(\E[\lambda_i | G_i=4] \neq \E[\lambda_i | G_i = \infty]\) but \(F_3 \neq F_2 = F_1\)

  • \(\implies\) our approach won’t work, but (I think) no approach would work here

  • In this case, all groups trend the same between periods 1 and 2, so it looks like parallel trends holds. Here it does hold in pre-treatment periods, but it is violated in post-treatment periods

  • This is closely related to the saying: “parallel trends is fundamentally untestable”

What could go right/wrong?


  • Can’t directly test, but a lot of the DID infrastructure carries over here.

  • For DID, can “pre-test” parallel trends if there is more than 1 pre-treatment period

  • For our approach, we need 2 pre-treatment periods to identify \(ATT(g,t)\), but if there are more pre-treatment periods then we can pre-test

  • If there are more factors (say \(R\)), we need \(R+1\) pre-treatment periods to recover \(ATT(g,t)\), but if there are more pre-treatment periods then we can pre-test.