Treatment Effects in Staggered Adoption Designs with Non-Parallel Trends

Brantly Callaway

University of Georgia

Emmanuel Tsyawo

FGSES, Université Mohammed VI Polytechnique

October 3, 2024

Introduction

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\T}{\mathrm{T}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)Setting of the paper: Researcher interested in learning about the causal effect of a binary treatment and has access to a few periods of panel data

In economics, by far the most common approach in this setting is to use difference-in-differences (DiD).

In the current paper, we will think about:

Cases where the parallel trends assumption could be violated
Applications where there is staggered treatment adoption
How to exploit staggered treatment adoption to allow for violations of parallel trends while still recovering the same target causal effect parameters

Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on \(\underbrace{\textrm{earnings}}_{\textrm{outcome}}\)

Where does parallel trends come from?

Parallel trends assumption: \(\E[\Delta Y_t(0) | D=1] = \E[\Delta Y_t(0) | D=0]\)

Parallel trends is equivalent to this model of untreated potential outcomes:

\[Y_{it}(0) = \theta_t + \xi_i + e_{it}\]

where \(\xi_i\) is unobserved heterogeneity and \(\E[e_t | D] = 0\)

You can view this as (essentially) embodying two assumptions:

Unconfoundedness (conditional on unobserved heterogeneity): \(Y_{it}(0) \independent D_i | \xi_i\)
- (Gobillon and Magnac 2016; Gardner 2020)

Linearity / Additive Separability
- This type of auxiliary functional form assumption is not necessary for other causal inference approaches (e.g., random assignment, IV, RD)

What if we remove additive separability?

In this case, we have that

\[Y_{it}(0) = h_t(\xi_i) + e_{it}\]

but this is too generic to be useful…

Alternatively, can we assess the plausibility of additive separability?

Theoretically, in (probably most) applications, we simply do not know if additive separability is reasonable or not

Therefore, most DiD applications in economics include an event study plot that checks parallel trends in pre-treatment periods.
This is implicitly a test of the additive separability in the previous model for untreated potential outcomes.

Deryugina (2017)

Y: gov’t transfers, D: hurricane

Callaway and Sant’Anna (2021)

Y: employment, D: min. wage

Some ideas

Allow for certain violations of parallel trends
- \(\implies\) bounds on causal effect parameters
- Magnitude of allowed violations can be grounded in the magnitude of violations of parallel trends in pre-treatment periods (Manski and Pepper 2018; Rambachan and Roth 2023; Ban and Kedagni 2022)

Consider alternative model for untreated potential outcomes ⬅️ our paper
- Using (arguably) the most natural extension to DiD: interactive fixed effects (IFE)
- (I think) IFE is closely connected to the ways that “bounding approaches” allow for violations of parallel trends…

Outline of Talk

Introduction to Interactive Fixed Effects Models
Identification - Baseline Case
More Periods and More Groups
Application

Introduction to Interactive Fixed Effects Models

IFE Model for Untreated Potential Outcomes

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

\(\lambda_i\) is often referred to as “factor loading” (notation above implies that this is a scalar, but you can allow for higher dimension)
\(F_t\) is often referred to as a “factor”
\(e_{it}\) is idioyncratic in the sense that it is not systematically different across groups

IFE Model for Untreated Potential Outcomes

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

In our context, though, it makes sense to interpret these as

\(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved “ability”)
\(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to “ability”)

IFE Model for Untreated Potential Outcomes

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

Special Case: \(F_t = t\) \(\implies\) unit-specific linear trends

Where Does IFE Come From?

View 1: Model for untreated potential outcomes

The IFE model correctly specifies \(\E[Y_t(0)|\xi,D]\), i.e.,

\[Y_{it}(0) = \underbrace{\theta_t + \eta_i + \lambda_i F_t}_{h_t(\xi_i)} + e_{it}\]

where \(\xi_i = (\eta_i, \lambda_i)\)

(Might) be able to relax this to hold approximately…“approximate factor model” where the factor model is viewed as a low-dimensional approximation to a high-dimensional/nonparametric model
- As long as we capture the “dominant” components

Where Does IFE Come From?

View 2: Difference-in-Differences

Including “covariates” in the parallel trends assumption is very common
- Idea: parallel trends holds among units with the same characteristics

One very common way to operationalize is regression adjustment

\[Y_{it}(0) = \theta_t + \eta_i + X_i \beta_t + e_{it}\]

What characteristics need to be included?…depends on the application

But: what if we don’t observe \(X_i\)? This becomes exactly the IFE model that we have been talking about!

IFE & Violations of Parallel Trends

Interactive fixed effects models allow for violations of parallel trends:

\[ \begin{aligned} Y_{it}(0) &= \theta_t + \eta_i + \lambda_i F_t + e_{it} \\ \end{aligned} \]

\[ \implies \E[\Delta Y_{t}(0) | D = d] = \Delta \theta_t + \E[\lambda|D=d]\Delta F_t \]

which can vary across groups.

Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups

Identification

Staggered Treatment Adoption

Many of the insights of recent work on DiD have been in the context of staggered treatment adoption

\(\implies\) there is variation in treatment timing across units
de Chaisemartin and D’Haultfœuille (2020), Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021), Marcus and Sant’Anna (2021), among others
These papers all treat staggered treatment adoption as a nuisance, and
- Show limitations of two-way fixed regressions for implementing DiD identification strategies
- Provide alternative estimation strategies

In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters

Notation / Data / Setup

Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\T}, D_{i1}, D_{i2}, \ldots, D_{i\T}\}_{i=1}^n\)

\(\T\) time periods
No one treated in the first time period (i.e., \(D_{i1} = 0\))
Staggered treatment adoption: for \(t=2,\ldots,\T\), \(D_{it-1} = 1 \implies D_{it}=1\).
A unit’s group \(G_i\) is the time period when it becomes treated.
- By convention, set \(G_i = \infty\) for never-treated units
Potential outcomes: \(Y_{it}(g)\), \(Y_{it}(0)\) is untreated potential outcome
Observed outcomes: \(Y_{it} = Y_{it}(G_i)\)
No anticipation: For \(t < G_i\), \(Y_{it} = Y_{it}(0)\)

Setup is exactly the same as DiD with staggered treatment adoption!

Target Parameters

Following CS-2021, we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]

\(ATT(g,t)\) is the average treatment effect for group \(g\) in time period \(t\)

Group-time average treatment effects are the natural building block for other common target parameters in DiD applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)

Identifying \(ATT(g,t)\) with fixed-\(\T\)

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

Target: \(ATT(3,3) = \E[\Delta Y_3 | G=3] - \underbrace{\color{red}{\E[\Delta Y_3(0) | G=3]}}_{\textrm{have to figure out}}\)

Using quasi-differencing argument, can show that

\[ \Delta Y_{i3}(0) = \theta_t^* + F_3^* \Delta Y_{i2}(0) + v_{i3} \]

where \(\theta_3^*\) and \(F_3^*\) are functions of the original parameters \(\theta_t\) and \(F_t\), and \(v_{i3}\) is a function of \(e_{it}\).

[Explanation]

Identifying \(ATT(g,t)\) with fixed-\(\T\)

Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,

\[\begin{align*} \color{red}{\E[\Delta Y_3(0) | G=3]} = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_2(0) | G = 3]}_{\textrm{identified}} + \underbrace{\E[v_3|G=3]}_{=0} \end{align*}\]

\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).

How can we recover \(\theta_3^\) and \(F_3^\)?

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

\[\begin{align*} \Delta Y_{i3}(0) = \theta_3^* + F_3^* \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

Some issues:

Expression involves untreated potential outcomes through period 3
- \(\implies\) Only groups 4 and \(\infty\) are useful
\(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) by construction
- \(\implies\) We need some exogenous variation (IV) to recover the parameters

Existing Ideas in the Literature

There are a number of different ideas here:

Make additional assumptions ruling out serial correlation in \(e_{it}\) \(\implies\) can use lags of outcomes as instruments (Imbens, Kallus, and Mao 2021):
- But this is seen as a strong assumption in many applications (Bertrand, Duflo, and Mullainathan 2004)

Alternatively can introduce covariates and make auxiliary assumptions about them (Callaway and Karami 2023; Brown and Butts 2023; Brown, Butts, and Westerlund 2023)

However, it turns out that, with staggered treatment adoption, you can recover \(ATT(3,3)\) essentially for free

Our Approach

In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:

\[\begin{align*} \E[\Delta Y_3(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=4] \\ \E[\Delta Y_3(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=\infty] \\ \end{align*}\]

We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* &= \frac{\E[\Delta Y_3|G=\infty] - \E[\Delta Y_3|G=4]}{\E[\Delta Y_2|G=\infty] - \E[\Delta Y_2|G=4]} \\ \theta_3^* &= \E[\Delta Y_3 | G=4] - F_3^* \E[\Delta Y_2 | G=4] \end{align*}\]

\(\implies\) we can recover \(ATT(3,3)\).

This strategy amounts to using “group” as an instrument for \(\Delta Y_{i2}(0)\).

Additional Details about Identification

Condition 1: Relevance \(\quad \E[\Delta Y_2(0) | G=4] \neq \E[\Delta Y_2(0) | G=\infty]\)

For relevance to hold, the following two “more primitive” conditions both need to hold

\(\E[\lambda | G=4] \neq \E[\lambda | G = \infty]\)
\(F_2 \neq F_1\)

Otherwise, \(G = 4\) and \(G = \infty\) have the same trend between the first two periods.

\(\implies\) \(F_3^*\) is not identified

Additional Details about Identification

Condition 2: Exogeneity

i.e., that “group” is uncorrelated with \(e_{it}\)
- This would be violated if, for example, the model for untreated potential outcomes should include two factors instead of one.

Can’t directly test exogeneity, but a lot of the DiD infrastructure carries over

For DiD, can “pre-test” parallel trends if have more than 1 pre-period
For our approach, we need 2 pre-treatment periods to identify \(ATT(g,t)\), but if have more pre-treatment periods then we can pre-test
- e.g., if we have 3 pre-treatment periods, then non-zero pseudo-ATT’s in pre-treatment periods suggest that the exogeneity condition is violated.

How Many Interactive Fixed Effects?

The discussion so far has been about the case of 1 IFE. However, an important issue for IFE approaches is determining how many IFEs terms there are (e.g., 0, 1, 2, …)

Interestingly, there is a tight link in our approach between this “model selection” and identification

Example: Suppose that we know that the true number of interactive fixed effects is either 0 or 1. How can we decide?

Notice that parallel trends holds if either of the following two conditions hold:

\(\E[\lambda | G=3] = \E[\lambda | G=4] = \E[\lambda | G = \infty]\) \(\implies\) IFEs “absorbed” into time fixed effects
\(F_1 = F_2 = F_3\) \(\implies\) IFEs “absorbed” into unit fixed effects

How Many Interactive Fixed Effects?

Idea: Check the relevance condition (i.e., check if \(\E[\Delta Y_2 | G=4] \neq \E[\Delta Y_2 | G=\infty]\))

If relevance holds, then we can conclude that there is at least 1 IFE

If relevance doesn’t hold, then “act like” 0 IFEs
- This could be correct: 👍
- But it is not a guarantee that the are 0 IFEs.

How Many Interactive Fixed Effects?

Let us walk through both cases where relevance fails, but there really is 1 IFE.

Case 1: Parallel trends holds between \(G=4\) and \(G=\infty\) (across all periods), but does not hold with \(G=3\). [Figure]

In math: \(\color{green}{F_2 \neq F_1}\) but \(\color{green}{\E[\lambda | G=3] \neq} \color{red}{ \E[\lambda | G=4] = \E[\lambda | G = \infty]}\)

Intuition: \(G=4\) and \(G=\infty\) are effectively the “same comparison group”
Implications:
- Neither our approach nor DiD will recover \(ATT(3,3)\)
- However, both will “fail safely” because you see that \(G=3\) is trending differently from \(G=4\) and \(G=\infty\)

How Many Interactive Fixed Effects?

Let us walk through both cases where relevance fails, but there really is 1 IFE.

Case 2: Parallel trends holds between periods 1 and 2 (for all groups), but does not hold from period 2 to 3

In math: \(\color{green}{F_3 \neq} \color{red}{F_2 = F_1}\)

Case 2a: \(\E[\lambda | G=4] \neq \E[\lambda | G = \infty]\) [Figure]
- Our approach will not recover \(ATT(3,3)\)
- Our approach will “fail safely” because the observed violation of parallel trends in period 3 for \(G=4\) and \(G=\infty\) provides evidence against parallel trends
- DiD (at least the version that pools \(G=4\) and \(G=\infty\) into a single comparison group) may “fail horribly” because there is no pre-treatment evidence that parallel trends is violated

How Many Interactive Fixed Effects?

Let us walk through both cases where relevance fails, but there really is 1 IFE.

Case 2: Parallel trends holds between periods 1 and 2 (for all groups), but does not hold from period 2 to 3

In math: \(\color{green}{F_3 \neq} \color{red}{F_2 = F_1}\)
Case 2b: \(\E[\lambda|G=3] \neq \E[\lambda | G=4] = \E[\lambda | G = \infty]\) [Figure]
- Our approach will “fail horribly” in this case (so will DiD)
- In fact, no approach would work here
- This is closely related to the idea: “parallel trends is fundamentally untestable”
- In this case, we essentially have the strongest possible evidence that parallel trends holds, and yet, it could still be violated

Extensions

General Case with More Periods and Groups

Can scale the identification arguments up for more periods, groups, and IFEs

Estimation

Identification is constructive and suggests a two-step estimation procedure where we estimate the parameters of the IFE model in the first step (e.g., \(\theta_3^*\) and \(F_3^*\)) and then plug these into a second step estimator for \(ATT(g,t)\).
With more periods, groups, and/or interactive fixed effects, parameters of IFE model can be over-identified \(\implies\) GMM, but otherwise similar

[More Details]

Discussion

Relative to other approaches to dealing with IFEs:
- We do not need a large number of periods or extra auxiliary assumptions
- Only need there to be staggered treatment adoption

The main drawback is that can’t recover as many \(ATT(g,t)\)’s; e.g., in this example, we can’t recover \(ATT(3,4)\) or \(ATT(4,4)\) which might be recoverable in other settings

Generality: we have talked about IFE models, but
- There are other types of models that need extra moment conditions (e.g., dynamic panel data model for \(Y_{it}(0)\)), could use the same sort of idea there

Application

Setup

Use county-level data from 1998-2007 during a period where the federal minimum wage was flat

Exploit minimum wage changes across states
- Any state that increases their minimum wage above the federal minimum wage will be considered as treated
- Allow for one year of “anticipation” (this only affects estimates in post-treatment periods)

Interested in the effect of the minimum wage on teen employment

We’ll also make a number of simplifications:
- not worry much about issues like clustered standard errors
- not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states

DiD Estimates

IFE Clarifications

The next set of results include one interactive fixed effect
Additional Comments:
- Because of anticipation, we can only estimate effects up to 2005 (after that \(G=2007\) is no longer a valid comparison group)
- We also lose some estimates in early periods because those are needed in the quasi-differencing steps
- No estimates for \(G=2007\) at all because not enough valid comparison groups

IFE Results

Conclusion

Comments very welcome: brantly.callaway@uga.edu

Code: staggered_ife2 function in ife package in R, available at github.com/bcallaway11/ife

Appendix

References

Arkhangelsky, Dmitry, Susan Athey, David A Hirshberg, Guido W Imbens, and Stefan Wager. 2021. “Synthetic Difference-in-Differences.” American Economic Review 111 (12): 4088–118.

Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. 2021. “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association, 1–15.

Ban, Kyunghoon, and Desire Kedagni. 2022. “Generalized Difference-in-Differences Models: Robust Bounds.”

Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan. 2004. “How Much Should We Trust Differences-in-Differences Estimates?” The Quarterly Journal of Economics 119 (1): 249–75.

Brown, Nicholas, and Kyle Butts. 2023. “Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels.”

Brown, Nicholas, Kyle Butts, and Joakim Westerlund. 2023. “Simple Difference-in-Differences Estimation in Fixed-t Panels.”

Callaway, Brantly, and Sonia Karami. 2023. “Treatment Effects in Interactive Fixed Effects Models with a Small Number of Time Periods.” Journal of Econometrics 233 (1): 184–208.

Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

de Chaisemartin, Clement, and Xavier D’Haultfœuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Deryugina, Tatyana. 2017. “The Fiscal Cost of Hurricanes: Disaster Aid Versus Social Insurance.” American Economic Journal: Economic Policy 9 (3): 168–98.

Gardner, John. 2020. “Identification and Estimation of Average Causal Effects When Treatment Status Is Ignorable Within Unobserved Strata.” Econometric Reviews 39 (10): 1014–41.

Gobillon, Laurent, and Thierry Magnac. 2016. “Regional Policy Evaluation: Interactive Fixed Effects and Synthetic Controls.” Review of Economics and Statistics 98 (3): 535–51.

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Imbens, Guido, Nathan Kallus, and Xiaojie Mao. 2021. “Controlling for Unmeasured Confounding in Panel Data Using Minimal Bridge Functions: From Two-Way Fixed Effects to Factor Models.”

Manski, Charles F, and John V Pepper. 2018. “How Do Right-to-Carry Laws Affect Crime Rates? Coping with Ambiguity Using Bounded-Variation Assumptions.” Review of Economics and Statistics 100 (2): 232–44.

Marcus, Michelle, and Pedro HC Sant’Anna. 2021. “The Role of Parallel Trends in Event Study Settings: An Application to Environmental Economics.” Journal of the Association of Environmental and Resource Economists 8 (2): 235–75.

Rambachan, Ashesh, and Jonathan Roth. 2023. “A More Credible Approach to Parallel Trends.” Review of Economic Studies 90 (5): 2555–91.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.

Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25 (1): 57–76.

Quasi-Differencing Explanation

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

\[Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it}\]

In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\]

The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)

Quasi-Differencing Explanation

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

From last slide, combining terms we have that

\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

[Back]

Case 1 Example

[Back]

Case 2a Example

[Back]

Case 2b Example

[Back]

General Case - Setup

Interactive fixed effects for untreated potential outcomes:

\[ Y_{it}(0) = \theta_t + \eta_i + \lambda_i' F_t + e_{it} \] where \(\lambda_i\) and \(F_t\) are \(R\) dimensional vectors.

Assume: Unconfoundedness conditional on unobserved heterogeneity (i.e., this implies “groups” can be used as instruments):

\[ \E[Y_{t}(0) |\eta, \lambda, G] = \E[Y_{t}(0) |\eta, \lambda] \quad \text{a.s.} \]

An implication of both conditions above is that

\[ \E[e_t |\eta, \lambda, G] = 0 \]

which we use below as a source of moment conditions to identify parameters from the interactive fixed effects model.

General Case - Identification

Similar to earlier case:

\[ \begin{aligned} ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \underbrace{\E[Y_t(0) - Y_{g-1}(0) | G=g]}_{\textrm{need to figure out}} \\ \end{aligned} \]

General Case - Identification

Using similar differencing arguments as before, one can show:

\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]

where

\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is an \(R\)-dimensional vector of transformations of pre-treatment outcomes,
\(\theta^*(g,t)\) and \(F^*(g,t)\) are transformations of time fixed effects and factors,
\(v_i(g,t)\) involves transformations of \(e_{it}\).

General Case - Identification

Using similar differencing arguments as before, one can show:

\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]

so that

\(R+1\) parameters to identify
\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is endogenous by construction
Can use “groups” as instruments
Identification is local to groups/periods that meet the following criteria:
- Group must have at least \(R+1\) pre-treatment periods (so quasi-differencing is feasible)
- Time period must be early enough so that there are enough \((R+1)\) untreated comparison groups.

General Case - Identification

For \(g' \in \mathcal{G}^{comp}(g,t)\), we use moment conditions of the form

\[ 0 = \E\Big[\indicator{G=g'} v(g,t)\Big]\]

Stacking the above moment conditions, we have that

\[ \mathbf{0}_{|\mathcal{G}^{comp}(g,t)|} = \E\left[ \ell^{comp}(g,t) \left\{ \Big( Y_{t} - Y_{g-1}\Big) - \Big(\theta^*(g,t) - {\widetilde{\Delta Y}}^{{pre(g)}^{'}} F^*(g,t) \Big) \right\} \right] \]

where \(\ell^{comp}(g,t)\) is a vector of indicators for groups that have not yet been treated by period \(t\).

General Case - Identification

Since we are using groups as IVs, identification hinges on relevance:

\[ \textrm{Rank}\Big(\mathbf{\Gamma}(g,t)\Big) = R + 1 \]

where

\[ \mathbf{\Gamma}(g,t) := \E\left[ \ell^{comp}(g,t) \begin{pmatrix} 1 \\ \widetilde{\Delta Y}^{pre(g)} \end{pmatrix}' \right] \]

As before, you can relate the relevance condition to conditions on \(\lambda_i\) and \(F_t\).

Need “enough variation” in \(\E[\lambda|G=g']\) among groups in \(\mathcal{G}^{comp}(g,t)\).
Need “enough variation” in \(F_t\) across pre-treatment time periods.

[More Details]

General Case - Identification

Theorem: Identification

For some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\) and under given assumptions,

\[ \begin{pmatrix} \theta^*(g,t) \\ F^*(g,t) \end{pmatrix} = \Big( \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \mathbf{\Gamma}(g,t) \Big)^{-1} \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \E[\ell^{comp}(g,t)(Y_{t} - Y_{g-1})] \]

In addition, \(ATT(g,t)\) is identified, and it is given by:

\[ ATT(g,t) = \E[Y_t(g) - Y_{g-1} | G=g] - \Big( \theta^*(g,t) + F^*(g,t)'\E[\Delta Y^{pre(g)} | G=g] \Big) \]

Estimation

Estimation proceeds in two steps and is constructive given identification results. The first step is to estimate \(\theta^*(g,t)\) and \(F^*(g,t)\):

Given a positive definite matrix \(\widehat{\mathbf{W}}(g,t)\), the estimator of \(\delta^*(g,t)\) is:

\[ \widehat{\delta}^*(g,t) = \left( \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t)\widehat{\mathbf{\Gamma}}(g,t) \right)^{-1} \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t) \E_n\big[\ell^{comp}_i(g,t)(Y_{it} - Y_{ig-1})\big] \]

Estimation

Second step, plug into sample analog of expression for \(ATT(g,t)\):

\[ \widehat{ATT}(g,t) = \hat{p}_g^{-1} \left\{ \E_n\Big[\indicator{G_i=g}(Y_{it} - Y_{ig-1})\Big] - \E_n\Big[A_i(g)\Big]^\prime \widehat{\delta}^*(g,t) \right\} \]

where

\[ A_i(g) := \indicator{G_i=g}\begin{pmatrix}1 \\ \widetilde{\Delta Y}_i^{pre(g)} \end{pmatrix} \]

If you want an event study or overall average treatment effect, can combine estimates across groups and time periods, following the same logic as in CS-2021.

Asymptotic Theory

Theorem: Asymptotic Normality

Suppose assumptions hold, then for some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\),

\(\widehat{ATT}(g,t)\) is asymptotically linear, and it satisfies the relation: \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_{igt} + o_p(1) \]
\(\widehat{ATT}(g,t) \rightarrow_p ATT(g,t)\) as \(n \rightarrow \infty\) for each pair \((g,t)\).
In addition, \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) \xrightarrow{d} \mathcal{N}(0,\sigma_{gt}^2) \] where \(\sigma_{gt}^2 = \E[\psi_{igt}^2]\). [Back]

General Case - Relevance Condition

\[ \begin{aligned} \textrm{Define: } \qquad \mathbf{\Lambda}^{comp}(g,t) := \E\Big[ \ell^{comp}(g,t) \begin{pmatrix} 1 & \lambda' \end{pmatrix} \Big] \quad \textrm{and} \quad \mathbf{\Delta F}^{pre(g)} := \begin{bmatrix} \Delta F_2' \\ \vdots \\ \Delta F_{g-1}' \end{bmatrix} \end{aligned} \]

where \(\mathbf{\Lambda}^{comp}(g,t)\) is a \(|\mathcal{G}^{comp}(g,t)| \times (R+1)\) matrix, and \(\mathbf{\Delta F}^{pre(g)}\) is a \((g-2) \times R\) matrix.

Proposition: Relevance

The rank condition for identification is equivalent to the following: \[ \textrm{Rank}\Big(\mathbf{\Lambda}^{comp}(g,t)\Big) = R + 1 \quad \textrm{and} \quad \textrm{Rank}\Big(\mathbf{\Delta F}^{pre(g)}\Big) = R \]

[Return]

Treatment Effects in Staggered Adoption Designs with Non-Parallel Trends

Introduction

Where does parallel trends come from?

What if we remove additive separability?

Deryugina (2017)

Callaway and Sant’Anna (2021)

Some ideas

Outline of Talk

Introduction to Interactive Fixed Effects Models

IFE Model for Untreated Potential Outcomes

IFE Model for Untreated Potential Outcomes

IFE Model for Untreated Potential Outcomes

Where Does IFE Come From?

Where Does IFE Come From?

IFE & Violations of Parallel Trends

Identification

Staggered Treatment Adoption

Notation / Data / Setup

Target Parameters

Identifying \(ATT(g,t)\) with fixed-\(\T\)

Identifying \(ATT(g,t)\) with fixed-\(\T\)

How can we recover \(\theta_3^*\) and \(F_3^*\)?

Existing Ideas in the Literature

Our Approach

Additional Details about Identification

Additional Details about Identification

How Many Interactive Fixed Effects?

How Many Interactive Fixed Effects?

How Many Interactive Fixed Effects?

How Many Interactive Fixed Effects?

How Many Interactive Fixed Effects?

Extensions

Discussion

Application

Setup

DiD Estimates

IFE Clarifications

IFE Results

Conclusion

Appendix

References

Quasi-Differencing Explanation

Quasi-Differencing Explanation

Case 1 Example

Case 2a Example

Case 2b Example

General Case - Setup

General Case - Identification

General Case - Identification

General Case - Identification

General Case - Identification

General Case - Identification

General Case - Identification

Estimation

Estimation

Asymptotic Theory

General Case - Relevance Condition

How can we recover \(\theta_3^\) and \(F_3^\)?