Treatment Effects in Staggered Adoption Designs with Non-Parallel Trends

Brantly Callaway

University of Georgia

Emmanuel Tsyawo

FGSES, Université Mohammed VI Polytechnique

September 20, 2024

Introduction

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\T}{\mathrm{T}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)Setting of the paper: Researcher interested in learning about the causal effect of a binary treatment and has access to a few periods of panel data

In economics, by far the most common approach in this setting is to use difference-in-differences (DiD).

In the current paper, we will think about:

Cases where the parallel trends assumption could be violated
Applications where there is staggered treatment adoption
How to exploit staggered treatment adoption to allow for violations of parallel trends while still recovering the same target causal effect parameters

Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on \(\underbrace{\textrm{earnings}}_{\textrm{outcome}}\)

Introduction

Parallel trends assumption: \(\E[\Delta Y_t(0) | D=1] = \E[\Delta Y_t(0) | D=0]\)

Where does parallel trends come from?

Parallel trends is closely connected to the following model of untreated potential outcomes:

\[Y_{it}(0) = \theta_t + \xi_i + e_{it}\]

where \(\xi_i\) is unobserved heterogeneity and \(e_{it}\) is an idiosyncratic error term.

You can view this as embodying two assumptions:

Unconfoundedness (conditional on unobserved heterogeneity): \(Y_{it}(0) \independent D_i | \xi_i\)
Linearity / Additive Separability
- This type of auxiliary functional form assumption is not necessary for other causal inference approaches (e.g., random assignment, IV, RD)

Introduction

What if we back off of the additive separability assumption?

In this case, we have that

\[Y_{it}(0) = h_t(\xi_i) + e_{it}\]

where \(h_t(\xi) = \E[Y_t(0)|\xi,D]\), and \(e_{it}\) is idiosyncratic, but this is too generic to be useful…

Alternatively, can we assess the plausibility of additive separability?

Theoretically, in (probably most) applications, we simply do not know if additive separability is reasonable or not

Therefore, most DiD applications in economics include an event study plot that checks parallel trends in pre-treatment periods.
This is implicitly a test of the additive separability in the previous model for untreated potential outcomes.

Deryugina (2017), Y: gov’t transfers, D: hurricane

CS-2021, Y: employment, D: min. wage

Some ideas

Allow for certain violations of parallel trends (\(\implies\) bounds on causal effect parameters) often connected to the magnitude of the violations of parallel trends in pre-treatment periods (Manski and Pepper 2018; Rambachan and Roth 2023; Ban and Kedagni 2022)

Consider alternative model for untreated potential outcomes
- That’s what we will do in this paper!
- Using (arguably) the most naturally connected approach to DiD: interactive fixed effects (IFE)
- (I think) IFE is closely connected to the ways that “bounding approaches” allow for violations of parallel trends…

Outline of Talk

Introduction to Interactive Fixed Effects Models
Identification - Baseline Case
More Periods and More Groups
Application

Introduction to Interactive Fixed Effects Models

IFE Model for Untreated Potential Outcomes

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

\(\lambda_i\) is often referred to as “factor loading” (notation above implies that this is a scalar, but you can allow for higher dimension)
\(F_t\) is often referred to as a “factor”
\(e_{it}\) is idioyncratic in the sense that \(\E[e_t | \eta, \lambda, D] = 0\)

IFE Model for Untreated Potential Outcomes

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

In our context, though, it makes sense to interpret these as

\(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved “ability”)
\(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to “ability”)

IFE Model for Untreated Potential Outcomes

An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]

Special Case: \(F_t = t\) \(\implies\) unit-specific linear trends

Where Did This Come From?

View 1: Model for untreated potential outcomes

The IFE model correctly specifies \(\E[Y_t(0)|\xi,D]\), i.e.,

\[Y_{it}(0) = \underbrace{\theta_t + \eta_i + \lambda_i F_t}_{h_t(\xi_i)} + e_{it}\]

where \(\xi_i = (\eta_i, \lambda_i)\)

(Might) be able to relax this to hold approximately…“approximate factor model” where the factor model is viewed as a low-dimensional approximation to a high-dimensional/nonparametric model
- As long as we capture the “dominant” components

Where Did This Come From?

View 2: Difference-in-Differences

Alternatively, we can come at this from the perspective of difference-in-differences

Including “covariates” in the parallel trends assumption is very common
- The idea is that parallel trends holds among units with the same characteristics

One very common way to operationalize is regression adjustment

\[Y_{it}(0) = \theta_t + \eta_i + X_i \beta_t + e_{it}\]

What characteristics need to be included?…depends on the application

But: what if we don’t observe \(X_i\)? This becomes exactly the IFE model that we have been talking about!

IFE & Violations of Parallel Trends

Interactive fixed effects models allow for violations of parallel trends:

\[\begin{align*} \E[\Delta Y_{t}(0) | D = d] = \Delta \theta_t + \E[\lambda|D=d]\Delta F_t \end{align*}\] which can vary across groups.

Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups

Identification

Staggered Treatment Adoption

Many of the insights of recent work on DiD have been in the context of staggered treatment adoption

\(\implies\) there is variation in treatment timing across units
de Chaisemartin and D’Haultfœuille (2020), Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021), Marcus and Sant’Anna (2021), among others
These papers all treat staggered treatment adoption as a nuisance, and
- Show limitations of two-way fixed regressions for implementing DiD identification strategies
- Provide alternative estimation strategies

In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters

Notation / Data / Setup

Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\T}, D_{i1}, D_{i2}, \ldots, D_{i\T}\}_{i=1}^n\)

\(\T\) time periods
No one treated in the first time period (i.e., \(D_{i1} = 0\))
Staggered treatment adoption: for \(t=2,\ldots,\T\), \(D_{it-1} = 1 \implies D_{it}=1\).
A unit’s group \(G_i\) is the time period when it becomes treated. By convention, set \(G_i = \infty\) for units that do not participate in the treatment in any period.
Potential outcomes: \(Y_{it}(g)\), \(Y_{it}(0)\) is untreated potential outcome
Observed outcomes: \(Y_{it} = Y_{it}(G_i)\)
No anticipation: For \(t < G_i\), \(Y_{it} = Y_{it}(0)\)

This setup is exactly the same as the literature on DiD with staggered treatment adoption

Target Parameters

Following CS-2021, we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]

\(ATT(g,t)\) is the average treatment effect for group \(g\) in time period \(t\)

Group-time average treatment effects are the natural building block for other common target parameters in DiD applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)

Identifying \(ATT(g,t)\) with fixed-\(\T\)

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

Target: \(ATT(3,3) = \E[\Delta Y_3 | G=3] - \underbrace{\E[\Delta Y_3(0) | G=3]}_{\textrm{have to figure out}}\)

In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\]

The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)

Identifying \(ATT(g,t)\) with fixed-\(\T\)

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

From last slide, combining terms we have that

\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]

Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,

\[\begin{align*} \E[\Delta Y_3(0) | G=3] = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_2(0) | G = 3]}_{\textrm{identified}} + \underbrace{\E[v_3|G=3]}_{=0} \end{align*}\]

\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).

Identifying \(ATT(g,t)\) with fixed-\(\T\)

Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)

From last slide, combining terms we have that

How can we recover \(\theta_3^*\) and \(F_3^*\)?

Expression involves untreated potential outcomes through period 3, and we have groups 4 and \(\infty\) for which we observe these untreated potential outcomes. This suggests using those groups.
However, this is not so simple because, by construction, \(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) (note: \(v_{i3}\) contains \(\Delta e_{i2} \implies\) they will be correlated by construction)
We need some exogenous variation (IV) to recover the parameters \(\rightarrow\)

Existing Ideas in the Literature

There are a number of different ideas here:

Make additional assumptions ruling out serial correlation in \(e_{it}\) \(\implies\) can use lags of outcomes as instruments (Imbens, Kallus, and Mao 2021):
- But this is seen as a strong assumption in many applications (Bertrand, Duflo, and Mullainathan 2004)

Alternatively can introduce covariates and make auxiliary assumptions about them (Callaway and Karami 2023; Brown and Butts 2023; Brown, Butts, and Westerlund 2023)

However, it turns out that, with staggered treatment adoption, you can recover \(ATT(3,3)\) essentially for free

Our Approach

In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:

\[\begin{align*} \E[\Delta Y_3(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=4] \\ \E[\Delta Y_3(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=\infty] \\ \end{align*}\]

We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* &= \frac{\E[\Delta Y_3|G=\infty] - \E[\Delta Y_3|G=4]}{\E[\Delta Y_2|G=\infty] - \E[\Delta Y_2|G=4]} \\ \theta_3^* &= \E[\Delta Y_3 | G=4] - F_3^* \E[\Delta Y_2 | G=4] \end{align*}\]

\(\implies\) we can recover \(ATT(3,3)\).

This strategy amounts to using “group” as an instrument for \(\Delta Y_{i2}(0)\).

Additional Details about Identification

Condition 1: Relevance \(\quad \E[\Delta Y_2(0) | G=4] \neq \E[\Delta Y_2(0) | G=\infty]\)

For relevance to hold, the following two “more primitive” conditions both need to hold

\(\E[\lambda | G=4] \neq \E[\lambda | G = \infty]\)
\(F_2 \neq F_1\)

Otherwise, \(G = 4\) and \(G = \infty\) have the same trend between the first two periods.

\(\implies\) \(F_3^*\) is not identified

Additional Details about Identification

Condition 2: Exogeneity

i.e., that “group” is uncorrelated with \(e_{it}\)
- This would be violated if, for example, the model for untreated potential outcomes should include two factors instead of one.

Can’t directly test exogeneity, but a lot of the DiD infrastructure carries over here.

For DiD, can “pre-test” parallel trends if there is more than 1 pre-treatment period
For our approach, we need 2 pre-treatment periods to identify \(ATT(g,t)\), but if there are more pre-treatment periods then we can pre-test
- e.g., if we have 3 pre-treatment periods, then non-zero pseudo-ATT’s in pre-treatment periods suggest that the exogeneity condition is violated.

How Many Interactive Fixed Effects?

The discussion so far has been about the case of 1 IFE. However, an important issue for IFE approaches is determining how many IFEs terms there are (e.g., 0, 1, 2, …)

Interestingly, there is a tight link in our approach between this “model selection” and identification

Example: Suppose that we know that the true number of interactive fixed effects is either 0 or 1. How can we decide?

Notice that parallel trends holds if either of the following two conditions hold:

\(\E[\lambda | G=3] = \E[\lambda | G=4] = \E[\lambda | G = \infty]\) \(\implies\) IFEs “absorbed” into time fixed effects
\(F_1 = F_2 = F_3\) \(\implies\) IFEs “absorbed” into unit fixed effects

How Many Interactive Fixed Effects?

Idea: Check the relevance condition (i.e., check if \(\E[\Delta Y_2 | G=4] \neq \E[\Delta Y_2 | G=\infty]\))

If relevance holds, then we can conclude that there is at least 1 IFE

If relevance doesn’t hold, then “act like” 0 IFEs
- This could be correct, 👍
- But it is not a guarantee that the are 0 IFEs.

How Many Interactive Fixed Effects?

Let us walk through both cases where relevance fails, but there really is 1 IFE.

Case 1: \(\color{green}{F_2 \neq F_1}\) but \(\color{green}{\E[\lambda | G=3] \neq} \color{red}{ \E[\lambda | G=4] = \E[\lambda | G = \infty]}\)

Intuition: \(G=4\) and \(G=\infty\) are the “same comparison group”, so we cannot deal with the IFE
\(\implies\) our approach won’t work, but you would be able to see that \(G=3\) is trending differently from \(G=4\) and \(G=\infty\)
- i.e., we can reject parallel trends in the same way as typical DiD approaches

How Many Interactive Fixed Effects?

Let us walk through both cases where relevance fails, but there really is 1 IFE.

Case 2: \(\color{green}{\E[\lambda | G=4] \neq \E[\lambda | G = \infty]}\) but \(\color{green}{F_3 \neq} \color{red}{F_2 = F_1}\)

Intuition: The effect of \(\lambda_i\) doesn’t change between periods 1 and 2 \(\implies\) all groups trend the same between periods 1 and 2, so it looks like parallel trends holds. Here it does hold in pre-treatment periods, but it is violated in post-treatment periods
\(\implies\) our approach won’t work, but (I think) no approach would work here
- This is closely related to the idea: “parallel trends is fundamentally untestable”

Extensions

General Case with More Periods and Groups

Can scale this argument up for more periods, groups, and IFEs

Estimation

Identification is constructive and suggests a two-step estimation procedure where we estimate the parameters of the IFE model in the first step (e.g., \(\theta_3^*\) and \(F_3^*\)) and then plug these into a second step estimator for \(ATT(g,t)\).
With more periods and/or interactive fixed effects, parameters of IFE model can be over-identified \(\implies\) GMM, but otherwise similar

[More Details]

Discussion

Relative to other approaches to dealing with IFEs:
- We do not need a large number of periods or extra auxiliary assumptions
- Only need there to be staggered treatment adoption

The main drawback is that can’t recover as many \(ATT(g,t)\)’s; e.g., in this example, we can’t recover \(ATT(3,4)\) or \(ATT(4,4)\) which might be recoverable in other settings

Generality: we have talked about IFE models, but
- There are other types of models that need extra moment conditions (e.g., dynamic panel data model for \(Y_{it}(0)\)), could use the same sort of idea there

Application

Setup

Use county-level data from 1998-2007 during a period where the federal minimum wage was flat

Exploit minimum wage changes across states
- Any state that increases their minimum wage above the federal minimum wage will be considered as treated
- Allow for one year of “anticipation” (this only affects estimates in post-treatment periods)

Interested in the effect of the minimum wage on teen employment

We’ll also make a number of simplifications:
- not worry much about issues like clustered standard errors
- not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states

DiD Estimates

IFE Clarifications

The next set of results include one interactive fixed effect
Additional Comments:
- Because of anticipation, we can only estimate effects up to 2005 (after that \(G=2007\) is no longer a valid comparison group)
- We also lose some estimates in early periods because those are needed in the quasi-differencing steps
- No estimates for \(G=2007\) at all because not enough valid comparison groups

IFE Results

Conclusion

Comments very welcome: brantly.callaway@uga.edu

Code: staggered_ife2 function in ife package in R, available at github.com/bcallaway11/ife

Appendix

References

Arkhangelsky, Dmitry, Susan Athey, David A Hirshberg, Guido W Imbens, and Stefan Wager. 2021. “Synthetic Difference-in-Differences.” American Economic Review 111 (12): 4088–118.

Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, and Khashayar Khosravi. 2021. “Matrix Completion Methods for Causal Panel Data Models.” Journal of the American Statistical Association, 1–15.

Ban, Kyunghoon, and Desire Kedagni. 2022. “Generalized Difference-in-Differences Models: Robust Bounds.”

Bertrand, Marianne, Esther Duflo, and Sendhil Mullainathan. 2004. “How Much Should We Trust Differences-in-Differences Estimates?” The Quarterly Journal of Economics 119 (1): 249–75.

Brown, Nicholas, and Kyle Butts. 2023. “Dynamic Treatment Effect Estimation with Interactive Fixed Effects and Short Panels.”

Brown, Nicholas, Kyle Butts, and Joakim Westerlund. 2023. “Simple Difference-in-Differences Estimation in Fixed-t Panels.”

Callaway, Brantly, and Sonia Karami. 2023. “Treatment Effects in Interactive Fixed Effects Models with a Small Number of Time Periods.” Journal of Econometrics 233 (1): 184–208.

Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

de Chaisemartin, Clement, and Xavier D’Haultfœuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Deryugina, Tatyana. 2017. “The Fiscal Cost of Hurricanes: Disaster Aid Versus Social Insurance.” American Economic Journal: Economic Policy 9 (3): 168–98.

Gobillon, Laurent, and Thierry Magnac. 2016. “Regional Policy Evaluation: Interactive Fixed Effects and Synthetic Controls.” Review of Economics and Statistics 98 (3): 535–51.

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Imbens, Guido, Nathan Kallus, and Xiaojie Mao. 2021. “Controlling for Unmeasured Confounding in Panel Data Using Minimal Bridge Functions: From Two-Way Fixed Effects to Factor Models.”

Manski, Charles F, and John V Pepper. 2018. “How Do Right-to-Carry Laws Affect Crime Rates? Coping with Ambiguity Using Bounded-Variation Assumptions.” Review of Economics and Statistics 100 (2): 232–44.

Marcus, Michelle, and Pedro HC Sant’Anna. 2021. “The Role of Parallel Trends in Event Study Settings: An Application to Environmental Economics.” Journal of the Association of Environmental and Resource Economists 8 (2): 235–75.

Rambachan, Ashesh, and Jonathan Roth. 2023. “A More Credible Approach to Parallel Trends.” Review of Economic Studies 90 (5): 2555–91.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.

Xu, Yiqing. 2017. “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed Effects Models.” Political Analysis 25 (1): 57–76.

General Case - Setup

Interactive fixed effects for untreated potential outcomes:

\[ Y_{it}(0) = \theta_t + \eta_i + \lambda_i' F_t + e_{it} \] where \(\lambda_i\) and \(F_t\) are \(R\) dimensional vectors.

Assume: Unconfoundedness conditional on unobserved heterogeneity (i.e., this implies “groups” can be used as instruments):

\[ \E[Y_{t}(0) |\eta, \lambda, G] = \E[Y_{t}(0) |\eta, \lambda] \quad \text{a.s.} \]

An implication of both conditions above is that

\[ \E[e_t |\eta, \lambda, G] = 0 \]

which we use below as a source of moment conditions to identify parameters from the interactive fixed effects model.

General Case - Identification

Similar to earlier case:

\[ \begin{aligned} ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \underbrace{\E[Y_t(0) - Y_{g-1}(0) | G=g]}_{\textrm{need to figure out}} \\ \end{aligned} \]

General Case - Identification

Using similar differencing arguments as before, one can show:

\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]

where

\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is an \(R\)-dimensional vector of transformations of pre-treatment outcomes,
\(\theta^*(g,t)\) and \(F^*(g,t)\) are transformations of time fixed effects and factors,
\(v_i(g,t)\) involves transformations of \(e_{it}\).

General Case - Identification

Using similar differencing arguments as before, one can show:

\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]

so that

\(R+1\) parameters to identify
\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is endogenous by construction
Can use “groups” as instruments
Identification is local to groups/periods that meet the following criteria:
- Group must have at least \(R+1\) pre-treatment periods (so quasi-differencing is feasible)
- Time period must be early enough so that there are enough \((R+1)\) untreated comparison groups.

General Case - Identification

For \(g' \in \mathcal{G}^{comp}(g,t)\), we use moment conditions of the form

\[ 0 = \E\Big[\indicator{G=g'} v(g,t)\Big]\]

Stacking the above moment conditions, we have that

\[ \mathbf{0}_{|\mathcal{G}^{comp}(g,t)|} = \E\left[ \ell^{comp}(g,t) \left\{ \Big( Y_{t} - Y_{g-1}\Big) - \Big(\theta^*(g,t) - {\widetilde{\Delta Y}}^{{pre(g)}^{'}} F^*(g,t) \Big) \right\} \right] \]

where \(\ell^{comp}(g,t)\) is a vector of indicators for groups that have not yet been treated by period \(t\).

General Case - Identification

Since we are using groups as instruments, identification hinges on the relevance condition:

\[ \textrm{Rank}\Big(\mathbf{\Gamma}(g,t)\Big) = R + 1 \]

where

\[ \mathbf{\Gamma}(g,t) := \E\left[ \ell^{comp}(g,t) \begin{pmatrix} 1 \\ \widetilde{\Delta Y}^{pre(g)} \end{pmatrix}' \right] \]

Like the earlier case, you can relate the relevance condition to conditions on \(\lambda_i\) and \(F_t\).

There needs to be “enough variation” in \(\E[\lambda|G=g']\) among groups in \(\mathcal{G}^{comp}(g,t)\).
There needs to be “enough variation” in \(F_t\) across pre-treatment time periods.

[More Details]

General Case - Identification

Theorem: Identification

For some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\) and under given assumptions,

\[ \begin{pmatrix} \theta^*(g,t) \\ F^*(g,t) \end{pmatrix} = \Big( \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \mathbf{\Gamma}(g,t) \Big)^{-1} \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \E[\ell^{comp}(g,t)(Y_{t} - Y_{g-1})] \]

In addition, \(ATT(g,t)\) is identified, and it is given by:

\[ ATT(g,t) = \E[Y_t(g) - Y_{g-1} | G=g] - \Big( \theta^*(g,t) + F^*(g,t)'\E[\Delta Y^{pre(g)} | G=g] \Big) \]

Estimation

Estimation proceeds in two steps and is constructive given identification results. The first step is to estimate \(\theta^*(g,t)\) and \(F^*(g,t)\):

Given a positive definite matrix \(\widehat{\mathbf{W}}(g,t)\), the estimator of \(\delta^*(g,t)\) is:

\[ \widehat{\delta}^*(g,t) = \left( \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t)\widehat{\mathbf{\Gamma}}(g,t) \right)^{-1} \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t) \E_n\big[\ell^{comp}_i(g,t)(Y_{it} - Y_{ig-1})\big] \]

Estimation

Second step, plug into sample analog of expression for \(ATT(g,t)\):

\[ \widehat{ATT}(g,t) = \hat{p}_g^{-1} \left\{ \E_n\Big[\indicator{G_i=g}(Y_{it} - Y_{ig-1})\Big] - \E_n\Big[A_i(g)\Big]^\prime \widehat{\delta}^*(g,t) \right\} \]

where

\[ A_i(g) := \indicator{G_i=g}\begin{pmatrix}1 \\ \widetilde{\Delta Y}_i^{pre(g)} \end{pmatrix} \]

If you want an event study or overall average treatment effect, can combine estimates across groups and time periods, following the same logic as in CS-2021.

Asymptotic Theory

Theorem: Asymptotic Normality

Suppose assumptions hold, then for some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\),

\(\widehat{ATT}(g,t)\) is asymptotically linear, and it satisfies the relation: \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_{igt} + o_p(1) \]
\(\widehat{ATT}(g,t) \rightarrow_p ATT(g,t)\) as \(n \rightarrow \infty\) for each pair \((g,t)\).
In addition, \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) \xrightarrow{d} \mathcal{N}(0,\sigma_{gt}^2) \] where \(\sigma_{gt}^2 = \E[\psi_{igt}^2]\).

[Back]

General Case - Relevance Condition

\[ \begin{aligned} \textrm{Define: } \qquad \mathbf{\Lambda}^{comp}(g,t) := \E\Big[ \ell^{comp}(g,t) \begin{pmatrix} 1 & \lambda' \end{pmatrix} \Big] \quad \textrm{and} \quad \mathbf{\Delta F}^{pre(g)} := \begin{bmatrix} \Delta F_2' \\ \vdots \\ \Delta F_{g-1}' \end{bmatrix} \end{aligned} \]

where \(\mathbf{\Lambda}^{comp}(g,t)\) is a \(|\mathcal{G}^{comp}(g,t)| \times (R+1)\) matrix, and \(\mathbf{\Delta F}^{pre(g)}\) is a \((g-2) \times R\) matrix.

Proposition: Relevance

The rank condition for identification is equivalent to the following: \[ \textrm{Rank}\Big(\mathbf{\Lambda}^{comp}(g,t)\Big) = R + 1 \quad \textrm{and} \quad \textrm{Rank}\Big(\mathbf{\Delta F}^{pre(g)}\Big) = R \]

[Return]