University of Georgia
FGSES, Université Mohammed VI Polytechnique
September 20, 2024
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\T}{\mathrm{T}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)Setting of the paper: Researcher interested in learning about the causal effect of a binary treatment and has access to a few periods of panel data
In the current paper, we will think about:
Cases where the parallel trends assumption could be violated
Applications where there is staggered treatment adoption
How to exploit staggered treatment adoption to allow for violations of parallel trends while still recovering the same target causal effect parameters
Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on \(\underbrace{\textrm{earnings}}_{\textrm{outcome}}\)
Parallel trends assumption: \(\E[\Delta Y_t(0) | D=1] = \E[\Delta Y_t(0) | D=0]\)
Where does parallel trends come from?
Parallel trends is closely connected to the following model of untreated potential outcomes:
\[Y_{it}(0) = \theta_t + \xi_i + e_{it}\]
where \(\xi_i\) is unobserved heterogeneity and \(e_{it}\) is an idiosyncratic error term.
You can view this as embodying two assumptions:
Unconfoundedness (conditional on unobserved heterogeneity): \(Y_{it}(0) \independent D_i | \xi_i\)
Linearity / Additive Separability
What if we back off of the additive separability assumption?
In this case, we have that
\[Y_{it}(0) = h_t(\xi_i) + e_{it}\]
where \(h_t(\xi) = \E[Y_t(0)|\xi,D]\), and \(e_{it}\) is idiosyncratic, but this is too generic to be useful…
Alternatively, can we assess the plausibility of additive separability?
Theoretically, in (probably most) applications, we simply do not know if additive separability is reasonable or not
Therefore, most DiD applications in economics include an event study plot that checks parallel trends in pre-treatment periods.
This is implicitly a test of the additive separability in the previous model for untreated potential outcomes.
Consider alternative model for untreated potential outcomes
That’s what we will do in this paper!
Using (arguably) the most naturally connected approach to DiD: interactive fixed effects (IFE)
(I think) IFE is closely connected to the ways that “bounding approaches” allow for violations of parallel trends…
Introduction to Interactive Fixed Effects Models
Identification - Baseline Case
More Periods and More Groups
Application
An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
\(\lambda_i\) is often referred to as “factor loading” (notation above implies that this is a scalar, but you can allow for higher dimension)
\(F_t\) is often referred to as a “factor”
\(e_{it}\) is idioyncratic in the sense that \(\E[e_t | \eta, \lambda, D] = 0\)
An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
In our context, though, it makes sense to interpret these as
\(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved “ability”)
\(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to “ability”)
An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
Special Case: \(F_t = t\) \(\implies\) unit-specific linear trends
View 1: Model for untreated potential outcomes
The IFE model correctly specifies \(\E[Y_t(0)|\xi,D]\), i.e.,
\[Y_{it}(0) = \underbrace{\theta_t + \eta_i + \lambda_i F_t}_{h_t(\xi_i)} + e_{it}\]
where \(\xi_i = (\eta_i, \lambda_i)\)
View 2: Difference-in-Differences
Alternatively, we can come at this from the perspective of difference-in-differences
Including “covariates” in the parallel trends assumption is very common
One very common way to operationalize is regression adjustment
\[Y_{it}(0) = \theta_t + \eta_i + X_i \beta_t + e_{it}\]
Interactive fixed effects models allow for violations of parallel trends:
\[\begin{align*} \E[\Delta Y_{t}(0) | D = d] = \Delta \theta_t + \E[\lambda|D=d]\Delta F_t \end{align*}\] which can vary across groups.
Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups
Many of the insights of recent work on DiD have been in the context of staggered treatment adoption
\(\implies\) there is variation in treatment timing across units
de Chaisemartin and D’Haultfœuille (2020), Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021), Marcus and Sant’Anna (2021), among others
These papers all treat staggered treatment adoption as a nuisance, and
In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters
Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\T}, D_{i1}, D_{i2}, \ldots, D_{i\T}\}_{i=1}^n\)
\(\T\) time periods
No one treated in the first time period (i.e., \(D_{i1} = 0\))
Staggered treatment adoption: for \(t=2,\ldots,\T\), \(D_{it-1} = 1 \implies D_{it}=1\).
A unit’s group \(G_i\) is the time period when it becomes treated. By convention, set \(G_i = \infty\) for units that do not participate in the treatment in any period.
Potential outcomes: \(Y_{it}(g)\), \(Y_{it}(0)\) is untreated potential outcome
Observed outcomes: \(Y_{it} = Y_{it}(G_i)\)
No anticipation: For \(t < G_i\), \(Y_{it} = Y_{it}(0)\)
This setup is exactly the same as the literature on DiD with staggered treatment adoption
Following CS-2021, we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]
\(ATT(g,t)\) is the average treatment effect for group \(g\) in time period \(t\)
Group-time average treatment effects are the natural building block for other common target parameters in DiD applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
Target: \(ATT(3,3) = \E[\Delta Y_3 | G=3] - \underbrace{\E[\Delta Y_3(0) | G=3]}_{\textrm{have to figure out}}\)
In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\]
The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
From last slide, combining terms we have that
\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]
Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,
\[\begin{align*} \E[\Delta Y_3(0) | G=3] = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_2(0) | G = 3]}_{\textrm{identified}} + \underbrace{\E[v_3|G=3]}_{=0} \end{align*}\]
\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
From last slide, combining terms we have that
\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]
How can we recover \(\theta_3^*\) and \(F_3^*\)?
Expression involves untreated potential outcomes through period 3, and we have groups 4 and \(\infty\) for which we observe these untreated potential outcomes. This suggests using those groups.
However, this is not so simple because, by construction, \(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) (note: \(v_{i3}\) contains \(\Delta e_{i2} \implies\) they will be correlated by construction)
We need some exogenous variation (IV) to recover the parameters \(\rightarrow\)
There are a number of different ideas here:
Make additional assumptions ruling out serial correlation in \(e_{it}\) \(\implies\) can use lags of outcomes as instruments (Imbens, Kallus, and Mao 2021):
In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:
\[\begin{align*} \E[\Delta Y_3(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=4] \\ \E[\Delta Y_3(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=\infty] \\ \end{align*}\]
We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* &= \frac{\E[\Delta Y_3|G=\infty] - \E[\Delta Y_3|G=4]}{\E[\Delta Y_2|G=\infty] - \E[\Delta Y_2|G=4]} \\ \theta_3^* &= \E[\Delta Y_3 | G=4] - F_3^* \E[\Delta Y_2 | G=4] \end{align*}\]
\(\implies\) we can recover \(ATT(3,3)\).
This strategy amounts to using “group” as an instrument for \(\Delta Y_{i2}(0)\).
Condition 1: Relevance \(\quad \E[\Delta Y_2(0) | G=4] \neq \E[\Delta Y_2(0) | G=\infty]\)
For relevance to hold, the following two “more primitive” conditions both need to hold
\(\E[\lambda | G=4] \neq \E[\lambda | G = \infty]\)
\(F_2 \neq F_1\)
Otherwise, \(G = 4\) and \(G = \infty\) have the same trend between the first two periods.
Condition 2: Exogeneity
Can’t directly test exogeneity, but a lot of the DiD infrastructure carries over here.
For DiD, can “pre-test” parallel trends if there is more than 1 pre-treatment period
For our approach, we need 2 pre-treatment periods to identify \(ATT(g,t)\), but if there are more pre-treatment periods then we can pre-test
The discussion so far has been about the case of 1 IFE. However, an important issue for IFE approaches is determining how many IFEs terms there are (e.g., 0, 1, 2, …)
Example: Suppose that we know that the true number of interactive fixed effects is either 0 or 1. How can we decide?
Notice that parallel trends holds if either of the following two conditions hold:
\(\E[\lambda | G=3] = \E[\lambda | G=4] = \E[\lambda | G = \infty]\) \(\implies\) IFEs “absorbed” into time fixed effects
\(F_1 = F_2 = F_3\) \(\implies\) IFEs “absorbed” into unit fixed effects
Idea: Check the relevance condition (i.e., check if \(\E[\Delta Y_2 | G=4] \neq \E[\Delta Y_2 | G=\infty]\))
Let us walk through both cases where relevance fails, but there really is 1 IFE.
Case 1: \(\color{green}{F_2 \neq F_1}\) but \(\color{green}{\E[\lambda | G=3] \neq} \color{red}{ \E[\lambda | G=4] = \E[\lambda | G = \infty]}\)
Intuition: \(G=4\) and \(G=\infty\) are the “same comparison group”, so we cannot deal with the IFE
\(\implies\) our approach won’t work, but you would be able to see that \(G=3\) is trending differently from \(G=4\) and \(G=\infty\)
Let us walk through both cases where relevance fails, but there really is 1 IFE.
Case 2: \(\color{green}{\E[\lambda | G=4] \neq \E[\lambda | G = \infty]}\) but \(\color{green}{F_3 \neq} \color{red}{F_2 = F_1}\)
Intuition: The effect of \(\lambda_i\) doesn’t change between periods 1 and 2 \(\implies\) all groups trend the same between periods 1 and 2, so it looks like parallel trends holds. Here it does hold in pre-treatment periods, but it is violated in post-treatment periods
\(\implies\) our approach won’t work, but (I think) no approach would work here
General Case with More Periods and Groups
Estimation
Identification is constructive and suggests a two-step estimation procedure where we estimate the parameters of the IFE model in the first step (e.g., \(\theta_3^*\) and \(F_3^*\)) and then plug these into a second step estimator for \(ATT(g,t)\).
With more periods and/or interactive fixed effects, parameters of IFE model can be over-identified \(\implies\) GMM, but otherwise similar
Relative to other approaches to dealing with IFEs:
We do not need a large number of periods or extra auxiliary assumptions
Only need there to be staggered treatment adoption
Generality: we have talked about IFE models, but
Exploit minimum wage changes across states
Any state that increases their minimum wage above the federal minimum wage will be considered as treated
Allow for one year of “anticipation” (this only affects estimates in post-treatment periods)
The next set of results include one interactive fixed effect
Additional Comments:
Comments very welcome: brantly.callaway@uga.edu
Code: staggered_ife2
function in ife
package in R, available at github.com/bcallaway11/ife
Interactive fixed effects for untreated potential outcomes:
\[ Y_{it}(0) = \theta_t + \eta_i + \lambda_i' F_t + e_{it} \] where \(\lambda_i\) and \(F_t\) are \(R\) dimensional vectors.
Assume: Unconfoundedness conditional on unobserved heterogeneity (i.e., this implies “groups” can be used as instruments):
\[ \E[Y_{t}(0) |\eta, \lambda, G] = \E[Y_{t}(0) |\eta, \lambda] \quad \text{a.s.} \]
An implication of both conditions above is that
\[ \E[e_t |\eta, \lambda, G] = 0 \]
which we use below as a source of moment conditions to identify parameters from the interactive fixed effects model.
Similar to earlier case:
\[ \begin{aligned} ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \underbrace{\E[Y_t(0) - Y_{g-1}(0) | G=g]}_{\textrm{need to figure out}} \\ \end{aligned} \]
Using similar differencing arguments as before, one can show:
\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]
where
Using similar differencing arguments as before, one can show:
\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]
so that
\(R+1\) parameters to identify
\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is endogenous by construction
Can use “groups” as instruments
Identification is local to groups/periods that meet the following criteria:
For \(g' \in \mathcal{G}^{comp}(g,t)\), we use moment conditions of the form
\[ 0 = \E\Big[\indicator{G=g'} v(g,t)\Big]\]
Stacking the above moment conditions, we have that
\[ \mathbf{0}_{|\mathcal{G}^{comp}(g,t)|} = \E\left[ \ell^{comp}(g,t) \left\{ \Big( Y_{t} - Y_{g-1}\Big) - \Big(\theta^*(g,t) - {\widetilde{\Delta Y}}^{{pre(g)}^{'}} F^*(g,t) \Big) \right\} \right] \]
where \(\ell^{comp}(g,t)\) is a vector of indicators for groups that have not yet been treated by period \(t\).
Since we are using groups as instruments, identification hinges on the relevance condition:
\[ \textrm{Rank}\Big(\mathbf{\Gamma}(g,t)\Big) = R + 1 \]
where
\[ \mathbf{\Gamma}(g,t) := \E\left[ \ell^{comp}(g,t) \begin{pmatrix} 1 \\ \widetilde{\Delta Y}^{pre(g)} \end{pmatrix}' \right] \]
Like the earlier case, you can relate the relevance condition to conditions on \(\lambda_i\) and \(F_t\).
There needs to be “enough variation” in \(\E[\lambda|G=g']\) among groups in \(\mathcal{G}^{comp}(g,t)\).
There needs to be “enough variation” in \(F_t\) across pre-treatment time periods.
Theorem: Identification
For some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\) and under given assumptions,
\[ \begin{pmatrix} \theta^*(g,t) \\ F^*(g,t) \end{pmatrix} = \Big( \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \mathbf{\Gamma}(g,t) \Big)^{-1} \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \E[\ell^{comp}(g,t)(Y_{t} - Y_{g-1})] \]
In addition, \(ATT(g,t)\) is identified, and it is given by:
\[ ATT(g,t) = \E[Y_t(g) - Y_{g-1} | G=g] - \Big( \theta^*(g,t) + F^*(g,t)'\E[\Delta Y^{pre(g)} | G=g] \Big) \]
Estimation proceeds in two steps and is constructive given identification results. The first step is to estimate \(\theta^*(g,t)\) and \(F^*(g,t)\):
Given a positive definite matrix \(\widehat{\mathbf{W}}(g,t)\), the estimator of \(\delta^*(g,t)\) is:
\[ \widehat{\delta}^*(g,t) = \left( \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t)\widehat{\mathbf{\Gamma}}(g,t) \right)^{-1} \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t) \E_n\big[\ell^{comp}_i(g,t)(Y_{it} - Y_{ig-1})\big] \]
Second step, plug into sample analog of expression for \(ATT(g,t)\):
\[ \widehat{ATT}(g,t) = \hat{p}_g^{-1} \left\{ \E_n\Big[\indicator{G_i=g}(Y_{it} - Y_{ig-1})\Big] - \E_n\Big[A_i(g)\Big]^\prime \widehat{\delta}^*(g,t) \right\} \]
where
\[ A_i(g) := \indicator{G_i=g}\begin{pmatrix}1 \\ \widetilde{\Delta Y}_i^{pre(g)} \end{pmatrix} \]
If you want an event study or overall average treatment effect, can combine estimates across groups and time periods, following the same logic as in CS-2021.
Theorem: Asymptotic Normality
Suppose assumptions hold, then for some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\),
\(\widehat{ATT}(g,t)\) is asymptotically linear, and it satisfies the relation: \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_{igt} + o_p(1) \]
\(\widehat{ATT}(g,t) \rightarrow_p ATT(g,t)\) as \(n \rightarrow \infty\) for each pair \((g,t)\).
In addition, \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) \xrightarrow{d} \mathcal{N}(0,\sigma_{gt}^2) \] where \(\sigma_{gt}^2 = \E[\psi_{igt}^2]\).
[Back]
\[ \begin{aligned} \textrm{Define: } \qquad \mathbf{\Lambda}^{comp}(g,t) := \E\Big[ \ell^{comp}(g,t) \begin{pmatrix} 1 & \lambda' \end{pmatrix} \Big] \quad \textrm{and} \quad \mathbf{\Delta F}^{pre(g)} := \begin{bmatrix} \Delta F_2' \\ \vdots \\ \Delta F_{g-1}' \end{bmatrix} \end{aligned} \]
where \(\mathbf{\Lambda}^{comp}(g,t)\) is a \(|\mathcal{G}^{comp}(g,t)| \times (R+1)\) matrix, and \(\mathbf{\Delta F}^{pre(g)}\) is a \((g-2) \times R\) matrix.
Proposition: Relevance
The rank condition for identification is equivalent to the following: \[ \textrm{Rank}\Big(\mathbf{\Lambda}^{comp}(g,t)\Big) = R + 1 \quad \textrm{and} \quad \textrm{Rank}\Big(\mathbf{\Delta F}^{pre(g)}\Big) = R \]
[Return]