University of Georgia
FGSES, Université Mohammed VI Polytechnique
October 3, 2024
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\T}{\mathrm{T}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)Setting of the paper: Researcher interested in learning about the causal effect of a binary treatment and has access to a few periods of panel data
In the current paper, we will think about:
Cases where the parallel trends assumption could be violated
Applications where there is staggered treatment adoption
How to exploit staggered treatment adoption to allow for violations of parallel trends while still recovering the same target causal effect parameters
Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on \(\underbrace{\textrm{earnings}}_{\textrm{outcome}}\)
Parallel trends assumption: \(\E[\Delta Y_t(0) | D=1] = \E[\Delta Y_t(0) | D=0]\)
Parallel trends is equivalent to this model of untreated potential outcomes:
\[Y_{it}(0) = \theta_t + \xi_i + e_{it}\]
where \(\xi_i\) is unobserved heterogeneity and \(\E[e_t | D] = 0\)
You can view this as (essentially) embodying two assumptions:
In this case, we have that
\[Y_{it}(0) = h_t(\xi_i) + e_{it}\]
but this is too generic to be useful…
Alternatively, can we assess the plausibility of additive separability?
Theoretically, in (probably most) applications, we simply do not know if additive separability is reasonable or not
Therefore, most DiD applications in economics include an event study plot that checks parallel trends in pre-treatment periods.
This is implicitly a test of the additive separability in the previous model for untreated potential outcomes.
Y: gov’t transfers, D: hurricane
Y: employment, D: min. wage
Consider alternative model for untreated potential outcomes ⬅️ our paper
Using (arguably) the most natural extension to DiD: interactive fixed effects (IFE)
(I think) IFE is closely connected to the ways that “bounding approaches” allow for violations of parallel trends…
Introduction to Interactive Fixed Effects Models
Identification - Baseline Case
More Periods and More Groups
Application
An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
\(\lambda_i\) is often referred to as “factor loading” (notation above implies that this is a scalar, but you can allow for higher dimension)
\(F_t\) is often referred to as a “factor”
\(e_{it}\) is idioyncratic in the sense that it is not systematically different across groups
An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
In our context, though, it makes sense to interpret these as
\(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved “ability”)
\(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to “ability”)
An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
Special Case: \(F_t = t\) \(\implies\) unit-specific linear trends
View 1: Model for untreated potential outcomes
The IFE model correctly specifies \(\E[Y_t(0)|\xi,D]\), i.e.,
\[Y_{it}(0) = \underbrace{\theta_t + \eta_i + \lambda_i F_t}_{h_t(\xi_i)} + e_{it}\]
where \(\xi_i = (\eta_i, \lambda_i)\)
View 2: Difference-in-Differences
Including “covariates” in the parallel trends assumption is very common
One very common way to operationalize is regression adjustment
\[Y_{it}(0) = \theta_t + \eta_i + X_i \beta_t + e_{it}\]
Interactive fixed effects models allow for violations of parallel trends:
\[ \begin{aligned} Y_{it}(0) &= \theta_t + \eta_i + \lambda_i F_t + e_{it} \\ \end{aligned} \]
\[ \implies \E[\Delta Y_{t}(0) | D = d] = \Delta \theta_t + \E[\lambda|D=d]\Delta F_t \]
which can vary across groups.
Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups
Many of the insights of recent work on DiD have been in the context of staggered treatment adoption
\(\implies\) there is variation in treatment timing across units
de Chaisemartin and D’Haultfœuille (2020), Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021), Marcus and Sant’Anna (2021), among others
These papers all treat staggered treatment adoption as a nuisance, and
In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters
Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\T}, D_{i1}, D_{i2}, \ldots, D_{i\T}\}_{i=1}^n\)
\(\T\) time periods
No one treated in the first time period (i.e., \(D_{i1} = 0\))
Staggered treatment adoption: for \(t=2,\ldots,\T\), \(D_{it-1} = 1 \implies D_{it}=1\).
A unit’s group \(G_i\) is the time period when it becomes treated.
Potential outcomes: \(Y_{it}(g)\), \(Y_{it}(0)\) is untreated potential outcome
Observed outcomes: \(Y_{it} = Y_{it}(G_i)\)
No anticipation: For \(t < G_i\), \(Y_{it} = Y_{it}(0)\)
Setup is exactly the same as DiD with staggered treatment adoption!
Following CS-2021, we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]
\(ATT(g,t)\) is the average treatment effect for group \(g\) in time period \(t\)
Group-time average treatment effects are the natural building block for other common target parameters in DiD applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
Target: \(ATT(3,3) = \E[\Delta Y_3 | G=3] - \underbrace{\color{red}{\E[\Delta Y_3(0) | G=3]}}_{\textrm{have to figure out}}\)
Using quasi-differencing argument, can show that
\[ \Delta Y_{i3}(0) = \theta_t^* + F_3^* \Delta Y_{i2}(0) + v_{i3} \]
where \(\theta_3^*\) and \(F_3^*\) are functions of the original parameters \(\theta_t\) and \(F_t\), and \(v_{i3}\) is a function of \(e_{it}\).
Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,
\[\begin{align*} \color{red}{\E[\Delta Y_3(0) | G=3]} = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_2(0) | G = 3]}_{\textrm{identified}} + \underbrace{\E[v_3|G=3]}_{=0} \end{align*}\]
\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
\[\begin{align*} \Delta Y_{i3}(0) = \theta_3^* + F_3^* \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]
Some issues:
Expression involves untreated potential outcomes through period 3
\(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) by construction
There are a number of different ideas here:
Make additional assumptions ruling out serial correlation in \(e_{it}\) \(\implies\) can use lags of outcomes as instruments (Imbens, Kallus, and Mao 2021):
In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:
\[\begin{align*} \E[\Delta Y_3(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=4] \\ \E[\Delta Y_3(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_2(0) | G=\infty] \\ \end{align*}\]
We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* &= \frac{\E[\Delta Y_3|G=\infty] - \E[\Delta Y_3|G=4]}{\E[\Delta Y_2|G=\infty] - \E[\Delta Y_2|G=4]} \\ \theta_3^* &= \E[\Delta Y_3 | G=4] - F_3^* \E[\Delta Y_2 | G=4] \end{align*}\]
\(\implies\) we can recover \(ATT(3,3)\).
This strategy amounts to using “group” as an instrument for \(\Delta Y_{i2}(0)\).
Condition 1: Relevance \(\quad \E[\Delta Y_2(0) | G=4] \neq \E[\Delta Y_2(0) | G=\infty]\)
For relevance to hold, the following two “more primitive” conditions both need to hold
\(\E[\lambda | G=4] \neq \E[\lambda | G = \infty]\)
\(F_2 \neq F_1\)
Otherwise, \(G = 4\) and \(G = \infty\) have the same trend between the first two periods.
Condition 2: Exogeneity
Can’t directly test exogeneity, but a lot of the DiD infrastructure carries over
For DiD, can “pre-test” parallel trends if have more than 1 pre-period
For our approach, we need 2 pre-treatment periods to identify \(ATT(g,t)\), but if have more pre-treatment periods then we can pre-test
The discussion so far has been about the case of 1 IFE. However, an important issue for IFE approaches is determining how many IFEs terms there are (e.g., 0, 1, 2, …)
Example: Suppose that we know that the true number of interactive fixed effects is either 0 or 1. How can we decide?
Notice that parallel trends holds if either of the following two conditions hold:
\(\E[\lambda | G=3] = \E[\lambda | G=4] = \E[\lambda | G = \infty]\) \(\implies\) IFEs “absorbed” into time fixed effects
\(F_1 = F_2 = F_3\) \(\implies\) IFEs “absorbed” into unit fixed effects
Idea: Check the relevance condition (i.e., check if \(\E[\Delta Y_2 | G=4] \neq \E[\Delta Y_2 | G=\infty]\))
Let us walk through both cases where relevance fails, but there really is 1 IFE.
Case 1: Parallel trends holds between \(G=4\) and \(G=\infty\) (across all periods), but does not hold with \(G=3\). [Figure]
Intuition: \(G=4\) and \(G=\infty\) are effectively the “same comparison group”
Implications:
Let us walk through both cases where relevance fails, but there really is 1 IFE.
Case 2: Parallel trends holds between periods 1 and 2 (for all groups), but does not hold from period 2 to 3
Let us walk through both cases where relevance fails, but there really is 1 IFE.
Case 2: Parallel trends holds between periods 1 and 2 (for all groups), but does not hold from period 2 to 3
In math: \(\color{green}{F_3 \neq} \color{red}{F_2 = F_1}\)
Case 2b: \(\E[\lambda|G=3] \neq \E[\lambda | G=4] = \E[\lambda | G = \infty]\) [Figure]
General Case with More Periods and Groups
Estimation
Identification is constructive and suggests a two-step estimation procedure where we estimate the parameters of the IFE model in the first step (e.g., \(\theta_3^*\) and \(F_3^*\)) and then plug these into a second step estimator for \(ATT(g,t)\).
With more periods, groups, and/or interactive fixed effects, parameters of IFE model can be over-identified \(\implies\) GMM, but otherwise similar
Relative to other approaches to dealing with IFEs:
We do not need a large number of periods or extra auxiliary assumptions
Only need there to be staggered treatment adoption
Generality: we have talked about IFE models, but
Exploit minimum wage changes across states
Any state that increases their minimum wage above the federal minimum wage will be considered as treated
Allow for one year of “anticipation” (this only affects estimates in post-treatment periods)
The next set of results include one interactive fixed effect
Additional Comments:
Comments very welcome: brantly.callaway@uga.edu
Code: staggered_ife2
function in ife
package in R, available at github.com/bcallaway11/ife
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
\[Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it}\]
In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\]
The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)
Particular Case: \(\T=4\) and 3 groups: 3, 4, \(\infty\)
From last slide, combining terms we have that
\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]
[Back]
[Back]
[Back]
[Back]
Interactive fixed effects for untreated potential outcomes:
\[ Y_{it}(0) = \theta_t + \eta_i + \lambda_i' F_t + e_{it} \] where \(\lambda_i\) and \(F_t\) are \(R\) dimensional vectors.
Assume: Unconfoundedness conditional on unobserved heterogeneity (i.e., this implies “groups” can be used as instruments):
\[ \E[Y_{t}(0) |\eta, \lambda, G] = \E[Y_{t}(0) |\eta, \lambda] \quad \text{a.s.} \]
An implication of both conditions above is that
\[ \E[e_t |\eta, \lambda, G] = 0 \]
which we use below as a source of moment conditions to identify parameters from the interactive fixed effects model.
Similar to earlier case:
\[ \begin{aligned} ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \underbrace{\E[Y_t(0) - Y_{g-1}(0) | G=g]}_{\textrm{need to figure out}} \\ \end{aligned} \]
Using similar differencing arguments as before, one can show:
\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]
where
Using similar differencing arguments as before, one can show:
\[Y_{it}(0) - Y_{ig-1}(0) = \theta^*(g,t) + \widetilde{\Delta Y}_i^{pre(g)}(0)'F^*(g,t) + v_i(g,t)\]
so that
\(R+1\) parameters to identify
\(\widetilde{\Delta Y}_i^{pre(g)}(0)\) is endogenous by construction
Can use “groups” as instruments
Identification is local to groups/periods that meet the following criteria:
For \(g' \in \mathcal{G}^{comp}(g,t)\), we use moment conditions of the form
\[ 0 = \E\Big[\indicator{G=g'} v(g,t)\Big]\]
Stacking the above moment conditions, we have that
\[ \mathbf{0}_{|\mathcal{G}^{comp}(g,t)|} = \E\left[ \ell^{comp}(g,t) \left\{ \Big( Y_{t} - Y_{g-1}\Big) - \Big(\theta^*(g,t) - {\widetilde{\Delta Y}}^{{pre(g)}^{'}} F^*(g,t) \Big) \right\} \right] \]
where \(\ell^{comp}(g,t)\) is a vector of indicators for groups that have not yet been treated by period \(t\).
Since we are using groups as IVs, identification hinges on relevance:
\[ \textrm{Rank}\Big(\mathbf{\Gamma}(g,t)\Big) = R + 1 \]
where
\[ \mathbf{\Gamma}(g,t) := \E\left[ \ell^{comp}(g,t) \begin{pmatrix} 1 \\ \widetilde{\Delta Y}^{pre(g)} \end{pmatrix}' \right] \]
As before, you can relate the relevance condition to conditions on \(\lambda_i\) and \(F_t\).
Need “enough variation” in \(\E[\lambda|G=g']\) among groups in \(\mathcal{G}^{comp}(g,t)\).
Need “enough variation” in \(F_t\) across pre-treatment time periods.
Theorem: Identification
For some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\) and under given assumptions,
\[ \begin{pmatrix} \theta^*(g,t) \\ F^*(g,t) \end{pmatrix} = \Big( \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \mathbf{\Gamma}(g,t) \Big)^{-1} \mathbf{\Gamma}(g,t)' \mathbf{W}(g,t) \E[\ell^{comp}(g,t)(Y_{t} - Y_{g-1})] \]
In addition, \(ATT(g,t)\) is identified, and it is given by:
\[ ATT(g,t) = \E[Y_t(g) - Y_{g-1} | G=g] - \Big( \theta^*(g,t) + F^*(g,t)'\E[\Delta Y^{pre(g)} | G=g] \Big) \]
Estimation proceeds in two steps and is constructive given identification results. The first step is to estimate \(\theta^*(g,t)\) and \(F^*(g,t)\):
Given a positive definite matrix \(\widehat{\mathbf{W}}(g,t)\), the estimator of \(\delta^*(g,t)\) is:
\[ \widehat{\delta}^*(g,t) = \left( \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t)\widehat{\mathbf{\Gamma}}(g,t) \right)^{-1} \widehat{\mathbf{\Gamma}}(g,t)' \widehat{\mathbf{W}}(g,t) \E_n\big[\ell^{comp}_i(g,t)(Y_{it} - Y_{ig-1})\big] \]
Second step, plug into sample analog of expression for \(ATT(g,t)\):
\[ \widehat{ATT}(g,t) = \hat{p}_g^{-1} \left\{ \E_n\Big[\indicator{G_i=g}(Y_{it} - Y_{ig-1})\Big] - \E_n\Big[A_i(g)\Big]^\prime \widehat{\delta}^*(g,t) \right\} \]
where
\[ A_i(g) := \indicator{G_i=g}\begin{pmatrix}1 \\ \widetilde{\Delta Y}_i^{pre(g)} \end{pmatrix} \]
If you want an event study or overall average treatment effect, can combine estimates across groups and time periods, following the same logic as in CS-2021.
Theorem: Asymptotic Normality
Suppose assumptions hold, then for some group \(g \in \mathcal{G}^\dagger\), and for some time period \(t \in \{g, \ldots, t^{max}(g)\}\) where \(t^{max}(g)\) is the largest value of \(t\) such that \(|\mathcal{G}^{comp}(g,t)| \geq R+1\),
\(\widehat{ATT}(g,t)\) is asymptotically linear, and it satisfies the relation: \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) = \frac{1}{\sqrt{n}}\sum_{i=1}^n \psi_{igt} + o_p(1) \]
\(\widehat{ATT}(g,t) \rightarrow_p ATT(g,t)\) as \(n \rightarrow \infty\) for each pair \((g,t)\).
In addition, \[ \sqrt{n}(\widehat{ATT}(g,t) - ATT(g,t)) \xrightarrow{d} \mathcal{N}(0,\sigma_{gt}^2) \] where \(\sigma_{gt}^2 = \E[\psi_{igt}^2]\). [Back]
\[ \begin{aligned} \textrm{Define: } \qquad \mathbf{\Lambda}^{comp}(g,t) := \E\Big[ \ell^{comp}(g,t) \begin{pmatrix} 1 & \lambda' \end{pmatrix} \Big] \quad \textrm{and} \quad \mathbf{\Delta F}^{pre(g)} := \begin{bmatrix} \Delta F_2' \\ \vdots \\ \Delta F_{g-1}' \end{bmatrix} \end{aligned} \]
where \(\mathbf{\Lambda}^{comp}(g,t)\) is a \(|\mathcal{G}^{comp}(g,t)| \times (R+1)\) matrix, and \(\mathbf{\Delta F}^{pre(g)}\) is a \((g-2) \times R\) matrix.
Proposition: Relevance
The rank condition for identification is equivalent to the following: \[ \textrm{Rank}\Big(\mathbf{\Lambda}^{comp}(g,t)\Big) = R + 1 \quad \textrm{and} \quad \textrm{Rank}\Big(\mathbf{\Delta F}^{pre(g)}\Big) = R \]
[Return]