November 19, 2023
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)Setting of the paper: Researcher interested in learning about the causal effect of a binary treatment and has access to a few periods of panel data
In the current paper, we will think about:
Cases where the parallel trends assumption could be violated
Applications where there is staggered treatment adoption
How to exploit staggered treatment adoption to allow for violations of parallel trends while still recovering the same target causal effect parameters
Parallel trends assumption: \(\E[\Delta Y_t(0) | D=1] = \E[\Delta Y_t(0) | D=0]\)
DID is different from other quasi-experimental (e.g., random assignment, IV, RD) approaches to policy evaluation in that it inherently relies on functional form assumptions \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + e_{it} \end{align*}\] where the distribution of \(\eta_i\) can differ arbitrarily across groups, but \(\E[e_{it}|\eta_i, D] = \E[e_{it} | \eta_i] = 0\)
That the model for \(Y_{it}(0)\) depends on time and unit-specific unobserved heterogneity is in-line with a long history of economic models, but the additive separability between time effects and unobserved heterogeneity is often harder to justify
Therefore, most DID applications in economics include an event study plot that checks parallel trends in pre-treatment periods. This is implicitly a test of the additive separability in the previous model for untreated potential outcomes.
Allow for certain violations of parallel trends (\(\implies\) bounds on causal effect parameters) often connected to the magnitude of the violations of parallel trends in pre-treatment periods (Manski and Pepper 2018; Rambachan and Roth 2023; Ban and Kedagni 2022)
Consider alternative model for untreated potential outcomes
That’s what we will do in this paper!
Using (arguably) the most naturally connected approach to DID, interactive fixed effects (IFE)
(I think) IFE is closely connected to the ways that “bounding approaches” allow for violations of parallel trends…
Running Example: Causal effect of \(\underbrace{\textrm{job displacement}}_{\textrm{treatment}}\) on a \(\underbrace{\textrm{person's earnings}}_{\textrm{outcome}}\)
An intermediate case is an interactive fixed effects model for untreated potential outcomes: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + e_{it} \end{align*}\]
\(\lambda_i\) is often referred to as “factor loading” (notation above implies that this is a scalar, but you can allow for higher dimension)
\(F_t\) is often referred to as a “factor”
\(e_{it}\) is idioyncratic in the sense that \(\E[e_{it} | \eta_i, \lambda_i, D_i] = 0\)
In our context, though, it makes sense to interpret these as
\(\lambda_i\) unobserved heterogeneity (e.g., individual’s unobserved “ability”)
\(F_t\) the time-varying “return” unobserved heterogeneity (e.g., return to “ability”)
Interactive fixed effects models allow for violations of parallel trends:
\[\begin{align*} \E[\Delta Y_{it}(0) | D_i = d] = \Delta \theta_t + \E[\lambda_i|D_i=d]\Delta F_t \end{align*}\] which can vary across groups.
Example: If \(\lambda_i\) is “ability” and \(F_t\) is increasing over time, then (even in the absence of the treatment) groups with higher mean “ability” will tend to increase outcomes more over time than less skilled groups
Special Cases:
Observed \(\lambda_i \implies\) regression adjustment
\(F_t = t \implies\) unit-specific linear trend
But allowing \(F_t\) to vary arbitrarily is harder…[[More details]]
Many of the insights of recent work no DID have been in the context of staggered treatment adoption
\(\implies\) there is variation in treatment timing across units
de Chaisemartin and D’Haultfœuille (2020), Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021), among others
These papers all treat staggered treatment adoption as a nuisance, and
In the current paper, we will exploit staggered treatment adoption in order to identify causal effect parameters
Observed data: \(\{Y_{i1}, Y_{i2}, \ldots Y_{i\mathcal{T}}, D_{i1}, D_{i2}, \ldots, D_{i\mathcal{T}}\}_{i=1}^n\)
\(\mathcal{T}\) time periods
No one treated in the first time period (i.e., \(D_{i1} = 0\))
Staggered treatment adoption: for \(t=2,\ldots,\mathcal{T}\), \(D_{it-1} = 1 \implies D_{it}=1\).
A unit’s group \(G_i\) is the time period when it becomes treated. By convention, set \(G_i = \infty\) for units that do not participate in the treatment in any period.
Potential outcomes: \(Y_{it}(g)\), \(Y_{it}(\infty)\) is untreated potential outcome
Observed outcomes: \(Y_{it} = Y_{it}(G_i)\)
No anticipation: For \(t < G_i\), \(Y_{it} = Y_{it}(\infty)\)
Following CS-2021, we target group-time average treatment effects: \[\begin{align*} ATT(g,t) = \E[Y_{it}(g) - Y_{it}(\infty) | G=g] \end{align*}\]
Group-time average treatment effects are the natural building block for other common target parameters in DID applications such as event studies or an overall \(ATT\) (see Callaway and Sant’Anna (2021) for more details)
Back to DID for a moment: Under parallel trends, \[\begin{align*} ATT(g,t) = \E[Y_{it} - Y_{ig-1} | G=g] - \E[Y_{it} - Y_{ig-1} | G=g'] \end{align*}\] for any \(g' > t\) (i.e., any group that is not-yet-treated by period \(t\))
Very Simple Case:
\(\mathcal{T}=4\)
3 groups: 3, 4, \(\infty\)
We will target \(ATT(3,3) = \E[\Delta Y_{i3} | G_i=3] - \underbrace{\E[\Delta Y_{i3}(0) | G_i=3]}_{\textrm{have to figure out}}\) In this case, given the IFE model for untreated potential outcomes, we have: \[\begin{align*} \Delta Y_{i3}(0) &= \Delta \theta_3 + \lambda_i \Delta F_3 + \Delta e_{i3} \\ \Delta Y_{i2}(0) &= \Delta \theta_2 + \lambda_i \Delta F_2 + \Delta e_{i2} \\ \end{align*}\] The last equation implies that \[\begin{align*} \lambda_i = \Delta F_2^{-1}\Big( \Delta Y_{i2}(0) - \Delta \theta_2 - \Delta e_{i2} \Big) \end{align*}\] Plugging this back into the first equation (and combining terms), we have \(\rightarrow\)
From last slide, combining terms we have that
\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]
Now (momentarily) suppose that we (somehow) know \(\theta_3^*\) and \(F_3^*\). Then,
\[\begin{align*} \E[\Delta Y_{i3}(0) | G_i=3] = \theta_3^* + F_3^* \underbrace{\E[\Delta Y_{i2}(0) | G_i = 3]}_{\textrm{identified}} + \underbrace{\E[v_{i3}|G_i=3]}_{=0} \end{align*}\]
\(\implies\) this term is identified; hence, we can recover \(ATT(3,3)\).
From last slide, combining terms we have that
\[\begin{align*} \Delta Y_{i3}(0) = \underbrace{\Big(\Delta \theta_3 - \frac{\Delta F_3}{\Delta F_2} \Delta \theta_2 \Big)}_{=: \theta_3^*} + \underbrace{\frac{\Delta F_3}{\Delta F_2}}_{=: F_3^*} \Delta Y_{i2}(0) + \underbrace{\Delta e_{i3} - \frac{\Delta F_3}{\Delta F_2} \Delta e_{i2}}_{=: v_{i3}} \end{align*}\]
How can we recover \(\theta_3^*\) and \(F_3^*\)?
Notice: this involves untreated potential outcomes through period 3, and we have groups 4 and \(\infty\) for which we observe these untreated potential outcomes. This suggests using those groups.
However, this is not so simple because, by construction, \(\Delta Y_{i2}(0)\) is correlated with \(v_{i3}\) (note: \(v_{i3}\) contains \(\Delta e_{i2} \implies\) they will be correlated by construction)
We need some exogenous variation (IV) to recover the parameters \(\rightarrow\)
There are a number of different ideas here:
Make additional assumptions ruling out serial correlation in \(e_{it}\) \(\implies\) can use lags of outcomes as instruments (Imbens, Kallus, and Mao 2021):
Alternatively can introduce covariates and make auxiliary assumptions about them (Callaway and Karami 2023; Brown and Butts 2023; Brown, Butts, and Westerlund 2023)
However, it turns out that, with staggered treatment adoption, you can recover \(ATT(3,3)\) essentially for free
In particular, notice that, given that we have two distinct untreated groups in period 3: group 4 and group \(\infty\), then we have two moment conditions:
\[\begin{align*} \E[\Delta Y_{i3}(0) | G=4] &= \theta_3^* + F_3^* \E[\Delta Y_{i2}(0) | G=4] \\ \E[\Delta Y_{i3}(0) | G=\infty] &= \theta_3^* + F_3^* \E[\Delta Y_{i2}(0) | G=\infty] \\ \end{align*}\] We can solve these for \(\theta_3^*\) and \(F_3^*\): \[\begin{align*} F_3^* = \frac{\E[\Delta Y_{i3}|G_i=\infty] - \E[\Delta Y_{i3}|G_i=4]}{\E[\Delta Y_{i2}|G_i=\infty] - \E[\Delta Y_{i2}|G_i=4]} \end{align*}\]
\(\implies\) we can recover \(ATT(3,3)\).
An important issue for IFE approaches is determining how many IFEs terms there are (e.g., 0, 1, 2, …)
Interestingly, there is a tight link in our approach between this “model selection” and identification
In cases where parallel trends holds in pre-treatment periods \(\implies\) 0 IFEs, failure of relevance condition if we try to include 1 IFE
In cases where parallel trends is violated in pre-treatment periods (essentially) \(\implies\) (at least) 1 IFE, relevance condition holds for recovering \(ATT(g,t)\).
DID infrastructure related to pre-testing also carries over to our approach
[[Detailed Version]]
Can scale this argument up for more periods, groups, and IFEs (see paper)
Relative to other approaches to dealing with IFEs:
We do not need a large number of periods or extra auxiliary assumptions
Only need there to be staggered treatment adoption
The main drawback is that can’t recover as many \(ATT(g,t)\)’s; e.g., in this example, we can’t recover \(ATT(3,4)\) or \(ATT(4,4)\) which might be recoverable in other settings
Generality: we have talked about IFE models, but
Comments very welcome: brantly.callaway@uga.edu
Code: staggered_ife2
function in ife
package in R, available at github.com/bcallaway11/ife
Interactive fixed effects models for untreated potential outcomes generalize some other important cases:
Example 1: Suppose we observe \(\lambda_i\), then this amounts to the regression adjustment version of DID with a time-invariant covariate
Example 2: Suppose you know that \(F_t = t\), then this leads to a unit-specific linear trend model: \[\begin{align*} Y_{it}(0) = \theta_t + \eta_i + \lambda_i t + e_{it} \end{align*}\]
To allow for \(F_t\) to change arbitrarily over time is harder…
Example 3: Interactive fixed effects models also provide a connection to “large-T” approaches such as synthetic control and synthetic DID (Abadie, Diamond, and Hainmueller 2010; Arkhangelsky et al. 2021)
[[Back]]
There are a lot of ideas. Probably the most prominent idea is to directly estimate the model for untreated potential outcomes and impute untreated potential outcomes
See (Xu 2017; Gobillon and Magnac 2016) for substantial detail on this front
For example, Xu (2017) uses Bai (2009) principal components approach to estimate the model.
For the IFE model not to “reduce” to two-way fixed effects, we need to rule out both:
\(\E[\lambda_i | G_i=3] = \E[\lambda_i | G_i=4] = \E[\lambda_i | G_i = \infty]\)
\(F_1 = F_2 = F_3 = F_4\)
But we need to strengthen these for our approach to work
Relevance: We additionally need that both
\(\E[\lambda_i | G_i=4] \neq \E[\lambda_i | G_i = \infty]\)
\(F_2 \neq F_1\)
Otherwise, \(G_i = 4\) and \(G_i = \infty\) have the same trend between the first two periods.
Case 1: \(F_2 \neq F_1\) but \(\E[\lambda_i | G_i=3] \neq \E[\lambda_i | G_i=4] = \E[\lambda_i | G_i = \infty]\)
\(\implies\) our approach won’t work, but you would be able to see that \(G_i=3\) is trending differently from \(G_i=4\) and \(G_i=\infty\)
Effectively, \(G_i=4\) and \(G_i=\infty\) are the “same” comparison group, so we cannot deal with the IFE, but we can reject parallel trends in the same way as typical DID approaches
Relevance: We additionally need that both
\(\E[\lambda_i | G_i=4] \neq \E[\lambda_i | G_i = \infty]\)
\(F_2 \neq F_1\)
Otherwise, \(G_i = 4\) and \(G_i = \infty\) have the same trend between the first two periods.
Case 2: \(\E[\lambda_i | G_i=4] \neq \E[\lambda_i | G_i = \infty]\) but \(F_3 \neq F_2 = F_1\)
\(\implies\) our approach won’t work, but (I think) no approach would work here
In this case, all groups trend the same between periods 1 and 2, so it looks like parallel trends holds. Here it does hold in pre-treatment periods, but it is violated in post-treatment periods
This is closely related to the saying: “parallel trends is fundamentally untestable”
Exogeneity:
Can’t directly test, but a lot of the DID infrastructure carries over here.
For DID, can “pre-test” parallel trends if there is more than 1 pre-treatment period
For our approach, we need 2 pre-treatment periods to identify \(ATT(g,t)\), but if there are more pre-treatment periods then we can pre-test
If there are more factors (say \(R\)), we need \(R+1\) pre-treatment periods to recover \(ATT(g,t)\), but if there are more pre-treatment periods then we can pre-test.
[[Back]]