In my recent chapter in the Handbook of Labor, Human Resources, and Population Economics, I included proofs of results from Goodman-Bacon (Journal of Econometrics, 2021) and Sun and Abraham (Journal of Econometrics, 2021) basically with the idea of trying to write these results down in similar notation. I didn’t include the result from de Chaisemartin and d’Haultfoeuille (American Economic Review, 2020) just due to space limitations, but we are building on that result in a couple of recent papers, and I write this sort of proof just infrequently enough that I have to figure it out over and over. I’m going to just include the proof for the oft-considered case with staggered treatment adoption, no anticipation, and no units treated in the first period. I’m also using the same notation I always use – if it’s confusing, check out my handbook chapter. And, just to be clear, I’m not inventing anything here, just putting down a proof of a nice result in a familiar notation for me.
The main assumption underlying all of this is the following parallel trends assumption:
For all \(g \in \mathcal{G}\), and \(t=2,\ldots,\mathcal{T}\),
\[\E[\Delta Y_t(0) | G=g] = \E[\Delta Y_t(0)]\]which says that the path of untreated potential outcomes is the same for all groups across all time periods.
The interest here centers on interpreting \(\alpha\) from the following regression
\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it}\]Panel data versions of FWL-type arguments imply that we can remove the time- and unit- fixed effects by
\[\ddot{Y}_{it} = \alpha \ddot{D}_{it} + \ddot{e}_{it}\]where the notation indicates double-demeaning each of the variables, so, for example,
\[\ddot{D}_{it} = D_{it} - \bar{D}_i - \E[D_t] + \frac{1}{\mathcal{T}} \sum_{s=1}^{\mathcal{T}} \E[D_s]\]Now, population versions of FWL arguments imply that we can write
\[\alpha = \frac{\displaystyle \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \E[\ddot{D}_{it} Y_{it}]}{\displaystyle \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \E[\ddot{D}_{it}^2]}\]There are two useful properties of double-demeaned random variables that are useful below
\[\E[\ddot{D}_{it}] = 0 \qquad \textrm{and} \qquad \sum_{t=1}^T \ddot{D}_{it} = 0\]These are easy results to show (see, for example, my handbook chapter mentioned above for more details). Next, notice that, under staggered treatment adoption, \(\ddot{D}_{it}\) is fully determined by a unit’s group and knowledge of \(t\). In particular, notice that,
\[D_{it} = \indicator{G_i \leq t} \qquad \textrm{and} \qquad \bar{D}_i = \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \indicator{G_i \leq t} = \frac{\mathcal{T} - G_i + 1}{\mathcal{T}}\]Thus, define the function \(v(g,t) = \indicator{g \leq t} - \frac{\mathcal{T} - g + 1}{\mathcal{T}}\); this implies that \(D_{it} - \bar{D}_i = v(G_i,t)\). Next, define the function \(h(g,t) = v(g,t) - \displaystyle \sum_{g\in \mathcal{G}} v(g,t) p_g\), and notice that \(\E[D_t] - \displaystyle \frac{1}{\mathcal{T}} \sum_{s=1}^{\mathcal{T}} \E[D_s] = \E\big[ D_{it} - \bar{D}_i\big] = \E[v(G,t)]\). This implies that \(\ddot{D}_{it} = h(G_i,t)\), which gives us an easy way to switch between working with \(\ddot{D}_{it}\) and groups.
To show the result, most of the work will be for the numerator in the expression for \(\alpha\) above, and, in particular, notice that
\[\begin{aligned} \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \E[\ddot{D}_{it} Y_{it} ] &= \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \E[\ddot{D}_{it} Y_{it} ] - \underbrace{\frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \E[\ddot{D}_{it} Y_{iG_i-1} ]}_{=0} \\ &= \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \E[h(G_i,t) (Y_{it} - Y_{iG_i-1}) ] \\ &= \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \sum_{g \in \mathcal{G}} \E[h(g,t) (Y_{it} - Y_{ig-1}) | G=g] \, p_g \\ &= \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \sum_{g \in \mathcal{G}} h(g,t) \E[(Y_{it} - Y_{ig-1}) | G=g] \, p_g - \underbrace{\frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \sum_{g \in \mathcal{G}} h(g,t)\E[ (Y_{it} - Y_{ig-1}) | G=\mathcal{T}+1] \, p_g}_{=0} \\ &= \frac{1}{\mathcal{T}} \sum_{t=1}^{\mathcal{T}} \sum_{g \in \mathcal{G}} h(g,t) \Big( \E[(Y_{it} - Y_{ig-1}) | G=g] - \E[(Y_{it} - Y_{ig-1}) | G=\mathcal{T}+1] \Big) \, p_g \end{aligned}\]where the first equality holds by the property that \(\displaystyle \sum_{t=1}^{\mathcal{T}} \ddot{D}_{it} = 0\), the second equality holds by the definition of \(h\) and by combining terms, the third equality holds by the law of iterated expectations, we show that the extra term in the fourth equality is equal to 0 below, and the last equality holds by combining terms. Combining this with the denominator in the FWL expression for \(\alpha\), we have that
\[\alpha = \sum_{t=1}^{\mathcal{T}} \sum_{g \in \mathcal{G}} \frac{h(g,t)}{\sum_{s=1}^{\mathcal{T}} \E[h(G,s)^2]} \Big( \E[(Y_{it} - Y_{ig-1}) | G=g] - \E[(Y_{it} - Y_{ig-1}) | G=\mathcal{T}+1] \Big) \, p_g\]Note that the previous result is a decomposition in the sense that everything is computable, and \(\alpha\) will be exactly equal to the term on the right hand side (\(\hat{\alpha}\) will be equal to the sample analogue of the term on the RHS).
It’s also interesting to separate the previous expression based on whether a particular period is a post-treatment or a pre-treatment period. In particular, just by splitting the sum above (and noticing that the inside term is equal 0 for the never-treated group), we have that
\[\begin{aligned} \alpha &= \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} p_g \frac{h(g,t)}{\sum_{s=1}^{\mathcal{T}} \E[h(G,s)^2]} \Big( \E[(Y_{it} - Y_{ig-1}) | G=g] - \E[(Y_{it} - Y_{ig-1}) | G=\mathcal{T}+1] \Big) \\ & + \sum_{g \in \bar{\mathcal{G}}} \sum_{t=1}^{g-1} p_g \frac{h(g,t)}{\sum_{s=1}^{\mathcal{T}} \E[h(G,s)^2]} \Big( \E[(Y_{it} - Y_{ig-1}) | G=g] - \E[(Y_{it} - Y_{ig-1}) | G=\mathcal{T}+1] \Big) \end{aligned}\]Next, let’s impose parallel trends. In particular, under parallel trends \(\E[(Y_{it} - Y_{ig-1}) | G=g] - \E[(Y_{it} - Y_{ig-1}) | G=\mathcal{T}+1] = ATT(g,t)\) for \(t \geq g\) (i.e., post-treatment periods for group \(g\)), and \(\E[(Y_{it} - Y_{ig-1}) | G=g] - \E[(Y_{it} - Y_{ig-1}) | G=\mathcal{T}+1] = 0\) for \(t < g\) (i.e., pre-treatment periods for group \(g\)). Then,
\[\alpha = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} \underbrace{p_g \frac{h(g,t)}{\sum_{s=1}^{\mathcal{T}} \E[h(G,s)^2]}}_{w(g,t)} ATT(g,t)\]where \(\bar{\mathcal{G}}\) denotes the set of all groups excluding \(G=\mathcal{T}+1\) (the never-treated group). This says that, under parallel trends, \(\alpha\) is equal to a weighted average of group-time average treatment effects. To conclude, let’s show some interesting properties of the weights, \(w(g,t)\). Consider the numerator of the the weights,
\[\begin{aligned} \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} h(g,t) p_g &= \sum_{g \in \bar{\mathcal{G}}} \sum_{t=1}^{\mathcal{T}} h(g,t) \indicator{g \leq t} p_g \\ &= \sum_{t=1}^{\mathcal{T}} \sum_{g \in \mathcal{G}} h(g,t) \indicator{g \leq t} \indicator{g < \mathcal{T}+1}p_g \\ &= \sum_{t=1}^{\mathcal{T}} \E[h(G,t) \indicator{G \leq t}] \\ &= \sum_{t=1}^{\mathcal{T}} \E[\ddot{D}_{it} D_{it}] = \sum_{t=1}^{\mathcal{T}} \E[\ddot{D}_{it}^2] \end{aligned}\]This implies that
\[\sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} w(g,t) = 1\]or, in other words, the weights sum to 1. This is a good property for the weights to have. It is possible to discuss the weights in more detail though. I think it is fair to see the denominator in the weights as a normalizing constant. The \(p_g\) term indicates that, at least for this component of the weights, larger groups will tend to be given more weight. The most interesting term in the weights is \(h(g,t)\), and, for example, it is possible for \(h(g,t)\) to be negative (which would make \(w(g,t)\) negative as well). Recall that
\[h(g,t) = \indicator{g \leq t} - \frac{\mathcal{T} - g + 1}{\mathcal{T}} - \E[D_t] + \frac{1}{\mathcal{T}} \sum_{s=1}^{\mathcal{T}} \E[D_s]\]Also, notice that, for all the group-times that get non-zero weight, \(\indicator{g \leq t} = 1\), and the last term is constant across \(g\) and \(t\). This means that the most interesting terms are the two middle ones. Group-times that get negative weights (or the smallest weights) would be ones where \(\displaystyle \frac{T-g+1}{\mathcal{T}}\) is large (this would be the case for early treated groups) and when \(\E[D_t]\) is large (this would be large for later treated periods). This discussion suggests that, in a very simple case where \(\mathcal{T}=3\) and \(\mathcal{G} = \{2,3,4\}\), the \(ATT(g,t)\) at risk of having negative weights is \(ATT(g=2,t=3)\).
]]>The published version is here, and a draft of the chapter is available on my website
The chapter follows pretty closely what I teach about DID in my Ph.D. econometrics course at UGA. It’s probably less of a “practitioner’s guide” and more of an introduction to the literature for an econometrics student. And, in particular, the chapter includes:
1) Proofs of main results in the literature (e.g., Goodman-Bacon (2021), Callaway and Sant’Anna (2021), Sun and Abraham (2021)) in a unified notation.
2) A careful comparison of alternative estimation strategies that have recently been proposed in order circumvent the issues with two-way fixed effects regressions. In the chapter, I emphasize the conceptual similarities between different estimation strategies, but also try to point out differences between estimation strategies (and distinguish between fundamental differences and differences due to implementation choices made in different papers).
3) A discussion of realistic issues that show up in empirical applications such as including covariates in the parallel trends assumption and dealing with violations of the parallel trends assumption.
4) An extended application about minimum wage policies. My goal for the application was to (i) demonstrate different estimation strategies, and (ii) introduce open source code that is available for implementing new DID estimation strategies. The complete code/data that I used in the application is available here.
]]>Sonia Karami and I just had our paper Treatment Effects in Interactive Fixed Effects Models with a Small Number of Time Periods accepted at Journal of Econometrics.
One of the things that I have been very interested in over the past couple of years is trying to identify treatment effect parameters when (i) parallel trends assumptions are violated and (ii) the number of time periods is “small”.
Parallel trends assumptions are very closely related to the following model for untreated potential outcomes:
\[ Y_{it}(0) = \theta_t + \eta_i + U_{it} \] where \(\theta_t\) is a time fixed effect, \(\eta_i\) is an individual fixed effect, and \(U_{it}\) are idiosyncratic time varying unobservables.
But the additive separability between the time-period and unit fixed effects is important here.
cite other IFE papers
cite manski, roth, and try to connect motivation
A lot of my research has involved identifying treatment effect parameters in a difference in differences (DID) framework. For DID, the main identifying assumption is the parallel trends assumption:
Parallel Trends Assumption \[ \mathbb{E}[\Delta Y_t(0) | D=1] = \mathbb{E}[\Delta Y_t(0) | D=0] \]
As discussed above, parallel trends assumptions are very closely related to the following model for untreated potential outcomes:
\[ Y_{it}(0) = \theta_t + \eta_i + U_{it} \] where \(\theta_t\) is a time fixed effect, \(\eta_i\) is an individual fixed effect, and \(U_{it}\) are idiosyncratic time varying unobservables.
An extended version of the parallel trends assumption is the following conditional parallel trends assumption
Conditional Parallel Trends Assumption \[ \mathbb{E}[\Delta Y_t(0) | \tilde{Z}, D=1] = \mathbb{E}[\Delta Y_t(0) | \tilde{Z}, D=0] \] which says that parallel trends holds after conditioning on \(\tilde{Z}\). To give an example, the application in our paper is about job displacement and the outcome is an individual’s earnings. It seems likely that the path of untreated potential outcomes (how outcomes would change over time if an individual were not displaced from their job) likely depends on a person’s education, demographic characteristics, etc. If these are distributed differently across displaced workers and non-displaced workers (which is also likely), then conditioning on these sorts of variables before invoking parallel trends will be important.
This conditional parallel trends assumption is closely related to the following model for untreated potential outcomes \[ Y_{it}(0) = g_t(\tilde{Z}_i) + \eta_i + U_{it} \] where \(g_t\) is a nonparametric, time-varying function of \(\tilde{Z}\), and \(\eta_i\) and \(U_{it}\) are the same as before (the important thing here is the additive separability of \(\eta_i\) which allows for it to be differenced out). It’s common to impose linearity for \(g_t\) to get to \[ Y_{it}(0) = \tilde{Z}_i'\tilde{\delta}_t + \eta_i + U_{it} \] where we take \(\tilde{Z}_i\) to include an intercept so that the time fixed effect is absorbed into \(\tilde{Z}_i'\tilde{\delta}_{t}\) from here on out.
I have been a bit purposely vague about \(\tilde{Z}\) above. What variables need to be conditioned on though is largely a theoretical exercise. The variable that I mentioned above (education and/or demographic characteristics) are commonly observed in many datasets, but one might also think that parallel trends only holds after additionally conditioning on “ability” which is unlikely to be observed in most data.
Let’s partition \(\tilde{Z} = (Z,\lambda)\) where \(Z\) corresponds to the observed components of \(\tilde{Z}\) and \(\lambda\) corresponds to the unobserved components of \(\tilde{Z}\). Similarly, let’s partition \(\tilde{\delta}_t = (\delta_t, F_t)\) where \(\delta_t\) corresponds to the elements in \(Z\) and \(F_t\) corresponds to the elements in \(\lambda\). Plugging this back into the model for untreated potential outcomes above yields \[
Y_{it}(0) = Z_i'\delta_t + \lambda_i'F_t + \eta_i + U_{it}
\] This is an interactive fixed effects model for untreated potential outcomes!
Even if we like the interactive fixed effects model for untreated potential outcomes, it is still not clear if we can recover any causal effect parameters of interest under palatable identifying assumptions.
For one thing, like DID, we’d like to identify causal effect parameters when the number of time periods is small, and much of the interactive fixed effects literature involves arguments where the number of time periods goes to infinity.
To make things concrete, let’s consider the case with 3 time periods: \(t^*\), \(t^*-1\), and \(t^*-2\). And let’s suppose that no one is treated until the last period. We also define \(D_i\) as a variable that it is equal to one for individuals in the treated group (i.e., that become treated in the last period) and is equal to 0 otherwise. Like most of the literature on treatment effects with panel data, we’ll target identifying the average treatment effect on the treated (ATT) which is given by \[ ATT = \mathbb{E}[Y_{t^*}(1) - Y_{t^*}(0) | D=1] \] which is the difference between treated and untreated potential outcomes on average among individuals in the treated group. Our main identification challenge is therefore to recover \(\mathbb{E}[Y_{t^*}(0)|D=1]\).
Towards this end, we use a “quasi-differencing” approach to difference out \(\eta_i\) and \(\lambda_i\) (see, for example, Holtz-Eakin, Newey, and Rosen (1988) and Ahn, Lee, and Schmidt (2013)); that is,
\[ Y_{it^*-1}(0) - Y_{it^*-2}(0) = \lambda_i \big(F_{t^*-1} - F_{t^*-2}\big) + Z_i' \big(\delta_{t^*-1} - \delta_{t^*-2} \big) + U_{it^*-1} - U_{it^*-2} \]
which implies
\[ \lambda_i = \Big( \big( Y_{it^*-1}(0) - Y_{it^*-2}(0) \big) - Z_i' \big(\delta_{t^*-1} - \delta_{t^*-2}\big) - \big(U_{it^*-1}-U_{it^*-2}\big) \Big) \Big/ (F_{t^*-1} - F_{t^*-2}) \]
Similarly, \[ \begin{aligned} Y_{it}(0) - Y_{it^*-2}(0) &= \lambda_i(F_t - F_{t^*-2}) + Z_i'(\delta_t - \delta_{t^*-2}) + U_{it} - U_{it^*-2} \nonumber \\ &= Z_i'\delta^*_t + F^*_t (Y_{it^*-1} - Y_{it^*-2}) + V_{it} \end{aligned} \]
We have a number of extensions to these kind of results in the paper.
Pretty much the same arguments apply in cases where there are more than one interactive fixed effect. The main additional requirement is that, for each interactive fixed effect, we need at least one covariate whose effect on untreated potential outcomes is time invariant
We spend a lot of time thinking about practical issues such as weak instruments, not enough covariates with time invariant effects, and tests for covariates actually having time invariant effects. Except in one or two very pernicious cases, we think that our approach should either work or that one would be able to successfully detect that it is not working.
did
package.
My sense has been there are perhaps a number of limitations to the sorts of two way fixed effects (TWFE) regressions that include covariates that are very common in applied work. And, in particular, that there could be distinct issues from those that show up in the literature on TWFE regressions with multiple periods and variation in treatment timing (e.g., Goodman-Bacon (2021), de Chaisemartin and d’Haultfoeuille (2020), and Borusyak, Jaravel, and Spiess (2021))
In this paper, we have worked out a lot of these issues – particularly, in the case with exactly two time periods (which is a case where TWFE regressions work well under unconditional parallel trends). We also provide alternative strategies that (i) are able to get around these issues and (ii) are only slightly more complicated to implement than TWFE regressions.
TWFE Regressions
To fix ideas, let me write down how most researchers implement DID identification strategies when they think that the underlying parallel trends assumption ought to be conditional on some covariates:
\[\begin{aligned} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + v_{it} \end{aligned}\]where \(Y_{it}\) is the outcome of interest (for unit \(i\) in time period \(t\)), \(\theta_t\) is a time fixed effect, \(\eta_i\) is an individual fixed effect, \(D_{it}\) is a treatment dummy variable, \(\alpha\) is what will be reported as the causal effect of the treatment (or, maybe loosely as some kind of average causal effect), \(X_{it}\) are the time-varying covariates, and \(v_{it}\) are idiosyncratic time-varying unobservables.
We show that there are a number of potential limitations with using this two-way fixed effects (TWFE) regression:
In cases with multiple periods and variation in treatment timing, this sort of TWFE regression uses already treated units as the comparison group, and therefore suffers from all well-known weaknesses in this case. Both Goodman-Bacon (2021) and de Chaisemartin and d’Haultfoeuille (2020) already have results along these lines, so I’m going to only talk about the case with two time periods below (which is a case where, at least in the case of unconditional parallel trends, TWFE regressions work fine).
TWFE regressions won’t work well if the time-varying covariates are affected by the treatment. This issue is often referred to as a “bad control” problem. It seems to be standard practice just not to include covariates that are potentially affected by the treatment. I agree that it’s a bad idea to include a time-varying covariate that is itself affected by the treatment, but I am less sure that a good solution is to just not include it.
For example, suppose that a labor economist is studying the effect of a treatment on a person’s earnings and thinks parallel trends holds after conditioning on a person’s occupation, but occupation is potentially affected by the treatment (there is tons of work in labor economics that would be concerned with this issue). Both ignoring occupation and including occupation run into issues. One helpful way to think about this is to try to condition on untreated potential occupation — that is, what occupation would have occurred if a person had not been treated. TWFE regressions don’t naturally accommodate this, but we propose some solutions for this case (I’ll come back to this below).
TWFE regressions like this are highly sensitive to the functional form. Since we are considering the case with two periods (and, like the “textbook” version of DID, where no units are treated yet in the first period), we can write
\[\begin{aligned} \Delta Y_{it^*} = (\theta_{t^*} - \theta_{t^*-1}) + \alpha D_i + \Delta X_{it^*} \beta + \Delta v_{it^*} \end{aligned}\]where \(t^*\) indicates the second time period. You can see from this specification that, due to our linear functional form for the levels, we are effectively only controlling for the change in covariates over time.
We show that TWFE regressions implicitly rely on (i) the conditional parallel trends assumption only depending on the change in the time-varying covariates over time, and (ii) similarly, that TWFE regressions rely on conditional ATTs only depending on changes in time-covariates over time. Thus, TWFE can perform poorly in cases where, for example, the path of untreated potential outcomes also depends on the level of the time-varying covariate.
Let me give you a concrete example where only controlling for changes in covariates seems undesirable. Suppose you are using county-level data and are studying the effect of a treatment in Georgia using counties from Tennessee as the comparison group, and that you think that parallel trends holds after you condition on county population. I live in Oconee County, Georgia. From 2010 to 2021 (which were the dates I could most easily find for county population), Oconee County grew from about 33,000 to about 42,000. In Tennessee, the county with the most similar population change was Sevier County which increased from about 90,000 to about 99,000. But Sevier County is more than twice as big as Oconee County. This is probably not what we had in mind when we said we wanted to condition on county population. Maybe this is just bad luck, let’s check the county with the next most similar population change. It is Shelby County — this is Memphis! — which increased from 928,500 to 938,800. I don’t think anyone would think that comparing paths of outcomes for Shelby County and Oconee County is what any researcher has in mind for DID conditioning on county population. As a side-comment, if you switch to, say, the change in log population over time, you do not do much better either — in that case, the closest match is Montgomery County, TN which has over 5 times the population of Oconee County.
Perhaps somewhat surprisingly TWFE regressions also require strong functional form assumptions on the propensity score (see paper for details).
Similarly, we show that TWFE regressions are not robust to parallel trends assumptions and conditional ATTs depending on time-invariant covariates. However, conditioning on time-invariant covariates in the parallel trends assumption is important in many applications. For example, if you are a labor economist studying the effect of some treatment on people’s earnings, the most important covariates to condition on in the parallel trends assumption are all likely to be time invariant — e.g., demographics, education, etc.
\(\alpha\) is hard to interpret in the presence of treatment effect heterogeneity. Even if none of the issues above apply in a particular application, if treatment effects are heterogeneous (particularly, if they can vary across different values of the covariates), (under some additional conditions) \(\alpha\) will be equal to a weighted average of conditional ATT parameters but they will suffer from the “weight reversal” property pointed out in Sloczynski (2020) in a different context — conditional ATTs for values of the covariates that are uncommon for the treated group relative to the untreated group get lots of weight, and the opposite happens for relatively common values of the covariates.
If a researcher is fortunate enough that none of these issues apply in their application, then a TWFE regression would recover the ATT.
Existing work in econometrics
Most work on DID under conditional parallel trends (e.g., Abadie (2005), Sant’Anna and Zhao (2020), and Chang (2020)) considers the case with time-invariant covariates or uses “pre-treatment” values of time-varying covariates (which effectively just makes time-varying covariates time invariant by using their value in the pre-treatment period). This already solves most of the above issues: they can be adapted to handle cases multiple periods and variation in treatment timing in (1), they do not require the same strong functional form assumptions as in (3), they solve (4) above because they include time-invariant covariates, and they recover the overall ATT directly rather than a hard-to-interpret weighted average of conditional ATTs as in (5).
What’s new in our paper
First, in order to address (2), where the time-varying covariates could themselves be affected by the treatment, we provide specific conditions under which it is sufficient to condition on pre-treatment values of the time-varying covariates as is common in the econometrics literature. In particular, the condition that rationalizes conditioning on pre-treatment covariates is
\[\begin{aligned} X_{t^*}(0) \perp D | X_{t^*-1}, Z \end{aligned}\]where \(X_{t^*}(0)\) is the value that \(X\) would take in time period \(t^*\) if the treatment had not occurred and \(Z\) is the vector of time-invariant covariates in the parallel trends assumption. This is an unconfoundedness assumption, but for time-varying covariates rather than the outcome. In words, it says that covariates are evolving similarly among treated and untreated units that have the same pre-treatment characteristics \(X_{t^*-1}\) and time-invariant covariates \(Z\).
This condition may or may not be reasonable in particular applications, but it is the sort of thing that reseachers ought to think about. It is also “pre-testable” (i.e., you can look at data in pre-treatment periods and potentially find evidence for or against it).
In cases where this assumption does not hold, the strategy of just conditioning on pre-treatment covariates does not generally work. But we consider a number of other possible assumptions that can lead to alternative identification arguments in the paper. A big part of the paper is about these cases, but it is perhaps best just to consult the paper itself on this front as these arguments are somewhat more complicated.
Another important case is when a researcher is confident that covariates are evolving exogenously from the treatment; a simple version of this is just where \(X_{it^*}(1) = X_{it^*}(0)\) for all units (that is, the value of the covariates is the same under the treatment as without the treatment). Ignoring the issue of time-invariant covariates, the main issue with TWFE in this case are the functional form issues pointed out in (3) above. In this case, we provide a doubly robust expression for that ATT that does not rely on those sorts of functional form assumptions. These expressions involve outcome regressions and propensity scores that depend on both \(X_{t^*}\) and \(X_{t^*-1}\) — these can be challenging to estimate well because \(X_{t^*}\) and \(X_{t^*-1}\) are likely to be highly collinear in many applications. However, the doubly robust expression for the ATT allows us to connect to the literature on DID with machine learning (Chang (2020)) which provides an attractive way to try to estimate these functions.
Finally, in cases where these kinds of doubly robust / machine learning approaches are more complicated than a researcher actually wants to implement, we provide strategies for all of the cases discussed above that can be implemented using just regressions and averaging. Relative to the previous two points, these approaches require additional linearity assumptions (though substantially less restrictive than the issues discussed earlier for TWFE regressions), but have the benefit of being easier to implement; these ideas build on the ideas of regression adjustment and imputation that have shown up recently in the DID literature (Liu, Wang, and Xu (2021), Gardner (2021), Borusyak, Jaravel, and Spiess (2021)).
Let me just give the example of what we propose to do in cases where the time-varying covariates evolve exogenously. Similar to the “imputation” literature, we can exploit the connection between parallel trends assumptions and a model for untreated potential outcomes:
\[\begin{aligned} Y_{it}(0) = Z_i'\delta_t + \eta_i + X_{it}(0) \beta_t + v_{it} \end{aligned}\]where we take \(Z\) to include an intercept. The \(\beta_t\) is perhaps non-standard (see discussion in next paragraph). Taking the difference over time implies
\[\begin{aligned} \Delta Y_{it^*}(0) = Z_i'\delta^*_{t^*} + \Delta X_{it^*}(0) \beta_{t^*} + X_{it^*-1}(0) \beta^*_{t^*} + \Delta v_{it^*} \end{aligned}\]where we define \(\delta^*_{t^*} := (\delta_{t^*} - \delta_{t^*-1})\) and \(\beta^*_{t^*} := (\beta_{t^*} - \beta_{t^*-1})\). In my view, this is particularly attractive specification for untreated potential outcomes in terms of time-varying covariates. It includes both the initial level of the covariates (which is similar to including the “pre-treatment” value of the covariate) as well as the change in covariates over time. And, for example, (up to the parametric assumptions) this expression would avoid the issues of comparing counties with similar changes in population over time but very dissimilar overall populations.
Moreover, since we observe untreated potential outcomes and covariates for the untreated group, we can recover all of the parameters from the regression of \(\Delta Y_{t^*}\) on \(Z\), \(\Delta X_{t^*}\), and \(X_{t^*-1}\) using the untreated group. Next, notice that
\[\begin{aligned} ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1] \\ &= \E[\Delta Y_{t^*} | D=1] - \Big(\E[Z|D=1]'\delta^*_{t^*} + \E[\Delta X_{t^*}(0) | D=1] \beta_{t^*} + \E[X_{t^*-1}|D=1]\beta^*_{t^*} \Big) \\ \end{aligned}\]where the second equality holds by plugging in the expression for \(\Delta Y_{t^*}(0)\) from the previous display. Everything is identified in the last line except for \(\E[\Delta X_{t^*}(0) | D=1]\). If we believe that covariates evolve exogenously though, it means that this term is equal to \(\E[\Delta X_{t^*} | D=1]\) which is identified. We consider 5 additional scenarios for recovering \(\E[\Delta X_{t^*}(0) | D=1]\) in the paper.
To summarize, this suggests a simple two-step estimation procedure: (i) estimate a regression using untreated observations and recover the estimates of the parameters in the model for untreated potential outcomes, (ii) combine these with estimates of the averages of the change in outcomes over time and averages of covariates for the treated group (as in the previous display) to compute the ATT.
Conclusion
In my view, the sorts of TWFE regressions that show up in many applications in economics have a number of limitations – when these TWFE regressions include time-varying covariates, we are arguing that they are likely to have a number of disadvantages even in “textbook” cases with only two time periods. Fortunately, it is quite straightforward to use other approaches (that are not much more complicated) that can essentially avoid all of these issues.
We don’t have code yet, but we are working on it. If you have comments/questions, please feel free to get in touch.
References
Abadie, Alberto. “Semiparametric difference-in-differences estimators.” The Review of Economic Studies 72.1 (2005): 1-19.
Borusyak, Kirill, Xavier Jaravel, and Jann Spiess. “Revisiting event study designs: Robust and efficient estimation.” arXiv preprint arXiv:2108.12419 (2021).
Chang, Neng-Chieh. “Double/debiased machine learning for difference-in-differences models.” The Econometrics Journal 23.2 (2020): 177-191.
de Chaisemartin, Clément, and Xavier d’Haultfoeuille. “Two-way fixed effects estimators with heterogeneous treatment effects.” American Economic Review 110.9 (2020): 2964-96.
Gardner, John. “Two-stage differences in differences.” (2021).
Goodman-Bacon, Andrew. “Difference-in-differences with variation in treatment timing.” Journal of Econometrics (2021).
Liu, Licheng, Ye Wang, and Yiqing Xu. “A practical guide to counterfactual estimators for causal inference with time-series cross-sectional data.” arXiv preprint arXiv:2107.00856 (2021).
Sant’Anna, Pedro HC, and Jun Zhao. “Doubly robust difference-in-differences estimators.” Journal of Econometrics 219.1 (2020): 101-122.
Słoczyński, Tymon. “Interpreting ols estimands when treatment effects are heterogeneous: Smaller groups get larger weights.” The Review of Economics and Statistics (2020): 1-27.
One of the main ways that researchers use our did
package is to plot
event studies. These are quite useful in order to think about (i)
dynamic effects of participating in the treatment and (ii) to “pre-test”
the parallel trends assumption.
You can find an extended discussion about event studies, limitations of event study regressions in a number of relevant cases, etc. here.
This post isn’t about criticizing event study regressions; instead, what I want to talk about is the choice of the “base period” in event studies.
Event study regressions typically have a universal base period. This means that all differences are relative to a particular period, and, most commonly, it is set to be the period immediately before the treatment starts.
In the did
package, our default is to use a varying base period.
In pre-treatment periods, the base period is the immediately preceding
period; e.g., if period 4 is pre-treatment, then the base period for
this period will be period 3.
If there are violations of parallel trends in pre-treatment periods, then the interpretation of reported “effects” in pre-treatment periods in an event study differs depending on whether one uses a varying or universal base period. Here is the difference:
With a varying base period, the reported effects are pseudo-ATTs. They are what we would have estimated effect of participating in the treatment to be (on impact) if the treatment had occurred in that period (instead of when it actually occurred).
With a universal base period, event study estimates in pre-treatment periods are not themselves treatment effect parameters, but they are useful for showing how outcomes are trending over time.
In the newest version (version 2.1) of did
, we have added a new
argument, base_period
, to att_gt
to give users the option to choose
either a varying
(the default) or universal
base period.
A couple of other things that are also worth mentioning:
In post-treatment periods, the base period is the period immediately before treatment both cases \(\implies\) the only place where this difference matters is in pre-treatment periods.
In pre-treatment periods, either case is just a linear combination of the other, so they essentially are just alternative ways of reporting the same information. That is, choosing between a varying or universal base period is more related to how to the “style” of presenting results and shouldn’t change conclusions about whether parallel trends is violated in pre-treatment periods, etc.
My sense is that providing results using a varying base period tends to work better when (i) the researcher is primarily concerned with treatment effect anticipation, and/or (ii) the number of pre-treatment periods is relatively small. And that using a universal base period tends to work better when (i) the researcher thinks that there are long-term differences in trends across groups, and/or (ii) the number of pre-treatment periods is relatively large.
Finally, although using a universal based period is relatively more
common in applications, it seems to me that this is mainly because it is
easier to implement this when you are running an event study regression.
For researchers that are directly computing averages of paths of
outcomes at different lengths of exposure to the treatment (as we do in
the did
package), reporting the results in using either type of base
period is easy to do.
Example 1: No violations of parallel trends
Let’s start with the simplest case where there are no violations of parallel trends in pre-treatment periods.
library(did) # need to load version 2.1 of package
Below is some code to generate data where parallel trends holds in all
periods, and the average effect of participating in the treatment is
equal to 1 (reset.sim
and build_sim_dataset
are functions in the
did
package for generating simulated data).
# create data with no pre-trends
time.periods <- 5
sp <- reset.sim(time.periods=time.periods)
sp$te <- 1
data <- build_sim_dataset(sp)
data <- subset(data, G==time.periods | G==0)
# varying base period
res1_varying <- att_gt(yname="Y", xformla=~X, data=data, tname="period",
idname="id",
control_group="nevertreated",
gname="G", est_method="dr")
dynamic1_varying <- aggte(res1_varying, type="dynamic")
p1_varying <- ggdid(dynamic1_varying, ylim=c(-2,2))
# universal base period
res1_universal <- att_gt(yname="Y", xformla=~X, data=data, tname="period",
idname="id",
control_group="nevertreated",
gname="G", est_method="dr", base_period="universal"
)
dynamic1_universal <- aggte(res1_universal, type="dynamic")
p1_universal <- ggdid(dynamic1_universal, ylim=c(-2,2))
ggpubr::ggarrange(p1_varying, p1_universal, nrow=1)
The plot on the left uses a varying base period while the plot on the right uses a universal base period. The estimated treatment effects when $e=0$ are numerically identical. The pre-treatment estimates are not numerically identical (they are based on different paths of outcomes in pre-treatment periods for the treated group relative to the untreated group), but (as expected) neither provides any evidence against parallel trends. Finally, notice that using a varying base period provides an estimate when $e=0$, but does not provide an estimate when $e=-4$; using a universal base period provides an estimate when $e=-4$ but not when $e=-1$.
Example 2: Anticipation Effects
Next, we generate data where there anticipation effects. What is happening here is that there is a group that becomes treated in the last period and a group that never participates in the treatment (in order to not clutter the post with code, let me just point you to the complete code for this post…it is very similar to the code above). Parallel trends holds in all periods except the period right before treatment when the treated group experiences a negative “anticipation” effect of participating in the treatment.
As before, the results using a varying base period are in the panel on the left, and the results using a universal base period are on the right. As before, the post-treatment estimated effects are exactly the same. To me, it seems much clearer to interpret the figure on the left (recall that there are anticipation effects that are equal to -1 in the pre-treatment period). For me, the figure on the right is hard to interpret.
Example 3: Longer Run Linear Trends
Finally, let’s consider the case where there are longish-run linear trend differences between the treated group and untreated group (and, thus, parallel trends is violated in pre-treatment periods). That is, we are in the case where, on average, outcomes are increasing by one in the treated group relative to the untreated group across all periods (both pre-treatment and post-treatment).
As in the earlier two cases, the panel on the left contains results using a varying base period, and the panel on the right contains results using a universal base period; likewise, the post-treatment estimates are numerically identical. In this case, to me, it seems easier to notice the linear difference in trends in the right panel. If you are careful, you can still interpret the results using a varying base period. Particularly, in every pre-treatment period, we would have over-estimated the effect of participating in the treatment (if the treatment had started in that period) – this happpens because of the linear violations of parallel trends in all periods.
]]>