Introduction

Difference-in-differences is one of the most common approaches for identifying and estimating the causal effect of participating in a treatment on some outcome.

The “canonical” version of DiD involves two periods and two groups. The untreated group never participates in the treatment, and the treated group becomes treated in the second period.

However, much applied work deals with cases where there are more than two time periods and different units can become treated at different points in time. Regardless of the number of time periods, by far the leading approach in applied work is to try to estimate the effect of the treatment using a two-way fixed effects (TWFE) linear regression. This works great in the case with two periods, but there are a number of recent methodological papers that suggest that there may be substantial drawbacks to using TWFE with multiple time periods.

This vignette briefly discusses the emerging literature on DiD with multiple time periods – both issues with standard approaches as well as remedies for these potential problems. The did package implements a number of these remedies. A vignette for how to use the did package is available here. The background article for these vignettes is Callaway and Sant’Anna (2021), “Difference-in-Differences with Multiple Time Periods”.

Background

To start with, we’ll consider some background material in this section. First, we’ll discuss DiD with two time periods and two groups – this is the “canonical” case of DiD. Second, we briefly consider issues with TWFE linear regressions when there are multiple time periods.

DiD with 2 Periods and 2 Groups

The baseline case for DiD is the one with two periods (let’s call these periods tt and t1t-1) and two groups (a treated group and an untreated group).

Notation / Setup

  • For s{t,t1}s \in \{t,t-1\}, Yis(0)Y_{is}(0) is unit ii’s untreated potential outcomes – this is the outcome that unit ii would experience in period ss if they did not participate in the treatment

  • For s{t,t1}s \in \{t,t-1\}, Yis(1)Y_{is}(1) is unit ii’s treated potential outcome – this is the outcome that unit ii would experience in period ss if they did participate in the treatment.

  • Set D=1D=1 for units in the treated group and D=0D=0 for units in the untreated group

  • In the first period, no one participates in the treatment. In the second period, units in the treated group become treated. This means that observed outcomes are given by Yit1=Yit1(0)andYit=DiYit(1)+(1Di)Yit(0) Y_{it-1} = Y_{it-1}(0) \quad \textrm{and} \quad Y_{it} = D_i Y_{it}(1) + (1-D_i) Y_{it}(0) In other words, in the first period, we observe untreated potential outcomes for everyone (there is a no-anticipation assumption built in here). In the second period, we observe treated potential outcomes for units that actually participate in the treatment and untreated potential outcomes for units that do not participate in the treatment.

  • The main parameter of interest in most DiD designs is the Average Treatment Effect on the Treated (ATT). It is given by ATT=E[Yt(1)Yt(0)|D=1] ATT = E[Y_t(1) - Y_t(0) | D=1] This is the difference between treated and untreated potential outcomes, on average, for units in the treated group.

The main assumption in DiD designs is called the parallel trends assumption:

Parallel Trends Assumption

E[Yt(0)Yt1(0)|D=1]=E[Yt(0)Yt1|D=0] E[Y_t(0) - Y_{t-1}(0)| D=1] = E[ Y_t(0)-Y_{t-1} | D=0]

In words, this assumption says that the change (or “path”) in outcomes over time that units in the treated group would have experienced if they had not participated in the treatment is the same as the path of outcomes that units in the untreated group actually experienced. The parallel trends assumption allows for the level of untreated potential outcomes to differ across groups and is consistent with, for example, fixed effects models for untreated potential outcomes where the mean of the unobserved fixed effect can be different across groups.

This assumption is potentially useful because the path of untreated potential outcomes for units in the treated group (the term on the left in the above equation) is not known, but the researcher does observe the path of untreated potential outcomes for units in the untreated group (term on the right in the above equation). In fact, it is straightforward to show that, under the parallel trends assumption, the ATTATT is identified and given by ATT=E[YtYt1|D=1]E[YtYt1|D=0] ATT = E[ Y_t - Y_{t-1}| D=1] - E[ Y_t - Y_{t-1}| D=0]

That is, the ATTATT is the difference between the mean change in outcomes over time experienced by units in the treated group adjusted by the mean change in outcomes over time experienced by units in the untreated group; the latter term, under the parallel trends assumption, is what the path of outcomes for units in the treated group would have been if they had not participated in the treatment.

Two way fixed effects regressions

Now let’s move to a more general case where there are 𝒯\mathcal{T} total time periods. Denote particular time periods by tt where t=1,,𝒯t=1,\ldots,\mathcal{T}.

By far the most common approach to trying to estimate the effect of a binary treatment in this setup is the TWFE linear regression. This is a regression like Yit=θt+ηi+αDit+vit Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it} where θt\theta_t is a time fixed effect, ηi\eta_i is a unit fixed effect, DitD_{it} is a treatment dummy variable, vitv_{it} are time varying unobservables that are mean independent of everything else, and α\alpha is presumably the parameter of interest. α\alpha is often interpreted as the average effect of participating in the treatment.

Although this is essentially a standard approach in applied work, there are a number of recent papers that point out potentially severe drawbacks of using the TWFE estimation procedure. These include: Borusyak and Jaravel (2018), Goodman-Bacon (2021), de Chaisemartin and D’Haultfoeuille (2020), and Sun and Abraham (2021).

When will TWFE work?

  1. Effects really aren’t heterogeneous. If the effect of participating in the treatment really is α\alpha for all units, TWFE will work great. That being said, in many applications, treatment effects are very likely to be heterogeneous – they may vary across different units or exhibit dynamics or change across different time periods. In particular applications, this is worth thinking about, but, at least in our view, we think that heterogeneous effects of participating in some treatment is the leading case.

  2. There are only two time periods. This is the canonical case (2 periods, one group becomes treated in the second period, the other is never treated). In this case, under parallel trends an no-anticipation, α\alpha is going to be numerically equal to the ATTATT. In other words, in this case, even though it looks like you have restricted the effect of participating in the treatment to be the same across all units, TWFE exhibits robustness to treatment effect heterogeneity. Unfortunately, this robustness to treatment effect heterogeneity does not continue to hold when there are more periods and groups become treated at different points in time.

Why is TWFE not robust to treatment effect heterogeneity?

There are entire papers written about this, see, e.g., Borusyak and Jaravel (2018), Goodman-Bacon (2021), de Chaisemartin and D’Haultfoeuille (2020), and Sun and Abraham (2021). But here is the short version: in a TWFE regression, units whose treatment status doesn’t change over time serve as the comparison group for units whose treatment status does change over time. With multiple time periods and variation of treatment timing, some of these comparisons are:

  • newly treated units relative to ``never treated’’ units (good!)

  • newly treated units relative to ``not-yet treated’’ units (good!)

  • newly treated units relative to already treated units (bad!!!)

The first of these two comparisons are good (or at least in the spirit of DiD) in that they take the path of outcomes experienced by units that become treated and adjust it by the path of outcomes experienced by units that are not participating in the treatment. The third comparison is different though: it adjusts the path of outcomes for newly treated units by the path of outcomes for already treated units. But this is not the path of untreated potential outcomes, it includes treatment effect dynamics. Thus, these dynamics appear in α\alpha, making it very hard to give a clear causal interpretation.

And this issue can have potentially severe consequences. For example, it is possible to come up with examples where the effect of participating in the treatment is positive for all units in all time periods, but the TWFE estimation procedure leads to estimating a negative effect of participating in the treatment. Even in the case where ``negative weights’’ can be ruled out, α\alpha recover a weighted average of ATTsATT's, though these weights are hard to interpret.

Treatment Effects in Difference in Differences Designs with Multiple Periods

In light of the potential problems with TWFE regressions in DiD designs with multiple periods, are there alternative approaches that can be used in this case?

Yes, and it turns out that it is not all that complicated! It is just a matter of using the ``good/desirable’’ comparisons between groups instead of all possible comparisons.

To fix ideas, let’s provide some extended notation and be clear about the identifying assumptions that we are going to make.

Notation

  • Yit(0)Y_{it}(0) is unit ii’s untreated potential outcome. This is the outcome that unit ii would experience in period tt if they do not participate in the treatment.

  • Yit(g)Y_{it}(g) is unit ii’s potential outcome in time period tt if they become treated in period gg.

  • GiG_i is the time period when unit ii becomes treated (often groups are defined by the time period when a unit becomes treated; hence, the GG notation).

  • CiC_i is an indicator variable for whether unit ii is in a never-treated group.

  • DitD_{it} is an indicator variable for whether unit ii has been treated by time tt.

  • YitY_{it} is unit ii’s observed outcome in time period tt. For units in the never-treated group, Yit=Yit(0)Y_{it} = Y_{it}(0) in all time periods. For units in other groups, we observe Yit=1{Gi>t}Yit(0)+1{Git}Yit(Gi)Y_{it} = \mathbf{1}\{ G_i > t\} Y_{it}(0) + \mathbf{1}\{G_i \leq t \} Y_{it}(G_i). The notation here is a bit complicated, but in words, we observe untreated potential outcomes for units that have not yet participated in the treatment, and we observe treated potential outcomes for units once they start to participate in the treatment (and these can depend on when they became treated). Implicit in this notation there is a no treatment anticipation assumption, which can be relaxed as discussed in Callaway and Sant’Anna (2021), “Difference-in-Differences with Multiple Time Periods”.

  • XiX_i vector of pre-treatment covariates.

Main Assumptions

Staggered Treatment Adoption Assumption Recall that Dit=1D_{it} = 1 if a unit ii has been treated by time tt and Dit=0D_{it}=0 otherwise. Then, for t=1,...,𝒯1t=1,...,\mathcal{T}-1, Dit=1Dit+1=1D_{it} = 1 \implies D_{it+1} = 1.

Staggered treatment adoption implies that once a unit participates in the treatment, they remain treated. In other words, units do not “forget” about their treatment experience. This is a leading case in many applications in economics. For example, it would be the case for policies that roll out to different locations over some period of time. It would also be the case for many unit-level treatments that have a “scarring” effect. For example, in the context of job training, many applications consider participating in the treatment ever as defining treatment.

Within the DiD context, we believe it is hard to analyze non-staggered treatment setups without further restricting treatment effect heterogeneity across time, groups, treatment sequences, etc. That is the main reason we focus on this leading case.

Parallel Trends Assumption based on never-treated units For all g=2,...,𝒯g=2,...,\mathcal{T}, t=2,...,𝒯t=2,...,\mathcal{T} with tgt \ge g, E[Yt(0)Yt1(0)|G=g]=E[Yt(0)Yt1(0)|C=1] E[ Y_t(0) - Y_{t-1}(0) | G=g] = E[ Y_t(0) - Y_{t-1}(0)| C=1]

This is a natural extension of the parallel trends assumption in the two periods and two groups case. It says that, in the absence of treatment, average untreated potential outcomes for the group first treated in time gg and for the “never treated” group would have followed parallel paths in all post-treatment periods tgt \ge g.

Note that the aforementioned parallel trend assumption rely on using the ``never treated’’ units as comparison group for all “eventually treated” groups. This presumes that (i) a (large enough) “never-treated” group is available in the data, and (ii) these units are “similar enough” to the eventually treated units such that they can indeed be used as a valid comparison group. In situations where these conditions are not satisfied, one can use an alternative parallel trends assumption that uses the not-yet treated units as valid comparison groups.

Parallel Trends Assumption based on not-yet treated units For all g=2,...,𝒯g=2,...,\mathcal{T}, s,t=2,...,𝒯s,t=2,...,\mathcal{T} with tgt \ge g and sts \ge t$$ E[ Y_t(0) - Y_{t-1}(0) | G=g] = E[ Y_t(0) - Y_{t-1}(0)| D_s=0, G\not=g] $$ In plain English, this assumption states that one can use the not-yet-treated by time ss (sts \ge t) units as valid comparison groups when computing the average treatment effect for the group first treated in time gg. In general, this assumption uses more data when constructing comparison groups. However, as noted in Marcus and Sant’Anna (2021), this assumption does restrict some pre-treatment trends across different groups. In other words, there is no free-lunch.

Group-Time Average Treatment Effects

The above assumptions are natural extensions of the identifying assumptions in the two periods and two groups case to the multiple periods case.

Likewise, a natural way to generalize the parameter of interest (the ATT) from the two periods and two groups case to the multiple periods case is to define group-time average treatment effects:

ATT(g,t)=E[Yt(g)Yt(0)|G=g] ATT(g,t) = E[Y_t(g) - Y_t(0) | G=g]

This is the average effect of participating in the treatment for units in group gg at time period tt. Notice that when there are two time periods and two groups (the canonical case), the average treatment effect on the treated is given by ATT=ATT(g=2,t=2)ATT = ATT(g=2,t=2).

To give a couple more examples, suppose that a researcher has access to three time periods. Then, ATT(g=2,t=3)ATT(g=2,t=3) is the average effect of participating in the treatment for the group of units that become treated in time period 2, in time period 3. Similarly, ATT(g=3,t=3)ATT(g=3,t=3) is the average effect of participating in the treatment for the group of units that become treated in time period 3, in time period 3.

Identification of Group-Time Average Treatment Effects

Under either version of the parallel trends assumptions mentioned above, it is straightforward to show that group-time average treatment effects are identified. For instance, when one impose the parallel trends assumption based on “never-treated units”, we have that, for all tgt \ge gATT(g,t)=E[YtYg1|G=g]E[YtYg1|C=1]. ATT(g,t) = E[ Y_t - Y_{g-1}| G=g] - E[ Y_t - Y_{g-1}| C=1]. Alternatively, when one impose the parallel trends assumption based on “not-yet-treated units”, we have that, for all tgt \ge g$$ ATT(g,t) = E[ Y_t - Y_{g-1}| G=g] - E[ Y_t - Y_{g-1}| D_t=0, G\not=g]. $$

These group-time average treatment effects are the building blocks of understanding the effect of participating in a treatment in DiD designs with multiple time periods.

In many cases, the parallel trends assumption is substantially more plausible if it holds after conditioning on observed pre-treatment covariates. In other words, if the parallel trends assumptions are modified to be

Conditional Parallel Trends Assumption based on never-treated units For all g=2,...,𝒯g=2,...,\mathcal{T}, t=2,...,𝒯t=2,...,\mathcal{T} with tgt \ge g, E[Yt(0)Yt1(0)|X,G=g]=E[Yt(0)Yt1(0)|X,C=1] E[ Y_t(0) - Y_{t-1}(0) |X, G=g] = E[ Y_t(0) - Y_{t-1}(0)| X, C=1]

Parallel Trends Assumption based on not-yet treated units For all g=2,...,𝒯g=2,...,\mathcal{T}, s,t=2,...,𝒯s,t=2,...,\mathcal{T} with tgt \ge g and sts \ge t$$ E[ Y_t(0) - Y_{t-1}(0) | X, G=g] = E[ Y_t(0) - Y_{t-1}(0)|X, D_s=0, G\not=g] $$

These parallel trends assumptions are the conditional analogues of previous ones. Importantly, they allow for covariate-specific trends in outcomes across groups, which can be particularly important in setups where the distribution of covariates varies across groups.

An example of a case where this assumption is attractive is one where a researcher is interested in estimating the effect of participating in job training on earnings. In that case, if the path of earnings (in the absence of participating in job training) depends on things like education, previous occupation, or years of experience (which it almost certainly does), then it would be important to condition on these types of variables in order to make parallel trends more credible.

In this case, the parameter of interest is still often the ATT(g,t)sATT(g,t)'s (or their aggregation). It is still straightforward to identify and estimate the ATTATT in this case. Basically, one needs to estimate the change in outcomes for units in the untreated group conditional on XX, but average out XX over the distribution of covariates for individuals in group gg to obtain ATT(g,t)ATT(g,t) (see Callaway and Sant’Anna (2021) and references therein for many more details). In practice, you can use different approaches to recover these parameters. More precisely, you can estimate the ATT(g,t)sATT(g,t)'s using outcome-regressions, inverse probability weighting, or doubly-robust methods. But the did package automates all of this for the user.

Aggregating Group-Time Average Treatment Effects

Group-time average treatment effects are natural parameters to identify in the context of DiD with multiple periods and multiple groups. But in many applications, there may be a lot of them. There are some benefits and costs here. The main benefit is that it is relatively straightforward to think about heterogeneous effects across groups and time using group-time average treatment effects. On the other hand, it can be hard to summarize them (e.g., they are not just a single number).

In our paper, Callaway and Sant’Anna (2021), “Difference-in-Differences with Multiple Time Periods”, we propose a number of ways to aggregate group-time average treatment effects. Here, we will just consider a few important ones that we think applied researchers are most often interested in. First, consider the average effect of participating in the treatment, separately for each group. This is given by

θS(g)=1𝒯g+1t=2𝒯1{gt}ATT(g,t). \theta_S(g) = \frac{1}{\mathcal{T} - g + 1} \sum_{t=2}^{\mathcal{T}} \mathbf{1}\{g \leq t\} ATT(g,t).

This parameter may be of interest in its own right, since it allows one to highlight treatment effect heterogeneity with respect to treatment adoption period. Furthermore, it is fairly straightforward to further aggregate θS(g)\theta_S(g) to get an easy-to-interpret overall effect parameter,

θSO:=g=2𝒯θS(g)P(G=g). \theta^O_S := \sum_{g=2}^{\mathcal{T}} \theta_S(g) P(G=g).

θSO\theta^O_S is the overall effect of participating in the treatment across all groups that have ever participated in the treatment. In our view, this is close to being a multi-period analogue of the ATTATT in the two period case. Thus, if a researcher is constrained to report a single treatment effect summary parameter, we recommend reporting θSO\theta^O_S.

In DiD setups with multiple periods, it is natural to ask “How does treatment effects vary with elapsed treatment time?” Here, note that researchers are interested in understanding treatment effect dynamics. This is at the heart of event-study-type of analysis that is widespread in applied work.

In this case, a natural way to aggregate the group-time average treatment effect to highlight treatment effect dynamics is given by

θD(e):=g=2𝒯1{g+e𝒯}ATT(g,g+e)P(G=g|G+e𝒯). \theta_D(e) := \sum_{g=2}^{\mathcal{T}} \mathbf{1} \{ g + e \leq \mathcal{T} \} ATT(g,g+e) P(G=g | G+e \leq \mathcal{T}).

This is the average effect of participating in the treatment for the group of units that have been exposed to the treatment for exactly ee time periods.

All of these aggregations are available in the did package and examples with real data are available in our Getting Started with the did Package vignette. In Callaway and Sant’Anna (2021), we also discuss additional aggregation schemes. We encourage you to take a look!

Conclusion

This vignette has covered basic background issues on DiD with multiple periods. Callaway and Sant’Anna (2021) discusses many extensions and these are all provided in the did package as well. See our User Guides for more details.