Session 1: Introduction to Difference-in-Differences

Brantly Callaway

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} } \newcommand{\T}{T}\)

Introduction to Difference-in-Differences

DID Basics in 2 Period Case

Staggered Treatment Adoption

- Issues with Traditional Regression Approaches
- New Approaches

Application/Code for Minimum Wage Policy

Inference

Relaxing the Parallel Trends Assumption

Dealing with More Complicated Treatment Regimes

Alternative Identification Strategies

Additional Workshop Materials: https://bcallaway11.github.io/lsu-workshop/

- Slides, code, etc. for the workshop

References:

Callaway (2023),

*Handbook of Labor, Human Resources and Population Economics*Baker, Callaway, Cunningham, Goodman-Bacon, Sant’Anna (2024), draft posted very soon

Exploit a data structure where the researcher observes:

Multiple periods of data

Some pre-treatment data for all units

Some units become treated while other units remain untreated

(In my view) this particular data setup is a key distinguishing feature of difference-in-differences approaches relative to traditional panel data models (i.e., fixed effects, dynamic panel, etc.)

- This setup also explains why the methods we consider today are often grouped among natural experiment types of methods such as IV or RD.

Running Example: Causal effects of a state-level minimum wage increase on employment

Widely studied using DID identification strategies (Card and Krueger (1994), many others)

For today: very simplified version with (1) no changes in federal minimum wage and (2) “binarized” state minimum wages (i.e., state minimum wage is either above the federal minimum wage or not)

Panel data gives researchers the opportunity to follow the same person, firm, location, etc. over multiple time periods

Having this sort of data seems fundamentally useful for learning about causal effects of some treatment/policy variable.

To see this, the fundamental problem of causal inference is that we can either see a unit’s treated or untreated potential outcomes (but not both)

However, with panel data “natural experiment” setting above, this is not 100% true.

We can see both a unit’s treated and untreated potential outcome outcome…just at different points in time

This seems extremely useful for learning about causal effects

Modern approaches also typically allow for treatment effect heterogeneity

- That is, that effects of the treatment can vary across different units in potentially complicated ways

This is going to be a major issue in the discussion below

We’ll consider implications for “traditional” regression approaches and how new approaches are designed to handle this

Data:

2 periods: \(t=1\), \(t=2\)

- No one treated until period \(t=2\)
- Some units remain untreated in period \(t=2\)

\(D_{i,t}\) treatment indicator in period \(t\)

2 groups: \(G_i=1\) or \(G_i=0\) (treated and untreated)

Potential Outcomes: \(Y_{i,t}(1)\) and \(Y_{i,t}(0)\)

Observed Outcomes: \(Y_{i,t=2}\) and \(Y_{i,t=1}\)

\[\begin{align*} Y_{i,t=2} = G_i Y_{i,t=2}(1) +(1-G_i)Y_{i,t=2}(0) \quad \textrm{and} \quad Y_{i,t=1} = Y_{i,t=1}(0) \end{align*}\]

Average Treatment Effect on the Treated: \[ATT = \E[Y_{i,t=2}(1) - Y_{i,t=2}(0) | G_i=1]\]

Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group

Notice that: \[\begin{align*} ATT = \underbrace{\E[Y_{i,t=2}(1) | G_i=1]}_{\textrm{Easy}} - \underbrace{\E[Y_{i,t=2}(0) | G_i=1]}_{\textrm{Hard}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} ATT = \color{green}{\E[Y_{i,t=2}(1) - Y_{i,t=1}(0) | G_i=1]} - \color{red}{\E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | G_i=1]} \end{align*}\]

The first term is how outcomes changed over time for the treated group

Notice that: in our “natural experiment” setting, this is a difference between treated and untreated potential outcomes

We can directly estimate this from the data

Notice that: \[\begin{align*} ATT = \underbrace{\E[Y_{i,t=2}(1) | G_i=1]}_{\textrm{Easy}} - \underbrace{\E[Y_{i,t=2}(0) | G_i=1]}_{\textrm{Hard}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} ATT = \color{green}{\E[Y_{i,t=2}(1) - Y_{i,t=1}(0) | G_i=1]} - \color{red}{\E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | G_i=1]} \end{align*}\]

The second term is how outcomes *would have changed over time* if the treated group had not been treated

- This is not directly observed in the data \(\implies\) we need to make identifying assumptions

- There are many possibilities here:
- Before-after: \(\color{red}{\E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | G_i=1]} = 0\)

Notice that: \[\begin{align*} ATT = \underbrace{\E[Y_{i,t=2}(1) | G_i=1]}_{\textrm{Easy}} - \underbrace{\E[Y_{i,t=2}(0) | G_i=1]}_{\textrm{Hard}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} ATT = \color{green}{\E[Y_{i,t=2}(1) - Y_{i,t=1}(0) | G_i=1]} - \color{red}{\E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | G_i=1]} \end{align*}\]

The second term is how outcomes *would have changed over time* if the treated group had not been treated

This is not directly observed in the data \(\implies\) we need to make identifying assumptions

There are many possibilities here:

- Lagged outcome unconfoundedness: \(\color{red}{\E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | G_i=1]} = \E\Big[ \E[Y_{i,t=2}(0) | Y_{i,t=1}, G_i=0] - Y_{i,t=1}(0) \Big| G_i=1\Big]\)

With panel data, we can re-write this as

The second term is how outcomes *would have changed over time* if the treated group had not been treated

This is not directly observed in the data \(\implies\) we need to make identifying assumptions

There are many possibilities here:

- Change-in-changes: \(\color{red}{\E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | G_i=1]} = \E\Big[ Q_{Y_{i,t=2}(0)|G_i=0}\big(F_{Y_{i,t=1}(0)|G_i=0}(Y_{i,t=1}(0))\big) - Y_{i,t=1}(0) \Big| G_i=1\Big]\)

With panel data, we can re-write this as

*would have changed over time* if the treated group had not been treated

This is not directly observed in the data \(\implies\) we need to make identifying assumptions

There are many possibilities here:

- Difference-in-differences: ➡

**Parallel Trends Assumption**

\[\color{red}{\E[\Delta Y_i(0) | G_i=1]} = \E[\Delta Y_i(0) | G_i=0]\]

Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group

Identification: Under PTA, we can identify \(ATT\): \[ \begin{aligned} ATT &= \E[\Delta Y_i | G_i=1] - \E[\Delta Y_i(0) | G_i=1] \end{aligned} \]

**Parallel Trends Assumption**

\[\color{red}{\E[\Delta Y_i(0) | G_i=1]} = \E[\Delta Y_i(0) | G_i=0]\]

Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group

Identification: Under PTA, we can identify \(ATT\): \[ \begin{aligned} ATT &= \E[\Delta Y_i | G_i=1] - \E[\Delta Y_i(0) | G_i=1]\\ &= \E[\Delta Y_i | G_i=1] - \E[\Delta Y_i | G_i=0] \end{aligned} \]

\(\implies ATT\) is identified can be recovered by the difference in outcomes over time (difference 1) relative to the difference in outcomes over time for the untreated group (difference 2)

The most straightforward approach to estimation is plugin:

\[\widehat{ATT} = \frac{1}{n_1} \sum_{i=1}^n G_i \Delta Y_i - \frac{1}{n_0} \sum_{i=1}^n (1-G_i) \Delta Y_i\]

Alternatively, TWFE regression: \[Y_{i,t} = \theta_t + \eta_i + \alpha D_{i,t} + e_{i,t}\]

- Even though it looks like this model has restricted the effect of participating in the treatment to be constant (and equal to \(\alpha\)) across all individuals, TWFE (in this case) is actually robust to treatment effect heterogeneity.

- To see this, notice that (with two periods) the previous regression is equivalent to \[\begin{align*} \Delta Y_{i,t} = \Delta \theta_t + \alpha \Delta D_{i,t} + \Delta e_{i,t} \end{align*}\] This is fully saturated in \(\Delta D_{i,t}\) (which is binary) \(\implies\) \[\begin{align*} \alpha = \E[\Delta Y_{i,t}|G_i=1] - \E[\Delta Y_{i,t}|G_i=0] = ATT \end{align*}\]

It’s easy to make the TWFE regression more complicated:

Multiple time periods

Variation in treatment timing

More complicated treatments

Introducing additional covariates

Unfortunately, the robustness of TWFE regressions to treatment effect heterogeneity or these more complicated (and empirically relevant) settings does not seem to hold

Much of the recent (mostly negative) literature on TWFE in the context of DID has considered these types of “realistic” settings

Next, we will consider one of these settings: staggered treatment adoption

\(\T\) time periods

Staggered treatment adoption: Units can become treated at different points in time, but once a unit becomes treated, it remains treated.

Examples:

Government policies that roll out in different locations at different times (minimum wage is close to this over short time horizons)

“Scarring” treatments: e.g., job displacement does not typically happen year after year, but rather labor economists think of being displaced as changing a person’s “state” (the treatment is more like: has a person ever been displaced)

Notation:

In math, staggered treatment adoption means: \(D_{i,t-1}=1 \implies D_{i,t}=1\).

\(G_i\) — a unit’s group — the time period that unit becomes treated.

- Under staggered treatment adoption, fully summarizes a unit’s treatment regime

Define \(U_i=1\) for never-treated units and \(U_i=0\) otherwise.

Notation (cont’d):

- Potential outcomes: \(Y_{i,t}(g)\) — the outcome that unit \(i\) would experience in time period \(t\) if they became treated in period \(g\).

- Untreated potential outcome: \(Y_{i,t}(0)\) — the outcome unit \(i\) would experience in time period \(t\) if they did not participate in the treatment in any period.

- Observed outcome: \(Y_{i,t}=Y_{i,t}(G_i)\)

- No anticipation condition: \(Y_{i,t} = Y_{i,t}(0)\) for all \(t < G_i\) (pre-treatment periods for unit \(i\))

Group-time average treatment effects \[\begin{align*} ATT(g,t) = \E[Y_{i,t}(g) - Y_{i,t}(0) | G_i=g] \end{align*}\]

Explanation: \(ATT\) for group \(g\) in time period \(t\)

Event Study \[\begin{align*} ATT^{es}(e) = \E[ Y_{i,g+e}(G) - Y_{i,g+e}(0) | G_i \in \mathcal{G}_e] \end{align*}\]

where \(\mathcal{G}_e\) is the set of groups observed to have experienced the treatment for \(e\) periods at some point.

Explanation: \(ATT\) when units have been treated for \(e\) periods

Overall ATT

Towards this end: the average treatment effect for unit \(i\) (across its post-treatment time periods) is given by: \[\bar{\tau}_i(g) = \frac{1}{\T - g + 1} \sum_{t=g}^{\T} \Big( Y_{i,t}(g) - Y_{i,t}(0) \Big)\]

Then,

\[\begin{align*} ATT^o = \E[\bar{\tau}_i(G_i) | U_i=0] \end{align*}\]

Explanation: \(ATT\) across all units that every participate in the treatment

To understand the discussion later, it is also helpful to think of \(ATT(g,t)\) as a building block for the other parameters discussed above. In particular:

Event Study \[\begin{align*} ATT^{es}(e) = \sum_{g \in \mathcal{G}_e} w^{es}(g,e) ATT(g,g+e) \end{align*}\]

Overall ATT \[\begin{align*} ATT^o = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^o(g,t) ATT(g,t) \end{align*}\]

where

\[\begin{align*} w^{es}(g,e) = \P(G_i=g|G\in \mathcal{G}_e) \end{align*}\]

where

\[\begin{align*} w^o(g,t) = \frac{\P(G_i=g|U_i=0)}{\T-g+1} \end{align*}\]

In other words, if we can identify/recover \(ATT(g,t)\), then we can proceed to recover \(ATT^{es}(e)\) and \(ATT^o\).

**Multiple Period Version of Parallel Trends Assumption**

For all groups \(g \in \bar{\mathcal{G}}\) (all groups except the never-treated group) and for all time periods \(t=2,\ldots,\T\), \[\begin{align*} \E[\Delta Y_{i,t}(0) | G_i=g] = \E[\Delta Y_{i,t}(0) | U_i=1] \end{align*}\]

Using very similar arguments as before, can show that \[\begin{align*} ATT(g,t) = \E[Y_{i,t} - Y_{i,g-1} | G_i=g] - \E[Y_{i,t} - Y_{i,g-1} | U_i=1] \end{align*}\]

where the main difference is that we use \((g-1)\) as the base period (this is the period right before group \(g\) becomes treated).

The previous discussion emphasizes a general purpose identification strategy with staggered treatment adoption:

Step 1: Target disaggregated treatment effect parameters (i.e., group-time average treatment effects)

Step 2: (If desired) combine disaggregated treatment effects into lower dimensional summary treatment effect parameter

Notice that:

This amounts to breaking the problem into a set of two-period DID problems and then combining the results

It is also a general purpose strategy in that the same high-level idea is (1) not DID-specific and (2) can (possibly) be applied to more complicated treatment regimes

With staggered treatments, traditionally DID identification strategies have been implemented with two-way fixed effects (TWFE) regressions: \[\begin{align*} Y_{i,t} = \theta_t + \eta_i + \alpha D_{i,t} + e_{i,t} \end{align*}\]

One main contribution of recent work on DID has been to diagnose and understand the limitations of TWFE regressions for implementing DID

Goodman-Bacon (2021) intuition: \(\alpha\) “comes from” comparisons between the path of outcomes for units whose treatment status changes relative to the path of outcomes for units whose treatment status stays the same over time.

Some comparisons are for groups that become treated to not-yet-treated groups 👍

Other comparisons are for groups that become treated relative to already-treated groups 👎

- This can be especially problematic when there are treatment effect dynamics. Dynamics imply different trends from what would have happened absent the treatment.

de Chaisemartin and D’Haultfœuille (2020) intuition: You can write \(\alpha\) as a weighted average of \(ATT(g,t)\)

First, a decomposition: \[\begin{align*} \alpha &= \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^{TWFE}(g,t) \Big( \E[(Y_{i,t} - Y_{i,g-1}) | G_i=g] - \E[(Y_{i,t} - Y_{i,g-1}) | U_i=1] \Big) \\ & + \sum_{g \in \bar{\mathcal{G}}} \sum_{t=1}^{g-1} w^{TWFE}(g,t) \Big( \E[(Y_{i,t} - Y_{i,g-1}) | G_i=g] - \E[(Y_{i,t} - Y_{i,g-1}) | U_i=1] \Big) \end{align*}\]

Second, under parallel trends:

\[\begin{align*}
\alpha = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^{TWFE}(g,t) ATT(g,t)
\end{align*}\]

But the weights are (non-transparently) driven by the estimation method

These weights have some good / bad / strange properties such as possibly being negative

We’ll discuss:

Intuition: Directly implement the identification result discussed above

- Under parallel trends, recall that

\[\begin{align*} ATT(g,t) = \E[Y_{i,t} - Y_{i,g-1} | G_i=g] - \E[Y_{i,t} - Y_{i,g-1} | U_i=1] \end{align*}\]

Estimation:

\[\begin{align*}\widehat{ATT}^{CS}(g,t) = \frac{1}{n_g}\sum_{i=1}^n \indicator{G_i = g}(Y_{i,t} - Y_{i,g-1}) - \frac{1}{n_U}\sum_{i=1}^n \indicator{U_i = 1} (Y_{i,t} - Y_{i,g-1}) \end{align*}\]

2nd step: Recall: group-time average treatment effects are building blocks for more aggregated parameters such as \(ATT^{es}(e)\) and \(ATT^o\) \(\implies\) just plug in

- \(\implies\) two-step estimation procedure: target local/disaggregated \(ATT(g,t)\) in first step, then (if desired) aggregate them into lower dimensional parameters

Intuition: Paper points out limitations of event-study versions of the TWFE regressions discussed above:

\[\begin{align*} Y_{i,t} = \theta_t + \eta_i + \sum_{e=-(\T-1)}^{-2} \beta_e D_{i,t}^e + \sum_{e=0}^{\T} \beta_e D_{i,t}^e + e_{i,t} \end{align*}\]

and points out similar issues. In particular, the event study regression is “underspecified” \(\implies\) heterogeneous effects can “confound” the treatment effect estimates

Solution: Run fully interacted regression: \[\begin{align*} Y_{i,t} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{e \neq -1} \delta^{SA}_{ge} \indicator{G_i=g} \indicator{g+e=t} + e_{i,t} \end{align*}\]

2nd step: Aggregate \(\delta^{SA}_{ge}\)’s across groups (usually into an event study).

This sidesteps issues with the event study regression coming from treatment effect heterogeneity

For inference, need to account for two-step estimation procedure

Intuition: Are issues in DID literature due to limitations of TWFE regressions per se or due to *misspecification* of TWFE regression?

Solution: Proposes running “more interacted” TWFE regression:

\[\begin{align*} Y_{i,t} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{s=g}^{\T} \alpha_{gt}^W \indicator{G_i=g, t=s} + e_{i,t} \end{align*}\]

This is quite similar to Sun and Abraham (2021) except for that it doesn’t include interactions in pre-treatment periods. [The differences about \((g,t)\) relative to \((g,e)\) are trivial.]

Like SA, this provides robustness to treatment effect heterogeneity by including more interactions

Like SA, unless mainly interested in \(ATT(g,t)\), have to do second step aggregation that (arguably) ends the “killer feature” of the TWFE regression to begin with

Intuition: Parallel trends is closely connected to a TWFE model *for untreated potential outcomes* \[Y_{i,t}(0) = \theta_t + \eta_i + e_{i,t}\]

Estimation:

Step 1: Split data into treated and untreated observations

Step 2: Estimate above model for the set of untreated observations

Step 3: “Impute” \(\hat{Y}_{i,t}(0) = \hat{\theta}_t + \hat{\eta}_i\) for the treated observations

\(\displaystyle \widehat{ATT}^{G/BJS}(g,t) = \frac{1}{n_g} \sum_{i=1}^n \indicator{G_i=g}\Big(Y_{i,t} - \hat{Y}_{i,t}(0)\Big) \xrightarrow{p} ATT(g,t)\)

Can compute other treatment effect parameters too (e.g., event study or overall average treatment effect)

In my view, all of the approaches discussed above are fundamentally similar to each other.

In practice, it is sometimes possible to get different results though this is often driven by

Different estimation strategies trading off efficiency and robustness in different ways

Different choices in terms of default implementation details in computer code

In post-treatment periods, these give numerically identical results: \(\widehat{ATT}^{CS}(g,t) = \hat{\delta}^{SA}_{t,t-g}\)

- This is because a fully interacted regression (SA) is equivalent to taking differences in averages across groups (CS)

In pre-treatment periods, code will give different pre-treatment estimates, but this is due to different default choices

In SA, all results are relative to a fixed base period (typically the period right before treatment)

In CS, by default, in pre-treatment periods, estimates are of placebo policy effects on impact (i.e., the base period is always the most recent pre-treatment period)

These are clearly closely related, with the difference amounting to whether or not one includes indicators for pre-treatment periods.

It is fair to see this as a way to trade-off robustness and efficiency

If parallel trends holds across all time periods, then Wooldridge can tend to deliver more efficient estimates (as effectively all pre-treatment periods are used as base periods)

If parallel trends is violated in some pre-treatment periods but holds post-treatment, Wooldridge estimates will be inconsistent, but SA estimates will be robust to violations of parallel trends in pre-treatment periods.

Wooldridge and Gardner/BJS give numerically the same estimates: \(\hat{\alpha}^W_{gt} = \widehat{ATT}^{G/BJS}(g,t)\)

Intuition: Including full set of interactions is equivalent to estimating separate models by groups

The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.

The other major source of differences in estimates across procedures is different default options in software implementations. Examples:

- Overall average treatment effects
- CS emphasizes the “overall” treatment effects discussed above
- Default implementations of imputation run a regression of \(Y_{i,t}-\hat{Y}_{i,t}(0)\) on \(D_{i,t}\) which delivers the “simple” overall average treatment effect which just averages all available treatment effects

- Use county-level data from 2003-2007 during a period where the federal minimum wage was flat

Exploit minimum wage changes across states

- Any state that increases their minimum wage above the federal minimum wage will be considered as treated

- Interested in the effect of the minimum wage on teen employment

- We’ll also make a number of simplifications:
- not worry much about issues like clustered standard errors
- not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states

Goals:

Get some experience with an application and DID-related code

Assess how much do the issues that we have been talking about matter in practice

Full code is available on GitHub.

R packages used in empirical example

```
# drops NE region and a couple of small groups
mw_data_ch2 <- subset(mw_data_ch2, (G %in% c(2004,2006,2007,0)) & (region != "1"))
head(mw_data_ch2[,c("id","year","G","lemp","lpop","lavg_pay","region")])
```

```
id year G lemp lpop lavg_pay region
554 8003 2001 2007 5.556828 9.614137 10.05750 4
555 8003 2002 2007 5.356586 9.623972 10.09712 4
556 8003 2003 2007 5.389072 9.620859 10.10761 4
557 8003 2004 2007 5.356586 9.626548 10.14034 4
558 8003 2005 2007 5.303305 9.637958 10.17550 4
559 8003 2006 2007 5.342334 9.633056 10.21859 4
```

## Comments

The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.

The other major source of differences in estimates across procedures is different default options in software implementations. Examples: