Difference-in-Differences

Recent Methodological Advances and their Relevance to Empirical Work

Brantly Callaway

University of Georgia

August 4, 2025

Plan for the Talk

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} } \newcommand{\T}{T} \newcommand{\ATT}{\text{ATT}}\)

Panel Data Causal Inference: Challenges and Opportunities
DiD with Two Periods
Staggered Treatment Adoption
- Issues with Traditional Regression Approaches
- New Approaches
Application + Code

Additional Resources

Additional Materials: https://bcallaway11.github.io/camp-resources/

Slides, code, data, etc.

References:

Baker et al. (2025), forthcoming at Journal of Economic Literature
Callaway (2023), Handbook of Labor, Human Resources, and Population Economics

Additional Resources

Advanced Materials: https://github.com/bcallaway11/bank-of-portugal

Relaxing the parallel trends assumption by including covariates
Common issues in empirical work
Dealing with more complicated treatment regimes
Alternative identification strategies (e.g., conditioning on lagged outcome, change-in-changes, others)

Running Example

Running Example: Causal effect of a state-level minimum wage increase on employment

Widely studied using DiD identification strategies (Card and Krueger (1994), many others)
For today: very simplified version with
1. No changes in federal minimum wage
2. “Binarized” state minimum wages (i.e., state minimum wage is either above the federal minimum wage or not)

Part 1

Panel Data Causal Inference: Challenges and Opportunities

Research Design

Research Design: The setting that the researcher will use to estimate causal effects.

Exploit a data structure where the researcher observes:

Multiple periods of data
Some pre-treatment data for all units
Some units become treated while other units remain untreated

This research design is a key distinguishing feature of modern approaches to panel data causal inference relative to traditional panel data models

It allows for explicit comparisons between treated units and a comparison group.
It also explains why the methods we consider today are often grouped among quasi-experimental methods such as IV or RD.

Identification Strategy

Identification Strategy: A target parameter and set of assumptions that allow the researcher to recover the target parameter ➡

Comparison to IV and RD

IV and RD are closely connected to natural experiments where the assignment of treatment, though not controlled by the researcher, is (usually locally) randomly assigned.

draft lottery numbers
cutoff scores on standardized tests

This implies that

The research design is simply to exploit the natural experiment
The identification strategy formalizes the natural experiment
- Exclusion restriction, exogeneity, monotonicity are (almost) properties of the natural experiment
- There are not really alternative identification for the same research design

Panel Data Natural Experiments?

Panel data causal inference methods are often used in settings where there is no explicit natural experiment:

e.g., some locations implement a policy while others do not

This implies that

Auxiliary assumptions, such as parallel trends, play an important role in the identification strategy
- These assumptions are not implied by the research design
- One could imagine making alternative auxiliary assumptions instead
- Panel data causal inference methods are often referred to “model-based” as assumptions like parallel trends effectively involve a model for the outcome.
(Probably) less credible than methods that are based on natural experiments

Why Are Panel Data Approaches Popular?

Availability – Experiments and/or natural experiments are often not available

Why Are Panel Data Approaches Popular?

Allow for within-unit comparisons
- Recall the fundamental problem of causal inference is that we can either see a unit’s treated or untreated potential outcomes (but not both)
- However, in the panel data research design, this is not 100% true.
  - We can see both a unit’s treated and untreated potential outcome outcome…just at different points in time
  - Many approaches can be seen as comparing outcomes for the same unit before and after treatment and making some adjustment for “time effects”
- Related to allowing for certain forms of selection on unobservables

Why Are Panel Data Approaches Popular?

Pre-Testing – If you have panel data, you often have extra pre-treatment data that can be used to “validate” the identification strategy in pre-treatment periods
- e.g., Event studies that are extremely common in empirical work

Deryugina (2017)

Treatment Effect Heterogeneity

Modern approaches also typically allow for treatment effect heterogeneity

That is, that effects of the treatment can vary across different units in potentially complicated ways

This is going to be a major issue in the discussion below

We’ll consider implications for “traditional” regression approaches and how new approaches are designed to handle this

Forward-Engineering

Forward-Engineering: Identification first, then estimation

i.e., prioritize clearly defining target parameters and arguments for identifying assumptions
if possible, use estimators that directly implement identification strategy

Reverse-Engineering: Prioritize estimation (for economists, often some form of regression)

Notation for Setting with Two Periods

Data:

2 periods: \(t=1\), \(t=2\)
- No one treated until period \(t=2\)
- Some units remain untreated in period \(t=2\)
\(D_{it}\) treatment indicator in period \(t\)
2 groups: \(G_i=1\) or \(G_i=0\) (treated and untreated)

Potential Outcomes: \(Y_{it}(1)\) and \(Y_{it}(0)\)

Observed Outcomes: \(Y_{it=2}\) and \(Y_{it=1}\)

\[\begin{align*} Y_{it=2} = G_i Y_{it=2}(1) +(1-G_i)Y_{it=2}(0) \quad \textrm{and} \quad Y_{it=1} = Y_{it=1}(0) \end{align*}\]

Target Parameter

Average Treatment Effect on the Treated: \[\ATT = \E[Y_{t=2}(1) - Y_{t=2}(0) | G=1]\]

Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]

The first term is how outcomes changed over time for the treated group

We can directly estimate this from the data

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]

The second term is how outcomes would have changed over time if the treated group had not been treated

Not directly observed in the data \(\implies\) we need to make identifying assumptions

There are many possibilities here:
1. Before-after: \(\color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} = 0\)

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]

The second term is how outcomes would have changed over time if the treated group had not been treated

Not directly observed in the data \(\implies\) we need to make identifying assumptions
There are many possibilities here:
1. Lagged outcome unconfoundedness: \(\color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} = \E\Big[ \E[Y_{t=2}(0) - Y_{t=1}(0) | Y_{t=1}, G=0] \Big| G=1\Big]\)

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]

The second term is how outcomes would have changed over time if the treated group had not been treated

Not directly observed in the data \(\implies\) we need to make identifying assumptions
There are many possibilities here:
1. Change-in-changes: \(\color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} = \E\Big[ Q_{Y_{t=2}(0)|G=0}\big(F_{Y_{t=1}(0)|G=0}(Y_{t=1}(0))\big) - Y_{t=1}(0) \Big| G=1\Big]\)

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

\[\begin{align*} \ATT = \color{#4B8B3B}{\underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}}} - \color{#BA0C2F}{\underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}}} \end{align*}\]

With panel data, we can re-write this as

\[\begin{align*} \ATT = \color{#4B8B3B}{\E[Y_{t=2}(1) - Y_{t=1}(0) | G=1]} - \color{#BA0C2F}{\E[Y_{t=2}(0) - Y_{t=1}(0) | G=1]} \end{align*}\]

The second term is how outcomes would have changed over time if the treated group had not been treated

Not directly observed in the data \(\implies\) we need to make identifying assumptions
There are many possibilities here:
1. Difference-in-differences: ➡

Part 2

DiD with Two Periods

DiD with Two Periods

Parallel Trends Assumption

\[\color{#BA0C2F}{\E[\Delta Y(0) | G=1]} = \color{#336699}{\E[\Delta Y(0) | G=0]}\]

Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group

Identification: Under PTA, we can identify \(\ATT\): \[ \begin{aligned} \ATT &= \color{#4B8B3B}{\E[\Delta Y | G=1]} - \color{#BA0C2F}{\E[\Delta Y(0) | G=1]} \end{aligned} \]

DiD with Two Periods

Parallel Trends Assumption

\[\color{#BA0C2F}{\E[\Delta Y(0) | G=1]} = \color{#336699}{\E[\Delta Y(0) | G=0]}\]

Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group

Identification: Under PTA, we can identify \(\ATT\): \[ \begin{aligned} \ATT &= \color{#4B8B3B}{\E[\Delta Y | G=1]} - \color{#BA0C2F}{\E[\Delta Y(0) | G=1]}\\ &= \color{#4B8B3B}{\E[\Delta Y | G=1]} - \color{#336699}{\E[\Delta Y | G=0]} \end{aligned} \]

\(\implies \ATT\) is identified can be recovered by the difference in outcomes over time (difference 1) relative to the difference in outcomes over time for the untreated group (difference 2)

[Why is parallel trends a common assumption in economics?]

Estimation

The most straightforward approach to estimation is the plugin estimator:

\[\widehat{\ATT} = \frac{1}{n_1} \sum_{i=1}^n G_i \Delta Y_i - \frac{1}{n_0} \sum_{i=1}^n (1-G_i) \Delta Y_i\]

Estimation

An alternative approach is to use a TWFE regression: \[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it}\]

Even though it looks like this model has restricted the effect of participating in the treatment to be constant (and equal to \(\alpha\)) across all individuals, TWFE (in this case) is actually robust to treatment effect heterogeneity.

To see this, notice that (with two periods) the previous regression is equivalent to \[\begin{align*} \Delta Y_{it} = \Delta \theta_t + \alpha \Delta D_{it} + \Delta e_{it} \end{align*}\] This is fully saturated in \(\Delta D_{it}\) (which is binary) \(\implies\) \[\begin{align*} \alpha = \E[\Delta Y_{t}|G=1] - \E[\Delta Y_{t}|G=0] = \ATT \end{align*}\]

TWFE Regression

It’s easy to make the TWFE regression more complicated:

Multiple time periods
Variation in treatment timing
More complicated treatments
Introducing additional covariates

Unfortunately, the robustness of TWFE regressions to treatment effect heterogeneity or these more complicated (and empirically relevant) settings does not seem to hold

Much of the recent (mostly negative) literature on TWFE in the context of DiD has considered these types of “realistic” settings
Next, we will consider one of these settings: staggered treatment adoption

Part 3

Staggered Treatment Adoption

Setup with Staggered Treatment Adoption

\(\T\) time periods

Staggered treatment adoption: Units can become treated at different points in time, but once a unit becomes treated, it remains treated.

Examples:

Government policies that roll out in different locations at different times (minimum wage is close to this over short time horizons)
“Scarring” treatments: e.g., job displacement does not typically happen year after year, but rather labor economists think of being displaced as changing a person’s “state” (the treatment is more like: has a person ever been displaced)

Setup with Staggered Treatment Adoption

Notation:

In math, staggered treatment adoption means: \(D_{it-1}=1 \implies D_{it}=1\).
\(G_i\) — a unit’s group — the time period that unit becomes treated.
- Under staggered treatment adoption, fully summarizes a unit’s treatment regime
Define \(U_i=1\) for never-treated units and \(U_i=0\) otherwise.

Setup with Staggered Treatment Adoption

Notation (cont’d):

Potential outcomes: \(Y_{it}(g)\) — the outcome that unit \(i\) would experience in time period \(t\) if they became treated in period \(g\).
Untreated potential outcome: \(Y_{it}(0)\) — the outcome unit \(i\) would experience in time period \(t\) if they did not participate in the treatment in any period.
Observed outcome: \(Y_{it}=Y_{it}(G_i)\)
No anticipation condition: \(Y_{it} = Y_{it}(0)\) for all \(t < G_i\) (pre-treatment periods for unit \(i\))

Target Parameters

Group-time average treatment effects: \[\begin{align*} \ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g] \end{align*}\]

Explanation: \(\ATT\) for group \(g\) in time period \(t\)

Target Parameters

Event Study: \[\begin{align*} \ATT^{es}(e) = \E[ Y_{g+e}(G) - Y_{g+e}(0) | G \in \mathcal{G}_e] \end{align*}\]

where \(\mathcal{G}_e\) is the set of groups observed to have experienced the treatment for \(e\) periods at some point.

Explanation: \(\ATT\) when units have been treated for \(e\) periods

Target Parameters

Overall \(\mathbf{\ATT}\):

Towards this end: the average treatment effect for unit \(i\) (across its post-treatment time periods) is given by: \[\bar{\tau}_i(G_i) = \frac{1}{\T - G_i + 1} \sum_{t=G_i}^{\T} \Big( Y_{it}(G_i) - Y_{it}(0) \Big)\]

Then,

\[\begin{align*} \ATT^o = \E[\bar{\tau}(G) | U=0] \end{align*}\]

Explanation: \(\ATT\) across all units that every participate in the treatment

Pros & Cons of Aggregating Causal Effect Parameters

Group-Time \(\mathbf{\ATT(g,t)}\)

More fully characterizes treatment effect heterogeneity.
Some theories may have testable implications on \(\ATT(g,t)\) (e.g., on the sign of group-time average treatment effects).

⟶

Event Study \(\mathbf{\ATT^{es}(e)}\)

Easier to estimate precisely
Easier to report - summarizes treatment effects in a two-dimensional plot.

Overall \(\mathbf{\ATT^o}\)

Easiest to estimate precisely
Easiest to report - summarizes causal effects into single number to report in abstract or introduction.

\(\mathbf{\ATT(g,t)}\) as a Building Block

To understand the discussion later, it is also helpful to think of \(\ATT(g,t)\) as a building block for the other parameters discussed above. For example:

Overall ATT: \[\begin{align*} \ATT^o = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^o(g,t) \ATT(g,t) \qquad \qquad \textrm{where} \quad w^o(g,t) = \frac{\P(G=g|U=0)}{\T-g+1} \end{align*}\]

Event Study: Likewise, \(\ATT^{es}(e)\) is a weighted average of \(\ATT(g,g+e)\)

\(\implies\) If we can identify \(\mathbf{\ATT(g,t)}\), then we can proceed to recover \(\mathbf{\ATT^{es}(e)}\) and \(\mathbf{\ATT^o}\).

DiD Identification of \(\ATT(g,t)\)

Multiple Period Version of Parallel Trends Assumption

For all groups \(g \in \bar{\mathcal{G}}\) (all groups except the never-treated group) and for all time periods \(t=2,\ldots,\T\), \[\begin{align*} \E[\Delta Y_{t}(0) | G=g] = \E[\Delta Y_{t}(0) | U=1] \end{align*}\]

Using very similar arguments as before, can show that \[\begin{align*} \ATT(g,t) = \E[Y_{t} - Y_{g-1} | G=g] - \E[Y_{t} - Y_{g-1} | U=1] \end{align*}\]

where the main difference is that we use \((g-1)\) as the base period (this is the period right before group \(g\) becomes treated).

Summary

The previous discussion emphasizes a general purpose identification strategy with staggered treatment adoption:

Step 1: Target disaggregated treatment effect parameters (i.e., group-time average treatment effects)

Step 2: (If desired) combine disaggregated treatment effects into lower dimensional summary treatment effect parameter

Notice that:

This amounts to breaking the problem into a set of two-period DiD problems and then combining the results
It is also a general purpose strategy in that the same high-level idea is (1) not DiD-specific and (2) can (possibly) be applied to more complicated treatment regimes

What Can Go Wrong with TWFE Regression?

With staggered treatments, traditionally DiD identification strategies have been implemented with two-way fixed effects (TWFE) regressions: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*}\]

One main contribution of recent work on DiD has been to diagnose and understand the limitations of TWFE regressions for implementing DiD

What Can Go Wrong with TWFE Regression?

Goodman-Bacon (2021) intuition: \(\alpha\) “comes from” comparisons between the path of outcomes for units whose treatment status changes relative to the path of outcomes for units whose treatment status stays the same over time.

👍 Some comparisons are for groups that become treated to not-yet-treated groups
👎 Other comparisons are for groups that become treated relative to already-treated groups
- This can be especially problematic when there are treatment effect dynamics. Dynamics imply different trends from what would have happened absent the treatment.

What Can Go Wrong with TWFE Regression?

de Chaisemartin and D’Haultfoeuille (2020) intuition: You can write \(\alpha\) as a weighted average of \(\ATT(g,t)\)

First, a decomposition: \[\begin{align*} \alpha &= \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big) \\ & + \sum_{g \in \bar{\mathcal{G}}} \sum_{t=1}^{g-1} w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big) \end{align*}\]

What Can Go Wrong with TWFE Regression?

de Chaisemartin and D’Haultfoeuille (2020) intuition: You can write \(\alpha\) as a weighted average of \(\ATT(g,t)\)

Second, under parallel trends: \[\begin{align*} \alpha = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\T} w^{TWFE}(g,t) \ATT(g,t) \end{align*}\]

But the weights are (non-transparently) driven by the estimation method
These weights have some good / bad / strange properties such as possibly being negative
[More Details]

Callaway and Sant’Anna (2021)

Intuition: Directly implement the identification result discussed above

Under parallel trends, recall that

\[\begin{align*} \ATT(g,t) = \E[Y_{t} - Y_{g-1} | G=g] - \E[Y_{t} - Y_{g-1} | U=1] \end{align*}\]

Estimation:

\[\begin{align*}\widehat{\ATT}^{CS}(g,t) = \frac{1}{n_g}\sum_{i=1}^n \indicator{G_i = g}(Y_{it} - Y_{ig-1}) - \frac{1}{n_U}\sum_{i=1}^n \indicator{U_i = 1} (Y_{it} - Y_{ig-1}) \end{align*}\]

2nd step: Recall: group-time average treatment effects are building blocks for more aggregated parameters such as \(\ATT^{es}(e)\) and \(\ATT^o\) \(\implies\) just plug in

\(\implies\) two-step estimation procedure: target local/disaggregated \(\ATT(g,t)\) in first step, then (if desired) aggregate them into lower dimensional parameters

Other New Approaches

Regression based: Sun and Abraham (2021), Wooldridge (2021)

Include a large number of interaction terms in a TWFE regression
The coefficients on the interation terms correspond to \(\ATT(g,t)\)

Imputation: Gardner et al. (2023), Borusyak, Jaravel, and Spiess (2024)

Estimate a model for untreated potential outcomes using pre-treatment data
Impute (i.e., predict) untreated potential outcomes in post-treatment periods
Estimate \(\ATT(g,t)\) as the difference between observed and imputed outcomes

“Stacked” regression: Dube et al. (2023)

For each group \(g\), construct its “clean comparison group”, form a new dataset
Stack each of these new datasets together and run a TWFE regression

Other New Approaches

All of these approaches are conceptually very similar

e.g., if you include enough interaction terms, you turn a regression into a way to compute differences in means

Why can you get different numbers?

Often, different implementation choices made in software
- e.g., by default assuming parallel trends across more time periods or not, trading off robustness and efficiency.
- sometimes different default target parameters

Other New Approaches

Important differences:

CS does better with respect to including covariates
- Doubly robust estimation / immediate connections with machine learning
CS follows the “forward engineering” approach discussed above
- Suppose you wanted to use a custom comparison group, this is obvious with CS, but not necessarily with other approaches
- Stops from making conceptual mistakes such as “throwing in” a continuous treatment w/o an explicit argument for why this works

[Longer Comparison of New Approaches]

Part 4

Empirical Example

Empirical Example: Minimum Wages and Employment

Use county-level data from 2003-2007 during a period where the federal minimum wage was flat
Exploit minimum wage changes across states
- Any state that increases their minimum wage above the federal minimum wage will be considered as treated
Interested in the effect of the minimum wage on teen employment
We’ll also make a number of simplifications:
- not worry much about issues like clustered standard errors
- not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states

Empirical Example: Minimum Wages and Employment

Goals:

Get some experience with an application and DiD-related code
Assess how much do the issues that we have been talking about matter in practice

Code

Full code is available on GitHub.

R packages used in empirical example:

library(did)
library(BMisc)
library(twfeweights)
library(fixest)
library(modelsummary)
library(ggplot2)
load(url("https://github.com/bcallaway11/did_chapter/raw/master/mw_data_ch2.RData"))

Setup Data

# drops NE region and a couple of small groups
mw_data_ch2 <- subset(mw_data_ch2, (G %in% c(2004,2006,2007,0)) & (region != "1"))
head(mw_data_ch2[,c("id","year","G","lemp","lpop","lavg_pay","region")])

      id year    G     lemp     lpop lavg_pay region
554 8003 2001 2007 5.556828 9.614137 10.05750      4
555 8003 2002 2007 5.356586 9.623972 10.09712      4
556 8003 2003 2007 5.389072 9.620859 10.10761      4
557 8003 2004 2007 5.356586 9.626548 10.14034      4
558 8003 2005 2007 5.303305 9.637958 10.17550      4
559 8003 2006 2007 5.342334 9.633056 10.21859      4

# drop 2007 as these are right before fed. minimum wage change
data2 <- subset(mw_data_ch2, G!=2007 & year >= 2003)
# keep 2007 => larger sample size
data3 <- subset(mw_data_ch2, year >= 2003)

TWFE Regression

twfe_res2 <- fixest::feols(lemp ~ post | id + year,
                           data=data2,
                           cluster="id")

modelsummary(list(twfe_res2), gof_omit=".*")

	(1)
post	-0.038
	(0.008)

\(\mathbf{\ATT(g,t)}\) (Callaway and Sant’Anna)

attgt <- did::att_gt(yname="lemp",
                     idname="id",
                     gname="G",
                     tname="year",
                     data=data2,
                     control_group="nevertreated",
                     base_period="universal")
tidy(attgt)[,1:5] # print results, drop some extra columns

             term group time    estimate   std.error
1  ATT(2004,2003)  2004 2003  0.00000000          NA
2  ATT(2004,2004)  2004 2004 -0.03266653 0.021210914
3  ATT(2004,2005)  2004 2005 -0.06827991 0.021592785
4  ATT(2004,2006)  2004 2006 -0.12335404 0.021745364
5  ATT(2004,2007)  2004 2007 -0.13109136 0.023757903
6  ATT(2006,2003)  2006 2003 -0.03408910 0.011674878
7  ATT(2006,2004)  2006 2004 -0.01669977 0.007910050
8  ATT(2006,2005)  2006 2005  0.00000000          NA
9  ATT(2006,2006)  2006 2006 -0.01939335 0.009693080
10 ATT(2006,2007)  2006 2007 -0.06607568 0.009354202

Plot \(\mathbf{\ATT(g,t)}\)’s

Event Study

attes <- aggte(attgt, type="dynamic")
ggdid(attes)

Compute \(\mathbf{\ATT^o}\)

attO <- did::aggte(attgt, type="group")
summary(attO)


Call:
did::aggte(MP = attgt, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 


Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0571        0.0087    -0.0741     -0.0401 *


Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0888     0.0202       -0.1306     -0.0470 *
  2006  -0.0427     0.0075       -0.0583     -0.0272 *
---
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Doubly Robust

Comments

The differences between the CS estimates and the TWFE estimates are fairly large here: the CS estimate is about 50% larger than the TWFE estimate, though results are qualitatively similar.

de Chaisemartin and d’Haultfoeuille weights

\(\mathbf{\ATT^o}\) weights

Weight Comparison

Discussion

To summarize: \(ATT^o = -0.057\) while \(\alpha^{TWFE} = -0.038\). This difference can be fully accounted for

Pre-treatment differences in paths of outcomes across groups: explains about 64% of the difference
Differences in weights applied to the same post-treatment \(ATT(g,t)\): explains about 36% of the difference. [If you apply the post-treatment weights and “zero out” pre-treatment differences, the estimate would be \(-0.050\).]

Discussion

In my experience: this is fairly representative of how much new DiD approaches matter relative to TWFE regressions.

It does not seem like “catastrophic failure” of TWFE, but (in my view) these are meaningful differences
The whole discussion hinges crucially on how much treatment effect heterogeneity there is.
- More TE Het \(\implies\) more sensitivity to weighting schemes
- Just looking at TWFE regression does not give insight into how much TE Het there is

Additional Comments on Weights

There has been a lot concern about negative weights (both in econometrics and empirical work).

There were no negative weights in the example above, but the weights still weren’t great.
But, in my view, the more important issue is the non-transparent weighting scheme.
- Example 1: If you try using data3 (the data that includes \(G_i=2007\)), you will get a negative weight on \(ATT(g=2004,t=2007)\). But it turns out not to matter much, and TWFE works better in this case than in the case that I showed you.
- Example 2: Alternative treatment effect parameter ➡

Conclusion

Not a Panacea:

Good properties of new approaches are conditional on parallel trends holding

Possible Disadvantages:

Not yet available for all possible types of treatments where DiD identification strategies are used

Generally, it is possible/easy to figure out the underlying comparisons that we want to make, but can be hard to figure out how to pool the information. (e.g., minimum wage)
That said, not clear that this is necessarily a disadvantage. What we know about TWFE regressions is that they are generally equal to weighted averages of underlying causal effect parameters (this has been shown in many more complicated settings than the staggered treatment case we talked about today), but the weights are non-transparent and driven by the estimation method.
This means that you and the regression are basically agreed on the underlying components of a summary treatment effect parameter, and that you have a problem about combining information, the regression will make a choice for you but based on what we know about existing cases, usually the regression doesn’t make good choices…don’t think we should punt on these problems, and think that explicitly proposing ways to combine information should an important consideration in empirical work in economics.

Advantages:

Off-the-shelf robust to treatment effect heterogeneity
Arguably, simpler and more transparent
Direct implementation of the DiD identification strategy

Not sure if this is a good analogy, but I am currently teaching 1st year econometrics for ph.d. students, and a classic topic in that class is about standard errors and homoskedasiticity and heteroskedasicity. I am not against calculating standard errors under homoskedasticity if you say that you are assuming homoskedasticity (and have a good reason to believe it or test for it, etc.), but I am against saying that standard errors are robust to heterogeneity and then computing standard errors under homoskedasticity. Its true that this choice may not “change the results” in a good fraction of applications, but it still isn’t right. If you are willing to say that you assume that treatment effects don’t change across groups and time periods, then TWFE is all good, but, to me, it doesn’t make sense to say that you are only assuming parallel trends (w/o restricting TE het.) and then run a traditional TWFE regression.

Appendix

How do we choose among identifying assumptions?

View #1: Parallel trends as a purely reduced form assumption

For example, if you have extra pre-treatment periods, you can assess validity in pre-treatment periods

But this is certainly not the only possibility:

In some disciplines (e.g., biostats) it is relatively more common to assume unconfoundedness conditional on lagged outcomes (i.e., the LO approach above)
This is also what my undergraduate econometrics students almost always suggest (their judgement is not clouded by having thought about these things too much)
Or, alternatively, why not take two differences instead of one…

In my view, these seem like fair points

How do we choose among identifying asssumptions?

View #2: Models that lead to parallel trends assumption. We’ll focus on untreated potential outcomes: \[Y_{it}(0) = \theta_t + \eta_i + e_{it}\] Parallel trends is equivalent to this model along with the condition that \(\E[e_t | G] = 0\).

Many economic models have this sort of flavor, that the important thing driving differences in outcomes is some latent characteristic (differences in lagged outcomes may proxy this, but not the “deep” explanation)

See the discussion in Ashenfelter and Card (1985) and Card (2022)

Pros

No restrictions on treatment effect heterogeneity
Can allow for some self-selection into treatment

How do we choose among identifying asssumptions?

See the discussion in Ashenfelter and Card (1985) and Card (2022)

Cons: However, additive separability of \(\theta_t\) and \(\eta_i\) is crucial for identification

This is different from other natural experiment methods such as IV and RD, where at least from an identification perspective, there is not model-dependence
May not be plausible for limited dependent variables
Also related to results in Roth and Sant’Anna (2023) (about parallel trends and functional form) and Ghanem, Sant’Anna, and Wüthrich (2024) (about selection and parallel trends) [Back]

How do these results work?

Consider a simplified setting where \(\T=2\), but we allow for there to be units that are already treated in the first period.

\(\implies\) 3 groups: \(G_i=1\), \(G_i=2\), \(G_i=\infty\)

Because there are only two periods, the TWFE regression is equivalent to the regression \[\begin{align*} \Delta Y_i = \Delta \theta_{t=2} + \alpha \Delta D_{it=2} + \Delta e_{it=2} \end{align*}\]

Moreover, \(\Delta D_{it=2}\) only takes two values:

\(\Delta D_{it=2} = 0\) for \(G_i=1\) and \(G_i=\infty\)
\(\Delta D_{it=2} = 1\) for \(G_i=2\)

Thus, this is a fully saturated regression, and we have that \[\begin{align*} \alpha = \E[\Delta Y | \Delta D_{t=2} = 1] - \E[\Delta Y | \Delta D_{t=2}=0] \end{align*}\]

TWFE Explanation (cont’d)

Starting from the previous slide: \[\begin{align*} \alpha = \E[\Delta Y | \Delta D_{t=2} = 1] - \E[\Delta Y | \Delta D_{t=2}=0] \end{align*}\] and consider the term on the far right, we have that \[\begin{align*} \E[\Delta Y | \Delta D_{t=2}=0] = \E[\Delta Y | G=1] \underbrace{\frac{p_1}{p_1 + p_\infty}}_{=: w_1} + \E[\Delta Y | G=\infty] \underbrace{\frac{p_\infty}{p_1 + p_\infty}}_{=: w_\infty} \end{align*}\]

where \(w_1\) and \(w_\infty\) are the relative sizes of group 1 and the never treated group, and notice that \(w_1 + w_\infty = 1\). Plugging this back in \(\implies\) \[\begin{align*} \alpha = \Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big) w_1 + \Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big) w_\infty \end{align*}\]

This is exactly the Goodman-Bacon result! \(\alpha\) is a weighted average of all possible 2x2 comparisons

TWFE Explanation (cont’d)

Let’s keep going: \[\begin{align*} \alpha = \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y | G=1]\Big)}_{\textrm{What is this?}} w_1 + \underbrace{\Big( \E[\Delta Y | G=2] - \E[\Delta Y|G=\infty]\Big)}_{ATT(2,2)} w_\infty \end{align*}\] Working on the first term, we have that \[ \begin{aligned} & \E[\Delta Y_{2} | G=2] - \E[\Delta Y_{2} | G=1] \hspace{300pt}\\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{1}(\infty) | G=2] - \E[Y_{2}(1) - Y_{1}(1) | G=1] \\ &\hspace{10pt} = \E[Y_{2}(2) - Y_{2}(\infty) | G=2] + \underline{\E[Y_{2}(\infty) - Y_{1}(\infty) | G=2]}\\ &\hspace{20pt} - \Big( \E[Y_{2}(1) - Y_{2}(\infty) | G=1] - \E[Y_{1}(1) - Y_{1}(\infty) | G=1] + \underline{\E[Y_{2}(\infty) - Y_{1}(\infty) | G=1]} \Big) \end{aligned} \]

TWFE Explanation (cont’d)

Plug this expression back in \(\rightarrow\)

TWFE Explanation (cont’d)

Plugging the previous expression back in, we have that \[\begin{align*} \alpha = ATT(2,2) + ATT(1,1) w_1 + ATT(1,2)(-w_1) \end{align*}\]

This is exactly the result in de Chaisemartin and d’Haultfoeuille! \(\alpha\) is equal to a weighted average of \(ATT(g,t)\)’s, but it is possible that some of the weights can be negative.

Also, as they point out, a sufficient condition for the weights to be non-negative is: no treatment effect dynamics \(\implies ATT(1,1) = ATT(1,2)\) \(\overset{\textrm{here}}{\implies} \alpha = ATT(2,2)\).

In more complicated settings, this would guarantee no negative weights, but the you would still get a hard-to-understand weighted average of \(ATT(g,t)'s\).

[Back]

Comparing New Approaches to DiD

We’ll discuss:

Callaway and Sant’Anna (2021), R: did, Stata: csdid, Python: csdid
Sun and Abraham (2021), R: fixest, Stata: eventstudyinteract
Wooldridge (2021), R: etwfe, Stata: JWDiD
Gardner et al. (2023) / Borusyak, Jaravel, and Spiess (2024), R: did2s, Stata: did2s and did_imputation

Not including:

“Stacked Regression” (Cengiz et al. (2019), Dube et al. (2023)), Stata: stackedev
de Chaisemartin and D’Haultfoeuille (2020), R: DIDmultiplegt, Stata: did_multiplegt

Sun and Abraham (2021)

Intuition: Paper points out limitations of event-study versions of the TWFE regressions discussed above: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{e=-(\T-1)}^{-2} \beta_e D_{it}^e + \sum_{e=0}^{\T} \beta_e D_{it}^e + e_{it} \end{align*}\] and points out similar issues. In particular, the event study regression is “underspecified” \(\implies\) heterogeneous effects can “confound” the treatment effect estimates

Solution: Run fully interacted regression: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{e \neq -1} \delta^{SA}_{ge} \indicator{G_i=g} \indicator{g+e=t} + e_{it} \end{align*}\]

2nd step: Aggregate \(\delta^{SA}_{ge}\)’s across groups (usually into an event study).

This sidesteps issues with event study regression due treatment effect heterogeneity
For inference, need to account for two-step estimation procedure

Wooldridge (2021)

Intuition: Are issues in DiD literature due to limitations of TWFE regressions per se or due to misspecification of TWFE regression?

Solution: Proposes running “more interacted” TWFE regression: \[\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{s=g}^{\T} \alpha_{gt}^W \indicator{G_i=g, t=s} + e_{it} \end{align*}\] This is quite similar to Sun and Abraham (2021) except for that it doesn’t include interactions in pre-treatment periods. [The differences about \((g,t)\) relative to \((g,e)\) are trivial.]

Like SA, this provides robustness to treatment effect heterogeneity by including more interactions
Like SA, unless mainly interested in \(\ATT(g,t)\), have to do second step aggregation that (arguably) ends the “killer feature” of the TWFE regression to begin with

Gardner et al. (2023) / BJS (2023)

Intuition: Parallel trends is closely connected to a TWFE model for untreated potential outcomes \[Y_{it}(0) = \theta_t + \eta_i + e_{it}\]

Estimation:

Step 1: Split data into treated and untreated observations
Step 2: Estimate above model for the set of untreated observations
Step 3: “Impute” \(\hat{Y}_{it}(0) = \hat{\theta}_t + \hat{\eta}_i\) for the treated observations
\(\displaystyle \widehat{\ATT}^{G/BJS}(g,t) = \frac{1}{n_g} \sum_{i=1}^n \indicator{G_i=g}\Big(Y_{it} - \hat{Y}_{it}(0)\Big) \xrightarrow{p} \ATT(g,t)\)

Can compute other treatment effect parameters too (e.g., event study or overall average treatment effect)

Similarities and Differences

In my view, all of the approaches discussed above are fundamentally similar to each other.

In practice, it is sometimes possible to get different results though this is often driven by

Different estimation strategies trading off efficiency and robustness in different ways
Different choices in terms of default implementation details in computer code

Comparison 1: CS and SA

In post-treatment periods, these give numerically identical results: \(\widehat{\ATT}^{CS}(g,t) = \hat{\delta}^{SA}_{t,t-g}\)

This is because a fully interacted regression (SA) is equivalent to taking differences in averages across groups (CS)

In pre-treatment periods, code will give different pre-treatment estimates, but this is due to different default choices

In SA, all results are relative to a fixed base period (typically the period right before treatment)
In CS, by default, in pre-treatment periods, estimates are of placebo policy effects on impact (i.e., the base period is always the most recent pre-treatment period)

Comparison 2: SA and Wooldridge

These are clearly closely related, with the difference amounting to whether or not one includes indicators for pre-treatment periods.

It is fair to see this as a way to trade-off robustness and efficiency

If parallel trends holds across all time periods, then Wooldridge can tend to deliver more efficient estimates (as effectively all pre-treatment periods are used as base periods)
If parallel trends is violated in some pre-treatment periods but holds post-treatment, Wooldridge estimates will be inconsistent, but SA estimates will be robust to violations of parallel trends in pre-treatment periods.
See Harmon (2023) for more details

Comparison 3: Wooldridge and Gardner/BJS

Wooldridge and Gardner/BJS give numerically the same estimates: \(\hat{\alpha}^W_{gt} = \widehat{\ATT}^{G/BJS}(g,t)\)

Intuition: Including full set of interactions is equivalent to estimating separate models by groups

Comments

The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.

The other major source of differences in estimates across procedures is different default options in software implementations. Examples:

Different base periods
- It’s possible to come up with an imputation estimator that uses the base period right before treatment only \(\implies\) \(\uparrow\) robustness, \(\downarrow\) efficiency
- It’s also possible to do a version of CS with more base periods \(\implies\) \(\uparrow\) efficiency \(\downarrow\) robustness
  - Build-the-trend (i.e., path relative to average pre-treatment outcome) and GMM, Callaway (2023), Marcus and Sant’Anna (2021), Lee and Wooldridge (2023).

Comments

The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.

The other major source of differences in estimates across procedures is different default options in software implementations. Examples:

Different default target parameters
- CS emphasizes the “overall” treatment effects discussed above
- Default implementations of imputation run a regression of \(Y_{it}-\hat{Y}_{it}(0)\) on \(D_{it}\) which delivers the “simple” overall average treatment effect which just averages all available treatment effects

Comments

The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.

The other major source of differences in estimates across procedures is different default options in software implementations. Examples:

Different default comparison groups
- CS and SA by default use the never-treated group as the comparison group
- Wooldridge and Gardner/BJS by default use all untreated observations as the comparison group
- But (again) it is straightforward to adapt CS to use the not-yet-treated group as the comparison group, or even a customized comparison group (e.g., not-yet-but-eventually-treated)

Comments

The above discussion emphasizes the conceptual similarities between different proposed alternatives to TWFE regressions in the literature.

The other major source of differences in estimates across procedures is different default options in software implementations. Examples:

Handle covariates in different ways
- By default, imputation (effectively) uses changes in the covariates over time in estimation
- CS includes the level of the time-varying covariate in period \(g-1\).
- In Caetano and Callaway (2024), we consider including both changes and levels of covariates

Comments

See Baker, Larcker, and Wang (2022) and Callaway (2023) for more substantially more details.

[Back]

“Simple” Aggregation

Consider the following alternative aggregated treatment effect parameter

\[\begin{align*} \widehat{ATT}^{\text{simple}} = \frac{1}{N_{\text{post}}} \sum_{(i,t), t \leq G_i} \Big(Y_{it} - \hat{Y}_{it}(0)\Big) \end{align*}\] i.e., we just average all possible estimated treatment effects that we have available in post-treatment periods.

Relative to \(ATT^o\), early treated units get more weight (because we have more \(Y_{it}-\hat{Y}_{it}(0)\) for them).

By construction, weights are all positive. However, they are different from \(ATT^o\) weights

Compute \(\mathbf{\ATT^{\text{simple}}}\)

att_simple <- did::aggte(attgt, type="simple")
summary(att_simple)


Call:
did::aggte(MP = attgt, type = "simple")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 

     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0646        0.0106    -0.0854     -0.0439 *


---
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Doubly Robust

“Simple” Aggregation

Besides the violations of parallel trends in pre-treatment periods, these weights are further away from \(ATT^o\) than the TWFE regression weights are!

Implications:

Misplaced emphasis on (non-)negative weights
- If you are “content with” non-negative weights, then you can get any summary measure from \(-0.019\) (the smallest \(ATT(g,t)\)) to \(-0.13\) (the largest). This is a wide range of estimates.
Forward engineering, i.e., clearly stating a target aggregate treatment effect parameter and choosing weights that target that parameter is more important than checking for negative weights

[Back]

References

Ashenfelter, Orley, and David Card. 1985. “Using the Longitudinal Structure of Earnings to Estimate the Effect of Training Programs.” The Review of Economics and Statistics 67 (4): 648–60.

Baker, Andrew, Brantly Callaway, Scott Cunningham, Andrew Goodman-Bacon, and Pedro H. C. Sant’Anna. 2025. “Difference-in-Differences Designs: A Practitioner’s Guide.”

Baker, Andrew, David Larcker, and Charles Wang. 2022. “How Much Should We Trust Staggered Difference-in-Differences Estimates?” Journal of Financial Economics 144 (2): 370–95.

Borusyak, Kirill, Xavier Jaravel, and Jann Spiess. 2024. “Revisiting Event-Study Designs: Robust and Efficient Estimation.” Review of Economic Studies, rdae007.

Caetano, Carolina, and Brantly Callaway. 2024. “Difference-in-Differences When Parallel Trends Holds Conditional on Covariates.”

Callaway, Brantly. 2023. “Difference-in-Differences for Policy Evaluation.” In Handbook of Labor, Human Resources and Population Economics, edited by Klaus F. Zimmermann, 1–61. Springer International Publishing.

Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

Card, David. 2022. “Design-Based Research in Empirical Microeconomics.” American Economic Review 112 (6): 1773–81.

Card, David, and Alan Krueger. 1994. “Minimum Wages and Employment: A Case Study of the Fast-Food Industry in New Jersey and Pennsylvania.” American Economic Review 84 (4): 772.

Cengiz, Doruk, Arindrajit Dube, Attila Lindner, and Ben Zipperer. 2019. “The Effect of Minimum Wages on Low-Wage Jobs.” The Quarterly Journal of Economics 134 (3): 1405–54.

de Chaisemartin, Clement, and Xavier D’Haultfoeuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Dube, Arindrajit, Daniele Girardi, Òscar Jordà, and Alan M Taylor. 2023. “A Local Projections Approach to Difference-in-Differences Event Studies.”

Gardner, John, Neil Thakral, Linh T Tô, and Luther Yap. 2023. “Two-Stage Differences in Differences.”

Ghanem, Dalia, Pedro H. C. Sant’Anna, and Kaspar Wüthrich. 2024. “Selection and Parallel Trends.”

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Harmon, Nikolaj A. 2023. “Difference-in-Differences and Efficient Estimation of Treatment Effects.”

Lee, Soo Jeong, and Jeffrey M Wooldridge. 2023. “A Simple Transformation Approach to Difference-in-Differences Estimation for Panel Data.”

Marcus, Michelle, and Pedro HC Sant’Anna. 2021. “The Role of Parallel Trends in Event Study Settings: An Application to Environmental Economics.” Journal of the Association of Environmental and Resource Economists 8 (2): 235–75.

Roth, Jonathan, and Pedro HC Sant’Anna. 2023. “When Is Parallel Trends Sensitive to Functional Form?” Econometrica 91 (2): 737–47.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.

Wooldridge, Jeff. 2021. “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators.”

Difference-in-Differences

Plan for the Talk

Additional Resources

Additional Resources

Running Example

Part 1 Panel Data Causal Inference: Challenges and Opportunities

Research Design

Identification Strategy

Comparison to IV and RD

Panel Data Natural Experiments?

Why Are Panel Data Approaches Popular?

Why Are Panel Data Approaches Popular?

Why Are Panel Data Approaches Popular?

Treatment Effect Heterogeneity

Forward-Engineering

Notation for Setting with Two Periods

Target Parameter

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

How to Use Panel Data to Learn about \(\mathbf{\ATT}\)

Part 2 DiD with Two Periods

DiD with Two Periods

DiD with Two Periods

Estimation

Estimation

TWFE Regression

Part 3 Staggered Treatment Adoption

Setup with Staggered Treatment Adoption

Setup with Staggered Treatment Adoption

Setup with Staggered Treatment Adoption

Target Parameters

Target Parameters

Target Parameters

Pros & Cons of Aggregating Causal Effect Parameters

Group-Time \(\mathbf{\ATT(g,t)}\)

Event Study \(\mathbf{\ATT^{es}(e)}\)

Overall \(\mathbf{\ATT^o}\)

\(\mathbf{\ATT(g,t)}\) as a Building Block

DiD Identification of \(\ATT(g,t)\)

Summary

What Can Go Wrong with TWFE Regression?

What Can Go Wrong with TWFE Regression?

What Can Go Wrong with TWFE Regression?

What Can Go Wrong with TWFE Regression?

Callaway and Sant’Anna (2021)

Other New Approaches

Other New Approaches

Other New Approaches

Part 4 Empirical Example

Empirical Example: Minimum Wages and Employment

Empirical Example: Minimum Wages and Employment

Code

Setup Data

TWFE Regression

\(\mathbf{\ATT(g,t)}\) (Callaway and Sant’Anna)

Plot \(\mathbf{\ATT(g,t)}\)’s

Event Study

Compute \(\mathbf{\ATT^o}\)

Comments

de Chaisemartin and d’Haultfoeuille weights

\(\mathbf{\ATT^o}\) weights

Weight Comparison

Discussion

Discussion

Additional Comments on Weights

Conclusion

Appendix

How do we choose among identifying assumptions?

How do we choose among identifying asssumptions?

How do we choose among identifying asssumptions?

How do these results work?

TWFE Explanation (cont’d)

TWFE Explanation (cont’d)

TWFE Explanation (cont’d)

TWFE Explanation (cont’d)

TWFE Explanation (cont’d)

TWFE Explanation (cont’d)

TWFE Explanation (cont’d)

Part 1

Panel Data Causal Inference: Challenges and Opportunities

Part 2

DiD with Two Periods

Part 3

Staggered Treatment Adoption

Part 4

Empirical Example