Modern Approaches to Difference-in-Differences

.title[
# Modern Approaches to Difference-in-Differences
]
.author[
### Brantly Callaway, University of Georgia
]
.date[
### June 1, 2023 NEXT-D Workshop at Tulane University
]

---

# Introduction

`$$\newcommand{\E}{\mathbb{E}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\var}{\mathrm{var}}
\newcommand{\cov}{\mathrm{cov}}
\newcommand{\Var}{\mathrm{var}}
\newcommand{\Cov}{\mathrm{cov}}
\newcommand{\Corr}{\mathrm{corr}}
\newcommand{\corr}{\mathrm{corr}}
\newcommand{\L}{\mathrm{L}}
\renewcommand{\P}{\mathrm{P}}
\newcommand{\independent}{{\perp\!\!\!\perp}}
\newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }$$`

border-top: 80px solid #BA0C2F;

.inverse {
  background-color: #BA0C2F;
}

.alert {
    font-weight:bold; 
    color: #BA0C2F;
}

.alert-blue {
    font-weight: bold;
    color: #004E60;
}

.remark-slide-content {
    font-size: 23px;
    padding: 1em 4em 1em 4em;
}

.highlight-red {
 background-color:red;
 padding:0.1em 0.2em;
}

.assumption-box {
    background-color: rgba(222,222,222,.5);
    font-size: x-large;
    padding: 10px; 
    border: 10px solid lightgray; 
    margin: 10px;
}

.assumption-title {
 font-size: x-large;
 font-weight: bold;
 display: block;
 margin: 10px;
 text-decoration: underline;
 color: #BA0C2F;
}
</style>

Difference-in-differences (DID) is an extremely popular identification strategy for trying to recover the causal effect of some treatment on some outcome of interest

There have been a number of important advances in our understanding of DID over the past few years:

* Limitations of two-way fixed effects (TWFE) regressions as a way to implement a DID identification strategy

* Alternative estimation strategies that are robust to treatment effect heterogeneity

* Extensions of these alternative approaches along a number of empirically relevant dimensions

Today: Overview of recent work `$+$` a (fairly) detailed empirical application with code

Reference: Callaway (2023, *Handbook of Labor, Human Resources and Population Economics*)

---

# Outline

1. Introduction to Difference-in-Differences

2. Overview of Issues with TWFE Regressions

3. Alternative Estimation Strategies

4. Empirical Example: Minimum Wages and Employment

---

# Introduction to Difference-in-Differences

---

# The Logic of DID

Exploit a data structure where the researcher observes:

1. Multiple periods of data

2. Some pre-treatment data for all units

3. Some units become treated while other units remain untreated

Running Example The effect of a state-level minimum wage increase on employment

---

# The Logic of DID

Intuition for DID identification strategy is to compare:

- The change in outcomes over time for units that participate in the treatment to
    
- The change in outcomes over time for units that didn't participate in the treatment

Rough intuition: Compares a treated unit's outcomes to its past outcomes while making adjustment for "common shocks" using the comparison group.  [See: Heckman, Ichimura, and Todd (1997), Blundell and Costa Dias (2009), Gardner (2021), Ghanem, Sant'Anna, and Wuthrich (2022) for more details about when/why this procedure makes sense.]

DID identification strategies allow for treatment effect heterogeneity

* This is going to be a major issue in the discussion below

---

# Textbook Version of DID

Data:

* 2 periods: `$t^*$`, `$t^*-1$`

* No one treated until period `$t^*$`
    
    * Some units remain untreated in period `$t^*$`

* 2 groups: `$D=1$` or `$D=0$` (treated and untreated)

Potential Outcomes: `$Y_{it}(1)$` and `$Y_{it}(0)$`

Observed Outcomes: `$Y_{it^*}$` and `$Y_{it^*-1}$`

`\begin{align*}
  Y_{it^*} = D_i Y_{it^*}(1) +(1-D_i)Y_{it^*}(0) \quad \textrm{and} \quad Y_{it^*-1} = Y_{it^*-1}(0)
\end{align*}`

---

# Textbook Version of DID (cont'd)
Target Parameter: 
`$$ATT = \E[Y_{t^*}(1) - Y_{t^*}(0) | D=1]$$`

Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group

Parallel Trends Assumption: 
`$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$`
Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group

Identification: Under PTA, we can identify `$ATT$`:
$$
`\begin{aligned}
ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1]
\end{aligned}`
$$

---

count:false
# Textbook Version of DID (cont'd)
Target Parameter: 
`$$ATT = \E[Y_{t^*}(1) - Y_{t^*}(0) | D=1]$$`

Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group

Identification: Under PTA, we can identify `$ATT$`:
$$
`\begin{aligned}
ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1]\\
&= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*} | D=0]
\end{aligned}`
$$

---

# Setup w/ Staggered Treatment Adoption

- `$\mathcal{T}$` time periods

- Units can become treated at different points in time

- For simplicity, we'll adapt the staggered treatment framework. That is, once a unit becomes treated they remain treated.

- `$G_i$` - a unit's group - the time period that unit becomes treated. Also, define `$U_i=1$` for never-treated units and `$U_i=0$` otherwise.

- Potential outcomes: `$Y_{it}(g)$` - the outcome that unit `$i$` would experience in time period `$t$` if they became treated in period `$g$`.

- Untreated potential outcome: `$Y_{it}(0)$` - the outcome unit `$i$` would experience in time period `$t$` if they did not participate in the treatment in any period.

---

# Setup (cont'd)

- Observed outcome: `$Y_{it}=Y_{it}(G_i)$`

- No anticipation condition: `$Y_{it} = Y_{it}(0)$` for all `$t < G_i$` (pre-treatment periods for unit `$i$`)

Unit-level treatment effect
`$$\tau_{it}(g) = Y_{it}(g) - Y_{it}(0)$$`

Average treatment effect for unit `$i$` (across time periods):
`$$\bar{\tau}_i(g) = \frac{1}{\mathcal{T} - g + 1} \sum_{t=g}^{\mathcal{T}} \tau_{it}(g)$$`

---

# Target Parameters

* Group-time average treatment effects 
`\begin{align*}
 ATT(g,t) = \E[ \tau_t(G) | G=g]
\end{align*}`
Explanation: `$ATT$` for group `$g$` in timer period `$t$`

* Event Study 
`\begin{align*}
 ATT^{ES}(e) = \E[\tau_{t+e}(G) | G \in \mathcal{G}_e]
\end{align*}`
where `$\mathcal{G}_e$` is the set of groups observed to have experienced the treatment for `$e$` periods at some point.

Explanation: `$ATT$` when units have been treated for `$e$` periods

* Overall ATT 
`\begin{align*}
 ATT^O = \E[\bar{\tau}(G) | U=0]
\end{align*}`
Explanation: `$ATT$` across all units that every participate in the treatment

---

# Target Parameters

To understand the discussion later, it is also helpful to think of `$ATT(g,t)$` as a building block for the other parameters discussed above.

Notice that:

`\begin{align*}
  ATT^{ES}(e) = \sum_{g \in \bar{\mathcal{G}}} w^{ES}(g,e) ATT(g,g+e) \qquad \textrm{ and } \qquad ATT^O = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} w^O(g,t) ATT(g,t)
\end{align*}`
where
`\begin{align*}
  w^{ES}(g,e) = \indicator{g \in \mathcal{G}_e} \P(G=g|G\in \mathcal{G}_e) \qquad \textrm{and} \qquad
w^O(g,t) = \frac{\P(G=g|U=0)}{\mathcal{T}-g+1}
\end{align*}`

In other words, if we can identify/recover `$ATT(g,t)$`, then we can proceed to recover `$ATT^{ES}(e)$` and `$ATT^O$`.

---

# Identification of `$ATT(g,t)$`

## Multiple Period Version of Parallel Trends Assumption

For all groups `$g \in \bar{\mathcal{G}}$` (all groups except the never-treated group) and for all time periods `$t=2,\ldots,\mathcal{T}$`,
`\begin{align*}
  \E[\Delta Y_{t}(0) | G=g] = \E[\Delta Y_{t}(0) | U=1]
\end{align*}`

Using very similar arguments as before, can show that 
`\begin{align*}
  ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \E[Y_t - Y_{g-1} | U=1]
\end{align*}`

where the main difference is that we use `$(g-1)$` as the "base period" (this is the period right before group `$g$` becomes treated).

---

# Overview of Issues with TWFE Regressions

---

# What does TWFE estimate in this setup?

For roughly 30 years, the dominant approach to implementing a DID identification strategy has been to run a two-way fixed effects regression:

`$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$$`
--

In the "textbook" case above, you can show that `$\alpha = ATT$` `$\implies$` TWFE regression is robust to treatment effect heterogeneity

It's also super-convenient!

However, this robustness to treatment effect heterogeneity does not extend to more complicated settings:

* Staggered treatment adoption (this is the case I'll emphasize)

* More complicated treatments (e.g., continuous treatment) / moving into and out of the treatment

* Including covariates in the parallel trends assumption

---

# Goodman-Bacon (2021)

Goodman-Bacon (2021) intuition: `$\alpha$` "comes from" comparisons between the path of outcomes for units whose treatment status changes relative to the path of outcomes for units whose treatment status stays the same over time.

* Some comparisons are for groups that become treated to not-yet-treated groups (these are very much in the spirit of DID)

* Other comparisons are for groups that become treated relative to already-treated groups (these comparisons are not rationalized by parallel trends assumptions)

This can be especially problematic when there are treatment effect dynamics.  Dynamics imply different trends from what would have happened absent the treatment.

---

# de Chaisemartin and d'Haultfoeuille (2020)

de Chaisemartin and d'Haultfoeuille (2020) intuition: You can write `$\alpha$` as a weighted average of `$ATT(g,t)$`

First, a decomposition:
`\begin{align*}
\alpha &= \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}}  w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big) \\
  & + \sum_{g \in \bar{\mathcal{G}}} \sum_{t=1}^{g-1} w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big)
\end{align*}`

--
  
Second, under parallel trends:  
`\begin{align*}
\alpha = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} w^{TWFE}(g,t) ATT(g,t)
\end{align*}`

* But the weights are (non-transparently) driven by the estimation method

* These weights have some good / bad / strange properties such as possibly being negative

---

# Event Study Regressions

Event study regressions are popular in empirical work.  
`\begin{align*}
  Y_{it} = \theta_t + \eta_i + \sum_{e=-(\mathcal{T}-1)}^{-2} \beta_e D_{it}^e + \sum_{e=0}^{\mathcal{T}} \beta_e D_{it}^e + v_{it}
\end{align*}`
where `$D_{it}^e = \indicator{G_i + e = t}$` is a binary indicator of having been treated for exactly `$e$` periods in period `$t$`

Typically, researchers interpret:

* Post-treatment event study coefficients as dynamic effects

* Use pre-treatment coefficients as a pre-test

Sun and Abraham (2021) show similar issues to the TWFE regression with the event study regression here

* `$\beta_e$` can include effects from incorrect lengths of exposure

* weights on `$ATT(g,t)$` are non-transparent and driven by the estimation method and can be negative

---

# Alternative Estimation Strategies

---

# Alternative Approaches

We'll discuss:

1. Callaway and Sant'Anna (2021), R: `did`, Stata: `csdid`

2. Sun and Abraham (2021), R: `fixest`, Stata: `eventstudyinteract`

3. Wooldridge (2021), R: `etwfe`, Stata: `JWDID`

4. Gardner (2021) / Borusyak, Jaravel, Spiess (2022), R: `did2s`, Stata: `did2s` and `did_imputation`

Not including:

1. "Clean controls" (Cengiz, Dube, Lindner, and Zipperer (2019) and Dube, Girardi, Jorda, and Taylor (2023)), Stata: `stackedev`

2. de Chaisemartin and d'Haultfoeuille (2020), R: `DIDmultiplegt`, Stata: `did_multiplegt`

---

# Callaway and Sant'Anna (2021)

Key idea: Separate identification and estimation:

* Under parallel trends, recall that
`$$ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \E[Y_t - Y_{g-1} | U=1]$$`
    
--

Estimation:
`$$\widehat{ATT}^{CS}(g,t) = \frac{1}{n}\sum_{i=1}^n \frac{\indicator{G_i = g}}{\hat{p}_g} (Y_{it} - Y_{ig-1}) - \frac{1}{n}\sum_{i=1}^n \frac{\indicator{U_i = 1}}{\hat{p}_U} (Y_{it} - Y_{ig-1})$$`

2nd step: Recall: group-time average treatment effects are building blocks for more aggregated parameters such as `$ATT^{ES}(e)$` and `$ATT^O$` `$\implies$` just plug in

* `$\implies$` two-step estimation procedure: target local/disaggregated `$ATT(g,t)$` in first step, then (if desired) aggregate them into lower dimensional parameters

---

# Sun and Abraham (2021)

Intuition: The event study regression is "underspecified" `$\implies$` heterogeneous effects can "confound" the treatment effect estimates

Solution: Run fully interacted regression:
`\begin{align*}
 Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{e \neq -1} \delta^{SA}_{ge} \indicator{G_i=g} \indicator{g+e=t} + v_{it}
\end{align*}`

2nd step: Aggregate `$\delta^{SA}_{ge}$`'s across groups (usually into an event study).

* This sidesteps issues with the event study regression coming from treatment effect heterogeneity

* For inference, need to account for two-step estimation procedure

---

# Wooldridge (2021)

Main question: Are issues in DID literature due to limitations of TWFE regressions themselves or something else?

Proposes running "more interacted" TWFE regression:

`\begin{align*}
  Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{s=g}^{\mathcal{T}} \alpha_{gt}^W \indicator{G_i=g, t=s} + v_{it}
\end{align*}`

This is quite similar to Sun and Abraham (2021) except for that it doesn't include interactions in pre-treatment periods. [The differences about `$(g,t)$` relative to `$(g,e)$` are trivial.]

* Like SA, this provides robustness to treatment effect heterogeneity by including more interactions

* However, unless mainly interested in `$ATT(g,t)$`, have to do second step aggregation that (arguably) ends the "killer feature" of the TWFE regression to begin with

---

# Gardner (2021) / BJS (2022)

Intuition: Parallel trends is closely connected to a TWFE model *for untreated potential outcomes*
`$$Y_{it}(0) = \theta_t + \eta_i + e_{it}$$`
--
Estimation:

* Step 1: Split data into treated and untreated observations

* Step 2: Estimate above model for the set of untreated observations

* Step 3: "Impute" `$\hat{Y}_{it}(0) = \hat{\theta}_t + \hat{\eta}_i$` for the treated observations

* `$\displaystyle \widehat{ATT}^{G/BJS}(g,t) = \frac{1}{n} \sum_{i=1}^n \frac{\indicator{G_i=g}}{\hat{p}_g} \Big(Y_{it} - \hat{Y}_{it}(0)\Big) \xrightarrow{p} ATT(g,t)$`

Can compute other treatment effect parameters too.

---

# Similarities and Differences

In my view, all of the approaches discussed above are fundamentally similar to each other.

In practice, it is sometimes possible to get different results though this is often driven by

* Different choices in terms of default implementation details in computer code

* Different estimation strategies trading off efficiency and robustness in different ways

---

# Comparison 1: CS and SA

In post-treatment periods, these give numerically identical results: `$\widehat{ATT}^{CS}(g,t) = \hat{\delta}^{SA}_{t,t-g}$`

* This is because a fully interacted regression (SA) is equivalent to taking differences in averages across groups (CS)

In pre-treatment periods, code will give different pre-treatment estimates, but this is due to different default choices

* In SA, all results are relative to a fixed base period (typically the period right before treatment)

* In CS, by default, in pre-treatment periods, estimates are of placebo policy effects on impact (i.e., the base period is always the most recent pre-treatment period)

Similarly, results will be different if you choose a different comparsion group in CS (e.g., not-yet-treated vs. never-treated).

In both cases, these are just different choices though, and, for example, it is feasible (and easy) to set a fixed base period in CS

---

# Comp 2: SA and Wooldridge

These are clearly closely related, with the difference amounting to whether or not one includes indicators for pre-treatment periods.

It is fair to see this as a way to trade-off robustness and efficiency

* If parallel trends holds across all time periods, then Wooldridge will deliver more efficient estimates (as effectively all pre-treatment periods are used as base periods)

* If parallel trends is violated in some pre-treatment periods but holds post-treatment, Wooldridge estimates will be inconsistent, but SA estimates will be robust to violations of parallel trends in pre-treatment periods.

---

# Comp 3: Wooldridge and Gardner/BJS

Wooldridge and Gardner/BJS give numerically the same estimates: `$\hat{\alpha}^W_{gt} = \widehat{ATT}^{G/BJS}(g,t)$`

Intuition: similar to equivalence between Oaxaca-Blinder decompositions and regression adjustment (i.e., including interactions is equivalent to estimating separate models by group).

---

# Comments

The above discussion emphasizes the similarities between different proposed alternatives to TWFE regressions in the literature.

The differences also seem to be mainly driven by different implementation choices.  Examples:

* It's possible to come up with an imputation estimator that uses the base period right before treatment only `$\implies$` `$\uparrow$` robustness, `$\downarrow$` efficiency

* It's also possible to do a version of CS with more base periods `$\implies$` `$\uparrow$` efficiency `$\downarrow$` robustness

* Build-the-trend (i.e., path relative to average pre-treatment outcome) and GMM, Callaway (2023) and Marcus and Sant'Anna (2021).

---

# Empirical Example: Minimum Wages and Employment

---

# Example: Minimum Wage

- Use county-level data from 2003-2007 during a period where the federal minimum wage was flat

- Exploit minimum wage changes across states

- Any state that increases their minimum wage above the federal minimum wage will be considered as treated
  
--

- Interested in the effect of the minimum wage on teen employment

- We'll also make a number of simplifications:

* not worry much about issues like clustered standard errors
    
    * not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states

Goal: How much do the issues that we have been talking about matter in practice?

---

# Code

Full code is available on my website: [https://bcallaway11.github.io/files/presentations/NEXT-D](https://bcallaway11.github.io/files/presentations/NEXT-D) or link is on my homepage [brantlycallaway.com](https://www.brantlycallaway.com)

R packages used in empirical example

```r
library(did)
library(BMisc)
library(twfeweights)
library(fixest)
library(modelsummary)
library(ggplot2)
load(url("https://github.com/bcallaway11/did_chapter/raw/master/mw_data_ch2.RData"))
```

---

# Setup Data

```r
# drops NE region and a couple of small groups
mw_data_ch2 <- subset(mw_data_ch2, (G %in% c(2004,2006,2007,0)) & (region != "1"))
head(mw_data_ch2[,c("id","year","G","lemp","lpop","region")])
```

```
##       id year    G     lemp     lpop region
## 554 8003 2001 2007 5.556828 9.614137      4
## 555 8003 2002 2007 5.356586 9.623972      4
## 556 8003 2003 2007 5.389072 9.620859      4
## 557 8003 2004 2007 5.356586 9.626548      4
## 558 8003 2005 2007 5.303305 9.637958      4
## 559 8003 2006 2007 5.342334 9.633056      4
```

```r
# drop 2007 as these are right before fed. minimum wage change
data2 <- subset(mw_data_ch2, G!=2007 & year >= 2003)
# keep 2007 => larger sample size
data3 <- subset(mw_data_ch2, year >= 2003)
```

---

# TWFE Regression

```r
twfe_res2 <- fixest::feols(lemp ~ post | id + year,
 data=data2,
 cluster="id")

modelsummary(list(twfe_res2), gof_omit=".*")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:left;"> </th>
 <th style="text-align:center;"> Model 1 </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> post </td>
 <td style="text-align:center;"> −0.038 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> </td>
 <td style="text-align:center;"> (0.008) </td>
 </tr>
</tbody>
</table>

---

# `$ATT(g,t)$` (Callaway and Sant'Anna)

```r
attgt <- did::att_gt(yname="lemp",
 idname="id",
 gname="G",
 tname="year",
 data=data2,
 control_group="nevertreated",
 base_period="universal")
tidy(attgt)[,1:5] # print results, drop some extra columns
```

```
##              term group time    estimate   std.error
## 1  ATT(2004,2003)  2004 2003  0.00000000          NA
## 2  ATT(2004,2004)  2004 2004 -0.03266653 0.020884500
## 3  ATT(2004,2005)  2004 2005 -0.06827991 0.020712351
## 4  ATT(2004,2006)  2004 2006 -0.12335404 0.020682602
## 5  ATT(2004,2007)  2004 2007 -0.13109136 0.022523279
## 6  ATT(2006,2003)  2006 2003 -0.03408910 0.011617027
## 7  ATT(2006,2004)  2006 2004 -0.01669977 0.007396980
## 8  ATT(2006,2005)  2006 2005  0.00000000          NA
## 9  ATT(2006,2006)  2006 2006 -0.01939335 0.009217105
## 10 ATT(2006,2007)  2006 2007 -0.06607568 0.009311762
```

---

# Plot `$ATT(g,t)$`'s

---

# Compute `$ATT^O$`

```r
attO <- did::aggte(attgt, type="group")
summary(attO)
```

```
## 
## Call:
## did::aggte(MP = attgt, type = "group")
## 
## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 
## 
## 
## Overall summary of ATT's based on group/cohort aggregation: 
## ATT Std. Error [ 95% Conf. Int.] 
## -0.0571 0.008 -0.0728 -0.0414 *
## 
## 
## Group Effects:
## Group Estimate Std. Error [95% Simult. Conf. Band] 
## 2004 -0.0888 0.0203 -0.1316 -0.0461 *
## 2006 -0.0427 0.0076 -0.0587 -0.0268 *
## ---
## Signif. codes: `*' confidence band does not cover 0
## 
## Control Group: Never Treated, Anticipation Periods: 0
## Estimation Method: Doubly Robust
```

---

# Comments

The differences between the CS estimates and the TWFE estimates are fairly large here: the CS estimate is about 50% larger than the TWFE estimate, though results are qualitatively similar.

Let's see if we can figure out what's going on...

---

# de Chaisemartin and d'Haultfoeuille weights

---

# `$ATT^O$` weights

---

# Weight Comparison

---

# Discussion

To summarize: `$ATT^O = -0.057$` while `$\alpha^{TWFE} = -0.038$`.  This difference can be fully accounted for

* Pre-treatment differences in paths of outcomes across groups: explains about 64% of the difference

* Differences in weights applied to the same post-treatment `$ATT(g,t)$`: explains about 36% of the difference. [If you apply the post-treatment weights and "zero out" pre-treatment differences, the estimate would be `$-0.050$`.]

In my experience: this is fairly representative of how much new DID approaches matter relative to TWFE regressions.  It does not seem like "catastrophic failure" of TWFE, but (in my view) these are meaningful differences (and, e.g., given slightly different `$ATT(g,t)$`'s, the difference in the weighting schemes could change the qualitative results).

* Of course, this whole discussion hinges crucially on how much treatment effect heterogeneity there is.  More TE Het `$\implies$` more sensitivity to weighting schemes [just looking at TWFE regression does not give insight into how much TE Het there is.]

---

# Additional Comments

One more comment: there is a lot concern about negative weights (both in econometrics and empirical work).

* There were no negative weights in the example above, but the weights still weren't great.

* No negative weights does rule out "sign reversal"

* But, in my view, the more important issue is the non-transparent weighting scheme.

* Ex. If you try using `data3` (the data that includes `$G=2007$`), you will get a negative weight on `$ATT(g=2004,t=2007)$`.  But it turns out not to matter much, and TWFE works better in this case than in the case that I showed you.

---

# Bonus Material

[Bonus Material 1: Including Covariates in the Parallel Trends Assumption](#covs)

[Bonus Material 2: Dealing with Violations of Parallel Trends](#violations)

---

# Conclusion

That's all!  Thank you very much for inviting me.

Email: [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu)

---

# Covariates in the Parallel Trends Assumption

## Conditional Parallel Trends Assumption

For all time periods,

`$$\E[\Delta Y_t(0) | X_t, X_{t-1},Z,D=1] = \E[\Delta Y_t(0) | X_t, X_{t-1},Z,D=0]$$`
--

In words: Parallel trends holds conditional on having the same covariates `$X$`.

Minimum wage example: path of teen employment may depend on a state's population / population growth / region of the country

---

# Limitations of TWFE Regressions

In this setting, it is common to run the following TWFE regression:

`$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + v_{it}$$`

However:

* Issues related to multiple periods and variation in treatment timing still arise

* It's hard to allow for the path of untreated potential outcomes to depend on time-invariant covariates

* Mixes identification and estimation...e.g., with 2 periods
`\begin{align*}
\Delta Y_{it} = \Delta \theta_t + \alpha D_{it} + \Delta X_{it}'\beta + \Delta v_{it}
\end{align*}`
    `$\implies$` differencing out unit fixed effects can have implications about what researcher controls for

* This doesn't matter if model is truly linear

* However, if we think of linear model as an approximation, this may have meaningful implications.

* See Caetano and Callaway (2023) for more details

---

# Identification / Estimation

Can show that (under conditional PTA):

`$$ATT = \E[\Delta Y_t | D=1] - \E\Big[ \E[\Delta Y_t | X, D=0] \Big| D=1\Big]$$`

Intuition: (i) Compare path of outcomes for treated group to (conditional on covariates) path of outcomes for untreated group, (ii) adjust for differences in the distribution of covariates between groups.

This expression suggests a "regression adjustment" estimator.

It is easy to extend these arguments to multiple periods and variation in treatment timing

---

# Doubly Robust

Alternatively, you can show

`$$ATT=\E\left[ \left( \frac{D}{p} - \frac{p(X)(1-D)}{(1-p(X))p} \right)(\Delta Y_t - \E[\Delta Y_t | X, D=0]) \right]$$`
--

This requires estimating both `$p(X)$` and `$\E[\Delta Y_{t^*}|X,D=0]$`.

Big advantage:

- This expression for `$ATT$` is *doubly robust*. This means that, it will deliver consistent estimates of `$ATT$` if either the model for `$p(X)$` or for `$\E[\Delta Y_{t^*}|X,D=0]$`.

- In my experience, doubly robust estimators perform much better than either the regression or propensity score weighting estimators

- This also provides a connection to estimating `$ATT$` under conditional parallel trends using machine learning for `$p(X)$` and `$\E[\Delta Y_{t^*}|X,D=0]$` (see: Chang (2020) and Callaway, Drukker, Liu, and Sant'Anna (2023))

---

# Back to Minimum Wage Example

We'll allow for path of outcomes to depend on region of the country

```r
# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post | id + region^year,
 data=data2)
modelsummary(twfe_x, gof_omit=".*")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
 <tr>
 <th style="text-align:left;"> </th>
 <th style="text-align:center;"> Model 1 </th>
 </tr>
 </thead>
<tbody>
 <tr>
 <td style="text-align:left;"> post </td>
 <td style="text-align:center;"> 0.001 </td>
 </tr>
 <tr>
 <td style="text-align:left;"> </td>
 <td style="text-align:center;"> (0.008) </td>
 </tr>
</tbody>
</table>

Relative to previous results, this is much smaller and statistically insignificant and is similar to the result in Dube et al. (2010).

---

# Use Doubly Robust Approach from CS

```
## 
## Call:
## aggte(MP = cs_x, type = "group")
## 
## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 
## 
## 
## Overall summary of ATT's based on group/cohort aggregation: 
## ATT Std. Error [ 95% Conf. Int.] 
## -0.0273 0.0085 -0.0438 -0.0107 *
## 
## 
## Group Effects:
## Group Estimate Std. Error [95% Simult. Conf. Band] 
## 2004 -0.0436 0.0204 -0.0892 0.0019 
## 2006 -0.0199 0.0079 -0.0376 -0.0022 *
## ---
## Signif. codes: `*' confidence band does not cover 0
## 
## Control Group: Never Treated, Anticipation Periods: 0
## Estimation Method: Doubly Robust
```

---

# Comments

Even more than in the previous case, the results in this case are notably different depending on the estimation strategy.

[back](#bonus)

---

# What about violations of parallel trends?

Parallel trends assumptions don't automatically hold in applications with repeated observations over time.

The most natural way to motivate parallel trends is with a linear model for untreated potential outcomes:
`\begin{align*}
  Y_{it}(0) = \theta_t + \eta_i + v_{it}
\end{align*}`
where the key feature is the additive separability of `$\eta_i$`

But it's not always clear if additive separability (and hence parallel trends) is reasonable

* The most common "response" is pre-testing...checking if parallel trends holds in pre-treatment periods

DID + pre-tests are a very powerful/useful approach to "validating" the parallel trends assumption

---

# What about our case?

---

# Partial Identification / Sensitivity Analysis

References: Manski and Pepper (2018), Rambachan and Roth (2021)

Two versions of sensitivity analysis in RR:

* Violations of parallel trends evolve smoothly

* Violations of parallel trends are "not too different" in post-treatment periods from the violations in pre-treatment periods
  
  - Will show results for this case
  
  - Allow for violations of parallel trends up to `$\bar{M}$` times as large as were observed in any pre-treatment period.

- And we'll vary `$\bar{M}$`.

---

# What about violations of parallel trends?

---

[back](#bonus)