Modern Approaches to Difference in Differences

class: center, middle, inverse, title-slide

# Modern Approaches to Difference in Differences
### Brantly Callaway, University of Georgia
### October 22, 2021 <br><br>Session 2: Two Way Fixed Effects

---

# A More Complicated Setup

`$$\newcommand{\E}{\mathbb{E}}$$`
`$$\newcommand{\P}{\mathrm{P}}$$`

border-top: 80px solid #BA0C2F;

.inverse {
  background-color: #BA0C2F;
}

.alert {
    font-weight:bold; 
    color: red;
}

.alert-blue {
    font-weight: bold;
    color: blue;
}

.remark-slide-content {
    font-size: 23px;
    padding: 1em 4em 1em 4em;
}

.highlight-red {
 background-color:red;
 padding:0.1em 0.2em;
}
</style>

---

# A More Complicated Setup

- `$\mathcal{T}$` time periods

- Units can become treated at different points in time

- For simplicity, we'll adapt the <span class="alert-blue">staggered treatment framework</span>.  That is, once a unit becomes treated they remain treated.

- `$G_i$` - a unit's <span class="alert-blue">group</span> - the time period that unit becomes treated.  Set `$G_i = \mathcal{T}+1$` for units that do not participate in the treatment in any period.

- Potential outcomes: `$Y_{it}(g)$` - the outcome that unit `$i$` would experience in time period `$t$` if they became treated in period `$g$`.

- Untreated potential outcome: `$Y_{it}(0)$` - the outcome unit `$i$` would experience in time period `$t$` if they did not participate in the treatment in any period.

- Observed outcome: `$Y_{it}=Y_{it}(G_i)$`

- No anticipation condition: `$Y_{it}(G_i) = Y_{it}(0)$` for all `$t < G_i$` (pre-treatment periods for unit `$i$`)

---

# A More Complicated Setup

- A number of extensions (more complicated treatment regimes, anticipation effects, conditioning on covariates) are possible

## Multiple period version of parallel trends

For all groups `$g,k$` and all `$t=2,\ldots,\mathcal{T}$`,

`$$\E[\Delta Y_t(0) | G=g] = \E[\Delta Y_t(0) | G=k]$$`

In words: trends in untreated potential outcomes are the same across all groups

---

# What does TWFE estimate in this setup?

`$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$$`
--

<span class="alert-blue">Rough intuition:</span> `$\alpha$` "comes from" comparisons between the path of outcomes for units whose <span class="alert">treatment status changes</span> relative to the path of outcomes for units whose <span class="alert">treatment status stays the same</span> over time.

We'll see that this intuition is pretty much right

But some of these "comparisons" have undesirable properties

---

# Goodman-Bacon (2021)

<span class="alert-blue">Notation:</span>

For two groups `$g$` and `$k$` with `$k > g$` (i.e., group `$k$` treated after group `$g$`), define:

- `$\bar{Y}_i^{PRE(g)}$` - average outcome for individual `$i$` across periods before either group treated

- `$\bar{Y}_i^{MID(g,k)}$` - average outcome for individual `$i$` across periods after group `$g$` becomes treated but before group `$k$` becomes treated

- `$\bar{Y}_i^{POST(k)}$` - average outcome for individual `$i$` across periods after both groups are treated

Further define:

- `$\bar{G}_g = \frac{\mathcal{T}-(g-1)}{\mathcal{T}}$` - the fraction of periods that units in group `$g$` are treated (this is bigger for earlier treated groups)

---

# Goodman-Bacon (2021)

<span class="alert-blue">Bacon Decomposition: </span> `$\alpha$` from the TWFE regression can be written as

`$$\sum_{g \in \mathcal{G}} \sum_{k \in \mathcal{G}\\k>g} w_1(g,k) \delta^{MID,PRE}(g,k) + w_2(g,k) \delta^{POST,MID}(g,k)$$`
where `$w_1(g,k)$` and `$w_2(g,k)$` are positive weights satisfying

`$$\sum_{g \in \mathcal{G}} \sum_{k \in \mathcal{G}\\k>g} w_1(g,k) + w_2(g,k) = 1$$`

[we'll come back to these momentarily]

---

# Goodman-Bacon (2021)

First main term in Bacon decomposition:

`$$\delta^{MID,PRE}(g,k) = \E\left[ \bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)} | G=g \right] - \E\left[ \bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)} | G=k \right]$$`

- The first term is the "path" of outcomes experienced by group `$g$` (pre-treatment relative to post-treatment)

- The second term, under the multiple period parallel trends assumption, is the path of outcomes that group `$g$` *would have experienced* if they had not become treated.

<span class="alert">Under parallel trends, these are exactly the sort of comparisons that we would like to show up in `$\alpha$`.</span>

---

# Goodman-Bacon (2021)

Second main component of Bacon decomposition:

`$$\delta^{POST,MID}(g,k) = \E\left[ \bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)} | G=k\right] - \E\left[ \bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)} | G=g \right]$$`

- The first term is the path of outcomes experienced by group `$k$` (pre-treatment relative to post-treatment)

- The second term is the path of outcomes experienced by group `$g$`.

- These are periods where group `$g$`'s treatment status does not change
    
    - But these are post-treatment time periods for group `$g$`
    
  
--

<span class="alert">However, parallel trends is not about paths of post-treatment outcomes...</span>
    
---

# Goodman-Bacon (2021)

By adding and subtracting terms to `$\delta^{MID,POST}(g,k)$`:

$$
`\begin{aligned}
\delta^{POST,MID}(g,k) &= \E\left[ \bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)} | G=k\right] - \E\left[ \bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)} | G=\mathcal{T}+1 \right] \\
& - \left\{\left(\E\left[ \bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)} | G=g\right] - \E\left[ \bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)} | G=\mathcal{T}+1 \right]\right) \right.\\
& \hspace{10pt} - \left.\left(\E\left[ \bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)} | G=g\right] - \E\left[ \bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)} | G=\mathcal{T}+1 \right]\right)\right\}
\end{aligned}`
$$

- The first term is "good"; under parallel trends, it is related to the effect of participating in the treatment for group `$k$`

- The second term (everything inside `$\{ \circ \}$`), under parallel trends, is about <span class="alert-blue">treatment effect dynamics</span>

It is undesirable that treatment effect dynamics show up in `$\alpha$`

---

# Goodman-Bacon (2021)

All this suggests the following about `$\alpha$` from the TWFE regression under parallel trends assumptions:

- It is equal to a weighted average of (i) reasonable underlying treatment effect parameters, and (ii) treatment effect dynamics

- de Chaisemartin and D'Haultfoeuille (2020) "negative weights" result is due to the treatment effect dynamics term discussed here

- This opens up the possibility of really bad causal effect estimates due to TWFE.  An extreme case would be that the effect of participating in the treatment is positive for all groups and time periods, but that negative weights (treatment effect dynamics) lead to a negative TWFE estimate of the effect of the treatment

--
    
- You can introduce an extra (and testable) assumption ruling out treatment effect dynamics, but it seems more straightforward to just use a different estimation strategy

---

# Goodman-Bacon (2021)

Back to the weights:

`$$w_1(g,k) = \frac{(1-\bar{G}_g)(\bar{G}_g - \bar{G}_k)(p_g + p_k)^2 p_{g|\{g,k\}}(1-p_{g|\{g,k\}})}{\textrm{normalizing constant}}$$`

`$$w_2(g,k) = \frac{\bar{G}_k(\bar{G}_g - \bar{G}_k)(p_g + p_k)^2 p_{g|\{g,k\}}(1-p_{g|\{g,k\}})}{\textrm{normalizing constant}}$$`
--

Both of these put more weight on:

1. larger groups, when `$p_g$` and/or `$p_k$` are large

2. similarly sized groups, `$p_{g|\{g,k\}}(1-p_{g|\{g,k\}})$` largest when `$p_{g|\{g,k\}}=0.5$`.

3. "middle" groups (middle between comparison group and beginning (for `$w_1$`) or end (for `$w_2$` time periods))

---

# Callaway and Sant'Anna (2021)
Can we get around these issues with TWFE?

<span class="alert-blue">Group-Time Average Treatment Effects</span>

`$$ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g]$$`

This is analogous to the `$ATT$` in the baseline case with two periods and two groups

<span class="alert">Identification:</span>

$$
`\begin{aligned}
ATT(g,t) &= \E[Y_t(g) | G=g] - \E[Y_t(0) | G=g] \hspace{150pt}
\end{aligned}`
$$

---

count:false
# Callaway and Sant'Anna (2021)
Can we get around these issues with TWFE?

<span class="alert-blue">Group-Time Average Treatment Effects</span>

`$$ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g]$$`

This is analogous to the `$ATT$` in the baseline case with two periods and two groups

<span class="alert">Identification:</span>

$$
`\begin{aligned}
ATT(g,t) &= \E[Y_t(g) | G=g] - \E[Y_t(0) | G=g] \hspace{150pt}\\
&= \E[Y_t(g) - Y_{g-1}(0) | G=g] - \E[Y_t(0) - Y_{g-1}(0) | G=g]
\end{aligned}`
$$

---

count:false
# Callaway and Sant'Anna (2021)
Can we get around these issues with TWFE?

<span class="alert-blue">Group-Time Average Treatment Effects</span>

`$$ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g]$$`

This is analogous to the `$ATT$` in the baseline case with two periods and two groups

<span class="alert">Identification:</span>

$$
`\begin{aligned}
ATT(g,t) &= \E[Y_t(g) | G=g] - \E[Y_t(0) | G=g] \hspace{150pt}\\
&= \E[Y_t(g) - Y_{g-1}(0) | G=g] - \E[Y_t(0) - Y_{g-1}(0) | G=g]\\
&= \E[Y_t(g) - Y_{g-1}(0) | G=g] - \E[Y_t(0) - Y_{g-1}(0) | D_t=0]
\end{aligned}`
$$

---

count:false
# Callaway and Sant'Anna (2021)
Can we get around these issues with TWFE?

<span class="alert-blue">Group-Time Average Treatment Effects</span>

`$$ATT(g,t) = \E[Y_t(g) - Y_t(0) | G=g]$$`

This is analogous to the `$ATT$` in the baseline case with two periods and two groups

<span class="alert">Identification:</span>

---

# Callaway and Sant'Anna (2021)
<span class="alert-blue">Estimation</span>

$$
`\begin{aligned}
\widehat{ATT}(g,t) &= \hat{\E}[Y_t - Y_{g-1} | G=g] - \hat{\E}[Y_t - Y_{g-1} | D_t=0] \hspace{150pt}
\end{aligned}`
$$

---

count:false
# Callaway and Sant'Anna (2021)
<span class="alert-blue">Estimation</span>

$$
`\begin{aligned}
\widehat{ATT}(g,t) &= \hat{\E}[Y_t - Y_{g-1} | G=g] - \hat{\E}[Y_t - Y_{g-1} | D_t=0] \hspace{150pt}\\
&= \frac{1}{n} \sum_{i=1}^n \frac{\mathbf{1}\{G_i = g\}}{\hat{\P}(G=g)}(Y_{it} - Y_{ig-1}) - \frac{1}{n} \sum_{i=1}^n \frac{\mathbf{1}\{G_i > t\}}{\hat{\P}(G > t)}(Y_{it} - Y_{ig-1})
\end{aligned}`
$$

<span class="alert-blue">This is easy</span> and avoids making any of the "bad comparisons" that were causing problems for TWFE
---

# Callaway and Sant'Anna (2021)

One thing that is still different between TWFE and `$ATT(g,t)$` is that there are potentially "lots" of `$ATT(g,t)$`.  <span class="alert-blue">Can we recover and "overall" ATT from these?</span>

As a step in this direction, define:

`$$ATT^G(g) := \frac{1}{\mathcal{T}-g+1} \sum_{t=g}^{\mathcal{T}} ATT(g,t)$$`

This is the ATT (across all post-treatment time periods) for units in group `$g$`.

Next, define:

`$$ATT^O := \sum_{g \in \mathcal{G}} ATT^G(g) \P(G=g|G \in \mathcal{G})$$`
--

where `$\mathcal{G}$` is the set of all groups that ever participate in the treatment.

`$ATT^O$` is the average effect of participating in the treatment across all units that are treated in any time period `$\implies$` it's a natural overall treatment effect parameter.

---

# Coding Examples

Two examples:

- Minimum Wage Policy

- This is from Callaway and Sant'Anna (2021)
    
    - Some places "modern" DID will make a big difference, but others not much (think this is pretty typical)

- Simulated Data

- In cases where differences are small, we'll make sure to see that things can potentially go quite poorly

For today, I'll just show code/results, but you should be able to download from my website and run locally if you would like to

---

# Example: Minimum Wage

- Use period in the U.S. from 2002-2007 where federal minimum wage was flat

- Exploit minimum wage changes across states

- Any state that increases their minimum wage above the federal minimum wage will be considered as treated
  
--

- Interested in the effect of the minimum wage on teen employment

---

# Example: Minimum Wage

```r
library(did)
# for bacondecomp, dev version is much faster
# devtools::install_github("evanjflack/bacondecomp")
library(bacondecomp)
library(fixest)
library(modelsummary)
library(ggplot2)
load("mw_data2.RData")
```

---

# Example: Minimum Wage

```r
head(mw_data2)
```

```
##     year countyreal     lpop     lemp first.treat treat region post
## 829 2001       8001 5.896761 8.730690        2006     1      4    0
## 820 2002       8001 5.896761 8.541300        2006     1      4    0
## 844 2003       8001 5.896761 8.461469        2006     1      4    0
## 858 2004       8001 5.896761 8.336870        2006     1      4    0
## 833 2005       8001 5.896761 8.340217        2006     1      4    0
## 823 2006       8001 5.896761 8.378161        2006     1      4    1
```

---

# Example: Minimum Wage

```r
# add post-treatment dummy variable
mw_data2$post <- 1*((mw_data2$year >= mw_data2$first.treat) & mw_data2$treat != 0)

twfe_res <- feols(lemp ~ post | countyreal + year,
                  data=mw_data2,
                  cluster="countyreal")
```

---

# Example: Minimum Wage

```r
modelsummary(twfe_res, gof_omit=".*")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Model 1 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> post </td>
   <td style="text-align:center;"> −0.021 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.006) </td>
  </tr>
</tbody>
</table>

---

# Example: Minimum Wage

```r
# run bacon decomposition
bacon_res <- bacon(lemp ~ post, 
                   data=mw_data2,
                   id_var="countyreal",
                   time_var="year")

# confirm same estimate
sum(bacon_res$estimate * bacon_res$weight)
```

```
## [1] -0.02129215
```

```r
# bacon decomp results
head(bacon_res)
```

```
##    treated untreated     estimate      weight                     type
## 2     2005      2006  0.031370307 0.035177102 Earlier vs Later Treated
## 4     2003      2006 -0.025199486 0.023661728 Earlier vs Later Treated
## 5     2006      2005 -0.005552584 0.017588551 Later vs Earlier Treated
## 8     2003      2005 -0.030110977 0.006023476 Earlier vs Later Treated
## 9     2006       Inf -0.041531139 0.543036659     Treated vs Untreated
## 10    2005       Inf  0.014012249 0.248829811     Treated vs Untreated
```

---

# Example: Minimum Wage

```r
# plot bacon decomposition
ggplot(data=bacon_res, 
       mapping=aes(x=weight,
                   y=estimate,
                   color=as.factor(type))) + 
  geom_point(size=5) + 
  scale_color_discrete(name="") + 
  theme_bw() + 
  theme(legend.position="bottom")
```

---

# Example: Minimum Wage

---

# Example: Minimum Wage

```r
# callaway and sant'anna
cs_res <- att_gt(yname="lemp",
                 tname="year",
                 idname="countyreal",
                 gname="first.treat",
                 data=mw_data2)
```

---

# Example: Minimum Wage

```r
ggdid(cs_res)
```

---

# Example: Minimum Wage

```r
aggte(cs_res, type="group")
```

```
## 
## Call:
## aggte(MP = cs_res, type = "group")
## 
## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Forthcoming at the Journal of Econometrics <https://arxiv.org/abs/1803.09015>, 2020. 
## 
## 
## Overall summary of ATT’s based on group/cohort aggregation:  
##      ATT    Std. Error     [ 95%  Conf. Int.]  
##  -0.0434        0.0059     -0.055     -0.0319 *
## 
## 
## Group Effects:
##  Group Estimate Std. Error [95% Simult.  Conf. Band]  
##   2003  -0.0542     0.0131       -0.0841     -0.0243 *
##   2005  -0.0138     0.0082       -0.0325      0.0048  
##   2006  -0.0529     0.0076       -0.0703     -0.0355 *
## ---
## Signif. codes: `*' confidence band does not cover 0
## 
## Control Group:  Never Treated,  Anticipation Periods:  0
## Estimation Method:  Doubly Robust
```

---

# Example: Minimum Wage

CS estimates roughly twice as large in magnitude as TWFE estimates, but qualitative results are similar (negative effects of minimum wage on teen employment).

---

# Example: Simulated Data

```r
library(tidyr)
library(dplyr)
# simulation parameters
time.periods <- 20
groups <- c(5,15,time.periods+1)
pg <- c(0.5,0.5,0)
n <- 1000

# generate data (code omitted...)
# load file: sim_data.RDS
```

---

# Example: Simulated Data

```r
# plot data
plotdf <- data %>%
  group_by(G, time.period) %>%
  summarise(Yobs=mean(Y),
            Y0=mean(Y0))
plotdf_obs <- plotdf %>% select(-Y0)
plotdf_obs$group <- paste0(plotdf$G,"-observed")
plotdf0 <- plotdf %>% select(-Yobs)
plotdf0$group <- paste0(plotdf0$G,"-untreated")
ggplot(data=plotdf,
       mapping=aes(x=time.period, y=Yobs, color=as.factor(G))) + 
  geom_point() +
  geom_line() + 
  geom_point(aes(y=Y0)) + 
  geom_line(aes(y=Y0), linetype="dashed") + 
  scale_x_continuous(breaks=seq(2,time.periods,by=2)) + 
  ylab("Y") + 
  theme_bw()
```

---

# Example: Simulated Data

---

# Example: Simulated Data

```r
# TWFE
twfe_res <- feols(Y ~ post | id + time.period,
            data=data,
            cluster="id")
            
modelsummary(twfe_res, gof_omit=".*")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Model 1 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> post </td>
   <td style="text-align:center;"> −25.043 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.029) </td>
  </tr>
</tbody>
</table>

---

# Example: Simulated Data

```r
cs_res <- att_gt(yname="Y",
                 tname="time.period",
                 idname="id",
                 gname="G",
                 data=data,
                 control_group = "notyettreated")

round(aggte(cs_res, type="group")$overall.att, 3)
```

```
## Warning in compute.aggte(MP = MP, type = type, balance_e = balance_e, min_e
## = min_e, : Simultaneous conf. band is somehow smaller than pointwise one
## using normal approximation. Since this is unusual, we are reporting pointwise
## confidence intervals
```

```
## [1] 50.005
```

---

# Example: Simulated Data

These are much different results (CS is correct, TWFE is wildly incorrect).  They are due to:

1. Heavy dynamics for early-treated group

2. No never-treated group (tends to really make these issues we're talking about much worse!)

Next up: Pre-testing and Event Studies