Modern Approaches to Difference in Differences

class: center, middle, inverse, title-slide

# Modern Approaches to Difference in Differences
### Brantly Callaway, University of Georgia
### October 22, 2021 <br><br>Session 4: Conditional Parallel Trends and Other Extensions

---

# Conditional Parallel Trends

`$$\newcommand{\E}{\mathbb{E}}$$`
`$$\newcommand{\P}{\mathrm{P}}$$`

border-top: 80px solid #BA0C2F;

.inverse {
  background-color: #BA0C2F;
}

.alert {
    font-weight:bold; 
    color: red;
}

.alert-blue {
    font-weight: bold;
    color: blue;
}

.remark-slide-content {
    font-size: 23px;
    padding: 1em 4em 1em 4em;
}

.highlight-red {
 background-color:red;
 padding:0.1em 0.2em;
}
</style>

For simplicity, let's go back to the case w/o variation in treatment timing

- Straightforward to extend

## Conditional Parallel Trends Assumption

For all time periods,

`$$\E[\Delta Y_t(0) | X, D=1] = \E[\Delta Y_t(0) | X, D=0]$$`
--

In words: Parallel trends holds conditional on having the same covariates `$X$`.

Labor Economics Example: `$Y$` - earnings, path of earnings may depend on covariates like education, race, gender, age, etc.

---

# Recovering ATT under conditional PTA
Under the conditional parallel trends assumption,
$$
`\begin{aligned}
ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}
\end{aligned}`
$$

---

count:false
# Recovering ATT under conditional PTA
Under the conditional parallel trends assumption,
$$
`\begin{aligned}
ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\
&= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]\\
&= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E\big[ \E[Y_{t^*}(0) - Y_{t^*-1}(0) | X, D=1] \big| D=1\big]\\
&= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E\big[ \E[Y_{t^*}(0) - Y_{t^*-1}(0) | X, D=0] \big| D=1\big]
\end{aligned}`
$$

---

Everything is identified here

- but estimation may be more challenging
---

# Estimation

<span class="alert-blue">Idea 1: Regression</span>

Assume: `$\E[\Delta Y_{t^*} | X, D=d] = X'\beta_{t^*}$`, then

`$$ATT=\E[\Delta Y_{t^*} | D=1] - \E[X'|D=1]\beta_{t^*}$$`
---

# Estimation

<span class="alert-blue">Idea 2: Propensity Score Weighting</span>

Along the lines of Abadie (2005), can show that

$$
  ATT = \E\left[ \left( \frac{D}{p} - \frac{p(X)(1-D)}{1-p(X)} \right) \Delta Y_t^*\right]
$$
where `$p(X) := \P(D=1|X)$` is the propensity score (can estimate by logit or probit).

Intuition: under conditional parallel trends, the reason for differences in (unconditional) paths of untreated potential outcomes is due to differences in distribution of covariates between treated and untreated group

- This expression "weights" up `$\Delta Y_{t^*}$` for units from the untreated covariates that have covariates that "relatively more common" among the treated group

- The choice between the first two ideas likely comes down whether you feel better about modeling `$\E[Delta Y_{t^*}|X,D=d]$` or `$p(X)$`.

---

# Estimation

<span class="alert-blue">Idea 3: Doubly Robust</span>

Along the lines of Sant'Anna and Zhao (2020), can show

`$$ATT=\E\left[ \left( \frac{D}{p} - \frac{p(X)(1-D)}{1-p(X)} \right)(\Delta Y_{t^*} - \E[\Delta Y_{t^*} | X, D=0]) \right]$$`
--

This requires estimating both `$p(X)$` and `$\E[\Delta Y_{t^*}|X,D=0]$`.

Big advantage:

- This expression for `$ATT$` is *doubly robust*.  This means that, it will deliver consistent estimates of `$ATT$` if <span class="alert">either</span> the model for `$p(X)$` or for `$\E[\Delta Y_{t^*}|X,D=0]$`.

- In my experience, doubly robust estimators perform much better than either the regression or propensity score weighting estimators

- This also provides a connection to estimating `$ATT$` under conditional parallel trends using machine learning for `$p(X)$` and `$\E[\Delta Y_{t^*}|X,D=0]$` (see: Chang (2020))
---

# Example: MW with region-specific trends

Rough idea: trends in teen employment may be quite different across different regions of the country

- This is conditional parallel trends

Idea: Include `region-year` fixed effects

```r
library(did)
library(fixest)
library(modelsummary)
load("mw_data2.RData")

# create post treatment dummy
mw_data2$post <- 1*(mw_data2$year >= mw_data2$first.treat & mw_data2$first.treat > 0)

# run TWFE regression
twfe_x <- feols(lemp ~ post | countyreal + region^year,
                data=mw_data2)
```

---

# Example: Minimum Wage

```r
modelsummary(twfe_x, gof_omit=".*")
```

<table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;">
 <thead>
  <tr>
   <th style="text-align:left;">   </th>
   <th style="text-align:center;"> Model 1 </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> post </td>
   <td style="text-align:center;"> 0.001 </td>
  </tr>
  <tr>
   <td style="text-align:left;">  </td>
   <td style="text-align:center;"> (0.006) </td>
  </tr>
</tbody>
</table>

Relative to previous results, this is much smaller and statistically insignificant

This is a pretty famous result in the MW literature (Dube et al. (2010))

---

# Example: Minimum Wage

Same idea, but estimate via group-time average treatment effects:

```r
cs_x <- att_gt(yname="lemp",
               tname="year",
               idname="countyreal",
               gname="first.treat",
               xformla=~region,
               data=mw_data2)
cs_x_res <- aggte(cs_x, type="group")
```

---

# Example: Minimum Wage

```r
summary(cs_x_res)
```

```
## 
## Call:
## aggte(MP = cs_x, type = "group")
## 
## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Forthcoming at the Journal of Econometrics <https://arxiv.org/abs/1803.09015>, 2020. 
## 
## 
## Overall summary of ATT’s based on group/cohort aggregation:  
##      ATT    Std. Error     [ 95%  Conf. Int.]  
##  -0.0311        0.0059    -0.0427     -0.0195 *
## 
## 
## Group Effects:
##  Group Estimate Std. Error [95% Simult.  Conf. Band]  
##   2003  -0.0197     0.0153       -0.0573      0.0179  
##   2005   0.0169     0.0094       -0.0061      0.0400  
##   2006  -0.0514     0.0078       -0.0706     -0.0321 *
## ---
## Signif. codes: `*' confidence band does not cover 0
## 
## Control Group:  Never Treated,  Anticipation Periods:  0
## Estimation Method:  Doubly Robust
```
---

# Example: Minimum Wage

```r
# regression
cs_x_reg <- att_gt(yname="lemp",
               tname="year",
               idname="countyreal",
               gname="first.treat",
               xformla=~region + lpop,
               est_method = "reg",
               data=mw_data2)
cs_x0_reg <- aggte(cs_x_reg, type="group")
```

---

# Example: Minimum Wage

```r
# propensity score weighting
cs_x_ipw <- att_gt(yname="lemp",
               tname="year",
               idname="countyreal",
               gname="first.treat",
               xformla=~region + lpop,
               est_method = "ipw",
               data=mw_data2)
cs_x0_ipw <- aggte(cs_x_ipw, type="group")
```

---

# Example: Minimum Wage

```r
# doubly robust
cs_x_dr <- att_gt(yname="lemp",
               tname="year",
               idname="countyreal",
               gname="first.treat",
               xformla=~region + lpop,
               est_method = "dr",
               data=mw_data2)
cs_x0_dr <- aggte(cs_x_dr, type="group")
```

---

# Example: Minimum Wage

```r
# show results
round(cbind.data.frame(reg=cs_x0_reg$overall.att, 
                       ipw=cs_x0_ipw$overall.att,
                       dr=cs_x0_dr$overall.att), 6)
```

```
##         reg       ipw        dr
## 1 -0.033159 -0.035392 -0.032318
```

These are all similar, but

- somewhat smaller in magnitude than unconditional case

- much different from TWFE results

---

# What about violations of parallel trends?

DID + pre-tests are a very powerful/useful approach to identifying causal effect parameters when repeated observations are available

But what should you do in cases like our application where:

---

# What about violations of parallel trends?

---

# What about violations of parallel trends?

<span class="alert-blue">One possibility:</span> Model underlying parallel trends is not correct

Examples:

- Interactive fixed effects: `$Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + v_{it}$`

- Special case: `$Y_{it}(0) = \theta_t + \eta_i + \lambda_i t + v_{it}$` (linear trends)
    
    - see: Callaway and Karami (2021) for small-T case
    
--

- More generally: `$Y_{it}(0) = g_t(\eta_i) + v_{it}$` (additive separability not appropriate)

- Pandemics (Callaway and Li (2021))

`$\implies$` sometimes you can make progress, but other times not

---

# What about violations of parallel trends?

Another strategy: partial identification / sensitivity analysis

- See: Manski and Pepper (2018), Rambachan and Roth (2021)

RR provide several versions of sensitivity analysis

- We'll focus on what the call `$\Delta^{RM}(\bar{M})$`

- We'll allow for violations of parallel trends up to `$\bar{M}$` times as large as were observed in any pre-treatment period.

- And we'll vary `$\bar{M}$`.

---

# What about violations of parallel trends?

---

# Conclusion

That's all!  Thank you very much for inviting me.

Email: [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu)