Modern Approaches to Difference in Differences

class: center, middle, inverse, title-slide

# Modern Approaches to Difference in Differences
### Brantly Callaway, University of Georgia
### October 22, 2021 Session 1: Introduction to DID

---

# Basic Setup

`$$\newcommand{\E}{\mathbb{E}}$$`
<style type="text/css">

border-top: 80px solid #BA0C2F;

.inverse {
  background-color: #BA0C2F;
}

.alert {
    font-weight:bold; 
    color: red;
}

.alert-blue {
    font-weight: bold;
    color: blue;
}

.remark-slide-content {
    font-size: 23px;
    padding: 1em 4em 1em 4em;
}

.highlight-red {
 background-color:red;
 padding:0.1em 0.2em;
}
</style>

We are interested in understanding the causal effect of some treatment on some outcome of interest.

- This could be some economic policy (e.g., state implements some policy)

- This could be participating in some program (e.g., individual participates in job training program)

- For today, we'll focus on the case where the treatment is binary (i.e., individual either participates or not)

---

# Possible Approaches

There are many possible approaches to thinking about causal effects (experiments, comparing treated and untreated units with the same covariates (e.g., matching/regression), IV, regression discontinuity, etc.)

Today we'll think about:

- (broadly) how to use repeated observations over time to identify causal effect parameters

- (specifically) mostly difference in differences where we'll compare:

- The change in outcomes over time for units that participate in the treatment to
    
    - The change in outcomes over time for units that didn't participate in the treatment

---

# DID Figure

---

# Before we get going...

Compared to other causal inference approaches, DID has been around for a very long time

- Dates back to at least John Snow and cholera epidemic in 1850s London

- If you're interested, more details: [https://www.ph.ucla.edu/epi/snow/grand_experiment.html](https://www.ph.ucla.edu/epi/snow/grand_experiment.html)
   
--
 
- Re-introduced/popularized in economics by Ashenfelter (1978), Card (1990), Card and Krueger (1994), among others
    
---

# Before we get going...

DID is also really popular (from Currie et al. (2020) and among applied micro papers)

---

# Setup

Potential Outcomes Notation:

- Two time periods: `$t^*$` and `$t^*-1$`

- No one treated until time period `$t^*$`, some units remain untreated in period `$t^*$`
    
--

- `$D_i$` - treatment indicator

- Potential outcomes in each time period

- `$Y_{it}(1)$` - treated potential outcome
    
    - `$Y_{it}(0)$` - untreated potential outcome
    
--

- Observed outcomes

`$$Y_{it^*} = D_i Y_{it^*}(1) + (1-D_i) Y_{it^*}(0) \quad \textrm{and} \quad Y_{it^*-1} = Y_{it^*-1}(0)$$`

---

# Individual-level Treatment Effects

In this framework, notice that the effect of participating in the treatment for individual `$i$` in time period `$t^*$` is

`$$TE_i = Y_{it^*}(1) - Y_{it^*}(0)$$`
--

Most work in economics "gives up" on trying to recover individual-level treatment effects

- If you're interested: check Heckman, Smith, and Clements (1997), Fan and Park (2010), or Callaway (2020)

---

# Average Treatment Effects

Instead, most of the literature focuses on the Average Treatment Effect (ATE) or Average Treatment Effect on the Treated (ATT)

`$$ATT=\E[Y_{t^*}(1) - Y_{t^*}(0) | D=1]$$`

Interpretation: The average difference between treated and untreated potential outcomes among those that participated in the treatment.

---

# Identification Challenge

Let's break apart the expression for `$ATT$`:

`$$ATT=\underbrace{\E[Y_{t^*}(1) | D=1]}_{\textrm{identified}} - \underbrace{\E[Y_{t^*}(0)|D=1]}_{\textrm{requires assumption}}$$`

In other words, in order to recover the `$ATT$`, we need to be able to figure out the average untreated potential outcome among those that participated in the treatment

There are a variety of ways to figure this out (e.g., random treatment assignment), but we'll talk difference in differences

---

# Parallel Trends Assumption
## Parallel Trends Assumption

`$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$`
--

Recovering `$ATT$` under parallel trends:

$$
`\begin{aligned}
ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}
\end{aligned}`
$$

---

count:false
# Parallel Trends Assumption
## Parallel Trends Assumption

`$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$`

Recovering `$ATT$` under parallel trends:

$$
`\begin{aligned}
ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\
&= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]
\end{aligned}`
$$

---

count:false
# Parallel Trends Assumption
## Parallel Trends Assumption

`$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$`

Recovering `$ATT$` under parallel trends:

$$
`\begin{aligned}
ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\
&= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]\\
&= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=0]
\end{aligned}`
$$

---

count:false
# Parallel Trends Assumption
## Parallel Trends Assumption

`$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$`

Recovering `$ATT$` under parallel trends:

which is where difference in differences gets its name

---

# Where does the PTA come from?

- View \#1: Purely reduced form assumption, assess validity in pre-treatment periods

- View \#2: Models that lead to parallel trends assumption (see: Blundell and Costa Dias (2009) and Gardner (2021))

`$$Y_{it}(0) = \theta_t + \eta_i + v_{it}$$`
--

where

- `$\theta_t$` is a time fixed effect

- `$\eta_i$` is an individual fixed effect (importantly: can follow different distribution among treated and untreated group)

- `$v_{it}$` idiosyncratic, time varying unobservables

---

# Models for PTA (cont'd)

In this case, notice that

`$$\Delta Y_{it^*}(0) = (\theta_{t^*} - \theta_{t^*-1}) + (v_{it^*} - v_{it^*-1})$$`
--

so that

`$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$`
---

# Models for PTA (cont'd)

(As far as I know), this is the only model that rationalizes parallel trends/DID

Nice properties:

1. Allows for time trends in (untreated potential) outcomes

2. Allows for unobserved heterogeneity (individual fixed effects) that can have different distributions between the treated group and untreated group

3. No restrictions on how treated potential outcomes are generated at all

4. No restrictions on treatment effect heterogeneity (across individuals or across time/exposure to the treatment)

5. Individuals can "select" into the treatment on the basis of (i) treated potential outcomes and (ii) unobserved heterogeneity `$\eta_i$`...just not time varying unobservables `$v_{it}$`

---

# Models for PTA (cont'd)

(As far as I know), this is the only model that rationalizes parallel trends/DID

Drawbacks:

1. Relies heavily on linearity/additive separability

- Not implied by most economic models
    
    - What about case with binary (or otherwise "limited") dependent variables / nonlinear models?

---

# Estimation
Given the above discussion, estimation of the `$ATT$` is very easy.

$$
`\begin{aligned}
\widehat{ATT} &= \hat{\E}[\Delta Y_{t^*} | D=1] - \hat{\E}[\Delta Y_{t^*}|D=0] \hspace{150pt}
\end{aligned}`
$$

---

count:false
# Estimation
Given the above discussion, estimation of the `$ATT$` is very easy.

$$
`\begin{aligned}
\widehat{ATT} &= \hat{\E}[\Delta Y_{t^*} | D=1] - \hat{\E}[\Delta Y_{t^*}|D=0] \hspace{150pt}\\
&= \frac{1}{n} \sum_{i=1}^n \frac{D_i}{\hat{p}} \Delta Y_{it^*} - \frac{1}{n} \sum_{i=1}^n \frac{(1-D_i)}{(1-\hat{p})} \Delta Y_{it^*}
\end{aligned}`
$$

Or, even more easily, run the following two-way fixed effects regression (TWFE):

`$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$$`

Pros: Economists know a lot about this sort of regression and you can just read off standard errors, etc.

---

# TWFE Regression

`$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$$`
Some more things to point out:

- Even though it looks like this model has restricted the effect of participating in the treatment to be constant (and equal to `$\alpha$`) across all individuals,

TWFE (in this case) is actually robust to treatment effect heterogeneity.
 
 In fact, you can (easily) show that, in this case, `$\alpha = ATT$`.

--
    
- It's easy to make the TWFE regression more complicated:

- Multiple time periods
    
    - Variation in treatment timing
    
    - More complicated treatments
    
    - Introducing additional covariates

---

# TWFE Regression

- Unfortunately, the robustness of TWFE regressions to treatment effect heterogeneity or these more complicated (and empirically relevant) settings does not seem to hold

- Much of the recent (mostly negative) literature on TWFE in the context of DID has considered these types of "realistic" settings

- We will start to cover these sorts of issues in the next section!