Difference in Differences with a Continuous Treatment

class: center, middle, inverse, title-slide

# Difference in Differences with a Continuous Treatment
### Brantly Callaway, University of Georgia Andrew Goodman-Bacon, Federal Reserve Bank of Minneapolis Pedro H.C. Sant’Anna, Microsoft & Vanderbilt University 
### November 20, 2021 Southern Economics Association Conference

---

# Motivation

`$$\newcommand{\E}{\mathbb{E}}$$`

border-top: 80px solid #BA0C2F;

.inverse {
  background-color: #BA0C2F;
}

.alert {
    font-weight:bold; 
    color: red;
}

.alert-blue {
    font-weight: bold;
    color: blue;
}

.remark-slide-content {
    font-size: 23px;
    padding: 1em 4em 1em 4em;
}

.highlight-red {
 background-color:red;
 padding:0.1em 0.2em;
}

.assumption-box {
    background-color: rgba(222,222,222,.5);
    font-size: x-large;
    padding: 10px; 
    border: 10px solid lightgray; 
    margin: 10px;
}

.assumption-title {
 font-size: x-large;
 font-weight: bold;
 display: block;
 margin: 10px;
 text-decoration: underline;
}
}
</style>

There has been a lot of recent work/interest in DID!

A number of papers have diagnosed issues with very commonly used two-way fixed effects (TWFE) regressions to implement DID

* de Chaisemartin and d'Haultfoueille (2020), Borusyak, Jaravel, and Spiess (2021) Goodman-Bacon (2021), Sun and Abraham (2021)

Summary of Issues:

* Already-treated groups sometimes serve as comparison group `$\implies$` treatment effect dynamics can lead to very poor estimates of treatment effects

* Weights on underlying parameters are driven by estimation method

---

# Motivation

There have also been a number of papers fixing these issues

* Callaway and Sant'Anna (2021), Cengiz, Dube, Lindner, and Zipperer (2019), Gardner (2021), Wooldridge (2021)

* `$+$` previous papers

Basic idea:

* Explicitly make "good" comparisons and omit "bad" comparisons

* Choose your own weights `$\implies$` can recover overall `$ATT$`, event studies, or other target parameters of interest

---

# This paper

These papers have (largely) focused on the case with a binary, staggered treatment

* Some exceptions: de Chaisemartin and D'Haultfouille (2020, 2021)

But there is considerable demand for understanding DID with more general treatments

---

# Twitter

---

count:false

# This paper

Current paper: Generalize binary treatment case to multi-valued or continuous treatment ("dose")

`$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{it} + v_{it}$$`
Setup:

* Treatment "continuous enough" that researcher would estimate above model rather than include a sequence of dummy variables

* Researchers often interpret `$\beta^{twfe}$` as an average causal response

* i.e., (an average over) casual effects of a marginal increase in the dose
  
---

# This paper

Similar issues as in binary treatment literature related to regression (TWFE) estimation strategies when the treatment is multi-valued and/or continuous

* Already treated units serve as comparison group `$\implies$` poor estimates of treatment effect parameters in the presence of treatment effect dynamics
  
  * `$TWFE$` estimate is a weighted average of underlying treatment parameters, but weights driven by estimation method
  
  * (this one is new) Heterogeneous causal effects of dose across timing-groups can lead to poor estimates (negative weights)
  
--
  
As in the case with a staggered, binary treatment, we can fix all of these by

* Carefully making desirable comparisons

* Choosing our own weights

---

# Now for the bad news...

However, there are new issues related to interpreting differences between treatment effects at different doses as causal effects

Intuition: "Standard" DID delivers ATT-type parameters.

* These are local to a specific dose
 
 `$\implies$` Comparisons across different doses include both:
 
 * The causal effect of more dose
 
 * "Selection bias" terms
 
* Getting rid of these selection bias terms requires additional assumptions that are likely to be substantially stronger in practice

No easy fixes here!

`$\implies$` (at least in some sense), this is more negative than previous papers

---

# Outline

1. Baseline Case: Two periods, no one treated in first period

2. TWFE in Baseline Case

3. More General Case: Multiple periods, variation in treatment timing

4. TWFE in More General Case

---
class: inverse, middle, center

# Baseline Case Two periods, no one treated in first period

---

# Notation

Potential outcomes notation

* Two time periods: `$t^*$` and `$t^*-1$`
  
  * No one treated until period `$t^*$`
    
  * Some units remain untreated in period `$t^*$`
  
* Potential outcomes: `$Y_{it^*}(d)$`

* Observed outcomes: `$Y_{it^*}$` and `$Y_{it^*-1}$`

`$$Y_{it^*}=Y_{it^*}(D_i) \quad \textrm{and} \quad Y_{it^*-1}=Y_{it^*-1}(0)$$`

---

# Parameters of Interest (ATT-type)

* Level Effects (Average Treatment Effect on the Treated)

`$$ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d]$$`

* Interpretation: The average effect of dose `$d$` relative to not being treated *local to the group that actually experienced dose `$d$`*
  
  * This is the natural analogue of `$ATT$` in the binary treatment case

* Slope Effect (Average Causal Responses)

`$$ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACRT^O := \E[ACRT(D|D)|D>0]$$`
  
  * Interpretation: `$ACRT(d|d)$` is the causal effect of a marginal increase in dose *local to units that actually experienced dose `$d$`*
  
  * `$ACRT^O$` averages `$ACRT(d|d)$` over the population distribution of the dose

---

# Discrete Dose

* Level Effects (Average Treatment Effect on the Treated)

`$$ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]$$`

* This is exactly the same as for continuous dose

--
* Slope Effect (Average Causal Responses)

* Possible doses: `$\{d_1, \ldots, d_J\}$`

`$$ACRT(d_j|d_j) := ATT(d_j|d_j) - ATT(d_{j-1}|D=d_j)$$`
--

* Interestingly: In the case with a binary treatment, `$ACRT(1|1) = ATT$`
  
    `$\implies$` In binary treatment case, `$ATT$` is both a slope and level effect

---

# Identification
<div class="assumption-box"> "Standard" Parallel Trends Assumption

For all `d`,

`\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]`

</div>

Then,

$$
`\begin{aligned}
ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}
\end{aligned}`
$$

---

count:false
# Identification
<div class="assumption-box"> "Standard" Parallel Trends Assumption

For all `d`,

`\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]`

</div>

Then,

$$
`\begin{aligned}
ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\
&= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]
\end{aligned}`
$$

---

count:false
# Identification
<div class="assumption-box"> "Standard" Parallel Trends Assumption

For all `d`,

`\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]`

</div>

Then,

$$
`\begin{aligned}
ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\
&= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\
&= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0]
\end{aligned}`
$$

---

count:false
# Identification
<div class="assumption-box"> "Standard" Parallel Trends Assumption

For all `d`,

`\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]`

</div>

Then,

This is exactly what you would expect
---

# Are we done?

Unfortunately, no

Most applied work with a multi-valued or continuous treatment wants to think about how causal responses vary across dose

* For example, plot treatment effects as a function of dose

* Does more dose tends to increase/decrease/not effect outcomes?
  
* Average causal response parameters *inherently* involve comparisons across slightly different doses

---

# Interpretation Issues
Consider comparing `$ATT(d|d)$` for two different doses
--

$$
`\begin{aligned}
& ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}
\end{aligned}`
$$

---

count:false
# Interpretation Issues
Consider comparing `$ATT(d|d)$` for two different doses

$$
`\begin{aligned}
& ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\
& \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}}
\end{aligned}`
$$

"Standard" Parallel Trends is not strong enough to rule out the selection bias terms here

* Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends

* This problem spills over into identifying `$ACRT(d|d)$`

Positive side-comment: `$ATT(d_h|d_h) - ATT(d_l|d_l) = \E[\Delta Y_{t^*} | D=d_h] - \E[\Delta Y_{t^*} | D=d_l]$` (which doesn't involve the untreated group)

---

# Alternative Parameters of Interest (ATE-type)

* Level Effects

`$$ATE(d) := \E[Y_{t^*}(d) - Y_{t^*}(0)]$$`
--

* Slope Effects

$$
`\begin{aligned}
  ACR(d) := \frac{\partial ATE(d)}{\partial d} \ \ \ \ &\textrm{or} \ \ \ \ ACR(d_j) := ATE(d_j) - ATE(d_{j-1}) \\
  & \textrm{or} \ \ \ ACR^O := \E[ACR(D) | D>0]
\end{aligned}`
$$

---

# Comparisons across dose
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

$$
`\begin{aligned}
ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]
\end{aligned}`
$$

---

count:false
# Comparisons across dose
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

$$
`\begin{aligned}
ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\
&= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}}
\end{aligned}`
$$

Unfortunately, "Standard" Parallel Trends Assumption not strong enough to identify `$ATE(d)$`.

---

# Introduce Stronger Assumptions

<div class="assumption-box">"Strong" Parallel Trends

For all `d`,

`\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0)] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]`

</div>

Under Strong Parallel Trends, it is straightforward to show that

`$$ATE(d) = \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*}|D=0]$$`

RHS is exactly the same expression as for `$ATT(d|d)$` under "standard" parallel trends, but here

* assumptions are different

* parameter interpretation is different

---

# Comments on Strong Parallel Trends

* This is notably different from "Standard" Parallel Trends

* It involves potential outcomes for all values of the dose (not just untreated potential outcomes)
  
* It is related to (but slightly weaker) than assuming

* `$ATE(d) = ATT(d|d)$` (this is a form of treatment effect homogeneity)
 
 * All dose groups would have experienced the same path of outcomes had they been assigned the same dose
 
* Can show that it is not strictly stronger than Standard Parallel Trends

* But it is likely to be substantially stronger in practice

---

# Summarizing

* It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose

* However, comparison of ATT-type parameters across different doses are hard to interpret

* They include selection bias terms
  
  * This issues extends to identifying ACRT parameters

* This suggests targeting ATE-type parameters

* Comparisons across doses do not contain selection bias terms
  
  * But identifying ATE-type parameters requires stronger assumptions

---
class: inverse, center, middle

# TWFE in Baseline Case

---

# TWFE

The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression:

`$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Post_{t^*} + v_{it}$$`
In baseline case (two periods, no one treated in first period), this is just

`$$\Delta Y_i = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i$$`

`$\beta^{twfe}$` often loosely interpreted as Average Causal Response

---
# Interpreting `$\beta^{twfe}$`

In the paper, we show that

* Under Standard Parallel Trends:

`$$\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl + w_0 \frac{ATT(d_L|d_L)}{d_L}$$`

* `$w_1(l)$` and `$w_0$` are positive weights that integrate to 1
  
  * `$ACRT(l|l)$` is average causal response conditional on `$D=l$`
  
  * `$\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l}$` is a local selection bias term
  
  * `$\frac{ATT(d_L|d_L)}{d_L}$` is the causal effect of going from no dose to the smallest possible dose (conditional on `$D=d_L$`)
  
---

# Interpreting `$\beta^{twfe}$`

* Under Strong Parallel Trends:

`$$\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) ACR(l) \, dl + w_0 \frac{ATE(d_L)}{d_L}$$`

* `$w_1(l)$` and `$w_0$` are same weights as before
  
  * `$ACR(l)$` is average causal response to dose `$l$` across entire population
  
  * there is no selection bias term
  
  * `$\frac{ATE(d_L)}{d_L}$` is the causal effect of going from no dose to the smallest possible dose (across entire population)

---
# What does this mean?

* Issue \#1: Selection bias terms that show up under standard parallel trends

`$\implies$` to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
  
--
  
* Issue \#2: Weights

* They are all positive
 
 * But this is a very minimal requirement for weights being "reasonable"
 
 * These weights have the "strange" property that they are maximized at `$d=\E[D]$`.
 
---

# Ex. Mixture of Normals Dose

![](data:image/png;base64,#did_continuous_treatment_files/figure-html/unnamed-chunk-5-1.png)

---

# Ex. Exponential Dose
![](data:image/png;base64,#did_continuous_treatment_files/figure-html/unnamed-chunk-6-1.png)

---

# What does this mean?

* Issue \#3: Pre-testing

* That the expressions for `$ATE(d)$` and `$ATT(d|d)$` are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between "standard" and "strong" parallel trends 
  
---

# What should you do?

1. Either (i) report `$ATT(d|d)$` directly and interpret carefully, or (ii) be aware (and think through) that `$\beta^{twfe}$`, comparisons across `$d$`, or average causal response parameters all require imposing stronger assumptions

2. With regard to weights, there are likely better options for estimating causal effect parameters

* Step 1: Nonparametrically estimate `$ACR(d) = \frac{\partial \E[\Delta Y | D=d]}{\partial d}$`
 
 * Side-comment: This is not actually too hard to estimate. No curse-of-dimensionality, etc.
 
 * Step 2: Estimate `$ACR^0 = \E[ACR(D)|D>0]$`.
 
 * These do not get around the issue of requiring a stronger assumption

---

class: inverse, middle, center

# More General Case Multiple periods, variation in treatment timing

---

# Summary of TWFE Issues

* Issue \#1: Selection bias terms that show up under standard parallel trends

`$\implies$` to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
  
--
  
* Issue \#2: Weights

* Negative weights possible due to (i) treatment effect dynamics or (ii) heterogeneous causal responses across groups
  
  * Are (undesirably) driven by estimation method

--
  
Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights

Selection bias terms are more fundamental challenge

---

# Conclusion

* There are a number of challenges to implementing/interpreting DID with a multi-valued or continuous treatment

* Issues related to TWFE are (mostly) anticipated at this point

* But (in my view) the main new issue here is that justifying interpreting comparisons across different doses as causal effects requires stronger assumptions than most researchers probably think that they are making

* Link to paper: [https://arxiv.org/abs/2107.02637](https://arxiv.org/abs/2107.02637)

* Other Summaries: &nbsp; (i) [Five minute summary](https://bcallaway11.github.io/posts/five-minute-did-continuous-treatment) &nbsp; &nbsp; &nbsp; (ii) [Pedro's Twitter](https://twitter.com/pedrohcgs/status/1415915759960690696)

* Comments welcome: [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu)

* Code: ETA 2-3 months