Difference in Differences with a Continuous Treatment

# Difference in Differences with a Continuous Treatment
### Brantly Callaway, University of Georgia<br>Andrew Goodman-Bacon, Federal Reserve Bank of Minneapolis <br>Pedro H.C. Sant’Anna, Microsoft & Vanderbilt University<br><br><br>
### May 7, 2022<br>SOLE Conference

---

# Introduction

`$$\newcommand{\E}{\mathbb{E}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\var}{\mathrm{var}}
\newcommand{\cov}{\mathrm{cov}}
\newcommand{\Var}{\mathrm{var}}
\newcommand{\Cov}{\mathrm{cov}}
\newcommand{\Corr}{\mathrm{corr}}
\newcommand{\corr}{\mathrm{corr}}
\newcommand{\L}{\mathrm{L}}
\renewcommand{\P}{\mathrm{P}}
\newcommand{\independent}{{\perp\!\!\!\perp}}$$`

border-top: 80px solid #BA0C2F;

.inverse {
  background-color: #BA0C2F;
}

.alert {
    font-weight:bold; 
    color: #BA0C2F;
}

.alert-blue {
    font-weight: bold;
    color: blue;
}

.remark-slide-content {
    font-size: 23px;
    padding: 1em 4em 1em 4em;
}

.highlight-red {
 background-color:red;
 padding:0.1em 0.2em;
}

.assumption-box {
    background-color: rgba(222,222,222,.5);
    font-size: x-large;
    padding: 10px; 
    border: 10px solid lightgray; 
    margin: 10px;
}

.assumption-title {
    font-size: x-large;
    font-weight: bold;
    display: block;
    margin: 10px;
    text-decoration: underline;
    color: #BA0C2F;
}
</style>

Canonical versions of difference-in-differences are for the case where the treatment is <span class="alert">binary</a>

But many applications in economics involve more complicated treatments that may be <span class="alert-blue">multi-valued</span> or <span class="alert">continuous</span>

**Examples:**

* Minimum wages

* Years of education

* Amount of local spending on public goods

* Amount of pollution

* Number of cigarettes smoked

---

# Introduction

In particular, we'll consider the case where researchers have traditionally run the following two-way fixed effects (TWFE) regression

`$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{it} + v_{it}$$`

* Treatment "continuous enough" that researchers would estimate above model rather than include a sequence of dummy variables

* Researchers often interpret `$\beta^{twfe}$` as some type of  <span class="alert">causal response</span> parameter

---

# Introduction

We'll point out limitations with this sort of TWFE regression in the presence of

1. Treatment effect heterogeneity

2. Multiple periods / variation in treatment timing

3. Due to "local-ness" of DID identification strategies

We'll also discuss alternative approaches

- Like the recent literature on DID (mainly) with a binary, staggered treatment, one can propose "fixes" that are robust to issues (i) and (ii)

- However, issue (iii) is "deeper" and often requires "structural" types of assumptions (i.e., assumptions that allow for extrapolation)

- TWFE regressions also inherently rely on these types of assumptions in this context, even in favorable cases such as exactly two periods
   
---
    
# Outline

1. Identification in Baseline Case with Two Periods

2. TWFE Regressions with Two Periods

3. Dealing with Selection Bias Terms

4. Extensions to Multiple Periods and Variation in Treatment Timing

---

# Identification in Baseline Case with Two Periods

---

# Notation

Potential outcomes notation

* Two time periods: `$t^*$` and `$t^*-1$`
  
  * No one treated until period `$t^*$`
    
  * Some units remain untreated in period `$t^*$`
  
* Potential outcomes: `$Y_{it^*}(d)$`

* Observed outcomes: `$Y_{it^*}$` and `$Y_{it^*-1}$`

`$$Y_{it^*}=Y_{it^*}(D_i) \quad \textrm{and} \quad Y_{it^*-1}=Y_{it^*-1}(0)$$`

---

# Parameters of Interest (ATT-type)

* Level Effects (Average Treatment Effect on the Treated)

`$$ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d]$$`

* Interpretation: The average effect of dose `$d$` relative to not being treated *local to the group that actually experienced dose `$d$`*
  
  * This is the natural analogue of `$ATT$` in the binary treatment case

* Slope Effect (Average Causal Responses)

`$$ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACRT^O := \E[ACRT(D|D)|D>0]$$`
  
  * Interpretation: `$ACRT(d|d)$` is the causal effect of a marginal increase in dose *local to units that actually experienced dose `$d$`*
  
  * `$ACRT^O$` averages `$ACRT(d|d)$` over the population distribution of the dose

---

# Identification
<div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span>

For all `d`,

<p style="text-align:center">
$\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]$
</p>

</div>

Then,

$$
`\begin{aligned}
ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}
\end{aligned}`
$$

---

count:false
# Identification
<div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span>

For all `d`,

<p style="text-align:center">
$\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]$
</p>

</div>

Then,

$$
`\begin{aligned}
ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\
&= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]
\end{aligned}`
$$

---

count:false
# Identification
<div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span>

For all `d`,

<p style="text-align:center">
$\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]$
</p>

</div>

Then,

$$
`\begin{aligned}
ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\
&= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\
&= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0]
\end{aligned}`
$$

---

count:false
# Identification
<div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span>

For all `d`,

<p style="text-align:center">
$\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]$
</p>

</div>

Then,

<mark>This is exactly what you would expect</mark>
---

# Are we done?

<mark>Unfortunately, no</mark>

Most empirical work with a multi-valued or continuous treatment wants to think about how causal responses vary across dose

* For example, plot treatment effects as a function of dose

* Does more dose tends to increase/decrease/not effect outcomes?
  
* Average causal response parameters *inherently* involve comparisons across slightly different doses

---

# Interpretation Issues
Consider comparing `$ATT(d|d)$` for two different doses
--

$$
`\begin{aligned}
& ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}
\end{aligned}`
$$

---

count:false
# Interpretation Issues
Consider comparing `$ATT(d|d)$` for two different doses

---

count:false
# Interpretation Issues
Consider comparing `$ATT(d|d)$` for two different doses

---

count:false
# Interpretation Issues
Consider comparing `$ATT(d|d)$` for two different doses

$$
`\begin{aligned}
& ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\
& \hspace{25pt} = \Big(\E[\Delta Y_{t^*}|D=d_h] - \E[\Delta Y_{t^*}|D=0]\Big) - \Big(\E[\Delta Y_{t^*}|D=d_l] - \E[\Delta Y_{t^*}|D=0]\Big)\\
& \hspace{25pt} = \E[\Delta Y_{t^*}|D=d_h] - \E[\Delta Y_{t^*}|D=d_l]\\
& \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}}
\end{aligned}`
$$

"Standard" Parallel Trends is not strong enough to rule out the selection bias terms here

* Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions

* Intuition: DID identifies `$ATT(d|d)$` parameters that are "local" to dose `$d$`; comparing local parameters is tricky (Fricke (2017))

---

# Interpretation Issues

<span class="alert">Positive side-comment:</span> `$ATT(d_h|d_h) - ATT(d_l|d_l) = \E[\Delta Y_{t^*} | D=d_h] - \E[\Delta Y_{t^*} | D=d_l]$` (which doesn't involve the untreated group)

This problem spills over into identifying `$ACRT(d|d)$`.  In particular, the same sort of arguments imply that

`\begin{align*}
  \frac{\partial \E[\Delta Y_{t^*}|D=d]}{\partial d} = ACRT(d|d) + \underbrace{\frac{\partial ATT(d|l)}{\partial l} \Big|_{l=d}}_{\textrm{Selection Bias}}
\end{align*}`

---

# Recap

With a multi-valued or continuous treatment, identifying `$ATT(d|d)$` is just like the case with a binary treatment

* Suggests one can estimate `$ATT(d|d)$` and readily interpret is as the average treatment of dose `$d$` among those that experienced dose `$d$`

However, standard versions of parallel trends assumptions (alone) do not rationalize making comparisons across different doses different doses

* Plots of `$ATT(d|d)$` as a function of dose have competing explanations as (i) actual causal effects or (ii) selection bias, or some combination of these

* Parallel trends does not justify taking the derivative of `$\E[\Delta Y_{t^*}|D=d] - \E[\Delta Y_{t^*}|D=0]$` (w.r.t. `$d$`) and interpreting it as `$ACRT(d|d)$`

* ...or averaging this into `$ACRT^O$`.

---

# TWFE Regressions with Two Periods

---

# TWFE

The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression:

`$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Post_{t^*} + v_{it}$$`
In baseline case (two periods, no one treated in first period), this is just

`$$\Delta Y_i = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i$$`

`$\beta^{twfe}$` often loosely interpreted as some kind of (average?) causal response parameter

We'll consider the case where:

- Standard parallel trends holds

- But allow for treatment effect heterogeneity and selection into a particular amount of the treatment

---

# Interpreting `$\beta^{twfe}$`

In the paper, we show that

* Under Standard Parallel Trends:

`$$\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl + w_0 \frac{ATT(d_L|d_L)}{d_L}$$`

* `$w_1(l)$` and `$w_0$` are positive weights that integrate to 1
  
  * `$ACRT(l|l)$` is average causal response conditional on `$D=l$`
  
  * `$\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l}$` is a local selection bias term
  
  * `$\frac{ATT(d_L|d_L)}{d_L}$` is the causal effect of going from no dose to the smallest possible dose (conditional on `$D=d_L$`)
  
---

# What does this mean?

* Issue \#1: Selection bias terms that show up under standard parallel trends

`$\implies$` to interpret as a weighted average of any kind of causal responses, need to invoke stronger assumptions
  
--
  
* Issue \#2: Weights

* They are all positive
  
  * But this is a <span class="alert">very minimal</span> requirement for weights being "reasonable"
  
  * These weights have the "strange" property that they are maximized at `$d=\E[D]$`.
  
---

# Ex. Mixture of Normals Dose

![](data:image/png;base64,#did_continuous_treatment_short_files/figure-html/unnamed-chunk-4-1.png)

---

# Ex. Exponential Dose
![](data:image/png;base64,#did_continuous_treatment_short_files/figure-html/unnamed-chunk-5-1.png)

---

# What does this mean?

These sorts of decompositions are generally not unique: we also show that you can relate `$\beta^{TWFE}$` to underlying `$ATT(d|d)$` terms

* These do not involve selection bias terms

* However, the weights integrate to 0 (rather than 1) and can be negative, suggesting that (and not surprisingly) that you should not think of `$\beta^{TWFE}$` as approximating the `$ATT(d|d)$` function.

All this to say, besides not generally being robust to treatment effect heterogeneity (even in cases with two periods), the TWFE regression inherently suffers from the issues related to `$ATT(d|d)$` being local

Sufficient conditions for `$\beta^{TWFE} = ACRT^O$`:

1. `$ACRT(d|d)$` constant across `$d$` (version of treatment effect homogeneity)

2. No selection bias

---

# Dealing with Selection Bias Terms

---

# Dealing with Selection Bias Terms

<div class="assumption-box"><span class="assumption-title">"Strong" Parallel Trends</span>

For all `d`,

<p style="text-align: center">
$\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0)] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]$
</p>

</div>

Under Strong Parallel Trends, it is straightforward to show that

`$$ATE(d) := \E[Y_{t^*}(d) - Y_{t^*}(0)] = \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*}|D=0]$$`

RHS is exactly the same expression as for `$ATT(d|d)$` under "standard" parallel trends, but here

* assumptions are different

* parameter interpretation is different

---

# Comparisons across dose
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

$$
`\begin{aligned}
ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]
\end{aligned}`
$$

---

count:false
# Comparisons across dose
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

$$
`\begin{aligned}
ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\
&= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}}
\end{aligned}`
$$

---

# Comments on Strong Parallel Trends

* This is notably different from "Standard" Parallel Trends

* It involves potential outcomes for all values of the dose (not just untreated potential outcomes)

--
  
* It amounts to assuming that there is no selection bias "on average"

* It's slightly weaker than assuming that `$ATT(d|l) = ATT(d|d)$` for all `$d$` and `$l$`, but this is a useful benchmark for thinking about this sort of assumption

* You can also think about this as a treatment effect homogeneity condition (though across "dose groups" rather than amounts of the treatment)

--
  
* This sort of assumption also has the flavor of being "structural" in the sense that it allows extrapolation of treatment effects from observed doses to unobserved doses

* Strong parallel trends implies that one can interpret `$ATE(d)$` globally as being causal

---

# Alternative Ideas

If strong parallel trends is implausible, here are some ideas:

* If one is more narrowly interested in `$ACRT(d|d)$`, could assume that "local" selection bias

* This is likely to still be a strong assumption in many applications

* Even weaker assumptions: one might be willing to assume that the sign of the selection bias is known

* Then, can get an upper or lower (depending on sign of selection bias) on `$ACRT(d|d)$`.

---

# What should you do?

1. Either (i) report `$ATT(d|d)$` directly and interpret carefully, or (ii) be aware (and think through) that `$\beta^{twfe}$`, comparisons across `$d$`, or average causal response parameters all require imposing stronger assumptions

2. With regard to weights, there are likely better options for estimating causal effect parameters

* Step 1: Nonparametrically estimate `$ACR(d) = \frac{\partial \E[\Delta Y | D=d]}{\partial d}$`
  
    * Side-comment: This is not actually too hard to estimate.  No curse-of-dimensionality, etc.
  
  * Step 2: Estimate `$ACR^0 = \E[ACR(D)|D>0]$`.
  
  * <span class="alert">These do not get around the issue of requiring a stronger assumption</span>

---

# Extensions to Multiple Periods and Variation in Treatment Timing

---

# Summary of TWFE Issues

* Issue \#1: Selection bias terms that show up under standard parallel trends

`$\implies$` to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
  
--
  
* Issue \#2: Weights

* Negative weights possible due to (i) treatment effect dynamics (de Chaisemartin and d'Haultfoeuille (2020), Goodman-Bacon (2021)) or (ii) heterogeneous causal responses across groups (new)
  
  * Are (undesirably) driven by estimation method

--
  
Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights (Callaway and Sant'Anna (2021))

Selection bias terms are more fundamental challenge

---

# Conclusion

* There are a number of challenges to implementing/interpreting DID with a multi-valued or continuous treatment

* Issues related to TWFE are (mostly) anticipated at this point

* But (in my view) the main new issue here is that <span class="alert">justifying interpreting comparisons across different doses as causal effects requires stronger assumptions than most researchers probably think that they are making</span>

* <mark>Link to paper:</mark> [https://arxiv.org/abs/2107.02637](https://arxiv.org/abs/2107.02637)

* <mark>Other Summaries:</mark> &nbsp; (i) [Five minute summary](https://bcallaway11.github.io/posts/five-minute-did-continuous-treatment) &nbsp; &nbsp; &nbsp;  (ii) [Pedro's Twitter](https://twitter.com/pedrohcgs/status/1415915759960690696)

* <mark>Comments welcome:</mark> [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu)

* <mark>Code:</mark> ETA (hopefully) Summer 2022