class: center, middle, inverse, title-slide # Difference in Differences with a Continuous Treatment ### Brantly Callaway, University of Georgia
Andrew Goodman-Bacon, Federal Reserve Bank of Minneapolis
Pedro H.C. Sant’Anna, Microsoft & Vanderbilt University
### May 7, 2022
SOLE Conference --- # Introduction `$$\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}}$$` <style type="text/css"> border-top: 80px solid #BA0C2F; .inverse { background-color: #BA0C2F; } .alert { font-weight:bold; color: #BA0C2F; } .alert-blue { font-weight: bold; color: blue; } .remark-slide-content { font-size: 23px; padding: 1em 4em 1em 4em; } .highlight-red { background-color:red; padding:0.1em 0.2em; } .assumption-box { background-color: rgba(222,222,222,.5); font-size: x-large; padding: 10px; border: 10px solid lightgray; margin: 10px; } .assumption-title { font-size: x-large; font-weight: bold; display: block; margin: 10px; text-decoration: underline; color: #BA0C2F; } </style> Canonical versions of difference-in-differences are for the case where the treatment is <span class="alert">binary</a> -- But many applications in economics involve more complicated treatments that may be <span class="alert-blue">multi-valued</span> or <span class="alert">continuous</span> -- **Examples:** * Minimum wages * Years of education * Amount of local spending on public goods * Amount of pollution * Number of cigarettes smoked --- # Introduction In particular, we'll consider the case where researchers have traditionally run the following two-way fixed effects (TWFE) regression `$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{it} + v_{it}$$` * Treatment "continuous enough" that researchers would estimate above model rather than include a sequence of dummy variables * Researchers often interpret `\(\beta^{twfe}\)` as some type of <span class="alert">causal response</span> parameter --- # Introduction We'll point out limitations with this sort of TWFE regression in the presence of -- 1. Treatment effect heterogeneity -- 2. Multiple periods / variation in treatment timing -- 3. Due to "local-ness" of DID identification strategies -- We'll also discuss alternative approaches - Like the recent literature on DID (mainly) with a binary, staggered treatment, one can propose "fixes" that are robust to issues (i) and (ii) -- - However, issue (iii) is "deeper" and often requires "structural" types of assumptions (i.e., assumptions that allow for extrapolation) - TWFE regressions also inherently rely on these types of assumptions in this context, even in favorable cases such as exactly two periods --- # Outline <br><br><br><br> 1. Identification in Baseline Case with Two Periods 2. TWFE Regressions with Two Periods 3. Dealing with Selection Bias Terms 4. Extensions to Multiple Periods and Variation in Treatment Timing --- class: inverse, middle, center count: false # Identification in Baseline Case with Two Periods --- # Notation Potential outcomes notation * Two time periods: `\(t^*\)` and `\(t^*-1\)` * No one treated until period `\(t^*\)` * Some units remain untreated in period `\(t^*\)` * Potential outcomes: `\(Y_{it^*}(d)\)` * Observed outcomes: `\(Y_{it^*}\)` and `\(Y_{it^*-1}\)` `$$Y_{it^*}=Y_{it^*}(D_i) \quad \textrm{and} \quad Y_{it^*-1}=Y_{it^*-1}(0)$$` --- # Parameters of Interest (ATT-type) * Level Effects (Average Treatment Effect on the Treated) `$$ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d]$$` * Interpretation: The average effect of dose `\(d\)` relative to not being treated *local to the group that actually experienced dose `\(d\)`* * This is the natural analogue of `\(ATT\)` in the binary treatment case -- * Slope Effect (Average Causal Responses) `$$ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACRT^O := \E[ACRT(D|D)|D>0]$$` * Interpretation: `\(ACRT(d|d)\)` is the causal effect of a marginal increase in dose *local to units that actually experienced dose `\(d\)`* * `\(ACRT^O\)` averages `\(ACRT(d|d)\)` over the population distribution of the dose --- # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> \(\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]\) </p> </div> -- Then, -- $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt} \end{aligned}` $$ --- count:false # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> \(\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]\) </p> </div> Then, $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \end{aligned}` $$ --- count:false # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> \(\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]\) </p> </div> Then, $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0] \end{aligned}` $$ --- count:false # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> \(\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]\) </p> </div> Then, $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0]\\ &= \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*} | D=0] \end{aligned}` $$ <mark>This is exactly what you would expect</mark> --- # Are we done? -- <mark>Unfortunately, no</mark> -- Most empirical work with a multi-valued or continuous treatment wants to think about how causal responses vary across dose * For example, plot treatment effects as a function of dose * Does more dose tends to increase/decrease/not effect outcomes? * Average causal response parameters *inherently* involve comparisons across slightly different doses --- # Interpretation Issues Consider comparing `\(ATT(d|d)\)` for two different doses -- $$ `\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned}` $$ --- count:false # Interpretation Issues Consider comparing `\(ATT(d|d)\)` for two different doses $$ `\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \Big(\E[\Delta Y_{t^*}|D=d_h] - \E[\Delta Y_{t^*}|D=0]\Big) - \Big(\E[\Delta Y_{t^*}|D=d_l] - \E[\Delta Y_{t^*}|D=0]\Big) \end{aligned}` $$ --- count:false # Interpretation Issues Consider comparing `\(ATT(d|d)\)` for two different doses $$ `\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \Big(\E[\Delta Y_{t^*}|D=d_h] - \E[\Delta Y_{t^*}|D=0]\Big) - \Big(\E[\Delta Y_{t^*}|D=d_l] - \E[\Delta Y_{t^*}|D=0]\Big)\\ & \hspace{25pt} = \E[\Delta Y_{t^*}|D=d_h] - \E[\Delta Y_{t^*}|D=d_l] \end{aligned}` $$ --- count:false # Interpretation Issues Consider comparing `\(ATT(d|d)\)` for two different doses $$ `\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \Big(\E[\Delta Y_{t^*}|D=d_h] - \E[\Delta Y_{t^*}|D=0]\Big) - \Big(\E[\Delta Y_{t^*}|D=d_l] - \E[\Delta Y_{t^*}|D=0]\Big)\\ & \hspace{25pt} = \E[\Delta Y_{t^*}|D=d_h] - \E[\Delta Y_{t^*}|D=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}` $$ -- "Standard" Parallel Trends is not strong enough to rule out the selection bias terms here * Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions * Intuition: DID identifies `\(ATT(d|d)\)` parameters that are "local" to dose `\(d\)`; comparing local parameters is tricky (Fricke (2017)) --- # Interpretation Issues <span class="alert">Positive side-comment:</span> `\(ATT(d_h|d_h) - ATT(d_l|d_l) = \E[\Delta Y_{t^*} | D=d_h] - \E[\Delta Y_{t^*} | D=d_l]\)` (which doesn't involve the untreated group) -- This problem spills over into identifying `\(ACRT(d|d)\)`. In particular, the same sort of arguments imply that -- `\begin{align*} \frac{\partial \E[\Delta Y_{t^*}|D=d]}{\partial d} = ACRT(d|d) + \underbrace{\frac{\partial ATT(d|l)}{\partial l} \Big|_{l=d}}_{\textrm{Selection Bias}} \end{align*}` --- # Recap With a multi-valued or continuous treatment, identifying `\(ATT(d|d)\)` is just like the case with a binary treatment * Suggests one can estimate `\(ATT(d|d)\)` and readily interpret is as the average treatment of dose `\(d\)` among those that experienced dose `\(d\)` -- However, standard versions of parallel trends assumptions (alone) do not rationalize making comparisons across different doses different doses -- * Plots of `\(ATT(d|d)\)` as a function of dose have competing explanations as (i) actual causal effects or (ii) selection bias, or some combination of these -- * Parallel trends does not justify taking the derivative of `\(\E[\Delta Y_{t^*}|D=d] - \E[\Delta Y_{t^*}|D=0]\)` (w.r.t. `\(d\)`) and interpreting it as `\(ACRT(d|d)\)` * ...or averaging this into `\(ACRT^O\)`. --- class: inverse, middle, center count: false # TWFE Regressions with Two Periods --- # TWFE The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression: `$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Post_{t^*} + v_{it}$$` In baseline case (two periods, no one treated in first period), this is just `$$\Delta Y_i = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i$$` -- `\(\beta^{twfe}\)` often loosely interpreted as some kind of (average?) causal response parameter -- We'll consider the case where: - Standard parallel trends holds - But allow for treatment effect heterogeneity and selection into a particular amount of the treatment --- # Interpreting `\(\beta^{twfe}\)` In the paper, we show that * Under Standard Parallel Trends: `$$\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl + w_0 \frac{ATT(d_L|d_L)}{d_L}$$` * `\(w_1(l)\)` and `\(w_0\)` are positive weights that integrate to 1 * `\(ACRT(l|l)\)` is average causal response conditional on `\(D=l\)` * `\(\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l}\)` is a local selection bias term * `\(\frac{ATT(d_L|d_L)}{d_L}\)` is the causal effect of going from no dose to the smallest possible dose (conditional on `\(D=d_L\)`) --- # What does this mean? * Issue \#1: Selection bias terms that show up under standard parallel trends `\(\implies\)` to interpret as a weighted average of any kind of causal responses, need to invoke stronger assumptions -- * Issue \#2: Weights * They are all positive * But this is a <span class="alert">very minimal</span> requirement for weights being "reasonable" * These weights have the "strange" property that they are maximized at `\(d=\E[D]\)`. --- # Ex. Mixture of Normals Dose ![](data:image/png;base64,#did_continuous_treatment_short_files/figure-html/unnamed-chunk-4-1.png)<!-- --> --- # Ex. Exponential Dose ![](data:image/png;base64,#did_continuous_treatment_short_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- # What does this mean? These sorts of decompositions are generally not unique: we also show that you can relate `\(\beta^{TWFE}\)` to underlying `\(ATT(d|d)\)` terms * These do not involve selection bias terms * However, the weights integrate to 0 (rather than 1) and can be negative, suggesting that (and not surprisingly) that you should not think of `\(\beta^{TWFE}\)` as approximating the `\(ATT(d|d)\)` function. -- All this to say, besides not generally being robust to treatment effect heterogeneity (even in cases with two periods), the TWFE regression inherently suffers from the issues related to `\(ATT(d|d)\)` being local -- Sufficient conditions for `\(\beta^{TWFE} = ACRT^O\)`: 1. `\(ACRT(d|d)\)` constant across `\(d\)` (version of treatment effect homogeneity) 2. No selection bias --- class: inverse, middle, center count: false # Dealing with Selection Bias Terms --- # Dealing with Selection Bias Terms <div class="assumption-box"><span class="assumption-title">"Strong" Parallel Trends</span> For all `d`, <p style="text-align: center"> \(\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0)] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]\) </p> </div> -- Under Strong Parallel Trends, it is straightforward to show that `$$ATE(d) := \E[Y_{t^*}(d) - Y_{t^*}(0)] = \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*}|D=0]$$` RHS is exactly the same expression as for `\(ATT(d|d)\)` under "standard" parallel trends, but here * assumptions are different * parameter interpretation is different --- # Comparisons across dose ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose -- $$ `\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)] \end{aligned}` $$ --- count:false # Comparisons across dose ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose $$ `\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}` $$ --- # Comments on Strong Parallel Trends * This is notably different from "Standard" Parallel Trends * It involves potential outcomes for all values of the dose (not just untreated potential outcomes) -- * It amounts to assuming that there is no selection bias "on average" -- * It's slightly weaker than assuming that `\(ATT(d|l) = ATT(d|d)\)` for all `\(d\)` and `\(l\)`, but this is a useful benchmark for thinking about this sort of assumption * You can also think about this as a treatment effect homogeneity condition (though across "dose groups" rather than amounts of the treatment) -- * This sort of assumption also has the flavor of being "structural" in the sense that it allows extrapolation of treatment effects from observed doses to unobserved doses -- * Strong parallel trends implies that one can interpret `\(ATE(d)\)` globally as being causal --- # Alternative Ideas If strong parallel trends is implausible, here are some ideas: -- * If one is more narrowly interested in `\(ACRT(d|d)\)`, could assume that "local" selection bias * This is likely to still be a strong assumption in many applications -- * Even weaker assumptions: one might be willing to assume that the sign of the selection bias is known * Then, can get an upper or lower (depending on sign of selection bias) on `\(ACRT(d|d)\)`. --- # What should you do? 1. Either (i) report `\(ATT(d|d)\)` directly and interpret carefully, or (ii) be aware (and think through) that `\(\beta^{twfe}\)`, comparisons across `\(d\)`, or average causal response parameters all require imposing stronger assumptions -- 2. With regard to weights, there are likely better options for estimating causal effect parameters * Step 1: Nonparametrically estimate `\(ACR(d) = \frac{\partial \E[\Delta Y | D=d]}{\partial d}\)` * Side-comment: This is not actually too hard to estimate. No curse-of-dimensionality, etc. * Step 2: Estimate `\(ACR^0 = \E[ACR(D)|D>0]\)`. * <span class="alert">These do not get around the issue of requiring a stronger assumption</span> --- class: inverse, middle, center count: false # Extensions to Multiple Periods and Variation in Treatment Timing --- # Summary of TWFE Issues * Issue \#1: Selection bias terms that show up under standard parallel trends `\(\implies\)` to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions -- * Issue \#2: Weights * Negative weights possible due to (i) treatment effect dynamics (de Chaisemartin and d'Haultfoeuille (2020), Goodman-Bacon (2021)) or (ii) heterogeneous causal responses across groups (new) * Are (undesirably) driven by estimation method -- Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights (Callaway and Sant'Anna (2021)) -- Selection bias terms are more fundamental challenge --- # Conclusion * There are a number of challenges to implementing/interpreting DID with a multi-valued or continuous treatment * Issues related to TWFE are (mostly) anticipated at this point * But (in my view) the main new issue here is that <span class="alert">justifying interpreting comparisons across different doses as causal effects requires stronger assumptions than most researchers probably think that they are making</span> * <mark>Link to paper:</mark> [https://arxiv.org/abs/2107.02637](https://arxiv.org/abs/2107.02637) * <mark>Other Summaries:</mark> (i) [Five minute summary](https://bcallaway11.github.io/posts/five-minute-did-continuous-treatment) (ii) [Pedro's Twitter](https://twitter.com/pedrohcgs/status/1415915759960690696) * <mark>Comments welcome:</mark> [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu) * <mark>Code:</mark> ETA (hopefully) Summer 2022