class: center, middle, inverse, title-slide # Difference in Differences with a Continuous Treatment ### Brantly Callaway, University of Georgia
Andrew Goodman-Bacon, Federal Reserve Bank of Minneapolis
Pedro H.C. Sant’Anna, Microsoft & Vanderbilt University
### November 20, 2021
Southern Economics Association Conference --- # Motivation `$$\newcommand{\E}{\mathbb{E}}$$` <style type="text/css"> border-top: 80px solid #BA0C2F; .inverse { background-color: #BA0C2F; } .alert { font-weight:bold; color: red; } .alert-blue { font-weight: bold; color: blue; } .remark-slide-content { font-size: 23px; padding: 1em 4em 1em 4em; } .highlight-red { background-color:red; padding:0.1em 0.2em; } .assumption-box { background-color: rgba(222,222,222,.5); font-size: x-large; padding: 10px; border: 10px solid lightgray; margin: 10px; } .assumption-title { font-size: x-large; font-weight: bold; display: block; margin: 10px; text-decoration: underline; } } </style> There has been a lot of recent work/interest in DID! -- A number of papers have <span class="alert">diagnosed</span> issues with very commonly used two-way fixed effects (TWFE) regressions to implement DID * de Chaisemartin and d'Haultfoueille (2020), Borusyak, Jaravel, and Spiess (2021) Goodman-Bacon (2021), Sun and Abraham (2021) -- Summary of Issues: * Already-treated groups sometimes serve as comparison group `\(\implies\)` treatment effect dynamics can lead to very poor estimates of treatment effects * Weights on underlying parameters are driven by estimation method --- # Motivation There have also been a number of papers <span class="alert">fixing</span> these issues * Callaway and Sant'Anna (2021), Cengiz, Dube, Lindner, and Zipperer (2019), Gardner (2021), Wooldridge (2021) * `\(+\)` previous papers -- Basic idea: * Explicitly make "good" comparisons and omit "bad" comparisons * Choose your own weights `\(\implies\)` can recover overall `\(ATT\)`, event studies, or other target parameters of interest --- # This paper These papers have (largely) focused on the case with a binary, staggered treatment * Some exceptions: de Chaisemartin and D'Haultfouille (2020, 2021) But there is considerable demand for understanding DID with more general treatments --- # Twitter <center><img src="tweet_better.png" width=90%></center> --- count:false # This paper <mark>Current paper:</mark> Generalize binary treatment case to multi-valued or continuous treatment (<span class="alert">"dose"</span>) -- `$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{it} + v_{it}$$` Setup: * Treatment "continuous enough" that researcher would estimate above model rather than include a sequence of dummy variables * Researchers often interpret `\(\beta^{twfe}\)` as an <span class="alert">average causal response</span> * i.e., (an average over) casual effects of a marginal increase in the dose --- # This paper <span class="alert">Similar issues</span> as in binary treatment literature related to regression (TWFE) estimation strategies when the treatment is multi-valued and/or continuous * Already treated units serve as comparison group `\(\implies\)` poor estimates of treatment effect parameters in the presence of treatment effect dynamics * `\(TWFE\)` estimate is a weighted average of underlying treatment parameters, but weights driven by estimation method * (this one is new) Heterogeneous causal effects of dose across timing-groups can lead to poor estimates (negative weights) -- As in the case with a staggered, binary treatment, we can fix all of these by * Carefully making desirable comparisons * Choosing our own weights --- # Now for the bad news... However, there are <span class="alert">new issues</span> related to interpreting differences between treatment effects at different doses as <span class="alert">causal effects</span> Intuition: "Standard" DID delivers ATT-type parameters. * These are <span class="alert">local</span> to a specific dose `\(\implies\)` Comparisons across different doses include both: * The causal effect of more dose * "Selection bias" terms * Getting rid of these selection bias terms requires additional assumptions that are likely to be substantially stronger in practice No easy fixes here! -- `\(\implies\)` (at least in some sense), this is <mark>more negative than previous papers</mark> --- # Outline <br> <br> <br> 1. Baseline Case: Two periods, no one treated in first period 2. TWFE in Baseline Case 3. More General Case: Multiple periods, variation in treatment timing 4. TWFE in More General Case --- class: inverse, middle, center # Baseline Case <br><br> Two periods, no one treated in first period --- # Notation Potential outcomes notation * Two time periods: `\(t^*\)` and `\(t^*-1\)` * No one treated until period `\(t^*\)` * Some units remain untreated in period `\(t^*\)` * Potential outcomes: `\(Y_{it^*}(d)\)` * Observed outcomes: `\(Y_{it^*}\)` and `\(Y_{it^*-1}\)` `$$Y_{it^*}=Y_{it^*}(D_i) \quad \textrm{and} \quad Y_{it^*-1}=Y_{it^*-1}(0)$$` --- # Parameters of Interest (ATT-type) * Level Effects (Average Treatment Effect on the Treated) `$$ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d]$$` * Interpretation: The average effect of dose `\(d\)` relative to not being treated *local to the group that actually experienced dose `\(d\)`* * This is the natural analogue of `\(ATT\)` in the binary treatment case -- * Slope Effect (Average Causal Responses) `$$ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACRT^O := \E[ACRT(D|D)|D>0]$$` * Interpretation: `\(ACRT(d|d)\)` is the causal effect of a marginal increase in dose *local to units that actually experienced dose `\(d\)`* * `\(ACRT^O\)` averages `\(ACRT(d|d)\)` over the population distribution of the dose --- # Discrete Dose * Level Effects (Average Treatment Effect on the Treated) `$$ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]$$` * This is exactly the same as for continuous dose -- * Slope Effect (Average Causal Responses) * Possible doses: `\(\{d_1, \ldots, d_J\}\)` `$$ACRT(d_j|d_j) := ATT(d_j|d_j) - ATT(d_{j-1}|D=d_j)$$` -- * Interestingly: In the case with a binary treatment, `\(ACRT(1|1) = ATT\)` `\(\implies\)` In binary treatment case, `\(ATT\)` is both a slope and level effect --- # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> `\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]` </p> </div> -- Then, -- $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt} \end{aligned}` $$ --- count:false # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> `\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]` </p> </div> Then, $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \end{aligned}` $$ --- count:false # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> `\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]` </p> </div> Then, $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0] \end{aligned}` $$ --- count:false # Identification <div class="assumption-box"> <span class="assumption-title">"Standard" Parallel Trends Assumption</span> For all `d`, <p style="text-align:center"> `\mathbb{E}[\Delta Y_t(0) | D=d] = \mathbb{E}[\Delta Y_t(0) | D=0]` </p> </div> Then, $$ `\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0]\\ &= \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*} | D=0] \end{aligned}` $$ <mark>This is exactly what you would expect</mark> --- # Are we done? <mark>Unfortunately, no</mark> -- Most applied work with a multi-valued or continuous treatment wants to think about how causal responses vary across dose * For example, plot treatment effects as a function of dose * Does more dose tends to increase/decrease/not effect outcomes? * Average causal response parameters *inherently* involve comparisons across slightly different doses --- # Interpretation Issues Consider comparing `\(ATT(d|d)\)` for two different doses -- $$ `\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned}` $$ --- count:false # Interpretation Issues Consider comparing `\(ATT(d|d)\)` for two different doses $$ `\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}` $$ -- "Standard" Parallel Trends is not strong enough to rule out the selection bias terms here * Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends * This problem spills over into identifying `\(ACRT(d|d)\)` -- <span class="alert">Positive side-comment:</span> `\(ATT(d_h|d_h) - ATT(d_l|d_l) = \E[\Delta Y_{t^*} | D=d_h] - \E[\Delta Y_{t^*} | D=d_l]\)` (which doesn't involve the untreated group) --- # Alternative Parameters of Interest (ATE-type) * Level Effects `$$ATE(d) := \E[Y_{t^*}(d) - Y_{t^*}(0)]$$` -- * Slope Effects $$ `\begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \ \ \ \ &\textrm{or} \ \ \ \ ACR(d_j) := ATE(d_j) - ATE(d_{j-1}) \\ & \textrm{or} \ \ \ ACR^O := \E[ACR(D) | D>0] \end{aligned}` $$ --- # Comparisons across dose ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose -- $$ `\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)] \end{aligned}` $$ --- count:false # Comparisons across dose ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose $$ `\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}` $$ -- <mark>Unfortunately, "Standard" Parallel Trends Assumption not strong enough to identify `\(ATE(d)\)`.</mark> --- # Introduce Stronger Assumptions <div class="assumption-box"><span class="assumption-title">"Strong" Parallel Trends</span> For all `d`, <p style="text-align: center"> `\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0)] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]` </p> </div> -- Under Strong Parallel Trends, it is straightforward to show that `$$ATE(d) = \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*}|D=0]$$` RHS is exactly the same expression as for `\(ATT(d|d)\)` under "standard" parallel trends, but here * assumptions are different * parameter interpretation is different --- # Comments on Strong Parallel Trends * This is notably different from "Standard" Parallel Trends * It involves potential outcomes for all values of the dose (not just untreated potential outcomes) * It is related to (but slightly weaker) than assuming * `\(ATE(d) = ATT(d|d)\)` (this is a form of treatment effect homogeneity) * All dose groups would have experienced the same path of outcomes had they been assigned the same dose * Can show that it is not <span class="alert">strictly</span> stronger than Standard Parallel Trends * But it is likely to be substantially stronger in practice --- # Summarizing * It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose * However, comparison of ATT-type parameters across different doses are hard to interpret * They include selection bias terms * This issues extends to identifying ACRT parameters * This suggests targeting ATE-type parameters * Comparisons across doses do not contain selection bias terms * But identifying ATE-type parameters requires stronger assumptions --- class: inverse, center, middle # TWFE in Baseline Case --- # TWFE The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression: `$$Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Post_{t^*} + v_{it}$$` In baseline case (two periods, no one treated in first period), this is just `$$\Delta Y_i = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i$$` `\(\beta^{twfe}\)` often loosely interpreted as Average Causal Response --- # Interpreting `\(\beta^{twfe}\)` In the paper, we show that * Under Standard Parallel Trends: `$$\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl + w_0 \frac{ATT(d_L|d_L)}{d_L}$$` * `\(w_1(l)\)` and `\(w_0\)` are positive weights that integrate to 1 * `\(ACRT(l|l)\)` is average causal response conditional on `\(D=l\)` * `\(\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l}\)` is a local selection bias term * `\(\frac{ATT(d_L|d_L)}{d_L}\)` is the causal effect of going from no dose to the smallest possible dose (conditional on `\(D=d_L\)`) --- # Interpreting `\(\beta^{twfe}\)` * Under Strong Parallel Trends: `$$\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) ACR(l) \, dl + w_0 \frac{ATE(d_L)}{d_L}$$` * `\(w_1(l)\)` and `\(w_0\)` are same weights as before * `\(ACR(l)\)` is average causal response to dose `\(l\)` across entire population * there is no selection bias term * `\(\frac{ATE(d_L)}{d_L}\)` is the causal effect of going from no dose to the smallest possible dose (across entire population) --- # What does this mean? * Issue \#1: Selection bias terms that show up under standard parallel trends `\(\implies\)` to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions -- * Issue \#2: Weights * They are all positive * But this is a <span class="alert">very minimal</span> requirement for weights being "reasonable" * These weights have the "strange" property that they are maximized at `\(d=\E[D]\)`. --- # Ex. Mixture of Normals Dose ![](data:image/png;base64,#did_continuous_treatment_files/figure-html/unnamed-chunk-5-1.png)<!-- --> --- # Ex. Exponential Dose ![](data:image/png;base64,#did_continuous_treatment_files/figure-html/unnamed-chunk-6-1.png)<!-- --> --- # What does this mean? * Issue \#3: Pre-testing * That the expressions for `\(ATE(d)\)` and `\(ATT(d|d)\)` are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between "standard" and "strong" parallel trends --- # What should you do? 1. Either (i) report `\(ATT(d|d)\)` directly and interpret carefully, or (ii) be aware (and think through) that `\(\beta^{twfe}\)`, comparisons across `\(d\)`, or average causal response parameters all require imposing stronger assumptions -- 2. With regard to weights, there are likely better options for estimating causal effect parameters * Step 1: Nonparametrically estimate `\(ACR(d) = \frac{\partial \E[\Delta Y | D=d]}{\partial d}\)` * Side-comment: This is not actually too hard to estimate. No curse-of-dimensionality, etc. * Step 2: Estimate `\(ACR^0 = \E[ACR(D)|D>0]\)`. * <span class="alert">These do not get around the issue of requiring a stronger assumption</span> --- class: inverse, middle, center # More General Case <br> <br> Multiple periods, variation in treatment timing --- # Summary of TWFE Issues * Issue \#1: Selection bias terms that show up under standard parallel trends `\(\implies\)` to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions -- * Issue \#2: Weights * Negative weights possible due to (i) treatment effect dynamics or (ii) heterogeneous causal responses across groups * Are (undesirably) driven by estimation method -- Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights -- Selection bias terms are more fundamental challenge --- # Conclusion * There are a number of challenges to implementing/interpreting DID with a multi-valued or continuous treatment * Issues related to TWFE are (mostly) anticipated at this point * But (in my view) the main new issue here is that <span class="alert">justifying interpreting comparisons across different doses as causal effects requires stronger assumptions than most researchers probably think that they are making</span> * <mark>Link to paper:</mark> [https://arxiv.org/abs/2107.02637](https://arxiv.org/abs/2107.02637) * <mark>Other Summaries:</mark> (i) [Five minute summary](https://bcallaway11.github.io/posts/five-minute-did-continuous-treatment) (ii) [Pedro's Twitter](https://twitter.com/pedrohcgs/status/1415915759960690696) * <mark>Comments welcome:</mark> [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu) * <mark>Code:</mark> ETA 2-3 months