Difference-in-Differences with a Continuous Treatment

Brantly Callaway

University of Georgia

Andrew Goodman-Bacon

Federal Reserve Bank of Minneapolis

Pedro Sant’Anna

Emory University

June 14, 2024

What’s Been Happening in the DID Literature?

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)There have been a number of recent advances in the differences-in-differences literature. Two broad contributions:

  • Contribution 1: Diagnose issues with commonly used two-way fixed effects (TWFE) regressions commonly used to implement DID identification strategies \[Y_{i,t} = \theta_t + \eta_i + \beta^{twfe} D_{i,t} + e_{i,t}\]
    • Roughly: TWFE regression can deliver poor estimates of causal effect parameters in the presence of treatment effect heterogeneity
  • Contribution 2: Propose alternative estimation strategies that “work” when the identification stratgey works (and are robust to treatment effect heterogeneity)

This Paper

These papers have (largely) focused on the case with a binary, staggered treatment

Current paper: Move from a setting with a binary treatment case to one with a continuous treatment (“dose”)

Some of the arguments involve extending ideas from the binary, staggered treatment case to a setting with continuous treatment

  • But we will also face new conceptual issues in this case that do not show up in a setting with a binary treatment

Example:

  • Effect of \(\underbrace{\textrm{length of school closures}}_{\textrm{continuous treatment}}\) (during Covid) on \(\underbrace{\textrm{students' test scores}}_{\textrm{outcome}}\)

Today’s Talk




  1. Identification: What’s the same as in the binary treatment case?

  2. Identification: What’s different from the binary treatment case?

  3. Interpreting TWFE Regressions (quickly if time permits)

1. Identification: What’s the same as in the binary treatment case?

Continuous Treatment Notation

Potential outcomes notation

  • Two time periods: \(t=1\) and \(t=2\)

    • No one treated until period \(t=2\)
    • Some units remain untreated in period \(t=2\)
  • Potential outcomes: \(Y_{i,t=2}(d)\)

  • Observed outcomes: \(Y_{i,t=2}\) and \(Y_{i,t=1}\)

    \[Y_{i,t=2}=Y_{i,t=2}(D_i) \quad \textrm{and} \quad Y_{i,t=1}=Y_{i,t=1}(0)\]

Parameters of Interest (ATT-type)

Level Effects (Average Treatment Effect on the Treated)

\[ATT(d|d) := \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d]\]

  • Interpretation: The average effect of dose \(d\) relative to not being treated local to the group that actually experienced dose \(d\)

  • This is the natural analogue of \(ATT\) in the binary treatment case

Parameters of Interest (ATT-type)

Slope Effects (Average Causal Response on the Treated)

\[ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}\]

  • Interpretation: \(ACRT(d|d)\) is the causal effect of a marginal increase in dose local to units that actually experienced dose \(d\)

Aggregated Parameters

Notice that \(ATT(d|d)\) and \(ACRT(d|d)\) are functional parameters

  • This is different from \(\beta^{twfe}\) (from the TWFE regression of \(Y_{i,t}\) on \(D_{i,t}\))

We can view \(ATT(d|d)\) and \(ACRT(d|d)\) as the “building blocks” for a more aggregated parameter. Aggregated versions of these (into a single number) are \[\begin{align*} ATT^o := \E[ATT(D|D)|D>0] \qquad \qquad ACRT^o := \E[ACRT(D|D)|D>0] \end{align*}\]

  • \(ATT^o\) averages \(ATT(d|d)\) over the population distribution of the dose

  • \(ACRT^o\) averages \(ACRT(d|d)\) over the population distribution of the dose

  • \(ACRT^o\) is the natural target parameter for the TWFE regression in this case

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt} \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d]\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[\Delta Y_{i,t=2}(0) | D_i=0] \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d]\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[\Delta Y_{i,t=2}(0) | D_i=0]\\ &= \E[\Delta Y_{i,t=2} | D_i=d] - \E[\Delta Y_{i,t=2} | D_i=0] \end{aligned} \]

This is exactly what you would expect

2. Identification: What’s different from the binary treatment case?

Are we done?

Unfortunately, no

Most empirical work with a continuous treatment wants to think about how causal responses vary across dose

  • Plot treatment effects as a function of dose and ask: does more dose tends to increase/decrease/not affect outcomes?
  • Average causal response parameters inherently involve comparisons across slightly different doses

There are new issues related to comparing \(ATT(d|d)\) at different doses and interpreting these differences as causal effects

  • Unlike the staggered, binary treatment case: No easy fixes here!

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned} \]

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{i,t=2}(d_h)-Y_{i,t=2}(d_l) | D_i=d_h] + \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_h] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_l] \end{aligned} \]

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{i,t=2}(d_h)-Y_{i,t=2}(d_l) | D_i=d_h] + \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_h] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{i,t=2}(d_h) - Y_{i,t=2}(d_l) | D_i=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned} \]

“Standard” Parallel Trends is not strong enough to rule out the selection bias terms here

  • Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends

  • This problem spills over into identifying \(ACRT(d|d)\)

Interpretation Issues

Intuition:

  • Difference-in-differences identification strategies result in \(ATT(d|d)\) parameters. These are local parameters and difficult to compare to each

  • This explanation is similar to thinking about LATEs with two different instruments

  • Thus, comparing \(ATT(d|d)\) across different values is tricky and not for free

What can you do?

  • One idea, just recover \(ATT(d|d)\) and interpret it cautiously (interpret it by itself not relative to different values of \(d\))

  • If you want to compare them to each other, it will come with the cost of additional (structural) assumptions

Introduce Stronger Assumptions

“Strong” Parallel Trends Assumption

For all doses d and l,

\[\mathbb{E}[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] = \mathbb{E}[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d]\]

  • This is notably different from “Standard” Parallel Trends

  • It involves potential outcomes for all values of the dose (not just untreated potential outcomes)

  • All dose groups would have experienced the same path of outcomes had they been assigned the same dose

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=l] \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=l] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=l] = ATT(d|l) \end{aligned} \]

Since this holds for all \(l\), it also implies that \(ATT(d|d) = ATE(d) = \E[Y_{i,t=2}(d) - Y_{i,t=2}(0)]\). Thus, under strong parallel trends, we have that

\[ATE(d) = \E[\Delta Y_{i,t=2}|D_i=d] - \E[\Delta Y_{i,t=2}|D_i=0]\]

RHS is exactly the same expression as for \(ATT(d|d)\) under “standard” PT, but

  • assumptions are different

  • parameter interpretation is different

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{i,t=2}(d_h) - Y_{i,t=2}(0)] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0)] \end{aligned} \]

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{i,t=2}(d_h) - Y_{i,t=2}(0)] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0)]\\ &= \underbrace{\E[Y_{i,t=2}(d_h) - Y_{i,t=2}(d_l)]}_{\textrm{Causal Response}} \end{aligned} \]

Thus, recovering \(ATE(d)\) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption

Given that we can compare \(ATE(d)\)’s across dose, we can recover slope effects in this setting

\[ \begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^o := \E[ACR(D) | D>0] \end{aligned} \]

Additional Comments

3. Interpreting TWFE Regressions

TWFE Regressions in this Context

Consider the same TWFE regression (but now \(D_{i,t}\) is continuous): \[\begin{align*} Y_{i,t} = \theta_t + \eta_i + \beta^{twfe} D_{i,t} + e_{i,t} \end{align*}\] We show that \[\begin{align*} \beta^{twfe} = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*}\] where \(m_\Delta(l) := \E[\Delta Y_{i,t=2}|D_i=l] - \E[\Delta Y_{i,t=2}|D_i=0]\) and \(w(l)\) are weights

  • Under standard parallel trends, \(m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}\)

  • Under strong parallel trends, \(m'_{\Delta}(l) = ACR(l)\).

About the weights: they are all positive, but have some strange properties (e.g., always maximized at \(l = \E[D]\) (even if this is not a common value for the dose))

  • \(\implies\) even under strong parallel trends, \(\beta^{twfe} \neq ACR^o\).

TWFE Regressions in this Context

Other issues can arise in more complicated cases

  • For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment

  • In general, things get worse for TWFE regressions with more complications

What should you do?

Level Effects - no issues related to selection bias

  • For \(ATT^o\): Binarize treatment, \(ATT^o = \E[\Delta Y_{i,t=2} | D_i > 0] - \E[\Delta Y_{i,t=2} | D_i=0]\).

  • For \(ATT(d|d)\): Nonparametrically estimate \(\E[\Delta Y_{i,t=2}|D_i=d]-\E[\Delta Y_{i,t=2}|D_i=0]\)

    • This is not actually too hard to estimate. No curse-of-dimensionality, etc.

Slope Effects - must deal with selection bias

  • Nonparametrically estimate derivative of \(\E[\Delta Y_{i,t=2}|D_i=d]\)

  • For \(ACR(d)\): Under strong parallel trends, derivative is equal to \(ACR(d)\)

  • For \(ACR^o\): Average \(ACR(D_i)\) over \(D_i>0\)

Additional Comments:

  1. Changing the estimation strategy helps with the weights, but it does not fix the issues related to standard vs. strong parallel trends

What should you do?

Level Effects - no issues related to selection bias

  • For \(ATT^o\): Binarize treatment, \(ATT^o = \E[\Delta Y_{i,t=2} | D_i > 0] - \E[\Delta Y_{i,t=2} | D_i=0]\).

  • For \(ATT(d|d)\): Nonparametrically estimate \(\E[\Delta Y_{i,t=2}|D_i=d]-\E[\Delta Y_{i,t=2}|D_i=0]\)

    • This is not actually too hard to estimate. No curse-of-dimensionality, etc.

Slope Effects - must deal with selection bias

  • Nonparametrically estimate derivative of \(\E[\Delta Y_{i,t=2}|D_i=d]\)

  • For \(ACR(d)\): Under strong parallel trends, derivative is equal to \(ACR(d)\)

  • For \(ACR^o\): Average \(ACR(D_i)\) over \(D_i>0\)

Additional Comments:

  1. It’s relatively straightforward to extend this strategy to settings with multiple periods and variation in treatment timing by extending existing work about a staggered, binary treatment

Summary

It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose

However, comparison of ATT-type parameters across different doses are hard to interpret

  • They include selection bias terms
  • This issues extends to identifying ACRT parameters
  • These issues extend to TWFE regressions

This suggests targeting ATE-type parameters

  • Comparisons across doses do not contain selection bias terms
  • But identifying ATE-type parameters requires stronger assumptions

[Empirical example about Medicare policy and capital/labor ratios]

Conclusion

Appendix

References

Ager, Philipp, Katherine Eriksson, Ezra Karger, Peter Nencka, and Melissa A Thomasson. 2024. “School Closures During the 1918 Flu Pandemic.” Review of Economics and Statistics 106 (1): 266–76.
Gillitzer, Christian, and Nalini Prasad. 2023. “The Effect of School Closures on Standardized Test Scores: Evidence from a Zero-COVID Environment.”

4. Empirical Application

Empirical Application

This is a simplified version of Acemoglu and Finkelstein (2008)

1983 Medicare reform that eliminated labor subsidies for hospitals

  • Medicare moved to the Prospective Payment System (PPS) which replaced “full cost reimbursement” with “partial cost reimbursement” which eliminated reimbursements for labor (while maintaining reimbursements for capital expenses)

  • Rough idea: This changes relative factor prices which suggests hospitals may adjust by changing their input mix. Could also have implications for technology adoption, etc.

  • In the paper, we provide some theoretical arguments concerning properties of production functions that suggests that strong parallel trends holds.

Data

Annual hospital-reported data from the American Hospital Association, 1980-1986

Outcome is capital/labor ratio

  • proxy using the depreciation share of total operating expenses (avg. 4.5%)

  • our setup: collapse to two periods by taking average in pre-treatment periods and average in post-treatment periods

Dose is “exposure” to the policy

  • the fraction of Medicare patients in the period before the policy was implemented

  • roughly 15% of hospitals are untreated (have essentially no Medicare patients)

    • AF provide results both using and not using these hospitals as (good) it is useful to have untreated hospitals (bad) they are fairly different (includes federal, long-term, psychiatric, children’s, and rehabilitation hospitals)

Bin Scatter

ATE Plot

ACR(T) Plot

Results

Results

Density weights vs. TWFE weights

TWFE Weights with and without Untreated Group

[Back]

Positive Side-Comments: No untreated units

It’s possible to do some versions of DID with a continuous treatment without having access to a fully untreated group.

  • In this case, it is not possible to recover level effects like \(ATT(d|d)\).

  • However, notice that \[\begin{aligned}& \E[\Delta Y_{i,t=2} | D_i=d_h] - \E[\Delta Y_{i,t=2}| D_i=d_l] \\ &\hspace{50pt}= \Big(\E[\Delta Y_{i,t=2} | D_i=d_h] - \E[\Delta Y_{i,t=2}(0) | D_i=d_h]\Big) - \Big(\E[\Delta Y_{i,t=2} | D_i=d_l]-\E[\Delta Y_{i,t=2}(0) | D_i=d_l]\Big) \\ &\hspace{50pt}= ATT(d_h|d_h) - ATT(d_l|d_l)\end{aligned}\]

  • In words: comparing path of outcomes for those that experienced dose \(d_h\) to path of outcomes among those that experienced dose \(d_l\) (and not relying on having an untreated group) delivers the difference between their \(ATT\)’s.

  • Still face issues related to selection bias / strong parallel trends though

Back

Positive Side-Comments: Alternative approaches

Strategies like binarizing the treatment can still work (though be careful!)

  • If you classify units as being treated or untreated, you can recover the \(ATT\) of being treated at all.

  • On the other hand, if you classify units as being “high” treated, “low” treated, or untreated — our arguments imply that selection bias terms can come up when comparing effects for “high” to “low”

Back

Negative Side-Comment: Pre-testing

That the expressions for \(ATE(d)\) and \(ATT(d|d)\) are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between “standard” and “strong” parallel trends. In particular, the relevant information that we have for testing each one is the same

  • In effect, the only testable implication of strong parallel trends in pre-treatment periods is standard parallel trends.

Back

Issues with TWFE Regressions

TWFE

The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression:

\[Y_{i,t} = \theta_t + \eta_i + \beta^{twfe} D_{i,t} + v_{i,t}\] In baseline case (two periods, no one treated in first period), this is just

\[\Delta Y_i, = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i\]

\(\beta^{twfe}\) often (loosely) interpreted as some kind of average causal response (i.e., slope effect) parameter

Interpreting \(\beta^{twfe}\)

In the paper, we show that

  • Under Standard Parallel Trends:

    \[\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl\]

    • \(w_1(l)\) are positive weights that integrate to 1

    • \(ACRT(l|l)\) is average causal response conditional on \(D_i=l\)

    • \(\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l}\) is a local selection bias term

Interpreting \(\beta^{twfe}\)

In the paper, we show that

  • Under Strong Parallel Trends:

    \[\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) ACR(l) \, dl\]

    • \(w_1(l)\) are same weights as before

    • \(ACR(l)\) is average causal response to dose \(l\) across entire population

    • there is no selection bias term

What does this mean?

  • Issue #1: Selection bias terms that show up under standard parallel trends

    \(\implies\) to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions

  • Issue #2: Weights

    • They are all positive

    • But this is a very minimal requirement for weights being “reasonable”

    • These weights have “strange” properties (i) affected by the size of the untreated group, (ii) that they are maximized at \(D_i=\E[D]\).

    • [[Example 1 - Mixture of Normals Dose]]     [[Example 2: Exponential Dose]]

Ex. Mixture of Normals Dose

Back

Ex. Exponential Dose