Difference-in-Differences with a Continuous Treatment

Brantly Callaway

brantly.callaway@uga.edu

University of Georgia

Andrew Goodman-Bacon

andrew@goodman-bacon.com

Federal Reserve Bank of Minneapolis

Pedro Sant’Anna

pedro.santanna@emory.edu

Emory University

November 14, 2024

What’s Been Happening in the DID Literature?

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)There have been a number of recent advances in the differences-in-differences literature. Two broad contributions:

Contribution 1: Diagnose issues with commonly used two-way fixed effects (TWFE) regressions commonly used to implement DID identification strategies \[Y_{it} = \theta_t + \eta_i + \beta^{twfe} D_{it} + e_{it}\]
- Roughly: TWFE regression can deliver poor estimates of causal effect parameters in the presence of treatment effect heterogeneity
- (de Chaisemartin and D’Haultfœuille 2020; Goodman-Bacon 2021; Sun and Abraham 2021; Borusyak, Jaravel, and Spiess 2024, among others)

Contribution 2: Propose alternative estimation strategies that “work” when the identification stratgey works (and are robust to treatment effect heterogeneity)
- (previous papers plus Callaway and Sant’Anna 2021; Gardner et al. 2023; Wooldridge 2021; Dube et al. 2023)

This Paper

These papers have (largely) focused on the case with a binary, staggered treatment

Current paper: Move from a setting with a binary treatment case to one with a continuous treatment (“dose”)

Some of the arguments involve extending ideas from the binary, staggered treatment case to a setting with continuous treatment

But we will also face new conceptual issues in this case that do not show up in a setting with a binary treatment

Example:

Effect of \(\underbrace{\textrm{length of school closures}}_{\textrm{continuous treatment}}\) (during Covid) on \(\underbrace{\textrm{students' test scores}}_{\textrm{outcome}}\)
- e.g., (Ager et al. 2024; Gillitzer and Prasad 2023, among others)

Clarifications about Continuous Treatment

For today, mostly emphasize a continuous treatment, but our results also apply to other settings (with trivial modifications):

multi-valued treatments (e.g., effect of state-level minimum wage policies on employment)
binary policy variable with differential “exposure” to it (application in the paper: a binary Medicare policy where different hospitals had more exposure to the policy)

But results do not apply to “fuzzy” DID setups

Fuzzy DID refers to a setting where a researcher is ultimately interested in understanding the effect of a binary treatment but observes aggregate data
- Ex.: Study union wage-premium (at individual level) using state-level data and exploiting variation in the “amount” of unionization across different locations
- See de Chaisemartin and D’Haultfœuille (2018) for more details about this case

Today’s Talk

Identification: What’s the same as in the binary treatment case?
Identification: What’s different from the binary treatment case?
Interpreting TWFE Regressions
Empirical Application

1. Identification: What’s the same as in the binary treatment case?

Continuous Treatment Notation

Potential outcomes notation

Two time periods: \(t=2\) and \(t=1\)
- No one treated until period \(t=2\)
- Some units remain untreated in period \(t=2\)
Potential outcomes: \(Y_{it=2}(d)\)
Observed outcomes: \(Y_{it=2}\) and \(Y_{it=1}\)

\[Y_{it=2}=Y_{it=2}(D_i) \quad \textrm{and} \quad Y_{it=1}=Y_{it=1}(0)\]

Parameters of Interest (ATT-type)

Level Effects (Average Treatment Effect on the Treated)

\[ATT(d|d) := \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d]\]

Interpretation: The average effect of dose \(d\) relative to not being treated local to the group that actually experienced dose \(d\)
This is the natural analogue of \(ATT\) in the binary treatment case

Parameters of Interest (ATT-type)

Slope Effects (Average Causal Response on the Treated)

\[ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}\]

Interpretation: \(ACRT(d|d)\) is the causal effect of a marginal increase in dose local to units that actually experienced dose \(d\)

Aggregated Parameters

Notice that \(ATT(d|d)\) and \(ACRT(d|d)\) are functional parameters

This is different from \(\beta^{twfe}\) (from the TWFE regression of \(Y_{it}\) on \(D_{it}\))

We can view \(ATT(d|d)\) and \(ACRT(d|d)\) as the “building blocks” for a more aggregated parameter.

Aggregated versions of these (into a single number) are \[\begin{align*} ATT^o := \E\Big[ATT(D|D)\Big|D>0\Big] \qquad \qquad ACRT^o := \E\Big[ACRT(D|D)\Big|D>0\Big] \end{align*}\]

\(ATT^o\) averages \(ATT(d|d)\) over the population distribution of the dose
\(ACRT^o\) averages \(ACRT(d|d)\) over the population distribution of the dose
- \(ACRT^o\) is the natural target parameter for the TWFE regression in this case

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y(0) | D=d] = \E[\Delta Y(0) | D=0]\]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y(0) | D=d] = \E[\Delta Y(0) | D=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt} \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y(0) | D=d] = \E[\Delta Y(0) | D=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y(0) | D=d] = \E[\Delta Y(0) | D=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d]\\ &= \E[\Delta Y | D=d] - \E[\Delta Y | D=0] \end{aligned} \]

This is exactly what you would expect

2. Identification: What’s different from the binary treatment case?

Are we done?

Unfortunately, no

Most empirical work with a continuous treatment wants to think about how causal responses vary across dose

Plot treatment effects as a function of dose and ask: does more dose tends to increase/decrease/not affect outcomes?

Average causal response parameters inherently involve comparisons across slightly different doses

There are new issues related to comparing \(ATT(d|d)\) at different doses and interpreting these differences as causal effects

Unlike the staggered, binary treatment case: No easy fixes here!

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned} \]

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{t=2}(d_h)-Y_{t=2}(d_l) | D=d_h] + \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_h] - \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{t=2}(d_h) - Y_{t=2}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned} \]

“Standard” Parallel Trends is not strong enough to rule out the selection bias terms here

Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends
This problem spills over into identifying \(ACRT(d|d)\)

Interpretation Issues

Intuition:

Difference-in-differences identification strategies result in \(ATT(d|d)\) parameters. These are local parameters and difficult to compare to each
This explanation is similar to thinking about LATEs with two different instruments
Thus, comparing \(ATT(d|d)\) across different values is tricky and not for free

What can you do?

One idea, just recover \(ATT(d|d)\) and interpret it cautiously (interpret it by itself not relative to different values of \(d\))
If you want to compare them to each other, it will come with the cost of additional (structural) assumptions

Introduce Stronger Assumptions

“Strong” Parallel Trends Assumption

For all doses d and l,

\[\mathbb{E}[Y_{t=2}(d) - Y_{t=1}(0) | D=l] = \mathbb{E}[Y_{t=2}(d) - Y_{t=1}(0) | D=d]\]

This is notably different from “Standard” Parallel Trends
It involves potential outcomes for all values of the dose (not just untreated potential outcomes)
All dose groups would have experienced the same path of outcomes had they been assigned the same dose

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a certain restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a certain restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a certain restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=l] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=l] \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a certain restriction on treatment effect heterogeneity. Notice:

Since this holds for all \(d\) and \(l\), it also implies that \(ATT(d|d) = ATE(d) = \E[Y_{t=2}(d) - Y_{t=2}(0)]\). Thus, under strong parallel trends, we have that

\[ATE(d) = \E[\Delta Y|D=d] - \E[\Delta Y|D=0]\]

RHS is exactly the same expression as for \(ATT(d|d)\) under “standard” parallel trends, but here

assumptions are different
parameter interpretation is different

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t=2}(d_h) - Y_{t=2}(0)] - \E[Y_{t=2}(d_l) - Y_{t=2}(0)] \end{aligned} \]

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t=2}(d_h) - Y_{t=2}(0)] - \E[Y_{t=2}(d_l) - Y_{t=2}(0)]\\ &= \underbrace{\E[Y_{t=2}(d_h) - Y_{t=2}(d_l)]}_{\textrm{Causal Response}} \end{aligned} \]

Thus, recovering \(ATE(d)\) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption

Given that we can compare \(ATE(d)\)’s across dose, we can recover slope effects in this setting

\[ \begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^o := \E\Big[ACR(D) \Big| D>0\Big] \end{aligned} \]

Additional Comments

Can you relax strong parallel trends?
Positive side-comment: No untreated units
Positive side-comment: Binarizing the Treatment
Negative side-comment: Pre-testing

Summarizing

It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose

However, comparison of ATT-type parameters across different doses are hard to interpret

They include selection bias terms
This issues extends to identifying ACRT parameters
These issues extend to TWFE regressions

This suggests targeting ATE-type parameters

Comparisons across doses do not contain selection bias terms
But identifying ATE-type parameters requires stronger assumptions

3. Interpreting TWFE Regressions

TWFE Regressions in this Context

Consider the same TWFE regression (but now \(D_{it}\) is continuous): \[\begin{align*} Y_{it} = \theta_t + \eta_i + \beta^{twfe} D_{it} + e_{it} \end{align*}\] We show that \[\begin{align*} \beta^{twfe} = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*}\] where \(m_\Delta(l) := \E[\Delta Y|D=l] - \E[\Delta Y|D=0]\) and \(w(l)\) are weights

Under standard parallel trends, \(m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}\)
Under strong parallel trends, \(m'_{\Delta}(l) = ACR(l)\).

About the weights: they are all positive, but have some strange properties (e.g., always maximized at \(l = \E[D]\) (even if this is not a common value for the dose))

\(\implies\) even under strong parallel trends, \(\beta^{twfe} \neq ACR^o\).

TWFE Regressions in this Context

Other issues can arise in more complicated cases

For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment
In general, things get worse for TWFE regressions with more complications

Estimation - What should you do?

Level Effects - no issues related to selection bias

For \(ATT^o\): Binarize treatment, \(ATT^o = \E[\Delta Y | D > 0] - \E[\Delta Y | D=0]\).
For \(ATT(d|d)\): Nonparametrically estimate \(m_\Delta(d) = \E[\Delta Y|D=d]-\E[\Delta Y|D=0]\)
- This is not actually too hard to estimate. No curse-of-dimensionality, etc.

Slope Effects - must deal with selection bias

Nonparametrically estimate derivative of \(m_\Delta(d)\)
For \(ACR(d)\): Under strong parallel trends, derivative is equal to \(ACR(d)\)
For \(ACR^o\): Average \(ACR(D)\) over \(D>0\); i.e., \(ACR^o = \E[ACR(D)|D>0]\)

Additional Comments:

Changing the estimation strategy helps with the weights, but it does not fix the issues related to standard vs. strong parallel trends

Estimation - What should you do?

Level Effects - no issues related to selection bias

For \(ATT^o\): Binarize treatment, \(ATT^o = \E[\Delta Y | D > 0] - \E[\Delta Y | D=0]\).
For \(ATT(d|d)\): Nonparametrically estimate \(m_\Delta(d) = \E[\Delta Y|D=d]-\E[\Delta Y|D=0]\)
- This is not actually too hard to estimate. No curse-of-dimensionality, etc.

Slope Effects - must deal with selection bias

Nonparametrically estimate derivative of \(m_\Delta(d)\)
For \(ACR(d)\): Under strong parallel trends, derivative is equal to \(ACR(d)\)
For \(ACR^o\): Average \(ACR(D)\) over \(D>0\); i.e., \(ACR^o = \E[ACR(D)|D>0]\)

Additional Comments:

It’s relatively straightforward to extend this strategy to settings with multiple periods and variation in treatment timing by extending existing work about a staggered, binary treatment

4. Empirical Application

Empirical Application

This is a simplified version of Acemoglu and Finkelstein (2008)

1983 Medicare reform that eliminated labor subsidies for hospitals

Medicare moved to the Prospective Payment System (PPS) which replaced “full cost reimbursement” with “partial cost reimbursement” which eliminated reimbursements for labor (while maintaining reimbursements for capital expenses)
Rough idea: This changes relative factor prices which suggests hospitals may adjust by changing their input mix. Could also have implications for technology adoption, etc.
In the paper, we provide some theoretical arguments concerning properties of production functions that suggests that strong parallel trends holds.

Data

Annual hospital-reported data from the American Hospital Association, 1980-1986

Outcome is capital/labor ratio

proxy using the depreciation share of total operating expenses (avg. 4.5%)
our setup: collapse to two periods by taking average in pre-treatment periods and average in post-treatment periods

Dose is “exposure” to the policy

the fraction of Medicare patients in the period before the policy was implemented
roughly 15% of hospitals are untreated (have essentially no Medicare patients)
- AF provide results both using and not using these hospitals as (good) it is useful to have untreated hospitals (bad) they are fairly different (includes federal, long-term, psychiatric, children’s, and rehabilitation hospitals)

Bin Scatter

ATE(T) Plot

\(\widehat{ATT}^o = 0.80~~(\textrm{s.e}.=0.05)\)

ACR(T) Plot

Results

Density weights vs. TWFE weights

TWFE Weights with and without Untreated Group

Conclusion

There are a number of challenges to implementing/interpreting DID with a continuous treatment
The extension to multiple periods and variation in treatment timing is relatively straightforward (at least proceeds along the “expected” lines)
But (in my view) the main new issue here is that justifying interpreting comparisons across different doses as causal effects requires stronger assumptions than most researchers probably think that they are making
Link to paper: https://arxiv.org/abs/2107.02637
Other Summaries: (i) Five minute summary (ii) Pedro’s Twitter
Comments welcome: brantly.callaway@uga.edu
Code: in progress

Appendix

References

Ager, Philipp, Katherine Eriksson, Ezra Karger, Peter Nencka, and Melissa A Thomasson. 2024. “School Closures During the 1918 Flu Pandemic.” Review of Economics and Statistics 106 (1): 266–76.

Borusyak, Kirill, Xavier Jaravel, and Jann Spiess. 2024. “Revisiting Event-Study Designs: Robust and Efficient Estimation.” Review of Economic Studies, rdae007.

Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.

de Chaisemartin, Clement, and Xavier D’Haultfœuille. 2018. “Fuzzy Differences-in-Differences.” The Review of Economic Studies 85 (2): 999–1028.

———. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Dube, Arindrajit, Daniele Girardi, Òscar Jordà, and Alan M Taylor. 2023. “A Local Projections Approach to Difference-in-Differences Event Studies.”

Gardner, John, Neil Thakral, Linh T Tô, and Luther Yap. 2023. “Two-Stage Differences in Differences.”

Gillitzer, Christian, and Nalini Prasad. 2023. “The Effect of School Closures on Standardized Test Scores: Evidence from a Zero-COVID Environment.”

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.

Wooldridge, Jeff. 2021. “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators.”

Can you relax strong parallel trends?

Some ideas:

It could be reasonable to assume that you know the sign of the selection bias. This can lead to (possibly) informative bounds on differences/derivatives/etc. between \(ATT(d|d)\) parameters

If interest is mainly in \(ACRT\)’s, then can weaken the assumption to something like “local” strong parallel trends

Strong parallel trends may be more plausible after conditioning on some covariates.
- For length of school closure, strong parallel trends probably more plausible conditional on being a rural county in the Southeast or conditional on being a college town in the Midwest.

Back

Positive Side-Comments: No untreated units

It’s possible to do some versions of DID with a continuous treatment without having access to a fully untreated group.

In this case, it is not possible to recover level effects like \(ATT(d|d)\).
However, notice that \[\begin{aligned}& \E[\Delta Y | D=d_h] - \E[\Delta Y | D=d_l] \\ &\hspace{50pt}= \Big(\E[\Delta Y | D=d_h] - \E[\Delta Y(0) | D=d_h]\Big) - \Big(\E[\Delta Y | D=d_l]-\E[\Delta Y(0) | D=d_l]\Big) \\ &\hspace{50pt}= ATT(d_h|d_h) - ATT(d_l|d_l)\end{aligned}\]
In words: comparing path of outcomes for those that experienced dose \(d_h\) to path of outcomes among those that experienced dose \(d_l\) (and not relying on having an untreated group) delivers the difference between their \(ATT\)’s.
Still face issues related to selection bias / strong parallel trends though

Back

Positive Side-Comments: Alternative approaches

Strategies like binarizing the treatment can still work (though be careful!)

If you classify units as being treated or untreated, you can recover the \(ATT^o\) by comparing mean outcome for treated relative to untreated.
On the other hand, if you classify units as being “high” treated, “low” treated, or untreated — our arguments imply that selection bias terms can come up when comparing effects for “high” to “low”

Back

Negative Side-Comment: Pre-testing

That the expressions for \(ATE(d)\) and \(ATT(d|d)\) are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between “standard” and “strong” parallel trends. In particular, the relevant information that we have for testing each one is the same

In effect, the only testable implication of strong parallel trends in pre-treatment periods is standard parallel trends.

Back

Issues with TWFE Regressions

TWFE

The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression:

\[Y_{it} = \theta_t + \eta_i + \beta^{twfe} D_{it} + v_{it}\] In baseline case (two periods, no one treated in first period), this is just

\[\Delta Y_i = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i\]

\(\beta^{twfe}\) often (loosely) interpreted as some kind of average causal response (i.e., slope effect) parameter

Interpreting \(\beta^{twfe}\)

In the paper, we show that

Under Standard Parallel Trends:

\[\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl\]
- \(w_1(l)\) are positive weights that integrate to 1
- \(ACRT(l|l)\) is average causal response conditional on \(D=l\)
- \(\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l}\) is a local selection bias term

Interpreting \(\beta^{twfe}\)

In the paper, we show that

Under Strong Parallel Trends:

\[\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) ACR(l) \, dl\]
- \(w_1(l)\) are same weights as before
- \(ACR(l)\) is average causal response to dose \(l\) across entire population
- there is no selection bias term

What does this mean?

Issue #1: Selection bias terms that show up under standard parallel trends

\(\implies\) to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions

Issue #2: Weights
- They are all positive
- But this is a very minimal requirement for weights being “reasonable”
- These weights have “strange” properties (i) affected by the size of the untreated group, (ii) that they are maximized at \(d=\E[D]\).
- [[Example 1 - Mixture of Normals Dose]] [[Example 2: Exponential Dose]]

Application: Medicare Reform and Capital/Labor Ratios

We consider a simplified version of an application from Acemoglu and Finkelstein (2008) who study the effects of a 1983 Medicare reform that (roughly) eliminated hospital reimbursements for labor

The continuous treatment is differential exposure to the policy based on the fraction of Medicare patients that the hospital had before the policy was implemented
Idea: policy changes relative factor prices which suggests hospitals may adjust by changing their input mix.

Results:

We provide conditions on production functions that rationalize invoking (or not invoking) strong parallel trends
Our estimate of \(ACR(d)\) (or \(ACRT(d|d)\)) is fairly heterogeneous across dose \(\implies\) the results quite sensitive to the (implicit) TWFE weighting scheme [More Details…]

Ex. Mixture of Normals Dose

Back

Ex. Exponential Dose

Back

More General Case

Multiple periods, variation in treatment timing

Setup

Staggered treatment adoption
- If you are treated today, you will continue to be treated tomorrow
- Note relatively straightforward to relax, just makes notation more complex
- Can allow for treatment anticipation too, but ignoring for simplicity now
- Once become treated, dose remains constant (could probably relax this too)

Setup

Additional Notation:
- \(G_i\) — a unit’s “group” (the time period when unit becomes treated)
- Potential outcomes \(Y_{it}(g,d)\) — the outcome unit \(i\) would experience in time period \(t\) if they became treated in period \(g\) with dose \(d\)
- \(Y_{it}(0)\) is the potential outcome corresponding to not being treated in any period

Parameters of Interest

Level Effects:

\[ ATT(g,t,d|g,d) := \E[Y_t(g,d) - Y_t(0) | G=g, D=d] \ \ \ \textrm{and} \ \ \ ATE(g,t,d) := \E[Y_t(g,d) - Y_t(0) ]\]

Slope Effects:

\[ACRT(g,t,d|g,d) := \frac{\partial ATT(g,t,l|g,d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACR(g,t,d) := \frac{\partial ATE(g,t,d)}{\partial d}\]

Parameters of Interest

These essentially inherit all the same issues as in the two period case

Under a multi-period version of “standard” parallel trends, comparisons of \(ATT\) across different values of dose are hard to interpret
- They contain selection bias terms

Under a multi-period version of “strong” parallel trends, comparisons of \(ATE\) across different values of dose straightforward to interpret
- But this involves a much stronger assumption

Expressions in remainder of talk are under “strong” parallel trends

Under “standard” parallel trends, add selection bias terms everywhere

Parameters of Interest

Often, these are high-dimensional and it may be desirable to “aggregate” them

Average by group (across post-treatment time periods) and then across groups

\(\rightarrow\) \(ACR^{overall}(d)\) (overall average causal response for particular dose)

Average \(ACR^{overall}(d)\) across dose

\(\rightarrow\) \(ACR^o\) (this is just one number) and is likely to be the parameter that one would be targeting in a TWFE regression

Event study: average across groups who have been exposed to treatment for \(e\) periods

\(\rightarrow\) For fixed \(d\)

\(\rightarrow\) Average across different values of \(d\) \(\implies\) typical looking ES plot

class: inverse, middle, center

TWFE in More General Case

TWFE Regression

Consider the same TWFE regression as before

\[Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{it} + v_{it}\]

Running Example

How should \(\beta^{twfe}\) be interpreted?

We show in the paper that \(\beta^{twfe}\) is a weighted average of the following terms:

\[\delta^{WITHIN}(g) = \frac{\textrm{cov}(\bar{Y}^{POST}(g) - \bar{Y}^{PRE(g)}(g), D | G=g)}{\textrm{var(D|G=g)}}\]

Comes from within-group variation in the amount of dose
This term is essentially the same as in the baseline case and corresponds to a reasonable treatment effect parameter under strong parallel trends
Like baseline case, (after some manipulations) this term corresponds to a “derivative”/“ACR”
Does not show up in the binary treatment case because there is no variation in amount of treatment

How should \(\beta^{twfe}\) be interpreted?

\(\beta^{twfe}\) weighted average, term 2 of 4

For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),

\[\delta^{MID,PRE}(g,k) = \frac{\E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | G=k \right]}{\E[D|G=g]}\]

Comes from comparing path of outcomes for a group that becomes treated (group \(g\)) relative to a not-yet-treated group (group \(k\))
Corresponds to a reasonable treatment effect parameter under strong parallel trends
Denominator (after some derivations) ends up giving this a “derivative”/“ACR” interpretation
Similar terms show up in the case with a binary treatment

\(\beta^{twfe}\) weighted average, term 2 of 4

\(\beta^{twfe}\) weighted average, term 3 of 4

For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),

\[ \begin{aligned} \delta^{POST,MID}(g,k) &= \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | G=k\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | D=0 \right]}{\E[D|G=k]} \\ &- \left(\frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(k)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | D=0 \right]}{\E[D|G=k]} \right.\\ & \hspace{25pt} - \left.\frac{\E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(k)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | D=0 \right]}{\E[D|G=k]} \right) \end{aligned} \]

\(\beta^{twfe}\) weighted average, term 3 of 4

For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),

\[ \begin{aligned} \delta^{POST,MID}(g,k) &= \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | G=k\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | D=0 \right]}{\E[D|G=k]} \\ &- \textrm{Treatment Effect Dynamics for Group g} \end{aligned} \] * Comes from comparing path of outcomes for a group that becomes treated (group \(k\)) to paths of outcomes of an already treated group (group \(g\))

In the presence of treatment effect dynamics (these are not ruled out by any parallel trends assumption), this term is problematic
This is similar-in-spirit to the problematic terms for TWFE with a binary treatment

\(\beta^{twfe}\) weighted average, term 3 of 4

\(\beta^{twfe}\) weighted average, term 4 of 4

For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),

\[ \begin{aligned} \delta^{POST,PRE}(g,k) = \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | G=k \right]}{\E[D|G=g] - \E[D|G=k]} \end{aligned} \] * Comes from comparing path of outcomes for groups \(g\) and \(k\) in their common post-treatment periods relative to their common pre-treatment periods

In the presence of heterogeneous causal responses (causal response in same time period differs across groups), this term ends up being (partially) problematic too
Only shows up when \(\E[D|G=g] \neq \E[D|G=k]\)
No analogue in the binary treatment case

\(\beta^{twfe}\) weighted average, term 4 of 4

Summary of TWFE Issues

Issue #1: Selection bias terms that show up under standard parallel trends

\(\implies\) to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions

Issue #2: Weights
- Negative weights possible due to (i) treatment effect dynamics or (ii) heterogeneous causal responses across groups
- Are (undesirably) driven by estimation method

Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights

Selection bias terms are more fundamental challenge

Difference-in-Differences with a Continuous Treatment

What’s Been Happening in the DID Literature?

This Paper

Clarifications about Continuous Treatment

Today’s Talk

1. Identification: What’s the same as in the binary treatment case?

Continuous Treatment Notation

Parameters of Interest (ATT-type)

Parameters of Interest (ATT-type)

Aggregated Parameters

Identification

Identification

Identification

Identification

2. Identification: What’s different from the binary treatment case?

Are we done?

Interpretation Issues

Interpretation Issues

Interpretation Issues

Interpretation Issues

Introduce Stronger Assumptions

Introduce Stronger Assumptions

Introduce Stronger Assumptions

Introduce Stronger Assumptions

Introduce Stronger Assumptions

Comparisons across dose

Comparisons across dose

Additional Comments

Summarizing

3. Interpreting TWFE Regressions

TWFE Regressions in this Context

TWFE Regressions in this Context

Estimation - What should you do?

Estimation - What should you do?

4. Empirical Application

Empirical Application

Data

Bin Scatter

ATE(T) Plot

ACR(T) Plot

Results

Results

Density weights vs. TWFE weights

TWFE Weights with and without Untreated Group

Conclusion

Appendix

References

Can you relax strong parallel trends?

Positive Side-Comments: No untreated units

Positive Side-Comments: Alternative approaches

Negative Side-Comment: Pre-testing

Issues with TWFE Regressions

TWFE

Interpreting \(\beta^{twfe}\)

Interpreting \(\beta^{twfe}\)

What does this mean?

Application: Medicare Reform and Capital/Labor Ratios

Ex. Mixture of Normals Dose

Ex. Exponential Dose

More General Case Multiple periods, variation in treatment timing

Setup

Setup

Parameters of Interest

Parameters of Interest

Parameters of Interest

TWFE in More General Case

TWFE Regression

Running Example

How should \(\beta^{twfe}\) be interpreted?

How should \(\beta^{twfe}\) be interpreted?

How should \(\beta^{twfe}\) be interpreted?

\(\beta^{twfe}\) weighted average, term 2 of 4

\(\beta^{twfe}\) weighted average, term 2 of 4

\(\beta^{twfe}\) weighted average, term 3 of 4

\(\beta^{twfe}\) weighted average, term 3 of 4

\(\beta^{twfe}\) weighted average, term 3 of 4

\(\beta^{twfe}\) weighted average, term 4 of 4

\(\beta^{twfe}\) weighted average, term 4 of 4

Summary of TWFE Issues

More General Case

Multiple periods, variation in treatment timing