\newcommand{\E}{\mathbb{E}}
There has been a lot of recent work/interest in DID!
A number of papers have diagnosed issues with very commonly used two-way fixed effects (TWFE) regressions to implement DID
\newcommand{\E}{\mathbb{E}}
There has been a lot of recent work/interest in DID!
A number of papers have diagnosed issues with very commonly used two-way fixed effects (TWFE) regressions to implement DID
Summary of Issues:
Already-treated groups sometimes serve as comparison group \implies treatment effect dynamics can lead to very poor estimates of treatment effects
Weights on underlying parameters are driven by estimation method
There have also been a number of papers fixing these issues
Callaway and Sant'Anna (2020), Cengiz, Dube, Lindner, and Zipperer (2019), Gardner (2021)
+ previous papers
There have also been a number of papers fixing these issues
Callaway and Sant'Anna (2020), Cengiz, Dube, Lindner, and Zipperer (2019), Gardner (2021)
+ previous papers
Basic idea:
Explicitly make "good" comparisons and omit "bad" comparisons
Choose your own weights \implies can recover overall ATT, event studies, or other target parameters of interest
These papers have (largely) focused on the case with a binary, staggered treatment
But there is considerable demand for understanding DID with more general treatments
Current paper: Generalize binary treatment case to multi-valued or continuous treatment ("dose")
Current paper: Generalize binary treatment case to multi-valued or continuous treatment ("dose")
Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{it} + v_{it} Setup:
Treatment "continuous enough" that researcher would estimate above model rather than include a sequence of dummy variables
Researchers often interpret \beta^{twfe} as an average causal response
Similar issues as in binary treatment literature related to regression (TWFE) estimation strategies when the treatment is multi-valued and/or continuous
Already treated units serve as comparison group \implies poor estimates of treatment effect parameters in the presence of treatment effect dynamics
TWFE estimate is a weighted average of underlying treatment parameters, but weights driven by estimation method
(this one is new) Heterogeneous causal effects of dose across timing-groups can lead to poor estimates (negative weights)
Similar issues as in binary treatment literature related to regression (TWFE) estimation strategies when the treatment is multi-valued and/or continuous
Already treated units serve as comparison group \implies poor estimates of treatment effect parameters in the presence of treatment effect dynamics
TWFE estimate is a weighted average of underlying treatment parameters, but weights driven by estimation method
(this one is new) Heterogeneous causal effects of dose across timing-groups can lead to poor estimates (negative weights)
As in the case with a staggered, binary treatment, we can fix all of these by
Carefully making desirable comparisons
Choosing our own weights
However, there are new issues related to interpreting differences between treatment effects at different doses as causal effects
Intuition: "Standard" DID delivers ATT-type parameters.
These are local to a specific dose
\implies Comparisons across different doses include both:
The causal effect of more dose
"Selection bias" terms
Getting rid of these selection bias terms requires additional assumptions that are likely to be substantially stronger in practice
No easy fixes here!
However, there are new issues related to interpreting differences between treatment effects at different doses as causal effects
Intuition: "Standard" DID delivers ATT-type parameters.
These are local to a specific dose
\implies Comparisons across different doses include both:
The causal effect of more dose
"Selection bias" terms
Getting rid of these selection bias terms requires additional assumptions that are likely to be substantially stronger in practice
No easy fixes here!
\implies (at least in some sense), this is more negative than previous papers
Brand new paper
Not 100% complete
No application
No code
Comments/suggestions/etc. more than welcome
Baseline Case: Two periods, no one treated in first period
TWFE in Baseline Case
More General Case: Multiple periods, variation in treatment timing
TWFE in More General Case
Potential outcomes notation
Two time periods: t and t-1
No one treated until period t
Some units remain untreated in period t
Potential outcomes: Y_{it}(d)
Observed outcomes: Y_{it} and Y_{it-1}
Y_{it}=Y_{it}(D_i) \quad \textrm{and} \quad Y_{it-1}=Y_{it-1}(0)
Level Effects (Average Treatment Effect on the Treated)
ATT(d|d) := \E[Y_t(d) - Y_{t}(0) | D=d]
Interpretation: The average effect of dose d relative to not being treated local to the group that actually experienced dose d
This is the natural analogue of ATT in the binary treatment case
Level Effects (Average Treatment Effect on the Treated)
ATT(d|d) := \E[Y_t(d) - Y_{t}(0) | D=d]
Interpretation: The average effect of dose d relative to not being treated local to the group that actually experienced dose d
This is the natural analogue of ATT in the binary treatment case
Slope Effect (Average Causal Responses)
ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACRT^O := \E[ACRT(D|D)|D>0]
Interpretation: ACRT(d|d) is the causal effect of a marginal increase in dose local to units that actually experienced dose d
ACR^O averages ACRT(d|d) over the population distribution of the dose
Level Effects (Average Treatment Effect on the Treated)
ATT(d|d) := \E[Y_t(d) - Y_{t-1}(0) | D=d]
Level Effects (Average Treatment Effect on the Treated)
ATT(d|d) := \E[Y_t(d) - Y_{t-1}(0) | D=d]
Slope Effect (Average Causal Responses)
ACRT(d_j|d_j) := ATT(d_j|d_j) - ATT(d_{j-1}|D=d_j)
Level Effects (Average Treatment Effect on the Treated)
ATT(d|d) := \E[Y_t(d) - Y_{t-1}(0) | D=d]
Slope Effect (Average Causal Responses)
ACRT(d_j|d_j) := ATT(d_j|d_j) - ATT(d_{j-1}|D=d_j)
Interestingly: In the case with a binary treatment, ACRT(1|1) = ATT
\implies In binary treatment case, ATT is both a slope and level effect
For all d,
\E[\Delta Y_t(0) | D=d] = \E[\Delta Y_t(0) | D=0]
For all d,
\E[\Delta Y_t(0) | D=d] = \E[\Delta Y_t(0) | D=0]
Then,
For all d,
\E[\Delta Y_t(0) | D=d] = \E[\Delta Y_t(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_t(d) - Y_t(0) | D=d] \hspace{150pt} \end{aligned}
For all d,
\E[\Delta Y_t(0) | D=d] = \E[\Delta Y_t(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_t(d) - Y_t(0) | D=d] \hspace{150pt}\\ &= \E[Y_t(d) - Y_{t-1}(0) | D=d] - \E[Y_t(0) - Y_{t-1}(0) | D=d] \end{aligned}
For all d,
\E[\Delta Y_t(0) | D=d] = \E[\Delta Y_t(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_t(d) - Y_t(0) | D=d] \hspace{150pt}\\ &= \E[Y_t(d) - Y_{t-1}(0) | D=d] - \E[Y_t(0) - Y_{t-1}(0) | D=d]\\ &= \E[Y_t(d) - Y_{t-1}(0) | D=d] - \E[\Delta Y_t(0) | D=0] \end{aligned}
For all d,
\E[\Delta Y_t(0) | D=d] = \E[\Delta Y_t(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_t(d) - Y_t(0) | D=d] \hspace{150pt}\\ &= \E[Y_t(d) - Y_{t-1}(0) | D=d] - \E[Y_t(0) - Y_{t-1}(0) | D=d]\\ &= \E[Y_t(d) - Y_{t-1}(0) | D=d] - \E[\Delta Y_t(0) | D=0]\\ &= \E[\Delta Y_t | D=d] - \E[\Delta Y_t | D=0] \end{aligned}
This is exactly what you would expect
Unfortunately, no
Unfortunately, no
Most applied work with a multi-valued or continuous treatment wants to think about how causal responses vary across dose
For example, plot treatment effects as a function of dose
Average causal response parameters inherently involve comparisons across slightly different doses
Consider comparing ATT(d|d) for two different doses
Consider comparing ATT(d|d) for two different doses \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned}
Consider comparing ATT(d|d) for two different doses
\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_t(d_h) - Y_t(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}
Consider comparing ATT(d|d) for two different doses
\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_t(d_h) - Y_t(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}
"Standard" Parallel Trends is not strong enough to rule out the selection bias terms here
Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends
This problem spills over into identifying ACRT(d|d)
Level Effects
ATE(d) := \E[Y_t(d) - Y_t(0)]
Level Effects
ATE(d) := \E[Y_t(d) - Y_t(0)]
\begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \ \ \ \ &\textrm{or} \ \ \ \ ACR(d_j) := ATE(d_j) - ATE(d_{j-1}) \\ & \textrm{or} \ \ \ ACR^O := \E[ACR(D) | D>0] \end{aligned}
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_t(d_h) - Y_t(0)] - \E[Y_t(d_l) - Y_t(0)] \end{aligned}
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_t(d_h) - Y_t(0)] - \E[Y_t(d_l) - Y_t(0)]\\ &= \underbrace{\E[Y_t(d_h) - Y_t(d_l)]}_{\textrm{Causal Response}} \end{aligned}
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_t(d_h) - Y_t(0)] - \E[Y_t(d_l) - Y_t(0)]\\ &= \underbrace{\E[Y_t(d_h) - Y_t(d_l)]}_{\textrm{Causal Response}} \end{aligned}
Unfortunately, "Standard" Parallel Trends Assumption not strong enough to identify ATE(d).
For all d,
\E[Y_t(d) - Y_{t-1}(0)] = \E[Y_t(d) - Y_{t-1}(0) | D=d]
For all d,
\E[Y_t(d) - Y_{t-1}(0)] = \E[Y_t(d) - Y_{t-1}(0) | D=d] Under Strong Parallel Trends, it is straightforward to show that
ATE(d) = \E[\Delta Y_t | D=d] - \E[\Delta Y_t|D=0]
RHS is exactly the same expression as for ATT(d|d) under "standard" parallel trends, but here
assumptions are different
parameter interpretation is different
This is notably different from "Standard" Parallel Trends
Can show that it is not strictly stronger than Standard Parallel Trends
It is also slightly weaker than assuming
ATE(d) = ATT(d|d) (this is a form of treatment effect homogeneity)
All dose groups would have experienced the same path of outcomes had they been assigned the same dose
It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose
However, comparison of ATT-type parameters across different doses are hard to interpret
They include selection bias terms
This issues extends to identifying ACRT parameters
This suggests targeting ATE-type parameters
Comparisons across doses do not contain selection bias terms
But identifying ATE-type parameters requires stronger assumptions
The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression:
Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Post_t + v_{it} In baseline case (two periods, no one treated in first period), this is just
\Delta Y_i = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i
\beta^{twfe} often loosely interpreted as Average Causal Response
In the paper, we show that
Under Standard Parallel Trends:
\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl + w_0 \frac{ATT(d_L|d_L)}{d_L}
w_1(l) and w_0 are positive weights that integrate to 1
ACRT(l|l) is average causal response conditional on D=l
\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} is a local selection bias term
\frac{ATT(d_L|d_L)}{d_L} is the causal effect of going from no dose to the smallest possible dose (conditional on D=d_L)
Under Strong Parallel Trends:
\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) ACR(l) \, dl + w_0 \frac{ATE(d_L)}{d_L}
w_1(l) and w_0 are same weights as before
ACR(l) is average causal response to dose l across entire population
there is no selection bias term
\frac{ATE(d_L)}{d_L} is the causal effect of going from no dose to the smallest possible dose (across entire population)
Issue #1: Selection bias terms that show up under standard parallel trends
\implies to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #1: Selection bias terms that show up under standard parallel trends
\implies to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #2: Weights
They are all positive
But this is a very minimal requirement for weights being "reasonable"
These weights have the "strange" property that they are maximized at d=\E[D].
Issue #3: Pre-testing
Either (i) report ATT(d|d) directly and interpret carefully, or (ii) be aware (and think through) that \beta^{twfe}, comparisons across d, or average causal response parameters all require imposing stronger assumptions
With regard to weights, there are likely better options for estimating causal effect parameters
Step 1: Nonparametrically estimate ACR(d) = \frac{\partial \E[\Delta Y | D=d]}{\partial d}
Side-comment: This is not actually too hard to estimate. No curse-of-dimensionality, etc.
Step 2: Estimate ACR^0 = \E[ACR(D)|D>0].
These do not get around the issue of requiring a stronger assumption
Staggered treatment adoption
If you are treated today, you will continue to be treated tomorrow
Note relatively straightforward to relax, just makes notation more complex
Can allow for treatment anticipation too, but ignoring for simplicity now
Once become treated, dose remains constant (could probably relax this too)
Additional Notation:
G_i -- a unit's "group" (the time period when unit becomes treated)
Potential outcomes Y_{it}(g,d) -- the outcome unit i would experience in time period t if they became treated in period g with dose d
Y_{it}(0) is the potential outcome corresponding to not being treated in any period
Level Effects:
ATT(g,t,d|g,d) := \E[Y_t(g,d) - Y_t(0) | G=g, D=d] \ \ \ \textrm{and} \ \ \ ATE(g,t,d) := \E[Y_t(g,d) - Y_t(0) ]
Level Effects:
ATT(g,t,d|g,d) := \E[Y_t(g,d) - Y_t(0) | G=g, D=d] \ \ \ \textrm{and} \ \ \ ATE(g,t,d) := \E[Y_t(g,d) - Y_t(0) ] Slope Effects:
ACRT(g,t,d|g,d) := \frac{\partial ATT(g,t,l|g,d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACR(g,t,d) := \frac{\partial ATE(g,t,d)}{\partial d}
These essentially inherit all the same issues as in the two period case
These essentially inherit all the same issues as in the two period case
Under a multi-period version of "standard" parallel trends, comparisons of ATT across different values of dose are hard to interpret
These essentially inherit all the same issues as in the two period case
Under a multi-period version of "standard" parallel trends, comparisons of ATT across different values of dose are hard to interpret
Under a multi-period version of "strong" parallel trends, comparisons of ATE across different values of dose straightforward to interpret
These essentially inherit all the same issues as in the two period case
Under a multi-period version of "standard" parallel trends, comparisons of ATT across different values of dose are hard to interpret
Under a multi-period version of "strong" parallel trends, comparisons of ATE across different values of dose straightforward to interpret
Expressions in remainder of talk are under "strong" parallel trends
Often, these are high-dimensional and it may be desirable to "aggregate" them
Often, these are high-dimensional and it may be desirable to "aggregate" them
Average by group (across post-treatment time periods) and then across groups
\rightarrow ACR^{overall}(d) (overall average causal response for particular dose)
Often, these are high-dimensional and it may be desirable to "aggregate" them
Average by group (across post-treatment time periods) and then across groups
\rightarrow ACR^{overall}(d) (overall average causal response for particular dose)
Average ACR^{overall}(d) across dose
\rightarrow ACR^O (this is just one number) and is likely to be the parameter that one would be targeting in a TWFE regression
Often, these are high-dimensional and it may be desirable to "aggregate" them
Average by group (across post-treatment time periods) and then across groups
\rightarrow ACR^{overall}(d) (overall average causal response for particular dose)
Average ACR^{overall}(d) across dose
\rightarrow ACR^O (this is just one number) and is likely to be the parameter that one would be targeting in a TWFE regression
Event study: average across groups who have been exposed to treatment for e periods
\rightarrow For fixed d
\rightarrow Average across different values of d \implies typical looking ES plot
Consider the same TWFE regression as before
Y_{it} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{it} + v_{it}
We show in the paper that \beta^{twfe} is a weighted average of the following terms:
\delta^{WITHIN}(g) = \frac{\textrm{cov}(\bar{Y}^{POST}(g) - \bar{Y}^{PRE(g)}(g), D | G=g)}{\textrm{var(D|G=g)}}
Comes from within-group variation in the amount of dose
This term is essentially the same as in the baseline case and corresponds to a reasonable treatment effect parameter under strong parallel trends
Like baseline case, (after some manipulations) this term corresponds to a "derivative"/"ACR"
Does not show up in the binary treatment case because there is no variation in amount of treatment
For k > g (i.e., group k becomes treated after group g),
\delta^{MID,PRE}(g,k) = \frac{\E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | G=k \right]}{\E[D|G=g]}
Comes from comparing path of outcomes for a group that becomes treated (group g) relative to a not-yet-treated group (group k)
Corresponds to a reasonable treatment effect parameter under strong parallel trends
Denominator (after some derivations) ends up giving this a "derivative"/"ACR" interpretation
Similar terms show up in the case with a binary treatment
For k > g (i.e., group k becomes treated after group g),
\begin{aligned} \delta^{POST,MID}(g,k) &= \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | G=k\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | D=0 \right]}{\E[D|G=k]} \\ &- \left(\frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(k)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | D=0 \right]}{\E[D|G=k]} \right.\\ & \hspace{25pt} - \left.\frac{\E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(k)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | D=0 \right]}{\E[D|G=k]} \right) \end{aligned}
For k > g (i.e., group k becomes treated after group g),
\begin{aligned} \delta^{POST,MID}(g,k) &= \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | G=k\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | D=0 \right]}{\E[D|G=k]} \\ &- \textrm{Treatment Effect Dynamics for Group g} \end{aligned}
Comes from comparing path of outcomes for a group that becomes treated (group k) to paths of outcomes of an already treated group (group k)
In the presence of treatment effect dynamics (these are not ruled out by any parallel trends assumption), this term is problematic
This is similar-in-spirit to the problematic terms for TWFE with a binary treatment
For k > g (i.e., group k becomes treated after group g),
\begin{aligned} \delta^{POST,PRE}(g,k) = \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | G=k \right]}{\E[D|G=g] - \E[D|G=k]} \end{aligned}
Comes from comparing path of outcomes for groups g and k in their common post-treatment periods relative to their common pre-treatment periods
In the presence of heterogeneous causal responses (causal response in same time period differs across groups), this term ends up being (partially) problematic too
Only shows up when \E[D|G=g] \neq \E[D|G=k]
No analogue in the binary treatment case
Issue #1: Selection bias terms that show up under standard parallel trends
\implies to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #1: Selection bias terms that show up under standard parallel trends
\implies to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #2: Weights
Negative weights possible due to (i) treatment effect dynamics or (ii) heterogeneous causal responses across groups
Are (undesirably) driven by estimation method
Issue #1: Selection bias terms that show up under standard parallel trends
\implies to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #2: Weights
Negative weights possible due to (i) treatment effect dynamics or (ii) heterogeneous causal responses across groups
Are (undesirably) driven by estimation method
Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights
Issue #1: Selection bias terms that show up under standard parallel trends
\implies to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #2: Weights
Negative weights possible due to (i) treatment effect dynamics or (ii) heterogeneous causal responses across groups
Are (undesirably) driven by estimation method
Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights
Selection bias terms are more fundamental challenge
There are a number of challenges to implementing/interpreting DID with a multi-valued or continuous treatment
Issues related to TWFE are (mostly) anticipated at this point
But (in my view) the main new issue here is that justifying interpreting comparisons across different doses as causal effects requires stronger assumptions than most researchers probably think that they are making
Link to paper: https://arxiv.org/abs/2107.02637
Other Summaries: (i) Five minute summary (ii) Pedro's Twitter
Comments welcome: brantly.callaway@uga.edu
Code: ETA 2-3 months
\newcommand{\E}{\mathbb{E}}
There has been a lot of recent work/interest in DID!
A number of papers have diagnosed issues with very commonly used two-way fixed effects (TWFE) regressions to implement DID
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |