June 14, 2024
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)There have been a number of recent advances in the differences-in-differences literature. Two broad contributions:
These papers have (largely) focused on the case with a binary, staggered treatment
Current paper: Move from a setting with a binary treatment case to one with a continuous treatment (“dose”)
Some of the arguments involve extending ideas from the binary, staggered treatment case to a setting with continuous treatment
Example:
Effect of \(\underbrace{\textrm{length of school closures}}_{\textrm{continuous treatment}}\) (during Covid) on \(\underbrace{\textrm{students' test scores}}_{\textrm{outcome}}\)
Identification: What’s the same as in the binary treatment case?
Identification: What’s different from the binary treatment case?
Interpreting TWFE Regressions (quickly if time permits)
Potential outcomes notation
Two time periods: \(t=1\) and \(t=2\)
Potential outcomes: \(Y_{i,t=2}(d)\)
Observed outcomes: \(Y_{i,t=2}\) and \(Y_{i,t=1}\)
\[Y_{i,t=2}=Y_{i,t=2}(D_i) \quad \textrm{and} \quad Y_{i,t=1}=Y_{i,t=1}(0)\]
Level Effects (Average Treatment Effect on the Treated)
\[ATT(d|d) := \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d]\]
Interpretation: The average effect of dose \(d\) relative to not being treated local to the group that actually experienced dose \(d\)
This is the natural analogue of \(ATT\) in the binary treatment case
Slope Effects (Average Causal Response on the Treated)
\[ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}\]
Notice that \(ATT(d|d)\) and \(ACRT(d|d)\) are functional parameters
We can view \(ATT(d|d)\) and \(ACRT(d|d)\) as the “building blocks” for a more aggregated parameter. Aggregated versions of these (into a single number) are \[\begin{align*} ATT^o := \E[ATT(D|D)|D>0] \qquad \qquad ACRT^o := \E[ACRT(D|D)|D>0] \end{align*}\]
\(ATT^o\) averages \(ATT(d|d)\) over the population distribution of the dose
\(ACRT^o\) averages \(ACRT(d|d)\) over the population distribution of the dose
\(ACRT^o\) is the natural target parameter for the TWFE regression in this case
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt} \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d]\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[\Delta Y_{i,t=2}(0) | D_i=0] \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d]\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[\Delta Y_{i,t=2}(0) | D_i=0]\\ &= \E[\Delta Y_{i,t=2} | D_i=d] - \E[\Delta Y_{i,t=2} | D_i=0] \end{aligned} \]
This is exactly what you would expect
Unfortunately, no
Most empirical work with a continuous treatment wants to think about how causal responses vary across dose
There are new issues related to comparing \(ATT(d|d)\) at different doses and interpreting these differences as causal effects
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned} \]
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{i,t=2}(d_h)-Y_{i,t=2}(d_l) | D_i=d_h] + \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_h] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_l] \end{aligned} \]
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{i,t=2}(d_h)-Y_{i,t=2}(d_l) | D_i=d_h] + \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_h] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{i,t=2}(d_h) - Y_{i,t=2}(d_l) | D_i=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned} \]
“Standard” Parallel Trends is not strong enough to rule out the selection bias terms here
Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends
This problem spills over into identifying \(ACRT(d|d)\)
Intuition:
Difference-in-differences identification strategies result in \(ATT(d|d)\) parameters. These are local parameters and difficult to compare to each
This explanation is similar to thinking about LATEs with two different instruments
Thus, comparing \(ATT(d|d)\) across different values is tricky and not for free
What can you do?
One idea, just recover \(ATT(d|d)\) and interpret it cautiously (interpret it by itself not relative to different values of \(d\))
If you want to compare them to each other, it will come with the cost of additional (structural) assumptions
“Strong” Parallel Trends Assumption
For all doses d
and l
,
\[\mathbb{E}[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] = \mathbb{E}[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d]\]
This is notably different from “Standard” Parallel Trends
It involves potential outcomes for all values of the dose (not just untreated potential outcomes)
All dose groups would have experienced the same path of outcomes had they been assigned the same dose
Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \ \end{aligned} \]
Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \ \end{aligned} \]
Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=l] \ \end{aligned} \]
Strong parallel trends is equivalent to a particular restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=l] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=l] = ATT(d|l) \end{aligned} \]
Since this holds for all \(l\), it also implies that \(ATT(d|d) = ATE(d) = \E[Y_{i,t=2}(d) - Y_{i,t=2}(0)]\). Thus, under strong parallel trends, we have that
\[ATE(d) = \E[\Delta Y_{i,t=2}|D_i=d] - \E[\Delta Y_{i,t=2}|D_i=0]\]
RHS is exactly the same expression as for \(ATT(d|d)\) under “standard” PT, but
assumptions are different
parameter interpretation is different
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{i,t=2}(d_h) - Y_{i,t=2}(0)] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0)] \end{aligned} \]
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{i,t=2}(d_h) - Y_{i,t=2}(0)] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0)]\\ &= \underbrace{\E[Y_{i,t=2}(d_h) - Y_{i,t=2}(d_l)]}_{\textrm{Causal Response}} \end{aligned} \]
Thus, recovering \(ATE(d)\) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption
Given that we can compare \(ATE(d)\)’s across dose, we can recover slope effects in this setting
\[ \begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^o := \E[ACR(D) | D>0] \end{aligned} \]
Consider the same TWFE regression (but now \(D_{i,t}\) is continuous): \[\begin{align*} Y_{i,t} = \theta_t + \eta_i + \beta^{twfe} D_{i,t} + e_{i,t} \end{align*}\] We show that \[\begin{align*} \beta^{twfe} = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*}\] where \(m_\Delta(l) := \E[\Delta Y_{i,t=2}|D_i=l] - \E[\Delta Y_{i,t=2}|D_i=0]\) and \(w(l)\) are weights
Under standard parallel trends, \(m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}\)
Under strong parallel trends, \(m'_{\Delta}(l) = ACR(l)\).
About the weights: they are all positive, but have some strange properties (e.g., always maximized at \(l = \E[D]\) (even if this is not a common value for the dose))
Other issues can arise in more complicated cases
For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment
In general, things get worse for TWFE regressions with more complications
Level Effects - no issues related to selection bias
For \(ATT^o\): Binarize treatment, \(ATT^o = \E[\Delta Y_{i,t=2} | D_i > 0] - \E[\Delta Y_{i,t=2} | D_i=0]\).
For \(ATT(d|d)\): Nonparametrically estimate \(\E[\Delta Y_{i,t=2}|D_i=d]-\E[\Delta Y_{i,t=2}|D_i=0]\)
Slope Effects - must deal with selection bias
Nonparametrically estimate derivative of \(\E[\Delta Y_{i,t=2}|D_i=d]\)
For \(ACR(d)\): Under strong parallel trends, derivative is equal to \(ACR(d)\)
For \(ACR^o\): Average \(ACR(D_i)\) over \(D_i>0\)
Additional Comments:
Level Effects - no issues related to selection bias
For \(ATT^o\): Binarize treatment, \(ATT^o = \E[\Delta Y_{i,t=2} | D_i > 0] - \E[\Delta Y_{i,t=2} | D_i=0]\).
For \(ATT(d|d)\): Nonparametrically estimate \(\E[\Delta Y_{i,t=2}|D_i=d]-\E[\Delta Y_{i,t=2}|D_i=0]\)
Slope Effects - must deal with selection bias
Nonparametrically estimate derivative of \(\E[\Delta Y_{i,t=2}|D_i=d]\)
For \(ACR(d)\): Under strong parallel trends, derivative is equal to \(ACR(d)\)
For \(ACR^o\): Average \(ACR(D_i)\) over \(D_i>0\)
Additional Comments:
It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose
However, comparison of ATT-type parameters across different doses are hard to interpret
This suggests targeting ATE-type parameters
[Empirical example about Medicare policy and capital/labor ratios]
Link to paper: https://arxiv.org/abs/2107.02637
Other Summaries: (i) Five minute summary (ii) Pedro’s Twitter
Comments welcome: brantly.callaway@uga.edu
Code: in progress
This is a simplified version of Acemoglu and Finkelstein (2008)
1983 Medicare reform that eliminated labor subsidies for hospitals
Medicare moved to the Prospective Payment System (PPS) which replaced “full cost reimbursement” with “partial cost reimbursement” which eliminated reimbursements for labor (while maintaining reimbursements for capital expenses)
Rough idea: This changes relative factor prices which suggests hospitals may adjust by changing their input mix. Could also have implications for technology adoption, etc.
In the paper, we provide some theoretical arguments concerning properties of production functions that suggests that strong parallel trends holds.
Annual hospital-reported data from the American Hospital Association, 1980-1986
Outcome is capital/labor ratio
proxy using the depreciation share of total operating expenses (avg. 4.5%)
our setup: collapse to two periods by taking average in pre-treatment periods and average in post-treatment periods
Dose is “exposure” to the policy
the fraction of Medicare patients in the period before the policy was implemented
roughly 15% of hospitals are untreated (have essentially no Medicare patients)
[Back]
Some ideas:
Strong parallel trends may be more plausible after conditioning on some covariates.
It’s possible to do some versions of DID with a continuous treatment without having access to a fully untreated group.
In this case, it is not possible to recover level effects like \(ATT(d|d)\).
However, notice that \[\begin{aligned}& \E[\Delta Y_{i,t=2} | D_i=d_h] - \E[\Delta Y_{i,t=2}| D_i=d_l] \\ &\hspace{50pt}= \Big(\E[\Delta Y_{i,t=2} | D_i=d_h] - \E[\Delta Y_{i,t=2}(0) | D_i=d_h]\Big) - \Big(\E[\Delta Y_{i,t=2} | D_i=d_l]-\E[\Delta Y_{i,t=2}(0) | D_i=d_l]\Big) \\ &\hspace{50pt}= ATT(d_h|d_h) - ATT(d_l|d_l)\end{aligned}\]
In words: comparing path of outcomes for those that experienced dose \(d_h\) to path of outcomes among those that experienced dose \(d_l\) (and not relying on having an untreated group) delivers the difference between their \(ATT\)’s.
Still face issues related to selection bias / strong parallel trends though
Strategies like binarizing the treatment can still work (though be careful!)
If you classify units as being treated or untreated, you can recover the \(ATT\) of being treated at all.
On the other hand, if you classify units as being “high” treated, “low” treated, or untreated — our arguments imply that selection bias terms can come up when comparing effects for “high” to “low”
That the expressions for \(ATE(d)\) and \(ATT(d|d)\) are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between “standard” and “strong” parallel trends. In particular, the relevant information that we have for testing each one is the same
The most common strategy in applied work is to estimate the two-way fixed effects (TWFE) regression:
\[Y_{i,t} = \theta_t + \eta_i + \beta^{twfe} D_{i,t} + v_{i,t}\] In baseline case (two periods, no one treated in first period), this is just
\[\Delta Y_i, = \beta_0 + \beta^{twfe} \cdot D_i + \Delta v_i\]
\(\beta^{twfe}\) often (loosely) interpreted as some kind of average causal response (i.e., slope effect) parameter
In the paper, we show that
Under Standard Parallel Trends:
\[\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) \left[ ACRT(l|l) + \frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l} \right] \, dl\]
\(w_1(l)\) are positive weights that integrate to 1
\(ACRT(l|l)\) is average causal response conditional on \(D_i=l\)
\(\frac{\partial ATT(l|h)}{\partial h} \Big|_{h=l}\) is a local selection bias term
In the paper, we show that
Under Strong Parallel Trends:
\[\beta^{tfwe} = \int_{\mathcal{D}_+} w_1(l) ACR(l) \, dl\]
\(w_1(l)\) are same weights as before
\(ACR(l)\) is average causal response to dose \(l\) across entire population
there is no selection bias term
Issue #1: Selection bias terms that show up under standard parallel trends
\(\implies\) to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #2: Weights
They are all positive
But this is a very minimal requirement for weights being “reasonable”
These weights have “strange” properties (i) affected by the size of the untreated group, (ii) that they are maximized at \(D_i=\E[D]\).
[[Example 1 - Mixture of Normals Dose]] [[Example 2: Exponential Dose]]
Staggered treatment adoption
If you are treated today, you will continue to be treated tomorrow
Note relatively straightforward to relax, just makes notation more complex
Can allow for treatment anticipation too, but ignoring for simplicity now
Once become treated, dose remains constant (could probably relax this too)
Additional Notation:
\(G_i\) — a unit’s “group” (the time period when unit becomes treated)
Potential outcomes \(Y_{i,t}(g,d)\) — the outcome unit \(i\) would experience in time period \(t\) if they became treated in period \(g\) with dose \(d\)
\(Y_{i,t}(0)\) is the potential outcome corresponding to not being treated in any period
Level Effects:
\[ ATT(g,t,d|g,d) := \E[Y_t(g,d) - Y_t(0) | G=g, D_i=d] \ \ \ \textrm{and} \ \ \ ATE(g,t,d) := \E[Y_t(g,d) - Y_t(0) ]\]
Slope Effects:
\[ACRT(g,t,d|g,d) := \frac{\partial ATT(g,t,l|g,d)}{\partial l} \Big|_{l=d} \ \ \ \textrm{and} \ \ \ ACR(g,t,d) := \frac{\partial ATE(g,t,d)}{\partial d}\]
These essentially inherit all the same issues as in the two period case
Under a multi-period version of “standard” parallel trends, comparisons of \(ATT\) across different values of dose are hard to interpret
Under a multi-period version of “strong” parallel trends, comparisons of \(ATE\) across different values of dose straightforward to interpret
Expressions in remainder of talk are under “strong” parallel trends
Often, these are high-dimensional and it may be desirable to “aggregate” them
Average by group (across post-treatment time periods) and then across groups
\(\rightarrow\) \(ACR^{overall}(d)\) (overall average causal response for particular dose)
Average \(ACR^{overall}(d)\) across dose
\(\rightarrow\) \(ACR^o\) (this is just one number) and is likely to be the parameter that one would be targeting in a TWFE regression
Event study: average across groups who have been exposed to treatment for \(e\) periods
\(\rightarrow\) For fixed \(d\)
\(\rightarrow\) Average across different values of \(d\) \(\implies\) typical looking ES plot
class: inverse, middle, center
Consider the same TWFE regression as before
\[Y_{i,t} = \theta_t + \eta_i + \beta^{twfe} \cdot D_i \cdot Treat_{i,t} + v_{i,t}\]
We show in the paper that \(\beta^{twfe}\) is a weighted average of the following terms:
\[\delta^{WITHIN}(g) = \frac{\textrm{cov}(\bar{Y}^{POST}(g) - \bar{Y}^{PRE(g)}(g), D | G=g)}{\textrm{var(D|G=g)}}\]
Comes from within-group variation in the amount of dose
This term is essentially the same as in the baseline case and corresponds to a reasonable treatment effect parameter under strong parallel trends
Like baseline case, (after some manipulations) this term corresponds to a “derivative”/“ACR”
Does not show up in the binary treatment case because there is no variation in amount of treatment
For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),
\[\delta^{MID,PRE}(g,k) = \frac{\E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | G=k \right]}{\E[D|G=g]}\]
Comes from comparing path of outcomes for a group that becomes treated (group \(g\)) relative to a not-yet-treated group (group \(k\))
Corresponds to a reasonable treatment effect parameter under strong parallel trends
Denominator (after some derivations) ends up giving this a “derivative”/“ACR” interpretation
Similar terms show up in the case with a binary treatment
For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),
\[ \begin{aligned} \delta^{POST,MID}(g,k) &= \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | G=k\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | D_i=0 \right]}{\E[D|G=k]} \\ &- \left(\frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(k)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | D_i=0 \right]}{\E[D|G=k]} \right.\\ & \hspace{25pt} - \left.\frac{\E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(k)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{MID(g,k)} - \bar{Y}^{PRE(g)}\big) | D_i=0 \right]}{\E[D|G=k]} \right) \end{aligned} \]
For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),
\[ \begin{aligned} \delta^{POST,MID}(g,k) &= \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | G=k\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{MID(g,k)}\big) | D_i=0 \right]}{\E[D|G=k]} \\ &- \textrm{Treatment Effect Dynamics for Group g} \end{aligned} \] * Comes from comparing path of outcomes for a group that becomes treated (group \(k\)) to paths of outcomes of an already treated group (group \(g\))
In the presence of treatment effect dynamics (these are not ruled out by any parallel trends assumption), this term is problematic
This is similar-in-spirit to the problematic terms for TWFE with a binary treatment
For \(k > g\) (i.e., group \(k\) becomes treated after group \(g\)),
\[ \begin{aligned} \delta^{POST,PRE}(g,k) = \frac{\E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | G=g\right] - \E\left[\big(\bar{Y}^{POST(k)} - \bar{Y}^{PRE(g)}\big) | G=k \right]}{\E[D|G=g] - \E[D|G=k]} \end{aligned} \] * Comes from comparing path of outcomes for groups \(g\) and \(k\) in their common post-treatment periods relative to their common pre-treatment periods
In the presence of heterogeneous causal responses (causal response in same time period differs across groups), this term ends up being (partially) problematic too
Only shows up when \(\E[D|G=g] \neq \E[D|G=k]\)
No analogue in the binary treatment case
Issue #1: Selection bias terms that show up under standard parallel trends
\(\implies\) to interpret as a weighted average of any kind of causal responses, need to invoke (likely substantially) stronger assumptions
Issue #2: Weights
Negative weights possible due to (i) treatment effect dynamics or (ii) heterogeneous causal responses across groups
Are (undesirably) driven by estimation method
Weights issues can be solved by carefully making desirable comparisons and user-chosen appropriate weights
Selection bias terms are more fundamental challenge