\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }
The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.
The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.
However, this certainly does not cover the full range of possible treatments. In Part 4, we'll primarily consider two leading extensions:
A treatment that is multi-valued or continuous (e.g., minimum wage has this flavor)
A treatment that can turn on and off (e.g., union status)
The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.
However, this certainly does not cover the full range of possible treatments. In Part 4, we'll primarily consider two leading extensions:
A treatment that is multi-valued or continuous (e.g., minimum wage has this flavor)
A treatment that can turn on and off (e.g., union status)
A couple of things to notice as we go along:
I'm not going to cover much on TWFE regressions here. They have even more sources of things that can go wrong.
Try to pay attention to the pattern. Even though the arguments are getting more complicated, we are still following the idea of (i) target disaggregated parameters, (ii) combine them into lower dimensional objects, (3) here there will be some additional interpretation issues that also emphasize
Potential outcomes notation
Two time periods: t^* and t^*-1
No one treated until period t^*
Some units remain untreated in period t^*
Potential outcomes: Y_{it^*}(d)
Observed outcomes: Y_{it^*} and Y_{it^*-1}
Y_{it^*}=Y_{it^*}(D_i) \quad \textrm{and} \quad Y_{it^*-1}=Y_{it^*-1}(0)
Level Effects (Average Treatment Effect on the Treated)
ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d]
Interpretation: The average effect of dose d relative to not being treated local to the group that actually experienced dose d
This is the natural analogue of ATT in the binary treatment case
Slope Effect (Average Causal Response on the Treated)
ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}
Slope Effect (Average Causal Response on the Treated)
ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}
We can view ACRT(d|d) as the "building block" here. An aggregated version of it (into a single number) is \begin{align*} ACRT^O := \E[ACRT(D|D)|D>0] \end{align*}
ACRT^O averages ACRT(d|d) over the population distribution of the dose
Like ATT^O for staggered treatment adoption, ACRT^O is the natural target parameter for the TWFE regression in this case
\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]
\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]
Then,
\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt} \end{aligned}
\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \end{aligned}
\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0] \end{aligned}
\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]
Then,
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0]\\ &= \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*} | D=0] \end{aligned}
This is exactly what you would expect
Unfortunately, no
Unfortunately, no
Most applied work with a multi-valued or continuous treatment wants to think about how causal responses vary across dose
For example, plot treatment effects as a function of dose
Average causal response parameters inherently involve comparisons across slightly different doses
Consider comparing ATT(d|d) for two different doses
Consider comparing ATT(d|d) for two different doses \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned}
Consider comparing ATT(d|d) for two different doses
\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}
Consider comparing ATT(d|d) for two different doses
\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}
"Standard" Parallel Trends is not strong enough to rule out the selection bias terms here
Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends
This problem spills over into identifying ACRT(d|d)
Consider comparing ATT(d|d) for two different doses
\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}
"Standard" Parallel Trends is not strong enough to rule out the selection bias terms here
Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends
This problem spills over into identifying ACRT(d|d)
Positive side-comment: ATT(d_h|d_h) - ATT(d_l|d_l) = \E[\Delta Y_{t^*} | D=d_h] - \E[\Delta Y_{t^*} | D=d_l] (which doesn't involve the untreated group)
Intuition:
Difference-in-differences identification strategies result in ATT(d|d) parameters. These are local parameters and difficult to compare to each
This explanation is similar to thinking about LATEs with two different instruments
Thus, comparing ATT(d|d) across different values is tricky and not for free
Intuition:
Difference-in-differences identification strategies result in ATT(d|d) parameters. These are local parameters and difficult to compare to each
This explanation is similar to thinking about LATEs with two different instruments
Thus, comparing ATT(d|d) across different values is tricky and not for free
What can you do?
One idea, just recover ATT(d|d) and interpret it cautiously (interpret it by itself not relative to different values of d)
If you want to compare them to each other, it will come with the cost of additional (structural) assumptions
\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]
\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]
This is notably different from "Standard" Parallel Trends
It involves potential outcomes for all values of the dose (not just untreated potential outcomes)
All dose groups would have experienced the same path of outcomes had they been assigned the same dose
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \ \end{aligned}
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \ \end{aligned}
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \ \end{aligned}
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \\\ &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=l] = ATT(d|l) \end{aligned}
Strong parallel trends implies a version of treatment effect homogeneity. Notice:
\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \\\ &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=l] = ATT(d|l) \end{aligned}
Since this holds for all d and l, it also implies that ATT(d|d) = ATE(d) = \E[Y_{t^*}(d) - Y_{t^*}(0)]. Thus, under strong parallel trends, we have that
ATE(d) = \E[\Delta Y_{t^*}|D=d] - \E[\Delta Y_{t^*}|D=0]
RHS is exactly the same expression as for ATT(d|d) under "standard" parallel trends, but here
assumptions are different
parameter interpretation is different
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)] \end{aligned}
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}
Thus, recovering ATE(d) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}
Thus, recovering ATE(d) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption
Given that we can compare ATE(d)'s across dose, we can recover slope effects in this setting
\begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^O := \E[ACR(D) | D>0] \end{aligned}
Consider the same TWFE regression (but now D_{it} is continuous): \begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*} You can show that \begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*} where m_\Delta(l) := \E[\Delta Y_{t^*}|D=l] - \E[\Delta Y_{t^*}|D=0] and w(l) are weights
Consider the same TWFE regression (but now D_{it} is continuous): \begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*} You can show that \begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*} where m_\Delta(l) := \E[\Delta Y_{t^*}|D=l] - \E[\Delta Y_{t^*}|D=0] and w(l) are weights
Under standard parallel trends, m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}
Under strong parallel trends, m'_{\Delta}(l) = ACR(l).
Thus, issues related to selection bias continue to show up here
Consider the same TWFE regression (but now D_{it} is continuous): \begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*} You can show that \begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*} where m_\Delta(l) := \E[\Delta Y_{t^*}|D=l] - \E[\Delta Y_{t^*}|D=0] and w(l) are weights
Under standard parallel trends, m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}
Under strong parallel trends, m'_{\Delta}(l) = ACR(l).
Thus, issues related to selection bias continue to show up here
About the weights: they are all positive, but have some strange properties (e.g., always maximized at l = \E[D] (even if this is not a common value for the dose))
Other issues can arise in more complicated cases
For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment
In general, things get worse for TWFE regressions with more complications
It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose
However, comparison of ATT-type parameters across different doses are hard to interpret
They include selection bias terms
This issues extends to identifying ACRT parameters
These issues extend to TWFE regressions
It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose
However, comparison of ATT-type parameters across different doses are hard to interpret
They include selection bias terms
This issues extends to identifying ACRT parameters
These issues extend to TWFE regressions
This suggests targeting ATE-type parameters
Comparisons across doses do not contain selection bias terms
But identifying ATE-type parameters requires stronger assumptions
"Scarring" vs. Moving in and out of treatment
Example treatments:
Union status (Vella and Verbeek, 1998)
Whether or not location hit by hurricane (Deryugina, 2017)
Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)
"Scarring" vs. Moving in and out of treatment
Example treatments:
Union status (Vella and Verbeek, 1998)
Whether or not location hit by hurricane (Deryugina, 2017)
Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)
Additional Notation:
We can make a lot of progress by redefining our notion of a "group"
Keep track of entire treatment regime \mathbf{D}_i := (D_{i1}, \ldots, D_{i\mathcal{T}})' and/or treatment history up to period t: \mathbf{D}_{it} := (D_{i1}, \ldots, D_{it})'.
Potential outcomes Y_{it}(\mathbf{d}_t) where \mathbf{d}_t is some treatment history up to period t (this notation imposes "no anticipation" --- potential outcomes do not depend on future treatments). Observed outcomes: Y_{it}(\mathbf{D}_{it})
A little more notation...
\mathcal{D}_t \subseteq \{0,1\}^t is the set of all possible treatment histories in period t. As earlier, we will exclude units that are treated in the first period, (I'll briefly come back to this later)
\mathbf{0}_t denotes not participating in the treatment in any period up to period t
A little more notation...
\mathcal{D}_t \subseteq \{0,1\}^t is the set of all possible treatment histories in period t. As earlier, we will exclude units that are treated in the first period, (I'll briefly come back to this later)
\mathbf{0}_t denotes not participating in the treatment in any period up to period t
In this case, we'll define groups by their treatment histories \mathbf{d}_t. Thus, we can consider group-time average treatment effects defined by \begin{align*} ATT(\mathbf{d}_t, t) := \E[Y_{it}(\mathbf{d}_t) - Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] \end{align*}
Parallel Trends Assumption: For all t=2,\ldots,\mathcal{T}, and for all \mathbf{d}_t \in \mathcal{D}_t, \begin{align*} \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] = \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}
Parallel Trends Assumption: For all t=2,\ldots,\mathcal{T}, and for all \mathbf{d}_t \in \mathcal{D}_t, \begin{align*} \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] = \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}
Identification: In this setting, under the parallel trends assumption, we have that \begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{d}_t] - \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}
Parallel Trends Assumption: For all t=2,\ldots,\mathcal{T}, and for all \mathbf{d}_t \in \mathcal{D}_t, \begin{align*} \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] = \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}
Identification: In this setting, under the parallel trends assumption, we have that \begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{d}_t] - \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}
This argument is straightforward and analogous to what we have done before. However...
There are a number of additional complications that arise here.
There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)
\implies small groups \implies imprecise estimates and (possibly) invalid inferences
also makes it harder to report the results
There are a number of additional complications that arise here.
There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)
\implies small groups \implies imprecise estimates and (possibly) invalid inferences
also makes it harder to report the results
The previous point provides an additional reason to try to aggregate the group-time average treatment effects. However, this is also not so straightforward.
This is an area of active research (e.g., de Chaisemartin and d'Haultfoeuille (2023) and Yanagi (2023))
Some ideas below...but the literature has not converged here yet
Probably the simplest approach is to just make groups on the basis of the first period when a unit experiences the treatment
We have (kind of) been doing this in our minimum wage application
Lots of papers (e.g., job displacement, hospitalization) have used this idea
Formally, it amounts to averaging over all subsequent treatments decisions (de Chaisemartin and d'Haultfoeuille (2023))
Probably the simplest approach is to just make groups on the basis of the first period when a unit experiences the treatment
We have (kind of) been doing this in our minimum wage application
Lots of papers (e.g., job displacement, hospitalization) have used this idea
Formally, it amounts to averaging over all subsequent treatments decisions (de Chaisemartin and d'Haultfoeuille (2023))
But there are other ideas too. Suppose that you were interested in the average treatment effect of experiencing some cumulative number of treatment effects over time (e.g., how many years someone was in a union).
Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.
Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.
We will target the average treatment effect of having experienced exactly \sigma treatments by period t.
Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.
We will target the average treatment effect of having experienced exactly \sigma treatments by period t.
Towards this end, also define \mathcal{D}_t^\sigma = \{\mathbf{d}_t \in \mathcal{D}_t : \sigma_t(\mathbf{d}_t) = \sigma\} --- this is the set of treatment histories that result in \sigma cumulative treatments in period t. Then, consider
Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.
We will target the average treatment effect of having experienced exactly \sigma treatments by period t.
Towards this end, also define \mathcal{D}_t^\sigma = \{\mathbf{d}_t \in \mathcal{D}_t : \sigma_t(\mathbf{d}_t) = \sigma\} --- this is the set of treatment histories that result in \sigma cumulative treatments in period t. Then, consider
\begin{align*} ATT^{sum}(\sigma, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^\sigma} ATT(\mathbf{d}_t, t) \P(D_{it}=\mathbf{d}_t | \mathbf{D}_{it} \in \mathcal{D}_t^\sigma) \end{align*}
Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.
We will target the average treatment effect of having experienced exactly \sigma treatments by period t.
Towards this end, also define \mathcal{D}_t^\sigma = \{\mathbf{d}_t \in \mathcal{D}_t : \sigma_t(\mathbf{d}_t) = \sigma\} --- this is the set of treatment histories that result in \sigma cumulative treatments in period t. Then, consider
\begin{align*} ATT^{sum}(\sigma, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^\sigma} ATT(\mathbf{d}_t, t) \P(D_{it}=\mathbf{d}_t | \mathbf{D}_{it} \in \mathcal{D}_t^\sigma) \end{align*}
This is the average ATT(\mathbf{d}_t,t) across treatment regimes that lead to exactly \sigma treatments by period t
Similar to previous cases, ATT^{sum}(\sigma,t) is a weighted average of underlying 2x2 DID parameters
Averaging like this reduces the number of groups, and makes the estimation problem discussed above easier (the "effective" number of units is larger)
Even though ATT^{sum}(\sigma,t) (possibly substantially) reduces the dimensionality of the underlying group-time average treatment effect parameters, we might want to reduce more.
Even though ATT^{sum}(\sigma,t) (possibly substantially) reduces the dimensionality of the underlying group-time average treatment effect parameters, we might want to reduce more.
This is tricky though because the composition of the effective groups changes over time (just because you have two groups have the same number of cumulative treatments in one period doesn't mean that they have the same number in subsequent periods)
An alternative idea is to just report treatment effect parameters in the last period: ATT^{sum}(\sigma,\mathcal{T}) as a function of \sigma.
An alternative idea is to just report treatment effect parameters in the last period: ATT^{sum}(\sigma,\mathcal{T}) as a function of \sigma.
Unlike the staggered treatment adoption case, where ATT^{ES}(e) and ATT^O seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.
An alternative idea is to just report treatment effect parameters in the last period: ATT^{sum}(\sigma,\mathcal{T}) as a function of \sigma.
Unlike the staggered treatment adoption case, where ATT^{ES}(e) and ATT^O seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.
Another caution is that (I presume) the issues about interpreting ATT-type parameters across different amounts of the treatment (here across \sigma) will introduce selection bias terms except under additional assumptions
Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.
Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.
But it seems within the spirit of DID to assume parallel trends for staying at the same treatment over time
Then we can recover group-time average treatment effects for switchers relative to stayers
See de Chaisemartin et al. (2022) and de Chaisemartin and d'Haultfoeuille (2023) for approaches along these lines
Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.
But it seems within the spirit of DID to assume parallel trends for staying at the same treatment over time
Then we can recover group-time average treatment effects for switchers relative to stayers
See de Chaisemartin et al. (2022) and de Chaisemartin and d'Haultfoeuille (2023) for approaches along these lines
This results in many more disaggregated treatment effect parameters
[Details]
We've covered a number of different settings, but we certainly haven't covered all of them
Ex. Suppose you have a multi-valued treatment that can change values over time
I'm not sure what exactly to do off the top of my head (and the exact thing to do likely depends on the particular goals of the application), but I think that you can get some ideas from extrapolating our discussion:
Step 1: Target disaggregated parameters
Step 2: If desired, choose aggregated target parameter suitable to the application, combine underlying disaggregateed parameters directly to recover this parameter
Idea 1: Partial Identification In some application, it may seem reasonable to think that you know the sign of the selection bias. If this "works against" the sign of differences in m_\Delta(d) as d increases, this implies that you could still sign differences in ATT(d|d) as d increases
Idea 1: Partial Identification In some application, it may seem reasonable to think that you know the sign of the selection bias. If this "works against" the sign of differences in m_\Delta(d) as d increases, this implies that you could still sign differences in ATT(d|d) as d increases
Idea 2: Strong PT Conditional on Covariate It might be reasonable to assume strong parallel trends conditional on some other variable.
Example: For the minimum wage, it might be reasonable to assume that strong parallel trends holds across states within the same region of the country (say, West or South)
Evidence in favor of this is much different distributions of MW policy across regions
Wouldn't be able to make "full" comparison across all doses here, but could learn about employment effects of a $15.75 MW (Washington) relative to $13.20 (Oregon).
[Back]
Parallel Trends for Stayers: \begin{align*} \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it-1} = \mathbf{d}_{t-1})] = \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}
Parallel Trends for Stayers: \begin{align*} \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it-1} = \mathbf{d}_{t-1})] = \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}
In this case, you can recover the ATT for switchers: (here we are supposing that d_{t-1}=0, but can make an analogous argument in the opposite case) \begin{align*} ATT^{switchers}(\mathbf{d}_{t-1},t) &= \E[Y_{it}(1,\mathbf{d}_{t-1}) - Y_{it}(0,\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (1,\mathbf{d}_{t-1})] \\ &\overset{\textrm{PTA}}{=} \E[\Delta Y_{it} | \mathbf{D}_{it}=(1,\mathbf{d}_{t-1})] - \E[\Delta Y_{it} | \mathbf{D}_{it}=(0,\mathbf{d}_{t-1})] \end{align*} That is, you can recover ATT^{switchers} by comparing the paths of outcomes for switchers to the path of outcomes for stayers (exactly what you'd expect!)
Parallel Trends for Stayers: \begin{align*} \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it-1} = \mathbf{d}_{t-1})] = \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}
In this case, you can recover the ATT for switchers: (here we are supposing that d_{t-1}=0, but can make an analogous argument in the opposite case) \begin{align*} ATT^{switchers}(\mathbf{d}_{t-1},t) &= \E[Y_{it}(1,\mathbf{d}_{t-1}) - Y_{it}(0,\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (1,\mathbf{d}_{t-1})] \\ &\overset{\textrm{PTA}}{=} \E[\Delta Y_{it} | \mathbf{D}_{it}=(1,\mathbf{d}_{t-1})] - \E[\Delta Y_{it} | \mathbf{D}_{it}=(0,\mathbf{d}_{t-1})] \end{align*} That is, you can recover ATT^{switchers} by comparing the paths of outcomes for switchers to the path of outcomes for stayers (exactly what you'd expect!)
Given this sort of assumption, there may be a huge number of ATT^{switchers}(\mathbf{d}_{t-1},t) in realistic applications.
You could use these to further understand treatment effect heterogeneity
You could also propose some way to aggregate them into a lower dimensional argument [Back]
\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |