Session 4: More Complicated Treatment Regimes
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\) The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.
However, this certainly does not cover the full range of possible treatments. In this session, we’ll primarily consider three leading extensions:
A treatment that is multi-valued or continuous (e.g., length of school closures during Covid on student test scores)
A treatment that can turn on and off (e.g., union status)
Treatment that can change amounts—we’ll try to take our minimum wage example more seriously
A couple of things to notice as we go along:
I’m not going to cover much on TWFE regressions here. They have even more sources of things that can go wrong.
Try to pay attention to the pattern. Even though the arguments are getting more complicated, we are still following the idea of (i) target disaggregated parameters, (ii) combine them into lower dimensional objects, (3) here there will be some additional interpretation issues that are worth emphasizing
The arguments here will be for the case with a continuous treatment, but analogous results hold for other settings:
Running Example: Causal effect of the length of school closures on student test scores
Potential outcomes notation
Two time periods: \(t=2\) and \(t=1\)
Potential outcomes: \(Y_{it=2}(d)\)
Observed outcomes: \(Y_{it=2}\) and \(Y_{it=1}\)
\[Y_{it=2}=Y_{it=2}(D_i) \quad \textrm{and} \quad Y_{it=1}=Y_{it=1}(0)\]
Level Effects (Average Treatment Effect on the Treated)
\[ATT(d|d) := \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d]\]
Interpretation: The average effect of dose \(d\) relative to not being treated local to the group that actually experienced dose \(d\)
This is the natural analogue of \(ATT\) in the binary treatment case
Slope Effects (Average Causal Response on the Treated)
\[ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}\]
Notice that \(ATT(d|d)\) and \(ACRT(d|d)\) are functional parameters
We can view \(ATT(d|d)\) and \(ACRT(d|d)\) as the “building blocks” for a more aggregated parameter. Aggregated versions of these (into a single number) are \[\begin{align*} ATT^o := \E[ATT(D|D)|D>0] \qquad \qquad ACRT^o := \E[ACRT(D|D)|D>0] \end{align*}\]
\(ATT^o\) averages \(ATT(d|d)\) over the population distribution of the dose
\(ACRT^o\) averages \(ACRT(d|d)\) over the population distribution of the dose
\(ACRT^o\) is the natural target parameter for the TWFE regression in this case
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt} \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d]\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[\Delta Y_{t=2}(0) | D=0] \end{aligned} \]
“Standard” Parallel Trends Assumption
For all \(d\),
\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]
Then,
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d]\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[\Delta Y_{t=2}(0) | D=0]\\ &= \E[\Delta Y_{t=2} | D=d] - \E[\Delta Y_{t=2} | D=0] \end{aligned} \]
This is exactly what you would expect
Unfortunately, no
Most empirical work with a continuous treatment wants to think about how causal responses vary across dose
There are new issues related to comparing \(ATT(d|d)\) at different doses and interpreting these differences as causal effects
At a high-level, these issues arise from a tension between empirical researchers wanting to use a quasi-experimental research design (which delivers “local” treatment effect parameters) but (often) wanting to compare these “local” parameters to each other
Unlike the staggered, binary treatment case: No easy fixes here!
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned} \]
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{t=2}(d_h)-Y_{t=2}(d_l) | D=d_h] + \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_h] - \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_l] \end{aligned} \]
Consider comparing \(ATT(d|d)\) for two different doses
\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{t=2}(d_h)-Y_{t=2}(d_l) | D=d_h] + \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_h] - \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{t=2}(d_h) - Y_{t=2}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned} \]
“Standard” Parallel Trends is not strong enough to rule out the selection bias terms here
Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends
This problem spills over into identifying \(ACRT(d|d)\)
Intuition:
Difference-in-differences identification strategies result in \(ATT(d|d)\) parameters. These are local parameters and difficult to compare to each
This explanation is similar to thinking about LATEs with two different instruments
Thus, comparing \(ATT(d|d)\) across different values is tricky and not for free
What can you do?
One idea, just recover \(ATT(d|d)\) and interpret it cautiously (interpret it by itself not relative to different values of \(d\))
If you want to compare them to each other, it will come with the cost of additional (structural) assumptions
“Strong” Parallel Trends Assumption
For all doses d
and l
,
\[\mathbb{E}[Y_{t=2}(d) - Y_{t=1}(0) | D=l] = \mathbb{E}[Y_{t=2}(d) - Y_{t=1}(0) | D=d]\]
This is notably different from “Standard” Parallel Trends
It involves potential outcomes for all values of the dose (not just untreated potential outcomes)
All dose groups would have experienced the same path of outcomes had they been assigned the same dose
Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \ \end{aligned} \]
Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \ \end{aligned} \]
Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=l] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=l] \ \end{aligned} \]
Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:
\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=l] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=l] \\\ &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=l] = ATT(d|l) \end{aligned} \]
Since this holds for all \(d\) and \(l\), it also implies that \(ATT(d|d) = ATE(d) = \E[Y_{t=2}(d) - Y_{t=2}(0)]\). Thus, under strong parallel trends, we have that
\[ATE(d) = \E[\Delta Y_{t=2}|D=d] - \E[\Delta Y_{t=2}|D=0]\]
RHS is exactly the same expression as for \(ATT(d|d)\) under “standard” parallel trends, but here
assumptions are different
parameter interpretation is different
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t=2}(d_h) - Y_{t=2}(0)] - \E[Y_{t=2}(d_l) - Y_{t=2}(0)] \end{aligned} \]
ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose
\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t=2}(d_h) - Y_{t=2}(0)] - \E[Y_{t=2}(d_l) - Y_{t=2}(0)]\\ &= \underbrace{\E[Y_{t=2}(d_h) - Y_{t=2}(d_l)]}_{\textrm{Causal Response}} \end{aligned} \]
Thus, recovering \(ATE(d)\) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption
Given that we can compare \(ATE(d)\)’s across dose, we can recover slope effects in this setting
\[ \begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^o := \E[ACR(D) | D>0] \end{aligned} \]
Can you relax strong parallel trends?
Positive side-comment: No untreated units
Consider the same TWFE regression (but now \(D_{it}\) is continuous): \[\begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*}\] You can show that \[\begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*}\] where \(m_\Delta(l) := \E[\Delta Y_{t=2}|D=l] - \E[\Delta Y_{t=2}|D=0]\) and \(w(l)\) are weights
Under standard parallel trends, \(m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}\)
Under strong parallel trends, \(m'_{\Delta}(l) = ACR(l)\).
Thus, issues related to selection bias continue to show up here
About the weights: they are all positive, but have some strange properties (e.g., always maximized at \(l = \E[D]\) (even if this is not a common value for the dose))
Other issues can arise in more complicated cases
For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment
In general, things get worse for TWFE regressions with more complications
This is a simplified version of Acemoglu and Finkelstein (2008)
1983 Medicare reform that eliminated labor subsidies for hospitals
Medicare moved to the Prospective Payment System (PPS) which replaced “full cost reimbursement” with “partial cost reimbursement” which eliminated reimbursements for labor (while maintaining reimbursements for capital expenses)
Rough idea: This changes relative factor prices which suggests hospitals may adjust by changing their input mix. Could also have implications for technology adoption, etc.
In the paper, we provide some theoretical arguments concerning properties of production functions that suggests that strong parallel trends holds.
Hospital reported data from the American Hospital Association, yearly from 1980-1986
Outcome is capital/labor ratio
proxy using the depreciation share of total operating expenses (avg. 4.5%)
our setup: collapse to two periods by taking average in pre-treatment periods and average in post-treatment periods
Dose is “exposure” to the policy
the number of Medicare patients in the period before the policy was implemented
roughly 15% of hospitals are untreated (have essentially no Medicare patients)
“Scarring” vs. Moving in and out of treatment
Example treatments:
Union status (Vella and Verbeek, 1998)
Whether or not location hit by hurricane (Deryugina, 2017)
Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)
Additional Notation:
We can make a lot of progress by redefining our notion of a “group”
Keep track of entire treatment regime \(\mathbf{D}_i := (D_{i1}, \ldots, D_{iT})'\) and/or treatment history up to period \(t\): \(\mathbf{D}_{it} := (D_{i1}, \ldots, D_{it})'\).
Potential outcomes \(Y_{it}(\mathbf{d}_t)\) where \(\mathbf{d}_t\) is some treatment history up to period \(t\) (this notation imposes “no anticipation” — potential outcomes do not depend on future treatments). Observed outcomes: \(Y_{it}(\mathbf{D}_{it})\)
\(\mathbf{0}_t\) denotes not participating in the treatment in any period up to period \(t\)
In this case, we’ll define groups by their treatment histories \(\mathbf{d}_t\). Thus, we can consider group-time average treatment effects defined by \[\begin{align*} ATT(\mathbf{d}_t, t) := \E[Y_{t}(\mathbf{d}_t) - Y_{t}(\mathbf{0}_t) | \mathbf{D}_{t} = \mathbf{d}_t] \end{align*}\]
In-and-Out Parallel Trends Assumption:
For all \(t=2,\ldots,T\), and for all \(\mathbf{d}_t \in \mathcal{D}_t\), \[\begin{align*} \E[\Delta Y_{t}(\mathbf{0}_t) | \mathbf{D}_{t} = \mathbf{d}_t] = \E[\Delta Y_{t}(\mathbf{0}_t) | \mathbf{D}_{t} = \mathbf{0}_t] \end{align*}\]
Identification: In this setting, under the parallel trends assumption, we have that \[\begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{t} - Y_{1} | \mathbf{D}_{t} = \mathbf{d}_t] - \E[Y_{t} - Y_{1} | \mathbf{D}_{t} = \mathbf{0}_t] \end{align*}\]
This argument is straightforward and analogous to what we have done before. However…
There are a number of additional complications that arise here.
There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)
\(\implies\) small groups \(\implies\) imprecise estimates and (possibly) invalid inferences
also makes it harder to report the results
The previous point provides an additional reason to try to aggregate the group-time average treatment effects. However, this is also not so straightforward.
Probably the simplest approach is to just make “timing groups” on the basis of the first period when a unit experiences the treatment
We have (kind of) been doing this in our minimum wage application
Lots of papers (e.g., job displacement, hospitalization) have used this idea
Formally, it amounts to averaging over all subsequent treatments decisions (de Chaisemartin and d’Haultfœuille (2024))
In math: Define \(M_i := \min\{t : D_{it} = 1\}\), then we can consider the (timing-group)-time average treatment effects: \[ATT(m,t) := \E[Y_{t}(\mathbf{D}_{t}) - Y_{t}(\mathbf{0}_t) | M = m]\]
If the treatment were staggered, these would be exactly the group-time average treatment effects discussed earlier
Can show that these are averages of \(ATT(\mathbf{d}_t, t)\) across different treatment histories that have the same \(M_i\).
But there are other ideas too. For example, you could target the average treatment effect across all periods that a unit participated in the treatment
Define \(C_i := \displaystyle \sum_{t=2}^T D_{it}\) — the total number of periods that unit \(i\) was treated
Unit-specific average treatment effect \[\bar{\tau}_i = \frac{1}{C_i} \sum_{t=2}^{T} D_{it} \big(Y_{it}(\mathbf{D}_{it}) - Y_{it}(\mathbf{0}_t) \big)\] This is the average treatment effect for unit \(i\) in all the periods that it was treated
Overall average treatment effect: \[ATT^o := \E[\bar{\tau} | \mathbf{D} \neq \mathbf{0}_t]\]
Can show that this is a different weighted average of \(ATT(\mathbf{d}_t, t)\).
This sort of parameter might be interesting in applications where treatment status changes often and treatment effects are short-lived
Suppose that you were interested in the average treatment effect of experiencing some cumulative number of treatments over time (e.g., how many years someone was in a union).
Consider the average treatment effect parameter \[ATT^{sum}(\sigma) := \E\Big[Y_{T}(\mathbf{D}) - Y_{T}(\mathbf{0}) \big| C=\sigma\Big]\] which is the average treatment effect (in the last period) among those units that experienced \(\sigma\) total treatments across all years
As before, you can show that this is a weighted average of \(ATT(\mathbf{d}_t, t)\).
Can report \(ATT^{sum}(\sigma)\) for different values of \(\sigma\).
Unlike the staggered treatment adoption case, where \(ATT^{es}(e)\) and \(ATT^o\) seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.
Another caution is that (I presume) the issues about interpreting \(ATT\)-type parameters across different amounts of the treatment (e.g., across \(\sigma\)) will introduce selection bias terms except under additional assumptions
If we engage seriously with differing minimum wages across states, this is related to (but not exactly the same) as either or the two cases considered previously.
Unique features of minimum wage application:
Multiple values of the treatment
Amount can change over time
But (in our sample) treatment does not ever turn back off
It is straightforward for us to get \(ATT(\mathbf{d}_t, t)\). This amounts to just estimating treatment effects for each treated state in our data in each time period.
The example here is small enough that perhaps we could just show disaggregated results, but this would not be true for most applications.
Goals:
Come up with a version of an event study (that acknowledges different treatment amounts)
Come up with an overall average treatment effect parameter (also acknowledging different treatment amounts)
It is less clear how to aggregate them. I will propose an idea, but you could certainly come up with something else.
For counties that experienced treatment regime \(\mathbf{d}_t\), consider the scaled treatment effect \[\frac{Y_{it}(\mathbf{d}_t) - Y_{it}(\mathbf{0}_t)}{d_t}\] which is the effect of the minimum wage scaled by the minimum wage in the current period
Define \(M_i\) as the first time a state raised it’s minimum wage
Consider the following parameter \[ATT^{scaled}(m,t) := \E\left[ \frac{Y_{t}(\mathbf{D}_{t}) - Y_{t}(\mathbf{0}_t)}{D_{t}} \Big| M = m \right]\] which is the average per dollar effect of the minimum wage increase on employment in period \(t\) across those which first raised the minimum wage in period \(m\)
Can show that this is an average of \(\frac{ATT(\mathbf{d}_t, t)}{d_t}\) across different treatment histories that have \(M_i=m\).
we can average across \(m,t\) to get an event study or an overall average treatment effect — interpret both as per dollar effect of minimum wage increases on employment
per dollar \(\widehat{ATT}^o = -0.058\), \(\textrm{s.e.}=0.018\).
We’ve covered a number of different settings, but we certainly haven’t covered all of them
Using new, heterogeneity-robust approaches typically requires customized approaches in complicated settings (unlike TWFE regressions)
In my view, this is a feature of new approaches (rather than a weakness). As researchers, I think we should grapple with complexity of the problems that we are studying
What should you do?
My goal in this section is to provide at least a recipe for dealing with complicated treatment regimes
Step 1: Target disaggregated parameters
Step 2: If desired, choose aggregated target parameter suitable to the application, combine underlying disaggregated parameters directly to recover this parameter
Some ideas:
Conditioning on some covariates could make strong parallel trends more plausible.
[Back]
It’s possible to do some versions of DID with a continuous treatment without having access to a fully untreated group.
In this case, it is not possible to recover level effects like \(ATT(d|d)\).
However, notice that \[\begin{aligned}& \E[\Delta Y | D=d_h] - \E[\Delta Y | D=d_l] \\ &\hspace{50pt}= \Big(\E[\Delta Y | D=d_h] - \E[\Delta Y(0) | D=d_h]\Big) - \Big(\E[\Delta Y | D=d_l]-\E[\Delta Y(0) | D=d_l]\Big) \\ &\hspace{50pt}= ATT(d_h|d_h) - ATT(d_l|d_l)\end{aligned}\]
In words: comparing path of outcomes for those that experienced dose \(d_h\) to path of outcomes among those that experienced dose \(d_l\) (and not relying on having an untreated group) delivers the difference between their \(ATT\)’s.
Still face issues related to selection bias / strong parallel trends though
[Back]
Strategies like binarizing the treatment can still work (though be careful!)
If you classify units as being treated or untreated, you can recover the \(ATT\) of being treated at all.
On the other hand, if you classify units as being “high” treated, “low” treated, or untreated — our arguments imply that selection bias terms can come up when comparing effects for “high” to “low”
[Back]
That the expressions for \(ATE(d)\) and \(ATT(d|d)\) are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between “standard” and “strong” parallel trends. In particular, the relevant information that we have for testing each one is the same
[Back]
There are other additional assumptions that could be attractive in applications like this
Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.
But it seems within the spirit of DID to assume parallel trends for staying at the same treatment over time
This approach could potentially greatly increase the amount of information that we are able to use and results in many more disaggregated treatment effect parameters
There are other additional assumptions that could be attractive in applications like this
Assumptions that limit the “memory” of potential outcomes could be attractive in some applications
e.g., \(Y_{t}(\mathbf{d}_t) = Y_{t}(\mathbf{d}_{t-5:t})\) — potential outcomes only depend on treatments in the last 5 periods
this allows pooling across treatment histories
could increase the size of the comparison group
There are other additional assumptions that could be attractive in applications like this
Assumptions that limit treatment effect dynamics could be attractive in some applications
For example, if a unit has been treated for 5 years in a row, then their trend in outcomes over time goes back to being the same as the trend in untreated potential outcomes (though the level could still be affected by the treatment)
I think this is what event studies that bin the endpoints have in mind
This allows those units with a “steady” treatment to eventually re-enter the comparison group (and this is often a testable assumption)
[Back]
Parallel Trends Assumption for Stayers
For any treatment history \(\mathbf{d}_{t-1}\),
\[\begin{align*} \E[Y_{t}(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{t-1} = \mathbf{d}_{t-1})] = \E[Y_{t}(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{t} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}\]
In this case, you can recover the \(ATT\) for switchers: (here we are supposing that \(d_{t-1}=0\), but can make an analogous argument in the opposite case) \[\begin{align*} ATT^{switchers}(\mathbf{d}_{t-1},t) &= \E[Y_{t}(1,\mathbf{d}_{t-1}) - Y_{t}(0,\mathbf{d}_{t-1}) | \mathbf{D}_{t} = (1,\mathbf{d}_{t-1})] \\ &\overset{\textrm{PTA}}{=} \E[\Delta Y_{t} | \mathbf{D}_{t}=(1,\mathbf{d}_{t-1})] - \E[\Delta Y_{t} | \mathbf{D}_{t}=(0,\mathbf{d}_{t-1})] \end{align*}\] That is, you can recover \(ATT^{switchers}\) by comparing the paths of outcomes for switchers to the path of outcomes for stayers (exactly what you’d expect!)
Given this sort of assumption, there may be a huge number of \(ATT^{switchers}(\mathbf{d}_{t-1},t)\) in realistic applications.
You could use these to further understand treatment effect heterogeneity
You could also propose some way to aggregate them into a lower dimensional argument
[Back]
\(\mu(\mathbf{d}_t) := d_t\) — “how much” treated in this period
\(\varrho(\mathbf{d}_t) := \min\{s : d_s \in \mathbf{d}_t, d_s \neq 0\}\) — first period treated
Building block parameter: Define \(\mathcal{D}_t^{\mu,\varrho} = \{\mathbf{d}_t \in \mathcal{D}_t : \mu(\mathbf{d}_t) = \mu, \varrho(\mathbf{d}_t) = \varrho\}\) — this is the set of states that have a minimum wage equal to \(\mu\) in period \(t\) and first increased their minimum wage in period \(\varrho\). Then, consider
\[ATT^{per}(\mu, \varrho, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^{\mu,\varrho}} \frac{ATT(\mathbf{d}_t, t)}{\mu(\mathbf{d}_t)} \P(D_{t} = \mathbf{d}_t | \mathbf{D}_{t} \in \mathcal{D}_t^{\mu,\varrho})\]
This is the (per-dollar) \(ATT\) of having a minimum wage \(\mu\) in period \(t\) among states that (a) actually had a \(\mu\) minimum wage and first increased their minimum wage in period \(\rho\).
Next, define \(M_t= \{\mu : \mu\}\)
Further consider
\[ATT^{per}(\rho, t) = \sum_{\mu \in M_t} ATT^{per}(\mu, \varrho, t) \P(\mu(\mathbf{d}_t))\]
[Back]