Modern Approaches to Difference-in-Differences

Session 4: More Complicated Treatment Regimes

Brantly Callaway

University of Georgia

Introduction

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\) The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.

However, this certainly does not cover the full range of possible treatments. In this session, we’ll primarily consider three leading extensions:

  1. A treatment that is multi-valued or continuous (e.g., length of school closures during Covid on student test scores)

  2. A treatment that can turn on and off (e.g., union status)

  3. Treatment that can change amounts—we’ll try to take our minimum wage example more seriously

A couple of things to notice as we go along:

  • I’m not going to cover much on TWFE regressions here. They have even more sources of things that can go wrong.

  • Try to pay attention to the pattern. Even though the arguments are getting more complicated, we are still following the idea of (i) target disaggregated parameters, (ii) combine them into lower dimensional objects, (3) here there will be some additional interpretation issues that are worth emphasizing

Part 1: DID with a Continuous Treatment

Introduction

The arguments here will be for the case with a continuous treatment, but analogous results hold for other settings:

  • Multi-valued treatment
  • Differential exposure to a binary treatment

Running Example: Causal effect of the length of school closures on student test scores

Continuous Treatment Notation

Potential outcomes notation

  • Two time periods: \(t=2\) and \(t=1\)

    • No one treated until period \(t=2\)
    • Some units remain untreated in period \(t=2\)
  • Potential outcomes: \(Y_{it=2}(d)\)

  • Observed outcomes: \(Y_{it=2}\) and \(Y_{it=1}\)

    \[Y_{it=2}=Y_{it=2}(D_i) \quad \textrm{and} \quad Y_{it=1}=Y_{it=1}(0)\]

Parameters of Interest (ATT-type)

Level Effects (Average Treatment Effect on the Treated)

\[ATT(d|d) := \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d]\]

  • Interpretation: The average effect of dose \(d\) relative to not being treated local to the group that actually experienced dose \(d\)

  • This is the natural analogue of \(ATT\) in the binary treatment case

Parameters of Interest (ATT-type)

Slope Effects (Average Causal Response on the Treated)

\[ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}\]

  • Interpretation: \(ACRT(d|d)\) is the causal effect of a marginal increase in dose local to units that actually experienced dose \(d\)

Aggregated Parameters

Notice that \(ATT(d|d)\) and \(ACRT(d|d)\) are functional parameters

  • This is different from \(\alpha\) (from the TWFE regression of \(Y_{it}\) on \(D_{it}\))

We can view \(ATT(d|d)\) and \(ACRT(d|d)\) as the “building blocks” for a more aggregated parameter. Aggregated versions of these (into a single number) are \[\begin{align*} ATT^o := \E[ATT(D|D)|D>0] \qquad \qquad ACRT^o := \E[ACRT(D|D)|D>0] \end{align*}\]

  • \(ATT^o\) averages \(ATT(d|d)\) over the population distribution of the dose

  • \(ACRT^o\) averages \(ACRT(d|d)\) over the population distribution of the dose

  • \(ACRT^o\) is the natural target parameter for the TWFE regression in this case

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt} \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d]\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[\Delta Y_{t=2}(0) | D=0] \end{aligned} \]

Identification

“Standard” Parallel Trends Assumption

For all \(d\),

\[\E[\Delta Y_{t=2}(0) | D=d] = \E[\Delta Y_{t=2}(0) | D=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d]\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[\Delta Y_{t=2}(0) | D=0]\\ &= \E[\Delta Y_{t=2} | D=d] - \E[\Delta Y_{t=2} | D=0] \end{aligned} \]

This is exactly what you would expect

Are we done?

Unfortunately, no

Most empirical work with a continuous treatment wants to think about how causal responses vary across dose

  • Plot treatment effects as a function of dose and ask: does more dose tends to increase/decrease/not affect outcomes?
  • Average causal response parameters inherently involve comparisons across slightly different doses

There are new issues related to comparing \(ATT(d|d)\) at different doses and interpreting these differences as causal effects

  • At a high-level, these issues arise from a tension between empirical researchers wanting to use a quasi-experimental research design (which delivers “local” treatment effect parameters) but (often) wanting to compare these “local” parameters to each other

  • Unlike the staggered, binary treatment case: No easy fixes here!

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned} \]

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{t=2}(d_h)-Y_{t=2}(d_l) | D=d_h] + \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_h] - \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_l] \end{aligned} \]

Interpretation Issues

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{t=2}(d_h)-Y_{t=2}(d_l) | D=d_h] + \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_h] - \E[Y_{t=2}(d_l) - Y_{t=2}(0) | D=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{t=2}(d_h) - Y_{t=2}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned} \]

“Standard” Parallel Trends is not strong enough to rule out the selection bias terms here

  • Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends

  • This problem spills over into identifying \(ACRT(d|d)\)

Interpretation Issues

Intuition:

  • Difference-in-differences identification strategies result in \(ATT(d|d)\) parameters. These are local parameters and difficult to compare to each

  • This explanation is similar to thinking about LATEs with two different instruments

  • Thus, comparing \(ATT(d|d)\) across different values is tricky and not for free

What can you do?

  • One idea, just recover \(ATT(d|d)\) and interpret it cautiously (interpret it by itself not relative to different values of \(d\))

  • If you want to compare them to each other, it will come with the cost of additional (structural) assumptions

Introduce Stronger Assumptions

“Strong” Parallel Trends Assumption

For all doses d and l,

\[\mathbb{E}[Y_{t=2}(d) - Y_{t=1}(0) | D=l] = \mathbb{E}[Y_{t=2}(d) - Y_{t=1}(0) | D=d]\]

  • This is notably different from “Standard” Parallel Trends

  • It involves potential outcomes for all values of the dose (not just untreated potential outcomes)

  • All dose groups would have experienced the same path of outcomes had they been assigned the same dose

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=l] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=l] \ \end{aligned} \]

Introduce Stronger Assumptions

Strong parallel trends is equivalent to a restriction on treatment effect heterogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=d] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=d] \\\ &= \E[Y_{t=2}(d) - Y_{t=1}(0) | D=l] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=l] \\\ &= \E[Y_{t=2}(d) - Y_{t=2}(0) | D=l] = ATT(d|l) \end{aligned} \]

Since this holds for all \(d\) and \(l\), it also implies that \(ATT(d|d) = ATE(d) = \E[Y_{t=2}(d) - Y_{t=2}(0)]\). Thus, under strong parallel trends, we have that

\[ATE(d) = \E[\Delta Y_{t=2}|D=d] - \E[\Delta Y_{t=2}|D=0]\]

RHS is exactly the same expression as for \(ATT(d|d)\) under “standard” parallel trends, but here

  • assumptions are different

  • parameter interpretation is different

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t=2}(d_h) - Y_{t=2}(0)] - \E[Y_{t=2}(d_l) - Y_{t=2}(0)] \end{aligned} \]

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t=2}(d_h) - Y_{t=2}(0)] - \E[Y_{t=2}(d_l) - Y_{t=2}(0)]\\ &= \underbrace{\E[Y_{t=2}(d_h) - Y_{t=2}(d_l)]}_{\textrm{Causal Response}} \end{aligned} \]

Thus, recovering \(ATE(d)\) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption

Given that we can compare \(ATE(d)\)’s across dose, we can recover slope effects in this setting

\[ \begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^o := \E[ACR(D) | D>0] \end{aligned} \]

Additional Comments

Can you relax strong parallel trends?

Positive side-comment: No untreated units

Positive side-comment: Binarizing the Treatment

Negative side-comment: Pre-testing

TWFE Regressions in this Context

Consider the same TWFE regression (but now \(D_{it}\) is continuous): \[\begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*}\] You can show that \[\begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*}\] where \(m_\Delta(l) := \E[\Delta Y_{t=2}|D=l] - \E[\Delta Y_{t=2}|D=0]\) and \(w(l)\) are weights

  • Under standard parallel trends, \(m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}\)

  • Under strong parallel trends, \(m'_{\Delta}(l) = ACR(l)\).

Thus, issues related to selection bias continue to show up here

About the weights: they are all positive, but have some strange properties (e.g., always maximized at \(l = \E[D]\) (even if this is not a common value for the dose))

  • \(\implies\) even under strong parallel trends, \(\alpha \neq ACR^o\).

TWFE Regressions in this Context

Other issues can arise in more complicated cases

  • For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment

  • In general, things get worse for TWFE regressions with more complications

Summarizing

  • It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose
  • However, comparison of ATT-type parameters across different doses are hard to interpret
    • They include selection bias terms
    • This issues extends to identifying ACRT parameters
    • These issues extend to TWFE regressions
  • This suggests targeting ATE-type parameters
    • Comparisons across doses do not contain selection bias terms
    • But identifying ATE-type parameters requires stronger assumptions

Empirical Application

This is a simplified version of Acemoglu and Finkelstein (2008)

1983 Medicare reform that eliminated labor subsidies for hospitals

  • Medicare moved to the Prospective Payment System (PPS) which replaced “full cost reimbursement” with “partial cost reimbursement” which eliminated reimbursements for labor (while maintaining reimbursements for capital expenses)

  • Rough idea: This changes relative factor prices which suggests hospitals may adjust by changing their input mix. Could also have implications for technology adoption, etc.

  • In the paper, we provide some theoretical arguments concerning properties of production functions that suggests that strong parallel trends holds.

Data

Hospital reported data from the American Hospital Association, yearly from 1980-1986

Outcome is capital/labor ratio

  • proxy using the depreciation share of total operating expenses (avg. 4.5%)

  • our setup: collapse to two periods by taking average in pre-treatment periods and average in post-treatment periods

Dose is “exposure” to the policy

  • the number of Medicare patients in the period before the policy was implemented

  • roughly 15% of hospitals are untreated (have essentially no Medicare patients)

    • AF provide results both using and not using these hospitals as (good) it is useful to have untreated hospitals (bad) they are fairly different (includes federal, long-term, psychiatric, children’s, and rehabilitation hospitals)

Bin Scatter

ATT/ATE Plot

ACR(T) Plot

Results

Results

Density weights vs. TWFE weights

TWFE Weights with and without Untreated Group

Part 2: Units can move in and out of the treatment

Units can move in and out of the treatment

“Scarring” vs. Moving in and out of treatment

Example treatments:

  • Union status (Vella and Verbeek, 1998)

  • Whether or not location hit by hurricane (Deryugina, 2017)

  • Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)

Additional Notation:

We can make a lot of progress by redefining our notion of a “group”

  • Keep track of entire treatment regime \(\mathbf{D}_i := (D_{i1}, \ldots, D_{iT})'\) and/or treatment history up to period \(t\): \(\mathbf{D}_{it} := (D_{i1}, \ldots, D_{it})'\).

  • Potential outcomes \(Y_{it}(\mathbf{d}_t)\) where \(\mathbf{d}_t\) is some treatment history up to period \(t\) (this notation imposes “no anticipation” — potential outcomes do not depend on future treatments). Observed outcomes: \(Y_{it}(\mathbf{D}_{it})\)

  • \(\mathbf{0}_t\) denotes not participating in the treatment in any period up to period \(t\)

Units can move in and out of the treatment

In this case, we’ll define groups by their treatment histories \(\mathbf{d}_t\). Thus, we can consider group-time average treatment effects defined by \[\begin{align*} ATT(\mathbf{d}_t, t) := \E[Y_{t}(\mathbf{d}_t) - Y_{t}(\mathbf{0}_t) | \mathbf{D}_{t} = \mathbf{d}_t] \end{align*}\]

Units can move in and out of the treatment

In-and-Out Parallel Trends Assumption:

For all \(t=2,\ldots,T\), and for all \(\mathbf{d}_t \in \mathcal{D}_t\), \[\begin{align*} \E[\Delta Y_{t}(\mathbf{0}_t) | \mathbf{D}_{t} = \mathbf{d}_t] = \E[\Delta Y_{t}(\mathbf{0}_t) | \mathbf{D}_{t} = \mathbf{0}_t] \end{align*}\]

Identification: In this setting, under the parallel trends assumption, we have that \[\begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{t} - Y_{1} | \mathbf{D}_{t} = \mathbf{d}_t] - \E[Y_{t} - Y_{1} | \mathbf{D}_{t} = \mathbf{0}_t] \end{align*}\]

This argument is straightforward and analogous to what we have done before. However…

Units can move in and out of the treatment

There are a number of additional complications that arise here.

  1. There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)

    • \(\implies\) small groups \(\implies\) imprecise estimates and (possibly) invalid inferences

    • also makes it harder to report the results

  1. The previous point provides an additional reason to try to aggregate the group-time average treatment effects. However, this is also not so straightforward.

    • This is an area of active research (e.g., de Chaisemartin and d’Haultfœuille (2024) and Yanagi (2022))

    • Some ideas below…but the literature has not converged here yet (nor is it clear if it can converge)

Units can move in and out of the treatment

Probably the simplest approach is to just make “timing groups” on the basis of the first period when a unit experiences the treatment

  • We have (kind of) been doing this in our minimum wage application

  • Lots of papers (e.g., job displacement, hospitalization) have used this idea

  • Formally, it amounts to averaging over all subsequent treatments decisions (de Chaisemartin and d’Haultfœuille (2024))

In math: Define \(M_i := \min\{t : D_{it} = 1\}\), then we can consider the (timing-group)-time average treatment effects: \[ATT(m,t) := \E[Y_{t}(\mathbf{D}_{t}) - Y_{t}(\mathbf{0}_t) | M = m]\]

  • If the treatment were staggered, these would be exactly the group-time average treatment effects discussed earlier

  • Can show that these are averages of \(ATT(\mathbf{d}_t, t)\) across different treatment histories that have the same \(M_i\).

Units can move in and out of the treatment

But there are other ideas too. For example, you could target the average treatment effect across all periods that a unit participated in the treatment

  • Define \(C_i := \displaystyle \sum_{t=2}^T D_{it}\) — the total number of periods that unit \(i\) was treated

  • Unit-specific average treatment effect \[\bar{\tau}_i = \frac{1}{C_i} \sum_{t=2}^{T} D_{it} \big(Y_{it}(\mathbf{D}_{it}) - Y_{it}(\mathbf{0}_t) \big)\] This is the average treatment effect for unit \(i\) in all the periods that it was treated

  • Overall average treatment effect: \[ATT^o := \E[\bar{\tau} | \mathbf{D} \neq \mathbf{0}_t]\]

  • Can show that this is a different weighted average of \(ATT(\mathbf{d}_t, t)\).

This sort of parameter might be interesting in applications where treatment status changes often and treatment effects are short-lived

Units can move in and out of the treatment

Suppose that you were interested in the average treatment effect of experiencing some cumulative number of treatments over time (e.g., how many years someone was in a union).

  • Consider the average treatment effect parameter \[ATT^{sum}(\sigma) := \E\Big[Y_{T}(\mathbf{D}) - Y_{T}(\mathbf{0}) \big| C=\sigma\Big]\] which is the average treatment effect (in the last period) among those units that experienced \(\sigma\) total treatments across all years

  • As before, you can show that this is a weighted average of \(ATT(\mathbf{d}_t, t)\).

  • Can report \(ATT^{sum}(\sigma)\) for different values of \(\sigma\).

Units can move in and out of the treatment

Unlike the staggered treatment adoption case, where \(ATT^{es}(e)\) and \(ATT^o\) seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.

  • However, if I were writing a paper, I would (i) show disaggregated results, (ii) argue for some particular aggregated parameter and choose weights on the disaggregated parameters that target this parameter

Another caution is that (I presume) the issues about interpreting \(ATT\)-type parameters across different amounts of the treatment (e.g., across \(\sigma\)) will introduce selection bias terms except under additional assumptions

  • e.g., saying that, on average participating in a union for 10 years increased earnings by some amount and participating in a union for for 5 years increased by another amount is one thing; causally attributing the difference to “longer union participation” (probably) takes more assumptions

[Possible additional assumptions]

Treatment that can change amounts: back to the minimum wage example

Minimum Wage Example

If we engage seriously with differing minimum wages across states, this is related to (but not exactly the same) as either or the two cases considered previously.

Unique features of minimum wage application:

  • Multiple values of the treatment

  • Amount can change over time

  • But (in our sample) treatment does not ever turn back off

Minimum Wages by State

Group-time average treatment effects

It is straightforward for us to get \(ATT(\mathbf{d}_t, t)\). This amounts to just estimating treatment effects for each treated state in our data in each time period.

ATT by State and Time

Aggregating Group-Time Average Treatment Effects

The example here is small enough that perhaps we could just show disaggregated results, but this would not be true for most applications.

Goals:

  • Come up with a version of an event study (that acknowledges different treatment amounts)

  • Come up with an overall average treatment effect parameter (also acknowledging different treatment amounts)

How to Aggregate

It is less clear how to aggregate them. I will propose an idea, but you could certainly come up with something else.

For counties that experienced treatment regime \(\mathbf{d}_t\), consider the scaled treatment effect \[\frac{Y_{it}(\mathbf{d}_t) - Y_{it}(\mathbf{0}_t)}{d_t}\] which is the effect of the minimum wage scaled by the minimum wage in the current period

  • \(d_t = \textrm{state min wage} - \textrm{federal min wage}\)

How to Aggregate

Define \(M_i\) as the first time a state raised it’s minimum wage

Consider the following parameter \[ATT^{scaled}(m,t) := \E\left[ \frac{Y_{t}(\mathbf{D}_{t}) - Y_{t}(\mathbf{0}_t)}{D_{t}} \Big| M = m \right]\] which is the average per dollar effect of the minimum wage increase on employment in period \(t\) across those which first raised the minimum wage in period \(m\)

  • Can show that this is an average of \(\frac{ATT(\mathbf{d}_t, t)}{d_t}\) across different treatment histories that have \(M_i=m\).

  • we can average across \(m,t\) to get an event study or an overall average treatment effect — interpret both as per dollar effect of minimum wage increases on employment

ATT per Dollar by State and Time

Event Study per Dollar

per dollar \(\widehat{ATT}^o = -0.058\), \(\textrm{s.e.}=0.018\).

Summary

We’ve covered a number of different settings, but we certainly haven’t covered all of them

Using new, heterogeneity-robust approaches typically requires customized approaches in complicated settings (unlike TWFE regressions)

In my view, this is a feature of new approaches (rather than a weakness). As researchers, I think we should grapple with complexity of the problems that we are studying

  • In all likelihood, if you run a TWFE regression, it is going to give you some kind of weighted average of underlying treatment effect parameters (with hard to understand/interpret weights).

What should you do?

My goal in this section is to provide at least a recipe for dealing with complicated treatment regimes

  • Step 1: Target disaggregated parameters

  • Step 2: If desired, choose aggregated target parameter suitable to the application, combine underlying disaggregated parameters directly to recover this parameter

Appendix

Positive Side-Comments: No untreated units

It’s possible to do some versions of DID with a continuous treatment without having access to a fully untreated group.

  • In this case, it is not possible to recover level effects like \(ATT(d|d)\).

  • However, notice that \[\begin{aligned}& \E[\Delta Y | D=d_h] - \E[\Delta Y | D=d_l] \\ &\hspace{50pt}= \Big(\E[\Delta Y | D=d_h] - \E[\Delta Y(0) | D=d_h]\Big) - \Big(\E[\Delta Y | D=d_l]-\E[\Delta Y(0) | D=d_l]\Big) \\ &\hspace{50pt}= ATT(d_h|d_h) - ATT(d_l|d_l)\end{aligned}\]

  • In words: comparing path of outcomes for those that experienced dose \(d_h\) to path of outcomes among those that experienced dose \(d_l\) (and not relying on having an untreated group) delivers the difference between their \(ATT\)’s.

  • Still face issues related to selection bias / strong parallel trends though

[Back]

Positive Side-Comments: Alternative approaches

Strategies like binarizing the treatment can still work (though be careful!)

  • If you classify units as being treated or untreated, you can recover the \(ATT\) of being treated at all.

  • On the other hand, if you classify units as being “high” treated, “low” treated, or untreated — our arguments imply that selection bias terms can come up when comparing effects for “high” to “low”

[Back]

Negative Side-Comment: Pre-testing

That the expressions for \(ATE(d)\) and \(ATT(d|d)\) are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between “standard” and “strong” parallel trends. In particular, the relevant information that we have for testing each one is the same

  • In effect, the only testable implication of strong parallel trends in pre-treatment periods is standard parallel trends.

[Back]

Possible Additional Assumptions

There are other additional assumptions that could be attractive in applications like this

  1. Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.

    But it seems within the spirit of DID to assume parallel trends for staying at the same treatment over time

    • Then we can recover group-time average treatment effects for switchers relative to stayers
    • See de Chaisemartin et al. (2023) and de Chaisemartin and d’Haultfœuille (2024) for approaches along these lines

    This approach could potentially greatly increase the amount of information that we are able to use and results in many more disaggregated treatment effect parameters

Possible Additional Assumptions

There are other additional assumptions that could be attractive in applications like this

  1. Assumptions that limit the “memory” of potential outcomes could be attractive in some applications

    • e.g., \(Y_{t}(\mathbf{d}_t) = Y_{t}(\mathbf{d}_{t-5:t})\) — potential outcomes only depend on treatments in the last 5 periods

    • this allows pooling across treatment histories

    • could increase the size of the comparison group

Possible Additional Assumptions

There are other additional assumptions that could be attractive in applications like this

  1. Assumptions that limit treatment effect dynamics could be attractive in some applications

    For example, if a unit has been treated for 5 years in a row, then their trend in outcomes over time goes back to being the same as the trend in untreated potential outcomes (though the level could still be affected by the treatment)

    I think this is what event studies that bin the endpoints have in mind

    This allows those units with a “steady” treatment to eventually re-enter the comparison group (and this is often a testable assumption)

[Back]

DID using Stayers and Switchers

Parallel Trends Assumption for Stayers

For any treatment history \(\mathbf{d}_{t-1}\),

\[\begin{align*} \E[Y_{t}(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{t-1} = \mathbf{d}_{t-1})] = \E[Y_{t}(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{t} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}\]

In this case, you can recover the \(ATT\) for switchers: (here we are supposing that \(d_{t-1}=0\), but can make an analogous argument in the opposite case) \[\begin{align*} ATT^{switchers}(\mathbf{d}_{t-1},t) &= \E[Y_{t}(1,\mathbf{d}_{t-1}) - Y_{t}(0,\mathbf{d}_{t-1}) | \mathbf{D}_{t} = (1,\mathbf{d}_{t-1})] \\ &\overset{\textrm{PTA}}{=} \E[\Delta Y_{t} | \mathbf{D}_{t}=(1,\mathbf{d}_{t-1})] - \E[\Delta Y_{t} | \mathbf{D}_{t}=(0,\mathbf{d}_{t-1})] \end{align*}\] That is, you can recover \(ATT^{switchers}\) by comparing the paths of outcomes for switchers to the path of outcomes for stayers (exactly what you’d expect!)

Given this sort of assumption, there may be a huge number of \(ATT^{switchers}(\mathbf{d}_{t-1},t)\) in realistic applications.

  • You could use these to further understand treatment effect heterogeneity

  • You could also propose some way to aggregate them into a lower dimensional argument

[Back]

Minimum Wage Mathematical Details

\(\mu(\mathbf{d}_t) := d_t\) — “how much” treated in this period

\(\varrho(\mathbf{d}_t) := \min\{s : d_s \in \mathbf{d}_t, d_s \neq 0\}\) — first period treated

Building block parameter: Define \(\mathcal{D}_t^{\mu,\varrho} = \{\mathbf{d}_t \in \mathcal{D}_t : \mu(\mathbf{d}_t) = \mu, \varrho(\mathbf{d}_t) = \varrho\}\) — this is the set of states that have a minimum wage equal to \(\mu\) in period \(t\) and first increased their minimum wage in period \(\varrho\). Then, consider

\[ATT^{per}(\mu, \varrho, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^{\mu,\varrho}} \frac{ATT(\mathbf{d}_t, t)}{\mu(\mathbf{d}_t)} \P(D_{t} = \mathbf{d}_t | \mathbf{D}_{t} \in \mathcal{D}_t^{\mu,\varrho})\]

This is the (per-dollar) \(ATT\) of having a minimum wage \(\mu\) in period \(t\) among states that (a) actually had a \(\mu\) minimum wage and first increased their minimum wage in period \(\rho\).

  • There are still a ton of these…

Aggregate Even More

Next, define \(M_t= \{\mu : \mu\}\)

Further consider

\[ATT^{per}(\rho, t) = \sum_{\mu \in M_t} ATT^{per}(\mu, \varrho, t) \P(\mu(\mathbf{d}_t))\]

[Back]

References

de Chaisemartin, Clément, Xavier D’Haultfoeuille, Félix Pasquier, and Gonzalo Vazquez-Bare. 2023. “Difference-in-Differences Estimators for Treatments Continuously Distributed at Every Period.”
de Chaisemartin, Clément, and Xavier d’Haultfœuille. 2024. “Difference-in-Differences Estimators of Intertemporal Treatment Effects.” Review of Economics and Statistics, 1–45.
Yanagi, Takahide. 2022. “Doubly Robust Difference-in-Differences with General Treatment Patterns.”