Session 3: More Complicated Treatment Regimes

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\) The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.

However, this certainly does not cover the full range of possible treatments. In Part 3, we’ll primarily consider two leading extensions:

A treatment that is multi-valued or continuous (e.g., length of school closures during Covid on student test scores)

A treatment that can turn on and off (e.g., union status)

A couple of things to notice as we go along:

I’m not going to cover much on TWFE regressions here. They have even more sources of things that can go wrong.

Try to pay attention to the pattern. Even though the arguments are getting more complicated, we are still following the idea of (i) target disaggregated parameters, (ii) combine them into lower dimensional objects, (3) here there will be some additional interpretation issues that are worth emphasizing

Potential outcomes notation

Two time periods: \(t=2\) and \(t=1\)

- No one treated until period \(t=2\)
- Some units remain untreated in period \(t=2\)

Potential outcomes: \(Y_{i,t=2}(d)\)

Observed outcomes: \(Y_{i,t=2}\) and \(Y_{i,t=1}\)

\[Y_{i,t=2}=Y_{i,t=2}(D_i) \quad \textrm{and} \quad Y_{i,t=1}=Y_{i,t=1}(0)\]

Level Effects (Average Treatment Effect on the Treated)

\[ATT(d|d) := \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d]\]

Interpretation: The average effect of dose \(d\) relative to not being treated

*local to the group that actually experienced dose \(d\)*This is the natural analogue of \(ATT\) in the binary treatment case

Slope Effects (Average Causal Response on the Treated)

\[ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}\]

- Interpretation: \(ACRT(d|d)\) is the causal effect of a marginal increase in dose
*local to units that actually experienced dose \(d\)*

Notice that \(ATT(d|d)\) and \(ACRT(d|d)\) are functional parameters

- This is different from \(\beta^{twfe}\) (from the TWFE regression of \(Y_{i,t}\) on \(D_{i,t}\))

We can view \(ATT(d|d)\) and \(ACRT(d|d)\) as the “building blocks” for a more aggregated parameter. Aggregated versions of these (into a single number) are \[\begin{align*} ATT^o := \E[ATT(D|D)|D>0] \qquad \qquad ACRT^o := \E[ACRT(D|D)|D>0] \end{align*}\]

\(ATT^o\) averages \(ATT(d|d)\) over the population distribution of the dose

\(ACRT^o\) averages \(ACRT(d|d)\) over the population distribution of the dose

\(ACRT^o\) is the natural target parameter for the TWFE regression in this case

**“Standard” Parallel Trends Assumption**

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

**“Standard” Parallel Trends Assumption**

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt} \end{aligned} \]

**“Standard” Parallel Trends Assumption**

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \end{aligned} \]

**“Standard” Parallel Trends Assumption**

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d]\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[\Delta Y_{i,t=2}(0) | D_i=0] \end{aligned} \]

**“Standard” Parallel Trends Assumption**

For all \(d\),

\[\E[\Delta Y_{i,t=2}(0) | D_i=d] = \E[\Delta Y_{i,t=2}(0) | D_i=0]\]

Then,

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{150pt}\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d]\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[\Delta Y_{i,t=2}(0) | D_i=0]\\ &= \E[\Delta Y_{i,t=2} | D_i=d] - \E[\Delta Y_{i,t=2} | D_i=0] \end{aligned} \]

This is exactly what you would expect

Unfortunately, no

Most empirical work with a continuous treatment wants to think about how causal responses vary across dose

- Plot treatment effects as a function of dose and ask: does more dose tends to increase/decrease/not affect outcomes?

- Average causal response parameters
*inherently*involve comparisons across slightly different doses

There are new issues related to comparing \(ATT(d|d)\) at different doses and interpreting these differences as causal effects

At a high-level, these issues arise from a tension between empirical researchers wanting to use a quasi-experimental research design (which delivers “local” treatment effect parameters) but (often) wanting to compare these “local” parameters to each other

Unlike the staggered, binary treatment case: No easy fixes here!

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned} \]

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{i,t=2}(d_h)-Y_{i,t=2}(d_l) | D_i=d_h] + \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_h] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_l] \end{aligned} \]

Consider comparing \(ATT(d|d)\) for two different doses

\[ \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \E[Y_{i,t=2}(d_h)-Y_{i,t=2}(d_l) | D_i=d_h] + \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_h] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0) | D_i=d_l]\\ & \hspace{25pt} = \underbrace{\E[Y_{i,t=2}(d_h) - Y_{i,t=2}(d_l) | D_i=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned} \]

“Standard” Parallel Trends is not strong enough to rule out the selection bias terms here

Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends

This problem spills over into identifying \(ACRT(d|d)\)

Intuition:

Difference-in-differences identification strategies result in \(ATT(d|d)\) parameters. These are local parameters and difficult to compare to each

This explanation is similar to thinking about LATEs with two different instruments

Thus, comparing \(ATT(d|d)\) across different values is tricky and not for free

What can you do?

One idea, just recover \(ATT(d|d)\) and interpret it cautiously (interpret it by itself not relative to different values of \(d\))

If you want to compare them to each other, it will come with the cost of additional (structural) assumptions

**“Strong” Parallel Trends Assumption**

For all doses `d`

and `l`

,

\[\mathbb{E}[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] = \mathbb{E}[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d]\]

This is notably different from “Standard” Parallel Trends

It involves potential outcomes for all values of the dose (not just untreated potential outcomes)

All dose groups would have experienced the same path of outcomes had they been assigned the same dose

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \ \end{aligned} \]

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \ \end{aligned} \]

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=l] \ \end{aligned} \]

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\[ \begin{aligned} ATT(d|d) &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=d] \hspace{200pt} \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=d] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=d] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=1}(0) | D_i=l] - \E[Y_{i,t=2}(0) - Y_{i,t=1}(0) | D_i=l] \\\ &= \E[Y_{i,t=2}(d) - Y_{i,t=2}(0) | D_i=l] = ATT(d|l) \end{aligned} \]

Since this holds for all \(d\) and \(l\), it also implies that \(ATT(d|d) = ATE(d) = \E[Y_{i,t=2}(d) - Y_{i,t=2}(0)]\). Thus, under strong parallel trends, we have that

\[ATE(d) = \E[\Delta Y_{i,t=2}|D_i=d] - \E[\Delta Y_{i,t=2}|D_i=0]\]

RHS is exactly the same expression as for \(ATT(d|d)\) under “standard” parallel trends, but here

assumptions are different

parameter interpretation is different

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{i,t=2}(d_h) - Y_{i,t=2}(0)] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0)] \end{aligned} \]

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\[ \begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{i,t=2}(d_h) - Y_{i,t=2}(0)] - \E[Y_{i,t=2}(d_l) - Y_{i,t=2}(0)]\\ &= \underbrace{\E[Y_{i,t=2}(d_h) - Y_{i,t=2}(d_l)]}_{\textrm{Causal Response}} \end{aligned} \]

Thus, recovering \(ATE(d)\) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption

Given that we can compare \(ATE(d)\)’s across dose, we can recover slope effects in this setting

\[ \begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^o := \E[ACR(D) | D>0] \end{aligned} \]

Can you relax strong parallel trends?

Positive side-comment: No untreated units

Consider the same TWFE regression (but now \(D_{i,t}\) is continuous): \[\begin{align*} Y_{i,t} = \theta_t + \eta_i + \alpha D_{i,t} + e_{i,t} \end{align*}\] You can show that \[\begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*}\] where \(m_\Delta(l) := \E[\Delta Y_{i,t=2}|D_i=l] - \E[\Delta Y_{i,t=2}|D_i=0]\) and \(w(l)\) are weights

Under standard parallel trends, \(m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}\)

Under strong parallel trends, \(m'_{\Delta}(l) = ACR(l)\).

Thus, issues related to selection bias continue to show up here

About the weights: they are all positive, but have some strange properties (e.g., always maximized at \(l = \E[D]\) (even if this is not a common value for the dose))

- \(\implies\) even under strong parallel trends, \(\alpha \neq ACR^o\).

Other issues can arise in more complicated cases

For example, suppose you have a staggered continuous treatment, then you will

*additionally*get issues that are analogous to the ones we discussed earlier for a binary staggered treatmentIn general, things get worse for TWFE regressions with more complications

- It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose

- However, comparison of ATT-type parameters across different doses are hard to interpret
- They include selection bias terms
- This issues extends to identifying ACRT parameters
- These issues extend to TWFE regressions

- This suggests targeting ATE-type parameters
- Comparisons across doses do not contain selection bias terms
- But identifying ATE-type parameters requires stronger assumptions

“Scarring” vs. Moving in and out of treatment

Example treatments:

Union status (Vella and Verbeek, 1998)

Whether or not location hit by hurricane (Deryugina, 2017)

Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)

Additional Notation:

We can make a lot of progress by redefining our notion of a “group”

Keep track of entire treatment regime \(\mathbf{D}_i := (D_{i,1}, \ldots, D_{i,T})'\) and/or treatment history up to period \(t\): \(\mathbf{D}_{i,t} := (D_{i,1}, \ldots, D_{i,t})'\).

Potential outcomes \(Y_{i,t}(\mathbf{d}_t)\) where \(\mathbf{d}_t\) is some treatment history up to period \(t\) (this notation imposes “no anticipation” — potential outcomes do not depend on future treatments). Observed outcomes: \(Y_{i,t}(\mathbf{D}_{i,t})\)

\(\mathbf{0}_t\) denotes not participating in the treatment in any period up to period \(t\)

In this case, we’ll define groups by their treatment histories \(\mathbf{d}_t\). Thus, we can consider group-time average treatment effects defined by \[\begin{align*} ATT(\mathbf{d}_t, t) := \E[Y_{i,t}(\mathbf{d}_t) - Y_{i,t}(\mathbf{0}_t) | \mathbf{D}_{i,t} = \mathbf{d}_t] \end{align*}\]

**In-and-Out Parallel Trends Assumption:**

For all \(t=2,\ldots,T\), and for all \(\mathbf{d}_t \in \mathcal{D}_t\), \[\begin{align*} \E[\Delta Y_{i,t}(\mathbf{0}_t) | \mathbf{D}_{i,t} = \mathbf{d}_t] = \E[\Delta Y_{i,t}(\mathbf{0}_t) | \mathbf{D}_{i,t} = \mathbf{0}_t] \end{align*}\]

Identification: In this setting, under the parallel trends assumption, we have that \[\begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{i,t} - Y_{i,1} | \mathbf{D}_{i,t} = \mathbf{d}_t] - \E[Y_{i,t} - Y_{i,1} | \mathbf{D}_{i,t} = \mathbf{0}_t] \end{align*}\]

This argument is straightforward and analogous to what we have done before. However…

There are a number of additional complications that arise here.

There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)

\(\implies\) small groups \(\implies\) imprecise estimates and (possibly) invalid inferences

also makes it harder to report the results

The previous point provides an additional reason to try to aggregate the group-time average treatment effects. However, this is also not so straightforward.

Probably the simplest approach is to just make “timing groups” on the basis of the first period when a unit experiences the treatment

We have (kind of) been doing this in our minimum wage application

Lots of papers (e.g., job displacement, hospitalization) have used this idea

Formally, it amounts to averaging over all subsequent treatments decisions (Clement de Chaisemartin and D’Haultfœuille (2023))

In math: Define \(M_i := \min\{t : D_{i,t} = 1\}\), then we can consider the (timing-group)-time average treatment effects: \[ATT(m,t) := \E[Y_{i,t}(\mathbf{D}_{i,t}) - Y_{i,t}(\mathbf{0}_t) | M_i = m]\]

If the treatment were staggered, these would be exactly the group-time average treatment effects discussed earlier

Can show that these are averages of \(ATT(\mathbf{d}_t, t)\) across different treatment histories that have the same \(M_i\).

But there are other ideas too. For example, you could target the average treatment effect across all periods that a unit participated in the treatment

Define \(C_i := \displaystyle \sum_{t=2}^T D_{i,t}\) — the total number of periods that unit \(i\) was treated

Unit-specific average treatment effect \[\bar{\tau}_i = \frac{1}{C_i} \sum_{t=2}^{T} D_{i,t} \big(Y_{i,t}(\mathbf{D}_{i,t}) - Y_{i,t}(\mathbf{0}_t) \big)\] This is the average treatment effect for unit \(i\) in all the periods that it was treated

Overall average treatment effect: \[ATT^o := \E[\bar{\tau}_i | \mathbf{D}_i \neq \mathbf{0}_t]\]

Can show that this is a different weighted average of \(ATT(\mathbf{d}_t, t)\).

This sort of parameter might be interesting in applications where treatment status changes often and treatment effects are short-lived

Suppose that you were interested in the average treatment effect of experiencing some cumulative number of treatments over time (e.g., how many years someone was in a union).

Consider the average treatment effect parameter \[ATT^{sum}(\sigma) := \E\Big[Y_{i,T}(\mathbf{D}_i) - Y_{i,T}(\mathbf{0}) \big| C_i=\sigma\Big]\] which is the average treatment effect (in the last period) among those units that experienced \(\sigma\) total treatments across all years

As before, you can show that this is a weighted average of \(ATT(\mathbf{d}_t, t)\).

Can report \(ATT^{sum}(\sigma)\) for different values of \(\sigma\).

Unlike the staggered treatment adoption case, where \(ATT^{es}(e)\) and \(ATT^o\) seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.

- However, if I were writing a paper, I would (i) show disaggregated results, (ii) argue for some particular aggregated parameter and choose weights on the disaggregated parameters that target this parameter

Another caution is that (I presume) the issues about interpreting \(ATT\)-type parameters across different amounts of the treatment (e.g., across \(\sigma\)) will introduce selection bias terms except under additional assumptions

- e.g., saying that, on average participating in a union for 10 years increased earnings by some amount and participating in a union for for 5 years increased by another amount is one thing; causally attributing the difference to “longer union participation” (probably) takes more assumptions

If we engage seriously with differing minimum wages across states, this is related to (but not exactly the same) as either or the two cases considered previously.

Unique features of minimum wage application:

Multiple values of the treatment

Amount can change over time

But (in our sample) treatment does not ever turn back off

It is straightforward for us to get \(ATT(\mathbf{d}_t, t)\). This amounts to just estimating treatment effects for each treated state in our data in each time period.

The example here is small enough that perhaps we could just show disaggregated results, but this would not be true for most applications.

Goals:

Come up with a version of an event study (that acknowledges different treatment amounts)

Come up with an overall average treatment effect parameter (also acknowledging different treatment amounts)

It is less clear how to aggregate them. I will propose an idea, but you could certainly come up with something else.

For counties that experienced treatment regime \(\mathbf{d}_t\), consider the scaled treatment effect \[\frac{Y_{i,t}(\mathbf{d}_t) - Y_{i,t}(\mathbf{0}_t)}{d_t}\] which is the effect of the minimum wage scaled by the minimum wage in the current period

- \(d_t = \textrm{state min wage} - \textrm{federal min wage}\)

Define \(M_i\) as the first time a state raised it’s minimum wage

Consider the following parameter \[ATT^{scaled}(m,t) := \E\left[ \frac{Y_{i,t}(\mathbf{D}_{i,t}) - Y_{i,t}(\mathbf{0}_t)}{D_{i,t}} \Big| M_i = m \right]\] which is the average per dollar effect of the minimum wage increase on employment in period \(t\) across those which first raised the minimum wage in period \(m\)

Can show that this is an average of \(\frac{ATT(\mathbf{d}_t, t)}{d_t}\) across different treatment histories that have \(M_i=m\).

we can average across \(m,t\) to get an event study or an overall average treatment effect — interpret both as per dollar effect of minimum wage increases on employment

per dollar \(\widehat{ATT}^o = -0.058\), \(\textrm{s.e.}=0.018\).

We’ve covered a number of different settings, but we certainly haven’t covered all of them

Using new, heterogeneity-robust approaches typically requires customized approaches in complicated settings (unlike TWFE regressions)

In my view, this is a feature of new approaches (rather than a weakness). As researchers, I think we should grapple with complexity of the problems that we are studying

- In all likelihood, if you run a TWFE regression, it is going to give you some kind of weighted average of underlying treatment effect parameters (with hard to understand/interpret weights).

What should you do?

My goal in this section is to provide at least a recipe for dealing with complicated treatment regimes

Step 1: Target disaggregated parameters

Step 2: If desired, choose aggregated target parameter suitable to the application, combine underlying disaggregated parameters directly to recover this parameter

Some ideas:

- Partial identification: It could be reasonable to assume that you know the sign of the selection bias. This can lead to (possibly) informative bounds on differences/derivatives/etc. between \(ATT(d|d)\) parameters

Conditioning on some covariates could make strong parallel trends more plausible.

- For length of school closure, strong parallel trends probably more plausible conditional on being a rural county in the Southeast or conditional on being a college town in the Midwest.

[Back]

It’s possible to do some versions of DID with a continuous treatment without having access to a fully untreated group.

In this case, it is not possible to recover level effects like \(ATT(d|d)\).

However, notice that \[\begin{aligned}& \E[\Delta Y_i | D_i=d_h] - \E[\Delta Y_i | D_i=d_l] \\ &\hspace{50pt}= \Big(\E[\Delta Y_i | D_i=d_h] - \E[\Delta Y_i(0) | D_i=d_h]\Big) - \Big(\E[\Delta Y_i | D_i=d_l]-\E[\Delta Y_i(0) | D_i=d_l]\Big) \\ &\hspace{50pt}= ATT(d_h|d_h) - ATT(d_l|d_l)\end{aligned}\]

In words: comparing path of outcomes for those that experienced dose \(d_h\) to path of outcomes among those that experienced dose \(d_l\) (and not relying on having an untreated group) delivers the difference between their \(ATT\)’s.

Still face issues related to selection bias / strong parallel trends though

[Back]

Strategies like binarizing the treatment can still work (though be careful!)

If you classify units as being treated or untreated, you can recover the \(ATT\) of being treated at all.

On the other hand, if you classify units as being “high” treated, “low” treated, or untreated — our arguments imply that selection bias terms can come up when comparing effects for “high” to “low”

[Back]

That the expressions for \(ATE(d)\) and \(ATT(d|d)\) are exactly the same also means that we cannot use pre-treatment periods to try to distinguish between “standard” and “strong” parallel trends. In particular, the relevant information that we have for testing each one is the same

- In effect, the only testable implication of strong parallel trends in pre-treatment periods is standard parallel trends.

[Back]

This is a simplified version of Acemoglu and Finkelstein (2008)

1983 Medicare reform that eliminated labor subsidies for hospitals

Medicare moved to the Prospective Payment System (PPS) which replaced “full cost reimbursement” with “partial cost reimbursement” which eliminated reimbursements for labor (while maintaining reimbursements for capital expenses)

Rough idea: This changes relative factor prices which suggests hospitals may adjust by changing their input mix. Could also have implications for technology adoption, etc.

In the paper, we provide some theoretical arguments concerning properties of production functions that suggests that strong parallel trends holds.

Hospital reported data from the American Hospital Association, yearly from 1980-1986

Outcome is capital/labor ratio

proxy using the depreciation share of total operating expenses (avg. 4.5%)

our setup: collapse to two periods by taking average in pre-treatment periods and average in post-treatment periods

Dose is “exposure” to the policy

the number of Medicare patients in the period before the policy was implemented

roughly 15% of hospitals are untreated (have essentially no Medicare patients)

- AF provide results both using and not using these hospitals as (good) it is useful to have untreated hospitals (bad) they are fairly different (includes federal, long-term, psychiatric, children’s, and rehabilitation hospitals)