Processing math: 0%
+ - 0:00:00
Notes for current slide
Notes for next slide

Advanced Panel Data Methods

Brantly Callaway, University of Georgia

August 16, 2023

Advanced Causal Inference Workshop at Northwestern University

1 / 26

Part 4: More Complicated Treatment Regimes

\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }

1 / 26

Introduction

The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.

2 / 26

Introduction

The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.

However, this certainly does not cover the full range of possible treatments. In Part 4, we'll primarily consider two leading extensions:

  1. A treatment that is multi-valued or continuous (e.g., minimum wage has this flavor)

  2. A treatment that can turn on and off (e.g., union status)

2 / 26

Introduction

The discussion (and much of the recent DID literature) has focused on the setting with staggered treatment adoption.

However, this certainly does not cover the full range of possible treatments. In Part 4, we'll primarily consider two leading extensions:

  1. A treatment that is multi-valued or continuous (e.g., minimum wage has this flavor)

  2. A treatment that can turn on and off (e.g., union status)

A couple of things to notice as we go along:

  • I'm not going to cover much on TWFE regressions here. They have even more sources of things that can go wrong.

  • Try to pay attention to the pattern. Even though the arguments are getting more complicated, we are still following the idea of (i) target disaggregated parameters, (ii) combine them into lower dimensional objects, (3) here there will be some additional interpretation issues that also emphasize

2 / 26

Continuous Treatment Notation

Potential outcomes notation

  • Two time periods: t^* and t^*-1

    • No one treated until period t^*

    • Some units remain untreated in period t^*

  • Potential outcomes: Y_{it^*}(d)

  • Observed outcomes: Y_{it^*} and Y_{it^*-1}

    Y_{it^*}=Y_{it^*}(D_i) \quad \textrm{and} \quad Y_{it^*-1}=Y_{it^*-1}(0)

3 / 26

Parameters of Interest (ATT-type)

  • Level Effects (Average Treatment Effect on the Treated)

    ATT(d|d) := \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d]

    • Interpretation: The average effect of dose d relative to not being treated local to the group that actually experienced dose d

    • This is the natural analogue of ATT in the binary treatment case

4 / 26

Parameters of Interest (ATT-type)

  • Slope Effect (Average Causal Response on the Treated)

    ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}

    • Interpretation: ACRT(d|d) is the causal effect of a marginal increase in dose local to units that actually experienced dose d
5 / 26

Parameters of Interest (ATT-type)

  • Slope Effect (Average Causal Response on the Treated)

    ACRT(d|d) := \frac{\partial ATT(l|d)}{\partial l} \Big|_{l=d}

    • Interpretation: ACRT(d|d) is the causal effect of a marginal increase in dose local to units that actually experienced dose d

We can view ACRT(d|d) as the "building block" here. An aggregated version of it (into a single number) is \begin{align*} ACRT^O := \E[ACRT(D|D)|D>0] \end{align*}

  • ACRT^O averages ACRT(d|d) over the population distribution of the dose

  • Like ATT^O for staggered treatment adoption, ACRT^O is the natural target parameter for the TWFE regression in this case

5 / 26

Identification

"Standard" Parallel Trends Assumption For all d,

\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]

6 / 26

Identification

"Standard" Parallel Trends Assumption For all d,

\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]

Then,

6 / 26

Identification

"Standard" Parallel Trends Assumption For all d,

\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]

Then,

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt} \end{aligned}

6 / 26

Identification

"Standard" Parallel Trends Assumption For all d,

\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]

Then,

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \end{aligned}

6 / 26

Identification

"Standard" Parallel Trends Assumption For all d,

\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]

Then,

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0] \end{aligned}

6 / 26

Identification

"Standard" Parallel Trends Assumption For all d,

\mathbb{E}[\Delta Y_{t^*}(0) | D=d] = \mathbb{E}[\Delta Y_{t^*}(0) | D=0]

Then,

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{150pt}\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d]\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[\Delta Y_{t^*}(0) | D=0]\\ &= \E[\Delta Y_{t^*} | D=d] - \E[\Delta Y_{t^*} | D=0] \end{aligned}

This is exactly what you would expect

6 / 26

Are we done?

Unfortunately, no

7 / 26

Are we done?

Unfortunately, no

Most applied work with a multi-valued or continuous treatment wants to think about how causal responses vary across dose

  • For example, plot treatment effects as a function of dose

    • Does more dose tends to increase/decrease/not effect outcomes?
  • Average causal response parameters inherently involve comparisons across slightly different doses

7 / 26

Interpretation Issues

Consider comparing ATT(d|d) for two different doses

8 / 26

Interpretation Issues

Consider comparing ATT(d|d) for two different doses \begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt} \end{aligned}

8 / 26

Interpretation Issues

Consider comparing ATT(d|d) for two different doses

\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}

8 / 26

Interpretation Issues

Consider comparing ATT(d|d) for two different doses

\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}

"Standard" Parallel Trends is not strong enough to rule out the selection bias terms here

  • Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends

  • This problem spills over into identifying ACRT(d|d)

8 / 26

Interpretation Issues

Consider comparing ATT(d|d) for two different doses

\begin{aligned} & ATT(d_h|d_h) - ATT(d_l|d_l) \hspace{350pt}\\ & \hspace{25pt} = \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l) | D=d_h]}_{\textrm{Causal Response}} + \underbrace{ATT(d_l|d_h) - ATT(d_l|d_l)}_{\textrm{Selection Bias}} \end{aligned}

"Standard" Parallel Trends is not strong enough to rule out the selection bias terms here

  • Implication: If you want to interpret differences in treatment effects across different doses, then you will need stronger assumptions than standard parallel trends

  • This problem spills over into identifying ACRT(d|d)

Positive side-comment: ATT(d_h|d_h) - ATT(d_l|d_l) = \E[\Delta Y_{t^*} | D=d_h] - \E[\Delta Y_{t^*} | D=d_l] (which doesn't involve the untreated group)

8 / 26

Interpretation Issues

Intuition:

  • Difference-in-differences identification strategies result in ATT(d|d) parameters. These are local parameters and difficult to compare to each

  • This explanation is similar to thinking about LATEs with two different instruments

  • Thus, comparing ATT(d|d) across different values is tricky and not for free

9 / 26

Interpretation Issues

Intuition:

  • Difference-in-differences identification strategies result in ATT(d|d) parameters. These are local parameters and difficult to compare to each

  • This explanation is similar to thinking about LATEs with two different instruments

  • Thus, comparing ATT(d|d) across different values is tricky and not for free

What can you do?

  • One idea, just recover ATT(d|d) and interpret it cautiously (interpret it by itself not relative to different values of d)

  • If you want to compare them to each other, it will come with the cost of additional (structural) assumptions

9 / 26

Introduce Stronger Assumptions

"Strong" Parallel Trends For all doses d and l,

\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]

10 / 26

Introduce Stronger Assumptions

"Strong" Parallel Trends For all doses d and l,

\mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] = \mathbb{E}[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d]

  • This is notably different from "Standard" Parallel Trends

  • It involves potential outcomes for all values of the dose (not just untreated potential outcomes)

  • All dose groups would have experienced the same path of outcomes had they been assigned the same dose

10 / 26

Introduce Stronger Assumptions

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \ \end{aligned}

11 / 26

Introduce Stronger Assumptions

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \ \end{aligned}

11 / 26

Introduce Stronger Assumptions

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \ \end{aligned}

11 / 26

Introduce Stronger Assumptions

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \\\ &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=l] = ATT(d|l) \end{aligned}

11 / 26

Introduce Stronger Assumptions

Strong parallel trends implies a version of treatment effect homogeneity. Notice:

\begin{aligned} ATT(d|d) &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=d] \hspace{200pt} \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=d] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=d] \\\ &= \E[Y_{t^*}(d) - Y_{t^*-1}(0) | D=l] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=l] \\\ &= \E[Y_{t^*}(d) - Y_{t^*}(0) | D=l] = ATT(d|l) \end{aligned}

Since this holds for all d and l, it also implies that ATT(d|d) = ATE(d) = \E[Y_{t^*}(d) - Y_{t^*}(0)]. Thus, under strong parallel trends, we have that

ATE(d) = \E[\Delta Y_{t^*}|D=d] - \E[\Delta Y_{t^*}|D=0]

RHS is exactly the same expression as for ATT(d|d) under "standard" parallel trends, but here

  • assumptions are different

  • parameter interpretation is different

11 / 26

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

12 / 26

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)] \end{aligned}

12 / 26

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}

12 / 26

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}

Thus, recovering ATE(d) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption

12 / 26

Comparisons across dose

ATE-type parameters do not suffer from the same issues as ATT-type parameters when making comparisons across dose

\begin{aligned} ATE(d_h) - ATE(d_l) &= \E[Y_{t^*}(d_h) - Y_{t^*}(0)] - \E[Y_{t^*}(d_l) - Y_{t^*}(0)]\\ &= \underbrace{\E[Y_{t^*}(d_h) - Y_{t^*}(d_l)]}_{\textrm{Causal Response}} \end{aligned}

Thus, recovering ATE(d) side-steps the issues about comparing treatment effects across doses, but it comes at the cost of needing a (potentially very strong) extra assumption

Given that we can compare ATE(d)'s across dose, we can recover slope effects in this setting

\begin{aligned} ACR(d) := \frac{\partial ATE(d)}{\partial d} \qquad &\textrm{or} \qquad ACR^O := \E[ACR(D) | D>0] \end{aligned}

12 / 26

TWFE Regressions in this Context

Consider the same TWFE regression (but now D_{it} is continuous): \begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*} You can show that \begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*} where m_\Delta(l) := \E[\Delta Y_{t^*}|D=l] - \E[\Delta Y_{t^*}|D=0] and w(l) are weights

13 / 26

TWFE Regressions in this Context

Consider the same TWFE regression (but now D_{it} is continuous): \begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*} You can show that \begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*} where m_\Delta(l) := \E[\Delta Y_{t^*}|D=l] - \E[\Delta Y_{t^*}|D=0] and w(l) are weights

  • Under standard parallel trends, m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}

  • Under strong parallel trends, m'_{\Delta}(l) = ACR(l).

Thus, issues related to selection bias continue to show up here

13 / 26

TWFE Regressions in this Context

Consider the same TWFE regression (but now D_{it} is continuous): \begin{align*} Y_{it} = \theta_t + \eta_i + \alpha D_{it} + e_{it} \end{align*} You can show that \begin{align*} \alpha = \int_{\mathcal{D}_+} w(l) m'_\Delta(l) \, dl \end{align*} where m_\Delta(l) := \E[\Delta Y_{t^*}|D=l] - \E[\Delta Y_{t^*}|D=0] and w(l) are weights

  • Under standard parallel trends, m'_{\Delta}(l) = ACRT(l|l) + \textrm{local selection bias}

  • Under strong parallel trends, m'_{\Delta}(l) = ACR(l).

Thus, issues related to selection bias continue to show up here

About the weights: they are all positive, but have some strange properties (e.g., always maximized at l = \E[D] (even if this is not a common value for the dose))

  • \implies even under strong parallel trends, \alpha \neq ACR^O.
13 / 26

TWFE Regressions in this Context

Other issues can arise in more complicated cases

  • For example, suppose you have a staggered continuous treatment, then you will additionally get issues that are analogous to the ones we discussed earlier for a binary staggered treatment

  • In general, things get worse for TWFE regressions with more complications

14 / 26

Summarizing

  • It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose
15 / 26

Summarizing

  • It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose

  • However, comparison of ATT-type parameters across different doses are hard to interpret

    • They include selection bias terms

    • This issues extends to identifying ACRT parameters

    • These issues extend to TWFE regressions

15 / 26

Summarizing

  • It is straightforward/familiar to identify ATT-type parameters with a multi-valued or continuous dose

  • However, comparison of ATT-type parameters across different doses are hard to interpret

    • They include selection bias terms

    • This issues extends to identifying ACRT parameters

    • These issues extend to TWFE regressions

  • This suggests targeting ATE-type parameters

15 / 26

Example 2: Units can move in and out of the treatment

"Scarring" vs. Moving in and out of treatment

Example treatments:

  • Union status (Vella and Verbeek, 1998)

  • Whether or not location hit by hurricane (Deryugina, 2017)

  • Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)

16 / 26

Example 2: Units can move in and out of the treatment

"Scarring" vs. Moving in and out of treatment

Example treatments:

  • Union status (Vella and Verbeek, 1998)

  • Whether or not location hit by hurricane (Deryugina, 2017)

  • Whether or not a district shares the same ethnicity as the president of the country (Burgess, et al., 2015)

Additional Notation:

We can make a lot of progress by redefining our notion of a "group"

  • Keep track of entire treatment regime \mathbf{D}_i := (D_{i1}, \ldots, D_{i\mathcal{T}})' and/or treatment history up to period t: \mathbf{D}_{it} := (D_{i1}, \ldots, D_{it})'.

  • Potential outcomes Y_{it}(\mathbf{d}_t) where \mathbf{d}_t is some treatment history up to period t (this notation imposes "no anticipation" --- potential outcomes do not depend on future treatments). Observed outcomes: Y_{it}(\mathbf{D}_{it})

16 / 26

Example 2: Units can move in and out of the treatment

A little more notation...

  • \mathcal{D}_t \subseteq \{0,1\}^t is the set of all possible treatment histories in period t. As earlier, we will exclude units that are treated in the first period, (I'll briefly come back to this later)

  • \mathbf{0}_t denotes not participating in the treatment in any period up to period t

17 / 26

Example 2: Units can move in and out of the treatment

A little more notation...

  • \mathcal{D}_t \subseteq \{0,1\}^t is the set of all possible treatment histories in period t. As earlier, we will exclude units that are treated in the first period, (I'll briefly come back to this later)

  • \mathbf{0}_t denotes not participating in the treatment in any period up to period t

In this case, we'll define groups by their treatment histories \mathbf{d}_t. Thus, we can consider group-time average treatment effects defined by \begin{align*} ATT(\mathbf{d}_t, t) := \E[Y_{it}(\mathbf{d}_t) - Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] \end{align*}

17 / 26

Example 2: Units can move in and out of the treatment

Parallel Trends Assumption: For all t=2,\ldots,\mathcal{T}, and for all \mathbf{d}_t \in \mathcal{D}_t, \begin{align*} \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] = \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}

18 / 26

Example 2: Units can move in and out of the treatment

Parallel Trends Assumption: For all t=2,\ldots,\mathcal{T}, and for all \mathbf{d}_t \in \mathcal{D}_t, \begin{align*} \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] = \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}

Identification: In this setting, under the parallel trends assumption, we have that \begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{d}_t] - \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}

18 / 26

Example 2: Units can move in and out of the treatment

Parallel Trends Assumption: For all t=2,\ldots,\mathcal{T}, and for all \mathbf{d}_t \in \mathcal{D}_t, \begin{align*} \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{d}_t] = \E[\Delta Y_{it}(\mathbf{0}_t) | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}

Identification: In this setting, under the parallel trends assumption, we have that \begin{align*} ATT(\mathbf{d}_t, t) = \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{d}_t] - \E[Y_{it} - Y_{i1} | \mathbf{D}_{it} = \mathbf{0}_t] \end{align*}

This argument is straightforward and analogous to what we have done before. However...

18 / 26

Example 2: Units can move in and out of the treatment

There are a number of additional complications that arise here.

  1. There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)

    • \implies small groups \implies imprecise estimates and (possibly) invalid inferences

    • also makes it harder to report the results

19 / 26

Example 2: Units can move in and out of the treatment

There are a number of additional complications that arise here.

  1. There are way more possible groups here than in the staggered treatment case (you can think of this as leading to a kind of curse of dimensionality)

    • \implies small groups \implies imprecise estimates and (possibly) invalid inferences

    • also makes it harder to report the results

  2. The previous point provides an additional reason to try to aggregate the group-time average treatment effects. However, this is also not so straightforward.

    • This is an area of active research (e.g., de Chaisemartin and d'Haultfoeuille (2023) and Yanagi (2023))

    • Some ideas below...but the literature has not converged here yet

19 / 26

Example 2: Units can move in and out of the treatment

Probably the simplest approach is to just make groups on the basis of the first period when a unit experiences the treatment

  • We have (kind of) been doing this in our minimum wage application

  • Lots of papers (e.g., job displacement, hospitalization) have used this idea

  • Formally, it amounts to averaging over all subsequent treatments decisions (de Chaisemartin and d'Haultfoeuille (2023))

20 / 26

Example 2: Units can move in and out of the treatment

Probably the simplest approach is to just make groups on the basis of the first period when a unit experiences the treatment

  • We have (kind of) been doing this in our minimum wage application

  • Lots of papers (e.g., job displacement, hospitalization) have used this idea

  • Formally, it amounts to averaging over all subsequent treatments decisions (de Chaisemartin and d'Haultfoeuille (2023))

But there are other ideas too. Suppose that you were interested in the average treatment effect of experiencing some cumulative number of treatment effects over time (e.g., how many years someone was in a union).

20 / 26

Example 2: Units can move in and out of the treatment

Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.

21 / 26

Example 2: Units can move in and out of the treatment

Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.

We will target the average treatment effect of having experienced exactly \sigma treatments by period t.

21 / 26

Example 2: Units can move in and out of the treatment

Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.

We will target the average treatment effect of having experienced exactly \sigma treatments by period t.

Towards this end, also define \mathcal{D}_t^\sigma = \{\mathbf{d}_t \in \mathcal{D}_t : \sigma_t(\mathbf{d}_t) = \sigma\} --- this is the set of treatment histories that result in \sigma cumulative treatments in period t. Then, consider

21 / 26

Example 2: Units can move in and out of the treatment

Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.

We will target the average treatment effect of having experienced exactly \sigma treatments by period t.

Towards this end, also define \mathcal{D}_t^\sigma = \{\mathbf{d}_t \in \mathcal{D}_t : \sigma_t(\mathbf{d}_t) = \sigma\} --- this is the set of treatment histories that result in \sigma cumulative treatments in period t. Then, consider

\begin{align*} ATT^{sum}(\sigma, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^\sigma} ATT(\mathbf{d}_t, t) \P(D_{it}=\mathbf{d}_t | \mathbf{D}_{it} \in \mathcal{D}_t^\sigma) \end{align*}

21 / 26

Example 2: Units can move in and out of the treatment

Define \sigma_t(\mathbf{d}_t) := \displaystyle \sum_{s=1}^t d_s --- \sigma_t(\cdot) is a function that adds up the cumulative number of treatments up to period t for treatment history \mathbf{d}_t.

We will target the average treatment effect of having experienced exactly \sigma treatments by period t.

Towards this end, also define \mathcal{D}_t^\sigma = \{\mathbf{d}_t \in \mathcal{D}_t : \sigma_t(\mathbf{d}_t) = \sigma\} --- this is the set of treatment histories that result in \sigma cumulative treatments in period t. Then, consider

\begin{align*} ATT^{sum}(\sigma, t) = \sum_{\mathbf{d}_t \in \mathcal{D}_t^\sigma} ATT(\mathbf{d}_t, t) \P(D_{it}=\mathbf{d}_t | \mathbf{D}_{it} \in \mathcal{D}_t^\sigma) \end{align*}

This is the average ATT(\mathbf{d}_t,t) across treatment regimes that lead to exactly \sigma treatments by period t

Similar to previous cases, ATT^{sum}(\sigma,t) is a weighted average of underlying 2x2 DID parameters

Averaging like this reduces the number of groups, and makes the estimation problem discussed above easier (the "effective" number of units is larger)

21 / 26

Example 2: Units can move in and out of the treatment

Even though ATT^{sum}(\sigma,t) (possibly substantially) reduces the dimensionality of the underlying group-time average treatment effect parameters, we might want to reduce more.

22 / 26

Example 2: Units can move in and out of the treatment

Even though ATT^{sum}(\sigma,t) (possibly substantially) reduces the dimensionality of the underlying group-time average treatment effect parameters, we might want to reduce more.

This is tricky though because the composition of the effective groups changes over time (just because you have two groups have the same number of cumulative treatments in one period doesn't mean that they have the same number in subsequent periods)

22 / 26

Example 2: Units can move in and out of the treatment

An alternative idea is to just report treatment effect parameters in the last period: ATT^{sum}(\sigma,\mathcal{T}) as a function of \sigma.

  • This would be something that you could report in a two-dimensional plot
23 / 26

Example 2: Units can move in and out of the treatment

An alternative idea is to just report treatment effect parameters in the last period: ATT^{sum}(\sigma,\mathcal{T}) as a function of \sigma.

  • This would be something that you could report in a two-dimensional plot

Unlike the staggered treatment adoption case, where ATT^{ES}(e) and ATT^O seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.

  • However, if I were writing a paper, I would (i) show disaggregated results, (ii) argue for some particular aggregated parameter and choose weights on the disaggregated parameters that target this parameter
23 / 26

Example 2: Units can move in and out of the treatment

An alternative idea is to just report treatment effect parameters in the last period: ATT^{sum}(\sigma,\mathcal{T}) as a function of \sigma.

  • This would be something that you could report in a two-dimensional plot

Unlike the staggered treatment adoption case, where ATT^{ES}(e) and ATT^O seem like good default parameters to report, it is not clear to me what (or if there is) a good default choice here.

  • However, if I were writing a paper, I would (i) show disaggregated results, (ii) argue for some particular aggregated parameter and choose weights on the disaggregated parameters that target this parameter

Another caution is that (I presume) the issues about interpreting ATT-type parameters across different amounts of the treatment (here across \sigma) will introduce selection bias terms except under additional assumptions

  • e.g., saying that, on average participating in a union for 10 years increased earnings by some amount and participating in a union for for 5 years increased by another amount is one thing; causally attributing the difference to "longer union participation" (probably) takes more assumptions
23 / 26

Extensions

Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.

24 / 26

Extensions

Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.

But it seems within the spirit of DID to assume parallel trends for staying at the same treatment over time

  • Then we can recover group-time average treatment effects for switchers relative to stayers

  • See de Chaisemartin et al. (2022) and de Chaisemartin and d'Haultfoeuille (2023) for approaches along these lines

24 / 26

Extensions

Notice that above, we only invoked parallel trends with respect to untreated potential outcomes.

But it seems within the spirit of DID to assume parallel trends for staying at the same treatment over time

  • Then we can recover group-time average treatment effects for switchers relative to stayers

  • See de Chaisemartin et al. (2022) and de Chaisemartin and d'Haultfoeuille (2023) for approaches along these lines

This results in many more disaggregated treatment effect parameters

[Details]

24 / 26

Summary

We've covered a number of different settings, but we certainly haven't covered all of them

  • Ex. Suppose you have a multi-valued treatment that can change values over time

  • I'm not sure what exactly to do off the top of my head (and the exact thing to do likely depends on the particular goals of the application), but I think that you can get some ideas from extrapolating our discussion:

    • Step 1: Target disaggregated parameters

    • Step 2: If desired, choose aggregated target parameter suitable to the application, combine underlying disaggregateed parameters directly to recover this parameter

25 / 26
25 / 26

Ideas for Weakening Strong Parallel Trends

Idea 1: Partial Identification In some application, it may seem reasonable to think that you know the sign of the selection bias. If this "works against" the sign of differences in m_\Delta(d) as d increases, this implies that you could still sign differences in ATT(d|d) as d increases

25 / 26

Ideas for Weakening Strong Parallel Trends

Idea 1: Partial Identification In some application, it may seem reasonable to think that you know the sign of the selection bias. If this "works against" the sign of differences in m_\Delta(d) as d increases, this implies that you could still sign differences in ATT(d|d) as d increases

Idea 2: Strong PT Conditional on Covariate It might be reasonable to assume strong parallel trends conditional on some other variable.

  • Example: For the minimum wage, it might be reasonable to assume that strong parallel trends holds across states within the same region of the country (say, West or South)

    • Evidence in favor of this is much different distributions of MW policy across regions

    • Wouldn't be able to make "full" comparison across all doses here, but could learn about employment effects of a $15.75 MW (Washington) relative to $13.20 (Oregon).

[Back]

25 / 26

DID using Stayers and Switchers

Parallel Trends for Stayers: \begin{align*} \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it-1} = \mathbf{d}_{t-1})] = \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}

26 / 26

DID using Stayers and Switchers

Parallel Trends for Stayers: \begin{align*} \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it-1} = \mathbf{d}_{t-1})] = \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}

In this case, you can recover the ATT for switchers: (here we are supposing that d_{t-1}=0, but can make an analogous argument in the opposite case) \begin{align*} ATT^{switchers}(\mathbf{d}_{t-1},t) &= \E[Y_{it}(1,\mathbf{d}_{t-1}) - Y_{it}(0,\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (1,\mathbf{d}_{t-1})] \\ &\overset{\textrm{PTA}}{=} \E[\Delta Y_{it} | \mathbf{D}_{it}=(1,\mathbf{d}_{t-1})] - \E[\Delta Y_{it} | \mathbf{D}_{it}=(0,\mathbf{d}_{t-1})] \end{align*} That is, you can recover ATT^{switchers} by comparing the paths of outcomes for switchers to the path of outcomes for stayers (exactly what you'd expect!)

26 / 26

DID using Stayers and Switchers

Parallel Trends for Stayers: \begin{align*} \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it-1} = \mathbf{d}_{t-1})] = \E[Y_t(d_{t-1},\mathbf{d}_{t-1}) - Y_{t-1}(\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (d_{t-1},\mathbf{d}_{t-1})] \end{align*}

In this case, you can recover the ATT for switchers: (here we are supposing that d_{t-1}=0, but can make an analogous argument in the opposite case) \begin{align*} ATT^{switchers}(\mathbf{d}_{t-1},t) &= \E[Y_{it}(1,\mathbf{d}_{t-1}) - Y_{it}(0,\mathbf{d}_{t-1}) | \mathbf{D}_{it} = (1,\mathbf{d}_{t-1})] \\ &\overset{\textrm{PTA}}{=} \E[\Delta Y_{it} | \mathbf{D}_{it}=(1,\mathbf{d}_{t-1})] - \E[\Delta Y_{it} | \mathbf{D}_{it}=(0,\mathbf{d}_{t-1})] \end{align*} That is, you can recover ATT^{switchers} by comparing the paths of outcomes for switchers to the path of outcomes for stayers (exactly what you'd expect!)

Given this sort of assumption, there may be a huge number of ATT^{switchers}(\mathbf{d}_{t-1},t) in realistic applications.

  • You could use these to further understand treatment effect heterogeneity

  • You could also propose some way to aggregate them into a lower dimensional argument [Back]

26 / 26

Part 4: More Complicated Treatment Regimes

\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }

1 / 26
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow