Difference-in-Differences Workshop

Session 2: Using Covariates in Difference-in-Differences Identification Strategies

Brantly Callaway

University of Georgia

Additional Resources

Additional Workshop Materials: https://bcallaway11.github.io/uga-cbai-workshop/

  • Slides, code, data, etc.

General References:

  • Callaway (2023), Handbook of Labor, Human Resources and Population Economics

  • Baker, Callaway, Cunningham, Goodman-Bacon, Sant’Anna (2024), draft posted very soon

Specific References:

  • Caetano and Callaway (2023) for interpreting TWFE regressions with covariates

  • Callaway and Sant’Anna (2021) and Sant’Anna and Zhao (2020) for alternative estimators

  • Caetano, Callaway, Payne, and Sant’Anna (2022) for “bad controls”

Review of Session 1

  1. Difference-in-Differences with Two Periods and Two Groups

  2. Extensions to Staggered Treatment Adoption

    • Limitations of TWFE Regressions
    • Introduction to newer approaches
  3. Application about Effects of Minimum Wage Policies on Employment

\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)

Notation

Data:

  • 2 periods: \(t=1\), \(t=2\)

    • No one treated until period \(t=2\)
    • Some units remain untreated in period \(t=2\)
  • \(D_{it}\) treatment indicator in period \(t\)

  • 2 groups: \(G_i=1\) or \(G_i=0\) (treated and untreated)

Potential Outcomes: \(Y_{it}(1)\) and \(Y_{it}(0)\)

Observed Outcomes: \(Y_{it=2}\) and \(Y_{it=1}\)

\[\begin{align*} Y_{it=2} = G_i Y_{it=2}(1) +(1-G_i)Y_{it=2}(0) \quad \textrm{and} \quad Y_{it=1} = Y_{it=1}(0) \end{align*}\]

Target Parameter

Average Treatment Effect on the Treated: \[ATT = \E[Y_{t=2}(1) - Y_{t=2}(0) | G=1]\]

Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group

Pushing the expectation through the difference, we have that: \[\begin{align*} ATT = \underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}} - \underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}} \end{align*}\]

Part 1: Identification with Two Periods

Additional Notation for Covariates

Start with the case with only two time periods

More notation about covariates:

  • \(X_{it=2}\) and \(X_{it=1}\) — time-varying covariates

  • \(Z_i\) — time-invariant covariates

What We’ll Do in this Part

Two identification results:

  1. Direct identifcation strategy:

    • Recover \(ATT\) using DiD identification strategy at all values of the covariates, and then averaging them
  2. Indirect identification strategy:

    • Based on re-weighting the untreated group to make it have the same distribution of covariates as the treated group

These different identification strategies will suggest alternative ways to estimate \(ATT\), which we will return to later…

Identification Strategy 2: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt} \end{aligned} \]

Identification Strategy 2: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big] \end{aligned} \]

Identification Strategy 2: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big] \end{aligned} \]

Identification Strategy 2: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big]\\ &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=0] \end{aligned} \]

\(\implies\) (even under conditional parallel trends) we can recover \(ATT\) by just directly comparing paths of outcomes for treated and untreated groups.

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt} \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big] \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \nu_0(X) \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big] \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \nu_0(X) \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big]\\ &= \E[\Delta Y | G=1] - \E[\nu_0(X) \Delta Y(0) | G=0] \end{aligned} \]

\(\implies\) We can recover \(ATT\) by re-weighting the untreated group to have the same distribution of covariates as the treated group has…and then just average

Discussion

The arguments about suggest that, in order to estimate the \(ATT\), we will either need to

  1. Correctly model \(\E[\Delta Y(0) | X, G=0]\) (i.e, specifiy a model for \(m_0(X)\))

  2. Balance the distribution of \(X\) to be the same for the untreated group relative to the treated group.

We will pursue these approaches soon, but next we will switch to considering TWFE regressions that include covariates

Part 2: Limitations of TWFE Regressions

Limitations of TWFE Regressions

In this setting, it is common to run the following TWFE regression:

\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\]

However, there are a number of issues:

Issue 1: Issues related to multiple periods and variation in treatment timing still arise

Issue 2: Hard to allow parallel trends to depend on time-invariant covariates

Issue 3: Hard to allow for covariates that could be affected by the treatment

Issues 4 & 5: (harder to see) Can perform poorly for including time-varying covariates in the parallel trends assumption

Limitations of TWFE Regressions

Focusing on the case with two periods, to estimate the model, we take first-differences to eliminate the unit fixed effects and ultimately estimate the regression

\[\Delta Y_{it} = \Delta \theta_t + \alpha D_{it} + \Delta X_{it}'\beta + \Delta e_{it}\]

Building on work about interpreting cross-sectional regressions \(Y_i = \alpha D_i + X_i'\beta + e_i\), in the presence of treatment effect heterogeneity

  • Angrist (1998), Aronow and Samii (2016), Słoczyński (2022), Chattopadhyay and Zubizarreta (2023), Blandhol et al. (2022), Hahn (2023), among others

we can provide a useful decomposition of \(\alpha\) \(\rightarrow\)

Limitations of TWFE Regressions

Can show that the coefficient \(\alpha\) in the TWFE regression can be decomposed as

\[ \small \alpha = \underbrace{\E\Big[ w_1(\Delta X) ATT(X_{t=2},X_{t=1},Z)\Big| G=1 \Big]}_{\textrm{weighted avg. of $ATT(X)$}} + \underbrace{\E\Big[ w_1(\Delta X) \Big( \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] - \L_0(\Delta Y | \Delta X) \Big) \Big| G=1 \Big]}_{\textrm{misspecification bias}}\]

where \[ w_1(\Delta X) := \frac{\big(1-\L(D|\Delta X)\big) \pi}{\E\big[(D-\L(D|\Delta X))^2\big]}\]

Comments:

  • It is possible for both weights to be negative, given that linear probability models can predict probabilities outside of the \([0,1]\) interval

  • These weights are easy to estimate as they only depend on linear projections

Limitations of TWFE Regressions

Can show that the coefficient \(\alpha\) in the TWFE regression can be decomposed as

\[ \small \alpha = \underbrace{\E\Big[ w_1(\Delta X) ATT(X_{t=2},X_{t=1},Z)\Big| G=1 \Big]}_{\textrm{weighted avg. of $ATT(X)$}} + \underbrace{\E\Big[ w_1(\Delta X) \Big( \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] - \L_0(\Delta Y | \Delta X) \Big) \Big| G=1 \Big]}_{\textrm{misspecification bias}}\]

where \[ w_1(\Delta X) := \frac{\big(1-\L(D|\Delta X)\big) \pi}{\E\big[(D-\L(D|\Delta X))^2\big]}\]

About the first term:

Ideally, we would like \(w_1(\Delta X)=1\), which would imply that this term is equal to \(ATT\).

Relative to this baseline, these weights have some drawbacks:

  • The weights can be negative

  • The weights suffer from a form of weight reversal (e.g., Słoczyński (2022)):

Limitations of TWFE Regressions

\[ \small \alpha = \underbrace{\E\Big[ w_1(\Delta X) ATT(X_{t=2},X_{t=1},Z)\Big| G=1 \Big]}_{\textrm{weighted avg. of $ATT(X)$}} + \underbrace{\E\Big[ w_1(\Delta X) \Big( \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] - \L_0(\Delta Y | \Delta X) \Big) \Big| G=1 \Big]}_{\textrm{misspecification bias}}\]

The misspecification bias component is equal to 0 if either:

  1. \(\E[\Delta Y|X_{t=2},X_{t=1},Z,G=0] = \L_0(\Delta Y| \Delta X)\) (i.e., the model for untreated potential outcomes is linear in \(\Delta X\))
  1. The implicit regression weights, \(w_1(\Delta X)\), are covariate balancing weights

    • in the sense that they make the distribution of \((X_{t=2}, X_{t=1}, Z)\) to be the same for the treated untreated groups

Versions of these conditions seem plausible in the case with cross sectional data, but do not seem reasonable in the panel data context \(\rightarrow\)

Limitations of TWFE Regressions

Consider the condition

\[ \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] = \L_0(\Delta Y | \Delta X) \]

Notice that this condition involves two things:

  1. A condition about linearity (makes sense…and similar to the cross-sectional case)

  2. Changing the covariates that show up from \((X_{t=2}, X_{t=1}, Z)\) to \(\Delta X\)

Condition (b) amounts to changing the identification strategy from one where parallel trends only depends on \(\Delta X\) rather than on \(X_{t=1}\), \(X_{t=2}\) and \(Z\).

  • In the minimum wage example, we originally wanted to compare counties with the same population and in the same region of the country.

  • Condition (b) would (effectively) change this to comparing counties with similar population changes over time

  • This could end up being much different from what we were originally aiming for

Limitations of TWFE Regressions

Next, a property of implicit regression weights is that they balance the means of regressors included in the model (Chattopadhyay and Zubizarreta (2023))

  • This is good property in the cross-sectional setting and suggests that typically misspecification bias is likely to be small in that case

  • In our case, this means that the TWFE regression will balance the mean of \(\Delta X\) across groups

More importantly: The TWFE does not necessarily balance variables that do not show up in the estimating equation, including:

  • Levels of time-varying covariates: \(X_{t=2}, X_{t=1}\)

  • Time-invariant covariates: \(Z\)

Taken together, the arguments above suggest (to me) that misspecification bias is likely to be a much bigger issue in the TWFE setting than in the cross-sectional setting

Limitations of TWFE Regressions

In Caetano and Callaway (2023), we refer to the misspecification bias term above as hidden linearity bias.

What we mean is that the implications of a linear model may be much more severe in a panel data setting than in the cross-sectional setting:

  • The arguments that would lead us to think that misspecification bias is typically small in cross-sectional settings do not apply for TWFE regressions

  • TWFE effectively changes the identification to one where the only covariates that show up in the parallel trends assumption are \(\Delta X\)

Limitations of TWFE Regressions

What is going wrong with the TWFE regression?

\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\]

The source of the issues with the TWFE regression is that, when we difference out the unit fixed effect, we also transform the covariates.

  • We have focused on the case with two periods and estimation in first differences, but similar issues apply in cases with more periods and with other transformations (e.g., within transformation)

The “inherited” transformation of the covariates makes it where we are highly dependent on the model be correctly specified for this to make sense

Instead, a better option will be to difference the outcomes (in line with parallel trends) but then to directly include the covariates that we want \((X_{t=2}, X_{t=1}, Z)\).

Limitations of TWFE Regressions

One last question: How much does this matter in practice?

  • Not all that easy to check how far away \(\E[\Delta Y | X_{t=1}, X_{t=2}, Z, G=0]\) is from \(\L_0(\Delta Y|\Delta X)\)
  • Instead, an easier idea is to apply implicit regression weights to \((X_{t=2},X_{t=1},Z)\), and check if they balance these across groups

    • This gives us a way to diagnose the sensitivity of the TWFE regression to hidden linearity bias.
    • This is easy to check in practice: weights just depend on linear projections that are easy to directly estimate
    • If these are close to being balanced, it suggests that misspecification bias is small.
    • If not, then it matters a lot whether or not TWFE regression is correctly specified.

    One of the main things we will do in the application is to see how well implicit TWFE regression weights balance levels of time-varying covariates and omitted time-invariant covariates

Part 3: Alternative Estimation Strategies

Alternative Estimation Strategies

Given the limitations of TWFE regressions, we will consider alternative estimation strategies:

  1. Regression Adjustment (RA)

  2. Propensity Score Weighting (IPW)

  3. Augmented-Inverse Propensity Score Weighting (AIPW)

We will motivate these approaches from the two types of identification results that we showed earlier

These will have a number of better properties than TWFE regressions

Regression Adjustment (RA)

Recall our first identification result above:

\[ATT = \E[\Delta Y | G=1] - \E\Big[ \underbrace{\E[\Delta Y(0) | X, G=0]}_{=:m_0(X)} \Big| G=1\Big]\]

The most direct way to proceed is by proposing a model for \(m_0(X)\). For example, \(m_0(X) = X'\beta_0\).

  • Notice that linearity of untreated potential outcomes is exactly the same condition we needed for the TWFE regression to be a weighted average of conditional-on-covariates \(ATT\)’s.
  • However, here \(X\) includes \((X_{t=2}, X_{t=1}, Z)\) rather than only \(\Delta X\).
    • This means that there could still be misspecification bias, but there is no hidden linearity bias
  • Moreover, if linearity holds, we directly target \(ATT\), rather than recovering a hard-to-interpret weighted average of \(ATT(X)\).
  • \(\implies\) there is a strong case to go with RA as a default option over TWFE.

Regression Adjustment (RA)

Recall our first identification result above:

\[ATT = \E[\Delta Y | G=1] - \E\Big[ \underbrace{\E[\Delta Y(0) | X, G=0]}_{=:m_0(X)} \Big| G=1\Big]\]

This expression suggests a regression adjustment estimator, based on:

\[ATT = \E[\Delta Y | G=1] - \E[X'\beta_0|G=1]\]

and we can estimate the \(ATT\) by

  • Step 1: Estimate \(\beta_0\) using untreated group

  • Step 2: Compute predicted change in untreated potential outcomes for treated units: \(\widehat{\Delta Y_i(0)} = X_i'\hat{\beta}_0\)

  • Step 3: Compute \(\widehat{ATT} = \displaystyle \frac{1}{n_1} \sum_{i=1}^n G_i \big(\Delta Y_i - X_i'\hat{\beta}_0\big)\)

[Side-Discussion: One-shot imputation estimators]

Inverse Propensity Score Weighting (IPW)

Alternatively, recall our identification strategy based on re-weighting: \[ ATT = \E[\Delta Y | G=1] - \E[ \nu_0(X) \Delta Y | G=0] \]

The most common balancing weights are based on the propensity score, you can show: \[\begin{align*} \nu_0(X) = \frac{p(X)(1-\pi)}{(1-p(X))\pi} \end{align*}\] where \(p(X) = \P(G=1|X)\) and \(\pi=\P(G=1)\).

  • This is the approach suggested in Abadie (2005). In practice, you need to estimate the propensity score. The most common choices are probit or logit.

For estimation:

  • Step 1: Estimate the propensity score (typically logit or probit)

  • Step 2: Compute the weights, using the estimated propensity score

  • Step 3: Compute \(\widehat{ATT} = \displaystyle \frac{1}{n_1} \sum_{i=1}^n G_i \Delta Y_i - \frac{1}{n_0} \sum_{i=1}^n \frac{\hat{p}(X_i) (1-G_i)}{\big(1-\hat{p}(X_i)\big) \hat{\pi}} \Delta Y_i\)

Augmented-Inverse Propensity Score Weighting (AIPW)

You can show an additional identification result:

\[ATT = \E\left[ \Delta Y_{t} - \E[\Delta Y_{t} | X, G=0] \big| G=1\right] - \E\left[ \frac{p(X)(1-\pi)}{(1-p(X))\pi} \big(\Delta Y_{t} - \E[\Delta Y_{t} | X, G=0]\big) \Big| G=0\right]\]

This requires estimating both \(p(X)\) and \(\E[\Delta Y|X,G=0]\).

Big advantage: The sample analogue of this expression \(ATT\) is doubly robust. This means that, it will deliver consistent estimates of \(ATT\) if either the model for \(p(X)\) or for \(\E[\Delta Y|X,G=0]\) is correctly specified.

  • In my experience, doubly robust estimators perform much better than either the regression or propensity score weighting estimators
  • This also provides a connection to estimating \(ATT\) under conditional parallel trends using machine learning for \(p(X)\) and \(\E[\Delta Y|X,G=0]\) (see: Chang (2020))

Discussion

Regarding the previous issues with TWFE regressions, RA, IPW, and AIPW satisfy:

Issue 1: Multiple periods

Issue 2: Time-invariant covariates

Issue 3: Covariates affected by the treatment ?

Issue 4: Hidden linearity bias

Issue 5: Weighted average of \(ATT(X)\)

You can also show that they will, by construction, balance the means of \((X_{t=2},X_{t=1},Z)\) across groups.

In my view, these are much better properties that the TWFE regression when it comes to including covariates.

Part 4: Multiple Periods and Variation in Treatment Timing

Multiple Time Periods and Variation in Treatment Timing


Conditional Parallel Trends with Multiple Periods

For all groups \(g \in \bar{\mathcal{G}}\) (all groups except the never-treated group) and for all time periods \(t=2, \ldots, T\),

\[\E[\Delta Y_{t}(0) | \mathbf{X}, Z, G=g] = \E[\Delta Y_{t}(0) | \mathbf{X}, Z, U=1]\]

where \(\mathbf{X}_i := (X_{i1},X_{i2},\ldots,X_{iT})\).

Under this assumption, using similar arguments to the ones above, one can show that

\[ATT(g,t) = \E\left[ \left( \frac{\indicator{G=g}}{\pi_g} - \frac{p_g(\mathbf{X},Z)U}{(1-p_g(\mathbf{X},Z))\pi_g}\right)\Big(Y_{t} - Y_{g-1} - m_{gt}^0(\mathbf{X},Z)\Big) \right]\]

where \(p_g(\mathbf{X},Z) := \P(G=g|\mathbf{X},Z,\indicator{G=g}+U=1)\) and \(m_{gt}^0(\mathbf{X},Z) := \E[Y_{t}-Y_{g-1}|\mathbf{X},Z,U=1]\).

Practical Considerations

Because \(\mathbf{X}_i\) contains \(X_{it}\) for all time periods, terms like \(m_{gt}^0(\mathbf{X},Z)\) can be quite high-dimensional (and hard to estimate) in many applications.

In many cases, it may be reasonable to replace with lower dimensional function \(\mathbf{X}_i\):

  • \(\bar{X}_i\) — the average of \(X_{it}\) across time periods
  • \(X_{it}, X_{ig-1}\) — the covariates in the current period and base period (this is possible in the pte package currently and may be added to did soon).
  • \(X_{ig-1}\) — the covariates in the base period (this is the default in did)
  • \((X_{it}-X_{ig-1})\) — the change in covariates over time

Otherwise, however, everything is the same as before:

  1. Recover \(ATT(g,t)\)

  2. If desired: aggregate into \(ATT^{es}(e)\) or \(ATT^o\).

Part 5: Empirical Example

Empirical Example: Minimum Wages and Employment

  • This is the same application that we considered in Session 1
  • Use county-level data from 2003-2007 during a period where the federal minimum wage was flat
  • Exploit minimum wage changes across states

    • Any state that increases their minimum wage above the federal minimum wage will be considered as treated
  • Interested in the effect of the minimum wage on teen employment
  • We’ll also make a number of simplifications:
    • not worry much about issues like clustered standard errors
    • not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states

Goals:

  • Include covariates in the parallel trends assumption, assess how much this matters

  • Try to assess how well different estimation strategies do in terms of handling covariates

Empirical Example: Minimum Wages and Employment

Let’s start by assuming that parallel trends holds conditional on a county’s population and average income (sometimes we’ll add region too)

  • i.e., we would like to compare treated and untreated counties with similar populations and average incomes

I’ll show results for the following cases:

  1. Results without covariates (as a reminder of results from last time)

  2. Two period TWFE with covariates

  3. All periods TWFE with covariates

  4. Callaway and Sant’Anna (2021) including \(X_{g-1}\) and \(Z\) as covariates

    • RA, IPW, AIPW

In addition to estimates, we’ll also assess how well each of these works in terms of balancing covariates using the twfeweights package.

TWFE Results without Covariates

twfe_res2 <- fixest::feols(lemp ~ post | id + year,
                           data=data2,
                           cluster="id")


modelsummary(list(twfe_res2), gof_omit=".*")
(1)
post -0.038
(0.008)

Event Study without Covariates (Callaway and Sant’Anna)

attgt <- did::att_gt(yname="lemp",
                     idname="id",
                     gname="G",
                     tname="year",
                     data=data2,
                     control_group="nevertreated",
                     base_period="universal")
attes <- aggte(attgt, type="dynamic")
ggdid(attes)

Overall ATT without Covariates (Callaway and Sant’Anna)

attO <- did::aggte(attgt, type="group")
summary(attO)

Call:
did::aggte(MP = attgt, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 


Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0571        0.0085    -0.0738     -0.0404 *


Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0888     0.0189       -0.1302     -0.0475 *
  2006  -0.0427     0.0083       -0.0610     -0.0245 *
---
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Doubly Robust

Two periods TWFE Regression with Covariates

# run TWFE regression
data2_subset <- subset(data2, year %in% c(2003,2004))
data2_subset <- subset(data2_subset, G %in% c(0, 2004))
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + year,
                        data=data2_subset,
                        cluster="id")
modelsummary(twfe_x, gof_omit=".*")
(1)
post -0.032
(0.019)
lpop 0.833
(0.261)
lavg_pay 0.037
(0.145)

Diagnose covariate balance

library(twfeweights)
tp_wts <- two_period_reg_weights(
  yname = "lemp",
  tname = "year",
  idname = "id",
  gname = "G",
  xformula = ~lpop + lavg_pay,
  extra_balance_vars_formula = ~region,
  data = data2_subset
)

Diagnose covariate balance

ggtwfeweights(tp_wts,
              absolute_value=FALSE,
              standardize=TRUE,
              plot_relative_to_target=FALSE) +
  xlim(c(-2,2))

TWFE with more periods and covariates

# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + year,
                        data=data2,
                        cluster="id")
modelsummary(twfe_x, gof_omit=".*")
(1)
post -0.048
(0.008)
lpop 1.235
(0.089)
lavg_pay 0.172
(0.076)

Diagnose covariate balance

twfe_wts <- implicit_twfe_weights(
  yname = "lemp",
  tname = "year",
  idname = "id",
  gname = "G",
  xformula = ~lpop + lavg_pay,
  data = data2,
  base_period = "gmin1"
)
covariate_balance <- twfe_cov_bal(twfe_wts, ~ region + lpop + lavg_pay + -1)

Diagnose covariate balance

ggtwfeweights(covariate_balance,
              absolute_value = FALSE,
              standardize = TRUE,
              plot_relative_to_target = FALSE) +
  xlim(c(-1,1))

CS (2021) AIPW, \(X_{g-1}, Z\)

# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
               tname="year",
               idname="id",
               gname="G",
               xformla=~region + lpop + lavg_pay,
               control_group="nevertreated",
               base_period="universal",
               est_method="dr",
               data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)

CS (2021) AIPW, \(X_{g-1}, Z\)


Call:
aggte(MP = cs_x, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 


Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0317        0.0075    -0.0463      -0.017 *


Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0509     0.0204       -0.0928     -0.0089 *
  2006  -0.0230     0.0076       -0.0387     -0.0074 *
---
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Doubly Robust

CS (2021) AIPW, \(X_{g-1}, Z\)

Check covariate balance

# similar code as before...check course materials

Additional Results and Bonus Material

Additional Results:

  • Add region as a covariate in TWFE [details]

  • One-shot imputation estimators [details]

  • Regression adjustment [details]

  • Inverse propensity score weighting [details]

Bonus Material:

Conclusion

  • Including covariates in the parallel trends assumption can make difference-in-differences identification strategies more plausible

  • If you want to include covariates in the parallel trends assumption, it is better to use approaches that directly include the covariates relative to estimation strategies that transform the covariates

  • In the minimum wage application, we did better in terms of covariate balance with regression adjustment and AIPW; however, there still seemed to be violations of parallel trends in pre-treatment periods

    • I know some ways to get rid of these apparent violations of parallel trends, but I want to use this as an angle to get invited back for Session 3…

Appendix

Side-Discussion: Imputation Estimators

Regression adjustment is closely related to imputation estimators, which we talked about in the first session.

In settings with multiple periods and variation in treatment timing, these are often operationalized in different ways though

  • In Callaway and Sant’Anna (2021), we considered regression adjustment at the group-time level \(\implies\) do \(2 \times 2\) regression adjustment many times and then aggreggate

  • However, most imputation estimators are implemented in one-shot, where you would typically estimate the model \[Y_{it}(0) = \theta_t + \eta_i + X_{it}'\beta + e_{it}\] across all time periods

Side-Discussion: Imputation Estimators

One-shot imputation estimators have similar limitations as TWFE regressions when it comes to covariates:

\[Y_{it}(0) = \theta_t + \eta_i + X_{it}'\beta + e_{it}\]

Issue 1: Multiple periods

Issue 2: Time-invariant covariates

Issue 3: Covariates affected by the treatment ❌

Issue 4: Hidden linearity bias ❌

  • You can see that it implicitly relies on \(\E[\Delta Y(0) | X_{t=2}, X_{t=1}, Z, G=0] = \E[\Delta Y(0) | \Delta X, G=0]\)

Issue 5: Weighted average of \(ATT(X)\)

[back]

Understanding Double Robustness

To understand double robustness, we can rewrite the expression for \(ATT\) as \[\begin{align*} ATT = \E\left[ \frac{D}{\pi} \Big(\Delta Y - m_0(X)\Big) \right] - \E\left[ \frac{p(X)(1-D)}{(1-p(X))\pi} \Big(\Delta Y - m_0(X)\Big)\right] \end{align*}\]

The first term is exactly the same as what comes from regression adjustment

  • If we correctly specify a model for \(m_0(X)\), it will be equal to \(ATT\).

  • If \(m_0(X)\) not correctly specified, then, by itself, this term will be biased for \(ATT\)

The second term can be thought of as a de-biasing term

  • If \(m_0(X)\) is correctly specified, it is equal to 0

  • If \(p(X)\) is correctly specified, it reduces to \(\E[\Delta Y_{t}(0) | G=1] - \E[m_0(X)|G=1]\) which both delivers counterfactual untreated potential outcomes and removes the (possibly misspecified) second term from the first equation

[Back]

Add region as a covariate

We’ll allow for path of outcomes to depend on region of the country

# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + region^year,
                        data=data2,
                        cluster="id")
modelsummary(twfe_x, gof_omit=".*")
(1)
post -0.022
(0.008)
lpop 1.057
(0.137)
lavg_pay 0.074
(0.079)

Relative to previous results, this is much smaller—this is (broadly) in line with the literature where controlling for region often matters a great deal (e.g., Dube, Lester, and Reich (2010)).

Check covariate balance

# similar code as before...check course materials

[back]

One-shot imputation (i.e., regression adjustment with only \(\Delta X\) as covariate)

# it's reg. adj. even though the function says aipw...
ra_wts <- implicit_aipw_weights(
  yname = "lemp",
  tname = "year",
  idname = "id",
  gname = "G",
  xformula = ~ 1,
  d_covs_formula = ~ lpop + lavg_pay,
  pscore_formula = ~1,
  data = data2
)
ra_wts$est
[1] -0.06098144

i.e., we estimate a somewhat larger effect of the minimum wage on teen employment

One-shot imputation (i.e., regression adjustment with only \(\Delta X\) as covariate)

ra_cov_bal <- aipw_cov_bal(ra_wts, ~ region + lpop + lavg_pay + -1)
ggtwfeweights(ra_cov_bal, absolute_value = FALSE,
              standardize = TRUE,
              plot_relative_to_target = FALSE) +
  xlim(c(-1,1))

[back]

CS (2021) Regression Adjustment, \(X_{g-1}, Z\)

# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
               tname="year",
               idname="id",
               gname="G",
               xformla=~region + lpop + lavg_pay,
               control_group="nevertreated",
               base_period="universal",
               est_method="reg",
               data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)

CS (2021) Regression Adjustment, \(X_{g-1},Z\)


Call:
aggte(MP = cs_x, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 


Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0321        0.0079    -0.0477     -0.0166 *


Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0596     0.0194       -0.1007     -0.0185 *
  2006  -0.0197     0.0080       -0.0367     -0.0027 *
---
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Outcome Regression

CS (2021) Regression Adjustment, \(X_{g-1},Z\)

Check covariate balance

# similar code as before...check course materials

[back]

CS (2021) IPW, \(X_{g-1}, Z\)

# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
               tname="year",
               idname="id",
               gname="G",
               xformla=~region + lpop + lavg_pay,
               control_group="nevertreated",
               base_period="universal",
               est_method="ipw",
               data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)

CS (2021) IPW, \(X_{g-1}, Z\)


Call:
aggte(MP = cs_x, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> 


Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0313        0.0083    -0.0475     -0.0151 *


Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0514     0.0179       -0.0902     -0.0127 *
  2006  -0.0222     0.0072       -0.0377     -0.0067 *
---
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Inverse Probability Weighting

CS (2021) IPW, \(X_{g-1}, Z\)

[back]

Part 6: Covariates Affected by the Treatment

Covariates Affected by the Treatment

So far, our discussion has been for the case where the time-varying covariates evolve exogenously.

  • Many (probably most) covariates fit into this category: in the minimum wage example, a county’s population probably fits here.

In some applications, we may want to control for covariates that themselves could be affected by the treatment

  • Classical examples in labor economics: A person’s industry, occupation, or union status

  • These are often referred to as “bad controls”

You can see a tension here:

  • We would like to compare units who, absent being treated, would have had the same (say) union status

  • But union status could be affected by the treatment

Covariates Affected by the Treatment

The most common practice is to just completely drop these covariates from the analysis

  • Not clear if this is the right idea though…

We will consider some alternatives

  • Condition on pre-treatment value of bad control
  • Treat bad control as an outcome (i.e., use some identification strategy), then feed this into the main analysis as a covariate

Additional Notation

To wrap our heads around this, let’s go back to the case with two time periods.

Define treated and untreated potential covariates: \(X_{it}(1)\) and \(X_{it}(0)\). Notice that in the “textbook” two period setting, we observe \[X_{it=2} = D_i X_{it=2}(1) + (1-D_i) X_{it=2}(0) \qquad \textrm{and} \qquad X_{it=1} = X_{it=1}(0)\]

Then, we will consider parallel trends in terms of untreated potential outcomes and untreated potential covariates:


Conditional Parallel Trends using Untreated Potential Covariates

\[\E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=1] = \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=0]\]


Identification Issues

Following the same line of argument as before, it follows that

\[ATT = \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=0] \Big| G=1\Big]\]

The second term is the tricky one. Notice that:

  • The inside conditional expectation is identified — we see untreated potential outcomes and covariates for the untreated group

  • However, we cannot average over \(X_{t=2}(0)\) for the treated group, because we don’t observe \(X_{t=2}(0)\) for the treated group

There are several options for what we can do   \(\rightarrow\)

Option 1: Ignore

One idea is to just ignore that the covariates may have been affected by the treatment:


Alternative Conditional Parallel Trends 1

\[\E[\Delta Y(0) | { \color{red} X_{\color{red}{t=2} } }, X_{t=1}(0), Z, G=1] = \E[\Delta Y(0) | { \color{red} X_{\color{red}{t=2}} }, X_{t=1}(0), Z, G=0]\]


The limitations of this approach are well known (even discussed in MHE), and this is not typically the approach taken in empirical work

Job Displacement Example: You would compare paths of outcomes for workers who left union because they were displaced to paths of outcomes for non-displaced workers who also left union (e.g., because of better non-unionized job opportunity)

Option 2: Drop

It is more common in empirical work to drop \(X{it}(0)\) entirely from the parallel trends assumption


Alternative Conditional Parallel Trends 2

\[\E[\Delta Y(0) | Z, G=1] = \E[\Delta Y(0) | Z, G=0]\]


In my view, this is not attractive either though. If we believe this assumption, then we have basically solved the bad control problem by assuming that it does not exist.

Job Displacement Example: We have now just assumed that path of earnings (absent job displacement) doesn’t depend on union status

Option 3: Tweak

Perhaps a better alternative identifying assumption is the following one


Alternative Conditional Parallel Trends 3

\[\E[\Delta Y(0) | X_{t=1}(0), Z, G=1] = \E[\Delta Y(0) | X_{t=1}(0), Z, G=0]\]


Intuition: Conditional parallel trends holds after conditioning on pre-treatment time-varying covariates that could have been affected by treatment

Job Displacement Example: Path of earnings (absent job displacement) depends on pre-treatment union status, but not untreated potential union status in the second period

What to do: Since \(X_{it=1}(0)\) is observed for all units, we can immediately operationalize this assumption use our arguments from earlier (i.e., the ones without bad controls)

  • This is difficult to operationalize with a TWFE regression

  • In practice, you can just include the bad control among other covariates in did

Option 4: Extra Assumptions

Another option is to keep the original identifying assumption, but add additional assumptions where we (in some sense) treat \(X_t\) as an outcome and as a covariate.

Recall:

\[ATT = \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=0] \Big| G=1\Big]\]

If we could figure out distribution of \(X_{t=2}(0)\) for the treated group, we could recover \(ATT\)

Option 4: Dealing with \(X_{t=2}(0)\)

Covariate Unconfoundedness Assumption

\[X_{t=2}(0) \independent D | X_{t=1}(0), Z\]

Intuition: For the treated group, the time-varying covariate would have evolved in the same way over time as it actually did for the untreated group, conditional on \(X_{t=1}\) and \(Z\).

  • Notice that this assumption only concerns untreated potential covariates \(\implies\) it allows for \(X_{t=2}\) to be affected by the treatment

  • Making an assumption like this indicates that \(X_{t=2}(0)\) is playing a dual role: (i) start by treating it as if it’s an outcome, (ii) have it continue to play a role as a covariate

Under this assumption, can show that we can recover the \(ATT\):

\[ATT = \E[\Delta Y | G=1] - \E\left[ \E[\Delta Y | X_{t=1}, Z, G=0] \Big| G=1 \right]\]

This is the same expression as in Option 3

Option 4: Additional Discussion

In some cases, it may make sense to condition on other additional variables (e.g., the lagged outcome \(Y_{t=1}\)) in the covariate unconfoundedness assumption. In this case, it is still possible to identify \(ATT\), but it is more complicated

It could also be possible to use alternative identifying assumptions besides covariate unconfoundedness — at a high-level, we somehow need to recover the distribution of \(X_{t=2}(0)\)

  • e.g., Brown, Butts, and Westerlund (2023)

See Caetano et al. (2022) for more details about bad controls.

[Back]

References

Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” The Review of Economic Studies 72 (1): 1–19.
Angrist, Joshua D. 1998. “Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants.” Econometrica 66 (2): 249–88.
Aronow, Peter M, and Cyrus Samii. 2016. “Does Regression Produce Representative Estimates of Causal Effects?” American Journal of Political Science 60 (1): 250–67.
Blandhol, Christine, John Bonney, Magne Mogstad, and Alexander Torgovitsky. 2022. “When Is TSLS Actually Late?”
Brown, Nicholas, Kyle Butts, and Joakim Westerlund. 2023. “Simple Difference-in-Differences Estimation in Fixed-t Panels.”
Caetano, Carolina, and Brantly Callaway. 2023. “Difference-in-Differences When Parallel Trends Holds Conditional on Covariates.”
Caetano, Carolina, Brantly Callaway, Robert Payne, and Hugo Sant’Anna. 2022. “Difference in Differences with Time-Varying Covariates.”
Callaway, Brantly. 2023. “Difference-in-Differences for Policy Evaluation.” In Handbook of Labor, Human Resources and Population Economics, edited by Klaus F. Zimmermann, 1–61. Springer International Publishing.
Callaway, Brantly, and Pedro HC Sant’Anna. 2021. “Difference-in-Differences with Multiple Time Periods.” Journal of Econometrics 225 (2): 200–230.
Chang, Neng-Chieh. 2020. “Double/Debiased Machine Learning for Difference-in-Differences Models.” The Econometrics Journal 23 (2): 177–91.
Chattopadhyay, Ambarish, and José R Zubizarreta. 2023. “On the Implied Weights of Linear Regression for Causal Inference.” Biometrika 110 (3): 615–29.
Dube, Arindrajit, T William Lester, and Michael Reich. 2010. “Minimum Wage Effects Across State Borders: Estimates Using Contiguous Counties.” The Review of Economics and Statistics 92 (4): 945–64.
Hahn, Jinyong. 2023. “Properties of Least Squares Estimator in Estimation of Average Treatment Effects.” SERIEs 14 (3): 301–13.
Sant’Anna, Pedro H. C., and Jun Zhao. 2020. “Doubly Robust Difference-in-Differences Estimators.” Journal of Econometrics 219 (1): 101–22.
Słoczyński, Tymon. 2022. “Interpreting OLS Estimands When Treatment Effects Are Heterogeneous: Smaller Groups Get Larger Weights.” The Review of Economics and Statistics 104 (3): 501--509.