Modern Approaches to Difference-in-Differences

Session 2: Including Covariates in the Parallel Trends Assumption

Brantly Callaway

University of Georgia


  1. Introduction to Difference-in-Differences

  2. Including Covariates in the Parallel Trends Assumption

  3. Common Extensions for Empirical Work

  4. Dealing with More Complicated Treatment Regimes

  5. Alternative Identification Strategies


Start with the case with only two time periods

Only need a little bit of new notation here:

  • \(X_{i,t=2}\) and \(X_{i,t=1}\) — time-varying covariates

  • \(Z_i\) — time-invariant covariates

Alternative Identification Strategy: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | D = 1] - \E[\Delta Y(0) | D = 1] \hspace{150pt} \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | D = 1] - \E[\Delta Y(0) | D = 1] \hspace{150pt}\\ &= \E[\Delta Y | D = 1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=1\Big] \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | D = 1] - \E[\Delta Y(0) | D = 1] \hspace{150pt}\\ &= \E[\Delta Y | D = 1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=1\Big]\\ &= \E[\Delta Y | D = 1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=0\Big] \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then

\[ \begin{aligned} ATT &= \E[\Delta Y | D = 1] - \E[\Delta Y(0) | D = 1] \hspace{150pt}\\ &= \E[\Delta Y | D = 1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=1\Big]\\ &= \E[\Delta Y | D = 1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=0\Big]\\ &= \E[\Delta Y | D = 1] - \E[\Delta Y(0) | D=0] \end{aligned} \]

\(\implies\) (even under conditional parallel trends) we can recover \(ATT\) by just directly comparing paths of outcomes for treated and untreated groups.

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | D=1] - \E[\Delta Y(0) | D=1] \hspace{150pt} \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | D=1] - \E[\Delta Y(0) | D=1] \hspace{150pt}\\ &= \E[\Delta Y | D=1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=1\Big] \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | D=1] - \E[\Delta Y(0) | D=1] \hspace{150pt}\\ &= \E[\Delta Y | D=1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=1\Big]\\ &= \E[\Delta Y | D=1] - \E\Big[ \nu_0(X) \E[\Delta Y(0) | X, D=0 ] \Big| D=0\Big] \end{aligned} \]

Alternative Identification Strategy: Covariate Balancing

More generally: We would not expect the distribution of covariates to be the same across groups.

However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that

\[ \begin{aligned} ATT &= \E[\Delta Y | D=1] - \E[\Delta Y(0) | D=1] \hspace{150pt}\\ &= \E[\Delta Y | D=1] - \E\Big[ \E[\Delta Y(0) | X, D=0 ] \Big| D=1\Big]\\ &= \E[\Delta Y | D=1] - \E\Big[ \nu_0(X) \E[\Delta Y(0) | X, D=0 ] \Big| D=0\Big]\\ &= \E[\Delta Y | D=1] - \E[\nu_0(X) \Delta Y(0) | D=0] \end{aligned} \]

\(\implies\) We can recover \(ATT\) by re-weighting the untreated group to have the same distribution of covariates as the treated group has…and then just average


The arguments about suggest that, in order to estimate the \(ATT\), we will either need to

  1. Correctly model \(\E[\Delta Y(0) | X, D=0]\) (i.e, specifiy a model for \(m_0(X)\))

  2. Balance the distribution of \(X\) to be the same for the untreated group relative to the treated group.

Next, we will discuss how well this works for TWFE regressions and then alternative (more direct) estimation strategies.

Limitations of TWFE Regressions

Limitations of TWFE Regressions

In this setting, it is common to run the following TWFE regression:

\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\]

However, there are a number of issues:

Issue 1: Issues related to multiple periods and variation in treatment timing still arise

Issue 2: Hard to allow parallel trends to depend on time-invariant covariates

Issue 3: Hard to allow for covariates that could be affected by the treatment

Issues 4 & 5: (harder to see) Can perform poorly for including time-varying covariates in the parallel trends assumption

Side-Discussion: Interpreting regressions under unconfoundedness

Consider the cross-sectional regression \[ Y_i = \alpha D_i + X_i'\beta + e_i \] where we assume uncoundedness: \(Y(0) \independent D | X\)

View 1: Correctly specified model for \(\E[Y|X,D]\).

  • This implies a model for \(m_0(X)\): \(m_0(X) = X'\beta\)

  • It also restricts treatment effect heterogeneity:

    For any \(X\), \(\E[Y|X,D=1] - \E[Y|X,D=0] = \alpha\) \(\implies\) treatment effects don’t systematically vary with \(X\).

  • see, e.g., Chaisemartin et al. (2024) for related linearity tests in the context of TWFE

Side-Discussion: Interpreting regressions under unconfoundedness

Consider the cross-sectional regression \[ Y_i = \alpha D_i + X_i'\beta + e_i \] where we assume uncoundedness: \(Y(0) \independent D | X\)

View 2: Linear model as approximation to possibly more complicated conditional expectation (e.g., Angrist (1998))

You can show some interesting/useful results in this case   \(\rightarrow\)

Side-Discussion: Interpreting regressions under unconfoundedness

  1. \(\alpha\) can be re-interpreted as a weighting estimator

    \[ \alpha = \E\Big[w_1(X) Y \Big| D=1\Big] - \E\Big[w_0(X) Y \Big| D=0\Big] \]

    where \[ w_1(X) := \frac{\big(1-\L(D|X)\big) \pi}{\E\big[(D-\L(D|X))^2\big]} \ \ \textrm{and} \ \ w_0(X) := \frac{\L(D|X)(1-\pi)}{\E\big[(D-\L(D|X))^2\big]} \]

    and the result here follows (basically) immediately using FWL/partialling out arguments.

Side-Discussion: Interpreting regressions under unconfoundedness

  1. \(\alpha\) is equal to a weighted average of \(ATT(X)\) plus misspecification bias terms:

    \[ \alpha = \E\Big[w_1(X) ATT(X) \Big| D=1\Big] - \E\Big[w_1(X)\Big(\E[Y|X,D=0] - \L_0(Y|X)\Big) \Big| D=1\Big] \]

    The misspecification bias component is equal to 0 if either:

    • \(\E[Y|X,D=0] = \L_0(Y|X)\) (model for untreated potential outcomes is linear in \(X\))

    • The implicit regression weights are covariate balancing weights; i.e. for any function of the covariates \(g\)

      \[ \E\Big[ w_1(X) g(X) \Big| D=1\Big] = \E\Big[ w_0(X) g(X) \Big| D=1 \Big]\]

      [more details]

Limitations of TWFE Regressions

Let us now return to the TWFE regression: \[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\] and specialize to the case with two time periods, so that we ultimately run the regression

\[\Delta Y_{it} = \Delta \theta_t + \alpha D_{it} + \Delta X_{it}'\beta + \Delta e_{it}\]

Using the same arguments as above, you can show that: \[ \alpha = \E\Big[ w_1(\Delta X) ATT(X_{t=2},X_{t=1},Z)\Big| D=1 \Big] + \E\Big[ w_1(\Delta X) \Big( \E[\Delta Y | X_{t=2}, X_{t=1}, Z] - \L_0(\Delta Y | \Delta X) \Big) \Big| D=1 \Big]\]

Similar to above, \(\alpha\) is equal to weighted averages of:

  • conditional-on-covariates \(ATT\)’s

  • misspecification bias

Limitations of TWFE Regressions

The misspecification bias term is equal to 0 if:

  • \(\E[\Delta Y | X_{t=2}, X_{t=1}, Z] = \L_0(\Delta Y | \Delta X)\). This amounts to:
    • A condition about linearity (makes sense…)
    • Changing the identification strategy from one where parallel trends only depends on \(\Delta X\) rather than on \(X_{t=1}\), \(X_{t=2}\) and \(Z\).
  • The implicit regression weights \(w_1(\Delta X)\) and \(w_0(\Delta X)\) balance the distribution of \((X_{t=2}, X_{t=1}, Z)\) for the treated group relative to the untreated group.

Limitations of TWFE Regressions

In Caetano and Callaway (2023), we refer to the misspecification bias term above as hidden linearity bias.

What we mean is that the implications of a linear model may be much more severe in a panel data setting than in the cross-sectional setting:

  • It effectively changes the identification to one where what matters is changes in covariates over time

  • The regression will balance (in mean) terms that show up in the regression (i.e. \(\Delta X\)), but it won’t balance terms that don’t show up (e.g., \(Z\)) or other terms such as \(X_{t=1}\) or \(X_{t=2}\).

Limitations of TWFE Regressions

How much does this matter in practice?

  • Not all that easy to check how far away \(\E[\Delta Y | X_{t=1}, X_{t=2}, Z, D=0]\) is from \(\L_0(\Delta Y|\Delta X)\)
  • Instead, an easier idea is to apply implicit regression weights to \(Z\), \(X_{t=1}\), etc.:

    \[ \E\left[w_1(\Delta X) \begin{pmatrix} Z \\ X_{t=1} \end{pmatrix} \middle| D=1 \right] \overset{?}{=} \E\left[w_0(\Delta X) \begin{pmatrix} Z \\ X_{t=1} \end{pmatrix} \middle| D=0 \right] \]

    which gives us a way to diagnose the sensitivity of the TWFE regression to hidden linearity bias.

    • This is easy to check in practice: weights just depend on linear projections that are easy to directly estimate
    • If these are close, it suggests that misspecification bias is small.
    • If not, then it matters a lot whether or not the model is correctly specified.

Limitations of TWFE Regressions

Issue 5: Even if none of the previous 4 issues apply, \(\alpha\) is a weighted average of \(ATT(X)\). However,

  • The weights can be negative

  • The weights suffer from weight reversal (e.g., Słoczyński (2022)):

    • Too much weight on \(ATT(X)\) for values of the covariates that are relatively uncommon for the treated group relative
    • Too little weight on \(ATT(X)\) for values of the covariates that are relatively common for the treated group

Alternative Estimation Strategies

Regression Adjustment (RA)

Recall our first identification result above:

\[ATT = \E[\Delta Y | D=1] - \E\Big[ \underbrace{\E[\Delta Y(0) | X, D=0]}_{=:m_0(X)} \Big| D=1\Big]\]

The most direct way to proceed is by proposing a model for \(m_0(X)\). For example, \(m_0(X) = X'\beta_0\).

  • Notice that linearity of untreated potential outcomes is exactly the same condition we needed for the TWFE regression to be a weighted average of conditional-on-covariates \(ATT\)’s.
  • However, here \(X\) includes \((X_{t=2}, X_{t=1}, Z)\) rather than only \(\Delta X\).
    • This means that there could still be misspecification bias, but there is no hidden linearity bias
  • Moreover, if linearity holds, we directly target \(ATT\), rather than recovering a hard-to-interpret weighted average of \(ATT(X)\).
  • \(\implies\) there is a strong case to go with RA as a default option over TWFE.

Regression Adjustment (RA)

Recall our first identification result above:

\[ATT = \E[\Delta Y | D=1] - \E\Big[ \underbrace{\E[\Delta Y(0) | X, D=0]}_{=:m_0(X)} \Big| D=1\Big]\]

This expression suggests a regression adjustment estimator:

\[ATT = \E[\Delta Y | D=1] - \E[X'\beta_0|D=1]\]

and we can estimate the \(ATT\) by

  • Step 1: Estimate \(\beta_0\) using untreated group

  • Step 2: Compute predicted values for treated units: \(X_i'\hat{\beta}_0\)

  • Step 3: Compute \(\widehat{ATT} = \displaystyle \frac{1}{n} \sum_{i=1}^n \frac{D_i}{\hat{\pi}} \Delta Y_i - \frac{1}{n} \sum_{i=1}^n \frac{D_i}{\hat{\pi}} X_i'\hat{\beta}_0\)

Side-Discussion on Imputation Estimators

(One shot) imputation estimators that include covariates typically involve estimating the model \[Y_{it}(0) = \theta_t + \eta_i + X_{it}'\beta + e_{it}\]

Issue 1: Multiple periods

Issue 2: Time-invariant covariates

Issue 3: Covariates affected by the treatment ❌

Issue 4: Hidden linearity bias ❌

  • You can see that it implicitly relies on \(\E[\Delta Y(0) | X_{t=2}, X_{t=1}, Z, D=0] = \E[\Delta Y(0) | \Delta X, D=0]\)

Issue 5: Weighted average of \(ATT(X)\)

Propensity Score Weighting (IPW)

Alternatively, recall our identification strategy based on re-weighting: \[ ATT = \E[\Delta Y | D=1] - \E[ \nu_0(X) \Delta Y | D=0] \]

The most common balancing weights are based on the propensity score, you can show: \[\begin{align*} \nu_0(X) = \frac{p(X)(1-\pi)}{(1-p(X))\pi} \end{align*}\] where \(p(X) = \P(D=1|X)\) and \(\pi=\P(D=1)\).

  • This is the approach suggested in Abadie (2005). In practice, you need to estimate the propensity score. The most common choices are probit or logit.

For estimation:

  • Step 1: Estimate the propensity score (typically logit or probit)

  • Step 2: Compute the weights, using the estimated propensity score

  • Step 3: Compute \(\widehat{ATT} = \displaystyle \frac{1}{n} \sum_{i=1}^n \frac{D_i}{\hat{\pi}} \Delta Y_i - \frac{1}{n} \sum_{i=1}^n \frac{\hat{p}(X_i) (1-D_i)}{\big(1-\hat{p}(X_i)\big) \hat{\pi}}\)

Augmented-Inverse Propensity Score Weighting (AIPW)

You can show an additional identification result:

\[ATT = \E\left[ \Delta Y_{t} - \E[\Delta Y_{t} | X, D=0] \big| D=1\right] - \E\left[ \frac{p(X)(1-p)}{(1-p(X))p} \big(\Delta Y_{t} - \E[\Delta Y_{t} | X, D=0]\big) \Big| D = 0\right]\]

This requires estimating both \(p(X)\) and \(\E[\Delta Y|X,D=0]\).

Big advantage: The sample analogue of this expression \(ATT\) is doubly robust. This means that, it will deliver consistent estimates of \(ATT\) if either the model for \(p(X)\) or for \(\E[\Delta Y|X,D=0]\) is correctly specified.

  • In my experience, doubly robust estimators perform much better than either the regression or propensity score weighting estimators
  • This also provides a connection to estimating \(ATT\) under conditional parallel trends using machine learning for \(p(X)\) and \(\E[\Delta Y|X,D=0]\) (see: Chang (2020))


Regarding the previous issues with TWFE regressions, RA and AIPW satisfy:

Issue 1: Multiple periods

Issue 2: Time-invariant covariates

Issue 3: Covariates affected by the treatment ?

Issue 4: Hidden linearity bias

Issue 5: Weighted average of \(ATT(X)\)

You can also show that they will, by construction, balance the means of \((X_{t=2},X_{t=1},Z)\) across groups.

In my view, these are much better properties that the TWFE regression when it comes to including covariates.

Multiple Periods and Variation in Treatment Timing

Multiple Time Periods and Variation in Treatment Timing

Conditional Parallel Trends with Multiple Periods

For all groups \(g \in \bar{\mathcal{G}}\) (all groups except the never-treated group) and for all time periods \(t=2, \ldots, T\),

\[\E[\Delta Y_{t}(0) | \mathbf{X}, Z, G=g] = \E[\Delta Y_{t}(0) | \mathbf{X}, Z, U=1]\]

where \(\mathbf{X}_i := (X_{i1},X_{i2},\ldots,X_{iT})\).

Under this assumption, using similar arguments to the ones above, one can show that

\[ATT(g,t) = \E\left[ \left( \frac{\indicator{G=g}}{\pi_g} - \frac{p_g(\mathbf{X},Z)U}{(1-p_g(\mathbf{X},Z))\pi_g}\right)\Big(Y_{t} - Y_{g-1} - m_{gt}^0(\mathbf{X},Z)\Big) \right]\]

where \(p_g(\mathbf{X},Z) := \P(G=g|\mathbf{X},Z,\indicator{G=g}+U=1)\) and \(m_{gt}^0(\mathbf{X},Z) := \E[Y_{t}-Y_{g-1}|\mathbf{X},Z,U=1]\).

Practical Considerations

Because \(\mathbf{X}_i\) contains \(X_{i,t}\) for all time periods, terms like \(m_{gt}^0(\mathbf{X},Z)\) can be quite high-dimensional (and hard to estimate) in many applications.

In many cases, it may be reasonable to replace with lower dimensional function \(\mathbf{X}_i\):

  • \(\bar{X}_i\) — the average of \(X_{it}\) across time periods
  • \(X_{it}, X_{ig-1}\) — the covariates in the current period and base period (this is possible in the pte package currently and may be added to did soon).
  • \(X_{ig-1}\) — the covariates in the base period (this is the default in did)
  • \((X_{it}-X_{ig-1})\) — the change in covariates over time

Otherwise, however, everything is the same as before:

  1. Recover \(ATT(g,t)\)

  2. If desired: aggregate into \(ATT^{es}(e)\) or \(ATT^o\).

Empirical Example

Back to Minimum Wage Example

Let’s start by assuming that parallel trends holds conditional on a county’s population and average income (sometimes we’ll add region too)

  • i.e., we would like to compare treated and untreated counties with similar populations and average incomes

I’ll show results for the following cases:

  1. Two period regression

  2. All periods regression

    • with and without region as a covariate
  3. Regression adjustment where the model is \(Y_{it}(0) = \theta_t + \eta_i + X_{it}'\beta + e_{it}\)

  4. Callaway and Sant’Anna (2021) including \(X_{g-1}\) and \(Z\) as covariates

    • RA, IPW, AIPW

In addition to estimates, we’ll also assess how well each of these works in terms of balancing covariates using the twfeweights package.

Two periods TWFE Regression

# run TWFE regression
data2_subset <- subset(data2, year %in% c(2003,2004))
data2_subset <- subset(data2_subset, G %in% c(0, 2004))
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + year,
modelsummary(twfe_x, gof_omit=".*")
post -0.032
lpop 0.833
lavg_pay 0.037

Diagnose covariate balance

tp_wts <- two_period_reg_weights(
  yname = "lemp",
  tname = "year",
  idname = "id",
  gname = "G",
  xformula = ~lpop + lavg_pay,
  extra_balance_vars_formula = ~region,
  data = data2_subset

Diagnose covariate balance

              plot_relative_to_target=FALSE) +

TWFE with more periods and covariates

# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + year,
modelsummary(twfe_x, gof_omit=".*")
post -0.048
lpop 1.235
lavg_pay 0.172

Diagnose covariate balance

twfe_wts <- implicit_twfe_weights(
  yname = "lemp",
  tname = "year",
  idname = "id",
  gname = "G",
  xformula = ~lpop + lavg_pay,
  data = data2,
  base_period = "gmin1"
covariate_balance <- twfe_cov_bal(twfe_wts, ~ region + lpop + lavg_pay + -1)

Diagnose covariate balance

              absolute_value = FALSE,
              standardize = TRUE,
              plot_relative_to_target = FALSE) +

Add region as a covariate

We’ll allow for path of outcomes to depend on region of the country

# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + region^year,
modelsummary(twfe_x, gof_omit=".*")
post -0.022
lpop 1.057
lavg_pay 0.074

Relative to previous results, this is much smaller—this is (broadly) in line with the literature where controlling for region often matters a great deal (e.g., Dube, Lester, and Reich (2010)).

Check covariate balance

# similar code as before...check course materials

Regression adjustment with only \(\Delta X\)

# it's reg. adj. even though the function says aipw...
ra_wts <- implicit_aipw_weights(
  yname = "lemp",
  tname = "year",
  idname = "id",
  gname = "G",
  xformula = ~ 1,
  d_covs_formula = ~ lpop + lavg_pay,
  pscore_formula = ~1,
  data = data2
[1] -0.06098144

i.e., we estimate a somewhat larger effect of the minimum wage on teen employment

Regression adjustment with only \(\Delta X\)

ra_cov_bal <- aipw_cov_bal(ra_wts, ~ region + lpop + lavg_pay + -1)
ggtwfeweights(ra_cov_bal, absolute_value = FALSE,
              standardize = TRUE,
              plot_relative_to_target = FALSE) +

CS (2021) Regression Adjustment, \(X_{g-1}, Z\)

# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
               xformla=~region + lpop + lavg_pay,
cs_x_res <- aggte(cs_x, type="group")
cs_x_dyn <- aggte(cs_x, type="dynamic")

CS (2021) Regression Adjustment, \(X_{g-1},Z\)

aggte(MP = cs_x, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <>, <> 

Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0321        0.0084    -0.0486     -0.0157 *

Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0596     0.0197       -0.1021     -0.0170 *
  2006  -0.0197     0.0084       -0.0378     -0.0017 *
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Outcome Regression

CS (2021) Regression Adjustment, \(X_{g-1},Z\)

Check covariate balance

# similar code as before...check course materials

CS (2021) IPW, \(X_{g-1}, Z\)

# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
               xformla=~region + lpop + lavg_pay,
cs_x_res <- aggte(cs_x, type="group")
cs_x_dyn <- aggte(cs_x, type="dynamic")

CS (2021) IPW, \(X_{g-1}, Z\)

aggte(MP = cs_x, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <>, <> 

Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0313        0.0078    -0.0465      -0.016 *

Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0514     0.0206       -0.0971     -0.0058 *
  2006  -0.0222     0.0074       -0.0387     -0.0057 *
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Inverse Probability Weighting

CS (2021) IPW, \(X_{g-1}, Z\)

CS (2021) AIPW, \(X_{g-1}, Z\)

# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
               xformla=~region + lpop + lavg_pay,
cs_x_res <- aggte(cs_x, type="group")
cs_x_dyn <- aggte(cs_x, type="dynamic")

CS (2021) AIPW, \(X_{g-1}, Z\)

aggte(MP = cs_x, type = "group")

Reference: Callaway, Brantly and Pedro H.C. Sant'Anna.  "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <>, <> 

Overall summary of ATT's based on group/cohort aggregation:  
     ATT    Std. Error     [ 95%  Conf. Int.]  
 -0.0317        0.0081    -0.0475     -0.0158 *

Group Effects:
 Group Estimate Std. Error [95% Simult.  Conf. Band]  
  2004  -0.0509     0.0185       -0.0896     -0.0122 *
  2006  -0.0230     0.0075       -0.0387     -0.0074 *
Signif. codes: `*' confidence band does not cover 0

Control Group:  Never Treated,  Anticipation Periods:  0
Estimation Method:  Doubly Robust

CS (2021) AIPW, \(X_{g-1}, Z\)

Check covariate balance

# similar code as before...check course materials


If you want to include covariates in the parallel trends assumption, it is better to use approaches that directly include the covariates relative to estimation strategies that transform the covariates

Covariates Affected by the Treatment

Covariates Affected by the Treatment

So far, our discussion has been for the case where the time-varying covariates evolve exogenously.

  • Many (probably most) covariates fit into this category: in the minimum wage example, a county’s population probably fits here.

In some applications, we may want to control for covariates that themselves could be affected by the treatment

  • Classical examples in labor economics: A person’s industry, occupation, or union status

  • These are often referred to as “bad controls”

You can see a tension here:

  • We would like to compare units who, absent being treated, would have had the same (say) union status

  • But union status could be affected by the treatment

Covariates Affected by the Treatment

The most common practice is to just completely drop these covariates from the analysis

  • Not clear if this is the right idea though…

We will consider some alternatives

  • Condition on pre-treatment value of bad control
  • Treat bad control as an outcome (i.e., use some identification strategy), then feed this into the main analysis as a covariate

Additional Notation

To wrap our heads around this, let’s go back to the case with two time periods.

Define treated and untreated potential covariates: \(X_{it}(1)\) and \(X_{it}(0)\). Notice that in the “textbook” two period setting, we observe \[X_{i,t=2} = D_i X_{i,t=2}(1) + (1-D_i) X_{i,t=2}(0) \qquad \textrm{and} \qquad X_{i,t=1} = X_{i,t=1}(0)\]

Then, we will consider parallel trends in terms of untreated potential outcomes and untreated potential covariates:

Conditional Parallel Trends using Untreated Potential Covariates

\[\E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, D=1] = \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, D=0]\]

Identification Issues

Following the same line of argument as before, it follows that

\[ATT = \E[\Delta Y | D=1] - \E\Big[ \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, D=0] \Big| D=1\Big]\]

The second term is the tricky one. Notice that:

  • The inside conditional expectation is identified — we see untreated potential outcomes and covariates for the untreated group

  • However, we cannot average over \(X_{t=2}(0)\) for the treated group, because we don’t observe \(X_{t=2}(0)\) for the treated group

There are several options for what we can do   \(\rightarrow\)

Option 1: Ignore

One idea is to just ignore that the covariates may have been affected by the treatment:

Alternative Conditional Parallel Trends 1

\[\E[\Delta Y(0) | { \color{red} X_{\color{red}{i,t=2} } }, X_{t=1}(0), Z, D=1] = \E[\Delta Y(0) | { \color{red} X_{\color{red}{i,t=2}} }, X_{t=1}(0), Z, D=0]\]

The limitations of this approach are well known (even discussed in MHE), and this is not typically the approach taken in empirical work

Job Displacement Example: You would compare paths of outcomes for workers who left union because they were displaced to paths of outcomes for non-displaced workers who also left union (e.g., because of better non-unionized job opportunity)

Option 2: Drop

It is more common in empirical work to drop \(X_{i,t}(0)\) entirely from the parallel trends assumption

Alternative Conditional Parallel Trends 2

\[\E[\Delta Y(0) | Z, D=1] = \E[\Delta Y(0) | Z, D=0]\]

In my view, this is not attractive either though. If we believe this assumption, then we have basically solved the bad control problem by assuming that it does not exist.

Job Displacement Example: We have now just assumed that path of earnings (absent job displacement) doesn’t depend on union status

Option 3: Tweak

Perhaps a better alternative identifying assumption is the following one

Alternative Conditional Parallel Trends 3

\[\E[\Delta Y(0) | X_{t=1}(0), Z, D=1] = \E[\Delta Y(0) | X_{t=1}(0), Z, D=0]\]

Intuition: Conditional parallel trends holds after conditioning on pre-treatment time-varying covariates that could have been affected by treatment

Job Displacement Example: Path of earnings (absent job displacement) depends on pre-treatment union status, but not untreated potential union status in the second period

What to do: Since \(X_{i,t=1}(0)\) is observed for all units, we can immediately operationalize this assumption use our arguments from earlier (i.e., the ones without bad controls)

  • This is difficult to operationalize with a TWFE regression

  • In practice, you can just include the bad control among other covariates in did

Option 4: Extra Assumptions

Another option is to keep the original identifying assumption, but add additional assumptions where we (in some sense) treat \(X_t\) as an outcome and as a covariate.


\[ATT = \E[\Delta Y | D=1] - \E\Big[ \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, D=0] \Big| D=1\Big]\]

If we could figure out distribution of \(X_{t=2}(0)\) for the treated group, we could recover \(ATT\)

Option 4: Dealing with \(X_{t=2}(0)\)

Covariate Unconfoundedness Assumption

\[X_{t=2}(0) \independent D | X_{t=1}(0), Z\]

Intuition: For the treated group, the time-varying covariate would have evolved in the same way over time as it actually did for the untreated group, conditional on \(X_{t=1}\) and \(Z\).

  • Notice that this assumption only concerns untreated potential covariates \(\implies\) it allows for \(X_{t=2}\) to be affected by the treatment

  • Making an assumption like this indicates that \(X_{t=2}(0)\) is playing a dual role: (i) start by treating it as if it’s an outcome, (ii) have it continue to play a role as a covariate

Under this assumption, can show that we can recover the \(ATT\):

\[ATT = \E[\Delta Y | D=1] - \E\left[ \E[\Delta Y | X_{t=1}, Z, D=0] \Big| D=1 \right]\]

This is the same expression as in Option 3

Option 4: Additional Discussion

In some cases, it may make sense to condition on other additional variables (e.g., the lagged outcome \(Y_{t=1}\)) in the covariate unconfoundedness assumption. In this case, it is still possible to identify \(ATT\), but it is more complicated

It could also be possible to use alternative identifying assumptions besides covariate unconfoundedness — at a high-level, we somehow need to recover the distribution of \(X_{t=2}(0)\)

  • e.g., Brown, Butts, and Westerlund (2023)

See Caetano et al. (2022) for more details about bad controls.



Regression weights - more details

By construction, the regression weights balance the mean of \(X\). That is, \[ \E\Big[ w_1(X) X \Big| D=1\Big] = \E\Big[ w_0(X) X \Big| D=1 \Big] \]

But they do not necessarily balance other functions of the covariates such as quadratic terms, interactions, etc. You can check for balance by computing terms like \[ \E\Big[ w_1(X) X^2 \Big| D=1\Big] \overset{?}{=} \E\Big[ w_0(X) X^2 \Big| D=1 \Big] \]

(perhaps) that the regression weights balance the means of covariates indicates that typically the misspecification bias should be small


Understanding Double Robustness

To understand double robustness, we can rewrite the expression for \(ATT\) as \[\begin{align*} ATT = \E\left[ \frac{D}{\pi} \Big(\Delta Y - m_0(X)\Big) \right] - \E\left[ \frac{p(X)(1-D)}{(1-p(X))\pi} \Big(\Delta Y - m_0(X)\Big)\right] \end{align*}\]

The first term is exactly the same as what comes from regression adjustment

  • If we correctly specify a model for \(m_0(X)\), it will be equal to \(ATT\).

  • If \(m_0(X)\) not correctly specified, then, by itself, this term will be biased for \(ATT\)

The second term can be thought of as a de-biasing term

  • If \(m_0(X)\) is correctly specified, it is equal to 0

  • If \(p(X)\) is correctly specified, it reduces to \(\E[\Delta Y_{t}(0) | D=1] - \E[m_0(X)|D=1]\) which both delivers counterfactual untreated potential outcomes and removes the (possibly misspecified) second term from the first equation



Abadie, Alberto. 2005. “Semiparametric Difference-in-Differences Estimators.” The Review of Economic Studies 72 (1): 1–19.
Angrist, Joshua D. 1998. “Estimating the Labor Market Impact of Voluntary Military Service Using Social Security Data on Military Applicants.” Econometrica 66 (2): 249–88.
Brown, Nicholas, Kyle Butts, and Joakim Westerlund. 2023. “Simple Difference-in-Differences Estimation in Fixed-t Panels.”
Caetano, Carolina, and Brantly Callaway. 2023. “Difference-in-Differences When Parallel Trends Holds Conditional on Covariates.”
Caetano, Carolina, Brantly Callaway, Robert Payne, and Hugo Sant’Anna. 2022. “Difference in Differences with Time-Varying Covariates.”
Chaisemartin, Clément de, Diego Ciccia, Xavier D’Haultfœuille, and Felix Knau. 2024. “Two-Way Fixed Effects and Differences-in-Differences Estimators in Heterogeneous Adoption Designs.”
Chang, Neng-Chieh. 2020. “Double/Debiased Machine Learning for Difference-in-Differences Models.” The Econometrics Journal 23 (2): 177–91.
Dube, Arindrajit, T William Lester, and Michael Reich. 2010. “Minimum Wage Effects Across State Borders: Estimates Using Contiguous Counties.” The Review of Economics and Statistics 92 (4): 945–64.
Słoczyński, Tymon. 2022. “Interpreting OLS Estimands When Treatment Effects Are Heterogeneous: Smaller Groups Get Larger Weights.” The Review of Economics and Statistics 104 (3): 501--509.