Session 2: Using Covariates in Difference-in-Differences Identification Strategies
Additional Workshop Materials: https://bcallaway11.github.io/uga-cbai-workshop/
General References:
Callaway (2023), Handbook of Labor, Human Resources and Population Economics
Baker, Callaway, Cunningham, Goodman-Bacon, Sant’Anna (2024), draft posted very soon
Difference-in-Differences with Two Periods and Two Groups
Extensions to Staggered Treatment Adoption
Application about Effects of Minimum Wage Policies on Employment
\(\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }\)
Including covariates in the parallel trends assumption can often make DID identification strategies more plausible:
Examples
Minimum wage example: path of teen employment may depend on a county’s population / population growth / region of the country
Job displacement example: path of earnings may depend on year’s of education / race / occupation
However, there are a number of new issues that can arise in this setting…
Identification with Two Periods
Limitations of TWFE Regressions
Alternative Estimation Strategies
Multiple Periods and Variation in Treatment Timing
Minimum Wage Application
Dealing with “Bad” Controls
Data:
2 periods: \(t=1\), \(t=2\)
\(D_{it}\) treatment indicator in period \(t\)
2 groups: \(G_i=1\) or \(G_i=0\) (treated and untreated)
Potential Outcomes: \(Y_{it}(1)\) and \(Y_{it}(0)\)
Observed Outcomes: \(Y_{it=2}\) and \(Y_{it=1}\)
\[\begin{align*} Y_{it=2} = G_i Y_{it=2}(1) +(1-G_i)Y_{it=2}(0) \quad \textrm{and} \quad Y_{it=1} = Y_{it=1}(0) \end{align*}\]
Average Treatment Effect on the Treated: \[ATT = \E[Y_{t=2}(1) - Y_{t=2}(0) | G=1]\]
Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group
Pushing the expectation through the difference, we have that: \[\begin{align*} ATT = \underbrace{\E[Y_{t=2}(1) | G=1]}_{\textrm{Easy}} - \underbrace{\E[Y_{t=2}(0) | G=1]}_{\textrm{Hard}} \end{align*}\]
(Unconditional) Parallel Trends Assumption
\[\E[\Delta Y(0) | G=1] = \E[\Delta Y(0) | G=0]\]
In words: The path of untreated potential outcomes is the same for the treated group as for the untreated group
From last time: Under (unconditional) PTA,
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \\ &= \E[\Delta Y | G=1] - \E[\Delta Y | G=0] \end{aligned} \]
Start with the case with only two time periods
More notation about covariates:
\(X_{it=2}\) and \(X_{it=1}\) — time-varying covariates
\(Z_i\) — time-invariant covariates
Two identification results:
Direct identifcation strategy:
Indirect identification strategy:
These different identification strategies will suggest alternative ways to estimate \(ATT\), which we will return to later…
Conditional Parallel Trends Assumption
\[\E[\Delta Y(0) | X_{t=2}, X_{t=1},Z,G=1] = \E[\Delta Y(0) | X_{t=2}, X_{t=1},Z,G=0]\]
In words: Parallel trends holds conditional on having the same covariates \((X_{t=2},X_{t=1},Z)\).
Minimum wage example (e.g.) Parallel trends conditional on counties have the same population (like \(X_{t}\)) and being in the same region of the country (like \(Z\))
Under conditional parallel trends, we have that \[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt} \end{aligned} \]
Under conditional parallel trends, we have that \[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &=\E\Big[ \E[\Delta Y | X,G=1] \Big| G=1\Big] - \E\Big[ \E[\Delta Y(0) | X, G=1] \Big| G=1\Big] \end{aligned} \]
Under conditional parallel trends, we have that \[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &=\E\Big[ \E[\Delta Y | X,G=1] \Big| G=1\Big] - \E\Big[ \E[\Delta Y(0) | X, G=1] \Big| G=1\Big]\\ &=\E\Big[ \underbrace{\E[\Delta Y | X,G=1] - \E[\Delta Y(0) | X, G=0]}_{=:ATT(X)} \Big| G=1\Big] \end{aligned} \]
Intuition:
Compare path of outcomes for treated group to (conditional on covariates) path of outcomes for untreated group, \(\rightarrow\) \(ATT(X)\)
Average \(ATT(X)\) to get \(ATT\)
From the last slide, we had that
\[ ATT = \E\Big[ \underbrace{\E[\Delta Y | X,G=1] - \E[\Delta Y(0) | X, G=0]}_{=:ATT(X)} \Big| G=1\Big] \]
For estimation, it is useful to simplify the previous expression to be
\[ATT = \E[\Delta Y | G=1] - \E\Big[ \underbrace{\E[\Delta Y(0) | X, G=0]}_{=:m_0(X)} \Big| G=1\Big] \]
This expression highlights that the “challenging” term to estimate here will be \(m_0(X)\)…
The argument above also requires an overlap condition
For all possible values of the covariates, \(p(x) := \P(G=1|X=x) < 1\).
In words: for all treated units, we can find untreated units that have the same characteristics
Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt} \end{aligned} \]
Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big] \end{aligned} \]
Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big] \end{aligned} \]
Example: Momentarily, suppose that the distribution of \(X\) was the same for both groups, then
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big]\\ &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=0] \end{aligned} \]
\(\implies\) (even under conditional parallel trends) we can recover \(ATT\) by just directly comparing paths of outcomes for treated and untreated groups.
More generally: We would not expect the distribution of covariates to be the same across groups.
However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt} \end{aligned} \]
More generally: We would not expect the distribution of covariates to be the same across groups.
However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big] \end{aligned} \]
More generally: We would not expect the distribution of covariates to be the same across groups.
However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \nu_0(X) \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big] \end{aligned} \]
More generally: We would not expect the distribution of covariates to be the same across groups.
However the idea of covariate balancing is to come up with balancing weights \(\nu_0(X)\) such that the distribution of \(X\) is the same in the untreated group as it is in the treated group after applying the balancing weights. Then we would have that
\[ \begin{aligned} ATT &= \E[\Delta Y | G=1] - \E[\Delta Y(0) | G=1] \hspace{150pt}\\ &= \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X, G=0 ] \Big| G=1\Big]\\ &= \E[\Delta Y | G=1] - \E\Big[ \nu_0(X) \E[\Delta Y(0) | X, G=0 ] \Big| G=0\Big]\\ &= \E[\Delta Y | G=1] - \E[\nu_0(X) \Delta Y(0) | G=0] \end{aligned} \]
\(\implies\) We can recover \(ATT\) by re-weighting the untreated group to have the same distribution of covariates as the treated group has…and then just average
The arguments about suggest that, in order to estimate the \(ATT\), we will either need to
Correctly model \(\E[\Delta Y(0) | X, G=0]\) (i.e, specifiy a model for \(m_0(X)\))
Balance the distribution of \(X\) to be the same for the untreated group relative to the treated group.
We will pursue these approaches soon, but next we will switch to considering TWFE regressions that include covariates
In this setting, it is common to run the following TWFE regression:
\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\]
However, there are a number of issues:
Issue 1: Issues related to multiple periods and variation in treatment timing still arise
Issue 2: Hard to allow parallel trends to depend on time-invariant covariates
Issue 3: Hard to allow for covariates that could be affected by the treatment
Issues 4 & 5: (harder to see) Can perform poorly for including time-varying covariates in the parallel trends assumption
Focusing on the case with two periods, to estimate the model, we take first-differences to eliminate the unit fixed effects and ultimately estimate the regression
\[\Delta Y_{it} = \Delta \theta_t + \alpha D_{it} + \Delta X_{it}'\beta + \Delta e_{it}\]
Building on work about interpreting cross-sectional regressions \(Y_i = \alpha D_i + X_i'\beta + e_i\), in the presence of treatment effect heterogeneity
we can provide a useful decomposition of \(\alpha\) \(\rightarrow\)
Can show that the coefficient \(\alpha\) in the TWFE regression can be decomposed as
\[ \small \alpha = \underbrace{\E\Big[ w_1(\Delta X) ATT(X_{t=2},X_{t=1},Z)\Big| G=1 \Big]}_{\textrm{weighted avg. of $ATT(X)$}} + \underbrace{\E\Big[ w_1(\Delta X) \Big( \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] - \L_0(\Delta Y | \Delta X) \Big) \Big| G=1 \Big]}_{\textrm{misspecification bias}}\]
where \[ w_1(\Delta X) := \frac{\big(1-\L(D|\Delta X)\big) \pi}{\E\big[(D-\L(D|\Delta X))^2\big]}\]
Comments:
It is possible for both weights to be negative, given that linear probability models can predict probabilities outside of the \([0,1]\) interval
These weights are easy to estimate as they only depend on linear projections
Can show that the coefficient \(\alpha\) in the TWFE regression can be decomposed as
\[ \small \alpha = \underbrace{\E\Big[ w_1(\Delta X) ATT(X_{t=2},X_{t=1},Z)\Big| G=1 \Big]}_{\textrm{weighted avg. of $ATT(X)$}} + \underbrace{\E\Big[ w_1(\Delta X) \Big( \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] - \L_0(\Delta Y | \Delta X) \Big) \Big| G=1 \Big]}_{\textrm{misspecification bias}}\]
where \[ w_1(\Delta X) := \frac{\big(1-\L(D|\Delta X)\big) \pi}{\E\big[(D-\L(D|\Delta X))^2\big]}\]
About the first term:
Ideally, we would like \(w_1(\Delta X)=1\), which would imply that this term is equal to \(ATT\).
Relative to this baseline, these weights have some drawbacks:
The weights can be negative
The weights suffer from a form of weight reversal (e.g., Słoczyński (2022)):
\[ \small \alpha = \underbrace{\E\Big[ w_1(\Delta X) ATT(X_{t=2},X_{t=1},Z)\Big| G=1 \Big]}_{\textrm{weighted avg. of $ATT(X)$}} + \underbrace{\E\Big[ w_1(\Delta X) \Big( \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] - \L_0(\Delta Y | \Delta X) \Big) \Big| G=1 \Big]}_{\textrm{misspecification bias}}\]
The misspecification bias component is equal to 0 if either:
The implicit regression weights, \(w_1(\Delta X)\), are covariate balancing weights
Versions of these conditions seem plausible in the case with cross sectional data, but do not seem reasonable in the panel data context \(\rightarrow\)
Consider the condition
\[ \E[\Delta Y | X_{t=2}, X_{t=1}, Z, G=0] = \L_0(\Delta Y | \Delta X) \]
Notice that this condition involves two things:
A condition about linearity (makes sense…and similar to the cross-sectional case)
Changing the covariates that show up from \((X_{t=2}, X_{t=1}, Z)\) to \(\Delta X\)
Condition (b) amounts to changing the identification strategy from one where parallel trends only depends on \(\Delta X\) rather than on \(X_{t=1}\), \(X_{t=2}\) and \(Z\).
In the minimum wage example, we originally wanted to compare counties with the same population and in the same region of the country.
Condition (b) would (effectively) change this to comparing counties with similar population changes over time
This could end up being much different from what we were originally aiming for
Next, a property of implicit regression weights is that they balance the means of regressors included in the model (Chattopadhyay and Zubizarreta (2023))
This is good property in the cross-sectional setting and suggests that typically misspecification bias is likely to be small in that case
In our case, this means that the TWFE regression will balance the mean of \(\Delta X\) across groups
More importantly: The TWFE does not necessarily balance variables that do not show up in the estimating equation, including:
Levels of time-varying covariates: \(X_{t=2}, X_{t=1}\)
Time-invariant covariates: \(Z\)
Taken together, the arguments above suggest (to me) that misspecification bias is likely to be a much bigger issue in the TWFE setting than in the cross-sectional setting
In Caetano and Callaway (2023), we refer to the misspecification bias term above as hidden linearity bias.
What we mean is that the implications of a linear model may be much more severe in a panel data setting than in the cross-sectional setting:
The arguments that would lead us to think that misspecification bias is typically small in cross-sectional settings do not apply for TWFE regressions
TWFE effectively changes the identification to one where the only covariates that show up in the parallel trends assumption are \(\Delta X\)
What is going wrong with the TWFE regression?
\[Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + e_{it}\]
The source of the issues with the TWFE regression is that, when we difference out the unit fixed effect, we also transform the covariates.
The “inherited” transformation of the covariates makes it where we are highly dependent on the model be correctly specified for this to make sense
Instead, a better option will be to difference the outcomes (in line with parallel trends) but then to directly include the covariates that we want \((X_{t=2}, X_{t=1}, Z)\).
One last question: How much does this matter in practice?
Instead, an easier idea is to apply implicit regression weights to \((X_{t=2},X_{t=1},Z)\), and check if they balance these across groups
One of the main things we will do in the application is to see how well implicit TWFE regression weights balance levels of time-varying covariates and omitted time-invariant covariates
Given the limitations of TWFE regressions, we will consider alternative estimation strategies:
Regression Adjustment (RA)
Propensity Score Weighting (IPW)
Augmented-Inverse Propensity Score Weighting (AIPW)
We will motivate these approaches from the two types of identification results that we showed earlier
These will have a number of better properties than TWFE regressions
Recall our first identification result above:
\[ATT = \E[\Delta Y | G=1] - \E\Big[ \underbrace{\E[\Delta Y(0) | X, G=0]}_{=:m_0(X)} \Big| G=1\Big]\]
The most direct way to proceed is by proposing a model for \(m_0(X)\). For example, \(m_0(X) = X'\beta_0\).
Recall our first identification result above:
\[ATT = \E[\Delta Y | G=1] - \E\Big[ \underbrace{\E[\Delta Y(0) | X, G=0]}_{=:m_0(X)} \Big| G=1\Big]\]
This expression suggests a regression adjustment estimator, based on:
\[ATT = \E[\Delta Y | G=1] - \E[X'\beta_0|G=1]\]
and we can estimate the \(ATT\) by
Step 1: Estimate \(\beta_0\) using untreated group
Step 2: Compute predicted change in untreated potential outcomes for treated units: \(\widehat{\Delta Y_i(0)} = X_i'\hat{\beta}_0\)
Step 3: Compute \(\widehat{ATT} = \displaystyle \frac{1}{n_1} \sum_{i=1}^n G_i \big(\Delta Y_i - X_i'\hat{\beta}_0\big)\)
Alternatively, recall our identification strategy based on re-weighting: \[ ATT = \E[\Delta Y | G=1] - \E[ \nu_0(X) \Delta Y | G=0] \]
The most common balancing weights are based on the propensity score, you can show: \[\begin{align*} \nu_0(X) = \frac{p(X)(1-\pi)}{(1-p(X))\pi} \end{align*}\] where \(p(X) = \P(G=1|X)\) and \(\pi=\P(G=1)\).
For estimation:
Step 1: Estimate the propensity score (typically logit or probit)
Step 2: Compute the weights, using the estimated propensity score
Step 3: Compute \(\widehat{ATT} = \displaystyle \frac{1}{n_1} \sum_{i=1}^n G_i \Delta Y_i - \frac{1}{n_0} \sum_{i=1}^n \frac{\hat{p}(X_i) (1-G_i)}{\big(1-\hat{p}(X_i)\big) \hat{\pi}} \Delta Y_i\)
You can show an additional identification result:
\[ATT = \E\left[ \Delta Y_{t} - \E[\Delta Y_{t} | X, G=0] \big| G=1\right] - \E\left[ \frac{p(X)(1-\pi)}{(1-p(X))\pi} \big(\Delta Y_{t} - \E[\Delta Y_{t} | X, G=0]\big) \Big| G=0\right]\]
This requires estimating both \(p(X)\) and \(\E[\Delta Y|X,G=0]\).
Big advantage: The sample analogue of this expression \(ATT\) is doubly robust. This means that, it will deliver consistent estimates of \(ATT\) if either the model for \(p(X)\) or for \(\E[\Delta Y|X,G=0]\) is correctly specified.
Regarding the previous issues with TWFE regressions, RA, IPW, and AIPW satisfy:
Issue 1: Multiple periods ✔
Issue 2: Time-invariant covariates ✔
Issue 3: Covariates affected by the treatment ?
Issue 4: Hidden linearity bias ✔
Issue 5: Weighted average of \(ATT(X)\) ✔
You can also show that they will, by construction, balance the means of \((X_{t=2},X_{t=1},Z)\) across groups.
In my view, these are much better properties that the TWFE regression when it comes to including covariates.
With multiple periods and variation in treatment timing, and when parallel trends holds across all groups and time periods, we previously showed that:
\[ ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \E[Y_t - Y_{g-1} | U=1] \]
Then, if desired, we can aggregate these into \(ATT^{es}(e)\) or \(ATT^o\).
This strategy amounted to a two-part strategy:
Break the problem into a series of \(2 \times 2\) comparisons
Aggregate \(ATT(g,t)\) into desired target parameter
We will follow a similar strategy here, just accounting for covariates in the parallel trends assumption
Conditional Parallel Trends with Multiple Periods
For all groups \(g \in \bar{\mathcal{G}}\) (all groups except the never-treated group) and for all time periods \(t=2, \ldots, T\),
\[\E[\Delta Y_{t}(0) | \mathbf{X}, Z, G=g] = \E[\Delta Y_{t}(0) | \mathbf{X}, Z, U=1]\]
where \(\mathbf{X}_i := (X_{i1},X_{i2},\ldots,X_{iT})\).
Under this assumption, using similar arguments to the ones above, one can show that
\[ATT(g,t) = \E\left[ \left( \frac{\indicator{G=g}}{\pi_g} - \frac{p_g(\mathbf{X},Z)U}{(1-p_g(\mathbf{X},Z))\pi_g}\right)\Big(Y_{t} - Y_{g-1} - m_{gt}^0(\mathbf{X},Z)\Big) \right]\]
where \(p_g(\mathbf{X},Z) := \P(G=g|\mathbf{X},Z,\indicator{G=g}+U=1)\) and \(m_{gt}^0(\mathbf{X},Z) := \E[Y_{t}-Y_{g-1}|\mathbf{X},Z,U=1]\).
Because \(\mathbf{X}_i\) contains \(X_{it}\) for all time periods, terms like \(m_{gt}^0(\mathbf{X},Z)\) can be quite high-dimensional (and hard to estimate) in many applications.
In many cases, it may be reasonable to replace with lower dimensional function \(\mathbf{X}_i\):
pte
package currently and may be added to did
soon).did
)Otherwise, however, everything is the same as before:
Recover \(ATT(g,t)\)
If desired: aggregate into \(ATT^{es}(e)\) or \(ATT^o\).
Exploit minimum wage changes across states
Goals:
Include covariates in the parallel trends assumption, assess how much this matters
Try to assess how well different estimation strategies do in terms of handling covariates
Let’s start by assuming that parallel trends holds conditional on a county’s population and average income (sometimes we’ll add region too)
I’ll show results for the following cases:
Results without covariates (as a reminder of results from last time)
Two period TWFE with covariates
All periods TWFE with covariates
Callaway and Sant’Anna (2021) including \(X_{g-1}\) and \(Z\) as covariates
In addition to estimates, we’ll also assess how well each of these works in terms of balancing covariates using the twfeweights
package.
Call:
did::aggte(MP = attgt, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0571 0.0085 -0.0738 -0.0404 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0888 0.0189 -0.1302 -0.0475 *
2006 -0.0427 0.0083 -0.0610 -0.0245 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
# run TWFE regression
data2_subset <- subset(data2, year %in% c(2003,2004))
data2_subset <- subset(data2_subset, G %in% c(0, 2004))
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + year,
data=data2_subset,
cluster="id")
modelsummary(twfe_x, gof_omit=".*")
(1) | |
---|---|
post | -0.032 |
(0.019) | |
lpop | 0.833 |
(0.261) | |
lavg_pay | 0.037 |
(0.145) |
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="universal",
est_method="dr",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0317 0.0075 -0.0463 -0.017 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0509 0.0204 -0.0928 -0.0089 *
2006 -0.0230 0.0076 -0.0387 -0.0074 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Doubly Robust
Additional Results:
Add region as a covariate in TWFE [details]
One-shot imputation estimators [details]
Regression adjustment [details]
Inverse propensity score weighting [details]
Bonus Material:
Including covariates in the parallel trends assumption can make difference-in-differences identification strategies more plausible
If you want to include covariates in the parallel trends assumption, it is better to use approaches that directly include the covariates relative to estimation strategies that transform the covariates
In the minimum wage application, we did better in terms of covariate balance with regression adjustment and AIPW; however, there still seemed to be violations of parallel trends in pre-treatment periods
Regression adjustment is closely related to imputation estimators, which we talked about in the first session.
In settings with multiple periods and variation in treatment timing, these are often operationalized in different ways though
In Callaway and Sant’Anna (2021), we considered regression adjustment at the group-time level \(\implies\) do \(2 \times 2\) regression adjustment many times and then aggreggate
However, most imputation estimators are implemented in one-shot, where you would typically estimate the model \[Y_{it}(0) = \theta_t + \eta_i + X_{it}'\beta + e_{it}\] across all time periods
One-shot imputation estimators have similar limitations as TWFE regressions when it comes to covariates:
\[Y_{it}(0) = \theta_t + \eta_i + X_{it}'\beta + e_{it}\]
Issue 1: Multiple periods ✔
Issue 2: Time-invariant covariates ⚠
Issue 3: Covariates affected by the treatment ❌
Issue 4: Hidden linearity bias ❌
Issue 5: Weighted average of \(ATT(X)\) ✔
[back]
To understand double robustness, we can rewrite the expression for \(ATT\) as \[\begin{align*} ATT = \E\left[ \frac{D}{\pi} \Big(\Delta Y - m_0(X)\Big) \right] - \E\left[ \frac{p(X)(1-D)}{(1-p(X))\pi} \Big(\Delta Y - m_0(X)\Big)\right] \end{align*}\]
The first term is exactly the same as what comes from regression adjustment
If we correctly specify a model for \(m_0(X)\), it will be equal to \(ATT\).
If \(m_0(X)\) not correctly specified, then, by itself, this term will be biased for \(ATT\)
The second term can be thought of as a de-biasing term
If \(m_0(X)\) is correctly specified, it is equal to 0
If \(p(X)\) is correctly specified, it reduces to \(\E[\Delta Y_{t}(0) | G=1] - \E[m_0(X)|G=1]\) which both delivers counterfactual untreated potential outcomes and removes the (possibly misspecified) second term from the first equation
[Back]
We’ll allow for path of outcomes to depend on region of the country
# run TWFE regression
twfe_x <- fixest::feols(lemp ~ post + lpop + lavg_pay | id + region^year,
data=data2,
cluster="id")
modelsummary(twfe_x, gof_omit=".*")
(1) | |
---|---|
post | -0.022 |
(0.008) | |
lpop | 1.057 |
(0.137) | |
lavg_pay | 0.074 |
(0.079) |
Relative to previous results, this is much smaller—this is (broadly) in line with the literature where controlling for region
often matters a great deal (e.g., Dube, Lester, and Reich (2010)).
[back]
# it's reg. adj. even though the function says aipw...
ra_wts <- implicit_aipw_weights(
yname = "lemp",
tname = "year",
idname = "id",
gname = "G",
xformula = ~ 1,
d_covs_formula = ~ lpop + lavg_pay,
pscore_formula = ~1,
data = data2
)
ra_wts$est
[1] -0.06098144
i.e., we estimate a somewhat larger effect of the minimum wage on teen employment
[back]
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="universal",
est_method="reg",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0321 0.0079 -0.0477 -0.0166 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0596 0.0194 -0.1007 -0.0185 *
2006 -0.0197 0.0080 -0.0367 -0.0027 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Outcome Regression
[back]
# callaway and sant'anna including covariates
cs_x <- att_gt(yname="lemp",
tname="year",
idname="id",
gname="G",
xformla=~region + lpop + lavg_pay,
control_group="nevertreated",
base_period="universal",
est_method="ipw",
data=data2)
cs_x_res <- aggte(cs_x, type="group")
summary(cs_x_res)
cs_x_dyn <- aggte(cs_x, type="dynamic")
ggdid(cs_x_dyn)
Call:
aggte(MP = cs_x, type = "group")
Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015>
Overall summary of ATT's based on group/cohort aggregation:
ATT Std. Error [ 95% Conf. Int.]
-0.0313 0.0083 -0.0475 -0.0151 *
Group Effects:
Group Estimate Std. Error [95% Simult. Conf. Band]
2004 -0.0514 0.0179 -0.0902 -0.0127 *
2006 -0.0222 0.0072 -0.0377 -0.0067 *
---
Signif. codes: `*' confidence band does not cover 0
Control Group: Never Treated, Anticipation Periods: 0
Estimation Method: Inverse Probability Weighting
[back]
So far, our discussion has been for the case where the time-varying covariates evolve exogenously.
In some applications, we may want to control for covariates that themselves could be affected by the treatment
Classical examples in labor economics: A person’s industry, occupation, or union status
These are often referred to as “bad controls”
You can see a tension here:
We would like to compare units who, absent being treated, would have had the same (say) union status
But union status could be affected by the treatment
The most common practice is to just completely drop these covariates from the analysis
We will consider some alternatives
To wrap our heads around this, let’s go back to the case with two time periods.
Define treated and untreated potential covariates: \(X_{it}(1)\) and \(X_{it}(0)\). Notice that in the “textbook” two period setting, we observe \[X_{it=2} = D_i X_{it=2}(1) + (1-D_i) X_{it=2}(0) \qquad \textrm{and} \qquad X_{it=1} = X_{it=1}(0)\]
Then, we will consider parallel trends in terms of untreated potential outcomes and untreated potential covariates:
Conditional Parallel Trends using Untreated Potential Covariates
\[\E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=1] = \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=0]\]
Following the same line of argument as before, it follows that
\[ATT = \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=0] \Big| G=1\Big]\]
The second term is the tricky one. Notice that:
The inside conditional expectation is identified — we see untreated potential outcomes and covariates for the untreated group
However, we cannot average over \(X_{t=2}(0)\) for the treated group, because we don’t observe \(X_{t=2}(0)\) for the treated group
There are several options for what we can do \(\rightarrow\)
One idea is to just ignore that the covariates may have been affected by the treatment:
Alternative Conditional Parallel Trends 1
\[\E[\Delta Y(0) | { \color{red} X_{\color{red}{t=2} } }, X_{t=1}(0), Z, G=1] = \E[\Delta Y(0) | { \color{red} X_{\color{red}{t=2}} }, X_{t=1}(0), Z, G=0]\]
The limitations of this approach are well known (even discussed in MHE), and this is not typically the approach taken in empirical work
Job Displacement Example: You would compare paths of outcomes for workers who left union because they were displaced to paths of outcomes for non-displaced workers who also left union (e.g., because of better non-unionized job opportunity)
It is more common in empirical work to drop \(X{it}(0)\) entirely from the parallel trends assumption
Alternative Conditional Parallel Trends 2
\[\E[\Delta Y(0) | Z, G=1] = \E[\Delta Y(0) | Z, G=0]\]
In my view, this is not attractive either though. If we believe this assumption, then we have basically solved the bad control problem by assuming that it does not exist.
Job Displacement Example: We have now just assumed that path of earnings (absent job displacement) doesn’t depend on union status
Perhaps a better alternative identifying assumption is the following one
Alternative Conditional Parallel Trends 3
\[\E[\Delta Y(0) | X_{t=1}(0), Z, G=1] = \E[\Delta Y(0) | X_{t=1}(0), Z, G=0]\]
Intuition: Conditional parallel trends holds after conditioning on pre-treatment time-varying covariates that could have been affected by treatment
Job Displacement Example: Path of earnings (absent job displacement) depends on pre-treatment union status, but not untreated potential union status in the second period
What to do: Since \(X_{it=1}(0)\) is observed for all units, we can immediately operationalize this assumption use our arguments from earlier (i.e., the ones without bad controls)
This is difficult to operationalize with a TWFE regression
In practice, you can just include the bad control among other covariates in did
Another option is to keep the original identifying assumption, but add additional assumptions where we (in some sense) treat \(X_t\) as an outcome and as a covariate.
Recall:
\[ATT = \E[\Delta Y | G=1] - \E\Big[ \E[\Delta Y(0) | X_{t=2}(0), X_{t=1}(0), Z, G=0] \Big| G=1\Big]\]
If we could figure out distribution of \(X_{t=2}(0)\) for the treated group, we could recover \(ATT\)
Covariate Unconfoundedness Assumption
\[X_{t=2}(0) \independent D | X_{t=1}(0), Z\]
Intuition: For the treated group, the time-varying covariate would have evolved in the same way over time as it actually did for the untreated group, conditional on \(X_{t=1}\) and \(Z\).
Notice that this assumption only concerns untreated potential covariates \(\implies\) it allows for \(X_{t=2}\) to be affected by the treatment
Making an assumption like this indicates that \(X_{t=2}(0)\) is playing a dual role: (i) start by treating it as if it’s an outcome, (ii) have it continue to play a role as a covariate
Under this assumption, can show that we can recover the \(ATT\):
\[ATT = \E[\Delta Y | G=1] - \E\left[ \E[\Delta Y | X_{t=1}, Z, G=0] \Big| G=1 \right]\]
This is the same expression as in Option 3
In some cases, it may make sense to condition on other additional variables (e.g., the lagged outcome \(Y_{t=1}\)) in the covariate unconfoundedness assumption. In this case, it is still possible to identify \(ATT\), but it is more complicated
It could also be possible to use alternative identifying assumptions besides covariate unconfoundedness — at a high-level, we somehow need to recover the distribution of \(X_{t=2}(0)\)
See Caetano et al. (2022) for more details about bad controls.
[Back]