class: center, middle, inverse, title-slide # Modern Approaches to Difference in Differences ### Brantly Callaway, University of Georgia ### October 22, 2021
Session 4: Conditional Parallel Trends and Other Extensions --- # Conditional Parallel Trends `$$\newcommand{\E}{\mathbb{E}}$$` `$$\newcommand{\P}{\mathrm{P}}$$` <style type="text/css"> border-top: 80px solid #BA0C2F; .inverse { background-color: #BA0C2F; } .alert { font-weight:bold; color: red; } .alert-blue { font-weight: bold; color: blue; } .remark-slide-content { font-size: 23px; padding: 1em 4em 1em 4em; } .highlight-red { background-color:red; padding:0.1em 0.2em; } </style> For simplicity, let's go back to the case w/o variation in treatment timing - Straightforward to extend -- ## Conditional Parallel Trends Assumption For all time periods, `$$\E[\Delta Y_t(0) | X, D=1] = \E[\Delta Y_t(0) | X, D=0]$$` -- In words: Parallel trends holds conditional on having the same covariates `\(X\)`. -- Labor Economics Example: `\(Y\)` - earnings, path of earnings may depend on covariates like education, race, gender, age, etc. --- # Recovering ATT under conditional PTA Under the conditional parallel trends assumption, $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt} \end{aligned}` $$ --- count:false # Recovering ATT under conditional PTA Under the conditional parallel trends assumption, $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1] \end{aligned}` $$ --- count:false # Recovering ATT under conditional PTA Under the conditional parallel trends assumption, $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E\big[ \E[Y_{t^*}(0) - Y_{t^*-1}(0) | X, D=1] \big| D=1\big] \end{aligned}` $$ --- count:false # Recovering ATT under conditional PTA Under the conditional parallel trends assumption, $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E\big[ \E[Y_{t^*}(0) - Y_{t^*-1}(0) | X, D=1] \big| D=1\big]\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E\big[ \E[Y_{t^*}(0) - Y_{t^*-1}(0) | X, D=0] \big| D=1\big] \end{aligned}` $$ --- count:false # Recovering ATT under conditional PTA Under the conditional parallel trends assumption, $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E\big[ \E[Y_{t^*}(0) - Y_{t^*-1}(0) | X, D=1] \big| D=1\big]\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E\big[ \E[Y_{t^*}(0) - Y_{t^*-1}(0) | X, D=0] \big| D=1\big]\\ &= \E[\Delta Y_{t^*} | D=1] - \E\big[ \E[\Delta Y_{t^*} | X, D=0] \big| D=1\big] \end{aligned}` $$ Everything is identified here -- - but estimation may be more challenging --- # Estimation <span class="alert-blue">Idea 1: Regression</span> Assume: `\(\E[\Delta Y_{t^*} | X, D=d] = X'\beta_{t^*}\)`, then `$$ATT=\E[\Delta Y_{t^*} | D=1] - \E[X'|D=1]\beta_{t^*}$$` --- # Estimation <span class="alert-blue">Idea 2: Propensity Score Weighting</span> Along the lines of Abadie (2005), can show that $$ ATT = \E\left[ \left( \frac{D}{p} - \frac{p(X)(1-D)}{1-p(X)} \right) \Delta Y_t^*\right] $$ where `\(p(X) := \P(D=1|X)\)` is the propensity score (can estimate by logit or probit). -- Intuition: under conditional parallel trends, the reason for differences in (unconditional) paths of untreated potential outcomes is due to differences in distribution of covariates between treated and untreated group -- - This expression "weights" up `\(\Delta Y_{t^*}\)` for units from the untreated covariates that have covariates that "relatively more common" among the treated group -- - The choice between the first two ideas likely comes down whether you feel better about modeling `\(\E[Delta Y_{t^*}|X,D=d]\)` or `\(p(X)\)`. --- # Estimation <span class="alert-blue">Idea 3: Doubly Robust</span> Along the lines of Sant'Anna and Zhao (2020), can show `$$ATT=\E\left[ \left( \frac{D}{p} - \frac{p(X)(1-D)}{1-p(X)} \right)(\Delta Y_{t^*} - \E[\Delta Y_{t^*} | X, D=0]) \right]$$` -- This requires estimating both `\(p(X)\)` and `\(\E[\Delta Y_{t^*}|X,D=0]\)`. -- Big advantage: - This expression for `\(ATT\)` is *doubly robust*. This means that, it will deliver consistent estimates of `\(ATT\)` if <span class="alert">either</span> the model for `\(p(X)\)` or for `\(\E[\Delta Y_{t^*}|X,D=0]\)`. -- - In my experience, doubly robust estimators perform much better than either the regression or propensity score weighting estimators -- - This also provides a connection to estimating `\(ATT\)` under conditional parallel trends using machine learning for `\(p(X)\)` and `\(\E[\Delta Y_{t^*}|X,D=0]\)` (see: Chang (2020)) --- # Example: MW with region-specific trends Rough idea: trends in teen employment may be quite different across different regions of the country - This is conditional parallel trends Idea: Include `region-year` fixed effects ```r library(did) library(fixest) library(modelsummary) load("mw_data2.RData") # create post treatment dummy mw_data2$post <- 1*(mw_data2$year >= mw_data2$first.treat & mw_data2$first.treat > 0) # run TWFE regression twfe_x <- feols(lemp ~ post | countyreal + region^year, data=mw_data2) ``` --- # Example: Minimum Wage ```r modelsummary(twfe_x, gof_omit=".*") ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> post </td> <td style="text-align:center;"> 0.001 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.006) </td> </tr> </tbody> </table> Relative to previous results, this is much smaller and statistically insignificant This is a pretty famous result in the MW literature (Dube et al. (2010)) --- # Example: Minimum Wage Same idea, but estimate via group-time average treatment effects: ```r cs_x <- att_gt(yname="lemp", tname="year", idname="countyreal", gname="first.treat", xformla=~region, data=mw_data2) cs_x_res <- aggte(cs_x, type="group") ``` --- # Example: Minimum Wage ```r summary(cs_x_res) ``` ``` ## ## Call: ## aggte(MP = cs_x, type = "group") ## ## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Forthcoming at the Journal of Econometrics <https://arxiv.org/abs/1803.09015>, 2020. ## ## ## Overall summary of ATT’s based on group/cohort aggregation: ## ATT Std. Error [ 95% Conf. Int.] ## -0.0311 0.0059 -0.0427 -0.0195 * ## ## ## Group Effects: ## Group Estimate Std. Error [95% Simult. Conf. Band] ## 2003 -0.0197 0.0153 -0.0573 0.0179 ## 2005 0.0169 0.0094 -0.0061 0.0400 ## 2006 -0.0514 0.0078 -0.0706 -0.0321 * ## --- ## Signif. codes: `*' confidence band does not cover 0 ## ## Control Group: Never Treated, Anticipation Periods: 0 ## Estimation Method: Doubly Robust ``` --- # Example: Minimum Wage ```r # regression cs_x_reg <- att_gt(yname="lemp", tname="year", idname="countyreal", gname="first.treat", xformla=~region + lpop, est_method = "reg", data=mw_data2) cs_x0_reg <- aggte(cs_x_reg, type="group") ``` --- # Example: Minimum Wage ```r # propensity score weighting cs_x_ipw <- att_gt(yname="lemp", tname="year", idname="countyreal", gname="first.treat", xformla=~region + lpop, est_method = "ipw", data=mw_data2) cs_x0_ipw <- aggte(cs_x_ipw, type="group") ``` --- # Example: Minimum Wage ```r # doubly robust cs_x_dr <- att_gt(yname="lemp", tname="year", idname="countyreal", gname="first.treat", xformla=~region + lpop, est_method = "dr", data=mw_data2) cs_x0_dr <- aggte(cs_x_dr, type="group") ``` --- # Example: Minimum Wage ```r # show results round(cbind.data.frame(reg=cs_x0_reg$overall.att, ipw=cs_x0_ipw$overall.att, dr=cs_x0_dr$overall.att), 6) ``` ``` ## reg ipw dr ## 1 -0.033159 -0.035392 -0.032318 ``` -- These are all similar, but - somewhat smaller in magnitude than unconditional case - much different from TWFE results --- # What about violations of parallel trends? DID + pre-tests are a very powerful/useful approach to identifying causal effect parameters when repeated observations are available -- But what should you do in cases like our application where: --- # What about violations of parallel trends? <img src="data:image/png;base64,#modern_did_session4_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # What about violations of parallel trends? <span class="alert-blue">One possibility:</span> Model underlying parallel trends is not correct -- Examples: - Interactive fixed effects: `\(Y_{it}(0) = \theta_t + \eta_i + \lambda_i F_t + v_{it}\)` - Special case: `\(Y_{it}(0) = \theta_t + \eta_i + \lambda_i t + v_{it}\)` (linear trends) - see: Callaway and Karami (2021) for small-T case -- - More generally: `\(Y_{it}(0) = g_t(\eta_i) + v_{it}\)` (additive separability not appropriate) -- - Pandemics (Callaway and Li (2021)) -- `\(\implies\)` sometimes you can make progress, but other times not --- # What about violations of parallel trends? Another strategy: partial identification / sensitivity analysis - See: Manski and Pepper (2018), Rambachan and Roth (2021) -- RR provide several versions of sensitivity analysis - We'll focus on what the call `\(\Delta^{RM}(\bar{M})\)` - We'll allow for violations of parallel trends up to `\(\bar{M}\)` times as large as were observed in any pre-treatment period. - And we'll vary `\(\bar{M}\)`. --- # What about violations of parallel trends? <img src="data:image/png;base64,#modern_did_session4_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- # Conclusion That's all! Thank you very much for inviting me. Email: [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu)