class: center, middle, inverse, title-slide .title[ # Modern Approaches to Difference-in-Differences ] .author[ ### Brantly Callaway, University of Georgia ] .date[ ### June 1, 2023
NEXT-D Workshop at Tulane University ] --- # Introduction `$$\newcommand{\E}{\mathbb{E}} \newcommand{\E}{\mathbb{E}} \newcommand{\var}{\mathrm{var}} \newcommand{\cov}{\mathrm{cov}} \newcommand{\Var}{\mathrm{var}} \newcommand{\Cov}{\mathrm{cov}} \newcommand{\Corr}{\mathrm{corr}} \newcommand{\corr}{\mathrm{corr}} \newcommand{\L}{\mathrm{L}} \renewcommand{\P}{\mathrm{P}} \newcommand{\independent}{{\perp\!\!\!\perp}} \newcommand{\indicator}[1]{ \mathbf{1}\{#1\} }$$` <style type="text/css"> border-top: 80px solid #BA0C2F; .inverse { background-color: #BA0C2F; } .alert { font-weight:bold; color: #BA0C2F; } .alert-blue { font-weight: bold; color: #004E60; } .remark-slide-content { font-size: 23px; padding: 1em 4em 1em 4em; } .highlight-red { background-color:red; padding:0.1em 0.2em; } .assumption-box { background-color: rgba(222,222,222,.5); font-size: x-large; padding: 10px; border: 10px solid lightgray; margin: 10px; } .assumption-title { font-size: x-large; font-weight: bold; display: block; margin: 10px; text-decoration: underline; color: #BA0C2F; } </style> Difference-in-differences (DID) is an extremely popular identification strategy for trying to recover the causal effect of some treatment on some outcome of interest -- There have been a number of important advances in our understanding of DID over the past few years: * Limitations of two-way fixed effects (TWFE) regressions as a way to implement a DID identification strategy * Alternative estimation strategies that are robust to treatment effect heterogeneity * Extensions of these alternative approaches along a number of empirically relevant dimensions -- <span class="alert">Today: </span> Overview of recent work `\(+\)` a (fairly) detailed empirical application with code -- <span class="alert-blue">Reference: </span> Callaway (2023, *Handbook of Labor, Human Resources and Population Economics*) --- # Outline <br> <br> 1. Introduction to Difference-in-Differences 2. Overview of Issues with TWFE Regressions 3. Alternative Estimation Strategies 4. Empirical Example: Minimum Wages and Employment --- class: inverse, middle, center count: false # Introduction to Difference-in-Differences --- # The Logic of DID Exploit a data structure where the researcher observes: 1. Multiple periods of data 2. Some pre-treatment data for all units 3. Some units become treated while other units remain untreated -- <br> <span class="alert-blue">Running Example</span> The effect of a state-level minimum wage increase on employment --- # The Logic of DID <span class="alert-blue">Intuition for DID identification strategy</span> is to compare: - The change in outcomes over time for units that participate in the treatment to - The change in outcomes over time for units that didn't participate in the treatment -- Rough intuition: Compares a treated unit's outcomes to its past outcomes while making adjustment for "common shocks" using the comparison group. [See: Heckman, Ichimura, and Todd (1997), Blundell and Costa Dias (2009), Gardner (2021), Ghanem, Sant'Anna, and Wuthrich (2022) for more details about when/why this procedure makes sense.] -- <br> DID identification strategies allow for <span class="alert">treatment effect heterogeneity</span> * This is going to be a major issue in the discussion below --- # Textbook Version of DID <span class="alert">Data: </span> * 2 periods: `\(t^*\)`, `\(t^*-1\)` * No one treated until period `\(t^*\)` * Some units remain untreated in period `\(t^*\)` * 2 groups: `\(D=1\)` or `\(D=0\)` (treated and untreated) -- <span class="alert">Potential Outcomes: </span> `\(Y_{it}(1)\)` and `\(Y_{it}(0)\)` -- <span class="alert">Observed Outcomes: </span> `\(Y_{it^*}\)` and `\(Y_{it^*-1}\)` `\begin{align*} Y_{it^*} = D_i Y_{it^*}(1) +(1-D_i)Y_{it^*}(0) \quad \textrm{and} \quad Y_{it^*-1} = Y_{it^*-1}(0) \end{align*}` --- # Textbook Version of DID (cont'd) <span class="alert">Target Parameter: </span> `$$ATT = \E[Y_{t^*}(1) - Y_{t^*}(0) | D=1]$$` Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group -- <span class="alert">Parallel Trends Assumption: </span> `$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$` Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group -- <span class="alert">Identification: </span>Under PTA, we can identify `\(ATT\)`: $$ `\begin{aligned} ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1] \end{aligned}` $$ --- count:false # Textbook Version of DID (cont'd) <span class="alert">Target Parameter: </span> `$$ATT = \E[Y_{t^*}(1) - Y_{t^*}(0) | D=1]$$` Explanation: Mean difference between treated and untreated potential outcomes in the second period among the treated group <span class="alert">Parallel Trends Assumption: </span> `$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$` Explanation: Mean path of untreated potential outcomes is the same for the treated group as for the untreated group <span class="alert">Identification: </span>Under PTA, we can identify `\(ATT\)`: $$ `\begin{aligned} ATT &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*}(0) | D=1]\\ &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*} | D=0] \end{aligned}` $$ --- # Setup w/ Staggered Treatment Adoption - `\(\mathcal{T}\)` time periods -- - Units can become treated at different points in time -- - For simplicity, we'll adapt the <span class="alert-blue">staggered treatment framework</span>. That is, once a unit becomes treated they remain treated. - `\(G_i\)` - a unit's <span class="alert-blue">group</span> - the time period that unit becomes treated. Also, define `\(U_i=1\)` for never-treated units and `\(U_i=0\)` otherwise. -- - Potential outcomes: `\(Y_{it}(g)\)` - the outcome that unit `\(i\)` would experience in time period `\(t\)` if they became treated in period `\(g\)`. -- - Untreated potential outcome: `\(Y_{it}(0)\)` - the outcome unit `\(i\)` would experience in time period `\(t\)` if they did not participate in the treatment in any period. --- # Setup (cont'd) - Observed outcome: `\(Y_{it}=Y_{it}(G_i)\)` -- - No anticipation condition: `\(Y_{it} = Y_{it}(0)\)` for all `\(t < G_i\)` (pre-treatment periods for unit `\(i\)`) -- Unit-level treatment effect `$$\tau_{it}(g) = Y_{it}(g) - Y_{it}(0)$$` -- Average treatment effect for unit `\(i\)` (across time periods): `$$\bar{\tau}_i(g) = \frac{1}{\mathcal{T} - g + 1} \sum_{t=g}^{\mathcal{T}} \tau_{it}(g)$$` --- # Target Parameters * <span class="alert">Group-time average treatment effects</span> `\begin{align*} ATT(g,t) = \E[ \tau_t(G) | G=g] \end{align*}` Explanation: `\(ATT\)` for group `\(g\)` in timer period `\(t\)` -- * <span class="alert">Event Study</span> `\begin{align*} ATT^{ES}(e) = \E[\tau_{t+e}(G) | G \in \mathcal{G}_e] \end{align*}` where `\(\mathcal{G}_e\)` is the set of groups observed to have experienced the treatment for `\(e\)` periods at some point. Explanation: `\(ATT\)` when units have been treated for `\(e\)` periods <!--In math: `\(\mathcal{G}_e = \{g : (g+e) \in [2,T] \textrm{ and } g > 0\}\)`--> -- * <span class="alert">Overall ATT</span> `\begin{align*} ATT^O = \E[\bar{\tau}(G) | U=0] \end{align*}` Explanation: `\(ATT\)` across all units that every participate in the treatment --- # Target Parameters To understand the discussion later, it is also helpful to think of `\(ATT(g,t)\)` as a <span class="alert">building block</span> for the other parameters discussed above. -- Notice that: `\begin{align*} ATT^{ES}(e) = \sum_{g \in \bar{\mathcal{G}}} w^{ES}(g,e) ATT(g,g+e) \qquad \textrm{ and } \qquad ATT^O = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} w^O(g,t) ATT(g,t) \end{align*}` where `\begin{align*} w^{ES}(g,e) = \indicator{g \in \mathcal{G}_e} \P(G=g|G\in \mathcal{G}_e) \qquad \textrm{and} \qquad w^O(g,t) = \frac{\P(G=g|U=0)}{\mathcal{T}-g+1} \end{align*}` -- <br> In other words, if we can identify/recover `\(ATT(g,t)\)`, then we can proceed to recover `\(ATT^{ES}(e)\)` and `\(ATT^O\)`. --- # Identification of `\(ATT(g,t)\)` ## Multiple Period Version of Parallel Trends Assumption For all groups `\(g \in \bar{\mathcal{G}}\)` (all groups except the never-treated group) and for all time periods `\(t=2,\ldots,\mathcal{T}\)`, `\begin{align*} \E[\Delta Y_{t}(0) | G=g] = \E[\Delta Y_{t}(0) | U=1] \end{align*}` <br> -- Using very similar arguments as before, can show that `\begin{align*} ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \E[Y_t - Y_{g-1} | U=1] \end{align*}` -- where the main difference is that we use `\((g-1)\)` as the "base period" (this is the period right before group `\(g\)` becomes treated). --- class: inverse, middle, center count: false # Overview of Issues with TWFE Regressions --- # What does TWFE estimate in this setup? For roughly 30 years, the dominant approach to implementing a DID identification strategy has been to run a two-way fixed effects regression: `$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$$` -- In the "textbook" case above, you can show that `\(\alpha = ATT\)` `\(\implies\)` TWFE regression is robust to treatment effect heterogeneity -- It's also super-convenient! -- However, this robustness to treatment effect heterogeneity does not extend to more complicated settings: * Staggered treatment adoption (this is the case I'll emphasize) * More complicated treatments (e.g., continuous treatment) / moving into and out of the treatment * Including covariates in the parallel trends assumption --- # Goodman-Bacon (2021) <span class="alert-blue">Goodman-Bacon (2021) intuition:</span> `\(\alpha\)` "comes from" comparisons between the path of outcomes for units whose <span class="alert">treatment status changes</span> relative to the path of outcomes for units whose <span class="alert">treatment status stays the same</span> over time. -- * Some comparisons are for groups that become treated to <span class="alert">not-yet-treated</span> groups (these are very much in the spirit of DID) * Other comparisons are for groups that become treated relative to <span class="alert">already-treated</span> groups (these comparisons are not rationalized by parallel trends assumptions) This can be especially problematic when there are treatment effect dynamics. Dynamics imply different trends from what would have happened absent the treatment. --- # de Chaisemartin and d'Haultfoeuille (2020) <span class="alert-blue">de Chaisemartin and d'Haultfoeuille (2020) intuition:</span> You can write `\(\alpha\)` as a weighted average of `\(ATT(g,t)\)` First, a decomposition: `\begin{align*} \alpha &= \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big) \\ & + \sum_{g \in \bar{\mathcal{G}}} \sum_{t=1}^{g-1} w^{TWFE}(g,t) \Big( \E[(Y_{t} - Y_{g-1}) | G=g] - \E[(Y_{t} - Y_{g-1}) | U=1] \Big) \end{align*}` -- Second, under parallel trends: `\begin{align*} \alpha = \sum_{g \in \bar{\mathcal{G}}} \sum_{t=g}^{\mathcal{T}} w^{TWFE}(g,t) ATT(g,t) \end{align*}` * But the weights are (non-transparently) driven by the estimation method * These weights have some good / bad / strange properties such as possibly being negative <!-- negative weights can be ruled out if there are no treatment effect dynamics --> <!--depend on relative group sizes--> <!-- for a fixed group more weight on earlier periods --> <!--for a fixed time period more weight on earlier treated groups --> <!--weights would change if you added an extra pre-treatment period--> --- # Event Study Regressions Event study regressions are popular in empirical work. `\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{e=-(\mathcal{T}-1)}^{-2} \beta_e D_{it}^e + \sum_{e=0}^{\mathcal{T}} \beta_e D_{it}^e + v_{it} \end{align*}` where `\(D_{it}^e = \indicator{G_i + e = t}\)` is a binary indicator of having been treated for exactly `\(e\)` periods in period `\(t\)` -- Typically, researchers interpret: * Post-treatment event study coefficients as dynamic effects * Use pre-treatment coefficients as a pre-test -- Sun and Abraham (2021) show similar issues to the TWFE regression with the event study regression here * `\(\beta_e\)` can include effects from incorrect lengths of exposure * weights on `\(ATT(g,t)\)` are non-transparent and driven by the estimation method and can be negative --- class: inverse, middle, center count: false # Alternative Estimation Strategies --- # Alternative Approaches <span class="alert">We'll discuss:</span> 1. Callaway and Sant'Anna (2021), R: `did`, Stata: `csdid` 2. Sun and Abraham (2021), R: `fixest`, Stata: `eventstudyinteract` 3. Wooldridge (2021), R: `etwfe`, Stata: `JWDID` 4. Gardner (2021) / Borusyak, Jaravel, Spiess (2022), R: `did2s`, Stata: `did2s` and `did_imputation` -- <span class="alert">Not including:</span> 1. "Clean controls" (Cengiz, Dube, Lindner, and Zipperer (2019) and Dube, Girardi, Jorda, and Taylor (2023)), Stata: `stackedev` <!-- clean control is similar to CS, for a particular group, take its observations + clean controls (not-yet-treated) observations, do this for all groups, and "stack" into a new dataset, then run a TWFE regression on this data. You still have weights driven by the estimation method, but you get rid of negative weights issues and can "undue" the negative weights--> 2. de Chaisemartin and d'Haultfoeuille (2020), R: `DIDmultiplegt`, Stata: `did_multiplegt` <!-- DID_M is like ATT^{ES}(e), stata command can allow for different values of e, though I'm not super familiar with it --> --- # Callaway and Sant'Anna (2021) <span class="alert">Key idea:</span> Separate identification and estimation: * Under parallel trends, recall that `$$ATT(g,t) = \E[Y_t - Y_{g-1} | G=g] - \E[Y_t - Y_{g-1} | U=1]$$` -- <span class="alert">Estimation:</span> `$$\widehat{ATT}^{CS}(g,t) = \frac{1}{n}\sum_{i=1}^n \frac{\indicator{G_i = g}}{\hat{p}_g} (Y_{it} - Y_{ig-1}) - \frac{1}{n}\sum_{i=1}^n \frac{\indicator{U_i = 1}}{\hat{p}_U} (Y_{it} - Y_{ig-1})$$` <span class="alert">2nd step:</span> Recall: group-time average treatment effects are building blocks for more aggregated parameters such as `\(ATT^{ES}(e)\)` and `\(ATT^O\)` `\(\implies\)` just plug in * `\(\implies\)` two-step estimation procedure: target local/disaggregated `\(ATT(g,t)\)` in first step, then (if desired) aggregate them into lower dimensional parameters --- # Sun and Abraham (2021) <span class="alert">Intuition: </span> The event study regression is "underspecified" `\(\implies\)` heterogeneous effects can "confound" the treatment effect estimates -- <span class="alert">Solution:</span> Run fully interacted regression: `\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{e \neq -1} \delta^{SA}_{ge} \indicator{G_i=g} \indicator{g+e=t} + v_{it} \end{align*}` -- <span class="alert">2nd step:</span> Aggregate `\(\delta^{SA}_{ge}\)`'s across groups (usually into an event study). * This sidesteps issues with the event study regression coming from treatment effect heterogeneity * For inference, need to account for two-step estimation procedure --- # Wooldridge (2021) <span class="alert">Main question:</span> Are issues in DID literature due to limitations of TWFE regressions themselves or something else? -- Proposes running "more interacted" TWFE regression: `\begin{align*} Y_{it} = \theta_t + \eta_i + \sum_{g \in \bar{\mathcal{G}}} \sum_{s=g}^{\mathcal{T}} \alpha_{gt}^W \indicator{G_i=g, t=s} + v_{it} \end{align*}` -- This is quite similar to Sun and Abraham (2021) except for that it doesn't include interactions in pre-treatment periods. [The differences about `\((g,t)\)` relative to `\((g,e)\)` are trivial.] * Like SA, this provides robustness to treatment effect heterogeneity by including more interactions * However, unless mainly interested in `\(ATT(g,t)\)`, have to do second step aggregation that (arguably) ends the "killer feature" of the TWFE regression to begin with --- # Gardner (2021) / BJS (2022) <span class="alert">Intuition: </span>Parallel trends is closely connected to a TWFE model *for untreated potential outcomes* `$$Y_{it}(0) = \theta_t + \eta_i + e_{it}$$` -- <span class="alert">Estimation:</span> * Step 1: Split data into treated and untreated observations * Step 2: Estimate above model for the set of untreated observations * Step 3: "Impute" `\(\hat{Y}_{it}(0) = \hat{\theta}_t + \hat{\eta}_i\)` for the treated observations * `\(\displaystyle \widehat{ATT}^{G/BJS}(g,t) = \frac{1}{n} \sum_{i=1}^n \frac{\indicator{G_i=g}}{\hat{p}_g} \Big(Y_{it} - \hat{Y}_{it}(0)\Big) \xrightarrow{p} ATT(g,t)\)` Can compute other treatment effect parameters too. <!-- like this one: focuses on the problem of figuring out what's going on with untreated potential outcomes, global estimation (for better or worse) of model of untreated potential outcomes --> --- # Similarities and Differences In my view, all of the approaches discussed above are fundamentally similar to each other. -- In practice, it is sometimes possible to get different results though this is often driven by * Different choices in terms of default implementation details in computer code * Different estimation strategies trading off efficiency and robustness in different ways --- # Comparison 1: CS and SA In post-treatment periods, these give numerically identical results: `\(\widehat{ATT}^{CS}(g,t) = \hat{\delta}^{SA}_{t,t-g}\)` * This is because a fully interacted regression (SA) is equivalent to taking differences in averages across groups (CS) -- In pre-treatment periods, code will give different pre-treatment estimates, but this is due to different default choices * In SA, all results are relative to a fixed base period (typically the period right before treatment) * In CS, by default, in pre-treatment periods, estimates are of placebo policy effects on impact (i.e., the base period is always the most recent pre-treatment period) -- Similarly, results will be different if you choose a different comparsion group in CS (e.g., not-yet-treated vs. never-treated). -- In both cases, these are just different choices though, and, for example, it is feasible (and easy) to set a fixed base period in CS --- # Comp 2: SA and Wooldridge These are clearly closely related, with the difference amounting to whether or not one includes indicators for pre-treatment periods. -- It is fair to see this as a way to <span class="alert">trade-off robustness and efficiency</span> * If parallel trends holds across all time periods, then Wooldridge will deliver more efficient estimates (as effectively all pre-treatment periods are used as base periods) * If parallel trends is violated in some pre-treatment periods but holds post-treatment, Wooldridge estimates will be inconsistent, but SA estimates will be robust to violations of parallel trends in pre-treatment periods. --- # Comp 3: Wooldridge and Gardner/BJS Wooldridge and Gardner/BJS give numerically the same estimates: `\(\hat{\alpha}^W_{gt} = \widehat{ATT}^{G/BJS}(g,t)\)` Intuition: similar to equivalence between Oaxaca-Blinder decompositions and regression adjustment (i.e., including interactions is equivalent to estimating separate models by group). --- # Comments The above discussion emphasizes the similarities between different proposed alternatives to TWFE regressions in the literature. -- The differences also seem to be mainly driven by different implementation choices. Examples: * It's possible to come up with an imputation estimator that uses the base period right before treatment only `\(\implies\)` `\(\uparrow\)` robustness, `\(\downarrow\)` efficiency * It's also possible to do a version of CS with more base periods `\(\implies\)` `\(\uparrow\)` efficiency `\(\downarrow\)` robustness * Build-the-trend (i.e., path relative to average pre-treatment outcome) and GMM, Callaway (2023) and Marcus and Sant'Anna (2021). --- class: inverse, middle, center count: false # Empirical Example: Minimum Wages and Employment --- # Example: Minimum Wage - Use county-level data from 2003-2007 during a period where the federal minimum wage was flat -- - Exploit minimum wage changes across states - Any state that increases their minimum wage above the federal minimum wage will be considered as treated -- - Interested in the effect of the minimum wage on teen employment -- - We'll also make a number of simplifications: * not worry much about issues like clustered standard errors * not worry about variation in the amount of the minimum wage change (or whether it keeps changing) across states -- <span class="alert">Goal: </span> How much do the issues that we have been talking about matter in practice? --- # Code Full code is available on my website: [https://bcallaway11.github.io/files/presentations/NEXT-D](https://bcallaway11.github.io/files/presentations/NEXT-D) or link is on my homepage [brantlycallaway.com](https://www.brantlycallaway.com) <span class="alert">R packages used in empirical example</span> ```r library(did) library(BMisc) library(twfeweights) library(fixest) library(modelsummary) library(ggplot2) load(url("https://github.com/bcallaway11/did_chapter/raw/master/mw_data_ch2.RData")) ``` --- # Setup Data ```r # drops NE region and a couple of small groups mw_data_ch2 <- subset(mw_data_ch2, (G %in% c(2004,2006,2007,0)) & (region != "1")) head(mw_data_ch2[,c("id","year","G","lemp","lpop","region")]) ``` ``` ## id year G lemp lpop region ## 554 8003 2001 2007 5.556828 9.614137 4 ## 555 8003 2002 2007 5.356586 9.623972 4 ## 556 8003 2003 2007 5.389072 9.620859 4 ## 557 8003 2004 2007 5.356586 9.626548 4 ## 558 8003 2005 2007 5.303305 9.637958 4 ## 559 8003 2006 2007 5.342334 9.633056 4 ``` ```r # drop 2007 as these are right before fed. minimum wage change data2 <- subset(mw_data_ch2, G!=2007 & year >= 2003) # keep 2007 => larger sample size data3 <- subset(mw_data_ch2, year >= 2003) ``` --- # TWFE Regression ```r twfe_res2 <- fixest::feols(lemp ~ post | id + year, data=data2, cluster="id") modelsummary(list(twfe_res2), gof_omit=".*") ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> post </td> <td style="text-align:center;"> −0.038 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.008) </td> </tr> </tbody> </table> --- # `\(ATT(g,t)\)` (Callaway and Sant'Anna) ```r attgt <- did::att_gt(yname="lemp", idname="id", gname="G", tname="year", data=data2, control_group="nevertreated", base_period="universal") tidy(attgt)[,1:5] # print results, drop some extra columns ``` ``` ## term group time estimate std.error ## 1 ATT(2004,2003) 2004 2003 0.00000000 NA ## 2 ATT(2004,2004) 2004 2004 -0.03266653 0.020884500 ## 3 ATT(2004,2005) 2004 2005 -0.06827991 0.020712351 ## 4 ATT(2004,2006) 2004 2006 -0.12335404 0.020682602 ## 5 ATT(2004,2007) 2004 2007 -0.13109136 0.022523279 ## 6 ATT(2006,2003) 2006 2003 -0.03408910 0.011617027 ## 7 ATT(2006,2004) 2006 2004 -0.01669977 0.007396980 ## 8 ATT(2006,2005) 2006 2005 0.00000000 NA ## 9 ATT(2006,2006) 2006 2006 -0.01939335 0.009217105 ## 10 ATT(2006,2007) 2006 2007 -0.06607568 0.009311762 ``` --- # Plot `\(ATT(g,t)\)`'s <img src="modern_did_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> --- # Compute `\(ATT^O\)` ```r attO <- did::aggte(attgt, type="group") summary(attO) ``` ``` ## ## Call: ## did::aggte(MP = attgt, type = "group") ## ## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> ## ## ## Overall summary of ATT's based on group/cohort aggregation: ## ATT Std. Error [ 95% Conf. Int.] ## -0.0571 0.008 -0.0728 -0.0414 * ## ## ## Group Effects: ## Group Estimate Std. Error [95% Simult. Conf. Band] ## 2004 -0.0888 0.0203 -0.1316 -0.0461 * ## 2006 -0.0427 0.0076 -0.0587 -0.0268 * ## --- ## Signif. codes: `*' confidence band does not cover 0 ## ## Control Group: Never Treated, Anticipation Periods: 0 ## Estimation Method: Doubly Robust ``` --- # Comments The differences between the CS estimates and the TWFE estimates are fairly large here: the CS estimate is about 50% larger than the TWFE estimate, though results are qualitatively similar. -- <span class="alert">Let's see if we can figure out what's going on...</span> --- # de Chaisemartin and d'Haultfoeuille weights <img src="modern_did_files/figure-html/unnamed-chunk-9-1.png" style="display: block; margin: auto;" /> --- # `\(ATT^O\)` weights <img src="modern_did_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- # Weight Comparison <img src="modern_did_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> --- # Discussion To summarize: `\(ATT^O = -0.057\)` while `\(\alpha^{TWFE} = -0.038\)`. This difference can be fully accounted for * Pre-treatment differences in paths of outcomes across groups: explains about 64% of the difference * Differences in weights applied to the same post-treatment `\(ATT(g,t)\)`: explains about 36% of the difference. [If you apply the post-treatment weights and "zero out" pre-treatment differences, the estimate would be `\(-0.050\)`.] -- In my experience: this is fairly representative of how much new DID approaches matter relative to TWFE regressions. It does not seem like "catastrophic failure" of TWFE, but (in my view) these are meaningful differences (and, e.g., given slightly different `\(ATT(g,t)\)`'s, the difference in the weighting schemes could change the qualitative results). * Of course, this whole discussion hinges crucially on how much treatment effect heterogeneity there is. More TE Het `\(\implies\)` more sensitivity to weighting schemes [just looking at TWFE regression does not give insight into how much TE Het there is.] --- # Additional Comments One more comment: there is a lot concern about negative weights (both in econometrics and empirical work). * There were no negative weights in the example above, but the weights still weren't great. * No negative weights does rule out "sign reversal" * But, in my view, the more important issue is the non-transparent weighting scheme. * Ex. If you try using `data3` (the data that includes `\(G=2007\)`), you will get a negative weight on `\(ATT(g=2004,t=2007)\)`. But it turns out not to matter much, and TWFE works better in this case than in the case that I showed you. --- name: bonus # Bonus Material <br> <br> <br> [Bonus Material 1: Including Covariates in the Parallel Trends Assumption](#covs) <br> [Bonus Material 2: Dealing with Violations of Parallel Trends](#violations) --- # Conclusion That's all! Thank you very much for inviting me. Email: [brantly.callaway@uga.edu](mailto:brantly.callaway@uga.edu) --- name: covs count: false # Covariates in the Parallel Trends Assumption ## Conditional Parallel Trends Assumption For all time periods, `$$\E[\Delta Y_t(0) | X_t, X_{t-1},Z,D=1] = \E[\Delta Y_t(0) | X_t, X_{t-1},Z,D=0]$$` -- In words: Parallel trends holds conditional on having the same covariates `\(X\)`. -- <br> Minimum wage example: path of teen employment may depend on a state's population / population growth / region of the country --- count: false # Limitations of TWFE Regressions In this setting, it is common to run the following TWFE regression: `$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + X_{it}'\beta + v_{it}$$` -- However: * Issues related to multiple periods and variation in treatment timing still arise -- * It's hard to allow for the path of untreated potential outcomes to depend on time-invariant covariates -- * Mixes identification and estimation...e.g., with 2 periods `\begin{align*} \Delta Y_{it} = \Delta \theta_t + \alpha D_{it} + \Delta X_{it}'\beta + \Delta v_{it} \end{align*}` `\(\implies\)` differencing out unit fixed effects can have implications about what researcher controls for * This doesn't matter if model is truly linear * However, if we think of linear model as an approximation, this may have meaningful implications. * See Caetano and Callaway (2023) for more details --- count: false # Identification / Estimation Can show that (under conditional PTA): `$$ATT = \E[\Delta Y_t | D=1] - \E\Big[ \E[\Delta Y_t | X, D=0] \Big| D=1\Big]$$` -- Intuition: (i) Compare path of outcomes for treated group to (conditional on covariates) path of outcomes for untreated group, (ii) adjust for differences in the distribution of covariates between groups. -- This expression suggests a "regression adjustment" estimator. -- It is easy to extend these arguments to multiple periods and variation in treatment timing --- count: false # Doubly Robust Alternatively, you can show `$$ATT=\E\left[ \left( \frac{D}{p} - \frac{p(X)(1-D)}{(1-p(X))p} \right)(\Delta Y_t - \E[\Delta Y_t | X, D=0]) \right]$$` -- This requires estimating both `\(p(X)\)` and `\(\E[\Delta Y_{t^*}|X,D=0]\)`. -- Big advantage: - This expression for `\(ATT\)` is *doubly robust*. This means that, it will deliver consistent estimates of `\(ATT\)` if <span class="alert">either</span> the model for `\(p(X)\)` or for `\(\E[\Delta Y_{t^*}|X,D=0]\)`. -- - In my experience, doubly robust estimators perform much better than either the regression or propensity score weighting estimators -- - This also provides a connection to estimating `\(ATT\)` under conditional parallel trends using machine learning for `\(p(X)\)` and `\(\E[\Delta Y_{t^*}|X,D=0]\)` (see: Chang (2020) and Callaway, Drukker, Liu, and Sant'Anna (2023)) --- count: false # Back to Minimum Wage Example We'll allow for path of outcomes to depend on region of the country ```r # run TWFE regression twfe_x <- fixest::feols(lemp ~ post | id + region^year, data=data2) modelsummary(twfe_x, gof_omit=".*") ``` <table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:center;"> Model 1 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> post </td> <td style="text-align:center;"> 0.001 </td> </tr> <tr> <td style="text-align:left;"> </td> <td style="text-align:center;"> (0.008) </td> </tr> </tbody> </table> Relative to previous results, this is much smaller and statistically insignificant and is similar to the result in Dube et al. (2010). --- count: false # Use Doubly Robust Approach from CS ``` ## ## Call: ## aggte(MP = cs_x, type = "group") ## ## Reference: Callaway, Brantly and Pedro H.C. Sant'Anna. "Difference-in-Differences with Multiple Time Periods." Journal of Econometrics, Vol. 225, No. 2, pp. 200-230, 2021. <https://doi.org/10.1016/j.jeconom.2020.12.001>, <https://arxiv.org/abs/1803.09015> ## ## ## Overall summary of ATT's based on group/cohort aggregation: ## ATT Std. Error [ 95% Conf. Int.] ## -0.0273 0.0085 -0.0438 -0.0107 * ## ## ## Group Effects: ## Group Estimate Std. Error [95% Simult. Conf. Band] ## 2004 -0.0436 0.0204 -0.0892 0.0019 ## 2006 -0.0199 0.0079 -0.0376 -0.0022 * ## --- ## Signif. codes: `*' confidence band does not cover 0 ## ## Control Group: Never Treated, Anticipation Periods: 0 ## Estimation Method: Doubly Robust ``` --- count: false # Comments Even more than in the previous case, the results in this case are notably different depending on the estimation strategy. <br> [back](#bonus) --- name: violations count: false # What about violations of parallel trends? Parallel trends assumptions don't automatically hold in applications with repeated observations over time. -- The most natural way to motivate parallel trends is with a linear model for untreated potential outcomes: `\begin{align*} Y_{it}(0) = \theta_t + \eta_i + v_{it} \end{align*}` where the key feature is the additive separability of `\(\eta_i\)` -- But it's not always clear if additive separability (and hence parallel trends) is reasonable * The most common "response" is pre-testing...checking if parallel trends holds in pre-treatment periods -- DID + pre-tests are a very powerful/useful approach to "validating" the parallel trends assumption --- count: false # What about our case? <img src="modern_did_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- count: false # Partial Identification / Sensitivity Analysis References: Manski and Pepper (2018), Rambachan and Roth (2021) -- Two versions of sensitivity analysis in RR: * Violations of parallel trends evolve smoothly * Violations of parallel trends are "not too different" in post-treatment periods from the violations in pre-treatment periods - Will show results for this case - Allow for violations of parallel trends up to `\(\bar{M}\)` times as large as were observed in any pre-treatment period. - And we'll vary `\(\bar{M}\)`. --- count: false # What about violations of parallel trends? <img src="modern_did_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- count: false [back](#bonus)