class: center, middle, inverse, title-slide # Modern Approaches to Difference in Differences ### Brantly Callaway, University of Georgia ### October 22, 2021
Session 1: Introduction to DID --- # Basic Setup `$$\newcommand{\E}{\mathbb{E}}$$` <style type="text/css"> border-top: 80px solid #BA0C2F; .inverse { background-color: #BA0C2F; } .alert { font-weight:bold; color: red; } .alert-blue { font-weight: bold; color: blue; } .remark-slide-content { font-size: 23px; padding: 1em 4em 1em 4em; } .highlight-red { background-color:red; padding:0.1em 0.2em; } </style> We are interested in understanding the <span class="alert-blue">causal effect</span> of some <span class="alert">treatment</span> on some outcome of interest. -- - This could be some economic policy (e.g., state implements some policy) -- - This could be participating in some program (e.g., individual participates in job training program) -- - For today, we'll focus on the case where the treatment is binary (i.e., individual either participates or not) --- # Possible Approaches There are many possible approaches to thinking about causal effects (experiments, comparing treated and untreated units with the same covariates (e.g., matching/regression), IV, regression discontinuity, etc.) -- Today we'll think about: -- - (broadly) how to use repeated observations over time to identify causal effect parameters -- - (specifically) mostly <span class="alert-blue">difference in differences</span> where we'll compare: - The change in outcomes over time for units that participate in the treatment to - The change in outcomes over time for units that didn't participate in the treatment --- # DID Figure <center><img src="data:image/png;base64,#DID.png"/></center> --- # Before we get going... Compared to other causal inference approaches, DID has been around for a very long time -- - Dates back to at least John Snow and cholera epidemic in 1850s London - If you're interested, more details: [https://www.ph.ucla.edu/epi/snow/grand_experiment.html](https://www.ph.ucla.edu/epi/snow/grand_experiment.html) -- - Re-introduced/popularized in economics by Ashenfelter (1978), Card (1990), Card and Krueger (1994), among others --- # Before we get going... DID is also really popular (from Currie et al. (2020) and among applied micro papers) <center><img src="data:image/png;base64,#DID_over_time.png" width=650/></center> --- # Setup <span class="alert">Potential Outcomes Notation:</span> -- - Two time periods: `\(t^*\)` and `\(t^*-1\)` - No one treated until time period `\(t^*\)`, some units remain untreated in period `\(t^*\)` -- - `\(D_i\)` - treatment indicator -- - Potential outcomes in each time period - `\(Y_{it}(1)\)` - treated potential outcome - `\(Y_{it}(0)\)` - untreated potential outcome -- - Observed outcomes `$$Y_{it^*} = D_i Y_{it^*}(1) + (1-D_i) Y_{it^*}(0) \quad \textrm{and} \quad Y_{it^*-1} = Y_{it^*-1}(0)$$` --- # Individual-level Treatment Effects In this framework, notice that the effect of participating in the treatment for individual `\(i\)` in time period `\(t^*\)` is `$$TE_i = Y_{it^*}(1) - Y_{it^*}(0)$$` -- Most work in economics "gives up" on trying to recover individual-level treatment effects -- - If you're interested: check Heckman, Smith, and Clements (1997), Fan and Park (2010), or Callaway (2020) --- # Average Treatment Effects Instead, most of the literature focuses on the <span class="alert-blue">Average Treatment Effect (ATE)</span> or <span class="alert">Average Treatment Effect on the Treated (ATT)</span> `$$ATT=\E[Y_{t^*}(1) - Y_{t^*}(0) | D=1]$$` -- <span class="alert">Interpretation:</span> The average difference between treated and untreated potential outcomes among those that participated in the treatment. --- # Identification Challenge Let's break apart the expression for `\(ATT\)`: `$$ATT=\underbrace{\E[Y_{t^*}(1) | D=1]}_{\textrm{identified}} - \underbrace{\E[Y_{t^*}(0)|D=1]}_{\textrm{requires assumption}}$$` -- In other words, in order to recover the `\(ATT\)`, we <span class="alert-blue">need to be able to figure out the average untreated potential outcome among those that participated in the treatment</span> -- There are a variety of ways to figure this out (e.g., random treatment assignment), but we'll talk difference in differences --- # Parallel Trends Assumption ## Parallel Trends Assumption `$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$` -- <br><br> Recovering `\(ATT\)` under parallel trends: $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt} \end{aligned}` $$ --- count:false # Parallel Trends Assumption ## Parallel Trends Assumption `$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$` <br><br> Recovering `\(ATT\)` under parallel trends: $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1] \end{aligned}` $$ --- count:false # Parallel Trends Assumption ## Parallel Trends Assumption `$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$` <br><br> Recovering `\(ATT\)` under parallel trends: $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=0] \end{aligned}` $$ --- count:false # Parallel Trends Assumption ## Parallel Trends Assumption `$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$` <br><br> Recovering `\(ATT\)` under parallel trends: $$ `\begin{aligned} ATT &= \E[Y_{t^*}(1) | D=1] - \E[Y_{t^*}(0) | D=1] \hspace{150pt}\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=1]\\ &= \E[Y_{t^*}(1) - Y_{t^*-1}(0) | D=1] - \E[Y_{t^*}(0) - Y_{t^*-1}(0) | D=0]\\ &= \E[\Delta Y_{t^*} | D=1] - \E[\Delta Y_{t^*} | D=0] \end{aligned}` $$ -- which is where difference in differences gets its name --- # Where does the PTA come from? - <span class="alert-blue">View \#1:</span> Purely reduced form assumption, assess validity in pre-treatment periods -- - <span class="alert">View \#2:</span> Models that lead to parallel trends assumption (see: Blundell and Costa Dias (2009) and Gardner (2021)) -- `$$Y_{it}(0) = \theta_t + \eta_i + v_{it}$$` -- where - `\(\theta_t\)` is a time fixed effect - `\(\eta_i\)` is an individual fixed effect (importantly: can follow different distribution among treated and untreated group) - `\(v_{it}\)` idiosyncratic, time varying unobservables --- # Models for PTA (cont'd) In this case, notice that `$$\Delta Y_{it^*}(0) = (\theta_{t^*} - \theta_{t^*-1}) + (v_{it^*} - v_{it^*-1})$$` -- so that `$$\E[\Delta Y_{t^*}(0) | D=1] = \E[\Delta Y_{t^*}(0) | D=0]$$` --- # Models for PTA (cont'd) (As far as I know), this is the only model that rationalizes parallel trends/DID -- <span class="alert-blue">Nice properties:</span> 1. Allows for time trends in (untreated potential) outcomes -- 2. Allows for unobserved heterogeneity (individual fixed effects) that can have different distributions between the treated group and untreated group -- 3. No restrictions on how treated potential outcomes are generated at all -- 4. No restrictions on treatment effect heterogeneity (across individuals or across time/exposure to the treatment) -- 5. Individuals can "select" into the treatment on the basis of (i) treated potential outcomes and (ii) unobserved heterogeneity `\(\eta_i\)`...just not time varying unobservables `\(v_{it}\)` --- # Models for PTA (cont'd) (As far as I know), this is the only model that rationalizes parallel trends/DID -- <span class="alert">Drawbacks:</span> 1. Relies heavily on linearity/additive separability - Not implied by most economic models - What about case with binary (or otherwise "limited") dependent variables / nonlinear models? --- # Estimation Given the above discussion, estimation of the `\(ATT\)` is very easy. -- $$ `\begin{aligned} \widehat{ATT} &= \hat{\E}[\Delta Y_{t^*} | D=1] - \hat{\E}[\Delta Y_{t^*}|D=0] \hspace{150pt} \end{aligned}` $$ --- count:false # Estimation Given the above discussion, estimation of the `\(ATT\)` is very easy. $$ `\begin{aligned} \widehat{ATT} &= \hat{\E}[\Delta Y_{t^*} | D=1] - \hat{\E}[\Delta Y_{t^*}|D=0] \hspace{150pt}\\ &= \frac{1}{n} \sum_{i=1}^n \frac{D_i}{\hat{p}} \Delta Y_{it^*} - \frac{1}{n} \sum_{i=1}^n \frac{(1-D_i)}{(1-\hat{p})} \Delta Y_{it^*} \end{aligned}` $$ -- Or, even more easily, run the following <span class="alert">two-way fixed effects regression (TWFE):</span> `$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$$` -- <span class="alert-blue">Pros:</span> Economists know a lot about this sort of regression and you can just read off standard errors, etc. --- # TWFE Regression `$$Y_{it} = \theta_t + \eta_i + \alpha D_{it} + v_{it}$$` <span class="alert-blue">Some more things to point out:</span> -- - Even though it looks like this model has restricted the effect of participating in the treatment to be constant (and equal to `\(\alpha\)`) across all individuals, TWFE (in this case) is actually <span class="alert">robust</span> to treatment effect heterogeneity. In fact, you can (easily) show that, in this case, `\(\alpha = ATT\)`. -- - It's easy to make the TWFE regression more complicated: - Multiple time periods - Variation in treatment timing - More complicated treatments - Introducing additional covariates --- # TWFE Regression - Unfortunately, the robustness of TWFE regressions to treatment effect heterogeneity or these more complicated (and empirically relevant) settings does not seem to hold -- - Much of the recent (mostly negative) literature on TWFE in the context of DID has considered these types of "realistic" settings -- - We will start to cover these sorts of issues in the next section!