18 Causal Inference with Observational Data

Unfortunately, we rarely have access to experimental data in applications or the ability to run experiments to evaluate causal effects programs/policies that we’d be interested in studying. For example, it’s hard to imagine convincing a large number of countries to randomly assign their interest rate, minimum wage, or immigration policies though these are all policies that economists would be interested in thinking about the causal effect of. It is also fairly uncommon to have access to the sort of natural experiments that we discussed in the previous chapter. Instead, we often have to make do with observational data—data where the treatment is not randomly assigned. In this chapter, we’ll discuss some common approaches to causal inference with observational data.

18.1 Unconfoundedness

\[ \newcommand{\E}{\mathbb{E}} \renewcommand{\P}{\textrm{P}} \let\L\relax \newcommand{\L}{\textrm{L}} %doesn't work in .qmd, place this command at start of qmd file to use it \newcommand{\F}{\textrm{F}} \newcommand{\var}{\textrm{var}} \newcommand{\cov}{\textrm{cov}} \newcommand{\corr}{\textrm{corr}} \newcommand{\Var}{\mathrm{Var}} \newcommand{\Cov}{\mathrm{Cov}} \newcommand{\Corr}{\mathrm{Corr}} \newcommand{\sd}{\mathrm{sd}} \newcommand{\se}{\mathrm{s.e.}} \newcommand{\T}{T} \newcommand{\indicator}[1]{\mathbb{1}\{#1\}} \newcommand\independent{\perp \!\!\! \perp} \newcommand{\N}{\mathcal{N}} \]

SW 6.8, SW Ch. 9

The approach that we will start with is what is probably the most common approach: unconfoundedness.

Unconfoundedness Assumption: \[\begin{align*} (Y(1),Y(0)) \independent D | X \end{align*}\] You can think of this as saying that, among individuals with the same covariates $X$, they have the same distributions of potential outcomes regardless of whether or not they participate in the treatment. Note that the distribution of $X$ is still allowed to be different between the treated and untreated groups. In other words, after you condition on covariates, there is nothing special (in terms of the distributions of potential outcomes) about the group that participates in the treatment relative to the group that doesn’t participate in the treatment.

This is potentially a strong assumption. In order to believe this assumption, you need to believe that untreated individuals with the same characteristics can deliver, on average, the outcome that individuals in the treated group would have experienced if they had not participated in the treatment. In math, you can write this as \[\begin{align*} \E[Y(0) | X, D=1] = \E[Y(0) | X, D=0] \end{align*}\]

If you are willing to believe this assumption, then you can recover the $ATT$. In particular, notice that \[\begin{align*} ATT &= \E[Y(1) - Y(0) \mid D=1] \\ &= \E\Big[\E[Y(1) - Y(0) \mid X, D=1] \Bigm| D=1\Big] \\ &= \E\Big[\E[Y(1) \mid X, D=1] - \E[Y(0) \mid X, D=1] \Bigm| D=1\Big] \\ &= \E\Big[\E[Y(1) \mid X, D=1] - \E[Y(0) \mid X, D=0] \Bigm| D=1\Big] \\ &= \E\Big[\E[Y \mid X, D=1] - \E[Y \mid X, D=0] \Bigm| D=1\Big] \end{align*}\] where the first equality is just the law of iterated expectations, the second equality uses the linearity of expectation, the third equality uses unconfoundedness—this is the key step, and the last equality uses the definition of observed outcomes. This final expression only involves observed data, so we can estimate it from data.

The previous expression is quite intuitive. It suggests: (1) finding treated and untreated units that have the same covariates $X$ and comparing their outcomes, (2) then averaging these differences over the distribution of $X$ for treated units.

18.1.1 Estimation - Regression Adjustment

Next, let’s discuss how to estimate the $ATT$ given the previous expression. I wrote down what I think is the most intuitive expression for the $ATT$ under unconfoundedness above, but we can simplify it in a way that will make estimation easier. In particular, notice that \[\begin{align*} ATT &= \E\Big[\E[Y \mid X, D=1] - \E[Y \mid X, D=0] \Bigm| D=1\Big] \\ &= \E\Big[\E[Y \mid X, D=1] \mid D=1\Big] - \E\Big[\E[Y \mid X, D=0] \mid D=1\Big] \\ &= \E[Y \mid D=1] - \E\Big[\E[Y \mid X, D=0] \mid D=1\Big] \end{align*}\] so that the only complicated part to estimate is $\E[Y \mid X, D=0]$. We will proceed by using the analogy principle—estimate $\E[Y \mid X, D=0]$, then average the predicted values.

Step 1: Regression using Untreated Units

Estimate a regression of $Y$ on $X$ using only untreated units (i.e., those with $D=0$), and recover predicted values from this regression, which we will denote $\hat{Y}_i^{(0)}$ for each unit $i$.

Step 2: Average Predicted Values for Treated Units

Our estimator for $ATT$ will be

\[\begin{align*} \widehat{ATT} &= \bar{Y}_{D=1} - \frac{1}{n_1} \sum_{i=1}^n D_i \hat{Y}_i^{(0)} \end{align*}\] where $\bar{Y}_{D=1}$ is the average outcome for treated units, $n_1$ is the number of treated units, and the sum is over all units but only includes predicted values for treated units (because of the $D_i$ term).

Because we estimate a first-step regression and then average, the approach discussed above is called regression adjustment.

Code

My favorite package for implementing regression adjustment is the DRDID package. This is a package that is built for using panel data, but we can “trick” it into doing cross-sectional regression adjustment. The code below shows how to do this. Suppose that we have a data frame df with outcome variable Y, treatment variable D, and covariates X1, X2, and X3. The

library(DRDID)

y <- df$Y
D <- df$D
X <- model.matrix(~ X1 + X2 + X3, data = df) # create matrix of covariates
ra_att <- DRDID::reg_did_panel(y1 = y, y0 = 0, D = D, covariates = X)
ra_att

DRDID::reg_did_panel expects outcomes for two periods: period 1 and period 0, but our trick is to just set the period 0 outcomes to zero for everyone (i.e., y0 = 0). The function will then estimate the regression adjustment estimator for us. Note that you can include any covariates you want in the regression by changing the formula in model.matrix.

18.1.2 Estimation - Regression under Treatment Effect Homogeneity

Although I prefer the regression adjustment approach above, it is more common to try to estimate the causal effect of $D$ using a single regression.

We will make two additional assumptions here. First, we will make the treatment effect homogeneity assumption that we have discussed before: $Y_i(1) - Y_i(0) = \alpha$ for all $i$. Second, we will assume a linear model for untreated potential outcomes: \[\begin{align*} Y(0) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + U \end{align*}\] where, for simplicity, I have assumed that there are only two covariates, $X_1$ and $X_2$. Next, notice that unconfoundedness implies that $\E[U|X_1,X_2,D] = 0$ (the conditioning on $D$ is the unconfoundedness part). Now, recalling the definition of the observed outcome, we can write \[\begin{align*} Y_i &= D_i Y_i(1) + (1-D_i) Y_i(0) \\ &= D_i (Y_i(1) - Y_i(0)) + Y_i(0) \\ &= \alpha D_i + \beta_0 + \beta_1 X_1 + \beta_2 X_2 + U_i \end{align*}\] which suggests running the regression of observed $Y$ on $X_1,X_2,$ and $D$ and interpreting the estimate of $\alpha$ as the causal effect of participating in the treatment. In practice, this will be very similar to what we have done before — so the process would not be hard, but convincing someone (or even yourself) that unconfoundedness holds will be the bigger issue here.

As a final comment, the assumption of treatment effect homogeneity is not quite so innocuous here. It turns out that you can show that, in the presence of treatment effect heterogeneity, $\alpha$ will be equal to a weighted average of individual treatment effects, but the weights can sometimes be “strange”. There are methods that are robust to treatment effect heterogeneity (they are beyond the scope of the current class, but they are not “way” more difficult than what we are doing here). That said, in my experience, the regression estimators (under treatment effect homogeneity) tend to deliver similar estimates to alternative estimators that are robust to treatment effect heterogeneity at least in the setup considered in this section.

18.2 Panel Data Approaches

SW All of Ch. 10 and 13.4

In this section, we’ll consider the case where a researcher has access to a different type of data called panel data. Panel data is data that follows the same individual (or firm, etc.) over time. In this case, it is often helpful to index variables by time. For example, $Y_{it}$ is the outcome for individual $i$ in time period $t$. $X_{it}$ is the value of a regressor for individual $i$ in time period $t$ and $D_{it}$ is the value of the treatment for individual $i$ in time period $t$. If some variable doesn’t vary over time (e.g., a regressor like race), we won’t use a $t$ subscript.

18.2.1 Setup

For this section, we will consider a setting where a researcher observes exactly two periods of panel data, $t=1$ and $t=2$. We will also suppose that, in the first period, no units are treated (i.e., $D_{i1} = 0$ for all $i$) while in the second period, some units are treated (i.e., $D_{i2}$ can be either 0 or 1). In this setting, it will be more convenient to work with the “group” variable, $G_i$, which indicates whether or not individual $i$ participates in the treatment in the second period. Slightly updating notation, we will be interested in $ATT = \E[Y_{t=2}(1) - Y_{t=2}(0) | G=1]$ which is the average treatment effect in period 2 for the treated group. The setup discussed here is sometimes referred to as a “pre-post” design because we have a “pre-treatment” period (period 1) and a “post-treatment” period (period 2).

18.2.2 Using Panel Data to Validate Assumptions

The first major use of panel data is to try check/validate identification assumptions. Recall that, one of the main challenges that we faced with unconfoundedness was that it is hard to convince yourself or others that unconfoundedness holds. Panel data can help with this as we can effectively check if unconfoundedness holds in the pre-treatment period.

The main implication of unconfoundedness that we used above was that we could learn about untreated potential outcomes for the treated group by looking at untreated units with the same covariates. In the pre-treatment period, no one is treated, so we can actually check if this implication holds in the pre-treatment period. That is, \[\begin{align*} \E[Y_{t=1} | G=1] \stackrel{?}{=} \E\Big[ \E[Y_{t=1} | X, G=0] \Bigm| G=1 \Big] \end{align*}\] If unconfoundedness holds in the first period, then these should be equal, and if they are unequal, that is evidence against unconfoundedness holding. If you squint at this, you can see that $\E[Y_{t=1} | G=1] - \E\Big[ \E[Y_{t=1} | X, G=0] \Bigm| G=1 \Big]$ is exactly the estimand that we would use for the $ATT$ if the treatment had been implemented in period 1. Thus, we can basically implement our estimator in the pre-treatment period and check to see if we get something close to zero (implying that unconfoundedness held in period 1) or not (implying that unconfoundedness is violated in period 1). In the first case, it seems reasonable to hope that unconfoundness might hold in period 2, while in the second case, it provides a strong piece of evidence against unconfoundedness holding in period 2. Because we implement the same estimator in the pre-treatment period as we would in the post-treatment period, this approach is sometimes called a placebo test.

In my view, being able to implement this sort of placebo test is the most important feature of using panel data for causal inference. That said, I should mention that this does not guarantee that we will get reliable estimates in post-treatment periods. We really need unconfoundedness to hold in the post-treatment period, not necessarily in the pre-treatment period. Still, it seems like in most applications, this is about as good evidence as you could have about the plausibility of unconfoundedness.

18.2.3 Using Panel Data to Make Adjustments

Besides validating assumptions, the other main use case for panel data is to adjust for certain things that would typically be unobserved if we did not have panel data. We will consider two versions of this.

18.2.3.1 Lagged Outcome Unconfoundedness

Probably the most straightforward thing that we can do with panel data that is unavailable with cross-sectional data is to assume unconfoundedness holds after conditioning on lagged outcomes. That is, \[\begin{align*} \big(Y_{t=2}(1), Y_{t=2}(0)\big) \independent G | \big(X, Y_{t=1}\big) \end{align*}\] Using exactly the same arguments as for unconfoundedness above, it immediately follows that \[\begin{align*} ATT &= \E\Big[ \E[Y_{t=2} | X, Y_{t=1}, G=1] - \E[Y_{t=2} | X, Y_{t=1}, G=0] \Bigm| G=1 \Big] \end{align*}\] In other words, to recover the $ATT$, we simply need to find treated and untreated units that have the same characteristics $X$ and the same pre-treatment outcome $Y_{t=1}$, compare their outcomes in period 2, and then average all of these differences.

Estimation is exactly the same as for unconfoundedness above; just make sure to include $Y_{t=1}$ as one of the covariates in the regression adjustment or regression approach.

18.2.3.2 Difference-in-Differences

While adding lagged outcomes to the unconfoundedness assumption seems very natural, difference-in-differences is an alternative approach to making adjustments that is more popular among economists.

In the previous section, we invoked the assumption of unconfoundedness and were in the setup where $X$ was fully observed. But suppose instead that you thought this alternative version of unconfoundedness held \[\begin{align*} \big(Y(1),Y(0)\big) \independent D | \big(X,W\big) \end{align*}\] where $X$ were observed random variables, but $W$ were not observed.

Let us also maintain the assumption of a linear model for untreated potential outcomes: \[\begin{align*} Y_t(0) = \beta_0 + \beta_1 X + \beta_2 W + U \end{align*}\] Unconfoundedness continues to imply that $\E[U|D] = 0$ (i.e., that the error terms are not systematically different between the groups). Previously, this expression led to a regression adjustment estimator of $ATT$, but now that approach is infeasible because $W$ is not observed. However, notice that, with panel data, we can write this model for both time periods: \[\begin{align*} Y_{t=2}(0) = \beta_0 + \beta_1 X + \beta_2 W + U_{t=2} \\ Y_{t=1}(0) = \beta_0 + \beta_1 X + \beta_2 W + U_{t=1} \end{align*}\] Subtracting the second equation from the first gives \[\begin{align*} \Delta Y(0) = \Delta U \end{align*}\] which implies that $\E[\Delta Y(0) | D] = 0$ (i.e., untreated potential outcomes do not systematically change over time).

Now, let us use this to recover the $ATT$: \[\begin{align*} ATT &= \E[Y_{t=2}(1) - Y_{t=2}(0) | D=1] \\ &= \E[Y_{t=2} | D=1] - \Big(\E[Y_{t=2}(0) | D=1] - \E[Y_{t=1}(0) | D=1]\Big) - \E[Y_{t=1}(0) | D=1] \\ &= \E[Y_{t=2} | D=1] - \Big(\underbrace{\E[\Delta Y(0) | D=1]}_{=0}\Big) - \E[Y_{t=1} | D=1] \\ &= \E[Y_{t=2} | D=1] - \E[Y_{t=1} | D=1] \end{align*}\] In other words, in the setup above, the $ATT$ is just equal to the average post-treatment outcome for the treated group relative to the average pre-treatment outcome for the treated group. This is called a before-after identification strategy, and it is actually quite intuitive: we are effectively comparing each treated unit’s outcome after treatment to its outcome before it participated in the treatment.

In practice, it is common to slightly generalize this approach. In fact, I pulled a little bit of trick on you earlier. Probably a more appropriate linear model for untreated potential outcomes is: \[\begin{align*} Y_t(0) = \beta_{0,t} + \beta_{1,t} X + \beta_{2,t} W + U_t \end{align*}\] where we allow the intercept and the effects of $X$ and $W$ on untreated potential outcomes to vary over time. In this case, we can follow exactly the same steps as above to get \[\begin{align*} Y_{t=2}(0) &= \beta_{0,t=2} + \beta_{1,t=2} X + \beta_{2,t=2} W + U_{t=2} \\ Y_{t=1}(0) &= \beta_{0,t=1} + \beta_{1,t=1} X + \beta_{2,t=1} W + U_{t=1} \end{align*}\] Subtracting the second equation from the first gives \[\begin{align*} \Delta Y(0) = \Delta \beta_0 + \Delta \beta_1 X + \Delta \beta_2 W + \Delta U \end{align*}\] This still presents a bit of a problem as $W$ is still unobserved. In particular, we need one more assumption—the effect of $W$ is constant over time, i.e., $\beta_{2,t=2} = \beta_{2,t=1}$ so that $\Delta \beta_2 = 0$. In this case, we have that \[\begin{align*} \Delta Y(0) = \Delta \beta_0 + \Delta \beta_1 X + \Delta U \end{align*}\] since $\E[\Delta U | X, D] = 0$ (by the version of unconfoundedness we have considered in this section), it follows that \[\begin{align*} \E[\Delta Y(0) | X, D=1] = \E[\Delta Y(0) | X, D=0] \end{align*}\] This condition is called the parallel trends assumption because it says that, after conditioning on $X$, the untreated potential outcomes for the treated and untreated groups would have followed the same trend over time. Under this assumption, we can recover the $ATT$ as follows: \[\begin{align*} ATT &= \E[Y_{t=2}(1) - Y_{t=2}(0) | D=1] \\ &= \E[Y_{t=2}(1) - Y_{t=1}(0) | D=1] - \E[Y_{t=2}(0) - Y_{t=1}(0) | D=1] \\ &= \E\Big[ \E[Y_{t=2} - Y_{t=1} | X, D=1] - \E[Y_{t=2}(0) - Y_{t=1}(0) | X, D=1] \Bigm| D=1 \Big] \\ &= \E\Big[ \E[Y_{t=2} - Y_{t=1} | X, D=1] - \E[Y_{t=2} - Y_{t=1} | X, D=0] \Bigm| D=1 \Big] \\ \end{align*}\] which follows essentially the same argument as for unconfoundedness after we take differences in the second equation. This suggests the following estimation approach: regression adjustment but where the outcome variable is $\Delta Y$ rather than $Y$. This approach is called difference-in-differences because it involves taking differences over time to eliminate unobserved, time invariant variables, and then taking differences between treated and untreated groups to recover the causal effect. It is very popular in empirical work in economics.

18.3 Lab 7: Minimium Wage and Employment

For this lab, we will use the njmin data from the causaldata package. This is data that comes from Card and Krueger (1994), which is one of the most well-known empirical papers in all of economics. It is about the causal effect of a minimum wage increase in New Jersey on employment in the fast food industry. It is one of the original difference-in-differences papers and is also, as far as I know, one of the first papers to explicitly use the pre-post research design that we discussed above.

To start with, use the following code to load the data, drop rows with missing data, and drop columns that we won’t use in this lab.

load("data/fastfood.RData")
library(dplyr)

# drop missing data
fastfood <- subset(fastfood, balanced == 1)
# drop unused columns
fastfood <- select(fastfood, id, state, location, chain_name, ownership, fte_pre, hrsopen_pre, wage_st_pre, fte_post, hrsopen_post, wage_st_post)

Use modelsummary::datasummary_balance to report summary statistics for all the variables in the data, separately by state (New Jersey vs. Pennsylvania). What do you notice?
Let us start by assuming unconfoundedness conditional on the restaurant chain (chain_name). Estimate the $ATT$ of the minimum wage on full-time equivalent employees in the post-treatment period using (1) regression adjustment and (2) using a single regression. Report your estimates and interpret them.
Implement the placebo test for unconfoundedness in the pre-treatment period using both regression adjustment and using a single regression. What do you find, and what does it suggest about the plausibility of unconfoundedness in this application?
Let us change the identification strategy to lagged outcome unconfoundedness (let’s continue to also include chain_name as a covariate). Estimate the $ATT$ of the minimum wage on full-time equivalent employees in the post-treatment period using lagged outcome unconfoundedness using (1) regression adjustment and (2) using a single regression.
Finally, estimate the $ATT$ of the minimum wage on full-time equivalent employees in the post-treatment period using difference-in-differences, continuing to include chain_name as a covariate, using (1) regression adjustment and (2) using a single regression. How do your results compare to the previous two estimates?

18.4 Lab 7: Solution

Summary statistics by state

library(modelsummary)
datasummary_balance(
    ~state,
    data = fastfood,
    fmt = 2
)

		Pennsylvania (N=75)		New Jersey (N=309)
		Mean	Std. Dev.	Mean	Std. Dev.	Diff. in Means	Std. Error
id		3776.93	1805.12	2176.01	1209.22	-1600.92	219.50
fte_pre		23.38	12.01	20.43	9.21	-2.95	1.48
hrsopen_pre		14.51	2.96	14.40	2.82	-0.12	0.38
wage_st_pre		4.63	0.36	4.61	0.34	-0.02	0.05
fte_post		21.10	8.38	20.90	9.38	-0.20	1.10
hrsopen_post		14.64	2.88	14.39	2.76	-0.25	0.37
wage_st_post		4.62	0.36	5.08	0.10	0.46	0.04
		N	Pct.	N	Pct.
location	NJ_Central	0	0.0	58	18.8
	NJ_North	0	0.0	162	52.4
	NJ_South	0	0.0	89	28.8
	PA_Easton	41	54.7	0	0.0
	PA_PhillyNE	34	45.3	0	0.0
chain_name	Burger King	33	44.0	126	40.8
	KFC	12	16.0	67	21.7
	Roy Rogers	17	22.7	77	24.9
	Wendys	13	17.3	39	12.6
ownership	Franchise	49	65.3	201	65.0
	Company	26	34.7	108	35.0

Regression adjustment ATT estimate

# regression adjustment
library(DRDID)

fte_post <- fastfood$fte_post
covs <- model.matrix(~chain_name, data = fastfood)
D <- as.numeric(fastfood$state == "New Jersey")

unc_att <- DRDID::reg_did_panel(y1 = fte_post, y0 = 0, D = D, covariates = covs)
unc_att

 Call:
DRDID::reg_did_panel(y1 = fte_post, y0 = 0, D = D, covariates = covs)
------------------------------------------------------------------
 Outcome-Regression DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
  0.5969     0.8762     0.6812     0.4957    -1.1205     2.3143  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

# single regression
unc_reg <- lm(fte_post ~ D + chain_name, data = fastfood)
summary(unc_reg)


Call:
lm(formula = fte_post ~ D + chain_name, data = fastfood)

Residuals:
    Min      1Q  Median      3Q     Max 
-24.312  -4.563  -0.927   3.548  38.299 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           23.8764     1.0699  22.316  < 2e-16 ***
D                      0.4357     1.0670   0.408 0.683251    
chain_nameKFC        -10.8346     1.1393  -9.510  < 2e-16 ***
chain_nameRoy Rogers  -3.6110     1.0758  -3.357 0.000869 ***
chain_nameWendys      -1.3138     1.3212  -0.994 0.320672    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.266 on 379 degrees of freedom
Multiple R-squared:  0.1984,    Adjusted R-squared:  0.1899 
F-statistic: 23.45 on 4 and 379 DF,  p-value: < 2.2e-16

Placebo test in pre-treatment period

# regression adjustment placebo
fte_pre <- fastfood$fte_pre
unc_placebo <- DRDID::reg_did_panel(y1 = fte_pre, y0 = 0, D = D, covariates = covs)
unc_placebo

 Call:
DRDID::reg_did_panel(y1 = fte_pre, y0 = 0, D = D, covariates = covs)
------------------------------------------------------------------
 Outcome-Regression DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
  -1.899     1.1865    -1.6004     0.1095    -4.2246     0.4267  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

# single regression placebo
unc_placebo_reg <- lm(fte_pre ~ D + chain_name, data = fastfood)
summary(unc_placebo_reg)


Call:
lm(formula = fte_pre ~ D + chain_name, data = fastfood)

Residuals:
    Min      1Q  Median      3Q     Max 
-17.302  -5.468  -1.616   3.238  62.913 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)            25.642      1.145  22.394   <2e-16 ***
D                      -2.340      1.142  -2.049   0.0411 *  
chain_nameKFC         -11.186      1.219  -9.174   <2e-16 ***
chain_nameRoy Rogers   -1.215      1.151  -1.055   0.2920    
chain_nameWendys       -1.137      1.414  -0.804   0.4217    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.846 on 379 degrees of freedom
Multiple R-squared:  0.2058,    Adjusted R-squared:  0.1974 
F-statistic: 24.55 on 4 and 379 DF,  p-value: < 2.2e-16

Lagged outcome unconfoundedness ATT estimate

# regression adjustment
lou_covs <- model.matrix(~ chain_name + fte_pre, data = fastfood)
lou_att_post <- DRDID::reg_did_panel(y1 = fte_post, y0 = 0, D = D, covariates = lou_covs)
lou_att_post

 Call:
DRDID::reg_did_panel(y1 = fte_post, y0 = 0, D = D, covariates = lou_covs)
------------------------------------------------------------------
 Outcome-Regression DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
  0.844      0.8903     0.948      0.3431    -0.9009     2.5889  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

# single regression
lou_reg <- lm(fte_post ~ D + chain_name + fte_pre, data = fastfood)
summary(lou_reg)


Call:
lm(formula = fte_post ~ D + chain_name + fte_pre, data = fastfood)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.2840  -4.3720  -0.9191   3.1944  31.3946 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)          12.95210    1.45332   8.912  < 2e-16 ***
D                     1.43260    0.95619   1.498  0.13491    
chain_nameKFC        -6.06889    1.12243  -5.407 1.14e-07 ***
chain_nameRoy Rogers -3.09339    0.96013  -3.222  0.00138 ** 
chain_nameWendys     -0.82922    1.17844  -0.704  0.48208    
fte_pre               0.42603    0.04277   9.960  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.366 on 378 degrees of freedom
Multiple R-squared:  0.365, Adjusted R-squared:  0.3566 
F-statistic: 43.46 on 5 and 378 DF,  p-value: < 2.2e-16

Difference-in-differences ATT estimate

# regression adjustment
did_att_post <- DRDID::reg_did_panel(y1 = fte_post, y0 = 0, D = D, covariates = covs)
did_att_post

 Call:
DRDID::reg_did_panel(y1 = fte_post, y0 = 0, D = D, covariates = covs)
------------------------------------------------------------------
 Outcome-Regression DID estimator for the ATT:
 
   ATT     Std. Error  t value    Pr(>|t|)  [95% Conf. Interval] 
  0.5969     0.8762     0.6812     0.4957    -1.1205     2.3143  
------------------------------------------------------------------
 Estimator based on panel data.
 Outcome regression est. method: OLS.
 Analytical standard error.
------------------------------------------------------------------
 See Sant'Anna and Zhao (2020) for details.

# single regression
fastfood$delta_fte <- fastfood$fte_post - fastfood$fte_pre
did_reg <- lm(delta_fte ~ D + chain_name, data = fastfood)
summary(did_reg)


Call:
lm(formula = delta_fte ~ D + chain_name, data = fastfood)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.734  -3.861   0.564   4.277  33.167 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)  
(Intercept)           -1.7659     1.1570  -1.526   0.1278  
D                      2.7757     1.1539   2.405   0.0166 *
chain_nameKFC          0.3518     1.2321   0.286   0.7754  
chain_nameRoy Rogers  -2.3960     1.1634  -2.060   0.0401 *
chain_nameWendys      -0.1764     1.4288  -0.123   0.9018  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8.938 on 379 degrees of freedom
Multiple R-squared:  0.02875,   Adjusted R-squared:  0.0185 
F-statistic: 2.804 on 4 and 379 DF,  p-value: 0.02564

18.5 Coding Questions

For this problem, we will use the data rand_hie. This is data from the RAND health insurance experiment in the 1980s. In the experiment, participants were randomly assigned to get Catastrophic (the least amount of coverage), insurance that came with a Deductible, insurance that came with Cost Sharing (i.e., co-insurance so that an individual pays part of their medical insurance), and Free (so that there is no cost of medical care).

For this problem, we will be interested in whether or not changing the type of health insurance changed the amount of health care utilization and the health status of individuals.

We will focus on the difference between the least amount of health insurance (“Catastrophic”) and the most amount of health insurance (“Free”). In particular, you can start this problem by running the following code:

rand_hie_subset <- subset(rand_hie, plan_type %in% c("Catastrophic", "Free")) rand_hie_subset$free <- 1*(rand_hie_subset$plan_type=="Free")

This code creates a new data frame called rand_hie_subset that only contains individuals who were assigned to either the “Catastrophic” or “Free” plan types. It also creates a new variable called free that is equal to one for individuals who were assigned to the “Free” plan type and zero for individuals who were assigned to the “Catastrophic” plan type. Use this data frame for the rest of this problem.
1. Suppose you are interested in estimating the average treatment effect on the treated of the “Free” plan relative to the “Catastrophic” plan on total medical expenditure (total_med_expenditure). Can you estimate this effect by running a regression of total_med_expenditure on free? Explain why or why not.
2. Estimate the average treatment effect on the treated of a “Free” plan (free) relative to a “Catastrophic” plan on total medical expenditure (total_med_expenditure) Report and interpret your results.
3. Estimate the average treatment effect on the treated of a “Free” plan (free) relative to a “Catastrophic” plan on face to face doctor visits (face_to_face_visits) Report and interpret your results.
4. Estimate the average treatment effect on the treated of a “Free” plan (free) relative to a “Catastrophic” plan on overall health index (health_index) Report and interpret your results.
5. Provide an overall interpration of your results from parts b-d.
For this problem, we will study the causal effect of having more children on women’s labor supply using the data Fertility.
1. To start with, run a regression of the number of hours that a woman typically works per week (work) on whether or not she has more than two children (morekids). Report your results. Should you interpret the estimated coefficient on morekids as the causal effect of having more than two children? Explain.
2. One possible instrument in this setup is the sex composition of the first two children (i.e., whether they are both girls, both boys, or a boy and a girl). The thinking here is that, at least in the United States, parents tend to have a preference for having both a girl and a boy and that, therefore, parents whose first two children have the same sex may be more likely to have a third child than they would have been if they have a girl and a boy. Go through the lise of four assumptions that we discussed in the context of instrumental variables and provide some discussion about whether or not each of these assumptions are likely to hold in this context.
3. Regardless of your answer to part (b), create a new variable called samesex that is equal to one for families whose first two children have the same sex. Estimate the effect of morekids on work using samesex as an instrument for morekids and report the results. Provide some discussion about your results.
For this question, we will use the AJR data. A deep question in development economics is: Why are some countries much richer than other countries? One explanation for this is that richer countries have different institutions (e.g., property rights, democracy, etc.) that are conducive to growth. Its hard to study these questions though because institutions do not arise randomly — there could be reverse causality so that property rights, democracy, etc. are (perhaps partially) caused by being rich rather than the other way around. Alternatively, other factors (say a country’s geography) could cause both of these. We’ll consider one instrumental variables approach to thinking about this question in this problem.
1. Run a regression of the log of per capita GDP (the log of per capita GDP is stored in the variable GDP) on a measure of the protection against expropriation risk (this is a measure of how “good” a country’s institutions are (a larger number indicates “better” institutions) and it is in the variable Exprop). How do you interpret these results? Do you think it would be reasonable to interpret the estimated coefficient on Exprop as the causal effect of institutions on GDP.
2. One possible instrument for Exprop is settler mortality (we’ll use the log of this which is available in the variable logMort). Settler mortality is a measure of how dangerous it was for early settlers of a particular location. The idea is that places that have high settler mortality may have set up worse (sometimes called “extractive”) institutions than places that had lower settler mortality. But that settler mortality (from a long time ago) does not have any other direct effect on modern GDP. Provide some discussion about whether settler mortality is a valid instrument for institutions.
3. Estimate an IV regression of GDP on Exprop using logMort as an instrument for Exprop. How do you interpret the results? How do these results compare to the ones from part a?
For this question, we’ll use the data house to study the causal effect of incumbency on the probability that a member of the House of Representatives gets re-elected.
1. One way to try to estimate the causal effect of incumbency is to just run a regression where the outcome is democratic_vote_share (this is the same outcome we’ll use below) and where the model includes a binary variable for whether or not the democratic candidate is an incumbent. What are some limitions of this strategy?
2. The house data contains data about the margin of victory (is positive if they won the election and negative if they lost) for Democratic candidates in the current election and data about the Democratic margin of victory in the past election. Explain how you could use this data in a regression discontinuity design to estimate the causal effect of incumbency.
3. The main assumption to rationalize a regression discontinuity design is the continuity assumption. Explain what this assumption means in the context of the regression discontinuity design that you proposed in part b. Do you think that this assumption is likely to hold in this context? Why or why not?
4. Use the house data to implement the regression discontinuity design that you proposed in part b. What do you estimate as the causal effect of incumbency?
For this problem, we will use the data banks. We will study the causal effect of monetary policy on bank closures during the Great Depression. We’ll consider an interesting natural experiment in Mississippi where half the northern half of the state was in St. Louis’s federal reserve district (District 8) and the southern half of the state was in Atlanta’s federal reserve district (District 6). Atlanta had much looser monetary policy (meaning they substantially increased lending) than St. Louis during the early part of the Great Depression and our interest is in whether looser monetary policy made an difference.
1. Plot the total number of banks separately for District 6 and District 8 across all available time periods in the data.
2. An important event in the South early in the Great Depression was the collapse of Caldwell and Company — the largest banking chain in the South at the time. This happened in November 1930. The Atlanta Fed’s lending markedly increased quickly after this event while St. Louis’s did not. Calculate a DID estimate of the effect of looser monetary policy on the number of banks that are still in business. How do you interpret these results? Hint: You can calculate this by taking the difference between the number of banks in District 6 relative to the number of banks in District 8 across all time periods relative to the difference between the number of banks in District 6 relative to District 8 in the first period (July 1, 1929).

18.6 Extra Questions

What is the difference between treatment effect homogeneity and treatment effect heterogeneity?
Why do most researchers give up on trying to estimate the individual-level effect of participating in a treatment?
What are four conditions for a valid instrument? Explain in words and math.
Explain what unconfoundedness means.
If we make the assumption of unconfoundedness, explain in words what sort of comparisons that we would like to make in order to estimate the average treatment effect on the treated.
What is the key condition underlying a difference-in-differences approach to learn about the causal effect of some treatment on some outcome? Explain both in words and in math. In what circumstance would a researcher be likely to prefer this type of assumption over unconfoundedness?
Suppose you are interested in the causal effect of participating in a union on a person’s income. Consider the following approaches.
1. Consider the following assumption: \[\begin{align*} \big(Earnings(1), Earnings(0)\big) \independent Union | Education \end{align*}\] where $Earnings(1)$ is the potential earnings if the individual participates in a union, $Earnings(0)$ is the potential earnings if the individual does not participate in a union, $Union$ is a binary variable equal to one if the individual participates in a union and zero otherwise, and $Education$ is years of education. Do you think this assumption is likely to hold? Explain.
2. Regardless of your answer to part (a), explain how to estimate the ATT of participating in a union on earnings under the assumption in part (a) using regression adjustment.
3. Regardless of your answer to part (a), explain how to estimate the ATT of participating in a union on earnings under the assumption in part (a) using a single regression. What additional assumption do you need to make in order to use this approach?
4. Now suppose that you have access to two periods of panel data. Suppose that we are looking at a population of adults, so we can assume that each person’s education is constant over time. Also, suppose that no one participated in a union in the first period (note: this is probably unrealistic, but we could just drop all observations that participated in a union in the first period). How can you use this data to conduct a placebo test for the unconfoundedness assumption in part (a)? Why is this useful?
5. Consider the same setup at in part (d). Explain the assumption of lagged outcome unconfoundedness in this context. Do you think this assumption is likely to hold? Explain.
6. Regardless of your answer to part (d), explain how to estimate the ATT of participating in a union on earnings under the assumption in part (d) using regression adjustment and then using a single regression.
7. Now, suppose that we are willing to make the parallel trends assumption. Explain what this assumption means in this context. Do you think this assumption is likely to hold? Explain.
8. Regardless of your answer to part (f), explain how to estimate the ATT of participating in a union on earnings under the assumption in part (f) using regression adjustment and then using a single regression.
9. Given the setup with two periods of panel data, is it possible to conduct a placebo test for lagged outcome unconfoundedness or difference-in-differences? Does your answer change if you had access to three periods of panel data? Explain.
Suppose that you are interested in learning about the causal effect of attending college on earnings.
1. You read a newspaper article that says that the average earnings for college graduates is $50,000 per year while the average earnings for non-college graduates is $40,000 per year. The article interprets this $10,000 difference as the return to attending college. What assumption would you need to believe in order for this interpretation to be valid? Do you think this assumption is likely to hold? Explain.
2. Regardless of your actual answer to part (a), suppose that you are not willing to believe the assumption that you mentioned there. You remember that having access to an instrumental variable is useful for learning about causal effects, and your friend suggests generating an instrument $Z$ by randomly drawing a 1 or a 0 for each individual in your data. Go through the four conditions that we discussed for a valid instrument and explain whether or not each of these conditions is likely to hold for this instrument.
3. You also notice that you have a variable in your data that indicates whether or not an individual grew up near a college (near_college). Another friend suggests using this variable as an instrument for attending college. Go through the four conditions that we discussed for a valid instrument and explain whether or not each of these conditions is likely to hold for this instrument.
Suppose that you are interested in the effect of lower college costs on the probability of graduating from college. You have access to student-level data from Georgia where students are eligible for the Hope Scholarship if they can keep their GPA above 3.0.
1. What strategy can use to exploit this institional setting to learn about the causal effect of lower college costs on the probability of going to college?
2. What sort of data would you need in order to implement this strategy?
3. Can you think of any ways that the approach that you suggested could go wrong?
4. Another researcher reads the results from the approach you have implemented and complains that your results are only specific to students who have grades right around the 3.0 cutoff. Is this a fair criticism?
Suppose you are willing to believe versions of unconfoundedness, a linear model for untreated potential outcomes, and treatment effect homogeneity so that you could write \[\begin{align*} Y_i = \beta_0 + \alpha D_i + \beta_1 X_i + \beta_2 W_i + U_i \end{align*}\] with $\E[U|D,X,W] = 0$ so that you were willing to interpret $\alpha$ in this regression as the causal effect of $D$ on $Y$. However, suppose that $W$ is not observed so that you cannot operationalize the above regression.
1. Since you do not observe $W$, you are considering just running a regression of $Y$ on $D$ and $X$ and interpreting the estimated coefficient on $D$ as the causal effect of $D$ on $Y$. Does this seem like a good idea?
2. In part (a), we can write a version of the model that you are thinking about estimating as \[\begin{align*} Y_i = \delta_0 + \delta_1 D_i + \delta_2 X_i + \epsilon_i \end{align*}\] Suppose that $\E[\epsilon | D, X] = 0$ and suppose also that \[\begin{align*} W_i = \gamma_0 + \gamma_1 D_i + \gamma_2 X_i + V_i \end{align*}\] with $\E[V|D,X]=0$. Provide an expression for $\delta_1$ in terms of $\alpha$, $\gamma$’s and $\beta$’s. Explain what this expression means.
Suppose you have access to an experiment where some participants were randomly assigned to participate in a job training program and others were randomly assigned not to participate. However, some individuals that were assigned to participate in the treatment decided not to actually participate. Let’s use the following notation: $D=1$ for individuals who actually participated and $D=0$ for individuals who did not participate. $Z=1$ for individuals who were assigned to the treatment and $Z=0$ for individuals assigned not to participate (here, $D$ and $Z$ are not exactly the same because some individuals who were assigned to the treatment did not actually participate).

You are considering several different approaches to dealing with this issue.
1. Estimating $ATT$ by $\bar{Y}_{D=1} - \bar{Y}_{D=0}$. What condition do you need to believe in order for this approach to deliver a valid estimate of the $ATT$? Does this condition seem likely to hold in this context? Explain.
2. Run the regression $Y_i = \beta_0 + \alpha D_i + U_i$ using $Z_i$ as an instrument. Go through the conditions for a valid instrument and explain whether or not each of these conditions is likely to hold in this context. Assuming that all four conditions hold, how do you interpret the estimated coefficient on $D$ in this regression?
Suppose you and a friend have conducted an experiment (things went well so that everyone complied with the treatment that they were assigned to, etc.). You interpret the difference $\bar{Y}_{D=1} - \bar{Y}_{D=0}$ as an estimate of the $ATT$, but your friend says that you should interpret it as an estimate of the $ATE$. In fact, according to your friend, random treatment assignment implies that \[\begin{align*} & \E[Y(1)] = \E[Y(1)|D=1] = \E[Y|D=1] \\ \text{ and } & \E[Y(0)] = \E[Y(0)|D=0] = \E[Y|D=0] \end{align*}\] which implies that $ATE = \E[Y|D=1] - \E[Y|D=0]$. Who is right?