6.6 Instrumental Variables

SW all of chapter 12

In the previous section, I used the word natural experiment but didn’t really define it. When an actual experiment is not actually available, a very common strategy used by researchers interested in causal effects is to consider natural experiments — these are not actual experiments, but more like the case where “something weird” happens that makes some individuals more likely to participate in the treatment without otherwise affecting their outcomes. This “something weird” is called an instrumental variable.

Let me give you some examples:

  • This is not as popular of a topic as it used to be, but many economists used to be interested in the causal effect of military service on earnings. This is challenging because individuals “self-select” into the military (i.e., individuals don’t just randomly choose to join the military, and, while there may be many dimensions of choosing to join the military, probably one dimension is what a person expects the effect to be on their future earnings).

    • A famous example of an instrumental variable in this case is an individual’s Vietname draft lottery number. Here, the idea is that a randomly generated lottery number (by construction) doesn’t have any direct effect on earnings, but it does affect the chances that someone participates in the military. This is therefore a natural experiment and could serve the role of an instrumental variable.
  • For studying the effect of education on on earnings, researchers have used the day of birth as an instrument for years of education. The idea is that compulsory school laws are set up so that individuals can leave school when they reach a certain age (e.g., 16). But this means that, among students that want to drop out as early as they can, students who have an “early” birthday (usually around October) will have spent less time in school than students who have a “late” birthday (usually around July) at any particular age. This is a kind of natural experiment — comparing earnings of students who drop out at 16 for those who have early birthdays relative to late birthdays.

Let’s formalize these arguments. Using the same arguments as before, suppose we have a regression that we’d like to run

\[ Y_i = \beta_0 + \alpha D_i + \underbrace{\beta_1 W_i + U_i}_{V_i} \] and interpret our estimate of \(\alpha\) as an estimate of the causal effect of participating in the treatment. And where, for simplicity, I am not including any \(X\) covariates and where we do not observe \(W\). If \(D\) is correlated with \(W\), then just ignoring \(W\) and running a regression of \(Y\) on \(D\) will result in omitted variable bias so that regression does not recover an estimate of \(\alpha\). To help with the discussion below, we’ll define \(V_i\) to be the entire unobservable term, \(\beta_1 W_i + U_i\), in the above equation.

An instrumental variable, which we’ll call \(Z\), needs to satisfy the following two conditions:

  1. \(\mathrm{cov}(Z,V) = 0\) — This condition is called the exclusion restriction, and it means that the instrument is uncorrelated with the error term in the above equation. In practice, we’d mainly need to make sure that it is uncorrelated with whatever we think is in \(W\).

  2. \(\mathrm{cov}(Z,D) \neq 0\) — This condition is called instrument relevance, and it means that the instrument needs to actually affect whether or not an individual participates in the treatment. We’ll see why this condition is important momentarily.

Next, notice that

\[ \begin{aligned} \mathrm{cov}(Z,Y) &= \mathrm{cov}(Z,\beta_0 + \alpha D + V) \\ &= \alpha \mathrm{cov}(Z,D) \end{aligned} \] which holds because \(\mathrm{cov}(Z,\beta_0) = 0\) (because \(\beta_0\) is a constant) and \(\mathrm{cov}(Z,V)=0\) by the first condition of \(Z\) being a valid instrument. This implies that

\[ \alpha = \frac{\mathrm{cov}(Z,Y)}{\mathrm{cov}(Z,D)} \] That is, if we have a valid instrument, the above formula gives us a path to recovering the causal effect of \(D\) on \(Y\). [Now you can also see why we needed the second condition — otherwise, we could divide by 0 here.]

The intuition for this is the following: changes in the instrument can cause changes in the outcome but only because they can change whether or not an individual participates in the treatment. These changes show up in the numerator. They are scaled by how much changes in the instrument result in changes in the treatment.

If there are other covariates in the model, the formula for \(\alpha\) will become more complicated. But you can use the ivreg function in the ivreg package to make these complications for you.

6.6.1 Example: Return to Education

In this example, we’ll estimate the return to education using whether or not an individual lives close to a college as an instrument for attending college. The idea is that (at least after controlling for some other covariates), the distance that a person lives from a college should not directly affect their earnings but it could affect whether or not they attend college due to it being more or less convenient. I think that the papers that use this sort of an idea primarily have in mind that distance-to-college may affect whether or not a student attends a community college rather than a university.


data("SchoolingReturns", package="ivreg")

lm_reg <- lm(log(wage) ~ education + poly(experience, 2, raw = TRUE) + ethnicity + smsa + south,
  data = SchoolingReturns)

iv_reg <- ivreg(log(wage) ~ education + poly(experience, 2, raw = TRUE) + ethnicity + smsa + south, 
  ~ nearcollege + poly(age, 2, raw = TRUE) + ethnicity + smsa + south,
  data = SchoolingReturns)

reg_list <- list(lm_reg, iv_reg)

Model 1 Model 2
(Intercept) 4.734 4.066
(0.068) (0.608)
education 0.074 0.133
(0.004) (0.051)
poly(experience, 2, raw = TRUE)1 0.084 0.056
(0.007) (0.026)
poly(experience, 2, raw = TRUE)2 −0.002 −0.001
(0.000) (0.001)
ethnicityafam −0.190 −0.103
(0.018) (0.077)
smsayes 0.161 0.108
(0.016) (0.050)
southyes −0.125 −0.098
(0.015) (0.029)
Num.Obs. 3010 3010
R2 0.291 0.176
R2 Adj. 0.289 0.175
AIC 40329.6 40778.6
BIC 40377.7 40826.7
Log.Lik. −1308.702
F 204.932
RMSE 0.37 0.40

The main parameter of interest here is the coefficient on education. The IV estimates are noticeably larger than the OLS estimates (0.133 relative to 0.074). [This is actually quite surprising as you would think that OLS would tend to over-estimate the return to education. This is a very famous example, and there are actually quite a few “explanations” from labor economists about why this sort of result arises.]