5.9 Omitted Variable Bias

SW 6.1

Suppose that we are interested in the following regression model

\[ \mathbb{E}[Y|X_1, X_2, Q] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 Q \] and, in particular, we are interested in the the partial effect

\[ \frac{ \partial \, \mathbb{E}[Y|X_1,X_2,Q]}{\partial \, X_1} = \beta_1 \] But we are faced with the issue that we do not observe \(Q\) (which implies that we cannot control for it in the regression)

Recall that we can equivalently write

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 Q + U \tag{5.1} \] where \(\mathbb{E}[U|X_1,X_2,Q]=0\).

Now, for simplicity, suppose that

\[ \mathbb{E}[Q | X_1, X_2] = \gamma_0 + \gamma_1 X_1 + \gamma_2 X_2 \]

Now, let’s consider the idea of just running a regression of \(Y\) on \(X_1\) and \(X_2\) (and just not including \(Q\)); in other words, consider the regression \[ \mathbb{E}[Y|X_1,X_2] = \delta_0 + \delta_1 X_1 + \delta_2 X_2 \] We are interested in the question of whether or not we can recover \(\beta_1\) if we do this. If we consider this “feasible” regression, notice if we plug in the expression for \(Y\) from Equation (5.1),

\[ \begin{aligned} \mathbb{E}[Y|X_1,X_2] &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 \mathbb{E}[Q|X_1,X_2] \\ &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 (\gamma_0 + \gamma_1 X_1 + \gamma_2 X_2) \\ &= \underbrace{(\beta_0 + \beta_3 \gamma_0)}_{\delta_0} + \underbrace{(\beta_1 + \beta_3 \gamma_1)}_{\delta_1} X_1 + \underbrace{(\beta_2 + \beta_3 \gamma_2)}_{\delta_2} X_2 \end{aligned} \]

In other words, if we run the feasible regression of \(Y\) on \(X_1\) and \(X_2\), \(\delta_1\) (the coefficient on \(X_1\)) is not equal to \(\beta_1\); rather, it is equal to \((\beta_1 + \beta_3 \gamma_1)\).

That you are not generally able to recover \(\beta_1\) in this case is called omitted variable bias

There are two cases where you will recover \(\delta_1 = \beta_1\) though which occur when \(\beta_3 \gamma_1 = 0\):

\(\beta_3=0\). This would be the case where \(Q\) has no effect on \(Y\)
\(\gamma_1=0\). This would be the case where \(X_1\) and \(Q\) are uncorrelated after controlling for \(X_2\).

Interestingly, there may be some case where you can “sign” the bias; i.e., figure out if \(\beta_3 \gamma_1\) is positive or negative. For example, you might have theoretical reasons to suspect that \(\gamma_1 > 0\) and \(\beta_3 > 0\). In this case,

\[ \delta_1 = \beta_1 + \textrm{something positive} \] which implies that \(\delta_1\) (i.e., running a regression that ignores \(Q\)) would cause us to tend to over-estimate \(\beta_1\).

Side-Comment:

The book talks about omitted variable bias in the context of causality (this is probably the leading case), but we have not talked about causality yet. The same issues arise if we just say that we have some regression of interest but are unable to estimate it because some covariates are unobserved.
The relationship to causality (which is not so important for now), is that under certain conditions, we may have a particular partial effect that we would be willing to interpret as being the “causal effect”, but if we are unable to control for some variables that would lead to this interpretation, then we get to the issues pointed out in the textbook.