# Some additional comments on omitted variable bias

In class (and after), there were a few questions about whether $$\mathbb{E}[X_1 u]$$ was actually equal to 0 and why that was. Below, I just created some simulated data to show that $$X_1$$ is uncorrelated with $$u$$, that it is also uncorrelated with $$e$$ (which was the error term from the linear projection of $$Y$$ on $$X_1$$ and $$X_2$$), but that $$X_1$$ is correlated with the “structural error” $$v=X_2'\beta_2 + e$$. That all of these hold provides another explanation for why the “feasible” regression of $$Y$$ on $$X_1$$ does not, in general, recover $$\beta_1$$ but instead recovers what we called $$\gamma_1$$ in class.

# ------------------------------------------------------
#
# simulations about omitted variable / linear projection
#
# ------------------------------------------------------

library(mvtnorm)               # load package for making draws from multivariate
# normal distribution
set.seed(1234)                 # set seed, so reproducible

# simulation parameters
bet0 <- 0                      # intercept
bet1 <- 1
bet2 <- 3
rho <- 0.5                     # correlation between X1 and X2
n <- 1000                      # number of observations

# create data
# draw X1, X2, they will  have mean 0, variance 1, and correlation rho
Sigma <- matrix(c(1,rho,rho,1), nrow=2, byrow=FALSE)
X <- rmvnorm(n, sigma=Sigma)
X1 <- X[,1]
X2 <- X[,2]

# draw e, it will have mean 0, variance 1, and be independent of X1 and X2
e <- rnorm(n)

# create Y
Y <- bet0 + bet1*X1 + bet2*X2 + e

# run long regression (since we created data, in this case it is "feasible")
long_reg <- lm(Y ~ X1 + X2)
summary(long_reg)              # show results
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -3.09221 -0.72589  0.03841  0.69621  2.96916
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.02828    0.03202   0.883    0.377
## X1           1.01373    0.03673  27.597   <2e-16 ***
## X2           2.94830    0.03874  76.098   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.012 on 997 degrees of freedom
## Multiple R-squared:  0.9236, Adjusted R-squared:  0.9234
## F-statistic:  6023 on 2 and 997 DF,  p-value: < 2.2e-16
ehat <- resid(long_reg)        # recover residuals

# now, let's run the short regression of Y on just X1
short_reg <- lm(Y ~ X1)
summary(short_reg)             # coeffs are not equal to ones from long reg
##
## Call:
## lm(formula = Y ~ X1)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -9.2841 -1.7645 -0.1045  1.7561  8.8840
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.008952   0.083492  -0.107    0.915
## X1           2.476075   0.081644  30.328   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.64 on 998 degrees of freedom
## Multiple R-squared:  0.4796, Adjusted R-squared:  0.4791
## F-statistic: 919.8 on 1 and 998 DF,  p-value: < 2.2e-16
uhat <- resid(short_reg)       # recover residuals
all(uhat == ehat)              # these are not equal to residuals from long reg
##  FALSE
cor(X1, uhat)                  # X1 is uncorrelated with uhat
##  9.681901e-18
cor(X1, ehat)                  # X1 also uncorrelated with ehat
##  -5.19263e-18
# now let's create the "structural" error/residuals; this what we called "v"
# in class, where v=bet2*X2 + e
bet2hat <- long_reg\$coefficients  # recover estimate of bet2
vhat <- ehat + bet2hat*X2            # recover structural residuals vhat
cor(X1, vhat)                        # X1 is correlated with these residuals
##  0.4932111
# This is just an alternative way of arguing that if you just run the
# regression of Y on X1, it is not going to deliver a reasonable
# estimate of beta1