Some additional comments on omitted variable bias

In class (and after), there were a few questions about whether \(\mathbb{E}[X_1 u]\) was actually equal to 0 and why that was. Below, I just created some simulated data to show that \(X_1\) is uncorrelated with \(u\), that it is also uncorrelated with \(e\) (which was the error term from the linear projection of \(Y\) on \(X_1\) and \(X_2\)), but that \(X_1\) is correlated with the “structural error” \(v=X_2'\beta_2 + e\). That all of these hold provides another explanation for why the “feasible” regression of \(Y\) on \(X_1\) does not, in general, recover \(\beta_1\) but instead recovers what we called \(\gamma_1\) in class.

# ------------------------------------------------------
#
# simulations about omitted variable / linear projection
#
# ------------------------------------------------------

library(mvtnorm)               # load package for making draws from multivariate
                               # normal distribution
set.seed(1234)                 # set seed, so reproducible


# simulation parameters
bet0 <- 0                      # intercept
bet1 <- 1 
bet2 <- 3
rho <- 0.5                     # correlation between X1 and X2
n <- 1000                      # number of observations

# create data
# draw X1, X2, they will  have mean 0, variance 1, and correlation rho
Sigma <- matrix(c(1,rho,rho,1), nrow=2, byrow=FALSE)
X <- rmvnorm(n, sigma=Sigma)
X1 <- X[,1]
X2 <- X[,2]

# draw e, it will have mean 0, variance 1, and be independent of X1 and X2
e <- rnorm(n)

# create Y
Y <- bet0 + bet1*X1 + bet2*X2 + e

# run long regression (since we created data, in this case it is "feasible")
long_reg <- lm(Y ~ X1 + X2)
summary(long_reg)              # show results
## 
## Call:
## lm(formula = Y ~ X1 + X2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.09221 -0.72589  0.03841  0.69621  2.96916 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.02828    0.03202   0.883    0.377    
## X1           1.01373    0.03673  27.597   <2e-16 ***
## X2           2.94830    0.03874  76.098   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.012 on 997 degrees of freedom
## Multiple R-squared:  0.9236, Adjusted R-squared:  0.9234 
## F-statistic:  6023 on 2 and 997 DF,  p-value: < 2.2e-16
ehat <- resid(long_reg)        # recover residuals

# now, let's run the short regression of Y on just X1
short_reg <- lm(Y ~ X1)
summary(short_reg)             # coeffs are not equal to ones from long reg
## 
## Call:
## lm(formula = Y ~ X1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2841 -1.7645 -0.1045  1.7561  8.8840 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.008952   0.083492  -0.107    0.915    
## X1           2.476075   0.081644  30.328   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.64 on 998 degrees of freedom
## Multiple R-squared:  0.4796, Adjusted R-squared:  0.4791 
## F-statistic: 919.8 on 1 and 998 DF,  p-value: < 2.2e-16
uhat <- resid(short_reg)       # recover residuals
all(uhat == ehat)              # these are not equal to residuals from long reg
## [1] FALSE
cor(X1, uhat)                  # X1 is uncorrelated with uhat
## [1] 9.681901e-18
cor(X1, ehat)                  # X1 also uncorrelated with ehat
## [1] -5.19263e-18
# now let's create the "structural" error/residuals; this what we called "v"
# in class, where v=bet2*X2 + e
bet2hat <- long_reg$coefficients[3]  # recover estimate of bet2
vhat <- ehat + bet2hat*X2            # recover structural residuals vhat
cor(X1, vhat)                        # X1 is correlated with these residuals
## [1] 0.4932111
# This is just an alternative way of arguing that if you just run the 
# regression of Y on X1, it is not going to deliver a reasonable 
# estimate of beta1