In class (and after), there were a few questions about whether \(\mathbb{E}[X_1 u]\) was actually equal to 0 and why that was. Below, I just created some simulated data to show that \(X_1\) is uncorrelated with \(u\), that it is also uncorrelated with \(e\) (which was the error term from the linear projection of \(Y\) on \(X_1\) and \(X_2\)), but that \(X_1\) is correlated with the “structural error” \(v=X_2'\beta_2 + e\). That all of these hold provides another explanation for why the “feasible” regression of \(Y\) on \(X_1\) does not, in general, recover \(\beta_1\) but instead recovers what we called \(\gamma_1\) in class.
# ------------------------------------------------------
#
# simulations about omitted variable / linear projection
#
# ------------------------------------------------------
library(mvtnorm) # load package for making draws from multivariate
# normal distribution
set.seed(1234) # set seed, so reproducible
# simulation parameters
bet0 <- 0 # intercept
bet1 <- 1
bet2 <- 3
rho <- 0.5 # correlation between X1 and X2
n <- 1000 # number of observations
# create data
# draw X1, X2, they will have mean 0, variance 1, and correlation rho
Sigma <- matrix(c(1,rho,rho,1), nrow=2, byrow=FALSE)
X <- rmvnorm(n, sigma=Sigma)
X1 <- X[,1]
X2 <- X[,2]
# draw e, it will have mean 0, variance 1, and be independent of X1 and X2
e <- rnorm(n)
# create Y
Y <- bet0 + bet1*X1 + bet2*X2 + e
# run long regression (since we created data, in this case it is "feasible")
long_reg <- lm(Y ~ X1 + X2)
summary(long_reg) # show results
##
## Call:
## lm(formula = Y ~ X1 + X2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.09221 -0.72589 0.03841 0.69621 2.96916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.02828 0.03202 0.883 0.377
## X1 1.01373 0.03673 27.597 <2e-16 ***
## X2 2.94830 0.03874 76.098 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.012 on 997 degrees of freedom
## Multiple R-squared: 0.9236, Adjusted R-squared: 0.9234
## F-statistic: 6023 on 2 and 997 DF, p-value: < 2.2e-16
ehat <- resid(long_reg) # recover residuals
# now, let's run the short regression of Y on just X1
short_reg <- lm(Y ~ X1)
summary(short_reg) # coeffs are not equal to ones from long reg
##
## Call:
## lm(formula = Y ~ X1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2841 -1.7645 -0.1045 1.7561 8.8840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.008952 0.083492 -0.107 0.915
## X1 2.476075 0.081644 30.328 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.64 on 998 degrees of freedom
## Multiple R-squared: 0.4796, Adjusted R-squared: 0.4791
## F-statistic: 919.8 on 1 and 998 DF, p-value: < 2.2e-16
uhat <- resid(short_reg) # recover residuals
all(uhat == ehat) # these are not equal to residuals from long reg
## [1] FALSE
## [1] 9.681901e-18
## [1] -5.19263e-18
# now let's create the "structural" error/residuals; this what we called "v"
# in class, where v=bet2*X2 + e
bet2hat <- long_reg$coefficients[3] # recover estimate of bet2
vhat <- ehat + bet2hat*X2 # recover structural residuals vhat
cor(X1, vhat) # X1 is correlated with these residuals
## [1] 0.4932111