6.6 Extra Questions

What are some drawbacks of using \(R^2\) as a model selection tool?
For AIC and BIC, we choose the model that minimizes these (rather than maximizes them). Why?
Does AIC or BIC tend to pick “more complicated” models? What is the reason for this?
Suppose you are interested in predicting some outcome \(Y\) and have access to covariates \(X_1\), \(X_2\), and \(X_3\). You estimate the following two models \[\begin{align*} Y &= 30 + 4 X_1 - 2 X_2 - 10 X_3, \qquad R^2=0.5, AIC=3421 \\ Y &= 9 + 2 X_1 - 3 X_2 - 2 X_3 + 2 X_1^2 + 1 X_2^2 - 4 X_3^2 + 2 X_1 X_2, \qquad R^2 = 0.75, AIC=4018 \end{align*}\]
1. Which model seems to be predicting better? Explain why.
2. Using the model that is predicting better, what would be your prediction for \(Y\) when \(X_1=10, X_2=1, X_3=5\)?
In Lasso and Ridge regressions, it is common to “standardize” the regressors before estimating the model (e.g., the glmnet does this automatically for you). What is the reason for doing this?
In Lasso and Ridge regressions, the penalty term lead to “shrinking” the estimated parameters in the model towards 0. This tends to introduce bias while reducing variance. Why can introducing bias while reducing variance potentially lead to better predictions? Does this argument always apply or just apply in some cases? Explain.
In Lasso and Ridge regressions, the penalty term depends on the tuning parameter \(\lambda\).
1. How is this tuning parameter often chosen in practice? Why does it make sense to choose it in this way?
2. What would happen to the estimated coefficients when \(\lambda=0\)?
3. What would happen to the estimated coefficients as \(\lambda \rightarrow \infty\)?