6.3 Machine Learning

SW 14.1, 14.2, 14.6

Some extra resources on estimating Lasso and Ridge regressions in R:

“Big” Data typically means one of two things:

  1. \(n\) — the number of observations is extremely large

  2. \(k\) — the number of regressors is very large

\(n\) being large is more of a computer science problem. That said, there are economists who think about these issues, but I think it is still the case that it is relatively uncommon for an economist to have so much data that they have trouble computing their estimators.

We’ll focus on the second issue — where \(k\) is large. A main reason that this may occur in applications is that we may be unsure of the functional form for \(\mathbb{E}[Y|X_1,X_2,X_3]\). Should it include higher order terms like \(X_1^2\) or \(X_1^3\)? Should it include interactions like \(X_1 X_2\). Once you start proceeding this way, even if you only have 5-10 actual covariates, \(k\) can become quite large.

As in the case of the mean, perhaps we can improve predictions by introducing some bias while simultaneously decreasing the variance of our predictions.

Let’s suppose that we have some function that can take in regressors and makes predictions of the outcome; we’ll call this function \(\hat{f}\) (where the “hat” indicates that we’ll typically estimate this function). An example would just be a regression where, for example, \(\hat{f}(X) = \hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3\). One way to evaluate how well \(\hat{f}\) makes predictions is by considering its mean squared prediction error:

\[ MSPE := \mathbb{E}\left[ (Y - \hat{f}(X))^2 \right] \] \(MSPE\) quantifies the mean “distance” between our predictions and actual outcomes.

Recall that, if we could just pick any function to minimize \(MSPE\), we would set \(\hat{f}(X) = \mathbb{E}[Y|X]\), but generally we do not just know what \(\mathbb{E}[Y|X]\) is. We can “decompose” MSPE into several underlying conditions

\[ \begin{aligned} MSPE &= \mathbb{E}\left[ \left( (Y - \mathbb{E}[Y|X]) - (\hat{f}(X) - \mathbb{E}[\hat{f}(X)|X]) - (\mathbb{E}[\hat{f}(X)|X] - \mathbb{E}[Y|X]) \right)^2 \right] \\ &= \mathbb{E}\left[ (Y-\mathbb{E}[Y|X])^2 \right] + \mathbb{E}\left[ (\hat{f}(X) - \mathbb{E}[\hat{f}(X)|X])^2\right] + \mathbb{E}\left[ (\mathbb{E}[\hat{f}(X)|X] - \mathbb{E}[Y|X])^2 \right] \\ &= \mathrm{var}(U) + \mathbb{E}[\mathrm{var}(\hat{f}(X)|X)] + \mathbb{E}\left[\textrm{Bias}(\hat{f}(X))^2\right] \end{aligned} \] where, to understand this decomposition, the first equality just adds and subtracts \(\mathbb{E}[Y|X]\) and \(\mathbb{E}[\hat{f}(X)|X]\). The second equality squares the long expression from the first equality and cancels some terms. The third equality holds where \(U := Y-\mathbb{E}[Y|X]\), by the definition of (conditional) variance, and we call the last bias because it is the difference between the mean of our prediction function \(\hat{f}\) and \(\mathbb{E}[Y|X]\).

You can think of the term \(\mathrm{var}(U)\) as being irreducible prediction error — even if we knew \(\mathbb{E}[Y|X]\), we wouldn’t get every prediction exactly right. But the other two terms come from having to estimate \(\hat{f}\). The term \(\mathbb{E}[\mathrm{var}(\hat{f}(X)X)]\) is the average variance of our predictions across different values of \(X\). The term \(\mathbb{E}\left[\textrm{Bias}(\hat{f}(X))^2\right]\) is the squared bias of our predictions averaged across different values of \(X\). Many machine learning estimators have the property of being biased (so that the last term is not equal to 0), but having reduced variance relative to OLS estimators.

To do this, we’ll estimate the parameters of the model in a similar way to what we have done before except that we’ll add a penalty term that gets larger as parameter estimates move further away from 0. In particular, we consider estimates of the form

\[ (\hat{\beta}_1, \hat{\beta}_2, \ldots, \hat{\beta}_k) = \underset{b_1,\ldots,b_k}{\textrm{argmin}} \sum_{i=1}^n (Y_i - b_1 X_{1i} - \cdots - b_k X_{ki})^2 + \textrm{penalty} \]

Lasso and ridge regression, which are the two approaches to machine learning that we will talk about below, amount to different choices of the penalty term.

Side-Comment: Generally, you should “standardize” the regressors before implementing Lasso or Ridge regression. The glmnet package does this for you by default.

6.3.1 Lasso

SW 14.3

For Lasso, the penalty term is given by

\[ \lambda \sum_{j=1}^k |b_j| \] The absolute value on each of the \(b_j\) terms means that the penalty function gets larger for larger values of any \(b_j\). Since we are trying to minimize the objective function, this means that there is a “cost” to choosing a larger value of \(b_j\) relative to OLS. Thus, Lasso estimates tend to be “shrunk” towards 0.

\(\lambda\) is called a tuning parameter — larger values of \(\lambda\) imply that the penalty term is more important. Notice that, if you send \(\lambda \rightarrow \infty\), then the penalty term will be so large that you would set all the parameters to be equal to 0. On the other hand, if you set \(\lambda=0\), then you will get the OLS estimates (because there will be no penalty in this case). In general, it is hard to know what is the “right” value for \(\lambda\), and it is typically chosen using cross validation (this will be done automatically for you using the glmnet package).

6.3.2 Ridge Regression

SW 14.4

For Ridge, the penalty term is given by

\[ \lambda \sum_{j=1}^k b_j^2 \] Much of the discussion from Lasso applies here. The only difference is that the form of the penalty is different here: \(b_j^2\) instead of \(|b_j|\). Relative to the Lasso penalty, the Ridge penalty “dislikes more” very large values of \(b_j\).

Comparing Lasso and Ridge

  • Both shrink the estimated parameters towards 0. This tends to introduce bias into our predictions but comes with the benefit of reducing the variance of the predictions.

  • Interestingly, both Lasso and Ridge can be implemented when \(k > n\) (i.e., if you have more regressors than observations). This is in contrast to to OLS, where the parameters cannot be estimated in this case.

  • Both are generally not very computationally intensive. For ridge regression, we can in fact derive an explicit expression for the estimated \(\beta\)’s. We cannot do this with Lasso; a full discussion of how Lasso actually estimates the parameters is beyond the scope of our course, but, suffice it to say, that you can generally compute Lasso estimates quickly.

  • Lasso also performs “model selection” — that is, if you use the Lasso, many of the estimated parameters will be set equal to 0. This can sometimes be an advantage. Ridge (like OLS) will generally produce non-zero estimates of all parameters in the model.

6.3.3 Computation

Computing Lasso and Ridge regressions is somewhat more complicated than most other things that we have computed this semester. The main R package for Lasso and Ridge regressions is the glmnet package. For some reason, the syntax of the package is somewhat different from, for example, the lm command. In my view, it is often easier to use the glmnetUtils package, which seems to be just a wrapper for glmnet but with functions that are analogous to lm.

I’m just going to sketch here how you would use these functions to estimate a Lasso or Ridge regression. In the lab later on in this chapter, we’ll do a more concrete example. Suppose that you have access to a training dataset called train and a testing dataset called test, you can use the glmnetUtils package in the following way:

library(glmnetUtils)
lasso_model <- cv.glmnet(Y ~ X1 + X2 + X3, data=train, use.model.frame=TRUE) # or whatever formula you want to use
coef(lasso_model) # if you are interested in estimated coefficients
predict(lasso_model, newdata=test) # get predictions

ridge_model <- cv.glmnet(Y ~ X1 + X2 + X3, 
                         data=train, 
                         alpha=0,
                         use.model.frame=TRUE)
coef(ridge_model)
predict(ridge_model, newdata=test)

It’s worth making a couple of comments about this

  • In terms of code, the only difference between the Lasso and Ridge, is that for Ridge, we added the extra argument alpha=0. The default value of alpha is 1, and that leads to a Lasso regression (which is why didn’t need to specify it for our Lasso estimates). If you are interested, you can read more details about this in the documentation for glmnet using ?glmnet.

  • Generally, if you are going to use Lasso or Ridge, then you would have many more regressors to include than just X1 + X2 + X3. One common way that you get more regressors is when you start to thinking about including higher order terms or interaction terms. One way to do this quickly is to use the poly function inside the formula. For example,

    lasso_model2 <- cv.glmnet(Y ~ poly(X1, X2, X3, degree=2, raw=TRUE), 
                              data=train, 
                              use.model.frame=TRUE)

    would include all interactions between \(X_1\), \(X_2\) and \(X_3\) as well as squared terms for each; if you wanted to include more interactions and cubic terms, you could specify degree=3.