5.3 Computation

Even if we know that \(\mathbb{E}[Y|X_1,X_2,X_3] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\), in general, we do not know the values of the population parameters (the \(\beta\)’s). This is analogous to the framework in the previous chapter where we were interested in the population parameter \(\mathbb{E}[Y]\) and estimated it by \(\bar{Y}\).

In this section, we’ll discuss how to estimate \((\beta_0,\beta_1,\beta_2,\beta_3)\) using R. We’ll refer to the estimated values of the parameters as \((\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3)\). As in the previous section, it will not be the case that the estimated \(\hat{\beta}\)’s are exactly equal to the population \(\beta\)’s. Later on in this chapter, we will establish properties like consistency (so that, as long as we have a large sample, the estimated \(\hat{\beta}\)’s should be “close” to the population \(\beta\)’s) and asymptotic normality (so that we can conduct inference).

Also later on in this chapter, we’ll talk about how R itself actually makes these computations.

The main function in R for estimating linear regressions is the lm function (lm stands for linear model). The key things to specify for running a regression in R are a formula argument which tells lm which variables are the outcome and which variables are the regressors and a data argument which tells the lm command what data we are using to estimate the regression. Let’s give an example using the mtcars data.

reg <- lm(mpg ~ hp + wt, data=mtcars)

What this line of code does is to run a regression. The formula is mpg ~ hp + wt. In other words mpg (standing for miles per gallon) is the outcome, and we are running a regression on hp (horse power) and wt (weight). The ~ symbol is a “tilde”. In order to add regressors, we separate them with a +. The second argument data=mtcars says to use the mtcars data. All of the variables in the formula need to correspond to column names in the data. We saved the results of the regression in a variable called reg. It’s most common to report the results of the regression using the summary command.

summary(reg)
#> 
#> Call:
#> lm(formula = mpg ~ hp + wt, data = mtcars)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -3.941 -1.600 -0.182  1.050  5.854 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
#> hp          -0.03177    0.00903  -3.519  0.00145 ** 
#> wt          -3.87783    0.63273  -6.129 1.12e-06 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 2.593 on 29 degrees of freedom
#> Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
#> F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

The main thing that this reports is the estimated parameters. Our estimate of the “Intercept” (i.e., this is \(\hat{\beta}_0\)) is in the first row of the table; our estimate is 37.227. The estimated coefficient on hp is -0.0318, and the estimated coefficient on wt is -3.878.

You can also see standard errors for each estimated parameter, a t-statistic, and a p-value in the other columns. We will talk about these in more detail in the next section.

For now, we’ll also ignore the information provided at the bottom of the summary.

Now that we have estimated the parameters, we can use these to predict \(mpg\) given a value of \(hp\) and \(wt\). For example, suppose that you wanted to predict the \(mpg\) of a 2500 pound car (note: weight in \(mtcars\) is in 1000s of pounds) and 120 horsepower car, you could compute

\[ 37.227 - 0.0318(120) - 3.878(2.5) = 23.716 \] Alternatively, there is a built-in function in R called predict that can be used to generate predicted values. We just need to specify the values that we would like to get predicted values for by passing in a data frame with the relevant columns though the newdata argument. For example,

pred <- predict(reg, newdata=data.frame(hp=120,wt=2.5))
round(pred,3)
#>     1 
#> 23.72

A popular alternative to R’s lm function is the lm_robust function from the estimatr package. This provides different standard errors from the default standard errors provided by lm that are, at least in most applications in economics, typically a better choice — we’ll have a further discussion on this topic when we talk about inference later on in this chapter.