## 5.3 Computation

Even if we know that \(\mathbb{E}[Y|X_1,X_2,X_3] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\), in general, we do not know the values of the population parameters (the \(\beta\)’s). This is analogous to the framework in the previous chapter where we were interested in the population parameter \(\mathbb{E}[Y]\) and estimated it by \(\bar{Y}\).

In this section, we’ll discuss how to estimate \((\beta_0,\beta_1,\beta_2,\beta_3)\) using `R`

. We’ll refer to the estimated values of the parameters as \((\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2, \hat{\beta}_3)\). As in the previous section, it will not be the case that the estimated \(\hat{\beta}\)’s are exactly equal to the population \(\beta\)’s. Later on in this chapter, we will establish properties like consistency (so that, as long as we have a large sample, the estimated \(\hat{\beta}\)’s should be “close” to the population \(\beta\)’s) and asymptotic normality (so that we can conduct inference).

Also later on in this chapter, we’ll talk about how `R`

itself actually makes these computations.

The main function in `R`

for estimating linear regressions is the `lm`

function (`lm`

stands for linear model). The key things to specify for running a regression in `R`

are a `formula`

argument which tells `lm`

which variables are the outcome and which variables are the regressors and a `data`

argument which tells the `lm`

command what data we are using to estimate the regression. Let’s give an example using the `mtcars`

data.

What this line of code does is to run a regression. The formula is `mpg ~ hp + wt`

. In other words `mpg`

(standing for miles per gallon) is the outcome, and we are running a regression on `hp`

(horse power) and `wt`

(weight). The `~`

symbol is a “tilde”. In order to add regressors, we separate them with a `+`

. The second argument `data=mtcars`

says to use the `mtcars`

data. All of the variables in the formula need to correspond to column names in the data. We saved the results of the regression in a variable called `reg`

. It’s most common to report the results of the regression using the `summary`

command.

```
summary(reg)
#>
#> Call:
#> lm(formula = mpg ~ hp + wt, data = mtcars)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -3.941 -1.600 -0.182 1.050 5.854
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 37.22727 1.59879 23.285 < 2e-16 ***
#> hp -0.03177 0.00903 -3.519 0.00145 **
#> wt -3.87783 0.63273 -6.129 1.12e-06 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 2.593 on 29 degrees of freedom
#> Multiple R-squared: 0.8268, Adjusted R-squared: 0.8148
#> F-statistic: 69.21 on 2 and 29 DF, p-value: 9.109e-12
```

The main thing that this reports is the estimated parameters. Our estimate of the “Intercept” (i.e., this is \(\hat{\beta}_0\)) is in the first row of the table; our estimate is `37.227`

. The estimated coefficient on `hp`

is `-0.0318`

, and the estimated coefficient on `wt`

is `-3.878`

.

You can also see standard errors for each estimated parameter, a t-statistic, and a p-value in the other columns. We will talk about these in more detail in the next section.

For now, we’ll also ignore the information provided at the bottom of the summary.

Now that we have estimated the parameters, we can use these to predict \(mpg\) given a value of \(hp\) and \(wt\). For example, suppose that you wanted to predict the \(mpg\) of a 2500 pound car (note: weight in \(mtcars\) is in 1000s of pounds) and 120 horsepower car, you could compute

\[
37.227 - 0.0318(120) - 3.878(2.5) = 23.716
\]
Alternatively, there is a built-in function in `R`

called `predict`

that can be used to generate predicted values. We just need to specify the values that we would like to get predicted values for by passing in a data frame with the relevant columns though the `newdata`

argument. For example,

A popular alternative to

`R`

’s`lm`

function is the`lm_robust`

function from the`estimatr`

package. This provides different standard errors from the default standard errors provided by`lm`

that are, at least in most applications in economics, typically a better choice — we’ll have a further discussion on this topic when we talk about inference later on in this chapter.