## 4.7 Mean Squared Error

More generally, two estimators can be compared by their mean squared error which is defined as

$\textrm{MSE}(\hat{\theta}) := \mathbb{E}\left[ (\hat{\theta} - \theta)^2\right]$

The mean squared error of $$\hat{\theta}$$ is the average “distance” between $$\hat{\theta}$$ and $$\theta$$ in the thought experiment of having repeated samples of size $$n$$.

Another equivalent expression for the mean squared error is

$\textrm{MSE}(\hat{\theta}) = \textrm{Bias}(\hat{\theta})^2 + \mathrm{var}(\hat{\theta})$ In other words, if we can figure out the bias and variance of $$\hat{\theta}$$, then we can recover mean squared error.

Side-Comment: I think it is worth quickly explaining where the second expression for $$\textrm{MSE}(\hat{\theta})$$ comes from. Starting from the definition of $$\textrm{MSE}(\hat{\theta})$$,

\begin{aligned} \textrm{MSE}(\hat{\theta}) &= \mathbb{E}\left[ (\hat{\theta} - \theta)^2\right] \\ &= \mathbb{E}\left[ \left( (\hat{\theta} - \mathbb{E}[\hat{\theta}]) + (\mathbb{E}[\hat{\theta}] - \theta)\right)^2 \right] \\ &= \mathbb{E}\left[ (\hat{\theta} - \mathbb{E}[\hat{\theta}])^2 \right] + \mathbb{E}\left[ (\mathbb{E}[\hat{\theta}] - \theta)^2\right] + 2 \mathbb{E}\left[ (\hat{\theta} - \mathbb{E}[\hat{\theta}])(\mathbb{E}[\hat{\theta}] - \theta) \right] \\ &= \mathrm{var}(\hat{\theta}) + \textrm{Bias}(\hat{\theta})^2 \end{aligned} where the first equality is just the definition of $$\textrm{MSE}(\hat{\theta})$$, the second equality adds and subtracts $$\mathbb{E}[\hat{\theta}]$$, the third equality squares everything in parentheses from the previous line and pushes the expectation through the sum. For the last equality, the first term in the previous line corresponds to the definition of $$\mathrm{var}(\hat{\theta})$$; for the second term, recall that $$\textrm{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta}-\theta]$$ (and this is non-random so the outside expectation just goes away); the last term is equal to 0 which just holds by the properties of expectations after noticing that $$(\mathbb{E}[\hat{\theta}] - \theta)$$ is non-random and can therefore come out of the expectation.

Generally, we would like to choose estimators that have low mean squared error (this essentially means that they have low bias and variance). Moreover, mean squared error gives us a way to compare estimators that are potentially biased. [Also, notice that for unbiased estimators, comparing mean squared errors of different estimators just compares their variance (because the bias term is equal to 0), so this is a generalization of relative efficiency from the previous section.]

Example 4.1 Let’s compare three estimators of $$\mathbb{E}[Y]$$ based on their mean squared error. Let’s consider the three following estimators

\begin{aligned} \hat{\mu} &:= \frac{1}{n} \sum_{i=1}^n Y_i \\ \hat{\mu}_1 &:= Y_1 \\ \hat{\mu}_\lambda &:= \lambda \bar{Y} \quad \textrm{for some } \lambda > 0 \end{aligned} $$\hat{\mu}$$ is just the sample average of $$Y$$’s that we have already discussed. $$\hat{\mu}_1$$ is the (somewhat strange) estimator of $$\mathbb{E}[Y]$$ that just uses the first observation in the data (regardless of the sample size). $$\hat{\mu}_\lambda$$ is an estimator of $$\mathbb{E}[Y]$$ that multiplies $$\bar{Y}$$ by some positive constant $$\lambda$$.

To calculate the mean squared error of each of these estimators, let’s calculate their means and their variances.

\begin{aligned} \mathbb{E}[\hat{\mu}] &= \mathbb{E}[Y] \\ \mathbb{E}[\hat{\mu}_1] &= \mathbb{E}[Y_1] = \mathbb{E}[Y] \\ \mathbb{E}[\hat{\mu}_\lambda] &= \lambda \mathbb{E}[\bar{Y}] = \lambda \mathbb{E}[Y] \end{aligned} This means that $$\hat{\mu}$$ and $$\hat{\mu}_1$$ are both unbiased. $$\hat{\mu}_\lambda$$ is biased (unless $$\lambda=1$$ though this is a relatively uninteresting case as it would mean that $$\hat{\mu}_\lambda$$ is exactly the same as $$\hat{\mu}$$) with $$\textrm{Bias}(\hat{\mu}_\lambda) = (\lambda - 1) \mathbb{E}[Y]$$.

Next, let’s calculate the variance for each estimator

\begin{aligned} \mathrm{var}(\hat{\mu}) &= \frac{\mathrm{var}(Y)}{n} \\ \mathrm{var}(\hat{\mu}_1) &= \mathrm{var}(Y_1) = \mathrm{var}(Y) \\ \mathrm{var}(\hat{\mu}_\lambda) &= \lambda^2 \mathrm{var}(\bar{Y}) = \lambda^2 \frac{\mathrm{var}(Y)}{n} \end{aligned} This means that we can now calculate mean squared error for each estimator.

\begin{aligned} \textrm{MSE}(\hat{\mu}) &= \frac{\mathrm{var}{Y}}{n} \\ \textrm{MSE}(\hat{\mu}_1) &= \mathrm{var}(Y) \\ \textrm{MSE}(\hat{\mu}_\lambda) &= (\lambda-1)^2\mathbb{E}[Y]^2 + \lambda^2 \frac{\mathrm{var}(Y)}{n} \end{aligned} The first thing to notice is that $$\hat{\mu}$$ dominates $$\hat{\mu}_1$$ (where dominates means that there isn’t any scenario where you could make a reasonable case that $$\hat{\mu}_1$$ is a better estimator) because its MSE is strictly lower (they tie only if $$n=1$$ when they become the same estimator). This is probably not surprising — $$\hat{\mu}_1$$ just throws away a lot of potentially useful information.

The more interesting case is $$\hat{\mu}_\lambda$$. The first term is the bias term — it is greater than the bias from $$\hat{\mu}$$ or $$\hat{\mu}_1$$ because the bias of both of these is equal to 0. However, relative to $$\hat{\mu}$$, the variance of $$\hat{\mu}_\lambda$$ can be smaller when $$\lambda$$ is less than 1. In fact, you can show that there are estimators that have smaller mean squared error than $$\hat{\mu}$$ by choosing a $$\lambda$$ that is smaller than (usually just slightly smaller than) 1. This sort of estimator would be biased, but are able to compensate introducing some bias by having smaller variance. For now, we won’t talk much about this sort of estimator (and stick to $$\bar{Y}$$), but this sort of estimator has the “flavor” of modern machine learning estimators that typically introduce some bias while reducing variance. One last comment: if you were to make a “bad” choice of $$\lambda$$, $$\hat{\mu}_\lambda$$ could have higher mean squared error than even $$\hat{\mu}_1$$, so if you wanted to proceed this way, you’d have to choose $$\lambda$$ with some care.