Consider the following regression where airq is an indicator of air quality (lower is better) for a particular metropolitan area in California, dens1000 is the number of 1000s of people per square mile, coas indicates whether or not the metro area is on the coast, and medi1000 is the median income in the metro area (in thousands of dollars). The numbers in parentheses are the standard errors of the coefficients.
Which regressors are statistically significant in this regression?
Answer: The intercept and coas are statistically significant. medi1000 is marginally statistically significant (the t-statistic is exactly equal to 2 from the available information); none of the other regressors are statistically significant.
What is the predicted value for the air quality index for a metro area with 1000 people per square mile, that is not located on the coast, and with median income equal to $50,000?
Let \(Y\) denote a person’s age in the United States. Suppose that you have the theory that \(\mathbb{E}[Y] = 35\). You are able to collect a random sample of 100 observations. Using this data, you calculate \(\bar{Y} = 37\) and that \(\widehat{\mathrm{var}}(Y) = 6\).
Calculate a t-statistic for testing the null hypothesis that \(\mathbb{E}[Y]=35\). Do you reject the null hypothesis here? Explain.
Since \(|t| > 1.96\), you reject the null hypothesis here. In other words, if the null hypothesis were true, there is less than a 5% chance that we would get a t-statistic this large (in absolute value).
This p-value indicates that, if the null hypothesis were true, it is virtually certain that we would not get a t-statistic as large in absolute value as we did — in other words, we have very strong evidence against \(H_0\) here.
Calculate a 95% confidence interval for \(\mathbb{E}[Y]\). How do you interpret it?
There is a 95% chance that the interval \([36.52,37.48]\) contains the true value of \(\mathbb{E}[Y]\).
Consider the following conditional expectation using country-level data, where \(pcGDP\) is a country’s per capita GDP (in thousands of dollars), \(Inflation\) is the country’s current inflation rate, \(Europe\) is a binary variable indicating whether the country is located in Europe, and where \(Democracy\) is a binary variable indicating whether a country has democratic institutions.
the second equality holds because \(\mathrm{Var}(X)^2\) is non-random and can come out of the expectation,
the third equality uses the law of iterated expectations,
the fourth equality holds by the condition of homoskedasticity,
the fifth equality holds because \(\sigma^2\) is non-random and can come out of the expectation,
the sixth equality holds by the definition of variance, and
the last equality holds by canceling \(\mathrm{Var}(X)\) in the numerator with one of the \(\mathrm{Var}(X)\)’s in the denominator.
The main complication is to estimate \(\sigma^2\). Notice that \(\sigma^2 = \mathbb{E}[U^2]\) (i.e., doesn’t depend on \(X\)). Thus, we can estimate \(\sigma^2\) by \[
\hat{\sigma}^2 = \frac{1}{n} \sum_{i=1}^n \hat{U}_i^2
\] which is the sample average of the squared residuals. Therefore, the estimator for \(V\) is \[
\hat{V} = \frac{\hat{\sigma}^2}{\widehat{\mathrm{Var}}(X)}
\] where \(\widehat{\mathrm{Var}}(X)\) is the sample variance of \(X\).
All of the remaining parts are exactly the same as in the more complicated case considered in the course notes, except for that we use the estimator for \(V\) derived in part (b) instead of the more complicated one derived in the course notes. For calculating a t-statistic, we would compute \[
t = \frac{\sqrt{n}(\hat{\beta}_1-0)}{\sqrt{\hat{V}}}
\]
For computing a p-value, we would compute \[
p = 2 \Phi(-|t|)
\] where \(t\) is from part (c).
For computing a 95% confidence interval, we would compute \[
CI = [\hat{\beta}_1 - 1.96 \frac{\sqrt{\hat{V}}}{\sqrt{n}}, \hat{\beta}_1 + 1.96 \frac{\sqrt{\hat{V}}}{\sqrt{n}}]
\]