Consider the following regression where

`airq`

is an indicator of air quality (lower is better) for a particular metropolitan area in California,`dens1000`

is the number of 1000s of people per square mile,`coas`

indicates whether or not the metro area is on the coast, and`medi1000`

is the median income in the metro area (in thousands of dollars).`data("Airq", package="Ecdat") library(modelsummary) Airq$coas <- 1*(Airq$coas=="yes") Airq$dens1000 <- Airq$dens/1000 Airq$medi1000 <- Airq$medi/1000 reg1 <- lm(airq ~ dens1000 + coas + dens1000*coas + medi1000, data=Airq) modelsummary(reg1, fmt=1, gof_omit=".")`

Model 1

(Intercept)

120.6

(9.5)

dens1000

−0.3

(2.8)

coas

−31.2

(11.3)

medi1000

0.8

(0.4)

dens1000 × coas

−1.2

(3.4)

Which regressors are statistically significant in this regression?

`The intercept and `coas` are statistically significant. `medi1000` is marginally statistically significant (the t-statistic is exactly equal to 2 from the available information); none of the other regressors are statistically significant.`

What is the predicted value for the air quality index for a metro area with 1000 people per square mile, that is not located on the coast, and with median income equal to $50,000?

`The predicted value is given by: 120.6 - 0.3(1) - 31.2(0) + 0.8(50) - 1.2(1)(0) = 160.3`

Consider the following regression, where

`child_fincome`

is child’s family income,`parent_fincome`

is parents’ family income,`sex`

is binary variable indicating whether a child is male,`yearborn`

is the year that the child was born in, and`education`

is the years of education of the child.`load("../Detailed Course Notes/data/intergenerational_mobility.RData") reg2 <- lm(log(child_fincome) ~ log(parent_fincome) + sex + yearborn + education, data=intergenerational_mobility) summary(reg2)`

`## ## Call: ## lm(formula = log(child_fincome) ~ log(parent_fincome) + sex + ## yearborn + education, data = intergenerational_mobility) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.11404 -0.32489 0.04514 0.36940 2.70867 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 21.3037430 1.9719502 10.803 < 2e-16 *** ## log(parent_fincome) 0.5964735 0.0198679 30.022 < 2e-16 *** ## sex 0.0318506 0.0194484 1.638 0.101572 ## yearborn -0.0085957 0.0009896 -8.686 < 2e-16 *** ## education 0.0012618 0.0003437 3.672 0.000244 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5834 on 3625 degrees of freedom ## Multiple R-squared: 0.2221, Adjusted R-squared: 0.2212 ## F-statistic: 258.8 on 4 and 3625 DF, p-value: < 2.2e-16`

How do you interpret the coefficient on

`log(parent_fincome)`

in this model?`If parents' income increases by 1%, then, on average, child's income increases by 0.596% holding sex, year born, and education constant.`

Let \(Y\) denote a person’s age in the United States. Suppose that you have the theory that \(\mathbb{E}[Y] = 35\). You are able to collect a random sample of 100 observations. Using this data, you calculate \(\bar{Y} = 37\) and that \(\hat{\mathrm{var}}(Y) = 6\).

Calculate a t-statistic for testing the null hypothesis that \(\mathbb{E}[Y]=35\). Do you reject the null hypothesis here? Explain.

\[ \begin{aligned} t &= \frac{\sqrt{n}(\bar{Y} - \mu_0)}{\sqrt{\widehat{\mathrm{var}}(Y)}} \\ &= \frac{10(37-35)}{\sqrt{6}} \\ &= \frac{20}{\sqrt{6}} \\ &= 8.16 \end{aligned} \]

Since \(|t| > 1.96\), you reject the null hypothesis here. In other words, if the null hypothesis were true, there is less than a 5% chance that we would get a t-statistic this large (in absolute value).

What is the standard error of \(\bar{Y}\).

\[ \begin{aligned} \textrm{s.e.}(\bar{Y}) &= \frac{\sqrt{\widehat{\mathrm{var}}(Y)}}{\sqrt{n}} \\ &= \frac{\sqrt{6}}{\sqrt{100}} \\ &= 0.245 \end{aligned} \]

Calculate a p-value for the null hypothesis that \(\mathbb{E}[Y]=35\). How do you interpret it? \[ \begin{aligned} \textrm{p-value} &= 2 \Phi(-|t|) \\ &= 2 \Phi(-8.16) \\ &= 3\times 10^{-16} \approx 0 \end{aligned} \]

This p-value indicates that, if the null hypothesis were true, it is virtually certain that we would not get a t-statistic as large in absolute value as we did — in other words, we have very strong evidence against \(H_0\) here.

Calculate a 95% confidence interval for \(\mathbb{E}[Y]\). How do you interpret it?

\[ \begin{aligned} CI &= [\bar{Y} - 1.96 \textrm{s.e.}(\bar{Y}), \bar{Y} + 1.96 \textrm{s.e.}(\bar{Y})] \\ &= [37 - 1.96 \cdot 0.245, 37 + 1.96 \cdot 0.245] \\ &= [36.52, 37.48] \end{aligned} \]

95% of confidence intervals (in the repeated sampling thought experiment) would contain the true value of \(\mathbb{E}[Y]\).

Consider the following regression using country-level data, where \(GDP\) is a country’s GDP, \(Inflation\) is the country’s current inflation rate, \(Europe\) is a binary variable indicating whether the country is located in Europe, and where \(Democracy\) is a binary variable indicating whether a country has democratic institutions.

\[GDP = \beta_0 + \beta_1 Inflation + \beta_2 Inflation \cdot Europe + \beta_3 Inflation^2 + \beta_4 Democracy + U\]

What is the partial effect of Inflation in this model?

\[PE_{Inflation} = \beta_1 + \beta2 Europe + 2 \beta_3 Inflation\]

What is the average partial effect of Inflation in this model?

\[APE_{Inflation} = \beta_1 + \beta_2 \mathbb{E}[Europe] + 2 \beta_3 \mathbb{E}[Inflation]\]

Given relevant data, how would you estimate the average partial effect of Inflation?

\[\widehat{APE}_{Inflation} = \hat{\beta}_1 + \hat{\beta}_2 \overline{Europe} + 2 \hat{\beta}_3 \overline{Inflation}\]

where \(\hat{\beta}_1\), \(\hat{\beta}_2\), and \(\hat{\beta_3}\) come from estimating the regression in the problem; \(\overline{Europe}\) is the sample average of \(Europe\) in the data (in other words, it is just equal to the fraction of countries that are located in Europe); and \(\overline{Inflation}\) is the average inflation in the data.