1. Consider the following regression where airq is an indicator of air quality (lower is better) for a particular metropolitan area in California, dens1000 is the number of 1000s of people per square mile, coas indicates whether or not the metro area is on the coast, and medi1000 is the median income in the metro area (in thousands of dollars).

    data("Airq", package="Ecdat")
    library(modelsummary)
    Airq$coas <- 1*(Airq$coas=="yes")
    Airq$dens1000 <- Airq$dens/1000
    Airq$medi1000 <- Airq$medi/1000
    reg1 <- lm(airq ~ dens1000 + coas + dens1000*coas + medi1000, data=Airq)
    modelsummary(reg1, fmt=1, gof_omit=".")

    Model 1

    (Intercept)

    120.6

    (9.5)

    dens1000

    −0.3

    (2.8)

    coas

    −31.2

    (11.3)

    medi1000

    0.8

    (0.4)

    dens1000 × coas

    −1.2

    (3.4)

    1. Which regressors are statistically significant in this regression?

       The intercept and `coas` are statistically significant.  `medi1000` is marginally statistically significant (the t-statistic is exactly equal to 2 from the available information); none of the other regressors are statistically significant.
    2. What is the predicted value for the air quality index for a metro area with 1000 people per square mile, that is not located on the coast, and with median income equal to $50,000?

      The predicted value is given by:
      
      120.6 - 0.3(1) - 31.2(0) + 0.8(50) - 1.2(1)(0) = 160.3   




  1. Consider the following regression, where child_fincome is child’s family income, parent_fincome is parents’ family income, sex is binary variable indicating whether a child is male, yearborn is the year that the child was born in, and education is the years of education of the child.

    load("../Detailed Course Notes/data/intergenerational_mobility.RData")
    
    reg2 <- lm(log(child_fincome) ~ log(parent_fincome) + sex + yearborn + education,
               data=intergenerational_mobility)
    summary(reg2)
    ## 
    ## Call:
    ## lm(formula = log(child_fincome) ~ log(parent_fincome) + sex + 
    ##     yearborn + education, data = intergenerational_mobility)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -3.11404 -0.32489  0.04514  0.36940  2.70867 
    ## 
    ## Coefficients:
    ##                       Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)         21.3037430  1.9719502  10.803  < 2e-16 ***
    ## log(parent_fincome)  0.5964735  0.0198679  30.022  < 2e-16 ***
    ## sex                  0.0318506  0.0194484   1.638 0.101572    
    ## yearborn            -0.0085957  0.0009896  -8.686  < 2e-16 ***
    ## education            0.0012618  0.0003437   3.672 0.000244 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.5834 on 3625 degrees of freedom
    ## Multiple R-squared:  0.2221, Adjusted R-squared:  0.2212 
    ## F-statistic: 258.8 on 4 and 3625 DF,  p-value: < 2.2e-16

    How do you interpret the coefficient on log(parent_fincome) in this model?

     If parents' income increases by 1%, then, on average, child's income increases by 0.596% holding sex, year born, and education constant.



  1. Let \(Y\) denote a person’s age in the United States. Suppose that you have the theory that \(\mathbb{E}[Y] = 35\). You are able to collect a random sample of 100 observations. Using this data, you calculate \(\bar{Y} = 37\) and that \(\hat{\mathrm{var}}(Y) = 6\).

    1. Calculate a t-statistic for testing the null hypothesis that \(\mathbb{E}[Y]=35\). Do you reject the null hypothesis here? Explain.

      \[ \begin{aligned} t &= \frac{\sqrt{n}(\bar{Y} - \mu_0)}{\sqrt{\widehat{\mathrm{var}}(Y)}} \\ &= \frac{10(37-35)}{\sqrt{6}} \\ &= \frac{20}{\sqrt{6}} \\ &= 8.16 \end{aligned} \]

      Since \(|t| > 1.96\), you reject the null hypothesis here. In other words, if the null hypothesis were true, there is less than a 5% chance that we would get a t-statistic this large (in absolute value).

    2. What is the standard error of \(\bar{Y}\).

    \[ \begin{aligned} \textrm{s.e.}(\bar{Y}) &= \frac{\sqrt{\widehat{\mathrm{var}}(Y)}}{\sqrt{n}} \\ &= \frac{\sqrt{6}}{\sqrt{100}} \\ &= 0.245 \end{aligned} \]

    1. Calculate a p-value for the null hypothesis that \(\mathbb{E}[Y]=35\). How do you interpret it? \[ \begin{aligned} \textrm{p-value} &= 2 \Phi(-|t|) \\ &= 2 \Phi(-8.16) \\ &= 3\times 10^{-16} \approx 0 \end{aligned} \]

      This p-value indicates that, if the null hypothesis were true, it is virtually certain that we would not get a t-statistic as large in absolute value as we did — in other words, we have very strong evidence against \(H_0\) here.

    2. Calculate a 95% confidence interval for \(\mathbb{E}[Y]\). How do you interpret it?

      \[ \begin{aligned} CI &= [\bar{Y} - 1.96 \textrm{s.e.}(\bar{Y}), \bar{Y} + 1.96 \textrm{s.e.}(\bar{Y})] \\ &= [37 - 1.96 \cdot 0.245, 37 + 1.96 \cdot 0.245] \\ &= [36.52, 37.48] \end{aligned} \]

      95% of confidence intervals (in the repeated sampling thought experiment) would contain the true value of \(\mathbb{E}[Y]\).



  1. Consider the following regression using country-level data, where \(GDP\) is a country’s GDP, \(Inflation\) is the country’s current inflation rate, \(Europe\) is a binary variable indicating whether the country is located in Europe, and where \(Democracy\) is a binary variable indicating whether a country has democratic institutions.

    \[GDP = \beta_0 + \beta_1 Inflation + \beta_2 Inflation \cdot Europe + \beta_3 Inflation^2 + \beta_4 Democracy + U\]

    1. What is the partial effect of Inflation in this model?

      \[PE_{Inflation} = \beta_1 + \beta2 Europe + 2 \beta_3 Inflation\]

    2. What is the average partial effect of Inflation in this model?

      \[APE_{Inflation} = \beta_1 + \beta_2 \mathbb{E}[Europe] + 2 \beta_3 \mathbb{E}[Inflation]\]

    3. Given relevant data, how would you estimate the average partial effect of Inflation?

      \[\widehat{APE}_{Inflation} = \hat{\beta}_1 + \hat{\beta}_2 \overline{Europe} + 2 \hat{\beta}_3 \overline{Inflation}\]

      where \(\hat{\beta}_1\), \(\hat{\beta}_2\), and \(\hat{\beta_3}\) come from estimating the regression in the problem; \(\overline{Europe}\) is the sample average of \(Europe\) in the data (in other words, it is just equal to the fraction of countries that are located in Europe); and \(\overline{Inflation}\) is the average inflation in the data.