1. Suppose you work for a social media company. The social media company is trying to predict the number of clicks that different types of advertisements will get on their website. You run the following regression to try to the number of clicks that a particular advertisement will get: \[\begin{align*} Clicks = \beta_0 + \beta_1 FontSize + \beta_2 Picture + U \end{align*}\] where \(Clicks\) is the number of clicks that an ad gets (in thousands), \(FontSize\) is the size of the font of the ad, and \(Picture\) is a binary variable that is equal to one if the ad contains a picture and 0 otherwise.

    1. Suppose you estimate this model and estimate that \(\hat{\beta}_0 = 40\), \(\hat{\beta}_1 = 2\), and \(\hat{\beta}_2 = 80\). What would you predict that the number of clicks would be for an ad with 16 point font size and that contains a picture?
    Answer: \(40 + 2(16) + 80 = 152\), so you would predict 152,000 clicks on the ad.
    1. Your boss is very happy with your work, but suggests making the model more complicated. Your boss suggests you run the following regression

      \[\begin{align*} Revenue = \beta_0 &+ \beta_1 FontSize + \beta_2 Picture + \beta_3 Animated \\ &+ \beta_4 ColorfulFont + \beta_5 FontSize^2 + U \end{align*}\] (here \(Animated\) is a binary variable that is equal to one if the ad contains an animation and is equal to 0 otherwise; and \(ColorfulFont\) is a binary variable that is equal to 1 if the font in the ad is any color besides black and 0 otherwise). You estimate the model and notice that

      model from part (a) model from part (b)
      \(R^2\) 0.11 0.37
      Adj. \(R^2\) 0.10 0.35
      AIC 6789 4999
      BIC 6536 4876

      Based on the table, which model do you prefer for predicting ad clicks?

      Answer: The table indicates that the model from part (b) is likely to predict better than the model from part (a). This holds since adjusted \(R^2\) is higher for the model from part (b) and because AIC and BIC are both lower for the model from part (b).
    2. An alternative approach to choosing between these two models is to use J-fold cross-validation. Explain how you could use J-fold cross validation in this problem.

      Answer: In order to use J-fold cross validation, randomly split the data into J folds (that is, groups). For each fold, do the following:

      1. Using all observations except the ones in the current fold, estimate each model. This step gives estimated values of the parameters.

      2. Using the estimated models in Step 1, make predictions for the outcome for each model in the current fold. For each model, record the prediction error \(\tilde{U}_i = Y_i - \tilde{Y}_i\) (which is the difference between the actual outcome and the predicted outcome for each observation in the current fold).

      Repeat these two steps for all J folds. This gives you a prediction error for every observation in the data. For each model compute \(CV = \displaystyle \frac{1}{n} \sum_{i=1}^n\tilde{U}_i^2\) which is the average prediction error across observations. Choose whichever model delivers a smaller value for \(CV\).
  2. Questions about causal inference.

    1. What does the condition \((Y(1), Y(0)) \perp D\) mean? When would you expect it to hold?

      Answer: This condition says that potential outcomes are independent of treatment status. In practice, it means that individuals that participate in the treatment do not systematically different treated or untreated potential outcomes relative to those that do not participate in the treatment. It would hold in an experiment; that is, where the treatment is randomly assigned.
    2. What does the condition \((Y(1), Y(0)) \perp D | (X_1, X_2)\) mean? How is this different from the previous condition?

      Answer: This condition says that potential outcomes are independent of treatment status after conditioning on the variables \(X_1\) and \(X_2\). In practice, it means that individuals that participate in the treatment do not have systematically different treated or untreated potential outcomes relative to those that do not participate in the treatment and that have the same value of the covariates \(X_1\) and \(X_2\).

      Relative to the condition in part (a), it means that we are only willing to interpret average differences in outcomes between treated and untreated individuals with the same characteristics (in terms of \(X_1\) and \(X_2\)) as causal effects rather than simply compare differences in average outcomes between the treated and untreated groups and interpret these as causal effects. And, in practice, if we want to use regressions to estimate causal effects, we need to include \(X_1\) and \(X_2\) in the regression if we want to use this condition, while for part (a) we can just run a regression on \(D\) only.
    3. Suppose you are interested interested in the effect of a state policy that decreases the minimum legal drinking age from 21 to 18 on the number of traffic fatalities in a state. Do you think that the condition in part (a) is likely to hold here? Explain. What variables would you need to include in the condition in part (b) to hold? Explain.

      Answer: It is probably not reasonable to assume that the condition in part (a) is likely to hold here though it likely depends on how states choose to set their drinking age policies. For example, if states that lower the minimum drinking age tend to be more rural than other states (and, additionally, more rural states tend to have fewer traffic fatalities), then that would be a violation of the condition in part (a).

      There are a number of variables that one might need to include for the condition in part (b) to hold. Some that come to mind are: (i) the population density of a state, (ii) the highway speed limit in the state, (iii) the age distribution of the population of the state; other things that might be hard to measure but could matter are things like some states may just tend to have more aggressive drivers than other states.
  3. Extra Questions 8.8. Suppose you are willing to believe versions of unconfoundedness, a linear model for untreated potential outcomes, and treatment effect homogeneity so that you could write \[\begin{align*} Y_i = \beta_0 + \alpha D_i + \beta_1 X_i + \beta_2 W_i + U_i \qquad (1) \end{align*}\] with \(\mathbb{E}[U|D,X,W] = 0\) so that you were willing to interpret \(\alpha\) in this regression as the causal effect of \(D\) on \(Y\). However, suppose that \(W\) is not observed so that you cannot operationalize the above regression.

    1. Since you do not observe \(W\), you are considering just running a regression of \(Y\) on \(D\) and \(X\) and interpreting the estimated coefficient on \(D\) as the causal effect of \(D\) on \(Y\). Does this seem like a good idea?
    Answer: No, if you just ignore \(W_i\), you will get omitted variable bias. The coefficient on \(D\) in this regression will not (generally) be equal to the causal effect \(\alpha\).
    1. In part (a), we can write a version of the model that you are thinking about estimating as \[\begin{align*} Y_i = \delta_0 + \delta_1 D_i + \delta_2 X_i + \epsilon_i \qquad (2) \end{align*}\] Suppose that \(\mathbb{E}[\epsilon | D, X] = 0\) and suppose also that \[\begin{equation} W_i = \gamma_0 + \gamma_1 D_i + \gamma_2 X_i + V_i \qquad (3) \end{equation}\] with \(\mathbb{E}[V|D,X]=0\). Provide an expression for \(\delta_1\) in terms of \(\alpha\), \(\gamma\)’s and \(\beta\)’s. Explain what this expression means.

    Answer: First, from Equation (2) and the condition that \(\mathbb{E}[\epsilon | D, X] = 0\), we know that

    \[ \mathbb{E}[Y|D,X] = \delta_0 + \delta_1 D + \delta_2 X \] so that \(\delta_1\) is just the coefficient on \(D\) from a regression of \(Y\) on \(D\) and \(X\) (and ignoring \(W\)).

    Now, let’s derive an alternative expression for \(\mathbb{E}[Y|D,X]\) using Equation (1) as the starting point. In particular, by just plugging in \(Y\) from Equation (1) into \(\mathbb{E}[Y|D,X]\), notice that:

    \[ \begin{aligned} \mathbb{E}[Y|D, X] &= \mathbb{E}[\beta_0 + \alpha D + \beta_1 X + \beta_2 W + U | D, X] \\ &= \beta_0 + \alpha D + \beta_1 X + \beta_2 \mathbb{E}[W|D,X] + \mathbb{E}[U|D,X] \\ &= \beta_0 + \alpha D + \beta_1 X + \beta_2 (\gamma_0 + \gamma_1 D + \gamma_2 X) \\ &= \underbrace{(\beta_0 + \beta_2 \gamma_0)}_{\delta_0} + \underbrace{(\alpha + \beta_2 \gamma_1)}_{\delta_1} D + \underbrace{(\beta_1 + \beta_2 \gamma_2)}_{\delta_2} X \end{aligned} \]

    where the first equality holds by plugging in \(Y\) from Equation (1), the second equality holds by properties of expectations (and since we are conditioning on \(D\) and \(X\)), the third equality holds from Equation (3) and because \(\mathbb{E}[U|D,X]=0\), the fourth equality just rearranges terms. This is an alternative expression for a regression of \(Y\) on \(D\) and \(X\) in terms of the \(\beta\)’s and \(\gamma\)’s. And, most importantly, it implies that

    \[ \delta_1 = \alpha + \beta_2 \gamma_1 \]

    In words, this means that the coefficient on \(D\) in a regression of \(Y\) on \(D\) and \(X\) (that ignores \(W\)) will not be equal to \(\alpha\) unless \(\beta_2=0\) (this would occur if the partial effect of \(W\) on \(Y\) is equal to 0) or \(\gamma_1=0\) (this would occur if \(D\) and \(W\) are uncorrelated after controlling for \(X\)).



  1. Consider the following regression, where child_fincome is child’s family income, parent_fincome is parents’ family income, sex is binary variable indicating whether a child is male, yearborn is the year that the child was born in, and education is the years of education of the child.

    load("../Detailed Course Notes/data/intergenerational_mobility.RData")
    
    reg2 <- lm(log(child_fincome) ~ log(parent_fincome) + sex + yearborn + education,
               data=intergenerational_mobility)
    summary(reg2)
    ## 
    ## Call:
    ## lm(formula = log(child_fincome) ~ log(parent_fincome) + sex + 
    ##     yearborn + education, data = intergenerational_mobility)
    ## 
    ## Residuals:
    ##      Min       1Q   Median       3Q      Max 
    ## -3.11404 -0.32489  0.04514  0.36940  2.70867 
    ## 
    ## Coefficients:
    ##                       Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)         21.3037430  1.9719502  10.803  < 2e-16 ***
    ## log(parent_fincome)  0.5964735  0.0198679  30.022  < 2e-16 ***
    ## sex                  0.0318506  0.0194484   1.638 0.101572    
    ## yearborn            -0.0085957  0.0009896  -8.686  < 2e-16 ***
    ## education            0.0012618  0.0003437   3.672 0.000244 ***
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 0.5834 on 3625 degrees of freedom
    ## Multiple R-squared:  0.2221, Adjusted R-squared:  0.2212 
    ## F-statistic: 258.8 on 4 and 3625 DF,  p-value: < 2.2e-16

    How do you interpret the coefficient on log(parent_fincome) in this model?

    Answer: If parents’ income increases by 1%, then, on average, child’s income increases by 0.596% holding sex, year born, and education constant.



  1. Consider the following regression using country-level data, where \(GDP\) is a country’s GDP, \(Inflation\) is the country’s current inflation rate, \(Europe\) is a binary variable indicating whether the country is located in Europe, and where \(Democracy\) is a binary variable indicating whether a country has democratic institutions.

    \[GDP = \beta_0 + \beta_1 Inflation + \beta_2 Inflation \cdot Europe + \beta_3 Inflation^2 + \beta_4 Democracy + U\]

    1. What is the partial effect of Inflation in this model?

      Answer:

      \[PE_{Inflation} = \beta_1 + \beta2 Europe + 2 \beta_3 Inflation\]

    2. What is the average partial effect of Inflation in this model?

      Answer:

      \[APE_{Inflation} = \beta_1 + \beta_2 \mathbb{E}[Europe] + 2 \beta_3 \mathbb{E}[Inflation]\]

    3. Given relevant data, how would you estimate the average partial effect of Inflation?

      Answer:

      \[\widehat{APE}_{Inflation} = \hat{\beta}_1 + \hat{\beta}_2 \overline{Europe} + 2 \hat{\beta}_3 \overline{Inflation}\]

      where \(\hat{\beta}_1\), \(\hat{\beta}_2\), and \(\hat{\beta_3}\) come from estimating the regression in the problem; \(\overline{Europe}\) is the sample average of \(Europe\) in the data (in other words, it is just equal to the fraction of countries that are located in Europe); and \(\overline{Inflation}\) is the average inflation in the data.