1. Suppose you work for a social media company. The social media company is trying to predict the number of clicks that different types of advertisements will get on their website. You run the following regression to try to the number of clicks that a particular advertisement will get: \[\begin{align*} Clicks = \beta_0 + \beta_1 FontSize + \beta_2 Picture + U \end{align*}\] where \(Clicks\) is the number of clicks that an ad gets (in thousands), \(FontSize\) is the size of the font of the ad, and \(Picture\) is a binary variable that is equal to one if the ad contains a picture and 0 otherwise.

    1. Suppose you estimate this model and estimate that \(\hat{\beta}_0 = 40\), \(\hat{\beta}_1 = 2\), and \(\hat{\beta}_2 = 80\). What would you predict that the number of clicks would be for an ad with 16 point font size and that contains a picture?
    Answer: \(40 + 2(16) + 80 = 152\), so you would predict 152,000 clicks on the ad.
    1. Your boss is very happy with your work, but suggests making the model more complicated. Your boss suggests you run the following regression

      \[\begin{align*} Revenue = \beta_0 &+ \beta_1 FontSize + \beta_2 Picture + \beta_3 Animated \\ &+ \beta_4 ColorfulFont + \beta_5 FontSize^2 + U \end{align*}\] (here \(Animated\) is a binary variable that is equal to one if the ad contains an animation and is equal to 0 otherwise; and \(ColorfulFont\) is a binary variable that is equal to 1 if the font in the ad is any color besides black and 0 otherwise). You estimate the model and notice that

      model from part (a) model from part (b)
      \(R^2\) 0.11 0.37
      Adj. \(R^2\) 0.10 0.35
      AIC 6789 4999
      BIC 6536 4876

      Based on the table, which model do you prefer for predicting ad clicks?

      Answer: The table indicates that the model from part (b) is likely to predict better than the model from part (a). This holds since adjust \(R^2\) is higher for the model from part (b) and because AIC and BIC are both lower for the model from part (b).
    2. An alternative approach to choosing between these two models is to use J-fold cross-validation. Explain how you could use J-fold cross validation in this problem.

      Answer: In order to use J-fold cross validation, randomly split the data into J folds (that is, groups). For each fold, do the following:

      1. Using all observations except the ones in the current fold, estimate each model. This step gives estimated values of the parameters.

      2. Using the estimated models in Step 1, make predictions for the outcome for each model in the current fold. For each model, record the prediction error \(\tilde{U}_i = Y_i - \tilde{Y}_i\) (which is the difference between the actual outcome and the predicted outcome for each observation in the current fold).

      Repeat these two steps for all J folds. This gives you a prediction error for every observation in the data. For each model compute \(CV = \frac{1}{n} \sum_{i=1}^n\tilde{U}_i^2\) which is the average prediction error across observations. Choose whichever model delivers a smaller value for \(CV\).
  2. Questions about causal inference.

    1. What does the condition \((Y(1), Y(0)) \perp D\) mean? When would you expect it to hold?

      Answer: This condition says that potential outcomes are independent of treatment status. In practice, it means that individuals that participate in the treatment do not systematically different treated or untreated potential outcomes relative to those that do not participate in the treatment. It would hold in an experiment; that is, where the treatment is randomly assigned.
    2. What does the condition \((Y(1), Y(0)) \perp D | (X_1, X_2)\) mean? How is this different from the previous condition?

      Answer: This condition says that potential outcomes are independent of treatment status after conditioning on the variables \(X_1\) and \(X_2\). In practice, it means that individuals that participate in the treatment do not have systematically different treated or untreated potential outcomes relative to those that do not participate in the treatment and that have the same value of the covariates \(X_1\) and \(X_2\).

      Relative to the condition in part (a), it means that we are only willing to interpret average differences in outcomes between treated and untreated individuals with the same characteristics (in terms of \(X_1\) and \(X_2\)) as causal effects rather than simply compare differences in average outcomes between the treated and untreated groups and interpret these as causal effects. And, in practice, if we want to use regressions to estimate causal effects, we need to include \(X_1\) and \(X_2\) in the regression if we want to use this condition, while for part (a) we can just run a regression on \(D\) only.
    3. Suppose you are interested interested in the effect of a state policy that decreases the minimum legal drinking age from 21 to 18 on the number of traffic fatalities in a state. Do you think that the condition in part (a) is likely to hold here? Explain. What variables would you need to include in the condition in part (b) to hold? Explain.

      Answer: It is probably not reasonable to assume that the condition in part (a) is likely to hold here though it likely depends on how states choose to set their drinking age policies. For example, if states that lower the minimum drinking age tend to be more rural than other states (and, additionally, more rural states tend to have fewer traffic fatalities), then that would be a violation of the condition in part (a).

      There are a nummber of variables that one might need to include for the condition in part (b) to hold. Some that come to mind are: (i) the population density of a state, (ii) the highway speed limit in the state, (iii) the age distribution of the population of the state; other things that might be hard to measure but could matter are things like some states may just tend to have more aggressive drivers than other states.
  3. Suppose you are interested in the causal effect of \(D\) on \(Y\). If you could estimate the following model, you would be willing to interpret \(\alpha\) as the causal effect of \(D\) on \(Y\) \[\begin{align*} Y_i = \beta_0 + \alpha D_i + \beta_1 W_i + U_i \end{align*}\] where \(\mathbb{E}[U|D,W]=0\). However, you do not observe \(W_i\).

    1. Since you do not observe \(W_i\), you are considering just running a regression of \(Y_i\) on \(D_i\). Will this strategy work? Explain.
    Answer: This strategy will not generally work. In particular, there will be omitted variable bias. That is, the coefficient on \(D_i\) in a regression of \(Y_i\) on \(D_i\) will not be equal to \(\alpha\), and therefore we should not interpret that coefficient as an estimate of the causal effect of \(D\) on \(Y\).
    1. Now suppose that you actually have access to panel data. Further, suppose that \(W\) does not vary over time, but that \(Y\) and \(D\) do vary over time. Therefore, you are considering the model \[\begin{align*} Y_{it} = \beta_0 + \alpha D_{it} + \beta_1 W_i + U_{it} \end{align*}\] Explain how you can use this setup to estimate the causal effect \(\alpha\) (be specific about exactly what regression you would run here).

    Answer: Notice that subtracting \(Y_{it-1}\) from \(Y_{it}\) implies that

    \[ \begin{aligned} Y_{it} &= \beta_0 + \alpha D_{it} + \beta_1 W_i + U_{it} \\ - Y_{it-1} &= \beta_0 + \alpha D_{it-1} + \beta_1 W_i + U_{it-1} \\[10pt] \implies \Delta Y_{it} &= \alpha \Delta D_{it} + \Delta U_{it} \end{aligned} \]

    The key thing here is that taking this difference over time gets rid of the term involving \(W_i\) which was the reason for the omitted variable bias in part (a). The above equation suggests running a regression of the change in \(Y\) over time on the change in \(D\) over time and interpreting the coefficient on \(\Delta D_{it}\) as the causal effect.
    1. Now, suppose that actually the effect of \(W\) varies over time, so that the model from part (b) becomes \[\begin{align*} Y_{it} = \beta_0 + \alpha D_{it} + \beta_{1t}W_i + U_{it} \end{align*}\] (note: what’s different here is that \(\beta_{1t}\) changes across time periods). Will your strategy from part (b) continue to work in this case? Explain.

    Answer: If we follow the same strategy as in part (b) where we subtracted \(Y_{it-1}\) from \(Y_{it}\), notice that

    \[ \begin{aligned} Y_{it} &= \beta_0 + \alpha D_{it} + \beta_{1,t} W_i + U_{it} \\ - Y_{it-1} &= \beta_0 + \alpha D_{it-1} + \beta_{1,t-1} W_i + U_{it-1} \\[10pt] \implies \Delta Y_{it} &= \alpha \Delta D_{it} + \Delta \beta_{1,t} W_i + \Delta U_{it} \end{aligned} \] where \(\Delta \beta_{1,t} = \beta_{1,t} - \beta_{1,t-1}\). Unlike part (b), this expression still involves \(W_i\). This means that we still cannot run this regression (because we do not observe \(W_i\)). This further implies that running a regression of \(\Delta Y_{it}\) on \(\Delta D_{it}\) will contain omitted variable bias.

    BTW, as a side-comment, you can interpret \(\beta_{1,t}\) varying over time as meaning that the effect of \(W_i\) changes over time. A leading example here would be if \(W_i\) were “ability” and you thought that the return to ability might be changing over time.
  4. Extra Questions 6.8. Suppose you are willing to believe versions of unconfoundedness, a linear model for untreated potential outcomes, and treatment effect homogeneity so that you could write \[\begin{align*} Y_i = \beta_0 + \alpha D_i + \beta_1 X_i + \beta_2 W_i + U_i \qquad (1) \end{align*}\] with \(\mathbb{E}[U|D,X,W] = 0\) so that you were willing to interpret \(\alpha\) in this regression as the causal effect of \(D\) on \(Y\). However, suppose that \(W\) is not observed so that you cannot operationalize the above regression.

    1. Since you do not observe \(W\), you are considering just running a regression of \(Y\) on \(D\) and \(X\) and interpreting the estimated coefficient on \(D\) as the causal effect of \(D\) on \(Y\). Does this seem like a good idea?
    Answer: No, if you just ignore \(W_i\), you will get omitted variable bias. The coefficient on \(D\) in this regression will not (generally) be equal to the causal effect \(\alpha\).
    1. In part (a), we can write a version of the model that you are thinking about estimating as \[\begin{align*} Y_i = \delta_0 + \delta_1 D_i + \delta_2 X_i + \epsilon_i \qquad (2) \end{align*}\] Suppose that \(\mathbb{E}[\epsilon | D, X] = 0\) and suppose also that \[\begin{equation} W_i = \gamma_0 + \gamma_1 D_i + \gamma_2 X_i + V_i \qquad (3) \end{equation}\] with \(\mathbb{E}[V|D,X]=0\). Provide an expression for \(\delta_1\) in terms of \(\alpha\), \(\gamma\)’s and \(\beta\)’s. Explain what this expression means.

    Answer: First, from Equation (2) and the condition that \(\mathbb{E}[\epsilon | D, X] = 0\), we know that

    \[ \mathbb{E}[Y|D,X] = \delta_0 + \delta_1 D + \delta_2 X \] so that \(\delta_1\) is just the coefficient on \(D\) from a regression of \(Y\) on \(D\) and \(X\) (and ignoring \(W\)).

    Now, let’s derive an alternative expression for \(\mathbb{E}[Y|D,X]\) using Equation (1) as the starting point. In particular, by just plugging in \(Y\) from Equation (1) into \(\mathbb{E}[Y|D,X]\), notice that:

    \[ \begin{aligned} \mathbb{E}[Y|D, X] &= \mathbb{E}[\beta_0 + \alpha D + \beta_1 X + \beta_2 W + U | D, X] \\ &= \beta_0 + \alpha D + \beta_1 X + \beta_2 \mathbb{E}[W|D,X] + \mathbb{E}[U|D,X] \\ &= \beta_0 + \alpha D + \beta_1 X + \beta_2 (\gamma_0 + \gamma_1 D + \gamma_2 X) \\ &= \underbrace{(\beta_0 + \beta_2 \gamma_0)}_{\delta_0} + \underbrace{(\alpha + \beta_2 \gamma_1)}_{\delta_1} D + \underbrace{(\beta_1 + \beta_2 \gamma_2)}_{\delta_2} X \end{aligned} \]

    where the first equality holds by plugging in \(Y\) from Equation (1), the second equality holds by properties of expectations (and since we are conditioning on \(D\) and \(X\)), the third equality holds from Equation (3) and because \(\mathbb{E}[U|D,X]=0\), the fourth equality just rearranges terms. This is an alternative expression for a regression of \(Y\) on \(D\) and \(X\) in terms of the \(\beta\)’s and \(\gamma\)’s. And, most importantly, it implies that

    \[ \delta_1 = \alpha + \beta_2 \gamma_1 \]

    In words, this means that the coefficient on \(D\) in a regression of \(Y\) on \(D\) and \(X\) (that ignores \(W\)) will not be equal to \(\alpha\) unless \(\beta_2=0\) (this would occur if the partial effect of \(W\) on \(Y\) is equal to 0) or \(\gamma_1=0\) (this would occur if \(D\) and \(W\) are uncorrelated after controlling for \(X\)).