Suppose you work for a social media company. The social media company is trying to predict the number of clicks that different types of advertisements will get on their website. You run the following regression to try to the number of clicks that a particular advertisement will get: \[\begin{align*}
Clicks = \beta_0 + \beta_1 FontSize + \beta_2 Picture + U
\end{align*}\] where \(Clicks\) is the number of clicks that an ad gets (in thousands), \(FontSize\) is the size of the font of the ad, and \(Picture\) is a binary variable that is equal to one if the ad contains a picture and 0 otherwise.
Part (a) Suppose you estimate this model and estimate that \(\hat{\beta}_0 = 40\), \(\hat{\beta}_1 = 2\), and \(\hat{\beta}_2 = 80\). What would you predict that the number of clicks would be for an ad with 16 point font size and that contains a picture?
Answer:\(40 + 2(16) + 80 = 152\), so you would predict 152,000 clicks on the ad.
Part (b) Your boss is very happy with your work, but suggests making the model more complicated. Your boss suggests you run the following regression
\[\begin{align*}
Revenue = \beta_0 &+ \beta_1 FontSize + \beta_2 Picture + \beta_3 Animated \\ &+ \beta_4 ColorfulFont + \beta_5 FontSize^2 + U
\end{align*}\] (here \(Animated\) is a binary variable that is equal to one if the ad contains an animation and is equal to 0 otherwise; and \(ColorfulFont\) is a binary variable that is equal to 1 if the font in the ad is any color besides black and 0 otherwise). You estimate the model and notice that
model from part (a)
model from part (b)
\(R^2\)
0.11
0.37
Adj. \(R^2\)
0.10
0.35
AIC
6789
4999
BIC
6536
4876
Based on the table, which model do you prefer for predicting ad clicks?
Answer: The table indicates that the model from part (b) is likely to predict better than the model from part (a). This holds since adjusted \(R^2\) is higher for the model from part (b) and because AIC and BIC are both lower for the model from part (b).
Part (c) An alternative approach to choosing between these two models is to use J-fold cross-validation. Explain how you could use J-fold cross validation in this problem.
Answer: In order to use J-fold cross validation, randomly split the data into J folds (that is, groups). For each fold, do the following:
Using all observations except the ones in the current fold, estimate each model. This step gives estimated values of the parameters.
Using the estimated models in Step 1, make predictions for the outcome for each model in the current fold. For each model, record the prediction error \(\tilde{U}_i = Y_i - \tilde{Y}_i\) (which is the difference between the actual outcome and the predicted outcome for each observation in the current fold).
Repeat these two steps for all J folds. This gives you a prediction error for every observation in the data. For each model compute \(CV = \displaystyle \frac{1}{n} \sum_{i=1}^n\tilde{U}_i^2\) which is the average prediction error across observations. Choose whichever model delivers a smaller value for \(CV\).
Question 2
Questions about causal inference.
Part (a) What does the condition \((Y(1), Y(0)) \perp \!\!\! \perp D\) mean? When would you expect it to hold?
Answer: This condition says that potential outcomes are independent of treatment status. In practice, it means that individuals that participate in the treatment do not systematically different treated or untreated potential outcomes relative to those that do not participate in the treatment. It would hold in an experiment; that is, where the treatment is randomly assigned.
Part (b) What does the condition \((Y(1), Y(0)) \perp \!\!\! \perp D | (X_1, X_2)\) mean? How is this different from the previous condition?
Answer: This condition says that potential outcomes are independent of treatment status after conditioning on the variables \(X_1\) and \(X_2\). In practice, it means that individuals that participate in the treatment do not have systematically different treated or untreated potential outcomes relative to those that do not participate in the treatment and that have the same value of the covariates \(X_1\) and \(X_2\).
Relative to the condition in part (a), it means that we are only willing to interpret average differences in outcomes between treated and untreated individuals with the same characteristics (in terms of \(X_1\) and \(X_2\)) as causal effects rather than simply compare differences in average outcomes between the treated and untreated groups and interpret these as causal effects. And, in practice, if we want to use regressions to estimate causal effects, we need to include \(X_1\) and \(X_2\) in the regression if we want to use this condition, while for part (a) we can just run a regression on \(D\) only.
Part (c) Suppose you are interested interested in the effect of a state policy that decreases the minimum legal drinking age from 21 to 18 on the number of traffic fatalities in a state. Do you think that the condition in part (a) is likely to hold here? Explain. What variables would you need to include in the condition in part (b) to hold? Explain.
Answer: It is probably not reasonable to assume that the condition in part (a) is likely to hold here though it likely depends on how states choose to set their drinking age policies. For example, if states that lower the minimum drinking age tend to be more rural than other states (and, additionally, more rural states tend to have fewer traffic fatalities), then that would be a violation of the condition in part (a).
There are a number of variables that one might need to include for the condition in part (b) to hold. Some that come to mind are: (i) the population density of a state, (ii) the highway speed limit in the state, (iii) the age distribution of the population of the state; other things that might be hard to measure but could matter are things like some states may just tend to have more aggressive drivers than other states.
Question 3
Extra Question 14.1
Answer:\(R^2\) is a measure of the in-sample fit of a regression. If we are interested in choosing a model that will predict well out-of-simple, ranking different models by in-sample fit may not be appropriate. Second, \(R^2\) will always be larger for more complicated models relative to simpler models. This can lead to overfitting and poor out-of-sample predictions.
Question 4
Extra Question 14.2
Answer: For AIC and BIC, the “penalty”/“cost” terms tend to increase these quantities while the “benefit” of adding a regressor comes from decreasing \(SSR\) (and, hence, decreasing the value of AIC and/or BIC). This means that models that do well according to these criteria will have low values of AIC/BIC and models that do poorly will have high values of AIC/BIC; therefore, we choose the model that minimizes AIC/BIC.
Question 5
Extra Question 14.3
Answer: Recall that \[
\begin{aligned}
AIC &= 2k + n\log(SSR) \\
BIC &= k\log(n) + n\log(SSR)
\end{aligned}
\] so that the only difference between \(AIC\) and \(BIC\) is in their penalty terms: \(2k\) and \(k\log(n)\), respectively. Then, as long as \(n\geq 8\) (which would presumably always be the case when you are doing model selection), the penalty for adding another regressor is larger for \(BIC\) than \(AIC\). This means that \(BIC\) will tend to choose “less complicated” models and \(AIC\) will tend to choose “more complicated” models.
Question 6
Extra Question 14.7
Answer:
Part (a) The tuning parameter is often chosen via cross validation. It makes sense to choose it this way because this is effectively choosing a value of \(\lambda\) that is making good pseudo-out-of-sample predictions. As we will see below, if you make bad choices of \(\lambda\), that could result in very poor predictions.
Part (b) When \(\lambda=0\), there would effectively be no penalty term and, therefore, the estimated parameters would coincide with the OLS estimates.
Part (c) When \(\lambda \rightarrow \infty\), the penalty term would overwhelm the term corresponding to minimizing SSR. This would result in setting all the estimated parameters to be equal to 0. This extreme approach is likely to lead to very poor predictions.
Question 7
For this problem, we will be interested in the causal effect of having children on women’s labor supply. We will use data from Angrist and Evans (1998) that contains information about the labor supply (particularly, the number of hours worked per week) of married women with at least two children. We will be interested in whether having more than two children decreases labor supply relative to having exactly two children.
Part (a) Consider the following regression of the number of hours that a woman typically works per week (work) on whether or not she has more than two children (morekids), her age and age^2, and race/ethnicity (afam and hispanic). How do you interpret the coefficient on morekids below (in general)? Would it be reasonable to interpret this coefficient as the causal effect of having more than two children? In particular, what conditions would need to be satisfied for this to an estimate of the causal effect, and provide a discussion of whether or not these conditions seem likely to hold in this context.
Call:
lm(formula = work ~ morekids + age + I(age^2) + afam + hispanic,
data = Fertility)
Residuals:
Min 1Q Median 3Q Max
-37.38 -17.86 -10.80 22.92 45.42
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.2590433 2.8942377 -1.817 0.0692 .
morekidsyes -6.2192495 0.0881474 -70.555 < 2e-16 ***
age 0.8725248 0.1983323 4.399 1.09e-05 ***
I(age^2) -0.0006059 0.0033660 -0.180 0.8572
afamyes 11.5853282 0.1920706 60.318 < 2e-16 ***
hispanicyes 1.2590698 0.1629666 7.726 1.11e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.39 on 254648 degrees of freedom
Multiple R-squared: 0.04334, Adjusted R-squared: 0.04333
F-statistic: 2308 on 5 and 254648 DF, p-value: < 2.2e-16
Answer: The estimated coefficient on morekids indicates that women with more than two children work, on average, a little of 6 hours per week less than women with exactly two children after controlling for age and race/ethnicity. It is probably not reasonable to interpret this coefficient as the causal effect of having more than two children as, for example, women with more children may be different in other ways (e.g., women with more kids may have different preferences for work or have spouse’s with higher income than women with exactly two kids) that could affect their labor supply. In order to interpret this coefficient as a causal effect, we would need to assume that, after controlling for age and race/ethnicity, women with more than two kids are not systematically different in terms of other determinants of labor supply (such as the ones mentioned above) from women with exactly two kids.
Part (b) One possible instrument in this setup is whether or not the first two children have the same sex or different sexes (i.e., whether they are both girls or both boys versus a boy and a girl). What are the two key conditions that a variable needs to satisfy for it to be a valid instrument? Do you think that, if we define \(Z\) as the variable that is equal to 1 if the first two children have the same sex, that \(Z\) satisfies these conditions?
Answer: The two key conditions that a variable needs to satisfy to be a valid instrument are the exclusion restriction and instrument relevance. The exclusion restriction says that the instrument, \(Z\), is uncorrelated with the error term, \(V\); i.e., that \(\mathrm{cov}(Z,V) = 0\). More specifically, this means that \(Z\) is uncorrelated with other variables that are (i) not observed (or otherwise not included in the model) and (ii) that are correlated with the outcome. In the context of this problem, this would mean that whether or not the first two kids have the same sex should be independent of, e.g., preferences for work or spouse’s income. Since the sex of the first two children is effectively randomly assigned (at least I am pretty sure this is the case), this condition seems likely to hold. The instrument relevance condition says that the instrument, \(Z\), is correlated with the treatment variable, \(D\); i.e., that \(\mathrm{cov}(Z,D) \neq 0\). In this context, this would mean that having two kids of the same sex is correlated with having more than two kids. Unlike the exclusion restriction, this is something we can check in the data, and it does turn out that having two kids of the same sex is predictive of having additional kids relative to having two kids of different sexes; I think this is often interpreted as some parents having a preference for having at least one boy and one girl.
Part (c) For this part the variable samesex is a binary variable that is equal to 1 if the first two children have the same sex and 0 otherwise. Consider the following instrumental variables regression. How do you interpret the coefficient on morekids below?
Call:
ivreg(formula = work ~ morekids + age + I(age^2) + afam + hispanic |
samesex + age + I(age^2) + afam + hispanic, data = Fertility)
Residuals:
Min 1Q Median 3Q Max
-37.11 -17.74 -10.90 22.75 45.13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.3476082 2.9088030 -1.838 0.066 .
morekidsyes -5.8386983 1.2478009 -4.679 2.88e-06 ***
age 0.8754998 0.1985781 4.409 1.04e-05 ***
I(age^2) -0.0007558 0.0034017 -0.222 0.824
afamyes 11.5476740 0.2281698 50.610 < 2e-16 ***
hispanicyes 1.1978427 0.2581922 4.639 3.50e-06 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 254648 1277.241 <2e-16 ***
Wu-Hausman 1 254647 0.093 0.76
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.39 on 254648 degrees of freedom
Multiple R-Squared: 0.04327, Adjusted R-squared: 0.04326
Wald test: 1316 on 5 and 254648 DF, p-value: < 2.2e-16
Answer: If you believe that samesex is a valid instrument, then the results from this regression indicate that having more than two children causes women’s labor supply to decrease by just under 6 hours per week, on average.
Question 8
For this question, we will consider whether or not there is an incumbency advantage in elections (an incumbent is a candidate who is currently holding the office that they are running for). The outcome for this problem is the share of the vote received by the Democratic candidate in a large number of elections for the House of Representatives, voteshare (vote share takes values from -100 to 100).
Part (a) Consider the following regression of voteshare on whether or not the Democratic candidate in an election is an incumbent, incumbent. How do you interpret the coefficient on incumbent below? Would it be reasonable to interpret this coefficient as the causal effect of being an incumbent?
library(RDHonest)data("lee08", package="RDHonest")# margin is the democratic margin in the previous electionlee08$incumbent <-1*(lee08$margin >0)incumbent_reg <-lm(voteshare ~ incumbent, data=lee08)summary(incumbent_reg)
Call:
lm(formula = voteshare ~ incumbent, data = lee08)
Residuals:
Min 1Q Median 3Q Max
-69.788 -10.061 -0.356 9.632 65.348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.6522 0.3201 108.25 <2e-16 ***
incumbent 35.1359 0.4195 83.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.76 on 6556 degrees of freedom
Multiple R-squared: 0.5169, Adjusted R-squared: 0.5168
F-statistic: 7014 on 1 and 6556 DF, p-value: < 2.2e-16
Answer: The estimated coefficient on incumbent indicates that Democratic incumbents receive, on average, 35 more percentage points of the vote than non-incumbents. This is a huge difference; however, it is highly unlikely to be the causal effect of incumbency. In particular, places that are firmly Democratic are probably much more likely than places that are firmly Republican to have a Democratic incumbent (and these places are also likely to have a high Democratic vote share). Since we are not controlling for any kind of propensity to be Democratic, this would mean that the regression would fail to deliver the causal effect of being an incumbent.
Part (b) Describe how you could use a regression discontinuity approach to estimate the causal effect of being an incumbent on the vote share received by the Democratic candidate.
Answer: A natural idea is to zoom on elections where the previous election was close. In particular, districts where a Democrat barely won or barely lost should be similar in terms of their observed and unobserved characteristics (in particular, their propensity to vote for a Democratic candidate). Then, differences in the outcomes of these elections among those where there is a Democratic incumbent or not should give us a strong indication of the causal effect of being an incumbent on vote share.
Part (c) To implement our regression discontinuity approach, we need to decide which observations to include based on the Democratic candidate’s margin of victory in the previous election (the running variable). The graph below shows the average voteshare (the outcome) as a function of margin. What range of the margin would you include in the analysis, and why?
Call: binsreg
Binscatter Plot
Bin/Degree selection method (binsmethod) = User-specified
Placement (binspos) = Quantile-spaced
Derivative (deriv) = 0
Group (by) = Full Sample
Sample size (n) = 6558
# of distinct values (Ndist) = 5815
# of clusters (Nclust) = NA
dots, degree (p) = 0
dots, smoothness (s) = 0
# of bins (nbins) = 93
Answer: As we discussed in class, a good rule of thumb on which elections to include in the analysis are to start with ones close to the threshold and to stop once the relationship between the outcome and the running variable on either side of the threshold is no longer linear. To me, a reasonable choice in this application would be to include elections where the margin is between -40 and 40.
Part (d) Consider the following regression. Explain why this code implements the regression discontinuity approach that we discussed above. How do you interpret the results?
Call:
lm(formula = voteshare ~ incumbent + margin + incumbent:margin,
data = subset(lee08, abs(margin) < 40))
Residuals:
Min 1Q Median 3Q Max
-68.505 -5.571 -0.436 5.100 62.491
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 44.98644 0.52333 85.961 <2e-16 ***
incumbent 8.86297 0.72719 12.188 <2e-16 ***
margin 0.35410 0.02418 14.642 <2e-16 ***
incumbent:margin 0.04018 0.03319 1.211 0.226
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.22 on 4165 degrees of freedom
Multiple R-squared: 0.5001, Adjusted R-squared: 0.4997
F-statistic: 1389 on 3 and 4165 DF, p-value: < 2.2e-16
Answer: The results indicate that being an incumbent causes the vote share to increase by almost 9 percentage points on average (and for elections that are close to the threshold). This is a much different estimate from what we reported in part (a); however, it still indicates that there is a large incumbency advantage in elections.