Suppose you work for a social media company. The social media company is trying to predict the number of clicks that different types of advertisements will get on their website. You run the following regression to try to the number of clicks that a particular advertisement will get: \[\begin{align*}
Clicks = \beta_0 + \beta_1 FontSize + \beta_2 Picture + U
\end{align*}\] where \(Clicks\) is the number of clicks that an ad gets (in thousands), \(FontSize\) is the size of the font of the ad, and \(Picture\) is a binary variable that is equal to one if the ad contains a picture and 0 otherwise.
Part (a) Suppose you estimate this model and estimate that \(\hat{\beta}_0 = 40\), \(\hat{\beta}_1 = 2\), and \(\hat{\beta}_2 = 80\). What would you predict that the number of clicks would be for an ad with 16 point font size and that contains a picture?
Part (b) Your boss is very happy with your work, but suggests making the model more complicated. Your boss suggests you run the following regression
\[\begin{align*}
Revenue = \beta_0 &+ \beta_1 FontSize + \beta_2 Picture + \beta_3 Animated \\ &+ \beta_4 ColorfulFont + \beta_5 FontSize^2 + U
\end{align*}\] (here \(Animated\) is a binary variable that is equal to one if the ad contains an animation and is equal to 0 otherwise; and \(ColorfulFont\) is a binary variable that is equal to 1 if the font in the ad is any color besides black and 0 otherwise). You estimate the model and notice that
model from part (a)
model from part (b)
\(R^2\)
0.11
0.37
Adj. \(R^2\)
0.10
0.35
AIC
6789
4999
BIC
6536
4876
Based on the table, which model do you prefer for predicting ad clicks?
Part (c) An alternative approach to choosing between these two models is to use J-fold cross-validation. Explain how you could use J-fold cross validation in this problem.
Question 2
Questions about causal inference.
Part (a) What does the condition \((Y(1), Y(0)) \perp \!\!\! \perp D\) mean? When would you expect it to hold?
Part (b) What does the condition \((Y(1), Y(0)) \perp \!\!\! \perp D | (X_1, X_2)\) mean? How is this different from the previous condition?
Part (c) Suppose you are interested interested in the effect of a state policy that decreases the minimum legal drinking age from 21 to 18 on the number of traffic fatalities in a state. Do you think that the condition in part (a) is likely to hold here? Explain. What variables would you need to include in the condition in part (b) to hold? Explain.
Question 3
Extra Question 14.1
Question 4
Extra Question 14.2
Question 5
Extra Question 14.3
Question 6
Extra Question 14.7
Question 7
For this problem, we will be interested in the causal effect of having children on women’s labor supply.
Part (a) Consider the following regression of the number of hours that a woman typically works per week (work) on whether or not she has more than two children (morekids), her age and age^2, and race/ethnicity (afam and hispanic). How do you interpret the coefficient on morekids below (in general)? Would it be reasonable to interpret this coefficient as the causal effect of having more than two children? In particular, what conditions would need to be satisfied for this to an estimate of the causal effect, and provide a discussion of whether or not these conditions seem likely to hold in this context.
Call:
lm(formula = work ~ morekids + age + I(age^2) + afam + hispanic,
data = Fertility)
Residuals:
Min 1Q Median 3Q Max
-37.38 -17.86 -10.80 22.92 45.42
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.2590433 2.8942377 -1.817 0.0692 .
morekidsyes -6.2192495 0.0881474 -70.555 < 2e-16 ***
age 0.8725248 0.1983323 4.399 1.09e-05 ***
I(age^2) -0.0006059 0.0033660 -0.180 0.8572
afamyes 11.5853282 0.1920706 60.318 < 2e-16 ***
hispanicyes 1.2590698 0.1629666 7.726 1.11e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.39 on 254648 degrees of freedom
Multiple R-squared: 0.04334, Adjusted R-squared: 0.04333
F-statistic: 2308 on 5 and 254648 DF, p-value: < 2.2e-16
Part (b) One possible instrument in this setup is whether or not the first two children have the same sex or different sexes (i.e., whether they are both girls or both boys versus a boy and a girl). What are the two key conditions that a variable needs to satisfy for it to be a valid instrument? Do you think that, if we define \(Z\) as the variable that is equal to 1 if the first two children have the same sex, that \(Z\) satisfies these conditions?
Part (c) For this part the variable samesex is a binary variable that is equal to 1 if the first two children have the same sex and 0 otherwise. Consider the following instrumental variables regression. How do you interpret the coefficient on morekids below?
Call:
ivreg(formula = work ~ morekids + age + I(age^2) + afam + hispanic |
samesex + age + I(age^2) + afam + hispanic, data = Fertility)
Residuals:
Min 1Q Median 3Q Max
-37.11 -17.74 -10.90 22.75 45.13
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.3476082 2.9088030 -1.838 0.066 .
morekidsyes -5.8386983 1.2478009 -4.679 2.88e-06 ***
age 0.8754998 0.1985781 4.409 1.04e-05 ***
I(age^2) -0.0007558 0.0034017 -0.222 0.824
afamyes 11.5476740 0.2281698 50.610 < 2e-16 ***
hispanicyes 1.1978427 0.2581922 4.639 3.50e-06 ***
Diagnostic tests:
df1 df2 statistic p-value
Weak instruments 1 254648 1277.241 <2e-16 ***
Wu-Hausman 1 254647 0.093 0.76
Sargan 0 NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 21.39 on 254648 degrees of freedom
Multiple R-Squared: 0.04327, Adjusted R-squared: 0.04326
Wald test: 1316 on 5 and 254648 DF, p-value: < 2.2e-16
Question 8
For this question, we will consider whether or not there is an incumbency advantage in elections (an incumbent is a candidate who is currently holding the office that they are running for). The outcome for this problem is the share of the vote received by the Democratic candidate in a large number of elections for the House of Representatives, voteshare (vote share takes values from -100 to 100).
Part (a) Consider the following regression of voteshare on whether or not the Democratic candidate in an election is an incumbent, incumbent. How do you interpret the coefficient on incumbent below? Would it be reasonable to interpret this coefficient as the causal effect of being an incumbent?
library(RDHonest)data("lee08", package="RDHonest")# margin is the democratic margin in the previous electionlee08$incumbent <-1*(lee08$margin >0)incumbent_reg <-lm(voteshare ~ incumbent, data=lee08)summary(incumbent_reg)
Call:
lm(formula = voteshare ~ incumbent, data = lee08)
Residuals:
Min 1Q Median 3Q Max
-69.788 -10.061 -0.356 9.632 65.348
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.6522 0.3201 108.25 <2e-16 ***
incumbent 35.1359 0.4195 83.75 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.76 on 6556 degrees of freedom
Multiple R-squared: 0.5169, Adjusted R-squared: 0.5168
F-statistic: 7014 on 1 and 6556 DF, p-value: < 2.2e-16
Part (b) Describe how you could use a regression discontinuity approach to estimate the causal effect of being an incumbent on the vote share received by the Democratic candidate.
Part (c) To implement our regression discontinuity approach, we need to decide which observations to include based on the Democratic candidate’s margin of victory in the previous election (the running variable). The graph below shows the average voteshare (the outcome) as a function of margin. What range of the margin would you include in the analysis, and why?
Call: binsreg
Binscatter Plot
Bin/Degree selection method (binsmethod) = User-specified
Placement (binspos) = Quantile-spaced
Derivative (deriv) = 0
Group (by) = Full Sample
Sample size (n) = 6558
# of distinct values (Ndist) = 5815
# of clusters (Nclust) = NA
dots, degree (p) = 0
dots, smoothness (s) = 0
# of bins (nbins) = 93
Part (d) Consider the following regression. Explain why this code implements the regression discontinuity approach that we discussed above. How do you interpret the results?