Recall that \[ \begin{aligned} AIC &= 2k + n\log(SSR) \\ BIC &= k\log(n) + n\log(SSR) \end{aligned} \] so that the only difference between \(AIC\) and \(BIC\) is in their penalty terms: \(2k\) and \(k\log(n)\), respectively. Then, as long as \(n\geq 8\) (which would presumably always be the case when you are doing model selection), the penalty for adding another regressor is larger for \(BIC\) than \(AIC\). This means that \(BIC\) will tend to choose “less complicated” models and \(AIC\) will tend to choose “more complicated” models.

In terms of mean squared prediction error, the accuracy of our predictions depends on both the bias and the variance of the predictions. If we can substantially reduce variance by introducing a small amount of bias, this can result in better predictions.

This argument doesn’t always apply. For example, if you make “bad choices” of the penalty term using Lasso or Ridge regressions, you could introduce lots of bias that might not be offset by the smaller variance.

**Part (a)**

```
load("../../Detailed Course Notes/data/rand_hie.RData")
reg_a <- lm(total_med_expenditure ~ plan_type, data=rand_hie)
summary(reg_a)
```

```
##
## Call:
## lm(formula = total_med_expenditure ~ plan_type, data = rand_hie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -532.9 -380.4 -279.9 47.9 17987.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 392.77 37.05 10.601 < 2e-16 ***
## plan_typeDeductible 34.09 50.47 0.675 0.49944
## plan_typeCost Sharing 11.98 48.26 0.248 0.80401
## plan_typeFree 140.12 45.86 3.056 0.00226 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 915.9 on 3347 degrees of freedom
## Multiple R-squared: 0.00428, Adjusted R-squared: 0.003388
## F-statistic: 4.796 on 3 and 3347 DF, p-value: 0.002455
```

Relative to having only “catastrophic” insurance coverage, total medical expenditure is estimated to be substantially higher, on average, for individuals assigned to “free” insurance (i.e., that paid nothing for medical care). The estimates of effects on total medical expenditure for “deductible” and “cost sharing” insurance are positive but not statistically different from 0. The effect of “free” insurance appears to be large – medical expenditures are about 36% higher for individuals assigned to this group relative to the “catastrophic” group. Since individuals were randomly assigned to a type of plan, it seems reasonable to interpret these results as being a causal effect of plan type on medical spending.

**Part (b)**

```
reg_b <- lm(face_to_face_visits ~ plan_type, data=rand_hie)
summary(reg_b)
```

```
##
## Call:
## lm(formula = face_to_face_visits ~ plan_type, data = rand_hie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.928 -2.858 -1.528 0.748 91.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1917 0.2248 14.199 < 2e-16 ***
## plan_typeDeductible 0.1062 0.3062 0.347 0.729
## plan_typeCost Sharing 0.2602 0.2928 0.889 0.374
## plan_typeFree 1.7361 0.2782 6.241 4.91e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.556 on 3347 degrees of freedom
## Multiple R-squared: 0.01856, Adjusted R-squared: 0.01768
## F-statistic: 21.09 on 3 and 3347 DF, p-value: 1.557e-13
```

These results are broadly similar to the ones before. Individuals assigned to the “free” insurance plan had, on average, 1.7 more face to face visits with doctors. This is 54% more than individuals randomly assigned to the “catastrophic” insurance plan. As in part (a), it seems reasonable to interpret these as causal effects due to the random assignment.

**Part (c)**

```
reg_c <- lm(health_index ~ plan_type, data=rand_hie)
summary(reg_c)
```

```
##
## Call:
## lm(formula = health_index ~ plan_type, data = rand_hie)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.525 -8.838 1.462 10.663 32.263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.5247 0.6152 111.394 <2e-16 ***
## plan_typeDeductible -0.7880 0.8380 -0.940 0.347
## plan_typeCost Sharing 0.5133 0.8013 0.641 0.522
## plan_typeFree -0.7407 0.7613 -0.973 0.331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.21 on 3347 degrees of freedom
## Multiple R-squared: 0.001326, Adjusted R-squared: 0.0004312
## F-statistic: 1.482 on 3 and 3347 DF, p-value: 0.2175
```

These results are different from the previous ones. Although individuals assigned to the “free” insurance plan appear to be utilizing more medical care, it does not appear to be improving their health (at least according to this measure of an individual’s health). The results here are not statistically significant and quantitatively small; for example, here we estimate that individuals in the “free” insurance plan about 1% lower health index, on average, than those in the “catastrophic” plan.

**Part (d)**

Parts (a)-(c) seem to suggest that “free” insurance increased medical care usage without much of an effect on health (at least in the way that we were able to measure health).

**Part (a)**

```
data("Fertility", package="AER")
reg_a <- lm(work ~ morekids + age + I(age^2) + afam + hispanic, data=Fertility)
summary(reg_a)
```

```
##
## Call:
## lm(formula = work ~ morekids + age + I(age^2) + afam + hispanic,
## data = Fertility)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.38 -17.86 -10.80 22.92 45.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.2590433 2.8942377 -1.817 0.0692 .
## morekidsyes -6.2192495 0.0881474 -70.555 < 2e-16 ***
## age 0.8725248 0.1983323 4.399 1.09e-05 ***
## I(age^2) -0.0006059 0.0033660 -0.180 0.8572
## afamyes 11.5853282 0.1920706 60.318 < 2e-16 ***
## hispanicyes 1.2590698 0.1629666 7.726 1.11e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.39 on 254648 degrees of freedom
## Multiple R-squared: 0.04334, Adjusted R-squared: 0.04333
## F-statistic: 2308 on 5 and 254648 DF, p-value: < 2.2e-16
```

We estimate that women who have more than two children work, on average, about 6 hours less per week than women who do not have more than two children, holding age and race/ethnicity constant. I am not an expert here, but I suspect that it is not reasonable to interpret this as the causal effect of having more than two kids on the number of hours worked. Here are two possible reasons (though there could be others). First, most labor supply models would suggest that women’s labor supply depends on other sources of income for the family. For example, (all else equal) women in families where the husband has high earnings or that experienced a large inheritance are likely to work less than women who without either of these; since this could also be correlated with having more children, that is one possible reason for not interpreting this as a causal effect. Another thing that is immediately noticeable here is that our model does not include any measure of educational attainment. As far as I know, higher educated people tend to work more hours, on average, than less educated people. If this is correlated with number of children, then this would be another reason not to interpret the above estimates as causal effects.

**Part (b)**

Yes, this is probably a good candidate for an instrument. Presumably, the sex composition of children is close to completely random. That would indicate that it satisfies the exogeneity condition. If it also satisfies the relevance condition, which we can check (and, at least to me seems plausible given the explanation in the problem), then it can work as an instrument.

**Part (c)**

```
library(ivreg)
Fertility$samesex <- 1*(Fertility$gender1 == Fertility$gender2)
ivreg_c <- ivreg(work ~ morekids + age + I(age^2) + afam + hispanic | samesex + age + I(age^2) + afam + hispanic,
data=Fertility)
summary(ivreg_c)
```

```
##
## Call:
## ivreg(formula = work ~ morekids + age + I(age^2) + afam + hispanic |
## samesex + age + I(age^2) + afam + hispanic, data = Fertility)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.11 -17.74 -10.90 22.75 45.13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.3476082 2.9088030 -1.838 0.066 .
## morekidsyes -5.8386983 1.2478009 -4.679 2.88e-06 ***
## age 0.8754998 0.1985781 4.409 1.04e-05 ***
## I(age^2) -0.0007558 0.0034017 -0.222 0.824
## afamyes 11.5476740 0.2281698 50.610 < 2e-16 ***
## hispanicyes 1.1978427 0.2581922 4.639 3.50e-06 ***
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 1 254648 1277.241 <2e-16 ***
## Wu-Hausman 1 254647 0.093 0.76
## Sargan 0 NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.39 on 254648 degrees of freedom
## Multiple R-Squared: 0.04327, Adjusted R-squared: 0.04326
## Wald test: 1316 on 5 and 254648 DF, p-value: < 2.2e-16
```

In this case, it seems reasonable to interpret our results as causal effects. Interestingly, the estimate of the effect of having more than two kids turns out to be similar to the original one – we estimate that having more than two kids causes women’s hours worked to decrease by about 6 on average. Also, notice the diagnostics: in particular, the weak instrument diagnostics suggest that the `samesex`

variable satisfies the relevance condition.

Unconfoundedness is the condition that

\[ (Y(1),Y(0)) \perp D | X \]

This says that potential outcomes are independent of the treatment after conditioning on covariates. In practice, it means that, if we find individuals with the same \(X\) covariates and some of which participate in the treatment while others do not, we would be willing to interpret differences in their average outcomes as being causal effects of the treatment.

**Part (a)**

It is probably not reasonable to interpret \(\hat{\alpha}\) as an estimate of the causal effect of being in a union. There are probably a number of other things that we would need to be able to control for in order to interpret this as a causal effect; some examples are: age (I think union members tend to be older in the U.S. and that tends to be correlated with higher earnings), occupation/industry (union members tend to be concentrated in the manufacturing sector where there traditionally have been fairly large wage premiums), and perhaps other things like motivation and ability (it’s not totally clear if this is true, but it is at least worth entertaining the idea that either of these could be correlated with union membership and they are both very likely correlated with earnings).

**Part (b)**

The individual fixed effect, \(\eta_i\), in this model likely handles a number of issues that were concerning in part (a); for example, age (you can think of replacing age by year of birth to more clearly see that this is effectively time invariant), motivation, and ability can all reasonably be thought to be time invariant (or close to it). Occupation and industry probably also don’t vary too much over time. You could probably still come up with some limitations here (perhaps the return to motivation or ability tends to increase over time, for example), but at a minimum, it is probably more reasonable to think of this as a causal effect of unions on earnings.

One disadvantage of this sort of regression is that there is probably not much variation in union status over time (i.e., most people that are in a union stay in from year to year; and most people not in a union stay out from year to year). This may lead to imprecise estimates of \(\alpha\). If you were to take this application very seriously, you would probably also need to think carefully about *why* the people that switch are actually switching.

**Part (c)**

It is unlikely that this strategy would work. While it is probably reasonable to think that birthday is uncorrelated with the error term (so that the exogeneity condition holds), it seems very unlikely that birthday is correlated with union status (this would suggest that the relevance condition would not hold).