Recall that \[ \begin{aligned} AIC &= 2k + n\log(SSR) \\ BIC &= k\log(n) + n\log(SSR) \end{aligned} \] so that the only difference between \(AIC\) and \(BIC\) is in their penalty terms: \(2k\) and \(k\log(n)\), respectively. Then, as long as \(n\geq 8\) (which would presumably always be the case when you are doing model selection), the penalty for adding another regressor is larger for \(BIC\) than \(AIC\). This means that \(BIC\) will tend to choose “less complicated” models and \(AIC\) will tend to choose “more complicated” models.
In terms of mean squared prediction error, the accuracy of our predictions depends on both the bias and the variance of the predictions. If we can substantially reduce variance by introducing a small amount of bias, this can result in better predictions.
This argument doesn’t always apply. For example, if you make “bad choices” of the penalty term using Lasso or Ridge regressions, you could introduce lots of bias that might not be offset by the smaller variance.
Part (a)
load("../../Detailed Course Notes/data/rand_hie.RData")
rand_hie_subset <- subset(rand_hie, plan_type %in% c("Catastrophic", "Free"))
reg_a <- lm(total_med_expenditure ~ plan_type, data=rand_hie_subset)
summary(reg_a)
##
## Call:
## lm(formula = total_med_expenditure ~ plan_type, data = rand_hie_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -532.9 -392.8 -299.4 38.4 17987.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 392.77 40.19 9.773 <2e-16 ***
## plan_typeFree 140.12 49.74 2.817 0.0049 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 993.4 on 1758 degrees of freedom
## Multiple R-squared: 0.004493, Adjusted R-squared: 0.003927
## F-statistic: 7.935 on 1 and 1758 DF, p-value: 0.004903
Relative to having only “catastrophic” insurance coverage, total medical expenditure (notice that total medical expenditure includes both how much the person paid themselves plus how much their insurance paid) is estimated to be substantially higher, on average, for individuals assigned to “free” insurance (i.e., that paid nothing for medical care); in particular, we estimate that “free” insurance results about $140 more, on average, than those with only catastrophic coverage. In my view, this difference is large in magnitude as the average expenditure is $393 for those with catastrophic coverage, which implies that those with free coverage have 36% higher total medical expenditures. Since individuals were randomly assigned to a type of plan, it seems reasonable to interpret these results as being a causal effect of plan type on total medical spending.
Part (b)
reg_b <- lm(face_to_face_visits ~ plan_type, data=rand_hie_subset)
summary(reg_b)
##
## Call:
## lm(formula = face_to_face_visits ~ plan_type, data = rand_hie_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.928 -3.192 -1.792 0.808 91.672
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1917 0.2562 12.457 < 2e-16 ***
## plan_typeFree 1.7361 0.3171 5.475 5.01e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.333 on 1758 degrees of freedom
## Multiple R-squared: 0.01676, Adjusted R-squared: 0.01621
## F-statistic: 29.98 on 1 and 1758 DF, p-value: 5.007e-08
These results are broadly similar to the ones before. Individuals assigned to the “free” insurance plan had, on average, 1.7 more face to face visits with doctors. This is 54% more than individuals randomly assigned to the “catastrophic” insurance plan. As in part (a), it seems reasonable to interpret these as causal effects due to the random assignment.
Part (c)
reg_c <- lm(health_index ~ plan_type, data=rand_hie_subset)
summary(reg_c)
##
## Call:
## lm(formula = health_index ~ plan_type, data = rand_hie_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.525 -9.784 1.516 10.616 32.216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.5247 0.6190 110.698 <2e-16 ***
## plan_typeFree -0.7407 0.7661 -0.967 0.334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.3 on 1758 degrees of freedom
## Multiple R-squared: 0.0005315, Adjusted R-squared: -3.708e-05
## F-statistic: 0.9348 on 1 and 1758 DF, p-value: 0.3338
These results are different from the previous ones. Although individuals assigned to the “free” insurance plan appear to be utilizing more medical care, it does not appear to be improving their health (at least according to this measure of an individual’s health). The results here are not statistically significant and quantitatively small; for example, here we estimate that individuals in the “free” insurance plan have about 1% lower health index, on average, than those in the “catastrophic” plan.
Part (d)
Parts (a)-(c) seem to suggest that “free” insurance increased medical care usage without much of an effect on health (at least in the way that we were able to measure health).
Unconfoundedness is the condition that
\[ (Y(1),Y(0)) \perp D | X \]
This says that potential outcomes are independent of the treatment after conditioning on covariates. In practice, it means that, if we find individuals with the same \(X\) covariates and some of which participate in the treatment while others do not, we would be willing to interpret differences in their average outcomes as being causal effects of the treatment.
Part (a)
It is probably not reasonable to interpret \(\hat{\alpha}\) as an estimate of the causal effect of being in a union. There are probably a number of other things that we would need to be able to control for in order to interpret this as a causal effect; some examples are: age (I think union members tend to be older in the U.S. and that tends to be correlated with higher earnings), occupation/industry (union members tend to be concentrated in the manufacturing sector where there traditionally have been fairly large wage premiums), and perhaps other things like motivation and ability (it’s not totally clear if this is true, but it is at least worth entertaining the idea that either of these could be correlated with union membership and they are both very likely correlated with earnings).