Homework 5 Solutions

Ch. 18, Coding Question 1

Part (a)

load("../../Detailed Course Notes/data/rand_hie.RData")
rand_hie_subset <- subset(rand_hie, plan_type %in% c("Catastrophic", "Free"))

reg_a <- lm(total_med_expenditure ~ plan_type, data=rand_hie_subset)
summary(reg_a)

Call:
lm(formula = total_med_expenditure ~ plan_type, data = rand_hie_subset)

Residuals:
    Min      1Q  Median      3Q     Max 
 -532.9  -392.8  -299.4    38.4 17987.6 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     392.77      40.19   9.773   <2e-16 ***
plan_typeFree   140.12      49.74   2.817   0.0049 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 993.4 on 1758 degrees of freedom
Multiple R-squared:  0.004493,  Adjusted R-squared:  0.003927 
F-statistic: 7.935 on 1 and 1758 DF,  p-value: 0.004903

Relative to having only “catastrophic” insurance coverage, total medical expenditure (notice that total medical expenditure includes both how much the person paid themselves plus how much their insurance paid) is estimated to be substantially higher, on average, for individuals assigned to “free” insurance (i.e., that paid nothing for medical care); in particular, we estimate that “free” insurance results about $140 more, on average, than those with only catastrophic coverage. In my view, this difference is large in magnitude as the average expenditure is $393 for those with catastrophic coverage, which implies that those with free coverage have 36% higher total medical expenditures. Since individuals were randomly assigned to a type of plan, it seems reasonable to interpret these results as being a causal effect of plan type on total medical spending.

Part (b)

reg_b <- lm(face_to_face_visits ~ plan_type, data=rand_hie_subset)
summary(reg_b)

Call:
lm(formula = face_to_face_visits ~ plan_type, data = rand_hie_subset)

Residuals:
   Min     1Q Median     3Q    Max 
-4.928 -3.192 -1.792  0.808 91.672 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.1917     0.2562  12.457  < 2e-16 ***
plan_typeFree   1.7361     0.3171   5.475 5.01e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.333 on 1758 degrees of freedom
Multiple R-squared:  0.01676,   Adjusted R-squared:  0.01621 
F-statistic: 29.98 on 1 and 1758 DF,  p-value: 5.007e-08

These results are broadly similar to the ones before. Individuals assigned to the “free” insurance plan had, on average, 1.7 more face to face visits with doctors. This is 54% more than individuals randomly assigned to the “catastrophic” insurance plan. As in part (a), it seems reasonable to interpret these as causal effects due to the random assignment.

Part (c)

reg_c <- lm(health_index ~ plan_type, data=rand_hie_subset)
summary(reg_c)

Call:
lm(formula = health_index ~ plan_type, data = rand_hie_subset)

Residuals:
    Min      1Q  Median      3Q     Max 
-60.525  -9.784   1.516  10.616  32.216 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    68.5247     0.6190 110.698   <2e-16 ***
plan_typeFree  -0.7407     0.7661  -0.967    0.334    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.3 on 1758 degrees of freedom
Multiple R-squared:  0.0005315, Adjusted R-squared:  -3.708e-05 
F-statistic: 0.9348 on 1 and 1758 DF,  p-value: 0.3338

These results are different from the previous ones. Although individuals assigned to the “free” insurance plan appear to be utilizing more medical care, it does not appear to be improving their health (at least according to this measure of an individual’s health). The results here are not statistically significant and quantitatively small; for example, here we estimate that individuals in the “free” insurance plan have about 1% lower health index, on average, than those in the “catastrophic” plan.

Part (d)

Parts (a)-(c) seem to suggest that “free” insurance increased medical care usage without much of an effect on health (at least in the way that we were able to measure health).

Ch. 18, Extra Question 1

Treatment effect heterogeneity means that the effect of the treatment can be different across different units. Treatment effect homogeneity means that the effect of the treatment is the same across all units. Most applications (at least the ones we have considered) likely exhibit treatment effect heterogeneity.

Ch. 18, Extra Question 3

Unconfoundedness is the condition that

\[ \Big(Y(1),Y(0)\Big) \perp \!\!\! \perp D \, \Big| \, X \]

This says that potential outcomes are independent of the treatment after conditioning on covariates. In practice, it means that, if we find individuals with the same \(X\) covariates and some of which participate in the treatment while others do not, we would be willing to interpret differences in their average outcomes as being causal effects of the treatment.

Ch. 18, Extra Question 5

The first key condition is the exclusion restriction. It says that the instrument, \(Z\), is uncorrelated with the error term, \(V\); i.e., that \(\mathrm{cov}(Z,V) = 0\). More specifically, this means that \(Z\) is uncorrelated with other variables that are (i) not observed (or otherwise not included in the model) and (ii) that are correlated with the outcome.

The second key condition is instrument relevance. It says that the instrument, \(Z\), is correlated with the treatment variable, \(D\); i.e., that \(\mathrm{cov}(Z,D) \neq 0\).

Ch. 18, Extra Question 8

Part (a)

No, if you just ignore \(W_i\), you will get omitted variable bias. The coefficient on \(D\) in this regression will not (generally) be equal to the causal effect \(\alpha\).

Part (b)

First, from the equation \(Y_i = \delta_0 + \delta_1 D_i + \delta_2 X_i + \epsilon_i\) (and because \(\mathbb{E}[\epsilon|D,X]=0\)) we know that

\[ \mathbb{E}[Y|D,X] = \delta_0 + \delta_1 D + \delta_2 X \] so that \(\delta_1\) is just the coefficient on \(D\) from a regression of \(Y\) on \(D\) and \(X\) (and ignoring \(W\)).

Now, let’s derive an alternative expression for \(\mathbb{E}[Y|D,X]\) using the equation \(Y_i = \beta_0 + \alpha D_i + \beta_1 X_i + \beta_2 W_i + U_i\) as the starting point. In particular, by just plugging in \(Y\) from this expression into \(\mathbb{E}[Y|D,X]\), notice that:

\[ \begin{aligned} \mathbb{E}[Y|D, X] &= \mathbb{E}[\beta_0 + \alpha D + \beta_1 X + \beta_2 W + U | D, X] \\ &= \beta_0 + \alpha D + \beta_1 X + \beta_2 \mathbb{E}[W|D,X] + \mathbb{E}[U|D,X] \\ &= \beta_0 + \alpha D + \beta_1 X + \beta_2 (\gamma_0 + \gamma_1 D + \gamma_2 X) \\ &= \underbrace{(\beta_0 + \beta_2 \gamma_0)}_{\delta_0} + \underbrace{(\alpha + \beta_2 \gamma_1)}_{\delta_1} D + \underbrace{(\beta_1 + \beta_2 \gamma_2)}_{\delta_2} X \end{aligned} \]

where the first equality holds by plugging in \(Y\) from Equation (1), the second equality holds by properties of expectations (and since we are conditioning on \(D\) and \(X\)), the third equality holds from Equation (3) and because \(\mathbb{E}[U|D,X]=0\), the fourth equality just rearranges terms. This is an alternative expression for a regression of \(Y\) on \(D\) and \(X\) in terms of the \(\beta\)’s and \(\gamma\)’s. And, most importantly, it implies that

\[ \delta_1 = \alpha + \beta_2 \gamma_1 \]

In words, this means that the coefficient on \(D\) in a regression of \(Y\) on \(D\) and \(X\) (that ignores \(W\)) will not be equal to \(\alpha\) unless \(\beta_2=0\) (this would occur if the partial effect of \(W\) on \(Y\) is equal to 0) or \(\gamma_1=0\) (this would occur if \(D\) and \(W\) are uncorrelated after controlling for \(X\)).