Chapter 4, Coding Question 1

# load data
library(Ecdat)
data("Airq", package="Ecdat")

# a) estimate mean rainfall
ybar <- mean(Airq$rain)
ybar
## [1] 36.078
# b) standard error
V <- var(Airq$rain)
n <- nrow(Airq)
se <- sqrt(V)/sqrt(n)
se
## [1] 2.462628
# c) t-statistic
h0 <- 25
t <- (ybar-h0)/se
t
## [1] 4.498446

Since \(|t| > 1.96\), we would reject \(H_0\) at the 5% significance level.

# d) p-value
pval <- 2*pnorm(-abs(t))
pval
## [1] 6.845183e-06

There is virtually a 0 percent chance of getting a t-statistic this large in absolute value if the null hypotheses were true.

# e) confidence interval
ciL <- ybar - 1.96*se
ciU <- ybar + 1.96*se
paste0("[",round(ciL,3),", ", round(ciU,3), "]")
## [1] "[31.251, 40.905]"
# f) summary statistics
library(modelsummary)
datasummary_balance(~coas, Airq)
no (N=9)
yes (N=21)
Mean Std. Dev. Mean Std. Dev. Diff. in Means Std. Error
airq 125.3 10.5 95.9 28.7 -29.5 7.2
vala 4118.2 5909.8 4218.6 4136.7 100.4 2166.9
rain 32.3 7.6 37.7 15.2 5.4 4.2
dens 1706.4 3014.6 1738.1 2821.2 31.7 1178.5
medi 6290.3 10065.4 10842.2 13396.8 4551.9 4450.1

Chapter 5, Coding Question 1

# a)
data(Caschool)
reg <- lm(testscr ~ str + avginc + elpct, data=Caschool)
summary(reg)
## 
## Call:
## lm(formula = testscr ~ str + avginc + elpct, data = Caschool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.800  -6.862   0.275   6.586  31.199 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 640.31550    5.77489 110.879   <2e-16 ***
## str          -0.06878    0.27691  -0.248    0.804    
## avginc        1.49452    0.07483  19.971   <2e-16 ***
## elpct        -0.48827    0.02928 -16.674   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.35 on 416 degrees of freedom
## Multiple R-squared:  0.7072, Adjusted R-squared:  0.7051 
## F-statistic: 334.9 on 3 and 416 DF,  p-value: < 2.2e-16

avginc and elpct are statistically different from 0 while str is not statistically different 0. We can tell by comparing (the absolute value of) the t-statistics in the column labeled “t value” to to 1.96. The ones that are larger in magnitude are statistically different from 0.

# b)
mean(Caschool$testscr)
## [1] 654.1565

The average test score in the data is a little over 654.

# c)
predict(reg, newdata=data.frame(str=20, avginc=30, elpct=10))
##        1 
## 678.8928

The predicted value here is somewhat higher than the overall sample average from part (b).

# d)
predict(reg, newdata=data.frame(str=15, avginc=30, elpct=10))
##        1 
## 679.2367

The predicted value here is almost the same (slightly bigger) than in part (c). The reason for this is that the estimated coefficient on str from the original regression is very small — this means that changing the student teacher ratio by 5 does not change the predicted value very much.

Question 3

Part (a)

\[\begin{align*} t &= \frac{\sqrt{n}(\bar{Y} - \mu_0)}{\sqrt{\widehat{\mathrm{var}}(Y)}} \\ &= \frac{\sqrt{100}(63 - 50)}{\sqrt{225}} \\ &= \frac{(10)(13)}{15} \\ &= 8.67 \end{align*}\]

We reject \(H_0\) here since \(|t| > 1.96\).

Part (b)

\[\begin{align*} \textrm{s.e.}(\bar{Y}) &= \frac{\sqrt{\widehat{\mathrm{var}}(Y)}}{\sqrt{n}} \\ &= \frac{\sqrt{225}}{\sqrt{100}} \\ &= \frac{15}{10} = 1.5 \end{align*}\]

Part (c)

\[\begin{align*} \textrm{p-value} &= 2 \Phi(-|t|) = 2\Phi(-8.67) \approx 0 \end{align*}\]

The p-value is essentially 0 here. It says that, if \(H_0\) were true, then the probability that we would have calculated a t-statistic as extreme as 8.67 is essentially 0. In other words, we have very strong evidence against our theory that \(\mathbb{E}[Y] = 50\).

Part (d)

\[\begin{align*} CI_{95\%} &= [\bar{Y} - 1.96 \textrm{s.e.}(\bar{Y})\, , \bar{Y} + 1.96 \textrm{s.e.}(\bar{Y})] \\ &= [63 - (1.96)(1.5)\, , 63 + (1.96)(1.5)] \\ &= [60.06\, , 65.95] \end{align*}\]

There is a 95% chance that the interval \([60.06\, 65.95]\) contains \(\mathbb{E}[Y]\).

Part (e)

For part (a), changing the significance level to 1% does not change the t-statistic, but it does change the critical value. Instead of using the critical value 1.96, we should use the critical value of 2.58 here. In either case, we would continue to reject \(H_0\).

For parts (b) and (c), neither the standard error nor the p-value changes when we change the significance level.

For part (d), we calculate a 99% confidence interval by using 2.58 as the critical value. Thus, \[\begin{align*} CI_{99\%} &= [\bar{Y} - 2.58 \textrm{s.e.}(\bar{Y})\, , \bar{Y} + 2.58 \textrm{s.e.}(\bar{Y})] \\ &= [63 - (2.58)(1.5)\, , 63 + (2.58)(1.5)] \\ &= [59.13\, , 66.87] \end{align*}\] There is a 99% chance that the interval \([59.13\, , 66.87]\) contains \(\mathbb{E}[Y]\). Notice that the 99% confidence interval is wider than the 95% confidence interval that we reported earlier.