5.14 Coding Questions

  1. For this problem, we will use the data Caschool. This data contains information about test scores for schools from California from the 1998-1999 academic year. For this problem, we will use the variables testscr (average test score in the school), str (student teacher ratio in the school), avginc (the average income in the school district), and elpct (the percent of English learners in the school).

    1. Run a regression of test scores on student teacher ratio, average income, and English learners percentage. Report your results. Which regressors are statistically significant? How do you know?

    2. What is the average test score across all schools in the data?

    3. What is the predicted average test score for a school with a student teacher ratio of 20, average income of $30,000, and 10% English learners? How does this compare to the overall average test score from part (b)?

    4. What is the predicted average test score for a school with a student teacher ratio of 15, average income of $30,000, and 10% English learners? How does this compare to your answer from part (c)?

  2. For this problem, we will use the data intergenerational_mobility.

    1. Run a regression of child family income (\(child\_fincome\)) on parents’ family income (\(parent\_fincome\)). How should you interpret the estimated coefficient on parents’ family income? What is the p-value for the coefficient on parents’ family income?

    2. Run a regression of \(\log(child\_fincome)\) on \(parent\_fincome\). How should you interpret the estimated cofficient on \(parent\_fincome\)?

    3. Run a regression of \(child\_fincome\) on \(\log(parent\_fincome)\). How should you interpret the estimated coefficient on \(\log(parent\_fincome)\)?

    4. Run a regression of \(\log(child\_fincome)\) on \(\log(parent\_fincome)\). How should you interpret the estimated coefficient on \(\log(parent\_fincome)\)?

  3. For this question, we’ll use the fertilizer_2000 data.

    1. Run a regression of \(\log(avyield)\) on \(\log(avfert)\). How do you interpret the estimated coefficient on \(\log(avfert)\)?

    2. Now suppose that you additionally want to control for precipitation and the region that a country is located in. How would you do this? Estimate the model that you propose here, report the results, and interpret the coefficient on \(\log(avfert)\).

    3. Now suppose that you are interested in whether the effect of fertilizer varies by region that a country is located in (while still controlling for the same covariates as in part (b)). Propose a model that can be used for this purpose. Estimate the model that you proposed, report the results, and discuss whether the effect of fertilizer appears to vary by region or not.

  4. For this question, we will use the data mutual_funds. We’ll be interested in whether mutual funds that have higher expense ratios (these are typically actively managed funds) have higher returns relative to mutual funds that have lower expense ratios (e.g., index funds). For this problem, we will use the variables fund_return_3years, investment_type, risk_rating, size_type, fund_net_annual_expense_ratio, asset_cash, asset_stocks, asset_bonds.

    1. Calculate the median fund_net_annual_expense_ratio.

    2. Use the datasummary_balance function from the modelsummary package to report summary statistics for fund_return_3year, fund_net_annual_expense_ratio, risk_rating, asset_cash, asset_stocks, asset_bonds based on whether their expense ratio is above or below the median. Do you notice any interesting patterns?

    3. Run a regression of fund_return_3years on fund_net_annual_expense_ratio. How do you interpret the results?

    4. Now, additionally control for investment_type, risk_rating, and size_type Hint: think carefully about what type of variables each of these are and how they should enter the model. How do these results compare to the ones from part c?

    5. Now, add the variables assets_cash, assets_stocks, and assets_bonds to the model from part d. How do you interpret these results? Compare and interpret the differences between parts c, d, and e.

  5. For this question, we’ll use the data Lead_Mortality to study the effect of lead pipes on infant mortality in 1900.

    1. Run a regression of infant mortality (infrate) on whether or not a city had lead pipes (lead) and interpret/discuss the results.

    2. It turns out that the amount of lead in drinking water depends on how acidic the water is, with more acidic water leaching more of the lead (so that there is more exposure to lead with more acidic water). To measure acidity, we’ll use the pH of the water in a particular city (ph); recall that, a lower value of pH indicates higher acidity. Run a regression of infant mortality on whether or not a city has lead pipes, the pH of its water, and the interaction between having lead pipes and pH. Report your results. What is the estimated partial effect of having lead pipes from this model?

    3. Given the results in part b, calculate an estimate of the average partial effect of having lead pipes on infant mortality.

    4. Given the results in part b, how much does the partial effect of having lead pipes differ for cities that have a pH of 6.5 relative to a pH of 7.5?