Due: Friday, Nov. 19 at the beginning of class.

For this project, we are going to try to predict house prices. The data that we are going to use for the project comes from a fairly well known database of house selling prices from Ames, Iowa.


A description of the data is available in the file data_description.txt which is posted on ELC.

There are two data files:

What to do:

Step 1: Using the house_price_train.csv data, I want you to try at least 5 different models. Using only this data, I want you to rank these models from 1-5 in terms of which ones you think will predict house prices the best along with some explanation of why. Do not use house_price_test.csv at all for this step.

Step 2: Estimate a model (you can choose which covariates to include, but I recommend either the most complicated model that you estimated in Step 1 or a more complicated model) using Lasso and Ridge regression.

Step 3: Then, using the house_price_test.csv data, I would like for you to come up with a prediction for the selling price of each house in that data. For each model from Steps 1 and 2, I’d like for you to compute \[\begin{align*} \frac{1}{n} \sum_{i=1}^n (Y_i - \tilde{Y}_i)^2 \end{align*}\] where \(Y_i\) is the actual sale price and \(\tilde{Y}_i\) is the predicted sale price coming from each model. Then, rank each model according to how well it predicts house selling prices in the test data according to the above criteria.

Step 4: Discuss your results. In particular, discuss how well your rankings from the first step compared to rankings for out of sample predictions.

What to turn in

4-6 pages document that should include:

Some things to note

Grading Criteria

Estimate 5 models and rank them according to model selection criteria 5pts
Estimate (reasonably “complicated”) model using Lasso and Ridge regression 5pts
Rank all 7 models by out sample prediction quality 5pts
Discussion, clarity of arguments, etc. 5pts