Project 1

Due: Monday, Nov. 18 at the beginning of class.

For this project, we are going to try to predict house prices. The data that we are going to use for the project comes from a fairly well known database of house selling prices from Ames, Iowa.

Data

A description of the data is available in the file data_description.txt which is posted on the course website.

There are two data files:

house_price_train.csv — use this data to train / pick the model that you want to estimate. The outcome of interest is SalePrice. Otherwise, you are free to use whatever available variables you would like or think are important.
house_price_test.csv — this data has the same columns as in house_price_train.csv, but is out-of-sample (i.e., it is new data). Use this only in the second step below.

What to do:

Step 1: Using the house_price_train.csv data, I want you to try at least 5 different models. Using only this data, I want you to rank these models from 1-5 in terms of which ones you think will predict house prices the best along with some explanation of why. Do not use house_price_test.csv at all for this step.

Step 2: Estimate a model (you can choose which regressors to include, but I recommend either the most complicated model that you estimated in Step 1 or a more complicated model) using Lasso and Ridge regression.

Step 3: Then, using the house_price_test.csv data, I would like for you to come up with a prediction for the selling price of each house in that data. For each model from Steps 1 and 2, I’d like for you to compute \[\begin{align*} \frac{1}{n} \sum_{i=1}^n (Y_i - \tilde{Y}_i)^2 \end{align*}\] where \(Y_i\) is the actual sale price and \(\tilde{Y}_i\) is the predicted sale price coming from each model. Then, rank each model according to how well it predicts house selling prices in the test data according to the above criteria.

Step 4: Discuss your results. In particular, discuss how well your rankings from the first step compared to rankings for out of sample predictions.

What to turn in

4-6 pages document that should include:

a description of the models that you estimate
a ranking of the models based on step 1
an explanation for this ranking
a ranking based on the out-of-sample predictions from step 2
a discussion of how well your model rankings from step 1 match up to your model rankings from step 2
at least one relevant plot

Some things to note

The data is somewhat “messier” than we have typically worked with. Dealing with this sort of data is part of the challenge of the project.

Grading Criteria

Estimate 5 models and rank them according to model selection criteria	5pts
Estimate (reasonably “complicated”) model using Lasso and Ridge regression	5pts
Rank all 7 models by out sample prediction quality	5pts
Discussion, clarity of arguments, etc.	5pts