Due: Tuesday, Nov. 22 at the beginning of class.
For this project, we are going to try to predict house prices. The data that we are going to use for the project comes from a fairly well known database of house selling prices from Ames, Iowa.
A description of the data is available in the file
data_description.txt
which is posted on ELC.
There are two data files:
house_price_train.csv
— use this data to train /
pick the model that you want to estimate. The outcome of interest is
SalePrice
. Otherwise, you are free to use whatever
available variables you would like or think are important.
house_price_test.csv
— this data has the same
columns as in house_price_train.csv
, but is out-of-sample
(i.e., it is new data). Use this only in the second step
below.
Step 1: Using the house_price_train.csv
data, I want you
to try at least 5 different models. Using only this data, I want you to
rank these models from 1-5 in terms of which ones you think will predict
house prices the best along with some explanation of why. Do not
use house_price_test.csv
at all for this step.
Step 2: Estimate a model (you can choose which covariates to include, but I recommend either the most complicated model that you estimated in Step 1 or a more complicated model) using Lasso and Ridge regression.
Step 3: Then, using the house_price_test.csv
data, I
would like for you to come up with a prediction for the selling price of
each house in that data. For each model from Steps 1 and 2, I’d like for
you to compute \[\begin{align*}
\frac{1}{n} \sum_{i=1}^n (Y_i - \tilde{Y}_i)^2
\end{align*}\] where \(Y_i\) is
the actual sale price and \(\tilde{Y}_i\) is the predicted sale price
coming from each model. Then, rank each model according to how well it
predicts house selling prices in the test data according to the above
criteria.
Step 4: Discuss your results. In particular, discuss how well your rankings from the first step compared to rankings for out of sample predictions.
4-6 pages document that should include:
a description of the models that you estimate
a ranking of the models based on step 1
an explanation for this ranking
a ranking based on the out-of-sample predictions from step 2
a discussion of how well your model rankings from step 1 match up to your model rankings from step 2
at least one relevant plot
Estimate 5 models and rank them according to model selection criteria | 5pts |
Estimate (reasonably “complicated”) model using Lasso and Ridge regression | 5pts |
Rank all 7 models by out sample prediction quality | 5pts |
Discussion, clarity of arguments, etc. | 5pts |