**Due: ** Monday, Nov. 21 at the beginning of class.
Note: we will not meet in
person this day, so please submit the project through ELC by the start
of our normal class time

For this project, we are going to try to predict house prices. The data that we are going to use for the project comes from a fairly well known database of house selling prices from Ames, Iowa.

A description of the data is available in the file
`data_description.txt`

which is posted on ELC.

There are two data files:

`house_price_train.csv`

— use this data to train / pick the model that you want to estimate. The outcome of interest is`SalePrice`

. Otherwise, you are free to use whatever available variables you would like or think are important.`house_price_test.csv`

— this data has the same columns as in`house_price_train.csv`

, but is out-of-sample (i.e., it is new data). Use this*only*in the second step below.

Step 1: Using the `house_price_train.csv`

data, I want you
to try at least 5 different models. Using only this data, I want you to
rank these models from 1-5 in terms of which ones you think will predict
house prices the best along with some explanation of why. **Do not
use house_price_test.csv at all for this step.**

Step 2: Estimate a model (you can choose which regressors to include, but I recommend either the most complicated model that you estimated in Step 1 or a more complicated model) using Lasso and Ridge regression.

Step 3: Then, using the `house_price_test.csv`

data, I
would like for you to come up with a prediction for the selling price of
each house in that data. For each model from Steps 1 and 2, I’d like for
you to compute \[\begin{align*}
\frac{1}{n} \sum_{i=1}^n (Y_i - \tilde{Y}_i)^2
\end{align*}\] where \(Y_i\) is
the actual sale price and \(\tilde{Y}_i\) is the predicted sale price
coming from each model. Then, rank each model according to how well it
predicts house selling prices in the test data according to the above
criteria.

Step 4: Discuss your results. In particular, discuss how well your rankings from the first step compared to rankings for out of sample predictions.

4-6 pages document that should include:

a description of the models that you estimate

a ranking of the models based on step 1

an explanation for this ranking

a ranking based on the out-of-sample predictions from step 2

a discussion of how well your model rankings from step 1 match up to your model rankings from step 2

at least one relevant plot

- The data is somewhat “messier” than we have typically worked with. Dealing with this sort of data is part of the challenge of the project.

Estimate 5 models and rank them according to model selection criteria | 5pts |

Estimate (reasonably “complicated”) model using Lasso and Ridge regression | 5pts |

Rank all 7 models by out sample prediction quality | 5pts |

Discussion, clarity of arguments, etc. | 5pts |