5.1 Nonparametric Regression / Curse of Dimensionality

If you knew nothing about regressions, it would seem natural to try to estimate \(\mathbb{E}[Y|X_1=x_1,X_2=x_2,X_3=x_3]\) by just calculating the average of \(Y\) among observations that have values of the regressors equal to \(x_1\), \(x_2\), and \(x_3\) (if these are discrete) or that are, in some sense, close to \(x_1\), \(x_2\), and \(x_3\) (if these are continuous).

This is actually a pretty attractive idea.

However, you run into the issue that it is practically challenging to do this when the number of regressors starts to get large (i.e., if you have 10 regressors, generally, you would need tons of data to be able to find a suitable number of observations that are “close” to any particular value of the regressors).

Let me give a more concrete example. Suppose that you were trying to estimate mean house price as a function of a house’s characteristics. If the only characteristic of the house that you knew was the number of bedrooms, then it would be pretty easy to just calculate the average house price among houses with 2, 3, 4, etc. bedrooms. Now suppose that you knew both the number of bedrooms and the number of square feet. In this case, if we wanted to estimate mean house prices as a function of these characteristics, we would need to find houses that have the same number of bedrooms and (at least) a similar number of square feet. This starts to “slice” the data that you have more thinly. If you continue with this idea (suppose that you want to estimate mean house price as a function of number of bedrooms, number of bathrooms, number of square feet, what year the house was built in, whether or not it has a basement, what zip code it is located in, etc.) then you will start to stretch your data extremely thin to the point that you may have very few relevant observations (or perhaps no relevant observations) for particular values of the characteristics.

This issue is called the “curse of dimensionality”.

We will focus on linear models for \(\mathbb{E}[Y|X_1,X_2,X_3]\) largely to get around the curse of dimensionality.

This idea of using observations that are very close in terms of characteristics in order to estimate a conditional expectation is called nonparametric econometrics/statistics. You can take entire courses (typically graduate-level) on this topic if you were interested. The reason that it is is called nonparametric is that it doesn’t involve making any functional form assumptions (like linearity) but the cost is that it would typically require many more observations (due to the curse of dimensionality).