7.1 Linear Probability Model

SW 11.1

Let’s continue to consider

\[ \mathbb{E}[Y|X_1,X_2,X_3] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 \]

when \(Y\) is binary. Of course, you can still run this regression.

One thing that is helpful to notice before we really get started here is that when \(Y\) is binary (so that either \(Y=0\) or \(Y=1\))

\[ \begin{aligned} \mathbb{E}[Y] &= \sum_{y \in \mathcal{Y}} y \mathrm{P}(Y=y) \\ &= 0 \mathrm{P}(Y=0) + 1 \mathrm{P}(Y=1) \\ &= \mathrm{P}(Y=1) \end{aligned} \] And exactly the same sort of argument implies that, when \(Y\) is binary, \(\mathbb{E}[Y|X] = \mathrm{P}(Y=1|X)\). Thus, if we believe the model in the first part of this section, this result implies that

\[ \mathrm{P}(Y=1|X_1,X_2,X_3) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 \] For this reason, the model in this section is called the linear probabilty model. Moreover, this further implies that we should interpret

\[ \beta_1 = \frac{\partial \, \mathrm{P}(Y=1|X_1,X_2,X_3)}{\partial \, X_1} \] as a partial effect. That is, \(\beta_1\) is how much the probability that \(Y=1\) changes when \(X_1\) increases by one unit, holding \(X_2\) and \(X_3\) constant. This is good (and simple), but there are some drawbacks:

  1. It’s possible to get non-sensical predictions (predicted probabilities that are less than 0 or greater than 1) with a linear probability model.

  2. A related problem is that the linear probability model implies constant partial effects. That is, the effect of a change in one regressor always changes the probability of \(Y=1\) (holding other regressors constant) by the same amount. It may not be obvious that this is a disadvantage, but it is.

Example 7.1 Let \(Y=1\) if an individual participates in the labor force. Further let \(X_1=1\) if an individual is male and 0 otherwise, \(X_2\) denote an individual’s age, and \(X_3=1\) for college graduates and 0 otherwise.

Additionally, suppose that \(\beta_0=0.4, \beta_1=0.2, \beta_2=0.01, \beta_3=0.1\).

Let’s calculate the probability of being in the labor force for a 40 year old woman who is not a college graduate. This is given by

\[ \mathrm{P}(Y=1 | X_1=0, X_2=40, X_3=0) = 0.4 + (0.01)(40) = 0.8 \] In other words, we’d predict that, given these characteristics, the probability of being in the labor force is 0.8.

Now, let’s calculate the probability of being in the labor force for a 40 year old man who is a college graduate. This is given by

\[ \mathrm{P}(Y=1|X_1=1, X_2=40, X_3=1) = 0.4 + 0.2 + (0.01)(40) + 0.1 = 1.1 \] We have calculated that the predicted probability of being in the labor force, given these characteristics, is 1.1 — this makes no sense! Our maximum predicted probabilty should be 1.

The problem of constant partial effect is closely related. Here, labor force participation is increasing in age, but with a binary outcome (by construction) the effect has to die off — for those who are already very likely to participate in the labor force (in this example, older men with a college education, the partial effect of age has to be low because they are already very likely to participate in the labor force).

We can circumvent both of the main problems with the linear probability model by consider nonlinear models for binary outcomes. By far the most common are probit and logit. We will discuss these next.