7.2 Probit and Logit

SW 11.2, 11.3

Let’s start this section with probit. A probit model arises from setting

\[ \mathrm{P}(Y=1|X_1,X_2,X_3) = \Phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3) \] where \(\Phi\) is the cdf of a standard normal random variable. This is a nonlinear model due to \(\Phi\) making the model nonlinear in parameters.

Using \(\Phi\) (or any cdf) here has a useful property that no matter what value the “index” \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\) takes on, the cdf is always between 0 and 1. This implies that we cannot get predicted probabilities outside of 0 and 1.

Thus, this circumvents the problems with the linear probability model. That said, there are some things we have to be careful about. First, as usual, we are interested in partial effects rather than the parameters themselves. But partial effects are more complicated here. Notice that

\[ \begin{aligned} \frac{ \partial \, P(Y=1|X_1,X_2,X_3)}{\partial \, X_1} &= \frac{\partial \, \Phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3)}{\partial \, X_1} \\ &= \phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3) \beta_1 \end{aligned} \] where \(\phi\) is the pdf of a standard normal random variable. And the second equality requires using the chain rule — take the derivative of the “outside” (i.e., \(\Phi\)) and then the derivative of the “inside” with respect to \(X_1\). Notice that this partial effect is more complicated that in the case of the linear models that we have mainly considered — it involves \(\phi\), but more importantly it also depends on the values of all the covariates. In other words, the partial effect of \(X_1\) can vary across different values of \(X_1\), \(X_2\), and \(X_3\).

Logit is conceptually similar to probit, but instead of using \(\Phi\), Logit uses the logistic function \(\Lambda(z) = \frac{\exp(z)}{1+\exp(z)}\). The logistic function has the same important properties as \(\Phi\): (i) \(\Lambda(z)\) is increasing in \(z\), (ii) \(\Lambda(z) \rightarrow 1\) as \(z \rightarrow \infty\), and (iii) \(\Lambda(z) \rightarrow 0\) as \(z \rightarrow -\infty\). Thus, in a logit model,

\[ \begin{aligned} \mathrm{P}(Y=1 | X_1, X_2, X_3) &= \Lambda(\beta_0 + \beta_1 X_1 + \beta_2 + \beta_3 X_3) \\ &= \frac{\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3)}{1+\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3)} \end{aligned} \]


Because probit and logit models are nonlinear, estimation is more complicated than for the linear regression models that we were studying before. In particular, we cannot write down a formula like \(\hat{\beta}_1 = \textrm{something}\).

Instead, probit and logit models are typically esetimated through an approach called maximum likelihood estimation. Basically, the computer will solve an optimization problem trying to choose the “most likely” values of the parameters given the data that you have. It turns out that this particular optimization problem is actually quite easy for the computer to solve — even though estimating the parameters is more complicated than for linear regression, it will still feel like R can estimate a probit or logit model pretty much instantly.