## 5.4 Binary Outcome Models

In addition to referenced material below, please read all of SW Ch. 11

You may or may not have noticed this, but all the outcomes that we have considered so far have involved a continuous outcome. But lots of economic variables are discrete (we’ll mainly focus on binary outcomes). Examples:

Labor force participation

Graduating from college

The question is: Do our linear regression tools still apply to this case? In other words, does

\[ \mathbb{E}[Y | X_1, X_2, X_3] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 \] still make sense?

- Note: we have already included binary regressors and know how to interpret these, so this section is about binary outcomes rather than binary regressors

### 5.4.1 Linear Probability Model

SW 11.1

Let’s continue to consider

\[ \mathbb{E}[Y|X_1,X_2,X_3] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 \]

when \(Y\) is binary. Of course, you can still run this regression.

One thing that is helpful to notice before we really get started here is that when \(Y\) is binary (so that either \(Y=0\) or \(Y=1\))

\[ \begin{aligned} \mathbb{E}[Y] &= \sum_{y \in \mathcal{Y}} y \mathrm{P}(Y=y) \\ &= 0 \mathrm{P}(Y=0) + 1 \mathrm{P}(Y=1) \\ &= \mathrm{P}(Y=1) \end{aligned} \] And exactly the same sort of argument implies that, when \(Y\) is binary, \(\mathbb{E}[Y|X] = \mathrm{P}(Y=1|X)\). Thus, if we believe the model in the first part of this section, this result implies that

\[
\mathrm{P}(Y=1|X_1,X_2,X_3) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3
\]
For this reason, the model in this section is called the **linear probabilty model**. Moreover, this further implies that we should interpret

\[ \beta_1 = \frac{\partial \, \mathrm{P}(Y=1|X_1,X_2,X_3)}{\partial \, X_1} \] as a partial effect. That is, \(\beta_1\) is how much the probability that \(Y=1\) changes when \(X_1\) increases by one unit, holding \(X_2\) and \(X_3\) constant. This is good (and simple), but there are some drawbacks:

It’s possible to get non-sensical predictions (predicted probabilities that are less than 0 or greater than 1) with a linear probability model.

A related problem is that the linear probability model implies constant partial effects. That is, the effect of a change in one regressor always changes the probability of \(Y=1\) (holding other regressors constant) by the same amount. It may not be obvious that this is a disadvantage, but it is.

**Example 5.2 **Let \(Y=1\) if an individual participates in the labor force. Further let \(X_1=1\) if an individual is male and 0 otherwise, \(X_2\) denote an individual’s age, and \(X_3=1\) for college graduates and 0 otherwise.

Additionally, suppose that \(\beta_0=0.4, \beta_1=0.2, \beta_2=0.01, \beta_3=0.1\).

Let’s calculate the probability of being in the labor force for a 40 year old woman who is not a college graduate. This is given by

\[ \mathrm{P}(Y=1 | X_1=0, X_2=40, X_3=0) = 0.4 + (0.01)(40) = 0.8 \] In other words, we’d predict that, given these characteristics, the probability of being in the labor force is 0.8.

Now, let’s calculate the probability of being in the labor force for a 40 year old man who is a college graduate. This is given by

\[ \mathrm{P}(Y=1|X_1=1, X_2=40, X_3=1) = 0.4 + 0.2 + (0.01)(40) + 0.1 = 1.1 \] We have calculated that the predicted probability of being in the labor force, given these characteristics, is 1.1 — this makes no sense! Our maximum predicted probabilty should be 1.

The problem of constant partial effect is closely related. Here, labor force participation is increasing in age, but with a binary outcome (by construction) the effect has to die off — for those who are already very likely to participate in the labor force (in this example, older men with a college education, the partial effect of age has to be low because they are already very likely to participate in the labor force).

We can circumvent both of the main problems with the linear probability model by consider **nonlinear models** for binary outcomes. By far the most common are **probit** and **logit**. We will discuss these next.

### 5.4.2 Probit and Logit

SW 11.2, 11.3

Let’s start this section with probit. A probit model arises from setting

\[ \mathrm{P}(Y=1|X_1,X_2,X_3) = \Phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3) \] where \(\Phi\) is the cdf of a standard normal random variable. This is a nonlinear model due to \(\Phi\) making the model nonlinear in parameters.

Using \(\Phi\) (or any cdf) here has a useful property that no matter what value the “index” \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3\) takes on, the cdf is always between 0 and 1. This implies that we cannot get predicted probabilities outside of 0 and 1.

Thus, this circumvents the problems with the linear probability model. That said, there are some things we have to be careful about. First, as usual, we are interested in partial effects rather than the parameters themselves. But partial effects are more complicated here. Notice that

\[ \begin{aligned} \frac{ \partial \, P(Y=1|X_1,X_2,X_3)}{\partial \, X_1} &= \frac{\partial \, \Phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3)}{\partial \, X_1} \\ &= \phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3) \beta_1 \end{aligned} \] where \(\phi\) is the pdf of a standard normal random variable. And the second equality requires using the chain rule — take the derivative of the “outside” (i.e., \(\Phi\)) and then the derivative of the “inside” with respect to \(X_1\). Notice that this partial effect is more complicated that in the case of the linear models that we have mainly considered — it involves \(\phi\), but more importantly it also depends on the values of all the covariates. In other words, the partial effect of \(X_1\) can vary across different values of \(X_1\), \(X_2\), and \(X_3\).

**Logit** is conceptually similar to probit, but instead of using \(\Phi\), Logit uses the logistic function \(\Lambda(z) = \frac{\exp(z)}{1+\exp(z)}\). The logistic function has the same important properties as \(\Phi\): (i) \(\Lambda(z)\) is increasing in \(z\), (ii) \(\Lambda(z) \rightarrow 1\) as \(z \rightarrow \infty\), and (iii) \(\Lambda(z) \rightarrow 0\) as \(z \rightarrow -\infty\). Thus, in a logit model,

\[ \begin{aligned} \mathrm{P}(Y=1 | X_1, X_2, X_3) &= \Lambda(\beta_0 + \beta_1 X_1 + \beta_2 + \beta_3 X_3) \\ &= \frac{\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3)}{1+\exp(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3)} \end{aligned} \]

**Estimation**

Because probit and logit models are nonlinear, estimation is more complicated than for the linear regression models that we were studying before. In particular, we cannot write down a formula like \(\hat{\beta}_1 = \textrm{something}\).

Instead, probit and logit models are typically esetimated through an approach called **maximum likelihood estimation**. Basically, the computer will solve an optimization problem trying to choose the “most likely” values of the parameters given the data that you have. It turns out that this particular optimization problem is actually quite easy for the computer to solve — even though estimating the parameters is more complicated than for linear regression, it will still feel like `R`

can estimate a probit or logit model pretty much instantly.

### 5.4.3 Average Partial Effects

One of the complications with Probit and Logit is that it is not so simple to interpret the estimated parameters.

Remember we are generally interested in partial effects, not the parameters themselves. It just so happens that in many of the linear models that we have considered so far the \(\beta\)’s correspond to the partial effect — this means that it is sometimes easy to forget that they are not what we are typically most interested in.

This is helpful framing for thinking about how to interpret the results from a Probit or Logit model.

Let’s focus on the Probit model. In that case, \[\begin{align*} \mathrm{P}(Y=1|X_1,X_2,X_3) = \Phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3) \end{align*}\] where \(\Phi\) is the cdf of standard normal random variable.

*Continuous Case*: When \(X_1\) is continuous, the partial effect of \(X_1\) is given by
\[\begin{align*}
\frac{\partial \mathrm{P}(Y=1|X_1,X_2,X_3)}{\partial X_1} = \phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3) \beta_1
\end{align*}\]
where \(\phi\) is the pdf of a standard normal random variable. This is more complicated than the partial effect in the context of a linear model. It depends on \(\phi\) (which looks complicated, but you can just use `R`

’s `dnorm`

command to handle that part). More importantly, the partial effect depends on the values of \(X_1,X_2,\) and \(X_3\). [As discussed above, this is likely a good thing in the context of a binary outcome model]. Thus, in order to get a partial effect, we need to put in some values for these. If you have particular values of the covariates that you are interested in, you can definitely do that, but my general suggestion is to report the *Average Partial Effect*:
\[\begin{align*}
APE &= \mathbb{E}\left[ \frac{\partial \mathrm{P}(Y=1|X_1,X_2,X_3)}{\partial X_1} \right] \\
&= \mathbb{E}\left[ \phi(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3) \beta_1 \right]
\end{align*}\]
which you can estimate by
\[\begin{align*}
\widehat{APE} &= \frac{1}{n} \sum_{i=1}^n \phi(\hat{\beta}_0 + \hat{\beta}_1 X_1 + \hat{\beta}_2 X_2 + \hat{\beta}_3 X_3) \hat{\beta}_1
\end{align*}\]
which amounts to just computing the partial effect at each value of the covariates in your data and then averaging these partial effects together. This can be a bit cumbersome to do in practice, and it is often convenient to use the `R`

package `mfx`

to compute these sorts of average partial effects for you.

*Discrete/Binary Case*: When \(X_1\) is discrete (let’s say binary, but extention to discrete is straightforward), the partial effect of \(X_1\) is
\[\begin{align*}
& \mathrm{P}(Y=1|X_1=1, X_2, X_3) - \mathrm{P}(Y=1|X_1=0, X_2, X_3) \\
&\hspace{100pt} = \Phi(\beta_0 + \beta_1 + \beta_2 X_2 + \beta_3 X_3) - \Phi(\beta_0 + \beta_2 X_2 + \beta_3 X_3)
\end{align*}\]
Notice that \(\beta_1\) does not show up in the last term. As above, the partial effect depends on the values of \(X_2\) and \(X_3\) which suggests reporting an \(APE\) as above (follows the same steps, just replacing the partial effect, as in the continuous case above)

- Extensions to Logit are virtually identical, just replace \(\Phi\) with \(\Lambda\) and \(\phi\) with \(\lambda\).

Side-Comment: The parameters from LPM, Probit, and Logit could be quite different (in fact, they are quite different by construction), but APE’s are often very similar.

### 5.4.4 Computation

In this section, I’ll demonstrate how to estimate and interpret binary outcome models using the `titanic_training`

data. The outcome for this data is `Survived`

which is a binary variable indicating whether a particular passenger survived the Titanic wreck. We’ll estimate models that also include passenger class (`Pclass`

), passenger’s sex, and passenger’s age.

The main `R`

function for estimating binary outcome models is the `glm`

function (this stands for “generalized linear model”). The syntax is very similar to the syntax for the `lm`

command, so much of what we do below will feel feel familiar. We’ll also use the `probitmfx`

and `logitmfx`

functions from the `mfx`

package to compute partial effects.

**Linear Probability Model**

We’ll start by estimating a linear probability model.

```
# load the data
<- read.csv("data/titanic_training.csv")
titanic_train
# linear probability model
<- lm(Survived ~ as.factor(Pclass) +
lpm + Age,
Sex data=titanic_train)
summary(lpm)
#>
#> Call:
#> lm(formula = Survived ~ as.factor(Pclass) + Sex + Age, data = titanic_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1.05638 -0.26294 -0.07656 0.21103 0.98057
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.065516 0.061481 17.331 < 2e-16
#> as.factor(Pclass)2 -0.148575 0.050593 -2.937 0.003472
#> as.factor(Pclass)3 -0.349809 0.046462 -7.529 2.44e-13
#> Sexmale -0.490605 0.037330 -13.143 < 2e-16
#> Age -0.004570 0.001317 -3.471 0.000564
#>
#> (Intercept) ***
#> as.factor(Pclass)2 **
#> as.factor(Pclass)3 ***
#> Sexmale ***
#> Age ***
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.3915 on 495 degrees of freedom
#> Multiple R-squared: 0.3747, Adjusted R-squared: 0.3696
#> F-statistic: 74.14 on 4 and 495 DF, p-value: < 2.2e-16
```

Let’s quickly interpret a couple of these parameters. Recall that these can all be directly interpreted as partial effects on the probability of surviving the Titanic wreck. For example, these estimates indicate that third class passengers were about 33% less likely to survive on average than first class passengers controlling for passenger’s sex and age.

The other thing that jumps out is passenger’s sex. These estimates indicate that men were, on average, 49% less likely to survive the Titanic wreck than women after controlling for passenger class, and age.

Before we move on, let’s compute a couple of predicted probabilities.

```
# young, first class, female
<- data.frame(Pclass=1, Sex="female", Age=25, Embarked="S")
pred_df1
# old, third class, male
<- data.frame(Pclass=3, Sex="male", Age=55, Embarked="S")
pred_df2 <- rbind.data.frame(pred_df1, pred_df2)
pred_df round(predict(lpm, newdata=pred_df), 3)
#> 1 2
#> 0.951 -0.026
```

This illustrates that there were very large differences in survival probabilities. It also demonstrates that the linear probability model can deliver non-sensical predictions — we predict the survival probability of a 55 year old, male, third-class passenger to be -3%.

**Estimating Probit and Logit Models**

Let’s estimate Probit and Logit models using the same specifications. First, Probit:

```
# probit
<- glm(Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked),
probit family=binomial(link="probit"),
data=titanic_train)
summary(probit)
#>
#> Call:
#> glm(formula = Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked),
#> family = binomial(link = "probit"), data = titanic_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.5145 -0.7433 -0.4397 0.6560 2.3799
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 5.418219 146.954198 0.037 0.97059
#> as.factor(Pclass)2 -0.445235 0.197398 -2.256 0.02410
#> as.factor(Pclass)3 -1.145788 0.188508 -6.078 1.22e-09
#> Sexmale -1.464201 0.136512 -10.726 < 2e-16
#> Age -0.015764 0.005009 -3.147 0.00165
#> as.factor(Embarked)C -3.443295 146.954199 -0.023 0.98131
#> as.factor(Embarked)Q -3.464431 146.954544 -0.024 0.98119
#> as.factor(Embarked)S -3.662911 146.954180 -0.025 0.98011
#>
#> (Intercept)
#> as.factor(Pclass)2 *
#> as.factor(Pclass)3 ***
#> Sexmale ***
#> Age **
#> as.factor(Embarked)C
#> as.factor(Embarked)Q
#> as.factor(Embarked)S
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 678.28 on 499 degrees of freedom
#> Residual deviance: 468.38 on 492 degrees of freedom
#> AIC: 484.38
#>
#> Number of Fisher Scoring iterations: 12
```

Now, Logit:

```
# logit
<- glm(Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked),
logit family=binomial(link="logit"),
data=titanic_train)
summary(logit)
#>
#> Call:
#> glm(formula = Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked),
#> family = binomial(link = "logit"), data = titanic_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.4680 -0.7289 -0.4336 0.6471 2.3689
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 14.646649 535.411272 0.027 0.97818
#> as.factor(Pclass)2 -0.793003 0.342058 -2.318 0.02043
#> as.factor(Pclass)3 -2.055828 0.335386 -6.130 8.8e-10
#> Sexmale -2.461563 0.241113 -10.209 < 2e-16
#> Age -0.028436 0.008764 -3.245 0.00118
#> as.factor(Embarked)C -11.262004 535.411276 -0.021 0.98322
#> as.factor(Embarked)Q -11.229232 535.411568 -0.021 0.98327
#> as.factor(Embarked)S -11.593053 535.411259 -0.022 0.98273
#>
#> (Intercept)
#> as.factor(Pclass)2 *
#> as.factor(Pclass)3 ***
#> Sexmale ***
#> Age **
#> as.factor(Embarked)C
#> as.factor(Embarked)Q
#> as.factor(Embarked)S
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 678.28 on 499 degrees of freedom
#> Residual deviance: 467.01 on 492 degrees of freedom
#> AIC: 483.01
#>
#> Number of Fisher Scoring iterations: 12
```

It’s rather hard to interpret the parameters in both of these models (let’s defer this to the next section). But it is worth mentioning that all of the estimated coefficients have the same sign for the linear probability model, the probit model, and the logit model, and the same set of regressors have statistically significant effects across models (and the t-statistics/p-values are very similar across models).

Now, let’s calculate the same predicted probabilities as we did for the linear probability model above:

```
# for probit
predict(probit, newdata=pred_df, type="response")
#> 1 2
#> 0.91327666 0.04256272
# for logit
predict(logit, newdata=pred_df, type="response")
#> 1 2
#> 0.91235117 0.04618584
```

Before we interpret the result, notice that we add the argument `type=response`

(this indicates that we want to get a predicted probability).

Here (for both models), we estimate that a 25 year old woman traveling first class has a 91% probability of survival (this is slightly smaller than the prediction from the linear probability model). On the other hand, we estimate that a 55 year old man traveling 3rd class has a 4.3% (from probit) or 4.6% (from logit) probability of survival. While these probabilities are still quite low, unlike the estimates from the linear probability model, they are at least positive.

**Average Partial Effects**

To conclude this section, let’s calculate average partial effects for each model. First, for probit:

```
library(mfx)
<- probitmfx(Survived ~ as.factor(Pclass) + Sex + Age,
probit_ape data=titanic_train,
atmean=FALSE)
probit_ape#> Call:
#> probitmfx(formula = Survived ~ as.factor(Pclass) + Sex + Age,
#> data = titanic_train, atmean = FALSE)
#>
#> Marginal Effects:
#> dF/dx Std. Err. z P>|z|
#> as.factor(Pclass)2 -0.1303971 0.0424994 -3.0682 0.0021534
#> as.factor(Pclass)3 -0.3438262 0.0470404 -7.3092 2.688e-13
#> Sexmale -0.4895845 0.0412840 -11.8590 < 2.2e-16
#> Age -0.0042614 0.0012934 -3.2948 0.0009851
#>
#> as.factor(Pclass)2 **
#> as.factor(Pclass)3 ***
#> Sexmale ***
#> Age ***
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> dF/dx is for discrete change for the following variables:
#>
#> [1] "as.factor(Pclass)2" "as.factor(Pclass)3"
#> [3] "Sexmale"
```

Now, for logit:

```
<- logitmfx(Survived ~ as.factor(Pclass) + Sex + Age,
logit_ape data=titanic_train,
atmean=FALSE)
logit_ape#> Call:
#> logitmfx(formula = Survived ~ as.factor(Pclass) + Sex + Age,
#> data = titanic_train, atmean = FALSE)
#>
#> Marginal Effects:
#> dF/dx Std. Err. z P>|z|
#> as.factor(Pclass)2 -0.1304383 0.0420444 -3.1024 0.001920
#> as.factor(Pclass)3 -0.3542499 0.0474154 -7.4712 7.947e-14
#> Sexmale -0.4859890 0.0413167 -11.7625 < 2.2e-16
#> Age -0.0044098 0.0014205 -3.1045 0.001906
#>
#> as.factor(Pclass)2 **
#> as.factor(Pclass)3 ***
#> Sexmale ***
#> Age **
#> ---
#> Signif. codes:
#> 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> dF/dx is for discrete change for the following variables:
#>
#> [1] "as.factor(Pclass)2" "as.factor(Pclass)3"
#> [3] "Sexmale"
```

A couple of things to notice:

The average partial effects are extremely similar across models. For example, across all three models, the average partial effect of being

`male`

is to reduce the probability of survival by 49% controlling for the other variables in the model. The other average partial effects are quite similar across models as well.Both

`probitmfx`

and`logitmfx`

functions took in an argument for`atmean`

. We set it equal to`FALSE`

. If you set it equal to`TRUE`

, you will compute a different kind of partial effect. You can check`?probitmfx`

or`?logitmfx`

for more details.

Side-Comment: We are not necessarily so interested in prediction (though you can make predictions using the techniques we discuss below), but I find this a good spot to teach about binary outcome models before we conclude the course talking about causality