7.4 Lab 6: Estimating Binary Outcome Models

In this lab, I’ll demonstrate how to estimate and interpret binary outcome models using the titanic_training data. The outcome for this data is Survived which is a binary variable indicating whether a particular passenger survived the Titanic wreck. We’ll estimate models that also include passenger class (Pclass), passenger’s sex, and passenger’s age.

The main R function for estimating binary outcome models is the glm function (this stands for “generalized linear model”). The syntax is very similar to the syntax for the lm command, so much of what we do below will feel feel familiar. We’ll also use the probitmfx and logitmfx functions from the mfx package to compute partial effects.

Linear Probability Model

We’ll start by estimating a linear probability model.

# load the data
titanic_train <- read.csv("data/titanic_training.csv")

# linear probability model
lpm <- lm(Survived ~ as.factor(Pclass) + 
            Sex + Age, 
          data=titanic_train)
summary(lpm)
#> 
#> Call:
#> lm(formula = Survived ~ as.factor(Pclass) + Sex + Age, data = titanic_train)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.05638 -0.26294 -0.07656  0.21103  0.98057 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)         1.065516   0.061481  17.331  < 2e-16 ***
#> as.factor(Pclass)2 -0.148575   0.050593  -2.937 0.003472 ** 
#> as.factor(Pclass)3 -0.349809   0.046462  -7.529 2.44e-13 ***
#> Sexmale            -0.490605   0.037330 -13.143  < 2e-16 ***
#> Age                -0.004570   0.001317  -3.471 0.000564 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3915 on 495 degrees of freedom
#> Multiple R-squared:  0.3747, Adjusted R-squared:  0.3696 
#> F-statistic: 74.14 on 4 and 495 DF,  p-value: < 2.2e-16

Let’s quickly interpret a couple of these parameters. Recall that these can all be directly interpreted as partial effects on the probability of surviving the Titanic wreck. For example, these estimates indicate that third class passengers were about 33% less likely to survive on average than first class passengers controlling for passenger’s sex and age.

The other thing that jumps out is passenger’s sex. These estimates indicate that men were, on average, 49% less likely to survive the Titanic wreck than women after controlling for passenger class, and age.

Before we move on, let’s compute a couple of predicted probabilities.

# young, first class, female
pred_df1 <- data.frame(Pclass=1, Sex="female", Age=25, Embarked="S")

# old, third class, male
pred_df2 <- data.frame(Pclass=3, Sex="male", Age=55, Embarked="S")
pred_df <- rbind.data.frame(pred_df1, pred_df2)
round(predict(lpm, newdata=pred_df), 3)
#>      1      2 
#>  0.951 -0.026

This illustrates that there were very large differences in survival probabilities. It also demonstrates that the linear probability model can deliver non-sensical predictions — we predict the survival probability of a 55 year old, male, third-class passenger to be -3%.

Estimating Probit and Logit Models

Let’s estimate Probit and Logit models using the same specifications. First, Probit:

# probit
probit <- glm(Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked), 
              family=binomial(link="probit"), 
              data=titanic_train)
summary(probit)
#> 
#> Call:
#> glm(formula = Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked), 
#>     family = binomial(link = "probit"), data = titanic_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.5145  -0.7433  -0.4397   0.6560   2.3799  
#> 
#> Coefficients:
#>                        Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)            5.418219 146.954198   0.037  0.97059    
#> as.factor(Pclass)2    -0.445235   0.197398  -2.256  0.02410 *  
#> as.factor(Pclass)3    -1.145788   0.188508  -6.078 1.22e-09 ***
#> Sexmale               -1.464201   0.136512 -10.726  < 2e-16 ***
#> Age                   -0.015764   0.005009  -3.147  0.00165 ** 
#> as.factor(Embarked)C  -3.443295 146.954199  -0.023  0.98131    
#> as.factor(Embarked)Q  -3.464431 146.954544  -0.024  0.98119    
#> as.factor(Embarked)S  -3.662911 146.954180  -0.025  0.98011    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 678.28  on 499  degrees of freedom
#> Residual deviance: 468.38  on 492  degrees of freedom
#> AIC: 484.38
#> 
#> Number of Fisher Scoring iterations: 12

Now, Logit:

# logit
logit <- glm(Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked), 
              family=binomial(link="logit"), 
              data=titanic_train)
summary(logit)
#> 
#> Call:
#> glm(formula = Survived ~ as.factor(Pclass) + Sex + Age + as.factor(Embarked), 
#>     family = binomial(link = "logit"), data = titanic_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.4680  -0.7289  -0.4336   0.6471   2.3689  
#> 
#> Coefficients:
#>                        Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)           14.646649 535.411272   0.027  0.97818    
#> as.factor(Pclass)2    -0.793003   0.342058  -2.318  0.02043 *  
#> as.factor(Pclass)3    -2.055828   0.335386  -6.130  8.8e-10 ***
#> Sexmale               -2.461563   0.241113 -10.209  < 2e-16 ***
#> Age                   -0.028436   0.008764  -3.245  0.00118 ** 
#> as.factor(Embarked)C -11.262004 535.411276  -0.021  0.98322    
#> as.factor(Embarked)Q -11.229232 535.411568  -0.021  0.98327    
#> as.factor(Embarked)S -11.593053 535.411259  -0.022  0.98273    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 678.28  on 499  degrees of freedom
#> Residual deviance: 467.01  on 492  degrees of freedom
#> AIC: 483.01
#> 
#> Number of Fisher Scoring iterations: 12

It’s rather hard to interpret the parameters in both of these models (let’s defer this to the next section). But it is worth mentioning that all of the estimated coefficients have the same sign for the linear probability model, the probit model, and the logit model, and the same set of regressors have statistically significant effects across models (and the t-statistics/p-values are very similar across models).

Now, let’s calculate the same predicted probabilities as we did for the linear probability model above:

# for probit
predict(probit, newdata=pred_df, type="response")
#>          1          2 
#> 0.91327666 0.04256272

# for logit
predict(logit, newdata=pred_df, type="response")
#>          1          2 
#> 0.91235117 0.04618584

Before we interpret the result, notice that we add the argument type=response (this indicates that we want to get a predicted probability).

Here (for both models), we estimate that a 25 year old woman traveling first class has a 91% probability of survival (this is slightly smaller than the prediction from the linear probability model). On the other hand, we estimate that a 55 year old man traveling 3rd class has a 4.3% (from probit) or 4.6% (from logit) probability of survival. While these probabilities are still quite low, unlike the estimates from the linear probability model, they are at least positive.

Average Partial Effects

To conclude this section, let’s calculate average partial effects for each model. First, for probit:

library(mfx)
probit_ape <- probitmfx(Survived ~ as.factor(Pclass) + Sex + Age, 
                        data=titanic_train, 
                        atmean=FALSE)
probit_ape
#> Call:
#> probitmfx(formula = Survived ~ as.factor(Pclass) + Sex + Age, 
#>     data = titanic_train, atmean = FALSE)
#> 
#> Marginal Effects:
#>                         dF/dx  Std. Err.        z     P>|z|    
#> as.factor(Pclass)2 -0.1303971  0.0424994  -3.0682 0.0021534 ** 
#> as.factor(Pclass)3 -0.3438262  0.0470404  -7.3092 2.688e-13 ***
#> Sexmale            -0.4895845  0.0412840 -11.8590 < 2.2e-16 ***
#> Age                -0.0042614  0.0012934  -3.2948 0.0009851 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> dF/dx is for discrete change for the following variables:
#> 
#> [1] "as.factor(Pclass)2" "as.factor(Pclass)3" "Sexmale"

Now, for logit:

logit_ape <- logitmfx(Survived ~ as.factor(Pclass) + Sex + Age, 
                        data=titanic_train, 
                        atmean=FALSE)
logit_ape
#> Call:
#> logitmfx(formula = Survived ~ as.factor(Pclass) + Sex + Age, 
#>     data = titanic_train, atmean = FALSE)
#> 
#> Marginal Effects:
#>                         dF/dx  Std. Err.        z     P>|z|    
#> as.factor(Pclass)2 -0.1304383  0.0420444  -3.1024  0.001920 ** 
#> as.factor(Pclass)3 -0.3542499  0.0474154  -7.4712 7.947e-14 ***
#> Sexmale            -0.4859890  0.0413167 -11.7625 < 2.2e-16 ***
#> Age                -0.0044098  0.0014205  -3.1045  0.001906 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> dF/dx is for discrete change for the following variables:
#> 
#> [1] "as.factor(Pclass)2" "as.factor(Pclass)3" "Sexmale"

A couple of things to notice:

The average partial effects are extremely similar across models. For example, across all three models, the average partial effect of being male is to reduce the probability of survival by 49% controlling for the other variables in the model. The other average partial effects are quite similar across models as well.
Both probitmfx and logitmfx functions took in an argument for atmean. We set it equal to FALSE. If you set it equal to TRUE, you will compute a different kind of partial effect. You can check ?probitmfx or ?logitmfx for more details.