Chapter 8

Author
Affiliation
Juan H Klopper (MD), Do Hee Lee (MPH)

The George Washington University

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

Given enough knowledge, anyone can build a model with any variety of explanatory variables. Throwing variables and interactions at a model, might produce results, but they are deleterious to many of the response why we use models. A lot more thought has to go into creating a model.

The decisions around model building has many components. We discuss some of these in broader terms in this chapter.

Section 4. Reasons for Model Building classification problem
Section 7. Explanatory Variable Types binning
Section 9. Suggestions based on Study Design forward selection, backward elimination
Section 10. Model Comparison Akaike information criterion (AIC)

2 Libraries

Load the required libraries for Chapter 8. If these packages are not already installed, please do so before loading the libraries.

Code
library(rio)
library(dplyr)
#library(DT)

3 Model Building

There are a lot of decisions to make and transparency is key, especially in view of research reproducibility. The following five steps are key. They actually also pertain to the whole process of research, not just the model creation decisions.

  1. Documents all decisions that are made about the research including the decisions made about the model
  2. Explain why the decisions were made
  3. Share with peers how the decisions were made such that they can attempt to reproduce the results
  4. Work collaboratively
  5. Peer-review is your friend (other will cast a new eye on your work and may guide you to better or alternative decisions)

There may only be Five steps, but it does include a lot of work. Remember too that clinical trials and their protocols have to be registered before any actual data collection and analysis can commence.

4 Reasons for Model Building

Models are used in two main ways.

  • Explanation: Describe associations between explanatory and response variables (we must be able to interpret the results of the model)
  • Prediction: Given unseen data, the model provides a probability of success (given a binary response variable)

Machine learning has really taken a front seat when it comes to predictions. The prediction of a categorical response variable is termed a classification problem. Data is typically divided into a training and a test set. The training set is used to train a model and the test set (unseen during the creation of the machine learning model) to test the possible accuracy of the model in the wild. The metrics used during testing may reflect how well the model will do when used in the wild. There are many types of approaches in machine learning including approaches built on decision trees and neural networks. Neural networks minimize a cost function through repeated processes of forward and backpropagation, attempting to minimize a cost function. It is, in essence, an optimization problem.

5 Considerations

We have mentioned that there are many considerations when creating a model. These are typically divided into statistical and scientific considerations. Both are equally valid and important. There are too many to list in both cases. Some are highlighted below.

  • Statistical considerations
    • Statistical significance (bivariate analysis)
    • Model stability (large standard errors mean large variability and reduced trust in the results)
    • Underfitting (model does not explain the data well) and overfitting (model particular to the random data provided and does not generalize well when used in the wild)
  • Scientific considerations
    • Is the model plausible (number of buses that are late vs. development of gallstones)
    • Is the model useful (can it be used to inform a change in care or management)
    • Have all confounders being accounted for (many studies leave out important confounders)

We ultimately want a parsimonious model, that is, one that is simple but explains the data well. Such a model is interpretable and finds use after publication.

Remember too, that there are many types of models for categorical response variables. We have concentrated on the arguably most useful and most used, logistic regression (LR) model. With LR models we estimate odds ratios. There are also model type for risk ratios (log-binomial) and risk difference (linear probability).

It makes sense to consider other model types previously used in the field of study of current research project.

6 General Rules

All modeling starts with a proper research question that ultimately determines the model and selection of explanatory variables. When given data, the purposeful selection of explanatory variables is key. We often consider a primary explanatory variable, X, and a set of potential confounders, Z_{1} , Z_{2} \ldots. We simply do not just add explanatory variables without consideration and purpose. Any variable that is included should be well justified.

As a rule of thumb some suggest the following.

  • At least 10 cases of each of the outcome variables
  • Amount of predictors must equal no more than a third of the number of successes in the data set

A large number of predictors lead to large standard errors which means high variability in the parameter estimates.

Multicollinearity, that is, correlation among the explanatory variables also lead to model instability through raised standard errors and high estimate variability.

In summary, we can state the following general rules as we start to consider modeling.

  1. Structure a proper research question
  2. Collect (select) only the variables pertaining to answering the research question (there is always the temptation to use more if they are there)
  3. Verify integrity of the data
  4. Do descriptive statistics and data visualization to understand the data (summary statistics usually go into the first table or few tables in a paper anyway) (the summary and data visualization can be stratified if required)
  5. Assess for multicollinearity (if there is pick one of the variables only or combine them in some meaningful way into a new constituent variable)
  6. Do bivariate analysis (single explanatory variable and response variable) (with a suggestion to exclude explanatory variables that do not have a p values of less than 0.25 using Wald’s test)

7 Explanatory Variable Types

We have started by looking at a sensible way to start thinking about model building. Next, we consider the individual explanatory variable types: ordinal, multi-level, and continuous, exploring the following ideas.

  • Convert an ordinal variable into a numerical variable
  • Reduce the number of levels in a k level explanatory variable
  • Consider creating intervals for numerical variables and create a categorical variable

7.1 Ordinal Explanatory Variables

For a k level ordinal explanatory variable, we would have k-1 dummy variables. It is important to consider that more slopes mean more variability in the estimates. It may make sense to consider converting an ordinal variable to a numerical variable.

In the illustrative example below, we import data that looked at the return of bottles, Y, with Y=1 if the bottle is returned, given the amount of the deposit returned, X. We assign the imported data set to the variable deposit.

Code
deposit <- import_dataset("deposit_data.csv")

A summary table shows that there seems to be an increased number of returns as the deposit values increases.

Code
xtabs(
  ~ .,
  data = deposit
)
       returned
deposit   0   1
     2  428  72
     5  397 103
     10 370 130
     15 280 220
     20 180 320
     25  60 440
     30  38 462

If we select only the returns, using the filter function as returning a tibble object, we can create a list of objects in order to plot the relationship between increasing deposit values and the number of returns, shown in Figure 1.

Code
returns <- deposit %>% 
  filter(returned == 1) %>% 
  tibble()
Code
tab_returns <- table(returns$deposit)

tab_returns_values <- as.numeric(tab_returns)
tab_returns_labels <- names(tab_returns)

barplot(
  tab_returns_values,
  names.arg = tab_returns_labels,
  main = "Frequency of returns as deposit value increases",
  xlab = "Value of deposit",
  ylab = "Frequency of returns",
  las = 1,
  panel.first = grid()
)

Figure 1: Frequency of Returns

We see the increase in the frequency of returns as the value of the deposit increases.

If the return is seen as a k level explanatory variable, then we get the following model, using a deposit value of 2 as the reference level. We use the glm function and assign the model to the variable model_cat, with the coef function to return the coefficient estimates.

Code
model_cat <- glm(
  formula = returned ~ factor(deposit),
  data = deposit,
  family = binomial(link = "logit")
)

coef(model_cat)
      (Intercept)  factor(deposit)5 factor(deposit)10 factor(deposit)15 
       -1.7824571         0.4332498         0.7364885         1.5412950 
factor(deposit)20 factor(deposit)25 factor(deposit)30 
        2.3578212         3.7748872         4.2804358 

It looks like the slope increases as the deposit value increases. It may not be linear, but we may want to view the deposit as a numerical variable. To decide on this, we calculate the empirical logits, shown in Equation 1, for each level.

\text{Empirical logit}_{k} = \log{\left( \frac{p_{k}}{1 - p_{k}} \right)} \text{ : (the log odds of success at each level)} \tag{1}

To calculate the empirical logits at each level, we create a user-defined function implementing Equation 1 to do the calculation.

Code
emp_log_odds <- function(success, failure){
  return(round(log(success / failure), digits = 3))
}

We use the values from the frequency table that we created before and calculate the empirical log odds, each assigned to an appropriate variable.

Code
emp_log_odds_2 <- emp_log_odds(72, 428)
emp_log_odds_5 <- emp_log_odds(103, 397)
emp_log_odds_10 <- emp_log_odds(130, 370)
emp_log_odds_15 <- emp_log_odds(220, 280)
emp_log_odds_20 <- emp_log_odds(320, 180)
emp_log_odds_25 <- emp_log_odds(440, 60)
emp_log_odds_30 <- emp_log_odds(462, 38)

A scatter plot shows the relationship between the empirical log odds and the deposit value.

Code
plot(
  c(2, 5, 10, 15, 20, 25, 30),
  c(emp_log_odds_2, emp_log_odds_5, emp_log_odds_10, emp_log_odds_15,
        emp_log_odds_20, emp_log_odds_25, emp_log_odds_30),
  xlab = "Deposit",
  ylab = "Empirical log odds",
  col = "orangered",
  cex = 3,
  panel.first = grid()
)

Figure 2: Empirical Logits

It seems as if there is a linear relationship. It may make sense to include the deposit as a numerical variable. It creates a much simpler model, which we create below, assigned to the variable model_num.

Code
model_num <- glm(
  formula = returned ~ deposit,
  data = deposit,
  family = binomial(link = "logit")
)

round(summary(model_num)$coefficients, digits = 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -2.384      0.088 -26.996        0
deposit        0.157      0.005  30.263        0

This approach of considering an ordinal variable as a numerical variable was obvious in our illustrative example. It is not so in all cases. In cancer staging, with Grades I through IV, or response to medication such as low, low-medium, medium-high, or high, we may think of using values such as 1,2,3 and 4. It may make sense to use other values. In this case, we check for a linear relationship between the log odds of the response and the recoded explanatory variable. If the relationship is approximately linear, then we use the recoded values as a numerical variable.

7.2 k Level Nominal Explanatory Variables

Nominal variables cannot be recoded as numerical values. The only approach that we can take is to combine some of the levels into a single level, if it makes logical sense and can be argued for in the domain of the problem.

7.3 Numerical Variables

When we use numerical variables as explanatory variables in logistic regression (LR) models, we assume a linear relationship between the variable and the log odds of the response variable.

There are several approaches to determine the functional relationship between a numerical explanatory variable and the log odds of the response variable. These include locally weighted scatter plot smoothing, spline-based methods, and more.

We will explore the construction of empirical logit plots by binning the numerical predictor into k groups. This will help us find the functional relationship.

Binning refers to creating intervals of the same width, with each interval becoming a categorical level. Each value in the numerical explanatory variable then belongs to one of the levels. The choice of k, the number of bins, is arbitrary. We want to choose k large enough to discover the relationship, but not too large, as this may lead to highly variable empirical logits.

We generate data for an illustrative example. The data contains a numerical explanatory variable X being age, and a response variable, death vs. not death, with Y=1 indicating death. The data is generated below assigned to the data frame object age_death_data, for which we view the data table using the datatable form the DT library.

Code
set.seed(12345)
age <- round(runif(n = 1500, min = 0, max = 95), 2)
lp <- 0.0011*(age - 48)^2
pr <- exp(lp) / (1 + exp(lp))
death <- rbinom(n = 1500, size = 1, prob = pr)
age_death_data <- data.frame(age, death)

#DT::datatable(age_death_data)

A box plot visualizes the relationship in Figure 3.

Code
boxplot(
  age ~ death,
  data = age_death_data,
  main = "Age of survivors (0) and non-survivors (1)",
  las = 1
)

Figure 3: Age for Survivors and Non-Survivors

We start by choosing a few values for k. In most cases, three well chosen values should suffice. Below, we choose k=4, k=10, and k=25. The cut function below, shows the levels. The lowest value is sometimes impossible. It is just the result of the way the function calculates the cuts for the bins.

Code
levels(cut(age_death_data$age, 4))
[1] "(-0.0449,23.8]" "(23.8,47.5]"    "(47.5,71.2]"    "(71.2,95.1]"   
Code
levels(cut(age_death_data$age, 10))
 [1] "(-0.0449,9.54]" "(9.54,19]"      "(19,28.5]"      "(28.5,38]"     
 [5] "(38,47.5]"      "(47.5,57]"      "(57,66.5]"      "(66.5,76]"     
 [9] "(76,85.5]"      "(85.5,95.1]"   
Code
levels(cut(age_death_data$age, 25))
 [1] "(-0.0449,3.85]" "(3.85,7.64]"    "(7.64,11.4]"    "(11.4,15.2]"   
 [5] "(15.2,19]"      "(19,22.8]"      "(22.8,26.6]"    "(26.6,30.4]"   
 [9] "(30.4,34.2]"    "(34.2,38]"      "(38,41.8]"      "(41.8,45.6]"   
[13] "(45.6,49.4]"    "(49.4,53.2]"    "(53.2,57]"      "(57,60.8]"     
[17] "(60.8,64.6]"    "(64.6,68.4]"    "(68.4,72.2]"    "(72.2,76]"     
[21] "(76,79.8]"      "(79.8,83.6]"    "(83.6,87.4]"    "(87.4,91.2]"   
[25] "(91.2,95.1]"   

In the three cases above, we see 4, 10, and 25 intervals. In each case, we printed the intervals to the screen. The intervals are left-open right-closed.

Four intervals might be too few to find a functional relationship and 25 might be too many and hide an obvious functional relationship.

The emplogitplot1 function in the Stat2Data library uses the empirical logit function in Equation 1 and plots the values, shown in Figure 4. (In each group there will be x successes and n-x _failures with which Equation 1 is calculated. The emplogitplot1 function does a slightly different calculation to Equation 1 to avoid divisions by 0.

Code
Stat2Data::emplogitplot1(
  death ~ age,
  data = age_death_data,
  ngroups = 4
)

Figure 4: Using Four Bins

The blue line shows a linear relationship (a least squares line, which we ignore). Clearly, though, the plot is starting to show a quadratic (squared) relationship. Below in Figure 5, we look at 10 bins.

Code
Stat2Data::emplogitplot1(
  death ~ age,
  data = age_death_data,
  ngroups = 10
)

Figure 5: Using Ten Bins

The quadratic relationship is now more clear to see. The 25 bins do not add more information in Figure 6.

Code
Stat2Data::emplogitplot1(
  death ~ age,
  data = age_death_data,
  ngroups = 25
)

Figure 6: Using Twenty-Five Bins

We are starting to see overfitting when k=25.

Below are two models. The first is the naive model that assumes a linear relationship for the numerical explanatory variable, assigned to the variable naive_model.

Code
naive_model <- glm(
  death ~ age,
  data = age_death_data,
  family = binomial(link = "logit")
)
round(summary(naive_model)$coefficients, digits = 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.722      0.113   6.378     0.00
age            0.000      0.002   0.202     0.84

We see a p value of 0.84 in the case of the log odds ratio for the \texttt{age} variable. Next, we add the quadratic term for the age and assign the new model to the variable better_model. (Take note of the format of the formula keyword value.)

Code
better_model <- glm(
  formula = death ~ age + I(age^2),
  data = age_death_data,
  family = binomial(link = "logit")
)
round(summary(better_model)$coefficients, digits = 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    2.445       0.22  11.102        0
age           -0.100       0.01 -10.012        0
I(age^2)       0.001       0.00  10.392        0

Our model is now as shown in Equation 2 and age is a predictor of death (see below for interpretation).

\log{\left( \frac{\hat{\pi}(\text{age})}{1 - \hat{pi} (\text{age})} \right)} = 2.445 - 0.1 (\text{age}) + 0.001 (\text{age})^{2} \tag{2}

While we had the slopes \hat{\beta}_{1} = -0.1 and \beta_{2} = 0.001, we cannot express the effect on the log odds of death for a one unit increase in age, while keeping the other predictors constant, because the other predictor also contains age. In fact, the slope of age is a function of age, as we can see from the first derivative in Equation 3.

\frac{\mathbb{d}}{\mathbb{d} \text{ age}} \text{logit}({\hat{\pi} (\text{age})}) = 0.002 (\text{age}) - 0.1 \tag{3}

One way to interpret the numerical variable is to plot the logit and the probability, shown in Figure 7 and Figure 8.

Code
ages <- seq(min(age_death_data$age), max(age_death_data$age), by = 0.5)
plot(
  ages,
  2.45 - 0.1 * ages + 0.001 * ages^2,
  type = "l",
  main = "Logit plot",
  xlab = "Age",
  ylab = "Estimated log(odds of death)",
  las = 1,
  col = "orangered",
  lwd = 2,
  panel.first = grid()
)

Figure 7: Logit Plot

Code
ages <- seq(min(age_death_data$age), max(age_death_data$age), by = 0.5)
plot(
  ages,
  (exp(2.45 - 0.1 * ages + 0.001 * ages^2)) / (1 + exp(2.45 - 0.1 * ages + 0.001 * ages^2)),
  type = "l",
  main = "Probability plot",
  xlab = "Age",
  ylab = "Predicted probability of death",
  las = 1,
  panel.first = grid(),
  col = "orangered",
  lwd = 2
)

Figure 8: Logit Plot

We have now considered approaches to models with single explanatory variables. Next, we take a closer look at multivariable models.

8 Slopes of Models with Multiple Explanatory Variables

We might be tempted to directly compare the slopes in a generalized linear model. While this can be done for categorical predictors, it cannot immediately be done for different numerical explanatory variables.

In this section we explore how to standardize the predictors. Our data set is the familiar private insurance data set that we used before. We start by importing the data as a tibble object and assigning it to the variable insurance.

Code
insurance <- import_dataset("mus14_specific.csv")

We use the \texttt{married}, \texttt{age}, and \texttt{hhincome} variables as explanatory variables and the \texttt{private} variable as response variable to create a logistic regression model, assigned to the variable model.

Code
model <- glm(
  private ~ married + age + hhincome,
  data = insurance,
  family = binomial(link = "logit")
)

round(summary(model)$coefficients, digits = 4)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.2857     0.6733 -1.9097   0.0562
married       0.5656     0.0917  6.1693   0.0000
age           0.0024     0.0101  0.2330   0.8157
hhincome      0.0054     0.0008  6.4130   0.0000

There are two numerical predictors in our model, \texttt{hhincome}, which is household income, and \texttt{age}. The one unit increase in age is different from a unit increase in household income (which is measured in thousands of dollars).

While not required, we can standardize numerical explanatory variables to better compare them.

A numerical variable, X, is standardized using Equation 4, where S_{X} is the sample standard deviation of X, each observation has value x_{i}, and \bar{X} is the mean of the variable.

x_{i\text{ (standardized)}} = \frac{x_{i} - \overline{X}}{S_{X}} \tag{4}

This technique is most often used when using numerical variables in neural networks, where it is absolutely required.

The result is that we can now consider a 1 standard deviation increase instead of a 1 unit increase in the numerical explanatory variables.

Below, we standardize the two numerical variables using the mutate function from the dplyr library. The function adds new variables.

Code
insurance <- insurance %>% 
  mutate(
    age_std = ((age - mean(age)) / sd(age)),
    hhincome_std = ((hhincome - mean(hhincome)) / sd(hhincome))
  )

We can now use the variables in a new model, assigned to the variable model_std.

Code
model_std <- glm(
  private ~ married + age_std + hhincome_std,
  data = insurance,
  family = binomial(link = "logit")
)

round(summary(model_std)$coefficients, digits = 4)
             Estimate Std. Error  z value Pr(>|z|)
(Intercept)   -0.8835     0.0800 -11.0422   0.0000
married        0.5656     0.0917   6.1693   0.0000
age_std        0.0087     0.0373   0.2330   0.8157
hhincome_std   0.3471     0.0541   6.4130   0.0000

The result of the Wald test (z value and p value) for the numerical variable does not change due to standardization. For both numerical explanatory variables, though, we can compare the slope estimates, and consider a single standard deviation increase in each.

Instead of standardization, we can also center the numerical variables. Centering a numerical variable uses the equation in Equation 5.

x_{i \text{ centered}} = x_{i} - \overline{X} \tag{5}

Below, we consider a bivariate model where we have the years of education as a (numerical) explanatory variable for purchasing private healthcare insurance, assigned to the variable model_edu.

Code
model_edu <- glm(
  formula = private ~ educyear,
  data = insurance,
  family = binomial(link = "logit")
)

round(summary(model_edu)$coefficients, digits = 4)
            Estimate Std. Error  z value Pr(>|z|)
(Intercept)  -2.4018     0.1605 -14.9666        0
educyear      0.1603     0.0127  12.6670        0

We add a new column for centering the years of education variable, using the mutate function.

Code
insurance <- insurance %>% 
  mutate(
    educyear_cen = educyear - mean(educyear)
  )

The mean of the years of education is calculated below.

Code
round(mean(insurance$educyear), digits = 1)
[1] 11.9

We note that it is 11.9 years.

Now we build a new model, assigned to the variable model_cen.

Code
model_cen <- glm(
  formula = private ~ educyear_cen,
  data = insurance,
  family = binomial(link = "logit")
)

round(summary(model_cen)$coefficients, digits = 4)
             Estimate Std. Error  z value Pr(>|z|)
(Intercept)   -0.4948     0.0377 -13.1287        0
educyear_cen   0.1603     0.0127  12.6670        0

Note that the slope does not change.

The intercept in the original model is the log odds of having private insurance for someone with 0 years of education, which does not make sense. For the centered variable, the intercept is the log odds of private insurance for someone with 11.9 years of education. It is now meaningful, because 11.9 years is typical for the data.

9 Suggestions based on Study Design

Models can be used in the setting of exploratory data analysis. In this case, it might be common that there is little known or published in terms of the research question.

9.1 Models for Exploratory Studies

Guideline are much less clear when conducting exploratory studies. There are processes that can be followed, though.

Forward selection starts with a intercept only model and sequentially adds explanatory variables in order to optimize the fit of the model.

Backward elimination starts with a complex (many explanatory variables) model and sequentially eliminates a variable (largest p values) in order to optimize the fit of the model.

These procedures can be done automatically using a computer language such as R. The final model might be nonsensical, though. As an example, it may include an interaction term, but neither of the main effects that make up the interaction. Statistical significance should not be the decision driver. Important confounders must be included, even though their slopes are not significantly different from 0.

9.2 Model for Confirmatory Studies

It should be obvious that the explanatory variable that is designated as the primary predictor of the outcome should be included in a final model.

Past work in the field of the research question can be a great source of information and a proper literature review is important.

As far as interaction terms are concerned, we only include them if there is a strong sense that the association between the primary predictor and the outcome is modified by the potential modifiers. On the other hand, any known confounders should be included in a model.

9.3 Guidelines for Interaction Terms and k Level Explanatory Variables

Models that have interaction terms, XZ, must have their components, X and Z, (the main effects) in the model.

Consider a simple model with only binary interactions terms, X and Z , shown in Equation 6.

\text{logit} ( \pi ) = \beta_{0} + \beta_{1} x z \tag{6}

Table 1 shows the four possible cases. In all cases, except when both x=1 and z=1, we have the same log odds.

Table 1: Four Cases
x z \text{logit}( \ pi )
0 0 \beta_{0}
1 0 \beta_{0}
0 1 \beta_{0}
1 1 \beta_{0} + \beta_{1}

In the case of a k-level predictor, all k-1 dummy variables must be included.

10 Model Comparison

The comparison of models is an open problem in statistics. Various methods have been described to do just this. One such method is the Akaike information criterion (AIC), which we explore in this section. Others include Mallow’s C_{p}, and Bayesian information criterion (BIC).

10.1 Akaike Information Criterion

The AIC equation uses the log likelihood of the model (the probability of finding the model given the data) and the number of model parameters, as shown in Equation 7.

\text{AIC} = -2 [\log{(L_{\text{model}})} - \text{ number of model parameters}] \tag{7}

Given several models, we select the model with the smallest value for the AIC. Such a model maximizes the log likelihood and minimizes the penalty (the number of parameters). There is no absolute indication of how a model performs given the value of the AIC, but comparison between models can be made.

Despite wanting a model with a smallest AIC value, this should not be the driving force behind final model selection. There is a trade-off between explainability and complexity.

10.2 Viewing the Akaike Information Criterion in R

Below, we build two models for comparison, using the same data as in the section on standardization and centering, and assign the data to the variable dfr. We also use the mutate function to standardize the two numerical variables \texttt{age} and \texttt{hhincome}. The two new variables are named \texttt{agestd} and \texttt{hhincomestd}

Code
dfr <- import_dataset("mus14_specific.csv")

dfr <- dfr %>% 
  mutate(
    agestd = ((age - mean(age)) / sd(age)),
    hhincomestd = ((hhincome - mean(hhincome)) / sd(hhincome))
  )

We generate a model, with three explanatory variables, assigned to the variable model_1.

Code
model_1 <- glm(
  formula = private ~ married + agestd + hhincomestd,
  data = dfr,
  family = binomial(link = "logit")
)
summary(model_1)

Call:
glm(formula = private ~ married + agestd + hhincomestd, family = binomial(link = "logit"), 
    data = dfr)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.883457   0.080007 -11.042  < 2e-16 ***
married      0.565639   0.091686   6.169 6.86e-10 ***
agestd       0.008683   0.037262   0.233    0.816    
hhincomestd  0.347063   0.054119   6.413 1.43e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4279.5  on 3205  degrees of freedom
Residual deviance: 4153.5  on 3202  degrees of freedom
AIC: 4161.5

Number of Fisher Scoring iterations: 4

A summary of the model shows an AIC value of 4161.5. Now for the second model, which omits \texttt{agestd}, assigned to the variable model_2.

Code
model_2 <- glm(
  formula = private ~ married + hhincomestd,
  data = dfr,
  family = binomial(link = "logit")
)
summary(model_2)

Call:
glm(formula = private ~ married + hhincomestd, family = binomial(link = "logit"), 
    data = dfr)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.88555    0.07951 -11.138  < 2e-16 ***
married      0.56853    0.09085   6.258 3.90e-10 ***
hhincomestd  0.34737    0.05410   6.421 1.36e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4279.5  on 3205  degrees of freedom
Residual deviance: 4153.5  on 3203  degrees of freedom
AIC: 4159.5

Number of Fisher Scoring iterations: 4

The AIC is now 4159.5, lower than that for the first model. Using only the AIC, we would use the second model.

Note that the slopes in the second model are almost the same as for the first model. The \texttt{agestd} variable seems not to add much to the explanation of the association with the response variable in the model.

11 Lab Problems

The data set contained in the file multi08.csv has simulated data on three pairs of predictors and outcomes, \texttt{Pred1} and \texttt{Response1}, \texttt{Pred2} and \texttt{Response2}, and \texttt{Pred3} and \texttt{Response3}. All the predictors are numerical variables on the interval 0 to 10. The response variables are all binary variable (1 is the success level, and 0 is the failure level).

The data is imported and assigned to the variable multi01 below.

Code
multi01 <- import_dataset("multi08.csv")

11.1 Question

Suppose that you want to model the relationship between the numerical predictor \texttt{Pred1} and the binary response \texttt{Response1} using a logistic regression model. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt{Pred1} on the horizontal axis where \texttt{Pred1} has been categorized into k=15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt{Pred1}? Write down the form of the logistic model that best models the association between \texttt{Pred1} and the log odds of response.

We use the emplogitplot1 function from the Stat2Data library to create the empirical logit plot, using a value of 15 for the ngroups argument.

Code
Stat2Data::emplogitplot1(
  Response1 ~ Pred1,
  data = multi01,
  ngroups = 15
)

There is an approximately linear relationship between the log odds of response and \texttt{Pred1}. Let \texttt{Pred1} be x. The logistic model that best models the association is then as shown in Equation 8.

\log{\left( \frac{\pi (x)}{1 - \pi (x)} \right)} = \beta_{0} + \beta_{1} x \tag{8}

11.2 Question

Suppose that you want to model the relationship between the numerical predictor \texttt{Pred2} and the binary response \texttt{Response2} using a logistic regression model. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt{Pred2} on the horizontal axis where \texttt{Pred2} has been categorized into 15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt{Pred2}? Write down the form of the logistic model that best models the association between \texttt{Pred2} and the log odds of response.

We use the emplogitplot1 function again.

Code
Stat2Data::emplogitplot1(
  Response2 ~ Pred2,
  data = multi01,
  ngroups = 15
)

There is a nonlinear relationship between the log odds of response and \texttt {Pred2}. It appears that, for values of \texttt {Pred2} below 5, the log odds of response are close to 0, while for values of \texttt {Pred2} at or above 5, the log odds of response are close to 4. This suggests that it may make sense to dichotomize \texttt {Pred2} into a binary variable, \texttt {Pred2bin}, such that \texttt {Pred2bin} = 1 if \texttt {Pred2} \ge 5 or 0 if \texttt {Pred2}< 5. The logistic model that best models the association between \texttt {Pred2} (actually using \texttt {Pred2bin}) and the log odds of response is as follows in Equation 9 for \texttt {Pred2bin} =x.

\log{\left( \frac{\pi (x)}{1 - \pi (x)} \right)} = \beta_{0} + \beta_{1} x \tag{9}

11.3 Question

Suppose that you want to model the relationship between the numerical predictor \texttt {Pred3} and the binary response \texttt {Response3} using logistic regression. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt {Pred3} on the horizontal axis where \texttt {Pred3} has been categorized into 15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt {Pred3} ? Write down the form of the logistic model that best models the association between \texttt {Pred3} and the log odds of response. Suppose that you want to model the relationship between the numerical predictor \texttt{Pred3} and the binary response \texttt{Response3} using a logistic regression model. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt{Pred3} on the horizontal axis where \texttt{Pred3} has been categorized into 15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt{Pred3}? Write down the form of the logistic model that best models the association between \texttt{Pred3} and the log odds of response.

Code
Stat2Data::emplogitplot1(
  Response3 ~ Pred3,
  data = multi01,
  ngroups = 15
)

There is a nonlinear relationship between the log odds of response and \texttt{Pred3}. Since the log odds of response show an approximately curved relationship with \texttt{Pred3} that may be approximated by a quadratic curve, we may want to include the square of \texttt{Pred3} in our logistic regression model. The logistic model that best models the association between \texttt{Pred3} and the log odds of response is shown in Equation 10 where \texttt{Pred3}=x.

\log{\left( \frac{\pi (x)}{1 - \pi (x)} \right)} = \beta_{0} + \beta_{1} x + \beta_{2} x^{2} \tag{10}

11.4 Question

Consider \texttt{Pred1} and \texttt{Response1} again. Fit the model that you wrote down in Question 1 and also fit a model that includes the square of \texttt{Pred3} as a predictor. Based on Akaike Information Criteria (AIC), which model is best?

We construct two models as suggested, assigned to the variables model_01_04_simple and model_01_04_quad respectively. We use the aic attribute to return the AIC for each model

Code
model_01_04_simple <- glm(
  formula = Response1 ~ Pred1,
  data = multi01,
  family = binomial(link = "logit")
)

model_01_04_simple$aic
[1] 551.9545
Code
model_01_04_quad <- glm(
  formula = Response1 ~ Pred1 + I(Pred1^2),
  data = multi01,
  family = binomial(link = "logit")
)

model_01_04_quad$aic
[1] 552.6454

The model with \texttt{Pred1} only, has an AIC of 551.95, while the model that includes the squared term, has an AIC of 552.65. Since the model with \texttt{Pred1}only has the smaller AIC, this is the best model using only AIC to make our decision.

11.5 Question

Consider \texttt{Pred2} and \texttt{Response2} again. Fit the model that you wrote down in Question 2 and also fit a model that includes the binarized version of \texttt{Pred2} as a predictor. Based on AIC, which model is best?

As before, we construct two models, assigned to the variables model_01_05_simple and model_01_05_bin respectively.

Code
model_01_05_simple <- glm(
  formula = Response2 ~ Pred2,
  data = multi01,
  family = binomial(link = "logit")
)

model_01_05_simple$aic
[1] 305.1697

A new variable, Pred2bin, is created using the mutate function and the case_when function.

Code
multi01 <- multi01 %>% 
  mutate(
    Pred2bin = case_when(
      Pred2 < 5 ~ 0,
      Pred2 >= 5 ~ 1
    ),
    Pred2bin = factor(
      Pred2bin,
      levels = c(0, 1)
    )
  )

model_01_05_bin <- glm(
  formula = Response2 ~ Pred2bin,
  data = multi01,
  family = binomial(link = "logit")
)

model_01_05_bin$aic
[1] 270.431

The model with \texttt{Pred2} only, has an AIC of 305.17, while the model that includes the binarized term has an AIC of 270.43. Since the model with \texttt{Pred2bin} has the smaller AIC, this is the best model using only AIC to make our decision.

11.6 Question

Consider \texttt{Pred3} and \texttt{Response3} again. Fit the model that you wrote down in Question 3 and also fit a model that assumes a quadratic relationship between \texttt{Pred3} and the log odds of response. Based on AIC, which model is best?

Once again, we generate the two models assigned to the variable model_01_06_simple and model_01_06_quad respectively.

Code
model_01_06_simple <- glm(
  formula = Response3 ~ Pred3,
  data = multi01,
  family = binomial(link = "logit")
)

model_01_06_simple$aic
[1] 540.8581
Code
model_01_06_quad <- glm(
  formula = Response3 ~ Pred3 + I(Pred3^2),
  data = multi01,
  family = binomial(link = "logit")
)

model_01_06_quad$aic
[1] 534.473

The model that includes the squared \texttt{Pred3} predictor has an AIC of 534.47 while the model that assumes a linear relationship between the log odds of response and \texttt{Pred3} has an AIC of 540.86. Since the model with the squared \texttt{Pred3} predictor has the smaller AIC, this is the best model.

11.7 Question

The lect8_class.csv data file contains a response variable (encoded with 0 and 1), a numerical explanatory variable, \texttt{Numerical}, and an ordinal explanatory variable, \texttt{Ordinal}. Find the functional relationship between the numerical variable and the response using the empirical logit function. Create a model with a linear relationship and a model with a non-linear relationship and compare them.

The data is imported below and assigned to the variable class_data.

Code
class_data <- import_dataset("lect8_class.csv")

To understand the data, we generate a box-and-whisker plot in Figure 9 below.

Code
boxplot(
  Numerical ~ Response,
  data = class_data,
  col="#4197D9"
)

Figure 9: Box-and-Whisker Plot for Question 7

The describe function from the psych library returns summary statistics for the \texttt{Numerical} variable.

Code
psych::describe(class_data$Numerical)
   vars  n    mean      sd median trimmed     mad  min  max range skew kurtosis
X1    1 78 2441.47 1072.19   2197 2367.11 1283.93 1000 4913  3913 0.46    -0.47
      se
X1 121.4

It seems that we can consider seven classes (eight breaks), starting at 900 and incrementing by 700. The emplogitplot1 function returns the empirical logit plot.

Code
s <- 900
k <- 8
r <- 700

Stat2Data::emplogitplot1(
  Response ~ Numerical,
  data = class_data,
  breaks = seq(from = s, to = k * r, by = r),
  showline = FALSE
)

Two models are create below. The model, class_num_model, contains only the \texttt{Numerical} variables as explanatory variable. The model class_num_func_model also contain the cube of the \texttt{Numerical} variable.

Code
class_num_model <- glm(
  formula = Response ~ Numerical,
  data = class_data,
  family = binomial(link = "logit")
)

summary(class_num_model)

Call:
glm(formula = Response ~ Numerical, family = binomial(link = "logit"), 
    data = class_data)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.8618894  0.7631667  -3.750 0.000177 ***
Numerical    0.0011066  0.0002949   3.752 0.000175 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 107.669  on 77  degrees of freedom
Residual deviance:  88.301  on 76  degrees of freedom
AIC: 92.301

Number of Fisher Scoring iterations: 4
Code
class_num_func_model <- glm(
  formula = Response ~ Numerical + I(Numerical^2),
  data = class_data,
  family = binomial(link = "logit")
)

summary(class_num_func_model)

Call:
glm(formula = Response ~ Numerical + I(Numerical^2), family = binomial(link = "logit"), 
    data = class_data)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)  
(Intercept)    -3.466e+00  1.633e+00  -2.123   0.0338 *
Numerical       1.635e-03  1.274e-03   1.283   0.1993  
I(Numerical^2) -1.018e-07  2.350e-07  -0.433   0.6649  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 107.67  on 77  degrees of freedom
Residual deviance:  88.12  on 75  degrees of freedom
AIC: 94.12

Number of Fisher Scoring iterations: 3

The simpler model contains the lower AIC. A logit plot is shown in Figure 10.

Code
num_values <- seq(min(class_data$Numerical), max(class_data$Numerical), by = 0.1)
plot(
  num_values,
  -2.8618894 + 0.0011066 * num_values,
  type = "l",
  main = "Logit plot",
  xlab = "Numerical variable value",
  ylab = "Estimated log(odds of death)",
  las = 1,
  col = "orangered",
  lwd = 2,
  panel.first = grid()
)

Figure 10: Logit Plot for Question 7

By using Equation 11, we generate Figure 11 that shows the estimated probability of death, given a value for the \texttt{Numerical} variable.

\frac{e^z}{1+e^{z}} \tag{11}

Code
num_values <- seq(min(class_data$Numerical), max(class_data$Numerical), by = 0.1)
plot(
  num_values,
  exp(-2.8618894 + 0.0011066 * num_values) / (1 + exp(-2.8618894 + 0.0011066 * num_values)),
  type = "l",
  main = "Probability of death",
  xlab = "Numerical variable values",
  ylab = "Estimated probability of death",
  las = 1,
  col = "orangered",
  lwd = 2,
  panel.first = grid()
)

Figure 11: Logit Plot for Question 7

11.8 Question

Reconsider Question 7. Determine if the ordinal explanatory variable, \texttt{Ordinal}, can be used as a numerical variable.

The table function summarizes the data.

Code
table(class_data$Ordinal, class_data$Response)
   
     0  1
  1 12  7
  2 12  4
  3  9  6
  4  6  9
  5  3 10

We calculate the odds of success for each of the values of the ordinal variable and assign the result using the emp_logit variable, for which we then express the logarithm using the as.numeric function.

Code
emp_logit <- table(class_data$Ordinal, class_data$Response)[, 2] / table(class_data$Ordinal, class_data$Response)[, 1]
log(as.numeric(emp_logit))
[1] -0.5389965 -1.0986123 -0.4054651  0.4054651  1.2039728

In Figure 12, we view the empirical logits.

Code
plot(
  c(1, 2, 3, 4, 5),
  log(as.numeric(emp_logit)),
  xlab = "Ordinal response values",
  ylab = "Empirical log odds",
  col = "orangered",
  cex = 3,
  panel.first = grid()
)

Figure 12: Empirical Log Odds

It does not seem as if the \texttt{Ordinal} variable can be used as a numerical variable in the model.