Code
library(rio)
library(dplyr)
#library(DT)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
Given enough knowledge, anyone can build a model with any variety of explanatory variables. Throwing variables and interactions at a model, might produce results, but they are deleterious to many of the response why we use models. A lot more thought has to go into creating a model.
The decisions around model building has many components. We discuss some of these in broader terms in this chapter.
Load the required libraries for Chapter 8. If these packages are not already installed, please do so before loading the libraries.
library(rio)
library(dplyr)
#library(DT)
There are a lot of decisions to make and transparency is key, especially in view of research reproducibility. The following five steps are key. They actually also pertain to the whole process of research, not just the model creation decisions.
There may only be Five steps, but it does include a lot of work. Remember too that clinical trials and their protocols have to be registered before any actual data collection and analysis can commence.
Models are used in two main ways.
Machine learning has really taken a front seat when it comes to predictions. The prediction of a categorical response variable is termed a classification problem. Data is typically divided into a training and a test set. The training set is used to train a model and the test set (unseen during the creation of the machine learning model) to test the possible accuracy of the model in the wild. The metrics used during testing may reflect how well the model will do when used in the wild. There are many types of approaches in machine learning including approaches built on decision trees and neural networks. Neural networks minimize a cost function through repeated processes of forward and backpropagation, attempting to minimize a cost function. It is, in essence, an optimization problem.
We have mentioned that there are many considerations when creating a model. These are typically divided into statistical and scientific considerations. Both are equally valid and important. There are too many to list in both cases. Some are highlighted below.
We ultimately want a parsimonious model, that is, one that is simple but explains the data well. Such a model is interpretable and finds use after publication.
Remember too, that there are many types of models for categorical response variables. We have concentrated on the arguably most useful and most used, logistic regression (LR) model. With LR models we estimate odds ratios. There are also model type for risk ratios (log-binomial) and risk difference (linear probability).
It makes sense to consider other model types previously used in the field of study of current research project.
All modeling starts with a proper research question that ultimately determines the model and selection of explanatory variables. When given data, the purposeful selection of explanatory variables is key. We often consider a primary explanatory variable, X, and a set of potential confounders, Z_{1} , Z_{2} \ldots. We simply do not just add explanatory variables without consideration and purpose. Any variable that is included should be well justified.
As a rule of thumb some suggest the following.
A large number of predictors lead to large standard errors which means high variability in the parameter estimates.
Multicollinearity, that is, correlation among the explanatory variables also lead to model instability through raised standard errors and high estimate variability.
In summary, we can state the following general rules as we start to consider modeling.
We have started by looking at a sensible way to start thinking about model building. Next, we consider the individual explanatory variable types: ordinal, multi-level, and continuous, exploring the following ideas.
For a k level ordinal explanatory variable, we would have k-1 dummy variables. It is important to consider that more slopes mean more variability in the estimates. It may make sense to consider converting an ordinal variable to a numerical variable.
In the illustrative example below, we import data that looked at the return of bottles, Y, with Y=1 if the bottle is returned, given the amount of the deposit returned, X. We assign the imported data set to the variable deposit
.
<- import_dataset("deposit_data.csv") deposit
A summary table shows that there seems to be an increased number of returns as the deposit values increases.
xtabs(
~ .,
data = deposit
)
returned
deposit 0 1
2 428 72
5 397 103
10 370 130
15 280 220
20 180 320
25 60 440
30 38 462
If we select only the returns, using the filter
function as returning a tibble object, we can create a list of objects in order to plot the relationship between increasing deposit values and the number of returns, shown in Figure 1.
<- deposit %>%
returns filter(returned == 1) %>%
tibble()
<- table(returns$deposit)
tab_returns
<- as.numeric(tab_returns)
tab_returns_values <- names(tab_returns)
tab_returns_labels
barplot(
tab_returns_values,names.arg = tab_returns_labels,
main = "Frequency of returns as deposit value increases",
xlab = "Value of deposit",
ylab = "Frequency of returns",
las = 1,
panel.first = grid()
)
We see the increase in the frequency of returns as the value of the deposit increases.
If the return is seen as a k level explanatory variable, then we get the following model, using a deposit value of 2 as the reference level. We use the glm
function and assign the model to the variable model_cat
, with the coef
function to return the coefficient estimates.
<- glm(
model_cat formula = returned ~ factor(deposit),
data = deposit,
family = binomial(link = "logit")
)
coef(model_cat)
(Intercept) factor(deposit)5 factor(deposit)10 factor(deposit)15
-1.7824571 0.4332498 0.7364885 1.5412950
factor(deposit)20 factor(deposit)25 factor(deposit)30
2.3578212 3.7748872 4.2804358
It looks like the slope increases as the deposit value increases. It may not be linear, but we may want to view the deposit as a numerical variable. To decide on this, we calculate the empirical logits, shown in Equation 1, for each level.
\text{Empirical logit}_{k} = \log{\left( \frac{p_{k}}{1 - p_{k}} \right)} \text{ : (the log odds of success at each level)} \tag{1}
To calculate the empirical logits at each level, we create a user-defined function implementing Equation 1 to do the calculation.
<- function(success, failure){
emp_log_odds return(round(log(success / failure), digits = 3))
}
We use the values from the frequency table that we created before and calculate the empirical log odds, each assigned to an appropriate variable.
<- emp_log_odds(72, 428)
emp_log_odds_2 <- emp_log_odds(103, 397)
emp_log_odds_5 <- emp_log_odds(130, 370)
emp_log_odds_10 <- emp_log_odds(220, 280)
emp_log_odds_15 <- emp_log_odds(320, 180)
emp_log_odds_20 <- emp_log_odds(440, 60)
emp_log_odds_25 <- emp_log_odds(462, 38) emp_log_odds_30
A scatter plot shows the relationship between the empirical log odds and the deposit value.
plot(
c(2, 5, 10, 15, 20, 25, 30),
c(emp_log_odds_2, emp_log_odds_5, emp_log_odds_10, emp_log_odds_15,
emp_log_odds_20, emp_log_odds_25, emp_log_odds_30),xlab = "Deposit",
ylab = "Empirical log odds",
col = "orangered",
cex = 3,
panel.first = grid()
)
It seems as if there is a linear relationship. It may make sense to include the deposit as a numerical variable. It creates a much simpler model, which we create below, assigned to the variable model_num
.
<- glm(
model_num formula = returned ~ deposit,
data = deposit,
family = binomial(link = "logit")
)
round(summary(model_num)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.384 0.088 -26.996 0
deposit 0.157 0.005 30.263 0
This approach of considering an ordinal variable as a numerical variable was obvious in our illustrative example. It is not so in all cases. In cancer staging, with Grades I through IV, or response to medication such as low, low-medium, medium-high, or high, we may think of using values such as 1,2,3 and 4. It may make sense to use other values. In this case, we check for a linear relationship between the log odds of the response and the recoded explanatory variable. If the relationship is approximately linear, then we use the recoded values as a numerical variable.
Nominal variables cannot be recoded as numerical values. The only approach that we can take is to combine some of the levels into a single level, if it makes logical sense and can be argued for in the domain of the problem.
When we use numerical variables as explanatory variables in logistic regression (LR) models, we assume a linear relationship between the variable and the log odds of the response variable.
There are several approaches to determine the functional relationship between a numerical explanatory variable and the log odds of the response variable. These include locally weighted scatter plot smoothing, spline-based methods, and more.
We will explore the construction of empirical logit plots by binning the numerical predictor into k groups. This will help us find the functional relationship.
Binning refers to creating intervals of the same width, with each interval becoming a categorical level. Each value in the numerical explanatory variable then belongs to one of the levels. The choice of k, the number of bins, is arbitrary. We want to choose k large enough to discover the relationship, but not too large, as this may lead to highly variable empirical logits.
We generate data for an illustrative example. The data contains a numerical explanatory variable X being age, and a response variable, death vs. not death, with Y=1 indicating death. The data is generated below assigned to the data frame object age_death_data
, for which we view the data table using the datatable
form the DT
library.
set.seed(12345)
<- round(runif(n = 1500, min = 0, max = 95), 2)
age <- 0.0011*(age - 48)^2
lp <- exp(lp) / (1 + exp(lp))
pr <- rbinom(n = 1500, size = 1, prob = pr)
death <- data.frame(age, death)
age_death_data
#DT::datatable(age_death_data)
A box plot visualizes the relationship in Figure 3.
boxplot(
~ death,
age data = age_death_data,
main = "Age of survivors (0) and non-survivors (1)",
las = 1
)
We start by choosing a few values for k. In most cases, three well chosen values should suffice. Below, we choose k=4, k=10, and k=25. The cut
function below, shows the levels. The lowest value is sometimes impossible. It is just the result of the way the function calculates the cuts for the bins.
levels(cut(age_death_data$age, 4))
[1] "(-0.0449,23.8]" "(23.8,47.5]" "(47.5,71.2]" "(71.2,95.1]"
levels(cut(age_death_data$age, 10))
[1] "(-0.0449,9.54]" "(9.54,19]" "(19,28.5]" "(28.5,38]"
[5] "(38,47.5]" "(47.5,57]" "(57,66.5]" "(66.5,76]"
[9] "(76,85.5]" "(85.5,95.1]"
levels(cut(age_death_data$age, 25))
[1] "(-0.0449,3.85]" "(3.85,7.64]" "(7.64,11.4]" "(11.4,15.2]"
[5] "(15.2,19]" "(19,22.8]" "(22.8,26.6]" "(26.6,30.4]"
[9] "(30.4,34.2]" "(34.2,38]" "(38,41.8]" "(41.8,45.6]"
[13] "(45.6,49.4]" "(49.4,53.2]" "(53.2,57]" "(57,60.8]"
[17] "(60.8,64.6]" "(64.6,68.4]" "(68.4,72.2]" "(72.2,76]"
[21] "(76,79.8]" "(79.8,83.6]" "(83.6,87.4]" "(87.4,91.2]"
[25] "(91.2,95.1]"
In the three cases above, we see 4, 10, and 25 intervals. In each case, we printed the intervals to the screen. The intervals are left-open right-closed.
Four intervals might be too few to find a functional relationship and 25 might be too many and hide an obvious functional relationship.
The emplogitplot1
function in the Stat2Data
library uses the empirical logit function in Equation 1 and plots the values, shown in Figure 4. (In each group there will be x successes and n-x _failures with which Equation 1 is calculated. The emplogitplot1
function does a slightly different calculation to Equation 1 to avoid divisions by 0.
::emplogitplot1(
Stat2Data~ age,
death data = age_death_data,
ngroups = 4
)
The blue line shows a linear relationship (a least squares line, which we ignore). Clearly, though, the plot is starting to show a quadratic (squared) relationship. Below in Figure 5, we look at 10 bins.
::emplogitplot1(
Stat2Data~ age,
death data = age_death_data,
ngroups = 10
)
The quadratic relationship is now more clear to see. The 25 bins do not add more information in Figure 6.
::emplogitplot1(
Stat2Data~ age,
death data = age_death_data,
ngroups = 25
)
We are starting to see overfitting when k=25.
Below are two models. The first is the naive model that assumes a linear relationship for the numerical explanatory variable, assigned to the variable naive_model
.
<- glm(
naive_model ~ age,
death data = age_death_data,
family = binomial(link = "logit")
)round(summary(naive_model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.722 0.113 6.378 0.00
age 0.000 0.002 0.202 0.84
We see a p value of 0.84 in the case of the log odds ratio for the \texttt{age} variable. Next, we add the quadratic term for the age and assign the new model to the variable better_model
. (Take note of the format of the formula
keyword value.)
<- glm(
better_model formula = death ~ age + I(age^2),
data = age_death_data,
family = binomial(link = "logit")
)round(summary(better_model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.445 0.22 11.102 0
age -0.100 0.01 -10.012 0
I(age^2) 0.001 0.00 10.392 0
Our model is now as shown in Equation 2 and age is a predictor of death (see below for interpretation).
\log{\left( \frac{\hat{\pi}(\text{age})}{1 - \hat{pi} (\text{age})} \right)} = 2.445 - 0.1 (\text{age}) + 0.001 (\text{age})^{2} \tag{2}
While we had the slopes \hat{\beta}_{1} = -0.1 and \beta_{2} = 0.001, we cannot express the effect on the log odds of death for a one unit increase in age, while keeping the other predictors constant, because the other predictor also contains age. In fact, the slope of age is a function of age, as we can see from the first derivative in Equation 3.
\frac{\mathbb{d}}{\mathbb{d} \text{ age}} \text{logit}({\hat{\pi} (\text{age})}) = 0.002 (\text{age}) - 0.1 \tag{3}
One way to interpret the numerical variable is to plot the logit and the probability, shown in Figure 7 and Figure 8.
<- seq(min(age_death_data$age), max(age_death_data$age), by = 0.5)
ages plot(
ages,2.45 - 0.1 * ages + 0.001 * ages^2,
type = "l",
main = "Logit plot",
xlab = "Age",
ylab = "Estimated log(odds of death)",
las = 1,
col = "orangered",
lwd = 2,
panel.first = grid()
)
<- seq(min(age_death_data$age), max(age_death_data$age), by = 0.5)
ages plot(
ages,exp(2.45 - 0.1 * ages + 0.001 * ages^2)) / (1 + exp(2.45 - 0.1 * ages + 0.001 * ages^2)),
(type = "l",
main = "Probability plot",
xlab = "Age",
ylab = "Predicted probability of death",
las = 1,
panel.first = grid(),
col = "orangered",
lwd = 2
)
We have now considered approaches to models with single explanatory variables. Next, we take a closer look at multivariable models.
We might be tempted to directly compare the slopes in a generalized linear model. While this can be done for categorical predictors, it cannot immediately be done for different numerical explanatory variables.
In this section we explore how to standardize the predictors. Our data set is the familiar private insurance data set that we used before. We start by importing the data as a tibble object and assigning it to the variable insurance
.
<- import_dataset("mus14_specific.csv") insurance
We use the \texttt{married}, \texttt{age}, and \texttt{hhincome} variables as explanatory variables and the \texttt{private} variable as response variable to create a logistic regression model, assigned to the variable model
.
<- glm(
model ~ married + age + hhincome,
private data = insurance,
family = binomial(link = "logit")
)
round(summary(model)$coefficients, digits = 4)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.2857 0.6733 -1.9097 0.0562
married 0.5656 0.0917 6.1693 0.0000
age 0.0024 0.0101 0.2330 0.8157
hhincome 0.0054 0.0008 6.4130 0.0000
There are two numerical predictors in our model, \texttt{hhincome}, which is household income, and \texttt{age}. The one unit increase in age is different from a unit increase in household income (which is measured in thousands of dollars).
While not required, we can standardize numerical explanatory variables to better compare them.
A numerical variable, X, is standardized using Equation 4, where S_{X} is the sample standard deviation of X, each observation has value x_{i}, and \bar{X} is the mean of the variable.
x_{i\text{ (standardized)}} = \frac{x_{i} - \overline{X}}{S_{X}} \tag{4}
This technique is most often used when using numerical variables in neural networks, where it is absolutely required.
The result is that we can now consider a 1 standard deviation increase instead of a 1 unit increase in the numerical explanatory variables.
Below, we standardize the two numerical variables using the mutate
function from the dplyr
library. The function adds new variables.
<- insurance %>%
insurance mutate(
age_std = ((age - mean(age)) / sd(age)),
hhincome_std = ((hhincome - mean(hhincome)) / sd(hhincome))
)
We can now use the variables in a new model, assigned to the variable model_std
.
<- glm(
model_std ~ married + age_std + hhincome_std,
private data = insurance,
family = binomial(link = "logit")
)
round(summary(model_std)$coefficients, digits = 4)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8835 0.0800 -11.0422 0.0000
married 0.5656 0.0917 6.1693 0.0000
age_std 0.0087 0.0373 0.2330 0.8157
hhincome_std 0.3471 0.0541 6.4130 0.0000
The result of the Wald test (z value and p value) for the numerical variable does not change due to standardization. For both numerical explanatory variables, though, we can compare the slope estimates, and consider a single standard deviation increase in each.
Instead of standardization, we can also center the numerical variables. Centering a numerical variable uses the equation in Equation 5.
x_{i \text{ centered}} = x_{i} - \overline{X} \tag{5}
Below, we consider a bivariate model where we have the years of education as a (numerical) explanatory variable for purchasing private healthcare insurance, assigned to the variable model_edu
.
<- glm(
model_edu formula = private ~ educyear,
data = insurance,
family = binomial(link = "logit")
)
round(summary(model_edu)$coefficients, digits = 4)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.4018 0.1605 -14.9666 0
educyear 0.1603 0.0127 12.6670 0
We add a new column for centering the years of education variable, using the mutate
function.
<- insurance %>%
insurance mutate(
educyear_cen = educyear - mean(educyear)
)
The mean of the years of education is calculated below.
round(mean(insurance$educyear), digits = 1)
[1] 11.9
We note that it is 11.9 years.
Now we build a new model, assigned to the variable model_cen
.
<- glm(
model_cen formula = private ~ educyear_cen,
data = insurance,
family = binomial(link = "logit")
)
round(summary(model_cen)$coefficients, digits = 4)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4948 0.0377 -13.1287 0
educyear_cen 0.1603 0.0127 12.6670 0
Note that the slope does not change.
The intercept in the original model is the log odds of having private insurance for someone with 0 years of education, which does not make sense. For the centered variable, the intercept is the log odds of private insurance for someone with 11.9 years of education. It is now meaningful, because 11.9 years is typical for the data.
Models can be used in the setting of exploratory data analysis. In this case, it might be common that there is little known or published in terms of the research question.
Guideline are much less clear when conducting exploratory studies. There are processes that can be followed, though.
Forward selection starts with a intercept only model and sequentially adds explanatory variables in order to optimize the fit of the model.
Backward elimination starts with a complex (many explanatory variables) model and sequentially eliminates a variable (largest p values) in order to optimize the fit of the model.
These procedures can be done automatically using a computer language such as R. The final model might be nonsensical, though. As an example, it may include an interaction term, but neither of the main effects that make up the interaction. Statistical significance should not be the decision driver. Important confounders must be included, even though their slopes are not significantly different from 0.
It should be obvious that the explanatory variable that is designated as the primary predictor of the outcome should be included in a final model.
Past work in the field of the research question can be a great source of information and a proper literature review is important.
As far as interaction terms are concerned, we only include them if there is a strong sense that the association between the primary predictor and the outcome is modified by the potential modifiers. On the other hand, any known confounders should be included in a model.
Models that have interaction terms, XZ, must have their components, X and Z, (the main effects) in the model.
Consider a simple model with only binary interactions terms, X and Z , shown in Equation 6.
\text{logit} ( \pi ) = \beta_{0} + \beta_{1} x z \tag{6}
Table 1 shows the four possible cases. In all cases, except when both x=1 and z=1, we have the same log odds.
x | z | \text{logit}( \ pi ) |
---|---|---|
0 | 0 | \beta_{0} |
1 | 0 | \beta_{0} |
0 | 1 | \beta_{0} |
1 | 1 | \beta_{0} + \beta_{1} |
In the case of a k-level predictor, all k-1 dummy variables must be included.
The comparison of models is an open problem in statistics. Various methods have been described to do just this. One such method is the Akaike information criterion (AIC), which we explore in this section. Others include Mallow’s C_{p}, and Bayesian information criterion (BIC).
The AIC equation uses the log likelihood of the model (the probability of finding the model given the data) and the number of model parameters, as shown in Equation 7.
\text{AIC} = -2 [\log{(L_{\text{model}})} - \text{ number of model parameters}] \tag{7}
Given several models, we select the model with the smallest value for the AIC. Such a model maximizes the log likelihood and minimizes the penalty (the number of parameters). There is no absolute indication of how a model performs given the value of the AIC, but comparison between models can be made.
Despite wanting a model with a smallest AIC value, this should not be the driving force behind final model selection. There is a trade-off between explainability and complexity.
Below, we build two models for comparison, using the same data as in the section on standardization and centering, and assign the data to the variable dfr
. We also use the mutate
function to standardize the two numerical variables \texttt{age} and \texttt{hhincome}. The two new variables are named \texttt{agestd} and \texttt{hhincomestd}
<- import_dataset("mus14_specific.csv")
dfr
<- dfr %>%
dfr mutate(
agestd = ((age - mean(age)) / sd(age)),
hhincomestd = ((hhincome - mean(hhincome)) / sd(hhincome))
)
We generate a model, with three explanatory variables, assigned to the variable model_1
.
<- glm(
model_1 formula = private ~ married + agestd + hhincomestd,
data = dfr,
family = binomial(link = "logit")
)summary(model_1)
Call:
glm(formula = private ~ married + agestd + hhincomestd, family = binomial(link = "logit"),
data = dfr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.883457 0.080007 -11.042 < 2e-16 ***
married 0.565639 0.091686 6.169 6.86e-10 ***
agestd 0.008683 0.037262 0.233 0.816
hhincomestd 0.347063 0.054119 6.413 1.43e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4279.5 on 3205 degrees of freedom
Residual deviance: 4153.5 on 3202 degrees of freedom
AIC: 4161.5
Number of Fisher Scoring iterations: 4
A summary of the model shows an AIC value of 4161.5. Now for the second model, which omits \texttt{agestd}, assigned to the variable model_2
.
<- glm(
model_2 formula = private ~ married + hhincomestd,
data = dfr,
family = binomial(link = "logit")
)summary(model_2)
Call:
glm(formula = private ~ married + hhincomestd, family = binomial(link = "logit"),
data = dfr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.88555 0.07951 -11.138 < 2e-16 ***
married 0.56853 0.09085 6.258 3.90e-10 ***
hhincomestd 0.34737 0.05410 6.421 1.36e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4279.5 on 3205 degrees of freedom
Residual deviance: 4153.5 on 3203 degrees of freedom
AIC: 4159.5
Number of Fisher Scoring iterations: 4
The AIC is now 4159.5, lower than that for the first model. Using only the AIC, we would use the second model.
Note that the slopes in the second model are almost the same as for the first model. The \texttt{agestd} variable seems not to add much to the explanation of the association with the response variable in the model.
The data set contained in the file multi08.csv
has simulated data on three pairs of predictors and outcomes, \texttt{Pred1} and \texttt{Response1}, \texttt{Pred2} and \texttt{Response2}, and \texttt{Pred3} and \texttt{Response3}. All the predictors are numerical variables on the interval 0 to 10. The response variables are all binary variable (1 is the success level, and 0 is the failure level).
The data is imported and assigned to the variable multi01
below.
<- import_dataset("multi08.csv") multi01
Suppose that you want to model the relationship between the numerical predictor \texttt{Pred1} and the binary response \texttt{Response1} using a logistic regression model. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt{Pred1} on the horizontal axis where \texttt{Pred1} has been categorized into k=15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt{Pred1}? Write down the form of the logistic model that best models the association between \texttt{Pred1} and the log odds of response.
Suppose that you want to model the relationship between the numerical predictor \texttt{Pred2} and the binary response \texttt{Response2} using a logistic regression model. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt{Pred2} on the horizontal axis where \texttt{Pred2} has been categorized into 15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt{Pred2}? Write down the form of the logistic model that best models the association between \texttt{Pred2} and the log odds of response.
Suppose that you want to model the relationship between the numerical predictor \texttt {Pred3} and the binary response \texttt {Response3} using logistic regression. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt {Pred3} on the horizontal axis where \texttt {Pred3} has been categorized into 15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt {Pred3} ? Write down the form of the logistic model that best models the association between \texttt {Pred3} and the log odds of response. Suppose that you want to model the relationship between the numerical predictor \texttt{Pred3} and the binary response \texttt{Response3} using a logistic regression model. Construct an empirical logit plot with the log odds of response on the vertical axis and \texttt{Pred3} on the horizontal axis where \texttt{Pred3} has been categorized into 15 intervals with approximately equal numbers of observations in each interval. Does this plot support the contention that there is a linear relationship between the log odds of response and \texttt{Pred3}? Write down the form of the logistic model that best models the association between \texttt{Pred3} and the log odds of response.
Consider \texttt{Pred1} and \texttt{Response1} again. Fit the model that you wrote down in Question 1 and also fit a model that includes the square of \texttt{Pred3} as a predictor. Based on Akaike Information Criteria (AIC), which model is best?
Consider \texttt{Pred2} and \texttt{Response2} again. Fit the model that you wrote down in Question 2 and also fit a model that includes the binarized version of \texttt{Pred2} as a predictor. Based on AIC, which model is best?
Consider \texttt{Pred3} and \texttt{Response3} again. Fit the model that you wrote down in Question 3 and also fit a model that assumes a quadratic relationship between \texttt{Pred3} and the log odds of response. Based on AIC, which model is best?
The lect8_class.csv
data file contains a response variable (encoded with 0 and 1), a numerical explanatory variable, \texttt{Numerical}, and an ordinal explanatory variable, \texttt{Ordinal}. Find the functional relationship between the numerical variable and the response using the empirical logit function. Create a model with a linear relationship and a model with a non-linear relationship and compare them.
The data is imported below and assigned to the variable class_data
.
<- import_dataset("lect8_class.csv") class_data
Reconsider Question 7. Determine if the ordinal explanatory variable, \texttt{Ordinal}, can be used as a numerical variable.