Chapter 6

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In the previous chapter, we considered measures of association between an explanatory variable, X, and a response variable, Y, given an effect modifier or a confounder, Z, using tables of observed data.

Generalized linear models (GLMs) give us a much richer approach, considering multiple explanatory variables.

In this session, we explore logistic regression (LR) models with multiple explanatory variables. Aspects of logistic regression models with multiple explanatory variables also pertain to other GLMs, including linear probability models, and log-binomial models, with their own link functions.

Key Words

Section 3. Multivariable Logistic Regression	multivariable logistic regression
Section 4. Inference with Respect to Multivariable Logistic Regression Models	nested, full model, reduced model, omnibus test

2 Libraries

Load the required libraries for Chapter 6. If these packages are not already installed, please do so before loading the libraries.

Code

library(rio)
library(dplyr)
library(ggstatsplot)
library(lmtest)
library(epiR)
library(DescTools)
library(MASS)

3 Multivariable Logistic Regression

LR models with multiple explanatory variables are termed multivariable logistic regression models. We will reserve the term, multivariate logistic regression models, for cases with multiple responses. More about the use of these term can be in this paper.

Multivariable GLMs, in general, offer better predictions and estimations of adjusted associations (as with stratified analysis, but in terms of modeling), and provides for a concise approach to assessing effect modification with different types of effect modifiers.

In this chapter, we consider only binary response variable logistic regression (LR) models. The explanatory variables, on the other hand, can be binary, multilevel, or continuous, and we will explore all of these together in a model.

3.1 Components of a Multivariable Logistic Regression Model

The random component and the link function are the same as was introduced in chapter 4, when we explored simple LR models with single explanatory variables.

As before, Y, has two levels, a success that we term disease, encoded as 1, and a failure that we term not disease, encoded as 0.

The systematic component is different from our simple LR models. For p different explanatory variables, we have the systematic component in Equation 1, where x_{1}, x_{2}, \ldots , x_{p} are the p individual explanatory variables.

\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p} \tag{1}

Our interest lies with P(Y=1 \, \vert \, x_{1} , x_{2} , \ldots , x_{p}), the probability of success (disease or 1) for the specific values of each of the explanatory variables. We will use vector notation where the components of the vector, \mathbf{x}, are the values of the individual explanatory variables.

The equation for the multivariable LR model is therefor written as shown in Equation 2 for p explanatory variables, where \mathbf{x} is the vector notation for \left( x_{1} , x_{2} , \ldots , x_{p} \right), and X is the design matrix (contains the value 1 for x_{0}.

\begin{align} \log{(\text{odds})} &= \log{ \left( \frac{P(Y=1 \vert \mathbf{x})}{1 - P(Y=1 \vert \mathbf{x})} \right) } \\ \\ &= \log{ \left( \frac{\pi(\mathbf{x})}{1 - \pi(\mathbf{x})} \right)} \\ \\ &= \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p} \\ &= X \beta \end{align} \tag{2}

Through simple algebraic manipulation, we have the probability function \pi with respect to \mathbf{x}, shown in Equation 3.

\begin{align} &P(Y=1 \vert \mathbf{x}) = \pi ({\mathbf{x}}) = \frac{e^{\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p}}}{1 + e^{\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p}}} \\ &P \left( Y=1 \, \vert \, \mathbf{x} \right) = \pi \left( \mathbf{x} \right) = \frac{e^{X \beta}}{1+e^{X \beta}} \end{align} \tag{3}

3.2 Interpretation of the Parameters

As with simple LR models, we interpret the parameters \beta_{i}, where i = \left\{ 0,1,2, \ldots , p \right\}.

The intercept, \beta_{0}, is the log odds of disease for x_{1} = x_{2} = \ldots = x_{p} = 0.

We can also calculate e^{\beta_{0}}, the odds of having disease for x_{1} = x_{2} = \ldots = x_{p} = 0 using Equation 2.

An arbitrary parameter \beta_{j}, where j = \left\{ 1,2, \ldots , p \right\} for each of the variable types is shown in Table 1.

Table 1: Slopes
x_{j} variable type	\beta_{j}	e^{\beta_{j}}
\begin{cases} 1 \text{ if exposed} \\ 0 \text{ if not exposed} \end{cases}	log odds ratio (OR) for disease comparing exposed to non-exposed, adjusting for the other predictors	OR for disease comparing exposed to non-exposed adjusting for the other predictors
Multilevel where x_{j} is one of the dummy variables	log OR for disease comparing the j^{\text{th}} level to the reference level, adjusting for the other predictors	OR for disease comparing exposed to non-exposed adjusting for the other predictors
Numerical	log OR for disease for a one-unit increase in x_{j}, adjusting for the other predictors	OR for disease for a one-unit increase in x_{j}, adjusting for the other predictors

Note the additional section with respect to the interpretation that reads adjusting for the other predictors. We will see later that this refers to keeping the other explanatory variables (called predictors in the table) constant or fixed.

The slopes then, represents the log OR. In Equation 4, we develop an intuition for the log OR in terms of an arbitrary binary explanatory variable x_{j}, where x_{j}=1 for the exposed, and x_{j} = 0 for the unexposed, irrespective of the values for the other explanatory variables, x_{i} and i \ne j.

\begin{align} &x_{j}=1: \; \log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}\right)} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (1) + \ldots + \beta_{p} x_{p} \\ \\ &x_{j}=0: \; \log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})} \right)} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (0) + \ldots + \beta_{p} x_{p} \end{align} \tag{4}

Subtracting the two equations and making use of the laws of logarithms, we have Equation 5, showing that \beta_{j} represents a log OR.

\begin{align} &\log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}\right)} - \log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}\right)} \\ \\ &= \log{\left(\frac{\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}}{\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}}\right)} \\ \\ &= \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (1) + \ldots + \beta_{p} x_{p} - \left[ \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (0) + \ldots + \beta_{p} x_{p} \right] \\ \\ &= \beta_{j} \end{align} \tag{5}

The derivation in Equation 5 similarly applies to any of the k-1 levels of a k-level variable (against one of the k levels chosen as the reference level) and for a difference of c units in a continuous variable, where we would used x_{j} = \nu + c and x_{j} = \nu.

3.3 Example

The data for our illustrative example is the survey of 3206 individuals on Medicare that we have used before. We import it below from the mus14data_selected.csv comma-separated values file, assigned to the variable dfr.

Code

dfr <- import_dataset("mus14data_selected.csv")

The \texttt{private\_cat} variable indicates if an individual purchased private supplemental health insurance. This is our response variable, Y, with levels 1 indicating the purchase of insurance (or success or disease) and 0 indicating no purchase (or failure or not disease).

The variable \texttt{retire\_cat} is a binary explanatory variable and indicates whether an individual is retired, 1, or not retired, 0.

The variable \texttt{age} is a continuous numerical variable and indicates the age of an individual measured in years.

We transform the \texttt{private\_cat} and the \texttt{hhincome\_cat} variables into factors using the factor function. The levels keyword arguments is used to list the levels in order. The first level is used as the reference levels, that is No PI for the response variable and < \$10K for the \texttt{hincome\_cat} explanatory variable.

Code

dfr$private_cat <- factor(
  dfr$private_cat,
  levels = c("No PI", "PI")
)

dfr$hhincome_cat <- factor(
  dfr$hhincome_cat,levels = c("< $10K", "[$10K, $30K)", "[$30K, $50K)", ">= $50K")
)

We want to understand the relationship between whether or not a purchase occurs in relation to retirement status, \texttt{retire\_cat},, annual household income group \texttt{hhincome\_cat}, and age group, \texttt{age}.

With the income category being multilevel, we assign <\$10K as the reference level. Table 2 shows the explanatory variables.

Table 2: Explanatory Variables
Variable	Level	Values
x_{1}	retirement status	\begin{cases} 1 \text{ if retired} \\ 0 \text{ if not rettired} \end{cases}
x_{2}	[\$10,\$30)K	\begin{cases} 1 \text{ if income in } [10,30)] \\ 0 \text{ otherwise} \end{cases}
x_{3}	[\$30,\$50)K	\begin{cases} 1 \text{ if income in } [30,50)] \\ 0 \text{ otherwise} \end{cases}
x_{4}	[\$30,\$50)K	\begin{cases} 1 \text{ if income in } \ge 50 \\ 0 \text{ otherwise} \end{cases}
x_{5}	\ge \$ 50K	age in years

We write the multivariable LR model in Equation 6.

\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \beta_{5} x_{5} \tag{6}

It is time to build our model. For this, we use the glm function assigned to the variable full_model. Note the use of the + symbol in the formula, which we use to add the explanatory variables.

Code

full_model <- glm(
  formula = private_cat ~ retire_cat + hhincome_cat + age,
  data = dfr,
  family = binomial(link = "logit")
)

We use the coefficients attribute of the full_model object to return the estimated coefficients, \hat{\beta}_{0}=-1.657,\hat{\beta}_{1}=0.222,\hat{\beta}_{2}=1.930,\hat{\beta}_{3}=2.773,\hat{\beta}_{4}=2.999, and \hat{\beta}_{5}=-0.019. We also use the round function and set the digits argument to 3 to round the results to three decimal places.

Code

round(full_model$coefficients, digits = 3)

             (Intercept)               retire_cat hhincome_cat[$10K, $30K) 
                  -1.657                    0.222                    1.930 
hhincome_cat[$30K, $50K)      hhincome_cat>= $50K                      age 
                   2.773                    2.999                   -0.019

We write fitted model in Equation 7.

\begin{align} &\log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = \hat{\beta}_{0} + \hat{\beta_{1}} x_{1} + \hat{\beta}_{2} x_{2} + \hat{\beta}_{3} x_{3} + \hat{\beta}_{4} x_{4} + \hat{\beta}_{5} x_{5} \\ \\ &\log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = -1.657 + 0.222 x_{1} + 1.930 x_{2} + 2.773 x_{3} + 2.999 x_{4} - 0.019 x_{5} \end{align} \tag{7}

Remember, that we can use Equation 3 to calculate a probability of private insurance given values for the five values in the vector \mathbf{x} = (x_{1},x_{2},x_{3},x_{4},x_{5}).

Next, we interpret the coefficients. Note that the values will be slightly different, depending on rounding differences. R tends to maintain many decimal places when doing calculations, which we don’t do by hand.

The intercept, \hat{\beta}_0 = -1.66. This is the log odds for purchasing private insurance, Y=1, when not retired, x_{1}=0, with an annual income of <\$10K (the reference case, x_{2}=x_{3}=x_{4}=0), and with an age of x_{5}=0.

The slope \hat{\beta}_{1}=0.22 is the adjusted log odds ratio of purchasing insurance, Y=1, comparing those that are retired, x_{1}=1, to those that are not, x_{1}=0, adjusting for annual household income and age. Note the other two explanatory variables are mentioned by name, i.e. the annual income is not mentioned by levels. Adjusting for refers to keeping the values for these variables fixed, and only changing x_{1}.

The more meaningful result comes from the antilog, e^{\hat{\beta}_{1}}=\log{\left( 0.22 \right)} \approx 1.25, which is the corresponding adjusted OR. The odds of purchasing private insurance for those that are retired is 1.25 - 1 = 0.25 = 25% greater than for those that are not retired, adjusting for household income and age. By adjusting for the other two explanatory variables, we are stating that two people with the same income level and age have an OR =1.25 with reference to their respective retirement status, one being retired, the other not.

The slope, \hat{\beta}_{2}=1.93 is the adjusted log odds ratio for private insurance, Y=1, comparing those with an annual household income of [\$10\text{K},\$30\text{k}) to those with < \$10K, adjusting for retirement status and age. Note that we omit the other levels of household income from the way that we use dummy variables.

The antilog, e^{\hat{\beta}_{2}} = 6.89 is the corresponding adjusted odds ratio. The odds of private insurance for those with an annual income of [\$10\text{K},\$30\text{k}) are 6.89 - 1 = 5.89 = 589% greater than the odds for those with an income of < \$10K, adjusting for retirement status and age.

The slope, \hat{\beta}_{3}=2.77 is the adjusted log odds ratio for private insurance, Y=1, comparing those with an annual household income of [\$30\text{K},\$50\text{k}) to those with < \$10K, adjusting for retirement status and age.

The antilog, e^{\hat{\beta}_{3}} = 15.96 is the corresponding adjusted odds ratio. The odds of private insurance for those with an annual income of [\$30\text{K},\$50\text{k}) are 15.96 - 1 = 14.96 = 1496% greater than the odds for those with an income of < \$10K, adjusting for retirement status and age.

The slope, \hat{\beta}_{4}=3.00 is the adjusted log odds ratio for private insurance, Y=1, comparing those with an annual household income of \ge \S 50K to those with < \$10K, adjusting for retirement status and age.

The antilog, e^{\hat{\beta}_{4}} = 20.09 is the corresponding adjusted odds ratio. The odds of private insurance for those with an annual income of \ge \$ 50 are 20.09 - 1 = 19.09 = 1909% greater than the odds for those with an income of < \$10K, adjusting for retirement status and age.

The slope, \hat{\beta}_{5}=-0.02 is the log odds ratio for private insurance, Y=1, for a one year increase in age, adjusting for retirement status and annual household income.

The antilog, e^{\hat{\beta}_{5}}=0.98 is the corresponding adjusted OR. The odds of private insurance decrease by 1 = 0.98 = 0.02 = 2% for a one year increase in age, adjusting for retirement status and household income (that is two people with the same retirement age and income, differing by a single year).

The logistic regression model has allowed us to understand the association between the explanatory variables and the response variable. We can also use LR models to predict the probability of private insurance, the input for a person with values for the explanatory variables to predict the probability of private insurance, Y=1.

As example, we consider a 75 year old person, who is in retirement, with a household income of \ge \S 50. We convert these values to complete our vector, \mathbf{x}.

We have that x_{1}=1 (retired), x_{2} = x_{3} = 0, with x_{4}=1 (household income of \ge \$ 50), and x_{5}=75. In the code chunk below, we use the values of the coefficients and Equation 3 to calculate the probability of private insurance for this person.

Code

exp(-1.66+0.22+3-(0.02*75)) / (1 + exp(-1.66+0.22+3-(0.02*75)))

[1] 0.5149955

We see a probability of approximately 51\% as a solution to P(Y=1 \vert X) for this specific person.

Now that we can calculate and interpret the estimated coefficients, we can explore inference with respect to these coefficients in terms of confidence intervals and hypothesis testing.

4 Inference with Respect to Multivariable Logistic Regression Models

4.1 Confidence Intervals

For large enough samples, we have that the sampling distribution of estimated coefficients are approximately normally distributed, written as in Equation 8, where \sigma_{\hat{\beta}_{j}} is the standard error of \hat{\beta}_{j} for j = \left\{ 0, 1, \ldots , p \right\}.

\hat{\beta}_{j} \sim \text{N} (\beta_{j} , \sigma_{\hat{\beta}_{j}}^{2}) \tag{8}

In Equation 8, we have a normal distribution with mean \beta_{j} and variance, \sigma_{\hat{\beta}_{j}}^{2}. As a result, we can construct Wald confidence intervals (CIs) for the coefficients, shown in Equation 9, where \widehat{\text{SE}}(\hat{\beta}_{j}) is the standard error of \hat{\beta}_{j}.

100 (1 - \alpha) \text{\% Wald CI for } \beta{j} : \; \hat{\beta}_{j} \pm z_{\frac{\alpha}{2}} \widehat{\text{SE}}(\hat{\beta}_{j}) \tag{9}

The confidence coefficient, z_{\frac{\alpha}{2}}, for \alpha=0.05, is calculated in the code chunk below, using the qnorm function.

Code

z_crit <- qnorm(0.975)
z_crit

[1] 1.959964

We note in our table of the results of the illustrative example in the previous section, that \hat{\beta}_{1} = 0.222 (retirement), with \widehat{\text{SE}}_{\beta_{1}} = 0.084. We use Equation 9 to calculate the lower and upper bounds, as shown in Equation 10.

\hat{\beta}_{1} \pm z_{\frac{\alpha}{2}} \widehat{\text{SE}}(\hat{\beta}_{j}) = 0.222 \pm 1.96 (0.084) = (0.057, 0.387) \tag{10}

Code

lower <- 0.222 - z_crit * 0.084
upper <- 0.222 + z_crit * 0.084

lower

[1] 0.05736303

Code

upper

[1] 0.386637

Taking the antilogs of the two values, we have the confidence intervals of the adjusted OR for x_{1} (retirement).

Code

exp(lower)

[1] 1.05904

Code

exp(upper)

[1] 1.472022

In the illustrative example, we saw that the adjusted OR for private insurance was 1.25, comparing retired people to non-retired people, adjusting for household income and age. We have the confidence interval for this OR as 1.06 - 1.47. The CI does not contain 1, the value under the null hypothesis, and we can conclude that the odds of purchase are indeed greater for retired people compared to non-retired people, adjusting for household income and age.

Indeed, we are 95% confident that the true adjusted OR is between 1.06 and 1.47.

The confint function takes a LR model as argument and returns the intervals for each variable in the model, with the confidence level specified by the level argument (which is 0.95 by default.)

Code

confint(full_model)

Waiting for profiling to be done...

                               2.5 %       97.5 %
(Intercept)              -3.14729115 -0.169970191
retire_cat                0.05682768  0.388073868
hhincome_cat[$10K, $30K)  1.50207594  2.407110503
hhincome_cat[$30K, $50K)  2.33859127  3.255684008
hhincome_cat>= $50K       2.56861885  3.478466289
age                      -0.04119060  0.003115709

The result shows the confidence intervals, but for the log values. We can the confint function as argument to the exp function to return the confidence intervals for the adjusted odds ratios themselves.

Code

exp(confint(full_model))

Waiting for profiling to be done...

                               2.5 %    97.5 %
(Intercept)               0.04296836  0.843690
retire_cat                1.05847340  1.474139
hhincome_cat[$10K, $30K)  4.49100244 11.101836
hhincome_cat[$30K, $50K) 10.36662247 25.937350
hhincome_cat>= $50K      13.04779100 32.409976
age                       0.95964620  1.003121

Next up, we consider hypothesis testing.

4.2 Hypothesis Testing

We start by considering the omnibus test, which is a likelihood ratio test (LRT) for hypothesis testing regarding our model overall.

Hypothesis testing depends on our research question. If we want to consider the association between having private medical insurance and retirement status (for which we use x_{1} when we created the model), adjusting for household income and age, we have the null hypothesis and alternative hypothesis as shown in Equation 11.

\begin{align} &H_{0}: \; \beta_{1} = 0 \\ &H_{1}: \; \beta_{1} \ne 0 \end{align} \tag{11}

To test whether having private medical insurance is associated with annual household income (dummy variables x_{2},x_{3},x_{4}), adjusting for retirement status and age, we have the null and alternative hypothesis in Equation 12.

\begin{align} &H_{0}: \; \beta_{2} = \beta_{3} = \beta_{4} = 0 \\ &H_{1}: \; \text{any one of more of } \beta_{2} , \beta_{3} , \beta_{4} \ne 0 \end{align} \tag{12}

If we want to test whether private medical insurance is jointly associated with retirement status (x_{1}), annual household income, (dummy variables x_{2},x_{3},x_{4}), and age (x_{5}), then we have the null hypothesis and alternative hypothesis in Equation 13.

\begin{align} &H_{0}: \; \beta_{1} = \beta_{2} = \beta_{3} = \beta_{4} = \beta_{5} = 0 \\ &H_{1} : \; \text{any one or more of } \beta_{1} , \beta_{2} , \beta_{3} , \beta_{4} , \beta_{5} \ne 0 \end{align} \tag{13}

In general terms, there are two types of null hypotheses for generalized linear model (GLM) parameters.

Single parameter for example H_{0}: \; \beta_{1} = 0
Multiple parameter for example H_{0}: \; \beta_{1} = \beta_{2} = \beta_{3} = 0

Furthermore, we have three types of tests.

Likelihood ratio test (LRT) - used mainly for multiple parameters tests
Wald test - Used mainly for single parameters tests
Score test (not often used and not discussed in this textbook)

Note that software often drives the choice of test.

For large samples sizes, these tests typically lead to the same results. If the results do not concur, it is recommended to use the results from a likelihood ratio test.

We have encountered the LRT before. Here, we use it in the case of multiple parameters, comparing a full model to a reduced model. Before we understand what this means, we have to explore LRTs a bit more. We remember that LRTs are based on the transformation of Equation 14.

\Lambda = \frac{\text{maximum likelihood when parameters satisfy } H_{0}}{\text{maximum likelihood when parameters are unrestricted}} = \frac{L_{0}}{L_{1}} \tag{14}

There is no, or weak evidence, against H_{0} if \Lambda is approximately equal 1, that is to say, L_{0} \approx L_{1}. There is strong evidence against H_{0} if \Lambda is much less than 1, that is to say, L_{0} is much less than L_{1}.

As before, rather than using \Lambda, we use the test statistic, G^{2}, shown in Equation 15.

G^{2} = -2 \log{\frac{L_{0}}{L_{1}}} = -2 \left( \log{L_{0}} - \log{L_{1}} \right) \tag{15}

Under the null hypothesis, G^{2} follows a \chi^{2} distribution for given degrees of freedom. In general, the degrees of freedom is the number of parameters under H_{1} minus the number of parameters under H_{0}.

Consider model A (with 3 parameters) and model B (with 2 parameters), shown below in Equation 16.

\begin{align} &\text{Model A: } \log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} \\ &\text{Model B: } \log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = \beta_{0} + \beta_{1} x_{1} \end{align} \tag{16}

We state that Model B is nested within model A or that model B is a special case of model A. Note that model B is equivalent to model A, but with \beta_{2} = 0. In this case, model A is the full model and model B is the reduced model.

We can finally state that we use LRTs, that is to say, one or multiple parameters are equal to 0 by the comparison of a full model to a nested or reduced model. The degrees of freedom is then the number of parameters in the full model minus the number of parameters in the reduced model.

As an example, we consider our full model with private insurance status as the response, and retirement status, annual household income, and age, as explanatory variables. This full model is shown in Equation 17.

\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \beta_{5} x_{5} \tag{17}

We use a LRT with \alpha = 0.05 to test the null hypothesis that private insurance is independent of retirement status, annual household income, and age, with the null and alternative hypothesis shown in Equation 13.

Our reduced model only consists of the intercept, shown in Equation 18 with the null hypothesis and alternative hypothesis.

\begin{align} &\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} \\ &H_{0}: \; \beta_{0} = 0 \\ &H_{1}: \; \beta_{0} \ne 0 \end{align} \tag{18}

To perform the test, we need to compute Equation 13. We fit the full model to get \log{L_{1}} and we fit the reduce model to get \log{L_{0}}. The G^{2} test statistic is then based on the \chi^{2} distribution under the null hypothesis. The degrees of freedom will be the number of parameters in the full model, which is 6, minus the number of parameters in the reduced model, which is 1, hence 6-1=5 degrees of freedom.

We have already created the full model above, assigned to the variable full_model. Below, we created the reduced model. Note that 1 value in the formula keyword argument, to indicate that we are only considering the intercept.

Code

reduced_model <- glm(
  formula = private_cat ~ 1,
  data = dfr,
  family = binomial(link = "logit")
)

We use the lrtest function from the lmtest library to perform the LRT. The reduced model is passed as first argument and the full model as second argument.

Code

lmtest::lrtest(
  reduced_model,
  full_model
)

Likelihood ratio test

Model 1: private_cat ~ 1
Model 2: private_cat ~ retire_cat + hhincome_cat + age
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   1 -2139.8                         
2   6 -1934.0  5 411.48  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see \log{L_{0}} = -2139.8, \log{L_{1}} = -1934.0, 5 degrees of freedom, G^{2} = -2(\log{L_{0}} - \log{L_{1}}) = 411.48, and a p value of <0.001. We reject the null hypothesis. The data provides evidence that private health insurance status is associated with at least one of retirement status, annual household income, or age.

We refer to this LRT as an omnibus test (reduced model with only the intercept, against the full model with all explanatory variables). If we reject the null hypothesis given the data, we can move on to the individual explanatory variables (single parameters).

As example we consider the logistic model with private insurance status as the outcome and retirement status, annual household income, and age as explanatory variables. We use a LRT with \alpha = 0.05 to test the null hypothesis that private insurance is independent of age, adjusting for retirement status and annual household income. This reduced model and the null and alternative hypothesis is shown in Equation 19.

\begin{align} &\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} \\ &H_{0}: \; \beta_{5} = 0 \\ &H_{1}: \; \beta_{5} \ne 0 \end{align} \tag{19}

We follow the same procedure as before. Now we have that the degrees of freedom is 6-5=1, with the 5 coming from Equation 19, where the parameters are \beta_{0},\beta_{1},\beta_{2},\beta_{3}, and \beta_{5}. We create this new reduced model below, where we omit the age variable, and assign the model to the variable `reduced_model_age``.

Code

reduced_model_age <- glm(
  formula  = private_cat ~ retire_cat + hhincome_cat, # Omit age
  data = dfr,
  family = binomial(link = "logit")
)

The lrtest function returns the results.

Code

lmtest::lrtest(
  reduced_model_age,
  full_model
)

Likelihood ratio test

Model 1: private_cat ~ retire_cat + hhincome_cat
Model 2: private_cat ~ retire_cat + hhincome_cat + age
  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   5 -1935.5                       
2   6 -1934.0  1 2.8332    0.09233 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that \log{L_{0}}=-1935.5 with 5 degrees of freedom, \log{L_{1}} = -1934.0, with 6 degrees of freedom, G^{2} = 2.8332, with 6-5=1 degree of freedom, and a p values of 0.09. We fail to reject the null hypothesis. The data do not provide evidence that private health insurance is associated with age, adjusting for retirement status and household income.

Instead of a LRT for an individual coefficient, we consider instead using Wald’s test (as used in R). Although Wald tests can be used for multiple parameters, we will not cover it in this textbook.

For an arbitrary coefficient, \beta_{j} (log odds ratio), we have a null hypothesis and alternative hypothesis a shown in Equation 20.

\begin{align} &H_{0}: \; \beta_{j} = 0 \\ &H_{1}: \; \beta_{j} \ne 0 \end{align} \tag{20}

The Wald test statistic, shown in Equation 21 approximately follows the standard normal distribution under the null hypothesis.

Z = \frac{\hat{\beta}_{j}}{\widehat{\text{SE}}(\hat{\beta}_{j})} \tag{21}

As alternative, we can also consider the test statistic, Z^{2}, that approximately follows a \chi^{2} distribution with 1 degree of freedom.

The coefficient (or estimate) table shows the standard errors, \widehat{\text{SE}}_{\hat{\beta}_{j}}, the Z statistic, and the p values. We use the coefficient attribute when calling the summary function with the full model as argument.

Code

round(summary(full_model)$coefficients, digits = 3)

                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                -1.657      0.759  -2.183    0.029
retire_cat                  0.222      0.084   2.629    0.009
hhincome_cat[$10K, $30K)    1.930      0.230   8.397    0.000
hhincome_cat[$30K, $50K)    2.773      0.233  11.905    0.000
hhincome_cat>= $50K         2.999      0.231  12.979    0.000
age                        -0.019      0.011  -1.680    0.093

For the results above, there are six separate Wald tests with null hypothesis H_{0} \; \beta_{i} = 0 and alternative hypothesis H_{1}: \; \beta_{i} \ne, for i =\left\{ 0,1,2,3,4,5 \right\}.

In the case of age (x_{5}), for instance, we have that Z=-1.680, with a p value of 0.093, the same as for the LRT that we used above. The data do not provide evidence that private insurance status and age are associated, adjusting for retirement status and annual household income.

While retirement status and age are single explanatory variables, x_{1} and x_{5} in this case, we have to consider how to approach a multilevel explanatory variable. We do so by starting with a LRT comparing a reduced model without the k-1 dummy variables x_{2}, x_{3}, and x_{4}, against the full model. The model, null hypothesis and alternative hypothesis tests for the reduced model is shown in Equation 22.

\begin{align} &\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{5} x_{5} \\ &H_{0}: \; \beta_{2} = \beta_{3} = \beta_{4} = 0 \\ &H_{1}: \; \text{any one or more of } \beta_{2} , \beta_{3}, \beta_{4} \ne 0 \end{align} \tag{22}

We create the reduced model below and assign it to the variable `reduced_model_k_level``. While we are using x_{5} for the sake of clarity, the new reduced model will denote a new x_{2} instead of x_{5}.

Code

reduced_model_k_level <- glm(
  formula = private_cat ~ retire_cat + age,
  data = dfr,
  family = binomial(link = "logit")
)

The lrtest function returns the results.

Code

lmtest::lrtest(
  reduced_model_k_level,
  full_model
)

Likelihood ratio test

Model 1: private_cat ~ retire_cat + age
Model 2: private_cat ~ retire_cat + hhincome_cat + age
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   3 -2129.3                         
2   6 -1934.0  3 390.54  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The LRT test statistic, G^{2} = 390.54, with a p value of < 0.01. We reject the null hypothesis. The data provide evidence that there is an association between private insurance status and annual household income, adjusting for retirement status and age.

Since we have evidence against the null hypothesis, we can continue with the Wald test results from the coefficient table above, that each for each specific level of the k-level variable.

Remember that the values shown for \hat{\beta}_{j} are log values. Below, we use the antilog to express useful results. The coef function returns the coefficients, \hat{\beta}_{j},. The confint.default returns the confidence interval for a confidence level set by the level argument. We also use the cbind function to print the columns for our two outputs next to each other.

Code

cbind(
  exp(coef(full_model)),
  exp(confint.default(full_model, level = 0.95))
)

                                          2.5 %     97.5 %
(Intercept)               0.1907165  0.04307452  0.8444152
retire_cat                1.2487590  1.05819268  1.4736438
hhincome_cat[$10K, $30K)  6.8865213  4.38931595 10.8044570
hhincome_cat[$30K, $50K) 16.0011499 10.13713177 25.2573216
hhincome_cat>= $50K      20.0585488 12.75374161 31.5472424
age                       0.9811983  0.95971143  1.0031663

For instance, we see that the odds for private insurance are 20.1 times greater for those with an income of more than \$50K compared to the reference level. The confidence interval does not contain 1.

5 Lab Problems

5.1 Question

Create you own flowchart (or similar map) with respect to the steps required for statistical inference with multivariable logistic regression models.

Solution 5.1

Below is an example flow chart.

5.2 Question

You are investigating maternal complications following normal vaginal delivery (NVD) using a multivariable logistic regression model. The variables are described in Table 3.

Table 3: Variables
Variable Name	Description	Levels or Units
\texttt {Complications}	Complications following NVD	Yes and No
\texttt {Duration}	Duration of labor	Hours
\texttt {ANC}	Ante-natal clinic visit	Yes and No

The data set is available in the comma-separated file Complications_Mat.csv, which is imported below and assigned to the variable comp.

Code

comp <- import_dataset("Complications_Mat.csv")

Create a report of summary and comparative summary statistics and data visualization with respect the to explanatory and response variables. Generate a multivariable logistic regression model and use confidence intervals and hypothesis testing to add to the report.

Solution 5.2

Generate Summary Statistics and Data Visualization

We start with the categorical variables and create a frequency table of the binary explanatory variable, \texttt{ANC}, using the xtabs function.

Code

xtabs(
  ~ ANC,
  data = comp
)

ANC
 No Yes 
 28  28

We note that there were 28 cases who attended and 28 cases who did not attend the antenatal clinic.

Next, we investigate the binary response variable, \texttt{Complications}, using the xtabs function.

Code

xtabs(
  ~ Complications,
  data = comp
)

Complications
 No Yes 
 32  24

A total of 32 women did not suffer any complications and 24 did.

The describe function in the psych library returns summary statistics for the \texttt{Duration} (duration of labor) variable.

Code

psych::describe(comp$Duration)

   vars  n  mean   sd median trimmed  mad min  max range skew kurtosis   se
X1    1 56 13.99 4.37   13.9   13.75 3.78 5.8 25.1  19.3  0.5    -0.04 0.58

Now that we have established univariate summary statistics, we consider comparative summary statistics for the explanatory variables, given the two levels of the response variable. These are the statistics that we will visualize.

We start with a contingency table comparing \texttt{ANC} to the two levels of the \texttt{Complications} variable.

Code

xtabs(
  ~ANC + Complications,
  data = comp
)

     Complications
ANC   No Yes
  No   8  20
  Yes 24   4

The stacked bar chart in Figure 1 visualizes the contingency table.

Code

ggstatsplot::ggbarstats(
  comp,
  y = ANC,
  x = Complications,
  bf.message = FALSE, 
  results.subtitle = FALSE,
  palette = "Blues"
)

Figure 1: ANC Frequency per Complication Level

We note a much higher proportion of complications (71%) in those that did not attend the antenatal clinic compared to those that did attend antenatal clinic (14%).

The describeBy function in the psych library provides comparative summary statistics of the \texttt{Duration} variable.

Code

psych::describeBy(comp$Duration, group = comp$Complications)


 Descriptive statistics by group 
group: No
   vars  n  mean   sd median trimmed  mad min  max range  skew kurtosis   se
X1    1 32 12.21 2.96   12.6   12.32 3.85 5.8 17.4  11.6 -0.27    -1.01 0.52
------------------------------------------------------------ 
group: Yes
   vars  n  mean   sd median trimmed  mad min  max range skew kurtosis   se
X1    1 24 16.36 4.85   16.3   16.39 5.34 6.8 25.1  18.3 0.01     -0.8 0.99

We visualize the comparative summary statistics of the continuous numerical variable using a combination box-and-whisker and violin plot in Figure 2. To do this, we use the ggbetweenstats plot in the ggstatsplot library. We set the bf.message argument to FALSE so that Bayesian statistics results are not printed in the plot.

Code

ggstatsplot::ggbetweenstats(
  data = comp,
  x = Complications,
  y = Duration,
  bf.message = FALSE,
  results.subtitle = FALSE,
  palette = "Blues"
)

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Figure 2: Duration of Labor for Two Outcome Classes

The plot shows that the group with complications had a much longer time in labor. We also note from the frequentist summary statistics in the plot, there is a significant difference in time in labor between the two response groups.

Now that we have a general idea of the relationship between the explanatory variables and the response variable, we can generate a logistic regression model to understand the specific associations.

Generate a Multivariable Logistic Regression Model

Below, we transform the data type of the \texttt{ANC} and the \texttt{Complication} variables into factors using the factor function and set the order of the levels such that the success for \texttt{ANC} variable is No and \texttt{Complication} variable is Yes. Note that to do so, we place these levels last in the vector that is the value for the levels argument. This is important to state, since we interpret the results of our logistic regression model in terms of these choices.

Code

comp$ANC <- factor(
  comp$ANC,
  levels = c("Yes", "No")
)

comp$Complications <- factor(
  comp$Complications,
  levels = c("No", "Yes")
)

A good approach is to start with an omnibus test. We generate a full model and a model containing only the intercept. We assign these two models to the variables comp_full_model and comp_intercept_model, respectively.

Code

comp_full_model <- glm(
  Complications ~ Duration + ANC,
  data = comp,
  family = binomial(link = "logit")
)

summary(comp_full_model)


Call:
glm(formula = Complications ~ Duration + ANC, family = binomial(link = "logit"), 
    data = comp)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -4.9668     1.6053  -3.094  0.00198 **
Duration      0.2420     0.1090   2.220  0.02642 * 
ANCNo         2.3516     0.7196   3.268  0.00108 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 76.486  on 55  degrees of freedom
Residual deviance: 49.950  on 53  degrees of freedom
AIC: 55.95

Number of Fisher Scoring iterations: 5

Code

comp_intercept_model <- glm(
  Complications ~ 1,
  data = comp,
  family = binomial(link = "logit")
)

The lrtest function in the lmtest library returns the results of a likelihood ratio test (LRT), comparing the two models. We pass the intercept model as first argument.

Code

lmtest::lrtest(
  comp_intercept_model,
  comp_full_model
)

Likelihood ratio test

Model 1: Complications ~ 1
Model 2: Complications ~ Duration + ANC
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   1 -38.243                         
2   3 -24.975  2 26.536  1.729e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result of the LRT is a small p value that that is less that the 5\% significance level so, we reject the null hypothesis that none of the predictors, \texttt {Duration} nor \texttt {ANC}, are associated with maternal complications. Below, we print a summary of the full model.

Code

summary(comp_full_model)


Call:
glm(formula = Complications ~ Duration + ANC, family = binomial(link = "logit"), 
    data = comp)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -4.9668     1.6053  -3.094  0.00198 **
Duration      0.2420     0.1090   2.220  0.02642 * 
ANCNo         2.3516     0.7196   3.268  0.00108 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 76.486  on 55  degrees of freedom
Residual deviance: 49.950  on 53  degrees of freedom
AIC: 55.95

Number of Fisher Scoring iterations: 5

We write the model in Equation 23, where x_{1} is the \texttt{Duration} variable and x_{2} is the \texttt{ANC} variable.

\log{\text{odds of complications}} = -4.9668 + 0.2420 x_{1} + 2.3516 x_{2} \tag{23}

The exp function calculates the antilog values of the coefficients. We use the coefficients attribute to create the argument for the slope coefficients for the intercept, \texttt {Duration}, and \texttt {ANC}.

Code

exp(comp_full_model$coefficients)

 (Intercept)     Duration        ANCNo 
 0.006965362  1.273810751 10.502365248

We also print the antilog values of the confidence intervals for the coefficients using the exp function and the confint.default function.

Code

exp(confint.default(comp_full_model))

                   2.5 %     97.5 %
(Intercept) 0.0002995524  0.1619625
Duration    1.0287563942  1.5772381
ANCNo       2.5628674946 43.0376038

We can state that visiting an antenatal clinic and duration of labor are significantly associated with perinatal complications (p value < \, 0.01). The odds of a complication are 905\% higher (95\% CI of 156\%-4204\%, p value of 0.001) for those who did not attend the antenatal clinic compared to those who did, adjusting for the duration of labor. The odds of a complication are 27.4\% higher (95\% CI of 2.9\%-57.7\%, p value of 0.026) for each additional hour increase in the duration of labor, adjusting for antenatal clinic visit.

5.3 Question

You are part of a research team and tasked with creating a research report for presentation. Your team is investigating mortality in general-admission critical care units at the 5% significance level. The explanatory variables under consideration are age at admission, admission white cell count, admission hemoglobin, diabetes mellitus, and class of admission. The variables are summarized in Table 4.

Table 4: Variables
Variable Name	Description	Levels or Units of Measurement
\texttt {Mortality}	Mortality (response)	Died vs. Survived
\texttt {Age}	Age at admission	Age in in years
\texttt {WCC}	Admission white cell count	\times 10^{9} cells/L
\texttt {HB}	Hemoglobin	g/dL
\texttt {Diabetes}	Diabetes mellitus	None, Type I, Type II
\texttt Class	Class of admission	Infectious and Non-infectious

The data set is available in the comma-separated file ICU_Mortality. The file is imported below and assigned to the variable lab.

Code

lab <- import_dataset("ICU_Mortality.csv")

After including summary statistics and data visualization, both univariable and comparative with respect to the outcome variable (with commentary), answer the following research questions (in any order).

Is there an association between class of admission and mortality?
Is the presence and type of diabetes an effect modifier or confounder of the association between class of admission and mortality? (Use stratified analysis)
Use a logistic regression model to understand the association between survival and age at admission, admission white cell count, admission hemoglobin, diabetes, and class of admission. Use both confidence intervals and hypothesis testing.

Solution 5.3.1

Generate Summary Statistics and Data Visualization

We start our analysis by looking at the frequency of the two levels of the response variable, \texttt{Mortality}, using the xtabs.

Code

xtabs(
  ~ Mortality,
  data = lab
)

Mortality
    Died Survived 
     104      207

We note that 104 people died and 207 survived.

The xtabs function is likewise used below to generate frequency tables of the categorical explanatory variables, \texttt{Class} and \texttt{Diabetes}.

Code

xtabs(
  ~ Class,
  data = lab
)

Class
    Infectious Non-infectious 
           150            161

There were 150 cases admitted with infectious disease and 161 cases admitted with non-infectious disease.

Code

xtabs(
  ~ Diabetes,
  data = lab
)

Diabetes
   None  Type I Type II 
     98      80     133

A total of 98 cases did not have diabetes, 80 had type I diabetes, and 133 had type II diabetes.

The describe function from the psych library returns the summary statistics of the continuous numerical variables, \texttt{Age}, \texttt{WCC}, and \texttt{HB}.

Code

psych::describe(lab$Age)

   vars   n  mean    sd median trimmed   mad min max range skew kurtosis   se
X1    1 311 56.46 10.97     56   56.39 10.38  23  85    62 0.08    -0.38 0.62

Code

psych::describe(lab$WCC)

   vars   n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
X1    1 311 10.29 4.02     10   10.13 3.85 -0.8 30.8  31.6 0.77      2.3 0.23

Code

psych::describe(lab$HB)

   vars   n  mean   sd median trimmed  mad min  max range  skew kurtosis   se
X1    1 311 13.53 2.46   13.6   13.62 2.37 4.9 20.2  15.3 -0.41     0.44 0.14

Now that we have our univariate summary statistics, we generate comparative summary statistics of all the explanatory variables, using the two levels of the response variable. In each case, we produce a plot to better understand the results.

Code

xtabs(
  ~Class + Mortality,
  data = lab
)

                Mortality
Class            Died Survived
  Infectious       49      101
  Non-infectious   55      106

We visualize the frequency table in Figure 3 and note that the proportion of mortality is almost the same for both classes of admission (67%% among those who were admitted with infectious disease vs. 66% among those admitted with non-infectious disease).

Code

ggstatsplot::ggbarstats(
  lab,
  y = Class,
  x = Mortality,
  bf.message = FALSE,
  results.subtitle = FALSE,
  palette = "Blues"
)

Figure 3: Class of Admission Frequency per Mortality Level

The frequency table with respect to diabetes type follows next. We note that the frequencies for mortality are lower for all levels of diabetes- None, Type I, and Type II, compared to survival.

Code

xtabs(
  ~Diabetes + Mortality,
  data = lab
)

         Mortality
Diabetes  Died Survived
  None      45       53
  Type I    36       44
  Type II   23      110

The visualization is seen in Figure 4.

Code

ggstatsplot::ggbarstats(
  lab,
  y = Diabetes,
  x = Mortality,
  bf.message = FALSE,
  results.subtitle = FALSE,
  palette = "Blues"
)

Figure 4: Diabetes Type Frequency per Mortality Level

There is a significantly higher proportion of survival in those with Type II diabetes, compared to those without diabetes and those with Type I diabetes (83% vs. 54% and 55%). Also, the proportion of survival are similar for those who do not have diabetes and those who have Type I diabetes.

Next, we consider the continuous variables with respect to the response variable. In each case, we generate the comparative summary statistics and data visualization. We start with the \texttt{Age} variable.

Code

psych::describeBy(lab$Age, group = lab$Mortality)


 Descriptive statistics by group 
group: Died
   vars   n  mean   sd median trimmed   mad min max range  skew kurtosis   se
X1    1 104 61.12 9.74     61   61.26 11.86  45  77    32 -0.06     -1.3 0.95
------------------------------------------------------------ 
group: Survived
   vars   n  mean    sd median trimmed mad min max range skew kurtosis   se
X1    1 207 54.11 10.83     53   53.79 8.9  23  85    62 0.25     0.11 0.75

In Figure 5, we visualize the age difference between the groups.

Code

ggstatsplot::ggbetweenstats(
  lab,
  y = Age,
  x = Mortality,
  bf.message = FALSE,
  palette = "Blues"
)

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Figure 5: Age Difference per Mortality Level

There is a higher average age in those who died, compared to those who survived (61 vs. 54 years of age). Also, the Welch’s test demonstrates a statistically significant difference in age by mortality level (test statistic of 5.77, p value of <0.001). Next, we look at white cell count on admission.

Code

psych::describeBy(lab$WCC, group = lab$Mortality)


 Descriptive statistics by group 
group: Died
   vars   n  mean   sd median trimmed  mad  min  max range skew kurtosis   se
X1    1 104 12.49 4.96  12.75   12.44 4.67 -0.8 30.8  31.6  0.3     1.29 0.49
------------------------------------------------------------ 
group: Survived
   vars   n mean  sd median trimmed  mad min  max range skew kurtosis  se
X1    1 207 9.18 2.9    9.1    9.18 2.82 1.8 16.1  14.3    0    -0.26 0.2

We visualize the data in Figure 6.

Code

ggstatsplot::ggbetweenstats(
  lab,
  y = WCC,
  x = Mortality,
  bf.message = FALSE,
  palette = "Blues"
)

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

Figure 6: WCC Values per Mortality Level

The mean white cell count is higher for those who died, compared to those who survived (12.49 vs. 9.18 \times 10^{9} cells/L). Those who died had a significantly higher admission white cell count (Welch’s test statistic of 6.28, p value of <0.001). Finally, we consider the hemoglobin levels at admission.

Code

psych::describeBy(lab$HB, group = lab$Mortality)


 Descriptive statistics by group 
group: Died
   vars   n  mean   sd median trimmed  mad min  max range  skew kurtosis   se
X1    1 104 12.62 2.87   12.9   12.67 2.82 4.9 20.2  15.3 -0.16    -0.08 0.28
------------------------------------------------------------ 
group: Survived
   vars   n  mean   sd median trimmed  mad min  max range  skew kurtosis   se
X1    1 207 13.98 2.09   13.9   13.99 1.93   7 18.6  11.6 -0.19     0.15 0.15

We visualize the results in Figure 7.

Code

ggstatsplot::ggbetweenstats(
  lab,
  y = HB,
  x = Mortality,
  bf.message = FALSE,
  palette = "Blues"
)

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

There is a different in means for hemoglobin between those who died compared to those who survived (12.62 vs. 13.98 g/dL). We note a significantly lower hemoglobin levels in those who died (Welch’s test statistic of -4.30, p value of <0.001).

Assess for the Association Between Class of Admission and Mortality

Now that we have a thorough appreciation of the data, we use table methods and construct models to understand the association between the explanatory variables and the response variable.

To investigate the association between class of admission and mortality, we make a copy of the original tibble object, lab, and assign the copy to the variable, lab_table_levels. We transform the \texttt{Mortality} and set the order of the levels such that we can use the epi.2by2 function to determine the required association.

Code

lab_table_levels <- lab

lab_table_levels$Mortality <- factor(
  lab_table_levels$Mortality,
  levels = c("Died", "Survived")
)
lab_table_levels$Class <- factor(
  lab_table_levels$Class,
  levels = c("Infectious", "Non-infectious")
)

The table function is used to generate a contingency table.

Code

lab_full_table <- table(lab_table_levels$Class, lab_table_levels$Mortality)
lab_full_table

                
                 Died Survived
  Infectious       49      101
  Non-infectious   55      106

Code

epiR::epi.2by2(lab_full_table)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           49          101        150     32.67 (25.24 to 40.79)
Exposed -           55          106        161     34.16 (26.88 to 42.04)
Total              104          207        311     33.44 (28.22 to 38.98)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 0.96 (0.70, 1.31)
Inc odds ratio                                 0.94 (0.58, 1.50)
Attrib risk in the exposed *                   -1.49 (-11.98, 8.99)
Attrib fraction in the exposed (%)            -4.58 (-43.19, 23.63)
Attrib risk in the population *                -0.72 (-9.73, 8.29)
Attrib fraction in the population (%)         -2.16 (-18.46, 11.91)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 0.078 Pr>chi2 = 0.780
Fisher exact test that OR = 1: Pr>chi2 = 0.811
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

In line with the original data visualization in Figure 3, there is no association between the class of admission and mortality with an odds ratio of 0.94 (95\% CI 0.58-1.50), with a p value using Pearson’s \chi^{2} test of 0.78.

Solution 5.3.2

Next, we investigate if the type of diabetes is an effect modifier of the association between admission class and mortality. we create a stratified table using the table function and assign the table object to the variable lab_stratified_table.

Code

lab_stratified_table <- table(
  lab_table_levels$Class,
  lab_table_levels$Mortality,
  lab_table_levels$Diabetes
)
lab_stratified_table

, ,  = None

                
                 Died Survived
  Infectious       19        9
  Non-infectious   26       44

, ,  = Type I

                
                 Died Survived
  Infectious       20        7
  Non-infectious   16       37

, ,  = Type II

                
                 Died Survived
  Infectious       10       85
  Non-infectious   13       25

We use the Woolftest function from the DescTools library to conduct Woolf’s test comparing the odds ratios for each level of the \texttt{Diabetes} variable.

Code

DescTools::WoolfTest(lab_stratified_table)


    Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)

data:  lab_stratified_table
X-squared = 26.723, df = 2, p-value = 1.574e-06

We note that the type of diabetes is an effect modifier of the association between admission class and mortality. We have to express each individual measure of association. We start with those without diabetes. First, though, we set the levels of the \texttt{Diabetes} variable in the lab_table_levels copy of the original lab tibble object.

Code

lab_table_levels$Diabetes <- factor(
  lab_table_levels$Diabetes,
  levels = c("None", "Type I", "Type II")
)

We filter only for those without diabetes using the pipe operator, %>%, from the magrittr library and create a tibble object that only contains those without diabetes.

Code

lab_diabetes_none <- lab_table_levels %>% filter(Diabetes == "None")
epiR::epi.2by2(table(lab_diabetes_none$Class, lab_diabetes_none$Mortality))

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           19            9         28     67.86 (47.65 to 84.12)
Exposed -           26           44         70     37.14 (25.89 to 49.52)
Total               45           53         98     45.92 (35.80 to 56.29)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 1.83 (1.23, 2.72)
Inc odds ratio                                 3.57 (1.41, 9.05)
Attrib risk in the exposed *                   30.71 (10.04, 51.39)
Attrib fraction in the exposed (%)            45.26 (18.56, 63.21)
Attrib risk in the population *                8.78 (-6.24, 23.79)
Attrib fraction in the population (%)         19.11 (3.34, 32.31)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 7.598 Pr>chi2 = 0.006
Fisher exact test that OR = 1: Pr>chi2 = 0.007
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

In this case we see a significant association between admission class and mortality with an odds ratio of 3.57 (95\% CI of 1.41 - 9.05, p value of 0.006). The odds of mortality given no diabetes is 257\% higher in those admitted with infectious disease compared to those without infectious disease.

Next, we consider only those with Type I diabetes. We follow the same method as before.

Code

lab_diabetes_TypeI <- lab_table_levels %>% filter(Diabetes == "Type I")
epiR::epi.2by2(table(lab_diabetes_TypeI$Class, lab_diabetes_TypeI$Mortality))

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           20            7         27     74.07 (53.72 to 88.89)
Exposed -           16           37         53     30.19 (18.34 to 44.34)
Total               36           44         80     45.00 (33.85 to 56.53)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 2.45 (1.54, 3.91)
Inc odds ratio                                 6.61 (2.33, 18.72)
Attrib risk in the exposed *                   43.89 (23.25, 64.52)
Attrib fraction in the exposed (%)            59.25 (35.04, 74.43)
Attrib risk in the population *                14.81 (-1.67, 31.29)
Attrib fraction in the population (%)         32.91 (11.41, 49.20)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 13.919 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

In those with Type I diabetes, we see a significant association between class of admission and mortality with an odds ratio of 6.61 (95\% CI of 2.33-18.72, p value of < 0.001). The odds of mortality given Type I diabetes is 561\% higher in those admitted with infectious disease compared to those without infectious disease.

Lastly, we consider those with Type II diabetes.

Code

lab_diabetes_TypeII <- lab_table_levels %>% filter(Diabetes == "Type II")
epiR::epi.2by2(table(lab_diabetes_TypeII$Class, lab_diabetes_TypeII$Mortality))

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           10           85         95      10.53 (5.16 to 18.51)
Exposed -           13           25         38     34.21 (19.63 to 51.35)
Total               23          110        133     17.29 (11.29 to 24.81)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 0.31 (0.15, 0.64)
Inc odds ratio                                 0.23 (0.09, 0.58)
Attrib risk in the exposed *                   -23.68 (-39.98, -7.39)
Attrib fraction in the exposed (%)            -225.00 (-576.81, -56.06)
Attrib risk in the population *                -16.92 (-33.31, -0.52)
Attrib fraction in the population (%)         -97.83 (-176.78, -41.40)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 10.645 Pr>chi2 = 0.001
Fisher exact test that OR = 1: Pr>chi2 = 0.002
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

In those with Type II diabetes, we also see a significant association between class of admission and mortality with an odds ratio of 0.23 (95\% CI of 0.09-0.58, p value of 0.001). The odds of mortality given Type II diabetes is 77\% lower in those admitted with infectious disease compared to those without infectious disease.

Solution 5.3.3

To consider a logistic regression model, we make a new copy of the original lab tibble object containing our data and assign it to the variable lab_glm. We set the level of the categorical variables by choice. Remember that for a generalized linear model, we place the reference level first in the order for the vector following the levels argument.

Code

lab_glm <- lab

lab_glm$Mortality <- factor(
  lab_glm$Mortality,
  levels = c("Survived", "Died")
)
lab_glm$Diabetes <- factor(
  lab_glm$Diabetes,
  levels = c("None", "Type I", "Type II")
)
lab_glm$Class <- factor(
  lab_glm$Class,
  levels = c("Non-infectious", "Infectious")
)

Conduct an Omnibus Test

We start with an omnibus test, comparing a full model (assigned to the variable model_full), with a model containing only the intercept (assigned to the variable model_min). As before, we will use the lmtest function from the lrtest library.

Code

model_full <- glm(
  formula = Mortality ~ Age + WCC + HB + Diabetes + Class,
  data = lab_glm,
  family = binomial(link = "logit")
)

We print a summary of the full model.

Code

summary(model_full)


Call:
glm(formula = Mortality ~ Age + WCC + HB + Diabetes + Class, 
    family = binomial(link = "logit"), data = lab_glm)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -5.06859    1.38372  -3.663 0.000249 ***
Age              0.07656    0.01471   5.205 1.95e-07 ***
WCC              0.23330    0.04350   5.363 8.20e-08 ***
HB              -0.16671    0.06320  -2.638 0.008348 ** 
DiabetesType I  -0.01249    0.37333  -0.033 0.973305    
DiabetesType II -1.38006    0.37454  -3.685 0.000229 ***
ClassInfectious  0.50508    0.32514   1.553 0.120323    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 396.37  on 310  degrees of freedom
Residual deviance: 283.04  on 304  degrees of freedom
AIC: 297.04

Number of Fisher Scoring iterations: 5

Code

model_min <- glm(
  formula = Mortality ~ 1,
  data = lab_glm,
  family = binomial(link = "logit")
)

Code

lmtest::lrtest(
  model_min,
  model_full
)

Likelihood ratio test

Model 1: Mortality ~ 1
Model 2: Mortality ~ Age + WCC + HB + Diabetes + Class
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   1 -198.19                         
2   7 -141.52  6 113.34  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result of the LRT shows a significant association between the explanatory variables and the response variable (\chi^{2} statistic of 113.34, p value of <0.001). Thus, at least one of the predictors are effective in the model for predicting the outcome of \texttt {Mortality}.

Assess for Statistically Significant Difference for Each Predictor Between Participants who Died and Those who Survived

Consider the multi-level \texttt {Diabetes} variable. We want to assess whether the \texttt {Diabetes} variable as a whole, and not only the different levels of \texttt {Diabetes} compared to its reference level, is statistically significant in the model. So, we construct a model without \texttt{Diabetes} (assigned to the variable model_diabetes) to compare it to the full model using a LRT.

Code

model_diabetes <- glm(
  formula = Mortality ~ Age + WCC + HB + Class,
  data = lab_glm,
  family = binomial(link = "logit")
)

Code

lmtest::lrtest(
  model_diabetes,
  model_full
)

Likelihood ratio test

Model 1: Mortality ~ Age + WCC + HB + Class
Model 2: Mortality ~ Age + WCC + HB + Diabetes + Class
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   5 -150.28                         
2   7 -141.52  2 17.531   0.000156 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of the LRT demonstrate the \texttt {Diabetes} is effective in the model for predicting the outcome of \texttt {Mortality} (\chi^2 test statistic of 17.531, p value <0.001).

Next, we proceed to generate odds ratios and corresponding 95\% CIs for each of the predictors in the model to evaluate whether each of the predictors are statistically significant in association with \texttt {Mortality} and if so, report effect size.

Code

exp(model_full$coefficients)

    (Intercept)             Age             WCC              HB  DiabetesType I 
    0.006291286     1.079571468     1.262755893     0.846445060     0.987584440 
DiabetesType II ClassInfectious 
    0.251563267     1.657121567

Code

exp(confint.default(model_full))

                       2.5 %     97.5 %
(Intercept)     0.0004177506 0.09474618
Age             1.0488883632 1.11115214
WCC             1.1595500638 1.37514756
HB              0.7478248781 0.95807088
DiabetesType I  0.4751064474 2.05285159
DiabetesType II 0.1207367745 0.52414914
ClassInfectious 0.8761796461 3.13411970

Explanatory Variable	Interpretation
\texttt {Age}	The odds of mortality increases by a multiplicative factor of 1.08 for each additional year of age, accounting for the other predictors. Since the 95\% CI of 1.05-1.11 excludes the null OR of 1 , we conclude that \texttt {Age} is statistically significant in association with \texttt {Mortality}, adjusting for the additional predictors.
\texttt {WCC}	The odds of mortality increases by a multiplicative factor of 1.26 for every unit (\times 10^{9} cells/L) increase in white cell count, accounting for the other predictors. Since the 95\% CI of 1.16-1.37 excludes the null OR of $1$, we conclude that \texttt {WCC} is statistically significant in association with \texttt {Mortality}, adjusting for the additional predictors.
\texttt {HC}	The odds of mortality decreases by a multiplicative factor of 0.85 for every unit (g/dL) increase in hemoglobin, accounting for the other predictors. Since the 95\% CI of 0.75-0.96 excludes the null OR of $1$, we conclude that \texttt {HC} is statistically significant in association with \texttt {Mortality}, adjusting for the additional predictors.
\texttt {Diabetes} Type I	Since the 95\% CI of 0.48-2.05 includes the null OR, we conclude that Type I diabetes, compared to no diabetes is not statistically significant in association with \texttt {Mortality}, adjusting for the additional predictors.
\texttt {Diabetes} Type II	The odds of mortality decreases by 74.84\% for those with Type II diabetes compared to those without diabetes , accounting for the other predictors. Since the 95\% CI of 0.12-0.52 excludes the null OR, we conclude that Type II diabetes, compared to no diabetes is statistically significant in association with \texttt {Mortality}, accounting for the additional predictors.
\texttt {Class}	Since the 95\% CI of 0.88-3.13 includes the null OR, we conclude those admitted with infectious disease, compared those admitted without infectious disease is not statistically significant in association with \texttt {Mortality}, accounting for the additional predictors.