Chapter 4

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In the previous three sections, we used frequency tables to describe data for categorical variables and to investigate the association between two categorical variables.

Another method of approaching analysis is to use logistic regression models. Logistic regression models have several advantages over table methods. These models allow us to consider more than one explanatory variable as opposed to a single exposure variable used in the table methods that we have seen thus far. We can introduce more than one explanatory variable, have nominal and continuous explanatory variable, and we can make use of hypothesis testing with respect to the measures of association that logistic regression provides. Logistic regression models can also investigate confounding and interaction (effect modification). Lastly, we can use logistic regression to create prediction of the success outcome class, given values for the explanatory variable(s).

Prediction is of particular interest to the modern world of machine learning. In prediction, our interest shifts away from the calculations involved in inference (confidence intervals and hypothesis testing). Instead we want to calculate an estimated numerical outcome or the estimated probability of a categorical outcome, given a variety of explanatory variables. In prediction analysis, we are not interested in the association between the explanatory variables and the outcome or target variable.

In this notebook, we introduce a broad class of regression models named generalized linear models (GLMs). While linear models have numerical response variables, the idea behind GLMs are to broaden the use of regression models to include categorical, count, rate, and other outcome types. We will concentrate on models with a binary nominal outcome variable. These models are referred to as logistic regression models.

We start by exploring the notation and some basic concepts.

Key Words

Section 4. Structure of a Generalized Linear Model

random component, systematic component, link function

2 Libraries

Load the required libraries for Chapter 4. If these packages are not already installed, please do so before loading the libraries.

Code

library(rio)
library(dplyr)
library(epiR)
library(psych)
library(ggstatsplot)
library(DescTools)
library(car)

3 General Notation

The outcome variable in a GLM is often referred to as the response variable and is usually denoted as Y. In mathematical textbooks, this is expressed as a vector (for single variable responses) and written as a boldface \mathbf{y}. Y is our disease variable from the table methods that we have seen. For a binary response variable, we will still have two levels or classes success or D and $failure or \bar{D}.

The explanatory variables are usually denoted by x_{1} , \, x_{2} , \, \ldots \, x_{p}, where the subscript is specific for each of p explanatory variables. In mathematical notation, we would express the values as a matrix X_{np}, where the column vectors in the matrix are the p individual explanatory variables and n is the sample size. Instead of a single binary or multi-level exposure variable, we can now explore a richer set of explanatory or exposure variables.

We also use the notation \left\{ y_{i} \, \vert \, x_{1i} , x_{2i} , \ldots , x_{pi} \right\}, where i = \left\{ 1,2,3, \ldots , n\right\} for the sample size n and where p is the number of explanatory variables. We have that y_{i} is the i^{\text{th}} outcome for values of each of the i^{\text{th}} observations p explanatory variable values.

In broad terms, and for a categorical response variable, we are then interested in P \left( Y = y_{i} \, \vert \, \mathbf{x}_{i} \right), the probability that the i^{\text{th}} outcome takes on the value y_{i}, given, as mentioned, the values x_{1i} , x_{2i} , \ldots , x_{pi} (one for each of the p explanatory variables). Our interest will lie with assigning one of the response levels or classes as the success level, encoded with a 1 and consider the probability P(Y=1 \, \vert \mathbf{x}).

4 Structure of a Generalized Linear Model

There are two main parts to a GLM, a random component and a systematic component, with a third part that links the random and systematic components.

The random component identifies the distribution of the response variable Y. In a linear regression model Y follows (or is assumed to follow) a normal distribution with Y \sim N(\mu , \sigma^{2}). Y might be a binary categorical variable with Y \sim \text{Bin}(1, \pi), or Y might be a rate (counts per time interval) with Y \sim \text{Poisson}(\mu).

The systematic component regards the explanatory variables and their coefficients, shown in Equation 1.

\beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{1}

The systematic components is the mathematical form that relates the explanatory variables to the response variable (linear regression) or some function of the response variable. Note that (1) is a linear form (which is not always so and may have to be adjusted). Here, the slopes \beta_{0} , \beta_{1} , \ldots , \beta_{p} are linear, allowing us to consider unit increases in the explanatory variables and how they alter the calculated (predicted or estimated) response variable (or at least a function of it).

The link function, as the name suggests, links the random and systematic components of a GLM. If we consider the expected value of the response variable E(Y) to be \mu (the mean of the probability distribution of Y), then link function g relates \mu to the linear predictor (the systematic component), as shown in Equation 2.

\mu_{Y} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{2}

In a linear regression model, Y is a continuous numerical variable assumed to have a normal distribution Y \sim \text{N}(\mu , \sigma^{2}). Linear regression models have a systematic component as in (1). The link function is the identity function g(\mu) = \mu.

In the case of a binary response variable, we consider a logistic regression model. Here the response variable Y \sim \text{Bin}(1 \pi). The systematic component is still as in Equation 1. The link function is the logit function, shown in Equation 3.

g(\mu) = \log{\frac{\mu}{1 - \mu}} \tag{3}

Since we are considering individual (independent) Bernoulli trials, we have that E(Y) = 1 \times \pi, where \pi is the probability of success, and we can write Equation 4 below.

\log{\frac{\pi}{1 - \pi}} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{4}

For a count response (such as incidence per time interval), the Poisson log-linear GLM is often used. Here, the response variable follows a Poisson distribution, Y \sim \text{Poisson}(\mu), with the systematic component still as in Equation 1. The link function is the log-link function shown in Equation 5.

g(\mu) = \log{\mu} \tag{5}

A log-linear GLM can then be written as in Equation 6.

\log{\mu} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{6}

As a summary, we have Table 1, where the random component Y \sim \text{Bin}(1 , \pi) (\pi is the population probability of success , i.e. it is a binary variable). The systematic component is then \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p}.

Table 1: Models and their Link Functions
Link name	Function	Model
Identity link	g(\mu) = \mu	Linear probability model
Log link	g(\mu) = \log{\mu}	Log binomial model
Logistic (logit) link	g(\mu) = \log{\frac{\pi}{1 - \pi}}	Logistic model

In a GLM, \pi is a function of the predictors. If we have a single predictor, x, we write \pi(x) = P(Y=1 \; \vert \; x) (the probability of observing a success given a value for x). Note that we are only considering binary (dichotomous) outcome variables.

5 Understanding the Components of a Generalized Linear Model

We have seen that the random component of a simple logistic regression model (the response variable) has a binomial distribution, Y \sim \text{Bin}(1, \pi). Our aim is to understand an estimate of the probability of success \hat{\pi} = P(Y=1) of this response variable. Hence, our probability estimate of failure is 1-\hat{\pi} = P(Y=0). Remember that the mean (expected value) of Y \sim \text{Bin}(1, \pi) is \mu = n \pi, where \pi is the population proportion of success. In this case n=1 and \mu = \pi.

For now, our single explanatory variable, x, is likewise binary in nature, which we encode with a 0 for control (or no treatment) and a 1 for exposure ( or treatment), such that our systematic component is \beta_{0} + \beta_{1} x, where x = \left\{ 0,1 \right\}.

5.1 Link Function

In many cases the relationship between the explanatory variable(s) and the response variable is not linear. The logit link function introduces the non-linearity which models many processes in nature. The logit link function g(\pi) was shown in Equation 3.

We can find an equation for our probability of success, \pi using simple algebra, shown in Equation 7, using the anti-log function.

\begin{align} &\log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1}x \\ &\frac{\pi}{1 - \pi} = e^{\beta_{0} + \beta_{1}x} \\ &\pi = (1 - \pi) e^{\beta_{0} + \beta_{1}x} \\ &\pi = e^{\beta_{0} + \beta_{1}x} - \pi e^{\beta_{0} + \beta_{1}x} \\ &\pi + \pi e^{\beta_{0} + \beta_{1}x} = e^{\beta_{0} + \beta_{1}x} \\ &\pi(1 + e^{\beta_{0} + \beta_{1}x}) = e^{\beta_{0} + \beta_{1}x} \\ &\pi = \frac{e^{\beta_{0} + \beta_{1}x}}{1 + e^{\beta_{0} + \beta_{1}x}} \end{align} \tag{7}

In Equation 7, we have a function for the probability of success, \pi (x) (given the explanatory variable x). The right-hand side expression of Equation 7 means that the interval of \pi(x) = [0,1], exactly what we require for a probability.

In Figure 1, we set arbitrary values for \beta_{0} and \beta_{1} to view an example of a logistic model.

Code

inputs <- seq(-3, 3, by = 0.1)
beta_0 <- 0.1
beta_1 <- 1
c <- exp(beta_0 + beta_1 * inputs)

plot(
  inputs,
  (c) / (1 + (c)),
  type = "l",
  main = "Logistic model",
  xlab = "Explanatory variable",
  ylab = "Probability of success",
  las = 1,
  panel.first = grid(lty = 1)
)

Given a value for the explanatory variable in Figure 1 (horizontal axis), the model calculates the probability of the success class (vertical axis).

5.2 The Explanatory and Response Variables

To use and understand a logistic regression model with a binary response variable and a binary explanatory variable, we import a data set saved as a comma-separated values file, ChronicConditionPrivateInsurance.csv.

Code

dfr <- import_dataset("ChronicConditionsPrivateInsurance.csv")

The data is taken from a 2010 survey by Cameron and Trivedi. There are 3206 subjects in the study. We want to explore the relationship between supplementary insurance and the presence of chronic disease. More specifically, we are interested in the presence of chronic disease as an explanation for subjects purchasing supplementary insurance.

Both variables are binary in nature, with the following levels shown in Table 2.

Table 2: Levels for Example Problem
Type	Name	Levels	Generic levels
Explanatory	bin	\begin{cases} x=0 \; \text{no chronic disease} \\ x=1 \; \text{chronic disease} \end{cases}	\begin{cases} x=0 \text{ control} \\ x=1 \text{ treatment} \end{cases}
Response	private	\begin{cases} Y=0 \; \; \text{no supplemental insurance} \\ Y=1 \; \; \text{supplemental insurance} \end{cases}	\begin{cases} Y=0 \text{ failure} \\ Y=1 \text{ success} \end{cases}

It is always important to remember that we must assign the levels of our variables to the encoded values for our response and our explanatory variables. In the case of binary variables, the task is simple. We will explore multi-level explanatory variables later. In the case of our illustrative example, subjects either have no x=0 chronic disease, or, chronic disease is present, with x=1. We choose success as having purchased supplemental health insurance Y=1. We can, therefore, write Equation 8.

\pi(x) = P(Y=1 \, \vert \, x) \tag{8}

Below, we see the table of observed data in Table 3. Since both levels (classes) in the \texttt{private} and in the \texttt{bin} variables are coded with a 0 and a 1 respectively, we generate two new variables \texttt{PrivateInsurance} and \texttt{ChronicDisease} using the factor function. The factor function allows us to create factor (categorical) variables, set the order of the classes with the levels keyword argument, and set labels for the two levels, using the labels keyword argument.

Code

dfr$PrivateInsurance = factor(dfr$private, levels = c(0, 1), labels = c("No", "Yes"))
dfr$ChronicDisease = factor(dfr$bin, levels = c(0, 1), labels = c("No", "Yes"))

crosstable::crosstable(
  dfr,
  ChronicDisease,
  by = PrivateInsurance,
  total = "both",
  percent_pattern="{n}"
) %>% 
  crosstable::as_flextable(
    keep_id = FALSE
  )

Table 3: Table of observed data
label	variable	PrivateInsurance		Total
label	variable	No	Yes	Total
ChronicDisease	No	246	140	386 (12.04%)
	Yes	1719	1101	2820 (87.96%)
	Total	1965 (61.29%)	1241 (38.71%)	3206 (100.00%)

From the observed values in Table 3, we can calculate risks and ratios, especially using the epi.2by2 function in the epiR library as we have seen before. As revision, we implement the epi.2by2 function in the code chunk below. Using modelling will ultimately be a more powerful approach, though. Remember that the requires a certain order to the two levels of each of teh exposure and disease variables. We use indexing to enforce the order.

Code

epiR::epi.2by2(table(dfr$ChronicDisease, dfr$PrivateInsurance)[2:1, 2:1])

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +         1101         1719       2820     39.04 (37.24 to 40.87)
Exposed -          140          246        386     36.27 (31.47 to 41.29)
Total             1241         1965       3206     38.71 (37.02 to 40.42)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 1.08 (0.94, 1.24)
Inc odds ratio                                 1.13 (0.90, 1.40)
Attrib risk in the exposed *                   2.77 (-2.35, 7.90)
Attrib fraction in the exposed (%)            7.10 (-6.86, 19.24)
Attrib risk in the population *                2.44 (-2.64, 7.52)
Attrib fraction in the population (%)         6.30 (-6.09, 17.25)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 1.101 Pr>chi2 = 0.294
Fisher exact test that OR = 1: Pr>chi2 = 0.316
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

The results shows that the risk of private supplementary insurance is 2.77 percentage points greater in those with chronic disease compared to those that do not have chronic disease, (95% confidence interval -2.35 to 7.9 percentage points).

6 Building a Logistic Regression Model

6.1 The `glm` function

The glm function allows us to create a generalized linear model (GLM) writing very little code. We use it in the code chunk below, as assign the result to the variable model.

Code

model <- glm(
  private ~ bin,
  data = dfr,
  family = binomial(link = "logit")
)

The glm function takes a formula as first argument. The formula, private ~ bin indicates that we want to consider the \texttt{private} outcome variable given the \texttt{bin} variable. Both of these variables are encoded with 0 and 1 for the two respective levels in each. This is ideal as the exposure levels, E and \bar{E} and the two disease levels, D and \overline{D} are already contained in the numerical values 0 and 1. The second argument, data is assigned to our data object, dfr. The last argument is the family keyword argument. We set this to binomial(link = "logit"). This is the instruction to the glm function that the outcome, Y follows a binomial distribution and that we specify the logit link function.

We view the coefficients below using the attribute coefficients.

Code

summary(model)$coefficients

              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.5636891  0.1058674 -5.324485 1.012396e-07
bin          0.1181652  0.1126849  1.048634 2.943467e-01

7 Interpreting the Results

Given the data, we can now explore the estimated probability of success, \hat{\pi} using the coefficients and the logit function that links \hat{\pi} and the linear component \beta_{0} + \beta_{1}x, shown in Equation 9, which represents our model. Note that the values for the coefficient estimates, \hat{\beta}_{0} and \hat{\beta}_{1} are from the Estimate column in the results above. The former is the intercept and the second is the slope.

\begin{align} &\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ \\ &\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = -0.5637 + 0.1182 x \end{align} \tag{9}

How do we interpret the model in Equation 9? We start with x=0, an absence of chronic health conditions with respect to the explanatory variable (the exposure variable).

In Equation 10 we write that the of the log of the odds that the subject purchases supplemental insurance, Y=1, given that they do not have chronic disease, x=0 is \beta_{0} + \beta_{1}(0) = \beta_{0} = -0.5637. Then we use the derivation in Equation 7 to calculate the probability estimate, \hat{\pi}, the estimated probability of having purchased supplemental insurance given that no chronic disease is present.

\begin{align} &\log{\frac{\hat{\pi}(0)}{1 - \hat{\pi}(0)}} = -0.5637 \\ &\hat{P}(Y=1 \, \vert \, x = 0) = \hat{\pi}(0) = \frac{e^{-0.5637}}{1 + e^{-0.5637}} = \approx 0.3627 \end{align} \tag{10}

So, \beta_{0} is the log odds of success given the control (no chronic health conditions or \bar{D}) and from this we can calculate the probability of success given the control.

In Equation 11, we do the same for x=1 (presence of chronic disease or D).

\begin{align} &\log{\frac{\hat{\pi}(1)}{1 - \hat{\pi}(1)}} = -0.5637 + 0.1182(1) \\ &\hat{P}(Y=1 \, \vert \, x = 1) = \hat{\pi}(1) = \frac{e^{-0.5637+0.1182}}{1 + e^{-0.5637+0.1182}} = \approx 0.3904 \end{align} \tag{11}

Again, we have the log odds of success , but now, given treatment (presence of chronic disease) and from this we can calculate the probability of success, given treatment. In our case, the log odds of having purchased supplemental insurance given the presence of chronic disease.

From these values, we can calculate the estimated risk difference shown in Equation 12 and the estimates risk ratio, shown in (8).

\widehat{\text{RD}} = \hat{P}(Y=1 \, \vert \, x=1) - \hat{P}(Y=1 \, \vert \, x=0) = \hat{\pi}(1) = \hat{\pi}(0) = 0.3904 - 0.3627 = 0.0277 \tag{12}

The risk of purchasing supplemental insurance is 2.77 percentage points higher in the case of the presence of chronic disease compared to the absence of chronic disease.

In Equation 13 we calculate the relative risk or risk ratio.

\widehat{\text{RR}} = \frac{\hat{p}(Y=1 \, \vert \, x=1)}{\hat{P}(Y=1 \, \vert \, x=0)} = \frac{\hat{\pi}(1)}{\hat{\pi}(0)} = \frac{0.3904}{0.3627} \approx 1.08 \tag{13}

There is a 1.08 - 1 = 0.08 = 8% increase in risk of purchasing supplemental insurance comparing those with chronic disease to those without chronic disease.

How do we interpret \beta_{1}? In Equation 14, where we make use of the properties of logarithms, we see that it is the log of the odds ratio of success given treatment over control.

\begin{align} &\log{\frac{\hat{\pi}(1)}{1 - \hat{\pi}(1)}} - \log{\frac{\hat{\pi}(0)}{1 - \hat{\pi}(0)}} = \left[ \beta_{0} + \beta_1(1) \right] - \beta_{0} \\ &\log{\frac{\frac{\hat{\pi}(1)}{1 - \hat{\pi}(1)}}{\frac{\hat{\pi}(0)}{1 - \hat{\pi}(0)}}} = \log{\text{OR}} = \beta_{1} \end{align} \tag{14}

In Equation 15 we can also see the odds ratio of success given treatment versus control.

\text{OR} = e^{\beta_{1}} \tag{15}

By substituting the values for our illustrative example, we see the odds ratio of purchasing supplemental insurance given the presence of chronic disease versus the absence of chronic disease.

Code

exp(0.1182)

[1] 1.125469

There is a 1.125 - 1 = 0.125 = 12.5% increase in the odds of purchasing supplemental insurance if chronic disease is present compared to when chronic disease is absent.

The results for the estimated risk difference, risk ratio, and odds ratio are the same as when we use the epi.2by2 function.

8 Multi-level Explanatory Variables

We have explored our first logistic regression model, with a single, binary explanatory variable. It allows us to compare the results of the measures of association with those that we get from using a table and the epi.2by2 function.

A logistic regression model allows us to expand on table methods by introducing multi-level nominal explanatory variables.

With a single binary explanatory variable, the intercept (coefficient estimate \hat{\beta}_{0}) was the log of the odds of success given the control and that the slope (coefficient estimate \hat{\beta}_{1}) was the log of the odds ratio of success given treatment over control with respect to the levels of the explanatory variable. Exponentiating these coefficient values resulted in the odds and odds ratio respectively.

Now, we consider an explanatory variable with three levels. Logistic regression models can be expanded to include explanatory variables with multiple levels. We start our exploration by considering dummy variables.

8.1 Dummy Variables

An explanatory variable with three levels such as mild, moderate, and severe. To use a multilevel variable in a model, we make use of dummy variables.

We start by changing the levels into their own variables. We illustrate the concept using the three-level example mentioned above.

Subject	Explanatory variable value	Mild	Moderate	Severe
1	Mild	1	0	0
2	Moderate	0	1	0
3	Severe	0	0	1

If a subject had a value of mild captured for the explanatory variable, a 1 is encoded for the new variable mild and a 0 for the rest. A similar concepts follows for the other two levels. This makes absolute sense if 1 represents Yes and 0 represents No. An onservation witrl mild disease will have a data entry of Yes or 1 in that column in a spreadsheet. That observation does not have moderate or severe diseases and have No or 0 captured for those two variables.

One of the three levels must serve as our base case, against which our model will compare the rest. Each dummy variable is now a binary variable. Just as with binary variables where 0 was the base case, we do the same here. The choice of base case is dictated by the research question(s). In the table above, we will use mild. We note that this case is redundant and that we can capture the same information with only two dummy variables.

Subject	Explanatory variable value	Moderate	Severe
1	Mild	0	0
2	Moderate	1	0
3	Severe	0	1

Any subject with the base case will simply have a 0 for moderate and for severe. In this manner, we will always have k-1 dummy variables for an explanatory variable with k levels. Each of the k-1 dummy variables is a binary categorical variable.

8.2 Logistic Regression Model with a Multilevel Explanatory Variable

We are now familiar with the components of a generalized linear model (GLM). We expand the linear component to include the two dummy variables in our example, shown in Equation 16 using the logit link function, for x_{1} (being moderate in our example) and x_{2} (being severe in our example), where \pi(x_{1},x_{2}) = P(Y=1 \, \vert \, x_{1} , x_{2}).

\log{\frac{\pi(x_{1},x_{2})}{1 - \pi(x_{1},x_{2})}} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} \tag{16}

As before, we can use simple algebra to derive a function for \pi(x_{1}, x_{2}), shown in Equation 17.

\pi(x_{1},x_{2}) = \frac{e^{\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}}}{1 + e^{\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}}} \tag{17}

Note that we can expand Equation 16 and Equation 17 for any number of explanatory variables x_{i}.

As in the previous notebook, we start by considering x_{1} = 0 and x_{2} = 0, with our model shown in Equation 18.

\log{\frac{\pi(0,0)}{1-\pi(0,0)}} = \beta_{0} \tag{18}

The intercept, \beta_{0}, is the log of the odds of Y=1, success, for a subject with the base case, x_{1}=0 and x_{2}=0.

The probability of success given x_{1} = 0 and x_{2} = 0 is shown in Equation 19.

\pi(0,0) = \frac{{e^{\beta_{0}}}}{1 + e^{\beta_{0}}} \tag{19}

Next, we consider a subject with x_{1} = 1 and x_{2}=0 (a moderate case in our three levels example), shown in Equation 20.

\log{\frac{\pi(1,0)}{1-\pi(1,0)}} = \beta_{0} + \beta_{1}x_{1} \tag{20}

We can also express the probability, shown in Equation 21.

\pi(1,0) = \frac{e^{\beta_{0} + \beta_{1}x_{1}}}{1 + e^{\beta_{0} + \beta_{1}x_{1}}} \tag{21}

As in the previous notebook, we can consider the risk difference (RD) and the relative risk (RR), shown in Equation 22 and Equation 23 respectively, where we compare the second level of the explanatory variable to the base case.

\text{RD} = \pi(1,0) - \pi(0,0) \tag{22}

\text{RR} = \frac{\pi(1,0)}{\pi(0,0)} \tag{23}

The odds ratio (OR) is of particular interest. We show this is Equation 24.

\begin{align} &\log{\frac{\pi(1,0)}{1-\pi(1, 0)}} - \log{\frac{\pi(0,0)}{1-\pi(0,0)}} = \beta_{0} + \beta_{1} - \beta_{0} \\ &\log{\frac{\frac{\pi(1,0)}{1-\pi(1,0)}}{\frac{\pi(0,x0)}{1-\pi(0,0)}}} = \beta_{1} \\ &\beta_{1} = \log{\text{OR}} \end{align} \tag{24}

We see that the coefficient \beta_{1} is the log odds ratio for success comparing the second level to the base case.

In the same manner we argue for the third level, x_{1}=0, x_{2}=1, shown in Equation 25, and so too in cases that have more than two levels.

\begin{align} &\log{\frac{\pi(0,1)}{1-\pi(0,1)}} = \beta_{0} + \beta_{2}x_{2} \\ &\pi(0,1) = \frac{e^{\beta_{0} + \beta_{2}x_{2}}}{1 + e^{\beta_{0} + \beta_{2}x_{2}}} \\ &\text{RD} = \pi(1,0) - \pi(0,0) \\ &\text{RR} = \frac{\pi(1,0)}{\pi(0,0)} \\ &\beta_{2} = \log{\text{OR}_{\text{ comparing level 2 to base case}}} \\ &\text{OR} = e^{\beta_{2}} \end{align} \tag{25}

Here \beta_{2} is the log of the odds ratio for success given the third level compared to the base case.

8.3 Illustrative Example using a Multi-Level Explanatory Binary Variable

As in the previous notebook, we will use the same survey data from 2010 as illustrative example.

We construct a table of observed values with the private variable as the response variable (Y=0 indicating that a subject did not purchase supplemental health insurance and Y=1 indicating that they did). The response variable is tri and represents the explanatory variable, x. There are three levels indicating the presence and number of chronic conditions per subject. The encoding in the data set is as shown below.

x=0: no chronic disease
x=1: one, two, or three chronic conditions are present
x=2: more than three chronic conditions are present

We represent the summary table of observed values in Table 4, where we create a new variable with appropriate labels for the three levels.

Code

dfr$DiseaseLevel <- factor(dfr$tri,
                           levels = c(0, 1, 2),
                           c("No chronic disease", "1-3 chronic conditions", "More than 3 chronic conditions"))
crosstable::crosstable(
  dfr,
  DiseaseLevel,
  by=PrivateInsurance,
  total = "both",
  percent_pattern="{n}"
) %>% 
  crosstable::as_flextable(
    keep_id = FALSE
  )

Table 4: Table of observed data
label	variable	PrivateInsurance		Total
label	variable	No	Yes	Total
DiseaseLevel	No chronic disease	246	140	386 (12.04%)
	1-3 chronic conditions	1386	957	2343 (73.08%)
	More than 3 chronic conditions	333	144	477 (14.88%)
	Total	1965 (61.29%)	1241 (38.71%)	3206 (100.00%)

In remains for us to build the logistic regression model. We use the glm function to build our model, with \texttt{PrivateInsurance} as the response variable, and \texttt{DiseaseLevel} as the explanatory variable. R will generate the 3-1=2 dummy variables.

Code

model <- glm(
  PrivateInsurance ~ DiseaseLevel,
  data = dfr,
  family = binomial(link = "logit")
)

The coefficients attribute returns the values of the coefficient estimates. Note that the order that we used when we created the new factor, \texttt{DiseaseLevel}, was very important. The glm function will treat the level that is mentioned first as the base case. In our case this was 0 or No chronic disease.

Code

summary(model)$coefficients

                                             Estimate Std. Error   z value
(Intercept)                                -0.5636891  0.1058674 -5.324485
DiseaseLevel1-3 chronic conditions          0.1933153  0.1139049  1.697164
DiseaseLevelMore than 3 chronic conditions -0.2746401  0.1454487 -1.888226
                                               Pr(>|z|)
(Intercept)                                1.012396e-07
DiseaseLevel1-3 chronic conditions         8.966565e-02
DiseaseLevelMore than 3 chronic conditions 5.899562e-02

The Estimate for the (Intercept) represents \hat{\beta}_{0}. We see the two levels listed below (Intercept). As expected, we see only the 1-3 chronic conditions and the More than 3 chronic conditions, both with the variable name DiseaseLevel as prefix, for \hat{\beta}_{1} and \hat{\beta}_{2} respectively.

We can write our model, as shown in Equation 26.

\begin{align} &\log{\frac{\hat{\pi}(x_{1}, x_{2})}{1-\hat{\pi}(x_{1}, x_{2})}} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2} \\ &\log{\frac{\hat{\pi}(x_{1}, x_{2})}{1-\hat{\pi}(x_{1}, x_{2})}} = -0.5637 + 0.1933 x_{1} - 0.2746 x_{2} \end{align} \tag{26}

From this model in Equation 26, we consider what we can derive from the intercept, \hat{\beta}_{0}, and the two slopes, \hat{\beta}_{1} and \hat{\beta}_{2}, summarized in Table 5.

Table 5: Interpretation of the Model
Coefficient	Value	Meaning
\hat{\beta}_{0}	-0.56	log of the odds of purchase given no conditions
\hat{\beta}_{1}	0.19	log odds ratio of purchase given 1-3 conditions compared to no condition
\hat{\beta}_{2}	-0.27	log odds ratio of purchase given more than 3 conditions compared to no condition

We can take the antilog of each of these values by exponentiating each coefficient estimate, shown in Table 6.

Table 6: Exponentiation of Each Coefficient Estimate
Coefficient	Value	Meaning
e^{\hat{\beta}_{0}}	0.57	odds of purchase given no conditions
e^{\hat{\beta}_{1}}	1.21	odds ratio of purchase given 1-3 conditions compared to no condition
e^{\hat{\beta}_{2}}	0.76	odds ratio of purchase given more than 3 conditions compared to no condition

We save the three coefficients and assign them to respective variables before taking the antilog to verify the results above.

Code

beta_0 <- -0.5636891
beta_1 <- 0.1933153
beta_2 <- -0.2746401

exp(beta_0)

[1] 0.5691057

Code

exp(beta_1)

[1] 1.213265

Code

exp(beta_2)

[1] 0.7598455

We can also consider the risk difference and risk ratios when we calculate the estimated probabilities. We start with the estimated risk of purchase given no conditions, where the risk is \hat{\pi}(0,0) = 0.3627, calculated below.

Code

risk_0 <- exp(beta_0) / (1 + exp(beta_0))
risk_0

[1] 0.3626943

The estimated risk of purchase in the case of 1-3 conditions is \hat{\pi}(1,0)=0.4085, calculated below.

Code

risk_1 <- exp(beta_0 + beta_1) / (1 + exp(beta_0 + beta_1))
risk_1

[1] 0.4084507

The estimated risk of purchase in the case of more than 3 conditions is \hat{\pi}(0,1)=0.3019, calculated below.

Code

risk_2 <- exp(beta_0 + beta_2) / (1 + exp(beta_0 + beta_2))
risk_2

[1] 0.3018868

The estimated risk difference between those with 1-3 conditions and those with no condition is therefor 0.4085-0.3627 = 0.0458 (about 5%) and the estimated risk difference between those with more than 3 conditions and those with no conditions is 0.3019-0.3627=-0.0608 (about -6%).

The relative risks in these two comparisons are 0.41 \div 0.36 = 1.14 (14% higher in 1-3 group) and 0.30 \div 0.36 = 0.83 (17% lower), respectively.

9 Continuous Numerical Explanatory Variables

In the previous two sections, we have explored logistic regression models with either a binary or a multilevel explanatory variable for a binary response variable Y. We can also do so for a continuous numerical explanatory variable x.

In Equation 27, we see the now familiar logit link function between the random and the systematic components of our generalized linear model with a single explanatory variable x, where \pi(x) = P(Y=1 \, \vert \, x).

\log{\frac{\pi(x)}{1 - \pi(x)}} = \beta_{0} + \beta_{1} x \tag{27}

Interpreting Equation 27, where x=0, is as with the case of binary explanatory variables. We have Equation 28.

\log{\frac{\pi(0)}{1 - \pi(0)}} = \beta_{0} \tag{28}

The coefficients \beta_{0} is the intercept and represents the log of the odds of success when x=0. The probability of success when x=0 is shown in Equation 29.

\pi(0) = \frac{e^{\beta_{0}}}{1 + e^{\beta_{0}}} \tag{29}

For continuous numerical variables we do not have levels for comparison. Instead, we consider the difference between values that the continuous variable can take in reference to the units in which it is measured.

We start by considering a 1 unit increase in the continuous explanatory variable. We us an arbitrary value x = \nu and a single unit increase \nu + 1.

In Equation 30 and Equation 31 we see the log of the odds for success for x=\nu+1 and for x = \nu, respectively.

\log{\frac{\pi(\nu + 1)}{1 - \pi(\nu + 1)}} = \beta_{0} + \beta_{1} (\nu + 1) \tag{30}

\log{\frac{\pi(\nu)}{1 - \pi(\nu)}} = \beta_{0} + \beta_{1} \nu \tag{31}

Subtraction, shown in Equation 32, yields the derivation for the log odds ratio of success for a single unit increase in the continuous numerical explanatory variable.

\begin{align} &\log{\frac{\pi(\nu + 1)}{1 - \pi(\nu + 1)}} - \log{\frac{\pi(\nu )}{1 - \pi(\nu)}} = \beta_{0} + \beta_{1} (\nu + 1) - [\beta_{0} + \beta_{1} \nu] \\ &\frac{\log{\frac{\pi(\nu + 1)}{1 - \pi(\nu)}}}{\log{\frac{\pi(\nu)}{1 - \pi(\nu)}}} = \beta_{1} \end{align} \tag{32}

Now, we can consider any constant, c, increase, shown in Equation 33.

\frac{\log{\frac{\pi(\nu + c)}{1 - \pi(\nu + c)}}}{\log{\frac{\pi(\nu)}{1 - \pi(\nu)}}} = c \beta_{1} \tag{33}

Note that since we are considering a continuous numerical variable, we have no frequencies and cannot calculate risks and odds. If the study design allows, we can, though, consider the probability of success given a value for the continuous explanatory variable, using Equation 34.

\pi(x) = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \tag{34}

9.1 Example

To illustrate the concepts above, we use the same data set as before, pertaining to the purchase of supplementary health insurance.

The \texttt{chronic} variable shows the number of chronic conditions. We will consider this as a continuous numerical explanatory variable.

As we should always do, we explore the data using summary statistic and data visualization.

Code

psych::describe(dfr$chronic)

   vars    n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 3206 2.06 1.42      2    1.98 1.48   0   8     8 0.65     0.24 0.03

Code

ggstatsplot::ggbetweenstats(
  data = dfr,
  x = PrivateInsurance,
  y = chronic,
  bf.message = FALSE,
  palette = "Blues"
)

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

We build a logistic regression model as before, using the glm function.

Code

model <- glm(
  PrivateInsurance ~ chronic,
  data = dfr,
  family = binomial(link = "logit")
)

We view the coefficients using the coefficients attribute.

Code

summary(model)$coefficients

               Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -0.30972934 0.06386084 -4.850067 1.234198e-06
chronic     -0.07323511 0.02590275 -2.827311 4.694078e-03

The intercept is \hat{\beta}_{0} = -0.3097 and the slope is \hat{\beta}_{1} = -0.0732. The model can now be written, as shown in Equation 35, where x is the number of chronic conditions.

\begin{align} &\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = \hat{\beta_{0}} + \hat{\beta}_{1} x \\&\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = -3.097 - 0.0732 x \end{align} \tag{35}

The values are assigned to variables below.

Code

beta_0 <- -0.3097
beta_1 <- -0.0723

The intercept \hat{\beta}_{0} = -0.3097 is the log of the odds of purchasing supplemental health insurance given no chronic conditions, x=0. The antilog is calculated below, and shows that the odds of purchase given no chronic condition is e^{\hat{\beta}_{0}} = 0.73.

Code

exp(beta_0)

[1] 0.733667

The slope \hat{\beta}_{1} = -0.0723 is the log of the odds ratio of purchase for any single unit increase in the number of chronic conditions. The antilog is calculated below, and shows that the odds ratio of purchase for any one unit increase in the number of chronic conditions is e^{\hat{\beta}_{1}} = 0.93.

Code

exp(beta_1)

[1] 0.9302518

The odds ratio value of 0.93 is less than 1 and we have that the odds of purchase is 1-0.93 = 0.07 = 7% lower for every increase of one chronic condition.

For an increase of two chronic conditions, we have that the odds ratio is e^{2 \hat{\beta}_{1}} = 0.865. That is a 1 - 0.865 = 0.135 = 13.5% decrease in the odds.

Code

exp(2 * beta_1)

[1] 0.8653684

The estimated probability of purchase for any number of chronic conditions can be computed using Equation 36.

\hat{\pi}(x) = \frac{e^{-0.3097 - 0.0723 x}}{1 + e^{-0.3097 - 0.0723 x}} \tag{36}

10 Confidence Intervals

We have explored generalized linear models, and in particular, simple logistic regression models with single binary, multilevel, and continuous numerical explanatory variables. We briefly explore the concept of uncertainty in the slope, \beta_{1} now.

Using a generalized linear modeling approach to understanding the association between two categorical variables is more powerful than using analysis from tables of observed data. We can tap into the calculations used for inference with these models.

When using generalized linear models in the case of large sample sizes, we have that the estimates are approximately normally distributed, with a mean of the population parameter, \beta_{1}, as well as the variance, calculated from the standard error of \hat{\beta}_{1}, as shown in Equation 37, where we consider \hat{\beta}_{1} and its estimated standard error \sigma_{\hat{\beta}_{1}}.

\hat{\beta}_{1} \sim N(\beta_{1} , \sigma_{\hat{\beta}_{1}}^{2}) \tag{37}

We can then construct a confidence interval for a confidence level of 100(1-\alpha )%, shown in Equation 38 using the critical z score for the specific \alpha value.

\text{Wald confidence interval for } \beta_{1}: \; \hat{\beta}_{1} \pm z_{\frac{\alpha}{2}} \widehat{SE}_{\hat{\beta}_{1}} \tag{38}

We use the same data as for the other illustrative examples in this session and use the \texttt{ChronicDisease} variable as explanatory variable. It is a binary variable with 0 or No indicating the absence of chronic disease and 1 or Yes indicating the presence of one or more chronic conditions. The response variable is still \texttt{PrivateInsurance}, indicating the purchase of health insurance, 1 or Yes, or the non-purchase of insurance, 0 or No.

We create the model using the glm function, specifying the link function.

Code

model <- glm(
  PrivateInsurance ~ ChronicDisease,
  data = dfr,
  family = binomial(link = "logit")
)

The coefficients attribute shows the table of coefficients.

Code

summary(model)$coefficients

                    Estimate Std. Error   z value     Pr(>|z|)
(Intercept)       -0.5636891  0.1058674 -5.324485 1.012396e-07
ChronicDiseaseYes  0.1181652  0.1126849  1.048634 2.943467e-01

We have that \hat{\beta}_{1} = 0.1182. We can now express our uncertainty in this value.

The confint.default function can be used to calculate and show the confidence interval values for a given \alpha value, set by the level keyword argument, which we set as 0.95, indicating a 95% confidence level.

Code

ci <- confint.default(
  model,
  level = 0.95
)

ci

                       2.5 %     97.5 %
(Intercept)       -0.7711853 -0.3561929
ChronicDiseaseYes -0.1026932  0.3390237

We have that \hat{\beta}_{1} = 0.1182 (95% confidence intervals -0.1027 to 0.3390).

While \hat{\beta}_{1} is the log odds ratio of purchase comparing those with chronic disease and those without, we can take the antilog of these values to express the odds ratio with confidence intervals.

Code

exp(0.1181652)

[1] 1.12543

Code

exp(ci)

                      2.5 %    97.5 %
(Intercept)       0.4624646 0.7003375
ChronicDiseaseYes 0.9024038 1.4035766

We finally have that the odds ratio of purchase comparing those with chronic disease and those without is 1.13 (95% confidence interval 0.90 to 1.4).

We see a 100 (1.13 - 1) = 13% increase in the odds of purchase comparing those with to those without chronic conditions. Our confidence intervals, though, show that the true population odds ratio is in the interval: 100(1 - 0.90) = 10% (a decrease in the odds) all the way to 100(1 - 1.4) = 40% (an increase in the odds).

In a later session we will explore hypothesis testing and see that if this interval contains 1.0, it means that we have a p value more than our chosen \alpha value and we fail to reject the null hypothesis. This should be clear to see from the confidence intervals that shows this range between decreasing and increasing the odds.

11 Lab Problems

11.1 Question

You are a member of a research team modeling the risk of death in a community exposed to an infectious agent using a cross-sectional study design. Research data pertaining to mortality and age group are available in a comma-separated file, infect_mortality.csv. The response variable \texttt{Mortality} has levels Yes (person died) and No (person survived). The explanatory variable, \texttt{AgeGroup} has two levels, younger persons are encoded as A and older persons as B. Let A=0 and B=1.

Your task is to import the data, verify integrity, summarize, and visualize the data, express the measures of association (inclusive of the risk difference, risk ratio, and the odds ratio). Use confidence intervals (with a 95% confidence level) and hypothesis testing to interrogate the association between mortality and age group. Use both a critical value and a p value approach in your hypothesis testing. Use a 5% level of significance.

Also create a model that estimates the probability of death given age group. Provide full commentary on your model.

Solution 11.1

We start by stating a research question: Is there an association between age group and mortality?

The hypothesis with respect to the research question are as follows.

H_{0}: Age group and mortality are independent.
H_{1}: Age group and mortality are not independent.

We import the data and review the variables with the str function. Then, we consider the integrity of the data by verifying that there is at least no missing data.

Code

lab <- import_dataset("infect_mortality.csv")
str(lab)

'data.frame':   99 obs. of  4 variables:
 $ AgeGroup      : chr  "B" "A" "B" "B" ...
 $ CardiacFailure: chr  "Mild" "None" "Mild" "Mild" ...
 $ Glucose       : int  92 102 100 96 105 103 86 92 95 91 ...
 $ Mortality     : chr  "No" "Yes" "Yes" "No" ...

The is.na function returns a value of TRUE if data is missing and FALSE if not. Since TRUE is represented internally as a 1 and FALSE as a 0, we can sum over all the Boolean values by inserting the is.na function as argument to the sum function.

Code

sum(is.na(lab))

[1] 0

A table of observed data serves as summary.

Code

table(lab$AgeGroup, lab$Mortality)

To have the data table in correct order, we assign the data.frame object to a new variable (making a copy of the original). We then set the variables \texttt{Mortality} and \texttt{AgeGroup} as factors using the factor function and setting the levels such that the level representing E and D are listed first.

Code

table_lab <- lab
table_lab$Mortality <- factor(table_lab$Mortality, levels = c("Yes", "No"))
table_lab$AgeGroup <- factor(table_lab$AgeGroup, levels = c("B", "A"))

tb <- table(table_lab$AgeGroup, table_lab$Mortality)
tb

A bar plot visualizes the summary data.

Code

ggstatsplot::ggbarstats(
  table_lab,
  y = AgeGroup,
  x = Mortality,
  bf.message = FALSE, 
  palette = "Blues"
)

The epi.2by2 function returns all the measures of association and Pearson’s \chi^{2} test.

Code

epiR::epi.2by2(tb)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           21           36         57     36.84 (24.45 to 50.66)
Exposed -            5           37         42      11.90 (3.98 to 25.63)
Total               26           73         99     26.26 (17.93 to 36.07)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 3.09 (1.27, 7.54)
Inc odds ratio                                 4.32 (1.47, 12.68)
Attrib risk in the exposed *                   24.94 (9.04, 40.84)
Attrib fraction in the exposed (%)            67.69 (21.30, 86.73)
Attrib risk in the population *                14.36 (1.28, 27.44)
Attrib fraction in the population (%)         54.67 (6.42, 78.04)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 7.765 Pr>chi2 = 0.005
Fisher exact test that OR = 1: Pr>chi2 = 0.006
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

We verify that the assumption of a suffucient number of expected values is met for use of the \chi^{2} distribution as sampling distribution for the test statistic, X^{2}.

Code

chisq.test(tb, correct = FALSE)$expected

   
        Yes      No
  B 14.9697 42.0303
  A 11.0303 30.9697

To use a critical value approach, we calculate the critical value for \alpha = 0.05 and (I-1)(J-1) = (2-1)(2-1) = 1 degree of freedom using the qchisq function.

Code

qchisq(0.05, 1, lower.tail = FALSE)

[1] 3.841459

Pearson’s \chi^{2} test is conducted using the chisq.test function. We set the correct argument to FALSE so as not to includes Yates’ continuity correction.

Code

chisq.test(tb, correct = FALSE)


    Pearson's Chi-squared test

data:  tb
X-squared = 7.7654, df = 1, p-value = 0.005326

We can also perform a likelihood ratio test using the GTest function from the DescTools library.

Code

DescTools::GTest(tb, correct = "none")


    Log likelihood ratio (G-test) test of independence without correction

data:  tb
G = 8.3192, X-squared df = 1, p-value = 0.003923

The test statistic is larger than the critical value. The p value is also smaller than the level of significance, \alpha = 0.5.

Next, we generate a model that estimates the probability of mortality given age group. To do so, we reset the levels of the categorical variables \texttt{Mortality} and \texttt{AgeGroup} such that the non-exposure level and the failure level are listed first for the levels argument.

Code

lab$AgeGroup <- factor(lab$AgeGroup, levels = c("A", "B"))
lab$Mortality <- factor(lab$Mortality, levels = c("No", "Yes"))

We create a logistic regression model and assign the model to the variable lab_model.

Code

lab_model <- glm(
  formula = Mortality ~ AgeGroup,
  data = lab,
  family = binomial(link = "logit")
)

The summary for the model is printed using the summary function and the coefficients attribute.

Code

summary(lab_model)$coefficients

             Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.001480  0.4764693 -4.200649 2.661512e-05
AgeGroupB    1.462483  0.5499271  2.659414 7.827681e-03

The model is shown below by subsituting the values for the coefficient estimates.

\begin{align} &\log{\left( \frac{\hat{\pi}(x)}{1 - \hat{\pi} (x)} \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ \\ &\log{\left( \frac{\hat{\pi}(x)}{1 - \hat{\pi} (x)} \right)} = -2.001480 + 1.462483 x \end{align}

The log odds of dying in the younger group is -2.001480. The odds of dying in the younger group is calculated below.

Code

exp(-2.001480)

[1] 0.1351351

The log odds of dying in the older group is -2.001480 + 1.462483 = -0.538997.

The log odds ratio of dying comparing the older group to the younger group is 1.462483. The odds ratio of dying comparing the older group to the younger group is calculated below.

Code

exp(1.462483)

[1] 4.316665

11.2 Question

We continue with the data set from the previous lab problem research project, constructing models predicting mortality. The data are captured in the comma-separated file infect_mortality.csv.

The New York Heart Association (NYHA) classifies cardiac or heart failure into grades I through IV, with grade I and II considered mild failure and grades III and IV considered advanced cardiac failure. The data set contains a variable, \texttt{CardiacFailure}, with three levels, None (healthy cardiac function), Mild (NYHA grades I and II cardiac failure), and Advanced (NYHA grades III and IV cardiac failure).

Your interest is in understanding the association between cardiac failure and mortality by using a model that estimates the probability of mortality given the level of cardiac failure. Create a thorough report pertaining to this research question to present to your colleagues in the research group.

Also use table methods to confirm the results of odds ratios from the model comparing the different levels of heart failure to those with a healthy heart, gven that None (healthy cardiac function) is the base case for this multi-level explanatory variable..

Solution 11.2

We reimport the data using the import_dataset user-defined function created in Chapter 2.

Code

lab <- import_dataset("infect_mortality.csv")

We verify that no missing data is present as in the first lab problem.

Code

sum(is.na(lab))

[1] 0

Summary statistics is provided in the form of a contingency table, shown in Table 7.

Code

crosstable::crosstable(
  lab,
  CardiacFailure,
  by = Mortality,
  total = "both",
  percent_pattern="{n}"
) %>% 
  crosstable::as_flextable(
    keep_id = FALSE
  )

Table 7: Tabkle of observed data
label	variable	Mortality		Total
label	variable	No	Yes	Total
CardiacFailure	Advanced	11	5	16 (16.16%)
	Mild	11	18	29 (29.29%)
	None	51	3	54 (54.55%)
	Total	73 (73.74%)	26 (26.26%)	99 (100.00%)

The data summary is visualized as a bar plot.

Code

ggstatsplot::ggbarstats(
  lab,
  x = Mortality,
  y = CardiacFailure,
  bf.message = FALSE, 
  palette = "Blues"
)

We set the levels of the binary variables. We choose None for the baseline level for the cardiac failure model and No for the baseline of the mortality variable.

Code

lab$CardiacFailure <- factor(
  lab$CardiacFailure,
  levels = c("None", "Mild", "Advanced")
)
lab$Mortality <- factor(
  lab$Mortality,
  levels = c("No", "Yes")
)

The model will calculate have k-1 = 3-1=2 explanatory variables, x_{1} and x_{2}. We use teh glm function and assign the model to the variable lab_model.

Code

lab_model <- glm(
  formula = Mortality ~ CardiacFailure,
  data = lab,
  family = binomial(link = "logit")
)

The coefficients are returned using the summary function and the coefficients attribute.

Code

summary(lab_model)$coefficients

                        Estimate Std. Error   z value     Pr(>|z|)
(Intercept)            -2.833213  0.5940762 -4.769107 1.850439e-06
CardiacFailureMild      3.325690  0.7066762  4.706102 2.524987e-06
CardiacFailureAdvanced  2.044756  0.8023937  2.548320 1.082431e-02

We write the model using substitution of values, where x_{1} is the binary dummy variable for mild cardiac failure and x_{2} is the binary dummy variable for advanced cardiac failure.

\begin{align} &\log{ \left( \frac{\hat{\pi} (x_{1} , x_{2})}{1 - \hat{\pi} (x_{1} , x_{2})} \right) } = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2} \\ \\ &\log{ \left( \frac{\hat{\pi} (x_{1} , x_{2})}{1 - \hat{\pi} (x_{1} , x_{2})} \right) } = -2.833213 + 3.325690 x_{1} + 2.044756 x_{2} \end{align}

\hat{\beta}_{0} is the estimated log odds of mortality given a healthy functioning heart. The odds is calculated below.

Code

exp(-2.833213)

[1] 0.05882355

The estimated probability of death given a healthy heart is calculated below.

Code

exp(-2.833213) / (1 + exp(-2.833213))

[1] 0.05555557

The estimated log odds ratio of mortality comparing those with mild to those with no cardiac failure is \hat{\beta}_{1} = 3.325690. The estimated odd ratio of mortality comparing moderate cardiac failure compared to a healthy heart is calculated below.

Code

exp(3.325690)

[1] 27.81819

The estimated probability of mortality given moderate cardiac failure is calculated below.

Code

exp(-2.833213 + 3.325690) / (1 + exp(-2.833213 + 3.325690))

[1] 0.6206898

The estimated log odds ratio of mortality given advanced cardiac failure compared to a healthy heart is \hat{\beta}_{2} = 2.044756. The estimated odds ratio is calculated below.

Code

exp(2.044756)

[1] 7.727273

The estimated probability of mortality given severe cardiac failure is calculated below.

Code

exp(-2.833213 + 2.044756) / (1 + exp(-2.833213 + 2.044756))

[1] 0.3125001

To use table methods, we create a table from the output of the xtabs function.

Code

xtabs(
  ~ CardiacFailure + Mortality,
  data = lab
)

              Mortality
CardiacFailure No Yes
      None     51   3
      Mild     11  18
      Advanced 11   5

We can now create separate tables comparing the Mild to None levels and the Adanced to None levels. For the sake of simplicity, we only use the odds ratio using the oddsratio.wald function form the epitools library.

Code

mild_healthy_table <- matrix(
  c(18, 3, 11, 51),
  nrow = 2
)

mild_healthy_table

     [,1] [,2]
[1,]   18   11
[2,]    3   51

Code

epitools::oddsratio.wald(
  mild_healthy_table,
  rev = "both"
)$measure

          odds ratio with 95% C.I.
Predictor  estimate    lower    upper
  Exposed2  1.00000       NA       NA
  Exposed1 27.81818 6.962995 111.1377

We note the odds ratio and 95\% confidence intervals. The same is calculated form the Advanced class, comparing to the None class.

Code

advanced_healthy_table <- matrix(
  c(5, 3, 11, 51),
  nrow = 2
)

advanced_healthy_table

     [,1] [,2]
[1,]    5   11
[2,]    3   51

Code

epitools::oddsratio.wald(
  advanced_healthy_table,
  rev = "both"
)$measure

Warning in chisq.test(xx, correct = correction): Chi-squared approximation may
be incorrect

          odds ratio with 95% C.I.
Predictor  estimate    lower    upper
  Exposed2 1.000000       NA       NA
  Exposed1 7.727273 1.603318 37.24198

11.3 Question

We continue with our simulated research project into mortality. The data set contains a variable named Glucose and is a measure of fasting blood glucose. The data are captured in the comma-separated file infect_mortality.csv.

Prepare a report of results and interpretation for your colleagues using fasting blood glucose as an explanatory variable for the probability of mortality.

Solution 11.3

The data are already imported and assigned to the lab variable.

A summary of the \texttt{Glucose} variable is reported by levels of the \texttt{Mortality} variable using the describeBy function from the psych library.

Code

psych::describeBy(lab$Glucose, group = lab$Mortality)


 Descriptive statistics by group 
group: No
   vars  n  mean  sd median trimmed  mad min max range  skew kurtosis   se
X1    1 73 93.79 5.2     94   93.73 4.45  77 106    29 -0.08     0.85 0.61
------------------------------------------------------------ 
group: Yes
   vars  n   mean   sd median trimmed  mad min max range  skew kurtosis   se
X1    1 26 101.96 2.18    102  102.05 2.97  98 105     7 -0.15     -1.2 0.43

Data visualization of the \texttt{Glucose} variable is reported by levels of the \texttt{Mortality} variable using a box-and-whisker / violin combination plot.

Code

ggstatsplot::ggbetweenstats(
  data = lab,
  x = Mortality,
  y = Glucose,
  bf.message = FALSE, 
  palette = "Blues"
)

Warning in min(x): no non-missing arguments to min; returning Inf

Warning in max(x): no non-missing arguments to max; returning -Inf

We can answer the research question: Is there is a difference in serum glucose levels between those who survived and dies. A T test can be used, but we must make sure that we are nor violating the assumptions for the use of a parametric test. Onse such test is the Shapiro-Wilk test. The null hypothesis states that the values are not from a population in which the variable values are normally distributed. This is done for glucose levels of both survival classes.

Code

shapiro.test(lab %>% 
               filter(Mortality == "No") %>% 
               select(Glucose) %>% 
               pull())


    Shapiro-Wilk normality test

data:  lab %>% filter(Mortality == "No") %>% select(Glucose) %>% pull()
W = 0.9738, p-value = 0.1321

Code

shapiro.test(lab %>% 
               filter(Mortality == "Yes") %>% 
               select(Glucose) %>% 
               pull())


    Shapiro-Wilk normality test

data:  lab %>% filter(Mortality == "Yes") %>% select(Glucose) %>% pull()
W = 0.93606, p-value = 0.1081

We also have to confirm the equality of variances. Levene’s test can be used for this. The null hypothesis is that the variances are not equal. The levenTest function form the car library can perform this test.

Code

car::leveneTest(
  Glucose ~ Mortality,
  data = lab
)

Levene's Test for Homogeneity of Variance (center = median)
      Df F value   Pr(>F)   
group  1  8.9859 0.003455 **
      97                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null hypothesis at the 5\% level of significance and must use the unequal variance T test, know as Welch’s test. Thet.test function uses a formula as first argument, the data source as second argument, and we set the var.equal keyword argument to FALSE.

Code

t.test(
  Glucose ~ Mortality,
  data = lab,
  var.equal = FALSE
)


    Welch Two Sample t-test

data:  Glucose by Mortality
t = -10.98, df = 94.366, p-value < 2.2e-16
alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
95 percent confidence interval:
 -9.643785 -6.690251
sample estimates:
 mean in group No mean in group Yes 
         93.79452         101.96154

We can also create a single explanatory variable logistic regression model.

Code

lab_model <- glm(
  formula = Mortality ~ Glucose,
  data = lab,
  family = binomial(link = "logit")
)

The table of coefficients of coefficients are returned using the coefficients attribute.

Code

summary(lab_model)$coefficients

               Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -46.0474195  9.4155297 -4.890582 1.005383e-06
Glucose       0.4571839  0.0942374  4.851406 1.225894e-06

The model is written below, by substituting the values from our model.

\begin{align*} &\log{\left( \frac{\hat{\pi} (x)}{1 - \hat{\pi} (x)} \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ \\ &\log{\left( \frac{\hat{\pi} (x)}{1 - \hat{\pi} (x)} \right)} = -46.0474195 + 0.4571839 x \end{align*}

The log odds of mortality given a person with a fasting serum glucose level of 0 is \hat{\beta}_{0} = -46.0474195 (this is nonsensical). The odds of mortality given a person with a fasting serum glucose level of 0 is calculated below.

Code

exp(-46.0474195)

[1] 1.004292e-20

The log odds ratio of mortality comparing two persons with a unit unit difference in fasting serum glucose levels is \hat{\beta}_{1} = 0.4571839. The odds ratio of mortality comparing two persons with a single unit difference in fasting serum glucose levels is calculated below.

Code

exp(0.4571839)

[1] 1.579619

The probability of mortality is shown in the graph below, using the model.

Code

glucose <- seq(80, 120, length = 100)

plot(
  glucose,
  (exp(-46.0474195 + 0.4571839 * glucose)) / (1 + exp(-46.0474195 + 0.4571839 * glucose)),
  type = "l",
  lwd = 2,
  panel.first = grid(),
  main = "Probability of mortality given fasting serum glucose level",
  xlab = "Serum glucose level",
  ylab = "Probability of mortality",
  las = 1
)

11.4 Question

We continue to explore the data set investigating mortality. Calculate the 95% confidence intervals for the coefficients in the model that estimates the probability of mortality given the level of cardiac function.

Solution 11.4

With the data already imported, we create model.

Code

lab_model <- glm(
  formula = Mortality ~ CardiacFailure,
  data = lab,
  family = binomial(link = "logit")
)

The coefficient estimates are expressed.

Code

summary(lab_model)$coefficients

                        Estimate Std. Error   z value     Pr(>|z|)
(Intercept)            -2.833213  0.5940762 -4.769107 1.850439e-06
CardiacFailureMild      3.325690  0.7066762  4.706102 2.524987e-06
CardiacFailureAdvanced  2.044756  0.8023937  2.548320 1.082431e-02

The 95\% confidence intervals of the log odds ratios using are returned using the confint.default function.

Code

ci <- confint.default(
  lab_model,
  level = 0.95
)

ci

                            2.5 %    97.5 %
(Intercept)            -3.9975813 -1.668845
CardiacFailureMild      1.9406300  4.710750
CardiacFailureAdvanced  0.4720932  3.617419

Exponentiating returns the confidence interval for the odds ratios of the intercept and the slope.

Code

exp(ci)

                            2.5 %      97.5 %
(Intercept)            0.01835999   0.1884646
CardiacFailureMild     6.96313631 111.1354429
CardiacFailureAdvanced 1.60334686  37.2413140

1 Introduction

2 Libraries

3 General Notation

4 Structure of a Generalized Linear Model

5 Understanding the Components of a Generalized Linear Model

5.1 Link Function

5.2 The Explanatory and Response Variables

6 Building a Logistic Regression Model

6.1 The glm function

7 Interpreting the Results

8 Multi-level Explanatory Variables

8.1 Dummy Variables

8.2 Logistic Regression Model with a Multilevel Explanatory Variable

8.3 Illustrative Example using a Multi-Level Explanatory Binary Variable

9 Continuous Numerical Explanatory Variables

9.1 Example

10 Confidence Intervals

11 Lab Problems

11.1 Question

11.2 Question

11.3 Question

11.4 Question

6.1 The `glm` function