Code
library(rio)
library(dplyr)
library(epiR)
library(psych)
library(ggstatsplot)
library(DescTools)
library(car)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
In the previous three sections, we used frequency tables to describe data for categorical variables and to investigate the association between two categorical variables.
Another method of approaching analysis is to use logistic regression models. Logistic regression models have several advantages over table methods. These models allow us to consider more than one explanatory variable as opposed to a single exposure variable used in the table methods that we have seen thus far. We can introduce more than one explanatory variable, have nominal and continuous explanatory variable, and we can make use of hypothesis testing with respect to the measures of association that logistic regression provides. Logistic regression models can also investigate confounding and interaction (effect modification). Lastly, we can use logistic regression to create prediction of the success outcome class, given values for the explanatory variable(s).
Prediction is of particular interest to the modern world of machine learning. In prediction, our interest shifts away from the calculations involved in inference (confidence intervals and hypothesis testing). Instead we want to calculate an estimated numerical outcome or the estimated probability of a categorical outcome, given a variety of explanatory variables. In prediction analysis, we are not interested in the association between the explanatory variables and the outcome or target variable.
In this notebook, we introduce a broad class of regression models named generalized linear models (GLMs). While linear models have numerical response variables, the idea behind GLMs are to broaden the use of regression models to include categorical, count, rate, and other outcome types. We will concentrate on models with a binary nominal outcome variable. These models are referred to as logistic regression models.
We start by exploring the notation and some basic concepts.
Load the required libraries for Chapter 4. If these packages are not already installed, please do so before loading the libraries.
library(rio)
library(dplyr)
library(epiR)
library(psych)
library(ggstatsplot)
library(DescTools)
library(car)
The outcome variable in a GLM is often referred to as the response variable and is usually denoted as Y. In mathematical textbooks, this is expressed as a vector (for single variable responses) and written as a boldface \mathbf{y}. Y is our disease variable from the table methods that we have seen. For a binary response variable, we will still have two levels or classes success or D and $failure or \bar{D}.
The explanatory variables are usually denoted by x_{1} , \, x_{2} , \, \ldots \, x_{p}, where the subscript is specific for each of p explanatory variables. In mathematical notation, we would express the values as a matrix X_{np}, where the column vectors in the matrix are the p individual explanatory variables and n is the sample size. Instead of a single binary or multi-level exposure variable, we can now explore a richer set of explanatory or exposure variables.
We also use the notation \left\{ y_{i} \, \vert \, x_{1i} , x_{2i} , \ldots , x_{pi} \right\}, where i = \left\{ 1,2,3, \ldots , n\right\} for the sample size n and where p is the number of explanatory variables. We have that y_{i} is the i^{\text{th}} outcome for values of each of the i^{\text{th}} observations p explanatory variable values.
In broad terms, and for a categorical response variable, we are then interested in P \left( Y = y_{i} \, \vert \, \mathbf{x}_{i} \right), the probability that the i^{\text{th}} outcome takes on the value y_{i}, given, as mentioned, the values x_{1i} , x_{2i} , \ldots , x_{pi} (one for each of the p explanatory variables). Our interest will lie with assigning one of the response levels or classes as the success level, encoded with a 1 and consider the probability P(Y=1 \, \vert \mathbf{x}).
There are two main parts to a GLM, a random component and a systematic component, with a third part that links the random and systematic components.
The random component identifies the distribution of the response variable Y. In a linear regression model Y follows (or is assumed to follow) a normal distribution with Y \sim N(\mu , \sigma^{2}). Y might be a binary categorical variable with Y \sim \text{Bin}(1, \pi), or Y might be a rate (counts per time interval) with Y \sim \text{Poisson}(\mu).
The systematic component regards the explanatory variables and their coefficients, shown in Equation 1.
\beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{1}
The systematic components is the mathematical form that relates the explanatory variables to the response variable (linear regression) or some function of the response variable. Note that (1) is a linear form (which is not always so and may have to be adjusted). Here, the slopes \beta_{0} , \beta_{1} , \ldots , \beta_{p} are linear, allowing us to consider unit increases in the explanatory variables and how they alter the calculated (predicted or estimated) response variable (or at least a function of it).
The link function, as the name suggests, links the random and systematic components of a GLM. If we consider the expected value of the response variable E(Y) to be \mu (the mean of the probability distribution of Y), then link function g relates \mu to the linear predictor (the systematic component), as shown in Equation 2.
\mu_{Y} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{2}
In a linear regression model, Y is a continuous numerical variable assumed to have a normal distribution Y \sim \text{N}(\mu , \sigma^{2}). Linear regression models have a systematic component as in (1). The link function is the identity function g(\mu) = \mu.
In the case of a binary response variable, we consider a logistic regression model. Here the response variable Y \sim \text{Bin}(1 \pi). The systematic component is still as in Equation 1. The link function is the logit function, shown in Equation 3.
g(\mu) = \log{\frac{\mu}{1 - \mu}} \tag{3}
Since we are considering individual (independent) Bernoulli trials, we have that E(Y) = 1 \times \pi, where \pi is the probability of success, and we can write Equation 4 below.
\log{\frac{\pi}{1 - \pi}} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{4}
For a count response (such as incidence per time interval), the Poisson log-linear GLM is often used. Here, the response variable follows a Poisson distribution, Y \sim \text{Poisson}(\mu), with the systematic component still as in Equation 1. The link function is the log-link function shown in Equation 5.
g(\mu) = \log{\mu} \tag{5}
A log-linear GLM can then be written as in Equation 6.
\log{\mu} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{6}
As a summary, we have Table 1, where the random component Y \sim \text{Bin}(1 , \pi) (\pi is the population probability of success , i.e. it is a binary variable). The systematic component is then \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p}.
Link name | Function | Model |
---|---|---|
Identity link | g(\mu) = \mu | Linear probability model |
Log link | g(\mu) = \log{\mu} | Log binomial model |
Logistic (logit) link | g(\mu) = \log{\frac{\pi}{1 - \pi}} | Logistic model |
In a GLM, \pi is a function of the predictors. If we have a single predictor, x, we write \pi(x) = P(Y=1 \; \vert \; x) (the probability of observing a success given a value for x). Note that we are only considering binary (dichotomous) outcome variables.
We have seen that the random component of a simple logistic regression model (the response variable) has a binomial distribution, Y \sim \text{Bin}(1, \pi). Our aim is to understand an estimate of the probability of success \hat{\pi} = P(Y=1) of this response variable. Hence, our probability estimate of failure is 1-\hat{\pi} = P(Y=0). Remember that the mean (expected value) of Y \sim \text{Bin}(1, \pi) is \mu = n \pi, where \pi is the population proportion of success. In this case n=1 and \mu = \pi.
For now, our single explanatory variable, x, is likewise binary in nature, which we encode with a 0 for control (or no treatment) and a 1 for exposure ( or treatment), such that our systematic component is \beta_{0} + \beta_{1} x, where x = \left\{ 0,1 \right\}.
In many cases the relationship between the explanatory variable(s) and the response variable is not linear. The logit link function introduces the non-linearity which models many processes in nature. The logit link function g(\pi) was shown in Equation 3.
We can find an equation for our probability of success, \pi using simple algebra, shown in Equation 7, using the anti-log function.
\begin{align} &\log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1}x \\ &\frac{\pi}{1 - \pi} = e^{\beta_{0} + \beta_{1}x} \\ &\pi = (1 - \pi) e^{\beta_{0} + \beta_{1}x} \\ &\pi = e^{\beta_{0} + \beta_{1}x} - \pi e^{\beta_{0} + \beta_{1}x} \\ &\pi + \pi e^{\beta_{0} + \beta_{1}x} = e^{\beta_{0} + \beta_{1}x} \\ &\pi(1 + e^{\beta_{0} + \beta_{1}x}) = e^{\beta_{0} + \beta_{1}x} \\ &\pi = \frac{e^{\beta_{0} + \beta_{1}x}}{1 + e^{\beta_{0} + \beta_{1}x}} \end{align} \tag{7}
In Equation 7, we have a function for the probability of success, \pi (x) (given the explanatory variable x). The right-hand side expression of Equation 7 means that the interval of \pi(x) = [0,1], exactly what we require for a probability.
In Figure 1, we set arbitrary values for \beta_{0} and \beta_{1} to view an example of a logistic model.
<- seq(-3, 3, by = 0.1)
inputs <- 0.1
beta_0 <- 1
beta_1 <- exp(beta_0 + beta_1 * inputs)
c
plot(
inputs,/ (1 + (c)),
(c) type = "l",
main = "Logistic model",
xlab = "Explanatory variable",
ylab = "Probability of success",
las = 1,
panel.first = grid(lty = 1)
)
Given a value for the explanatory variable in Figure 1 (horizontal axis), the model calculates the probability of the success class (vertical axis).
To use and understand a logistic regression model with a binary response variable and a binary explanatory variable, we import a data set saved as a comma-separated values file, ChronicConditionPrivateInsurance.csv
.
<- import_dataset("ChronicConditionsPrivateInsurance.csv") dfr
The data is taken from a 2010 survey by Cameron and Trivedi. There are 3206 subjects in the study. We want to explore the relationship between supplementary insurance and the presence of chronic disease. More specifically, we are interested in the presence of chronic disease as an explanation for subjects purchasing supplementary insurance.
Both variables are binary in nature, with the following levels shown in Table 2.
Type | Name | Levels | Generic levels |
---|---|---|---|
Explanatory | bin | \begin{cases} x=0 \; \text{no chronic disease} \\ x=1 \; \text{chronic disease} \end{cases} | \begin{cases} x=0 \text{ control} \\ x=1 \text{ treatment} \end{cases} |
Response | private | \begin{cases} Y=0 \; \; \text{no supplemental insurance} \\ Y=1 \; \; \text{supplemental insurance} \end{cases} | \begin{cases} Y=0 \text{ failure} \\ Y=1 \text{ success} \end{cases} |
It is always important to remember that we must assign the levels of our variables to the encoded values for our response and our explanatory variables. In the case of binary variables, the task is simple. We will explore multi-level explanatory variables later. In the case of our illustrative example, subjects either have no x=0 chronic disease, or, chronic disease is present, with x=1. We choose success as having purchased supplemental health insurance Y=1. We can, therefore, write Equation 8.
\pi(x) = P(Y=1 \, \vert \, x) \tag{8}
Below, we see the table of observed data in Table 3. Since both levels (classes) in the \texttt{private} and in the \texttt{bin} variables are coded with a 0 and a 1 respectively, we generate two new variables \texttt{PrivateInsurance} and \texttt{ChronicDisease} using the factor
function. The factor
function allows us to create factor (categorical) variables, set the order of the classes with the levels
keyword argument, and set labels for the two levels, using the labels
keyword argument.
$PrivateInsurance = factor(dfr$private, levels = c(0, 1), labels = c("No", "Yes"))
dfr$ChronicDisease = factor(dfr$bin, levels = c(0, 1), labels = c("No", "Yes"))
dfr
::crosstable(
crosstable
dfr,
ChronicDisease,by = PrivateInsurance,
total = "both",
percent_pattern="{n}"
%>%
) ::as_flextable(
crosstablekeep_id = FALSE
)
label | variable | PrivateInsurance | Total | |
---|---|---|---|---|
No | Yes | |||
ChronicDisease | No | 246 | 140 | 386 (12.04%) |
Yes | 1719 | 1101 | 2820 (87.96%) | |
Total | 1965 (61.29%) | 1241 (38.71%) | 3206 (100.00%) |
From the observed values in Table 3, we can calculate risks and ratios, especially using the epi.2by2
function in the epiR
library as we have seen before. As revision, we implement the epi.2by2
function in the code chunk below. Using modelling will ultimately be a more powerful approach, though. Remember that the requires a certain order to the two levels of each of teh exposure and disease variables. We use indexing to enforce the order.
::epi.2by2(table(dfr$ChronicDisease, dfr$PrivateInsurance)[2:1, 2:1]) epiR
Outcome + Outcome - Total Inc risk *
Exposed + 1101 1719 2820 39.04 (37.24 to 40.87)
Exposed - 140 246 386 36.27 (31.47 to 41.29)
Total 1241 1965 3206 38.71 (37.02 to 40.42)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio 1.08 (0.94, 1.24)
Inc odds ratio 1.13 (0.90, 1.40)
Attrib risk in the exposed * 2.77 (-2.35, 7.90)
Attrib fraction in the exposed (%) 7.10 (-6.86, 19.24)
Attrib risk in the population * 2.44 (-2.64, 7.52)
Attrib fraction in the population (%) 6.30 (-6.09, 17.25)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 1.101 Pr>chi2 = 0.294
Fisher exact test that OR = 1: Pr>chi2 = 0.316
Wald confidence limits
CI: confidence interval
* Outcomes per 100 population units
The results shows that the risk of private supplementary insurance is 2.77 percentage points greater in those with chronic disease compared to those that do not have chronic disease, (95% confidence interval -2.35 to 7.9 percentage points).
glm
functionThe glm
function allows us to create a generalized linear model (GLM) writing very little code. We use it in the code chunk below, as assign the result to the variable model
.
<- glm(
model ~ bin,
private data = dfr,
family = binomial(link = "logit")
)
The glm
function takes a formula as first argument. The formula, private ~ bin
indicates that we want to consider the \texttt{private} outcome variable given the \texttt{bin} variable. Both of these variables are encoded with 0 and 1 for the two respective levels in each. This is ideal as the exposure levels, E and \bar{E} and the two disease levels, D and \overline{D} are already contained in the numerical values 0 and 1. The second argument, data
is assigned to our data object, dfr
. The last argument is the family
keyword argument. We set this to binomial(link = "logit")
. This is the instruction to the glm
function that the outcome, Y follows a binomial distribution and that we specify the logit link function.
We view the coefficients below using the attribute coefficients
.
summary(model)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5636891 0.1058674 -5.324485 1.012396e-07
bin 0.1181652 0.1126849 1.048634 2.943467e-01
Given the data, we can now explore the estimated probability of success, \hat{\pi} using the coefficients and the logit function that links \hat{\pi} and the linear component \beta_{0} + \beta_{1}x, shown in Equation 9, which represents our model. Note that the values for the coefficient estimates, \hat{\beta}_{0} and \hat{\beta}_{1} are from the Estimate column in the results above. The former is the intercept and the second is the slope.
\begin{align} &\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ \\ &\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = -0.5637 + 0.1182 x \end{align} \tag{9}
How do we interpret the model in Equation 9? We start with x=0, an absence of chronic health conditions with respect to the explanatory variable (the exposure variable).
In Equation 10 we write that the of the log of the odds that the subject purchases supplemental insurance, Y=1, given that they do not have chronic disease, x=0 is \beta_{0} + \beta_{1}(0) = \beta_{0} = -0.5637. Then we use the derivation in Equation 7 to calculate the probability estimate, \hat{\pi}, the estimated probability of having purchased supplemental insurance given that no chronic disease is present.
\begin{align} &\log{\frac{\hat{\pi}(0)}{1 - \hat{\pi}(0)}} = -0.5637 \\ &\hat{P}(Y=1 \, \vert \, x = 0) = \hat{\pi}(0) = \frac{e^{-0.5637}}{1 + e^{-0.5637}} = \approx 0.3627 \end{align} \tag{10}
So, \beta_{0} is the log odds of success given the control (no chronic health conditions or \bar{D}) and from this we can calculate the probability of success given the control.
In Equation 11, we do the same for x=1 (presence of chronic disease or D).
\begin{align} &\log{\frac{\hat{\pi}(1)}{1 - \hat{\pi}(1)}} = -0.5637 + 0.1182(1) \\ &\hat{P}(Y=1 \, \vert \, x = 1) = \hat{\pi}(1) = \frac{e^{-0.5637+0.1182}}{1 + e^{-0.5637+0.1182}} = \approx 0.3904 \end{align} \tag{11}
Again, we have the log odds of success , but now, given treatment (presence of chronic disease) and from this we can calculate the probability of success, given treatment. In our case, the log odds of having purchased supplemental insurance given the presence of chronic disease.
From these values, we can calculate the estimated risk difference shown in Equation 12 and the estimates risk ratio, shown in (8).
\widehat{\text{RD}} = \hat{P}(Y=1 \, \vert \, x=1) - \hat{P}(Y=1 \, \vert \, x=0) = \hat{\pi}(1) = \hat{\pi}(0) = 0.3904 - 0.3627 = 0.0277 \tag{12}
The risk of purchasing supplemental insurance is 2.77 percentage points higher in the case of the presence of chronic disease compared to the absence of chronic disease.
In Equation 13 we calculate the relative risk or risk ratio.
\widehat{\text{RR}} = \frac{\hat{p}(Y=1 \, \vert \, x=1)}{\hat{P}(Y=1 \, \vert \, x=0)} = \frac{\hat{\pi}(1)}{\hat{\pi}(0)} = \frac{0.3904}{0.3627} \approx 1.08 \tag{13}
There is a 1.08 - 1 = 0.08 = 8% increase in risk of purchasing supplemental insurance comparing those with chronic disease to those without chronic disease.
How do we interpret \beta_{1}? In Equation 14, where we make use of the properties of logarithms, we see that it is the log of the odds ratio of success given treatment over control.
\begin{align} &\log{\frac{\hat{\pi}(1)}{1 - \hat{\pi}(1)}} - \log{\frac{\hat{\pi}(0)}{1 - \hat{\pi}(0)}} = \left[ \beta_{0} + \beta_1(1) \right] - \beta_{0} \\ &\log{\frac{\frac{\hat{\pi}(1)}{1 - \hat{\pi}(1)}}{\frac{\hat{\pi}(0)}{1 - \hat{\pi}(0)}}} = \log{\text{OR}} = \beta_{1} \end{align} \tag{14}
In Equation 15 we can also see the odds ratio of success given treatment versus control.
\text{OR} = e^{\beta_{1}} \tag{15}
By substituting the values for our illustrative example, we see the odds ratio of purchasing supplemental insurance given the presence of chronic disease versus the absence of chronic disease.
exp(0.1182)
[1] 1.125469
There is a 1.125 - 1 = 0.125 = 12.5% increase in the odds of purchasing supplemental insurance if chronic disease is present compared to when chronic disease is absent.
The results for the estimated risk difference, risk ratio, and odds ratio are the same as when we use the epi.2by2
function.
We have explored our first logistic regression model, with a single, binary explanatory variable. It allows us to compare the results of the measures of association with those that we get from using a table and the epi.2by2
function.
A logistic regression model allows us to expand on table methods by introducing multi-level nominal explanatory variables.
With a single binary explanatory variable, the intercept (coefficient estimate \hat{\beta}_{0}) was the log of the odds of success given the control and that the slope (coefficient estimate \hat{\beta}_{1}) was the log of the odds ratio of success given treatment over control with respect to the levels of the explanatory variable. Exponentiating these coefficient values resulted in the odds and odds ratio respectively.
Now, we consider an explanatory variable with three levels. Logistic regression models can be expanded to include explanatory variables with multiple levels. We start our exploration by considering dummy variables.
An explanatory variable with three levels such as mild, moderate, and severe. To use a multilevel variable in a model, we make use of dummy variables.
We start by changing the levels into their own variables. We illustrate the concept using the three-level example mentioned above.
Subject | Explanatory variable value | Mild | Moderate | Severe |
---|---|---|---|---|
1 | Mild | 1 | 0 | 0 |
2 | Moderate | 0 | 1 | 0 |
3 | Severe | 0 | 0 | 1 |
If a subject had a value of mild captured for the explanatory variable, a 1 is encoded for the new variable mild and a 0 for the rest. A similar concepts follows for the other two levels. This makes absolute sense if 1 represents Yes and 0 represents No. An onservation witrl mild disease will have a data entry of Yes or 1 in that column in a spreadsheet. That observation does not have moderate or severe diseases and have No or 0 captured for those two variables.
One of the three levels must serve as our base case, against which our model will compare the rest. Each dummy variable is now a binary variable. Just as with binary variables where 0 was the base case, we do the same here. The choice of base case is dictated by the research question(s). In the table above, we will use mild. We note that this case is redundant and that we can capture the same information with only two dummy variables.
Subject | Explanatory variable value | Moderate | Severe |
---|---|---|---|
1 | Mild | 0 | 0 |
2 | Moderate | 1 | 0 |
3 | Severe | 0 | 1 |
Any subject with the base case will simply have a 0 for moderate and for severe. In this manner, we will always have k-1 dummy variables for an explanatory variable with k levels. Each of the k-1 dummy variables is a binary categorical variable.
We are now familiar with the components of a generalized linear model (GLM). We expand the linear component to include the two dummy variables in our example, shown in Equation 16 using the logit link function, for x_{1} (being moderate in our example) and x_{2} (being severe in our example), where \pi(x_{1},x_{2}) = P(Y=1 \, \vert \, x_{1} , x_{2}).
\log{\frac{\pi(x_{1},x_{2})}{1 - \pi(x_{1},x_{2})}} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} \tag{16}
As before, we can use simple algebra to derive a function for \pi(x_{1}, x_{2}), shown in Equation 17.
\pi(x_{1},x_{2}) = \frac{e^{\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}}}{1 + e^{\beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2}}} \tag{17}
Note that we can expand Equation 16 and Equation 17 for any number of explanatory variables x_{i}.
As in the previous notebook, we start by considering x_{1} = 0 and x_{2} = 0, with our model shown in Equation 18.
\log{\frac{\pi(0,0)}{1-\pi(0,0)}} = \beta_{0} \tag{18}
The intercept, \beta_{0}, is the log of the odds of Y=1, success, for a subject with the base case, x_{1}=0 and x_{2}=0.
The probability of success given x_{1} = 0 and x_{2} = 0 is shown in Equation 19.
\pi(0,0) = \frac{{e^{\beta_{0}}}}{1 + e^{\beta_{0}}} \tag{19}
Next, we consider a subject with x_{1} = 1 and x_{2}=0 (a moderate case in our three levels example), shown in Equation 20.
\log{\frac{\pi(1,0)}{1-\pi(1,0)}} = \beta_{0} + \beta_{1}x_{1} \tag{20}
We can also express the probability, shown in Equation 21.
\pi(1,0) = \frac{e^{\beta_{0} + \beta_{1}x_{1}}}{1 + e^{\beta_{0} + \beta_{1}x_{1}}} \tag{21}
As in the previous notebook, we can consider the risk difference (RD) and the relative risk (RR), shown in Equation 22 and Equation 23 respectively, where we compare the second level of the explanatory variable to the base case.
\text{RD} = \pi(1,0) - \pi(0,0) \tag{22}
\text{RR} = \frac{\pi(1,0)}{\pi(0,0)} \tag{23}
The odds ratio (OR) is of particular interest. We show this is Equation 24.
\begin{align} &\log{\frac{\pi(1,0)}{1-\pi(1, 0)}} - \log{\frac{\pi(0,0)}{1-\pi(0,0)}} = \beta_{0} + \beta_{1} - \beta_{0} \\ &\log{\frac{\frac{\pi(1,0)}{1-\pi(1,0)}}{\frac{\pi(0,x0)}{1-\pi(0,0)}}} = \beta_{1} \\ &\beta_{1} = \log{\text{OR}} \end{align} \tag{24}
We see that the coefficient \beta_{1} is the log odds ratio for success comparing the second level to the base case.
In the same manner we argue for the third level, x_{1}=0, x_{2}=1, shown in Equation 25, and so too in cases that have more than two levels.
\begin{align} &\log{\frac{\pi(0,1)}{1-\pi(0,1)}} = \beta_{0} + \beta_{2}x_{2} \\ &\pi(0,1) = \frac{e^{\beta_{0} + \beta_{2}x_{2}}}{1 + e^{\beta_{0} + \beta_{2}x_{2}}} \\ &\text{RD} = \pi(1,0) - \pi(0,0) \\ &\text{RR} = \frac{\pi(1,0)}{\pi(0,0)} \\ &\beta_{2} = \log{\text{OR}_{\text{ comparing level 2 to base case}}} \\ &\text{OR} = e^{\beta_{2}} \end{align} \tag{25}
Here \beta_{2} is the log of the odds ratio for success given the third level compared to the base case.
As in the previous notebook, we will use the same survey data from 2010 as illustrative example.
We construct a table of observed values with the private
variable as the response variable (Y=0 indicating that a subject did not purchase supplemental health insurance and Y=1 indicating that they did). The response variable is tri
and represents the explanatory variable, x. There are three levels indicating the presence and number of chronic conditions per subject. The encoding in the data set is as shown below.
We represent the summary table of observed values in Table 4, where we create a new variable with appropriate labels for the three levels.
$DiseaseLevel <- factor(dfr$tri,
dfrlevels = c(0, 1, 2),
c("No chronic disease", "1-3 chronic conditions", "More than 3 chronic conditions"))
::crosstable(
crosstable
dfr,
DiseaseLevel,by=PrivateInsurance,
total = "both",
percent_pattern="{n}"
%>%
) ::as_flextable(
crosstablekeep_id = FALSE
)
label | variable | PrivateInsurance | Total | |
---|---|---|---|---|
No | Yes | |||
DiseaseLevel | No chronic disease | 246 | 140 | 386 (12.04%) |
1-3 chronic conditions | 1386 | 957 | 2343 (73.08%) | |
More than 3 chronic conditions | 333 | 144 | 477 (14.88%) | |
Total | 1965 (61.29%) | 1241 (38.71%) | 3206 (100.00%) |
In remains for us to build the logistic regression model. We use the glm
function to build our model, with \texttt{PrivateInsurance} as the response variable, and \texttt{DiseaseLevel} as the explanatory variable. R will generate the 3-1=2 dummy variables.
<- glm(
model ~ DiseaseLevel,
PrivateInsurance data = dfr,
family = binomial(link = "logit")
)
The coefficients
attribute returns the values of the coefficient estimates. Note that the order that we used when we created the new factor, \texttt{DiseaseLevel}, was very important. The glm
function will treat the level that is mentioned first as the base case. In our case this was 0 or No chronic disease.
summary(model)$coefficients
Estimate Std. Error z value
(Intercept) -0.5636891 0.1058674 -5.324485
DiseaseLevel1-3 chronic conditions 0.1933153 0.1139049 1.697164
DiseaseLevelMore than 3 chronic conditions -0.2746401 0.1454487 -1.888226
Pr(>|z|)
(Intercept) 1.012396e-07
DiseaseLevel1-3 chronic conditions 8.966565e-02
DiseaseLevelMore than 3 chronic conditions 5.899562e-02
The Estimate
for the (Intercept)
represents \hat{\beta}_{0}. We see the two levels listed below (Intercept)
. As expected, we see only the 1-3 chronic conditions
and the More than 3 chronic conditions
, both with the variable name DiseaseLevel
as prefix, for \hat{\beta}_{1} and \hat{\beta}_{2} respectively.
We can write our model, as shown in Equation 26.
\begin{align} &\log{\frac{\hat{\pi}(x_{1}, x_{2})}{1-\hat{\pi}(x_{1}, x_{2})}} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2} \\ &\log{\frac{\hat{\pi}(x_{1}, x_{2})}{1-\hat{\pi}(x_{1}, x_{2})}} = -0.5637 + 0.1933 x_{1} - 0.2746 x_{2} \end{align} \tag{26}
From this model in Equation 26, we consider what we can derive from the intercept, \hat{\beta}_{0}, and the two slopes, \hat{\beta}_{1} and \hat{\beta}_{2}, summarized in Table 5.
Coefficient | Value | Meaning |
---|---|---|
\hat{\beta}_{0} | -0.56 | log of the odds of purchase given no conditions |
\hat{\beta}_{1} | 0.19 | log odds ratio of purchase given 1-3 conditions compared to no condition |
\hat{\beta}_{2} | -0.27 | log odds ratio of purchase given more than 3 conditions compared to no condition |
We can take the antilog of each of these values by exponentiating each coefficient estimate, shown in Table 6.
Coefficient | Value | Meaning |
---|---|---|
e^{\hat{\beta}_{0}} | 0.57 | odds of purchase given no conditions |
e^{\hat{\beta}_{1}} | 1.21 | odds ratio of purchase given 1-3 conditions compared to no condition |
e^{\hat{\beta}_{2}} | 0.76 | odds ratio of purchase given more than 3 conditions compared to no condition |
We save the three coefficients and assign them to respective variables before taking the antilog to verify the results above.
<- -0.5636891
beta_0 <- 0.1933153
beta_1 <- -0.2746401
beta_2
exp(beta_0)
[1] 0.5691057
exp(beta_1)
[1] 1.213265
exp(beta_2)
[1] 0.7598455
We can also consider the risk difference and risk ratios when we calculate the estimated probabilities. We start with the estimated risk of purchase given no conditions, where the risk is \hat{\pi}(0,0) = 0.3627, calculated below.
<- exp(beta_0) / (1 + exp(beta_0))
risk_0 risk_0
[1] 0.3626943
The estimated risk of purchase in the case of 1-3 conditions is \hat{\pi}(1,0)=0.4085, calculated below.
<- exp(beta_0 + beta_1) / (1 + exp(beta_0 + beta_1))
risk_1 risk_1
[1] 0.4084507
The estimated risk of purchase in the case of more than 3 conditions is \hat{\pi}(0,1)=0.3019, calculated below.
<- exp(beta_0 + beta_2) / (1 + exp(beta_0 + beta_2))
risk_2 risk_2
[1] 0.3018868
The estimated risk difference between those with 1-3 conditions and those with no condition is therefor 0.4085-0.3627 = 0.0458 (about 5%) and the estimated risk difference between those with more than 3 conditions and those with no conditions is 0.3019-0.3627=-0.0608 (about -6%).
The relative risks in these two comparisons are 0.41 \div 0.36 = 1.14 (14% higher in 1-3 group) and 0.30 \div 0.36 = 0.83 (17% lower), respectively.
In the previous two sections, we have explored logistic regression models with either a binary or a multilevel explanatory variable for a binary response variable Y. We can also do so for a continuous numerical explanatory variable x.
In Equation 27, we see the now familiar logit link function between the random and the systematic components of our generalized linear model with a single explanatory variable x, where \pi(x) = P(Y=1 \, \vert \, x).
\log{\frac{\pi(x)}{1 - \pi(x)}} = \beta_{0} + \beta_{1} x \tag{27}
Interpreting Equation 27, where x=0, is as with the case of binary explanatory variables. We have Equation 28.
\log{\frac{\pi(0)}{1 - \pi(0)}} = \beta_{0} \tag{28}
The coefficients \beta_{0} is the intercept and represents the log of the odds of success when x=0. The probability of success when x=0 is shown in Equation 29.
\pi(0) = \frac{e^{\beta_{0}}}{1 + e^{\beta_{0}}} \tag{29}
For continuous numerical variables we do not have levels for comparison. Instead, we consider the difference between values that the continuous variable can take in reference to the units in which it is measured.
We start by considering a 1 unit increase in the continuous explanatory variable. We us an arbitrary value x = \nu and a single unit increase \nu + 1.
In Equation 30 and Equation 31 we see the log of the odds for success for x=\nu+1 and for x = \nu, respectively.
\log{\frac{\pi(\nu + 1)}{1 - \pi(\nu + 1)}} = \beta_{0} + \beta_{1} (\nu + 1) \tag{30}
\log{\frac{\pi(\nu)}{1 - \pi(\nu)}} = \beta_{0} + \beta_{1} \nu \tag{31}
Subtraction, shown in Equation 32, yields the derivation for the log odds ratio of success for a single unit increase in the continuous numerical explanatory variable.
\begin{align} &\log{\frac{\pi(\nu + 1)}{1 - \pi(\nu + 1)}} - \log{\frac{\pi(\nu )}{1 - \pi(\nu)}} = \beta_{0} + \beta_{1} (\nu + 1) - [\beta_{0} + \beta_{1} \nu] \\ &\frac{\log{\frac{\pi(\nu + 1)}{1 - \pi(\nu)}}}{\log{\frac{\pi(\nu)}{1 - \pi(\nu)}}} = \beta_{1} \end{align} \tag{32}
Now, we can consider any constant, c, increase, shown in Equation 33.
\frac{\log{\frac{\pi(\nu + c)}{1 - \pi(\nu + c)}}}{\log{\frac{\pi(\nu)}{1 - \pi(\nu)}}} = c \beta_{1} \tag{33}
Note that since we are considering a continuous numerical variable, we have no frequencies and cannot calculate risks and odds. If the study design allows, we can, though, consider the probability of success given a value for the continuous explanatory variable, using Equation 34.
\pi(x) = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \tag{34}
To illustrate the concepts above, we use the same data set as before, pertaining to the purchase of supplementary health insurance.
The \texttt{chronic} variable shows the number of chronic conditions. We will consider this as a continuous numerical explanatory variable.
As we should always do, we explore the data using summary statistic and data visualization.
::describe(dfr$chronic) psych
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 3206 2.06 1.42 2 1.98 1.48 0 8 8 0.65 0.24 0.03
::ggbetweenstats(
ggstatsplotdata = dfr,
x = PrivateInsurance,
y = chronic,
bf.message = FALSE,
palette = "Blues"
)
Warning in min(x): no non-missing arguments to min; returning Inf
Warning in max(x): no non-missing arguments to max; returning -Inf
We build a logistic regression model as before, using the glm
function.
<- glm(
model ~ chronic,
PrivateInsurance data = dfr,
family = binomial(link = "logit")
)
We view the coefficients using the coefficients
attribute.
summary(model)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.30972934 0.06386084 -4.850067 1.234198e-06
chronic -0.07323511 0.02590275 -2.827311 4.694078e-03
The intercept is \hat{\beta}_{0} = -0.3097 and the slope is \hat{\beta}_{1} = -0.0732. The model can now be written, as shown in Equation 35, where x is the number of chronic conditions.
\begin{align} &\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = \hat{\beta_{0}} + \hat{\beta}_{1} x \\&\log{\frac{\hat{\pi}(x)}{1 - \hat{\pi}(x)}} = -3.097 - 0.0732 x \end{align} \tag{35}
The values are assigned to variables below.
<- -0.3097
beta_0 <- -0.0723 beta_1
The intercept \hat{\beta}_{0} = -0.3097 is the log of the odds of purchasing supplemental health insurance given no chronic conditions, x=0. The antilog is calculated below, and shows that the odds of purchase given no chronic condition is e^{\hat{\beta}_{0}} = 0.73.
exp(beta_0)
[1] 0.733667
The slope \hat{\beta}_{1} = -0.0723 is the log of the odds ratio of purchase for any single unit increase in the number of chronic conditions. The antilog is calculated below, and shows that the odds ratio of purchase for any one unit increase in the number of chronic conditions is e^{\hat{\beta}_{1}} = 0.93.
exp(beta_1)
[1] 0.9302518
The odds ratio value of 0.93 is less than 1 and we have that the odds of purchase is 1-0.93 = 0.07 = 7% lower for every increase of one chronic condition.
For an increase of two chronic conditions, we have that the odds ratio is e^{2 \hat{\beta}_{1}} = 0.865. That is a 1 - 0.865 = 0.135 = 13.5% decrease in the odds.
exp(2 * beta_1)
[1] 0.8653684
The estimated probability of purchase for any number of chronic conditions can be computed using Equation 36.
\hat{\pi}(x) = \frac{e^{-0.3097 - 0.0723 x}}{1 + e^{-0.3097 - 0.0723 x}} \tag{36}
We have explored generalized linear models, and in particular, simple logistic regression models with single binary, multilevel, and continuous numerical explanatory variables. We briefly explore the concept of uncertainty in the slope, \beta_{1} now.
Using a generalized linear modeling approach to understanding the association between two categorical variables is more powerful than using analysis from tables of observed data. We can tap into the calculations used for inference with these models.
When using generalized linear models in the case of large sample sizes, we have that the estimates are approximately normally distributed, with a mean of the population parameter, \beta_{1}, as well as the variance, calculated from the standard error of \hat{\beta}_{1}, as shown in Equation 37, where we consider \hat{\beta}_{1} and its estimated standard error \sigma_{\hat{\beta}_{1}}.
\hat{\beta}_{1} \sim N(\beta_{1} , \sigma_{\hat{\beta}_{1}}^{2}) \tag{37}
We can then construct a confidence interval for a confidence level of 100(1-\alpha )%, shown in Equation 38 using the critical z score for the specific \alpha value.
\text{Wald confidence interval for } \beta_{1}: \; \hat{\beta}_{1} \pm z_{\frac{\alpha}{2}} \widehat{SE}_{\hat{\beta}_{1}} \tag{38}
We use the same data as for the other illustrative examples in this session and use the \texttt{ChronicDisease} variable as explanatory variable. It is a binary variable with 0 or No indicating the absence of chronic disease and 1 or Yes indicating the presence of one or more chronic conditions. The response variable is still \texttt{PrivateInsurance}, indicating the purchase of health insurance, 1 or Yes, or the non-purchase of insurance, 0 or No.
We create the model using the glm
function, specifying the link function.
<- glm(
model ~ ChronicDisease,
PrivateInsurance data = dfr,
family = binomial(link = "logit")
)
The coefficients
attribute shows the table of coefficients.
summary(model)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5636891 0.1058674 -5.324485 1.012396e-07
ChronicDiseaseYes 0.1181652 0.1126849 1.048634 2.943467e-01
We have that \hat{\beta}_{1} = 0.1182. We can now express our uncertainty in this value.
The confint.default
function can be used to calculate and show the confidence interval values for a given \alpha value, set by the level
keyword argument, which we set as 0.95, indicating a 95% confidence level.
<- confint.default(
ci
model,level = 0.95
)
ci
2.5 % 97.5 %
(Intercept) -0.7711853 -0.3561929
ChronicDiseaseYes -0.1026932 0.3390237
We have that \hat{\beta}_{1} = 0.1182 (95% confidence intervals -0.1027 to 0.3390).
While \hat{\beta}_{1} is the log odds ratio of purchase comparing those with chronic disease and those without, we can take the antilog of these values to express the odds ratio with confidence intervals.
exp(0.1181652)
[1] 1.12543
exp(ci)
2.5 % 97.5 %
(Intercept) 0.4624646 0.7003375
ChronicDiseaseYes 0.9024038 1.4035766
We finally have that the odds ratio of purchase comparing those with chronic disease and those without is 1.13 (95% confidence interval 0.90 to 1.4).
We see a 100 (1.13 - 1) = 13% increase in the odds of purchase comparing those with to those without chronic conditions. Our confidence intervals, though, show that the true population odds ratio is in the interval: 100(1 - 0.90) = 10% (a decrease in the odds) all the way to 100(1 - 1.4) = 40% (an increase in the odds).
In a later session we will explore hypothesis testing and see that if this interval contains 1.0, it means that we have a p value more than our chosen \alpha value and we fail to reject the null hypothesis. This should be clear to see from the confidence intervals that shows this range between decreasing and increasing the odds.
You are a member of a research team modeling the risk of death in a community exposed to an infectious agent using a cross-sectional study design. Research data pertaining to mortality and age group are available in a comma-separated file, infect_mortality.csv
. The response variable \texttt{Mortality} has levels Yes (person died) and No (person survived). The explanatory variable, \texttt{AgeGroup} has two levels, younger persons are encoded as A and older persons as B. Let A=0 and B=1.
Your task is to import the data, verify integrity, summarize, and visualize the data, express the measures of association (inclusive of the risk difference, risk ratio, and the odds ratio). Use confidence intervals (with a 95% confidence level) and hypothesis testing to interrogate the association between mortality and age group. Use both a critical value and a p value approach in your hypothesis testing. Use a 5% level of significance.
Also create a model that estimates the probability of death given age group. Provide full commentary on your model.
We continue with the data set from the previous lab problem research project, constructing models predicting mortality. The data are captured in the comma-separated file infect_mortality.csv
.
The New York Heart Association (NYHA) classifies cardiac or heart failure into grades I through IV, with grade I and II considered mild failure and grades III and IV considered advanced cardiac failure. The data set contains a variable, \texttt{CardiacFailure}, with three levels, None (healthy cardiac function), Mild (NYHA grades I and II cardiac failure), and Advanced (NYHA grades III and IV cardiac failure).
Your interest is in understanding the association between cardiac failure and mortality by using a model that estimates the probability of mortality given the level of cardiac failure. Create a thorough report pertaining to this research question to present to your colleagues in the research group.
Also use table methods to confirm the results of odds ratios from the model comparing the different levels of heart failure to those with a healthy heart, gven that None (healthy cardiac function) is the base case for this multi-level explanatory variable..
We continue with our simulated research project into mortality. The data set contains a variable named Glucose and is a measure of fasting blood glucose. The data are captured in the comma-separated file infect_mortality.csv
.
Prepare a report of results and interpretation for your colleagues using fasting blood glucose as an explanatory variable for the probability of mortality.
We continue to explore the data set investigating mortality. Calculate the 95% confidence intervals for the coefficients in the model that estimates the probability of mortality given the level of cardiac function.