Code
library(rio)
library(dplyr)
library(ggstatsplot)
library(lmtest)
library(epiR)
library(DescTools)
library(MASS)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
In the previous chapter, we considered measures of association between an explanatory variable, X, and a response variable, Y, given an effect modifier or a confounder, Z, using tables of observed data.
Generalized linear models (GLMs) give us a much richer approach, considering multiple explanatory variables.
In this session, we explore logistic regression (LR) models with multiple explanatory variables. Aspects of logistic regression models with multiple explanatory variables also pertain to other GLMs, including linear probability models, and log-binomial models, with their own link functions.
Load the required libraries for Chapter 6. If these packages are not already installed, please do so before loading the libraries.
library(rio)
library(dplyr)
library(ggstatsplot)
library(lmtest)
library(epiR)
library(DescTools)
library(MASS)
LR models with multiple explanatory variables are termed multivariable logistic regression models. We will reserve the term, multivariate logistic regression models, for cases with multiple responses. More about the use of these term can be in this paper.
Multivariable GLMs, in general, offer better predictions and estimations of adjusted associations (as with stratified analysis, but in terms of modeling), and provides for a concise approach to assessing effect modification with different types of effect modifiers.
In this chapter, we consider only binary response variable logistic regression (LR) models. The explanatory variables, on the other hand, can be binary, multilevel, or continuous, and we will explore all of these together in a model.
The random component and the link function are the same as was introduced in chapter 4, when we explored simple LR models with single explanatory variables.
As before, Y, has two levels, a success that we term disease, encoded as 1, and a failure that we term not disease, encoded as 0.
The systematic component is different from our simple LR models. For p different explanatory variables, we have the systematic component in Equation 1, where x_{1}, x_{2}, \ldots , x_{p} are the p individual explanatory variables.
\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p} \tag{1}
Our interest lies with P(Y=1 \, \vert \, x_{1} , x_{2} , \ldots , x_{p}), the probability of success (disease or 1) for the specific values of each of the explanatory variables. We will use vector notation where the components of the vector, \mathbf{x}, are the values of the individual explanatory variables.
The equation for the multivariable LR model is therefor written as shown in Equation 2 for p explanatory variables, where \mathbf{x} is the vector notation for \left( x_{1} , x_{2} , \ldots , x_{p} \right), and X is the design matrix (contains the value 1 for x_{0}.
\begin{align} \log{(\text{odds})} &= \log{ \left( \frac{P(Y=1 \vert \mathbf{x})}{1 - P(Y=1 \vert \mathbf{x})} \right) } \\ \\ &= \log{ \left( \frac{\pi(\mathbf{x})}{1 - \pi(\mathbf{x})} \right)} \\ \\ &= \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p} \\ &= X \beta \end{align} \tag{2}
Through simple algebraic manipulation, we have the probability function \pi with respect to \mathbf{x}, shown in Equation 3.
\begin{align} &P(Y=1 \vert \mathbf{x}) = \pi ({\mathbf{x}}) = \frac{e^{\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p}}}{1 + e^{\beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \ldots + \beta_{p} x_{p}}} \\ &P \left( Y=1 \, \vert \, \mathbf{x} \right) = \pi \left( \mathbf{x} \right) = \frac{e^{X \beta}}{1+e^{X \beta}} \end{align} \tag{3}
As with simple LR models, we interpret the parameters \beta_{i}, where i = \left\{ 0,1,2, \ldots , p \right\}.
The intercept, \beta_{0}, is the log odds of disease for x_{1} = x_{2} = \ldots = x_{p} = 0.
We can also calculate e^{\beta_{0}}, the odds of having disease for x_{1} = x_{2} = \ldots = x_{p} = 0 using Equation 2.
An arbitrary parameter \beta_{j}, where j = \left\{ 1,2, \ldots , p \right\} for each of the variable types is shown in Table 1.
x_{j} variable type | \beta_{j} | e^{\beta_{j}} |
---|---|---|
\begin{cases} 1 \text{ if exposed} \\ 0 \text{ if not exposed} \end{cases} | log odds ratio (OR) for disease comparing exposed to non-exposed, adjusting for the other predictors | OR for disease comparing exposed to non-exposed adjusting for the other predictors |
Multilevel where x_{j} is one of the dummy variables | log OR for disease comparing the j^{\text{th}} level to the reference level, adjusting for the other predictors | OR for disease comparing exposed to non-exposed adjusting for the other predictors |
Numerical | log OR for disease for a one-unit increase in x_{j}, adjusting for the other predictors | OR for disease for a one-unit increase in x_{j}, adjusting for the other predictors |
Note the additional section with respect to the interpretation that reads adjusting for the other predictors. We will see later that this refers to keeping the other explanatory variables (called predictors in the table) constant or fixed.
The slopes then, represents the log OR. In Equation 4, we develop an intuition for the log OR in terms of an arbitrary binary explanatory variable x_{j}, where x_{j}=1 for the exposed, and x_{j} = 0 for the unexposed, irrespective of the values for the other explanatory variables, x_{i} and i \ne j.
\begin{align} &x_{j}=1: \; \log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}\right)} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (1) + \ldots + \beta_{p} x_{p} \\ \\ &x_{j}=0: \; \log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})} \right)} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (0) + \ldots + \beta_{p} x_{p} \end{align} \tag{4}
Subtracting the two equations and making use of the laws of logarithms, we have Equation 5, showing that \beta_{j} represents a log OR.
\begin{align} &\log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}\right)} - \log{\left(\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}\right)} \\ \\ &= \log{\left(\frac{\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=1, \ldots , x_{p})}}{\frac{\pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}{1 - \pi (x_{1} , x_{2} \ldots , x_{j}=0, \ldots , x_{p})}}\right)} \\ \\ &= \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (1) + \ldots + \beta_{p} x_{p} - \left[ \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{j} (0) + \ldots + \beta_{p} x_{p} \right] \\ \\ &= \beta_{j} \end{align} \tag{5}
The derivation in Equation 5 similarly applies to any of the k-1 levels of a k-level variable (against one of the k levels chosen as the reference level) and for a difference of c units in a continuous variable, where we would used x_{j} = \nu + c and x_{j} = \nu.
The data for our illustrative example is the survey of 3206 individuals on Medicare that we have used before. We import it below from the mus14data_selected.csv
comma-separated values file, assigned to the variable dfr
.
<- import_dataset("mus14data_selected.csv") dfr
The \texttt{private\_cat} variable indicates if an individual purchased private supplemental health insurance. This is our response variable, Y, with levels 1 indicating the purchase of insurance (or success or disease) and 0 indicating no purchase (or failure or not disease).
The variable \texttt{retire\_cat} is a binary explanatory variable and indicates whether an individual is retired, 1, or not retired, 0.
The variable \texttt{age} is a continuous numerical variable and indicates the age of an individual measured in years.
We transform the \texttt{private\_cat} and the \texttt{hhincome\_cat} variables into factors using the factor
function. The levels
keyword arguments is used to list the levels in order. The first level is used as the reference levels, that is No PI for the response variable and < \$10K for the \texttt{hincome\_cat} explanatory variable.
$private_cat <- factor(
dfr$private_cat,
dfrlevels = c("No PI", "PI")
)
$hhincome_cat <- factor(
dfr$hhincome_cat,levels = c("< $10K", "[$10K, $30K)", "[$30K, $50K)", ">= $50K")
dfr )
We want to understand the relationship between whether or not a purchase occurs in relation to retirement status, \texttt{retire\_cat},, annual household income group \texttt{hhincome\_cat}, and age group, \texttt{age}.
With the income category being multilevel, we assign <\$10K as the reference level. Table 2 shows the explanatory variables.
Variable | Level | Values |
---|---|---|
x_{1} | retirement status | \begin{cases} 1 \text{ if retired} \\ 0 \text{ if not rettired} \end{cases} |
x_{2} | [\$10,\$30)K | \begin{cases} 1 \text{ if income in } [10,30)] \\ 0 \text{ otherwise} \end{cases} |
x_{3} | [\$30,\$50)K | \begin{cases} 1 \text{ if income in } [30,50)] \\ 0 \text{ otherwise} \end{cases} |
x_{4} | [\$30,\$50)K | \begin{cases} 1 \text{ if income in } \ge 50 \\ 0 \text{ otherwise} \end{cases} |
x_{5} | \ge \$ 50K | age in years |
We write the multivariable LR model in Equation 6.
\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \beta_{5} x_{5} \tag{6}
It is time to build our model. For this, we use the glm
function assigned to the variable full_model
. Note the use of the +
symbol in the formula, which we use to add the explanatory variables.
<- glm(
full_model formula = private_cat ~ retire_cat + hhincome_cat + age,
data = dfr,
family = binomial(link = "logit")
)
We use the coefficients
attribute of the full_model
object to return the estimated coefficients, \hat{\beta}_{0}=-1.657,\hat{\beta}_{1}=0.222,\hat{\beta}_{2}=1.930,\hat{\beta}_{3}=2.773,\hat{\beta}_{4}=2.999, and \hat{\beta}_{5}=-0.019. We also use the round
function and set the digits
argument to 3 to round the results to three decimal places.
round(full_model$coefficients, digits = 3)
(Intercept) retire_cat hhincome_cat[$10K, $30K)
-1.657 0.222 1.930
hhincome_cat[$30K, $50K) hhincome_cat>= $50K age
2.773 2.999 -0.019
We write fitted model in Equation 7.
\begin{align} &\log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = \hat{\beta}_{0} + \hat{\beta_{1}} x_{1} + \hat{\beta}_{2} x_{2} + \hat{\beta}_{3} x_{3} + \hat{\beta}_{4} x_{4} + \hat{\beta}_{5} x_{5} \\ \\ &\log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = -1.657 + 0.222 x_{1} + 1.930 x_{2} + 2.773 x_{3} + 2.999 x_{4} - 0.019 x_{5} \end{align} \tag{7}
Remember, that we can use Equation 3 to calculate a probability of private insurance given values for the five values in the vector \mathbf{x} = (x_{1},x_{2},x_{3},x_{4},x_{5}).
Next, we interpret the coefficients. Note that the values will be slightly different, depending on rounding differences. R tends to maintain many decimal places when doing calculations, which we don’t do by hand.
The intercept, \hat{\beta}_0 = -1.66. This is the log odds for purchasing private insurance, Y=1, when not retired, x_{1}=0, with an annual income of <\$10K (the reference case, x_{2}=x_{3}=x_{4}=0), and with an age of x_{5}=0.
The slope \hat{\beta}_{1}=0.22 is the adjusted log odds ratio of purchasing insurance, Y=1, comparing those that are retired, x_{1}=1, to those that are not, x_{1}=0, adjusting for annual household income and age. Note the other two explanatory variables are mentioned by name, i.e. the annual income is not mentioned by levels. Adjusting for refers to keeping the values for these variables fixed, and only changing x_{1}.
The more meaningful result comes from the antilog, e^{\hat{\beta}_{1}}=\log{\left( 0.22 \right)} \approx 1.25, which is the corresponding adjusted OR. The odds of purchasing private insurance for those that are retired is 1.25 - 1 = 0.25 = 25% greater than for those that are not retired, adjusting for household income and age. By adjusting for the other two explanatory variables, we are stating that two people with the same income level and age have an OR =1.25 with reference to their respective retirement status, one being retired, the other not.
The slope, \hat{\beta}_{2}=1.93 is the adjusted log odds ratio for private insurance, Y=1, comparing those with an annual household income of [\$10\text{K},\$30\text{k}) to those with < \$10K, adjusting for retirement status and age. Note that we omit the other levels of household income from the way that we use dummy variables.
The antilog, e^{\hat{\beta}_{2}} = 6.89 is the corresponding adjusted odds ratio. The odds of private insurance for those with an annual income of [\$10\text{K},\$30\text{k}) are 6.89 - 1 = 5.89 = 589% greater than the odds for those with an income of < \$10K, adjusting for retirement status and age.
The slope, \hat{\beta}_{3}=2.77 is the adjusted log odds ratio for private insurance, Y=1, comparing those with an annual household income of [\$30\text{K},\$50\text{k}) to those with < \$10K, adjusting for retirement status and age.
The antilog, e^{\hat{\beta}_{3}} = 15.96 is the corresponding adjusted odds ratio. The odds of private insurance for those with an annual income of [\$30\text{K},\$50\text{k}) are 15.96 - 1 = 14.96 = 1496% greater than the odds for those with an income of < \$10K, adjusting for retirement status and age.
The slope, \hat{\beta}_{4}=3.00 is the adjusted log odds ratio for private insurance, Y=1, comparing those with an annual household income of \ge \S 50K to those with < \$10K, adjusting for retirement status and age.
The antilog, e^{\hat{\beta}_{4}} = 20.09 is the corresponding adjusted odds ratio. The odds of private insurance for those with an annual income of \ge \$ 50 are 20.09 - 1 = 19.09 = 1909% greater than the odds for those with an income of < \$10K, adjusting for retirement status and age.
The slope, \hat{\beta}_{5}=-0.02 is the log odds ratio for private insurance, Y=1, for a one year increase in age, adjusting for retirement status and annual household income.
The antilog, e^{\hat{\beta}_{5}}=0.98 is the corresponding adjusted OR. The odds of private insurance decrease by 1 = 0.98 = 0.02 = 2% for a one year increase in age, adjusting for retirement status and household income (that is two people with the same retirement age and income, differing by a single year).
The logistic regression model has allowed us to understand the association between the explanatory variables and the response variable. We can also use LR models to predict the probability of private insurance, the input for a person with values for the explanatory variables to predict the probability of private insurance, Y=1.
As example, we consider a 75 year old person, who is in retirement, with a household income of \ge \S 50. We convert these values to complete our vector, \mathbf{x}.
We have that x_{1}=1 (retired), x_{2} = x_{3} = 0, with x_{4}=1 (household income of \ge \$ 50), and x_{5}=75. In the code chunk below, we use the values of the coefficients and Equation 3 to calculate the probability of private insurance for this person.
exp(-1.66+0.22+3-(0.02*75)) / (1 + exp(-1.66+0.22+3-(0.02*75)))
[1] 0.5149955
We see a probability of approximately 51\% as a solution to P(Y=1 \vert X) for this specific person.
Now that we can calculate and interpret the estimated coefficients, we can explore inference with respect to these coefficients in terms of confidence intervals and hypothesis testing.
For large enough samples, we have that the sampling distribution of estimated coefficients are approximately normally distributed, written as in Equation 8, where \sigma_{\hat{\beta}_{j}} is the standard error of \hat{\beta}_{j} for j = \left\{ 0, 1, \ldots , p \right\}.
\hat{\beta}_{j} \sim \text{N} (\beta_{j} , \sigma_{\hat{\beta}_{j}}^{2}) \tag{8}
In Equation 8, we have a normal distribution with mean \beta_{j} and variance, \sigma_{\hat{\beta}_{j}}^{2}. As a result, we can construct Wald confidence intervals (CIs) for the coefficients, shown in Equation 9, where \widehat{\text{SE}}(\hat{\beta}_{j}) is the standard error of \hat{\beta}_{j}.
100 (1 - \alpha) \text{\% Wald CI for } \beta{j} : \; \hat{\beta}_{j} \pm z_{\frac{\alpha}{2}} \widehat{\text{SE}}(\hat{\beta}_{j}) \tag{9}
The confidence coefficient, z_{\frac{\alpha}{2}}, for \alpha=0.05, is calculated in the code chunk below, using the qnorm
function.
<- qnorm(0.975)
z_crit z_crit
[1] 1.959964
We note in our table of the results of the illustrative example in the previous section, that \hat{\beta}_{1} = 0.222 (retirement), with \widehat{\text{SE}}_{\beta_{1}} = 0.084. We use Equation 9 to calculate the lower and upper bounds, as shown in Equation 10.
\hat{\beta}_{1} \pm z_{\frac{\alpha}{2}} \widehat{\text{SE}}(\hat{\beta}_{j}) = 0.222 \pm 1.96 (0.084) = (0.057, 0.387) \tag{10}
<- 0.222 - z_crit * 0.084
lower <- 0.222 + z_crit * 0.084
upper
lower
[1] 0.05736303
upper
[1] 0.386637
Taking the antilogs of the two values, we have the confidence intervals of the adjusted OR for x_{1} (retirement).
exp(lower)
[1] 1.05904
exp(upper)
[1] 1.472022
In the illustrative example, we saw that the adjusted OR for private insurance was 1.25, comparing retired people to non-retired people, adjusting for household income and age. We have the confidence interval for this OR as 1.06 - 1.47. The CI does not contain 1, the value under the null hypothesis, and we can conclude that the odds of purchase are indeed greater for retired people compared to non-retired people, adjusting for household income and age.
Indeed, we are 95% confident that the true adjusted OR is between 1.06 and 1.47.
The confint
function takes a LR model as argument and returns the intervals for each variable in the model, with the confidence level specified by the level
argument (which is 0.95 by default.)
confint(full_model)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -3.14729115 -0.169970191
retire_cat 0.05682768 0.388073868
hhincome_cat[$10K, $30K) 1.50207594 2.407110503
hhincome_cat[$30K, $50K) 2.33859127 3.255684008
hhincome_cat>= $50K 2.56861885 3.478466289
age -0.04119060 0.003115709
The result shows the confidence intervals, but for the log values. We can the confint
function as argument to the exp
function to return the confidence intervals for the adjusted odds ratios themselves.
exp(confint(full_model))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.04296836 0.843690
retire_cat 1.05847340 1.474139
hhincome_cat[$10K, $30K) 4.49100244 11.101836
hhincome_cat[$30K, $50K) 10.36662247 25.937350
hhincome_cat>= $50K 13.04779100 32.409976
age 0.95964620 1.003121
Next up, we consider hypothesis testing.
We start by considering the omnibus test, which is a likelihood ratio test (LRT) for hypothesis testing regarding our model overall.
Hypothesis testing depends on our research question. If we want to consider the association between having private medical insurance and retirement status (for which we use x_{1} when we created the model), adjusting for household income and age, we have the null hypothesis and alternative hypothesis as shown in Equation 11.
\begin{align} &H_{0}: \; \beta_{1} = 0 \\ &H_{1}: \; \beta_{1} \ne 0 \end{align} \tag{11}
To test whether having private medical insurance is associated with annual household income (dummy variables x_{2},x_{3},x_{4}), adjusting for retirement status and age, we have the null and alternative hypothesis in Equation 12.
\begin{align} &H_{0}: \; \beta_{2} = \beta_{3} = \beta_{4} = 0 \\ &H_{1}: \; \text{any one of more of } \beta_{2} , \beta_{3} , \beta_{4} \ne 0 \end{align} \tag{12}
If we want to test whether private medical insurance is jointly associated with retirement status (x_{1}), annual household income, (dummy variables x_{2},x_{3},x_{4}), and age (x_{5}), then we have the null hypothesis and alternative hypothesis in Equation 13.
\begin{align} &H_{0}: \; \beta_{1} = \beta_{2} = \beta_{3} = \beta_{4} = \beta_{5} = 0 \\ &H_{1} : \; \text{any one or more of } \beta_{1} , \beta_{2} , \beta_{3} , \beta_{4} , \beta_{5} \ne 0 \end{align} \tag{13}
In general terms, there are two types of null hypotheses for generalized linear model (GLM) parameters.
Furthermore, we have three types of tests.
Note that software often drives the choice of test.
For large samples sizes, these tests typically lead to the same results. If the results do not concur, it is recommended to use the results from a likelihood ratio test.
We have encountered the LRT before. Here, we use it in the case of multiple parameters, comparing a full model to a reduced model. Before we understand what this means, we have to explore LRTs a bit more. We remember that LRTs are based on the transformation of Equation 14.
\Lambda = \frac{\text{maximum likelihood when parameters satisfy } H_{0}}{\text{maximum likelihood when parameters are unrestricted}} = \frac{L_{0}}{L_{1}} \tag{14}
There is no, or weak evidence, against H_{0} if \Lambda is approximately equal 1, that is to say, L_{0} \approx L_{1}. There is strong evidence against H_{0} if \Lambda is much less than 1, that is to say, L_{0} is much less than L_{1}.
As before, rather than using \Lambda, we use the test statistic, G^{2}, shown in Equation 15.
G^{2} = -2 \log{\frac{L_{0}}{L_{1}}} = -2 \left( \log{L_{0}} - \log{L_{1}} \right) \tag{15}
Under the null hypothesis, G^{2} follows a \chi^{2} distribution for given degrees of freedom. In general, the degrees of freedom is the number of parameters under H_{1} minus the number of parameters under H_{0}.
Consider model A (with 3 parameters) and model B (with 2 parameters), shown below in Equation 16.
\begin{align} &\text{Model A: } \log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} \\ &\text{Model B: } \log{\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}} = \beta_{0} + \beta_{1} x_{1} \end{align} \tag{16}
We state that Model B is nested within model A or that model B is a special case of model A. Note that model B is equivalent to model A, but with \beta_{2} = 0. In this case, model A is the full model and model B is the reduced model.
We can finally state that we use LRTs, that is to say, one or multiple parameters are equal to 0 by the comparison of a full model to a nested or reduced model. The degrees of freedom is then the number of parameters in the full model minus the number of parameters in the reduced model.
As an example, we consider our full model with private insurance status as the response, and retirement status, annual household income, and age, as explanatory variables. This full model is shown in Equation 17.
\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \beta_{5} x_{5} \tag{17}
We use a LRT with \alpha = 0.05 to test the null hypothesis that private insurance is independent of retirement status, annual household income, and age, with the null and alternative hypothesis shown in Equation 13.
Our reduced model only consists of the intercept, shown in Equation 18 with the null hypothesis and alternative hypothesis.
\begin{align} &\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} \\ &H_{0}: \; \beta_{0} = 0 \\ &H_{1}: \; \beta_{0} \ne 0 \end{align} \tag{18}
To perform the test, we need to compute Equation 13. We fit the full model to get \log{L_{1}} and we fit the reduce model to get \log{L_{0}}. The G^{2} test statistic is then based on the \chi^{2} distribution under the null hypothesis. The degrees of freedom will be the number of parameters in the full model, which is 6, minus the number of parameters in the reduced model, which is 1, hence 6-1=5 degrees of freedom.
We have already created the full model above, assigned to the variable full_model
. Below, we created the reduced model. Note that 1
value in the formula
keyword argument, to indicate that we are only considering the intercept.
<- glm(
reduced_model formula = private_cat ~ 1,
data = dfr,
family = binomial(link = "logit")
)
We use the lrtest
function from the lmtest
library to perform the LRT. The reduced model is passed as first argument and the full model as second argument.
::lrtest(
lmtest
reduced_model,
full_model )
Likelihood ratio test
Model 1: private_cat ~ 1
Model 2: private_cat ~ retire_cat + hhincome_cat + age
#Df LogLik Df Chisq Pr(>Chisq)
1 1 -2139.8
2 6 -1934.0 5 411.48 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see \log{L_{0}} = -2139.8, \log{L_{1}} = -1934.0, 5 degrees of freedom, G^{2} = -2(\log{L_{0}} - \log{L_{1}}) = 411.48, and a p value of <0.001. We reject the null hypothesis. The data provides evidence that private health insurance status is associated with at least one of retirement status, annual household income, or age.
We refer to this LRT as an omnibus test (reduced model with only the intercept, against the full model with all explanatory variables). If we reject the null hypothesis given the data, we can move on to the individual explanatory variables (single parameters).
As example we consider the logistic model with private insurance status as the outcome and retirement status, annual household income, and age as explanatory variables. We use a LRT with \alpha = 0.05 to test the null hypothesis that private insurance is independent of age, adjusting for retirement status and annual household income. This reduced model and the null and alternative hypothesis is shown in Equation 19.
\begin{align} &\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} \\ &H_{0}: \; \beta_{5} = 0 \\ &H_{1}: \; \beta_{5} \ne 0 \end{align} \tag{19}
We follow the same procedure as before. Now we have that the degrees of freedom is 6-5=1, with the 5 coming from Equation 19, where the parameters are \beta_{0},\beta_{1},\beta_{2},\beta_{3}, and \beta_{5}. We create this new reduced model below, where we omit the age variable, and assign the model to the variable `reduced_model_age``.
<- glm(
reduced_model_age formula = private_cat ~ retire_cat + hhincome_cat, # Omit age
data = dfr,
family = binomial(link = "logit")
)
The lrtest
function returns the results.
::lrtest(
lmtest
reduced_model_age,
full_model )
Likelihood ratio test
Model 1: private_cat ~ retire_cat + hhincome_cat
Model 2: private_cat ~ retire_cat + hhincome_cat + age
#Df LogLik Df Chisq Pr(>Chisq)
1 5 -1935.5
2 6 -1934.0 1 2.8332 0.09233 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that \log{L_{0}}=-1935.5 with 5 degrees of freedom, \log{L_{1}} = -1934.0, with 6 degrees of freedom, G^{2} = 2.8332, with 6-5=1 degree of freedom, and a p values of 0.09. We fail to reject the null hypothesis. The data do not provide evidence that private health insurance is associated with age, adjusting for retirement status and household income.
Instead of a LRT for an individual coefficient, we consider instead using Wald’s test (as used in R). Although Wald tests can be used for multiple parameters, we will not cover it in this textbook.
For an arbitrary coefficient, \beta_{j} (log odds ratio), we have a null hypothesis and alternative hypothesis a shown in Equation 20.
\begin{align} &H_{0}: \; \beta_{j} = 0 \\ &H_{1}: \; \beta_{j} \ne 0 \end{align} \tag{20}
The Wald test statistic, shown in Equation 21 approximately follows the standard normal distribution under the null hypothesis.
Z = \frac{\hat{\beta}_{j}}{\widehat{\text{SE}}(\hat{\beta}_{j})} \tag{21}
As alternative, we can also consider the test statistic, Z^{2}, that approximately follows a \chi^{2} distribution with 1 degree of freedom.
The coefficient (or estimate) table shows the standard errors, \widehat{\text{SE}}_{\hat{\beta}_{j}}, the Z statistic, and the p values. We use the coefficient
attribute when calling the summary
function with the full model as argument.
round(summary(full_model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.657 0.759 -2.183 0.029
retire_cat 0.222 0.084 2.629 0.009
hhincome_cat[$10K, $30K) 1.930 0.230 8.397 0.000
hhincome_cat[$30K, $50K) 2.773 0.233 11.905 0.000
hhincome_cat>= $50K 2.999 0.231 12.979 0.000
age -0.019 0.011 -1.680 0.093
For the results above, there are six separate Wald tests with null hypothesis H_{0} \; \beta_{i} = 0 and alternative hypothesis H_{1}: \; \beta_{i} \ne, for i =\left\{ 0,1,2,3,4,5 \right\}.
In the case of age (x_{5}), for instance, we have that Z=-1.680, with a p value of 0.093, the same as for the LRT that we used above. The data do not provide evidence that private insurance status and age are associated, adjusting for retirement status and annual household income.
While retirement status and age are single explanatory variables, x_{1} and x_{5} in this case, we have to consider how to approach a multilevel explanatory variable. We do so by starting with a LRT comparing a reduced model without the k-1 dummy variables x_{2}, x_{3}, and x_{4}, against the full model. The model, null hypothesis and alternative hypothesis tests for the reduced model is shown in Equation 22.
\begin{align} &\log{\left(\frac{\pi (\mathbf{x})}{1 - \pi (\mathbf{x})}\right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{5} x_{5} \\ &H_{0}: \; \beta_{2} = \beta_{3} = \beta_{4} = 0 \\ &H_{1}: \; \text{any one or more of } \beta_{2} , \beta_{3}, \beta_{4} \ne 0 \end{align} \tag{22}
We create the reduced model below and assign it to the variable `reduced_model_k_level``. While we are using x_{5} for the sake of clarity, the new reduced model will denote a new x_{2} instead of x_{5}.
<- glm(
reduced_model_k_level formula = private_cat ~ retire_cat + age,
data = dfr,
family = binomial(link = "logit")
)
The lrtest
function returns the results.
::lrtest(
lmtest
reduced_model_k_level,
full_model )
Likelihood ratio test
Model 1: private_cat ~ retire_cat + age
Model 2: private_cat ~ retire_cat + hhincome_cat + age
#Df LogLik Df Chisq Pr(>Chisq)
1 3 -2129.3
2 6 -1934.0 3 390.54 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The LRT test statistic, G^{2} = 390.54, with a p value of < 0.01. We reject the null hypothesis. The data provide evidence that there is an association between private insurance status and annual household income, adjusting for retirement status and age.
Since we have evidence against the null hypothesis, we can continue with the Wald test results from the coefficient table above, that each for each specific level of the k-level variable.
Remember that the values shown for \hat{\beta}_{j} are log values. Below, we use the antilog to express useful results. The coef
function returns the coefficients, \hat{\beta}_{j},. The confint.default
returns the confidence interval for a confidence level set by the level
argument. We also use the cbind
function to print the columns for our two outputs next to each other.
cbind(
exp(coef(full_model)),
exp(confint.default(full_model, level = 0.95))
)
2.5 % 97.5 %
(Intercept) 0.1907165 0.04307452 0.8444152
retire_cat 1.2487590 1.05819268 1.4736438
hhincome_cat[$10K, $30K) 6.8865213 4.38931595 10.8044570
hhincome_cat[$30K, $50K) 16.0011499 10.13713177 25.2573216
hhincome_cat>= $50K 20.0585488 12.75374161 31.5472424
age 0.9811983 0.95971143 1.0031663
For instance, we see that the odds for private insurance are 20.1 times greater for those with an income of more than \$50K compared to the reference level. The confidence interval does not contain 1.
Create you own flowchart (or similar map) with respect to the steps required for statistical inference with multivariable logistic regression models.
You are investigating maternal complications following normal vaginal delivery (NVD) using a multivariable logistic regression model. The variables are described in Table 3.
Variable Name | Description | Levels or Units |
---|---|---|
\texttt {Complications} | Complications following NVD | Yes and No |
\texttt {Duration} | Duration of labor | Hours |
\texttt {ANC} | Ante-natal clinic visit | Yes and No |
The data set is available in the comma-separated file Complications_Mat.csv
, which is imported below and assigned to the variable comp
.
<- import_dataset("Complications_Mat.csv") comp
Create a report of summary and comparative summary statistics and data visualization with respect the to explanatory and response variables. Generate a multivariable logistic regression model and use confidence intervals and hypothesis testing to add to the report.
You are part of a research team and tasked with creating a research report for presentation. Your team is investigating mortality in general-admission critical care units at the 5% significance level. The explanatory variables under consideration are age at admission, admission white cell count, admission hemoglobin, diabetes mellitus, and class of admission. The variables are summarized in Table 4.
Variable Name | Description | Levels or Units of Measurement |
---|---|---|
\texttt {Mortality} | Mortality (response) | Died vs. Survived |
\texttt {Age} | Age at admission | Age in in years |
\texttt {WCC} | Admission white cell count | \times 10^{9} cells/L |
\texttt {HB} | Hemoglobin | g/dL |
\texttt {Diabetes} | Diabetes mellitus | None, Type I, Type II |
\texttt Class | Class of admission | Infectious and Non-infectious |
The data set is available in the comma-separated file ICU_Mortality
. The file is imported below and assigned to the variable lab
.
<- import_dataset("ICU_Mortality.csv") lab
After including summary statistics and data visualization, both univariable and comparative with respect to the outcome variable (with commentary), answer the following research questions (in any order).