Code
library(rio)
library(dplyr)
library(VGAM)
library(brant)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
In this chapter, we consider response variables that are multilevel categorical. The variables can be either nominal or ordinal. We will explore two model types. The baseline-categorical logistic model (for nominal response variables) and the cumulative logistic model.
To start our exploration, we consider the multinomial distribution.
Load the required libraries for Chapter 11. If these packages are not already installed, please do so before loading the libraries.
library(rio)
library(dplyr)
library(VGAM)
library(brant)
We are already familiar with the binomial distribution, described by parameters n and \pi. The binomial distribution in n trials, with a population probability of success of \pi, is a special case of the multinomial distribution.
For the multinomial distribution, we consider a J-level categorical response variable with levels i=\left( 1,2,3, \ldots ,J \right). Each of the J=j levels of the response variable has a probability of \pi_{i} such that Equation 1 holds.
\sum_{i=1}^{J} \pi_{i} = 1 \tag{1}
For any specific experiment, we can express the frequency of each of the J levels of a response variable Y, as n_{i} such that Equation 2 holds, where n is the total number of observations in the experiment.
\sum_{i=1}^{J} n_{i} = n \tag{2}
The outcome of an experiment is then a vector of frequencies written as \left( n_{1} , n_{2} , \ldots , n_{J} \right), where as mentioned, n_{i} is the number of cases of the i^{\text{th}} level of Y. The distribution for these frequencies is written in Equation 3, where Mult refers to the multinomial distribution.
\left( n_{1} , n_{2} , \ldots , n_{J} \right) \sim \text{Mult} \left( n , \left( \pi_{1} , \pi_{2} , \ldots , \pi_{J} \right) \right) \tag{3}
The probability mass function for the multinomial distribution is written in Equation 4.
\begin{align} &P \left( n_{1} , n_{2} , \ldots , n_{J} \right) = \left( \frac{n!}{\prod_{i=1}^{J} n_{i}!} \prod_{i=1}^{J} \pi_{i}^{n_{i}} \right) \\ &P \left( n_{1} , n_{2} , \ldots , n_{J} \right) = \frac{n!}{n_{1}!\, n_{2}! \, \ldots \, n_{J}!} \left( \pi_{1}^{n_{1}} \, \pi_{2}^{n_{2}} \, \ldots \, \pi_{J}^{n_{J}} \right) \end{align} \tag{4}
As is clear from Equation 4, we have that the binomial distribution is indeed a special case with J=2. Furthermore, the marginal distribution of the frequency in each level is binomial. For an arbitrary level, j, n_{j} has a mean of n \pi_{j} and a variance of n \pi_{j} \left( 1 - \pi_{j} \right).
As an example, consider a nominal categorical response variable with three levels A, B, and C, with probabilities of 0.25, 0.5, and 0.25. If we randomly selected three subjects, we might have the 10 possible outcomes below. In the first row, we have that all three subjects have level A, and so on.
\left( n_{1},n_{2},n_{3}\right) | Using the probability mass function |
---|---|
\left(3,0,0\right) | \left(\frac{3!}{3!0!0!}\right)0.25^{3}0.5^{0}0.25^{0} |
\left(2,1,0\right) | \left(\frac{3!}{2!1!0!}\right)0.25^{2}0.5^{1}0.25^{0} |
\left(2,0,1\right) | \left(\frac{3!}{2!0!1!}\right)0.25^{2}0.5^{0}0.25^{1} |
\left(1,1,1\right) | \left(\frac{3!}{1!1!1!}\right)0.25^{1}0.5^{1}0.25^{1} |
\left(1,2,0\right) | \left(\frac{3!}{1!2!0!}\right)0.25^{1}0.5^{2}0.25^{0} |
\left(1,0,2\right) | \left(\frac{3!}{1!0!2!}\right)0.25^{1}0.5^{0}0.25^{2} |
\left(0,1,2\right) | \left(\frac{3!}{0!1!2!}\right)0.25^{0}0.5^{1}0.25^{2} |
\left(0,0,3\right) | \left(\frac{3!}{0!0!3!}\right)0.25^{0}0.5^{0}0.25^{3} |
\left(0,2,1\right) | \left(\frac{3!}{0!2!1!}\right)0.25^{0}0.5^{2}0.25^{1} |
\left(0,3,0\right) | \left(\frac{3!}{0!3!0!}\right)0.25^{0}0.5^{3}0.25^{0} |
We calculate each of the probabilities below using the dmultinom
function.
= c(0.25, 0.5, 0.25)
probability = 3
s dmultinom(c(3,0,0), size = s, prob = probability)
[1] 0.015625
dmultinom(c(2,1,0), size = s, prob = probability)
[1] 0.09375
dmultinom(c(2,0,1), size = s, prob = probability)
[1] 0.046875
dmultinom(c(1,1,1), size = s, prob = probability)
[1] 0.1875
dmultinom(c(1,2,0), size = s, prob = probability)
[1] 0.1875
dmultinom(c(1,0,2), size = s, prob = probability)
[1] 0.046875
dmultinom(c(0,1,2), size = s, prob = probability)
[1] 0.09375
dmultinom(c(0,0,3), size = s, prob = probability)
[1] 0.015625
dmultinom(c(0,2,1), size = s, prob = probability)
[1] 0.1875
dmultinom(c(0,3,0), size = s, prob = probability)
[1] 0.125
We can use the multinomial distribution in logistic regression, whic we explore next, starting with the baseline-category logistic model.
For nominal responses, we use the baseline-category logistic model, sometimes referred to as the polytomous logistic model.
Consider the p explanatory variables x_{1} , x_{2} , \ldots , x_{p} and a response variable, Y, with J levels.
As the name of the model implies, we must select one of the levels of the response variable as the baseline level. For this explanation, we consider J to be the baseline level and construct J-1 unique binary logistic equations, shown in Equation 5.
\begin{align} &\log{\left( \frac{P \left( Y=1 \, \vert \, x_{1} , \ldots , x_{p} \right)}{P \left( Y=J \, \vert \, x_{1} , \ldots , x_{p} \right)} \right)} = \beta_{10} + \beta_{11} x_{1} + \ldots + \beta_{1p} x_{p} \\ \\ &\log{\left( \frac{P \left( Y=2 \, \vert \, x_{1} , \ldots , x_{p} \right)}{P \left( Y=J \, \vert \, x_{1} , \ldots , x_{p} \right)} \right)} = \beta_{20} + \beta_{21} x_{1} + \ldots + \beta_{2p} x_{p} \\ &\vdots \\ \\ &\log{\left( \frac{P \left( Y=J-1 \, \vert \, x_{1} , \ldots , x_{p} \right)}{P \left( Y=J \, \vert \, x_{1} , \ldots , x_{p} \right)} \right)} = \beta_{(J-1)0} + \beta_{(J-1)1} x_{1} + \ldots + \beta_{(J-1)p} x_{p} \end{align} \tag{5}
Note that we still have a random component, which now follows a multinomial distribution, a systematic component, X\beta (in matrix form), and the link function, shown in Equation 6.
\log{\left( \frac{P(Y=j)}{P(Y=J)} \right)}, \text{ for } j = 1,2, \ldots , j-1 \tag{6}
The following holds for the baseline-category logistic model, which should be clear from Equation 5.
As an example, we consider a study of children attending an emergency center with a fever due to infection. The three-level nominal response variable, Y, is the type of infection and has levels Pneu for bacterial pneumonia, UTI for a urinary tract infection, and Bact for all other infections.
If we consider a single numerical predictor, \texttt{age}, for age in years, and choose Bact as the baseline case, we can write our model as in Equation 7.
\begin{align} &\log{\left( \frac{P( Y = \text{Pneu} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})} \right)} = \beta_{10} + \beta_{11} \text{age} \\ \\ &\log{\left( \frac{P( Y = \text{UTI} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})} \right)} = \beta_{20} + \beta_{21} \text{age} \end{align} \tag{7}
We import the data set below, stored in the comma-separated values file SBI.csv
. The response variable, Y, is named \texttt{sbi} for serious bacterial infection.
<- import_dataset("SBI.csv") dfr
We summarize the response variable in a table below.
xtabs(
~ sbi,
data = dfr
)
sbi
Bact NotApplicable Pneu UTI
34 1752 251 311
For the sake of this example, we omit all cases with NotApplicable using the filter
function in the dplyr
library.
<- dfr %>%
dfr filter(sbi != "NotApplicable") # Using the not equal, !=, logical operator
We repeat the summary statistics of the response variable.
xtabs(
~ sbi,
data = dfr
)
sbi
Bact Pneu UTI
34 251 311
To create a model, we recast the \texttt{sbi} variable in the data set as a factor type and list Bact as first level to choose it as the baseline level for the model.
$sbi <- factor(dfr$sbi, levels = c("Bact", "Pneu", "UTI")) dfr
We use the vglm
function from the VGAM
library to create our model and assign it to the variable bact_model
. The family
argument is set the multinomial(refLevel = 1)
to indicate that we are using the multinomial distribution and the level that was stated first, Bact, is the reference level.
<- VGAM::vglm(
bact_model formula = sbi ~ age,
data = dfr,
family = VGAM::multinomial(refLevel = 1) # Set Bact (listed first in factor as reference levels)
)
We must know the order of the levels of the response variable to interpret the summary.
bact_model
Call:
VGAM::vglm(formula = sbi ~ age, family = VGAM::multinomial(refLevel = 1),
data = dfr)
Coefficients:
(Intercept):1 (Intercept):2 age:1 age:2
2.2709729 3.5459520 -0.1263634 -0.8462970
Degrees of Freedom: 1192 Total; 1188 Residual
Residual deviance: 929.1852
Log-likelihood: -464.5926
This is a multinomial logit model with 3 levels
The 1
will then refer to Pneu and the 2
to UTI in accordance with the order of the levels that we set before.
We write our model in Equation 8.
\begin{align} &\log{\left( \frac{P( Y = \text{Pneu} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})} \right)} = 2.271 - 0.1264 \times \text{age} \\ \\ &\log{\left( \frac{P( Y = \text{UTI} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})} \right)} = 3.546 - 0.8463 \times \text{age} \end{align} \tag{8}
The interpretation of the estimates, is summarized below.
We can also compare Pneu to UTI by resetting the level order (or by the refLink
statement) and recreate the model. As an alternative, we can use the law of logarithms as shown in Equation 9.
\begin{align} &\log{\left( \frac{P( Y = \text{Pneu} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})} \right)} - \log{\left( \frac{P( Y = \text{UTI} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})} \right)} \\ \\ = &\log{\left( \frac{\frac{P( Y = \text{Pneu} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})}}{\frac{P( Y = \text{UTI} \, \vert \, \text{age})}{P( Y = \text{Bact} \, \vert \, \text{age})}} \right)} \\ \\ = &\log{\left( \frac{P( Y = \text{Pneu} \, \vert \, \text{age})}{P( Y = \text{UTI} \, \vert \, \text{age})} \right)} \\ \\ = &2.271 - 0.1264 \times \text{age} - \left( 3.546 - 0.8463 \times \text{age} \right) \\ = &-1.275 + 0.7199 \times \text{age} \end{align} \tag{9}
The interpretation is as follows.
As with binary logistic models, we can use inference with respect to baseline-category logistic models. If we reconsider equation Equation 7, we can state the null hypothesis of no association between the explanatory variable \texttt{age} (age in years) and the response variable \texttt{sbi} (type of serious bacterial infection), with H_{0}: \; \beta_{11} = \beta_{21}=0. The null hypothesis states that age has no effect on the log odds of pneumonia vs. other bacterial infection or the log odds of urinary tract infection vs. other bacterial infection.
If we are interested in a specific association between an explanatory variable and the response variable, then we can test the null hypothesis that all of the slopes for that predictor are equal to 0. We test using a likelihood ratio test (LRT), where the reduced model is the intercept-only model, and the full model is as in Equation 7. If we reject the null hypothesis, that is, we find that at least one of the slopes is not equal to 0, then we can assess the individual slopes with either Wald tests or with confidence intervals (CIs).
Below then, we start with the reduced model (intercept only) and assign it the variable bact_model_red
. Note the use of 1
in the formula
of the vglm
function.
<- VGAM::vglm(
bact_model_red ~ 1,
sbi data = dfr,
family = VGAM::multinomial(refLevel = 1)
)
The lrtest_vgml
function perform the LRT. In this case we have that H_{0}: \; \beta_{11} = \beta_{21} = 0. We pass the full model and then the reduced model as arguments.
::lrtest_vglm(
VGAM# Full model
bact_model, # Reduced (nested) model
bact_model_red )
Likelihood ratio test
Model 1: sbi ~ age
Model 2: sbi ~ 1
#Df LogLik Df Chisq Pr(>Chisq)
1 1188 -464.59
2 1190 -516.72 2 104.26 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see a G^{2} test statistic of 104.26. G^{2} \sim \chi^{2}_{\text{df}} and df =2. We have very large test statistic value for 2 degree with a p value that approaches 0 and we reject the null hypothesis at the 5% significance level. There is evidence in the data that at least one of the slopes is not equal to 0, in other words, there is an association between age and the type of infection.
Since we rejected the null hypothesis, we can review the summary of the full model using the `summaryvgml
function from the VGAM
library.
::summaryvglm(bact_model) VGAM
Call:
VGAM::vglm(formula = sbi ~ age, family = VGAM::multinomial(refLevel = 1),
data = dfr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 2.2710 0.3472 6.541 6.12e-11 ***
(Intercept):2 3.5460 0.3440 10.309 < 2e-16 ***
age:1 -0.1264 0.1315 -0.961 0.337
age:2 -0.8463 0.1421 -5.954 2.61e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: log(mu[,2]/mu[,1]), log(mu[,3]/mu[,1])
Residual deviance: 929.1852 on 1188 degrees of freedom
Log-likelihood: -464.5926 on 1188 degrees of freedom
Number of Fisher scoring iterations: 6
No Hauck-Donner effect found in any of the estimates
Reference group is level 1 of the response
For H_{0}: \; \beta_{11}=0, we see the results of the Wald test with a Z statistic value of -0.961 and a p value of 0.337. We fail to reject the null hypothesis at the 5% significance level and there is no association between age and whether the type of serious bacterial infection is pneumonia compared to other infection.
For H_{0}: \; \beta_{21}=0, we see the results of the Wald test with a Z statistic value of -5.954 and a p value of < 0.001. We reject the null hypothesis at the 5% significance level and conclude that there is an association between age and whether the type of serious bacterial infection is a urinary tract infection compared to other infection.
We can also calculate a probability that Y=j for any of the J levels of the response variable. This is shown in Equation 10, where the parameters are all set to 0 for baseline level J.
P \left( Y = j \, \vert \, x_{1} , x_{2} , \ldots , x_{p} \right) = \frac{e^{\beta_{j0} + \beta_{j1} x_{1} + \ldots + \beta_{jp} x_{p}}}{\sum_{h=1}^{J} e^{\beta_{h0} + \beta_{h1} x_{1} + \ldots + \beta_{hp} x_{p}}} \tag{10}
The three estimated probabilities for our example problem are shown in Equation 11.
\begin{align} &P \left( Y = \text{Pneu} \, \vert \, \text{age} \right) = \frac{e^{2.2710 - 0.1264 \times \text{age}}}{e^{0}+e^{2.2710 - 0.1264 \times \text{age}} + e^{3.5460 - 0.8463 \times \text{age}}} \\ \\ &P \left( Y = \text{UTI} \, \vert \, \text{age} \right) = \frac{e^{3.5460 - 0.8463 \times \text{age}}}{e^{0}+e^{2.2710 - 0.1264 \times \text{age}} + e^{3.5460 - 0.8463 \times \text{age}}} \\ \\ &P \left( Y = \text{Bact} \, \vert \, \text{age} \right) = \frac{e^{0}}{e^{0}+e^{2.2710 - 0.1264 \times \text{age}} + e^{3.5460 - 0.8463 \times \text{age}}} \end{align} \tag{11}
Since we have a numerical explanatory variable, we can plot all three estimated probabilities as line plots.
<- seq(0, 5, length = 200)
age <- 1 + exp(2.2710-0.1264*age) + exp(3.5460-0.8463*age)
denominator <- exp(2.2710-0.1264*age) / denominator
pneu <- exp(3.5460-0.8463*age) / denominator
uti <- 1 / denominator
bact
plot(
age,
pneu,type = "l",
col = "deepskyblue",
main = "Model of serious bacterial infection",
xlab = "Age [years]",
ylab = "Probability",
las = 1,
lty = 1,
ylim = c(0, 1),
lwd = 2,
panel.first = grid()
)
lines(
age,
uti,col = "orange",
lty = 2
)
lines(
age,
bact,col = "black",
lty = 3,
lwd = 2
)
legend(
0.1,
1,
legend = c("Pneumonia", "UTI", "Other"),
col = c("deepskyblue", "orange", "black"),
lty = c(1, 2, 3)
)
For any age, the probabilities add to 1.
Next, we explore modeling for ordinal response variables.
We are still considering response variables with J levels, but now the levels are ordered. The benefit of models with ordinal response variables, is that they are simpler models. We also have greater power to detect effects.
For models with an ordinal response variable, we consider the cumulative probability of Y, that is, the probability that Y falls at or below a particular level. For response level j, we write the cumulative probability as in Equation 12, for \pi_{1} , \pi_{2} , \ldots , \pi_{j}, where, since our logical connective is the disjunctive (using the logical or), we can add all the probabilities.
\begin{align} &P \left( Y \le j \right) = \sum_{i=1}^{j} \pi_{i} \\ &P(Y=1) + P(Y=2) + \ldots + P(Y=j) = \pi_{1} + \pi_{2} + \ldots + \pi_{j} \end{align} \tag{12}
The cumulative probabilities can be expressed as cumulative odds, shown in Equation 13.
\frac{P ( Y \le j)}{1 - P (Y \le j)} = \frac{P ( Y \le j )}{ P ( Y > j)} \tag{13}
The cumulative log odds (also known as cumulative logits) are shown in Equation 14.
\log{ \left( \frac{P ( Y \le j )}{ P ( Y > j)} \right)} = \log{\left( \frac{\sum_{i=1}^{j} \pi_{i}}{\sum_{i=j+1}^{J} \pi_{i}} \right)} \tag{14}
For a J-level ordinal response variable, we will have J-1 log odds functions.
As an example we can consider a five-point Likert scale, commonly used for survey questions. The levels of Y are listed below.
We can therefor look at the following.
Notice that it is not interesting to look at the log odds of P (Y \le 5) in this case, as the cumulative probability is 1.
As illustrative example, we consider a study comparing a new drug for arthritis, measured against a placebo. Participants are asked to rate the improvement in their symptoms after one month of treatment. The levels of the explanatory variable, \texttt{trt}, are Active for active drug, and Placebo for the placebo. The levels for the response variable, \texttt{improvement}, are alot for a lot of improvement, some for some improvement, and none for no improvement. The arth_data.csv
file contains the data and is imported below, assigned to the variable arth
.
<- import_dataset("arth_data.csv") arth
The summary statistics are shown below.
xtabs(
~ trt + improvement,
data = arth
2:1, ] )[
improvement
trt 1 2 3
Placebo 7 7 29
Active 21 7 13
We encode Y as shown in Equation 15.
Y = \begin{cases} 1 \text{ if a lot} \\ 2 \text{ if some} \\ 3 \text{ if none} \end{cases} \tag{15}
The three probabilities, P \left( Y = i \, \vert \, x_{1} \right) = \pi_{i}, for i = ( 1,2,3 ). The cumulative logits are shown in Equation 16.
\begin{align} &\text{logit} \left[ P \left( Y \le 1 \, \vert \, x_{1}\right) \right] = \log{\left( \frac{P \left( Y \le 1 \, \vert \, x_{1} \right)}{P \left( Y > 1 \, \vert \, x_{1} \right)} \right)} = \log{\left( \frac{\pi_{1}}{\pi_{2} + \pi_{3}} \right)} \\ &\text{(the log odds for improvement being a lot vs. some or none)} \\ \\ &\text{logit} \left[ P \left( Y \le 2 \, \vert \, x_{1}\right) \right] = \log{\left( \frac{P \left( Y \le 2 \, \vert \, x_{1} \right)}{P \left( Y > 2 \, \vert \, x_{1} \right)} \right)} = \log{\left( \frac{\pi_{1} + \pi_{2}}{\pi_{3}} \right)} \\ &\text{(the log odds for improvement being a lot or some vs. none)} \end{align} \tag{16}
One generalized model that we can now use for this illustrative example, is the cumulative logistic model. If we have p explanatory variables, we can write the cumulative logistic model as in Equation 17.
\begin{align} &\log\left(\frac{P(Y \le 1 |x_1, \ldots, x_p)}{P(Y > 1 |x_1, \ldots, x_p)}\right) = \beta_{10} + \beta_{11}x_1 + \ldots + \beta_{1p}x_p \\ \\ &\log\left(\frac{P(Y \le 2 |x_1, \ldots, x_p)}{P(Y > 2 |x_1, \ldots, x_p)}\right) = \beta_{20} + \beta_{21}x_1 + \ldots + \beta_{2p}x_p \\ &\vdots \\ &\log\left(\frac{P(Y \le J-1 |x_1, \ldots, x_p)}{P(Y > J - 1 |x_1, \ldots, x_p)}\right) = \beta_{(J-1)1} + \beta_{(J-1)1}x_1 + \ldots + \beta_{(J-1)p}x_{p} \end{align} \tag{17}
Notice that each equation depends on all of the response levels from 1 through J. This contrasts with the baseline-category logistic model for nominal responses, where each equation only depended on two response variable levels (one always being the baseline or reference level).
For equation j in Equation 18, each slope is the log odds ratio for a response in levels 1 or 2 or 3 or \ldots or j vs. a response in levels j+1 or j+2 or \ldots or J.
A greatly simplified model is the cumulative logistic model with proportional odds assumption, usually referred to as proportional odds models.
We make the assumption that the slopes for a specific predictor are the same across the J-1 logistic equations, shown in Equation 17.
\begin{align} &\log\left(\frac{P(Y \le 1 \, \vert \, x_1, \ldots, x_p)}{P(Y > 1 \, \vert \, x_1, \ldots, x_p)}\right) = \beta_{10} + \beta_{1}x_1 + \ldots + \beta_{p}x_p \\ \\ &\log\left(\frac{P(Y \le 2 \, \vert \, x_1, \ldots, x_p)}{P(Y > 2 \, \vert \, x_1, \ldots, x_p)}\right) = \beta_{20} + \beta_{1}x_1 + \ldots + \beta_{p}x_p \\ &\vdots \\ &\log\left(\frac{P(Y \le J-1 \, \vert \, x_1, \ldots, x_p)}{P(Y > J - 1 \, \vert \, x_1, \ldots, x_p)}\right) = \beta_{(J-1)1} + \beta_{1}x_1 + \ldots + \beta_{p}x_p \\ \end{align} \tag{18}
The number of parameters are reduced from (J-1)(p+1) to p+J-1.
For our illustrative example, examining a new drug for arthritis symptoms, we have the two logit equations, shown in Equation 19.
\begin{align} &\log{\left( \frac{P \left( Y \le 1 \, \vert \, x_{1} \right)}{P \left( Y > 1 \, \vert \, x_{1} \right)} \right)} = \beta_{10} + \beta_{1} x_{1} \\ \\ &\log{\left( \frac{P \left( Y \le 2 \, \vert \, x_{1} \right)}{P \left( Y > 2 \, \vert \, x_{1} \right)} \right)} = \beta_{20} + \beta_{1} x_{1} \end{align} \tag{19}
In Equation 20 we interpret the slope of the first equation, and in Equation 21, the second equation. In both Equation 20 and Equation 21 the verbose explanation of \beta_{1} is key.
\begin{align}&\log{\left( \frac{P \left( Y \le 1 \, \vert \, x_{1} = 1 \right)}{P \left( Y > 1 \, \vert \, x_{1} = 1 \right)} \right)} - \log{\left( \frac{P \left( Y \le 1 \, \vert \, x_{1} = 0 \right)}{P \left( Y > 1 \, \vert \, x_{1} = 0 \right)} \right)} = \beta_{10} + \beta_{1} - \beta_{10} \\ \\ &\log{\left( \frac{\frac{P \left( Y \le 1 \, \vert \, x_{1} = 1 \right)}{P \left( Y > 1 \, \vert \, x_{1} = 1 \right)}}{\frac{P \left( Y \le 1 \, \vert \, x_{1} = 0 \right)}{P \left( Y > 1 \, \vert \, x_{1} = 0 \right)}} \right)} = \beta_{1} = \log{\left( \frac{\text{odds } Y \le 1 \, \vert \, \text{Active}}{\text{odds } Y \le 1 \, \vert \, \text{Placebo}} \right)} \end{align} \tag{20}
\begin{align}&\log{\left( \frac{P \left( Y \le 2 \, \vert \, x_{1} = 1 \right)}{P \left( Y > 2 \, \vert \, x_{1} = 1 \right)} \right)} - \log{\left( \frac{P \left( Y \le 2 \, \vert \, x_{1} = 0 \right)}{P \left( Y > 2 \, \vert \, x_{1} = 0 \right)} \right)} = \beta_{20} + \beta_{1} - \beta_{20} \\ \\ &\log{\left( \frac{\frac{P \left( Y \le 2 \, \vert \, x_{1} = 1 \right)}{P \left( Y > 2 \, \vert \, x_{1} = 1 \right)}}{\frac{P \left( Y \le 2 \, \vert \, x_{1} = 0 \right)}{P \left( Y > 2 \, \vert \, x_{1} = 0 \right)}} \right)} = \beta_{1} = \log{\left( \frac{\text{odds } Y \le 2 \, \vert \, \text{Active}}{\text{odds } Y \le 2 \, \vert \, \text{Placebo}} \right)} \end{align} \tag{21}
How do we reconcile the fact that we have two log odds ratio interpretations for \beta_{1}? We can express Equation 20 and Equation 21 as listed below.
The proportional odds assumption means that we assume the same effect of treatment no matter how we form groups with response variable levels, as long as we group low levels together vs. high levels together (that means, as long as we respect the order). We therefor have a simplified interpretation of \beta_{1} being the log odds ratio for improvement being more vs. less comparing those taking active treatment to those taking placebo. In general, any slope from a proportional odds model is the log odds ratio for lower levels of Y vs. higher levels of Y comparing … (the rest of the sentence depending on the type of explanatory variable).
Below, we use the data from our illustrative example to build a proportional odds model. We use the mutate
function in the dplyr
library to add new variables and set the factor levels and order. See the code comment for an explanation of the data transformation.
<-
arth %>%
arth mutate(improvement = factor(improvement), # Covert to a factor
improvement_fac = case_when(improvement == 1 ~ "alot",
== 2 ~ "some",
improvement == 3 ~ "none"), # Add a column encoding words for the values 1, 2, 3
improvement improvement_fac = factor(improvement_fac, levels = c("alot", "some", "none")), # Convert to a factor
improvement_ord = ordered(improvement_fac, levels = c("alot", "some", "none")), # Add an ordered factor
trt = factor(trt, levels = c("Placebo", "Active"))) # Convert to a factor
We create a proportional odds model below, assigned to the variable po_model
. We set the family
argument to cumulative
, with parallel = TRUE
indicating that we want to use the proportional odds assumption.
<- VGAM::vglm(
po_model formula = improvement_ord ~ trt,
data = arth,
family = VGAM::cumulative(parallel = TRUE) # Use PO assumption
)
::summaryvglm(po_model) VGAM
Call:
VGAM::vglm(formula = improvement_ord ~ trt, family = VGAM::cumulative(parallel = TRUE),
data = arth)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -1.5501 0.3579 -4.331 1.48e-05 ***
(Intercept):2 -0.7501 0.3233 -2.320 0.020349 *
trtActive 1.5679 0.4442 3.529 0.000416 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])
Residual deviance: 156.606 on 165 degrees of freedom
Log-likelihood: -78.303 on 165 degrees of freedom
Number of Fisher scoring iterations: 4
No Hauck-Donner effect found in any of the estimates
Exponentiated coefficients:
trtActive
4.796405
The two equations for the model as written in Equation 22.
\begin{align} &\log{\left( \frac{\hat{P} \left( Y \le 1 \, \vert \, x_{1} \right)}{\hat{P} \left( Y > 1 \, \vert \, x_{1} \right)} \right)} = -1.5501 + 1.5679 x_{1} \\ \\ &\log{\left( \frac{\hat{P} \left( Y \le 2 \, \vert \, x_{1} \right)}{\hat{P} \left( Y > 2 \, \vert \, x_{1} \right)} \right)} = -0.7501 + 1.5679 x_{1} \end{align} \tag{22}
Our interpretation of the model is listed below.
Having interpreted the model, we can conduct hypothesis testing. The null hypothesis is H_{0} : \; \beta_{1} = 0. Without the proportional odds assumption, we would need to test multiple slopes (each slope corresponding to the treatment variable in each of the (J-1) equations). The proportional odds assumption gives us more power to detect an association if one exists. However, we should always check if the proportional odds assumption is reasonable!
One method for evaluating the assumptions is Brant’s test. The test compares the the proportional odds model to one that fits (J - 1) separate binary logistic models allowing for different slopes for each model. The null hypothesis is that the proportional odds assumption holds.
It has been suggested that Brant’s test is too sensitive, especially in the case of large sample sizes, which are commonplace today. Other suggestions include the following.
If the proportional odds assumption appears to be badly violated, then we could consider the following.
We conduct Brant’s test below, using the polr
function from the MASS
library to create the model and then the brant
function from the brant
library.
<- MASS::polr(improvement_ord ~ trt, data = arth, Hess = TRUE)
polrm
brant(polrm)
--------------------------------------------
Test for X2 df probability
--------------------------------------------
Omnibus 0.21 1 0.64
trtActive 0.21 1 0.64
--------------------------------------------
H0: Parallel Regression Assumption holds
Using Brant’s test, we fail to reject the null hypothesis of no violation of the proportional odds assumption in this case.
Below, we also fit a model without the proportional odds assumption.
<- VGAM::vglm(
po_model_no_assumption formula = improvement_ord ~ trt,
data = arth,
family = VGAM::cumulative(parallel = FALSE) # No proportional odds assumption
)
::summaryvglm(po_model_no_assumption) VGAM
Call:
VGAM::vglm(formula = improvement_ord ~ trt, family = VGAM::cumulative(parallel = FALSE),
data = arth)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -1.6376 0.4131 -3.964 7.36e-05 ***
(Intercept):2 -0.7282 0.3254 -2.238 0.02524 *
trtActive:1 1.6864 0.5179 3.256 0.00113 **
trtActive:2 1.4955 0.4675 3.199 0.00138 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2])
Residual deviance: 156.3861 on 164 degrees of freedom
Log-likelihood: -78.1931 on 164 degrees of freedom
Number of Fisher scoring iterations: 4
No Hauck-Donner effect found in any of the estimates
Exponentiated coefficients:
trtActive:1 trtActive:2
5.400000 4.461538
For the sake of completeness, we now note two different values for \hat{\beta}_{1}.
We can generate probabilities from the proportional odds model. For i = \left( 1,2,3, \ldots ,p \right) explanatory variables, we have Equation 23.
P \left(Y \le j \, \vert \, x_{i} \right) = \frac{e^{\beta_{j0} + \beta_{1}x_{1} + \ldots + \beta_{p} x_{p}}}{1 + e^{\beta_{j0} + \beta_{1}x_{1} + \ldots + \beta_{p} x_{p}}}, \text{ where } j = 1,2, \ldots , J-1 \tag{23}
Note that for the highest level, J, we have that P \left( Y \le J \, \vert \, x_{i} \right) = 1. Probabilities for different levels make use of the fact shown in Equation 24, where we also show the equation for j=2 for our illustrative example.
\begin{align} &P \left( Y = j \, \vert \, x_{i} \right) = P \left( Y \le j-1 \, \vert \, x_{i} \right) - P \left( Y \le j \, \vert \, x_{i} \right) \\ &P \left( Y = 2 \, \vert \, x_{i} \right) = P \left( Y \le 1 \, \vert \, x_{i} \right) - P \left( Y \le 2 \, \vert \, x_{i} \right) \end{align} \tag{24}
In Equation 25 then we see use the fitted proportional odds model for our arthritis example, and find the three predicted probabilities for someone on active treatment having a lot of improvement, j=1, some improvement, j=2, and no improvement, j=3.
\begin{align} &\hat{P} \left( Y = 1 \, \vert \, x_{1} = 1 \right) = \hat{P} \left( Y \le 1 \, \vert \, x_{1} = 1 \right) \\ &\hat{P} \left( Y = 1 \, \vert \, x_{1} = 1 \right) = \frac{e^{-1.5501 + 1.5679 (1)}}{1+ e^{-1.5501 + 1.5679 (1)}} \\ &\hat{P} \left( Y = 1 \, \vert \, x_{1} = 1 \right) \approx 0.50 \\ \\ &\hat{P} \left( Y = 2 \, \vert \, x_{1} = 1 \right) = \hat{P} \left( Y \le 2 \, \vert \, x_{1} = 1 \right) - \hat{P} \left( Y \le 1 \, \vert \, x_{1} = 1 \right) \\ &\hat{P} \left( Y = 2 \, \vert \, x_{1} = 1 \right) = \frac{e^{-0.75011 + 1.5679 (1)}}{1+ e^{-0.7501 + 1.5679 (1)}} - 0.50 \\ &\hat{P} \left( Y = 2 \, \vert \, x_{1} = 1 \right) \approx 0.69 - 0.50 = 0.19 \\ \\ &\hat{P} \left( Y = 3 \, \vert \, x_{1} = 1 \right) = 1 - \hat{P} \left( Y \le 2 \, \vert \, x_{1} = 1 \right) \\ &\hat{P} \left( Y = 3 \, \vert \, x_{1} = 1 \right) \approx 1-0.69 = 0.31 \end{align} \tag{25}
In this lecture, we used a baseline-category categorical logistic model in the case of a multi-level nominal variable. We use a likelihood ratio test, considering a reduced model (omitting the variable of interest) and a full model. If a significant result is found (at least one of the slopes in the J-1 models is not equal to 0), we consider individual levels using a Wald test.
We also considered a cumulative logistic model with the proportional odds assumption for ordinal responses. The proportional odds assumption is often used for it simplicity when interpreting the results and it increased power to detect effects.
Consider a nominal response variable Y with levels 1, 2, and 3 and a single explanatory variable x. If we let level 3 of Y be the baseline category, then the corresponding baseline-category logistic model with Y as the response and x as the explanatory variable is as shown in Equation 26 below.
\begin{align} &\log{\left( \frac{P \left( Y = 1 \, \vert \, x \right)}{P \left( Y = 3 \, \vert \, x \right)} \right)} = \beta_{10} + \beta_{11} x \\ &\log{\left( \frac{P \left( Y = 2 \, \vert \, x \right)}{P \left( Y = 3 \, \vert \, x \right)} \right)} = \beta_{20} + \beta_{21} x \end{align} \tag{26}
The P \left( J = j \, \vert \, x \right) = \pi_{j} is given as in Equation 10 above. The coefficients for the baseline level are all set to 0. Show that this is so.
A subset of data taken from a study to assess factors associated with women’s attitudes, knowledge, and behavior towards mammograms are given in the file mammography_data.csv
. At present, we are interested in investigating the association between whether a woman has had a mammography experience, \texttt{ME} , history of breast cancer in the family, \texttt{HIST} , and perceived benefit of mammography, \texttt{PB}. The values and levels as described in Table 1.
Variable name | Description | Levels or Units of Measurement |
---|---|---|
\texttt {ME} | Mammography experience | 0 = never, 1 = within one year, 2 = over one year ago |
\texttt {HIST} | Family history | 0 = No, 1 = Yes |
\texttt {PB} | Perceived benefit | A value in the range of 5 – 20 \, ^{*} |
Answer the following questions.
Consider only the association between mammography experience and family history. Use R to construct a 2 \times 3 table that cross classifies these two variables. Test the null hypothesis of no association between mammography experience and family history using a 5% significance level.
Use \texttt{ME} = 0 as the baseline category to compute the sample odds ratio (OR) for having a mammogram within one year vs. never having one comparing those women with a family history to those with no family history.
Using \texttt{ME} = 0 as the baseline category to compute the sample odds ratio (OR) for having a mammogram over a year ago vs. never having one comparing those women with a family history to those with no family history.
Using \texttt{ME} = 0 as the baseline category, fit a baseline-category logistic model in R with \texttt{ME} as the response and \texttt{HIST} as the explanatory variable. Print the estimated coefficient table and write out the fitted model.
Take the antilogs of the slopes from the model that you fit. Interpret these values. How do they compare with the odds ratios computed in QUESTION 2 and QUESTION 3.
Use the logistic model that you fit in QUESTION 4 to test the same null hypothesis that you tested in QUESTION 1. Use a 5% significance level.
Add perceived benefit of mammography, \texttt{PB}, to the model. Adjusting for family history, is perceived benefit of mammography associated with mammography experience? Conduct an appropriate test at the 5% significance level. If it is associated, describe the association.
Based on the model that you fit in QUESTION 7, what is the predicted probability of having a mammogram within one year for a woman who has a family history of breast cancer and has a perceived benefit of mammography score of 10?
A study was conducted to assess the association between certain pollutants/behaviors and respiratory function. The response measured was chronic respiratory disease status \texttt{level}. The explanatory variables of interest are air pollution, job exposure to pollution , and smoking status. The data are stored in the file resp_data.csv
, with a summary of the variable shown in Table 2.
Variable Name | Description | Levels |
---|---|---|
\texttt {level} | Chronic respiratory disease status | 1 = no symptoms, 2 = cough for 3 months or less per year, 3 = cough for more than 3 months per years, 4 = cough for more than 3 months per year and shortness of breath |
\texttt {air} | Air pollution | low for low pollution or high for high pollution |
\texttt {exposure} | Job exposure to pollution | no for no or yes for yes |
\texttt {smoking} | Smoking | non for never smoked, ex for history of smoking, current for current smoker |
<- import_dataset("resp_data.csv") poll
# Recode
<-
poll %>%
poll mutate(level_rev = case_when(level == 1 ~ 4, # Original no symptoms no last
== 2 ~ 3,
level == 3 ~ 2,
level == 4 ~ 1), # Original worse symptoms now first
level level_rev = factor(level_rev, ordered = TRUE, levels = c(1,2,3,4)), # Worse symptoms first
air = factor(air, levels = c("low", "high")),
exposure = factor(exposure, levels = c("no", "yes")),
smoking = factor(smoking, levels = c("non", "ex", "cur")))
Answer the following questions. 1. Suppose that we want to fit a proportional odds model with respiratory response level as the ordinal outcome. We are interested in comparing those with worse respiratory response to those with better response. In this case, we can recode the data so that the assigned values for the level variable are reversed. Suppose that we do this and call the recoded variable level_rev
so that level_rev
= 4 (no symptoms), 3 = cough for 3 months or less per year, 2 = cough more than 3 months per year, 1 = cough for more than 3 months per year and shortness of breath. Define the probabilities of interest and write down the three cumulative logits that we would be modeling.
The data was originally encoded with 1 = no symptoms, 2 = cough for 3 months or less per year, 3 = cough for more than 3 months per years, 4 = cough for more than 3 months per year and shortness of breath. If we were to pass this order 1,2,3,a nd 4, the semantic meaning with have a lower degree of symptoms before higher degree of symptoms. We reencode the values (in reverse order), such that when we pass the order (using the levels
argument for the factor
function), we have the highest degree of symptoms (worst symptoms) first, as needed.
<- MASS::polr(
lab_3_3_model ~ air + exposure + smoking,
level_rev data = poll
)unique(poll)
air exposure smoking level level_rev
1 low no non 1 4
159 low no non 2 3
168 low no ex 1 4
335 low no ex 2 3
354 low no cur 1 4
661 low no cur 2 3
763 low yes non 1 4
789 low yes non 2 3
794 low yes ex 1 4
832 low yes ex 2 3
844 low yes cur 1 4
938 low yes cur 2 3
986 high no non 1 4
1080 high no non 2 3
1087 high no ex 1 4
1154 high no ex 2 3
1162 high no cur 1 4
1346 high no cur 2 3
1411 high yes non 1 4
1443 high yes non 2 3
1446 high yes ex 1 4
1485 high yes ex 2 3
1496 high yes cur 1 4
1573 high yes cur 2 3
1621 low no non 3 2
1626 low no ex 3 2
1631 low no ex 4 1
1634 low no cur 3 2
1717 low no cur 4 1
1785 low yes non 3 2
1790 low yes non 4 1
1791 low yes ex 3 2
1795 low yes ex 4 1
1799 low yes cur 3 2
1845 low yes cur 4 1
1905 high no non 3 2
1910 high no non 4 1
1911 high no ex 3 2
1915 high no ex 4 1
1918 high no cur 3 2
1951 high no cur 4 1
1987 high yes non 3 2
1993 high yes non 4 1
1994 high yes ex 3 2
1998 high yes ex 4 1
2000 high yes cur 3 2
2039 high yes cur 4 1
::brant(lab_3_3_model) brant
--------------------------------------------
Test for X2 df probability
--------------------------------------------
Omnibus 10.66 8 0.22
airhigh 1.39 2 0.5
exposureyes 0.04 2 0.98
smokingex 7.22 2 0.03
smokingcur 5.53 2 0.06
--------------------------------------------
H0: Parallel Regression Assumption holds
Warning in brant::brant(lab_3_3_model): 1 combinations in table(dv,ivs) do not
occur. Because of that, the test results might be invalid.
# Fit the proportional odds model
<- VGAM::vglm(
lab_3_3_cum_model ~ air + exposure + smoking,
level_rev family = cumulative(parallel = FALSE),
data = poll
)
# Summary
::summaryvglm(lab_3_3_cum_model) VGAM
Call:
VGAM::vglm(formula = level_rev ~ air + exposure + smoking, family = cumulative(parallel = FALSE),
data = poll)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 -5.07486 0.59333 -8.553 < 2e-16 ***
(Intercept):2 -2.84239 0.22043 -12.895 < 2e-16 ***
(Intercept):3 -2.08562 0.16441 -12.685 < 2e-16 ***
airhigh:1 -0.01699 0.14598 -0.116 0.9073
airhigh:2 -0.11130 0.11312 -0.984 0.3252
airhigh:3 -0.01645 0.09946 -0.165 0.8687
exposureyes:1 0.91005 0.14403 6.318 2.64e-10 ***
exposureyes:2 0.87061 0.11256 7.734 1.04e-14 ***
exposureyes:3 0.84431 0.10292 8.204 2.33e-16 ***
smokingex:1 1.26099 0.65832 1.915 0.0554 .
smokingex:2 0.03130 0.28822 0.109 0.9135
smokingex:3 0.42984 0.20193 2.129 0.0333 *
smokingcur:1 3.05178 0.59282 5.148 2.63e-07 ***
smokingcur:2 1.75824 0.22084 7.962 1.70e-15 ***
smokingcur:3 1.82920 0.16581 11.032 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: logitlink(P[Y<=1]), logitlink(P[Y<=2]),
logitlink(P[Y<=3])
Residual deviance: 4161.64 on 6252 degrees of freedom
Log-likelihood: -2080.82 on 6252 degrees of freedom
Number of Fisher scoring iterations: 6
Warning: Hauck-Donner effect detected in the following estimate(s):
'(Intercept):1', '(Intercept):2'
Exponentiated coefficients:
airhigh:1 airhigh:2 airhigh:3 exposureyes:1 exposureyes:2
0.9831513 0.8946705 0.9836885 2.4844372 2.3883710
exposureyes:3 smokingex:1 smokingex:2 smokingex:3 smokingcur:1
2.3263795 3.5289310 1.0317980 1.5370140 21.1530498
smokingcur:2 smokingcur:3
5.8022116 6.2289294
We view the result in Equation 32.
\begin{align*} \log\left(\frac{P(\text{level\_rev} \le 1)}{P(\text{level\_ref} > 1)}\right) &= -5.07486 - 0.01699 x_{1} + 0.91005 x_{2} + 1.26099 x_{3} + 3.05178 x_{4} \\ \log\left(\frac{P(\text{level\_rev} \le 2)}{P(\text{level\_ref} > 2)}\right) &= -2.84239 - 0.11130 x_{1} + 0.87061 x_{2} + 0.03130 x_{3} + 1.75824 x_{4} \\ \log\left(\frac{P(\text{level\_rev} \le 3)}{P(\text{level\_ref} > 3)}\right) &= -2.08562 - 0.01645 x_{1} + 0.84431 x_{2} + 0.40003 x_{3} + 1.82920 x_{4} \\ \end{align*} \tag{32}
Write the fitted model obtained in QUESTION 2 of PROBLEM 3.
Using the fitted model from QUESTION 2 of PROBLEM 3, interpret the odds ratio corresponding to the \texttt{exposure} predictor. Construct a 95% confidence interval (CI) for the corresponding population odds ratio.
Using the model that you fit in question 2 of problem 3, compute the predicted probability of the following event: that a person has cough equal to or less than 3 months per year given that air pollution where they live is low, they are not exposed to pollution at their job, and they are a non-smoker.