Code
library(rio)
library(dplyr)
library(epitools)
library(DescTools)
library(epiR)
library(emmeans)
library(lmtest)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
In the previous chapter, we explored the use of generalized linear models in terms of multiple explanatory variables. We paid particular attention to multivariable logistic regression (LR) models, using the logit link function.
In this session, we aim to use our knowledge of multivariable LR models to expand our ability to examine effect modification and confounding. LR models will give us a richer approach than table methods that we used in chapter 5.
Along the way, we will encounter a visual approach to understanding effect modification, in essence, the geometry of effect modification.
Load the required libraries for Chapter 7. If these packages are not already installed, please do so before loading the libraries.
library(rio)
library(dplyr)
library(epitools)
library(DescTools)
library(epiR)
library(emmeans)
library(lmtest)
A variable, Z, is an effect modifier of the association between the primary exposure, the explanatory variable, X, and the response, Y, if the association between X and Y depends on the levels of Z. In essence we ask: Are our measures of association between X and Y different for the different levels of Z?
As illustrative example, we consider hypothetical data for the cognitive behavioral therapy (CBT) versus placebo study, for major depressive disorder (MDD). Our potential effect modifier is age group. The data is stored in the file mdd_cbt_data.csv
. The variables are explained in Table 1.
Variable | Symbol | Name in data file | Levels |
---|---|---|---|
Cognitive behavioral therapy (CBT) | X | cbt |
CBT, E=1 & placebo \overline{E}=0 |
Remission of major depressive disorder | Y | remit |
Remission D=1 & no remission \overline{D}=0 |
Age group | Z | onset_age |
\ge 30 = 1 & <30=0 |
We want to know if the association between the exposure, with levels indicating CBT and placebo, and the response with levels indicating remission of depression or no remission, is different for each of the age groups, the potential effect modifier.
Our potential effect modifier is binary and table methods can be used for this analysis. Here, though, we consider a LR model. The data is imported below.
<- import_dataset("mdd_cbt_data.csv") dfr
We set the levels using the factor
function. We state the factor as per Table 1. Note that the levels of the binary variable are encoded with a 0 and a 1 in the data file.
$remit <- factor(
dfr$remit,
dfrlevels = c(0, 1)
)
$cbt <- factor(
dfr$cbt,
dfrlevels = c(0, 1)
)
$onset_age <- factor(
dfr$onset_age,
dfrlevels = c(0, 1)
)
The stratum specific odds ratios as measures or association are calculated from the stratified tables using the oddsratio.wald
function from the epitools
library. Two data frame objects are created using the filter
function from the dplyr
library for use by the oddsratio.wald
function. As a reminder, we can also use the epi.2by2
function from the epiR
function. We would have to remember that the order of the levels are different when using the epi.2by2
function.
<- dfr %>%
younger filter(onset_age == 0)
<- dfr %>%
older filter(onset_age == 1)
is.factor(dfr$onset_age)
[1] TRUE
<- table(younger$cbt, younger$remit)
younger_table <- table(older$cbt, older$remit) older_table
The oddsratio.wald
function returns the contingency tables with the odds ratio and confidence interval, together with p values using various tests.
::oddsratio.wald(younger_table) epitools
$data
0 1 Total
0 37 53 90
1 25 65 90
Total 62 118 180
$measure
odds ratio with 95% C.I.
estimate lower upper
0 1.000000 NA NA
1 1.815094 0.9727674 3.386799
$p.value
two-sided
midp.exact fisher.exact chi.square
0 NA NA NA
1 0.06253311 0.08405226 0.05980011
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
We do the same for the older age group.
::oddsratio.wald(older_table) epitools
$data
0 1 Total
0 45 45 90
1 14 76 90
Total 59 121 180
$measure
odds ratio with 95% C.I.
estimate lower upper
0 1.000000 NA NA
1 5.428571 2.68489 10.97601
$p.value
two-sided
midp.exact fisher.exact chi.square
0 NA NA NA
1 7.445016e-07 1.268881e-06 8.54776e-07
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
The odds ratio (OR) is much larger in the \ge 30 group. The confidence intervals for this group does not include 1. To determine if the odds ratios are different from each other, we make use of Woolf’s test.
We create stratified tables below, using X (cbt
), Y (remit
), and then Z (onset_age
) (in that order) as arguments.
<- table(dfr$cbt, dfr$remit, dfr$onset_age)
stratified_tables stratified_tables
, , = 0
0 1
0 37 53
1 25 65
, , = 1
0 1
0 45 45
1 14 76
Using Woolf’s test, we see a large X_{W}^{2} statistic.
::WoolfTest(stratified_tables) DescTools
Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)
data: stratified_tables
X-squared = 5.2112, df = 1, p-value = 0.02244
With the large X_{W}^{2} statistic, we reject the null hypothesis that there is homogeneity of odds ratios. The age of onset group variable, Z, is an effect modifier. We would describe the measures of association separately for the two strata (levels) of the effect modifier.
LR models can be used to give us the same results as table methods. We can do so much more with LR models with respect to effect modification and confounding, though. For one, we can consider binary, multi-level, and continuous numerical variables.
To examine an effect modifier in a LR model, we construct new variables, called interaction terms. We fit the main effects, called lower-order terms, X and Z, and the interaction term, called the higher-order term, X \times Z.
X \times Z combines the levels of X and Z to create a new variable for our model. The number of variables can be determined by Equation 1, where j is the number of levels of X, and k is the number of levels of Z.
(j-1)(k-1) \tag{1}
In our case, the result is (2-1)(2-1)=1. For multilevel categorical variables, many extra terms can be created. By definition, numerical variables have many values, and we will explore the approach to numerical variables later.
To summarize, in our illustrative example of CBT vs. placebo for major depressive disorder, with age-at-onset group as effect modifier, we have that Y=1 for those that go into remission and Y=0 otherwise, X=1 if in the CBT group, 0, otherwise, and Z=1 if onset age is \ge 30, and 0 if less than 30. Finally, we have our probability of success (remission), as shown in Equation 2.
P(Y=1 \, \vert \, x,z) \tag{2}
Using the logit link function, we write our model in Equation 3, where we see our new interaction term xz.
\log{\frac{\pi (x,z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{1} x + \beta_{2} z + \beta_{3} xz \tag{3}
This model is flexible enough to allow us to estimate potentially different associations (odds ratios) between X and Y at each level of Z. We simply need to consider the results carefully. This should be clear to see from some algebraic manipulation in Equation 4, where we factor out the X term.
\log{\left( \frac{\pi (x,z)}{1 - \pi (x,z)} \right)} = \beta_{0} + \beta_{2} z + (\beta_{1} + \beta_{3} z) x \tag{4}
Remember that the slopes are log odds ratios. The model can estimate different associations between explanatory variables and the response variable in each stratum (level) of Z. We note the log odds ratio, for X, depending on the value of Z=z is \log{\text{OR}_{z}} = \beta_{1} + \beta_{3}z. In Equation 5, we see the two possibilities.
\begin{align} &z=0: \; \text{OR}_{z=0} = e^{\beta_{1} + \beta_{3}(0)} = e^{\beta_{1}} \\ &z=1: \; \text{OR}_{z=1} = e^{\beta_{1} + \beta_{3}(1)} = e^{\beta_{1} + \beta_{3}} \end{align} \tag{5}
If \beta_{3} = 0 then \text{OR}_{z=0} = \text{OR}_{z=1} and Z is not an effect modifier. The odds ratio of remission for CBT, x=1, compared to placebo, x=0, is the same for both levels of Z.
With the flexibility of a LR model, we can use a Wald test or a likelihood ratio test to test the null hypothesis H_{0}: \beta_{3} = 0. If we have evidence to reject H_{0}, then Z is an effect modifier.
Before we do this, we consider the geometry of effect modification for binary variables X and Z.
Consider a model without interaction (only the main effects), shown in Equation 5.
\log{\frac{\pi (x,z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{1}x + \beta_{2}z
We have the two cases z=0 and z=1, shown in Equation 6.
\begin{align} &z=0: \; \log{\frac{\pi (x,0)}{1 - \pi (x,0)}} = \beta_{0} + \beta_{1}x \\ &z=1: \; \log{\frac{\pi (x,1)}{1 - \pi (x,1)}} = (\beta_{0} + \beta_{2}) + \beta_{1}x \end{align} \tag{6}
The slopes are the same (parallel lines). The intercept is different if \beta_{2} \ne 0.
For a model with an interaction term, shown in Equation 3, we have Equation 7.
\begin{align} &z=0: \; \log{\frac{\pi (x,0)}{1 - \pi (x,0)}} = \beta_{0} + \beta_{1}x \\ &z=1: \; \log{\frac{\pi (x,1)}{1 - \pi (x,1)}} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3})x \end{align} \tag{7}
If \beta_{3} \ne 0 (evidence against H_{0}: \, \beta_{3}=0) then the two slopes are different (non-parallel lines). If \beta_{2} \ne 0, the intercepts are also different. We will see this in action a bit later.
We use the CBT for MDD example again and fit a model using the glm
function and assign the result to the variable model
. Note the use of the multiplication symbol, *
, to denote an interaction term.
<- glm(
model ~ cbt + onset_age + cbt * onset_age,
remit data = dfr,
family = binomial(link = "logit")
)
We look at the coefficient table below.
round(summary(model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.359 0.214 1.678 0.093
cbt1 0.596 0.318 1.873 0.061
onset_age1 -0.359 0.301 -1.196 0.232
cbt1:onset_age1 1.096 0.480 2.283 0.022
We can write our model as shown in Equation 8.
\begin{align} &\log{\left( \frac{\hat{\pi}(x,z)}{1 - \hat{\pi}(x,z)}\right)} = \hat{\beta_{0}} + \beta_{1} x + \beta_{2}z + \beta_{3}xz \\ &\log{\left( \frac{\hat{\pi}(x,z)}{1 - \hat{\pi}(x,z)}\right)} = 0.359 + 0.596 x - 0.359 z + 1.096 xz \end{align} \tag{8}
The coefficient table shows the result of the Wald test of H_{0}: \, \beta_{3}=0. The test statistic Z=2.283, with a p value less than 0.05. The onset of age does modify the effect of treatment on remitter status (the association between explanatory variable and the response variable). From the test summary at the start of this notebook, we had the Woolf test statistic, X_{W}^{2}=5.211, with a p value of 0.022, the same as the Wald test using LR.
Now that we have shown that Z is an effect modifier, we have to describe the association between X and Y for each of the strata of Z. We rewrite Equation 8 by factoring out x, as shown in Equation 9.
\log{\frac{\hat{\pi}(x,z)}{1 - \hat{\pi}(x,z)}} = 0.359 - 0.359 z + (0.596 + 1.096 z)x \tag{9}
For z=0 (age of onset before 30), we have that the slope (log odds ratio) for x is 0.596. Therefor \log{\widehat{\text{OR}}_{z=0}} = 0.596 \rightarrow \widehat{\text{OR}}_{z=0} = e^{0.596}=1.81. For those with an age of onset of MDD younger than 30 years, the odds of remitting for those in the CBT group are 1.81 - 1 = 0.81 = 81% greater than for those in the placebo group.
For z=1 (age of onset at 30 years or later), we have that the slope for x is 0.596+1.096=1.692. The \log{\widehat{\text{OR}}_{z=1}} = 1.692 \rightarrow \widehat{\text{OR}}_{z=1} = e^{1.1692}=5.43. For those with an age of onset of MDD at 30 years or older, the odds of remitting for those in the CBT group are 5.43 - 1 = 4.43 = 443% greater than for those in the placebo group.
We require confidence intervals or we need to do hypothesis testing. The emmeans
(estimated marginal means) library in R can help us do the calculations.
First, we need to create an emmGrid object using the emmeans
function (you can read the documentation to learn more about grid objects). The formula, ~ cbt | onset_age
, reads, X conditional on Z.
<- emmeans::emmeans(
emmgrid_mdd
model,~ cbt | onset_age
)
We use the contrast
function to do our calculations.
::contrast(
emmeans
emmgrid_mdd,"consec", # CBT at level 1 then level 0
name = "cbt", # Optional (easier to find in results)
infer = c(TRUE, TRUE), # Hypothesis test & confidence intervals
level = 0.95
)
onset_age = 0:
cbt estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cbt1 - cbt0 0.596 0.318 Inf -0.0276 1.22 1.873 0.0610
onset_age = 1:
cbt estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cbt1 - cbt0 1.692 0.359 Inf 0.9876 2.40 4.709 <.0001
Results are given on the log odds ratio (not the response) scale.
Confidence level used: 0.95
We see a repeat of our estimates (estimate
) for the two strata (levels of Z
). We also see the standard errors (SE
), the degrees of freedom (df
), lower and upper bounds of the confidence intervals (asymp.LCL
and asymp.UCL
), Wald statistic (z.ratio
), and a p value (p.value
).
The emmeans package also allows us to see the geometry, using the emmip
function, with th result in Figure 1.
::emmip(
emmeans
model,~ cbt | onset_age
)
As discussed in the geometry section, note that the intercepts and slopes are different. The y axis is linear (in log odds).
We unfortunately have to compute the antilogs by hand. Below, we use the exp
function to calculate the antilog values and we also use the round
function, with the keyword argument digits
set to 2, to print two decimal places.
round(exp(c(0.596, -0.0276, 1.22)), digits = 2)
[1] 1.81 0.97 3.39
We have that \widehat{\text{OR}}_{z=0} = e^{0.596} = 1.81 and 95% CI is e^{-0.0276},e^{1.22} = (0.97,3.39). The same syntax is used for z=1.
round(exp(c(1.692, 0.9876, 2.4)), digits = 2)
[1] 5.43 2.68 11.02
Here, we have that \widehat{\text{OR}}_{z=1} = e^{1.692} = 5.43 and 95% CI is e^{-0.9876},e^{2.4} = (2.68,11.02).
We use the example of endometrial carcinoma and the use of continued combined hormone replacement therapy (HRT) vs. placebo. Our possible effect modifier is body mass index (BMI), with three levels, which we sumamrize in Table 2.
Variable | Type in problem | Explanation |
---|---|---|
\texttt{cancer}: \, Y | Response | Y=1 is endometrial carcinoma and Y=0 is no carcinoma |
\texttt{HRT}: \, X | Explanatory variable | x=1 for HRT and x=0 for placebo |
\texttt{BMI}: \, Z | Possible effect modifier | z_{1} = 1 if BMI is 25-29 and with z_2=1 if BMI is \ge 30 and 0 otherwise |
We choose BMI <25 as the reference level for our potential effect modifier, Z. From this choice, we explore \pi (x,z_{1},z_{2}) = P(Y=1 \, \vert \, x , z_{1} , z_{2}). This is written with the logit link function for a logistic regression model in Equation 10, where we note that we have an additional (j-1)(k-1) = (2-1)(3-1) = 2 variables, which have the coefficients \beta_{4} and \beta_{5} (the log odds ratios for the interaction terms \beta_{1} , \beta_{2}, and \beta_{3} as the log odds ratios for the main effects x, z_{1}, and z_{2}).
\log{\left(\frac{\pi (x,z_{1},z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = \beta_{0} + \beta_{1}x + \beta_{2} z_{1} + \beta_{3} z_{2} + \beta_{4} x z_{1} + \beta_{5} x z_{2} \tag{10}
The equation in Equation 10 can be rearranged algebraically, as shown in Equation 11.
\log{\frac{\pi (x,z_{1},z_{2})}{1 - \pi (x,z_{1},z_{2})}} = \beta_{0} + \beta_{2} z_{1} + \beta_{3} z_{2} + (\beta_{1} + \beta_{4} z_{1} + \beta_{5} z_{2}) x \tag{11}
Since we have more than one parameter, we use the likelihood ratio test (LRT), with H_{0}:\; \beta_{4} = \beta_{5} = 0 which states that Z is not an effect modifier. If \beta_{4} = \beta_{5} = 0, then the levels of Z do not modify the log odds ratio for X, which will be \beta_{1} alone.
In Equation 12, we see the three possibilities, BMI < 25, BMI 25 - 29, and BMI \ge 30.
\begin{align} &z_{1} = 0 , z_{2} = 0 : \; \log(\frac{\pi (x, 0, 0)}{ 1 - \pi (x , 0 , 0)} = \beta_{0} + \beta_{1} x \\ &z_{1} = 1 , z_{2} = 0 : \; \log(\frac{\pi (x, 1, 0)}{ 1 - \pi (x , 1 , 0)} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{4}) x \\ &z_{1} = 0 , z_{2} = 1 : \; \log(\frac{\pi (x, 0, 1)}{ 1 - \pi (x , 0 , 1)} = (\beta_{0} + \beta_{3}) + (\beta_{1} + \beta_{5}) x \end{align} \tag{12}
If both \beta_{4} and \beta_{5} are not equal to 0 (and \beta_{4} \ne \beta_{5}), then the slopes are different (non-parallel lines). If \beta_{2} and \beta_{3} are not equal to 0 then the intercepts are different.
The data for this example is in the comma-separated file endo_data.csv
, which we import as a tibble object assigned to the variable cancer
.
<- import_dataset("endo_data.csv") cancer
$BMI <- factor(
cancer$BMI,
cancerlevels = c("BMI < 25", "BMI 25-29", "BMI >= 30")
)
$HRT <- factor(
cancer$HRT,
cancerlevels = c("never", "cont_comb")
)
The glm
function creates the logistic regression model and fits the data. We assign the results to the variable model
. As before, we use the multiplication symbol, *
, to create the interaction terms.
<- glm(
model formula = cancer ~ HRT + BMI + HRT * BMI,
data = cancer,
family = binomial(link = "logit")
)
The coefficient table shows the fitted results.
round(summary(model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.586 0.124 -20.861 0.000
HRTcont_comb 0.236 0.158 1.498 0.134
BMIBMI 25-29 0.174 0.172 1.011 0.312
BMIBMI >= 30 0.128 0.189 0.675 0.500
HRTcont_comb:BMIBMI 25-29 -0.489 0.238 -2.055 0.040
HRTcont_comb:BMIBMI >= 30 -1.713 0.284 -6.037 0.000
From the results above, we write the fitted model in Equation 13.
\log{\left(\frac{\pi (x , z_{1} , z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = -2.586 + 0.236 x + 0.174 z_{1} + 0.128 z_{2} - 0.489 x z_{1} - 1.713 x z_{2} \tag{13}
Before interpreting the model parameters, we should test H_{0}: \; \beta_{4} = \beta_{5} = 0. We already have the full model. Now we create a reduced model, so that we can use a likelihood ratio test (LRT).
<- glm(
reduced_model ~ HRT + BMI, # No interaction term
cancer data = cancer,
family = binomial(link = "logit")
)
The lrtest
function performs the LRT.
::lrtest(reduced_model, model) lmtest
Likelihood ratio test
Model 1: cancer ~ HRT + BMI
Model 2: cancer ~ HRT + BMI + HRT * BMI
#Df LogLik Df Chisq Pr(>Chisq)
1 4 -1483.7
2 6 -1464.6 2 38.291 4.845e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There are 6 parameters in the full model and 4 parameters in the reduced model, hence 6-4=2 degrees of freedom in the results. The test statistic G^{2} = 38.291 and the p value is <0.001. We reject the null hypothesis. BMI does modify the association between HRT and endometrial cancer.
Our model is now as in Equation 14.
\begin{align} &\log{\left(\frac{\pi (x , z_{1} , z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = -2.586 + 0.236 x + 0.174 z_{1} + 0.128 z_{2} - 0.489 x z_{1} - 1.713 x z_{2} \\ &\log{\left(\frac{\pi (x , z_{1} , z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = -2.586 + 0.174 z_{1} + 0.128 z_{2} + (0.236 - 0.489 z_{1} - 1.713 z_{2}) x \end{align} \tag{14}
The estimated log odds ratios and odds ratios for the three levels of Z are calculated in Equation 15.
\begin{align} \text{BMI} < 25: \; &\log{\widehat{\text{OR}}_{z_{1}=0,z_{2}=0}} = 0.236, \; \widehat{\text{OR}}_{z_{1}=0,z_{2}=0} = e^{0.236} = 1.27 \\ \text{BMI } 25-29: \; &\log{\widehat{\text{OR}}_{z_{1}=1,z_{2}=0}} = 0.236 - 0.489 = -0.253, \; \widehat{\text{OR}}_{z_{1}=1,z_{2}=0} = e^{0.236} = 0.78 \\ \text{BMI} \ge 30: \; &\log{\widehat{\text{OR}}_{z_{1}=0,z_{2}=1}} = 0.236 - 1.713 = -1.477, \; \widehat{\text{OR}}_{z_{1}=0,z_{2}=1} = e^{-1.477} = 0.23 \end{align} \tag{15}
Filling in the appropriate values for x, z_{1}, and z_{2} in Equation 14, we see the geometry of the results (the coordinates of the points, with a line connecting them). The emmip
function is used again, to provide Figure 2.
::emmip(
emmeans
model,~ HRT | BMI
)
We see very different slopes. As before, we create an emmGrid object with the emmeans
function and then use the contrast
function to to show us the individual results of hypothesis testing and confidence intervals (that are now required since we rejected the null hypothesis using the LRT test). The results shows the association between X and Y at each level of Z.
<- emmeans::emmeans(
model_emm
model,~ HRT | BMI
)
::contrast(
emmeans
model_emm,"consec",
name = "HRT",
infer = c(TRUE, TRUE),
level = 0.95
)
BMI = BMI < 25:
HRT estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cont_comb - never 0.236 0.158 Inf -0.0729 0.5455 1.498 0.1342
BMI = BMI 25-29:
HRT estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cont_comb - never -0.253 0.178 Inf -0.6014 0.0964 -1.419 0.1560
BMI = BMI >= 30:
HRT estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cont_comb - never -1.477 0.236 Inf -1.9390 -1.0144 -6.261 <.0001
Results are given on the log odds ratio (not the response) scale.
Confidence level used: 0.95
We see the three sets of estimates, standard errors, degrees of freedom, lower and upper bounds of the 95% confidence intervals, the Z statistics, and the p values.
As before, we have to take the antilogs by hand (code).
round(exp(c(0.236, -0.0729, 0.5455)), digits = 2)
[1] 1.27 0.93 1.73
round(exp(c(-0.253, -0.6014, 0.0964)), digits = 2)
[1] 0.78 0.55 1.10
round(exp(c(-1.477, -1.939, -1.0144)), digits = 2)
[1] 0.23 0.14 0.36
From the results, we see that in the cases for BMI <25 and BMI 25-29, there is not enough evidence to show an association between HRT and the development of endometrial carcinoma.
We have looked at effect modification for binary and multi-level variables. Now we consider a numerical variable as possible effect modifier.
to investigate a continuous numerical variable as a potential effect modifier, we revisit our first illustrative example. Now, though, Z, is a numerical variable. We use the data set on major depressive disorder (MDD) that we have used before. Our explanatory variable is X, cognitive behavioral therapy (CBT) vs. placebo, and the possible effect modifier is age of onset (a numerical data type). The variable summary is shown in Table 3
Variable | Type in problem | Explanation |
---|---|---|
Y | Response | Y=1 if remitter and Y=0 otherwise |
X | Explanatory variable | x=1 if CBT and x=0 for placebo |
Z | Possible effect modifier | $Z = $ age of onset in years |
Our aim is to investigate \pi (x,z) = P(Y=1 \, \vert \, x, z). It should be obvious that table methods cannot be used. Our model is now as written in Equation 16.
\begin{align} &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{1} x + \beta_{2} z + \beta_{3} xz \\ &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{2} z + (\beta_{1} + \beta_{3} z) x \end{align} \tag{16}
Since z can take on many values, we have have many slopes, (\beta_{1} + \beta_{3}), for x. In Equation 17, we choose z=16, z=29.
\begin{align} &z=16: \; \log{\text{OR}_{z=16}} = \beta_{1} + \beta_{3} (16), \; \text{OR}_{z=16} = e^{\beta_{1} + \beta_{3} (16)} \\ &z=29: \; \log{\text{OR}_{z=29}} = \beta_{1} + \beta_{3} (29), \; \text{OR}_{z=29} = e^{\beta_{1} + \beta_{3} (29)} \end{align} \tag{17}
If \beta_{3}=0, then the z does not modify the slope of x. We have H_{0}: \; \beta_{3} = 0.
In Equation 18, we write the log odds ratios for the two example values of the two values of z.
\begin{align} &z=16: \; \log{\frac{\pi (x,16)}{1 - \pi (x,16)}} = [\beta_{0} + \beta_{2} (16)] + [\beta_{1} + \beta_{3} (16)] x \\ &z=29: \; \log{\frac{\pi (x,29)}{1 - \pi (x,29)}} = [\beta_{0} + \beta_{2} (29)] + [\beta_{1} + \beta_{3} (29)] x \end{align} \tag{18}
Many lines are possible. If \beta_{3} \ne 0 the lines are non-parallel. Intercepts are different too if \beta_{2} \ne 0.
The data is in the mdd_cbt_data.csv
comma-separated values file. We import the data and assign it to the variable dfr
.
<- import_dataset("mdd_cbt_data.csv") dfr
The model is created in the usual way and we assign it again to the variable model
.
<- glm(
model formula = remit ~ cbt + onset_age_num + cbt * onset_age_num,
data = dfr,
family = binomial(link = "logit")
)
The model is written in Equation 19, from the table of coefficients.
round(summary(model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.656 0.472 1.389 0.165
cbt -0.280 0.720 -0.390 0.697
onset_age_num -0.016 0.015 -1.069 0.285
cbt:onset_age_num 0.047 0.023 2.009 0.045
\begin{align} &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = 0.656 - 0.280 x - 0.016 z + 0.047 x z \\ &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = 0.656 - 0.016 z + (-0.280 + 0.047) x \end{align} \tag{19}
In this case, we read the Wald statistic for H_{0}:\; \beta_{3} = 0 from the table, with Z=2.009 for a p value of 0.045. The data provides evidence that age of onset in years, modifies the association between treatment and remitter status.
In the geometry shown in Figure 3, we choose three different values for the effect modifier, Z, using the at
keyword argument.
::emmip(
emmeans
model,~ cbt | onset_age_num,
at = list(onset_age_num = c(20, 29.8, 40.3))
)
This gives us a visual indication of the effect modification of age.
One way to describe the effect modification numerically is to choose three values as with the geometry. The three values can be the first, second (median), and third quartiles, which we calculate below, using the quantile
function.
quantile(
$onset_age_num,
dfrprobs = c(0.25, 0.5, 0.75)
)
25% 50% 75%
20.0 29.8 40.3
The contrast
function shows the results.
<- emmeans::emmeans(
model_emm
model,~ cbt | onset_age_num,
at = list(onset_age_num = c(20, 29.8, 40.3))
)
::contrast(
emmeans
model_emm,"consec",
name = "cbt",
infer = c(TRUE, TRUE),
level = 0.95
)
onset_age_num = 20.0:
cbt estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cbt1 - cbt0 0.66 0.318 Inf 0.0368 1.28 2.076 0.0379
onset_age_num = 29.8:
cbt estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cbt1 - cbt0 1.12 0.238 Inf 0.6543 1.59 4.710 <.0001
onset_age_num = 40.3:
cbt estimate SE df asymp.LCL asymp.UCL z.ratio p.value
cbt1 - cbt0 1.61 0.355 Inf 0.9193 2.31 4.553 <.0001
Results are given on the log odds ratio (not the response) scale.
Confidence level used: 0.95
We see the estimates and their confidence intervals, and the test statistics and p values.
The antilogs must be done by hand (code) as before.
round(exp(c(0.66, 0.0368, 1.28)), digits = 2)
[1] 1.93 1.04 3.60
round(exp(c(1.12, 0.6543, 1.59)), digits = 2)
[1] 3.06 1.92 4.90
round(exp(c(1.61, 0.9193, 2.31)), digits = 2)
[1] 5.00 2.51 10.07
We see that the odds ratio for remission comparing CBT to placebo at the age of onset of MDD of 20 years is 1.93 (95% CI 1.04 to 3.6). As the age of onset increases, the odds ratios increase.
We could write: Researchers wanted to investigate whether there is an association between treatment types (CBT vs. placebo) and remission of MDD. They needed to account for the age of onset of disease as a possible effect modifier. A Wald test showed that age of onset was indeed an effect modifier. We could then continue with describing the associations by confidence intervals and / or hypothesis test.
In conclusion, we have looked at all three data types for the possible effect modifier. In general, we start by using the Wald or likelihood ratio test for the null hypothesis that the slopes of the interaction terms are all 0.
If significant, then we measure the association between X and Y at all levels of Z (some in the case of a numerical variable).
If the test Wald or LRT is not significant, we drop the interaction term and refit the model with only the main effects, X and Z.
We remind ourselves of the definition of a confounder. A variable, Z, is a confounder of the association between the primary exposure and response if the association between the primary exposure, X, and the response, Y is biased (distorted) because Z is associated with the primary exposure and causally associated with the response. To be a confounder, Z cannot be in the causal pathway from X to Y. If Z is on the causal pathway, then it is termed a mediator.
We return to our fun simulated research of the association between baldness, X (with E being bald and \overline{E} not being bald) and the development of chronic heart disease (CHD) (with D being the presence of CHD and \overline{D} being no IHD). Our potential confounder is age group, Z, with z=1 for the older group, and z=0 for the younger group.
Stratification can get out of hand with many levels for Z. We can use generalized linear models instead, where we follow these steps for a LR model approach to examining a confounder.
If these steps are followed, we compute the adjusted measures of association between X and Y (including Z). The slopes of X are the adjusted measures of association.
Finally, we compare the crude measures to the adjusted measures.
The data file, bald_chd_data_file.csv
, is imported below, and assigned to the variable chd
.
<- import_dataset("bald_chd_data_file.csv") chd
Now, we fit the crude model.
<- glm(
crude_model ~ bald,
CHD data = chd,
family = binomial(link = "logit")
)
We view the table of coefficients.
round(summary(crude_model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.367 0.060 -55.981 0
bald 1.123 0.133 8.460 0
The antilog of the crude estimates are calculated.
round(exp(coef(crude_model)), digits = 3)
(Intercept) bald
0.034 3.074
We see that \widehat{\text{OR}}_{\text{crdue}} = 3.07. We also calculate the confidence intervals.
round(exp(confint.default(crude_model, level = 0.95)), digits = 3)
2.5 % 97.5 %
(Intercept) 0.031 0.039
bald 2.370 3.987
We see that the interval does not contain the 1, with H_{0}: \; \text{OR} = 1.
Now, we fit the adjusted model.
<- glm(
adjusted_model ~ bald + age,
CHD data = chd,
family = binomial(link = "logit")
)
We see the table of coefficients.
round(summary(adjusted_model)$coefficients, digits = 3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.384 0.060 -55.995 0.000
bald 0.025 0.205 0.121 0.904
age 3.431 0.259 13.228 0.000
Using the antilog, we see the adjusted odds ratio for baldness.
round(exp(coef(adjusted_model)), digits = 3)
(Intercept) bald age
0.034 1.025 30.894
We see \widehat{\text{OR}}_{\text{after adjusting for age group}} = 1.025. We calculate the confidence intervals below.
round(exp(confint.default(adjusted_model)), digits = 3)
2.5 % 97.5 %
(Intercept) 0.030 0.038
bald 0.686 1.533
age 18.584 51.360
Odds ratios for having CHD comparing bald to non-bald group is as shown in Equation 20.
\begin{align} &\widehat{\text{OR}}_{\text{crude}} = 3.07 \text{ with } 95\% \text{CI } (2.37, 3.99) \\ &\widehat{\text{OR}}_{\text{adjusted}} = 1.03 \text{ with } 95\% \text{CI } (0.69, 1.53) \end{align} \tag{20}
We see very different results and conclude that age group is a confounder. We therefor report the adjusted measure of association.
Note, we assumed here that the requirements for testing a confounder have been met. We did this in chapter 5. Notice also, how the results are the same as for the Mantel-Haenszel estimates for chapter 5. This will not always be the case.
Reconsider the lab work that was completed in lecture 5. You are a member of a funded group conducting research into maternal perinatal complications in a disadvantaged community. You investigate the association between maternal age group and maternal perinatal complications. You also consider human immunodeficiency virus (HIV) status as possible effect modifier.
1
) and Older (\overline{E} or 0
)1
) and No (\overline{D} or 0
)1
) and Negative (0
)Consider the association between maternal age group and maternal perinatal complications and HIV status as possible effect modifier.
As review of stratified analysis, use table methods to determine if HIV status is an effect modifier of the association between maternal age group and complications. Write conclusions on the associations and effect modification, based on the outcome of appropriate tests.
Create a logistic regression model to investigate the interaction between maternal age group and HIV status. As revision, perform a likelihood ratio test to investigate the model. Write your conclusions based on the logistic regression model. Compare the results to those obtained using stratified analysis (table methods).
The data for this lab problem are in a spreadsheet file named PerinatalComplications.csv
, which is imported below and assigned to the variable comp
.
<- import_dataset("PerinatalComplications.csv") comp
Consider the data that are available in the data set empirical_injury_infection_data.csv
. You are investigating the effect of a novel empirical antibiotic (antibiotic given prior to surgery for established contamination), E, compared to an established prophylactic antibiotic , \overline{E}, on the development of surgical infection in the setting of penetrating abdominal trauma in five provinces in a Central African country. You are interested in the level of hollow-viscus injury as potential effect modifier, Z. The potential effect modifier has three levels, summarized in Table 4.
Levels of Z | Data file encoding | Values of dummy variables |
---|---|---|
No hollow viscus injury | None | z_{1}=0, z_{2}=0 |
Small bowel injury | SBI | z_{1}=1, z_{2}=0 |
Large intestine injury | LBI | z_{1}=0, z_{2}=1 |
The levels of X (antibiotic group) and Y (surgical infection are summarized in Table 5)
Variable Name | Description | Levels or Units of Measurement |
---|---|---|
\texttt Group | Type of antibiotic | Established vs. Novel |
\texttt Infection | Infection (response) | Yes vs. No |
Determine if the level of hollow viscus injury is an effect modifier of the association between the type of empirical antibiotic and the development of infection. Comment on your decisions and the results.
The data are in the file empirical_injury_infection_data.csv
and imported below, assigned to the variable lab
.
As part of a group that plans interventions among gang members to reduce the rate of human immunodeficiency virus (HIV) sero-conversion you investigate the association between membership of a gang (variable named Gang in the data set) and the sero-conversion (variable named HIV in data set). You consider the age of onset of smoking as a potential effect modifier.
Your group manages data for aid agencies in South Africa and in Brazil. Complete the analyses using a logistic regression model to investigate the interaction between gang membership and age of onset of smoking, and comment on your decisions and results for the two countries.
Note that the data are in the files gangs_hiv_africa.csv
and gangs_hiv_brazil.csv
.
Your are the biostatistician for a research project investigating the development of cancer of the cervix. In one of the sub projects you investigate the association between current smoking status and the development of cervical carcinoma. You have also collected data on the number of sexual partners and consider two levels for this variable, less than 2 and 2+ sexual partners and consider whether the two-group number of sexual partners modifies the effect of the association between current smoking status and the development of cervical carcinoma or if this association is biased by the level of sexual partners.
Prepare a thorough report for your funding agency oversight committee meeting. State your research questions, hypotheses, levels of significance and confidence, all your analyses decisions, and results. Contrast table methods and a generalized linear model approach.
The data are in the file Cervical.csv
and is imported below, assigned to the variable cerv
.
<- import_dataset("Cervical.csv") cerv