Chapter 7

Author
Affiliation
Juan H Klopper MD MMed, Do Hee Lee MPH

The George Washington University

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In the previous chapter, we explored the use of generalized linear models in terms of multiple explanatory variables. We paid particular attention to multivariable logistic regression (LR) models, using the logit link function.

In this session, we aim to use our knowledge of multivariable LR models to expand our ability to examine effect modification and confounding. LR models will give us a richer approach than table methods that we used in chapter 5.

Along the way, we will encounter a visual approach to understanding effect modification, in essence, the geometry of effect modification.

Section 3. Revisiting Table Methods effect modifier
Section 4. Logistic Regression Model for Effect Modifiers interaction terms, lower-order terms, higher-order terms
Section 6. Logistic Regression Model for Confounders mediator, confounder

2 Libraries

Load the required libraries for Chapter 7. If these packages are not already installed, please do so before loading the libraries.

Code
library(rio)
library(dplyr)
library(epitools)
library(DescTools)
library(epiR)
library(emmeans)
library(lmtest)

3 Tables Methods Revisited

A variable, Z, is an effect modifier of the association between the primary exposure, the explanatory variable, X, and the response, Y, if the association between X and Y depends on the levels of Z. In essence we ask: Are our measures of association between X and Y different for the different levels of Z?

As illustrative example, we consider hypothetical data for the cognitive behavioral therapy (CBT) versus placebo study, for major depressive disorder (MDD). Our potential effect modifier is age group. The data is stored in the file mdd_cbt_data.csv. The variables are explained in Table 1.

Table 1: Variable for Illustrative Example
Variable Symbol Name in data file Levels
Cognitive behavioral therapy (CBT) X cbt CBT, E=1 & placebo \overline{E}=0
Remission of major depressive disorder Y remit Remission D=1 & no remission \overline{D}=0
Age group Z onset_age \ge 30 = 1 & <30=0

We want to know if the association between the exposure, with levels indicating CBT and placebo, and the response with levels indicating remission of depression or no remission, is different for each of the age groups, the potential effect modifier.

Our potential effect modifier is binary and table methods can be used for this analysis. Here, though, we consider a LR model. The data is imported below.

Code
dfr <- import_dataset("mdd_cbt_data.csv")

We set the levels using the factor function. We state the factor as per Table 1. Note that the levels of the binary variable are encoded with a 0 and a 1 in the data file.

Code
dfr$remit <- factor(
  dfr$remit,
  levels = c(0, 1)
)

dfr$cbt <- factor(
  dfr$cbt,
  levels = c(0, 1)
)

dfr$onset_age <- factor(
  dfr$onset_age,
  levels = c(0, 1)
)

The stratum specific odds ratios as measures or association are calculated from the stratified tables using the oddsratio.wald function from the epitools library. Two data frame objects are created using the filter function from the dplyr library for use by the oddsratio.wald function. As a reminder, we can also use the epi.2by2 function from the epiR function. We would have to remember that the order of the levels are different when using the epi.2by2 function.

Code
younger <- dfr %>%
  filter(onset_age == 0)
older <- dfr %>%
  filter(onset_age == 1)

is.factor(dfr$onset_age)
[1] TRUE
Code
younger_table <- table(younger$cbt, younger$remit)
older_table <- table(older$cbt, older$remit)

The oddsratio.wald function returns the contingency tables with the odds ratio and confidence interval, together with p values using various tests.

Code
epitools::oddsratio.wald(younger_table)
$data
       
         0   1 Total
  0     37  53    90
  1     25  65    90
  Total 62 118   180

$measure
   odds ratio with 95% C.I.
    estimate     lower    upper
  0 1.000000        NA       NA
  1 1.815094 0.9727674 3.386799

$p.value
   two-sided
    midp.exact fisher.exact chi.square
  0         NA           NA         NA
  1 0.06253311   0.08405226 0.05980011

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

We do the same for the older age group.

Code
epitools::oddsratio.wald(older_table)
$data
       
         0   1 Total
  0     45  45    90
  1     14  76    90
  Total 59 121   180

$measure
   odds ratio with 95% C.I.
    estimate   lower    upper
  0 1.000000      NA       NA
  1 5.428571 2.68489 10.97601

$p.value
   two-sided
      midp.exact fisher.exact  chi.square
  0           NA           NA          NA
  1 7.445016e-07 1.268881e-06 8.54776e-07

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The odds ratio (OR) is much larger in the \ge 30 group. The confidence intervals for this group does not include 1. To determine if the odds ratios are different from each other, we make use of Woolf’s test.

We create stratified tables below, using X (cbt), Y (remit), and then Z (onset_age) (in that order) as arguments.

Code
stratified_tables <- table(dfr$cbt, dfr$remit, dfr$onset_age)
stratified_tables
, ,  = 0

   
     0  1
  0 37 53
  1 25 65

, ,  = 1

   
     0  1
  0 45 45
  1 14 76

Using Woolf’s test, we see a large X_{W}^{2} statistic.

Code
DescTools::WoolfTest(stratified_tables)

    Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)

data:  stratified_tables
X-squared = 5.2112, df = 1, p-value = 0.02244

With the large X_{W}^{2} statistic, we reject the null hypothesis that there is homogeneity of odds ratios. The age of onset group variable, Z, is an effect modifier. We would describe the measures of association separately for the two strata (levels) of the effect modifier.

LR models can be used to give us the same results as table methods. We can do so much more with LR models with respect to effect modification and confounding, though. For one, we can consider binary, multi-level, and continuous numerical variables.

4 Logistic Regression Model for Effect Modifiers

4.1 Binary Effect Modifiers

To examine an effect modifier in a LR model, we construct new variables, called interaction terms. We fit the main effects, called lower-order terms, X and Z, and the interaction term, called the higher-order term, X \times Z.

X \times Z combines the levels of X and Z to create a new variable for our model. The number of variables can be determined by Equation 1, where j is the number of levels of X, and k is the number of levels of Z.

(j-1)(k-1) \tag{1}

In our case, the result is (2-1)(2-1)=1. For multilevel categorical variables, many extra terms can be created. By definition, numerical variables have many values, and we will explore the approach to numerical variables later.

To summarize, in our illustrative example of CBT vs. placebo for major depressive disorder, with age-at-onset group as effect modifier, we have that Y=1 for those that go into remission and Y=0 otherwise, X=1 if in the CBT group, 0, otherwise, and Z=1 if onset age is \ge 30, and 0 if less than 30. Finally, we have our probability of success (remission), as shown in Equation 2.

P(Y=1 \, \vert \, x,z) \tag{2}

Using the logit link function, we write our model in Equation 3, where we see our new interaction term xz.

\log{\frac{\pi (x,z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{1} x + \beta_{2} z + \beta_{3} xz \tag{3}

This model is flexible enough to allow us to estimate potentially different associations (odds ratios) between X and Y at each level of Z. We simply need to consider the results carefully. This should be clear to see from some algebraic manipulation in Equation 4, where we factor out the X term.

\log{\left( \frac{\pi (x,z)}{1 - \pi (x,z)} \right)} = \beta_{0} + \beta_{2} z + (\beta_{1} + \beta_{3} z) x \tag{4}

Remember that the slopes are log odds ratios. The model can estimate different associations between explanatory variables and the response variable in each stratum (level) of Z. We note the log odds ratio, for X, depending on the value of Z=z is \log{\text{OR}_{z}} = \beta_{1} + \beta_{3}z. In Equation 5, we see the two possibilities.

\begin{align} &z=0: \; \text{OR}_{z=0} = e^{\beta_{1} + \beta_{3}(0)} = e^{\beta_{1}} \\ &z=1: \; \text{OR}_{z=1} = e^{\beta_{1} + \beta_{3}(1)} = e^{\beta_{1} + \beta_{3}} \end{align} \tag{5}

If \beta_{3} = 0 then \text{OR}_{z=0} = \text{OR}_{z=1} and Z is not an effect modifier. The odds ratio of remission for CBT, x=1, compared to placebo, x=0, is the same for both levels of Z.

With the flexibility of a LR model, we can use a Wald test or a likelihood ratio test to test the null hypothesis H_{0}: \beta_{3} = 0. If we have evidence to reject H_{0}, then Z is an effect modifier.

Before we do this, we consider the geometry of effect modification for binary variables X and Z.

Consider a model without interaction (only the main effects), shown in Equation 5.

\log{\frac{\pi (x,z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{1}x + \beta_{2}z

We have the two cases z=0 and z=1, shown in Equation 6.

\begin{align} &z=0: \; \log{\frac{\pi (x,0)}{1 - \pi (x,0)}} = \beta_{0} + \beta_{1}x \\ &z=1: \; \log{\frac{\pi (x,1)}{1 - \pi (x,1)}} = (\beta_{0} + \beta_{2}) + \beta_{1}x \end{align} \tag{6}

The slopes are the same (parallel lines). The intercept is different if \beta_{2} \ne 0.

For a model with an interaction term, shown in Equation 3, we have Equation 7.

\begin{align} &z=0: \; \log{\frac{\pi (x,0)}{1 - \pi (x,0)}} = \beta_{0} + \beta_{1}x \\ &z=1: \; \log{\frac{\pi (x,1)}{1 - \pi (x,1)}} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{3})x \end{align} \tag{7}

If \beta_{3} \ne 0 (evidence against H_{0}: \, \beta_{3}=0) then the two slopes are different (non-parallel lines). If \beta_{2} \ne 0, the intercepts are also different. We will see this in action a bit later.

We use the CBT for MDD example again and fit a model using the glm function and assign the result to the variable model. Note the use of the multiplication symbol, *, to denote an interaction term.

Code
model <- glm(
  remit ~ cbt + onset_age + cbt * onset_age,
  data = dfr,
  family = binomial(link = "logit")
)

We look at the coefficient table below.

Code
round(summary(model)$coefficients, digits = 3)
                Estimate Std. Error z value Pr(>|z|)
(Intercept)        0.359      0.214   1.678    0.093
cbt1               0.596      0.318   1.873    0.061
onset_age1        -0.359      0.301  -1.196    0.232
cbt1:onset_age1    1.096      0.480   2.283    0.022

We can write our model as shown in Equation 8.

\begin{align} &\log{\left( \frac{\hat{\pi}(x,z)}{1 - \hat{\pi}(x,z)}\right)} = \hat{\beta_{0}} + \beta_{1} x + \beta_{2}z + \beta_{3}xz \\ &\log{\left( \frac{\hat{\pi}(x,z)}{1 - \hat{\pi}(x,z)}\right)} = 0.359 + 0.596 x - 0.359 z + 1.096 xz \end{align} \tag{8}

The coefficient table shows the result of the Wald test of H_{0}: \, \beta_{3}=0. The test statistic Z=2.283, with a p value less than 0.05. The onset of age does modify the effect of treatment on remitter status (the association between explanatory variable and the response variable). From the test summary at the start of this notebook, we had the Woolf test statistic, X_{W}^{2}=5.211, with a p value of 0.022, the same as the Wald test using LR.

Now that we have shown that Z is an effect modifier, we have to describe the association between X and Y for each of the strata of Z. We rewrite Equation 8 by factoring out x, as shown in Equation 9.

\log{\frac{\hat{\pi}(x,z)}{1 - \hat{\pi}(x,z)}} = 0.359 - 0.359 z + (0.596 + 1.096 z)x \tag{9}

For z=0 (age of onset before 30), we have that the slope (log odds ratio) for x is 0.596. Therefor \log{\widehat{\text{OR}}_{z=0}} = 0.596 \rightarrow \widehat{\text{OR}}_{z=0} = e^{0.596}=1.81. For those with an age of onset of MDD younger than 30 years, the odds of remitting for those in the CBT group are 1.81 - 1 = 0.81 = 81% greater than for those in the placebo group.

For z=1 (age of onset at 30 years or later), we have that the slope for x is 0.596+1.096=1.692. The \log{\widehat{\text{OR}}_{z=1}} = 1.692 \rightarrow \widehat{\text{OR}}_{z=1} = e^{1.1692}=5.43. For those with an age of onset of MDD at 30 years or older, the odds of remitting for those in the CBT group are 5.43 - 1 = 4.43 = 443% greater than for those in the placebo group.

We require confidence intervals or we need to do hypothesis testing. The emmeans (estimated marginal means) library in R can help us do the calculations.

First, we need to create an emmGrid object using the emmeans function (you can read the documentation to learn more about grid objects). The formula, ~ cbt | onset_age, reads, X conditional on Z.

Code
emmgrid_mdd <- emmeans::emmeans(
  model,
  ~ cbt | onset_age
)

We use the contrast function to do our calculations.

Code
emmeans::contrast(
  emmgrid_mdd,
  "consec", # CBT at level 1 then level 0
  name = "cbt", # Optional (easier to find in results)
  infer = c(TRUE, TRUE), # Hypothesis test & confidence intervals
  level = 0.95
)
onset_age = 0:
 cbt         estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cbt1 - cbt0    0.596 0.318 Inf   -0.0276      1.22   1.873  0.0610

onset_age = 1:
 cbt         estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cbt1 - cbt0    1.692 0.359 Inf    0.9876      2.40   4.709  <.0001

Results are given on the log odds ratio (not the response) scale. 
Confidence level used: 0.95 

We see a repeat of our estimates (estimate) for the two strata (levels of Z). We also see the standard errors (SE), the degrees of freedom (df), lower and upper bounds of the confidence intervals (asymp.LCL and asymp.UCL), Wald statistic (z.ratio), and a p value (p.value).

The emmeans package also allows us to see the geometry, using the emmip function, with th result in Figure 1.

Code
emmeans::emmip(
  model,
  ~ cbt | onset_age
)

Figure 1: Geometry of the Association across the Levels of the Effect Modifier

As discussed in the geometry section, note that the intercepts and slopes are different. The y axis is linear (in log odds).

We unfortunately have to compute the antilogs by hand. Below, we use the exp function to calculate the antilog values and we also use the round function, with the keyword argument digits set to 2, to print two decimal places.

Code
round(exp(c(0.596, -0.0276, 1.22)), digits = 2)
[1] 1.81 0.97 3.39

We have that \widehat{\text{OR}}_{z=0} = e^{0.596} = 1.81 and 95% CI is e^{-0.0276},e^{1.22} = (0.97,3.39). The same syntax is used for z=1.

Code
round(exp(c(1.692, 0.9876, 2.4)), digits = 2)
[1]  5.43  2.68 11.02

Here, we have that \widehat{\text{OR}}_{z=1} = e^{1.692} = 5.43 and 95% CI is e^{-0.9876},e^{2.4} = (2.68,11.02).

4.2 Multilevel Effect Modifiers

We use the example of endometrial carcinoma and the use of continued combined hormone replacement therapy (HRT) vs. placebo. Our possible effect modifier is body mass index (BMI), with three levels, which we sumamrize in Table 2.

Table 2: Endometrial Carcinoma Variable
Variable Type in problem Explanation
\texttt{cancer}: \, Y Response Y=1 is endometrial carcinoma and Y=0 is no carcinoma
\texttt{HRT}: \, X Explanatory variable x=1 for HRT and x=0 for placebo
\texttt{BMI}: \, Z Possible effect modifier z_{1} = 1 if BMI is 25-29 and with z_2=1 if BMI is \ge 30 and 0 otherwise

We choose BMI <25 as the reference level for our potential effect modifier, Z. From this choice, we explore \pi (x,z_{1},z_{2}) = P(Y=1 \, \vert \, x , z_{1} , z_{2}). This is written with the logit link function for a logistic regression model in Equation 10, where we note that we have an additional (j-1)(k-1) = (2-1)(3-1) = 2 variables, which have the coefficients \beta_{4} and \beta_{5} (the log odds ratios for the interaction terms \beta_{1} , \beta_{2}, and \beta_{3} as the log odds ratios for the main effects x, z_{1}, and z_{2}).

\log{\left(\frac{\pi (x,z_{1},z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = \beta_{0} + \beta_{1}x + \beta_{2} z_{1} + \beta_{3} z_{2} + \beta_{4} x z_{1} + \beta_{5} x z_{2} \tag{10}

The equation in Equation 10 can be rearranged algebraically, as shown in Equation 11.

\log{\frac{\pi (x,z_{1},z_{2})}{1 - \pi (x,z_{1},z_{2})}} = \beta_{0} + \beta_{2} z_{1} + \beta_{3} z_{2} + (\beta_{1} + \beta_{4} z_{1} + \beta_{5} z_{2}) x \tag{11}

Since we have more than one parameter, we use the likelihood ratio test (LRT), with H_{0}:\; \beta_{4} = \beta_{5} = 0 which states that Z is not an effect modifier. If \beta_{4} = \beta_{5} = 0, then the levels of Z do not modify the log odds ratio for X, which will be \beta_{1} alone.

In Equation 12, we see the three possibilities, BMI < 25, BMI 25 - 29, and BMI \ge 30.

\begin{align} &z_{1} = 0 , z_{2} = 0 : \; \log(\frac{\pi (x, 0, 0)}{ 1 - \pi (x , 0 , 0)} = \beta_{0} + \beta_{1} x \\ &z_{1} = 1 , z_{2} = 0 : \; \log(\frac{\pi (x, 1, 0)}{ 1 - \pi (x , 1 , 0)} = (\beta_{0} + \beta_{2}) + (\beta_{1} + \beta_{4}) x \\ &z_{1} = 0 , z_{2} = 1 : \; \log(\frac{\pi (x, 0, 1)}{ 1 - \pi (x , 0 , 1)} = (\beta_{0} + \beta_{3}) + (\beta_{1} + \beta_{5}) x \end{align} \tag{12}

If both \beta_{4} and \beta_{5} are not equal to 0 (and \beta_{4} \ne \beta_{5}), then the slopes are different (non-parallel lines). If \beta_{2} and \beta_{3} are not equal to 0 then the intercepts are different.

The data for this example is in the comma-separated file endo_data.csv, which we import as a tibble object assigned to the variable cancer.

Code
cancer <- import_dataset("endo_data.csv")
Code
cancer$BMI <- factor(
  cancer$BMI,
  levels = c("BMI < 25", "BMI 25-29", "BMI >= 30")
)

cancer$HRT <- factor(
  cancer$HRT,
  levels = c("never", "cont_comb")
)

The glm function creates the logistic regression model and fits the data. We assign the results to the variable model. As before, we use the multiplication symbol, *, to create the interaction terms.

Code
model <- glm(
  formula = cancer ~ HRT + BMI + HRT * BMI,
  data = cancer,
  family = binomial(link = "logit")
)

The coefficient table shows the fitted results.

Code
round(summary(model)$coefficients, digits = 3)
                          Estimate Std. Error z value Pr(>|z|)
(Intercept)                 -2.586      0.124 -20.861    0.000
HRTcont_comb                 0.236      0.158   1.498    0.134
BMIBMI 25-29                 0.174      0.172   1.011    0.312
BMIBMI >= 30                 0.128      0.189   0.675    0.500
HRTcont_comb:BMIBMI 25-29   -0.489      0.238  -2.055    0.040
HRTcont_comb:BMIBMI >= 30   -1.713      0.284  -6.037    0.000

From the results above, we write the fitted model in Equation 13.

\log{\left(\frac{\pi (x , z_{1} , z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = -2.586 + 0.236 x + 0.174 z_{1} + 0.128 z_{2} - 0.489 x z_{1} - 1.713 x z_{2} \tag{13}

Before interpreting the model parameters, we should test H_{0}: \; \beta_{4} = \beta_{5} = 0. We already have the full model. Now we create a reduced model, so that we can use a likelihood ratio test (LRT).

Code
reduced_model <- glm(
  cancer ~ HRT + BMI, # No interaction term
  data = cancer,
  family = binomial(link = "logit")
)

The lrtest function performs the LRT.

Code
lmtest::lrtest(reduced_model, model)
Likelihood ratio test

Model 1: cancer ~ HRT + BMI
Model 2: cancer ~ HRT + BMI + HRT * BMI
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   4 -1483.7                         
2   6 -1464.6  2 38.291  4.845e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are 6 parameters in the full model and 4 parameters in the reduced model, hence 6-4=2 degrees of freedom in the results. The test statistic G^{2} = 38.291 and the p value is <0.001. We reject the null hypothesis. BMI does modify the association between HRT and endometrial cancer.

Our model is now as in Equation 14.

\begin{align} &\log{\left(\frac{\pi (x , z_{1} , z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = -2.586 + 0.236 x + 0.174 z_{1} + 0.128 z_{2} - 0.489 x z_{1} - 1.713 x z_{2} \\ &\log{\left(\frac{\pi (x , z_{1} , z_{2})}{1 - \pi (x,z_{1},z_{2})}\right)} = -2.586 + 0.174 z_{1} + 0.128 z_{2} + (0.236 - 0.489 z_{1} - 1.713 z_{2}) x \end{align} \tag{14}

The estimated log odds ratios and odds ratios for the three levels of Z are calculated in Equation 15.

\begin{align} \text{BMI} < 25: \; &\log{\widehat{\text{OR}}_{z_{1}=0,z_{2}=0}} = 0.236, \; \widehat{\text{OR}}_{z_{1}=0,z_{2}=0} = e^{0.236} = 1.27 \\ \text{BMI } 25-29: \; &\log{\widehat{\text{OR}}_{z_{1}=1,z_{2}=0}} = 0.236 - 0.489 = -0.253, \; \widehat{\text{OR}}_{z_{1}=1,z_{2}=0} = e^{0.236} = 0.78 \\ \text{BMI} \ge 30: \; &\log{\widehat{\text{OR}}_{z_{1}=0,z_{2}=1}} = 0.236 - 1.713 = -1.477, \; \widehat{\text{OR}}_{z_{1}=0,z_{2}=1} = e^{-1.477} = 0.23 \end{align} \tag{15}

Filling in the appropriate values for x, z_{1}, and z_{2} in Equation 14, we see the geometry of the results (the coordinates of the points, with a line connecting them). The emmip function is used again, to provide Figure 2.

Code
emmeans::emmip(
  model,
  ~ HRT | BMI
)

Figure 2: Geometry of the Association across the Levels of the Effect Modifier

We see very different slopes. As before, we create an emmGrid object with the emmeans function and then use the contrast function to to show us the individual results of hypothesis testing and confidence intervals (that are now required since we rejected the null hypothesis using the LRT test). The results shows the association between X and Y at each level of Z.

Code
model_emm <- emmeans::emmeans(
  model,
  ~ HRT | BMI
)

emmeans::contrast(
  model_emm,
  "consec",
  name = "HRT",
  infer = c(TRUE, TRUE),
  level = 0.95
)
BMI = BMI < 25:
 HRT               estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cont_comb - never    0.236 0.158 Inf   -0.0729    0.5455   1.498  0.1342

BMI = BMI 25-29:
 HRT               estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cont_comb - never   -0.253 0.178 Inf   -0.6014    0.0964  -1.419  0.1560

BMI = BMI >= 30:
 HRT               estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cont_comb - never   -1.477 0.236 Inf   -1.9390   -1.0144  -6.261  <.0001

Results are given on the log odds ratio (not the response) scale. 
Confidence level used: 0.95 

We see the three sets of estimates, standard errors, degrees of freedom, lower and upper bounds of the 95% confidence intervals, the Z statistics, and the p values.

As before, we have to take the antilogs by hand (code).

Code
round(exp(c(0.236, -0.0729, 0.5455)), digits = 2)
[1] 1.27 0.93 1.73
Code
round(exp(c(-0.253, -0.6014, 0.0964)), digits = 2)
[1] 0.78 0.55 1.10
Code
round(exp(c(-1.477, -1.939, -1.0144)), digits = 2)
[1] 0.23 0.14 0.36

From the results, we see that in the cases for BMI <25 and BMI 25-29, there is not enough evidence to show an association between HRT and the development of endometrial carcinoma.

5 Continuous Numerical Effect Modifiers

We have looked at effect modification for binary and multi-level variables. Now we consider a numerical variable as possible effect modifier.

to investigate a continuous numerical variable as a potential effect modifier, we revisit our first illustrative example. Now, though, Z, is a numerical variable. We use the data set on major depressive disorder (MDD) that we have used before. Our explanatory variable is X, cognitive behavioral therapy (CBT) vs. placebo, and the possible effect modifier is age of onset (a numerical data type). The variable summary is shown in Table 3

Table 3: Variable Summary
Variable Type in problem Explanation
Y Response Y=1 if remitter and Y=0 otherwise
X Explanatory variable x=1 if CBT and x=0 for placebo
Z Possible effect modifier $Z = $ age of onset in years

Our aim is to investigate \pi (x,z) = P(Y=1 \, \vert \, x, z). It should be obvious that table methods cannot be used. Our model is now as written in Equation 16.

\begin{align} &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{1} x + \beta_{2} z + \beta_{3} xz \\ &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = \beta_{0} + \beta_{2} z + (\beta_{1} + \beta_{3} z) x \end{align} \tag{16}

Since z can take on many values, we have have many slopes, (\beta_{1} + \beta_{3}), for x. In Equation 17, we choose z=16, z=29.

\begin{align} &z=16: \; \log{\text{OR}_{z=16}} = \beta_{1} + \beta_{3} (16), \; \text{OR}_{z=16} = e^{\beta_{1} + \beta_{3} (16)} \\ &z=29: \; \log{\text{OR}_{z=29}} = \beta_{1} + \beta_{3} (29), \; \text{OR}_{z=29} = e^{\beta_{1} + \beta_{3} (29)} \end{align} \tag{17}

If \beta_{3}=0, then the z does not modify the slope of x. We have H_{0}: \; \beta_{3} = 0.

In Equation 18, we write the log odds ratios for the two example values of the two values of z.

\begin{align} &z=16: \; \log{\frac{\pi (x,16)}{1 - \pi (x,16)}} = [\beta_{0} + \beta_{2} (16)] + [\beta_{1} + \beta_{3} (16)] x \\ &z=29: \; \log{\frac{\pi (x,29)}{1 - \pi (x,29)}} = [\beta_{0} + \beta_{2} (29)] + [\beta_{1} + \beta_{3} (29)] x \end{align} \tag{18}

Many lines are possible. If \beta_{3} \ne 0 the lines are non-parallel. Intercepts are different too if \beta_{2} \ne 0.

The data is in the mdd_cbt_data.csv comma-separated values file. We import the data and assign it to the variable dfr.

Code
dfr <- import_dataset("mdd_cbt_data.csv")

The model is created in the usual way and we assign it again to the variable model.

Code
model <- glm(
  formula = remit ~ cbt + onset_age_num + cbt * onset_age_num,
  data = dfr,
  family = binomial(link = "logit")
)

The model is written in Equation 19, from the table of coefficients.

Code
round(summary(model)$coefficients, digits = 3)
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)          0.656      0.472   1.389    0.165
cbt                 -0.280      0.720  -0.390    0.697
onset_age_num       -0.016      0.015  -1.069    0.285
cbt:onset_age_num    0.047      0.023   2.009    0.045

\begin{align} &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = 0.656 - 0.280 x - 0.016 z + 0.047 x z \\ &\log{\frac{\pi (x, z)}{1 - \pi (x,z)}} = 0.656 - 0.016 z + (-0.280 + 0.047) x \end{align} \tag{19}

In this case, we read the Wald statistic for H_{0}:\; \beta_{3} = 0 from the table, with Z=2.009 for a p value of 0.045. The data provides evidence that age of onset in years, modifies the association between treatment and remitter status.

In the geometry shown in Figure 3, we choose three different values for the effect modifier, Z, using the at keyword argument.

Code
emmeans::emmip(
  model,
  ~ cbt | onset_age_num,
  at = list(onset_age_num = c(20, 29.8, 40.3))
)

Figure 3: ?(caption)

This gives us a visual indication of the effect modification of age.

One way to describe the effect modification numerically is to choose three values as with the geometry. The three values can be the first, second (median), and third quartiles, which we calculate below, using the quantile function.

Code
quantile(
  dfr$onset_age_num,
  probs = c(0.25, 0.5, 0.75)
)
 25%  50%  75% 
20.0 29.8 40.3 

The contrast function shows the results.

Code
model_emm <- emmeans::emmeans(
  model,
  ~ cbt | onset_age_num,
  at = list(onset_age_num = c(20, 29.8, 40.3))
)

emmeans::contrast(
  model_emm,
  "consec",
  name = "cbt",
  infer = c(TRUE, TRUE),
  level = 0.95
)
onset_age_num = 20.0:
 cbt         estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cbt1 - cbt0     0.66 0.318 Inf    0.0368      1.28   2.076  0.0379

onset_age_num = 29.8:
 cbt         estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cbt1 - cbt0     1.12 0.238 Inf    0.6543      1.59   4.710  <.0001

onset_age_num = 40.3:
 cbt         estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 cbt1 - cbt0     1.61 0.355 Inf    0.9193      2.31   4.553  <.0001

Results are given on the log odds ratio (not the response) scale. 
Confidence level used: 0.95 

We see the estimates and their confidence intervals, and the test statistics and p values.

The antilogs must be done by hand (code) as before.

Code
round(exp(c(0.66, 0.0368, 1.28)), digits = 2)
[1] 1.93 1.04 3.60
Code
round(exp(c(1.12, 0.6543, 1.59)), digits = 2)
[1] 3.06 1.92 4.90
Code
round(exp(c(1.61, 0.9193, 2.31)), digits = 2)
[1]  5.00  2.51 10.07

We see that the odds ratio for remission comparing CBT to placebo at the age of onset of MDD of 20 years is 1.93 (95% CI 1.04 to 3.6). As the age of onset increases, the odds ratios increase.

We could write: Researchers wanted to investigate whether there is an association between treatment types (CBT vs. placebo) and remission of MDD. They needed to account for the age of onset of disease as a possible effect modifier. A Wald test showed that age of onset was indeed an effect modifier. We could then continue with describing the associations by confidence intervals and / or hypothesis test.

In conclusion, we have looked at all three data types for the possible effect modifier. In general, we start by using the Wald or likelihood ratio test for the null hypothesis that the slopes of the interaction terms are all 0.

If significant, then we measure the association between X and Y at all levels of Z (some in the case of a numerical variable).

If the test Wald or LRT is not significant, we drop the interaction term and refit the model with only the main effects, X and Z.

6 Logistic Regression Model for Confounders

We remind ourselves of the definition of a confounder. A variable, Z, is a confounder of the association between the primary exposure and response if the association between the primary exposure, X, and the response, Y is biased (distorted) because Z is associated with the primary exposure and causally associated with the response. To be a confounder, Z cannot be in the causal pathway from X to Y. If Z is on the causal pathway, then it is termed a mediator.

We return to our fun simulated research of the association between baldness, X (with E being bald and \overline{E} not being bald) and the development of chronic heart disease (CHD) (with D being the presence of CHD and \overline{D} being no IHD). Our potential confounder is age group, Z, with z=1 for the older group, and z=0 for the younger group.

Stratification can get out of hand with many levels for Z. We can use generalized linear models instead, where we follow these steps for a LR model approach to examining a confounder.

  1. To obtain the crude measures of association, we fit a model with X and Y (without Z). The slope for X is the crude measure of association.
  2. Test to see if there is an association between X and Z (we can use a LR model or table methods).
  3. Test to see if Z is causally associated with Y among the unexposed (we can use a LR model or table methods).
  4. Is Z in the causal pathway from X to Y?
  5. Is X is associated with Y?

If these steps are followed, we compute the adjusted measures of association between X and Y (including Z). The slopes of X are the adjusted measures of association.

Finally, we compare the crude measures to the adjusted measures.

The data file, bald_chd_data_file.csv, is imported below, and assigned to the variable chd.

Code
chd <- import_dataset("bald_chd_data_file.csv")

Now, we fit the crude model.

Code
crude_model <- glm(
  CHD ~ bald,
  data = chd,
  family = binomial(link = "logit")
)

We view the table of coefficients.

Code
round(summary(crude_model)$coefficients, digits = 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -3.367      0.060 -55.981        0
bald           1.123      0.133   8.460        0

The antilog of the crude estimates are calculated.

Code
round(exp(coef(crude_model)), digits = 3)
(Intercept)        bald 
      0.034       3.074 

We see that \widehat{\text{OR}}_{\text{crdue}} = 3.07. We also calculate the confidence intervals.

Code
round(exp(confint.default(crude_model, level = 0.95)), digits = 3)
            2.5 % 97.5 %
(Intercept) 0.031  0.039
bald        2.370  3.987

We see that the interval does not contain the 1, with H_{0}: \; \text{OR} = 1.

Now, we fit the adjusted model.

Code
adjusted_model <- glm(
  CHD ~ bald + age,
  data = chd,
  family = binomial(link = "logit")
)

We see the table of coefficients.

Code
round(summary(adjusted_model)$coefficients, digits = 3)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -3.384      0.060 -55.995    0.000
bald           0.025      0.205   0.121    0.904
age            3.431      0.259  13.228    0.000

Using the antilog, we see the adjusted odds ratio for baldness.

Code
round(exp(coef(adjusted_model)), digits = 3)
(Intercept)        bald         age 
      0.034       1.025      30.894 

We see \widehat{\text{OR}}_{\text{after adjusting for age group}} = 1.025. We calculate the confidence intervals below.

Code
round(exp(confint.default(adjusted_model)), digits = 3)
             2.5 % 97.5 %
(Intercept)  0.030  0.038
bald         0.686  1.533
age         18.584 51.360

Odds ratios for having CHD comparing bald to non-bald group is as shown in Equation 20.

\begin{align} &\widehat{\text{OR}}_{\text{crude}} = 3.07 \text{ with } 95\% \text{CI } (2.37, 3.99) \\ &\widehat{\text{OR}}_{\text{adjusted}} = 1.03 \text{ with } 95\% \text{CI } (0.69, 1.53) \end{align} \tag{20}

We see very different results and conclude that age group is a confounder. We therefor report the adjusted measure of association.

Note, we assumed here that the requirements for testing a confounder have been met. We did this in chapter 5. Notice also, how the results are the same as for the Mantel-Haenszel estimates for chapter 5. This will not always be the case.

7 Lab Problems

7.1 Question

Reconsider the lab work that was completed in lecture 5. You are a member of a funded group conducting research into maternal perinatal complications in a disadvantaged community. You investigate the association between maternal age group and maternal perinatal complications. You also consider human immunodeficiency virus (HIV) status as possible effect modifier.

  • There are two levels for the maternal age group variable \texttt{AgeGroup}: Teenage (E or 1) and Older (\overline{E} or 0)
  • There are two levels to the maternal perinatal complication variable \texttt{Complication}: Yes (D or 1) and No (\overline{D} or 0)
  • There are two levels to the maternal HIV status \texttt{HIV}: Positive (1) and Negative (0)

Consider the association between maternal age group and maternal perinatal complications and HIV status as possible effect modifier.

As review of stratified analysis, use table methods to determine if HIV status is an effect modifier of the association between maternal age group and complications. Write conclusions on the associations and effect modification, based on the outcome of appropriate tests.

Create a logistic regression model to investigate the interaction between maternal age group and HIV status. As revision, perform a likelihood ratio test to investigate the model. Write your conclusions based on the logistic regression model. Compare the results to those obtained using stratified analysis (table methods).

The data for this lab problem are in a spreadsheet file named PerinatalComplications.csv, which is imported below and assigned to the variable comp.

Code
comp <- import_dataset("PerinatalComplications.csv")

Generate Summary Statistics and Data Visualization

First, we generate frequency tables for each of the variables of interest.

Code
xtabs(
  ~ AgeGroup,
  data = comp
)
AgeGroup
  Older Teenage 
    146     125 

In terms of \texttt {AgeGroup}, more women identify as Older compared to Teenage (146 vs. 125 participants).

Code
xtabs(
  ~ Complication,
  data = comp
)
Complication
 No Yes 
184  87 

In terms of \texttt {Complication}, more women reported not having maternal perinatal complications (184 vs. 87 participants).

Code
xtabs(
  ~ HIV,
  data = comp
)
HIV
Negative Positive 
     147      124 

In terms of \texttt {HIV}, there were more women who were HIV negative compared to positive (147 vs. 124 women).

To generate cross-tabulated results between \texttt {AgeGroup} and \texttt {Complication} by the two levels of \texttt {HIV}, we use the factor function to order each of the categorical predictors such that the reference level is listed last in the levels vector.

Code
comp$AgeGroup <- factor(comp$AgeGroup, levels = c("Older", "Teenage"))
comp$Complication <- factor(comp$Complication, levels = c("No", "Yes"))
comp$HIV <- factor(comp$HIV, levels = c("Negative", "Positive"))
Code
lab_stratified_tables <- table(comp$AgeGroup, comp$Complication, comp$HIV)
lab_stratified_tables
, ,  = Negative

         
          No Yes
  Older   65  14
  Teenage 63   5

, ,  = Positive

         
          No Yes
  Older   43  24
  Teenage 13  44

For women who are HIV negative, Older women demonstrate higher frequencies of perinatal maternal complications compared to Teenage women (14 vs 5 participants).

For women who are HIV positive, Teenage women demonstrate higher frequencies of perinatal material complications compared to Older women (44 vs. 24 participants).

Conduct Stratified Analysis to Assess for Effect Modification

To determine if HIV status is an effect modifier of the association between maternal age group and complications, we conduct the Woolf test.

Code
DescTools::WoolfTest(lab_stratified_tables[c(2,1), c(2,1),])

    Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)

data:  lab_stratified_tables[c(2, 1), c(2, 1), ]
X-squared = 16.788, df = 1, p-value = 4.18e-05

Since the Woolf test shows a test statistic of 16.788 and corresponding p value of <0.001, we conclude that HIV status is an effect modifier and report separate measures of association with confidence intervals.

As such, we conduct separate Pearson’s \chi^{2} tests.

Code
negative <- comp %>% 
  filter(HIV == "Negative") %>% data.frame()
positive <- comp %>% 
  filter(HIV == "Positive") %>% data.frame()
Code
negative_table <- table(negative$AgeGroup, negative$Complication)
positive_table <- table(positive$AgeGroup, positive$Complication)
Code
negative_table
         
          No Yes
  Older   65  14
  Teenage 63   5
Code
epitools::oddsratio.wald(
  negative_table
)
$data
         
           No Yes Total
  Older    65  14    79
  Teenage  63   5    68
  Total   128  19   147

$measure
         odds ratio with 95% C.I.
           estimate     lower    upper
  Older   1.0000000        NA       NA
  Teenage 0.3684807 0.1253458 1.083228

$p.value
         two-sided
          midp.exact fisher.exact chi.square
  Older           NA           NA         NA
  Teenage 0.06603975   0.08402126  0.0617098

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
Code
epitools::oddsratio.wald(
  positive_table
)
$data
         
          No Yes Total
  Older   43  24    67
  Teenage 13  44    57
  Total   56  68   124

$measure
         odds ratio with 95% C.I.
          estimate    lower    upper
  Older   1.000000       NA       NA
  Teenage 6.064103 2.738134 13.43007

$p.value
         two-sided
            midp.exact fisher.exact   chi.square
  Older             NA           NA           NA
  Teenage 3.799353e-06 4.728884e-06 3.956588e-06

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

There is evidence at the 5% significance level that HIV status is an effect modifier of the association between age group and complications. We report individual measures of association (ORs with CIs) and do hypothesis testing using Pearson’s \chi^{2} test of the G^{2} test.

Code
chisq.test(negative_table)

    Pearson's Chi-squared test with Yates' continuity correction

data:  negative_table
X-squared = 2.6303, df = 1, p-value = 0.1048
Code
chisq.test(positive_table)

    Pearson's Chi-squared test with Yates' continuity correction

data:  positive_table
X-squared = 19.648, df = 1, p-value = 9.31e-06

There is an association between age group and complications in the positive group only.

Generate a Logistic Regression Model with Interaction

Code
lab_model <- glm(
  formula = Complication ~ AgeGroup * HIV,
  data = comp,
  family = binomial(link = "logit")
)

We also create an intercept-only model.

Code
lab_intercept_model <- glm(
  Complication ~ 1,
  data = comp,
  family = binomial(link = "logit")
)

A likelihood ratio test (LRT) compared the nested model and the full model.

Code
lmtest::lrtest(
  lab_intercept_model,
  lab_model
)
Likelihood ratio test

Model 1: Complication ~ 1
Model 2: Complication ~ AgeGroup * HIV
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   1 -170.09                         
2   4 -129.08  3 82.022  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p value of the LRT is less than 0.01.

We can use the coefficients to write our model.

Code
round(lab_model$coefficients, digits = 4)
                (Intercept)             AgeGroupTeenage 
                    -1.5353                     -0.9984 
                HIVPositive AgeGroupTeenage:HIVPositive 
                     0.9522                      2.8008 

\begin{align} &\log{\left( \frac{\pi ( \mathbf{x} )}{1 - \pi ( \mathbf{x} )} \right)} = - 1.5353 - 0.9984 x + 0.9522 z + 2.8008 xz \\ &\log{\left( \frac{\pi ( \mathbf{x} )}{1 - \pi ( \mathbf{x} )} \right)} = - 1.5353 + 0.9522 z + \left( - 0.9984 + 2.8008 z \right) x \end{align}

For z=0 (HIV negative), we have that the slope (log odds ratio) for x is 0.9984. Therefor \log{\widehat{\text{OR}}_{z=0}} = 0.9984 \rightarrow \widehat{\text{OR}}_{z=0} = e^{-0.9984}=0.3685. For those who are HIV negative, the odds of complications for those in the younger group is 1.0 - 0.3685 = 0.6315 = 63.2% lower than for those in the older group. Note how this is the same as the odds ratio using table methods.

For z=1 (HIV positive), we have that the slope for x is -0.9984+2.8008=1.8024. The \log{\widehat{\text{OR}}_{z=1}} = 1.8024 \rightarrow \widehat{\text{OR}}_{z=1} = e^{1.8024}=6.0642. For those that are HIV positive, the odds of complications for those in the younger group are 6.064 - 1 = 5.064 = 506% greater than for those in the older group. Note how this is the same as the odds ratio using table methods.

The null hypothesis that \beta_{3}=0 is rejected at the 5% significance level.

Code
summary(lab_model)

Call:
glm(formula = Complication ~ AgeGroup * HIV, family = binomial(link = "logit"), 
    data = comp)

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -1.5353     0.2946  -5.211 1.88e-07 ***
AgeGroupTeenage              -0.9984     0.5502  -1.815   0.0696 .  
HIVPositive                   0.9522     0.3895   2.444   0.0145 *  
AgeGroupTeenage:HIVPositive   2.8008     0.6836   4.097 4.18e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 340.18  on 270  degrees of freedom
Residual deviance: 258.16  on 267  degrees of freedom
AIC: 266.16

Number of Fisher Scoring iterations: 5

We can calculate confidence intervals.

Code
emmgrid_lab <- emmeans::emmeans(
  lab_model,
  ~ AgeGroup | HIV
)
Code
emmeans::contrast(
  emmgrid_lab,
  "consec", # AgeGroup at level younger then level older
  name = "AgeGroup", # Optional (easier to find in results)
  infer = c(TRUE, TRUE), # Hypothesis test & confidence intervals
  level = 0.95
)
HIV = Negative:
 AgeGroup        estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Teenage - Older   -0.998 0.550 Inf     -2.08    0.0799  -1.815  0.0696

HIV = Positive:
 AgeGroup        estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Teenage - Older    1.802 0.406 Inf      1.01    2.5975   4.443  <.0001

Results are given on the log odds ratio (not the response) scale. 
Confidence level used: 0.95 
Code
emmeans::emmip(
  lab_model,
  ~ AgeGroup | HIV
)

Code
round(exp(c(-0.998, -2.08, 0.0799)), digits = 2)
[1] 0.37 0.12 1.08

For the HIV negative group, we have that \widehat{\text{OR}}_{z=0} = e^{-0.998} = 0.37 and 95% CI is e^{-2.08},e^{0.0799} = (0.12,1.08).

Code
round(exp(c(1.802, 1.01, 2.5975)), digits = 2)
[1]  6.06  2.75 13.43

For the HIV positive group, we have that \widehat{\text{OR}}_{z=1} = e^{1.802} = 6.06 and 95% CI is e^{1.01},e^{2.5975} = (2.75,13.43).

Both the stratified analysis and logistic regression methods produce the result that that HIV status is an effect modifier.

7.2 Question

Consider the data that are available in the data set empirical_injury_infection_data.csv. You are investigating the effect of a novel empirical antibiotic (antibiotic given prior to surgery for established contamination), E, compared to an established prophylactic antibiotic , \overline{E}, on the development of surgical infection in the setting of penetrating abdominal trauma in five provinces in a Central African country. You are interested in the level of hollow-viscus injury as potential effect modifier, Z. The potential effect modifier has three levels, summarized in Table 4.

Table 4: Levels of the Potential Effect Modifier
Levels of Z Data file encoding Values of dummy variables
No hollow viscus injury None z_{1}=0, z_{2}=0
Small bowel injury SBI z_{1}=1, z_{2}=0
Large intestine injury LBI z_{1}=0, z_{2}=1

The levels of X (antibiotic group) and Y (surgical infection are summarized in Table 5)

Table 5: Levels of the Explanatory and Response Variable
Variable Name Description Levels or Units of Measurement
\texttt Group Type of antibiotic Established vs. Novel
\texttt Infection Infection (response) Yes vs. No

Determine if the level of hollow viscus injury is an effect modifier of the association between the type of empirical antibiotic and the development of infection. Comment on your decisions and the results.

The data are in the file empirical_injury_infection_data.csv and imported below, assigned to the variable lab.

Code
lab <- import_dataset("empirical_injury_infection_data.csv")

Generate Summary Statistics

Code
xtabs(
  ~ Group,
  data = lab
)
Group
Established       Novel 
       2400        3680 

Overall, there are more participants who receive the novel compared to the established antibiotic (3680 vs. 2400 participants).

Code
xtabs(
  ~ Infection,
  data = lab
)
Infection
  No  Yes 
5697  383 

More participants do not have an infection (5697 vs. 383).

Code
xtabs(
  ~ Injury,
  data = lab
)
Injury
 LBI None  SBI 
2184 2023 1873 

Most participants have an large intestine injury followed by no hollow viscus injury and small bowel injury (2184 vs. 2023 vs. 1873.)

Generate a Multivariable Logistic Regression with an Interaction Term Between Type of Antibiotic and Injury

Code
lab$Group <- factor(lab$Group, levels = c("Established", "Novel"))
lab$Infection <- factor(lab$Infection, levels = c("No", "Yes"))
lab$Injury <- factor(lab$Injury, levels = c("None", "SBI", "LBI"))
Code
lab_model <- glm(
  formula = Infection ~ Group * Injury,
  data = lab,
  family = binomial(link = "logit")
)
summary(lab_model)

Call:
glm(formula = Infection ~ Group * Injury, family = binomial(link = "logit"), 
    data = lab)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           -2.6150     0.1410 -18.549  < 2e-16 ***
GroupNovel             0.2834     0.1730   1.638   0.1014    
InjurySBI              0.2030     0.1845   1.101   0.2711    
InjuryLBI              0.1571     0.2009   0.782   0.4341    
GroupNovel:InjurySBI  -0.5360     0.2482  -2.159   0.0308 *  
GroupNovel:InjuryLBI  -1.7602     0.2925  -6.017 1.78e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2859.1  on 6079  degrees of freedom
Residual deviance: 2771.9  on 6074  degrees of freedom
AIC: 2783.9

Number of Fisher Scoring iterations: 6

Fitted model.

\begin{align} &\log{\left( \frac{\hat{\pi} ( \mathbf{x} )}{1 - \hat{\pi} ( \mathbf{x} )} \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x + \hat{\beta}_{2} z_{1} + \hat{\beta}_{3} z_{2} + \hat{\beta}_{4} x z_{1} + \hat{\beta}_{5} x z_{2} \\ &\log{\left( \frac{\hat{\pi} ( \mathbf{x} )}{1 - \hat{\pi} ( \mathbf{x} )} \right)} = -2.3315 - 0.2834 x - 0.3330 z_{1} - 1.6030 z_{2} + 0.5360 x z_{1} + 1.7602 x z_{2} \\ \\ &\log{\left( \frac{\hat{\pi} ( \mathbf{x} )}{1 - \hat{\pi} ( \mathbf{x} )} \right)} = \hat{\beta}_{0} + \hat{\beta}_{2} z_{1} + \hat{\beta}_{3} z_{2} + \left( \hat{\beta}_{2} + \hat{\beta}_{4} z_{1} + \hat{\beta}_{5} z_{2} \right) x \\ &\log{\left( \frac{\hat{\pi} ( \mathbf{x} )}{1 - \hat{\pi} ( \mathbf{x} )} \right)} = -2.3315 - 0.3330 z_{1} - 1.6030 z_{2} + ( - 0.2834 + 0.5360 z_{1} + 1.7602 z_{2} ) x \end{align}

Test null hypothesis \hat{\beta}_{4} = \hat{\beta}_{5} = 0

Code
lab_reduced_model <- glm(
  formula = Infection ~ Group + Injury, # No interaction term
  data = lab,
  family = binomial(link = "logit")
)
Code
lmtest::lrtest(
  lab_reduced_model,
  lab_model
)
Likelihood ratio test

Model 1: Infection ~ Group + Injury
Model 2: Infection ~ Group * Injury
  #Df  LogLik Df  Chisq Pr(>Chisq)    
1   4 -1405.1                         
2   6 -1385.9  2 38.329  4.754e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
emmeans::emmip(
  lab_model,
  ~ Group | Injury
)

Code
lab_model_emm <- emmeans::emmeans(
  lab_model,
  ~ Group | Injury
)

emmeans::contrast(
  lab_model_emm,
  "consec",
  name = "Group",
  infer = c(TRUE, TRUE),
  level = 0.95
)
Injury = None:
 Group               estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Novel - Established    0.283 0.173 Inf   -0.0557    0.6226   1.638  0.1014

Injury = SBI:
 Group               estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Novel - Established   -0.253 0.178 Inf   -0.6014    0.0964  -1.419  0.1560

Injury = LBI:
 Group               estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Novel - Established   -1.477 0.236 Inf   -1.9390   -1.0144  -6.261  <.0001

Results are given on the log odds ratio (not the response) scale. 
Confidence level used: 0.95 
Code
none <- c(0.283, -0.0557, 0.6226)
sbi <- c(-0.253, -0.6014,  0.0964)
lbi <- c(-1.477, -1.9390, -1.0144)
Code
exp(none)
[1] 1.3271052 0.9458228 1.8637675
Code
exp(sbi)
[1] 0.7764679 0.5480438 1.1011995
Code
exp(lbi)
[1] 0.2283216 0.1438477 0.3626199

We are comparing a novel drug to an established drug. In the case of LBI there is a 1.0-0.2283=0.7717=77.2% lower odds of infection comparing those receiving the novel antibiotic compared to those receiving the established antibiotic. This is a significant reduction. The CIs for the ORs for the other two types of injury contain the null OR of 1. Conclusively, we can recommend the novel antibiotic as empirical therapy for LBI.

7.3 Question

As part of a group that plans interventions among gang members to reduce the rate of human immunodeficiency virus (HIV) sero-conversion you investigate the association between membership of a gang (variable named Gang in the data set) and the sero-conversion (variable named HIV in data set). You consider the age of onset of smoking as a potential effect modifier.

Your group manages data for aid agencies in South Africa and in Brazil. Complete the analyses using a logistic regression model to investigate the interaction between gang membership and age of onset of smoking, and comment on your decisions and results for the two countries.

Note that the data are in the files gangs_hiv_africa.csv and gangs_hiv_brazil.csv.

Conduct Analysis for South Africa

South African data in the file gangs_hiv_africa.csv. Level the \texttt {Gang} and \texttt {HIV} variables such that the reference groups are listed first in the vector. Then, construct a logistic regression model with an interaction term with \texttt {Gang} and \texttt {Age}.

Code
za <- import_dataset("gangs_hiv_africa.csv")
Code
za$Gang <- factor(za$Gang, levels = c("No", "Yes"))
za$HIV <- factor(za$HIV, levels = c("Negative", "Positive"))
Code
za_model <- glm(
  formula = HIV ~ Gang * Age,
  data = za,
  family = binomial(link = "logit")
)
summary(za_model)

Call:
glm(formula = HIV ~ Gang * Age, family = binomial(link = "logit"), 
    data = za)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.918324   1.108334   0.829    0.407
GangYes     -0.724440   1.696562  -0.427    0.669
Age          0.004614   0.079664   0.058    0.954
GangYes:Age -0.059248   0.119642  -0.495    0.620

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 172.57  on 126  degrees of freedom
Residual deviance: 155.20  on 123  degrees of freedom
AIC: 163.2

Number of Fisher Scoring iterations: 4

Next, construct a model with covariates \texttt {Gang} and \texttt {Age}, but no interaction term.

Code
za_model_no_interaction <- glm(
  formula = HIV ~ Gang + Age,
  data = za,
  family = binomial(link = "logit")
)
summary(za_model_no_interaction)

Call:
glm(formula = HIV ~ Gang + Age, family = binomial(link = "logit"), 
    data = za)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.27605    0.85031   1.501    0.133    
GangYes     -1.54495    0.39130  -3.948 7.87e-05 ***
Age         -0.02169    0.05936  -0.365    0.715    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 172.57  on 126  degrees of freedom
Residual deviance: 155.45  on 124  degrees of freedom
AIC: 161.45

Number of Fisher Scoring iterations: 4

Then, conduct a likelihood ratio test between the models with and without the interaction term.

Code
lmtest::lrtest(
  za_model,
  za_model_no_interaction
)
Likelihood ratio test

Model 1: HIV ~ Gang * Age
Model 2: HIV ~ Gang + Age
  #Df  LogLik Df  Chisq Pr(>Chisq)
1   4 -77.600                     
2   3 -77.723 -1 0.2457     0.6201
Code
exp(za_model_no_interaction$coefficients)
(Intercept)     GangYes         Age 
  3.5824591   0.2133217   0.9785410 
Code
exp(confint.default(za_model_no_interaction))
                2.5 %     97.5 %
(Intercept) 0.6767051 18.9654450
GangYes     0.0990740  0.4593145
Age         0.8710677  1.0992745

Since the likelihood ratio test demonstrates a chi-square statistic of 0.2457 and a p value that is greater than the 5\% significance level, we reject the null hypothesis and conclude that the interaction terms is not effective in the model. Therefore, age of onset of smoking is not an effect modifier within the primary association between HIV status and gang membership. Hence, we report the crude associations.

Adjusting for age of onset of smoking, the odds of HIV are 78.7\% lower for participants who were involved in a gang compared to those who were not (95\% CI of 0.099 to 0.459).

Conduct Analysis for Brazil

Brazilian data in the file gangs_hiv_brazil.csv. Level the \text {Gang} and \text {HIV} variables such that the reference groups are listed first in the vector. Then, construct a logistic regression model with an interaction term between texttt {Gang} and \texttt {Age}.

Code
br <- import_dataset("gangs_hiv_brazil.csv")
Code
br$Gang <- factor(br$Gang, levels = c("No", "Yes"))
br$HIV <- factor(br$HIV, levels = c("Negative", "Positive"))
Code
br_lab_model <- glm(
  formula = HIV ~ Gang * Age,
  data = br,
  family = binomial(link = "logit")
)
summary(br_lab_model)

Call:
glm(formula = HIV ~ Gang * Age, family = binomial(link = "logit"), 
    data = br)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  10.1328     2.4342   4.163 3.14e-05 ***
GangYes      29.6178    17.5576   1.687   0.0916 .  
Age          -0.4285     0.1033  -4.148 3.35e-05 ***
GangYes:Age  -2.5544     1.2865  -1.986   0.0471 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 172.571  on 126  degrees of freedom
Residual deviance:  49.146  on 123  degrees of freedom
AIC: 57.146

Number of Fisher Scoring iterations: 8

The interaction term demonstrates a coefficient estimate of -2.5544 and a corresponding p value of 0.0471, which is less than the significance level of 0.05. Therefore, we can conclude that age of onset of smoking is an effect modifier in the primary association between HIV status and gang membership.

Since \texttt {Age} is a continuous variable, we use the the the quantiles to define cut-off points to report effect modification. However, in general, the cut-off points for the continuous variables should be made based on practical or clinical relevance or research question (if cut-off points were pre-defined).

Code
quantile(br$Age, probs = c(0.25, 0.5, 0.75))
25% 50% 75% 
 13  14  22 
Code
emmeans::emmip(
  br_lab_model,
  ~ Gang | Age,
  at = list(Age = c(13, 14, 22))
)

Code
model_emm_br <- emmeans::emmeans(
  br_lab_model,
  ~ Gang | Age,
  at = list(Age = c(13, 14, 22))
)

emmeans::contrast(
  model_emm_br,
  "consec",
  name = "Gang",
  infer = c(TRUE, TRUE),
  level = 0.95
)
Age = 13:
 Gang     estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Yes - No    -3.59  1.57 Inf     -6.67    -0.512  -2.286  0.0223

Age = 14:
 Gang     estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Yes - No    -6.14  1.45 Inf     -8.99    -3.295  -4.227  <.0001

Age = 22:
 Gang     estimate    SE  df asymp.LCL asymp.UCL z.ratio p.value
 Yes - No   -26.58 10.88 Inf    -47.91    -5.248  -2.442  0.0146

Results are given on the log odds ratio (not the response) scale. 
Confidence level used: 0.95 
Code
exp(c(-3.59, -6.67, -0.512))
[1] 0.027598330 0.001268399 0.599295788
Code
exp(c(-6.14, -8.99, -3.295))
[1] 0.0021549236 0.0001246501 0.0370680450
Code
exp(c(-26.58, -47.91, -5.248))
[1] 2.860571e-12 1.559378e-21 5.258024e-03

For participants who are 13 years of age (i.e., 25 \texttt {th} percentile of age), the odds of HIV positive status are 97.2 \% lower for those in a gang compared to those that are not in a gang. The 95\% CI is (0.001, $ 0.599$).

For participants who are 14 years of age (i.e., 50 \texttt {th} percentile of age), the odds of HIV positive status are 99.8 \% lower for those in a gang compared to those that are not in a gang. The 95\% CI is (0.0001, 0.037).

For participants who are 22 years of age (i.e., 75 \texttt {th} percentile of age), the odds of HIV positive status are 2.86 \times 10^{-12}times lower for those in a gang compared to those that are not in a gang. This indicates a very small difference.

7.4 Question

Your are the biostatistician for a research project investigating the development of cancer of the cervix. In one of the sub projects you investigate the association between current smoking status and the development of cervical carcinoma. You have also collected data on the number of sexual partners and consider two levels for this variable, less than 2 and 2+ sexual partners and consider whether the two-group number of sexual partners modifies the effect of the association between current smoking status and the development of cervical carcinoma or if this association is biased by the level of sexual partners.

Prepare a thorough report for your funding agency oversight committee meeting. State your research questions, hypotheses, levels of significance and confidence, all your analyses decisions, and results. Contrast table methods and a generalized linear model approach.

The data are in the file Cervical.csv and is imported below, assigned to the variable cerv.

Code
cerv <- import_dataset("Cervical.csv")

Research Questions

For this sub project, the research question aims to investigate the association between current smoking status and the development of cervical carcinoma.

Then, we consider whether the two-group number of sexual partners modifies the effect of the association between current smoking status and the development of cervical carcinoma or if this association is biased by the level of sexual partners.

Hypothesis

We hypothesize that the two-group number of sexual partners confounds the primary association between current smoking status and the development of cervical carcinoma.

Significance level is 0.05.

Generate Univariate Summary Statistics

Code
xtabs(
  ~ Smoke,
  data = cerv
)
Smoke
Non smoker     Smoker 
       386        278 
Code
xtabs(
  ~ Carcinoma,
  data = cerv
)
Carcinoma
 No Yes 
450 214 
Code
xtabs(
  ~ Partner,
  data = cerv
)
Partner
    Less Two plus 
     542      122 

Overall, there are more non-smokers than smokers and more participants without than those with cervical carcinoma. Additionally, more participants have had less than two compared to two or more sexual partners.

Assess the Association Between Smoking Status and Cervical Carcinoma Development

Code
chisq.test(cerv$Smoke, cerv$Carcinoma)

    Pearson's Chi-squared test with Yates' continuity correction

data:  cerv$Smoke and cerv$Carcinoma
X-squared = 10.124, df = 1, p-value = 0.001464

The chi-square test between smoking status and carcinoma demonstrate a test statistic of 10.124 and a p value of 0.001464, which is less than the significance level of 0.05. Therefore, we conclude that there is a statistically significant association between smoking status and carcinoma development.

Assess for Effect Modification

First, we set the levels and assign reference groups.

Code
cerv$Smoke <- factor(cerv$Smoke, levels = c("Smoker", "Non smoker"))
cerv$Carcinoma <- factor(cerv$Carcinoma, levels = c("Yes", "No"))
cerv$Partner <- factor(cerv$Partner, levels = c("Less", "Two plus"))
Code
crude_table_lab <- xtabs(
  ~ Smoke + Carcinoma,
  data = cerv
)
crude_table_lab
            Carcinoma
Smoke        Yes  No
  Smoker     109 169
  Non smoker 105 281
Code
stratified_tables_lab <- table(
  cerv$Smoke, cerv$Carcinoma, cerv$Partner
)
stratified_tables_lab
, ,  = Less

            
             Yes  No
  Smoker      97 148
  Non smoker  93 204

, ,  = Two plus

            
             Yes  No
  Smoker      12  21
  Non smoker  12  77

When stratified by the two-group number of sexual partners, there are more participants who are smokers with cervical carcinoma in the less than two compared to the more than two sexual partners group.

Next, we compare partial measures using the Woolf test.

Code
DescTools::WoolfTest(stratified_tables_lab)

    Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)

data:  stratified_tables_lab
X-squared = 3.3716, df = 1, p-value = 0.06633

The Woolf test yields a chi-square statistic of 3.3716 and a corresponding p value that is greater than the significance level. Therefore, the number of sexual partners is not an effect modifier. So, we proceed to evaluate the number of sexual partners as a confounder.

Assess for Confounding

Here, we let X represent smoking status (i.e., primary predictor), Z represent the two-group number of sexual partners (i.e., potential confounder), and Y represent the development of carcinoma (i.e., response variable).

To check one of the criteria for confounding, we assess the association between X and Z.

Code
smoke_partner_table_lab <- table(
  cerv$Smoke,
  cerv$Partner
)

smoke_partner_table_lab
            
             Less Two plus
  Smoker      245       33
  Non smoker  297       89
Code
chisq.test(smoke_partner_table_lab)

    Pearson's Chi-squared test with Yates' continuity correction

data:  smoke_partner_table_lab
X-squared = 12.749, df = 1, p-value = 0.0003562

We reject the null hypothesis. There is an association between the explanatory variable, X, and the potential confounder, Z.

Then, using the data set containing only the unexposed, we assess for the second criteria for confounding- causal association between Z and Y.

Code
unexposed_lab <- cerv %>% 
  filter(Smoke == "Non smoker") %>% tibble()

Then, we generate a table of Z and Y in the unexposed group.

Code
cancer_partners_unexposed_lab <- table(
  unexposed_lab$Carcinoma,
  unexposed_lab$Partner
)
cancer_partners_unexposed_lab
     
      Less Two plus
  Yes   93       12
  No   204       77

Within the non-smoking group, most participants have less than two sexual partners and no cervical carcinoma.

Next, we test for independence.

Code
chisq.test(cancer_partners_unexposed_lab)

    Pearson's Chi-squared test with Yates' continuity correction

data:  cancer_partners_unexposed_lab
X-squared = 10.112, df = 1, p-value = 0.001473

We reject the null hypothesis. There is a causal relationship between the potential confounder and the explanatory variable in the unexposed. This means that there is a causal relationship between the number of sexual partners and the development of cervical carcinoma.

To satisfy the third criteria for confounding, we note that there is no causal relationship between smoking and the number of sexual partners.

By satisfaction of the three criteria of confounding, the two-group number of sexual partners seems to bias the association between smoking status and development of cervical carcinoma.

But, we must determine if the difference between the crude and the MH common measure of association demonstrate at least a 10\% difference to conclude the two-group number of sexual partners variable as a confounder.

\lvert \frac{\text{crude measure} - \text{adjusted measure}}{\text{crude measure}} \rvert\tag{}

Code
stratified_tables_lab
, ,  = Less

            
             Yes  No
  Smoker      97 148
  Non smoker  93 204

, ,  = Two plus

            
             Yes  No
  Smoker      12  21
  Non smoker  12  77
Code
epiR::epi.2by2(stratified_tables_lab)
             Outcome +    Outcome -      Total                 Inc risk *
Exposed +          109          169        278     39.21 (33.43 to 45.22)
Exposed -          105          281        386     27.20 (22.82 to 31.93)
Total              214          450        664     32.23 (28.68 to 35.93)


Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio (crude)                         1.44 (1.16, 1.79)
Inc risk ratio (M-H)                           1.37 (1.10, 1.70)
Inc risk ratio (crude:M-H)                     1.05
Inc odds ratio (crude)                         1.73 (1.24, 2.40)
Inc odds ratio (M-H)                           1.61 (1.15, 2.23)
Inc odds ratio (crude:M-H)                     1.08
Attrib risk in the exposed (crude) *           12.01 (4.75, 19.26)
Attrib risk in the exposed (M-H) *             10.50 (1.80, 19.20)
Attrib risk (crude:M-H)                        1.14
-------------------------------------------------------------------
 M-H test of homogeneity of IRRs: chi2(1) = 4.136 Pr>chi2 = 0.042
 M-H test of homogeneity of ORs: chi2(1) = 3.361 Pr>chi2 = 0.067
 Test that M-H adjusted OR = 1:  chi2(1) = 8.019 Pr>chi2 = 0.002
 Wald confidence limits
 M-H: Mantel-Haenszel; CI: confidence interval
 * Outcomes per 100 population units 

Risk ratio

Code
(1.44 - 1.37) / 1.44
[1] 0.04861111

Odds ratio

Code
(1.73 - 1.61) / 1.73
[1] 0.06936416

Whether we calculate the difference using the risk or odds ratio, there is not more than 10% difference. Therefore, we conclude that the two-group sexual partners variable is not a confounder and report crude measures.

As such, the odds of cervical carcinoma are 73\% greater for participants who smoke compared to those who do not (95\% CI of 1.24 to 2.40).