Chapter 5

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In the previous notebooks, we considered mostly single predictor variables. First, we considered table methods of associations, and then we explored generalized linear models with single binary, multilevel, or continuous numerical explanatory variables.

In this notebook, we repeat our steps, starting with tables of observed data, and the measures of association that we know so well by now. These are risk difference, risk ratio, and odds ratio. This time, though, we consider another variable in reference to the explanatory variable, X, and the response variable, Y. This variable, Z, will be viewed as either an effect modifier or as a confounder. We want to interrogate the way that Z affects the association between X and Y.

As an example, consider a trial of a new antibiotic, E, to prevent surgical site sepsis following surgery to the colon (large intestine). These are procedures prone to infection due to the high load of bacteria in the colon. The control might be a standard antibiotic, \bar{E}. The two groups of prophylactic (preventative) antibiotics serve as the levels of the explanatory variable, X. The development, D, and absence of the development of postoperative infection, \bar{D}, are the two levels of the response variable Y. We might consider that smoking, Z, might influence this association and we would consider three levels of Z such as never smoked, a prior smoking history, or current smoking.

While we will concentrate on a single variable, Z, the concepts that we will explore, termed the principles of stratified analysis, extend to more than one variable.

To understand the principles of stratified analysis, we start to explore the two concepts of effect modification and confounding.

Key Words

Section 3. Defining Effect Modification and Confounding	effect modifier, confounder
Section 4. Stratified Analysis	stratified analysis, pooled table or marginal table, partial table or stratum specific tables or conditional tables, crude measures of association, conditional or partial or stratum specific measure of association

2 Libraries

Load the required libraries for Chapter 5. If these packages are not already installed, please do so before loading the libraries.

Code

library(rio)
library(tibble)
library(epiR)
library(epitools)
library(dplyr)

3 Definitions for Effect Modification and Confounders

3.1 Effect Modification

Effect modification is the simpler of the two concepts. A variable, Z, is an effect modifier of the association between the (primary) explanatory variable, X, and the response variable, Y, if the association between X and Y depends on the levels of Z. This means that the measures of association between X and Y are different at each level of Z.

3.2 Confounding

A variable, Z, is a confounder of the association between the (primary) explanatory variable, X, and the response variable, Y, if the association between X and Y is biased (distorted) because Z is associated with X and causally associated with Y. Note that for Z to be a confounder, it cannot be in the causal pathway between X and Y, though.

An example of a variable, Z, that is in the causal pathway might be the presence or absence of a high cholesterol level (hypercholesterolemia). If X is age group (say those younger and older than 65 years of age) and Z is the presence or absence of increased cholesterol levels, and Y is the presence or absence of ischemic heart disease, we could consider Z in the causal pathway between age group and a heart disease. As subjects get older, we see a higher risk of hypercholesterolemia, which can cause ischemic heart disease.

To be a confounder, therefore, Z must fulfill the following criteria.

Be associated with X
Be causally associated with Y
Not be in the causal pathway from X to Y

Now that we have an idea of effect modifiers and confounders, we can start to explore stratified analysis.

4 Stratified Analysis

Effect modification and confounding are examined using stratified analysis. The association between X and Y is determined at each level of Z.

Stratified analysis requires the creation of separate tables of observed data between X and Y at each level of Z.

The table of observed data for X and Y is referred to as the pooled table or the marginal table. In the pooled table Table 1, the levels of the explanatory variable, X, are exposure, E, and non-exposure \bar{E}. The levels of the response variable, Y, are disease, D, and no disease, \bar{D}.

Table 1: Pooled Table
POOLED TABLE	D	\bar{D}	Total
E	n_{11}	n_{12}	n_{1+}
\bar{E}	n_{21}	n_{22}	n_{2+}
Total	n_{+1}	n_{+2}	n

Separate tables at each level of Z are termed partial tables or stratum specific tables or conditional tables. In Table 2 and Table 3, we create two stratified tables for two levels of Z.

For level 1 of Z, we have the stratified table in Table 2. We see the extra subscript _{1} to denote that the frequencies are only for those observations in level 1 of Z.

Table 2: Stratified Table for First Level of Potential Effect Modifier
Stratified table 1	D	\bar{D}	Total
E	n_{111}	n_{121}	n_{1+1}
\bar{E}	n_{211}	n_{221}	n_{2+1}
Total	n_{+11}	n_{+21}	n_{1}

For level 2 of Z, we would have the stratified table in Table 3, and so on, if there are more levels in the sample space of Z. Note the use of the subscript _{2}.

Table 3: Stratified Table for Second Level of Potential Effect Modifier
Stratified table 2	D	\bar{D}	Total
E	n_{112}	n_{122}	n_{1+2}
\bar{E}	n_{212}	n_{222}	n_{2+2}
Total	n_{+12}	n_{+22}	n_{2}

We can now consider measures of association between X and Y, with and without Z.

If we ignore Z (using the pooled table), the measures of association, risk difference (RD), risk ratio (RR), and odds ratio (OR), are now referred to as crude measures of association, and for the estimates of these, we use the notation \widehat{\text{RD}}_{\text{crude}}, \widehat{\text{RR}}_{\text{crude}}, and \widehat{\text{OR}}_{\text{crude}}, respectively.

If we account for the k = \left\{ 1, 2, 3, 4, \ldots , K \right\} levels of Z, we will use the notation \widehat{\text{RD}}_{k}, \widehat{RR}_{k}, and \widehat{\text{OR}}_{k}, respectively. These are termed the conditional or partial or stratum specific measures of association.

5 Stratified Analysis for Effect Modification

We always start by evaluating whether Z is an effect modifier.

If all the conditional measures of association between X and Y are equal, such as \widehat{\text{OR}}_{1} = \widehat{\text{OR}}_{2} = \ldots = \widehat{\text{OR}}_{K}, for the k levels of Z, then Z is not and effect modifier of the association between X and Y.

If it is, we describe the individual measures of association for each level of Z, together with the results of the analysis (separate measures of association and separate Pearson’s \chi^{2} tests).

Only if Z is not an effect modifier, do we start to consider it as a confounder and investigate if it fulfills the criteria of a confounder, outlined before. For this we will use the Mantel-Haenszel test.

To determine if Z is an effect modifier (if the measures of association are different from each other at each level of Z), we can use various tests of significance. Two such tests are Woolf’s test and Breslow-Day test.

Woolf’s test considers odds ratios. For the k = \left\{ 1, 2, \ldots , k \right\} levels of Z, we state the null hypothesis, H_{0} (homogeneous association across all levels or strata of Z), and the alternative hypothesis, H_{1} as in Equation 1.

\begin{align} &H_{0}: \; \text{OR}_{1} = \text{OR}_{2} = \ldots = \text{OR}_{k} \\ &H_{1}: \; \text{not all odds ratios are equal} \end{align} \tag{1}

The test statistic for Woolf’s test is X_{W}^{2} and is shown in Equation 2, with the expansion of its components.

\begin{align} &X_{W}^{2} = \sum_{k=1}^{K} {\left( \frac{\log{(\widehat{\text{OR}}_{k}) - \bar{\log{\text{OR}}}}}{\text{SE} (\log{\widehat{\text{OR}}_{k}})} \right)}^{2} \\ &\text{SE}(\log{\widehat{\text{OR}}_{k}}) = \sqrt{\frac{1}{n_{11k}} + \frac{1}{n_{12k}} + \frac{1}{n_{21k}} + \frac{1}{n_{22k}}} \\ &\bar{\log{\text{OR}}} = \frac{\sum_{k=1}^{K} w_{k} \log{\widehat{\text{OR}}_{k}}}{\sum_{k=1}^{K} w_{k}} \\ &w_{k} = \frac{1}{{\left[ \text{SE}(\log{\widehat{\text{OR}}_{k}}) \right]}^{2}} \end{align} \tag{2}

The denominator of X_{W}^{2} is a measure of the variability of the estimate that we are measuring. The numerator is a measure of the difference between the data that we observe, \log(\widehat{\text{OR}}_{k}), and the weighted average of the stratum specific log odds ratios, \bar{\log{\text{OR}}}.

Small weights, w_{k}, result from large variances in the estimated log odds ratios and larger weights result from smaller variances. Woolf’s test statistic, X_{W}^{2}, therefor, puts more weight on partial tables with less variance.

We have that \widehat{\text{OR}}_{k} is the estimated OR for the k^{\text{th}} partial table.

Under the null hypothesis, we find that X_{W}^{2} approximately follows a \chi^{2} distribution with k-1 degrees of freedom.

As illustrative example, we consider data from the Multi-Women Cohort Study conducted in the United Kingdom between 1996 and 2001. The study recruited women between the ages of 50 and 56 years of age.

For our purposes, we will consider the use of combined hormone replacement therapy (HRT) as explanatory variable, X. There are two levels. We have, E (exposure), as the continuous use of combined HRT, and \bar{E} (no exposure), as the absence of use of combined HRT.

The development of endometrial carcinoma (cancer of the lining of the womb) is the response variable, Y, with the two levels, D (disease), the development of endometrial carcinoma, and \bar{D} (no disease), the absence of endometrial carcinoma.

We also consider the potential effect modifier, Z, the class of body mass index (BMI), with the three levels, Z_{1}, a BMI less than 25, Z_{2}, a BMI on the interval 25 to 29, and Z_{3}, a BMI of 30 or more.

The data for this illustrative example is in a comma-separated file, which we import below. (Note that the data file is in the same folder as this notebook. Since we set the working directory to be the current directory at the start of the file, we can simply import the file without specifying the directory structure.)

Code

dfr <- import_dataset("endo_data.csv")

We view a table summary for each of the three variables in the data set using the xtabs function. Note that we make use of a formula as first argument. We only list the \texttt{HRT} variable and the function will return the unique elements in the set, together with it frequency.

Code

xtabs(
  ~ HRT,
  data = dfr
)

HRT
cont_comb     never 
     3769      2607

We do the same for the \texttt{BMI} and the \texttt{cancer} variables.

Code

xtabs(
  ~ BMI,
  data = dfr
)

BMI
 BMI < 25 BMI >= 30 BMI 25-29 
     2319      2184      1873

Code

xtabs(
  ~ cancer,
  data = dfr
)

cancer
   0    1 
5971  405

For consistency and for the use of the appropriate R functions, we need the order of the variables to conform with the tables above. The order of the levels are E then \bar{E} for X, D and then \bar{D} for Y. For Z we use the order \text{BMI} < 25, \text{BMI 25-29}, and \text{BMI} >= 30. We use the levels keyword argument to set the order of the levels.

Code

dfr$HRT <- factor(
  dfr$HRT,
  levels = c("cont_comb", "never")
)

dfr$cancer <- factor(
  dfr$cancer,
  levels = c(1, 0)
)

dfr$BMI <- factor(
  dfr$BMI,
  levels = c(
    "BMI < 25",
    "BMI 25-29",
    "BMI >= 30"
  )
)

We now use the table function to create the pooled table. We use attribute, \$, notation to include the two variables (column in the data file).

Code

pooled_table <- table(dfr$HRT, dfr$cancer)
pooled_table

           
               1    0
  cont_comb  205 3564
  never      200 2407

We note from the pooled table that \widehat{\text{OR}}_{\text{crude}} = (205 \times 2407) \div (3564 \times 200) = 0.69.

We can also create all the stratified tables using the table function. Note that the (potential) effect modifier variable, Z, is placed after the explanatory (row) variable X, and the response (column) variable, Y. We assign the the stratified table to the variable stratified_tables.

Code

stratified_tables <- table(dfr$HRT, dfr$cancer, dfr$BMI)
stratified_tables

, ,  = BMI < 25

           
               1    0
  cont_comb  115 1205
  never       70  929

, ,  = BMI 25-29

           
               1    0
  cont_comb   61  876
  never       77  859

, ,  = BMI >= 30

           
               1    0
  cont_comb   29 1483
  never       53  619

We now have three (one for each level of Z) contingency tables. From the stratified tables, we have the following stratified odds ratios.

\widehat{\text{OR}}_{<25} = (115 \times 929) \div (1205 \times 70) = 1.27
\widehat{\text{OR}}_{25-29} = (61 \times 859) \div (876 \times 77) = 0.78
\widehat{\text{OR}}_{\ge 30} = (29 \times 619) \div (1483 \times 53) = 0.23

It seems that \widehat{\text{OR}}_{<25} \ne \widehat{\text{OR}}_{25-29} \ne \widehat{\text{OR}}_{\ge 30}, but we need to confirm this, using Woolf’s test.

We let \text{OR}_k be the odds ratio for endometrial carcinoma comparing users of combined HRT in the k^{\text{th}} BMI stratum for k = \left\{ <25 , 25- 29 , \ge 30 \right\}.

We state the null hypothesis and the alternative hypothesis in Equation 3.

\begin{align} &H_{0} : \; \text{OR}_{<25} = \text{OR}_{25-29} = \text{OR}_{\ge 30} \text{ (homogeneous association)} \\ &H_{1} : \; \text{not all conditional odds ratios are the same} \end{align} \tag{3}

Below, we use the WoolfTest function from the DescTools library and pass the variable that contains the stratified tables.

Code

DescTools::WoolfTest(stratified_tables)


    Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)

data:  stratified_tables
X-squared = 36.474, df = 2, p-value = 1.202e-08

The result prints the test statistic value, 36.474, the degrees of freedom, 3-1=2, and the p value, <0.01. We reject the null hypothesis of homogeneous association at the 5\% level of significance. The data provided sufficient evidence that the association between the use of combined HRT and endometrial carcinoma differs by BMI class.

The epi2by2 function from the epiR library can be used for Woolf’s test as well.

Code

epiR::epi.2by2(stratified_tables)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +          205         3564       3769        5.44 (4.74 to 6.21)
Exposed -          200         2407       2607        7.67 (6.68 to 8.76)
Total              405         5971       6376        6.35 (5.77 to 6.98)


Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio (crude)                         0.71 (0.59, 0.86)
Inc risk ratio (M-H)                           0.77 (0.64, 0.93)
Inc risk ratio (crude:M-H)                     0.92
Inc odds ratio (crude)                         0.69 (0.57, 0.85)
Inc odds ratio (M-H)                           0.76 (0.62, 0.93)
Inc odds ratio (crude:M-H)                     0.91
Attrib risk in the exposed (crude) *           -2.23 (-3.48, -0.98)
Attrib risk in the exposed (M-H) *             -1.74 (-3.24, -0.24)
Attrib risk (crude:M-H)                        1.28
-------------------------------------------------------------------
 M-H test of homogeneity of IRRs: chi2(2) = 36.780 Pr>chi2 = <0.001
 M-H test of homogeneity of ORs: chi2(2) = 36.429 Pr>chi2 = <0.001
 Test that M-H adjusted OR = 1:  chi2(1) = 7.528 Pr>chi2 = 0.003
 Wald confidence limits
 M-H: Mantel-Haenszel; CI: confidence interval
 * Outcomes per 100 population units

The line Odds ratio crude shows the \widehat{\text{OR}}_{\text{crude}} to be 0.69, with 95% confidence intervals 0.62 to 0.93.

The line M-H test of homogeneity of ORs: chi2(2) = 36.429 Pr>chi2 = <0.001 shows the test statistic and the p value. It is slightly different from the values that we saw before, since the epi2by2 functions uses a small continuity correction.

Since, we have rejected the null hypothesis, we report the stratum specific measures of association, having confirmed that the BMI class, Z, is an effect modifier.

The oddsratio.wald test from the epitools library can be used to calculate the \widehat{\text{OR}} and confidence intervals for each of the levels of Z. We use indexing to select the three separate stratified tables. Since the epitools package requires a reverse order of the levels of both the explanatory and the response variable, we use the rev argument and set the value to "both". \bar{E} will come before E, and \bar{D} will come before D.

Code

# All rows, all columns, subtable 1
epitools::oddsratio.wald(stratified_tables[, , 1], rev = "both")

$data
           
               0   1 Total
  never      929  70   999
  cont_comb 1205 115  1320
  Total     2134 185  2319

$measure
           odds ratio with 95% C.I.
            estimate     lower   upper
  never     1.000000        NA      NA
  cont_comb 1.266568 0.9297033 1.72549

$p.value
           two-sided
            midp.exact fisher.exact chi.square
  never             NA           NA         NA
  cont_comb   0.133436     0.141764  0.1334362

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

We see \widehat{\text{OR}}_{<25} = 1.27 (95% confidence interval 0.93 to 1.72). While there is 1.27 - 1.0 = 0.27 = 27% increase in the odds of endometrial carcinoma comparing combined HRT to no HRT in this BMI group. The confidence interval includes 1.0 so describe both an greater and a lower odds of endometrial cancer for this BMI group.

A \chi^{2} test for independence shows that the association for the low BMI group in not significant.

Code

chisq.test(stratified_tables[, , 1])


    Pearson's Chi-squared test with Yates' continuity correction

data:  stratified_tables[, , 1]
X-squared = 2.0258, df = 1, p-value = 0.1546

We look now at the results from the \text{BMI} = 25-29 level of Z.

Code

# All rows, all columns, subtable 2
epitools::oddsratio.wald(stratified_tables[, , 2], rev = "both")

$data
           
               0   1 Total
  never      859  77   936
  cont_comb  876  61   937
  Total     1735 138  1873

$measure
           odds ratio with 95% C.I.
             estimate     lower    upper
  never     1.0000000        NA       NA
  cont_comb 0.7768339 0.5480361 1.101152

$p.value
           two-sided
            midp.exact fisher.exact chi.square
  never             NA           NA         NA
  cont_comb  0.1565288     0.158206  0.1551257

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

We see \widehat{\text{OR}}_{25-29} = 0.78 (95% confidence interval 0.55 to 1.10). While there is 1.0 - 0.78 = 0.22 = 22% decrease in the odds of endometrial carcinoma comparing combined HRT to no HRT in this BMI group. The confidence interval includes 1.0 so describe both a greater and a lesser odds of endometrial cancer for this BMNI group.

A \chi^{2} test for independence shows that the association for the medium BMI group in not significant.

Code

chisq.test(stratified_tables[, , 2])


    Pearson's Chi-squared test with Yates' continuity correction

data:  stratified_tables[, , 2]
X-squared = 1.7775, df = 1, p-value = 0.1825

Lastly, we consider the \text{BMI} \ge 30 level for Z.

Code

# All rows, all columns, subtable 3
epitools::oddsratio.wald(stratified_tables[, , 3])

$data
           
             1    0 Total
  cont_comb 29 1483  1512
  never     53  619   672
  Total     82 2102  2184

$measure
           odds ratio with 95% C.I.
             estimate     lower     upper
  cont_comb 1.0000000        NA        NA
  never     0.2283871 0.1438428 0.3626228

$p.value
           two-sided
              midp.exact fisher.exact   chi.square
  cont_comb           NA           NA           NA
  never     1.479089e-10 1.519617e-10 1.264398e-11

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

We see \widehat{\text{OR}}_{\ge 30} = 0.23 (95% confidence interval 0.14 to 0.36). There is 1.0 - 0.23 = 0.76 = 76% lesser odds of endometrial carcinoma comparing combined HRT to no HRT in this BMI group. The confidence interval does not include 1.0 and the reduction in the odds is significant as shown by the \chi^{2} test.

Code

chisq.test(stratified_tables[, , 3])


    Pearson's Chi-squared test with Yates' continuity correction

data:  stratified_tables[, , 3]
X-squared = 44.232, df = 1, p-value = 2.917e-11

Since we have identified Z to be an effect modifier by using the stratified tables and Woolf’s test, we do not continue to examine Z as a confounder.

6 Confounders

In the introduction, we defined a variable, Z, to be a confounder of the association between a variable X and a variable Y if the association between X and Y is biased or distorted because Z is associated with X and causally associated with Y, but is not in the causal pathway from X to Y.

If Z is indeed a confounder, it follows that the crude associations, risk difference (RD), risk ratio (RR), and odds ratio (OR), are biased estimates of the association between between X and Y.

Consider the following two contrived examples for two levels of an explanatory variable X (exposed, E, and not exposed, \bar{E}), and the two levels of a response variable, Y, (diseased or success D, and not diseased or failure, \bar{D}.

We see the pooled (marginal) table in Table 4 and then two stratified tables, Table 5 and Table 6, for two levels of Z, together with the estimates of association. (The values in the stratified tables sum to the values in the pooled table.)

Table 4: Pooled Table
Pooled table	D	\bar{D}
E	103	147
\bar{E}	103	147

\widehat{\text{RD}}_{\text{crude}} = 0, \; \widehat{\text{RR}}_{\text{crude}} = 1, \; \widehat{\text{OR}}_{\text{crude}} = 1

Table 5: First Stratified Table
Z level 1	D	\bar{D}
E	55	135
\bar{E}	7	153

\widehat{\text{RD}}_{1} = 0.17, \; \widehat{\text{RR}}_{1} = 2.48, \; \widehat{\text{OR}}_{1} = 3.08

Table 6: Second Stratified Table
Z level 2	D	\bar{D}
E	48	12
\bar{E}	96	94

\widehat{\text{RD}}_{2} = 0.29, \; \widehat{\text{RR}}_{2} = 1.58, \; \widehat{\text{OR}}_{2} = 3.92

With the results from the pooled table, it seems as if there is no association. Given the levels of Z, though, there is an association.

To the contrary, in Table 7 Table 8 and Table 9 below, there seems to be an association, but when looking at the stratified tables, there is none.

Table 7: Pooled Table
Pooled table	D	\bar]{D}
E	20	20
\bar{E}	20	40

\widehat{\text{RD}}_{\text{crude}} = 0.17, \; \widehat{\text{RR}}_{\text{crude}} = 1.50, \; \widehat{\text{OR}}_{\text{crude}} = 2

Table 8: First Stratified Table
Z level 1	D	\bar{D}
E	18	12
\bar{E}	12	8

\widehat{\text{RD}}_{1} = 0, \; \widehat{\text{RR}}_{1} = 1, \; \widehat{\text{OR}}_{1} = 1

Table 9: Second Stratified Table
Z level 2	D	\bar{D}
E	2	8
\bar{E}	8	32

\widehat{\text{RD}}_{2} = 0, \; \widehat{\text{RR}}_{2} = 1, \; \widehat{\text{OR}}_{2} = 1

As illustrative example, we will use the data from the first three tables above. We have to consider the three criteria for a confounder (after having ruled out that it is not an effect modifier).

We start by considering if the two-level variable Z is associated with the explanatory variable, X, then if Z is causally related to Y, and finally we consider if Z is in the causal pathway from X (the explanatory variable) to Y (the response variable).

6.1 Association with the Explanatory Variable

We form a table of observed data where Z becomes the response variable, shown in Table 10.

Table 10: The Levels of the Confounder Become the Response
	Z = \text{level } 1	Z = \text{level } 2
E	55+135=190	48+12=60
\bar{E}	7+53=60	96+94=190

We can use Pearson’s \chi^{2} test to determine if the association is significant. We save the data as a matrix object.

Code

tbl_XZ <- matrix(
  c(190, 60, 60, 190),
  nrow = 2
)

tbl_XZ

     [,1] [,2]
[1,]  190   60
[2,]   60  190

Below, we use the chisq.test function to determine significance of association between X and Z.

Code

chisq.test(
  tbl_XZ
)


    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl_XZ
X-squared = 133.13, df = 1, p-value < 2.2e-16

We note a large \chi^{2} statistic with a p value less than 0.01. We reject the null hypothesis that there is no association between X and Z. The potential confounder Z is associated with the explanatory variable, X.

6.2 Causal Association with the Response Variable

To determine a causal relationship between Z and Y, we consider the unexposed only, and create Table 11 below.

Table 11: Contingency Table between Y and Z (only for the unexposed)
	Z = \text{level } 1	Z = \text{level } 2
D	7	96
\bar{D}	53	94

The Pearson’s \chi^{2} test shows a significant association between Z and Y in the unexposed.

Code

tbl_YZ <- matrix(
  c(7, 53, 96, 94),
  nrow = 2
)

chisq.test(tbl_YZ)


    Pearson's Chi-squared test with Yates' continuity correction

data:  tbl_YZ
X-squared = 26.843, df = 1, p-value = 2.207e-07

6.3 Z is Not in the Causal Pathway Between X and Y

We determine this requirement mostly through domain knowledge using logical arguments that experts agree upon. We can also draw on knowledge from prior research.

Now that we have determined that all three criteria are met (assuming that we have excluded Z as an effect modifier), we can measure the association between X and Y accounting (adjusting) for Z. This is done by computing adjusted measures of association.

Once computed, we compare the adjusted measures and the crude measures. If they are very different from each other (see rule of thumb later), we declare Z to be a confounder of the association between X and Y and we report the adjusted measures of association, together with their confidence intervals. We also perform hypothesis testing for the (adjusted) association between X and Y using the Mantel-Haenszel test.

If the adjusted and crude measures are similar, then Z is not a confounder and we simply report the crude measures of association, together with confidence intervals and the results of Pearson’s \chi^{2} test.

One set of adjusted measures of association is the Mantel-Haenszel (MH) common risk difference, risk ratio, and odds ratio.

6.4 Mantel-Haenszel Adjusted Common Measures of Association

Given Table 12 below, which serves as a generic table of observed data for each of the k= \left\{1, 2, 3, \ldots , K \right\} levels of Z, we can calculate the MH common estimated measures of association given in Equation 4.

Table 12: First Level of Potential Confounder
Z = \text{level }1	D	\bar{D}	Total
E	n_{11k}	n_{12k}	n_{1+k}
\bar{E}	n_{21k}	n_{22k}	n_{2+k}
Total	n_{+1k}	n_{+2k}	n_{k}

\begin{align} &\widehat{\text{OR}}_{\text{MH}} = \frac{\sum_{k=1}^{K} w_{k} \widehat{\text{OR}}_{k}}{\sum_{k=1}^{K} w_{k}} , \; w_{k} = \frac{n_{12k} n_{21k}}{n_{k}} \\ \\ &\widehat{\text{RR}}_{\text{MH}} = \frac{\sum_{k=1}^{K} u_{k} \widehat{\text{RR}}_{k}}{\sum_{k=1}^{K} u_{k}} , \; u_{k} = \frac{n_{21k} (n_{11k} + n_{12k})}{n_{k}} \\ \\ &\widehat{\text{RR}}_{\text{MH}} = \frac{\sum_{k=1}^{K} w_{k} \widehat{\text{RD}}_{k}}{\sum_{k=1}^{K} v_{k}} , \; v_{k} = \frac{(n_{11k} + n_{12k})(n_{21k} + n_{22k})}{n_{k}} \end{align} \tag{4}

The values w_{k}, u_{k}, and v_{k} are used to generate weighted averages of the partial measures of association.

Do remember that we only compute these common estimates if the corresponding partial measures are the same, that is, if the RD, RR, and OR values are similar across the stratified tables (which means that Z is not an effect modifier). We use Woolf’s test for this.

If the partial measures are similar (Z is not an effect modifier), and the adjusted and crude measures are very different, then Z is a confounder and we report the adjusted measures, with confidence intervals, together with conducting the Mantel-Haenszel test.

Very different is hard to define. There are several rules of thumb. We will consider a difference of 0.1 or more, using Equation 5 as very different.

\lvert \frac{\text{crude measure} - \text{adjusted measure}}{\text{crude measure}} \rvert \tag{5}

As illustrative, hypothetical example, researchers want to determine if there is an association between baldness and ischemic heart disease. That is to say, the association between the explanatory variable, X, with levels being bald and not being bald, and the response variable, Y, with levels ischemic heart disease and no ischemic heart disease. As possible confounder, they consider the age group of participants, Z, with levels young and old.

The results are as shown in the pooled (marginal) table Table 13 and stratified tables Table 14 and Table 15 below.

Table 13: Pooled Table
Pooled table	IHD	No IHD	Total
Bald	79	745	824
Not bald	286	8290	8576
Total	365	9035	9400

Table 14: Stratified Table for Younger Participants
Stratified by Young	IHD	No IHD	Total
Bald	55	51	106
Not bald	5	5	10
Total	60	56	116

Table 15: Stratified Table for Older Participants
Stratified by Old	IHD	No IHD	Total
Bald	24	694	718
Not bald	281	8285	8566
Total	305	8979	9284

Instead of importing data for this simulated research project, we will construct the tables by hand. We start with the pooled table.

Code

pooled_table <- matrix(
  c(79, 286, 745, 8290),
  nrow = 2,
  dimnames = list(
    c("bald", "not bald"),
    c("IHD", "no IHD")
  )
)

pooled_table

         IHD no IHD
bald      79    745
not bald 286   8290

The next table contains the data for X and Z.

Code

age_bald_table <- matrix(
  c(106, 10, 718, 8566),
  nrow = 2,
  dimnames = list(
    c("old", "young"),
    c("bald", "not bald")
  )
)

age_bald_table

      bald not bald
old    106      718
young   10     8566

Next, is a table of Y and Z in the non-exposed (not bald).

Code

age_ihd_non_bald_table <- matrix(
  c(5, 281, 5, 8285),
  nrow = 2,
  dimnames = list(
    c("old", "young"),
    c("IHD", "no IHD")
  )
)

age_ihd_non_bald_table

      IHD no IHD
old     5      5
young 281   8285

The next two tables are the partial tables for Z with k=1 being the older group and k=2 being the younger group.

Code

# k = 1
old <- matrix(
  c(55, 5, 51, 5),
    nrow = 2,
    dimnames = list(
      c("bald", "not bald"),
      c("IHD", "no IHD")
  )
)

# k = 2
young <- matrix(
  c(24, 281, 694, 8285),
    nrow = 2,
    dimnames = list(
      c("bald", "not bald"),
      c("IHD", "no IHD")
  )
)

We can now start our analysis by considering the requirements for Z to be a confounder.

We must test whether there is an association between Z and X. We use the epi.2by2 function for this.

Code

epiR::epi.2by2(age_bald_table)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +          106          718        824     12.86 (10.65 to 15.34)
Exposed -           10         8566       8576        0.12 (0.06 to 0.21)
Total              116         9284       9400        1.23 (1.02 to 1.48)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 110.32 (57.92, 210.15)
Inc odds ratio                                 126.46 (65.83, 242.93)
Attrib risk in the exposed *                   12.75 (10.46, 15.03)
Attrib fraction in the exposed (%)            99.09 (98.27, 99.52)
Attrib risk in the population *                1.12 (0.88, 1.35)
Attrib fraction in the population (%)         90.55 (82.92, 94.77)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 1002.294 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

By both the confidence intervals and the small p value, we see that there is an association between baldness, X, and age group, Z.

Next, we must consider if Z is causally related to Y.

The causal association between Z and Y uses a table of observed data, where we only include the non-exposed, in this case, those without baldness. We use the epi.2by2 function again.

Code

epiR::epi.2by2(age_ihd_non_bald_table)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +            5            5         10     50.00 (18.71 to 81.29)
Exposed -          281         8285       8566        3.28 (2.91 to 3.68)
Total              286         8290       8576        3.33 (2.96 to 3.74)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 15.24 (8.11, 28.63)
Inc odds ratio                                 29.48 (8.49, 102.42)
Attrib risk in the exposed *                   46.72 (15.73, 77.71)
Attrib fraction in the exposed (%)            93.44 (87.68, 96.51)
Attrib risk in the population *                0.05 (-0.48, 0.59)
Attrib fraction in the population (%)         1.63 (0.15, 3.09)
-------------------------------------------------------------------
Yates corrected chi2 test that OR = 1: chi2(1) = 53.914 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

By the confidence intervals and the small p values, we conclude that there is a causal association between age group, Z, and baldness, Y.

Perhaps we can all agree that baldness does not cause the age group. If Z is in the causal pathway from X to Z, we consider it to be a mediator.

We are finally ready to examine age group, Z, as a confounder. We start by considering the partial measures of association, to rule Z out as an effect modifier. We must ensure that the measures of association are similar across the levels of Z. Below, we use the epi.2by2 function for each level.

Code

epiR::epi.2by2(
  old
)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           55           51        106     51.89 (41.97 to 61.70)
Exposed -            5            5         10     50.00 (18.71 to 81.29)
Total               60           56        116     51.72 (42.26 to 61.10)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 1.04 (0.54, 1.98)
Inc odds ratio                                 1.08 (0.29, 3.94)
Attrib risk in the exposed *                   1.89 (-30.53, 34.30)
Attrib fraction in the exposed (%)            3.64 (-83.91, 49.51)
Attrib risk in the population *                1.72 (-30.57, 34.02)
Attrib fraction in the population (%)         3.33 (-74.82, 46.55)
-------------------------------------------------------------------
Yates corrected chi2 test that OR = 1: chi2(1) = 0.000 Pr>chi2 = 1.000
Fisher exact test that OR = 1: Pr>chi2 = 1.000
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

Code

epiR::epi.2by2(
  young
)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           24          694        718        3.34 (2.15 to 4.93)
Exposed -          281         8285       8566        3.28 (2.91 to 3.68)
Total              305         8979       9284        3.29 (2.93 to 3.67)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 1.02 (0.68, 1.54)
Inc odds ratio                                 1.02 (0.67, 1.56)
Attrib risk in the exposed *                   0.06 (-1.31, 1.43)
Attrib fraction in the exposed (%)            1.86 (-47.85, 34.86)
Attrib risk in the population *                0.00 (-0.52, 0.53)
Attrib fraction in the population (%)         0.15 (-3.13, 3.31)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 0.008 Pr>chi2 = 0.928
Fisher exact test that OR = 1: Pr>chi2 = 0.913
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

The results are summarized in Table 16. We see that the measures are very similar indeed.

Table 16: Summary Table of Partial Measures of Association
Group	\widehat{\text{RR}}	\widehat{\text{OR}}
Old	1.04 (0.54, 1.98)	1.08 (0.29, 3.94)
Young	1.02 (0.68, 1.54)	1.02 (0.67, 1.56)

The measures of association at the levels of Z are very similar. We could use Woolf’s test in this case to confirm the similarity.

We can now continue to determine the difference between the crude and the MH common measures of association. We create the required data table below.

The tables are printed below.

Code

bald_ihd_age_tab

, , age_var = old

          ihd_var
bald_var    IHD no IHD
  bald       55     51
  not bald    5      5

, , age_var = young

          ihd_var
bald_var    IHD no IHD
  bald       24    694
  not bald  281   8285

We pass the tables to the epi.2by2 function.

Code

epiR::epi.2by2(bald_ihd_age_tab)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           79          745        824       9.59 (7.66 to 11.81)
Exposed -          286         8290       8576        3.33 (2.96 to 3.74)
Total              365         9035       9400        3.88 (3.50 to 4.29)


Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio (crude)                         2.87 (2.26, 3.65)
Inc risk ratio (M-H)                           1.02 (0.72, 1.46)
Inc risk ratio (crude:M-H)                     2.81
Inc odds ratio (crude)                         3.07 (2.37, 3.99)
Inc odds ratio (M-H)                           1.03 (0.69, 1.53)
Inc odds ratio (crude:M-H)                     3.00
Attrib risk in the exposed (crude) *           6.25 (4.21, 8.30)
Attrib risk in the exposed (M-H) *             0.09 (-4.62, 4.79)
Attrib risk (crude:M-H)                        71.84
-------------------------------------------------------------------
 M-H test of homogeneity of IRRs: chi2(1) = 0.002 Pr>chi2 = 0.963
 M-H test of homogeneity of ORs: chi2(1) = 0.003 Pr>chi2 = 0.956
 Test that M-H adjusted OR = 1:  chi2(1) = 0.015 Pr>chi2 = 0.452
 Wald confidence limits
 M-H: Mantel-Haenszel; CI: confidence interval
 * Outcomes per 100 population units

Table 17 shows the results from the epi.2by2 function. Included is the rule of thumb.

Table 17: Results and Rule of Thumb
	Relative risk	Risk ratio	Odds ratio
\widehat{\text{measure}}_{\text{crude}}	0.0625	2.87	3.07
\widehat{\text{measure}}_{\text{MH}}	0.0009	1.02	1.03
Rule of thumb	\lvert \frac{0.0625-0.0009}{0.0625} \rvert = 0.99 \ge 0.10	\lvert \frac{2.87-1.02}{2.87} \rvert = 0.64 \ge 0.10	\lvert \frac{3.07 - 1.03}{3.07} \rvert = 0.66 \ge 0.10

All MH common measures of association are very different from the crude measures of association and we conclude that age group, Z, is a confounder. The estimates of the MH common measures must then be presented as the final results.

Together with these values, we also include the confidence intervals or conduct a hypothesis test. The confidence intervals are in the results of the epi.2by2 function call output. In the case of risk difference, we see that the confidence interval includes 0 and in the case of the risk and odds ratios, the confidence intervals include 1. We conclude that there is no evidence that there is an association between baldness and ischemic heart disease.

The null hypothesis of the MH test assumes that the row variable, X, and the column variable, Y for each partial tables are independent. Much like Fisher’s exact test, the expected value of the n_{11k} cell in each partial table can be computed, conditioned on the respective row and column totals. The X_{\text{MH}}^{2} test statistic measures the difference between the expected and the observed cell counts. The null hypothesis for the test statistic approximates a \chi^{2} distribution with 1 degree of freedom.

We see the results of the test in the line Test that M-H adjusted OR = 1 in the output of the `epi.2by2 function. We see that \chi^{2} = 0.015 with a large p value of 0.90. We fail to reject the null hypothesis and state that there is no association between baldness and ischemic heart disease.

7 Lab Problems

7.1 Question

You are a member of a funded group conducting research into maternal perinatal complications in a disadvantaged community. You investigate the association between maternal age group and maternal perinatal complications. You also consider human immunodeficiency virus (HIV) status as possible effect modifier.

There are two levels to the maternal age group variable AgeGroup: Teenage and Older. There are two levels to the maternal perinatal complication variable Complication: Yes and No There are two levels to the maternal HIV status HIV: Positive and Negative

Consider the association between maternal age group and maternal perinatal complications and HIV status as possible effect modifier. Prepare a thorough report for presentation to your (non-statistician) collaborators. Include your research question, hypotheses, level of significance, and confidence level. Use only table methods in your report.

The data for this lab problem are in a spreadsheet file named PerinatalComplications.csv.

Solution 7.1

The research question is whether HIV status is possible effect modifier in the primary association between maternal age group and maternal perinatal complications. We select a significance level of 0.05 before starting our analysis.

We import the data using the import_dataset user-defined function from Chapter 2.

Code

comp <- import_dataset("PerinatalComplications.csv")

Conduct Univariate Analysis

Then, we observe the frequencies of our variables of interest.

Code

xtabs(
  ~ AgeGroup,
  data = comp
)

AgeGroup
  Older Teenage 
    146     125

Code

xtabs(
  ~ HIV,
  data = comp
)

HIV
Negative Positive 
     147      124

In terms of the age groups, there are older than teenage participants. Also, there are more HIV negative than positive patients.

Before viewing the stratified descriptive statistics, ensure that the the levels of the predictor variables are such that the reference group is last in the list.

Code

comp$AgeGroup <- factor(comp$AgeGroup, levels = c("Teenage", "Older"))
comp$Complication <- factor(comp$Complication, levels = c("Yes", "No"))
comp$HIV <- factor(comp$HIV, levels = c("Negative", "Positive"))

Code

lab_stratified_tables <- table(comp$AgeGroup, comp$Complication, comp$HIV)
lab_stratified_tables

, ,  = Negative

         
          Yes No
  Teenage   5 63
  Older    14 65

, ,  = Positive

         
          Yes No
  Teenage  44 13
  Older    24 43

Within the HIV negative group, there are more older than teenage patients. Within the HIV positive group, there are more teenage than older patients.

Assess for Effect Modification

Then, we implement the Woolf’s test assess for effect modification by HIV status within the primary association between maternal age group and maternal perinatal complications.

The hypotheses are as follows,

H_{0}: Homogeneous association among the stratified odds ratios between maternal age and maternal perinatal complications by HIV status

H_{1}: Not all conditional odds ratios are the same

Code

DescTools::WoolfTest(lab_stratified_tables)


    Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)

data:  lab_stratified_tables
X-squared = 16.788, df = 1, p-value = 4.18e-05

The result prints the test statistic value of 16.788 with 1 degree of freedom and a corresponding p-value less than the 0.05 significance level. So, we reject the null hypothesis of homogeneous association at the 5% significance level and conclude that HIV status is an effect modifier within the primary association between maternal age and maternal perinatal complications. As such, we report separate measures of association with confidence intervals and do separate Pearson’s \chi^{2}tests.

Code

negative <- matrix(c(5, 14, 63, 65), nrow = 2)
negative

     [,1] [,2]
[1,]    5   63
[2,]   14   65

Code

positive <- matrix(c(44, 24, 13, 43), nrow = 2)
positive

     [,1] [,2]
[1,]   44   13
[2,]   24   43

Code

epiR::epi.2by2(negative)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +            5           63         68       7.35 (2.43 to 16.33)
Exposed -           14           65         79     17.72 (10.04 to 27.94)
Total               19          128        147      12.93 (7.96 to 19.45)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 0.41 (0.16, 1.09)
Inc odds ratio                                 0.37 (0.13, 1.08)
Attrib risk in the exposed *                   -10.37 (-20.83, 0.09)
Attrib fraction in the exposed (%)            -141.01 (-534.68, 8.48)
Attrib risk in the population *                -4.80 (-14.81, 5.22)
Attrib fraction in the population (%)         -37.11 (-78.33, -5.41)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 3.491 Pr>chi2 = 0.062
Fisher exact test that OR = 1: Pr>chi2 = 0.084
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

Code

epiR::epi.2by2(positive)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           44           13         57     77.19 (64.16 to 87.26)
Exposed -           24           43         67     35.82 (24.47 to 48.47)
Total               68           56        124     54.84 (45.65 to 63.79)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 2.15 (1.52, 3.06)
Inc odds ratio                                 6.06 (2.74, 13.43)
Attrib risk in the exposed *                   41.37 (25.55, 57.20)
Attrib fraction in the exposed (%)            53.60 (34.14, 67.31)
Attrib risk in the population *                19.02 (4.58, 33.46)
Attrib fraction in the population (%)         34.68 (17.29, 48.41)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 21.286 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

For mothers who are HIV negative, the odds of perinatal complications are 37% lower for those in the teenage age group compared to the older age group. The odds ratio of 0.37 has a corresponding 95% CI of (0.13, 1.08).

For mothers who are HIV positive, the odds of perinatal complications are 6.06 times higher (506% greater) for those in the teenage age group compared to the older age group. The odds ratio of 6.06 has a corresponding 95% CI of (2.74, 13.43).

7.2 Question

Your are the biostatistician for a research project investigating the development of cancer of the cervix. In one of the sub projects you investigate the association between current smoking status and the development of cervical carcinoma. You have also collected data on the number of sexual partners and consider two levels for this variable, less than 2 and 2+ sexual partners and consider whether the two-group number of sexual partners modifies the effect of the association between current smoking status and the development of cervical carcinoma or if this association is biased by the level of sexual partners.

Prepare a thorough report for your funding agency oversight committee meeting. State your research questions, hypotheses, levels of significance and confidence, all your analyses decisions, and results. Use table methods only.

The data are in the file Cervical.csv.

Solution 7.2

The research question is whether whether the two-group number of sexual partners modifies the effect of the association between current smoking status and the development of cervical carcinoma or if this association is biased by the level of sexual partners (i.e., confounder).

Code

cerv <- import_dataset("Cervical.csv")

Conduct Univariate Analysis

First, we generate univariate summary statistics.

Code

xtabs(
  ~ Smoke,
  data = cerv
)

Smoke
Non smoker     Smoker 
       386        278

Code

xtabs(
  ~ Carcinoma,
  data = cerv
)

Carcinoma
 No Yes 
450 214

Code

xtabs(
  ~ Partner,
  data = cerv
)

Partner
    Less Two plus 
     542      122

Overall, there are more non-smoker than smoker participants.there are more patients without carcinoma. And, there are more participants who report having had less than 2 sexual partners.

Then, we set levels for our variables, ensuring that for the \texttt {Smoke} and \texttt {Carcinoma} variables, the reference group is listed last in the order.

Code

cerv$Smoke <- factor(cerv$Smoke, levels = c("Smoker", "Non smoker"))
cerv$Carcinoma <- factor(cerv$Carcinoma, levels = c("Yes", "No"))
cerv$Partner <- factor(cerv$Partner, levels = c("Less", "Two plus"))

Code

crude_table_lab <- xtabs(
  ~ Smoke + Carcinoma,
  data = cerv
)
crude_table_lab

            Carcinoma
Smoke        Yes  No
  Smoker     109 169
  Non smoker 105 281

Code

stratified_tables_lab <- table(
  cerv$Smoke, cerv$Carcinoma, cerv$Partner
)
stratified_tables_lab

, ,  = Less

            
             Yes  No
  Smoker      97 148
  Non smoker  93 204

, ,  = Two plus

            
             Yes  No
  Smoker      12  21
  Non smoker  12  77

In the cross tabulation between smoking and carcinoma status, most participants report non-smoker and no carcinoma. Similarly, within each stratification by number of sexual partners, most participants are non-smokers without carcinoma.

Assess for Effect Modification

Then, with the Woolf’s test, we compare partial measures.

The hypotheses are as follows,

H_{0}: Homogeneous association among the stratified odds ratios between current smoking status and the development of cervical carcinoma

H_{1}: Not all conditional odds ratios are the same

Code

DescTools::WoolfTest(stratified_tables_lab)


    Woolf Test on Homogeneity of Odds Ratios (no 3-Way assoc.)

data:  stratified_tables_lab
X-squared = 3.3716, df = 1, p-value = 0.06633

The results demonstrate a test statistic of 3.3716 with 1 degree of freedom and a corresponding p-value that is greater than the 0.05 significance level. Hence, the two-group number of sexual partners does not modify the effect of the association between current smoking status and the development of cervical carcinoma. So, we move onto assess for confounding effect.

Assess for Confounding

For one of the criteria for confounding, we evaluate the association between \texttt Smoke (X) and \texttt Partner (Z) with a Pearson’s \chi^{2} test.

Code

smoke_partner_table_lab <- table(
  cerv$Smoke,
  cerv$Partner
)

smoke_partner_table_lab

            
             Less Two plus
  Smoker      245       33
  Non smoker  297       89

Code

chisq.test(smoke_partner_table_lab)


    Pearson's Chi-squared test with Yates' continuity correction

data:  smoke_partner_table_lab
X-squared = 12.749, df = 1, p-value = 0.0003562

Given a \chi^{2} test statistic of 12.749 with 1 degree of freedom and a corresponding p-value less than the 0.05 significance level, we reject the null hypothesis. Thus, there is an association between the explanatory variable, \texttt Smoke, and the potential confounder, \texttt Partner.

For the second criteria for confounding, we assess whether there is a causal association with the response variable.

Table of \texttt {Partner} (Z) and \texttt {Carcinoma} (Y).

Code

cancer_partners_lab <- table(
  cerv$Carcinoma,
  cerv$Partner
)
cancer_partners_lab

     
      Less Two plus
  Yes  190       24
  No   352       98

Then, we test for independence with the Chi-squared test.

Code

chisq.test(cancer_partners_lab)


    Pearson's Chi-squared test with Yates' continuity correction

data:  cancer_partners_lab
X-squared = 10.097, df = 1, p-value = 0.001485

Given that the \chi^{2} statistic of 10.097 with 1 degree of freedom has a corresponding p-value that is less than the 0.05 significance level, we reject the null hypothesis. Thus, there is a causal relationship between the potential confounder and the explanatory variable. This means that there is a causal relationship between the number of sexual partners and the development of cervical carcinoma.

There is no causal relationship between smoking and the number of sexual partners.

For the third criteria of confounding, we assess whether the two-group number of sexual partners is in the causal pathway between current smoking status and the development of cervical carcinoma. Since there is no causal relationship between smoking and the number of sexual partners, this third criteria is satisfied.

With all three criteria for confounding met, we determine the difference between the crude and the MH common measures of association.

Code

epiR::epi.2by2(stratified_tables_lab)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +          109          169        278     39.21 (33.43 to 45.22)
Exposed -          105          281        386     27.20 (22.82 to 31.93)
Total              214          450        664     32.23 (28.68 to 35.93)


Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio (crude)                         1.44 (1.16, 1.79)
Inc risk ratio (M-H)                           1.37 (1.10, 1.70)
Inc risk ratio (crude:M-H)                     1.05
Inc odds ratio (crude)                         1.73 (1.24, 2.40)
Inc odds ratio (M-H)                           1.61 (1.15, 2.23)
Inc odds ratio (crude:M-H)                     1.08
Attrib risk in the exposed (crude) *           12.01 (4.75, 19.26)
Attrib risk in the exposed (M-H) *             10.50 (1.80, 19.20)
Attrib risk (crude:M-H)                        1.14
-------------------------------------------------------------------
 M-H test of homogeneity of IRRs: chi2(1) = 4.136 Pr>chi2 = 0.042
 M-H test of homogeneity of ORs: chi2(1) = 3.361 Pr>chi2 = 0.067
 Test that M-H adjusted OR = 1:  chi2(1) = 8.019 Pr>chi2 = 0.002
 Wald confidence limits
 M-H: Mantel-Haenszel; CI: confidence interval
 * Outcomes per 100 population units

Risk Ratio

Code

(1.44 - 1.37) / 1.44

[1] 0.04861111

Odds ratio

Code

(1.73 - 1.61) / 1.73

[1] 0.06936416

Whether using the risk ratio or the odds ratio, there is not more than 10% difference between the crude and respective MH measures. Since the two-group number of sexual partners does not confound the association between current smoking status and the development of cervical carcinoma, we report report crude measures.

As such, the odds of cervical carcinoma are 1.73 times higher (73% greater) for smokers compared to non-smokers (OR: 1.73, 95% CI: (1.24, 2.40)).