Chapter 2

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In chapter 2, we establish definitions for the measures of association- risk difference, relative risk, and odds ratio. Then, using the epi2by2 and epitools libraries, we calculate these measures of association. Additionally, we discuss which measures of association are most effective based on study design and perform relevant hypothesis testing.

Key Words

Section 3. Anatomy of a 2 \times 2 Contingency table	contingency table
Section 4. Probabilities Derived from a Contingency Table	relative frequency, joint probabilities, joint probability distribution, marginal probabilities, marginal probability distribution, conditional probabilities
Section 5. Measures of Association for a 2 \times 2Contingency Table	risk difference, relative risk, odds ratio
Section 6. Application of Measures of Association Based on Study Design	absolute vs. relative measure of association, case-control design, cohort design, randomized controlled trial, cross-sectional design
Section 8. Hypothesis Testing for the Measures of Association: Pearson’s \chi^{2} Approach	Pearson’s \chi^{2} test, degrees of freedom, $\chi^{2}$ distribution

2 Libraries

Load the required libraries for Chapter 2. If these packages are not already installed, please do so before loading the libraries.

Code

library(ggplot2)
library(latex2exp)
library(crosstable)
library(gmodels)
library(ggstatsplot)
library(sjPlot)
library(DescTools)
library(rio)

Also, all data for this chapter and subsequent chapters are from a GitHub repository. Below is a user-defined function to allow for more accessible data import into R. An example of how to apply the user-defined function import_dataset is demonstrated in Section 5 Example 5.4.

Code

import_dataset <- function(filename) {
  # GitHub repository URL
  github_url <- "https://raw.githubusercontent.com/juanklopper/TutorialData/main/"
  

  file_url <- paste0(github_url, filename)
  

  dataset <- import(file_url)
  
  return(dataset)
}

3 Anatomy of a 2\times 2 Contingency Table

3.1 Sample Contingency Table

Studies concerning categorical variables allow us to consider the frequency of outcomes for each variable and the association of these frequencies between pairs of variables. We can express such outcomes in a contingency table or a table of observed values, an example of which is shown in Table 1. Participants are randomized into two groups, A and B. They are observed for one of two outcomes, Outcome 1 and Outcome 2.

Table 1: Joint Frequencies
Group	Outcome 1	Outcome 2	Total
A	509	116	625
B	398	104	502
Total	907	220	1127

A total of 509 participants in group A have Outcome 1 and 116 in group A have Outcome 2. A total of 398 in group B have Outcome 1 and 104 in group B have Outcome 2.

Note that for this chapter, we will usually have the explanatory variable, in this case, the groups, on the left-hand side (the rows), with a row for each of the levels of the explanatory group. The levels of the response variable are along the top (the columns), with a column for each level of the response variable. In the case of this example, we have 2 rows and 2 columns resulting in a 2 \times 2 contingency table.

By this structure of the contingency table, we can sum all the frequencies along the rows and columns. We note the following:

Irrespective of which outcome participants were observed to have, there are 509 + 116 = 625 participants in group A and 398 + 104 = 502 participants in group B
Irrespective of which group participants were in, there are 509 + 398 = 907 participants who have Outcome 1 and 116 + 104 = 220 who have Outcome 2.
We also note a total of 1127 participants. This value must be the same for the sum of the row or the column totals or the sum of the four joint frequencies.

3.2 General Notation

Our general notation for a 2 \times 2 table will be as in Table 2. The subscripts denotes the row and then the column number. We use the + symbol as variable that holds all the values for a row (if placed second) and for a column (if placed first).

Table 2: Frequency Table of Observed Data
Group	Outcome 1	Outcome 2	Total
A	n_{1,1}	n_{1,2}	n_{1,+}
B	n_{2,1}	n_{2,2}	n_{2,+}
Total	n_{+,1}	n_{+,2}	n

Different textbooks use different notation for the groups and outcomes in a 2 \times 2 table of observed values. You might see a table as in Table 3, where E is exposure and \bar{E} is non-exposure, with D being disease and \bar{D} being non-diseases. Other textbooks might use success for D and failure for \bar{D}. None of these terms are to be thought of in literal terms. They merely help us decide on the outcome of interest (D or success).

Table 3: Expressing Exposure and Disease
	D	\bar{D}	Total
E	n_{1,1}	n_{1,2}	n_{1,+}
\bar{E}	n_{2,1}	n_{2,2}	n_{2,+}
Total	n_{+,1}	n_{+,2}	n

We are not restricted to a 2 \times 2 contingency table. If the explanatory variable has m outcomes and the response variable has n outcomes, then we would have an m \times n contingency table.

4 Probabilities Derived from a Contingency Table

In Table 4, we have the results of a survey. The data are frequencies of observed values for the levels of two categorical variables, Group (explanatory variable) and Choice (response variable). There are two categories for Group variable, A and B and two for the Choice variable, I and II.

Table 4: Example Data
	I	II
A	80	100
B	120	25

The total number of participants are 80 + 100 + 120 + 25 = 325.

There are 80 + 100 = 180 participants in group A and 120 + 25 = 145 participants in group B.
Of all the participants, 80 + 120 = 200 chose I and 100 + 25 = 125 chose II. Table 5 shows these values.

Table 5: Row and Column Totals Added
	I	II	Total
A	80	100	180
B	120	25	145
Total	200	125	325

We can express a relative frequency or a proportion by dividing each frequency by the total number of participants. The results is a contingency table, Table 6, that is a probability distribution.

Table 6: Relative Frequencies
	I	II	Total
A	0.246	0.308	0.554
B	0.369	0.077	0.446
Total	0.615	0.385	1.0

4.1 Joint Probabilities

The values 0.246,0.308, 0.369 and 0.077 are termed joint probabilities. For the first value we could state that the probability of being in group A and having chosen I is 0.246. We can write this as P(A \cap I), the probability of being in group A and having chosen I. The four values together form the joint probability distribution. In some textbooks, the notation \hat{P} is used to denote that the results are from a sample and are used to estimate the true population joint probability.

Note also, that when we refer to probability, we are actually calculating a proportion.

In general, we have Equation 1 for a joint probability, where i denotes the row number and j the column number. \frac{n_{ij}}{n} \tag{1}

4.2 Marginal Probabilities

The values 0.554, 0.446, 0.615 and 0.385 are termed marginal probabilities and form the marginal probability distribution.For 0.554 we could write P(\text{being in group A}) or then \hat{P}(A). Each of the two sets of marginal probabilities (along the rows or the columns) sum to 1.0.

Depending if we are considering row or column marginal probabilities, we have Equation 2, where we only consider the two column marginal probabilities, (\hat{P}(I) and \hat{P}(II)), where j=1,2.

\frac{n_{+j}}{n} \tag{2}

4.3 Conditional Probabilities

What about the probability of having chosen I given that the participant is in group A? A total of 80 participants in the 180 in group A, chose I. The conditional probability, \hat{P}(I \vert A) = 80 \div 180 = 0.444.

We have to remember that these are all values for a sample taken from a population and all three types of probabilities are probability estimates.

4.4 Example

In the code cell below, we construct observed data for a simulated study. There are 40 participants. The explanatory variable is Group and the participants are divided equally into groups A and B. Participants in group A receive novel therapy to alleviate a specific symptom and those in group B receive a placebo. Researchers measure whether they improve (a success for this binary response variable, encoded as Yes) or not (a failure, encoded as No).

We create the simulated data and display the results as a table.

Code

set.seed(12)          

explanatory <- rep( 
  c("A", "B"),        
  each = 20           
)

response <- sample(    
  c("No", "Yes"),      
  size = 40,           
  replace = TRUE       
)

data <- data.frame(    
  Group = explanatory, 
  Outcome = response  
)                      

xtabs(                 
  ~Group + Outcome,    
  data = data          
)

     Outcome
Group No Yes
    A  7  13
    B 10  10

We use the crosstable function from the crosstable library to construct a more elegant table of observed values, together with marginal totals, shown in Table 7. Note that since we did not use the factor function to set the order of the levels, the crosstable function uses the \bar{D} level (the No) in the first column.

Code

crosstable::crosstable(
  data,
  Group,
  by=Outcome,
  total = "both",
  percent_pattern="{n}"
) %>% 
  crosstable::as_flextable(
    keep_id = FALSE
  )

Table 7: Using the crosstable package
label	variable	Outcome		Total
label	variable	No	Yes	Total
Group	A	7	13	20 (50.00%)
	B	10	10	20 (50.00%)
	Total	17 (42.50%)	23 (57.50%)	40 (100.00%)

The CrossTable function from the gmodels library provides the joint, marginal, and the conditional probabilities.

Code

gmodels::CrossTable(
  data$Group,
  data$Outcome,
  chisq = FALSE,
  prop.chisq = FALSE,
  prop.c = TRUE,
  prop.r = TRUE
)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  40 

 
             | data$Outcome 
  data$Group |        No |       Yes | Row Total | 
-------------|-----------|-----------|-----------|
           A |         7 |        13 |        20 | 
             |     0.350 |     0.650 |     0.500 | 
             |     0.412 |     0.565 |           | 
             |     0.175 |     0.325 |           | 
-------------|-----------|-----------|-----------|
           B |        10 |        10 |        20 | 
             |     0.500 |     0.500 |     0.500 | 
             |     0.588 |     0.435 |           | 
             |     0.250 |     0.250 |           | 
-------------|-----------|-----------|-----------|
Column Total |        17 |        23 |        40 | 
             |     0.425 |     0.575 |           | 
-------------|-----------|-----------|-----------|

Note, the order of the response variable is once again left as the default. Also, using the output Cell Contents as a key for the Total Observations in Table, note that the values are displayed as per the function arguments used:

Observed frequency
Conditional probability upon the explanatory variable level
Conditional probability upon the response variable level
Joint probability

From the table we read the following joint probabilities from the table.

P(\text{A}) and P(\text{no improvement}) = 0.175
P(\text{A}) and P(\text{improvement}) = 0.325
P(\text{B}) and P(\text{no improvement}) = 0.25
P(\text{B}) and P(\text{no improvement}) = 0.25

We read the following marginal probabilities from the table.

P(\text{A}) = 0.5
P(\text{A}) = 0.5
P(\text{no improvement}) = 0.425
P(\text{improvement}) = 0.575

Finally, we read the following conditional probabilities from the table.

P(\text{no improvement} \vert \text{A}) = 7 \div 20 = 0.35
P(\text{improvement} \vert \text{A}) = 13 \div 20 = 0.65
P(\text{no improvement} \vert \text{B}) = 10 \div 20 = 0.5
P(\text{improvement} \vert \text{B}) = 10 \div 20 = 0.5

Note from the study design that given that the group allocation was fixed, it does not make sense to consider the probabilities of group allocation given the outcomes. We will revisit this discussion in a later section when we consider risks and ratios in terms of study design.

5 Measures of Association for a 2 \times 2 Contingency Table

We can derive and understand the values that we calculate for risks and ratios from a 2 \times 2 table of observed values. There are three commonly used measures. We use them to quantify the magnitude and direction of association between two binary variables. The measures are risk difference, risk ratio (relative risk), and the odds ratio.

5.1 Risk Difference

We define the risk difference (RD) in Equation 3, the risk of success, D, given exposure, E, minus the risk of success, D, given non-exposure, \bar{E}. To be clear, we are only considering those with the disease and we calculate the difference in the conditional probabilities - with the disease given exposure and no exposure.

\begin{align} &\text{Definition: } \text{RD} = P(D \vert E) - P(D \vert \bar{E}) \\ &\text{Estimate: } \widehat{\text{RD}} = \hat{P}(D \vert E) - \hat{P}(D \vert \bar{E}) \end{align} \tag{3}

Note that our interest lies with the successlevel, D, of the response variable.

The two levels of the explanatory variable are independent. Each participant only took one of either aspirin or placebo. Of the n_{1} = 11037 participants in the aspirin group, 104 developed a myocardial infarction for p_{1} = 104 \div 11037 = 0.0094, as shown in Equation 4. Of the n_{2} = 11034 participants in the placebo group, 189 developed a myocardial infarction. The conditional probability of a myocardial infarction in this group was p_{2} = 189 \div 11034 = 0.0171, as shown in Equation 5.

p_{1} = \hat{P}(D \, \vert \, E) = \frac{n_{11}}{n_{1+}} = \frac{104}{11037} \approx 0.0094 \tag{4}

p_{2} = \hat{P}(D \, \vert \bar{E}) = \frac{n_{21}}{n_{2+}} = \frac{189}{11034} \approx 0.0171 \tag{5}

We state that the risk of the development of a myocardial infarction in the placebo group was 0.017 and the risk of the development of a myocardial infarction in the treatment group was 0.009.

The difference in the risk of participants who developed a myocardial infarction between the two groups is \Delta p = 0.0094 - 0.0171 = -0.0077 \in [-1, 1]. This difference is termed the risk difference.

We can interpret the result by stating that the risk (chance or probability) of developing a MI in the aspirin group is 0.77 percentage points lower than developing a MI in the placebo group.

The RD has an interval -1 \le \text{RD} \le 1. The interval boundary values can be calculated from the extreme summary results in Table 8 and Table 9.

Table 8: Risk Difference Extreme Case
RD=-1	D	\bar{D}	Total
E	0	20	20
\bar{E}	20	0	20
Total	20	20	40

The risk difference is shown in Equation 6.

P(D \, \vert E) - P \left( D \, \vert \bar{E} \right) = \frac{0}{20} - \frac{20}{20} = -1 \tag{6}

Table 9: Risk Difference Extreme Case
RD=1	D	\bar{D}	Total
E	20	0	20
\bar{E}	0	20	20
Total	20	20	40

The risk difference in this case is shown in Equation 7.

P(D \, \vert E) - P \left( D \, \vert \bar{E} \right) = \frac{20}{20} - \frac{0}{20} = 1 \tag{7}

5.2 Relative Risk

While considering the difference in risk, we have to take cognizance of the number of cases and hence the proportions. A \Delta p of 0.009 is the same when we subtract 0.010 - 0.001 as when we subtract 0.410 - 0.401, yet the proportions are very different.

To mitigate the differences, we consider relative risk or risk ratio (RR), shown in Equation 8.

\begin{align} \text{Defintion: } &\text{RR} = \frac{P(D \vert E)}{P(D \vert \bar{E})} \\ \text{Estimate: } &\widehat{\text{RR}} = \frac{\hat{P}(D \vert E)}{\hat{P}(D \vert \bar{E})} \end{align} \tag{8}

In the case of the two sets of proportions mentioned above, we see the difference in RR in Equation 9.

\widehat{\text{RR}} = \frac{\frac{104}{11037}}{\frac{189}{11034}} \approx 0.5501 \tag{9}

The RR in the case of our example is 0.551. We can interpret this as the risk of a MI in the aspirin group is 0.551 times that in the placebo group or that the risk of a MI among those taking aspirin is 1 - 0.5501 = 0.4499 = 45% less than that among the placebo group. (See explanation below in Thinking about lowering and increasing the measures of association).

The RR has an interval 0 \le \text{RR} < \infty. The interval boundary values can be calculated from the extreme summary results in Table 10 and Table 11.

Table 10: Relative Risk Extreme Case
RR=0	D	\bar{D}	Total
E	0	20	20
\bar{E}	20	0	20
Total	20	20	40

The relative risk is shown in Equation 10.

\frac{P(D \, \vert \, E)}{P \left( D \, \vert \, \bar{E} \right)} = \frac{\frac{0}{20}}{\frac{20}{20}} = 0 \tag{10}

Table 11: Relative Risk Extreme Case
RR=\infty	D	\bar{D}	Total
E	20	0	20
\bar{E}	0	20	20
Total	20	20	40

The relative risk is shown in Equation 11.

\frac{P(D \, \vert \, E)}{P \left( D \, \vert \, \bar{E} \right)} = \frac{\frac{20}{20}}{\frac{0}{20}} = \infty \tag{11}

5.3 Odds Ratio

If the population proportion of success for a binary outcome of an experiment is \pi then the odds of success is as shown in Equation 12.

\text{odds}=\frac{\pi}{1 - \pi} \tag{12}

The sample proportion of success (a myocardial infarction) in the group taking aspirin in our illustrative example was 104 \div 11037 = 0.0095. This is P(D \vert E). We calculate the odds of success in the code chunk below, assigning the result to the variable odds_aspirin.

Code

odds_aspirin <- (104 / 11037) / (1 - (104 / 11037))
odds_aspirin

[1] 0.009512485

We can do the same for the group taking placebo. We calculate the odds of success in the placebo group in the code chunk below and assign it to the variable odds_placebo. This is P(D \vert \bar{E}).

Code

odds_placebo <- (189 / 11034) / (1 - (189 / 11034))
odds_placebo

[1] 0.01742739

The odds ratio comparing the odds of success between the two groups is calculated as shown in Equation 13.

\begin{align} \text{odds ratio} = &\theta = \frac{\text{odds}_{1}}{\text{odds}_2} \\ \\ \text{Defintion: } &\text{OR}_{\text{D}} = \frac{\frac{P(D \vert E)}{1 - P(D \vert E)}}{\frac{P(D \vert \bar{E})}{1 - P(D \vert \bar{E})}} = \frac{\frac{P(D \, \vert \, E)}{P(\bar{D} \, \vert \, E)}}{\frac{P(D \, \vert \, \bar{E})}{p(\bar{D} \, \vert \, \bar{E})}} \\ \\ \text{Estimate: } &\widehat{\text{OR}}_{\text{D}} = \frac{\frac{\hat{P}(D \vert E)}{1 - \hat{P}(D \vert E)}}{\frac{\hat{P}(D \vert \bar{E})}{1 - \hat{P}(D \vert \bar{E})}} = \frac{\frac{\hat{P}(D \, \vert \, E)}{\hat{P}(\bar{D} \, \vert \, E)}}{\frac{\hat{P}(D \, \vert \, \bar{E})}{\hat{P}(\bar{D} \, \vert \, \bar{E})}} \end{align} \tag{13}

5.4 Example

The data for this section is in a file named measures_assoc.csv which we import below using the user-defined function import_dataset from Section 2.

Code

dfr<- import_dataset("measures_assoc.csv")

We create a table of observed data between Exposure and Disease.

Code

tb <-  table(dfr$Exposure, dfr$Disease) 
tb

We create another table of calculations for the measures of association between Exposure and Disease.

Code

epiR::epi.2by2(tb)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           21           19         40     52.50 (36.13 to 68.49)
Exposed -            8           32         40      20.00 (9.05 to 35.65)
Total               29           51         80     36.25 (25.79 to 47.76)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 2.62 (1.32, 5.21)
Inc odds ratio                                 4.42 (1.64, 11.93)
Attrib risk in the exposed *                   32.50 (12.67, 52.33)
Attrib fraction in the exposed (%)            61.90 (24.33, 80.82)
Attrib risk in the population *                16.25 (-0.02, 32.52)
Attrib fraction in the population (%)         44.83 (8.43, 66.76)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 9.141 Pr>chi2 = 0.002
Fisher exact test that OR = 1: Pr>chi2 = 0.005
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

The Inc risk is the incidence risk which is simply \hat{P}(D \vert E) = 52.5, \hat{P(D \vert \bar{E})} = 20.0, and \hat{P}(D)=36.2. We also see the estimated odds \widehat{\text{OR}}(D \vert E) = 1.105, \widehat{\text{OR}}(D \vert \bar{E}) = 0.250, and \widehat{\text{OR}}(D)=0.569.

Within the table titled, Point estimates and 95% CIs, the Inc risk ratio represents \widehat{\text{RR}} and its corresponding CI. Then, the Odds ratio demonstrates \widehat{\text{OR}} and CI. The Attrib risk in the exposed * is the \widehat{\text{RD}} and CI.

6 Application of Measures of Association Based on Study Design

RD is an absolute measure of association, whereas the RR (relative risk) and OR are relative measures of association. We need to consider the use of these measures.

As an example, we have the following conditional probabilities.

P(D \vert E) = 0.02
P(D \vert \bar{E}) = 0.01

We see the following measures of association.

\text{RD} = 0.02 - 0.01 = 0.01
\text{RR} = 0.02 \div 0.01 = 2
\text{OR} = [0.02 \div (1 - 0.02)] \div [0.01 \div (1 - 0.01)] = 2.02

If we only report the RR, we might consider the value to be large. The RD, though, shows the minor difference. The OR is approximately the same as the RR since we have small values for P(D \vert E) and P(D \vert \bar{E}).

Contrast the above to the values below.

P(D \vert E) = 0.99
P(D \vert \bar{E}) = 0.89

The measures of association are as follows.

\text{RD} = 0.99 - 0.89 = 0.1
\text{RR} = 0.99 \div 0.89 = 1.11
\text{OR} = [0.99 \div (1 - 0.99)] \div [0.89 \div (1 - 0.89)] = 12.26

The OR is quite large when compared to the RR (because the P(D \vert E) and P(D \vert \bar{E}) are large. It would be useful to also state the RR or RD (if the study design allows for it).

In the next section, we explore case-series, cohort, and cross-sectional study designs to determine which risks and ratios can be used in each study design.

6.1 Case-control Design

A case-control study design is a form of retrospective study design, where participants are divided into groups as indicated by levels of a categorical variable. We then consider events that took place prior to this.

In @tbd-fixeddesign, n_{+1} and n_{+2} are fixed by design.

Fixed outcomes {#@tbd-fixeddesign}
	D	\bar{D}
E	n_{11}	n_{12}
\bar{E}	n_{21}	n_{22}
Total	n_{+1}	n_{+2}

We have that n_{12} and n_{12} are random and can be described by the binomial distribution, shown in Equation 14.

\begin{align} &n_{11} \sim \text{Bin}(n_{+1} , P(E \vert D)) \\ &n_{12} \sim \text{Bin}(n_{+2} , P(E \vert \bar{D})) \end{align} \tag{14}

We can only use associations that use P(E \vert D) and P(E \vert \bar{D}), that is to say odds ratios (either for the exposure or the disease, since they are equal). If the disease is rare, then we can use the odds ratio to estimate the relative risk.

6.1.1 Example Scenario

Consider a study that investigates surgical site sepsis following colonic resection (removal of the large bowel). Surgical site sepsis is any infection of tissues impacted by surgical incisions and interventions.

As an example, we imagine that researchers investigated 100 participants. A total of 33 had a surgical infection and 67 did not. The proportions for this variable are fixed by the study design.

The researchers want to investigate the role of prophylactic antibiotics on the development of surgical site sepsis. Prophylactic antibiotics are given intravenously just prior to surgery. It penetrates the tissues and the aim is for a high enough concentration of the antibiotic in the tissue when incisions are made, stopping the proliferation of bacteria and preventing infection.

In this simulated study, participants were given a novel or a regimented antibiotic as prophylactic agent and the choice between these was up to the attending surgeon (which for the sake of this example, we shall take as random).

The researchers find the results in Table 12.

Table 12: Results Table
Group	Infection	No infection	Total
Novel	5	30	35
Control	28	37	65
Total	33	67	100

Generally, the table above would be deemed incorrect. In a case-control design, the researchers fix the number of cases for the explanatory variable. In this design, the presence or absence of surgical site sepsis (infection) is the explanatory variable. We write the levels of this variable along the left-hand side (the rows), shown in Table 13.

Table 13: Standard Form
Group	Novel	Control	Total
Infection	5	28	33
No infection	30	37	67
Total	35	65	100

The probability estimates are shown in Table 14.

Table 14: Probability Estimates
Group	Novel	Control	Total
Infection	0.05	0.28	0.33
No infection	0.30	0.37	0.67
Total	0.35	0.65	1.0

We cannot express a probability estimate for the development of infection in the larger population, as the number of cases in each group was fixed. In other words, we cannot say that the estimated probability of infection in people who have the type of surgery involved in this study is 33%. We can, though, consider the marginal probabilities of receiving the new drug or the regimented drug. These probability estimates are the marginal probabilities 0.35 and 0.65. Of more interest, is the estimated conditional probabilities. Given that an infection occurred (33 cases) the estimated probability of having received the new (novel) drug is $5 = 15 $% and of the having received the regimented (control) drug is 28 \div 33 \approx 0.85 = 85%. If no infection occurred, the estimated probability of having received the new drug is 30 \div 67 \approx 0.45 = 45% and of having received the regimented drug is 37 \div 65 \approx 0.55 = 55%.

6.2 Cohort Design or Randomized Controlled Trial

A cohort study design and a randomized controlled trial are examples of a prospective study design. The explanatory variable levels divide the cohort or trial groups into two groups. They are followed over time and results are measured for the outcome variable.

In Table 15, n_{1+} and n_{2+} are fixed.

Table 15: Generic Table of Observed Values
	D	\bar{D}	Total
E	n_{11}	n_{12}	n_{1+}
\bar{E}	n_{21}	n_{22}	n_{2+}

Now n_{11} and n_{21} are random variables, described as shown in Equation 15.

\begin{align} &n_{11} \sim \text{Bin}(n_{1+} , P(D \vert E)) \\ &n_{21} \sim \text{Bin}(n_{2+} , P(D \vert \bar{E})) \end{align} \tag{15}

We can use measure of association that have P(E \vert D) and P(E \vert \bar{D}), so we can use the risk difference, relative risk, and odds ratios.

6.2.1 Example Scenario

As an example, we consider a much better study than the case-control study above. Researchers compare those who are given a new prophylactic antibiotic to those who received a regimented prophylactic antibiotic prior to surgery. One week after the surgery, they measure the outcome as whether surgical site sepsis (infection) occurred (response variable). The results are as shown in Table 16.

Table 16: Data Comparing Novel and Control Groups
Group	Infection	No infection	Total
Novel	12	38	50
Control	15	35	50
Total	27	73	100

The table of estimated probabilities is Table 17.

Table 17: Relative Frequencies
Group	Infection	No infection	Total
Novel	0.12	0.38	0.5
Control	015	0.35	0.5
Total	0.27	0.73	1.0

From this design, we can state that the estimated probability of infection is 27 \div 100 = 0.27 = 27% and for not developing infection is 73 \div 100 = 0.73 = 73%. These are the marginal probabilities along the bottom. We can also state that given the new drug, the estimated probability of an infection is 12 \div 50 =0.24 = 24% and that given the regimented drug, the estimated probability is 15/50 = 0.3 = 30%. These are conditional probabilities. We conclude that the estimated probability of surgical site sepsis is lower for the new prophylactic antibiotic.

6.3 Cross-sectional Design

In a cross-sectional study design, we measure values for variables from a sample at one time.

All the cell frequencies in the Table 18 are now random.

Table 18: Generic Table
	D	\bar{D}	Total
E	n_{11}	n_{12}	n_{1+}
\bar{E}	n_{21}	n_{22}	n_{2+}
Total	n_{+1}	n_{+2}	n

All are described by the multinmial distribution, shown in Equation 16.

(n_{11}, n_{12}, n_{21}, n_{22}) \sim \text{Mult}(n, P(D \cap E) , P(\bar{D} \cap E) , P(\bar{E} \cap D) , P(\bar{E} \cap \bar{D})) \tag{16}

All measures of association can be used.

6.3.1 Example Scenario

As example, we may have a list of patients who visited a local clinic that opened two calendar years ago. We randomly select 100 people to contact over a two week period and ask them to answer a few survey questions. We record the values in each case. Note that although the data collection took two weeks, the data are a snap-shot, with frequencies for two variables calculated.

Imagine then that among other questions, we asked them whether their last visit was in the first year or in the second year (with levels Second and First) and if they were satisfied with the waiting period (levels Yes and No). The results are shown in Table 19.

Table 19: Example Data
Year	Yes	No	Total
Second	24	23	47
First	20	33	53
Total	44	56	100

The estimated probabilities are in Table 20.

Table 20: Estimated Probabilities
Year	Yes	No	Total
Second	0.24	0.23	0.47
First	0.2	0.33	0.53
Total	0.44	0.56	1.0

In this case both variables are not fixed. We can express the estimated probability of a last visit being in the second year as being 0.47 and in the first year as being 0.53. We can also state that the probability of being satisfied with the waiting times as being 0.44 and not being satisfied as being 0.56. In all these cases, we are considering marginal probabilities.

We can also consider all eight conditional probabilities.

\hat{P}(\text{satisfied} \, \vert \, \text{second year}) = 24 \div 47 \approx 0.51
\hat{P}(\text{not satisfied} \, \vert \, \text{second year}) = 23 \div 47 \approx 0.49
\hat{P}(\text{satisfied} \, \vert \, \text{first year}) = 20 \div 53 \approx 0.38
\hat{P}(\text{not satisfied} \, \vert \, \text{first year}) = 33 \div 53 \approx 0.62
\hat{P}(\text{second year} \, \vert \, \text{satisifed}) = 24 \div 44 \approx 0.55
\hat{P}(\text{second year} \, \vert \, \text{not satisifed}) = 23 \div 56 \approx 0.41
\hat{P}(\text{first year} \, \vert \, \text{satisifed}) = 20 \div 44 \approx 0.45
\hat{P}(\text{first year} \, \vert \, \text{not satisifed}) = 33 \div 56 \approx 0.59

Next, we consider confidence intervals (CIs) for the measures of association.

7 Confidence Intervals for Measures of Association

Confidence intervals are useful for estimation of population parameters, precision assessment of estimates, and hypothesis testing.

7.1 CI for Risk Difference

To calculate the CI for risk difference (RD), we use the two conditional probabilities in Equation 17.

\begin{align} &p_{1} = \hat{P}(D \vert E) = \frac{n_{11}}{n_{1+}} \\ \\ &p_{2} = \hat{P}(D \vert \bar{E}) = \frac{n_{21}}{n_{2+}} \end{align} \tag{17}

Now we can calculate the 100(1 - \alpha)% CI for RD, as shown in Equation 18.

(p_{1} - p_{2}) \pm z_{\frac{\alpha}{2}} \sqrt{\frac{p_{1} (1 - p_{1})}{n_{1+}} + \frac{p_{2} (1 - p_{2})}{n_{2+}}} \tag{18}

We use the normal approximation to the binomial distribution for n_{11} and n_{21} and we check the following assumptions.

n_{1+} p_{1} \ge 5 and n_{1+}(1 - p_{1}) \ge 5 or then n_{11} \ge 5 and n_{12} \ge 5
n_{2+} p_{2} \ge 5 and n_{2+}(1 - p_{2}) \ge 5 or then n_{21} \ge 5 and n_{22} \ge 5

This means that all four cell frequencies should be at least 5.

7.1.1 Example

As example, we consider the results from a study in Table 21.

Table 21: Study Data
	D	\bar{D}	Total
E	21	19	40
\bar{E}	8	32	40
Total	29	51	80

We substitute the values in the table into Equation 18.

Code

p1 <- 21 / 40
p2 <- 8 / 40
z_a2 <- 1.96

lower <- (p1 - p2) - (z_a2 * sqrt(((p1 * (1 - p1)) / (40)) + ((p2 * (1 - p2)) / (40))))
upper <- (p1 - p2) + (z_a2 * sqrt(((p1 * (1 - p1)) / (40)) + ((p2 * (1 - p2)) / (40))))

lower

[1] 0.1267164

Code

upper

[1] 0.5232836

We are 95% confident that the true RD is between 12.67 and 52.32 percentage points.

We could also state that we are 95% confident that the true risk of disease in the exposure group is between 12.67 and 52.32 percentage points greater than the risk of disease in the non-exposure group.

We note that the null value for RD, which is 0, is not in the interval.

7.2 CI for Relative Risk

The sampling distribution for relative risk is not normally distributed. So, we cannot use the normal approximation to calculate the confidence interval for the relative risk.

In the code cell below, we simulate 1000 studies so that we can visualize the sampling distribution of relative risk. We use the following parameters.

n_{11} \sim \text{Bin}(100, P(D \vert E) = 0.3)
n_{21} \sim \text{Bin}(100, P(D \vert \bar{E}) = 0.2)

Code

n_11 <- rbinom(1000, 100, 0.3) 
n_21 <- rbinom(1000, 100, 0.2) 

rr <- n_11 / n_21

Code

hist(
  rr,
  main = "Simulated sampling distribution of estimated relative risk",
  xlab = "Estimated relatve risk",
  ylab = "Frequency",
  col = "#4197D9",
  las = 1,
  panel.first = grid()
)

Code

hist(
  log(rr),
  main = "Simulated sampling distribution of logarithm of estimated relative risk",
  xlab = "Logarithm of estimated relatve risk",
  ylab = "Frequency",
  col = "#98FB98",
  las = 1,
  panel.first = grid()
)

The histogram in Figure 1 shows the distribution of the simulated values. The sampling distribution is right skewed. In Figure 2, we use the natural logarithm transformation and plot a histogram of the sampling distribution of \log{\widehat{\text{RR}}}.

Using Equation 17 for p_{1} and p_{2}, we show the equation for the confidence interval of \log{\widehat{\text{RR}}} in Equation 19.

\log{\left( \frac{p1}{p2} \right)} \pm z_{\frac{\alpha}{2}} \sqrt{\frac{1 - p_{1}}{n_{11}} + \frac{1 - p_{2}}{n_{21}}} \tag{19}

7.2.1 Example

The 95\% CI with Z_{\frac{\alpha}{2}} = 1.96 for the \log{\widehat{\text{RR}}} is calculated below, substituting the values from our table of observed values from Table 21.

Code

lower <- log(p1 / p2) - z_a2 * sqrt(((1 - p1) / 21) + ((1 - p2) / 8))
upper <- log(p1 / p2) + z_a2 * sqrt(((1 - p1) / 21) + ((1 - p2) / 8))
lower

[1] 0.2787476

Code

upper

[1] 1.651414

To calculate the CI for the estimated relative risk from our table of observed values, we need to exponentiation each value (that is, take the anti-log). In Equation 20, we see the anti-log for our example values.

(e^{0.2787476} , e^{1.651414}) \tag{20}

Below, we calculate the estimated relative risk and corresponding lower and upper bounds of the 95% CI.

Code

p1 / p2

[1] 2.625

Code

round(exp(lower), digits = 3)

[1] 1.321

Code

round(exp(upper), digits = 3)

[1] 5.214

The \widehat{\text{RR}}=2.625 with 95% CI of 1.321 to 5.214. We can state that we are 95% confident that the true population RR is in this interval. Alternatively we state that we are 95% confident that the true risk of disease in the exposure group is between 1.32 - 1 = 0.32 = 32% and 5.214 - 1 = 4.214 = 421% greater than the risk of disease in the non-exposure group.

We performed a natural logarithm transformation (log transformation) to use the normal approximation to the binomial distribution. We still need to make sure that the assumptions for the use of this approximation are met. That is n_{ij} \ge 5 (which in our case it is).

Note also that the null value for the relative risk (which is 1) is not within the interval.

7.3 CI for Odds Ratios

The sampling distribution of the odds ratio (OR) is also not normally distributed and we use the log-transformation.

The equation for the 100(1-\alpha)% CI for the log OR is shown in Equation 21.

\log{\left( \frac{\frac{p_{1}}{1 - p_{1}}}{\frac{p_{2}}{1 - p_{2}}} \right)} \pm z_{\frac{\alpha}{2}} \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}} \tag{21}

7.3.1 Example

The estimated CI for the log \widehat{\text{OR}}, is calculated below, substituting the values from our table of observed values from Table 21.

Code

log_odds_ratio <- log((p1 / (1 - p1)) / (p2 / (1 - p2)))
log_odds_ratio

[1] 1.486378

Below, we calculate the estimated OR and corresponding lower and upper of the 95% CI.

Code

round((p1 / (1 - p1)) / (p2 / (1 - p2)), digits = 3)

[1] 4.421

Code

se_or <- sqrt((1 / 21) + (1 / 19) + (1 / 8) + (1 / 32))
lower <- log_odds_ratio - z_a2 * se_or
upper <- log_odds_ratio + z_a2 * se_or

round(exp(lower), digits = 3)

[1] 1.638

Code

round(exp(upper), digits = 3)

[1] 11.93

The \widehat{\text{OR}}=4.421 with 95% CI of 1.638 to 11.93. We can state that we are 95% confident that the true population OR is on this interval. Alternatively we state that we are 95% confident that the OR of disease in the exposure group is between 1.638 - 1 = 0.638 = 63.8% and 11.93 - 1 = 10.93 = 1093% greater than the risk of disease in the non-exposure group.

Note also that the null value for the OR (which is 1) is not within the interval. There is a positive association between exposure and disease.

8 Hypothesis Testing for the Measures of Association: Pearson’s \chi^{2} Approach

8.1 Null Values

Our first task is to develop values for the null hypothesis for measures of association.

Consider the table below, where we have two binary variables exposure (with levels E and \bar{E}) and disease (with levels D and \bar{D}). In Table 22, we express the joint probabilities \pi_{ij}, as well as the marginal probabilities.

Table 22: Table of Probabilities
	D	\bar{D}	Total
E	\pi_{11}	\pi_{12}	\pi_{1+}
\bar{E}	\pi_{21}	\pi_{22}	\pi_{2+}
Total	\pi_{+1}	\pi_{+2}	1.0

We remember from probability theory that two events, A and B, are independent if and only if P(A \cap B) = P(A) P(B). The two binary variables are then independent if an only if the following hold.

P(D \cap E) = P(D) P(E) \Rightarrow \pi_{11} = \pi_{1+} \pi_{+1}
P(\bar{D} \cap E) = P(\bar{D}) P(E) \Rightarrow \pi_{12} = \pi_{1+} \pi_{+2}
P(D \cap \bar{E}) = P(D) P(\bar{E}) \Rightarrow \pi_{21} = \pi_{2+} \pi_{+1}
P(\bar{D} \cap \bar{E}) = P(\bar{D}) P(\bar{E}) \Rightarrow \pi_{22} = \pi_{2+} \pi_{+2}

In all cases, the joint probability being equal to the product of the marginal probabilities, indicates independence.

This independence has implications for the measures of association, shown in Equation 23 and Equation 24, where we use Equation 22 from probability theory.

P(A \vert B) = \frac{P(A \cap B)}{P(B)} \tag{22}

\begin{align} \text{RD} = P(D \vert E) - P(D \vert \bar{E}) = &\frac{P(D \cap E)}{P(E)} - \frac{D \cap \bar{E}}{P(\bar{E})} \\ &= \frac{\pi_{1+} \pi_{+1}}{\pi_{1+}} - \frac{\pi_{2+} \pi_{+1}}{\pi_{2+}} \\ &= \pi_{+1} - \pi_{+1} \\ &= 0 \end{align} \tag{23}

\text{RR} = \frac{P(D \vert E)}{P(D \vert \bar{E})} = \frac{\frac{\pi_{1+} \pi_{+1}}{\pi_{1+}}}{\frac{\pi_{2+}\pi_{+1}}{\pi_{2+}}} = 1 \tag{24}

The null value for the OR is the same, shown in Equation 25.

\begin{align} &\text{OR} = 1 \end{align} \tag{25}

8.2 Null and Alternative Hypotheses

In the section below, we state the null hypothesis and the alternative hypothesis in words, statistical terms, and in values.

In words,
- H_{0}: Outcome level and exposure level are independent (not associated)
- H_{1}: Outcome level and exposure level are not independent (are associated)
In statistical terms,
- H_{0}: \; \pi_{ij} = \pi_{i+} \pi_{+j} (joint probability equals product of its marginal probabilities)
- H_{1}: \; \pi_{ij} \ne \pi_{i+} \pi_{+j}

In values by measure of association,

RD OR RR

RD	OR	RR
H_{0}: RD=0\\ H_{1}: RD\ne 0	$$ `H_{0}: OR =1\\ H_{1}: OR \ne 1 $$`	$$ `H_{0}: RR=1\\ H_{1}: RR \ne 1 $$`

H_{0}: RD=0\\ H_{1}: RD\ne 0

 H_{0}: OR =1\\

 H_{1}: OR \ne 1

               $$

  H_{0}: RR=1\\


 H_{1}: RR \ne 1


               $$

8.3 Pearson’s \chi^{2} test

The Pearson \chi^{2} test for independence compares the observed frequencies with the frequencies that would be expected if the null hypothesis was true.

We have Table 23 of observed values.

Table 23: Generic Table of Observes Values
	D	\bar{D}	Total
E	n_{11}	n_{12}	n_{1+}
\bar{E}	n_{21}	n_{22}	n_{2+}
Total	n_{+1}	n_{+2}	n

Table 23 can be written, shown in Table 24, using O_{ij} for the table of observed values.

Table 24: Table of Observed Values
	D	\bar{D}	Total
E	O_{11}	O_{12}	n_{1+}
\bar{E}	O_{21}	O_{22}	n_{2+}
Total	n_{+1}	n_{+2}	n

The aim is to calculate the expected values under the null hypothesis. We use the notation in Table 25, indicating the table of expected values.

Table 25: Expected Values
	D	\bar{D}	Total
E	E_{11}	E_{12}	n_{1+}
\bar{E}	E_{21}	E_{22}	n_{2+}
Total	n_{+1}	n_{+2}	n

If the null hypothesis was true, then disease and exposure are independent variables and the statements below would hold (as we remember from probability theory mentioned before).

P(D \vert E) = P(D) P(E)
P(\bar{D} \vert E) = P(\bar{D}) P(E)
P(D \vert \bar{E}) = P(D) P(\bar{E})
P(\bar{D} \vert \bar{E}) = P(\bar{D}) P(\bar{E})

Under the assumption of the null hypothesis, we can then estimate the joint probabilities as in Equation 26, Equation 27, Equation 28, and Equation 29.

The expected value E_{11}

\begin{align} &\hat{P}(D \vert E) = \frac{n_{11}}{n} = \hat{P}(D) \hat{P}(E) = \frac{n_{+1}}{n} \frac{n_{1+}}{n} \\ &\frac{n_{11}}{n} = \frac{n_{+1}}{n} \frac{n_{1+}}{n} \\ &n_{11} = \frac{n_{+1} n_{1+}}{n}, \text{ the expected value } E_{11} \end{align} \tag{26}

The expected value E_{12}

\begin{align} &\hat{P}(\bar{D} \vert E) = \frac{n_{12}}{n} = \hat{P}(\bar{D}) \hat{P}(E) = \frac{n_{+2}}{n} \frac{n_{1+}}{n} \\ &\frac{n_{12}}{n} = \frac{n_{+2}}{n} \frac{n_{1+}}{n} \\ &n_{12} = \frac{n_{+2} n_{1+}}{n}, \text{ the expected value } E_{12} \end{align} \tag{27}

The expected value E_{21}

\begin{align} &\hat{P}(D \vert \bar{E}) = \frac{n_{21}}{n} = \hat{P}(D) \hat{P}(\bar{E}) = \frac{n_{+1}}{n} \frac{n_{2+}}{n} \\ &\frac{n_{21}}{n} = \frac{n_{+1}}{n} \frac{n_{2+}}{n} \\ &n_{21} = \frac{n_{+1} n_{2+}}{n}, \text{ the expected value } E_{21} \end{align} \tag{28}

The expected value E_{22}

\begin{align} &\hat{P}(\bar{D} \vert \bar{E}) = \frac{n_{22}}{n} = \hat{P}(\bar{D}) \hat{P}(\bar{E}) = \frac{n_{+2}}{n} \frac{n_{2+}}{n} \\ &\frac{n_{22}}{n} = \frac{n_{+2}}{n} \frac{n_{2+}}{n} \\ &n_{22} = \frac{n_{+2} n_{2+}}{n}, \text{ the expected value } E_{22} \end{align} \tag{29}

If E is the expected frequency under the null hypothesis (not used here to indicate exposure), then we have the expected cell frequencies derived in Equation 30.

\begin{align} &E_{ij} = \frac{n_{i+}}{n} \frac{n_{+j}}{n} \times n \\ &E_{ij} = \frac{n_{i+} n_{+j}}{n} \end{align} \tag{30}

Any expected cell frequency, E_{ij}, under the null hypothesis is the product of its (row and column) marginal totals, n_{i+} and n_{+j}, divided by the sample size, n, with the calculations summarized in Table 26.

Table 26: Expected Values
	D	\bar{D}
E	$ E_{11} = f r a c { n _ { + 1 } n _ { 1+}}{n}$	$ E_{12} == f r a c { n _ { + 2 } n_ { 1+}}{n}$
\bar{E}	$ E_{21} = f r a c { n _ { + 1 } n _ { 2+}}{n}$	$ E_{22 }= f r a c { n { + 2 } n { 2+}}{n}$

We can use Equation 31 to calculate a test statistic (standardized by the expected values), which we denote as X^{2}.

X^{2} = \sum_{i=1}^{2} \sum_{j=1}^{2} \frac{\left( O_{ij} - E_{ij}\right)^{2}}{E_{ij}} \tag{31}

If the null hypothesis is true, then X^{2} approximately follows a \chi^{2} distribution (X^{2} \sim \chi_{1}^{2}), that is, with a single degree of freedom. This approximation holds if each expected value is greater than or equal to 5.

While we consider a 2 \times 2 table of observed values, we can use two multilevel variables (with I levels and J levels respectively). Provided that no more than 20% of the expected values as less than 5, we can use the approximation shown in Equation 32. (See below in Degrees of freedom for the \chi^{2} distribution for more information.)

X^{2} \sim \chi_{(I-1)(J-1)}^{2} \tag{32}

8.4 Degrees of freedom for the \chi^{2} distribution

The degrees of freedom that we use for tables of observed values (and expected values) is the number of parameters under the alternative hypothesis minus the number of parameters under the null hypothesis.

Consider three random numbers. If we want them to sum to a certain total, two of the values are variable (they can take on any value). This immediately fixes the third value, though (such that we still sum to the same total). In this sense, we have a two degrees of freedom for this simple explanation. We say that two of the values are not redundant.

In a table of observed values, we have I rows and J columns. They have to sum the the respective row and column totals and hence the I-1 and J-1 degrees of freedom. In total then, we have (I-1) + (J-1) = I+J-2 non-redundant parameters under the null hypothesis.

Under the alternative hypothesis there are no restrictions on the joint probabilities (except that they sum to 1.0). We therefor have IJ - 1 non-redundant parameters.

The difference is then as shown in Equation 33.

(IJ-1) - (I + J -2) = IJ - I - J + 1 = (I-1)(J-1) \tag{33}

8.5 The \chi^{2} Distribution

We have the following general properties of the \chi^{2} distribution given df degrees of freedom.

The mean is df
The variance is 2df
It is right-skewed, but becomes less so as df increases
If X \sim \text{N}(0 ,1^2) then X^{2} \sim \chi_{1}^{2}

8.6 Pearson \chi^{2} Test Approach

We can use either a critical value approach or a p value approach to determine significance using the Pearson \chi^{2} test.

The critical \chi^{2}_{1, \alpha} values can be expressed using the qchisq function. We do this below for \alpha = 0.1, \alpha = 0.05, and \alpha = 0.01 (all for a single degree of freedom).

alpha value 0.1

Code

qchisq(0.1, df = 1, lower.tail = FALSE)

[1] 2.705543

alpha value 0.05

Code

qchisq(0.05, df = 1, lower.tail = FALSE)

[1] 3.841459

alpha value 0.01

Code

qchisq(0.01, df = 1, lower.tail = FALSE)

[1] 6.634897

In Figure 3, using the dist_chisq function form the sjPlot library, we see the area of rejection for \alpha = 0.05.

Code

sjPlot::dist_chisq(
  chi2 = NULL,
  deg.f = 1,
  p = 0.05,
  xmax = NULL,
  geom.alpha = 0.7
)

In the critical value approach, we reject H_{0} if X^{2} > \chi_{1, \alpha}^{2}.

If we use the p value approach, we reject H_{0} if P(X^{2} > X^{2}_{\text{observed}}) < \alpha.

8.6.1 Example

We use the measure_assoc data from a previous section.

Code

dfr <- import_dataset("measures_assoc.csv")

The table function returns a table of observed values.

Code

tb <- table(dfr$Exposure, dfr$Disease)
tb

The null hypothesis is that the two variables are not associated (they are independent) and the alternative hypothesis is that they are associated (not independent).

The chisq.test function takes the table of observed values as argument and returns the X^{2} statistic and the p value.

Code

# Don't use continuity correction
tst <- chisq.test(tb, correct = FALSE) 
tst


    Pearson's Chi-squared test

data:  tb
X-squared = 9.1413, df = 1, p-value = 0.002499

From the p value approach, we reject the null hypothesis for \alpha = 0.05.

Since we assigned the test to a computer variable, tst, we can extract the statistic attribute.

Code

tst$statistic

X-squared 
 9.141312

Now, we can use the dist_chisq function again. Unfortunately, the X^{2} value is large and the area under the curve is too small to see for df=1 in Figure 4.

Code

sjPlot::dist_chisq(
  chi2 = tst$statistic,
  deg.f = 1,
  p = 0.05
)

Remember that the expected cell frequencies must all be 5 or more for us to use Pearson’s \chi^{2} test. The expected attribute returns the table of expected values.

Code

tst$expected

    
        D   ND
  E  14.5 25.5
  NE 14.5 25.5

8.6.1.1 Conclusion Reporting

In reporting the results of our study, we can use confidence intervals of the measures of association and / or hypothesis testing. We use both actually. The measures of association help us understand the direction of the association.

In this sense, the epi.2by2 function from the epiR library returns all the information that we require.

Code

epiR::epi.2by2(tb)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           21           19         40     52.50 (36.13 to 68.49)
Exposed -            8           32         40      20.00 (9.05 to 35.65)
Total               29           51         80     36.25 (25.79 to 47.76)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 2.62 (1.32, 5.21)
Inc odds ratio                                 4.42 (1.64, 11.93)
Attrib risk in the exposed *                   32.50 (12.67, 52.33)
Attrib fraction in the exposed (%)            61.90 (24.33, 80.82)
Attrib risk in the population *                16.25 (-0.02, 32.52)
Attrib fraction in the population (%)         44.83 (8.43, 66.76)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 9.141 Pr>chi2 = 0.002
Fisher exact test that OR = 1: Pr>chi2 = 0.005
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

8.6.2 Example

In this last section, we simulate data with an exposure variable with 3 levels and a disease variable with 2 levels.

Code

set.seed(12) # Reproducibility
exposure <- sample(
  c("Low dose", "High dose", "Placebo"),
  size = 100,
  replace = TRUE,
  prob = c(0.3, 0.4, 0.3) # Stated probabilities for each of 3 levels
)

outcome <- sample(
  c("Improve", "not improve"),
  size = 100,
  replace = TRUE,
  prob = c(0.4, 0.6)
)

dfr <- data.frame(
  Exposure = exposure,
  Disease = outcome
)

Below, we create a table fo observed values.

Code

tb <- table(dfr$Exposure, dfr$Disease)
tb

           
            Improve not improve
  High dose      25          23
  Low dose       11          17
  Placebo         7          17

The Pearson \chi^{2} test is performed below.

Code

tst <- chisq.test(tb)
tst


    Pearson's Chi-squared test

data:  tb
X-squared = 3.6472, df = 2, p-value = 0.1614

We cannot express measures of association as we do not have a 2 \times 2 design.

Finally, we visualize the results in Figure 5.

Code

sjPlot::dist_chisq(
  chi2 = tst$statistic,
  deg.f = 2
)

Figure 5: Visualizing the Results of the Example

9 `Epitools` as an Alterative to `Epi2by2`

The epitools library returns the same results as the epiR library. Note, that the epitools requires the 2 \times 2 table of observed values to be in the order shown in Table 27.

Table 27: Format for `epitools` library
	\bar{D}	D	Total
\bar{E}
E
Total

9.1 Example

We import a comma-separated values file to demonstrate the use of epitools.

Code

dfr <- import_dataset("measures_assoc.csv")

The table of observed functions is shown below, and is the format required by the epiR library.

Code

tab <-  xtabs(
  ~ Exposure + Disease,
  data = dfr
)

tab

        Disease
Exposure  D ND
      E  21 19
      NE  8 32

The data are summarized with a mosaic plot.

Code

(Desc(tab, verbose="high", expected=TRUE))

------------------------------------------------------------------------------ 
tab (xtabs, table)

Summary: 
n: 80, rows: 2, columns: 2

Pearson's Chi-squared test (cont. adj):
  X-squared = 7.789, df = 1, p-value = 0.005256
Fisher's exact test p-value = 0.004827
                                
                 D     ND    Sum
                                
E     freq      21     19     40
      perc   26.2%  23.8%  50.0%
      p.row  52.5%  47.5%      .
      p.col  72.4%  37.3%      .
      exp   14.500 25.500      .
                                
NE    freq       8     32     40
      perc   10.0%  40.0%  50.0%
      p.row  20.0%  80.0%      .
      p.col  27.6%  62.7%      .
      exp   14.500 25.500      .
                                
Sum   freq      29     51     80
      perc   36.3%  63.7% 100.0%
      p.row      .      .      .
      p.col      .      .      .
      exp        .      .      .
                                

----------
' 95% conf. level

The data can also be summarized with a stacked bar plot.

Code

ggstatsplot::ggbarstats(
  data = dfr,
  x = Disease,
  y = Exposure,
  type = "parametric",
  bf.message = FALSE,
  results.subtitle = FALSE, 
  palette = "Blues"
)

9.1.1 `riskratio.wald`

The riskratio.wald function calculate the relative risk estimate and confidence intervals for the relative risk estimate. Note the use of the keyword argument rev, set to both such that both the rows and columns of the table of observed values are reversed.

Code

epitools::riskratio.wald(
  tab,
  rev = "both"
)

$data
        Disease
Exposure ND  D Total
   NE    32  8    40
   E     19 21    40
   Total 51 29    80

$measure
        risk ratio with 95% C.I.
Exposure estimate   lower    upper
      NE    1.000      NA       NA
      E     2.625 1.32149 5.214283

$p.value
        two-sided
Exposure  midp.exact fisher.exact  chi.square
      NE          NA           NA          NA
      E  0.002892717  0.004827425 0.002499019

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The results show \widehat{\text{RR}} = 2.65 with a Wald 95% confidence interval of (1.32, 5.21) as with the epiR epi.2by2 function. We also see the results of hypothesis testing using exact tests and the Pearson \chi^{2} test for association.

9.1.2 `riskratio`

The riskratio function takes the method keyword argument and lets us decide between a Wald test, exact test, and bootstrap resampling. Below, we select wald as the method.

Code

epitools::riskratio(
  tab,
  method = "wald", 
  rev = "both"
)

$data
        Disease
Exposure ND  D Total
   NE    32  8    40
   E     19 21    40
   Total 51 29    80

$measure
        risk ratio with 95% C.I.
Exposure estimate   lower    upper
      NE    1.000      NA       NA
      E     2.625 1.32149 5.214283

$p.value
        two-sided
Exposure  midp.exact fisher.exact  chi.square
      NE          NA           NA          NA
      E  0.002892717  0.004827425 0.002499019

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

9.1.3 `oddsratio.wald`

The oddsratio.wald function is similar to the riskratio.wald function, but for odds ratio tests. Thus, the odds ratio estimate and corresponding confidence intervals for the odds ratio estimate is computed.

Code

epitools::oddsratio.wald(
  tab,
  rev = "both"
)

$data
        Disease
Exposure ND  D Total
   NE    32  8    40
   E     19 21    40
   Total 51 29    80

$measure
        odds ratio with 95% C.I.
Exposure estimate    lower    upper
      NE 1.000000       NA       NA
      E  4.421053 1.638427 11.92956

$p.value
        two-sided
Exposure  midp.exact fisher.exact  chi.square
      NE          NA           NA          NA
      E  0.002892717  0.004827425 0.002499019

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

10 Lab Problems

10.1 Question

The first lab problem questions are based on the following simulate data for an explanatory variable Procedure with two levels I and II denoting two types of inguinal hernia repair. A total of 50 participants were randomly assigned to one of two methods of surgical repair of these groin hernias. The response variable, Recurrence, measured whether they developed a recurrence (return of the hernia), with levels No and Yes.

The data is stored in a data frame object and assigned to the variable data.

Code

set.seed(12)

group <- rep(
  c("I", "II"),
  each = 25
)

recurrence <- sample(
  c("No", "Yes"),
  size = 50,
  replace = TRUE,
  prob = c(0.8, 0.2)
)

data <- data.frame(
  Group = group,
  Recurrence = factor(recurrence, levels = c("Yes", "No"))
)

We see a stacked bar plot to visualize the results.

Code

ggbarstats(
  data = data,
  x = Recurrence,
  y = Group,
  type = "parametric",
  bf.message = FALSE,
  results.subtitle = FALSE, 
  palette = "Blues",
  title = "Proportion of recurrence and no recurrence in two type of hernia repair",
  xlab = "Group",
  legend.title = "Recurrence",
  ggplot.component = list(ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 2)))
)

Complete the following four questions.

Create a 2 \times 2 table of observed values.
Calculate the joint probabilities.
Calculate the marginal probabilities.
Calculate the relevant conditional probabilities.

Solution 10.1.1

The solution uses the xtabs function.

Code

xtabs(
  ~ Group + Recurrence,
  data = data
)

     Recurrence
Group Yes No
   I    3 22
   II   7 18

Solution 10.1.2

The joint probabilities are calculated below.

\begin{align*} &P(\text{Group I } \cap \text{ Recurrence}) = \frac{3}{50} = 0.06 \\ &P(\text{Group I } \cap \text{ No recurrence}) = \frac{22}{50} = 0.44 \\ &P(\text{Group II } \cap \text{ Recurrence}) = \frac{7}{50} = 0.14 \\ &P(\text{Group II } \cap \text{ No recurrence}) = \frac{18}{50} = 0.36 \end{align*}

Additionally, the CrossTable function from the gmodels library returns all the probabilities.

Code

gmodels::CrossTable(
  data$Group,
  data$Recurrence,
  chisq = FALSE,
  prop.chisq = FALSE,
  prop.c = TRUE,
  prop.r = TRUE
)


 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  50 

 
             | data$Recurrence 
  data$Group |       Yes |        No | Row Total | 
-------------|-----------|-----------|-----------|
           I |         3 |        22 |        25 | 
             |     0.120 |     0.880 |     0.500 | 
             |     0.300 |     0.550 |           | 
             |     0.060 |     0.440 |           | 
-------------|-----------|-----------|-----------|
          II |         7 |        18 |        25 | 
             |     0.280 |     0.720 |     0.500 | 
             |     0.700 |     0.450 |           | 
             |     0.140 |     0.360 |           | 
-------------|-----------|-----------|-----------|
Column Total |        10 |        40 |        50 | 
             |     0.200 |     0.800 |           | 
-------------|-----------|-----------|-----------|

Solution 10.1.3

See the output from the CrossTable function call from Solution 10.1.2. Also, marginal probabilities are shown below.

\begin{align*} &P(\text{Group I}) = \frac{25}{50} = 0.5 \\ &P(\text{Group II}) = \frac{25}{50} = 0.5 \\ &P(\text{Recurrence}) = \frac{10}{50} = 0.2 \\ &P(\text{No recurrence}) = \frac{40}{50} = 0.8 \end{align*}

Solution 10.1.4

The conditional probabilities are shown below.

\begin{align*} &P(\text{Recurrence} \, \vert \, \text{Group I}) = \frac{3}{25} = 0.12 \\ &{P(\text{Recurrence} \, \vert \, \text{Group II})} = \frac{7}{25} = 0.28 \end{align*}

10.2 Question

Find an example of each study design in the published literature that contain the results of risk and ratio analysis and comment these results.

Solution 10.2

There are many possible solutions found across literature. A study for each of the four study designs reviewed in Chapter 2 is described below.

Cohort
- Physical Activity and Incident Depression Analysis of Prospective Cohort Studies (doi: 10.1176/appi.ajp.2018.17111194. )
- The study presented both odds ratios and relative risks to support the claim that physical activity can confer protection against emergent depression.
RCT
- Randomized Trial of Metformin, Ivermectin, and Fluvoxamine for Covid-19 (DOI: 10.1056/NEJMoa2201662)
- Odds ratios were reported to support the conclusion that none of the three medications- metformin, ivermectin, and fluvoxamine, prevent the occurrence of hypoxemia, an emergency department visit, hospitalization, or death associated with Covid-19.
Case-control
- A Case-Control Study of Screening Sigmoidoscopy and Mortality from Colorectal Cancer (DOI: 10.1056/NEJM199203053261001)
- Odds ratios were presented to demonstrate the finding that sigmoidoscopy can reduce mortality from rectum and distal colon cancer.
Cross-sectional
- Prevalence of Pulmonary Embolism among Patients Hospitalized for Syncope (DOI: 10.1056/NEJMoa1602172)
- Odds ratios were calculated from logistic regression to support the conclusion that pulmonary embolism is prevalent among patients hospitalized for a first episode of syncope.

10.3 Question

The research data for this lab pertains to the development of a cerebrovascular accident (a stroke). The data is available in a comma-separated file named stroke.csv.

Do NOT use a library function to answer the questions of problems 3.1 through 3.4.

Complete the following three questions.

Import the data as a data frame object and assign it to the computer variable dfr.
Verify that there is no missing data.
Both \texttt{hypertension} and \texttt{stroke} are binary variables. The levels are No and Yes for each. Consider \texttt{hypertension} to be the explanatory variable, with the level Yes as E and the value of No as \bar{E}. Consider Yes (the development of a stroke) as D, and no (the absence of the development of a stroke ) as \bar{D}. Generate a table of observed values and assign it to the variable dfr.

Solution 10.3.1

The code below uses the import_dataset user-defined function to import the data.

Code

dfr <- import_dataset("stroke.csv")

Solution 10.3.2

The is.na function is used to verify that there is no missing data.

Code

dfr %>% is.na() %>% sum()

[1] 0

Solution 10.3.3

A table of observed values are generated below.

Code

dfr$hypertension <- factor(dfr$hypertension, levels = c("Yes", "No"))
dfr$stroke <- factor(dfr$stroke, levels = c("Yes", "No"))

tab <- xtabs(
  ~ hypertension + stroke,
  data = dfr
)

tab

            stroke
hypertension  Yes   No
         Yes   66  413
         No   182 4320

10.4 Question

Answer the following five questions.

Calculate the risk estimate of a stroke given the presence of hypertension.
Calculate the risk estimate of a stroke given the absence of hypertension.
Calculate the risk difference estimate for a stroke comparing those with and without hypertension.
Calculate a 95% confidence interval for the population risk difference of a stroke comparing those with and without hypertension.
Comment on your results.

Solution 10.4.1

The risk estimate of a stroke given the presence of hypertension is below.

Code

n1 <- 66 + 413
p1 <- 66 / n1
p1

[1] 0.1377871

Solution 10.4.2

The risk estimate of a stroke given the absence of hypertension is below.

Code

n2 <- 182 + 4320
p2 <- 182 / n2
p2

[1] 0.04042648

Solution 10.4.3

The risk difference estimate for a stroke comparing those with and without hypertension is below.

Code

rd <- p1 - p2
rd

[1] 0.09736058

Solution 10.4.4

The lower and upper bounds of the 95\% confidence interval are presented below.

Code

z_a2 <- qnorm(0.975)

lower <- (p1 - p2) - (z_a2 * sqrt(((p1 * (1 - p1)) / (n1)) + ((p2 * (1 - p2)) / (n2))))
upper <- (p1 - p2) + (z_a2 * sqrt(((p1 * (1 - p1)) / (n1)) + ((p2 * (1 - p2)) / (n2))))

round(lower, digits = 3)

[1] 0.066

Code

round(upper, digits = 3)

[1] 0.129

Solution 10.4.5

We are 95% confident that the true population risk difference of the risk for stroke comparing those with hypertension and those without hypertension is between 0.066 and 0.129.

10.5 Question

Answer the following three questions.

Calculate the relative risk estimate for a stroke comparing those with and without hypertension.
Calculate a 95% confidence interval for the population risk difference of a stroke comparing those with and without hypertension.
Comment on your results.

Solution 10.5.1

The relative risk estimate for a stroke comparing those with and without hypertension is below.

Code

rr = p1 / p2
rr

[1] 3.408337

Solution 10.5.2

The lower and upper bounds of the 95\% confidence interval are presented below.

Code

n11 <- 66
n21 <- 182
lower <- log(p1 / p2) - z_a2 * sqrt(((1 - p1) / n11) + ((1 - p2) / n21))
upper <- log(p1 / p2) + z_a2 * sqrt(((1 - p1) / n11) + ((1 - p2) / n21))
round(exp(lower), digits = 3)

[1] 2.614

Code

round(exp(upper), digits = 3)

[1] 4.444

Solution 10.5.3

We are 95% confident that the true population risk ratio for stroke comparing those with hypertension to those without hypertension in between 2.614 and 4.44.

10.6 Question

Answer the following five questions.

Calculate the odds of a stroke among those with hypertension.
Calculate the odds of a stroke among those without hypertension.
Calculate the odds ratio for a stroke comparing those with and without hypertension.
Calculate a 95% confidence interval for the odds ratio for a stroke comparing those with and without hypertension.
Comment on your results.

Solution 10.6.1

The odds of a stroke among those with hypertension is below.

Code

odds_str_hpt <- 66 / 413
odds_str_hpt

[1] 0.1598063

Solution 10.6.2

The odds of a stroke among those without hypertension is below.

Code

odds_str_nohpt <- 182 / 4320
odds_str_nohpt

[1] 0.04212963

Solution 10.6.3

The odds ratio for a stroke comparing those with and without hypertension is below.

Code

or <- odds_str_hpt / odds_str_nohpt
or

[1] 3.793204

Solution 10.6.4

The lower and upper bounds of the 95\% confidence interval are presented below.

Code

log_odds_ratio <- log((p1 / (1 - p1)) / (p2 / (1 - p2)))
se_or <- sqrt((1 / 66) + (1 / 413) + (1 / 182) + (1 / 4320))
lower <- log_odds_ratio - z_a2 * se_or
upper <- log_odds_ratio + z_a2 * se_or
round(exp(lower), digits = 3)

[1] 2.812

Code

round(exp(upper), digits = 3)

[1] 5.116

Solution 10.6.5

We are 95\% confident that the true population odds ratio for a stroke comparing those with and without hypertension is between 2.776 and 5.183.

10.7 Question

Answer the following question using an appropriate R function. Researchers consider the association between the presence, E = \text{Yes}, and absence \bar{E} = \text{No}, of heart disease and the development, D, and non-development, \bar{D}, of a stroke.

Calculate the risk difference and 95\% confidence interval for the risk difference pertaining to the research.
Calculate the relative risk and 95\% confidence interval for the relative risk pertaining to the research.
Calculate the odds ratio and 95\% confidence interval for the odds ratio pertaining to the research.
Write a comment regarding the results.

Solution 10.7

We answer all the questions using the epi2by2 function.

Code

epiR::epi.2by2(tab)

             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           66          413        479     13.78 (10.82 to 17.19)
Exposed -          182         4320       4502        4.04 (3.49 to 4.66)
Total              248         4733       4981        4.98 (4.39 to 5.62)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 3.41 (2.61, 4.44)
Inc odds ratio                                 3.79 (2.81, 5.12)
Attrib risk in the exposed *                   9.74 (6.60, 12.88)
Attrib fraction in the exposed (%)            70.66 (61.74, 77.50)
Attrib risk in the population *                0.94 (0.10, 1.77)
Attrib fraction in the population (%)         18.80 (12.71, 24.48)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 86.743 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units

We see the estimated risk ratio (Inc risk ratio) is 3.41, 95\% CI 2.61 to 4.44. The estimated odds ratio (Odds ratio) is 3.79, 95\% CI 2.81 to 5.12. The estimated risk difference (Attrb risk in the exposed *) is 0.94, 95\% CI 0.10 to 1.77. Note that the latter is expressed as a percentage, not a fraction.

10.8 Question

The following questions pertain to the same data set that was used in the previous notebook. Patients either have hypertension, E, or do not have hypertension, \bar{E}. The presence of hypertension is our explanatory variable. The outcome or response variable levels are stroke, D, and no stroke, \bar{D}. Use a significance level of 0.05 for this problem.

The data is available in the stroke.csv file that is imported below and assigned to the computer variable stroke.

Code

stroke <- import_dataset("stroke.csv")

Create a contingency table from the data and use the usual order of the levels (as used in this course). Assign the table to the variable tab.
Calculate the table of expected values (without using any R functions).
Can we use the \chi^{2} distribution to approximate our Pearson test statistic?
Calculate the X^{2} test statistic without using and R function and assign it to the variable xs.
What is the degrees of freedom for this problem?
Calculate the critical \chi^{2} value for \alpha=0.05.
Is the test statistic larger than the critical statistic value?
Use an appropriate R function to calculate a p-value for the test statistic.
What is the test conclusion?
Use an appropriate R function to confirm the results of your calculations using the Pearson \chi^{2} tests for independence.
Confirm the values for the table of expected values.
Calculate the estimated odds ratio and 95\% Wald confidence intervals to confirm to support your test conclusion. Does the interval contain the null value \text{OR}=1?

Solution 10.8.1

A contingency table of observed values with ordered levels as used in the course is below.

Code

stroke$hypertension <- factor(stroke$hypertension, levels = c("Yes", "No"))
stroke$stroke <- factor(stroke$stroke, levels = c("Yes", "No"))
tab <- xtabs(
  ~ hypertension + stroke,
  data  = stroke
)

tab

            stroke
hypertension  Yes   No
         Yes   66  413
         No   182 4320

Code

gmodels::CrossTable(stroke$hypertension, stroke$stroke)


 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4981 

 
                    | stroke$stroke 
stroke$hypertension |       Yes |        No | Row Total | 
--------------------|-----------|-----------|-----------|
                Yes |        66 |       413 |       479 | 
                    |    74.498 |     3.904 |           | 
                    |     0.138 |     0.862 |     0.096 | 
                    |     0.266 |     0.087 |           | 
                    |     0.013 |     0.083 |           | 
--------------------|-----------|-----------|-----------|
                 No |       182 |      4320 |      4502 | 
                    |     7.926 |     0.415 |           | 
                    |     0.040 |     0.960 |     0.904 | 
                    |     0.734 |     0.913 |           | 
                    |     0.037 |     0.867 |           | 
--------------------|-----------|-----------|-----------|
       Column Total |       248 |      4733 |      4981 | 
                    |     0.050 |     0.950 |           | 
--------------------|-----------|-----------|-----------|

Solution 10.8.2

Each of the expected values has been manually calculated without R packages.

Code

n1plus <- 66 + 413
n2plus <- 182 + 4320
nplus1 <- 66 + 182
nplus2 <- 413 + 4320
n <- n1plus + n2plus
e11 <- nplus1 * n1plus / n
e12 <- nplus2 * n1plus / n
e21 <- nplus1 * n2plus / n
e22 <- nplus2 * n2plus / n
e11

[1] 23.84903

Code

e12

[1] 455.151

Code

e21

[1] 224.151

Code

e22

[1] 4277.849

Solution 10.8.3

Yes, since expected values are at least 5.

Solution 10.8.4

The \chi^{2} statistic is calculated manually.

Code

n11 <- 66
n12 <- 413
n21 <- 182
n22 <- 4320
xs <- ((n11 - e11)^2/e11) + ((n12 - e12)^2/e12) + ((n21 - e21)^2/e21) + ((n22 - e22)^2/e22)
xs

[1] 86.74324

Solution 10.8.5

Since this is a 2 \times 2 contingency table, there is 1 degree of freedom.

Solution 10.8.6

The critical \chi^{2} for \alpha = 0.05 with 1 degree of freedom is below.

Code

qchisq(0.95, df = 1)

[1] 3.841459

Solution 10.8.7

Yes, since the \chi^{2} statistic is greater than the critical value, 86.74324 > 3.841459.

Solution 10.8.8

The pchisq function is used to calculate a p-value for the test statistic.

Code

pchisq(xs, df = 1, lower.tail = FALSE)

[1] 1.235652e-20

Solution 10.8.9

At the 5% significance level, there is enough evidence to suggest that there is an association between the presence of hypertension and the development of a stroke.

Solution 10.8.10

The chisq.test function is appropriate to confirm the results of the Pearson \chi^{2} test.

Code

tst <- chisq.test(tab, correct = FALSE)
tst


    Pearson's Chi-squared test

data:  tab
X-squared = 86.743, df = 1, p-value < 2.2e-16

Solution 10.8.11

Below are the expected values for the table.

Code

tst$expected

            stroke
hypertension       Yes       No
         Yes  23.84903  455.151
         No  224.15097 4277.849

Solution 10.8.12

Below we calculate the estimated odds ratio and 95\% Wald confidence intervals.

Code

epitools::oddsratio.wald(tab)

$data
            stroke
hypertension Yes   No Total
       Yes    66  413   479
       No    182 4320  4502
       Total 248 4733  4981

$measure
            odds ratio with 95% C.I.
hypertension estimate    lower    upper
         Yes 1.000000       NA       NA
         No  3.793204 2.812414 5.116031

$p.value
            two-sided
hypertension   midp.exact fisher.exact   chi.square
         Yes           NA           NA           NA
         No  2.220446e-15 1.740745e-15 1.235652e-20

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The interval of plausible population odds ratios does not contain the null, \text{OR}=1 .

1 Introduction

2 Libraries

3 Anatomy of a 2\times 2 Contingency Table

3.1 Sample Contingency Table

3.2 General Notation

4 Probabilities Derived from a Contingency Table

4.1 Joint Probabilities

4.2 Marginal Probabilities

4.3 Conditional Probabilities

4.4 Example

5 Measures of Association for a 2 \times 2 Contingency Table

5.1 Risk Difference

5.2 Relative Risk

5.3 Odds Ratio

5.4 Example

6 Application of Measures of Association Based on Study Design

6.1 Case-control Design

6.1.1 Example Scenario

6.2 Cohort Design or Randomized Controlled Trial

6.2.1 Example Scenario

6.3 Cross-sectional Design

6.3.1 Example Scenario

7 Confidence Intervals for Measures of Association

7.1 CI for Risk Difference

7.1.1 Example

7.2 CI for Relative Risk

7.2.1 Example

7.3 CI for Odds Ratios

7.3.1 Example

8 Hypothesis Testing for the Measures of Association: Pearson’s \chi^{2} Approach

8.1 Null Values

8.2 Null and Alternative Hypotheses

8.3 Pearson’s \chi^{2} test

8.4 Degrees of freedom for the \chi^{2} distribution

8.5 The \chi^{2} Distribution

8.6 Pearson \chi^{2} Test Approach

8.6.1 Example

8.6.1.1 Conclusion Reporting

8.6.2 Example

9 Epitools as an Alterative to Epi2by2

9.1 Example

9.1.1 riskratio.wald

9.1.2 riskratio

9.1.3 oddsratio.wald

10 Lab Problems

10.1 Question

10.2 Question

10.3 Question

10.4 Question

10.5 Question

10.6 Question

10.7 Question

10.8 Question

9 `Epitools` as an Alterative to `Epi2by2`

9.1.1 `riskratio.wald`

9.1.2 `riskratio`

9.1.3 `oddsratio.wald`