Chapter 2

Author
Affiliation
Juan H Klopper (MD), Do Hee Lee (MPH)

The George Washington University

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In chapter 2, we establish definitions for the measures of association- risk difference, relative risk, and odds ratio. Then, using the epi2by2 and epitools libraries, we calculate these measures of association. Additionally, we discuss which measures of association are most effective based on study design and perform relevant hypothesis testing.

Section 3. Anatomy of a 2 \times 2 Contingency table contingency table
Section 4. Probabilities Derived from a Contingency Table relative frequency, joint probabilities, joint probability distribution, marginal probabilities, marginal probability distribution, conditional probabilities
Section 5. Measures of Association for a 2 \times 2Contingency Table risk difference, relative risk, odds ratio
Section 6. Application of Measures of Association Based on Study Design absolute vs. relative measure of association, case-control design, cohort design, randomized controlled trial, cross-sectional design
Section 8. Hypothesis Testing for the Measures of Association: Pearson’s \chi^{2} Approach Pearson’s \chi^{2} test, degrees of freedom, $\chi^{2}$ distribution

2 Libraries

Load the required libraries for Chapter 2. If these packages are not already installed, please do so before loading the libraries.

Code
library(ggplot2)
library(latex2exp)
library(crosstable)
library(gmodels)
library(ggstatsplot)
library(sjPlot)
library(DescTools)
library(rio)

Also, all data for this chapter and subsequent chapters are from a GitHub repository. Below is a user-defined function to allow for more accessible data import into R. An example of how to apply the user-defined function import_dataset is demonstrated in Section 5 Example 5.4.

Code
import_dataset <- function(filename) {
  # GitHub repository URL
  github_url <- "https://raw.githubusercontent.com/juanklopper/TutorialData/main/"
  

  file_url <- paste0(github_url, filename)
  

  dataset <- import(file_url)
  
  return(dataset)
}

3 Anatomy of a 2\times 2 Contingency Table

3.1 Sample Contingency Table

Studies concerning categorical variables allow us to consider the frequency of outcomes for each variable and the association of these frequencies between pairs of variables. We can express such outcomes in a contingency table or a table of observed values, an example of which is shown in Table 1. Participants are randomized into two groups, A and B. They are observed for one of two outcomes, Outcome 1 and Outcome 2.

Table 1: Joint Frequencies
Group Outcome 1 Outcome 2 Total
A 509 116 625
B 398 104 502
Total 907 220 1127

A total of 509 participants in group A have Outcome 1 and 116 in group A have Outcome 2. A total of 398 in group B have Outcome 1 and 104 in group B have Outcome 2.

Note that for this chapter, we will usually have the explanatory variable, in this case, the groups, on the left-hand side (the rows), with a row for each of the levels of the explanatory group. The levels of the response variable are along the top (the columns), with a column for each level of the response variable. In the case of this example, we have 2 rows and 2 columns resulting in a 2 \times 2 contingency table.

By this structure of the contingency table, we can sum all the frequencies along the rows and columns. We note the following:

  • Irrespective of which outcome participants were observed to have, there are 509 + 116 = 625 participants in group A and 398 + 104 = 502 participants in group B

  • Irrespective of which group participants were in, there are 509 + 398 = 907 participants who have Outcome 1 and 116 + 104 = 220 who have Outcome 2.

  • We also note a total of 1127 participants. This value must be the same for the sum of the row or the column totals or the sum of the four joint frequencies.

3.2 General Notation

Our general notation for a 2 \times 2 table will be as in Table 2. The subscripts denotes the row and then the column number. We use the + symbol as variable that holds all the values for a row (if placed second) and for a column (if placed first).

Table 2: Frequency Table of Observed Data
Group Outcome 1 Outcome 2 Total
A n_{1,1} n_{1,2} n_{1,+}
B n_{2,1} n_{2,2} n_{2,+}
Total n_{+,1} n_{+,2} n

Different textbooks use different notation for the groups and outcomes in a 2 \times 2 table of observed values. You might see a table as in Table 3, where E is exposure and \bar{E} is non-exposure, with D being disease and \bar{D} being non-diseases. Other textbooks might use success for D and failure for \bar{D}. None of these terms are to be thought of in literal terms. They merely help us decide on the outcome of interest (D or success).

Table 3: Expressing Exposure and Disease
D \bar{D} Total
E n_{1,1} n_{1,2} n_{1,+}
\bar{E} n_{2,1} n_{2,2} n_{2,+}
Total n_{+,1} n_{+,2} n

We are not restricted to a 2 \times 2 contingency table. If the explanatory variable has m outcomes and the response variable has n outcomes, then we would have an m \times n contingency table.

4 Probabilities Derived from a Contingency Table

In Table 4, we have the results of a survey. The data are frequencies of observed values for the levels of two categorical variables, Group (explanatory variable) and Choice (response variable). There are two categories for Group variable, A and B and two for the Choice variable, I and II.

Table 4: Example Data
I II
A 80 100
B 120 25

The total number of participants are 80 + 100 + 120 + 25 = 325.

  • There are 80 + 100 = 180 participants in group A and 120 + 25 = 145 participants in group B.

  • Of all the participants, 80 + 120 = 200 chose I and 100 + 25 = 125 chose II. Table 5 shows these values.

Table 5: Row and Column Totals Added
I II Total
A 80 100 180
B 120 25 145
Total 200 125 325

We can express a relative frequency or a proportion by dividing each frequency by the total number of participants. The results is a contingency table, Table 6, that is a probability distribution.

Table 6: Relative Frequencies
I II Total
A 0.246 0.308 0.554
B 0.369 0.077 0.446
Total 0.615 0.385 1.0

4.1 Joint Probabilities

The values 0.246,0.308, 0.369 and 0.077 are termed joint probabilities. For the first value we could state that the probability of being in group A and having chosen I is 0.246. We can write this as P(A \cap I), the probability of being in group A and having chosen I. The four values together form the joint probability distribution. In some textbooks, the notation \hat{P} is used to denote that the results are from a sample and are used to estimate the true population joint probability.

Note also, that when we refer to probability, we are actually calculating a proportion.

In general, we have Equation 1 for a joint probability, where i denotes the row number and j the column number. \frac{n_{ij}}{n} \tag{1}

4.2 Marginal Probabilities

The values 0.554, 0.446, 0.615 and 0.385 are termed marginal probabilities and form the marginal probability distribution.For 0.554 we could write P(\text{being in group A}) or then \hat{P}(A). Each of the two sets of marginal probabilities (along the rows or the columns) sum to 1.0.

Depending if we are considering row or column marginal probabilities, we have Equation 2, where we only consider the two column marginal probabilities, (\hat{P}(I) and \hat{P}(II)), where j=1,2.

\frac{n_{+j}}{n} \tag{2}

4.3 Conditional Probabilities

What about the probability of having chosen I given that the participant is in group A? A total of 80 participants in the 180 in group A, chose I. The conditional probability, \hat{P}(I \vert A) = 80 \div 180 = 0.444.

We have to remember that these are all values for a sample taken from a population and all three types of probabilities are probability estimates.

4.4 Example

In the code cell below, we construct observed data for a simulated study. There are 40 participants. The explanatory variable is Group and the participants are divided equally into groups A and B. Participants in group A receive novel therapy to alleviate a specific symptom and those in group B receive a placebo. Researchers measure whether they improve (a success for this binary response variable, encoded as Yes) or not (a failure, encoded as No).

We create the simulated data and display the results as a table.

Code
set.seed(12)          

explanatory <- rep( 
  c("A", "B"),        
  each = 20           
)

response <- sample(    
  c("No", "Yes"),      
  size = 40,           
  replace = TRUE       
)

data <- data.frame(    
  Group = explanatory, 
  Outcome = response  
)                      

xtabs(                 
  ~Group + Outcome,    
  data = data          
)                     
     Outcome
Group No Yes
    A  7  13
    B 10  10

We use the crosstable function from the crosstable library to construct a more elegant table of observed values, together with marginal totals, shown in Table 7. Note that since we did not use the factor function to set the order of the levels, the crosstable function uses the \bar{D} level (the No) in the first column.

Code
crosstable::crosstable(
  data,
  Group,
  by=Outcome,
  total = "both",
  percent_pattern="{n}"
) %>% 
  crosstable::as_flextable(
    keep_id = FALSE
  )
Table 7: Using the crosstable package

label

variable

Outcome

Total

No

Yes

Group

A

7

13

20 (50.00%)

B

10

10

20 (50.00%)

Total

17 (42.50%)

23 (57.50%)

40 (100.00%)

The CrossTable function from the gmodels library provides the joint, marginal, and the conditional probabilities.

Code
gmodels::CrossTable(
  data$Group,
  data$Outcome,
  chisq = FALSE,
  prop.chisq = FALSE,
  prop.c = TRUE,
  prop.r = TRUE
)

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  40 

 
             | data$Outcome 
  data$Group |        No |       Yes | Row Total | 
-------------|-----------|-----------|-----------|
           A |         7 |        13 |        20 | 
             |     0.350 |     0.650 |     0.500 | 
             |     0.412 |     0.565 |           | 
             |     0.175 |     0.325 |           | 
-------------|-----------|-----------|-----------|
           B |        10 |        10 |        20 | 
             |     0.500 |     0.500 |     0.500 | 
             |     0.588 |     0.435 |           | 
             |     0.250 |     0.250 |           | 
-------------|-----------|-----------|-----------|
Column Total |        17 |        23 |        40 | 
             |     0.425 |     0.575 |           | 
-------------|-----------|-----------|-----------|

 

Note, the order of the response variable is once again left as the default. Also, using the output Cell Contents as a key for the Total Observations in Table, note that the values are displayed as per the function arguments used:

  1. Observed frequency

  2. Conditional probability upon the explanatory variable level

  3. Conditional probability upon the response variable level

  4. Joint probability

From the table we read the following joint probabilities from the table.

  • P(\text{A}) and P(\text{no improvement}) = 0.175
  • P(\text{A}) and P(\text{improvement}) = 0.325
  • P(\text{B}) and P(\text{no improvement}) = 0.25
  • P(\text{B}) and P(\text{no improvement}) = 0.25

We read the following marginal probabilities from the table.

  • P(\text{A}) = 0.5
  • P(\text{A}) = 0.5
  • P(\text{no improvement}) = 0.425
  • P(\text{improvement}) = 0.575

Finally, we read the following conditional probabilities from the table.

  • P(\text{no improvement} \vert \text{A}) = 7 \div 20 = 0.35
  • P(\text{improvement} \vert \text{A}) = 13 \div 20 = 0.65
  • P(\text{no improvement} \vert \text{B}) = 10 \div 20 = 0.5
  • P(\text{improvement} \vert \text{B}) = 10 \div 20 = 0.5

Note from the study design that given that the group allocation was fixed, it does not make sense to consider the probabilities of group allocation given the outcomes. We will revisit this discussion in a later section when we consider risks and ratios in terms of study design.

5 Measures of Association for a 2 \times 2 Contingency Table

We can derive and understand the values that we calculate for risks and ratios from a 2 \times 2 table of observed values. There are three commonly used measures. We use them to quantify the magnitude and direction of association between two binary variables. The measures are risk difference, risk ratio (relative risk), and the odds ratio.

5.1 Risk Difference

We define the risk difference (RD) in Equation 3, the risk of success, D, given exposure, E, minus the risk of success, D, given non-exposure, \bar{E}. To be clear, we are only considering those with the disease and we calculate the difference in the conditional probabilities - with the disease given exposure and no exposure.

\begin{align} &\text{Definition: } \text{RD} = P(D \vert E) - P(D \vert \bar{E}) \\ &\text{Estimate: } \widehat{\text{RD}} = \hat{P}(D \vert E) - \hat{P}(D \vert \bar{E}) \end{align} \tag{3}

Note that our interest lies with the successlevel, D, of the response variable.

The two levels of the explanatory variable are independent. Each participant only took one of either aspirin or placebo. Of the n_{1} = 11037 participants in the aspirin group, 104 developed a myocardial infarction for p_{1} = 104 \div 11037 = 0.0094, as shown in Equation 4. Of the n_{2} = 11034 participants in the placebo group, 189 developed a myocardial infarction. The conditional probability of a myocardial infarction in this group was p_{2} = 189 \div 11034 = 0.0171, as shown in Equation 5.

p_{1} = \hat{P}(D \, \vert \, E) = \frac{n_{11}}{n_{1+}} = \frac{104}{11037} \approx 0.0094 \tag{4}

p_{2} = \hat{P}(D \, \vert \bar{E}) = \frac{n_{21}}{n_{2+}} = \frac{189}{11034} \approx 0.0171 \tag{5}

We state that the risk of the development of a myocardial infarction in the placebo group was 0.017 and the risk of the development of a myocardial infarction in the treatment group was 0.009.

The difference in the risk of participants who developed a myocardial infarction between the two groups is \Delta p = 0.0094 - 0.0171 = -0.0077 \in [-1, 1]. This difference is termed the risk difference.

We can interpret the result by stating that the risk (chance or probability) of developing a MI in the aspirin group is 0.77 percentage points lower than developing a MI in the placebo group.

The RD has an interval -1 \le \text{RD} \le 1. The interval boundary values can be calculated from the extreme summary results in Table 8 and Table 9.

Table 8: Risk Difference Extreme Case
RD=-1 D \bar{D} Total
E 0 20 20
\bar{E} 20 0 20
Total 20 20 40

The risk difference is shown in Equation 6.

P(D \, \vert E) - P \left( D \, \vert \bar{E} \right) = \frac{0}{20} - \frac{20}{20} = -1 \tag{6}

Table 9: Risk Difference Extreme Case
RD=1 D \bar{D} Total
E 20 0 20
\bar{E} 0 20 20
Total 20 20 40

The risk difference in this case is shown in Equation 7.

P(D \, \vert E) - P \left( D \, \vert \bar{E} \right) = \frac{20}{20} - \frac{0}{20} = 1 \tag{7}

5.2 Relative Risk

While considering the difference in risk, we have to take cognizance of the number of cases and hence the proportions. A \Delta p of 0.009 is the same when we subtract 0.010 - 0.001 as when we subtract 0.410 - 0.401, yet the proportions are very different.

To mitigate the differences, we consider relative risk or risk ratio (RR), shown in Equation 8.

\begin{align} \text{Defintion: } &\text{RR} = \frac{P(D \vert E)}{P(D \vert \bar{E})} \\ \text{Estimate: } &\widehat{\text{RR}} = \frac{\hat{P}(D \vert E)}{\hat{P}(D \vert \bar{E})} \end{align} \tag{8}

In the case of the two sets of proportions mentioned above, we see the difference in RR in Equation 9.

\widehat{\text{RR}} = \frac{\frac{104}{11037}}{\frac{189}{11034}} \approx 0.5501 \tag{9}

The RR in the case of our example is 0.551. We can interpret this as the risk of a MI in the aspirin group is 0.551 times that in the placebo group or that the risk of a MI among those taking aspirin is 1 - 0.5501 = 0.4499 = 45% less than that among the placebo group. (See explanation below in Thinking about lowering and increasing the measures of association).

The RR has an interval 0 \le \text{RR} < \infty. The interval boundary values can be calculated from the extreme summary results in Table 10 and Table 11.

Table 10: Relative Risk Extreme Case
RR=0 D \bar{D} Total
E 0 20 20
\bar{E} 20 0 20
Total 20 20 40

The relative risk is shown in Equation 10.

\frac{P(D \, \vert \, E)}{P \left( D \, \vert \, \bar{E} \right)} = \frac{\frac{0}{20}}{\frac{20}{20}} = 0 \tag{10}

Table 11: Relative Risk Extreme Case
RR=\infty D \bar{D} Total
E 20 0 20
\bar{E} 0 20 20
Total 20 20 40

The relative risk is shown in Equation 11.

\frac{P(D \, \vert \, E)}{P \left( D \, \vert \, \bar{E} \right)} = \frac{\frac{20}{20}}{\frac{0}{20}} = \infty \tag{11}

5.3 Odds Ratio

If the population proportion of success for a binary outcome of an experiment is \pi then the odds of success is as shown in Equation 12.

\text{odds}=\frac{\pi}{1 - \pi} \tag{12}

The sample proportion of success (a myocardial infarction) in the group taking aspirin in our illustrative example was 104 \div 11037 = 0.0095. This is P(D \vert E). We calculate the odds of success in the code chunk below, assigning the result to the variable odds_aspirin.

Code
odds_aspirin <- (104 / 11037) / (1 - (104 / 11037))
odds_aspirin
[1] 0.009512485

We can do the same for the group taking placebo. We calculate the odds of success in the placebo group in the code chunk below and assign it to the variable odds_placebo. This is P(D \vert \bar{E}).

Code
odds_placebo <- (189 / 11034) / (1 - (189 / 11034))
odds_placebo
[1] 0.01742739

The odds ratio comparing the odds of success between the two groups is calculated as shown in Equation 13.

\begin{align} \text{odds ratio} = &\theta = \frac{\text{odds}_{1}}{\text{odds}_2} \\ \\ \text{Defintion: } &\text{OR}_{\text{D}} = \frac{\frac{P(D \vert E)}{1 - P(D \vert E)}}{\frac{P(D \vert \bar{E})}{1 - P(D \vert \bar{E})}} = \frac{\frac{P(D \, \vert \, E)}{P(\bar{D} \, \vert \, E)}}{\frac{P(D \, \vert \, \bar{E})}{p(\bar{D} \, \vert \, \bar{E})}} \\ \\ \text{Estimate: } &\widehat{\text{OR}}_{\text{D}} = \frac{\frac{\hat{P}(D \vert E)}{1 - \hat{P}(D \vert E)}}{\frac{\hat{P}(D \vert \bar{E})}{1 - \hat{P}(D \vert \bar{E})}} = \frac{\frac{\hat{P}(D \, \vert \, E)}{\hat{P}(\bar{D} \, \vert \, E)}}{\frac{\hat{P}(D \, \vert \, \bar{E})}{\hat{P}(\bar{D} \, \vert \, \bar{E})}} \end{align} \tag{13}

5.4 Example

The data for this section is in a file named measures_assoc.csv which we import below using the user-defined function import_dataset from Section 2.

Code
dfr<- import_dataset("measures_assoc.csv")

We create a table of observed data between Exposure and Disease.

Code
tb <-  table(dfr$Exposure, dfr$Disease) 
tb    
    
      D ND
  E  21 19
  NE  8 32

We create another table of calculations for the measures of association between Exposure and Disease.

Code
epiR::epi.2by2(tb)                      
             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           21           19         40     52.50 (36.13 to 68.49)
Exposed -            8           32         40      20.00 (9.05 to 35.65)
Total               29           51         80     36.25 (25.79 to 47.76)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 2.62 (1.32, 5.21)
Inc odds ratio                                 4.42 (1.64, 11.93)
Attrib risk in the exposed *                   32.50 (12.67, 52.33)
Attrib fraction in the exposed (%)            61.90 (24.33, 80.82)
Attrib risk in the population *                16.25 (-0.02, 32.52)
Attrib fraction in the population (%)         44.83 (8.43, 66.76)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 9.141 Pr>chi2 = 0.002
Fisher exact test that OR = 1: Pr>chi2 = 0.005
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units 

The Inc risk is the incidence risk which is simply \hat{P}(D \vert E) = 52.5, \hat{P(D \vert \bar{E})} = 20.0, and \hat{P}(D)=36.2. We also see the estimated odds \widehat{\text{OR}}(D \vert E) = 1.105, \widehat{\text{OR}}(D \vert \bar{E}) = 0.250, and \widehat{\text{OR}}(D)=0.569.

Within the table titled, Point estimates and 95% CIs, the Inc risk ratio represents \widehat{\text{RR}} and its corresponding CI. Then, the Odds ratio demonstrates \widehat{\text{OR}} and CI. The Attrib risk in the exposed * is the \widehat{\text{RD}} and CI.

6 Application of Measures of Association Based on Study Design

RD is an absolute measure of association, whereas the RR (relative risk) and OR are relative measures of association. We need to consider the use of these measures.

As an example, we have the following conditional probabilities.

  • P(D \vert E) = 0.02
  • P(D \vert \bar{E}) = 0.01

We see the following measures of association.

  • \text{RD} = 0.02 - 0.01 = 0.01
  • \text{RR} = 0.02 \div 0.01 = 2
  • \text{OR} = [0.02 \div (1 - 0.02)] \div [0.01 \div (1 - 0.01)] = 2.02

If we only report the RR, we might consider the value to be large. The RD, though, shows the minor difference. The OR is approximately the same as the RR since we have small values for P(D \vert E) and P(D \vert \bar{E}).

Contrast the above to the values below.

  • P(D \vert E) = 0.99
  • P(D \vert \bar{E}) = 0.89

The measures of association are as follows.

  • \text{RD} = 0.99 - 0.89 = 0.1
  • \text{RR} = 0.99 \div 0.89 = 1.11
  • \text{OR} = [0.99 \div (1 - 0.99)] \div [0.89 \div (1 - 0.89)] = 12.26

The OR is quite large when compared to the RR (because the P(D \vert E) and P(D \vert \bar{E}) are large. It would be useful to also state the RR or RD (if the study design allows for it).

In the next section, we explore case-series, cohort, and cross-sectional study designs to determine which risks and ratios can be used in each study design.

6.1 Case-control Design

A case-control study design is a form of retrospective study design, where participants are divided into groups as indicated by levels of a categorical variable. We then consider events that took place prior to this.

In @tbd-fixeddesign, n_{+1} and n_{+2} are fixed by design.

Fixed outcomes {#@tbd-fixeddesign}
D \bar{D}
E n_{11} n_{12}
\bar{E} n_{21} n_{22}
Total n_{+1} n_{+2}

We have that n_{12} and n_{12} are random and can be described by the binomial distribution, shown in Equation 14.

\begin{align} &n_{11} \sim \text{Bin}(n_{+1} , P(E \vert D)) \\ &n_{12} \sim \text{Bin}(n_{+2} , P(E \vert \bar{D})) \end{align} \tag{14}

We can only use associations that use P(E \vert D) and P(E \vert \bar{D}), that is to say odds ratios (either for the exposure or the disease, since they are equal). If the disease is rare, then we can use the odds ratio to estimate the relative risk.

6.1.1 Example Scenario

Consider a study that investigates surgical site sepsis following colonic resection (removal of the large bowel). Surgical site sepsis is any infection of tissues impacted by surgical incisions and interventions.

As an example, we imagine that researchers investigated 100 participants. A total of 33 had a surgical infection and 67 did not. The proportions for this variable are fixed by the study design.

The researchers want to investigate the role of prophylactic antibiotics on the development of surgical site sepsis. Prophylactic antibiotics are given intravenously just prior to surgery. It penetrates the tissues and the aim is for a high enough concentration of the antibiotic in the tissue when incisions are made, stopping the proliferation of bacteria and preventing infection.

In this simulated study, participants were given a novel or a regimented antibiotic as prophylactic agent and the choice between these was up to the attending surgeon (which for the sake of this example, we shall take as random).

The researchers find the results in Table 12.

Table 12: Results Table
Group Infection No infection Total
Novel 5 30 35
Control 28 37 65
Total 33 67 100

Generally, the table above would be deemed incorrect. In a case-control design, the researchers fix the number of cases for the explanatory variable. In this design, the presence or absence of surgical site sepsis (infection) is the explanatory variable. We write the levels of this variable along the left-hand side (the rows), shown in Table 13.

Table 13: Standard Form
Group Novel Control Total
Infection 5 28 33
No infection 30 37 67
Total 35 65 100

The probability estimates are shown in Table 14.

Table 14: Probability Estimates
Group Novel Control Total
Infection 0.05 0.28 0.33
No infection 0.30 0.37 0.67
Total 0.35 0.65 1.0

We cannot express a probability estimate for the development of infection in the larger population, as the number of cases in each group was fixed. In other words, we cannot say that the estimated probability of infection in people who have the type of surgery involved in this study is 33%. We can, though, consider the marginal probabilities of receiving the new drug or the regimented drug. These probability estimates are the marginal probabilities 0.35 and 0.65. Of more interest, is the estimated conditional probabilities. Given that an infection occurred (33 cases) the estimated probability of having received the new (novel) drug is $5 = 15 $% and of the having received the regimented (control) drug is 28 \div 33 \approx 0.85 = 85%. If no infection occurred, the estimated probability of having received the new drug is 30 \div 67 \approx 0.45 = 45% and of having received the regimented drug is 37 \div 65 \approx 0.55 = 55%.

6.2 Cohort Design or Randomized Controlled Trial

A cohort study design and a randomized controlled trial are examples of a prospective study design. The explanatory variable levels divide the cohort or trial groups into two groups. They are followed over time and results are measured for the outcome variable.

In Table 15, n_{1+} and n_{2+} are fixed.

Table 15: Generic Table of Observed Values
D \bar{D} Total
E n_{11} n_{12} n_{1+}
\bar{E} n_{21} n_{22} n_{2+}

Now n_{11} and n_{21} are random variables, described as shown in Equation 15.

\begin{align} &n_{11} \sim \text{Bin}(n_{1+} , P(D \vert E)) \\ &n_{21} \sim \text{Bin}(n_{2+} , P(D \vert \bar{E})) \end{align} \tag{15}

We can use measure of association that have P(E \vert D) and P(E \vert \bar{D}), so we can use the risk difference, relative risk, and odds ratios.

6.2.1 Example Scenario

As an example, we consider a much better study than the case-control study above. Researchers compare those who are given a new prophylactic antibiotic to those who received a regimented prophylactic antibiotic prior to surgery. One week after the surgery, they measure the outcome as whether surgical site sepsis (infection) occurred (response variable). The results are as shown in Table 16.

Table 16: Data Comparing Novel and Control Groups
Group Infection No infection Total
Novel 12 38 50
Control 15 35 50
Total 27 73 100

The table of estimated probabilities is Table 17.

Table 17: Relative Frequencies
Group Infection No infection Total
Novel 0.12 0.38 0.5
Control 015 0.35 0.5
Total 0.27 0.73 1.0

From this design, we can state that the estimated probability of infection is 27 \div 100 = 0.27 = 27% and for not developing infection is 73 \div 100 = 0.73 = 73%. These are the marginal probabilities along the bottom. We can also state that given the new drug, the estimated probability of an infection is 12 \div 50 =0.24 = 24% and that given the regimented drug, the estimated probability is 15/50 = 0.3 = 30%. These are conditional probabilities. We conclude that the estimated probability of surgical site sepsis is lower for the new prophylactic antibiotic.

6.3 Cross-sectional Design

In a cross-sectional study design, we measure values for variables from a sample at one time.

All the cell frequencies in the Table 18 are now random.

Table 18: Generic Table
D \bar{D} Total
E n_{11} n_{12} n_{1+}
\bar{E} n_{21} n_{22} n_{2+}
Total n_{+1} n_{+2} n

All are described by the multinmial distribution, shown in Equation 16.

(n_{11}, n_{12}, n_{21}, n_{22}) \sim \text{Mult}(n, P(D \cap E) , P(\bar{D} \cap E) , P(\bar{E} \cap D) , P(\bar{E} \cap \bar{D})) \tag{16}

All measures of association can be used.

6.3.1 Example Scenario

As example, we may have a list of patients who visited a local clinic that opened two calendar years ago. We randomly select 100 people to contact over a two week period and ask them to answer a few survey questions. We record the values in each case. Note that although the data collection took two weeks, the data are a snap-shot, with frequencies for two variables calculated.

Imagine then that among other questions, we asked them whether their last visit was in the first year or in the second year (with levels Second and First) and if they were satisfied with the waiting period (levels Yes and No). The results are shown in Table 19.

Table 19: Example Data
Year Yes No Total
Second 24 23 47
First 20 33 53
Total 44 56 100

The estimated probabilities are in Table 20.

Table 20: Estimated Probabilities
Year Yes No Total
Second 0.24 0.23 0.47
First 0.2 0.33 0.53
Total 0.44 0.56 1.0

In this case both variables are not fixed. We can express the estimated probability of a last visit being in the second year as being 0.47 and in the first year as being 0.53. We can also state that the probability of being satisfied with the waiting times as being 0.44 and not being satisfied as being 0.56. In all these cases, we are considering marginal probabilities.

We can also consider all eight conditional probabilities.

  • \hat{P}(\text{satisfied} \, \vert \, \text{second year}) = 24 \div 47 \approx 0.51
  • \hat{P}(\text{not satisfied} \, \vert \, \text{second year}) = 23 \div 47 \approx 0.49
  • \hat{P}(\text{satisfied} \, \vert \, \text{first year}) = 20 \div 53 \approx 0.38
  • \hat{P}(\text{not satisfied} \, \vert \, \text{first year}) = 33 \div 53 \approx 0.62
  • \hat{P}(\text{second year} \, \vert \, \text{satisifed}) = 24 \div 44 \approx 0.55
  • \hat{P}(\text{second year} \, \vert \, \text{not satisifed}) = 23 \div 56 \approx 0.41
  • \hat{P}(\text{first year} \, \vert \, \text{satisifed}) = 20 \div 44 \approx 0.45
  • \hat{P}(\text{first year} \, \vert \, \text{not satisifed}) = 33 \div 56 \approx 0.59

Next, we consider confidence intervals (CIs) for the measures of association.

7 Confidence Intervals for Measures of Association

Confidence intervals are useful for estimation of population parameters, precision assessment of estimates, and hypothesis testing.

7.1 CI for Risk Difference

To calculate the CI for risk difference (RD), we use the two conditional probabilities in Equation 17.

\begin{align} &p_{1} = \hat{P}(D \vert E) = \frac{n_{11}}{n_{1+}} \\ \\ &p_{2} = \hat{P}(D \vert \bar{E}) = \frac{n_{21}}{n_{2+}} \end{align} \tag{17}

Now we can calculate the 100(1 - \alpha)% CI for RD, as shown in Equation 18.

(p_{1} - p_{2}) \pm z_{\frac{\alpha}{2}} \sqrt{\frac{p_{1} (1 - p_{1})}{n_{1+}} + \frac{p_{2} (1 - p_{2})}{n_{2+}}} \tag{18}

We use the normal approximation to the binomial distribution for n_{11} and n_{21} and we check the following assumptions.

  • n_{1+} p_{1} \ge 5 and n_{1+}(1 - p_{1}) \ge 5 or then n_{11} \ge 5 and n_{12} \ge 5
  • n_{2+} p_{2} \ge 5 and n_{2+}(1 - p_{2}) \ge 5 or then n_{21} \ge 5 and n_{22} \ge 5

This means that all four cell frequencies should be at least 5.

7.1.1 Example

As example, we consider the results from a study in Table 21.

Table 21: Study Data
D \bar{D} Total
E 21 19 40
\bar{E} 8 32 40
Total 29 51 80

We substitute the values in the table into Equation 18.

Code
p1 <- 21 / 40
p2 <- 8 / 40
z_a2 <- 1.96

lower <- (p1 - p2) - (z_a2 * sqrt(((p1 * (1 - p1)) / (40)) + ((p2 * (1 - p2)) / (40))))
upper <- (p1 - p2) + (z_a2 * sqrt(((p1 * (1 - p1)) / (40)) + ((p2 * (1 - p2)) / (40))))

lower
[1] 0.1267164
Code
upper
[1] 0.5232836

We are 95% confident that the true RD is between 12.67 and 52.32 percentage points.

We could also state that we are 95% confident that the true risk of disease in the exposure group is between 12.67 and 52.32 percentage points greater than the risk of disease in the non-exposure group.

We note that the null value for RD, which is 0, is not in the interval.

7.2 CI for Relative Risk

The sampling distribution for relative risk is not normally distributed. So, we cannot use the normal approximation to calculate the confidence interval for the relative risk.

In the code cell below, we simulate 1000 studies so that we can visualize the sampling distribution of relative risk. We use the following parameters.

  • n_{11} \sim \text{Bin}(100, P(D \vert E) = 0.3)
  • n_{21} \sim \text{Bin}(100, P(D \vert \bar{E}) = 0.2)
Code
n_11 <- rbinom(1000, 100, 0.3) 
n_21 <- rbinom(1000, 100, 0.2) 

rr <- n_11 / n_21              
Code
hist(
  rr,
  main = "Simulated sampling distribution of estimated relative risk",
  xlab = "Estimated relatve risk",
  ylab = "Frequency",
  col = "#4197D9",
  las = 1,
  panel.first = grid()
)

Figure 1: Histogram of Simulated Values

Code
hist(
  log(rr),
  main = "Simulated sampling distribution of logarithm of estimated relative risk",
  xlab = "Logarithm of estimated relatve risk",
  ylab = "Frequency",
  col = "#98FB98",
  las = 1,
  panel.first = grid()
)

Figure 2: Logarithm Transformation

The histogram in Figure 1 shows the distribution of the simulated values. The sampling distribution is right skewed. In Figure 2, we use the natural logarithm transformation and plot a histogram of the sampling distribution of \log{\widehat{\text{RR}}}.

Using Equation 17 for p_{1} and p_{2}, we show the equation for the confidence interval of \log{\widehat{\text{RR}}} in Equation 19.

\log{\left( \frac{p1}{p2} \right)} \pm z_{\frac{\alpha}{2}} \sqrt{\frac{1 - p_{1}}{n_{11}} + \frac{1 - p_{2}}{n_{21}}} \tag{19}

7.2.1 Example

The 95\% CI with Z_{\frac{\alpha}{2}} = 1.96 for the \log{\widehat{\text{RR}}} is calculated below, substituting the values from our table of observed values from Table 21.

Code
lower <- log(p1 / p2) - z_a2 * sqrt(((1 - p1) / 21) + ((1 - p2) / 8))
upper <- log(p1 / p2) + z_a2 * sqrt(((1 - p1) / 21) + ((1 - p2) / 8))
lower
[1] 0.2787476
Code
upper
[1] 1.651414

To calculate the CI for the estimated relative risk from our table of observed values, we need to exponentiation each value (that is, take the anti-log). In Equation 20, we see the anti-log for our example values.

(e^{0.2787476} , e^{1.651414}) \tag{20}

Below, we calculate the estimated relative risk and corresponding lower and upper bounds of the 95% CI.

Code
p1 / p2       
[1] 2.625
Code
round(exp(lower), digits = 3) 
[1] 1.321
Code
round(exp(upper), digits = 3)  
[1] 5.214

The \widehat{\text{RR}}=2.625 with 95% CI of 1.321 to 5.214. We can state that we are 95% confident that the true population RR is in this interval. Alternatively we state that we are 95% confident that the true risk of disease in the exposure group is between 1.32 - 1 = 0.32 = 32% and 5.214 - 1 = 4.214 = 421% greater than the risk of disease in the non-exposure group.

We performed a natural logarithm transformation (log transformation) to use the normal approximation to the binomial distribution. We still need to make sure that the assumptions for the use of this approximation are met. That is n_{ij} \ge 5 (which in our case it is).

Note also that the null value for the relative risk (which is 1) is not within the interval.

7.3 CI for Odds Ratios

The sampling distribution of the odds ratio (OR) is also not normally distributed and we use the log-transformation.

The equation for the 100(1-\alpha)% CI for the log OR is shown in Equation 21.

\log{\left( \frac{\frac{p_{1}}{1 - p_{1}}}{\frac{p_{2}}{1 - p_{2}}} \right)} \pm z_{\frac{\alpha}{2}} \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}} \tag{21}

7.3.1 Example

The estimated CI for the log \widehat{\text{OR}}, is calculated below, substituting the values from our table of observed values from Table 21.

Code
log_odds_ratio <- log((p1 / (1 - p1)) / (p2 / (1 - p2)))
log_odds_ratio
[1] 1.486378

Below, we calculate the estimated OR and corresponding lower and upper of the 95% CI.

Code
round((p1 / (1 - p1)) / (p2 / (1 - p2)), digits = 3)
[1] 4.421
Code
se_or <- sqrt((1 / 21) + (1 / 19) + (1 / 8) + (1 / 32))
lower <- log_odds_ratio - z_a2 * se_or
upper <- log_odds_ratio + z_a2 * se_or

round(exp(lower), digits = 3)
[1] 1.638
Code
round(exp(upper), digits = 3)
[1] 11.93

The \widehat{\text{OR}}=4.421 with 95% CI of 1.638 to 11.93. We can state that we are 95% confident that the true population OR is on this interval. Alternatively we state that we are 95% confident that the OR of disease in the exposure group is between 1.638 - 1 = 0.638 = 63.8% and 11.93 - 1 = 10.93 = 1093% greater than the risk of disease in the non-exposure group.

We performed a natural logarithm transformation (log transformation) to use the normal approximation to the binomial distribution. We still need to make sure that the assumptions for the use of this approximation are met. That is n_{ij} \ge 5, which is true for this case.

Note also that the null value for the OR (which is 1) is not within the interval. There is a positive association between exposure and disease.

8 Hypothesis Testing for the Measures of Association: Pearson’s \chi^{2} Approach

8.1 Null Values

Our first task is to develop values for the null hypothesis for measures of association.

Consider the table below, where we have two binary variables exposure (with levels E and \bar{E}) and disease (with levels D and \bar{D}). In Table 22, we express the joint probabilities \pi_{ij}, as well as the marginal probabilities.

Table 22: Table of Probabilities
D \bar{D} Total
E \pi_{11} \pi_{12} \pi_{1+}
\bar{E} \pi_{21} \pi_{22} \pi_{2+}
Total \pi_{+1} \pi_{+2} 1.0

We remember from probability theory that two events, A and B, are independent if and only if P(A \cap B) = P(A) P(B). The two binary variables are then independent if an only if the following hold.

  • P(D \cap E) = P(D) P(E) \Rightarrow \pi_{11} = \pi_{1+} \pi_{+1}
  • P(\bar{D} \cap E) = P(\bar{D}) P(E) \Rightarrow \pi_{12} = \pi_{1+} \pi_{+2}
  • P(D \cap \bar{E}) = P(D) P(\bar{E}) \Rightarrow \pi_{21} = \pi_{2+} \pi_{+1}
  • P(\bar{D} \cap \bar{E}) = P(\bar{D}) P(\bar{E}) \Rightarrow \pi_{22} = \pi_{2+} \pi_{+2}

In all cases, the joint probability being equal to the product of the marginal probabilities, indicates independence.

This independence has implications for the measures of association, shown in Equation 23 and Equation 24, where we use Equation 22 from probability theory.

P(A \vert B) = \frac{P(A \cap B)}{P(B)} \tag{22}

\begin{align} \text{RD} = P(D \vert E) - P(D \vert \bar{E}) = &\frac{P(D \cap E)}{P(E)} - \frac{D \cap \bar{E}}{P(\bar{E})} \\ &= \frac{\pi_{1+} \pi_{+1}}{\pi_{1+}} - \frac{\pi_{2+} \pi_{+1}}{\pi_{2+}} \\ &= \pi_{+1} - \pi_{+1} \\ &= 0 \end{align} \tag{23}

\text{RR} = \frac{P(D \vert E)}{P(D \vert \bar{E})} = \frac{\frac{\pi_{1+} \pi_{+1}}{\pi_{1+}}}{\frac{\pi_{2+}\pi_{+1}}{\pi_{2+}}} = 1 \tag{24}

The null value for the OR is the same, shown in Equation 25.

\begin{align} &\text{OR} = 1 \end{align} \tag{25}

8.2 Null and Alternative Hypotheses

In the section below, we state the null hypothesis and the alternative hypothesis in words, statistical terms, and in values.

  • In words,

    • H_{0}: Outcome level and exposure level are independent (not associated)

    • H_{1}: Outcome level and exposure level are not independent (are associated)

  • In statistical terms,

    • H_{0}: \; \pi_{ij} = \pi_{i+} \pi_{+j} (joint probability equals product of its marginal probabilities)

    • H_{1}: \; \pi_{ij} \ne \pi_{i+} \pi_{+j}

  • In values by measure of association,

    RD OR RR
    H_{0}: RD=0\\ H_{1}: RD\ne 0

    $$

     H_{0}: OR =1\\
    
     H_{1}: OR \ne 1
    
                   $$

    $$

      H_{0}: RR=1\\
    
    
     H_{1}: RR \ne 1
    
    
                   $$

    8.3 Pearson’s \chi^{2} test

The Pearson \chi^{2} test for independence compares the observed frequencies with the frequencies that would be expected if the null hypothesis was true.

We have Table 23 of observed values.

Table 23: Generic Table of Observes Values
D \bar{D} Total
E n_{11} n_{12} n_{1+}
\bar{E} n_{21} n_{22} n_{2+}
Total n_{+1} n_{+2} n

Table 23 can be written, shown in Table 24, using O_{ij} for the table of observed values.

Table 24: Table of Observed Values
D \bar{D} Total
E O_{11} O_{12} n_{1+}
\bar{E} O_{21} O_{22} n_{2+}
Total n_{+1} n_{+2} n

The aim is to calculate the expected values under the null hypothesis. We use the notation in Table 25, indicating the table of expected values.

Table 25: Expected Values
D \bar{D} Total
E E_{11} E_{12} n_{1+}
\bar{E} E_{21} E_{22} n_{2+}
Total n_{+1} n_{+2} n

If the null hypothesis was true, then disease and exposure are independent variables and the statements below would hold (as we remember from probability theory mentioned before).

  • P(D \vert E) = P(D) P(E)
  • P(\bar{D} \vert E) = P(\bar{D}) P(E)
  • P(D \vert \bar{E}) = P(D) P(\bar{E})
  • P(\bar{D} \vert \bar{E}) = P(\bar{D}) P(\bar{E})

Under the assumption of the null hypothesis, we can then estimate the joint probabilities as in Equation 26, Equation 27, Equation 28, and Equation 29.

  • The expected value E_{11}

\begin{align} &\hat{P}(D \vert E) = \frac{n_{11}}{n} = \hat{P}(D) \hat{P}(E) = \frac{n_{+1}}{n} \frac{n_{1+}}{n} \\ &\frac{n_{11}}{n} = \frac{n_{+1}}{n} \frac{n_{1+}}{n} \\ &n_{11} = \frac{n_{+1} n_{1+}}{n}, \text{ the expected value } E_{11} \end{align} \tag{26}

  • The expected value E_{12}

\begin{align} &\hat{P}(\bar{D} \vert E) = \frac{n_{12}}{n} = \hat{P}(\bar{D}) \hat{P}(E) = \frac{n_{+2}}{n} \frac{n_{1+}}{n} \\ &\frac{n_{12}}{n} = \frac{n_{+2}}{n} \frac{n_{1+}}{n} \\ &n_{12} = \frac{n_{+2} n_{1+}}{n}, \text{ the expected value } E_{12} \end{align} \tag{27}

  • The expected value E_{21}

\begin{align} &\hat{P}(D \vert \bar{E}) = \frac{n_{21}}{n} = \hat{P}(D) \hat{P}(\bar{E}) = \frac{n_{+1}}{n} \frac{n_{2+}}{n} \\ &\frac{n_{21}}{n} = \frac{n_{+1}}{n} \frac{n_{2+}}{n} \\ &n_{21} = \frac{n_{+1} n_{2+}}{n}, \text{ the expected value } E_{21} \end{align} \tag{28}

  • The expected value E_{22}

\begin{align} &\hat{P}(\bar{D} \vert \bar{E}) = \frac{n_{22}}{n} = \hat{P}(\bar{D}) \hat{P}(\bar{E}) = \frac{n_{+2}}{n} \frac{n_{2+}}{n} \\ &\frac{n_{22}}{n} = \frac{n_{+2}}{n} \frac{n_{2+}}{n} \\ &n_{22} = \frac{n_{+2} n_{2+}}{n}, \text{ the expected value } E_{22} \end{align} \tag{29}

If E is the expected frequency under the null hypothesis (not used here to indicate exposure), then we have the expected cell frequencies derived in Equation 30.

\begin{align} &E_{ij} = \frac{n_{i+}}{n} \frac{n_{+j}}{n} \times n \\ &E_{ij} = \frac{n_{i+} n_{+j}}{n} \end{align} \tag{30}

Any expected cell frequency, E_{ij}, under the null hypothesis is the product of its (row and column) marginal totals, n_{i+} and n_{+j}, divided by the sample size, n, with the calculations summarized in Table 26.

Table 26: Expected Values
D \bar{D}
E $ E_{11} =
f r a c { n _ { + 1 } n _ { 1+}}{n}$
$ E_{12} ==
f r a c { n _ { + 2 } n_ { 1+}}{n}$
\bar{E} $ E_{21} =
f r a c { n _ { + 1 } n _ { 2+}}{n}$
$ E_{22 }=
f r a c { n { + 2 } n { 2+}}{n}$

We can use Equation 31 to calculate a test statistic (standardized by the expected values), which we denote as X^{2}.

X^{2} = \sum_{i=1}^{2} \sum_{j=1}^{2} \frac{\left( O_{ij} - E_{ij}\right)^{2}}{E_{ij}} \tag{31}

If the null hypothesis is true, then X^{2} approximately follows a \chi^{2} distribution (X^{2} \sim \chi_{1}^{2}), that is, with a single degree of freedom. This approximation holds if each expected value is greater than or equal to 5.

While we consider a 2 \times 2 table of observed values, we can use two multilevel variables (with I levels and J levels respectively). Provided that no more than 20% of the expected values as less than 5, we can use the approximation shown in Equation 32. (See below in Degrees of freedom for the \chi^{2} distribution for more information.)

X^{2} \sim \chi_{(I-1)(J-1)}^{2} \tag{32}

8.4 Degrees of freedom for the \chi^{2} distribution

The degrees of freedom that we use for tables of observed values (and expected values) is the number of parameters under the alternative hypothesis minus the number of parameters under the null hypothesis.

Consider three random numbers. If we want them to sum to a certain total, two of the values are variable (they can take on any value). This immediately fixes the third value, though (such that we still sum to the same total). In this sense, we have a two degrees of freedom for this simple explanation. We say that two of the values are not redundant.

In a table of observed values, we have I rows and J columns. They have to sum the the respective row and column totals and hence the I-1 and J-1 degrees of freedom. In total then, we have (I-1) + (J-1) = I+J-2 non-redundant parameters under the null hypothesis.

Under the alternative hypothesis there are no restrictions on the joint probabilities (except that they sum to 1.0). We therefor have IJ - 1 non-redundant parameters.

The difference is then as shown in Equation 33.

(IJ-1) - (I + J -2) = IJ - I - J + 1 = (I-1)(J-1) \tag{33}

8.5 The \chi^{2} Distribution

We have the following general properties of the \chi^{2} distribution given df degrees of freedom.

  • The mean is df
  • The variance is 2df
  • It is right-skewed, but becomes less so as df increases
  • If X \sim \text{N}(0 ,1^2) then X^{2} \sim \chi_{1}^{2}

8.6 Pearson \chi^{2} Test Approach

We can use either a critical value approach or a p value approach to determine significance using the Pearson \chi^{2} test.

The critical \chi^{2}_{1, \alpha} values can be expressed using the qchisq function. We do this below for \alpha = 0.1, \alpha = 0.05, and \alpha = 0.01 (all for a single degree of freedom).

alpha value 0.1

Code
qchisq(0.1, df = 1, lower.tail = FALSE) 
[1] 2.705543

alpha value 0.05

Code
qchisq(0.05, df = 1, lower.tail = FALSE) 
[1] 3.841459

alpha value 0.01

Code
qchisq(0.01, df = 1, lower.tail = FALSE) 
[1] 6.634897

In Figure 3, using the dist_chisq function form the sjPlot library, we see the area of rejection for \alpha = 0.05.

Code
sjPlot::dist_chisq(
  chi2 = NULL,
  deg.f = 1,
  p = 0.05,
  xmax = NULL,
  geom.alpha = 0.7
)

Figure 3: Area of Rejection

In the critical value approach, we reject H_{0} if X^{2} > \chi_{1, \alpha}^{2}.

If we use the p value approach, we reject H_{0} if P(X^{2} > X^{2}_{\text{observed}}) < \alpha.

8.6.1 Example

We use the measure_assoc data from a previous section.

Code
dfr <- import_dataset("measures_assoc.csv")

The table function returns a table of observed values.

Code
tb <- table(dfr$Exposure, dfr$Disease)
tb
    
      D ND
  E  21 19
  NE  8 32

The null hypothesis is that the two variables are not associated (they are independent) and the alternative hypothesis is that they are associated (not independent).

The chisq.test function takes the table of observed values as argument and returns the X^{2} statistic and the p value.

Code
# Don't use continuity correction
tst <- chisq.test(tb, correct = FALSE) 
tst

    Pearson's Chi-squared test

data:  tb
X-squared = 9.1413, df = 1, p-value = 0.002499

From the p value approach, we reject the null hypothesis for \alpha = 0.05.

Since we assigned the test to a computer variable, tst, we can extract the statistic attribute.

Code
tst$statistic
X-squared 
 9.141312 

Now, we can use the dist_chisq function again. Unfortunately, the X^{2} value is large and the area under the curve is too small to see for df=1 in Figure 4.

Code
sjPlot::dist_chisq(
  chi2 = tst$statistic,
  deg.f = 1,
  p = 0.05
)

Figure 4: Large Test Statistic

Remember that the expected cell frequencies must all be 5 or more for us to use Pearson’s \chi^{2} test. The expected attribute returns the table of expected values.

Code
tst$expected
    
        D   ND
  E  14.5 25.5
  NE 14.5 25.5
8.6.1.1 Conclusion Reporting

In reporting the results of our study, we can use confidence intervals of the measures of association and / or hypothesis testing. We use both actually. The measures of association help us understand the direction of the association.

In this sense, the epi.2by2 function from the epiR library returns all the information that we require.

Code
epiR::epi.2by2(tb)
             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           21           19         40     52.50 (36.13 to 68.49)
Exposed -            8           32         40      20.00 (9.05 to 35.65)
Total               29           51         80     36.25 (25.79 to 47.76)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 2.62 (1.32, 5.21)
Inc odds ratio                                 4.42 (1.64, 11.93)
Attrib risk in the exposed *                   32.50 (12.67, 52.33)
Attrib fraction in the exposed (%)            61.90 (24.33, 80.82)
Attrib risk in the population *                16.25 (-0.02, 32.52)
Attrib fraction in the population (%)         44.83 (8.43, 66.76)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 9.141 Pr>chi2 = 0.002
Fisher exact test that OR = 1: Pr>chi2 = 0.005
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units 

8.6.2 Example

In this last section, we simulate data with an exposure variable with 3 levels and a disease variable with 2 levels.

Code
set.seed(12) # Reproducibility
exposure <- sample(
  c("Low dose", "High dose", "Placebo"),
  size = 100,
  replace = TRUE,
  prob = c(0.3, 0.4, 0.3) # Stated probabilities for each of 3 levels
)

outcome <- sample(
  c("Improve", "not improve"),
  size = 100,
  replace = TRUE,
  prob = c(0.4, 0.6)
)

dfr <- data.frame(
  Exposure = exposure,
  Disease = outcome
)

Below, we create a table fo observed values.

Code
tb <- table(dfr$Exposure, dfr$Disease)
tb
           
            Improve not improve
  High dose      25          23
  Low dose       11          17
  Placebo         7          17

The Pearson \chi^{2} test is performed below.

Code
tst <- chisq.test(tb)
tst

    Pearson's Chi-squared test

data:  tb
X-squared = 3.6472, df = 2, p-value = 0.1614

We cannot express measures of association as we do not have a 2 \times 2 design.

Finally, we visualize the results in Figure 5.

Code
sjPlot::dist_chisq(
  chi2 = tst$statistic,
  deg.f = 2
)

Figure 5: Visualizing the Results of the Example

9 Epitools as an Alterative to Epi2by2

The epitools library returns the same results as the epiR library. Note, that the epitools requires the 2 \times 2 table of observed values to be in the order shown in Table 27.

Table 27: Format for epitools library
\bar{D} D Total
\bar{E}
E
Total

9.1 Example

We import a comma-separated values file to demonstrate the use of epitools.

Code
dfr <- import_dataset("measures_assoc.csv")

The table of observed functions is shown below, and is the format required by the epiR library.

Code
tab <-  xtabs(
  ~ Exposure + Disease,
  data = dfr
)

tab
        Disease
Exposure  D ND
      E  21 19
      NE  8 32

The data are summarized with a mosaic plot.

Code
(Desc(tab, verbose="high", expected=TRUE))
------------------------------------------------------------------------------ 
tab (xtabs, table)

Summary: 
n: 80, rows: 2, columns: 2

Pearson's Chi-squared test (cont. adj):
  X-squared = 7.789, df = 1, p-value = 0.005256
Fisher's exact test p-value = 0.004827
                                
                 D     ND    Sum
                                
E     freq      21     19     40
      perc   26.2%  23.8%  50.0%
      p.row  52.5%  47.5%      .
      p.col  72.4%  37.3%      .
      exp   14.500 25.500      .
                                
NE    freq       8     32     40
      perc   10.0%  40.0%  50.0%
      p.row  20.0%  80.0%      .
      p.col  27.6%  62.7%      .
      exp   14.500 25.500      .
                                
Sum   freq      29     51     80
      perc   36.3%  63.7% 100.0%
      p.row      .      .      .
      p.col      .      .      .
      exp        .      .      .
                                

----------
' 95% conf. level

The data can also be summarized with a stacked bar plot.

Code
ggstatsplot::ggbarstats(
  data = dfr,
  x = Disease,
  y = Exposure,
  type = "parametric",
  bf.message = FALSE,
  results.subtitle = FALSE, 
  palette = "Blues"
)

9.1.1 riskratio.wald

The riskratio.wald function calculate the relative risk estimate and confidence intervals for the relative risk estimate. Note the use of the keyword argument rev, set to both such that both the rows and columns of the table of observed values are reversed.

Code
epitools::riskratio.wald(
  tab,
  rev = "both"
)
$data
        Disease
Exposure ND  D Total
   NE    32  8    40
   E     19 21    40
   Total 51 29    80

$measure
        risk ratio with 95% C.I.
Exposure estimate   lower    upper
      NE    1.000      NA       NA
      E     2.625 1.32149 5.214283

$p.value
        two-sided
Exposure  midp.exact fisher.exact  chi.square
      NE          NA           NA          NA
      E  0.002892717  0.004827425 0.002499019

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The results show \widehat{\text{RR}} = 2.65 with a Wald 95% confidence interval of (1.32, 5.21) as with the epiR epi.2by2 function. We also see the results of hypothesis testing using exact tests and the Pearson \chi^{2} test for association.

9.1.2 riskratio

The riskratio function takes the method keyword argument and lets us decide between a Wald test, exact test, and bootstrap resampling. Below, we select wald as the method.

Code
epitools::riskratio(
  tab,
  method = "wald", 
  rev = "both"
)
$data
        Disease
Exposure ND  D Total
   NE    32  8    40
   E     19 21    40
   Total 51 29    80

$measure
        risk ratio with 95% C.I.
Exposure estimate   lower    upper
      NE    1.000      NA       NA
      E     2.625 1.32149 5.214283

$p.value
        two-sided
Exposure  midp.exact fisher.exact  chi.square
      NE          NA           NA          NA
      E  0.002892717  0.004827425 0.002499019

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

9.1.3 oddsratio.wald

The oddsratio.wald function is similar to the riskratio.wald function, but for odds ratio tests. Thus, the odds ratio estimate and corresponding confidence intervals for the odds ratio estimate is computed.

Code
epitools::oddsratio.wald(
  tab,
  rev = "both"
)
$data
        Disease
Exposure ND  D Total
   NE    32  8    40
   E     19 21    40
   Total 51 29    80

$measure
        odds ratio with 95% C.I.
Exposure estimate    lower    upper
      NE 1.000000       NA       NA
      E  4.421053 1.638427 11.92956

$p.value
        two-sided
Exposure  midp.exact fisher.exact  chi.square
      NE          NA           NA          NA
      E  0.002892717  0.004827425 0.002499019

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

10 Lab Problems

10.1 Question

The first lab problem questions are based on the following simulate data for an explanatory variable Procedure with two levels I and II denoting two types of inguinal hernia repair. A total of 50 participants were randomly assigned to one of two methods of surgical repair of these groin hernias. The response variable, Recurrence, measured whether they developed a recurrence (return of the hernia), with levels No and Yes.

The data is stored in a data frame object and assigned to the variable data.

Code
set.seed(12)

group <- rep(
  c("I", "II"),
  each = 25
)

recurrence <- sample(
  c("No", "Yes"),
  size = 50,
  replace = TRUE,
  prob = c(0.8, 0.2)
)

data <- data.frame(
  Group = group,
  Recurrence = factor(recurrence, levels = c("Yes", "No"))
)

We see a stacked bar plot to visualize the results.

Code
ggbarstats(
  data = data,
  x = Recurrence,
  y = Group,
  type = "parametric",
  bf.message = FALSE,
  results.subtitle = FALSE, 
  palette = "Blues",
  title = "Proportion of recurrence and no recurrence in two type of hernia repair",
  xlab = "Group",
  legend.title = "Recurrence",
  ggplot.component = list(ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 2)))
)

Complete the following four questions.

  1. Create a 2 \times 2 table of observed values.

  2. Calculate the joint probabilities.

  3. Calculate the marginal probabilities.

  4. Calculate the relevant conditional probabilities.

The solution uses the xtabs function.

Code
xtabs(
  ~ Group + Recurrence,
  data = data
)
     Recurrence
Group Yes No
   I    3 22
   II   7 18

The joint probabilities are calculated below.

\begin{align*} &P(\text{Group I } \cap \text{ Recurrence}) = \frac{3}{50} = 0.06 \\ &P(\text{Group I } \cap \text{ No recurrence}) = \frac{22}{50} = 0.44 \\ &P(\text{Group II } \cap \text{ Recurrence}) = \frac{7}{50} = 0.14 \\ &P(\text{Group II } \cap \text{ No recurrence}) = \frac{18}{50} = 0.36 \end{align*}

Additionally, the CrossTable function from the gmodels library returns all the probabilities.

Code
gmodels::CrossTable(
  data$Group,
  data$Recurrence,
  chisq = FALSE,
  prop.chisq = FALSE,
  prop.c = TRUE,
  prop.r = TRUE
)

 
   Cell Contents
|-------------------------|
|                       N |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  50 

 
             | data$Recurrence 
  data$Group |       Yes |        No | Row Total | 
-------------|-----------|-----------|-----------|
           I |         3 |        22 |        25 | 
             |     0.120 |     0.880 |     0.500 | 
             |     0.300 |     0.550 |           | 
             |     0.060 |     0.440 |           | 
-------------|-----------|-----------|-----------|
          II |         7 |        18 |        25 | 
             |     0.280 |     0.720 |     0.500 | 
             |     0.700 |     0.450 |           | 
             |     0.140 |     0.360 |           | 
-------------|-----------|-----------|-----------|
Column Total |        10 |        40 |        50 | 
             |     0.200 |     0.800 |           | 
-------------|-----------|-----------|-----------|

 

See the output from the CrossTable function call from Solution 10.1.2. Also, marginal probabilities are shown below.

\begin{align*} &P(\text{Group I}) = \frac{25}{50} = 0.5 \\ &P(\text{Group II}) = \frac{25}{50} = 0.5 \\ &P(\text{Recurrence}) = \frac{10}{50} = 0.2 \\ &P(\text{No recurrence}) = \frac{40}{50} = 0.8 \end{align*}

The conditional probabilities are shown below.

\begin{align*} &P(\text{Recurrence} \, \vert \, \text{Group I}) = \frac{3}{25} = 0.12 \\ &{P(\text{Recurrence} \, \vert \, \text{Group II})} = \frac{7}{25} = 0.28 \end{align*}

10.2 Question

Find an example of each study design in the published literature that contain the results of risk and ratio analysis and comment these results.

There are many possible solutions found across literature. A study for each of the four study designs reviewed in Chapter 2 is described below.

  • Cohort

    • Physical Activity and Incident Depression Analysis of Prospective Cohort Studies (doi: 10.1176/appi.ajp.2018.17111194. )

    • The study presented both odds ratios and relative risks to support the claim that physical activity can confer protection against emergent depression.

  • RCT

    • Randomized Trial of Metformin, Ivermectin, and Fluvoxamine for Covid-19 (DOI: 10.1056/NEJMoa2201662)

    • Odds ratios were reported to support the conclusion that none of the three medications- metformin, ivermectin, and fluvoxamine, prevent the occurrence of hypoxemia, an emergency department visit, hospitalization, or death associated with Covid-19.

  • Case-control

    • A Case-Control Study of Screening Sigmoidoscopy and Mortality from Colorectal Cancer (DOI: 10.1056/NEJM199203053261001)

    • Odds ratios were presented to demonstrate the finding that sigmoidoscopy can reduce mortality from rectum and distal colon cancer.

  • Cross-sectional

    • Prevalence of Pulmonary Embolism among Patients Hospitalized for Syncope (DOI: 10.1056/NEJMoa1602172)

    • Odds ratios were calculated from logistic regression to support the conclusion that pulmonary embolism is prevalent among patients hospitalized for a first episode of syncope.

10.3 Question

The research data for this lab pertains to the development of a cerebrovascular accident (a stroke). The data is available in a comma-separated file named stroke.csv.

Do NOT use a library function to answer the questions of problems 3.1 through 3.4.

Complete the following three questions.

  1. Import the data as a data frame object and assign it to the computer variable dfr.

  2. Verify that there is no missing data.

  3. Both \texttt{hypertension} and \texttt{stroke} are binary variables. The levels are No and Yes for each. Consider \texttt{hypertension} to be the explanatory variable, with the level Yes as E and the value of No as \bar{E}. Consider Yes (the development of a stroke) as D, and no (the absence of the development of a stroke ) as \bar{D}. Generate a table of observed values and assign it to the variable dfr.

The code below uses the import_dataset user-defined function to import the data.

Code
dfr <- import_dataset("stroke.csv")

The is.na function is used to verify that there is no missing data.

Code
dfr %>% is.na() %>% sum()
[1] 0

A table of observed values are generated below.

Code
dfr$hypertension <- factor(dfr$hypertension, levels = c("Yes", "No"))
dfr$stroke <- factor(dfr$stroke, levels = c("Yes", "No"))

tab <- xtabs(
  ~ hypertension + stroke,
  data = dfr
)

tab
            stroke
hypertension  Yes   No
         Yes   66  413
         No   182 4320

10.4 Question

Answer the following five questions.

  1. Calculate the risk estimate of a stroke given the presence of hypertension.

  2. Calculate the risk estimate of a stroke given the absence of hypertension.

  3. Calculate the risk difference estimate for a stroke comparing those with and without hypertension.

  4. Calculate a 95% confidence interval for the population risk difference of a stroke comparing those with and without hypertension.

  5. Comment on your results.

The risk estimate of a stroke given the presence of hypertension is below.

Code
n1 <- 66 + 413
p1 <- 66 / n1
p1
[1] 0.1377871

The risk estimate of a stroke given the absence of hypertension is below.

Code
n2 <- 182 + 4320
p2 <- 182 / n2
p2
[1] 0.04042648

The risk difference estimate for a stroke comparing those with and without hypertension is below.

Code
rd <- p1 - p2
rd
[1] 0.09736058

The lower and upper bounds of the 95\% confidence interval are presented below.

Code
z_a2 <- qnorm(0.975)

lower <- (p1 - p2) - (z_a2 * sqrt(((p1 * (1 - p1)) / (n1)) + ((p2 * (1 - p2)) / (n2))))
upper <- (p1 - p2) + (z_a2 * sqrt(((p1 * (1 - p1)) / (n1)) + ((p2 * (1 - p2)) / (n2))))

round(lower, digits = 3)
[1] 0.066
Code
round(upper, digits = 3)
[1] 0.129

We are 95% confident that the true population risk difference of the risk for stroke comparing those with hypertension and those without hypertension is between 0.066 and 0.129.

10.5 Question

Answer the following three questions.

  1. Calculate the relative risk estimate for a stroke comparing those with and without hypertension.

  2. Calculate a 95% confidence interval for the population risk difference of a stroke comparing those with and without hypertension.

  3. Comment on your results.

The relative risk estimate for a stroke comparing those with and without hypertension is below.

Code
rr = p1 / p2
rr
[1] 3.408337

The lower and upper bounds of the 95\% confidence interval are presented below.

Code
n11 <- 66
n21 <- 182
lower <- log(p1 / p2) - z_a2 * sqrt(((1 - p1) / n11) + ((1 - p2) / n21))
upper <- log(p1 / p2) + z_a2 * sqrt(((1 - p1) / n11) + ((1 - p2) / n21))
round(exp(lower), digits = 3)
[1] 2.614
Code
round(exp(upper), digits = 3)
[1] 4.444

We are 95% confident that the true population risk ratio for stroke comparing those with hypertension to those without hypertension in between 2.614 and 4.44.

10.6 Question

Answer the following five questions.

  1. Calculate the odds of a stroke among those with hypertension.

  2. Calculate the odds of a stroke among those without hypertension.

  3. Calculate the odds ratio for a stroke comparing those with and without hypertension.

  4. Calculate a 95% confidence interval for the odds ratio for a stroke comparing those with and without hypertension.

  5. Comment on your results.

The odds of a stroke among those with hypertension is below.

Code
odds_str_hpt <- 66 / 413
odds_str_hpt
[1] 0.1598063

The odds of a stroke among those without hypertension is below.

Code
odds_str_nohpt <- 182 / 4320
odds_str_nohpt
[1] 0.04212963

The odds ratio for a stroke comparing those with and without hypertension is below.

Code
or <- odds_str_hpt / odds_str_nohpt
or
[1] 3.793204

The lower and upper bounds of the 95\% confidence interval are presented below.

Code
log_odds_ratio <- log((p1 / (1 - p1)) / (p2 / (1 - p2)))
se_or <- sqrt((1 / 66) + (1 / 413) + (1 / 182) + (1 / 4320))
lower <- log_odds_ratio - z_a2 * se_or
upper <- log_odds_ratio + z_a2 * se_or
round(exp(lower), digits = 3)
[1] 2.812
Code
round(exp(upper), digits = 3)
[1] 5.116

We are 95\% confident that the true population odds ratio for a stroke comparing those with and without hypertension is between 2.776 and 5.183.

10.7 Question

Answer the following question using an appropriate R function. Researchers consider the association between the presence, E = \text{Yes}, and absence \bar{E} = \text{No}, of heart disease and the development, D, and non-development, \bar{D}, of a stroke.

  1. Calculate the risk difference and 95\% confidence interval for the risk difference pertaining to the research.

  2. Calculate the relative risk and 95\% confidence interval for the relative risk pertaining to the research.

  3. Calculate the odds ratio and 95\% confidence interval for the odds ratio pertaining to the research.

  4. Write a comment regarding the results.

We answer all the questions using the epi2by2 function.

Code
epiR::epi.2by2(tab)
             Outcome +    Outcome -      Total                 Inc risk *
Exposed +           66          413        479     13.78 (10.82 to 17.19)
Exposed -          182         4320       4502        4.04 (3.49 to 4.66)
Total              248         4733       4981        4.98 (4.39 to 5.62)

Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio                                 3.41 (2.61, 4.44)
Inc odds ratio                                 3.79 (2.81, 5.12)
Attrib risk in the exposed *                   9.74 (6.60, 12.88)
Attrib fraction in the exposed (%)            70.66 (61.74, 77.50)
Attrib risk in the population *                0.94 (0.10, 1.77)
Attrib fraction in the population (%)         18.80 (12.71, 24.48)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 86.743 Pr>chi2 = <0.001
Fisher exact test that OR = 1: Pr>chi2 = <0.001
 Wald confidence limits
 CI: confidence interval
 * Outcomes per 100 population units 

We see the estimated risk ratio (Inc risk ratio) is 3.41, 95\% CI 2.61 to 4.44. The estimated odds ratio (Odds ratio) is 3.79, 95\% CI 2.81 to 5.12. The estimated risk difference (Attrb risk in the exposed *) is 0.94, 95\% CI 0.10 to 1.77. Note that the latter is expressed as a percentage, not a fraction.

10.8 Question

The following questions pertain to the same data set that was used in the previous notebook. Patients either have hypertension, E, or do not have hypertension, \bar{E}. The presence of hypertension is our explanatory variable. The outcome or response variable levels are stroke, D, and no stroke, \bar{D}. Use a significance level of 0.05 for this problem.

The data is available in the stroke.csv file that is imported below and assigned to the computer variable stroke.

Code
stroke <- import_dataset("stroke.csv")
  1. Create a contingency table from the data and use the usual order of the levels (as used in this course). Assign the table to the variable tab.

  2. Calculate the table of expected values (without using any R functions).

  3. Can we use the \chi^{2} distribution to approximate our Pearson test statistic?

  4. Calculate the X^{2} test statistic without using and R function and assign it to the variable xs.

  5. What is the degrees of freedom for this problem?

  6. Calculate the critical \chi^{2} value for \alpha=0.05.

  7. Is the test statistic larger than the critical statistic value?

  8. Use an appropriate R function to calculate a p-value for the test statistic.

  9. What is the test conclusion?

  10. Use an appropriate R function to confirm the results of your calculations using the Pearson \chi^{2} tests for independence.

  11. Confirm the values for the table of expected values.

  12. Calculate the estimated odds ratio and 95\% Wald confidence intervals to confirm to support your test conclusion. Does the interval contain the null value \text{OR}=1?

A contingency table of observed values with ordered levels as used in the course is below.

Code
stroke$hypertension <- factor(stroke$hypertension, levels = c("Yes", "No"))
stroke$stroke <- factor(stroke$stroke, levels = c("Yes", "No"))
tab <- xtabs(
  ~ hypertension + stroke,
  data  = stroke
)

tab
            stroke
hypertension  Yes   No
         Yes   66  413
         No   182 4320
Code
gmodels::CrossTable(stroke$hypertension, stroke$stroke)

 
   Cell Contents
|-------------------------|
|                       N |
| Chi-square contribution |
|           N / Row Total |
|           N / Col Total |
|         N / Table Total |
|-------------------------|

 
Total Observations in Table:  4981 

 
                    | stroke$stroke 
stroke$hypertension |       Yes |        No | Row Total | 
--------------------|-----------|-----------|-----------|
                Yes |        66 |       413 |       479 | 
                    |    74.498 |     3.904 |           | 
                    |     0.138 |     0.862 |     0.096 | 
                    |     0.266 |     0.087 |           | 
                    |     0.013 |     0.083 |           | 
--------------------|-----------|-----------|-----------|
                 No |       182 |      4320 |      4502 | 
                    |     7.926 |     0.415 |           | 
                    |     0.040 |     0.960 |     0.904 | 
                    |     0.734 |     0.913 |           | 
                    |     0.037 |     0.867 |           | 
--------------------|-----------|-----------|-----------|
       Column Total |       248 |      4733 |      4981 | 
                    |     0.050 |     0.950 |           | 
--------------------|-----------|-----------|-----------|

 

Each of the expected values has been manually calculated without R packages.

Code
n1plus <- 66 + 413
n2plus <- 182 + 4320
nplus1 <- 66 + 182
nplus2 <- 413 + 4320
n <- n1plus + n2plus
e11 <- nplus1 * n1plus / n
e12 <- nplus2 * n1plus / n
e21 <- nplus1 * n2plus / n
e22 <- nplus2 * n2plus / n
e11
[1] 23.84903
Code
e12
[1] 455.151
Code
e21
[1] 224.151
Code
e22
[1] 4277.849

Yes, since expected values are at least 5.

The \chi^{2} statistic is calculated manually.

Code
n11 <- 66
n12 <- 413
n21 <- 182
n22 <- 4320
xs <- ((n11 - e11)^2/e11) + ((n12 - e12)^2/e12) + ((n21 - e21)^2/e21) + ((n22 - e22)^2/e22)
xs
[1] 86.74324

Since this is a 2 \times 2 contingency table, there is 1 degree of freedom.

The critical \chi^{2} for \alpha = 0.05 with 1 degree of freedom is below.

Code
qchisq(0.95, df = 1)
[1] 3.841459

Yes, since the \chi^{2} statistic is greater than the critical value, 86.74324 > 3.841459.

The pchisq function is used to calculate a p-value for the test statistic.

Code
pchisq(xs, df = 1, lower.tail = FALSE)
[1] 1.235652e-20

At the 5% significance level, there is enough evidence to suggest that there is an association between the presence of hypertension and the development of a stroke.

The chisq.test function is appropriate to confirm the results of the Pearson \chi^{2} test.

Code
tst <- chisq.test(tab, correct = FALSE)
tst

    Pearson's Chi-squared test

data:  tab
X-squared = 86.743, df = 1, p-value < 2.2e-16

Below are the expected values for the table.

Code
tst$expected
            stroke
hypertension       Yes       No
         Yes  23.84903  455.151
         No  224.15097 4277.849

Below we calculate the estimated odds ratio and 95\% Wald confidence intervals.

Code
epitools::oddsratio.wald(tab)
$data
            stroke
hypertension Yes   No Total
       Yes    66  413   479
       No    182 4320  4502
       Total 248 4733  4981

$measure
            odds ratio with 95% C.I.
hypertension estimate    lower    upper
         Yes 1.000000       NA       NA
         No  3.793204 2.812414 5.116031

$p.value
            two-sided
hypertension   midp.exact fisher.exact   chi.square
         Yes           NA           NA           NA
         No  2.220446e-15 1.740745e-15 1.235652e-20

$correction
[1] FALSE

attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"

The interval of plausible population odds ratios does not contain the null, \text{OR}=1 .