Code
library(ggplot2)
library(latex2exp)
library(crosstable)
library(gmodels)
library(ggstatsplot)
library(sjPlot)
library(DescTools)
library(rio)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
In chapter 2, we establish definitions for the measures of association- risk difference, relative risk, and odds ratio. Then, using the epi2by2
and epitools
libraries, we calculate these measures of association. Additionally, we discuss which measures of association are most effective based on study design and perform relevant hypothesis testing.
Load the required libraries for Chapter 2. If these packages are not already installed, please do so before loading the libraries.
library(ggplot2)
library(latex2exp)
library(crosstable)
library(gmodels)
library(ggstatsplot)
library(sjPlot)
library(DescTools)
library(rio)
Also, all data for this chapter and subsequent chapters are from a GitHub repository. Below is a user-defined function to allow for more accessible data import into R. An example of how to apply the user-defined function import_dataset
is demonstrated in Section 5 Example 5.4.
<- function(filename) {
import_dataset # GitHub repository URL
<- "https://raw.githubusercontent.com/juanklopper/TutorialData/main/"
github_url
<- paste0(github_url, filename)
file_url
<- import(file_url)
dataset
return(dataset)
}
Studies concerning categorical variables allow us to consider the frequency of outcomes for each variable and the association of these frequencies between pairs of variables. We can express such outcomes in a contingency table or a table of observed values, an example of which is shown in Table 1. Participants are randomized into two groups, A and B. They are observed for one of two outcomes, Outcome 1 and Outcome 2.
Group | Outcome 1 | Outcome 2 | Total |
---|---|---|---|
A | 509 | 116 | 625 |
B | 398 | 104 | 502 |
Total | 907 | 220 | 1127 |
A total of 509 participants in group A have Outcome 1 and 116 in group A have Outcome 2. A total of 398 in group B have Outcome 1 and 104 in group B have Outcome 2.
Note that for this chapter, we will usually have the explanatory variable, in this case, the groups, on the left-hand side (the rows), with a row for each of the levels of the explanatory group. The levels of the response variable are along the top (the columns), with a column for each level of the response variable. In the case of this example, we have 2 rows and 2 columns resulting in a 2 \times 2 contingency table.
By this structure of the contingency table, we can sum all the frequencies along the rows and columns. We note the following:
Irrespective of which outcome participants were observed to have, there are 509 + 116 = 625 participants in group A and 398 + 104 = 502 participants in group B
Irrespective of which group participants were in, there are 509 + 398 = 907 participants who have Outcome 1 and 116 + 104 = 220 who have Outcome 2.
We also note a total of 1127 participants. This value must be the same for the sum of the row or the column totals or the sum of the four joint frequencies.
Our general notation for a 2 \times 2 table will be as in Table 2. The subscripts denotes the row and then the column number. We use the + symbol as variable that holds all the values for a row (if placed second) and for a column (if placed first).
Group | Outcome 1 | Outcome 2 | Total |
---|---|---|---|
A | n_{1,1} | n_{1,2} | n_{1,+} |
B | n_{2,1} | n_{2,2} | n_{2,+} |
Total | n_{+,1} | n_{+,2} | n |
Different textbooks use different notation for the groups and outcomes in a 2 \times 2 table of observed values. You might see a table as in Table 3, where E is exposure and \bar{E} is non-exposure, with D being disease and \bar{D} being non-diseases. Other textbooks might use success for D and failure for \bar{D}. None of these terms are to be thought of in literal terms. They merely help us decide on the outcome of interest (D or success).
D | \bar{D} | Total | |
---|---|---|---|
E | n_{1,1} | n_{1,2} | n_{1,+} |
\bar{E} | n_{2,1} | n_{2,2} | n_{2,+} |
Total | n_{+,1} | n_{+,2} | n |
We are not restricted to a 2 \times 2 contingency table. If the explanatory variable has m outcomes and the response variable has n outcomes, then we would have an m \times n contingency table.
In Table 4, we have the results of a survey. The data are frequencies of observed values for the levels of two categorical variables, Group (explanatory variable) and Choice (response variable). There are two categories for Group variable, A and B and two for the Choice variable, I and II.
I | II | |
---|---|---|
A | 80 | 100 |
B | 120 | 25 |
The total number of participants are 80 + 100 + 120 + 25 = 325.
There are 80 + 100 = 180 participants in group A and 120 + 25 = 145 participants in group B.
Of all the participants, 80 + 120 = 200 chose I and 100 + 25 = 125 chose II. Table 5 shows these values.
I | II | Total | |
---|---|---|---|
A | 80 | 100 | 180 |
B | 120 | 25 | 145 |
Total | 200 | 125 | 325 |
We can express a relative frequency or a proportion by dividing each frequency by the total number of participants. The results is a contingency table, Table 6, that is a probability distribution.
I | II | Total | |
---|---|---|---|
A | 0.246 | 0.308 | 0.554 |
B | 0.369 | 0.077 | 0.446 |
Total | 0.615 | 0.385 | 1.0 |
The values 0.246,0.308, 0.369 and 0.077 are termed joint probabilities. For the first value we could state that the probability of being in group A and having chosen I is 0.246. We can write this as P(A \cap I), the probability of being in group A and having chosen I. The four values together form the joint probability distribution. In some textbooks, the notation \hat{P} is used to denote that the results are from a sample and are used to estimate the true population joint probability.
Note also, that when we refer to probability, we are actually calculating a proportion.
In general, we have Equation 1 for a joint probability, where i denotes the row number and j the column number. \frac{n_{ij}}{n} \tag{1}
The values 0.554, 0.446, 0.615 and 0.385 are termed marginal probabilities and form the marginal probability distribution.For 0.554 we could write P(\text{being in group A}) or then \hat{P}(A). Each of the two sets of marginal probabilities (along the rows or the columns) sum to 1.0.
Depending if we are considering row or column marginal probabilities, we have Equation 2, where we only consider the two column marginal probabilities, (\hat{P}(I) and \hat{P}(II)), where j=1,2.
\frac{n_{+j}}{n} \tag{2}
What about the probability of having chosen I given that the participant is in group A? A total of 80 participants in the 180 in group A, chose I. The conditional probability, \hat{P}(I \vert A) = 80 \div 180 = 0.444.
We have to remember that these are all values for a sample taken from a population and all three types of probabilities are probability estimates.
In the code cell below, we construct observed data for a simulated study. There are 40 participants. The explanatory variable is Group and the participants are divided equally into groups A and B. Participants in group A receive novel therapy to alleviate a specific symptom and those in group B receive a placebo. Researchers measure whether they improve (a success for this binary response variable, encoded as Yes) or not (a failure, encoded as No).
We create the simulated data and display the results as a table.
set.seed(12)
<- rep(
explanatory c("A", "B"),
each = 20
)
<- sample(
response c("No", "Yes"),
size = 40,
replace = TRUE
)
<- data.frame(
data Group = explanatory,
Outcome = response
)
xtabs(
~Group + Outcome,
data = data
)
Outcome
Group No Yes
A 7 13
B 10 10
We use the crosstable
function from the crosstable
library to construct a more elegant table of observed values, together with marginal totals, shown in Table 7. Note that since we did not use the factor
function to set the order of the levels, the crosstable
function uses the \bar{D} level (the No) in the first column.
::crosstable(
crosstable
data,
Group,by=Outcome,
total = "both",
percent_pattern="{n}"
%>%
) ::as_flextable(
crosstablekeep_id = FALSE
)
label | variable | Outcome | Total | |
---|---|---|---|---|
No | Yes | |||
Group | A | 7 | 13 | 20 (50.00%) |
B | 10 | 10 | 20 (50.00%) | |
Total | 17 (42.50%) | 23 (57.50%) | 40 (100.00%) |
The CrossTable
function from the gmodels
library provides the joint, marginal, and the conditional probabilities.
::CrossTable(
gmodels$Group,
data$Outcome,
datachisq = FALSE,
prop.chisq = FALSE,
prop.c = TRUE,
prop.r = TRUE
)
Cell Contents
|-------------------------|
| N |
| N / Row Total |
| N / Col Total |
| N / Table Total |
|-------------------------|
Total Observations in Table: 40
| data$Outcome
data$Group | No | Yes | Row Total |
-------------|-----------|-----------|-----------|
A | 7 | 13 | 20 |
| 0.350 | 0.650 | 0.500 |
| 0.412 | 0.565 | |
| 0.175 | 0.325 | |
-------------|-----------|-----------|-----------|
B | 10 | 10 | 20 |
| 0.500 | 0.500 | 0.500 |
| 0.588 | 0.435 | |
| 0.250 | 0.250 | |
-------------|-----------|-----------|-----------|
Column Total | 17 | 23 | 40 |
| 0.425 | 0.575 | |
-------------|-----------|-----------|-----------|
Note, the order of the response variable is once again left as the default. Also, using the output Cell Contents as a key for the Total Observations in Table, note that the values are displayed as per the function arguments used:
Observed frequency
Conditional probability upon the explanatory variable level
Conditional probability upon the response variable level
Joint probability
From the table we read the following joint probabilities from the table.
We read the following marginal probabilities from the table.
Finally, we read the following conditional probabilities from the table.
Note from the study design that given that the group allocation was fixed, it does not make sense to consider the probabilities of group allocation given the outcomes. We will revisit this discussion in a later section when we consider risks and ratios in terms of study design.
We can derive and understand the values that we calculate for risks and ratios from a 2 \times 2 table of observed values. There are three commonly used measures. We use them to quantify the magnitude and direction of association between two binary variables. The measures are risk difference, risk ratio (relative risk), and the odds ratio.
We define the risk difference (RD) in Equation 3, the risk of success, D, given exposure, E, minus the risk of success, D, given non-exposure, \bar{E}. To be clear, we are only considering those with the disease and we calculate the difference in the conditional probabilities - with the disease given exposure and no exposure.
\begin{align} &\text{Definition: } \text{RD} = P(D \vert E) - P(D \vert \bar{E}) \\ &\text{Estimate: } \widehat{\text{RD}} = \hat{P}(D \vert E) - \hat{P}(D \vert \bar{E}) \end{align} \tag{3}
Note that our interest lies with the successlevel, D, of the response variable.
The two levels of the explanatory variable are independent. Each participant only took one of either aspirin or placebo. Of the n_{1} = 11037 participants in the aspirin group, 104 developed a myocardial infarction for p_{1} = 104 \div 11037 = 0.0094, as shown in Equation 4. Of the n_{2} = 11034 participants in the placebo group, 189 developed a myocardial infarction. The conditional probability of a myocardial infarction in this group was p_{2} = 189 \div 11034 = 0.0171, as shown in Equation 5.
p_{1} = \hat{P}(D \, \vert \, E) = \frac{n_{11}}{n_{1+}} = \frac{104}{11037} \approx 0.0094 \tag{4}
p_{2} = \hat{P}(D \, \vert \bar{E}) = \frac{n_{21}}{n_{2+}} = \frac{189}{11034} \approx 0.0171 \tag{5}
We state that the risk of the development of a myocardial infarction in the placebo group was 0.017 and the risk of the development of a myocardial infarction in the treatment group was 0.009.
The difference in the risk of participants who developed a myocardial infarction between the two groups is \Delta p = 0.0094 - 0.0171 = -0.0077 \in [-1, 1]. This difference is termed the risk difference.
We can interpret the result by stating that the risk (chance or probability) of developing a MI in the aspirin group is 0.77 percentage points lower than developing a MI in the placebo group.
The RD has an interval -1 \le \text{RD} \le 1. The interval boundary values can be calculated from the extreme summary results in Table 8 and Table 9.
RD=-1 | D | \bar{D} | Total |
---|---|---|---|
E | 0 | 20 | 20 |
\bar{E} | 20 | 0 | 20 |
Total | 20 | 20 | 40 |
The risk difference is shown in Equation 6.
P(D \, \vert E) - P \left( D \, \vert \bar{E} \right) = \frac{0}{20} - \frac{20}{20} = -1 \tag{6}
RD=1 | D | \bar{D} | Total |
---|---|---|---|
E | 20 | 0 | 20 |
\bar{E} | 0 | 20 | 20 |
Total | 20 | 20 | 40 |
The risk difference in this case is shown in Equation 7.
P(D \, \vert E) - P \left( D \, \vert \bar{E} \right) = \frac{20}{20} - \frac{0}{20} = 1 \tag{7}
While considering the difference in risk, we have to take cognizance of the number of cases and hence the proportions. A \Delta p of 0.009 is the same when we subtract 0.010 - 0.001 as when we subtract 0.410 - 0.401, yet the proportions are very different.
To mitigate the differences, we consider relative risk or risk ratio (RR), shown in Equation 8.
\begin{align} \text{Defintion: } &\text{RR} = \frac{P(D \vert E)}{P(D \vert \bar{E})} \\ \text{Estimate: } &\widehat{\text{RR}} = \frac{\hat{P}(D \vert E)}{\hat{P}(D \vert \bar{E})} \end{align} \tag{8}
In the case of the two sets of proportions mentioned above, we see the difference in RR in Equation 9.
\widehat{\text{RR}} = \frac{\frac{104}{11037}}{\frac{189}{11034}} \approx 0.5501 \tag{9}
The RR in the case of our example is 0.551. We can interpret this as the risk of a MI in the aspirin group is 0.551 times that in the placebo group or that the risk of a MI among those taking aspirin is 1 - 0.5501 = 0.4499 = 45% less than that among the placebo group. (See explanation below in Thinking about lowering and increasing the measures of association).
The RR has an interval 0 \le \text{RR} < \infty. The interval boundary values can be calculated from the extreme summary results in Table 10 and Table 11.
RR=0 | D | \bar{D} | Total |
---|---|---|---|
E | 0 | 20 | 20 |
\bar{E} | 20 | 0 | 20 |
Total | 20 | 20 | 40 |
The relative risk is shown in Equation 10.
\frac{P(D \, \vert \, E)}{P \left( D \, \vert \, \bar{E} \right)} = \frac{\frac{0}{20}}{\frac{20}{20}} = 0 \tag{10}
RR=\infty | D | \bar{D} | Total |
---|---|---|---|
E | 20 | 0 | 20 |
\bar{E} | 0 | 20 | 20 |
Total | 20 | 20 | 40 |
The relative risk is shown in Equation 11.
\frac{P(D \, \vert \, E)}{P \left( D \, \vert \, \bar{E} \right)} = \frac{\frac{20}{20}}{\frac{0}{20}} = \infty \tag{11}
If the population proportion of success for a binary outcome of an experiment is \pi then the odds of success is as shown in Equation 12.
\text{odds}=\frac{\pi}{1 - \pi} \tag{12}
The sample proportion of success (a myocardial infarction) in the group taking aspirin in our illustrative example was 104 \div 11037 = 0.0095. This is P(D \vert E). We calculate the odds of success in the code chunk below, assigning the result to the variable odds_aspirin.
<- (104 / 11037) / (1 - (104 / 11037))
odds_aspirin odds_aspirin
[1] 0.009512485
We can do the same for the group taking placebo. We calculate the odds of success in the placebo group in the code chunk below and assign it to the variable odds_placebo. This is P(D \vert \bar{E}).
<- (189 / 11034) / (1 - (189 / 11034))
odds_placebo odds_placebo
[1] 0.01742739
The odds ratio comparing the odds of success between the two groups is calculated as shown in Equation 13.
\begin{align} \text{odds ratio} = &\theta = \frac{\text{odds}_{1}}{\text{odds}_2} \\ \\ \text{Defintion: } &\text{OR}_{\text{D}} = \frac{\frac{P(D \vert E)}{1 - P(D \vert E)}}{\frac{P(D \vert \bar{E})}{1 - P(D \vert \bar{E})}} = \frac{\frac{P(D \, \vert \, E)}{P(\bar{D} \, \vert \, E)}}{\frac{P(D \, \vert \, \bar{E})}{p(\bar{D} \, \vert \, \bar{E})}} \\ \\ \text{Estimate: } &\widehat{\text{OR}}_{\text{D}} = \frac{\frac{\hat{P}(D \vert E)}{1 - \hat{P}(D \vert E)}}{\frac{\hat{P}(D \vert \bar{E})}{1 - \hat{P}(D \vert \bar{E})}} = \frac{\frac{\hat{P}(D \, \vert \, E)}{\hat{P}(\bar{D} \, \vert \, E)}}{\frac{\hat{P}(D \, \vert \, \bar{E})}{\hat{P}(\bar{D} \, \vert \, \bar{E})}} \end{align} \tag{13}
The data for this section is in a file named measures_assoc.csv which we import below using the user-defined function import_dataset
from Section 2.
<- import_dataset("measures_assoc.csv") dfr
We create a table of observed data between Exposure and Disease.
<- table(dfr$Exposure, dfr$Disease)
tb tb
D ND
E 21 19
NE 8 32
We create another table of calculations for the measures of association between Exposure and Disease.
::epi.2by2(tb) epiR
Outcome + Outcome - Total Inc risk *
Exposed + 21 19 40 52.50 (36.13 to 68.49)
Exposed - 8 32 40 20.00 (9.05 to 35.65)
Total 29 51 80 36.25 (25.79 to 47.76)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio 2.62 (1.32, 5.21)
Inc odds ratio 4.42 (1.64, 11.93)
Attrib risk in the exposed * 32.50 (12.67, 52.33)
Attrib fraction in the exposed (%) 61.90 (24.33, 80.82)
Attrib risk in the population * 16.25 (-0.02, 32.52)
Attrib fraction in the population (%) 44.83 (8.43, 66.76)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 9.141 Pr>chi2 = 0.002
Fisher exact test that OR = 1: Pr>chi2 = 0.005
Wald confidence limits
CI: confidence interval
* Outcomes per 100 population units
The Inc risk
is the incidence risk which is simply \hat{P}(D \vert E) = 52.5, \hat{P(D \vert \bar{E})} = 20.0, and \hat{P}(D)=36.2. We also see the estimated odds \widehat{\text{OR}}(D \vert E) = 1.105, \widehat{\text{OR}}(D \vert \bar{E}) = 0.250, and \widehat{\text{OR}}(D)=0.569.
Within the table titled, Point estimates and 95% CIs, the Inc risk ratio
represents \widehat{\text{RR}} and its corresponding CI. Then, the Odds ratio
demonstrates \widehat{\text{OR}} and CI. The Attrib risk in the exposed *
is the \widehat{\text{RD}} and CI.
RD is an absolute measure of association, whereas the RR (relative risk) and OR are relative measures of association. We need to consider the use of these measures.
As an example, we have the following conditional probabilities.
We see the following measures of association.
If we only report the RR, we might consider the value to be large. The RD, though, shows the minor difference. The OR is approximately the same as the RR since we have small values for P(D \vert E) and P(D \vert \bar{E}).
Contrast the above to the values below.
The measures of association are as follows.
The OR is quite large when compared to the RR (because the P(D \vert E) and P(D \vert \bar{E}) are large. It would be useful to also state the RR or RD (if the study design allows for it).
In the next section, we explore case-series, cohort, and cross-sectional study designs to determine which risks and ratios can be used in each study design.
A case-control study design is a form of retrospective study design, where participants are divided into groups as indicated by levels of a categorical variable. We then consider events that took place prior to this.
In @tbd-fixeddesign, n_{+1} and n_{+2} are fixed by design.
D | \bar{D} | |
---|---|---|
E | n_{11} | n_{12} |
\bar{E} | n_{21} | n_{22} |
Total | n_{+1} | n_{+2} |
We have that n_{12} and n_{12} are random and can be described by the binomial distribution, shown in Equation 14.
\begin{align} &n_{11} \sim \text{Bin}(n_{+1} , P(E \vert D)) \\ &n_{12} \sim \text{Bin}(n_{+2} , P(E \vert \bar{D})) \end{align} \tag{14}
We can only use associations that use P(E \vert D) and P(E \vert \bar{D}), that is to say odds ratios (either for the exposure or the disease, since they are equal). If the disease is rare, then we can use the odds ratio to estimate the relative risk.
Consider a study that investigates surgical site sepsis following colonic resection (removal of the large bowel). Surgical site sepsis is any infection of tissues impacted by surgical incisions and interventions.
As an example, we imagine that researchers investigated 100 participants. A total of 33 had a surgical infection and 67 did not. The proportions for this variable are fixed by the study design.
The researchers want to investigate the role of prophylactic antibiotics on the development of surgical site sepsis. Prophylactic antibiotics are given intravenously just prior to surgery. It penetrates the tissues and the aim is for a high enough concentration of the antibiotic in the tissue when incisions are made, stopping the proliferation of bacteria and preventing infection.
In this simulated study, participants were given a novel or a regimented antibiotic as prophylactic agent and the choice between these was up to the attending surgeon (which for the sake of this example, we shall take as random).
The researchers find the results in Table 12.
Group | Infection | No infection | Total |
---|---|---|---|
Novel | 5 | 30 | 35 |
Control | 28 | 37 | 65 |
Total | 33 | 67 | 100 |
Generally, the table above would be deemed incorrect. In a case-control design, the researchers fix the number of cases for the explanatory variable. In this design, the presence or absence of surgical site sepsis (infection) is the explanatory variable. We write the levels of this variable along the left-hand side (the rows), shown in Table 13.
Group | Novel | Control | Total |
---|---|---|---|
Infection | 5 | 28 | 33 |
No infection | 30 | 37 | 67 |
Total | 35 | 65 | 100 |
The probability estimates are shown in Table 14.
Group | Novel | Control | Total |
---|---|---|---|
Infection | 0.05 | 0.28 | 0.33 |
No infection | 0.30 | 0.37 | 0.67 |
Total | 0.35 | 0.65 | 1.0 |
We cannot express a probability estimate for the development of infection in the larger population, as the number of cases in each group was fixed. In other words, we cannot say that the estimated probability of infection in people who have the type of surgery involved in this study is 33%. We can, though, consider the marginal probabilities of receiving the new drug or the regimented drug. These probability estimates are the marginal probabilities 0.35 and 0.65. Of more interest, is the estimated conditional probabilities. Given that an infection occurred (33 cases) the estimated probability of having received the new (novel) drug is $5 = 15 $% and of the having received the regimented (control) drug is 28 \div 33 \approx 0.85 = 85%. If no infection occurred, the estimated probability of having received the new drug is 30 \div 67 \approx 0.45 = 45% and of having received the regimented drug is 37 \div 65 \approx 0.55 = 55%.
A cohort study design and a randomized controlled trial are examples of a prospective study design. The explanatory variable levels divide the cohort or trial groups into two groups. They are followed over time and results are measured for the outcome variable.
In Table 15, n_{1+} and n_{2+} are fixed.
D | \bar{D} | Total | |
---|---|---|---|
E | n_{11} | n_{12} | n_{1+} |
\bar{E} | n_{21} | n_{22} | n_{2+} |
Now n_{11} and n_{21} are random variables, described as shown in Equation 15.
\begin{align} &n_{11} \sim \text{Bin}(n_{1+} , P(D \vert E)) \\ &n_{21} \sim \text{Bin}(n_{2+} , P(D \vert \bar{E})) \end{align} \tag{15}
We can use measure of association that have P(E \vert D) and P(E \vert \bar{D}), so we can use the risk difference, relative risk, and odds ratios.
As an example, we consider a much better study than the case-control study above. Researchers compare those who are given a new prophylactic antibiotic to those who received a regimented prophylactic antibiotic prior to surgery. One week after the surgery, they measure the outcome as whether surgical site sepsis (infection) occurred (response variable). The results are as shown in Table 16.
Group | Infection | No infection | Total |
---|---|---|---|
Novel | 12 | 38 | 50 |
Control | 15 | 35 | 50 |
Total | 27 | 73 | 100 |
The table of estimated probabilities is Table 17.
Group | Infection | No infection | Total |
---|---|---|---|
Novel | 0.12 | 0.38 | 0.5 |
Control | 015 | 0.35 | 0.5 |
Total | 0.27 | 0.73 | 1.0 |
From this design, we can state that the estimated probability of infection is 27 \div 100 = 0.27 = 27% and for not developing infection is 73 \div 100 = 0.73 = 73%. These are the marginal probabilities along the bottom. We can also state that given the new drug, the estimated probability of an infection is 12 \div 50 =0.24 = 24% and that given the regimented drug, the estimated probability is 15/50 = 0.3 = 30%. These are conditional probabilities. We conclude that the estimated probability of surgical site sepsis is lower for the new prophylactic antibiotic.
In a cross-sectional study design, we measure values for variables from a sample at one time.
All the cell frequencies in the Table 18 are now random.
D | \bar{D} | Total | |
---|---|---|---|
E | n_{11} | n_{12} | n_{1+} |
\bar{E} | n_{21} | n_{22} | n_{2+} |
Total | n_{+1} | n_{+2} | n |
All are described by the multinmial distribution, shown in Equation 16.
(n_{11}, n_{12}, n_{21}, n_{22}) \sim \text{Mult}(n, P(D \cap E) , P(\bar{D} \cap E) , P(\bar{E} \cap D) , P(\bar{E} \cap \bar{D})) \tag{16}
All measures of association can be used.
As example, we may have a list of patients who visited a local clinic that opened two calendar years ago. We randomly select 100 people to contact over a two week period and ask them to answer a few survey questions. We record the values in each case. Note that although the data collection took two weeks, the data are a snap-shot, with frequencies for two variables calculated.
Imagine then that among other questions, we asked them whether their last visit was in the first year or in the second year (with levels Second and First) and if they were satisfied with the waiting period (levels Yes and No). The results are shown in Table 19.
Year | Yes | No | Total |
---|---|---|---|
Second | 24 | 23 | 47 |
First | 20 | 33 | 53 |
Total | 44 | 56 | 100 |
The estimated probabilities are in Table 20.
Year | Yes | No | Total |
---|---|---|---|
Second | 0.24 | 0.23 | 0.47 |
First | 0.2 | 0.33 | 0.53 |
Total | 0.44 | 0.56 | 1.0 |
In this case both variables are not fixed. We can express the estimated probability of a last visit being in the second year as being 0.47 and in the first year as being 0.53. We can also state that the probability of being satisfied with the waiting times as being 0.44 and not being satisfied as being 0.56. In all these cases, we are considering marginal probabilities.
We can also consider all eight conditional probabilities.
Next, we consider confidence intervals (CIs) for the measures of association.
Confidence intervals are useful for estimation of population parameters, precision assessment of estimates, and hypothesis testing.
To calculate the CI for risk difference (RD), we use the two conditional probabilities in Equation 17.
\begin{align} &p_{1} = \hat{P}(D \vert E) = \frac{n_{11}}{n_{1+}} \\ \\ &p_{2} = \hat{P}(D \vert \bar{E}) = \frac{n_{21}}{n_{2+}} \end{align} \tag{17}
Now we can calculate the 100(1 - \alpha)% CI for RD, as shown in Equation 18.
(p_{1} - p_{2}) \pm z_{\frac{\alpha}{2}} \sqrt{\frac{p_{1} (1 - p_{1})}{n_{1+}} + \frac{p_{2} (1 - p_{2})}{n_{2+}}} \tag{18}
We use the normal approximation to the binomial distribution for n_{11} and n_{21} and we check the following assumptions.
This means that all four cell frequencies should be at least 5.
As example, we consider the results from a study in Table 21.
D | \bar{D} | Total | |
---|---|---|---|
E | 21 | 19 | 40 |
\bar{E} | 8 | 32 | 40 |
Total | 29 | 51 | 80 |
We substitute the values in the table into Equation 18.
<- 21 / 40
p1 <- 8 / 40
p2 <- 1.96
z_a2
<- (p1 - p2) - (z_a2 * sqrt(((p1 * (1 - p1)) / (40)) + ((p2 * (1 - p2)) / (40))))
lower <- (p1 - p2) + (z_a2 * sqrt(((p1 * (1 - p1)) / (40)) + ((p2 * (1 - p2)) / (40))))
upper
lower
[1] 0.1267164
upper
[1] 0.5232836
We are 95% confident that the true RD is between 12.67 and 52.32 percentage points.
We could also state that we are 95% confident that the true risk of disease in the exposure group is between 12.67 and 52.32 percentage points greater than the risk of disease in the non-exposure group.
We note that the null value for RD, which is 0, is not in the interval.
The sampling distribution for relative risk is not normally distributed. So, we cannot use the normal approximation to calculate the confidence interval for the relative risk.
In the code cell below, we simulate 1000 studies so that we can visualize the sampling distribution of relative risk. We use the following parameters.
<- rbinom(1000, 100, 0.3)
n_11 <- rbinom(1000, 100, 0.2)
n_21
<- n_11 / n_21 rr
hist(
rr,main = "Simulated sampling distribution of estimated relative risk",
xlab = "Estimated relatve risk",
ylab = "Frequency",
col = "#4197D9",
las = 1,
panel.first = grid()
)
hist(
log(rr),
main = "Simulated sampling distribution of logarithm of estimated relative risk",
xlab = "Logarithm of estimated relatve risk",
ylab = "Frequency",
col = "#98FB98",
las = 1,
panel.first = grid()
)
The histogram in Figure 1 shows the distribution of the simulated values. The sampling distribution is right skewed. In Figure 2, we use the natural logarithm transformation and plot a histogram of the sampling distribution of \log{\widehat{\text{RR}}}.
Using Equation 17 for p_{1} and p_{2}, we show the equation for the confidence interval of \log{\widehat{\text{RR}}} in Equation 19.
\log{\left( \frac{p1}{p2} \right)} \pm z_{\frac{\alpha}{2}} \sqrt{\frac{1 - p_{1}}{n_{11}} + \frac{1 - p_{2}}{n_{21}}} \tag{19}
The 95\% CI with Z_{\frac{\alpha}{2}} = 1.96 for the \log{\widehat{\text{RR}}} is calculated below, substituting the values from our table of observed values from Table 21.
<- log(p1 / p2) - z_a2 * sqrt(((1 - p1) / 21) + ((1 - p2) / 8))
lower <- log(p1 / p2) + z_a2 * sqrt(((1 - p1) / 21) + ((1 - p2) / 8))
upper lower
[1] 0.2787476
upper
[1] 1.651414
To calculate the CI for the estimated relative risk from our table of observed values, we need to exponentiation each value (that is, take the anti-log). In Equation 20, we see the anti-log for our example values.
(e^{0.2787476} , e^{1.651414}) \tag{20}
Below, we calculate the estimated relative risk and corresponding lower and upper bounds of the 95% CI.
/ p2 p1
[1] 2.625
round(exp(lower), digits = 3)
[1] 1.321
round(exp(upper), digits = 3)
[1] 5.214
The \widehat{\text{RR}}=2.625 with 95% CI of 1.321 to 5.214. We can state that we are 95% confident that the true population RR is in this interval. Alternatively we state that we are 95% confident that the true risk of disease in the exposure group is between 1.32 - 1 = 0.32 = 32% and 5.214 - 1 = 4.214 = 421% greater than the risk of disease in the non-exposure group.
We performed a natural logarithm transformation (log transformation) to use the normal approximation to the binomial distribution. We still need to make sure that the assumptions for the use of this approximation are met. That is n_{ij} \ge 5 (which in our case it is).
Note also that the null value for the relative risk (which is 1) is not within the interval.
The sampling distribution of the odds ratio (OR) is also not normally distributed and we use the log-transformation.
The equation for the 100(1-\alpha)% CI for the log OR is shown in Equation 21.
\log{\left( \frac{\frac{p_{1}}{1 - p_{1}}}{\frac{p_{2}}{1 - p_{2}}} \right)} \pm z_{\frac{\alpha}{2}} \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}} \tag{21}
The estimated CI for the log \widehat{\text{OR}}, is calculated below, substituting the values from our table of observed values from Table 21.
<- log((p1 / (1 - p1)) / (p2 / (1 - p2)))
log_odds_ratio log_odds_ratio
[1] 1.486378
Below, we calculate the estimated OR and corresponding lower and upper of the 95% CI.
round((p1 / (1 - p1)) / (p2 / (1 - p2)), digits = 3)
[1] 4.421
<- sqrt((1 / 21) + (1 / 19) + (1 / 8) + (1 / 32))
se_or <- log_odds_ratio - z_a2 * se_or
lower <- log_odds_ratio + z_a2 * se_or
upper
round(exp(lower), digits = 3)
[1] 1.638
round(exp(upper), digits = 3)
[1] 11.93
The \widehat{\text{OR}}=4.421 with 95% CI of 1.638 to 11.93. We can state that we are 95% confident that the true population OR is on this interval. Alternatively we state that we are 95% confident that the OR of disease in the exposure group is between 1.638 - 1 = 0.638 = 63.8% and 11.93 - 1 = 10.93 = 1093% greater than the risk of disease in the non-exposure group.
We performed a natural logarithm transformation (log transformation) to use the normal approximation to the binomial distribution. We still need to make sure that the assumptions for the use of this approximation are met. That is n_{ij} \ge 5, which is true for this case.
Note also that the null value for the OR (which is 1) is not within the interval. There is a positive association between exposure and disease.
Our first task is to develop values for the null hypothesis for measures of association.
Consider the table below, where we have two binary variables exposure (with levels E and \bar{E}) and disease (with levels D and \bar{D}). In Table 22, we express the joint probabilities \pi_{ij}, as well as the marginal probabilities.
D | \bar{D} | Total | |
---|---|---|---|
E | \pi_{11} | \pi_{12} | \pi_{1+} |
\bar{E} | \pi_{21} | \pi_{22} | \pi_{2+} |
Total | \pi_{+1} | \pi_{+2} | 1.0 |
We remember from probability theory that two events, A and B, are independent if and only if P(A \cap B) = P(A) P(B). The two binary variables are then independent if an only if the following hold.
In all cases, the joint probability being equal to the product of the marginal probabilities, indicates independence.
This independence has implications for the measures of association, shown in Equation 23 and Equation 24, where we use Equation 22 from probability theory.
P(A \vert B) = \frac{P(A \cap B)}{P(B)} \tag{22}
\begin{align} \text{RD} = P(D \vert E) - P(D \vert \bar{E}) = &\frac{P(D \cap E)}{P(E)} - \frac{D \cap \bar{E}}{P(\bar{E})} \\ &= \frac{\pi_{1+} \pi_{+1}}{\pi_{1+}} - \frac{\pi_{2+} \pi_{+1}}{\pi_{2+}} \\ &= \pi_{+1} - \pi_{+1} \\ &= 0 \end{align} \tag{23}
\text{RR} = \frac{P(D \vert E)}{P(D \vert \bar{E})} = \frac{\frac{\pi_{1+} \pi_{+1}}{\pi_{1+}}}{\frac{\pi_{2+}\pi_{+1}}{\pi_{2+}}} = 1 \tag{24}
The null value for the OR is the same, shown in Equation 25.
\begin{align} &\text{OR} = 1 \end{align} \tag{25}
In the section below, we state the null hypothesis and the alternative hypothesis in words, statistical terms, and in values.
In words,
H_{0}: Outcome level and exposure level are independent (not associated)
H_{1}: Outcome level and exposure level are not independent (are associated)
In statistical terms,
H_{0}: \; \pi_{ij} = \pi_{i+} \pi_{+j} (joint probability equals product of its marginal probabilities)
H_{1}: \; \pi_{ij} \ne \pi_{i+} \pi_{+j}
In values by measure of association,
RD | OR | RR |
---|---|---|
H_{0}: RD=0\\ H_{1}: RD\ne 0 | $$
|
$$
|
The Pearson \chi^{2} test for independence compares the observed frequencies with the frequencies that would be expected if the null hypothesis was true.
We have Table 23 of observed values.
D | \bar{D} | Total | |
---|---|---|---|
E | n_{11} | n_{12} | n_{1+} |
\bar{E} | n_{21} | n_{22} | n_{2+} |
Total | n_{+1} | n_{+2} | n |
Table 23 can be written, shown in Table 24, using O_{ij} for the table of observed values.
D | \bar{D} | Total | |
---|---|---|---|
E | O_{11} | O_{12} | n_{1+} |
\bar{E} | O_{21} | O_{22} | n_{2+} |
Total | n_{+1} | n_{+2} | n |
The aim is to calculate the expected values under the null hypothesis. We use the notation in Table 25, indicating the table of expected values.
D | \bar{D} | Total | |
---|---|---|---|
E | E_{11} | E_{12} | n_{1+} |
\bar{E} | E_{21} | E_{22} | n_{2+} |
Total | n_{+1} | n_{+2} | n |
If the null hypothesis was true, then disease and exposure are independent variables and the statements below would hold (as we remember from probability theory mentioned before).
Under the assumption of the null hypothesis, we can then estimate the joint probabilities as in Equation 26, Equation 27, Equation 28, and Equation 29.
\begin{align} &\hat{P}(D \vert E) = \frac{n_{11}}{n} = \hat{P}(D) \hat{P}(E) = \frac{n_{+1}}{n} \frac{n_{1+}}{n} \\ &\frac{n_{11}}{n} = \frac{n_{+1}}{n} \frac{n_{1+}}{n} \\ &n_{11} = \frac{n_{+1} n_{1+}}{n}, \text{ the expected value } E_{11} \end{align} \tag{26}
\begin{align} &\hat{P}(\bar{D} \vert E) = \frac{n_{12}}{n} = \hat{P}(\bar{D}) \hat{P}(E) = \frac{n_{+2}}{n} \frac{n_{1+}}{n} \\ &\frac{n_{12}}{n} = \frac{n_{+2}}{n} \frac{n_{1+}}{n} \\ &n_{12} = \frac{n_{+2} n_{1+}}{n}, \text{ the expected value } E_{12} \end{align} \tag{27}
\begin{align} &\hat{P}(D \vert \bar{E}) = \frac{n_{21}}{n} = \hat{P}(D) \hat{P}(\bar{E}) = \frac{n_{+1}}{n} \frac{n_{2+}}{n} \\ &\frac{n_{21}}{n} = \frac{n_{+1}}{n} \frac{n_{2+}}{n} \\ &n_{21} = \frac{n_{+1} n_{2+}}{n}, \text{ the expected value } E_{21} \end{align} \tag{28}
\begin{align} &\hat{P}(\bar{D} \vert \bar{E}) = \frac{n_{22}}{n} = \hat{P}(\bar{D}) \hat{P}(\bar{E}) = \frac{n_{+2}}{n} \frac{n_{2+}}{n} \\ &\frac{n_{22}}{n} = \frac{n_{+2}}{n} \frac{n_{2+}}{n} \\ &n_{22} = \frac{n_{+2} n_{2+}}{n}, \text{ the expected value } E_{22} \end{align} \tag{29}
If E is the expected frequency under the null hypothesis (not used here to indicate exposure), then we have the expected cell frequencies derived in Equation 30.
\begin{align} &E_{ij} = \frac{n_{i+}}{n} \frac{n_{+j}}{n} \times n \\ &E_{ij} = \frac{n_{i+} n_{+j}}{n} \end{align} \tag{30}
Any expected cell frequency, E_{ij}, under the null hypothesis is the product of its (row and column) marginal totals, n_{i+} and n_{+j}, divided by the sample size, n, with the calculations summarized in Table 26.
D | \bar{D} | |
---|---|---|
E | $ E_{11} = f r a c { n _ { + 1 } n _ { 1+}}{n}$ |
$ E_{12} == f r a c { n _ { + 2 } n_ { 1+}}{n}$ |
\bar{E} | $ E_{21} = f r a c { n _ { + 1 } n _ { 2+}}{n}$ |
$ E_{22 }= f r a c { n { + 2 } n { 2+}}{n}$ |
We can use Equation 31 to calculate a test statistic (standardized by the expected values), which we denote as X^{2}.
X^{2} = \sum_{i=1}^{2} \sum_{j=1}^{2} \frac{\left( O_{ij} - E_{ij}\right)^{2}}{E_{ij}} \tag{31}
If the null hypothesis is true, then X^{2} approximately follows a \chi^{2} distribution (X^{2} \sim \chi_{1}^{2}), that is, with a single degree of freedom. This approximation holds if each expected value is greater than or equal to 5.
While we consider a 2 \times 2 table of observed values, we can use two multilevel variables (with I levels and J levels respectively). Provided that no more than 20% of the expected values as less than 5, we can use the approximation shown in Equation 32. (See below in Degrees of freedom for the \chi^{2} distribution for more information.)
X^{2} \sim \chi_{(I-1)(J-1)}^{2} \tag{32}
The degrees of freedom that we use for tables of observed values (and expected values) is the number of parameters under the alternative hypothesis minus the number of parameters under the null hypothesis.
Consider three random numbers. If we want them to sum to a certain total, two of the values are variable (they can take on any value). This immediately fixes the third value, though (such that we still sum to the same total). In this sense, we have a two degrees of freedom for this simple explanation. We say that two of the values are not redundant.
In a table of observed values, we have I rows and J columns. They have to sum the the respective row and column totals and hence the I-1 and J-1 degrees of freedom. In total then, we have (I-1) + (J-1) = I+J-2 non-redundant parameters under the null hypothesis.
Under the alternative hypothesis there are no restrictions on the joint probabilities (except that they sum to 1.0). We therefor have IJ - 1 non-redundant parameters.
The difference is then as shown in Equation 33.
(IJ-1) - (I + J -2) = IJ - I - J + 1 = (I-1)(J-1) \tag{33}
We have the following general properties of the \chi^{2} distribution given df degrees of freedom.
We can use either a critical value approach or a p value approach to determine significance using the Pearson \chi^{2} test.
The critical \chi^{2}_{1, \alpha} values can be expressed using the qchisq
function. We do this below for \alpha = 0.1, \alpha = 0.05, and \alpha = 0.01 (all for a single degree of freedom).
alpha value 0.1
qchisq(0.1, df = 1, lower.tail = FALSE)
[1] 2.705543
alpha value 0.05
qchisq(0.05, df = 1, lower.tail = FALSE)
[1] 3.841459
alpha value 0.01
qchisq(0.01, df = 1, lower.tail = FALSE)
[1] 6.634897
In Figure 3, using the dist_chisq
function form the sjPlot
library, we see the area of rejection for \alpha = 0.05.
::dist_chisq(
sjPlotchi2 = NULL,
deg.f = 1,
p = 0.05,
xmax = NULL,
geom.alpha = 0.7
)
In the critical value approach, we reject H_{0} if X^{2} > \chi_{1, \alpha}^{2}.
If we use the p value approach, we reject H_{0} if P(X^{2} > X^{2}_{\text{observed}}) < \alpha.
We use the measure_assoc data from a previous section.
<- import_dataset("measures_assoc.csv") dfr
The table
function returns a table of observed values.
<- table(dfr$Exposure, dfr$Disease)
tb tb
D ND
E 21 19
NE 8 32
The null hypothesis is that the two variables are not associated (they are independent) and the alternative hypothesis is that they are associated (not independent).
The chisq.test
function takes the table of observed values as argument and returns the X^{2} statistic and the p value.
# Don't use continuity correction
<- chisq.test(tb, correct = FALSE)
tst tst
Pearson's Chi-squared test
data: tb
X-squared = 9.1413, df = 1, p-value = 0.002499
From the p value approach, we reject the null hypothesis for \alpha = 0.05.
Since we assigned the test to a computer variable, tst
, we can extract the statistic attribute.
$statistic tst
X-squared
9.141312
Now, we can use the dist_chisq
function again. Unfortunately, the X^{2} value is large and the area under the curve is too small to see for df=1 in Figure 4.
::dist_chisq(
sjPlotchi2 = tst$statistic,
deg.f = 1,
p = 0.05
)
Remember that the expected cell frequencies must all be 5 or more for us to use Pearson’s \chi^{2} test. The expected
attribute returns the table of expected values.
$expected tst
D ND
E 14.5 25.5
NE 14.5 25.5
In reporting the results of our study, we can use confidence intervals of the measures of association and / or hypothesis testing. We use both actually. The measures of association help us understand the direction of the association.
In this sense, the epi.2by2
function from the epiR library returns all the information that we require.
::epi.2by2(tb) epiR
Outcome + Outcome - Total Inc risk *
Exposed + 21 19 40 52.50 (36.13 to 68.49)
Exposed - 8 32 40 20.00 (9.05 to 35.65)
Total 29 51 80 36.25 (25.79 to 47.76)
Point estimates and 95% CIs:
-------------------------------------------------------------------
Inc risk ratio 2.62 (1.32, 5.21)
Inc odds ratio 4.42 (1.64, 11.93)
Attrib risk in the exposed * 32.50 (12.67, 52.33)
Attrib fraction in the exposed (%) 61.90 (24.33, 80.82)
Attrib risk in the population * 16.25 (-0.02, 32.52)
Attrib fraction in the population (%) 44.83 (8.43, 66.76)
-------------------------------------------------------------------
Uncorrected chi2 test that OR = 1: chi2(1) = 9.141 Pr>chi2 = 0.002
Fisher exact test that OR = 1: Pr>chi2 = 0.005
Wald confidence limits
CI: confidence interval
* Outcomes per 100 population units
In this last section, we simulate data with an exposure variable with 3 levels and a disease variable with 2 levels.
set.seed(12) # Reproducibility
<- sample(
exposure c("Low dose", "High dose", "Placebo"),
size = 100,
replace = TRUE,
prob = c(0.3, 0.4, 0.3) # Stated probabilities for each of 3 levels
)
<- sample(
outcome c("Improve", "not improve"),
size = 100,
replace = TRUE,
prob = c(0.4, 0.6)
)
<- data.frame(
dfr Exposure = exposure,
Disease = outcome
)
Below, we create a table fo observed values.
<- table(dfr$Exposure, dfr$Disease)
tb tb
Improve not improve
High dose 25 23
Low dose 11 17
Placebo 7 17
The Pearson \chi^{2} test is performed below.
<- chisq.test(tb)
tst tst
Pearson's Chi-squared test
data: tb
X-squared = 3.6472, df = 2, p-value = 0.1614
We cannot express measures of association as we do not have a 2 \times 2 design.
Finally, we visualize the results in Figure 5.
::dist_chisq(
sjPlotchi2 = tst$statistic,
deg.f = 2
)
Epitools
as an Alterative to Epi2by2
The epitools
library returns the same results as the epiR
library. Note, that the epitools
requires the 2 \times 2 table of observed values to be in the order shown in Table 27.
\bar{D} | D | Total | |
---|---|---|---|
\bar{E} | |||
E | |||
Total |
We import a comma-separated values file to demonstrate the use of epitools
.
<- import_dataset("measures_assoc.csv") dfr
The table of observed functions is shown below, and is the format required by the epiR
library.
<- xtabs(
tab ~ Exposure + Disease,
data = dfr
)
tab
Disease
Exposure D ND
E 21 19
NE 8 32
The data are summarized with a mosaic plot.
Desc(tab, verbose="high", expected=TRUE)) (
------------------------------------------------------------------------------
tab (xtabs, table)
Summary:
n: 80, rows: 2, columns: 2
Pearson's Chi-squared test (cont. adj):
X-squared = 7.789, df = 1, p-value = 0.005256
Fisher's exact test p-value = 0.004827
D ND Sum
E freq 21 19 40
perc 26.2% 23.8% 50.0%
p.row 52.5% 47.5% .
p.col 72.4% 37.3% .
exp 14.500 25.500 .
NE freq 8 32 40
perc 10.0% 40.0% 50.0%
p.row 20.0% 80.0% .
p.col 27.6% 62.7% .
exp 14.500 25.500 .
Sum freq 29 51 80
perc 36.3% 63.7% 100.0%
p.row . . .
p.col . . .
exp . . .
----------
' 95% conf. level
The data can also be summarized with a stacked bar plot.
::ggbarstats(
ggstatsplotdata = dfr,
x = Disease,
y = Exposure,
type = "parametric",
bf.message = FALSE,
results.subtitle = FALSE,
palette = "Blues"
)
riskratio.wald
The riskratio.wald
function calculate the relative risk estimate and confidence intervals for the relative risk estimate. Note the use of the keyword argument rev
, set to both
such that both the rows and columns of the table of observed values are reversed.
::riskratio.wald(
epitools
tab,rev = "both"
)
$data
Disease
Exposure ND D Total
NE 32 8 40
E 19 21 40
Total 51 29 80
$measure
risk ratio with 95% C.I.
Exposure estimate lower upper
NE 1.000 NA NA
E 2.625 1.32149 5.214283
$p.value
two-sided
Exposure midp.exact fisher.exact chi.square
NE NA NA NA
E 0.002892717 0.004827425 0.002499019
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
The results show \widehat{\text{RR}} = 2.65 with a Wald 95% confidence interval of (1.32, 5.21) as with the epiR epi.2by2
function. We also see the results of hypothesis testing using exact tests and the Pearson \chi^{2} test for association.
riskratio
The riskratio
function takes the method
keyword argument and lets us decide between a Wald test, exact test, and bootstrap resampling. Below, we select wald
as the method.
::riskratio(
epitools
tab,method = "wald",
rev = "both"
)
$data
Disease
Exposure ND D Total
NE 32 8 40
E 19 21 40
Total 51 29 80
$measure
risk ratio with 95% C.I.
Exposure estimate lower upper
NE 1.000 NA NA
E 2.625 1.32149 5.214283
$p.value
two-sided
Exposure midp.exact fisher.exact chi.square
NE NA NA NA
E 0.002892717 0.004827425 0.002499019
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
oddsratio.wald
The oddsratio.wald
function is similar to the riskratio.wald
function, but for odds ratio tests. Thus, the odds ratio estimate and corresponding confidence intervals for the odds ratio estimate is computed.
::oddsratio.wald(
epitools
tab,rev = "both"
)
$data
Disease
Exposure ND D Total
NE 32 8 40
E 19 21 40
Total 51 29 80
$measure
odds ratio with 95% C.I.
Exposure estimate lower upper
NE 1.000000 NA NA
E 4.421053 1.638427 11.92956
$p.value
two-sided
Exposure midp.exact fisher.exact chi.square
NE NA NA NA
E 0.002892717 0.004827425 0.002499019
$correction
[1] FALSE
attr(,"method")
[1] "Unconditional MLE & normal approximation (Wald) CI"
The first lab problem questions are based on the following simulate data for an explanatory variable Procedure with two levels I and II denoting two types of inguinal hernia repair. A total of 50 participants were randomly assigned to one of two methods of surgical repair of these groin hernias. The response variable, Recurrence, measured whether they developed a recurrence (return of the hernia), with levels No and Yes.
The data is stored in a data frame object and assigned to the variable data
.
set.seed(12)
<- rep(
group c("I", "II"),
each = 25
)
<- sample(
recurrence c("No", "Yes"),
size = 50,
replace = TRUE,
prob = c(0.8, 0.2)
)
<- data.frame(
data Group = group,
Recurrence = factor(recurrence, levels = c("Yes", "No"))
)
We see a stacked bar plot to visualize the results.
ggbarstats(
data = data,
x = Recurrence,
y = Group,
type = "parametric",
bf.message = FALSE,
results.subtitle = FALSE,
palette = "Blues",
title = "Proportion of recurrence and no recurrence in two type of hernia repair",
xlab = "Group",
legend.title = "Recurrence",
ggplot.component = list(ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 2)))
)
Complete the following four questions.
Create a 2 \times 2 table of observed values.
Calculate the joint probabilities.
Calculate the marginal probabilities.
Calculate the relevant conditional probabilities.
Find an example of each study design in the published literature that contain the results of risk and ratio analysis and comment these results.
The research data for this lab pertains to the development of a cerebrovascular accident (a stroke). The data is available in a comma-separated file named stroke.csv
.
Do NOT use a library function to answer the questions of problems 3.1 through 3.4.
Complete the following three questions.
Import the data as a data frame object and assign it to the computer variable dfr
.
Verify that there is no missing data.
Both \texttt{hypertension} and \texttt{stroke} are binary variables. The levels are No and Yes for each. Consider \texttt{hypertension} to be the explanatory variable, with the level Yes as E and the value of No as \bar{E}. Consider Yes (the development of a stroke) as D, and no (the absence of the development of a stroke ) as \bar{D}. Generate a table of observed values and assign it to the variable dfr
.
Answer the following five questions.
Calculate the risk estimate of a stroke given the presence of hypertension.
Calculate the risk estimate of a stroke given the absence of hypertension.
Calculate the risk difference estimate for a stroke comparing those with and without hypertension.
Calculate a 95% confidence interval for the population risk difference of a stroke comparing those with and without hypertension.
Comment on your results.
Answer the following three questions.
Calculate the relative risk estimate for a stroke comparing those with and without hypertension.
Calculate a 95% confidence interval for the population risk difference of a stroke comparing those with and without hypertension.
Comment on your results.
Answer the following five questions.
Calculate the odds of a stroke among those with hypertension.
Calculate the odds of a stroke among those without hypertension.
Calculate the odds ratio for a stroke comparing those with and without hypertension.
Calculate a 95% confidence interval for the odds ratio for a stroke comparing those with and without hypertension.
Comment on your results.
Answer the following question using an appropriate R function. Researchers consider the association between the presence, E = \text{Yes}, and absence \bar{E} = \text{No}, of heart disease and the development, D, and non-development, \bar{D}, of a stroke.
Calculate the risk difference and 95\% confidence interval for the risk difference pertaining to the research.
Calculate the relative risk and 95\% confidence interval for the relative risk pertaining to the research.
Calculate the odds ratio and 95\% confidence interval for the odds ratio pertaining to the research.
Write a comment regarding the results.
The following questions pertain to the same data set that was used in the previous notebook. Patients either have hypertension, E, or do not have hypertension, \bar{E}. The presence of hypertension is our explanatory variable. The outcome or response variable levels are stroke, D, and no stroke, \bar{D}. Use a significance level of 0.05 for this problem.
The data is available in the stroke.csv
file that is imported below and assigned to the computer variable stroke
.
<- import_dataset("stroke.csv") stroke
Create a contingency table from the data and use the usual order of the levels (as used in this course). Assign the table to the variable tab
.
Calculate the table of expected values (without using any R functions).
Can we use the \chi^{2} distribution to approximate our Pearson test statistic?
Calculate the X^{2} test statistic without using and R function and assign it to the variable xs
.
What is the degrees of freedom for this problem?
Calculate the critical \chi^{2} value for \alpha=0.05.
Is the test statistic larger than the critical statistic value?
Use an appropriate R function to calculate a p-value for the test statistic.
What is the test conclusion?
Use an appropriate R function to confirm the results of your calculations using the Pearson \chi^{2} tests for independence.
Confirm the values for the table of expected values.
Calculate the estimated odds ratio and 95\% Wald confidence intervals to confirm to support your test conclusion. Does the interval contain the null value \text{OR}=1?