Code
#install.packages("DescTools")
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
In this chapter, we develop the intuition for inference using single proportions. For a single proportion, we consider an experiment of independent trials, where each trial has two outcomes. For example, in an experiment of multiple coin flips, our two outcomes are heads and tails. In terms of probability vocabulary, these two outcomes are referred to as success and failure. Note, success applies to the outcome of interest and failure applies to the alternative outcome. As such, if we are interested in the proportion of heads, then the outcome of heads would be considered success.
Then, this concept of a single proportion is expanded to make inferences for larger sample sizes.
Expanding the use of Base-R, packages include specific features that generally result in code that is not only easier to read, but also more efficient.
To access the package required for Chapter 1,
1) Install. Locate the console, which is the bottom left pane of your R program. In the console, copy and paste the code below. To access the code below, click on the Code icon. Then, click on the clipboard icon on the right-end of the code cell to copy the code.
Note, to install other packages in subsequent chapters, replace the DescTools
inside the parentheses with the name of the desired package.
#install.packages("DescTools")
2) Call the Library. Locate the source editor, which is the top left pane of your R program. To create a new code cell, click on .
library(DescTools)
There are four key concepts that are important to single proportion analysis: frequency, relative frequency or proportion, sample proportion, and population proportion.
The number of successes and/or the number of failures in an experiment of independent trials with a binary outcome.
Suppose we flip a coin 100 times, where the binary outcomes of heads and tails are equally likely at each coin flip. Let’s assign the outcome of heads as success and outcome of tails as failure. What would be the frequencies for success and failure?
We generate 100 random independent trials to represent the 100 coin flips.
set.seed(12)
<- sample(
vals c("Success", "Failure"),
size = 100,
replace = TRUE
)
table(vals)
vals
Failure Success
58 42
We have frequencies of 42 successes and 58 failures. Note, the sum of the frequencies equal the sample size.
The frequency of each level divided the sample size. The sum of all the level proportions is equal to 1.
What are the relative frequencies for success and failure?
table(vals) / length(vals)
vals
Failure Success
0.58 0.42
We have relative frequencies of 0.42 for success and 0.58 for failure.
Denoted in (Equation 1), where x is the frequency of the level of interest and n is the sample size.
p = \frac{x}{n} \tag{1}
Denoted in (Equation 2), where N is the population size.
\pi = \frac{x}{N} \tag{2}
Using the Bernoulli distribution, we can describe a single trial with a binary outcome, such as one coin flip.
If X is a random variable for a single trial with a binary outcome, 1 or 0, then X is a Bernoulli random variable, where \pi represents the probability of success, as shown in (Equation 3).
X = \begin{cases} &1, \; \text{ with a probability of } \pi \\ &0, \; \text{ with a probability of } 1 - \pi \end{cases} \tag{3}
As such, the probability mass function (PMF) for the Bernoulli distribution is as shown in (Equation 4).
P(X=k) = \pi^{k} (1 - \pi )^{1 - k} \tag{4}
We say that X follow a Bernoulli distribution with parameter \pi, as denoted in (Equation 5).
X \sim \text{Ber} ( \pi ) \tag{5}
A Bernoulli random variable has a mean (expected value) equal to the probability of success, as shown in (Equation 6). Its variance is equal to the product of the probability of success and probability of failure, and its standard deviation is the square root of the variance, as shown in (Equation 7).
\text{E}(X) = \pi \tag{6}
\begin{align} &\sigma^{2}_{\text{Bernoulli}} = \pi (1 - \pi) \\ &\sigma_{\text{Bernoulli}} = \sqrt{\pi ( 1 - \pi )} \end{align} \tag{7}
Suppose that we have not just 1, but n independent trials, where each trial has the same success probability of \pi. For example, instead of one coin flip, we have an experiment of multiple coin flips, where the probability of success continues to be the probability of heads.
This experiment consists of the sum of independent and identically distributed (iid) n Bernoulli trials. Thus, each trial out of the total n trials is denoted by X_1, X_2, ..., X_n, where any given X_i has the probability of success of \pi and Y denotes the random variable for the binomial distribution as shown in (Equation 8).
Y = \sum_{i=1}^{n} X_{i} \tag{8}
The PMF for the binomial distribution is shown in (Equation 9), where k is the number of successes.
\begin{align} &P(Y=k) = \begin{pmatrix} n \\ k \end{pmatrix} \pi^{k} (1 - \pi )^{n - k} \\ &P(Y=k)= \frac{n!}{k! (n-k)!} \pi^{k} (1 - \pi)^{n-k} \end{align} \tag{9}
We say that Y follows a binomial distribution with the parameters, n (sample size), and \pi (probability of success) as denoted in (Equation 10). Y \sim \text{Bin}(n \pi) \tag{10} The mean (expected value) of the binomial distribution is shown in (Equation 11), and the variance and standard deviation in (Equation 12).
\text{E}(Y) = n \pi \tag{11}
\begin{align} &\sigma^{2}_{\text{binomial}} = n \pi (1 - \pi) \\ &\sigma_{\text{binomial}} = \sqrt{n \pi ( 1 - \pi )} \end{align} \tag{12}
As an illustrative example, consider that 7% of patients that have a prostatectomy (surgical removal of the prostate) develop urinary retention, that requires the (re)placement of a catheter. What is the probability that 3 of 7 patients having undergone a prostatectomy will develop urinary retention? We derive the result in (Equation 13), where we start with defining the random variable as a binomial random variable with parameters n=7 and \pi = 0.07.
\begin{align} &\text{Given}\ Y \sim \text{Bin}(n, \pi)\\ &P(Y=y) = \frac{n!}{y! (n - y )!} \pi^{y} (1 - \pi)^{n-y} \\\\\\ &\text{If} \ n= 7, \ y=3, \text{then} \ Y \sim \text{Bin}(7, 0.07) \\ &P(Y=3) = \frac{7!}{3! (7 - 3)!} 0.07^3 (1- 0.07)^{7-3} \\\\ &\approx 0.009 \end{align} \tag{13}
round(dbinom(
x = 3,
size = 7,
prob = 0.07),
digits = 3)
[1] 0.009
As the sample size increases, the binomial distribution approximates a normal distribution with mean, n \pi, and variance, n p (1 - \pi). Examine the bar plots below with the same \pi = 0.2, but different sample sizes of 10 and 100, as shown in (Figure 2).
par(mfrow = c(1, 2))
barplot(
dbinom(
seq(from = 0, to = 10, by = 1),
size = 10,
prob = 0.2
)
)
barplot(
dbinom(
seq(from = 0, to = 100, by = 1),
size = 100,
prob = 0.2
) )
We say that X follows a normal distribution with two parameters, mean (\mu) and variance (\sigma^2) as denoted in (Equation 14).
\begin{align}X \sim N(\mu, \sigma^2) \end{align} \tag{14}
As a continuous distribution, the normal distribution has a probability density function (PDF) as shown in (Equation 15).
\begin{align} f_X(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \end{align} \tag{15}
Similar to the PMF, the PDF describes a probability distribution for each value of x that the random variable X can take on. But, the PDF applies to continuous random variables unlike the PMF.
As sample size increases, calculating probability by finding the sum of multiple discrete values with the binomial distribution becomes inefficient. The normal distribution allows for probability estimation based on area under the curve.
If the following two properties are satisfied, we use the normal distribution as an approximation for the binomial distribution:
Consider taking a sample of 50 adults from a rural setting in a certain country, where each adult in this setting has a 20% chance of having pulmonary tuberculosis (PTB). What is the probability that fewer than 6 of the 50 have PTB.
Our random variable, Y, is the number of people in the sample of 50 that have PTB. We require the P(Y \le 5) and know that Y \sim \text{Bin}(n=50, \ \pi = 0.2).
First, we check that the properties for the use of normal approximation is met. Since 50 \times 0.2 \ge 5 and 50 \times (1 - 0.2) \ge 5, we proceed to use normal approximation.
Now, for these binomial parameters, we have that \text{E}(Y) = \mu = 50(0.2)=10 And, we have that the variance is \sigma^{2} = 50(0.2)(1-0.2)=8 We can write Y \sim \text{N}(\mu = 10 , \sigma = 8).
The result is now much simpler compared to the many terms using the binomial distribution. The derivation is shown in (Equation 16), where we use Z transformation.
\begin{align} P(Y \le 5) &= P(\frac{Y - \mu}{\sigma} < \frac{5 - 10}{\sqrt{8}}) \\ &=P(Z < -1.767...) \\\\ & \approx 0.039 \end{align} \tag{16}
The pnorm
function confirms the results.
<- (5 - 10) / sqrt(8)
z_stat pnorm(z_stat)
[1] 0.03854994
We compare the result to the exact binomial calculation using the pbinom
function.
round(pbinom(
q = 5,
size = 50,
prob = 0.2),
digits = 3)
[1] 0.048
The normal approximation yields similar results to the exact binomial calculation. The exact binomial calculation is more accurate and can be computed readily with computers. However, the advantage of the normal approximation is that is allows us to conduct inferential analysis (e.g., confidence intervals and hypothesis testing).
Often do not know the population proportion, \pi, unless we have information from an entire population. As such, we can take a representative sample to answer the question about a suspected \pi = \pi_{0}.
This time, consider drawing a sample of adults from an area with a high proportion of PTB. Researcher might consider that each individual has a probability of testing positive for PTB of 0.39. Estimate the population proportion, \pi, of testing positive for PTB.
We have seen that the sample proportion, p, is an estimate for the population proportion, \pi.This time we sample 2306 random adults from this region with a high PTB prevalence.
We create a sample of 2306 such adults, where we place a probability for success at 0.39. Then, we generate a table of frequencies for the with two levels, PTB and No PTB.
set.seed(12)
<- sample(
ptb c("PTB", "No PTB"),
size = 2306,
replace = TRUE,
prob = c(0.39, 1 - 0.39)
)
<- data.frame(
dfr PTB = factor(
ptb,levels = c("PTB", "No PTB")
)
)
table(dfr$PTB)
PTB No PTB
897 1409
We have calculated the sample proportion of PTB as an estimate for the population proportion of PTB, as shown in in (Equation 17).
p = \frac{x}{n} = \frac{897}{2306} = 0.389 \approx \hat{\pi} \tag{17}
Additionally, we can use maximum likelihood estimation to justify the use of p to estimate \pi.
In general, the MLE is used to estimate the parameters of a model such that parameters maximize the likelihood of getting the observed data.
Let’s find the MLE,
1. Specify the parameter(s) to estimate based on the data.
The chance of selecting a participant testing positive for PTB is \pi.
2. Determine the likelihood function.
The likelihood function expresses the probability of the observed data as a function of the parameter(s) of interest.
For our random sample of n=2306 Bernoulli random variables, the likelihood function (\mathscr{L}(\hat{\pi})) represents the probability of observing X_1,X_2, ..., X_{2306} as a function of \pi, as shown in (Equation 18). Note that the likelihood function is the joint distribution of the n=2306 Bernoulli random variables.
\mathscr{L}(\hat{\pi}) = \hat{\pi}^{\sum_{i}{x_i}}(1-\hat{\pi})^{n-{\sum_{i}{x_i}}}\\ \rightarrow \frac{2306!}{897! \times 1409!} \hat{\pi}^{897} (1 - \hat{\pi})^{2306-897} \tag{18}
3. Find the value(s) of the parameter(s) that makes the probability of seeing the observed data as large as possible.
For our PTB example problem, we can find the value of \hat{\pi} that maximizes \mathscr{L}(\hat{\pi}) with calculus. We take the first derivative of \mathscr{L} with respect to \hat{\pi}, set the result to 0, and solve for \hat{\pi}. Note, \hat{\pi}=p=0.389.
Below, we generate a user-defined function called likeli.mle
to output a plot of the likelihood function of a binomial distribution annotated with a vertical orange line through the MLE, \hat{\pi}.
<- function(k,n) {
likeli.mle<-seq(0,1, by=0.001)
p<-function(p) dbinom(k,n,p)
likelihood
<- sapply(p, likelihood)
likelihood_values <-which.max(likelihood_values)
mle_index<-p[mle_index]
mle
plot(p, likelihood(p),
type="l",
xlab=expression(pi),
ylab="Likelihood")
abline(v=mle, col="orange")
text(x=0.6, y=0.01, labels=paste("MLE ≈0.389"))
}
likeli.mle(897, 2306)
Our researchers considered a null hypothesis H_{0}: \, \pi = \pi_{0}=0.4, where \pi is the true population probability of PTB. The alternative hypothesis is then H_{1}: \; \pi_{0} \ne 0.4.
From Y \sim \text{Bin}(2306 , \pi), we have that Y \sim \text{N}(\mu = np , \sigma^{2} = n\pi(1 - \pi)) using the normal approximation. The conversion calculation is shown in (Equation 19), where \pi_{0} is the population proportion under the null hypothesis, and p is the sample proportion of successes as indicated.
Z = \frac{p - \pi_{0}}{\sqrt{\frac{\pi_{0} (1 - \pi_{0})}{n}}} , \; p = \frac{x}{n} \tag{19}
We calculate the z statistic below, where s
is the number of successes, n
is the sample size, p
is the sample proportion of success, and pi_0
is the population proportion under the null hypothesis.
<- 897
s <- 2306
n <- s / n
p <- 0.4
pi_0
<- (p - pi_0) / sqrt((pi_0 * (1 - pi_0)) / n)
z_score z_score
[1] -1.079688
Then, we calculate the p value for this z score given a two-tailed alternative hypothesis.
pnorm(z_score) * 2
[1] 0.280281
For \alpha = 0.05, we fail to reject the null hypothesis and state that the data does not provide evidence that the population proportion is not 0.4.
The code chunk below shows the critical t value for \alpha = 0.05 and a two-sided alternative hypothesis.
<- qnorm(0.025)
z_crit_lower z_crit_lower
[1] -1.959964
Since our test statistic value is not less than the critical value, we fail to reject the null hypothesis using the critical value approach as well.
Alternatively, the prop.test
function performs the same calculation. The function uses the X^{2} test statistic, where Z^{2} \sim \chi^{2}(\text{df}=1). Parameters for this function include x
number of successes, sample size (n
), probability of success (p
), confidence level (conf.level
), and continuity correction (correct
). The conf.level
parameter uses the Score test for the estimate of the standard error (see Equation 20).
prop.test(
x = s,
n = n,
p = 0.40,
conf.level = 0.95,
correct = FALSE
)
1-sample proportions test without continuity correction
data: s out of n, null probability 0.4
X-squared = 1.1657, df = 1, p-value = 0.2803
alternative hypothesis: true p is not equal to 0.4
95 percent confidence interval:
0.3692875 0.4090523
sample estimates:
p
0.3889853
Other than hypothesis testing, we can also use the normal approximation for computing CIs. The CIs are based on a Score test (performed by the prop.test
function). The standard error for the Score test is shown in (Equation 20).
\begin{align} &p \pm c \times \widehat{\text{SE}}_{\pi_{0}}, \\ &\text{where } \widehat{\text{SE}}(\pi_{0}) = \sqrt{ \frac{(\pi_{0})(1 - \pi_{0})}{n}} \end{align} \tag{20}
Using the Score test, we calculate the lower and upper bounds of the 95% CIs below.
<- sqrt((pi_0 * (1 - pi_0)) / n)
se_score_test <- p - (-z_crit_lower) * se_score_test
lower <- p + (-z_crit_lower) * se_score_test
upper round(lower, digits = 3)
[1] 0.369
round(upper, digits = 3)
[1] 0.409
We can also calculate the Wald CIs. If Y \sim \text{Bin}(n, \pi) and the properties of the normal approximation hold, then the 100( 1 − \alpha )% CI for \pi is given as shown in (Equation 21) for the sample proportion, p.
p \pm z_{\frac{\alpha}{2}} \widehat{\text{SE}} (p) \tag{21}
Here \widehat{\text{SE}}(p) is the standard error of the sample proportion as is shown in (Equation 22).
\widehat{\text{SE}}(p) = \sqrt\frac{p(1 - p)}{n} \tag{22}
Given that we can use the normal approximation, we calculate the lower and upper bounds for a 95% CI as demonstrated below.
<- sqrt(p * (1 - p) / n)
se_p
<- p - (-z_crit_lower) * se_p
lower <- p + (-z_crit_lower) * se_p
upper
lower
[1] 0.3690872
upper
[1] 0.4088833
The BinomCI
function in the DescTools
library also calculates the confidence intervals.
::BinomCI(
DescToolsx = s,
n = n,
conf.level = 0.95,
method = "wald"
)
est lwr.ci upr.ci
[1,] 0.3889853 0.3690872 0.4088833
In conclusion, we are 95% confident that the population proportion of PTB is as low as 0.369 and as high as 0.409.
In the case that the normal approximation assumptions are not satisfied, we apply the exact binomial test.
Suppose the sample proportion p is larger than \pi_{0}=0.4 and the sample frequency is 897.
The frequency representing \pi_{0}=0.4 of the data is calculated below.
<- 0.4 * n
null_freq null_freq
[1] 922.4
Since we are dealing with discrete random variables, we round our null hypothesis frequency to 922. The difference between the proportion under the null hypothesis and the sample proportion is 922 - 897 = 25. Note that the null hypothesis value is larger than the sample proportion.
We subtract the difference from the proportion under the null hypothesis and find 922 - 25 = 897. This implies that we require 25 fewer than the null hypothesis frequency of successes 922. Then, we add the difference to the proportion under the null hypothesis 922+25=947, which implies that we require 25 more than the null hypothesis frequency of successes.
For the exact binomial test, the p- value is calculated as P(Y \le 897) + P(Y \ge 947), using the exact null distribution Y \sim \text{Bin}(2306, 0.40).
The binom.test
function performs the calculation for us.
binom.test(
x = s,
n = n,
p = 0.40,
conf.level = 0.95
)
Exact binomial test
data: s and n
number of successes = 897, number of trials = 2306, p-value = 0.2879
alternative hypothesis: true probability of success is not equal to 0.4
95 percent confidence interval:
0.369022 0.409231
sample estimates:
probability of success
0.3889853
We also see the confidence intervals using the Clopper-Pearson tail with the BinomCI
function.
::BinomCI(
DescToolsx = s,
n = n,
conf.level = 0.95,
method = "clopper-pearson"
)
est lwr.ci upr.ci
[1,] 0.3889853 0.369022 0.409231
In conclusion, we are 95% confidence that the population proportion of PTB is between 0.369 and 0.409.
Which scale of measurement is most appropriate for the following variables? State either nominal or ordinal.
A small pilot study is conducted to test a new therapy. Either patients recover or they do not recover. Seven patients are enrolled and it is believed that there is a 20% chance of recovery. Use each method below to find the probability that five or more patients recover.
Compute the probability using the exact method. Do the calculation by hand and confirm using the dbinom
function in R.
Compute the probability using a normal approximation. Use the pnorm
function in R.
Should we have expected the normal approximation in (1) to be close to the exact calculation in part (2)?
Suppose that we observed five out of 40 subjects lost to follow up in the first month of an observational study. Assume that the number of subjects lost to follow up follows a binomial distribution with n = 40 and unknown population proportion \pi.
Construct and interpret a 95% Wald confidence interval for \pi. Do the calculation by hand and confirm using R.
Use R to compute an exact 95% confidence interval for \pi. Interpret this interval.
A longitudinal study of 50 randomly selected patients, requires those patients to follow up with investigators every month for assessments. In the first month, 12 of the 50 subjects are lost to follow up. Suppose that the probability of being lost to follow up in the first month is the same for all subjects.
What is the probability distribution for the number of subjects lost to follow up in the first month of this study?
Based on the results of the first month, what is the maximum likelihood estimate of the proportion of subjects lost to follow up in the first month?
Compute a 95% Wald confidence interval, by hand and by using R, for the population proportion of subjects lost to follow up in the first month. Interpret the resulting confidence interval.
Use R to compute a 95% score confidence interval for the population proportion of subjects lost to follow up in the first month. Interpret the resulting confidence interval.
Use the confidence interval from part (4) to reject or fail to reject null hypothesis: H_{0}: \; \pi = 0.25, using a 5% significance level.
Out of 20 randomly selected subjects who worked in demolitions of old buildings, 15 were accidentally exposed to lead during their first year on the job. Test the null hypothesis that 30% of workers in demolitions of old buildings are accidentally exposed to lead. Use a 2% level of significance.
Is it reasonable to conduct a score test? Regardless of your answer, conduct the relevant score hypothesis test and compute the corresponding score confidence interval. Interpret the interval.
Conduct the relevant exact hypothesis test and compute the corresponding exact confidence interval. Interpret the interval.