Chapter 1

Author
Affiliation
Juan H Klopper (MD), Do Hee Lee (MPH)

The George Washington University

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In this chapter, we develop the intuition for inference using single proportions. For a single proportion, we consider an experiment of independent trials, where each trial has two outcomes. For example, in an experiment of multiple coin flips, our two outcomes are heads and tails. In terms of probability vocabulary, these two outcomes are referred to as success and failure. Note, success applies to the outcome of interest and failure applies to the alternative outcome. As such, if we are interested in the proportion of heads, then the outcome of heads would be considered success.

Then, this concept of a single proportion is expanded to make inferences for larger sample sizes.

Section 3. Proportions frequency, relative frequency/proportion, sample proportion, population proportion
Section 4. Distribution to Describe Proportions Bernoulli distribution, probability mass function (PMF), binomial distribution
Section 5. Normal Approximation to the Binomial Distribution normal distribution, probability density function (PDF), maximum likelihood estimate (MLE), likelihood function, exact binomial test

2 Libraries

Expanding the use of Base-R, packages include specific features that generally result in code that is not only easier to read, but also more efficient.

To access the package required for Chapter 1,

1) Install. Locate the console, which is the bottom left pane of your R program. In the console, copy and paste the code below. To access the code below, click on the Code icon. Then, click on the clipboard icon on the right-end of the code cell to copy the code.

Note, to install other packages in subsequent chapters, replace the DescTools inside the parentheses with the name of the desired package.

Code
#install.packages("DescTools")

2) Call the Library. Locate the source editor, which is the top left pane of your R program. To create a new code cell, click on .

Code
library(DescTools)   

3 Proportions

There are four key concepts that are important to single proportion analysis: frequency, relative frequency or proportion, sample proportion, and population proportion.

Frequency

The number of successes and/or the number of failures in an experiment of independent trials with a binary outcome.

Suppose we flip a coin 100 times, where the binary outcomes of heads and tails are equally likely at each coin flip. Let’s assign the outcome of heads as success and outcome of tails as failure. What would be the frequencies for success and failure?

We generate 100 random independent trials to represent the 100 coin flips.

Code
set.seed(12)                   

vals <- sample(                
  c("Success", "Failure"),     
  size = 100,                  
  replace = TRUE               
)

table(vals)                    
vals
Failure Success 
     58      42 

We have frequencies of 42 successes and 58 failures. Note, the sum of the frequencies equal the sample size.

Relative Frequency or Proportion

The frequency of each level divided the sample size. The sum of all the level proportions is equal to 1.

What are the relative frequencies for success and failure?

Code
table(vals) / length(vals)  
vals
Failure Success 
   0.58    0.42 

We have relative frequencies of 0.42 for success and 0.58 for failure.

Sample Proportion

Denoted in (Equation 1), where x is the frequency of the level of interest and n is the sample size.

p = \frac{x}{n} \tag{1}

Population Proportion

Denoted in (Equation 2), where N is the population size.

\pi = \frac{x}{N} \tag{2}

4 Distributions to Describe Proportions

Using the Bernoulli distribution, we can describe a single trial with a binary outcome, such as one coin flip.

Bernoulli Distribution

If X is a random variable for a single trial with a binary outcome, 1 or 0, then X is a Bernoulli random variable, where \pi represents the probability of success, as shown in (Equation 3).

X = \begin{cases} &1, \; \text{ with a probability of } \pi \\ &0, \; \text{ with a probability of } 1 - \pi \end{cases} \tag{3}

As such, the probability mass function (PMF) for the Bernoulli distribution is as shown in (Equation 4).

P(X=k) = \pi^{k} (1 - \pi )^{1 - k} \tag{4}

PMF is the probability distribution of a discrete random variable such that the PMF of X gives the probability of each possible value, k, that X can take on.

For the Bernoulli distribution, the only possible values of k are either 0 or 1. As such, when k=1 the probability of success is \pi. Then, when k=0 the probability of failure is 1-\pi.
P(X=1)= \pi^{1} (1 - \pi )^{1 - 1}= \pi \\ P(X=0)= \pi^{0} (1 - \pi )^{1 - 0}= 1-\pi

We say that X follow a Bernoulli distribution with parameter \pi, as denoted in (Equation 5).

X \sim \text{Ber} ( \pi ) \tag{5}

A Bernoulli random variable has a mean (expected value) equal to the probability of success, as shown in (Equation 6). Its variance is equal to the product of the probability of success and probability of failure, and its standard deviation is the square root of the variance, as shown in (Equation 7).

\text{E}(X) = \pi \tag{6}

\begin{align} &\sigma^{2}_{\text{Bernoulli}} = \pi (1 - \pi) \\ &\sigma_{\text{Bernoulli}} = \sqrt{\pi ( 1 - \pi )} \end{align} \tag{7}

Let’s explore the relationship between mean and variance for the Bernoulli distribution.

[We create a data frame object with different paired combinations of probabilities for success and failure.]

Code
success <- seq(0, to = 1, by = 0.1)         
failure <- (1 - seq(0, to = 1, by = 0.1))   

var_Ber <- data.frame(   
  Success = success,   
  Failure = failure,  
  Variance = success * failure   
)   

var_Ber   
   Success Failure Variance
1      0.0     1.0     0.00
2      0.1     0.9     0.09
3      0.2     0.8     0.16
4      0.3     0.7     0.21
5      0.4     0.6     0.24
6      0.5     0.5     0.25
7      0.6     0.4     0.24
8      0.7     0.3     0.21
9      0.8     0.2     0.16
10     0.9     0.1     0.09
11     1.0     0.0     0.00

Note, the highest variance of 0.25 is achieved when \pi=0.5.

Binomial Distribution

Suppose that we have not just 1, but n independent trials, where each trial has the same success probability of \pi. For example, instead of one coin flip, we have an experiment of multiple coin flips, where the probability of success continues to be the probability of heads.

This experiment consists of the sum of independent and identically distributed (iid) n Bernoulli trials. Thus, each trial out of the total n trials is denoted by X_1, X_2, ..., X_n, where any given X_i has the probability of success of \pi and Y denotes the random variable for the binomial distribution as shown in (Equation 8).

Y = \sum_{i=1}^{n} X_{i} \tag{8}

The following properties must be satisfied for a random variable to be considered a binomial random variable:

  1. A fixed number of, n, trials
  2. Independent trials
  3. Binary outcome, success and failure
  4. Fixed probability of success for each trial

The PMF for the binomial distribution is shown in (Equation 9), where k is the number of successes.

\begin{align} &P(Y=k) = \begin{pmatrix} n \\ k \end{pmatrix} \pi^{k} (1 - \pi )^{n - k} \\ &P(Y=k)= \frac{n!}{k! (n-k)!} \pi^{k} (1 - \pi)^{n-k} \end{align} \tag{9}

Suppose in our experiment, we flip a coin 3 times and we consider the outcome of heads as success. Since the probability of heads and tails are equally likely, the probability of success is \frac{1}{2} and the probability of failure is \frac{1}{2} for each coin flip. But, what would be the probability of obtaining k=2 heads from the n=3 total coin flips?

First, we consider the situation of k=2 heads and 1 tail from the n=3 total coin flips. Since each trial is a Bernoulli random variable, the probability of success is \pi if heads and 0 if tails. Since each trial is independent, then the probability of flipping 2 heads and 1 is \pi^2(1-\pi), as illustrated below (Figure 1).

Figure 1: 2 Heads and 1 Tail

Second, we consider all the different combinations of 2 heads and 1 tail as demonstrated in (Table 1).

Table 1: Combinations of 2 Heads and 1 Tail
Scenario 1 Scenario 2 Scenario 3

Note, this is equivalent to applying the combination probability formula of C(n, k)={n\choose k}=\frac{n!}{k!(n-k)!}

Therefore, the probability of obtaining k=2 heads from the n=3 trials is calculated as the product of this binomial coefficient and the Bernoulli PMF for the n trials.

P(Y=3)= {3 \choose 2}\pi^2(1-\pi)^{3-2}

We say that Y follows a binomial distribution with the parameters, n (sample size), and \pi (probability of success) as denoted in (Equation 10). Y \sim \text{Bin}(n \pi) \tag{10} The mean (expected value) of the binomial distribution is shown in (Equation 11), and the variance and standard deviation in (Equation 12).

\text{E}(Y) = n \pi \tag{11}

\begin{align} &\sigma^{2}_{\text{binomial}} = n \pi (1 - \pi) \\ &\sigma_{\text{binomial}} = \sqrt{n \pi ( 1 - \pi )} \end{align} \tag{12}

4.1 Example

As an illustrative example, consider that 7% of patients that have a prostatectomy (surgical removal of the prostate) develop urinary retention, that requires the (re)placement of a catheter. What is the probability that 3 of 7 patients having undergone a prostatectomy will develop urinary retention? We derive the result in (Equation 13), where we start with defining the random variable as a binomial random variable with parameters n=7 and \pi = 0.07.

\begin{align} &\text{Given}\ Y \sim \text{Bin}(n, \pi)\\ &P(Y=y) = \frac{n!}{y! (n - y )!} \pi^{y} (1 - \pi)^{n-y} \\\\\\ &\text{If} \ n= 7, \ y=3, \text{then} \ Y \sim \text{Bin}(7, 0.07) \\ &P(Y=3) = \frac{7!}{3! (7 - 3)!} 0.07^3 (1- 0.07)^{7-3} \\\\ &\approx 0.009 \end{align} \tag{13}

Code
round(dbinom(   
  x = 3,        
  size = 7,     
  prob = 0.07), 
  digits = 3)   
[1] 0.009

5 Normal Approximation to the Binomial Distribution

As the sample size increases, the binomial distribution approximates a normal distribution with mean, n \pi, and variance, n p (1 - \pi). Examine the bar plots below with the same \pi = 0.2, but different sample sizes of 10 and 100, as shown in (Figure 2).

Code
par(mfrow = c(1, 2))   

barplot(   
  dbinom(   
    seq(from = 0, to = 10, by = 1),   
    size = 10,  
    prob = 0.2   
  )   
)   

barplot(   
  dbinom(   
    seq(from = 0, to = 100, by = 1),   
    size = 100,   
    prob = 0.2   
  )   
)   

Figure 2: Binomial to Normal Distribution

We say that X follows a normal distribution with two parameters, mean (\mu) and variance (\sigma^2) as denoted in (Equation 14).

\begin{align}X \sim N(\mu, \sigma^2) \end{align} \tag{14}

As a continuous distribution, the normal distribution has a probability density function (PDF) as shown in (Equation 15).

\begin{align} f_X(x)=\frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} \end{align} \tag{15}

Similar to the PMF, the PDF describes a probability distribution for each value of x that the random variable X can take on. But, the PDF applies to continuous random variables unlike the PMF.

As sample size increases, calculating probability by finding the sum of multiple discrete values with the binomial distribution becomes inefficient. The normal distribution allows for probability estimation based on area under the curve.

If the following two properties are satisfied, we use the normal distribution as an approximation for the binomial distribution:

  1. Expected value of successes, \text{E}(Y=1) = n \pi \ge 5
  2. Expected values of failures, \text{E}(Y=0) = n (1 - \pi) \ge 5

5.1 Example

Consider taking a sample of 50 adults from a rural setting in a certain country, where each adult in this setting has a 20% chance of having pulmonary tuberculosis (PTB). What is the probability that fewer than 6 of the 50 have PTB.

Our random variable, Y, is the number of people in the sample of 50 that have PTB. We require the P(Y \le 5) and know that Y \sim \text{Bin}(n=50, \ \pi = 0.2).

First, we check that the properties for the use of normal approximation is met. Since 50 \times 0.2 \ge 5 and 50 \times (1 - 0.2) \ge 5, we proceed to use normal approximation.

Now, for these binomial parameters, we have that \text{E}(Y) = \mu = 50(0.2)=10 And, we have that the variance is \sigma^{2} = 50(0.2)(1-0.2)=8 We can write Y \sim \text{N}(\mu = 10 , \sigma = 8).

The result is now much simpler compared to the many terms using the binomial distribution. The derivation is shown in (Equation 16), where we use Z transformation.

\begin{align} P(Y \le 5) &= P(\frac{Y - \mu}{\sigma} < \frac{5 - 10}{\sqrt{8}}) \\ &=P(Z < -1.767...) \\\\ & \approx 0.039 \end{align} \tag{16}

The pnorm function confirms the results.

Code
z_stat <- (5 - 10) / sqrt(8)
pnorm(z_stat)
[1] 0.03854994

We compare the result to the exact binomial calculation using the pbinom function.

Code
round(pbinom(
  q = 5,
  size = 50,
  prob = 0.2), 
  digits = 3)
[1] 0.048

The normal approximation yields similar results to the exact binomial calculation. The exact binomial calculation is more accurate and can be computed readily with computers. However, the advantage of the normal approximation is that is allows us to conduct inferential analysis (e.g., confidence intervals and hypothesis testing).

5.2 Inferences based on Normal Approximations

Often do not know the population proportion, \pi, unless we have information from an entire population. As such, we can take a representative sample to answer the question about a suspected \pi = \pi_{0}.

5.3 Example

This time, consider drawing a sample of adults from an area with a high proportion of PTB. Researcher might consider that each individual has a probability of testing positive for PTB of 0.39. Estimate the population proportion, \pi, of testing positive for PTB.

We have seen that the sample proportion, p, is an estimate for the population proportion, \pi.This time we sample 2306 random adults from this region with a high PTB prevalence.

We create a sample of 2306 such adults, where we place a probability for success at 0.39. Then, we generate a table of frequencies for the with two levels, PTB and No PTB.

Code
set.seed(12)
ptb <- sample(
  c("PTB", "No PTB"),
  size = 2306,
  replace = TRUE,
  prob = c(0.39, 1 - 0.39)
)


dfr <- data.frame(
  PTB = factor(
    ptb,
    levels = c("PTB", "No PTB")
  )
)

table(dfr$PTB)

   PTB No PTB 
   897   1409 

We have calculated the sample proportion of PTB as an estimate for the population proportion of PTB, as shown in in (Equation 17).

p = \frac{x}{n} = \frac{897}{2306} = 0.389 \approx \hat{\pi} \tag{17}

Additionally, we can use maximum likelihood estimation to justify the use of p to estimate \pi.

In general, the MLE is used to estimate the parameters of a model such that parameters maximize the likelihood of getting the observed data.

Let’s find the MLE,

1. Specify the parameter(s) to estimate based on the data.

The chance of selecting a participant testing positive for PTB is \pi.

2. Determine the likelihood function.

The likelihood function expresses the probability of the observed data as a function of the parameter(s) of interest.

For our random sample of n=2306 Bernoulli random variables, the likelihood function (\mathscr{L}(\hat{\pi})) represents the probability of observing X_1,X_2, ..., X_{2306} as a function of \pi, as shown in (Equation 18). Note that the likelihood function is the joint distribution of the n=2306 Bernoulli random variables.

\mathscr{L}(\hat{\pi}) = \hat{\pi}^{\sum_{i}{x_i}}(1-\hat{\pi})^{n-{\sum_{i}{x_i}}}\\ \rightarrow \frac{2306!}{897! \times 1409!} \hat{\pi}^{897} (1 - \hat{\pi})^{2306-897} \tag{18}

3. Find the value(s) of the parameter(s) that makes the probability of seeing the observed data as large as possible.

For our PTB example problem, we can find the value of \hat{\pi} that maximizes \mathscr{L}(\hat{\pi}) with calculus. We take the first derivative of \mathscr{L} with respect to \hat{\pi}, set the result to 0, and solve for \hat{\pi}. Note, \hat{\pi}=p=0.389.

Below, we generate a user-defined function called likeli.mle to output a plot of the likelihood function of a binomial distribution annotated with a vertical orange line through the MLE, \hat{\pi}.

Code
likeli.mle<- function(k,n) {
  p<-seq(0,1, by=0.001)
  likelihood <-function(p) dbinom(k,n,p)
  
  likelihood_values <- sapply(p, likelihood)
  mle_index<-which.max(likelihood_values)
  mle<-p[mle_index]
  
  plot(p, likelihood(p), 
       type="l",
       xlab=expression(pi),
       ylab="Likelihood")
  abline(v=mle, col="orange")
  text(x=0.6, y=0.01, labels=paste("MLE ≈0.389"))
  
}

likeli.mle(897, 2306)

Figure 3: MLE

5.4 Example

Our researchers considered a null hypothesis H_{0}: \, \pi = \pi_{0}=0.4, where \pi is the true population probability of PTB. The alternative hypothesis is then H_{1}: \; \pi_{0} \ne 0.4.

From Y \sim \text{Bin}(2306 , \pi), we have that Y \sim \text{N}(\mu = np , \sigma^{2} = n\pi(1 - \pi)) using the normal approximation. The conversion calculation is shown in (Equation 19), where \pi_{0} is the population proportion under the null hypothesis, and p is the sample proportion of successes as indicated.

Z = \frac{p - \pi_{0}}{\sqrt{\frac{\pi_{0} (1 - \pi_{0})}{n}}} , \; p = \frac{x}{n} \tag{19}

We calculate the z statistic below, where s is the number of successes, n is the sample size, p is the sample proportion of success, and pi_0 is the population proportion under the null hypothesis.

Code
s <- 897 
n <- 2306 
p <- s / n 
pi_0 <- 0.4 

z_score <- (p - pi_0) / sqrt((pi_0 * (1 - pi_0)) / n)
z_score
[1] -1.079688

Then, we calculate the p value for this z score given a two-tailed alternative hypothesis.

Code
pnorm(z_score) * 2
[1] 0.280281

For \alpha = 0.05, we fail to reject the null hypothesis and state that the data does not provide evidence that the population proportion is not 0.4.

The code chunk below shows the critical t value for \alpha = 0.05 and a two-sided alternative hypothesis.

Code
z_crit_lower <- qnorm(0.025)
z_crit_lower
[1] -1.959964

Since our test statistic value is not less than the critical value, we fail to reject the null hypothesis using the critical value approach as well.

Alternatively, the prop.test function performs the same calculation. The function uses the X^{2} test statistic, where Z^{2} \sim \chi^{2}(\text{df}=1). Parameters for this function include x number of successes, sample size (n), probability of success (p), confidence level (conf.level), and continuity correction (correct). The conf.level parameter uses the Score test for the estimate of the standard error (see Equation 20).

Code
prop.test(
  x = s, 
  n = n, 
  p = 0.40, 
  conf.level = 0.95,
  correct = FALSE
)

    1-sample proportions test without continuity correction

data:  s out of n, null probability 0.4
X-squared = 1.1657, df = 1, p-value = 0.2803
alternative hypothesis: true p is not equal to 0.4
95 percent confidence interval:
 0.3692875 0.4090523
sample estimates:
        p 
0.3889853 

Other than hypothesis testing, we can also use the normal approximation for computing CIs. The CIs are based on a Score test (performed by the prop.test function). The standard error for the Score test is shown in (Equation 20).

\begin{align} &p \pm c \times \widehat{\text{SE}}_{\pi_{0}}, \\ &\text{where } \widehat{\text{SE}}(\pi_{0}) = \sqrt{ \frac{(\pi_{0})(1 - \pi_{0})}{n}} \end{align} \tag{20}

Using the Score test, we calculate the lower and upper bounds of the 95% CIs below.

Code
se_score_test <- sqrt((pi_0 * (1 - pi_0)) / n)
lower <- p - (-z_crit_lower) * se_score_test
upper <- p + (-z_crit_lower) * se_score_test
round(lower, digits = 3)
[1] 0.369
Code
round(upper, digits = 3)
[1] 0.409

We can also calculate the Wald CIs. If Y \sim \text{Bin}(n, \pi) and the properties of the normal approximation hold, then the 100( 1 − \alpha )% CI for \pi is given as shown in (Equation 21) for the sample proportion, p.

p \pm z_{\frac{\alpha}{2}} \widehat{\text{SE}} (p) \tag{21}

Here \widehat{\text{SE}}(p) is the standard error of the sample proportion as is shown in (Equation 22).

\widehat{\text{SE}}(p) = \sqrt\frac{p(1 - p)}{n} \tag{22}

Given that we can use the normal approximation, we calculate the lower and upper bounds for a 95% CI as demonstrated below.

Code
se_p <- sqrt(p * (1 - p) / n)

lower <- p - (-z_crit_lower) * se_p
upper <- p + (-z_crit_lower) * se_p

lower
[1] 0.3690872
Code
upper
[1] 0.4088833

The BinomCI function in the DescTools library also calculates the confidence intervals.

Code
DescTools::BinomCI(
  x = s, 
  n = n, 
  conf.level = 0.95,
  method = "wald"
)
           est    lwr.ci    upr.ci
[1,] 0.3889853 0.3690872 0.4088833

In conclusion, we are 95% confident that the population proportion of PTB is as low as 0.369 and as high as 0.409.

5.5 Exact Binomial Test

In the case that the normal approximation assumptions are not satisfied, we apply the exact binomial test.

Suppose the sample proportion p is larger than \pi_{0}=0.4 and the sample frequency is 897.

The frequency representing \pi_{0}=0.4 of the data is calculated below.

Code
null_freq <- 0.4 * n
null_freq
[1] 922.4

Since we are dealing with discrete random variables, we round our null hypothesis frequency to 922. The difference between the proportion under the null hypothesis and the sample proportion is 922 - 897 = 25. Note that the null hypothesis value is larger than the sample proportion.

We subtract the difference from the proportion under the null hypothesis and find 922 - 25 = 897. This implies that we require 25 fewer than the null hypothesis frequency of successes 922. Then, we add the difference to the proportion under the null hypothesis 922+25=947, which implies that we require 25 more than the null hypothesis frequency of successes.

For the exact binomial test, the p- value is calculated as P(Y \le 897) + P(Y \ge 947), using the exact null distribution Y \sim \text{Bin}(2306, 0.40).

The binom.test function performs the calculation for us.

Code
binom.test(
  x = s,
  n = n,
  p = 0.40,
  conf.level = 0.95
)

    Exact binomial test

data:  s and n
number of successes = 897, number of trials = 2306, p-value = 0.2879
alternative hypothesis: true probability of success is not equal to 0.4
95 percent confidence interval:
 0.369022 0.409231
sample estimates:
probability of success 
             0.3889853 

We also see the confidence intervals using the Clopper-Pearson tail with the BinomCI function.

Code
DescTools::BinomCI(
  x = s, 
  n = n, 
  conf.level = 0.95,
  method = "clopper-pearson"
)
           est   lwr.ci   upr.ci
[1,] 0.3889853 0.369022 0.409231

In conclusion, we are 95% confidence that the population proportion of PTB is between 0.369 and 0.409.

6 Lab Problems

6.1 Question

Which scale of measurement is most appropriate for the following variables? State either nominal or ordinal.

  1. Smoking (current smoker, previous smoker, never smoked)
  2. Highest degree obtained (none, high school, bachelor’s, master’s, doctorate)
  3. Patient hemodynamic status (normal, fluid responsive shock, fluid unresponsive shock)
  4. Facility location (Takoma Park, Vienna, Potomac)
  5. Favorite beverage (beer, juice, milk, soft drink, wine, other)
  6. How often a patient experiences nausea (never, occasionally, often, always)
  1. Nominal
  2. Ordinal
  3. Ordinal
  4. Nominal
  5. Nominal
  6. Ordinal

6.2 Question

A small pilot study is conducted to test a new therapy. Either patients recover or they do not recover. Seven patients are enrolled and it is believed that there is a 20% chance of recovery. Use each method below to find the probability that five or more patients recover.

  1. Compute the probability using the exact method. Do the calculation by hand and confirm using the dbinom function in R.

  2. Compute the probability using a normal approximation. Use the pnorm function in R.

  3. Should we have expected the normal approximation in (1) to be close to the exact calculation in part (2)?

Let Y be the number of subjects who recover out of seven. We assume Y \sim \text{Bin}(7,0.2) and calculate.

\begin{align*} P(Y \ge 5) &= P(Y=5) + P(Y=6) + P(Y=7) \\ &= \frac{7!}{5!2!}{( 0.2)}^{5}{( 1-0.2 )}^{2} + \frac{7!}{6!1!}{( 0.2 )}^{6}{( 1-0.2)}^{1} + \frac{7!}{7!0!}{( 0.2)}^{7}{( 1-0.2)}^{0} \\\\ &= 0.004672 \end{align*}

Below are two ways to perform the calculations using R.

Code
dbinom(x = 5, size = 7, prob = 0.2) + dbinom(x = 6, size = 7, prob = 0.2) + dbinom(x = 7, size = 7, prob = 0.2)
[1] 0.004672
Code
sum(dbinom(x = 5:7, size = 7, prob = 0.2))
[1] 0.004672

Using the normal distribution we have that \text{Bin}(7,0.2) \sim \text{N}(7 \times 0.2 = 1.4 , 7 \times 0.2 \times 0.8 = 1.12) and P(Y \ge 5).

\begin{align*} P(Y \ge 5) &= 1 - P(\frac{Y - \mu}{\sigma} < \frac{5 - 1.4}{\sqrt{1.12}}) \\ &= 1 - P(Z < 3.40168) \end{align*}

Below are two ways to perform the calculations using R.

Code
1 - pnorm(3.40168)
[1] 0.000334865
Code
pnorm(q = 5, mean = 1.4, sd = sqrt(1.12), lower.tail = FALSE)
[1] 0.0003348647

No, since we did not meet the assumptions for the use of the normal approximation.

n \pi = 7 \times 0.2 = 1.4 < 5

6.3 Question

Suppose that we observed five out of 40 subjects lost to follow up in the first month of an observational study. Assume that the number of subjects lost to follow up follows a binomial distribution with n = 40 and unknown population proportion \pi.

  1. Construct and interpret a 95% Wald confidence interval for \pi. Do the calculation by hand and confirm using R.

  2. Use R to compute an exact 95% confidence interval for \pi. Interpret this interval.

The 95% Wald CI for the true population proportion is shown below.

\frac{5}{40} \pm 1.96 \sqrt{\frac{\frac{5}{40} \left( 1-\frac{5}{40} \right)}{40}} = (0.00225 , 0.2275)

Code
DescTools::BinomCI(
  x = 5,
  n = 40,
  conf.level = 0.95,
  method = "wald"
)
       est     lwr.ci   upr.ci
[1,] 0.125 0.02251103 0.227489

We are 95% confident that the true proportion of those who will be lost to follow up is between 2.25% and 22.75%.

Below are two ways to perform the calculations using R.

Code
binom.test(
  x = 5,
  n = 40,
  conf.level = 0.95
)

    Exact binomial test

data:  5 and 40
number of successes = 5, number of trials = 40, p-value = 1.383e-06
alternative hypothesis: true probability of success is not equal to 0.5
95 percent confidence interval:
 0.04185963 0.26803292
sample estimates:
probability of success 
                 0.125 
Code
DescTools::BinomCI(
  x = 5,
  n = 40,
  conf.level = 0.95,
  method = "clopper-pearson"
)
       est     lwr.ci    upr.ci
[1,] 0.125 0.04185963 0.2680329

We are 95% confident that the true proportion of those who will be lost to follow up is between 4.19% and 26.80%.

6.4 Question

A longitudinal study of 50 randomly selected patients, requires those patients to follow up with investigators every month for assessments. In the first month, 12 of the 50 subjects are lost to follow up. Suppose that the probability of being lost to follow up in the first month is the same for all subjects.

  1. What is the probability distribution for the number of subjects lost to follow up in the first month of this study?

  2. Based on the results of the first month, what is the maximum likelihood estimate of the proportion of subjects lost to follow up in the first month?

  3. Compute a 95% Wald confidence interval, by hand and by using R, for the population proportion of subjects lost to follow up in the first month. Interpret the resulting confidence interval.

  4. Use R to compute a 95% score confidence interval for the population proportion of subjects lost to follow up in the first month. Interpret the resulting confidence interval.

  5. Use the confidence interval from part (4) to reject or fail to reject null hypothesis: H_{0}: \; \pi = 0.25, using a 5% significance level.

The probability distribution for the number of subjects lost to follow up in the first month of this study is \text{Bin}(50, \pi) where \pi is the unknown true proportion (probability) of being lost to follow up in the first month.

The MLE for the proportion of subjects lost to follow up in the first month is the sample proportions as shown below.

p = \frac{12}{50} = 0.24

The 95% Wald CI for the true population proportions as shown below.

\frac{12}{50} \pm 1.96 \sqrt{\frac{\frac{12}{50} - \left( 1 - \frac{12}{50} \right)}{50}} = (0.1216, 0.3584)

Code
DescTools::BinomCI(
  x = 12,
  n = 50,
  conf.level = 0.95,
  method = "wald"
)
      est    lwr.ci    upr.ci
[1,] 0.24 0.1216208 0.3583792

We are 95% confident that the true proportion of those who will be lost to follow up is between 12.16% and 35.84%.

Code
prop.test(
  x = 12,
  n = 50,
  conf.level = 0.95,
  correct = FALSE
)

    1-sample proportions test without continuity correction

data:  12 out of 50, null probability 0.5
X-squared = 13.52, df = 1, p-value = 0.000236
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.1429739 0.3741268
sample estimates:
   p 
0.24 

We are 95% confident that the true proportion of those who will be lost to follow up is between 14.3% and 37.4%.

Since the 95% CI in Solution 4.4 contains the null hypothesized value of 0.25, we cannot reject the null hypothesis at the 5% significance level.

6.5 Question

Out of 20 randomly selected subjects who worked in demolitions of old buildings, 15 were accidentally exposed to lead during their first year on the job. Test the null hypothesis that 30% of workers in demolitions of old buildings are accidentally exposed to lead. Use a 2% level of significance.

  1. Is it reasonable to conduct a score test? Regardless of your answer, conduct the relevant score hypothesis test and compute the corresponding score confidence interval. Interpret the interval.

  2. Conduct the relevant exact hypothesis test and compute the corresponding exact confidence interval. Interpret the interval.

Let \pi = the true proportion of workers in demolitions of old buildings who are accidentally exposed to lead and n = 20. The null hypothesized value for \pi is \pi_{0} = 0.30. It is reasonable to conduct a score test since n \pi_{0} = 20 \times 0.30 = 6 \ge 5 and n(1− \pi_{0})=20 \times 0.70 = 14 \ge 5.

Code
prop.test(
  x = 15,
  n = 20,
  p = 0.3,
  conf.level = 0.98,
  correct = FALSE
)

    1-sample proportions test without continuity correction

data:  15 out of 20, null probability 0.3
X-squared = 19.286, df = 1, p-value = 1.125e-05
alternative hypothesis: true p is not equal to 0.3
98 percent confidence interval:
 0.4899589 0.9035577
sample estimates:
   p 
0.75 

We reject the null hypothesis that \pi = 0.3 at the 2% significance level. The corresponding 98% CI of (0.490, 0.903) means that we are 98% confident that the true proportion of workers in demolitions of old buildings who are accidentally exposed to lead is between 40.9% and 90.3%.

Code
binom.test(
  x = 15,
  n = 20,
  p = 0.30,
  conf.level = 0.98
)

    Exact binomial test

data:  15 and 20
number of successes = 15, number of trials = 20, p-value = 4.294e-05
alternative hypothesis: true probability of success is not equal to 0.3
98 percent confidence interval:
 0.4678936 0.9311550
sample estimates:
probability of success 
                  0.75 

We reject the null hypothesis that \pi = 0.3 at the 2% significance level. The corresponding 98% CI of (0.468, 0.931) means that we are 98% confident that the true proportion of workers in demolitions of old buildings who are accidentally exposed to lead is between 46.8% and 93.1%.