Chapter 12

Author
Affiliation
Juan H Klopper (MD), Do Hee Lee (MPH)

The George Washington University

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

There are many examples in public and in clinical health in which we count the occurrences of events. We might count the number of mutations in one region of a chromosome, the number of visits to an emergency center in a day, or the number of births per month at an obstetric unit.

It arises naturally that there are factors that influence counts. If count is our response variable (which is different from a binary or multi-level categorical variable), then influencing factors can serve as explanatory variables, allowing us to consider generalized linear models. As generalized linear models, we still have three components.

  • A random component
  • A systematic component
  • A link function

In order to generate such models, we start by considering the Poisson distribution with respect to describing the random component.

Section 3. The Poisson Distribution Poisson distribution, Poisson log-linear model
Section 7. Overdispersion overdispersion, dispersion factor
Section 8. Count Regression for Rate Data offset

2 Libraries

Load the required libraries for Chapter 12. If these packages are not already installed, please do so before loading the libraries.

Code
#library(DT)
library(rio)
library(hms)
library(dplyr)
library(tibbletime)
library(timetk)
library(lubridate)
library(ggplot2)

3 The Poisson Distribution

A random variable, Y, follows a Poisson distribution with parameter, \lambda, the mean number of events per interval, if the following hold.

  • Y is the number of occurrences of an event over a specified interval (time, length, etc.)
  • The occurrences are independent in an interval and between intervals
  • The occurrences are uniformly distributed over the interval

If these assumptions hold, we write Y \sim \text{Poiss} \left( \lambda \right). The probability mass function of this discrete type distribution is shown in Equation 1.

P \left( Y = y \right) = \frac{\lambda^{y} \, e^{-\lambda}}{y!} \tag{1}

The mean and variance of Y is shown in Equation 2.

\begin{align} &\text{E} \left( Y \right) = \lambda \\ &\text{Var} \left( Y \right) = \lambda \end{align} \tag{2}

Imagine then, as illustrative example of the Poisson distribution, that an obstetric unit in a rural environment delivers an average of 27 babies per 24 hours. What is the probability that, on a given day, 30 babies are born?

The results is calculated in Equation 3, making use of Equation 1, for \lambda = 27, Y is the number of babies born per day, and Y \sim \text{Poiss} \left( 27 \right).

P \left( Y = 30 \right) = \frac{27^{30} \, e^{-27}}{30!} \tag{3}

We use the dpois function in R to do the calculation. The first argument is the values for Y = y and the second argument is the value for the parameter \lambda.

Code
dpois(
  30,
  lambda = 27
)
[1] 0.06184461

The probability of 30 babies being born in one day at this unit is about 6.18%. In the bar plot below, we show the probability for the number of births on the interval 10 to 40 newborns per day.

Code
babies <- 10:40
barplot(
  dpois(
    babies,
    lambda = 27
  ),
  names.arg = 10:40,
  main = "Probabilities over number of newborns per day",
  xlab = "Number of newborns per day",
  ylab = "Probability",
  col = "orange",
  las = 1,
  panel.first = grid()
)

In building a model, our aim is to model the expected average count of the response variable, that is \text{E}(Y) as a function of p explanatory variables, x_{1} , x_{2} , \ldots , x_{p}. We do this by introducing the log link function, shown in Equation 4, where X is a design matrix (includes a first column of ones).

\log{\left( \lambda \right)} = X \mathbf{\beta} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{4}

The model in Equation 4 is termed the Poisson log-linear model when all the explanatory variables x_{i} are categorical.

We estimate the coefficients using maximum likelihood estimation and interpret the slopes as odds ratios as with logistic regression models. We can also calculate predicted counts given values for the explanatory variable using simple algebra, the results of which is shown in Equation 5.

\lambda \left( x_{1} , \ldots , x_{p} \right) = e^{\beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p}} \tag{5}

4 Models with a Binary Explanatory Variable

Throughout the next section, we will explore some of the functionality in R and in various libraries to work with dates. When creating models for count data with respect to the response variable, it is very useful to be able to manage dates and times. To this end, the illustrative examples will be interspersed with learning about date and time functionality in R.

By definition, counts require a time interval. Given time, we are free to set an interval. We can summarize over periods (time intervals or time ranges). To illustrate this, we generate a data set of daily death counts in a critical care unit during the year 2020 (for coronavirus disease 2019 pandemic).

Code
days <- seq(as.Date("2020-04-05"), as.Date("2020-12-26"), by="days") # Generating dates
ds <- seq(1, length(days), by = 1) # Sequence of days
counts <- floor(exp(ds/(length(ds)/2))) # Formula for number of deaths on a specific day

dfr <- data.frame(
  ID = 1:length(days),
  Dates = days,
  Deaths = counts
)

#DT::datatable(dfr)

Data are recorded every day. The \texttt{Dates} column shows the day and the \texttt{Death} column shows the number of deaths recorded in the particular critical care unit per day. We can plot the daily number of deaths as a scatter plot.

Code
plot(
  dfr$Dates,
  dfr$Deaths,
  main = "Average number of deaths per day",
  xlab = "Dates",
  ylab = "Death",
  las = 1,
  panel.first = grid()
)

We can now aggregate (sum) over every week using the as_tbl_time function from the tibbletime library. We set the weekly argument and use the "start" value for the side argument to indicate using the start of the week as row value.

Code
dfr_weekly <- dfr %>% 
  dplyr::arrange(Dates) %>% 
  tibbletime::as_tbl_time(Dates) %>% 
  tibbletime::collapse_by("weekly", side = "start", clean = T) %>% 
  dplyr::group_by(Dates) %>% 
  summarise(Deaths = sum(Deaths)) %>% 
  dplyr::mutate(ID = 1:38)

#DT::datatable(dfr_weekly)

We plot the result.

Code
plot(
  dfr_weekly$Dates,
  dfr_weekly$Deaths,
  main = "Average number of deaths per week",
  xlab = "Dates",
  ylab = "Death",
  las = 1,
  panel.first = grid()
)

We can also do the same aggregation over every month.

Code
dfr_monthly <- dfr %>% 
  dplyr::arrange(Dates) %>% 
  tibbletime::as_tbl_time(Dates) %>% 
  tibbletime::collapse_by("monthly", side = "start", clean = T) %>% 
  dplyr::group_by(Dates) %>% 
  summarise(Deaths = sum(Deaths)) %>% 
  mutate(ID = 1:9)

#DT::datatable(dfr_monthly)

A plot shows the aggregate of deaths for each month.

Code
plot(
  dfr_monthly$Dates,
  dfr_monthly$Deaths,
  main = "Average number of deaths per month",
  xlab = "Dates",
  ylab = "Death",
  las = 1,
  panel.first = grid()
)

There are many more libraries in R that allow us to manipulate dates and times.

Back to our illustrative example. We want to consider our response variable, Y (death counts), explained by a single, binary explanatory variable. We generate a new variable Binary_bin in our data frame object that serves as our binary explanatory variable x, with x=0 if a data was in the first half of the date range and x=1 if in the second half of the date range.

Code
dfr <- dfr %>% 
  mutate(
    Binary_bin = case_when(
      ID <= floor(nrow(dfr) / 2) ~ "first",
      ID > floor(nrow(dfr) / 2) ~ "second"
    )
  )

#DT::datatable(dfr)

We set the explanatory variable as a factor.

Code
dfr$Binary_bin <- factor(dfr$Binary_bin, levels = c("first", "second"))

Now, we create a model using the glm function. Note the use of the poisson value for the family keyword argument, with the link set to "log".

Code
bin_model <- glm(
  Deaths ~ Binary_bin,
  data = dfr,
  family = poisson(link = "log")
)

summary(bin_model)$coefficients
                  Estimate Std. Error   z value     Pr(>|z|)
(Intercept)      0.2687062 0.07580977  3.544479 3.933891e-04
Binary_binsecond 1.1617130 0.08686588 13.373641 8.621559e-41

If \lambda (x) = \text{E} \left( Y \, \vert \, x \right), then our model is as in Equation 6.

\begin{align} &\log{\left( \lambda (x) \right)} = \beta_{0} + \beta_{1} x \\ &\lambda (x) = e^{\beta_{0} + \beta_{1}x} \end{align} \tag{6}

Using the results from our model, we have Equation 7.

\begin{align} &\log{\left( \hat{\lambda} (x) \right)} = 0.2687062 + 1.1617130 x \\ &\hat{\lambda} (x) = e^{0.2687062 + 1.1617130 \, x} \end{align} \tag{7}

If we use x=0 (first half), we have that \hat{\lambda}(0) = e^{\hat{\beta}_{0}} = e^{0.2687062} \approx 1.308. This means that e^{\hat{\beta}_{0}} is the estimated mean number of deaths in the unexposed. In our example, the mean number of deaths in the unexposed (cases in the first half of the date range) is 1.308.

We can also interpret the slope. In Equation 8, we make use of the law of exponents and the fact that we have just noted that e^{\hat{\beta}_{0}} = \hat{\lambda} (0).

\begin{align} &\hat{\lambda} (1) = e^{\hat{\beta}_{0} + \hat{\beta}_{1}x} = e^{\hat{\beta}_{0}}e^{\hat{\beta}_{1}x} \\ &\hat{\lambda} (1) = \hat{\lambda} (0) e^{\hat{\beta}_{1} x} \end{align} \tag{8}

We interpret Equation 8 as the estimated mean count in the exposed in e^{\beta_{1}} times the mean count of the unexposed. The results for our illustrative example is shown in Equation 9.

\begin{align} &\hat{\lambda} (1) = 1.308 \times e^{1.1617130 \left( 1 \right)} \\ &\hat{\lambda} (1) \approx 4.180451 \end{align} \tag{9}

The estimated mean number of deaths given being in the second half of the date range is about 4.18. That is e^{1.161713} = 3.195402 times the estimated mean number of deaths in the unexposed.

Note that there are three cases for the slope, \beta_{1}.

  • \beta_{1} > 0: i.e. e^{\beta_{1}} > 1 and the exposed have a higher mean count compared to the unexposed
  • \beta_{1} = 0: i.e. e^{\beta_{1}} = 1 and the exposed have the same mean count compared to the unexposed (used for the null hypothesis)
  • \beta_{1} < 0: i.e. e^{\beta_{1}} < 1 and the exposed have a lower mean count compared to the unexposed

The summary of the model above shows the result of a Wald test, with a Wald statistic, Z_{W}, and a p value.

5 Models with a k-Level Explanatory Variable

In the code below, we create a new column for our data frame object called TriBin. It represents a 3-level categorical variable with the levels first if an observation is in April, May, or June, second, if an observation is in July, August, or September, and third, if an observation is in October, November or December.

Code
dfr <- dfr %>% 
  mutate(TriBin = case_when(
    lubridate::month(Dates) %in% c(4, 5, 6) ~ "first",
    lubridate::month(Dates) %in% c(7, 8, 9) ~ "second",
    lubridate::month(Dates) %in% c(10, 11, 12) ~ "third"
  ))

With a 3-level categorical variable, we can create 3-1=2 dummy variables. We have that x_{1} is \texttt{TriBin} being second and x_{2} for \texttt{TriBin} being third, each with levels 1 (if in that bin) and 0 (otherwise).

Before generating the model, we set the levels.

Code
dfr$TriBin <- factor(dfr$TriBin, levels = c("first", "second", "third"))

We build the model and show the coefficients and Wald test results.

Code
tri_model <- glm(
  Deaths ~ TriBin,
  data = dfr,
  family = poisson(link = "log")
)

summary(tri_model)$coefficients
                  Estimate Std. Error       z value     Pr(>|z|)
(Intercept)  -1.386773e-15  0.1072113 -1.293496e-14 1.000000e+00
TriBinsecond  8.347977e-01  0.1273233  6.556517e+00 5.507893e-11
TriBinthird   1.600200e+00  0.1175348  1.361469e+01 3.275489e-42

The results above are used to write the model and probabilities, shown in Equation 10.

\begin{align} &\log{\left( \hat{\lambda} \left( x_{1} , x_{2} \right) \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2} \\ &\hat{\lambda} \left( x_{1} , x_{2} \right) = e^{\hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2}} \\ \\ &\log{\left( \hat{\lambda} \left( x_{1} , x_{2} \right) \right)} = 0 + 0.8347977 x_{1} + 1.6002 x_{2} \\ &\hat{\lambda} \left( x_{1} , x_{2} \right) = e^{0 + 0.8347977 x_{1} + 1.6002 x_{2}} \end{align} \tag{10}

The interpretation is as follows.

  • e^{\hat{\beta}_{0}} = e^{0} = 1 is the estimated mean number of deaths in April, May, and June (base date range) (x_{1} = 0, \; x_{2} = 0)
  • e^{\hat{\beta}_{1}} = e^{0.8347977} \approx 2.304 shows that the estimated mean number of deaths in July, August, and September (x_{1} = 1, \; x_{2} = 0) is 2.304 times the mean number of deaths in April, May, and June
  • e^{\hat{\beta}_{1}} = e^{1.6002} \approx 4.954 shows that the mean number of deaths in October, November, and December (x_{1} = 0, \; x_{2} = 1) is 4.954 times the mean number of deaths in April, May, and June

6 Model with a Numerical Explanatory Variable

In the case of our illustrative example, and for a single numerical explanatory variable, we have that Y is the number of deaths, \lambda (x) = \text{E} \left( Y \, \vert \, x \right), and x is the numerical explanatory variable. Our simple model will then be as in Equation 11.

\begin{align} &\log{\left( \lambda \left( x \right) \right)} = \beta_{0} + \beta_{1} x \\ &\lambda (x) = e^{\beta_{0} + \beta_{1} x} \end{align} \tag{11}

The interpretation of \beta_{0} shows that \lambda (0) = e^{\beta_{0}} is the mean count for a cases where x=0. We interpret the slope by considering a one-unit difference in x, i.e. x=\nu+1 and x=\nu, shown in Equation 12.

\begin{align} &\lambda \left( \nu + 1 \right) = e^{\beta_{0} + \beta_{1} \left( \nu + 1 \right)} \\ &\lambda \left( \nu + 1 \right) = e^{\beta_{0} + \beta_{1} \nu}e^{\beta_{1}} \\ &\text{but } e^{\beta_{0} + \beta_{1} \nu} = \lambda \left( \nu \right) \text{, therefor} \\ &\lambda \left( \nu + 1 \right) = \lambda \left( \nu \right) e^{\beta_{1}} \end{align} \tag{12}

We interpret Equation 12 by stating that for a one-unit increase in x, the estimated mean count changes by a factor of e^{\beta_{1}}. If \beta_{1}>1 the the change is an increase and if \beta_{1}<0 then the change is a decrease.

As with our derivation for a constant c-unit increase which we performed in the case of simple logistic regression, we have that a c-unit increase in x brings about a factor change of e^{c \beta_{1}}.

For our illustrative example, we can choose the actual day, with the first day of data collection as x=1. We create a model and show the coefficients and results of a Wald test.

Code
num_model <- glm(
  Deaths ~ ID,
  data = dfr,
  family = poisson(link = "log")
)

summary(num_model)$coefficients
                Estimate  Std. Error   z value     Pr(>|z|)
(Intercept) -0.352732361 0.105001900 -3.359295 7.814159e-04
ID           0.008627318 0.000544697 15.838748 1.681362e-56

The model is written in Equation 13.

\begin{align} &\log{\left( \hat{\lambda} \left( x \right) \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ &\hat{\lambda} (x) = e^{\hat{\beta}_{0} + \hat{\beta}_{1} x} \\ \\ &\log{\left( \hat{\lambda} \left( x \right) \right)} = -0.352732361 + 0.008627318 (x) \\ &\hat{\lambda} (x) = e^{-0.352732361 + 0.008627318 (x)} \end{align} \tag{13}

A one-day increase the day since the start brings about a e^{\hat{\beta}_{1}} = e^{0.008627318} \approx 1.008665 increase in the average number of deaths.

We can plot our model over the data.

Code
plot(
  ds,
  dfr$Deaths,
  main = "Average number of deaths per day",
  xlab = "Days (as numerical variable)",
  ylab = "Death",
  las = 1,
  panel.first = grid()
)
lines(
  ds,
  exp(-0.352732361 + 0.008627318 * ds)
)

7 Overdispersion

For the simple, single variable cases above (binary and numerical explanatory variables), we can use a Wald test or a likelihood ratio test (the latter specifically for multi-explanatory variable models, including the k-1 dummy variable case).

We noted the results of Wald tests in the summary reports above, where we have the null hypothesis that \beta_{j} = 0 and a Wald statistic, as shown in Equation 14.

Z_{W} = \frac{\hat{\beta}_{j}}{\widehat{\text{SE}} \left( \hat{\beta}_{j} \right)} \tag{14}

The assumption for these models are that if Y \sim \text{Poiss} \left( \lambda \right) then \text{E} \left( Y \right) = \text{Var} \left( Y \right) = \lambda. Count type data often have greater variance than assumed under a Poisson model. This greater variance is termed overdispersion and can be seen by small standard errors in the model summaries.

There are several approaches to dealing with overdispersion. A simple approach is to assume a model that allows for the variance to be different from the mean by a constant factor, \phi, termed the dispersion factor, with \text{Var} \left( Y \right) = \phi \lambda. We state that when \phi > 1, then we have overdispersion.

We obtain an estimate, \hat{\phi}, using Equation 15, where n is the number of observations, and p is the number of slopes.

\hat{\phi} = \frac{\sum_{i=1}^{n} \left( \text{Pearson residual} \right)_{i}^{2}}{n-p-1} \tag{15}

If \hat{\pi}>1 then we have overdispersion and we multiply the standard error in Equation 14 by \sqrt{\hat{\phi}}, shown in Equation 15, where QP stands for quasi-Poisson.

\widehat{\text{SE}}_{\text{QP}} \left( \hat{\beta}_{j} \right) = \sqrt{\hat{\phi}} \left[ \widehat{\text{SE}} \left( \hat{\beta}_{j} \right) \right]

As illustrative example, we consider the death_data.csv file. The file contains data from a study of the number of deaths due to acquired immunodeficiency syndrome (AIDS) per three-moth period in Australia from January 1983 through June 1986.

Code
aus <- import_dataset("death_data.csv")

The time frame contains data for 14 periods. The period is in the \texttt{period} column. We will use \texttt{period} as a numerical explanatory variable and create a simple model. The \texttt{count} variable indicates the number of death in each of the 14 periods respectively.

Code
aus_period_model <- glm(
  count ~ period,
  data = aus,
  family = poisson(link = "log")
)

summary(aus_period_model)

Call:
glm(formula = count ~ period, family = poisson(link = "log"), 
    data = aus)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.33963    0.25119   1.352    0.176    
period       0.25652    0.02204  11.639   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 207.272  on 13  degrees of freedom
Residual deviance:  29.654  on 12  degrees of freedom
AIC: 86.581

Number of Fisher Scoring iterations: 5

From the summary of our numerical explanatory model above, we see an estimated standard error for \hat{\beta}_{1} = 0.022047. We also see that the estimated dispersion factor that was used was 1.

We can calculate the Pearson residual in each case, using the resid function. We show the first five such residuals.

Code
reds <- resid(aus_period_model, type = "pearson")
reds[1:5]
         1          2          3          4          5 
-1.3472679 -0.8787483 -0.5926473 -0.4640352 -1.8060830 

In the code below we calculate \hat{\phi} by using Equation 15.

Code
phi_hat <- sum(reds^2) / (length(reds) - 1 - 1)
phi_hat
[1] 2.403942

The standard error for \hat{\beta}_{1} should be multiplied by \sqrt{\hat{\phi}}.

By using the quasipoisson keyword as value for the family argument, R returns a model that compensates for overdispersion.

Code
aus_period_model_quasi <- glm(
  count ~ period,
  data = aus,
  family = quasipoisson(link = "log")
)

summary(aus_period_model_quasi)

Call:
glm(formula = count ~ period, family = quasipoisson(link = "log"), 
    data = aus)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.33963    0.38946   0.872      0.4    
period       0.25652    0.03417   7.507 7.17e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 2.403942)

    Null deviance: 207.272  on 13  degrees of freedom
Residual deviance:  29.654  on 12  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

We see the new standard error, which we confirm below, by multiplying the square root of the dispersion factor by the original standard error for \hat{\beta}_{1}.

Code
sqrt(phi_hat) * 0.02204
[1] 0.03417225

From the summary of the model, we note that after adjustment, the variability in \hat{\beta}_{1} is described by a t distribution.

Other languages and software use a Wald test statistic, shown in Equation 16, together with a confidence interval for the coefficient, \hat{\beta}_{j}.

\begin{align} &Z_{W} = \frac{\hat{\beta}_{j}-0}{\widehat{\text{SE}}_{\text{QP}} \left( \hat{\beta}_{j} \right)} \\ &\hat{\beta}_{j} \pm Z_{\frac{\alpha}{2}} \, \widehat{\text{SE}}_{\text{QP}} \left( \hat{\beta}_{j} \right) \end{align} \tag{16}

There are considerations that we must be aware of when using correction of overdispersion. After correction, estimates are not obtained from maximizing a legitimate likelihood function. A likelihood ratio test cannot be used. Wald test statistics and confidence intervals can be calculated, though.

An alternative to consider, is the negative binomial regression model that uses the negative binomial distribution for the random component. Using this distribution, we have a variance that is different from the mean.

8 Count Regression for Rate Data

The last model type that we consider is used for rate data, expressed as a number per unit. As illustrative example, we consider a study conducted in Denmark that collected data on cases of lung cancer for different age groups is different cities.

For this example, we can model the rate of lung cancer cases as a function of age and city. The data for this study is in the lung_cancer.csv file, which we import below.

Code
denmark <- import_dataset("lung_cancer_data.csv")
#DT::datatable(denmark)

We note that we have a column for \text{cases} and \texttt{population}. Our response count is Y, which is a function of population size, with a parameter that we denote with a t, and hence we have a sample rate, shown in Equation 17, together with the mean value of the rate.

\begin{align} &\frac{Y}{t} \\ \\ &\frac{E \left( Y \right)}{t} = \frac{\lambda}{t} \end{align} \tag{17}

A log-linear model for the rate in Equation 17 with single explanatory variable x_{1} is shown in Equation 18.

\log{\left( \frac{\lambda(x_{1})}{t} \right)} = \beta_{0} + \beta_{1} x_{1} \tag{18}

Simple algebra, expresses the mean counts from this model, derived in Equation 19.

\begin{align} &e^{\log{\left( \frac{\lambda(x_{1})}{t} \right)}} = e^{\beta_{0} + \beta_{1} x_{1}} \\ &\lambda \left( x_{1} \right) = t \, e^{\beta_{0} + \beta_{1} x_{1}} \end{align} \tag{19}

In Equation 20, we show that e^{\beta_{1}} is a rate ratio.

\frac{\lambda \left( 1 \right)}{\lambda \left( 0 \right)} = \frac{e^{\beta_{0} + \beta_{1} (1)}}{e^{\beta_{0} + \beta_{1} (0)}} = e^{\beta_{1}} \tag{20}

The equation in Equation 20 can be rewritten as in Equation 21, showing that the rate of event occurrences for those with x_{1} = 1 is e^{\beta_{1}} times the rate of event occurrences for those with x_{1} = 0.

\frac{\lambda \left( 1 \right)}{t} = e^{\beta_{1}} \frac{\lambda \left( 0 \right)}{t} \tag{21}

Using the laws of logarithms, we can also express Equation 18 as the difference in logs, shown in Equation 22.

\begin{align} &\log{\left( \lambda \left( x_{1} \right) \right)} - \log{\left( t \right)} = \beta_{0} + \beta_{1} x_{1} \\ &\log{\left( \lambda \left( x_{1} \right) \right)} = \beta_{0} + \beta_{1} x_{1} + \log{\left( t \right)} \end{align} \tag{22}

The term \log{\left( t \right)} is the offset. Assuming that Y follows a Poisson distribution, then (21) shows a Poisson regression model, with explanatory variables x_{1} and \log{\left( t \right)}, where the coefficient of the latter is 1, shown in Equation 23.

\begin{align} &\log{\left( \lambda \left( x_{1} , \log{\left( t \right)} \right) \right)} = \beta_{0} + \beta_{1} x_{2} + \beta_{2} \log{\left( t \right)} \\ &\log{\left( \lambda \left( x_{1} , \log{\left( t \right)} \right) \right)} = \beta_{0} + \beta_{1} x_{2} + \log{\left( t \right)}, \text{ with } \beta_{2} = 1 \end{align} \tag{23}

Below, we generate a model, where Y is the number of lung cancer cases in the \texttt{cases} variable (column), t is the population size in the \texttt{population} variable, with the four levels of the \texttt{city} variable indicating one of four cities, Fredericia (x_{1}), Horsens (x_{2}), Kolding (x_{3}), and Vejle, the latter chosen as the reference city. We also include the age midpoint variable in the \texttt{age_midpt} variable (x_{4}).

With \mathbf{x} = \left( x_{1} , x_{2} , x_{3} , x_{4} \right) and E \left( Y \, \vert \, \mathbf{x} \right) = \lambda \left( \mathbf{x} \right), we fit a model as shown in Equation 24.

\log{\left( \lambda \left( \mathbf{x} \right) \right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \log{\left( t \right)} \tag{24}

Code
denmark$city <- factor(denmark$city, levels = c("Vejle", "Fredericia", "Horsens", "Kolding"))
Code
denmark_model <- glm(
  cases ~ city + age_midpt + offset(log(population)),
  data = denmark,
  family = poisson(link = "log")
)

summary(denmark_model)

Call:
glm(formula = cases ~ city + age_midpt + offset(log(population)), 
    family = poisson(link = "log"), data = denmark)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.228237   0.440333 -18.686   <2e-16 ***
cityFredericia  0.246016   0.187769   1.310    0.190    
cityHorsens    -0.059263   0.191972  -0.309    0.758    
cityKolding    -0.104262   0.198081  -0.526    0.599    
age_midpt       0.056759   0.006527   8.696   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 129.908  on 23  degrees of freedom
Residual deviance:  46.448  on 19  degrees of freedom
AIC: 152.84

Number of Fisher Scoring iterations: 5

We see the fitted model in Equation 25.

\begin{align} &\log{\left( \hat{\lambda} \left( \mathbf{x} \right) \right)} = -8.228237 + 0.246016 \, x_{1} - 0.059263 \, x_{2} - 0.104262 \, x_{3} + 0.056759 \, x_{4} + \log{\left( t \right)} \end{align} \tag{25}

If we consider x_{1}, we can state that the rate of lung cancer deaths in Fredericia is about e^{0.246016} \approx 1.28 = 28% higher than the lung cancer deaths in Vejle, adjusting for age.

If we consider x_{4}, we can state that for each 1-year increase in age, the rate of lung cancer deaths increase by about e^{0.056759} \approx 1.06 = 6%, adjusting for city.

The models are log linear and as can be seen from the plots below, observed log rates are not linear at all.

Code
denmark <- denmark %>% 
  mutate(cancer_rate = cases / population)

ggplot(
  denmark,
  aes(
    x = age_midpt,
    y = log(cancer_rate),
    color = city
  )
) +
  geom_point() +
  geom_line() +
  labs(
    title = "Cancer rate (cases per population size) for each city"
  ) +
  xlab("Age") + 
  ylab("Log of rate")

One way to get the prediction with ease, is to create dummy variable first, before creating the model. We do so below.

Code
fredericia <- ifelse(denmark$city == "Fredericia", 1, 0)
horsens <- ifelse(denmark$city == "Horsens", 1, 0)
kolding <- ifelse(denmark$city == "Kolding", 1, 0)

denmark_alt_model <- glm(
  cases ~ fredericia + horsens + kolding + age_midpt + offset(log(population)),
  data = denmark,
  family = poisson(link = "log")
)

summary(denmark_alt_model)

Call:
glm(formula = cases ~ fredericia + horsens + kolding + age_midpt + 
    offset(log(population)), family = poisson(link = "log"), 
    data = denmark)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.228237   0.440333 -18.686   <2e-16 ***
fredericia   0.246016   0.187769   1.310    0.190    
horsens     -0.059263   0.191972  -0.309    0.758    
kolding     -0.104262   0.198081  -0.526    0.599    
age_midpt    0.056759   0.006527   8.696   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 129.908  on 23  degrees of freedom
Residual deviance:  46.448  on 19  degrees of freedom
AIC: 152.84

Number of Fisher Scoring iterations: 5

The model is exactly the same, but now it is easier to enter the data using the predict function.

Code
pred <- predict(
  denmark_alt_model,
  denmark %>% select(
    all_of(fredericia),
    all_of(horsens),
    all_of(kolding),
    age_midpt,
    population
  ),
  type = "response"
) / denmark$population

denmark <- denmark %>% 
  mutate(predicted = pred)

Now we plot the predicted rates and verify that the model is linear.

Code
ggplot(
  denmark,
  aes(
    x = age_midpt,
    y = log(predicted),
    color = city
  )
) +
  geom_point() +
  geom_line() +
  labs(
    title = "Predicted cancer rate (cases per population size) for each city"
  ) +
  xlab("Age") + 
  ylab("Log of predicted rate")

We have to consider all the tools for model evaluation that we have learned about in this course. Below, we fit an extra term, which is the square of the age.

Code
denmark <- denmark %>% 
  mutate(age_midpt_squared = age_midpt^2)

We create a new model by adding this term.

Code
denmark_new_model <- glm(
  cases ~ fredericia + horsens + kolding + age_midpt + age_midpt_squared + offset(log(population)),
  data = denmark,
  family = poisson(link = "log")
)

summary(denmark_new_model)

Call:
glm(formula = cases ~ fredericia + horsens + kolding + age_midpt + 
    age_midpt_squared + offset(log(population)), family = poisson(link = "log"), 
    data = denmark)

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -2.173e+01  3.096e+00  -7.017 2.27e-12 ***
fredericia         2.740e-01  1.878e-01   1.459    0.145    
horsens           -5.864e-02  1.920e-01  -0.305    0.760    
kolding           -1.015e-01  1.981e-01  -0.512    0.608    
age_midpt          5.097e-01  1.021e-01   4.991 6.01e-07 ***
age_midpt_squared -3.701e-03  8.298e-04  -4.461 8.17e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 129.908  on 23  degrees of freedom
Residual deviance:  26.017  on 18  degrees of freedom
AIC: 134.41

Number of Fisher Scoring iterations: 5

We add predictions using the new model and then visualize the new model.

Code
pred_new <- predict(
  denmark_new_model,
  denmark %>% select(
    all_of(fredericia),
    all_of(horsens),
    kolding,
    age_midpt,
    age_midpt_squared,
    population
  ),
  type = "response"
) / denmark$population
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
  # Was:
  data %>% select(kolding)

  # Now:
  data %>% select(all_of(kolding))

See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
Code
denmark <- denmark %>% 
  mutate(predicted_new = pred_new)
Code
ggplot(
  denmark,
  aes(
    x = age_midpt,
    y = log(predicted_new),
    color = city
  )
) +
  geom_point() +
  geom_line() +
  labs(
    title = "Predicted cancer rate (cases per population size) for each city"
  ) +
  xlab("Age") + 
  ylab("Log of predicted rate")

We can view each city in its own plot.

Code
denmark %>% 
  ggplot(
    aes(
      x = age_midpt,
      y = value,
      col = city
    )
  ) +
  geom_point(aes(y = log(cancer_rate), col = "cancer_rate")) +
  geom_line(aes(y = log(cancer_rate), col = "cancer_rate")) +
  geom_point(aes(y = log(predicted_new), col = "predicted_new")) +
  geom_line(aes(y = log(predicted_new), col = "predicted_new")) +
  labs(
    title = "Actual (red) and predicted log of rates (green) for each city"
  ) +
  xlab("Age midpoint") +
  ylab("Log of rate") +
  facet_grid(
    ~ city
  ) +
  theme(legend.position="none")

9 Lab Problems

9.1 Question

The file campuscrime_data.csv contains the number of burglaries reported at a collection of 47 U.S. public universities with over 10000 students in the year 2009. In addition, explanatory variables are included which may explain differences in crime rates, including total number of students, percentage of men, average ACT test scores, and tuition.

The variables of interest are shown in Table 1.

Table 1: Variables for this Problem
Variable Meaning and coding
\texttt{burg09} Number of burglaries
\texttt{tuition} Cost of tuition
\texttt{act.comp} Average ACT score
\texttt{pct.male} Percentage males
\texttt{total} Number of students
  1. Import the data file as a data frame or tibble object.

  2. Scale the tuition variable by 1000 to create a new variable \texttt{tuitionthou} that gives the tuition in thousands of dollars. Then fit a Poisson regression model with \texttt{burg09} as the response and with \texttt{tuitionthou}, \texttt{act.comp}, \texttt{pct.male}, and \texttt{total} as the predictors. Report the coefficient table from R. Write the fitted model.

  3. Interpret the coefficients of \texttt{tuitionthou} and \texttt{pct,male} in terms of the predicted mean number of burglaries.

  4. Is there evidence of overdispersion? If so, fit an appropriate model to account for the overdispersion. Report the coefficient table for this model.

  5. It may make sense to include \log{(\text{total})} as an offset in the model. Refit the last model above with \log{(\text{total})} as a predictor (rather than \texttt{total}). Construct an approximate Wald 95% CI for the coefficient for the \log{(\text{total})} predictor. How can we use this interval to make a decision about whether it makes sense to include \log{(\text{total})} as an offset? Does it make sense to include \log{(\text{total})} as an offset? HINT: This question shows how to use an offset when considering counts. If the confidence interval includes 1, then it makes sense to use \log{(\text{total})} as an offset (since it coefficient may be 1).

  6. Refit the model from Question 5, but include \log{(\text{total})} as an offset rather than as a predictor. Interpret the antilogs of the coefficients for the \texttt{tuitionthou} and \texttt{pct.male} predictors.

  7. Based on the model that you fit in Question 6, is \texttt{tuitionthou} associated with the rate of burglaries in 2009, adjusting for mean ACT score and percentage of male students? Use a 5% significance level to test this association.

Code
school <- import_dataset("campuscrime_data.csv")
Code
school <- school %>% 
  mutate(tuitionthou = tuition / 1000)

school_model_1 <- glm(
  burg09 ~ tuitionthou + act.comp + pct.male + total,
  data = school,
  family = poisson(link = "log")
)

summary(school_model_1)

Call:
glm(formula = burg09 ~ tuitionthou + act.comp + pct.male + total, 
    family = poisson(link = "log"), data = school)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  2.741e+00  3.238e-01   8.465  < 2e-16 ***
tuitionthou -3.896e-02  3.937e-03  -9.894  < 2e-16 ***
act.comp     5.352e-02  1.008e-02   5.308 1.11e-07 ***
pct.male    -1.313e-02  5.355e-03  -2.452   0.0142 *  
total        2.666e-05  2.745e-06   9.709  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 1400.0  on 46  degrees of freedom
Residual deviance: 1133.6  on 42  degrees of freedom
AIC: 1389

Number of Fisher Scoring iterations: 5

The model is as below, where \mathbf{x} = \left( x_{1} , x_{2} , x_{3} , x_{4} \right) representing tuition in thousands of dollars, average ACT scores, percentage males, and number of students.

\log{\left( \hat{\lambda} \left( \mathbf{x} \right) \right)} = 2.741 - 0.03896 x_{1} + 0.05352 x_{2} - 0.01313 x_{3} + 0.00002666 x_{4}

The predicted mean number of burglaries are calculated as below.

\hat{\lambda} \left( \mathbf{x} \right) = e^{2.741 - 0.03896 x_{1} + 0.05352 x_{2} - 0.01313 x_{3} + 0.00002666 x_{4}}

  • \texttt{tuitionthou} For every \$1000 increase in tuition the estimated mean number of burglaries decreases by 1 - e^{-0.03896} \approx 1 - 0.96172892 = 3.8%, adjusting for average ACT scores, percentage males, and number of students
  • \texttt{pct.male} For each 1 percentage point increase in male students, the estimated mean number of burglaries decreases by about 1 - e^{- 0.01313} \approx 1 - 0.9869558 = 1.3%, adjusting for tuition in thousands of dollars, average ACT scores, and number of students

We fit a new model that compensates for overdispersion.

Code
school_model_2 <- glm(
  burg09 ~ tuitionthou + act.comp + pct.male + total,
  data = school,
  family = quasipoisson(link = "log") # Note the use of the quasipoisson keyword
)

summary(school_model_2)

Call:
glm(formula = burg09 ~ tuitionthou + act.comp + pct.male + total, 
    family = quasipoisson(link = "log"), data = school)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  2.741e+00  1.735e+00   1.579   0.1218  
tuitionthou -3.896e-02  2.110e-02  -1.846   0.0720 .
act.comp     5.352e-02  5.405e-02   0.990   0.3277  
pct.male    -1.313e-02  2.870e-02  -0.458   0.6497  
total        2.666e-05  1.472e-05   1.811   0.0772 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 28.72955)

    Null deviance: 1400.0  on 46  degrees of freedom
Residual deviance: 1133.6  on 42  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

We see that the estimated dispersion parameter is \hat{\phi} = 28.72955 > 1, showing evidence for overdispersion. The coefficients are as shown above and we select this model for reporting.

Code
school_model_3 <- glm(
  burg09 ~ tuitionthou + act.comp + pct.male + log(total),
  data = school,
  family = quasipoisson(link = "log")
)

summary(school_model_3)

Call:
glm(formula = burg09 ~ tuitionthou + act.comp + pct.male + log(total), 
    family = quasipoisson(link = "log"), data = school)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)  
(Intercept) -4.107092   4.815548  -0.853   0.3986  
tuitionthou -0.036648   0.021400  -1.713   0.0942 .
act.comp     0.065634   0.053879   1.218   0.2300  
pct.male    -0.009317   0.028915  -0.322   0.7489  
log(total)   0.696416   0.542476   1.284   0.2063  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 29.49233)

    Null deviance: 1400  on 46  degrees of freedom
Residual deviance: 1177  on 42  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
Code
exp(confint.default(school_model_3))
                   2.5 %     97.5 %
(Intercept) 1.310119e-06 206.687464
tuitionthou 9.244184e-01   1.005308
act.comp    9.608211e-01   1.186770
pct.male    9.361404e-01   1.048495
log(total)  6.929369e-01   5.810399

We see a confidence interval for \hat{\beta}_{4} = (0.6929369, 5.810399), which contains 1. We should consider \log{(\text{total})} as an offset.

Code
school_model_4 <- glm(
  burg09 ~ tuitionthou + act.comp + pct.male + offset(log(total)),
  data = school,
  family = quasipoisson(link = "log")
)

summary(school_model_4)

Call:
glm(formula = burg09 ~ tuitionthou + act.comp + pct.male + offset(log(total)), 
    family = quasipoisson(link = "log"), data = school)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -6.63085    1.65998  -3.995 0.000249 ***
tuitionthou -0.03970    0.02090  -1.900 0.064187 .  
act.comp     0.05145    0.04853   1.060 0.294994    
pct.male    -0.01429    0.02796  -0.511 0.612061    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 29.91953)

    Null deviance: 1328.3  on 46  degrees of freedom
Residual deviance: 1186.2  on 43  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5
  • \texttt{tuitionthou} For each 1000 dollar increase in tuition, the estimated rate of burglaries decreases by 1 - e^{-0.03970} \approx 1 - 0.961077 = 0.04 = 4%, adjusting for mean ACT score and percentage of male students
  • \texttt{pct.male} For each 1 percentage point increase in male students, the rate of burglaries decreases by about (1 − e^{-0.01429} = 1 − 0.9858116 = 0.014 = 1.4)%, adjusting for tuition and mean ACT score

R expresses a T statistic. We use the value for the statistic and the p value as results for a Wald test.

The data do not provide evidence of an association between tuition and burglary rate, adjusting for mean ACT score and percentage of male students, at the 5% level of significance.

9.2 Question

The file smoking_data.csv contains data from a study of death by lung disease among individuals from different age groups and with different smoking status.

The variables and their mean and coding are as shown in Table 2.

Table 2: Variables for this Problem
Variable Meaning and coding
\texttt{age} 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+ (9-level)
\texttt{smoke} cigarPipeOnly, cigaretteOnly, cigarettePlus, no (4-level)
\texttt{pop} population at risk
\texttt{dead} number of deaths by lung disease
  1. Import the data file as a data frame object or tibble object.

  2. Fit a Poisson regression model with rate of death by lung disease (that is, by population at risk) as the response and age group as the only predictor. Report the coefficient table and write down the fitted model.

  3. Based on the model that you fit in Question 2, is age associated with rate of death by lung disease? Conduct an appropriate test at the 5% significance level. Hint: Use a likelihood ratio test.

  4. With respect to the model that you fit in Question 2, is there evidence of overdispersion? If there is overdispersion, fit an appropriate model to account for the overdispersion. Report the coefficient table for this model.

  5. Interpret the antilogs of the coefficients for the \texttt{age45 − 59} and \texttt{age80+} dummy variables from the model that you fit in Question 3.

Code
lung <- import_dataset("smoking_data.csv")
Code
lung_model_1 <- glm(
  dead ~ age + offset(log(pop)),
  data = lung,
  family = poisson(link = "log")
)

summary(lung_model_1)

Call:
glm(formula = dead ~ age + offset(log(pop)), family = poisson(link = "log"), 
    data = lung)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.39572    0.05842 -58.125  < 2e-16 ***
age45-49     0.55603    0.07999   6.951 3.62e-12 ***
age50-54     0.98815    0.07681  12.864  < 2e-16 ***
age55-59     1.37145    0.06526  21.017  < 2e-16 ***
age60-64     1.62900    0.06254  26.049  < 2e-16 ***
age65-69     1.95715    0.06269  31.218  < 2e-16 ***
age70-74     2.20577    0.06410  34.409  < 2e-16 ***
age75-79     2.45779    0.06713  36.610  < 2e-16 ***
age80+       2.68749    0.07080  37.958  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4055.98  on 35  degrees of freedom
Residual deviance:  191.72  on 27  degrees of freedom
AIC: 449.75

Number of Fisher Scoring iterations: 4

We write the model for \mathbf{x} = \left( x_{1} , x_{2} , \ldots , x_{8}\right) for the 8 groups and the youngest group as reference, where \hat{\lambda} \left( \mathbf{x} \right) is the estimated mean rate of death by lung disease.

\log{\left( \hat{\lambda} \left( \mathbf{x} \right) \right)}= −3.40 + 0.56 \, x_{1} + 0.99 \, x_{2} + 1.37 \, x_{2} + 1.63 \, x_{4} + 1.96 \, x_{5} + 2.21 \, x_{6} + 2.46 \, x_{7} + 2.69 \, x_{8} + \log{(\text{dead})}

Code
lung_int_model <- glm(
  dead ~ 1 + offset(log(pop)),
  data = lung,
  family = poisson(link = "log")
)
Code
lmtest::lrtest(
  lung_int_model, # Intercept only (reduced) model
  lung_model_1 # Initial full model
)
Likelihood ratio test

Model 1: dead ~ 1 + offset(log(pop))
Model 2: dead ~ age + offset(log(pop))
  #Df   LogLik Df  Chisq Pr(>Chisq)    
1   1 -2148.00                         
2   9  -215.87  8 3864.3  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see a large test statistic G^{2}=3864.3 and we reject the null hypothesis null hypothesis of no association between age and rate of death by lung disease. These data provide evidence of an association between age and rate of death by lung disease.

We fit a model that compensates for overdispersion.

Code
lung_model_2 <- glm(
  dead ~ age + offset(log(pop)),
  data = lung,
  family = quasipoisson(link = "log")
)

summary(lung_model_2)

Call:
glm(formula = dead ~ age + offset(log(pop)), family = quasipoisson(link = "log"), 
    data = lung)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -3.3957     0.1539 -22.067  < 2e-16 ***
age45-49      0.5560     0.2107   2.639   0.0136 *  
age50-54      0.9881     0.2023   4.884 4.16e-05 ***
age55-59      1.3715     0.1719   7.979 1.42e-08 ***
age60-64      1.6290     0.1647   9.889 1.80e-10 ***
age65-69      1.9571     0.1651  11.852 3.29e-12 ***
age70-74      2.2058     0.1689  13.063 3.47e-13 ***
age75-79      2.4578     0.1768  13.899 8.03e-14 ***
age80+        2.6875     0.1865  14.411 3.38e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 6.938038)

    Null deviance: 4055.98  on 35  degrees of freedom
Residual deviance:  191.72  on 27  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

From the output, we see that \hat{\phi}= 6.938038 > 1. We have evidence of overdispersion. If we have to choose between the model fit in Question 2 and the quasi-Poisson model, then we should choose the quasi-Poisson model, since it uses the correct standard errors.

  • For \texttt{age45 − 59} the rate of deaths by lung disease for those who are 45 - 49 years old are e^{0.556} = 1.75 times the rate of deaths by lung disease for those who are 40 - 44 years old
  • For \texttt{age80+} the rate of deaths by lung disease for those who are 80 years and older are e^{2.69} = 14.73 times the rate of deaths by lung disease for those who are 40 - 44 years old