Code
#library(DT)
library(rio)
library(hms)
library(dplyr)
library(tibbletime)
library(timetk)
library(lubridate)
library(ggplot2)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
There are many examples in public and in clinical health in which we count the occurrences of events. We might count the number of mutations in one region of a chromosome, the number of visits to an emergency center in a day, or the number of births per month at an obstetric unit.
It arises naturally that there are factors that influence counts. If count is our response variable (which is different from a binary or multi-level categorical variable), then influencing factors can serve as explanatory variables, allowing us to consider generalized linear models. As generalized linear models, we still have three components.
In order to generate such models, we start by considering the Poisson distribution with respect to describing the random component.
Load the required libraries for Chapter 12. If these packages are not already installed, please do so before loading the libraries.
#library(DT)
library(rio)
library(hms)
library(dplyr)
library(tibbletime)
library(timetk)
library(lubridate)
library(ggplot2)
A random variable, Y, follows a Poisson distribution with parameter, \lambda, the mean number of events per interval, if the following hold.
If these assumptions hold, we write Y \sim \text{Poiss} \left( \lambda \right). The probability mass function of this discrete type distribution is shown in Equation 1.
P \left( Y = y \right) = \frac{\lambda^{y} \, e^{-\lambda}}{y!} \tag{1}
The mean and variance of Y is shown in Equation 2.
\begin{align} &\text{E} \left( Y \right) = \lambda \\ &\text{Var} \left( Y \right) = \lambda \end{align} \tag{2}
Imagine then, as illustrative example of the Poisson distribution, that an obstetric unit in a rural environment delivers an average of 27 babies per 24 hours. What is the probability that, on a given day, 30 babies are born?
The results is calculated in Equation 3, making use of Equation 1, for \lambda = 27, Y is the number of babies born per day, and Y \sim \text{Poiss} \left( 27 \right).
P \left( Y = 30 \right) = \frac{27^{30} \, e^{-27}}{30!} \tag{3}
We use the dpois
function in R to do the calculation. The first argument is the values for Y = y and the second argument is the value for the parameter \lambda.
dpois(
30,
lambda = 27
)
[1] 0.06184461
The probability of 30 babies being born in one day at this unit is about 6.18%. In the bar plot below, we show the probability for the number of births on the interval 10 to 40 newborns per day.
<- 10:40
babies barplot(
dpois(
babies,lambda = 27
),names.arg = 10:40,
main = "Probabilities over number of newborns per day",
xlab = "Number of newborns per day",
ylab = "Probability",
col = "orange",
las = 1,
panel.first = grid()
)
In building a model, our aim is to model the expected average count of the response variable, that is \text{E}(Y) as a function of p explanatory variables, x_{1} , x_{2} , \ldots , x_{p}. We do this by introducing the log link function, shown in Equation 4, where X is a design matrix (includes a first column of ones).
\log{\left( \lambda \right)} = X \mathbf{\beta} = \beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p} \tag{4}
The model in Equation 4 is termed the Poisson log-linear model when all the explanatory variables x_{i} are categorical.
We estimate the coefficients using maximum likelihood estimation and interpret the slopes as odds ratios as with logistic regression models. We can also calculate predicted counts given values for the explanatory variable using simple algebra, the results of which is shown in Equation 5.
\lambda \left( x_{1} , \ldots , x_{p} \right) = e^{\beta_{0} + \beta_{1} x_{1} + \ldots + \beta_{p} x_{p}} \tag{5}
Throughout the next section, we will explore some of the functionality in R and in various libraries to work with dates. When creating models for count data with respect to the response variable, it is very useful to be able to manage dates and times. To this end, the illustrative examples will be interspersed with learning about date and time functionality in R.
By definition, counts require a time interval. Given time, we are free to set an interval. We can summarize over periods (time intervals or time ranges). To illustrate this, we generate a data set of daily death counts in a critical care unit during the year 2020 (for coronavirus disease 2019 pandemic).
<- seq(as.Date("2020-04-05"), as.Date("2020-12-26"), by="days") # Generating dates
days <- seq(1, length(days), by = 1) # Sequence of days
ds <- floor(exp(ds/(length(ds)/2))) # Formula for number of deaths on a specific day
counts
<- data.frame(
dfr ID = 1:length(days),
Dates = days,
Deaths = counts
)
#DT::datatable(dfr)
Data are recorded every day. The \texttt{Dates} column shows the day and the \texttt{Death} column shows the number of deaths recorded in the particular critical care unit per day. We can plot the daily number of deaths as a scatter plot.
plot(
$Dates,
dfr$Deaths,
dfrmain = "Average number of deaths per day",
xlab = "Dates",
ylab = "Death",
las = 1,
panel.first = grid()
)
We can now aggregate (sum) over every week using the as_tbl_time
function from the tibbletime
library. We set the weekly
argument and use the "start"
value for the side
argument to indicate using the start of the week as row value.
<- dfr %>%
dfr_weekly ::arrange(Dates) %>%
dplyr::as_tbl_time(Dates) %>%
tibbletime::collapse_by("weekly", side = "start", clean = T) %>%
tibbletime::group_by(Dates) %>%
dplyrsummarise(Deaths = sum(Deaths)) %>%
::mutate(ID = 1:38)
dplyr
#DT::datatable(dfr_weekly)
We plot the result.
plot(
$Dates,
dfr_weekly$Deaths,
dfr_weeklymain = "Average number of deaths per week",
xlab = "Dates",
ylab = "Death",
las = 1,
panel.first = grid()
)
We can also do the same aggregation over every month.
<- dfr %>%
dfr_monthly ::arrange(Dates) %>%
dplyr::as_tbl_time(Dates) %>%
tibbletime::collapse_by("monthly", side = "start", clean = T) %>%
tibbletime::group_by(Dates) %>%
dplyrsummarise(Deaths = sum(Deaths)) %>%
mutate(ID = 1:9)
#DT::datatable(dfr_monthly)
A plot shows the aggregate of deaths for each month.
plot(
$Dates,
dfr_monthly$Deaths,
dfr_monthlymain = "Average number of deaths per month",
xlab = "Dates",
ylab = "Death",
las = 1,
panel.first = grid()
)
There are many more libraries in R that allow us to manipulate dates and times.
Back to our illustrative example. We want to consider our response variable, Y (death counts), explained by a single, binary explanatory variable. We generate a new variable Binary_bin
in our data frame object that serves as our binary explanatory variable x, with x=0 if a data was in the first half of the date range and x=1 if in the second half of the date range.
<- dfr %>%
dfr mutate(
Binary_bin = case_when(
<= floor(nrow(dfr) / 2) ~ "first",
ID > floor(nrow(dfr) / 2) ~ "second"
ID
)
)
#DT::datatable(dfr)
We set the explanatory variable as a factor.
$Binary_bin <- factor(dfr$Binary_bin, levels = c("first", "second")) dfr
Now, we create a model using the glm
function. Note the use of the poisson
value for the family
keyword argument, with the link
set to "log"
.
<- glm(
bin_model ~ Binary_bin,
Deaths data = dfr,
family = poisson(link = "log")
)
summary(bin_model)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.2687062 0.07580977 3.544479 3.933891e-04
Binary_binsecond 1.1617130 0.08686588 13.373641 8.621559e-41
If \lambda (x) = \text{E} \left( Y \, \vert \, x \right), then our model is as in Equation 6.
\begin{align} &\log{\left( \lambda (x) \right)} = \beta_{0} + \beta_{1} x \\ &\lambda (x) = e^{\beta_{0} + \beta_{1}x} \end{align} \tag{6}
Using the results from our model, we have Equation 7.
\begin{align} &\log{\left( \hat{\lambda} (x) \right)} = 0.2687062 + 1.1617130 x \\ &\hat{\lambda} (x) = e^{0.2687062 + 1.1617130 \, x} \end{align} \tag{7}
If we use x=0 (first half), we have that \hat{\lambda}(0) = e^{\hat{\beta}_{0}} = e^{0.2687062} \approx 1.308. This means that e^{\hat{\beta}_{0}} is the estimated mean number of deaths in the unexposed. In our example, the mean number of deaths in the unexposed (cases in the first half of the date range) is 1.308.
We can also interpret the slope. In Equation 8, we make use of the law of exponents and the fact that we have just noted that e^{\hat{\beta}_{0}} = \hat{\lambda} (0).
\begin{align} &\hat{\lambda} (1) = e^{\hat{\beta}_{0} + \hat{\beta}_{1}x} = e^{\hat{\beta}_{0}}e^{\hat{\beta}_{1}x} \\ &\hat{\lambda} (1) = \hat{\lambda} (0) e^{\hat{\beta}_{1} x} \end{align} \tag{8}
We interpret Equation 8 as the estimated mean count in the exposed in e^{\beta_{1}} times the mean count of the unexposed. The results for our illustrative example is shown in Equation 9.
\begin{align} &\hat{\lambda} (1) = 1.308 \times e^{1.1617130 \left( 1 \right)} \\ &\hat{\lambda} (1) \approx 4.180451 \end{align} \tag{9}
The estimated mean number of deaths given being in the second half of the date range is about 4.18. That is e^{1.161713} = 3.195402 times the estimated mean number of deaths in the unexposed.
Note that there are three cases for the slope, \beta_{1}.
The summary of the model above shows the result of a Wald test, with a Wald statistic, Z_{W}, and a p value.
In the code below, we create a new column for our data frame object called TriBin
. It represents a 3-level categorical variable with the levels first if an observation is in April, May, or June, second, if an observation is in July, August, or September, and third, if an observation is in October, November or December.
<- dfr %>%
dfr mutate(TriBin = case_when(
::month(Dates) %in% c(4, 5, 6) ~ "first",
lubridate::month(Dates) %in% c(7, 8, 9) ~ "second",
lubridate::month(Dates) %in% c(10, 11, 12) ~ "third"
lubridate ))
With a 3-level categorical variable, we can create 3-1=2 dummy variables. We have that x_{1} is \texttt{TriBin} being second and x_{2} for \texttt{TriBin} being third, each with levels 1 (if in that bin) and 0 (otherwise).
Before generating the model, we set the levels.
$TriBin <- factor(dfr$TriBin, levels = c("first", "second", "third")) dfr
We build the model and show the coefficients and Wald test results.
<- glm(
tri_model ~ TriBin,
Deaths data = dfr,
family = poisson(link = "log")
)
summary(tri_model)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.386773e-15 0.1072113 -1.293496e-14 1.000000e+00
TriBinsecond 8.347977e-01 0.1273233 6.556517e+00 5.507893e-11
TriBinthird 1.600200e+00 0.1175348 1.361469e+01 3.275489e-42
The results above are used to write the model and probabilities, shown in Equation 10.
\begin{align} &\log{\left( \hat{\lambda} \left( x_{1} , x_{2} \right) \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2} \\ &\hat{\lambda} \left( x_{1} , x_{2} \right) = e^{\hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2}} \\ \\ &\log{\left( \hat{\lambda} \left( x_{1} , x_{2} \right) \right)} = 0 + 0.8347977 x_{1} + 1.6002 x_{2} \\ &\hat{\lambda} \left( x_{1} , x_{2} \right) = e^{0 + 0.8347977 x_{1} + 1.6002 x_{2}} \end{align} \tag{10}
The interpretation is as follows.
In the case of our illustrative example, and for a single numerical explanatory variable, we have that Y is the number of deaths, \lambda (x) = \text{E} \left( Y \, \vert \, x \right), and x is the numerical explanatory variable. Our simple model will then be as in Equation 11.
\begin{align} &\log{\left( \lambda \left( x \right) \right)} = \beta_{0} + \beta_{1} x \\ &\lambda (x) = e^{\beta_{0} + \beta_{1} x} \end{align} \tag{11}
The interpretation of \beta_{0} shows that \lambda (0) = e^{\beta_{0}} is the mean count for a cases where x=0. We interpret the slope by considering a one-unit difference in x, i.e. x=\nu+1 and x=\nu, shown in Equation 12.
\begin{align} &\lambda \left( \nu + 1 \right) = e^{\beta_{0} + \beta_{1} \left( \nu + 1 \right)} \\ &\lambda \left( \nu + 1 \right) = e^{\beta_{0} + \beta_{1} \nu}e^{\beta_{1}} \\ &\text{but } e^{\beta_{0} + \beta_{1} \nu} = \lambda \left( \nu \right) \text{, therefor} \\ &\lambda \left( \nu + 1 \right) = \lambda \left( \nu \right) e^{\beta_{1}} \end{align} \tag{12}
We interpret Equation 12 by stating that for a one-unit increase in x, the estimated mean count changes by a factor of e^{\beta_{1}}. If \beta_{1}>1 the the change is an increase and if \beta_{1}<0 then the change is a decrease.
As with our derivation for a constant c-unit increase which we performed in the case of simple logistic regression, we have that a c-unit increase in x brings about a factor change of e^{c \beta_{1}}.
For our illustrative example, we can choose the actual day, with the first day of data collection as x=1. We create a model and show the coefficients and results of a Wald test.
<- glm(
num_model ~ ID,
Deaths data = dfr,
family = poisson(link = "log")
)
summary(num_model)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.352732361 0.105001900 -3.359295 7.814159e-04
ID 0.008627318 0.000544697 15.838748 1.681362e-56
The model is written in Equation 13.
\begin{align} &\log{\left( \hat{\lambda} \left( x \right) \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ &\hat{\lambda} (x) = e^{\hat{\beta}_{0} + \hat{\beta}_{1} x} \\ \\ &\log{\left( \hat{\lambda} \left( x \right) \right)} = -0.352732361 + 0.008627318 (x) \\ &\hat{\lambda} (x) = e^{-0.352732361 + 0.008627318 (x)} \end{align} \tag{13}
A one-day increase the day since the start brings about a e^{\hat{\beta}_{1}} = e^{0.008627318} \approx 1.008665 increase in the average number of deaths.
We can plot our model over the data.
plot(
ds,$Deaths,
dfrmain = "Average number of deaths per day",
xlab = "Days (as numerical variable)",
ylab = "Death",
las = 1,
panel.first = grid()
)lines(
ds,exp(-0.352732361 + 0.008627318 * ds)
)
For the simple, single variable cases above (binary and numerical explanatory variables), we can use a Wald test or a likelihood ratio test (the latter specifically for multi-explanatory variable models, including the k-1 dummy variable case).
We noted the results of Wald tests in the summary reports above, where we have the null hypothesis that \beta_{j} = 0 and a Wald statistic, as shown in Equation 14.
Z_{W} = \frac{\hat{\beta}_{j}}{\widehat{\text{SE}} \left( \hat{\beta}_{j} \right)} \tag{14}
The assumption for these models are that if Y \sim \text{Poiss} \left( \lambda \right) then \text{E} \left( Y \right) = \text{Var} \left( Y \right) = \lambda. Count type data often have greater variance than assumed under a Poisson model. This greater variance is termed overdispersion and can be seen by small standard errors in the model summaries.
There are several approaches to dealing with overdispersion. A simple approach is to assume a model that allows for the variance to be different from the mean by a constant factor, \phi, termed the dispersion factor, with \text{Var} \left( Y \right) = \phi \lambda. We state that when \phi > 1, then we have overdispersion.
We obtain an estimate, \hat{\phi}, using Equation 15, where n is the number of observations, and p is the number of slopes.
\hat{\phi} = \frac{\sum_{i=1}^{n} \left( \text{Pearson residual} \right)_{i}^{2}}{n-p-1} \tag{15}
If \hat{\pi}>1 then we have overdispersion and we multiply the standard error in Equation 14 by \sqrt{\hat{\phi}}, shown in Equation 15, where QP stands for quasi-Poisson.
\widehat{\text{SE}}_{\text{QP}} \left( \hat{\beta}_{j} \right) = \sqrt{\hat{\phi}} \left[ \widehat{\text{SE}} \left( \hat{\beta}_{j} \right) \right]
As illustrative example, we consider the death_data.csv
file. The file contains data from a study of the number of deaths due to acquired immunodeficiency syndrome (AIDS) per three-moth period in Australia from January 1983 through June 1986.
<- import_dataset("death_data.csv") aus
The time frame contains data for 14 periods. The period is in the \texttt{period} column. We will use \texttt{period} as a numerical explanatory variable and create a simple model. The \texttt{count} variable indicates the number of death in each of the 14 periods respectively.
<- glm(
aus_period_model ~ period,
count data = aus,
family = poisson(link = "log")
)
summary(aus_period_model)
Call:
glm(formula = count ~ period, family = poisson(link = "log"),
data = aus)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.33963 0.25119 1.352 0.176
period 0.25652 0.02204 11.639 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 207.272 on 13 degrees of freedom
Residual deviance: 29.654 on 12 degrees of freedom
AIC: 86.581
Number of Fisher Scoring iterations: 5
From the summary of our numerical explanatory model above, we see an estimated standard error for \hat{\beta}_{1} = 0.022047. We also see that the estimated dispersion factor that was used was 1.
We can calculate the Pearson residual in each case, using the resid
function. We show the first five such residuals.
<- resid(aus_period_model, type = "pearson")
reds 1:5] reds[
1 2 3 4 5
-1.3472679 -0.8787483 -0.5926473 -0.4640352 -1.8060830
In the code below we calculate \hat{\phi} by using Equation 15.
<- sum(reds^2) / (length(reds) - 1 - 1)
phi_hat phi_hat
[1] 2.403942
The standard error for \hat{\beta}_{1} should be multiplied by \sqrt{\hat{\phi}}.
By using the quasipoisson
keyword as value for the family
argument, R returns a model that compensates for overdispersion.
<- glm(
aus_period_model_quasi ~ period,
count data = aus,
family = quasipoisson(link = "log")
)
summary(aus_period_model_quasi)
Call:
glm(formula = count ~ period, family = quasipoisson(link = "log"),
data = aus)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.33963 0.38946 0.872 0.4
period 0.25652 0.03417 7.507 7.17e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasipoisson family taken to be 2.403942)
Null deviance: 207.272 on 13 degrees of freedom
Residual deviance: 29.654 on 12 degrees of freedom
AIC: NA
Number of Fisher Scoring iterations: 5
We see the new standard error, which we confirm below, by multiplying the square root of the dispersion factor by the original standard error for \hat{\beta}_{1}.
sqrt(phi_hat) * 0.02204
[1] 0.03417225
From the summary of the model, we note that after adjustment, the variability in \hat{\beta}_{1} is described by a t distribution.
Other languages and software use a Wald test statistic, shown in Equation 16, together with a confidence interval for the coefficient, \hat{\beta}_{j}.
\begin{align} &Z_{W} = \frac{\hat{\beta}_{j}-0}{\widehat{\text{SE}}_{\text{QP}} \left( \hat{\beta}_{j} \right)} \\ &\hat{\beta}_{j} \pm Z_{\frac{\alpha}{2}} \, \widehat{\text{SE}}_{\text{QP}} \left( \hat{\beta}_{j} \right) \end{align} \tag{16}
There are considerations that we must be aware of when using correction of overdispersion. After correction, estimates are not obtained from maximizing a legitimate likelihood function. A likelihood ratio test cannot be used. Wald test statistics and confidence intervals can be calculated, though.
An alternative to consider, is the negative binomial regression model that uses the negative binomial distribution for the random component. Using this distribution, we have a variance that is different from the mean.
The last model type that we consider is used for rate data, expressed as a number per unit. As illustrative example, we consider a study conducted in Denmark that collected data on cases of lung cancer for different age groups is different cities.
For this example, we can model the rate of lung cancer cases as a function of age and city. The data for this study is in the lung_cancer.csv
file, which we import below.
<- import_dataset("lung_cancer_data.csv")
denmark #DT::datatable(denmark)
We note that we have a column for \text{cases} and \texttt{population}. Our response count is Y, which is a function of population size, with a parameter that we denote with a t, and hence we have a sample rate, shown in Equation 17, together with the mean value of the rate.
\begin{align} &\frac{Y}{t} \\ \\ &\frac{E \left( Y \right)}{t} = \frac{\lambda}{t} \end{align} \tag{17}
A log-linear model for the rate in Equation 17 with single explanatory variable x_{1} is shown in Equation 18.
\log{\left( \frac{\lambda(x_{1})}{t} \right)} = \beta_{0} + \beta_{1} x_{1} \tag{18}
Simple algebra, expresses the mean counts from this model, derived in Equation 19.
\begin{align} &e^{\log{\left( \frac{\lambda(x_{1})}{t} \right)}} = e^{\beta_{0} + \beta_{1} x_{1}} \\ &\lambda \left( x_{1} \right) = t \, e^{\beta_{0} + \beta_{1} x_{1}} \end{align} \tag{19}
In Equation 20, we show that e^{\beta_{1}} is a rate ratio.
\frac{\lambda \left( 1 \right)}{\lambda \left( 0 \right)} = \frac{e^{\beta_{0} + \beta_{1} (1)}}{e^{\beta_{0} + \beta_{1} (0)}} = e^{\beta_{1}} \tag{20}
The equation in Equation 20 can be rewritten as in Equation 21, showing that the rate of event occurrences for those with x_{1} = 1 is e^{\beta_{1}} times the rate of event occurrences for those with x_{1} = 0.
\frac{\lambda \left( 1 \right)}{t} = e^{\beta_{1}} \frac{\lambda \left( 0 \right)}{t} \tag{21}
Using the laws of logarithms, we can also express Equation 18 as the difference in logs, shown in Equation 22.
\begin{align} &\log{\left( \lambda \left( x_{1} \right) \right)} - \log{\left( t \right)} = \beta_{0} + \beta_{1} x_{1} \\ &\log{\left( \lambda \left( x_{1} \right) \right)} = \beta_{0} + \beta_{1} x_{1} + \log{\left( t \right)} \end{align} \tag{22}
The term \log{\left( t \right)} is the offset. Assuming that Y follows a Poisson distribution, then (21) shows a Poisson regression model, with explanatory variables x_{1} and \log{\left( t \right)}, where the coefficient of the latter is 1, shown in Equation 23.
\begin{align} &\log{\left( \lambda \left( x_{1} , \log{\left( t \right)} \right) \right)} = \beta_{0} + \beta_{1} x_{2} + \beta_{2} \log{\left( t \right)} \\ &\log{\left( \lambda \left( x_{1} , \log{\left( t \right)} \right) \right)} = \beta_{0} + \beta_{1} x_{2} + \log{\left( t \right)}, \text{ with } \beta_{2} = 1 \end{align} \tag{23}
Below, we generate a model, where Y is the number of lung cancer cases in the \texttt{cases} variable (column), t is the population size in the \texttt{population} variable, with the four levels of the \texttt{city} variable indicating one of four cities, Fredericia (x_{1}), Horsens (x_{2}), Kolding (x_{3}), and Vejle, the latter chosen as the reference city. We also include the age midpoint variable in the \texttt{age_midpt} variable (x_{4}).
With \mathbf{x} = \left( x_{1} , x_{2} , x_{3} , x_{4} \right) and E \left( Y \, \vert \, \mathbf{x} \right) = \lambda \left( \mathbf{x} \right), we fit a model as shown in Equation 24.
\log{\left( \lambda \left( \mathbf{x} \right) \right)} = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \log{\left( t \right)} \tag{24}
$city <- factor(denmark$city, levels = c("Vejle", "Fredericia", "Horsens", "Kolding")) denmark
<- glm(
denmark_model ~ city + age_midpt + offset(log(population)),
cases data = denmark,
family = poisson(link = "log")
)
summary(denmark_model)
Call:
glm(formula = cases ~ city + age_midpt + offset(log(population)),
family = poisson(link = "log"), data = denmark)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.228237 0.440333 -18.686 <2e-16 ***
cityFredericia 0.246016 0.187769 1.310 0.190
cityHorsens -0.059263 0.191972 -0.309 0.758
cityKolding -0.104262 0.198081 -0.526 0.599
age_midpt 0.056759 0.006527 8.696 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 129.908 on 23 degrees of freedom
Residual deviance: 46.448 on 19 degrees of freedom
AIC: 152.84
Number of Fisher Scoring iterations: 5
We see the fitted model in Equation 25.
\begin{align} &\log{\left( \hat{\lambda} \left( \mathbf{x} \right) \right)} = -8.228237 + 0.246016 \, x_{1} - 0.059263 \, x_{2} - 0.104262 \, x_{3} + 0.056759 \, x_{4} + \log{\left( t \right)} \end{align} \tag{25}
If we consider x_{1}, we can state that the rate of lung cancer deaths in Fredericia is about e^{0.246016} \approx 1.28 = 28% higher than the lung cancer deaths in Vejle, adjusting for age.
If we consider x_{4}, we can state that for each 1-year increase in age, the rate of lung cancer deaths increase by about e^{0.056759} \approx 1.06 = 6%, adjusting for city.
The models are log linear and as can be seen from the plots below, observed log rates are not linear at all.
<- denmark %>%
denmark mutate(cancer_rate = cases / population)
ggplot(
denmark,aes(
x = age_midpt,
y = log(cancer_rate),
color = city
)+
) geom_point() +
geom_line() +
labs(
title = "Cancer rate (cases per population size) for each city"
+
) xlab("Age") +
ylab("Log of rate")
One way to get the prediction with ease, is to create dummy variable first, before creating the model. We do so below.
<- ifelse(denmark$city == "Fredericia", 1, 0)
fredericia <- ifelse(denmark$city == "Horsens", 1, 0)
horsens <- ifelse(denmark$city == "Kolding", 1, 0)
kolding
<- glm(
denmark_alt_model ~ fredericia + horsens + kolding + age_midpt + offset(log(population)),
cases data = denmark,
family = poisson(link = "log")
)
summary(denmark_alt_model)
Call:
glm(formula = cases ~ fredericia + horsens + kolding + age_midpt +
offset(log(population)), family = poisson(link = "log"),
data = denmark)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.228237 0.440333 -18.686 <2e-16 ***
fredericia 0.246016 0.187769 1.310 0.190
horsens -0.059263 0.191972 -0.309 0.758
kolding -0.104262 0.198081 -0.526 0.599
age_midpt 0.056759 0.006527 8.696 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 129.908 on 23 degrees of freedom
Residual deviance: 46.448 on 19 degrees of freedom
AIC: 152.84
Number of Fisher Scoring iterations: 5
The model is exactly the same, but now it is easier to enter the data using the predict
function.
<- predict(
pred
denmark_alt_model,%>% select(
denmark all_of(fredericia),
all_of(horsens),
all_of(kolding),
age_midpt,
population
),type = "response"
/ denmark$population
)
<- denmark %>%
denmark mutate(predicted = pred)
Now we plot the predicted rates and verify that the model is linear.
ggplot(
denmark,aes(
x = age_midpt,
y = log(predicted),
color = city
)+
) geom_point() +
geom_line() +
labs(
title = "Predicted cancer rate (cases per population size) for each city"
+
) xlab("Age") +
ylab("Log of predicted rate")
We have to consider all the tools for model evaluation that we have learned about in this course. Below, we fit an extra term, which is the square of the age.
<- denmark %>%
denmark mutate(age_midpt_squared = age_midpt^2)
We create a new model by adding this term.
<- glm(
denmark_new_model ~ fredericia + horsens + kolding + age_midpt + age_midpt_squared + offset(log(population)),
cases data = denmark,
family = poisson(link = "log")
)
summary(denmark_new_model)
Call:
glm(formula = cases ~ fredericia + horsens + kolding + age_midpt +
age_midpt_squared + offset(log(population)), family = poisson(link = "log"),
data = denmark)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.173e+01 3.096e+00 -7.017 2.27e-12 ***
fredericia 2.740e-01 1.878e-01 1.459 0.145
horsens -5.864e-02 1.920e-01 -0.305 0.760
kolding -1.015e-01 1.981e-01 -0.512 0.608
age_midpt 5.097e-01 1.021e-01 4.991 6.01e-07 ***
age_midpt_squared -3.701e-03 8.298e-04 -4.461 8.17e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 129.908 on 23 degrees of freedom
Residual deviance: 26.017 on 18 degrees of freedom
AIC: 134.41
Number of Fisher Scoring iterations: 5
We add predictions using the new model and then visualize the new model.
<- predict(
pred_new
denmark_new_model,%>% select(
denmark all_of(fredericia),
all_of(horsens),
kolding,
age_midpt,
age_midpt_squared,
population
),type = "response"
/ denmark$population )
Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
ℹ Please use `all_of()` or `any_of()` instead.
# Was:
data %>% select(kolding)
# Now:
data %>% select(all_of(kolding))
See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
<- denmark %>%
denmark mutate(predicted_new = pred_new)
ggplot(
denmark,aes(
x = age_midpt,
y = log(predicted_new),
color = city
)+
) geom_point() +
geom_line() +
labs(
title = "Predicted cancer rate (cases per population size) for each city"
+
) xlab("Age") +
ylab("Log of predicted rate")
We can view each city in its own plot.
%>%
denmark ggplot(
aes(
x = age_midpt,
y = value,
col = city
)+
) geom_point(aes(y = log(cancer_rate), col = "cancer_rate")) +
geom_line(aes(y = log(cancer_rate), col = "cancer_rate")) +
geom_point(aes(y = log(predicted_new), col = "predicted_new")) +
geom_line(aes(y = log(predicted_new), col = "predicted_new")) +
labs(
title = "Actual (red) and predicted log of rates (green) for each city"
+
) xlab("Age midpoint") +
ylab("Log of rate") +
facet_grid(
~ city
+
) theme(legend.position="none")
The file campuscrime_data.csv
contains the number of burglaries reported at a collection of 47 U.S. public universities with over 10000 students in the year 2009. In addition, explanatory variables are included which may explain differences in crime rates, including total number of students, percentage of men, average ACT test scores, and tuition.
The variables of interest are shown in Table 1.
Variable | Meaning and coding |
---|---|
\texttt{burg09} | Number of burglaries |
\texttt{tuition} | Cost of tuition |
\texttt{act.comp} | Average ACT score |
\texttt{pct.male} | Percentage males |
\texttt{total} | Number of students |
Import the data file as a data frame or tibble object.
Scale the tuition variable by 1000 to create a new variable \texttt{tuitionthou} that gives the tuition in thousands of dollars. Then fit a Poisson regression model with \texttt{burg09} as the response and with \texttt{tuitionthou}, \texttt{act.comp}, \texttt{pct.male}, and \texttt{total} as the predictors. Report the coefficient table from R. Write the fitted model.
Interpret the coefficients of \texttt{tuitionthou} and \texttt{pct,male} in terms of the predicted mean number of burglaries.
Is there evidence of overdispersion? If so, fit an appropriate model to account for the overdispersion. Report the coefficient table for this model.
It may make sense to include \log{(\text{total})} as an offset in the model. Refit the last model above with \log{(\text{total})} as a predictor (rather than \texttt{total}). Construct an approximate Wald 95% CI for the coefficient for the \log{(\text{total})} predictor. How can we use this interval to make a decision about whether it makes sense to include \log{(\text{total})} as an offset? Does it make sense to include \log{(\text{total})} as an offset? HINT: This question shows how to use an offset when considering counts. If the confidence interval includes 1, then it makes sense to use \log{(\text{total})} as an offset (since it coefficient may be 1).
Refit the model from Question 5, but include \log{(\text{total})} as an offset rather than as a predictor. Interpret the antilogs of the coefficients for the \texttt{tuitionthou} and \texttt{pct.male} predictors.
Based on the model that you fit in Question 6, is \texttt{tuitionthou} associated with the rate of burglaries in 2009, adjusting for mean ACT score and percentage of male students? Use a 5% significance level to test this association.
The file smoking_data.csv
contains data from a study of death by lung disease among individuals from different age groups and with different smoking status.
The variables and their mean and coding are as shown in Table 2.
Variable | Meaning and coding |
---|---|
\texttt{age} | 40-44, 45-49, 50-54, 55-59, 60-64, 65-69, 70-74, 75-79, 80+ (9-level) |
\texttt{smoke} | cigarPipeOnly, cigaretteOnly, cigarettePlus, no (4-level) |
\texttt{pop} | population at risk |
\texttt{dead} | number of deaths by lung disease |
Import the data file as a data frame object or tibble object.
Fit a Poisson regression model with rate of death by lung disease (that is, by population at risk) as the response and age group as the only predictor. Report the coefficient table and write down the fitted model.
Based on the model that you fit in Question 2, is age associated with rate of death by lung disease? Conduct an appropriate test at the 5% significance level. Hint: Use a likelihood ratio test.
With respect to the model that you fit in Question 2, is there evidence of overdispersion? If there is overdispersion, fit an appropriate model to account for the overdispersion. Report the coefficient table for this model.
Interpret the antilogs of the coefficients for the \texttt{age45 − 59} and \texttt{age80+} dummy variables from the model that you fit in Question 3.