Code
library(rio)
library(dplyr)
library(lmtest)
library(ResourceSelection)
This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International
In the previous chapter we explored ways of determining model building, deciding on which explanatory variables to use in our final model.
In this chapter, we explore how well our model or models explain or describe the data that we have, and if any observations influence the model estimates more than others.
There are two ways to assess model fit depending on whether we have group level or individual level data. While it is more common in many areas of research to generate models using individual level data, we begin our exploration with group level data.
Load the required libraries for Chapter 9. If these packages are not already installed, please do so before loading the libraries.
library(rio)
library(dplyr)
library(lmtest)
library(ResourceSelection)
A group level data set is one in which the number of unique predictor profiles is smaller (often much smaller) than the number of observations. A predictor profile describes a unique observation in a data set (a row vector of values for all the variables, an example of which is in Table 1).
As an example we have a binary outcome \texttt{low infant birth weight} with levels low, s, and not low, f, explained by mother’s age, \texttt{Age} (in the interval [14,18]), and smoking status, \texttt{Smoke}, with levels Yes and No.
We can assess success and failure for each combination of age and smoking status as shown in the table below, where we can treat the age variable as a categorical variable.
Smoke | Age | Successes | Failures |
---|---|---|---|
Yes | 14 | s_{1} | f_{1} |
No | 14 | s_{2} | f_{2} |
Yes | 15 | s_{3} | f_{3} |
No | 15 | s_{4} | f_{4} |
Yes | 16 | s_{5} | f_{5} |
No | 16 | s_{6} | f_{6} |
Yes | 17 | s_{7} | f_{7} |
No | 17 | s_{8} | f_{8} |
Yes | 18 | s_{9} | f_{9} |
No | 18 | s_{10} | f_{10} |
In R we will use the glm
function as follows for a logistic regression model, where success and failure are two element vectors of outcomes.
glm(cbind(success, failure) ~ x1 + x2), family = binomial(link = "logit"), data = data
An individual level data set is one in which the number of unique predictor profiles and outcome combinations equal (or closely equal) the number of observations. The number of successes and failures for each combination of explanatory variables are typically very small.
As an example, we can add years of education as another explanatory variable to the example above. The number of successes and failures for every combination of explanatory variables will now be very small. There simply won’t be many observations with the same explanatory variable profile.
In R, we use the usual formula that we have been using up until now in cases of individual level data.
Before building a model, we do have data, which we can feed to the model, termed the working model, and evaluate what the prediction would be for each observation given that we do know what the outcome variable values are.
A saturated model, S, is a model that we can construct that perfectly predicts the data (group level). We can measure a working model against a saturated model.
The maximum likelihood of data under the saturated model, L_{S}, is larger than that of any other model that is nested within S. This then also holds for the log of the maximum likelihood of the data, \log{(L_{S})}.
A saturated model will treat each explanatory variable as a categorical type variable and includes all the main effects and all possible interaction terms.
A saturated model has as many parameters (intercept and slopes) as there are unique combinations of the levels of the explanatory models (predictor profiles) and outcomes in the data.
For group level data, the group saturated model provides perfect prediction of the observed proportion of successes for each unique combination of explanatory variable levels. As mentioned, saturated models treat each predictor as categorical (even if numerical) and includes all main effects and possible interaction terms.
The saturated model for individual level data cannot be used to assess model fit as with group level data, and will not be discussed. It could have as many parameters as observations.
As illustrative example of a group level data set, we explore a model that uses a mother’s age (as a categorical variable with levels 14, 15, 16, 17, 18) as predictor of her infant’s birth weight, Y, which can be low, 1, or not low, 0.
The dummy variables for age would then correspond to 15 through 18, assigned to x_{1} through x_{4}. The model for low birth weight is as shown in Equation 1, for \pi (\mathbf{X}) = P(Y \vert x_{1} , x_{2} , x_{3} , x_{4}).
\text{logit} ( \pi ( \mathbf{x} ) ) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} \tag{1}
The data is summarized in Table 2.
Age | Sample size | Y=1 | Probability (proportion of success) |
---|---|---|---|
14 | 15 | 6 | 6 \div 15 = 0.40 |
15 | 67 | 30 | 30 \div 67 = 0.45 |
16 | 125 | 44 | 44 \div 125 = 0.35 |
17 | 71 | 21 | 21 \div 71 = 0.30 |
18 | 20 | 5 | 5 \div 20 = 0.25 |
Using R, we find the model to be \text{logit}(\hat{\pi} (\mathbf{x})) = -0.4055 + 0.1957 x_{1} - 0.2048 x_{2} - 0.4620 x_{3} - 0.6931 x_{4}, as calculated in the code below.
We start by importing the data. There are rows with missing data, which we delete by using the dplyr filter
function with the %in%
keyword. We also create a vector of the age values, assigned to the variable age_levels
.
<- import_dataset("sim_birthweight.csv")
dfr <- dfr %>%
dfr filter(age %in% 14:18) # Drop missing data
= 14:18 age_levels
We have to create a table of profiles, using the table
function. We assign this table to the variable dfr_table
.
<- table(
dfr_table Age = dfr$age,
LBW = dfr$lbw
)
dfr_table
LBW
Age 0 1
14 9 6
15 37 30
16 81 44
17 50 21
18 15 5
The first column in the table above represents the maternal age levels. Each row is a profile. Remember that all variables are treated as categorical variables when using profiles. The second column counts the number of failures from the response variable for each profile. The last column is the frequency of successes.
Now we create a model using the table as response variable. We reverse the order of the failure and success frequency columns using indexing. Using the formula
keyword argument, we set our variable age_level
(as a factor using the factor
function) to represent the explanatory variable. Note that our data only contains five rows of data. We do not pass the dfr
data frame as argument to the glm
function, as we are using the group level data that we created.
<- glm(
model_s formula = dfr_table[, 2:1] ~ factor(age_levels),
family = binomial(link = "logit")
)
summary(model_s)
Call:
glm(formula = dfr_table[, 2:1] ~ factor(age_levels), family = binomial(link = "logit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4055 0.5270 -0.769 0.442
factor(age_levels)15 0.1957 0.5815 0.337 0.736
factor(age_levels)16 -0.2048 0.5593 -0.366 0.714
factor(age_levels)17 -0.4620 0.5877 -0.786 0.432
factor(age_levels)18 -0.6931 0.7379 -0.939 0.348
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.7103e+00 on 4 degrees of freedom
Residual deviance: 2.7978e-14 on 0 degrees of freedom
AIC: 30.736
Number of Fisher Scoring iterations: 3
Using the estimates from our results, we create a table below, that shows perfect prediction in Table 3.
Age | Sample size | Y=1 | Proportion | Saturated model \hat{\pi} |
---|---|---|---|---|
14 | 15 | 6 | 0.40 | \frac{e^{-0.4055}}{1 + e^{-0.4055}} = 0.40 |
15 | 67 | 30 | 0.45 | \frac{e^{-0.4055 + 0.1957}}{1 + e^{-0.4055 + 0.1957}} = 0.45 |
16 | 125 | 44 | 0.35 | \frac{e^{-0.4055 - 0.2048}}{1 + e^{-0.4055 - 0.2048}} = 0.35 |
17 | 71 | 21 | 0.30 | \frac{e^{-0.4055 - 0.4620}}{1 + e^{-0.4055 - 0.4620}} = 0.30 |
18 | 20 | 5 | 0.25 | \frac{e^{-0.4055 - 0.6931}}{1 + e^{-0.4055 - 0.6931}} = 0.25 |
We use code to confirm the probabilities using the data.
<- apply(dfr_table, 1, sum) # Sum the success encoded as 1
SS
cbind(SS, dfr_table[, 2], round(dfr_table[,2] / SS, 2)) # Bind the two row vectors
SS
14 15 6 0.40
15 67 30 0.45
16 125 44 0.35
17 71 21 0.30
18 20 5 0.25
The \log{( L_{S} )}=-10.368 for this saturated model. This value is larger than the log likelihood of any model that is nested in the saturated model.
In Table 4, we see the summary of a second example, now with two binary predictor variables, and hence, four unique combinations. The two binary explanatory variables are race ( black and white ) and AZT use ( Yes and No ). These variables are used as predictors for the binary outcome variable AIDS symptoms.
Profile | Race | AZT use | AIDS symptoms | Sample size | Proportion |
---|---|---|---|---|---|
1 | White | Yes | 14 | 107 | 0.131 |
2 | White | No | 32 | 113 | 0.283 |
3 | Black | Yes | 11 | 63 | 0.175 |
4 | Black | No | 12 | 55 | 0.218 |
If x_{1} is AZT use, and x_{2} is race, then the group level saturated logistic regression model with interaction term is \text{logit}(\pi ( \mathbf{x}) ) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{1} x_{2}. There are four parameters and four unique predictor profiles.
The model is \text{logit}(\pi ( \mathbf{x}) ) = -1.2763 - 0.2771 x_{1} + 0.3476 x_{2} - 0.6878 x_{1} x_{2}. Substitution will yield the same proportions as in Table 4 and is shown in Table 5.
Profile | Proportion | Saturated model for \hat{\pi} |
---|---|---|
1 | 0.131 | \frac{e^{-1.2763-0.2771+0.3476-0.687}}{1+e^{-1.2763-0.2771+0.3476-0.687}} = 0.131 |
2 | 0.283 | \frac{e^{-1.2763+0.3476}}{1+e^{-1.2763+0.3476}} = 0.283 |
3 | 0.175 | \frac{e^{-1.2763-0.2771}}{1+e^{-1.2763-0.2771}} = 0.175 |
4 | 0.218 | \frac{e^{-1.2763}}{1+e^{-1.2763}} = 0.218 |
The \log{( \pi ( \mathbf{x}) )} = - 8.738 and is larger than the log likelihood of any model nested within the saturated model.
While it may seem as if a saturated model is best to use, we quickly run into problems as the number of explanatory variables increase. If we have four binary explanatory variables for instance, we would have 2^{4}=16 parameters.
The saturated model, instead, is used as a benchmark to compare nested models that are contained within it.
Suppose then, that we have a nested model, M, within the group saturated model, S.
If M fits the data well, then \log{ L_{M} } will be close to \log{ L_{S} }, otherwise, if it does not fit the data well, the values will not be close.
Instead of simply considering the difference, we use the deviance, shown in Equation 2.
\text{Deviance} = G_{M}^{2} = -2 [ \log{ L_{M} } - \log{ L_{S} } ] \tag{2}
A large deviance (large difference between M and S) indicates that M does not fit the data well and a small deviance indicates that M fits the data well. Note that the formula in Equation 2 suggests a likelihood ratio test (LRT).
There is a spectrum of models from L_{0}, a model with only the intercept (largest deviance), up to the saturated model (lowest deviance). (The null deviance is G_{0}^{2} = -2 [ \log{ L_{0} } - \log{ L_{S} } ].)
A goodness of fit test (GOF) is a test that uses the deviance statistic, G_{M}^{2}, to evaluate a nested model M. The GOF test is a LRT with the null hypothesis that M has adequate fit vs. the alternative hypothesis, that is does not have a good fit, with the test statistic elucidated in Equation 3.
G_{GOF}^{2} = -2 [ \log{ ( L_{M} ) } - \log{ ( L_{S} ) } ] \sim \chi^{2}_{\text{df}} \text{ under } H_{0} \tag{3}
(Note that we do NOT want to reject the null hypothesis. Our aim is good model fit after all.)
In Equation 3, df is the number of parameters in S minus the number of parameters in M, referred to as the residual degrees of freedom. Note that we can use the distribution of the test statistics only if there is a finite number of predictor patterns and the fitted count for each predictor pattern is at least 5. As with other tests using a \chi^{2} distribution approximation, a large test statistic indicates a small p value and rejection of H_{0} (evidence that M has poor fit).
As illustrative example, we return to the model of birth weight given age, in which we viewed age as a k-level categorical variable. A simpler model, M, views age as a numerical explanatory variable with M: \; \text{logit} ( \pi ( \text{age} )) = \beta_{0} + \beta_{1} (\text{age}), created below. Note that M is nested within S.
# Working model M
<- glm(
model_m formula = dfr_table[, 2:1] ~ age_levels,
family = binomial(link = "logit")
)
summary(model_m)
Call:
glm(formula = dfr_table[, 2:1] ~ age_levels, family = binomial(link = "logit"))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.4851 2.0472 1.702 0.0887 .
age_levels -0.2547 0.1279 -1.992 0.0464 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4.71033 on 4 degrees of freedom
Residual deviance: 0.66599 on 3 degrees of freedom
AIC: 25.402
Number of Fisher Scoring iterations: 3
Finally, we use the lrtest
function to perform the LRT given the two models.
::lrtest(model_m, model_s) lmtest
Likelihood ratio test
Model 1: dfr_table[, 2:1] ~ age_levels
Model 2: dfr_table[, 2:1] ~ factor(age_levels)
#Df LogLik Df Chisq Pr(>Chisq)
1 2 -10.701
2 5 -10.368 3 0.666 0.8812
We see G^{2}_{\text{GOF, df=}3} = \text{ deviance } = 0.666, with a p value of 0.88. For \alpha = 0.05, we fail to reject the null hypothesis and state that the simpler model fits the data well.
In R the model has attributes null.deviance
and deviance
that return the null deviance and the model deviance respectively.
$null.deviance model_s
[1] 4.710326
$deviance model_s
[1] 2.797762e-14
Note that the deviance for the saturated model is 0, as expected. We do the same for the working model.
$null.deviance model_m
[1] 4.710326
$deviance model_m
[1] 0.6659857
The null deviance is the same. We see the deviance for the working model, which is G^{2}_{\text{GOF}}, that was returned by the lrtest
function.
We do not have to create a saturated model and use the lrtest
function in R. We simply create the working model and return the deviance
attribute, which we use as argument in the pchisq
function to return the p value for the test statistic.
pchisq(model_m$deviance, df = 3, lower.tail = FALSE)
[1] 0.8811738
We cannot use the goodness of fit test for individual level data. Instead, we use the Hosmer-Lemeshow (LM) test.
The null hypothesis and the alternative hypothesis is the same as for the GOF test. If we fail to reject the null hypothesis, then there is no evidence that our working model, M, has poor fit.
There are six steps in conducting the Hosmer-Lemeshow test
\text{HL} = \sum_{k=1}^{g} \frac{( O_{k} - E_{k})^{2}}{n_{k}\overline{\hat{\pi}}_{k} (1 - \overline{\hat{\pi}}_{k})} \tag{4}
We note that \text{HL} \sim \chi^{2}_{g-2} under the null hypothesis. We fail to reject the null hypothesis when HL is small. In this case we state that the model fits the data well.
As illustrative example, we review the supplemental private insurance data set.
<- import_dataset("mus14data.csv") dfr
We fit a model assigned to the variable model_hrs
, with household income (numerical), the number of chronic disease (numerical), marital status (binary), and age (numerical) as explanatory variables for the response indicating the purchase of supplemental healthcare insurance.
<- glm(
model_hrs formula = private ~ hhincome + chronic + married + age,
data = dfr,
family = binomial(link = "logit")
)
summary(model_hrs)
Call:
glm(formula = private ~ hhincome + chronic + married + age, family = binomial(link = "logit"),
data = dfr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1082974 0.6860336 -1.616 0.106
hhincome 0.0052360 0.0008433 6.209 5.33e-10 ***
chronic -0.0379298 0.0267946 -1.416 0.157
married 0.5649570 0.0916783 6.162 7.17e-10 ***
age 0.0009817 0.0102040 0.096 0.923
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4279.5 on 3205 degrees of freedom
Residual deviance: 4151.4 on 3201 degrees of freedom
AIC: 4161.4
Number of Fisher Scoring iterations: 4
We see values for the null and residual deviance in the summary table. These cannot be used for this individual level data set. Note that large value for the degrees of freedom, indicating 3206 unique combinations of these four explanatory variables. A corresponding saturated model would have 3206 parameters.
Instead, we conduct the HL test. The hoslem.test
function in the ResourceSelection
library conducts the test as per the steps set out above. We assign the test to the variable hl_test_info
.
<- ResourceSelection::hoslem.test(
hl_test_info $y,
model_hrsfitted(model_hrs),
g = 10
) hl_test_info
Hosmer and Lemeshow goodness of fit (GOF) test
data: model_hrs$y, fitted(model_hrs)
X-squared = 101.15, df = 8, p-value < 2.2e-16
Here, we reject the null hypothesis. We have evidence that the model does not fit the data well.
Below, we generate a table of the 10 groups. The Total indicates the number of observation in each of the 10 groups (with the indicated left-open, right-closed group intervals). The number of success observations that are observed and expected are shown in the Observed and Expected columns.
cbind(
Total = apply(hl_test_info$observed, 1, sum),
Observed = hl_test_info$observed[, 2],
Expected = hl_test_info$expected[, 2]
)
Total Observed Expected
[0.212,0.256] 321 38 78.21675
(0.256,0.276] 321 91 85.23210
(0.276,0.367] 320 111 102.12663
(0.367,0.387] 321 81 121.61421
(0.387,0.399] 320 105 125.75329
(0.399,0.409] 321 135 129.70913
(0.409,0.423] 320 151 133.03939
(0.423,0.442] 321 189 138.48492
(0.442,0.477] 320 169 146.16868
(0.477,0.998] 321 171 180.65489
We see the large difference between the observed and expected values and hence the poor fit.
In the sections, we looked at the likelihood ratio test as goodness-of-fit test for group level data and the Hosmer-Lemeshow test for individual level data to assess how well our working models fit the data.
We can explore the residuals to understand why there is poor fit for group level models, similar to linear regression, where large residuals (positive or negative difference between observed and predicted values) indicate poor fit.
Assuming group level data with k different combinations (profiles) of the levels for the explanatory variables, we have the following notation.
There are two different residuals that we explore here, Pearson residuals and standardized residuals.
Pearson residuals are denoted by e_{i} and is shown in Equation 5.
e_{i} = \frac{y_{i} - n_{i} \hat{\pi}_{i}}{\sqrt{n_{i} \hat{\pi}_{i} (1 - \hat{\pi}_{i})}} \tag{5}
The numerator in Equation 5 is the difference between the observed and the expected counts (of successes). The denominator is the binomial standard deviation of the observed count, y_{i}. Note that e_{i} is approximately normally distributed with a variance less than 1.
Also note that Equation 5 follows from the description y_{i} \sim \text{Bin}(n_{i} , \hat{\pi}_{i}) with expected value, E(y_{i}) = n_{i} \hat{\pi}_{i}, and variance, \text{var}(y_{i}) = n_{i} \hat{\pi}_{i} (1 - \hat{\pi}_{i}).
Standardized residuals, as the name implies, standardizes the Pearson residuals, and are calculated as in Equation 6, where h_{i} is known as the the leverage (covered in more advanced classes on generalized linear models).
\frac{y_{i} - n_{i} \hat{\pi}_{i}}{\sqrt{n_{i} \hat{\pi}_{i} (1 - \hat{\pi}_{i})} (1 - h_{i})} = \frac{e_{i}}{\sqrt{1 - h_{i}}} \tag{6}
The standardized residual is approximately normally distributed, with \text{N}(0,1^{2}) when the model holds. Residuals < -2 or >2 are considered large.
As illustrative example, we consider again the low birth weight predicted by maternal age example from the previous notebook. Table 6 below is a summary of the data.
Age | Sample size | Y=1 | Probability (proportion of success) |
---|---|---|---|
14 | 15 | 6 | 0.40 |
15 | 67 | 30 | 0.45 |
16 | 125 | 44 | 0.35 |
17 | 71 | 21 | 0.30 |
18 | 20 | 5 | 0.25 |
The model \text{logit} (\pi (\text{age})) = \beta_{0} + \beta_{1} (\text{age}) shows the relationship between age and risk of low birth weight. In the previous section, we used the goodness of fit test and had no evidence for a poor fit.
The fitted model was \text{logit}(\hat{\pi}(\text{age})) = 3.4851 - 0.2547 (\text{age}) and the probability of low birth weight was then as shown in Equation 7.
\hat{\pi}(\text{age}) = \frac{e^{3.4851 - 0.2547 (\text{age})}}{1 + e^{3.4851 - 0.2547 (\text{age})}} \tag{7}
Table 7, summarizes the results from the model, where SR is the standardized residuals.
i | Age | n_{i} | y_{i} | Prop | \hat{\pi}_{i} | n_{i}\hat{\pi}_{i} | e_{i} | SR |
---|---|---|---|---|---|---|---|---|
1 | 14 | 15 | 6 | 0.40 | 0.480 | 7.198 | -0.618 | -0.739 |
2 | 15 | 67 | 30 | 0.45 | 0.417 | 27.934 | 0.514 | 0.723 |
3 | 16 | 125 | 44 | 0.35 | 0.357 | 44.576 | -0.104 | -0.138 |
4 | 17 | 71 | 21 | 0.30 | 0.301 | 21.337 | -0.085 | -0.117 |
5 | 18 | 20 | 5 | 0.25 | 0.250 | 4.997 | 0.003 | 0.004 |
The residuals are small for all combinations. Ages 14 and 15 show the worst fit (residuals are largest - over-predicting for age 14 (negative residual) and under-predicting for age 15 (positive residual)).
To end this chapter, we discuss two more topic relating to model evaluation, separation and sparseness.
The calculation of parameters for a generalized linear model in R uses the Fisher Scoring Algorithm. The algorithm is an iterative process that aims to find the maximum likelihood for estimate values.
For an algorithm to find a maximum (or a minimum), it needs to converge during the iterative process. It can happen that an algorithm does not converge, instead resulting in values that are infinite or non-existent.
There are patterns to data that can lead to failure to converge. These are complete separation and sparseness.
Complete separation or perfect discrimination is a state of the data such that there is a cutoff in the the values for an explanatory variable that completely separates the levels of the response variable.
As illustrative example, we consider a hospital admission scoring system that prognosticates death. The score takes values from 1 through 8. The outcome is, death, with Y=1 for those who die, and Y=0 for those who do not die. The data is created below as a data frame object.
<- 1:8
score <- rep(c(0,1), each = 4)
outcome <- data.frame(
dfr Score = score,
Outcome = outcome
)
We plot the data as a scatter plot shown in Figure 1.
plot(
$Score,
dfr$Outcome,
dfrmain = "Scatter plot of admission scoring system and outcome",
xlab = "Score",
ylab = "Ourcome",
las = 1,
pch = 23,
cex = 3,
ylim = c(-0.1, 1.1),
panel.first = grid()
)
It is clear from the plot that there is separation of the scores. For a score of less than or equal to 4 (on the horizontal axis), we always have that Y=0, and for a score greater than 4, we always have Y=1. Below, we build a model for the data.
<- glm(
model_sep ~ Score,
Outcome data = dfr,
family = binomial(link = "logit")
)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
The output returns a warning that probabilities were fitted that were 0 and 1. The summary output appears at first glance to be without problems.
summary(model_sep)
Call:
glm(formula = Outcome ~ Score, family = binomial(link = "logit"),
data = dfr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -206.1 365130.9 -0.001 1
Score 45.8 80643.9 0.001 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1.1090e+01 on 7 degrees of freedom
Residual deviance: 4.5253e-10 on 6 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
With closer inspection, we note that the standard errors are very large. We therefore see very small Wald statistic values and p values of death. We would expect to see small p values, since the score does seem to predict death.
We also note that the Fisher Scoring Algorithm took more than the usual 4 steps. In fact, it took the maximum (in R) of 25 steps, indicating that there was not good convergence for the estimates.
In Figure 2, we see the values of \beta_{1}, through the iterations. If there were more iterations, we see that we would have \beta_{1} \rightarrow \infty.
= c(0.92646, 1.54454, 2.40424, 3.75172, 5.67804, 7.74615, 9.78218,
beta_1 11.79651, 13.80191, 15.80391, 17.80465, 19.80492, 21.80502,
23.80505, 25.80507, 27.80507, 29.80508, 31.80508, 33.80508,
35.80508, 37.80507, 39.80507, 41.80506, 43.80502)
plot(
1:length(beta_1),
beta_1,main = "Values of slope through iterations",
xlab = "Iterations",
ylab = "Value of slope",
type = "l",
lwd = 2,
col = "orangered",
panel.first = grid()
)
While we have had a single explanatory variable in our illustrative example, the same can happen with multiple explanatory variables. Complete separation occurs when when some linear combination of the predictors perfectly predicts the dependent variable.
Warnings to look out for as warning messages include, when the log likelihood and deviance are 0, when the coefficients for the explanatory variables are large, or the standard errors are extremely large.
When complete separation occurs, we can remove the offending explanatory variable since it completely accounts for all the variation in the response. Remember to mention when reporting the results that the variable was removed and why.
Sparseness or quasi-complete separation refers to an imbalance in the frequencies of the levels of the response variables for some or all combinations of the explanatory variables. In machine learning, this problem is referred to as class imbalance.
In the code chunk below, we import a data set for participants that either received a novel, active drug or a placebo at one of 5 centers (the binary explanatory variables). The response variable recorded whether they improved or not. We also set the levels of the multi-level explanatory variable and the outcome variable.
<- import_dataset("sparseness.csv")
dfr
$Center <- factor(dfr$Center, levels = 1:5)
dfr$Outcome <- factor(dfr$Outcome, levels = c("Not improve", "Improve"))
dfr$Recode <- factor(dfr$Recode, levels = 1:3) dfr
The 5 centers indicate a 5-level categorical variable (with dummy variables x_{1} through x_{4}) and the treatment variable is binary (x_{5}).
Our model is then the log of the odds of improvement \text{logit}( \pi ) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \beta_{5} x_{5}.
The xtabs
function show that we have only 16 successes (improvement) 78 failures (no improvement).
xtabs(
~ Outcome,
data = dfr
)
Outcome
Not improve Improve
78 16
When we consider a table by center, we see that center 1 and 3 had no participants that show improvement.
xtabs(
~ Center + Outcome,
data = dfr
)
Outcome
Center Not improve Improve
1 14 0
2 22 1
3 12 0
4 9 8
5 21 7
<- glm(
model_sparse formula = Outcome ~ Center + Treatment,
data = dfr,
family = binomial(link = "logit")
)
summary(model_sparse)
Call:
glm(formula = Outcome ~ Center + Treatment, family = binomial(link = "logit"),
data = dfr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -18.7727 2749.8777 -0.007 0.9946
Center2 16.1162 2749.8779 0.006 0.9953
Center3 -0.3194 4060.6214 0.000 0.9999
Center4 19.3595 2749.8777 0.007 0.9944
Center5 18.2964 2749.8777 0.007 0.9947
TreatmentPlacebo -1.5460 0.7017 -2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 85.77 on 93 degrees of freedom
Residual deviance: 57.74 on 88 degrees of freedom
AIC: 69.74
Number of Fisher Scoring iterations: 18
We see no warning. The clue that a problem exists, though, is in the large values for the standard errors, relative to the estimates.
Since center 1, which is the reference center, has no successes, the odds of success is 0. For the odds ratios, we will therefor divide by 0.
Before generating models, or doing any statistical analysis, data visualization, and exploratory analysis, whether that be summary statistics or bivariate modeling (single explanatory variable).
Solutions to sparseness would be to re-code the data by combining levels or converting from a k-level categorical variable to a numerical variable. We can remove observations, but this is dubious as it does not conform to the original research question and protocol that precedes data capture and analysis.
Below, we combine centers 1, 2, and 3 into a single center. Note that a decision such as this must make sense in the setting of the research.
<- glm(
model_recode formula = Outcome ~ Recode + Treatment,
data = dfr,
family = binomial(link = "logit")
)
summary(model_recode)
Call:
glm(formula = Outcome ~ Recode + Treatment, family = binomial(link = "logit"),
data = dfr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.3737 1.0219 -3.301 0.000963 ***
Recode2 3.9660 1.1557 3.432 0.000600 ***
Recode3 2.9012 1.1171 2.597 0.009399 **
TreatmentPlacebo -1.5588 0.6998 -2.228 0.025905 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 85.770 on 93 degrees of freedom
Residual deviance: 59.147 on 90 degrees of freedom
AIC: 67.147
Number of Fisher Scoring iterations: 6
We note that our standard errors are now closer in value to the estimates.
You conduct a study to find the association between transfer time to hospital (measured in minutes), type of road vehicle accident (4-level categorical variable), the need for pre-hospital resuscitation measures (5-level categorical variable), and whether victims return to working functionality six months after the incident (binary categorical response variable). To measure the association, you build a logistic regression model.
What type of goodness of fit test would you consider and why?
You conduct a study on quinquagenarians (people in their fifties) to find the association between age and the development of permanent disabilities after motor vehicle accidents. To measure the association, you build a logistic regression model.
The data is in the file crash.csv
.
Generate a saturated model and a working model (where age is a numerical explanatory variable), and conduct a goodness of fit test on the latter.
You analyze data on the time (measured in minutes) that patients are hypoxic (lack of oxygen in the blood) during the first 24 hours following admission to a critical care unit for acute respiratory distress syndrome. Data on whether they required the use of inotropes (epinephrine) during the first 24 hours is also collected (a binary variable with No being no inotropes and Yes confirming the use of inotropes). The response variable is Death with Yes confirming that a patient died and No, that they survived.
The data are in the hypoxia.csv
file.
Create a logistic regression to understand the association between hypoxic time and the use of inotropes, and mortality. Determine the goodness of fit of your model.
In the code chunk below, we create a data set as a data frame object, assigned to the variable chd_data
. The data set contains information on 609 males who were involved in a study of coronary heart disease (CHD). There are three predictor variables.
CHD_case
contains the number of subjects at each combination of the predictor levels that have CHD while the variable CHD_control
contains the number of subjects without CHD<- factor(c(0,0,0,0,1,1,1,1), levels = c(0,1))
CAT <- factor(c(0,1,0,1,0,1,0,1), levels = c(0,1))
AGE <- factor(c(0,0,1,1,0,0,1,1), levels = c(0,1))
ECG <- c(17,15,7,5,1,9,3,14)
CHD_case <- c(257,107,52,27,7,30,14,44)
CHD_control <- data.frame(CAT, AGE, ECG, CHD_case, CHD_control)
chd_data #DT::datatable(chd_data <- data.frame(CAT, AGE, ECG, CHD_case, CHD_control))
You consider the model in Equation 8.
\log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1} \left( \text{AGE} \right) + \beta_{2} \left( \text{CAT} \right) + \beta_{3} \left( \text{ECG} \right) \tag{8}
Answer the following questions
Is chd_data
an individual level data set or group level data set?
How many unique predictor level combinations are there?
Is the model that you want fit in (Lab problem 1 model) the saturated model? Explain why or why not. If it is not the saturated model, then write down the saturated model.
Fit the saturated model and report the estimated intercept and slopes. Assign the model to the variable model_lab_1
.
Compute the predicted probability for CHD for each predictor level combination. Then, show that the saturated model perfectly predicts these probabilities.
Fit a model with ECG as the only predictor (group level data) and show the estimated intercept and coefficient.
Does this model have poor fit? Conduct a goodness of fit test for this model using the lrtest
function. Use a 5% significance level. When creating the simple model, assign it to the variable model_ecg
.
If the model has poor fit, examine the model predicted probabilities and the standardized residuals to see which observations fit poorly.
Fit the working model in (Lab problem 1 model) and show the estimated intercept and coefficients. Assign the model to the variable model_main
.
Does this model have poor fit? Conduct a goodness of fit test for this model using the lrtest
function. Use a 5% significance level. Confirm the p value for this test using the pchisq
function.
Suppose that we are interested in investigating the factors that influence the development of germinal matrix hemorrhages (gmh), which are hemorrhages (bleeding) in the brain in low birth weight infants. In particular, we are interested in investigating the influence of a baby’s head circumference and gender on the probability of having a germinal matrix hemorrhage. The data set gmh.csv
contains the variables shown in Table 8.
Variable Name | Description | Levels or Units of Measurement |
---|---|---|
gmh | presence of hemorrhage | 0 for no hemorrhage and 1 for hemorrhage |
headcirc | head circumference | centimeters |
sex | 0 for female and 1 for male |
Suppose that we want to use the model shown in Equation 10.
\log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1} \left( \text{head circumference} \right) + \beta_{2} \left( \text{sex} \right) \tag{10}
Answer the following questions.
Import the data set and assign it to the variable gmh
.
How many observations are in the data set?
How many unique values are there for the headcirc variable?
Is this an individual level data set or a group level data set?
Is the model in Equation 11 a saturated model? If so, or if not, explain your reasoning.
\begin{align} \log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + &\beta_{1} \left( \text{head circumference} \right) + \beta_{2} \left( \text{sex} \right) \\ &\beta_{3} \left( \text{head circumference} \times \text{sex} \right) \end{align} \tag{11}
model_lab_2
. Conduct an appropriate goodness of fit test of the null hypothesis that the logistic model with head circumference and sex as the predictors (main effects only) adequately fits the data. Use a 5% significance level.\begin{align} \log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1} \left( \text{head circumference} \right) + \beta_{2} \left( \text{sex} \right) \end{align} \tag{12}