Chapter 9

Author
Affiliation
Juan H Klopper (MD), Do Hee Lee (MPH)

The George Washington University

This chapter of Applied Categorical Data Analysis by JH Klopper and DH Lee is licensed under Attribution-NonCommercial-NoDerivatives 4.0 International

1 Introduction

In the previous chapter we explored ways of determining model building, deciding on which explanatory variables to use in our final model.

In this chapter, we explore how well our model or models explain or describe the data that we have, and if any observations influence the model estimates more than others.

There are two ways to assess model fit depending on whether we have group level or individual level data. While it is more common in many areas of research to generate models using individual level data, we begin our exploration with group level data.

Section 3. Levels of Data group level, predictor profile, individual level
Section 4. Group Level Data working model, saturated model, group saturated model, deviance, goodness of fit test, residual degrees of freedom
Section 7. Separation and Sparseness complete separation or perfect discrimination, sparseness or quasi-complete separation

2 Libraries

Load the required libraries for Chapter 9. If these packages are not already installed, please do so before loading the libraries.

Code
library(rio)
library(dplyr)
library(lmtest)
library(ResourceSelection)

3 Levels of Data

3.1 Group Level

A group level data set is one in which the number of unique predictor profiles is smaller (often much smaller) than the number of observations. A predictor profile describes a unique observation in a data set (a row vector of values for all the variables, an example of which is in Table 1).

As an example we have a binary outcome \texttt{low infant birth weight} with levels low, s, and not low, f, explained by mother’s age, \texttt{Age} (in the interval [14,18]), and smoking status, \texttt{Smoke}, with levels Yes and No.

We can assess success and failure for each combination of age and smoking status as shown in the table below, where we can treat the age variable as a categorical variable.

Table 1: Group Level Data
Smoke Age Successes Failures
Yes 14 s_{1} f_{1}
No 14 s_{2} f_{2}
Yes 15 s_{3} f_{3}
No 15 s_{4} f_{4}
Yes 16 s_{5} f_{5}
No 16 s_{6} f_{6}
Yes 17 s_{7} f_{7}
No 17 s_{8} f_{8}
Yes 18 s_{9} f_{9}
No 18 s_{10} f_{10}

In R we will use the glm function as follows for a logistic regression model, where success and failure are two element vectors of outcomes.

glm(cbind(success, failure) ~ x1 + x2), family = binomial(link = "logit"), data = data

3.2 Individual Level

An individual level data set is one in which the number of unique predictor profiles and outcome combinations equal (or closely equal) the number of observations. The number of successes and failures for each combination of explanatory variables are typically very small.

As an example, we can add years of education as another explanatory variable to the example above. The number of successes and failures for every combination of explanatory variables will now be very small. There simply won’t be many observations with the same explanatory variable profile.

In R, we use the usual formula that we have been using up until now in cases of individual level data.

4 Group Level Data

Before building a model, we do have data, which we can feed to the model, termed the working model, and evaluate what the prediction would be for each observation given that we do know what the outcome variable values are.

A saturated model, S, is a model that we can construct that perfectly predicts the data (group level). We can measure a working model against a saturated model.

The maximum likelihood of data under the saturated model, L_{S}, is larger than that of any other model that is nested within S. This then also holds for the log of the maximum likelihood of the data, \log{(L_{S})}.

A saturated model will treat each explanatory variable as a categorical type variable and includes all the main effects and all possible interaction terms.

A saturated model has as many parameters (intercept and slopes) as there are unique combinations of the levels of the explanatory models (predictor profiles) and outcomes in the data.

For group level data, the group saturated model provides perfect prediction of the observed proportion of successes for each unique combination of explanatory variable levels. As mentioned, saturated models treat each predictor as categorical (even if numerical) and includes all main effects and possible interaction terms.

The saturated model for individual level data cannot be used to assess model fit as with group level data, and will not be discussed. It could have as many parameters as observations.

As illustrative example of a group level data set, we explore a model that uses a mother’s age (as a categorical variable with levels 14, 15, 16, 17, 18) as predictor of her infant’s birth weight, Y, which can be low, 1, or not low, 0.

The dummy variables for age would then correspond to 15 through 18, assigned to x_{1} through x_{4}. The model for low birth weight is as shown in Equation 1, for \pi (\mathbf{X}) = P(Y \vert x_{1} , x_{2} , x_{3} , x_{4}).

\text{logit} ( \pi ( \mathbf{x} ) ) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} \tag{1}

The data is summarized in Table 2.

Table 2: Summary Data for Group Level Example
Age Sample size Y=1 Probability (proportion of success)
14 15 6 6 \div 15 = 0.40
15 67 30 30 \div 67 = 0.45
16 125 44 44 \div 125 = 0.35
17 71 21 21 \div 71 = 0.30
18 20 5 5 \div 20 = 0.25

Using R, we find the model to be \text{logit}(\hat{\pi} (\mathbf{x})) = -0.4055 + 0.1957 x_{1} - 0.2048 x_{2} - 0.4620 x_{3} - 0.6931 x_{4}, as calculated in the code below.

We start by importing the data. There are rows with missing data, which we delete by using the dplyr filter function with the %in% keyword. We also create a vector of the age values, assigned to the variable age_levels.

Code
dfr <- import_dataset("sim_birthweight.csv")
dfr <- dfr %>% 
  filter(age %in% 14:18) # Drop missing data

age_levels = 14:18

We have to create a table of profiles, using the table function. We assign this table to the variable dfr_table.

Code
dfr_table <- table(
  Age = dfr$age,
  LBW = dfr$lbw
)

dfr_table
    LBW
Age   0  1
  14  9  6
  15 37 30
  16 81 44
  17 50 21
  18 15  5

The first column in the table above represents the maternal age levels. Each row is a profile. Remember that all variables are treated as categorical variables when using profiles. The second column counts the number of failures from the response variable for each profile. The last column is the frequency of successes.

Now we create a model using the table as response variable. We reverse the order of the failure and success frequency columns using indexing. Using the formula keyword argument, we set our variable age_level (as a factor using the factor function) to represent the explanatory variable. Note that our data only contains five rows of data. We do not pass the dfr data frame as argument to the glm function, as we are using the group level data that we created.

Code
model_s <- glm(
  formula = dfr_table[, 2:1] ~ factor(age_levels),
  family = binomial(link = "logit")
)

summary(model_s)

Call:
glm(formula = dfr_table[, 2:1] ~ factor(age_levels), family = binomial(link = "logit"))

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)
(Intercept)           -0.4055     0.5270  -0.769    0.442
factor(age_levels)15   0.1957     0.5815   0.337    0.736
factor(age_levels)16  -0.2048     0.5593  -0.366    0.714
factor(age_levels)17  -0.4620     0.5877  -0.786    0.432
factor(age_levels)18  -0.6931     0.7379  -0.939    0.348

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.7103e+00  on 4  degrees of freedom
Residual deviance: 2.7978e-14  on 0  degrees of freedom
AIC: 30.736

Number of Fisher Scoring iterations: 3

Using the estimates from our results, we create a table below, that shows perfect prediction in Table 3.

Table 3: Perfect Prediction for Group Level Data Model
Age Sample size Y=1 Proportion Saturated model \hat{\pi}
14 15 6 0.40 \frac{e^{-0.4055}}{1 + e^{-0.4055}} = 0.40
15 67 30 0.45 \frac{e^{-0.4055 + 0.1957}}{1 + e^{-0.4055 + 0.1957}} = 0.45
16 125 44 0.35 \frac{e^{-0.4055 - 0.2048}}{1 + e^{-0.4055 - 0.2048}} = 0.35
17 71 21 0.30 \frac{e^{-0.4055 - 0.4620}}{1 + e^{-0.4055 - 0.4620}} = 0.30
18 20 5 0.25 \frac{e^{-0.4055 - 0.6931}}{1 + e^{-0.4055 - 0.6931}} = 0.25

We use code to confirm the probabilities using the data.

Code
SS <- apply(dfr_table, 1, sum) # Sum the success encoded as 1

cbind(SS, dfr_table[, 2], round(dfr_table[,2] / SS, 2)) # Bind the two row vectors
    SS        
14  15  6 0.40
15  67 30 0.45
16 125 44 0.35
17  71 21 0.30
18  20  5 0.25

The \log{( L_{S} )}=-10.368 for this saturated model. This value is larger than the log likelihood of any model that is nested in the saturated model.

In Table 4, we see the summary of a second example, now with two binary predictor variables, and hence, four unique combinations. The two binary explanatory variables are race ( black and white ) and AZT use ( Yes and No ). These variables are used as predictors for the binary outcome variable AIDS symptoms.

Table 4: Summary Statistics for Second Example
Profile Race AZT use AIDS symptoms Sample size Proportion
1 White Yes 14 107 0.131
2 White No 32 113 0.283
3 Black Yes 11 63 0.175
4 Black No 12 55 0.218

If x_{1} is AZT use, and x_{2} is race, then the group level saturated logistic regression model with interaction term is \text{logit}(\pi ( \mathbf{x}) ) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{1} x_{2}. There are four parameters and four unique predictor profiles.

The model is \text{logit}(\pi ( \mathbf{x}) ) = -1.2763 - 0.2771 x_{1} + 0.3476 x_{2} - 0.6878 x_{1} x_{2}. Substitution will yield the same proportions as in Table 4 and is shown in Table 5.

Table 5: Results from Saturated Model
Profile Proportion Saturated model for \hat{\pi}
1 0.131 \frac{e^{-1.2763-0.2771+0.3476-0.687}}{1+e^{-1.2763-0.2771+0.3476-0.687}} = 0.131
2 0.283 \frac{e^{-1.2763+0.3476}}{1+e^{-1.2763+0.3476}} = 0.283
3 0.175 \frac{e^{-1.2763-0.2771}}{1+e^{-1.2763-0.2771}} = 0.175
4 0.218 \frac{e^{-1.2763}}{1+e^{-1.2763}} = 0.218

The \log{( \pi ( \mathbf{x}) )} = - 8.738 and is larger than the log likelihood of any model nested within the saturated model.

While it may seem as if a saturated model is best to use, we quickly run into problems as the number of explanatory variables increase. If we have four binary explanatory variables for instance, we would have 2^{4}=16 parameters.

The saturated model, instead, is used as a benchmark to compare nested models that are contained within it.

Suppose then, that we have a nested model, M, within the group saturated model, S.

If M fits the data well, then \log{ L_{M} } will be close to \log{ L_{S} }, otherwise, if it does not fit the data well, the values will not be close.

Instead of simply considering the difference, we use the deviance, shown in Equation 2.

\text{Deviance} = G_{M}^{2} = -2 [ \log{ L_{M} } - \log{ L_{S} } ] \tag{2}

A large deviance (large difference between M and S) indicates that M does not fit the data well and a small deviance indicates that M fits the data well. Note that the formula in Equation 2 suggests a likelihood ratio test (LRT).

There is a spectrum of models from L_{0}, a model with only the intercept (largest deviance), up to the saturated model (lowest deviance). (The null deviance is G_{0}^{2} = -2 [ \log{ L_{0} } - \log{ L_{S} } ].)

A goodness of fit test (GOF) is a test that uses the deviance statistic, G_{M}^{2}, to evaluate a nested model M. The GOF test is a LRT with the null hypothesis that M has adequate fit vs. the alternative hypothesis, that is does not have a good fit, with the test statistic elucidated in Equation 3.

G_{GOF}^{2} = -2 [ \log{ ( L_{M} ) } - \log{ ( L_{S} ) } ] \sim \chi^{2}_{\text{df}} \text{ under } H_{0} \tag{3}

(Note that we do NOT want to reject the null hypothesis. Our aim is good model fit after all.)

In Equation 3, df is the number of parameters in S minus the number of parameters in M, referred to as the residual degrees of freedom. Note that we can use the distribution of the test statistics only if there is a finite number of predictor patterns and the fitted count for each predictor pattern is at least 5. As with other tests using a \chi^{2} distribution approximation, a large test statistic indicates a small p value and rejection of H_{0} (evidence that M has poor fit).

As illustrative example, we return to the model of birth weight given age, in which we viewed age as a k-level categorical variable. A simpler model, M, views age as a numerical explanatory variable with M: \; \text{logit} ( \pi ( \text{age} )) = \beta_{0} + \beta_{1} (\text{age}), created below. Note that M is nested within S.

Code
# Working model M
model_m <- glm(
  formula = dfr_table[, 2:1] ~ age_levels,
  family = binomial(link = "logit")
)

summary(model_m)

Call:
glm(formula = dfr_table[, 2:1] ~ age_levels, family = binomial(link = "logit"))

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   3.4851     2.0472   1.702   0.0887 .
age_levels   -0.2547     0.1279  -1.992   0.0464 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4.71033  on 4  degrees of freedom
Residual deviance: 0.66599  on 3  degrees of freedom
AIC: 25.402

Number of Fisher Scoring iterations: 3

Finally, we use the lrtest function to perform the LRT given the two models.

Code
lmtest::lrtest(model_m, model_s)
Likelihood ratio test

Model 1: dfr_table[, 2:1] ~ age_levels
Model 2: dfr_table[, 2:1] ~ factor(age_levels)
  #Df  LogLik Df Chisq Pr(>Chisq)
1   2 -10.701                    
2   5 -10.368  3 0.666     0.8812

We see G^{2}_{\text{GOF, df=}3} = \text{ deviance } = 0.666, with a p value of 0.88. For \alpha = 0.05, we fail to reject the null hypothesis and state that the simpler model fits the data well.

In R the model has attributes null.deviance and deviance that return the null deviance and the model deviance respectively.

Code
model_s$null.deviance
[1] 4.710326
Code
model_s$deviance
[1] 2.797762e-14

Note that the deviance for the saturated model is 0, as expected. We do the same for the working model.

Code
model_m$null.deviance
[1] 4.710326
Code
model_m$deviance
[1] 0.6659857

The null deviance is the same. We see the deviance for the working model, which is G^{2}_{\text{GOF}}, that was returned by the lrtest function.

We do not have to create a saturated model and use the lrtest function in R. We simply create the working model and return the deviance attribute, which we use as argument in the pchisq function to return the p value for the test statistic.

Code
pchisq(model_m$deviance, df = 3, lower.tail = FALSE)
[1] 0.8811738

5 Individual Level Data

We cannot use the goodness of fit test for individual level data. Instead, we use the Hosmer-Lemeshow (LM) test.

The null hypothesis and the alternative hypothesis is the same as for the GOF test. If we fail to reject the null hypothesis, then there is no evidence that our working model, M, has poor fit.

  • H_{0}: \; M \text{ has adequate fit}
  • H_{1}: \; M \text{ has poor fit}

There are six steps in conducting the Hosmer-Lemeshow test

  • Step 1
    • Fit a working model, M
  • Step 2
  • Calculate the model-based probability of success for each observation
  • Step 3
    • Order the probabilities from the previous test in descending order
  • Step 4
    • Split the ordered probabilities into g percentile groups
    • The default is 10, resulting in deciles
  • Step 5
    • Compute the expected number of successes in each group
      • The average of the model-based probability for each group is \overline{\hat{\pi}}_{k} for k = 1,2, \ldots , g and is used as the expected value for each group
      • For n_{k} as the number of observations in each group, we have the expected values as E_{k} = n_{k} \overline{\hat{\pi}}_{k}
      • The number of observed successes in each group is denoted by O_{k}
  • Step 6
    • Compute the HL test statistic, HL, as given in Equation 4, where the denominator is a standard error for the test statistic.

\text{HL} = \sum_{k=1}^{g} \frac{( O_{k} - E_{k})^{2}}{n_{k}\overline{\hat{\pi}}_{k} (1 - \overline{\hat{\pi}}_{k})} \tag{4}

We note that \text{HL} \sim \chi^{2}_{g-2} under the null hypothesis. We fail to reject the null hypothesis when HL is small. In this case we state that the model fits the data well.

As illustrative example, we review the supplemental private insurance data set.

Code
dfr <- import_dataset("mus14data.csv")

We fit a model assigned to the variable model_hrs, with household income (numerical), the number of chronic disease (numerical), marital status (binary), and age (numerical) as explanatory variables for the response indicating the purchase of supplemental healthcare insurance.

Code
model_hrs <- glm(
  formula = private ~ hhincome + chronic + married + age,
  data = dfr,
  family = binomial(link = "logit")
)

summary(model_hrs)

Call:
glm(formula = private ~ hhincome + chronic + married + age, family = binomial(link = "logit"), 
    data = dfr)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.1082974  0.6860336  -1.616    0.106    
hhincome     0.0052360  0.0008433   6.209 5.33e-10 ***
chronic     -0.0379298  0.0267946  -1.416    0.157    
married      0.5649570  0.0916783   6.162 7.17e-10 ***
age          0.0009817  0.0102040   0.096    0.923    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4279.5  on 3205  degrees of freedom
Residual deviance: 4151.4  on 3201  degrees of freedom
AIC: 4161.4

Number of Fisher Scoring iterations: 4

We see values for the null and residual deviance in the summary table. These cannot be used for this individual level data set. Note that large value for the degrees of freedom, indicating 3206 unique combinations of these four explanatory variables. A corresponding saturated model would have 3206 parameters.

Instead, we conduct the HL test. The hoslem.test function in the ResourceSelection library conducts the test as per the steps set out above. We assign the test to the variable hl_test_info.

Code
hl_test_info <- ResourceSelection::hoslem.test(
  model_hrs$y,
  fitted(model_hrs),
  g = 10
)
hl_test_info

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  model_hrs$y, fitted(model_hrs)
X-squared = 101.15, df = 8, p-value < 2.2e-16

Here, we reject the null hypothesis. We have evidence that the model does not fit the data well.

Below, we generate a table of the 10 groups. The Total indicates the number of observation in each of the 10 groups (with the indicated left-open, right-closed group intervals). The number of success observations that are observed and expected are shown in the Observed and Expected columns.

Code
cbind(
  Total = apply(hl_test_info$observed, 1, sum),
  Observed = hl_test_info$observed[, 2],
  Expected = hl_test_info$expected[, 2]
)
              Total Observed  Expected
[0.212,0.256]   321       38  78.21675
(0.256,0.276]   321       91  85.23210
(0.276,0.367]   320      111 102.12663
(0.367,0.387]   321       81 121.61421
(0.387,0.399]   320      105 125.75329
(0.399,0.409]   321      135 129.70913
(0.409,0.423]   320      151 133.03939
(0.423,0.442]   321      189 138.48492
(0.442,0.477]   320      169 146.16868
(0.477,0.998]   321      171 180.65489

We see the large difference between the observed and expected values and hence the poor fit.

6 Residuals

In the sections, we looked at the likelihood ratio test as goodness-of-fit test for group level data and the Hosmer-Lemeshow test for individual level data to assess how well our working models fit the data.

We can explore the residuals to understand why there is poor fit for group level models, similar to linear regression, where large residuals (positive or negative difference between observed and predicted values) indicate poor fit.

Assuming group level data with k different combinations (profiles) of the levels for the explanatory variables, we have the following notation.

  • Each of the combinations will be denoted by a subscript i, where i = 1,2, \ldots , k
  • We will use y_{i} as the observed number of successes for predictor combination i
  • \hat{\pi}_{i} will be the model based probability of success for predictor combination i
  • Finally, n_{i} \hat{\pi}_{i} will be the expected number of successes for predictor combination i

There are two different residuals that we explore here, Pearson residuals and standardized residuals.

6.1 Pearson Residuals

Pearson residuals are denoted by e_{i} and is shown in Equation 5.

e_{i} = \frac{y_{i} - n_{i} \hat{\pi}_{i}}{\sqrt{n_{i} \hat{\pi}_{i} (1 - \hat{\pi}_{i})}} \tag{5}

The numerator in Equation 5 is the difference between the observed and the expected counts (of successes). The denominator is the binomial standard deviation of the observed count, y_{i}. Note that e_{i} is approximately normally distributed with a variance less than 1.

Also note that Equation 5 follows from the description y_{i} \sim \text{Bin}(n_{i} , \hat{\pi}_{i}) with expected value, E(y_{i}) = n_{i} \hat{\pi}_{i}, and variance, \text{var}(y_{i}) = n_{i} \hat{\pi}_{i} (1 - \hat{\pi}_{i}).

6.2 Standardized Residuals

Standardized residuals, as the name implies, standardizes the Pearson residuals, and are calculated as in Equation 6, where h_{i} is known as the the leverage (covered in more advanced classes on generalized linear models).

\frac{y_{i} - n_{i} \hat{\pi}_{i}}{\sqrt{n_{i} \hat{\pi}_{i} (1 - \hat{\pi}_{i})} (1 - h_{i})} = \frac{e_{i}}{\sqrt{1 - h_{i}}} \tag{6}

The standardized residual is approximately normally distributed, with \text{N}(0,1^{2}) when the model holds. Residuals < -2 or >2 are considered large.

As illustrative example, we consider again the low birth weight predicted by maternal age example from the previous notebook. Table 6 below is a summary of the data.

Table 6: Summary Statistics of the Low Birth Weight Example
Age Sample size Y=1 Probability (proportion of success)
14 15 6 0.40
15 67 30 0.45
16 125 44 0.35
17 71 21 0.30
18 20 5 0.25

The model \text{logit} (\pi (\text{age})) = \beta_{0} + \beta_{1} (\text{age}) shows the relationship between age and risk of low birth weight. In the previous section, we used the goodness of fit test and had no evidence for a poor fit.

The fitted model was \text{logit}(\hat{\pi}(\text{age})) = 3.4851 - 0.2547 (\text{age}) and the probability of low birth weight was then as shown in Equation 7.

\hat{\pi}(\text{age}) = \frac{e^{3.4851 - 0.2547 (\text{age})}}{1 + e^{3.4851 - 0.2547 (\text{age})}} \tag{7}

Table 7, summarizes the results from the model, where SR is the standardized residuals.

Table 7: Standardized Residuals
i Age n_{i} y_{i} Prop \hat{\pi}_{i} n_{i}\hat{\pi}_{i} e_{i} SR
1 14 15 6 0.40 0.480 7.198 -0.618 -0.739
2 15 67 30 0.45 0.417 27.934 0.514 0.723
3 16 125 44 0.35 0.357 44.576 -0.104 -0.138
4 17 71 21 0.30 0.301 21.337 -0.085 -0.117
5 18 20 5 0.25 0.250 4.997 0.003 0.004

The residuals are small for all combinations. Ages 14 and 15 show the worst fit (residuals are largest - over-predicting for age 14 (negative residual) and under-predicting for age 15 (positive residual)).

7 Separation and Sparseness

To end this chapter, we discuss two more topic relating to model evaluation, separation and sparseness.

The calculation of parameters for a generalized linear model in R uses the Fisher Scoring Algorithm. The algorithm is an iterative process that aims to find the maximum likelihood for estimate values.

For an algorithm to find a maximum (or a minimum), it needs to converge during the iterative process. It can happen that an algorithm does not converge, instead resulting in values that are infinite or non-existent.

There are patterns to data that can lead to failure to converge. These are complete separation and sparseness.

7.1 Complete Separation

Complete separation or perfect discrimination is a state of the data such that there is a cutoff in the the values for an explanatory variable that completely separates the levels of the response variable.

As illustrative example, we consider a hospital admission scoring system that prognosticates death. The score takes values from 1 through 8. The outcome is, death, with Y=1 for those who die, and Y=0 for those who do not die. The data is created below as a data frame object.

Code
score <- 1:8
outcome <- rep(c(0,1), each  = 4)
dfr <- data.frame(
  Score = score,
  Outcome = outcome
)

We plot the data as a scatter plot shown in Figure 1.

Code
plot(
  dfr$Score,
  dfr$Outcome,
  main = "Scatter plot of admission scoring system and outcome",
  xlab = "Score",
  ylab = "Ourcome",
  las = 1,
  pch = 23,
  cex = 3,
  ylim = c(-0.1, 1.1),
  panel.first = grid()
)

Figure 1: Scatter Plot of Example Data

It is clear from the plot that there is separation of the scores. For a score of less than or equal to 4 (on the horizontal axis), we always have that Y=0, and for a score greater than 4, we always have Y=1. Below, we build a model for the data.

Code
model_sep <- glm(
  Outcome ~ Score,
  data = dfr,
  family = binomial(link = "logit")
)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The output returns a warning that probabilities were fitted that were 0 and 1. The summary output appears at first glance to be without problems.

Code
summary(model_sep)

Call:
glm(formula = Outcome ~ Score, family = binomial(link = "logit"), 
    data = dfr)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -206.1   365130.9  -0.001        1
Score           45.8    80643.9   0.001        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1.1090e+01  on 7  degrees of freedom
Residual deviance: 4.5253e-10  on 6  degrees of freedom
AIC: 4

Number of Fisher Scoring iterations: 25

With closer inspection, we note that the standard errors are very large. We therefore see very small Wald statistic values and p values of death. We would expect to see small p values, since the score does seem to predict death.

We also note that the Fisher Scoring Algorithm took more than the usual 4 steps. In fact, it took the maximum (in R) of 25 steps, indicating that there was not good convergence for the estimates.

In Figure 2, we see the values of \beta_{1}, through the iterations. If there were more iterations, we see that we would have \beta_{1} \rightarrow \infty.

Code
beta_1 = c(0.92646, 1.54454, 2.40424, 3.75172, 5.67804, 7.74615, 9.78218,
11.79651, 13.80191, 15.80391, 17.80465, 19.80492, 21.80502,
23.80505, 25.80507, 27.80507, 29.80508, 31.80508, 33.80508,
35.80508, 37.80507, 39.80507, 41.80506, 43.80502)

plot(
  1:length(beta_1),
  beta_1,
  main = "Values of slope through iterations",
  xlab = "Iterations",
  ylab = "Value of slope",
  type = "l",
  lwd = 2,
  col = "orangered",
  panel.first = grid()
)

Figure 2: The Estimate Does Not Converge

While we have had a single explanatory variable in our illustrative example, the same can happen with multiple explanatory variables. Complete separation occurs when when some linear combination of the predictors perfectly predicts the dependent variable.

Warnings to look out for as warning messages include, when the log likelihood and deviance are 0, when the coefficients for the explanatory variables are large, or the standard errors are extremely large.

When complete separation occurs, we can remove the offending explanatory variable since it completely accounts for all the variation in the response. Remember to mention when reporting the results that the variable was removed and why.

7.2 Sparseness

Sparseness or quasi-complete separation refers to an imbalance in the frequencies of the levels of the response variables for some or all combinations of the explanatory variables. In machine learning, this problem is referred to as class imbalance.

In the code chunk below, we import a data set for participants that either received a novel, active drug or a placebo at one of 5 centers (the binary explanatory variables). The response variable recorded whether they improved or not. We also set the levels of the multi-level explanatory variable and the outcome variable.

Code
dfr <- import_dataset("sparseness.csv")

dfr$Center <- factor(dfr$Center, levels = 1:5)
dfr$Outcome <- factor(dfr$Outcome, levels = c("Not improve", "Improve"))
dfr$Recode <- factor(dfr$Recode, levels = 1:3)

The 5 centers indicate a 5-level categorical variable (with dummy variables x_{1} through x_{4}) and the treatment variable is binary (x_{5}).

Our model is then the log of the odds of improvement \text{logit}( \pi ) = \beta_{0} + \beta_{1} x_{1} + \beta_{2} x_{2} + \beta_{3} x_{3} + \beta_{4} x_{4} + \beta_{5} x_{5}.

The xtabs function show that we have only 16 successes (improvement) 78 failures (no improvement).

Code
xtabs(
  ~ Outcome,
  data = dfr
)
Outcome
Not improve     Improve 
         78          16 

When we consider a table by center, we see that center 1 and 3 had no participants that show improvement.

Code
xtabs(
  ~ Center + Outcome,
  data = dfr
)
      Outcome
Center Not improve Improve
     1          14       0
     2          22       1
     3          12       0
     4           9       8
     5          21       7
Code
model_sparse <- glm(
  formula = Outcome ~ Center + Treatment,
  data = dfr,
  family = binomial(link = "logit")
)

summary(model_sparse)

Call:
glm(formula = Outcome ~ Center + Treatment, family = binomial(link = "logit"), 
    data = dfr)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)  
(Intercept)       -18.7727  2749.8777  -0.007   0.9946  
Center2            16.1162  2749.8779   0.006   0.9953  
Center3            -0.3194  4060.6214   0.000   0.9999  
Center4            19.3595  2749.8777   0.007   0.9944  
Center5            18.2964  2749.8777   0.007   0.9947  
TreatmentPlacebo   -1.5460     0.7017  -2.203   0.0276 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 85.77  on 93  degrees of freedom
Residual deviance: 57.74  on 88  degrees of freedom
AIC: 69.74

Number of Fisher Scoring iterations: 18

We see no warning. The clue that a problem exists, though, is in the large values for the standard errors, relative to the estimates.

Since center 1, which is the reference center, has no successes, the odds of success is 0. For the odds ratios, we will therefor divide by 0.

Before generating models, or doing any statistical analysis, data visualization, and exploratory analysis, whether that be summary statistics or bivariate modeling (single explanatory variable).

Solutions to sparseness would be to re-code the data by combining levels or converting from a k-level categorical variable to a numerical variable. We can remove observations, but this is dubious as it does not conform to the original research question and protocol that precedes data capture and analysis.

Below, we combine centers 1, 2, and 3 into a single center. Note that a decision such as this must make sense in the setting of the research.

Code
model_recode <- glm(
  formula = Outcome ~ Recode + Treatment,
  data = dfr,
  family = binomial(link = "logit")
)

summary(model_recode)

Call:
glm(formula = Outcome ~ Recode + Treatment, family = binomial(link = "logit"), 
    data = dfr)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -3.3737     1.0219  -3.301 0.000963 ***
Recode2            3.9660     1.1557   3.432 0.000600 ***
Recode3            2.9012     1.1171   2.597 0.009399 ** 
TreatmentPlacebo  -1.5588     0.6998  -2.228 0.025905 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 85.770  on 93  degrees of freedom
Residual deviance: 59.147  on 90  degrees of freedom
AIC: 67.147

Number of Fisher Scoring iterations: 6

We note that our standard errors are now closer in value to the estimates.

8 Lab Problems

8.1 Question

You conduct a study to find the association between transfer time to hospital (measured in minutes), type of road vehicle accident (4-level categorical variable), the need for pre-hospital resuscitation measures (5-level categorical variable), and whether victims return to working functionality six months after the incident (binary categorical response variable). To measure the association, you build a logistic regression model.

What type of goodness of fit test would you consider and why?

We use the Hosmer-Lemeshow test, since we have individual level data.

8.2 Question

You conduct a study on quinquagenarians (people in their fifties) to find the association between age and the development of permanent disabilities after motor vehicle accidents. To measure the association, you build a logistic regression model.

The data is in the file crash.csv.

Generate a saturated model and a working model (where age is a numerical explanatory variable), and conduct a goodness of fit test on the latter.

We import the data and assign the tibble to the variable crash.

Code
crash <- import_dataset("crash.csv")

The xtabs function generates a table summarizing the data.

Code
xtabs(
  ~ Disabled,
  data = crash
)
Disabled
 No Yes 
 32  25 

Overall, there are more participants who are not disabled than disabled.

Then, we re-level factor variables so that the reference group comes first in the list.

Code
crash$Disabled <- factor(crash$Disabled, levels = c("No", "Yes"), labels = c(0, 1))
age_levels <- sort(unique(crash$Age))
Code
crash_table <- table(
  Age = crash$Age,
  Disabled = crash$Disabled
)
crash_table
    Disabled
Age  0 1
  50 9 0
  51 2 0
  52 1 0
  53 3 0
  54 4 2
  55 5 6
  56 5 4
  57 3 4
  58 0 5
  59 0 4

We can use indexing to reverse the order of the levels.

Code
crash_table[, c(2, 1)]
    Disabled
Age  1 0
  50 0 9
  51 0 2
  52 0 1
  53 0 3
  54 2 4
  55 6 5
  56 4 5
  57 4 3
  58 5 0
  59 4 0

The indexing above is used to generate a model for group level data. The saturated model is assigned to the variable crash_model_s.

Code
crash_model_s <- glm(
  formula = crash_table[, c(2,1)] ~ factor(age_levels),
  family = binomial(link = "logit")
)

summary(crash_model_s)

Call:
glm(formula = crash_table[, c(2, 1)] ~ factor(age_levels), family = binomial(link = "logit"))

Coefficients:
                       Estimate Std. Error z value Pr(>|z|)
(Intercept)          -2.603e+01  9.061e+04       0        1
factor(age_levels)51  1.123e+00  1.422e+05       0        1
factor(age_levels)52  1.460e+00  1.593e+05       0        1
factor(age_levels)53  8.664e-01  1.363e+05       0        1
factor(age_levels)54  2.533e+01  9.061e+04       0        1
factor(age_levels)55  2.621e+01  9.061e+04       0        1
factor(age_levels)56  2.580e+01  9.061e+04       0        1
factor(age_levels)57  2.631e+01  9.061e+04       0        1
factor(age_levels)58  5.156e+01  1.314e+05       0        1
factor(age_levels)59  5.139e+01  1.332e+05       0        1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 3.3435e+01  on 9  degrees of freedom
Residual deviance: 4.2309e-10  on 0  degrees of freedom
AIC: 30.252

Number of Fisher Scoring iterations: 23

We now generate a nested model assigned to the variable crash_model_m.

Code
crash_model_m <- glm(
  formula = crash_table[, c(2,1)] ~ age_levels,
  data = crash,
  family = binomial(link = "logit")
)

summary(crash_model_m)

Call:
glm(formula = crash_table[, c(2, 1)] ~ age_levels, family = binomial(link = "logit"), 
    data = crash)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -43.7804    12.5031  -3.502 0.000463 ***
age_levels    0.7882     0.2250   3.504 0.000459 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 33.435  on 9  degrees of freedom
Residual deviance:  6.377  on 8  degrees of freedom
AIC: 20.629

Number of Fisher Scoring iterations: 5
Code
1 - pchisq(6.377, df = 8)
[1] 0.6050807
Code
pchisq(6.377, 8, lower.tail = FALSE)
[1] 0.6050807
Code
lmtest::lrtest(
  crash_model_m,
  crash_model_s
)
Likelihood ratio test

Model 1: crash_table[, c(2, 1)] ~ age_levels
Model 2: crash_table[, c(2, 1)] ~ factor(age_levels)
  #Df  LogLik Df Chisq Pr(>Chisq)
1   2 -8.3145                    
2  10 -5.1260  8 6.377     0.6051

The likelihood ratio test yields a chi-square statistic of 6.377 and a corresponding p value that is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis that the test is a good fit.

8.3 Question

You analyze data on the time (measured in minutes) that patients are hypoxic (lack of oxygen in the blood) during the first 24 hours following admission to a critical care unit for acute respiratory distress syndrome. Data on whether they required the use of inotropes (epinephrine) during the first 24 hours is also collected (a binary variable with No being no inotropes and Yes confirming the use of inotropes). The response variable is Death with Yes confirming that a patient died and No, that they survived.

The data are in the hypoxia.csv file.

Create a logistic regression to understand the association between hypoxic time and the use of inotropes, and mortality. Determine the goodness of fit of your model.

We import the dataset.

Code
hypoxia <- import_dataset("hypoxia.csv")

Then, we re-level the factor variables so that the reference group is listed first.

Code
hypoxia$Death <- factor(hypoxia$Death, levels = c("No", "Yes"))
hypoxia$Inotropes <- factor(hypoxia$Inotropes, levels = c("No", "Yes"))

The box-and-whisker plot in Figure 3 summarizes the data.

Code
boxplot(
  Time ~ Death + Inotropes,
  data = hypoxia,
  col = "#4197D9"
)

Figure 3: Box-and-Whisker Plot

We generate a model and assign it to the variable hypoxia_model.

Code
hypoxia_model <- glm(
  formula = Death ~ Time + Inotropes,
  data = hypoxia,
  family = binomial(link = "logit")
)

summary(hypoxia_model)

Call:
glm(formula = Death ~ Time + Inotropes, family = binomial(link = "logit"), 
    data = hypoxia)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)   
(Intercept)  -10.979280   5.505458  -1.994  0.04612 * 
Time           0.015202   0.008131   1.870  0.06154 . 
InotropesYes   3.189674   1.041148   3.064  0.00219 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 46.070  on 33  degrees of freedom
Residual deviance: 29.589  on 31  degrees of freedom
AIC: 35.589

Number of Fisher Scoring iterations: 5
Code
psych::describe(hypoxia$Time)
   vars  n   mean   sd median trimmed   mad min max range  skew kurtosis    se
X1    1 34 633.41 64.4    647  636.14 48.93 483 777   294 -0.42    -0.03 11.04

Since we have individual level data, we conduct a Hosmer-Lemeshow goodness-of-fit test.

Code
hl_test_info_hypoxia <- ResourceSelection::hoslem.test(
  hypoxia_model$y,
  fitted(hypoxia_model),
  g = 3
)
hl_test_info_hypoxia

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  hypoxia_model$y, fitted(hypoxia_model)
X-squared = 2.7352, df = 1, p-value = 0.09816

The Hosmer-Lemeshow GOF test yields a chi-square statistic of 2.7352 and a p value that is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis that the model is a good fit.

Code
cbind(Total = apply(hl_test_info_hypoxia$observed, 1, sum), 
      Observed = hl_test_info_hypoxia$observed[,2], 
      Expected = hl_test_info_hypoxia$expected[,2])
               Total Observed Expected
[0.0257,0.431]    12        3 2.308408
(0.431,0.868]     11        6 7.739984
(0.868,0.934]     11       11 9.951609

8.4 Question

In the code chunk below, we create a data set as a data frame object, assigned to the variable chd_data. The data set contains information on 609 males who were involved in a study of coronary heart disease (CHD). There are three predictor variables.

  • AGE with 1 if age is more than 55 and 0 if age is equal to or less than 55
  • CAT is a binary variable that indicates high 1 or normal 0 catecholamine level
  • ECG test result are 1 for an abnormal test and 0 for a normal test
  • The variable CHD_case contains the number of subjects at each combination of the predictor levels that have CHD while the variable CHD_control contains the number of subjects without CHD
Code
CAT <- factor(c(0,0,0,0,1,1,1,1), levels = c(0,1))
AGE <- factor(c(0,1,0,1,0,1,0,1), levels = c(0,1))
ECG <- factor(c(0,0,1,1,0,0,1,1), levels = c(0,1))
CHD_case <- c(17,15,7,5,1,9,3,14)
CHD_control <- c(257,107,52,27,7,30,14,44)
chd_data <- data.frame(CAT, AGE, ECG, CHD_case, CHD_control)
#DT::datatable(chd_data <- data.frame(CAT, AGE, ECG, CHD_case, CHD_control))

You consider the model in Equation 8.

\log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1} \left( \text{AGE} \right) + \beta_{2} \left( \text{CAT} \right) + \beta_{3} \left( \text{ECG} \right) \tag{8}

Answer the following questions

  1. Is chd_data an individual level data set or group level data set?

  2. How many unique predictor level combinations are there?

  3. Is the model that you want fit in (Lab problem 1 model) the saturated model? Explain why or why not. If it is not the saturated model, then write down the saturated model.

  4. Fit the saturated model and report the estimated intercept and slopes. Assign the model to the variable model_lab_1.

  5. Compute the predicted probability for CHD for each predictor level combination. Then, show that the saturated model perfectly predicts these probabilities.

  6. Fit a model with ECG as the only predictor (group level data) and show the estimated intercept and coefficient.

  7. Does this model have poor fit? Conduct a goodness of fit test for this model using the lrtest function. Use a 5% significance level. When creating the simple model, assign it to the variable model_ecg.

  8. If the model has poor fit, examine the model predicted probabilities and the standardized residuals to see which observations fit poorly.

  9. Fit the working model in (Lab problem 1 model) and show the estimated intercept and coefficients. Assign the model to the variable model_main.

  10. Does this model have poor fit? Conduct a goodness of fit test for this model using the lrtest function. Use a 5% significance level. Confirm the p value for this test using the pchisq function.

This is a group level data set because we that the number of unique predictor profiles in smaller than the number of observations. Generally, data sets can be categorized a group level data if all of the variables are categorical and the sample size is sufficiently large.

There are 8 unique predictor level combinations.

\texttt {AGE} \texttt {CAT} \texttt {ECG}
less than or equal to 55 high catecholamine abnormal ECG test
less than or equal to 55 high catecholamine level normal ECG test
less than or equal to 55 low catecholamine level abnormal ECG test
less than or equal to 55 low catecholamine level normal ECG test
more than 55 high catecholamine level abnormal ECG test
more than 55 high catecholamine level normal ECG test
more than 55 low catecholamine level abnormal ECG test
more than 55 low catecholamine level normal ECG test

No, the model above is not the saturated model. By definition, a saturated model has as many parameters as there are unique subject profiles (predictor level and outcome combinations). Since we just showed that there are 8 unique profiles, and there are only 4 parameters in the model (\beta_{0},\beta_{1},\beta_{2},\beta_{3}), the model above is not the saturated model. The saturated model is shown in Equation 9.

\begin{align} \log{\left( \frac{\pi}{1 - \pi} \right)} = &\beta_{0} + \beta_{1} \left( \text{AGE} \right) + \beta_{2} \left( \text{CAT} \right) + \beta_{3} \left( \text{ECG} \right) + \\ &\beta_{4} \left( \text{AGE} \times \text{CAT} \right) + \beta_{5} \left( \text{AGE} \times \text{ECG} \right) + \\ &\beta_{6} \left( \text{CAT} \times \text{ECG} + \beta_{7} \left( \text{AGE} \times \text{CAT} \times \text{ECG} \right) \right) \end{align} \tag{9}

The output below displays the estimated intercept and slopes.

Code
model_lab_1 <- glm(cbind(CHD_case, CHD_control) ~ AGE*CAT*ECG, family = binomial(link = "logit"), data = chd_data)

summary(model_lab_1)

Call:
glm(formula = cbind(CHD_case, CHD_control) ~ AGE * CAT * ECG, 
    family = binomial(link = "logit"), data = chd_data)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -2.715863   0.250429 -10.845   <2e-16 ***
AGE1            0.751084   0.372461   2.017   0.0437 *  
CAT1            0.769953   1.097985   0.701   0.4832    
ECG1            0.710529   0.474133   1.499   0.1340    
AGE1:CAT1      -0.009147   1.194164  -0.008   0.9939    
AGE1:ECG1      -0.432149   0.733384  -0.589   0.5557    
CAT1:ECG1      -0.305064   1.331323  -0.229   0.8188    
AGE1:CAT1:ECG1  0.085525   1.524490   0.056   0.9553    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance:  2.1332e+01  on 7  degrees of freedom
Residual deviance: -6.2172e-15  on 0  degrees of freedom
AIC: 44.712

Number of Fisher Scoring iterations: 4

We can do the computations by hand, but it is easier to do them in R. We can use the mutate function to create a new proportion variable that we will add to the data set. We then use the predict function on the fitted model object to get the predicted probabilities.

Code
chd_data <- chd_data %>%
  dplyr::mutate(prop_CHD = CHD_case / (CHD_case + CHD_control))

#DT::datatable(chd_data)
Code
predict(model_lab_1, type = "response")
        1         2         3         4         5         6         7         8 
0.0620438 0.1229508 0.1186441 0.1562500 0.1250000 0.2307692 0.1764706 0.2413793 
Code
model_ecg <- glm(cbind(CHD_case, CHD_control) ~ ECG, family = binomial(link = "logit"), data = chd_data)

summary(model_ecg)

Call:
glm(formula = cbind(CHD_case, CHD_control) ~ ECG, family = binomial(link = "logit"), 
    data = chd_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.2563     0.1622 -13.912  < 2e-16 ***
ECG1          0.7036     0.2609   2.697  0.00701 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 21.332  on 7  degrees of freedom
Residual deviance: 14.357  on 6  degrees of freedom
AIC: 47.069

Number of Fisher Scoring iterations: 4
Code
lmtest::lrtest(
  model_ecg,
  model_lab_1
)
Likelihood ratio test

Model 1: cbind(CHD_case, CHD_control) ~ ECG
Model 2: cbind(CHD_case, CHD_control) ~ AGE * CAT * ECG
  #Df  LogLik Df  Chisq Pr(>Chisq)  
1   2 -21.534                       
2   8 -14.356  6 14.357    0.02589 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The likelihood ratio test yields a chi-square statistic of 14.357 and a corresponding p value that is less than the 5\% significance level. Therefore, we reject the null hypothesis that the model is a good fit.

Since the model demonstrated poor fit, we calculate the predicted probabilities as shown below.

Code
predict(model_ecg, type = "response")
         1          2          3          4          5          6          7 
0.09480813 0.09480813 0.17469880 0.17469880 0.09480813 0.09480813 0.17469880 
         8 
0.17469880 

Below are the standardized residuals to see which observations fit poorly.

Code
rstandard(model_ecg)
          1           2           3           4           5           6 
-3.18069007  1.19740984 -1.48271625 -0.31037455  0.28183134  2.61721435 
          7           8 
 0.02028013  1.58714308 

The estimated intercept and coefficients of the working model are displayed below.

Code
model_main <- glm(cbind(CHD_case, CHD_control) ~ AGE + CAT + ECG,
                  data = chd_data,
                  family = binomial(link = "logit"))

summary(model_main)

Call:
glm(formula = cbind(CHD_case, CHD_control) ~ AGE + CAT + ECG, 
    family = binomial(link = "logit"), data = chd_data)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.6163     0.2123 -12.322   <2e-16 ***
AGE1          0.6157     0.2838   2.169   0.0301 *  
CAT1          0.6223     0.3194   1.948   0.0514 .  
ECG1          0.3620     0.2904   1.247   0.2126    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 21.33202  on 7  degrees of freedom
Residual deviance:  0.95443  on 4  degrees of freedom
AIC: 37.666

Number of Fisher Scoring iterations: 4
Code
lmtest::lrtest(
  model_main,
  model_lab_1
)
Likelihood ratio test

Model 1: cbind(CHD_case, CHD_control) ~ AGE + CAT + ECG
Model 2: cbind(CHD_case, CHD_control) ~ AGE * CAT * ECG
  #Df  LogLik Df  Chisq Pr(>Chisq)
1   4 -14.833                     
2   8 -14.356  4 0.9544     0.9166
Code
#Confirm the p value

pchisq(0.9544, df = 4, lower.tail = FALSE)
[1] 0.9166298

Based on the likelihood ratio test, we have a chi-square statistic of 0.9544 and a p value that is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis and conclude that model without the interaction term (i.e., the simpler called model_main) is a good fit.

8.5 Question

Suppose that we are interested in investigating the factors that influence the development of germinal matrix hemorrhages (gmh), which are hemorrhages (bleeding) in the brain in low birth weight infants. In particular, we are interested in investigating the influence of a baby’s head circumference and gender on the probability of having a germinal matrix hemorrhage. The data set gmh.csv contains the variables shown in Table 8.

Table 8: Variables for Problem 5
Variable Name Description Levels or Units of Measurement
gmh presence of hemorrhage 0 for no hemorrhage and 1 for hemorrhage
headcirc head circumference centimeters
sex 0 for female and 1 for male

Suppose that we want to use the model shown in Equation 10.

\log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1} \left( \text{head circumference} \right) + \beta_{2} \left( \text{sex} \right) \tag{10}

Answer the following questions.

  1. Import the data set and assign it to the variable gmh.

  2. How many observations are in the data set?

  3. How many unique values are there for the headcirc variable?

  4. Is this an individual level data set or a group level data set?

  5. Is the model in Equation 11 a saturated model? If so, or if not, explain your reasoning.

\begin{align} \log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + &\beta_{1} \left( \text{head circumference} \right) + \beta_{2} \left( \text{sex} \right) \\ &\beta_{3} \left( \text{head circumference} \times \text{sex} \right) \end{align} \tag{11}

  1. Create the model in Equation 12 and assign it to the variable model_lab_2. Conduct an appropriate goodness of fit test of the null hypothesis that the logistic model with head circumference and sex as the predictors (main effects only) adequately fits the data. Use a 5% significance level.

\begin{align} \log{\left( \frac{\pi}{1 - \pi} \right)} = \beta_{0} + \beta_{1} \left( \text{head circumference} \right) + \beta_{2} \left( \text{sex} \right) \end{align} \tag{12}

Code
gmh <- import_dataset("gmh.csv")

There are 100 observations in the data set.

Code
nrow(gmh)
[1] 100

There are 77 unique values for the \texttt {headcirc} variable.

Code
length(unique(gmh$headcirc))
[1] 77

This is individual level data because the number of unique predictor profiles and outcome combinations equal (or closely equal) the number of observations. Generally, when the data set contains continous variables, it is likely to be a individual level data set.

Code
str(gmh)
'data.frame':   100 obs. of  3 variables:
 $ gmh     : int  1 0 1 0 1 0 1 1 0 0 ...
 $ headcirc: num  21 22 22 22.4 22.9 ...
 $ sex     : int  0 0 0 0 0 1 1 0 0 0 ...

This is not the saturated model. The saturated model contains as many parameters as there are unique combinations of the predictor levels and outcome. There are at least 77 of these (based on the fact that we reported 77 unique head circumference values in the data set). This cannot be the saturated model since it only has 4 parameters (1 intercept and 3 slopes).

We create the model_lab_2 model with \texttt {gmh} as the outcome and \texttt {headcirc} and \texttt {sex} as the predictor variables as shown below.

Code
model_lab_2 <- glm(
  formula = gmh ~ headcirc + sex,
  data = gmh,
  family = binomial(link = "logit")
)

summary(model_lab_2)

Call:
glm(formula = gmh ~ headcirc + sex, family = binomial(link = "logit"), 
    data = gmh)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept)   8.3328     2.8305   2.944  0.00324 **
headcirc     -0.3540     0.1091  -3.245  0.00118 **
sex          -0.1737     0.4972  -0.349  0.72682   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 114.61  on 99  degrees of freedom
Residual deviance: 101.61  on 97  degrees of freedom
AIC: 107.61

Number of Fisher Scoring iterations: 4

We generate descriptive statistics for the \texttt {headcirc} variable.

Code
psych::describe(gmh$headcirc)
   vars   n  mean  sd median trimmed  mad min   max range skew kurtosis   se
X1    1 100 26.79 2.6     27   26.75 2.45  21 35.78 14.78 0.29     0.57 0.26

The mean head circumference is 26.79cm with a standard deviation of 2.6cm.

Code
ResourceSelection::hoslem.test(
  model_lab_2$y,
  fitted(model_lab_2),
  g = 10
)

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  model_lab_2$y, fitted(model_lab_2)
X-squared = 3.466, df = 8, p-value = 0.9018

Since we have individual level data, we use the Hosmer-Lemeshow test to assess adequate fit of the model. With a chi-square statistic of 3.466 and a p value that is greater than the significance level of 5\%, we fail to reject the null hypothesis that the data does not have poor fit.