INTRODUCTION TO STATISTICS USING PYTHON

Introduction

In this open textbook, created as a Quarto document, we consider the use of Python to conduct the majority of the statistical tests covered in a postgraduate introduction to biostatistics course.

At The George Washington University (GWU), Milken Institute School of Public Health, our postgraduate introduction to biostatistics course is Biostatistical Applications for Public Health, PubH 6002. Many students in the School take PubH 6002, as it is a core requirement for programs including Biostatistics, Bioinformatics, and Public Health, among others.

The Biostatistical Applications for Public Health course develops the intuition behind the most commonly used statistical tests. Students are encouraged to understand the equations, and to do the majority of the calculations by hand.

When I teach this course, I do provide the students with supplemental documentation showing how to perform the tests using the R language for statistical computing and similarly how to use Microsoft® Excel®. They are not obliged to use any specific language or application, though. Since students must show all their calculations during assessment, a computer language only serves the purpose of verifying their own results. The point is to understand the reasoning behind the equations, and hence the emphasis is not on a computer language.

With the supplemental documentation using R, I thought it a good idea to do something similar in Python. This notebook is then a showcase of how to perform the statistical analysis covered in PubH 6002, but in Python. As such, this notebook does not attempt to teach a full biostatistics course. Instead, I assume a basic understanding of biostatistcis and concentrate on the use of Python. I do take the time to explain the major topics, though. The most often used equations are also included.

It is important to note that some knowledge of Python is required as well. At GWU, we have a postgraduate course, Introduction to Python for Public Health Research, PubH 6852. If you have never considered this course, please do so. Python is the leading language in data science and data analysis.

In the notebook, we will cover the topics listed in the Contents section that you see on the top-right of this page. We will generate simulated data and import data in comma-separated values (CSV) files, to show as much of the power and ease of use of Python as possible.

Packages

As with R, we need to make use of a variety of packages to conduct all of our analysis. The first three packages are the stats module in the scipy package for statistical tests, the pandas package for data import and wrangling, and the numpy package for numerical computations and random value generation.

from scipy import stats
import pandas
import numpy

The statmodels package contains many functions for statistical analysis in a variety of modules. We will use one of the functions in the proportions submodule to conduct a one sample z test for proportions.

from statsmodels.stats.proportion import proportions_ztest

The three measures of association, risk difference, risk ratio, and odds ratio, can be calculated using functions from the zepid package. We can also calculate sensitivity, specificity, positive- and negative predictive values using this package.

from zepid import RiskDifference, RiskRatio, OddsRatio
from zepid.calc import sensitivity, specificity, ppv_converter, npv_converter

Modeling is a very important topic in biostatistics. Functions and modules in the statsmodels package are used to generate a variety of models.

from statsmodels.formula.api import ols, logit
from statsmodels.api import add_constant
from statsmodels.api import stats as sm_stats # Namespace abbreviation naming convention not to override stats from scipy

Post-hoc analysis is used after we reject the null hypothesis that all the means of three or more groups as equal. We will use the pingouin package for post-hoc tests. The package can be used to conduct a variety of statistical test instead of the scipy stats module.

import pingouin

/Users/juanklopper/opt/miniconda3/envs/datascience/lib/python3.10/site-packages/outdated/utils.py:14: OutdatedPackageWarning:

The package pingouin is out of date. Your version is 0.5.3, the latest is 0.5.4.
Set the environment variable OUTDATED_IGNORE=1 to disable these warnings.

We will generate interactive plots using the plotly package.

from plotly import express, graph_objects, io
from plotly import graph_objects
io.templates.default = 'gridon' # Setting the default plotting style

Matplotlib is the most established plotting package in Python. If you would like a comprehensive overview of using Matplotlib, watch my video HERE.

Specificity, sensitivity, positive predictive value, and negative predictive value

We will touch on concepts of probability where we make use of joint, marginal, and conditional probabilities throughout this notebook. We start by considering the concepts of specificity, sensitivity, positive predictive value (PPV), and negative predictive value (NPV).

As illustrative example, we consider the analysis of a new diagnostic test. Table 1 shows the results where the test is used for a disease with a prevalence in the population of 5\%. TP indicates a true positive, where a patient has the disease and the test return a positive result. TN indicates a true negative, where the test return a negative result in a patient who does not have the disease. FP indicates a false positive. This is where the patient does not have the disease, but the test returns a true result. Lastly, we have FN, a false negative, where the patient does have the disease, but the test returns a negative result. A total of 33+8+4+102=147 tests were concluded.

Table 1: Results of a new diagnostic test
	Disease present	Disease absent
Test result positive	TP =33	FP=8
Test result negative	FN =4	TN =102

Sensitivity is the proportions of cases where the test returns a positive result, given that the disease is present written as the conditional probability P \left( + \, \vert \, D \right), shown in Equation 1, where + indicates a positive test result and D is the presence of disease.

\begin{align} &\text{sensitivity} = P \left( + \, \vert \, D \right) = \frac{TP}{TP + FN} \\ \\ &\text{sensitivity} = \frac{33}{33+4} \approx 0.892 \end{align} \tag{1}

The sensitivity function from the calc module of the zepid package can calculate sensitivity. the arguments are detected for the number of true positives, cases, for the total number of cases with the disease (TP+FN), alpha for the level of significance to calculate two-sided Wals confidence intervals, as confinit that can be set to 'wald' or 'hypergeometric' for Wald or hypergeometric confidence intervals. The function returns the sensitivity, the 95\% confidence intervals for the sensitivity, and the standard error.

sensitivity(
  detected=33,
  cases=33 + 4,
  alpha=0.05,
  confint='wald'
)

(0.8918918918918919,
 0.7918383492469495,
 0.9919454345368343,
 0.05104866387043433)

We state that the test can detect about 89.2\% of cases where the disease is present.

Specificity is the proportions of cases where the test returns a negative result, given that the disease is not present written as the conditional probability P \left( - \, \vert \bar{D} \right), shown in Equation 2, where - is a negative test result and \bar{D} is the absence of disease.

\begin{align} &\text{specificity} = P \left( - \, \vert \bar{D} \right) = \frac{TN}{TN + FP} \\ \\ &\text{specificity} = \frac{102}{102+8} \approx 0.927 & \end{align} \tag{2}

The specificity function from the calc module of the zepid package can calculate specificity. The first argument, detected, is the number of false positives. The noncases agument is the number of cases without the disease (FP+TN). The alpha argument is the \alpha values used to calculate two-sided Wald confidence intervals for the specificity. The confint argument can either be 'wald' (the default) for a Wald or 'hypergeometric' for a hypergeometric confidence interval.The function return the specificity, the 95\% confidence intervals for the specificity and the standard error.

specificity(detected=8, noncases=8+102, alpha=0.05, confint='wald')

(0.9272727272727272,
 0.8787434143460249,
 0.9758020401994296,
 0.024760308510511086)

We state that the test can detect about 92.7\% of cases where the disease is absent.

PPV calculates the probability that the disease is present in the case that the test returns a positive result written as the conditional probability P \left( D \, \vert \, + \right). The equation is based on Bayes’ formula and incorporates the prevalence, q of the disease, shown in Equation 3, where sens is the sensitivity and spec is the specificity.

\begin{align} &\text{PPV} = P(D \vert +) = \frac{\text{sens} \times q}{\left( \text{sens} \times q \right) + \left( 1 - \text{spec} \right) \times \left( 1 - q \right)} \\ \\ &\text{PPV} = \frac{0.89189 \times 0.05}{(0.89189 \times 0.05) + (1-0.92727) \times (1-0.05)} \approx 0.048 \end{align} \tag{3}

The ppv_converter function calculates the PPV. We pass the values for the sensitivity, the specificity, and the prevalence to the function.

ppv_converter(0.89189, 0.92727, 0.05)

0.3922533600731829

We state that a person whose test returns a positive result, has about a 39.2\% probability of actually having the disease.

NPV calculates the probability that the disease is absent in the case that the test returns a negative result written as the conditional probability P \left( \bar{D} \, \vert \, - \right). The equation is also based on Bayes’ formula and incorporates the prevalence, q of the disease, shown in Equation 4, where sens is the sensitivity and spec is the specificity.

\begin{align} &\text{NPV} = P(\overline{D} \vert -) = \frac{\text{spec} \times (1 - q)}{\text{spec} \times (1 - q) + (( 1 - \text{sens}) \times q)} \\ \\ &\text{NPV} = \frac{0.92727 \times (1 - 0.05)}{(0.92727 \times (1 - 0.05)) + ((1 - 0.89189) \times 0.05)} \end{align} \tag{4}

The npv_converter function calculates the NPV. We pass the values for the sensitivity, the specificity, and the prevalence to the function.

npv_converter(0.89189, 0.92727, 0.05)

0.993901131881324

We state that a person whose test returns a pnegative result, has about a 99.4\% probability of not having the disease.

Exploratory data analysis

Statistical data analysis starts with exploratory data analysis (EDA). Here we summarize and visualize the data. This step helps us understand the data and discover potential errors such as missing data and outliers.

Generating simulated data

For this section on EDA, we generate simulated data. We create a pandas dataframe object for two variable, Group and Creatinine. The data simulates two exposure groups, Placebo and Treatment, simulating a study in which participants received either a placebo or an active intervention repsectively.

The outcome measures their serum creatinine levels.

In the code, we make use of the repeat function in the numpy package to repeat the strings 'Placebo' and 'Treatment' each 39 and 41 times respectively.

We then create two separate arrays of the appropriate length for the serum creatinine levels. In the placebo group, we take the values from a normal distribution with \mu=100 and \sigma=10. For the intervention group, we create 14 serum creatinine values taken from a normal distribution with \mu=95 and \sigma=12.

To make the pseudo-random number generation reprodicuble, we use the seed function from the random module of the numpy package. We seed the generator with the interger 12.

Finaly, we add the random values to two columns of a pandas dataframe object, assigned to the variable project_1_data. Note the use of the numpy concat function to concanetante the two arrays containing data for the Group variable and the two arrays for the Creatinine variable.

placebo_n = 39
treatment_n = 41

numpy.random.seed(12)

placebo_group = numpy.repeat(
  'Placebo',
  placebo_n
)
placebo_serum_creatinine = numpy.round(
  numpy.random.normal(
    loc=100,
    scale=10,
    size=placebo_n
  ),
  0
)

treatment_group = numpy.repeat(
  'Treatment',
  treatment_n
)

treatment_serum_creatinine = numpy.round(
  numpy.random.normal(
    loc=95,
    scale=12,
    size=treatment_n
  )
)

project_1_data = pandas.DataFrame(
  {
    'Group':numpy.concatenate([placebo_group, treatment_group]),
    'Creatinine':numpy.concatenate([placebo_serum_creatinine, treatment_serum_creatinine])
  }
)

We view the first five observations for the dataframe object below.

project_1_data[:5] # Use indexing short-had to display rows 0 through 4

	Group	Creatinine
0	Placebo	105.0
1	Placebo	93.0
2	Placebo	102.0
3	Placebo	83.0
4	Placebo	108.0

Now that we have some data, we can start with EDA.

Measures of central tendency

Measures of central tendency calculate a value for a variable that is representative of all the values for that variable.

The mean is calculated for a continuous variable, \bar{X}, is shown in Equation 5 for the random variable, X, where n is the sample size and x_{i} is the variable value for each observation.

\bar{X} = \frac{\sum_{i=1}^{n} x_{i}}{n} \tag{5}

There are many ways to calculate the mean in Python. Since we do not have illegal characters such a s spaces in the variable (column) name Creatinine, we can use dot notation, project_1_data.Creatine. This syntax will return a pandas series object, containing the values of the indicated column. The mean method calculates the mean.

Note that in this notebook, we refer to a method as a function that is applied to an object. The code, project_1_data.Creatine, returns a series object, and we apply the mean function to it as a method. Note the use of parentheses to indicate that mean is a method.

project_1_data.Creatinine.mean()

96.625

The mean function from the numpy library caluclates the mean of a numpy array object. To create an array object, we add the to_numpy method to the pandas series object that is returend by code proejct_1_data.Creatine. The resulting numpy array object is passed as an argument to the numpy mean function,.

numpy.mean(project_1_data.Creatinine.to_numpy())

96.625

After generating a pandas series object, we can use the median method to calculate the median. Alternatively, we could also create a numpy array object and use the numpy median function. The median is used when data for a continuous variable is skewed. We order the values in ascending (or descending) order, and select or calculate that value for whcih half of the values will be less than and half will me more than the median.

project_1_data.Creatinine.median()

96.5

The mode is used for categorical data type variables. The mode is then the value that occurs most commonly. We can show the unique elements of a categorical variable and show the frequency of each by using the value_counts method for a pandas series object. We do so below for the Group variable using the project_1_data.Group syntax.

project_1_data.Group.value_counts()

Group
Treatment    41
Placebo      39
Name: count, dtype: int64

We generated the data ourselves and know that Treatment is the mode for the Group variable. Note that more than one level (class) of a categorical can have an equal highest ferquency. In this case, we refer to the variable as multi-modal.

Comparative summary statistics calculate statistics for a variable after grouping by the levels (classes) of a categorical variable. Below, we use the groupby method and pass the string 'Group' as arguemnt. The dataframe object will be split along all the unique levels of the Group variable. The variable that we are concerned with is passed as an index in square brackets. Then we use the mean method.

project_1_data.groupby('Group')['Creatinine'].mean()

Group
Placebo      97.512821
Treatment    95.780488
Name: Creatinine, dtype: float64

The results show the mean of the Creatinine variable for each level of the Group variable.

Since the variable column name does not contain illegal characters, we can also use dot notation.

project_1_data.groupby('Group').Creatinine.mean()

Group
Placebo      97.512821
Treatment    95.780488
Name: Creatinine, dtype: float64

The loc property allows us to use indexing to select rows and columns. Below, we see two separate ways to use loc. The first uses row, column notation. The row section is a conditional. We want to select only the rows for the observations that are in the Placebo group in the Group variable. The column refers to the Creatinine variable. Finally, we calculate the mean using the mean method.

project_1_data.loc[project_1_data.Group == 'Placebo', 'Creatinine'].mean()

97.51282051282051

The result is the mean creatinine value for those that received a placebo. We can also use the alternative syntax examples below, where we use dot notation and then index notation.

# Alternative syntax
project_1_data.loc[project_1_data.Group == 'Placebo'].Creatinine.mean()

97.51282051282051

# Alternative syntax
project_1_data.loc[project_1_data.Group == 'Placebo']['Creatinine'].mean()

97.51282051282051

Measures of dispersion

Measures of dispersion provide us with information about the spread of data for a continuous variable. Below, we use the min and the max methods to calculate the minimum and maximum values for the Creatinine variable by the levels of the Group variable.

project_1_data.groupby('Group')['Creatinine'].min()

Group
Placebo      78.0
Treatment    57.0
Name: Creatinine, dtype: float64

project_1_data.groupby('Group')['Creatinine'].max()

Group
Placebo      129.0
Treatment    122.0
Name: Creatinine, dtype: float64

The range is the difference between the minimum and the maximum values of a continuous variable. We use simple subtraction below.

project_1_data.groupby('Group')['Creatinine'].max() - project_1_data.groupby('Group')['Creatinine'].min()

Group
Placebo      51.0
Treatment    65.0
Name: Creatinine, dtype: float64

The quartiles for a continuous variable can be caluclated using the quatile method of a series object. Below, we use the groupby method again. We pass a Python list object to the quantile method. These are the percentile values expressed as fractions of 1. The 0 will return the minumum value and the 1 will return the maximum value. The 0.25 returns the first quartile and the 0.75, the third quartile value. The 0.5 returns the second quartile or median value. Note that we can use any percentile value that may be of interest to us.

project_1_data.groupby('Group')['Creatinine'].quantile([0, 0.25, 0.5, 0.75, 1])

Group          
Placebo    0.00     78.0
           0.25     90.5
           0.50     99.0
           0.75    105.0
           1.00    129.0
Treatment  0.00     57.0
           0.25     88.0
           0.50     94.0
           0.75    103.0
           1.00    122.0
Name: Creatinine, dtype: float64

The iqr function in the scipy stats module calculates the interquartile range, the difference between the first and third quartile values. We pass a numpy array of the creatinine values of those on the placebo group as argument.

stats.iqr(
  project_1_data.loc[project_1_data.Group == 'Placebo', 'Creatinine'].to_numpy()
)

14.5

The sample variance is the sum of squared difference between each continuous variable value, x_{i}, and the mean of that varaibale \bar{\mathbf{x}}, divided by one less than the sample size, that is n-1, as shown in Equation 6.

s^{2} = \frac{\left( x_{i} - \bar{\mathbf{x}} \right)^{2}}{n-1} \tag{6}

We use the var method. Since we require a sample standard deviation, we have to use the ddof argument and set it to 1, indicating n-1. Equation 6 for the variance of a population, will only have the populations size in the denominator (no subtraction by 1). Below, we calculate the sample standard deviation of the Creatinine variable for each level of the Group variable.

project_1_data.groupby('Group')['Creatinine'].var(ddof=1)

Group
Placebo      112.993252
Treatment    161.425610
Name: Creatinine, dtype: float64

The standard deviation is simply the square root of the variance. The std method can be used to calculate the sample standard deviation, as is shown for the Creatinine variabel as group by the levels of the Group variable.

project_1_data.groupby('Group')['Creatinine'].std(ddof=1)

Group
Placebo      10.629828
Treatment    12.705338
Name: Creatinine, dtype: float64

Descriptive statistics

The describe method returns the important sample descriptive statistics for a continuous variable. We do so for the Creatinine variable ,for each of the two levels of the Group variable.

project_1_data.groupby('Group').Creatinine.describe()

	count	mean	std	min	25%	50%	75%	max
Group
Placebo	39.0	97.512821	10.629828	78.0	90.5	99.0	105.0	129.0
Treatment	41.0	95.780488	12.705338	57.0	88.0	94.0	103.0	122.0

Data visualization

Data visualization is an excellent form of EDA. There are many Python packages for data visualization. We see two examples using the plotly express module below.

Figure 1 shows a bar plot of the level of the Group variable.

express.histogram(
  data_frame=project_1_data,
  x='Group'
).update_yaxes(
  title='Frequency'
)

Figure 1: Bar plot of levels of treatment group

Figure 2 shows a stacked histogram of the Creatinine variable for each of the two levels of the Group variable.

express.histogram(
  data_frame=project_1_data,
  x='Creatinine',
  color='Group',
  marginal='box'
)

Figure 2: Histogram of serum creatinine levels for each treatment group

Comparing means of two populations

There are three parametric tests for the comparison of the mean of a continuous variable between two populations. We have the equal variance t test, known as Student’s t test, the unequal variance t test, known as Welch’s test, and the paired-sample t test for repeated measurements or dependent samples.

Bartlett’s test considers equality of the variance of each population. Under the null hypothesis, we have that \sigma^{2}_{1} - \sigma^{2}_{2} = 0. The bartlett function in the scipy stats module performs this test, returning a test statistic and p value. The creatinine values for each treatmnent group are each passed as a numpy array.

stats.bartlett(
  project_1_data.loc[project_1_data.Group == 'Placebo', 'Creatinine'].to_numpy(),
  project_1_data.loc[project_1_data.Group == 'Treatment', 'Creatinine'].to_numpy()
)

BartlettResult(statistic=1.2140216636032084, pvalue=0.27053710182505164)

For \alpha=0.05, we fail to reject the null hypothesis. The variance values are equal.

We need to ensure that the data are from populations in which the values are normally distributed. In this case, we can use the Shapiro-Wilk test, with a null hypothesis that the data are from a normal distribition. We use the shapiro function from the scipy stats module. The creatinine values for each group are passed as argument to the function as numpy array objects.

stats.shapiro(
  project_1_data.loc[project_1_data.Group == 'Placebo', 'Creatinine'].to_numpy()
)

ShapiroResult(statistic=0.9753268361091614, pvalue=0.5372494459152222)

stats.shapiro(
  project_1_data.loc[project_1_data.Group == 'Treatment', 'Creatinine'].to_numpy()
)

ShapiroResult(statistic=0.9770594239234924, pvalue=0.5659596920013428)

For \alpha=0.05, we fail to reject the null hypothesis in both cases.

In this case, we will use Student’s t test. The test statistic, t_{\text{data}} and the degrees of freedom, \nu is shown in Equation 7, assuming a null hypothesis where \mu_{1} - \mu_{2} = 0. We have that n_{1} and n_{2} are the sample sizes, and \bar{X}_{1} and \bar{X}_{2} are the two means uder consideration.

\begin{align} &t_{\text{data}} = \frac{(\overline{X}_{1} - \overline{X}_{2}) - (\mu_{1} - \mu_{2})}{\sqrt{\frac{(n_{1} - 1) S_{1}^{2} + (n_{2} - 1) S_{2}^{2}}{n_{1} + n_{2} - 2}} \sqrt{\frac{1}{n_{1}} + \frac{1}{n_{2}}}} \sim t_{\nu} \\ \\ &\nu = n_{1} + n_{2} - 2 \end{align} \tag{7}

The two sets of creatinine values are passed as numpy array objects to the ttest_ind function of the scipy stats module. We use a two-sided alternative hypothesis. The default value of the alternative argument is 'two-sided'.

stats.ttest_ind(
  project_1_data.loc[project_1_data.Group == 'Placebo', 'Creatinine'].to_numpy(),
  project_1_data.loc[project_1_data.Group == 'Treatment', 'Creatinine'].to_numpy(),
  alternative='two-sided'
)

TtestResult(statistic=0.6596872652900193, pvalue=0.5113977195100912, df=78.0)

The function returns a t_{\text{data}} statistic and a p value. We note that there is not enough evidence at the 5\% level of significance to show that there is a difference in the serum creatinine levels of the two populations.

To use a critical value approach for this two-sided alternative hypothesis, we require the critical value for a t distribution, in this case with \nu = n_{1} + n_{2} - 2 = 39 + 41 - 2 = 78 degrees of freedom. For \alpha=0.05, we require a value for a rejection region in the upper tail, comprising half of 0.05, which is 0.975. We use the t.ppf function from the stats module.

stats.t.ppf(
  0.975,
  df=placebo_n + treatment_n - 2
)

1.990847068555052

We fail to reject the null hypothesis since \lvert t_{\text{data}} \rvert \le t_{\text{crit}}.

The p value can be calculated using the cdf method. We have to multiply the results by 2, since we have a two-sided alternative hypothesis and the order of subtraction of the two means, would reflect symmetrically around the mean of the t distribution, which is 0. We subtract the value from the cumulative distribution function from 1, since we want to calculate the area in the upper tail of the distribution. We use the t function to specify the degrees of freedom.

2 * (1 - stats.t(placebo_n + treatment_n - 2).cdf(0.6596872652900193))

0.5113977195100912

We have to take a look at one-sided alternative hypotheses as well. In the case of our simulated research data, we might have that \mu_{1} - \mu_{2} > 0. In this case we would have a different critical value. The region of rejection in the upper tail for \alpha=0.05 would have a region of non-rejection of 0.95. We calculate the critical value with the t.ppf function again.

stats.t.ppf(
  0.95,
  df=placebo_n + treatment_n - 2
)

1.6646246444385238

The t_{\text{data}} value is still not larger than the critical value, and we still fail to reject the null hypothesis. The alternative argument in the ttest_ind function is simply set to 'greater' in this case. We must take note of the order in which the data for the two populations are passed. Since we have that \mu_{1} is the creatinine level in the population receiving the placebo, this array of values is passed first.

stats.ttest_ind(
  project_1_data.loc[project_1_data.Group == 'Placebo', 'Creatinine'].to_numpy(),
  project_1_data.loc[project_1_data.Group == 'Treatment', 'Creatinine'].to_numpy(),
  alternative='greater'
)

TtestResult(statistic=0.6596872652900193, pvalue=0.2556988597550456, df=78.0)

As before, we can calculate the p value using the cdf method. Note that we have a one-sided alternative hypothesis and do not multiply the result by 2.

1 - stats.t(placebo_n + treatment_n - 2).cdf(0.6596872652900193)

0.2556988597550456

In the case of an unqueal variance t test, or Welch’s test, we simply state the value False for the equal_var argument. The t_{\text{data}} statistic and the degrees of freedom are calculated as shown in Equation 8, where S_{1}^{2} and S_{2}^{2} are the respective variances of the two groups.

\begin{align} &t_{\text{data}} = \frac{(\overline{X}_{1} - \overline{X}_{2}) - (\mu_{1} - \mu_{2})}{\sqrt{\frac{S_{1}^{2}}{n_{1}} + \frac{S_{2}^{2}}{n_{2}}}} \sim t_{\nu} \\ \\ &\nu = \frac{{\left( \frac{S_{1}^{2}}{n_{1}}+ \frac{S_{2}^{2}}{n_{2}} \right)}^{2}}{\frac{{\left( \frac{S_{1}^{2}}{n_{1}} \right)}^{2}}{n_{1} -1} + \frac{{\left( \frac{S_{2}^{2}}{n_{2}} \right)}^{2}}{n_{2} - 1}} \end{align} \tag{8}

When the assumptions for the use of the two parametric tests above (Student’s t test and Welch’s test) are not met, then we must consider a non-parametric alternative test. This is the Mann-Whitney-U test.

Below, we use the mannwhitneyu function from the stats module of the scipy package to conduct the Mann-Whitney-U test. We pass two numpy arrays as arguments, as we did when using the ttest_ind function.

stats.mannwhitneyu(
  project_1_data.loc[project_1_data.Group == 'Placebo', 'Creatinine'].to_numpy(),
  project_1_data.loc[project_1_data.Group == 'Treatment', 'Creatinine'].to_numpy(),
)

MannwhitneyuResult(statistic=860.5, pvalue=0.5599556077102408)

As with the parametric test, we have that p>\alpha as we fail to reject the null hypothesis at the 5\% level of significance.

For a paired-sample t test, we create a new data set. We simulate measuring the serum creatinine level before and after an intervention.

numpy.random.seed(12)

before = numpy.round(
  numpy.random.normal(
    loc=120,
    scale=20,
    size=80
  )
)

after = numpy.round(
  before - numpy.random.normal(
    loc=1,
    scale=5,
    size=80
  )
)

project_2_data = pandas.DataFrame(
  {
    'Before':before,
    'After':after
  }
)

The t_{\text{data}} statistic and the degrees of freedom, \nu, are shown in Equation 9, where \bar{d} is the difference between the two measurements and n is the sample size.

\begin{align} &t_{\text{data}}\frac{\overline{d}}{\frac{s_{d}}{\sqrt{n}}} \sim t_{\nu} \\ \\ &\nu = n - 1 \end{align} \tag{9}

We can consider a paired-sample t test as a one-sample t test where we measure the difference between the repeated measures against a difference of 0. Below, we do a simple subtraction between the two measures of the continuous variable and use the ttest_1samp function. The popmean (difference under the null hypothesis) argument is set to 0.

stats.ttest_1samp(
  project_2_data.Before - project_2_data.After,
  popmean=0
)

TtestResult(statistic=0.48746592754436224, pvalue=0.6272781896360737, df=79)

We can also use the ttest_rel function and pass the two pandas series objects as two arguments.

stats.ttest_rel(
  project_2_data.Before,
  project_2_data.After
)

TtestResult(statistic=0.48746592754436224, pvalue=0.6272781896360737, df=79)

The null hypothesis is that \Delta=0, where \Delta is the difference in the measurememnts in the population. There is not enough evidence at the 5\% level of significance two show that there is a difference in the serum creatinine level before and after the intervention.

Care must be taken when considering one-tailed alternative hypothesis for the paired-sample t test. The order of subtraction is important.

The Wilcoxon signed-rank test is the non-parametric alternative to the paired-sample t test. Below, we use the wilcoxon function form the scipy stats module. We pass the difference between the two sets of measurements as argument.

stats.wilcoxon(
  project_2_data.Before - project_2_data.After
)

WilcoxonResult(statistic=1466.0, pvalue=0.8566244954627896)

As with the paired-sample t test, we fail to reject the null hypothesis at the \5% level of significance.

Categorical data analysis

For categorical data type variable, we can consider the frequency and relative frequency (proportion) of the levels (classes) of the variable.

Below, we generate siumulated data for a new variable for our project_1_data dataframe object. The disease (outcome) variable is Improvement with levels No and Yes. The Group variable will serve as our exposure variable with levels Placebo (the non-exposure) and Treatment (the exposure).

numpy.random.seed(12)

improvement_placebo = numpy.random.choice(
  ['No', 'Yes'],
  size=placebo_n,
  p=[0.7, 0.3]
)

improvement_treatment = numpy.random.choice(
  ['No', 'Yes'],
  size=treatment_n,
  p=[0.4, 0.6]
)

project_1_data['Improvement'] = numpy.concatenate(
  [
    improvement_placebo,
    improvement_treatment
  ]
)

We view the firt five observations.

project_1_data[:5]

	Group	Creatinine	Improvement
0	Placebo	105.0	No
1	Placebo	93.0	Yes
2	Placebo	102.0	No
3	Placebo	83.0	No
4	Placebo	108.0	No

We start by considering the proportions of the levels of a single categorical variable.

One-sample z test for proportions

The one-sample z test for proportions consider the propostion of the success level of a binary categorical variable. The value_counts method shows the frequency of the levels of the Improvement variable below.

project_1_data.Improvement.value_counts()

Improvement
Yes    42
No     38
Name: count, dtype: int64

Adding the normalize argument set to True shows the relative frequencies (proportions).

project_1_data.Improvement.value_counts(normalize=True)

Improvement
Yes    0.525
No     0.475
Name: proportion, dtype: float64

If Yes is our successful outcome, we can consider a one-sample z test for proportions. Each observation is independent and neither proportion (success or failure) is close to 0 or then to 1. The test statistic is calculated as shown in Equation 10, where \hat{p} is the proportion of successes in our data and p_{0} is the proportion of success under the null hypothesis.

\begin{align} &z_{\text{data}} = \frac{\hat{p} - p_{0}}{\sqrt{\frac{p_{0}(1-p_{0})}{n}}} \sim \text{N} \left( 0,1^{2} \right) \end{align} \tag{10}

The proportions_ztest function from the proportions submodule of the statsmodule package performs a one-sample z test for proportions. The count argument is the freuqency of successes, the nobs argument is the sample size, and the value argument is the proportion of successes under the null hypothesis. We will use p_{0} = 0.5 under the null hypothesis for this example.

proportions_ztest(
  count = 42,
  nobs = 80,
  value = 0.5
)

(0.44777366283964537, 0.6543165528058583)

We see z_{\text{data}} \approx 0.44777. The critical value, t_{\text{crit}} for \alpha=0.05 (two-sided alternative hypothesis) is calculated below.

stats.norm.ppf(0.975)

1.959963984540054

We have that \lvert t_{\text{data}} \rvert \ngtr t_{\text{crit}}. There is not enough at the 5\% level of significance to show that the proportion of those that improved is different from a half.

We can also express confidence intervals for the proportion of success. Equation 11 shows how these are calucalted, where c is the confidence coefficent for a given level of confidence.

\hat{p} \pm c \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}} \tag{11}

For a 95\% confidence level, we calculate c below using the norm.ppf function from the scipy stats module.

c = stats.norm.ppf(0.975)
c

1.959963984540054

We now use Equation 11 to calculate the lower and the upper bounds.

n = 42 + 38
successes = 42 / n
margin_of_error = c * numpy.sqrt((successes / (1 - successes)) / n)
lower = successes - margin_of_error
upper = successes + margin_of_error

lower, upper

(0.2946246837472876, 0.7553753162527124)

Our proportion of successes (Improvement = Yes) was 0.525, with a 95\% confidence interval of 0.29-0.76. Note how the proportion of successes under the null hypothesis, 0.5, is within this interval, supprting the fact that we failed to reject the null hypothesis.

\chi^{2} test for proportions

When we have a binary or a multi-level categorical variable, we can consider the \chi^{2} test for proportions.

The \chi^{2} test for proportions compares the observed data (O) to the expected data (E) under the null hypothesis as shown in Equation 12, where X^{2} is the test statistic, k is the number of levels (classes) of the categorical variable, and \nu is the degrees of freedom. Note that the text statistic follows the \chi^{2} distribution for \nu degrees of freedom.

\begin{align} &X^{2} = \sum_{i=1}^{k} \frac{{\left( O_{i} - E_{i} \right)}^{2}}{E_{i}} \sim \chi^{2}_{\nu} \\ \\ &\nu = k - 1 \end{align} \tag{12}

If we use the same data as before, we have 42 successes and 38 failures. For a sample size of 80, we would expect 40 successes and 40 failures under a null hypothesis of equal proportions.

We can use the chisquare function in the stats module of scipy to conduct a \chi^{2} test for proportions. The first argument is a Python list of observed frequencies and the second argument is a Python list of expected frequencies.

stats.chisquare(
  [42, 38], # Observed frequencies
  [40, 40] # Expected frequencies
)

Power_divergenceResult(statistic=0.2, pvalue=0.6547208460185768)

We see X^{2}_{\text{data}} = 0.2 and p > \alpha and we failt to reject the null hypothesis. To use a critical value decision rule, we can calculate a critical value using the chi2.ppf function. For \alpha = 0.05 we require an X^{2} value for which the non-rejection region is 1-0.05=0.95. Since we have two levels for the Improvement variable we have \nu = 2-1=1 degrees of freedom.

stats.chi2.ppf(0.95, df=1)

3.841458820694124

We have that X^{2}_{\text{data}} \ngtr X^{2}_{\text{crit}}. There is not enough evidence at the 5\% level of significance to show that the proportion of successes is not a half.

Measures of association

Measures of association refers to the relationship between two categorical variables. There are three such measures. They are risk difference (RD), risk ratio or relative risk (RR), and odds ratio (OR).

Equation 13 shows the calculation for odds, where p is the probability of success (for a binary variable).

\text{odds} = \frac{p}{1-p} \tag{13}

Table 2 shows a table of two binary variables, Exposure with level exposure E and _not exposure \bar{E}, and Disease with levels _disease D and no disease \bar{D}. The n_{ij} represent the joint frequencies, with the respective marginal frequencies.

Table 2: Table of observed values
	D	\bar{D}	Total
E	n_{11}	n_{12}	n_{1+}
\bar{E}	n_{21}	n_{22}	n_{2+}
Total	n_{+1}	n_{12}	n

Equation 14 show the equations for the three measures of association.

\begin{align} &\widehat{\text{RD}} = \hat{P} \left( D \, \vert \, E \right) - \hat{P} \left( D \, \vert \, \bar{E} \right) = \frac{n_{11}}{n_{1+}} - \frac{n_{21}}{n_{21}} \\ \\ &\widehat{\text{RR}} = \frac{\hat{P} \left( D \, \vert \, E \right)}{\hat{P} \left( D \, \vert \, \bar{E} \right)} = \frac{n_{11}n_{2+}}{n_{1+}n_{21}} \\ \\ &\widehat{\text{OR}} = \frac{n_{11} n_{22}}{n_{12} n_{21}} \end{align} \tag{14}

A contingency table is a table of observed frequencies. We use the value_counts methods for the Improvement variable, but use the groupby method to group by the levels of the Group variable.

project_1_data.groupby('Group').Improvement.value_counts()

Group      Improvement
Placebo    No             24
           Yes            15
Treatment  Yes            27
           No             14
Name: count, dtype: int64

The results is a 2 \times 2 contingency table. The crosstab function returns a table that is easier to interpret. The pandas series objects along the rows and then the columns are passed as arguments.

pandas.crosstab(
  project_1_data.Group,
  project_1_data.Improvement
)

Improvement	No	Yes
Group
Placebo	24	15
Treatment	14	27

By adding the margins argument and setting the value to True, we have that the marginal totals are displayed as well.

pandas.crosstab(
  project_1_data.Group,
  project_1_data.Improvement,
  margins=True
)

Improvement	No	Yes	All
Group
Placebo	24	15	39
Treatment	14	27	41
All	38	42	80

One way to calculate measures of association, is to encode the levels of the two binary variable. Below, we use the replace method. A Python dictionary object is used, with the orginal level name as the dictionary keys, and the encoded value as the dictionary values. We select the success level and encode it as a 1, with the failure level encoded with a 0.

project_1_data['Group_code'] = project_1_data.Group.replace(
  {
  'Placebo':0,
  'Treatment':1
}
)

project_1_data['Improvement_code'] = project_1_data.Improvement.replace(
  {
  'No':0,
  'Yes':1
}
)

The crosstab function can be used with these new variables. Note that the order of the levels are reversed from that in Table 2.

pandas.crosstab(
  project_1_data.Group_code,
  project_1_data.Improvement_code,
  margins=True
)

Improvement_code	0	1	All
Group_code
0	24	15	39
1	14	27	41
All	38	42	80

We calculate the estimated risk difference in Equation 15.

\hat{P} \left( D \, \vert \, E \right) - \hat{P} \left( D \, \vert \, \bar{E} \right) = \frac{27}{41} - \frac{15}{39} \approx 0.274 \tag{15}

The zepid package has many function pertinent to the field of epidemiology. Below, we instantiate the RiskDifference class using the variable rd. We then use the fit method from this class. The first argument is the dataframe object. We then use the exposure and outcome arguments two specify the two binary variables under consideration.

rd = RiskDifference()

rd.fit(
  project_1_data,
  exposure='Group_code',
  outcome='Improvement_code'
)

rd.summary()

Comparison:0 to 1
+-----+-------+-------+
|     |   D=1 |   D=0 |
+=====+=======+=======+
| E=1 |    27 |    14 |
+-----+-------+-------+
| E=0 |    15 |    24 |
+-----+-------+-------+ 

======================================================================
                         Risk Difference                              
======================================================================
        Risk  SD(Risk)  Risk_LCL  Risk_UCL
Ref:0  0.385     0.078     0.232     0.537
1      0.659     0.074     0.513     0.804
----------------------------------------------------------------------
       RiskDifference  SD(RD)  RD_LCL  RD_UCL
Ref:0           0.000     NaN     NaN     NaN
1               0.274   0.107   0.063   0.485
----------------------------------------------------------------------
       RiskDifference    CLD  LowerBound  UpperBound
Ref:0           0.000    NaN         NaN         NaN
1               0.274  0.421      -0.341       0.659
----------------------------------------------------------------------
Missing E:    0
Missing D:    0
Missing E&D:  0
======================================================================

We see the estimated risk difference \widehat{\text{RD}} \approx 0.274 is the second set of rows. The function also returns the 95\% confidence interval for the risk difference. The calculation for the confidence intervals is shown in Equation 16, using the notation from Table 2.

\begin{align} &\text{Let } \hat{p}_{1} = \hat{P}(D \vert E) \\ &\text{Let } \hat{p}_{2} = \hat{P}(D \vert \overline{E}) \\ &\text{Let } n_{1} = n_{1+} \\ &\text{Let } n_{2} = n_{2+} \\ &\text{Let } c = \text{ confidence coefficient} \\ \\ &\left( \hat{p}_{1} - \hat{p}_{2} \right) \pm c \sqrt{\frac{\hat{p}_{1} \left( 1 - \hat{p}_{1} \right)}{n_{1}} + \frac{\hat{p}_{2} \left( 1 - \hat{p}_{2} \right)}{n_{2}}} \end{align} \tag{16}

As a conclusion, we would state that the risk of improvement is about 27.4 percentage points higher in those receiving the active intervention compared to those receiving placebo (95\% CI 6.3 - 48.5 percentage points).

Under the null hypothesis, the risk difference is 0, the risk of improvement is the same in two exposure groups. Note that 0 is not within the condidence interval values, and we would reject the null hypothesis. We have shown that the risk of improvement is increased.

The estmated risk ratio is calculated in Equation 17.

\frac{\hat{P} \left( D \, \vert \, E \right)}{\hat{P} \left( D \, \vert \, \bar{E} \right)} = \frac{27}{41} \times \frac{39}{15} \approx 1.712 \tag{17}

To calculate the risk ratio, we instantiate the RiskRatio class and follow a similar syntax as before.

rr = RiskRatio()

rr.fit(
  project_1_data,
  exposure='Group_code',
  outcome='Improvement_code'
)

rr.summary()

Comparison:0 to 1
+-----+-------+-------+
|     |   D=1 |   D=0 |
+=====+=======+=======+
| E=1 |    27 |    14 |
+-----+-------+-------+
| E=0 |    15 |    24 |
+-----+-------+-------+ 

======================================================================
                            Risk Ratio                                
======================================================================
        Risk  SD(Risk)  Risk_LCL  Risk_UCL
Ref:0  0.385     0.078     0.232     0.537
1      0.659     0.074     0.513     0.804
----------------------------------------------------------------------
       RiskRatio  SD(RR)  RR_LCL  RR_UCL
Ref:0      1.000     NaN     NaN     NaN
1          1.712   0.232   1.087   2.696
----------------------------------------------------------------------
Missing E:    0
Missing D:    0
Missing E&D:  0
======================================================================

The result is in the second set of rows with \widehat{\text{RR}} \approx 1.712. The function also returns the 95\% confidence interval for the risk ratio. The calculation for the confidence intervals is shown in Equation 18, using the notation from Table 2.

\log{\left( \widehat{\text{RR}} \right)} \pm c \sqrt{\frac{n_{12}}{n_{11}(n_{11} + n_{12})} + \frac{n_{22}}{n_{21} (n_{21} + n_{22})}} \tag{18}

As a conclusion, we would state that the risk of improvement is 1.712 times higher in those receiving the active intervention compared to those receiving the placebo (95\% CI 1.087 - 2.696). Alternatively, we could state that the risk of improvement is 71.2\% higher in those that received the active intervention compared to those that received the placebo (95\% CI 8.7\% - 169.6\%).

Under the null hypothesis, the relative risk is 1, the risk of improvement is the same in two exposure groups. Note that 1 is not within the condidence interval and we would reject the null hypothesis. We have shown that the risk of improvement is increased.

The estimated odds ratio is calculate in Equation 19

\widehat{\text{OR}} = \frac{27 \times 24}{14 \times 15} \approx 3.086 \tag{19}

To calculate the risk ratio, we instantiate the OddsRatio class and follow a similar syntax as before.

oddsratio = OddsRatio()

oddsratio.fit(
  project_1_data,
  exposure='Group_code',
  outcome='Improvement_code'
)

oddsratio.summary()

Comparison:0 to 1
+-----+-------+-------+
|     |   D=1 |   D=0 |
+=====+=======+=======+
| E=1 |    27 |    14 |
+-----+-------+-------+
| E=0 |    15 |    24 |
+-----+-------+-------+ 

======================================================================
                           Odds Ratio                                 
======================================================================
       OddsRatio  SD(OR)  OR_LCL  OR_UCL
Ref:0      1.000     NaN     NaN     NaN
1          3.086   0.466   1.239   7.686
----------------------------------------------------------------------
Missing E:    0
Missing D:    0
Missing E&D:  0
======================================================================

The result is in the second set of rows with \widehat{\text{OR}} \approx 3.086. The function also returns the 95\% confidence interval for the odds ratio. The calculation for the confidence intervals is shown in Equation 20, using the notation from Table 2.

\log{\left( \widehat{OR} \right)} \pm c \sqrt{\frac{1}{n_{11}} + \frac{1}{n_{12}} + \frac{1}{n_{21}} + \frac{1}{n_{22}}} \tag{20}

As a conclusion, we would state that the odds of improvement is 3.086 times higher in those receiving the active intervention compared to those receiving the placebo (95\% CI 1.239 - 6.787). Alternatively, we could state that the odds of improvement is 208.6\% higher in those that received the active intervention compared to those that received the plecebo (95\% CI 23.9\% - 668.6\%).

Under the null hypothesis, the odds is 1, the odds of improvement is the same in two exposure groups. Note that 1 is not within the condidence interval and we would reject the null hypothesis. We have shown that the odds of improvement are increased.

Pearson’s \chi^{2} test

We can express a test statistic for the difference of observed values of the relastionship between two categorical variables. Under the null hypothesis, the two variables are not associated, in other words, they are independent.

From probability theory we have that if the outcomes for two experiments are A and B, and they are independent, then Equation 21 holds.

P \left( A \cap B \right) = P \left( A \right) \, P \left( B \right) \tag{21}

We can caluclate expected values under a null hypothesis of independence. We review the table of observed values from our previous example.

pandas.crosstab(
  project_1_data.Group,
  project_1_data.Improvement,
  margins=True
)

Improvement	No	Yes	All
Group
Placebo	24	15	39
Treatment	14	27	41
All	38	42	80

The joint probability of those in the placebo group that showed no improvement is \hat{P} \left( \text{ Placebo and No} \right) = \frac{24}{80}. The propability of being in the placebo group is \hat{P} \left( \text{Placebo} \right) = \frac{39}{80} and for no imporvement is \hat{p} \left( \text{No} \right) = \frac{38}{80}. Given indpendence, we see the calculation for the expected joint frequency of being in the placebo group and reporting no improvement, E_{11}, as shown in Equation 22.

\begin{align} &\frac{E_{11}}{80} = \frac{39}{80} \times \frac{38}{80} \\ \\ &E_{11} = \frac{39 \times 38}{80} \end{align} \tag{22}

Note how E_{11}, the expected value in row 1 column 1, is the product of the corresponding row total and column total divided by the sample size under the null hypothesis of independence between the two variables. The pattern repeats for all the joint frequencies.

We can express the difference between the observed joint frequenciies and the expected joint frequencies under the null hypothesis as shown in Equation 23, where r is the number of levels of the exposure variable, c is the number of levels of the disease variable, and \nu is the degrees of freedom.

X^{2} = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{{\left( O_{ij} - E_{ij} \right)}^{2}}{E_{ij}} \sim \chi^{2}_{\nu = (r-1)(c-1)}, \text{ under } H_{0} \tag{23}

We assign the result of the crosstab function to a variable so that we can pass it to the chi2_contingency function. This function performs Pearson’s \chi^{2} test.

observed = pandas.crosstab(
  project_1_data.Group,
  project_1_data.Improvement
)

We see X^{2} = 4.966 with a p value of 0.026. For \alpha=0.05 we reject the null hypothesis that there is no assoctiation between Group and Improvement. This supports the fact that none of the confidence intervals for the three measures of association that we calculated contained the measure of association values under the null hypothesis.

To use Pearson’s \chi^{2} test, we must have that at least 80\% of the expected values be 5 or more in joint frequency. For a 2 \times 2 table, this would mean that all the values are more than or equal to 5. We see the expected values in the result above. All values are 5 or more.

Linear models

Linear regression allows us to understand the linear relationship between explanatory variables and a continuous outcome variable. We can also use a model to determine the value of the outcome variable, given the value(s) of the explanatory variable(s).

Simple linear regression

In simple linear regression, we have a single continuous explanatory variable.

Below, we import a spreadsheet file to use in this section. The read_csv pandas function imports a comma-separated values (CSV) file. In this case, the CSV file is in a GitHub repository online.

bodyfat_file_url = 'https://raw.githubusercontent.com/juanklopper/TutorialData/main/bodyfat.csv'
bodyfat = pandas.read_csv(bodyfat_file_url)

The info method returns information about our data set.

bodyfat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252 entries, 0 to 251
Data columns (total 15 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Density  252 non-null    float64
 1   BodyFat  252 non-null    float64
 2   Age      252 non-null    int64  
 3   Weight   252 non-null    float64
 4   Height   252 non-null    float64
 5   Neck     252 non-null    float64
 6   Chest    252 non-null    float64
 7   Abdomen  252 non-null    float64
 8   Hip      252 non-null    float64
 9   Thigh    252 non-null    float64
 10  Knee     252 non-null    float64
 11  Ankle    252 non-null    float64
 12  Biceps   252 non-null    float64
 13  Forearm  252 non-null    float64
 14  Wrist    252 non-null    float64
dtypes: float64(14), int64(1)
memory usage: 29.7 KB

We see a list of 15 variables and 252 observations. There are no data missing. All values are either integers, indicated by int64, or values with decimal places, indicated by float64. This is an indication that all the variables are continuous type variables.

Below, we use the describe method to show the summary statistics of the Height variable.

bodyfat.Height.describe()

count    252.000000
mean      70.148810
std        3.662856
min       29.500000
25%       68.250000
50%       70.000000
75%       72.250000
max       77.750000
Name: Height, dtype: float64

We do the same for the Weight variable.

bodyfat.Weight.describe()

count    252.000000
mean     178.924405
std       29.389160
min      118.500000
25%      159.000000
50%      176.500000
75%      197.000000
max      363.150000
Name: Weight, dtype: float64

Figure 3 shows the relationship between the Height and Weight variables.

express.scatter(
  data_frame=bodyfat,
  x='Height',
  y='Weight'
)

Figure 3: Scatter plot of height and weight values

The data visualization confirm a suspected outlier. This outlier can be removed by identifying its index and using the drop function. Setting the inplace argument to True, makes the change to the dataframe object permanent.

indexHeight = bodyfat.loc[bodyfat.Height == 29.5].index
bodyfat.drop(indexHeight , inplace=True)

indexWeight = bodyfat.loc[bodyfat.Weight == 363.15].index
bodyfat.drop(indexWeight, inplace=True)

With removal of the suspected outlier, we repeat the scatter plot in Figure 4.

express.scatter(
  data_frame=bodyfat,
  x='Height',
  y='Weight'
)

Figure 4: Scatter plot of height and weight values (after removal of outlier)

Pearson’s correlation coefficient, r,is an estimate of the population correlation, \rho, between two continuous variables. Correlation is the linear association between two continuous variables. The correlation coefficient indicates the strength and direction of the linear relatiosnhip and is calculates as shown in Equation 24, where x_{i} are the values for a variable X, y_{i} are the values for a variable Y, with the sample size n and the respective variable means \bar{X} and \bar{Y}. the symbols s_{X} and s_{Y} indicate the standard deviations of the two variables.

\begin{align} &r = \frac{1}{n-1} \sum_{i=1}^{n} \left( \frac{x_{i} - \bar{X}}{s_{X}} \right) \left( \frac{y_{i} - \bar{Y}}{s_{Y}} \right) \\ \\ &r = \frac{\sum_{i=1}^{n} \left[ (x_{i} - \bar{X})(y_{i} - \bar{Y}) \right]}{\sqrt{\sum_{i=1}^{n}{\left( x_{i} - \bar{X} \right)}^{2} \sum_{i=1}^{n} {\left( y_{i} - \bar{Y} \right)}^{2}}} \end{align} \tag{24}

The numpy corrcoef function returns the correlation coefficient. We pass two pandas series objects, one for each of the variables, as arguments.

numpy.corrcoef(
  bodyfat.Weight,
  bodyfat.Height
)

array([[1.        , 0.51291305],
       [0.51291305, 1.        ]])

The resultant array of values shows r = 0.513, a moderate positive correlation.

The pearsonr function from the stats module of scipy return a test statistics and p value for the null hypothesis that there is no correlation between the two varaibles, \rho = 0. The test statistic t_{\text{data}} is calculated using Equation 25.

t_{\text{data}} = \frac{r - 0}{\sqrt{\frac{1 - r^{2}}{n-2}}} = r \sqrt{\frac{n-2}{1 - r^{2}}} \sim t_{\nu = n-2} \text{ under } H_{0} \tag{25}

stats.pearsonr(
  bodyfat.Height.to_list(),
  bodyfat.Weight.to_list()
)

PearsonRResult(statistic=0.5129130498282941, pvalue=3.545130750585916e-18)

We have enough evidence at the 5\% level of significance to show that the linear relationship between Height and Weight is not 0.

If the data fail the assumptions for the use of Pearson’s correlation coefficient, we can consider the non-parametric Spearman’s correlation coefficient. The spearmar function from the scipy stats module returns the Spearman’s correlation coefficeint and corresponding p value under the null hypothesis that there is no correlation.

stats.spearmanr(
  bodyfat.Height.to_list(),
  bodyfat.Weight.to_list()
)

SignificanceResult(statistic=0.5255648381181571, pvalue=3.720022005717529e-19)

To create a linear model, where we use Height as explanatory variable and Weight as outcome variable, we use the ols function from the statsmodel package. The formula argument takes a string as value. The formula is in the form '<Outcome variable> ~ <Explanatory variable>'. The fit method fits the data to the model. The model is assigned to the variable weight_height_model. We call the summary method on this model to see a summary.

weight_height_model = ols(
  formula='Weight ~ Height',
  data = bodyfat
).fit()

weight_height_model.summary()

OLS Regression Results
Dep. Variable:	Weight	R-squared:	0.263
Model:	OLS	Adj. R-squared:	0.260
Method:	Least Squares	F-statistic:	88.54
Date:	Tue, 05 Mar 2024	Prob (F-statistic):	3.55e-18
Time:	14:49:12	Log-Likelihood:	-1140.4
No. Observations:	250	AIC:	2285.
Df Residuals:	248	BIC:	2292.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-194.4861	39.623	-4.908	0.000	-272.527	-116.446
Height	5.2995	0.563	9.409	0.000	4.190	6.409

Omnibus:	19.162	Durbin-Watson:	1.538
Prob(Omnibus):	0.000	Jarque-Bera (JB):	21.410
Skew:	0.655	Prob(JB):	2.24e-05
Kurtosis:	3.583	Cond. No.	1.90e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.9e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Equation 26 shows the model given the coefficients from the results above.

\begin{align} &\hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} \left( \text{age} \right) \\ \\ &\hat{y} = -194.4861 + 5.2955 \left( \text{age} \right) \end{align} \tag{26}

We can perform analysis of variance on the model using the anova_lm function from the statsmodel package.

sm_stats.anova_lm(weight_height_model, typ=2)

	sum_sq	df	F	PR(>F)
Height	47880.125599	1.0	88.535759	3.545131e-18
Residual	134118.363841	248.0	NaN	NaN

The F_{\text{data}}-ratio is calculated using Equation 27.

\begin{align} &F = \frac{\frac{\sum_{i=1}^{n}{\left( \hat{y}_{i} - \bar{\mathbf{y}} \right)^{2}}}{k-1}}{\frac{\sum_{i=1}^{n}{\left( y_{i} - \hat{y}_{i} \right)^{2}}}{N-k}} \end{align} \tag{27}

To understand Equation 27, we will calculate all the components manually.

We need to use the model to calculate the estimated outcome values. We create a pandas series object of the explanatory variable. The add_constant function adds a column of zeros for the intercept term. This creates a design matrix. We then use the predict method and pass the design matrix as argument. We assign the predicted (estimated) outcome values to a new variable WeightPredicted and add it to our original dataframe object.

explanatory_data = bodyfat.Height
explanatory_data = add_constant(explanatory_data)
bodyfat['WeightPredicted'] = weight_height_model.predict(explanatory_data)

In Figure 5 we plot the actual outcome values to the predicted outcome values, given our simple linear regression model.

express.scatter(
  data_frame=bodyfat,
  x='Weight',
  y='WeightPredicted',
  labels={
    'WeightPredicted':'Predicted weight'
  }
)

Figure 5: Scatter plot of weight and predicted weight

The sum of squares due to the regression (ssr) is the sum of the squared difference between the predicted values for the outcome variable and the mean of the outcoem variable, and is calculated using Equation 28.

\text{ssr} = \sum_{i=1}^{n} \left( \hat{y}_{i} - \bar{y} \right)^{2} \tag{28}

We perform Equation 28 as assign the result to the variable ssr.

ssr = ((bodyfat.WeightPredicted - bodyfat.Weight.mean())**2).sum()
ssr

47880.12559869983

The residual sum of squares (see) is the sum of the squared differences between the actual and the predicted outcome variable values (in other words the residuals), and is calculated in Equation 29.

\text{sse} = \sum_{i=1}^{n} \left( y_{i} - \hat{y}_{i} \right)^{2} \tag{29}

sse = ((bodyfat.Weight - bodyfat.WeightPredicted)**2).sum()
sse

134118.3638413002

The numerator degrees of freedom is the number of parameters in the model minus 1. In our model we only have \hat{\beta}_{0} and \hat{\beta}_{1} and hence \nu_{1} = 2-1=1. The denominator degrees of freedom is the sample size minus the number of parameters in the model, in our case \nu_{2} = 250-2=248.

The numerator of the F_{\text{data}}-ratio is then ssr divided by the numerator degrees of freedom. The result is the mean squared error due to the regression (msr). The denominator of the F_{\text{data}}-ratio is the sse divided by the denominator degrees of freedom. The result is the mean sqaured error (mse). We calculate the F_{\text{data}}-ratio below.

f_ratio = (ssr / (2 - 1)) / (sse / (250 - 2))
f_ratio

88.53575907418738

The critical value for F_{\nu_{1}=1, \, \nu_{2}=248} is calculated below using the f.ppf function from the stats module of the scipy package, using \alpha=0.05.

stats.f.ppf(
  0.95,
  1,
  248
)

3.87922825535667

We have that F_{\text{data}} > F_{\text{crit}} and we reject the null hypothesis that \beta_{1}=0. There is a linear relastionship between Height and Weight. Under the null hypothesis, \beta_{1}=0, the value of Height will not effect the value of Weight.

Multiple linear regression

We can add more explanatory variables to our linear regression model. In Figure 6 and Figure 7 we see a scatter plot matrix of the rest of the explanatory variables in the data set.

fig = graph_objects.Figure(
  graph_objects.Splom(
    dimensions=[
      {'label': 'Weight', 'values': bodyfat.Weight},
      {'label': 'Age', 'values': bodyfat.Age},
      {'label': 'Neck', 'values': bodyfat.Neck},
      {'label': 'Chest', 'values': bodyfat.Chest},
      {'label': 'Abdomen', 'values': bodyfat.Abdomen},
      {'label': 'Hip', 'values': bodyfat.Hip}
    ]
  )
)

fig.show()

Figure 6: Scatter plot matrix of weight, age, neck, chest, abdomen, and hip measurements

fig = graph_objects.Figure(
  graph_objects.Splom(
    dimensions=[
      {'label': 'Weight', 'values': bodyfat.Weight},
      {'label': 'Thigh', 'values': bodyfat.Thigh},
      {'label': 'Knee', 'values': bodyfat.Knee},
      {'label': 'Ankle', 'values': bodyfat.Ankle},
      {'label': 'Biceps', 'values': bodyfat.Biceps},
      {'label': 'Forearm', 'values': bodyfat.Forearm}
    ]
  )
)

fig.show()

Figure 7: Scatter plot matrix of weight, thigh, knee, ankle, biceps, forearm measurements

From the data visualization, we see that the age variable does not seem linearly associated with Weight. We will use the other variables imaged above as explanatory variables.

weight_height_model = ols(
  'Weight ~ Height + Neck + Chest + Abdomen + Hip + Thigh + Knee + Ankle + Biceps + Forearm + Wrist',
  data = bodyfat
).fit()

weight_height_model.summary()

OLS Regression Results
Dep. Variable:	Weight	R-squared:	0.977
Model:	OLS	Adj. R-squared:	0.976
Method:	Least Squares	F-statistic:	917.8
Date:	Tue, 05 Mar 2024	Prob (F-statistic):	5.45e-188
Time:	14:49:12	Log-Likelihood:	-707.16
No. Observations:	250	AIC:	1438.
Df Residuals:	238	BIC:	1481.
Df Model:	11
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-332.8193	7.991	-41.648	0.000	-348.562	-317.077
Height	2.0741	0.132	15.713	0.000	1.814	2.334
Neck	0.8705	0.224	3.886	0.000	0.429	1.312
Chest	0.9003	0.089	10.103	0.000	0.725	1.076
Abdomen	0.3938	0.077	5.127	0.000	0.243	0.545
Hip	0.9882	0.128	7.714	0.000	0.736	1.241
Thigh	0.5860	0.131	4.475	0.000	0.328	0.844
Knee	0.2973	0.234	1.273	0.204	-0.163	0.757
Ankle	0.6905	0.211	3.274	0.001	0.275	1.106
Biceps	0.4853	0.165	2.948	0.004	0.161	0.810
Forearm	0.4701	0.200	2.347	0.020	0.076	0.865
Wrist	0.8444	0.490	1.724	0.086	-0.120	1.809

Omnibus:	37.167	Durbin-Watson:	1.895
Prob(Omnibus):	0.000	Jarque-Bera (JB):	92.279
Skew:	-0.678	Prob(JB):	9.16e-21
Kurtosis:	5.650	Cond. No.	6.24e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.24e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

The summary shows all the coefficients and an F statistic (among many other results).

We use the anova_lm function again to conduct analysis of variance on our new model.

sm_stats.anova_lm(weight_height_model, typ=2)

	sum_sq	df	F	PR(>F)
Height	4348.144432	1.0	246.883206	1.201929e-38
Neck	266.008622	1.0	15.103698	1.319844e-04
Chest	1797.520370	1.0	102.061373	3.371004e-20
Abdomen	463.035834	1.0	26.290702	6.083775e-07
Hip	1047.964809	1.0	59.502373	3.334331e-13
Thigh	352.654722	1.0	20.023375	1.186142e-05
Knee	28.555968	1.0	1.621379	2.041419e-01
Ankle	188.805802	1.0	10.720201	1.217330e-03
Biceps	153.074147	1.0	8.691394	3.515357e-03
Forearm	97.021071	1.0	5.508757	1.974202e-02
Wrist	52.359351	1.0	2.972911	8.596783e-02
Residual	4191.692069	238.0	NaN	NaN

There is a remarkable decrease in the sse with a value of 4191.692069.

We use the same procedure as before to create a design matrix to use in the predict method. We overwrite the value in the WeightPredicted column.

explanatory_data = bodyfat[['Height', 'Neck', 'Chest', 'Abdomen', 'Hip', 'Thigh', 'Knee', 'Ankle', 'Biceps', 'Forearm', 'Wrist']]
explanatory_data = add_constant(explanatory_data)
bodyfat['WeightPredicted'] = weight_height_model.predict(explanatory_data)

Figure 8 shows the relationship of the outcome variable with the new predicted values.

express.scatter(
  data_frame=bodyfat,
  x='Weight',
  y='WeightPredicted'
)

Figure 8: Scatter plot of weight and predicted weight)

As before, we calculate the F_{\text{data}} statistic manually.

ssr = ((bodyfat.WeightPredicted - bodyfat.Weight.mean())**2).sum()
ssr

177806.79737114842

sse = ((bodyfat.WeightPredicted - bodyfat.Weight)**2).sum()
sse

4191.692068851403

We have that \nu_{1}=12-1=11 and \nu_{2}=250-12=238.

f_ratio = (ssr / (12 - 1)) / (sse / (250 - 12))
f_ratio

917.7898714285956

This is much larger than the critical value that we calculate below.

stats.f.ppf(
  0.95,
  11,
  238
)

1.8290341139711406

Analysis of variance

Analysis of variance (ANOVA) follows the same calculations for the F_{\text{data}}- statistic that we used before. We use ANOVA to determine if there is a difference between the mean values of a continuous variable among more that two groups (formed by the levels of a categorical variable). This is in contrast to the use of t tests when we only have two groups.

We can view ANOVA as a linear regression model with a categorical explanatory variable. To demostrate, we import a data set containing information about stroke victims.

stroke_file_url = 'https://raw.githubusercontent.com/juanklopper/TutorialData/main/stroke.csv'
stroke = pandas.read_csv(stroke_file_url)

The info method returns information about the data set.

stroke.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4981 entries, 0 to 4980
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             4981 non-null   object 
 1   age                4981 non-null   float64
 2   hypertension       4981 non-null   object 
 3   heart_disease      4981 non-null   object 
 4   ever_married       4981 non-null   object 
 5   work_type          4981 non-null   object 
 6   Residence_type     4981 non-null   object 
 7   avg_glucose_level  4981 non-null   float64
 8   bmi                4981 non-null   float64
 9   smoking_status     4981 non-null   object 
 10  stroke             4981 non-null   object 
dtypes: float64(3), object(8)
memory usage: 428.2+ KB

The value_counts method returns the frequencies of the levels of the smoking_status variable.

stroke.smoking_status.value_counts()

smoking_status
never smoked       1838
Unknown            1500
formerly smoked     867
smokes              776
Name: count, dtype: int64

For this example, we remove all the observations for which the level of the smoking status variable is Unknown. There are also missing value for the bmi (body mass index) variable. We delete these observations using the same indexing technique that we used before.

indexSmoke = stroke.loc[stroke.smoking_status == 'Unknown'].index
stroke.drop(indexSmoke, inplace=True)
stroke.dropna(inplace=True)

Figure 9 shows a box-and-whisker plot of the body mass index for the three levels of the smoking status variable.

express.box(
  data_frame=stroke,
  x='smoking_status',
  y='bmi'
)

Figure 9: Distribution of the body mass index for each smoking group

There are many ways to create a model that uses smoking status as explanatory variable for the body mass index of participants. Here, we encode the three levels with the values never smoked =0, formerly smoked =1, and smokes =2 using the replace method and a dictionary. We also assign the encoded values to a new variable, smoking_status_encoded.

stroke['smoking_status_encoded'] = stroke.smoking_status.replace(
  {
  'never smoked':0,
  'formerly smoked':1,
  'smokes':2
}
)

Now, we use the pandas get_dummies function to create dummy variables for our new variable. We specify the dataframe object and use the column argument to select the required variable, passed as a Python list object.

stroke = pandas.get_dummies(stroke, columns=['smoking_status_encoded'])

The first five observation of our dataframe object is shown below. We see the new dummy variables.

stroke[:5]

	gender	age	hypertension	heart_disease	ever_married	work_type	Residence_type	avg_glucose_level	bmi	smoking_status	stroke	smoking_status_encoded_0	smoking_status_encoded_1	smoking_status_encoded_2
0	Male	67.0	No	Yes	Yes	Private	Urban	228.69	36.6	formerly smoked	Yes	False	True	False
1	Male	80.0	No	Yes	Yes	Private	Rural	105.92	32.5	never smoked	Yes	True	False	False
2	Female	49.0	No	No	Yes	Private	Urban	171.23	34.4	smokes	Yes	False	False	True
3	Female	79.0	Yes	No	Yes	Self-employed	Rural	174.12	24.0	never smoked	Yes	True	False	False
4	Male	81.0	No	No	Yes	Private	Urban	186.21	29.0	formerly smoked	Yes	False	True	False

With never smoke =0 or smoking_status_encoded_0 as reference level, we create a linear regression model using the dummy variables formerly smoked =1 or smoking_status_encoded_1 and smokes =2 or smoking_status_encoded_2 as explanatory variables.

smoke_bmi_model = ols(
  'bmi ~ smoking_status_encoded_1 + smoking_status_encoded_2',
  data=stroke
).fit()

smoke_bmi_model.summary()

OLS Regression Results
Dep. Variable:	bmi	R-squared:	0.003
Model:	OLS	Adj. R-squared:	0.003
Method:	Least Squares	F-statistic:	5.978
Date:	Tue, 05 Mar 2024	Prob (F-statistic):	0.00256
Time:	14:49:12	Log-Likelihood:	-11300.
No. Observations:	3481	AIC:	2.261e+04
Df Residuals:	3478	BIC:	2.262e+04
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	29.4688	0.145	203.135	0.000	29.184	29.753
smoking_status_encoded_1[T.True]	0.8065	0.256	3.147	0.002	0.304	1.309
smoking_status_encoded_2[T.True]	0.6211	0.266	2.333	0.020	0.099	1.143

Omnibus:	142.310	Durbin-Watson:	1.989
Prob(Omnibus):	0.000	Jarque-Bera (JB):	159.699
Skew:	0.525	Prob(JB):	2.10e-35
Kurtosis:	3.019	Cond. No.	3.19

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We write the model in Equation 30.

\begin{align} &\hat{y}_{i} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \beta_{2} x_{2} \\ \\ &\text{estimated mean body mass index} = 29.9826 + \\ \\ &\;\;\;0.7646 \text{ ( formerly smoked)} + 0.5610 \text{ (smokes)} \end{align} \tag{30}

Since x_{1} and x_{2} can only take values 0,0, 1,0, and 0,1, our model can only create three estimated mean body mass index values. We will see that these three values are the mean body mass index for each of the levels of the smoking_status variable.

We use the anova_lm function to conduct analysis of variance on our model.

sm_stats.anova_lm(smoke_bmi_model)

	df	sum_sq	mean_sq	F	PR(>F)
smoking_status_encoded_1	1.0	251.968502	251.968502	6.513982	0.010745
smoking_status_encoded_2	1.0	210.502747	210.502747	5.441994	0.019715
Residual	3478.0	134533.137047	38.681178	NaN	NaN

Using our model, we create a design matrix to use in the predict method. We assign the predicted values to a new column, bmiPredicted, in the dataframe object.

explanatory_data = stroke[['smoking_status_encoded_1', 'smoking_status_encoded_2']]
explanatory_data = add_constant(explanatory_data)
stroke['bmiPredicted'] = smoke_bmi_model.predict(explanatory_data)

Now, we manually confirm the results from the anova_lm function and the ols functions.

ssr = ((stroke.bmiPredicted - stroke.bmi.mean())**2).sum()
ssr

462.4712491657241

sse = ((stroke.bmiPredicted - stroke.bmi)**2).sum()
sse

134533.13704730076

We have that \nu_{1}=3-1=2 and \nu_{2}=3426-3=3423.

f_ratio = (ssr / (3 - 1)) / (sse / (3426 - 3))
f_ratio

5.883454146087778

Below, we calculate a critical value for \nu_{1}=2 and \nu_{3423}.

stats.f.ppf(
  0.95,
  2,
  3423
)

2.9983556011257178

We have that F_{\text{data}} < F_{\text{crit}} and we reject the null hypothesis that all of the explanatory variables are not linearly associated with the outcome variable.

We can also use the f_oneway function from the stats module of the scipy package to compare more than two means. The values for each group are passed as arguments. Below, we generate these using loc indexing and conditionals.

stats.f_oneway(
  stroke.loc[stroke.smoking_status == 'never smoked', 'bmi'],
  stroke.loc[stroke.smoking_status == 'formerly smoked', 'bmi'],
  stroke.loc[stroke.smoking_status == 'smokes', 'bmi']
)

F_onewayResult(statistic=5.977988174144142, pvalue=0.002560029001165349)

We see the same F_{\text{data}} statistic and p value.

When the data do not meet the assumptions for the use of ANOVA, we can consider the non-parametric Krusal-Wallis test instead. The kruskal scipy stats module function is used in the same way as the f_oneway function.

stats.kruskal(
  stroke.loc[stroke.smoking_status == 'never smoked', 'bmi'],
  stroke.loc[stroke.smoking_status == 'formerly smoked', 'bmi'],
  stroke.loc[stroke.smoking_status == 'smokes', 'bmi']
)

KruskalResult(statistic=15.580390166194027, pvalue=0.00041377215615102063)

Since we have that F_{\text{data}} < F_{\text{crit}} or then p < \alpha, we can conduct post-hoc analysis.

One such approach is using the Bonferroni method. We calculate pair-wise t statistcic and consider an adjusted p value. The Bonferroni correction of the level of significance is shown in Equation 31.

\text{new } \alpha = \frac{\text{orginal }\alpha}{\text{number of pair-wise tests}} \tag{31}

In our case, this would be \alpha_{\text{adjusted}}=0.017.

We can use the pairwise_test function from the pingouin package to do pairwise analysis. The dv argument is the outcome variable. The between argument is the the multilevel categorical explanatory variable. We assign 'bonf' to the padjust argument. We can, however, simply look at the uncorrected p value column and use our adjusted \alpha value.

posthocs = pingouin.pairwise_tests(
  dv='bmi',
  between='smoking_status',
  padjust='bonf',
  data=stroke
  )

posthocs.round(3)

	Contrast	A	B	Paired	Parametric	T	dof	alternative	p-unc	p-corr	p-adjust	BF10	hedges
0	smoking_status	formerly smoked	never smoked	False	True	3.165	1746.846	two-sided	0.002	0.005	bonf	6.663	0.129
1	smoking_status	formerly smoked	smokes	False	True	0.614	1622.626	two-sided	0.539	1.000	bonf	0.067	0.030
2	smoking_status	never smoked	smokes	False	True	-2.355	1507.070	two-sided	0.019	0.056	bonf	0.754	-0.099

Analysis of covariance

Analysis of covariance adds a continuous variable to the ANOVA model in the previous section. This continuous variable is termed a covariate.

We use ANCOVA in cases where we control for the main effect. In the case of our example we may suppose (unrealistically in this case, but considered for the sake of explanation) that we selected the cases based on their smoking status. We could not control for the age. The covariate is used to adjust the effect of smoking status on the outcome, which is body mass index in this case.

Figure 10 shows a scatter plot of the age against the body mass index of the observations in our example data set. The markers are grouped by smoking status. A linear model (trendline) is added for each group. We note that the slopes of failrly similar.

express.scatter(
  data_frame=stroke,
  x='age',
  y='bmi',
  color='smoking_status',
  trendline='ols',
  labels={
    'age':'Age [years]',
    'bmi':'Body mass index',
    'smoking_status':'Smoking status'
  }
).update_traces(
  marker={'opacity':0.2}
)

Figure 10: Scatter plot of age and body mass index for every level of smoking status

We create a model below, assigned to the variable smoke_age_bmi_model, using the ols function. We use the age covariate and the two dummy variables for former smokers and those currently smoking.

smoke_age_bmi_model = ols(
  'bmi ~ age + smoking_status_encoded_1 + smoking_status_encoded_2',
  data=stroke
).fit()

smoke_age_bmi_model.summary()

OLS Regression Results
Dep. Variable:	bmi	R-squared:	0.014
Model:	OLS	Adj. R-squared:	0.013
Method:	Least Squares	F-statistic:	16.39
Date:	Tue, 05 Mar 2024	Prob (F-statistic):	1.41e-10
Time:	14:49:12	Log-Likelihood:	-11281.
No. Observations:	3481	AIC:	2.257e+04
Df Residuals:	3477	BIC:	2.260e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	27.8509	0.302	92.123	0.000	27.258	28.444
smoking_status_encoded_1[T.True]	0.5317	0.259	2.054	0.040	0.024	1.039
smoking_status_encoded_2[T.True]	0.6175	0.265	2.331	0.020	0.098	1.137
age	0.0344	0.006	6.090	0.000	0.023	0.045

Omnibus:	175.480	Durbin-Watson:	1.987
Prob(Omnibus):	0.000	Jarque-Bera (JB):	202.000
Skew:	0.588	Prob(JB):	1.37e-44
Kurtosis:	3.099	Cond. No.	168.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The anova_lm function returns the results for analysis of variance for our model.

sm_stats.anova_lm(smoke_age_bmi_model)

	df	sum_sq	mean_sq	F	PR(>F)
smoking_status_encoded_1	1.0	251.968502	251.968502	6.581580	1.034551e-02
smoking_status_encoded_2	1.0	210.502747	210.502747	5.498468	1.908910e-02
age	1.0	1420.042838	1420.042838	37.092436	1.249417e-09
Residual	3477.0	133113.094210	38.283892	NaN	NaN

We will calculte the F_{\text{data}} statistic manually again. We start by generating a design matrix.

explanatory_data = stroke[['age', 'smoking_status_encoded_1', 'smoking_status_encoded_2']]
explanatory_data = add_constant(explanatory_data)
stroke['bmiPredicted'] = smoke_age_bmi_model.predict(explanatory_data)

The ssr and sse are calculated below.

ssr = ((stroke.bmiPredicted - stroke.bmi.mean())**2).sum()
ssr

1882.514086894581

sse = ((stroke.bmiPredicted - stroke.bmi)**2).sum()
sse

133113.09420957183

Finally, we have a F_{\text{data}} statistic, which we compare to F_{\text{crit}} statistic for \nu_{1} = 3 and \nu_{2}= 3422.

f_ratio = (ssr / (4 - 1)) / (sse / (3426 - 4))
f_ratio

16.131554008280855

stats.f.ppf(
  0.95,
  2,
  3423
)

2.9983556011257178

Equation 32 shows our model.

\begin{align} &\hat{y} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2} \\ \\ &\text{estimated mean body mass index} = 28.6282 + \\ \\ &\;\;\;0.0291 \text{ (age)} + 0.5182 \text{ (former smoker)} + \\ \\ &\;\;\;0.5459 \text{ (current smoker)} \end{align} \tag{32}

Note how age adjusts the estimated mean body mass index in this model. We see the difference between the original body mass index values, shown in Figure 11, and the adjusted values in Figure 12.

express.box(
  data_frame=stroke,
  x='smoking_status',
  y='bmiPredicted',
  labels={
    'smoking_status':'Smoking status',
    'bmiPredicted':'Adjusted body mass index'
  }
)

Figure 11: Body mass index

express.box(
  data_frame=stroke,
  x='smoking_status',
  y='bmi',
  labels={
    'smoking_status':'Smoking status',
    'bmi':'Body mass index'
  }
)

Figure 12: Body mass index adjusted by covariate

We have that F_{\text{data}} > F_{\text{crit}} and we reject the null hypothesis that all the slopes for our model are 0. Not all the adjusted mean body mass index values are equal.

Logistic regression

A logistic regression model estimates the log odds of success of a binary outcome variable given one or more explanatory variables. If we encode the failure level of the outcome variable with 0 and the success level with 1, we can create a scatter plot, considering a continuous explanatory variable. Figure 13 shows heart_disease as the explanatory variable and bmi as the outcome variable. Note that the outcome variable is binary and can only take on values of 0 and 1 in the scatter plot

express.scatter(
  data_frame=stroke,
  x='bmi',
  y=stroke.heart_disease.replace(
    {
      'No':0,
      'Yes':1
    }
  ),
  trendline='ols',
  labels={
    'bmi':'Body mass index',
    'heart_disease':'Heart disease'
  }
)

Figure 13: Body mass index versus presence of heart disease

The line (model) that was fitted to the data, is flat (no linear association). In other cases we might find a line (model) with more of a slope and such a line (model) extends below 0 and above 1. Such values are non-sensical.

Instead, we can view the closed interval from 0 to 1 as the probability of success. In this sense, we make use of a link function. A link function links the random component (the binary variable), with the linear components. The link function in the case of a logistic regression is the logit function. The model that we create for n explanatory variable is shown in Equation 33, where \hat{p} = P \left( Y=1 \, \vert \, x_{1},x_{2}, \ldots , x_{n} \right) (the probability of success given values for the explanatory variables).

\log{\left( \frac{\hat{p}}{1 - \hat{p}} \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \ldots + \hat{\beta}_{n} x_{n} \tag{33}

The left-hand side of Equation 33 is the log odds of success. Through simple algebra, we can a function \hat{p} \left( \hat{\beta}_{0},\hat{\beta}_{1}, \ldots , \hat{\beta}_{n} \right), shown in Equation 34.

\hat{p} \left( \hat{\beta}_{0},\hat{\beta}_{1}, \ldots , \hat{\beta}_{n} \right) = \frac{e^{\hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \ldots + \hat{\beta}_{n} x_{n}}}{1 + e^{e^{\hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \ldots + \hat{\beta}_{n} x_{n}}}} \tag{34}

In the example for this section, we consider the binary outcome variable heart_disease. We see the frequency of the two levels of this variable below.

stroke.heart_disease.value_counts()

heart_disease
No     3254
Yes     227
Name: count, dtype: int64

Figure 14 shows the distribution of body mass index between those with and without heart disease. We add a new column to the dataframe object and encode 0 as No and 1 as Yes for the two levels of the heart_disease variable, using the replace method and a Python dictionary object.

stroke['heart_disease_class'] = stroke.heart_disease.replace(
  {
  0:'No',
  1:'Yes'
}
)

express.box(
  data_frame=stroke,
  x='heart_disease_class',
  y='bmi'
).update_layout(
  xaxis={'title':'Heart Disease'},
  yaxis={'title':'BMI'}
)

Figure 14: Distribution of body mass index for those with and without heart disease

The statsmodels logit function function is used to generate a logistic regression model.

stroke['heart_disease_vals'] = stroke.heart_disease.replace(
  {
  'No':0,
  'Yes':1
}
)
heart_disease_bmi_model = logit(
  formula='heart_disease_vals ~ bmi',
  data=stroke
).fit()

heart_disease_bmi_model.summary()

Optimization terminated successfully.
         Current function value: 0.240937
         Iterations 7

Logit Regression Results
Dep. Variable:	heart_disease_vals	No. Observations:	3481
Model:	Logit	Df Residuals:	3479
Method:	MLE	Df Model:	1
Date:	Tue, 05 Mar 2024	Pseudo R-squ.:	0.0005572
Time:	14:49:13	Log-Likelihood:	-838.70
converged:	True	LL-Null:	-839.17
Covariance Type:	nonrobust	LLR p-value:	0.3335

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.9793	0.335	-8.893	0.000	-3.636	-2.323
bmi	0.0106	0.011	0.971	0.331	-0.011	0.032

Equation 35 shows the model based on the results above, where P \left( Y = 1 \, \vert \, \text{bmi} \right) is the probability of success (presence of heart disease) given a value for body mass index.

\begin{align} &\log{\left( \frac{\hat{P} \left( Y=1 \, \vert \, x \right)}{1 - \hat{P} \left( Y=1 \, \vert \, x \right)} \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ \\ &\log{\left( \frac{\hat{P} \left( Y=1 \, \vert \, x \right)}{1 - \hat{P} \left( Y=1 \, \vert \, x \right)} \right)} = -2.7717 + 0.0007 x \\ \\ &\hat{P} \left( Y=1 \, \vert \, x \right) = \frac{e^{ -2.7717 + 0.0007 x}}{1 + e^{ -2.7717 + 0.0007 x}} \end{align} \tag{35}

We can use simple Python code to generate a table of results. We exponentiate all values as per Equation 34.

odds_ratios = pandas.DataFrame(
    {
        "OR": heart_disease_bmi_model.params,
        "Lower CI": heart_disease_bmi_model.conf_int()[0],
        "Upper CI": heart_disease_bmi_model.conf_int()[1],
    }
)
odds_ratios = numpy.exp(odds_ratios)

odds_ratios

	OR	Lower CI	Upper CI
Intercept	0.050827	0.026359	0.098008
bmi	1.010615	0.989313	1.032375

We can also use a categorical variable as explanatory variable. The hypertension variable is a binary variable. We can make use of the functionality built into the formula argument. The C below indication that hypertension is a categorical variable. The Treatment specifies the reference level, set to 0, which is 0 (no hypertension) in this case.

heart_disease_hypertension_model = logit(
  formula='heart_disease_vals ~ C(hypertension, Treatment(reference=0))',
  data=stroke
).fit()

heart_disease_hypertension_model.summary()

Optimization terminated successfully.
         Current function value: 0.236201
         Iterations 7

Logit Regression Results
Dep. Variable:	heart_disease_vals	No. Observations:	3481
Model:	Logit	Df Residuals:	3479
Method:	MLE	Df Model:	1
Date:	Tue, 05 Mar 2024	Pseudo R-squ.:	0.02021
Time:	14:49:13	Log-Likelihood:	-822.21
converged:	True	LL-Null:	-839.17
Covariance Type:	nonrobust	LLR p-value:	5.763e-09

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.8430	0.079	-35.821	0.000	-2.999	-2.687
C(hypertension, Treatment(reference=0))[T.Yes]	1.0070	0.161	6.251	0.000	0.691	1.323

Equation 36 shows the model.

\begin{align} &\log{\left( \frac{\hat{P} \left( Y=1 \, \vert \, x \right)}{1 - \hat{P} \left( Y=1 \, \vert \, x \right)} \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x \\ \\ &\log{\left( \frac{\hat{P} \left( Y=1 \, \vert \, x \right)}{1 - \hat{P} \left( Y=1 \, \vert \, x \right)} \right)} = -2.9368 + 1.0565 x \\ \\ &\hat{P} \left( Y=1 \, \vert \, x \right) = \frac{e^{ -2.9368 + 1.0565 x}}{1 + e^{ -2.9368 + 1.0565 x}} \end{align} \tag{36}

Using this model can only calculate two values for the log odds of success, as x can only take a value of 0 (no hypertension) or 1 (hypertension). We see print the full results below, remembering that we have to exponentiate all values according to Equation 34.

odds_ratios = pandas.DataFrame(
    {
        "OR": heart_disease_hypertension_model.params,
        "Lower CI": heart_disease_hypertension_model.conf_int()[0],
        "Upper CI": heart_disease_hypertension_model.conf_int()[1],
    }
)
odds_ratios = numpy.exp(odds_ratios)

odds_ratios

	OR	Lower CI	Upper CI
Intercept	0.058252	0.049861	0.068057
C(hypertension, Treatment(reference=0))[T.Yes]	2.737387	1.996244	3.753695

The odds of heart disease are 2.88 times higher comparing those with hypertension to those without hypertension (95\% CI 2.07-4.0, p < 0.01). The presensence of hypertension increases the odds of heart disease by 188\%.

The smoking_status variable has three levels. Below, we create a logistic regression model. Using the same syntax notation as before, we select those that have never smoked as the reference level.

heart_disease_smoking_model = logit(
  formula='heart_disease_vals ~ C(smoking_status, Treatment(reference="never smoked"))',
  data=stroke
).fit()

heart_disease_smoking_model.summary()

Optimization terminated successfully.
         Current function value: 0.238400
         Iterations 7

Logit Regression Results
Dep. Variable:	heart_disease_vals	No. Observations:	3481
Model:	Logit	Df Residuals:	3478
Method:	MLE	Df Model:	2
Date:	Tue, 05 Mar 2024	Pseudo R-squ.:	0.01108
Time:	14:49:13	Log-Likelihood:	-829.87
converged:	True	LL-Null:	-839.17
Covariance Type:	nonrobust	LLR p-value:	9.140e-05

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.9782	0.109	-27.407	0.000	-3.191	-2.765
C(smoking_status, Treatment(reference="never smoked"))[T.formerly smoked]	0.6499	0.161	4.026	0.000	0.334	0.966
C(smoking_status, Treatment(reference="never smoked"))[T.smokes]	0.5168	0.172	3.004	0.003	0.180	0.854

Equation 37 shows the model for the summary results above.

\begin{align} &\log{\left( \frac{\hat{P} \left( Y=1 \, \vert \, x_{1},x_{2} \right)}{1 - \hat{P} \left( Y=1 \, \vert \, x_{1},x_{2} \right)} \right)} = \hat{\beta}_{0} + \hat{\beta}_{1} x_{1} + \hat{\beta}_{2} x_{2} \\ \\ &\log{\left( \frac{\hat{P} \left( Y=1 \, \vert \, x_{1},x_{2} \right)}{1 - \hat{P} \left( Y=1 \, \vert \, x_{1},x_{2} \right)} \right)} = -3.0849 + \\ \\ &\;\;\;0.6909 x_{1} + 0.5672 x_{2} \\ \\ &\hat{P} \left( Y=1 \, \vert \, x_{1},x_{2} \right) = \frac{e^{ -3.0849 + 0.6909 x_{1} + 0.5672 x_{2}}}{1 + e^{ -3.0849 + 0.6909 x_{1} + 0.5672 x_{2}}} \end{align} \tag{37}

We cannot simply look at the results of the Wald test (the z statistic and p value) for each of the dummy variables. Our task remains to determine if there is an association between smoking status and heart disease. To do this, we conduct a lokelihood ratio test, comparing the model with the smoking_status variable to a nested model without smoking_status. Since smoking_status is the only variable in what we will term the full model, the only nested model that we can create is a model with out the intercept.

heart_disease_intercept_model = logit(
  formula='heart_disease_vals ~ 1',
  data=stroke
).fit()

heart_disease_intercept_model.summary()

Optimization terminated successfully.
         Current function value: 0.241072
         Iterations 7

Logit Regression Results
Dep. Variable:	heart_disease_vals	No. Observations:	3481
Model:	Logit	Df Residuals:	3480
Method:	MLE	Df Model:	0
Date:	Tue, 05 Mar 2024	Pseudo R-squ.:	9.702e-13
Time:	14:49:13	Log-Likelihood:	-839.17
converged:	True	LL-Null:	-839.17
Covariance Type:	nonrobust	LLR p-value:	nan

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-2.6627	0.069	-38.787	0.000	-2.797	-2.528

Equation 38 shows the equaltion of the likelihood ratio statistic, G^{2}, where ll is the log likelihood.

G^{2} = -2 (\text{ll of nested model - ll of full model}) \tag{38}

The llf attribute returns the log likelihood of a model.

G2 = -2*(
  heart_disease_intercept_model.llf - heart_disease_smoking_model.llf
)

G2

18.600523264298317

The G^{2} test statistic follows a \chi^{2} distribution. The degrees of freedom is the difference in the number of parameters of the full model and the nested model. In this case it is 4-2=2. The chi2.sf function determines the p value for the G^{2} statistic.

stats.chi2.sf(G2, 2)

9.140031508879114e-05

We note that p<\alpha and state that there is enough evidence in the data at the \5% level of significance to show that there is an association between moking_status and heart_disease. We can now consider each of the dummy variables individually.

odds_ratios = pandas.DataFrame(
    {
        "OR": heart_disease_smoking_model.params,
        "Lower CI": heart_disease_smoking_model.conf_int()[0],
        "Upper CI": heart_disease_smoking_model.conf_int()[1],
    }
)
odds_ratios = numpy.exp(odds_ratios)

odds_ratios

	OR	Lower CI	Upper CI
Intercept	0.050886	0.041125	0.062964
C(smoking_status, Treatment(reference="never smoked"))[T.formerly smoked]	1.915417	1.395891	2.628302
C(smoking_status, Treatment(reference="never smoked"))[T.smokes]	1.676577	1.196680	2.348925

The odds of heart disease is 2.0 times higher comparing those who formely smoked compared to those who never smoked (95\% CI 1.43-2.78, \, p < 0.01). The odds of heart disease is 100\% higher for those who formely smoked compared to those who never smoked.

The odds of heart disease is 1.76 times higher comparing those who currently smokes compared to those who never smoked (95\% CI 1.24-2.51, \, p < 0.01). The odds of heart disease is 76\% higher for those who currently smokes compared to those who never smoked.

Conclusion

We can conduct many useful and commonly used statistical tests and data analysis in Python. Python is the leading language for data science and it is worthwhile to take the time to learn how to use Python for biostatistic.

Learning to use Python in is this setting opens the doors to more advanced analysis techniques such machine learning.