Predictive Data Analysis for Monitoring Diabetes in Healthcare

AuthorBalakumar Janakiraman

CourseIT6513 — EHR Systems & Applications

InstitutionKennesaw State University, Georgia, USA

DatasetPIMA Indian Diabetes — 768 records, 9 fields

Abstract

In today’s environment, Data Analytics plays an important role in healthcare in predicting chronic diseases and managing them. The electronic health record data combined with the dataset from the PIMA Indian dataset from The National Institute of Diabetes and Digestive and Kidney Diseases and from the Centers for Disease Control and Prevention (CDC) is analysed using machine learning classification models to predict diabetes outcomes and identify key risk factors.

Table of Contents

Introduction
Supervised and Unsupervised Learning
Data Mining — PIMA Indians Diabetes Dataset
Machine Learning Models for Classification
Visualization Using Seaborn
Obtaining Risk Factors for Diabetes Complications
Conclusion
References

768

Dataset Records

80%

Best Model AUC

ML Models Tested

29K+

CDC Dataset Rows

1. Introduction

"Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy." [2] In the long run, diabetes would lead to heart disease, kidney disease, and vision problems.

— Centers for Disease Control and Prevention

In this paper, the PIMA Indian Diabetes Data Set is analyzed using a classification model algorithm to use target variables to predict outcomes. The dataset is also analyzed to identify risk factors and complications for diabetes using CDC datasets from the National Institute of Diabetes and Digestive and Kidney Diseases.

2. Supervised and Unsupervised Learning

Supervised learning is the task of inferring functions from labeled training data. Unsupervised learning is the task of inferring functions from unlabeled data.

Fig. 1 — Machine Learning: Supervised vs Unsupervised Learning

Fig. 1 — Machine Learning taxonomy: Supervised Learning (Regression, Classification) and Unsupervised Learning (Association rule mining, Clustering)

2.1 Classification Algorithm

The objective is to predict the target variable that is discrete. Binary classification finds whether there is a disease or not. Multilevel classification classifies into more than two classes. The classification algorithm uses a set of pre-classified examples to develop a model of classification.

Fig. 2 — Classification Algorithm

In this project the Classification models are used to make predictions on the training data set. The training dataset includes input variables (X) and output variable (Y). The classification model is then used to make predictions on the test dataset.

3. Data Mining — PIMA Indians Diabetes Dataset

PIMA Indians diabetes dataset is analyzed by using the following code in Python 3.6:

Fig. 3 — Python Code Loading PIMA Dataset

Fig. 3 — Python 3.6 code loading and reading the PIMA Indians Diabetes dataset

The dataset provides the causes for diabetes information. It includes 768 records and 9 fields. Data type is integer 64.

Input variables:

1. Pregnancies — Numeric
2. Glucose — Numeric
3. Blood Pressure — Numeric
4. Skin Thickness — Numeric
5. Insulin — Numeric
6. BMI — Numeric
7. Diabetes Pedigree Function — Numeric
8. Age — Numeric
9. Outcome — Numeric (Target Variable: 0 = No Diabetes, 1 = Diabetes)

3.1 Data Exploration and Visualization Using Correlation

Correlation is a statistic that measures the degree to which two variables move in relation to each other. The correlation coefficient has a value that must be between -1.0 and +1.0. A correlation coefficient of +1 indicates a perfect positive correlation, while a value of -1.0 indicates a perfect negative correlation.

The below code shows the correlation matrix with heat map to get a correlation matrix plot for Independent Variables:

Fig. 4 — Correlation Matrix Heat Map

Fig. 4 — Correlation matrix heat map showing relationships between all input variables in the PIMA dataset

Fig. 5 — Correlation Analysis Code

The .corr() method can be used to obtain the correlation matrix showing the degree of relationship between each pair of variables.

Fig. 5 — Correlation matrix heat map showing relationships between all input variables in the PIMA dataset

3.2 Data Preparation for Classification Models

Independent Variables are grouped into X data frame and target variable “outcome” in Y Series. The first few rows are shown below:

Fig. 6 — X and Y Data Frame Preparation

Fig. 6 — Python code grouping independent variables into X dataframe and target outcome variable into Y series with first rows output

3.3 Train / Test & Validate

In machine learning, a common task is the study and construction of Algorithms that can learn from and make predictions on data. The dataset is split as follows:

Train: Use a large (~60%) sample of data to determine model coefficients
Validate: Use a smaller (~20%) sample to assess performance and correct model coefficients
Test: Use a different (~20%) sample to assess how the model will perform on unseen data

Scikit-learn (sklearn) is the Python machine learning library. It is used to import libraries to Train/Test Split dataset:

Fig. 7 — Scikit-learn Train/Test Split Code

Fig. 7 — Python Scikit-learn code implementing Train/Test Split on the PIMA dataset

The dimensions are:

X_train: 576 rows and 8 columns of input variables
X_test: 192 rows and 8 columns of input variables
Y_train: 576 rows of the target variable OUTCOME
Y_test: 192 rows of the target variable OUTCOME

4. Machine Learning Models for Classification Model Algorithms

4.1 Dummy Classifier

"Dummy Classifier is a classifier that makes predictions using simple rules. The most_frequent strategy always predicts the most frequent label in the training set."

— Scikit-learn Documentation [17]

In order to perform the strategy, the code to Apply Machine Learning Models for Classification Algorithms trains a Dummy Classifier. The results are shown below:

Fig. 8 — Dummy Classifier Code

Fig. 8 — Python code applying Dummy Classifier as baseline for classification algorithms

Fig. 9 — Dummy Classifier Accuracy Score

Fig. 9 — Dummy Classifier accuracy score output showing 0.6320 baseline accuracy

The accuracy score for Dummy Classifier is 0.6320 (63.20%). Any model performing below this score is worse than a naive prediction.

4.2 Logistic Regression

The train a logistic regression model on the training data. The code for the logistic regression model is shown below:

Fig. 10 — Logistic Regression Code

Fig. 10 — Python code training a Logistic Regression model on the PIMA training dataset

Predictions & Confusion Matrix explanation and how Accuracy, Recall, Precision Score is calculated. The confusion matrix is a specific table layout that allows visualization of the performance of an algorithm:

Confusion Matrix — Performance Metrics
Metric	Formula
Overall Accuracy	(TN + TP) / Total Observations
Overall Error Rate	(FN + FP) / Total Observations
Recall (TPR)	TP / (FN + TP)
Precision	TP / (TP + FP)
False Negative Rate (FNR)	FN / (FN + TP) = 1 − TPR
True Negative Rate (TNR)	TN / (TN + FP)
False Positive Rate (FPR)	FP / (TN + FP) = 1 − TNR

Making predictions on training and test data and printing first 6 values are shown below:

Fig. 11 — Logistic Regression — First 6 Prediction Values

Fig. 11 — First 6 prediction values from Logistic Regression on training and test data

Accuracy, Recall, Precision score on Training Data is shown below:

Fig. 12 — Logistic Regression — Accuracy, Recall & Precision Scores

Fig. 12 — Accuracy, Recall and Precision score results for Logistic Regression on training data

4.3 ROC Curves and AUC Values

Receiver Operating Characteristic curve (ROC) is created by plotting the true positive rate against the false positive rate (FPR). Predicting probabilities on test data and printing first 5 probabilities of “Outcome” is shown below:

Fig. 13 — ROC Curve — Probability Prediction Code

Fig. 13 — Python code predicting probabilities on test data for ROC curve generation

ROC curves and AUC Values are shown below:

Fig. 14 — ROC Curves and AUC Values

Fig. 14 — ROC Curves showing AUC scores for Logistic Regression models

Fig. 15 — ROC Curve — AUC Score Comparison

Fig. 15 — Area under blue and red ROC curves showing AUC score comparison

If the AUC Score is higher, it is a better model. The Area under blue and red curve is the AUC score.

4.4 Random Forest

Random forest is an ensemble classification method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes of the individual trees. The Random Forest code is shown below:

Fig. 16 — Random Forest Classifier Code

Fig. 16 — Python code implementing Random Forest ensemble classifier on the PIMA dataset

Accuracy, recall, precision, and AUC score of the test data for Random Forest is shown below:

Fig. 17 — Random Forest — Accuracy, Recall, Precision & AUC Score

Fig. 17 — Random Forest accuracy, recall, precision and AUC score results on test data

4.5 Decision Tree

The goal is to predict a model when the value of a target variable is continuous, discrete, or categorical. By importing the decision tree classifier, the model can be created:

Fig. 18 — Decision Tree Classifier Code

Fig. 18 — Python code for Decision Tree classifier implementation and results

Decision Tree results on Test Data:

Accuracy of test data: 70%
Recall of test data: 57%
Precision for test data: 61%
AUC score for test data: 60%

4.6 Logistic Regression with Polynomial Features

Logistic Regression with Polynomial features model is shown below by importing polynomial features:

Fig. 19 — Logistic Regression with Polynomial Features

Fig. 19 — Python code for Logistic Regression with Polynomial Features implementation

The AUC score for logistic regression with polynomial feature is 67%.

Comparison of results of all the techniques is shown below:

Fig. 20 — Comparison of All Classification Models

Fig. 20 — Comparison of results showing AUC scores for all classification techniques

Model Performance Summary
Model	AUC Score	Accuracy
Random Forest	80% (0.8055)	⭐ Best Model
Logistic Regression	80% (0.8014)	Excellent
Logistic + Polynomial	67% (0.6783)	Good
Decision Tree	60% (0.6813)	Good
Dummy Classifier (Baseline)	63% (0.6320)	Baseline

5. Visualization of Data Using Seaborn

Fig. 21 — Seaborn Pair Plot Code

Fig. 21 — Python code generating Seaborn pair plot visualization for the PIMA dataset

Fig. 22 — Seaborn Pair Plot Output

Fig. 22 — Seaborn pair plot showing variable distributions and pairwise relationships. Blue = No Diabetes (0), Orange = Diabetes (1)

In the pair plot, the observations labeled as 0 are shown in blue and 1 in orange. Inspection of the graph suggests that observations can be classified by several variables. The pair plots show the distribution of each variable and the relationships between pairs of variables.

6. Obtaining Risk Factors for Diabetes-Related Complications

The datasets come from two diabetes datasets from the Centers for Disease Control and Prevention (CDC). One is the National data in CSV format which gives national-level statistics, and a lookup table providing detailed risk factor information for diabetes complications.

The Python code is shown below reading two datasets (National_Data.csv and lookup_table.xlsx) and forming a combined dataset df.csv:

Fig. 23 — CDC Dataset Merge Code

Fig. 23 — Python code reading and merging two CDC diabetes datasets: National_Data.csv and lookup_table.xlsx into combined df.csv

The screenshots below show the results of merging two datasets from CDC to get a meaningful dataset:

Fig. 24 — Merged Dataset Results

Fig. 24 — Results of merged CDC datasets showing Risk factors, GenderID, Age and related fields

Fig. 25 — Combined Dataset Structure

Fig. 25 — Combined CDC dataset structure showing all columns and data types

The dataset has 29,102 rows and 18 Columns. The data shape is shown below:

Fig. 26 — Dataset Shape: 29,102 rows × 18 Columns

Fig. 26 — Output showing merged CDC dataset shape of 29,102 rows and 18 columns

Randomly locating 2000th row is shown below with Risk factors, GenderID, Age and so on:

Fig. 27 — Row 2000 — Risk Factors, GenderID, Age

Fig. 27 — Randomly located 2000th row showing Risk factors, GenderID, Age and other variables

6.1 The Risk Factor for Diabetes Disease

Fig. 28 — Risk Factors for Diabetes Disease

Fig. 28 — Analysis output showing primary risk factors for diabetes disease from CDC dataset

6.2 The Risk Factor Complication for Diabetes Disease

Fig. 29 — Risk Factor Complications for Diabetes Disease

Fig. 29 — Risk factor complications for diabetes-related conditions including cardiovascular, kidney and vision complications

7. Conclusion

The predictive analytics illustrates how the classification models like Logistic Regression, Decision Tree and Random Forest could be able to predict diseases like diabetes with reasonable accuracy. These models learn from the labeled training dataset and apply the learning to make predictions.

This project explains which machine learning model technologies can be applied to diseases such as diabetes to predict variables such as pregnancy, blood pressure, insulin, BMI, glucose, skin thickness, diabetes pedigree function, and age to determine whether a person is diabetic or not.

The visualization of the data set provides classification modeling techniques to predictive analytics to train, validate and test the dataset. These models determine data-driven insights that can be used to proactively manage diabetes in healthcare settings.

References

What is predictive analytics? TechTarget. searchbusinessanalytics.techtarget.com
Centers for Disease Control and Prevention. Diabetes Basics. (2021). cdc.gov/diabetes/basics
Hayes, A. Correlation. Investopedia. (2021). investopedia.com
Confusion matrix. Wikipedia. (2021). en.m.wikipedia.org/wiki/Confusion_matrix
Training, validation, and test sets. Wikipedia. (2021). en.m.wikipedia.org/wiki
scikit-learn. scikit-learn.org
Logistic Regression 3-class Classifier. scikit-learn. scikit-learn.org
Random forest. Wikipedia. (2021). en.m.wikipedia.org/wiki/Random_forest
Receiver operating characteristic. Wikipedia. (2021). en.m.wikipedia.org
Palle, S. Big Data & Data Analytics Program. Emory Continuing Education. (2019).
Bresnick, J. 10 High-Value Use Cases for Predictive Analytics in Healthcare. HealthITAnalytics. (2019). healthitanalytics.com
Division of Diabetes Translation. CDC. (2021). cdc.gov/diabetes
Pima Indians Diabetes Database. Kaggle. (2016). kaggle.com
Decision Trees. scikit-learn. scikit-learn.org/stable/modules/tree.html
Python Decision Tree Classification with Scikit-Learn. DataCamp. datacamp.com
Supervised vs Unsupervised Learning. Guru99. guru99.com
sklearn.dummy.DummyClassifier. scikit-learn. scikit-learn.org
Johnson, C. Linear Regression with Python. (2014). connor-johnson.com
Witten, I. H. and Frank, E. Data Mining. Practical Machine Learning Tools and Techniques. Elsevier. (2005).