Machine Learning · Healthcare Analytics · Python

Predictive Data Analysis for
Monitoring Diabetes in Healthcare

📄 Term Project Report· IT6513 — EHR Systems & Applications· Kennesaw State University
AuthorBalakumar Janakiraman
CourseIT6513 — EHR Systems & Applications
InstitutionKennesaw State University, Georgia, USA
DatasetPIMA Indian Diabetes — 768 records, 9 fields
Abstract

In today’s environment, Data Analytics plays an important role in healthcare in predicting chronic diseases and managing them. The electronic health record data combined with the dataset from the PIMA Indian dataset from The National Institute of Diabetes and Digestive and Kidney Diseases and from the Centers for Disease Control and Prevention (CDC) is analysed using machine learning classification models to predict diabetes outcomes and identify key risk factors.

Table of Contents
  1. Introduction
  2. Supervised and Unsupervised Learning
  3. Data Mining — PIMA Indians Diabetes Dataset
  4. Machine Learning Models for Classification
  5. Visualization Using Seaborn
  6. Obtaining Risk Factors for Diabetes Complications
  7. Conclusion
  8. References
768
Dataset Records
80%
Best Model AUC
4
ML Models Tested
29K+
CDC Dataset Rows

1. Introduction

"Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy." [2] In the long run, diabetes would lead to heart disease, kidney disease, and vision problems.

— Centers for Disease Control and Prevention

In this paper, the PIMA Indian Diabetes Data Set is analyzed using a classification model algorithm to use target variables to predict outcomes. The dataset is also analyzed to identify risk factors and complications for diabetes using CDC datasets from the National Institute of Diabetes and Digestive and Kidney Diseases.

2. Supervised and Unsupervised Learning

Supervised learning is the task of inferring functions from labeled training data. Unsupervised learning is the task of inferring functions from unlabeled data.

Fig. 1 — Machine Learning: Supervised vs Unsupervised Learning
Machine Learning Supervised Learning (Predictive) Unsupervised Learning (Descriptive) Regression Classification Association rule mining Clustering Predict continuous value Predict class label Find frequent patterns Group similar data
Fig. 1 — Machine Learning taxonomy: Supervised Learning (Regression, Classification) and Unsupervised Learning (Association rule mining, Clustering)

2.1 Classification Algorithm

The objective is to predict the target variable that is discrete. Binary classification finds whether there is a disease or not. Multilevel classification classifies into more than two classes. The classification algorithm uses a set of pre-classified examples to develop a model of classification.

Fig. 2 — Classification Algorithm
Fig. 2 — Classification Algorithm overview showing training and prediction workflow

In this project the Classification models are used to make predictions on the training data set. The training dataset includes input variables (X) and output variable (Y). The classification model is then used to make predictions on the test dataset.

3. Data Mining — PIMA Indians Diabetes Dataset

PIMA Indians diabetes dataset is analyzed by using the following code in Python 3.6:

Fig. 3 — Python Code Loading PIMA Dataset
Fig. 3 — Python 3.6 code loading and reading the PIMA Indians Diabetes dataset
Fig. 3 — Python 3.6 code loading and reading the PIMA Indians Diabetes dataset

The dataset provides the causes for diabetes information. It includes 768 records and 9 fields. Data type is integer 64.

Input variables:

3.1 Data Exploration and Visualization Using Correlation

Correlation is a statistic that measures the degree to which two variables move in relation to each other. The correlation coefficient has a value that must be between -1.0 and +1.0. A correlation coefficient of +1 indicates a perfect positive correlation, while a value of -1.0 indicates a perfect negative correlation.

The below code shows the correlation matrix with heat map to get a correlation matrix plot for Independent Variables:

Fig. 4 — Correlation Matrix Heat Map
Fig. 4 — Correlation matrix heat map showing relationships between all input variables in the PIMA dataset
Fig. 4 — Correlation matrix heat map showing relationships between all input variables in the PIMA dataset
Fig. 5 — Correlation Analysis Code

The .corr() method can be used to obtain the correlation matrix showing the degree of relationship between each pair of variables.

Fig. 5 — Correlation matrix heat map showing relationships between all input variables in the PIMA dataset

3.2 Data Preparation for Classification Models

Independent Variables are grouped into X data frame and target variable “outcome” in Y Series. The first few rows are shown below:

Fig. 6 — X and Y Data Frame Preparation
Fig. 6 — Python code grouping independent variables into X dataframe and target outcome variable into Y series with first rows output
Fig. 6 — Python code grouping independent variables into X dataframe and target outcome variable into Y series with first rows output

3.3 Train / Test & Validate

In machine learning, a common task is the study and construction of Algorithms that can learn from and make predictions on data. The dataset is split as follows:

Scikit-learn (sklearn) is the Python machine learning library. It is used to import libraries to Train/Test Split dataset:

Fig. 7 — Scikit-learn Train/Test Split Code
Fig. 7 — Python Scikit-learn code implementing Train/Test Split on the PIMA dataset
Fig. 7 — Python Scikit-learn code implementing Train/Test Split on the PIMA dataset

The dimensions are:

4. Machine Learning Models for Classification Model Algorithms

4.1 Dummy Classifier

"Dummy Classifier is a classifier that makes predictions using simple rules. The most_frequent strategy always predicts the most frequent label in the training set."

— Scikit-learn Documentation [17]

In order to perform the strategy, the code to Apply Machine Learning Models for Classification Algorithms trains a Dummy Classifier. The results are shown below:

Fig. 8 — Dummy Classifier Code
Fig. 8 — Python code applying Dummy Classifier as baseline for classification algorithms
Fig. 8 — Python code applying Dummy Classifier as baseline for classification algorithms
Fig. 9 — Dummy Classifier Accuracy Score
Fig. 9 — Dummy Classifier accuracy score output showing 0.6320 baseline accuracy
Fig. 9 — Dummy Classifier accuracy score output showing 0.6320 baseline accuracy

The accuracy score for Dummy Classifier is 0.6320 (63.20%). Any model performing below this score is worse than a naive prediction.

4.2 Logistic Regression

The train a logistic regression model on the training data. The code for the logistic regression model is shown below:

Fig. 10 — Logistic Regression Code
Fig. 10 — Python code training a Logistic Regression model on the PIMA training dataset
Fig. 10 — Python code training a Logistic Regression model on the PIMA training dataset

Predictions & Confusion Matrix explanation and how Accuracy, Recall, Precision Score is calculated. The confusion matrix is a specific table layout that allows visualization of the performance of an algorithm:

Confusion Matrix — Performance Metrics
MetricFormula
Overall Accuracy(TN + TP) / Total Observations
Overall Error Rate(FN + FP) / Total Observations
Recall (TPR)TP / (FN + TP)
PrecisionTP / (TP + FP)
False Negative Rate (FNR)FN / (FN + TP) = 1 − TPR
True Negative Rate (TNR)TN / (TN + FP)
False Positive Rate (FPR)FP / (TN + FP) = 1 − TNR

Making predictions on training and test data and printing first 6 values are shown below:

Fig. 11 — Logistic Regression — First 6 Prediction Values
Fig. 11 — First 6 prediction values from Logistic Regression on training and test data
Fig. 11 — First 6 prediction values from Logistic Regression on training and test data

Accuracy, Recall, Precision score on Training Data is shown below:

Fig. 12 — Logistic Regression — Accuracy, Recall & Precision Scores
Fig. 12 — Accuracy, Recall and Precision score results for Logistic Regression on training data
Fig. 12 — Accuracy, Recall and Precision score results for Logistic Regression on training data

4.3 ROC Curves and AUC Values

Receiver Operating Characteristic curve (ROC) is created by plotting the true positive rate against the false positive rate (FPR). Predicting probabilities on test data and printing first 5 probabilities of “Outcome” is shown below:

Fig. 13 — ROC Curve — Probability Prediction Code
Fig. 13 — Python code predicting probabilities on test data for ROC curve generation
Fig. 13 — Python code predicting probabilities on test data for ROC curve generation

ROC curves and AUC Values are shown below:

Fig. 14 — ROC Curves and AUC Values
Fig. 14 — ROC Curves showing AUC scores for Logistic Regression models
Fig. 14 — ROC Curves showing AUC scores for Logistic Regression models
Fig. 15 — ROC Curve — AUC Score Comparison
Fig. 15 — Area under blue and red ROC curves showing AUC score comparison
Fig. 15 — Area under blue and red ROC curves showing AUC score comparison

If the AUC Score is higher, it is a better model. The Area under blue and red curve is the AUC score.

4.4 Random Forest

Random forest is an ensemble classification method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes of the individual trees. The Random Forest code is shown below:

Fig. 16 — Random Forest Classifier Code
Fig. 16 — Python code implementing Random Forest ensemble classifier on the PIMA dataset
Fig. 16 — Python code implementing Random Forest ensemble classifier on the PIMA dataset

Accuracy, recall, precision, and AUC score of the test data for Random Forest is shown below:

Fig. 17 — Random Forest — Accuracy, Recall, Precision & AUC Score
Fig. 17 — Random Forest accuracy, recall, precision and AUC score results on test data
Fig. 17 — Random Forest accuracy, recall, precision and AUC score results on test data

4.5 Decision Tree

The goal is to predict a model when the value of a target variable is continuous, discrete, or categorical. By importing the decision tree classifier, the model can be created:

Fig. 18 — Decision Tree Classifier Code
Fig. 18 — Python code for Decision Tree classifier implementation and results
Fig. 18 — Python code for Decision Tree classifier implementation and results

Decision Tree results on Test Data:

4.6 Logistic Regression with Polynomial Features

Logistic Regression with Polynomial features model is shown below by importing polynomial features:

Fig. 19 — Logistic Regression with Polynomial Features
Fig. 19 — Python code for Logistic Regression with Polynomial Features implementation
Fig. 19 — Python code for Logistic Regression with Polynomial Features implementation

The AUC score for logistic regression with polynomial feature is 67%.

Comparison of results of all the techniques is shown below:

Fig. 20 — Comparison of All Classification Models
Fig. 20 — Comparison of results showing AUC scores for all classification techniques
Fig. 20 — Comparison of results showing AUC scores for all classification techniques
Model Performance Summary
ModelAUC ScoreAccuracy
Random Forest80% (0.8055)⭐ Best Model
Logistic Regression80% (0.8014)Excellent
Logistic + Polynomial67% (0.6783)Good
Decision Tree60% (0.6813)Good
Dummy Classifier (Baseline)63% (0.6320)Baseline

5. Visualization of Data Using Seaborn

Fig. 21 — Seaborn Pair Plot Code
Fig. 21 — Python code generating Seaborn pair plot visualization for the PIMA dataset
Fig. 21 — Python code generating Seaborn pair plot visualization for the PIMA dataset
Fig. 22 — Seaborn Pair Plot Output
Fig. 22 — Seaborn pair plot showing variable distributions and pairwise relationships. Blue = No Diabetes (0), Orange = Diabetes (1)
Fig. 22 — Seaborn pair plot showing variable distributions and pairwise relationships. Blue = No Diabetes (0), Orange = Diabetes (1)

In the pair plot, the observations labeled as 0 are shown in blue and 1 in orange. Inspection of the graph suggests that observations can be classified by several variables. The pair plots show the distribution of each variable and the relationships between pairs of variables.

6. Obtaining Risk Factors for Diabetes-Related Complications

The datasets come from two diabetes datasets from the Centers for Disease Control and Prevention (CDC). One is the National data in CSV format which gives national-level statistics, and a lookup table providing detailed risk factor information for diabetes complications.

The Python code is shown below reading two datasets (National_Data.csv and lookup_table.xlsx) and forming a combined dataset df.csv:

Fig. 23 — CDC Dataset Merge Code
Fig. 23 — Python code reading and merging two CDC diabetes datasets: National_Data.csv and lookup_table.xlsx into combined df.csv
Fig. 23 — Python code reading and merging two CDC diabetes datasets: National_Data.csv and lookup_table.xlsx into combined df.csv

The screenshots below show the results of merging two datasets from CDC to get a meaningful dataset:

Fig. 24 — Merged Dataset Results
Fig. 24 — Results of merged CDC datasets showing Risk factors, GenderID, Age and related fields
Fig. 24 — Results of merged CDC datasets showing Risk factors, GenderID, Age and related fields
Fig. 25 — Combined Dataset Structure
Fig. 25 — Combined CDC dataset structure showing all columns and data types
Fig. 25 — Combined CDC dataset structure showing all columns and data types

The dataset has 29,102 rows and 18 Columns. The data shape is shown below:

Fig. 26 — Dataset Shape: 29,102 rows × 18 Columns
Fig. 26 — Output showing merged CDC dataset shape of 29,102 rows and 18 columns
Fig. 26 — Output showing merged CDC dataset shape of 29,102 rows and 18 columns

Randomly locating 2000th row is shown below with Risk factors, GenderID, Age and so on:

Fig. 27 — Row 2000 — Risk Factors, GenderID, Age
Fig. 27 — Randomly located 2000th row showing Risk factors, GenderID, Age and other variables
Fig. 27 — Randomly located 2000th row showing Risk factors, GenderID, Age and other variables

6.1 The Risk Factor for Diabetes Disease

Fig. 28 — Risk Factors for Diabetes Disease
Fig. 28 — Analysis output showing primary risk factors for diabetes disease from CDC dataset
Fig. 28 — Analysis output showing primary risk factors for diabetes disease from CDC dataset

6.2 The Risk Factor Complication for Diabetes Disease

Fig. 29 — Risk Factor Complications for Diabetes Disease
Fig. 29 — Risk factor complications for diabetes-related conditions including cardiovascular, kidney and vision complications
Fig. 29 — Risk factor complications for diabetes-related conditions including cardiovascular, kidney and vision complications

7. Conclusion

The predictive analytics illustrates how the classification models like Logistic Regression, Decision Tree and Random Forest could be able to predict diseases like diabetes with reasonable accuracy. These models learn from the labeled training dataset and apply the learning to make predictions.

This project explains which machine learning model technologies can be applied to diseases such as diabetes to predict variables such as pregnancy, blood pressure, insulin, BMI, glucose, skin thickness, diabetes pedigree function, and age to determine whether a person is diabetic or not.

The visualization of the data set provides classification modeling techniques to predictive analytics to train, validate and test the dataset. These models determine data-driven insights that can be used to proactively manage diabetes in healthcare settings.

References

  1. What is predictive analytics? TechTarget. searchbusinessanalytics.techtarget.com
  2. Centers for Disease Control and Prevention. Diabetes Basics. (2021). cdc.gov/diabetes/basics
  3. Hayes, A. Correlation. Investopedia. (2021). investopedia.com
  4. Confusion matrix. Wikipedia. (2021). en.m.wikipedia.org/wiki/Confusion_matrix
  5. Training, validation, and test sets. Wikipedia. (2021). en.m.wikipedia.org/wiki
  6. scikit-learn. scikit-learn.org
  7. Logistic Regression 3-class Classifier. scikit-learn. scikit-learn.org
  8. Random forest. Wikipedia. (2021). en.m.wikipedia.org/wiki/Random_forest
  9. Receiver operating characteristic. Wikipedia. (2021). en.m.wikipedia.org
  10. Palle, S. Big Data & Data Analytics Program. Emory Continuing Education. (2019).
  11. Bresnick, J. 10 High-Value Use Cases for Predictive Analytics in Healthcare. HealthITAnalytics. (2019). healthitanalytics.com
  12. Division of Diabetes Translation. CDC. (2021). cdc.gov/diabetes
  13. Pima Indians Diabetes Database. Kaggle. (2016). kaggle.com
  14. Decision Trees. scikit-learn. scikit-learn.org/stable/modules/tree.html
  15. Python Decision Tree Classification with Scikit-Learn. DataCamp. datacamp.com
  16. Supervised vs Unsupervised Learning. Guru99. guru99.com
  17. sklearn.dummy.DummyClassifier. scikit-learn. scikit-learn.org
  18. Johnson, C. Linear Regression with Python. (2014). connor-johnson.com
  19. Witten, I. H. and Frank, E. Data Mining. Practical Machine Learning Tools and Techniques. Elsevier. (2005).

Tags

Machine LearningHealthcare Analytics PythonDiabetes Prediction Logistic RegressionRandom Forest Decision TreeScikit-learn SeabornEHR SystemsKSU
About the Author
Balakumar Janakiraman
Business Data Analyst & IT Consultant with 5+ years in data analytics, machine learning and digital transformation. Post-Baccalaureate Certificate in IT from Kennesaw State University, USA. MBA from University of Madras. Founder of Vinayak Consulting Services, Toronto, Canada.
Get in Touch →