In today’s environment, Data Analytics plays an important role in healthcare in predicting chronic diseases and managing them. The electronic health record data combined with the dataset from the PIMA Indian dataset from The National Institute of Diabetes and Digestive and Kidney Diseases and from the Centers for Disease Control and Prevention (CDC) is analysed using machine learning classification models to predict diabetes outcomes and identify key risk factors.
1. Introduction
"Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy." [2] In the long run, diabetes would lead to heart disease, kidney disease, and vision problems.
— Centers for Disease Control and PreventionIn this paper, the PIMA Indian Diabetes Data Set is analyzed using a classification model algorithm to use target variables to predict outcomes. The dataset is also analyzed to identify risk factors and complications for diabetes using CDC datasets from the National Institute of Diabetes and Digestive and Kidney Diseases.
2. Supervised and Unsupervised Learning
Supervised learning is the task of inferring functions from labeled training data. Unsupervised learning is the task of inferring functions from unlabeled data.
2.1 Classification Algorithm
The objective is to predict the target variable that is discrete. Binary classification finds whether there is a disease or not. Multilevel classification classifies into more than two classes. The classification algorithm uses a set of pre-classified examples to develop a model of classification.
In this project the Classification models are used to make predictions on the training data set. The training dataset includes input variables (X) and output variable (Y). The classification model is then used to make predictions on the test dataset.
3. Data Mining — PIMA Indians Diabetes Dataset
PIMA Indians diabetes dataset is analyzed by using the following code in Python 3.6:
The dataset provides the causes for diabetes information. It includes 768 records and 9 fields. Data type is integer 64.
Input variables:
- 1. Pregnancies — Numeric
- 2. Glucose — Numeric
- 3. Blood Pressure — Numeric
- 4. Skin Thickness — Numeric
- 5. Insulin — Numeric
- 6. BMI — Numeric
- 7. Diabetes Pedigree Function — Numeric
- 8. Age — Numeric
- 9. Outcome — Numeric (Target Variable: 0 = No Diabetes, 1 = Diabetes)
3.1 Data Exploration and Visualization Using Correlation
Correlation is a statistic that measures the degree to which two variables move in relation to each other. The correlation coefficient has a value that must be between -1.0 and +1.0. A correlation coefficient of +1 indicates a perfect positive correlation, while a value of -1.0 indicates a perfect negative correlation.
The below code shows the correlation matrix with heat map to get a correlation matrix plot for Independent Variables:
The .corr() method can be used to obtain the correlation matrix showing the degree of relationship between each pair of variables.
3.2 Data Preparation for Classification Models
Independent Variables are grouped into X data frame and target variable “outcome” in Y Series. The first few rows are shown below:
3.3 Train / Test & Validate
In machine learning, a common task is the study and construction of Algorithms that can learn from and make predictions on data. The dataset is split as follows:
- Train: Use a large (~60%) sample of data to determine model coefficients
- Validate: Use a smaller (~20%) sample to assess performance and correct model coefficients
- Test: Use a different (~20%) sample to assess how the model will perform on unseen data
Scikit-learn (sklearn) is the Python machine learning library. It is used to import libraries to Train/Test Split dataset:
The dimensions are:
- X_train: 576 rows and 8 columns of input variables
- X_test: 192 rows and 8 columns of input variables
- Y_train: 576 rows of the target variable OUTCOME
- Y_test: 192 rows of the target variable OUTCOME
4. Machine Learning Models for Classification Model Algorithms
4.1 Dummy Classifier
"Dummy Classifier is a classifier that makes predictions using simple rules. The most_frequent strategy always predicts the most frequent label in the training set."
— Scikit-learn Documentation [17]In order to perform the strategy, the code to Apply Machine Learning Models for Classification Algorithms trains a Dummy Classifier. The results are shown below:
The accuracy score for Dummy Classifier is 0.6320 (63.20%). Any model performing below this score is worse than a naive prediction.
4.2 Logistic Regression
The train a logistic regression model on the training data. The code for the logistic regression model is shown below:
Predictions & Confusion Matrix explanation and how Accuracy, Recall, Precision Score is calculated. The confusion matrix is a specific table layout that allows visualization of the performance of an algorithm:
| Metric | Formula |
|---|---|
| Overall Accuracy | (TN + TP) / Total Observations |
| Overall Error Rate | (FN + FP) / Total Observations |
| Recall (TPR) | TP / (FN + TP) |
| Precision | TP / (TP + FP) |
| False Negative Rate (FNR) | FN / (FN + TP) = 1 − TPR |
| True Negative Rate (TNR) | TN / (TN + FP) |
| False Positive Rate (FPR) | FP / (TN + FP) = 1 − TNR |
Making predictions on training and test data and printing first 6 values are shown below:
Accuracy, Recall, Precision score on Training Data is shown below:
4.3 ROC Curves and AUC Values
Receiver Operating Characteristic curve (ROC) is created by plotting the true positive rate against the false positive rate (FPR). Predicting probabilities on test data and printing first 5 probabilities of “Outcome” is shown below:
ROC curves and AUC Values are shown below:
If the AUC Score is higher, it is a better model. The Area under blue and red curve is the AUC score.
4.4 Random Forest
Random forest is an ensemble classification method that constructs a multitude of decision trees at training time and outputs the class that is the mode of the classes of the individual trees. The Random Forest code is shown below:
Accuracy, recall, precision, and AUC score of the test data for Random Forest is shown below:
4.5 Decision Tree
The goal is to predict a model when the value of a target variable is continuous, discrete, or categorical. By importing the decision tree classifier, the model can be created:
Decision Tree results on Test Data:
- Accuracy of test data: 70%
- Recall of test data: 57%
- Precision for test data: 61%
- AUC score for test data: 60%
4.6 Logistic Regression with Polynomial Features
Logistic Regression with Polynomial features model is shown below by importing polynomial features:
The AUC score for logistic regression with polynomial feature is 67%.
Comparison of results of all the techniques is shown below:
| Model | AUC Score | Accuracy |
|---|---|---|
| Random Forest | 80% (0.8055) | ⭐ Best Model |
| Logistic Regression | 80% (0.8014) | Excellent |
| Logistic + Polynomial | 67% (0.6783) | Good |
| Decision Tree | 60% (0.6813) | Good |
| Dummy Classifier (Baseline) | 63% (0.6320) | Baseline |
5. Visualization of Data Using Seaborn
In the pair plot, the observations labeled as 0 are shown in blue and 1 in orange. Inspection of the graph suggests that observations can be classified by several variables. The pair plots show the distribution of each variable and the relationships between pairs of variables.
6. Obtaining Risk Factors for Diabetes-Related Complications
The datasets come from two diabetes datasets from the Centers for Disease Control and Prevention (CDC). One is the National data in CSV format which gives national-level statistics, and a lookup table providing detailed risk factor information for diabetes complications.
The Python code is shown below reading two datasets (National_Data.csv and lookup_table.xlsx) and forming a combined dataset df.csv:
The screenshots below show the results of merging two datasets from CDC to get a meaningful dataset:
The dataset has 29,102 rows and 18 Columns. The data shape is shown below:
Randomly locating 2000th row is shown below with Risk factors, GenderID, Age and so on:
6.1 The Risk Factor for Diabetes Disease
6.2 The Risk Factor Complication for Diabetes Disease
7. Conclusion
The predictive analytics illustrates how the classification models like Logistic Regression, Decision Tree and Random Forest could be able to predict diseases like diabetes with reasonable accuracy. These models learn from the labeled training dataset and apply the learning to make predictions.
This project explains which machine learning model technologies can be applied to diseases such as diabetes to predict variables such as pregnancy, blood pressure, insulin, BMI, glucose, skin thickness, diabetes pedigree function, and age to determine whether a person is diabetic or not.
The visualization of the data set provides classification modeling techniques to predictive analytics to train, validate and test the dataset. These models determine data-driven insights that can be used to proactively manage diabetes in healthcare settings.
References
- What is predictive analytics? TechTarget. searchbusinessanalytics.techtarget.com
- Centers for Disease Control and Prevention. Diabetes Basics. (2021). cdc.gov/diabetes/basics
- Hayes, A. Correlation. Investopedia. (2021). investopedia.com
- Confusion matrix. Wikipedia. (2021). en.m.wikipedia.org/wiki/Confusion_matrix
- Training, validation, and test sets. Wikipedia. (2021). en.m.wikipedia.org/wiki
- scikit-learn. scikit-learn.org
- Logistic Regression 3-class Classifier. scikit-learn. scikit-learn.org
- Random forest. Wikipedia. (2021). en.m.wikipedia.org/wiki/Random_forest
- Receiver operating characteristic. Wikipedia. (2021). en.m.wikipedia.org
- Palle, S. Big Data & Data Analytics Program. Emory Continuing Education. (2019).
- Bresnick, J. 10 High-Value Use Cases for Predictive Analytics in Healthcare. HealthITAnalytics. (2019). healthitanalytics.com
- Division of Diabetes Translation. CDC. (2021). cdc.gov/diabetes
- Pima Indians Diabetes Database. Kaggle. (2016). kaggle.com
- Decision Trees. scikit-learn. scikit-learn.org/stable/modules/tree.html
- Python Decision Tree Classification with Scikit-Learn. DataCamp. datacamp.com
- Supervised vs Unsupervised Learning. Guru99. guru99.com
- sklearn.dummy.DummyClassifier. scikit-learn. scikit-learn.org
- Johnson, C. Linear Regression with Python. (2014). connor-johnson.com
- Witten, I. H. and Frank, E. Data Mining. Practical Machine Learning Tools and Techniques. Elsevier. (2005).
Tags
Machine LearningHealthcare Analytics PythonDiabetes Prediction Logistic RegressionRandom Forest Decision TreeScikit-learn SeabornEHR SystemsKSU