Abstract
BACKGROUND/AIMS
Diabetes is one of the paramount public health challenges, affecting millions worldwide. Classification models can boost early detection and aid in treatment, particularly for diabetes type 2. This study, therefore, uses an ensemble learning approach for classifying diabetes type 2, utilizing a soft voting classifier, using multiple machine learning techniques on the Centers for Disease Control and Prevention Health Indicators Dataset.
MATERIALS AND METHODS
An ensemble model was developed in which the predictions of five machine learning algorithms were combined: XGBoost, Random Forest, Gradient Boosting, Support Vector Machine, and convolutional neural network-long short-term memory. Each model is trained using bootstrapped re-sampling, and predictions are aggregated through soft voting to improve classification performance on the test set.
RESULTS
On the test set, it achieved a classification accuracy of 87.8%, precision of 99.5%, recall of 99.51%, and an F1 score of 99.2%, hence proving high efficacy in identifying diabetes type 2 cases.
CONCLUSION
It follows that the proposed ensemble model efficiently classifies diabetes type 2 with high precision and recall; hence, it underpins the importance of ensemble learning in boosting the accuracy of classification. This may provide a reliable tool for early detection of diabetes, contributing to better patient outcomes through timely intervention.
INTRODUCTION
Diabetes mellitus is a chronic disease that impairs the body’s ability to convert food into energy, resulting in elevated blood sugar levels due to insufficient or ineffective insulin production.1 Type 2 diabetes accounts for 90-95% of cases, in which insulin is produced but is inadequate in its action. High blood sugar can damage blood vessels and organs, leading to severe complications such as cardiovascular disease, kidney failure, and neuropathy. Moreover, undiagnosed diabetes can reduce life expectancy by up to 8 years, highlighting the urgent need for early detection and intervention.2
Common symptoms include frequent thirst, nighttime urination, fatigue, unplanned weight loss, slow wound healing, increased hunger, and blurred vision.3 Diagnosis primarily relies on blood glucose measurement,4 influenced by various health indicators. Research indicates that obesity, high blood pressure (HighBP), high cholesterol (HighCol), stroke history, and cardiovascular diseases are significant risk factors for diabetes.5-7 These health-related factors complicate diabetes management, necessitating effective predictive models.
In public health, classifying patients as diabetic or non-diabetic using advanced machine learning techniques can significantly enhance early detection and treatment strategies. This study aims to leverage ensemble learning models to improve classification accuracy for type 2 diabetes, integrating multiple machine learning algorithms through a voting classifier. This approach seeks to provide a reliable tool for healthcare professionals to identify at-risk individuals, ultimately contributing to better patient outcomes.
Previous studies have explored various machine learning approaches for diabetes prediction. For instance, Singh and Singh8achieved 83.6% accuracy with a stacking-based ensemble framework. Kibria et al.9 reached 90% accuracy using a soft voting classifier. Dogru et al.10developed a hybrid model achieving 99.6% accuracy. Sunny et al.11 also proposed a soft voting ensemble method for accurate diabetes risk diagnosis.
MATERIALS AND METHODS
Statistical Analysis
To conduct this study and the proposed method, the Centers for Disease Control and Prevention (CDC) Diabetes Health Indicators Dataset was selected. The dataset contains 253,680 samples with 35 features, consisting of medical and behavioural data of individuals. The proposed method is implemented in the feature group shown in Table 1. Since this study primarily focuses on the classification of diabetes based on medical data, five medical features were selected from the dataset: including individuals’ HighBP, HighCol, body mass index (BMI), which determines whether they are at a healthy weight, whether they have had a stroke in their medical history, and whether they have any history of heart disease or heart attack. The age and gender of individuals were used as demographic data.
Machine Learning Algorithms
Random forest: Random Forest allows the generation of various models and classifications by training each decision tree on a different observation sample. The algorithm creates a decision tree for each example and determines the estimated value of each decision tree.12
Gradient boosting: In the first stage, an initial tree is created. It then calculates an error based on the difference between the actual value of the target variable and the value predicted by the tree a second tree is created to reduce this error. The second tree is used to estimate the negative gradient of the error predicted by the first tree.14
eXtreme gradient boosting: XGBoost is one of the ensemble methods that operates on decision trees, unlike traditional GB, which aims to minimize the errors of the model.
L(y, y^) + Ω(f ) (1)
Support vector machines: As represented in Figure 1, support vector machine (SVM) is a supervised learning technique used for classification and regression. The SVM algorithm draws lines to separate sets of two or more points that are placed on a plane. Considering two data sets, this line aims to maximize the distance between the points of both sets. The decision boundary that needs to be determined for separation finds the best margin between classes and defines the hyperplane.15
Convolutional-longitudinal-short-term neural network: A deep learning architecture formed by the combination of CNN and LSTM networks is known as CNN-LSTM, and represented in Figure 2. CNNs capture spatial relationships within the dataset through convolution. LSTM, a type of recurrent neural network, is successful in capturing long-term dependencies. The combination of CNN-LSTM enables learning both spatial and temporal features of the data.16
Correlation: Figure 3 illustrates the correlation between Diabetes Binary and eight characteristic variables. Health-related factors such as HighBP, BMI, HighCol, and heart disease show a positive correlation with diabetes. Among demographic variables, education level shows a negative correlation and stands out as significant, even when compared among medical factors. Examining both positive and negative correlations provides a comprehensive view of the dataset.
Proposed model: The dataset underwent an loading, preprocessing, and an 80-20 train-test split, followed by feature standardization. Bootstrap resampling was applied to the training set to enhance model robustness. Multiple models (XGBoost, Random Forest, Gradient Boosting, SVM, and CNN-LSTM) were trained on resampled datasets (Figure 4). The predictions were combined using soft voting, with the probabilities averaged for the predictions. The model’s performance was evaluated using accuracy, precision, recall, and F1 score. This ensemble approach improves robustness and mitigates overfitting.
RESULTS
Performance Analysis
The effectiveness of different machine learning algorithms in binary classification varies. The accuracy of measurements evaluates models such as Decision Tree, Random Forest, K-nearest neighbor (KNN), CatBoost, Gaussian Naive Bayes, Logistic Regression, Linear Discriminant, Gradient Boosting, and the proposed model. As shown in Table 2 and Figure 5, the proposed model achieved the highest accuracy (87.8%) in classifying the CDC Diabetes Indicator Dataset.
Beyond accuracy, precision, recall, and F1 score provide deeper insight into model performance. Random Forest and KNN excel in these metrics, demonstrating strong predictive power and minimizing false positives and false negatives. While the proposed model has the highest accuracy, its recall and F1 score confirm its ability to correctly classify positive cases. In contrast, Gaussian naive Bayes and logistic regression show moderate accuracy with lower precision, recall, and F1 scores, indicating a higher rate of misclassification. Decision Trees and Linear Models, though less accurate, outperform Gaussian Naive Bayes and Logistic Regression in precision, recall, and F1 score.
Finally, Gradient Boosting shows competitive performance with high precision, recall, and F1 score, although the sentence is incomplete.
It is slightly weaker than Random Forest and KNN. Overall, this analysis emphasizes the importance of examining different metrics beyond accuracy to fully evaluate model performance, especially in scenarios where the costs of false positives and false negatives are different.
DISCUSSION
This study presents a robust ensemble learning model for classifying type 2 diabetes, utilizing five machine learning algorithms: XGBoost, Random Forest, Gradient Boosting, SVM, and CNN-LSTM. The ensemble approach, employing soft voting, achieved a classification accuracy of 87.8%; with precision, recall, and F1 scores of 99.5%, 99.51%, and 99.2%, respectively. These results demonstrate the model’s effectiveness in accurately identifying diabetic patients while minimizing false positives and negatives, which is crucial in clinical settings.
The high performance of the ensemble model is attributed to its ability to leverage the strengths of diverse algorithms. Each algorithm contributes unique insights, allowing the ensemble to capture complex patterns in the data. For instance, Random Forest handles overfitting through its ensemble of decision trees, while XGBoost optimizes predictive accuracy via gradient boosting. The integration of CNN-LSTM captures both spatial and temporal features, which is beneficial for analyzing health-related time-series data.
However, the study has limitations. Reliance on a single dataset may restrict the generalizability of findings across different populations. Future research should validate the model on diverse datasets to ensure broader applicability. Additionally, the complexity of the ensemble model may pose challenges in real-time clinical implementation and interpretability. Simplifying the model or employing explainable AI techniques could enhance usability for healthcare professionals. In conclusion, this study highlights the potential of ensemble learning in diabetes prediction, offering a powerful tool for early detection and intervention.
Study Limitations
This study’s reliance on a single dataset may limit the generalizability of findings across diverse populations. Additionally, the model’s complexity could present challenges in real-time implementation and interpretability in clinical settings.
CONCLUSION
Diabetes prediction remains essential in public health for early intervention and management. This study’s ensemble model, leveraging the strengths of various machine learning techniques, achieved superior predictive accuracy and reliability in classifying diabetes based on health indicators. With high scores in precision, recall, and F1 metrics, the model proves valuable for accurately identifying diabetic patients and minimizing classification errors. Although further work is needed to enhance interpretability and generalizability, the findings suggest that ensemble learning offers a powerful approach for diabetes prediction, potentially aiding clinicians in early detection and reducing the impact of diabetes on patient health outcomes.
MAIN POINTS
• An ensemble learning model combining XGBoost, Random Forest, Gradient Boosting, SVM, and CNN-LSTM was developed for diabetes type 2 classification.
• The model achieved high performance with 87.8% accuracy, 99.5% precision, 99.51% recall, and 99.2% F1 score.
• The results highlight the potential of ensemble models in improving early detection and management of diabetes.