An Improved Diabetes Prediction using Support Vector Machine Method with a Radial Basis Function (RBF) kernel

Edgar Osakpanwan Osaghae,
Department of Computer Science,
Federal University Lokoja,
Lokoja, Kogi State, Nigeria.
Emal: edgar.osaghae@fulokoja.edu.ng

Abstract

Diabetes is a Chronic metabolic condition, is one of the most challenging diseases to predict at its early stage patients. Healthcare practitioners are relying on the use of Machine learning to predict the disease early in patients, so that early treatment can commence, before the disease cause serious damages to the patients body. Several Machine Learning researchers have used different Machine learning models to solve the problem but still recorded less than 100% percent prediction accuracies. Consequently, in this research, Support Vector Machine with a Radia Basis Function (RBF) model was used to train the diabetes dataset, to predict diabetes, when given some features. When the trained model was test, it recorded an accuracy of percent 100%. With this impressive prediction accuracy, this model can be deployed in hospitals and healthcare centres, to assists healthcare practitioners to predict diabetes early in patients so that the health complication associated with the full-blown diabetes stage can be adverted.

Keywords: Machine Learning, Support Vector Machine, Radia Basis Function, Confusion Matrix, Diabetes

1. Introduction

Diabetes is a condition whereby the human body fails to produce sufficient insulin, resulting to a raise in the sugar contents in the blood. Common symptoms of diabetes in human body are increase in urination, thirst and hunger. There are two types of diabetes, namely, type 1 diabetes and type 2 diabetes. The type 1 diabetes is whereby the pancreas does not produce sufficient insulin for the body, while type 2 diabetes also have the problem of producing sufficient insulin for the body and insulin resistance. If diabetes is not treated or properly monitored, it can lead to serious health challenges (Sistla, 2022; Jiby, 2021; Soliman & AboElhamd, 2014; Vaishali & Pandey, 2018; Gandhi & Pandi, 2022; Verma & Harshavardhanan, 2024). Diabetes, a chronic metabolic condition, is a global health care burden. According to the International Diabetes Federation (IDF), 463 million people between ages 20 and 79 years have diabetes, and 374 million have impaired glucose tolerance. While 8.8% of the world population was reported to have diabetes in 2017, the numbers are projected to rise to 10% by 2045 (Rani, 2020; Tejas et al., 2018; Ellahham, 2020; Massari et al., 2024; Anggoro & Permatasari, 2023; Varun et al., 2021). Early detection of diabetes can advert the complications it could cause and also save the patient life. Some of the health challenges caused by diabetes are Coronary Artery Disease, (CAD), Coronary Kidney Disease (CKD), Chronic Obstructive Aspiratory Sickness, Hypertention and Hypothyroidism (Kayalvizhi & Maheswari 2020; Pappula et al., 2024; Asli & Adnan, 2023; Shamriz & Mete, 2021; Ozge & Hamza, 2021; Heet & Briskilal, 2022). Early prediction of diabetes is very challenging for medical practitioners, as a result of complex interdependence of various factors, that are difficult to analyze (Reddy et al., 2023).

Support Vector Machine (SVM) is utilized to group the class name by isolating the data point by a straight line (Praneeth, 2019). SVM is a linear classifier in the parameter space, which can be easily extended to a nonlinear classifier of the Փ-machine type by mapping the S = {x} of the input into high-dimensional (possibly infinite-dimensional) feature space F = {Փ(x)}. By choosing an adequate mapping Փ, the points become linearly separable or mostly linearly separable in the high-dimensional space, so that one can easily apply structure risk minimization. We need not compute the mapped patterns Փ(x) explicitly, and instead we only need the dot products between mapped patterns. They are directly available from the kernel function which generates Փ(x). By choosing different kind of kernels, SVM can realize Radial Basis Function (RBF), Polynomial and Multilayer Perceptron classifiers. Compared with the traditional way of implementing them, SVM has an extra advantage of automatic model selection, in the sense that both the optimal number and locations of the basis functions are automatically obtained during training (Amari & Wu, 1999; Ramachandro & Bhramaramba, 2019).

In SVM algorithm, there are several common types of kernel functions such as linear, radial basis function (RBF), sigmoid and polynomial. Each kernel function has specific parameters that need to be optimized to achieve the best performance Regularization parameter 𝐶 is used to determine the magnitude of penalty for errors. This impacts the balance between the smoothness of the decision boundary and the capability to accurately classify the training data. If the value of 𝐶 is high, the training data will be accurately classified based on the hyperplane; conversely, if the value of 𝐶 is low, the optimization will seek a higher margin that separates the hyperplane. One of the factors that affects the performance of SVM classification is gamma, which is included in the sample subspace with complex changes. A high value of gamma gives more weight to data points close to the decision boundary. Conversely, a low value of gamma takes into account data points far from the decision boundary in the computation of the decision boundary (Dewi et al, 2024)

Researchers have been using Machine Learning methods to develop systems that can help predict diseases using symptoms shown by the patients. Different Machine Learning methods are perfect for predicting diabetes, but the main challenge is choosing the ones that predict the disease, with high accuracy (Heet & Briskilal, 2022; Ehtisham & Jameel, 2020). Support Vector Machine (SVM) is one of the most sophisticated machine learning techniques, known for its efficiency in solving challenging classification problems. SVM has the capability to manage high-dimensional data and spot subtle trends within datasets, which makes it suitable for medical diagnosis applications, like diabetes prediction (Srividhya et al., 2023). explored combination of various machine learning algorithms to predict diabetes. The machine learning algorithms they used and their prediction accuracies are Gaussian Naïve Bayes Classifiers (96.43%), Support Vector Machine (96.69%), Linear Regression (98.52%) and Decision Tree (98.86%). Decision Tree recorded the highest prediction accuracy, with a 98.86% detection (Pappula et al., 2024).

Although the models presented have high accuracies, but there are mix of classification and regression algorithms used to build the models, without focusing on either regression or classification method of building the models. There are certain features in a dataset sample that can help in training the machine learning models for predicting diabetes, namely, gender, pregnancy, smoking, Glucose, Alcohol, Blood Pressure, Sleeping hours, Skin Thickness, Insulin, Body Mass Index (BMI), Age, Regular Medicine, etc. (Shruthi et al., 2021; Asaad, 2022).

2. Related Works

Researchers, clinical practitioners and people in the industry widely believe that artificial intelligence has the power to alter the ongoing situations of late medication and detection due to human errors. Automation has the capability to construct efficient and reliable medical detection systems. Machine learning, by means of its powerful predictive and classification models, plays an important role in helping to achieve this. In recent years, several models have been proposed for the prediction of diabetes, based on machine learning techniques (Virgolici & Virgolici, 2024). There are many Machine learning models for training diabetes datasets, some of them are AdaBoost, Decision Tree, Support Vector Machine (SVM), Neural Networks, etc. (Patel at al. 2021; Hang et al., 2024; Mishra et al. 2020; Asfaw, 2019; Suryadevara & Balaji, 2024; Nar & Palkar, 2021; Alam et al., 2024). There are some related research works similar to this research and they are as followed.

Waberi et al. (2024) combined LSTM’s ability to analyze sequential data with XGBoost’s strength in handling structured datasets. When the hybrid LSTM-XGBoost model was trained with the dataset, it recorded prediction accuracy of 99%.

Dewi et al. (2024) used diabetes mellitus datasets from Kaggle to train a Support Vector Machine method with a Radial Basis Function (RBF) kernel. They used regularization parameters of C and gamma values of 10 and 0.01 respectively, to achieve an accuracy level of 98.25%.

Srividhya et al. (2023) used datasets to train an SVM model in order to predict diabetes. They used 100,000 data samples, 8 features with 2 classification classes. They reported an impressive accuracy of 96.5. It was observed that the datasets (100,000) used was high and the research did not report the problem experienced that would need further study.

Krishnenedhu et al. (2020) used diabetes datasets to train Support Vector Machine and Logistic Regression Models, after the training, SVM recorded an accuracy of 93% and Logistic Regression recorded an accuracy of 76%.

Anuja & Chitra (2013) developed an SVM(with RBF) model using 460 samples, 8 features, 2 classification classes and an accuracy of 76%. This is an impressive accuracy due to using fewer number of dataset samples. The authors suggested a further research work by improving the feature subset selection process.

Verma et al. (2017) used Support Vector Machine with RBF kernel model, which was trained using diabetes datasets from UCI diabetes repository. After the model was trained, it recorded an accuracy of 88%.

3. Materials and Methods

This research was conducted in Ubuntu Linux 18.04 operating system environment, using python 3.9 and implemented in Jupyter notebook editor.

3.1. Dataset Collection

The dataset used for this research was obtained from Kaggle Online dataset repository containing 34, 862 instances with 9 features. The features are gender, age, hypertension, heart
disease, Body Mass Index (bmi), HbAlc_level, Smoking History, Blood Glucose Level and diabetes. The diabetes feature is the target feature that the model would try to predict.

3.2. Preprocessing

The values of the features in the dataset were checked if they contain null values and duplication. While exploring the dataset, it was observed that the gender feature contains 18 null values and the ones that are not null are string values namely, ‘Male’ and ‘Female’ respectively. The instances of the dataset having the 18 null values, were dropped entirely, resulting to having 34, 844 dataset instances. Since machine learning models can only be trained by numerical values, the gender feature string values, were transformed from string values ‘Male’ and ‘Female’ to numeric values of 0 and 1 respectively. The Smoking History feature was dropped from the features of the dataset because, apart from having trainable values like ‘Smoking’, ‘Not Smoking’, but also has a value ‘not sure’ and this may reduce the prediction accuracy of the proposed model. To speed up the training time of the model and also to ensure numerical stability, the dataset values were normalized between 0 and 1.

3.3. Training Support Vector Machines with RBF Kernel

After experimenting with the parameters of gamma and C, gamma value of scale and C value of 1 were selected as parameters for the SVM-RBF kernel. The dataset of 34, 844 was split, using 80% for training and 20% for testing. After training the model, an impressive accuracy of 1.0 (100%) was recorded and the evaluation of the result using confusion matrix is shown in Figure 1 below.

Fig. 1: Output Results of Diabetes Prediction

To ensure that there was no indication of overfitting, the program for predicting diabetes in this research was run several times with increasing number of dataset instances, using confusion matrix. The first sample used was 1000 and then increased by 1000, for the next subsequence samples. The last sample used to test this model 34,000 and none of the results got deviated from the accuracy of 100%. The result of the experiment are shown in Table 1and the variable used to title the column headings are:

i) n is the number of instances
ii) Accuracy is the prediction percentage
iii) CM is the confusion matrix.

Table 1: Confusion Matrix Some Diabetes Samples

n	Accuracy	TP	TN
1000	100%	174	26
2000	100%	359	41
3000	100%	536	64
4000	100%	722	78
5000	100%	903	97
6000	100%	1106	94
7000	100%	1255	145
8000	100%	1453	147
9000	100%	1640	160
10000	100%	1846	154
11000	100%	2009	191
12000	100%	2202	198
13000	100%	2367	233
14000	100%	2555	245
15000	100%	2739	291
16000	100%	2890	310
17000	100%	3084	316
18000	100%	3294	306
19000	100%	3476	324
20000	100%	3651	349
21000	100%	3817	383
22000	100%	3993	407
23000	100%	4158	441
24000	100%	4389	410
25000	100%	4582	417
26000	100%	4713	486
27000	100%	4928	471
28000	100%	5092	507
29000	100%	5263	536
30000	100%	5441	558
31000	100%	5625	574
32000	100%	5813	586
33000	100%	5974	625
34000	100%	6182	617

4. Results and Discussion

The results shown in Figure 1 and the Table 1 are indications that it is possible to predict diabetes to an accuracy of 100% using SVM with RBF kernel, as the chosen trained Machine learning model. To be sure that the model did not overfit, confusion matrix was used as a metrics to test different instances of the dataset shown in Table 1 and all the results recorded 100% accuracies. With these impressive results of 100%, accuracies, this model can be conveniently deployed to hospitals and healthcare centres, to help medical practitioners to predict early stage of diabetes before they become full-blown stage that can cause serious health challenges to the patients.

5. Conclusion

Diabetes is a Chronic metabolic condition and if it is left untreated can caused an enormous damage to the body organs of the patient. Healthcare practitioners are relying on the use of Machine learning to predict the disease early in patients, so that early treatment can commence and advert the damages the disease can cause. Several Machine Learning researchers have used different Machine learning models to solve the problem but still recorded less than 100% percent prediction accuracies. In this research, Support Vector Machine with a Radia Basis Function (RBF) model was used to train the diabetes dataset, to predict diabetes, when given some features. When the trained model was test, it recorded an accuracy of percent 100%. With this impressive prediction accuracy, this model can be deployed in hospitals and healthcare centres, to assists healthcare practitioners to predict diabetes early in patients so that the health complication associated with the full-blown diabetes stage can be adverted. In further research, there are plans to adopt Support Vector Machine with a Radia Basis Function (RBF) model to predict other related diseases and see if their accuracies can be improved.

References

Amari S. & Wu S. (1999). Improving
Support Vector Machine Classifiers by Modifying Kernel Functions, Neural Networks (12): 783-789.
Alam S., Ferdous J. & Neera N. S. (2024).
Enhancing Diabetes Prediction: An ImprovedBoosting Algorithm for Diabetes Prediction, International Journal of Advanced Computer Science and Applications, 15(5): 1273-1286.
Anggoro D. A. & Permatasari D. (2023).
Performance Comparison of the Kernels of Support
Vector Machine Algorithm for Diabetes Mellitus Classification, International Journal of Advanced Computer Science and Applications, 14(2): 214-219.
Anuja V. A. & Chitra R. (2013).
Classification of Diabetes Using Support Vector Machine,
International Journal of Engineering Research and Application, 3(2): 1797-1801.
Asaad R. R. (2022). Support Vector
Machine Classification Learning Algorithm for Diabetes
Prediction, International Research Journal of Science, Technology, Education and Management, 2(2): 26-34.
Asfaw T. A. (2019). Prediction of Diabetes
Mellitus using Machine Learning Techniques,International Journal of Computer Applications, 1(1): 31-38.
Asli G. and Adnan K. (2023). Performance
Comparison Machine Algorithms in Diabetesdisease Prediction, European Mechanical Science, 7(3), 178-183.
Dewi C., Zendrato J. & Christanto H. J.
(2024). Improvement of Support Vector Machine forPredicting Diabetes Mellitus with Machine Learning Approach, Journal of Autonomous Intelligence, 7(2): 1-12.
Ehtisham F. & Jameel A. (2020). Disease
Prediction System using Support Vector Machineand Multilinear Regression, International Journal of Innovative Research in Computer Science & Technology (IJIRCST), 8(4): 331-336.
Ellahham S. (2020). Artificial Intelligence:
The Future for Diabetes Care, The AmericanJournal of Medicine, 133(8): 895-900.
Hang O. Y., Wiwied V. & Rosaida
R.(2024). Diabetes Prediction using Machine LearningEnsemble Model, Journal of Advanced Research in Applied Sciences and Engineering Technology, 37(1): 82-98.
Heet P. & Briskilal J. (2022). Prediction of
Diabetes using Machine Learning Algorithm,Journal of Current Research in Engineering and Science, 5(1): 1-6.
Gandhi P. & Pandi G. S. (2022).
International Journal of Research Publication and Reviews,
3(2): 77-82.
Jiby T. C. (2021). A Study on Various
Machine Learning Classification Algorithms for Diabetes Prediction, International Journal of Engineering Research and Technology, 10(8): 425-427.
Joshi T. N & Chawan P. M. (2018).
Diabetes Prediction using Machine Learning Techniques, International Journal of Engineering Research and Application, 8(1): 9-13.
Kayalvizhi M. & Maheswari D (2020). A
Hybride Deep Learning Algorithms for DiabetesMellitus Prediction using Thermal Foot Images, European Journal of Molecular and Clinical Medicine, 7(11): 5176-5183.
Krishnenedhu J., Arnesh G., Harish B. &
Vidhya K. (2020). Diabetes Prediction Using SVM and
Logistic Regression, International Journal of Research in Engineering, Science and Managementm 3(2): 327-329.
Massari H. E., Gherabi N., Qanouni F. &
Mhammedi S. (2024). Diabetes Prediction usingMachine Learning with Feature Engineering and Hyperparameter Tuning, International Journal of Advanced Computer Science and Applications, 15(8): 171-179.
Mishra P., Sharma A. & Badholi A.
(2020). Predictive Modelling and Analytics for Diabetes using a Machine Learning Approach, International Journal of Engineering Research in Computer Science and Engineering, 7(10): 9-18.
Nar P. & Palkar B. (2021). Diabetes
Prediction using Machine Learning Technique, International Journal of Computer Applications, 183(14): 34-37.
Ozge N. E. & Hamza O. I (2021). Early
Stage Diabetes Prediction Using Machine Learning Methods, European Journal of Science and Technology, (29): 52-57.
Pappula P., Kotha S. R, Arukonda M.,
Akuthota S., Karingula R. R. (2024). Establishing a
Diabetes Prediction Decision Support System with Machine Learning as its Foundation, African Journal of Biological Sciences, 6(7), 1257-1262.
Praneeth D. H. (2019). An Overview on
Support Vector Machine (SVM) and Classification using Intersection Kernel Support, International Journal of Management and Engineering, 9(1): 3604- 3611.
Patel K., Nair M. & Phansekar S. (2021).
Diabetes Prediction using Machine Learning, International Journal of Scientific and Engineering Research, 12(3): 63-66.
Ramachandro M. & Bhramaramba R.
(2019). Classification of Gene Expression Data Set using
Support Vector Machine with RBF Kernel, International Journal of Recent Technology and Engineering, 8(2): 2907-2913.
Rani K. J. (2020). Diabetes Prediction
using Machine Learning, International Journal of
Scientific Research in Computer Science, Engineering and Information Technology, 6(4): 294-305.
Reddy A. Y, Vandana K,. Pranalika T.,
Bhavani Y & Nandini T. S. (2023). Diabetes Prediction Using Extreme Learning Machine: Application of Health Systems, Journal of Cardiovascular Disease Research, 14(7): 2241-2255.
Shamriz N & Mete Y. (2021). Diabetes
Prediction Using Machine Learning Classificationalgorithms, European Journal of Science and Technology, (24): 53-59.
Shruthi U, Talari N. K, Allingaram A.,
Allaparthi C., Adavi D., Kumar Y. & Kusetty R. (2021).Diabetes Prediction using Machine Learning Technique, International Research Journal of Modernization in Engineering Technology and Science, 3(5): 2930-2936.
Sistla S. (2022). Predicting Diabetes using
SVM implemented by Machine Learning,International Journal of Soft Computing and Engineering, 12(2): 16-18.
Soliman O. S. & AboElhamd E. (2014).
Classification of Diabetes Mellitus using Modified Particle Swarm Optimization and Least Squares Support Vector Machine, International Journal of Computer Trends and Technology (IJCTT), 8(1): 38-44.
Srividhya N., Divya K,. Sanjana N.,
Krishna K., and Rambhupal M. (2023). Diabetes Prediction
Using Support Vector Machine, International Journal of Multidisciplinary Research, 9(10): 421-426.
Suryadevara T. & Balaji V. (2024).
Prediction of Diabetes using Machine Learning Techniques,
International Journal of Research Publication and Reviews, 5(8): 4191-1496.
Vaishali & Pandey N. (2018). Diabetes
Prediction using Linear Regression, Decision Tree &Least Square Support Vector Machine, International Journal of Innovative Research in Computer and Communication Engineering, 6(4): 3756-3763.
Varun J. , Anji N. & Tarun P. (2021). A
Review on Current Advances in Machine LearningBased Diabetes Prediction, Primary Care Diabetes, 15(2021): 435-443.
Verma K. & Harshavardhanan P. (2024).
Type-2 Diabetes Prediction using Machine LearningAlgorithms and Ensembles with Hyperparameters, Journal of Metabolic Disorders diabetes, 2(105): 1-11.
Verma P., Kaur I. & Kaur J. (2017). Novel
Approach of Diabetes Disease Classification by Support Vector Machine with RBF Kernel, International Journal of Advance Research, Ideas ad Innovations in Techology 3(1): 276-280.
Virgolici O. & Virgolici B. (2024
Diabetes Prediction using Machine Learning Techniques:A Brief Overview, Diabetes & its Complications, 8(1): 1-9.
Waberi A. D., Mwangi R. R. & Rimiru R.
M. (2024). Advancing Type II Diabetes Predictions with a Hybrid LSTM-XGBoost Approach, Journal of Data Analysis and Processing, (12): 163-188.