particularly valuable in situations where false
negatives are considered costly or undesirable. For
Gradientboosting, SVM and RandomForest, recall
value is 0.75, 0.73 and 0.74. Both KNN and
LogisticRegression perform poorly in these 2 metrics
with the lowest scores.
For comprehensive evaluation, F1 score is
introduced to strike a balance between precision and
recall, which lies between 0 and 1. F1 score near 1 can
be considered as the best model (Eusebi 2013). For
Gradientboosting, SVM and RandomForest, F1 score
is 0.75, 0.73 and 0.74 respectively. Therefore, these 3
models are found to give most precise result of the
patients based on the dataset and they will be taken for
further evaluation and comparison.
The effectiveness of a classification model
is evaluated via the algorithm's assessment statistic
known as AUC (Area Under the ROC Curve). It
measures the model's ability to distinguish between
positive and negative classes across various thresholds
and takes 4 metrics above into account to grade the
model performance. An ideal model has an AUC
value of 1.0 while a non-discriminating model has a
value of 0.5 (Eusebi 2013). For Gradientboosting
AUC is 0.81, much higher than SVM at 0.71 and
RandomForest at 0.72. So, from above studies, it can
be concluded that Gradientboosting is the optimal
classifier to diagnose diabetes.
Additionally, for each model, it is found that each
metric only differ by approximately 1%, indicating
that the model can function with excellent stability.
Moreover, it suggests that the model is well-balanced
and reliable in classifications. However, almost all
metric values are around 73% with small variance,
suggesting that Gradientboosting only brought minor
improvement instead of fundamental performance
gain compared to other machine learning models.
5 CONCLUSION
Through evaluation and comparison, the conclusion
can be summarized from results of each model.
Gradientboosting gives the highest accuracy at 0.75
while KNN has the lowest accuracy at 0.71. After
comprehensive comparison, it’s clear that
Gradientboosting is superior to others, but it doesn’t
significantly outperform others on the given dataset.
The contribution of the research is to exam and
improve traditional machine learning algorithms’
performance on disease diagnosis. Generally speaking,
machine learning models are able to handle large
dataset efficiently and make predictions automatically.
But the they are not reliable enough to be brought into
practice, for the given samples are limited and
insufficient, the model is not complicated and the
accuracy of prediction is not high enough. In the future,
the author will apply some practical machine learning
skills such as model blending or some deep learning
algorithms to improve the model. Also, abundant data
and samples are collected and intended for model
training. After appropriate improvements, the model
can be applied to prevention and treatment of diabetes.
It can not only predict risk of heart disease based on
clinical indicators but also distinguish between
individual differences and draw up the optimal
treatment plans.
REFERENCES
E. A.M. Gale and K.M. Gillespie, “Diabetes and gender,”
Diabetologia, 2001, pp.3-15.
G. Roglic, “WHO Global report on diabetes: A summary,”
International Journal of Noncommunicable Disease,
vol.1, pp.3-8, 2016.
Q.Zou, K.Qu, Y.Luo, D.Yin, Y.Ju and H.Tang, “Predicting
Diabetes Mellitus With Machine Learning
Techniques,” Front. Genet., vol.9, pp.1-10, 2018.
Y.A.Christobel and P.Sivaprakasam, “A New Classwise k
Nearest Neighbor (CKNN) Method for the
Classification of Diabetes Dataset,” International
Journal of Engineering and Advanced Technology
(IJEAT), vol.2, pp.396-400, 2013.
V.A.Kumari and R.Chitra, “Classification Of Diabetes
Disease Using Support Vector Machine,”
International Journal of Engineering Research and
Applications, vol.3, pp.1797-1801, 2013.
S.Hina, A.Shaikh and S.A.Satter, “Analyzing Diabetes
Datasets using Data Mining,” Journal of Basic &
Applied Sciences, vol.13, pp.466-471, 2017.
F.Y. Osisanwo, J.E.T. Akinsola, O. Awodele, J.O.
Hinmikaiye, O. Olakanmi and J.Akinjobi,
“Supervised Machine Learning Algorithms:
Classification and Comparison,” International
Journal of Computer Trends and
Technology(IJCTT), vol.48, pp.128-138, 2017.
M.Nilashi, O.Ibrahim, M.Dalvi, H.Ahmadi and
L.Shahmoradi, “Accuracy Improvement for Diabetes
Disease Classification: A Case on a Public Medical
Dataset,” Fuzzy Information and Engineering,
pp.345-357, 2017.
A.Mujumdar and V. Vaidehi, “Diabetes Prediction using
Machine Learning Algorithms,” International
Conference on Recent Trends in Advanced
Computing(ICRTAC), vol.165, pp.292-299, 2019.
P.Eusebi, “Diagnostic Accuracy Measures,”
Cerebrovascular Disease, pp.267-272, 2013