Diabetes Prediction Using Machine Learning Algorithms

R. Sathishkumar and G. Anitha

Department of Computer Applications Karpagam Academy of Higher Education, Coimbatore, India

Keywords: Machine Learning, Support Vector Machines, KNN, Boosting Algorithm.

Abstract: Diabetes is a prevalent and life-threatening condition with severe implications, including heart attacks,

blindness, and neuropathy. The study aims to predict diabetes and its risk factors using machine learning

algorithms, specifically Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Logistic Regression,

and Boosting Algorithms. The research employs a diverse dataset with 768 cases, emphasizing data

preprocessing for improved accuracy. The results show that SVM performs exceptionally well in predicting

diabetes cases due to its ability to create a hyperplane, making it an effective supervised machine learning

algorithm. KNN identifies similarities between data points for classification. Logistic Regression is suitable

for supervised binary classification problems. Boosting algorithms collaborate to improve predictive accuracy,

resembling teamwork. Moreover, Random Forest, a bagging ensemble technique, also exhibits high accuracy.

Machine learning holds the potential to significantly enhance diabetes risk prediction and facilitate early

intervention. Accurate predictions are vital for effective diabetes management and informed clinical decisions.,

K- NN, Boosting algorithms.

1 INTRODUCTION

Diabetes stands as one of the deadliest conditions in

the biosphere. It is not just a disease but also a

precursor to various other serious ailments such as

heart attacks, blindness, and neuropathies. The typical

routine involves patients needing to visit medical

centers regularly, consult their healthcare providers,

and adhere closely to a strict regimen for their well-

being. Moreover, each time they need medical

attention, they must navigate through bureaucratic

hurdles, often leading to frustration.

Diabetes Mellitus (DM) is defined as a group of

metabolic disorders primarily caused by irregularities

in insulin production and function. Insufficient

insulin leads to elevated blood glucose levels

(hyperglycemia) and impaired absorption of

carbohydrates, fats, and proteins (Siper et al 2005).

DM ranks among the most prevalent endocrine

disorders globally, affecting over 200 million

individuals. The incidence of diabetes is projected to

escalate significantly in the coming years. However,

there are two main medical classifications: Type 1

Diabetes (T1D) and Type 2 Diabetes (T2D), each

characterized by distinct pathologies. T2D, which

appears to be the most common form, is

predominantly associated with insulin resistance.

Major risk factors for T2D include lifestyle choices,

physical inactivity, dietary habits, and genetic

predisposition. On the other hand, T1D is believed to

result from autoimmune destruction of pancreatic β

cells within the Langerhans islets. T1D affects

approximately 10% of all diabetic cases worldwide,

with about 10% of them eventually developing

idiopathic diabetes. Other forms of DM, based on

insulin secretion patterns and/or etiology, include

Gestational Diabetes.

2 LITERATURE REVIEW

In this segment, we discussed the bracket alongside

prognostication procedures aimed at predicting

diabetes in healthcare. Specifically, the significance

of Bluetooth Low Energy (BLE)-based sensors and

device literacy protocols is emphasized for

personalized diabetes mellitus care in healthcare

production, offering healthcare authorities the luxury

to explore and identify therapeutic insights (Siper et

al 2005).

In this realm, there is substantial ongoing effort

directly tied to the prevailing landscape. Artificial

Neural Networks (ANNs) have been adapted to

achieve the highest accuracy in prediction within the

Sathishkumar, R. and Anitha, G.

Diabetes Prediction Using Machine Learning Algorithms.

DOI: 10.5220/0012771400003739

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Artiﬁcial Intelligence for Internet of Things: Accelerating Innovation in Industry and Consumer Electronics (AI4IoT 2023), pages 349-353

ISBN: 978-989-758-661-3

349

medical field. The robust multilayer perception

(MLP) of ANNs is employed to forecast emotional

distress. The outcomes are interconnected within the

same domain and prove to be superior. The

integration of diabetes patient data with ANN,

Decision Trees, Support Vector Machines (SVM),

and Naive Bayes (NB) forms a hybrid approach that

challenges existing methods, yielding significant

results (Devi et al 2005Berry et al 1997, Witten et al

2016, Giri et al 2016).

Figure 1: Various Notations: Data Flow diagram.

3 METHODOLOGY

The purpose of this study is to review tabloid data for

early signs of diabetes, aiming to predict the condition

and promote healthier lifestyles. We conducted tests

using various methodologies and collaborative

approaches to forecast diabetes. Subsequently, we

successfully concluded this phase.

Data Collection Explanation: The initial step

involves gathering information from a repository. The

dataset comprises 768 cases with multiple attributes.

Data Pre-processing: Data pre-processing is a

crucial procedure. It involves handling missing data

and other inconsistencies that could affect the

accuracy of the information. By optimizing this

process, we ensure better quality and efficiency in

subsequent analyses. Proper pre-processing is

essential for accurate results and successful

prediction using machine learning algorithms.

Zero-Value Removal: We eliminate entries with

zero values, as these could skew the analysis. This

step, called zero-value removal, helps streamline the

dataset by reducing unnecessary dimensions, thus

facilitating more efficient analysis while retaining

valuable information.

Standardization: Standardization involves scaling

all features to a similar range. This ensures that all

characteristics are measured on comparable scales,

aiding in fair comparisons and accurate predictions.

Feature Engineering: Once the data is prepared,

we employ various techniques to extract meaningful

features. This involves utilizing different methods

and collaborative approaches tailored to the dataset.

By dissecting these techniques, we can assess their

effectiveness and identify key factors contributing to

accurate predictions.

Support Vector Machine (SVM): One such

method is the Support Vector Machine (SVM), a

prevalent algorithm for classification and regression

tasks. SVM constructs a hyperplane that separates

data points into distinct categories. It can also classify

new data based on learned patterns from the training

set. Fine-tuning the hyperplane's parameters allows

for precise separation, enhancing the algorithm's

predictive capabilities.

By following these steps and employing

sophisticated methodologies like SVM, we aim to

improve diabetes prediction accuracy and contribute

to promoting healthier lifestyles.

4 DRIFT ILLUSTRATION

Information influx diagrams are essential tools for

realistically depicting the flow of information within

a commercial evidence organization. Two primary

types of diagrams are commonly utilized: logical and

physical data flow diagrams.

The purpose of these diagrams is to illustrate how

information flows through an organization to achieve

specific business objectives. Logical diagrams depict

the conceptual flow of information, while physical

diagrams represent the actual implementation of the

information flow within the organization. Sometimes

referred to as data inflow graphs or bubble charts,

these diagrams reveal how data enters and exits the

system, the processes involved in handling the data,

and where data is stored.

In the realm of Mechanism Literateness, once

information has been prepared, it undergoes a

dissemination process. Various supportive and

collaborative methods are employed to forecast

phenomena like diabetes. These methods rely heavily

on established information patterns. The key to

effective dissemination lies in dissecting these

patterns, verifying their accuracy, and identifying the

significant factors that contribute to prediction.

One approach is through Provision Course

Appliances, such as Support Vector Machines

(SVM). SVM is a prevalent method in supervised

learning, capable of generating a hyperplane that

separates two classes or predicts values for

regression. It discerns properties within specific

AI4IoT 2023 - First International Conference on Artiﬁcial Intelligence for Internet of things (AI4IOT): Accelerating Innovation in Industry

and Consumer Electronics

350

datasets and can categorize data based on the

information provided. Separation is achieved by

optimizing a hyperplane that maximizes the margin

between classes or regression points.

5 MACHINE LEARNING

ALGORITHMS OVERVIEW

Support Vector Machine (SVM): SVM, short for

Support Vector Machine, stands out as one of the

most ubiquitous algorithms in machine learning.

Operating primarily within supervised learning, SVM

tackles both Support and Regression challenges, with

a predominant focus on solving Classification

problems. Its core objective revolves around

delineating spaces and accommodating new data

points with optimal linearity for future predictions.

SVM identifies pivotal points or vectors to

construct hyperplanes, often referred to as Support

Vectors, thereby earning its moniker. This supervised

machine learning technique excels in generating

hyperplanes that effectively segregate data points into

distinct classes or categories. Beyond mere

classification, SVM also facilitates regression tasks,

adept at discerning patterns within datasets and

accurately categorizing the realities supported by the

provided data. Segregation is achieved through

meticulously delineated hyperplanes, finely dividing

data clusters based on their proximity to distinct

activity nodes.

K-Nearest Neighbors (KNN): K-Nearest

Neighbors (KNN) emerges as one of the simplest yet

effective machine learning algorithms, rooted in the

paradigm of Supervised Learning. Operating on the

premise of similarity measurement, KNN traverses

both new and existing data points, leveraging the

instruction set within the algorithm. Its classification

methodology not only enhances data accuracy but

also fosters unique insights into diverse datasets.

Figure 2: Governing function for Logistic retrogression.

Consider an instance where a creature exhibits

traits reminiscent of both feline and canine attributes,

yet its classification as either a cat or a dog remains

uncertain. In such scenarios, the KNN algorithm

meticulously evaluates the shared features between

the queried data point and the existing dataset,

ultimately delivering a classification based on the

closest resemblance, whether it be feline or canine.

Logistic Regression: Delving into the

fundamentals and applications of Logistic

Regression, this supervised learning algorithm stands

as a pivotal component in the machine learning

landscape. In essence, Logistic Regression represents

a type of supervised classification methodology.

Within the confines of a classification problem, the

dependent variable (outcome), denoted as 'y',

becomes contingent upon distinct features (or inputs),

denoted as 'X'. Contrary to common misconception,

Logistic Regression operates as a regression model,

utilizing regression techniques to forecast the

probability of an observed data point belonging to a

particular category.

In summary, these machine learning algorithms—

SVM, KNN, and Logistic Regression—embody

diverse methodologies, each tailored to address

specific challenges within the machine learning

domain. From delineating intricate data spaces to

discerning nuanced patterns and probabilities, these

algorithms collectively enrich the spectrum of

supervised learning techniques, propelling

advancements across various industries and

applications.

Figure 3: Random Forest Prediction Diagram.

Logistic regression transitions into a binary

fashion once a conclusion boundary is introduced into

the equation. The determination of the threshold

Diabetes Prediction Using Machine Learning Algorithms

351

value is a crucial aspect of logistic regression and

depends on the context of the problem at hand.

This discussion delves into the basics of logistic

regression and its application, which falls under the

category of supervised learning techniques. In a

binary classification problem, the target variable

(outcome), y, can only take two values based on

certain features (or predictors), X. Contrary to

common belief, logistic regression is a classification

model. It utilizes a regression model to predict the

probability of an observed data point belonging to a

certain class.

High Accuracy/Low Recall: In scenarios where

minimizing false negatives without necessarily

increasing false positives is crucial, we choose a

threshold that prioritizes either high or low accuracy

rates. For example, in a medical diagnosis scenario,

we want any potentially affected case to be flagged

without too much concern for falsely identifying

healthy cases as affected, as the consequence of a

missed diagnosis can be severe.

Low Accuracy/High Recall: In contexts where

reducing false positives is imperative without

drastically increasing false negatives, we opt for a

threshold that emphasizes either low or high accuracy

rates. For instance, if we're evaluating responses from

customers regarding a particular product, we must

ensure that positive responses are accurately

identified, as a negative response might lead to a loss

in potential sales.

Figure 4: Flow chart of the proposed method.

5.1 Boosting Algorithms

Boosting is all about “teamwork”. The models that

are used here provides the information that in which

the model or the upcoming models are focused on.

Hence therefore this type is known to be the boosting

which can able to turn the model which has the energy

to work.

5.2 Moonboot Fastening Combination

(Catching)

Moonboot fastening facilitates an arbitrary slice

through the intricacies of information, aiding in

understanding biases and resistance inherent in

conventional data. It encompasses a systematic

approach that can be utilized to mitigate resistance in

various processes, often leading to more holistic

outcomes. By incorporating incidental facets of

information from customary sources, it provides a

comprehensive perspective.

Random timber, as an administered literacy

procedure, fosters collaborative learning systems

aimed at organization and refinement. Unlike

arbitrary timber, which amalgamates predictions

from multiple decision trees, random timber operates

by assembling a collection of decision trees during

training and utilizing them for prediction or

classification tasks.

5.3 Implementation

Popularly, this endeavor adopted dissimilar

methodologies. The planned methods utilized diverse

brackets alongside collaborative styles, enforced

through programming language (Python). These

approaches are standard in Machine Learning

techniques employed to enhance the accuracy of

information processing. Notably, an arbitrary timber

classifier outperforms other methods. Overall, we

employed sophisticated Machine Learning

techniques for prediction, achieving high precision.

The figure illustrates the outcomes of these Machine

Learning methodologies. Furthermore, the significant

features influencing the prediction are highlighted for

the arbitrary timber algorithm. The cumulative

importance of each influential point for diabetes

prediction is depicted, with the X-axis representing

the significance of each point and the Y-axis

displaying their respective names.

AI4IoT 2023 - First International Conference on Artiﬁcial Intelligence for Internet of things (AI4IOT): Accelerating Innovation in Industry

and Consumer Electronics

352

5.4 Data Flow Diagram

The data flow diagram illustrates the rational inflow

of information within an organization, serving

specific industry purposes. It delineates how data

circulates through the system, exchanging details and

storing information. The aim of a Data Flow Diagram

is to depict both the capabilities and limitations of an

organization's data flow. Also referred to as a data

inflow graph or bubble chart, it visualizes the

movement of data within the system.

6 CONCLUSIONS

Machine literacy possesses the remarkable capability

to refine the prognosis of diabetes threats through

advanced computational methodologies and the

vastness of extensive epidemiological and hereditary

diabetes threat datasets. Detecting diabetes in its early

stages remains crucial for effective management. This

effort has elucidated a machine literacy approach to

predicting diabetes scenarios. This methodology

could also aid clinicians in developing a precise and

efficient tool that translates into practical use in the

clinic, assisting them in making better-informed

decisions regarding patient care.

REFERENCES

Siper, A., Farley, R., & Lombardo, C. (2005). Machine

Learning and Data Mining Methods in Diabetes

Research. In Proceedings of Student/Faculty Research

Day, CSIS, Pace University, May 6th.

Devi, M. R., & Shyla, J. M. Devi et al 2016016). Analysis

of Various Data Mining Techniques to Predict Diabetes

Mellitus. International Journal of Applied Engineering

Research, 11(1), 727-730.

Berry, M., & Linoff, G. (1997). Data mining techniques:

For marketing, sales, and customer support. John Wiley

& Sons, Inc.

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. Devi et al

2016016). Data Mining: Practical machine learning

tools and techniques. Morgan Kaufmann.

Giri, D., Sengottuvelan, M., & Rajaram, P. R. Devi et al

2016013). Automated diagnosis of coronary artery

disease affected patients using LDA, PCA, ICA and

discrete wavelet transform. Knowledge-Based

Systems, 37, 274-282.

Diabetes Prediction Using Machine Learning Algorithms

353