Dataset Balancing in Disease Prediction

Vincenza Carchiolo

and Michele Malgeri

Dip. Ingegneria Elettrica Elettronica Informatica (DIEEI), Universit

a di Catania, Via Santa Soﬁa 64, Catania, Italy

Keywords:

Machine Learning, Data Analysis, Health Informatics.

Abstract:

The utilization of machine learning in the prevention of serious diseases such as cancer or heart disease is in-

creasingly crucial. Various studies have demonstrated that enhanced forecasting performance can signiﬁcantly

extend patients’ life expectancy. Naturally, having sufﬁcient datasets is vital for employing techniques to

classify the clinical situation of patients, facilitating predictions regarding disease onset. However, available

datasets often exhibit imbalances, with more records featuring positive metrics than negative ones. Hence,

data preprocessing assumes a pivotal role. In this paper, we aim to assess the impact of machine learning and

SMOTE (Synthetic Minority Over-sampling Technique) methods on prediction performance using a given

set of examples. Furthermore, we will illustrate how the selection of an appropriate SMOTE process signiﬁ-

cantly enhances performance, as evidenced by several metrics. Nonetheless, in certain instances, the effect of

SMOTE is scarcely noticeable, contingent upon the dataset and machine learning methods employed.

1 INTRODUCTION

The importance of machine learning (ML) in health-

care is increasingly evident and signiﬁcant. ML mod-

els can analyze large amounts of data, such as medi-

cal images, vital signs, and medical histories, to assist

physicians in the early and accurate diagnosis of dis-

eases. This can lead to better outcomes for patients,

as it allows for the timely and precise identiﬁcation

of conditions. Furthermore, through the analysis of

patient data, advanced customization of treatments is

possible. Indeed, ML can help develop personalized

treatment plans, taking into account individual varia-

tions in biological data, test results, and treatment re-

sponses, thereby improving the effectiveness of care.

Finally, machine learning plays a central role in dis-

ease prevention, since ML models can identify risk

factors for speciﬁc medical conditions and help pre-

vent diseases through the early detection of predictive

signs and the implementation of preventive interven-

tions.

Medical datasets often suffer from imbalance, a

critical issue for predictive modelling. When applied

to imbalanced datasets, models may exhibit a bias to-

ward predicting the majority class, resulting in Clas-

siﬁcation Bias. This bias can lead to Inaccurate Per-

formance, particularly for underrepresented classes,

https://orcid.org/0000-0002-1671-840X

https://orcid.org/0000-0002-9279-3129

where the model fails to learn effectively from those

examples or may suffer from overﬁtting. Balancing

the dataset is thus paramount for building accurate

disease prediction models. It directly impacts the

model’s ability to generalize correctly and make ac-

curate predictions across all disease classes. With-

out proper balancing, models may struggle to gener-

alize from the training data to new instances, impair-

ing their predictive performance. In medical applica-

tions, dataset balancing is one of the most signiﬁcant

problem for several critical reasons, with the primary

concern being patient safety. There are many charac-

teristics regarding dataset balancing that become sig-

niﬁcant, such as disease prevalence, particularly when

studying the class of rare diseases which might be un-

derrepresented.

Existing literature is considered in Section 2,

while the datasets are introduced in Section 3. In Sec-

tion 4, the methods and results are discussed in detail.

Moreover, a comparative study is presented to point

out the appropriateness of the results with respect to

several metrics. Finally, we consider further works

and concluding remarks in Section 5.

2 DATA SET BALANCING

As commonly acknowledged, there are numerous

methods for balancing a dataset. In this section, we

Carchiolo, V. and Malgeri, M.

Dataset Balancing in Disease Prediction.

DOI: 10.5220/0012755700003756

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 13th International Conference on Data Science, Technology and Applications (DATA 2024), pages 293-300

ISBN: 978-989-758-707-8; ISSN: 2184-285X

293

discuss balancing methods for classiﬁcation and pro-

vide an overview of related work in the literature on

methods to balance datasets, particularly focusing on

health-related research. In many scenarios, the classes

of interest, such as those related to rare diseases or

clinically signiﬁcant events, can be signiﬁcantly un-

derrepresented compared to control or normal classes.

This is challenging specially during the training of

machine learning models, as models tend to be inﬂu-

enced more by the majority class, thereby overlooking

the minority class. Consequently, the model’s ability

to generalize to new data and correctly identify posi-

tive cases in the minority class may be compromised.

Therefore, it is crucial to carefully address the issue

of data imbalance in health-related datasets to ensure

the construction of accurate and reliable models.

There are numerous techniques available for bal-

ancing datasets, but in this article, our focus will be

on SMOTE (Synthetic Minority Over-sampling Tech-

nique) (Pradipta et al., 2021). SMOTE is one of

the most widely used methods for addressing the is-

sue of class imbalance in datasets, especially when

there is a signiﬁcant under-representation of minor-

ity classes compared to others. This technique is

commonly applied in machine learning contexts, in-

cluding classiﬁcation models used to predict diseases,

frauds, or other rare events. SMOTE operates based

on three main components: 1. Minority Deﬁnition:

this component identiﬁes the minority class in the

dataset, which is characterized by having fewer ex-

amples compared to the other classes; 2. Generation

of Synthetic Examples: this step involves generating

synthetic examples of the minority class. These ex-

amples are created by linearly combining nearby sam-

ples in the feature space; 3. SMOTE Procedure: for

each example in the minority class, this procedure se-

lects some of its nearest neighbors and creates new

synthetic examples through a linear combination of

the feature values.

Finally, by adding these synthetic examples to the

dataset, SMOTE increases the amount of data avail-

able for the minority class, thus helping to bal-

ance the dataset. In addition to SMOTE, several

other methods address the issue of class imbalance

in datasets. Among them, we mention the following

methods. The table below (Table 1) reports a com-

parison among them. Each technique has its advan-

tages and disadvantages, and the choice depends on

the speciﬁc characteristics of the dataset and the prob-

lem being addressed. In some cases, experimenting

with different approaches may be effective in deter-

mining which one works best for the speciﬁc case.

Several authors propose various approaches to

address class imbalance and feature selection prob-

lems in Clinical Decision Support Systems (CDSS).

In (Sreejith et al., 2020) the authors introduce a frame-

work that balances the dataset at the data level and

employs a wrapper approach for feature selection,

utilizing Chaotic Multi-Verse Optimization (CMVO)

for subset selection. Performance evaluation using

the arithmetic mean of Matthews correlation coef-

ﬁcient (MCC) and F-score (F1) indicates compet-

itiveness of the proposed framework. Paper (Xu

et al., 2021) presents a cluster-based oversampling

algorithm (KNSMOTE), which combines Synthetic

Minority Oversampling Technique (SMOTE) and k-

means clustering. This algorithm identiﬁes ”safe

samples” from clustered classes and synthesizes new

samples through linear interpolation, effectively ad-

dressing class imbalance. In a different study (Li

et al., 2021; Xu et al., 2020) SMOTE is highlighted

as a successful method with practical applications,

alongside the introduction of a novel oversampling

approach called SMOTE-NaN-DE, which improves

class-imbalance data by generating synthetic samples.

Additionally, a hybrid sampling algorithm named

RFMSE, combining M-SMOTE and Edited Nearest

Neighbor based on Random Forest, is proposed to

enhance sampling effectiveness. Jakhmola and Prad-

han in (Jakhmola and Pradhan, 2015) propose an in-

teractive algorithm allowing users to customize pre-

processing requirements, yielding higher quality data

suitable for correlation and multiple regression anal-

ysis, as demonstrated on a diabetes dataset. Finally,

in (Khushi et al., 2021) are introduced research in-

vestigates class imbalance techniques for lung can-

cer prediction, employing various methods includ-

ing under-sampling, over-sampling, and hybrid tech-

niques. Evaluation metrics, such as AUC, reveal

the superiority of over-sampling methods, particularly

random forest with random over-sampling, in predict-

ing lung cancer presence.

3 DATASET DESCRIPTION

To analyze the balance issue in the healthcare domain,

we will leverage ﬁve diverse datasets. These datasets

differ signiﬁcantly in terms of the number of features,

observations, and imbalance ratio. Despite these dif-

ferences, they all revolve around predicting medical

situations through binary classiﬁcation tasks. Given

the inherent imbalance in these datasets, our objec-

tive is twofold: Evaluate the performance of predic-

tions when using the imbalanced dataset and assess

the impact of preprocessing techniques on prelimi-

nary dataset balancing to enhance prediction perfor-

mance.

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

294

Table 1: Comparison of some Imbalanced Data Handling Methods.

Method Advantages Disadvantages

SMOTE Preserves information from the minority class, reducing the risk

of data loss. Can improve the generalization of the model.

May introduce noise in the synthetic data, especially if the data

distribution is complex. Could require more computational time

compared to other methods.

Random Undersampling Simple and fast to implement. Can reduce the training time on

very large datasets.

May lead to loss of important information in the majority class,

increasing the risk of under-representation.

Random Oversampling Simple to implement. Can improve the accuracy of models on

imbalanced datasets.

May lead to overﬁtting if not used cautiously, especially with ex-

cessive replication.

Cluster Based Oversam-

pling

Effective when minority class examples form distinct clusters. Re-

duces the risk of generating synthetic data in inconsistent regions.

Requires careful parameter tuning and can be computationally ex-

pensive.

Tomek Links Enhances class separation without adding noise. May not be effective in complex class distributions.

ENN Can improve model performance by reducing misclassiﬁcation. May excessively reduce dataset size, potentially losing important

information.

SMOTE-ENN - Combines beneﬁts of both techniques, enhancing class separa-

tion and mitigating overﬁtting risks.

Computationally intensive, particularly on large datasets.

ADASYN More effective in complex and non-uniform data distributions. Requires more computational resources compared to SMOTE.

Random Oversampling with

replacement

Simple to implement. Can enhance model performance on imbal-

anced datasets.

Risk of overﬁtting if replication is excessive, especially on small

datasets.

Cost-Sensitive Learning Improves model performance on imbalanced datasets without syn-

thetic data addition.

Requires careful weight selection and may not be universally ef-

fective.

We will deﬁne the Imbalance Ratio as the propor-

tion between the number of examples in the minor-

ity class and the number of examples in the majority

class. This ratio provides a quantitative measure of

the degree of class imbalance within each dataset. For

example, If there are 100 negative examples (major-

ity class) and 20 positive examples (minority class)

the imbalance ratio will be 20/100 = 0.2. Obviously,

the more imbalanced the dataset, the closer this value

to zero.

The ﬁrst dataset we discuss, Wisconsin Diagnos-

tic Breast Cancer (WDBC) (Repository, ), is the well-

known dataset that collects data for breast cancer pre-

diction. Since breast cancer is the most common

cause of cancer deaths in women and is a type of

cancer that can be treated when diagnosed early, pre-

diction is a very important aspect. This dataset has

been extensively studied in the literature (Elter et al.,

), which is why it is utilized in this paper. The dataset

is from the University Hospital of California and can

be downloaded from both the UCI Machine Learning

Repository and Kaggle. It consists of 569 samples

and 33 features, computed from a digitized image of

a ﬁne needle aspiration (FNA) of a breast mass and re-

lated to some characteristics of each cell nucleus (e.g.,

radius, texture, perimeter, area, etc.). Some of these

features are more selective and decisive than others,

and the determination of these features signiﬁcantly

increases the success of the models, which is why

Feature Selection is applied to select them.

The second dataset, also widely referenced in

literature, is the Heart Failure Clinical Records

dataset (Chicco and Jurman, 2020a). Cardiovascular

diseases (CVDs) are the leading cause of death glob-

ally, claiming approximately 17.9 million lives each

year, representing 31% of all deaths worldwide. Heart

failure, a common occurrence resulting from CVDs,

is the focus of this dataset, which comprises 12 fea-

tures aimed at predicting mortality associated with

heart failure. Many CVDs are preventable through ad-

dressing behavioral risk factors such as tobacco use,

poor diet, obesity, physical inactivity, and excessive

alcohol consumption via population-wide interven-

tions. Individuals with existing CVD or those at high

cardiovascular risk, often due to hypertension, dia-

betes, hyperlipidemia, or other established diseases,

require early detection and management, where ma-

chine learning models can offer signiﬁcant assistance.

This dataset includes medical records from 299 heart

failure patients, gathered at the Faisalabad Institute of

Cardiology and Allied Hospital in Faisalabad, Punjab,

Pakistan, between April and December 2015. It en-

compasses 13 features encompassing clinical, physi-

ological, and lifestyle-related information.

The third dataset used is Pima Indians Diabetes

Database (Sigillito, ). The Pima Indians Diabetes

Database is a well-known dataset in the ﬁeld of ma-

chine learning and healthcare research. It contains

medical data from the Pima Indian population, specif-

ically focused on women aged 21 and above from the

Gila River Indian Community near Phoenix, Arizona.

The dataset includes various health-related attributes

such as glucose level, insulin level, BMI (Body Mass

Index), age, and the presence or absence of diabetes

within a ﬁve-year period following the initial exami-

nation. This dataset is widely used for developing pre-

dictive models to identify individuals at risk of devel-

oping diabetes. Due to its large sample size and com-

prehensive health information, the Pima Indians Di-

abetes Database has been instrumental in advancing

research in diabetes prediction and management. De-

spite its signiﬁcance, the dataset also poses challenges

due to its inherent class imbalance and missing data,

necessitating careful preprocessing and model evalu-

ation techniques. Its availability in the public domain

has facilitated numerous studies aimed at improving

diabetes diagnosis and treatment strategies, contribut-

ing signiﬁcantly to the broader efforts in public health

and medical informatics.

The fourth dataset is a more recent data set. The

Dataset Balancing in Disease Prediction

295

Differentiated Thyroid Cancer Recurrence dataset

(Borzooei et al., 2023) is a valuable resource in the

domain of thyroid cancer research. It comprises

clinical data from patients diagnosed with differenti-

ated thyroid cancer (DTC) who underwent thyroidec-

tomy and subsequent treatment. The dataset includes

various demographic and clinical variables such as

age, sex, tumor size, histopathological characteristics,

treatment modalities, and follow-up information. A

key focus of the dataset is to predict the recurrence

of thyroid cancer following initial treatment based on

these factors. Researchers utilize machine learning

and statistical methods to develop predictive mod-

els that can identify patients at higher risk of recur-

rence, thereby aiding in personalized treatment strate-

gies and follow-up care. Due to its specialized nature

and importance in thyroid cancer management, the

Differentiated Thyroid Cancer Recurrence dataset has

garnered attention from researchers worldwide. How-

ever, challenges such as limited sample size and data

heterogeneity need to be addressed to enhance the ro-

bustness and generalizability of predictive models de-

rived from this dataset. Overall, it serves as a valuable

tool in advancing our understanding of thyroid cancer

recurrence and improving patient outcomes through

tailored interventions.

Finally, the last dataset used is the Sepsis Sur-

vival Minimal Clinical Records (Chicco and Jur-

man, 2020b). The Sepsis Survival Minimal Clini-

cal Records dataset is an essential and widely used

dataset in sepsis study and research, a severe medi-

cal condition caused by a systemic inﬂammatory re-

sponse to an infection. This dataset contains clinically

relevant and minimal information about patients with

sepsis, including demographic data, vital signs, labo-

ratory test results, and treatment information. Its sim-

pliﬁed structure makes it particularly suitable for de-

veloping predictive models of sepsis survival and for

evaluating clinical management strategies. Thanks to

its availability and focused nature, the Sepsis Sur-

vival Minimal Clinical Records dataset has signiﬁ-

cantly contributed to the understanding of sepsis and

the research of effective clinical interventions to im-

prove outcomes for patients with this severe medical

condition. However, it is important to consider limita-

tions and potential biases in the data to obtain accurate

and generalizable results.

The common feature shared by the aforemen-

tioned datasets is the presence of only two classes.

The main data of the ﬁve datasets are summarized in

Table 2, demonstrating varying numbers of features,

observations, and Imbalance Ratios.

4 EXPERIMENT AND

DISCUSSION

We implemented the following 10 supervised algo-

rithms. 1. Logistic Regression (LG) is a machine

learning method used for binary classiﬁcation prob-

lems. Its principle of operation is based on estimat-

ing the conditional probabilities that an instance be-

longs to one of the two classes. It uses the logistic

function (or sigmoid function) to transform a linear

combination of features into a value between 0 and

1, representing the estimated probability. This value

is then used as a threshold to assign the instance to

one of the two classes. 2. Support Vector Machine

(SVM) operates by seeking to ﬁnd the optimal hy-

perplane of separation between classes in the case of

binary classiﬁcation. The separation hyperplane is

deﬁned as the hyperplane that maximizes the margin

between the nearest class instances, which are called

support vectors. SVM can effectively handle datasets

with many features, and it tends to generalize well to

test data, reducing the risk of overﬁtting. 3. Gaus-

sian Naive Bayes (GNB) is based on Bayes’ theo-

rem and assumes that features are independent and

follow a Gaussian distribution. 4. Decision Tree DT

recursively splits the dataset into subsets based on the

value of features, aiming to maximize the purity of

each subset in terms of class labels. 5. Random For-

est RF is an ensemble learning method that builds

multiple decision trees and combines their predictions

through voting or averaging. 6. Extra Tree (ET) is

similar to RF but introduces additional randomness

in the feature selection process. 7. K-Nearest Neigh-

bors (KNN) operates by classifying an instance based

on the majority class among its k nearest neighbors

in the feature space. The distance metric (e.g., Eu-

clidean distance) is used to measure the similarity be-

tween instances. 8. Hist Gradient Boosting (HGB is

a boosting algorithm that builds a series of decision

trees sequentially, each one correcting the errors of its

predecessors. It uses histogram-based techniques to

speed up training. 9. Bagging Classiﬁer (BC) is an

ensemble learning method that trains multiple mod-

els on bootstrap samples of the dataset and combines

their predictions. It reduces variance and improves

stability. 10. Finally, Multilayer Perceptron (MLP) is

a type of artiﬁcial neural network consisting of mul-

tiple layers of interconnected neurons. The selection

of these methods is intended to facilitate experiments

showcasing the diverse impacts of various smoothing

techniques. Through these experiments, we aim to

ascertain the complexity of asserting a universally su-

perior smoothing method. As we will observe, the ef-

ﬁcacy of a particular smoothing method, which may

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

296

Table 2: Dataset Information.

Dataset # Instances # Features Imbalance Ratio

Feature Types

# Numeric # Symbolic

Breast Cancer Wisconsin (Diagnostic) 699 9 0.59 9 0

Hearth failure 299 12 0.94 12 0

Pima Indians Diabetes Database 768 8 0.54 8 0

Differentiated Thyroid Cancer Recurrence dataset 383 16 0.39 6 10

Sepsis Survival Minimal Clinical Records 137 3 0.21 3 0

excel in certain scenarios, could result in inferior out-

comes compared to the unbalanced dataset in other

cases.

Furthermore, we will employ four distinct over-

sampling techniques, commonly utilized in address-

ing imbalanced datasets, which will be referred to

throughout the remainder of the paper as Smote1,

Smote2, Smote3, and Smote4.

BorderlineSMOTE with 20 neighbors=20

(Smote1) is a variant of the SMOTE algorithm that

generates synthetic samples only for those minority

class instances that are misclassiﬁed or lie near the

decision boundary (i.e., borderline instances). It

generates synthetic samples by selecting a minority

class instance and ﬁnding its k nearest neighbors.

It then selects one of these neighbors randomly

and generates a synthetic sample along the line

segment joining the original instance and the selected

neighbor. By setting m neighbors=20, it speciﬁes

the number of nearest neighbors to consider when

generating synthetic samples.

BorderlineSMOTE with neighbors = 10 and

sampling strategy =

′

minority

′

(Smote2) is a variant

of BorderlineSMOTE also generates synthetic sam-

ples near the decision boundary between the minor-

ity and majority classes. Additionally, it adjusts

the sampling strategy to focus on the minority class

by specifying sampling strategy=’minority’. By set-

ting m neighbors = 10, it speciﬁes a different num-

ber of nearest neighbors to consider when generat-

ing synthetic samples compared to the previous vari-

ant. SMOTE with 10 neighbors (Smote3) is a pop-

ular oversampling technique that generates synthetic

samples by interpolating between existing minority

class instances. It selects a minority class instance

and ﬁnds its k nearest neighbors. It then selects one

of these neighbors randomly and generates a syn-

thetic sample along the line segment joining the orig-

inal instance and the selected neighbor. By setting

k neighbors = 10, it speciﬁes the number of nearest

neighbors to consider when generating synthetic sam-

ples. SMOTE with sampling strategy =

′

minority

′

and neighbors = 10(Smote4), similar to Smote3, gen-

erates synthetic samples by interpolating between ex-

isting minority class instances. It further adjusts the

sampling strategy to focus on the minority class by

specifying sampling strategy =

′

minority

′

. By set-

ting k neighbors = 10, it speciﬁes a different number

of nearest neighbors to consider when generating syn-

thetic samples compared to the previous variant. Both

BorderlineSMOTE and SMOTE aim to address class

imbalance by oversampling the minority class.

To assess the impact of balancing, we will utilize

Accuracy and AUC. Accuracy is a general measure of

the model’s precision and represents the percentage of

instances classiﬁed correctly out of the total instances,

it’s calculated as the ratio of the number of correct

predictions to the total number of predictions made,

and it is particularly useful when classes in the dataset

are balanced but can be misleading in presence of un-

balancing. AUC measures the model’s discriminative

ability, i.e., its ability to correctly classify positive ex-

amples as positive and negative examples as negative.

All the ﬁgures are on a logarithmic scale and the blue

bar refers to the analysis of the imbalanced dataset,

while the others refer to datasets obtained with the

four balancing methods

The result of Breast Cancer accuracy and Auc

score are shown in ﬁgure 1, let us note that while

some datasets yield consistent accuracy values across

methods, others exhibit signiﬁcant variability depend-

ing on the method used. For the Breast Cancer

dataset, the highest accuracy (see ﬁgure 1a) values

are achieved with the ET and HGB methods, whether

smoothing is applied or not. The maximum score of

0.973684 is obtained for ET when no balancing is per-

formed and for HGB when Smote1 is applied. The

maximun Auc score, 0.974206, (see ﬁgure 1b) it ob-

tained for HGB with Smote1. This, considering that

AUC is less prone to overﬁtting, allows us to afﬁrm

that smoothing allows us to gain an advantage, albeit

small

The result of Heart

f ailure accuracy and AUC

score are shown in ﬁg 2. The accuracy values reach 1

(see ﬁgure 2a) using various methods, but almost al-

ways with smoothing. Note that this data set consis-

tently performs quite well in terms of accuracy (the

worst value being 0.829268). Analyzing the results

for AUC (see ﬁgure 2b) allows us to make the same

considerations and therefore its study is not particu-

larly signiﬁcant for our purposes.

For the third dataset (Pima), all methods perform

Dataset Balancing in Disease Prediction

297

(a) Accuracy.

(b) AUC.

Figure 1: Breast cancer scores.

(a) Accuracy.

(b) AUC.

Figure 2: Heart failure scores.

better with appropriate balancing, both in terms of

accuracy and AUC score 3. In this case, several

methods with SMOTE allow achieving an accuracy

of 0.892857 (3a). The ﬁgure 3b clearly shows that

the best AUC (DT or ET method) is consistently ob-

tained with datasets that have had SMOTE2 applied

(0.856522). This result allows us to conclude that for

this dataset, characterized by an imbalance ratio of

0.54, balancing is often beneﬁcial.

(a) Accuracy.

(b) AUC.

Figure 3: Pima scores.

For the fourth dataset, the Differentiated Thyroid

Cancer Recurrence dataset (see ﬁgure 4), the high-

est accuracy value (0.961039) was achieved using the

DT method with Smote1 (refer to ﬁgure 4a). Inter-

estingly, all methods improved with balancing, which

is signiﬁcant considering this dataset has a higher

imbalance ratio than the previous three, making the

impact of balancing generally beneﬁcial. The AUC

analysis further distinguishes the methods, conﬁrm-

ing the most effective solutions (refer to ﬁgure 4b).

The top AUC score is 0.945455, obtained using the

DT method with Smote1.

Finally, for the ﬁfth dataset (the most imbalanced),

the behavior is nearly equivalent for any method, as

adopting the appropriate smoothing method (not al-

ways the same) 5. The best choice allows achieving

a value of 0.892857. A particular situation occurs for

the MLP method, which performs poorly using any of

the four balancing methods. In terms of AUC score,

the results are still extremely diverse depending on

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

298

(a) Accuracy.

(b) AUC.

Figure 4: Thyroid score.

the method and smoothing used, much like with accu-

racy. However, it should be noted that for this dataset,

which is the most imbalanced one, balancing can be

crucial as it allows us to achieve the best result. At the

same time, if used inadequately, it can even worsen

the results. In terms of ACU score, the best score is

obtained with DT and ET using Smote2.

The positive effect of smoothing is better appre-

ciated by analyzing AUC, which is less inﬂuenced

by overﬁtting compared to accuracy. For datasets

where the choice of method substantially alters ac-

curacy values, the impact of data balancing can be

signiﬁcant. The ﬁgures from 1b to 5b summarize the

variation of accuracy and AUC values for each dataset

and method. Therefore, it can be stated that there

is no single most effective Smote method, but rather,

the (method, Smote) pair yielding better performance

should be sought.

5 CONCLUSION

A detailed analysis was conducted on ﬁve distinct

datasets, utilizing various machine learning tech-

niques to assess the impact of data preprocess-

ing (Carchiolo et al., 2022). For each dataset, we ex-

plored the behavior of ten distinct algorithms, each

with its own characteristics and tuning parameters. In

(a) Accuracy.

(b) AUC.

Figure 5: Sepsis scores.

order to assess the impact of data balancing, we per-

formed the analysis both with and without data bal-

ancing techniques, considering four different smooth-

ing approaches to handle the presence of underrepre-

sented classes. The results obtained highlighted a sig-

niﬁcant variation in model performance based on the

different combinations of machine learning method,

data balancing, and smoothing techniques. It became

clear that the choice of machine learning method and

the application of balancing strategies must be closely

integrated to achieve optimal results. In particular,

we observed that while data balancing can signiﬁ-

cantly improve model performance on heavily imbal-

anced datasets, inadequate implementation could lead

to inferior results. Furthermore, we recognized that

parameter optimization for datasets characterized by

imbalance requires a particularly careful and targeted

approach, as the speciﬁc dataset characteristics can

signiﬁcantly inﬂuence the effectiveness of proposed

solutions. While various approaches exist, compar-

ing them can be challenging due to numerous tun-

ing parameters and variations within articles. How-

ever, ongoing research suggests the importance of ex-

ploring diverse classiﬁers and imbalance techniques,

including deep learning models, to enhance predic-

tion capabilities and address imbalance issues effec-

tively. In conclusion, our analysis underscored the

importance of carefully considering the speciﬁc con-

Dataset Balancing in Disease Prediction

299

text of each dataset and adopting a ﬂexible and tar-

geted approach to address the issue of data imbal-

ance in machine learning contexts. From the analy-

sis conducted, it emerged that in the vast majority of

cases, the solutions’ accuracy and AUC with the ap-

plication of balancing are better. Nonetheless, as fu-

ture work, the analysis will be extended to a greater

number of datasets and balancing methods. Another

activity for future work concerns the application to

datasets that involve non-binary classiﬁcation to an-

alyze whether balancing is advantageous in this case

as well. Finally, precision and recall analysis could be

conducted to add further conﬁdence in the quality of

the results.

ACKNOWLEDGEMENTS

The work is partially supported by UDMA project,

CUP: G69J18001040007.

REFERENCES

Borzooei, S., Briganti, G., and Golparian, M. e. a. (2023).

Machine learning for risk stratiﬁcation of thyroid

cancer patients: a 15-year cohort study. European

Archives of Oto-Rhino-Laryngology.

Carchiolo, V., Grassia, M., Malgeri, M., and Mangioni, G.

(2022). Co-authorship networks analysis to discover

collaboration patterns among italian researchers. Fu-

ture Internet, 14(6).

Chicco, D. and Jurman, G. (2020a). Machine learning can

predict survival of patients with heart failure from

serum creatinine and ejection fraction alone. BMC

Medical Informatics and Decision Making, 20(16).

Chicco, D. and Jurman, G. (2020b). Survival prediction of

patients with sepsis from age, sex, and septic episode

number alone. Sci Rep, 10:17156.

Elter, M., Schulz-Wendtland, R., and Wittenberg, T. The

prediction of breast cancer biopsy outcomes using two

cad approaches that both emphasize an intelligible de-

cision process.

Jakhmola, S. and Pradhan, T. (2015). A computational ap-

proach of data smoothening and prediction of diabetes

dataset. In Proceedings of the Third International

Symposium on Women in Computing and Informatics,

WCI ’15, page 744–748, New York, NY, USA. Asso-

ciation for Computing Machinery.

Khushi, M., Shaukat, K., Alam, T. M., Hameed, I. A., Ud-

din, S., Luo, S., Yang, X., and Reyes, M. C. (2021). A

comparative performance analysis of data resampling

methods on imbalance medical data. IEEE Access,

9:109960–109975.

Li, J., Zhu, Q., Wu, Q., Zhang, Z., Gong, Y., He, Z., and

Zhu, F. (2021). Smote-nan-de: Addressing the noisy

and borderline examples problem in imbalanced clas-

siﬁcation by natural neighbors and differential evolu-

tion. Knowledge-Based Systems, 223:107056.

Pradipta, G. A., Wardoyo, R., Musdholifah, A., Sanjaya,

I. N. H., and Ismail, M. (2021). Smote for handling

imbalanced data problem : A review. In 2021 Sixth

International Conference on Informatics and Comput-

ing (ICIC), pages 1–8.

Repository, U. M. L. UCI Machine Learning Repository:

Mammographic Mass Data Set. http://archive.ics.uci.

edu/ml/datasets/mammographic+mass.

Sigillito, V. National Institute of Diabetes and Digestive and

Kidney Diseases, note = Donor of database: Vincent

Sigillito (vgs@aplcen.apl.jhu.edu) Research Center,

RMI Group Leader Applied Physics Laboratory The

Johns Hopkins University Johns Hopkins Road Lau-

rel, MD 20707 (301) 953-6231 (c) Date received: 9

May 1990.

Sreejith, S., Khanna Nehemiah, H., and Kannan, A. (2020).

Clinical data classiﬁcation using an enhanced smote

and chaotic evolutionary feature selection. Computers

in Biology and Medicine, 126:103991.

Xu, Z., Shen, D., Nie, T., and Kou, Y. (2020). A hybrid sam-

pling algorithm combining m-smote and enn based on

random forest for medical imbalanced data. Journal

of Biomedical Informatics, 107:103465.

Xu, Z., Shen, D., Nie, T., Kou, Y., Yin, N., and Han, X.

(2021). A cluster-based oversampling algorithm com-

bining smote and k-means for imbalanced medical

data. Information Sciences, 572:574–589.

DATA 2024 - 13th International Conference on Data Science, Technology and Applications

300