
Table 1: Comparison of some Imbalanced Data Handling Methods.
Method Advantages Disadvantages
SMOTE Preserves information from the minority class, reducing the risk
of data loss. Can improve the generalization of the model.
May introduce noise in the synthetic data, especially if the data
distribution is complex. Could require more computational time
compared to other methods.
Random Undersampling Simple and fast to implement. Can reduce the training time on
very large datasets.
May lead to loss of important information in the majority class,
increasing the risk of under-representation.
Random Oversampling Simple to implement. Can improve the accuracy of models on
imbalanced datasets.
May lead to overfitting if not used cautiously, especially with ex-
cessive replication.
Cluster Based Oversam-
pling
Effective when minority class examples form distinct clusters. Re-
duces the risk of generating synthetic data in inconsistent regions.
Requires careful parameter tuning and can be computationally ex-
pensive.
Tomek Links Enhances class separation without adding noise. May not be effective in complex class distributions.
ENN Can improve model performance by reducing misclassification. May excessively reduce dataset size, potentially losing important
information.
SMOTE-ENN - Combines benefits of both techniques, enhancing class separa-
tion and mitigating overfitting risks.
Computationally intensive, particularly on large datasets.
ADASYN More effective in complex and non-uniform data distributions. Requires more computational resources compared to SMOTE.
Random Oversampling with
replacement
Simple to implement. Can enhance model performance on imbal-
anced datasets.
Risk of overfitting if replication is excessive, especially on small
datasets.
Cost-Sensitive Learning Improves model performance on imbalanced datasets without syn-
thetic data addition.
Requires careful weight selection and may not be universally ef-
fective.
We will define the Imbalance Ratio as the propor-
tion between the number of examples in the minor-
ity class and the number of examples in the majority
class. This ratio provides a quantitative measure of
the degree of class imbalance within each dataset. For
example, If there are 100 negative examples (major-
ity class) and 20 positive examples (minority class)
the imbalance ratio will be 20/100 = 0.2. Obviously,
the more imbalanced the dataset, the closer this value
to zero.
The first dataset we discuss, Wisconsin Diagnos-
tic Breast Cancer (WDBC) (Repository, ), is the well-
known dataset that collects data for breast cancer pre-
diction. Since breast cancer is the most common
cause of cancer deaths in women and is a type of
cancer that can be treated when diagnosed early, pre-
diction is a very important aspect. This dataset has
been extensively studied in the literature (Elter et al.,
), which is why it is utilized in this paper. The dataset
is from the University Hospital of California and can
be downloaded from both the UCI Machine Learning
Repository and Kaggle. It consists of 569 samples
and 33 features, computed from a digitized image of
a fine needle aspiration (FNA) of a breast mass and re-
lated to some characteristics of each cell nucleus (e.g.,
radius, texture, perimeter, area, etc.). Some of these
features are more selective and decisive than others,
and the determination of these features significantly
increases the success of the models, which is why
Feature Selection is applied to select them.
The second dataset, also widely referenced in
literature, is the Heart Failure Clinical Records
dataset (Chicco and Jurman, 2020a). Cardiovascular
diseases (CVDs) are the leading cause of death glob-
ally, claiming approximately 17.9 million lives each
year, representing 31% of all deaths worldwide. Heart
failure, a common occurrence resulting from CVDs,
is the focus of this dataset, which comprises 12 fea-
tures aimed at predicting mortality associated with
heart failure. Many CVDs are preventable through ad-
dressing behavioral risk factors such as tobacco use,
poor diet, obesity, physical inactivity, and excessive
alcohol consumption via population-wide interven-
tions. Individuals with existing CVD or those at high
cardiovascular risk, often due to hypertension, dia-
betes, hyperlipidemia, or other established diseases,
require early detection and management, where ma-
chine learning models can offer significant assistance.
This dataset includes medical records from 299 heart
failure patients, gathered at the Faisalabad Institute of
Cardiology and Allied Hospital in Faisalabad, Punjab,
Pakistan, between April and December 2015. It en-
compasses 13 features encompassing clinical, physi-
ological, and lifestyle-related information.
The third dataset used is Pima Indians Diabetes
Database (Sigillito, ). The Pima Indians Diabetes
Database is a well-known dataset in the field of ma-
chine learning and healthcare research. It contains
medical data from the Pima Indian population, specif-
ically focused on women aged 21 and above from the
Gila River Indian Community near Phoenix, Arizona.
The dataset includes various health-related attributes
such as glucose level, insulin level, BMI (Body Mass
Index), age, and the presence or absence of diabetes
within a five-year period following the initial exami-
nation. This dataset is widely used for developing pre-
dictive models to identify individuals at risk of devel-
oping diabetes. Due to its large sample size and com-
prehensive health information, the Pima Indians Di-
abetes Database has been instrumental in advancing
research in diabetes prediction and management. De-
spite its significance, the dataset also poses challenges
due to its inherent class imbalance and missing data,
necessitating careful preprocessing and model evalu-
ation techniques. Its availability in the public domain
has facilitated numerous studies aimed at improving
diabetes diagnosis and treatment strategies, contribut-
ing significantly to the broader efforts in public health
and medical informatics.
The fourth dataset is a more recent data set. The
Dataset Balancing in Disease Prediction
295