the ANN-GA model delivered the highest performance,
achieving an accuracy rate of 95.82% (Akgül et al
2019).
Class imbalance is a common issue in raw data,
particularly prevalent in the field of medical
diagnostics, where the majority of classification data
often lean towards negative class values (Thabtah et
al 2020). Often, the number of healthy patients might
greatly exceed the number of patients with heart
disease, potentially exerting a considerable influence
on model accuracy. The primary aim of the research
is to anticipate the existence of cardiovascular
ailments by considering a wide range of influencing
factors such as Body Mass Index (BMI), Smoking,
and Alcohol Drinking. In addition, this data
imbalance problem is addressed to gain higher
accuracy of the heart disease classification model. In
the study, Exploratory Data Analysis (EDA) is first
used to gain an understanding of the distribution and
features of a dataset, laying the groundwork for
subsequent modeling. Second, the Synthetic Minority
Over-sampling Technique (SMOTE) is employed to
balance the dataset. Rather than simply replicating
examples from the minority class, the fundamental
concept of the technique is to generate synthetic
samples. SMOTE is considered the standard
benchmark for learning from imbalanced data
(Fernández et al 2018). Subsequently, the RF is
applied to make predictions. RF has significant
advantages in handling high-dimensional features
and large-scale data, while also maintaining high
interpretability. The model attains commendable
performance metrics, highlighting its resilience in
identifying individuals who are susceptible to heart
disease. The resultant model has an accuracy of
93.39%, a precision of 94.25%, a recall of 92.42%,
and an F1 score of 93.33%. Furthermore, by
analyzing feature importance, the experimental
results demonstrate that BMI emerges as the most
influential factor in predicting the presence of heart
disease, facilitating a better understanding of the
influential variables in this critical healthcare context.
Importantly, the knowledge derived from the study
also furnishes a valuable framework for predicting
other rare medical conditions with similar class
imbalance challenges.
2 METHODOLOGY
2.1 Dataset Description and
Preprocessing
Personal Key Indicators of Heart Disease dataset
from Kaggle is designed to investigate key indicators
associated with heart disease, a leading cause of
mortality in the United States (Dataset 2023). The
dataset has been refined to retain 319,795 data points
with 18 relevant variables. The target variable is
"heart disease," which serves as a binary indicator of
the existence or non-existence of cardiovascular
ailments. Besides, there are 13 categorical features:
smoking status, alcohol drinking habits, stroke
history, difficulty walking, gender, age category, race,
diabetic status, physical activity levels, general health
assessments, asthma, kidney disease, and skin cancer.
The dataset also includes 4 numerical features: BMI,
physical health assessments, mental health
assessments, and sleep time, all of which contribute
to the dataset's richness.
In the data preprocessing phase, it is observed that
the dataset exhibits an imbalanced distribution of
heart disease cases, with only 9% labeled as "Yes".
This class imbalance has been taken into
consideration during model development to prevent
bias. Moreover, to ensure data accuracy and enhance
model effectiveness, a total of 18,078 duplicated data
points are identified and removed from the dataset.
Besides, through label encoding, each distinct
category is assigned a unique integer value, ensuring
that ML models can effectively interpret and utilize
these features. These steps play a pivotal role in
ensuring the dataset's integrity for subsequent
analysis and modeling tasks, enhancing the
effectiveness of the model.
2.2 Proposed Approach
The primary objective of this study is to explore key
indicators related to heart disease. As depicted in Fig.
1, the study initially conducts preliminary EDA to
gain an in-depth understanding of the dataset's
characteristics. Given the dataset's inherent class
imbalance, the SMOTE method is used to balance the
dataset, addressing the issue of data imbalance.
Following data preprocessing, RF is utilized for heart
disease prediction. The performance of the model is
assessed through a range of performance metrics,
including recall, precision, accuracy, F1 score, which
ensures that the model exhibits robustness and
generalization capability. Additionally, feature
importance scores provided by the RF are leveraged
to analyze the significance of various influencing
factors, enhancing the understanding of which factors
play a pivotal role in predicting heart disease.
DAML 2023 - International Conference on Data Analysis and Machine Learning
128