Prediction of Heart Attack on Random Forest and Logistic Regression
Xinyi Huang
Applied Statistics, Northeastern University at Qinhuangdao, Qinhuangdao, China
Keywords: Heart Attack Prediction, Random Forest, Logistic Regression.
Abstract: The heart is one of the most important organs. There are various kinds of heart diseases, with coronary artery
disease (CAD) being the most prevalent ones. CAD refers to a group of diseases caused by the damage to the
coronary arteries that supply oxygen and blood, resulting in ischemia, hypoxia or necrosis. About 2% of the
world's population suffers from CAD, which causes 17.5 million deaths worldwide every year. Therefore,
monitoring and prevention of heart disease is essential. The study uses two different algorithms, Random
Forest Classifier and Logistic Regression to establish models, with respective accuracies of 0.9798 and 0.8965.
RF is found to be more effective than LR in this small classification problem. Future research should focus
on integrating larger, more diverse datasets and introducing other advanced machine learning algorithms in
conjunction with the RF algorithm to explore hidden patterns in the data. These models can help predict the
risk of heart disease.
1 INTRODUCTION
Trans fats have been added to a wide variety of foods.
However, trans fats will increase Low-Density
Lipoprotein cholesterol, which is a compound that
clogs arteries and contributes to heart attacks
(Ganguly and Pierce 2015). According to newly
released data from the World Health Organization,
due to COVID-19, individuals with pre-existing heart
disease have a higher risk of suffering from severe
illness or death. Therefore, predicting heart disease is
important for both individuals and society. The
traditional methods of detecting heart disease are
expensive and time-consuming. Owing to the
application of machine learning in the medical field,
machine learning algorithms can autonomously learn
and extract valuable information and patterns from
historical data, enabling automated decision-making
and prediction (Litjens et al 2017).
It has been discovered that some factors
contributing to a heart attack can be detected through
daily measurements and blood checks. Therefore, this
paper uses machine learning to effectively predict
heart attacks. The selection of suitable machine
learning algorithms based on the dataset
characteristics is crucial for achieving accurate results.
Harshit Jindal employed the K-nearest neighbors
(KNN) algorithm as the primary model to improve the
accuracy of the model based on a collected dataset of
medical attributes. Additionally, Ashir Javeed
developed a novel approach called Random Search
Algorithm - Random Force (RSA-RF) model, which
effectively discovers the optimal subset of features
and reduces the feature dimensionality, thus reducing
the time complexity.
Machine learning techniques are widely used in
predicting heart diseases, with methods such as
Support Vector Machines (SVM) (Raju et al 2018),
Random Forest Classifier (RF) and Logistic
Regression (LR) receiving significant attention.
RF is an ensemble learning method known for its
high accuracy and robustness. The construction of a
RF involves two stages: training and prediction.
During the training stage, multiple decision trees are
trained by randomly sampling data instances and
selecting subsets of features (Speiser et al 2019). Each
decision tree is generated based on different subsets of
data and features. In the prediction stage, input
samples are classified through each decision tree, and
the final classification result is determined by voting
or averaging the outcomes (Breiman 1984). However,
the predictive performance of RF may sometimes fall
short of expectations. To achieve optimal
performance, the number of trees can be adjusted
using cross-validation techniques.
LR is a common binary classification algorithm
used to predict the probability of an event occurring. It
transforms the input variables through a sigmoid
function into output variables, mapping continuous
real values to probability values between 0 and 1,
466
Huang, X.
Prediction of Heart Attack on Random Forest and Logistic Regression.
DOI: 10.5220/0012816000003885
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Data Analysis and Machine Learning (DAML 2023), pages 466-470
ISBN: 978-989-758-705-4
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
referring to the likelihood of a sample being classified
under a specific category (Hilbe 2009). To minimize
the error, the approach of maximum likelihood
estimation is utilized in parameter estimation. LR is
commonly used in classification problems such as
spam filtering, web page text classification (Gangurde
and Kumar 2020), etc. Performance can be improved
by using regularization to limit the size of model
parameters, adding interaction terms, and other
techniques.
In this paper, the author uses RF and LR to
construct models and evaluates them to compare their
characteristics, ultimately determining which model
performed better.
2 METHOD
2.1 Dataset
The dataset is called the Heart Attack dataset. It
contains 1320 sample dataand each sample has 9
characterization dimensions. It is a binary
classification problem with response variables that
are diseased or not diseased, then the remaining 8
characterization dimensions as parameters. In this
study, binary labels were used: a label value of 0
signifies the absence of a heart attack, whereas a label
value of 1 signifies the presence of a heart attack.
Here's a sample presentation of the datasets.
2.2 Dataset Visualization
A partial display of samples from the dataset is shown
in Table 1.
The seaborn heatmap function was used to
generate a correlation heatmap of the independent
variables. Figure 1 displays the correlation strength
between different features, where darker color
indicates stronger correlation. There is almost no
multicollinearity problem between the variable pairs,
as the correlation between them is very low (n < 0.1).
Figure 1: Thermodynamic diagram of correlation
(Original).
2.3 Prerequisite
The proposed methodology consists of multiple steps,
which is depicted in Figure 2, starting with data
collection, extraction of significant values, and a
preprocessing phase for exploring the data. Data
preprocessing involves dealing with missing values
and cleaning and normalizing the data according to the
algorithm. After data preprocessing, the preprocessed
data is classified using classifiers. Finally, this
proposed model is evaluated to assess the performance
and accuracy using various evaluation metrics.
Figure 2: Algorithm Flowchart (Original).
Table 1: Examples in the Dataset.
age
gender
impluse
pressurehight
pressurelow
glucose
kcm
troponin
64
1
66
160
83
160
1.8
0.012
63
0
60
150
83
198
2.39
0.013
50
0
93
120
71
120
0.609
0.013
60
1
86
135
65
187
35.55
0.006
78
1
77
116
64
104
2.24
0.021
70
0
91
147
65
86
1.54
0.02
78
0
61
136
70
111
1.09
0.082
43
0
64
160
77
191
1.15
0.011
54
0
60
104
60
50
8.14
0.006
48
1
84
118
68
96
5.33
0.006
Prediction of Heart Attack on Random Forest and Logistic Regression
467
2.4 Algorithm
2.4.1 Random Forest Classifier
Feature Selection: Using the Bootstrap Sampling
method, data subsets are randomly selected with
replacements from the original dataset, and a decision
tree is constructed for each subset.
Suppose n represents tthe quantity of samples
present D. During the process of bootstrap sampling,
a random sample is selected from the population D
and placed into the sampling set Ds, and then put it
back into D. After n iterations, the sampling set Ds
contains n samples, and it is possible to have duplicate
samples. The probability that a sample from D is not
selected at all during the entire process can be
calculated as follows:


 
 (1)
Through Bootstrap Sampling, approximately
36.8% of the samples in the initial dataset D do not
appear in Ds.
In this study, RF was used to construct 100
decision trees, and the ultimate classification outcome
is determined through the majority vote or average
value of all the trees. By increasing the number of
decision trees, the model's accuracy, robustness, and
generalization ability can be improved. RF effectively
reduces the risk of overfitting and exhibits good
performance in solving classification and regression
problems.
2.4.2 Logistic Regression
It is crucial to find the best parameter combination
within an acceptable time frame to improve model
performance and reduce overfitting risk. In this study,
cross-validation and grid search are employed to
explore the parameter space and identify the optimal
parameter combination. Consequently, a
regularization coefficient of 10.0, a penalty parameter
of l1 to prevent overfitting, and liblinear as the
optimization algorithm are determined as the best
parameter values. Therefore, the object function has
changed as follow:
J(θ) = 1/m * Σ[y(i) * log(h
θ
(x(i)))+ (1 y(i)) log(1
hθ(x(i)))]+ λ/(2m) Σθ² (2)
m represents the count of training samples, while y
denotes the actual label of each sample. h
θ
(x) is the
prediction function, calculated as 1 / (1 + exp (-θ
T
* x)),
θ is the parameter of the model, x(i) is the feature
vector of the i-th sample, and λ is the regularization
parameter.
2.5 Evaluation Criteria
2.5.1 Confusion Matrix
It is a tabular form used to assessing the effectiveness
of classification models' performance, which indicates
the relation between the predictions of a classification
model on a test set and the actual labels.
It consists of the following four metrics: True
Positive (TP), True Negative (TN), False Positive (FP),
False Negative (FN).
2.5.2 Accuracy
This article refers to the proportion of the total sample
set that accurately predicts whether a subject has heart
disease.
Accuracy = (TP + TN) / (TP + TN + FP + FN) (3)
2.5.3 Precision
In this context, precision refers to the percentage of
actual heart disease patients among all the cases
predicted as having heart disease.
Precision = (TP) / (TP + FP) (4)
2.5.4 Recall
A higher recall rate implies that the model can
accurately identify individuals with heart disease.
Recall = (TP) / (TP + FN) (5)
2.5.5 F1-score
An elevated F1-score suggests that the model has
attained an enhanced equilibrium and overall
effectiveness in precisely detecting individuals
afflicted with cardiovascular conditions.
F1-score = (2 * (Precision * Recall)) / (Precision +
Recall) (6)
2.5.6 ROC Curve and AUC Value
They are commonly used metrics for binary
classification model evaluation. When the ROC curve
approaches the upper left corner, the performance of
the classifier is better. The larger the AUC value
indicates higher discrimination between positive and
negative instances.
DAML 2023 - International Conference on Data Analysis and Machine Learning
468
3 RESULT
3.1 Predict Result
Table 2: Predict results of RF.
F1 Score
Precision
Recall
Accuracy
1
0.9835
0.9794
0.9875
0.9798
0
0.9740
0.9804
0.9677
Table 3: Predict results of LR.
F1 Score
Precision
Recall
Accuracy
1
0.9126
0.9386
0.8880
0.8965
0
0.8731
0.8393
0.9097
From Table 2 and Table 3, the results of the RF model
are better than those of LR in all aspects except for
Recall. Accuracy is more important, and the Accuracy
of RF is significantly higher than that of LR.
3.2 Evaluation
Figure 3: Confusion Matrix of LR (Original).
Figure 4: Confusion Matrix of RF (Original).
From Figure 3 and Figure 4, both models show high
accuracy, precision, and recall in predicting heart
disease in patients, demonstrating good performance.
The author particularly focuses on False Negative,
which represents cases where the model fails to detect
the presence of heart disease in individuals. Such errors
can lead to delays in diagnosis and treatment, resulting
in serious consequences for patients. However,
according to the results, the probability of this type of
error of RF is lower than the probability LR.
Figure 5: ROC curve of RF (Original).
Figure 6: Roc curve of LR (Original)
From Figure 5 and Figure 6, both models exhibit
excellent performance with ROC curves that approach
the upper left corner and have high AUC values.
However, when comparing the two, RF performs
better.
4 DISCUSSION
The results all indicate that the RF model outperforms
LR overall. The accuracy rate of 97.8% clearly
demonstrates the excellence of the model. Moreover,
concerning the ROC curve and AUC value, it is
Prediction of Heart Attack on Random Forest and Logistic Regression
469
noticeable that the ROC curve of the RF model
approaches the top-left corner, exhibiting the
superiority of the model's performance. The AUC
value of 0.98, which is closer to 1, further consolidates
the success of the RF model. These findings suggest
that the RF model could be an optimal solution for
heart attack prediction.
5 CONCLUSION
A heart disease prediction model has successfully been
constructed by adopting two different techniques.
After comparison, the RF model is more suitable and
effective in predicting such small-scale classification
problems.
Early prediction and intervention are crucial for the
health of patients as heart disease is a common and
serious illness. In order to better help more people
predict their risk of heart disease, this research can
provide new ideas and methods for heart disease
prevention.
To improve the accuracy and reliability of the
model's predictions, future research must focus on
incorporating larger-scale and more diverse datasets.
By collecting heart disease-related data with various
attributes and features, models can gain a more
comprehensive understanding.
Additionally, further research considers
introducing other advanced machine learning
algorithms to explore more patterns and features
hidden in the data. In this way, it enhances the
performance and stability of the prediction model.
In conclusion, by expanding the scale of the dataset
and employing advanced techniques, further
improvement of the heart disease prediction model can
lead to positive contributions in preventing heart
disease.
REFERENCES
R. Ganguly, and G. N. Pierce, The toxicity of dietary trans
fats,” Food and Chemical Toxicology, vol. 78, pp. 170
176, April 2015.
G. Litjens, T. Kooi, B. E. Bejnordi, et al., A Survey on
Deep Learning in Medical Image Analysis,” Medical
Image Analysis, vol. 42, pp. 6088, December 2017.
H. Jindal, S. Agrawal, R. Khera, et al., Heart disease
prediction using machine learning algorithms,” IOP
Conference Series: Materials Science and Engineering,
vol. 1022, pp. 012072, 2021.
A. Javeed, S. J. Zhou, Y. J. Liao, et al., An Intelligent
Learning System based on Random Search Algorithm
and Optimized Random Forest Model for Improved
Heart Disease Detection,” IEEE Access, vol. 7, pp.
180235-180243, November 2019.
C. Raju, E. Philipsy, S. Chacko, et al., A Survey on
Predicting Heart Disease using Data Mining
Techniques,” ICEDSS, March 2018.
J. L. Speiser, M. E. Miller, J. Tooze, and E. Ip, A
comparison of random forest variable selection methods
for classification prediction modeling,” Expert Systems
with Applications, vol. 134, pp. 93101, November
2019.
L. Breiman, Classification and regression trees, 1st ed., New
York: Routledge, 1984, pp 4-8.
J. M. Hilbe, Logistic regression models, London: Crc Press,
2009, pp 243-250.
A. Wijaya, and A. Bisri, Hybrid decision tree and logistic
regression classifier for email spam detection,” ICITEE,
October 2016.
R. Gangurde, and B. Kumar, “Web Page Prediction Using
Genetic Algorithm and Logistic Regression based on
Weblog and Web Content Features,” ICESC, August
2020.
DAML 2023 - International Conference on Data Analysis and Machine Learning
470