Predictive Assessment of Heart Disease Based on Multiple Machine

Learning Models

Zhongyi Zhang

Software Engineering, Tianjin University of Commerce, Tianjin, China

Keywords: Heart Disease Prediction, Machine Learning, Logistic Regression, Support Vector Machines.

Abstract: Cardiovascular disease has been one of the leading causes of many deaths, and early diagnosis can improve

treatment outcomes and survival rates. In this paper, six mainstream algorithmic models applicable to

dichotomization are compared to predict whether a person is suffering from heart disease based on a number

of features, such as gender, age, type of chest pain, resting Electrocardiograph (ECG) results, and maximum

heart rate. The study first explored the relationship between the values of these features and tried to analyze

the main factors affecting heart disease among them, then the dataset was divided in a way that the test set

was 30% and the training set was 70% to train the model, and finally the six algorithmic models were used to

predict the dataset, and the training results showed that Support Vector Machine (SVM) algorithmic model

could provide more accurate data for the prediction of heart disease. This paper provides new tools and ideas

for clinical diagnosis, treatment and prevention in the field of cardiovascular medicine, and contributes to the

improvement of patients' quality of life and the reduction of medical costs.

1 INTRODUCTION

Heart disease is a common cardiovascular disease,

which is a disease that directly involves the structure

or function of the heart, including coronary artery

disease, myocardial infarction, heart failure,

arrhythmia, etc. Stroke, pulmonary embolism, and

severe hypertension are also relatively common

complications, which have a great impact on the

patient's quality of life, and pose a great threat to

human health and life. Therefore, the ability to more

accurately predict the likelihood of disease can help

people develop preventive strategies to improve

quality of life and longevity.

In recent years, with the progress of medicine and

people's attention to health, an increasing number of

people will take the means of regular testing to check

whether they are suffering from heart disease and

whether they may suffer from the disease in the

future, and according to the results of the treatment or

prevention. But if relying only on the doctor's own

experience to judge, the efficiency is low, the labor

cost is high, and there is a certain degree of

misdiagnosis. Therefore, there is a need to rely on the

https://orcid.org/0009-0004-3324-3938

means of Artificial Intelligence (AI) to carry out the

auxiliary diagnosis.

Currently, the field of artificial intelligence is

developing rapidly, and has been applied in many

fields including transportation, education, civil

engineering, finance, biology, healthcare, etc (Qiu,

2022; Qiu, 2024; Sun, 2020; Wu, 2024; Zhou, 2023).

Especially in the medical field, there are many

remarkable achievements and breakthroughs in

recent years. For example, Zhang, Wu, et al. focus on

medical AI, and put forward the first chest X-ray

diagnostic basic model for disease diagnosis based on

the enhancement of the knowledge in the medical

field (Zhang, 2023), Yuzhe Yang et al. develop a

respiratory signal detection model for Parkinson's

diagnosis through sleep breathing detection, and one

important direction is to complete the diagnosis and

prediction of heart disease (Yang, 2022). In addition,

Ouyang et al. used the EchoNet-Dynamic model to

complete the detection of echocardiograms (Ouyang,

2020). ERIC J. TOPOL et al. developed a new AI

model to complete the detection of atrial fibrillation

(AFib) etc (Yang, 2022). In conclusion, the rapid

development of AI is bringing great changes and

Zhang, Z.

Predictive Assessment of Heart Disease Based on Multiple Machine Learning Models.

DOI: 10.5220/0012937500004508

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 1st International Conference on Engineering Management, Information Technology and Intelligence (EMITI 2024), pages 323-327

ISBN: 978-989-758-713-9

323

innovations in various fields, providing people with

smarter, efficient and sustainable solutions.

Currently, numerous models exist for illness

prediction; however, for diseases like heart disease,

various model structures and parameter selections

significantly influence the prediction outcomes. In

addition, the emergence of cardiac disease is an

intricate process that is typically impacted by multiple

causes. For example, the environment, individuals,

and lifestyle habits. For the major disease of heart

disease, how to identify self-consistent prediction

models and techniques, enhancing the precision of

diagnosis is a matter that requires immediate

attention.

In order to solve the above problems, this article

will predict heart disease based on the dataset from

Kaggle using various machine learning models such

as Logistic Regression, Random Forest, Naive Bayes,

Support Vector Machines (SVM) etc., then select the

most suitable prediction model and method for heart

disease by comparing the learning results.

2 METHOD

2.1 Dataset Preparation

The source of the dataset used in this study is the

Kaggle platform (Kaggle, 2019). There are a total of

1025 records in this dataset, each record contains 14

attribute features, such as chest pain type,

thalassemia, resting electrocardiographic results etc.

In the chest pain type, '0' represents typical angina, ‘1’

represents atypical angina, ‘2’ represents non-anginal

pain and ‘3’ represents asymptomatic.

2.1.1 Preprocessing

In order to improve the performance and accuracy of

the machine learning algorithms, increase the

generalization ability of the model, and reduce the

overfitting and underfitting problems of the model, it

is necessary to preprocess the dataset in the file.

Firstly, there will modify the attribute names of the

data by expanding the original abbreviated form of

the names to full names to enhance our understanding

of the attribute features. Secondly, it is necessary to

check the tables for missing values and display them

in a heat map. Then, convert variable numbers to text

names while defining the feature data and target data,

and divide the data proportionally into training and

testing sets. Finally, standardising the data so that the

models have the same scale when processing the data,

which enables the algorithms to learn and generalize

more efficiently, and it also helps this projection to

choose the most appropriate learning model.

2.2 Machine Learning Models

In this machine learning, this paper needs to use six

algorithmic models Logistic Regression, Decision

Tree, Random Forest, Naive Bayes, K-Nearest

Neighbors (KNN) and Support Vector Machines

(SVM). In order to train, evaluate and select the

models, this study use various tools provided by

sklearn. For instance, to divide the dataset, one can

use the train_test_split function in the

model_selection module. To calculate the accuracy of

the six models, the accuracy_score function in the

sklearn.metrics module can be used.

2.2.1 Logistic Regression

The main idea of the logistic regression model is to

make binary predictions by building a linear model

and the linear output is mapped to a probability range

of [0, 1] using a logistic function. The sigmoid

function (or logistic function):

𝜎



𝑥



1+𝑒



𝜎: 𝑅→



0,1





The threshold is typically set to 0.5. Predictions

with a probability greater than 0.5 are classified as

positive, while those with a probability less than or

equal to 0.5 are classified as negative.

During the training phase, the logistic regression

model estimates the model parameters using methods

such as maximum likelihood estimation or gradient

descent. This is done so that the model's predicted

probabilities for the training data are as close as

possible to the actual labels. During the prediction

phase, the logistic regression model utilises the

learned parameters to calculate the probabilities of the

input samples and make classification predictions

based on a predetermined threshold. The performance

of the model can be evaluated by various evaluation

metrics (e.g., accuracy, precision, recall, F1 value,

etc.).

2.2.2 Decision Tree

The main idea of decision tree modeling is to divide

and predict data through a series of feature selection

and node splitting. It utilizes a hierarchical tree model

to depict the decision-making pathway, where each

branch signifies a condition based on a feature's value

or a specific threshold. Every non-terminal node

symbolizes a feature or attribute, while terminal

nodes, or leaves, denote a particular category or

EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence

324

outcome. And, according to the selected features and

division criteria, it can divide the dataset into different

subsets, for discrete features, each subset corresponds

to one value of the feature; for continuous features,

the data can be divided into two subsets based on a

threshold value.

2.2.3 Random Forest

The main idea of random forest model is to perform

classification or regression by combining multiple

decision trees. This is an integrated learning method

that combines the benefits of decision trees. It

enhances the performance and generalisation of the

model by randomly selecting features and samples for

training.

Random forest will combine multiple decision

trees together, each time the dataset is randomly have

put back to select, at the same time randomly selected

part of the features as input. In the context of

classification, the combiner determines the final

outcome by choosing the option that has the majority

vote among various classification results. For

regression problems, it computes the final result by

taking the average of outcomes from multiple

regressions.

2.2.4 Naive Bayes

The main idea of plain Bayesian model is based on

Bayes' theorem and the assumption of conditional

independence of features. It is a simple but effective

probabilistic classification algorithm, which is called

"plain" because it is simplified accordingly on the

basis of Bayesian algorithm, which is the most

original and simplest assumption of Bayesian

classification, i.e., all the features are relatively

independent of each other.

Equation (2) represents a common Bayesian

formula, where P(A) represents the a priori

probability, i.e. the probability of event A occurring

before the occurrence of event B. P(A|B) represents

the a posteriori probability, i.e. the probability of

event A occurring after the occurrence of event B.

The likelihood function, P(B|A)/P(B), is an adjusting

factor that makes the predicted probability closer to

the true probability:

𝑃



𝐴

𝐵



𝑃



𝐵

𝐴



𝑃



𝐴



𝑃



𝐵







As plain Bayes is founded on the independence of

features, the above equation can be expressed as

follows, given the category a:

𝑃



𝐵

𝐴= 𝑎



= 𝑃



𝐵



𝐴

= 𝑎











Finally, Equation (4) is derived.:

𝑃



= 𝑃



𝐴

𝑏



𝑃



𝐴



𝛱





𝑃



𝑏



𝐴



𝑃



𝐵







2.5.5 K-Nearest Neighbors (KNN)

The KNN model is an instance-based learning

algorithm model that is mainly used for classification

and regression problems. Its main idea is to make

classification or regression predictions based on the

nearest neighbor samples in the feature space, which

assumes that similar samples have similar classes or

objective values.

The KNN algorithm implementation process is

also relatively simple. To predict an input vector 𝑥, it

is necessary to identify the set of k nearest vectors to

𝑥 in the training data set. The category of 𝑥 can then

be predicted as the one with the highest number of

categories among these k samples. Again, this is

where the k in the KNN algorithm comes from.

Equations 5, 6 then show the mathematical

expression of the KNN algorithm. In this

mathematical expression assume that the training

dataset D now has m samples, 𝑥 is the feature vector

of the samples and each sample has n features, and 𝑦

is the category corresponding to the sample.

𝐷=





𝑥



, 𝑦







𝑥



, 𝑦





, ⋯,



𝑥



, 𝑦











𝑥



= 𝑥









, 𝑥









, ⋯, 𝑥















The distance of the nearest neighbor samples can

be measured by the following formula. In the formula:

𝐿



𝑥



, 𝑦



= 𝑥









−𝑥























(𝑝≥1)

(

)

2.2.6 Support Vector Machines (SVM)

The main concept behind SVM is to identify the most

suitable hyperplane in the feature space that

maximizes the interval between samples of different

classes. This idea of interval maximization can

enhance the model's generalization ability and

mitigate the risk of overfitting.

In this algorithm, the separating hyperplane is

usually denoted by 𝜔× 𝑥+ 𝑏=0, for linearly

divisible datasets, there are infinitely many

hyperplanes that fit this description, but the one with

the largest geometric interval is unique.

For nonlinear classification problems in the input

space, the kernel function can be used to handle and

perform linear discrimination in higher dimensional

space. This is shown in the following expression (8):

𝑓

(

𝑥

)

= 𝑠𝑔𝑛𝛼



∗

𝑦



𝐾

(

𝑥, 𝑥



)





+ 𝑏

∗



(

)

Predictive Assessment of Heart Disease Based on Multiple Machine Learning Models

325

3 RESULTS AND DISCUSSION

This article is based on machine learning, in order to

have a better prediction of heart disease, a total of six

different algorithms are compared, this learning is

divided into data sets, where the test set is 30% and

the training set is 70%. From this, it can be found that

the SVM algorithm model is particularly effective, as

shown in Table 1, with an accuracy of 89.01%. In

addition, this study evaluated the correlation of the

feature values of this experiment at the beginning, and

the results are shown in the Figure 1, there is a strong

correlation between the heart disease patients and the

features chest pain type, maximum heart rate

achieved, exercise-induced angina, and ST

depression with a correlation coefficient of more than

0.4; and there is a certain correlation with the features

age, sex, st_slope, num_major_vessels, and

thalassemia. The correlation coefficient is between

0.2 and 0.4; and the correlation with chol and

fasting_blood_sugar is weak.

Table 1: Model performance.

Model

Training

Accuracy

Testing

Accuracy

Logistic Regression (LF)

0.8956 0.8774

Decision Tree (DT)

0.8323 0.7282

Random Forest (RF)

0.8864 0.8721

Naive Bayes (NB)

0.8241 0.7442

K-Nearest Neighbors (KNN)

0.8147 0.7623

Support Vector Machines

(SVM)

0.9128 0.8901

Figure 1: The correlation map of features (Picture credit:

Original).

However, through further analysis, as shown in

Figure 2, it can be found that angina induced by

exercise is not strong evidence to confirm the

diagnosis of heart disease.

Figure 2: The relationship between exercise induced angina

and heart disease (/Picture credit: Original).

Through the data in Table 1, it can be observed more

intuitively that LR, RF and SVM algorithmic models

have a high degree of accuracy. The reason for this is

that, firstly, the problem that needs to be dealt with in

this research is a binary classification problem, i.e.,

predicting whether or not one will suffer from heart

disease. Secondly, all three algorithmic models have

good robustness and can handle outliers or noise well

and are not easily affected by extreme samples, and

secondly they are highly interpretive, which helps to

understand the contribution and impact of different

features on heart disease, but the more complex

feature relationships between the features in this case,

and the larger number of feature values to be

categorized, result in the Logistic Regression and

Random forest algorithms being slightly less accurate

than the SVM algorithm.

At the same time, examining the data presented in

Table 1, it is not difficult to find that the two types of

algorithmic models, Naive Bayes and K-Nearest

Neighbors, however, perform poorly in this study.

Through the study of the Naive Bayes algorithm, it is

not difficult to find that the premise of this algorithm

is to assume that all the features are independent of

each other, but it cannot be well avoided that there

may be a certain correlation between the different

features of this kind of problem, for example, blood

pressure and cholesterol levels, and the algorithm

usually assumes that the features are discrete, and the

processing of the continuous features may cause a

certain degree of error, such as age, blood pressure etc.

EMITI 2024 - International Conference on Engineering Management, Information Technology and Intelligence

326

The KNN algorithm, on the other hand, is more

sensitive to the balance of the data. In this study, it

can be observed that the uneven distribution of the

number of samples of certain types of eigenvalues,

such as gender, type of chest pain, etc., which is the

main problem that leads to the unsatisfactory

prediction accuracy of the model of this algorithm.

4 CONCLUSIONS

In this article, the study was conducted by using

different machine learning algorithms to make

predictions about heart disease. These six algorithms

include Logistic Regression, Decision Tree, Random

Forest, Naive Bayes, K-Nearest Neighbors, and

Support Vector Machines, and finally the best model

was found for this study, which is the Support Vector

Machines classifier model, whose accuracy reaches

89.01%.

In fact, there are some shortcomings in this study.

For example, the samples included in the dataset were

not balanced enough, the study did not reach the

expected accuracy of 95%, and only some machine

learning models were used in the training of the

model, and some deep learning algorithms were not

considered to be used to make predictions.

In the future, in order to have a more balanced

data sample, the study will continue to collect

relevant data for the machine's learning, and the

algorithmic model of the experiment to adjust the

parameters and optimization, and will start to try to

use some better algorithms to complete the

optimization of the prediction results, such as

XGBoost, CatBoost etc. Next, the research will

expand the heart prediction problem into a multi-

classification problem, classifying heart disease into

different types or severity levels to better assist

doctors in risk assessment and personalizing

treatment for patients.

REFERENCES

Kaggle. 2019. Heart disease dataset. Retrieved from https://

www.kaggle.com/datasets/johnsmith88/heart-disease-

dataset

Ouyang, D., He, B., Ghorbani, A. et al. 2020. Video-based

AI for beat-to-beat assessment of cardiac function.

Nature, 580, 252–256. https://doi.org/10.1038/s41586-

020-2145-8

Qiu, Y., Wang, J., Jin, Z., Chen, H., Zhang, M., & Guo, L.

2022. Pose-guided matching based on deep learning for

assessing quality of action on rehabilitation training.

Biomedical Signal Processing and Control, 72, 103323.

Qiu, Y., Hui, Y., Zhao, P., Cai, C. H., Dai, B., Dou, J., ... &

Yu, J. 2024. A novel image expression-driven modeling

strategy for coke quality prediction in the smart

cokemaking process. Energy, 130866.

Sun, G., Zhan, T., Owusu, B.G., Daniel, A.M., Liu, G., &

Jiang, W. 2020. Revised reinforcement learning based

on anchor graph hashing for autonomous cell activation

in cloud-RANs. Future Generation Computer Systems,

104, 60-73.

Topol, E. J. 2023. As artificial intelligence goes

multimodal, medical applications multiply. Science,

381, eadk6139. DOI:10.1126/science.adk6139

Wu, Y., Jin, Z., Shi, C., Liang, P., & Zhan, T. 2024.

Research on the Application of Deep Learning-based

BERT Model in Sentiment Analysis. arXiv preprint

arXiv:2403.08217.

Yang, Y. et al. 2022. Artificial intelligence-enabled

detection and assessment of Parkinson's disease using

nocturnal breathing signals. Nature Medicine.

doi:10.1038/s41591-022-01932-x

Zhang, X., Wu, C., Zhang, Y. et al. 2023. Knowledge-

enhanced visual-language pre-training on chest

radiology images. Nat Commun, 14, 4542.

https://doi.org/10.1038/s41467-023-40260-7

Zhou, Y., Osman, A., Willms, M., Kunz, A., Philipp, S.,

Blatt, J., & Eul, S. 2023. Semantic Wireframe

Detection. publica.fraunhofer.de.

Predictive Assessment of Heart Disease Based on Multiple Machine Learning Models

327