good to balance the precision and recall values and
the absence of outlier values on the boxplot.
Optimization of SVM and K-NN algorithms based
on Particle Swarm Optimization on the sentiment
analysis of the hashtag phenomenon #2019gantipres-
iden (Saepudin et al., 2022) the calculation results of
the SVM method have an Accuracy of 88.00% and
an AUC of 0.964 while the SVM + PSO Method pro-
duces an Accuracy of 92.75% and an AUC of 0.973.
Testing has also been compared using PSO-based k-
NN and k-NN methods. The calculation results ob-
tained from testing data using the k-NN method re-
sulted in Accuracy of 88.50% and AUC of 0.948.
Meanwhile, the PSO-based k-NN method resulted
in an Accuracy value which actually decreased by
75.25% and an AUC of 0.768.
Comparison of optimization of C4.5 and Na
¨
ıve
Bayes data classification algorithms based on Parti-
cle Swarm Optimization Credit Risk Determination
(Rifai and Aulianita, 2018). Based on the test results,
the accuracy value of the C4.5 algorithm is 85.40%
and the accuracy value of the Na
¨
ıve Bayes algorithm
is 85.09%. From the two algorithms, a combination
was then carried out with Particle Swarm Optimiza-
tion optimization, with the results of the C4.5 + PSO
algorithm having the highest value based on the ac-
curacy value of 87.61%, AUC of 0.860 and precision
of 88.96% while the highest recall value was obtained
by the Na
¨
ıve Bayes + PSO algorithm of 96.75%. The
classification results of each algorithm in this study
will be compared to get the best performance evalua-
tion in breast cancer detection. Thus, one of the op-
timization data techniques is needed that aims to im-
prove the performance of the conventional data min-
ing classification method that has been chosen. One
optimization algorithm that is quite popular is Parti-
cle Swarm Optimization (PSO). Particle Swarm Opti-
mization (PSO) has solved many algorithm optimiza-
tion problems (Yoga and Prihandoko, 2018).
2 RESEARCH METHODOLOGY
2.1 Dataset Acquisition
The dataset used in this study is the data uploaded by
Ronan Azarias on the kaggle.com page entitled heart
desease dataset. The dataset amounts to 500 data. The
attributes contained in the data include:
a. Age: patient’s age (years)
b. Sex: patient’s sex (M: Male, F: Female)
c. ChestPainType: chest pain type (TA: Typical
Angina, ATA: Atypical Angina, NAP: Non-
Anginal Pain, ASY: Asymptomatic)
d. RestingBP: resting blood pressure (mm Hg)
e. Cholesterol: serum cholesterol (mm/dl)
f. FastingBS: fasting blood glucose (1: if FastingBS
¿ 120 mg/dl, 0: otherwise)
g. RestingECG: Resting ECG results (Normal: nor-
mal, ST: with ST-T wave abnormality, LVH: show-
ing probable or definite left ventricular hypertro-
phy by Estes criteria)
h. MaxHR: maximum heart rate reached (Numeric
value between 60 and 202)
i. ExerciseAngina: exercise-induced angina (Y: Yes,
N: No)
j. Oldpeak: old peak = ST (Numerical value mea-
sured in depression)
k. ST Slope: the slope of the peak exercise ST seg-
ment (Up: upsloping, Flat: flat, Down: downslop-
ing) In addition, there is the response variable,
which in this case is a binary variable:
l. HeartDisease: output class (1: heart disease, 0:
normal)
2.2 Pre-Processing
Data cleaning is a step that is done before entering
the data mining process [9]. Data cleaning contains
several activities whose main purpose is to introduce
and improve the data to be studied. The need for
improvements to the data to be studied is due to the
fact that raw data tends not to be ready for mining.
A frequent case is the presence of missing values in
the data. Missing values in datasets come from data
whose attributes have no informational value. This in-
formation is not obtained possible due to the process
that occurs when merging data. Handling of miss-
ing value in this study was carried out by reducing
data objects (under sampling). As a result of the data
cleaning carried out, there were 456 records from the
initial number of 500 records.
2.3 K-Nearest Neighbor
K-Nearest Neighbor is also called lazy learner be-
cause it is learning-based. K-Nearest Neighbor delays
the process of modeling training data until it is needed
to classify samples of test data. The sample train data
is described by numeric attributes on the n-dimension
and stored in n-dimensional space. When a sample of
test data (label of unknown class) is given, K-Nearest
Neighbor searches for the training k sample closest to
ICAISD 2023 - International Conference on Advanced Information Scientific Development
182