PCA performs eigenvalue decomposition on the
covariance matrix to obtain their corresponding
eigenvectors. Then, it selects the top N principal
components based on the magnitude of the
eigenvalues, where N is the desired dimensionality
after dimension reduction and defined by the author.
After calculating the quantities of different PCs, sort
the retained data information and select the top 25 PCs
to train the KNN model.
KNN: Prediction. K-Nearest Neighbors represents
one of the machine learning techniques used for
classification as well as regression. It is used to
classify three types of celestial bodies. The KNN
algorithm stores each feature data in the training set.
In this research, for one sample data in the test set, the
KNN algorithm calculates the k points with the
smallest Euclidean distance to this sample point. It
classifies this sample data into the category of the
nearest neighbors. The accuracy generally changes
with the variation of the k value. Because K-NN
merely stores the training dataset at first and only uses
it to figure out how to categorize or predict new
datasets as necessary, it is sometimes referred to as a
lazy learning algorithm (Bansal et al 2022). In their
paper, Niu and Lu et al0 noted that various distance
metrics are crucial and significantly impact nearest-
neighbor-based algorithms (Niu et al 2013). In this
paper, the author uses Euclidean distance. KNN
Euclidean distance formula:
d=
∑
(𝑥
−𝑦
)
(5)
2.5 Evaluation Criteria
Confusion matrix: The confusion matrix counts the
number of samples in the incorrect category and the
right group. The forecast outcomes are displayed in
the confusion matrix. It displays conflicted forecast
outcomes. It can not only assist in mistake detection
but also error type display. At the same time, the
confusion matrix makes it simple to compute other
high-level classification indicators.
Accuracy: Percentage of accurate predictions. It is
one of the most commonly used metrics in multi-class
classification, and its formula considers the sum of
correctly predicted examples as the numerator and the
sum of the confusion matrix's total entries as the
denominator (Grandini et al 2020). It represents
accurately predicted test samples’ percentage out of all
the test samples in this study.
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
(6)
Macro Average Precision (MAP): Average of each
category's precision, which is the proportion of
accurate predictions among the anticipated positive
instances. In the STAR devision, the denominator is
the number of correctly identified examples in the
STAR category divided by the number of examples
identified as the STAR category in non-STAR
examples. The numerator is the number of correctly
identified examples in the STAR category in the true
situation.
𝑀𝑎𝑐𝑟𝑜𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
∑
(7)
Macro Average Recall (MAR): It is the average
value of the recall rate for each category. In this
research, The recall rate represents the percentage of
samples that the model correctly predicts as belonging
to a certain class out of all the actual samples
belonging to that class.
𝑀𝑎𝑐𝑟𝑜𝐴𝑣𝑒𝑟𝑎𝑔𝑒𝑅𝑒𝑐𝑎𝑙𝑙 =
∑
(8)
Macro F1-Score: Macro F1-Score is the name
given to the harmonic mean of Precision and Recall.
Because MAR or MAP cannot be used independently
to assess a model, the Macro F1-score balances the
two indicators and makes them compatible. The
algorithms that perform well across all categories
exhibit a high Macro F1-score, while the algorithms
with inaccurate predictions demonstrate a low Macro
F1-score (Grandini et al 2020).
𝑀𝑎𝑐𝑟𝑜𝐹1 −𝑠𝑐𝑜𝑟𝑒=
∗
(9)
3 RESULT
3.1 Data Dimensionality Reduction
The author compared the explained variance,
accuracy, and training time in the experiment when
using from 1 to 25 principal components
(PCA(n_components=i)) and KNN. The metrics can
be seen in 0, 0, and 0.
The explained variance, in 0, increases with the
number of PCs. However, the rate of its increase
gradually slows down. After the number of PCs
reaches 23, there is no significant increase, indicating
that the maximum amount of information has been
retained when the number of PCs reaches 23.
In Fig. 7, the more PCs there are, the higher the
accuracy rate. But after there are five PCs, the pace of
increase slows down. The accuracy increases slightly
when the number of PCs is between 5 and 20.
However, the accuracy no longer improves when the
number of PCs exceeds 20.