3.1 Predicting MPS Using Only Amino
Acid Composition
First only compositions of 20 amino acids were used
to calculate the distance (i.e., BS-WED) of each test
protein to each training protein. Ten-fold cross-
validation was used to evaluate the performance on
the dataset. The parameter θ was set to 1. Various k
values ranging from 1 to 30 were tried. For
comparison, we also searched for the best
performance of traditional KNN method using the
standard Euclidean distance for various k values in
the same range (i.e., 1-30). As can be seen from Table
1 (row 2), MPKNN achieved the performance of
80.3% accuracy and 0.579 MCC when k = 20. In
comparison, standard KNN has the most optimal
performance of 76.4% accuracy and 0.490 MCC
when k = 19. Therefore, it is clear that BS-WED is a
better distance measurement than standard Euclidean
distance when measuring the relationship between the
query protein and the training proteins (i.e., MPs or
non-MPs). The detailed prediction performance of
MPKNN using only compositions of 20 amino acids
is shown in Table 2 (Column 2).
3.2 Prediction Performance of Each
Feature Set
The proposed KNN method based on BS-WED was
tested on each feature set individually to assess their
usefulness in predicting MPs. The parameter θ was
set to the default value (i.e., 1). Various k values
ranging from 1 to 30 were tried and the best
performance was kept. We also evaluated the
prediction performance of all features combined. The
results are listed in Table 1. It is obvious that the
prediction performance couldn’t be further improved
by simply combining all features investigated in this
study. This could due to the fact that not all features
were useful for the prediction. Also, some features
might be correlated with each other, which could
impair the prediction performance.
Table 2: Comparison of different methods.
Method MPKNN
+20 AAs
MPKNN
+ 18
selected
features
Shirafkan
et al.’s
method
Recall 91.6% 91.6% -
Precision 79.4%
82.4%
74%
Acc 80.3%
82.9%
77%
MCC 0.579 0.635 -
F-Measure 0.851 0.868 0.75
AUC 0.750
0.862
0.75
3.3 Performance After Feature
Selection
We then seek to improve the prediction performance
by applying the heuristic feature selection process as
described in Methods and Materials to search for a
subset of features that was (almost) most useful for
the prediction. In the end a set of 18 features that
include compositions of 12 amino acids {A, N, D, C,
E, G, I, L, K, F, S, T}, 4 delta-function factors
{
,
,
,
}, and 2 physicochemical factors
{EISD840101_2 and HOPT810101_10} were
chosen. Adding more features did not improve the
prediction performance. As can been seen from Table
2 (Column 3), using selected features, MPKNN
improved its performance to 91.6% recall, 82.4%
precision, 82.9% accuracy, 0.635 MCC, 0.868 F-
measure, and 0.862 AUC.
We also compared our final MPKNN (i.e., using
18 selected features) with previously published
method by (Shirafkan, 2021) on the same benchmark
dataset. The prediction performance of Shirafkan et
al.’s method was directly obtained from their report
(as shown in Table 2, Column 4). Table 2 clearly
shows that our method has achieved far more superior
performance than that of Shirafkan et al.
4 CONCLUSIONS
In this study we present MPKNN, a KNN method
which can predict MPs with 91.6% recall, 82.4%
precision, 82.9% accuracy, 0.635 MCC, 0.868 F-
measure, and 0.862 AUC. The method is based on a
bit-score weighted Euclidean distance (BS-WED) to
measure the similarity between proteins. Compared to
the standard Euclidean distance, BS-WED takes
account of both compositions and sequence
similarity.
The benchmark dataset used in this study was
relatively small, and therefore feature selection
(wrapper method) and cross-validation were
performed on the same dataset to avoid insufficient
training. To better estimate the generalization ability
of our method, it would be preferred to carry out these
two processes on two separate non-overlapping
datasets. For the future work, we plan to curate a more
comprehensive dataset with a larger number of MP
and non-MP proteins from various protein databases
so that we may split the dataset into two parts, with
the first part reserved for feature selection and the
second part for cross-validation (i.e., training and test)
using the selected features. We also plan to
investigate the possibility of improving the prediction