3 RESULTS AND DISCUSSION
First, we had to determine specific cutoff values to
systematically adjust how similar or dissimilar
complexes are incorporated for training. We tried
different settings and consequently decided that, for
protein structure similarity, the cutoff value c
increases from 0.40 to 1.00 with a step size of 0.01 in
the direction specified by
, and decreases from
0.99 to 0.40 and then to 0 in the opposite direction
specified by
; for ligand fingerprint similarity, c
increases from 0.55 to 1.00 in
, and decreases
from 0.99 to 0.55 and then to 0 in
; for pocket
topology dissimilarity, c decreases from 10.0 to 0
with a step size of 0.2 in
, and increases from 0.2
to 10.0 and then to +∞ in
.
Next, we plotted the number of training
complexes against the three types of cutoff (Figure 1),
in order to visibly show that these distributions are
hardly even. In fact, the distribution of training
complexes under the protein structure similarity
measure is extraordinarily skewed, e.g. as many as
859 training complexes (accounting for 31% of the
original full training set of 2764 complexes) have a
test set similarity greater than 0.99 (note the sheer
height of the rightmost bar), and 156 training
complexes have a test set similarity in the range of
(0.98, 0.99]. Incrementing the cutoff by just 0.01 from
0.99 to 1.00 will include 859 additional training
complexes, whereas incrementing the cutoff by the
same step size from 0.90 to 0.91 will include merely
17 additional training complexes, even none from
0.72 to 0.73. Therefore, one would seemingly expect
a significant performance gain from raising the cutoff
by just 0.01 if the cutoff is already at 0.99. This is also
true, although less apparent, for ligand fingerprint
similarity, where 179 training complexes have a test
set similarity greater than 0.99. The distribution under
the pocket topology dissimilarity measure, however,
seems relatively more uniform, with just 15
complexes falling in the range of [0, 0.2) and just 134
complexes in the range of [10, +∞). Hence
introducing this supplementary similarity measure
based on pocket topology, which is novel in this study,
offers a different tool to investigate the influence of
data similarity on the scoring power of SFs with
training set size unbiased towards both ends of cutoff.
Keeping in mind the above-illustrated non-even
distributions, we re-trained the three classical SFs
(MLR::Xscore, MLR::Vina, MLR::Cyscore) and the
four machine-learning SFs (RF::Xscore, RF::Vina,
RF::Cyscore, RF::XVC) on the 61 nested training sets
generated with protein structure similarity measure,
evaluated their scoring power on the PDBbind v2013
core set, and plotted their predictive performance (in
terms of Rp, Rs and RMSE) in a consistent scale
against both cutoff value and number of training
complexes in two similarity directions (Figure 2).
Looking at the top row alone, where RF::Xscore was
not able to surpass MLR::Xscore until the similarity
cutoff reached 0.99, it is therefore not surprising for
Li and Yang to draw their conclusion that after
removal of training proteins that are highly similar to
the test proteins, machine-learning SFs did not
outperform classical SFs in Rp (Li and Yang, 2017)
(note that the v2007 dataset employed in previous
studies has an analogously skewed distribution as the
v2013 dataset employed in this study; data not
shown). Nonetheless, if one looks at the second row,
which plots essentially the same result but against the
associated number of training complexes instead, it
becomes clear that RF::Xscore trained on 1905 (the
number of training complexes associated to cutoff
0.99, about 69% of the full 2764 complexes)
complexes was able to outperform MLR::Xscore,
which was already the best performing classical SF
considered here. In terms of RMSE, RF::Xscore
surpassed MLR::Xscore at cutoff=0.91 when they
were trained on just 1458 (53%) complexes whose
proteins are not so similar to those in the test set. This
is more apparent for RF::XVC, which outperformed
MLR::Xscore at a cutoff of just 0.70, corresponding
to only 1243 (45%) training complexes. In other
words, even if the original training set was split into
two halves and the half with proteins dissimilar to the
test set was used for training, machine-learning SFs
would still produce a smaller prediction error than the
best classical SF. Having said that, it does not make
sense for anyone to exclude the most relevant samples
for training (Li et al., 2018). When the full training set
was used, a large performance gap between machine-
learning and classical SFs was observed. From a
different viewpoint, through comparing the top two
rows showing basically the same result but with
different horizontal axis, the crossing point where
RF::Xscore started to overtake MLR::Xscore is
located near the right edge of the subfigures in the
first row, whereas the same crossing point is
noticeably left shifted in the second row, suggesting
that the outstanding scoring power of RF::Xscore and
RF::XVC is actually attributed to increasing training
set size but not exclusively to a high similarity cutoff
value as claimed previously.
Due to the skewness of the distribution of training
complexes under the protein structure similarity
measure, it should be understandable to anticipate a
remarkable performance gain from raising the cutoff
by only 0.01 if it already touches 0.99 because it will