Table 1: Performance comparison of different prediction
methods. Highest values for each measure are marked in
bold.
Model Se Sp Acc MCC
Comp Unb 90.1
91.4 91.2 0.759
Comp Over
84.2 94.0 92.0
0.796
Comp Under
88.1 92.5 91.6 0.709
Red Unb
88.1 92.7 91.8
0.796
Red Over
87.1 94.3 92.8 0.753
Red Under
84.2 93.8 91.8 0.722
g2p 10%
86.1 87.8 87.4 0.671
g2p 5%
79.2 93.8 90.7 0.721
g2p 2.5%
73.3
98.2 93.0
0.778
WebPSSM
68.3 93.5 88.3 0.635
T-CUP 2.0
82.2 93.2 90.9 0.733
usage. We used two strategies to enhance our model
performance: methods for balancing training data and
zero or near-zero variance predictors removal. In
total, six approaches were evaluated.
The proposed models performed very similarly to
each other. This information corroborates previous
studies that showed the strength of the random forest
algorithm, even with unbalanced training data
(Dittman, Khoshgoftaar, & Napolitano, 2015).
However, the error rate of the models suggests that
the oversampling approach was more adequate for
this type of problem.
T-CUP 2.0 uses random forest algorithm, like our
model. However, the data preparation for our model
showed influence in its performance. Perhaps, the
choice of Engelman hydrophobicity scale, instead of
Kyte-Doolittle scale (Kyte & Doolittle, 1982), used
for T-CUP 2.0. Therefore, the evaluation of
numerical conversion of amino acids should be
considered as an important factor for the development
of genotypic models.
Regarding the number of explanatory variables in
the model, it was possible to observe that both
approaches (Complete and Reduced models) had
comparable performance, suggesting that there is no
great difference between these models. However, on
behalf of parsimony, it is preferable to have a model
with minimal explanatory variables. Therefore, the
Reduced Model is more suitable to our objective.
The charts also showed that the models barely
change their error rate after 200-250 trees in the
forest, except for the undersampled models. Thus, the
model can perform optimally with a smaller number
of trees, streamlining the process of prediction.
It is very significant that our model has achieved
the highest sensitivity values. Although geno2pheno
algorithm achieved the best performance in
specificity and accuracy, our models showed best
values of MCC, a robust parameter for evaluation of
any prediction method. Our main goal in this study
was to enhance the ability of algorithms to predict
viral specimens with X4 tropism. The Complete
Model with no balancing showed sensitivity and
specificity above 90%, which suit our model into the
European guidelines on the clinical management of
HIV-1 tropism testing (Vandekerckhove et al., 2011).
Therefore, our studies are very promising to achieve
a new and more accurate genotypic predictor.
REFERENCES
Barré-Sinoussi, F., Chermann, J. C., Rey, F., Nugeyre, M.
T., Chamaret, S., Gruest, J., … Montagnier, L. (1983).
Isolation of a T-lymphotropic retrovirus from a patient
at risk for acquired immune deficiency syndrome
(AIDS). Science (New York, N.Y.), 220(4599), 868–
871. https://doi.org/DOI:10.1126/science.6189183
Berger, E. A., Doms, R. W., Fenyö, E.-M. M., Korber, B.
T. M., Littman, D. R., Moore, J. P., … Weiss, R. A.
(1998). A new classification for HIV-1. Nature,
391(6664), 240. https://doi.org/10.1038/34571
Breiman, L. (2001). Random forests. Machine Learning.
https://doi.org/10.1023/A:1010933404324
Clapham, P. R., & McKnight, Á. (2001). HIV-1 receptors
and cell tropism. British Medical Bulletin, 58, 43–59.
https://doi.org/10.1093/bmb/58.1.43
Dietterich, T. (1995). Overfitting and Undercomputing in
Machine Learning. ACM Computing Surveys (CSUR),
27(3), 326–327. https://doi.org/10.1145/212094.212114
Dittman, D. J., Khoshgoftaar, T. M., & Napolitano, A.
(2015). The Effect of Data Sampling When Using
Random Forest on Imbalanced Bioinformatics Data. In
Proceedings - 2015 IEEE 16th International
Conference on Information Reuse and Integration, IRI
2015. https://doi.org/10.1109/IRI.2015.76
Engelman, D. M., Steitz, T. A., & Goldman, A. (1986).
Identifying nonpolar transbilayer helices in amino acid
sequences of membrane proteins. Annual Review of
Biophysics and Biophysical Chemistry.
https://doi.org/10.1146/annurev.bb.15.060186.001541
Gallo, R. C., Sarin, P. S., Gelmann, E. P., Robert-Guroff,
M., Richardson, E., Kalyanaraman, V. S., … Popovic,
M. (1983). Isolation of human T-cell leukemia virus in
acquired immune deficiency syndrome (AIDS).
Science. https://doi.org/10.1126/science.6601823
Heider, D., Dybowski, J. N., Wilms, C., & Hoffmann, D.
(2014). A simple structure-based model for the
prediction of HIV-1 co-receptor tropism. BioData
Mining, 7, 14. https://doi.org/10.1186/1756-0381-7-14
Jensen, M. A., Li, F.-S., van ’t Wout, A. B., Nickle, D. C.,
Shriner, D., He, H.-X., … Mullins, J. I. (2003).