Eduardo Campos dos Santos, Braulio Roberto Gonçalves Marinho Couto, Marcos A. dos Santos, Julio Cesar Dias Lopes


Drug target identification and validation are critical steps in the drug discovery pipeline. Hence, predicting potential “druggable targets”, or targets that can be modulated by some drug, is very relevant to drug discovery. Approaches using structural bioinformatics to predict “druggable domains” have been proposed, but they have only been applied to proteins that have solved structures or that have a reliable model predicted by homology. We show that available protein annotation terms may be used to explore semantic-based measures to provide target similarity searching and develop a tool for potential drug target prediction. We analysed 1,541 human protein drug targets and 29,580 human proteins not validated as drug targets but which share some InterPro annotations with a known drug target. We developed a semantic-based similarity measure by using singular value decomposition over InterPro terms associated with drug targets, performed statistical analyses and built logistic regression models. We present a probabilistic model summarised in a closed mathematical formula that allows human protein drug targets to be predicted with a sensitivity of 89% and a specificity of 67%.


  1. Altman, D. G. (1991). Practical Statistics for Medical Research. Chapman & Hall.
  2. Betts, M. J., Guigó, R., Agarwal, P., Russell, R. B. (2001). Exon structure conservation despite low sequence similarity: a relic of dramatic events in evolution? The EMBO journal, 20(19), 5354-5360.
  3. Cattel, R. B. (1966). The scree test for the number of factors. Multivariate Behavioural Research, 1, 245-76.
  4. Chagoyen, M., Carmona-Saez, P., Gil, C., Carazo, J. M., Pascual-Montano, A. (2006). A literature-based similarity metric for biological processes. BMC Bioinformatics, 7, 363-375.
  5. Chen, M.-c., sheng Chen, L., chin Hsu, C., Rong Zeng, W. (2008). An information granulation based data mining approach for classifying imbalanced data. Information Sciences, 178, 3214-3227.
  6. Cheng, A. C., Coleman, R. G., Smyth, K. T., Cao, Q., Soulard, P., Caffrey, D. R., Salzberg, A. C., Huang, E. S. (2007). Structure-based maximal affinity model predicts small-molecule druggability. Nature Biotechnology, 25(1), 71-75.
  7. Berry, M. W. et al., 1995. Using linear algebra for intelligent information retrieval. SIAM Review, 37, 573-595.
  8. Deerwester, S. et al., 1990. Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 41(6), 1-13.
  9. Eldén, L., 2006. Numerical linear algebra in data mining. Acta Numerica, 327-384.
  10. Gan, H. H., Perlow, R. A., Roy, S., Ko, J., Wu, M., Huang, J., Yan, S., Nicoletta, A., Vafai, J., Sun, D., Wang, L., Noah, J. E., Pasquali, S., Schlick, T. (2002). Analysis of protein sequence/structure similarity relationships. Biophysical Journal, 83, 2781-2791.
  11. Gao, Z., Li, H., Zhang, H., Liu, X., Kang, L., Luo, X., Zhu, W., Chen, K., Wang, X., and Jiang, H. (2008). Pdtd: a web-accessible protein database for drug target identification. BMC Bioinformatics, 9(1), 104.
  12. Golub, G and Kahan, W. (1965). Calculating the singular values and pseudo-inverse of a matrix. SIAM J. Numer. Anal. Ser. B, Vol.2, No.2, p. 205-224, 1965.
  13. Haupt, V. J. and Schroeder, M. (2011). Old friends in new guise: repositioning of known drugs with structural bioinformatics. Briefings in Bioinformatics.
  14. Hopkins, A. L., Groom, C. R. (2002). The druggable genome. Nature reviews. Drug discovery, 1(9), 727- 730.
  15. Hosmer, D. W. and Lemeshow, S. (2000). Applied logistic regression (Wiley Series in probability and statistics). Wiley-Interscience Publication.
  16. Kanehisa, M., Goto, S., Furumichi, M., Tanabe, M., Hirakawa, M. (2010). Kegg for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Research, 38, D355-D360.
  17. Krissinel, E. (2007). On the relationship between sequence and structure similarities in proteomics. Bioinformatics, 23(6), 717-723.
  18. Liu, T., Chen, Z., Zhang, B., Ma, W.-y., Wu, G. (2004). Improving text classification using local latent semantic indexing. In Proceedings of the Fourth IEEE International Conference on Data Mining, ICDM 7804, pages 162-169, Washington, DC, USA. IEEE Computer Society.
  19. Lord, P., Stevens, R., Brass, A., Goble, C. (2003). Investigating semantic similarity measures across the gene ontology: the relationship between sequence and annotation. Bioinformatics, 19, 1275-1283.
  20. Moriaud, F., Richard, S. B., Adcock, S. A., ChanasMartin, L., Surgand, J.-S., Ben Jelloul, M., and Delfaud, F. (2011). Identify drug repurposing candidates by mining the protein data bank. Briefings in Bioinformatics.
  21. Schlesselman, J. J. (1982). Case-Control Studies. Oxford U. Press.
  22. The UniProt Consortium. (2010). The Universal Protein Resource (UniProt) in 2010. Nucleic Ac-ids Res, 38(suppl 1):D142-D148.
  23. Schreiber, S. L. (2009). Organic chemistry: Molecular diversity by design. Nature, 457, 153-154.
  24. SPSS Inc. (2008). Statistic Package for Social Science (SPSS) for Windows.
  25. Stuart, G. W., Moffett, K., Leader, J. J. (2002). A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol Biol Evol, 19(4), 554-562.
  26. Verdine, G. L., Walensky, L. D. (2007). The challenge of drugging undruggable targets in cancer: Lessons learned from targeting bcl-2 family members. Clinical Cancer Research, 13(24), 7264-7270.
  27. Vidovic, D., Schürer, S. C. (2009). Knowledge-based characterization of similarity relationships in the human protein - tyrosine phosphatase family for rational inhibitor design. Journal of Medicinal Chemistry, 52(21), 6649-6659.
  28. Wishart, D. S., Knox, C., Guo, A. C., Cheng, D., Shrivastava, S., Tzur, D., Gautam, B., Hassanali, M. (2008). Drugbank: a knowledge base for drugs, drug actions and drug targets. Nucleic Acids Research - Database issue, 36, D901-D906.
  29. Zhu, F., Han, B., Kumar, P., Liu, X., Ma, X., Wei, X., Huang, L., Guo, Y., Han, L., Zheng, C., and Chen, Y. (2010). Update of TTD: Therapeutic target database. Nucleic Acids Research - Database issue, 38, D'7- D791.

Paper Citation

in Harvard Style

Campos dos Santos E., Gonçalves Marinho Couto B., A. dos Santos M. and Dias Lopes J. (2012). PREDICTING NEW HUMAN DRUG TARGETS BY USING FEATURE SELECTION TECHNIQUES . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012) ISBN 978-989-8425-90-4, pages 137-142. DOI: 10.5220/0003734501370142

in Bibtex Style

author={Eduardo Campos dos Santos and Braulio Roberto Gonçalves Marinho Couto and Marcos A. dos Santos and Julio Cesar Dias Lopes},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)},

in EndNote Style

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)
SN - 978-989-8425-90-4
AU - Campos dos Santos E.
AU - Gonçalves Marinho Couto B.
AU - A. dos Santos M.
AU - Dias Lopes J.
PY - 2012
SP - 137
EP - 142
DO - 10.5220/0003734501370142