Computer Annotation of Nucleic Acid Sequences in Bacterial Genomes Using Phylogenetic Profiles

Mikhail A. Golyshev, Eugene V. Korotkov


Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles. The phylogenetic profile of a gene was created by way of searching for similarities between the nucleotide sequence of the gene and 1204 reference genomes, with further estimation of the statistical significance of found similarities. The profiles of the genes with known functions were used for prediction of possible functions and functional groups for the new genes.We conducted the functional annotation for genes from 104 bacterial genomes and compared the functions predicted by our system with the already known functions. For the genes that have already been annotated, the known function matched the function we predicted in 63% of the time, and in 86% of the time the known function was found within the top five predicted functions. Besides, our system increased the share of annotated genes by 19%. The developed system may be used as an alternative or complementary system to the current annotation systems.


  1. Ali, H., 2004. A hidden markov model for gene function prediction from sequential expression data. Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004., (Csb), pp.639-640.
  2. Altschul, S. F. et al., 1990. Basic local alignment search tool. Journal of molecular biology, 215(3), pp.403- 410.
  3. Altschul, S. F. et al., 1997. Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17), pp.3389- 3402. Available at: http:// 146917&tool=pmcentrez&rendertype=abstract.
  4. Ashburner, M. et al., 2000. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature genetics, 25(1), pp.25-9.
  5. Aziz, R. K. et al., 2008. The RAST Server: rapid annotations using subsystems technology. BMC genomics, 9, p.75.
  6. Bairoch, A. & Apweiler, R., 1999. The SWISS-PROT protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Research, 27(1), pp.49-54.
  7. Benson, D. A. et al., 2013. GenBank. Nucleic acids research, 41(Database issue), pp.D36-42.
  8. Date, S. V & Marcotte, E. M., 2003. Discovery of uncharacterized cellular systems by genome-wide analysis of functional linkages. Nature biotechnology, 21(9), pp.1055-62.
  9. Eisen, J. A., 1998. Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis. Genome Research, 8(3), pp.163-167.
  10. Eisenhaber, F., 2012. A decade after the first full human genome sequencing: when will we understand our own genome? Journal of bioinformatics and computational biology, 10(5), p.1271001.
  11. Feller, W., 1968. An Introduction to Probability Theory and Its Applications,
  12. Finn, R. D. et al., 2010. The Pfam protein families database. Nucleic Acids Research, 38, pp.D211-D222.
  13. Friedberg, I., 2006. Automated protein function prediction--the genomic challenge. Briefings in bioinformatics, 7(3), pp.225-42.
  14. Galperin, M. Y. & Koonin, E. V, 2010. From complete genome sequence to “complete” understanding? Trends in biotechnology, 28, pp.398-406.
  15. Gaasterland, T. & Ragan, M. A., 1998. Constructing the multigenome viewes of whole microbial genomes. Microbial & Comparative Genomics 3, pp. 177-192.
  16. Haft, D. H., 2003. The TIGRFAMs database of protein families. Nucleic Acids Research, 31(1), pp.371-373.
  17. Hunter, S. et al., 2012. InterPro in 2011: new developments in the family and domain prediction database. Nucleic acids research, 40, pp.D306-12.
  18. Janitz, M., 2007. Assigning functions to genes - the main challenge of the post-genomics era. Biochemical Pharmacology, 159, pp.115 -129.
  19. Jothi, R., Przytycka, T. M. & Aravind, L., 2007. Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment. BMC bioinformatics, 8, p.173.
  20. Kanehisa, M. et al., 2004. The KEGG resource for deciphering the genome. Nucleic acids research, 32(Database issue), pp.D277-80.
  21. Kensche, P. R. et al., 2008. Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution. Journal of the Royal Society, Interface / the Royal Society, 5(19), pp.151- 70.
  22. Kharchenko, P. et al., 2006. Identifying metabolic enzymes with multiple types of association evidence. BMC bioinformatics, 7, p.177.
  23. Markowitz, V. M. et al., 2012. IMG: the Integrated Microbial Genomes database and comparative analysis system. Nucleic acids research, 40(Database issue), pp.D115-22.
  24. Meyer, F. et al., 2003. GenDB--an open source genome annotation system for prokaryote genomes. Nucleic acids research, 31(8), pp.2187-95.
  25. Needleman, S. B. & Wunsch, C. D., 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48, pp.443-453.
  26. Pandit, S. B., Balaji, S. & Srinivasan, N., 2004. Structural and functional characterization of gene products encoded in the human genome by homology detection. IUBMB life, 56(6), pp.317-31.
  27. Pearson, W. R. & Lipman, D. J., 1988. Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences of the United States of America, 85(8), pp.2444-8.
  28. Pellegrini, M. et al., 1999. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proceedings of the National Academy of Sciences of the United States of America, 96(8), pp.4285-8.
  29. Pellegrini, M., 2012. Using phylogenetic profiles to predict functional relationships J. Helden, A. Toussaint, & D. Thieffry, eds. Methods in Molecular Biology, 804, pp.167-177.
  30. Pruitt, K. D., Tatusova, T. & Maglott, D. R., 2005. NCBI Reference Sequence (RefSeq): a curated nonredundant sequence database of genomes, transcripts and proteins. Nucleic acids research, 33(Database issue), pp.D501-4.
  31. Quevillon, E. et al., 2005. InterProScan: protein domains identifier. Nucleic Acids Research, 33, pp.W116- W120.
  32. Raeside, D. E., 1976. Monte Carlo principles and applications. Physics in Medicine and Biology, 21, pp.181-197.
  33. Rastogi, S. C., Mendiratta, N. & Rastogi, P., 2006. Bioinformatics Methods and Applications: Genomics, Proteomics and Drug Discovery, PHI Learning Pvt. Ltd.
  34. Richardson, E. J. & Watson, M., 2013. The automatic annotation of bacterial genomes. Briefings in bioinformatics, 14(1), pp.1-12.
  35. Saghatelian, A. & Cravatt, B. F., 2005. Assignment of protein function in the postgenomic era. Nature chemical biology, 1(3), pp.130-42.
  36. Shuster, J. J., 2005. Hypergeometric Distribution. In Encyclopedia of Biostatistics.
  37. Smith, T. F. & Waterman, M. S., 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147, pp.195-197.
  38. Tanenbaum, D. M. et al., 2010. The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data. Standards in genomic sciences, 2(2), pp.229-37.
  39. Tatusov, R. L. et al., 2000. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic acids research, 28, pp.33-36.
  40. The UniProt Consortium, 2011. Ongoing and future developments at the Universal Protein Resource. Nucleic acids research, 39(Database issue), pp.D214- 9.
  41. Weiller, G. F., 1998. Phylogenetic Profiles?: A Graphical Method for Detecting Genetic Recombinations in Homologous Sequences. Molecular Biology and Evolution, 15, pp.326-335.

Paper Citation

in Harvard Style

A. Golyshev M. and V. Korotkov E. (2015). Computer Annotation of Nucleic Acid Sequences in Bacterial Genomes Using Phylogenetic Profiles . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015) ISBN 978-989-758-070-3, pages 134-143. DOI: 10.5220/0005236201340143

in Bibtex Style

author={Mikhail A. Golyshev and Eugene V. Korotkov},
title={Computer Annotation of Nucleic Acid Sequences in Bacterial Genomes Using Phylogenetic Profiles},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)},

in EndNote Style

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2015)
TI - Computer Annotation of Nucleic Acid Sequences in Bacterial Genomes Using Phylogenetic Profiles
SN - 978-989-758-070-3
AU - A. Golyshev M.
AU - V. Korotkov E.
PY - 2015
SP - 134
EP - 143
DO - 10.5220/0005236201340143