
Although the developed system doesn’t make 
exact predictions of gene functions (the precision is 
about 63%, see Table 5), it may be used as an 
alternative or complementation to the existing 
annotation systems: the existing systems predict 
functions for genes from sets C4 and C5, and our 
system covers functions for genes from sets C3 and 
C5. Therefore, the use of our system can increase the 
share of annotated bacterial genes by 19% (by the 
size of the C3 set).  
63% predictions of gene functions was received 
for P
0
=10
-7
 and Z=5.0 (see 2.1). P
0 
and Z
 
was chosen 
with a large margin. It is possible to define an upper 
limit for the number of false positives in C2 set. For 
this purpose we can use the number of profiles 
which have at least one "1" received for mixed genes 
(see 2.1). The number of these profiles was 0.4 % 
and other profiles contain only zeros. Profiles with 
zeros have P>P
0 
and automatically eliminated from 
our consideration. But 39 “random genes” which 
have at least one "1" received profiles with P < P
0
. It 
means that less than 0.01% is upper limit of false 
positives for N2 (C2 set). Thus, false positives have 
a small effect on our results. 
REFERENCES 
Ali, H., 2004. A hidden markov model for gene function 
prediction from sequential expression data. 
Proceedings. 2004 IEEE Computational Systems 
Bioinformatics Conference, 2004. CSB 2004., (Csb), 
pp.639–640. 
Altschul, S. F. et al., 1990. Basic local alignment search 
tool.  Journal of molecular biology, 215(3), pp.403–
410. 
Altschul, S. F. et al., 1997. Gapped BLAST and PSI-
BLAST: a new generation of protein database search 
programs.  Nucleic Acids Research, 25(17), pp.3389–
3402. Available at: http:// 
www.pubmedcentral.nih.gov/articlerender.fcgi?artid=
146917&tool=pmcentrez&rendertype=abstract. 
Ashburner, M. et al., 2000. Gene ontology: tool for the 
unification of biology. The Gene Ontology 
Consortium. Nature genetics, 25(1), pp.25–9. 
Aziz, R. K. et al., 2008. The RAST Server: rapid 
annotations using subsystems technology. BMC 
genomics, 9, p.75. 
Bairoch, A. & Apweiler, R., 1999. The SWISS-PROT 
protein sequence data bank and its supplement 
TrEMBL in 1999. Nucleic Acids Research, 27(1), 
pp.49–54. 
Benson, D. A. et al., 2013. GenBank. Nucleic acids 
research, 41(Database issue), pp.D36–42. 
Date, S. V & Marcotte, E. M., 2003. Discovery of 
uncharacterized cellular systems by genome-wide 
analysis of functional linkages. Nature biotechnology, 
21(9), pp.1055–62. 
Eisen, J. A., 1998. Phylogenomics: Improving Functional 
Predictions for Uncharacterized Genes by 
Evolutionary Analysis. Genome Research, 8(3), 
pp.163–167. 
Eisenhaber, F., 2012. A decade after the first full human 
genome sequencing: when will we understand our own 
genome? Journal of bioinformatics and computational 
biology, 10(5), p.1271001. 
Feller, W., 1968. An Introduction to Probability Theory 
and Its Applications, 
Finn, R. D. et al., 2010. The Pfam protein families 
database. Nucleic Acids Research, 38, pp.D211–D222. 
Friedberg, I., 2006. Automated protein function 
prediction--the genomic challenge. Briefings in 
bioinformatics, 7(3), pp.225–42. 
Galperin, M. Y. & Koonin, E. V, 2010. From complete 
genome sequence to “complete” understanding? 
Trends in biotechnology, 28, pp.398–406. 
Gaasterland, T. & Ragan, M. A., 1998.  Constructing the 
multigenome viewes of whole microbial genomes. 
Microbial & Comparative Genomics 3, pp. 177-192.  
Haft, D. H., 2003. The TIGRFAMs database of protein 
families. Nucleic Acids Research, 31(1), pp.371–373. 
Hunter, S. et al., 2012. InterPro in 2011: new 
developments in the family and domain prediction 
database. Nucleic acids research, 40, pp.D306–12. 
Janitz, M., 2007. Assigning functions to genes — the main 
challenge of the post-genomics era. Biochemical 
Pharmacology, 159, pp.115 –129. 
Jothi, R., Przytycka, T. M. & Aravind, L., 2007. 
Discovering functional linkages and uncharacterized 
cellular pathways using phylogenetic profile 
comparisons: a comprehensive assessment. BMC 
bioinformatics, 8, p.173. 
Kanehisa, M. et al., 2004. The KEGG resource for 
deciphering the genome. Nucleic acids research, 
32(Database issue), pp.D277–80. 
Kensche, P. R. et al., 2008. Practical and theoretical 
advances in predicting the function of a protein by its 
phylogenetic distribution. Journal of the Royal 
Society, Interface / the Royal Society, 5(19), pp.151–
70. 
Kharchenko, P. et al., 2006. Identifying metabolic 
enzymes with multiple types of association evidence. 
BMC bioinformatics, 7, p.177. 
Markowitz, V. M. et al., 2012. IMG: the Integrated 
Microbial Genomes database and comparative analysis 
system.  Nucleic acids research, 40(Database issue), 
pp.D115–22. 
Meyer, F. et al., 2003. GenDB--an open source genome 
annotation system for prokaryote genomes. Nucleic 
acids research, 31(8), pp.2187–95. 
Needleman, S. B. & Wunsch, C. D., 1970. A general 
method applicable to the search for similarities in the 
amino acid sequence of two proteins. Journal of 
Molecular Biology, 48, pp.443–453. 
Pandit, S. B., Balaji, S. & Srinivasan, N., 2004. Structural 
and functional characterization of gene products 
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
142