Although the developed system doesn’t make
exact predictions of gene functions (the precision is
about 63%, see Table 5), it may be used as an
alternative or complementation to the existing
annotation systems: the existing systems predict
functions for genes from sets C4 and C5, and our
system covers functions for genes from sets C3 and
C5. Therefore, the use of our system can increase the
share of annotated bacterial genes by 19% (by the
size of the C3 set).
63% predictions of gene functions was received
for P
0
=10
-7
and Z=5.0 (see 2.1). P
0
and Z
was chosen
with a large margin. It is possible to define an upper
limit for the number of false positives in C2 set. For
this purpose we can use the number of profiles
which have at least one "1" received for mixed genes
(see 2.1). The number of these profiles was 0.4 %
and other profiles contain only zeros. Profiles with
zeros have P>P
0
and automatically eliminated from
our consideration. But 39 “random genes” which
have at least one "1" received profiles with P < P
0
. It
means that less than 0.01% is upper limit of false
positives for N2 (C2 set). Thus, false positives have
a small effect on our results.
REFERENCES
Ali, H., 2004. A hidden markov model for gene function
prediction from sequential expression data.
Proceedings. 2004 IEEE Computational Systems
Bioinformatics Conference, 2004. CSB 2004., (Csb),
pp.639–640.
Altschul, S. F. et al., 1990. Basic local alignment search
tool. Journal of molecular biology, 215(3), pp.403–
410.
Altschul, S. F. et al., 1997. Gapped BLAST and PSI-
BLAST: a new generation of protein database search
programs. Nucleic Acids Research, 25(17), pp.3389–
3402. Available at: http://
www.pubmedcentral.nih.gov/articlerender.fcgi?artid=
146917&tool=pmcentrez&rendertype=abstract.
Ashburner, M. et al., 2000. Gene ontology: tool for the
unification of biology. The Gene Ontology
Consortium. Nature genetics, 25(1), pp.25–9.
Aziz, R. K. et al., 2008. The RAST Server: rapid
annotations using subsystems technology. BMC
genomics, 9, p.75.
Bairoch, A. & Apweiler, R., 1999. The SWISS-PROT
protein sequence data bank and its supplement
TrEMBL in 1999. Nucleic Acids Research, 27(1),
pp.49–54.
Benson, D. A. et al., 2013. GenBank. Nucleic acids
research, 41(Database issue), pp.D36–42.
Date, S. V & Marcotte, E. M., 2003. Discovery of
uncharacterized cellular systems by genome-wide
analysis of functional linkages. Nature biotechnology,
21(9), pp.1055–62.
Eisen, J. A., 1998. Phylogenomics: Improving Functional
Predictions for Uncharacterized Genes by
Evolutionary Analysis. Genome Research, 8(3),
pp.163–167.
Eisenhaber, F., 2012. A decade after the first full human
genome sequencing: when will we understand our own
genome? Journal of bioinformatics and computational
biology, 10(5), p.1271001.
Feller, W., 1968. An Introduction to Probability Theory
and Its Applications,
Finn, R. D. et al., 2010. The Pfam protein families
database. Nucleic Acids Research, 38, pp.D211–D222.
Friedberg, I., 2006. Automated protein function
prediction--the genomic challenge. Briefings in
bioinformatics, 7(3), pp.225–42.
Galperin, M. Y. & Koonin, E. V, 2010. From complete
genome sequence to “complete” understanding?
Trends in biotechnology, 28, pp.398–406.
Gaasterland, T. & Ragan, M. A., 1998. Constructing the
multigenome viewes of whole microbial genomes.
Microbial & Comparative Genomics 3, pp. 177-192.
Haft, D. H., 2003. The TIGRFAMs database of protein
families. Nucleic Acids Research, 31(1), pp.371–373.
Hunter, S. et al., 2012. InterPro in 2011: new
developments in the family and domain prediction
database. Nucleic acids research, 40, pp.D306–12.
Janitz, M., 2007. Assigning functions to genes — the main
challenge of the post-genomics era. Biochemical
Pharmacology, 159, pp.115 –129.
Jothi, R., Przytycka, T. M. & Aravind, L., 2007.
Discovering functional linkages and uncharacterized
cellular pathways using phylogenetic profile
comparisons: a comprehensive assessment. BMC
bioinformatics, 8, p.173.
Kanehisa, M. et al., 2004. The KEGG resource for
deciphering the genome. Nucleic acids research,
32(Database issue), pp.D277–80.
Kensche, P. R. et al., 2008. Practical and theoretical
advances in predicting the function of a protein by its
phylogenetic distribution. Journal of the Royal
Society, Interface / the Royal Society, 5(19), pp.151–
70.
Kharchenko, P. et al., 2006. Identifying metabolic
enzymes with multiple types of association evidence.
BMC bioinformatics, 7, p.177.
Markowitz, V. M. et al., 2012. IMG: the Integrated
Microbial Genomes database and comparative analysis
system. Nucleic acids research, 40(Database issue),
pp.D115–22.
Meyer, F. et al., 2003. GenDB--an open source genome
annotation system for prokaryote genomes. Nucleic
acids research, 31(8), pp.2187–95.
Needleman, S. B. & Wunsch, C. D., 1970. A general
method applicable to the search for similarities in the
amino acid sequence of two proteins. Journal of
Molecular Biology, 48, pp.443–453.
Pandit, S. B., Balaji, S. & Srinivasan, N., 2004. Structural
and functional characterization of gene products
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
142