ognize the presence of novel gene annotations using
the obsolete annotation profile of the gene. The appli-
cation of this method requires two different versions
of the annotation matrix to build representations of
the training data. However, biologists typically have
available only the most updated version of the gene
annotation matrix. Given this constrain, we have pro-
posed a method to represent the training data using
a single annotation matrix as input. It is based on
creating a different annotation matrix, representing an
older version of the input one, by perturbing the input
one in order to randomly remove some of its annota-
tions. This allows the use of supervised algorithms
even in datasets without labels and the comparison
of supervised algorithm results with those obtained
by unsupervised methods on the same originally un-
labeled datasets.
Obtained results are very encouraging, since they
show a great improvement compared with unsuper-
vised techniques. Furthermore, these results could be
even better with an appropriate tuning of the parame-
ters of the supervised algorithms used; our purpose is
to thoroughly investigate this aspect in the future.
From the obtained results we can see that by in-
creasing the number of perturbed (removed) annota-
tions, the results improve, reaching a peak when the
number of artificial missing annotations in the train-
ing set is comparable to the number of those in the
validation set, i.e. when the variety of missing an-
notations has been fully mapped in the training set.
Furthermore, it is noteworthy also the case where we
do not perturb the training matrix, avoiding the tuning
of the parameter p, which gets anyway good results.
We plan to further verify the effectiveness of the pro-
posed approach, also applying weighting schemes on
the data representation.
ACKNOWLEDGEMENTS
This research is part of the “GenData 2020” project
funded by the Italian MIUR. The authors would like
to thank Claudio Sartori for the useful discussions
about data mining algorithms.
REFERENCES
Barutcuoglu, Z., Schapire, R. E., and Troyanskaya, O. G.
(2006). Hierarchical multi-label prediction of gene
function. Bioinformatics, 22(7):830–836.
Bicego, M., Lovato, P., Oliboni, B., and Perina, A. (2010).
Expression microarray classification using topic mod-
els. In Proceedings of the 2010 ACM Symposium on
Applied Computing, pages 1516–1520. ACM.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent
dirichlet allocation. the Journal of machine Learning
research, 3:993–1022.
Canakoglu, A., Ghisalberti, G., and Masseroli, M. (2012).
Integration of biomolecular interaction data in a ge-
nomic and proteomic data warehouse to support
biomedical knowledge discovery. In Computational
Intelligence Methods for Bioinformatics and Bio-
statistics, pages 112–126. Springer.
Casella, G. and George, E. I. (1992). Explaining the gibbs
sampler. The American Statistician, 46(3):167–174.
Chicco, D. and Masseroli, M. (2013). A discrete optimiza-
tion approach for svd best truncation choice based
on roc curves. In Bioinformatics and Bioengineering
(BIBE), 2013 IEEE 13th International Conference on,
pages 1–4. IEEE.
Chicco, D., Tagliasacchi, M., and Masseroli, M. (2012).
Genomic annotation prediction based on integrated in-
formation. In Computational Intelligence Methods
for Bioinformatics and Biostatistics, pages 238–252.
Springer.
Done, B., Khatri, P., Done, A., and Draghici, S. (2007). Se-
mantic analysis of genome annotations using weight-
ing schemes. In Computational Intelligence and
Bioinformatics and Computational Biology, 2007.
CIBCB’07. IEEE Symposium on, pages 212–218. IET.
Done, B., Khatri, P., Done, A., and Draghici, S. (2010).
Predicting novel human gene ontology annotations us-
ing semantic analysis. IEEE/ACM Transactions on
Computational Biology and Bioinformatics (TCBB),
7(1):91–99.
Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester,
S., and Harshman, R. (1988). Using latent semantic
analysis to improve access to textual information. In
Proceedings of the SIGCHI conference on Human fac-
tors in computing systems, pages 281–285. ACM.
GO Consortium et al. (2001). Creating the gene ontology
resource: design and implementation. Genome re-
search, 11(8):1425–1433.
Griffiths, T. (2002). Gibbs sampling in the generative model
of latent dirichlet allocation. Standford University,
518(11):1–3.
Hofmann, T. (1999). Probabilistic latent semantic index-
ing. In Proceedings of the 22nd annual international
ACM SIGIR conference on Research and development
in information retrieval, pages 50–57. ACM.
Khatri, P., Done, B., Rao, A., Done, A., and Draghici, S.
(2005). A semantic analysis of the annotations of the
human genome. Bioinformatics, 21(16):3416–3421.
King, O. D., Foulger, R. E., Dwight, S. S., White, J. V.,
and Roth, F. P. (2003). Predicting gene function from
patterns of annotation. Genome research, 13(5):896–
904.
Masseroli, M., Chicco, D., and Pinoli, P. (2012). Proba-
bilistic latent semantic analysis for prediction of gene
ontology annotations. In Neural Networks (IJCNN),
The 2012 International Joint Conference on, pages 1–
8. IEEE.
Pandey, G., Kumar, V., and Steinbach, M. (2006). Compu-
DiscoveringNewGeneFunctionalitiesfromRandomPerturbationsofKnownGeneOntologicalAnnotations
115