Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations

Giacomo Domeniconi, Marco Masseroli, Gianluca Moro, Pietro Pinoli

Abstract

Genomic annotations describing functional features of genes and proteins through controlled terminologies and ontologies are extremely valuable, especially for computational analyses aimed at inferring new biomedical knowledge. Thanks to the biology revolution led by the introduction of the novel DNA sequencing technologies, several repositories of such annotations have becoming available in the last decade; among them, the ones including Gene Ontology annotations are the most relevant. Nevertheless, the available set of genomic annotations is incomplete, and only some of the available annotations represent highly reliable human curated information. In this paper we propose a novel representation of the annotation discovery problem, so as to enable applying supervised algorithms to predict Gene Ontology annotations of different organism genes. In order to use supervised algorithms despite labeled data to train the prediction model are not available, we propose a random perturbation method of the training set, which creates a new annotation matrix to be used to train the model to recognize new annotations. We tested the effectiveness of our approach on nine Gene Ontology annotation datasets. Obtained results demonstrated that our technique is able to improve novel annotation predictions with respect to state of the art unsupervised methods.

References

  1. Barutcuoglu, Z., Schapire, R. E., and Troyanskaya, O. G. (2006). Hierarchical multi-label prediction of gene function. Bioinformatics, 22(7):830-836.
  2. Bicego, M., Lovato, P., Oliboni, B., and Perina, A. (2010). Expression microarray classification using topic models. In Proceedings of the 2010 ACM Symposium on Applied Computing, pages 1516-1520. ACM.
  3. Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3:993-1022.
  4. Canakoglu, A., Ghisalberti, G., and Masseroli, M. (2012). Integration of biomolecular interaction data in a genomic and proteomic data warehouse to support biomedical knowledge discovery. In Computational Intelligence Methods for Bioinformatics and Biostatistics, pages 112-126. Springer.
  5. Casella, G. and George, E. I. (1992). Explaining the gibbs sampler. The American Statistician, 46(3):167-174.
  6. Chicco, D. and Masseroli, M. (2013). A discrete optimization approach for svd best truncation choice based on roc curves. In Bioinformatics and Bioengineering (BIBE), 2013 IEEE 13th International Conference on, pages 1-4. IEEE.
  7. Chicco, D., Tagliasacchi, M., and Masseroli, M. (2012). Genomic annotation prediction based on integrated information. In Computational Intelligence Methods for Bioinformatics and Biostatistics, pages 238-252. Springer.
  8. Done, B., Khatri, P., Done, A., and Draghici, S. (2007). Semantic analysis of genome annotations using weighting schemes. In Computational Intelligence and Bioinformatics and Computational Biology, 2007. CIBCB'07. IEEE Symposium on, pages 212-218. IET.
  9. Done, B., Khatri, P., Done, A., and Draghici, S. (2010). Predicting novel human gene ontology annotations using semantic analysis. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 7(1):91-99.
  10. Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S., and Harshman, R. (1988). Using latent semantic analysis to improve access to textual information. In Proceedings of the SIGCHI conference on Human factors in computing systems, pages 281-285. ACM.
  11. GO Consortium et al. (2001). Creating the gene ontology resource: design and implementation. Genome research, 11(8):1425-1433.
  12. Griffiths, T. (2002). Gibbs sampling in the generative model of latent dirichlet allocation. Standford University, 518(11):1-3.
  13. Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50-57. ACM.
  14. Khatri, P., Done, B., Rao, A., Done, A., and Draghici, S. (2005). A semantic analysis of the annotations of the human genome. Bioinformatics, 21(16):3416-3421.
  15. King, O. D., Foulger, R. E., Dwight, S. S., White, J. V., and Roth, F. P. (2003). Predicting gene function from patterns of annotation. Genome research, 13(5):896- 904.
  16. Masseroli, M., Chicco, D., and Pinoli, P. (2012). Probabilistic latent semantic analysis for prediction of gene ontology annotations. In Neural Networks (IJCNN), The 2012 International Joint Conference on, pages 1- 8. IEEE.
  17. Pandey, G., Kumar, V., and Steinbach, M. (2006). Computational approaches for protein function prediction: A survey. Technical report, Minneapolis, MN, USA.
  18. Pérez, A. J., Perez-Iratxeta, C., Bork, P., Thode, G., and Andrade, M. A. (2004). Gene annotation from scientific literature using mappings between keyword systems. Bioinformatics, 20(13):2084-2091.
  19. Perina, A., Lovato, P., Murino, V., and Bicego, M. (2010). Biologically-aware latent dirichlet allocation (balda) for the classification of expression microarray. In Pattern Recognition in Bioinformatics, pages 230-241. Springer.
  20. Pinoli, P., Chicco, D., and Masseroli, M. (2013). Enhanced probabilistic latent semantic analysis with weighting schemes to predict genomic annotations. In Bioinformatics and Bioengineering (BIBE), 2013 IEEE 13th International Conference on, pages 1-4. IEEE.
  21. Pinoli, P., Chicco, D., and Masseroli, M. (2014a). Latent dirichlet allocation based on gibbs sampling for gene function prediction. In Proceedings of the International Conference on Computational Intelligence in Bioinformatics and Computational Biology, pages 1- 7. IEEE Computer Society.
  22. Pinoli, P., Chicco, D., and Masseroli, M. (2014b). Weighting scheme methods for enhanced genome annotation prediction. In Computational Intelligence Methods for Bioinformatics and Biostatistics (CIBB), 2013 10th International Meeting on, pages 76-89. LNBI, Springer International Publishing.
  23. Porteous, I., Newman, D., Ihler, A., Asuncion, A., Smyth, P., and Welling, M. (2008). Fast collapsed gibbs sampling for latent dirichlet allocation. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 569- 577. ACM.
  24. Raychaudhuri, S., Chang, J. T., Sutphin, P. D., and Altman, R. B. (2002). Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. Genome Research, 12(1):203-214.
  25. Tanoue, J., Yoshikawa, M., and Uemura, S. (2002). The genearound go viewer. Bioinformatics, 18(12):1705- 1706.
  26. Tao, Y., Sam, L., Li, J., Friedman, C., and Lussier, Y. A. (2007). Information theory applied to the sparse gene ontology annotation network to predict novel gene function. Bioinformatics, 23(13):i529-i538.
Download


Paper Citation


in Harvard Style

Domeniconi G., Masseroli M., Moro G. and Pinoli P. (2014). Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014) ISBN 978-989-758-048-2, pages 107-116. DOI: 10.5220/0005087801070116


in Bibtex Style

@conference{kdir14,
author={Giacomo Domeniconi and Marco Masseroli and Gianluca Moro and Pietro Pinoli},
title={Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},
year={2014},
pages={107-116},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005087801070116},
isbn={978-989-758-048-2},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)
TI - Discovering New Gene Functionalities from Random Perturbations of Known Gene Ontological Annotations
SN - 978-989-758-048-2
AU - Domeniconi G.
AU - Masseroli M.
AU - Moro G.
AU - Pinoli P.
PY - 2014
SP - 107
EP - 116
DO - 10.5220/0005087801070116