Ana Stanescu, Doina Caragea


Successful advances in DNA sequencing technologies have made it possible to obtain tremendous amounts of data fast and inexpensively. As a consequence, the afferent genome annotation has become the bottleneck in our understanding of genes and their functions. Traditionally, data from biological domains have been analyzed using supervised learning techniques. However, given the large amounts of unlabeled genomics data available, together with small amounts of labeled data, the use of semi-supervised learning algorithms is desirable. Our purpose is to study the applicability of semi-supervised learning frameworks to DNA prediction problems, with focus on alternative splicing, a natural biological process that contributes to protein diversity. More specifically, we address the problem of predicting alternatively spliced exons. To utilize the unlabeled data, we train classifiers via the Expectation Maximization method and variants of this method. The experiments conducted show an increase in the quality of the prediction models when unlabeled data is used in the training phase, as compared to supervised prediction models which do not make use of the unlabeled data.


  1. Baldi, P. and Brunak, S. (2001). Bioinformatics: the machine learning approach. MIT Press.
  2. Ben-Hur, A., Ong, C. S., Sonnenburg, S., Scholkopf, B., and Ratsch, G. (2008). Support vector machines and kernels for computational biology. PLoS computational biology.
  3. Black, D. L. (2003). Mechanisms of alternative premessenger RNA splicing. Annual Review of Biochemistry.
  4. Blum, A. and Mitchell, T. (1998). Combining labeled and unlabeled data with Co-Training. In Proceedings of the eleventh annual conference on Computational learning theory. ACM.
  5. Brefeld, U. and Scheffer, T. (2004). Co-EM support vector learning. In In Proceedings of the International Conference on Machine Learning.
  6. Chasin, L. A. (2007). Searching for splicing motifs. Advances in Experimental Medicine and Biology.
  7. Chow, L. T., Gelinas, R. E., Broker, T. R., and Roberts, R. J. (1977). An amazing sequence arrangement at the 578 ends of adenovirus 2 messenger RNA. Cell.
  8. Collins, M. and Singer, Y. (1999). Unsupervised models for named entity classification. In In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.
  9. Dai, W., Xue, G., Yang, Q., and Yu, Y. (2007). Transferring naive bayes classifiers for text classification. In In Proceedings of the 22nd AAAI Conference on Artificial Intelligence.
  10. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society.
  11. Dong, A. and Bhanu, B. (2003). A new semi-supervised EM algorithm for image retrieval. Computer Vision and Pattern Recognition.
  12. Dror, G., Sorek, R., and Shamir, R. (2005). Accurate identification of alternatively spliced exons using support vector machine. Bioinformatics (Oxford, England).
  13. Gammerman, A., Vovk, V., and Vapnik, V. (1998). Learning by transduction. In In Uncertainty in Artificial Intelligence. Morgan Kaufmann.
  14. Goldberg, A. B. and Zhu, X. (2006). Seeing stars when there aren't many stars: graph-based semi-supervised learning for sentiment categorization. In Proceedings of the First Workshop on Graph Based Methods for Natural Language Processing. Association for Computational Linguistics.
  15. Huang, J. and Ling, C. X. (2005). Using a u c and accuracy in evaluating learning algorithms. IEEE Transactions on Knowledge and Data Engineering.
  16. Kabat, J. L., Barberan-Soler, S., McKenna, P., Clawson, H., Farrer, T., and Zahler, A. M. (2006). Intronic alternative splicing regulators identified by comparative genomics in nematodes. PLoS computational biology.
  17. Lawrence, C. E. and Reilly, A. A. (1990). An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins.
  18. McCallum, A. and Nigam, K. (1998). A comparison of event models for naive bayes text classification. Dimension Contemporary German Arts And Letters.
  19. Moreno, P. J. and Agarwal, S. (2003). An experimental study of semi-supervised EM. Technical report, HP Labs.
  20. Nagaraj, S. H., Gasser, R. B., and Ranganathan, S. (2007). A hitchhiker's guide to expressed sequence tag (est) analysis. Briefings in bioinformatics.
  21. Nesvizhskii, A. I., Keller, A., Kolker, E., and Aebersold, R. (2003). A statistical model for identifying proteins by tandem mass spectrometry. Analytical Chemistry.
  22. Nigam, K. and Ghani, R. (2000). Analyzing the effectiveness and applicability of Co-Training. In Proceedings of the 9th International Conference on Information and Knowledge Management. ACM.
  23. Nigam, K., McCallum, A. K., Thrun, S., and Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning.
  24. Pertea, M., Mount, S. M., and Salzberg, S. L. (2007). A computational survey of candidate exonic splicing enhancer motifs in the model plant Arabidopsis thaliana. BMC bioinformatics.
  25. Provost, F. J., Fawcett, T., and Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. In Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc.
  26. Ratsch, G., Sonnenburg, S., and Scholkopf, B. (2005). Rase: recognition of alternatively spliced exons in C.elegans. Bioinformatics (Oxford, England).
  27. Rosenberg, C., Hebert, M., and Schneiderman, H. (2005). Semi-supervised self-training of object detection models. In Proceedings of the Seventh IEEE Workshops on Application of Computer Vision. IEEE Computer Society.
  28. Vapnik, V. N. (1995). The nature of statistical learning theory. Springer-Verlag New York, Inc.
  29. Weston, J., Kuang, R., Leslie, C., and Noble, W. (2006). Protein ranking by semi-supervised network propagation. BMC Bioinformatics.
  30. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Annals of Statistics, Vol. 11, No. 1.
  31. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics.
  32. Zhang, Y.-Q. and Rajapakse, J. C. (2009). Machine learning in bioinformatics. Wiley.

Paper Citation

in Harvard Style

Stanescu A. and Caragea D. (2012). SEMI-SUPERVISED LEARNING OF ALTERNATIVELY SPLICED EXONS USING EXPECTATION MAXIMIZATION TYPE APPROACHES . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012) ISBN 978-989-8425-90-4, pages 240-245. DOI: 10.5220/0003791802400245

in Bibtex Style

author={Ana Stanescu and Doina Caragea},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)},

in EndNote Style

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2012)
SN - 978-989-8425-90-4
AU - Stanescu A.
AU - Caragea D.
PY - 2012
SP - 240
EP - 245
DO - 10.5220/0003791802400245