Naïve Bayes Domain Adaptation for Biological Sequences

Nic Herndon, Doina Caragea


The increased volume of biological data requires automatic computation tools to analyze it. Although machine learning methods have been successfully used with biological sequences in a supervised framework, their accuracy usually suffers when a classifier is learned on a source domain and applied to a different, less studied domain, in a domain adaptation framework. To address this issue, we propose to use an algorithm that combines labeled sequences from a well studied organism, the source domain, with labeled and unlabeled sequences from a related, less studied organism, the target domain. Our experimental results show that this algorithm has high classifying accuracy on the target domain.


  1. Baten, A., Chang, B., Halgamuge, S., and Li, J. (2006). Splice site identification using probabilistic parameters and svm classification. BMC Bioinformatics, 7(Suppl 5):S15.
  2. Bernal, A., Crammer, K., Hatzigeorgiou, A., and Pereira, F. (2007). Global discriminative learning for higheraccuracy computational gene prediction. PLoS Comput Biol, 3(3):e54.
  3. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C., Furey, T. S., M.Ares, J., and Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS, 97(1):262-267.
  4. Dai, W., Xue, G., Yang, Q., and Yu, Y. (2007). Transferring naïve bayes classifiers for text classification. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence.
  5. Degroeve, S., Saeys, Y., De Baets, B., Rouzé, P., and Van De Peer, Y. (2005). Splicemachine: predicting splice sites from high-dimensional local context representations. Bioinformatics, 21(8):1332-1338.
  6. Eaton, J. W., Bateman, D., and Hauberg, S. (2008). GNU Octave Manual Version 3. Network Theory Ltd.
  7. Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. (2000). Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. Journal of molecular biology, 300(4):1005-1016.
  8. Gardy, J. L., Laird, M. R., Chen, F., Rey, S., Walsh, C. J., Ester, M., and Brinkman, F. S. L. (2005). Psortb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics, 21(5):617- 623.
  9. Gardy, J. L., Spencer, C., Wang, K., Ester, M., Tusnády, G. E., Simon, I., Hua, S., deFays, K., Lambert, C., Nakai, K., and Brinkman, F. S. (2003). Psort-b: improving protein subcellular localization prediction for gram-negative bacteria. Nucleic Acids Research, 31(13):3613-3617.
  10. Huang, J., Li, T., Chen, K., and Wu, J. (2006). An approach of encoding for prediction of splice sites using svm. Biochimie, 88:923-9.
  11. Jaakkola, T. S. and Haussler, D. (1999). Exploiting generative models in discriminative classifiers. In Proceedings of the 1998 conference on Advances in neural information processing systems II, pages 487-493, Cambridge, MA, USA. MIT Press.
  12. Maeireizo, B., Litman, D., and Hwa, R. (2004). Co-training for predicting emotions with spoken dialogue data. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, ACLdemo 7804, Stroudsburg, PA, USA. Association for Computational Linguistics.
  13. Mccallum, A. and Nigam, K. (1998). A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on 'Learning for Text Categorization'.
  14. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, S., and Schölkopf, B. (2001). An introduction to kernel-based learning algorithms. IEEE Transactions on Neural Networks, 12(2):181-202.
  15. Nigam, K., Mccallum, A., Thrun, S., and Mitchell, T. (1999). Text classification from labeled and unlabeled documents using EM. In Machine Learning, pages 103-134.
  16. Noble, W. S. (2006). What is a support vector machine? Nat Biotech, 24(12):1565-1567.
  17. Pan, S. J. and Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345-1359.
  18. Rätsch, G. and Sonnenburg, S. (2004). Accurate splice site detection for caenorhabditis elegans. In B. Schlkopf, K. T. and Vert, J.-P., editors, Kernel Methods in Computational Biology, pages 277-298. MIT Press.
  19. Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R., and Schölkopf, B. (2007). Improving the c. elegans genome annotation using machine learning. PLoS Computational Biology, 3:e20.
  20. Riloff, E., Wiebe, J., and Wilson, T. (2003). Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL 7803, pages 25-32, Stroudsburg, PA, USA. Association for Computational Linguistics.
  21. Schölkopf, B. and Smola, A. J. (2001). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA.
  22. Schweikert, G., Widmer, C., Schölkopf, B., and Rätsch, G. (2008). An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In NIPS'08, pages 1433-1440.
  23. Sonnenburg, S., Rätsch, G., Jagota, A., and Müller, K.-R. (2002). New methods for splice-site recognition. In In Proceedings of the International Conference on Artifical Neural Networks., pages 329-336. Copyright by Springer.
  24. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Rätsch, G. (2007). Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8(Supplement 10):1-16.
  25. Tan, S., Cheng, X., Wang, Y., and Xu, H. (2009). Adapting naive bayes to domain adaptation for sentiment analysis. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR 7809, pages 337-349, Berlin, Heidelberg. Springer-Verlag.
  26. Tsuda, K., Kawanabe, M., Rätsch, G., Sonnenburg, S., and Müller, K.-R. (2002). A new discriminative kernel from probabilistic models. Neural Comput., 14(10):2397-2414.
  27. Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., New York, NY, USA.
  28. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL 7895, pages 189-196, Stroudsburg, PA, USA. Association for Computational Linguistics.
  29. Zhang, Y., Chu, C.-H., Chen, Y., Zha, H., and Ji, X. (2006). Splice site prediction using support vector machines with a bayes kernel. Expert Syst. Appl., 30(1):73-81.
  30. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., and Müller, K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16(9):799-807.

Paper Citation

in Harvard Style

Herndon N. and Caragea D. (2013). Naïve Bayes Domain Adaptation for Biological Sequences . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013) ISBN 978-989-8565-35-8, pages 62-70. DOI: 10.5220/0004245500620070

in Bibtex Style

author={Nic Herndon and Doina Caragea},
title={Naïve Bayes Domain Adaptation for Biological Sequences},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)},

in EndNote Style

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2013)
TI - Naïve Bayes Domain Adaptation for Biological Sequences
SN - 978-989-8565-35-8
AU - Herndon N.
AU - Caragea D.
PY - 2013
SP - 62
EP - 70
DO - 10.5220/0004245500620070