Empirical Study of Domain Adaptation with Naïve Bayes on the Task of Splice Site Prediction

Nic Herndon, Doina Caragea

2014

Abstract

For many machine learning problems, training an accurate classifier in a supervised setting requires a substantial volume of labeled data. While large volumes of labeled data are currently available for some of these problems, little or no labeled data exists for others. Manually labeling data can be costly and time consuming. An alternative is to learn classifiers in a domain adaptation setting in which existing labeled data can be leveraged from a related problem, referred to as source domain, in conjunction with a small amount of labeled data and large amount of unlabeled data for the problem of interest, or target domain. In this paper, we propose two similar domain adaptation classifiers based on a na¨ıve Bayes algorithm. We evaluate these classifiers on the difficult task of splice site prediction, essential for gene prediction. Results show that the algorithms correctly classified instances, with highest average area under precision-recall curve (auPRC) values between 18.46% and 78.01%.

References

  1. Arita, M., Tsuda, K., and Asai, K. (2002). Modeling splicing sites with pairwise correlations.
  2. Baten, A., Chang, B., Halgamuge, S., and Li, J. (2006). Splice site identification using probabilistic parameters and SVM classification.
  3. Baten, A. K., Halgamuge, S. K., Chang, B., and Wickramarachchi, N. (2007). Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification. In Proceedings of the 4th international symposium on Neural Networks: Part IIAdvances in Neural Networks, ISNN 7807, pages 1221- 1230, Berlin, Heidelberg. Springer-Verlag.
  4. Bernal, A., Crammer, K., Hatzigeorgiou, A., and Pereira, F. (2007). Global discriminative learning for higheraccuracy computational gene prediction. PLoS Comput Biol, 3(3):e54.
  5. Brown, M. P. S., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C., Furey, T. S., M.Ares, J., and Haussler, D. (2000). Knowledge-based analysis of microarray gene expression data using support vector machines. PNAS, 97(1):262-267.
  6. Cai, D., Delcher, A., Kao, B., and Kasif, S. (2000). Modeling splice sites with Bayes networks. Bioinformatics, 16(2):152-158.
  7. Dai, W., Xue, G., Yang, Q., and Yu, Y. (2007). Transferring naïve bayes classifiers for text classification. In Proceedings of the 22nd AAAI Conference on Artificial Intelligence.
  8. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1-38.
  9. Gantz, J. H., Reinsel, D., Chute, C., Schlinchting, W., McArthur, J., Minton, S., Xheneti, I., Toncheva, A., and Manfrediz, A. (2007). The Expanding Digital Universe.
  10. Herndon, N. and Caragea, D. (2013a). Naïve Bayes Domain Adaptation for Biological Sequences. In Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, BIOINFORMATICS 2013, pages 62-70.
  11. Herndon, N. and Caragea, D. (2013b). Predicting protein localization using a domain adaptation approach. Communications in Computer and Information Science (CCIS 2013). Springer-Verlag.
  12. Li, J., Wang, L., Wang, H., Bai, L., and Yuan, Z. (2012). High-accuracy splice site prediction based on sequence component and position features. Genet Mol Res, 11(3):3432-51.
  13. Maeireizo, B., Litman, D., and Hwa, R. (2004). Co-training for predicting emotions with spoken dialogue data. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, ACLdemo 7804, Stroudsburg, PA, USA. Association for Computational Linguistics.
  14. Mccallum, A. and Nigam, K. (1998). A Comparison of Event Models for Naïve Bayes Text Classification. In AAAI-98 Workshop on 'Learning for Text Categorization'.
  15. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, S., and Schölkopf, B. (2001). An Introduction to KernelBased learning Algorithms. IEEE Transactions on Neural Networks, 12(2):181-202.
  16. Noble, W. S. (2006). What is a support vector machine? Nat Biotech, 24(12):1565-1567.
  17. Rätsch, G. and Sonnenburg, S. (2004). Accurate Splice Site Prediction for Caenorhabditis Elegans. In Kernel Methods in Computational Biology, MIT Press series on Computational Molecular Biology, pages 277-298. MIT Press.
  18. Rätsch, G., Sonnenburg, S., Srinivasan, J., Witte, H., Müller, K.-R., Sommer, R., and Schölkopf, B. (2007). Improving the c. elegans genome annotation using machine learning. PLoS Computational Biology, 3:e20.
  19. Riloff, E., Wiebe, J., and Wilson, T. (2003). Learning subjective nouns using extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL 7803, pages 25-32, Stroudsburg, PA, USA. Association for Computational Linguistics.
  20. Schweikert, G., Widmer, C., Schölkopf, B., and Rätsch, G. (2008). An empirical analysis of domain adaptation algorithms for genomic sequence analysis. In NIPS'08, pages 1433-1440.
  21. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27:379- 423, 623-656.
  22. Sonnenburg, S., Schweikert, G., Philips, P., Behr, J., and Rätsch, G. (2007). Accurate splice site prediction using support vector machines. BMC Bioinformatics, 8(Supplement 10):1-16.
  23. Tan, S., Cheng, X., Wang, Y., and Xu, H. (2009). Adapting Naïve Bayes to Domain Adaptation for Sentiment Analysis. In Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, ECIR 7809, pages 337-349, Berlin, Heidelberg. Springer-Verlag.
  24. Yarowsky, D. (1995). Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics, ACL 7895, pages 189-196, Stroudsburg, PA, USA. Association for Computational Linguistics.
  25. Zhang, Y., Chu, C.-H., Chen, Y., Zha, H., and Ji, X. (2006). Splice site prediction using support vector machines with a bayes kernel. Expert Syst. Appl., 30(1):73-81.
  26. Zien, A., Rätsch, G., Mika, S., Schölkopf, B., Lengauer, T., and Müller, K.-R. (2000). Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics, 16(9):799-807.
Download


Paper Citation


in Harvard Style

Herndon N. and Caragea D. (2014). Empirical Study of Domain Adaptation with Naïve Bayes on the Task of Splice Site Prediction . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014) ISBN 978-989-758-012-3, pages 57-67. DOI: 10.5220/0004806800570067


in Bibtex Style

@conference{bioinformatics14,
author={Nic Herndon and Doina Caragea},
title={Empirical Study of Domain Adaptation with Naïve Bayes on the Task of Splice Site Prediction},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)},
year={2014},
pages={57-67},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004806800570067},
isbn={978-989-758-012-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)
TI - Empirical Study of Domain Adaptation with Naïve Bayes on the Task of Splice Site Prediction
SN - 978-989-758-012-3
AU - Herndon N.
AU - Caragea D.
PY - 2014
SP - 57
EP - 67
DO - 10.5220/0004806800570067