Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification

Karthik Tangirala, Doina Caragea


Recent advancements in biological sciences have resulted in the availability of large amounts of sequence data (both DNA and protein sequences). The annotation of biological sequence data can be approached using machine learning techniques. Such techniques require that the input data is represented as a vector of features. In the absence of biologically known features, a common approach is to generate k-mers using a sliding window. A larger k value usually results in better features; however, the number of k-mer features is exponential in k, and many of the k-mers are not informative. Feature selection techniques can be used to identify the most informative features, but are computationally expensive when used over the set of all k-mers, especially over the space of variable length k-mers (which presumably capture better the information in the data). Instead of working with all k-mers, we propose to generate features using an approach based on Burrows Wheeler Transformation (BWT). Our approach generates variable length k-mers that represent a small subset of kmers. Experimental results on both DNA (alternative splicing prediction) and protein (protein localization) sequences show that the BWT features combined with feature selection, result in models which are better than models learned directly from k-mers. This shows that the BWT-based approach to feature generation can be used to obtain informative variable length features for DNA and protein prediction problems.


  1. Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks, 5:537-550.
  2. Becher, V., Deymonnaz, A., and Heiber, P. (2009). Efficient computation of all perfect repeats in genomic sequences of up to half a gb, with a case study on the human genome. Bioinformatics, 25(14):1746-1753.
  3. Burrows, M. and Wheeler, D. J. (1994). A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corp., Palo Alto, CA.
  4. Caragea, C., Silvescu, A., and Mitra, P. (2011). Protein sequence classification using feature hashing. In Proc. of IEEE BIBM 2011, pages 538-543.
  5. Chor, B., Horn, D., Levy, Y., Goldman, N., and Massingham, T. (2009). Genomic DNA k-mer spectra: models and modalities. GENOME BIOLOGY, 10.
  6. Chuzhanova, N. A., Jones, A. J., and Margetts, S. (1998). Feature selection for genetic sequence classification. Bioinformatics, 14(2):139-143.
  7. Degroeve, S., De Baets, B., Van de Peer, Y., and Rouzé, P. (2002). Feature subset selection for splice site prediction. Bioinformatics, 18(suppl 2):S75-S83.
  8. Ferragina, P. and Manzini, G. (2000). Opportunistic data structures with applications. In Proc. of the 41st Symp. on Foundations of Computer Science, pages 390-398.
  9. Gardy, J. L., Laird, M. R., Chen, F., Rey, S., Walsh, C. J., Ester, M., and Brinkman, F. S. L. (2005). Psortb v.2.0: Expanded prediction of bacterial protein subcellular localization and insights gained from compar. proteome analysis. Bioinformatics, 21(5):617-623.
  10. Griffith, M., Tang, M. J., Griffith, O. L., Morin, R. D., Chan, S. Y., Asano, J. K., Zeng, T., Flibotte, S., Ally, A., Baross, A., Hirst, M., Jones, S. J. M., Morin, G. B., Tai, I. T., and Marra, M. A. (2008). ALEXA: a microarray design platform for alternative expression analysis. Nature Methods, 5(2):118.
  11. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. (2009). The WEKA data mining software: an update. SIGKDD Explor. Newsl., 11(1):10-18.
  12. Langmead, B., Trapnell, C., Pop, M., and Salzberg, S. (2009). Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10(3):1-10.
  13. Largeron, C., Moulin, C., and Gèry, M. (2011). Entropy based feature selection for text categorization. In Proc. of the 2011 ACM Symp. on Applied Computing, SAC 7811, pages 924-928, New York, NY, USA. ACM.
  14. Li, H. and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics, 25(14):1754-1760.
  15. Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008). Soap: short oligonucleotide alignment program. Bioinformatics, 24(5):713-714.
  16. Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen, K., and Wang, J. (2009). Soap2: an improved ultrafast tool for short read alignment. Bioinformatics, 25(15):1966-1967.
  17. Melsted, P. and Pritchard, J. (2011). Efficient counting of k-mers in dna sequences using a bloom filter. BMC Bioinformatics, 12(1):1-7.
  18. Ng, H. T., Goh, W. B., and Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. SIGIR Forum, 31(SI):67-73.
  19. Rätsch, G., Sonnenburg, S., and Schölkopf, B. (2005). Rase: recognition of alternatively spliced exons in c.elegans. Bioinformatics, 21(suppl 1):i369-i377.
  20. Saeys, Y., Rouzè, P., and Van De Peer, Y. (2007). In search of the small ones: improved prediction of short exons in vertebrates, plants, fungi and protists. Bioinformatics, 23(4):414-420.
  21. Salzberg, S. L., Delcher, A. L., Kasif, S., and White, O. (1998). Microbial gene identification using interpolated markov models. Nucleic Acids Research, 26(2):544-548.
  22. Shah, M., Lee, H., Rogers, S., and Touchman, J. (2004). An exhaustive genome assembly algorithm using k-mers to indirectly perform n-squared comparisons in o(n). In Proc. of IEEE CSB 2004, pages 740-741.
  23. Wiener, E. D., Pedersen, J. O., and Weigend, A. S. (1995). A neural network approach to topic spotting. In Proc. of SDAIR-95, pages 317-332, Las Vegas, US.
  24. Xia, J., Caragea, D., and Brown, S. (2008). Exploring alternative splicing features using support vector machines. In Proc. of IEEE BIBM 2008, pages 231-238, Washington, DC, USA. IEEE Computer Society.
  25. Zavaljevski, N., Stevens, F. J., and Reifman, J. (2002). Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics, 18(5):689-696.

Paper Citation

in Harvard Style

Tangirala K. and Caragea D. (2014). Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014) ISBN 978-989-758-012-3, pages 196-203. DOI: 10.5220/0004806201960203

in Bibtex Style

author={Karthik Tangirala and Doina Caragea},
title={Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)},

in EndNote Style

JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)
TI - Generating Features using Burrows Wheeler Transformation for Biological Sequence Classification
SN - 978-989-758-012-3
AU - Tangirala K.
AU - Caragea D.
PY - 2014
SP - 196
EP - 203
DO - 10.5220/0004806201960203