interesting to explore different kernels for SVM, and
perform tuning for different sets of parameters, in or-
der to further improve the performance.
Another interesting direction could be to analyze
the performance of features generated using BWT for
big data, with various feature selection techniques, in
addition to the ECCD technique used in this paper.
ACKNOWLEDGEMENTS
We would like to thank Ana Stanescu in the De-
partment of Computing and Information Sciences
at Kansas State University for making the D.
melanogaster data available for this study. We would
also like to acknowledge Dr. Adrian Silvescu for
insightful discussions regarding the BWT transform,
and Dr. Torben Amtoft for useful discussions regard-
ing time and space complexity of the algorithms stud-
ied in this paper.
REFERENCES
Battiti, R. (1994). Using mutual information for select-
ing features in supervised neural net learning. IEEE
Transactions on Neural Networks, 5:537–550.
Becher, V., Deymonnaz, A., and Heiber, P. (2009). Effi-
cient computation of all perfect repeats in genomic
sequences of up to half a gb, with a case study on the
human genome. Bioinformatics, 25(14):1746–1753.
Burrows, M. and Wheeler, D. J. (1994). A block-sorting
lossless data compression algorithm. Technical Re-
port 124, Digital Equipment Corp., Palo Alto, CA.
Caragea, C., Silvescu, A., and Mitra, P. (2011). Protein
sequence classification using feature hashing. In Proc.
of IEEE BIBM 2011, pages 538–543.
Chor, B., Horn, D., Levy, Y., Goldman, N., and Massing-
ham, T. (2009). Genomic DNA k-mer spectra: models
and modalities. GENOME BIOLOGY, 10.
Chuzhanova, N. A., Jones, A. J., and Margetts, S. (1998).
Feature selection for genetic sequence classification.
Bioinformatics, 14(2):139–143.
Degroeve, S., De Baets, B., Van de Peer, Y., and Rouz ´e, P.
(2002). Feature subset selection for splice site predic-
tion. Bioinformatics, 18(suppl 2):S75–S83.
Ferragina, P. and Manzini, G. (2000). Opportunistic data
structures with applications. In Proc. of the 41st Symp.
on Foundations of Computer Science, pages 390–398.
Gardy, J. L., Laird, M. R., Chen, F., Rey, S., Walsh, C. J.,
Ester, M., and Brinkman, F. S. L. (2005). Psortb
v.2.0: Expanded prediction of bacterial protein sub-
cellular localization and insights gained from compar.
proteome analysis. Bioinformatics, 21(5):617–623.
Griffith, M., Tang, M. J., Griffith, O. L., Morin, R. D.,
Chan, S. Y., Asano, J. K., Zeng, T., Flibotte, S., Ally,
A., Baross, A., Hirst, M., Jones, S. J. M., Morin,
G. B., Tai, I. T., and Marra, M. A. (2008). ALEXA: a
microarray design platform for alternative expression
analysis. Nature Methods, 5(2):118.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann,
P., and Witten, I. H. (2009). The weka data min-
ing software: An update. SIGKDD Explor. Newsl.,
11(1):10–18.
Langmead, B., Trapnell, C., Pop, M., and Salzberg, S.
(2009). Ultrafast and memory-efficient alignment of
short DNA sequences to the human genome. Genome
Biology, 10(3):1–10.
Largeron, C., Moulin, C., and G `ery, M. (2011). Entropy
based feature selection for text categorization. In
Proc. of the 2011 ACM Symp. on Applied Computing,
SAC ’11, pages 924–928, New York, NY, USA. ACM.
Li, H. and Durbin, R. (2009). Fast and accurate short read
alignment with Burrows-Wheeler transform. Bioin-
formatics, 25(14):1754–1760.
Li, R., Li, Y., Kristiansen, K., and Wang, J. (2008). Soap:
short oligonucleotide alignment program. Bioinfor-
matics, 24(5):713–714.
Li, R., Yu, C., Li, Y., Lam, T.-W., Yiu, S.-M., Kristiansen,
K., and Wang, J. (2009). Soap2: an improved ul-
trafast tool for short read alignment. Bioinformatics,
25(15):1966–1967.
Melsted, P. and Pritchard, J. (2011). Efficient counting of
k-mers in dna sequences using a bloom filter. BMC
Bioinformatics, 12(1):1–7.
Ng, H. T., Goh, W. B., and Low, K. L. (1997). Feature se-
lection, perceptron learning, and a usability case study
for text categorization. SIGIR Forum, 31(SI):67–73.
R ¨atsch, G., Sonnenburg, S., and Sch ¨olkopf, B. (2005).
Rase: recognition of alternatively spliced exons in
c.elegans. Bioinformatics, 21(suppl 1):i369–i377.
Saeys, Y., Rouz `e, P., and Van De Peer, Y. (2007). In search
of the small ones: improved prediction of short exons
in vertebrates, plants, fungi and protists. Bioinformat-
ics, 23(4):414–420.
Salzberg, S. L., Delcher, A. L., Kasif, S., and White,
O. (1998). Microbial gene identification using in-
terpolated markov models. Nucleic Acids Research,
26(2):544–548.
Shah, M., Lee, H., Rogers, S., and Touchman, J. (2004). An
exhaustive genome assembly algorithm using k-mers
to indirectly perform n-squared comparisons in o(n).
In Proc. of IEEE CSB 2004, pages 740–741.
Wiener, E. D., Pedersen, J. O., and Weigend, A. S. (1995).
A neural network approach to topic spotting. In Proc.
of SDAIR-95, pages 317–332, Las Vegas, US.
Xia, J., Caragea, D., and Brown, S. (2008). Exploring al-
ternative splicing features using support vector ma-
chines. In Proc. of IEEE BIBM 2008, pages 231–238,
Washington, DC, USA. IEEE Computer Society.
Zavaljevski, N., Stevens, F. J., and Reifman, J. (2002). Sup-
port vector machines with selective kernel scaling for
protein classification and identification of key amino
acid positions. Bioinformatics, 18(5):689–696.
GeneratingFeaturesusingBurrowsWheelerTransformationfor
BiologicalSequenceClassification
203