A Novel Feature Generation Method for Sequence Classification - Mutated Subsequence Generation

Hao Wan, Carolina Ruiz, Joseph Beck

Abstract

In this paper, we present a new feature generation algorithm for sequence data sets called Mutated Subsequence Generation (MSG). Given a data set of sequences, the MSG algorithm generates features from these sequences by incorporating mutative positions in subsequences. We compare this algorithm with other sequence-based feature generation algorithms, including position-based, k-grams, and k-gapped pairs. Our experiments show that the MSG algorithm outperforms these other algorithms in domains in which presence, not specific location, of sequential patterns discriminate among classes in a data set.

References

  1. Amaldi, E., & Kann, V. (1998). On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(1-2), 237-260.
  2. Ao, w., Gaudet, J., Kent, W., Muttumu, S., & Mango, S. E. (2004, September). Environmentally induced foregut remodeling by PHA-4/FoxA and DAF12/NHR. Science, 305, 1743-1746.
  3. Bache, K., & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA, USA: University of California, School of Information and Computer Science.
  4. Chuzhanova, N. A., Jones, A. J., & Margetts, S. (1998). Feature selection for genetic sequence classification. Bioinformatics, 14(2), 139-143.
  5. Damashek, M. (1995, Feb 10). Gauging Similarity with nGrams: Language-Independent Categorization of Text. Science, 267(5199), 843-848.
  6. Dong, G., & Pei, J. (2009). Sequence Data Mining. Heidelberg: Springer-Verlag Berlin.
  7. Gini, C. (1912). "Italian: Variabilità e mutabilità"(Variability and Mutability). C. Cuppini, Bologna, 156 pages. Reprinted in Memorie di metodologica statistica (Ed. Pizetti E, Salvemini, T). Rome: Libreria Eredi Virgilio Veschi (1955).
  8. Hall, M. A., & Smith, L. A. (1999). Feature Selection For Machine Learning: Comparing a Correlation-based Filter Approach to the Wrapper. Proceedings of the Twelfth International FLAIRS Conference, (pp. 235- 239). Orlando, FL.
  9. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1), 10-18.
  10. Harley, C. B., & Reynolds, R. P. (1987). Analysis of E. coli promoter sequences. Nucleic Acids Research, 15(5), 2343-2361.
  11. Hawley, D. K., & McClure, W. R. (1983). Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Research, 11(8), 2237-2255.
  12. Huang, S.-H., Liu, R.-S., Chen, C.-Y., Chao, Y.-T., & Chen, S.-Y. (2005). Prediction of Outer Membrane Proteins by Support Vector Machines Using Combinations of Gapped Amino Acid Pair Compositions. Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering (BIBE'05), (pp. 113-120 ).
  13. Ji, X., Bailey, J., & Dong, G. (2005). Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints. Proceedings of the Fifth IEEE International Conference on Data Mining.
  14. Kohavi, R., & Johnb, G. H. (1997). Wrappers for feature selection. Artificial Intelligence, 97(1-2), 273-324.
  15. Leslie, C. S., Eskin, E., Cohen, A., Weston, J., & Noble, W. S. (2004). Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4), 467-476.
  16. Mah, A. K., Tu, D. K., Johnsen, R. C., Chu, J. S., Chen, N., & Baillie, D. L. (2010). Characterization of the octamer, a cis-regulatory element that modulates excretory cell gene-expression in Caenorhabditis elegans. BMC Molecular Biology, 11(19).
  17. Noordewier, M. O., Towell, G. G., & Shavlik, J. W. (1991). Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences. Advances in Neural Information Processing Systems, 3.
  18. Park, K.-J., & Kanehisa, M. (2003). Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics, 19(13), 1656-1663.
  19. Reece-Hoyes, J. S., Shingles, J., Dupuy, D., Grove, C. A., Walhout, A. J., Vidal, M., & Hope, I. A. (2007). Insight into transcription factor gene duplication from Caenorhabditis elegans Promoterome-driven expression patterns. BMC Genomics, 8(27).
  20. Tan, P.-N., Kumar, V., & Steinbach, M. (2005). Introduction to Data Mining. Boston, MA, USA: Addison-Wesley.
  21. Towell, G. G., Shavlik, J. W., & Noordewier, M. O. (1990). Refinement of Approximate Domain Theories by Knowledge-Based Neural Networks. In Proceedings of the Eighth National Conference on Artificial Intelligence, (pp. 861-866).
  22. Wan, H., Barrett, G., Ruiz, C., & Ryder, E. F. (2013). Mining Association Rules That Incorporate Transcription Factor Binding Sites and Gene Expression Patterns in C. elegans. In Proc. Fourth International Conference on Bioinformatics Models, Methods and Algorithms BIOINFORMATICS2013 (pp. 81-89). Barcelona, Spain. SciTePress.
  23. WormBase, release WS230. (2012, April 1). Retrieved from http://www.wormbase.org/
  24. Xing, Z., Pei, J., & Keogh, E. (June 2010). A Brief Survey on Sequence Classification. ACM SIGKDD Explorations, 12(1), 40-48.
Download


Paper Citation


in Harvard Style

Wan H., Ruiz C. and Beck J. (2014). A Novel Feature Generation Method for Sequence Classification - Mutated Subsequence Generation . In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014) ISBN 978-989-758-012-3, pages 68-79. DOI: 10.5220/0004808200680079


in Bibtex Style

@conference{bioinformatics14,
author={Hao Wan and Carolina Ruiz and Joseph Beck},
title={A Novel Feature Generation Method for Sequence Classification - Mutated Subsequence Generation},
booktitle={Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)},
year={2014},
pages={68-79},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004808200680079},
isbn={978-989-758-012-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms - Volume 1: BIOINFORMATICS, (BIOSTEC 2014)
TI - A Novel Feature Generation Method for Sequence Classification - Mutated Subsequence Generation
SN - 978-989-758-012-3
AU - Wan H.
AU - Ruiz C.
AU - Beck J.
PY - 2014
SP - 68
EP - 79
DO - 10.5220/0004808200680079