Authors:
Hao Wan
;
Carolina Ruiz
and
Joseph Beck
Affiliation:
Worcester Polytechnic Institute, United States
Keyword(s):
Sequence Classification, Feature Generation, Mutated Subsequences.
Related
Ontology
Subjects/Areas/Topics:
Algorithms and Software Tools
;
Bioinformatics
;
Biomedical Engineering
;
Data Mining and Machine Learning
;
Pattern Recognition, Clustering and Classification
;
Sequence Analysis
Abstract:
In this paper, we present a new feature generation algorithm for sequence data sets called Mutated Subsequence Generation (MSG). Given a data set of sequences, the MSG algorithm generates features from these sequences by incorporating mutative positions in subsequences. We compare this algorithm with other sequence-based feature generation algorithms, including position-based, k-grams, and k-gapped pairs. Our experiments show that the MSG algorithm outperforms these other algorithms in domains in which presence, not specific location, of sequential patterns discriminate among classes in a data set.