Authors:
Enrico Maiorino
1
;
Francesca Possemato
1
;
Valerio Modugno
2
and
Antonello Rizzi
1
Affiliations:
1
SAPIENZA University of Rome, Italy
;
2
University La Sapienza, Italy
Keyword(s):
Granular Modeling, Sequence Data Mining, Inexact Sequence Matching, Frequent Subsequences Extraction, Evolutionary Computation.
Related
Ontology
Subjects/Areas/Topics:
Artificial Intelligence
;
Computational Intelligence
;
Evolutionary Computing
;
Genetic Algorithms
;
Hybrid Systems
;
Informatics in Control, Automation and Robotics
;
Intelligent Control Systems and Optimization
;
Knowledge Discovery and Information Retrieval
;
Knowledge-Based Systems
;
Machine Learning
;
Representation Techniques
;
Soft Computing
;
Symbolic Systems
Abstract:
Nowadays, the wide development of techniques to communicate and store information of all kinds has raised the need to find new methods to analyze and interpret big quantities of data. One of the most important problems in sequential data analysis is frequent pattern mining, that consists in finding frequent subsequences (patterns) in a sequence database in order to highlight and to extract interesting knowledge from the data at hand. Usually real-world data is affected by several noise sources and this makes the analysis more hallenging,
so that approximate pattern matching methods are required. A common procedure employed to identify recurrent patterns in noisy data is based on clustering algorithms relying on some edit distance between subsequences. When facing inexact mining problems, this plain approach can produce many spurious patterns due to multiple pattern matchings on the same sequence excerpt. In this paper we present a method to overcome this drawback by applying an optim
ization-based filter that identifies the most descriptive patterns among those
found by the clustering process, able to return clusters more compact and easily interpretable. We evaluate the mining system’s performances using synthetic data with variable amounts of noise, showing that the algorithm performs well in synthesizing retrieved patterns with acceptable information loss.
(More)