mining is similar to the problem of mining frequent
item sets and subsequences. For this type of problem
many works (Zaki, 2001) (Yan et al., 2003) describe
search techniques for non-contiguous sequences of
objects. For example, the first work (Agrawal and
Srikant, 1995) of this sub-field of data mining is re-
lated to the analysis and prediction of consumer be-
haviour. In this context, a transaction consists in the
sale of a set of items and a sequence is an ordered
set of transactions. If we consider, for instance, a se-
quence of transactions ha, b, a, c, a, ci a possible non-
contiguous subsequence could be ha, b, c, ci. How-
ever, this approach is not ideal when the objective is
to extract patterns where the contiguity of the com-
ponent objects inside a sequence plays a fundamental
role in the extraction of information. The computa-
tion biology community has developed a lot of meth-
ods for detecting frequent patterns that in this field are
called motifs. Some works (Sinha and Tompa, 2003),
(Pavesi et al., 2004) use Hamming distances to search
for recurrent motifs in data. Other works employ suf-
fix tree data structure (Zhu et al., 2007), suffix array
to store and organize the search space (Ji and Bai-
ley, 2007), or use a GrC framework for the extrac-
tion of frequent patterns in data (Rizzi et al., 2013).
Most methods focus only on the recurrence of pat-
terns in data without taking into account the concept
of “information redundancy”, or, in other words, the
existence of overlapping among retrieved patterns. In
this paper we present a new approximate subsequence
mining algorithm called FRL-GRADIS (Filtered Re-
inforcement Learning-based GRanular Approach for
DIscrete Sequences) aiming to reduce the information
redundancy of RL-GRADIS (Rizzi et al., 2012) by ex-
ecuting an optimization-based refinement process on
the extracted patterns. In particular, this paper intro-
duces the following contributions:
1. our approach finds the patterns that maximize the
knowledge about the process that generates the se-
quences;
2. we employ a dissimilarity measure that can ex-
tract patterns despite the presence of noise and
possible corruptions of the patterns themselves;
3. our method can be applied on every kind of se-
quence of objects, given a properly defined simi-
larity or dissimilarity function defined in the ob-
jects domain;
4. the filtering operation produces results that can be
interpreted more easily by application’s field ex-
perts;
5. considering this procedure as an inner module of
a more complex classification system, it allows to
further reduce the dimension of the feature space,
thus better addressing the curse of dimensionality
problem.
This paper consists of three parts. In the first part we
provide some useful definitions and a proper notation;
in the second part we present FRL-GRADIS as a two-
step procedure, consisting of a subsequences extrac-
tion step and a subsequences filtering step. Finally, in
the third part, we report the results obtained by apply-
ing the algorithm to synthetic data, showing a good
overall performance in most cases.
2 PROBLEM DEFINITION
Let D = {α
i
} be a domain of objects α
i
. The objects
represent the atomic units of information. A sequence
S is an ordered list of n objects that can be represented
by the set of pairs
S = {(i → β
i
) | i = 1, ... , n; β
i
∈ D},
where the integer i is the order index of the object β
i
within the sequence S. S can also be expressed with
the compact notation
S ≡ hβ
1
, β
2
, ..., β
n
i
A sequence database SDB is a set of sequences S
i
of
variable lengths n
i
. For example, the DNA sequence
S = hG, T,C, A, A, T, G, T,Ci is defined over the do-
main of the four amino acids D = {A,C, G, T }.
A sequence S
1
= hβ
0
1
, β
0
2
, ..., β
0
n
1
i is a subsequence
of a sequence S
2
= hβ
00
1
, β
00
2
, ..., β
00
n
2
i if n
1
≤ n
2
and
S
1
⊆ S
2
. The position π
S
2
(S
1
) of the subsequence S
1
with respect to the sequence S
2
corresponds to the or-
der index of its first element (in this case the order
index of the object β
0
1
) within the sequence S
2
. The
subsequence S
1
is also said to be connected if
β
0
j
= β
00
j+k
∀ j = 1, ..., n
1
where k = π
S
2
(S
1
).
Two subsequences S
1
and S
2
of a sequence S are over-
lapping if
S
1
∩ S
2
6=
/
0.
In the example described above, the complete nota-
tion for the sequence S = hG, T,C, A, A, T, G, T,Ci is
S = {(1 → G), (2 → T ), (3 → C),...}
and a possible connected subsequence S
1
= hA, T, Gi
corresponds to the set
S
1
= {(5 → A), (6 → T ), (7 → G)}.
Notice that the objects of the subsequence S
1
inherit
the order indices from the containing sequence S, so
that they are univocally referred to their original po-
sitions in S. From now on we will focus only on con-
nected subsequences, therefore the connection prop-
erty will be implicitly assumed.
InformationGranulesFilteringforInexactSequentialPatternMiningbyEvolutionaryComputation
105