data:image/s3,"s3://crabby-images/2c4e9/2c4e98594ea78a15ead4f8cfe671fcf8d339ff7b" alt=""
2 SEQUENTIAL PATTERN
MINING PROBLEM
In this section, we first define the problem of
sequential pattern mining, and then illustrate some
of the well known Sequential Pattern Mining
Algorithms explaining their working with examples.
2.1 Definitions
The formal statement of sequential pattern mining is
defined in (Agrawal,1995) as following:
Let I = {x
1
,……,x
m
} be a set of items. An itemset is
a non-empty subset of items, and an itemset with l
items called a l-itemset.
A sequence s =
〈
s
1
,…,s
n
〉
is an ordered list of
itemsets where s
i
is the i
th
element of s and is called
a transaction. The number of transaction in a
sequence is called the length of the sequence. A
sequnce s with length k is called k-sequence and is
denoted by |s|.
Consider two data sequences s = 〈 s
1
,…,s
n
〉 and t = 〈
t
1
,…,t
m
〉. We say that s is a subsequence of t if s is a
“projection” of t derived by deleting elements and/or
items from t. More formally s is a subsequence of t
if there exist integers j
1
< j
2
< j
3
<…<j
n
such that s
1
⊆ t
j1
, s
2
⊆ t
j2
,…,and s
n
⊆ t
jn
. For example sequences 〈
1 3 〉 and 〈 1 2 4 〉 are subsequences of 〈 1 2 3 4 〉,
while 〈 3 1 〉 is not.
Following (Srikant, 1996), the sequence s is defined
to be subsequence with a maximum distance
constraint of
δ
, or alternately
δ
-distance
subsequence, of t if there exist integers j
1
< j
2
< j
3
<…<j
n
such that s
1
⊆ t
j1
, s
2
⊆ t
j2
, s
n
⊆ t
jn
and j
k
– j
k-1
≤
δ
for each k = 2,3,4,…,n. That is, occurrences of
adjacent elements of s within t are not separated by
more than
δ
elements.
As a special case of the above definition, we say that
s is a contiguous subsequence of t if s is a 1-
distance subsequence of t, i.e., the elements of s can
be mapped to a contiguous segment of t.
A sequence s is said to contain a sequence p if p is a
subsequence of s.
The support of a pattern p is defined as the fraction
of sequences in the input database that contain p.
Given a set of sequences S, we say that s ∈ S is
maximal if there are no sequences in S - { s } that
contain it.
2.2 Sequential Pattern Mining
Algorithms
Sequential pattern mining has been intensively
studied during recent years, so there exist a great
diversity of algorithms for sequential pattern mining.
Most of these algorithms are based on the Apriori
property proposed in association rule mining
(Agrwal, 1994), which states that any sub-pattern of
a frequent pattern must be frequent. Based on this
heuristic, a series of Apriori-like algorithms have
been proposed: AprioriAll, AprioriSome,
DynamicSome in (Agrawal,1995), and GSP
(Srikant, 1996). Later on another series of data
projection based algorithms became popular because
of their efficiency, which include FreeSpan (Han,
2000) and PrefixSpan (Pei, 2001). Recently, Zaki
proposed an efficient algorithm called SPADE
(Zaki, 2001), which is a lattice based algorithm.
After that, a fast algorithm, called SPAM (Ayres,
2002) is proposed, it uses a vertical bitmap
representation of the data. Also, a memory indexing
based approach called MEMISP (Ming-Yen, 2002)
is proposed, it uses a memory indexing scheme to
reduce the I/O complexity.
3 SEQUENTIAL PATTERN
MINING AND CONSTRAINTS
Like many frequent mining problems, there are two
major difficulties in sequential pattern mining: (1)
effictiveness: mining may return a huge number of
patterns, many of which could be uninteresting to
users, and (2) efficiency: it often takes substantial
processng power for mining the complete set of
sequential paterns in a large sequence database.
Constraint-based mining may overcome both
difficulties since constraints usually represents user′s
interest and focus, which confines the patterns to be
found to a particular set o conditions. Moreover, if
constraints can be pushed deep into the mining
process, it is likely to achieve efficiency since the
search can be focused. This motivates the study of
constraint-based mining of sequential patterns.
3.1 Categories of Constraints
For real-world data mining, it is interesting to
examine some interesting constraints from the
application point of view. These constraints are
presented in (Pei, 2002). Although this is by no
means complete, it covers most of the interesting
constraints in applications
.
Alternatively, constraints can be categorized
according to their properties for constraint pushing
in the candidate generation and pruning
processes(Ng, 1998, Pei, 2000, Pei, 2001).
Monotonicity, anti-monotonicity, and succinctness
are three categories of constraints that we briefly
discuss below.
MINING SEQUENTIAL PATTERNS WITH REGULAR EXPRESSION CONSTRAINTS USING SEQUENTIAL
PATTERN TREE
117