be expressively mined by mapping them as sets of
itemset sequences (Henriques et al., 2013). Here,
these mappings are seen as a pre-processing step of
the target methods. Sequential pattern mining (SPM),
originally proposed by (Agrawal and Srikant, 1995),
still is a default option to explore itemset sequences.
Let an item be an element from an ordered set Σ.
An itemset I is a set of non-repeated items. A se-
quence s is an ordered set of itemsets. A sequence
a=a
1
...a
n
is a subsequence of b=b
1
...b
m
(a⊆b), if
∃
1≤i
1
<..<i
n
≤m
: a
1
⊆ b
i
1
,..,a
n
⊆ b
i
n
. A sequence is max-
imal, with respect to a set of sequences, if it is not
contained in any other sequence of the set. The il-
lustrative sequence s
1
={a}{be}=a(be) is contained in
s
2
=(ad)c(bce) and is maximal w.r.t. S={ae, (ab)e}.
Definition 1. Given a set of sequences S and some
user-specified minimum support threshold θ, a se-
quence s ∈ S is frequent if contained in at least θ se-
quences. The sequential pattern mining task aims
for discovering the set of maximal frequent sequences
(sequential patterns) in S.
Considering a database S={(bc)a(abc)d, a(ac)c,
cad(acd)} and a support threshold θ=3, the set
of maximal sequential patterns for S under θ is
{a(ac), cc}. Traditional SPM approaches rely on pre-
fixes and suffixes, subsequences with specific mean-
ings, and on the (anti-)monotonicity property to de-
liver complete and deterministic outputs. However,
these outputs are commonly highly voluminous and
the frequency is a deterministic function (cannot flex-
ibly consider underlying noise distributions).
Alternatives have been proposed, with a first class
focused on formal languages and on the construction
of acyclic graphs that define partial orders and con-
straints between items (Guralnik et al., 1998; Lax-
man et al., 2005). Probabilistic generative models
as neural networks, hidden Markov models (HMMs)
and stochastic grammars hold the promise to deliver
compact representations given by the underlying lat-
tices (Ge and Smyth, 2000). The expressive power,
simplicity and propensity towards sequential data of
HMMs turn them an attractive candidate.
Definition 2. Consider a discrete alphabet Σ, a first-
order discrete HMM is a pair (T, E) that defines
a stochastic finite automaton where a set of con-
nected hidden states X={x
1
, ..x
k
} is expressed by a
probability transition matrix T =(t
i j
), with observable
emissions described by probability emission matrix
E=(e
i
(σ)) = (e
iσ
), where 1<i<k, 1<j<k and σ ∈ Σ.
Under a first-order Markov assumption, emissions
depend on the current state only. Let the system be in
state x
i
: it has a probability t
i j
=P(x
j
|x
i
) of moving
to x
j
state and probability e
iσ
=P(σ|x
i
) of emitting σ
item. (T, E) defines the HMM architecture.
Preferred emissions and transitions (paths with
higher generation probability) are usually associated
with regions that may have structural and functional
significance. For specific architectures, different pat-
terns such as periodicities or gap-based patterns can
be revealed by analyzing the learned (T, E) parame-
ters (Baldi and Brunak, 2001). Based on this observa-
tion, alternative Markov-based approaches have been
proposed for the mining of patterns using different:
i) task formulations, ii) assumptions, and iii) learning
settings (Chudova and Smyth, 2002; Ge and Smyth,
2000; Laxman et al., 2005; Murphy, 2002).
The commonly target tasks include the discov-
ery of generative strings
1
as consensus patterns and
profiles (across a set of sequences) or motifs (within
one sequence). These tasks have been mainly applied
to univariate sequences (Chudova and Smyth, 2002;
Ge and Smyth, 2000; Fujiwara et al., 1994; Mur-
phy, 2002), with some exceptions allowing numeric
sequences with a fixed multivariate order (Bishop,
2006) and graph structures (Xiang et al., 2010). Ad-
ditionally, the majority is centered on the discovery
of contiguous items, not accounting for items’ prece-
dences of arbitrary distance.
Previous work by (Laxman et al., 2005; Jacque-
mont et al., 2009; Cao et al., 2010) provide important
principles for the decoding of sequential patterns but
both fail to model co-occurrences.
What makes the problem difficult is that few is
known a priori about what these patterns may look
like. Typically, the number and disposal of prece-
dences and co-occurrences can significantly vary
across patterns. State-of-the-art approaches (Chudova
and Smyth, 2002; Murphy, 2002) place assumptions
regarding the type, length and number of patterns, and
commonly assume that patterns do not overlap. These
restricted formulations require background knowl-
edge that may not be available.Even so, traditional
learning settings of HMMs may still present signifi-
cant additional challenges to pattern-based tasks. One
of them is the convergence of emission probabilities.
The spurious background matches in long sequences
can lead to false detections, making pattern discov-
ery difficult. The Viterbi algorithm alleviates this
problem (Bishop, 2006) but does not guarantee the
convergence of emission probabilities. In literature,
three learning settings have been proposed. (Murphy,
2002) requires emission distributions to be (nearly)
deterministic, i.e., each state should only emit a single
symbol, although this symbol is not specified. This
is achieved using the minimum entropy prior (Brand,
1
Given an alphabet Σ, a generative string is a distribu-
tion over Σ allowing substitutions with noise probability ε
GenerativeModelingofItemsetSequencesDerivedfromRealDatabases
265