Generative Modeling of Itemset Sequences Derived from Real Databases

Rui Henriques

and Claudia Antunes

KDBIO, Inesc-ID, Instituto Superior T

ecnico, University of Lisbon, Lisbon, Portugal

Dep. Comp. Science and Eng., Instituto Superior T

ecnico, University of Lisbon, Lisbon, Portugal

Keywords:

Hidden Markov Models, Itemset Sequences, Real-world Databases.

Abstract:

The problem of discovering temporal and attribute dependencies from multi-sets of events derived from real-

world databases can be mapped as a sequential pattern mining task. Although generative approaches can offer

a critical compact and probabilistic view of sequential patterns, existing contributions are only prepared to deal

with sequences with a ﬁxed multivariate order. Thus, this work targets the task of modeling itemset sequences

under a Markov assumption. Experimental results hold evidence for the ability to model sequential patterns

with acceptable completeness and precision levels, and with superior efﬁciency for dense or large datasets.

We show that the proposed learning setting allows: i) compact representations; ii) the probabilistic decoding

of patterns; and iii) the inclusion of user-driven constraints through simple parameterizations.

1 INTRODUCTION

Recent work in the area of data mining reveals the im-

portance of deﬁning learning methods to simultane-

ously mine temporal and cross-attribute dependencies

in real-world data (Henriques and Antunes, 2014).

For this purpose, multi-dimensional and relational

structures have been mapped as itemset sequences,

temporally ordered sets of itemsets. This turns the

mining of itemset sequences applicable not only to

transactional databases like market basket analysis,

but also over multi-dimensional databases as ob-

served in healthcare and business domains. However,

the common option to explore itemset sequences, se-

quential pattern mining (SPM), has not been largely

adopted due to its voluminous results and parameter-

ization needs. Additionally, although generative ap-

proaches allow for effective pattern-centered analyzes

of multivariate sequences of ﬁxed order, there are not

yet experiments that show whether or not they can

be extended to consider itemsets of varying length

(cross-attribute occurrences) with acceptable perfor-

mance. In this work we rely on parameterized hidden

Markov models (HMMs) to deliver a compact and

generative representation of sequential patterns that

combine frequent co-occurrences (intra-transactional

analysis) and precedences (inter-transactional analy-

sis).

With the goal of overcoming the critical problems

of traditional SPM methods, some approaches rely on

compressed representations or deﬁne a deterministic

generator of sequential patterns (Mannila and Meek,

2000). However, they can still grow exponentially.

Additionally, these methods cannot disclose the like-

lihood of a pattern to be generated when assuming

underlying noise distributions. To tackle these draw-

backs, formal languages and HMMs have been ap-

plied to solve SPM task (Chudova and Smyth, 2002;

Laxman et al., 2005; Jacquemont et al., 2009). How-

ever, these generative solutions are not able to model

itemset sequences and depend on restrictive assump-

tions regarding the size, shape and noise of patterns.

This paper answers the question: to which extent

can HMMs address these challenges. To answer we,

ﬁrst, propose solutions based on alternative Markov-

based architectures. Second, we evaluate their per-

formance by assessing the efﬁciency and the output

matching against deterministic outputs for synthetic

data and real databases. To the best of our knowl-

edge, this is the ﬁrst systematized work on how to use

HMMs over multivariate symbolic sequences.

This paper is structured as follows. In Section 2,

generative SPM is formalized and motivated. Section

3 describes the proposed solutions. Results are pro-

vided in Section 4 and their implications synthesized.

2 BACKGROUND

Recent research shows that real-world databases can

264

Henriques R. and Antunes C..

Generative Modeling of Itemset Sequences Derived from Real Databases.

DOI: 10.5220/0004898302640272

In Proceedings of the 16th International Conference on Enterprise Information Systems (ICEIS-2014), pages 264-272

ISBN: 978-989-758-027-7

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

be expressively mined by mapping them as sets of

itemset sequences (Henriques et al., 2013). Here,

these mappings are seen as a pre-processing step of

the target methods. Sequential pattern mining (SPM),

originally proposed by (Agrawal and Srikant, 1995),

still is a default option to explore itemset sequences.

Let an item be an element from an ordered set Σ.

An itemset I is a set of non-repeated items. A se-

quence s is an ordered set of itemsets. A sequence

a=a

...a

is a subsequence of b=b

...b

(a⊆b), if

∃

1≤i

<..<i

≤m

: a

⊆ b

,..,a

⊆ b

. A sequence is max-

imal, with respect to a set of sequences, if it is not

contained in any other sequence of the set. The il-

lustrative sequence s

={a}{be}=a(be) is contained in

=(ad)c(bce) and is maximal w.r.t. S={ae, (ab)e}.

Deﬁnition 1. Given a set of sequences S and some

user-speciﬁed minimum support threshold θ, a se-

quence s ∈ S is frequent if contained in at least θ se-

quences. The sequential pattern mining task aims

for discovering the set of maximal frequent sequences

(sequential patterns) in S.

Considering a database S={(bc)a(abc)d, a(ac)c,

cad(acd)} and a support threshold θ=3, the set

of maximal sequential patterns for S under θ is

{a(ac), cc}. Traditional SPM approaches rely on pre-

ﬁxes and sufﬁxes, subsequences with speciﬁc mean-

ings, and on the (anti-)monotonicity property to de-

liver complete and deterministic outputs. However,

these outputs are commonly highly voluminous and

the frequency is a deterministic function (cannot ﬂex-

ibly consider underlying noise distributions).

Alternatives have been proposed, with a ﬁrst class

focused on formal languages and on the construction

of acyclic graphs that deﬁne partial orders and con-

straints between items (Guralnik et al., 1998; Lax-

man et al., 2005). Probabilistic generative models

as neural networks, hidden Markov models (HMMs)

and stochastic grammars hold the promise to deliver

compact representations given by the underlying lat-

tices (Ge and Smyth, 2000). The expressive power,

simplicity and propensity towards sequential data of

HMMs turn them an attractive candidate.

Deﬁnition 2. Consider a discrete alphabet Σ, a ﬁrst-

order discrete HMM is a pair (T, E) that deﬁnes

a stochastic ﬁnite automaton where a set of con-

nected hidden states X={x

, ..x

} is expressed by a

probability transition matrix T =(t

i j

), with observable

emissions described by probability emission matrix

E=(e

(σ)) = (e

iσ

), where 1<i<k, 1<j<k and σ ∈ Σ.

Under a ﬁrst-order Markov assumption, emissions

depend on the current state only. Let the system be in

state x

: it has a probability t

i j

=P(x

) of moving

to x

state and probability e

iσ

=P(σ|x

) of emitting σ

item. (T, E) deﬁnes the HMM architecture.

Preferred emissions and transitions (paths with

higher generation probability) are usually associated

with regions that may have structural and functional

signiﬁcance. For speciﬁc architectures, different pat-

terns such as periodicities or gap-based patterns can

be revealed by analyzing the learned (T, E) parame-

ters (Baldi and Brunak, 2001). Based on this observa-

tion, alternative Markov-based approaches have been

proposed for the mining of patterns using different:

i) task formulations, ii) assumptions, and iii) learning

settings (Chudova and Smyth, 2002; Ge and Smyth,

2000; Laxman et al., 2005; Murphy, 2002).

The commonly target tasks include the discov-

ery of generative strings

as consensus patterns and

proﬁles (across a set of sequences) or motifs (within

one sequence). These tasks have been mainly applied

to univariate sequences (Chudova and Smyth, 2002;

Ge and Smyth, 2000; Fujiwara et al., 1994; Mur-

phy, 2002), with some exceptions allowing numeric

sequences with a ﬁxed multivariate order (Bishop,

2006) and graph structures (Xiang et al., 2010). Ad-

ditionally, the majority is centered on the discovery

of contiguous items, not accounting for items’ prece-

dences of arbitrary distance.

Previous work by (Laxman et al., 2005; Jacque-

mont et al., 2009; Cao et al., 2010) provide important

principles for the decoding of sequential patterns but

both fail to model co-occurrences.

What makes the problem difﬁcult is that few is

known a priori about what these patterns may look

like. Typically, the number and disposal of prece-

dences and co-occurrences can signiﬁcantly vary

across patterns. State-of-the-art approaches (Chudova

and Smyth, 2002; Murphy, 2002) place assumptions

regarding the type, length and number of patterns, and

commonly assume that patterns do not overlap. These

restricted formulations require background knowl-

edge that may not be available.Even so, traditional

learning settings of HMMs may still present signiﬁ-

cant additional challenges to pattern-based tasks. One

of them is the convergence of emission probabilities.

The spurious background matches in long sequences

can lead to false detections, making pattern discov-

ery difﬁcult. The Viterbi algorithm alleviates this

problem (Bishop, 2006) but does not guarantee the

convergence of emission probabilities. In literature,

three learning settings have been proposed. (Murphy,

2002) requires emission distributions to be (nearly)

deterministic, i.e., each state should only emit a single

symbol, although this symbol is not speciﬁed. This

is achieved using the minimum entropy prior (Brand,

Given an alphabet Σ, a generative string is a distribu-

tion over Σ allowing substitutions with noise probability ε

GenerativeModelingofItemsetSequencesDerivedfromRealDatabases

265

Figure 1: Pattern mining with fully inter-connected arch.

1999)

. An alternative is to use a mixture of Dirichlets

(Brown et al., 1993). Finally, Chudova (Chudova and

Smyth, 2002) introduces a Bayes error framework.

3 TECHNICAL SOLUTIONS

In this section we propose solutions to mine sequen-

tial patterns using HMMs without the need to rely

on assumptions. We overview existing architectures,

propose a simple data mapping for their application

over itemset sequences, deﬁne initialization, decod-

ing and learning principles, and, ﬁnally, incrementally

propose more expressive architectures.

3.1 Existing HMM Architectures

The paradigmatic case of pattern mining over uni-

variate sequences is to use fully inter-connected ar-

chitectures (Baldi and Brunak, 2001), illustrated in

Fig.1, plus an efﬁcient method (based on propagation

on graphs or dynamic programming) to exploit the

model. Such method can retrieve patterns of arbitrary

length by exploiting the most probable transitions and

emissions. The criteria of whether is frequent or not

may either depend on the ranking position of the pat-

tern or on the overall generation probability.

Although the fully inter-connected architecture

can be always used as a default option, in order

to minimize the introduced problem of emissions’

convergence there are alternative HMM architec-

tures with sparser connectivity. A basic architecture

adopted for motif discovery in univariate sequences is

one that explicitly models the pattern shape and uses

a distribution that either guarantees the convergence

of emission probabilities, such as in (Murphy, 2002),

or allows for a ﬁxed error threshold, such as in (Chu-

dova and Smyth, 2002). An important assumption for

these architectures, illustrated in Fig.2, is whether the

target pattern emerge from the supporting sequences

or/and from the recurrence of the pattern within each

sequence. For this case, a new state and transitions

Assuming e

to be multinomial emissions for a x

state,

entropy is given by: H(e

)=−Σ

jσ

log(e

jσ

)

Figure 2: Motif discovery with shape-speciﬁc architecture.

Figure 3: LRA: precedences of arbitrary length.

} need to be included, and x

→ x

transition

deleted (t

=0). Additionally, transition probabilities,

T , need to be carefully initialized. For instance, the

self-loop transitions encode the expected length of the

inter-pattern segments by using, for instance, a geo-

metric distribution.

However, this architecture does not support pat-

terns with non-contiguous items. For the discovery

of more relaxed patterns, Left-to-Right Architectures

(LRAs) (Baldi and Brunak, 2001; Liu et al., 1995)

are commonly adopted in biology and speech recog-

nition. LRAs consider both insertions (to allow spar-

sity) between pattern items and deletions (skip states)

characterized by void emissions. Deletions can be

used both to skip noisy occurrences or discover pat-

terns with reduced length. With LRAs, a large num-

ber of precedences can be retrieved through the anal-

ysis of emissions along the main path. These emis-

sions can be though as the set of symbols for aligning

sequences. Fig.3 illustrates this architecture. Uniform

initialization of transition probabilities without a prior

that favors transitions toward the main states should

be avoided in order to guarantee that main states are

selected (instead of only insert-delete combinations).

3.2 Proposed Solutions

To be able to process itemset sequences under a

Markov assumption, we propose a simple mapping of

each sequence of itemsets introducing a special sym-

bol for delimiters, Σ ∪ {$}. Illustrating, a sequence of

itemsets (ab)ca(ac) is now mapped into a univariate

sequence $ab$c$a$ac$, where $ is the symbol that

delimits co-occurrences.

Under this mapping, we can apply existing HMM

architectures prepared to deal with univariate se-

quences. The retrieval of patterns from the underlying

lattices results in combined sequences of both regular

items and delimiters. Empty itemsets (sequent delim-

iters) are removed from the decoded patterns.

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

266

3.2.1 Structural Principles

Initialization Principles. To deﬁne the initialization

of transition probabilities, T, we propose the use of

metrics based on simple statistics over the dataset S

and on pattern expectations. Fig.2 shows the ini-

tializations for motif-oriented architecture where the

average length of sequences is 20 items, and where

the probability of visiting an item belonging to a

pattern is α=33,(3)% based on the expected average

pattern size (3 items) and recurrence (9 inter-pattern

items). When alternative paths are available (as in

fully-interconnected architectures), a random weight

(ε>0) can be added to each transition in order to fa-

cilitate the learning convergence.

Finally, the emission probabilities, E, should be

equal for all the items in order to not bias the learn-

ing. The emission probabilities of delimiters should:

i) slightly increase for fully-interconnected architec-

tures in order to guarantee the distinction between

precedences and co-occurrences, and ii) slightly de-

creased for LRAs to avoid that all of the main path

emissions converge to the delimiter symbol.

Learning Settings. The distribution underlying the

learning of emissions probabilities must guarantee

strong convergence of emissions for the accurate de-

coding of patterns, but simultaneously avoid a too

strong convergence that limits the coverage of inter-

esting patterns. The use of entropy or Dirichlet distri-

butions offer too strong convergence that restricts the

decoding of multiple patterns along the model paths,

while the use of the Bayes error rate (Chudova and

Smyth, 2002) was observed to be difﬁcult for pat-

terns with signiﬁcant autocorrelation and with a peri-

odic structure (as bcbcbc or aaaa). The use of Viterbi

(Bishop, 2006) is a good compromise since it guaran-

tees the learning convergence without requiring a too

restrictive number of eligible emissions per state.

Decoding Principles. Beyond the deﬁnition, initial-

ization and learning of the generative model, there is

the need to deﬁne principles for a robust and efﬁcient

decoding of patterns from lattices. First, the use of the

anti-monotonic property to prune paths. Second, the

use of probability thresholds that traduces the crite-

rion that deﬁnes whether a pattern is or not frequent.

This threshold should be weighted by the length of

the pattern to avoid a bias towards small patterns.

3.2.2 Extending the Existing Architectures

Although the existing architectures can be applied as-

is using the proposed data mapping and parameter ini-

tializations, they are prone to decoding errors. Note

the case where a state emits a reduced number of

items with high probability, but one of these items is

Figure 4: CoIA: discovery of co-occuring items.

Figure 5: IPA: discovery of inter-transactional patterns.

the itemset delimiter. For this common case, the dis-

tinction between precedences and co-occurrences be-

comes blurry, in particular if the state has a self-loop

transition (as in fully-interconnected architectures).

To improve accuracy, we propose new architectures

with dedicated states to emit delimiters.

Before introducing them, consider an extension of

LRAs where main and insert states only emit regular

items, delimited by two states that can only emit the

delimiter item with transitions to self-looping states.

This architecture, referred as CoIA (Co-occuring

Items Architecture) and illustrated in Fig.4, captures

intra-transactional patterns by seeing each itemset as

a univariate sequence. A transition to the initial state

can be used to consider the pattern recurrence within

a sequence.

Two variations can be considered over this archi-

tecture. First, an end state can be linked to the ar-

chitecture. This guarantees that, at least, one intra-

transactional pattern per sequence is used to learn the

left-to-right emissions. The end state can be imple-

mented by adding an ending symbol at the end of

each input sequence. Second, deletion states can be

removed. This turns the intra-transactional patterns of

ﬁxed length, which simpliﬁes the learning, although it

restricts the original potential for decoding patterns of

arbitrary length along the main path.

Now consider the proposed Itemset-Precedences

Architecture (IPA) illustrated in Fig.5. With this ar-

chitecture we can model inter-transactional patterns

by decoding learned emissions along the main path.

Two aspects of IPA should be noticed. First, insert

states are used to remove non-frequent items. Second,

each state dedicated to emit delimiters has a transi-

tion to a self-looping state in order to allow for gaps

between itemsets. In this way, we transit from con-

GenerativeModelingofItemsetSequencesDerivedfromRealDatabases

267

Figure 6: SPA: discovery of sequential patterns.

tiguous items to item precedences.

Beyond not capturing intra-transactional patterns,

IPA suffers from another drawback. Since there is no

guarantee that most of the input sequences will reach

state x

, the signiﬁcance of precedences decoded from

the ﬁrst portions of the path is greater than from the

last portions. This turns the pattern decoding algo-

rithms more complex as they need to reduce the tol-

erance (cut-off thresholds) along the path in order to

avoid the output of patterns prone to errors.

Similarly to the CoIA architecture, we can con-

sider a variation of IPA that includes deletion states,

which allows for a well-distributed level of signiﬁ-

cance for the learned probabilities across regions.

Finally, for the learning of generative models that

combine precedences and co-occurrences, we pro-

pose the integration of the previous CoIA and IPA ar-

chitectures. A CoIA architecture is applied between

each sequent pair of delimiter states from the IPA ar-

chitecture. A simpliﬁcation of the resulting architec-

ture, sequential patterns architecture (SPA), is illus-

trated in Fig.6. SPA is prone to deliver shapes as the

one described by the (ab)(bcd)a pattern.

The number of discovered precedences can be ar-

bitrary by introducing deletion states across delim-

iter states. The size of the intra-transactional pat-

terns can be also arbitrary (unless the user is inter-

ested in speciﬁc pattern shapes) by considering dele-

tion states within each CoIA component. Finally,

the end requirement or the recurrence within each se-

quence (loop to initial state) are possible variations.

Beyond reducing the probability of decoding pat-

terns that are not frequent, SPA has a very efﬁcient

decoding step. There is only the need to analyze com-

binations of emissions along the main path.

3.2.3 Convergence of emissions

A critical drawback of previous SPA, IPA and CoIA

architectures is that they cannot model a large number

of patterns. These architectures rely on one main path

only, where each state commonly emits a reduced

number of items with signiﬁcant probability, which

commonly results in a compact set of patterns. This

turns this method to be not competitive with determin-

istic peers and, therefore, of limited utility. Although

one can reduce the threshold probabilities of the de-

coding phase or decrease the convergence threshold

of the adopted HMM learning algorithm (to relax the

convergence of emissions), this signiﬁcantly degrades

the quality of the output patterns.

A simple solution to avoid this problem is to adopt

an iterative scheme, where each iteration is composed

of three phases – learning, decoding, and masking of

patterns – until patterns cannot be further decoded.

We propose an alternative solution that relies on

multiple paths, so the sum of the compact sets of pat-

terns from each path approximates the true number of

frequent patterns. The number of paths can be deﬁned

by dividing the expected number of frequent patterns

by the average number of patterns able to decoded

from each SPA component.

4 RESULTS

The target hidden Markov models

were adapted from

the HMM-WEKA extension prepared for classiﬁca-

tion (implemented according to (Bishop, 2006; Mur-

phy, 2002) sources). The extensions were imple-

mented using Java (JVM version 1.6.0-24) and the

following experiments were computed using an Intel

Core i5 2.80GHz with 6GB of RAM.

We adopted synthesized datasets based on IBM

Generator tool

by ﬁxing values for all, except one,

of the parameters, and by varying the value for the

remaining parameter. The default dataset contains

m=2.500 sequences, with an average of n=10 trans-

actions each, each transaction with l=4 items on av-

erage. The alphabet has 1.000 items. The average

length of maximal patterns is set to 4 and maximal

frequent transactions set to 2. The values for different

sequential patterns and transactional patterns were set

to 1.000 and 2.000, respectively. This default setting

generates near 10.000 sequential patterns for a sup-

port of 1% (with the majority of them having more

than 5 items), and more than 400 sequential pattens

for a support of 4%. The varied parameters include

the number of items per itemsets, the number of item-

sets per sequence, and the number of available items

(density). These combinatorial set of datasets were

tested for the architectures introduced in previous sec-

tion, whose properties are illustrated in Table 1.

In order to validate if the proposed solutions have

an acceptable performance, it is critical to assess ef-

ﬁciency of the learning-and-decoding stages against

Software: http://web.ist.utl.pt/rmch/software/hmmevoc

http://www.cs.loyola.edu/∼cgiannel/assoc gen.html

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

268

Figure 7: Completeness and precision of alternative archi-

tectures for varying minimum support.

traditional SPM approaches and their ability to extract

all frequent patterns (completeness) and only those

ones (precision) (Jacquemont et al., 2009). These

metrics provide an understanding on whether is it

possible to decode patterns from compact generative

models that match the output of deterministic ap-

proaches. Completeness is the fraction of frequent se-

quential patterns that were decoded by our approach.

Correctness or precision is the fraction of decoded se-

quential patterns that are also retrieved by determin-

istic miners.

Completeness =

|GenerativeOut put∩DeterministicOut put|

|DeterministicOut put|

Precision =

|GenerativeOut put∩DeterministicOut put|

|GenerativeOut put|

The provided results for these metrics are an aver-

age using 10 datasets per parametrization. Addition-

ally, we statistically tested the signiﬁcance of the ob-

served differences by using a paired two-sample two-

tailed t-Student test with 9 degrees of freedom.

4.1 Completeness-precision

An initial view on how the generative outputs com-

pare against deterministic outputs for varying lev-

els of support is depicted in Fig.7. When assessing

these results against the number of frequent patterns

of the dataset, two major observations can be derived.

First, our generative approach is able to cover all fre-

quent itemsets with support above ∼5%, and the com-

pleteness level degrades for lower supports (2% sup-

port delivers >3000 frequent patterns) although the

larger architectures can still cover a large number of

Table 1: Tested HMM architectures.

Architecture Properties

Fully Interc. 10 states. Delimiter emissions allowed on every state.

LRA

Main path with 14 states (max number of items and

delimiters in pattern using expectations). Delimiter

emissions allowed on main and insertion states.

SPA (no end

requirement)

Max N precedences with max L co-occurrences each

(N=8 and L=4 are the default data expectations).

SPA (end

requirement)

Max N precedences with max L co-occurrences each

(N=8 and L=4 are the default data expectations).

Multi-path SPA Three SPA (end requirement) paths.

Figure 8: Completeness of alternative architectures against

parameterizable datasets (θ=4%).

frequent patterns (>1500). Second, precision levels

are 100% above 1% of support, which means that

our approach is able to only deliver patterns whose

frequency is >1% – important since generative ap-

proaches allow for noisy occurrences. Multi-path

SPA is signiﬁcantly superior than remaining options

in terms of completeness, while all the proposed ar-

chitectures were signiﬁcantly better than traditional

fully-interconnected and LRA architectures in terms

of precision.

Completeness. Fig.8 illustrates the completeness of

the proposed architectures to capture patterns with

support above 4%. Note that an increase of support

to 6% results in an approximated levels of 100% for

all the architectures across datasets.

Two major observations result from the analy-

sis. First, multi-path and fully-interconnected archi-

tectures achieve a good completeness since they can

focus on different subsets of probable emissions along

the alternative architectural paths. Second, the lev-

els of completeness degrade for higher densities and

itemset length. This is a natural result of the explosion

of patterns discovered by deterministic approaches

under such hard settings. Note that an increase of

multi-path SPA to six paths is able to hold complete-

ness levels above 96% for all the adopted settings.

Precision. Fig.9 illustrates the precision of the pro-

posed architectures, that is, the fraction of decoded

patterns deterministically frequent (support higher

than 1.5%). Note that a decrease of support to 0.8%

results in an approximated levels of 100% for all the

architectures across datasets. Generative approaches

hold high levels of precision (>90%) across the ma-

jority of dataset settings. There is a slight decrease of

precision for short itemsets since the number of de-

terministic patterns is smaller than the average num-

ber of decoded patterns and for large itemsets due

to a cumulative decoding error associated with larger

patterns and the intra-transactional size constraints

adopted for SPA architectures. Additionally, the ob-

served decrease in precision for high levels of spar-

sity is not only explained by a reduced set of de-

terministic patterns (potentially smaller than the de-

coded set) but also by an intrinsic difﬁculty to guar-

antee the convergence towards a reduced set of emis-

GenerativeModelingofItemsetSequencesDerivedfromRealDatabases

269

Figure 9: Precision of alternative architectures against pa-

rameterizable datasets (θ=1.5%).

Figure 10: Efﬁciency against parameterizable datasets.

sions that can block larger decoded outputs. Finally,

fully-interconnected and LRA architectures are not as

competitive as SPA-based architectures due to the ad-

ditional error propagation associated from not con-

straining delimiter emissions to dedicated states.

4.2 Efﬁciency

The comparison of efﬁciency for the alternative archi-

tectures against PreﬁxSPAN

(Pei et al., 2001), one

of the most efﬁcient deterministic SPM algorithms,

applied with a support threshold of 1% is illustrated

in Fig.10. Under this threshold, the number of de-

terministic frequent patterns vary between 1.000 pat-

terns (sparser and smaller datasets) to near 100.000

patterns (denser and larger datasets).

Generative approaches are particularly suitable

over dense datasets against deterministic approaches,

whose performance rapidly deteriorates for densities

above 10%. Datasets with densities beyond 20%

are very common across a large number of domains.

Interestingly, the performance of the generative ap-

proaches does not signiﬁcantly change with varying

densities. This is explained by a double-effect: learn-

ing convergence deteriorates with an increased den-

sity, but this additional complexity is compensated by

a higher efﬁciency per iteration since there is a sig-

niﬁcantly lower number of emission probabilities to

learn per state.

Additionally, generative approaches scale better

with an increased number of itemsets per sequence

than PreﬁxSPAN. Under the default settings, PreﬁxS-

PAN is only efﬁcient for sequences with less than 15

itemsets. Understandably, fully-interconnected and

LRA are the most efﬁcient solutions due their inher-

ent structural simplicity.

http://www.philippe-fournier-viger.com/spmf/

Figure 11: Completeness-precision for Foodmart dataset.

4.3 Real-world databases

Our approach was applied in sparse and dense real-

world databases. Fig.11 illustrates the performance

of our approach over itemset sequences derived from

the Foodmart data-warehouse

pentaho/mondrian/mysql-foodmart-database.

Each itemset sequences is composed of temporally

ordered basket sales from a speciﬁc customer be-

tween 1997-98 (average of 6 items per basket and 6

baskets). Two important observations result from this

analysis. First, the best Markov-based architectures

are able to achieve 100% precision levels while still

being able to recover more than half of the frequent

patterns for very low levels of support (θ=0.5%).

Second, for medium levels of support (2%), our

generative approach is able to cover all the frequent

patterns, although it additionally delivers patterns

with a lower support (1.5-2%) that can penalize the

precision.

Secondly, we applied our approach over the dense

Plan dataset

, which is not tractable for determinis-

tic SPM approaches even when considering a con-

strained number of instances (<1000). In line with

previous efﬁciency observations, our generative al-

ternatives were able to learn emission and transition

probabilities in useful time. In fact, our approach is

critical for similar dense cases as it is able to decode

patterns based on the most accentuated probability

differences across the learned lattices.

4.4 Discussion

Generative SPM approaches provide more scalable

principles than deterministic peers to deal with dense

datasets and with very large sequences. Even when

considering complex architectures, generative ap-

proaches tend to perform better in terms of efﬁciency

for dataset with these properties. Additionally, the

analyzed precision-completeness levels is over 90%

for the most expressive architectures across settings,

which guarantees the deterministic signiﬁcance of the

decoded patterns. These are particularly attractive

levels knowing that the probabilistic learning of pat-

https://sites.google.com/a/dlpage.phi-integration.com/

http://www.cs.rpi.edu/∼zaki/software/plandata.gz

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

270

terns accounts for noisy occurrences that can lead to a

substantially different (but similarly interesting) out-

put. The observed performance shows that compact

representations of large outputs is possible for SPM

over itemset sequences using generative models.

Additionally, these models present three intrinsic

properties of interest. First, they provide a probabilis-

tic view for the sequential patterns’ support that ac-

counts for occurrences under noise distributions. Sec-

ond, the probability of generating any sequential pat-

tern can be assessed on a query-basis. Note that de-

terministic approaches only disclose support for the

frequent patterns. Third, the introduction of back-

ground knowledge and constraints, as the selection of

speciﬁc pattern shapes in accordance with domain ex-

pectations, can be easily incorporated in the mining

process by parameterizing the target architecture.

5 CONCLUSION

This article proposes a methodology for the deﬁni-

tion of generative models under a Markov assump-

tion to explore itemset sequences, an increasingly

adopted data format to capture temporal and cross-

attribute dependencies in real-world data. This allows

a compact and probabilistic view of sequential pat-

terns that tackles the problems of existing generative

approaches, which can only deal with constrained for-

mulations of the task. The methodology covers multi-

ple architectures and learning settings that guarantee

the relevance of the decoded patterns.

We show that the efﬁciency of the proposed SPM

generative approaches is competitive with traditional

SPM deterministic approaches on synthetic and real

data. Additionally, the proposed approaches hold

good levels of output-matching across a wide variety

of synthetic datasets. This is considerably attractive

since generative approaches offer a probabilist view

of patterns where the notion of pattern relevance is

rather different than the traditional counting support

as it allows for noise distributions underlying pattern

occurrences. This opens a new door for the gen-

erative formulation of SPM. This formulation holds

the potentiality to deliver: compact representations

of commonly large outputs; a probabilistic view of

patterns (allowing for noise distributions, an alterna-

tive view of the traditional support); pruned searches

under user-driven constraints; and a basis for query-

driven decoding of patterns of interest.

Relevant future directions include: the assessment

of changes in classiﬁers performance when adopting

pattern-sensitive generative models; the study of the

potential to dynamically self-learn expressive archi-

tectures from data; and the analysis of the impact of

these generative models for a broader-range of real-

world databases.

ACKNOWLEDGEMENTS

This work was supported by Fundac¸

ao para a

encia e Tecnologia under the project D2PM,

PTDC/EIA-EIA/ 110074/2009, and the PhD grant

SFRH/BD/75924/2011.

REFERENCES

Agrawal, R. and Srikant, R. (1995). Mining sequential pat-

terns. In ICDE, pages 3–14. IEEE CS.

Baldi, P. and Brunak, S. (2001). Bioinformatics: The Ma-

chine Learning Approach. Adaptive Comp. and Mach.

Learning. MIT Press, 2nd edition.

Bishop, C. (2006). Pattern Recognition and Machine

Learning. Info. Science and Stat. Springer.

Brand, M. (1999). Structure learning in conditional proba-

bility models via an entropic prior and parameter ex-

tinction. Neural Comput., 11(5):1155–1182.

Brown, M., Hughey, R., Krogh, A., Mian, I. S., Sj

olander,

K., and Haussler, D. (1993). Using dirichlet mix-

ture priors to derive hidden markov models for pro-

tein families. In 1st IC on Int. Sys. for Molecular Bio.,

pages 47–55. AAAI Press.

Cao, L., Ou, Y., Yu, P. S., and Wei, G. (2010). Detecting

abnormal coupled sequences and sequence changes in

group-based manipulative trading behaviors. In ACM

SIGKDD, pages 85–94. ACM.

Chudova, D. and Smyth, P. (2002). Pattern discovery

in sequences under a markov assumption. In ACM

SIGKDD, pages 153–162. ACM.

Fujiwara, Y., Asogawa, M., and Konagaya, A. (1994).

Stochastic motif extraction using hidden markov

model. In ISMB, pages 121–129. AAAI.

Ge, X. and Smyth, P. (2000). Deformable markov model

templates for time-series pattern matching. In ACM

SIGKDD, pages 81–90. ACM.

Guralnik, V., Wijesekera, D., and Srivastava, J. (1998).

Pattern directed mining of sequence data. In ACM

SIGKDD, pages 51–57.

Henriques, R. and Antunes, C. (2014). Learning predictive

models from integrated healthcare data: Capturing

temporal and cross-attribute dependencies. In HICSS.

IEEE.

Henriques, R., Pina, S. M., and Antunes, C. (2013). Tem-

poral mining of integrated healthcare data: Methods,

revealings and implications. In SDM: 2nd IW on Data

Mining for Medicine and Healthcare. SIAM Pub.

Jacquemont, S., Jacquenet, F., and Sebban, M. (2009). Min-

ing probabilistic automata: a statistical view of se-

quential pattern mining. Mach. Learn., 75(1):91–127.

GenerativeModelingofItemsetSequencesDerivedfromRealDatabases

271

Laxman, S., Sastry, P., and Unnikrishnan, K. (2005).

Discovering frequent episodes and learning hidden

markov models: A formal connection. IEEE TKDE,

17:1505–1517.

Liu, J., Neuwald, A., and Lawrence, C. (1995). Bayesian

models for multiple local sequence alignment and

gibbs sampling strategies. American Stat. Ass.,

90(432):1156–1170.

Mannila, H. and Meek, C. (2000). Global partial orders

from sequential data. In ACM SIGKDD, pages 161–

168. ACM.

Murphy, K. (2002). Dynamic Bayesian Networks: Repre-

sentation, Inference and Learning. PhD thesis, UC

Berkeley, CS.

Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q.,

Dayal, U., and Hsu, M. (2001). Preﬁxspan: Min-

ing sequential patterns by preﬁx-projected growth. In

ICDE, pages 215–224. IEEE CS.

Xiang, R., Neville, J., and Rogati, M. (2010). Modeling

relationship strength in online social networks. In IC

on World wide web, WWW, pages 981–990. ACM.

ICEIS2014-16thInternationalConferenceonEnterpriseInformationSystems

272