Extracting Multi-item Sequential Patterns by Wap-tree Based Approach

Kezban Dilek Onal and Pinar Karagoz

Department of Computer Engineering, Middle East Technical University, Ankara, Turkey

Keywords:

WAP-Tree (Web Access Pattern Tree), Sequential Pattern Mining, FOF (First Occurrence Forest), Sibling

Principle, Web Usage Mining.

Abstract:

Sequential pattern mining constitutes a basis for solution of problems in web mining, especially in web us-

age mining. Research on sequence mining continues seeking faster algorithms. WAP-Tree based algorithms

that emerged from the web usage mining literature have shown a remarkable performance on single-item

sequence databases. In this study, we investigate the application of WAP-Tree based mining to multi-item se-

quential pattern mining and we present MULTI-WAP-Tree, which extends WAP-Tree for multi-item sequence

databases. In addition, we propose a new algorithm MULTI-FOF-SP (MULTI-FOF-Sibling Principle) that

extracts patterns on MULTI-WAP-Tree. MULTI-FOF-SP is based on the previous WAP-Tree based algorithm

FOF (First Occurrence Forest) and an early pruning strategy called ”Sibling Principle” from the literature.

Experimental results reveal that MULTI-FOF-SP ﬁnds patterns faster than PreﬁxSpan on dense multi-item

sequence databases with small alphabets.

1 INTRODUCTION

Sequential pattern mining is one of the major tasks

in data mining and it constitutes a basis for solu-

tion of pattern discovery problems in various domains

(Mooney and Roddick, 2013). Use of sequential pat-

tern mining in web usage mining problems helps dis-

covering user navigation patterns on web sites that

can guide processes like recommendation and web

site design.

In some applications of sequence mining like web

usage mining and bioinformatics, the sequences have

ﬁxed transaction size of 1. This speciﬁc case of se-

quence mining is referred as Single-Item Sequential

Pattern Mining. In accordance with this naming, the

general form of the problem is referred as Multi-Item

Sequential Pattern Mining.

WAP-Tree (Web Access Pattern Tree) based al-

gorithms have shown remarkable execution time per-

formance on single-item sequential pattern mining

(Mabroukeh and Ezeife, 2010). WAP-Tree is a com-

pact data structure for representing single-item se-

quence databases (Pei et al., 2000). There is a con-

siderable number of WAP-Tree based algorithms in

the literature. Among the previous WAP-Tree based

algorithms, PLWAP (Ezeife and Lu, 2005) is re-

ported to outperform well known general sequen-

tial pattern mining algorithms PreﬁxSpan and LAPIN

on single-item sequence databases (Mabroukeh and

Ezeife, 2010) for single item patterns.

Inspired by the success of WAP-Tree, we designed

a new data structure MULTI-WAP-Tree which ex-

tends WAP-Tree for representing multi-item/general

sequence databases. Secondly, we propose a new

sequential pattern mining algorithm MULTI-FOF-SP

based on MULTI-WAP-Tree inspired by the WAP-

Tree based algorithm FOF (First Occurrence For-

est) (Peterson and Tang, 2008). MULTI-FOF-SP,

integrates FOF approach and the early prunning

idea ”Sibling Principle” from the previous studies

(Masseglia et al., 2000) and (Song et al., 2005).

We have analyzed the performance of the pro-

posed method on several test cases. The results show

that MULTI-WAP-Tree and the associated mining al-

gorithm on this structure presents successful results,

especially for dense multi-item sequence databases

with small alphabets.

The rest of the paper is composed of four sections.

In Section 2, we provide background information on

WAP-Tree and the FOF algorithm. We present our

contributions, MULTI-WAP-Tree and MULTI-FOF-

SP, in Section 3 and Section 4, respectively. Finally,

we present experiment results in Section 5 and con-

clude in Section 6.

215

Onal K. and Karagoz P..

Extracting Multi-item Sequential Patterns by Wap-tree Based Approach.

DOI: 10.5220/0004788102150222

In Proceedings of the 10th International Conference on Web Information Systems and Technologies (WEBIST-2014), pages 215-222

ISBN: 978-989-758-024-6

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

2 RELATED WORK AND

BACKGROUND

Sequential pattern mining is the extraction of the se-

quences that occur at least as frequently as the mini-

mum support minSupport in a sequence database D

(Agrawal and Srikant, 1995). The sequential pat-

tern mining algorithms in the literature can be re-

viewed under four categories (Mabroukeh and Ezeife,

2010), namely Apriori based, vertical projection, pat-

tern growth and early prunning.

The WAP-Tree based algorithms follow the pat-

tern growth approach on single-item sequential pat-

tern mining. Since our study is focused on WAP-

Tree based mining, we ﬁrst review the WAP-Tree data

structure in the next subsection. Secondly, we sum-

marize the pattern growth approach in Subsection 2.2.

Finally, we review the FOF (First Occurrence Forest)

algorithm in comparison with the other WAP-Tree

based algorithms in Subsection 2.3.

2.1 WAP-Tree

The WAP-Tree (Web Access Pattern Tree) is a tree

data structure that is designed to represent a single-

item sequence database. This data structure was ﬁrst

introduced together with the WAP-Mine algorithm

(Pei et al., 2000). WAP-Mine algorithm converts the

sequence database into a WAP-Tree with an initial

database scan and performs mining on the WAP-Tree

to ﬁnd frequent patterns.

Table 1: Sample Single-item Sequence Database.

Sequence Id Sequence

1 aba

2 adcdb

3 beae

4 ac

Figure 1 illustrates the WAP-Tree for the sequence

database given in Table 1 under minimum support 0.5.

Each node of the WAP-Tree comprises two ﬁelds:

Item and Count. Count ﬁeld of a node n stores the

count of the sequences starting with the preﬁx ob-

tained by following the path from root to n. The two

children of R, (a:3) and (b:1) in Figure 1 indicate that

there are 3 sequences starting with a and a single se-

quence starting with b. WAP-Tree provides a compact

representation since shared preﬁxes can be encoded

on the nodes.

a:3

b:1

a:1

c:2

b:1

a:1

Figure 1: WAP-Tree For The Sequence Database in Table 1

Under Support Threshold 0.5.

2.2 Pattern Growth Approach

The pattern growth approach follows the divide and

conquer paradigm (Han et al., 2005). The original

problem of ﬁnding the frequent sequences is recur-

sively broken down into smaller problems by shrink-

ing the database into projected databases. The lex-

icographic search space is traversed depth-ﬁrst and

whenever a pattern is found to be frequent, the

database is projected by the pattern. The mining pro-

cess is recursed on the projected database.

The realization of the pattern growth approach de-

pends on the pattern growing direction, how the pro-

jected databases are represented and located during

mining (Han et al., 2005). There are two alternative

directions for growing patterns: namely suﬃx grow-

ing and preﬁx growing. The patterns are grown by ap-

pending symbols in preﬁx growing whereas they are

grown by prepending symbols in suﬃx growing.

D |



aba

adcdb

beae

(a)

D |

dcdb

(b)

D |



(c)

Figure 2: Projected Databases for Diﬀerent Patterns.

Three sample projected databases for preﬁx-

growing are given in horizontal database representa-

tion in Figure 2. From left to right, the tables in Figure

2 are the projected databases of the patterns  ,b,ba in

the sequence database D given in Table 1. The pro-

jections of the sequences can be traced by following

the rows of the tables at the same level. For exam-

ple, the a-projection of the sequence aba is ba. The

b-projection of the sequence ba is a and it is equal to

the ab-projection of aba.

2.3 FOF (First Occurrence Forest)

FOF is a preﬁx growing WAP-Tree based algorithm.

The FOF algorithm represents the sequence database

as a WAP-Tree and it adopts the recursive projected

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

216

a:3

b:1

a:1

c:2

b:1

a:1

(a) FOF for the Pattern .

a:3

b:1

a:1

c:2

b:1

a:1

(b) FOF for the Pattern

a:3

b:1

a:1

c:2

b:1

a:1

ab.

Figure 3: FOFs for Preﬁx Growing On WAP-Tree.

database mining approach of the pattern growth algo-

rithms. The FOF algorithm represents the projected

databases as First Occurrence Forests (FOFs). First

Occurrence Forest is a list of WAP-Tree nodes which

root a forest of WAP-Trees. The WAP-Trees rooted

by the FOF nodes of a pattern encode the projections

of the sequences by the pattern. FOF representation

enables easy support counting for a pattern by sum-

ming the count values of the FOF nodes.

The mining process of the FOF algorithm is based

on searching the ﬁrst level occurrences of symbols re-

cursively. At each pattern growing step by a symbol,

the ﬁrst level occurrences of the symbol in the for-

est of WAP-Trees deﬁned by the FOF of the original

pattern are assigned as the FOF nodes for the grown

pattern. The mining process is recursed on the FOF

nodes for the grown pattern.

The mining process of the FOF algorithm on the

WAP-Tree in Figure 1 is partially illustrated in Fig-

ure 3. FOF of three diﬀerent patterns are presented in

Figure 3. The shaded nodes in the WAP-Trees indi-

cate the FOF nodes. Initially, the FOF for the pattern

 is {R : 4}. Secondly, the ﬁrst level occurrences of the

symbol a under the node R:4 are found as the FOF

for the pattern a. The node n

is not included in the

FOF of the symbol a since it is not a ﬁrst level occur-

rence. In the next level of recursion, the FOF algo-

rithm mines the FOF for a. The FOF for ab is the set

of ﬁrst level b occurrences in the sub-trees rooted by

the shaded nodes for pattern a. FOF of ab is found by

searching the ﬁrst level b occurrences in the sub-trees

rooted by the FOF nodes for the pattern a.

The projected database representation as a list of

nodes, i.e. FOF, was adapted by previous WAP-

Tree based algorithms except WAP-Mine. All WAP-

Tree based algorithms in the literature follow pattern-

growth approach. However, they diﬀer in the pattern

growing direction. WAP-Mine does suﬃx growing

and represents projected databases as WAP-Trees. All

of the latter algorithms, namely PLWAP (Ezeife and

Lu, 2005), FLWAP (Tang et al., 2006), FOF (Peterson

and Tang, 2008) and BLWAP (Liu and Liu, 2010), do

preﬁx growing.

Preﬁx growing WAP-Tree based algorithms dif-

fer in their approach for locating the ﬁrst level oc-

currences of a symbol in a WAP-Tree. The FOF al-

gorithm locates ﬁrst level occurrences of an item by

simple depth-ﬁrst search. On the contrary, PLWAP,

BLWAP and FLWAP leverage additional data struc-

tures links and header table besides the WAP-Tree.

The links connect all occurrences of an item in a

WAP-Tree and the links can be followed starting from

the header table. The ﬁrst level occurences of an item

in a WAP-Tree can be easily found by following the

links of an item and ﬁltering the occurrences which

are found at the ﬁrst level. The FOF algorithm is re-

ported to outperform PLWAP and FLWAP in (Peter-

son and Tang, 2008) although. FOF uses less mem-

ory since it mines WAP-Tree without additional struc-

tures. Besides, although the links provide direct ac-

cess to the occurrences, ﬁltering the ﬁrst level occur-

rences brings an extra cost to PLWAP and FLWAP.

3 MULTI-WAP-TREE

MULTI-WAP-Tree is an extended WAP-Tree which

can represent both single and multi-item sequence

databases. MULTI-WAP-Tree is identical to the

WAP-Tree in that it contains only frequent items in

its nodes and each sequence in the sequence database

is encoded in MULTI-WAP-Tree on a path from the

root to a node of the tree.

MULTI-WAP-Tree diﬀers from the WAP-Tree in

two points:

• MULTI-WAP-Tree has two types of edges be-

tween nodes : S-Edge and I-Edge whereas WAP-

Tree has a single edge type.

• MULTI-WAP-Tree nodes keep a pointer to their

parent nodes in addition to the ﬁelds of WAP-

Tree.

MULTI-WAP-Tree is able to represent a multi-

item sequence database owing to two diﬀerent edge

types between nodes. A multi-item sequence is com-

posed of a series of item-sets. WAP-Tree is not able

to represent multi-item databases since it cannot ex-

press boundaries between item-sets in multi-item se-

ExtractingMulti-itemSequentialPatternsbyWap-treeBasedApproach

217

quences. S-Edges of MULTI-WAP-Tree can encode

item set boundaries. An S-Edge from a node to its

child indicates the separator between two item sets:

item set ending with the parent and item set starting

with the child. On the contrary, the nodes connected

with I-Edges are always in the same item set.

Figure 4(d) shows the MULTI-WAP-Tree for the

sample multi-item database in Table 2 under the sup-

port threshold 0.5. The dashed edges in the ﬁgure are

I-Edges, whereas the plain edges are S-Edges. The

preﬁx represented by the gray coloured node is (ab).

Although both of the edges originating from this node

point to c labeled nodes, they yield diﬀerent preﬁxes

since their edge types are diﬀerent. The child of this

node connected with an S-Edge represents the preﬁx

(ab)(c), whereas the other child connected with the

I-Edge represents the preﬁx (abc).

Table 2: Sample Multi-item Sequence Database.

Sequence Id Sequence

1 (ab)(c)

2 (a)(b)(c)

3 (abc)(c)

The second extension to the WAP-Tree, ”pointer

to parent node” in MULTI-WAP-Tree nodes, enables

tracking item sets upwards in the tree in the mining

phase. This additional ﬁeld does not contribute to

the database representation but it is an important con-

struct for mining MULTI-WAP-Tree.

4 MULTI-FOF-SP

MULTI-FOF-SP algorithm is a multi-item sequen-

tial pattern mining algorithm based on the representa-

tion MULTI-WAP-Tree. MULTI-FOF-SP algorithm

has three basic steps, as in all WAP-Tree based algo-

rithms:

1. Scan database to ﬁnd frequent items.

2. Scan database and build MULTI-WAP-Tree.

3. Mine frequent patterns from MULTI-WAP-Tree.

4.1 Building MULTI-WAP-Tree

For MULTI-WAP-Tree construction, after eliminat-

ing the infrequent items, each sequence is inserted

into the tree starting from the root, updating counts

of shared preﬁx nodes and inserting new nodes for

the unshared suﬃx part, as in WAP-Tree construc-

tion. While constructing a WAP-Tree, checking only

equality of item ﬁelds of nodes is suﬃcient. However,

in the MULTI-WAP-Tree case, checking the equality

of edge types is also required. To illustrate, Figure 4

depicts the MULTI-WAP-Tree database construction

algorithm on the mini database in Table 2. When in-

serting the second sequence (a)(b)(c), although the a

node already has a b child, since this child is con-

nected with an I-Edge, a new b node is added with an

S-Edge.

(a)

a:1

b:1

c:1

(b)

a:2

b:1

c:1

b:1

c:1

(c)

a:3

b:2

c:1 c:1

c:1

b:1

c:1

(d)

Figure 4: Building Steps of the MULTI-WAP-Tree for the

Database in Table 2. Left to right: MULTI-WAP-Tree af-

ter sequences (ab)(c), (a)(b)(c), (abc)(c) are inserted succe-

sively.

4.2 Mining MULTI-WAP-Tree

MULTI-FOF-SP follows preﬁx growing and FOF ap-

proach for extracting multi-item sequential patterns

on MULTI-WAP-Tree. MULTI-FOF-SP represents

projected databases in FOF form and ﬁnds ﬁrst level

occurrences of an item by simple depth ﬁrst search on

the MULTI-WAP-Tree, as in FOF algorithm. How-

ever, MULTI-FOF-SP algorithm diﬀers from FOF in

two points which are explained in detail in the follow-

ing two subsections.

4.2.1 S-Occurrence vs. I-Occurrence

Existence of two diﬀerent edge types S-Edge and

I-Edge in MULTI-WAP-Tree necessitates distinc-

tion between S-Occurrences and I-Occurrences. S-

Occurrences are the ﬁrst level occurrences that yield

a sequence extension, whereas I-Occurrences yield an

item-set extension.

MULTI-FOF-SP algorithm treats S-Occurrences

and I-Occurrences as occurrences of diﬀerent sym-

bols and searches for them separately. Whenever an

occurrence is located, it is subjected to a test for its

occurrence type. If an occurrence is found in the sub-

tree rooted by startNode, the algorithm decides on the

type of occurrence based on the three rules below:

1. A node is an S-Occurrence if there exists at least

one S-Edge on the path from the startNode to this

node.

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

218

2. A node is an I-Occurrence if there are only I-

Edges on the path from the startNode to this node.

3. A node is an I-Occurrence if the last itemset of

the grown pattern can be found by following the

ancestor nodes of the node before an S-Edge is

encountered.

R:3

a:2

a:1

b:1

c:1

b:1

b:2

a:1

c:1

Figure 5: Find First Occurrences Illustration.

It is crucial to note that a node can be both an S-

Occurrence and I-Occurrence at the same time. To

illustrate the rules above, consider the MULTI-WAP-

Tree in Figure 5. The gray shaded nodes r1 and r2

represent the FOF for the pattern (a). There exists

three c nodes, namely n1, n2 and n3, under this FOF.

There are two S-Occurrences {n1,n3} of c and two I-

Occurrences {n1, n2} of c under the FOF for the pat-

tern (a). I-Occurrences contribute to support count

of (ac), whereas S-Occurrences contribute to that of

(a)(c).

n2 is an I-Occurrence according to Rule 1 since

there exists no S-Edges between n2 and its ances-

tor r2. On the contrary, n3 is an S-Occurrence since

there exists an S-Edge on the route from n3 to r2. Fi-

nally, n1 is both an S-Occurrence and I-Occurrence

since it matches both Rule 1 and Rule 3. n1 is an S-

Occurrence because there exists an S-Edge between

n1 and r1. In addition, n1 is an I-Occurrence since

the last item-set of the grown pattern, namely (ac),

can be obtained by backtracking with only I-Edges

from n1 to the ancestor node r3. The bactracking op-

eration is performed using the pointers to the parent

nodes which were mentioned as a diﬀerence from the

WAP-Tree previously.

4.2.2 Sibling Principle

Sibling principle is an early pruning idea which was

used in the previous studies (Song et al., 2005) and

(Masseglia et al., 2000). It is an expression of the

Apriori principle (Agrawal and Srikant, 1995) on the

lexicographic search tree. According to the Apriori

principle, the sequences can be judged as infrequent

if any of its sub-sequences is known to be infrequent.

This principle can be modeled in lexicographic tree

considering the siblings of a frequent pattern. Sibling

principle requires checking sibling nodes of a node in

the lexicographic tree and imposes constraints on the

set of grown patterns. If a sibling node s of a node n

is not frequent, a sequence which is a super-sequence

of both s and n is pruned.

Figure 6 shows the subspace of the lexicographic

search space processed by the MULTI-FOF-SP algo-

rithm during mining the database in Table 2. The

dashed edges in the tree indicate item-set extensions

whereas the normal edges correspond to sequence ex-

tensions. The X sign indicates the sequence is fre-

quent. All the patterns represented by the nodes ex-

cept the underlined nodes are subjected to support

counting by the algorithm MULTI-FOF-SP during

mining the WAP-Tree given in Figure 4(d). The un-

derlined nodes encode the sequences that are pruned

owing to the sibling principle by MULTI-FOF-SP al-

gorithm. For instance, the leftmost c node in the tree

which represents the pattern (abc) can be early pruned

by the sibling principle since the sibling pattern (ac)

is found to be infrequent.

The numbers on the nodes indicate the order of

traverse by MULTI-FOF-SP during mining. It is cru-

cial to note that, the search space is traversed with a

hybrid traversal strategy in order to apply the sibling

principle. MULTI-FOF-SP combines the depth-ﬁrst

traversal of pattern growth approach and the breadth

ﬁrst approach imposed by the Apriori principle.

5 EXPERIMENTS

In this section, we present the experiments we have

conducted in order to evaluate performance of the

algorithm MULTI-FOF-SP. We compared execution

time and memory usage performance of MULTI-

FOF-SP on multi-item sequence databases with the

algorithms PreﬁxSpan

and LAPIN-LCI

We generated several synthetic sequence

databases using IBM Quest Data Generator (Agrawal

et al., 1993). We downloaded the IBM Quest Data

Generator executable in Illimine Software Package

Version 1.1.0 from the web site

. In order to obtain

We downloaded PreﬁxSpan executable in Illimine Soft-

ware Package Version 1.1.0 from web site of Illimine

Project http://illimine.cs.uiuc.edu/download/.

We downloaded LAPIN-LCI ex-

ecutable from http://www.tkl.iis.u-

tokyo.ac.jp/ yangzl/soft/LAPIN/index.htm

http://illimine.cs.uiuc.edu/download/

ExtractingMulti-itemSequentialPatternsbyWap-treeBasedApproach

219



a X

b X

c a

c X

b X

c X

Figure 6: The Subspace of the Lexicographic Search Space Processed by the MULTI-FOF-SP Algorithm for Mining the

Database in Table. 2

a comprehensive data set, we determined a set of

values for each of the parameters C (Average length

of sequences) ={25}, T (average size of itemsets) =

{3,7}, D (number of sequences) = {200K,800K} and N

(size of database alphabet) = {10,500}. We generated

databases with combination of values from these sets.

Each sequence database in the experiment set is

named in accordance with the parameter values. For

instance C25T3S25I3N10D200k speciﬁes a database

generated with parameters C=25, T=3, S=25, I=3,

N=10 and D=200K. The support values for the ex-

periment sets are chosen such that comparative per-

formance results can be obtained in a reasonable time.

We performed experiments on a personal com-

puter with Intel(R) Core(TM) i7 860 @2.80Ghz 2.80

Ghz CPU, 8 GB installed memory and Windows 7

Professional 64-bit operating system. For each exper-

iment case identiﬁed by the 4-tuple (Algorithm, Se-

quence Database, MinSupport), we report the execu-

tion time and memory consumption of algorithm on

the sequence database. We repeated each test case run

at least three times and reported the average value for

all the algorithms. Due to space limitation, we present

results of only a part of the conducted experiments.

As the ﬁrst set of experiments, we present results

on two diﬀerent sequence databases of alphabet size

N=500 in Figure 7 and Figure 8. In these experiments

with large alphabet, PreﬁxSpan is faster then MULTI-

FOF-SP in all of the cases. Although, LAPIN-LCI is

faster than MULTI-FOF-SP in some cases, there are

also cases in which LAPIN ends unexpectedly con-

suming more than 2 GBs of memory. MULTI-FOF-

SP can prune the search space owing to sibling prin-

ciple yet it may not be suﬃcient when the alphabet is

large.

As the second set of experiments, we present re-

sults on three diﬀerent sequence databases of alphabet

size N=10 in Figure 9 and Figure 10.

MULTI-FOF-SP outperforms PreﬁxSpan in terms

of execution time on all sequence databases with al-

phabet size N=10. In most of the cases, MULTI-

30 20 10

500

1,000

Min Support(%)

Execution Time (seconds)

C25T3S25I3D200k

30 20 10

2,000

4,000

Min Support(%)

Execution Time (seconds)

C25T3S25I3D800k

PreﬁxSpan

LAPIN-LCI

MULTI-FOF-SP

Figure 7: Execution Times of Algorithms on Databases

with N=500.

FOF-SP is faster than LAPIN-LCI yet there are cases

in which LAPIN-LCI is slightly faster. However,

there are also cases LAPIN-LCI could not com-

plete execution since it ran out of memory. One

such case is given in Figure 10 for the database

C25T7S25I7N10D800k when the minimum support

value is decreased to 99.3.

MULTI-FOF-SP ranks the second on databases

with N=10 in terms of memory consumption. The al-

gorithm stores many FOFs in the memory in order to

apply sibling principle. However, the memory con-

sumption is moderate because this cost is compen-

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

220

30 20 10

500

1,000

Min Support(%)

Peak Memory Consumption (MB)

C25T3S25I3D200k

30 20 10

500

1,000

Min Support(%)

Peak Memory Consumption (MB)

C25T3S25I3D800k

PreﬁxSpan

LAPIN-LCI MULTI-FOF-SP

Figure 8: Memory Consumption of Algorithms on Databases with N=500.

99 97

Min Support(%)

Execution Time (seconds)

C25T3S25I3D200k

99 97

200

400

600

Min Support(%)

Execution Time (seconds)

C25T3S25I3D800k

99.7

99.5

99.3

2,000

4,000

6,000

8,000

Min Support(%)

Execution Time (seconds)

C25T7S25I7D800k

PreﬁxSpan

LAPIN-LCI MULTI-FOF-SP

Figure 9: Execution Times of Algorithms on Databases with N=10.

99 97

100

200

300

Min Support(%)

Peak Memory Consumption (MB)

C25T3S25I3D200k

99 97

500

1,000

Min Support(%)

Peak Memory Consumption (MB)

C25T3S25I3D800k

99.7

99.5

99.3

800

1,000

1,200

1,400

1,600

1,800

Min Support(%)

Peak Memory Consumption (MB)

C25T7S25I7D800k

PreﬁxSpan

LAPIN-LCI MULTI-FOF-SP

Figure 10: Memory Consumption Of Algorithms on Databases with N=10.

sated by the compression provided by the MULTI-

WAP-Tree data structure.

MULTI-FOF-SP is faster on the databases with

smaller alphabets owing to the compression provided

by the MULTI-WAP-Tree. For instance, MULTI-

WAP-Tree has at most 20 child nodes of the root re-

gardless of the number of sequences in the database

when the alphabet size is 10. Moreover, the width

of the WAP-Tree is never larger than the number of

sequences. However, it is crucial to note that it is

more eﬃcient to mine the projected database in hor-

izontal representation than mining the tree structure.

Scanning the tree requires tracking the edges between

the nodes. Besides, WAP-Tree node occupies more

ExtractingMulti-itemSequentialPatternsbyWap-treeBasedApproach

221

space than the space occupied by an item in hori-

zontol representation. Consequently, if the FOF ap-

proach were applied on the MULTI-WAP-Tree with-

out the sibling principle, both memory and time re-

quirements of FOF approach would be higher in cases

of large alphabets. A high degree of compression by

the MULTI-WAP-Tree is required to outperform Pre-

ﬁxSpan and LAPIN in terms of both execution time

and memory. We observed that this requirement can-

not be met in case of large alphabets.

6 CONCLUSIONS AND FUTURE

WORK

In this paper, we introduced a new data structure

MULTI-WAP-Tree and a new algorithm MULTI-

FOF-SP for extracting multi-item sequence patterns.

MULTI-WAP-Tree is the ﬁrst tree structure for repre-

senting general sequence databases. MULTI-FOF-SP

employs the early pruning idea Sibling Principle.

We have experimented on several test cases to

compare MULTI-FOF-SP with previous multi-item

sequence mining algorithms, PreﬁxSpan and LAPIN-

LCI. Experiments revealed that MULTI-FOF-SP out-

performs PreﬁxSpan and has a performance close

to LAPIN-LCI in terms of execution time on dense

multi-item databases with small alphabets. In addi-

tion, it has a better performance than LAPIN-LCI in

terms of memory usage for these databases.

In this work, we devised a MULTI-WAP-Tree

based algorithm that uses sibling principle and ob-

tained good results. As a continuation of this line,

other existing tree based algorithms can be inves-

tigated for multi-item sequence mining using the

MULTI-WAP-Tree data structure.

REFERENCES

Agrawal, R., Imelinski, T., and Swami, A. (1993). Min-

ing association rules between sets of items in large

databases. In Proceedings of the ACM SIGMOD

Conference on Management of Data, pages 207–216.

ACM.

Agrawal, R. and Srikant, R. (1995). Mining sequential pat-

terns. In Proceedings of the Eleventh International

Conference on Data Engineering (ICDE’95), pages

3–14. IEEE.

Ezeife, C. and Lu, Y. (2005). Mining web log sequen-

tial patterns with position coded pre-order linked

wap-tree. Data Mining and Knowledge Discovery,

10(1):5–38.

Han, J., Pei, J., and Yan, X. (2005). Sequential pat-

tern mining by pattern-growth: Principles and exten-

sions*. In Chu, W. and Lin, T., editors, Foundations

and Advances in Data Mining, volume 180 of Stud-

ies in Fuzziness and Soft Computing, pages 183–220.

Springer Berlin Heidelberg.

Liu, L. and Liu, J. (2010). Mining web log sequential pat-

terns with layer coded breadth-ﬁrst linked wap-tree. In

International Conference of Information Science and

Management Engineering (ISME’2010), volume 1,

pages 28–31. IEEE.

Mabroukeh, N. and Ezeife, C. (2010). A taxonomy of se-

quential pattern mining algorithms. ACM Computing

Surveys (CSUR), 43(1):3.

Masseglia, F., Poncelet, P., and Cicchetti, R. (2000). An

eﬃcient algorithm for web usage mining. Networking

and Information Systems Journal, 2(5/6):571–604.

Mooney, C. H. and Roddick, J. F. (2013). Sequential pattern

mining – approaches and algorithms. ACM Comput.

Surv., 45(2):19:1–19:39.

Pei, J., Han, J., Mortazavi-Asl, B., and Zhu, H. (2000). Min-

ing access patterns eﬃciently from web logs. Knowl-

edge Discovery and Data Mining. Current Issues and

New Applications, pages 396–407.

Peterson, E. and Tang, P. (2008). Mining frequent sequen-

tial patterns with ﬁrst-occurrence forests. In Proceed-

ings of the 46th Annual Southeast Regional Confer-

ence (ACMSE), pages 34–39. ACM.

Song, S., Hu, H., and Jin, S. (2005). Hvsm: A new sequen-

tial pattern mining algorithm using bitmap representa-

tion. In Li, X., Wang, S., and Dong, Z., editors, Ad-

vanced Data Mining and Applications, volume 3584

of Lecture Notes in Computer Science, pages 455–

463. Springer Berlin Heidelberg.

Tang, P., Turkia, M., and Gallivan, K. (2006). Mining

web access patterns with ﬁrst-occurrence linked wap-

trees. In Proceedings of the 16th International Confer-

ence on Software Engineering and Data Engineering

(SEDE’07), pages 247–252. Citeseer.

WEBIST2014-InternationalConferenceonWebInformationSystemsandTechnologies

222