Constraint-programming Approach for Multiset and Sequence Mining
Pablo Gay, Beatriz L
´
opez and Joaquim Mel
´
endez
University of Girona, Campus Montilivi, P4 Building, E17071, Girona, Spain
Keywords:
Data Mining, Itemset Mining, Multiset Mining, Sequence Mining, Pattern Discovery.
Abstract:
Constraint-based data mining is a field that recently has started to receive more attention. Describing a problem
through a declarative model enables very descriptive and easy to extend implementations. Our work uses a
previous itemset mining model in order to extend it with the capabilities to discover different and interesting
patterns that have not been explored yet: multisets and sequences. The classic example domain is the retailer
organizations, trying to mine the most common combinations of items bought together. Multisets would allow
mining not only this itemsets but also the quantities of each item and sequences the order in with the items
are retrieved. In this paper, we provide the background of the original work and we describe the modifications
done to the model to extend it and support these new patterns. We also test the new models using real world
data to prove their feasibility.
1 INTRODUCTION
Itemset mining, a problem originated in the study of
retailer organization databases, has been a major con-
cern in the machine learning community due its im-
pact on different areas, such as games, census, traf-
fic accidents, among other (Rep, 2012). The learn-
ing objective is to find the most common combina-
tion of products bought toghether. After itemset min-
ing, sequence (Agrawal and Srikant, 1995) and mul-
tiset (David and Nourine, ence) mining algorithms
have been developed. The importance of sequences
is shown in fields as recommendations (Burke, 1999),
web browsing (Cadez et al., 2000), health care (Zhang
et al., 2003) or music (Brand, 1998) for instance. In
sequence mining, item order matters, while in multi-
sets item repetitions is taken into account.
The original itemset mining algorithm known is
Apriori (Agrawal and Srikant, 1994) and consists in
three phases. First, it builds an item lattice. Sec-
ond it scans the lattice looking for itemsets that have
at least a support of a user-specified threshold. And
third, and in order to build associational rules, builds
up the appropriate relationship between premises and
consequences. A similar process follows with se-
quences. Since the approach is bread first search, it
is computationally expensive, and other authors have
been explored several alternatives to improve its effi-
ciency. (Srikant and Agrawal, 1996) proposed the use
of taxonomies, some primitive data structures to count
the support of patterns, sliding windows to relax the
definition of frequent sequences, and time constraints
to discriminate whenever two elements of a transac-
tion belong to the same sequence or not. (David and
Nourine, ence) goes one step further by mixing se-
quences and multisets. All of these approaches are
procedural based, and they search in one way or an-
other a tree structure.
Recently, there has been an interest to use declar-
ative approaches for data mining (Bonchi and Luc-
chese, 2007). In the particular case of itemsets,
(De Raedt et al., 2008) proposes to use constraint
programming. Although constraint programming is
a well-studied field in computer science, its use for
itemset mining had gone unnoticed until then. The
main achievement is the reformulation of the itemset
mining problem as a constraint satisfaction problem,
exploiting current and powerful solvers to find out the
frequent itemset patterns. The framework proposed
at (De Raedt et al., 2008) is revisited in (De Raedt
et al., 2010), where the authors explore in deep the
formulation, propose models for frequent, closed, and
weighted itemset mining, and analyze the efficiency
gained with solvers. The constraint programming ap-
proach enables the extension, in a declarative manner,
of the data mining problem by the addition of new
constraints.
Our work concerns the extension of the previous
work of (De Raedt et al., 2010), to mine two different
kind of frequent patterns: multisets and sequences.
212
Gay P., López B. and Meléndez J..
Constraint-programming Approach for Multiset and Sequence Mining.
DOI: 10.5220/0004135302120220
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 212-220
ISBN: 978-989-8565-29-7
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
Our ultimate goal is to provide a constraint program-
ming framework for frequent multiset and sequence
mining.
This paper is organized as follows. First, in Sec-
tion 2 we contextualize our research within some
other related work. Next, we provide in Section 3 the
basis of the constraint programming approach in (De
Raedt et al., 2010), that is then extended with our new
proposals in Sections 4 (multiset mining) and 5 (se-
quence mining). The experimental results and their
discussion is presented in Section 6. We end with
some conclusions and future work directions in Sec-
tion 7.
2 RELATED WORK
Constraint-based mining has been understood in fre-
quent itemset mining as the process of defining con-
straints in the mining process. That means, that we
find itemset patterns by constraining their length, du-
ration, gaps between adjacent items, prefixes, and so
on (Han et al., 2007). These methods follow a proce-
dural approach to find out the frequent patterns. Our
work concerns on the use of the constraint program-
ming paradigm for mining frequent patterns, follow-
ing a declarative definition of the mining problem,
and letting the solvers find the patterns, as in (De
Raedt et al., 2010).
An example of such systems is Molfea (De Raedt
and Kramer, 2001), which mines chemical structures
for sequences of atoms and bonds that are defined
by some user criteria. These criteria are primitives
that represent fragments of the sequences of atoms
and may require having a minimum frequency on
the database. Another example is ConQuest (Bonchi
et al., 2009), a constraint-based querying system,
which interacts with the user to achieve the desired
learning results. MusicDFS (Soulet et al., 2006) is an-
other application example of a tool which implements
its own efficient deep search first algorithm to mine
constrained patterns. These kind of frameworks have
in common that they focus on the use of constraints in
the search process to improve mining results whereas
our approach focuses on create a model definition to
solve the problem of mining multisets and sequence
patterns using a constraint programming approach.
Other related works are in the field of plan recog-
nition. The plan recognition problem (Schmidt et al.,
1978) takes as input a set of sequential actions per-
formed by some actors within the system and tries to
organize these actions in order to recreate the plan and
to infer the pursued goal. Moreover, the plans can be
considered sequential patterns if they are mined using
data from several actors. Recently, this community
has also moved to consider constraint-satisfaction for-
mulations to define their problem (Gal et al., 2012).
3 BACKGROUND
In this work, the modelization of frequent itemset
problem has been considered as starting point to ex-
tend it towards multiset and ordered itemset min-
ing. However, formalization can be generalized to
cover closed, maximal or typologies of itemset min-
ing problems.
As it has been commented above, our model is
based on the original and simpler design of (Guns
et al., 2011). It is a constraint programming approach
for itemset mining. We explain the essentials of the
method (the problem definition and the constraint pro-
gramming model) in this section to proceed afterward
with the definition of our models.
3.1 Problem Statement
The original motivation of itemset mining was on the
retailers, from which most of the terminology has
been derived. The input information is a database
containing all the products acquired by the clients ac-
cording to transactions. The output, the most frequent
items bought.
Let I = {1, . .. , m} be a subset of N that map
some items as for example {A, B, . . . , M} and A =
{1, . . . , n} a set of transaction identifiers. Now, let D
0
be a set of tuples with a transaction id t and an item-
set I. Finally, let D be the binary representation of D
0
such that:
D
0
= {(t,I) | t A, I I , i I : D
t,i
= 1} (1)
where D
t,i
is the position (t,i) of the D matrix. Us-
ing this definition itemsets can be discriminated using
their id and D contains 1s only when the itemset con-
tains that specific item. Table 1 shows an example of
this representations.
The occurrences of a specific itemset I within the
matrix D, that is, the coverage ϕ, is defined as fol-
lows:
ϕ
D
(I) = {t A | i I : D
t,i
= 1} (2)
For example, using matrix D from Table 1,
ϕ
D
({2, 3}) = {5, 8, 9, 10}.
The amount of transactions that contain the item-
set is denoted as support and is defined as:
support
D
(I) = |ϕ
D
(I)| (3)
So using again the previous example,
support
D
({2, 3}) = |ϕ
D
({2, 3})| = |{5,8, 9, 10}| = 4
Constraint-programmingApproachforMultisetandSequenceMining
213
Table 1: Example of item database. Left: original itemset database. Central (D
0
): transactions database. Right (D): binary
representation.
A B C D
Transactions (id, Itemset) id 1 2 3 4
{C} (1, {3}) 1 0 0 1 0
{B} (2, {2}) 2 0 1 0 0
{A,C} (3, {1, 3}) 3 1 0 1 0
{B, D} (4, {2, 4}) 4 0 1 0 1
{B,C} (5, {2, 3}) 5 0 1 1 0
{C, D} (6, {3, 4}) 6 0 0 1 1
{A, D} (7, {1, 4}) 7 1 0 0 1
{B,C, D} (8, {2,3, 4}) 8 0 1 1 1
{A, B,C} (9, {1, 2, 3}) 9 1 1 1 0
{A, B,C, D} (10, {1, 2, 3, 4}) 10 1 1 1 1
D
0
D
Frequent itemset mining consists on finding all the
itemsets which support is equal or higher than a cer-
tain threshold θ. That is:
frequent itemsets = {I | I I , support
D
(I) θ} (4)
3.2 Constraint Programming Model
First of all, (Guns et al., 2011) define a boolean vari-
able I
i
for each individual item. Itemsets I are then
represented as a collection of this binary variables.
Another set of binary variables, T
t
, represent the trans-
actions of the set of T that cover a given itemset,
T = ϕ
D
(I). For example, transaction 8 of Table 1
is represented by the following setting of the binary
variables settings: I
1
= 0, I
2
= 1, I
3
= 1, I
4
= 1, T
8
= 1
for ϕ
D
(2, 3, 4).
Thanks to this new binary variables, the coverage
and support constraints are formulated. The coverage
constraints is modeled as follows:
T = ϕ
D
(I) (t A : T
t
= 1
iI
I
i
(1 D
t,i
) = 0)
(5)
The second one, the frequency constrain, requires
computing if the sum of the binary vector T is greater
or equal to the θ threshold, as follows:
|T | > θ
tA
T
t
θ (6)
Therefore, the itemset mining from a constraint
programming approach consist on finding the set
(I, T )|I I , T A, T = ϕ
D
(I), |T | > θ (7)
4 MULTISET MINING
The previous section presented how to find frequent
itemsets. This kind of patterns can be very useful,
since a lot of databases can be transformed into the bi-
nary representation shown at Equation 1. In the clas-
sical scenario, the retailer database, each row will be
considered as a client transaction and each column as
an item. Then the resulting itemset patterns are the
most common products bought together.
However, using this same scenario, we can see
that we could extract more information. Frequent pat-
terns at this point represent the products commonly
purchased together, but the amount of each product is
still unknown. We can know that beer and crisps are
bought together, but probably there is a proportional
relation among them, like for each six-pack of beer,
two bags of crisps. Consequently, the original prob-
lem definition and the model should be extended to
learn patterns with such property.
4.1 Problem Statement
Until now, we have been working under the assump-
tion that we were using sets of items called itemsets
that did not contain any item repeated. Now itemsets
can contain the same item more than once. Hence, we
need to modify also our description of D
0
and D to
use multisets:
D
0
= {(t,I) | t A, I I , i I , D
t,i
= Q
i
(I)} (8)
This new representation differs with the one at
Equation 1 when assigning the value of D
t,i
. In this
case instead of using a binary value to assert when-
ever the item exists or not in the itemset, we propose
the function Q
i
(I), which return the cardinal of the set
of all the items in I that are equal to i or, what is the
same, the Q uantity of i in I:
Q
i
(I) = |{k I | k = i}| (9)
Table 2 show an example of how this new matrix
D is.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
214
Table 2: Example of item database with repetition. Left
(D
0
): transaction database, where items are repeated. Right
(D): new representation using repetitions.
(T
id
, Itemset)
(1, {2, 3, 2})
(2, {4})
(3, {1, 3})
(4, {1})
(5, {2, 3})
(6, {4})
(7, {3, 4})
(8, {1, 2, 2, 3})
(9, {1, 2})
(10, {1, 2, 3})
A B C D
T
id
1 2 3 4
1 0 2 1 0
2 0 0 0 1
3 1 0 1 0
4 1 0 0 0
5 0 1 1 0
6 0 0 0 1
7 0 0 1 1
8 1 2 1 0
9 1 1 0 0
10 1 1 1 0
D
0
D
Since the input matrix now instead of containing
binary values it contains the specific amounts of items
in the transactions, the coverage definition (Equa-
tion 2) must change accordingly:
ϕ
R
D
(I) = {t A | i I, D
t,i
Q
i
(I)} (10)
With this new definition we assure that support is
estimated properly, since itemsets with high amounts
of repeated items than the desired one provide support
but not vice versa. For example, {1, 3, 3} supports the
pattern {1, 3} but not the {1, 3, 3, 3} one. This new
definition does not affect the way support is estimated
(i.e. Equation 3), but we redefine it for convenience,
since the coverage set to be taken into account is dif-
ferent:
support
R
D
(I) = |ϕ
R
D
(I)| (11)
Therefore, the multiset mining problem from a
constraint programming approach consists on finding
the set
frequent itemsets = {I | I I , support
R
D
(I) θ}
(12)
Observe that the problem definition for multisets
is still compatible with the original model since it is
equivalent to discover patterns from multisets where
at most there is one repetition for item.
4.2 Constraint Programming Model
According to the problem definition, the model main
extension is related to the coverage constraint. Now,
the instantiated vector T contain a 1 in the t
th
only if
the amount of each item i within a certain itemset I is
greater or equal in the t
th
transaction of the matrix D:
t A : T
t
= 1
^
iI
D
t,i
Q
i
(I) (13)
Table 3: Output from the multiset mining.
A B C D
1 2 3 4 Support Itemset
0 0 0 1 3 {D}
0 2 0 0 2 {B,B}
0 2 1 0 2 {B,B,C}
1 1 1 0 2 {A,B,C}
1 0 1 0 3 {A,C}
0 1 1 0 4 {B,C}
0 0 1 0 6 {C}
1 1 0 0 3 {A,B}
1 0 0 0 5 {A}
0 1 0 0 5 {B}
The support constraint and the problem formula-
tion from the remain the same as in Section 3.2.
4.3 Example
Here we provide an illustrative example of the model
proposed. We use example provided in Table 2. The
θ threshold is 2 and it represents the 20% of the
database.
Thus, the multiset {B, B,C} has its representa-
tion in the D matrix as {0, 2, 1, 0}. To estimate the
coverage of {B, B,C}, all the transactions from D
must be checked to see if they meet the specifications
shown at Equation 10. Transactions 1 and 8 do, so
ϕ
R
D
({0, 2, 1, 0}) = {1, 8}. The next step is calculat-
ing the support, so as it is the cardinal of the cov-
erage, support
R
D
({0, 2, 1, 0}) = |ϕ
R
D
({0, 2, 1, 0})| =
|{1, 8}| = 2. And finally, multiset {0, 2, 1, 0} will be
considered a frequent patterns since this support is
greater or equal to the previously specified θ.
Following the same procedure, multiset
{C, D} is represented as {0, 0, 1, 1} in D. Then,
ϕ
R
D
({0, 0, 1, 1}) = {7} since 7 is the only transaction
that has elements C and D greater or equal to 1.
Finally, support
R
D
({0, 0, 1, 1}) = |ϕ
R
D
({0, 0, 1, 1})| =
|{7}| = 1, which obviously is lower than the θ thresh-
old an therefore it is not considered as a frequent
pattern.
All the frequent multiset patterns found for the ex-
ample are shown in Table 3.
5 SEQUENCE MINING
In this section we are considering itemsets with item
order information but without any repetition (see Ta-
ble 4).Back again to the retailers example, sequence
mining allows knowing the order in which items are
retrieved. Such knowledge can be useful for differ-
ent optimization purposes, as the product distribution:
Constraint-programmingApproachforMultisetandSequenceMining
215
Table 4: Example of ordered itemsets database. Left (D
0
):
transaction database. Right (D): representation including
order.
(T
id
, Itemset)
(1, {4})
(2, {3, 2})
(3, {3, 1, 2})
(4, {1, 3, 2})
(5, {4, 1})
(6, {3, 4})
(7, {1, 2, 3})
(8, {3, 4, 2})
(9, {2, 4, 3, 1})
(10, {1, 4, 3, 2})
A B C D
T
id
1 2 3 4
1 0 0 0 1
2 0 2 1 0
3 2 3 1 0
4 1 3 2 0
5 2 0 0 1
6 0 0 1 2
7
1 2 3 0
8 0 3 1 2
9 4 1 3 2
10 1 4 3 2
D
0
D
the clients could find a simpler route to gather ev-
erything or the store manager could introduce in the
mined routes products that usually are not bought by
the clients but they could be interested in.
5.1 Problem Statement
In this case, itemsets are ordered sets of items, so in
a certain itemset I, I
n
would be its n
th
item. Conse-
quently, the matrix D must be able to represent or-
der information. Our approach is similar to the ones
presented before, keeping the definitions as simple as
possible:
D
0
= {(t, I) | t A, I I , i I , D
t,i
= O
i
(I)} (14)
The main difference with Equation 1 is the binary
value assigned to D
t,i
to represent the existence of the
item in the itemset. Here we have replaced it with
the new function O
i
(I), where given a certain ordered
itemset I and a certain item i returns the position of i
within I, for example O
4
({2, 4, 3, 1}) = 2. The Order
function, O
i
(I), is the following:
O
i
(I) = k | k |I|, I
k
= i (15)
The coverage constraint needs to be redefined
too. In the original work and in our previous ex-
tension, coverage is a measure that estimates how
many times a combination of items or a quantity of
items is contained by the database. The new coverage
function we propose needs to change completely this
paradigm. Instead of counting the items, the items
must fulfill specific conditions between themselves.
That means that if we want to count the support of
a certain ordered itemset {C, A, B}, first we need to
know which of the transactions in the database con-
tain items where the item C goes before A and B,
where item A goes after item C and before B and
where item B goes after items C and A, or what’s the
same, which transactions in the database follow the
order specified by the itemset. It is easy to see that
these conditions can be also considered rules or con-
straints, what makes the constraint-based definition an
excellent choice. The following equation shows how
simple is to represent this conditions:
ϕ
O
D
(I) = {t A | i, j I, i 6= j, (I
i
< I
j
) (D
t,i
< D
t, j
)}
(16)
The new coverage represents the set of all the
transactions indices t where given all possible posi-
tions i and j (and i 6= j) within an itemset I, if item
I
i
appears before then item I
j
then in the item in the
i
th
position of the transaction must appear also before
the one at the j
th
.
Consistently, the support is defined as follows:
support
O
D
(I) = |ϕ
O
D
(I)| (17)
Finally, the sequence mining problem from a con-
straint programming approach consists on finding the
set
frequent itemsets = {I | I I , support
D
O
(I) θ}
(18)
5.2 Constraint Programming Model
According to the problem definition, the coverage
constraint form is modeled as a constrained program-
ming approach as follows:
t A : T
t
= 1
^
iI
^
jI\{i}
(I
i
< I
j
) (D
t,i
< D
t, j
)
(19)
The model denoted by the above equation, comes
straightforward from Equation 16, but it has three
problems. The first and most important one is the
following: the constraint checks the condition (I
i
<
I
j
) (D
t,i
< D
t, j
) and if D
t,i
is equals to 0 (i.e. the
item does not belong to the transaction), a true as-
sertion will be occur. Consequently, we add an addi-
tional condition to the constraint, as follows:
t A : T
t
= 1
^
iI
^
jI\{i}
(I
i
< I
j
) . . .
. . . ((D
t,i
> 0) (D
t,i
< D
t, j
))
(20)
The second problem relies in the constraint defi-
nition itself. It looks for how many transactions ful-
fill some denoted conditions between items, so the re-
sulting patterns will have at minimum two items. In
the case that patterns with single items were requested
the procedure becomes the same one as the previous
model: counting occurrences of items. In that case,
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
216
Table 5: Output from the sequence mining.
A B C D
1 2 3 4 Support Itemset
1 2 0 0 4 {A,B}
1 0 2 0 3 {A,C}
1 3 2 0 2 {A,C,B}
0 1 2 0 2 {B,C}
2 0 1 0 2 {C,A}
0 2 1 0 5 {C,B}
0 0 1 2 2 {C,D}
2 0 0 1 2 {D,A}
0 2 0 1 2 {D,B}
0 0 2 1 2 {D,C}
the previous model should be used adding a new con-
straint limiting the pattern length like |I| = 1 can be
used.
The third problem is that any solver fed with this
model could return malformed patterns. If we look
back to Table 4, the ordered itemset {0, 2, 1, 0} can
be considered a pattern since half of the transac-
tions contain this sequence. What makes inconsistent
the model is that the solver could return {0, 2, 1, 0},
{0, 3, 1, 0} or {0, 3, 2, 0} as well since we are check-
ing the transactions precedences without any point of
reference. All three of them have the same meaning:
that the item C has a lower value of than item B so
therefore it goes before, but only the first one should
be returned. An auxiliary constrain like i I, i |I|
limiting the maximum value that the items can have
should be more than enough to avoid this.
Finally, the support constraint and the problem
formulation from the remain the same as in Section
3.2.
5.3 Example
To provide an illustrative example of our mode, we
use the data shown in Table 4. We assume the thresh-
old value: θ = 2
Like in the previous example, the first thing to
do is to translate the itemset to match the represen-
tation from Table 4. If we choose {C, B} (transac-
tion 2), then its representation is {0, 2, 1, 0}. If we
what know if it is a frequent pattern, from Equa-
tion 16, item C has a lower value (goes before)
than item B, and all the transactions in matrix D
that contain a lower value in item C than in item
B support this itemset. That is, ϕ
O
D
({0, 2, 1, 0}) =
{2, 3, 4, 8, 10}. Consequently, through Equation 17,
support
O
D
({0, 2, 1, 0}) = |{2, 3, 4, 8, 10}| = 5 and
since it is a value greater than the θ specified it is con-
sidered a frequent pattern.
All the frequent sequence patterns found for the
example are shown in Table 5.
Table 6: Multiset patterns.
Support Pattern
10% 2x frontpage, 2x tech, 1x msn-sports
12% 1x msn-sports
12% 1x frontpage, 1x news, 1x business
21% 1x news
21% 3x frontpage
22% 1x local
23% 1x tech
26% 1x on-air
32% 2x frontpage
51% 1x frontpage
6 EXPERIMENTATION
In order to test our approach with real data, the UCI
(Frank and Asuncion, 2010) dataset corresponding
to the MSNBC anonymous web navigation has been
used.
6.1 Experimental Setup
The raw information provided by the dataset con-
sist on 1000000 data lines, each one is the activity
of a user session and contains a sequence of num-
bers from 1 to 17 that represents the category of the
web page that was loaded. For instance the sequence
{2, 4, 4, 4, 3} means that the user loaded the “news”
section, then loaded three times the “local” news and
finally loaded the “technology” section.
The original work presented at (Guns et al.,
2011), used an implementation based in Essence lan-
guage (Frisch et al., 2008), but we have adopted a
solution using MiniZinc (Nethercote et al., 2007) be-
cause it allows the specification of models using natu-
ral mathematical-like notation, speeding up our work.
6.2 Multisets Mining Resuts
As the original data from MSNCB.com contains re-
peated items per each transaction, so it can be used to
test our multiset model. First of all the database needs
to be flattened into the matrix representation shown in
Section 4. That means that given the 17 different pos-
sible categories, a multiset {2, 4, 4, 4, 3} needs to be
transformed into {0, 1, 1, 3, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}. On the other hand, for
our testing experiments we have used only the first
100 data entries.
Table 6 shows some of the patterns found. The
most common behavior found within the explored
data is to open the web front page, since 51% of users
loads it. Opening the front page is a rather common
Constraint-programmingApproachforMultisetandSequenceMining
217
way to start the browsing, but with our extension we
can know how many of them revisit this front page.
Results show that 32% of users during its web explo-
ration open a second time the front page and also that
a 21% opens it three times. If we don’t mind to re-
duce the support, we can find that 10% of explored
users open twice the front page, twice again the tech-
nology page and also read the msn-sports.
Using the original approach, these patterns would
have been found too, but without knowing how many
times the users visit the pages. This proofs that there
is still more information to mine than just the common
pages browsed in the same session. Despite that we
have worked with little subset of information, the con-
clusions of our results can be considered quite similar
to previous works focused in the visualization of the
sequences with this database (Cadez et al., 2000): af-
ter the front page, tech an local sections recive a lot of
user visits.
6.3 Sequence Mining Results
Regarding sequence mining, the MSNBC data we
have used only the first 100 data lines.
In Table 7 we present some of the patterns found.
In this case the support is not as high as in the pre-
vious example, but the sequences are still interesting.
The most common sequence followed by the 5% of
the explored users consists in loading first msn-sports
and then the sports ones, which are obvious related
making a sensible pattern. The second most common
sequence is also reasonable, since 4% of the users first
open the general news web page and after that the lo-
cal news.
We have to take under consideration that the found
patterns are quite short (no more than two consecutive
web sections) and that the support for these patterns is
not very hight (no more than 5%). That can be caused
by the original transformation we did in order to re-
move the repeated items, but still sustain the point that
there was still information that could not be mined us-
ing the existing constraint-based approaches. In this
case our transformation makes more difficult to com-
pare our results with the ones at (Cadez et al., 2000),
but still there are some patters in common like the
{misc, local}.
6.4 Discussion
On one hand the experimentatrion done until now has
shown that the running time for these models depends
exclusively on the solver used. Minizinc has the ca-
pability of being a high level constraint programming
language what allows the users to change its under-
Table 7: Sequence patterns.
Support Pattern
3% frontpage, business
3% local, health
3% on-air, msn-news
3% frontpage, local
3% frontpage, news
3% tech, local
3% misc, local
3% on-air, misc
4% news, local
5% msn-sports, sports
lying solver easyly. Therefore, the implementation
of the models “as they are” in this paper resulted in
processing a set of data with a multipurpose solver
for several hours while a fully optimized one returned
the solutions with only less than a second of execu-
tion. Since the scope of this work does not involve
surveying different solver implementations we rele-
gate this task for future works. Another future work
regarding process optimization may be the reification
of the already known contraints in order to simplyfy
the solver’s work allowuing it to work even faster.
On the other hand, the results presented in this sec-
tion show the feasibility of the constraint program-
ming paradigm for multiset and sequence learning.
Further experiments should be performed in other
scenarios, and particularly with big data to find out the
limits of the methodology. However, the important
issue is that the learning process is not encapsulated
in a complex and static procedure, but provided in a
declarative way. This will allow the addition of new
constraints without the need of modifying the base
model, for instance, looking for closed or maximal
patterns instead of frequent.
7 CONCLUSIONS
In this paper we have explored some of the state-
of-the-art constraint-based mining and we have pre-
sented the modifications done to (Guns et al., 2011)
that extend the original model so now it can support
multiset and sequence mining.
Our model for multiset mining requires a change
on the definition of the coverage constraint that takes
into account the amount of items in the frequent pat-
tern. The second, sequence mining, has the handi-
cap of requiring more constraints in order to return
valid outputs but the learning problem is provided in
a declarative and simple way. We have shown exam-
ples of its application into toy problems to easily com-
prehend how the constraints works. Moreover, we
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
218
have applied the constraint programming models to
the MSNBC database from the UCI repository, hav-
ing successful results.
Our models where simple enough to work with
academic examples, but further experimentation is re-
quired, focussing on escalation (both: length of trans-
action and database size). In this sense, our research
is directed towards the use of reified constraints that
optimize the constraint programming model.
Note that we have presented two independent
models: one for mining item repetitions within item-
sets and another one or mining the itemsets with or-
der relations. The combination of both can lead us
into a complete sequence patter mining. Once the
combination is done, this model will be equiparable
to classical algorithms used nowadays like prefixspan
or clospan with the advantage of being declarative and
easily extensible.
ACKNOWLEDGEMENTS
This work is supported by the research projects
CTQ2008-06865-C02-02 and DPI2011-24929, and
the grant FPU-AP2009-2831, all of them funded
by the Spanish Government. The MSNBC dataset
has been extracted from the UCI Machine Reposi-
tory (Frank and Asuncion, 2010).
REFERENCES
(2012). Frequent itemset mining dataset repository. http://
fimi.ua.ac.be/data/.
Agrawal, R. and Srikant, R. (1994). Fast algorithms for
mining association rules, volume 1215, pages 487–
499. Morgan Kaufmann.
Agrawal, R. and Srikant, R. (1995). Mining sequential pat-
terns. In Proceedings of the Eleventh International
Conference on Data Engineering, ICDE ’95, pages 3–
14, Washington, DC, USA. IEEE Computer Society.
Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego,
R., and Trasarti, R. (2009). A constraint-based query-
ing system for exploratory pattern discovery. Informa-
tion Systems, 34(1):3–27.
Bonchi, F. and Lucchese, C. (2007). Soft constraint-based
pattern mining. Data Knowl. Eng., 60:377–399.
Brand, M. (1998). Pattern discovery via entropy minimiza-
tion. Technical report, MERL - A Mitsubishi Electric
Research Laboratory.
Burke, R. (1999). The wasabi personal shopper: a case-
based recommender system. In Proceedings of the
sixteenth national conference on Artificial intelligence
and the eleventh Innovative applications of artificial
intelligence conference innovative applications of ar-
tificial intelligence, AAAI ’99/IAAI ’99, pages 844–
849, Menlo Park, CA, USA. American Association
for Artificial Intelligence.
Cadez, I., Heckerman, D., Meek, C., Smyth, P., and White,
S. (2000). Visualization of navigation patterns on a
web site using model-based clustering. In Proceed-
ings of the sixth ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, KDD
’00, pages 280–284, New York, NY, USA. ACM.
David, J. and Nourine, L. (Submitted to Theoretical
Computer Science). A generic algorithm for
sequence mining. 15 pages. http://www-lipn.univ-
paris13.fr/ david/old.version/articles/generation.pdf
[Accessed: 9.2.2012].
De Raedt, L., Guns, T., and Nijssen, S. (2008). Con-
straint programming for itemset mining. In Proceed-
ings of the 14th ACM SIGKDD international confer-
ence on Knowledge discovery and data mining, KDD
’08, pages 204–212, New York, NY, USA. ACM.
De Raedt, L., Guns, T., and Nijssen, S. (2010). Constraint
programming for data mining and machine learning.
In AAAI, pages 1671–1675.
De Raedt, L. and Kramer, S. (2001). The levelwise ver-
sion space algorithm and its application to molecular
fragment finding. In Proceedings of the 17th inter-
national joint conference on Artificial intelligence -
Volume 2, pages 853–859, San Francisco, CA, USA.
Morgan Kaufmann Publishers Inc.
Frank, A. and Asuncion, A. (2010). UCI machine learning
repository. http://archive.ics.uci.edu/ml.
Frisch, A., Harvey, W., Jefferson, C., Martinez-Hernandez,
B., and Miguel, I. (2008). Essence: A constraint lan-
guage for specifying combinatorial problems. Con-
straints, 13:268–306. 10.1007/s10601-008-9047-y.
Gal, Y., Reddy, S., Shieber, S. M., Rubin, A., and Grosz,
B. J. (2012). Plan recognition in exploratory domains.
Artificial Intelligence, 176(1):2270 – 2290.
Guns, T., Nijssen, S., and Raedt, L. D. (2011). Itemset min-
ing: A constraint programming perspective. Artificial
Intelligence, 175(12-13):1951 – 1983.
Han, J., Cheng, H., Xin, D., and Yao, X. (2007). Frequent
pattern mining: current status and future directions.
Data Min Knowl Disc, 15:55–86.
Nethercote, N., Stuckey, P. J., Becket, R., Brand, S., Duck,
G. J., and Tack, G. (2007). Minizinc: towards a stan-
dard cp modelling language. In Proceedings of the
13th international conference on Principles and prac-
tice of constraint programming, CP’07, pages 529–
543, Berlin, Heidelberg. Springer-Verlag.
Schmidt, C., Sridharan, N., and Goodson, J. (1978). The
plan recognition problem: An intersection of psychol-
ogy and artificial intelligence. Artificial Intelligence,
11(1-2):45 83. Applications to the Sciences and
Medicine.
Soulet, A., Kl
´
ema, J., and Cremilleux, B. (2006). Effi-
cient mining under flexible constraints through sev-
eral datasets. In D
ˇ
zeroski, S. and Struyf, J., editors,
Proceedings of 5th International Workshop on Knowl-
edge Discovery in Inductive Databases, pages 131–
142, Berlin, Germany. Humbolt Universit
¨
at Berlin,
2006, s.
Constraint-programmingApproachforMultisetandSequenceMining
219
Srikant, R. and Agrawal, R. (1996). Mining sequential
patterns: Generalizations and performance improve-
ments. In Apers, P. M. G., Bouzeghoub, M., and
Gardarin, G., editors, Proc. 5th Int. Conf. Extending
Database Technology, EDBT, volume 1057, pages 3–
17. Springer-Verlag.
Zhang, Z., Kwok, J. T., and Yeung, D.-Y. (2003). Paramet-
ric distance metric learning with label information. In
Proceedings of the 18th international joint conference
on Artificial intelligence, pages 1450–1452, San Fran-
cisco, CA, USA. Morgan Kaufmann Publishers Inc.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
220