Symmetry Breaking in Itemset Mining
Bela¨ıd Benhamou
1,2
, Sa¨ıd Jabbour
2
, Lakhdar Sais
2
and Yacoub Salhi
2
1
Aix-Marseille Universit´e, Laboratoire des Sciences de l’information et des Syst`emes (LSIS),
Domaine Universitaire de Saint J´erˆome, Avenue Escadrille Normandie Niemen, 13397 Marseille Cedex 20, France
2
Centre de Recherche en Informatique de Lens (CRIL), Universit´e d’Artois,
Rue Jean Souvenir, SP 18 F 62307 Lens Cedex, France
Keywords:
Data Mining, Itemset Mining, Symmetry, Satisfiability, Constraint Programming.
Abstract:
The concept of symmetry has been extensively studied in the field of constraint programming and in proposi-
tional satisfiability. Several methods for detection and removal of these symmetries have been developed, and
their integration in known solvers of these domain improved dramatically their effectiveness on a large variety
of problems considered difficult to solve. The concept of symmetry may be exported to other domains where
some structures can be exploited effectively. Particularly in data mining where some tasks can be expressed
as constraints. In this paper, we are interested in the detection and elimination of symmetries in the problem
of nding frequent itemsets of a transaction database and its variants. Recent works have provided effective
encodings as Boolean constraints for these data mining tasks and some recent works on symmetry detection
and elimination in itemset mining problems have been proposed. In this work we propose a generic frame-
work that could be used to eliminate symmetries for data mining task expressed in a declarative constraint
language. We show how symmetries between the items of the transactions are detected and eliminated by
adding symmetry-breaking predicate (SBP) to the Boolean encoding of the data mining task.
1 INTRODUCTION
In this paper, we investigate the notion of symme-
try elimination in Frequent Itemset Mining (FIM)
(Agrawal et al., 1993). The itemset mining prob-
lem has several applications and remains central in
the Data mining research field. The most known ex-
ample is the one considered by large retail organiza-
tions called basket data. A record of such data con-
tains essentially the customer identification, the trans-
action date and the items bought by the customer.
Advances in bar-codes technology, the use of credit
cards of frequent-customer card make it now possi-
ble to collect and store a great amounts of sale data.
It is then important for the retail firms to know the
set of items that are frequently bought by customers.
This is the frequent itemset mining problem. Since its
introduction in 1993 (Agrawal et al., 1993), several
highly scalable algorithms are introduced ((Agrawal
and Srikant, 1994), (Han et al., 2000), (Zaki and
Hsiao, 2005),(Uno et al., 2003) (Uno et al., 2004),
(Burdick et al., 2001),(Grahne and Zhu, 2005), (Mi-
nato et al., 2007) ) to enumerate the sets of frequent
items. The two challenging questions investigated in
such algorithms are: in one hand how to compute all
the frequent itemsets in a reasonable CPU time and
in the other hand how to compact the output and re-
duce its size when there is a huge number of frequent
itemsets. Many other data mining tasks exist, such as
the association rule mining, the frequent pattern, clus-
tering and episode mining, but almost all of them are
closely in relationship to itemset mining which looks
to be the canonical problem. A lot of efficient and
scalable algorithms are developed for target and spe-
cific mining tasks. As stated in (Tiwari et al., 2010),
different methods for the itemset mining are provided.
Mainly they differ from each other in the way they ex-
plore the search space, the data structure they use, the
exploitation of the anti-monotonicity property. The
other important point is the size of the output of such
algorithms. Some solutions are found, for instance
one can enumerate only the closed, the maximal, the
condensed, the preferred, or discriminative itemsets
instead of all the frequent itemsets.
Data mining community introduced the
constraint-based mining framework in order to
specify in terms of constraints the properties of
the patterns to be mined ((Bonchi and Lucchese,
2007),(Bucil˘a et al., 2003), (Pei et al., 2004), (Besson
et al., 2010)). A wide variety of constraints are
86
Benhamou B., Jabbour S., Sais L. and Salhi Y..
Symmetry Breaking in Itemset Mining.
DOI: 10.5220/0005078200860096
In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2014), pages 86-96
ISBN: 978-989-758-048-2
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
successfully integrated and implemented in different
specific data mining algorithms.
Recently De Raedt et al. ((Raedt et al., 2008;
Guns et al., 2011a)) introduced the alternative of us-
ing constraint programming in data mining. They
showed that a such alternative can be efficiently ap-
plied for a wide range of pattern mining problems.
Most of the pattern mining constraint (e.g. frequency,
closeness, maximality, and anti-monotonicity con-
straints) had been expressed in a declarativeconstraint
programming language. The data mining problem is
modeled as a constraint satisfaction problem (CSP)
and a solver (e.g. Gecode) is then used to enumer-
ate solutions corresponding to the set of interesting
patterns. A strength point here is that different con-
straints can be combined without the need to modify
the solver, unlike in the existing specific data min-
ing algorithms. Since the introduction of this declar-
ative approach, there is a growing interest in find-
ing generic and declarative approaches to model and
solve data mining tasks. For instance, several works
expressed data mining problems as propositional sat-
isfiability ((Jabbour et al., 2013c), (Henriques et al.,
2012), (M´etivier et al., 2012), (Khiari et al., 2010),
(Raedt et al., 2010), (Jabbour et al., 2013b)) and used
efficient modern SAT solvers as black-box to solve
them. More recently, a constraint declarative frame-
work for solving Data mining tasks called MiningZ-
inc (Guns et al., 2013), had been introduced.
On the other hand, symmetry is by definition a
multidisciplinary concept. It appears in many fields
ranging from mathematics to Artificial Intelligence,
chemistry and physics. It reveals different forms and
uses, even inside the same field. In general, it returns
to a transformation, which leaves invariant (does not
modify its fundamental structure and/or its properties)
an object (a figure, a molecule, a physical system, a
formula or a constraints network...). For instance, ro-
tating a chessboard up to 180 degrees gives a board
that is indistinguishable from the original one. Sym-
metry is a fundamental property that can be used to
study these various objects, to finely analyze these
complex systems or to reduce the computational com-
plexity when dealing with combinatorial problems.
As far as we know, the principle of symmetry
has been first introduced by Krishnamurthy (Krish-
namurty, 1985) to improve resolution in propositional
logic. Symmetries for Boolean constraints are studied
in depth in (Benhamou and Sais, 1992a; Benhamou
and Sais, 1994a). The authors showed how to de-
tect them and proved that their exploitation is a real
improvement for several automated deduction algo-
rithms efficiency. Since that, many research works
on symmetry appeared. For instance, the static ap-
proach used by James Crawford et al. in (Craw-
ford et al., 1996) for propositional logic theories con-
sists in adding constraints expressing global symme-
try of the problem. This technique has been improved
in (Aloul et al., 2003b) and extended to 0-1 Integer
Logic Programming in (Aloul et al., 2004). The no-
tion of interchangeability in Constraint Satisfaction
Problems (CSPs) is introduced in (Freuder, 1991) and
symmetry for CSPs is studied earlier in (Puget, 1993;
Benhamou, 1994).
In the context of constraint programming, Guns
et al. (Guns et al., 2011b) used symmetry breaking
constraints to impose a strict ordering on the patterns
in k-pattern set mining. More recently, symmetry de-
tection and elimination are integrated in itemset min-
ing problems (Jabbour et al., 2012; Jabbour et al.,
2013a). Two different approaches are proposed. In
the first one, symmetries are eliminated by rewriting
the transaction database (eliminating items), while in
the second approach the authors integrate symmetry
elimination in Apriori-like algorithms. For other pre-
vious studies on symmetries in data mining, we refer
the reader to the related work section.
The work that we investigate in this paper, goes in
this direction. It consists in detecting and eliminating
symmetries in the itemset mining problem expressed
as a Boolean satisfiability. We will show how global
symmetries
1
of the given transaction database are de-
tected and expressed in terms of symmetry breaking
predicates. Such predicates are added to the boolean
encoding of the itemset mining problem in a prepro-
cessing step and a SAT solver is used as a black box
to enumerate the non-symmetrical solutions (the non-
symmetrical frequent itemsets). In most of the data
mining tasks, we usually need to enumerate interest-
ing patterns and this usually lead to a output of huge
size. Eliminating symmetries might reduce the size of
the output and lead to discover the non-symmetrical
patterns which are the most important and representa-
tive of the knowledge.
The rest of the paper is organized as follows. In
Section 2, we give some necessary background on
the satisfiability problem, permutations and the nec-
essary notion on itemset mining problem. We study
the notion of symmetry in itemset mining represented
as boolean constraints in Section 3. In Section 4
we show how symmetries can be detected by means
of graph automorphism. We show in section 5 how
this symmetry can be eliminated by adding symme-
try breaking predicates to the Boolean encoding. Sec-
tion 6 givesexperimentson different data-sets to show
the advantage of using symmetries in itemset mining.
1
Symmetries that are present in the initial formulation
of the problem
SymmetryBreakinginItemsetMining
87
Section 7 investigates the related works and Section 8
concludes the work.
2 BACKGROUND
We summarize in this section some background on
the satisfiability problem, permutations, and itemset
mining problem.
2.1 Propositional Satisfiability (SAT)
We shall assume that the reader is familiarwith propo-
sitional logic. We give here, a short description. LetV
be the set of propositional variables called only vari-
ables. Variables will be distinguished from literals,
which are variables with an assigned parity 1 or 0 that
means True or False, respectively. This distinction
will be ignored whenever it is convenient, but not con-
fusing. For a propositional variable p, there are two
literals: p the positive literal and ¬p the negative one.
A clause is a disjunction of literals such that no
literal appears more than once, nor a literal and its
negation at the same time. This clause is denoted by
p
1
p
2
. .. p
n
. A formula F in conjunctivenormal
form (CNF) is a conjunction of clauses.
A truth assignment to a CNF F is a mapping ρ
defined from the set of variables of F into the set
{True,False}. If ρ[p] is the value for the positive
literal p then ρ[¬p] = ¬ρ[p]. The value of a clause
p
1
p
2
. . . p
n
in ρ is True, if the value True is as-
signed to at least one of its literals in ρ, False other-
wise. By convention, we define the value of the empty
clause (n = 0) to be False. The value ρ[F ] is True if
the value of each clause of F is True, False, other-
wise. We say that a CNF formula F is satisfiable if
there exists some truth assignments ρ that assign the
value True to F , it is unsatisfiable otherwise. In the
first case I is called a model of F . Let us remark that
a CNF formula which contains the empty clause is
unsatisfiable.
It is well-known (Tseitin, 1968) that for every
propositional formula F there exists a formula F
in
conjunctive normal form (CNF) such that F
is sat-
isfiable iff F is satisfiable. In the following we will
assume that the formulas are given in a CNF.
2.2 Permutations
Let = {1,2,..., N} for some integer N, where each
integer might represent a propositional variable. A
permutation of is a bijective mapping σ from to
that is usually represented as a product of cycles of
permutations. We denote by Perm() the set of all
permutations of and the composition of the per-
mutation of Perm(). The pair (Perm(),) forms
the permutation group of . That is, is closed and
associative. The inverse of a permutation is a per-
mutation and the identity permutation is a neutral ele-
ment. A pair (T,) forms a sub-group of (S,) iff T is
a subset of S and forms a group under the operation .
The orbit ω
Perm()
of an element ω of on which the
group Perm() acts is ω
Perm()
={ω
σ
|ω
σ
= σ(ω),σ
Perm()}. A generating set of the group Perm() is
a subset Gen of Perm() such that each element of
Perm() can be written as a composition of elements
of Gen. We write Perm()=< Gen >. An element of
Gen is called a generator. The orbit of ω can be
computed by using only the set of generators Gen.
2.3 Frequent, Closed and Maximal
Itemset Mining Problems
Let L = {0, . . .,m 1} be a set of m items and T =
{0,...,n1} a set of n transactions (transaction iden-
tifiers). A subset I L is called an itemset and a
transaction t T over L is in fact, a pair (t
id
,I) where
t
id
is the transaction identifier and I the correspond-
ing itemset. In the basket data example, t
id
represents
the customer identification and I the set of items he
put in his basket (he bought). Usually, when there
is no confusing, a transaction is just expressed by its
identifier. A transaction database D over L is a finite
set of transactions such that no different transactions
have the same identifier. Such a data set expresses
in the basket data the different transactions made by
customers. A transaction database can be seen as a bi-
nary matrix n× m, where n =| T | and m =| L |, with
D
t,i
{0,1} forall t T and forall i L. More pre-
cisely, a transaction database is expressed by the set
D = {(t,I) | t T , I L,i I : D
t,i
= 1}. The cov-
erage C
D
(I) of an itemset I in a transaction database
D is the set of all transactions in which I occurs. That
is, C
D
(I) = {t T | i I,D
t,i
= 1}. The support
S
D
(I) of an itemset I in D is the number |C
D
(I)| of
transactions supporting I. It is just the cardinality of
its coverage set. Moreover, the frequency F
D
(I) of I
in D is defined by
|C
D
(I)|
|D|
.
Example 1. Consider the transaction database
D made over the set of drink items L =
{Beer,Wine,Whisky,Cognac,Vodka,Pastis,Ricard,
Gin,Coke,Pepsi,Shweps,Juice,Water,Orangina}.
For example, we can see in Table 1 that the itemset
I = {Beer,Wine} has C
D
(I) = {001,002,003,004},
S
D
(I) =| C
D
(I) |= 4, and F
D
(I) = 0,4.
Given a transaction database D over L, and θ a
minimal support threshold, an itemset I is said to be
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
88
Table 1: An instance of a transaction database.
t
id
itemset
001 Beer, Wine, Whisky, Vodka, Cognac, Water
002 Beer, Wine, Whisky, Vodka, Gin, Water
003 Beer, Wine, Whisky, Water
004 Beer, Wine, Vodka, Water
005 Ricard, Coke, Pepsi, Water
006 Pastis, Pepsi, Coke, Water
007 Shweps, Orangina, Pepsi
008 Shweps, Orangina, Coke
009 Juice, Orangina, Pepsi
010 Juice, Orangina, Coke
frequent if S
D
(I) θ. I is a closed frequent itemset
if in addition to the frequency constraint it satisfies
the following constraint: for all itemset J such that
I J, S
D
(I) > S
D
(J). I is said to be a maximal fre-
quent itemset if in addition to the frequency constraint
it satisfies the following constraint: for all itemset J
such that I J, S
D
(J) < θ. Both closed and maximal
itemsets are two known condensed representation for
frequent itemsets. The data mining tasks we are deal-
ing with in this work are defined as follows:
Definition 1. 1. The frequent itemset mining
task consists in computing the following set
F I M
D
(θ) = {I L|S
D
(I) θ}.
2. The closed frequent itemset mining task consists
in computing the following set C LO
D
(θ) = {I
F I M
D
(θ)|∀J L,I J, S
D
(I) > S
D
(J)}.
3. The maximal frequent itemset mining task consists
in computing the following set M AX
D
(θ) = {I
F I M
D
(θ)|∀J L,I J, S
D
(J) < θ}.
The anti-monotonicity property in itemset mining
expresses the fact that all the subsets of a frequent
itemset are also frequent itemsets. More precisely:
Proposition 1. (Anti-monotonicity) Let θ be a min-
imal support threshold, if the itemset I is such that
S
D
(I) θ, then J I, S
D
(J) θ.
3 SYMMETRY IN BOOLEAN
SATISFIABILITY BASED
ITEMSET MINING
Both constraint programming and Satisfiability are
two known declarative programming frameworks
where the user has just to specify the problem he want
to solve rather than specifying how to solve it. The
frequent itemset mining tasks and some of its variants
(closed, maximal, etc) had been encoded for the first
time in (Raedt et al., 2008; Guns et al., 2011a) as con-
straint programming tasks where a constraint solver
could be used as a black box to solve them. Since that,
other works ((Jabbour et al., 2013c), (Henriques et al.,
2012), (M´etivier et al., 2012), (Khiari et al., 2010),
(Raedt et al., 2010), (Jabbour et al., 2013b)) expressed
the data mining tasks as a satisfiability problem where
the mining tasks are represented by propositional for-
mulas that are translated into their conjunctive normal
forms (CNF) which will be given as inputs to a SAT
solver. In this work we use the encoding proposed
in (Jabbour et al., 2013c) which we augment by the
symmetry breaking predicates that are used to avoid
enumerating the symmetrical models or the symmet-
rical no-goods of the resulting CNF encoding.
The general idea behind the CNF encoding of an
itemset mining task defined on a transaction database
D is to express each of its interpretations as a pair
(I, T) where I represents an itemset and T its covering
transaction subset in D. To do that, a boolean variable
I
i
is associated with each item i L and a variable T
t
is associated with each transaction t T . The itemset
I is then defined by all the variables I
i
that are true.
That is I
i
= 1, if i I, and I
i
= 0 if i / I. The set of
transaction T covered by I is then defined by the set
of variable T
t
that are true. That is, T
t
= 1 if t C
D
(I)
and T
t
= 0 if t / C
D
(I).
For instance, the F I M
D
(θ) task can
be seen as the search of the set of models
M = {(I,T) | I L,T T , T = C
D
(I), |T|≥ θ}.
We have to encode both the covering constraint
T = C
D
(I) and the frequency constraint |T|≥ θ.
These constraints are expressed by the following
boolean constraints:
^
tT
(¬T
t
_
iL,D
t,i
=0
I
i
)
tT
T
t
θ
The frequent closed itemset task is specified by
adding to the two previous constraints the following
constraints:
^
tT
(¬T
t
_
iL,D
t,i
=0
I
i
)
^
iL
((
^
tT
T
t
D
t,i
= 1) I
i
)
The maximal frequent itemset mining is specified
by adding the following constraint:
^
iL
((
tT
T
t
× D
t,i
θ) I
i
)
We denote by CNF(k, D), the CNF formula en-
coding the data mining task k over the transaction
database D, where k refers to F I M
D
(θ), C LO
D
(θ)
SymmetryBreakinginItemsetMining
89
or M AX
D
(θ). We also note P
k
D
a predicate repre-
senting the task k in D. Then an itemset I L having
T T as a cover verifies P
k
D
(P
k
D
(I, T) = true) if I is
an itemset which is an answer to the data mining task
k and T is its cover.
Remark 1. We recall that a model J of CNF(k,D)
is a pair (I,T) where the part I expresses the itemset
which is an answer to the considered task k and the
part T encodes its cover. More precisely each literal
I
i
which is true in I represents the item i in the itemset
I
which is an answer to the task k and each literal T
t
which is true in T represents the transaction t in T
which is the corresponding cover of I
. In the sequel
we denote by the pair (I
,T
) the itemset and its cover
that are extracted from an interpretation J = (I,T) of
CNF(k,D).
Symmetry is well studied in constraint program-
ming and propositional satisfiability. Since Krishna-
murthys (Krishnamurthy, 1985) symmetry definition
and the one given in (Benhamou and Sais, 1992b;
Benhamou and Sais, 1994b) in propositional logic,
several other definitions are given by the CP commu-
nity.
Symmetry has already been defined in itemset
mining (Jabbour et al., 2012; Jabbour et al., 2013a).
We give in the following a similar definition and show
how to eliminate such symmetry by means of sym-
metry breaking predicates that we add to the Boolean
encoding to solve efficiently some data mining tasks
like frequent, closed or maximal itemset mining.
Definition 2. Let D be a transaction database over a
set of items L. A symmetry of D is a permutation σ
defined on L such that σ(D) = D
Remark 2. It is obvious to see that a permutation on
the set of items L, induces a permutation σ
T
on the
set of transactions T and a permutation σ
D
on the
data-set D itself. We denote such permutations only
by σ when there is no confusion.
A symmetry of D is an item permutation that
leaves D invariant. If we denote by Perm(L)
the group of permutations of L and by Sym(L)
Perm(L) the subset of permutations of L that are the
symmetries of D, then Sym(L) is trivially a sub-group
of Perm(L).
Theorem 1. Let σ be a symmetry of a transaction
database D, I L an itemset having a cover T
T , and P
k
D
the predicate expressing the data mining
task k in D, then P
k
D
(I, T) = true iff P
k
D
(σ(I), σ(T)) =
true.
Proof. It is trivial to see that a symmetry of D verifies
such property. Indeed, if σ is a symmetry of D, then
σ(D) = D, thus it results that D and σ(D) have the
same itemsets and covers satisfying the predicate P
k
D
.
Thus σ must transform each itemset I with a cover T
verifying the predicate P
k
D
to an itemset σ(I) with a
cover σ(T) verifying the predicate P
k
D
.
In other words the symmetry σ of D transforms
each itemset I having a cover T which is a solution to
the data mining task k into a symmetrical itemset σ(I)
having a cover σ(T) which is also a solution of the
task k. It also transforms each itemset which is not a
solution to the task k into a symmetrical itemset which
will not be a solution to the task k. For instance if the
task k concerns the frequent itemset mining problem,
then by applying σ to a frequent itemset I we obtain
a symmetrical frequent itemset σ(I). If I is not fre-
quent, then σ(I) will not be frequent too.
Example 2. Consider the transaction
database defined in Table 1 of Example 1
and the permutation σ = (Whisky,Vodka)
(Cognac,Gin)(Ricard,Pastis)(Wine,Beer)(Shweps,
Juice)(Pepsi,Coke) which is defined on the set of
items L of D. We can see that σ(D)=D, then σ is a
symmetry of D.
Now, we give an important property which es-
tablishes a relationship between the symmetries of
a transaction database D and the Boolean encoding
CNF(k,D) of the data mining task k defined over D.
Proposition 2. Let D be a transaction database,
CNF(k,D) the Boolean encoding of the data mining
task k, σ a symmetry of D and J = (I,T) an interpre-
tation of CNF(k,D), then J is a model of CNF(k, D)
iff σ(J) is a model of CNF(k,D).
Proof. Let σ be a symmetry of the transaction
database D and J = (I, T) a model of the Boolean
encoding CNF(k, D). It results that the correspond-
ing pair itemset and cover (I
,J
) verify the predicate
P
k
D
of the data mining task k, that is P
k
D
(I
,J
) = true.
We have to prove that σ(J) = (σ(I),σ(T)) is also a
model of CNF(k,D). The permutation σ is a sym-
metry of D, thus by Theorem 1, it results that the
pair (σ(I
),σ(J
)) verifies the predicate P
k
D
, that is
P
k
D
(σ(I
),σ(J
)) = true. Therefore σ(J) is also a
model of CNF(k, D), since the pair (σ(I
),σ(J
)) ver-
ifying the predicate P
k
D
is extracted from the model
σ(J) = (σ(I), σ(T)) of CNF(k,D).
Remark 3. The previous proposition allows us to use
the symmetries of a transaction database D in its cor-
responding Boolean encoding CNF(k,D) in order to
detect symmetrical models and consider only one ele-
ment in each symmetrical equivalent class. This gives
an important alternative for symmetry exploitation in
constraint-based data mining methods. Indeed, we
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
90
can just compute the symmetries of D instead of com-
puting those of its Boolean CNF(k,D) which could
be time consuming. This could accelerate the symme-
try detection as the size of the transaction database D
is generally substantially smaller than the size of its
corresponding boolean encoding CNF(k,D).
In Example 1, if we consider θ = 2 and the sym-
metry σ of Example 2, then there will be symmet-
rical frequent itemsets in D. For instance, both I
1
=
{Beer,Wine,Wisky,Water} and I
2
= {Shweps,Pepsi}
are frequent itemsets in D. By the symmetry σ we
can deduce that σ(I
1
) = {Beer,Wine,Vodka,Water}
and σ(I
2
) = {Juice,Coke} are also frequent itemsets.
These are what we call symmetrical frequent itemsets
of D or symmetrical models
2
of CNF(k,D). A sym-
metry σ transforms each frequent itemset (a model of
the CNF encoding) into a frequent itemset and each
non frequent itemset (a no-good of the CNF encod-
ing) into a non frequent itemset. Symmetry elimi-
nation offers the advantage to enumerate only non-
symmetrical patterns (like I 1 and I
2
here) which
are considered as the most pertinent to the user for
understanding the data.
4 SYMMETRY DETECTION
The most known technique to detect syntactic sym-
metries for CNF formulas in satisfiability is the one
consisting in reducing the considered formula into a
graph (Crawford et al., 1996; Aloul et al., 2002; Aloul
et al., 2003b; Aloul et al., 2004) whose automorphism
group is identical to the symmetry group of the origi-
nal formula. We adapt the same approach here to de-
tect the syntactic symmetries of a transaction database
D. As it is done in (Jabbour et al., 2012), we represent
the database D by a graph G
D
that we use to compute
the symmetry group of D by means of its automor-
phism group. When this graph is built, we use a graph
automorphism tool like Saucy (Aloul et al., 2002)
to compute its automorphism group which gives the
symmetry group of D. We summarize bellow the con-
struction of the graph which represent the transaction
database D. Given a transaction database D, the asso-
ciated colored graph G
D
(V,E) is defined as follows:
The set of colored vertices V = L T is build as
follows:
1. Each item i L is represented by a vertex i V
of the color 1 in G
D
(V,E).
2. Each item t T is represented by a vertext V
of the color 2 in G
D
(V,E).
2
Here, we omitted the part T of the model representing
the cover of I.
The set of edges E is defined by E = {(t,i) | D
t,i
=
1}. That is, an edge connects each transaction ver-
tex t T to each vertex representing an item sup-
ported by t.
Beer
Wisky
Wine
Water
Cognac
Vodka
Ricard
Shweps
Juce
Orangina
001
002
003
004
005
006
007
008
009
010
Coke
Pepsi
Gin
Pastis
Figure 1: The graph of the transaction Database of Table 1.
Example 3. Consider the transaction database D
of Table 1 given in example 1. Its correspond-
ing graph G
D
(V,E) is shown in Figure 1. We can
see for instance that the vertex permutation γ =
(Wisky,Vodka)(Cognac,Gin)(Ricard, Pastis)(Pepsi,
Coke)(Shweps,Juice)(001,002)(003,004)(005, 006)
(007,010)(008, 009) is one among the automor-
phisms of G
D
(V,E). The restriction of the au-
tomorphism γ to L represents the symmetry σ =
(Wisky,Vodka)(Cognac,Gin)(Ricard, Pastis)(Pepsi,
Coke)(Shweps,Juice) that we used in Example 2.
An important property of the graph G
D
(V,E) is
that it preserves the group of symmetries of D. That
is, the symmetry group of D is identical to the auto-
morphism group of its graph representation G
D
(V,E),
thus we could use a graph automorphism system like
Saucy on G
D
(V,E) to detect the symmetry group of
D. The graph automorphism system returns a set of
generators Gen of the symmetry group from which we
can deduce each symmetry of D.
5 SYMMETRY ELIMINATION
Here we deal with the global symmetry which is
present in the formulation of the given problem that
SymmetryBreakinginItemsetMining
91
is represented by the transaction database D. Global
symmetry can be eliminated in a static way in a pre-
processing phase by just adding the symmetry break-
ing predicates to the Boolean encoding CNF(k, D)
and use a SAT solver as a black box on the resulting
CNF formula.
We shall compute the Lex-Leader Symmetry
Breaking Predicate (LL-SBP) induced by the auto-
morphisms of G
D
. More precisely, the group of auto-
morphisms Aut(G
D
) of the graph G
D
(or the symme-
try group Sym(D) of D) induces an equivalence rela-
tion on the set of interpretations of CNF(k,D). That
is, an interpretation I is equivalent to another inter-
pretation J of CNF(k,D) if there exists a symmetry
σ of D such that J = σ(I). The symmetry breaking
predicates are chosen such that they are true for ex-
actly one interpretation in each equivalent class (the
least interpretation in the lex ordering). In general,
we introduce an ordering on the the variables I
i
cor-
responding to the items of L and use it to construct a
lexicographical order on the set of interpretations.
The construction of the symmetry-breaking pred-
icate is based on the lex-leader method introduced by
Crawford et al (Crawford et al., 1996). Given a sym-
metry group Sym(D) = {σ
1
,σ
2
,...,σ
k
} of D and a
total ordering I
1
< I
2
< ··· < I
n
on the variables of
CNF(k,D) corresponding to the items of L. The
partial lex-leader symmetry-breaking predicate (PLL-
SBP) (Aloul et al., 2003a) that we have to add to
CNF(k,D) is expressed as follows:
PP(σ
l
) =
^
1in
[
^
1 ji1
(I
j
= I
σ
l
j
) (I
i
I
σ
l
i
)]
PLL SBP(Sym(D)) =
^
σ
l
GEN(Sym(D))
PP(σ
l
)
PP(σ
l
) is the permutation predicate correspond-
ing to the symmetry generator σ
l
and the expression
(I
i
I
σ
l
i
) denotes the clause (I
i
I
σ
l
i
).
The LL SBP is translated to a linear size CNF
formula by introducing auxiliary variables e
j
to rep-
resent the expressions (I
j
= I
σ
l
j
). For example, e
j
(I
j
= I
σ
l
j
) gives rise to the following implications:
(¬I
j
¬I
σ
l
j
e
i
),(I
j
I
σ
l
j
e
i
)
(¬I
j
¬e
i
I
σ
l
j
),(I
j
¬e
i
¬I
σ
l
j
)
Some optimizations such that ones studied in
Aloul (Aloul et al., 2003a) could be done to get a more
compact CNF PLL SBP.
6 EXPERIMENTS
In this section, we present an experimental analysis
of our symmetry breaking approach for SAT based
itemset mining.
6.1 Input Data-sets
We choose for our experiments two classes of data-
sets:
Simulated data-sets: : In this class, we use the
simulated data-sets, generated specifically to in-
volve interesting symmetries. The data is avail-
able at http://www.cril.fr/decMining.
Public datasets: The datasets used in
this class are well known in the data
mining community and are available at
https://dtai.cs.kuleuven.be/CP4IM/datasets/
6.2 The Experimented Methods
As we aim to enumerate all the frequent/closed item-
sets on the SAT based encoding, our experiments
are conducted using MiniSAT-Enum dedicated to the
enumeration of all models of a given CNF formula.
MiniSAT-Enum is obtained from MiniSAT 2.2
3
as
follows: each time a model is found a no-good
(clause) is generated and added to the formula in or-
der to avoid enumerating the same models. MiniSAT-
Enum takes as input a CNF formula and a set of items
variables and returns the set of frequent/closed item-
sets.
The methods that we experimented and compared
are the following:
1. MiniSAT-Enum: search without symmetry
breaking on the CNF encoding of the data mining
task CNF(k, D)
2. MiniSAT-Enum-SBP: search with symmetry
breaking. This method generates in a pre-
processing phase the symmetry-breaking predi-
cates, then apply MiniSAT-Enum to the result-
ing CNF instance CNF(k,D)+ PLL SBP. The
CPU time of MiniSAT-Enum-Sym includes the
time spent to generate the PLL SBP.
3. MiniSAT-Enum-ISB: this method (Jabbour
et al., 2012), called ItemPair symmetry breaking
(ISB), eliminates symmetries in a preprocess-
ing step, by rewriting the transaction database
D as a D
by eliminating symmetric items.
MiniSAT-Enum is then applied on the CNF for-
mula CNF(k, D
) encoding the new transaction
database.
3
MiniSAT: http://minisat.se/
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
92
In our experiments, we exploit Saucy
4
, a new im-
plementation of the Nauty system. It is originally
proposed in (Aloul et al., 2002) and significantly im-
proved in (Darga et al., 2008). The latest version of
Saucy outperforms all the existing tools by many or-
ders of magnitude, in some cases improving runtime
from several days to a fraction of a second.
We are interested on the CPU time and on the
number of models or closed/frequent itemsets found
with and without symmetry breaking. All the experi-
mental results presented in this section have been ob-
tained with a Quad-core Intel Xeon X5550 (2.66GHz,
32 GB RAM) cluster.
6.3 The Obtained Results
In Figure 2 and 3, we present the results obtained on
a simulated data dataset-gen-jss-5. The experiment
show the comparison of MiniSAT-Enum (CFIM),
MiniSAT-Enum-SBP (CFIM-SBP) and MiniSAT-
Enum-ISB (CFIM-ISB) w.r.t. CPU time in seconds
(Figure 2) and number of patterns (Figure 3). As
we can see, by breaking symmetries, we significantly
reduce both the number of closed frequent itemsets
(output) and CPU-time. Such reduction of the size
of the output induces a significant reduction of the
search time. Interestingly, breaking symmetries using
by adding SBP on the CNF encoding of the itemset
mining task (CFIM-SBP) is clearly better than elim-
inating symmetric items on the original transaction
database (CFIM-ISB). This experiment show that our
approach break more symmetries than the one pro-
posed in (Jabbour et al., 2012).
0
50
100
150
200
250
0 2 4 6 8 10 12 14 16 18 20
time(seconds)
quorum
dataset-gen-jss-5 CFIM
dataset-gen-jss-5 CFIM-ISB
dataset-gen-jss-5 CFIM-SBP
Figure 2: Results on simulated data (Closed frequent item-
sets): CPU time.
The second experiment is conducted on well-
know academic datasets. In this experiment, we are
interested on the frequent itemsets mining problem.
In Figure 4 and 5, we present the comparative results
4
Saucy2: Fast symmetry discovery - http://vlsicad.eecs.
umich.edu/BK/SAUCY/
10000
100000
1e+06
0 2 4 6 8 10 12 14 16 18 20
#Patterns
quorum
dataset-gen-jss-5 CFIM
dataset-gen-jss-5 CFIM-ISB
dataset-gen-jss-5 CFIM-SBP
Figure 3: Results on simulated data (Closed frequent item-
sets): number of patterns.
0
500
1000
1500
2000
2500
3000
3500
320 340 360 380 400 420 440 460
time(seconds)
quorum
australian-credit FIM-SBP
australian-credit FIM-ISB
australian-credit FIM
Figure 4: Results on public data - Australian - (frequent
itemsets): CPU time.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1000 1500 2000 2500 3000 3500 4000 4500 5000
time(seconds)
quorum
mushroom FIM-SBP
mushroom FIM-ISB
mushroom FIM
Figure 5: Results on public data - Mushroom - (frequent
itemsets): CPU time.
w.r.t. the computation time. No reduction is observed
on the number of frequent itemsets. On these datasets,
most of found symmetries involves items in the same
transactions. This explains why these particular sym-
metries does not reduce the number of closed/frequent
itemsets. However, even when the size of the output is
not reduced, breaking symmetries using our approach
significantly reduce the search space. In general sym-
metry breaking reduces dramatically the search space
and the corresponding CPU time for this declarative
approach, but did not reach the performances of opti-
mized dedicated algorithms like FPgrowth for exam-
ple.
SymmetryBreakinginItemsetMining
93
7 RELATED WORKS
The purpose of eliminating symmetry in data mining
tasks is in general either to obtain a more compact
output or to decrease the necessary CPU time for its
generation or to handle new mining properties to find
interesting frequent patterns. Some symmetry works
are introduced in the field of Data mining following
this direction.
Symmetries in graph mining are studied in
Desrosiers et al. (Desrosiers et al., 2007), and in
Vanetik (Vanetik, 2010). The area of graph min-
ing has a great importance in many applications. In
Desrosiers et al. (Desrosiers et al., 2007) symmetry
is exploited to prune the search space of sub-graph
mining algorithms. However, in Vanetik (Vanetik,
2010), symmetry is used to find interesting frequent
sub-graphs (those having limited diameter and high
symmetry). Such graphs represent the more struc-
turally important patterns in all of the chemical, text
and genetic data-sets. Their technique allows also to
reduce the necessary CPU to find such graphs.
Murtagh et al in (Murtagh and Contreras, 2010)
used symmetry to get a powerful means of structuring
and analyzing massive, high dimensional data stores.
They illustrate the power of hierarchical clustering in
case studies in chemistry and finance.
Symmetry is also studied in transaction database
using Zero-BDDs (Minato, 2006). These symmetries
looks very particular,since they are just transpositions
of two items and still identity for the remain items.
They used such symmetry to study the properties of
symmetrical patterns. Such symmetries are used in
(Gly et al., 2005) to explain in some cases why the
number of rules of a minimal cover of a relation is
exponential in number of items.
Two symmetry elimination approaches for fre-
quent itemset mining are introduced in (Jabbour et al.,
2012). They consist in rewriting the transaction
database in pre-processing phase by eliminating the
symmetrical of some items. These approaches are
specific to the data mining task considered. They
could be combined with our method for the itemset
mining task. Another approach integrate dynamic
symmetry elimination in the Apriori-like algorithm
(Jabbour et al., 2013a) in order to prune the search
space of enumerating all the frequent item sets of a
transaction database.
All of these methods are specific to the data min-
ing task considered and the target method used to
solve. They are different from the approach which de-
velop here, since our approach is generic and declar-
ative. It will work with all data mining task that is
expressed in a constraint language.
8 CONCLUSION
We studied in this work the notion of symmetry for
data mining tasks expressed as declarativeconstraints.
We showed how the symmetries of the given trans-
action database can be detected and eliminated by
adding symmetry-breaking predicate to the constraint
encoding of the considered data mining task. We
showed that even though such symmetries could not
be syntactically the symmetries of the CNF encod-
ing of the data mining problem, they conserve the set
of its models (the set of interesting patterns). Detect-
ing symmetry on the given transaction database rather
than the CNF encoding of the considered data min-
ing task could result in a great save of efforts in the
symmetry detection. Indeed, the size of the transac-
tion database is in general smaller then its correspond-
ing CNF encoding. The transaction database is rep-
resented by a colored graph that is used to compute
its symmetries. The symmetry group of the transac-
tion database is identical to the automorphism group
of the corresponding graph. The graph automorphism
tools SAUCY is naturally used on the obtained graph
to detect the group of symmetries of the transaction
database. This symmetry is eliminated statically by
adding in a pre-processing phase the well known lex
order symmetry breaking predicates to the CNF en-
coding of the considered data mining task. We then
applied as a black box a SAT model enumeration al-
gorithm on this resulting encoding to solve data min-
ing problem.
The proposed symmetry breaking method is im-
plemented and experimented on a variety of trans-
action data-sets. The first experimental results con-
firmed that eliminating symmetry is profitable for the
considered data mining tasks.
As a future work, we are lookingto eliminate sym-
metry in other data mining problems and try to ex-
tend symmetry exploitation to the local symmetry that
could exists at some nodes of the search tree. Both
kind of exploitation could be complementary, then
one can naturally think on the advantage of combin-
ing them.
REFERENCES
Agrawal, R., Imieli´nski, T., and Swami, A. (1993). Min-
ing association rules between sets of items in large
databases. In Proceedings of the 1993 ACM SIGMOD
International Conference on Management of Data,
SIGMOD ’93, pages 207–216, New York, NY, USA.
ACM.
Agrawal, R. and Srikant, R. (1994). Fast algorithms for
mining association rules in large databases. In Pro-
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
94
ceedings of the 20th International Conference on Very
Large Data Bases, VLDB ’94, pages 487–499, San
Francisco, CA, USA. Morgan Kaufmann Publishers
Inc.
Aloul, F. A., Markov, I. L., and Sakallah, K. A. (2003a).
Shatter: efficient symmetry-breaking for boolean sat-
isfiability. In DAC, pages 836–839. ACM.
Aloul, F. A., Ramani, A., Markov, I. L., and Sakallah, K. A.
(2002). Solving difficult SAT instances in the pres-
ence of symmetry. In Proceedings of the 39th Design
Automation Conference (DAC 2002), pages 731–736.
ACM Press.
Aloul, F. A., Ramani, A., Markov, I. L., and Sakallah, K. A.
(2003b). Solving difficult instances of boolean satis-
fiability in the presence of symmetry. IEEE Trans. on
CAD of Integrated Circuits and Systems, 22(9):1117–
1137.
Aloul, F. A., Ramani, A., Markov, I. L., and Sakallak, K. A.
(2004). Symmetry breaking for pseudo-boolean satis-
fiabilty. In ASPDAC’04, pages 884–887.
Benhamou, B. (1994). Study of symmetry in constraint sat-
isfaction problems. PPCP’94, pages 246–254.
Benhamou, B. and Sais, L. (1992a). Theoretical study of
symmetries in propositional calculus and application.
In CADE’11, pages 281–294.
Benhamou, B. and Sais, L. (1992b). Theoretical study of
symmetries in propositional calculus and applications.
In CADE, pages 281–294.
Benhamou, B. and Sais, L. (1994a). Tractability through
symmetries in propositional calculus. In JAR, 12:89–
102.
Benhamou, B. and Sais, L. (1994b). Tractability through
symmetries in propositional calculus. J. Autom. Rea-
soning, 12(1):89–102.
Besson, J., Boulicaut, J.-F., Guns, T., and Nijssen, S.
(2010). Generalizing itemset mining in a con-
straint programming setting. In Inductive Databases
and Constraint-Based Data Mining, pages 107–126.
Springer.
Bonchi, F. and Lucchese, C. (2007). Extending the state-
of-the-art of constraint-based pattern discovery. Data
Knowl. Eng., 60(2):377–399.
Bucil˘a, C., Gehrke, J., Kifer, D., and White, W. (2003).
Dualminer: A dual-pruning algorithm for itemsets
with constraints. Data Mining and Knowledge Dis-
covery, 7(3):241–272.
Burdick, D., Calimlim, M., and Gehrke, J. (2001). Mafia: A
maximal frequent itemset algorithm for transactional
databases. In In ICDE, pages 443–452.
Crawford, J., Ginsberg, M., Luks, E., and Roy, A. (1996).
Symmetry-breaking predicates for search problems.
In Knowledge Representation (KR), pages 148–159.
Morgan Kaufmann.
Darga, P. T., Sakallah, K. A., and Markov, I. L. (2008).
Faster symmetry discovery using sparsity of symme-
tries. In Proceedings of the 45th Annual Design Au-
tomation Conference, DAC ’08, pages 149–154, New
York, NY, USA. ACM.
Desrosiers, C., Galinier, P., Hansen, P., and Hertz, A.
(2007). Improving frequent subgraph mining in the
presence of symmetry. In MLG.
Freuder, E. (1991). Eliminating interchangeable values
in constraints satisfaction problems. AAAI-91, pages
227–233.
Grahne, G. and Zhu, J. (2005). Fast algorithms for frequent
itemset mining using fp-trees. IEEE Trans. on Knowl.
and Data Eng., 17(10):1347–1362.
Guns, T., Dries, A., Tack, G., Nijssen, S., and Raedt,
L. D. (2013). Miningzinc: A modeling language for
constraint-based mining. In International Joint Con-
ference on Artificial Intelligence, pages –, Beijing,
China.
Guns, T., Nijssen, S., and De Raedt, L. (2011a). Itemset
mining: A constraint programming perspective. Artif.
Intell., 175(12-13):1951–1983.
Guns, T., Nijssen, S., and de Raedt, L. (2011b). k-pattern set
mining under constraints. IEEE TKDE, 99(PrePrints).
Gly, A., Medina, R., Nourine, L., and Renaud, Y. (2005).
Uncovering and reducing hidden combinatorics in
guigues-duquenne bases. In Ganter, B. and Godin, R.,
editors, ICFCA, Lecture Notes in Computer Science,
pages 235–248. Springer.
Han, J., Pei, J., and Yin, Y. (2000). Mining frequent pat-
terns without candidate generation. In Proceedings
of the 2000 ACM SIGMOD International Conference
on Management of Data, SIGMOD ’00, pages 1–12,
New York, NY, USA. ACM.
Henriques, R., Lynce, I., and Manquinho, V. M. (2012). On
when and how to use sat to mine frequent itemsets.
CoRR, abs/1207.6253.
Jabbour, S., Khiari, M., Sais, L., Salhi, Y., and Tabia, K.
(2013a). Symmetry-based pruning in itemset mining.
In 25th International Conference on Tools with Arti-
ficial Intelligence(ICTAI’13), Washington DC, USA.
IEEE Computer Society.
Jabbour, S., Sais, L., and Salhi, Y. (2013b). Boolean satisfi-
ability for sequence mining. In CIKM, pages 649–658.
Jabbour, S., Sais, L., and Salhi, Y. (2013c). Top-k fre-
quent closed itemset mining using top-k sat prob-
lem. In European Conference on Machine Learning
and Principles and Practice of Knowledge Discovery
in Databases (ECML/PKDD’13), volume 146, pages
131–140. Springer.
Jabbour, S., Sais, L., Salhi, Y., and Tabia, K. (2012). Sym-
metries in itemset mining. In 20th European Confer-
ence on Artificial Intelligence (ECAI ’12), pages 432
437. IOS Press.
Khiari, M., Boizumault, P., and Crmilleux, B. (2010). Con-
straint programming for mining n-ary patterns. In Co-
hen, D., editor, CP, volume 6308 of Lecture Notes in
Computer Science, pages 552–567. Springer.
Krishnamurthy, B. (1985). Short proofs for tricky formulas.
Acta Inf., 22(3):253–275.
Krishnamurty, B. (1985). Short proofs for tricky formulas.
Acta Inf., (22):253–275.
M´etivier, J.-P., Boizumault, P., Cr´emilleux, B., Khiari, M.,
and Loudni, S. (2012). A constraint language for
declarative pattern discovery. In Proceedings of the
27th Annual ACM Symposium on Applied Computing,
SAC ’12, pages 119–125, New York, NY, USA. ACM.
Minato, S. I. (2006). Symmetric item set mining based on
zero-suppressed bdds. In Todorovski, L., Lavrac, N.,
SymmetryBreakinginItemsetMining
95
and Jantke, K. P., editors, Discovery Science, volume
4265 of Lecture Notes in Computer Science, pages
321–326. Springer.
Minato, S. I., Uno, T., and Arimura, H. (2007). Fast gen-
eration of very large-scale frequent itemsets using a
compact graph-based representation.
Murtagh, F. and Contreras, P. (2010). Hierarchical
clustering for finding symmetries and other pat-
terns in massive, high dimensional datasets. CoRR,
abs/1005.2638.
Pei, J., Han, J., and Lakshmanan, L. V. S. (2004). Push-
ing convertible constraints in frequent itemset mining.
Data Min. Knowl. Discov., 8(3):227–252.
Puget, J. F. (1993). On the satisfiability of symmetrical con-
strained satisfaction problems. In In J. Kamorowski
and Z. W. Ras,editors, Proceedings of ISMIS’93, LNAI
689, pages 350–361.
Raedt, L. D., Guns, T., and Nijssen, S. (2008). Constraint
programming for itemset mining. In KDD, pages 204
212.
Raedt, L. D., Guns, T., and Nijssen, S. (2010). Constraint
programming for data mining and machine learning.
In AAAI.
Tiwari, A., Gupta, R., and Agrawal, D. (2010). A survey on
frequent pattern mining: Current status and challeng-
ing issues. Inform. Technol. J, 9:1278–1293.
Tseitin, G. S. (1968). On the complexity of derivation in
propositional calculus. In Structures in the construc-
tive Mathematics and Mathematical logic, pages 115–
125. H.A.O Shsenko.
Uno, T., Asai, T., Uchida, Y., and Arimura, H. (2003).
Lcm: An efficient algorithm for enumerating frequent
closed item sets. In In Proceedings of Workshop on
Frequent itemset Mining Implementations (FIMI03.
Uno, T., Kiyomi, M., and Arimura, H. (2004).
Lcm ver. 2: Efficient mining algorithms for fre-
quent/closed/maximal itemsets. In FIMI.
Vanetik, N. (2010). Mining graphs with constraints on
symmetry and diameter. In Shen, H. T., Pei, J., zsu,
M. T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y.,
and Shao, J., editors, WAIM Workshops, volume 6185
of Lecture Notes in Computer Science, pages 1–12.
Springer.
Zaki, M. J. and Hsiao, C.-J. (2005). Efficient algorithms
for mining closed itemsets and their lattice structure.
IEEE Trans. on Knowl. and Data Eng., 17(4):462–
478.
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
96