Symmetry Breaking in Itemset Mining

Bela¨ıd Benhamou

1,2

, Sa¨ıd Jabbour

, Lakhdar Sais

and Yacoub Salhi

Aix-Marseille Universit´e, Laboratoire des Sciences de l’information et des Syst`emes (LSIS),

Domaine Universitaire de Saint J´erˆome, Avenue Escadrille Normandie Niemen, 13397 Marseille Cedex 20, France

Centre de Recherche en Informatique de Lens (CRIL), Universit´e d’Artois,

Rue Jean Souvenir, SP 18 F 62307 Lens Cedex, France

Keywords:

Data Mining, Itemset Mining, Symmetry, Satisﬁability, Constraint Programming.

Abstract:

The concept of symmetry has been extensively studied in the ﬁeld of constraint programming and in proposi-

tional satisﬁability. Several methods for detection and removal of these symmetries have been developed, and

their integration in known solvers of these domain improved dramatically their effectiveness on a large variety

of problems considered difﬁcult to solve. The concept of symmetry may be exported to other domains where

some structures can be exploited effectively. Particularly in data mining where some tasks can be expressed

as constraints. In this paper, we are interested in the detection and elimination of symmetries in the problem

of ﬁnding frequent itemsets of a transaction database and its variants. Recent works have provided effective

encodings as Boolean constraints for these data mining tasks and some recent works on symmetry detection

and elimination in itemset mining problems have been proposed. In this work we propose a generic frame-

work that could be used to eliminate symmetries for data mining task expressed in a declarative constraint

language. We show how symmetries between the items of the transactions are detected and eliminated by

adding symmetry-breaking predicate (SBP) to the Boolean encoding of the data mining task.

1 INTRODUCTION

In this paper, we investigate the notion of symme-

try elimination in Frequent Itemset Mining (FIM)

(Agrawal et al., 1993). The itemset mining prob-

lem has several applications and remains central in

the Data mining research ﬁeld. The most known ex-

ample is the one considered by large retail organiza-

tions called basket data. A record of such data con-

tains essentially the customer identiﬁcation, the trans-

action date and the items bought by the customer.

Advances in bar-codes technology, the use of credit

cards of frequent-customer card make it now possi-

ble to collect and store a great amounts of sale data.

It is then important for the retail ﬁrms to know the

set of items that are frequently bought by customers.

This is the frequent itemset mining problem. Since its

introduction in 1993 (Agrawal et al., 1993), several

highly scalable algorithms are introduced ((Agrawal

and Srikant, 1994), (Han et al., 2000), (Zaki and

Hsiao, 2005),(Uno et al., 2003) (Uno et al., 2004),

(Burdick et al., 2001),(Grahne and Zhu, 2005), (Mi-

nato et al., 2007) ) to enumerate the sets of frequent

items. The two challenging questions investigated in

such algorithms are: in one hand how to compute all

the frequent itemsets in a reasonable CPU time and

in the other hand how to compact the output and re-

duce its size when there is a huge number of frequent

itemsets. Many other data mining tasks exist, such as

the association rule mining, the frequent pattern, clus-

tering and episode mining, but almost all of them are

closely in relationship to itemset mining which looks

to be the canonical problem. A lot of efﬁcient and

scalable algorithms are developed for target and spe-

ciﬁc mining tasks. As stated in (Tiwari et al., 2010),

different methods for the itemset mining are provided.

Mainly they differ from each other in the way they ex-

plore the search space, the data structure they use, the

exploitation of the anti-monotonicity property. The

other important point is the size of the output of such

algorithms. Some solutions are found, for instance

one can enumerate only the closed, the maximal, the

condensed, the preferred, or discriminative itemsets

instead of all the frequent itemsets.

Data mining community introduced the

constraint-based mining framework in order to

specify in terms of constraints the properties of

the patterns to be mined ((Bonchi and Lucchese,

2007),(Bucil˘a et al., 2003), (Pei et al., 2004), (Besson

et al., 2010)). A wide variety of constraints are

Benhamou B., Jabbour S., Sais L. and Salhi Y..

Symmetry Breaking in Itemset Mining.

DOI: 10.5220/0005078200860096

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2014), pages 86-96

ISBN: 978-989-758-048-2

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

successfully integrated and implemented in different

speciﬁc data mining algorithms.

Recently De Raedt et al. ((Raedt et al., 2008;

Guns et al., 2011a)) introduced the alternative of us-

ing constraint programming in data mining. They

showed that a such alternative can be efﬁciently ap-

plied for a wide range of pattern mining problems.

Most of the pattern mining constraint (e.g. frequency,

closeness, maximality, and anti-monotonicity con-

straints) had been expressed in a declarativeconstraint

programming language. The data mining problem is

modeled as a constraint satisfaction problem (CSP)

and a solver (e.g. Gecode) is then used to enumer-

ate solutions corresponding to the set of interesting

patterns. A strength point here is that different con-

straints can be combined without the need to modify

the solver, unlike in the existing speciﬁc data min-

ing algorithms. Since the introduction of this declar-

ative approach, there is a growing interest in ﬁnd-

ing generic and declarative approaches to model and

solve data mining tasks. For instance, several works

expressed data mining problems as propositional sat-

isﬁability ((Jabbour et al., 2013c), (Henriques et al.,

2012), (M´etivier et al., 2012), (Khiari et al., 2010),

(Raedt et al., 2010), (Jabbour et al., 2013b)) and used

efﬁcient modern SAT solvers as black-box to solve

them. More recently, a constraint declarative frame-

work for solving Data mining tasks called MiningZ-

inc (Guns et al., 2013), had been introduced.

On the other hand, symmetry is by deﬁnition a

multidisciplinary concept. It appears in many ﬁelds

ranging from mathematics to Artiﬁcial Intelligence,

chemistry and physics. It reveals different forms and

uses, even inside the same ﬁeld. In general, it returns

to a transformation, which leaves invariant (does not

modify its fundamental structure and/or its properties)

an object (a ﬁgure, a molecule, a physical system, a

formula or a constraints network...). For instance, ro-

tating a chessboard up to 180 degrees gives a board

that is indistinguishable from the original one. Sym-

metry is a fundamental property that can be used to

study these various objects, to ﬁnely analyze these

complex systems or to reduce the computational com-

plexity when dealing with combinatorial problems.

As far as we know, the principle of symmetry

has been ﬁrst introduced by Krishnamurthy (Krish-

namurty, 1985) to improve resolution in propositional

logic. Symmetries for Boolean constraints are studied

in depth in (Benhamou and Sais, 1992a; Benhamou

and Sais, 1994a). The authors showed how to de-

tect them and proved that their exploitation is a real

improvement for several automated deduction algo-

rithms efﬁciency. Since that, many research works

on symmetry appeared. For instance, the static ap-

proach used by James Crawford et al. in (Craw-

ford et al., 1996) for propositional logic theories con-

sists in adding constraints expressing global symme-

try of the problem. This technique has been improved

in (Aloul et al., 2003b) and extended to 0-1 Integer

Logic Programming in (Aloul et al., 2004). The no-

tion of interchangeability in Constraint Satisfaction

Problems (CSPs) is introduced in (Freuder, 1991) and

symmetry for CSPs is studied earlier in (Puget, 1993;

Benhamou, 1994).

In the context of constraint programming, Guns

et al. (Guns et al., 2011b) used symmetry breaking

constraints to impose a strict ordering on the patterns

in k-pattern set mining. More recently, symmetry de-

tection and elimination are integrated in itemset min-

ing problems (Jabbour et al., 2012; Jabbour et al.,

2013a). Two different approaches are proposed. In

the ﬁrst one, symmetries are eliminated by rewriting

the transaction database (eliminating items), while in

the second approach the authors integrate symmetry

elimination in Apriori-like algorithms. For other pre-

vious studies on symmetries in data mining, we refer

the reader to the related work section.

The work that we investigate in this paper, goes in

this direction. It consists in detecting and eliminating

symmetries in the itemset mining problem expressed

as a Boolean satisﬁability. We will show how global

symmetries

of the given transaction database are de-

tected and expressed in terms of symmetry breaking

predicates. Such predicates are added to the boolean

encoding of the itemset mining problem in a prepro-

cessing step and a SAT solver is used as a black box

to enumerate the non-symmetrical solutions (the non-

symmetrical frequent itemsets). In most of the data

mining tasks, we usually need to enumerate interest-

ing patterns and this usually lead to a output of huge

size. Eliminating symmetries might reduce the size of

the output and lead to discover the non-symmetrical

patterns which are the most important and representa-

tive of the knowledge.

The rest of the paper is organized as follows. In

Section 2, we give some necessary background on

the satisﬁability problem, permutations and the nec-

essary notion on itemset mining problem. We study

the notion of symmetry in itemset mining represented

as boolean constraints in Section 3. In Section 4

we show how symmetries can be detected by means

of graph automorphism. We show in section 5 how

this symmetry can be eliminated by adding symme-

try breaking predicates to the Boolean encoding. Sec-

tion 6 givesexperimentson different data-sets to show

the advantage of using symmetries in itemset mining.

Symmetries that are present in the initial formulation

of the problem

SymmetryBreakinginItemsetMining

Section 7 investigates the related works and Section 8

concludes the work.

2 BACKGROUND

We summarize in this section some background on

the satisﬁability problem, permutations, and itemset

mining problem.

2.1 Propositional Satisﬁability (SAT)

We shall assume that the reader is familiarwith propo-

sitional logic. We give here, a short description. LetV

be the set of propositional variables called only vari-

ables. Variables will be distinguished from literals,

which are variables with an assigned parity 1 or 0 that

means True or False, respectively. This distinction

will be ignored whenever it is convenient, but not con-

fusing. For a propositional variable p, there are two

literals: p the positive literal and ¬p the negative one.

A clause is a disjunction of literals such that no

literal appears more than once, nor a literal and its

negation at the same time. This clause is denoted by

∨ p

∨. ..∨ p

. A formula F in conjunctivenormal

form (CNF) is a conjunction of clauses.

A truth assignment to a CNF F is a mapping ρ

deﬁned from the set of variables of F into the set

{True,False}. If ρ[p] is the value for the positive

literal p then ρ[¬p] = ¬ρ[p]. The value of a clause

∨ p

∨ . . . ∨ p

in ρ is True, if the value True is as-

signed to at least one of its literals in ρ, False other-

wise. By convention, we deﬁne the value of the empty

clause (n = 0) to be False. The value ρ[F ] is True if

the value of each clause of F is True, False, other-

wise. We say that a CNF formula F is satisﬁable if

there exists some truth assignments ρ that assign the

value True to F , it is unsatisﬁable otherwise. In the

ﬁrst case I is called a model of F . Let us remark that

a CNF formula which contains the empty clause is

unsatisﬁable.

It is well-known (Tseitin, 1968) that for every

propositional formula F there exists a formula F

′

conjunctive normal form (CNF) such that F

′

is sat-

isﬁable iff F is satisﬁable. In the following we will

assume that the formulas are given in a CNF.

2.2 Permutations

Let Ω = {1,2,..., N} for some integer N, where each

integer might represent a propositional variable. A

permutation of Ω is a bijective mapping σ from Ω to

Ω that is usually represented as a product of cycles of

permutations. We denote by Perm(Ω) the set of all

permutations of Ω and ◦ the composition of the per-

mutation of Perm(Ω). The pair (Perm(Ω),◦) forms

the permutation group of Ω. That is, ◦ is closed and

associative. The inverse of a permutation is a per-

mutation and the identity permutation is a neutral ele-

ment. A pair (T,◦) forms a sub-group of (S,◦) iff T is

a subset of S and forms a group under the operation ◦.

The orbit ω

Perm(Ω)

of an element ω of Ω on which the

group Perm(Ω) acts is ω

Perm(Ω)

={ω

|ω

= σ(ω),σ ∈

Perm(Ω)}. A generating set of the group Perm(Ω) is

a subset Gen of Perm(Ω) such that each element of

Perm(Ω) can be written as a composition of elements

of Gen. We write Perm(Ω)=< Gen >. An element of

Gen is called a generator. The orbit of ω ∈ Ω can be

computed by using only the set of generators Gen.

2.3 Frequent, Closed and Maximal

Itemset Mining Problems

Let L = {0, . . .,m − 1} be a set of m items and T =

{0,...,n−1} a set of n transactions (transaction iden-

tiﬁers). A subset I ⊆ L is called an itemset and a

transaction t ∈ T over L is in fact, a pair (t

,I) where

is the transaction identiﬁer and I the correspond-

ing itemset. In the basket data example, t

represents

the customer identiﬁcation and I the set of items he

put in his basket (he bought). Usually, when there

is no confusing, a transaction is just expressed by its

identiﬁer. A transaction database D over L is a ﬁnite

set of transactions such that no different transactions

have the same identiﬁer. Such a data set expresses

in the basket data the different transactions made by

customers. A transaction database can be seen as a bi-

nary matrix n× m, where n =| T | and m =| L |, with

t,i

∈ {0,1} forall t ∈ T and forall i ∈ L. More pre-

cisely, a transaction database is expressed by the set

D = {(t,I) | t ∈ T , I ⊆ L,∀i ∈ I : D

t,i

= 1}. The cov-

erage C

(I) of an itemset I in a transaction database

D is the set of all transactions in which I occurs. That

is, C

(I) = {t ∈ T | ∀i ∈ I,D

t,i

= 1}. The support

(I) of an itemset I in D is the number |C

(I)| of

transactions supporting I. It is just the cardinality of

its coverage set. Moreover, the frequency F

(I) of I

in D is deﬁned by

(I)|

|D|

Example 1. Consider the transaction database

D made over the set of drink items L =

{Beer,Wine,Whisky,Cognac,Vodka,Pastis,Ricard,

Gin,Coke,Pepsi,Shweps,Juice,Water,Orangina}.

For example, we can see in Table 1 that the itemset

I = {Beer,Wine} has C

(I) = {001,002,003,004},

(I) =| C

(I) |= 4, and F

(I) = 0,4.

Given a transaction database D over L, and θ a

minimal support threshold, an itemset I is said to be

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

Table 1: An instance of a transaction database.

itemset

001 Beer, Wine, Whisky, Vodka, Cognac, Water

002 Beer, Wine, Whisky, Vodka, Gin, Water

003 Beer, Wine, Whisky, Water

004 Beer, Wine, Vodka, Water

005 Ricard, Coke, Pepsi, Water

006 Pastis, Pepsi, Coke, Water

007 Shweps, Orangina, Pepsi

008 Shweps, Orangina, Coke

009 Juice, Orangina, Pepsi

010 Juice, Orangina, Coke

frequent if S

(I) ≥ θ. I is a closed frequent itemset

if in addition to the frequency constraint it satisﬁes

the following constraint: for all itemset J such that

I ⊂ J, S

(I) > S

(J). I is said to be a maximal fre-

quent itemset if in addition to the frequency constraint

it satisﬁes the following constraint: for all itemset J

such that I ⊂ J, S

(J) < θ. Both closed and maximal

itemsets are two known condensed representation for

frequent itemsets. The data mining tasks we are deal-

ing with in this work are deﬁned as follows:

Deﬁnition 1. 1. The frequent itemset mining

task consists in computing the following set

F I M

(θ) = {I ⊆ L|S

(I) ≥ θ}.

2. The closed frequent itemset mining task consists

in computing the following set C LO

(θ) = {I ∈

F I M

(θ)|∀J ⊆ L,I ⊂ J, S

(I) > S

(J)}.

3. The maximal frequent itemset mining task consists

in computing the following set M AX

(θ) = {I ∈

F I M

(θ)|∀J ⊆ L,I ⊂ J, S

(J) < θ}.

The anti-monotonicity property in itemset mining

expresses the fact that all the subsets of a frequent

itemset are also frequent itemsets. More precisely:

Proposition 1. (Anti-monotonicity) Let θ be a min-

imal support threshold, if the itemset I is such that

(I) ≥ θ, then ∀J ⊆ I, S

(J) ≥ θ.

3 SYMMETRY IN BOOLEAN

SATISFIABILITY BASED

ITEMSET MINING

Both constraint programming and Satisﬁability are

two known declarative programming frameworks

where the user has just to specify the problem he want

to solve rather than specifying how to solve it. The

frequent itemset mining tasks and some of its variants

(closed, maximal, etc) had been encoded for the ﬁrst

time in (Raedt et al., 2008; Guns et al., 2011a) as con-

straint programming tasks where a constraint solver

could be used as a black box to solve them. Since that,

other works ((Jabbour et al., 2013c), (Henriques et al.,

2012), (M´etivier et al., 2012), (Khiari et al., 2010),

(Raedt et al., 2010), (Jabbour et al., 2013b)) expressed

the data mining tasks as a satisﬁability problem where

the mining tasks are represented by propositional for-

mulas that are translated into their conjunctive normal

forms (CNF) which will be given as inputs to a SAT

solver. In this work we use the encoding proposed

in (Jabbour et al., 2013c) which we augment by the

symmetry breaking predicates that are used to avoid

enumerating the symmetrical models or the symmet-

rical no-goods of the resulting CNF encoding.

The general idea behind the CNF encoding of an

itemset mining task deﬁned on a transaction database

D is to express each of its interpretations as a pair

(I, T) where I represents an itemset and T its covering

transaction subset in D. To do that, a boolean variable

is associated with each item i ∈ L and a variable T

is associated with each transaction t ∈ T . The itemset

I is then deﬁned by all the variables I

that are true.

That is I

= 1, if i ∈ I, and I

= 0 if i /∈ I. The set of

transaction T covered by I is then deﬁned by the set

of variable T

that are true. That is, T

= 1 if t ∈ C

(I)

and T

= 0 if t /∈ C

(I).

For instance, the F I M

(θ) task can

be seen as the search of the set of models

M = {(I,T) | I ⊆ L,T ⊆ T , T = C

(I), |T|≥ θ}.

We have to encode both the covering constraint

T = C

(I) and the frequency constraint |T|≥ θ.

These constraints are expressed by the following

boolean constraints:

t∈T

(¬T

←

i∈L,D

t,i

)

∑

t∈T

≥ θ

The frequent closed itemset task is speciﬁed by

adding to the two previous constraints the following

constraints:

t∈T

(¬T

→

i∈L,D

t,i

)

i∈L

((

t∈T

→ D

t,i

= 1) → I

)

The maximal frequent itemset mining is speciﬁed

by adding the following constraint:

i∈L

((

∑

t∈T

× D

t,i

≥ θ) → I

)

We denote by CNF(k, D), the CNF formula en-

coding the data mining task k over the transaction

database D, where k refers to F I M

(θ), C LO

(θ)

SymmetryBreakinginItemsetMining

or M AX

(θ). We also note P

a predicate repre-

senting the task k in D. Then an itemset I ⊆ L having

T ⊆ T as a cover veriﬁes P

(I, T) = true) if I is

an itemset which is an answer to the data mining task

k and T is its cover.

Remark 1. We recall that a model J of CNF(k,D)

is a pair (I,T) where the part I expresses the itemset

which is an answer to the considered task k and the

part T encodes its cover. More precisely each literal

which is true in I represents the item i in the itemset

′

which is an answer to the task k and each literal T

which is true in T represents the transaction t in T

′

which is the corresponding cover of I

′

. In the sequel

we denote by the pair (I

′

) the itemset and its cover

that are extracted from an interpretation J = (I,T) of

CNF(k,D).

Symmetry is well studied in constraint program-

ming and propositional satisﬁability. Since Krishna-

murthy’s (Krishnamurthy, 1985) symmetry deﬁnition

and the one given in (Benhamou and Sais, 1992b;

Benhamou and Sais, 1994b) in propositional logic,

several other deﬁnitions are given by the CP commu-

nity.

Symmetry has already been deﬁned in itemset

mining (Jabbour et al., 2012; Jabbour et al., 2013a).

We give in the following a similar deﬁnition and show

how to eliminate such symmetry by means of sym-

metry breaking predicates that we add to the Boolean

encoding to solve efﬁciently some data mining tasks

like frequent, closed or maximal itemset mining.

Deﬁnition 2. Let D be a transaction database over a

set of items L. A symmetry of D is a permutation σ

deﬁned on L such that σ(D) = D

Remark 2. It is obvious to see that a permutation on

the set of items L, induces a permutation σ

on the

set of transactions T and a permutation σ

on the

data-set D itself. We denote such permutations only

by σ when there is no confusion.

A symmetry of D is an item permutation that

leaves D invariant. If we denote by Perm(L)

the group of permutations of L and by Sym(L) ⊂

Perm(L) the subset of permutations of L that are the

symmetries of D, then Sym(L) is trivially a sub-group

of Perm(L).

Theorem 1. Let σ be a symmetry of a transaction

database D, I ⊆ L an itemset having a cover T ⊆

T , and P

the predicate expressing the data mining

task k in D, then P

(I, T) = true iff P

(σ(I), σ(T)) =

true.

Proof. It is trivial to see that a symmetry of D veriﬁes

such property. Indeed, if σ is a symmetry of D, then

σ(D) = D, thus it results that D and σ(D) have the

same itemsets and covers satisfying the predicate P

Thus σ must transform each itemset I with a cover T

verifying the predicate P

to an itemset σ(I) with a

cover σ(T) verifying the predicate P

In other words the symmetry σ of D transforms

each itemset I having a cover T which is a solution to

the data mining task k into a symmetrical itemset σ(I)

having a cover σ(T) which is also a solution of the

task k. It also transforms each itemset which is not a

solution to the task k into a symmetrical itemset which

will not be a solution to the task k. For instance if the

task k concerns the frequent itemset mining problem,

then by applying σ to a frequent itemset I we obtain

a symmetrical frequent itemset σ(I). If I is not fre-

quent, then σ(I) will not be frequent too.

Example 2. Consider the transaction

database deﬁned in Table 1 of Example 1

and the permutation σ = (Whisky,Vodka)

(Cognac,Gin)(Ricard,Pastis)(Wine,Beer)(Shweps,

Juice)(Pepsi,Coke) which is deﬁned on the set of

items L of D. We can see that σ(D)=D, then σ is a

symmetry of D.

Now, we give an important property which es-

tablishes a relationship between the symmetries of

a transaction database D and the Boolean encoding

CNF(k,D) of the data mining task k deﬁned over D.

Proposition 2. Let D be a transaction database,

CNF(k,D) the Boolean encoding of the data mining

task k, σ a symmetry of D and J = (I,T) an interpre-

tation of CNF(k,D), then J is a model of CNF(k, D)

iff σ(J) is a model of CNF(k,D).

Proof. Let σ be a symmetry of the transaction

database D and J = (I, T) a model of the Boolean

encoding CNF(k, D). It results that the correspond-

ing pair itemset and cover (I

′

) verify the predicate

of the data mining task k, that is P

′

) = true.

We have to prove that σ(J) = (σ(I),σ(T)) is also a

model of CNF(k,D). The permutation σ is a sym-

metry of D, thus by Theorem 1, it results that the

pair (σ(I

′

),σ(J

′

)) veriﬁes the predicate P

, that is

(σ(I

′

),σ(J

′

)) = true. Therefore σ(J) is also a

model of CNF(k, D), since the pair (σ(I

′

),σ(J

′

)) ver-

ifying the predicate P

is extracted from the model

σ(J) = (σ(I), σ(T)) of CNF(k,D).

Remark 3. The previous proposition allows us to use

the symmetries of a transaction database D in its cor-

responding Boolean encoding CNF(k,D) in order to

detect symmetrical models and consider only one ele-

ment in each symmetrical equivalent class. This gives

an important alternative for symmetry exploitation in

constraint-based data mining methods. Indeed, we

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

can just compute the symmetries of D instead of com-

puting those of its Boolean CNF(k,D) which could

be time consuming. This could accelerate the symme-

try detection as the size of the transaction database D

is generally substantially smaller than the size of its

corresponding boolean encoding CNF(k,D).

In Example 1, if we consider θ = 2 and the sym-

metry σ of Example 2, then there will be symmet-

rical frequent itemsets in D. For instance, both I

{Beer,Wine,Wisky,Water} and I

= {Shweps,Pepsi}

are frequent itemsets in D. By the symmetry σ we

can deduce that σ(I

) = {Beer,Wine,Vodka,Water}

and σ(I

) = {Juice,Coke} are also frequent itemsets.

These are what we call symmetrical frequent itemsets

of D or symmetrical models

of CNF(k,D). A sym-

metry σ transforms each frequent itemset (a model of

the CNF encoding) into a frequent itemset and each

non frequent itemset (a no-good of the CNF encod-

ing) into a non frequent itemset. Symmetry elimi-

nation offers the advantage to enumerate only non-

symmetrical patterns (like I − 1 and I

here) which

are considered as the most pertinent to the user for

understanding the data.

4 SYMMETRY DETECTION

The most known technique to detect syntactic sym-

metries for CNF formulas in satisﬁability is the one

consisting in reducing the considered formula into a

graph (Crawford et al., 1996; Aloul et al., 2002; Aloul

et al., 2003b; Aloul et al., 2004) whose automorphism

group is identical to the symmetry group of the origi-

nal formula. We adapt the same approach here to de-

tect the syntactic symmetries of a transaction database

D. As it is done in (Jabbour et al., 2012), we represent

the database D by a graph G

that we use to compute

the symmetry group of D by means of its automor-

phism group. When this graph is built, we use a graph

automorphism tool like Saucy (Aloul et al., 2002)

to compute its automorphism group which gives the

symmetry group of D. We summarize bellow the con-

struction of the graph which represent the transaction

database D. Given a transaction database D, the asso-

ciated colored graph G

(V,E) is deﬁned as follows:

• The set of colored vertices V = L ∪ T is build as

follows:

1. Each item i ∈ L is represented by a vertex i ∈ V

of the color 1 in G

(V,E).

2. Each item t ∈ T is represented by a vertext ∈ V

of the color 2 in G

(V,E).

Here, we omitted the part T of the model representing

the cover of I.

• The set of edges E is deﬁned by E = {(t,i) | D

t,i

1}. That is, an edge connects each transaction ver-

tex t ∈ T to each vertex representing an item sup-

ported by t.

Beer

Wisky

Wine

Water

Cognac

Vodka

Ricard

Shweps

Juce

Orangina

001

002

003

004

005

006

007

008

009

010

Coke

Pepsi

Gin

Pastis

Figure 1: The graph of the transaction Database of Table 1.

Example 3. Consider the transaction database D

of Table 1 given in example 1. Its correspond-

ing graph G

(V,E) is shown in Figure 1. We can

see for instance that the vertex permutation γ =

(Wisky,Vodka)(Cognac,Gin)(Ricard, Pastis)(Pepsi,

Coke)(Shweps,Juice)(001,002)(003,004)(005, 006)

(007,010)(008, 009) is one among the automor-

phisms of G

(V,E). The restriction of the au-

tomorphism γ to L represents the symmetry σ =

(Wisky,Vodka)(Cognac,Gin)(Ricard, Pastis)(Pepsi,

Coke)(Shweps,Juice) that we used in Example 2.

An important property of the graph G

(V,E) is

that it preserves the group of symmetries of D. That

is, the symmetry group of D is identical to the auto-

morphism group of its graph representation G

(V,E),

thus we could use a graph automorphism system like

Saucy on G

(V,E) to detect the symmetry group of

D. The graph automorphism system returns a set of

generators Gen of the symmetry group from which we

can deduce each symmetry of D.

5 SYMMETRY ELIMINATION

Here we deal with the global symmetry which is

present in the formulation of the given problem that

SymmetryBreakinginItemsetMining

is represented by the transaction database D. Global

symmetry can be eliminated in a static way in a pre-

processing phase by just adding the symmetry break-

ing predicates to the Boolean encoding CNF(k, D)

and use a SAT solver as a black box on the resulting

CNF formula.

We shall compute the Lex-Leader Symmetry

Breaking Predicate (LL-SBP) induced by the auto-

morphisms of G

. More precisely, the group of auto-

morphisms Aut(G

) of the graph G

(or the symme-

try group Sym(D) of D) induces an equivalence rela-

tion on the set of interpretations of CNF(k,D). That

is, an interpretation I is equivalent to another inter-

pretation J of CNF(k,D) if there exists a symmetry

σ of D such that J = σ(I). The symmetry breaking

predicates are chosen such that they are true for ex-

actly one interpretation in each equivalent class (the

least interpretation in the lex ordering). In general,

we introduce an ordering on the the variables I

cor-

responding to the items of L and use it to construct a

lexicographical order on the set of interpretations.

The construction of the symmetry-breaking pred-

icate is based on the lex-leader method introduced by

Crawford et al (Crawford et al., 1996). Given a sym-

metry group Sym(D) = {σ

,σ

,...,σ

} of D and a

total ordering I

< I

< ··· < I

on the variables of

CNF(k,D) corresponding to the items of L. The

partial lex-leader symmetry-breaking predicate (PLL-

SBP) (Aloul et al., 2003a) that we have to add to

CNF(k,D) is expressed as follows:

PP(σ

) =

1≤i≤n

[

1≤ j≤i−1

= I

) → (I

≤ I

)]

PLL− SBP(Sym(D)) =

∈GEN(Sym(D))

PP(σ

)

PP(σ

) is the permutation predicate correspond-

ing to the symmetry generator σ

and the expression

≤ I

) denotes the clause (I

→ I

The LL − SBP is translated to a linear size CNF

formula by introducing auxiliary variables e

to rep-

resent the expressions (I

= I

). For example, e

↔

= I

) gives rise to the following implications:

(¬I

∨ ¬I

∨ e

),(I

∨ I

∨ e

)

(¬I

∨ ¬e

∨ I

),(I

∨ ¬e

∨ ¬I

)

Some optimizations such that ones studied in

Aloul (Aloul et al., 2003a) could be done to get a more

compact CNF PLL− SBP.

6 EXPERIMENTS

In this section, we present an experimental analysis

of our symmetry breaking approach for SAT based

itemset mining.

6.1 Input Data-sets

We choose for our experiments two classes of data-

sets:

• Simulated data-sets: : In this class, we use the

simulated data-sets, generated speciﬁcally to in-

volve interesting symmetries. The data is avail-

able at http://www.cril.fr/decMining.

• Public datasets: The datasets used in

this class are well known in the data

mining community and are available at

https://dtai.cs.kuleuven.be/CP4IM/datasets/

6.2 The Experimented Methods

As we aim to enumerate all the frequent/closed item-

sets on the SAT based encoding, our experiments

are conducted using MiniSAT-Enum dedicated to the

enumeration of all models of a given CNF formula.

MiniSAT-Enum is obtained from MiniSAT 2.2

follows: each time a model is found a no-good

(clause) is generated and added to the formula in or-

der to avoid enumerating the same models. MiniSAT-

Enum takes as input a CNF formula and a set of items

variables and returns the set of frequent/closed item-

sets.

The methods that we experimented and compared

are the following:

1. MiniSAT-Enum: search without symmetry

breaking on the CNF encoding of the data mining

task CNF(k, D)

2. MiniSAT-Enum-SBP: search with symmetry

breaking. This method generates in a pre-

processing phase the symmetry-breaking predi-

cates, then apply MiniSAT-Enum to the result-

ing CNF instance CNF(k,D)+ PLL − SBP. The

CPU time of MiniSAT-Enum-Sym includes the

time spent to generate the PLL− SBP.

3. MiniSAT-Enum-ISB: this method (Jabbour

et al., 2012), called ItemPair symmetry breaking

(ISB), eliminates symmetries in a preprocess-

ing step, by rewriting the transaction database

D as a D

′

by eliminating symmetric items.

MiniSAT-Enum is then applied on the CNF for-

mula CNF(k, D

′

) encoding the new transaction

database.

MiniSAT: http://minisat.se/

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

In our experiments, we exploit Saucy

, a new im-

plementation of the Nauty system. It is originally

proposed in (Aloul et al., 2002) and signiﬁcantly im-

proved in (Darga et al., 2008). The latest version of

Saucy outperforms all the existing tools by many or-

ders of magnitude, in some cases improving runtime

from several days to a fraction of a second.

We are interested on the CPU time and on the

number of models or closed/frequent itemsets found

with and without symmetry breaking. All the experi-

mental results presented in this section have been ob-

tained with a Quad-core Intel Xeon X5550 (2.66GHz,

32 GB RAM) cluster.

6.3 The Obtained Results

In Figure 2 and 3, we present the results obtained on

a simulated data dataset-gen-jss-5. The experiment

show the comparison of MiniSAT-Enum (CFIM),

MiniSAT-Enum-SBP (CFIM-SBP) and MiniSAT-

Enum-ISB (CFIM-ISB) w.r.t. CPU time in seconds

(Figure 2) and number of patterns (Figure 3). As

we can see, by breaking symmetries, we signiﬁcantly

reduce both the number of closed frequent itemsets

(output) and CPU-time. Such reduction of the size

of the output induces a signiﬁcant reduction of the

search time. Interestingly, breaking symmetries using

by adding SBP on the CNF encoding of the itemset

mining task (CFIM-SBP) is clearly better than elim-

inating symmetric items on the original transaction

database (CFIM-ISB). This experiment show that our

approach break more symmetries than the one pro-

posed in (Jabbour et al., 2012).

100

150

200

250

0 2 4 6 8 10 12 14 16 18 20

time(seconds)

quorum

dataset-gen-jss-5 CFIM

dataset-gen-jss-5 CFIM-ISB

dataset-gen-jss-5 CFIM-SBP

Figure 2: Results on simulated data (Closed frequent item-

sets): CPU time.

The second experiment is conducted on well-

know academic datasets. In this experiment, we are

interested on the frequent itemsets mining problem.

In Figure 4 and 5, we present the comparative results

Saucy2: Fast symmetry discovery - http://vlsicad.eecs.

umich.edu/BK/SAUCY/

10000

100000

1e+06

0 2 4 6 8 10 12 14 16 18 20

#Patterns

quorum

dataset-gen-jss-5 CFIM

dataset-gen-jss-5 CFIM-ISB

dataset-gen-jss-5 CFIM-SBP

Figure 3: Results on simulated data (Closed frequent item-

sets): number of patterns.

500

1000

1500

2000

2500

3000

3500

320 340 360 380 400 420 440 460

time(seconds)

quorum

australian-credit FIM-SBP

australian-credit FIM-ISB

australian-credit FIM

Figure 4: Results on public data - Australian - (frequent

itemsets): CPU time.

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1000 1500 2000 2500 3000 3500 4000 4500 5000

time(seconds)

quorum

mushroom FIM-SBP

mushroom FIM-ISB

mushroom FIM

Figure 5: Results on public data - Mushroom - (frequent

itemsets): CPU time.

w.r.t. the computation time. No reduction is observed

on the number of frequent itemsets. On these datasets,

most of found symmetries involves items in the same

transactions. This explains why these particular sym-

metries does not reduce the number of closed/frequent

itemsets. However, even when the size of the output is

not reduced, breaking symmetries using our approach

signiﬁcantly reduce the search space. In general sym-

metry breaking reduces dramatically the search space

and the corresponding CPU time for this declarative

approach, but did not reach the performances of opti-

mized dedicated algorithms like FPgrowth for exam-

ple.

SymmetryBreakinginItemsetMining

7 RELATED WORKS

The purpose of eliminating symmetry in data mining

tasks is in general either to obtain a more compact

output or to decrease the necessary CPU time for its

generation or to handle new mining properties to ﬁnd

interesting frequent patterns. Some symmetry works

are introduced in the ﬁeld of Data mining following

this direction.

Symmetries in graph mining are studied in

Desrosiers et al. (Desrosiers et al., 2007), and in

Vanetik (Vanetik, 2010). The area of graph min-

ing has a great importance in many applications. In

Desrosiers et al. (Desrosiers et al., 2007) symmetry

is exploited to prune the search space of sub-graph

mining algorithms. However, in Vanetik (Vanetik,

2010), symmetry is used to ﬁnd interesting frequent

sub-graphs (those having limited diameter and high

symmetry). Such graphs represent the more struc-

turally important patterns in all of the chemical, text

and genetic data-sets. Their technique allows also to

reduce the necessary CPU to ﬁnd such graphs.

Murtagh et al in (Murtagh and Contreras, 2010)

used symmetry to get a powerful means of structuring

and analyzing massive, high dimensional data stores.

They illustrate the power of hierarchical clustering in

case studies in chemistry and ﬁnance.

Symmetry is also studied in transaction database

using Zero-BDDs (Minato, 2006). These symmetries

looks very particular,since they are just transpositions

of two items and still identity for the remain items.

They used such symmetry to study the properties of

symmetrical patterns. Such symmetries are used in

(Gly et al., 2005) to explain in some cases why the

number of rules of a minimal cover of a relation is

exponential in number of items.

Two symmetry elimination approaches for fre-

quent itemset mining are introduced in (Jabbour et al.,

2012). They consist in rewriting the transaction

database in pre-processing phase by eliminating the

symmetrical of some items. These approaches are

speciﬁc to the data mining task considered. They

could be combined with our method for the itemset

mining task. Another approach integrate dynamic

symmetry elimination in the Apriori-like algorithm

(Jabbour et al., 2013a) in order to prune the search

space of enumerating all the frequent item sets of a

transaction database.

All of these methods are speciﬁc to the data min-

ing task considered and the target method used to

solve. They are different from the approach which de-

velop here, since our approach is generic and declar-

ative. It will work with all data mining task that is

expressed in a constraint language.

8 CONCLUSION

We studied in this work the notion of symmetry for

data mining tasks expressed as declarativeconstraints.

We showed how the symmetries of the given trans-

action database can be detected and eliminated by

adding symmetry-breaking predicate to the constraint

encoding of the considered data mining task. We

showed that even though such symmetries could not

be syntactically the symmetries of the CNF encod-

ing of the data mining problem, they conserve the set

of its models (the set of interesting patterns). Detect-

ing symmetry on the given transaction database rather

than the CNF encoding of the considered data min-

ing task could result in a great save of efforts in the

symmetry detection. Indeed, the size of the transac-

tion database is in general smaller then its correspond-

ing CNF encoding. The transaction database is rep-

resented by a colored graph that is used to compute

its symmetries. The symmetry group of the transac-

tion database is identical to the automorphism group

of the corresponding graph. The graph automorphism

tools SAUCY is naturally used on the obtained graph

to detect the group of symmetries of the transaction

database. This symmetry is eliminated statically by

adding in a pre-processing phase the well known lex

order symmetry breaking predicates to the CNF en-

coding of the considered data mining task. We then

applied as a black box a SAT model enumeration al-

gorithm on this resulting encoding to solve data min-

ing problem.

The proposed symmetry breaking method is im-

plemented and experimented on a variety of trans-

action data-sets. The ﬁrst experimental results con-

ﬁrmed that eliminating symmetry is proﬁtable for the

considered data mining tasks.

As a future work, we are lookingto eliminate sym-

metry in other data mining problems and try to ex-

tend symmetry exploitation to the local symmetry that

could exists at some nodes of the search tree. Both

kind of exploitation could be complementary, then

one can naturally think on the advantage of combin-

ing them.

REFERENCES

Agrawal, R., Imieli´nski, T., and Swami, A. (1993). Min-

ing association rules between sets of items in large

databases. In Proceedings of the 1993 ACM SIGMOD

International Conference on Management of Data,

SIGMOD ’93, pages 207–216, New York, NY, USA.

ACM.

Agrawal, R. and Srikant, R. (1994). Fast algorithms for

mining association rules in large databases. In Pro-

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

ceedings of the 20th International Conference on Very

Large Data Bases, VLDB ’94, pages 487–499, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

Aloul, F. A., Markov, I. L., and Sakallah, K. A. (2003a).

Shatter: efﬁcient symmetry-breaking for boolean sat-

isﬁability. In DAC, pages 836–839. ACM.

Aloul, F. A., Ramani, A., Markov, I. L., and Sakallah, K. A.

(2002). Solving difﬁcult SAT instances in the pres-

ence of symmetry. In Proceedings of the 39th Design

Automation Conference (DAC 2002), pages 731–736.

ACM Press.

Aloul, F. A., Ramani, A., Markov, I. L., and Sakallah, K. A.

(2003b). Solving difﬁcult instances of boolean satis-

ﬁability in the presence of symmetry. IEEE Trans. on

CAD of Integrated Circuits and Systems, 22(9):1117–

1137.

Aloul, F. A., Ramani, A., Markov, I. L., and Sakallak, K. A.

(2004). Symmetry breaking for pseudo-boolean satis-

ﬁabilty. In ASPDAC’04, pages 884–887.

Benhamou, B. (1994). Study of symmetry in constraint sat-

isfaction problems. PPCP’94, pages 246–254.

Benhamou, B. and Sais, L. (1992a). Theoretical study of

symmetries in propositional calculus and application.

In CADE’11, pages 281–294.

Benhamou, B. and Sais, L. (1992b). Theoretical study of

symmetries in propositional calculus and applications.

In CADE, pages 281–294.

Benhamou, B. and Sais, L. (1994a). Tractability through

symmetries in propositional calculus. In JAR, 12:89–

102.

Benhamou, B. and Sais, L. (1994b). Tractability through

symmetries in propositional calculus. J. Autom. Rea-

soning, 12(1):89–102.

Besson, J., Boulicaut, J.-F., Guns, T., and Nijssen, S.

(2010). Generalizing itemset mining in a con-

straint programming setting. In Inductive Databases

and Constraint-Based Data Mining, pages 107–126.

Springer.

Bonchi, F. and Lucchese, C. (2007). Extending the state-

of-the-art of constraint-based pattern discovery. Data

Knowl. Eng., 60(2):377–399.

Bucil˘a, C., Gehrke, J., Kifer, D., and White, W. (2003).

Dualminer: A dual-pruning algorithm for itemsets

with constraints. Data Mining and Knowledge Dis-

covery, 7(3):241–272.

Burdick, D., Calimlim, M., and Gehrke, J. (2001). Maﬁa: A

maximal frequent itemset algorithm for transactional

databases. In In ICDE, pages 443–452.

Crawford, J., Ginsberg, M., Luks, E., and Roy, A. (1996).

Symmetry-breaking predicates for search problems.

In Knowledge Representation (KR), pages 148–159.

Morgan Kaufmann.

Darga, P. T., Sakallah, K. A., and Markov, I. L. (2008).

Faster symmetry discovery using sparsity of symme-

tries. In Proceedings of the 45th Annual Design Au-

tomation Conference, DAC ’08, pages 149–154, New

York, NY, USA. ACM.

Desrosiers, C., Galinier, P., Hansen, P., and Hertz, A.

(2007). Improving frequent subgraph mining in the

presence of symmetry. In MLG.

Freuder, E. (1991). Eliminating interchangeable values

in constraints satisfaction problems. AAAI-91, pages

227–233.

Grahne, G. and Zhu, J. (2005). Fast algorithms for frequent

itemset mining using fp-trees. IEEE Trans. on Knowl.

and Data Eng., 17(10):1347–1362.

Guns, T., Dries, A., Tack, G., Nijssen, S., and Raedt,

L. D. (2013). Miningzinc: A modeling language for

constraint-based mining. In International Joint Con-

ference on Artiﬁcial Intelligence, pages –, Beijing,

China.

Guns, T., Nijssen, S., and De Raedt, L. (2011a). Itemset

mining: A constraint programming perspective. Artif.

Intell., 175(12-13):1951–1983.

Guns, T., Nijssen, S., and de Raedt, L. (2011b). k-pattern set

mining under constraints. IEEE TKDE, 99(PrePrints).

Gly, A., Medina, R., Nourine, L., and Renaud, Y. (2005).

Uncovering and reducing hidden combinatorics in

guigues-duquenne bases. In Ganter, B. and Godin, R.,

editors, ICFCA, Lecture Notes in Computer Science,

pages 235–248. Springer.

Han, J., Pei, J., and Yin, Y. (2000). Mining frequent pat-

terns without candidate generation. In Proceedings

of the 2000 ACM SIGMOD International Conference

on Management of Data, SIGMOD ’00, pages 1–12,

New York, NY, USA. ACM.

Henriques, R., Lynce, I., and Manquinho, V. M. (2012). On

when and how to use sat to mine frequent itemsets.

CoRR, abs/1207.6253.

Jabbour, S., Khiari, M., Sais, L., Salhi, Y., and Tabia, K.

(2013a). Symmetry-based pruning in itemset mining.

In 25th International Conference on Tools with Arti-

ﬁcial Intelligence(ICTAI’13), Washington DC, USA.

IEEE Computer Society.

Jabbour, S., Sais, L., and Salhi, Y. (2013b). Boolean satisﬁ-

ability for sequence mining. In CIKM, pages 649–658.

Jabbour, S., Sais, L., and Salhi, Y. (2013c). Top-k fre-

quent closed itemset mining using top-k sat prob-

lem. In European Conference on Machine Learning

and Principles and Practice of Knowledge Discovery

in Databases (ECML/PKDD’13), volume 146, pages

131–140. Springer.

Jabbour, S., Sais, L., Salhi, Y., and Tabia, K. (2012). Sym-

metries in itemset mining. In 20th European Confer-

ence on Artiﬁcial Intelligence (ECAI ’12), pages 432–

437. IOS Press.

Khiari, M., Boizumault, P., and Crmilleux, B. (2010). Con-

straint programming for mining n-ary patterns. In Co-

hen, D., editor, CP, volume 6308 of Lecture Notes in

Computer Science, pages 552–567. Springer.

Krishnamurthy, B. (1985). Short proofs for tricky formulas.

Acta Inf., 22(3):253–275.

Krishnamurty, B. (1985). Short proofs for tricky formulas.

Acta Inf., (22):253–275.

M´etivier, J.-P., Boizumault, P., Cr´emilleux, B., Khiari, M.,

and Loudni, S. (2012). A constraint language for

declarative pattern discovery. In Proceedings of the

27th Annual ACM Symposium on Applied Computing,

SAC ’12, pages 119–125, New York, NY, USA. ACM.

Minato, S. I. (2006). Symmetric item set mining based on

zero-suppressed bdds. In Todorovski, L., Lavrac, N.,

SymmetryBreakinginItemsetMining

and Jantke, K. P., editors, Discovery Science, volume

4265 of Lecture Notes in Computer Science, pages

321–326. Springer.

Minato, S. I., Uno, T., and Arimura, H. (2007). Fast gen-

eration of very large-scale frequent itemsets using a

compact graph-based representation.

Murtagh, F. and Contreras, P. (2010). Hierarchical

clustering for ﬁnding symmetries and other pat-

terns in massive, high dimensional datasets. CoRR,

abs/1005.2638.

Pei, J., Han, J., and Lakshmanan, L. V. S. (2004). Push-

ing convertible constraints in frequent itemset mining.

Data Min. Knowl. Discov., 8(3):227–252.

Puget, J. F. (1993). On the satisﬁability of symmetrical con-

strained satisfaction problems. In In J. Kamorowski

and Z. W. Ras,editors, Proceedings of ISMIS’93, LNAI

689, pages 350–361.

Raedt, L. D., Guns, T., and Nijssen, S. (2008). Constraint

programming for itemset mining. In KDD, pages 204–

212.

Raedt, L. D., Guns, T., and Nijssen, S. (2010). Constraint

programming for data mining and machine learning.

In AAAI.

Tiwari, A., Gupta, R., and Agrawal, D. (2010). A survey on

frequent pattern mining: Current status and challeng-

ing issues. Inform. Technol. J, 9:1278–1293.

Tseitin, G. S. (1968). On the complexity of derivation in

propositional calculus. In Structures in the construc-

tive Mathematics and Mathematical logic, pages 115–

125. H.A.O Shsenko.

Uno, T., Asai, T., Uchida, Y., and Arimura, H. (2003).

Lcm: An efﬁcient algorithm for enumerating frequent

closed item sets. In In Proceedings of Workshop on

Frequent itemset Mining Implementations (FIMI03.

Uno, T., Kiyomi, M., and Arimura, H. (2004).

Lcm ver. 2: Efﬁcient mining algorithms for fre-

quent/closed/maximal itemsets. In FIMI.

Vanetik, N. (2010). Mining graphs with constraints on

symmetry and diameter. In Shen, H. T., Pei, J., zsu,

M. T., Zou, L., Lu, J., Ling, T.-W., Yu, G., Zhuang, Y.,

and Shao, J., editors, WAIM Workshops, volume 6185

of Lecture Notes in Computer Science, pages 1–12.

Springer.

Zaki, M. J. and Hsiao, C.-J. (2005). Efﬁcient algorithms

for mining closed itemsets and their lattice structure.

IEEE Trans. on Knowl. and Data Eng., 17(4):462–

478.

KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval