INTEGRATED CANDIDATE GENERATION IN PROCESSING

BATCHES OF FREQUENT ITEMSET QUERIES USING APRIORI

Piotr Jedrzejczak and Marek Wojciechowski

Institute of Computing Science, Poznan University of Technology, ul. Piotrowo 2, 60-965 Poznan, Poland

Keywords: Data Mining, Frequent Itemsets, Apriori Algorithm, Data Mining Queries.

Abstract: Frequent itemset mining can be regarded as advanced database querying where a user specifies constraints

on the source dataset and patterns to be discovered. Since such frequent itemset queries can be submitted to

the data mining system in batches, a natural question arises whether a batch of queries can be processed

more efficiently than by executing each query individually. So far, two methods of processing batches of

frequent itemset queries have been proposed for the Apriori algorithm: Common Counting, which integrates

only the database scans required to process the queries, and Common Candidate Tree, which extends the

concept by allowing the queries to also share their main memory structures. In this paper we propose a new

method called Common Candidates, which further integrates processing of the queries from a batch by

performing integrated candidate generation.

1 INTRODUCTION

Frequent itemset discovery (Agrawal et al., 1993) is

a very important data mining problem with

numerous practical applications. Its goal is discovery

of the most frequently occurring subsets, in a

database of sets of items, called transactions.

Despite significant advances in frequent itemset

mining, the most widely implemented and used in

practice frequent itemset mining algorithm is the

classic Apriori algorithm (Agrawal and Srikant,

1994), due to its simplicity and satisfactory

performance in real-world scenarios. Apriori

iteratively generates candidates (i.e., potentially

frequent itemsets) from previously found smaller

frequent itemsets and counts their occurrences in the

database. To improve the efficiency of testing which

candidates are contained in a transaction read from

the database, the candidates are stored in a hash tree.

Frequent itemset mining is often regarded as

advanced database querying where a user specifies

the source dataset, the minimum support threshold,

and optionally pattern constraints within a given

constraint model (Imielinski and Mannila, 1996). A

significant amount of research on efficient

processing of frequent itemset queries has been done

in recent years, focusing mainly on constraint

handling (see e.g. (Pei and Han, 2000) for an

overview) and reusing results of previous queries

(Baralis and Psaila, 1999) (Meo, 2003).

Recently, a new problem of optimizing

processing of sets of frequent itemset queries has

been considered, bringing the concept of multiple-

query optimization, the problem extensively studied

in the area of database systems (see (Sellis, 1988)

for an overview), to the domain of frequent itemset

mining. The idea was to process the queries

concurrently rather than sequentially and exploit the

overlapping of queries’ source datasets.

Two general approaches have been taken to

design methods of processing batches of frequent

itemset queries: (1) providing methods independent

from a particular frequent pattern mining algorithm,

and (2) tailoring dedicated methods for the most

prominent frequent pattern mining algorithms with a

particular emphasis on Apriori (Wojciechowski and

Zakrzewicz, 2002). It has been shown that the latter

approach yields more efficient algorithms than the

former, due to better sharing of computations and

I/O operations among the queries forming a batch.

The first method of processing batches of

frequent itemset queries proposed for Apriori was

Common Counting (Wojciechowski and

Zakrzewicz, 2002), which consists in concurrent

execution of the queries with the integration of scans

of parts of the database shared among the queries.

Later, Common Counting was improved by

487

Jedrzejczak P. and Wojciechowski M..

INTEGRATED CANDIDATE GENERATION IN PROCESSING BATCHES OF FREQUENT ITEMSET QUERIES USING APRIORI.

DOI: 10.5220/0003099704870490

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 487-490

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

additionally sharing the hash tree structures used to

store candidates, resulting in the Common Candidate

Tree method (Grudzinski and Wojciechowski,

2007). In this paper we present a new algorithm

called Common Candidates, which builds on the

success of Common Candidate Tree, offering further

integration of computations among the queries by

performing integrated candidate generation.

2 RELATED WORK

To the best of our knowledge, apart from the

problem considered in this paper, multiple-query

optimization for frequent pattern queries has been

considered only in the context of frequent pattern

mining on multiple datasets (Jin et al., 2005). The

idea was to reduce the common computations

appearing in different complex queries, each of

which compared the support of patterns in several

disjoint datasets. This is fundamentally different

from our problem, where each query refers to only

one dataset and the queries' datasets overlap.

Earlier, the need for multiple-query optimization

has been postulated in the area of inductive logic

programming, where a technique based on similar

ideas as Common Counting has been proposed

(Blockeel et al., 2002).

3 BASIC DEFINITIONS

Frequent Itemset Query. A frequent itemset query

is a tuple dmq = (R, a,

, minsup), where R is a

database relation, a is a set-valued attribute of R,

a condition involving the attributes of R called data

selection predicate,

is a condition involving

discovered itemsets called pattern constraint, and

minsup is the minimum support threshold. The result

of dmq is a set of itemsets discovered in

satisfying

, and having support ≥ minsup (

and

denote relational projection and selection operations

respectively).

Elementary Data Selection Predicates. The set of

elementary data selection predicates for a set of

frequent itemset queries DMQ = {dmq

, dmq

, ...,

dmq

} is the smallest set S={s

, s

,..., s

} of data

selection predicates over the relation R such that for

each u, v (u ≠ v) we have

R∩

R =∅ and for

each dmq

there exist integers a, b, ..., m such that

R∪

R∪..∪

R. The set of elementary

data selection predicates represents the partitioning

of the database determined by overlapping of

queries’ datasets.

Problem Statement. Given a set of frequent itemset

queries DMQ = {dmq

, dmq

, ..., dmq

}, the problem

of multiple-query optimization of DMQ consists in

generating an algorithm to execute DMQ that

minimizes the overall processing time.

4 COMMON CANDIDATES

The only part of Apriori that is still performed

separately for each query in Common Candidate

Tree (CCT) is the candidate itemset generation. In

order to introduce concurrency in that area, we

propose a new method: Common Candidates

(CCan), which makes it possible to generate

candidates for all queries in a batch at once while

preserving all the optimizations present in CCT. The

pseudo-code for CCan is presented in Figure 1.

CCT used two representations of an itemset: a

standard, single-query representation (to store the

frequent itemsets and freshly generated candidates)

and an extended, multiple-query one (to store the

frequent itemsets inside a common hash tree). CCan

abandons the former completely and stores both the

frequent and candidate itemsets using the extended

representation with a bitmap (fromQuery[]) used to

indicate which queries generated a candidate itemset

and then updated to show in which queries that

itemset has been verified to be frequent.

The general idea of candidate generation remains

identical to that of Apriori and is composed of the

join phase and pruning phase. There are, however,

some significant differences. Unlike all the previous

methods which performed the join phase with only

the itemsets from one query at a time, CCan joins all

frequent itemsets from all queries simultaneously.

To avoid generating candidates that do not apply to

any query, only those pairs of itemsets that share at

least one query are considered. After a candidate has

been generated, its bitmap is calculated during the

mandatory pruning phase by performing a logical

AND operation on the bitmaps of all of its subsets of

size 1 less. The resulting bitmap has its bits set only

for those queries in which all of the subsets are

frequent (queries that the candidate actually applies

to), and candidates with an empty bitmap are

automatically pruned. As the candidates generated

using this method already use the extended itemset

representation, they can be stored inside a common

hash tree without any merging or conversion.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

488

The advantage of the integrated candidate generation

of CCan as compared to CCT is two-fold. Firstly,

each candidate is generated only once, no matter

how many queries it applies to. Secondly, there is no

need to convert the itemsets between the standard

and extended representations, as the latter is used in

both the generation and count phases.

5 EXPERIMENTAL RESULTS

In order to evaluate the performance of CCan

compared to CCT, we conducted a series of

experiments on a synthetic dataset generated with

GEN (Agrawal et al., 1996) using the following

settings: number of transactions = 1000000, average

number of items in a transaction = 8, number of

different items = 1000, number of patterns = 1500,

average pattern length = 4. The dataset was stored in

an index-organized table inside an Oracle database

to facilitate efficient access to its fragments

processed by frequent itemset queries. The

experiments were carried out on a Mac with 2.2

GHz Intel Core 2 Duo processor and 4 GB of

memory, running Snow Leopard; the database was

deployed on a PC with Athlon 64 3800+ processor

and 2 GB or memory, running SuSE Linux.

In the experiments we varied the level of

overlapping between the queries and the number of

queries in a batch. Each query referred to a dataset

containing 100000 subsequent transactions from the

generated dataset. The support threshold of 0.7%,

which resulted in 7-8 Apriori iterations, was used for

all queries. In addition to measuring total execution

times for both algorithms, we also measured the time

spent on candidate generation which is the target of

optimizations introduced in CCan.

Figure 2: Generation and total execution times for two

queries and different levels of overlapping.

The goal of the first experiment was to examine how

the level of overlapping between the queries affects

the generation and total execution times of CCan

compared to CTT. The batch used in this experiment

consisted of two queries. Obtained results are shown

in Figure 2.

The generation times of both CCT and CCan

remain almost constant regardless of the level of

overlapping, with CCan significantly outperforming

CCT. The difference in total execution times is less

significant, due to the fact that candidate occurrence

Input:  







,



,…,





,where 







, , ∑



,Φ



, minsup





(1) 







all possible 1‐itemsets



;

(2) for



1;



;



do begin

(3) for each 



 do begin

(4)  







: .fromQuery[i]true







∑







;

(5) if    then count



, 









; end

(6) for each   



(7) for



1;;



(8) if .counters[i]minsup



then .fromQuery[i]false;

(9) 











: .fromQuery[i]true



;

(10) 



 generate_candidates









;

(11) end

(12) for



1;;



Answer

















:.fromQuery[i]true



;



Figure 1: Common Candidates.

INTEGRATED CANDIDATE GENERATION IN PROCESSING BATCHES OF FREQUENT ITEMSET QUERIES

USING APRIORI

489

counting is considerably more time consuming than

candidate generation in Apriori-based methods.

Second of the conducted experiments examined

how well the algorithms scale with the increasing

number of concurrently executed queries. In order to

keep the queries equally similar, the level of

overlapping between each pair of subsequent queries

inside the batch was fixed at 75%. As can be seen in

Figure 3, the generation time of CCT grows linearly

with the increase of the number of queries in a batch,

while CCan remains largely insensitive. Total

execution times increase similarly for both methods,

with CCan performing slightly better, especially

with more queries in a batch.

Figure 3: Generation and total execution times for

different numbers of similar queries.

6 CONCLUSIONS

In this paper we addressed the problem of efficient

processing of batches of frequent itemset queries in

the context of the Apriori algorithm. We proposed a

new algorithm, called Common Candidates, built

upon Common Candidate Tree, offering further

integration of computations performed for a batch of

queries thanks to the integrated candidate generation

procedure.

The conducted experiments showed that the new

method results in significant reduction of the total

time spent on candidate generation. The impact of

the integrated candidate generation procedure on the

overall execution time is less spectacular but still

noticeable.

In the future we plan to investigate the possible

impact of several optimizations applied to Apriori by

its practical implementations on our batch

processing algorithms.

REFERENCES

Agrawal, R., Imielinski, T., Swami, A., 1993. Mining

Association Rules Between Sets of Items in Large

Databases, In Proc. of the 1993 ACM SIGMOD Conf.

Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning,

A., Bollinger, T., 1996. The Quest Data Mining

System, In Proc. of the 2nd KDD Conference.

Agrawal, R., Srikant, R., 1994. Fast Algorithms for

Mining Association Rules, In Proc. of the 20th VLDB

Conference.

Baralis, E., Psaila, G.,1999. Incremental Refinement of

Mining Queries, In Proceedings of the 1st DaWaK

Conference.

Blockeel, H., Dehaspe, L., Demoen, B., Janssens, G.,

Ramon, J., Vandecasteele, H., 2002. Improving the

Efficiency of Inductive Logic Programming Through

the Use of Query Packs, Journal of Artificial

Intelligence Research, Vol. 16.

Grudzinski, P., Wojciechowski, M., 2007. Integration of

Candidate Hash Trees in Concurrent Processing of

Frequent Itemset Queries Using Apriori, In Proc. of

the 3rd ADMKD Workshop.

Imielinski, T., Mannila, H., 1996. A Database Perspective

on Knowledge Discovery, Communications of the

ACM, Vol. 39.

Jin, R., Sinha, K., Agrawal, G., 2005. Simultaneous

Optimization of Complex Mining Tasks with a

Knowledgeable Cache, In Proc. of the 11th KDD

Conference.

Meo, R., 2003. Optimization of a Language for Data

Mining, In Proc. of the ACM SAC Conference.

Pei, J., Han, J., 2000. Can We Push More Constraints into

Frequent Pattern Mining?, In Proc. of the 6th KDD

Conference.

Sellis, T., 1988. Multiple-query optimization, ACM

Transactions on Database Systems, Vol. 13.

Wojciechowski, M., Zakrzewicz, M., 2002. Methods for

Batch Processing of Data Mining Queries, In Proc. of

the 5th DB&IS Conference.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

490