A Novel Clustering Algorithm to Capture Utility Information in

Transactional Data

Piyush Lakhawat, Mayank Mishra and Arun Somani

Department of Electrical and Computer Engineering, Iowa State University, Ames, Iowa, U.S.A.

Keywords:

Clustering Algorithm, Transactional Data, High Utility Patterns.

Abstract:

We develop and design a novel clustering algorithm to capture utility information in transactional data. Trans-

actional data is a special type of categorical data where transactions can be of varying length. A key objective

for all categorical data analysis is pattern recognition. Therefore, transactional clustering algorithms focus

on capturing the information on high frequency patterns from the data in the clusters. In recent times, utility

information for category types in the data has been added to the transactional data model for a more realistic

representation of data. As a result, the key information of interest has become high utility patterns instead of

high frequency patterns. To the best our knowledge, no existing clustering algorithm for transactional data

captures the utility information in the clusters found. Along with our new clustering rationale we also de-

velop corresponding metrics for evaluating quality of clusters found. Experiments on real datasets show that

the clusters found by our algorithm successfully capture the high utility patterns in the data. Comparative

experiments with other clustering algorithms further illustrate the effectiveness of our algorithm.

1 INTRODUCTION AND

MOTIVATION

Transactional data model is used in many important

applications of data mining and analytics like market

basket analysis, bioinformatics, click stream analysis

etc. Clustering is one of key analysis techniques for

these applications. For example, clustering of cus-

tomer transactions data is performed in market basket

analysis for segmentation and identiﬁcation of vari-

ous customer types (Ngai et al., 2009). An impor-

tant data type in bioinformatics experiments is gene

co-expression data which can be modeled as transac-

tional data (Andreopoulos et al., 2009). Identifying

clusters of genes exhibiting similar expression in con-

trolled conditions can lead to critical discoveries in

medicine.

A typical transaction of size k in a transactional

data set can represented as T

=(X

, X

. . . X

), where

is a category type(also called item type) and ID

is the transaction ID. For example, transactions from

a retail store data can be like T

=(Candy, Pop,

Chips), T

=(HDTV, Hi-Deﬁnition Speakers) and so

on, where each transaction is a list of items bought

by a customer in one trip to the store. If we work

with only this level of information then we assume

equal importance of each category type in the analy-

sis. This would mean that if more number of customer

purchase patterns include Pop than HDTVs then our

analysis would implicitly assume Pop being a more

important category type. However, due to signiﬁ-

cantly more proﬁt per sale of a HDTV, it is possible

that overall HDTV sales yield much more proﬁt to the

store than Pop sales. The general idea is that various

category types in a transactional data have different

level of relative importance (called utility) in the anal-

ysis. Therefore to make the transactional data model

more realistic utility values are assigned to each cate-

gory type in the data. An important data mining prob-

lem which also uses transactional data model is item-

set mining. The aforementioned idea has led to the

evolution of traditional frequent itemset mining prob-

lem into the current high utility itemset mining prob-

lem (Tseng et al., 2015). Such a transition has not yet

occurred for the transactional data clustering problem.

To the best of our knowledge no existing clustering al-

gorithm for transactional data captures the utility in-

formation.

Current transactional clustering algorithms cap-

ture the frequency information (number of occur-

rences) while ﬁnding clusters (Yan et al., 2010) but do

not use the utility values. This will lead to misleading

clusters especially when various category types have

signiﬁcant difference in their utility. In order to cap-

456

Lakhawat, P., Mishra, M. and Somani, A.

A Novel Clustering Algorithm to Capture Utility Information in Transactional Data.

DOI: 10.5220/0006092104560462

In Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2016) - Volume 1: KDIR, pages 456-462

ISBN: 978-989-758-203-5

ture high utility patterns (patterns based on combina-

tion of frequency and utility) in the data, we propose a

concept of relative utility of category types in clusters.

We use it to deﬁne a new similarity metric to guide

the clustering process. Based on our similarity met-

ric we develop a hierarchical agglomerative clustering

algorithm. The algorithm design allows two tunable

parameters to be set to achieve clusters with desired

properties. Since we are introducing a new clustering

rationale, we propose two corresponding cluster eval-

uation criterion as the existing ones dont take utility

values into account.

1.1 Contributions

Our work makes the following contributions:

• A novel clustering algorithm to capture utility in-

formation in transactional data. The algorithm can

be tuned to obtain clusters with desired properties

of a strong core and a balanced structure.

• Two validation criterion for evaluating the ob-

tained cluster structure based on how accurately

it captures the high utility patterns in the data.

• Experimental results on real data sets showing

that the clustering algorithm successfully captures

the high utility patterns in the data. Compara-

tive experiments with other algorithms illustrate

the effectiveness of our algorithm.

2 PRIOR WORK ON

CATEGORICAL AND

TRANSACTIONAL

CLUSTERING

Transactional data is a special type of categorical data

where transactions can be of varying lengths. While it

is possible to use categorical clustering algorithms for

transactional data, most of them lead to a data explo-

sion problem as described in (Yan et al., 2010). Spe-

ciﬁc algorithms targeting transactional data cluster-

ing have been developed as well (Wang et al., 1999),

(Yan et al., 2005), (Yan et al., 2010), (Yang et al.,

2002). Use of similarity measures and cluster quality

measures are two common approaches used in most

of the categorical and transactional clustering algo-

rithms. Based on our review of the published work

we classify them into the following three categories:

• Similarity Measures. In (Huang, 1998) an ex-

tension of the popular k-means algorithm (which

is for numerical data) is presented called k modes.

It replaces the means with modes of clusters to

minimize a cost function. An in-depth analysis

of subsequent updates of the k-modes algorithm

is done in (Bai et al., 2013). In (Guha et al.,

1999) a link based similarity measure is proposed

to guide the clustering. An entropy based similar-

ity criterion is presented in (Chen and Liu, 2005).

A transformation technique is presented in (Qian

et al., 2015) to map categorical data into a space

structure followed by using similarity measures.

A similarity measure based using information bot-

tleneck theory is presented in (Tishby and Slonim,

2000) and (Andritsos et al., 2004). (Ganti et al.,

1999), (Gibson et al., 1998) and (Wang et al.,

1999) are some other works with the focus on sim-

ilarity type measures for clustering. A mapping of

categorical data into a tree structure is performed

in (Chen et al., 2016) before performing cluster-

ing using a tree similarity metric.

• Quality Measures. (Yang et al., 2002), (Yan

et al., 2005), (Barbar

a et al., 2002) and (Yan

et al., 2010) present algorithms which iteratively

improve overall cluster quality. These deﬁnitions

are based on various entropy based concepts from

information theory. An in-depth discussion of en-

tropy based criterion used for the categorical clus-

tering is presented in (Li et al., 2004).

• Miscellaneous. Categorical clustering has been

presented as an optimization problem in (Xiong

et al., 2012). A learning based multi-objective

fuzzy approach is presented in (Saha and Maulik,

2014). An entropy based subspace clustering

technique for categorical data is presented in (Sun

et al., 2015).

While the current body of work has numerous clus-

tering algorithms, all lack of capturing the utility in-

formation in the data. We propose the following algo-

rithm to ﬁll the void.

3 A NOVEL CLUSTERING

ALGORITHM TO CAPTURE

UTILITY INFORMATION IN

TRANSACTIONAL DATA

Any clustering algorithm is designed with a clustering

criterion at its core. The clustering criterion behind

our algorithm is to gather transactions which share

common high utility category types in a cluster. At the

convergence of clustering we expect to capture high

utility patterns of the data. While the current algo-

rithms for transactional clustering aim to capture high

frequency patterns while clustering, we argue that this

A Novel Clustering Algorithm to Capture Utility Information in Transactional Data

457

can lead to inaccurate clustering in cases where the

utilities of category types have a steep distribution. A

high utility cluster (containing transactions with com-

mon high utility category types) can be missed (i.e.

not found) if we focus only on frequency information

while clustering. Before further describing our clus-

tering algorithm in detail we need to deﬁne the fol-

lowing preliminaries (common in utility itemset min-

ing (Tseng et al., 2015)):

I = {a

, a

, . . . , a

} = Set of item(category) types

(1)

D = {T

, T

, . . . , T

} = Transactional dataset (2)

where each T

= {x

, x

, . . .} ⊂ I

X = size k itemset = {x

, x

, . . . , x

∈ I} (3)

eu(a

) = external utility of a

(4)

iu(a

, T

) = internal utility of a

in T

(5)

eu(a

) is the measure of unit importance for item type

. While iu(a

, T

) is a quantity measure of a

in T

The absolute utility of an item in a transaction is de-

ﬁned as the product of its internal and external utility.

au(a

, T

) = eu(a

).iu(a

, T

) (6)

Absolute utility of an itemset in a transaction is the

sum of absolute utilities of its constituent items.

au(X, T

) =

∑

∈X

au(x

, T

) (7)

Absolute utility of a transaction (also called transac-

tion utility) is the sum of absolute utilities of all its

constituent items.

TU(T

) =

∑

∈T

au(x

, T

) (8)

Absolute utility of an itemset in the dataset D is the

sum of absolute of that itemset in all transactions that

it occurs in.

au(X, D) =

∑

X∈T

∧T

∈D

au(X, T

) (9)

The set of HUI in D is the collection of all itemsets

which have absolute utility more than or equal to φ in

the dataset D.

HUI in D = {X such that au(X, D) ≥ φ} (10)

A cluster C

of is a subset of transactions from D.

= {T

, T

. . . T

∈ D} (11)

C is the set of all given clusters.

= {a

∈ T

∧ T

∈ C

} = item types in C

(12)

We introduce following new concepts of cluster

utility, relative utility of a category type in a cluster

and the afﬁnity between clusters.

CU(C

) =

∑

∈C

TU(T

) = Cluster utility of C

(13)

We deﬁne cluster utility as an overall measure of im-

portance of a cluster, since it is the sum of utilities of

all transactions in it.

∀a

∈ I

, ru(a

) =

∑

∈I

∧T

∈C

au(a

, T

)

CU(C

)

(14)

= relative utility of a

in C

We deﬁne relative utility as the relative impor-

tance (since utility is a unit of importance) given to

among all I

in C

. The set {ru(a

)|a

∈ I

} is

then a of signature of the cluster. It is a concise rep-

resentation of the utility information we are interested

in about the cluster. And the collection of all such sets

gives us the concise global information of the cluster

structure present in the dataset on the basis of high

utility patterns in it.

For clusters C

and C

a f f inity(C

) =

∑

a∈I

∧a∈I

min(ru(a,C

), ru(a,C

))

(15)

We deﬁne afﬁnity as the sum of shared utility of

common category types among two clusters. So it is a

representative of the similar importance given by two

clusters to their common category types. This aligns

with our goal of clustering which is to gather transac-

tions which share common high utility category types

in a cluster.

3.1 Clustering Algorithm

Using our deﬁnition of afﬁnity we perform clustering

on the dataset using a bottom-up (also called agglom-

erative) hierarchical approach. Agglomerative hier-

archical approach is well established for successful

clustering of categorical data(Guha et al., 1999). We

initialize our algorithm by assigning each transaction

as an individual cluster. We then calculate afﬁnity of

each cluster with every other cluster. Following that

we incrementally merge the two clusters which have

most afﬁnity to each other. A merge implies union of

the two clusters to form a single cluster. After each

merge we calculate all the afﬁnity values associated

with the new cluster. This includes ﬁrst computing the

cluster utility and relative utilities for the new cluster.

The cluster structure represents a complete graph

with each cluster as a node connected to every other

node. The node weights represent the cluster utility

and the edge weights represent the afﬁnity values. At

KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval

458

each merge we calculate the node weight for the new

node and the edge weights for the new edges while

keeping the graph complete. We terminate the al-

gorithm when the maximum afﬁnity found between

any two clusters is below a certain predeﬁned afﬁnity

threshold min

a f f

. The cluster distribution can have

a long tail, meaning there can be many transactions

which do not fall under any signiﬁcant cluster struc-

ture. min

a f f

ensures that clusters with high cluster

utility do not end up merging just because all other re-

maining small clusters are very dissimilar among each

other.

At this point we only select clusters with more

than or equal to a minimum cluster utility value. We

do this by predeﬁning a parameter min

uty

and accept-

ing only those clusters which have cluster utility more

than or equal to min

uty

of the cluster with the max-

imum cluster utility. Figure 1 shows a small syn-

thetic example illustrating the clustering process. It

shows afﬁnities and cluster utilities as edge and node

weights respectively. The change in cluster structure

can be observed based on the choice of min

a f f

and

the min

uty

parameters. Algorithm 1 below provides a

pseudocode.

Input: C ;

while max

a f f

≥ min

a f f

for C

∈ C do

if afﬁnity(C

) > max

a f f

then

max

a f f

= afﬁnity(C

);

= C

;

= C

;

merge(C

);

update relevant afﬁnities;

for C

∈ C do

CU(C

)

max(CU(C

)∀C

∈C)

≤ min

uty

then

delete C

;

return C;

Algorithm 1: A novel clustering algorithm for categorical

data with utility information.

4 PROPERTIES OF THE

CLUSTER STRUCTURE

The obtained cluster structure by our algorithm will

have the following properties by design:

• A strong core: We terminate the clustering pro-

cess when the maximum afﬁnity found between

any two clusters is less than min

a f f

. Therefore,

each cluster will have a set of category types for

which each transaction in the cluster has at least

Figure 1: Illustration of the clustering process.

min

a f f

of shared relative utility among them.

This is evident from the deﬁnition of afﬁnity.

• A balanced structure: Ratio of cluster utility of

each cluster to that of the cluster with the maxi-

mum cluster utility will be more than or equal to

the predeﬁned threshold min

uty

• Greedy path taken: Clusters with the most afﬁnity

with each other merge at each step.

Besides the above three properties, we expect the

cluster structure to capture the high utility patterns in

the data. This is the primary goal of this work. A re-

cent transactional clustering framework SCALE(Yan

et al., 2010) introduces a cluster validation crite-

rion called LISR(Large Item Size Ratio) which cap-

tures large items(frequently occurring category types)

preserved in clustering. Since we capture patterns

based on combination criterion of frequency and util-

ity, using validation criterion like LISR is inadequate.

Therefore we introduce two new validation criterion

based on the high utility itemsets (HUI) present in the

clusters. We call them HUI

capture

and HUI

inaccurate

They are deﬁned as follows:

HUI in C = {X such that au(X,C

) >= φ

∈ C}

(16)

where φ

|D|

∑

∈C

The set of HUI in C represents the high utility pat-

terns present in the cluster structure C. If the entire

data set D is used in clustering then φ = φ

. How-

ever, since clustering is an expensive operation often

random sampling techniques are used to select trans-

actions for clustering. In those cases we scale down

the threshold φ to φ

HU I

capture

|{HUI in C} ∩ {HUI in D}|

|{HUI in D}|

(17)

HU I

capture

represents the degree to which the

cluster structure C captures the high utility patterns

A Novel Clustering Algorithm to Capture Utility Information in Transactional Data

459

present in the dataset D. Successful HU I

capture

characterized by a value greater than the ratio of size

of data used for clustering to the size of the whole

dataset

∑

∈C

|D|

. Higher values of HUI

capture

imply

higher quality of the cluster structure.

HU I

inaccurate

|{HUI in C} − {HUI in C ∩ HUI in D}|

|{HUI in D}|

(18)

HU I

inaccurate

represents the degree of inaccurately

captured high utility patterns in C with respect to the

high utility patterns present in the data D. Lower val-

ues of HUI

inaccurate

imply higher quality of the cluster

structure.

5 EXPERIMENTAL EVALUATION

AND DISCUSSION

To validate our algorithm we perform clustering on a

real data set obtained from (Ret, 2003) and provided

by (Brijs et al., 1999). It contains anonymous cus-

tomer transaction data from a Belgian retail store and

contains 88,163 transactions. We randomly generated

the external utilities (between 1-50) for various cate-

gory types by using a uniform random number gen-

erator. Using this data we performed the following

experiment:

• For φ=50000 we calculated the list of high utility

itemsets (HUI) in the complete data set D. We

found 111 HUIs. We did this by implementing a

popular itemset mining technique called the two-

phase algorithm (Liu et al., 2005). The choice of

φ was based on obtaining a reasonably large but

manageable number of HUI.

• For four combinations of our algorithm parame-

ters we calculated the cluster structure and the list

of HUI for each cluster structures. The selection

of data for clustering is done using uniform ran-

dom sampling.

• Finally, we evaluate our validation metrics:

HU I

capture

and HU I

inaccurate

. We present the re-

sults in Figure 2.

The following inferences can be drawn from the

results:

• The HUI captured in clusters are signiﬁcantly

greater than the data used for clustering. This im-

plies high quality of the cluster structure.

• The extremely low (and 0) values of HUI inaccu-

rately captured imply high accuracy of the cluster

structure.

Figure 2: Cluster validations metrics for the retail store data

set.

Table 1: Comparative results for Soybean dataset: patterns

missed.

φ Our Algorithm BKPlot K-modes

800 0 0 8

400 2 5 7

200 0 2 4

We also perform comparative experiments with

two other categorical clustering algorithms: K-modes

(Huang, 1998) and BKPlot (Chen and Liu, 2005).

Since the provided implementations of K-modes

(Kmo, 2015) and BKPlot (BKP, 2008) accept data

sets only with uniform transaction length, we perform

the comparative analysis on the Soybean data set ob-

tained from (UCI, 1987). It is a collection of attributes

of various soybean plants type. Soybean data set is

reasonably small so we dont use the random sampling

and instead perform clustering on the entire dataset.

To maintain uniformity in cluster structure we eval-

uate a four cluster solution for each algorithm. For

using in our algorithm we also generate utility val-

ues(between 1-50) for each category type using a uni-

form random number generator. Since the data set is

small and we dont use random sampling, we perform

comparison based on patterns (high utility itemsets)

missed by clusters formed by each algorithm. Ta-

ble 1 shows the comparative results. φ represents the

threshold value used to calculate high utility patterns

(itemsets). Our algorithm performs better (or equal)

than the other two for all cases. While the soybean

data set is a small one, it still illustrates the fact that

well established algorithms like K-modes and BKPlot

miss out on high utility patterns.

6 CONCLUSIONS AND FUTURE

WORKS

In the growing body of work of clustering algorithms

for categorical data we identiﬁed the need for an al-

gorithm which captures the utility information in the

transactional data. The current algorithms for trans-

KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval

460

actional clustering aim to capture high frequency pat-

terns while clustering. We argue that this can re-

sult in misleading clusters especially for cases where

the utilities of category types have a steep distribu-

tion. High utility clusters can be missed if we focus

only on frequency information while clustering. We

propose a novel clustering algorithm for transactional

data which captures the utility information. We also

propose two validation criterion for the obtained clus-

ter structure based on how accurately it captures the

high utility patterns in the data.

Our experiments on a real data sets show that the

clustering algorithm successfully captures the high

utility patterns in the data. Our comparative experi-

ment results further illustrate the effectiveness of our

algorithm over the popular K-modes and BKPlot al-

gorithms. For future work, we plan to do experiments

on data sets from various applications like bioinfor-

matics, click stream data etc. We believe interpreta-

tions of clusters found in different data sets will lead

to interesting results and evolution of our algorithm.

We plan to develop the idea of utility aware clustering

for entropy based clustering methods as well.

ACKNOWLEDGEMENTS

The research reported in this paper is funded in part

by Philip and Virginia Sproul Professorship Endow-

ment and Jerry R. Junkins Endowments at Iowa State

University. The research computation is supported

by the HPC@ISU equipment at Iowa State Univer-

sity, some of which has been purchased through fund-

ing provided by NSF under MRI grant number CNS

1229081 and CRI grant number 1205413. Any opin-

ions, ﬁndings, and conclusions or recommendations

expressed in this material are those of the author(s)

and do not necessarily reﬂect the views of the funding

agencies.

REFERENCES

(1987). Uci machine learning repository. archive.ics.

uci.edu/ml/datasets/Soybean+(Small). Accessed:

2016-09-01.

(2003). Frequent itemset mining dataset repository. http://

ﬁmi.ua.ac.be/data/. Accessed: 2016-06-14.

(2008). Bkplot implementation. cecs.wright.edu/

∼keke.chen/. Accessed: 2016-09-01.

(2015). K-modes implementation. https://github.com/

nicodv/kmodes. Accessed: 2016-09-01.

Andreopoulos, B., An, A., Wang, X., and Schroeder, M.

(2009). A roadmap of clustering algorithms: ﬁnding a

match for a biomedical application. Brieﬁngs in Bioin-

formatics, 10(3):297–314.

Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C.

(2004). Limbo: Scalable clustering of categori-

cal data. In International Conference on Extending

Database Technology, pages 123–146. Springer.

Bai, L., Liang, J., Dang, C., and Cao, F. (2013). The impact

of cluster representatives on the convergence of the k-

modes type clustering. IEEE transactions on pattern

analysis and machine intelligence, 35(6):1509–1522.

Barbar

a, D., Li, Y., and Couto, J. (2002). Coolcat: an

entropy-based algorithm for categorical clustering. In

Proceedings of the eleventh international conference

on Information and knowledge management, pages

582–589. ACM.

Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. (1999).

Using association rules for product assortment deci-

sions: A case study. In Proceedings of the ﬁfth ACM

SIGKDD international conference on Knowledge dis-

covery and data mining, pages 254–260. ACM.

Chen, K. and Liu, L. (2005). The” best k” for entropy-based

categorical data clustering.

Chen, X., Huang, J. Z., and Luo, J. (2016). Purtreeclust:

A purchase tree clustering algorithm for large-scale

customer transaction data. In 32nd IEEE Interna-

tional Conference on Data Engineering, pages 661–

672. IEEE.

Ganti, V., Gehrke, J., and Ramakrishnan, R. (1999). Cac-

tusclustering categorical data using summaries. In

Proceedings of the ﬁfth ACM SIGKDD international

conference on Knowledge discovery and data mining,

pages 73–83. ACM.

Gibson, D., Kleinberg, J., and Raghavan, P. (1998). Cluster-

ing categorical data: An approach based on dynamical

systems. Databases, 1:75.

Guha, S., Rastogi, R., and Shim, K. (1999). Rock: A robust

clustering algorithm for categorical attributes. In Data

Engineering, 1999. Proceedings., 15th International

Conference on, pages 512–521. IEEE.

Huang, Z. (1998). Extensions to the k-means algorithm for

clustering large data sets with categorical values. Data

mining and knowledge discovery, 2(3):283–304.

Li, T., Ma, S., and Ogihara, M. (2004). Entropy-based

criterion in categorical clustering. In Proceedings of

the twenty-ﬁrst international conference on Machine

learning, page 68. ACM.

Liu, Y., Liao, W.-k., and Choudhary, A. (2005). A two-

phase algorithm for fast discovery of high utility item-

sets. In Paciﬁc-Asia Conference on Knowledge Dis-

covery and Data Mining, pages 689–695. Springer.

Ngai, E. W., Xiu, L., and Chau, D. C. (2009). Application of

data mining techniques in customer relationship man-

agement: A literature review and classiﬁcation. Ex-

pert systems with applications, 36(2):2592–2602.

Qian, Y., Li, F., Liang, J., Liu, B., and Dang, C. (2015).

Space structure and clustering of categorical data.

Saha, I. and Maulik, U. (2014). Incremental learning based

multiobjective fuzzy clustering for categorical data.

Information Sciences, 267:35–57.

A Novel Clustering Algorithm to Capture Utility Information in Transactional Data

461

Sun, H., Chen, R., Jin, S., and Qin, Y. (2015). A hierarchical

clustering for categorical data based on holo-entropy.

In 2015 12th Web Information System and Application

Conference (WISA), pages 269–274. IEEE.

Tishby, N. and Slonim, N. (2000). Data clustering by

markovian relaxation and the information bottleneck

method. In NIPS, pages 640–646. Citeseer.

Tseng, V. S., Wu, C.-W., Fournier-Viger, P., and Philip,

S. Y. (2015). Efﬁcient algorithms for mining the con-

cise and lossless representation of high utility item-

sets. IEEE transactions on knowledge and data engi-

neering, 27(3):726–739.

Wang, K., Xu, C., and Liu, B. (1999). Clustering transac-

tions using large items. In Proceedings of the eighth

international conference on Information and knowl-

edge management, pages 483–490. ACM.

Xiong, T., Wang, S., Mayers, A., and Monga, E. (2012).

Dhcc: Divisive hierarchical clustering of categori-

cal data. Data Mining and Knowledge Discovery,

24(1):103–135.

Yan, H., Chen, K., Liu, L., and Yi, Z. (2010). Scale: a

scalable framework for efﬁciently clustering transac-

tional data. Data mining and knowledge Discovery,

20(1):1–27.

Yan, H., Zhang, L., and Zhang, Y. (2005). Clustering cat-

egorical data using coverage density. In International

Conference on Advanced Data Mining and Applica-

tions, pages 248–255. Springer.

Yang, Y., Guan, X., and You, J. (2002). Clope: a fast and

effective clustering algorithm for transactional data.

In Proceedings of the eighth ACM SIGKDD interna-

tional conference on Knowledge discovery and data

mining, pages 682–687. ACM.

KDIR 2016 - 8th International Conference on Knowledge Discovery and Information Retrieval

462