actional clustering aim to capture high frequency pat-
terns while clustering. We argue that this can re-
sult in misleading clusters especially for cases where
the utilities of category types have a steep distribu-
tion. High utility clusters can be missed if we focus
only on frequency information while clustering. We
propose a novel clustering algorithm for transactional
data which captures the utility information. We also
propose two validation criterion for the obtained clus-
ter structure based on how accurately it captures the
high utility patterns in the data.
Our experiments on a real data sets show that the
clustering algorithm successfully captures the high
utility patterns in the data. Our comparative experi-
ment results further illustrate the effectiveness of our
algorithm over the popular K-modes and BKPlot al-
gorithms. For future work, we plan to do experiments
on data sets from various applications like bioinfor-
matics, click stream data etc. We believe interpreta-
tions of clusters found in different data sets will lead
to interesting results and evolution of our algorithm.
We plan to develop the idea of utility aware clustering
for entropy based clustering methods as well.
ACKNOWLEDGEMENTS
The research reported in this paper is funded in part
by Philip and Virginia Sproul Professorship Endow-
ment and Jerry R. Junkins Endowments at Iowa State
University. The research computation is supported
by the HPC@ISU equipment at Iowa State Univer-
sity, some of which has been purchased through fund-
ing provided by NSF under MRI grant number CNS
1229081 and CRI grant number 1205413. Any opin-
ions, findings, and conclusions or recommendations
expressed in this material are those of the author(s)
and do not necessarily reflect the views of the funding
agencies.
REFERENCES
(1987). Uci machine learning repository. archive.ics.
uci.edu/ml/datasets/Soybean+(Small). Accessed:
2016-09-01.
(2003). Frequent itemset mining dataset repository. http://
fimi.ua.ac.be/data/. Accessed: 2016-06-14.
(2008). Bkplot implementation. cecs.wright.edu/
∼keke.chen/. Accessed: 2016-09-01.
(2015). K-modes implementation. https://github.com/
nicodv/kmodes. Accessed: 2016-09-01.
Andreopoulos, B., An, A., Wang, X., and Schroeder, M.
(2009). A roadmap of clustering algorithms: finding a
match for a biomedical application. Briefings in Bioin-
formatics, 10(3):297–314.
Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C.
(2004). Limbo: Scalable clustering of categori-
cal data. In International Conference on Extending
Database Technology, pages 123–146. Springer.
Bai, L., Liang, J., Dang, C., and Cao, F. (2013). The impact
of cluster representatives on the convergence of the k-
modes type clustering. IEEE transactions on pattern
analysis and machine intelligence, 35(6):1509–1522.
Barbar
´
a, D., Li, Y., and Couto, J. (2002). Coolcat: an
entropy-based algorithm for categorical clustering. In
Proceedings of the eleventh international conference
on Information and knowledge management, pages
582–589. ACM.
Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. (1999).
Using association rules for product assortment deci-
sions: A case study. In Proceedings of the fifth ACM
SIGKDD international conference on Knowledge dis-
covery and data mining, pages 254–260. ACM.
Chen, K. and Liu, L. (2005). The” best k” for entropy-based
categorical data clustering.
Chen, X., Huang, J. Z., and Luo, J. (2016). Purtreeclust:
A purchase tree clustering algorithm for large-scale
customer transaction data. In 32nd IEEE Interna-
tional Conference on Data Engineering, pages 661–
672. IEEE.
Ganti, V., Gehrke, J., and Ramakrishnan, R. (1999). Cac-
tusclustering categorical data using summaries. In
Proceedings of the fifth ACM SIGKDD international
conference on Knowledge discovery and data mining,
pages 73–83. ACM.
Gibson, D., Kleinberg, J., and Raghavan, P. (1998). Cluster-
ing categorical data: An approach based on dynamical
systems. Databases, 1:75.
Guha, S., Rastogi, R., and Shim, K. (1999). Rock: A robust
clustering algorithm for categorical attributes. In Data
Engineering, 1999. Proceedings., 15th International
Conference on, pages 512–521. IEEE.
Huang, Z. (1998). Extensions to the k-means algorithm for
clustering large data sets with categorical values. Data
mining and knowledge discovery, 2(3):283–304.
Li, T., Ma, S., and Ogihara, M. (2004). Entropy-based
criterion in categorical clustering. In Proceedings of
the twenty-first international conference on Machine
learning, page 68. ACM.
Liu, Y., Liao, W.-k., and Choudhary, A. (2005). A two-
phase algorithm for fast discovery of high utility item-
sets. In Pacific-Asia Conference on Knowledge Dis-
covery and Data Mining, pages 689–695. Springer.
Ngai, E. W., Xiu, L., and Chau, D. C. (2009). Application of
data mining techniques in customer relationship man-
agement: A literature review and classification. Ex-
pert systems with applications, 36(2):2592–2602.
Qian, Y., Li, F., Liang, J., Liu, B., and Dang, C. (2015).
Space structure and clustering of categorical data.
Saha, I. and Maulik, U. (2014). Incremental learning based
multiobjective fuzzy clustering for categorical data.
Information Sciences, 267:35–57.
A Novel Clustering Algorithm to Capture Utility Information in Transactional Data
461