# A Novel Clustering Algorithm to Capture Utility Information in Transactional Data

### Piyush Lakhawat, Mayank Mishra, Arun Somani

#### Abstract

We develop and design a novel clustering algorithm to capture utility information in transactional data. Transactional data is a special type of categorical data where transactions can be of varying length. A key objective for all categorical data analysis is pattern recognition. Therefore, transactional clustering algorithms focus on capturing the information on high frequency patterns from the data in the clusters. In recent times, utility information for category types in the data has been added to the transactional data model for a more realistic representation of data. As a result, the key information of interest has become high utility patterns instead of high frequency patterns. To the best our knowledge, no existing clustering algorithm for transactional data captures the utility information in the clusters found. Along with our new clustering rationale we also develop corresponding metrics for evaluating quality of clusters found. Experiments on real datasets show that the clusters found by our algorithm successfully capture the high utility patterns in the data. Comparative experiments with other clustering algorithms further illustrate the effectiveness of our algorithm.

#### References

- (1987). Uci machine learning repository. uci.edu/ml/datasets/Soybean+(Small). 2016-09-01.
- Andritsos, P., Tsaparas, P., Miller, R. J., and Sevcik, K. C. (2004). Limbo: Scalable clustering of categorical data. In International Conference on Extending Database Technology, pages 123-146. Springer.
- Bai, L., Liang, J., Dang, C., and Cao, F. (2013). The impact of cluster representatives on the convergence of the kmodes type clustering. IEEE transactions on pattern analysis and machine intelligence, 35(6):1509-1522.
- Barbará, D., Li, Y., and Couto, J. (2002). Coolcat: an entropy-based algorithm for categorical clustering. In Proceedings of the eleventh international conference on Information and knowledge management, pages 582-589. ACM.
- Brijs, T., Swinnen, G., Vanhoof, K., and Wets, G. (1999). Using association rules for product assortment decisions: A case study. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 254-260. ACM.
- Chen, K. and Liu, L. (2005). The” best k” for entropy-based categorical data clustering.
- Chen, X., Huang, J. Z., and Luo, J. (2016). Purtreeclust: A purchase tree clustering algorithm for large-scale customer transaction data. In 32nd IEEE International Conference on Data Engineering, pages 661- 672. IEEE.
- Ganti, V., Gehrke, J., and Ramakrishnan, R. (1999). Cactusclustering categorical data using summaries. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 73-83. ACM.
- Gibson, D., Kleinberg, J., and Raghavan, P. (1998). Clustering categorical data: An approach based on dynamical systems. Databases, 1:75.
- Guha, S., Rastogi, R., and Shim, K. (1999). Rock: A robust clustering algorithm for categorical attributes. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 512-521. IEEE.
- Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283-304.
- Li, T., Ma, S., and Ogihara, M. (2004). Entropy-based criterion in categorical clustering. In Proceedings of the twenty-first international conference on Machine learning, page 68. ACM.
- Liu, Y., Liao, W.-k., and Choudhary, A. (2005). A twophase algorithm for fast discovery of high utility itemsets. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 689-695. Springer.
- Ngai, E. W., Xiu, L., and Chau, D. C. (2009). Application of data mining techniques in customer relationship management: A literature review and classification. Expert systems with applications, 36(2):2592-2602.
- Qian, Y., Li, F., Liang, J., Liu, B., and Dang, C. (2015). Space structure and clustering of categorical data.
- Saha, I. and Maulik, U. (2014). Incremental learning based multiobjective fuzzy clustering for categorical data. Information Sciences, 267:35-57.
- Sun, H., Chen, R., Jin, S., and Qin, Y. (2015). A hierarchical clustering for categorical data based on holo-entropy. In 2015 12th Web Information System and Application Conference (WISA), pages 269-274. IEEE.
- Tishby, N. and Slonim, N. (2000). Data clustering by markovian relaxation and the information bottleneck method. In NIPS, pages 640-646. Citeseer.
- Tseng, V. S., Wu, C.-W., Fournier-Viger, P., and Philip, S. Y. (2015). Efficient algorithms for mining the concise and lossless representation of high utility itemsets. IEEE transactions on knowledge and data engineering, 27(3):726-739.
- Wang, K., Xu, C., and Liu, B. (1999). Clustering transactions using large items. In Proceedings of the eighth international conference on Information and knowledge management, pages 483-490. ACM.
- Xiong, T., Wang, S., Mayers, A., and Monga, E. (2012). Dhcc: Divisive hierarchical clustering of categorical data. Data Mining and Knowledge Discovery, 24(1):103-135.
- Yan, H., Chen, K., Liu, L., and Yi, Z. (2010). Scale: a scalable framework for efficiently clustering transactional data. Data mining and knowledge Discovery, 20(1):1-27.
- Yan, H., Zhang, L., and Zhang, Y. (2005). Clustering categorical data using coverage density. In International Conference on Advanced Data Mining and Applications, pages 248-255. Springer.
- Yang, Y., Guan, X., and You, J. (2002). Clope: a fast and effective clustering algorithm for transactional data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 682-687. ACM.

#### Paper Citation

#### in Harvard Style

Lakhawat P., Mishra M. and Somani A. (2016). **A Novel Clustering Algorithm to Capture Utility Information in Transactional Data** . In *Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)* ISBN 978-989-758-203-5, pages 456-462. DOI: 10.5220/0006092104560462

#### in Bibtex Style

@conference{kdir16,

author={Piyush Lakhawat and Mayank Mishra and Arun Somani},

title={A Novel Clustering Algorithm to Capture Utility Information in Transactional Data},

booktitle={Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)},

year={2016},

pages={456-462},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0006092104560462},

isbn={978-989-758-203-5},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the 8th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 1: KDIR, (IC3K 2016)

TI - A Novel Clustering Algorithm to Capture Utility Information in Transactional Data

SN - 978-989-758-203-5

AU - Lakhawat P.

AU - Mishra M.

AU - Somani A.

PY - 2016

SP - 456

EP - 462

DO - 10.5220/0006092104560462