# Arbitrary Shape Cluster Summarization with Gaussian Mixture Model

### Elnaz Bigdeli, Mahdi Mohammadi, Bijan Raahemi, Stan Matwin

#### Abstract

One of the main concerns in the area of arbitrary shape clustering is how to summarize clusters. An accurate representation of clusters with arbitrary shapes is to characterize a cluster with all its members. However, this approach is neither practical nor efficient. In many applications such as stream data mining, preserving all samples for a long period of time in presence of thousands of incoming samples is not practical. Moreover, in the absence of labelled data, clusters are representative of each class, and in case of arbitrary shape clusters, finding the closest cluster to a new incoming sample using all objects of clusters is not accurate and efficient. In this paper, we present a new algorithm to summarize arbitrary shape clusters. Our proposed method, called SGMM, summarizes a cluster using a set of objects as core objects, then represents each cluster with corresponding Gaussian Mixture Model (GMM). Using GMM, the closest cluster to the new test sample is identified with low computational cost. We compared the proposed method with ABACUS, a well-known algorithm, in terms of time, space and accuracy for both categorization and summarization purposes. The experimental results confirm that the proposed method outperforms ABACUS on various datasets including syntactic and real datasets.

#### References

- Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, 1998, “Automatic sub-space clustering of high dimensional data for data mining applications,” SIGMOD Rec, vol. 27, pp.94-105.
- Bifet, A, Holmes G, Pfahringer B 2009. New ensemble methods for evolving data streams. In: Proceedings ofthe 15th ACM SIGKDD international conference on knowledge discovery and data mining. pp 139-148.
- Borah B., Bhattacharyya D., 2008. Catsub: a technique for clustering categorical data based on subspace. J Comput Sci2:7-20.
- Charu C. Aggarwal , T. J. Watson , Resch Ctr , Jiawei Han , Jianyong Wang , Philip S. Yu, 2003, A framework for clustering evolving data streams. Proceedings of the 29th VLDB Conference, Berlin, German.
- Davies D.L.,. Bouldin D.W. A cluster separation measure. 1979. IEEE Trans. Pattern Anal. Machine Intell. 1 (4). Pp. 224-227.
- Yang D, Elke A, , Matthew O. Ward. 2011, Summarization and Matching of Density-Based Clusters in Streaming Environments. Proceedings of the VLDB Endowment (PVLDB), Vol. 5, No. 2, pp. 121-132.
- Ester. M., Kriegel. H., Sander. J., and Xu. X. 1996, A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD, pages 226-231.
- Cao F, Ester M, Qian W, Zhou A, Density-based, 2006, clustering over an evolving data stream with noise. In 2006 SIAM Conference on Data Mining. 328-339.
- G. Karypis, E.-H. Han and V. Kumar, 1999, Chameleon: Hierarchical Clustering Using Dynamic Modeling, Computer, 32:8, pp. 68-75.
- Gaddam S, Phoha V, Balagani K., 2007, K-means+id3: a novel method for supervised anomaly detection by cascading k-means clustering and id3 decision tree learning methods. IEEE Trans Knowl Data Eng 19(3):345-354.
- Guha S, Meyerson A, Mishra N et al, 2003, Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng 15(3):505-528.
- Han. J, Kamber. M, J. Pei. 2006. Data Mining: Concepts and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems).
- HE, Z., XU, X., AND DENG, S. 2003. Discovering cluster-based local outliers. Pattern Recog. Lett. 24, 9- 10,1641-1650.
- Hinneburg. A. and Keim. D. A. 1998. An efficient approach to clustering in large multimedia databases with noise,” in KDD , , pp. 58-65.
- John Hershberger, Nisheeth Shrivastava, Subhash Suri, 2009. Summarizing Spatial Data Streams Using ClusterHulls, Journal of Experimental Algorithmics (JEA), Volume 13,. Article No. 4 ACM New York, NY, USA.
- K. Dunn, j. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics ,(4), (1974), pp. 95-104.
- Kristian Kersting, Mirwaes Wahabzada, Christian Thurau, Christian Bauckhage., 2010. Hierarchical Convex NMF for Clustering Massive Data. ACML: 253-268.
- Mohammadi M, Akbari A, Raahemi B, Nasersharif B, Asgharian H. 2014. A fast anomaly detection system using probabilistic artificial immune algorithm capable of learning new attacks. Evolutionary Intelligence 6(3): 135-156.
- Chaoji V, Li W, Yildirim H, Zaki M, 2011. ABACUS: Mining Arbitrary Shaped Clusters from Large Datasets based on Backbone Identification. SDM, page 295-306. SIAM / Omnipress,.
- Wang. W., Yang. J., and Muntz. R. R., 1997. Sting: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd International Conference on Very Large Data Bases ,ser.VLDB'97.

#### Paper Citation

#### in Harvard Style

Bigdeli E., Mohammadi M., Raahemi B. and Matwin S. (2014). **Arbitrary Shape Cluster Summarization with Gaussian Mixture Model** . In *Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)* ISBN 978-989-758-048-2, pages 43-52. DOI: 10.5220/0005071500430052

#### in Bibtex Style

@conference{kdir14,

author={Elnaz Bigdeli and Mahdi Mohammadi and Bijan Raahemi and Stan Matwin},

title={Arbitrary Shape Cluster Summarization with Gaussian Mixture Model},

booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)},

year={2014},

pages={43-52},

publisher={SciTePress},

organization={INSTICC},

doi={10.5220/0005071500430052},

isbn={978-989-758-048-2},

}

#### in EndNote Style

TY - CONF

JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2014)

TI - Arbitrary Shape Cluster Summarization with Gaussian Mixture Model

SN - 978-989-758-048-2

AU - Bigdeli E.

AU - Mohammadi M.

AU - Raahemi B.

AU - Matwin S.

PY - 2014

SP - 43

EP - 52

DO - 10.5220/0005071500430052