studied in different applications but most of the
attention was towards non-arbitrary shape clustering
like k-means (Mohammadi et al., 2014)(Gaddam et
al., 2007). Arbitrary shape clustering methods
preserve all samples for each cluster and to find the
closet cluster to the new sample, distance of new
samples to all cluster members is calculated. It is
obvious that using such an approach is time
consuming. The other way to find the closet cluster
is to create a boundary for each cluster. If the new
sample is inside the boundary of a cluster then the
new sample belongs to that cluster. Finding the
boundary of arbitrary shape clusters, especially in
high dimensional problems, is a complex and time
consuming process. Moreover, it is necessary to save
too many faces to just keep borders of cluster
created by convex in higher dimension, which grows
exponentially with dimension (Kersting et al.,
2010)(Hershberger, 2009).
In this paper, we propose a new approach that fulfils
the mentioned requirements. We propose a
summarization approach to summarize arbitrary
shape clusters using Gaussian Mixture Model
(GMM). In our approach, we first find the core
objects of clusters and then we consider these core
objects as centres of GMM and represent a cluster
with a GMM. Since, GMM-based method keep all
statistical information of each cluster, it summarizes
each cluster in a way that we can use it for pattern
extraction, pattern matching, and pattern merging.
Moreover, this model is able to classify new objects.
Using GMM, each new test sample is fed into the
GMM of a cluster, and if the membership
probability to a cluster is more than a threshold, the
object is attached to that cluster.
The structure of the paper is as follows: In Section 2,
we review related work on arbitrary shape clusters
and summarization approaches. In Section 3, we
explain the general structure of the proposed
algorithm for summarization. In Section 4, we
present some discussions about the features of the
proposed method. In Section 5, we explain the
complexity of algorithm in more detail. Section 6
presents the experimental results of the proposed
algorithm in comparison with well-known
summarization algorithms. Finally, the conclusion
and future work are presented in Section 7.
2 RELATED WORK
There are various algorithms available for clustering,
which are categorized into four groups; partition-
based, hierarchical, density-based and spectral-based
clustering (Han, 2006). K-means is one of the
famous algorithms in the area of partition-based
clustering. However, using a centre and radius
makes the shape of clusters spherical which is
undesirable in many applications. In hierarchical
clustering methods such as Chameleon data is
clustered in hierarchical form but still with spherical
shape that is undesirable. Moreover, tuning the
parameters for methods like Chameleon is still
difficult (Karypis et al., 1999). Spectral clustering;
STING (Wang et al., 1997) and CLUIQE (Agrawal
et al. 1998) are able to create arbitrary shape clusters
but the major drawback of these methods is the
complexity of creating an efficient grid. The size of
grid varies for different dimensions and setting
different grid sizes and merging the grids to find
clusters are difficult. These difficulties make the
algorithm inaccurate in many cases. In the area of
arbitrary shape clustering, density-based methods
are more interesting and DBSCAN (Ester et al.,
1996) and DENCLUE (Hinneburg et al., 1998) are
the most famous ones. In density-based methods,
clusters are created using the concept of connecting
dense regions to find arbitrary shape clusters. Based
on prevalence of real time applications, there is more
interest to make these algorithms fast for streaming
applications (Guha et al., 2003)(Bifet et al.,
2009)(Charu et al., 2003).
Summarization is the solution to ease the complexity
of arbitrary shape clustering methods. The naïve
way to represent an arbitrary shape cluster is to
represent each cluster with all cluster members.
Obviously, this approach is neither practical nor
does it reflect the cluster properties. In k-means a
simple representation using a centre and radius
summarize the cluster. It is clear that this
summarization does not capture how data is
distributed in the cluster.
There are different ways to summarize arbitrary
shape clusters (Yang et al., 2011)(Cao et al.,
2006)(Chaoji et al., 2011). These algorithms use the
general idea behind the clustering methods for
arbitrary shape clusters. In the area of
summarization, the idea is to detect dense regions
and summarize the regions using core objects. Then,
a set of proper features is considered to summarize
the dense regions and their connectivity. In (Yang et
al., 2011) a grid is created for each cluster and based
on the idea of connecting dense regions, the core or
dense cells with their connections and their related
features are kept. In all summarization approaches,
these features play crucial role. In (Yang et al.,
2011) location and range of values and status
connection vector are kept however, it has some
KDIR2014-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
44