Density-based Clustering using Automatic Density Peak Detection

Huanqian Yan, Yonggang Lu and Heng Ma

School of Information Science and Engineering, Lanzhou University, Lanzhou, Gansu 730000, China

Keywords: Clustering, Pattern Recognition, Decision Graph, Image Segmentation.

Abstract: Clustering is an important unsupervised machine learning method which has played an important role in

various fields. Density-based clustering methods are capable of dealing with clusters of different sizes and

shapes. As suggested by Alex Rodriguez et al. in a paper published in Science in 2014, the 2D decision

graph of the estimated density value versus the minimum distance from the points with higher density

values for all the data points can be used to identify the cluster centroids. However, there lack automatic

methods for the determination of the cluster centroids from the decision graph. In this work, a novel

statistic-based method is designed to identify the cluster centroids automatically from the decision graph. So

the number of clusters is also automatically determined. Experiments on several synthetic and real-world

datasets show the superiority of the proposed method in centroid identification from the datasets with

various distributions and dimensionalities. Furthermore, it is also shown that the proposed method can be

effectively applied to image segmentation.

1 INTRODUCTION

Clustering is the process of grouping a set of data

objects into multiple groups or clusters so that

objects within a cluster have high similarity, but are

very dissimilar to objects in other clusters.

Dissimilarities or similarities are assessed based on

the attribute values describing the objects using

certain distance measures (Law, Urtasun, and Zemel,

2017). Clustering is an important technique for

exploratory data analysis, and has been studied for

many years. It has been shown to be useful in many

practical domains such as data classification and

image processing (Piotr, 2012).

Clustering is generally considered as a difficult

problem because the optimal number of clusters

cannot be easily determined and clusters may have

different distributions, shapes and sizes (Lu and

Wan, 2012). It has been shown that clustering is a

nonconvex, discrete optimization problem. Due to

the existence of many local minima, there is

typically no way to find a globally minimal solution

without trying all possible partitions (Kleinberg,

2003). Although many heuristic methods have been

developed, most of them are not generic enough and

can only be used for particular clustering problems.

Most clustering algorithms are based on two popular

techniques known as hierarchical and partitional

clustering. The partitional clustering algorithms

include square-error-based clustering methods,

density-based clustering methods, distribution-based

clustering methods and so on.

For hierarchical methods, they can be classified

as being either agglomerative or divisive, based on

how the hierarchical decomposition of the given set

of data objects is formed (Grant and Flynn, 2016;

Charikar and Chatziafratis, 2017). Hierarchical

clustering methods don’t need some strict initial

conditions, but they suffer from the mechanism that

a previous merge or split cannot be changed during

the following process.

For square-error-based clustering methods, such

as k-means (Wagstaff et al., 2001), k-medoids

(Kaufman and Rousseeuw, 2009), and affinity

propagation (Frey and Dueck, 2007; Serdah and

Ashour, 2016). An objective function, typically the

sum of the distance to a set of putative cluster

centers, is optimized until the best cluster center

candidates are found (Serdah and Ashour, 2016;

Ward, 1963; Hoppner, 1999; Jain, 2010). However,

for k-means and k-medoids, because a data point is

always assigned to the nearest center, they cannot be

used to detect non-globular clusters (Jain, 2010). For

affinity propagation method, with an improper initial

exemplar preference, it may fail to work properly.

Most square-error-based methods are greedy

Yan, H., Lu, Y. and Ma, H.

Density-based Clustering using Automatic Density Peak Detection.

DOI: 10.5220/0006572300950102

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 95-102

ISBN: 978-989-758-276-9

algorithms that depend on initial conditions and may

converge to suboptimal solutions.

Unlike square-error-based clustering, density-

based clustering can detect clusters with arbitrary

shapes or sizes. The most popular density-based

clustering methods include DBSCAN (Ester et al.,

1996), mean-shift (Fukunaga et al., 1975), OPTICS

(Campello et al., 2015), and DENCLUE (Campello

et al., 2015), etc. A drawback of these methods is

that the parameter setting is not a straightforward

task that user has to take care of. An excellent

density-based clustering method published in

Science in 2014 is proposed by Alex Rodriguez et

al. (Rodriguez and Laio, 2014). The method is called

Clustering by Fast Search and Find of Density Peaks

(CFSFDP). It is based on the simple idea that a

cluster centroid has a higher density value than its

neighbors and is far away from the other objects

with higher density values. CFSFDP can predict the

number of clusters by identifying the cluster

centroids in a 2D decision graph whose axes are the

density value and the minimum distance from the

points with higher density values respectively. But

the cluster cendroids in the decision graph must be

manually decided.

To address this issue, we propose a novel

clustering method called Automatic Density Peak

Detection (ADPD) in this paper. A new outlier

detection method is designed to identify the cluster

centroids automatically from the decision graph

using a statistical-based nonparametric density

estimation. This method can identify clusters with

arbitrary shapes or sizes and can determine the

number of clusters automatically.

The rest of the paper is organized as follows. The

original CFSFDP method

is introduced in Section 2.

The proposed ADPD algorithm is described in

Section 3. The experimental results are presented in

Section 4. And conclusions are drawn in Section 5.

2 THE CFSFDP METHOD

The CFSFDP method (Rodriguez and Laio, 2014)

generates clusters by assigning data points to the

same cluster as its nearest neighbor with higher

density after cluster centroids are selected by users.

A heuristic method named decision graph is

designed to select these centroids. For each data

point , two important indicators are considered in

the decision graph: local density 



, and the

minimum distance 



from points of higher density

values. Their definitions are:

Local Density 



: The local density 



of point 

is defined as













 







(1)

where 







is a kernel function, 



is the distance

between point  and point  , and 



is the cutoff

distance threshold. In the CFSFDP method, 



is a

parameter which needs to be determined manually.

In our experiments, the Gaussian kernel function is

used. So the local density 



is defined as: 

























Minimum Distance 



: The minimum distance





of point  is measured by computing the

minimum distance between the point  and any other

points of higher density values:















 















 

(2)

The value 



is much larger than the typical

distances between nearest neighbours if the 



point i is a local or global maximum density value.

This observation, which is the core of the algorithm,

is illustrated by an example in Figure 1. Figure 1A

shows 30 points from two normal distributions.

Figure 1B is the decision graph which shows the plot

of 



as a function of 



for each point. From the

decision graph, the two points having high local

density values and large minimum distances can be

easily identified. The two points are identified as

cluster centroids, which are shown as filled triangle

or square in both Figure 1A and Figure 1B.

Figure 1: (A) Points distribution in a 2D space. (B)

Decision graph for the data in (A).

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

3 IDENTIFYING CENTROIDS

FROM THE DECISION GRAPH

AUTOMATICALLY

As shown in Figure 1B, the cluster cendroids are

usually the points that have large 



values and

relatively high 



values, a simple threshold-based

method suggested by Alex Rodriguez et al.

(Rodriguez and Laio, 2014)

for selecting the cluster

centroids is to use the following formula:









 







(3)

where the threshold parameter 



has to be

decided by users. The drawbacks of the method are

that it does not use the distribution information of

the points in the decision graph and the parameter





can not be easily determined for different

datasets.

To deal with the drawbacks of the above method,

a statistic-based method for selecting the cluster

centroids is developed based on the following

observation: the value 



is much larger than the

typical distances between nearest neighbors if the

point  is a point having local or global maximum

density value. Thus, an important feature for

identifying a cluster cendroid is that its 



value is

anomalously large. So, in our method, cluster

cendroids are identified using a specially designed

outlier detection method which contains mainly

three steps: firstly, the probability density 













in the decision graph at a specific density value 



and an arbitrary distance value  is estimated;

secondly, the expectation value and the variance of

the distance  are computed at the specific 



value

using the probability density 













; thirdly, the

cluster cendroids are identified using the expectation

value and the variance of the distance .

Two-dimensional Gaussian function is used to

estimate the probability density at the specific 



 in

the decision graph, which is given by:

































































































(4)

where the  is the total number of the data points,

and  are the 2D kernel widths. The denominator

is a normalization factor which is used to ensure that























The selection of the values for the 2D kernel

widths  and  are important. It is found that  and

 can be estimated using the standard deviations of





and 



of all the data points:



 







 







(5)

The selection of the parameters  and  will be

discussed in Subsection 4.2.

Using the probability density defined in (4), the

expectation value and the variance of the distance

at the specific 



can be computed using:





































(6)

































  





















(7)

By substituting (4) into (6) and (7), it follows

that:















































































(8)













































































































(9)

Using (8) and (9), the expectation value and the

variance at the specific 



can be easily computed

using the summation instead of the integration in (6)

and (7). Then, the outliers are identified using the

following threshold:

























   











(10)

For any point i, if its minimum distance 

















, it is identified as an outlier, and thus is

used as a cluster cendroid in our experiments.

The process and result of identifying the cluster

centroids using the proposed method is shown in

Figure 2, where the data is from the Figure 1A.

Figure 2: (A) The process of identifying cluster centroids

in the decision graph. (B) The result of clustering,

different colors correspond to different clusters.

Density-based Clustering using Automatic Density Peak Detection

Using the threshold defined in (10), two points

represented as filled triangle and square are

identified as the cluster centroids.

4 EXPERIMENTAL RESULTS

There are six synthetic datasets and eight real-world

datasets used in the experiments. Two synthetic

datasets, called Dataset A and Dataset B generated

by ourselves, are shown in Figure 3, where different

colors represent different classes. The other four

synthetic datasets, including Aggregation, Flame,

Spiral, and R15, are downloaded from the internet,

which are shown in Table 1. Eight UCI real-world

datasets, including Iris, Breast cancer (Wisconsin),

Glass Identification, Wine Quality-red, Liver

Disorders, Seeds, Banknote authentication, Ecoli,

are also shown in Table 1.

Figure 3: Dataset A: Three 2D normal distributions with

the same size  centered at  ,

and . Dataset B: Four 2D normal distributions

with different size: , , centered at ;

 , , centered at  ;  , ,

centered at and , , centered at

.

4.1 Evaluation Criterion

Because for all the datasets, ground truth cluster

labels are available, the Fowlkes-Mallows index

(FM-index) is used to evaluate the clustering result

(Fowlkes and Mallows, 1983), which is defined as:















(11)

where TP is the number of true positives, FP is the

number of false positives, and FN is the number of

false negatives. A higher value for the FM-index

indicates a greater similarity between the clusters

and the ground truth.

4.2 Parameter Selection

In the CFSFDP method, the parameter 



must be

determined before computing the density values. It

can be chosen under the condition that the average

number of neighbors is around 1% to 2% of the total

number of the points, as suggested by Alex

Rodriguez et al (Rodriguez and Laio, 2014). In our

experiments, it is found that 4% is a better choice.

So, in our experiments, the parameter 



determined with the condition that the average

number of neighbors is around 4% of the total

number of the points.

In our method, the parameters



and



defined in

(5) have to be determined. To decide the value of



we first set



=0.5, then different values of



are used

to compute 



and identify the cluster centroids.

The clustering results of three datasets including Iris,

Seeds and Dataset B are shown in Figure 4A when

different



values are used. It can be seen from

Figure 4A that the clustering results are not sensitive

Table 1: Details of datasets in our experiments.

Dataset

Source

Aggregation

788

http://cs.joensuu.fi/sipu/datasets/

Flame

240

http://cs.joensuu.fi/sipu/datasets/

R15

600

http://cs.joensuu.fi/sipu/datasets/

Spiral

312

http://cs.joensuu.fi/sipu/datasets/

Iris

150