Kernel Hierarchical Agglomerative Clustering

Comparison of Different Gap Statistics to Estimate the Number of Clusters

Na Li, Nicolas Lefebvre and R´egis Lengell´e

Charles Delaunay Institute - UMR CNRS 6281, Universit´e de Technologie de Troyes, 12 Rue Marie Curie, Troyes, France

Keywords:

Hierarchical Agglomerative Clustering, Gap Statistics, Kernel Alignment, Number of Clusters.

Abstract:

Clustering algorithms, as unsupervised analysis tools, are useful for exploring data structure and have owned

great success in many disciplines. For most of the clustering algorithms like k-means, determining the number

of the clusters is a crucial step and is one of the most difﬁcult problems. Hierarchical Agglomerative Clustering

(HAC) has the advantage of giving a data representation by the dendrogram that allows clustering by cutting

the dendrogram at some optimal level. In the past years and within the context of HAC, efﬁcient statistics have

been proposed to estimate the number of clusters and the Gap Statistic by Tibshirani has shown interesting

performances. In this paper, we propose some new Gap Statistics to further improve the determination of the

number of clusters. Our works focus on the kernelized version of the widely-used Hierarchical Clustering

Algorithm.

1 INTRODUCTION

Clustering is the task of grouping objects according to

some measured or perceived characteristics of them

and it has owned great success in exploring the hid-

den structure of unlabeled data sets. As a useful tool

for unsupervised classiﬁcation, it has drawn increas-

ing attention in various domains including psychol-

ogy (Harman, 1960), biology (Sneath et al., 1973) and

computer security (Barbar´a and Jajodia, 2002). Clus-

tering algorithms have a long history. They originated

in anthropology by Driver and Kroeber (Driver and

Kroeber, 1932). In 1967 one of the most useful and

simple clustering algorithms, k-means (Mac Queen

et al., 1967), has been proposed. Since then a lot

of classical algorithms, like fuzzy c-means (Bezdek

et al., 1984), Hierarchical Agglomerative Clustering

(HAC) etc have emerged.

Meanwhile, another clustering method, kernel-

based clustering, has arisen and owned great success

because of its ability to perform linear tasks in some

non linearly transformed spaces. In machine learn-

ing, the kernel trick has been ﬁrstly introduced by

Aizerman (Aizerman et al., 1964). It became fa-

mous in Support Vector Machines (SVM) initially

proposed by Cortes and Vapnik (Cortes and Vapnik,

1995). SVM has shown better performances in many

problems and this success has brought an extensive

use of the kernel trick into other algorithms like ker-

nel PCA (Sch¨olkopf et al., 1998), non linear (adap-

tive) ﬁltering (Pr´ıncipe et al., 2011) etc. Kernel meth-

ods have been widely used in supervised classiﬁca-

tion tasks like SVM and then they were extended to

unsupervised classiﬁcation. A lot of kernel-induced

clustering algorithms have emerged due to the exten-

sive use of inner products. Most of these algorithms

are kernelized versions of the corresponding conven-

tional algorithms. Surveys of kernel-induced meth-

ods for clustering have been done in (Filippone et al.,

2008; Kim et al., 2005; Muller et al., 2001). The ﬁrst

proposed and the most well-known kernel-induced al-

gorithm is kernel k-means by Scholkopf (Sch¨olkopf

et al., 1998). A further version has been proposed

by Girolami (Girolami, 2002). After that, several

kernel-induced algorithms have emerged such as ker-

nel fuzzy c-means, kernel Self Organizing Maps, ker-

nel average-linkage etc. Compared with the corre-

sponding conventional algorithms, kernelized criteria

have shown better performance especially for non-

linearly separable data sets.

According to our literature survey, a few work has

been done on kernel based HAC (see, e.g. (Qin et al.,

2003), (Kim et al., 2005)). The prominence of HAC

consists in the data description provided by a den-

drogram which represents a tree of nested partitions

of the data. Hierarchical clustering usually depends

on distance calculations (to compute between-class

and within-class dispersions, minimum linkage, max-

255

Li N., Lefebvre N. and Lengellé R..

Kernel Hierarchical Agglomerative Clustering - Comparison of Different Gap Statistics to Estimate the Number of Clusters.

DOI: 10.5220/0004828202550262

In Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods (ICPRAM-2014), pages 255-262

ISBN: 978-989-758-018-5

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

imum linkage etc) which are based on inner products.

In order to explore wider classes of classiﬁers, we

can perform HAC after mapping the input data onto a

higher dimension space using a nonlinear transform.

This is one of the main ideas of kernel methods, where

the transformed space is selected as a Reproducing

Kernel Hilbert Space (RKHS) in which distance cal-

culations can be easily evaluated with the help of the

kernel trick (Aizerman et al., 1964).

In this paper we focus on the kernel based HAC

framework which is robust in clustering non linearly

separable data sets and, at the same time, several cri-

teria are proposed to determinate the number of clus-

ters, being inspired by the Gap Statistic of Tibshirani

(Tibshirani et al., 2001). This paper is organized as

follows: we start with an introduction of kernel HAC.

Then we introduce the principle of Gap Statistics and

our criteria to estimate the number of clusters. Subse-

quently, we provide the results of a simulation study

and we compare the results obtained with those of

Tibshirani, in standard and kernel HAC. We show that

alternate Gap Statistics are possible to estimate the

number of clusters and one of them, the Delta Level

Gap, is efﬁcient and robust to variations in the cluster-

ing procedure. Finally, we propose some perspectives

to this work in order to obtain a fully automatic clus-

tering algorithm.

2 KERNEL HIERARCHICAL

AGGLOMERATIVE

CLUSTERING (K-HAC)

2.1 Kernel Trick

The key notion in kernel based algorithms is the ker-

nel trick. To introduce the kernel trick, we ﬁrst re-

call Mercer’s theorem (Mercer, 1909). Let X be the

original space. A kernel function K : X × X −→ R is

called a positive deﬁnite kernel (or Mercer Kernel) if

and only if:

• K is symmetric: K(x,y) = K(y,x) ∀(x, y) ∈ X ×X

•

∑

i=1

∑

j=1

K(x

) ≥ 0, ∀i = 1,...,n,∀n ≥ 2

where c

∈ R

For each Mercer kernel we have:

K(x

) = hΦ(x

),Φ(x

)i (1)

where φ : X → F performs the mapping from the orig-

inal space onto the (high dimensional) feature space.

As shown in equation (1), inner product calculations

in the feature space can be computed by a kernel func-

tion in the original space, without explicitly speci-

fying the mapping function Φ. The computation of

Euclidean distance in feature space F beneﬁts from

this idea.

(Φ(x

),Φ(x

)) = kΦ(x

) − Φ(x

(2)

= K(x

) + K(x

) − 2K(x

)

Several commonly used Mercer kernels are listed

in (Vapnik, 2000). In this paper, we only consider the

gaussian kernel deﬁned as follows:

K(x

) = exp



−

k x

− x

2σ



(3)

2.2 Kernel-HAC

The advantage of HAC lies in the obtention of den-

drogram (see an example in Figure 3), which gives a

description of the data structure. By cutting the den-

drogramat a givenlevel, one can perform data cluster-

ing. The aim of introducing the kernel trick in HAC

is to easily explore wider classes of similarity mea-

sures. Figures 1 and 2 show an example of a data set

that cannot be clustered by standard HAC (Figure 1)

while kernel HAC allows perfect clustering (Figure

2).

The HAC algorithm steps are 1) assign each data

point to be a singleton; 2) calculate some similar-

ity/dissimilarity between each pair of clusters; 3)

merge the pairwise closest clusters into one; 4) repeat

the two previous steps until only one ﬁnal cluster is

obtained.

Different similarity measures have been proposed

and surveys have been done in (Murtagh, 1983; Ol-

son, 1995). Examples of linkage criteria are:

• Single linkage

d(r,s) = min(d(x

)), x

∈ r, x

∈ s

• Complete linkage

d(r,s) = max(d(x

)), x

∈ r, x

∈ s

• Average linkage

d(r,s) =

∑

i=1

∑

j=1

d(x

)

• Ward’s linkage

(r, s) = n

k ¯x

− ¯x

)

In this paper, we focus on the gaussian kernel

based HAC using Ward’s linkage criterion. In Ward’s

linkage, ¯x

denotes the the centroid of cluster r, ¯x

∑

. So in the feature space F, we have:

) =

+ n

)

∑

K(x

)

∑

K(x

) −

∑

K(x

)

(4)

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

256

−3 −2 −1 0 1 2 3

−2.5

−2

−1.5

−1

−0.5

0.5

1.5

2.5

Figure 1: Result using HAC.

−3 −2 −1 0 1 2 3

−2.5

−2

−1.5

−1

−0.5

0.5

1.5

2.5

Figure 2: Result using kernel HAC.

11292613 4 11230 7 2201628 6 81517212223 3 924182725 5191014

Figure 3: Dendrogram of this data set using kernel HAC.

As can be seen in ﬁgures 1 and 2 kernel HAC

allows clustering of separable classes with small

between-class dispersion in the original space.

3 DETERMINING THE NUMBER

OF CLUSTERS

Determining the number of clusters is one of the most

difﬁcult problems in cluster analysis. For some algo-

rithms like k-means, the number of clusters needs to

be provided in advance. Over the past years, a lot of

methods have emerged and the Gap Statistic proposed

by Tibshirani (Tibshirani et al., 2001) is probably one

of the most promising approach.

In the earlier times, Milligan and Copper (Milli-

gan and Cooper, 1985) have done a research on the

criteria for estimating the number of clusters, report-

ing the results of a simulation experiment designed

to determine the validity of 30 criteria proposed in

the literature. After that, several criteria emerged

like Hartigan’s rule (Hartigan, 1975), Krzanowski and

Lai’s index (Krzanowski and Lai, 1988), the silhou-

ette statistic suggested by Kaufman and Rousseeuw

(Kaufman and Rousseeuw, 2009) and Calinski and

Harabasz’s index (Cali´nski and Harabasz, 1974),

which has demonstrated better performance under

most of the situations considered in Milligan and

Copper’s study. More recent methods on determin-

ing the number of clusters include an approach using

approximate Bayes factors proposed by Fraley and

Raftery (Fraley and Raftery, 1998) and a jump method

by Sugar and James (Sugar and James, 2003).

Unfortunately, most of these methods are some-

what ad hoc or model-based and hence sometimes re-

quire parametric assumptions which lead to a lack of

generality. However, the principle of Gap Statistics

proposed by Tibshirani et al. (Tibshirani et al., 2001)

is to compare the within cluster dispersion obtained

by the considered clustering algorithm with that we

would obtain under a single cluster hypothesis. It is

designed to be applicable to virtually any clustering

method like the commonly used k-means and HAC.

3.1 Gap Statistics

Consider a data set x

,i = 1,2,...,n, j = 1,2,..., p,

consisting of p features, all of them being measured

on n independent observations. d

′

denotes some dis-

similarity between observations i and i

′

. The most

common used dissimilarity measure is the squared

Euclidean distance (

∑

− x

′

)

). Suppose that the

data set is composed of k clusters and that C

denotes

the indices of observations in cluster r and n

= |C

According to Tibshirani, and using his notations, we

deﬁne:

∑

i,i

′

∈C

′

(5)

as the sum of all the distances between any two ob-

servations in cluster r. So, using again the notations

of Tibshirani,

∑

r=1

(6)

KernelHierarchicalAgglomerativeClustering-ComparisonofDifferentGapStatisticstoEstimatetheNumberofClusters

257

is the within-class dispersion. W

monotonically de-

creases as the number of clusters k increases but, ac-

cording to Tibshirani (Tibshirani et al., 2001), from

some k onwards, the rate of decrease is dramatically

reduced. It has been shown that the location of such

an elbow indicates the appropriate number of clusters.

The main idea of Gap Statistics is to compare the

graph of log(W

) with its expectation we could ob-

serve under a single cluster hypothesis (the impor-

tance of the choice of an appropriate null reference

distribution of the data has been studied in (Gordon,

1996)).

According to (Tibshirani et al., 2001), the as-

sumed null model of the data set must be a single

cluster model. The most common considered refer-

ence distributions are:

• an uniform distribution over the range of the ob-

served data set.

• an uniform distribution over an area which is

aligned with the principal components of the data

set.

The ﬁrst method has the advantage of simplicity

while the second is more accurate in terms of consis-

tency because it takes into account the shape of the

data distribution.

Here again, using the notations of Tibshirani, we

deﬁne the Gap Statistic as:

Gap(k) = E

{log(W

)} − log(W

) (7)

Here E

{log(W

)} denotes the expectation of

log(W

) under some null reference distribution H

The estimated number of clusters

k falls at the point

where Gap

(k) is maximum. Expectation is esti-

mated by averaging the results obtained from differ-

ent realizations of the data set under the null reference

distribution.

Figures 4 and 5 show an example using ker-

nel HAC. Data are composed of three distinct bi-

dimensional Gaussian clusters centred on (0, 0),

(0,1.3), (1.4,−1) respectively, with unit covariance

matrix I

and 100 observations per class. The func-

tions log(W

) and the estimate of E

{log(W

)}

are shown in Figure 4. The Gap Statistic is shown

in Figure 5. In this example, E

{log(W

)} was

estimated using 150 independent realizations of the

null data set. We also estimated the standard deviation

sd(k) of log(W

). Let s

sd(k), which

is represented by vertical bars in Figure 5, then, ac-

cording to Tibshirani, the estimated number of cluster

k is the smallest k such that:

Gap(k) ≥ Gap(k + 1) − s

k+1

(8)

Figure 5 shows that, for the considered data set,

k = 3,

which is correct.

2 3 4 5 6 7 8 9 10

1.5

2.5

3.5

4.5

Number of clusters k

Obs and Exp log(W

)

Figure 4: Graphs of E

{log(W

)} (upper curve) and

log(W

) (lower curve).

1 2 3 4 5 6 7 8 9 10 11

0.4

0.6

0.8

1.2

1.4

1.6

1.8

Number of clusters k

Figure 5: Gap statistic as a function of the number of clus-

ters.

3.2 New Criteria to Estimate the

Number of Clusters in Kernel HAC

The main idea of Gap Statistics was to compare the

within-class dispersion obtained on the data with that

of an appropriate reference distribution. Inspired by

this idea, we propose to extend it to other criteria

which are suitable for HAC to estimate the number

of clusters.

These criteria are:

• Modiﬁed Gap Statistic

In the Gap Statistic proposed by Tibshirani, the

estimated number of clusters

k is chosen accord-

ing to equation 8. In this modiﬁed Gap Statis-

tic, we deﬁne

k as the number of clusters where

Gap(k) (equation 7) is maximum. As will be

shown later in this paper, this allows to potentially

improve the estimate, at least for standard HAC.

• Centered Alignment Gap

A good guess for the nonlinear function Φ(x

)

should be to directly produce the expected result

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

258

for all observations. In this case, the best Gram

matrix K whose general term is K

= K(x

)

becomes YY

′

where Y is the column vector (or

matrix, depending on the selected code book) of

all the data labels. Shawe-Taylor et al. ((Shawe-

Taylor and Kandola, 2002)) have proposed a func-

tion that depends on both the labels and the Gram

matrix, called alignment, to measure the degree

of agreement between a kernel function and the

clustering task. The alignment is deﬁned by:

Alignment =

< K,YY

′

< K,K >

< YY

′

,YY

′

(9)

where the subscript F denotes Frobenius norm. In

this paper, Y is estimated after the clustering pro-

cess. The Centered Alignment (CA) is deﬁned as:

CA =

< K

′

< K

< Y

′

(10)

Here the centered matrix K

associated to the ma-

trix K is deﬁned by:

(x,y) = hΦ(x) − E

[Φ],Φ(x

′

) − E

′

[Φ]i

= K(x,y) − E

[K(x,y)] − E

[K(x,y)]

x,y

[K(x,y)]

where the expectation operator is evaluated by av-

eraging over the data. Y

is deﬁned by:

= Y − E[Y]

So the Centered Alignment Gap is deﬁned by:

Gap

(k) = E

{CA/H

} −CA (11)

The estimated number of clusters

k is the k such

that Gap

(k) is maximum.

• Delta Level Gap

In the dendrogram, we consider each level of the

similarity measure h

,i = 1...n − 1 where h

the highest level. Then we deﬁne ∆h(k) = h

−

i+1

,k = 2...n. We deﬁne the criterion Delta Level

Gap:

Gap

∆h(k)

(k) = ∆h(k) − E

{∆h(k)/H

} (12)

The estimated number of clusters

k is the value of

k where Gap

∆h(k)

(k) is maximum.

• Weighted Delta Level Gap

This criterion is related to the previous one.

We deﬁne the Weighted Delta Level (denoted

by ∆h

(k)) by: ∆h

(2) = ∆h(2),k = 2 and

∆h

(k) =

∆h(k)

∑

k−1

i=2

∆h(i)

,k ≥ 2. We deﬁne the criterion

Weighted Delta Level Gap as:

Gap

∆h

(k)

(k) = ∆h

(k) − E

{∆h

(k)/H

}

(13)

The estimated number of clusters

k is the value of

k where Gap

∆h

(k)

(k) is maximum.

4 SIMULATIONS

We have generated 6 different data sets to compare

the proposed criteria with that from Tibshirani:

1. Five clusters in two dimensions

(1)

The clusters consist of gaussian bidimensional

distributions N(0,1.5

) centered at (0,0), (-3,3),

(3,-3), (3,3) and (-3,-3). Each cluster is composed

of 50 observations. Clusters strongly overlap. The

kernel parameter is σ = 0.90.

2. Five clusters in two dimensions

(2)

The data are generated in the same way as in the

previous case but the variance of the clusters is

now 1.25

). Clusters slightly overlap. σ = 0.85.

3. Five clusters in two dimensions

(3)

Same case but with variance equal to 1.0

. There

is no overlap. σ = 0.80.

4. Three clusters in two dimensions

This is a data set used by Tibshirani (Tibshirani

et al., 2001). The clusters are unit variance gaus-

sian bidimensional distributions with 25,25,50

observations, centered at (0, 0), (0,5) and (5,−3),

respectively. σ = 0.80.

5. Two nested circles and one outside isolated disk

in two dimensions

These three circles are centered at (0,0), (0,0)

and (0, 8) with 150, 100,100 observations. The

respective radii are uniformly distributed over

[0,1], [4,5] and [0,1]. σ = 0.55.

6. Two elongated clusters in three dimensions

This is also a data set used by Tibshirani (Tibshi-

rani et al., 2001) to show the interest of perform-

ing PCA to deﬁne the null distribution. Clusters

are aligned with the ﬁrst diagonal of a cube. We

have x

= x

= x with x composed of 100

equally spaced values between −0.5 and 0.5. To

each component of x a Gaussian perturbation with

standard deviation 0.1 is added. The second clus-

ter is generated similarly, except for a constant

value of 10 which is added to each component.

The result is two elongated clusters, stretching out

along the main diagonal of a three dimensional

cube. σ = 1.00.

Simulation results with kernel HAC are shown in Ta-

ble 1. 50 realizations were generated for each case

and we used principal component analysis to deﬁne

the distributions used as the null reference samples.

A special scenario using a uniform distribution over

the initial area covered by the data (without PCA) is

provided in case 7. For every simulation, expectation

appearing in equation 7 was estimated over 100 inde-

pendent realizations of the null distribution.

KernelHierarchicalAgglomerativeClustering-ComparisonofDifferentGapStatisticstoEstimatetheNumberofClusters

259

Table 1: Simulation results using Kernel HAC. Each number represents the number of times each criterion gives the number

of clusters indicated in the corresponding column, out of the 50 realizations. The column corresponding to the right number

of clusters is indicated in boldface. Numbers between parentheses indicate the results obtained using standard HAC, when of

some interest. NF stands for Not Found.

Number of clusters 2 3 4 5 6 7 8 9 10 NF

1. Five clusters in two dimensions

(1)

Gap Statistic 0 (48) 0 (2) 4 (0) 28 (0) 15 (0) 3 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Modiﬁed Gap Statistic 0 (2) 0 (0) 0 (10) 9 (36) 9 (1) 5 (0) 5 (0) 3 (0) 19 (0) 0 (0)

Delta Level Gap 0 (0) 0 (0) 18 (9) 30 (38) 1 (1) 1 (1) 0 (0) 0 (0) 0 (1) 0 (0)

Weighted Delta Level Gap 0 (47) 7 (0) 20 (1) 22 (1) 1 (1) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Centered Alignment Gap 6 (1) 4 (0) 11 (1) 23 (6) 6 (10) 0 (9) 0 (8) 0 (9) 0 (6) 0 (0)

2. Five clusters in two dimensions

(2)

Gap Statistic 0 (48) 0 (0) 0 (0) 43 (2) 6 (0) 1 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Modiﬁed Gap Statistic 0 (0) 0 (0) 0 (1) 19 (45) 10 (3) 6 (1) 5 (0) 2 (0) 8 (0) 0 (0)

Delta Level Gap 0 (0) 0 (0) 2 (1) 48 (48) 0 (1) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Weighted Delta Level Gap 0 (45) 3 (0) 12 (0) 35 (5) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Centered Alignment Gap 0 (0) 1 (0) 2 (0) 40 (10) 5 (15) 2 (9) 0 (7) 0 (4) 0 (5) 0 (0)

3. Five clusters in two dimensions

(3)

Gap Statistic 0 (39) 0 (0) 0 (0) 48 (11) 2 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Modiﬁed Gap Statistic 0 (0) 0 (0) 0 (0) 26 (47) 12 (3) 2 (0) 0 (0) 1 (0) 9 (0) 0 (0)

Delta Level Gap 0 (0) 0 (0) 0 (0) 50 (50) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Weighted Delta Level Gap 0 (30) 1 (0) 3 (0) 46 (20) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Centered Alignment Gap 0 (0) 0 (0) 0 (0) 50 (10) 0 (18) 0 (12) 0 (6) 0 (3) 0 (1) 0 (0)

4. Three clusters in two dimensions

Gap Statistic 0 38 12 0 0 0 0 0 0 0

Modiﬁed Gap Statistic 0 14 18 3 0 5 4 3 3 0

Delta Level Gap 5 45 0 0 0 0 0 0 0 0

Weighted Delta Level Gap 0 50 0 0 0 0 0 0 0 0

Centered Alignment Gap 35 15 0 0 0 0 0 0 0 0

5. Two nested circles and one outside isolated disk in two dimensions

Gap Statistic 0 (0) 0 (0) 0 (38) 0 (6) 0 (1) 0 (3) 0 (1) 1 (0) 0 (0) 49 (1)

Modiﬁed Gap Statistic 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 0 (1) 0 (0) 50 (49) 0 (0)

Delta Level Gap 0 (0) 50 (3) 0 (42) 0 (2) 0 (0) 0 (1) 0 (2) 0 (0) 0 (0) 0 (0)

Weighted Delta Level Gap 0 (48) 7 (0) 0 (2) 0 (0) 2 (0) 8 (0) 6 (0) 13 (0) 14 (0) 0 (0)

Centered Alignment Gap 50 (12) 0 (23) 0 (0) 0 (0) 0 (1) 0 (1) 0 (2) 0 (3) 0 (8) 0 (0)

6. Two elongated clusters in three dimensions

Gap Statistic 50 0 0 0 0 0 0 0 0 0

Modiﬁed Gap Statistic 45 0 5 0 0 0 0 0 0 0

Delta Level Gap 50 0 0 0 0 0 0 0 0 0

Weighted Delta Level Gap 0 0 0 0 0 0 4 10 36 0

Centered Alignment Gap 50 0 0 0 0 0 0 0 0 0

7. Two elongated clusters in three dimensions (without PCA to deﬁne the null reference)

Gap Statistic 0 0 1 0 18 13 15 3 0 0

Modiﬁed Gap Statistic 0 0 0 0 1 2 6 7 34 0

Delta Level Gap 50 0 0 0 0 0 0 0 0 0

Weighted Delta Level Gap 0 0 0 0 0 0 0 3 47 0

Centered Alignment Gap 50 0 0 0 0 0 0 0 0 0

The kernel parameter σ has been selected as the

value which maximizes the centered kernel alignment

deﬁned by equation 10.

To evaluate the centered kernel alignment, labels

must be known. They are obtained as the result of

kernel CAH clustering. Coding of the Y is performed

in such a way that the centered kernel alignment is in-

variant to the (arbitrary) cluster number. In this paper,

the vector code book is, for an observation x

belong-

ing to cluster m,m = 1,..., k:



= 1, if j = m,

= −1, if j 6= m.

(14)

Results in Table 1 clearly indicate that one of the pro-

posed criteria Delta Level Gap outperforms other cri-

teria (including the Gap Statistic by Tibshirani) in al-

most all cases.

Tibshirani (Tibshirani et al., 2001) mainly focused

on well-separated clusters. Our simulations also show

that the Gap Statistic estimation is not good at iden-

tifying the number of clusters when they highly over-

lap. See data sets 1,2 and 3 in Table 1: the lower the

overlap, the better the performances. Looking at the

example of data set 1 in Table 1, all criteria do not

give the expected results. The number of clusters es-

timated by Delta Level Gap mostly give 4 and 5 clus-

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

260

ters, which is better. However, all the criteria suffer

from overlap.

Another important conclusion is the importance of

the choice of an appropriate null reference distribu-

tion. Seeing scenarios 6 and 7 in Table 1, results vary

a lot between the uniform distribution and the uniform

distributions aligned with principal components. This

gives some room to improve our framework by choos-

ing a more appropriate null reference distribution.

We have also done simulations using standard

HAC on some of these data sets. Comparing the re-

sults obtained show that kernel HAC generally im-

proves the performances. We can observe that Cen-

tered Alignment Gap, Gap Statistic and Weighted

Delta Level Gap do not perform well on examples 1,2

and 3 for standard HAC. Evolution of these statistics

as functionsof the numberof clusters, that can be seen

in Figures 6, 7 and 8, explain these results. Figure 6

shows the evolution of the Gap Statistic as a func-

tion of the number of clusters k for a realization of

the second data set. As can be seen, the criterion ini-

tially proposed gives k = 2. Figure 7 represents the

evolution of Centered Alignment Gap. The behavior

is rather erratic and explains the variations in the esti-

mated number of clusters. This can also be observed

for Weighted Delta Level Gap shown in Figure 8.

Furthermore, Delta Level Gap shows excellent

performances on data set 5, which cannot be classi-

ﬁed using non kernel HAC while easily separated by

kernel HAC, as can be seen from Table 1.

According to all experiments we have done (not

all of them are presented here), we have observed that

the initial Gap Statistic of Tibshirani is not always

efﬁcient in estimating the right number of clusters.

Looking for the maximum value instead of selecting

the value proposed as in equation 8, appears to be a

possible alternative. However, the Delta Level Gap

seems to be the most reliable estimate among those

we have studied and potentially one of the less sensi-

tive to the null distribution of the data.

0 2 4 6 8 10 12

0.4

0.5

0.6

0.7

0.8

0.9

1.1

1.2

1.3

Number of clusters k

Gap statistic

Figure 6: Number of clusters using Gap Statistic.

2 3 4 5 6 7 8 9 10

−0.02

−0.01

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Number of clusters k

Centered Alignment Gap

Figure 7: Number of clusters using Centered Alignment

Gap.

2 3 4 5 6 7 8 9 10

−1.5

−1

−0.5

0.5

1.5

Number of clusters k

Weighted Delta Level Gap

Figure 8: Number of clusters using Weighted Delta Level

Gap.

5 CONCLUSIONS

In this paper,we have proposed Kernel HAC as a clus-

tering tool. It is robust for clustering separable classes

which have a small between-class dispersion in the

input space. Using an adequate Gap Statistic, it also

allows determination of the number of clusters in the

data. Out of them, simulation results have shown that

Delta Level Gap, one of our criteria, outperforms the

conventional Gap Statistic in many cases. Our future

work will focus on new methods for determining the

optimal kernel parameter. A few work has been done

which has already shown good prospects. Then, ker-

nel engineering (adaptation of the kernel function to

the data) will be considered.

REFERENCES

Aizerman, A., Braverman, E. M., and Rozoner, L. (1964).

Theoretical foundations of the potential function

KernelHierarchicalAgglomerativeClustering-ComparisonofDifferentGapStatisticstoEstimatetheNumberofClusters

261

method in pattern recognition learning. Automation

and remote control, 25:821–837.

Barbar´a, D. and Jajodia, S. (2002). Applications of data

mining in computer security, volume 6. Springer.

Bezdek, J. C., Ehrlich, R., and Full, W. (1984). Fcm: The

fuzzy c-means clustering algorithm. Computers &

Geosciences, 10(2):191–203.

Cali´nski, T. and Harabasz, J. (1974). A dendrite method for

cluster analysis. Communications in Statistics-theory

and Methods, 3(1):1–27.

Cortes, C. and Vapnik, V. (1995). Support-vector networks.

Machine learning, 20(3):273–297.

Driver, H. E. and Kroeber, A. L. (1932). Quantitative ex-

pression of cultural relationships. University of Cali-

fornia Press.

Filippone, M., Camastra, F., Masulli, F., and Rovetta, S.

(2008). A survey of kernel and spectral methods for

clustering. Pattern recognition, 41(1):176–190.

Fraley, C. and Raftery, A. E. (1998). How many clusters?

which clustering method? answers via model-based

cluster analysis. The computer journal, 41(8):578–

588.

Girolami, M. (2002). Mercer kernel-based clustering in fea-

ture space. Neural Networks, IEEE Transactions on,

13(3):780–784.

Gordon, A. D. (1996). Null models in cluster validation. In

From data to knowledge, pages 32–44. Springer.

Harman, H. H. (1960). Modern factor analysis.

Hartigan, J. A. (1975). Clustering Algorithms. John Wiley

& Sons, Inc., New York, NY, USA, 99th edition.

Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups

in data: an introduction to cluster analysis, volume

344. Wiley-Interscience.

Kim, D.-W., Lee, K. Y., Lee, D., and Lee, K. H. (2005).

Evaluation of the performance of clustering algo-

rithms in kernel-induced feature space. Pattern

Recognition, 38(4):607–611.

Krzanowski, W. J. and Lai, Y. (1988). A criterion for deter-

mining the number of groups in a data set using sum-

of-squares clustering. Biometrics, pages 23–34.

Mac Queen, J. et al. (1967). Some methods for classiﬁ-

cation and analysis of multivariate observations. In

Proceedings of the ﬁfth Berkeley symposium on math-

ematical statistics and probability, volume 1, page 14.

California, USA.

Mercer, J. (1909). Functions of positive and negative type,

and their connection with the theory of integral equa-

tions. Philosophical transactions of the royal society

of London. Series A, containing papers of a mathe-

matical or physical character, 209:415–446.

Milligan, G. W. and Cooper, M. C. (1985). An examination

of procedures for determining the number of clusters

in a data set. Psychometrika, 50(2):159–179.

Muller, K.-R., Mika, S., Ratsch, G., Tsuda, K., and

Scholkopf, B. (2001). An introduction to kernel-based

learning algorithms. Neural Networks, IEEE Transac-

tions on, 12(2):181–201.

Murtagh, F. (1983). A survey of recent advances in hierar-

chical clustering algorithms. The Computer Journal,

26(4):354–359.

Olson, C. F. (1995). Parallel algorithms for hierarchical

clustering. Parallel computing, 21(8):1313–1325.

Pr´ıncipe, J. C., Liu, W., and Haykin, S. (2011). Kernel

Adaptive Filtering: A Comprehensive Introduction,

volume 57. John Wiley & Sons.

Qin, J., Lewis, D. P., and Noble, W. S. (2003). Kernel hi-

erarchical gene clustering from microarray expression

data. Bioinformatics, 19(16):2097–2104.

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Non-

linear component analysis as a kernel eigenvalue prob-

lem. Neural computation, 10(5):1299–1319.

Shawe-Taylor, N. and Kandola, A. (2002). On kernel target

alignment. Advances in neural information processing

systems, 14:367.

Sneath, P. H., Sokal, R. R., et al. (1973). Numerical taxon-

omy. The principles and practice of numerical classi-

ﬁcation.

Sugar, C. A. and James, G. M. (2003). Finding the num-

ber of clusters in a dataset. Journal of the American

Statistical Association, 98(463).

Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimat-

ing the number of clusters in a data set via the gap

statistic. Journal of the Royal Statistical Society: Se-

ries B (Statistical Methodology), 63(2):411–423.

Vapnik, V. (2000). The nature of statistical learning theory.

springer.

ICPRAM2014-InternationalConferenceonPatternRecognitionApplicationsandMethods

262