ROBUST CENTROID-BASED CLUSTERING USING DERIVATIVES

OF PEARSON CORRELATION

Marc Strickert, Nese Sreenivasulu

Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben, Germany

Thomas Villmann

Clinic for Psychotherapy, University of Leipzig, Germany

Barbara Hammer

Institute of Computer Science, University of Clausthal, Germany

Keywords:

Centroid-based clustering, correlation, quantization cost optimization.

Abstract:

Modern high-throughput facilities provide the basis of -omics research by delivering extensive biomedical

data sets. Mass spectra, multi-channel chromatograms, or cDNA arrays are such data sources of interest for

which accurate analysis is desired. Centroid-based clustering provides helpful data abstraction by representing

sets of similar data vectors by characteristic prototypes, placed in high-density regions of the data space. This

way, speciﬁc modes can be detected, for example, in gene expression proﬁles or in lists containing protein

and metabolite abundances. Despite their widespread use, k-means and self-organizing maps (SOM) often

only produce suboptimum results in centroid computation: the ﬁnal clusters are strongly dependent on the

initialization and they do not quantize data as accurately as possible, particularly, if other than the Euclidean

distance is chosen for data comparison. Neural gas (NG) is a mathematically rigorous clustering method that

optimizes the centroid positions by minimizing their quantization errors. Originally formulated for Euclidean

distance, in this work NG is mathematically generalized to give accurate and robust results for the Pearson

correlation similarity measure. The beneﬁts of the new NG for correlation (NG-C) are demonstrated for sets

of gene expression data and mass spectra.

1 INTRODUCTION

Massive data sets with a high number of samples

and/or attributes create challenges in de novo data

analysis. Particularly, high-throughput biomedical

devices like mass spectrometers or gene expression

arrays generate thousands of data points in parallel

for which accurate data models are required in order

to faithfully reduce the data complexity and to facili-

tate the analysis.

Centroid-based data representations provide most

intuitive interpretations, because a centroid can be

regarded as noise-free prototype of its surrounding

cloud of data. Especially for many data vectors, cen-

troids can be much more easily assessed than re-

sults from hierarchical clustering, for example. Some

well-known clustering algorithms are freely avail-

able (de Hoon et al., 2004), implementing widely

used methods of Eisen et al. (Eisen et al., 1998).

As a matter of fact, self-organizing maps (SOM)

and k-means clustering are frequently used methods

for extracting a pre-deﬁned number of centroids from

the data (Kohonen, 2001; MacQueen, 1967). While

centroids in k-means get specialized by an iterative

averaging procedure applied to data that they do cur-

rently represent, SOM is a cooperative model with

updates of the best-matching centroid and also of its

neighbors. Since SOM neighbors are related to lateral

centroids arranged on a grid structure, the SOM up-

date triggers a mapping of similar high-dimensional

data vectors onto similar positions of that usually low-

dimensional grid, often, a 2D-plane for visualization.

Due to topological constraints induced by the grid,

quantization accuracy, i.e. data representation, of the

SOM centroids is often not optimum (Villmann and

Claussen, 2006). Thus, if the dimension reduction

197

Strickert M., Sreenivasulu N., Villmann T. and Hammer B. (2008).

ROBUST CENTROID-BASED CLUSTERING USING DERIVATIVES OF PEARSON CORRELATION.

In Proceedings of the First International Conference on Bio-inspired Systems and Signal Processing, pages 197-203

DOI: 10.5220/0001062601970203

 SciTePress

feature of SOM is not needed, better representations

are obtained without grid structure. This can be real-

ized by a SOM-like algorithm called neural gas (NG)

that will be of interest here.

Data condensation requires similarity criteria in

order to gather related items. Besides Euclidean dis-

tance, Pearson correlation is one of the most often

used comparison criteria in biological studies. In

principle, a wide range of similarity measures, ex-

pressed as Minkowski metric or correlation, is avail-

able in self-organizing maps (SOM) and k-means.

There is a commonly overlooked problem con-

nected to similarity rating and model update, though.

SOM centroids, for example, are deﬁned by their

closeness to data points, and it is thus straight-forward

to describe closeness by task-speciﬁc similarity mea-

sures. Yet, the SOM update rule ’make close cen-

troids more similar to the data’ is traditionally imple-

mented as a claim for identity: centroids are movedon

straight lines in Euclidean space, in portions depend-

ing on their closeness, towards presented data points.

There is thus a difference between the update rule

for a strict one-to-one correspondence of all centroid

components with those of the represented data vec-

tors, and the more relaxed desire of merely represent-

ing high similarity. Although, for vector pairs, iden-

tity is equivalent to maximum similarity, the situation

is different for a single centroid representing many

data points; then, similarity constraints do usually al-

low more degrees of freedom for the centroid place-

ment than strict identity constraints. Analogous con-

siderations apply to the k-means clustering method,

in which custom measures deﬁne data assignments

to centroids, but in which average data centroids are

strictly computed (by averaging) in Euclidean space.

The discrepancy between similarity computation

and subsequent update of data models can be circum-

vented by coupling the update procedure with ana-

lytic properties of the selected similarity measure. In

cost function frameworks the model parameters can

be adapted by optimization of similarity relationships.

Here, gradient-based optimization of centroid loca-

tions is discussed for Pearson correlation similarity.

Correlation is often used in biomedical analysis tasks.

It has got favorable pattern matching characteristics,

and it allows to calculate formal derivatives and can

be directly used in gradient methods such as the Hes-

kes variant of SOM (Heskes, 1999), neural gas (Mar-

tinetz and Schulten, 1991), and generalized learning

vector quantization (Sato and Yamada, 1995). The

subsequent derivative is integrated into the highly ac-

curate neural gas clustering method, for which supe-

rior performance is demonstrated for gene expression

data and mass spectrum data.

2 METHODS

Faithful data representation requires robust centroid

locations within the data. Self-organizing maps

(SOM) realize a cooperative centroid placement strat-

egy by iterative presentation of data points that trig-

ger further improvements of previously placed cen-

troids. A generalformulation of this simple procedure

is given in Algorithm 1.

Algorithm 1 SOM / NG centroid update

repeat

chose randomly a data vector x

k ←argmin

{d(w

,x)}

{ w

is closest centroid to data vector x }

for all m centroids j do

← w

+ γ ·h



D(w

)



·U(x,w

)

{ γ, h,σ,D,U: see text }

end for

until no more major changes

SOM Mode of Algorithm 1. Since SOM centroids

cooperate laterally on a grid structure, updates im-

ply spatial specialization with similar grid neigh-

bors. Grid dependencies between centroids k and j

are expressed by the neighborhood index D(w

For example, rectangular 2D grids possess four di-

rect neighbors N

of non-boundary centroids with

D(w

) = 1. The σ-range of neighborhood co-

operation is expressed by the decreasing function h

with maximum value at h

(0) = 1. Often a Gaus-

sian bell h

(D) = e

−D

/σ

is put upon the grid, con-

tracted during update by shrinking σ → 0. In addition

to neighborhood characterization, the update strategy

of centroid w

facing data vector x is described by

U(x,w

). As said above, centroids are most often

moved on straight Euclidean lines towards the data

vector, i.e. by the term U(x,w

) = (x −w

), in small

steps depending on the update rate γ < 1.

NG Mode of Algorithm 1. The neural gas algo-

rithm works exactly the same as described in the pre-

vious SOM mode, except for one crucial exception:

the centroid neighborhood is no longer deﬁned on a

pre-deﬁned grid structure. Instead, the neighborhood

changes dynamically in course of data presentation.

The centroid closest to the currently presented data

vector x is assigned a rank of zero, the runner-up gets

a rank of one, and so forth. In general, the neighbor-

hood is deﬁned by the ranks relative to only the data

vector: D(w

) = D(w

) = rnk(x, w

) with

rnk(x,w

) =



{d(x,w

) < d(x, w

),i = 1...m}



BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing

198

In contrast to SOM, the best-matching centroid w

does not induce a specialized structure on the grid

neighbors, and the rank-based neighborhood is al-

ways data optimum. Centroid update proﬁts from

ranks, because they are useful for breaking ties, i.e.

for differentiation of very similar data. Ranks are ex-

ponentially wrapped by h

(D) = e

−D/σ

, again σ →0

during update iterations. As for SOM, U(x,w

) =

(x −w

) and γ < 1.

Its is known that the NG algorithm asymptotically

realizes a stochastic gradient descent on the cost func-

tion (Martinetz et al., 1993):

E(W, σ) =

C(σ)

∑

j=1

∑

i=1

(rnk



)



·d(x

(1)

The scaling factorC(σ) =

∑

m−1

i=0

(i) is used for nor-

malization. In the limit σ →0, the NG mode of Algo-

rithm 1 leads to a centroid placement that minimizes

the total quantization error, deﬁned by d(x

), be-

tween m centroids and n data vectors. This property

does not hold for the SOM version. Even worse,

in general the mathematical optimization target of

SOM is undeﬁned (Cottrell et al., 1994), unless the

costly modiﬁcation proposed by Heskes is imple-

mented (Heskes, 1999).

The beneﬁts of neural gas are: mathematical un-

derstanding of centroid specialization, high repro-

ducibility of results, neighborhoodcooperation for ro-

bustness against initialization, and easy implementa-

tion. Very importantly, the generic formulation of the

neural gas algorithm allows to create modiﬁcations

with respect to the choice of the data similarity mea-

sure. A minor drawback of NG is the sorting oper-

ation, i.e. a computing complexity of O (nlogn), re-

quired for rank calculation. Therefore, a fast batch

version of neural gas with quadratic convergence

based on Newton’s method has been proposed re-

cently (Cottrell et al., 2006), complementing the it-

erative online approach discussed here. The authors

do also present a method for clustering data only de-

ﬁned by a similarity matrix. For its simplicity, we

stick to Algorithm 1 in the following, and we intro-

duce a derivation making full use of the analytic prop-

erties of Pearson correlation for an improved centroid

update rule.

Neural Gas Clustering with Pearson Correlation.

Pearson correlation is our focus of choice, because it

provides a certain degree of invariance to additive or

multiplicative effects induced by measuring devices

or biochemical probe concentrations. Thus, pattern-

based analysis is enhanced by choosing Pearson sim-

ilarity for data vectors and centroids, mathematically

described with abbreviation r(x,w) =

√

C ·D

r(x,w) =

∑

i=1

−µ

) · (w

−µ

)



∑

i=1

−µ

)





∑

i=1

−µ

)



(2)

In principle, the covariance of x and w gets standard-

ized by the product of the individual variances of x

and w. However, due to dynamic centroid update,

there is no much use in making the implicit standard-

ization explicit by data preprocessing, such as z-score

transformation. Furthermore, in cases when correla-

tion is just a building block, like in the dissimilarity

measure (1−r)

(Zhou et al., 2002), it is much more

natural to think in terms of a self-contained equation

(Eqn. 2) than in terms of statically preprocessed data.

Correlation described by Eqn. 2 can be plugged

into the cost function Eqn.1 being optimized by gra-

dients along partial derivativesof E with respect to co-

ordinates of all centroids w. In general, these deriva-

tives indicate contributions of the k-th centroid com-

ponent of w to the distance or similarity measure.

For the squared Euclidean distance d

(x,w) =

∑

i=1

− w

)

this corresponds to the previously

mentioned term U(x,w) = (x −w):

∂d

(x,w)

∂w

= −2·(x

−w

) ∝ U(x

For Pearson correlation the derivative is

∂r(x,w)

∂w

−µ

) −

·(w

−µ

)

√

C ·D

. (3)

Since the cost function should be minimized, cor-

relation r is turned by negative sign into a dissim-

ilarity measure. Therefore, the term U(x

) =

−∂r(x,w)/∂w

is inserted into Algorithm 1 which

constitutes the new version of neural gas for

correlation-based centroid placement, NG-C for

short. It can be shown that this correlation-based up-

date rule yields a valid gradient descent also at the

boundaries of the receptive ﬁelds. A proof, origi-

nally for the Euclidean case, is provided by (Martinetz

et al., 1993), where a vanishing contribution of the

ranks was presented. Since the proof does not rely on

speciﬁc properties of the Euclidean metric, a direct

transfer to Pearson correlation is possible. Therefore,

Eqn. 1 is still a cost function that gets optimized by

the neural gas algorithm.

Usually, good convergence is reached after 50–

1000 repeated data cycles, depending of the size n of

the data set and the number m of centroids. Thereby,

the neighborhood range σ is exponentially decreased

from a starting size of σ = m to a small value of

σ = 0.001. This involves all prototypes strongly in

the beginning, contracting centroids towards the data

ROBUST CENTROID-BASED CLUSTERING USING DERIVATIVES OF PEARSON CORRELATION

199

’center’, and it leads to a ﬁne-tuning of data-speciﬁc

centroids in the ﬁnal phase.

3 RESULTS

The following three applications show the superior-

ity of NG-C clustering over traditional methods with

Pearson correlation. As demonstrated, cost function

optimization by NG-C provides better data represen-

tations and higher reproducibility of results.

3.1 Single Cluster Representation of

Gene Expression Data

A ﬁrst proof of concept is given for the simple,

but illustrative task of ﬁnding only a single centroid

position. This points out structural differences be-

tween Euclidean- and correlation-based centroid up-

date. We use an exemplary 14-dimensional gene ex-

pression data set, where macroarrays were used to

cover 14 temporal developmental stages in the en-

dosperm tissue of developing barley grains, sampled

from day 0 after ﬂowering in steps of two days to day

26. After quality-based ﬁltering, 4824 highly reliable

genes were obtained. Conforming to standards, ex-

pression values were quantile normalized and log

transformed. However, for maintaining overall ex-

pression levels, z-score was not applied to the 14-

dimensional expression series. For illustration, the

set was further reduced to 344 genes of prominent

temporal up-regulation with more than 10 transitions

< x

t+1

Neural gas has been run with Euclidean update

U(x,w) = (x − w) and with updates based on the

derivative of correlation according to Eqn. 3. Both ap-

proaches have been re-run 50 times with random cen-

troid initialization. Each run has been carried out with

100 update iterations using γ = 0.001 for the approach

Euclidean and γ = 0.01 for the correlation-based one.

Neighborhood size σ does not have any inﬂuence and

even d is not important for data assignments, because

there is only one centroid to be assigned to. Thus,

only the effect of the derivative of d on the centroid

specialization is studied here.

The results are displayed in the plot panel of

Fig. 1. The plots show the 14-dimensional expression

series together with their centroids, projected by PCA

and embedded by multi-dimensional scaling (MDS)

in two dimensions. PCA represents the Euclidean

view on the data, MDS the correlation-based view. To

summarize the displayed results, Euclidean update is

very stringent in both data views, the top left panel

indicating that all 50 centroids are almost perfectly

located in the center of gravity at point (0,0), which

is the k-means solution for k = 1. Complementary to

that, correlation-based update exhibits many degrees

of freedom in Euclidean view, but shows very high

speciﬁcity in the correlation view – which is exactly

what is has been designed for.

In addition to visual validation, which might suf-

fer from shortcomings of the built-in dimension re-

duction, quantization errors have been calculated. For

the average data vector, analog to the determinis-

tic k-means result with k = 1, an average correla-

tion of r = 0.96226 to the data vectors is found.

The Euclidean NG-update yields a result with an av-

erage correlation of the generated centroids of r =

0.96222±5.583·10

−5

, which is virtually the result of

the avarage vector, affected by minor update-speciﬁc

ﬂuctuations. Correlation-based centroid update yields

the best results with an average correlation of r =

0.96403±8.173·10

−5

. In combination with the bot-

tom left panel in Fig. 1 it can be concluded that there

are non-unique solutions that can be reached only, if

Euclidean constraints are relaxed to updates operating

in correlation space. Despite of the small differences

for the presented data set, the results are quite fun-

damental, because they show that better solutions ex-

ist beyond averages. On a good mathematical basis,

similarity-speciﬁc updates induce less constraints on

the cost function and yield better data representations.

3.2 Clustering of Gene Expression Data

Mining for principal shapes in large lists of gene ex-

pression patterns is a central tool for the identiﬁca-

tion of co-expressed genes. Neural gas with corre-

lation is used to meet this purpose for the data set

described in the last paragraph containing 4824 gene

expression levels at 14 time points. For comparison,

Eisen’s implementation of k-means and Gasch’s and

Eisen’s fuzzy k-means are taken as reference mod-

els (de Hoon et al., 2004). Both make use of Pear-

son correlation for creating sets of similar patterns for

centroid calculation, but they compute centroid po-

sitions in Euclidean space. Calculations were done

with 100 cycles for neural gas, i.e. 482,400 centroid

updates, and 100 cycles for the k-means models.

A number of 23 centroids was used in all mod-

els, because fuzzy k-means is, due to its built-in PCA,

limited to 3×#experiments+2=3×14+ 2 = 44 proto-

types of which only 23 were identiﬁed as unique by

fuzzy k-means (Gasch and Eisen, 2002). Contrary to

the k-means methods, unused prototypes do not occur

in NG-C, because of its neighborhood cooperation.

The exponential NG-C neighborhood inﬂuence is re-

alized as exponential decay from σ = 23 to σ = 0.001,

BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing

200

-1

-2 -1 0 1 2 3 4

2nd principal component

1st principal component

PCA of gene expressions with NG-centroids from Euclidean update

Data

50 Centroids

-1

-3 -2 -1 0 1 2 3 4

MDS axis 2

MDS axis 1

Correlation-based MDS with NG-centroids from Euclidean update

Data

50 Centroids

-1

-2 -1 0 1 2 3 4

2nd principal component

1st principal component

PCA of gene expressions with NG-centroids from correlation update

Data

50 Centroids

-1

-3 -2 -1 0 1 2 3 4

MDS axis 2

MDS axis 1

Correlation-based MDS with NG-centroids from correlation update

Data

50 Centroids

Figure 1: Centroid specialization for single cluster representation. Top row: Euclidean update rule, bottom row: update based

on derivative of Pearson correlation. Left column: principal component plots, right column: multidimensional scaling of

(1−r) data relationships. In 50 individual NG runs, Euclidean updates (top row) show high speciﬁcity for both the Euclidean

data view (shown as PCA) and the correlation-based view (shown as MDS). Correlation-based updates (bottom row) exhibit

large diversity in Euclidean view (PCA) and high speciﬁcity in correlation view (MDS).

the update rate is set to γ = 0.001. Two quality criteria

are considered for model comparison: reproducibility

of the obtained centroids for different runs of the al-

gorithms and quantization accuracy.

Reproducibility of Clusters. One major aspect of

clustering is the consistency of the results. This has

been tested by running NG-C and k-means 10 times

from random starting conﬁgurations of the 23 cen-

troids. For fuzzy k-means the standard initialization is

ﬁxed, which makes repeats unnecessary. Visual com-

parison is thus restricted to NG-C and k-means. An

informative comparison between both methods is dis-

played in Figure 2, created using the free TreeView

software. Both horizontal intensity bars contain the

23 centroids of 10 runs, i.e. 230 columns. Shades of

gray denote speciﬁc gene expression intensities. Pat-

terns of temporal up- and down-regulations present

in the underlying data set are nicely captured by cen-

troids of both models. The tic marks attached to the

bottom of the NG-C bar point out 23 prominent bands

that reﬂect a high reproducibility of the centroids con-

tained therein, independent of their random initializa-

tion. For k-means, displayed in the row above, the re-

sult is very different: an unspeciﬁc continuous range

of ﬁnal states is obtained, which supports the experi-

ence of many users of k-means who complain about

the poor reproducibility of results.

Quantization Accuracy. Table 1 provides a summary

of the quantization accuracy of the found clusters.

For each run, the average correlation of expression

patterns with their corresponding centroids are mea-

sured, and the respective standard deviations are also

calculated. These two values are averaged over all

centroids. Finally, mean values for the 10 experi-

ments are determined and listed in Table 1. As a major

outcome, NG-C shows a superior data representation

over k-means and fuzzy k-means. The fuzzy k-means

is a little better than simple k-means, but its major dis-

advantage is the limitation to 44 centroids of which 21

are even unused. The good results of NG-C, however,

are not too much surprising, because neural gas has

been mathematically designed to optimize the goal of

Table 1: Average correlations between data samples and

their centroids for 10 independent runs of NG-C and k-

means. The deterministic result of the fuzzy k-means is

0.9335 ± 0.07216. In terms of quantization accuracy and

data assignment variability NG-C performs best. Both k-

means and its fuzzy k-means yield slightly worse quantiza-

tions, but fuzzy k-means covers data more homogeneously.

NG-C

mean std.-dev.

0.9516±0.0001 0.0573±0.0004

k-means

mean std.-dev.

0.9329±0.0017 0.0881±0.0038

ROBUST CENTROID-BASED CLUSTERING USING DERIVATIVES OF PEARSON CORRELATION

201

Figure 2: Cluster reproducibility for correlation-based neural gas (NG-C) and k-means. Both models, corresponding to the

upper and lower bar, were run 10 times with random initialization. For the considered number of 23 centroids this yields

a total of 230 centroids (gray columns) for comparison. While the ﬁnal states of k-means cover a continuum of centroid

locations, exhibiting only one major break, the ﬁnal states of NG-C are highly conserved and displayed as 23 characteristic

bands containing approx. 10 speciﬁc centroids.

maximum quantization accuracy (minimum quantiza-

tion error), while the k-means methods are acting on

assumptions about Euclidean data centers iteratively

assessed by averaging.

3.3 Clustering of Mass Spectra

The last application concerns clustering of mass spec-

trum data from a clinical cancer study with 1050 mass

spectra taken from sliced frozen tissue probes, us-

ing a linear MALDI-TOF MS, Autoﬂex, in a range

of 2000-10000Da (by courtesy of Bruker Daltonik

GmbH, Bremen). The data preparation protocol of

the measured spectra followed the default workﬂow

for baseline correction, alignment and peak picking.

Robust peaks with signal to noise ratio S/N > 5 were

used for further analysis, and only maxima of the ex-

tracted peaks were considered. This led to a high

quality data set of 1050 samples, each described by

32 peaks. Clustering assists in tasks of data inspec-

tion and hypothesis generation.

Neural gas is applied in two manners to address

the task of deriving tissue-speciﬁc spectrum centroids

from the 32-dimensional data: one with Pearson cor-

relation for centroids assignment, but with Euclidean

update, the other fully correlation-driven for both pat-

tern matching and update.

A small number of 11 clusters has been chosen

in order to force sparse representations and to make

the constraints of stringent Euclidean updates appar-

ent. Both approacheshavebeen trained in 10 indepen-

dent runs using 1000 data cycles, i.e. 1000 x 1050 it-

erations, starting with randomly initialized centroids.

Euclidean update was performed with an update rate

of γ = 0.01. A value of γ = 10

was used for the

correlation-based update. This large value compen-

sates for the very small variability of the derivatives

of correlations, which are caused by very similar mass

peak proﬁles.

Both methods yield accurate data abstractions,

as shown in Fig. 3. The MDS visualization faith-

fully displays the correlation relationships of the 32-

dimensional centroids and the data. Since similar

scatter points correspond to highly correlated data

vectors, excellent reproducibility of the ﬁnal con-

ﬁgurations and a good data coverage can be ob-

served. With respect to quantization, centroids from

Euclidean update correlate on average at a level of

r = 92.8106±0.0043 with the represented data. Up-

date by Pearson correlation yields an improvement to

r = 93.4854 ±0.0790 for the same number of pro-

totypes. The small standard deviation for Euclidean

update again points out (indirectly) the very strong at-

traction to the ﬁnal centroid conﬁguration, which is,

however, not optimum in terms of quantization accu-

racy (data representation), for which the correlation

update is clearly a better choice.

4 CONCLUSIONS

Based on the mathematical derivative of the Pear-

son correlation coefﬁcient, we developed a new ap-

proach to maximize correlation in prototype-based

data models. Particularly, the derivative can be di-

rectly plugged into the update step of a generalized

version of the neural gas clustering method. Well-

reproducible high-quality clusters were obtained by

the new NG-C method. For the data clustered here,

k-means and fuzzy k-means, although offering corre-

lation similarity, are clearly outperformed by NG-C.

In general, correlation-based centroid matching com-

bined with Euclidean update, as usually realized in

k-means and SOM implementations, leads to sub-

optimal data representations.

Although Pearson correlation is one of the gold

standards in biomedical data analysis, the above con-

cept can be easily generalized by replacing the deriva-

tive of Pearson correlation by that of other suitable

similarity measures. This opens directions to process

BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing

202

−1 0 1 2 3

−1 0 1 2

Embedded Data Set with NG Centroids

MDS axis 1

MDS axis 2

Figure 3: Visualization of data (small dots) and neural gas

(NG) centroids (circles and crosses). Centroids correspond

to 10 independent runs of NG, each run comprising 11 cen-

troids, for two different update rules. Circles correspond

to NG centroids obtained by Euclidean update; they do ex-

hibit an extremely high reproducibility. Crosses correspond

to centroids with correlation-based update; their ﬁnal states

are less stringently ﬁxed, but their quantization quality is

better (see text). In many cases, both update rules yield

similar ﬁnal conﬁgurations, but the boxes highlight regions

with sytematic differences.

data from wide scientiﬁc ﬁelds where domain knowl-

edge needs to be carefully considered.

ACKNOWLEDGEMENTS

Thanks to Frank-Michael Schleif (University of

Leipzig) for the mass spectrum data and for fruit-

ful discussions. This work is supported by

the Ministry of Culture of Saxony-Anhalt, grant

XP3624HP/0606T.

REFERENCES

Cottrell, M., Fort, J., and Pag`es, G. (1994). Two or three

things that we know about the Kohonen algorithm. In

Verleysen, M., editor, European Symposium on Arti-

ﬁcial Neural Networks (ESANN), pages 235–244. D-

facto Publications.

Cottrell, M., Hammer, B., Hasenfuss, A., and Villmann, T.

(2006). Batch and median neural gas. Neural Net-

works, 19(6–7):762–771.

de Hoon, M., Imoto, S., Nolan, J., and Miyano, S. (2004).

Open source clustering software. Bioinformatics,

20(9):1453–1454.

Eisen, M., Spellman, P., Brown, P., and Botstein, D. (1998).

Cluster analysis and display of genome-wide expres-

sion patterns. PNAS, 95(25):14863–14868.

Gasch, A. and Eisen, M. (2002). Exploring the conditional

coregulation of yeast gene expression through fuzzy

k-means clustering. Genome Biology, 3(11):1–22.

Heskes, T. (1999). Energy functions for self-organizing

maps. In Oja, E. and Kaski, S., editors, Kohonen

Maps, pages 303–316. Elsevier, Amsterdam.

Kohonen, T. (2001). Self-Organizing Maps. Springer-

Verlag, Berlin, 3rd edition.

MacQueen, J. (1967). Some methods for classiﬁcation and

analysis of multivariate observations. In Proceedings

of 5-th Berkeley Symposium on Mathematical Statis-

tics and Probability, pages 281–297. University of

California Press.

Martinetz, T., Berkovich, S., and Schulten, K. (1993).

“Neural-gas” network for vector quantization and its

application to time-series prediction. IEEE Transac-

tions on Neural Networks, 4(4):558–569.

Martinetz, T. and Schulten, K. (1991). A “neural-gas” net-

work learns topologies. Artiﬁcial Neural Networks,

pages 397–402.

Sato, A. and Yamada, K. (1995). Generalized Learning

Vector Quantization. In Tesauro, G., Touretzky, D.,

and Leen, T., editors, Advances in Neural Information

Processing Systems 7 (NIPS), volume 7, pages 423–

429. MIT Press.

Villmann, T. and Claussen, J. (2006). Magniﬁcation con-

trol in self-organizing maps and neural gas. Neural

Computation, 18(2):446–469.

Zhou, X., Kao, M.-C., and Wong, W. (2002). Transi-

tive functional annotation by shortest-path analysis of

gene expression data. PNAS, 99(20):12783–12788.

ROBUST CENTROID-BASED CLUSTERING USING DERIVATIVES OF PEARSON CORRELATION

203