’center’, and it leads to a fine-tuning of data-specific
centroids in the final phase.
3 RESULTS
The following three applications show the superior-
ity of NG-C clustering over traditional methods with
Pearson correlation. As demonstrated, cost function
optimization by NG-C provides better data represen-
tations and higher reproducibility of results.
3.1 Single Cluster Representation of
Gene Expression Data
A first proof of concept is given for the simple,
but illustrative task of finding only a single centroid
position. This points out structural differences be-
tween Euclidean- and correlation-based centroid up-
date. We use an exemplary 14-dimensional gene ex-
pression data set, where macroarrays were used to
cover 14 temporal developmental stages in the en-
dosperm tissue of developing barley grains, sampled
from day 0 after flowering in steps of two days to day
26. After quality-based filtering, 4824 highly reliable
genes were obtained. Conforming to standards, ex-
pression values were quantile normalized and log
2
-
transformed. However, for maintaining overall ex-
pression levels, z-score was not applied to the 14-
dimensional expression series. For illustration, the
set was further reduced to 344 genes of prominent
temporal up-regulation with more than 10 transitions
x
j
t
< x
j
t+1
.
Neural gas has been run with Euclidean update
U(x,w) = (x − w) and with updates based on the
derivative of correlation according to Eqn. 3. Both ap-
proaches have been re-run 50 times with random cen-
troid initialization. Each run has been carried out with
100 update iterations using γ = 0.001 for the approach
Euclidean and γ = 0.01 for the correlation-based one.
Neighborhood size σ does not have any influence and
even d is not important for data assignments, because
there is only one centroid to be assigned to. Thus,
only the effect of the derivative of d on the centroid
specialization is studied here.
The results are displayed in the plot panel of
Fig. 1. The plots show the 14-dimensional expression
series together with their centroids, projected by PCA
and embedded by multi-dimensional scaling (MDS)
in two dimensions. PCA represents the Euclidean
view on the data, MDS the correlation-based view. To
summarize the displayed results, Euclidean update is
very stringent in both data views, the top left panel
indicating that all 50 centroids are almost perfectly
located in the center of gravity at point (0,0), which
is the k-means solution for k = 1. Complementary to
that, correlation-based update exhibits many degrees
of freedom in Euclidean view, but shows very high
specificity in the correlation view – which is exactly
what is has been designed for.
In addition to visual validation, which might suf-
fer from shortcomings of the built-in dimension re-
duction, quantization errors have been calculated. For
the average data vector, analog to the determinis-
tic k-means result with k = 1, an average correla-
tion of r = 0.96226 to the data vectors is found.
The Euclidean NG-update yields a result with an av-
erage correlation of the generated centroids of r =
0.96222±5.583·10
−5
, which is virtually the result of
the avarage vector, affected by minor update-specific
fluctuations. Correlation-based centroid update yields
the best results with an average correlation of r =
0.96403±8.173·10
−5
. In combination with the bot-
tom left panel in Fig. 1 it can be concluded that there
are non-unique solutions that can be reached only, if
Euclidean constraints are relaxed to updates operating
in correlation space. Despite of the small differences
for the presented data set, the results are quite fun-
damental, because they show that better solutions ex-
ist beyond averages. On a good mathematical basis,
similarity-specific updates induce less constraints on
the cost function and yield better data representations.
3.2 Clustering of Gene Expression Data
Mining for principal shapes in large lists of gene ex-
pression patterns is a central tool for the identifica-
tion of co-expressed genes. Neural gas with corre-
lation is used to meet this purpose for the data set
described in the last paragraph containing 4824 gene
expression levels at 14 time points. For comparison,
Eisen’s implementation of k-means and Gasch’s and
Eisen’s fuzzy k-means are taken as reference mod-
els (de Hoon et al., 2004). Both make use of Pear-
son correlation for creating sets of similar patterns for
centroid calculation, but they compute centroid po-
sitions in Euclidean space. Calculations were done
with 100 cycles for neural gas, i.e. 482,400 centroid
updates, and 100 cycles for the k-means models.
A number of 23 centroids was used in all mod-
els, because fuzzy k-means is, due to its built-in PCA,
limited to 3×#experiments+2=3×14+ 2 = 44 proto-
types of which only 23 were identified as unique by
fuzzy k-means (Gasch and Eisen, 2002). Contrary to
the k-means methods, unused prototypes do not occur
in NG-C, because of its neighborhood cooperation.
The exponential NG-C neighborhood influence is re-
alized as exponential decay from σ = 23 to σ = 0.001,
BIOSIGNALS 2008 - International Conference on Bio-inspired Systems and Signal Processing
200