Authors:
Marc Strickert
1
;
Nese Sreenivasulu
1
;
Thomas Villmann
2
and
Barbara Hammer
3
Affiliations:
1
Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben, Germany
;
2
Clinic for Psychotherapy, University of Leipzig, Germany
;
3
Institute of Computer Science, University of Clausthal, Germany
Keyword(s):
Centroid-based clustering, correlation, quantization cost optimization.
Related
Ontology
Subjects/Areas/Topics:
Applications
;
Applications and Services
;
Artificial Intelligence
;
Biomedical Engineering
;
Biomedical Signal Processing
;
Biometrics
;
Biometrics and Pattern Recognition
;
Computational Intelligence
;
Computer Vision, Visualization and Computer Graphics
;
Data Manipulation
;
Health Engineering and Technology Applications
;
Human-Computer Interaction
;
Medical Image Detection, Acquisition, Analysis and Processing
;
Methodologies and Methods
;
Multimedia
;
Multimedia Signal Processing
;
Neural Networks
;
Neurocomputing
;
Neurotechnology, Electronics and Informatics
;
Pattern Recognition
;
Physiological Computing Systems
;
Sensor Networks
;
Signal Processing
;
Soft Computing
;
Telecommunications
;
Theory and Methods
Abstract:
Modern high-throughput facilities provide the basis of -omics research by delivering extensive biomedical data sets. Mass spectra, multi-channel chromatograms, or cDNA arrays are such data sources of interest for which accurate analysis is desired. Centroid-based clustering provides helpful data abstraction by representing sets of similar data vectors by characteristic prototypes, placed in high-density regions of the data space. This way, specific modes can be detected, for example, in gene expression profiles or in lists containing protein and metabolite abundances. Despite their widespread use, k-means and self-organizing maps (SOM) often only produce suboptimum results in centroid computation: the final clusters are strongly dependent on the initialization and they do not quantize data as accurately as possible, particularly, if other than the Euclidean distance is chosen for data comparison. Neural gas (NG) is a mathematically rigorous clustering method that optimizes the centro
id positions by minimizing their quantization errors. Originally formulated for Euclidean distance, in this work NG is mathematically generalized to give accurate and robust results for the Pearson correlation similarity measure. The benefits of the new NG for correlation (NG-C) are demonstrated for sets of gene expression data and mass spectra.
(More)