Authors:
Philipp Baumann
;
Dorit S. Hochbaum
and
Quico Spaen
Affiliation:
University of California, United States
Keyword(s):
Large-Scale Data Mining, Classification, Data Reduction, Supervised Normalized Cut.
Related
Ontology
Subjects/Areas/Topics:
Classification
;
Embedding and Manifold Learning
;
ICA, PCA, CCA and other Linear Models
;
Pattern Recognition
;
Sparsity
;
Theory and Methods
Abstract:
Machine learning techniques that rely on pairwise similarities have proven to be leading algorithms for classification. Despite their good and robust performance, similarity-based techniques are rarely chosen for largescale data mining because the time required to compute all pairwise similarities grows quadratically with the size of the data set. To address this issue of scalability, we introduced a method called sparse computation, which efficiently generates a sparse similarity matrix that contains only significant similarities. Sparse computation achieves significant reductions in running time with minimal and often no loss in accuracy. However, for massively-large data sets even such a sparse similarity matrix may lead to considerable running times. In this paper, we propose an extension of sparse computation called sparse-reduced computation that not only avoids computing very low similarities but also avoids computing similarities between highly-similar or identical objects by
compressing them to a single object. Our computational results show that sparse-reduced computation allows highly-accurate classification of data sets with millions of objects in seconds.
(More)