Authors:
Larissa Teixeira
1
;
Igor Eleutério
1
;
Mirela Cazzolato
1
;
2
;
Marco A. Gutierrez
2
;
Agma J. M. Traina
1
and
Caetano Traina-Jr.
1
Affiliations:
1
Institute of Mathematics and Computer Science (ICMC), University of São Paulo (USP), São Carlos, Brazil
;
2
The Heart Institute (InCor), University of São Paulo (USP), São Paulo, Brazil
Keyword(s):
Dimensional Data, k-medoids, Clustering, Indexing, Metric Access Method.
Abstract:
Clustering algorithms are powerful data mining techniques, responsible for identifying patterns and extracting information from datasets. Scalable algorithms have become crucial to enable data mining techniques on large datasets. In literature, k-medoid-based clustering algorithms stand out as one of the most used approaches. However, these methods face scalability challenges when applied to massive datasets and high dimensional vector spaces, mainly due to the high computational cost in the swap step. In this paper, we propose the KluSIM method to improve the computational efficiency of the swap step in the k-medoids clustering process. KluSIM leverages Metric Access Methods (MAMs) to prune the search space, speeding up the swap step. Additionally, KluSIM eliminates the need of maintaining a distance matrix in memory, successfully overcoming memory limitations in existing methodologies. Experiments over real and synthetic data show that KluSIM outperforms the baseline FasterPAM, wit
h a speed up of up to 881 times, requiring up to 3,500 times fewer distance calculations, and maintaining a comparable clustering quality. KluSIM is well-suited for big data analysis, being effective and scalable for clustering large datasets.
(More)