Authors:
Michele Ianni
1
;
Elio Masciari
2
;
Giuseppe M. Mazzeo
3
and
Carlo Zaniolo
4
Affiliations:
1
DIMES, University of Calabria, Rende (CS) and Italy
;
2
ICAR-CNR, Rende (CS) and Italy
;
3
Facebook, Menlo Park and U.S.A.
;
4
UCLA, Los Angeles and U.S.A.
Keyword(s):
Clustering, Big Data, Spark.
Abstract:
The need to support advanced analytics on Big Data is driving data scientist’ interest toward massively parallel distributed systems and software platforms, such as Map-Reduce and Spark, that make possible their scalable utilization. However, when complex data mining algorithms are required, their fully scalable deployment on such platforms faces a number of technical challenges that grow with the complexity of the algorithms involved. Thus algorithms, that were originally designed for a sequential nature, must often be redesigned in order to effectively use the distributed computational resources. In this paper, we explore these problems, and then propose a solution which has proven to be very effective on the complex hierarchical clustering algorithm CLUBS+. By using four stages of successive refinements, CLUBS+ delivers high-quality clusters of data grouped around their centroids, working in a totally unsupervised fashion. Experimental results confirm the accuracy and scalability
of CLUBS+ on Map-Reduce platforms.
(More)