clustering algorithm, based on the widely used Big
Data framework Apache Spark (Zaharia et al., 2010).
The implementation of a distributed algorithm
calls for suitable solution to some crucial problem in
a distributed programming environment such as: load
balancing and fault tolerance. A very popular solution
proposed in recent years for implementing parallel al-
gorithms tailored for Big Data is Apache Spark
1
. In-
deed, it offers very useful functions which relieves the
programmers from the explicit management of pro-
cess assignment and memory management. In this re-
spect, we implemented our CLUBS-P algorithm based
on Spark. This paper is organized as follows. In
Section 2 we provide an overview of CLUBS
+
, while
in Section 3 we first discuss the crucial points of
CLUBS
+
with a view to its parallelization, that lead
to devising CLUBS-P, for which we present a new im-
plementation based on Apache Spark (Section 4). In
Section 5 we present the experimental results and, fi-
nally, in Section 6 we draw our conclusions.
2 AN OVERVIEW OF CLUBS
+
In (Mazzeo et al., 2017) the authors introduced
CLUBS
+
. It is a parameter-free around-centroid clus-
tering algorithm based on a fast hierarchical ap-
proach, that combines the benefits of the divisive and
agglomerative approaches. The first operation per-
formed by CLUBS
+
is the definition of a binary space
partition of the domain of the dataset D, which yields
set of coarse clusters that are then refined. The next
operation is an agglomerative procedure performed
over the previously refined clusters, which is followed
by a final refinement phase. In the refinement phases,
outliers are identified and the remaining points are
assigned to the nearest cluster. In our running ex-
ample of Fig. 1, we show how the successive steps
of CLUBS
+
applied to the two-dimensional data dis-
tribution of Fig. 1(a), produce the results shown in
Fig. 1(b–e). In order to make this paper-self con-
tained, in the following we provide some details about
the four steps of CLUBS
+
allowing a full comprehen-
sion of the parallel algorithm implemented in this pa-
per, while a more exhaustive description of the algo-
rithm phases can be found in (Mazzeo et al., 2017).
The divisive step of CLUBS
+
performs a top-down
binary partitioning of the data set to isolate hyper-
rectangular blocks whose points are as much as pos-
sible close to each other: this is equivalent to mini-
mizing the within clusters sum of squares (WCSS) of
the blocks, an objective also pursued by k-means and
1
http://spark.apache.org/
many other algorithms.
Since finding the partitioning which minimizes
a measure such as the WCSS is NP-hard even for
two-dimensional points (Muthukrishnan et al., 1999),
CLUBS
+
uses a greedy approach where the splitting
hyperplanes are orthogonal to the dimension axes,
and the blocks are recursively split into pairs of clus-
ters that optimize specific clustering criteria that will
be discussed later. Given a input dataset, the algo-
rithm begins with a single cluster S corresponding to
the whole data set. S is entered into a priority queue,
Q. Q contains blocks of the partition that is iteratively
build over the dataset. While Q is not empty, a block
B is removed and split into a pair of blocks. If the
split is effective, the pairs of blocks replace B in Q,
otherwise B becomes a ‘final’ block for this phase.
In order to efficiently find the best split for a block,
the marginal distributions of the block must be com-
puted. In particular, for each dimension i of a d-
dimensional block B we must compute the functions
C
i
B
: R → N and LS
i
B
: R → R
d
, defined as follows
C
i
B
(x) =
|
p ∈ B ∧ p[i] = x
|
(1)
LS
i
B
(x) =
∑
p∈B∧p[i]=x
p (2)
where p[i] is the i-th coordinate of the point p.
These functions can be represented as maps, or, as-
suming that the coordinates of the points are integers,
as arrays. In (Mazzeo et al., 2017) the authors showed
that the split that minimizes the WCSS can be found
through a linear scan of these maps/arrays. The au-
thors conducted an extensive evaluation of the effec-
tiveness of criteria proposed in the literature for esti-
mating the quality and naturalness of a cluster set (Ar-
belaitz et al., 2013), and they concluded that the
Calinski-Harabasz index, CH-index for short, is the
most suitable to their needs (Calinski and Harabasz,
1974). Briefly, after each step, we compute the new
CH-index, and if it is increased by the split, then we
consider the split effective and continue the divisive
phase. Otherwise, we check a “local” criterion, based
on the presence of a ‘valley’ in the marginal distribu-
tion. In fact, even though a split could not decrease
the overall CH-index, a very large local discontinuity
could justify the split of a block anyway.
When the divisive phase is completed, the overall
space is partitioned into (a) blocks containing clus-
ters, and (b) blocks that only contain noise points. In
this intermediate refinement phase CLUBS
+
seeks to
achieve the objectives of (I) separating the blocks that
contain clusters from those that do not, and (II) gener-
ating well-rounded clusters for the blocks in the pre-
vious group.
Clustering Big Data
277