the MapReduce. Since its early development, the
k-means clustering (LLoyd, 1982) has been identi-
fied to have a very high complexity and significant
effort has been spent to tune the algorithm and im-
prove its performance. While k-means is very simple
and straightforward algorithm, it has two main issues:
1) the choice of the number of clusters and of the ini-
tial centroids. 2) the iterative nature of the algorithm
which impacts heavily on its scalability as the size of
the dataset increases. The right set of initial centroids
will lead to compact and well formed clusters (Jin
et al., 2006). On the other hand, in (McCallum et al.,
2000) and (Guha et al., 1998), the authors explained
how the number of iterations can be reduced by parti-
tioning the dataset into overlapping subsets and iterat-
ing only over the data objects within the overlapping
areas. This technique is called Canopy Clustering.
While the above studies largely concentrated on
improving the k-means algorithm and reducing the
number of iterations, there have been many other
studies about its scalability. Recently, more research
has been done on MapReduce framework. The paper
(Zhao et al., 2009) presented Parallel k-means cluster-
ing based on MapReduce. It has been shown that the
k-means clustering algorithms can scale well and can
be parallelized. The authors concluded that MapRe-
duce can efficiently process large datasets.
Some researches have conducted comparative
studies of various MapReduce frameworks available
in the market and studied their effectiveness in the
area of clustering large datasets. In (S.Ibrahim et al.,
2009), the authors have analysed the performance
benefits of Hadoop on virtual machines, and it was
shown that MapReduce is a good tool for cloud
based data analysis. There have been also devel-
opments with Microsoft product DryadLINQ to per-
form data intensive analysis and compared the per-
formance of DryadLINQ with Hadoop implementa-
tions (Ekanayake et al., 2009). In another study, the
authors have implemented a slightly enhanced model
and architecture of MapReduce called the
Twister
(Ekanayake et al., 2010). They have compared the
performance of Twister with Hadoop and DryadLINQ
with the aim of expanding the applicability of MapRe-
duce for data-intensive scientific applications. Two
important observations can be made from this study.
First, for computationintensiveworkload, threads and
processes did not show any significant difference in
performance. Second, for memory intensive work-
load, processes are 20 times faster than threads. In
(Jiang et al., 2009) a comparative study of Hadoop
MapReduce and Framework for Rapid Implemen-
tation of data mining Engines has been performed.
According to this study, they have concluded that
Hadoop is not well suited for modest-sized databases.
However, when the datasets are large, there is a good
performance benefit in using Hadoop.
3 MAPREDUCE K-MEANS
TECHNIQUE
The sequential k-means algorithm starts by choosing
k initial centroids, one for each cluster and assigns
each object of the dataset to the nearest centroid. Then
it recalculates the centroid of each cluster based on
its member objects and goes through again each data
object and assigns it to its closest centroid. This step
is repeated until there is no change in the centroids.
In this work we transformed the original k-means
algorithm to meet the MapReduce requirements. The
new k-means technique consists of four parts: a map-
per, a reducer, a mapper with a combiner, and a re-
ducer with a combiner.
3.1 Mapper and Reducer (MR)
The input dataset is distributed across the mappers.
The initial set of centroids is either placed in a com-
mon location and accessed by all the mappers or dis-
tributed on each mapper. The centroid list has an
identification for each centroid as a key and the cen-
troid itself as the value. Each data object in the sub-
set (x
1
,x
2
,...,x
m
) is assigned to its closest centroid by
the mapper. We use the Euclidean distance to mea-
sure proximity of data objects; the distances between
a data object and the centroids. The data object is as-
signed to the closest centroid. When all the objects
are assigned to the centroids, the mapper sends all the
data objects and the centroids they are assigned to, to
the reducer (Algorithm 1).
After the execution of the mapper, the reducer
takes as input the mapper outputs,
(key, value)
pairs,
and loops through them to calculate the new centroid
values. For each centroid, the reducer calculates a
new value based on the objects assigned to it in that
iteration. This new centroid list is sent back to the
start-up program (Algorithm 2).
3.2 Mapper and Reducer with
Combiner (MRC)
In the MapReduce model, the output of the map func-
tion is written to the disk and the reduce function
reads it. In the mean time, the output is sorted and
shuffled. In order to reduce the time overhead be-
tween the mappers and the reducer, Algorithm 1 was
modified to combine the map outputs in order to re-
CLOSER2012-2ndInternationalConferenceonCloudComputingandServicesScience
414