Authors:
Ilias K. Savvas
1
and
M-Tahar Kechadi
2
Affiliations:
1
T.E.I. of Larissa, Greece
;
2
UCD, Ireland
Keyword(s):
MapReduce, HDFS, Hadoop, Clustering, K-means Algorithm.
Related
Ontology
Subjects/Areas/Topics:
Cloud Application Architectures
;
Cloud Application Scalability and Availability
;
Cloud Computing
;
Cloud Middleware Frameworks
;
Energy and Economy
;
Load Balancing in Smart Grids
;
Platforms and Applications
;
Smart Grids
Abstract:
The Apache Hadoop software library is a framework for distributed processing of large data sets, while HDFS is a distributed file system that provides high-throughput access to data-driven applications, and MapReduce is software framework for distributed computing of large data sets. The huge collections of raw data require fast and accurate mining process in order to extract useful knowledge. One of the most popular techniques of data mining is the K-means clustering algorithm. In this paper, we developed a distributed version of the K-means algorithm using the MapReduce framework on the Hadoop Distributed File System. The theoretical and experimental results of the technique proved its efficiency.