Authors:
Dominique Thiébaut
;
Yang Li
;
Diana Jaunzeikare
;
Alexandra Cheng
;
Ellysha Raelen Recto
;
Gillian Riggs
;
Xia Ting Zhao
;
Tonje Stolpestad
and
Cam Le T. Nguyen
Affiliation:
Smith College, United States
Keyword(s):
Grid computing, XGrid, Hadoop, Wikipedia, Data mining, Performance.
Related
Ontology
Subjects/Areas/Topics:
Cloud Applications Performance and Monitoring
;
Cloud Computing
;
Platforms and Applications
Abstract:
We present a simple comparison of the performance of three different cluster platforms: Apple’s XGrid, and Hadoop the open-source version of Google’s MapReduce as the total execution time taken by each to parse a 27-GByte XML dump of the English Wikipedia. A local hadoop cluster of Linux workstation, as well as an Elastic MapReduce cluster rented from Amazon are used. We show that for this specific workload, XGrid yields the fastest execution time, with the local Hadoop cluster a close second. The overhead of fetching data from Amazon’s Simple Storage System (S3), along with the inability to skip the reduce, sort, and merge phases on Amazon penalizes this platform targeted for much larger data sets.