such as evaluating the scheduling policy only (Kokaly
et al., 2009). Because Hadoop follows the approach
of Google’s MapReduce, it is well suited to process
large data sets in text format, and its performance and
tuning for sorting large text files has been carefully
studied (O’Malley and Murthy, 2009).
There are few published comparisons of
Hadoop/MapReduce against other computing
solutions, and the study by Pavlo et al. (Pavlo et al.,
2009) along with a rebutal by Google’s Dean and
Ghemawat (Dean and Ghemawat, 2010) stand out.
Pavlo et al. compare the MapReduce paradigm to
parallel and distributed SQL database management
systems (DBMS), and herald the superiority of
DBMS but recognize the superiority of Hadoop
for its fault tolerance and ease of installation. The
rebutal published in CACM is an indication that
the fair comparison of different approaches to solve
identical parallel problems is nontrivial, and that the
assumptions made can greatly influence the results.
Iosup and Epema (Iosup and Epema, 2006) pro-
pose the creation and use of synthetic benchmarks
to rate the performance of grids, and evaluate their
methodology on one system only, which unfortu-
nately uses neither XGrid nor Hadoop.
While the work in (Raicu et al., 2006) is a very
comprehensive framework for measuring the perfor-
mance of various grids, it includes world-wide net-
work of research computers, or centralized Linux net-
works but not Hadoop or XGrid clusters.
The evaluation of MapReduce on multicore sys-
tems presented by Ranger (Ranger et al., 2007) is
worth mentioning, and although it does not compare
Hadoop’s performance to that of other grids, it points
to multicores as a competitor to clusters in some ar-
eas, and we present data supporting this argument.
It is a challenging task to provide a fair compar-
ison of the computational power of systems that op-
erate at different clock speeds, with different mem-
ory support, using different file systems, and with dif-
ferent network bandwidth. None the less, we feel
that our attempt to find the fastest solution for our
wikipedia processing problem is worth reporting, and
we hope our results will help others pick the best clus-
ter for their need.
3 PROBLEM STATEMENT
Our goals are simple: find among the available plat-
forms the parallel approach that provides the fastest
parsing of the entire contents of a dump of the En-
glish Wikipedia, where word-frequency statistics is
gathered for all pages.
Figure 1: Main block diagram of the processing of the XML
dump.
The process, illustrated in Figure 1, is the fol-
lowing: The data dump (pages-articles, collected
10/17/2009) is first retrieved from Mediawiki (Medi-
aWiki, 2002), and then parsed by a serial C++ pro-
gram, divide, that formats 9.3 million XML entries
into 431 XML files, each containing as many whole
entries as can fit in 64 Mbytes. We refer to these 431
files as splits, according to the hadoop syntax. Each
individual entry in a split represents a wikipedia page
and is formatted as a single line of text, making it di-
rectly compatible to the Hadoop data input format.
The 431 splits are then processed in parallel (paral-
lel parse), stop words are removed, and word statis-
tics are computed and stored in N (output files). N
varies depending on the platform and configuration
used. The storing of these files into database is not
reported in this comparison. Only the parallel parse
module is the subject of our comparative study. We
pick four different frameworks that are easily accessi-
ble on and off campus: a local XGrid cluster of iMacs
(in computer labs and classrooms) a local cluster of
Fedora-workstation running Hadoop, and Amazon’s
Elastic MapReduce framework (Amazon, 2002), both
as a cluster and as a multicore single instance. The
input and output files may reside in different storage
systems which we select to make the parallel com-
putation of Wikipedia an easily interchangeable stage
of a larger computational pipeline. For this reason,
we include setup, upload and download times of in-
put and output files in our comparison data.
The target execution time to beat is 15 hours 34
minutes and 28 seconds (56,068 seconds) for the se-
rial execution of parallel parse on one of the 2-
core iMac workstations used as standard agent in our
XGrid, with all files (executable, input and output)
stored on the iMacs local disk. This execution in-
volves only one core, due to the lack of multitasking
of the C++ program.
In the next section we present the XGrid infras-
tructure and the performance measured on this net-
work.
CLOSER 2011 - International Conference on Cloud Computing and Services Science
392