path and forecast the job’s completion time. Tests
have been conducted on the software prototype in or-
der to check that the actual job’s completion time (the
one measured at job execution time) gets close to the
forecast.
The paper is organized as follows. In Section 2
the literature is reviewed. In Section 3 an overview of
the proposal is presented. Technical details of the pro-
posed system architecture are discussed in Section 4.
In Section 5 we delve into the strategy implemented
by the job scheduler component. In Section 6 the re-
sults of the tests run on the system’s software proto-
type are presented. Section 7 concludes the work.
2 RELATED WORK
In the literature two main approaches are followed
by researchers to efficiently process geo-distributed
data: a) enhanced versions of the plain Hadoop im-
plementation which account for the nodes and the
network heterogeneity (Geo-hadoop approach); b) hi-
erarchical frameworks which gather and merge re-
sults from many Hadoop instances locally run on dis-
tributed clusters (Hierarchical approach). The for-
mer approach aims at optimizing the job performance
through the enforcement of a smart orchestration of
the Hadoop steps. The latter’s philosophy is to exploit
the native potentiality of Hadoop on a local base and
then merge the results collected from the distributed
computation. In the following a brief review of those
works is provided.
Geo-hadoop approaches reconsider the phases of
the job’s execution flow (Push, Map, Shuffle, Reduce)
in a perspective where data are distributed at a geo-
graphic scale, and the available resources are not ho-
mogeneous. In the aim of reducing the job’s aver-
age makespan, phases and the relative timing must be
adequately coordinated. Some researchers have pro-
posed enhanced version of Hadoop capable of opti-
mizing only a single phase (Kim et al., 2011; Mattess
et al., 2013). Heintz et al.(Heintz et al., 2014) analyze
the dynamics of the phases and address the need of
making a comprehensive, end-to-end optimization of
the job’s execution flow. To this end, they present an
analytical model which accounts for parameters such
as the network links, the nodes capacity and the ap-
plications profile, and transforms the makespan mini-
mization problem into a linear programming problem
solvable with the Mixed Integer Programming tech-
nique.
Hierarchical approaches tackle the problem from
a perspective that envisions two (or sometimes more)
computing levels: a bottom level, where several plain
MapReduce computations occur on local data only,
and a top level, where a central entity coordinates the
gathering of local computations and the packaging of
the final result. In (Luo et al., 2011) authors present a
hierarchical MapReduce architecture and introduces
a load-balancing algorithm that makes workload dis-
tribution across multiple clusters. The balancing is
guided by the number of cores available on each clus-
ter, the number of Map tasks potentially runnable at
each cluster and the nature (CPU or I/O bound) of the
application. The authors also propose to compress
data before their migration from one data center to
another. Jayalath et al.(Jayalath et al., 2014) make an
exhaustive analysis of issues concerning the execution
of MapReduce on geo-distributed data. The particular
context addressed by authors is the one in which mul-
tiple MapReduce operations need to be performed in
sequence on the same data.
With respect to the cited works, our places among
the hierarchical ones. The approach we propose dif-
fers in that it strives to exploit fresh information con-
tinuously sensed from the distributed computing con-
text (available sites computing capacity and inter-site
bandwidth) and calls on the integer partitioning tech-
nique to compose the space of the job’s potential exe-
cution paths and seek for the best.
3 SYSTEM DESIGN
According to the MapReduce paradigm, a generic
computation is called “job”. Upon a job submis-
sion, a scheduling system is responsible for splitting
the job in several tasks and mapping them to a set of
available nodes within a cluster. The performance of
a job execution is measured by its completion time
(some refers to it with the term makespan), i.e., the
time for a job to complete. Apart from the size of
the data to be processed, that time heavily depends on
the jobs execution flow determined by the schedul-
ing system and the computing power of the cluster
nodes where the tasks are actually executed. In a
scenario where computing nodes reside in distributed
clusters that are geographically distant to each oth-
ers, there is an additional parameter that may affect
the job performance. Communication links among
clusters (inter-cluster links) are often inhomogeneous
and have a much lower bandwidth than communica-
tion links among nodes within a cluster (intra-cluster
links). Also, clusters are not designed to have simi-
lar or comparable computing capacity, therefore they
might happen to be heterogeneous in terms of com-
puting power. Third, it is not rare that the data set to
be processed are unevenly distributed over the clus-
A Hadoop based Framework to Process Geo-distributed Big Data
179