2015) we proposed a job scheduling algorithm capa-
ble of generating all possible combinations of map-
pers and the related assigned data fragments by lever-
aging on the combinatorial theory. That approach’s
strategy was to explore the entire space of all poten-
tial execution paths and find the one providing the best
(minimum) execution time. Unfortunately the num-
ber of potential paths to visit may be very large, if we
consider that many sites may be involved in the com-
putation and that the data sets targeted by a job might
be fragmented at any level of granularity. Of course,
the time to seek for the best execution plan consider-
ably increases with the number of fragments and the
number of network’s sites. That time may turn into
an unacceptable overhead that would affect the per-
formance of the overall job. If on the one hand such
an approach guarantees for the optimal solution, on
the other one it is not scalable.
In order to get over the scalability problem, in
this work we propose a new approach that searches
for a good (not necessarily the best) job execution
plan which still is capable of providing an accept-
able execution time for the job. Let us consider the
whole job’s makespan divided into two phases: a pre-
processing phase, during which the job execution plan
is defined, and a processing phase, that is when the
real execution is enforced. The new approach aims
to keep the pre-processing phase as short as possible,
though it may cause a time stretch during the process-
ing phase. We will prove that, despite the time stretch
of the job’s execution, the overall job’s makespan will
benefit.
Well known and common optimization algorithms
follow an approach based on a heuristic search
paradigm known as the one-point iterative search.
One point search algorithms are relatively simple
in implementation, computationally inexpensive and
quite effective for large scale problems. In general,
a solution search starts by generating a random initial
solution and exploring the nearby area. The neighbor-
ing candidate can be accepted or rejected according to
a given acceptance condition, which is usually based
on an evaluation of a cost functions. If it is accepted,
then it serves as the current solution for the next iter-
ation and the search ends when no further improve-
ment is possible. Several methodologies have been
introduced in the literature for accepting candidates
with worse cost function scores. In many one-point
search algorithms, this mechanism is based on a so
called cooling schedule (CS) (Hajek, 1988). A weak
point of the cooling schedule is that its optimal form
is problem-dependent. Moreover, it is difficult to find
this optimal cooling schedule manually.
The job’s execution path generation and evalua-
tion, which represent our optimization problem, are
strictly dependent on the physical context where the
data to process are distributed. An optimization al-
gorithm based on the cooling schedule mechanism
would very likely not fit our purpose. Finding a con-
trol parameter that is good for any variation of the
physical context and in any scenario is not an easy
task; and if it is set up incorrectly, the optimization
algorithm fail shortening the search time. As this pa-
rameter is problem dependent, its fine-tuning would
always require preliminary experiments. Unfortu-
nately, such preliminary study can lead to additional
processing overhead. Based on these considerations,
we have discarded optimization algorithms which en-
vision a phase of cooling schedule.
The optimization algorithm we propose to use in
order to seek for a job execution plan is the Late Ac-
ceptance Hill Climbing (LAHC) (Burke and Bykov,
2008). The LAHC is an one-point iterative search al-
gorithm which starts from a randomly generated ini-
tial solution and, at each iteration, evaluates a new
candidate solution. The LAHC maintains a fixed-
length list of the previously computed values of the
cost function. The candidate solution’s cost is com-
pared with the last element of the list: if it is not
worse, it is accepted. After the acceptance procedure,
the cost of the current solution is added on top of the
list and the last element of the list is removed. This
method allows some worsening moves which may
prolong the search time but, at the same time, helps
avoiding local minima. The LAHC approach is sim-
ple, easy to implement and yet is an effective search
procedure. This algorithm depends on just the input
parameter L, representing the length of the list. It is
possible to make the processing time of LAHC inde-
pendent of the length of the list by eliminating the
shifting of the whole list at each iteration.
The search procedure carried out by the LAHC is
better detailed in reported in the Algorithm 1 listing.
The LAHC algorithm first generates an initial solu-
tion which consists of a random assignment of data
blocks to mappers. The resulting graph represents the
execution path. The evaluated cost for this execution
path is the current solution and it is added to the list.
At each iteration, the algorithm evaluates a new can-
didate (assignment of data blocks and mappers nodes)
and calculates the cost for the related execution path.
The candidate cost is compared with the last element
of the list and, if not worse, is accepted as the new
current solution and added on top of the list. This
procedure will continue until the reach of a stopping
condition. The last found solution will be chosen as
the execution path to enforce.
In the next section we compare the LAHC algo-
A LAHC-based Job Scheduling Strategy to Improve Big Data Processing in Geo-distributed Contexts
97