HANDLING DATA SKEW IN MAPREDUCE

Benjamin Guﬂer

, Nikolaus Augsten

, Angelika Reiser

and Alfons Kemper

Technische Universit¨at M¨unchen, M¨unchen, Germany

Free University of Bozen-Bolzano, Bolzano, Italy

Keywords:

MapReduce, Data skew, Load balancing.

Abstract:

MapReduce systems have become popular for processing large data sets and are increasingly being used in

e-science applications. In contrast to simple application scenarios like word count, e-science applications in-

volve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime

complexity of the reducer task is typically high, and (b) scientiﬁc data is often skewed. This leads to highly

varying execution times for the reducers. Varying execution times result in low resource utilisation and high

overall execution time since the next MapReduce cycle can only start after all reducers are done.

In this paper we address the problem of efﬁciently processing MapReduce jobs with complex reducer tasks

over skewed data. We deﬁne a new cost model that takes into account non-linear reducer tasks and we provide

an algorithm to estimate the cost in a distributed environment. We propose two load balancing approaches, ﬁne

partitioning and dynamic fragmentation, that are based on our cost model and can deal with both skewed data

and complex reduce tasks. Fine partitioning produces a ﬁxed number of data partitions, dynamic fragmentation

dynamically splits large partitions into smaller portions and replicates data if necessary. Our approaches can

be seamlessly integrated into existing MapReduce systems like Hadoop. We empirically evaluate our solution

on both synthetic data and real data from an e-science application.

1 INTRODUCTION

Over the last years, MapReduce has become popular

for processing massive data sets. Most research in

this area considers simple application scenarios like

log ﬁle analysis, word count, or sorting, and current

systems adopt a simple hashing approach to distribute

the load to the reducers.

Processing massive amounts of data is also a key

challenge in e-science. However, scientiﬁc applica-

tions exhibit properties to which current MapReduce

systems are not geared. First, the runtime complex-

ity of the reducer tasks is often non-linear. Second,

the distribution of scientiﬁc data is typically skewed.

The high runtime complexity ampliﬁes the skew and

leads to highly varying execution times of the reduc-

ers. Thus reducers with a low load have to wait for

the reducers with high load.

MapReduce jobs with high reducer complexity

include data mining tasks, which are popular in

e-science and often have higher polynomial or even

exponential worst case complexity. Consider, for ex-

ample, the Millennium simulation (Springel et al.,

2005), an important astrophysical data set that conta-

ins more than 18 million trees with a total of 760 mil-

lion nodes describing the evolution of the universe.

Experiments with frequent subtree mining on a sub-

set of the Millennium trees resulted in execution time

differences of several hours between the reducers.

Scientiﬁc data is often skewed. Skew arises from

physical properties of the observed objects (e.g., the

height of patients in medical studies), from research

interests focussing on subsets of the entire domain

(e.g., areas with active volcanoes in geosciences), or

from properties of the instruments and software em-

ployed to gather the data. In the Millennium simula-

tion, each tree node has a mass. The mass distribution

is highly skewed, with the 7 most frequent values ap-

pearing over 20 million times each, while almost 75%

of the values appear no more than 10 times.

In the map phase, MapReduce systems generate

(key,value) pairs from the input data. A cluster is the

subset of all (key,value) pairs, or tuples, sharing the

same key. Standard systems like Hadoop

use hashing

to distribute the clusters to the reducers. Each reducer

gets approximately the same number of clusters. For

skewed data, this approach is not good enough since

http://hadoop.apache.org

574

Guﬂer B., Augsten N., Reiser A. and Kemper A..

HANDLING DATA SKEW IN MAPREDUCE.

DOI: 10.5220/0003391105740583

In Proceedings of the 1st International Conference on Cloud Computing and Services Science (CLOSER-2011), pages 574-583

ISBN: 978-989-8425-52-2

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

clusters may vary considerably in size. With non-

linear reducers, the problem is even worse. The non-

linear reduce function is evaluated for each cluster

and even sets of clusters with the same overall number

of tuples can have very different execution times. Pro-

cessing a small number of large clusters takes much

longer than processing many small clusters.

Example 1. Assume a set of clusters consisting of

four tuples. The cost of a cluster is the number of

tuples to the third. If the four tuples belong to one

cluster, its cost is 4

= 64. If the set consists of two

clusters with two tuples each, the cost is only 2· 2

16.

In this paper, we design a new cost model that

takes into account non-linear reducer functions and

skewed data distributions. Instead of considering only

the size of the data partition (set of clusters) that is as-

signed to each reducer, we estimate its execution cost.

This is a challenging problem since a single cluster

may be produced by different mappers in a distributed

manner. Computing detailed statistics for each clus-

ter is too expensive since the number of clusters may

be proportional to the data size. We estimate cluster

cardinalities and their cost from aggregated statistics

computed on distributed mappers.

We design two new algorithms that use our cost

model to distribute the work load to reducers. The

ﬁrst algorithm, ﬁne partitioning, splits the input data

into a ﬁxed number of partitions. The number of par-

titions is larger than the number of reducers, and the

goal is to distribute the partitions such that the execu-

tion times for all reducers are similar. Fine partition-

ing does not control the cost of the partitions while

they are created, but achieves balanced loads by dis-

tributing expensive partitions to different reducers. In

our second approach, dynamic fragmentation, expen-

sive partitions are split locally by each mapper while

they are created, and tuples are replicated if neces-

sary. As a result, the cost of the partitions is more

uniform and a good load balancing is easier to achieve

for highly skewed distributions.

Summarising, our contribution is the following:

• We present a new cost model that takes into ac-

count non-linear reducers and skewed data distri-

butions, and we propose an efﬁcient algorithm to

estimate the cost in a distributed environment.

• We propose two load balancing algorithms that

are based on our cost model and evenly distribute

the load on the reducers. The ﬁrst algorithm, ﬁne

partitioning, splits the data into a ﬁxed number

of partitions, estimates their cost, and distributes

them appropriately. The second approach, dy-

namic fragmentation, controls the cost of the par-

titions while they are created.

• We empirically evaluate our techniques on syn-

thetic data sets with controlled skew, as well as on

real e-science data from the astrophysics domain.

2 DATA SKEW IN MapReduce

From a data-centric perspective,a MapReduce system

works as follows. m mappers transform the input to

a MapReduce job into a bag of (key,value) pairs, the

intermediate result I ⊆ K × V. The sub-bag of I con-

taining all (key,value) pairs with a speciﬁc key k is a

cluster

C(k) = {(k, v) ∈ I}

The intermediate result is split into p partitions. The

partition for an intermediate tuple is determined by

applying a partitioning function

π : K → {1, . . . , p}

to the key of the tuple. This way, all tuples belonging

to the same cluster are placed into the same partition.

A partition is thus a “container”, or bucket, for one or

more clusters. We denote a partition j as

P( j) =

]

k∈K:π(k)= j

C(k)

The partitions are distributed to r reducers which pro-

duce the output of the MapReduce job. All partitions

assigned to the same reducer form a partition bundle.

A good data distribution tries to balance the clus-

ters such that all reducers will require roughly the

same time for processing. There are two aspects

which need to be considered.

1. Number of Clusters. Some reducers might get

more clusters than others, leading to larger par-

tition bundles and longer execution times.

2. Difﬁculty of Clusters. The execution times may

vary from cluster to cluster. Reducers with “difﬁ-

cult” clusters might take much longer to complete

than reducers with “easy” clusters, even if the

overall size of the partition bundles is the same.

The ﬁrst of these two points can be solved by

using an appropriate hash function for partitioning

the data. The second point describes two challenges

which can not be handled by optimal hashing: clusters

of varying size and clusters of varying complexity. In

the following we will elaborate on these two aspects

of load balancing in MapReduce systems.

HANDLING DATA SKEW IN MAPREDUCE

575

3 COST MODEL

The reducer workload should be evenly distributed

to all participating nodes. This maximises resource

utilisation, as no reducers remain idle, waiting for

some overloaded reducers to complete. Moreover,

well-balanced execution times minimise the time un-

til job completion, because parallel processing is bet-

ter exploited. Finally, similar execution times are

a common (and often implicit) assumption in both

scheduling and failure detection strategies proposed

for MapReduce (Dean and Ghemawat, 2008; Zaharia

et al., 2008).

3.1 Current Situation

In state of the art MapReduce systems, like Hadoop,

every mapper partitions the share of intermediate re-

sults it creates into r partitions (i.e., p = r in the par-

titioning function π deﬁned in Section 2, and all par-

tition bundles consist of a single partition only). As

all mappers use the same partitioning function, tuples

belonging to the same cluster are all placed into the

same partition. This is visualised in Figure 1, where

we have two reducers. Thus, two partitions are cre-

ated per mapper. The partitions of the ﬁrst mapper

are shown in more detail on the left: the ﬁrst partition

contains four clusters, the second one holds three.

Map

Reduce

controller

Figure 1: Traditional Data Distribution.

Typically, a hash function is used for partitioning.

Assuming a reasonably good hash function, the clus-

ters are uniformly distributed to the partitions. Ev-

ery partition is then assigned to a dedicated reducer

for further processing. Figure 1 shows an example of

the current data distribution strategy in MapReduce:

partition P0 of every mapper is assigned to the ﬁrst

reducer, P1 to the second one.

This approach is perfectly suitable in situations

where the key frequencies are (almost) uniformly dis-

tributed, and the amount of work a reducer spends per

cluster does not vary strongly. In many other situa-

tions, however, distributing the keys uniformly is sub-

optimal. The most prominent problems are:

1. Skewed Key Frequencies. If some keys appear

more frequently in the intermediate data tuples,

the number of tuples per cluster will vary. Even if

every reducer receives the same number of clus-

ters, the overall number of tuples per reducer will

be different.

2. Skewed Tuple Sizes. In applications which hold

complex objects within the tuples, unbalanced

cluster sizes can arise from skewed tuple sizes.

3. Skewed Execution Times. If the execution time

of the reducer is worse than linear, processing a

single, large cluster may take much longer than

processing a higher number of small clusters.

Even if the overall number of tuples per reducer

is the same, the execution times of the reducers

may differ.

Example 2. Consider a reducer which compares all

items within a cluster to each other. Obviously, the

reducer’s complexity is quadratic in the number of tu-

ples within a cluster. Processing a cluster with six

tuples thus has a cost of 6

= 36. Three clusters of

size two only have a total cost of 3· 2

= 12. The total

number of tuples is, however, six in both cases.

Skew is symbolised by smaller and larger partition

icons and reducer boxes in Figure 1. In this example,

partition P0 is much larger than partition P1 on two

mappers. The reducer on the left thus gets a much

larger share of the data than the one on the right.

We will now propose an approach for obtaining

the best possible load balancing.

3.2 Optimal Solution

In order to balance the workload on the reducers, we

need to know the amount of work required for every

cluster. Typically, the work per cluster depends either

on the number of tuples in the cluster, or on the byte

size of the cluster, or both these parameters. There-

fore, while creating the clusters, we monitor for every

cluster C(k) the number of tuples it contains, |C(k)|,

and its (byte) size, kC(k)k. Based on the complex-

ity of the reducer algorithm, we can then calculate the

weight, w(|C(k)|, kC(k)k), i.e., the amount of work

for each cluster k as a function of tuple count and size.

Example 3. For the reducer described in Example 2,

we estimate the weight of a cluster as w(t, s) = t

As an example of a reducer complexity depending on

the byte size of the processed cluster, consider the fol-

lowing scenario: Every tuple contains an array of

values. The reducer’s task is to ﬁnd the median of

these values per cluster over all contained tuples. The

amount of work to spend thus depends on the com-

bined size of all arrays within a cluster, rather than

CLOSER 2011 - International Conference on Cloud Computing and Services Science

576

the tuple count. If the array size is proportional to the

tuple sizes, we will therefore base the work estimation

on the byte size of the clusters. As ﬁnding the median

element of an array of length n is possible in nlogn

time, we estimate the work as w(t, s) = slogs.

We obtain the optimal, i. e., the best weight bal-

anced assignment of clusters to reducers by solving

the associated bin packing problem. The optimal so-

lution is not feasible for two reasons.

1. In a worst-case scenario, the monitored data

grows linearly in the size of the intermediate data

I. Such a situation arises, e. g., when joining two

tables on their primary key columns: every key

value can appear only once per table, and the re-

sulting clusters contain at most two tuples.

2. The bin packing problem is NP hard. Hence, even

for a moderate number of clusters, calculating the

assignment of clusters to reducers can become

more expensive than the actual execution of the

reducers.

In the following we will address these two problems,

and develop heuristics for approximately solving the

load balancing problem.

3.3 Approximate Cost Estimation

The ﬁrst problem with the optimal solution is the size

of the monitored data. In the worst case, the number

of clusters, |K|, grows linearly with the number of in-

termediate data tuples. With MapReduce being a sys-

tem designed for processing terabyte scale data sets,

we can therefore not afford to monitor every cluster

individually. Instead, we do the monitoring on parti-

tion level, i. e., we create a histogram of monitoring

data using the partitions P( j) as histogram buckets.

Besides tuple count, t( j), and total size, s( j), we also

include the number of clusters per partition, c( j) in

our monitoring data µ: µ( j) = (c( j), t( j), s( j)) with

c( j) = |{C(k) : k ∈ K,C(k) ⊂ P( j)}|

t( j) = |P( j)|

s( j) =

∑

k∈K,C(k)⊂P( j)

kC(k)k

Recall from the preceding section that we need the

weight for each cluster. We estimate the tuple counts

and sizes of the clusters based on the monitoring in-

formation for the partitions using average values:

( j) =

t( j)

c( j)

( j) =

s( j)

c( j)

We can now determine the processing cost per

cluster, w(

( j), s

( j)), using the tuple count and size

estimates. Summing up all processing costs for a par-

tition, we obtain the partition cost, W( j):

W( j) = c( j)w(

( j), s

( j))

Since the input data is assumed to be skewed, the

average cost values for the clusters can differ substan-

tially from the actual values. Despite this approxi-

mation error we achieve much better load balancing

than current MapReduce implementations. We will

discuss this issue in Section 4.1 and present support-

ing experimental evaluation in Section 6.3.

Collecting accurate statistics for clusters is an

open research problem. As discussed in Section 3.2,

exact monitoring at the cluster level is not feasible.

Possible solutions could collect monitoring data on

a granularity level between clusters and partitions,

or selectively monitor only the most relevant clusters

within each partition.

3.4 Distributed Monitoring

The bag I holding all intermediate tuples is not mate-

rialised on a single host. Therefore, we need to col-

lect our monitoring data in a distributed manner, and

then aggregate it in order to obtain information on the

global data distribution. We denote by I

the bag of

intermediate (key,value) pairs generated by mapper i,

i. e., {I

, I

, . . . , I

} is a partitioning of I. Every map-

per i gathers monitoring information for all partitions

j based on the tuples in its share of the intermediate

data, I

. We collect all this monitoring data on a cen-

tral controller and aggregate it in order to obtain an

approximation of µ. Note that for our cost estimations

we do not need to introduce a new centralised com-

ponent in MapReduce, but we exploit the centralised

controller for task scheduling. For the tuple count and

size of partition j, we collect, on every mapper i, the

local tuple count, t

( j), and size, s

( j). By summing

up those values, we can reconstruct the exact number

of tuples per partition, t( j), and their total size, s( j).

t( j) =

∑

1≤i≤m

( j) s( j) =

∑

1≤i≤m

( j)

For the number of clusters per partition, the same ap-

proach is not applicable, as clusters are typically dis-

tributed over multiple mappers. We employ the linear

counting approach (Whang et al., 1990) for approxi-

mating the cluster count per partition.

4 LOAD BALANCING

Now that we have a cost model that takes into account

non-linear reducer tasks, we deﬁne two load balanc-

ing approaches based on this model.

HANDLING DATA SKEW IN MAPREDUCE

577

4.1 Fine Partitioning

By creating more partitions than there are reducers

(i.e., by choosing p > r, in contrast to current MapRe-

duce systems where p = r), we retain some degree of

freedom for balancing the load on the reducers. The

range of p is obviously bounded by the number of

reducers, r, on the lower end, and the number of clus-

ters, |K|, on the upper end. With p < r, some reducers

would not obtain any input. With p > |K|, some par-

titions will remain empty.

The number of partitions, p, inﬂuences the quality

of the obtained load balancing. The higher we choose

p, the more possibilities the controller has to balance

the load. On the other hand, the management over-

head grows with p. This overhead impacts on the ex-

ecution of the MapReduce job twice. First, we need

to collect and process more monitoring data. For very

high values of p (close to |K|), handling the monitor-

ing data could thus become a bottleneck in the job

execution. Second, partitions are the units of data

transfer (i. e., ﬁles) from the mappers to the reducers.

Transferring a few large ﬁles is faster and results in

less overhead than transferring many small ﬁles. We

need to be aware of this trade-off when choosing p.

The goal of assigning partitions to reducers is

to balance the load. The optimal load balance is

achieved by solving the respective bin packing prob-

lem. Unfortunately, bin packing is NP hard. We pro-

pose a greedy heuristics (sketched in Algorithm 1) to

determine the partition bundles. We pick the most ex-

pensive partition not yet assigned to a reducer, and

assign it to the reducer which has smallest total load.

The load of a reducer is the sum of the costs of all par-

titions assigned to that reducer. We repeat these steps

until all partitions have been assigned.

Algorithm 1: Assign Partitions to Reducers.

Input: W : {1, . . . , p} → R

Output: R: a set of partition bundles

1: R ←

2: P = {1, . . . , p}

3: while P 6=

0 do

4: q = argmax

j∈P

W( j)

5: P ← P\ {q}

6: if |R| < r then

7: R ← R∪ {{q}}

8: else

9: s = argmin

l∈R

∑

j∈l

W( j)

10: R ← (R\ {s}) ∪ {s∪ {q}}

11: end if

12: end while

13: return R

Note that we calculate the partition bundles only

after all mappers have completed their execution,

which prevents the reducer slow-start optimisation of

Hadoop. We will discuss this aspect in Section 4.3.

Map

Reduce

P0+P2

controller

Figure 2: Partitioned Data Distribution.

An example for an approximate bin packing so-

lution is shown in Figure 2. Even though we only

have two reducers, every mapper creates three parti-

tions. Based on the monitoring data obtained from the

mappers, the controller determines the assignment of

partitions to reducers. P1 is the most expensive parti-

tion and is assigned to a dedicated reducer, while P0

and P2, which are cheaper, share a reducer.

Recall from Section 3 that we do not know the

exact cost for every partition, but only approximated

values. This impacts the load balancing on the reduc-

ers as follows.

1. If the clusters are similar in cost, our cost estima-

tion is accurate, and the load balanced well.

2. If the clusters are heavily skewed, i. e., there are

very few clusters which are considerably larger

than the others, also the partitions containing

these clusters will be much larger than the others.

The estimated cost for those partitions will, there-

fore, also be higher than that of partitions contain-

ing only small clusters. Partitions containing large

clusters will thus very likely be assigned to dedi-

cated reducers, as long as the total number of re-

ducers is sufﬁciently large.

3. Finally, for moderately skewed data, two situa-

tions may arise.

(a) The larger clusters are evenly distributed over

all partitions. Then we overestimate the cost of

all partition. This is, however, not a problem

since the absolute cost is irrelevant for assign-

ing the partitions to reducers and we still obtain

a reasonable good load balancing.

(b) The partitioning function assigns the larger

clusters to a small number of partitions. Then

the same reasoning as for heavily skewed data

applies.

CLOSER 2011 - International Conference on Cloud Computing and Services Science

578

4.2 Dynamic Fragmentation

With the ﬁne partitioning approach presented above,

some partitions may grow excessively large, making

a good load balancing impossible. In this section we

present a strategy which dynamically splits very large

partitions into smaller fragments. We deﬁne a parti-

tion to be very large if it exceeds the average partition

size by a predeﬁned factor. Similar to partitions, frag-

ments are containers for multiple clusters. In contrast

to partitions, however, the number of fragments may

vary from mapper to mapper.

As before, every mapper starts creating its output

partitions according to the partitioning function π. If a

partition gains excessively more weight than the oth-

ers, the mapper splits this partition into fragments. We

choose the number of fragments, f, to be the small-

est integer greater than 1 s.t. p 6≡ 0 mod f . This is

shown in Figure 3. The leftmost mapper splits par-

tition P2, which has almost twice the weight of the

other partitions, into two fragments (3 ≡ 1 mod 2).

The mapper in the middle splits partitions P0 and P2,

while on the rightmost mapper splits partition P1.

F2a

F2b

F0a

F0b

F2a

F2b

F1a

F1b

Map

Reduce

P0+F2b

P1+F2a

controller

Figure 3: Fragmented Data Distribution.

Upon completion, each mapper sends a list of par-

titions which it has split into fragments, along with the

monitoring data, to the controller. For each partition

which has been fragmented on at least one mapper,

the controller considers both exploiting the fragments

or ignoring them. This is achieved by calculating the

partition bundles (the set R in Algorithm 1) for each

possible combination and then picking the best one.

When the fragments of a partition are sent to differ-

ent reducers, data from mappers which have not frag-

mented that partition needs to be replicated to all re-

ducers which get assigned one of the fragments. In

Figure 3, fragment F2a is assigned to the reducer on

the right, whereas fragment F2b is assigned to the left

one. Partition P2 from the rightmost mapper must be

copied to both reducers, as it might contain data be-

longing to both fragments. A ﬁltering step is inserted

at the reducer side that eliminates data items not be-

longing to the fragments of that reducer immediately

after receiving the ﬁle.

We choose the best partition assignment using a

cost based strategy. The ﬁrst aspect that the cost func-

tion needs to consider is how well the weight is bal-

anced. We use the standard deviation σ of the weight

of the partition bundles to express this aspect. The

lower the standard deviation, the better the data is bal-

anced. The second aspect to include is the amount of

replication. In the cost function we use the average

weight w of the partition bundles. We want to keep

w, and thus the amount of replicated data, as low as

possible and deﬁne the cost of an assignment R as

C (R) =

w(R) · (1+ σ(R))

We strive for an assignment with low cost. The pa-

rameter e controls the inﬂuence of the balancing over

replication. Low values of e favour assignments with

lower replication at the cost of unbalanced partition

bundles, high values favour well balanced partition

bundles at the cost of replication. A good choice for e

depends on the complexity of the reducer task.

Example 4. For the reducer in Example 2 with

quadratic runtime complexity, we choose a smaller

value for e than for a reducer with exponential worst

case complexity. The difference in execution time due

to unbalanced loads is much higher for expensive

reducers and the additional communication cost for

replication is likely to outweighed with balanced re-

ducers.

In the example of Figure 3, the beneﬁt of assign-

ing fragments F2a and F2b to different reducers out-

weighed the increased cost resulting from the replica-

tion of partition P2 to both reducers. Partition P1, on

the other hand, was only fragmented on the rightmost

mapper. Placing its fragments on different reducers

would require to replicate the partitions P1 from the

other mappers to both reducers, which in our example

did not pay off.

4.3 Reducer Slow-start

In “traditional” MapReduce systems, the ﬁrst reduc-

ers are already launched when a small percentage of

mappers is done. During this slow-start phase, reduc-

ers fetch their inputs from completed mappers. Both

approaches presented in the preceding sections, how-

ever, require all mappers to have completed process-

ing before the assignment of partitions (and possibly

fragments) to reducers can be calculated. For highly

complex reduce algorithms, the time savings due to

slow-start are negligible. For reduce algorithms of

moderate complexity, we can derive an initial assign-

ment of partitions to reducers based on the monitoring

data from the ﬁrst completed mappers, and adapt the

assignment later if necessary. Empirical evaluations

HANDLING DATA SKEW IN MAPREDUCE

579

(which are not further discussed due to space limi-

tations) show a fast convergence of the assignments

after r mappers are completed.

5 HANDLING LARGE CLUSTERS

The techniques presented so far aim at distributing

clusters to reducers such that the resulting load on

all reducers is balanced. In some situations, however,

good load balancing is not possible. Such situations

arise, e. g., when we have less clusters than reducers

(|K| < r), or when the cluster costs are heavily skewed

and very few of the clusters make up for most of the

total cost.

According to the MapReduce processing model, a

single cluster must not be distributed to multiple re-

ducers for processing. The processing code executed

on the reducers is supplied by the user. The possi-

bilities for the framework to react to expensive clus-

ters are therefore very limited. We propose to pro-

vide an optional extension to the interface, allowing

the framework to notify the user code if expensive

clusters are encountered. Hence, the user can react to

large clusters with application speciﬁc solutions, e. g.

using multi-threading or approximate algorithms.

6 EXPERIMENTAL EVALUATION

In this section we report on the experimental evalua-

tion of the presented partitioning strategies and their

impact on a speciﬁc e-science application.

6.1 Measurement Environment

We evaluate our partitioning strategies and their im-

pact using both synthetic and real e-science data. We

generate synthetic data sets based on Zipf distribu-

tions with 200 clusters and varying z parameter. The

e-science data set consist of all merger tree nodes

from the Millennium run

(Springel et al., 2005), a

294 GB data set with more than 760 million tuples.

We simulated a MapReduce environment, build-

ing on each mapper the histogram as described in

Section 3.4. We then calculated the partition bun-

dles based on these histograms, using the bin pack-

ing heuristic of Section 4.1. For the Millennium

data set, we used the number of mappers (389) and

the actual distribution of data to the mappers chosen

by Hadoop, conﬁgured with default settings except

for the HDFS block size, which we increased from

http://www.g-vo.org/Millennium

64 MB to 512 MB. We altered the block size because

Hadoop chooses the number of mappers based on the

number of input data blocks. With simple map tasks,

it is thus reasonable to use large block sizes in or-

der to avoid creating a huge number of short running

mappers. For the synthetic data, we chose parameters

close to the values observed in real world datasets. We

scheduled 400 mappers, generating 1.3 million inter-

mediate tuples each. We repeated all measurements

20 times and report the averages.

6.2 Partitioning Strategies

In our ﬁrst evaluation, we compare the current data

redistribution scheme (Section 3.1) with the ﬁne par-

titioning (Section 4.1) and the dynamic fragmenta-

tion (Section 4.2) approaches for varying parameters

e (0.05, 0.15, 0.3) in the cost function. We choose the

number of partitions, p, to be four times the number

of reducers. With this choice, we obtain a sufﬁcient

number of partitions to balance the load quite well,

while not exceeding the number of clusters.

We show the obtained results for varying numbers

of reducers in Figure 4. The left graph in each ﬁg-

ure shows the replication overhead introduced by dy-

namic fragmentation with varying e parameter. Stan-

dard MapReduce and ﬁne partitioning are not shown

in these graphs as they introduce no replication. If dy-

namic fragmentation chooses the same result as ﬁne

partitioning, i. e., the fragments are not exploited, then

no fragmentation overhead incurs and no bars are vis-

ible in the diagram. The right part of the ﬁgures shows

the standard deviation in the tuple count per reducer.

The values are relative to the average number of tu-

ples per reducer without replication.

Both ﬁne partitioning and dynamic fragmentation

balance the load considerably better than standard

MapReduce systems. Dynamic fragmentation has the

highest impact in the scenario with moderate skew

(Figure 4b) and with moderate reducer count. The

remaining situations are as follows. For low skew

(Figure 4a), except for the scenario with 10 reduc-

ers, no partition grows noticeably larger than the oth-

ers. Therefore no fragments are created, and dynamic

fragmentation falls back to ﬁne partitioning. For high

skew (Figure 4c), the partition(s) with very expensive

clusters are fragmented. Expensive clusters, however,

cannot be split. Therefore, the possible gain in bal-

ancedness is low, and fragments are exploited only

for high e values. An exception is the scenario with

10 reducers. Due to the very low number of reducers,

splitting the partition with the most expensive clus-

ter has a strong impact, allowing to accept even high

fragmentation overhead.

CLOSER 2011 - International Conference on Cloud Computing and Services Science

580

10 1020 2025 2550 50

frag. overhead [%]

reducers

stdev [% of mean]

reducers

(a) Synthetic, z = 0.1.

10 1020 2025 2550 50

frag. overhead [%]

reducers

stdev [% of mean]

reducers

(b) Synthetic, z = 0.5.

100

120

10 1020 2025 2550 50

frag. overhead [%]

reducers

stdev [% of mean]

reducers

10 1020 2025 2550 50

frag. overhead [%]

reducers

stdev [% of mean]

reducers

(d) Millennium data.

Standard MapReduce

Fine Partitioning, p = 4r

Dynamic Fragmentation, p = 4r, with

e = 0.05 e = 0.15

e = 0.30

Figure 4: Data Balancing.

Comparing the different conﬁgurations for dy-

namic fragmentation, we see that e is a reasonable pa-

rameter for conﬁguring the amount of replication tol-

erated in order to achieve better data balancing. The

larger e is, the more replication is accepted if this re-

sults in better balancing. The choice of e should thus

depend on the expected execution time of the reduc-

ers. For fast reducers, slightly skewed execution times

are typically acceptable. For long-running reducers,

more replication overhead will be outweighed by bet-

ter balancing the reducer execution times.

With the Millennium data (Figure 4d), the beneﬁt

of our load balancing techniques becomes even more

evident. For most of the reducer numbers, even the

ﬁne partitioning approach is able to reduce the devia-

tion by far more than 50%.

6.3 Inﬂuence on Applications

Finally, we evaluate the impact of our load balancing

approaches on the execution times of a MapReduce

application. Figure 5 shows the execution times for

a reducer side algorithm with quadratic complexity in

the number of input tuples, e.g., an algorithm doing a

pairwise comparison of all tuples within a cluster. The

total bar heights show the synthetic execution time for

the longest running reducer. We calculate the syn-

thetic execution time according to the algorithm com-

plexity, based on exact cluster sizes. Bar heights up to

the cut show the shortest reducer time. The (red) line

spanning over both bars for the same reducer count

gives the average execution time for a reducer. The

average is identical for both evaluated approaches, as

the same work is performed in both scenarios. Fi-

nally, every diagram includes the processing time re-

quired for the most expensive cluster (green, dotted

line). This value is a lower bound of the execution

time since clusters can not be split.

For the synthetic data sets, we observe the highest

impact of ﬁne partitioning on moderately skewed data

(z = 0.3, Figures 5b and 5e). Here, we are able to re-

duce the time required on the longest-running reducer

(the bar heights) by over 30%. For 25 and more reduc-

ers, the time spent for the longest-running reducer is

very close to the most expensivecluster (green, dotted

line), which gives the lowest reachable value. Note

also the small difference between average (red line)

and shortest (cut in the bars) execution times, indi-

cating that only very few reducers require more time.

Comparing Figures 5b and 5e, we see the positive im-

pact of a higher number of partitions, especially for

the conﬁgurations with 20 and 25 reducers.

For both very balanced data (Figure 5a) and very

skewed data (Figure 5d), we see only small differ-

ences between the two approaches. In the balanced

scenario, the na¨ıvedata distributionin current Hadoop

obtains a reasonably good result. The average ex-

ecution time (red line) is roughly half way between

the shortest and the longest execution times. For the

very skewed scenario, a single expensive cluster dom-

inates the overall execution time (see the execution

time for the most expensive cluster). The only way to

reduce the total execution time is thus to isolate this

expensive cluster. Fine partitioning achieve this goal

already for a lower number of reducers.

HANDLING DATA SKEW IN MAPREDUCE

581

10 20 25 50

Synthetic Execution Time

reducers

(a) Synthetic, z = 0.1, p = 4r.

10 20 25 50

Synthetic Execution Time

reducers

(b) Synthetic, z = 0.3, p = 4r.

10 20 25 50

Synthetic Execution Time

reducers

100

200

300

400

500

10 20 25 50

Synthetic Execution Time

reducers

(d) Synthetic, z = 0.9, p = 4r.

10 20 25 50

Synthetic Execution Time

reducers

(e) Synthetic, z = 0.3, p = 8r.

10 20 25 50

Synthetic Execution Time

reducers

(f) Millennium data.

Standard MapReduce: Maximum Time per Reducer

Fine Partitioning, p = 4r: Maximum Time per Reducer

Bars are Interrupted at Minimum Time per Reducer

Average Time per Reducer

Time for Most Expensive Cluster

Figure 5: Execution Times.

For the Millennium data set (Figure 5f) ﬁne parti-

tioning reaches to optimum (the maximum cost is the

cost of the most expensive cluster) already with 20

reducers, while the standard approach requires more

than 50 reducers.

7 RELATED WORK

Despite the popularity of MapReduce systems, which

have been at the centre of distributed systems research

over the last years, skew handling has received little

attention. Only very recently, the SkewReduce sys-

tem (Kwon et al., 2010) was proposed. In SkewRe-

duce, data characteristics are collected by a sampling

approach. The user has to provide cost functions

which derive, from this sampling data, information

on the runtime behaviour of the application. With the

techniques described in this paper, the user needs to

specify only the runtime complexity of the reducer

side algorithm; all remaining components are pro-

vided by the framework.

When processing joins on MapReduce systems,

data skew might arise as well. A recent publication

(Afrati and Ullman, 2010) shows how to best use

Symmetric Fragment-Replicate Joins (Stamos and

Young, 1993) on MapReduce systems in order to min-

imise communication. Based on the input relation

sizes, the presented system ﬁnds the optimal degree

of replication for all relations. Our work is orthog-

onal to this approach. Skewed join attribute distribu-

tion can lead to load imbalance on the reducers, which

is tackled by the techniques presented in this paper.

A scheduling algorithm for MapReduce in het-

erogeneous environments was presented in (Zaharia

et al., 2008). They show that an improved scheduling

strategy can effectively decrease the response time of

Hadoop. The scheduling strategy determines invoca-

tion time and hosts for the single reduce tasks, but not

the assignment of clusters to reducers. Their approach

can thus be combined with our load balancing tech-

niques in order to further reduce the response time.

MapReduce and (distributed) database systems

are often used for similar tasks. Hence, over the

last years there has been substantial effort from the

database community to both compare (Pavlo et al.,

2009; Stonebraker et al., 2010), and to combine the

two approaches. Database systems are used as intel-

ligent storage back-ends for MapReduce (Abouzeid

et al., 2009), and indexing is integrated with Hadoops

native data storage (Dittrich et al., 2010). MapReduce

applications written in dedicated high level languages

are optimised using techniques from query optimisa-

tion (Gates et al., 2009; Battr´e et al., 2010). All of this

work is orthogonal to the data skew handling tech-

niques for MapReduce we presented in this paper.

Distributed database literature offers much prior

work on handling skewed data distributions. Our

dynamic fragmentation approach is inspired by dis-

tributed hash join processing techniques (Zeller and

Gray, 1990), and extends these techniques such that

multiple mappers can contribute as data sources.

Data skew was also tackled in the Gamma project

(DeWitt et al., 1992). Some of their techniques are

applicable to MapReduce. Our ﬁne partitioning ap-

proach is similar to Virtual Processor Partitioning.

Other techniques are very speciﬁc to distributed join

processing and cannot be directly transferred to our

scenario. An example is the Subset-Replicate ap-

proach. Similar to the Fragment-Replicate Join, this

approach allows to distribute one cluster over multi-

ple sites. Such a technique is not applicable to ar-

bitrary distributed grouping/aggregation tasks, which

we need for load balancing in MapReduce.

Our bin packing heuristics for distributing the load

CLOSER 2011 - International Conference on Cloud Computing and Services Science

582

to reducers resembles the First Fit Decreasing (FFD)

algorithm (Johnson, 1973). Different from the stan-

dard bin packing scenario, the bins in our scenario

have no capacity limit. We choose the bin with the

lowest load to place the next item in.

8 SUMMARY AND ONGOING

WORK

Motivated by skewed reducer execution times in e-

science workﬂows, we analysed the behaviour of

MapReduce systems with skewed data distributions

and complex reducer side algorithms. We presented

two approaches, ﬁne partitioning and dynamic frag-

mentation, allowing for improved load balancing.

In future work, we will consider collecting more

sophisticated statistics on the partitions in order to

estimate the workload per partition more accurately.

Moreover, we will focus on skewed data distributions

on the mappers. Such skew can arise, e. g., in data

warehouses capturing a shifting trend.

ACKNOWLEDGEMENTS

This work was funded by the German Federal Min-

istry of Education and Research (BMBF, contract

05A08VHA) in the context of the GAVO-III project

and by the Autonomous Province of Bolzano - South

Tyrol, Italy, Promotion of Educational Policies, Uni-

versity and Research Department.

REFERENCES

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silber-

schatz, A., and Rasin, A. (2009). HadoopDB: An Ar-

chitectural Hybrid of MapReduce and DBMS Tech-

nologies for Analytical Workloads. In VLDB.

Afrati, F. N. and Ullman, J. D. (2010). Optimizing Joins in

a Map-Reduce Environment. In EDBT.

Battr´e, D., Ewen, S., Hueske, F., Kao, O., Markl, V., and

Warneke, D. (2010). Nephele/PACTs: A Program-

ming Model and Execution Framework for Web-Scale

Analytical Processing. In SoCC.

Dean, J. and Ghemawat, S. (2008). MapReduce: Simpliﬁed

Data Processing on Large Clusters. CACM, 51(1).

DeWitt, D., Naughton, J. F., Schneider, D. A., and Seshadri,

S. (1992). Practical Skew Handling in Parallel Joins.

In VLDB.

Dittrich, J., Quian´e-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty,

V., and Schad, J. (2010). Hadoop++: Making a Yellow

Elephant Run Like a Cheetah. In VLDB.

Gates, A. F., Natkovich, O., Chopra, S., Kamath, P.,

Narayanamurthy, S. M., Olston, C., Reed, B., Srini-

vasan, S., and Srivastava, U. (2009). Building a High-

Level Dataﬂow System on top of Map-Reduce: The

Pig Experience. In VLDB.

Johnson, D. S. (1973). Approximation Algorithms for

Combinatorial Problems. In STOC.

Kwon, Y., Balazinska, M., Howe, B., and Rolia, J. A.

(2010). Skew-resistant Parallel Processing of Feature-

Extracting Scientiﬁc User-Deﬁned Functions. In

SoCC.

Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D.,

Madden, S., and Stonebraker, M. (2009). A Compar-

ison of Approaches to Large-Scale Data Analysis. In

SIGMOD.

Springel, V., White, S., Jenkins, A., Frenk, C., Yoshida, N.,

Gao, L., Navarro, J., Thacker, R., Croton, D., Helly,

J., Peacock, J., Cole, S., Thomas, P., Couchman, H.,

Evrard, A., Colberg, J., and Pearce, F. (2005). Sim-

ulating the Joint Evolution of Quasars, Galaxies and

their Large-Scale Distribution. Nature, 435.

Stamos, J. W. and Young, H. C. (1993). A Symmetric Frag-

ment and Replicate Algorithm for Distributed Joins.

IEEE TPDS, 4(12).

Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paul-

son, E., Pavlo, A., and Rasin, A. (2010). MapReduce

and Parallel DBMSs: Friends or Foes? CACM, 53(1).

Whang, K.-Y., Zanden, B. T. V., and Taylor, H. M. (1990).

A Linear-Time Probabilistic Counting Algorithm for

Database Applications. TODS, 15(2).

Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and

Stoica, I. (2008). Improving MapReduce Performance

in Heterogeneous Environments. In OSDI.

Zeller, H. and Gray, J. (1990). An Adaptive Hash Join Al-

gorithm for Multiuser Environments. In VLDB.

HANDLING DATA SKEW IN MAPREDUCE

583