to reducers resembles the First Fit Decreasing (FFD)
algorithm (Johnson, 1973). Different from the stan-
dard bin packing scenario, the bins in our scenario
have no capacity limit. We choose the bin with the
lowest load to place the next item in.
8 SUMMARY AND ONGOING
WORK
Motivated by skewed reducer execution times in e-
science workflows, we analysed the behaviour of
MapReduce systems with skewed data distributions
and complex reducer side algorithms. We presented
two approaches, fine partitioning and dynamic frag-
mentation, allowing for improved load balancing.
In future work, we will consider collecting more
sophisticated statistics on the partitions in order to
estimate the workload per partition more accurately.
Moreover, we will focus on skewed data distributions
on the mappers. Such skew can arise, e. g., in data
warehouses capturing a shifting trend.
ACKNOWLEDGEMENTS
This work was funded by the German Federal Min-
istry of Education and Research (BMBF, contract
05A08VHA) in the context of the GAVO-III project
and by the Autonomous Province of Bolzano - South
Tyrol, Italy, Promotion of Educational Policies, Uni-
versity and Research Department.
REFERENCES
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silber-
schatz, A., and Rasin, A. (2009). HadoopDB: An Ar-
chitectural Hybrid of MapReduce and DBMS Tech-
nologies for Analytical Workloads. In VLDB.
Afrati, F. N. and Ullman, J. D. (2010). Optimizing Joins in
a Map-Reduce Environment. In EDBT.
Battr´e, D., Ewen, S., Hueske, F., Kao, O., Markl, V., and
Warneke, D. (2010). Nephele/PACTs: A Program-
ming Model and Execution Framework for Web-Scale
Analytical Processing. In SoCC.
Dean, J. and Ghemawat, S. (2008). MapReduce: Simplified
Data Processing on Large Clusters. CACM, 51(1).
DeWitt, D., Naughton, J. F., Schneider, D. A., and Seshadri,
S. (1992). Practical Skew Handling in Parallel Joins.
In VLDB.
Dittrich, J., Quian´e-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty,
V., and Schad, J. (2010). Hadoop++: Making a Yellow
Elephant Run Like a Cheetah. In VLDB.
Gates, A. F., Natkovich, O., Chopra, S., Kamath, P.,
Narayanamurthy, S. M., Olston, C., Reed, B., Srini-
vasan, S., and Srivastava, U. (2009). Building a High-
Level Dataflow System on top of Map-Reduce: The
Pig Experience. In VLDB.
Johnson, D. S. (1973). Approximation Algorithms for
Combinatorial Problems. In STOC.
Kwon, Y., Balazinska, M., Howe, B., and Rolia, J. A.
(2010). Skew-resistant Parallel Processing of Feature-
Extracting Scientific User-Defined Functions. In
SoCC.
Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D.,
Madden, S., and Stonebraker, M. (2009). A Compar-
ison of Approaches to Large-Scale Data Analysis. In
SIGMOD.
Springel, V., White, S., Jenkins, A., Frenk, C., Yoshida, N.,
Gao, L., Navarro, J., Thacker, R., Croton, D., Helly,
J., Peacock, J., Cole, S., Thomas, P., Couchman, H.,
Evrard, A., Colberg, J., and Pearce, F. (2005). Sim-
ulating the Joint Evolution of Quasars, Galaxies and
their Large-Scale Distribution. Nature, 435.
Stamos, J. W. and Young, H. C. (1993). A Symmetric Frag-
ment and Replicate Algorithm for Distributed Joins.
IEEE TPDS, 4(12).
Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paul-
son, E., Pavlo, A., and Rasin, A. (2010). MapReduce
and Parallel DBMSs: Friends or Foes? CACM, 53(1).
Whang, K.-Y., Zanden, B. T. V., and Taylor, H. M. (1990).
A Linear-Time Probabilistic Counting Algorithm for
Database Applications. TODS, 15(2).
Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and
Stoica, I. (2008). Improving MapReduce Performance
in Heterogeneous Environments. In OSDI.
Zeller, H. and Gray, J. (1990). An Adaptive Hash Join Al-
gorithm for Multiuser Environments. In VLDB.
HANDLING DATA SKEW IN MAPREDUCE
583