Benjamin Gufler, Nikolaus Augsten, Angelika Reiser, Alfons Kemper


MapReduce systems have become popular for processing large data sets and are increasingly being used in e-science applications. In contrast to simple application scenarios like word count, e-science applications involve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime complexity of the reducer task is typically high, and (b) scientific data is often skewed. This leads to highly varying execution times for the reducers. Varying execution times result in low resource utilisation and high overall execution time since the next MapReduce cycle can only start after all reducers are done. In this paper we address the problem of efficiently processing MapReduce jobs with complex reducer tasks over skewed data. We define a new cost model that takes into account non-linear reducer tasks and we provide an algorithm to estimate the cost in a distributed environment. We propose two load balancing approaches, fine partitioning and dynamic fragmentation, that are based on our cost model and can deal with both skewed data and complex reduce tasks. Fine partitioning produces a fixed number of data partitions, dynamic fragmentation dynamically splits large partitions into smaller portions and replicates data if necessary. Our approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We empirically evaluate our solution on both synthetic data and real data from an e-science application.


  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., and Rasin, A. (2009). HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB.
  2. Afrati, F. N. and Ullman, J. D. (2010). Optimizing Joins in a Map-Reduce Environment. In EDBT.
  3. Battré, D., Ewen, S., Hueske, F., Kao, O., Markl, V., and Warneke, D. (2010). Nephele/PACTs: A Programming Model and Execution Framework for Web-Scale Analytical Processing. In SoCC.
  4. Dean, J. and Ghemawat, S. (2008). MapReduce: Simplified Data Processing on Large Clusters. CACM, 51(1).
  5. DeWitt, D., Naughton, J. F., Schneider, D. A., and Seshadri, S. (1992). Practical Skew Handling in Parallel Joins. In VLDB.
  6. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., and Schad, J. (2010). Hadoop++: Making a Yellow Elephant Run Like a Cheetah. In VLDB.
  7. Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S., and Srivastava, U. (2009). Building a HighLevel Dataflow System on top of Map-Reduce: The Pig Experience. In VLDB.
  8. Johnson, D. S. (1973). Approximation Algorithms for Combinatorial Problems. In STOC.
  9. Kwon, Y., Balazinska, M., Howe, B., and Rolia, J. A. (2010). Skew-resistant Parallel Processing of FeatureExtracting Scientific User-Defined Functions. In SoCC.
  10. Pavlo, A., Paulson, E., Rasin, A., Abadi, D., DeWitt, D., Madden, S., and Stonebraker, M. (2009). A Comparison of Approaches to Large-Scale Data Analysis. In SIGMOD.
  11. Springel, V., White, S., Jenkins, A., Frenk, C., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., and Pearce, F. (2005). Simulating the Joint Evolution of Quasars, Galaxies and their Large-Scale Distribution. Nature, 435.
  12. Stamos, J. W. and Young, H. C. (1993). A Symmetric Fragment and Replicate Algorithm for Distributed Joins. IEEE TPDS, 4(12).
  13. Stonebraker, M., Abadi, D., DeWitt, D., Madden, S., Paulson, E., Pavlo, A., and Rasin, A. (2010). MapReduce and Parallel DBMSs: Friends or Foes? CACM, 53(1).
  14. Whang, K.-Y., Zanden, B. T. V., and Taylor, H. M. (1990). A Linear-Time Probabilistic Counting Algorithm for Database Applications. TODS, 15(2).
  15. Zaharia, M., Konwinski, A., Joseph, A. D., Katz, R., and Stoica, I. (2008). Improving MapReduce Performance in Heterogeneous Environments. In OSDI.
  16. Zeller, H. and Gray, J. (1990). An Adaptive Hash Join Algorithm for Multiuser Environments. In VLDB.

Paper Citation

in Harvard Style

Gufler B., Augsten N., Reiser A. and Kemper A. (2011). HANDLING DATA SKEW IN MAPREDUCE . In Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER, ISBN 978-989-8425-52-2, pages 574-583. DOI: 10.5220/0003391105740583

in Bibtex Style

author={Benjamin Gufler and Nikolaus Augsten and Angelika Reiser and Alfons Kemper},
booktitle={Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,},

in EndNote Style

JO - Proceedings of the 1st International Conference on Cloud Computing and Services Science - Volume 1: CLOSER,
SN - 978-989-8425-52-2
AU - Gufler B.
AU - Augsten N.
AU - Reiser A.
AU - Kemper A.
PY - 2011
SP - 574
EP - 583
DO - 10.5220/0003391105740583