Benchmarking Hadoop Performance in the Cloud - An in Depth Study of Resource Management and Energy Consumption

Aymen Jlassi, Patrick Martineau

Abstract

Virtual technologies have proven their capabilities to ensure good performance in the context of high performance computing (HPC). During the last decade, the big data tools have been emerging, they have their own needs in performance and infrastructure. Having a wide breadth of experience in the HPC domain, the experts can evaluate the infrastructures used to run big data tools easily. The outcome of this paper is the evaluation of two technologies of virtualization in the context of big data tools. We compare the performance and the energy consumption of two technologies of virtualization (Docker containers and VMware) and benchmark the software Hadoop (JoshBaer, 2015) using these environments. Firstly, the aim is the reduction of the Hadoop deployment cost using the cloud. Secondly, we discuss and analyze the assumptions learned from the HPC experiments and their applicability in the big data context. Thirdly, the Hadoop community finds an in-depth study of the resource consumption depending on the deployment environment. We come to the point that the use of the Docker container gives better performance in most experiments. Besides, the energy consumption varies according to the executed workload.

References

  1. Agency, U. E. P. (2007). Report to congress on server and data center energy efficiency.
  2. cores (2015). Getting started with systemd. https:// coreos.com/docs/launching-containers/launching/ getting-started-with-systemd/.
  3. Council, N. R. D. (2014). American data centers are wasting huge amounts of energy.
  4. David, M. (2007). Understanding full virtualization, paravirtualization and hardware assist. White Paper.
  5. Dean, J. and Ghemawat, S. (2008). Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113.
  6. Devices, A. M. (2012). Hadoop performance tuning guide - amd.
  7. Fadika, Z., Dede, E., Govindaraju, M., and Ramakrishnan, L. (2011). Benchmarking mapreduce implementations for application usage scenarios. In Grid 2011, pages 90-97.
  8. Fadika, Z., Govindaraju, M., Canon, R., and Ramakrishnan, L. (2012). Evaluating hadoop for data-intensive scientific operations. IEEE CLOUD 2012.
  9. for Energy, I. and (IET), T. (2015). Data centres energy efficiency. http://iet.jrc.ec.europa.eu/energyefficiency/ ict-codes-conduct/data-centres-energy-efficiency.
  10. Gandomi, A. and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2):137 - 144.
  11. Gantikow, H., Klingberg, S., and Reich, C. (2015). Container-based virtualization for hpc. In CLOSER 2015, pages 543-551.
  12. Gomes Xavier, M., Veiga Neves, M., and Fonticielha de Rose, C. (2014). A performance comparison of container-based virtualization systems for mapreduce clusters. In 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).
  13. Gu, Y. and Grossman, R. L. (2009). Lessons learned from a year's worth of benchmarks of large data clouds. MTAGS 7809, pages 31-36. ACM.
  14. Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. (2010). The hibench benchmark suite: Characterization of the mapreduce-based data analysis. In Data Engineering Workshops (ICDEW), 2010, pages 41-51.
  15. Intel (2015). Intel virtualization technology (intel vt). http:// www.intel.com/content/www/us/en/virtualization/ virtualization-technology/intel-virtualizationtechnology.html.
  16. Jiang, D., Ooi, B. C., Shi, L., and Wu, S. (2010). The performance of mapreduce: An in-depth study. Proc. VLDB Endow., 3(1-2):472-483.
  17. Jlassi, A., Martineau, P., and Tkindt, V. (2015). Offline scheduling of map and reduce tasks on hadoop systems. In CLOSER 2015, pages 178-185.
  18. JoshBaer (2015). Hadoop wiki powerby. https://wiki. apache.org/hadoop/PoweredBy.
  19. Kontagora, M. and Gonzalez-Velez, H. (2010). Benchmarking a mapreduce environment on a full virtualisation platform. In CISIS 2010, pages 433-438.
  20. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. (2009a). A comparison of approaches to large-scale data analysis. SIGMOD 7809, pages 165-178, New York, NY, USA.
  21. Pavlo, A., Paulson, E., Rasin, A., Abadi, D. J., DeWitt, D. J., Madden, S., and Stonebraker, M. (2009b). A comparison of approaches to large-scale data analysis. SIGMOD 7809, pages 165-178.
  22. Peinl, R. and Holzschuher, F. (2015). The docker ecosystem needs consolidation. In CLOSER 2015, pages 535- 542.
  23. Reshetova, E., Karhunen, J., Nyman, T., and Asokan, N. (2014). Security of OS-level virtualization technologies: Technical report. ArXiv e-prints.
  24. Shafer, J., Rixner, S., and Cox, A. (2010). The hadoop distributed filesystem: Balancing portability and performance. In ISPASS 2010, pages 122-133.
  25. Stonebraker, M., Abadi, D., DeWitt, D. J., Madden, S., Paulson, E., Pavlo, A., and Rasin, A. (2010). Mapreduce and parallel dbmss: Friends or foes? Commun. ACM, 53(1):64-71.
  26. Wen, Y., Zhao, J., Zhao, G., Chen, H., and Wang, D. (2012). A survey of virtualization technologies focusing on untrusted code execution. In IMIS'12, pages 378-383.
  27. Xavier, M., Neves, M., Rossi, F., Ferreto, T., Lange, T., and De Rose, C. (2013). Performance evaluation of container-based virtualization for high performance computing environments. In PDP, 2013, pages 233- 240.
  28. Xu, G., Xu, F., and Ma, H. (2012). Deploying and researching hadoop in virtual machines. In ICAL 7812, pages 395-399.
Download


Paper Citation


in Harvard Style

Jlassi A. and Martineau P. (2016). Benchmarking Hadoop Performance in the Cloud - An in Depth Study of Resource Management and Energy Consumption . In Proceedings of the 6th International Conference on Cloud Computing and Services Science - Volume 2: CLOSER, ISBN 978-989-758-182-3, pages 192-201. DOI: 10.5220/0005861701920201


in Bibtex Style

@conference{closer16,
author={Aymen Jlassi and Patrick Martineau},
title={Benchmarking Hadoop Performance in the Cloud - An in Depth Study of Resource Management and Energy Consumption},
booktitle={Proceedings of the 6th International Conference on Cloud Computing and Services Science - Volume 2: CLOSER,},
year={2016},
pages={192-201},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005861701920201},
isbn={978-989-758-182-3},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 6th International Conference on Cloud Computing and Services Science - Volume 2: CLOSER,
TI - Benchmarking Hadoop Performance in the Cloud - An in Depth Study of Resource Management and Energy Consumption
SN - 978-989-758-182-3
AU - Jlassi A.
AU - Martineau P.
PY - 2016
SP - 192
EP - 201
DO - 10.5220/0005861701920201