Data Warehouse MFRJ Query Execution Model for MapReduce
Aleksey Burdakov, Uriy Grigorev, Victoria Proletarskaya, Artem Ustimov
2017
Abstract
The growing number of MapReduce applications makes the Data Warehouse access time estimating an important task. The problem is that processing of large data requires significant time that may exceed the required thresholds. Fixing these problems discovered at the system operations stage is very costly. That is why it is beneficial to estimate the data processing time for peak loads at the design stage, i.e. before the MapReduce tasks implementation. This allows making timely design decisions. In this case mathematical models serve as an unreplaceable analytical instrument. This paper provides an overview of the n-dimensional MapReduce-based Data Warehouse Multi-Fragment-Replication Join (MFRJ) access method. It analyzes MapReduce workflow, and develops an analytical model that estimates Data Warehouse query execution average time. The modeling results allow a system designer to provide recommendations on the technical parameters of the query execution environment, Data Warehouse and the query itself. This is important in cases where there are restrictions imposed on the query execution time. The experiment preparation and execution in a cloud environment for model adequacy analysis are evaluated and described.
References
- Abdi, H. (2007) The method of least squares. In N. Salkind, editor, Encyclopedia of Measurement and Statistics. CA, USA: Thousand Oaks.
- Afrati, F. N. , Sarma, A. D., Salihoglu, S. and Ullman, J. D. (2012) Upper and lower bounds on the cost of a mapreduce computation. CoRR, abs/1206.4377, Proceedings of the VLDB Endowment, Volume 6 Issue 4, February 2013, Pages 277-288, ACM New York, NY, USA.
- Afrati, F.N. and Ullman, J.D. (2010) Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology, ACM New York, NY, USA.
- Burdakov, A.V., Grigorev, U.A., Ploutenko, A.D. (2014) Comparison of table join execution time for parallel DBMS and MapReduce, Software Engineering / 811: Parallel and Distributed Computing and Networks / 816: Artificial Intelligence and Applications Proceedings (March 18 - 18, 2014, Innsbruck, Austria), ACTA Press, 2014.
- Digital Ocean (2016), Available: www.digitalocean.com, [21 February 2017].
- Duda, J. (2012) Business intelligence and NoSQL databases. Information Systems in Management (2012), Vol. 1 (1), pp. 25-37.
- Golfarelli, M. and Rizzi, S. (2009) Data Warehouse Design: Modern Principles and Methodologies. McGraw-Hill, Inc. New York, NY, USA. P. 458.
- Hejmalícek, B. A. (2015) Hadoop as an Extension of the Enterprise Data Warehouse. Masaryk university, Faculty of informatics, Brno, Czech Republic.
- Huai, Y., Chauhan, A., Gates, A. et al. (2014) Major Technical Advancements in Apache Hive, SIGMOD 7814 Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pages 1235-1246, ACM, New York, NY, USA.
- Inmon, W. H. (2005) Building the Data Warehouse, Fourth Edition. Wiley Publishing, Inc. P. 576, Indianapolis, IN, USA.
- Jeffrey Dean, Sanjay Ghemawat (2004). MapReduce: Simplified Data Processing on Large Clusters. Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, USA.
- Karloff, H., Suri, S. and Vassilvitskii S. (2010) A model of computation for MapReduce. In Proceedings of the Twenty-First Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 7810, pages 938-948, Philadelphia, PA, USA, 2010. Society for Industrial and Applied Mathematics.
- Kaur, A. (2016) Big Data: A Review of Challenges, Tools and Techniques. International Journal of Scientific Research in Science, Engineering and Technology (IJSRSET), Volume 2, Issue 2, pp. 1090-1093, TechnoScience Academy.
- Koutris, P. and Suciu, D. (2011) Parallel evaluation of conjunctive queries. In Proceedings of the 30th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. ACM, New York, NY, USA, 223-234.
- Li, B., Mazur, E., Diao, Y., McGregor, A. and Shenoy, P.J. (2011) A platform for scalable one-pass analytics using MapReduce. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 985-996, ACM, New York, NY, USA.
- Li, F., Ooi, B. C., Özsu, M. T., Wu, S. (2014) Distributed data management using MapReduce. Journal ACM Computing Surveys (CSUR), Volume 46, Issue 3, January 2014, Article No. 31, ACM, New York, NY, USA.
- Palla, K. (2009) A comparative analysis of join algorithms using the Hadoop Map/Reduce framework. Master's Thesis, University of Edinburgh, Edinburgh, UK. Available:www.inf.ed.ac.uk/publications/thesis/online/ IM090720.pdf, [21 February 2017].
- Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S.R. and Stonebraker, M.A. (2009) A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD International Conference on Management of Data. ACM Press, New York, 165-178.
- Redmond, E. and Wilson, J.R. (2012) Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement. Pragmatic Bookshelf, Pragmatic Programmers, USA.
- Sadalage, P. and Fowler, M. (2012) NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison Wesley Professional, Crawfordsville, IN, USA.
- Simhadri, H. V. (2013) Program-Centric Cost Models for Locality and Parallelism. PhD thesis, Carnegie Mellon University (CMU). Pittsburgh, PA, USA. Available: www.cs.cmu.edu/hsimhadr/thesis.pdf, [21 February 2017].
- Tao, Y., Lin, W., and Xiao, X. (2013) Minimal mapreduce algorithms. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 7813, pages 529-540, New York, NY, USA. ACM.
- White, T. (2015) Hadoop: The Definitive Guide, 4th Edition. O'Reilly Media, Sebastopol, CA, USA.
- Wu, S., LI, F., Mehrotra, S. and Ooi, B.C. (2011) Query optimization for massively parallel data processing. In Proc. 2nd ACM Symposium on Cloud Computing. 12:1-12:13. ACM, New York, NY, USA.
- Zhou, G. Z. and Wang, G. Y. (2013) Cache Conscious StarJoin in MapReduce Environments. Cloud-I 7813 Proceedings of the International Workshop on Cloud Intelligence, Riva del Garda, Trento, Italy - August 26-26, 2013, ACM New York, NY, USA.
Paper Citation
in Harvard Style
Burdakov A., Grigorev U., Proletarskaya V. and Ustimov A. (2017). Data Warehouse MFRJ Query Execution Model for MapReduce . In Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, ISBN 978-989-758-245-5, pages 206-215. DOI: 10.5220/0006238502060215
in Bibtex Style
@conference{iotbds17,
author={Aleksey Burdakov and Uriy Grigorev and Victoria Proletarskaya and Artem Ustimov},
title={Data Warehouse MFRJ Query Execution Model for MapReduce},
booktitle={Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,},
year={2017},
pages={206-215},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0006238502060215},
isbn={978-989-758-245-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 2nd International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,
TI - Data Warehouse MFRJ Query Execution Model for MapReduce
SN - 978-989-758-245-5
AU - Burdakov A.
AU - Grigorev U.
AU - Proletarskaya V.
AU - Ustimov A.
PY - 2017
SP - 206
EP - 215
DO - 10.5220/0006238502060215