loading
Documents

Research.Publish.Connect.

Paper

Authors: Marin Fotache ; Marius-Iulian Cluci and Valerică Greavu-Şerban

Affiliation: Al. I. Cuza University of Iasi, Romania

ISBN: 978-989-758-426-8

Keyword(s): Big Data, Beowulf Clusters, Apache Spark, Spark SQL, Machine Learning, Distributed Computing, TCP-H.

Abstract: With distributed computing platforms deployed on affordable hardware, Big Data technologies have democratised the processing of huge volumes of structured and semi-structured data. Still, the costs of installing and operating even relatively small cluster of commodity servers or the cost of hiring cloud resources could prove inaccessible for many companies and institutions. This paper builds two predictive models for estimating the main drivers of the data processing performance for one of the most popular Big Data system (Apache Spark) deployed on gradually increased number of nodes of a Beowulf cluster. Data processing performance was estimated by randomly generated SparkSQL queries on TPC-H database schema, with variable number of joins (including self-joins), predicates, groups, aggregate functions and subqueries included in FROM clause. Using two machine learning techniques, random forest and extreme gradient boosting, predictive models tried to estimate the query duration on pre dictors related to cluster setup and query structure and also to assess the importance of predictors for the outcome variability. Results were positive and encouraging for extending the cluster number of nodes and the database scale. (More)

PDF ImageFull Text

Download
CC BY-NC-ND 4.0

Sign In Guest: Register as new SciTePress user now for free.

Sign In SciTePress user: please login.

PDF ImageMy Papers

You are not signed in, therefore limits apply to your IP address 34.204.187.106

In the current month:
Recent papers: 100 available of 100 total
2+ years older papers: 200 available of 200 total

Paper citation in several formats:
Fotache, M.; Cluci, M. and Greavu-Şerban, V. (2020). Low Cost Big Data Solutions: The Case of Apache Spark on Beowulf Clusters.In Proceedings of the 5th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS, ISBN 978-989-758-426-8, pages 327-334. DOI: 10.5220/0009407903270334

@conference{iotbds20,
author={Marin Fotache. and Marius{-}Iulian Cluci. and Valerică Greavu{-}Şerban.},
title={Low Cost Big Data Solutions: The Case of Apache Spark on Beowulf Clusters},
booktitle={Proceedings of the 5th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,},
year={2020},
pages={327-334},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0009407903270334},
isbn={978-989-758-426-8},
}

TY - CONF

JO - Proceedings of the 5th International Conference on Internet of Things, Big Data and Security - Volume 1: IoTBDS,
TI - Low Cost Big Data Solutions: The Case of Apache Spark on Beowulf Clusters
SN - 978-989-758-426-8
AU - Fotache, M.
AU - Cluci, M.
AU - Greavu-Şerban, V.
PY - 2020
SP - 327
EP - 334
DO - 10.5220/0009407903270334

Login or register to post comments.

Comments on this Paper: Be the first to review this paper.