using (mainly off the office hours) a small subset of
modest organisational workstations.
Even if the variability of some predictors (such as
number of cluster nodes) was low, both machine
learning models have good results in predicting the
query duration based on main query and cluster
parameters. Random Forests model performed
slightly better than the xgboost model, with the
concordance correlation coefficient above 90% and
the R
2
about 85%.
Variable’s importance provided by both models
suggest, as expected, that the query complexity
(approximated the necessary Spark tasks for query
completion and the number of joins) is the main
driver of query performance. Also, the database size
was ranked as an important predictor.
Unexpectedly, predictors such as the cluster
number of nodes, the gap between the cluster memory
and the database size, the tuples grouping and group
filtering, the cluster manager were qualified as less
important (in the outcome variability) by the both
models.
Some further research directions may include:
Increasing the number of cluster nodes;
Running the queries on TPC-H databases with
larger sizes;
Adding Kubernetes as a cluster manager in
order to have a whole image of all the available
resource managers;
Making optimization to the JVM, the garbage
collection, and OS parameter for accelerating
Spark performance;
Assess the performance of other Spark features
such as Streaming, Machine Learning and
GraphX in order to see how they perform on a
Beowulf cluster;
Test with the dataset in other formats not just
the default generated by TCP-H: AVRO,
Parquet, blob storage and AWS S3, to see if
there are any performance gains;
Diversify the hardware resources and storage
types (e.g. add SSDs or RAID configuration);
Take into account the hardware bottlenecks
which might occur during the testing, and
quantification their effect on performance;
Run the queries on other Big Data systems (such
as Hive and Pig) to compare the performance;
Overall results suggest that running SQL queries
on Spark using modest Beowulf clusters is a viable
solution, but this need subsequent comparisons with
other Big Data solutions, on disk (e.g. Hive, Pig) or
in-memory (e.g. in-memory features of SQL servers,
MemSQL, VoltDB, Impala).
REFERENCES
Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin,
R., Ghodsi, A., Stoica, I. & Zaharia, M., 2018.
Structured Streaming: A Declarative API for Real-
Time Applications in Apache Spark, Proc. of the
SIGMOD'18, 601-613.
Assunção, M.D. et al., 2015. Big Data computing and
clouds: Trends and future directions. Journal of
Parallel and Distributed Computing, 2015, 79, 3-15.
Breiman, L., 2001. Random Forests, Machine Learning, 45,
pp.2-32
Chaowei, Y., Huang, Q., Li, Z., Liu, K., Hu, F., 2017. Big
Data and cloud computing: innovation opportunities
and challenges. International Journal of Digital Earth,
10, 13-53.
Chen, Q., Wang, K., Bian, Z., Cremer, I., Xu, G. and Guo,
Y., 2016. Simulating Spark Cluster for Deployment
Planning, Evaluation and Optimization, SIMULTECH
2016, SCITEPRESS, 33-43
Chen, T. & He, T., 2019. xgboost: Extreme Gradient
Boosting. R package version 0.90.0.2.,
https://CRAN.R-project.org/package=xgboost
Chen, T., Guestrin C., 2016. XGboost: a scalable tree
boosting system. Proc. of the 22nd ACM SIG KDD
International conference on Knowledge Discovery and
Data Mining. ACM Press, 785–94.
Chiba, T., Onodera, T., 2016. Workload characterization
and optimization of TPC-H queries on Apache Spark.,
Proc. of the ISPASS 2016, 112-121
Cluci, M.I., Fotache, M., Greavu-Șerban, V., 2019. Data
Processing Performance of Apache Spark on Beowulf
Clusters. An Overview. In Proc. of the 34th IBIMA
Conference
Cutler A., Cutler D.R., Stevens J.R., 2012. Random Forests.
In: Zhang C., Ma Y. (eds) Ensemble Machine Learning.
Springer, Boston, MA
Fotache, M., Hrubaru, I., 2016. Performance Analysis of
Two Big Data Technologies on a Cloud Distributed
Architecture. Results for Non-Aggregate Queries on
Medium-Sized Data. Scientific Annals of Economics
and Business, 63(SI), 21-50
Fotache, M., Tică, A., Hrubaru, I., Spînu, M.T., 2018a. Big
Data Proprietary Platforms. The Case of Oracle
Exadata, Review of Economic and Business Studies, 11
(1), 45-78
Fotache, M., Greavu-Șerban, V., Hrubaru, I., Tică, A.,
2018b. Big Data Technologies on Commodity
Workstations. A Basic Setup for Apache Impala. Proc.
of the 19th International Conference on Computer
Systems and Technologies (CompSysTech'18), ACM
Press
Friedman, J., Hastie, T., Tibshirani, R., 2000. Additive
logistic regression: a statistical view of boosting. The
Annals of Statistics, 28(2), 337–407.
GCP, 2019. Google Cloud Platform blog and
documentation, [Online], [Retrieved September 22,
2019], https://cloud.google.com/blog/products/gcp/.
Gopalani, S., Arora, R.R., 2015. Comparing Apache Spark
and Map Reduce with Performance Analysis using K-