
5 CONCLUSIONS
Several big data processing platforms have emerged,
with Databricks standing out as one of the most
prominent options. It offers a range of cloud-based
services for executing complex queries and transac-
tional workloads. However, it requires the dimen-
sioning of the cloud environment, demanding users to
specify the types and number of VMs for deployment,
a task that can be far from trivial. Identifying the
characteristics of a SQL-like workload and estimat-
ing the appropriate VM type from a selection of more
than 100 options poses a complex challenge. While
mechanisms like autoscaling exist in Databricks, they
are costly and may not be straightforward to con-
figure for non-expert users. Improperly dimension-
ing the cloud environment, either through over or
under-dimensioning, can impact both workload per-
formance and financial costs.
This paper proposes a middleware named APOENA,
designed to dimension the cloud environment for spe-
cific workloads, i.e., SQL-like queries. APOENA col-
lects provenance data and employs this historical data
to train ML models capable of predicting query per-
formance for a particular combination of a query and
virtual cluster configuration. This configuration in-
cludes the type and number of VMs involved in the
execution. Although APOENA focused on Databricks
in this paper, it could be extended to work with other
big data frameworks such as Apache Spark.
Experiments with real-world workloads demon-
strated that APOENA classified query execution times
with over 90% accuracy and F1 weighted met-
rics. Future work involves implementing a retraining
mechanism for APOENA, as the current version does
not perform retraining automatically. Additionally,
we plan to evaluate APOENA using a broader range of
real-world workloads.
ACKNOWLEDGMENTS
This study was financed in part by the Coordenac¸
˜
ao
de Aperfeic¸oamento de Pessoal de N
´
ıvel Superior
- Brasil (CAPES) - Finance Code 001. This pa-
per was also partially financed by CNPq (grant
n
o
311898/2021-1) and FAPERJ (grant n
o
E-
26/202.806/2019).
REFERENCES
Ahmed, N. et al. (2022). Runtime prediction of big data
jobs: performance comparison of machine learning al-
gorithms and analytical models. J. Big Data, 9(1):67.
Armbrust, M. et al. (2020). Delta lake: High-performance
ACID table storage over cloud object stores. Proc.
VLDB Endow., 13(12):3411–3424.
Behm, A. et al. (2022). Photon: A fast query engine for
lakehouse systems. In SIGMOD’22, pages 2326–
2339. ACM.
Burdakov, A. et al. (2020). Predicting sql query execu-
tion time with a cost model for spark platform. In
IoTBDS’20, pages 279–287. INSTICC, SciTePress.
de Oliveira, D. E. M. et al. (2021). Towards optimizing the
execution of spark scientific workflows using machine
learning-based parameter tuning. Concurr. Comput.
Pract. Exp., 33(5).
Filho, E. R. L., de Almeida, E. C., Scherzinger, S., and
Herodotou, H. (2021). Investigating automatic param-
eter tuning for sql-on-hadoop systems. Big Data Res.,
25:100204.
Groth, P. and Moreau, L. (2013). W3C PROV - An
Overview of the PROV Family of Documents. Avail-
able at https://www.w3.org/TR/prov-overview/.
Herschel1, M., Diestelk
¨
amper1, R., and Lahmar, H. B.
(2017). A survey on provenance: What for? what
form? what from? The VLDB Journal.
Kundu, R. (2022). F1 score in machine learning: Intro &
calculation.
Meredino, A. et al. (2018). Big data, big decisions: The im-
pact of big data on board level decision-making. Jour-
nal of Business Research.
Mustafa, S., Elghandour, I., and Ismail, M. A. (2018). A
machine learning approach for predicting execution
time of spark jobs. Alexandria Engineering Journal,
57(4):3767–3778.
Nargesian, F., Pu, K. Q., Bashardoost, B. G., Zhu, E., and
Miller, R. J. (2023). Data lake organization. IEEE
Trans. Knowl. Data Eng., 35(1):237–250.
Nembrini, S., K
¨
onig, I. R., and Wright, M. N. (2018).
The revival of the gini importance? Bioinform.,
34(21):3711–3718.
Ortiz, J., de Almeida, V. T., and Balazinska, M. (2015).
Changing the face of database cloud services with per-
sonalized service level agreements. In CIDR’15.
¨
Ozt
¨
urk, M. M. (2023). Tuning parameters of apache
spark with gauss–pareto-based multi-objective opti-
mization. Knowledge and Information Systems.
Plaue, M. (2020). Data Science - An Introduction to Statis-
tics and Machine Learning. Springer.
Russell, S. J. and Norvig, P. (2020). Artificial Intelligence:
A Modern Approach (4th Edition). Pearson.
Singhal, R. and Nambiar, M. (2016). Predicting sql query
execution time for large data volume. In IDEAD’16,
page 378–385, New York, NY, USA. Association for
Computing Machinery.
Zaharia, M. (2019). Lessons from large-scale software as a
service at databricks. In SoCC’19, page 101.
Zaharia, M. et al. (2021). Lakehouse: A new generation
of open platforms that unify data warehousing and ad-
vanced analytics. In CIDR’21. www.cidrdb.org.
ICEIS 2024 - 26th International Conference on Enterprise Information Systems
296