geneous, requiring the adaptation of existing rou-
tines, which can be error-prone (Rahm and Bernstein,
2006). Those errors may result in loading inconsistent
data into the BIDW, which makes the identification of
such inconsistencies a top priority.
Provenance (Freire et al., 2008; Herschel et al.,
2017) presents a natural solution to support users in
tracking the steps of a process, following the data
derivation path, and monitoring relevant metadata
and parameters associated with data transformations.
Provenance data have been employed in workflows
from various domains, such as healthcare (Corrigan
et al., 2019). We observe ETL routines are similar to
large-scale workflows as they are iterative processes
that face the challenge of feedback loops and involve
multiple data transformations, users, datasets, and pa-
rameters (Silva and others., 2021).
Therefore, in this study, we propose the inte-
gration of provenance capturing and storing mecha-
nisms into the ETL routines of a BIDW. Specifically,
we introduce a novel strategy for automatically in-
jecting instructions into the ETL transformations to
collect and store both prospective and retrospective
provenance in a dedicated database. This database
serves as a valuable source of information for mon-
itoring, auditing, and correcting issues in the BIDW,
thereby enhancing data quality. Our solution, named
ProvETL, integrates a provenance schema with in-
jection strategies to create a BIDW extension that ad-
dresses the following questions: (i) Which data out-
put was produced after the ETL transformation con-
suming a given input? (ii) Which user executed this
transformation with specific input data and setup?
(iii) What is the summary for the execution of this
transformation and input data (in terms of an aggre-
gation function, such as COUNT and AVG)? (iv) What
transformation sequence generates this data product?
ProvETL also provides a web-based analytical
layer with a provenance graph of the executed ETL
routine, allowing users to interactively explore an-
swers to previous questions by traversing the graph.
To showcase the capabilities of our proposed solu-
tion, we investigated three real-world scenarios re-
lated to personnel admission, public information in
paycheck reports, and staff dismissals from our Uni-
versity BIDW. The results indicate that ProvETL
was effective in (i) detecting poor-quality inputs and
(ii) identifying transformations susceptible to noisy
data with acceptable processing overhead.
The remainder of this paper is divided as follows.
Section 2 discusses the background. Section 3 in-
troduces the material and methods, while Section 4
presents the qualitative and quantitative results. Fi-
nally, Section 5 offers the conclusions.
2 BACKGROUND
Provenance Data. This term denotes metadata that
explains the process of generating a specific piece of
data. Thus, provenance records data derivation paths
within a particular context for further querying and
exploration, which fosters reproducibility in scientific
experiments. We apply this concept as a rich source
of information, encompassing the routines, parameter
values, and data transformations executed by users
during the ETL process, i.e., a resource for moni-
toring and assuring the data quality of the BIDW.
The W3C recommendation, PROV (Groth and
Moreau, 2013), defines a data model for representing
provenance data. It conceptualizes provenance in
terms of Entities, Agents, Activities, and various
relationship types. Here, an entity could represent
a specific table of the OLTP database. Activities
are actions performed within the ETL routines that
affect entities, such as currency standardization,
with execution times and error messages. Lastly,
an Agent is a user responsible for executing ETL
routines. The specification of an ETL pipeline can be
viewed as Prospective Provenance (p-prov), a form of
provenance data that logs the steps carried out during
ETL routines. Another category of provenance is
Retrospective Provenance (r-prov), which captures
details related to the execution process, such as when
transformations are executed.
Provenance Capturing. Capturing provenance is
a process that can take two forms: (i) based on
instrumentation or (ii) instrumentation-free. In the
first category, the collection of provenance or the in-
vocation of a provenance system, e.g., YesWorkflow
(McPhillips et al., 2015), requires injecting specific
calls into the script or program where provenance
needs to be captured. While this approach offers the
advantage of capturing only the necessary prove-
nance, it does require some effort from the users.
The latter category captures every action performed
in the script without requiring any modification in
the source code. The approaches in this category,
e.g., noWorkflow (Murta et al., 2015), present the
advantage of being transparent for the user, but it
may result in collecting a substantial volume of data
with a significant portion may be irrelevant for the
user analysis, e.g., registering a file opening.
Data Warehousing. The BIDW model encompasses
four main layers. The first layer represents External
Data Sources, containing all input data that may
provide valuable insights. Data sources include
relational databases (i.e., OLTP databases), JSON
ICEIS 2024 - 26th International Conference on Enterprise Information Systems
314