A Quality-based ETL Design Evaluation Framework

Zineb El Akkaoui

, Alejandro Vaisman

and Esteban Zim

anyi

SEEDS Team, INPT Lab, Rabat, Morocco

CoDE Lab, Universit

e Libre de Bruxelles, Brussels, Belgium

Keywords:

ETL processes, Data Integration Performance, Design Quality, Theoretical Validation, Empirical Validation.

Abstract:

The Extraction, Transformation and Loading (ETL) process is a crucial component of a data warehousing

architecture. ETL processes are usually complex and time-consuming. Particularly important (although over-

looked) in ETL development is the design phase, since it impacts on the subsequent ones, i.e., implementation

and execution. Addressing ETL quality at the design phase allows taking actions that can have a positive and

low-cost impact on process efﬁciency. Using the well-known Briand et al. framework (a theoretical validation

framework for system artifacts), we formally specify a set of internal metrics that we conjecture to be corre-

lated with process efﬁciency. We also provide empirical validation of this correlation, as well as an analysis of

the metrics that have stronger impact on efﬁciency. Although there exist proposals in the literature addressing

design quality in ETL, as far as we are aware of, this is the ﬁrst proposal aimed at using metrics over ETL

models to predict the performance associated to these models.

1 INTRODUCTION

A data warehouse (DW) is a data repository that con-

solidates data coming from multiple and heteroge-

neous sources, in order to be exploited for decision-

making. This consolidation is performed through a

collection of processes denoted Extraction, Transfor-

mation and Loading (ETL). ETL development is usu-

ally composed of four phases, shown in Fig. 1 (Inmon,

2002). It has been widely argued that the ETL process

is complex and time-consuming, at a the point that

roughly 80% of a DW project is due to this phase (In-

mon, 2002; A. Simitsis and Sellis, ).

ETL processes are typically implemented as SQL

or Java programs orchestrated to answer organiza-

tion requirements. Thus, ETL optimization cannot

be reduced to SQL query optimization as usual in

databases: it should also take into account the process

structure. Structural optimization must manage two

main aspects: tasks combination and order. The latter

has been studied in (A. Simitsis and Sellis, 2005) as a

state space problem where, by changing the order of

the process tasks, an algorithm ﬁnds the process al-

ternative providing the best execution. However, the

combination aspect has not been covered in the lit-

erature. This factor is also relevant, since to tackle

the same integration problem, several process alter-

natives are possible, having different number of tasks

and work combinations (e.g., a process containing

few tasks, each one performing heavy work, can be

equivalent to a process that contains many less loaded

tasks). Comprehensive approaches aimed at ﬁnding

the best process structure are still lacking.

A line of research oriented towards enhancing

the ETL process proposes conceptual modeling lan-

guages to deﬁne ETL workﬂows (Z. El Akkaoui

and Zim

anyi, 2012; Trujillo and Luj

an-Mora, 2003;

P. Vassiliadis and Skiadopoulos, ), sometimes pre-

sented together with ETL design guidelines (U. Dayal

and Wilkinson, 2009). We comment on these propos-

als in Section 2. Most of them propose metrics that

can lead to improve characteristics like usability and

maintenance. However, to the best of our knowledge,

no work has studied and validated ETL model design

criteria with respect to the crucial issue of ETL pro-

cess efﬁciency.

To provide a solution to this problem, in this pa-

per we present tools that, based on the ETL design

characteristics, can help to predict the efﬁciency of

the process. We do this by means of a set of structural

(also denoted internal) metrics, deﬁned based on the

Briand et al. (L. Briand and Basili, 1996) theoretical

framework in order to guarantee the mathematical va-

lidity of the approach. The nice property of these met-

rics is that they can be computed at design time, and

that, as we will show, they are correlated with ETL ef-

El Akkaoui, Z., Vaisman, A. and Zimányi, E.

A Quality-based ETL Design Evaluation Framework.

DOI: 10.5220/0007786502490257

In Proceedings of the 21st International Conference on Enterprise Information Systems (ICEIS 2019), pages 249-257

ISBN: 978-989-758-372-8

249

Figure 1: Development steps & improvement criteria.

ﬁciency, speciﬁcally with the process throughput (an

external metric). Therefore, proving the correlation

stated above is an important achievement in ETL de-

sign, since, given a collection of alternative ETL mod-

els, it would allow predicting the one that is likely to

deliver the best performance without actually needing

to code a single line, dramatically dropping the costs

of the DW project.

The paper is organized as follows. Section 2 dis-

cusses related work. Section 3 presents our running

example, while in Section 4 we introduce the formal

data model for the ETL process graph. Section 5 stud-

ied the internal and external measures used to evaluate

the design quality of the ETL model. Section 6 reports

the experimental validation and its results, concluding

in Section 7.

2 RELATED WORK

A quality model (e.g., (B. Boehm and Merritt, 1978))

deﬁnes a set of quality goals together with a set of

practices for achieving and evaluating them. The

ISO/IEC 9126 standard

provides a standard qual-

ity model for software products. Built on soft-

ware quality approaches, ETL process quality eval-

uation methods have been proposed. These proposals

mainly address ETL usability or maintainability eval-

uations. Quality evaluations for ETL design proposed

in (L. Mu

noz and Trujillo, 2010; P. Vassiliadis and

Skiadopoulos, ) derive measures based on the Briand

et al. (L. Briand and Basili, 1996) evaluation frame-

work. While Vassiliadis et al. (P. Vassiliadis and Ski-

adopoulos, ) limit to propose supposedly useful mea-

sures for ETL models, Mu

noz et al. (L. Mu

noz and

Trujillo, 2010) validate the measures empirically by

studying their effect over two ISO quality dimensions,

namely usability and maintenance. Further (Z. El

Akkaoui and Trujillo, 2013; G. Papastefanatos and

Vassiliou, 2009) cover maintainability and support

data source evolution.

Enhancing ETL efﬁciency at the implementation

step has been studied in some few works. Simitsis et

al. (A. Simitsis and Dayal, 2013) proposes an opti-

http://www.iso.org/iso/iso catalogue/catalogue tc/

catalogue detail.htm?csnumber=22749

(a)

(b)

Figure 2: (a) Tables from source database; (b) Location hi-

erarchy.

mizer that converts the logical ﬂow to an executable

form that is optimized for the underlying infrastruc-

ture according to user-speciﬁed objectives. Another

optimization approach (A. Simitsis and Dayal, 2012)

intends to avoid bad execution plans by partitioning

the original data model into submodels that we run

on each engine. Another contribution (T. Majchrzak

and Kuchen, 2011), by deﬁning a set of performance

measures, a comparison is made between Talend (a

code generation-based tool) and Pentaho (an engine-

based tool). Results showed that Talend is more efﬁ-

cient in terms of execution time and CPU usage, while

Pentaho is less memory-consuming. Further, (Ali

and Wrembel, 2017; G. Kougka and Simitsis, 2018)

survey almost aforementioned data ﬂow optimization

techniques. But (G. Kougka and Simitsis, 2018)

expands the study to various other data-centric ﬂow

topics besides ETL processes, including database en-

gines, MapReduce systems, and business processes.

In the present work we go further than existing pro-

posals, quantitatively assessing the relationship be-

tween a set of design quality metrics evaluated over

the structure of the ETL process model, and external

metrics obtained at execution time.

3 RUNNING EXAMPLE

To model a DW we use the well-known star-

schema (Kimball and Ross, 2002), where DW tables

are of two kinds: dimension and fact tables. Dimen-

sions actually represent aggregation hierarchies along

which fact data are summarized. ETL processes take

data from the sources into the star-schema tables. To

illustrate our evaluation framework, we use a portion

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

250

of an ETL process that loads a collection of dimension

tables containing customer and location data. Fig. 2b

shows dimension tables DimArea, DimCountry, Dim-

State, and DimGeography. The ﬁrst three ones are

populated using an XML ﬁle denoted Territories.xml.

DimGeography is populated using geographical data

with attributes City, State, ZipCode, and Country,

present in the Customer and Supplier tables of the

Northwind database

(Fig. 2a). Before populating the

Location hierarchy, the geography data needs some

cleansing and transformation operations to ﬁt the data

warehouse requirements, namely: (a) Data comple-

tion, which requires dealing with null values. For

example, the attribute State may be null in the Cus-

tomer and Supplier source tables. The ETL process

ﬁxes this by using an external source ﬁle Cities.txt,

which contains three ﬁelds: city, state, and country.

(b) Data consolidation. For example, in the source

databases, attribute State contains either a state name

(e.g., California) or a state code (e.g., CA). In the lat-

ter case, the state code is either left empty or con-

verted into the state name using the State table with at-

tributes StateId, StateName, and Code (the ISO stan-

dard code), which contains the link between the state

name and its code. (c) Consistency, in particular with

respect to referential integrity constraints. During the

loading of the data into the data warehouse, referen-

tial integrity between all the hierarchy tables must be

ensured.

To model complex processes, several concep-

tual modeling tools (mentioned in Sections 1 and 2)

have been proposed in the literature. In this paper

we use an ETL model based on BPMN4ETL (El

Akkaoui and Zim

anyi, 2012), which models ETL pro-

cesses as workﬂows, extending the BPMN notation.

Fig. 3 shows the loading process for the DimGeog-

raphy dimension of our running example, modeled

in BPMN4ETL. More in detail, an ETL model in

BPMN4ETL is perceived as composed of a control

process containing several data processes. A control

process (top process of Fig. 3) manages the coarse-

grained groups of tasks and/or sub-processes, while a

data process (bottom processes of Fig. 3) operates at

ﬁner granularity, detailing how input data are trans-

formed and output data are produced. For exam-

ple, populating each fact (dimension, view, tempo-

rary, etc.) table in the data warehouse constitutes a

data process, whereas sequencing the load of differ-

ent dimension levels constitutes a control process.

http://www.microsoft.com/en-us/download/details.

aspx?id=23654

4 DATA AND QUALIY MODELS

4.1 ETL Data Process Graph

We now present the ETL model we use in this pa-

per. Due to space limitations we restrict ourselves to

study the data process perspective, which constitutes

the portion of the ETL process where transformations

occur, and we do not deal with the control process.

There is T, a set of node types. T = {“data input”,

“data output”, “ﬁlter”, “ﬁeld lookup”, “ﬁeld deriva-

tion”, “ﬁeld generation”, “join”, “union”, “aggrega-

tion”, “sort”, “pivot”, “script”}. There is also a set A

containing a list of possible actions performed by a

node. A = {“ﬁeld manipulation”, “ﬁeld generation”,

“join”, “lookup”, “branching”, “extraction”, “load”},

where “ﬁeld manipulation” action covers ﬁeld com-

putation, deletion, addition, sorting, pivoting, and

splitting. In addition, a node has a traversing stream.

The input stream of a node has schema:

(( f ield

, f ield

.datatype), ..., ( f ield

, f ield

datatype))

Deﬁnition 1 (Data Process Graph (DPG)). A data

process graph is a directed graph G(N,E) where N =

, n

, ..., n

is the set of nodes representing the data

tasks, and E = e

, e

, ..., e

is the set of edges between

nodes. An edge e = (a, b) ∈ E means that b receives

data from node a for further processing. In addition,

the following holds:

• There is a function with signature N 7→ T, that

maps a node n to its type.

• Each node has an associated set of actions, which

is a subset of A. There is a relation actions ∈ N ×

A, reﬂecting this association.

• A relation schema ∈ N × S deﬁnes the input ﬁelds

of a node.

• A node belongs to a ﬂow category. It is either a

branching node: “ﬁlter”, a join node: “join”, a

union node, a lookup node: “ﬁeldlookup” or a

unitary node, for the other nodes.

• A node belongs to a script category. It is either a

script node “script”, or a non-script node. In ad-

dition, “datainput” and “ﬁeldlookup” nodes can

also be considered script nodes when they include

a data extraction script. Otherwise, they are con-

sidered non-script nodes.

• A node belongs to a stream category accord-

ing to the applied treatment on its traversing

data. It is either a row-by-row node (“ﬁeld

derivation”, “ﬁeld generation”), a row-set node

(“sort”, “pivot”, “aggregation”), or an input-

output node (“data input”, “data output”). Row-

by-row nodes are asynchronous (each row is pro-

A Quality-based ETL Design Evaluation Framework

251

Filter

FieldLookup

DataInput

SelectQuery

Supplier

Customer

State = Null

Union

DataOutput

DimGeography

FieldLookup FieldLookup FieldLookup FieldLookup

Union

DataOutput

Found

Not

Found

Not

Found

Not

Found

Not

Found

Cities.txt

State

Not null

Null

DataOutput

Not

Found

DimState.

StateName

DimState.

StateEnglishName

DimCountry.

StateCode

CountryName

DimCountry.

StateCode

CountryCode

DimArea Dim State

DimCountry Load

Start

Event

DimCustomerDimGeography

Control Process

Data Process

Figure 3: DimGeography load data processes.

cessed when it arrives), while row-set nodes are

synchronous (processing starts only when the

whole row-set arrives).

• Data annotations can be associated to nodes and

edges as free text.

Deﬁnition 2 (Valid Data Process Graph). A Data Pro-

cess graph G is valid if the following constraints hold:

• G has at least two nodes: a “data input” node

and a“data output” node.

• Each node in G respects the allowed in- and out-

degrees, predeﬁned according to its type.

• Each node in G has at least one predecessor ex-

cept for the “data input” nodes.

• Each node performs a number of actions. Script

nodes have no predeﬁned number of actions,

while non-script nodes have one action.

The ETL process in Fig. 3 depicts a graph with

12 nodes. According to the deﬁnitions below, the

node n

= DataInput, is a unitary, non-script, and row-

by-row node, such that type(n

) = “data input”, and

actions(n

) =“extraction”.

4.2 Measure Families

Quality concepts like complexity, coupling, cohesion

or size are very often subject to interpretation. Thus,

Briand et al. (L. Briand and Basili, 1996) pro-

posed a framework facilitating the deﬁnition of qual-

ity measures based on mathematical properties. This

guides the designer in creating mutli-aspect and non-

redundant measures. The authors proposed a set of

mathematical properties general enough to be appli-

cable to a wide set of artifacts, not only programming

code. We next show how the proposal can be applied

to the ETL process context.

The framework deﬁned by Briand et al. identi-

ﬁes the following families of measures, which also

can be called quality dimensions (each family allows

creating a collection of associated measures that ‘op-

erationalize’ the concept): (a) Size: reﬂects the num-

ber of elements in a system; (b) Coupling: measures

the strength of the connections between system el-

ements; (c) Cohesion: characterizes the interaction

between elements in sub-systems or modules. It as-

sesses the tightness with which related program fea-

tures are grouped together in sub-systems or mod-

ules. A highly cohesive system has few interactions

between its elements; (d) Complexity: in general it

is used for assessing complex system behavior. It in-

forms about the effort needed to maintain, change and

understand a system. Complexity is a system property

that depends on the relationships between elements,

rather than a property of an isolated element.

Fig. 4a depicts the modular system decomposition

borrowed from (L. Briand and Basili, 1996), and its

correspondence with the DPG of Deﬁnition 1. The

modular system includes modules that contain ele-

ments. A system is a tuple < N, E > where N is the

set of elements of the system, and E is the set of edges

between the elements. A module m is a subset of the

elements of the system. In general, modules can over-

lap. When the modules partition the nodes in a sys-

tem, then this system is called modular. The DPG

corresponds to a modular system where nodes corre-

spond to modules and actions to elements.

By deﬁnition, size and complexity measures are

computed over the whole modular system, while the

other measures are computed over modules compos-

ing this modular system. Following (L. Briand and

Basili, 1996), the measures belonging to a family

must verify a set of properties, e.g.:

• Non-negativity: the measure must not be in-

versely related to the studied aspect. (Applies to

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

252

(a) (b)

Figure 4: (a) The structure of a system in (L. Briand and Basili, 1996); (b) Its mapping to the DPG.

Size, Coupling and Complexity).

• Null value: the measure must be null when a sys-

tem does not contain any element. (Applies to the

measures in the four families).

• Additivity: when several modules do not have el-

ements in common, the size of the system is the

size of its modules. (Applies to Size).

• Non-negativity and normalization: the measure is

independent of the size of the system or module,

and belongs to a certain interval. (Applies to Co-

hesion).

• Monotonicity: states that adding internal relation-

ships in modules does not decrease a measure’s

value (however, adding an edge between modules

decreases the measure). (Applies to Coupling and

Cohesion).

From Deﬁnition 1 we can devise the node clas-

siﬁcation hierarchy of Fig. 4b. At the bottom of

this classiﬁcation, we identify seven node categories.

Actually, the rationale behind these seven size mea-

sure categories is that we conjecture that these cate-

gories impact on efﬁciency (we validate this in Sec-

tion 6). Moreover, Fig. 4b shows the relationship be-

tween the graph deﬁnition and the measure families

in (L. Briand and Basili, 1996). For example, the Ac-

tion category is related to the Cohesion family. The

other node categories are linked to the Size family.

In our approach, we consider a DPG as a modu-

lar system, where a node is a module, an edge is the

inter-module relationship, and actions are the module

elements. Hence, size measures are deﬁned over the

whole graph, and cohesion measures are deﬁned over

its modules. The cohesion measure is aimed at reﬂect-

ing the intra-module interactions between elements

within modules (actions). No coupling and complex-

ity measures are deﬁned because coupling is an inter-

module relationship, and there is no inter-module in-

teraction among their elements (actions). Complex-

ity is linked to the inter-module relationships, and be-

cause of the DPG structure, the size measures also

inform about the complexity.

5 ETL DESIGN QUALITY

MEASURES

Efﬁciency is an ISO 9126 quality dimension that eval-

uates the capability of a model to provide appropri-

ate performance relative to the amount of resources

used, under stated conditions. In this section we ﬁrst

present efﬁciency (external) measures from the ISO

9126 (Becker, 2008), typically used to compute the

ETL process performance. Then, based on the mea-

sure families studied in Section 4, we propose a set of

structural (internal) measures, particularly describing

the graph node combination, to be applied at the de-

sign level. These internal measures are likely to im-

pact the ETL process efﬁciency.

5.1 External Measures

The ISO 9126 quality dimensions include function-

ality, reliability usability, efﬁciency, maintainability,

and portability. The set of measures proposed by the

ISO standard to evaluate ETL execution efﬁciency are

depicted in Table 1. For example, Execution Time

(ET) is the server time required to complete the exe-

cution of an ETL process. A more signiﬁcant measure

to evaluate ETL efﬁciency is Throughput (Th), which

takes into account the size of the data processed by the

ETL system. The throughput is computed as the num-

ber of rows per unit of time processed. Additional

performance measures address resource usage. For

example, the Disk I/O computes the number of disk

readwrite actions performed during the ETL execu-

tion. Note that the evaluation of the external measures

require a complete implementation and execution of

the ETL process.

A Quality-based ETL Design Evaluation Framework

253

Table 1: External measures.

Measure Name Description

ET Execution

Time

Time the server takes to complete the

execution

NL Network

Latency

Time the network takes to transfer data

from data sourcetarget to the data stag-

ing

Th Throughput Size of data served by the ETL model

per unit of time

DIO Disk IO Number of disk readwrite

Me Memory Memory amount usage

5.2 Internal Measures

We next propose a collection of internal measures,

formalize them, and show that they satisfy the proper-

ties deﬁned in (L. Briand and Basili, 1996). We base

the deﬁnition of the internal measures in the formal

deﬁnition of a DPG (Deﬁnition 1). To represent the

framework in (L. Briand and Basili, 1996), we call G

= (N, E) a graph, and m = (N

, E

) a module in G,

such that a set of nodes N

is linked with edges E

such that N

⊆ N, and E

⊆ E. A module can include

only one node.

Size Family. It appears straightforward to assume

that the larger the graph size, the lesser the efﬁciency

we can expect from an ETL process, and that each ad-

ditional node will increase the execution time and/or

resource usage. Importantly, the data processing of a

decomposed ETL graph are delayed by the data trans-

fer time between nodes. In addition, a decomposed

graph has a latency time due to different transforma-

tions (instead of a single combined transformation)

applied over each row. As mentioned in Deﬁnition

1, a node belongs to a stream category according to

the way it is processed (row-by-row, row-set or input-

output). A higher latency is hence expected for de-

composed row-set nodes. For this reason, we deﬁne a

size measure for each node category since we expect

that each has a speciﬁc inﬂuence on the efﬁciency. We

next deﬁne the measures in the Size family.

Deﬁnition 3 (Branching Nodes). Given a graph G.

The Branching Nodes measure of G, BN(G), is the

number of branching nodes in G. It is computed as

the cardinality of the nodes in the “branching” cate-

gory:

BN(G) = card({n ∈ N | type(m) ∈

“ f ilter”}})

Deﬁnition 4 (Joining Nodes). Given a graph G, the

Joining Node measure of G, JN(G), is the number of

node in the “joining” category:

JN = card({n ∈ N | type(m) ∈

{“ join”, “union”}})

Deﬁnition 5 (Lookup Nodes). Given a graph G, the

Lookup Nodes measure of G, LN(G), is the number of

nodes in G belonging to the “lookup” category and

which perform only one action (it does not include

scripts).

LN(G) = card({n ∈ N | type(m) ∈

{“ f ieldlookup”} ∧ card(actions(n)) = 1})

Deﬁnition 6 (Script Nodes). Given a graph G, the

Script Nodes measure of G, SN(G), is the number of

script nodes in the module m, and it is computed as

the cardinality of the nodes having a number of ac-

tions greater than 2.

SN(G) = card({n ∈ N | card(actions(n))

≥ 2})

Example 1. For G= DimGeography in Fig. 3, we

have:

BN(G) = card({n

}) = 1.

JN(G) = card({n

, n

}) = 2.

LN(G) = card({n

, n

}) = 5.

SN(G) = card({n

, n

}) = 3.

Finally, we deﬁne three measures on the stream

category. The more resource-consuming stream type

is the row-set type because it implies a blocking strat-

egy that delays the execution, in particular when deal-

ing with large data volumes. The row-by-row type

is less time-consuming. The input-output type is in

charge of importing and exporting data from and to

databases.

Deﬁnition 7 (RS, RbR, and IO Nodes). Given a

graph G, the Row-Set (RS) Nodes measure

RSN(G) = card({n ∈ N | type(n) ∈

{“sort”, “pivot”, “aggregation”}})

The Row-by-Row (RbR) Nodes measure of G,

RbRN(G), is the number of row-by-row nodes in G.

RbRN(G) = card({n ∈ N | type(n) ∈ {

“ f ieldderivation”, “ f ieldgeneration”}})

The Input-Output (IO) Nodes measure of G,

ION(G), is the number of data input and data output

nodes in G.

ION(G) = card({n ∈ N | type(n) ∈

{“datainput”, “dataout put”}

and card(actions(n)) = 1})

Example 2. For G = DimGeography graph of Fig. 3,

we have: RSN(G) = 0 (G does not contain any row-set

node), RbRN(G) = 0 (G does not contain any row-by-

row node), and ION(G) =

card({n

, n

}) = 4.

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

254

The mathematical properties that must character-

ize any size measure are veriﬁed by the proposed ones

as shown next (proofs omitted).

1. Non-negativity. Meaning that for any graph G,

BN(G) ≥ 0, JN(G) ≥ 0, LN(G) ≥ 0, SN(G) ≥

0, RSN(G) ≥ 0 RbRN(G) ≥ 0, and ION(G) ≥ 0;

2. Null value, implying that if G = ∅ =⇒ BN(G) =

0, JN(G) =0, LN(G) = 0, SN(G) = 0, RSN(G) = 0

RbRN(G) = 0, and ION(G) = 0;

3. Module additivity, having m

=< N

, E

> and

=< N

, E

> it holds that (m

⊆ G and m

⊆

G and N = N

∪ N

and N

∩ N

= ∅) =⇒

BN(G) = BN(m

)+BN(m

), JN(G) = JN(m

JN(m

), ...

It is worth mentioning that in our graph system,

a module is a unitary node sub-system, therefore

contains only one node.

Size measures inform about the high-level orga-

nization of the work in the graph. We need a com-

plementary characteristic in order to quantify the dis-

tribution of the work through the graph nodes. We

deﬁne it next.

Cohesion Family. Cohesion is related to the amount

of work processed by the graph nodes. Depending

on the scripting category, some nodes may perform

more actions than other ones. In particular, script,

ﬁeld lookup, and data input nodes can perform a large,

unpredictable, amount of actions. The Cohesion Ac-

tion measure, deﬁned next, is aimed at reﬂecting the

amount of work carried out by such nodes. We ex-

pected that nodes with a low cohesion (i.e., perform-

ing more actions) consume more time and resources

than nodes with a high cohesion. Note that this mea-

sure is deﬁned for each system module or node.

Deﬁnition 8 (Cohesion Action). Given a module m =

< N, E > from a graph G, the Cohesion Action mea-

sure of m, CA(m), is deﬁned as:

CA(m) =

LCAN(m)

CAC(G)

where CAC =

∑

n∈N

card(actions(n)) and

LCAN =

∑

n∈N

card(actions(n)) s.t. actions(n) ≥ 2

CA(m) is the proportion of low cohesion nodes in the

graph G. It determines the participation of each node

in performing the work. CAC(m) is the number of all

actions performed by the nodes in G. It estimates the

work produced by the whole graph. LCAN(m) is the

cardinality of the actions of low cohesion nodes.

Example 3 (Cohesion Action). For our example

graph, n

, n

are low cohesion nodes. Node n

is a “data input” node using a script which applies

both, extraction and join actions (annotated e, j as

speciﬁed by the set of actions A in Deﬁnition 1). Also,

and n

are “ﬁeld lookup” nodes containing a join

script in their lookup condition (annotated lk, j).

CAC(G) = card({e, j}) + 4 × card({lk}) +

2 × card({lk, j}) + 3 × card({l}) + 2 × card({ j})=

15;

LCAN({n

}) = card({ j, e}) = 2; LCAN({n

}) =

card({b, j}) = 2;

LCAN({n

}) = card({c, j}) = 2;

CA({n

}) = CA({n

}) = 2/15 =0.13.

The mathematical properties that must be accom-

plished by any cohesion measure are veriﬁed by CA,

that are not been demonstrated due to space limita-

tion.

6 EXPERIMENTAL VALIDATION

In this section, we describe a set of experiments aimed

at providing a preliminary empirical validation of the

relationship between the proposed internal measures,

and the external measure Throughput (Th). We chose

Th as the external measure since it can be computed

using the execution elapsed time extracted from the

system’s log ﬁle, allowing obtaining meaningful re-

sults. Our hypotheses are:

• H1: The number of join, lookup, script, row-set,

and input-output nodes are correlated with Th;

• H2: The number of row-by-row nodes has low im-

pact on the Th;

• H3: Adding branching nodes does not decrease

Th;

• H4: The best script node form is the data input

node using a script. This is better than using a

speciﬁc script node.

• H5: For optimization purposes, replacing join

nodes with a data input script node increases Th.

On the contrary, replacing lookup nodes with data

input script nodes is not beneﬁcial for throughput.

The validation of these hypothesis will guide the

ETL designer in answering fundamental questions

such as: which construct should I use to optimally add

a speciﬁc ETL transformation? What are the design

criteria that are increasing or decreasing the through-

put? Given two graphs, can we anticipate on which

one will have the best throughput?

A Quality-based ETL Design Evaluation Framework

255

6.1 Experimental Setting

We ran our experiments using Microsoft’s SQL

Server Integration Services (SSIS)

, over a com-

puter equipped with an Intel Dual-Core processor at

2.10GHz and 4 Gigabytes of RAM running Windows

7 OS.

The internal measures to be evaluated may inﬂu-

ence on each other. At this point we are interested

in ﬁnding out the relation between a single variable

(measure) and Th. Thus, we performed a set of con-

trolled experiments, in order to detect the individual

impact over Th of each internal design factor. We cre-

ated 54 graphs in the following way: we deﬁned an

initial ETL model using BPMN4ETL, and translated

the graph to a data ﬂow SSIS component. To analyze

each measure we modiﬁed the initial graph by modi-

fying the combination of nodes (ETL tasks). For ex-

ample, to study the inﬂuence of the number of branch-

ing nodes (measure BN), we started with a combina-

tion of the graph having BN = 1, and then we increase

BN by adding neutral branching nodes to obtain seven

graphs with BN = 1, 3, 5, 8, 10, 15, and 20. We

performed the same with join nodes, lookup nodes,

script nodes, row-by-row nodes, row-set nodes, input-

ouput nodes. To assess cohesion action we created 7

graphs with different number of actions in data input

and script nodes. All these yielded 54 graphs to per-

form, basically, the same job. We measured Th for

each graph execution. In addition, and to take into

account the data size, we ran each graph for differ-

ent data sizes: the initial data source size is 56Kb,

including 290 rows. We multiplied it by 100, 500,

and 1000, producing, respectively, a data input vary-

ing from 65Kb to 65Mb. In the ﬁgures of this section,

Th is expressed in KBytes by millisecond (Kb/ms).

6.2 Discussion of Results

We now analyze the results from our experiments with

respect to the hypotheses H1 through H5. First, we

studied the correlation between each measure and Th,

for different data source sizes (1x, 100x,..). The re-

sults are depicted in Table 2, where measures are in-

dicated as BN, JN, etc. Table 2 shows that all the mea-

sures are strongly correlated to the throughput, except

for BN, JN and CA-S. For BN, this implies that an

additional branching node has a low impact on the

throughput. For the JN and CA-S measures, the very

weak throughput results reﬂect the low performance

of join nodes in the SSIS tool. Thus, further exper-

iments are required over different tools. All other

http://msdn.microsoft.com/en-us/library/ms141026.

aspx/

measures are negatively correlated to the throughput,

conﬁrming that adding nodes decreases Th. There-

fore, as a ﬁrst result, these strong correlations vali-

date our measures and conﬁrm their inﬂuence over

the efﬁciency goal represented in this study by the

throughput external measure. For the CA measure we

performed two kinds of experiments: (a) adding node

actions to the data input (DI) script, yielding the CA

measure (denoted CA-DI) computed on a DI script

node; (b) adding the speciﬁc script (S) node action,

yielding the CA measure (denoted CA-S) computed

on an speciﬁc script node. The correlations for the

(CA-DI) measure were good, although no correlations

could be computed for (CA-S) because of the very

low throughput.

The second analysis, shown in Fig. 5a captures

(for the largest dataset) the throughput variation for

each measure. In the X axis we represent the number

of nodes of each type, and on the Y axis, the through-

put. Fig. 5a shows that adding all kinds of nodes

(except from branching nodes) strongly decreases Th.

Adding branching nodes does not reduce throughput,

conﬁrming results shown in Table 2. However, adding

BNs could be beneﬁcial for other external measures

(e.g., related with parallel execution). These results

were the same for all dataset sizes. Thus, the results

above conﬁrm hypotheses H1 through H3.

We now analyze hypotheses H4 and H5. Fig. 5b

shows (also for the largest dataset) that script nodes

(CA-DI) can be used (whenever possible) instead of

join ones since the latter deliver worse throughput.

Analogously, we can join several input nodes in a sin-

gle script node to increase Th. Fig. 5b also shows

that lookup nodes are better than script ones. This can

be explained by the optimization capabilities provided

by the ETL server in managing such type of nodes. In

summary, our results suggest that ETL thoughput can

be enhanced if at the design stage we choose more

efﬁcient kinds of nodes. Note, however, that this is

not always possible (e.g., it is not reasonable to group

several row-set nodes in the same script). Again, re-

sults were the same for all dataset sizes. Therefore,

hypotheses H4 and H5 are conﬁrmed by our experi-

ments.

7 CONCLUSION

We have formally presented, and empirically vali-

dated, a collection of measures that, computed over

alternative ETL process graphs (i.e., representing dif-

ferent combinations of tasks), allow predicting how

they will perform with respect to each other before

writing even a single line of programming code. As

ICEIS 2019 - 21st International Conference on Enterprise Information Systems

256

Table 2: Correlations.

Size BN JN LI SN RSN RbRN ION CA - DI CA-S

65Kb -0.82 - -0.90 -0.93 -0.81 -0.94 -0.88 -0.97 -

6.5Mb -0.69 - -0.88 -0.74 -0.72 -0.93 -0.81 -0.97 -

32.5Mb 0.13 - -0.82 -0.62 -0.81 -0.97 -0.93 -0.96 -

65Mb 0.64 -0.51 -0.80 -0.69 -0.78 -0.99 -0.93 -1.00 -

(a) (b)

Figure 5: (a) Throughput vs. # of nodes; (b) Lookup nodes vs. join nodes vs. script nodes.

a consequence, our results allow us to draw a set of

guidelines to design efﬁcient ETL workﬂows. As fu-

ture work, we will work in the assessment of the com-

plete impact over performance of the measures pro-

posed in this paper, since at this stage we focused

on the individual (partial) impact of each one of such

measures.

REFERENCES

A. Simitsis, K. W. and Dayal, U. (2013). Hybrid analytic

ﬂows - the case for optimization. Fundamenta Infor-

maticae, pages 303–335.

A. Simitsis, K. Wilkinson, M. C. and Dayal, U. (2012). Op-

timizing analytic data ﬂows for multiple execution en-

gines. In SIGMOD’12, 12th International Conference

on Management of Data. ACM Press.

A. Simitsis, P. V. and Sellis, T. State-space optimization

of ETL workﬂows. IEEE Trans. Knowl. Data Eng.,

17(10).

A. Simitsis, P. V. and Sellis, T. (2005). Optimizing ETL pro-

cesses in data warehouse environments. In ICDE’21,

21st International Conference on Data Engineering.

IEEE Computer Society Press.

Ali, S. and Wrembel, R. (2017). From conceptual design

to performance optimization of etl workﬂows: current

state of research and open problems. The VLDB Jour-

nal, 26(6):777–801.

B. Boehm, J. Brown, H. K. M. L. G. M. and Merritt, M.

(1978). Characteristics of Software Quality (TRW se-

ries of software technology). Elsevier.

Becker, S. (2008). Performance-related metrics in the ISO

9126 standard. In I. In Eusgeld, F. F. and Reussner,

R., editors, Dependability Metrics, pages 204–206.

Springer, Berlin, Heidelberg.

El Akkaoui, Z. and Zim

anyi, E. (2012). Deﬁning ETL

worfklows using BPMN and BPEL. In DOLAP’09,

9th International Workshop on Data Warehousing and

OLAP. ACM Press.

G. Kougka, A. G. and Simitsis, A. (2018). The many faces

of data-centric workﬂow optimization: a survey. In-

ternational Journal of Data Science and Analytics,

6(2):81–107.

G. Papastefanatos, P. Vassiliadis, A. S. and Vassiliou, Y.

(2009). Policy-regulated management of ETL evolu-

tion. Journal Data Semantics, 5530:146–176.

Inmon, W. (2002). Building the Data Warehouse. Wiley.

Kimball, R. and Ross, M. (2002). The Data Warehouse

Toolkit, 2nd. Ed. Wiley.

L. Briand, S. M. and Basili, V. (1996). Property-based soft-

ware engineering measurement. IEEE Transactions

on Software Engineering, 22(2):68–86.

L. Mu

noz, J. M. and Trujillo, J. (2010). A family of exper-

iments to validate measures for UML activity. Infor-

mation and Software Technology, 52(11):1188–1203.

P. Vassiliadis, A. Simitsis, P. G. M. T. and Skiadopoulos, S.

A generic and customizable framework for the design

of ETL scenarios. Information Systems, 30(7).

T. Majchrzak, T. J. and Kuchen, H. (2011). Efﬁciency eval-

uation of open source ETL tools. In SAC’11, 11th

Proceedings of the ACM Symposium on Applied Com-

puting. ACM Press.

Trujillo, J. and Luj

an-Mora, S. (2003). A UML-based

approach for modeling ETL processes in data ware-

houses. In ER’22, 22nd International Conference on

Conceptual Modeling. Springer.

U. Dayal, M. Castellanos, A. S. and Wilkinson, K. (2009).

Data integration ﬂows for business intelligence. In

EDBT’09, 9th International Conference on Extending

Database Technology. ACM Press.

Z. El Akkaoui, J. Maz

on, A. V. and Zim

anyi, E. (2012).

BPMN-based conceptual modeling of ETL processes.

In DAWAK’12, 12th International Conference on Data

Warehousing and Knowledge Discovery. Springer.

Z. El Akkaoui, E. Zim

anyi, J. M. A. V. and Trujillo, J.

(2013). A bpmn-based design and maintenance frame-

work for ETL processes. International Journal of

Data Warehousing and Mining IJDWM, 9(3):46–72.

A Quality-based ETL Design Evaluation Framework

257