of an ETL process that loads a collection of dimension
tables containing customer and location data. Fig. 2b
shows dimension tables DimArea, DimCountry, Dim-
State, and DimGeography. The first three ones are
populated using an XML file denoted Territories.xml.
DimGeography is populated using geographical data
with attributes City, State, ZipCode, and Country,
present in the Customer and Supplier tables of the
Northwind database
2
(Fig. 2a). Before populating the
Location hierarchy, the geography data needs some
cleansing and transformation operations to fit the data
warehouse requirements, namely: (a) Data comple-
tion, which requires dealing with null values. For
example, the attribute State may be null in the Cus-
tomer and Supplier source tables. The ETL process
fixes this by using an external source file Cities.txt,
which contains three fields: city, state, and country.
(b) Data consolidation. For example, in the source
databases, attribute State contains either a state name
(e.g., California) or a state code (e.g., CA). In the lat-
ter case, the state code is either left empty or con-
verted into the state name using the State table with at-
tributes StateId, StateName, and Code (the ISO stan-
dard code), which contains the link between the state
name and its code. (c) Consistency, in particular with
respect to referential integrity constraints. During the
loading of the data into the data warehouse, referen-
tial integrity between all the hierarchy tables must be
ensured.
To model complex processes, several concep-
tual modeling tools (mentioned in Sections 1 and 2)
have been proposed in the literature. In this paper
we use an ETL model based on BPMN4ETL (El
Akkaoui and Zim
´
anyi, 2012), which models ETL pro-
cesses as workflows, extending the BPMN notation.
Fig. 3 shows the loading process for the DimGeog-
raphy dimension of our running example, modeled
in BPMN4ETL. More in detail, an ETL model in
BPMN4ETL is perceived as composed of a control
process containing several data processes. A control
process (top process of Fig. 3) manages the coarse-
grained groups of tasks and/or sub-processes, while a
data process (bottom processes of Fig. 3) operates at
finer granularity, detailing how input data are trans-
formed and output data are produced. For exam-
ple, populating each fact (dimension, view, tempo-
rary, etc.) table in the data warehouse constitutes a
data process, whereas sequencing the load of differ-
ent dimension levels constitutes a control process.
2
http://www.microsoft.com/en-us/download/details.
aspx?id=23654
4 DATA AND QUALIY MODELS
4.1 ETL Data Process Graph
We now present the ETL model we use in this pa-
per. Due to space limitations we restrict ourselves to
study the data process perspective, which constitutes
the portion of the ETL process where transformations
occur, and we do not deal with the control process.
There is T, a set of node types. T = {“data input”,
“data output”, “filter”, “field lookup”, “field deriva-
tion”, “field generation”, “join”, “union”, “aggrega-
tion”, “sort”, “pivot”, “script”}. There is also a set A
containing a list of possible actions performed by a
node. A = {“field manipulation”, “field generation”,
“join”, “lookup”, “branching”, “extraction”, “load”},
where “field manipulation” action covers field com-
putation, deletion, addition, sorting, pivoting, and
splitting. In addition, a node has a traversing stream.
The input stream of a node has schema:
(( f ield
1
, f ield
1
.datatype), ..., ( f ield
i
, f ield
i
.
datatype))
Definition 1 (Data Process Graph (DPG)). A data
process graph is a directed graph G(N,E) where N =
n
1
, n
2
, ..., n
p
is the set of nodes representing the data
tasks, and E = e
1
, e
2
, ..., e
q
is the set of edges between
nodes. An edge e = (a, b) ∈ E means that b receives
data from node a for further processing. In addition,
the following holds:
• There is a function with signature N 7→ T, that
maps a node n to its type.
• Each node has an associated set of actions, which
is a subset of A. There is a relation actions ∈ N ×
A, reflecting this association.
• A relation schema ∈ N × S defines the input fields
of a node.
• A node belongs to a flow category. It is either a
branching node: “filter”, a join node: “join”, a
union node, a lookup node: “fieldlookup” or a
unitary node, for the other nodes.
• A node belongs to a script category. It is either a
script node “script”, or a non-script node. In ad-
dition, “datainput” and “fieldlookup” nodes can
also be considered script nodes when they include
a data extraction script. Otherwise, they are con-
sidered non-script nodes.
• A node belongs to a stream category accord-
ing to the applied treatment on its traversing
data. It is either a row-by-row node (“field
derivation”, “field generation”), a row-set node
(“sort”, “pivot”, “aggregation”), or an input-
output node (“data input”, “data output”). Row-
by-row nodes are asynchronous (each row is pro-
A Quality-based ETL Design Evaluation Framework
251