Task Clustering on ETL Systems
A Pattern-Oriented Approach
Bruno Oliveira and Orlando Belo
ALGORITMI R&D Centre, University of Minho,Campus de Gualtar, Braga, Portugal
Keywords: Data Warehousing Systems, ETL Conceptual Modelling, Task Clustering, ETL Patterns, ETL Skeletons,
BPMN Specification Models, and Kettle.
Abstract: Usually, data warehousing populating processes are data-oriented workflows composed by dozens of
granular tasks that are responsible for the integration of data coming from different data sources. Specific
subset of these tasks can be grouped on a collection together with their relationships in order to form higher-
level constructs. Increasing task granularity allows for the generalization of processes, simplifying their
views and providing methods to carry out expertise to new applications. Well-proven practices can be used
to describe general solutions that use basic skeletons configured and instantiated according to a set of
specific integration requirements. Patterns can be applied to ETL processes aiming to simplify not only a
possible conceptual representation but also to reduce the gap that often exists between two design
perspectives. In this paper, we demonstrate the feasibility and effectiveness of an ETL pattern-based
approach using task clustering, analyzing a real world ETL scenario through the definitions of two
commonly used clusters of tasks: a data lookup cluster and a data conciliation and integration cluster.
1 INTRODUCTION
Extract, Transform, and Load (ETL) development is
considered a very time-consuming, error-prone and
complex task, involving several stakeholders from
different knowledge domains. The amount of data
extracted from (heterogeneous) data sources and the
complexity associated with data transformation and
cleaning tasks represent a significant effort in any
data warehousing project (Kimball and Caserta,
2004). Thus, commercial tools with the ability to
support ETL development and implementation have a
crucial impact in the development of such processes,
providing the generation of very detailed and
documented models. However, such tools provide
very specific notations that were developed to support
very specific architecture choices. So, they use to
produce very large populating processes, often
difficult to understand, with granular activities
associated to specific modelling rules. Even
supporting composite constructs that represent
common used routines, they are still very platform-
specific since its implementation are not based in
solid formalisms such as Relational Algebra (Santos
and Belo, 2013) operators. Considering this, it is
pretty obvious that such tools cannot be used for
conceptual representation, since communication with
non-technical users (e.g. business users and database
administrators) would be compromised. In the
development process of an ETL system we need to
consider all the aspects related to developing phases
as well as the evolution the system may suffer across
time. Both are critical factors of success.
At early development phases, the business
processes, and by consequence business users, plays
an important role because they really know business
from inside, providing requirements validation that
are crucial to the feasibility of a decision support
system. Several works addressed this important issue,
proposing the use of well-know modelling languages
such as Business Process Modelling Language
(BPMN) (OMG, 2011) to support ETL conceptual
modelling (Akkaoui and Zimanyi, 2009) (Akkaoui et
al., 2011). BPMN is widely used to describe and
implement common business processes. Choosing
BPMN for conceptual modelling provides several
advantages. BPMN notation disposes very expressive
constructs, allowing for the representation of data
flows in a large variety of ways, capturing both data
and control flow constraints. However, such
expressiveness represents a delicate problem,
especially when an ETL conceptual model needs to
be translated and interpreted by a machine. The
207
Oliveira B. and Belo O..
Task Clustering on ETL Systems - A Pattern-Oriented Approach.
DOI: 10.5220/0005559302070214
In Proceedings of 4th International Conference on Data Management Technologies and Applications (DATA-2015), pages 207-214
ISBN: 978-989-758-103-8
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
research community addressed these problems over
the years, specifically in what is concerned to
mapping BPMN processes to BPEL language
(Ouyang et al., 2007). The detail and scope disparities
between more abstract and concrete models represent
a huge distance between two representations, simply
because they serve different purposes. In order to
simplify ETL development, we propose the
application of a task-clustering technique to group a
set of finer grain tasks into a collection of tasks, flows
and control primitives, providing a method to
organize them using layers of abstraction and
supporting different detail to serve the several
stakeholders in different project phases. The cluster
categorization provides a way to identify and classify
patterns that can be instantiated and reused in
different ETL processes. Each pattern is scope
independent and provides a specific skeleton that not
only specifies their internal behaviour but also
enables communication and data interchange with
other patterns in a workflow context.
In this paper, we demonstrate the feasibility and
effectiveness of the approach we developed and
followed analysing a real world ETL scenario and
identifying common tasks clusters. In section 2 we
discuss its generalization and use as general
constructs in an ETL package. Then, we demonstrate
how activities can be grouped to build ETL
conceptual models in different abstraction layers
(section 3), presenting two ETL skeletons (data
lookup and data conciliation and integration)
exposing their configuration and behaviour in a way
that they can be executed in a target ETL tool (section
4). Next, some related work is exposed (section 5).
Finally, in section 6, we evaluate the work done,
pointing out some research guidelines for future
work.
2 ETL TASKS CLUSTERING
On-line Transaction Processing (OLTP) systems are
responsible for recording all business transactions
performed in enterprise operational systems. These
are built to support specific business, which store
operational data from the daily business operations.
Therefore, they are the main data sources used by
data warehousing systems. In more complex cases,
data are distributed following the distinct business
branches of a company. These data can be stored in
sophisticated relational databases or in more simple
data structures (e.g. texts or spreadsheets files). Due
to this variety of information sources, problems on
populating a data warehouse often occur (Rahm and
Do, 2000).
Generally, tasks used on the extraction step (E)
are responsible to gather data in the data sources and
put it into a data staging area (DSA). The DSA is a
working area where data is prepared before going to
the data warehouse. For that, the DSA provides the
necessary metadata to support the entire ETL process,
providing, for example, support for data correction
and data recovery mechanisms using domain oriented
dictionary, mapping mechanisms or quarantine tables.
Transformation and cleaning (T) procedures are
applied posteriorly to data extraction, using the data
that was already stored in temporary structures in the
DSA. After this second step, data is loaded (L) to a
target data warehouse, following schema rules, and
operational and business constraints. Essentially, an
ETL process represents a data-driven workflow
representing a set of tasks and their associated control
flows and business rules that together express how
the system should coordinated. Typically, the
commercial tools use workflows to the representation
of very specific tasks that are frequently grouped
together each time we want to represent a same
procedure.
To reduce the impact of such situations, we
purpose an approach (also used in others application
scenarios (Singh et al., 2008) that allows us to
organize ETL tasks into clusters and execute them as
a single block. However, we went a little bit further
formalizing a set of “standard” clusters representing
some of the most common techniques used in real
world ETL scenarios. Based on a set of input
parameters, the tasks that compose a cluster should
produce a specific output. In fact, we are creating
routines or software patterns that can be used with the
aim to simplify ETL development, reducing the
implementation errors and time needed to implement
ETL processes. With the ETL patterns identified, we
propose a multi-layer approach to represent them,
using BPMN pools to represent the several layers of
abstraction, composed by several lanes representing
the tasks that should be applied for each group of
similar records to be processed.
To demonstrate the potential of our approach, we
present a common ETL scenario, representing an
ordinary data extraction process that using two
different data sources prepare data through
confirming and inserting it posteriorly in a data
warehouse. The first source, a relational schema,
stores data about flights (dates and flight time), travel
locations (city and country) and planes (brand, model
and type, and manufacturer data). The spreadsheet
source represents a subset of the data structures that
can be found in the relational schema.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
208
Figure 1: The “Flights” star schema.
This example was selected to illustrate all the steps
related to extract, conciliate, and integrate data into a
data warehouse schema (Figure 1), revealing how we
used our ETL pattern-based approach, using task
clustering, in a real world ETL scenario. The data
warehouse model is represented using the notation of
(Golfarelli and Rizzi, 2009). It integrates four distinct
dimension tables (“Calendar”, “Planes”, Flights”, and
“Locations”) and a single fact table (“FT-Flights”),
providing the necessary structures to analyze flights
data of a particular set of company branches over
time.
Once specified the data warehouse’s storage
structure, it is necessary to define the ETL process,
adjusting it to the specificities of the system. An
incorrect or incomplete specification of the ETL
activities will affect seriously the quality of the data
warehouse. Even in such simple example, the
populating process of the data warehouse have
dozens of tasks. Analysing each phase individually,
several clusters of tasks with a finer grain detail can
be identified - see Figure 2 where it is represented a
miniminal subset of the ETL workflow process
needed to populate the “Location” dimension,
implemented in Kettle (Pentaho, 2015). This process
represents the normalization of an attribute
(“Country”) that was collected in the spreadsheet data
source, and simply replaces the abbreviation values
(e.g. ‘PRT’, ‘ESP’, or ‘FRA’) used to identify
countries by the correspondent full names (Portugal,
Spain, or France). This transformation is done using a
dictionary table having all the correspondences
between countries full names and the abbreviation
values for each country is used. This process
represents a simple lookup operation that can be
reused in many other similar situations. If we group
all the tasks involved with into a data lookup cluster,
including a well-defined description of their behavior,
we have the ability to define a new ETL container, a
Data Lookup (DL) cluster or pattern.
Figure 2: Implementation of a normalization process.
Additionally, the semantics of populating process
from Figure 2 is strong related to the methodology
followed by the Kettle ETL implementation tool - for
example, before we can proceed to the Merge Join
task, between the two data sets, we need explicitly to
sort the data set by the attributes that will participate
in the joining condition. To generalize the behaviour
of this cluster, we need only to change its
configuration parameters. All the rest remains.
However, the configuration output metadata should
be provided, i.e. a compatible target repository and its
mappings to store execution results should be
indicated.
Following the previous example, it is pretty clear
that the physical approach does not help the ETL
planning and development. The use of abstract
models can reduce the effects of an unsuitable
planning, providing task encapsulation for hiding
processes complexity to end-users. Thus, users can
focus in non-technical details, helping in early ETL
development stages revealing the most critical
business and operational requirements.
In this work we explored two important stages of
the development of an ETL system: the ETL
conceptual representation, using a set of abstract
constructs that contributes to increase process
interpretation, and the enrichment of conceptual
representations, giving them enough detail to
transform abstract constructs to a set of detailed tasks
that can be mapped and executed in a commercial
ETL implementation tool. ETL patterns establish a
bridge between those models. Besides, they provide
the necessary meaning to drill down more general
constructs and support its (semi) direct execution. So,
we can generate executable models using general
templates describing how a problem must be solved
independently of the context in which they will be
applied. Besides their strong reusing abilities, patterns
in workflow systems provide a very flexible way to
specify and share communication protocols, increase
data interchange across processes, and allow for
incremental integration of new ETL patterns. All this
will improve the quality of an ETL system solution.
TaskClusteringonETLSystems-APattern-OrientedApproach
209
3 A MULTI-LAYER APPROACH
In large ETL projects, conceptual models play an
important role being very useful on preliminary
development stages where users must validate
business requirements. They help users having
different knowledge to understand the meaning of
concepts involved. Basically, BPMN it provides a
very simply and powerful notation for process
representation (very suitable for processes such as
ETL), coupled with his power of expressiveness,
implementation constructs, and control tasks
provided. Additionally, companies use BPMN used to
describe and implement internal business processes.
So, generically, we can identify two advantages: 1)
business users are already familiar with process
construction language; and 2) existing business
processes can be used and integrated with ETL
process, taking advantage of existing routines/process
logics. To support different abstraction levels, we can
use BPMN collapsed sub processes that can be
applied successfully to ETL processes. This provides
process conceptualization, since complex constructs
are decomposed into different levels. This is very
advantageous for high-level users when presenting,
discussing and understanding process concepts.
Considering the example used, we design a
possible subset of activities needed for data extraction
on the two referred information sources in order to
feed some temporary data structures stored in the
DSA (Figure 3). Figure 3a) represents the ETL layer
2, showing the level that can be derived based on the
extraction phase included in the main layer (layer 1),
representing the logic of the process related to the
extraction of data in both sources. Two parallel
BPMN gateways were used to indicate that tasks
belonging to each flow could be executed in parallel.
Two different BPMN sub-processes are also
represented, namely the ‘Extraction Tasks’ and the
#CDC# - spreadsheet’. The first process is a very
ordinary BPMN sub-process, which hides the
complexity of the data extraction process applied on
the relational source. The second one represents an
ETL pattern (we used the ‘#’ symbol to distinguish
it), which represents the logic of the differential
loading process performed over the data collected.
The CDC pattern was defined as a composite task.
However, for conceptual representation, the user does
not need to know its composition, but rather the
configuration for each parameter associated in order
to enable it. In contrast with this pattern, the first
extraction process can be easily decomposed (Figure
3b)), representing the data extraction from the tables
“Planes” and “Location”.
a)
b)
Figure 3: Representation of an ETL extraction phase -
layers 2 (a) and 3 (b).
After the conclusion of both processes, the data
stored in the table “Flights” is extracted considering
DSA specific requirements. The composition of ETL
layer is represented through a BPMN pool, where the
lanes represent the roles of the processes associated
with the ETL tasks. The transformation cluster can be
represented using only software patterns (Figure 4).
To consolidate and normalize data from each data
source, we defined several BPMN lanes to represent
the transformations of the entities that should be
applied to each information sources.
Figure 4: Representation of ETL Transformation phase
(layer 2).
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
210
Figure 5: Generic algorithm for DL pattern.
The process starts with the execution of three
concurrent flows. For the spreadsheet source, two
DQE (Data Quality Enhancement) ETL patterns were
used to apply data cleaning and normalization
techniques over the tables “Planes” and “Countries”.
Next, a set of DCI (data conciliation and integration)
patterns was used in order to integrate data about
planes, locations and flights, ensuring data integration
and consistency between data sources. The DCI
pattern involves several integration problems that
uses other external patterns. For this particular case, a
DCI pattern need to call a SK (surrogate key) pattern
in order to generate the diemnsios surrogate keys.
Upon completion of each flow, data are synchronized
and the loading procedures activated.
The use of more abstract BPMN constructors,
especially oriented to receive ETL requirements,
provides a way to encapsulate a considerable set of
repetitive tasks by grouping them into clusters that
represent ETL patterns. As such, this not only allows
for a simplified process modeling but also provides
more guarantees of consistency to ETL processes.
Using BPMN on ETL process specification is quite
interesting. It supplies a useful and expressive
notation at ETL conceptualization stage, allowing for
the representation of powerful orchestration
mechanisms on process instantiation. We can enrich
BPMN models with some specific implementation
details to support the automatic generation of
physical models based on its correspondent
conceptual representation. Eventual inconsistencies
related to the expressiveness of BPMN constructs are
minimized when we use ETL patterns.
4 PATTERNS SKELETONS
To enrich the ETL pattern approach we followed, we
provide a set of patterns definitions that can be
instantiated for specific ETL scenarios. We describe
patterns as a set of tasks that need to be executed in a
specific order. Such tasks can be atomic or composed,
depending of the detail level associated to its
implementation. We do not intend to provide a
workflow engine to execute ETL processes. Instead,
we want to present a generic way to describe ETL
patterns to ensure their mapping and execution on a
commercial ETL tool based on specific
transformations templates. For that, the architecture
and behaviour of each pattern must be described as a
set of functional components, i.e. an input
configuration, an output configuration, an exception-
handling configuration, and a log-tracking
configuration
We can view an ETL pattern as a single unit of
work, representing several operations that can be used
to transfer data between data sources. Since data
representation is intrinsically connected to their data
schemas and implementation technology, the use of
ETL patterns allow the representation of data in
several states, ensuring its consistency with target
repository requirements. The ACID properties
(atomicity, consistency, isolation, and durability) for
database transactions provide a set of base
characteristics that can also be considered in the
context of ETL patterns. Additionally to instance
level constraints, schema-level exceptions can occur,
which can leads to a compensation or a cancelation
event to handle appropriately such cases. For
example, a normalization procedure applied on a
TaskClusteringonETLSystems-APattern-OrientedApproach
211
Figure 6: Generic algorithm for DCI pattern.
non-existent attribute should activate a cancelation
event. However, if some specific records have
unexpected values, that records can be stored in
quarantine tables for further evaluation.
Generally, the behaviour of the process present
previously in Figure 2 can be described appealing to a
well-known procedure – a data lookup operation over
a specific table in order to identify the
correspondences that can be used to detect incorrect
or missing values, correcting them automatically
when possible.
The Figure 5 presents a generic algorithm for the
DL pattern that is scope independent and can be used
in single or composite contexts. The DL pattern starts
by reading the input metadata that allows for it to
identify all binding maps between source data and
data stored in the data warehouse. Configuration data
can be passed between activities at the process scope.
Next, records from source data are read (Load Record
activity). For each one of them, a target
correspondence is verified using predefined matching
attributes over the correspondent lookup table. A
simple mapping: <old value, new value> is used to
support Search Correspondences activity. If no
correspondences are found for a given value, the
process will identify an anomaly, and starts a
compensation event in order to preserve on a specific
quarantine table all the records without
correspondence. These records will be posteriorly
analyzed by a system administrator or, if possible, by
a correction and recover mechanism.
A BPMN Business rule task was also used to
represent automatic policies to handle specific
application scenarios – e.g. to solve a non-existing
record correspondence one may specified a rule to
delete the record. Tasks like these fit completely on
our implementation purposes, since rules can be
added or changed without affecting other tasks,
hiding its complexity to more general users.
The Replace data activity will be responsible to
materialize processed data (i.e. records with
correspondence) in the target repository specified in
pattern output metadata. Finally, the DL pattern
should update an ETL log journal with data output
summarizing the entire process.
Internally, each pattern is responsible to preserve
temporary data structures to handle intermediate
results and materialize them if needed. Several
BPMN artifacts were used to represent data with
persistent characteristics beyond process scope. They
represent the target schemas that each activity will
interact. The use of DL patterns is quite very useful in
the specification of the DCI procedure, which
activities are formulated in Figure 6. Similarly to the
DL pattern, the process starts with the configuration
parameters specification that is needed to access to
source data schemas. DCI’s metadata also includes
the binding rules that must be applied to common
attributes, allowing for the preservation of
relationships between fields with different names or
ordinal positions. Next, the process handles each
record according to a set of specific rules that define
its integration in the data warehouse.
The Load source task is responsible to load the
records collected on each data source providing the
means to extract the records and use them as input to
identify common records. For the representation of
the DCI pattern, we used a BPMN collaboration
diagram. Unlike the DL pattern, the DCI pattern need
to communicate with the DL pattern and, depending
on the requirements established, may communicate
with the SK and SCD (slowly changing dimension)
patterns in order to accomplish its function. This
behavior was modeled using intermediate message
flows events, which send messages to specific pools
that represent patterns - pools are collapsed due to
space limitations. Each pattern receives the request
and internally executes the respective procedure.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
212
The DL pattern is responsible to identify common
records between each data source. This is necessary
because the DCI pattern may receive data from two
or more data sources. The dictionary table that
preserves the binding rules must be queried to
identify mappings between the data schemas of each
source. The records that must be processed are never
transferred in events messages. Instead, the messages
exchanged store information about the input/output
repository that keeps the records and the specificities
that should be used to process them. As already
referred, for any record without correspondence, a
compensation event is launched internally in the DL
pattern as a way to signalize a no-occurrence
exception. The event can occur inside an external
pattern, but it is always reported to DCI pattern.
The error handling used is very similar to the DL
pattern. After processing all sources, a BPMN
inclusive gateway is used to determine which paths
will be taken next. For the new and updated records
without conflicts among data sources selected by
Load records activity, a surrogate key generation
pattern is invoked for the generation of a new
surrogate key and to update the correspondent
mapping tables. For the updated records, a flow can
be started to handle new versions of data using a
slowly changing dimensions technique. The history
preservation should be handled using a SCD pattern
with a history table to keep old versions of data.
However, in some cases records have multiple
versions between sources, which lead to many
redundancy problems. To solve these issues, a
compensation event in Load Source activity was used
to specify an alternate path to unlock these cases. For
example, a set of priority rule can be defined to solve
conflict records. Finally, the process ends with
Update log activity that will store all operations
performed during the process. Like the DL pattern,
the representation of a DCI pattern also include the
Error handling layer (also collapsed due to space
limitations), which describes the error handling
processes, not only fired by internal tasks but also by
ETL patterns involved.
Following the ETL pattern context presented, both
DL and DCI patterns can be categorized in different
levels. The DL pattern can be categorized as a single
pattern because its internal composition does not
imply the choreography with other patterns.
However, the DCI pattern is a composite pattern that
needs to exchange messages with other patterns in
order to accomplish a common goal. Moreover, with
the basic configuration provided, the dynamic
generation of instances following the base
configuration provided will be facilitated.
5 RELATED WORK
In the field of ETL patterns, as far as we know, there
is not much to refer. (Köppen et al., 2011) proposed a
pattern approach to support ETL development,
providing a general description for a set of patterns,
i.e. aggregator, history and duplicate elimination
patterns. In our opinion, this work present important
aspects defining composition properties that are
executed before or after each pattern. However, for
our ETL approach, we believe that the formalization
of these composite properties will limit ETL design,
because populating processes represent very specific
needs of a decision support system that can be
handled in a large variety of forms. Considering the
requirements of a conventional ETL process, we
provided a way to represent each pattern using visual
constructs using the BPMN language (Akkaoui and
Zimanyi, 2009).
Alternatively, (Muñoz et al., 2009) proposed a
MDA approach to derive ETL conceptual models
modelled with UML activity diagrams to carry out an
automatic code generation for ETL tools. Later,
(Akkaoui et al., 2011) complemented their initial
proposal providing this time a BPMN-based meta-
model for independent ETL modelling. They
explored and discussed the bridges to a model-to-text
translation, providing its execution using some ETL
commercial tools. Still using BPMN notation,
(Akkaoui et al., 2012) provided a BPMN meta-model
covering two important data process operations, and
the workflow orchestration layer.
More recently, following the same guidelines
from previous works, the same authors (Akkaoui et
al., 2013) provided a framework that allows for the
translation of abstract BPMN models to its concrete
execution in a target ETL tool using model-to-text
transformations. Based on these contributions, we are
working on the physical representation of BPMN
conceptual models in order to provide its execution
using existing ETL commercial tools. Additionally,
we want to explore its mapping using a set of model-
to-code transformations, providing dynamic
translation between our conceptual model to a
serialization model that can be executed in specific
ETL tool. Previously, we already explored several
ETL patterns specification and its physical mapping,
such as the SOA architecture (Oliveira and Belo,
2012; Oliveira and Belo, 2013).
TaskClusteringonETLSystems-APattern-OrientedApproach
213
6 CONCLUSIONS AND FUTURE
WORK
In this paper we identified and discussed two ETL
clusters as a way to group and organized a set of
inter-related tasks. These clusters can be generalized
and, when integrating specific configuration
parameters, can form ETL patterns. These have the
ability to represent a general solution for some typical
ETL composite tasks, which are commonly used in
conventional ETL processes – e.g. surrogate key
pipelining, slowly changing dimensions, change data
capture, or intensive data loading. Using the BPMN
language, we presented a conceptual approach to
model such processes applying ETL patterns. With
this component-based ETL development approach,
ETL designers and developers only have to take into
account the interfaces that define the interaction and
communications among ETL patterns. From a
conceptual point of view, we consider that ETL
models should not include any kind of
implementation infrastructure specification. BPMN
provides this kind of abstraction, since it allows for
the representation of several levels of detail in a same
process, fitting well the needs of the conceptual
approach proposed, and contributing to reduce
functional and operational errors, and decrease its
overall costs. To prove the adequacy of our approach,
we have presented a global specification of an ETL
model using a set of ETL patterns was used. As
future work we intend to provide an extended family
of ETL patterns to build a complete ETL system,
covering the integration of tasks that can be used in a
regular ETL system.
REFERENCES
Akkaoui, Z. El et al., 2013. A BPMN-Based Design and
Maintenance Framework for ETL Processes.
International Journal of Data Warehousing and Mining
(IJDWM), 9.
Akkaoui, Z. El et al., 2011. A model-driven framework for
ETL process development. In DOLAP ’11 Proceedings
of the ACM 14th international workshop on Data
Warehousing and OLAP. pp. 45–52.
Akkaoui, Z. El et al., 2012. BPMN-Based Conceptual
Modeling of ETL Processes. Data Warehousing and
Knowledge Discovery Lecture Notes in Computer
Science, 7448, pp.1–14.
Akkaoui, Z. El & Zimanyi, E., 2009. Defining ETL
worfklows using BPMN and BPEL. In DOLAP ’09
Proceedings of the ACM twelfth international workshop
on Data warehousing and OLAP. pp. 41–48.
Golfarelli, M. & Rizzi, S., 2009. Data Warehouse Design:
Modern Principles and Methodologies, McGraw-Hill.
Kimball, R. & Caserta, J., 2004. The Data Warehouse ETL
Toolkit: Practical Techniques for Extracting, Cleaning,
Conforming, and Delivering Data,
Köppen, V., Brüggemann, B. & Berendt, B., 2011.
Designing Data Integration: The ETL Pattern
Approach. The European Journal for the Informatics
Professional, XII(3).
Muñoz, L., Mazón, J.-N. & Trujillo, J., 2009. Automatic
Generation of ETL Processes from Conceptual Models.
In Proceedings of the ACM Twelfth International
Workshop on Data Warehousing and OLAP. DOLAP
’09. New York, NY, USA: ACM, pp. 33–40.
Oliveira, B. & Belo, O., 2012. BPMN Patterns for ETL
Conceptual Modelling and Validation. The 20th
International Symposium on Methodologies for
Intelligent Systems: Lecture Notes in Artificial
Intelligence.
Oliveira, B. & Belo, O., 2013. Approaching ETL
Conceptual Modelling and Validation Using BPMN
and BPEL. In 2nd International Conference on Data
Management Technologies and Applications (DATA).
OMG, 2011. Documents Associated With Business Process
Model And Notation (BPMN) Version 2.0.
Ouyang, C. et al., 2007. Pattern-based translation of BPMN
process models to BPEL web services. International
Journal of Web Services Research (JWSR),5,pp.42–62.
Pentaho, “Pentaho Data Integration”. Available at:
http://www.pentaho.com/product/data-integration
[Accessed March 16, 2015].
Rahm, E. & Do, H.H., 2000. Data Cleaning: Problems and
Current Approaches. IEEE Data Engineering Bulletin,
23, p.2000.
Santos, V. & Belo, O., 2013. Modeling ETL Data Quality
Enforcement Tasks Using Relational Algebra
Operators. Procedia Technology, 9(0), pp.442–450.
Singh, G., Su, M. & Vahi, K., 2008. Workflow task
clustering for best effort systems with Pegasus.
Proceedings of the 15th international conference on
Advanced information systems engineering.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
214