research community addressed these problems over
the years, specifically in what is concerned to
mapping BPMN processes to BPEL language
(Ouyang et al., 2007). The detail and scope disparities
between more abstract and concrete models represent
a huge distance between two representations, simply
because they serve different purposes. In order to
simplify ETL development, we propose the
application of a task-clustering technique to group a
set of finer grain tasks into a collection of tasks, flows
and control primitives, providing a method to
organize them using layers of abstraction and
supporting different detail to serve the several
stakeholders in different project phases. The cluster
categorization provides a way to identify and classify
patterns that can be instantiated and reused in
different ETL processes. Each pattern is scope
independent and provides a specific skeleton that not
only specifies their internal behaviour but also
enables communication and data interchange with
other patterns in a workflow context.
In this paper, we demonstrate the feasibility and
effectiveness of the approach we developed and
followed analysing a real world ETL scenario and
identifying common tasks clusters. In section 2 we
discuss its generalization and use as general
constructs in an ETL package. Then, we demonstrate
how activities can be grouped to build ETL
conceptual models in different abstraction layers
(section 3), presenting two ETL skeletons (data
lookup and data conciliation and integration)
exposing their configuration and behaviour in a way
that they can be executed in a target ETL tool (section
4). Next, some related work is exposed (section 5).
Finally, in section 6, we evaluate the work done,
pointing out some research guidelines for future
work.
2 ETL TASKS CLUSTERING
On-line Transaction Processing (OLTP) systems are
responsible for recording all business transactions
performed in enterprise operational systems. These
are built to support specific business, which store
operational data from the daily business operations.
Therefore, they are the main data sources used by
data warehousing systems. In more complex cases,
data are distributed following the distinct business
branches of a company. These data can be stored in
sophisticated relational databases or in more simple
data structures (e.g. texts or spreadsheets files). Due
to this variety of information sources, problems on
populating a data warehouse often occur (Rahm and
Do, 2000).
Generally, tasks used on the extraction step (E)
are responsible to gather data in the data sources and
put it into a data staging area (DSA). The DSA is a
working area where data is prepared before going to
the data warehouse. For that, the DSA provides the
necessary metadata to support the entire ETL process,
providing, for example, support for data correction
and data recovery mechanisms using domain oriented
dictionary, mapping mechanisms or quarantine tables.
Transformation and cleaning (T) procedures are
applied posteriorly to data extraction, using the data
that was already stored in temporary structures in the
DSA. After this second step, data is loaded (L) to a
target data warehouse, following schema rules, and
operational and business constraints. Essentially, an
ETL process represents a data-driven workflow
representing a set of tasks and their associated control
flows and business rules that together express how
the system should coordinated. Typically, the
commercial tools use workflows to the representation
of very specific tasks that are frequently grouped
together each time we want to represent a same
procedure.
To reduce the impact of such situations, we
purpose an approach (also used in others application
scenarios (Singh et al., 2008) that allows us to
organize ETL tasks into clusters and execute them as
a single block. However, we went a little bit further
formalizing a set of “standard” clusters representing
some of the most common techniques used in real
world ETL scenarios. Based on a set of input
parameters, the tasks that compose a cluster should
produce a specific output. In fact, we are creating
routines or software patterns that can be used with the
aim to simplify ETL development, reducing the
implementation errors and time needed to implement
ETL processes. With the ETL patterns identified, we
propose a multi-layer approach to represent them,
using BPMN pools to represent the several layers of
abstraction, composed by several lanes representing
the tasks that should be applied for each group of
similar records to be processed.
To demonstrate the potential of our approach, we
present a common ETL scenario, representing an
ordinary data extraction process that using two
different data sources prepare data through
confirming and inserting it posteriorly in a data
warehouse. The first source, a relational schema,
stores data about flights (dates and flight time), travel
locations (city and country) and planes (brand, model
and type, and manufacturer data). The spreadsheet
source represents a subset of the data structures that
can be found in the relational schema.
DATA2015-4thInternationalConferenceonDataManagementTechnologiesandApplications
208