domain independent approach to data integration, but
require the mappings to be manually engineered.
Schema-less NoSql database management systems
like, e.g., Bigtable (Chang et al., 2008) or semi-
structured data models (Acharya et al., 2008) effi-
ciently store NULL-values, thus mitigating the draw-
back of domain oriented data schemas of having
many NULL values when comparing them to generic
schemas like EAV. However, NoSql is targeted to-
wards semi-structured mass data, it is not particularly
suited to deal with strongly structured, relationship
heavy data like, e.g., master data. This often leads
to heterogeneous IT infrastructures with both NoSql
and traditional relational database systems. Thus, it is
still desirable for many companies to exclusively use
a relational database management system.
Business intelligence development tools for ETL
(Stumptner et al., 2012) typically use available meta-
data, e.g., constraints and foreign keys, to create tem-
plates which need to be further parameterized by the
domain experts. In our approach, domain knowledge
is specified by end users beforehand, our generator
then uses this information to create the ETL processes
without further adjustments needed by the end user.
Thus, the domain metadata is available for all ETL
processes to be developed in the future.
(Mu
˜
noz et al., 2009) presents an approach to auto-
matic generation of ETL processes. It is based on the
Model Driven Architecture (MDA) framework and
generates Process Specific Models (PSM) from Pro-
cess Independent Models (PIM) using Query View
Transformations (QVT). One PIM describes a sin-
gle ETL process and is completely implementation-
independent. The PIM is the main source of doc-
umentation for the ETL process. In (Atigui et al.,
2012) a framework to automatically integrate data
warehouse and ETL design within the MDA is intro-
duced. It is based on the Object Constraint Language
(OCL). In the paper at hand, the descriptor table is
closely related to the PIM, as it describes the ETL
process in a platform independent way using classes,
attributes and filter criteria, even though our example
implementation is based on the relational data model.
The descriptor table functions as a documentation of
the ETL process, with the possibility of transform-
ing the condensed representation of the table into a
more human-readable format. Instead of process spe-
cific models, our approach uses SQL as a common
language in the data warehousing world.
In (Skoutas and Simitsis, 2006), ontologies are
used to specify structure and semantics of multiple
data source schemata as well as the data warehouse
schema. Using reasoning, conceptual ETL processes
are then inferred automatically, which specify the
transformation of one or more source schemata to the
data warehouse schema. The main motivation for us-
ing ontologies is to overcome structural and semantic
heterogeneity. Comparing this to our approach, we
use a semi-generic data model for different products
in a product line, where the non-generic, i.e., product
specific parts are explicitly mapped via the descrip-
tor table. Thus, because of our generic approach, we
have no need for an inferred mapping of schemata.
(Skoutas and Simitsis, 2006) is only concerned with
the generation of conceptual ETL processes, a sub-
sequent transformation into a platform dependent im-
plementation is required.
(Khedri and Khosravi, 2013) proposes and imple-
ments a delta-oriented approach to handling variabil-
ity in database schemata for software product lines.
The start out with a core schema containing manda-
tory features that gets modified using delta scripts de-
pending on with optional or alternative features are
selected for a specific product. In contrast, our ap-
proach uses a strict separation of database objects
which are common to all products, i.e., master data,
and product specific parts of the database schema.
3 ARCHITECTURE
Figure 2 gives an overview of the data analysis ar-
chitecture presented in this paper. Data is trans-
formed from the operational database to the analysis
database passing three different stages. The opera-
tional database is a relational database, the analysis
database is implemented as an OLAP (on-line ana-
lytical processing) database (Chaudhuri et al., 2011).
The first stage is responsible for dealing with activi-
ties specific to the operational database. The second
step executes activities not dependent on the source or
target database. The last stage is responsible for deal-
ing with specifics of the analysis database. Thus, each
stage is responsible for executing an arbitrary number
of activities falling in one of three groups: operational
database specific, independent and analysis database
specific. When a new type of activity is needed, it
can be implemented as a template for instantiation
and reuse. Operational as well as analysis database
specific activities are generated based on a descriptor
table and use a interval defintion table.
The main task of the operational database spe-
cific activities is to retrieve data from the operational
database representing the specific product variation of
interest and transform it into a domain independent
data structure, in our case into an EAV model. To do
so, at least a change data capture and staging activity
must be implemented. The change data capture ac-
ADataAnalysisFrameworkforHigh-varietyProductLinesintheIndustrialManufacturingDomain
211