TOWARDS MODEL-DRIVEN EVOLUTION

OF DATA WAREHOUSES

Christian Kurze, Marcus Hofmann, Frieder Jacobi, André Müller and Peter Gluchowski

Chemnitz University of Technology, Faculty of Economics and Business Administration, 09107 Chemnitz, Germany

Keywords: Data Warehouse, Model-driven Architecture, Architecture-driven Modernization, Model Management,

Re-engineering, Data Warehouse Evolution.

Abstract: Data warehouse systems represent centralized data collections used for the purposes of data analysis and

decision support. Development and maintenance of extensive data warehouse systems require appropriate

support through methods and tools. Therefore, we introduce the project Computer-Aided Warehouse Engi-

neering (CAWE). It is a model-driven approach to the engineering of data warehouse systems. Especially

the process of data warehouse evolution, i.e. the maintenance of the core data storage component, the data

warehouse in narrow sense, is a tedious task since the change of data structures implies the challenge of en-

suring integrity and history of corporate data. The paper at hand provides research in progress and suggests

the adoption of a multidimensional algebraic formalism to the model-driven development paradigm.

1 INTRODUCTION

Nowadays, data warehouse systems form the back-

bone of many companies. They contain a subject

oriented, integrated, time variant and consistent col-

lection of data and therefore build the central com-

ponent of modern decision support systems (Inmon

et al., 2008).

An increased company size leads to a considera-

ble complexity of these systems. At the same time,

the applied methods and tools for system develop-

ment and management cannot deal with this grade of

complexity. Subsequently, extensive data warehouse

systems are hard to handle. A loss of efficiency,

faulty analyses, the absence of necessary adaptations

of the system and also high costs in maintenance and

further development form the negative conse-

quences.

We adopt the promising approach of model-

driven architecture (MDA) (Object Management

Group, 2003; Czarnecki and Helsen, 2006) in order

to ease development and evolution of complex data

warehouse systems. Our focus is put on the data

warehouse in a narrow sense, i.e. the core data sto-

rage component.

The paper is structured as follows: after introduc-

ing the CAWE approach, we describe the multidi-

mensional algebraic formalism of (Blaschka, 2000)

and show a possible adoption in the CAWE proto-

type.

2 THE CAWE APPROACH

Aspects like data safety, data security and com-

pliance put higher demands on development and

documentation of data warehouses as well as meta-

data management. Furthermore, there is a shift in

emphasis from the IT perspective towards the busi-

ness perspective. Because of that, there is a need for

adequate, continuous methods and tools to support

the lifecycle management of data warehouses from

the business layer towards the implementation. The

following tasks are practically relevant:

 Documentation and assessment of existing sys-

tems.

 Implementation of new systems on the basis of

the requirements of the business layer.

 Migration of legacy systems into modern archi-

tectures.

Data warehouse systems consist of different lay-

ers: operational systems, ETL processes, multidi-

mensional data storage, and applications (Chaudhuri

and Dayal, 1997). Each layer has to be taken into

account during the development process in order to

get a fully fledged data warehouse system.

356

Kurze C., Hofmann M., Jacobi F., Müller A. and Gluchowski P..

TOWARDS MODEL-DRIVEN EVOLUTION OF DATA WAREHOUSES.

DOI: 10.5220/0003444303560360

In Proceedings of the 13th International Conference on Enterprise Information Systems (ICEIS-2011), pages 356-360

ISBN: 978-989-8425-53-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

Combining the typical layers of data warehouse

systems and the viewpoints of MDA, (Mazón and

Trujillo, 2008) presented a promising framework

which has been extended and modified by (Kurze

and Gluchowski, 2010), as shown in figure 1. The

intended advantages are: definition of a systematic

and well-structured way for the development of he-

terogeneous data warehousing layers; increased

productivity by generating large parts of the imple-

mentation; better portability and adaptability by us-

ing the concept of separation of concerns; an inte-

grated modeling framework which supports compre-

hensive requirements definition support, as pro-

posed, for example, by (Guo et al., 2006); support

for system evolution based on an architecture-driven

modernization (ADM) (Ulrich and Newcomb, 2010)

and the explicit integration of the business domain.

Since the development of data warehouse sys-

tems is strongly aligned to software development,

the research framework is based on the general

terms forward, reverse and re-engineering. They are

well discussed within the context of software-

engineering (Chikofsky and Cross, 1990). Further-

more, their application has been transferred to data

engineering problems by (Aiken, 1996).

In order to support a triple-driven design of data

warehouses as a combination of data provided by

operational systems, user requirements against the

data warehouse as well as goals defined indepen-

dently of available data and user requirements (Guo

et al., 2006), the framework defines forward and

reverse engineering for data warehouses as a whole

spreading over the given layers of a data warehouse

system (Kurze and Gluchowski, 2010):

 Forward engineering of data warehouses re-

quires a reverse engineering of operational sys-

tems in order to support data-driven require-

ments engineering as well as forward engineer-

ing of multidimensional data storage, applica-

tions and ETL processes.

 Reverse engineering of data warehouses is also

based on a reverse engineering of operational

systems as a requirement for reverse engineer-

ing of ETL processes. Furthermore, it consists

of reverse engineering of multidimensional data

storage and applications.

 Re-engineering of data warehouses as a combi-

nation of forward and reverse engineering is

based on a reverse engineering of operational

systems, and a re-engineering of ETL

processes, multidimensional data storage, and

applications.

Figure 1 summarizes the framework and shows

the necessary metamodels of each data warehousing

layer as well as their corresponding MDA view-

point. The whole framework is encompassed by a

frame that offers the theoretical concepts of model

management as well as facilities to generate technic-

al and end-user documentation. It ensures the seam-

less integration into metadata management systems.

The connection of requirements of end-users,

corporate strategy and data sources supports a triple-

driven requirements engineering and helps to in-

crease the quality of reverse engineering initiatives

by aligning and improving automatically generated

artifacts with the implicit knowledge of users.

Furthermore, the framework includes a metamo-

del for each layer of data warehouses according to

the MDA viewpoint. Requirements can be captured

in a metamodel suitable for end-users, e.g. ADAPT

(Bulos and Forsman, 2006; Gluchowski et al., 2009).

They form the basis for model-to-model (M2M)

transformations into Platform Independent Models

(PIMs) for ETL, multidimensional data storage

(MD), and applications. Within each layer there are

further M2M and model-to-text (M2T) transforma-

tions that generate source code. In the reverse direc-

tion there are text-to-model (T2M) and M2M trans-

formation to gather abstract models from actual im-

plementations.

Due to different implementation alternatives, the

multidimensional data layer requires the distinction

between a relational and a pure multidimensional

implementation on Platform Specific Model (PSM)

and Code viewpoints.

Requirements

(CIM)

Data Sources

PIM

ETL

PIM

Application

PIM

Data Sources

PSM

ETL

PSM

Relational

PSM

Application

PSM

Data Sources

Code

ETL

Code

Relational

Code

Application

Code

CIM Layer

PIM Layer

PSM Layer

Code Layer

PIM

Model Management, Technical and End-User Documentation

Corporate

Strategy

End-User

M2MM2M

M2M

M2M M2M M2M M2M

M2T/T2MT2M M2T/T2M M2T/T2M

Figure 1: Framework for model-driven forward, reverse,

and re-engineering of data warehouse systems (Kurze and

Gluchowski, 2010).

Focusing on the multidimensional data layer,

figure 2 shows the metamodels which are necessary

for its evolution in the ADM horseshoe model.

Within the first step, an MD PIM and a Require-

ments model are reverse engineered from the current

implementation (depicted as Relational PSM and

TOWARDS MODEL-DRIVEN EVOLUTION OF DATA WAREHOUSES

357

Relational Code as well as MD PSM and MD Code).

The second evolution step is the enrichment of the

requirements. The result is a second requirements

model: Requirements’ which is transformed into a

new MD PIM’ as well as into target PSMs and cor-

responding code (Relational PSM’, MD PSM’, Rela-

tional Code’, and MD Code’).

While this approach would be suitable for pure

software artifacts, it is not suitable for data engineer-

ing problems: it does not take into account the in-

stance layer of data. When generating new code,

especially for the relational data storage, instance

data might be modified or even deleted. The particu-

lar feature of data warehouses is the offering of his-

torical data, i.e. data must not be deleted: after an

evolution, analyses with historical data must still be

possible. Therefore, it is essential not to modify any

instance data in an uncontrolled way. The approach-

es of schema and data evolution are promising to

overcome this issue, cf., for example, (Favre, 2009;

Blaschka, 2000; Kaas et al., 2004).

Target Solution

Existing Solution

Relational Code

MD PIM

Requirements Requirements’

MDPIM’

MD Code

Relational PSM MD PSM

Relational Code’ MD Code’

Relational PSM’ MD PSM’

Model changes by

business users or IT

Figure 2: Re-engineering of multidimensional data in the

ADM Horseshoe Model.

For data-intensive applications like data ware-

house systems the following research issue results:

Which methods support the evolution of data ware-

house systems within the scope of the model-driven

engineering considering the instance level?

Basically, the above remarks lead to two use cas-

es for the re-engineering:

1) The transfer of all the structures and instance

data from one existing platform to another.

2) Evolution of the data warehouse system

based on new functional requirements, in the

meaning of an adjustment of existing struc-

tures persistent over all levels of abstraction.

Particularly because of missing real-world im-

plementations, the latter is in the focus of further

treatment in the paper at hand.

3 DATA WAREHOUSE

EVOLUTION

3.1 Theoretical Approach

(Blaschka, 2000) and (Kaas et al., 2004) take into

account the challenges of schema evolution with

respect to instance data. (Blaschka, 2000) developed

an algebraic formalism for the treatment of multidi-

mensional schemata. He provides two models for

multidimensional data: multidimensional schema

and multidimensional instance. Based on these mod-

els, it has been possible to deduce a set of 14 ele-

mentary operators, which can represent all changes

within the given multidimensional model. The

schema change operations by a user are logged, so

that a sequence of operators for schema evolution is

derived. Well-defined priority rules for each opera-

tor ensure a correct transformation. Based on the 14

conceptual operators, Blaschka derived a mapping of

logical operators, which are executable on the rela-

tional star schema. They form the basis to generate

SQL code which performs the actual evolution. Kaas

et al. further examined the logical operators and

show particular implementation challenges in star

and snowflake schemata.

The existence of two models, before and after

the change, can lead to a theoretically infinite se-

quence of operators to represent the changes. Never-

theless, the sequence of operators can be determined

using planning algorithms developed in the area of

artificial intelligence (Barr and Feigenbaum, 1990).

The approach of model management (Bernstein and

Melnik, 2007) proposes several abstract operators.

Particularly interesting for the evolution of data

warehouses is the Diff operator which describes the

changes done on a model. Subsequently, this opera-

tor allows deriving a difference model representing

the changes a user made on requirements artifacts.

Our evolution approach is grounded on Blasch-

ka’s concept. We make use of the abstract Diff oper-

ator to determine changes on a model. These

changes are the basis to determine the necessary

operations on platform-specific models which

represent the actual implementation of evolution.

Figure 3 illustrates the concept described.

The framework assumes a platform-independent

model as well as the associated platform-specific

models and their implementation. They can either be

generated by a forward engineering or be the result

of a reverse engineering. Furthermore, there exists a

manipulated platform-independent model (PIM’).

Basically, this model could be transformed into plat-

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

358

form-specific models and code which would over-

ride existing schemata and instance data.

Relational

PSM

PIM

M2M

Relational

Code

M2T

Relational

PSM

Diff

PSM

Diff

PIM

Diff

M2M

Relational

Code

Diff

Code

Diff

M2T

Relational

PSM’

PIM’

M2M

Relational

Code’

M2T

Difference Model contains

model changes

Given Required

Can be generated, but schemata

and instance data will be

overwritten

Figure 3: Framework for re-engineering of multidimen-

sional data models.

Therefore, we assume a difference model MD

PIM

Diff

which contains all changes made by a user

on the MD PIM. This model is used to deduce a se-

quence of operators which, applied to the source

model MD PIM, represents a path from MD PIM to

the target model MD PIM’. These operators conform

to the operators by Blaschka on his multidimension-

al schema model. The mapping of these abstract

operators to specific systems (as done by Blaschka

and Kaas et al. for the relational implementation) is

the basis to generate various difference models on a

platform-specific viewpoint.

Based on this sequence of operations, code,

which adjusts the actual implementation, is generat-

ed through M2T transformations. While changes to

the structure of the data are executed mostly auto-

matically, changes to the data pool itself may require

manual assistance under certain circumstances (e.g.

the insertion of a new dimension level requires the

specification of hierarchical classifications).

3.2 Adoption in CAWE

The next step consists of transferring the theoretical

problem description to the prototype developed

within the CAWE project. An adequate starting

point is the schema evolution algebra introduced by

Blaschka. Since its capabilities are constrained by

his underlying mathematical models, some peculiari-

ties of multidimensional modeling cannot be de-

scribed properly. To expand Blaschka’s formalism

to the metamodels developed for CAWE, an imple-

mentation of this formalism via a computer algebra

system is planned. This enables the extended set of

operators to be tested for correctness. Furthermore,

there has to be a seamless integration, i.e. a formal

mapping, of abstract algebra and the CAWE meta-

models. This ensures the fit between metamodels

and algebra as well as the determination of model

transformations representing the operators.

Another task is the determination of the MD

PIM

Diff

model and the underlying operators.

(Blaschka, 2000) supposes to log the change actions

done by a user on the conceptual model. However, a

sequence determined in this way does not necessari-

ly need to be optimal. The Eclipse Modeling

Framework (EMF) technology underlying the

CAWE prototype has an appropriate built-in tool for

this purpose: EMF Model Transaction. It has to be

shown in which way the generic operators of the

EMF can be mapped on those of the extended alge-

bra. A second way to determine the difference model

is the usage of EMF model compare. Its result is a

formal difference model. Following this approach, it

is necessary to deduce the operations by the given

difference between the source and the target model.

Once the needed operations are determined, the

next task is to distinguish their correct sequence

while considering that pre- and post-conditions hold.

Basically, this issue can be described best as a plan-

ning problem. A solution using methods from the

area of artificial intelligence research seems to be

obvious (Barr and Feigenbaum, 1990). Further anal-

ysis whether existing algorithms can be used or uti-

lized as basis for a solution of this problem is essen-

tial. This process is supported through the imple-

mentation of the computer algebra system, too.

For the creation of transformations from multi-

dimensional operators into their relational imple-

mentation, the following step is necessary: It has to

be determined algorithmically, which multidimen-

sional operators have to be realized as a composite

operation in the relational database (e.g. insert

attribute with a following connect attribute to...). If

these operators have been defined on a relational

level, the data definition and data manipulation

scripts being necessary for the schema evolution can

be derived. Blaschka already derived scripts for the

star schema using his algorithm. Also (Kaas et al.,

2004) followed up this topic and described the im-

plementation of eight operators identified by them

for star and snowflake schemata. The CAWE ap-

proach takes up this state of research, adjusts it to

the developed metamodels and integrates it into the

prototype.

TOWARDS MODEL-DRIVEN EVOLUTION OF DATA WAREHOUSES

359

4 CONCLUSIONS AND FUTURE

WORK

The problem of the data warehouse evolution is of

significant practical relevance and has been analyzed

multiple times before. On an abstract, theoretical

level this problem has been formalized by the ap-

proach of model management (Bernstein and Mel-

nik, 2003). Particular research focusing on the evo-

lution of data warehouses has been done, for exam-

ple, by (Blaschka, 2000) and (Kaas et al., 2004).

Their approaches are very promising. Therefore, the

CAWE approach adopts the aforementioned ideas to

support the evolution of data warehouse schemata as

well as instance data.

To be able to adopt this idea to the model-driven

data warehouse engineering in general and to its

integration in CAWE in detail, we intend the follow-

ing further steps:

 Implementing Blaschka’s algebra in a computer

algebra system like Mathematica.

 Determination of operators from a given differ-

ence model.

 Selection and implementation of an appropriate

planning algorithm from the area of artificial

intelligence to determine the correct sequence

of operators.

 Development of metamodels for difference

models to represent the operators which are ne-

cessary on the different MDA viewpoints.

 Design and implementation of transformations

between the developed difference metamodels

considering the results of Blaschka and Kaas et

al.

 Extension of Blaschka’s formalism according

to the CAWE metamodels in order to support

advanced multidimensional modeling concepts.

ACKNOWLEDGEMENTS

This research has been partly funded by the Euro-

pean Social Fund and the Free State of Saxony.

REFERENCES

Aiken, P., 1996. Data Reverse Engineering: Slaying the

Legacy Dragon, McGraw-Hill. New York et al.

Barr, A., Feigenbaum, E. A., 1990. The Handbook of Ar-

tificial Intelligence, Volume 1, Addison-Wesley.

Reading et al, 2

edition.

Bernstein, P. A., Melnik, S., 2007. Model management

2.0: manipulating richer mappings. In Proceedings of

the 2007 ACM SIGMOD international conference on

Management of data. ACM, pp. 1–12.

Blaschka, M., 2000. FIESTA: A Framework for Schema

Evolution in Multidimensional Databases. Dissertation

Thesis at Technical University Munich, Germany.

Bulos, D., Forsman, S., 2006. Getting started with

ADAPT. Online available: http://www.symcorp.com/

downloads/ADAPT_white_paper.pdf. Last access:

2011-01-11.

Chaudhuri, S., Dayal, U., 1997. An overview of data ware

housing and OLAP technology. ACM SIGMOD

Record, 26(1), pp. 65-74.

Chikofsky, E. J., Cross, J. H., 1990. Reverse engineering

and design recovery: a taxonomy. IEEE Software,

7(1), pp. 13–17.

Czarnecki, K., Helsen, S., 2006. Feature-based survey of

model transformation approaches. IBM Systems Jour-

nal, 45(3), pp.621–645.

Favre, C., Bentayeb, F., Boussaid, O., 2009. A Survey of

Data Warehouse Model Evolution. In Handbook of

Research on Innovations in Database Technologies

and Applications: Current and Future Trends, Her-

shey. New York.

Gluchowski, P., Kurze, C., Schieder, C., 2009. A Model-

ing Tool for Multidimensional Data using the ADAPT

Notation. In Proceedings of the 42

Hawaii Interna-

tional Conference on System Sciences. IEEE.

Guo, Y., Tang, S., Tong, Y., Yang, D., 2006. Triple-driven

data modeling methodology in data warehousing: a

case study. In Proceedings of the 9th ACM interna-

tional workshop on Data warehousing and OLAP.

ACM, p. 59–66.

Inmon, W. H., Strauss, D., Neushloss, G., 2008. DW 2.0:

the architecture for the next generation of data ware-

housing, Morgan Kaufmann. Amsterdam et al.

Kaas, C., Pedersen, T. B., Rasmussen, B., 2004. Schema

evolution for stars and snowflakes. In Proceedings of

the 6

International Conference on Enterprise Infor-

mation Systems. INSTICC, pp. 425–433.

Kurze, C., Gluchowski, P., 2010. Computer-Aided Ware-

house Engineering (CAWE): Leveraging MDA and

ADM for the development of Data Warehouses. In

Proceedings of the 16

Americas Conference on In-

formation Systems. AIS, paper 282.

Mazón, J. N., Trujillo, J., 2008. An MDA approach for the

development of data warehouses. Decision Support

Systems, 45(1), pp. 41–58.

Object Management Group, 2003. MDA Guide Version

1.0.1. Online available: http://www.omg.org/cgi-bin/

doc?omg/03-06-01. Last access: 2011-01-11.

Ulrich, W., Newcomb, P. H., 2010. Information Systems

Transformation: Architecture-Driven Modernization

Case Studies, Morgan Kaufmann. Amsterdam et al.

ICEIS 2011 - 13th International Conference on Enterprise Information Systems

360