Mixed Driven Reﬁnement Design of Multidimensional Models based on

Agglomerative Hierarchical Clustering

Lucile Sautot

1,2

, Sandro Bimonte

, Ludovic Journaux

and Bruno Faivre

Biog

eosciences Laboratory, University of Burgundy, 6 boulevard Gabriel, Dijon, France

AgroParisTech, 19 rue du Main, Paris, France

IRSTEA Centre de Clermont-Ferrand, 9 avenue Blaise Pascal, Aubi

ere, France

LE2I, University of Burgundy, all

e Alain Savary, Dijon, France

Keywords:

Multidimensional Design, Data Warehouse, OLAP, Data Mining.

Abstract:

Data warehouses (DW) and OLAP systems are business intelligence technologies allowing the on-line anal-

ysis of huge volume of data according to users’ needs. The success of DW projects essentially depends on

the design phase where functional requirements meet data sources (mixed design methodology) (Phipps and

Davis, 2002). However, when dealing with complex applications existing design methodologies seem inef-

ﬁcient since decision-makers deﬁne functional requirements that cannot be deduced from data sources (data

driven approach) and/or they have not sufﬁcient application domain knowledge (user driven approach) (Sautot

et al., 2014b). Therefore, in this paper we propose a new mixed reﬁnement design methodology where the

classical data-driven approach is enhanced with data mining to create new dimensions hierarchies. A tool

implementing our approach is also presented to validate our theoretical proposal.

1 INTRODUCTION

Data warehouses (DW) and OLAP systems are busi-

ness intelligence technologies allowing the on-line

analysis of huge volume of data. Warehoused data

is organized according to the multidimensional model

that deﬁnes the concepts of dimensions and facts. Di-

mensions represent analysis axes and they are orga-

nized in hierarchies. Facts are the analysis subjects

and they are described by numerical indicators called

measures. Warehoused data are then explored and ag-

gregated using OLAP operators (e.g. Roll-up, Slice,

etc.) (Kimball, 1996).

The success of DW projects essentially depends

on the design phase where functional requirements

meet data sources (Phipps and Davis, 2002). Three

main methodologies have been developed: user-

driven, datadriven and mixed (Romero and Abello,

2009). User-driven approach puts decision-makers at

the center of the design phase by providing them tools

to deﬁne the multidimensional model exclusively ac-

cording to their analysis needs. Usually, data driven

methodology proposals deduce the multidimensional

model from structured and semistructured (Mahboubi

et al., 2009; Jensen et al., 2004) data sources exploit-

ing metadata (e.g. foreign keys) and some empirical

values. Finally, mixed approaches fusion the two pre-

vious described methods.

Hierarchies are crucial structures in DW since

they allow aggregation of measures in order to pro-

vide a global and general analytic view of warehoused

data. For that reasons, some works investigate deﬁni-

tion of hierarchies by means of Data Mining (DM)

algorithms (Favre et al., 2006; Sautot et al., 2014b).

However, this design step is applied once the multi-

dimensional model has been deﬁned, and it takes into

account only members of one dimension.

From our point of view, these methodologies

present an important limitation since in real DW

projects often those DM algorithms need data of dif-

ferent dimensions and facts. Thus, in this paper we

present a framework for a mixed design of multidi-

mensional models by integrating DM algorithms in

a classical data driven-approach. This allows deﬁn-

ing hierarchical structures, according to decisional

users’ requirements, that cannot be deduced by clas-

sical datadriven methods. This hierarchical organi-

zation of dimensional data is translated in a complex

multi-factual multidimensional model in order to rep-

resent as well as possible semantic of data sources.

547

Sautot L., Bimonte S., Journaux L. and Faivre B..

Mixed Driven Reﬁnement Design of Multidimensional Models based on Agglomerative Hierarchical Clustering.

DOI: 10.5220/0005404605470555

In Proceedings of the 17th International Conference on Enterprise Information Systems (ICEIS-2015), pages 547-555

ISBN: 978-989-758-096-3

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

The paper is organized in the following way: Sec-

tion 2 introduces related work; a retail case study and

the motivation are presented in Section 3; our design

method is detailed in Section 4 and its implementation

is shown on Section 5.

2 RELATED WORK

Three types of approaches can be used to design a

data warehouse: (i) Methods based on user speciﬁ-

cations, or demand-driven approaches; (ii) Methods

based on available data, or data-driven approaches;

(iii) Mixed methods, or hybrid approaches. For ex-

ample, (Jovanovic et al., 2012) is an iterative demand-

driven method where at each iteration, the system

searches for the best data corresponding with the in-

formation required by the user in terms of dimensions

or facts. Moreover, several other have proposed sys-

tems based on hybrid approach such as (Romero and

Abello, 2010) that propose to express functional re-

quirements using SQL queries.

Relational data driven approaches deduce multi-

dimensional structures (facts and dimensions) from

conceptual (Phipps and Davis, 2002) and/or logical

models (Carme et al., 2010; Jensen et al., 2004). In

particular some works investigate automatic discover-

ing facts using some heuristics (Carme et al., 2010).

About dimensions some works propose using logical

database metadata such as foreign keys (Jensen et al.,

2004) or some heuristics.

Other works use more complex algorithm to iden-

tify dimensions hierarchies. (Nguyen and Tjoa, 2000)

propose a system to dynamically build hierarchies

based on data from Twitter (Nguyen and Tjoa, 2000).

(Messaoud et al., 2004) present a new OLAP opera-

tor named OPAC that allows to aggregate facts that

refer to complex objects, such as images. This op-

erator is based on hierarchical clustering algorithm.

(Favre et al., 2006) provide a framework for automatic

deﬁning hierarchies according to user rules. In order

to personalize the multidimensional schema, (Ben-

tayeb, 2008) propose to create new levels in a hier-

archy with the K-means algorithm. (Leonhardi et al.,

2010) propose to increase the OLAP cube exploration

functionalities by providing the user data mining al-

gorithms to analyze data. (Ceci et al., 2011) use a hi-

erarchical clustering to integrate continuous variables

as dimensions in an OLAP schema. In the same line,

(Sautot et al., 2014b) propose using Agglomerative

Clustering for designing hierarchies, and the integra-

tion in a rapid prototyping methodology is presented

in (Sautot et al., 2014a). However, all existing works

deﬁne hierarchies using only either dimensional data

(i.e. attributes of dimension members) or factual data

(i.e. measures) (see Table 1). But, in a constellation

schema, a dimension can be enriched with a hierarchy

created by using other dimensions and facts. It means

that the creation of a new hierarchy can involved a

reﬁnement of facts and dimensions in the entier con-

stellation schema. We detail this issue in the follow-

ing section, using a real application case from bird

biodiversity.

Table 1: Summary of literature review related to automatic

hierarchy building.

Data sources

Star schema Constellation

schema

Facts One di-

mension

Facts and

dimensions

Algorithm

K-means (Bentayeb,

2008)

(Bentayeb,

2008)

Hierarchical

classiﬁcation

(Ceci

et al.,

2011)

(Sautot

et al.,

2014a)

(Sautot

et al.,

2014b)

(Messaoud

et al.,

2004)

Our proposal

Other (Favre

et al.,

2006)

(Nguyen

and Tjoa,

2000)

(Leonhardi

et al.,

2010)

3 MOTIVATION

In order to describe motivation of our new DW de-

sign methodology we present in this section a real

case study concerning the bird biodiversity analysis

(Sautot et al., 2014b). This dataset has been collected

to analyze spatio-temporal changes in bird popula-

tions along the Loire River (France) and to identify

local and global environmental factors that can ex-

plain these changes. Data sources are stored in a rela-

tional database (PostGIS). Applying the data driven

algorithm proposed in (Romero and Abello, 2010)

we obtain the constellation schema depicted in Fig-

ure 1, which presents two facts as described in the

following. Abundances is one fact, and can be an-

alyzed according to three dimensions (an instance is

shown on Table 3): (i) the species dimension, which

stores species names and attributes, (ii) the time di-

mension, which corresponds to the census years and

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

548

(iii) the spatial dimension, which describes census

points along the river. Using this model decision-

maker can answer to queries like: “What is the total

of birds per year and census point?” or “What is the

total of birds per year and altitude?”. To complete

bird census, the landscape and the river are described

around each census point. Environment descriptions

are represented by another fact, which is associated to

the time dimension and the spatial dimension. With

this model, it is possible to describe census points, for

example a possible OLAP query is “What is the per-

centage of forest per census point in 2012?”.

Note that descriptions of census points that are not

dependent from time, such as altitude and geology,

are used as spatial dimension levels, while other at-

tributes are represented as measures of another fact

(e.g. percentage of forest). Unfortunately, abun-

dances for a specie have not meaning if not related

to environmental data of census points. In this sit-

uation a drill-across operation is not adequate since

it will hide the species dimension. Indeed, with the

drill-across operators facts are joined only on com-

mon dimensions. Moreover, the multidimensional

model of Figure 1 does not make possible to provide

the decision-makers with OLAP queries aggregating

abundance by classes of environmental variable (30%

of forest, 50% of water, etc.), for example “What is

the total of birds per year and group of census point

with 30% of forest?” or “What is the total of birds per

year and group of census point with 50% of water?”,

since environmental parameters do not appear as lev-

els, but as measures, prohibiting group-by queries.

Therefore, in our case study, decision-makers

need for a new design method that group census

points (dimensional data) by environmental parame-

ters (factual data) and year (dimensional data).

The multidimensional model allowing correct

OLAP analysis should be the one shown on Figure 2

(Miquel et al., 2002b). This multidimensional schema

presents only one fact and the spatial dimension is

enriched with some levels representing group of en-

vironmental parameters for each year. Indeed, envi-

ronmental parameters for census points in 2001 can

be different from ones of 2002 implying that the same

census point is not grouped in the same level on two

different years as shown on Table 3.

For example, data describing agricultural activi-

ties around the census points, are available only for

the 2002 census campaign. Therefore, it is important

to take into this different classiﬁcation when navigat-

ing on the temporal dimension during an OLAP anal-

ysis session. For example, the query “What is the to-

tal of birds in 2002 and in census points with the same

environmental parameters?” has to use the environ-

Figure 1: Bird biodiversity case study: Data-driven constel-

lation schema.

Figure 2: Bird biodiversity case study:manually driven

multi-version schema.

ment type 2002 level, and “What is the total of birds

in 2011 and in census points with the same environ-

mental parameters?” has to use the environment type

2011 level. For example an OLAP query using the en-

vironment type 2002 level and the temporal member

2011 is not coherent since it associates the number of

birds on 2011 in the past geographical-environmental

conﬁguration of 2002, leading to erroneous interpre-

tation.

Table 2: Factual data of “Environments” node.

Years Census

Points

Agencies Percent

Forest

Percent

Grass-

land

2002 1 LE2I 0.176 0.250

2002 1 ONEMA 0.356 0.261

2002 2 LE2I 0.311 0.420

2002 2 ONEMA 0.255 0.574

2011 1 LE2I 0.189 0.278

2011 1 ONEMA 0.241 0.385

2011 2 LE2I 0.322 0.568

2011 2 ONEMA 0.257 0.575

MixedDrivenRefinementDesignofMultidimensionalModelsbasedonAgglomerativeHierarchicalClustering

549

Table 3: Factual data of “Abundances” node.

Years Census

points

Species Abundance

2002 1 Yellowhammer 1.5

2002 1 Coal Tit 0.5

2002 2 Yellowhammer 1.5

2002 2 Coal Tit 0

2011 1 Yellowhammer 1

2011 1 Coal Tit 3

2011 2 Yellowhammer 1

2011 2 Coal Tit 2

4 OUR PROPOSAL

In this section we introduce our framework for the re-

ﬁnement of multidimensional in a mixed approach.

The main idea of our proposal is using an existing

data driven methodology in a ﬁrst step. Then, in our

new design step, we collect user needs about hierar-

chies that are not been deduced in the multidimen-

sional schema by means of the functional dependen-

cies. These users’ needs are expressed in the form

of facts existing in the constellation multidimensional

model. In particular, the main idea is to provide an al-

gorithm that transforms the constellation multidimen-

sional schema by eliminating a fact node and integrat-

ing factual data in an associated dimension used for

creating new levels.

To perform this algorithm, we translate the multi-

dimensional model in a multidimensional graph.

In the following section we describe the multidi-

mensional graph deﬁnitions (Section 4.1), the main

algorithm is detailed in Section 4.2 and the calcula-

tion of new versioned hierarchies is explained in Sec-

tion 4.3.

4.1 Preliminaries

In this subsection, we present some preliminary deﬁ-

nitions.

We represent a multidimensional model using a

graph.

Deﬁnition 1. Multidimensional Graph.

A multidimensional graph is a directed graph

=< D, F, A > with:

D =

{

, ..., d

}

, dimensional nodes, which repre-

sent dimensions.

F =

{

, ..., f

}

, fact nodes representing facts.

A =

{

, ..., a

}

| ∀i ∈ [[1, p]], a

= ( f

, d

), with j ∈

[[1, n]] and k ∈ [[1, m]], are arcs

, meaning that arcs are

In this paper, the notation ( f

, d

) represents the arc

only directed from a fact node to a dimensional node.

Moreover, M

contains no alone node, isolated of

another node, but can contain possibly disconnected

sets of nodes if each sub-graph must contain at least

one fact node.

Example. An example of multidimensional graph

is shown on Figure 3. “Species” dimension, “Cen-

sus points” dimension, “Years” dimension, “Abun-

dances” fact and “Environments” fact are described

in previous sections. “Sources” dimension represents

agencies, which collect data. “Budget” fact represents

the funds allowed by each agency for each year to col-

lecting data.

Figure 3: Multidimensional graph M

In our approach decision-maker want to enrich a

dimension with some new hierarchies using some fac-

tual data. That dimension is called Target dimension

Deﬁnition 2. Target Dimension.

The target dimension d

of a multidimensional

graph M

is a dimension such as:

∈ D | ∃( f

, d

), ..., ( f

, d

) with u ∈ [[2, n]]

This means that dt is associated at least to two

facts since one has to be removed and used to create

its new levels.

Example. An example of possible target dimen-

sion is the “census point” dimension (Figure 3).

Let us now formalize the fact node that is used to

create levels.

Deﬁnition 3. Source Node.

The source node of a M

with a target dimension

is a fact node f

∈

{

, ..., f

}

Example. With “census point” dimension as target

node, an example of possible source node is the fact

node “Environments”.

from fact node f

to dimensional node d

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

550

As we have said before our algorithm removes

the source node from the graph. Therefore, a part of

the structure of the graph is changed. Note that only

nodes related to the source nodes are affected. We

deﬁne this sub-graph in the following way

Deﬁnition 4. Source-target Multidimensional

Sub-graph.

Let M

a multidimensional graph with a target

dimension d

and a source node f

then the Source-

target multidimensional sub-graph M

is a multidi-

mensional graph such as: M

=< D

, F

, A

> with:

{

∈ F | ∃( f

, d

)

}

{

∈ D | ∃( f

, d

)

}



( f

, d

) | f

∈ F

, d

∈ D



contains thereby only fact nodes linked to d

and dimensional nodes linked to f

. In M

, all fact

nodes are so linked to at least one dimensional node

and all dimensional nodes are so linked to at least one

fact node. There is no isolated node in this sub-graph.

is so a well-formed multidimensional graph.

Example. An example of Source-target multidi-

mensional sub-graph using the previous example is

shown on Figure 3.

In order to formalize inputs of the agglomerative

hierarchical clustering algorithm used for the creation

of levels of the target dimension, we formalize factual

data aggregated to a set of dimensions levels using the

deﬁnition of instance fact node.

Deﬁnition 5. Instance Fact Node.

Let M

a multidimensional graph. Let m

a mem-

ber of the dimension d

. Then the instance fact node

I( f , d

, ..., d

) is the set of tuples represent-

ing facts of f aggregated to the dimensions members

, ..., d

Example. Let, Table 2 representing the instance

fact node for the node “Environments”, then Table 4

represents facts aggregated to the All member of the

“Agencies” dimension:

(I(“Environments”, “Agencies.ALL”,

“Years.1990”, “Census points.*”))

4.2 Algorithm

In this section we provide details and formalize our

approach.

Removing a fact node from the multidimensional

graph implies its redeﬁnition. Thus, the main idea

is in a ﬁrst step to work on the source-target mul-

tidimensional graph exclusively, transform this sub-

graph adding levels to the target dimension and re-

moving the source node, and then ﬁnally re-integrate

‘*’ means ‘all members of the dimension’

the new sub-graph in the rest of original multidimen-

sional graph.

Removing the source node implies to handle its

associated dimensions. It is possible to distinguish

three types of dimensions:

• The target dimension that will rest in the trans-

formed sub-graph,

• the Non Context dimensions D

, and

• the Context dimensions D

The Non context dimensions D

are dimensions that

are only associated to the source node fact. In order to

remove one dimension it is possible to provide a clas-

sical Dice operator, which consists in aggregating fact

data to the top dimension member. Let us note that

in order to avoid summarazability problems (aggre-

gation cannot be reused) (Lenz and Thalheim, 2009),

in our approach we allow using only distributive and

algebraic aggregation functions for the Dice operator.

Example. An example of Non contextual dimen-

sion is the “Agencies” node. In Table 4 is shown an

example of the Dice operator on the Agencies dimen-

sion, which is a Non contextual dimension.

Table 4: Factual data of “Environments” node aggregated

on “Agencies”.

Years Census

Points

Percent

Forest

Percent

Grass-

land

2002 1 0.266 0.256

2002 2 0.283 0.497

2011 1 0.215 0.332

2011 2 0.290 0.572

Formally,

Deﬁnition 6. Non Contextual Dimension.

Let Source-target multidimensional sub-graph

=< D

, F

, A

>, then the set of non contextual

dimension D

{

, ..., d

}

⊂ D

| ∀i ∈ [[1, v]]∃!(d

, f

) | f

∈ F

Note that in the previous formula, all dimensional

nodes in D

are only linked to f

. Indeed, all dimen-

sional nodes in M

are linked to f

and dimensional

nodes in D

are linked to one (and only one) dimen-

sional node.

The Context dimensions D

are dimensions in M

that are associated to f

and another fact node f . With

the future reﬁned graph, users analyze facts in f ac-

cording to d

. But, data used for calculating new hier-

archies in d

come from f

and are thereby dependent

of dimensions in D

. Therefore, we need to ensure

MixedDrivenRefinementDesignofMultidimensionalModelsbasedonAgglomerativeHierarchicalClustering

551

that data used to create the hierarchy are coherent with

data consulted by the user during their OLAP analy-

sis. With this in mind, we offer a system that cal-

culates hierarchies according a context, this context

deﬁning with D

Formally,

Deﬁnition 7. Contextual Dimension.

Let Source-target multidimensional sub-graph

, then the set of contextual dimension D

⊂ D

| D

= D

− (D

∪ {d

})

Example. An example of contextual dimension is

the “Years” node. On Table 3, we present data from

“Abundances” node: data are dependent of “Years”

dimensional node.

Once we have deﬁned non context and contex di-

mensions let us provide our algorithm supposing that

we have only one context dimension.

The input of this algorithm is the multidimen-

sional graph M

presented on Figure 3.

Begin of the Reﬁnement Algorithm

1. Identify the Source-target multidimensional sub-graph

2. Calculate a hierarchy for each instance of each context.

This part of the algorithm is detailed in particular in the

section 4.3.

3. Remove f

from M

4. Remove isolated nodes. The isolated nodes can be

only dimensional nodes linked to f

. Then M

is well

formed.

End of the Reﬁnement Algorithm

The output of this algorithm is a multidimensional

graph, presented on Figure 4. We note that f

has been

removed and there are new hierarchies in the “census

points” node. Moreover, M

remains a well-formed

multidimensional graph and can be also implmented

in a ROLAP architecture.

Figure 4: Reﬁned multidimensional graph M

4.3 Automatic Creation of Hierarchies

In this section we describe how the is applied to create

new levels of the target dimension.

A complete methodology to create new hierar-

chies in a multidimensional model with Hierarchi-

cal Agglomerative Clustering is presented in (Sautot

et al., 2014b). The main idea of this methodology is

to build a new hierarchy into a dimension by using

data, which describe items at the lowest level of the

hierarchy. In our case, items are census points and

description data are factual data. We suggest to use

the Hierarchical Agglomerative Clustering, due to the

similarity between the output of the Hierarchical Ag-

glomerative Clustering and a hierarchy into an OLAP

dimension (Messaoud et al., 2004).

Main steps of this algorithm are: (1) Calculation

of distances between individuals; (2) Choice of the

two nearest individuals. (3) Aggregation of the two

nearest individuals in a cluster. The cluster is consid-

ered an individual. (4) Go back to the step 1 and loop

while there is more than one individual.

In our approach the clustering (AHC) takes as in-

puts the instance of the source node f

evaluated on

each member of the context dimension and dicing it

non context dimensions.

Formally, the step 2 of our algorithm is the

following:

Begin of the Hierarchy Builder Algorithm

for each member

of d

. create a new hierarchy of d

. AHC(I( f

.ALL,... ,d

. ALL, d

.member

, d

.*))

End of the Hierarchy Builder Algorithm

An example is presented on Figure 5. We note that

two hierarchies for the spatial dimension have been

created for years 2002 and 2011.

Figure 5: Contextual hierarchies of census points.

5 VALIDATION AND

EXPERIMENTS

In this section we present the implementation our pro-

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

552

posal. A semantic and performance evaluations are

detailed in Sec 5.1 and 5.2 respectively.

The reﬁnement tool implements our algorithm us-

ing Matlab. It allows deﬁning graph using a simple

visual interface as shown on Figure 7. The consid-

ered multidimensional graph is presented on the top

part of the visual interface. On the bottom one, the

algorithm ask inputs to users in a command window.

5.1 Semantic Evaluation

In this section, we describe the added-value of our

methodology from a design point of view (i.e. does

the reﬁnement methodology corresponds to decision-

makers needs?). For that goal two we have investi-

gated two aspects: 1) Do dimensions and facts cre-

ated using our methodology correspond to decision-

makers analysis needs?; 2) Do hierarchies created us-

ing our methodology improve analysis capabilities?

Therefore have decided to compare the result of

our methodology with with one proposed in (Miquel

et al., 2002a). Indeed, (Miquel et al., 2002a) propose

a manually method to obtain a multi-version multi-

dimensional schema,and when the time dimension is

chosen as the context dimension our approach results

a multi-version multidimensional schema. The re-

sult of this validation shows that the multidimensional

schema produced with the manual methodology and

our automatic methodology are equal.

Moreover, in order to validate the semantic cor-

rectness of using AHC for hierarchies deﬁnition, we

have asked to ecologists of the project to choice be-

tween a spatial dimension with only one level, and

a spatial dimension with a hierarchy created using

AHC. When the number of created levels is not supe-

rior to 5, decision-makers prefer having hierarchies,

since they can reveal interesting pattern such as agri-

cultural proﬁles of census points. For example, data

in the “Environments” fact table contains data that de-

scribe agriculture policies around each census point

at each year. The data clustering according to these

data can classify census points and allows decision-

makers analyzing impact of agricultural practices on

bird biodiversity. For example, decision-makers can

analyze biodiversity according to agricultural forest

and grassland parameters of census points, by using

this simple OLAP query: “What is the biodiversity

value per group of census points (ﬁrst level of the hi-

erarchy obtained with clustering) in 2002 and 2003?”.

This query can reveal that for the same year, for exam-

ple 2002, biodiversity is very affected by agricultural

parameters since the aggregated biodiversity value for

each group of census point is different.

5.2 Performance Evaluation

In this section, we test time performance of our

methodology in order to validate its feasibility from

a project deployment process point of view.

In particular we study time performance related

to: 1) reﬁnement algorithm for facts and dimension

design, and 2) hierarchy creation using AHC.

In order to test the ﬁrst point, we have created a

set of 200 simulated constellation schema using from

2 to 100 dimensions, since real usable multidimen-

sional schema presents maximum between 3 and 10

dimensions (Kimball, 1996). Finally, the worst time

execution is 15.23 s. The average execution time is

equal to 11.7 s with a standard deviation equal to 1.17

s. These performances are satisfactory for are good

for an off-line design phase.

In this paragraph, we study time performances of

the AHC algorithm. In this paragraph, “classiﬁed

items” are census points (which are members of the

“census points” dimension, the target dimension) and

“attributes” are aggregated facts from the “Environ-

ments” fact node (which is the source fact node). The

AHC algorithm has been also implemented in Mat-

lab and its performance has been also tested. Us-

ing our case study data, we perform 2090 tests, with

a number of classiﬁed items (source node instances-

Enverinments facts) between 10 and 190, and a num-

ber of attributes (source node attributes-Enverinments

fact measures) between 10 and 100, and the average

calculation time is equal to 0.072 s, with a standard

deviation equal to 0.002 s. To complete our evalu-

ation, we simulate a data set with 10,000 classiﬁed

items and 150 attributes. In this case, the AHC calcu-

lates a hierarchy in 147.36 s, with a standard deviation

equal to 4.03 , with a maximal calculation time equal

to 214 s. All time performances are shown on Figure

6. This calculation time (approximately four minutes)

is efﬁcient for an off-line design phase.

Figure 6: Execution times according the number of at-

tributes and classiﬁed items.

MixedDrivenRefinementDesignofMultidimensionalModelsbasedonAgglomerativeHierarchicalClustering

553

Figure 7: Visual interface of the reﬁnement tool.

6 CONCLUSION AND FUTURE

WORK

Design data warehouses system is a complex and cru-

cial task depending on available data sources and de-

cisional requirements. Existing work do not exploit

the semantics of data to automatically create complex

hierarchies. Thus in this paper, we present a mixed

multidimensional reﬁnement methodology, that trans-

form constellation schema to deﬁne hierarchy level

using a hierarchical clustering algorithm. Our reﬁne-

ment methodology enriches a dimension with factual

data, and considers the context of factual data. We

present also the implementation of our method in a

ROLAP architecture.

We perform the proposed methodology on a real

application case from bird biodiversity. We have

noted that actual automatic multidimensional design

methodologies cannot produce a multidimensional

schema, which covers all decision-maker needs due

to the data complexity. Our methodology offers a so-

lution to enrich dimensions with factual data and, by

this way, to reﬁne the multidimensional schema.

Our ongoing work is the extension of our method-

ology to simplify and reduce the number of created

levels, using other DM algorithms such as SVM, etc.,

in order to provide decision-makers with easy OLAP

exploration analysis and its implementation in a RO-

LAP architecture.

Moreover, we are also working to integrate our

approach in the rapid prototyping methodology pro-

posed in (Sautot et al., 2014a), and extending to help

decision-makers and DW experts choose the right DM

algorithms and parameters of the reﬁnement algo-

rithm (source node, contextual dimensions, etc.). Fu-

ture work concerns the usage of the formal evaluation

framework Goal Question Metric (Briand et al., 2002)

to evaluate our methodology.

ACKNOWLEDGEMENTS

Data acquisition received ﬁnancial support from the

FEDER Loire, Etablissement Public Loire, DREAL

de Bassin Centre, the R

egion Bourgogne (PARI, Pro-

jet Agrale 5) and the French Ministry of Agriculture.

We also thank heartily Pr. John Aldo Lee, from the

Catholic University of Leuven, for his help.

REFERENCES

Bentayeb, F. (2008). K-means based approach for olap di-

mension updates. In 10th International Conference on

Enterprise Information Systems (ICEIS), pages 531–

534.

Briand, L. C., Morasca, S., and Basili, V. R. (2002). An

operational process for goal-driven deﬁnition of mea-

sures. IEEE Trans. Software Eng., 28(12):1106–1125.

Carme, A., Mazon, J.-N., and Rizzi, S. (2010). A model-

driven heuristic approach for detecting multidimen-

sional facts in relational data sources. In Pedersen,

T., Mohania, M., and Tjoa, A. M., editors, Proceed-

ings of 12th International Conference on Data Ware-

housing and Knowledge Discovery (DaWaK), volume

LNCS 6263, pages 13–24.

Ceci, M., Cuzzocrea, A., and Malerba, D. (2011). Olap

over continuous domains via density-based hierar-

chical clustering. In 15th International Conference

on Knowledge-Based and Intelligent Information and

Engineering Systems (KES 2011), volume 2, pages

559–570.

Favre, C., Bentayeb, F., and Boussaid, O. (2006). A

knowledge-driven data warehouse model for analysis

evolution. Frontiers in Artiﬁcial Intelligence and Ap-

plications, 143:271.

Jensen, M. R., Holmgren, T., and Torben (2004). Discov-

ering multidimensional structure in relational data. In

Data Warehousing and Knowledge Discovery: 6th In-

ternational Conference (DaWaK).

Jovanovic, P., Romero, O., Simitsis, A., and Abell

o, A.

(2012). Ore: An iterative approach to the design

and evolution of multi-dimensional schemas. In Pro-

ceedings of the Fifteenth International Workshop on

Data Warehousing and OLAP, DOLAP ’12, pages 1–

8, New York, NY, USA. ACM.

Kimball, R. (1996). The Data Warehouse Toolkit: Practi-

cal Techniques for Building Dimensional Data Ware-

houses. Wiley.

Lenz, H.-J. and Thalheim, B. (2009). A formal framework

of aggregation for the olap-oltp model. Journal of

Universal Computer Science, 15(1):273–303.

ICEIS2015-17thInternationalConferenceonEnterpriseInformationSystems

554

Leonhardi, B., Mitschang, B., Pulido, R., Sieb, C., and

Wurst, M. (2010). Augmenting olap exploration

with dynamic advanced analytics. In 13th Interna-

tional Conference on Extending Database Technology

(EDBT 2010).

Mahboubi, H., Ralaivao, J.-C., Loudcher, S., Boussa

ıd, O.,

Bentayeb, F., Darmont, J., et al. (2009). X-wacoda:

an xml-based approach for warehousing and analyz-

ing complex data. Data Warehousing Design and Ad-

vanced Engineering Applications: Methods for Com-

plex Construction, pages 38–54.

Messaoud, R. B., Boussaid, O., and Rabas

eda, S. (2004). A

new olap aggregation based on the ahc technique. In

DOLAP 2004, ACM Seventh International Workshop

on Data Warehousing and OLAP, pages 65–72.

Miquel, M., Bdard, Y., and Brisebois, A. (2002a). Concep-

tion d’entrepts de donnes gospatiales partir de sources

htrognes. exemple d’application en foresterie. Ing-

nieries des Systmes d’information, 7(3):89–111.

Miquel, M., B

edard, Y., Brisebois, A., Pouliot, J., Marc-

hand, P., and Brodeur, J. (2002b). Modeling multi-

dimensional spatio-temporal data werehouses in a

context of evolving speciﬁcations. International

Archives Of Photogrammetry Remote Sensing And

Spatial Information Sciences, 34(4):142–147.

Nguyen, T. B. and Tjoa, A. M. (2000). An object oriented

multidimensional data model for olap. In In Proc.

of 1st Int. Conf. on Web-Age Information Manage-

ment (WAIM), number 1846 in LNCS, pages 69–82.

Springer.

Phipps, C. and Davis, K. C. (2002). Automating data ware-

house conceptual schema design and evaluation. In

Proceedings of the 4th International Workshop on De-

sign and Management of Data Warehouses (DMDW),

volume 2.

Romero, O. and Abello, A. (2009). A survey of multidimen-

sional modeling methodologies. International Journal

of Data Warehousing and Mining, 5(2):1–23.

Romero, O. and Abello, A. (2010). Automatic validation

of requirements to support multidimensional design.

Data and Knowledge Engineering, 69:917–942.

Sautot, L., Bimonte, S., Journaux, L., and Faivre, B.

(2014a). A methodology and tool for rapid prototyp-

ing of data warehouses using data mining: Applica-

tion to birds biodiversity. In Proceedings of 4th Inter-

national Conference on Model & Data Engineering

(MEDI). In Press.

Sautot, L., Faivre, B., Journaux, L., and Molin, P. (2014b).

The hierarchical agglomerative clustering with gower

index: a methodology for automatic design of olap

cube in ecological data processing context. Ecological

Informatics. In Press.

MixedDrivenRefinementDesignofMultidimensionalModelsbasedonAgglomerativeHierarchicalClustering

555