Adapting OLAP Analysis to User’s Constraints

through Semantic Hierarchies

Fadila Bentayeb and Rym Khemiri

ERIC Laboratory, University of Lyon, Lumi

ere Lyon 2

5 av. P. Mend

es-France, 69676 Bron Cedex, France

Keywords:

Analysis Level, Constrained K-means Clustering, OLAP, Personalization, PRoCK, Semantic Dimension

Hierarchy, User Constraints.

Abstract:

The objective of this paper is to provide a personalized on-line aggregate operator, namely PRoCK (Person-

alized Rollup operator with Constrained K-means), based on data mining techniques. The use of data mining

techniques, and more precisely constrained K-means clustering method, helps to discover new grouping sets

with respect to users requirements. In the context of data warehouses, PRoCK allows to adapt dimension

hierarchies to the user constraints. Indeed, applied on a given dimension hierarchy instances, constrained k-

means clustering method gives a new natural classiﬁcation. The obtained clustering results constitute a new

hierarchy level semantically richer, namely personalized level on which user may elaborate more sophisticated

OLAP analysis. PRoCK is integrated inside Oracle RDBMS (Relational DataBase Management System) and

we have carried out some experimentation which validated the relevance of our operator.

1 INTRODUCTION

Dimension hierarchies represent a substantial part

of the data warehouse model (Pedersen and Jensen,

2001). Indeed, hierarchies allow decision makers to

examine data at different levels of detail with OLAP

operators such as drill-down and roll-up. Further-

more, actual data warehouses models usually consider

OLAP dimensions as static entities. However, in prac-

tice, structural changes of dimensions schema are of-

ten necessary to adapt the multidimensional database

to changing requirements.

On the other hand, even though data warehouses

and OLAP are considered to be user-centric systems,

there is obviously a lack of involvement of the user in

the system. In fact, only a few analysis possibilities

are known at the design stage of a data warehouse ac-

cording to the identiﬁed global analysis needs of the

users. Although, business requirements often change

over time at the client level where some speciﬁc con-

straints must be satisﬁed. In this case, the data ware-

house must be user-centric to cope with user analysis

requirements.

Therefore, to improve decision support systems

and to give increasingly relevant information to the

user, the need to integrate user requirements into the

data warehouse is becoming unavoidable.

Unfortunately, OLAP does not provide automatic

tools for structuring analysis axes. We thus base our

research on data mining techniques that make possi-

ble integrating knowledge into OLAP process to cre-

ate new relevant analysis axes by exploiting the data

warehouse content. We show that combining OLAP

technology with data mining techniques can provide

more elaborated and more relevant analysis.

The objective of this paper is to adapt OLAP anal-

ysis to users by enriching existing hierarchies with de-

rived semantic hierarchies. Indeed, one can need to

deﬁne other semantic aggregates than those deﬁned

in the design step of the data warehouse. For this end,

we propose a personalized on-line aggregate opera-

tor called PRoCK (Personalized Rollup operator with

Constrained K-means). This operator creates auto-

matically new roll-up functions based on user prefer-

ences and using the cop k-means clustering algorithm.

The user preferences are deﬁned by means of con-

straints which are speciﬁed in the form of must-link

and cannot-link constraints (Wagstaff et al., 2001).

To achieve our objective, our idea consists in on-

line personalization process of the data warehouse

schema which follows these steps. Given a hierar-

chical level l, PRoCK classiﬁes its instances by using

the Cop K-means method clustering algorithm based

on the user constraints. A new hierarchical level pl is

193

Bentayeb F. and Khemiri R..

Adapting OLAP Analysis to User’s Constraints through Semantic Hierarchies.

DOI: 10.5220/0004444901930200

In Proceedings of the 15th International Conference on Enterprise Information Systems (ICEIS-2013), pages 193-200

ISBN: 978-989-8565-59-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

then created by applying a rollup function which re-

lates the instances of the level l with the instances of

the level pl. The domain of the personalized level

pl is composed of the k instances representing the

k obtained clusters. The obtained personalized se-

mantic hierarchies would provide new multidimen-

sional ways for analyzing data and obtain more rel-

evant analyses semantically richer.

Our operator is integrated inside Oracle DBMS

where we carried out some experimentation which

validated the relevance of our approach.

The remainder of this paper is organized as fol-

lows: Section 2 presents related works and compares

our approach to existing ones in the literature. In sec-

tion 3, we present our PRoCK operator. The frame-

work of creating semantic hierarchies using PRoCK is

described in section 4. After an illustrative example

presented in section 5, we describe implementation

with some preliminary experimental results in section

6. Finally, conclusions and our expected future work

are given in section 7.

2 RELATED WORKS

Since data warehouses are characterized by volumi-

nous data and are based on a user-centric analy-

sis process, including personalization into the data

warehousing process becomes a new research issue

(Rizzi, 2007). Despite ﬁrst approaches for person-

alization on data warehouses that focus on user def-

inition with speciﬁc data as deﬁned on traditional

databases, there exists some approaches based on

conceptual model and its multidimensional concepts

(fact, dimension, hierarchy, measure, attribute). For

example, using annotations, a new personalization

technique based on user preferences model is pro-

posed in which weights are associated to multidimen-

sional databases components (Ravat and Teste, 2008).

To assign priority weights to attributes of a multidi-

mensional schema, the personalization rules are de-

scribed using the Condition-Action formalism. More

recently, this model has been used for handling the

context notion in order to closely relating user re-

quirements to their current context (Jerbi et al., 2009).

Moreover, the importance of dimension hierar-

chies was reﬂected in (Bentayeb, 2008) where the au-

thor used data mining techniques as aggregation op-

erators to update dimension hierarchies in data ware-

houses without taking into account user preferences.

Garrigos et al. use the data warehouse multidi-

mensional model, user model and rules for the data

warehouse personalization (Garrig

os et al., 2009). As

a result, a data warehouse user is able to work with

a personalized OLAP schema, which best matches

his needs. Based on ECA-rules (Event-Condition-

Action) (Thalhammer et al., 2001)), PRML (Person-

alization Rules Modeling Language is used in (Gar-

rig

os et al., 2009) for speciﬁcation of OLAP person-

alization rules. The structure of such PRML rules can

be presented with following statement: when event do

if condition then action endIf endWhen.

After that, in (Kozmina and Niedrite, 2010), a new

method was proposed which provides exhaustive de-

scription of interaction between user and data ware-

house. A set of user-describing proﬁles (user prefer-

ence, temporal, spatial, preferential and recommenda-

tional) have been developed. A metamodel which for-

mulates user preferences for OLAP schema elements

and aggregate functions has been proposed. This

model reﬂects connections among user-describing

proﬁles.

Recently, inspired by (Kießling and K

ostler, 2002)

and (Golfarelli and Rizzi, 2009), (Golfarelli et al.,

2011) propose an approach to adapt preference con-

structors to multidimensional context. Formulated on

schema, preferences can not only be expressed over

attributes and thus over cuboids but also preferences

can be expressed over numerical values (measures).

The preferences composition is modeled using pred-

icate logic attributes and expressed through Pareto

composition (two preferences are equally relevant) or

Prioritization (a preference is more relevant than an-

other).

We argue that multidimensional structures such as

dimension hierarchies have a strong impact in OLAP

analysis and they should be considered in OLAP per-

sonalization. For this reason, users must be able to

express their preferences on dimension hierarchies.

In fact, preference model is considered a main open

problem in OLAP personalization in (Rizzi, 2007).

Our proposal comes close to a previous work

that proposes structural update of OLAP dimensions

(Bentayeb, 2008). However, it is different, so that, it

proposes personalizing hierarchies by exploiting user

preferences. Our method aims at improving OLAP

analysis process by taking into account the individual

interests of users.

In this section we have reviewed the current ap-

proaches for personalization in data warehouses. We

present a comparative table (table 1) confronting the

panoply of the proposed approaches. We choose some

criteria that we consider relevant to compare person-

alization approaches.

• Source: this criterion presents the object to exploit

for personalization which can be a user proﬁle (in-

terests, preferences, constraints,...), query history

(log ﬁle) or user context.

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

194

• Personalization Time: the time of personalization:

before querying, while querying or after querying.

• Personalization Object: this criterion presents the

object of the proposed method if it is a query, an

interface or a content to personalize.

• Input: this criterion presents the inputs of the pro-

posed method if it is DW schema or DW instance

or both of them.

• Output: this criterion presents the outputs of the

proposed method if it is a query, a set of tuples or

a personalized schema.

3 PERSONALIZED SEMANTIC

HIERARCHIES

Our method allows to enrich existing hierarchies with

derived semantic hierarchies based on the current user

needs (user preferences).

3.1 User Preferences

User proﬁle is the important element for a person-

alization system. Nevertheless, user proﬁle is re-

duced to user preferences. In the context of database

systems, preference query was introduced for the

ﬁrst time in order to soften “the rigid way in which

the researched data characteristics must be speciﬁed”

(Lacroix and Lavency, 1987). In the case where any

object (any record) doesn’t reply to these characteris-

tics, it’s nevertheless possible in some applications to

accept objects having less good characteristics against

search criteria. After that, several extensive investiga-

tions were carried out and two major lines emerged

in the literature for expressing preferences: quantita-

tive and qualitative approaches (Chomicki, 2003). In

the qualitative approach, preferences are speciﬁed di-

rectly, whereas, in the quantitative approach, prefer-

ences are expressed indirectly by using scoring func-

tions.

In this paper, we are distinguished from classi-

cal deﬁnition of preferences by the user constraints.

In fact, the user constraints are speciﬁed in the form

of must-link (two instances must be placed together)

and cannot-link (two instances must not be placed to-

gether) (Wagstaff et al., 2001). These kind of con-

straints are explicitly deﬁned by the user.

3.2 Principle

Our personalization method consists in changing on-

line the structure of the data warehouse by creating

personalized semantic hierarchies. Then, we enrich

existing hierarchies with derived semantic hierarchies

to allow the user to get his own personalized anal-

ysis. To achieve this purpose, we use data mining

techniques, whose parameters are ﬁxed by the user

in an interactive way, according to his/her own pref-

erences in terms of aggregation constraints deﬁned by

COP K-means. We selected the cop K-means cluster-

ing method in order to highlight aggregates seman-

tically richer than those provided by existing hierar-

chies with respect to user constraints.

In our method, user preferences are represented by

user constraints. In fact, users are asked to provide

their preferences about the obtained clusters which

may form a new granularity level in the considered

dimension hierarchy. We use a constrained cluster-

ing problem in which the user has some pre-existing

knowledge about their desired partitions. Besides

the number of clusters k, user can iteratively pro-

vide his/her constraints about how items should be

grouped in the form of must-link and cannot-link con-

straints. A must-link constraint enforces that two

instances must be placed in the same cluster while

a cannot-link constraint enforces that two instances

must not be placed in the same cluster. The user con-

straints reﬁne the clusters towards the desired data.

Our PRoCK operator generates automatically the

new roll-up function based on user constraints. Our

PRoCK operator exploits user knowledge especially

his hard constraints. Therefore, PRoCK provides a

way to deal with the structure of the hierarchy and

its data with respect to user preferences (user con-

straints).

To deﬁne the domain of the parent level and the

aggregation function from a child to the parent level,

our operator classiﬁes all instances of a child level

into k clusters with the cop k-means clustering al-

gorithm. Therefore, users are asked to choose cop

K-means parameters (k + constraints) following their

preferences about the obtained clusters which may

form a new granularity level in the considered dimen-

sion hierarchy.

4 FRAMEWORK FOR CREATING

PERSONALIZED SEMANTIC

HIERARCHIES

In this section, we present a declarative framework for

creating semantic hierarchies that addresses the chal-

lenges discussed earlier in the introduction. We show

the different deﬁnitions of used concepts, the cluster-

ing algorithm and the personalization algorithm.

AdaptingOLAPAnalysistoUser'sConstraintsthroughSemanticHierarchies

195

Table 1: Survey of OLAP personalization approaches.

Bellatreche et

al. 2005

Bentayeb 2008

Jerbi et al.

2008, 2009

Ravat and Teste

2008

Garrigos et al.

2007

Kozmina et al.

2010

Golfarelli et al.

2009, 2011

Our approach

Source

User proﬁle × × × × ×

Query Log × ×

Context ×

Time (% querying)

Before × × × × ×

While × ×

After × ×

Object

Query × × × × × ×

Interface ×

Input

DW schema × × × × ×

DW instance × ×

Output

Query × ×

Tuples ×

Schema × × ×

4.1 Basic deﬁnitions

Deﬁnition 1. Data warehouse. A data warehouse is

a multidimensional database that can be deﬁned as

µ = (δ,ϕ) where δ is a set of dimensions and ϕ is a

set of facts (Hurtado et al., 1999).

Deﬁnition 2. Dimension. A dimension schema

is a tuple D = (L, ) where:

• L is a ﬁnite set of levels which contains

a distinguished level named all, such that

dom(all) = {

all

}

•  is a transitive and reﬂexive relation over the el-

ements of L. The relation  contains a unique

bottom level called l

bottom

and a unique top level

called all.

L = l

bottom

,...,l,...all|∀l, l

bottom

 l  all

A dimension instance is a tuple (D, f ) where D is

a dimension schema and f is a set of partial functions

between instances of two adjacent hierarchical levels:

f = { f

,..., f

} such that:

∀l, l

∈ L | l  l

, ∃ f | f

: dom(l) → dom(l

)

Deﬁnition 3. Fact. A fact schema F is deﬁned

as F = (I,M) where I is a set of dimension identiﬁers

and M is a set of measures. A fact instance is a tuple

where the set of values for each identiﬁer is unique.

Deﬁnition 4. Cube. To create data cubes,

we use the CUBE operator (Gray et al., 1996)

which is deﬁned as follows: for a given fact

F = (I = {I

∈ D

,...,D

∈ D

},M), a set of levels

GL = {l

∈ D

,...l

∈ D

 l

∀i = 1...p} and

a set of measures m with m ⊂ M, the operation

CUBE(F,GL,m) gives a new fact F

= (GL,m

)

where m

is the result of aggregation (with roll-up

functions f

,..., f

of the set of measures m from I

to GL.

4.2 Constrained K-means Clustering

Cluster analysis, an important technology in data min-

ing, is an effective method of analyzing and discov-

ering useful information from numerous data. COP

K-means algorithm groups the data into classes or

clusters with respect to user constraints. COP-K-

means is an iterative partitioning algorithm for semi-

supervised clustering introduced in (Wagstaff et al.,

2001). COP-K-means extends K-means (MacQueen,

1967) by applying constraints based on background

knowledge.

Let Λ = {λ

,..., λ

} be the given set of instances

which must be partitioned such that the number of

clusters is not given beforehand. In the context of

clustering algorithms, instance-level constraints are a

useful way to express a priori knowledge that con-

strains a placement of instances into clusters. In gen-

eral, constraints may be derived from partially labeled

data or from background knowledge about the domain

of real data set. We consider the clustering problem of

the data set Λ under the following types of constraints.

• Must-Link constraints denoted by ML(λ

,λ

) in-

dicates that two instances λ

and λ

must be in the

same cluster.

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

196

• Cannot-Link constraints denoted by CL(λ

,λ

) in-

dicates that two instances λ

and λ

must not be in

the same cluster.

• Transitively derived Instance-Level constraints

from:

– ML(λ

,λ

) and ML(λ

,λ

) imply ML(λ

,λ

– ML(λ

,λ

), ML(λ

,λ

) and CL(λ

,λ

) imply

both CL(λ

,λ

) and CL(λ

,λ

We selected the COP K-means method because

we want to exploit user knowledge especially his

hard constraints about their desired partitions. The

Constrained K-means Algorithm is as follows:

COP-KMEANS (dataset D, number of clusters k, must-

link constraints Con

⊂ D × D, cannot- link constraints

Con

⊂ D × D)

1. Let C

...C

be the k initial cluster centers.

2. For each point d

∈ D, assign it to the closet cluster C

such that VIOLATE-CONSTRAINTS (d

,Con

) is

false. If no such cluster exists, fail (return{}).

3. For each cluster C

, update its center by averaging all of

the points d

that have been assigned to it.

4. Iterate between (2) and (3) until convergence.

5. Return {C

...C

VIOLATE-CONSTRAINTS(data point d, cluster C, must-link

constraints Con

⊂ D × D, cannot-link constraints Con

⊂

D × D)

1. For each (d,d

) ∈ Con

: if d

6= C, return true.

2. For each (d,d

) ∈ Con

: if d

6= C, return true.

3. Otherwise, return false.

4.3 Formalization

The COP K-means method enables us to classify in-

stances of a level l on its own attributes. We exploit

then the COP K-means clustering results to create a

new level pl and a roll-up function which relates in-

stances of the child level l with the domain of the par-

ent level pl with respect to user constraints.

Dimension Projection. The operator DimProject op-

erator allows a projection of a dimension D from a

hierarchical level l

. Thus, the Level l

becomes the

ﬁnest new hierarchical level of the new dimension D

which summarizes D on the level of detail l

. As-

sume a dimension D = (L, , f ) and a hierarchical

level l

∈ L, DimPro jection(D,l

) is a new dimension

= (L

,

, f

) such that:

• L

= L {l

| ∀l

∈ L,l

 l

• 

= {(l

→ l

),..., (l

k−1

→ l

)},

• f

= f { f

,..., f

k−1

Roll-up with Constrained K-means operator. In

our case, the f

function is provided by our

operator “PRoCK” (Roll-up with Constrained K-

means). Assume a positive integer k, a population

Λ = {λ

,λ

,..., λ

} composed by n instances,

a set of k classes C = {C

,...,C

} and a set of

constraints Cons = Cons

∪ Cons

. By using

the Cop K-means algorithm described in sec-

tion 4.2, RoCK(Λ, k,Cons) calculates the set

C = {c

,..., c

|∀i = 1..k,c

= barycenter(C

)} and

returns the roll-up function:

= {(λ

→ C

)|∀i = 1..n and ∀m = 1..k,

dist(λ

) ≤ dist(λ

) and violate-constraints

(λ

) = False}.

Insert a personalized level in a dimension. The

operator PRoCK creates a new level pl, to which

a pre-existing level l rolls up. A function f must

be deﬁned from the instance set of l, to the domain

of pl. We can summarize the formal deﬁnition

of this operator as follows: given a dimension

D = (L = {l

bottom

,..., l,...,all},), two levels l ∈ L

and pl /∈ L and a function f

: instanceSet(l)−→

dom(pl). PRoCK(D,l, pl, f

) is a new dimension

= (L

,

) where

= L ∪

{

}

and 

= ∪

{

(l → pl),(pl → all)

}

according to the roll-up function f

Our personalization approach is then original

since the new roll-up function is generated automati-

cally. It is more than a conceptual operator and pro-

vides a way to deal not only with the structure of the

hierarchy, but also with the data of this hierarchy.

4.4 Algorithm

We present in the following the input parameters and

the different steps of the personalization algorithm.

The ﬁrst step of our algorithm consists in generating

a learning set Λ

from the instances of the pre-existing

analysis level l. We consider a variable called data-

Source. If the value of this variable equals to ’D’, the

population Λ

is described by a part of attributes of

the dimension D chosen by the user. Otherwise, Λ

generated by executing the operation CUBE(F,Gl,m)

whose parameters are also ﬁxed by the user. Then,

the algorithm applies the COP K-means method to

the learning set Λ

with respect to deﬁned constraints

Cons. It allows to every portioning plan to specify

which are pairs having must-link or cannot-link con-

straints. Finally, our algorithm implements the new

analysis level pl in the data warehouse model. It is

AdaptingOLAPAnalysistoUser'sConstraintsthroughSemanticHierarchies

197

done after the validation of the user. To do this opera-

tion, our algorithm performs the PRoCK operator on

the dimension D, from the level l by using the roll-up

function f

generated during the previous step.

Algorithm 1: How to create a semantic hierarchy

level.

Input:

• A dimension D = (L,), a level l ∈ L and a set of

measures m ∈ M (if required)

• A level name pl /∈ L

• A positive integer k ≥ 2 which will be the modality

number of pl

• Constraints Cons in the form of must-link or

cannot-link constraints

• A variable dataSource that can take be fact) or

dimension)

Output: Personalized hierarchy

1 Λ ←

2 PersDim ←

3 if datasource = ‘Dimension’ then

4 Λ ← DimPro jection(D,l)

5 else

6 if datasource = ‘Fact’ then

7 Λ ← CUBE(F,Gl,m) ;

8 end

9 end

10 f

← COP K-means(Λ

,k, cons)

11 PersDim ← PRoCK(D, l, pl, f

)

12 return PersDim

5 ILLUSTRATIVE EXAMPLE

To illustrate our method, we present the analysis of

Internet impact which constitute a development indi-

cator. Indeed, Internet is a new vector of development

and trade and we can measure the impact of the In-

ternet for each country by measuring the number of

Internet users in relation to the population.

Table 2 gives the number of users within a coun-

try that access the Internet. This table contains the

number of Internet users and population of 9 African

countries. Statistics vary from country to country

and Nigeria occupies a rather exceptional place as the

most populous country in Africa.

Assume the user analysis objective is to know

whether the country is developed or not through the

impact of Internet use. To ﬁnd an answer to this ques-

tion, he will try to explore the use of Internet across

“Country” dimension whose actual hierarchy is orga-

nized as in Figure 1.

https://www.cia.gov/library/publications/the-world-

factbook/rankorder/2153rank.html

Table 2: Internet users in Africa.

Country Users(2009) Population

Tunisia 3 500 000 10 589 025

Zimbabwe 1 423 000 11 651 858

Ouganda 3 200 000 31 367 972

Morroco 13 213 000 31 671 474

Algeria 4 700 000 36 057 838

Kenya 3 996 000 39 002 772

South of Africa 4 420 000 49 052 489

Egypte 20 136 000 84 474 000

Nigeria 43 989 000 149 283 240

Country

Continent

All

Figure 1: Schema of Country dimension.

For more focused analysis, the user can then feel

the need to add a new level of analysis CountryGroup

which must group countries according to the rate of

Internet use. To achieve this goal, our idea consists

in extracting knowledge automatically from the data

warehouse content to provide possibly relevant clus-

ters of countries. In this case, it would be interesting

to directly describe each country by the two following

attributes: population and number of Internet users.

Our operator PROCK is then in charge of grouping

countries according to this new information and cre-

ate a new granularity level countryGroup for further

more elaborated OLAP queries.

However, each user may want a speciﬁc cluster-

ing of the data. In this case, the best way to ﬁnd

the personalized clustering for each user is to incorpo-

rate his/her preferences. As discussed earlier, in our

method, the user’s preferences are presented as must-

link and cannot-link constraints between pairs of data

instances. Our operator can then invoke the method

cop k-means clustering to group automatically coun-

tries. To run the example, we present hereafter three

application scenarios.

5.1 Scenario 1

Let Λ = {Tunisia, Kenya, Zimbabwe, Algeria,

Ouganda, Morroco, Egypte, Nigeria, South of

Africa}. By ﬁxing k = 3 and without applying any

constraints, we obtain the clusters C

, C

and C

illustrated in table 3.

5.2 Scenario 2

The user may want to ﬁnd countries of north Africa

Egypt, Morocco, Tunisia and Algeria in the same

group. Thus, he can introduce the following con-

straints:

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

198

Table 3: Internet users with clusters.

Country Users(2009) Population Cluster

Tunisia 3 500 000 10 589 025 C

Zimbabwe 1 423 000 11 651 858 C

Ouganda 3 200 000 31 367 972 C

Morroco 13 213 000 31 671 474 C

Algeria 4 700 000 36 057 838 C

Kenya 3 996 000 39 002 772 C

South of Africa 4 420 000 49 052 489 C

Egypte 20 136 000 84 474 000 C

Nigeria 43 989 000 149 283 240 C

• ML(Morroco, Egypte)

• ML(Morroco, Tunisia)

• ML(Morroco, Algeria)

Therefore, Λ = {Tunisia, Kenya, Zimbabwe, Al-

geria, Ouganda, Morroco, Egypte, Nigeria, South of

Africa} and Cons

= {(Morroco, Egypte), (Morroco,

Tunisia), (Morroco, Algeria)} and Cons

PRoCK(Λ, 3,Cons) returns the set C = {c

{(Egypte, Morroco, Algeria, Nigeria, Tunisia},

= {Ouganda, Southo f A f rica, Kenya}, c

{Zimbabwe}}.

Rollup function f

= {(Morroco → C

(Algeria → C

), Nigeria → C

), (Tunisia → C

(Egypte → C

), (Ouganda → C

), (Kenya → C

(Southo f A f rica → C

), (Zimbabwe → C

)}.

5.3 Scenario 3

We can see that Nigeria is invited in the cluster C

which is not real if we know the atypical character

of this country. Thus, we can introduce the following

constraint: cannot-link (Morroco, Nigeria). By apply-

ing these constraints, we can obtain the desired result.

Let Λ = {Tunisia, Kenya, Zimbabwe,

Algeria, Ouganda, Morroco, Egypte, Nigeria,

Southo f A f rica}, Cons

= {(Morroco, Egypte),

(Morroco, Tunisia), (Morroco, Algeria)} and

Cons

= {(Morroco, Nigeria)}.

PRoCK(Λ, 3,Cons) returns the set C = {c

{(Egypte, Morroco, Algeria, Tunisia}, c

{Ouganda, Southo f A f rica, Kenya, Zimbabwe},

= {Nigeria}}.

Rollup function f

CountryGroup

Country

= {(Morroco →

), (Algeria → C

), Tunisia → C

), Egypte → C

(Ouganda → C

), (Zimbabwe → C

), (Kenya → C

(Southo f A f rica → C

), Nigeria → C

)}.

5.4 Discussion

At the end of the classiﬁcation, our algorithm create

a new analysis level CountryGroup of country dimen-

sion as in Figure 2.

To materialize the “CountryGroup” new level in

the “Country” dimension, our algorithm performs the

operator PRoCK (Country, Country, CountryGroup,

CountryGroup

Country

Therefore, PRoCK provides a way to deal with the

structure of the hierarchy and its data with respect to

user constraints. In fact, it generates automatically the

new roll-up function based on user constraints. Based

on this new personalized semantic level, the user can

have personalized hierarchy and of course personal-

ized dimension that allow him/her to directly target

personalized analyses.

Country

Continent

All

CountryGroup

Figure 2: Enriched schema of Country dimension.

6 IMPLEMENTATION

AND EXPERIMENTS

We developed our approach within the Oracle 11g

RDBMS. Thus, we implemented the K-prototypes

algorithm by using PL/SQL stored procedures. K-

prototypes is a variant of the K-means method allow-

ing large datasets clustering with mixed numeric and

categorical values.

In order to assess the relevance of results

of PROCK operator on real data, our tests were

conducted with the “Foodmart” data warehouse

where sales are represented as a fact table namely

“Sales fact” and the axis of analysis are represented

as dimension tables namely Product, Promotion,

Time, Store and Customer. Thus, we expected the fol-

lowing test scenario: create an axis of analysis which

classiﬁes the 1560 products according to their weight

into 3 clusters with respect to user constraints. Let

us consider a marketing manager who wants to have

products of the same brand together but he wants not

to have recyclable package with non recyclable ones.

One way to ﬁnd the best clustering for this user is to

incorporate must-link and cannot-link constraints be-

tween pairs data instances of Product dimension in

order to have a personalized analysis level Product-

Group. The results of our test are in Figure 3.

As a consequence, the marketing manager may

have personalized analysis possibilities over the se-

AdaptingOLAPAnalysistoUser'sConstraintsthroughSemanticHierarchies

199

Analysis level: Product

Personalized Analysis level: ProductGroup

Class

Range

Average weight

[1.3 – 4.8]

15.11

[5.9 – 10.1]

14.98

[10.8 – 20.2]

16.66

Product_name

Net_weight

Recyclable_package

Cluster

Washington

Berry Juice

6,39

Washington

Mango Drink

4,42

Washington

Strawberry Drink

11,1

Washington

Cream Soda

9,6

Washington Diet

Soda

4,65

Washington

Cola

13,8

Washington Diet

Cola

…

Figure 3: Test results.

mantic hierarchy of the dimension Product and espe-

cially on the personalized level ProductGroup.

7 CONCLUSIONS AND FUTURE

WORKS

In this paper, we deﬁned a personalized aggregation

operator PRoCK which allows to change on-line the

data warehouse structure by enriching existing hier-

archies with derived semantic hierarchies. Thus, user

may have new analysis possibilities over the seman-

tic hierarchies especially the new aggregation levels.

To deﬁne the domain of the new level and the ag-

gregation function from an existing level to the per-

sonalized level, our operator PROCK classiﬁes all in-

stances of an existing level into k clusters according

to user constraints with the Cop k-means clustering

algorithm.

Finally, let us point out that as such operator ma-

ture, there are additional issues of research that need

to be pursued. To provide users with only rele-

vant data from the huge amount of available informa-

tion, personalization systems use preferences to allow

users to express their interest on speciﬁc data. Most

often, user preferences vary depending on the circum-

stances. For instance, decision maker requirements

can change from a context to another. As a conse-

quence, currently, we think of supporting constraints

that depend on user context.

REFERENCES

Bentayeb, F. (2008). K-means based approach for olap di-

mension updates. In ICEIS (1), pages 531–534.

Chomicki, J. (2003). Preference formulas in relational

queries. ACM Trans. Database Syst., 28(4):427–466.

Garrig

os, I., Pardillo, J., Maz

on, J.-N., and Trujillo, J.

(2009). A conceptual modeling approach for olap per-

sonalization. In ER, pages 401–414.

Golfarelli, M. and Rizzi, S. (2009). Expressing olap prefer-

ences. In SSDBM, pages 83–91.

Golfarelli, M., Rizzi, S., and Biondi, P. (2011). myolap:

An approach to express and evaluate olap preferences.

IEEE Trans. Knowl. Data Eng., 23(7):1050–1064.

Gray, J., Bosworth, A., Layman, A., and Pirahesh, H.

(1996). Data cube: A relational aggregation opera-

tor generalizing group-by, cross-tab, and sub-total. In

ICDE, pages 152–159.

Hurtado, C. A., Mendelzon, A. O., and Vaisman, A. A.

(1999). Maintaining data cubes under dimension up-

dates. In ICDE, pages 346–355.

Jerbi, H., Ravat, F., Teste, O., and Zurﬂuh, G. (2009).

Mod

ele de pr

erences contextuelles pour les analy-

ses olap. In EGC, pages 253–258.

Kießling, W. and K

ostler, G. (2002). Preference sql - de-

sign, implementation, experiences. In VLDB, pages

990–1001.

Kozmina, N. and Niedrite, L. (2010). Olap personalization

with user-describing proﬁles. In BIR, pages 188–202.

Lacroix, M. and Lavency, P. (1987). Preferences; putting

more knowledge into queries. In VLDB, pages 217–

225.

MacQueen, J. B. (1967). Some methods for classiﬁcation

and analysis of multivariate observations. In Cam, L.

M. L. and Neyman, J., editors, Proc. of the ﬁfth Berke-

ley Symposium on Mathematical Statistics and Prob-

ability, volume 1, pages 281–297. University of Cali-

fornia Press.

Pedersen, T. B. and Jensen, C. S. (2001). Multidimensional

database technology. IEEE Computer, 34(12):40–46.

Ravat, F. and Teste, O. (2008). Personalization and OLAP

Databases, volume New Trends in Data Warehousing

and Data Analysis, chapter chapter 4, pages 1–22.

Rizzi, S. (2007). Olap preferences: a research agenda. In

DOLAP, pages 99–100.

Thalhammer, T., Schreﬂ, M., and Mohania, M. K. (2001).

Active data warehouses: complementing olap with

analysis rules. Data Knowl. Eng., 39(3):241–269.

Wagstaff, K., Cardie, C., Rogers, S., and Schr

odl, S.

(2001). Constrained k-means clustering with back-

ground knowledge. In ICML, pages 577–584.

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

200