ONTOLOGICAL WAREHOUSING ON SEMANTICALLY INDEXED

DATA

Reusing Semantic Search Engine Ontologies to Develop Multidimensional Schemas

Filippo Sciarrone and Paolo Starace

Business Intelligence Division, Open Informatica srl, Via dei Castelli Romani 12/A, Pomezia, Italy

Keywords:

Business intelligence, Data Warehousing, OLAP.

Abstract:

In this article we present a ﬁrst experimentation of a Business Intelligence solution to dynamically develop

multidimensional OLAP schemas through a reuse of ontologies, stored in concept and relations dictionaries

and used by semantic indexing engines. The particular aspect of the proposed solution consists in the in-

tegration of semantic indexing techniques of non-structured documents, based on ontologies, with dynamic

management techniques of unbalanced hierarchies in a Data Warehouse. As a case study, we embedded our

solution into a real system, built for the analysis and management of experts’ curricula in an e-government

environment. We show how it is possible to automatically build OLAP dimensions, inheriting the hierarchic

structure of ontologies, with the goal of using the semantically indexed data to carry out multidimensional

OLAP analyses. The ﬁrst experimental results are encouraging.

1 MOTIVATIONS AND GOALS

In this article we present a solution for the dy-

namic management of ontologies in a BI environ-

ment. The main features of our proposal concern the

development of a process to dynamically extract con-

cepts from a semantically indexed database. Our ap-

proach allows us to expand the traditional On− Line−

Analytical − Processing (OLAP) analysis, in order to

design and build dimensions over ontologies. The ba-

sic idea of our work is a new methodology, aimed at

reusing predeﬁned ontologies in a concept-based dic-

tionary to develop multidimensional OLAP schemas.

Dimensions are obtained from the structure of the

ontologies in a dynamical way, namely, by deﬁning

only the root level of the very ontology and allowing

the system to build the cube dimension automatically.

Other studies carried out in this ﬁeld focused on dif-

ferent aspects of this problem. Some aim at extracting

schemas without involving the human being. In this

context, sometimes ontologies are used to describe

the application domain (Simitsis et al., 2008; Skoutas

and Simitsis, 2006), to generate mediators (Critchlow

et al., 1998) and to semantically describe data sources

(Toivonen and Niemi, 2004) to support and automate

the deﬁnition of Extract − Transform− Load (ETL)

processes. In all such cases, the use of ontologies oc-

curs at a lower level in the application architecture

with respect to our.

The paper breaks down as follows: Section 2

presents a description of the Ontology Processing pro-

cess, that is, the characteristics of the integrated pro-

cesses on which this proposal is based, ranging from

the working hypotheses, to the treatment of Bridge

Tables. Section 3 gives a description of the case study

through which our proposal was tested. Finally, Sec-

tion 4 deals with the conclusions and sets the work to

be done in future.

2 ONTOLOGY PROCESSING

In this section we present the paramount characteris-

tics of integrated processes supporting the treatment

and management of ontologies.

In order to try out our experiment in a real case, we

adapted our solution to a pre-existing system, imple-

mented to execute a two-step indexing process of non

structured documents in the three layers illustrated in

Figure 1. During the ﬁrst step, from the unstructured

docs layer to the terms set layer, the engine, based on

the API of the Open Source search engine Lucene

indexed each document, thus obtaining a set of index-

terms. In the second step, from the terms set layer

http://lucene.apache.org

315

Sciarrone F. and Starace P. (2009).

ONTOLOGICAL WAREHOUSING ON SEMANTICALLY INDEXED DATA - Reusing Semantic Search Engine Ontologies to Develop Multidimensional

Schemas.

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval, pages 315-318

DOI: 10.5220/0002307103150318

 SciTePress

to the ontologies layer, these terms were contextual-

ized and associated with the concepts of predeﬁned

ontologies.

Figure 1: The Two-Steps Indexing Process.

In order to integrate our module with the aforesaid

system, the following assumptions were imposed:

• The concepts included in the dictionary were ex-

clusively linked by hypernymy and hyponymy re-

lations;

• Each ontology was based on a hierarchic struc-

ture.

The aforesaid restrictions obviously entailed an

experimentation that was to be limited to the context,

albeit it could also go for other real cases.

In the case in point, an ontology is therefore to

be considered as the number of non uniform levels,

whose number may change and, above all, is not

known beforehand. In the development of our solu-

tion it was deemed crucial to make sure that the on-

tological tree be extracted dynamically, so as to make

the use of concepts as dimensions ﬂexible and easily

adaptable. Representing an arbitrary and irregular hi-

erarchy is an intrinsically hard task in a relational en-

vironment. The adopted solution envisages the inclu-

sion of a bridge table between the concept dimension

and the facts table. In literature, Kimball suggests this

method to manage the dimensions that recursively re-

fer to records on the same table (Kimball et al., 1998;

Kimball and Ross, 2002). The goal of the bridge ta-

ble was to help the OLAP engine aggregate data more

quickly.

Dynamic dimensions are generated starting from

the tree’s root concept. This is a central idea in the

project we are presenting. The goal is to uncouple the

user from the manual deﬁnition of hierarchies, plac-

ing him/her on a higher level in the application archi-

tecture. By doing so, the user may actually not know

the logical structure of the hierarchy because the latter

is deﬁned automatically.

Figure 2 shows the entire process performed by

the overall system. In particular we developed a cus-

tom ETL module in order to integrate semantically in-

Figure 2: The Overall Process.

dexed data with operational ones. Particular attention

is to be given to the management of the ontologies’

structure. In order to ensure a hierarchic navigation it

is necessary to bring it back to a tree structure. The

presence of navigable cycles on the structures is ruled

out - even theoretically - from the typology of relation

existing between the concepts, i.e., the part of rela-

tion. One must therefore consider the management of

Directed Acyclic Graphs (DAG).

3 CASE STUDY

In this section we illustrate the Case Study, namely,

the integration of our module with the pre-existing

system and the OLAP engine. The proposed so-

Figure 3: The Computer Expert Ontology.

lution was implemented successfully in an experi-

mentation within a corporate context for the analysis

and semantic search of curricula of experts, through

data generated by a web application used in the e-

governance ﬁeld. The web application indexed the

non-structured textual information contained in the

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

316

curricula uploaded by candidates submitting their ap-

plication for jobs offered by an Italian government

body. The application’s function was to retrieve the

information stored in these text documents, in order

to create a complete proﬁle for the worker and make it

easier to be found. In this context we chose to deﬁne a

star-schema to show the application of the suggested

study. The experimentation parameters are the ones

shown in Table 1. We must point out that the system

works off-line with respect to the user: in this way all

the parameters shown in the Table do not affect the

on-line performance.

Table 1: The Experimental Parameters.

Variable Value

# curricula 150,000

# ontologies 10

# concepts per candidate >10

ontology extraction avg. time 10”

index cleaning avg. time 5’

3.1 An Ontology Star-schema Example

The ontology to be integrated in the schema was one

partially deﬁned in a manual way (likewise, it could

be possible to choose a predeﬁned one in the seman-

tic dictionary), whose contents provide a description

of the IT expert concept. The operational database

provides contents regarding the candidates’ personal,

working and academic data. We therefore thought it

would be interesting to show the relations existing be-

tween such data and a concept regarding a working

environment such as the chosen one. The structure of

the IT expert concept ontological tree is shown in Fig-

ure 3. Consistently with the limitations mentioned in

the previous paragraphs, the ontological graph results

in an oriented n-ary tree. To highlight the use of the

ontology, the star schema we chose to implement in-

volves this dimension only. The resulting schema is

shown in Figure 4. The measure on which the aggre-

gation is to be made is the number of candidates re-

ferring to the single concept. The resulting table will

therefore be a factless fact table (Kimball and Caserta,

2004). In this situation it is evident that managing

many-to-many relations is a crucial aspect, because a

superﬁcial management of the issue inevitably leads

to an inconsistent result. Going further into detail,

the aforesaid problem is illustrated by referring to two

concepts stemming from the same parent concept. If

two stemming concepts were to simultaneously refer

to the same candidate, aggregating them to the parent

concept level would lead to wrong value. This is the

typical problem we come across in data warehousing

(Song et al., 2001; Golfarelli et al., 1998). Our pro-

posal doesn’t drift away from the issue. The problem

is neither worsened nor dampened, all its aspects are

inherited. It does occur more frequently however, be-

cause the tackled issues can be easily affected in this

regard. In fact, when addressing concepts such as hi-

erarchic dimensions, the use of many-to-many rela-

tions is frequent. The answer to the problem may be

searched, for example, in the weighted management

of the graph nodes (Song et al., 2001). The problem

does not exist if the concepts are on different levels

in the same branch, since it is possible to remove the

records that refer to the upper levels, thus maintaining

the result consistent with the interrogation level. The

dimensions that may be aggregated, both standard and

dynamic ones, are not number-limited, and this makes

the solution adaptable to any type of problem.

Figure 4: The Star-Schema with Bridge Table.

3.2 The OLAP module

The execution of the ETL process is ended by the

generation of a table of facts with a dimension stem-

ming from the IT expert concept. For its interroga-

tion we chose to use the OLAP MONDRIAN engine of

PENTAHO Business Intelligence Suite

. It was there-

fore necessary to create the required structures for in-

teraction with the speciﬁc interrogation tool which,

aside from the speciﬁc choice made in this example,

can be chosen without any particular restrictions. Re-

gardless of the choice, it is best to manage the cre-

ation of structures through a speciﬁcally-made solu-

tion, because necessities drift from the standard use

of the tools. The PENTAHO suite presents the data

Pentaho Business Intelligence Suite -

http://www.pentaho.org

ONTOLOGICAL WAREHOUSING ON SEMANTICALLY INDEXED DATA - Reusing Semantic Search Engine

Ontologies to Develop Multidimensional Schemas

317

Figure 5: The Resulting Pivot Table.

processed by the OLAP MONDRIAN engine through

jsp libraries that generate pivot tables for the naviga-

tion of multidimensional cubes, making roll-up and

drill-down operations. The result obtained from the

navigation of the schema illustrated in the previous

paragraph is shown in Figure 5. In the image it is pos-

sible to notice that the quality of the ontologies con-

tained in the dictionary is the basic element for a good

performance of the presented information. Therefore,

the names in the attribute ﬁeld have deliberately not

been reﬁned, so as to highlight this aspect.

4 CONCLUSIONS AND FUTURE

WORK

In this article we haveproposed a solution for the inte-

gration of indexing data generated by semantic search

engines and the re-use of ontologies deﬁned in their

dictionaries as OLAP dimensions. The goal is that

of dynamically developingmultidimensional schemas

for BI applications regarding ontologies. We use this

technology to simplify the management of ontology-

based information and reduce, without bringing to

zero, human involvement. We could consider the idea

of implementing the studies mentioned in Section 1

to enhance the process generating ontology-based di-

mensions. The deﬁned process is currently stable and

yields positive results in a company environment. Fi-

nally, the fact-deﬁnition process can be improved, ex-

tending the logic to the ”join” base of the data. In or-

der to provide a complete BI service, the system must

be able to make several types of aggregations, not just

the basic ones. As future work we plan the enhance-

ment of indexed data management, with the introduc-

tion of a Cache-Based engine, and on the resolution

of problems related to the management of many-to-

many relations.

REFERENCES

Critchlow, T., Ganesh, M., and Musick, R. (1998). Au-

tomatic generation of warehouse mediators using an

ontology engine. In Borgida, A., Chaudhri, V. K., and

Staudt, M., editors, KRDB, volume 10 of CEUR Work-

shop Proceedings, pages 8.1–8.8. CEUR-WS.org.

Golfarelli, M., Maio, D., and Rizzi, S. (1998). Conceptual

design of data warehouses from e/r schema. In HICSS

’98: Proceedings of the Thirty-First Annual Hawaii

International Conference on System Sciences-Volume

7, page 334, Washington, DC, USA. IEEE Computer

Society.

Kimball, R. and Caserta, J. (2004). The Datawarehouse

ETL Toolkit: Practical Techniques for Extracting,

Cleaning, Conforming and Delivering Dasta. Wiley.

Kimball, R., Reeves, L., Thornthwaite, W., Ross, M., and

Thornwaite, W. (1998). The Data Warehouse Lifecy-

cle Toolkit: Expert Methods for Designing, Develop-

ing and Deploying Data Warehouses with CD Rom.

John Wiley & Sons, Inc., New York, NY, USA.

Kimball, R. and Ross, M. (2002). The Data Warehouse

Toolkit: The Complete Guide to Dimensional Model-

ing (Second Edition). Wiley.

Simitsis, A., Skoutas, D., and Castellanos, M. (2008). Nat-

ural language reporting for etl processes. In DOLAP

’08: Proceeding of the ACM 11th international work-

shop on Data warehousing and OLAP, pages 65–72,

New York, NY, USA. ACM.

Skoutas, D. and Simitsis, A. (2006). Designing etl pro-

cesses using semantic web technologies. In DOLAP

’06: Proceedings of the 9th ACM international work-

shop on Data warehousing and OLAP, pages 67–74,

New York, NY, USA. ACM.

Song, I.-Y., yeol Song, I., Medsker, C., Ewen, E., and

Rowen, W. (2001). An analysis of many-to-many re-

lationships between fact and dimension tables in di-

mensional modeling. In Proc. of the Intl Workshop on

Design and Management of Data Warehouses, pages

6–1.

Toivonen, S. and Niemi, T. (2004). Describing Data

Sources Semantically for Facilitating Efﬁcient Cre-

ation of OLAP Cubes. In Poster Proceedings of the

Third Interntional Semantic Web Conference.

KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval

318