ONTOLOGIES BASED APPROACH FOR SEMANTIC INDEXING

IN DISTRIBUTED ENVIRONMENTS

Claude Moulin

CNRS Heudiasyc, University of Compiègne, France

Cristian Lai

CRS4, Center for Advanced Studies in Sardinia, Italy

Keywords: Semantics and Ontology Engineering, Information Search and Retrieval, Distributed Data Structures,

Distributed Systems.

Abstract: In this paper we present some issues relating the semantic indexing of resources in distributed decentralized

systems. We discuss some matter regarding the navigation between indexed resources and the way to

enhance our model for answering more generic queries. The argument is introduced with a brief scenario

focusing on e-learning domain, even if our goal is more general purpose. We discuss the role of ontologies

in semantic indexing, moving from centralized systems to the distributed paradigm. We explain how to

distribute a common index of resources semantically identified, composing a distributed knowledge base.

The paper highlights the importance of an ontology system for the key generation and the use of specific

domain ontologies.

1 INTRODUCTION

In the last years, research in the field of Knowledge

Engineering has addressed technologies able to

create modular and more efficient knowledge-based

systems. An important requirement is the

improvement of response time and the quality of the

discovery facilitating the selection of results more

related to a specific request. Semantic indexing

seems to be a valid candidate to fulfill this

requirement. Main search algorithms based on

sequences of keywords also try to get better results

but they often lack of contextual information.

Instead, selective search takes into account the

meaning, the importance of the words within the

resources content, mutual vicinity, etc. Moving from

client-server paradigm to Peer to Peer (P2P)

networks model, information retrieval problem

becomes more articulated. Even if P2P networks

architectures can not offer a hyper textual navigation

between resources and lost the hierarchical

organization of data, they keep some advantages

such as the easy scalability of resources, the

reliability on data transferring, and low operating

costs. A P2P architecture will avoid both physical

and semantic bottlenecks that limit information and

knowledge exchange (Staab, Stuckenschmidt, 2006).

In this paper we face the problem of the semantic

indexing of resources in decentralized networks, in

which traditional discovery is based on documents

title. Our studies address techniques able to create an

index of managed resources, distributed and

semantically annotated. Semantic structures are

introduced through ontologies. Distribution is gained

by P2P paradigm. The distribution of the index is

performed by a Distributed Hash Table (DHT), and

thus the generation of keys remains an important

aspect. The semantic information strictly related to a

resource is included in a semantic annotation and

then inserted in the index.

In the case of textual documents annotation, two

approaches can be followed: metadata are either

inserted within the document or found outside the

content as external resources. In our work we prefer

extracting the semantics of annotation from

ontologies. The solution we propose allows to fasten

to a document the meta-information that describes

its content and to publish it in the network. The URL

of the resource tied up to the meta-information

420

Molin C. and Lai C. (2009).

ONTOLOGIES BASED APPROACH FOR SEMANTIC INDEXING IN DISTRIBUTED ENVIRONMENTS.

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development, pages 420-423

DOI: 10.5220/0002298604200423

 SciTePress

allows its further access. Our solution lets open the

choice of the ontologies required for the

descriptions. The user has just to select ontologies

really shared among communities (Gruber, 1993)

else the discovery of the documents published in the

network would be impossible. The only

characteristic we use is the Unified Resource

Identifiers (URI) of concepts, relations and instances

of concepts. We also chose ontologies because we

want that any software agent can reason on the

structure of knowledge in order to build the most

appropriate queries when searching for resources.

Indeed, we consider that the resources published in

the network may be used diversely. The resources

we want to index are not always textual, so a manual

indexing is necessary; some interesting information

is not contained within the resource, thus the

automatic indexing would not have effect on this.

It’s the case when somebody wants to express a

point of view on a resource.

The paper is organized as follows: Section 2

describe a scenario where our indexing system can

be used. Section 3 reports briefly the related works.

Section 4 presents the core of our indexing system

and the way keys are generated.

2 SCENARIO

We consider teachers writing didactic material for

their courses and wanting to share them with other

teachers or students thanks to a simple mechanism.

They don’t want to deal with heavy tools or to

depend on centralized repositories. Teachers only

have to semantically describe their material in

relation with one or more ontologies. Obviously,

tools are supplied for that and the use of ontologies

is transparent to users. Then the documents are

published in a P2P network in conjunction with their

semantic description. Publication of semantically

annotated document in P2P networks was also

presented as a real challenge in (Davies et al., 2003).

Students and other teachers can discover the

resources making queries based on the concepts

expressed from ontologies. In this scenario the

choice of the ontologies is crucial and ontologies

really shared among communities should actually be

used.

Resources of various types can be handled and

are not necessarily textual. Semantic annotations can

represent objective information about resources

(nature, concepts of scientific domain, etc.) but can

also represent a point of view on documents

(difficulty level, usefulness in some context, etc.).

This example of semantic indexing in P2P

networks is planned to be integrated in a large

project concerning Organizational Memories. In this

kind of memories, resources can also be

semantically indexed on ontologies and can be

annotated. An ontology used for a memory is also a

part of this memory and is used for navigation.

Centralized memories may have some disadvantages

when they need to be filled up. Such a distributed

system could be a solution to this issue.

3 RELATED WORKS

Using ontologies in distributed systems like P2P

networks is closely considered by scientific

community. In 2002 Edutella project (Nejdl et al,

2001), handled by Sun Microsystems, did a first

approach for the association of semantics to

educational content through an open source

infrastructure based on RDF metadata for the

interoperation between different schemes

(IEEE/LOM, IMS Learning Design

(http://www.imsglobal.org/learningdesign/index.htm

l), ADL SCORM), performing a mapping among

them. The SWAP project (Ehrig et al, 2003),

managed by the University of Karlsruhe, pays a

special attention to topics related to Semantic Web.

Its aim is to allow computers to actually comprehend

the meaning of its processed data. Using the model

of ontologies, the project allows to develop a

technology in the area of knowledge management

and P2P. Complex structures can be easily encoded

in a set of RDF triples. In (Della Valle et al, 2006) is

supposed that RDF should become the bases of the

Semantic Web. Nevertheless, RDF isn’t enough; it

does not supply a sufficient expressive ability to

represent the whole knowledge schema. DHT based

overlays systems offer an interesting alternative to

existing information system architectures. We

propose to express the semantic classification

through concepts articulated by ontologies that

describe the specific domain and to formulate such

expression by the OWL formalism.

4 SEMANTIC INDEXING

We generally consider two kinds of models for

indexing resources: boolean (see Salton et al., 1982)

and vectorial. In the boolean model, the index of

documents is an inverse file which associates to each

keyword of the indexing system the set of

ONTOLOGIES BASED APPROACH FOR SEMANTIC INDEXING IN DISTRIBUTED ENVIRONMENTS

421

documents that contain it. A user’s query is a logical

expression combining keywords and Boolean

operators (NOT, AND OR). The system answers

with the list of documents that satisfy the logical

expression. The keywords proposed by the user are

supposed contained in the index.

In the vectorial model a document is represented

by a vector whose dimensions are the keywords of a

vector and the coordinates correspond to the weight

of the document in each dimension. A request is also

a vector of the same nature. The system answers a

request with the list of documents which present a

similarity with the request thanks to a specific

measure based on the vectors coordinates.

In centralized indexing both models are

available. However, in our case the index must be

distributed and the numbers of queries sent to the

index when searching for resources should

minimized, because they are time consuming. The

model of a distributed index is necessarily Boolean.

Our model should be able to satisfy logical

expressions and we have to construct different keys

for keeping this feature.

Our solution also allows a file sharing but the

semantic information is not contained in the title of

the resource. It’s contained in keys that describe the

resources. Ontology element identifiers are the

essential part of the indexing keys.

4.1 Ontological Elements

Our solution lets open the choice of the ontologies

requested for the descriptions. However, users of the

community should share them else the discovery of

the documents published in the network would be

impossible. The resource provider is responsible of

the choice of the ontology describing the concepts.

In our example at least two ontologies are pointed

out, one for the theoretical domain of the resource

(theory of language) and another for the description

of the resources. It’s generally the case when

indexing resources for e-learning. In case of a

manual semantic indexing, first is necessary to select

the ontologies used for building the indexing key. A

key may contain several concepts belonging to

different ontologies. For having homogeneity in our

knowledge representation and for preserving agent

reasoning based on ontologies, we use the LOM

ontology developed at “Université de Technologie

de Compiègne” for representing learning objects.

Studying the previous example and analyzing the

selected ontologies, we are led to the following

conclusion: a user can select a concept (the concept

of grammar is described in the first ontology) or an

instance of a concept (Exercise is an instance of the

concept of Learning Resource Type, and Difficult is

an instance of the concept of Difficulty Level).

Within the ontology, an element is completely

defined by its unique URI. It is enough to insert the

URIs of ontological elements for characterizing a

document in the key that indexes it. From this

information any software agent can discover the type

of the element inside a key and can decide to build

new queries if some do not give satisfying results.

4.2 Distributed Knowledge Base

We aim at creating a community, i.e. a set of nodes

of a P2P network, and at distributing a common

index of resources semantically identified. The data

structure that has been considered suitable for that is

a Distributed Hash Table (DHT) (Stoica et al.,

2006). The index is composed of entries that are

pairs of data (key, value). Each node of the network

contains a part of the whole data structure. The

publication, or indexing, is the operation of insertion

of a resource inside the DHT. More exactly it is the

operation of inserting a new index entry in the DHT.

The discovery is the operation which allows to find

some resources in the network that correspond to a

research key.

4.3 System Ontology

We consider that a key used for indexing a

document is based on a RDF semantic description,

i.e. a graph pattern of triples representing subjects

linked to objects by predicates. What is the meaning

of “indexing a document on the concept of

automaton” (in theory of language)? That

corresponds to the fact that the content of the

document has something to see with the concept. In

RDF, the document is represented as a blank node of

type Document and treating of the concept of

“automaton”. We had to create a system ontology

for representing these data. The RDF representation

of the document is then in N3 notation:

[] rdf:type syst:Document ;

syst:hasInterest lt:Automaton.

The system ontology is represented by the syst

prefix and the domain ontology by lt. “hasInterest”

is an annotation property allowing to attach any type

of elements (concept, relation or individual). The

system ontology also contains the concept of

Ontology, subconcept of Document for representing

the ontologies used by the users for indexing. Such

KEOD 2009 - International Conference on Knowledge Engineering and Ontology Development

422

an ontology is published in the network under the

description:

[] rdf:type syst:Ontology

The discovery system first looks for all the

ontologies used in the network for indexing and can

load them in the navigation sub system. For indexing

a document that occurs to be an ontology people can

use the following description:

[] rdf:type syst:Document ;

syst:hasType syst:Ontology.

4.4 Key Generation

An interesting feature of the navigating through an

ontology is the way to suggest for accessing other

resources. We can say that resources are close if

they are indexed by close ontological elements.

Concepts are close if one is a specialization of the

other or if they are domain and range of the same

property. From a selected concept, it’s simple to find

close concepts and then to build the query that

allows to access close resources. Instances of

concepts are close if they have the same type or if

their types are close.

Due to the Boolean model of the distributed

index, it’ necessary to create different keys when

publishing a document, so to be discovered with

different requests. The indexing must allow the

reasoning by subsumption. A request on a super-

concept must allow the discovery of documents

indexed on a sub-concept. The difficulty is to stop

the reasoning about the transitivity of subsumption,

for not indexing on a too general concept. We

consider that only two levels are enough in this case.

It is also possible to index on an attribute. A

document may show the interest for a country to

have a population and if this notion is modeled by

using an attribute, the description could be:

[] rdf:type syst:Document ;

syst:hasInterest ex:hasPopulation.

5 CONCLUSIONS

In this paper we have presented a solution which

aims at distributing semantically indexed resources

on P2P networks. The distribution of the index is

performed by a Distributed Hash Table. The

semantic information strictly related to a resource

and representing a point of view on the resource is

inserted in the key used to index the resource. The

semantic information comes from ontologies. Any

ontology can be used by our system. The drawback

of our solution is that the user has to navigate the

suitable ontologies and this operation can be time

consuming. Domain specific expert users have to

look for interesting ontologies and to publish them

in the network. Currently we are enhancing the tools

used to manage the ontologies in a way to hide their

underlined structures and to present them in a

comprehensive way. Our solution can supports the

building of online communities of users that want

share easily digital resources. We consider that the

building, storage and maintenance of ontologies are

the duty of the community the user belongs to. The

semantic indexing is strictly related to these issues.

REFERENCES

Davies, J., Fensel, D., and Van Harmelen, F., 2003.

Towards the Semantic Web: Ontology-Driven

Knowledge Management. New York : John Wiley &

Sons.

Della Valle, E., Turati, A. and Ghigni, A., 2006. PAGE: A

Distributed Infrastructure for Fostering RDF-Based

Interoperability. Distributed Applications and

Interoperable Systems. Berlin : Springer.

Gruber, T. R., 1993. Towards Principles for the Design of

Ontologies Used for Knowledge Sharing. In N.

Guarino and R. Poli, editors, Formal Ontology in

Conceptual Analysis and Knowledge Representation,

Deventer. Kluwer Academic Publishers.

Ehrig, M., Tempich, C., Broekstra, J., Van Harmelen, F.,

Sabou, M., Siebes, R., Staab, S. and H.

Stuckenschmidt, 2003. Swap - ontology-based

knowledge management with peer-to-peer technology.

WIAMIS'03, London, pp. 557-562. World Scientific,

London.

Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M.,

Naeve, A., Nilsson, M., Palmer, M. and Risch,

T.,2001. Edutella: A p2p networking infrastructure

based on Rdf.

Salton, G., Fox, Edward A. et Wu, Harry, 1982. Extended

Boolean information retrieval. Technical Report,

Cornell University.

Staab, S., Stuckenschmidt, H., 2006. Semantic Web and

Peer-to-Pee: Decentralized Management and

Exchange of Knowledge and Information. Springer.

Stoica, I., Morris, R., Karger, D., Kaashoek, M. F. and

Balakrishnan, H., 2006. Chord: A scalable peer-to-

peer lookup service for internet applications. In

Proceedings of the ACM SIGCOMM '01 Conference,

San Diego, California, pp 149-160.

ONTOLOGIES BASED APPROACH FOR SEMANTIC INDEXING IN DISTRIBUTED ENVIRONMENTS

423