A PROTOTYPE FOR KNOWLEDGE EXTRACTION FROM

SEMANTIC WEB BASED ON ONTOLOGICAL COMPONENTS

CONSTRUCTION

Nesrine Ben Mustapha, Hajer Baazaoui-Zghal

Riadi Laboratory., ENSI Campus Universitaire de la Manouba, 2010 Tunis, Tunisie

Marie-Aude Aufaure

Supelec, Computer Science Department, Plateau du Moulon, 91 192 Gif sur Yvette, France

Keywords: Semantic web, ontology construction, knowledge extraction, OntoCoSemWeb (Ontology Construction for

the Semantic Web) prototype.

Abstract: Adding a semantic dimension to web pages is a response to some problems of the present web and is known

as the semantic web. Many methods and methodologies can be found in the literature. Generally, they are

dedicated to particular data types like text, semi-structured data, relational data, etc. This paper presents a

prototype for knowledge extraction from web pages based on ontological components construction. Our

work deals with web pages. We will first study the state of the art of methodologies defined to learn

ontologies from texts. Then, we will define architecture of ontological components for the Semantic web.

An implementation and experimentation of the proposed architecture are presented.

1 INTRODUCTION

The volume of available information on the web is

growing exponentially. Consequently, integration of

heterogeneous data sources and information

retrieval, have become more and more complex.

Adding a semantic dimension to web pages is a

response to this problem and is known as the

semantic web (Berners-Lee, 2001). Ontologies can

be seen as a fundamental part of the semantic web.

They can be defined as an explicit, formal

specification of a shared conceptualization (Gruber,

1993). Meanwhile, building ontology manually is a

long and tedious task. We are interesting in learning

ontologies from text. We present in section 2

semantic web and ontological components and our

approach to build a domain ontology. In section 3

and 4, implementation and experimentation are

presented. Section 5 analyses the results. At last, we

conclude and give some perspectives for this work.

2 SEMANTIC WEB AND

ONTOLOGICAL

COMPONENTS

Starting from the state of the art, we propose a

hybrid approach to build domain ontology; our

objective is to increase the capability of this

ontology to specify and extract web knowledge in

order to contribute to the semantic web. Analyzing

the web content is a difficult task relative to

relevance, redundancies and incoherencies of web

structures and information. For these reasons,

proposing an approach to build automatically an

ontology still remains utopian. Our approach is

based on the cyclic relation between web mining,

semantic web and ontology building as stated in

(Berendt and al., 2002). Our proposal is based on the

following statements: (1) satisfy the fact that the

ontology is useful to specify and extract knowledge

from the web, (2) link the semantic content within

the web documents structure, and (3) combine

linguistic and learning techniques taking into

account the scalability and the evolution of the

451

Ben Mustapha N., Baazaoui-Zghal H. and Aufaure M. (2007).

A PROTOTYPE FOR KNOWLEDGE EXTRACTION FROM SEMANTIC WEB BASED ON ONTOLOGICAL COMPONENTS CONSTRUCTION.

In Proceedings of the Third International Conference on Web Information Systems and Technologies - Web Interfaces and Applications, pages 451-454

DOI: 10.5220/0001285304510454

 SciTePress

ontology. Our ontology is produced using web

mining techniques. We mainly focus on web content

and web structure mining. Building this ontology

leads us to solve two main problems. The first one is

relative to the heterogeneity of web documents

structure while the second one is more technical and

concerns technical choices to extract concepts,

relationships and axioms as well as the selection of

learning sources and scalability. An architecture of

ontological components is proposed to represent the

domain knowledge, the web sites structure and a set

of services. These ontological components are

integrated into a customizable ontology building

environment (Ben Mustapha and al., 2006).

2.1 Our Approach

Learning ontologies from web sites is more complex

than texts. Indeed, web pages can contain more

images, hypertext and frames than text. Learning

concepts is a task that requires texts able to

explicitly specify the properties of a particular

domain. Starting from the state of the art, we can say

that no learning method to extract concepts and

relationships is better. For these reasons, we propose

a customizable ontology building environment

taking into consideration the criteria defined in our

synthesis. In this environment, we propose a set of

interdependent ontologies to build a knowledge base

on a particular domain, made up of a set of web

documents, their structure and associated services.

We distinguish three ontologies, namely a generic

ontology of web sites structures, a domain ontology

and a service ontology. The generic ontology of web

sites structure contains a set of concepts and

relationships allowing a common structure

description of HTML, XML and DTD web pages.

This ontology enables users to learn axioms that

specify the semantic of web documents patterns. The

main objective is to ease the structure of web mining

knowing that the results can help to populate the

domain ontology. The domain ontology is divided

into three layers according to their level of

abstraction. The ontology of services is defined

starting from the concept of task ontology (Gomez-

Perez and al., 2003). In our web context, we speak

of web services instead of tasks. This ontology

specifies the domain services and will be useful to

map web knowledge into a set of interdependent

services. This ontology is hierarchically structured:

the upper level is the root service while the leaves

are elementary tasks for which a triplet “concept-

relation-concept” belonging to the domain ontology

is associated. These three ontological components

are interdependent where the axioms included in an

ontology are used to enhance another ontology

component. Meanwhile, these ontologies differ from

their use. The domain ontology is used to specify the

domain knowledge. The service ontology specifies

the common services that can be solicited by web

users and can be attached to several ontologies

defined on subparts of the domain. As we said

previously, the axioms of the structure ontology are

used to extract instances of the domain ontology.

2.2 Building the Domain Ontology

In this section, we focus on the domain ontology

extraction. Our strategy is based on three steps. The

first one is the initialization step. The second one is

an incremental learning process based on linguistic

and statistic techniques. The last one is a learning

step based on web structure mining. Here is their

definition. The initialization is based on the

following steps: (1) The design and manual building

of a minimal ontology related to the domain; this

construction is based on concepts and relationships

of Wordnet, (2) Composition of concepts and

relationships learning sources which consist in: (1)

Web search of documents related to our domain

using the concepts defined in the minimal ontology

as requests, (2) Classification of these web

documents, (3) Composition of a textual corpus

containing a set of phrases in which we can find at

least one concept of the minimal domain ontology

and (4) Composition of a corpus of HTML and

XML documents indexed by their URL. Each

iteration of the second stage includes two steps. The

first one (Procedure A) is defined by the following

tasks: (1) Enrichment of the ontology with new

concepts extracted from semi-structured data found

in the web pages (XML, DTD, tables), (2)

Construction of a word space based on the concepts

of the minimal domain ontology, (3) Lexico-

syntactic patterns learning based on the method

defined in (Alfonseca and Manandhar, 2002); these

patterns are related to non taxonomic relationships

between the concepts of the minimal ontology, (4)

Lexico-syntactic patterns learning to extract

synonymy, hyponymy and part-of relationships

(lexical layer of the domain ontology), (5) Similarity

matrix building: this matrix allows computing the

similarity between pairs of concepts found in the

multidimensional space word. The second step

(Procedure B) consists in: (1) Updating the textual

corpus and the web documents collection by

searching them according to the concepts defined in

the minimal ontology, (2) New concepts and non

WEBIST 2007 - International Conference on Web Information Systems and Technologies

452

taxonomic relationships extraction by the application

of lexico-syntactic patterns, (3) Attribution of a

weight for each extracted relationship relative to the

frequency of the relationships that apply the lexico-

syntactic pattern, (4) Updating the minimal

ontology. Each iteration can be validated by the

domain expert. This process is incremental:

procedures A and B are repeated until no integration

of new data is required. The last stage consists in an

enrichment of the ontology structure and an

extraction of structure patterns for each relationship

of the domain ontology. The implementation of this

strategy is still in progress. We have realized a little

case study to illustrate the first iteration (Ben

Mustapha and al., 2006).

3 ONTOCOSEMWEB

PROTOTYPE: APPLICATION

TO TOURISM

An implementation and experimentation of the

suggested approach were carried. The principal

objective is to automate the process of ontology

construction. In fact a prototype named

OntoCoSemWeb had been developed. The main

purposes of the developed prototype are: (1) to

automate the construction of ontology by combining

the method of (Hearst, 1998), the semantic signature

and a method of text mining which consists in the

construction of the space word and the ASIUM

method of syntactic Frames learning (Faure and Al,

1998 ) ; (2) to proceed to the learning of the possible

relationships while saving information of knowledge

extraction in the metaontology ; (3) To annotate the

chosen web pages by using a minimal ontology of a

specific field and enrich it by learning from these

same Web pages which will be then annotated by the

result of the ontology learning. Thus, after the

elimination of the HTML tags from web documents,

using API DOM (Document Object Model) a web

mining technique has been implemented, to perform

the segmentation of the texts in sentences. For the

construction of the minimal ontology and meta-

ontology, Protege 2000 has been used. The

alimentation of the meta-ontology is made by the

nominal expressions of each concept, the definition

of the semantic signature and Hyponyms of each

concept and the research of syntactic Frames of each

minimal ontology relation. Concepts are built

according to existing words in the corpus and in

Wordnet which is used to select the most similar

words or expressions, which will be considered as

the topics signatures of, as an example, the concept

"hotel ". The construction of the corpus patterns was

done from 10 Web sites to obtain a textual corpus of

10 groups where each group contains between 130

and 300 textual files. Each file represents one Web

page of the site associated with only one group. So

that the frequency of a pattern is computed

according to its occurrence in a Web page and for all

the pages of the Web site.

4 DESCRIPTION OF

ONTOCOSEMWEB

The experimentation starts by choosing a concept of

the minimal ontology of tourism to begin the

learning step. The metaontology enrichment starts

by searching synonyms, “part-of” relations and

nominal expressions referring to the concept. These

concepts are “part-of” the concept and will be

inserted temporarily in the metaontology during step

A in order to enrich the domain ontology in step B.

Then, the expressions referring to it are constructed.

The existing generic linguistic axioms of the

metaontology are used to deduce new concepts or

instances (from nominal propositions patterns or

other patterns). As an example, the NP_Concept

pattern, NP means « proper noun » allows us to

insert the instance « Arkansas hotel » as an instance

of the concept « Hotel ». Expressions referring to the

concept « hotel “ are generated. The

multidimensional space is built with terms existing

in the corpus and in Wordnet (to select the most

similar words or expressions associated to “hotel”).

These terms are the semantic signature of the

concept « hotel ». The concepts which are similar to

the concept « hotel » (palace for example) are

extracted. The next step consists in searching a

relation between a concept and its semantic

signature. The semantic signatures represent close

concepts. Some of them have no taxonomic relation

in the domain ontology. Then, we have to verify the

existence of a taxonomic or a non taxonomic

relation, in order to filter the semantic signatures list

related to a given concept. With the developed tool,

the user can select a concept and "Extract senses ",

which allows the visualization of the possible senses

of the word "Airline " and search in Wordnet. Two

senses are found: the user has to choose the suitable

one and "show synonym" to visualize the synonyms

relating to the selected sense. The synonyms in the

last stage are inserted in the metaontology. When

the concept is referred by a term having at least one

A PROTOTYPE FOR KNOWLEDGE EXTRACTION FROM SEMANTIC WEB BASED ON ONTOLOGICAL

COMPONENTS CONSTRUCTION

453

sense, the user cans «Enrich Nominal Proposition ».

A lexico-syntactic pattern and its occurrence are

associated to each nominal expression. The

following step allows the enrichment of the meta-

ontology with the nominal expressions of the given

concept, its occurrence in the corpus and the

occurrence of each nominal expression. The next

step consists of the construction of the

Multidimensional word space and concept similarity

matrix.

5 RESULT ANALYSIS

The experimentation carried out on Web pages

concerning tourism was presented in this paper. The

main aim was to show the feasibility of the approach

of ontologies construction for the semantic Web by

applying techniques of knowledge extraction. The

techniques of knowledge extraction are mainly text

mining techniques: the extraction technique of

lexical patterns, the construction of word space and

the construction of matrix similarity. The originality

of the suggested approach consists in ontology

construction to generate knowledge. Thus, the

iterative character of the approach makes it possible

to obtain after the last iteration association rules

such as : If " hotel " near to " sea" is expensive, or

80% of hotels of “Paris 2” are classified hotels 2

stars and less. Such rules represent a knowledge

being able to contribute to the decision-making.

6 CONCLUSIONS AND

PERSPECTIVES

Learning methodologies try to give a response to

the time-consuming manual ontology building task.

Learning techniques can be either numeric or

symbolic. They have been exploited to semi-

automate some foundation tasks such as concept

hierarchy building, taxonomic relationships

extraction, non taxonomic relationships learning,

etc. The proposed architecture is based on self-

learning ontological components which define Web

content semantics: the domain ontology ; Web

structure semantics : the ontology structure and

domain web services semantics: the web services

ontology. The metaontology is reusable and can be

applied to other domain building ontology processes

(from corpus with a determined language). Axioms

to extract concepts, relationships and instances are

learned incrementally. This metaontology can be

extended to other extraction techniques, where each

technique can use another similarity measure. A

hybrid domain ontology combining three techniques

is built: lexico-syntactic patterns, syntactic frames

and multidimensional word space. The proposed

approach is based on extracted information weighted

by its frequency in the corpus. The corpus is

updated, from one iteration to another, which is

useful to revise this information and the associated

weightings.

200

400

600

800

1000

1200

1400

1600

1800

initialisation Step A Step B

Concept number

number of new taxonomic

relationships

number of new non

taxonomic relationships

number of axiomes

linguistic patterns

lexicosyntaxic patterns

erroneous lexicosyntaxic

patterns

nominal expressions

erroneous linguistic

patterns

Figure 1: Results analysis.

REFERENCES

Berners-Lee, T, Hendler, J, Lassila, O., 2001: The

Semantic Web, Scientific American.

Berendt, B., Hotho, A., Stumme, G., 2002: Towards

semantic web mining. In: International Semantic Web

Conference, volume 2342 of Lecture Notes in

Computer Science, Springer, 264–278.

Ben Mustapha, N., Aufaure, M.-A., Baazaoui-Zghal, H.,

2006: Towards and Architecture of Ontological

Components for the Semantic Web, Web Information

Systems Modeling Workhop (WISM), in conjunction

with CAISE'2006, Luxembourg.

Gruber, T., 1993: Toward principles for the design of

ontologies used for knowledge sharing. International

Journal of Human-Computer Studies, special issue on

Formal Ontology in Conceptual Analysis and

Knowledge Representation.Eds, Guarino, N.&Poli , R.

Grüninger, M., Fox M.S., 1995: Methodology for the

design and evaluation of ontologies. IJCAI’95

Workshop on Basic Ontological Issues in Knowledge

Sharing, Montreal, Canada.

Alfonseca, E., Manandhar, S., 2002: Improving an

Ontology Refinement Method with Hyponymy

Patterns. Language Resources and Evaluation

(LREC), Las Palmas, Spain.

Gomez-Perez, A., Fernandez-Lopez, M., Corcho, O.,

2003: Ontological Engineering. Springer.

WEBIST 2007 - International Conference on Web Information Systems and Technologies

454