ontology. Our ontology is produced using web
mining techniques. We mainly focus on web content
and web structure mining. Building this ontology
leads us to solve two main problems. The first one is
relative to the heterogeneity of web documents
structure while the second one is more technical and
concerns technical choices to extract concepts,
relationships and axioms as well as the selection of
learning sources and scalability. An architecture of
ontological components is proposed to represent the
domain knowledge, the web sites structure and a set
of services. These ontological components are
integrated into a customizable ontology building
environment (Ben Mustapha and al., 2006).
2.1 Our Approach
Learning ontologies from web sites is more complex
than texts. Indeed, web pages can contain more
images, hypertext and frames than text. Learning
concepts is a task that requires texts able to
explicitly specify the properties of a particular
domain. Starting from the state of the art, we can say
that no learning method to extract concepts and
relationships is better. For these reasons, we propose
a customizable ontology building environment
taking into consideration the criteria defined in our
synthesis. In this environment, we propose a set of
interdependent ontologies to build a knowledge base
on a particular domain, made up of a set of web
documents, their structure and associated services.
We distinguish three ontologies, namely a generic
ontology of web sites structures, a domain ontology
and a service ontology. The generic ontology of web
sites structure contains a set of concepts and
relationships allowing a common structure
description of HTML, XML and DTD web pages.
This ontology enables users to learn axioms that
specify the semantic of web documents patterns. The
main objective is to ease the structure of web mining
knowing that the results can help to populate the
domain ontology. The domain ontology is divided
into three layers according to their level of
abstraction. The ontology of services is defined
starting from the concept of task ontology (Gomez-
Perez and al., 2003). In our web context, we speak
of web services instead of tasks. This ontology
specifies the domain services and will be useful to
map web knowledge into a set of interdependent
services. This ontology is hierarchically structured:
the upper level is the root service while the leaves
are elementary tasks for which a triplet “concept-
relation-concept” belonging to the domain ontology
is associated. These three ontological components
are interdependent where the axioms included in an
ontology are used to enhance another ontology
component. Meanwhile, these ontologies differ from
their use. The domain ontology is used to specify the
domain knowledge. The service ontology specifies
the common services that can be solicited by web
users and can be attached to several ontologies
defined on subparts of the domain. As we said
previously, the axioms of the structure ontology are
used to extract instances of the domain ontology.
2.2 Building the Domain Ontology
In this section, we focus on the domain ontology
extraction. Our strategy is based on three steps. The
first one is the initialization step. The second one is
an incremental learning process based on linguistic
and statistic techniques. The last one is a learning
step based on web structure mining. Here is their
definition. The initialization is based on the
following steps: (1) The design and manual building
of a minimal ontology related to the domain; this
construction is based on concepts and relationships
of Wordnet, (2) Composition of concepts and
relationships learning sources which consist in: (1)
Web search of documents related to our domain
using the concepts defined in the minimal ontology
as requests, (2) Classification of these web
documents, (3) Composition of a textual corpus
containing a set of phrases in which we can find at
least one concept of the minimal domain ontology
and (4) Composition of a corpus of HTML and
XML documents indexed by their URL. Each
iteration of the second stage includes two steps. The
first one (Procedure A) is defined by the following
tasks: (1) Enrichment of the ontology with new
concepts extracted from semi-structured data found
in the web pages (XML, DTD, tables), (2)
Construction of a word space based on the concepts
of the minimal domain ontology, (3) Lexico-
syntactic patterns learning based on the method
defined in (Alfonseca and Manandhar, 2002); these
patterns are related to non taxonomic relationships
between the concepts of the minimal ontology, (4)
Lexico-syntactic patterns learning to extract
synonymy, hyponymy and part-of relationships
(lexical layer of the domain ontology), (5) Similarity
matrix building: this matrix allows computing the
similarity between pairs of concepts found in the
multidimensional space word. The second step
(Procedure B) consists in: (1) Updating the textual
corpus and the web documents collection by
searching them according to the concepts defined in
the minimal ontology, (2) New concepts and non
WEBIST 2007 - International Conference on Web Information Systems and Technologies
452