efficacy of such tools depends directly of the way
the resources were described on the Web. In
accordance with Moura (Moura, 2001), search tools
are classified in two main classes:
·Research in Directories: Tools that were
introduced when the content of the Web was small
enough to be collected in a non-automatic way. The
documents are classified manually according to a
taxonomy.
·Search Engines: Tools that are worrier with its
database’s size. The gathering of documents in such
systems is done by software agents, which cross the
Web to collect data.
The aspects involved in these classes of tools,
guide the development of mechanisms which tries to
perform searches with meaning inlaid in the
keywords. The problem is that such search tools do
not consider the semantic aspects involved in the
keywords that were submitted. They just analyze the
words syntactically.
A complementary field to Information Retrieval
is Information Extraction, which aims at extracting
relevant information from semi-structured
documents, and organizes it in a friendly format.
NLP, Wrappers Development, Ontology-based
methods, among others, are techniques that can be
used to execute Information Extraction (Laender,
2002).
2.1 The Web
In the beginning of the nineties, the first efforts were
done to develop the WWW as we know currently.
Tim Berners Lee, the web’s idealizer, had faced
several challenges before its project was understood
by the scientific community (Fernández, 2001).
However, his efforts were rewarded, once the Web
was consolidated as the mean of information
distribution with the faster expansion in the
worldwide history. Its intensive use allied to
exponential growth provided a radical change in the
life of the people who access the Web. On the other
hand, this fast expansion turned the Web a content
deposit as huge as disorganized. Such troubles
generate constant disappointment to users with small
experience on the Web, especially when they are
involved with searches for specific information.
One of the factors that complicate this situation
is the pattern language used to create Web pages, the
HTML (Hypertext Markup Language) language.
HTML does not specify any kind of semantic
relative to the page’s contents. It is just responsible
for document’s presentation. Consequently, it
appears a gap, between the information available to
Web services processing and the information that
exists for people reading. The lack of meaning in
Web documents caused the need of insertion of
some “intelligence” to current WWW resources. The
idea of inserting meaning into the Web documents
can be summarized by one term: “Semantic Web”
(Berners-Lee et al., 2001).
2.1.1 The Semantic Web
The Semantic Web enables the evolution from a
Web composed by documents, to a Web formed by
information where every data possesses a well
defined meaning that can be interpreted,
“understood” and processed by people and software
cooperatively.
To understand the Semantic Web let’s assume
that someone is looking for pages about specie of
bird, an eagle, for instance. Typing “eagle” in a
search engine, several answers will be retrieved,
getting besides the requested information about
eagles, also pages about the “War Eagles Air
Museum”, or about the American football club
“Philadelphia Eagles”, among others results. This
happens because the software just analyzes the word
syntactically, do not discerning the football club,
from the birds or from the museum. But, if these
resources were marked up with a Semantic Web
language, this would not occur, because the words
could be distinguished semantically.
To develop the Semantic Web, an architecture
formed for a set of layers was proposed by Tim
Bernners-Lee (Berners-Lee et al, 2001). In this
paper, the RDF and ontology layer are considered.
2.1.2 Ontology
Ontology is a term vastly known in areas such as
Philosophy and Epistemology meaning respectively,
a “subject’s existence” and a “knowledge to know”
(Chandrasekaran et al, 1999). Recently this term is
being also used in the Artificial Intelligence (AI) to
describe concepts and relationships used by agents.
In the Database (DB) community, ontology is a
partial specification of a domain, which expresses
entities, relationships between these entities and
integrity rules (Mello et al, 2000). From these
definitions, an ontology can be defined as a
conceptual data model that describes the structure of
the data stored in a DB, in a high abstraction level.
Through these definitions an ontology will be
able to provide a common/shared understanding
about concepts on specific knowledge domains.
This
work uses a task ontology designed to aid the
performance of the extraction process defined in the
system architecture. Such architecture is introduced
in section 4.
USING ONTOLOGIES TO PROSPECT OFFERS ON THE WEB
201