documents evoking "Bathtub overflow", "Bidet
plugging", "Shower too-low-flow" or "Wash basin
fissure”. Thus the query should be able to include all
these alternatives more precise than "Sanitary
equipment problem". The problem is how to form
the query and send it to the search engine.
The best case of the algorithm is that a user’s
query, considered as a list of concepts in ontology,
could be replaced by a system query. This query
generated by the system contains all descendant
concepts of all initial concepts in the user’s query.
But Google limits the number of keywords of a
query and unfortunately, in most real situations the
number of descendant concepts exceeds this limit.
Thus, in our previous work (Cao et al, 2004),
besides implementing the algorithm for ideal
situation above, we developed two algorithms
enabling to overcome this Google limit by using the
branches of the domain ontology. Our evaluation on
those previous algorithms shows that they are more
appropriate for obtaining specialized documents than
general documents. However they have the
drawback to risk to privilege some branches and to
sometimes retrieve documents not related to all the
concepts of the user query. Therefore we needed
another solution that would better ensure that the
documents retrieved by Google are related to all the
concepts of the user’s query. More precisely, the
system queries generated, which consist of a list of
descendant concepts, should not privilege any initial
concept of the user’s query.
3.1 Balanced Concept Selection
Our new algorithm of the OntoWatch system allows
us to retrieve and to annotate documents as much as
possible related to initial concepts from user’s query.
As we mentioned before, while using the ontology
for searching and annotating document, the major
difficulty is the great number of descendant
concepts. So our solution is not to take as much as
possible descendant concepts of each initial concept,
but rather take into account a balanced distribution
between descendant concepts selected from different
initial concepts in user query. As the number of
concepts permitted in a Google query is too small in
comparison with the number the descendant
concepts possible to select, our algorithm will have
to make multiple choices. In other words several
queries corresponding to various selections of
concepts will be generated.
To ensure the balanced distribution between
descendant concepts selected, we rely on a criterion:
the number of descendant concepts permitted
depends on the weight of their initial concept in
comparison with other concepts in user query.
Let Total_Desc (C) be the number of all
descendant concepts of all initial concepts in the
user query, and Local_Desc be the number of
descendant concepts of an initial concept C.
The weight of C is the value of
Local_Desc/Total_Desc (C).
The presence in the generated query of at least
one descendant concept corresponding to each
concept in user query avoids the drawbacks of the
two previous algorithms. For each initial concept,
we have a limit number of concepts to select in order
to contribute to final query. The problem is the
strategy of descendant concept selection. Contrary to
our previous algorithms, which select concepts in
the depth direction, for a initial concept we select its
descendant concepts in the breadth direction. The
result of this process is a set of partial queries,
corresponding to each initial concept. The
combination of these partial queries gives us the
final set of system queries.
3.2 Detailed Description
BalancedOntologySearchAnnotation is the main
module in charge of searching documents on the
Internet by Google search engine with a set of
queries generated by the algorithm. Then, in each
document belonging to the results, the terms
corresponding to the concepts in the ontology are
extracted and stored in a hash table the key of which
is the document URL, in order to generate an RDF
annotation about this document. Based on the
comparison of the URL of each document, all the
redundancies are eliminated and the system
aggregates the concept list when a same document
was found by different queries. The module takes
the user’s query Request
u
(which is in fact a list of
concepts) as input and produces annotations
represented in RDF, as output.
Algorithm
BalancedOntologySearchAnnotation(Request
u
)
Q = GenerateQueries (Request
u
)
// Ann = hashtable the key of which is
the URL of each document retrieved.
Ann Å ∅
for each query q
Q do
Send q to Google
for each document D in Google results do
K Å ExtractConcept(D)
if D.URL was not in Ann
Add K and URL of D to Ann
ICEIS 2008 - International Conference on Enterprise Information Systems
236