words they contain while similarities between words
are themselves dependent on similarities between
the documents they are contained in. In our first
paper (Champclaux, 2007), we have shown that the
only use of the structural similarity we proposed was
not sufficient to improve the performance of an IRS.
Then in a later paper (Champclaux, 2008), we
presented a different model that combines the use of
both structural and surface similarities, and we
showed that our SimRank measure combined with
the cosine can improve the high precision. In this
paper, we experiment our SimRank measure in
combination with a more efficient measure namely
Okapi.
The remainder of this paper is structured as
follows: Section 2 presents related works in graph
theory when used in IR and management fields. In
section 3, we describe our approach. Section 4 deals
with the evaluation of our method, section 5
comments and discusses the results we obtain. This
paper will be concluded by giving some perspectives
to our work.
2 RELATED WORKS
The earliest paper on graph theory is said to be the
one by Leonhard Euler in (Euler 1736) where he
discusses whether or not it is possible to stroll
around the town of Konigsberg crossing each of its
bridges across the river exactly once. Euler gave the
necessary conditions to do so. Two century later,
Claude Berge lays the groundwork of this field in his
book (Berge, 1958).From the sixties to the present,
graphs have been used to model real world
problems, especially those related to networks:
electric circuits, biological network, social network,
transport network, computer network, World Wide
Web.
Modeling problems with graphs have paved the
way to new approaches to solve them. We use a sub-
field of graph theory –namely graph comparison- to
provide new solutions for IR.
In (Blondel, 2004), Vincent Blondel laid the
ground for graph comparison in the context of
information management. Blondel’s method
compares each node of a graph to every node of
another graph. The approach in (Blondel, 2004) is
presented as a generalization of Kleinberg’s method
(Kleinberg, 1999) which associates authority and
hub scores to web pages to enhance web search
accuracy. Blondel’s comparison is based on a
similarity measure that takes into account the
neighboring nodes of the compared nodes in the
graph they belong to. This makes it possible to
determine which node of a graph being analyzed
behaves like a given node of a graph considered as a
model. This method has been successfully applied to
web searching and synonym extraction (Blondel,
2004).
More generally, graph comparison has been used
in many fields such as biological networks
comparison using phylogenetic trees from metabolic
pathway data (Heymans 2003); Social network
mapping and small world related phenomenon
(Milgram, 1967)(Watts, 1999); Chemical structure
matching and similar structures uncovering from a
chemical database (Hattori, 2003). Our method
could be related to Latent Semantic Indexing
(Deerwester, 1990), or neural network (NN) (Belew,
1989). Indeed, both methods try to capture the
added-value of documents-terms interrelationship.
The LSI method decomposes the document-term
matrix in a combination of three matrices which
represent the information of the original matrix in a
different space where similar documents and similar
terms are closer as a direct consequence of the
underlying space reduction. Our method creates
links between pairs of objects when each element of
the pair is related to an element of a previously
linked pair. As in the LSI, we can build similarity
measures between documents, between terms and
between documents and terms. The aim of our
method is not to reduce the representation space,
but, rather, to find all indirect similarities that exist
with the queries. Regarding NN, the retrieval
mechanism is based on the neural activation that is
propagated from query nodes to document nodes
through the network synapses. In our approach,
terms nodes and documents nodes are directly
related to each other under indexing considerations
whereas, in IR based on NN, they are related to each
other following an a priori heuristic that may involve
a hidden neuron layer. In our approach there is a
back and forth calculation of documents and terms
similarities. At the initial step of our method, we
consider that the similarity between any pair of
separate documents (resp. terms) is nil; then we
evaluate the similarity between terms on the basis of
the similarity between documents they index
(calculated similarity). After this, we evaluate the
document to document similarity on the basis of the
previously calculated term similarities, and then
repeat those steps until convergence is reached. This
is a back and forth automatic similarity refining.
Sophisticated neural approaches (Mothe, 1994) does
propagate and retro propagate neural activation just
once. Hyperspace Analog to Language (Burges,
ICEIS 2009 - International Conference on Enterprise Information Systems
280