Authors:
Yaël Champclaux
;
Taoufiq Dkaki
and
Josiane Mothe
Affiliation:
Université de Toulouse, France
Keyword(s):
Information Retrieval, Structural similarity, Graph comparison, Term weighting.
Related
Ontology
Subjects/Areas/Topics:
Enterprise Information Systems
;
Information Engineering Methodologies
;
Information Systems Analysis and Specification
;
Methodologies, Processes and Platforms
;
Model-Driven Software Development
;
Software Engineering
;
Systems Engineering
Abstract:
In this paper, we present a new similarity measure in the context of Information Retrieval (IR). The main objective of IR systems is to select relevant documents, related to a user’s information need, from a collection of documents. Traditional approaches for document/query comparison use surface similarity, i.e. the comparison engine uses surface attributes (indexing terms). We propose a new method which combines the use of both surface and structural similarities with the aim of enhancing precision of top retrieved documents. In a previous work, we showed that the use of structural similarity in combination with cosine improves bare cosine ranking. In this paper, we compare our method to Okapi based on BM25 on the Cranfield collection. We show that structural similarities improve average precision and precision at top 10 retrieved documents about 50%. Experiments also address the term weighting influences on system performances.In this paper, we present a graph-based model which be
longs to the vector space family. A vector space model considers each document as a vector in the term space. Each coordinate of a vector is a value representing the importance in a document or in a query of an indexing term. The vector space is defined by the set of terms that the system collects during the indexing phase. Many similarity measures such as Cosine, Jaccard, Dice… are used to determine how well a document corresponds to a query. Such measures determine local similarities between a document and a query on the basis of the terms they have in common. Our goal is to exploit another type of similarities called structural similarities. These similarities identify resemblances between elements on the basis of relationships they have. The structural relationship that we use originates from the fact that documents contain words and that words are contained in documents. The idea is to compare these documents through the similarities between the words they contain while similarities between words are themselves dependent on similarities between the documents they are contained in. In a previous paper, we have shown that the use of structural similarities alone was not sufficient to improve the performance of an IRS. In this paper, we present a different method that combines the use of both structural and surface similarities with the aim of enhancing high precision.
Surface similarity is computed as an Okapi measure. Selected documents are then stored in a graph then sorted using a SimRank-based score.
We call this 2-stages method OkaSim. We have performed different experiments with different term-weightings on the Cranfield Corpus and show that the structural similarities can improve an Okapi ranking. We show that those similarities can improve average precision more than 50% and precision at top 10 retrieved documents about 50% of an Okapi ranking.
Tests and experiments also address the term weighting influences on system performances.
(More)