COMBINING INDEXING METHODS AND QUERY SIZES

IN INFORMATION RETRIEVAL IN FRENCH

Désiré Kompaoré, Josiane Mothe

Institut de Recherche en Informatique de Toulouse, Université de Toulouse, France

Ludovic Tanguy

ERSS, Université de Toulouse, France

Keywords: Information retrieval, data fusion, indexing, information retrieval in French.

Abstract: This paper analyses three type of different indexing methods applied on French test collections (CLEF from

2000 to 2005): lemmas, truncated terms and single words. The same search engine and the same

characteristics are used independently to the indexing method to avoid variability in the analysis. When

evaluated on French CLEF collections, indexing by lemmas is the best method compared to single words

and truncated term methods. We also analyse the impact of combining indexing methods by using the

CombMNZ function. As CLEF topics are composed of different parts, we also examine the influence of

these topic parts by comparing the results when topic parts are considered individually, and when they are

combined. Finally, we combine both indexing methods and query parts. We show that MAP can be

improved up to 8% compared to the best individual methods.

1 INTRODUCTION

Information retrieval is composed of various

processes which distinguish systems from each other.

Indexing aims at building a reduced representation of

the contents of the documents and queries. The

majority of indexing techniques analyzes the contents

of texts, and eliminates stop words to keep only

representative descriptors. Depending on the method

used, these descriptors can be represented in many

forms: original terms in documents, stems or lemmas.

When considering documents, these descriptors are

generally weighted in order to depict how a term

represents the document content and how they can

separate relevant and non-relevant documents when a

query contains this term. The weighting function also

helps in comparing systems performances. The

searching function (or model) is used to match query

and document representations resulting from

indexing, in order to decide which documents to

retrieve. This function calculates the document scores

and characterizes systems. These scores indicate the

degree of possible relevance of the retrieved

documents.

Specific studies focus on the influence of such or

such parameter on the efficiency of the search, by

choosing a parameter which varies while trying to

keep the other parameters identical. Other works

study the combination of various searches for a

given query: different representation of queries (Fox

and Shaw, 1994), various search models (McCabe et

al., 1999), various search systems (Hubert et al.,

2006) to improve the results.

Such studies are made possible by the existence

of test collections, such as those of TREC, CLEF, or

INEX, and of criteria to measure the efficiency of

search engines. A collection of evaluation is

composed of a set of documents, a set of topics and

the set of documents judged as relevant for each

topic. Evaluation is usually based on various criteria

that are calculated using the trec_eval tool

(trec.nist.gov). Basically, evaluation is based on

recall, which measures if the relevant documents are

retrieved, and on precision, which measures if the

retrieved documents are relevant.

Our study focuses on monolingual information

retrieval in French and analyses three indexing

techniques where indexes are terms (documents or

queries terms), truncated terms, or lemmas. We

analyse in a first step the various indexing modes

149

Kompaoré D., Mothe J. and Tanguy L. (2008).

COMBINING INDEXING METHODS AND QUERY SIZES IN INFORMATION RETRIEVAL IN FRENCH.

In Proceedings of the Tenth International Conference on Enterprise Information Systems - AIDSS, pages 149-154

DOI: 10.5220/0001674401490154

 SciTePress

considered individually, and in a second step we

combine them with the CombMNZ function (Fox and

Shaw, 1994). We also studied the influence of the size

of queries on the results. In this study, the query size

is based on the structure of topics from evaluation

campaigns. The evaluation is carried out on 6

collections of French ad-hoc collection of CLEF.

This paper is organized as follows: section 2

presents some related works. Section 3 presents the

various modes of indexing we analyse. Section 3

presents the collections as well as the criteria of

evaluation. Section 4 reports the results obtained and

discuss these results. We then conclude this paper

and indicate some directions for future works.

2 SOME RELATED WORKS

Some studies consider the various parts of a topic

when analysing system results. (Savoy, 2003)

indicates that, on the French collection of CLEF

2000 (34 queries), the majority of the ten studied

systems improve the average precision (MAP) when

the title and the description are taken into account.

The improvement compared to title only is on

average of 5%. Improvement when considering the

complete query compared to title only is 10.21%.

Within the framework of the TREC TetraByte track,

(Metzler et al., 2005) show that the average

precision is improved on average by 5% when one

adds the description to the title, and by 10,4% when

one adds the narrative to the title and description.

(Ahlgren and Kekäläinen, 2006) study the Swedish

collection of CLEF 2003 and various strategies of

indexing. 4 combinations out of 7 based on a

morphological analyser; one is based on a truncation

and one on a stemming-based approach. Truncation

gives the best results.

3 DOCUMENTS AND QUERIES

INDEXING METHODS

Text indexing consists in two principal steps:

extracting the terms that characterize the document

contents and assigning a weight to each of these

indexing terms. This weight reflects the capacity of

characterization of the document by the term.

The three indexing approaches we study vary

according to the unit of indexing. The weighting

function remains the same and is based on the BM25

function (Roberston et al., 1995). We use in these

experiments the Mercure system (Boughanem et al,

1998). Search is based on matching queries and

documents indexed by the same method.

3.1 Single Words Indexing

According to this indexing method, stop words are

eliminated but accents are preserved. The remaining

single terms correspond to indexes. Indexes thus

correspond exactly to terms in documents and

queries. This mode of indexing, called SW indexing

in the paper, is supposed to increase precision as the

documents that will be retrieved contains the terms

exactly as they appear in the query.

3.2 Indexing by Truncated Terms

Truncated terms (TT) are used to conflate the

different forms of a term into a single one. This

should improve recall due to the use of various

forms of terms in the documents and queries. Using

TT is a simple method to reduce morphological

variation of words. Several methods for generating

truncated terms exist and are based on statistical

considerations: deletion of the most frequent word

endings, application of rules such as in Porter-like

algorithms. In our approach, we use a truncation at 7

characters as (Denjean, 1989) suggested. Stop words

and accents are also removed.

3.3 Indexing by Lemmas

Indexing by lemmas (L) has the same objective than

stem indexing: to limit the risks of not retrieving

documents containing various forms of query terms.

However, lemmas extracting is more efficient than

word truncation. For example, “computers” and

“compute” could be associated to the same radical

“compute” by truncation at 7 characters.

Lemmatization will not make this type of association

and will lead to two indexing terms (“computer” as a

noun in singular form and “compute” as a verb).

In the experimentation that we report here,

lemmas are extracted using TreeTagger

(TreeTagger, H. Schmidt; www.ims.uni-

stuttgart.de/projekte/corplex/TreeTagger), stop

words are removed, but accents are preserved since

it is consistent with the fact to retrieve lemmas.

4 EXPERIMENTS

4.1 Collections of Evaluation

The evaluation is based on 5 years of the French

monolingual ad-hoc CLEF collection. We chose this

collection because this is the main used when

considering French.

ICEIS 2008 - International Conference on Enterprise Information Systems

150

The collection contains documents from ATS

(SDA) 1994 and 1995, and articles from Le Monde

newspaper 1994. It also includes about 50 topics per

year. Table 1 indicates various characteristics of the

collections and the number of indexing terms

resulting from the various indexing methods. CLEF

2001 and CLEF 2002 use the same document

collection, for that reason, we consider a set of

(49+50) topics on this collection.

Topics are composed of three parts (see figure 1): a

title (notes T) which is limited to a few words, a

descriptive (noted D) which explains in one or two

sentences the title, and a narrative (noted N) which

indicates what a relevant document is and eventually

what will not be relevant. In our study, 9 queries can

be built from one topic that result from the three

types of indexes (SW, TT, L) and the use of T only,

T+D, or T+D+N.

<num> C001

<title> Architecture à Berlin

<desc> Trouver des documents au sujet de

l'architecture à Berlin.

<narr> Les documents pertinents parlent, en général,

des caractéristiques architecturales de Berlin ou, en

particulier, de la reconstruction de certaines parties de

cette ville après la chute du mur.

Figure 1: Example of CLEF topic.

4.2 Evaluation and Methodology

Trec_eval (trec.nist.gov) is a program which is used

to evaluate systems performances according to a

certain number of measures. In this study, we choose

the following measures:

 MAP Mean Average Precision. Average

precision for a topic is the average of the

precision obtained after a new relevant

document is retrieved. MAP is the average of

the average precision over a topic set.

 Average precision at 5 documents. Precision

at 5 documents corresponds to the proportion

of relevant documents in the first 5 retrieved

documents. It is averaged over the topics.

MAP is used for global comparisons (Voorhees,

2007). On the other hand, precision at 5 is a good

indicator of users’ satisfaction since users generally

have a look to the top retrieved documents.

In a first step, we want to analyze the impact of

the indexing methods on systems performances.

These indexing methods are applied successively on

the T, T+D, and T+D+N topic fields ; results are

then compared. In a second stage, we combine the

different type of indexing (SW, TT, L). We use

CombMNZ function to combine the results.

The CombMNZ function (Fox and Shaw, 1994)

is widely used in data fusion studies (Beitzel et al.,

2004). Formula (1) indicates how the CombMNZ

function calculates the score of a document J after

fusion. Function (1) takes into account two

parameters:

 the score of the document in each fused result,

 the number of systems which retrieved a

document

systnbre

ijj

CountScoreNZScoreCombM ⋅=

∑

(1)

where Score

is the score calculated by system I

for document J, and Count

is the number of

fused systems which retrieve document J.

5 RESULTS

5.1 Results without Combinations

Table 2 indicates the mean average precision (MAP)

and precision at 5 documents (P5) on the sets of

topics. In this table, T+D sections of the topics are

considered and the three methods of indexing are

used independently on these sections. We chose to

present first these results since using T+D is the

most used method in CLEF by participants. Values

in bold font indicate the best results for a measure

for a given year. The line entitled “average”

indicates the MAP and P5 value for each method,

averaged over years. The line entitled “Var. in %”

indicates the variation in percentage of performance.

The baseline is the SW indexing method.

Table 1: Characteristics of the test collections used.

CLEF 2000 CLEF 2001/2002 CLEF 2003 CLEF 2004 CLEF 2005

Nber of topics 34 49+50 52 49 50

Nber of documents 44 13 87 191 129 806 90 261 177 452

Nber of SW 295 156 339 879 390 742 387 386 563 199

Nber of TT 195 47 228 354 270 281 290 646 410 155

Nber of Lemmas 246 400 290 037 340 887 340 811 509 475

COMBINING INDEXING METHODS AND QUERY SIZES IN INFORMATION RETRIEVAL IN FRENCH

151

Table 2: Results of each indexing method– Title and descriptive.

SW TT L

Year

MAP P5 MAP P5 MAP P5

2000 0.3946 0.4118 0.4067 0.4000

0.4333 0.4588

2001/2002 0.3916 0.4524

0.4401 0.5168

0.4326 0.4786

2003 0.4888 0.4654

0.5043

0.4615 0.4831

0.4731

2004 0.4174 0.4612 0.4311 0.4408

0.4479 0.4612

2005 0.2827 0.4400 0.3125 0.4840

0.3241 0.4840

Average 0.3950 0.4462 0.4190 0.4606

0.4242 0.4711

Var. in % - - +6.06 +3.24 +7.39 +5.60

Table 3: Results of each indexing method – Title only.

SW TT L

Year

MAP P5 MAP P5 MAP P5

2000 0.4002 0.4059 0.3963 0.3941

0.4107 0.4235

2001/2002 0.335 0.3859 0.3933 0.4324

0.3939 0.4346

2003 0.4212 0.4000

0.4567 0.4192

0.4382

0.4192

2004 0.3644 0.3918

0.3995 0.4286

0.3962 0.4245

2005 0.2158 0.4160 0.2867 0.4760

0.2948 0.4920

Average 0.3473 0.3999 0.3865 0.4301 0.3868 0.4388

Var. in % - - +11.28 +7.54 +11.36 +9.71

Var. in % TD -12.08 -10.36 -7.74 -6.63 -8.83 -6.87

For four collections, lemma-based indexing

outperforms the others methods. For example, when

one considers the CLEF 2000 collection, the MAP is

improved by approximately 7% compared to the SW

indexing. Lemmas are the most effective indexes for

all CLEF, except 2001/02. None of the collections

should use SW indexing. TT and L lead to

improvement of both MAP and P5. To be more

precise, on average MAP is improved by 6.06%

(resp. 7.39%) compared to SW and P5 by 3.24%

(resp. 5.6%). Considering statistical significance, we

use the wilcoxon test with a p-value < 0.05 to

consider the results significant. TT is better than SW

(statistically significant); however, the difference

between L and TT indexing is not statistically

significant.

Table 3 shows the results obtained by the

indexing method on the title section of the topic. The

additional line (Var. in % TD) indicates the average

variations observed compared to the T+D section of

Table 2 (same measure and same indexing units).

When titles only are considered, the performances

are overall lower than those obtained when the

descriptive section is also considered. This result is

not surprising since the title is enriched by the

descriptive section and thus potentially makes it

possible to better meet the user’s needs. On average,

the best method is the one that uses L indexing.

Using SW remains the worth method. L or TT

indexing is on average better than using SW. In

average, when considering all collections, L

indexing improves by 11.36% the MAP compared to

the technique of the SW when titles are considered

whereas this improvement is of 7.39% when T+D

are considered. This means that the indexing method

has a greater impact when considering short queries.

Again, considering statistical significance, TT is

better than SW; however, L and TT are not different.

When complete topics are considered, there is no

notable improvement on the best results compared to

search when queries are built considering title and

descriptive sections. Lemma-based indexing is still

the best method and gets about the same results than

using T+D whatever the measure (MAP or P5) when

averaged over the years. However, SW indexing

benefits from the enrichment of the query by the N

section (+3.40% for MAP and +5.14% for P5)

compared to T+D. Detailed results using T+D+N

are not presented here.

The comparative results obtained when the

various sections of the queries are used tend to

indicate that taking into account a more semantic

indexing (by lemmas) is especially effective on short

queries (title only). Indeed, it is within this

framework that we obtain the greatest variations

between the techniques of indexing, with a

superiority of the L indexing.

ICEIS 2008 - International Conference on Enterprise Information Systems

152

Table 4: Results of each method of indexing – Combining the sizes of the queries.

SW TT L

Year

MAP P5 MAP P5 MAP P5

2000 0.4320 0.4529 0.4354 0.4294

0.4568 0.4765

2001/2002 0.4352

0.5229

0.4516 0.4972

0.4740

0.4352

2003 0.5041

0.4962

0.5309

0.4962 0.5392

0.4846

2004 0.4331 0.4612

0.4472 0.4735

0.4442 0.4531

2005 0.2914 0.4800 0.3386

0.5360 0.3454

0.5000

Average 0.4192 0.4826 0.4407

0.4865 0.4519

0.4699

Var % vs TD +6.13 +8.16 +5.18 +5.62 +6.53 -0.25

Table 5: MAP and P5 - Results obtained from indexing methods and query size.

MAP

L Comb. %

2000 0.4333 0.4647 +7.25% 2000 0.4588 0.4464 -2.70%

2001/2002 0.4326 0.5336 +23.35% 2001/2002 0.4786 0.4635 -3.16%

2003 0.4831 0.5115 +5.88% 2003 0.4731 0.5355 +13.19%

2004 0.4479 0.4497 +0.40% 2004 0.4612 0.4571 -0.89%

2005 0.3241 0.3477 +7.28% 2005 0.4840 0.5280 +9.09%

Average. 0.4242 0.4614 +8.78% Average. 0.4711 0.4861 +3.18%

5.2 Combination of Indexing Methods

In this section, we study the combination of the

indexing methods and the influence of these

combinations on the performances. The variability

of MAP according to the indexing method used

suggested it was relevant to combine them.

Depending on the topic, this variation can be up to

600%.

For this reason, we apply CombMNZ on pairs of

methods and also to combine the three indexing

methods. Surprisingly, the results we obtained

slowed that one unique method (without fusion)

obtains comparable or higher MAP than any of the

combinations. The only exception occurs for the

collection of CLEF 2003 for which a combination

(SW + TT + L) improves the results compared to the

best of the simple techniques. When one considers

the average over the years, the combination of the

three methods of indexing does not improve MAP

compared to the best indexing, by lemmas.

If we analyse the results of the precision at 5

documents, we can also select a unique indexing

method that obtains performance equal or better than

any of the combinations. Even if this best simple

indexing method is not the same one on each

collection, on average, the non-fused indexing

method based on L remains the best. The details of

these results are not displayed in this paper.

5.3 Combination of the Sizes of Queries

The combination of the query sizes consists in

fusing results obtained using a single method to

extract indexes but considering the three versions of

the topic: T, T+D, T+D+N. That means that for

example, when considering SW, we fuse the three

results obtained by three runs: the one that uses T

only, the one that uses T+D and the one that uses the

three parts of the topic. Table 4 presents the results

of this combination. On average, the best results are

obtained using L indexing.

In addition, whatever the collection and

whatever the indexing method, system performance

is improved. This performance is compared to

unique systems or fused systems that consider T+D

only (cf table 2 and last line of table 4). Combining

queries size also improves the results compared to

those obtained with titles section only (cf table 3,

average line) and compared to those obtained with

the T+D+N, except for P5 when indexing by

lemmas. Apart from a few cases, the results are

statistically significant.

5.4 Combination of the Size of Queries

and the Methods of Indexing

We have investigated in this section the effect of

combining both indexing methods and queries size

(3 sizes of query and 3 methods of indexing), see

table 5. This method improves a bit more the high

precision (P5). The baseline for the analysis is L

COMBINING INDEXING METHODS AND QUERY SIZES IN INFORMATION RETRIEVAL IN FRENCH

153

indexing which is the best non-combined method on

average. MAP is improved of more than 8%. Results

are improved compared to each of the three

preceding combinations. Apart from a few cases, the

results are statistically significant.

6 CONCLUSIONS

In this study, we consider French mono-lingual

information retrieval and analyze the influence of

different indexing methods considering different

types of indexing methods and query size. We

analyzed three different indexing methods

respectively single words, truncated terms and

lemmatization. Our experiments are done on the

French collections of CLEF from 2000 to 2005 and

we have shown in an experimental way that the use

of the lemmas-based indexing was the most effective

unique method when one is interested in mean

average precision and high precision. The difference

is of more than 7% for MAP and about 6% for P5

(statistically significant). We have also shown that it

was relevant to combine the results obtained using

different query size (use of various sections of the

topics). Therefore, combining the various methods

of indexing does not make a significant

improvement in the results.

Future work will focus on the contextual

combination of the indexing methods and variation

in the sizes of the queries. Indeed, one can think that

a query for which the terms used have many

alternatives but come from various concepts should

be rather indexed by lemmas. On the other hand, a

query for which few documents are retrieved would

gain in being indexed by truncated terms in order to

expand it. Recent works in the literature have

studied how to predict query difficulty in terms of

recall and precision. Others were interested in

predicting the possibility of finding relevant

documents in the collection. Our future work will

consider these aspects: system fusion will be based

on some knowledge on the difficulty of the query

and on other elements of the context of the query in

order to decide which sections of the query to

consider (query size) and what indexes should be

used (either a single type or a combination).

ACKNOWLEDGEMENTS

This work has partially been supported by the

French ANR through the TCAN program (ARIEL

Project) and by the European Commission through

the project WS-Talk under the 6th FP (COOP-

006026). However views expressed herein are ours.

REFERENCES

Ahlgren P. and Kekäläinen J., 2006. Swedish full text

retrieval: Effectiveness of different combinations of

indexing strategies with query terms, Information

Retrieval journal, 9(6): 681-697.

Beitzel S.M. et al., 2004, Fusion of Effective Retrieval

Strategies in the Same Information Retrieval System.

JASIST, 55(10): 859-868.

Boughanem M., Dkaki T., Mothe J., Soulé-Dupuy C.,

1998, Mercure at trec7, NIST 500-242, 413-418.

Denjean P., 1989, Interrogation d’un systeme videotex :

l’indexation automatique des textes, PhD dissertation,

Université de Toulouse, France.

Fox E.A. and Shaw J.A., 1994, Combination of Multiple

Searches, TREC-), NIST 500-215, 243-252.

Hubert G. and Mothe J., 2007, Relevance feedback as an

indicator to select the best search engine, ICEIS 2007,

184-189.

Kompaoré N. D. and Mothe J., 2007, Probabilistic fusion

and categorization of queries based on linguistic

features, ACM PIKM, 63-68.

Lee J., 1997, Analysis of multiple evidence combination,

ACM SIGIR, 267-276.

Lu X. A. and Keefer R. B., 1994, Query

Expansion/Reduction and its Impact on Retrieval

Effectiveness, NIST 500-225, TREC-3, 231-240.

McCabe M. C., Chowdhury A., Grossman D.A., Frieder

O., 1999, A unified Environment for Fusion of

Information Retrieval, ACM CIKM, 330-334.

Metzler D., Strohman T., Zhou Y., Croft W. B., Indri at

TREC 2005: Terabyte Track.

Mothe J. and Tanguy L., 2005, Linguistic features to

predict query difficulty, SIGIR wkshop on Predicting

Query Difficulty - Methods and Applications.

Robertson S E, et al., 1995, Okapi at TREC-3, Overview

of the Third Text REtrieval Conference, 109-128.

Savoy J., 2003, Cross-language information retrieval:

experiments based on CLEF 2000 corpa, IPM, V. 39,

75-115.

Voorhees, E.M., 2007. Overview of TREC 2006, NIST,

MD 20899, 1-16.

ICEIS 2008 - International Conference on Enterprise Information Systems

154