QUERY EXPANSION WITH MATRIX CORRELATION

TECHNIQUES

A Systematic Approach

Claudio Biancalana, Antonello Lapolla and Alessandro Micarelli

Department of Computer Science and Automation, Artiﬁcial Intelligence Laboratory

Roma Tre University, Via Della Vasca Navale, 79, 000146 Rome, Italy

Keywords:

Information Retrieval, Query Expansion, Personalization.

Abstract:

This paper presents an Information Retrieval system that employs techniques based on Personalization and

Query Expansion (QE). The system was developed in an incremental and iterative way, starting from a simpler

system and reaching a more complex one, to the point that it is possible to talk about several systems each based

on a speciﬁc, deeply analyzed approach: four systems sharing the concept of term co-occurrence. Starting from

a simple system based on bigrams, we moved onto a system based on term proximity, through an approach

known in the literature as Hyperspace Analogue to Language (HAL), and eventually developing a solution

based on co-occurrence at page level. The latter presents a hybrid approach based on term proximity. This

novel architecture is shown here for the ﬁrst time to our knowledge.

1 INTRODUCTION

The considerable quantitative increase in the amount

of documents on the World Wide Web has led to a sce-

nario in which disorganization gained the upper hand,

due to the many different languages composing the

documents, typically drafted by a huge number of au-

thors on the Web. This fact leads to the need of sup-

porting the user more efﬁciently in retrieving informa-

tion on the web. Users easily ﬁnd problems in retriev-

ing information by means of a simple Boolean search

to check the presence of the searched-for term in the

web texts (Jansen et al., 2000). Indeed, some texts,

consisting of terms that are often synonyms, or related

to similar topics only, do not allowto conduct a proper

search, and only take into consideration a few terms,

which could be input by a user who is likely to have

no or little experience in on-line searches. The Query

Expansion (QE) technique ﬁts in this disordered sce-

nario to support the user in his/her search and allow

to widen the search domain, to include sets of words

that are somehow linked to the frequency of the term

the user speciﬁed in his/her query (Bai et al., 2005).

These may be simple synonyms or terms that are ap-

parently not connected to syntactic meaning, but nev-

ertheless linked to a context that is similar or identical

to the one expressed by the original search provided

by the user (Burgess et al., 1999). Such information

may be obtained in several ways, the main difference

being the source used to obtain further information,

which can be retrieved through the preferences ex-

plicitly indicated by the user, through the user’s in-

teraction with the system (Gasparetti and Micarelli,

2007), through the incremental collection of infor-

mation that links the query terms to document terms

(Gao et al., 2004)(for instance the search session logs

(Anick, 2003)) or by means of a simple syntactic anal-

ysis of the phrase forms that compose the documents

(Teevan et al., 2005),(Radlinski and Joachims, 2005).

This paper is organized as follows. Section 2 is

an overview of the implemented systems. In section 3

we introducethe general architecture of the developed

systems. Section 4 presents our experimental setup

and gives a detailed description of the results. Finally,

in 5 we illustrate our conclusions.

2 THE SYSTEMS

The systems we present in this work are based on the

same approach: the query input by the user into the

search engine is expanded through terms linked with

Biancalana C., Lapolla A. and Micarelli A. (2008).

QUERY EXPANSION WITH MATRIX CORRELATION TECHNIQUES - A Systematic Approach.

In Proceedings of the Fourth International Conference on Web Information Systems and Technologies, pages 34-41

DOI: 10.5220/0001520000340041

 SciTePress

the content of the previously visited web pages, hence

pertaining to all the user’s information needs. Thus,

the QE process goes with a mechanism that builds the

user model. As for the QE process, it is an approach

based on the automatic expansion of the query, based

on implicit feedback, formed by the pages previously

visited by the user. Besides, the developed systems

are referable to global analysis techniques, since nor

preliminary search is done through the original query.

The personalization process and the construction

of data structures for the user model are thus incre-

mental, so as to dynamically adapt themselves to the

user changing interests. All the developed systems

build the user model following the concept of term

co-occurrence (Sch¨utze and Pedersen, 1997). By co-

occurrence we mean the extent of which two terms

tend to appear simultaneously in the same context.

In this research we implemented and rated four

systems, each one characterized by two or more dif-

ferent versions. The four systems differ in the way

they deﬁne the co-occurrence between two terms.

• System I: this is the most straightforward sys-

tem among the implemented ones, both conceptu-

ally and computationally. The user model is built

around the concept of bigrams, namely a pair con-

sisting of two adjacent terms in the text of a web

page. Two terms are considered co-occurring only

if adjacent. The context of a term is thus exclu-

sively limited to the term that is directly next to it,

either to the left or to the right;

• System II: this system is based on the Hyper-

space Analogue to Language approach, in which

the contextof a term is expanded to a window of N

adjacent terms. Given a window of N terms, that

can be scrolled inside a page text, two terms are

considered co-occurring only if they are within

such window. The co-occurrence value will be in-

versely proportional to the distance between the

two terms within the window;

• System III: within this system, the context of a

term is expanded to the entire page considered.

Two terms are then deemed co-occurring only if

they are both present, simultaneously, in the same

page;

• System IV: this is a hybrid system, where the co-

occurring of two terms is not just the two terms’

frequency in the same document, but also the dis-

tance between the two terms within the considered

text. It is therefore a system which tries to ﬁnd a

compromise between the approaches used in the

previous three systems.

3 GENERAL ARCHITECTURE

OF THE DEVELOPED

SYSTEMS

The four developed systems feature the same archi-

tecture to create the user model and to execute the

QE. These systems differ in that they adopt different

methods to create the co-occurrencematrix. The main

architecture elements are the following:

3.1 Creation of the User Model

1. For every training link, the corresponding html

page is obtained, and the textual information is

taken from it through a parser, the purpose of

which is to get rid of html tags;

2. the extracted textual information is analyzed with

a part of speech (POS) tagger, MontyLingua

which can make a semantic analysis of the text.

The text is then broken down into sentences,

where nominal and verbal phrases are tracked

down; each term is then tagged as adjective, noun,

proper noun or preposition;

3. the terms included in the stop word list (a list of

words that, owing to their high occurrence within

a text document are considered irrelevant for In-

formation Retrieval) are removed from the textual

information analyzed by the POS tagger;

4. the textual information previously analyzed by the

POS tagger and cleansed of stop words now un-

dergoes stemming, by means of Porter’s algorithm

(Porter, 1997). The stemming algorithm allows to

trace terms with the same root back to the same

word;

5. the textual information is analyzed according to

the chosen system, taking into consideration or

not the extra semantic information gathered by

the POS tagger. In all the implemented sys-

tems the user model consists of a co-occurrence

matrix where, depending on the cases, such co-

occurrence is to be seen as a simple bigram for

sentences, nominal phrases or documents;

In all the implemented systems a co-occurrence

matrix is taken from each page. The lines of these

matrices are normalized so as to obtain comparable

values for each line. The matrices of the single pages

are added incrementally, in order to form one single

matrix of co-occurrences for the entire corpus.

http://web.media.mit.edu/hugo/montylingua

QUERY EXPANSION WITH MATRIX CORRELATION TECHNIQUES - A Systematic Approach

3.2 Execution of the Expansion

1. Given query Q, consisting of n terms q

, i =

1, . . . , n, for each of the q terms, the correspond-

ing stemmed term q

′

is calculated, hence obtain-

ing the new query Q

′

to be represented in vectors

as hq

′

, . . . , q

′

i, i = 1, . . . , n;

2. for each of the terms q

′

, i = 1, . . . , n, belonging

to query Q

′

, the vector corresponding to

→

′

, . . . , c

i is taken from the co-occurrence ma-

trix, where m stands for the number of stemmed

terms found in the training corpus, a value corre-

sponding to the size of the co-occurrence matrix;

3. given the stemmed term of query q

′

,i = 1, . . . , n,

and the corresponding co-occurrencevector

−→

′

, . . . , c

i, it is possible to calculate the weight-

ing the latter by means of a further co-occurrence

measure, indicated as c− index. Given the term

′

j, j = 1, . . . , n, and the corresponding vector

−−→

′

= hc

, . . . , c

i, each component of the latter

,i = 1, . . . , m, is replaced by the value c

× c−

index(t

, q

′

), with t

standing for the term corre-

sponding to the co-occurrence measure c

. The

two terms, a and b give:

c− index(a, b) =

+ n

− n

)

where n

stands for the number of documents in

the training corpus, in which words a and b are

both present, while n

and n

indicate the num-

ber of documents in the training corpus in which

word a and word b are present, respectively. The

c − index measure referring to two words there-

fore increases according to the frequency with

which the two words appear together in the docu-

ment rather than alone. Hence given a term of the

query and the corresponding co-occurrence vec-

tor, the use of the c − index tends to consolidate

co-occurrences with words that usually appear to-

gether with the query term, and not much alone.

4. the vectors of co-occurrence with the query terms,

to be possibly weighted with the c − index, are

added up to obtain a single vector representing

the terms that mostly co-occur with all the query

terms;

5. once obtained the vector

−→

= hc

, . . . , c

i, refer-

ring to query Q, it is possible to weigh each co-

occurrence value c

, i = 1, . . . , m, with the Inverse

Document Frequency (IDF) (Salton et al., 1975)

of the corresponding t

term. The IDF of each

term was calculated beforehand, considering the

entire corpus.

6. the components of vector

−→

= hc

, . . . , c

i, re-

ferring to Q are ordered according to the decreas-

ing co-occurrence values, removing the terms for

which the co-occurrence value is 0.0. Starting

from vector

−→

= hc

, . . . , c

i, we get the vector

of ordered pairs

−−−−−→

cv Pair

= h(t

, c

), . . . , (t

, c

where s stands for the number of terms co-

occurring with the query terms;

7. the stemmed terms present in the vector of ordered

pairs are replaced by the original terms through

the stemming table, provided that they are not al-

ready present in the starting query Q. The origi-

nal terms are given the co-occurrence value of the

corresponding stemmed term;

8. the expansion query relating to query Q is ob-

tained by taking the ﬁrst n terms of the vector of

ordered pairs

−−−−−−−→

cv Pari ex

. The original query Q

terms are then given a co-occurrence value of 1.0

and added to the ﬁrst n terms. Each query term is

weighted according to its co-occurrence value.

9. the obtained query is then input in the search en-

gine, which searches for the pages that mostly per-

tain to the query. The use of co-occurrence values

in the query allows to assign a greater weight to

the words with higher co-occurrence values.

It is very interesting to notice that given two terms

and t

, their similarity value is directly taken from

the co-occurrence matrix, considering the element

, t

) or (t

, t

), given the symmetry of the matrix

itself.

What follows is the description of the main devel-

opment characteristics of the single systems imple-

mented.

3.3 Bigram-based System (System I)

This is the simplest QE system among the imple-

mented ones. It is based on a very simple approach,

that of limiting the context of a word to its two adja-

cent words, to the left and to the right. Each word thus

forms two pairs, one with the word to the right, and

one with the word to the left. Given a document, the

terms contained in it are stemmed, and all the pairs

of adjacent words, known as bigrams, are searched

for. The pair (a, b) was considered to be equal to

the pair (b, a), while pairs such as (a, a), where the

two terms are identical, were left out. The ﬁnal co-

occurrence value will be equal to the number of times

the two words are adjacent in the document.For each

document, a co-occurrencematrix is then built, whose

lines are normalized. Finally, all the matrices in the

training documents are summed up to obtain the co-

occurrence matrix.

WEBIST 2008 - International Conference on Web Information Systems and Technologies

3.4 HAL-based System (System II)

Considering the limits of the bigram-based approach,

with reference to the small size of a term’s context,

exclusively associated with the adjacent terms, we

decided to expand this context to a window of N

terms, using the Hyperspace Analogue to Language

approach,(Bruza and Song, 2002). The co-occurrence

matrix is generated as follows: once a term is given,

its co-occurrence is calculated with the N terms to

its right (or to its left). In particular, given a term

t and considered the window of N terms to its right

(or left) f

= {w

, . . . , w

}, we get co− oc(t, w

) =

i = 1, . . . , N. During the testing phase, N was given

a value of 10. As in the bigram-based approach, pair

(a, b) is equal to pair (b, a): hence even in this case

the co-occurrence matrix is symmetrical. For each

one of the training documents a co-occurrence matrix

is generated, whose lines are then normalized. The

matrices of the single documents are then summed

up, generating one single co-occurrence matrix rep-

resenting the entire training corpus. The text is bro-

ken down into nominal expressions, as before, but in-

stead of gathering all the terms in one single docu-

ment, the breakdown is maintained intact, in nominal

expressions. This is when the HAL algorithm is im-

plemented separately on the single nominal expres-

sions of the document. We want to ascertain if and

how much the addition of semantic information, such

as the breakdown into nominal expressions, can help

enhance performance, still implemented the weight-

ing system of co-occurrences based on the joint use

of IDF and c− index.

3.5 System based on Co-occurrence at

Page Level (System III)

The systems implemented so far base the construc-

tion of the co-occurrence matrix on the proximity of

words: in the case of bigrams, co-occurrence is lim-

ited to two adjacent words, while co-occurrence in the

HAL-based approach is extended to a window of N

terms. Both methods take advantage of the concept

of word proximity: the more the two words are closer

in the text, the higher the probability they will be se-

mantically linked. In the approach we are about to

describe, we have decided to pursue a totally different

method in building the co-occurrence matrix, which

allows to overcome the limit of considering two co-

occurring terms only if they are close to each other

in the text. Indeed, we tried to implement a system

which exploits co-occurrence at a page level, namely

trying to track down the pairs of words that usually

co-occur within the same training document, regard-

less of the distance between them; each term in a doc-

ument is considered co-occurring with all the other

terms in that very document. The number of times the

term appears in the document is counted, and the vec-

tor

−→

=h(t

, t f

), . . . , (t

, t f

)i is generated, where

N stands for the number of different stemmed terms

within the training document under discussion. Such

vector consists of pairs (t

, t f

), i = 1, . . . , N, where t

stands for a term present in the document, and t f

the

number of times it appears in the document. The ben-

eﬁts of this weighting mechanism is evident when the

QE is done. As seen before, for each query term, the

co-occurrence vector is calculated. These vectors are

then summed up. The weighting mechanism makes

the contribution of the co-occurrence of the hardly

relevant terms of the query in the document corpus

less important compared to the co-occurrence of rele-

vant terms. Hence, for each training document, a co-

occurrence matrix is generated. These matrices are

then summed up so as to form one single matrix of co-

occurrences, which is used for the QE. Once the tex-

tual information is obtained from the training links,

the POS tagger extracts the nouns, proper nouns and

adjectives. Not all these terms are selected, only the

ﬁrst k are used, following an order based on t f × id f.

Co-occurrences at a page level are then calculated ex-

clusively using these ﬁrst k keywords where k is a

ﬁxed parameter of the system, which is the same for

any page to be analyzed.

3.6 System based on Co-occurrence at

Page Level and Term Proximity

(System IV)

System II is exclusively based on the concept of co-

occurrence at page level: it attempts to track down the

terms that are usually present simultaneously in the

same pages, without even considering the distance be-

tween the words within the text. Ignoring term prox-

imity within the same document can lead to a consid-

erable loss of information, since two words that are

close to each other are more likely to be correlated,

from a semantic viewpoint too. That’s why we de-

cided to use a hybrid approach, that doesn’t use page-

level co-occurrence only, but that also considers term

proximity, as the bigram-based and HAL-based sys-

tems do. Following this idea we implemented and

tested a hybrid approach, starting from the extraction

of nominal expressions and exclusively considering

nouns, proper nouns and adjectives. Moreover, the

weighting mechanism based on IDF and c− index is

used. In order to carry out the QE, the two vectors of

co-occurrence with the query terms are obtained sepa-

rately, following the HAL-based approach and the ap-

QUERY EXPANSION WITH MATRIX CORRELATION TECHNIQUES - A Systematic Approach

proach based on co-occurrence at page level, without

extracting the keywords. Such vectors are therefore

the same ones obtained from systems I and II respec-

tively. Each vector contains a different type of infor-

mation: the co-occurrence at page level vector con-

sists of the terms that are usually present in the same

documents in which the query terms appear, while the

HAL-type co-occurrence vector contains the terms

that are usually present in the documents, in prox-

imity of the query terms. The blending of the two

types of information is done by introducing a new

element which is known as proximity matrix. This

matrix is basically an extension of the HAL-based

approach. Whilst forming the co-occurrence matrix

the HAL-based approach, given a term t, considers

t co-occurring with the adjacent N terms, associat-

ing a greater co-occurrence value to the terms that are

closer to t. The HAL method thus envisages the use

of a window of N adjacent terms. We therefore asked

ourselves how the use of a preset-size window can en-

tail a loss of information, and we decided to employ a

method that allows to consider the proximity of term

t with all the other terms in the document. Let us

see how the proximity matrix P is built. A matrix P

is constructed, having size M, namely the number of

stemmed terms present in the training corpus. Each

matrix box contains two values, v

and v

, which we

initialize at 0.0 and 0, respectively. For each docu-

ment d of the training documents corpus and for each

term t present in d, all terms to the right of t are con-

sidered, and distance i from t is measured. Assume

that term t

′

is at distance i fromt, we extract from ma-

trix P the pair of values (v

, v

) in box (t, t

′

), and we

increase v

1.0

and v

by 1. As for the construction

of the query, it is done following the same method

adopted in system III. The only difference is to be

seen in the proximity measuring when co-occurrence

at page level values of the query terms are extracted.

Given term q

′

belonging to query Q, and having ex-

tracted the corresponding co-occurrence at page level

vector

−→

, the co-occurrence value of each term t

′

belonging to vector

−→

is multiplied by

, where v

and v

are the values present in the proximity matrix

corresponding to the pair of terms (q

′

, t

′

) based on co-

occurrence at page level and term proximity, through

which it is possible to understand the assets and weak

points of each term, also with reference to the systems

based on different approaches.

3.7 Pseudo-coding of the Seminal

Algorithms System III and System

In this subsection we show the pseudo-coding of the

algorithm which calculates the co-occurrence matrix

starting from the set of training documents (see Algo-

rithm 1), and the pseudo-coding of the algorithm for

the actual execution of the QE, starting from the data

contained in this matrix (see Algorithm 2). With ref-

erence to the algorithm calculating co-occurrence at

page level with extraction of the ﬁrst k keywords, we

notice that:

• the co-occurrence matrix is represented by a map

of maps. In this way, we avoid initializing a

square matrix the size of which is the overall num-

ber of different terms in the training documents

set. Using this matrix would entail a huge waste

of memory, since the majority of its elements have

a value equal to zero, considering how sparse the

matrix is. By using a map of maps on the other

hand, it is possible to input the co-occurrence val-

ues between the pairs of co-occurring terms in the

training documents when such pairs are present;

• the

keys()

method of a map yields the list of

keys;

• the

items()

method of a map yields the list of

pairs

(key, value)

;

• the IDF(t) method yields the inverse document

frequency of term t, calculated previously accord-

ing to the documents belonging to the third level

of the DMOZ directory (see section 4.1).

As for the algorithm calculating the QE through

the matrix of co-occurrence at page level, we notice

that:

• the sum method

(coocMap1, coocMap2)

yields

a new map, given from the union of pairs

(key,

value)

present in the two maps,

coocMap1

and

coocMap2

; should a key be present in both maps,

the corresponding value is given by the sum of

values contained in the starting maps;

• the stemming method (Q) yields a new query, ob-

tained by stemming the terms of query Q;

• the expand method (t) yields the list of non-

stemmed terms, found in the training documents,

whose root is equal to term t.

WEBIST 2008 - International Conference on Web Information Systems and Technologies

Algorithm 1: Pseudo-coding algorithm of the

co-occurrence-based system at page level (Sys-

tem III).

begin

∆ co-occurrence global matrix

initialization, represented by a map of maps

M ← Map([])

∆ training documents analysis

for doc in D do

∆ term occurrence map initialization

contained in a single document

t f ← Map([])

∆ term frequency calculation

for t in doc do

if not t in t f.keys() then

t f[t] = 0

else

t f[t] = t f[t] + 1

∆ t f ∗ id f calculation for every terms in

the document

for t in t f.keys() do

t f[t] = t f[t] × id f(t)

∆ get a list ordered by t f ∗ id f of k

couples (t, t f ∗ id f

)

t fid f list = map to ord list(t f)[0 : K]

∆ normalize t f ∗ id f values

t fid f list = normalize(t fid f list)

∆ transform the list into map

t fid f map = Map(t fid f list)

∆ get unique document-terms list

term list = t fid f map.keys()

∆ update global co-occurence matrix

for t

, t fid f in t fid f map.items() do

if not t

in M.keys() then

M[t

] ← Map([])

for t

in term list do

if not t

in M[t

].keys() then

M[t

][t

] = 0.0

else

M[t

][t

] = M[t

][t

] +

t fid f × t fid f map[t

]

end

4 EXPERIMENTATION

For each of the adopted approaches, the results of the

corresponding systems are presented. The compari-

son between different systems is made by using com-

parativeperformance values obtained from the system

under examination, on one single topic or the entire

Algorithm 2: Pseudo-coding algorithm of the

co-occurrence-based system at page level and

term proximity (System IV).

begin

∆ co-occurrence global matrix

initialization, represented by a map of maps

Q ← [q

, q

, ..., q

]

∆ co-occurrence map initialization

cooc map ← Map([])

∆ stemming

Q ← stemming(Q)

∆ co-occurrence map update for every

terms in the query

for q in Q do

∆ update co-occurrence map A

cooc map = sum(cooc map, M[q])

∆ get an ordered list of couples

(t, cooc val), sorted by co-occurrence

values

cooc list = map to ord list(cooc map)

∆ transform the list into map

cooc map = Map(cooc list)

∆ query initialization

exp query = []

∆ query expansion

for term, cooc val in cooc map.items() do

∆ get a non-stemmed term list

associated to term

exp list = expand(term)

∆ expand terms

for orig term in exp list do

exp query.append((orig term, cooc val))

∆ limit query expansion to ﬁrst k

co-occurrence terms

exp query = exp query[0 : K]

∆ merge original terms

for q in Q do

exp query.append((q, 1.0))

return exp query

end

benchmark. Such performances are expressed in F1-

measures, so as to summarize, in one single measure,

precision and recall values. As for the performance

measures taken into consideration, we have precision,

recall and f1-measure:

precision(t) =

recall(t) =

F1− measure =

2× precision× recall

precision+ recall

QUERY EXPANSION WITH MATRIX CORRELATION TECHNIQUES - A Systematic Approach

where n

stands for the number of returned links be-

longing to topic t, and N

the overall number of test

links belonging to topic t present in the index.

4.1 The Employed Benchmark: The

Open Directory Project

The Open Directory Project (ODP), also known as

DMOZ

, is a multilanguage directory of links belong-

ing to the World Wide Web, namely a system to col-

lect and classify links. The Open Directory Project

has a hierarchic structure: the links are grouped into

categories, also known as topics, and subcategories.

It is therefore possible to identify a level-based orga-

nization within the hierarchy. Given the large quan-

tity of links contained in ODP, we decided to consider

only Level III links. The pages corresponding to such

links were downloaded from the World Wide Web, by

using a parser; the textual information was taken from

it, and then it was indexed by means of the Lucene in-

dexing system

. Ten topics were then chosen from

the Level III topics, ﬁve of which corresponding to

the user’s information needs, and ﬁve whose function

was exclusively to generate noise in the creation of

the user model. Each topic’s links were then subdi-

vided in a training set, corresponding to 25% of the

links, and set of tests, corresponding to 75% of the

links (see table 1).

4.2 Experimentation Methods

Once the user model is generated, it is possible to

carry out real tests as follows. A query is built for

each topic belonging to the user’s information needs.

The terms of the query are simply the terms that form

the topic name. This query is then expanded accord-

ing to the user model, and used to search for web

pages within the created index, starting from all third-

level links. The pages belonging to the training set

of the considered topic are removed from the returned

pages; only the ﬁrst ﬁfty are taken into consideration,

which include the number of pages belonging to the

topic under consideration. The index obtained with

Lucene, starting from the third-levellink of ODP, con-

sists of 131,394 links belonging to 5,888 topics.

Table 2 compares the performances of the four

systems. It is possible to notice that the system based

on co-occurrence at page level (system III) clearly

achieves better results compared with other systems

based on term proximity (system IV), such as bi-

grams (system I) and HAL (system II). Indeed, both

http://dmoz.org

http://lucene.apache.org

Table 1: The employed benchmark: statistics.

Topic Test Training Information

links links Needs

S/C/H P V 15 5 yes

C/H A/P and M 27 7 yes

B/M and D/C 74 18 yes

G/R/D and P 52 14 yes

B/A and F/F 100 27 yes

S/C/P 35 7 no

A/P A/M 25 6 no

S/P/M and E 26 7 no

S/S S/L 13 5 no

R/G/R 15 5 no

Tot. 382 101

Table 2: Comparative F1-Measurement on implemented

Systems.

Topic System System System System

I II III IV

C/H A/P and M 0.00 0.00 0.16 0.16

S/C/H P V 0.13 0.06 0.09 0.06

G/R/D and P 0.14 0.12 0.18 0.16

B/M and D/C 0.21 0.27 0.19 0.21

B/A and F/F 0.34 0.36 0.57 0.48

Average F1 F1 F1 F1

0.16 0.16 0.24 0.21

Table 3: Comparative F1-Measurement.

Topic no QE RF System III

C/H A/P and M 0.05 0.08 0.16

S/C/H P V 0.09 0.13 0.09

G/R/D and P 0.10 0.18 0.18

B/M and D/C 0.19 0.14 0.19

B/A and F/F 0.05 0.14 0.57

Average F1 F1 F1

0.10 0.13 0.24

are based on the concept of term proximity, and more

speciﬁcally they imply a correlation between two

terms when they are close to each other in the text.

This approach, however, entails the loss of informa-

tion linked to terms that usually co-occur in the doc-

uments themselves, but which are not always close

to each other in the text. The approach based on co-

occurrence at page level hence steers away from the

term proximity concept, and tries to track down the

terms that are usually simultaneously present in the

same documents. The experimentation results show

that page-level correlations are stronger than those

based exclusively on term proximity within the text:

given a page relating to a particular topic, this will fea-

ture correlated terms, not because of their proximity,

but because they refer to the same topic. We also no-

tice that the system based on keywordextraction is the

one that offered the best performance among the pre-

WEBIST 2008 - International Conference on Web Information Systems and Technologies

sented ones: to calculate the correlation values, this

system takes into consideration only the most relevant

terms in the text, thus preventing a large quantity of

noise from impairing the performance.

Table 3 shows the results obtained by a system

based on a traditional content-based user-modeling

approach, where documents are represented in the

Vector Space Model (VSM) and without Query Ex-

pansion, in comparative terms. This system par-

ticularly focuses on the update of the user model

by means of Relevance Feedback (RF) techniques

(Salton and Buckley, 1997), applied to the train-

ing pages content: for each category, the ﬁrst ten

keywords are taken from the corresponding train-

ing pages. The keywords are obtained in terms of

t f × id f, and are then used to expand the query.

5 CONCLUSIONS

In this research we implemented and analyzed an In-

formation Retrieval system, based on QE and Per-

sonalization, that may help the user search for in-

formation on the Web, with reference to his/her in-

formation needs. The four systems implemented are

based on the following approaches: bigrams, Hyper-

space Analogueto Language(HAL), co-occurrenceat

page level and co-occurrence at page level with term

proximity. Among the developed systems, the one

based on co-occurrence at page level and keyword

extraction stood out. Indeed, this system, based on

an algorithm calculating co-occurrences, obtained the

best results, in terms of performance, on the reference

benchmark.

REFERENCES

Anick, P. (2003). Using terminological feedback for web

search reﬁnement: a log-based study. In SIGIR ’03:

Proceedings of the 26th annual international ACM

SIGIR conference on Research and development in

informaion retrieval, pages 88–95, New York, NY,

USA. ACM Press.

Bai, J., Song, D., Bruza, P., Nie, J.-Y., and Cao, G. (2005).

Query expansion using term relationships in language

models for information retrieval. In CIKM, pages

688–695.

Bruza, P. D. and Song, D. (2002). Inferring query models

by computing information ﬂow. In CIKM ’02: Pro-

ceedings of the eleventh international conference on

Information and knowledge management, pages 260–

269, New York, NY, USA. ACM Press.

Burgess, C., Livesay, K., and Lund, K. (1999). Exploration

in Context Space: Words, Sentences, Discourse. Dis-

course Processes, 25(2&3):211–257.

Gao, J., Nie, J.-Y., Wu, G., and Cao, G. (2004). Depen-

dence language model for information retrieval. In

SIGIR ’04: Proceedings of the 27th annual interna-

tional ACM SIGIR conference on Research and devel-

opment in information retrieval, pages 170–177, New

York, NY, USA. ACM Press.

Gasparetti, F. and Micarelli, A. (2007). Personalized search

based on a memory retrieval theory. International

Journal of Pattern Recognition and Artiﬁcial Intel-

ligence (IJPRAI): Special Issue on Personalization

Techniques for Recommender Systems and Intelligent

User Interfaces, 21(2):207–224.

Jansen, B. J., Spink, A., and Saracevic, T. (2000). Real

life, real users, and real needs: a study and analysis of

user queries on the web. Information Processing and

Management, 36(2):207–227.

Porter, M. F. (1997). An algorithm for sufﬁx stripping.

pages 313–316.

Radlinski, F. and Joachims, T. (2005). Query chains: Learn-

ing to rank from implicit feedback.

Salton, G. and Buckley, C. (1997). Improving retrieval per-

formance by relevance feedback. pages 355–364.

Salton, G., Wong, A., and Yang, C. S. (1975). A vector

space model for automatic indexing. Commun. ACM,

18(11):613–620.

Sch¨utze, H. and Pedersen, J. O. (1997). A cooccurrence-

based thesaurus and two applications to information

retrieval. Inf. Process. Manage., 33(3):307–318.

Teevan, J., Dumais, S. T., and Horvitz, E. (2005). Per-

sonalizing search via automated analysis of interests

and activities. In SIGIR ’05: Proceedings of the

28th annual international ACM SIGIR conference on

Research and development in information retrieval,

pages 449–456, New York, NY, USA. ACM Press.

QUERY EXPANSION WITH MATRIX CORRELATION TECHNIQUES - A Systematic Approach