A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED

GRAPH OF TERMS

Fabio Clarizia, Francesco Colace, Massimo De Santo, Luca Greco and Paolo Napoletano

Department of Electronics and Computer Engineering, University of Salerno

Via Ponte Don Melillo 1, 84084 Fisciano, Italy

Keywords:

Text retrieval, Query expansion, Term extraction, Probabilistic topic model, Relevance feedback.

Abstract:

It is well known that one way to improve the accuracy of a text retrieval system is to expand the original query

with additional knowledge coded through topic-related terms. In the case of an interactive environment, the

expansion, which is usually represented as a list of words, is extracted from documents whose relevance is

known thanks to the feedback of the user. In this paper we argue that the accuracy of a text retrieval system

can be improved if we employ a query expansion method based on a mixed Graph of Terms representation

instead of a method based on a simple list of words. The graph, that is composed of a directed and an undi-

rected subgraph, can be automatically extracted from a small set of only relevant documents (namely the user

feedback) using a method for term extraction based on the probabilistic Topic Model. The evaluation of the

proposed method has been carried out by performing a comparison with two less complex structures: one

represented as a set of pairs of words and another that is a simple list of words.

1 INTRODUCTION AND

RELATED WORK

It is well documented that the query length in typi-

cal information retrieval systems is rather short (usu-

ally two or three words (Jansen et al., 2000), (Jansen

et al., 2008) which may not be long enough to avoid

the inherent ambiguity of language (polysemy etc.),

and which makes text retrieval systems, that rely on

a term-frequency based index, suffer generally from

low precision, or low quality of document retrieval.

In turn, the idea of taking advantage of additional

knowledge, by expanding the original query with

other topic-related terms, to retrieve relevant docu-

ments has been largely discussed in the literature,

where manual, interactive and automatic techniques

have been proposed (Efthimiadis, 1996)(Christopher

D. Manning and Schtze, 2008)(Baeza-Yates and

Ribeiro-Neto, 1999). The idea behind these tech-

niques is that, in order to avoid ambiguity, it may be

sufﬁcient to better specify “the meaning” of what the

user has in mind when performing a search, or in other

words “the main concept” (or a set of concepts) of the

preferred topic in which the user is interested.

A better specialization of the query can be ob-

tained with additional knowledge, that can be ex-

tracted from exogenous (e.g. ontology, WordNet,

data mining) or endogenous knowledge (i.e. extracted

only from the documents contained in the repository)

(Bhogal et al., 2007; Piao et al., 2010; Christopher

D. Manning and Schtze, 2008).

In this paper we focus on those techniques which

make use of the “Relevance Feedback” (in the case of

endogenous knowledge) which takes into account the

results that are initially returned from a given query

and so uses the information about the relevance of

each result to perform a new expanded query. In the

literature we can distinguish between three types of

procedures for the assignement of the relevance: ex-

plicit feedback, implicit feedback, and pseudo feed-

back (Baeza-Yates and Ribeiro-Neto, 1999). The

feedback is obtained from assessors (or other users

of a system) indicating the relevance of a document

retrieved for a query. If the assessors know that the

feedback provided is interpreted as relevance judg-

ments then the feedback is considered as explicit, oth-

erwise is implicit. On the contrary, the pseudo rele-

vance feedback automates the manual part of the rel-

evance labeling by assuming that the top “n” ranked

documents after the initial query are relevant and so

ﬁnally doing relevance feedback as before under this

assumption.

Most existing methods, due to the fact that the hu-

man labeling task is enormously annoying and time

Clarizia F., Colace F., De Santo M., Greco L. and Napoletano P..

A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS.

DOI: 10.5220/0003660500840093

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2011), pages 84-93

ISBN: 978-989-8425-79-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

consuming (Ko and Seo, 2009; Ruthven, 2003), make

use of the pseudo relevance feedback. Nevertheless,

fully automatic methods suffer from obvious errors

when the initial query is intrinsically ambiguous. As

a consequence, in the recent years, some hybrid tech-

niques have been developed which take into account

a minimal explicit human feedback (Okabe and Ya-

mada, 2007; Dumais et al., 2003) and use it to auto-

matically identify other topic related documents. The

performance achieved by these methods is usually

medium with a mean average precision about 30%

(Okabe and Yamada, 2007).

However, whatever the technique that selects the

set of documents representing the feedback, the ex-

panded terms are usually computed by making use

of well known approaches for term selection as Roc-

chio, Robertson, CHI-Square, Kullback-Lieber etc

(Robertson and Walker, 1997)(Carpineto et al., 2001).

In this case the reformulated query consists in a sim-

ple (sometimes weighted) list of words.

Although such term selection methods have

proven their effectiveness in terms of accuracy and

computational cost, several more complex alterna-

tive methods have been proposed. In this case, they

usually consider the extraction of a structured set of

words so that the related expanded query is no longer

a list of words, but a weighted set of clauses combined

with suitable operators (Callan et al., 1992), (Collins-

Thompson and Callan, 2005), (Lang et al., 2010).

In this paper we propose a query expansion

method based on explicit relevance feedback that ex-

pands the initial query with a new structured query

representation, or vector of features, that we call a

mixed Graph of Terms and that can be automatically

extracted from a set of documents D using a global

method for term extraction based on a supervised

Term Clustering technique weighted by the Latent

Dirichlet Allocation implemented as the Probabilistic

Topic Model.

The evaluation of the method has been conducted

on a web repository collected by crawling a huge

number of web pages from the website Thomas-

Net.com. We have considered several topics and per-

formed a comparison with two less complex struc-

tures: one represented as a set of pairs of words and

another that is a simple list of words. The results

obtained, independently of the context, show that a

more complex representation is capable of retrieving

a greater number of relevant documents achieving a

mean average precision about 50%.

2 THE PROPOSED APPROACH

The vector of features needed to expand the query is

obtained as a result of an interactive process between

the user and system. The user initially performs a re-

trieval by inputting a query to the system and later

identiﬁes a small set D of relevant documents from

the hit list of documents returned by the system, that

is considered as the training set (the relevance feed-

back).

Existing query expansion techniques mostly use

the relevance feedback of both relevant and irrelevant

documents. Usually they obtain the term selection

through the scoring function proposed in (Robertson,

1991), (Carpineto et al., 2001) which assigns a weight

to each term depending on its occurrence in both rel-

evant and irrelevant documents. Differently, in this

paper we do not consider irrelevant documents and

the vector of features extraction is performed through

a method based on a supervised Term Clustering tech-

nique.

Precisely, the vector of features, that we call

mixed Graph of Terms, can be automatically ex-

tracted from a set of documents D using a method

for term extraction based on a supervised Term Clus-

tering technique (Sebastiani, 2002) weighted by the

Latent Dirichlet Allocation (Blei et al., 2003) imple-

mented as the Probabilistic Topic Model (Grifﬁths

et al., 2007).

The graph is composed of a directed and an undi-

rected subgraph (or levels). We have the lowest level,

namely the word level, that is obtained by grouping

terms with a high degree of pairwise semantic relat-

edness; so there are several groups (clusters), each

of them represented as a cloud of words connected

to their respective centroids (directed edges), alter-

natively called concepts (see ﬁg. 1(b)). Further, we

have the second level, namely the conceptual level,

obtained by inferring semantic relatedness between

centroids, and so between concepts (undirected edges,

see ﬁg. 1(a)).

The general idea of this note is supported by pre-

vious works (Noam and Naftali, 2001) that have con-

ﬁrmed the potential of supervised clustering methods

for term extraction, also in the case of query expan-

sion (Cao et al., 2008; Lee et al., 2009).

2.1 Extracting a Mixed Graph of Terms

A mixed Graph of Terms (mGT ) is a hierarchical

structure composed of two levels of information rep-

resented through a directed and an undirected sub-

graph: the conceptual and word level.

We consider extracting it from a corpus D =

A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS

(a) (b)

Figure 1: Theoretical representation of the Graph of Terms’s levels. 1(a) The conceptual level, here the weight ψ

i j

represents

the probability that two concepts are semantically related. 1(b) The word level, here the weight ρ

represents the probability

that a word is semantically related to a concept (centroid).

Figure 2: Vector of features for the topic Storage Tanks. 2 A mixed Graph of Terms.

,··· , w

} of M documents (that we call train-

ing set), where each document is, following the

Vector Space Model (Christopher D. Manning and

Schtze, 2008), a vector of feature weights w

1 j

,.. . , w

|T | j

), where T = {t

,··· ,t

|T |

} is the set of

features that occur at least once in at least one docu-

ment of D, and 0 ≤ w

k j

≤ 1 represents how much the

feature t

contributes to a semantics of document w

We choose to identify features with words, that is

the bags of words assumption, and in this case t

= v

where v

is one of the words of a vocabulary T .

The word level is composed of a set of words v

that specify through a directed weighted edge the con-

cept c

(see ﬁg. 1(b), tab. 1 and ﬁg. 2), or better the

centroid of such set (group or cluster), that is, there-

fore, still lexically denoted as a word. The weight ρ

can measure how far a word is related to a concept,

or how much we need such a word to specify that

concept, and it can be considered as a probability:

= P(c

). The resulting structure is a subgraph

rooted on c

On the other hand, the conceptual level is com-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

(a) (b)

Figure 3: Simpler Vector of features for the topic Storage Tanks.3(a) A Graph of Terms.3(b) A List of Terms.

posed of a set of interconnected, through undirected

weighted edges, concepts c

(see ﬁg. 1(a), tab. 1 and

ﬁg. 2), so forming a subgraph of pairs of centroids.

The weight ψ

i j

can be considered as the degree of se-

mantic correlation between two concepts and it can be

considered as a probability: ψ

i j

= P(c

2.1.1 Graph Drawing

A mGT is well determined through the learning of

the weights, the Relation Learning, and through the

learning of three parameters, the Parameter Learning,

that are Λ = (H, τ, µ) which specify the shape of the

graph. In facts, we have:

1. H: the number of concepts (namely the number of

clusters) of the corpus D;

2. µ

: the threshold that establishes for each concept

the number of edges of the directed subgraph, and

so the number of concept/word pairs of the corpus

D. An edge between the word s and the concept i

can be saved if ρ

≥ µ

. We consider, to simplify

the formulation, µ

= µ, ∀i;

3. τ: the threshold that establishes the number of

edges of the undirected subgraph, and so the num-

ber of concept/concept pairs of the corpus D. An

edge between the concept i and concept j can be

saved if ψ

i j

≥ τ.

2.1.2 Relations Learning

Due to the fact that each concept is lexically repre-

sented by a word of the vocabulary, then we have

that ρ

= P(c

) = P(v

), and ψ

i j

= P(c

) =

P(v

As a result, we can obtain each possible relation

by computing the joint probability P(v

) ∀i, j ∈ V ,

which can be considered as a word association prob-

lem and so can be solved through a smoothed ver-

sion of the generative model introduced in (Blei et al.,

2003) called Latent Dirichlet allocation, which makes

use of Gibbs sampling (Grifﬁths et al., 2007)

Furthermore, it is important to make clear that the

mixed Graph of Terms can not be considered as a co-

occurrence matrix. In fact, the core of the graph is

the probability P(v

), which we regard as a word

association problem, that in the topic model is con-

sidered as a problem of prediction: given that a cue is

presented, which new words might occur next in that

context?

It means that the model does not take into account

the fact that two words occur in the same document,

but that they occur in the same document when a spe-

ciﬁc topic (and so a context) is assigned to that docu-

ment (Grifﬁths et al., 2007).

2.1.3 Parameters Learning

Given a corpus D, once each ψ

i j

and ρ

is known

∀i, j, s, letting the parameters assume different set of

values Λ

, we can observe a different graph mGT

where t is representative of different parameter val-

ues.

The authors reported the mathematical formulation that

leads from the Latent Dirichlet Allocation to P(v

) in

(Clarizia et al., 2011)

A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS

A way of saying that a mGT is the best possible

for that set of documents is to demonstrate that it pro-

duces the maximum score attainable for each of the

documents when the same graph is used as a knowl-

edge base for querying in a set containing just those

documents which have fed the mGT builder.

Each graph mGT

can be represented, following

again the Vector Space Model (Christopher D. Man-

ning and Schtze, 2008), as a vector of feature weights

= (w

,.. . , w

), where |T

represents the total

number of pairs.

We have that each feature t

= (v

), that is not

the simple bags of words assumption, and w

k j

being

the weight calculated thanks to the tf-idf model ap-

plied to the pairs represented through t

, and with the

addition of the boost b

that is the semantic related-

ness between the words of each pair, of both the con-

ceptual and the word level, namely ψ

i j

and ρ

. Recall

that both ψ

i j

and ρ

are real values (probabilities) of

the interval [0,1], and so to distinguish the relevance

between the three cases, the traditional case (b

= 1),

the concept/word pair and the concept/concept pair,

we have distributed such values in a wider interval.

Speciﬁcally:

1. b

= 1 being the lowest level of relatedness;

2. b

∈ [ρ

min

,ρ

max

] with ρ

min

≥ 1 and

(ρ

max

− ρ

min

) = 1;

3. b

∈ [ψ

min

,ψ

max

] with ψ

min

> ρ

max

and

(ψ

max

− ψ

min

) = 1.

In the experiments we have chosen ρ

min

= 1, ψ

min

= 3

(see table 1).

At this point, also a document w

can be viewed

as a vector of weights in the space |T

, and so the

general formula of each weight is:

k j

tf-idf(t

) · b

∑

s=1

(tf-idf(t

) · b

)

(1)

The score for each graph at time t can be computed

following the cosine similarity model in the space

|, and so we have a score for each document S

{S(q

),··· , S (q

)}

As a result, to compute the optimum set of param-

eters Λ

we can maximise the function Fitness (F ),

and so,

∗

= argmax

{F (Λ

)}, (2)

where F (Λ

) = E

[S(q

)] − σ

[S(q

)],

where E

is the mean value of all elements of S

and σ

is the standard deviation. Since the space of

possible solutions could grow exponentially, we have

limited the space to |T

| < 150.

Furthermore, we have reduced the remaining

space of possible solutions by applying a clustering

method, that is the K-means algorithm, to all ψ

i j

and

values, and so that the optimum solution can be ex-

actly obtained after the exploration of the entire space.

This reduction allows us to compute a mGT from a

repository composed of a few documents in a reason-

able time (e.g. for 3 documents it takes about 3 sec-

onds with a Mac OS X based computer and a 2.66

GHz Intel Core i7 CPU and a 8GB RAM).

Table 1: An example of a mGT for the topic Storage Tank.

Conceptual Level

Concept i Concept j Relation Factor (ψ

i j

)

tank roof 4,0

tank water 3,37246

tank liquid 3,13853

··· ··· ···

liquid type 3,43828

liquid pressur 3,07028

··· ··· ···

Word Level

Concept i Word s Relation Factor (ρ

)

tank larg 2,0

tank construct 1,6

··· ··· ···

liquid type 1,21123

liquid maker 1,11673

liquid hose 1,06024

liquid ﬁx 1

··· ··· ···

3 EXTRACTING SIMPLER

REPRESENTATION FROM A

mGT

From the mixed Graph of Terms we can select dif-

ferent subsets of features and so obtaining simpler

representations (see ﬁgg. 3(a) and 3(b)). Before dis-

cussing in details, we recall that ψ

i j

= P(v

) and

= P(v

) are computed through the Topic Model

which also computes the probability for each word

= P(v

3.1 Graph of Terms

We can obtain a simpler representation by ﬁrstly se-

lecting all distinct possible pairs from the mGT (see

the table 1 to better understand) and secondly by uni-

form all their weights.

Note that even if both ψ

i j

and ρ

are real values

of the interval [0,1], they are not comparable because

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

one is a joint probability and the latter is a conditional.

Therefore, in order to make them to be comparable

we consider the product ρ

· η

instead of each ρ

Finally, to uniform all weights we do not shift each

i j

and ρ

· η

values from [0,1] to [ρ

min

,ρ

max

] and

[ψ

min

,ψ

max

] respectively, which means that we com-

press the conceptual over the word level. Following

this procedure we obtain a single level representation

named Graph of Terms (GT ), composed of weighted

pairs of words as in ﬁg. 3(a).

3.2 List of Terms

We can obtain the simplest representation by select-

ing from the mGT all distinct terms and associating

them their weight η

= P(v

) computed through the

Topic Model. We name this representation List of

Terms (LT ), see ﬁg. 3(b).

4 EXPERIMENTS

We have compared 3 different query expansion

methodologies based on different vector of features:

the mixed Graph of Terms (mGT ), the Graph of

Terms (GT ) and the List of Terms (LT ).

We have embedded all the techniques in an open

source text-based search engine, Lucene from the

Apache project. Here the score function S (q, w) is

based on the standard vector cosine similarity

, used

in a Vector Space Model combined with the Boolean

Model (Christopher D. Manning and Schtze, 2008)

which takes into account the boost factor b

whom

default value is 1, and it is assigned to the words that

compose the original query.

Such a function permits the assignments of a rank

to documents w that match a query q and permits the

transforming of each vector of features, that is the

mGT , GT and LT , into a set of Boolean clauses.

For instance, in the case of the mGT , since it is rep-

resented as pairs of related words, see Table 1, where

the relationship strength is described by a real value

(namely ψ

i j

and ρ

, the Relation factors), the ex-

panded query is:

((tank AND roo f )

4.0

) OR ((tank AND larg)

2.0

)...

As a consequence we search the pair of words tank

AND roof with a boost factor of 4.0 OR the pair of

words tank AND larg with a boost factor of 2.0 and

so on. For all the experiments we have considered

min

= 1 and ψ

min

= 3 (table 1).

We have used the Lucene version 2.4 and you can ﬁnd

details on the similarity at http://lucene.apache.org

4.1 Data Preparation

The evaluation of the method has been conducted on

a web repository collected at University of Salerno

by crawling 154,243 web pages for a total of about

3.0 GB by using the website ThomasNet (http://

www.thomasnet.com) as index of URLs, the refer-

ence language being English

ThomasNet, known as the “big green books” and

“Thomas Registry”, is a multi-volume directory of in-

dustrial product information covering 650,000 distrib-

utors, manufacturers and service companies within

67,000-plus industrial categories. We have down-

loaded webpages from the company websites related

to 150 categories of products (considered as topics),

randomly chosen from the ThomasNet directory.

Note that even if the presence or absence of cat-

egories in the repository depends on the random

choices made during the crawling stage, it could hap-

pen that webpages from some business companies

cover categories that are different from those ran-

domly chosen.

This means that the repository is not to be consid-

ered as representative of a low number of categories

(that is 150) but as a reasonable collection of hun-

dreds of categories. In this work we have considered

50 test questions (queries) extracted from 50 out of

the initial 150 categories (topics). Each original query

corresponds to the name of the topic, for instance if

we search for information about the topic ”genera-

tor” therefore the query will be exactly ”generator”.

Obviously, all the initial queries have been expanded

through the methodologies explored in section 3.

Here we show the summary results obtained on all

the 50 topics and the results obtained on the ﬁrst 10

examples, that are: 1. Lubricant, 2. Pump, 3. Adhe-

sive, 4. Generator, 5. Transformers, 6. Inverter, 7.

Valve, 8. LAN Cable, 9. Storage Tank, 10. Extractor.

4.2 Evaluation Measures

For each example the procedure that obtains the re-

formulation of the query, is explained as follows. A

person, who is interested in the topic ”generator”, per-

forms the initial query ”generator” and so interac-

tively choosing 3 relevant documents for that topic,

which represent the minimal positive feedback.

From those documents the system automatically

extracts the three vectors of features. In table 2 we

show the average size of the list of terms and the

list of pairs, that is 57 and 73 respectively for each

The repository will be public on our website to allow

further investigations from other researchers.

A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS

Table 2: Number of terms and pairs for each mGT .

Topic Query ] of terms ] of pairs

1 Lubricant 54 69

2 Pump 63 70

3 Adhesive 45 67

4 Generator 58 68

5 Transformers 67 82

6 Inverter 62 84

7 Valve 47 66

8 LAN Cable 69 85

9 Storage Tank 51 66

10 Extractor 53 71

Average Size 55 72

topic. The user has interactively assigned the rele-

vance of the documents by following an xml based

schema coding his intentions and represented as in the

following:

<query> storage tanks</query>

I am looking for information on storage tanks.

</description>

I am looking for web pages containing

datasheets of several storage tank types

</subtopic>

···

</subtopic>

I am looking for descriptions of storage tanks

as products

</subtopic>

</topic>

The expanded queries have been again performed

and for each context we have asked different humans

to assign graded judgments of relevance to the ﬁrst

100 pages returned by the system.

Due to the fact that the number of evaluations for

each topic, and so the number of topics itself, is small,

humans have judged, in contrast to the Minimum Test

Collection method (Carterette et al., 2008), all the re-

sults obtained. The assessment is based on three lev-

els of relevance, high relevant, relevant and not rele-

vant, assigned, to avoid cases of ambiguity, by follow-

ing the xml based schema coding the user intentions,

and introduced before.

The accuracy has been measured through standard

indicators provided by (Christopher D. Manning and

Schtze, 2008) and based on Precision and Recall,

eAP =

∑

i=1

∑

j>i

(3)

ePrec@k = eP@k =

∑

i=1

(4)

ERprec =

∑

i=1

(5)

ER =

∑

i=1

(6)

where eAP indicates the average precision on a topic,

and x

are Boolean indicators of relevance, k is the

cardinality of the considered result set (k=100), ER is

a subset of relevant documents

. The factor ERprec is

the precision at the level ER, while the measure eMAP

is the average of all eAP over topics. The measure

eP@k is the precision at level k (for instance eP5 is

the precision calculated by taking the top 5 results).

Further we have considered other standard mea-

sures of performance which take into account the

quality of the results related to the position in which

they are presented. We have considered the Cu-

mulative Gain (CG) and the Discounted Cumula-

tive Gain (DCG), the normalized Discontinued Cu-

mulative Gain (nDCG), and the Ideal DCG, that is

nDCG

DCG

IDCG

. Speciﬁcally:

∑

i=1

rel

(7)

DCG

∑

i=1

rel

− 1

log

(1 + i)

. (8)

where we have considered rel = 2, rel = 1 and rel = 0

in case of High Relevant, Relevant and Not Relevant

documents respectively.

4.3 Discussion

In table 3 we ﬁnd all the measures for each topic while

in table 4 we ﬁnd summary results across topics and

both tables report results for each vector of features.

The overall behavior of the mGT method is bet-

ter than both the GT and LT , especially in the case

of the topics 2, 3 and 7. In fact, in these cases the

proposed method has listed 62, 67 and 76 relevant or

high relevant documents in the top 100, that is about

68% (see also the column Rel of table 5).

However, in the case of topics 4, 6 and 8 the num-

ber of relevant documents is comparable between the

systems, with the percentage of relevant documents

retrieved being about 30%, that is less than half of the

worst value obtained for the topic 2. This suggests

that the systems are comparable only if the total num-

ber of relevant documents returned by both systems is

less than 50%.

Note that, ER = |R

mGT

∪ R

− R

mGT

∩ R

∩

|, where R

v f

is the set of relevant and high relevant doc-

uments obtained for a given topic and vf=vector of features.

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

Table 3: Indices of performance on different topics.

Topic eR eAP eR pr eP5 eP10 eP20 eP30 eP100

mGT 64 0.594 0.703 1.000 0.778 0.737 0.586 0.546

1 GT 64 0.517 0.625 1.000 0.778 0.684 0.552 0.495

LT 64 0.330 0.406 0.750 0.667 0.737 0.655 0.354

mGT 76 0.561 0.592 1.000 1.000 0.737 0.690 0.626

2 GT 76 0.481 0.500 1.000 1.000 0.684 0.690 0.566

LT 76 0.254 0.395 0.750 0.667 0.632 0.552 0.374

mGT 75 0.740 0.720 1.000 1.000 1.000 0.793 0.667

3 GT 75 0.626 0.693 1.000 1.000 1.000 0.759 0.576

LT 75 0.366 0.440 0.500 0.778 0.895 0.621 0.444

mGT 73 0.501 0.589 1.000 0.667 0.842 0.862 0.485

4 GT 73 0.534 0.603 1.000 0.667 0.842 0.862 0.525

LT 73 0.683 0.658 0.750 0.889 0.947 0.828 0.616

mGT 49 0.484 0.469 1.000 0.889 0.842 0.552 0.364

5 GT 49 0.439 0.429 1.000 0.889 0.790 0.517 0.333

LT 49 0.299 0.429 1.000 0.556 0.368 0.379 0.313

mGT 39 0.575 0.590 0.750 0.778 0.842 0.724 0.333

6 GT 39 0.580 0.590 0.750 0.778 0.842 0.690 0.343

LT 39 0.657 0.667 0.750 0.889 0.895 0.724 0.354

mGT 100 0.615 0.760 1.000 0.889 0.842 0.828 0.758

7 GT 100 0.633 0.780 1.000 0.778 0.790 0.828 0.788

LT 100 0.392 0.570 1.000 0.667 0.632 0.621 0.566

mGT 28 0.318 0.321 0.500 0.556 0.316 0.345 0.242

8 GT 28 0.327 0.357 0.500 0.556 0.316 0.345 0.242

LT 28 0.465 0.393 1.000 0.556 0.474 0.379 0.273

mGT 45 0.735 0.667 1.000 1.000 0.895 0.793 0.434

9 GT 45 0.679 0.600 1.000 1.000 0.947 0.759 0.404

LT 45 0.146 0.156 0.750 0.556 0.368 0.241 0.162

mGT 63 0.584 0.693 0.999 0.768 0.727 0.576 0.536

10 GT 63 0.507 0.615 0.999 0.768 0.674 0.542 0.485

LT 63 0.320 0.396 0.740 0.657 0.727 0.645 0.344

Table 4: Average values of performance.

run eMAP eRprec eP5 eP10 eP20 eP30 eP100

mGT 0.569 0.601 0.917 0.840 0.784 0.686 0.495

GT 0.535 0.575 0.917 0.827 0.766 0.667 0.475

LT 0.399 0.457 0.806 0.691 0.661 0.556 0.384

This probably happens due to the fact that the doc-

uments feeding the vector of features builder have

not covered, in terms of subtopics, all the examples

present in the repository.

Notwithstanding this, the most important fact is

that, when the graph is added to the initial query,

the search engine shows better performances than the

case of both a graph of word pairs and a simple word

list. As we can see in Table 5, the results on topics 4, 6

and 8 are the worst cases, while topics 2, 3, 5, 7, 9 and

10 are the best, as conﬁrmed by previous discussions

on table 3.

5 CONCLUSIONS

In this work we have demonstrated that a mixed

Graph of Terms based on a hierarchical representa-

tion is capable of retrieving a greater number of rel-

evant documents than a representations less complex

based on both a simple interconnected pairs of words

or a list of words, even if the size of the training set is

small and composed of only relevant documents.

These results suggest that our approach can be em-

ployed in a kind of interactive query expansion pro-

cess, where the user can initially perform a query

composed of key words, and later can select only rel-

A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS

Table 5: Cumulative Gain (CG), Discounted Cumulative Gain (DCG), Normalized Discounted Cumulative Gain (nDCG).

Topic Rel CG DCG IDCG nDCG

1 mGT 55 80 25.536 30.030 0.850

GT 49 74 24.528 28.985 0.846

LT 35 42 15.502 17.421 0.890

2 mGT 62 81 24.257 28.577 0.849

GT 57 75 22.654 27.271 0.831

LT 37 55 17.053 23.690 0.720

3 mGT 67 76 18.568 24.288 0.764

GT 57 64 16.320 21.385 0.763

LT 44 50 12.048 18.435 0.654

4 mGT 48 63 19.330 24.267 0.797

GT 52 67 20.361 24.970 0.815

LT 61 74 21.352 25.495 0.837

5 mGT 36 60 23.714 26.175 0.906

GT 33 55 22.325 24.728 0.903

LT 31 44 16.330 20.072 0.814

6 mGT 33 39 10.698 16.366 0.654

GT 34 40 10.823 16.561 0.654

LT 35 41 11.069 16.754 0.661

7 mGT 76 98 25.405 32.205 0.789

GT 78 100 25.696 32.522 0.790

LT 57 85 23.748 31.621 0.751

8 mGT 24 32 11.817 15.826 0.747

GT 24 32 11.943 15.826 0.755

LT 27 35 12.369 16.457 0.752

9 mGT 43 60 20.977 24.336 0.862

GT 40 55 19.763 22.814 0.866

LT 16 20 8.775 11.229 0.781

10 mGT 54 79 24.436 29.920 0.818

GT 48 73 23.428 27.885 0.840

LT 34 41 14.402 16.321 0.882

evant examples from the result set and so feed the

mGT builder. At this point, the system can add the

knowledge extracted from those documents suggested

by the user, and the query can be performed again.

REFERENCES

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern In-

formation Retrieval. ACM Press, New York.

Bhogal, J., Macfarlane, A., and Smith, P. (2007). A review

of ontology based query expansion. Information Pro-

cessing & Management, 43(4):866 – 886.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3(993–1022).

Callan, J., Croft, W. B., and Harding, S. M. (1992). The

inquery retrieval system. In In Proceedings of the

Third International Conference on Database and Ex-

pert Systems Applications, pages 78–83. Springer-

Verlag.

Cao, G., Nie, J.-Y., Gao, J., and Robertson, S. (2008).

Selecting good expansion terms for pseudo-relevance

feedback. In Proceedings of the 31st annual inter-

national ACM SIGIR conference on Research and de-

velopment in information retrieval, SIGIR ’08, pages

243–250, New York, NY, USA. ACM.

Carpineto, C., de Mori, R., Romano, G., and Bigi, B.

(2001). An information-theoretic approach to auto-

matic query expansion. ACM Trans. Inf. Syst., 19:1–

27.

Carterette, B., Allan, J., and Sitaraman, R. (2008). Minimal

test collections for retrieval evaluation. In 29th In-

ternational ACM SIGIR Conference on Research and

development in information retrieval.

Christopher D. Manning, P. R. and Schtze, H. (2008). In-

troduction to Information Retrieval. Cambridge Uni-

versity.

Clarizia, F., Greco, L., and Napoletano, P. (2011). An adap-

tive optimisation method for automatic lightweight

ontology extractions. In Filipe, J. and Cordeiro, J., ed-

itors, Lecture Notes in Business Information Process-

ing, page 357371. Springer-Verlag Berlin Heidelberg.

Collins-Thompson, K. and Callan, J. (2005). Query expan-

sion using random walk models. In Proceedings of

the 14th ACM international conference on Informa-

KDIR 2011 - International Conference on Knowledge Discovery and Information Retrieval

tion and knowledge management, CIKM ’05, pages

704–711, New York, NY, USA. ACM.

Dumais, S., Joachims, T., Bharat, K., and Weigend, A.

(2003). SIGIR 2003 workshop report: implicit mea-

sures of user interests and preferences. SIGIR Forum,

37(2):50–54.

Efthimiadis, E. N. (1996). Query expansion. In Williams,

M. E., editor, Annual Review of Information Systems

and Technology, pages 121–187.

Grifﬁths, T. L., Steyvers, M., and Tenenbaum, J. B. (2007).

Topics in semantic representation. Psychological Re-

view, 114(2):211–244.

Jansen, B. J., Booth, D. L., and Spink, A. (2008). Determin-

ing the informational, navigational, and transactional

intent of web queries. Information Processing & Man-

agement, 44(3):1251 – 1266.

Jansen, B. J., Spink, A., and Saracevic, T. (2000). Real

life, real users, and real needs: a study and analysis

of user queries on the web. Information Processing &

Management, 36(2):207–227.

Ko, Y. and Seo, J. (2009). Text classiﬁcation from unlabeled

documents with bootstrapping and feature projection

techniques. Inf. Process. Manage., 45:70–83.

Lang, H., Metzler, D., Wang, B., and Li, J.-T. (2010).

Improved latent concept expansion using hierarchi-

cal markov random ﬁelds. In Proceedings of the

19th ACM international conference on Information

and knowledge management, CIKM ’10, pages 249–

258, New York, NY, USA. ACM.

Lee, C.-J., Lin, Y.-C., Chen, R.-C., and Cheng, P.-J. (2009).

Selecting effective terms for query formulation. In

Lee, G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama,

K., Yoshioka, M., and Sakai, T., editors, Information

Retrieval Technology, volume 5839 of Lecture Notes

in Computer Science, pages 168–180. Springer Berlin

/ Heidelberg.

Noam, S. and Naftali, T. (2001). The power of word clus-

ters for text classiﬁcation. In In 23rd European Collo-

quium on Information Retrieval Research.

Okabe, M. and Yamada, S. (2007). Semisupervised query

expansion with minimal feedback. IEEE Transactions

on Knowledge and Data Engineering, 19:1585–1589.

Piao, S., Rea, B., McNaught, J., and Ananiadou, S. (2010).

Improving full text search with text mining tools. In

Horacek, H., Mtais, E., Muoz, R., and Wolska, M., ed-

itors, Natural Language Processing and Information

Systems, volume 5723 of Lecture Notes in Computer

Science, pages 301–302. Springer Berlin / Heidelberg.

Robertson, S. E. (1991). On term selection for query expan-

sion. J. Doc., 46:359–364.

Robertson, S. E. and Walker, S. (1997). On relevance

weights with little relevance information. In Proceed-

ings of the 20th annual international ACM SIGIR con-

ference on Research and development in information

retrieval, SIGIR ’97, pages 16–24, New York, NY,

USA. ACM.

Ruthven, I. (2003). Re-examining the potential effective-

ness of interactive query expansion. In Proceedings

of the 26th annual international ACM SIGIR confer-

ence on Research and development in informaion re-

trieval, SIGIR ’03, pages 213–220, New York, NY,

USA. ACM.

Sebastiani, F. (2002). Machine learning in automated text

categorization. ACM Comput. Surv., 34:1–47.

A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS