consuming (Ko and Seo, 2009; Ruthven, 2003), make
use of the pseudo relevance feedback. Nevertheless,
fully automatic methods suffer from obvious errors
when the initial query is intrinsically ambiguous. As
a consequence, in the recent years, some hybrid tech-
niques have been developed which take into account
a minimal explicit human feedback (Okabe and Ya-
mada, 2007; Dumais et al., 2003) and use it to auto-
matically identify other topic related documents. The
performance achieved by these methods is usually
medium with a mean average precision about 30%
(Okabe and Yamada, 2007).
However, whatever the technique that selects the
set of documents representing the feedback, the ex-
panded terms are usually computed by making use
of well known approaches for term selection as Roc-
chio, Robertson, CHI-Square, Kullback-Lieber etc
(Robertson and Walker, 1997)(Carpineto et al., 2001).
In this case the reformulated query consists in a sim-
ple (sometimes weighted) list of words.
Although such term selection methods have
proven their effectiveness in terms of accuracy and
computational cost, several more complex alterna-
tive methods have been proposed. In this case, they
usually consider the extraction of a structured set of
words so that the related expanded query is no longer
a list of words, but a weighted set of clauses combined
with suitable operators (Callan et al., 1992), (Collins-
Thompson and Callan, 2005), (Lang et al., 2010).
In this paper we propose a query expansion
method based on explicit relevance feedback that ex-
pands the initial query with a new structured query
representation, or vector of features, that we call a
mixed Graph of Terms and that can be automatically
extracted from a set of documents D using a global
method for term extraction based on a supervised
Term Clustering technique weighted by the Latent
Dirichlet Allocation implemented as the Probabilistic
Topic Model.
The evaluation of the method has been conducted
on a web repository collected by crawling a huge
number of web pages from the website Thomas-
Net.com. We have considered several topics and per-
formed a comparison with two less complex struc-
tures: one represented as a set of pairs of words and
another that is a simple list of words. The results
obtained, independently of the context, show that a
more complex representation is capable of retrieving
a greater number of relevant documents achieving a
mean average precision about 50%.
2 THE PROPOSED APPROACH
The vector of features needed to expand the query is
obtained as a result of an interactive process between
the user and system. The user initially performs a re-
trieval by inputting a query to the system and later
identifies a small set D of relevant documents from
the hit list of documents returned by the system, that
is considered as the training set (the relevance feed-
back).
Existing query expansion techniques mostly use
the relevance feedback of both relevant and irrelevant
documents. Usually they obtain the term selection
through the scoring function proposed in (Robertson,
1991), (Carpineto et al., 2001) which assigns a weight
to each term depending on its occurrence in both rel-
evant and irrelevant documents. Differently, in this
paper we do not consider irrelevant documents and
the vector of features extraction is performed through
a method based on a supervised Term Clustering tech-
nique.
Precisely, the vector of features, that we call
mixed Graph of Terms, can be automatically ex-
tracted from a set of documents D using a method
for term extraction based on a supervised Term Clus-
tering technique (Sebastiani, 2002) weighted by the
Latent Dirichlet Allocation (Blei et al., 2003) imple-
mented as the Probabilistic Topic Model (Griffiths
et al., 2007).
The graph is composed of a directed and an undi-
rected subgraph (or levels). We have the lowest level,
namely the word level, that is obtained by grouping
terms with a high degree of pairwise semantic relat-
edness; so there are several groups (clusters), each
of them represented as a cloud of words connected
to their respective centroids (directed edges), alter-
natively called concepts (see fig. 1(b)). Further, we
have the second level, namely the conceptual level,
obtained by inferring semantic relatedness between
centroids, and so between concepts (undirected edges,
see fig. 1(a)).
The general idea of this note is supported by pre-
vious works (Noam and Naftali, 2001) that have con-
firmed the potential of supervised clustering methods
for term extraction, also in the case of query expan-
sion (Cao et al., 2008; Lee et al., 2009).
2.1 Extracting a Mixed Graph of Terms
A mixed Graph of Terms (mGT ) is a hierarchical
structure composed of two levels of information rep-
resented through a directed and an undirected sub-
graph: the conceptual and word level.
We consider extracting it from a corpus D =
A NOVEL QUERY EXPANSION TECHNIQUE BASED ON A MIXED GRAPH OF TERMS
85