regression model to correctly order these pairs,
consists of feature vectors of pairs of documents and
the output space consists of the pairwise preference
{+1,−1} between each pair of documents. The input
space of the listwise approach consists of a corpus of
documents related to a single query and considers
them as a training example. Its output space contains
the ranked list of the documents. The main problem
with the pointwise and pairwise approaches is that
their loss functions are associated with particular
documents while most evaluation metrics of
information retrieval compute the ranking quality for
individual queries and not for documents. The goal
of the listwise approach is to maximize the
evaluation metrics such as NDCG and MAP.
A lot of the real ranking procedures actually
think of the relationship between the documents, but
all of the proposed learning-to-rank algorithms,
which belong to any of the above approaches, do not
take this into account. We could imagine this
connection as the relationships between the clusters,
the parent-child hierarchy etc.
Similar to the toy example in Kurland's PhD
thesis (Kurland, 2006), let q = {computer, printer}
be a query, and consider the documents:
d1 = computer, company, employ, salary
d2 = computer, investment, employer, company
d3 = taxes, printer, salary, company, employer
d4 = computer, printer, disk, tape, hardware
d5 = disk, tape, hardware, laptop
d6 = disk, tape, floppy, cd rom, hardware
Both the documents and the query are
represented using a vector space representation
(Baeza-Yates and Ribeiro-Neto, 2011) and the
weight for each term in a vector is its frequency
within the corresponding document (or query). If we
rank the documents regarding q, we may get the
following ranking:
Ranked list = d4, d1, d2, d3, d5, d6 (d4 is the top
retrieved document. )
However, since it is more rational to suppose
that the fundamental topic of the query is “computer
hardware” rather than “business”, we would like to
have d5 and d6 ranked as high as possible in the list.
Clustering the documents using the scheme, where
each document belongs to exactly one cluster, into
two clusters, could result in the following clusters: A
= {d1, d2, d3}, B = {d4, d5, d6}. If we took this
clustering into account and applied the cluster
hypothesis then d5 and d6 would be ranked higher
than d1, d2 and d3. That is the desirable outcome,
since d5 and d6, though not containing any of the
terms that occur in q are more close to the query's
topic(computer hardware), than d1, d2 and d3,
which contain one query term, but do not seem to
discuss the query topic.
As another sign of the significance of clustering
in (Zeng et al., 2004) it has been mentioned that
existing search engines such as Google
(www.google.com), Yahoo (http://search.yahoo.
com/) and Bing (www.bing.com) often return a long
list of search results, ranked by their relevancies to
the given query. As a consequence, Web users must
sequentially seek the list and examine the titles and
snippets to discern their desired results.
Undoubtedly, this is a time consuming procedure
when multiple sub-topics of the given query are
mingled together. They propose that a possible
solution to this problem is to (online) cluster search
results into different groups, and to enable users to
recognize their required group.
Carrot2 (http://search.carrot2.org/stable/search)
is a real illustration of this approach.
The aim of present work is to investigate whether
it is possible or not to integrate into the learning-to-
rank algorithm's procedure, without user
intervention, the information that we gain by
clustering following the well known cluster
hypothesis of the information retrieval area
(Kurland, 2006; Gan, Ma and Wu, 2007; van
Rijsbegen 1984) and examine the results of this
venture. Hence, after the off-line building of the
clusters and during the algorithm's function we
provide to each document the bonus that
corresponds to its cluster. Through this procedure
we build on the assumption that a document, which
belongs to one cluster, will be near the other
documents of its cluster at the ranked list. In a
narrow sense, we estimate that the documents, which
belong to the best cluster, will be at the top of the
ranked list and as a consequence we will have better
ranked lists and better measure metrics.
Before concluding the introduction we describe
some basic notions:
The BM25 weighting scheme (Robertson et al.,
2004) is a ranking function used by search engines
to rank matching documents according to their
relevance to a given search query.
Mean Average Precision (MAP) (Baeza-Yates
and Ribeiro-Neto, 2011) for a set of queries q
1
, ...q
s
is the mean of the average precision scores for each
query.
DCG (Baeza-Yates and Ribeiro-Neto, 2011)
measures the usefulness, or gain, of a document
based on its position in the result list. The gain is
accumulated from the top to the bottom of the result
list with each result’s gain being discounted at lower
CombiningLearning-to-RankwithClustering
287