be aggregated within the document ranking process.
In this paper we propose an aggregation mechanism
that allows for aggregation of a multitude of query-
independent attributes. We use two approaches, one
aggregating the attribute scores and another one, ag-
gregating ranks using weighted sum and logistic re-
gression as the aggregation vehicle. We present the
evaluation framework that targets the CDS document
collection, a production database used at CERN.
In the section 2 we outline the aggregation
method, in the section 3 we present the experimen-
tal data set up, in section 4 we present results that we
obtained on a test data set and we conclude in section
5.
2 SCORE AGGREGATION
We divide the process of score aggregation in three
step: (i) first we select relevant ranking attributes that
are convenient for aggregation, (ii) in the second step,
scores need to be normalized and re-scaled, and (iii)
finally, scores are aggregated via a score aggregation
function.
Selection of Attributes. In the first phase we select
attributes that are convenient for aggregation. At-
tributes that are not correlated are good candidates
for aggregation. On the other hand attributes that are
highly correlated can be considered as substitutes and
in that case we can selected only one of them.
We noticed that usage of traditional correlation
coefficients such as Spearman Rank correlation or
Kendal Tau coefficients do not take into account the
importance of low ranks. For this reason the correla-
tions should be adjusted so as to put more weight on
changes that occur in the upper part of the ranked list.
Some work in this direction has been also suggested
by (Yilmaz et al., 2008).
Score Normalization. In the second step we normal-
ize scores so that they reflect the underlying distribu-
tion of values. The idea is that a normalized score
should reflect the proportion of the population of doc-
uments with lower scores as they are observed for a
given ranking attribute. For example, if a score of N
corresponds to a median score among all of the ob-
served scores, it should be converted into a normal-
ized score of 0.5.
To determine the normalization function for each
of the attributes, we first calculated values at a per-
centile level. We then smoothed the obtained values
using standard density estimation techniques to ap-
proximate the underlying densities. We then construct
the cumulative distribution function summing up val-
ues over corresponding interval.
Score Aggregation. The task of score aggregation
was previously addressed in several works. Garcin et
al (Garcin et al., 2009) analyze aggregation of feed-
back ratings into a single value. They consider differ-
ent aggregations relying on informativeness, robust-
ness and strategyproofness. On all these attributes,
they show that the mean seems to be the worst way of
aggregating ratings, while the median is more robust.
In previous works, logistic regression was also used
as a vehicle to aggregate scores (Le Calv´e and Savoy,
2000) (Jacques Savoy and Vrajitoru., 1996) (Craswell
et al., 1999). In our preliminary study we adopted the
two mentioned aggregation models based on logistic
regression and a weighted sum.
To rank documents with logistic regression we
first compute the value of logit that corresponds to
a particular combination of scores of individual doc-
uments. We then project the obtained result on the
logistic curve and read the resulting aggregated score
on the Y-axis.
A more detail about the implementation of the
rank aggregation with logistic regression in d-Rank
can be obtained in (Vesely and Rajman, 2009). In our
study we worked with chosen regression coefficients
for which we tested a variety of combinations. Even-
tually coefficients should be learned through an auto-
mated procedure. The way we have generated data
for our experiments is in more detail described in the
next section.
3 EXPERIMENTAL SETUP
Within our work we plan to perform two types of eval-
uation: a system evaluation using a referential that we
extracted from the user access logs, and a user-centric
evaluation (Voorhees, 2002).
In order to proceed with the system evaluation, we
needed a referential that would allow us to compute
and compare our system using standard information
retrieval measures. To our knowledge the CDS col-
lection was not used within an information retrieval
evaluation in the past. One of the results of our work
thus is a referential that allows for a system evalua-
tion of various document retrieval scenarios featured
by the CDS retrieval system.
To create the referential of relevance judgments,
we opted for parsing the user access logs for queries
that were issued by users of the CDS search system
in the past. This way we have obtained a set of test
queries for experimentation that is close to a real-
world scenario. For this purpose, we created a tool
that allows us to parse user access logs and extract
D-RANK: A FRAMEWORK FOR SCORE AGGREGATION IN SPECIALIZED SEARCH
483