expansion technique in Section 5. Section 6 contains
the experimental evaluation of our methods and we
conclude in Section 7.
2 RELATED WORK
A statistical language model assigns probabilities to a
sequence of words by means of a probability distribu-
tion. This concept has been used in various applica-
tions such as speech recognition and machine trans-
lation. The first to employ language models for in-
formation retrieval were Ponte and Croft (Ponte and
Croft, 1998). Their basic idea was: estimate a lan-
guage model for each document and rank documents
by the likelihood scoring method. Since then several
variants of this basic method have been proposed as
well as several improvements of it.
Zhai and Lafferty (Zhai and Lafferty, 2004) have
proved the importance of the selection of smoothing
parameters. The term smoothing refers to the adjust-
ment of the maximum likelihood estimator of a lan-
guage model so that it will be more accurate and at
least it will not assign zero probability to words that
are not met in a document, which is known as the zero
probability problem.
Recent researches attempted to exploit the cor-
pus structure, which is, clusters of documents used
as a form of document smoothing using the language
modelling retrieval framework. Language models
over topics are constructed from clusters and the doc-
uments are smoothed with these topic models to im-
prove document retrieval ((Kurland and Lee, 2004);
(Liu, 2006); (Liu and Croft, 2004)). In this line Liu
and Croft (Liu and Croft, 2004) cluster documents
and smooth a document with the cluster containing
the document. Also, Kurland and Lee (Kurland and
Lee, 2004) suggested a framework which for each
document obtains the most similar documents in the
collection and then smooths the document with the
obtained ”neighbour documents”. These neighbour-
hoods of documents are formatted using as measure
the Kullback-Leibler (KL) divergence (Kullback and
Leibler, 1951) which is an asymmetric measure of
how different two probability distributions are. This
measure has been used in several researches ((Laf-
ferty and Zhai, 2001); (Kurland, 2006)).
The use of clusters for smoothing purposes is an
approach to cluster-based retrieval, while the most
common approach aims to create and retrieve one or
more clusters in response to a query. Cluster-based
retrieval is an idea that has a long history and many re-
searchers have worked on it. Using as base the cluster
hypothesis (Rijsbergen, 1979) several hard clustering
methods were employed ((Croft, 1980); (Jardine and
van Rijsbergen, 1971); (Voorhees, 1985)) as well as
several probabilistic (soft) clustering methods ((Blei
et al., 2003); (Hofmann, 2001)). The soft clustering
methods accept the following case: a document could
discuss multiple topics, and if one assumes that clus-
ters represent topics, then the document should be as-
sociated with the corresponding clusters. This is the
case that is accepted in this paper.
Finally, another important problem in informa-
tion retrieval, that is needed to refer to, is the prob-
lem of information redundancy. Several approaches
have been proposed attempting to resolve this prob-
lem. Some of them use ranking algorithms in order
to reduce the redundant information from the search
results ((Agrawal et al., 2009); (Chen and Karger,
2006); (Radlinski et al., 2009)), while some others
reward novelty and diversity covering all aspects of a
topic ((Agrawal et al., 2009); (Clarke et al., 2008)).
An interesting approach (Plegas and Stamou, 2013)
extracts the novel information between documents
with semantically equivalent content and creates a
single text, called SuperText, relieved from the dupli-
cated content. The suggestion we make in this paper
is that clusters of documents, assuming that clusters
represent topics, are an ideal group of documents to
locate repeated information.
Since a query could have many interpretations and
the documents retrieved for a query could discuss dif-
ferent aspects of the same topic, the ranking of docu-
ments in results list should take these parameters into
account. Being inspired from the related work previ-
ously referred, we consider the idea of forming over-
lapping clusters of relevant documents, taking advan-
tage of the asymmetry of KL-divergence when it is
applied over document language models. The clus-
tering method, we suggest, not only locates pairs of
similar documents, but also locates which of the two
is more similar to the other. The first method we sug-
gest locates inter-document lexical similarities using
our soft clustering method applied over document lan-
guage models. Subsequently, we implement the same
idea, but this time we try to locate inter-document
semantic similarities. In order to do this, we exam-
ine a new way to embed semantic information within
document language models. This method is query-
independent, since the query is being used only for
ranking purposes. As we previously referred, we con-
sider the case of locating duplicated content inside
the formed clusters, to further improve search results.
For this purpose, we apply a redundant elimination
method (Plegas and Stamou, 2013) not over the ini-
tial search results list, but over each of our clusters,
which is also a new try to remove duplicated informa-
WEBIST2015-11thInternationalConferenceonWebInformationSystemsandTechnologies
480