a number of parameters to our model; as well as
by exploring different document preprocessing steps.
Another important feature we employed in this pa-
per is the usages of passages over that of entire docu-
ments. The motivation behind the adoption of passa-
ges for query expansion was to hopefully reduce noise
that could occur lead to the topic drift in the resulting
query. Moreover, retrieving the indexed passages over
documents from the IR engine shortened the amount
of text to be processed in our graph approach.
The paper outline is as follows: section 2 presents
an overview of the previous work in query expansion
with the use of different language modeling approa-
ches and also highlights how passage-level evidence
is used to extract the semantic relatedness and query
expansion to improve the effectiveness of an IR sy-
stem. Section 3 gives an overview of the methodo-
logy used, that outlines the details of the graph-based
approach and its application with the document and
passage level retrieval. It also highlights the different
similarity functions employed to extract the top passa-
ges for query augmentation with the brief overview of
the test collection used for the experiments. Section 5
reports the experimental results obtained while com-
parison of query augmentation approaches at docu-
ment and passage level. Lastly, we provide the brief
overview of the main conclusion and outline future
work.
2 RELATED WORK
Many approaches have been explored in previous
research to augment user queries to better reflect
the user’s intended information needed, thereby
improving the accuracy of the system. One of the
original approaches was proposed by Rocchio (Roc-
chio, 1971) which attempts to augment a query to
better distinguish between relevant and non-relevant
documents. An ideal query is one that returns all
of the relevant documents while avoiding all of the
irrelevant ones. To estimate this ideal query, Rocchio
suggests an iterative feedback process whereby the
positive feedback and negative feedback provided by
a user is used to guide the query modification. To
achieve this, the author suggests giving each word a
weight relative to its presence in either the relevant or
irrelevant document set.
Zhang et al (Zhang et al., 2005) attempt to improve
upon search results using metadata found within
the corpus. They designated two features found
within the documents as indicators of how to rank
the documents; information richness and diversity.
Information richness is the extent to which a docu-
ment relates to a particular topic, and diversity is the
number of topics found within the corpus. In addition
to determining these scores for the documents,
they assign each document to a graph where the
node represents the document and the surrounding
nodes are determined by the inter similarity of the
documents. Using this approach they improved the
overall ranking of information gain and diversity by
12% and 31% respectively.
Hyperspace Analogue to Language (HAL) was pro-
posed by Lund and Burgess in a theoretical analysis
on the concept of capturing interdependencies in
terms (Lund and Burgess, 1996). In this work, they
applied a window of 10 to their document corpus
and measured the co-occurrence of terms. Yun et al
(Yan et al., 2010) use an approach that is relatable
to HAL and apply it to three TREC datasets for
query augmentation. They identify the drawbacks to
a standard bag of word approaches and apply HAL
to capture the semantic relationship between terms.
Additionally, they model the syntactic elements of
terms around a target event to better inform which
words to use in augmenting a query.
Similarly, Kotov and Zhai (Kotov and Zhai, 2011)
propose to use HAL to provide alternative senses
for words. His dataset comprised of three TREC
collections: AP88-89, ROBUST04 and ACQUAINT.
He applied a mutual information measure and HAL
to the dataset to ascertain the semantic strength
between terms. He used these values and selected the
strongest alternative candidate terms. Six participants
were asked to input the queries as found in the
respective datasets and were offered the option of
using the alternative terms if they felt the search
results were not sufficient. By combining these two
methodologies he was able to improve the overall
performance of the system. His conclusion was that
20 was the ideal window size when computing the
HAL score.
Passage level retrieval has been used in the past
for multiple purposes. Callan (Callan, 1994) has
used passage level evidence to improve the document
level ranking. Similarly Jong and Buckley (Jong
et al., 2015), and Sarwar et al (Sarwar et al., 2017)
followed the same concept and considered some
alternative passage evidence i.e. passage score,
the summation of passage score, inverse rank, and
evaluation functions score etc. to retrieve the do-
cuments more effectively. Moreover, to choose the
best passage boundaries several techniques have been
used. Callan (Callan, 1994) proposed the bounded
passages and overlapping window based approach.
Similarly text-titling, usage of arbitrary passages and
KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval
238