of α is that we do not want any absent concept to out-
weight any present concepts.
We expect the above domain knowledge infer-
ences to be applied before standard information re-
trieval methods are deployed. The row vector at the
top of Figure 3 is an indicator-based term-vector. This
term vector is either transformed with the help of
the knowledge model (our model) or retained with-
out change (standard model). We use the knowl-
edge model to expand the term vectors for all unob-
served concepts with their probabilities inferred from
the PHITS models. A variety of existing information
retrieval methods (e.g. PLSI (Hofmann, 1999), and
LDA (Wei and Croft, 2006)) can then be applied to
these two vector-term options. This makes it possible
to combine our knowledge model with many existing
retrieval techniques easily.
3.2.3 Query Expansion Retrieval
An alternative to expanding document vectors, is to
expand the query vectors. Briefly, the aim is to se-
lect concepts that are not in the original query, but are
likely to provide a relevant match. To implement this
approach, we first calculate P(e = b
i
|o
1
,o
2
,...,o
k
), the
probability of seeing an absent concept i given a list of
observed concepts o
1
,o
2
,...,o
k
in a query, as defined
in Equation 1. Then we sort the absent concepts ac-
cording to their conditional probabilities, and choose
the top m concepts to expand the query vector.
We applied two different inferences to expand
documents and queries because there is much less in-
formation or evidence in queries than that in docu-
ments. By choosing the top m relevant concepts we
can control the amount of noise introduced in the ex-
pansion. Our evaluation results show this approach is
effective to improve retrieval performance.
4 EXPERIMENTS
To demonstrate the benefit of the knowledge model,
we incorporate it in several document retrieval tech-
niques and compare them to the state-of-art research
search engine, Lemur/Indri. We learned the proba-
bilistic PHITS knowledge model from two sources:
the document corpus and the MSigDB database. Fur-
thermore, we combine these two knowledge models
using model averaging. We re-write Equation 1 as
follows to combine the inferences from the two mod-
els for model averaging:
P(e|d) =
∑
m
P(e|d, m)
P(d|m)P(m)
∑
m
P(d|m)P(m)
(3)
where e is an absent concept in document d and m is
a knowledge model. We assume uniform prior proba-
bility over the two models. Similarly, we apply model
averaging in all inference steps to combine the two
models.
We use the standard retrieval evaluation metric,
Mean Average Precision (MAP), to measure the re-
trieval performance in our experiments.
4.1 Document Expansion Retrieval
In the first set of experiments, we evaluate the doc-
ument expansion retrieval approach on a PubMed-
Cancer literature database. It consists of over 6000
research articles on 10 common cancers. Our corpus
contains both full documents and their abstracts.
Since document expansion can be easily incorpo-
rated into vector space information retrieval methods,
we combine it with two IR methods, namely, LDA
and PLSI. We compare it to Lemur/Indri with default
settings except that we use the Dirichlet smoothing
(µ = 2500), which had the best performance in our
experiments.
The knowledge model was learned from abstracts
and the complete text of the documents were used to
assess the relevance of documents returned by the sys-
tem. We randomly selected a subset of 20% of the ar-
ticles as the test set. To learn the probabilistic PHITS
knowledge model, we used i) 80% of the document
corpus, and ii) the MSigDB database. The combina-
tion of the two models was done at the inference level.
The relevance of a scientific document to the
query, especially if partial matches are to be assessed,
is best done by human experts. Unfortunately, this
is a very time-consuming process. So, we adopted
the following experimental setup: we perform all
knowledge-model learning and retrieval analysis on
abstracts only, and use exact matches of queries on
full texts as surrogate measures of true relevance.
Briefly, for a given query we retrieve a document
based on its abstract, and its relevance is judged (auto-
matically) by the query’s match to the full document.
We generated a set of 500 queries that consisted of
pairs of two domain concepts (proteins or genes) such
that 100 of these queries were generated by randomly
pairing any two concepts identified in the training cor-
pus, and 400 queries were generated using documents
in the test corpus by the following process. To gen-
erate a query we first randomly picked a test docu-
ment, and then randomly selected a pair of concepts
that were associated with each other in the full text of
this document. Thus, the generated query had a per-
fect match in the full text of at least one document.
All 500 queries were run on abstracts, and the rele-
KDIR 2009 - International Conference on Knowledge Discovery and Information Retrieval
30