would appear in the retrieval results as irrelevantones.
If these are ranked higher in the result list, and used as
the seemingly relevant documents for the pseudofeed-
back method, it would surely be the case that signif-
icant amount of irrelevant information is included in
the final results.
Therefore, we present a new method and a system
for related word extraction that uses Wikipedia
4
as the
information source. Wikipedia is a well known online
encyclopedia that is well organized with rich contents,
words, and internal links. Moreover, since it can eas-
ily be updated by anyone, many researchers have been
giving attention to it. Using Wikipedia, systems will
be able to remove irrelevant information from their
retrieval results, and improve the accuracy of the re-
sults especially when the area of interest is unfamiliar
to the user.
The problem with Wikipedia is as follows: it may
occur with high probability that the keyword given
by the user is absent in it, since the total amount
of information contained in Wikipedia is quite small
compared to those contained in whole Web pages on
the internet, and the users query may be exotic to
Wikipedia. So, we categorize queries in advance, and
build a most useful system for a user who wants to
use the Web as if it were a large virtual dictionary.
The usefulness is evaluated on the basis of accuracy
improvement and the quantity of information that is
new and interesting to the user.
2 RELATED WORK
Bedsides pseudofeedback, there are two more feed-
back methods called explicit feedback and implict
feedback. The former depends on the users evaluation
of documents, whereas the latter automatically col-
lects documents by analyzing users operations such
as scroll, click and zoom.
Another system uses personal information that re-
flects a user profile as an auxiliary information(Sieg
et al., 2007)(Yoshinori, 2004)(Qiu and Cho, 2006)D
Concretely a user profile will be derived from sched-
ules or some database containing his/her interests or
favorites. These help the system to offer related words
that meet his/her intention. For instance, when a user
wants to know about the weather forecast, the sys-
tem would examine just the regions which are near the
place where he/she lives. As far as the user wants such
information that is very specific to his/her interest, it
should be more appropriate to derive related words
from user profiles than to find out them from among
4
http://ja.wikipedia.org/
the Web that includes words from so many fields of
general interest. Yet these system would have some
difficulty to show relevant words when the user really
wants to obtain information quite new to him. This is
because there should be few information in the user
files which are supposed to suggest words about alien
culture, unseen incidents, and unfamiliar history, etc.
As for work concerning Wikipedia, Nakayama(Ito
et al., 2008) et.al. have succeeded in constructing
the association thesaurus dictionary for extracting re-
lated words from a given query. They calculate co-
occurrence of words that have links to other pages in
terms of the relatedness between those linked pages.
However, as Wikipedia’s internal links are provided
only arbitrarily by the author of that specific page,
some pages contain many links, others very few, or
even no internal link at all. In such a case, their sys-
tem will not work. On the other hand, our system
uses the related word extraction algorithm of our own
where all words, with or without linked words, are
taken into account, and free from the above problem.
The detail of the algorithm is described in section 5.
3 SYSTEM OVERVIEW
Figure 1 shows an overview of our system.
Figure 1: Image of system.
First, the user inputs a query, refereed as the initial
query, to the system. The goal of our system is to find
out related words for the query.
Second, the system collects documents called rel-
evant documents that are related to the initial query
and are used as sources for extracting related words.
Our system takes Wikipedia as a source of rele-
vant documents and extract some paragraphs from
Wikipedia pages that are related to the initial query.
The details will be described in section 4.
Third, the system performs morphological anal-
ysis on the above collected relevant documents by
RELATED WORD EXTRACTION FROM WIKIPEDIA FOR WEB RETRIEVAL ASSISTANCE
193