the user’s input retrieval query sentence by using the
question sentence in the first contribution. Figure 1
shows an outline of the proposed retrieval method.
Figure 1: Overview of question article retrieval by natural
sentence input.
The question article database of a Web bulletin
board is composed of a pair comprising the question
article and the article question sentence, which is a
set of sentences extracted from the first contribution
(Skowron, 2005; Li, 2002). The flow of the retrieval
procedure is shown below.
Step 1 A user inputs the retrieval query sentence.
Step 2 The input question sentence is analyzed in the
morpheme, and the nouns are extracted.
Step 3 The articles, i.e. candidate articles, are re-
trieved from the question article data base by using
a set of extracted nouns.
Step 4 It is judged whether the question sentence in
the candidate article is similar to the retrieval query
sentence.
Step 5 As a result of the similarity judgment, users
receive the article question sentence and question
articles in order of their similarity to the question,
with the most similar at the top.
The system proposed here extracts the article ques-
tion sentence from the first contribution as prepro-
cessing, then judges whether the retrieval query sen-
tence is similar to the article question sentences.
2.2 Features of a Targeted Bulletin
Board
Since the contribution articles by consumers are pub-
lished as they are and they are not wellformed with
respect to sentences, Web bulletin boards have certain
special characteristics. The description characteristics
of question sentences are shown below.
Description characteristic 1 Though questions may
have the same content, the sequence of their words
may be different.
Description characteristic 2 Though the questions
may have the same content, their length may be dif-
ferent.
Ex. 1 How convenient is it to write mail with the
SH901ic?
Ex. 2 On the SH901ic, is it possible to input T9
input and the bell when a character is input ?
In “description characteristic 1,” it is necessary to
judge similarity without depending on the sequence
of the words. In the examples of “description charac-
teristic 2,” both questions are about convenience when
mail is written. The first question (Ex. 1) is a vague
one, while Ex. 2 asks a question about character in-
put. It is necessary to judge the sentence without de-
pending on the sequence of words and the length of
the sentence (number of words). Here we judge sim-
ilarity by the modified cosine similarity index. Next,
the problem of solving similarity judgments of the
retrieval query sentence and the article question sen-
tence is described.
2.3 Judgment by Cosine Similarity
Index
Figure 2 shows the calculationdefinition of the con-
ventional cosine similarity index. The cosine similar-
ity index considers the retrieval query sentence and
the article question sentence to be a set of words, and
puts them into the word vector of n dimensions. The
similarity index between sentences is calculated at the
angle of the word vector.
The cosine similarity index tends to be high when
there are a lot of words common to both the re-
trieval query sentence and the article question sen-
tence. However, common words are in fact few be-
cause the sentence length is generally short on Web
bulletin boards. Therefore, a mostly low cosine simi-
larity index is calculated, and judgement is often erro-
neous. In the example of Fig 2, it is difficult to judge
whether the article question sentence is similar by this
cosine similarity index.
In Fig 2, “kisyu,” “henkou,” and “miniSD” in a
Japanese article question sentence are common words
for the retrieval query sentence. However, it is diffi-
cult to judge similarity from the cosine similarity in-
dex in cases where only the term frequency of words
is considered.
Therefore, we introduce not only the words but also
the structure of the sentence for this similarity cal-
A RETRIEVAL METHOD OF SIMILAR QUESTION ARTICLES FROM WEB BULLETIN BOARD
239