A RETRIEVAL METHOD OF SIMILAR QUESTION ARTICLES

FROM WEB BULLETIN BOARD

Yohei Sakurai, Soichiro Miyazaki, Masanori Akiyoshi

Osaka University

Yamadaoka 2-1, Suita, Osaka 565-0871, Japan

Keywords:

Web bulletin board, Natural sentence input, Question articles, Cosine similarity index.

Abstract:

This paper proposes a method for retrieving similar question articles from Web bulletin boards, which basically

use the cosine similarity index derived from a user’s query sentence and article question sentences. Since these

sentences are mostly short, it is difﬁcult to distinguish whether article question sentences are similar to a user’s

query sentence or not simply by applying the conventional cosine similarity index. In an attempt to overcome

this problem, our method modiﬁes the elements of the word vectors used in the cosine similarity index, which

are derived from a sentence structure from the viewpoints of common words and non-common words between

a user’s query sentence and article question sentences. Experimental results indicate that our proposed method

is effective.

1 INTRODUCTION

Web bulletin boards have been used in several do-

mains, where chat-style sentences are published as

they are, and often including useful information such

as consumers’ genuine opinions in the contributed ar-

ticles. However, users who want to obtain informa-

tion cannot at present efﬁciently inspect articles con-

taining that information in bulletin boards; there are

simply too many articles on Web bulletin boards.

Users generally retrieve by keyword or narrowly

searched articles, judging whether an article includes

necessary information by reading its title or ﬁrst sen-

tence. Users must sometimes inspect articles includ-

ing irrelevant information, too. As a result, it takes a

lot of time for them to judge whether the articles are

indeed useful.

In this research we propose a method for retriev-

ing similar question articles to a query by natural sen-

tence input to improve retrieval accuracy. Recently,

as information retrieval technology (Ohtsuka, 2004;

Kishida, 1997; Mochihashi, 2004) has improved,

various methods for judging similarity of sentences

have been developed. As a natural sentence input,

A question-answer system (Sasaki, 2002; Tamura,

2005) that retrieves the answer of the input in the doc-

ument has also been exammed as a natural style of

sentence input. However, these studies only deal with

formal sentences like those in newspapers.

A question article on a Web bulletin board is a form

by which a question asked in the ﬁrst contribution,

and the answers are given in following contributions.

Similarity judgments by conventional methods that

only match words are, however, insufﬁcient because

the retrieval query sentence and the article question

sentence are short.

Consequently, in this we research consider not only

matching of the words in the retrieval query sentence

and the article question sentence, but also the struc-

ture of the sentence. As a result, the accuracy of sim-

ilarity judgment is improved.

In Section 2, we describe the problems with the re-

trieval method using natural sentence input. In Sec-

tion 3, we described a the question sentence retrieval

method that applies co-occurrence information about

non-common words and concrete procedures to solve

the problem. In Section 4, the proposed method and a

conventional method are compared by retrieving prac-

tical data in order to evaluate the effectiveness of the

proposed retrieval method. Section 5 provides a sum-

mary.

2 ARTICLE RETRIEVAL FROM A

WEB BULLETIN BOARD

2.1 Question Article Retrieval

A question sentence might be included in the ﬁrst con-

tribution of an article on a Web bulletin board, and

users judge whether the article includes the required

information by reading it. In this paper we research

propose a retrieval method that judges similarity to

238

Sakurai Y., Miyazaki S. and Akiyoshi M. (2006).

A RETRIEVAL METHOD OF SIMILAR QUESTION ARTICLES FROM WEB BULLETIN BOARD.

In Proceedings of the First International Conference on Software and Data Technologies, pages 238-243

DOI: 10.5220/0001315202380243

 SciTePress

the user’s input retrieval query sentence by using the

question sentence in the ﬁrst contribution. Figure 1

shows an outline of the proposed retrieval method.

Figure 1: Overview of question article retrieval by natural

sentence input.

The question article database of a Web bulletin

board is composed of a pair comprising the question

article and the article question sentence, which is a

set of sentences extracted from the ﬁrst contribution

(Skowron, 2005; Li, 2002). The ﬂow of the retrieval

procedure is shown below.

Step 1 A user inputs the retrieval query sentence.

Step 2 The input question sentence is analyzed in the

morpheme, and the nouns are extracted.

Step 3 The articles, i.e. candidate articles, are re-

trieved from the question article data base by using

a set of extracted nouns.

Step 4 It is judged whether the question sentence in

the candidate article is similar to the retrieval query

sentence.

Step 5 As a result of the similarity judgment, users

receive the article question sentence and question

articles in order of their similarity to the question,

with the most similar at the top.

The system proposed here extracts the article ques-

tion sentence from the ﬁrst contribution as prepro-

cessing, then judges whether the retrieval query sen-

tence is similar to the article question sentences.

2.2 Features of a Targeted Bulletin

Board

Since the contribution articles by consumers are pub-

lished as they are and they are not wellformed with

respect to sentences, Web bulletin boards have certain

special characteristics. The description characteristics

of question sentences are shown below.

Description characteristic 1 Though questions may

have the same content, the sequence of their words

may be different.

Description characteristic 2 Though the questions

may have the same content, their length may be dif-

ferent.

Ex. 1 How convenient is it to write mail with the

SH901ic?

Ex. 2 On the SH901ic, is it possible to input T9

input and the bell when a character is input ?

In “description characteristic 1,” it is necessary to

judge similarity without depending on the sequence

of the words. In the examples of “description charac-

teristic 2,” both questions are about convenience when

mail is written. The ﬁrst question (Ex. 1) is a vague

one, while Ex. 2 asks a question about character in-

put. It is necessary to judge the sentence without de-

pending on the sequence of words and the length of

the sentence (number of words). Here we judge sim-

ilarity by the modiﬁed cosine similarity index. Next,

the problem of solving similarity judgments of the

retrieval query sentence and the article question sen-

tence is described.

2.3 Judgment by Cosine Similarity

Index

Figure 2 shows the calculationdeﬁnition of the con-

ventional cosine similarity index. The cosine similar-

ity index considers the retrieval query sentence and

the article question sentence to be a set of words, and

puts them into the word vector of n dimensions. The

similarity index between sentences is calculated at the

angle of the word vector.

The cosine similarity index tends to be high when

there are a lot of words common to both the re-

trieval query sentence and the article question sen-

tence. However, common words are in fact few be-

cause the sentence length is generally short on Web

bulletin boards. Therefore, a mostly low cosine simi-

larity index is calculated, and judgement is often erro-

neous. In the example of Fig 2, it is difﬁcult to judge

whether the article question sentence is similar by this

cosine similarity index.

In Fig 2, “kisyu,” “henkou,” and “miniSD” in a

Japanese article question sentence are common words

for the retrieval query sentence. However, it is difﬁ-

cult to judge similarity from the cosine similarity in-

dex in cases where only the term frequency of words

is considered.

Therefore, we introduce not only the words but also

the structure of the sentence for this similarity cal-

A RETRIEVAL METHOD OF SIMILAR QUESTION ARTICLES FROM WEB BULLETIN BOARD

239

Figure 2: Example of difﬁculty in similarity judgment that

uses cosine similarity index.

culation. The values of the word vector of the re-

trieval query sentence and the article question sen-

tence are, consequently, modiﬁed, improving similar-

ity judgment by the cosine similarity index.

3 QUESTION ARTICLE

RETRIEVAL METHOD BY

CO-OCCURRENCE

INFORMATION

3.1 Approach for the Problem of

Few Matching Words

Here, we propose similarity judgment based on the

modiﬁed cosine similarity index. Erroneous similar-

ity judgement is caused by few matching words be-

tween the retrieval query sentence and article ques-

tion sentence. The value of the word vector needs to

be modiﬁed in consideration of structural similarity

among sentences.

In similarity judgment of the retrieval query sen-

tence and the article question sentence, it is neces-

sary to judge whether the query sentence is similar

to the article question sentence with the same com-

mon words. In Fig 3, though the article question sen-

tence has the same common words, article question

sentence 1 is similar to the retrieval query sentence,

whereas sentence 2 is not similar in Japanese. Since it

is difﬁcult to judge similarity by only common words

in both sentences, the system pays attention to non-

common words.

Figure 4 shows an example of modifying the value

of the vector in consideration of the sentence struc-

Figure 3: Example of similarity judgment of article ques-

tion sentence in case common words are same.

ture. The upper part of ﬁgure 4 shows the structural

similarity of the sentence that the person has judged.

Phrases with similar structure often have the same

meaning, so in this study we use the partial struc-

ture of sentences composed of pairs comprising com-

mon words and non-common words of the retrieval

query sentence and the article question sentence. In

the lower part of Fig 4, it is judged that the sentence

structure “meka” and “kisyu” of the retrieval query

sentence is similar to the sentence structure “w11h”

and “kisyu” in the article question sentence. The sim-

ilarity index is improved by modifying the vector ele-

ment of non-common words included in the structure

of the retrieval query sentence to .

Figure 4: Example of modifying word vector based on sen-

tence structure.

Conversely, there are words that cannot apply to

word matches or structural similarities of sentences.

These words are considered to suppress the similarity

of sentences. As shown in Fig 4, the non-common

word “ikou” in the article question sentence is not

included in the structure of the sentence composed

of common words and non-common words. Non-

common words in the article question sentence are as-

sumed not to be related to words in the retrieval query

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

240

sentence. In this case, the similarity index is reduced

by modifying the word vector element of the retrieval

query sentence to (< 0).

By using the structural feature, some elements of

the word vector are modiﬁed to augment or suppress

the similarity. Then, we have to solve the following

problems of modifying the word vector.

• A comparison method for sentences composed of

common words and non-common words of the re-

trieval query sentence and the article question sen-

tence

• A calculation method for the value when the word

vector is modiﬁed

3.2 Comparison of the Structures of

Sentences that Consists of

Common Words and

Non-Common Words

Here we consider the structures of sentence composed

of common words and non-common words. The com-

mon and non-common words are all nouns.

The structure of a sentence is decided by using the

dependency analysis tool “ Cabocha. ”

The structure

of the sentence the person used for the similarity judg-

ment in Fig 4 is decided with a dependency analysis

tool. The deﬁnition of the structure in the sentence,

including common and non-common words is as fol-

lows when the dependency analysis is used.

• A clause including common words qualiﬁes a

clause with non-common words.

• A clause including non-common words qualiﬁes a

clause with common words.

The sample tree structure output, which is the de-

pendency analysis result of the retrieval query sen-

tence, is shown in Fig 5. The structures of the sen-

tence determined from the above-mentioned deﬁni-

tion are two of the pairs “meka”and “kisyu,” and

“miniSD”and “kopi”.

Figure 5: Dependency analysis result of retrieval query.

http://chasen.org/ taku/software/cabocha/

The structure of the retrieval query sentence is de-

cided as shown above, however, the dependency anal-

ysis is not applicable because the style of the article

question sentence is not wellformed. Therefore, it is

impossible to determine the sentence structure of the

article question sentence with a dependency analysis

tool.

Common words are considered to be included in

the article question sentence. If a word related to

non-common words of the retrieval query sentence

is included in the article question sentence, it is as-

sumed that the structure of the retrieval query sen-

tence is included in the article question sentence. If

so, the structural similarity of the sentences is com-

pared using the relation between words with the co-

occurrences.The co-occurrences of a word is when the

two words exist simaltaneously in the same sentence.

The co-occurrence dictionary for the word is automat-

ically produced beforehand from all the article ques-

tion sentences.

Figure 6 shows a comparison of the structure of

the retrieval query sentence and the article question

sentence. In this ﬁgure, it is noted that “meka,” and

“kisyu” in the structure of the retrieval query sen-

tence are decided by the dependency analysis, and it

is judged that the partial structure of the sentence is

similar because non-common the word “w11h” co-

occurs with “meka” in the article question sentence.

Moreover, two or more words co-occur, such as the

non-common words “ongaku,” “mubi,” and “tensou”

in the article question sentence, with “kopi” of the

non-common words ub the retrieval query sentence.

In this case, the three partial structures “miniSD” and

“ongaku,” “miniSD” and “mubi,” and “miniSD” and

“tensou” in the article question are assumed to be sim-

ilar to “miniSD” and “kopi” in the retrieval query sen-

tence.

Figure 6: Comparison of sentence structure including sen-

tence of common words and non-common words by co-

occurrence of word.

If non-common words of the article question sen-

tence did not co-occur with all non-common words

of the retrieval query sentence in the comparison of

the sentence structure, we consider the word that sup-

A RETRIEVAL METHOD OF SIMILAR QUESTION ARTICLES FROM WEB BULLETIN BOARD

241

presses the similarity of the retrieval query sentence

and the article question sentence.

3.3 Modiﬁcation of the Word Vector

3.3.1 Modiﬁcation of the Word Vector by the

Structural Similarity of Sentences

The word vector of the article question sentence is

modiﬁed close to the word vector of the retrieval

query sentence when the sentence structure of the re-

trieval query sentence are similar to the structure of

the article question sentence.

In our research, the co-occurrence of a word is used

to ﬁnd a sentence structure that is similar to the struc-

ture of the retrieval query sentence in an article ques-

tion sentence. It is thought that the higher the level

of co-occurrence, the more similar the sentence struc-

ture. Therefore, the co-occurrence index is used as an

index when the value of the vector is modiﬁed. Co-

occurrence index C

i,j

of non-common words W

the retrieval query sentence and non-common words

in the article question sentence are deﬁned by the

following formula.

i,j

all

i,j

= Co − occurrence sentences number

of word W

and word W

all

= All sentences number)

Figure 6 shows that two or more non-common

words in an article question sentence might co-occur

with non-common words of the retrieval query sen-

tences. Normalization is necessary to modify the vec-

tor when the vector is modiﬁed with the word’s co-

occurrence. Then, value of the word vector is normal-

ized with the maximum 1. The modiﬁcation expres-

sion for the word vector is deﬁned as follows: , where

is a modiﬁcation value. This value is used when

the word W

vector in the retrieval query sentence is

corrected. X

is the average of the co-occurrence in-

dex of non-common word W

in the retrieval query

sentence and non-common words in the article ques-

tion sentence. The bigger the value of X

, the higher

the value of α

1 + e

0.5−X

i,j

Q ∗ n

Q : Number of article question sentences

n : Number of non − common words of

art icle question sent ences that

co − occur to W

3.3.2 Modiﬁcation of the Word Vector by a

Word That Suppresses the Similarity of

Sentences

A word that does not co-occur with non-common

words of the retrieval query sentence is considered

to suppress the similarity of the sentences with re-

spect to non-common words of the article question

sentence. The value of the word vector in the re-

trieval query sentence is modiﬁed so that it diverges

from the word vector in the article question sentence.

The more common words are in the retrieval query

sentence and the article question sentence, the more

similar both sentences tend to be. Thus, it is thought

that the smaller the number of common words is, the

higher the possibility that the retrieval query sentence

is not similar to the article question sentence. The ex-

pression that modiﬁes the word vector of the retrieval

query sentence is deﬁned as follows: , where β

non-common words of the article question sentence

is a modiﬁcation value of the vector corresponding to

word W

= −

Number of common words

4 EVALUATION EXPERIMENT

To conﬁrm the effectiveness of the proposed method,

we retrieved by using the retrieval query sentence

1,162 article question sentences from Web bulletin

boards in order. All the retrieval results were sorted

using results of the cosine similarity index. The

data for correct question article sentences were con-

structed manually in advance. Since in this paper we

aim to improve a user’s retrieval accuracy, we per-

formed the evaluation in according to the number of

correct question article sentences in the top 10 re-

trieval results, and the distribution of the top 20 cor-

rect question article sentences.

4.1 Comparison of Top 10 Correct

Question Article Sentences

Tables 1 below shows the results for the number of

correct question article sentences in the top 10 re-

trieval results. The result was better than the con-

ventional cosine similarity index obtained for all

the queries in the top 10 correct question article

sentences, thus conﬁrming the effectiveness of this

method. We obtained the same result as the con-

ventional cosine similarity index in query (7) because

there were few correct question article sentences.

Furthermore, there were few correct question arti-

cle sentences in the top 10 for the number of correct

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

242

Table 1: Number of correct question article sentences

@@@@@ in the top 10 retrieval.

Query number (1) (2) (3) (4) (5) (6) (7)

Correct sentences 10 24 7 10 15 37 2

Proposed method 4 2 3 1 2 5 1

Conventional method 1 0 1 1 1 3 1

question article sentences. It is preferable that correct

question article sentences are all ranked highly. This

is a problem to solve in the near future.

4.2 Comparison of Emergence

Distribution of Correct Question

Article Sentences in the Top 20

The emergence distribution of the correct question ar-

ticle sentences for the cosine similarity index in the

top 20 is shown in the following ﬁgures 7 and 8. The

arrows in the ﬁgures indicate a change in the correct

question article sentences’ order in the top 20. The

order of all the correct question article sentences un-

der the conventional cosine similarity index has im-

proved in Fig 7, while in Fig 8, the correct question

article sentences in the top 20 and under the conven-

tional cosine similarity index move within the top 20.

The cosine similarity index thus improves, and the

correct question article sentences move to a higher

rank. The effectiveness of this method is conﬁrmed

from both results. Consequently, we found that the

Figure 7: Correct question article sentences’ distribution in

the top 20 of query (1).

Figure 8: Correct question article sentences’ distribution in

the top 20 of query (6).

incorrect question article sentences in subordinate po-

sitions occasionally exceeded the similarity index of

high-ranking correct question article sentences when

using the conventional cosine similarity index. Our

future work is to more effectively suppress the simi-

larity index in the question sentence of incorrect arti-

cles.

5 CONCLUSION

In this research we proposed a retrieval method to

present articles that include similar question sen-

tences to retrieval query sentences input by bulletin

board users. We found that in similarity judgments of

the article question sentence in the ﬁrst contribution of

the question article and the retrieval query sentence,

not only word match but also the structure of the sen-

tence are considered. The value of the word vector

of the retrieval query sentence and the article ques-

tion sentence is modiﬁed, making possible a similar-

ity judgment by the cosine similarity index.

The proposed technique was applied to 1162 ques-

tion sentences from articles on Web bulletin boards

with manual query retrieval. The results were evalu-

ated using the number of correct question article sen-

tences in the top 10 and the distribution of correct

question article sentences in the top 20 retrieval re-

sults sorted by the results of the cosine similarity in-

dex. The results conﬁrmed that the proposed method

is better than the conventional one for all retrieval

query sentences.

REFERENCES

Ohtsuka, T. (2004). An Evaluation Method of Web Search

Engines based on User’s Sense. NTCIR Workshop

4 Meeting Working Notes, Supplement Volume 1 :

WEB Task.

Mochihashi, D. (2004). Learning Nonstructural Distance

Metric by Minimum Cluster Distortions. EMNLP-

2004, pp.341-348.

Kishida, K. (1997). International publication patterns in

social sciences: a quantitative analysis of the IBSS

ﬁle. Scientometrics Vol.40, No.2, pp.277-298.

Sasaki, Y. (2002). NTT’s QA Systems for NTCIR QAC-1.

working notes, NTCIR Workshop 3, Tokyo.

Tamura, T. (2005). Classiﬁcation of Multiple-Sentence

Questions. In Proceedings of the 2nd IJCNLP-05.

Skowron, M. (2005). Effectiveness of Combined Features

for Machine Learning Based Question Classiﬁcation.

Special Issue on Question Answering and Text Sum-

marization, Journal of Natural Language Processing,

Vol.6, pp. 63-83, 2005.

Li, X. (2005). Learning Question Classiﬁers. COLING

2002, pp.556-562, 2002.

A RETRIEVAL METHOD OF SIMILAR QUESTION ARTICLES FROM WEB BULLETIN BOARD

243