WRITING SUPPORT SYSTEM BASED ON

A CONTEXT SENSITIVE VARIANT DICTIONARY

Aya Nishikawa, Ryo Nishimura, Yasuhiko Watanabe, Yoshihiro Okada

Ryukoku University, Dep. of Media Informatics, Seta, Otsu, Shiga, Japan

Masaki Murata

NICT, Seika-cho, Soraku-gun, Kyoto, Japan

Keywords:

Writing support system, Notational variant, Context suitable variant, Context sensitive dictionary.

Abstract:

In Japanese, it is difﬁcult to learn which variant is suitable for various contexts in ofﬁcial, business, and

technical documents because there are a large number of notational variants of Japanese words and Japanese

writing rules have many exceptions. From the viewpoint of information retrieval, a considerable number of

studies have been made on notational variants, however, previous Japanese writing support systems were not

concerned with them sufﬁciently. To solve this problem, we developed a writing support system which detects

notational variants unsuitable for the contexts in students’ reports and shows suitable ones to the students.

This system is based on the idea that context suitable variants are used dominantly in the context of ofﬁcial,

business, and technical documents. In this study, we ﬁrst show the diversity of notational variants of Japanese

words and how to develop a context sensitive variant dictionary by which our system determines which variant

is suitable for the contexts in ofﬁcial, business, and technical documents. Finally, we conducted a control

experiment and show the effectiveness of our system.

1 INTRODUCTION

In English, there are few words which are spelled in

several different ways, such as, color and colour. In

contrast, in Japanese, there are a large number of no-

tational variants of words. This is because Japanese

words are written in three kinds of characters:

• kanji (Chinese) characters,

• hiragara letters, and

• katakana letters.

For example, kagi [key] is written in three ways, as

shown in Figure 1. Basic rules of Japanese writing

are announced by the Cabinet, and Japanese students

study them in school for many years. However, it is

difﬁcult to learn the rules because they have many

exceptions. In fact, we often ﬁnd the confusion of

variant selection in Japanese university students’ re-

ports, including unsuitable notational variants for of-

ﬁcial, business, and technical documents. As a result,

it is important for students to learn which notational

variant is suitable for ofﬁcial, business, and techni-

cal documents. To solve this problem, (Nishikawa

Figure 1: Notational variants of “kagi [key]”.

09a) developed a writing support system which de-

tects unsuitable notational variants in students’ re-

ports and shows suitable ones to the students. This

system is based on the assumption that suitable vari-

ants are used dominantly in ofﬁcial, business, and

technical documents. If the assumption is proper, un-

suitable notational variants can be detected by con-

ﬁrming whether they are used dominantly in ofﬁcial,

business, and technical documents. We think the sys-

251

Nishikawa A., Nishimura R., Watanabe Y., Okada Y. and Murata M. (2010).

WRITING SUPPORT SYSTEM BASED ON A CONTEXT SENSITIVE VARIANT DICTIONARY.

In Proceedings of the 2nd International Conference on Computer Supported Education, pages 251-256

DOI: 10.5220/0002789702510256

 SciTePress

hiragana katakana kanji

kagi

[key] 1 279 198

Figure 2: The frequencies of notational variants of noun

“kagi [key]” in the newspaper articles [Mainichi Newspaper

(January 2006 – June 2006)].

hiragana katakana kanji

kagi

[key] 0 10 64

Figure 3: The frequencies of notational variants of noun

“kagi [key]” in the newspaper articles [Mainichi Newspaper

(2005 – 2007)] in the case that the word is used with “kakeru

[lock]”.

tem of (Nishikawa 09a) is promising, however, it has

a problem: The system was based on a context free

variant dictionary. As a result, it is possible that the

system lets users select variants which are most fre-

quent but unsuitable for the contexts. Take kagi [key]

for example. As shown in Figure 2, in newspaper ar-

ticles, kagi is dominantly written in katakana letters.

However, as shown in Figure 3, kagi is dominantly

written in a kanji character when it is used with kakeru

[lock]. As a result, it is important that writing sup-

port systems show variant information of Figure 3,

not Figure 2, when kagi [key] and kakeru [lock] are

used together. To solve this problem, we developed

a writing support system based on a context sensitive

variant dictionary.

Our system shows the frequencies of notational

variants to students because they are objective and

concrete measures. As a result, the system gives stu-

dents chances to consider the reasons why they used

variants unsuitable for the contexts. There are two

reasons why our system does not replace unsuitable

variants to context suitable ones automatically.

• it is not appropriate to restrict the use of various

variants because it is one of the sources of the

richness of Japanese expressions.

• it is important to consider the reasons why they

used variants unsuitable for the contexts and

choose context suitable ones, especially, in edu-

cational institutions.

From the viewpoint of information retrieval, a

considerable number of studies have been made

on notational variants (Kubomura 03) (Kouda 06)

(Bamba 08), however, spell checkers in Japanese

word processor, such as Microsoft word 2007, and

previous Japanese writing support systems were not

concerned with notational variants sufﬁciently (Shi-

momura 92) (Araki 93) (Murata 01). This is because

their main purposes were misspelling detection. Stu-

dents often use variants which are not misspelling,

names of plants hiragana katakana kanji

sakura

[cherry blossom] 184 39 736

bara

[rose] 0 217 0

himawari

[sun ﬂower] 42 8 0

tsubaki

[camellia] 9 25 83

tsutsuji

[azalea] 5 15 0

ringo

[apple] 8 71 10

mikan

[orange] 66 37 2

Figure 4: The frequencies of notational variants of nouns

(plant names) in the newspaper articles [Mainichi Newspa-

per (January 2006 – June 2006)].

however, unsuitable for the contexts in ofﬁcial, busi-

ness, or technical documents. In contrast, Yokoyama

dealt with variants of kanji characters (Yokoyama06),

but not with variants of words. Furthermore, he did

not consider this variant problem from the viewpoint

of contexts.

In this study, we ﬁrst show the diversity of no-

tational variants of Japanese words and how to de-

velop a context sensitive variant dictionary by which

our system determines which variant is suitable for

the context in ofﬁcial, business, and technical docu-

ments. Finally, we conducted a control experiment

and show the effectiveness of our system.

2 NOTATIONAL VARIANTS OF

JAPANESE WORDS

In this section, we show the diversity and exceptions

of notational variants of Japanese words.

First, we show the diversity and exceptions of no-

tational variants of Japanese nouns. We have shown

an example of notational variants of Japanese nouns,

kagi [key], in section 1. Furthermore, Figure 4 shows

the frequencies of notational variants of plant names

in the Mainichi newspaper articles (January 2006 –

June 2006). As shown in Figure 4, dominant ways

of writing plant names are inconsistent. In this study,

we will use the term dominant variant of a word to

refer to the most frequent variant of the word, as

(Nishikawa 09a) did. One of the reasons of this in-

consistent is that writers choose variants considering

the contexts.

Next, we show the diversity and exceptions of no-

tational variants of Japanese declinable words. Figure

5 shows the frequencies of notational variants of hiki-

CSEDU 2010 - 2nd International Conference on Computer Supported Education

252

hiragana kanji+(1) kanji+(2) kanji+(3)

hikiageru

[pull up] 1 4 774 146

Figure 5: The frequencies of notational variants of verb

“hikiageru [pull up]” in the newspaper articles [Mainichi

Newspaper (January 2006 – June 2006)].

hiragana kanji+(1) kanji+(2) kanji+(3)

hikiageru

[pull up] 0 0 2 15

Figure 6: The frequencies of notational variants of verb

“hikiageru [pull up]” in the newspaper articles [Mainichi

Newspaper (2005 – 2007)] in the case that the word is used

with “toushi [investment]”.

ageru [pull up] in the Mainichi newspaper articles. As

shown in Figure 5, is the dominant variant

of hikiageru [pull up]. However, as shown in Fig-

ure 6, a nondominant variant of hikiageru, ,

is used dominantly when hikiageru is used with tou-

shi [investment]. This kind of exceptions often con-

fuse learners of Japanese, not only foreign students

but Japanese students. In fact, the authors are often

confronted with the confusion of variant selection in

their reports.

3 WRITING SUPPORT SYSTEM

BASED ON A CONTEXT

SENSITIVE VARIANT

DICTIONARY

3.1 System Overview

Figure 7 shows the overview of our system based on

a context sensitive variant dictionary. Figure 8 shows

an example of how to use our writing support sys-

tem. As shown in Figure 7, users can access and send

input sentences to the system via web browsers by

using CGI based HTML forms. Input sentences are

segmented into words by using a Japanese morpho-

logical analyzer, JUMAN (JUMAN 05). Then, the

dependency relations between the words were ana-

lyzed by using a Japanese parser, KNP(KNP 05). Fi-

nally, by using the context sensitive variant dictionary,

the system conﬁrms whether variants are suitable for

the contexts in ofﬁcial, business, and technical docu-

ments. When the system detects a variant unsuitable

for the context in an input sentence, the system un-

derlines and turns it red, shows the frequency infor-

mation of the variant in the context, and gives users

chances to consider the reasons why they used the

Figure 7: System overview.

variant. In Figure 8 (a), a user gives the following

two input sentences to the system.

• kakugi de zeikin wo hikiageru koto ga kettei sareta

[the plan to raise taxes was approved by the Cabi-

net]

• New York no sijyo kara toushi wo hikiageru koto

ni shita [we decided to withdraw our investments

from the New York market]

Then, as shown in Figure 8 (b), the system detects that

variants of hikiageru in both input sentences are un-

suitable for the contexts. In each sentence, the variant

of hikiageru is underlined and turns red, and the con-

text sensitive frequency information of the variant is

shown. In this way, the key to detecting variants un-

suitable for the contexts is a context sensitive variant

dictionary. In section 3.2, we show how to develop a

context sensitive variant dictionary.

3.2 Context Sensitive Variant

Dictionary

In order to develop a context sensitive variant dic-

tionary, we expand a context free variant dictionary

by adding information of context suitable variants

and the contexts. The context free variant dictionary

(Nishikawa 09b), which we used and expanded in

this study, contains dominant variants of 20929 words

which were extracted from 296364 articles published

in the Mainichi Newspaper from January 2006 to June

2006 (Mainichi 06-08) credibly by using binomial

tests. These words can be classiﬁed into two types:

TYPE I a word of this type has actually two or more

variants, however, only one of them was found in

the newspaper articles. 14659 TYPE I words were

extracted from the Mainichi Newspaper (January

2006 – June 2006).

TYPE II a word of this type has two or more vari-

ants found in the newspaper articles. 6270 TYPE

WRITING SUPPORT SYSTEM BASED ON A CONTEXT SENSITIVE VARIANT DICTIONARY

253

(a) two input sentences, both of which include “hikiageru [pull up]”, are given to the system.

(b) the system detects unsuitable variants of “hikiageru [pull up]” for the contexts in the input sentences

and shows the context sensitive frequency information of the variants.

Figure 8: An example of how to use our writing support system. English system messages are inserted ad hoc for convenience

of non-Japanese readers of this paper.

II words were extracted from the Mainichi News-

paper (January 2006 – June 2006). Words which

have context suitable variants are classiﬁed into

TYPE II words.

In order to show how much the dominant variant of a

word is used dominantly, (Nishikawa 09b) introduced

dominant degree. Suppose that a word has variant i

(∈ I) and the utilization rate of variant i is calculated

as follows:

∑

i∈I

where u

and f

is the utilization rate and frequency

of variant i, respectively. The dominant degree of the

CSEDU 2010 - 2nd International Conference on Computer Supported Education

254

100

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

word frequency

dominant degree

TYPE II words (extracted by binomial test)

TYPE II words (all)

Figure 9: The histograms of the dominant degrees of TYPE

II words in the newspaper articles [Mainichi Newspaper

(January 2006 – June 2006)].

word is calculated as follows:

d = max

i∈I

where d is the dominant degree of the word. Figure

9 shows the histograms of the dominant degrees of

TYPE II words extracted from the Mainichi Newspa-

per (January 2006 – June 2006). The broken line in

Figure 9 shows the histogram of the dominant degrees

of all the TYPE II words extracted from the Mainichi

Newspaper (January 2006 – June 2006). On the other

hand, the thick line shows the histogram of the domi-

nant degrees of TYPE II words the variants of which

were extracted credibly by using binomial tests from

the Mainichi Newspaper (January 2006 – June 2006).

We expanded this variant dictionary by adding the fol-

lowing kinds of information

• context suitable variants and

• the contexts where the variants are used domi-

nantly.

The information was extracted in the next way.

Suppose that word A has a variant which is a non-

dominant variant of word A but is used dominantly in

the context that word A is used with B. We extracted

• the context suitable variant of word A and

• the context that word A is used with word B

in the next steps.

Step 1 apply Japanese morphological analysis and

dependency analysis to newspaper articles. In

this study, we used a Japanese morphological an-

alyzer, JUMAN (JUMAN 05) and a Japanese

parser, KNP(KNP 05).

Step 2 From the results of the analyses, extract vari-

ants of word A which have the dependency rela-

tion to word B. In the morphological analysis, JU-

MAN gives variant labels to variants. Variants of

Figure 10: The outline of the experiment.

a certain word can be detected because JUMAN

gives the same variant label to them.

Step 3 determine which variant of word A is used

dominantly in the context that word A is used with

word B. If the variant is not the dominant variant

of word A, go step 4. Otherwise, terminate the

process. The dominant variant of word A is regis-

tered in the variant dictionary (Nishikawa 09b).

Step 4 In order to conﬁrm that the variant is a cred-

ible context suitable variant, measure the credi-

bility of the context suitable variant by using bi-

nomial tests: the variant is regarded as a credi-

ble context suitable variant, when the lower limits

of one-sided 95% binomial conﬁdence interval of

the utilization rates of the variant in the context is

more than 0.5.

In this study, we extracted 3598 context suitable

variant and the contexts from 1786752 articles pub-

lished in the Mainichi Newspaper from 2005 to 2007

(Mainichi 06-08).

4 EXPERIMENTAL RESULTS

To evaluate our method, we conducted a control ex-

periment. Figure 10 shows the outline of the ex-

periment. 20 subjects, university students in com-

puter science, were classiﬁed into two groups: control

group and experimental group. As shown in Figure

10, we conducted test 1 and 2 to the control group,

and test 1 and 3 to the experimental group. In these

three tests, we gave the same ﬁve problems of variant

selection with the following kinds of information:

test 1 no information

test 2 context free variant information

test 3 context sensitive variant information

Each problem consisted of two sentences, one word

of which was underlined, and variant choices of the

word. From the variant choices of the underlined

word, the subjects were requested to choose one vari-

ant which seemed to be suitable for the contexts in

ofﬁcial, business, and technical documents. One sen-

tence in each problem had a context for which the

WRITING SUPPORT SYSTEM BASED ON A CONTEXT SENSITIVE VARIANT DICTIONARY

255

Table 1: The choosing rate of variants suitable for the con-

texts.

group test 1 test 2 / 3

control 68% 77%

experimental 73% 81%

dominant variant was suitable. The other had a con-

text for which the dominant variant was not suitable.

For example, the following two sentences were used

in a problem of the experiment.

Problem 1(a) kakugi de zeikin wo hikiageru koto ga

kettei sareta [the plan to raise taxes was approved

by the Cabinet]

Problem 1(b) New York no sijyo kara toushi wo

hikiageru koto ni shita [we decided to withdraw

our investments from the New York market]

The dominant variant of hikiageru [pull up] is suitable

for the context of problem 1(a), on the other hand, un-

suitable for the context of problem 1(b) because hiki-

ageru was used with toushi [investment]. When sub-

jects in the control group tried to solve problem 1(a)

and 1(b) in test 2, they received the frequency infor-

mation which is shown in Figure 5 and unsuitable for

the context of problem 1(b). On the other hand, sub-

jects in the experimental group received context sen-

sitive frequency information which

• is shown in Figure 5 when they tried to solve prob-

lem 1(a) in test 3

• is shown in Figure 6 when they tried to solve prob-

lem 1(b) in test 3

In other words, subjects in the experimental group re-

ceived the same context sensitive frequency informa-

tion which our system gives to users. Figure 8 (b)

shows the advices of our system when problem 1(a)

and 1(b) are given to the system.

Table 1 shows the choosing rate of variants suit-

able for the contexts in test 1, 2, and 3. Table 1 shows

that the notational variant selection is a serious prob-

lem. In test 1, some subjects chose unsuitable variants

for no particular reason and they were totally unaware

of doing it. However, Table 1 also implies that stu-

dents do not have conﬁdence in their notational vari-

ant selection and ﬂexibly change their decisions when

the reasons are given to them. Actually, in test 3,

ﬁve subjects in the experimental group changed their

decisions, and two other subjects did not change but

felt sure of their decisions. Some of them said that

they could obey system’s advices more simply than

teacher’s instructions without concrete evidences. On

the other hand, in test 2, ﬁve subjects in the control

group changed their decisions, and two of them se-

lected variants unsuitable for the contexts because of

the context free variant information.

5 CONCLUSIONS

In this paper, we ﬁrst proposed a method of devel-

oping a context sensitive variant dictionary by which

our writing support system determines which variant

is suitable for the contexts in ofﬁcial, business, and

technical documents. Then, we conducted a control

experiment and show the effectiveness of our system.

ACKNOWLEDGEMENTS

This research has been supported partly by the

Grant-in-Aid for Scientiﬁc Research (C) under Grant

No.20500106.

REFERENCES

Nishikawa, Nishimura, Watanabe, and Okada: Writing

Support System Dealing with Notational Variant Se-

lection, CSEDU 2009, (2009).

Nishikawa, Nishimura, Watanabe, Murata, and Okada:

Dominant Variant Dictionaries for Supporting Variant

Selection, IADIS AS 2009, (2009).

Kubomura and Kameda: Information Retrieval System with

Abilities of Processing Katakana-Allographs, Trans.

of IEICE, Vol.J86-D-II, No.3, (2003).

Kouda: Search method of variant notations on a science and

technology document retrieval system, IPSJ SIG NL,

Vol.2006, No.118, (2006).

Bamba, Shinzato, and Kurohashi: Development of a Large-

scale Web Page Clustering System using an Open

Search Engine Infrastructure TSUBAKI, IPSJ SIG

NL, Vol.2008, No.4, (2008).

Shimomura, Namiki, Nakagawa, and Takahashi: A method

for detecting errors in Japanese sentences based

on morphological analysis using minimal cost path

search, Trans. of IPSJ, Vol.33, No.4, (1992).

Araki, Ikehara, and Tukahara: A method for detect-

ing and correcting of characters wrongly substituted,

deleted or inserted in Japanese strings using 2nd-order

Markov model, IPSJ SIG NL, Vol.93, No.79, (1993).

Murata and Isahara: Extraction of negative examples based

on positive examples: automatic detection of mis-

spelled Japanese expressions and relative clauses that

do not have case relations with their heads, IPSJ SIG

NL, Vol.2001, No.69, (2001).

Yokoyama: Can we predict preference for kanji form from

newspaper data on character frequency?, IPSJ SIG

CH, Vol.2006, No.10, (2006).

Kurohashi and Kawahara: JUMAN Manual version 5.1 (in

Japanese), Kyoto University, (2005).

Kurohashi and Kawahara: KNP Manual version 2.0 (in

Japanese), Kyoto University, (2005).

Mainichi Shinbun CD-Rom data set 2005, 2006, and 2007,

Nichigai Associates Co., (2006-2008).

CSEDU 2010 - 2nd International Conference on Computer Supported Education

256