2 RELATED WORK
Sakai, Kumano, and Manabe describe a
transliteration system for automatic information
retrieval (Sakai, Kumano, and Manabe, 2002). Their
system is designed to support both transliteration
and back-transliteration of Roman scripts to
Japanese Katakana. They show that the system
produces over 75% correct results when
transliterating, and over 55% correct results when
back-transliterating. One of the reasons why back-
transliteration does not work as well as
transliteration is the fact that certain sounds are not
possible to represent in Katakana (for example, there
is no distinction between “l” and “r”).
Kawtrakul et al propose a similar back-
transliteration system for information retrieval
(Kawtrakul et al, 1998). Their system transliterates
Roman scripts to the alphasyllabary Thai script. The
main use of their system is to be able to
automatically generate foreign words (such as
proper nouns and technical terms) in Thai scripts.
The back-transliteration system they use consists of
three steps: syllable formation, phonetic
transcription, and a fuzzy search for a matching
English word. Similar to the system of Sakai,
Kumano, and Manabe, there are issues with back-
transliteration, because some of the sounds in
English do not exist in Thai.
Kang and Kim propose an English-to-Korean
transliteration scheme (Kang and Kim, 2003). This
scheme operates in three steps: constructing
phonetic sequences that represent all possible
transliterations of a given phrase; checking the
validity of each of the transliterations; and choosing
the most probable transliteration.
Grefenstette, Yan, and Evans describe a
Katakana transliteration scheme based on finding
matches on the Web (Grefenstette, Yan, and Evans,
2004). This is somewhat similar to the scheme of
Kang and Kim in that both schemes look at a
number of possible outcomes, and choose one based
on a criterion. The criterion that Grefenstette, Yan,
and Evans use is the highest hit score for a phrase on
the Web.
Al-Onaizan and Knight propose a transliteration
scheme from Arabic to English based on
probabilistic finite state machines (Al-Onaizan and
Knight, 2002). Their evaluation indicates
transliteration accuracies from 15% to 55%,
depending on whether the phrase to be transliterated
is of Arabic, English, or other origin. The absence of
short vowels in Arabic and the existence of silent
letters in English (such as the P in Psychology)
cause major transliteration inaccuracies.
3 TRANSLITERATION ENGINE
The goals of the transliteration engine are threefold:
to be able to produce text phrases that can be pasted
onto electronic documents; to be able to produce
XHTML Unicode strings that can be used for
publishing on the Internet; and to be able to produce
C strings that can be used in C-like programming
languages.
The transliteration script mappings in the
transliteration engine use the Hepburn Romanization
system. This system was originally developed in
1867 by Reverend James Hepburn to transcribe
Japanese words into Roman alphabets. There are
some variants of the system. The system used in the
transliteration engine is called the modified Hepburn
system where long vowels are indicated by doubling
the vowel. Table 1 illustrates a partial script
mapping for Japanese Katakana using the modified
Hepburn system.
Table 1: A partial script mapping for Katakana.
Source
string
Target string Unicode
symbol
b \u30C4 ツ
ba \u30D0 バ
baa \u30D0\u30FC バー
be \u30D9 ベ
bee \u30D9\u30FC ベー
bi \u30D3 ビ
bii \u30D3\u30FC ビー
bo \u30DC ボ
boo \u30DC\u30FC ボー
bu \u30D6 ブ
buu \u30D6\u30FC ブー
bya \u30D3\u30E3 ビャ
byo \u30D3\u30E5 ビュ
byu \u30D3\u30E7 ビョ
The source for the script mapping is a set of words
in Roman alphabet and the target is a set of words in
Unicode. This makes it possible to configure the
transliteration engine for a variety of scripts. The
engine processes the input text by looking for the
longest match of source letter sequences within the
script mapping.
3.1 Script Mapping
There are challenging issues to consider in
formulating the script mapping.
A fundamental requirement for using
transliteration is familiarity with the target language.
It is important that the user knows how a given word
is pronounced in the target language, and use this
A TRANSLITERATION ENGINE FOR ASIAN LANGUAGES
377