handled without part-of-speech and other lexical
information.
First, a sentence may be split into translation
units by semicolon, colon, and bar. The units must
be translated independently.
Second, parts of a sentence enclosed by a pair of
single or double quotation marks are separate
translation units. Parentheses and angle brackets also
enclose some parts of a sentence that are also
translation units. The enclosed parts must be
separated and be translated independently, but they
are also elements of other translation unit. Figure 2
shows two examples. The enclosed part Q1 must be
translated together with a given sentence as in TU1,
while P1 can be translated independently. In case of
P1, the target word for TU2 must be identified for
post-processing in which the translation result of
TU2 is appended to the translation of the target word.
Figure 2: Examples of sentences with translation unit
enclosed by double quotation marks.
Third, some sentences have head marks leading
list. When the mark is a symbol, it is easily removed.
The mark must be recognized and removed when it
is a digit or alphabetical digit (ex: i, ii, I, II, …).
Fourth, words abbreviated by apostrophe must be
restored for facilitating lexical analysis. For example,
“don’t” must be converted to “do not”. We build
dictionary for such words.
Fifth, words representing numbers must be
analyzed to know whether they are ordinal or
cardinal. Also, we must identify the combination of
number and unit words. In this case, the two words
must be combined. For the purpose, we need
information about unit and number words.
Sixth, we must identify composite words. Two or
more words play a role of a one word noun, verb,
adverb, preposition, or conjunction. Composite
nouns can be translated by idiom translation method,
while composite verbs, adverb, prepositions, and
conjunctions can be collected and combined into one
word. We need the list of composition words with
their translations.
Seventh, some special patterns must be handled.
For example, sentences including [~ so that ~]
pattern can be rewritten into [~, so ~], and the
rewritten sentences are easier to be analyzed. We
collect patterns requiring sentence rewriting, and
build corresponding rewriting patterns.
Eighth, phrases expressing date must be
identified and treated as one word. Also the phrases
are translated in separate post-processing for date
translation. There are several patterns for
representing date. We collect the patterns to be used
in identification and build corresponding translation
patterns for translation.
3.2.2 Tasks after Lexical Analysis
Some pre-processing problems need lexical
information such as part-of-speech, part-of-speech
probability, and etc. We present 5 pre-processing
tasks as followings.
First, sentences may include phrases expressing
human name and his age. The phrases must be
combined and treated as one word during lexical and
syntactic analysis. It needs corresponding post-
processing in which the combined phrases are
translated into Korean.
Second, geographical names consisting of
pronouns and comma must be combined. For
example, in sentence “I lived in Brynmawr, PA.”,
“Brynmawr, PA.” is combined, so the phrase can be
translated as one word. In order to solve above two
problems, we need to know whether a word is
pronoun or not.
Third, some sentences start with adverb or
adverbial phrase which modifies the following
sentence. The modifier can be separated, which can
reduce the parsing complexity.
Fourth, some sentences include patterns for
which the translation is difficult. Such patterns
include [not only ~ but (also) ~], [insist ~ that ~
VERB (base form) ~], [no sooner had ~ than ~], and
so on. We need lexical information to match ‘~’
parts in the patterns. For the patterns, we adopt
rewriting method using rewriting patterns. In this
pre-processing, the sentences matched with the
defined patterns are rewritten as directed by the
corresponding patterns. The corresponding patterns
have compatible meanings and forms that are easier
to be analyzed in the rule-based framework. For
example, [no sooner had ~ than ~] pattern has [as
soon as ~, ~] as its corresponding pattern.
Fifth, we consider comma rewriting for
preventing non-constituent segments from occurring
by segmentation. For example, in “I need small, fast
"Next year, I may evaluate it a little closer," said
Stan Guest, an uninsured farmer.
TU1: Q1, said Stan Guest, an uninsured farmer
TU2 = Q1: Next year, I may evaluate it a little closer
During the eighth five-year plan period (from 1991 to
1995), the reform successfully completed.
TU1: During the eighth five-year plan period, the
reform successfully completed
TU2=P1: from 1991 to 1995
PRE-PROCESSING TASKS FOR RULE-BASED ENGLISH-KOREAN MACHINE TRANSLATION SYSTEM
259