2 MORPHOLOGICAL ANALYSIS
AND SEGMENTATION
Documents consist of mainly texts, figures and tables,
texts contain many sentences which are sequences of
words. A word means a character string separated
by space or punctuation, but the problem is not re-
ally simple: how can we think about compound words
such as ”
U.S.A.
”, idioms (a group of words carrying
a different meaning when used together) or colloca-
tion (the way that some words occur regularly when
other words are used) such as ”
not only...but
also
”. A sentence in natural languages consists of
morphemes. A morpheme is a unit of strings carrying
minimal meaning. Each sentence can be decomposed
into a sequence of morphemes in a form of token (sin-
gle word), inflection (stemming) and part-of-speech
(POS) as noun and verb. The process is called mor-
phological analysis. In this work, by morpheme, we
mean a pair of token and part-of-speech attributes.
In morphological analysis, we have dictionary
which talks about relationship among morphemes,
and grammatical knowledge about morphemes. We
divide sentences into word segments, and examine
their role (and meaning) as well as the structural re-
lationship. The morphological analysis is one the key
steps for syntax and semantic analysis.
We know the fact that, in English, a word de-
scribes grammatical roles such as case and plural-
ity by means of word order or inflection. The dif-
ference between ”John calls Mary” and ”Mary calls
John” corresponds to the two interpretation of who
calls whom over John and Mary. Such kind of lan-
guage is called inflectional language.
On the other hand, in some languages as Japanese
and Chinese, grammatical relationship can be de-
scribed by means of postpositional particles, and such
kind of languages is called agglutinative language.
For example, ”I”, ”My” and ”Me” correspond to ”
”, ” ”, ” ” respectively where ”” means the first per-
sonal pronoun. The differences are 3 postpositional
particles ””, ”” and ”” which define subjective, pos-
sessive and objective respectively. As for John and
Mary, the two sentences ”W ł AŁ[ ”, ”W AŁ[ ł ” cor-
respond to the two sentences ”John calls Mary” and
”Mary calls John” where the positions of ”W”(John),
”AŁ[”(Mary) and ””(call) are exactly same but the
difference of postpositional particles. There is an-
other problem, there is no boundary between words
in Japanese. Basically, if we get words and rele-
vant postpositional particles, we could specify seg-
mentation. Morphological analysis examines roles of
words and determines which parts should be attached
to which words. In our case, we put tags between
words such as ”/Wł/AŁ[/” and ”/W/AŁ[ł/”. In agglu-
tinative languages, word segmentation plays essential
role on syntax and semantic analysis.
It is also important for the analysis step to examine
compound words. For example, we can decompose
”Łw” (University Education) into ”Łw”(University)
and ””(Education) but not ”Łw” (University Student)
into ”Łw” and ””. Such segmentation rules depend of-
ten on application domains. In this investigation, we
propose an experimental and efficient approach of do-
main dependent word segmentation base don stochas-
tic process. We apply n-gram model to Japanese,
examine relationship between morphemes by Hidden
Markov Model (HMM) for the word segmentation.
3 SEGMENTING WORDS
In a case of inflectional languages, there is no sharp
distinction between word segmentation and part-of-
speech (POS) tagging, and we can apply similar tech-
niques to analyze and examine sentences. There
have been two major approaches proposed so far,
rule-based tagging and probability-based tagging. In
the former, we extract characteristic patterns between
words or POS tags and between some more ones be-
fore/after the words of interests. Then we put them
into a form of rules. For example, we may have a rule
”this is not a verb if the preceding word is a deter-
miner”. By using well-structured rule tables, we can
get rather excellent results. The problems arise, how
we can extract useful, global, consistent and efficient
rules ? The process is not trivial and hand-coded, thus
we take much time yet not reliable.
On the other hand, in probability-based tagging,
we apply some of probabilistic approach (such as
naive Bayesian classification) and stochastic pro-
cess approach to tagging. A Hidden Markov Model
(HMM) is one of the typical examples where tags are
considered as states and words as observation sym-
bols. Given word sequences (sentences), we guess
tag sequences by means of Maximal Likelihood Esti-
mation (MLE) under a simple Markov Model among
states.
These investigation are also really useful for ag-
glutinative languages such as Japanese since we need
POS tagging. However, given sentences in the lan-
guages, we need word segmentation techniques dif-
ferent from POS tagging: we should examine word
boundaries to make every word consistent by using
postpositional particles. This is not easy to detect the
boundaries, and some approach such as ”ChaSen” has
been proposed based on HMM.
As for compound words in morphological anal-
ICAART 2009 - International Conference on Agents and Artificial Intelligence
124