occurrences so that we may expect to clarify details
of natural languages processing and their aspects.
In this investigation, we discuss how to extract
collocation by means of both data mining and sta-
tistical techniques. First we extend n-grams consist-
ing of independent words and that we take frequen-
cies on them after filtering on colligation(Sonoda,
2012). Then in the second phases we apply statisti-
cal filters for the candidates. Here we compare these
feature selection methods in statistical learning with
each other. Five methods are evaluated, including
term frequency (TF), Pairwise Mutual Information
(PMI), Dice Coefficient(DC), T-Score (TS) and Pair-
wise Log-Likelihood ratio (PLL). In section 2 we re-
view collocation in Japanese and how to characterize
them. In section 3, we discuss a new approach how
to extract the collocation. as well as details of fea-
ture selection methods in statistical learning. Section
4 contains some experiments, several analysis and the
comparison with other approach. We conclude our
investigation in section 5.
2 COLLOCATION IN JAPANESE
Before developing our story, let us see how word
structure works in Japanese language. We know the
fact that, in English, a word describes grammatical
roles such as case and plurality by means of word or-
der or inflection. For example, we see two sentences.
John calls Mary.
Mary calls John.
The difference corresponds to the two interpretations
of positions, i.e., who calls whom over John and
Mary. Such kind of language is called inflectional.
On the other hand, in Japanese, grammatical relation-
ship can be described by means of postpositional par-
ticles, and such kind of languages is called agglutina-
tive. For example, let us see the two sentences:
John/ga/Mary/wo/yobu. (John calls Mary)
John/wo/Mary/ga/yobu. (Mary calls John)
In the sentences, the positions of John, Mary and
yobu(call) are exactly same but the difference of
postpositional particles(”ga, wo”). With the post-
positional particles, we can put any words to any
places
3
. Independent word(s) and a postpositional
particle constitute a clause. Clearly, in Japanese lan-
guage, many approach for inflectional languages can’t
be applied in a straightforward manner
4
. The main
3
One exception is a predicate. In fact, the predicate
should appear as a last verb in each sentence.
4
Morphological analysis means both word segmenta-
tion and part of speech processing in Japanese. For exam-
reasons come from inherent aspects of Japanese; it is
agglutinative while English is inflectional.
As for collocation in Japanese, each clause
contains several morphemes, we see many co-
occurrences within nouns and postpositional parti-
cles, which look like colligation but are language-
dependent and useless for collocation. To obtain fre-
quent co-occurrences, there has been much inves-
tigation of text mining(Han, 2006). Here we ap-
ply Apriori and FP-tree algorithms to obtain frequent
word sets. Since we like to examine collocation, we
should extend n-grams approach containing indepen-
dent words only. Then, to screen trivial and useless
collocations, we should have some filters to remove
noises such as functional words and stop words. To
screen trivial colligation in English, there have seen
several investigations proposed so far using part of
speech and sentence structures that could be useful for
our case. Very often proper nouns cause noises (as un-
known words as ”iPad”) or confusion (i.e., ”Apple”
is a computer). Using ontology aspect, we may in-
troduce abstraction to these words, especially proper
nouns and numerals. For instance, we say ”Ichiro
at bat” and ”Matsui at bat”, then we may have
”<Baseball Player> at bat” as a frame.
To tackle with semantic preference issues over
word occurrences, there seem several approaches. It
seems easier to utilize case frame dictionaries. Gen-
erally the dictionaries allow us to analyze case struc-
ture, but the results depend on dictionary as well as
domain corpus. Another idea is that we apply statis-
tical filters to the words to characterize relationship
among words. They provide us with feature selection
criteria to extract collocations.
3 EXTRACTING COLLOCATION
IN JAPANESE
Let us describe how we extract collocation in
Japanese. Our approach consists of several steps,
filtering irrelevant morphemes, generalizing proper
nouns, generating extended n-gram (n-Xgram) ex-
tracting frequent word sets over n-Xgram and apply-
ing statistical filters.
ple, "sumomo/mo/momo/mo/momo/no/uchi" means Both
Plum and Peach are same kind of Peach, which is a typi-
cal tongue twister where you should say ”mo” many times.
There are two nouns ”sumomo” (plum) and ”momo”(peach).
There is no delimiter between words (no space, no comma,
and no thrash) and everything goes into one string as
”sumomomomomomomomonouchi”.
ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems
382