attributive relationship and the coordinative
relationship (equal relationship). Attributive
relationships are specified as allocative (rumah
makan, ‘restaurant’), instrumental (meja tulis,
‘writing desk’), possessive (hulubalang ‘guard’),
final/aim (bina marga ‘road development’), partitive
(luar negari ‘overseas’), ablative (orang Jawa
‘Javanese people’), comparative (merah jambu
‘pink’), quantitative (penuh sesak ‘crowded’).
After the distinction between word groups and
compound words have been established, the next
challenge is to find criteria to differentiate idiomatic
relations among compound words. Idioms can take
various forms, from word to sentence, e.g. tikus
kantor ‘office mouse (=corruptor)’, suara emas
‘golden voice (=good singer)’, besar kepala ‘big head
(=vain)’, badan Amran setelah sakit tinggal tulang
berbalut kulit ‘After being sick, the body of Amran is
only bone and skin (=Amran is now getting slim)’.
Compound words can be distinguished from
semantic perspective. For instance, an idiomatic
phrase kambing hitam is a full idiom, because the
meaning of the idiomatic phrase cannot be traced
from the meaning of its element and the new
meaning/change of meaning is unrelated to the
meanings of the elements (opaque). Kambing hitam
can mean (1) black goat (a group of words) or (2) a
person who is blamed (idiom). Therefore, kambing
hitam could be a phrase or compound words with an
idiomatic meaning. Semi idioms can be recognized
by the meaning of one of its constituent elements,
such as daerah hitam ‘black area’, the meaning of
daerah still refers to a place but the meaning of hitam
changes from a colour to an environment where
people commit a crime, prostitution etc. Based on
these examples it can be concluded that, idiomaticity
can be regarded as a feature of compound word in
Bahasa Indonesia.
In relation to the need for tagging to identify word
groups and compound words, the following steps are
proposed:
consists of more than one word;
Is there attributive or coordinative relationship?
If yes, it is a group of words, if not there is a
possibility of compound words;
(3) Idiomatic construction shows a high
degree of closeness so that it is an integral part
and its elements cannot be replaced with another
element (e.g. duta besar cannot be substituted
by duta kecil*, or duta tua* etc, duta besar
means ambassador). One element or both of
them have metaphorical meaning (e.g. besar is
not a size but its meaning is idiomatic).
If points 2 and 3 are the case, the word
construction is a compound word.
In this research, once the development of the POS
tagger was accomplished, we conducted an
experiment to apply the tagger onto our training
corpus. The experiment was applied to a corpus with
2 million words.
Along with the machine annotation experiment,
we also conducted a human annotation experiment on
the same corpus. The results of the POS and
compound annotation then were comparatively
analyzed by using corpus methods. The use of corpus
methods is chosen to make the comparative analysis
more practical, robust, and fast.
In this research, we conducted an experiment with
dataset consisting of 2 million corpus. The result of
annotation system showed that the machine
annotation accuracy reached 69,229%. The reference
set is a work of compound word annotation done by
experts in Indonesian linguistics. Among the correct
annotated results, some recognized compound words
are related to multiple expression words (MWEs)
with high frequencies in Sketch Engine
(https://www.sketchengine.co.uk/). It means that the
more frequent a compound word in a corpus the more
identifiable by machine annotator is.
Aside from the experiment with machine
annotation, in this research we also set a group of
students to do an annotation task with the same
corpus. As predicted before, the result of human
annotation is much higher than those of the machine
annotation. However, in few cases, when human
annotators made a mistake by mislabelling a phrase
as a compound word, e.g. jalan raya ‘main road’
(jalan ‘road’; raya ‘big, large, main’), the machine
left it unlabelled.
In the experiment, there are 92 erroneous
annotated results that can be classified into four
different types of errors: incompleteness,
miscategorization, contextual error, and other.
Incompleteness refers to annotation errors due to non-
completion of annotation, e.g. (bekerja, VB) ‘to
work’, (sama, COMP) ‘together’, > bekerja sama
(COMP) ‘to collaborate’. In the example, the machine
only labelled sama correctly, but the other component
bekerja was failed to be recognized.
Miscategorization is an error due to a failure in
recognizing a compound word. This error consists of
two different types that is over identification and
under identification. Over identification refers to a
situation when the machine recognized a compound
word to excessive degree, for example: ('jangka',
'COMP'), ('pendek', 'COMP'), > not labelled. In this
case, the machine should not categorize jangka
pendek as a compound word, but rather as a phrase
because of its compositional meaning. Meanwhile,
under identification refers to a contrast situation to