A Cross-lingual Part-of-Speech Tagging for Malay Language

Norshuhani Zamin

and Zainab Abu Bakar

Faculty of Science and Information Technology,Universiti Teknologi PETRONAS, Perak, Malaysia

Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Selangor, Malaysia

Keywords: Annotation Projection, Part-of-Speech Tagging, Malay Corpus, Dice Coefficient, Bigram Similarity.

Abstract: Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the per-

formance of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay

is experimented as the less-resourced language and English is experimented as the rich-resourced language.

The research is proposed to reduce the deadlock in Malay computational linguistic research due to the

shortage of Malay tools and annotated corpus by exploiting state-of-the-art English tools. This paper pro-

posed a cross-lingual annotation projection based on word alignment of two languages with syntactical dif-

ferences. A word alignment method known as MEWA (Malay-English Word Aligner) that integrates a Dice

Coefficient and bigram string similarity measure is proposed. MEWA is experimented to automatically in-

duced annotations using a Malay test collection on terrorism and an identified English tool. In the POS an-

notation projection experiment, the algorithm achieved accuracy rate of 79%.

1 INTRODUCTION

POS tagging is the first stage in automated text min-

ing. The development of language technologies can

scarcely begin without this initial phase. Automated

POS tagging aims to label or tag each word in the

text with its correct POS tag to reflect its syntactic

and morphological categories. The term “POS tag”

is often used interchangeably with the term “catego-

ry”, “word class” or “lexical category” in many lin-

guistic publications. The POS tags are usually rep-

resented by some abbreviations to indicate word

categories. The research outcome is exemplified in

Figure 1.

Figure 1: Example Outcome of Proposed Research.

Text-processing tools such as lemmatisers, POS-

taggers, analyzers, stemmers and parsers are availa-

ble for only a few rich-resourced languages, such as

English, German and Japanese. As for the existing

English POS-taggers, there are many advanced ver-

sions that have reached almost up to 98% accuracy

with almost possibility for further enhancement

(Tsuruoka et al., 2005; Indurkhya and Damerau,

2010).

Malay is a language spoken by approximately

300,000,000 users in Malaysia, Brunei, Singapore

and Indonesia (El-Imam and Don, 2005). It is a type

of Indo-European language. A corpus is a collection

of various written or spoken texts in machine-

readable forms and is usually unstructured. A corpus

plays a vital role in NLP text-based research. Malay

has relatively few resources and limited collection of

corpus. This limitation has been a hurdle for Malay

linguists to investigate the language computationally

(Hassan, 1974).

Malay is inflectional language in which the lan-

guage performs massive affixation, reduplication

and composition (Tan, 2003). The uniqueness of

these characteristics attracts linguists to explore the

underlying challenges and opportunities of the Ma-

lay language. There are three types of word deriva-

tion in Malay – noun derivation, verb derivation and

adjective derivation. New words are created by at-

taching affixes onto a root word. Some of the de-

rived words remain the same meaning with the root

words and some are totally different in meaning.

There are four types of affixes in Malay – prefixes,

suffixes, circumfixes and infixes (Sharum et al.,

2010). This inflectional property gives a unique

Input : Saya datang dari Malaysia.

(Translation: I come from Malaysia.)

Output : Saya /PRP datang /VBD dari/ IN Malaysia/NNP

(NOTE: PRP = Pronoun, VBD = Verb, IN = Preposition,

NNP = Proper Noun)

232

Zamin N. and Abu Bakar Z..

A Cross-lingual Part-of-Speech Tagging for Malay Language.

DOI: 10.5220/0005150602320240

In Proceedings of the International Conference on Agents and Artiﬁcial Intelligence (ICAART-2015), pages 232-240

ISBN: 978-989-758-074-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

identity to Malay morphology. Examples of noun,

verb and adjective derivation for common affixes in

Malay are listed in Table 1.

Table 1: Derivation of a New Malay Words using Affixes.

Malay Affix Malay Word English Word

ajar + an ajaran teaching

bel + ajar belajar to learn

meng + ajar mengajar to teach

di + ajar (transitive) diajar being taught

di + ajar + kan

(intransitive)

diajarkan being taught

mem + pel + ajar + i mempelajari to study

di + pel + ajar + i dipelajari being studied

pel + ajar pelajar student

peng + ajar pengajar teacher

pel + ajar + an pelajaran

subject / educa-

tion

peng + ajar + an pengajaran

lesson / moral of

the story

pem + bel + ajar +

pembelajaran

learning

ter + ajar terajar

accidentally

taught

ter + pel + ajar terpelajar well-educated

ber + pel + ajar + an berpelajaran educated

Unfortunately, annotated textual data in Malay

are currently scarce. In Malaysia, there are few Ma-

lay corpora that have been made available. They are

mainly developed for academic use and not publicly

accessible. They consist of various genres from story

books, novels, and from management studies to po-

litical issues. Examples of private data include the

Malay Practical Grammar Corpus (Abdullah et al.,

2004), the Dewan Bahasa Pustaka (DBP) Database

Corpus

, the Malay Corpus by Unit Terjemahan Me-

lalui Komputer from the University Science of Ma-

laysia (Ranaivo, 2004; Chuah and Yusoff, 2002)

and, more recently, the Malay LEXicon (MALEX)

(Don, 2010). The freely available Malay Concord-

ance Project Corpus

is a collection of 3,000,000

words extracted from classical Malay texts, ones that

are not related to this research domain. Limited ac-

cess to such important sources of linguistic

knowledge is a major hurdle in Malay NLP research.

Annotating the corpus manually is a laborious

and expensive task. This task is called corpus anno-

tation in linguistic. Some research do provides anno-

tation standards and guidelines for natural language

annotation in various domains such as the research

by Galescu and Blaylock (2012) and Roberts et al.

http://www.dbp.gov.my

http://mcp.anu.edu.au/

(2009). Annotations can be of different nature, such

as prosodic, semantic or historical annotation. The

most common form of annotated corpora is the

grammatically tagged one such as the POS tagged

corpus. Annotation projection is a task within paral-

lel corpora where information from Source Lan-

guage (SL) is projected or map onto the Target Lan-

guage (TL). Parallel corpora or bitext are extensive-

ly studied in the machine translation field where the

aligned phrases and words are used to create transla-

tion models. A survey of the literature revealed that

reuse of resources helps to reduce costs and over-

heads in system development (Mayobre, 1991; Bol-

linger and Pfleeger, 1990; Barnes and Bollinger,

1991; Kim and Stohr, 1998). This presents signifi-

cant opportunities to leverage these pre-existing re-

sources from a rich-resourced language, such as

English, to avoid building new text-processing tools

for a totally new language from scratch. The idea to

exploit the linguistic information from one language

to another is proposed.

Annotation projection using pre-aligned parallel

corpus is demonstrated successfully in projecting

coreference resolution in English-Portuguese paral-

lel corpus (de Souza and Orăsan, 2011), relation

detection in English-Korean parallel corpus (Kim et

al., 2010), dependency analysis in English-Swahili

parallel corpus (De Pauw et al., 2009), semantic

roles in English-German parallel corpus (Padó and

Lapata, 2005) and syntactic relations in English-

Romanian parallel corpus (Mititelu and Ion, 2005).

2 PART-OF-SPEECH (POS)

TAGGING

POS tagging is considered as an already solved

problem in NLP study (Søgaard, 2010). The state-of-

the-art in POS tagging accuracy is about 96%-97%,

but for most Indo-European languages such as Eng-

lish, French, Dutch and German and not for others

(Indurkhya and Damerau, 2010). There are three

ways to POS tagging- unsupervised, semi-

supervised and supervised. Christodoulopoulos et

al., (2010) evaluates seven unsupervised POS tag-

gers spanning nearly 20 years of work using Wall

Street Journal (WSJ) corpus. All the existing seven

unsupervised POS taggers were built using different

techniques and algorithms.

The comprehensive review finds that many older

algorithms perform well than the newer algorithms

because the methods used in newer systems do not

suit with POS induction task. Nevertheless, success-

ACross-lingualPart-of-SpeechTaggingforMalayLanguage

233

ful rate in POS tagging is mostly contributed by the

supervised and semi-supervised learning algorithms.

For example the fully-supervised Stanford POS tag-

ger (Toutanova et al., 2003), trained on the Wall

Street Journal (WSJ) corpus, records 97.24% accu-

racy (Dickinson, 2007) and the work by Søgaard

(2010) is an attempt at a semi-supervised POS tag-

ger using Support Vector Machine (SVM), trained

on POS-Tagged WSJ has successfully achieved 97%

accuracy.

An error-driven transformation-based tagger for

English, known as Brill’s tagger, automatically

learns and induces tagging rules from a pre-tagged

English corpus (Brill, 1995; Brill, 1992). It is the

first widely used tagger to have an accuracy of

above 95%. A mirror of Brill’s tagger is the latest

version and is known as the CST tagger

. It is cur-

rently made available online.

The advancement of supervised learning algo-

rithms so as to reach the level of inter-annotator

agreement is due to the massive volume of labelled

training data which serve as the input to the tagger.

However, manual labelling of data is highly expen-

sive and involves laborious preparatory work. On

the other hand, an unsupervised POS tagger appears

to be a more natural solution. Such a tagger does not

make use of any pre-tagged corpora for training but

only makes use of a sophisticated computational

method to automatically induce word groups. One of

the earliest POS tagging researches in this manner

was from Merialdo (1994) who used Maximum

Likelihood Model to train a trigram Hidden Markov

Model (HMM) tagger. Banko and Moore (2004)

proposed an improved framework using a discrimi-

native model rather than an HMM. Goldwater and

Griffith (2007) presented a Bayesian-based tagger

and a standard trigram HMM with accuracy closer to

most discriminative model.

In recent years, there have been a few attempts

on the development of unsupervised POS taggers for

less-studied languages, with accuracies reaching up

to 76.1% (Christodoulopoulos et al., 2010). Hence,

this study is conducted to help bridge the gap be-

tween the supervised and unsupervised algorithms.

3 EXISTING ENGLISH POS

TAGGERS

The aim of this research is to overcome the shortage

of NLP resources in Malay language by utilizing

existing computational linguistic resources in other

http://cst.dk/online/pos_tagger/uk/

languages. As NLP has made rapid progress over the

last decades, it seems realistic to apply an existing

POS tagger to a bilingual corpus (English and Ma-

lay) as a ‘bridge’. The resulting annotations are pro-

jected to the second, the resource poor language i.e.

Malay using the proposed word alignment algo-

rithm. The projection algorithm for resource induc-

tion requires an aligned parallel corpus or bilingual

corpus.

As the initial phase, it is significant to test the

performance of some existing POS taggers over the

English terrorism corpus. The corpus has been pre-

viously created through the translation of Google

Translate. It is the best statistical translator found

thus far for Malay language. However, as the

Google Translate often produces translation that

contains grammatical errors, the assistance of human

expert is required to correct the errors.

The comparative study involved three most ad-

vanced open-source POS taggers (i.e. CST, CLAWS

and Stanford) over their capacity to recognize some

standard word classes in the 25 selected news arti-

cles of Indonesia terrorism. These POS taggers are

selected for the comparison because they are public-

ly available. Out of these three, the best POS tagger

is chosen based on how accurate it performed com-

pared to human annotations.

CST Tagger was built based on the methodology

proposed by Brill (1992, 1995). It is said as the mir-

ror image of the original Brill’s Transformation-

based Learning tagger. Brill’s tagger was trained on

Brown Corpus. The corpus was compiled in the

1960s at Brown University, Providence, Rhode Is-

land as a general corpus (text collection) in the field

of corpus linguistic (Kučera and Francis, 1967).

Next candidate tool is the CLAWS tagger for Eng-

lish part-of-speech tagging

. This linguistic tool is

developed by a group of linguists at the University

of Lancaster, UK known as the University Centre for

Computer Corpus Research on Language (UCREL).

CLAWS stands for the Constituent Likelihood Au-

tomatic Word- Tagging System. CLAWS is the ma-

jor UCREL’s achievement. It is a stochastic tagger

that uses 134 tagsets trained on human tagged Brit-

ish National Corpus (BNC)

of multiple genres with

accuracy rate between 96% and 97% (Garside, 1987;

Garside and Smith, 1997; Leech et al., 1994). The

system used the frequency statistics drawn from the

BNC corpus.

The third tool in the comparative study is the

Stanford POS Tagger

. It is a state-of-the-art sto-

http://ucrel.lancs.ac.uk/claws/

http://www.natcorp.ox.ac.uk

http://nlp.stanford.edu/software/tagger.shtml

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

234

chastic tagger using the Maximum Entropy algo-

rithm and trained on Wall Street Journal (WSJ) cor-

pus, a collection multiple genres news articles with

Penn Treebank

tagsets (Toutanova et al., 2003).

Tagged sentences are extracted and split into train-

ing, development and test data sets. Stanford POS

tagger recorded 97.24% of its best accuracy.

This comparative experiment processed 25 news

articles which consist of 263 sentences and 5413

words and the results are presented in Table 2 fol-

lowed by discussions and justifications. The evalua-

tion metric used are Precision (P), Recall (R) and

F1-Score (F1).

Table 2: A comparative study between CST, CLAWS and

Stanford POS Tagger on English Terrorism Corpus.

Correct

Missed

Wrong

P R F1

CST

2353 275 317 0.88 0.80 0.84

CLAWS

1981 193 419 0.83 0.80 0.79

Stanford

2426 550 722 0.77 0.66 0.71

Although, most recent research in automated

POS tagging have explored stochastic algorithm, the

results show that the rule-based CST Tagger outper-

formed other stochastic taggers with 84% accuracy.

Accuracy represents the percentage of correctly

identified matches the proposed algorithm detects

from the possible matches. The success of CST Tag-

ger lies within the rule-based algorithm to tag un-

known words. CST Tagger obtained high accuracy

because the linguistic information is captured indi-

rectly through its Transformation-based Error-driven

Learning into a small number of simple non-

stochastic rules. In this experiment, CST Tagger

performed reasonably well on terrorism news arti-

cles probably because it was trained on the Brown

Corpus which contains 500 samples of English-

language texts, totalling roughly one million words,

compiled from works published in the United States

in 1961. Among these samples are sample texts on

Political Science and Government Documents which

might describe the cases, stories or issues on terror-

isms.

4 MALAY-ENGLISH WORD

ALIGNER (MEWA)

MEWA is a hybrid method using heuristic and prob-

abilistic approach that aligns English to its corre-

sponding Malay word, thus projects the part of-

speech (POS) of the English word which was as-

http://www.cis.upenn.edu/~treebank

signed by the off-the-shelf POS tagger to the newly

aligned Malay word.

In detailed, MEWA uses a probabilistic approach

i.e. N-gram scoring method for two characters which

is commonly referred as bigram and is integrated

with a heuristic approach, Dice Coefficient function

(Dice, 1945) in order to calculate the probability

distribution of letter sequences between Malay and

English texts. This is a hybrid method which never

been applied in any research involving the Malay

corpus.

4.1 Dice Coefficient Function

Dice Coefficient is a heuristic alignment method

which differs from statistical alignment. It uses spe-

cific associative measures rather than pure statistical

measures. The function applies basic heuristic rule –

a word pair are chosen as the aligned words if the

pair has the highest co-occurrence score. It is a sim-

ple and intuitive estimation approach for word

alignment (Moore, 2004). The Dice Coefficient

Function is shown in Equation 1.







∑













∑













∑











(1)

The Dice Coefficient is calculated by counting the

number sentences where the word co-occur (







the number of occurrences of the English word (



)

and the number of occurrences of the Malay word

(



). The measure is always between 0 and 1. The

use of the Dice Coefficient function in bitext align-

ment research is inspired by the research of Dien

(2005).

4.2 Bigram String Similarity Measure

Bigram is a simple probabilistic method to measure

the string closeness of two different texts and often

generates good results. Bigram method performs

well on languages of different structures and are

widely implemented in much text-mining research

as shown in Jiang and Liu (2010), Tsuruoka et al.,

(2005), Tan et al., (2002) including the pioneer of

text projection research, Yarowsky et al., (2001). A

bigram is also referred as the first-order Markov

model as it looks one token into the past and a tri-

gram is a second order Markov model while N-gram

is the N

order Markov model (Jurafsky and Martin,

2000). N-gram has been widely demonstrated in

early research for word prediction (Jurafsky et al.,

1994; Jurafsky et al., 1997). The successful of an N-

gram model is measured by its perplexity. A better

model has the lowest perplexity value.

ACross-lingualPart-of-SpeechTaggingforMalayLanguage

235

In most of the word prediction research, by far,

both the bigram and trigram approach show a great

success. However, a bigram model is suggested if

the training data is insufficient (Gibbon et al., 1997).

A bigram based model is faster and smaller at

matching two languages with dissimilar patterns

(Dunning, 1994) and performs better than the se-

quence of morphemes algorithm as experimented in

(Florian & Ngai, 2001). Based on these facts, the

research chose to work on bigram probabilistic mod-

el without any prior evaluation done.

Additionally, bigram model of two characters is

proposed to improve the limitations found in the

bitext project research of Dien and Kiem (2003).

Their research proposed a word alignment algorithm

using Dice Coefficient function, a dictionary look-up

and Vietnamese morphological analyzer to automat-

ically align Vietnamese and English bilingual corpo-

ra. The position of the words in each pair of sentenc-

es is pre-aligned by an existing tool called GIZA++

(Och and Ney, 2000). The accuracy of word align-

ment of Vietnamese-English was approximately

87% (Dien, 2002). The English corpus is tagged

using the fnTBL-toolkit (Florian and Ngai, 2001).

The similarity of morphemes between the source

word (English) and the target word (Vietnamese)

retrieved from a dictionary look-up is calculated. A

morpheme is the smallest meaningful unit of a lan-

guage that cannot be further divided. For example,

the word unbreakable consists of three morphemes:

un (signifying not), break (the base word) and able

(signifying can be done). In order to obtain the mor-

pheme, morphological analyzer is required. They

have generalized the Vietnamese POS tagsets

against the tagsets used by the fnTBL-toolkit, the

adopted English tagger.

To date, there is no relevant research found that

is related to bitext alignment involving Malay except

a research by Al-Adhaileh et al., (2009) and

Nasharuddin et al., (2013). Al-Adhaileh et al.,

(2009) proposed a text alignment algorithm to align

Malay and English text using Smooth Injective Map

Recognizer (SIMR) and Geometric Segment Align-

ment (GSA) algorithm. Both the algorithms were

successfully used to align French-English, Korean-

English and Chinese-English in previous research.

The hybrid methods were experimented on 100,000

words Malay-English extracted from books of mul-

tiple genres.

5 HOW DOES MEWA WORKS?

MEWA works by aligning one Malay word to the

most probable translated English word at a time. The

overall process flow of MEWA is depicted in Figure

Consider an example of POS annotation projec-

tion in Malay. We start with a source text, a Malay

sentence Polis Indonesia mendakwa empat lelaki

bersenjata. The text is translated to English using

Google Translate and then tagged using CST Tag-

ger, we get Indonesian/NNP police/NN ac-

cused/VBN four/CD armed/ VBN men/NNS. MEWA

uses both texts and information from Malay-English

lexicon and produces a tagged Malay sentence.

MEWA works by aligning one Malay word at a

time. Let’s illustrate the process to align the word

mendakwa. D

is a Malay-English lexicon which

consists of terrorism related words in Malay and its

possible translation in English. Lexemes for dakwa

are stored and their translation is the thesaurus of the

word as exemplified in Figure 3.3. The main use of

lexemes in the lexicon is to avoid the laborious pro-

cess to lemmatize the massive inflectional words of

the Malay language.

Figure 2: MEWA process flow.

Figure 3: Sample lexicon entries for the Malay word

dakwa.

Both the input texts are tokenized into Malay

word vector, W

and tagged English word vector,

as illustrated follows:

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

236

= {Polis, Indonesia, mendakwa, empat, lelaki, bersenjata}

= {Indonesian/NNP, police/NN, accused/VBN, four/CD,

armed/ VBN, men/NNS}

= {E

, E

}

For each word in W

we need to find the matching

word in W

. Taking a word from W

, we use D

find the possible English translations for the word –

}. The morpheme similarity of each word d

with

each of the English words in W

is now calculated

with using the Sørensen-Dice function (Sørensen,

1948) as formulated in Equation 2.





,



  max



,



|



∊



,



∊





(2)

Where 



= A set of Malay vector; 



= A set of

tagged translated English word vector; 



 = The

translation(s) of the Malay word; E

= The transla-

tion(s) of the English word.

The lexicon search for the Malay word men-

dakwa finds the equivalent translation available

from the manually prepared Malay-English lexicon.

The word mendakwa is originated from the base

word dakwa. From the lexicon, we get

(mendakwa) = {claim, accuse} = {d

}, where

= claim and d

= accuse. Next, each of the words

and d

is compared with each of the words E

, E

, and E

. For example, the first iteration is

to compare E

= Indonesian against d

= claim and

= accuse and we get the following calculation:

Sim(mendakwa, Indonesian) = max{Sim(claim, Indonesian),

Sim(accuse, Indonesian)}

= max{Sim({cl,la,ai,im},

{In,nd,do,on,ne,es,si,ia,an}),

Sim({ac,cc,cu,us,se}, {In,nd,do,on,ne,es,si,ia,an})}

= max{Sim(2x0/(4+9)), Sim(2x0/(5+9))}

= max{0,0} = 0.

This calculation continues for the word E

(police) in

the second iteration. The iteration continues until E

has been evaluated. The third iteration which com-

pares E

= accused against d

= claim and d

= ac-

cuse, returns the highest correlation score as exem-

plified below:

Sim(mendakwa,accused)= max{Sim(claim, accused),

Sim(accuse, accused)}

= max{Sim({cl,la,ai,im}, {ac,cc,cu,us,se,ed}),

Sim({ac,cc,cu,us,se}, {ac,cc,cu,us,se,ed})}

= max{Sim(2x0/(4+6)), Sim(2x5/(5+6))}

= max{0,0.91} = 0.91.

In this example, the word accused is considered to

be the most probable translation of the Malay word

mendakwa as it returns the highest similarity score,

0.91. After all words in W

and W

are statistically

compared, the bitext are mapped. Finally, the anno-

tation i.e. the POS tag of each English word is pro-

jected to its corresponding Malay word so as to cre-

ate the annotated Malay terrorism corpus, Polis/NN

Indonesia/NNP mendakwa/VBN empat/CD

lelaki/NNS bersenjata/VBN, as as shown in Figure 4.

Figure 4: Malay-english bitext mapping.

6 RESULTS AND DISCUSSION

MEWA is evaluated using an English – Malay paral-

lel corpus of 25 news articles on Indonesian terror-

ism, with 263 sentence pairs and 5413 word tokens,

was used to evaluate the framework. The Malay

texts were manually tagged by a human annotator

using a standard tagsets, which were made equiva-

lent to Brill’s tagsets. The test tagsets are purposely

reduced to six which refer to the broad categories of

Brill’s tagsets in order to simplify the human verifi-

cation task. In this work, all variants of verbs,

nouns, pronouns, cardinal numbers and adjectives

produced by the CST Tagger are generalized as VB,

CN, PN, CD and ADJ respectively to ease the evalu-

ation process. These entities are significant for the

later development. Regular words and delimiters

were removed in the Malay corpus leaving only

3466 prominent word tokens. The experiment is

divided into three sub-experiments:

 Experiment I: To test a small corpus of 50 ran-

domly picked sentences from the collection of

news articles.

 Experiment II: To test on a larger scale corpus

consists of 263 sentences.

The result is shown in Table 3. In the Experiment II,

a total of 263 sentences were extracted from a set of

25 Malay news articles on Indonesian terrorism. The

system performance against human annotation re-

sults is shown in Table 4.

In both experiments, the numbers of correct,

wrong and missed tags between the output produced

by system and those annotated by our human experts

were manually counted for all words. The results of

the Experiment II demonstrated that the system

achieved 86.87% in Precision, 72.56% in Recall and

79.07% in F1. These results indicate that the tagger

attains slightly better annotation accuracy than in the

Experiment I (refer to Table 3). The reason for this

could be that the size of the corpus is 12 times larger

than the previous size. The experiments show that

the performance increases proportionally with the

size of data. Observations have found that the size of

ACross-lingualPart-of-SpeechTaggingforMalayLanguage

237

Table 3: Experiment I – Test result for 50 sentences.

# Sentences # Word # Correct # Wrong # Missed P R F1

50 646 432 140 70 76 % 67% 71.2%

Table 4: Experiment II – Test result for 263 sentences.

# Sentences # Word # Correct # Wrong # Missed P R F1

263 5413 2515 380 571 86.87% 72.56% 79.07%

the dictionary look-up also increases proportionally

with the size of the tested corpus. Consequently, the

probability of an English word getting correctly

aligned with a Malay word is increased. A total of

1444 pairs of Malay-English words in the dictionary

look-up have been collected from these experiments.

The results indicate that this implementation works

well when the sentence pair is good, i.e. when there

is no data sparsity problem.

The performance was also contributed by the

CST POS Tagger’s expressive set of rules which is

able to solve ambiguity problems in English. Ambi-

guity is among the challenging problems in word

tagging (Dickinson, 2007). Tagging ambiguity is

when there is a word with more than one POS tags

exists. This occurs more frequently in English than

Malay as in the example We can can the can. The

three occurrences of the word can correspond to the

auxiliary, verb and noun categories respectively. As

for Malay, there are limited words that carry different

meanings in different contexts.

Consider this example: Muzium Perang itu

berwarna perang. In this example, the former perang

means war while the latter means brown. The whole

statement means The War Museum is brown in col-

our. This is an adjective clause where the latter

perang is an adjective that describes the noun perang

in the proper noun Muzium Perang. In the Experi-

ment II, there were 571 missing words which fall

under the “Missed” category. “Missed” means

skipped or not processed by the system. It was found

by observation that a word was missed due to either

it not being in the dictionary / lexicon or it being as-

sociated with a many-to-one entity.

Clearly, system performance can be increased

further by adding those missing words to the diction-

ary / lexicon. On the other hand, ambiguous tagging

results returned by Brill’s tagger contributed to re-

ducing system performance. Figure 5 shows an erro-

neous tagging produced by the CST POS Tagger.

Words given the wrong tag are underlined.

Dulmatin/NNP ,/, 39/NNP ,/, is/VBZ among/IN three/CD

suspected/VBN militants/NNS who/WP were/VBD shot/NN

dead/JJ

Figure 5: CST Tagger’s tagging errors.

The word suspected in this context is supposed to

be classified as adjective and given the tag JJ while

the both the word shot and dead are tagged as verb

(VBD) and adjective (JJ) respectively. However, am-

biguous tagging results were found only in 112

words which is about 29% from the total wrong

tagged words. Meanwhile, it is also observed in the

dataset that there exist a conflict between the sys-

tem’s output and human’s annotation in some word

classes. As for example, the human annotators agreed

that an adjective preceding a noun is also classified

as a common noun (CN) as shown in Table 5.

Table 5: Contradiction between MEWA’s tagging and

Human’s tagging.

Case

Examples

CST-tagged: anti-terror/JJ laws/NNS

System-tagged: undang-undang/NNS antikega-

nasan/JJ

Human-tagged: undang-undang/CN antikega-

nasan/CN

CST-tagged: mountainous/JJ areas/NNS

System-tagged: kawasan/NNS pergunungan/JJ

Human-tagged: kawasan/CN pergunungan/CN

7 CONCLUSIONS

A new method to automate POS-tagging for Malay

using parallel data from a rich-resourced language is

devised. As far as the research is concerned, no pre-

vious work has studied the alignment of a Malay cor-

pus with an English corpus by projecting tags using

hybrid algorithms. The bitext alignment method ap-

pears to be a powerful unsupervised learning algo-

rithm mapping two dissimilar languages at minimum

computational cost. MEWA has successfully mapped

and project POS annotation from English to Malay

corpus using existing English resource at fairly accu-

rate rate. MEWA is the first attempt for cross lingual

language research in Malay.

Finally, MEWA heavily reduces the labour re-

quired to annotate a Malay corpus, and generate

quick results.

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

238

8 FUTURE WORKS

Limitation in this research is outlined in this section

to be considered as topics to broach in our future

research to improve the performance of MEWA.

Some visible limitations that have been observed

during the experiments are the disagreement with

human’s annotations. As shown in Table 4, the hu-

man annotators agreed that an adjective preceding a

noun is also classified as a common noun (CN).

However, the existing English tagger keeps it as ad-

jective due the semantic of the word that uniquely

described the noun. This contradict piece of categori-

zation has decreased the system’s performance.

Difficulties are identified in one-to-many, many-

to-one and many-to-many word association. This is a

common problems link with Machine Translation

(MT) research for many years. A many-to-one asso-

ciation is the most common kind of association

where an object can be associated with multiple ob-

jects. However, these problems can be resolved using

rules. For a complete application, it is recommended

to put the focus on the automated lexicon builder.

The compilation of such lexicon (interchangeable

referred to as dictionary look-up or gazetteer) is often

a stumbling block in Natural Language Processing

(NLP) research. Machine learning methods especial-

ly the rule-based algorithm is often used to perform

this process. An advanced algorithm using the words

extracted from Wikipedia has been greatly explored

recently.

REFERENCES

Abdullah, I. H., Ahmad, Z., Ghani, R. A., Jalaludin, N. H.,

& Aman, I. (2004). A Practical Grammar of Malay–A

Corpus based Algorithm to the Description of Malay:

Extending the Possibilities for Endless and Lifelong

Language Learning. National University of Singapore.

Banko, M., Brill, E. (2001). Mitigating the Paucity-of-Data

Problem: Exploring the Eect of Training Corpus Size

on Classiﬁer Performance for Natural Language Pro-

cessing. Proceedings of the first international confer-

ence on Human language technology research, 1-5.

Association for Computational Linguistics.

Barnes, B. H. & Bollinger, T. B. (1991). Making Reuse

Cost-Effective. IEEE Software, 8(1), 13-24.Kim, Y., &

Stohr, E. A. (1998). Software Reuse: Survey and Re-

search Directions. Journal of Management Information

Systems, 113-147.

Bollinger, T. B. & Pfleeger, S. L. (1990). Economics of

Software Reuse: Issues and Alternatives. Information

and Software Technology, 32 (10), 643-652.

Brill, E. (1992). A Simple Rule-based Part of Speech Tag-

ger. In Proceedings of the Workshop on Speech and

Natural Language, 112-116. Association for Computa-

tional Linguistics.

Brill, E. (1995). Transformation-based Error-driven Learn-

ing and Natural Language Processing: A Case Study in

Part-of-Speech Tagging. Computational Linguistics,

21(4), 543-565.

Christodoulopoulus, C., Goldwater, S., & Steedman, M.

(2010). Two Decades of Unsupervised POS Induction:

How Far Have We Come. Proceedings of Empirical

Methods in Natural Language Processing.

Chuah C. K. & Yusoff, Z. (2002). Computational Linguis-

tics at Universiti Sains Malaysia. Proceedings of the In-

ternational Conference on Language Resources and

Evaluation, 1838 -1842.

De Pauw, G., Wagacha, P. W., & De Schryver, G. M.

(2009). The SAWA Corpus: A Parallel Corpus Eng-

lish-Swahili. Proceedings of the First Workshop on

Language Technologies for African Languages, 9-16.

Association for Computational Linguistics.

De Souza, J. G. C., & Orăsan, C. (2011). Can Projected

Chains in Parallel Corpora Help Coreference Resolu-

tion? Anaphora Processing and Applications, 59-69).

Springer Berlin Heidelberg.

Dice, L. R. (1945). Measures of the Amount of Ecologic

Association between Species. Ecology, 26(3), 297-302.

Dickinson, M. (2007). Determining Ambiguity Classes for

Part-of-Speech Tagging. Proceedings of RANLP-07.

Borovets, Bulgaria.

Dien, D. (2002). Building a Training Corpus for Word

Sense Disambiguation in English-to-Vietnamese Ma-

chine Translation. In Proceedings of the 2002 COLING

Workshop on Machine Translation in Asia - Volume

16 (1-7). Association for Computational Linguistics.

Dien, D. I. N. H. (2005). Building an Annotated English-

Vietnamese Parallel Corpus. MKS: A Journal of South-

east Asian Linguistics and Languages, 35, 21-36.

Dien, D., & Kiem, H. (2003). POS-tagger for English-

Vietnamese Bilingual Corpus. In Proceedings of the

HLT-NAACL 2003 Workshop on Building and using

Parallel Texts: Data Driven Machine Translation and

Beyond – Volume, 88-95. Association for Computa-

tional Linguistics.

Don., Z. M. (2010). Processing Natural Malay Texts: a

Data Driven Algorithm, Journal of TRAMES, 14, 90-

103.

Dunning, T. (1994). Statistical Identification of Language,

Computing Research Laboratory, New Mexico State

University.

El-Imam, Y. A. & Don, Z. M. (2005). Rules and algorithms

for phonetic transcription of standard Malay. IEICE

Transaction of Information Systems, E88-D, 2354-

2372.

Florian, R., & Ngai, G. (2001). Fast transformation-based

Learning Toolkit, Technical Report.

Galescu, L., & Blaylock, N. (2012). A Corpus of Clinical

Narratives Annotated with Temporal Information. Pro-

ceedings of the 2

ACM SIGHIT International Health

Informatics Symposium, 715-720, ACM.

Galescu, L., & Blaylock, N. (2012). A Corpus of Clinical

Narratives Annotated with Temporal Information. Pro-

ACross-lingualPart-of-SpeechTaggingforMalayLanguage

239

ceedings of the 2

ACM SIGHIT International Health

Informatics Symposium, 715-720, ACM.

Garside, R. (1987). The CLAWS Word-Tagging System.

The Computational Analysis of English: A Corpus-

based Algorithm. London: Longman, 30-41.

Garside, R., & Smith, N. (1997). A Hybrid Grammatical

Tagger: CLAWS4. Corpus Annotation: Linguistic In-

formation from Computer Text Corpora, 102-121.

Gibbon, D., Moore, R. K., & Winski, R. (Eds.).

(1997). Handbook of Standards and Resources for

Spoken Language Systems. Walter de Gruyter.

Goldwater, S., & Griffiths, T. (2007). A Fully Bayesian

Algorithm to Unsupervised Part-of-Speech Tagging. In

Annual Meeting-Association for Computational Lin-

guistics, 45(1),744.

Hassan, A. (1974). The Morphology of Malay, Dewan

Bahasa dan Pustaka, Kuala Lumpur Malaysia.

Indurkhya, N. & Damerau, F.J. (2010). Handbook of Natu-

ral Language Processing, Second Edition, Chapman &

Hall / CRC Press.

Jiang, W., & Liu, Q. (2010). Dependency Parsing and Pro-

jection based on Word-pair Classification. In Proceed-

ings of the 48

Annual Meeting of the Association for

Computational Linguistics (pp. 12-20). Association for

Computational Linguistics.

Jurafsky, D., & Martin, J. H. (2000). Speech and Language

Processing: An Introduction to Natural Language Pro-

cessing, Computational Linguistics and Speech, Pren-

tice Hall.

Jurafsky, D., Bates, R., Coccaro, N., Martin, R., Meteer,

M., Ries, K., & Ess-Dykema, V. (1997). Automatic

Detection of Discourse Structure for Speech Recogni-

tion and Understanding. Automatic Speech Recognition

and Understanding, 1997. Proceedings., 1997 IEEE

Workshop on (88-95). IEEE.

Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke,

A., Foster, E., & Morgan, N. (1994). The Berkeley

Restaurant Project. ICSLP (94,2139-2142).

Kim, S., Jeong, M., Lee, J., & Lee, G. G. (2010). A Cross-

Lingual Annotation Projection Algorithm for Relation

Detection. Proceedings of the 23

International Con-

ference on Computational Linguistics, 564-571. Asso-

ciation for Computational Linguistics.

Kučera, H., & Francis, W. N. (1967). Computational Anal-

ysis of Present-day American English. Dartmouth Pub-

lishing Group.

Leech, G., Garside, R., & Bryant, M. (1994). CLAWS4:

The Tagging of the British National Corpus. In Pro-

ceedings of the 15

conference on Computational lin-

guistics,1(622-628). Association for Computational

Linguistics.

Mayobre, G. (1991). Using Code Reusability Analysis to

Identify Reusable Components from the Software Re-

lated to an Application Domain. Proceedings of the 4

Annual Workshop on Software Reuse, 1-14.

Merialdo, B. (1994). Tagging English Text with a Probabil-

istic Model. Computational Linguistics, 20(2), 155-

171.

Mititelu, V. B., & Ion, R. (2005). Cross-Language Transfer

of Syntactic Relations Using Parallel Corpora. Cross-

Language Knowledge Induction Workshop, Romania.

Moore, R. C. (2004). Improving IBM Word-Alignment

Model 1. Proceedings of the 42nd Annual Meeting on

Association for Computational Linguistics (518). Asso-

ciation for Computational Linguistics.

Och, F. J., & Ney, H. (2000). Giza++: Training of Statisti-

cal Translation Models.

Ranaivo, B. (2004). Methodology for Compiling and Pre-

paring Malay Corpus. Technical Report. Unit Ter-

jemahan Melalui Komputer. Pusat Pengajian Sains

Komputer, Universiti Sains Malaysia.

Sharum, M. Y., Abdullah, M. T., Sulaiman, M. N., Murad,

M. A., & Hamzah, Z. (2010). MALIM—A New Com-

putational Algorithm of Malay Morphology. Proceed-

ings of Information Technology (ITSim), 2, 837-843.

Søgaard, A. (2010, July). Simple Semi-Supervised Train-

ing of Part-of-Speech Taggers. Proceedings of the ACL

2010 Conference Short Papers, 205-208. Association

for Computational Linguistics.

Sørensen, T. (1948). {A method of establishing groups of

equal amplitude in plant sociology based on similarity

of species and its application to analyses of the vegeta-

tion on Danish commons}. Biol. skr., 5, 1-34.

Tan, C. M., Wang, Y. F., & Lee, C. D. (2002). The Use of

Bigrams to Enhance Text Categorization. Information

Processing & Management, 38(4), 529-546.

Toutanova, K., Klein, D., Manning, C. D., & Singer, Y.

(2003). Feature-Rich Part-of-Speech Tagging with a

Cyclic Dependency Network. In Proceedings of the

2003 Conference of the North American Chapter of the

Association for Computational Linguistics on Human

Language Technology, 1,173-180. Association for

Computational Linguistics.

Tsuruoka, Y., Tateishi, Y., Kim, J. D., Ohta, T., McNaught,

J., Ananiadou, S., & Tsujii, J. I. (2005). Developing A

Robust Part-of-Speech Tagger for Biomedical Text.

Advances in Informatics, 382-392. Springer Berlin

Heidelberg

Yarowsky, D., Ngai, G., & Wicentowski, R. (2001). Induc-

ing Multilingual Text Analysis Tools via Robust Pro-

jection across Aligned Corpora. Proceedings of the

Human Language Technology Research, 1-8.

ICAART2015-InternationalConferenceonAgentsandArtificialIntelligence

240