A Cross-lingual Part-of-Speech Tagging for Malay Language
Norshuhani Zamin, Zainab Abu Bakar
2015
Abstract
Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the performance of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay is experimented as the less-resourced language and English is experimented as the rich-resourced language. The research is proposed to reduce the deadlock in Malay computational linguistic research due to the shortage of Malay tools and annotated corpus by exploiting state-of-the-art English tools. This paper proposed a cross-lingual annotation projection based on word alignment of two languages with syntactical differences. A word alignment method known as MEWA (Malay-English Word Aligner) that integrates a Dice Coefficient and bigram string similarity measure is proposed. MEWA is experimented to automatically induced annotations using a Malay test collection on terrorism and an identified English tool. In the POS annotation projection experiment, the algorithm achieved accuracy rate of 79%.
References
- Abdullah, I. H., Ahmad, Z., Ghani, R. A., Jalaludin, N. H., & Aman, I. (2004). A Practical Grammar of Malay-A Corpus based Algorithm to the Description of Malay: Extending the Possibilities for Endless and Lifelong Language Learning. National University of Singapore.
- Banko, M., Brill, E. (2001). Mitigating the Paucity-of-Data Problem: Exploring the E ect of Training Corpus Size on Classifier Performance for Natural Language Processing. Proceedings of the first international conference on Human language technology research, 1-5. Association for Computational Linguistics.
- Stohr, E. A. (1998). Software Reuse: Survey and Research Directions. Journal of Management Information Systems, 113-147.
- Bollinger, T. B. & Pfleeger, S. L. (1990). Economics of Software Reuse: Issues and Alternatives. Information and Software Technology, 32 (10), 643-652.
- Brill, E. (1992). A Simple Rule-based Part of Speech Tagger. In Proceedings of the Workshop on Speech and Natural Language, 112-116. Association for Computational Linguistics.
- Brill, E. (1995). Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, 21(4), 543-565.
- Christodoulopoulus, C., Goldwater, S., & Steedman, M. (2010). Two Decades of Unsupervised POS Induction: How Far Have We Come. Proceedings of Empirical Methods in Natural Language Processing.
- Chuah C. K. & Yusoff, Z. (2002). Computational Linguistics at Universiti Sains Malaysia. Proceedings of the International Conference on Language Resources and Evaluation, 1838 -1842.
- De Pauw, G., Wagacha, P. W., & De Schryver, G. M. (2009). The SAWA Corpus: A Parallel Corpus English-Swahili. Proceedings of the First Workshop on Language Technologies for African Languages, 9-16. Association for Computational Linguistics.
- De Souza, J. G. C., & Orasan, C. (2011). Can Projected Chains in Parallel Corpora Help Coreference Resolution? Anaphora Processing and Applications, 59-69). Springer Berlin Heidelberg.
- Dice, L. R. (1945). Measures of the Amount of Ecologic Association between Species. Ecology, 26(3), 297-302.
- Dickinson, M. (2007). Determining Ambiguity Classes for Part-of-Speech Tagging. Proceedings of RANLP-07. Borovets, Bulgaria.
- Dien, D. (2002). Building a Training Corpus for Word Sense Disambiguation in English-to-Vietnamese Machine Translation. In Proceedings of the 2002 COLING Workshop on Machine Translation in Asia - Volume 16 (1-7). Association for Computational Linguistics.
- Dien, D. I. N. H. (2005). Building an Annotated EnglishVietnamese Parallel Corpus. MKS: A Journal of Southeast Asian Linguistics and Languages, 35, 21-36.
- Dien, D., & Kiem, H. (2003). POS-tagger for EnglishVietnamese Bilingual Corpus. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using Parallel Texts: Data Driven Machine Translation and Beyond - Volume, 88-95. Association for Computational Linguistics.
- Don., Z. M. (2010). Processing Natural Malay Texts: a Data Driven Algorithm, Journal of TRAMES, 14, 90- 103.
- Dunning, T. (1994). Statistical Identification of Language, Computing Research Laboratory, New Mexico State University.
- El-Imam, Y. A. & Don, Z. M. (2005). Rules and algorithms for phonetic transcription of standard Malay. IEICE Transaction of Information Systems, E88-D, 2354- 2372.
- Florian, R., & Ngai, G. (2001). Fast transformation-based Learning Toolkit, Technical Report.
- Galescu, L., & Blaylock, N. (2012). A Corpus of Clinical Narratives Annotated with Temporal Information. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 715-720, ACM.
- Galescu, L., & Blaylock, N. (2012). A Corpus of Clinical Narratives Annotated with Temporal Information. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 715-720, ACM.
- Garside, R. (1987). The CLAWS Word-Tagging System. The Computational Analysis of English: A Corpusbased Algorithm. London: Longman, 30-41.
- Garside, R., & Smith, N. (1997). A Hybrid Grammatical Tagger: CLAWS4. Corpus Annotation: Linguistic Information from Computer Text Corpora, 102-121.
- Gibbon, D., Moore, R. K., & Winski, R. (Eds.). (1997). Handbook of Standards and Resources for Spoken Language Systems. Walter de Gruyter.
- Goldwater, S., & Griffiths, T. (2007). A Fully Bayesian Algorithm to Unsupervised Part-of-Speech Tagging. In Annual Meeting-Association for Computational Linguistics, 45(1),744.
- Hassan, A. (1974). The Morphology of Malay, Dewan Bahasa dan Pustaka, Kuala Lumpur Malaysia.
- Indurkhya, N. & Damerau, F.J. (2010). Handbook of Natural Language Processing, Second Edition, Chapman & Hall / CRC Press.
- Jiang, W., & Liu, Q. (2010). Dependency Parsing and Projection based on Word-pair Classification. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 12-20). Association for Computational Linguistics.
- Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech, Prentice Hall.
- Jurafsky, D., Bates, R., Coccaro, N., Martin, R., Meteer, M., Ries, K., & Ess-Dykema, V. (1997). Automatic Detection of Discourse Structure for Speech Recognition and Understanding. Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on (88-95). IEEE.
- Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke, A., Foster, E., & Morgan, N. (1994). The Berkeley Restaurant Project. ICSLP (94,2139-2142).
- Kim, S., Jeong, M., Lee, J., & Lee, G. G. (2010). A CrossLingual Annotation Projection Algorithm for Relation Detection. Proceedings of the 23rd International Conference on Computational Linguistics, 564-571. Association for Computational Linguistics.
- Kucera, H., & Francis, W. N. (1967). Computational Analysis of Present-day American English. Dartmouth Publishing Group.
- Leech, G., Garside, R., & Bryant, M. (1994). CLAWS4: The Tagging of the British National Corpus. In Proceedings of the 15th conference on Computational linguistics,1(622-628). Association for Computational Linguistics.
- Mayobre, G. (1991). Using Code Reusability Analysis to Identify Reusable Components from the Software Related to an Application Domain. Proceedings of the 4th Annual Workshop on Software Reuse, 1-14.
- Merialdo, B. (1994). Tagging English Text with a Probabilistic Model. Computational Linguistics, 20(2), 155- 171.
- Mititelu, V. B., & Ion, R. (2005). Cross-Language Transfer of Syntactic Relations Using Parallel Corpora. CrossLanguage Knowledge Induction Workshop, Romania.
- Moore, R. C. (2004). Improving IBM Word-Alignment Model 1. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (518). Association for Computational Linguistics.
- Och, F. J., & Ney, H. (2000). Giza++: Training of Statistical Translation Models.
- Ranaivo, B. (2004). Methodology for Compiling and Preparing Malay Corpus. Technical Report. Unit Terjemahan Melalui Komputer. Pusat Pengajian Sains Komputer, Universiti Sains Malaysia.
- Sharum, M. Y., Abdullah, M. T., Sulaiman, M. N., Murad, M. A., & Hamzah, Z. (2010). MALIM-A New Computational Algorithm of Malay Morphology. Proceedings of Information Technology (ITSim), 2, 837-843.
- Søgaard, A. (2010, July). Simple Semi-Supervised Training of Part-of-Speech Taggers. Proceedings of the ACL 2010 Conference Short Papers, 205-208. Association for Computational Linguistics.
- Sørensen, T. (1948). {A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons}. Biol. skr., 5, 1-34.
- Tan, C. M., Wang, Y. F., & Lee, C. D. (2002). The Use of Bigrams to Enhance Text Categorization. Information Processing & Management, 38(4), 529-546.
- Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 1,173-180. Association for Computational Linguistics.
- Tsuruoka, Y., Tateishi, Y., Kim, J. D., Ohta, T., McNaught, J., Ananiadou, S., & Tsujii, J. I. (2005). Developing A Robust Part-of-Speech Tagger for Biomedical Text. Advances in Informatics, 382-392. Springer Berlin Heidelberg
- Yarowsky, D., Ngai, G., & Wicentowski, R. (2001). Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. Proceedings of the Human Language Technology Research, 1-8.
Paper Citation
in Harvard Style
Zamin N. and Abu Bakar Z. (2015). A Cross-lingual Part-of-Speech Tagging for Malay Language . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-074-1, pages 232-240. DOI: 10.5220/0005150602320240
in Bibtex Style
@conference{icaart15,
author={Norshuhani Zamin and Zainab Abu Bakar},
title={A Cross-lingual Part-of-Speech Tagging for Malay Language},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2015},
pages={232-240},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005150602320240},
isbn={978-989-758-074-1},
}
in EndNote Style
TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - A Cross-lingual Part-of-Speech Tagging for Malay Language
SN - 978-989-758-074-1
AU - Zamin N.
AU - Abu Bakar Z.
PY - 2015
SP - 232
EP - 240
DO - 10.5220/0005150602320240