A Cross-lingual Part-of-Speech Tagging for Malay Language

Norshuhani Zamin, Zainab Abu Bakar

2015

Abstract

Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the performance of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay is experimented as the less-resourced language and English is experimented as the rich-resourced language. The research is proposed to reduce the deadlock in Malay computational linguistic research due to the shortage of Malay tools and annotated corpus by exploiting state-of-the-art English tools. This paper proposed a cross-lingual annotation projection based on word alignment of two languages with syntactical differences. A word alignment method known as MEWA (Malay-English Word Aligner) that integrates a Dice Coefficient and bigram string similarity measure is proposed. MEWA is experimented to automatically induced annotations using a Malay test collection on terrorism and an identified English tool. In the POS annotation projection experiment, the algorithm achieved accuracy rate of 79%.

References

  1. Abdullah, I. H., Ahmad, Z., Ghani, R. A., Jalaludin, N. H., & Aman, I. (2004). A Practical Grammar of Malay-A Corpus based Algorithm to the Description of Malay: Extending the Possibilities for Endless and Lifelong Language Learning. National University of Singapore.
  2. Banko, M., Brill, E. (2001). Mitigating the Paucity-of-Data Problem: Exploring the E ect of Training Corpus Size on Classifier Performance for Natural Language Processing. Proceedings of the first international conference on Human language technology research, 1-5. Association for Computational Linguistics.
  3. Stohr, E. A. (1998). Software Reuse: Survey and Research Directions. Journal of Management Information Systems, 113-147.
  4. Bollinger, T. B. & Pfleeger, S. L. (1990). Economics of Software Reuse: Issues and Alternatives. Information and Software Technology, 32 (10), 643-652.
  5. Brill, E. (1992). A Simple Rule-based Part of Speech Tagger. In Proceedings of the Workshop on Speech and Natural Language, 112-116. Association for Computational Linguistics.
  6. Brill, E. (1995). Transformation-based Error-driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging. Computational Linguistics, 21(4), 543-565.
  7. Christodoulopoulus, C., Goldwater, S., & Steedman, M. (2010). Two Decades of Unsupervised POS Induction: How Far Have We Come. Proceedings of Empirical Methods in Natural Language Processing.
  8. Chuah C. K. & Yusoff, Z. (2002). Computational Linguistics at Universiti Sains Malaysia. Proceedings of the International Conference on Language Resources and Evaluation, 1838 -1842.
  9. De Pauw, G., Wagacha, P. W., & De Schryver, G. M. (2009). The SAWA Corpus: A Parallel Corpus English-Swahili. Proceedings of the First Workshop on Language Technologies for African Languages, 9-16. Association for Computational Linguistics.
  10. De Souza, J. G. C., & Orasan, C. (2011). Can Projected Chains in Parallel Corpora Help Coreference Resolution? Anaphora Processing and Applications, 59-69). Springer Berlin Heidelberg.
  11. Dice, L. R. (1945). Measures of the Amount of Ecologic Association between Species. Ecology, 26(3), 297-302.
  12. Dickinson, M. (2007). Determining Ambiguity Classes for Part-of-Speech Tagging. Proceedings of RANLP-07. Borovets, Bulgaria.
  13. Dien, D. (2002). Building a Training Corpus for Word Sense Disambiguation in English-to-Vietnamese Machine Translation. In Proceedings of the 2002 COLING Workshop on Machine Translation in Asia - Volume 16 (1-7). Association for Computational Linguistics.
  14. Dien, D. I. N. H. (2005). Building an Annotated EnglishVietnamese Parallel Corpus. MKS: A Journal of Southeast Asian Linguistics and Languages, 35, 21-36.
  15. Dien, D., & Kiem, H. (2003). POS-tagger for EnglishVietnamese Bilingual Corpus. In Proceedings of the HLT-NAACL 2003 Workshop on Building and using Parallel Texts: Data Driven Machine Translation and Beyond - Volume, 88-95. Association for Computational Linguistics.
  16. Don., Z. M. (2010). Processing Natural Malay Texts: a Data Driven Algorithm, Journal of TRAMES, 14, 90- 103.
  17. Dunning, T. (1994). Statistical Identification of Language, Computing Research Laboratory, New Mexico State University.
  18. El-Imam, Y. A. & Don, Z. M. (2005). Rules and algorithms for phonetic transcription of standard Malay. IEICE Transaction of Information Systems, E88-D, 2354- 2372.
  19. Florian, R., & Ngai, G. (2001). Fast transformation-based Learning Toolkit, Technical Report.
  20. Galescu, L., & Blaylock, N. (2012). A Corpus of Clinical Narratives Annotated with Temporal Information. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 715-720, ACM.
  21. Galescu, L., & Blaylock, N. (2012). A Corpus of Clinical Narratives Annotated with Temporal Information. Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, 715-720, ACM.
  22. Garside, R. (1987). The CLAWS Word-Tagging System. The Computational Analysis of English: A Corpusbased Algorithm. London: Longman, 30-41.
  23. Garside, R., & Smith, N. (1997). A Hybrid Grammatical Tagger: CLAWS4. Corpus Annotation: Linguistic Information from Computer Text Corpora, 102-121.
  24. Gibbon, D., Moore, R. K., & Winski, R. (Eds.). (1997). Handbook of Standards and Resources for Spoken Language Systems. Walter de Gruyter.
  25. Goldwater, S., & Griffiths, T. (2007). A Fully Bayesian Algorithm to Unsupervised Part-of-Speech Tagging. In Annual Meeting-Association for Computational Linguistics, 45(1),744.
  26. Hassan, A. (1974). The Morphology of Malay, Dewan Bahasa dan Pustaka, Kuala Lumpur Malaysia.
  27. Indurkhya, N. & Damerau, F.J. (2010). Handbook of Natural Language Processing, Second Edition, Chapman & Hall / CRC Press.
  28. Jiang, W., & Liu, Q. (2010). Dependency Parsing and Projection based on Word-pair Classification. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 12-20). Association for Computational Linguistics.
  29. Jurafsky, D., & Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech, Prentice Hall.
  30. Jurafsky, D., Bates, R., Coccaro, N., Martin, R., Meteer, M., Ries, K., & Ess-Dykema, V. (1997). Automatic Detection of Discourse Structure for Speech Recognition and Understanding. Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on (88-95). IEEE.
  31. Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke, A., Foster, E., & Morgan, N. (1994). The Berkeley Restaurant Project. ICSLP (94,2139-2142).
  32. Kim, S., Jeong, M., Lee, J., & Lee, G. G. (2010). A CrossLingual Annotation Projection Algorithm for Relation Detection. Proceedings of the 23rd International Conference on Computational Linguistics, 564-571. Association for Computational Linguistics.
  33. Kucera, H., & Francis, W. N. (1967). Computational Analysis of Present-day American English. Dartmouth Publishing Group.
  34. Leech, G., Garside, R., & Bryant, M. (1994). CLAWS4: The Tagging of the British National Corpus. In Proceedings of the 15th conference on Computational linguistics,1(622-628). Association for Computational Linguistics.
  35. Mayobre, G. (1991). Using Code Reusability Analysis to Identify Reusable Components from the Software Related to an Application Domain. Proceedings of the 4th Annual Workshop on Software Reuse, 1-14.
  36. Merialdo, B. (1994). Tagging English Text with a Probabilistic Model. Computational Linguistics, 20(2), 155- 171.
  37. Mititelu, V. B., & Ion, R. (2005). Cross-Language Transfer of Syntactic Relations Using Parallel Corpora. CrossLanguage Knowledge Induction Workshop, Romania.
  38. Moore, R. C. (2004). Improving IBM Word-Alignment Model 1. Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (518). Association for Computational Linguistics.
  39. Och, F. J., & Ney, H. (2000). Giza++: Training of Statistical Translation Models.
  40. Ranaivo, B. (2004). Methodology for Compiling and Preparing Malay Corpus. Technical Report. Unit Terjemahan Melalui Komputer. Pusat Pengajian Sains Komputer, Universiti Sains Malaysia.
  41. Sharum, M. Y., Abdullah, M. T., Sulaiman, M. N., Murad, M. A., & Hamzah, Z. (2010). MALIM-A New Computational Algorithm of Malay Morphology. Proceedings of Information Technology (ITSim), 2, 837-843.
  42. Søgaard, A. (2010, July). Simple Semi-Supervised Training of Part-of-Speech Taggers. Proceedings of the ACL 2010 Conference Short Papers, 205-208. Association for Computational Linguistics.
  43. Sørensen, T. (1948). {A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons}. Biol. skr., 5, 1-34.
  44. Tan, C. M., Wang, Y. F., & Lee, C. D. (2002). The Use of Bigrams to Enhance Text Categorization. Information Processing & Management, 38(4), 529-546.
  45. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, 1,173-180. Association for Computational Linguistics.
  46. Tsuruoka, Y., Tateishi, Y., Kim, J. D., Ohta, T., McNaught, J., Ananiadou, S., & Tsujii, J. I. (2005). Developing A Robust Part-of-Speech Tagger for Biomedical Text. Advances in Informatics, 382-392. Springer Berlin Heidelberg
  47. Yarowsky, D., Ngai, G., & Wicentowski, R. (2001). Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora. Proceedings of the Human Language Technology Research, 1-8.
Download


Paper Citation


in Harvard Style

Zamin N. and Abu Bakar Z. (2015). A Cross-lingual Part-of-Speech Tagging for Malay Language . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-074-1, pages 232-240. DOI: 10.5220/0005150602320240


in Bibtex Style

@conference{icaart15,
author={Norshuhani Zamin and Zainab Abu Bakar},
title={A Cross-lingual Part-of-Speech Tagging for Malay Language},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2015},
pages={232-240},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005150602320240},
isbn={978-989-758-074-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - A Cross-lingual Part-of-Speech Tagging for Malay Language
SN - 978-989-758-074-1
AU - Zamin N.
AU - Abu Bakar Z.
PY - 2015
SP - 232
EP - 240
DO - 10.5220/0005150602320240