Building TALAA, a Free General and Categorized Arabic Corpus

Essma Selab, Ahmed Guessoum

Abstract

Arabic natural language processing (ANLP) has gained increasing interest over the last decade. However, the development of ANLP tools depends on the availability of large corpora. It turns out unfortunately that the scientific community has a deficit in large and varied Arabic corpora, especially ones that are freely accessible. With the Internet continuing its exponential growth, Arabic Internet content has also been following the trend, yielding large amounts of textual data available through different Arabic websites. This paper describes the TALAA corpus, a voluminous general Arabic corpus, built from daily Arabic newspaper websites. The corpus is a collection of more than 14 million words with 15,891,729 tokens contained in 57,827 different articles. A part of the TALAA corpus has been tagged to construct an annotated Arabic corpus of about 7000 tokens, the POS-tagger used containing a set of 58 detailed tags. The annotated corpus was manually checked by two human experts. The methodology used to construct TALAA is presented and various metrics are applied to it, showing the usefulness of the corpus. The corpus can be made available to the scientific community upon authorisation.

References

  1. Al Hayat corpus, Catalogue of Language Resources, http://catalog.elra.info/product_info.php?products_id= 632&language=fr (Last visited September 2014).
  2. Almeman, K., and Lee, M., 2013. Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words. In Communications, Signal Processing, and their Applications (ICCSPA). Sharjah.
  3. Alrabiah, M., Alsalman, A. M., and Atwell, E., 2013. KSUCCA a cornerstone to study the semantics of the Quranic words in the light of distributional lexical semantics, NOORIC'1435-2013, Almadinah Almonawwrah, Saudi Arabia.
  4. Al Shamsi, F., and Guessoum, A., 2006. A Hidden Markov Model - Based POS Tagger for Arabic. In: Proceeding of the 8th International Conference on the Statistical Analysis of Textual Data, France, pp.31-42.
  5. Al-Sulaiti, L., 2004. Designing and Developing a Corpus of Contemporary Arabic. University of Leeds, UK. MSc Thesis.
  6. An-Nahar Corpus, Catalogue of Language Resources, http://catalog.elra.info/search_result.php?keywords=W 0027&language=en (Last visited September 2014).
  7. AQMAR, http://www.ark.cs.cmu.edu/ArabicDeps/, (Last visited November 2014).
  8. Biemann, C., Bildhauer, F., Evert, S., Goldhahn, D., Quasthoff, U., Schäfer, R and Swiezinski, L., 2013. Scalable Construction of High-Quality Web Corpora. Journal for Language Technology and Computational Linguistics, vol. 28(2),pp. 23-59.
  9. Buckwalter, T., 2002. “Buckwalter Arabic morphological analyzer version 1.0”. LDC Catalog No: LDC2002L49. Linguistic Data Consortium, University of Pennsylvania.
  10. Contemporary corpus, http://www.comp.leeds.ac.uk/eric/ latifa/research.htm (Last visited September 2014).
  11. Cunningham, L. A., 2005. Language, Deals and Standards: The Future of XML Contracts. Washington University Law Review, Boston College Law School Research Paper No. 93.
  12. Diab, M., Hacioglu, K. and Jurafsky, D., 2004. Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks. In: Proceedings of Human Language Technology conference / North American chapter of the Association for Computational Linguistics annual meeting, USA, pp. 149-152.
  13. El-Khabar, 2010. El-khabar newspaper online, Available at: www.elkhabar.com/ar/ (Last visited November 2014).
  14. European Language Resources Association (ELRA), 2008, Catalogue of Language Resources [En ligne] // http://catalog.elra.info/index.php. (Last visited October 2014).
  15. Felice, M., 2012. Linguistic Indicators for Quality Estimation of Machine Translations. MSc Thesis in Natural Language Processing and Human Language Technology. University of Barcelona.
  16. Graff, D., 2003. Arabic Gigaword LDC2003T12. Web Download. Philadelphia: Linguistic Data Consortium.
  17. Habash, N. Y., 2010. Introduction to Arabic Natural Language Processing. Columbia University: A Publication in the Morgan and Claypool Publishers series.
  18. Khoja, S., 2001. APT: Arabic Part-of-speech Tagger. Proceedings of the Student Workshop at the Second Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL2001), Carnegie Mellon University, Pittsburgh, Pennsylvania.
  19. LDC, Linguistic Data Consortium-University of Pennsylvania, http://www.ldc.upenn.edu/. (Last visited October 2014).
  20. Maamouri, M., Bies, A., Buckwalter, T., Jin, H. and Mekki, W., 2005. Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis).
  21. Manning, C. D, and Schütze, H., 1999. Foundations of Statistical Natural Language Processing, MIT Press, ISBN 978-0-262-13360-9, pp. 24.
  22. Marton, Y., Habash, N. and Rambow, O., 2013. Dependency Parsing Of Modern Standard Arabic With Lexical And Inflectional Features. Computational Linguistics, 39(1).
  23. Miniwatts Marketing Group, 2014. Internet World Stats.
  24. Available at: http://www.internetworldstats.com/ stats7.htm, updated on Sept 18, 2014. (Last visited October 2014).
  25. Parallel Corpus, United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. http:// www.uncorpora.org/.(Last visited October 2014).
  26. Rafalovitch, A., and Dale, H., 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In: Proceedings of the MT Summit XII. Ottawa, Canada. pp. 292-299.
  27. Rastier, F., 2005. Enjeux épistémologiques de la linguistique de corpus. Williams, pp. 31-46.
  28. Véronis, J., 2001. Sense tagging: does it make sense? , Corpus Linguistics'2001 Conference, Lancaster, U.K.
  29. Zipf, G. K., 1949. Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Cambridge, MA: Addison-Wesley.
Download


Paper Citation


in Harvard Style

Selab E. and Guessoum A. (2015). Building TALAA, a Free General and Categorized Arabic Corpus . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ICAART 2015) ISBN 978-989-758-073-4, pages 284-291. DOI: 10.5220/0005352102840291


in Bibtex Style

@conference{puanlp15,
author={Essma Selab and Ahmed Guessoum},
title={Building TALAA, a Free General and Categorized Arabic Corpus},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ICAART 2015)},
year={2015},
pages={284-291},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005352102840291},
isbn={978-989-758-073-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: PUaNLP, (ICAART 2015)
TI - Building TALAA, a Free General and Categorized Arabic Corpus
SN - 978-989-758-073-4
AU - Selab E.
AU - Guessoum A.
PY - 2015
SP - 284
EP - 291
DO - 10.5220/0005352102840291