COMPARATIVE STUDY OF ARABIC AND FRENCH STATISTICAL LANGUAGE MODELS

Karima Meftouh, Kamel Smaili, Mohamed Tayeb Laskri

Abstract

In this paper, we propose a comparative study of statistical language models of Arabic and French. The objective of this study is to understand how to better model both Arabic and French. Several experiments using different smoothing techniques have been carried out. For French, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with Witten Bell method are more efficient. Tests are achieved with comparable corpora and vocabularies in terms of size.

References

  1. Meftouh, K., Smaili, K., Laskri, M.T. 2008. Arabic statistical modeling. In JADT'08, 9e Journées internationales d'Analyse statistique des Données Textuelles. 12-14 Mars, Lyon, France.
  2. Wikipedia, 2008. French language. http://en.wikipedia.org/wiki/french_language
  3. Saraswathi, S., Geetha, T.V. 2007. Comparison of performance of enhanced morpheme-based language models with different word-based language models for improving the performance of Tamil speech recognition system. ACM Trans. Asian language. Inform. Process. 6, 3, Article 9.
  4. Hayder K. Al Ameed, Shaikha O. Al Ketbi and al. 2005. Arabic light stemmer: A new enhanced approach. In IIT'05, the Second International Conference on Innovations in Information Technology.
  5. Vergyri, D., Kirchhoff, K. 2004. Automatic Diacritization of Arabic for Acoustic Modeling in Speech Recognition, COLING Workshop on Arabic-script Based Languages, Geneva, Switzerland.
  6. Al-Sulaiti, L. 2004. Designing and developing a corpus of contemporary Arabic. PhD thesis.
  7. Kim, W., Khudanpur, S. 2003. Cross-Lingual lexical triggers in statistical language modelling. Theoretical Issues In Natural Language Processing archive Proceedings of the 2003 conference on Empirical methods in natural language processing, Volume 10
  8. Darwish, K. 2002. Building a shallow Arabic morphological analyser in one day. In Proceeding of the ACL workshop on computational approaches to Semitic languages.
  9. Egyptian Demographic center. 2000. http://www.frcu.eun.eg/www/homepage/cdc/cdc.htm
  10. Stanley F. Chen, Goodman J. 1998. An empirical study of smoothing techniques for language modelling. Technical report TR-10-98, Computer science group, Harvard University, Cambridge, Massachusetts.
  11. Ney H., Essen U. and Kneser R. 1994. On structuring probabilistic dependencies in stochastic language modeling. Computer Speech and Language, 8(1):1-38.
  12. Witten I.T. and Bell T.C. 1991. The Zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4):1085-1094.
  13. Katz S.M. 1987. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal processing, 35(3): 400-401.
Download


Paper Citation


in Harvard Style

Meftouh K., Smaili K. and Tayeb Laskri M. (2009). COMPARATIVE STUDY OF ARABIC AND FRENCH STATISTICAL LANGUAGE MODELS . In Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8111-66-1, pages 156-160. DOI: 10.5220/0001537501560160


in Bibtex Style

@conference{icaart09,
author={Karima Meftouh and Kamel Smaili and Mohamed Tayeb Laskri},
title={COMPARATIVE STUDY OF ARABIC AND FRENCH STATISTICAL LANGUAGE MODELS},
booktitle={Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2009},
pages={156-160},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0001537501560160},
isbn={978-989-8111-66-1},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - COMPARATIVE STUDY OF ARABIC AND FRENCH STATISTICAL LANGUAGE MODELS
SN - 978-989-8111-66-1
AU - Meftouh K.
AU - Smaili K.
AU - Tayeb Laskri M.
PY - 2009
SP - 156
EP - 160
DO - 10.5220/0001537501560160