A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM

Ebru Celikel

2005

Abstract

The problem of language discrimination may arise in situations when many texts belonging to different source languages are at hand but we are not sure to which language each belongs to. This might usually be the case during information retrieval via Internet. We propose a cryptographic solution to the language identification problem: Employing the Prediction by Partial Matching (PPM) model, we generate a language model and then use this model to discriminate languages. PPM is a cryptographic tool based on an adaptive statistical model. It yields compression rates (measured in bits per character –bpc) to far better levels than that of many other conventional lossless compression tools. Language identification experiment results obtained on sample texts from five different languages as English, French, Turkish, German and Spanish Corpora are given. The rate of success yielded that the performance of the system is highly dependent on the diversity, as well as the target text and training text file sizes. The results also indicate that the PPM model is highly sensitive to input language. In cryptographic aspect, if the training text itself is kept secret, our language identification system would provide security to promising degrees.

References

  1. Batchelder, E.O., 1992. A learning experience: Training an artificial neural network to discriminate languages. Technical report.
  2. Beesley, K.R., 1988. Language identifier: A computer program for automatic natural-language identification on on-line Text. In proceedings of the 29th annual conference of the American translators association, 47-54.
  3. Braum, J., Levkowitz, H., 1998. Automatic language identification with perceptually guided training and recurrent neural networks. In Int'l conf. on spoken language processing (ICSLP 98), Sydney, Australia.
  4. Cavner, W.B., Trenkle J. M., 1994. N-gram based text categorization. In proceedings of the 3rd annual symposium on document analysis and information retrieval, 261-269.
  5. Celikel, E., 2004. The compression and modelling of turkish texts, PhD Thesis, Ege University, International Computer Institute, Izmir/Turkey.
  6. Cleary, J.G., Witten, I.H., 1984. Data compression using adaptive coding and partial string matching. In IEEE transactions on communications. 32(4), 396-402.
  7. Ganesan R., Sherman A. T., 1988. Statistical techniques for language recognition: An introduction and guide for cryptanalyst. Cryptologia XVII:4, 321-366
  8. German Linguistic URL: http://www.infoplease.com/ce6/society/A0858390.htm l.
  9. Ingle, N. C., 1991. A language Identification Table. In The Incorporated Linguist, Vol. 15(4), 98-101.
  10. Kulikowski, S., 1991: Using short words: A language identification algorithm. Unpublished Technical Report.
  11. Moffat, A., 1990. Implementing the PPM data compression scheme. In IEEE transactions on communications, 38(11), 1917-1921.
  12. Mustonen, S., 1965. Multiple discriminant analysis in linguistic problems. In statistical methods in linguistics, no: 4, Skriptor Fack, Stockholm.
  13. Newman, P., 1987. Foreign language identification: First step in the translation process. In Proceedings of the 28th annual conference of the American translators association, 509-516.
  14. Nelson, M., 1991. Arithmetic coding + statistical modeling = data compression part 1 - arithmetic coding. Dr.Dobb's Journal.
  15. Preez, J., Weber, D., 1996. Automatic language recognition using high-order HMMs, In Inter-national conference on spoken language processing (ICSLP), Sydney, Australia.
  16. Rau, M.D., 1974. Language identification by statistical analysis. Master's thesis, Naval post-graduate school.
  17. Sapir, E., 1921. Language: An introduction to the study of speech.
  18. Shannon, C.E., 1948. A mathematical theory of communication. In Bell system technical journal. vol. 27, 623-656.
  19. Teahan, W.J., 1998. Modeling English text. PhD Thesis, Univ. of Waikato, NZ.
  20. Teahan, W.J., 2000. Text classification and segmentation using minimum cross-entropy. In proceedings of RIAO'2000. Vol. 2, Paris, France, 943-961.
  21. Witten, I., Moffat, A. and Bell, T.C., 1999. Managing Gigabytes Compressing & Indexing Documents and Images, 2nd ed., Morgan Kauffman Pub., CA, USA.
  22. Ziegler, D.V., 1991 The Identification of Languages Using Linguistic Recognition Signals. PhD thesis, SUNY Buffalo.
Download


Paper Citation


in Harvard Style

Celikel E. (2005). A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM . In Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS, ISBN 972-8865-19-8, pages 213-219. DOI: 10.5220/0002556102130219


in Bibtex Style

@conference{iceis05,
author={Ebru Celikel},
title={A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM},
booktitle={Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,},
year={2005},
pages={213-219},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002556102130219},
isbn={972-8865-19-8},
}


in EndNote Style

TY - CONF
JO - Proceedings of the Seventh International Conference on Enterprise Information Systems - Volume 2: ICEIS,
TI - A CRYPTOGRAPHIC APPROACH TO LANGUAGE IDENTIFICATION: PPM
SN - 972-8865-19-8
AU - Celikel E.
PY - 2005
SP - 213
EP - 219
DO - 10.5220/0002556102130219