STATISTICAL LANGUAGE IDENTIFICATION OF SHORT TEXTS

Fela Winkelmolen, Viviana Mascardi

Abstract

Although correctly identifying the language of short texts should prove useful in a large number of applications, few satisfactory attemps are reported in the literature. In this paper we describe a Naive Bayes Classifier that performs well on very short texts, as well as the corpus that we created from movie subtitles for training it. Both the corpus and the algorithm are available under the GNU Lesser General Public License.

References

  1. Ahmed, B., Cha, S., and Tappert, C. (2004). Language identification from text using n-gram based cumulative frequency addition. In Proc. of CSIS Research Day.
  2. Cavnar, W. and Trenkle, J. (1994). N-Gram-Based Text Categorization. In Proc. of SDAIR-94, pages 161-169.
  3. Elworthy, D. (1998). Language identification with confidence limits. In Proc. of WVLC-98.
  4. Hakkinen, J., Tian, J., Phones, N., and Tampere, F. (2001). N-gram and decision tree based language identification for written words. In Proc. of IEEE ASRU-01, pages 335-338.
  5. MacNamara, S., Cunningham, P., and Byrne, J. (1998). Neural networks for language identification: a comparative study. Information Processing and Management, 34(4):395-403.
  6. Pham, T. and Tran, D. (2003). VQ-based written language identification. In Proc. of ISSPA-03.
  7. Prager, J. (1999). Linguini: Language identification for multilingual documents. Journal of Management Information Systems, 16(3):71-101.
  8. Winkelmolen, F. (2010). Statistical language identification of short texts. Bachelor's Thesis, University of Genova, http://www.disi.unige.it/person/MascardiV/ Download/Winkelmolen-Fela.pdf.
Download


Paper Citation


in Harvard Style

Winkelmolen F. and Mascardi V. (2011). STATISTICAL LANGUAGE IDENTIFICATION OF SHORT TEXTS . In Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART, ISBN 978-989-8425-40-9, pages 498-503. DOI: 10.5220/0003294404980503


in Bibtex Style

@conference{icaart11,
author={Fela Winkelmolen and Viviana Mascardi},
title={STATISTICAL LANGUAGE IDENTIFICATION OF SHORT TEXTS},
booktitle={Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,},
year={2011},
pages={498-503},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003294404980503},
isbn={978-989-8425-40-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 3rd International Conference on Agents and Artificial Intelligence - Volume 1: ICAART,
TI - STATISTICAL LANGUAGE IDENTIFICATION OF SHORT TEXTS
SN - 978-989-8425-40-9
AU - Winkelmolen F.
AU - Mascardi V.
PY - 2011
SP - 498
EP - 503
DO - 10.5220/0003294404980503