Multinomial Mixture Modelling for Bilingual Text Classification

Jorge Civera, Alfons Juan

2006

Abstract

Mixture modelling of class-conditional densities is a standard pattern classification technique. In text classification, the use of class-conditional multinomial mixtures can be seen as a generalisation of the Naive Bayes text classifier relaxing its (class-conditional feature) independence assumption. In this paper, we describe and compare several extensions of the class-conditional multinomial mixture-based text classifier for bilingual texts.

References

  1. Jain, A.K., et al.: Statistical Pattern Recognition: A Review. IEEE Trans. on PAMI 22 (2000) 4-37
  2. Sebastiani, F.: Machine learning in automated text categorisation. ACM Comp. Surveys 34 (2002) 1-47
  3. Lewis, D.D.: Naive Bayes at Forty: The Independence Assumption in Information Retrieval. In: Proc. of ECML'98. (1998) 4-15
  4. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI/ICML-98 Workshop on Learning for Text Categorization. (1998) 41-48
  5. Juan, A., Vidal, E.: On the use of Bernoulli mixture models for text classification. Pattern Recognition 35 (2002) 2705-2710
  6. Nigam, K., et al.: Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning 39 (2000) 103-134
  7. Novovicová, J., Malík, A.: Application of Multinomial Mixture Model to Text Classification. In: Proc. of IbPRIA 2003. (2003) 646-653
  8. Vilar, D., et al.: Effect of Feature Smoothing Methods in Text Classification Tasks. In: Proc. of PRIS'04. (2004) 108-117
  9. Pavlov, D., et al.: Document Preprocessing For Naive Bayes Classification and Clustering with Mixture of Multinomials. In: Proc. of KDD'04. (2004) 829-834
  10. Peng, F., et al.: Augmenting Naive Bayes classifiers with statistical language models. Information Retrieval 7 (2003) 317-345
  11. Rennie, J., et al.: Tackling the Poor Assumptions of Naive Bayes Text Classifiers. In: Proc. of ICML'03. (2003) 616-623
  12. Scheffer, T., Wrobel, S.: Text Classification Beyond the Bag-of-Words Representation. In: Proc. of ICML'02 Workshop on Text Learning. (2002)
  13. Cubel, E., et al.: Adapting finite-state translation to the TransType2 project. In: Proc. of EAMT/CLAW'03, Dublin (Ireland) (2003) 54-60
  14. Dempster, A.P., et al.: Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society B 39 (1977) 1-38
  15. Vidal, E., et al.: Example-Based Understanding and Translation Systems. Report ESPRIT project (2000)
  16. Simard, M.: The BAF: A Corpus of English-French Bitext. In: Proc. of LREC'98. (1998) 489-496
  17. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proc. of ICSLP'02. Volume 2. (2002) 901-904
  18. Witten, I.H., et al.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Trans. on Information Theory 37 (1991) 1085-1094
  19. Brown, P.F., et al.: A Statistical Approach to Machine Translation. Comp. Linguistics 16 (1990) 79-85
Download


Paper Citation


in Harvard Style

Civera J. and Juan A. (2006). Multinomial Mixture Modelling for Bilingual Text Classification . In 6th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2006) ISBN 978-972-8865-55-9, pages 93-103. DOI: 10.5220/0002471900930103


in Bibtex Style

@conference{pris06,
author={Jorge Civera and Alfons Juan},
title={Multinomial Mixture Modelling for Bilingual Text Classification},
booktitle={6th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2006)},
year={2006},
pages={93-103},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0002471900930103},
isbn={978-972-8865-55-9},
}


in EndNote Style

TY - CONF
JO - 6th International Workshop on Pattern Recognition in Information Systems - Volume 1: PRIS, (ICEIS 2006)
TI - Multinomial Mixture Modelling for Bilingual Text Classification
SN - 978-972-8865-55-9
AU - Civera J.
AU - Juan A.
PY - 2006
SP - 93
EP - 103
DO - 10.5220/0002471900930103