An Information Theoretic Approach to Text Sentiment Analysis

David Pereira Coutinho, Mário A. T. Figueiredo

Abstract

Most approaches to text sentiment analysis rely on human generated lexicon-based feature selection methods, supervised vector-based learning methods, and other solutions that seek to capture sentiment information. Most of these methods, in order to yield acceptable accuracy, require a complex preprocessing stage and careful feature engineering. This paper introduces a coding-theoretic-based sentiment analysis method that dispenses with any text preprocessing or explicit feature engineering, but still achieves state-of-the-art accuracy. By applying the Ziv-Merhav method to estimate the relative entropy (Kullback-Leibler divergence) and the cross parsing length from pairs of sequences of text symbols, we get information theoretic measures that make very few assumptions about the models which are assumed to have generated the sequences. Using these measures, we follow a dissimilarity space approach, on which we apply a standard support vector machine classifier. Experimental evaluation of the proposed approach on a text sentiment analysis problem (more specifically, movie reviews sentiment polarity classification) reveals that it outperforms the previous state-of-the-art, despite being much simpler than the competing methods.

References

  1. Duin, R. P. W. and Paclk, P. (2006). Prototype selection for dissimilarity-based classifiers. Pattern Recognition, 39:189-208.
  2. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features.
  3. Matsumoto, S., Takamura, H., and Okumura, M. (2005). Sentiment classification using word sub-sequences and dependency sub-trees. In Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, PAKDD'05, pages 301- 311, Berlin, Heidelberg. Springer-Verlag.
  4. Pang, B. and Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In In Proceedings of the ACL, pages 271-278.
  5. Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up? sentiment classification using machine learning techniques. In IN PROCEEDINGS OF EMNLP, pages 79-86.
  6. Pekalska, E. and Duin, R. P. W. (2002). Dissimilarity representations allow for building good classifiers. Pattern Recognition Letters, 23(8):943-956.
  7. Pekalska, E., Paclk, P., and Duin, R. P. W. (2001). A generalized kernel approach to dissimilarity-based classification. Journal of Machine Learning Research, 2:175-211.
  8. Pereira Coutinho, D. and Figueiredo, M. (2005). Information theoretic text classification using the Ziv-Merhav method. 2nd Iberian Conference on Pattern Recognition and Image Analysis - IbPRIA'2005.
  9. Salomon, D. and Motta, G. (2010). Handbook of Data Compression (5. ed.). Springer.
  10. Vinodhini, G. and Chandrasekaran, R. (2012). Sentiment analysis and opinion mining: A survey. International Journal of Advanced Research in Computer Science and Software Engineering.
  11. Whitelaw, C., Garg, N., and Argamon, S. (2005). Using appraisal taxonomies for sentiment analysis. In Proceedings of CIKM-05, the ACM SIGIR Conference on Information and Knowledge Management, Bremen, DE.
  12. Yessenalina, A., Yue, Y., and Cardie, C. (2010). Multi-level structured models for document-level sentiment classification. In In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP.
  13. Ziv, J. and Lempel, A. (1977). A universal algorithm for sequential data compression. IEEE Transactions on Information Theory, 23(3):337-343.
  14. Ziv, J. and Lempel, A. (1978). Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory, 24(5):530-536.
  15. Ziv, J. and Merhav, N. (1993). A measure of relative entropy between individual sequences with application to universal classification. IEEE Transactions on Information Theory, 39:1270-1279.
Download


Paper Citation


in Harvard Style

Pereira Coutinho D. and A. T. Figueiredo M. (2013). An Information Theoretic Approach to Text Sentiment Analysis . In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-8565-41-9, pages 577-580. DOI: 10.5220/0004269005770580


in Bibtex Style

@conference{icpram13,
author={David Pereira Coutinho and Mário A. T. Figueiredo},
title={An Information Theoretic Approach to Text Sentiment Analysis},
booktitle={Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2013},
pages={577-580},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004269005770580},
isbn={978-989-8565-41-9},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - An Information Theoretic Approach to Text Sentiment Analysis
SN - 978-989-8565-41-9
AU - Pereira Coutinho D.
AU - A. T. Figueiredo M.
PY - 2013
SP - 577
EP - 580
DO - 10.5220/0004269005770580