Addressing the Problem of Unbalanced Data Sets in Sentiment Analysis

Asmaa Mountassir, Houda Benbrahim, Ilham Berrada

Abstract

Sentiment Analysis is a research area where the studies focus on processing and analysing the opinions available on the web. This paper deals with the problem of unbalanced data sets in supervised sentiment classification. We propose three different methods to under-sample the majority class documents, namely Remove Similar, Remove Farthest and Remove by Clustering. Our goal is to compare the effectiveness of the proposed methods with the common random under-sampling. We use for classification three standard classifiers: Naïve Bayes, Support Vector Machines and k-Nearest Neighbours. The experiments are carried out on two different Arabic data sets that we have built and labelled manually. We show that results obtained on the first data set, which is slightly skewed, are better than those obtained on the second one which is highly skewed. The results show also that we can rely on the proposed techniques and that they are typically competitive with random under-sampling.

References

  1. Abdul-Mageed, M., Diab, M.T., Korayem, M., 2011. Subjectivity and Sentiment Analysis of Modern Standard Arabic. In Proc. ACL (Short Papers). pp.587-591.
  2. Brank, J., Grobelnik, M., Milic-Frayling, N, Mladenic, D., 2003. Training text classifiers with SVM on very few positive examples. Technical report, MSR-TR-2003- 34.
  3. Burns, N., Bi, Y., Wang, H., Anderson, T., 2011. Sentiment Analysis of Customer Reviews: Balanced versus Unbalanced Datasets. KES 2011, Part I, LNAI 6881, pp. 161-170.
  4. Carpenter, B., 2005. Scaling High-Order Character Language Models to Gigabytes. In: Workshop on Software. Association for Computational Linguistics, Morristown. pp. 86-99.
  5. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer. 2002. W. P. SMOTE: Synthetic Minority Oversampling Technique. Journal of Artificial Intelligence Research (JAIR), Volume 16, pp. 321-357.
  6. Dasarathy, B. V., 1991. Nearest Neighbor (NN) Norms: NN Pattern Classification Techniques. McGraw-Hill Computer Science Series. Las Alamitos, California: IEEE Computer Society Press.
  7. Hartigan, J., 1975. Clustering Algorithms. John Wiley & Sons, New York, NY.
  8. Japkowicz, N., 2003. Class Imbalances: Are we Focusing on the Right Issue? In Proc. Of ICML'03.
  9. Khoja, S., Garside, R., 1999. Stemming Arabic text. Computer Science Department, Lancaster University, Lancaster, UK.
  10. Kubat, M., Matwin, S., 1997. Addressing the Curse of Imbalanced Data Sets: One-Sided Sampling. In Proceedings of the Fourteenth International Conference on Machine Learning, pp. 179-186.
  11. Li, S., Wang, Z., Zhou, G., Lee, S. Y. M., 2011. SemiSupervised Learning for Imbalanced Sentiment Classification. In Proc. Of the Twenty-Second International Joint Conference on Artificial Intelligence, pp.1826-1831.
  12. Mitchell, T., 1996. Machine Learning. McCraw Hill.
  13. Pang, B., Lee, L., Vaithyanathain, S., 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp.79-86.
  14. Platt, J., 1999. Fast training on SVMs using sequential minimal optimization. In Scholkopf, B., Burges, C., and Smola, A. (Ed.), Advances in Kernel Methods: Support Vector Learning, MIT Press, Cambridge, MA, pp.185-208.
  15. Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez, L. A., Perea-Ortega, J. M., 2011a. Bilingual Experiments with an Arabic-English Corpus for Opinion Mining. In Proc. Of Recent Advances in Natural Language Processing, Hissar, Bulgaria. pp.740-745.
  16. Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez, L. A., Perea-Ortega, J. M., 2011b. Experiments with SVM to classify opinions in different domains. Expert Systems with Applications 38, pp.14799-14804.
  17. Salton, G., McGill, M., 1983. Modern Information Retrieval. New York: McGraw-Hill.
  18. Vapnik, V., 1995. The Nature of Statistical Learning. Springer-Verlag.
  19. Witten, I. H., Frank, E., 2005. Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, California.
  20. Wu, G., Chang, E., 2003. Class-Boundary Alignment for Imbalanced Dataset Learning. In Proc. Of ICML'03.
  21. Zhuang, L., Jing, F., Zhu, X., 2006. Movie Review Mining and Summarization. In CIKM'06. Virginia, USA.
Download


Paper Citation


in Harvard Style

Mountassir A., Benbrahim H. and Berrada I. (2012). Addressing the Problem of Unbalanced Data Sets in Sentiment Analysis . In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012) ISBN 978-989-8565-29-7, pages 306-311. DOI: 10.5220/0004142603060311


in Bibtex Style

@conference{kdir12,
author={Asmaa Mountassir and Houda Benbrahim and Ilham Berrada},
title={Addressing the Problem of Unbalanced Data Sets in Sentiment Analysis},
booktitle={Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)},
year={2012},
pages={306-311},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004142603060311},
isbn={978-989-8565-29-7},
}


in EndNote Style

TY - CONF
JO - Proceedings of the International Conference on Knowledge Discovery and Information Retrieval - Volume 1: KDIR, (IC3K 2012)
TI - Addressing the Problem of Unbalanced Data Sets in Sentiment Analysis
SN - 978-989-8565-29-7
AU - Mountassir A.
AU - Benbrahim H.
AU - Berrada I.
PY - 2012
SP - 306
EP - 311
DO - 10.5220/0004142603060311