presents the minority class.
Our experiments consist on balancing the two
classes of each data set by the use of the four studied
under-sampling methods, i.e. RR, RS, RF and RC.
Then we evaluate the performance of the three
classifiers on the balanced data sets.
Our results show that performance obtained on
DSMR is better than that obtained on DSPo. This
proves that the more the data set is unbalanced the
more the results are bad.
As a comparison between under-sampling
methods, we can say that, generally, the four
methods give near results. But iQn most of cases RR
yields the best results. RF is not recommended for
NB, it is rather recommended for SVM. For kNN,
we do not recommend to use RS.
As future works, we look for performing the
same experiments on unbalanced data sets that are
more homogeneous so as to validate our hypothesis
about the impact of heterogeneity on the
performance of the proposed techniques. We will
also study the effectiveness of the four under-
sampling methods by decreasing progressively
majority class size. On one hand, we aim to see
whether it is necessary to achieve a balance of 50%-
50% to have the best results. On the other hand, we
aim to observe the behaviour of our classifiers, by
using the different under-sampling methods, toward
the different steps of majority class decreasing.
Finally, we have as perspective too the study of
feature selection techniques on unbalanced data sets
of SA.
REFERENCES
Abdul-Mageed, M., Diab, M.T., Korayem, M., 2011.
Subjectivity and Sentiment Analysis of Modern
Standard Arabic. In Proc. ACL (Short Papers).
pp.587-591.
Brank, J., Grobelnik, M., Milić-Frayling, N, Mladenić, D.,
2003. Training text classifiers with SVM on very few
positive examples. Technical report, MSR-TR-2003-
34.
Burns, N., Bi, Y., Wang, H., Anderson, T., 2011.
Sentiment Analysis of Customer Reviews: Balanced
versus Unbalanced Datasets. KES 2011, Part I, LNAI
6881, pp. 161-170.
Carpenter, B., 2005. Scaling High-Order Character
Language Models to Gigabytes. In: Workshop on
Software. Association for Computational Linguistics,
Morristown. pp. 86–99.
Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer.
2002. W. P. SMOTE: Synthetic Minority Over-
sampling Technique. Journal of Artificial Intelligence
Research (JAIR), Volume 16, pp. 321-357.
Dasarathy, B. V., 1991. Nearest Neighbor (NN) Norms:
NN Pattern Classification Techniques. McGraw-Hill
Computer Science Series. Las Alamitos, California:
IEEE Computer Society Press.
Hartigan, J., 1975. Clustering Algorithms. John Wiley &
Sons, New York, NY.
Japkowicz, N., 2003. Class Imbalances: Are we Focusing
on the Right Issue? In Proc. Of ICML’03.
Khoja, S., Garside, R., 1999. Stemming Arabic text.
Computer Science Department, Lancaster University,
Lancaster, UK.
Kubat, M., Matwin, S., 1997. Addressing the Curse of
Imbalanced Data Sets: One-Sided Sampling. In
Proceedings of the Fourteenth International
Conference on Machine Learning, pp. 179-186.
Li, S., Wang, Z., Zhou, G., Lee, S. Y. M., 2011. Semi-
Supervised Learning for Imbalanced Sentiment
Classification. In Proc. Of the Twenty-Second
International Joint Conference on Artificial
Intelligence, pp.1826-1831.
Mitchell, T., 1996. Machine Learning. McCraw Hill.
Pang, B., Lee, L., Vaithyanathain, S., 2002. Thumbs up?
Sentiment classification using machine learning
techniques. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing.
pp.79-86.
Platt, J., 1999. Fast training on SVMs using sequential
minimal optimization. In Scholkopf, B., Burges, C.,
and Smola, A. (Ed.), Advances in Kernel Methods:
Support Vector Learning, MIT Press, Cambridge, MA,
pp.185-208.
Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez,
L. A., Perea-Ortega, J. M., 2011a. Bilingual
Experiments with an Arabic-English Corpus for
Opinion Mining. In Proc. Of Recent Advances in
Natural Language Processing, Hissar, Bulgaria.
pp.740-745.
Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez,
L. A., Perea-Ortega, J. M., 2011b. Experiments with
SVM to classify opinions in different domains. Expert
Systems with Applications 38, pp.14799-14804.
Salton, G., McGill, M., 1983. Modern Information
Retrieval. New York: McGraw-Hill.
Vapnik, V., 1995. The Nature of Statistical Learning.
Springer-Verlag.
Witten, I. H., Frank, E., 2005. Data Mining: Practical
machine learning tools and techniques, 2nd Edition,
Morgan Kaufmann, San Francisco, California.
Wu, G., Chang, E., 2003. Class-Boundary Alignment for
Imbalanced Dataset Learning. In Proc. Of ICML’03.
Zhuang, L., Jing, F., Zhu, X., 2006. Movie Review Mining
and Summarization. In CIKM’06. Virginia, USA.
AddressingtheProblemofUnbalancedDataSetsinSentimentAnalysis
311