Addressing the Problem of Unbalanced Data Sets in Sentiment

Analysis

Asmaa Mountassir, Houda Benbrahim and Ilham Berrada

ALBIRONI Research Team, ENSIAS, Mohamed 5 University, Souissi, Rabat, Morocco

Keywords: Sentiment Analysis, Opinion Mining, Unbalanced Data Sets, Machine Learning, Text Classification,

Natural Language Processing, Arabic Language.

Abstract: Sentiment Analysis is a research area where the studies focus on processing and analysing the opinions

available on the web. This paper deals with the problem of unbalanced data sets in supervised sentiment

classification. We propose three different methods to under-sample the majority class documents, namely

Remove Similar, Remove Farthest and Remove by Clustering. Our goal is to compare the effectiveness of

the proposed methods with the common random under-sampling. We use for classification three standard

classifiers: Naïve Bayes, Support Vector Machines and k-Nearest Neighbours. The experiments are carried

out on two different Arabic data sets that we have built and labelled manually. We show that results

obtained on the first data set, which is slightly skewed, are better than those obtained on the second one

which is highly skewed. The results show also that we can rely on the proposed techniques and that they are

typically competitive with random under-sampling.

1 INTRODUCTION

Nowadays, the web is no longer just a source of

information for internet users; it represents also a

space where simple users can provide information.

With the emergence of social media (such as social

networking sites, online news sites, online web

forums, personal blogs and online review sites),

internet users are more and more invited to express

their opinions, post comments or share experiences

about any topic. Therefore, the online opinion has

become an important currency for many researches

especially in the field of Opinion Mining (OM) and

Sentiment Analysis (SA).

SA is a subfield of Text Mining that gives

interest to process and analyse opinions expressed,

by different kinds of authors, on the web. There are

several tends in this area. Some studies deal with

Subjectivity Analysis where the classification

classes are OBJECTIVE vs. SUBJECTIVE (Abdul-

Mageed et al., 2011), while others focus on

Sentiment Classification (i.e. classification by

polarity), where classification classes are POSITIVE

vs. NEGATIVE (Pang et al., 2002). We can find

also some studies about opinion summarization

(Zhuang et al., 2006).

For most studies in SA, we note that the problem

of unbalanced data sets (UD) is not tackled. It is

often assumed that positive and negative classes are

balanced. After building and labelling of the data

set, this one is often equalized so to have the same

number of documents in each class. Otherwise, if

document collection is based on a rating system, we

gather the same number of documents for each class

(Pang et al., 2002; Rushdi-Saleh et al., 2011a).

Nevertheless, this assumption may not hold in the

real world since we cannot always have the same

number of positive documents as the number of

negative ones for a given subject. Hence the

importance of addressing the problem of UD in SA.

To resolve the problem of UD, there are

generally two approaches (Japkowicz, 2003). The

first approach tends to modify the classifier, such as

using the cost-sensitive learning (Brank et al., 2003)

and modifying the classifier to handle UD (Wu and

Chang, 2003). The second approach deals with the

modification of the data set itself. It consists of two

common methods, the first focuses on under-

sampling (Kubat and Matwin, 1997) while the

second deals with over-sampling (Chawla et al.,

2002). The under-sampling method seeks to reduce

the number of majority class members in the training

set. While the over-sampling method seeks to

increase the number of minority class members in

306

Mountassir A., Benbrahim H. and Berrada I..

Addressing the Problem of Unbalanced Data Sets in Sentiment Analysis.

DOI: 10.5220/0004142603060311

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2012), pages 306-311

ISBN: 978-989-8565-29-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

the training set. Note that majority class refers to the

class with more documents, and minority class

denotes the class with fewer documents.

Existing works that address UD in SA have

mostly used random under-sampling. Li et al. (2011)

have tested many methods for UD and found that

random under-sampling is the most effective. Then

they applied this technique together with a semi-

supervised method to classify four different

unbalanced data sets for OM. Their study cannot be

directly comparable to ours since they are interested

in semi-supervised approach while our work focuses

on supervised learning. Rushdi-Saleh et al. (2011b)

used, among others, an English highly unbalanced

data set of product reviews. They have used Support

Vector Machines as a classifier. They achieved high

results on the unbalanced data set. However, they

did not give details about how they handle the

skewed data. Burns et al. (2011) have performed a

comparative study between balanced and unbalanced

sentiment classification for customer reviews.

Likewise, they used random-sampling to balance

their data sets. They employed Naïve Bayes and

Language Model (Carpenter, 2005) as classifiers.

They show that a realistic unbalanced data set can

achieve substantially better results. However, their

results could not be reliable since they used the

accuracy as performance measure. According to

Kubat and Matwin (1997), the classifier’s

performance in applications of this kind cannot be

expressed in terms of the average accuracy. We will

give more details about this problem in section 4.

In this paper, we focus on under-sampling

methods in supervised sentiment classification in an

Arabic context. We propose three different

techniques, namely Remove Similar (RS), Remove

Farthest (RF) and Remove by Clustering (RC). The

idea behind these methods is that we seek to keep,

among majority class documents, only those that can

be representative for their class. Our initial

hypothesis is that targeted removal, from majority

class, may be more effective than random removal.

This is why we compare the effectiveness of these

techniques with that of random under-sampling that

we call Random Removal (RR). We point out that

these methods are independent from domain,

language and classification technique. We have built

and manually labelled two different sized Arabic

data sets of two different domains. The first data set

is slightly skewed, while the second is highly

skewed. We use three standard classifiers, namely

Naïve Bayes (Mitchell, 1996), Support Vector

Machines (Vapnik, 1995) and k-Nearest Neighbours

(Dasarathy, 1991). Our main goal is to evaluate the

performance of the different under-sampling

methods used in this study on two data sets with

different unbalance percentage.

The remainder of this paper is organized as

follows. The second section describes the three

under-sampling methods that we propose. The third

section presents the data collection that we have

used. The fourth section describes the experimental

environment and the performed experiments.

Afterward, the obtained results are presented. We

finish by a conclusion and future works in the last

section.

2 UNDER-SAMPLING METHODS

In this section we describe each of the proposed

under-sampling methods. These techniques are

respectively Remove Similar, Remove Farthest and

Remove by Clustering. In the following, we denote

by C majority class.

2.1 Remove Similar (RS)

This method consists of eliminating from majority

class the documents that may be very similar to

other documents of the same class. The intuition

behind this method is that similar documents may

not give supplementary information to learn the

classifier. We think that removing such documents

may help to balance data sets without losing much

information. The algorithm of this method is as

follows.

a. Compute for each document of C its

distance to each document of C.

b. Assign a score to each document of C. This

score corresponds to the minimum distance

of calculated distances in step a.

c. Remove from C the document with the

smallest score.

d. If the desired number of documents to

remove is achieved, end of the algorithm.

Otherwise, return to step b (we do not take

into account removed documents while

assigning scores).

2.2 Remove Farthest (RF)

The principle of this method is to eliminate from

majority class documents that are the farthest ones

from the rest of majority class members. We think

that the farthest documents might be a source of

noise or might represent specific documents. In both

cases, the removal of such documents may help to

AddressingtheProblemofUnbalancedDataSetsinSentimentAnalysis

307

balance data sets without losing much information.

We specify that, in our study, the maximum value

that can have a distance between two given

documents is 1. This notion is used in the algorithm

of this method where the steps are as follows.

a. Compute for each document of C its

distance to each document of C.

b. Assign a score to each document of C. This

score corresponds to the number of times

where computed distances in step a had 1 as

value.

c. Sort, in descending order, documents of C

by their scores.

d. Remove from C the first n documents sorted

in step c. The number n corresponds to the

number of documents that we desire to

remove.

2.3 Remove by Clustering (RC)

The principle of this method is different from that of

RS and RF. RC is based on clustering while RS and

RF are based on computing distances between

documents. The intuition behind this method is that,

if we apply a clustering algorithm on majority class

documents, the selection of clusters’ centres to

represent majority class might be helpful to perform

an optimal balancing of data sets. The steps of this

method are described below.

a. Apply a clustering algorithm on C

documents; the number of clusters to form

corresponds to the number of C documents

that we seek to keep.

b. For each cluster, compute for all its

documents their scores. A document score is

obtained by averaging its distances to the

other members of the same cluster.

c. Select from each cluster the document with

the smallest score. The selected documents

are assumed to represent in a way clusters’

centres.

d. Remove all C documents except those

selected in step c.

3 DATA COLLECTION

We collected our data sets from online forums of

Aljazeera’s web site

. We built two data sets of two

different domains. The first consists of 594

comments about movie reviews toward a movie that

has generated much noise in the Arab-Muslim

world, we called it DSMR. The second consists of

1082 comments about a political issue titled “Arab

support for the Palestinian affair”, we called it

DSPo.

Since our study deals with supervised learning,

we had to label manually our data sets. The

categories to consider are POSITIVE, NEGATIVE,

OBJECTIVE and NOT_ARABIC. POSITIVE

(POS) category includes all comments that reflect a

positive opinion regardless of the opinion object.

Likewise, NEGATIVE (NEG) category contains

comments with negative sentiments. OBJECTIVE

(OBJ) category consists of non-opinionated

comments (as in “The movie relates the war

events”), comments that reflect neutral opinions

(such as “I hope to know more details about this

story”) or comments that contain a mixture of

positive and negative opinions (as in “I enjoyed the

movie; however there was a falsification of the real

story”). Finally, NOT_ARABIC (N_AR) category

used to clean our data sets since this category

includes all comments not written in Arabic. Table

1 gives an overview on the distribution of each

category for the two data sets.

Table 1: Number of documents per category for each data

set.

POS NEG OBJ N_AR

DSMR 184 284 106 20

DSPo 149 462 383 88

Since the present study focuses on sentiment

classification, the categories we are interested in are

POSITIVE and NEGATIVE ones. Documents of

OBJECTIVE and NOT_ARABIC categories were

ignored. We will return back to objective documents

in next studies since they represent an important

percentage (up to 35.4% of the whole data set).

Table 2 illustrates the final structure of the used data

sets.

Table 2: Structure of the final data sets.

POS NEG Total

DSMR 184 (39.4%) 284 (60.6%) 468

DSPo 149 (24.4%) 462 (75.6%) 611

As we can see from table 2, both DSMR and DSPo

are unbalanced. We can see also that DSMR is

slightly skewed since it contains 39.4% of positive

documents versus 60.6% of negative documents.

However, DSPo is highly skewed in comparison

with DSMR. DSPo contains 24.4% of positive

documents versus 75.6% of negative documents. We

specify that, for our study, majority class is the

http://www.aljazeera.net

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

308

NEGATIVE category; minority class is the

POSITIVE one. This unbalance reflects one of the

main characteristics of comments derived from

Aljazeera’s site: positive comments are largely

dominated by negative ones. This is due to the

nature of discussed issues in the forums of

Aljazeera’s web site which often deal with problems

of the Arab-Muslim world. Finally, we note that,

while labelling the data, we noticed that, for both

data sets, there are a great number of comments that

are off-topic. This type of comments can be a source

of noise. Moreover, the more their percentage is

large the more the data set becomes heterogeneous.

This lack in document homogeneity may have a

great impact on classification performance.

4 EXPERIMENTS

In this section, we present the experiments that we

have performed for this study. Firstly, we describe

the experiments’ environment, i.e. some details

about the pre-processing techniques, the classifiers,

the validation method and the performance measure

that we have used. Secondly, we present the

obtained results then we discuss the different

findings.

4.1 Experimental Design

4.1.1 Pre-Processing

As a pre-processing step, we have removed from

textual documents all characters which are not

Arabic letters. We have also removed stop words.

As stemming process, we have applied light-

stemming method (Khoja and Garside, 1999).

Moreover, we have eliminated, from feature space,

terms that occur once in the data set. This

elimination may help us to clean data sets from

typing errors made by document authors who are

simple internet users. For weighting scheme, we

have used a binary weighting which is based on term

presence. Finally, documents were considered as

bags-of-words.

4.1.2 Classification and Algorithms

In this study, we are interested in a single-label

binary classification where each document is

assigned one of the two categories POSITIVE vs.

NEGATIVE.

For tasks of pre-processing and classification, we

have used the data mining package Weka

(Witten

and Frank, 2005). We have used three standard

classifiers; namely Naïve Bayes (NB), Support

Vector Machines (SVM) and k-Nearest Neighbours

(k-NN). For the NB classifier, we have used a kernel

estimator, rather than a normal distribution, for

numeric attributes (John and Langley, 1995).

Concerning SVM classifier, we have used a

normalized polynomial kernel with a Sequential

Minimum Optimization (SMO) (Platt, 1999). For the

k-NN classifier, we have used a linear search with a

cosine-based distance (Salton and McGill, 1983).

For our proposed under-sampling techniques, we

have used cosine-based distance for RS and RF

methods, and k-means (Hartigan, 1975) as clustering

algorithm for RC method.

4.1.3 Validation Method

Both DSMR and DSPo were randomly split into two

sets: a training set representing 75% of the data set,

and a test set representing 25% of the data set. We

specify that this split is stratified, i.e. we keep the

same category distribution in both training and test

sets as in the initial data set. This process was

repeated 25 times to obtain, as output, 25 samples

for each data set.

Each experiment to perform on a data set is run

on all its 25 samples. The final result to consider, for

a given data set, results from averaging its samples’

results. We point out that all under-sampling

methods are applied only on training sets; test sets

are let as they are. Training sets need to be balanced

in order not to bias classifiers’ learning. But test sets

are not modified so as to test classifiers on sets

representing the reality.

4.1.4 Performance Measure

To evaluate our proposed under-sampling methods,

we have adopted the popular g-performance which is

obtained as follows:

g-performance =

 AccAcc *

(2)

where Acc

and Acc

denote the accuracy for

positive and negative class respectively. These two

accuracies are obtained from the confusion matrix

that presents all possible category assignments for a

binary classification. It is illustrated in table 3.

Table 3: Confusion matrix.

Positive prediction Negative prediction

Positive class True Positive (TP)

False Negative

(FN)

Negative class False Positive (FP) True Negative (TN)

AddressingtheProblemofUnbalancedDataSetsinSentimentAnalysis

309

TP and TN denote the number of correctly

classified positive and negative documents, while FP

and FN denote the number of misclassified positive

and negative documents, respectively. The positive

and negative accuracies are respectively defined as

follows:

Acc

= TP / (TP+FN) (3)

Acc

= TN / (TN+FP) (4)

This metric is called g-performance because it

corresponds to the geometric mean of the positive

and negative accuracies. G-performance is suitable

for unbalanced classification since it maximizes the

accuracy of the two classes in order to balance both

classes at the same time (Kubat and Matwin, 1997).

4.2 Results

The experiments that we have carried out consist on

equalizing the two classes of both DSMR and DSPo

by the application of the four under-sampling

methods that we study (RS, RF, RR and RC).

Afterward, we employ each of the three classifiers

(NB, SVM and k-NN) to classify the balanced data

sets.

Tables 4 and 5 present the results in g-

performance obtained respectively on DSMR and

DSPo. For each classifier, we show the result of

classification after equalization of the two classes by

each of the four under-sampling methods. The best

result for each classifier is marked in bold.

Table 4: Results in g-performance by classifier on DSMR.

RR RS RF RC

71.9

65.5 60.9 70.6

SVM

72.1

68.1

72.1

69.2

kNN 66.6 60

64.5

Table 5: Results in g-performance by classifier on DSPo.

RR RS RF RC

65.7

60.9 51 65

SVM 53.1 58.8

61.4

53.6

kNN

62.9

56.7 61.8 60.3

As a comparison between the two data sets, we can

see that results on DSMR typically outperform those

obtained on DSPo. Results range from 60% to

72.1% for DSMR; and from 51% to 65.7% for

DSPo. Recall that DSPo is highly skewed with

regard to DSMR. This could explain the difference

in results. Hence, the more the data set is unbalanced

the more the results are not promising.

To compare between the four under-sampling

methods, we can see from tables 4 and 5 that,

generally and by taking into account the standard

deviation, the four methods yield near results.

However, it is clear that RR gives mostly the best

results for the four under-sampling methods.

Moreover, RF seems to be competitive with RR for

SVM on both DSMR and DSPo. We can also see

from tables 4 and 5 that RF is not recommended

when we use NB. Likewise, RS does not suit with

kNN. This can be understandable since kNN is

based on neighbourhood to perform classification.

By removing similar documents, we do not help this

classifier to correctly classify documents.

We can explain the fact that the three proposed

methods did not outperform random under-sampling

by the nature of our data sets which are not

homogeneous (see section 3). Indeed, when the

documents of a data set are not homogeneous, this

means that, generally, all documents are far from

each other; so neither removing similar nor

removing the farthest could be meaningful in that

case. It is the same for removal by clustering, since

even we choose a centre for each cluster, this centre

may not be able to represent all documents of its

cluster. The reason is that we specify in the

beginning the number of clusters to form.

5 CONCLUSION AND FUTURE

WORKS

The present study deals with the problem of

unbalanced data sets in Sentiment Analysis. We

propose three different under-sampling methods that

we called respectively Remove Similar, Remove

Farthest and Remove by Clustering. Our main goal,

by this study, is to evaluate the effectiveness of these

three methods. Furthermore, we compare them with

the random under-sampling which is commonly

used in the literature of unbalanced classification.

We focus on supervised classification of Arabic

documents. We use three standard classifiers known

by their effectiveness: Naïve Bayes, Support Vector

Machines and k-Nearest Neighbours. We use as data

collection two unbalanced Arabic data sets (DSMR

and DSPo) that we have built internally and labelled

manually. These data sets are different in many

aspects, namely document size, domain and

unbalance degree. DSMR contains 468 documents;

39.4% of them are positive and 60.6% are negative.

So, DSMR is slightly skewed. DSPo contains in

total 611 documents, where 24.4% are positive and

75.6% are negative. DSPo is therefore highly

skewed. The majority class in our study is the

NEGATIVE category, while the POSITIVE one

KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

310

presents the minority class.

Our experiments consist on balancing the two

classes of each data set by the use of the four studied

under-sampling methods, i.e. RR, RS, RF and RC.

Then we evaluate the performance of the three

classifiers on the balanced data sets.

Our results show that performance obtained on

DSMR is better than that obtained on DSPo. This

proves that the more the data set is unbalanced the

more the results are bad.

As a comparison between under-sampling

methods, we can say that, generally, the four

methods give near results. But iQn most of cases RR

yields the best results. RF is not recommended for

NB, it is rather recommended for SVM. For kNN,

we do not recommend to use RS.

As future works, we look for performing the

same experiments on unbalanced data sets that are

more homogeneous so as to validate our hypothesis

about the impact of heterogeneity on the

performance of the proposed techniques. We will

also study the effectiveness of the four under-

sampling methods by decreasing progressively

majority class size. On one hand, we aim to see

whether it is necessary to achieve a balance of 50%-

50% to have the best results. On the other hand, we

aim to observe the behaviour of our classifiers, by

using the different under-sampling methods, toward

the different steps of majority class decreasing.

Finally, we have as perspective too the study of

feature selection techniques on unbalanced data sets

of SA.

REFERENCES

Abdul-Mageed, M., Diab, M.T., Korayem, M., 2011.

Subjectivity and Sentiment Analysis of Modern

Standard Arabic. In Proc. ACL (Short Papers).

pp.587-591.

Brank, J., Grobelnik, M., Milić-Frayling, N, Mladenić, D.,

2003. Training text classifiers with SVM on very few

positive examples. Technical report, MSR-TR-2003-

34.

Burns, N., Bi, Y., Wang, H., Anderson, T., 2011.

Sentiment Analysis of Customer Reviews: Balanced

versus Unbalanced Datasets. KES 2011, Part I, LNAI

6881, pp. 161-170.

Carpenter, B., 2005. Scaling High-Order Character

Language Models to Gigabytes. In: Workshop on

Software. Association for Computational Linguistics,

Morristown. pp. 86–99.

Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer.

2002. W. P. SMOTE: Synthetic Minority Over-

sampling Technique. Journal of Artificial Intelligence

Research (JAIR), Volume 16, pp. 321-357.

Dasarathy, B. V., 1991. Nearest Neighbor (NN) Norms:

NN Pattern Classification Techniques. McGraw-Hill

Computer Science Series. Las Alamitos, California:

IEEE Computer Society Press.

Hartigan, J., 1975. Clustering Algorithms. John Wiley &

Sons, New York, NY.

Japkowicz, N., 2003. Class Imbalances: Are we Focusing

on the Right Issue? In Proc. Of ICML’03.

Khoja, S., Garside, R., 1999. Stemming Arabic text.

Computer Science Department, Lancaster University,

Lancaster, UK.

Kubat, M., Matwin, S., 1997. Addressing the Curse of

Imbalanced Data Sets: One-Sided Sampling. In

Proceedings of the Fourteenth International

Conference on Machine Learning, pp. 179-186.

Li, S., Wang, Z., Zhou, G., Lee, S. Y. M., 2011. Semi-

Supervised Learning for Imbalanced Sentiment

Classification. In Proc. Of the Twenty-Second

International Joint Conference on Artificial

Intelligence, pp.1826-1831.

Mitchell, T., 1996. Machine Learning. McCraw Hill.

Pang, B., Lee, L., Vaithyanathain, S., 2002. Thumbs up?

Sentiment classification using machine learning

techniques. In Proceedings of the Conference on

Empirical Methods in Natural Language Processing.

pp.79-86.

Platt, J., 1999. Fast training on SVMs using sequential

minimal optimization. In Scholkopf, B., Burges, C.,

and Smola, A. (Ed.), Advances in Kernel Methods:

Support Vector Learning, MIT Press, Cambridge, MA,

pp.185-208.

Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez,

L. A., Perea-Ortega, J. M., 2011a. Bilingual

Experiments with an Arabic-English Corpus for

Opinion Mining. In Proc. Of Recent Advances in

Natural Language Processing, Hissar, Bulgaria.

pp.740-745.

Rushdi-Saleh, M., Martin-Valdivia, M. T., Urena-Lopez,

L. A., Perea-Ortega, J. M., 2011b. Experiments with

SVM to classify opinions in different domains. Expert

Systems with Applications 38, pp.14799-14804.

Salton, G., McGill, M., 1983. Modern Information

Retrieval. New York: McGraw-Hill.

Vapnik, V., 1995. The Nature of Statistical Learning.

Springer-Verlag.

Witten, I. H., Frank, E., 2005. Data Mining: Practical

machine learning tools and techniques, 2nd Edition,

Morgan Kaufmann, San Francisco, California.

Wu, G., Chang, E., 2003. Class-Boundary Alignment for

Imbalanced Dataset Learning. In Proc. Of ICML’03.

Zhuang, L., Jing, F., Zhu, X., 2006. Movie Review Mining

and Summarization. In CIKM’06. Virginia, USA.

AddressingtheProblemofUnbalancedDataSetsinSentimentAnalysis

311