For the classification task, we only select the
Rating and Review text attributes. Rating is a
numerical value from 1 to 5, and the Review text is a
String which contains the opinion of the user. Before
using the dataset, we apply a few steps to get better
results. These steps are described as following:
1. Assign Rating value 1 and 2, to Negative;
2. Assign Rating value 4 and 5, to Positive
3. Remove all the instances that contain a Rating
value equal to 3.
3.2 Pre-processing and Text
Transformation
In order to improve results for the four algorithms that
we study in this paper, it is necessary to do some pre-
processing steps which will make it possible to reduce
data dimension without affecting the classification
task (Eler et al., 2018). The first step is to convert all
the instances of the dataset into lowercase. Next, we
remove some noisy formatting like HTML Tags and
Punctuation. Tokenization, removal of stop words
and stemming are described as follows:
- Tokenization: is the process that splits strings and
text into small pieces called tokens (Mouthami, Devi
and Bhaskaran, 2013). This process is widely used
and popular in pre-processing tasks.
- Removal of Stop Words: A stop word is a
commonly used word that appears frequently in any
document. These words are usually articles and
prepositions. An example of these terms is “the”,”
is”,” are”, “I” and “of” (Eler et al., 2018). Hence, we
can say that these terms do not add meaning to a
sentence, and for this reason, we can retrieve them
from the text before doing the classification task. For
this study, we use a list of common words of the
English Language which includes about 150 words.
- Stemming: is the process that reduces a word to their
base or root form. For example, the words
“swimmer” and “swimming” after the stemming
process are transformed into “swim”. In this study,
we use the Porter Stemmer because is one of the most
popular English rule-based stemmers (Jasmeet and
Gupta, 2016) and compared with Lovins Stemmer it´s
a more light stemmer. Moreover, produces the best
output as compared to other stemmers (Ganesh
Jivani, 2011).
Text Transformation: Machine learning
algorithms do not work with text features, so, for this
reason, we need to convert text into numerical
features. To deal with that, we use the TF-IDF (Term
Frequency-Inverse Document Frequency). This
algorithm assigns to each word of the sentence a
weight based on the TF and IDF (Yang and Salton,
1973).
The TF (term frequency) of a word is defined as
the number of times that the word appears in a
document.
The IDF (inverse document frequency) of a term
is defined as how important a term is (Salton and
Buckley, 1988) (Yang and Salton, 1973).
3.3 Classification Process
After cleaning the dataset and apply pre-processing
and text transformation steps, we split the data into
training and test. The percentage used for training is
80% and the remaining 20% are used for test. It is
necessary to feed the classification algorithms, so the
train data will be used for training the classifiers and
the test data will be used to evaluate them. The four
classifiers that we use are described in the following:
- Random Forest: is defined as a classifier with a
collection of tree-structured classifiers {h(x, k ), k =
1,...} where the {k} are independent identically
distributed random vectors and each tree casts a unit
vote for the most popular class at input x. When a
large number of trees is generated each one of them
will vote for a class, and the winner is the class that
has more votes (Breiman, 2001). For this study we
evaluate the Random Forest classifier with a different
number of trees to construct the Decision Forest, in
particular, we test the classifier with 50,100,200 and
400 trees.
-Naive Bayes: is a probabilistic machine learning
classifier based on the Bayes Theorem with an
assumption of independence among predictors, in
other words, this algorithm considers that a presence
of a feature in a class is independent of any other
features (Ahmad, Aftab and Muhammad, 2017). For
this study we evaluate two types: Multinomial and
Bernoulli.
Support Vector Machine: is a supervised learning
model which can achieve good results in text
categorization. Basically this classifier locates the
best possible boundaries to separate between positive
and negative training samples (Ahmad, Aftab and
Muhammad, 2017) For this study, we evaluate two
distinct kernel models for Support Vector Machine:
RBF and Linear (Minzenmayer et al., 2014) .
Decision Trees: is an algorithm that use trees to
predict the outcome of an instance. Essentially, a test
node computes an outcome based on the attribute
values of an instance, where each possible outcome is
associated with one of the subtrees. The process of
classify an instance starts on the root node of the tree.
If the root node is a test, the outcome for the instance
Comparison of Naïve Bayes, Support Vector Machine, Decision Trees and Random Forest on Sentiment Analysis
527