3.1 Description of the Dataset
The dataset that we choose for this investigation
(Sentiment analysis wit hotel reviews | 515K Hotel
Reviews Data in Europe | Kaggle, 2017) contains
515,000 customer reviews and a scoring of 1493
luxury hotels across Europe. Meanwhile, the
geographical location of hotels is also provided for
further analysis.
This dataset presents seventeen attributes (“Hotel
Address”, “Additional Number of Scoring”, “Review
Date”, “Average Score”, “Hotel Name”, “Reviewer
Nationality”, “Negative_Review”, “Review Total
Negative Word Counts”, “Total Number of reviews
Positive Review”, “Review Total Positive Word
Counts”, “Total Number of Reviews Reviewer Has
Given”, “Reviewer Score”, “Tags”, “Days Since
Review”, “LAT”, “LNG”), but we only will use 2 of
them, “Positive Review” and “Negative Review”
because this work only requires the use of reviews for
training data, so other attributes aren’t necessary for
this investigation.
For classification task, we select Positive_Review
and Negative_Review and give them a score, positive
and negative, respectively. The review of the user
goes into a string called review, and if a user didn’t
do a review but s/he’s on the dataset we delete
him/her from the training dataset.
There is an example of one review in this dataset.
“This hotel is awesome I took it sincerely because a
bit cheaper but the structure seem in an hold church
close to one awesome park Arrive in the city are like
10 minutes by tram and is super easy The hotel inside
is awesome and really cool and the room is incredible
nice with two floor and up one super big comfortable
room I’ll come back for sure there The staff very
gentle one Spanish man really good.”
3.2 Pre-process of the Text
Online text usually has a lot of noise and
uninformative parts like HTML tags, scripts,
advertisements, and punctuations. So, we need to
apply a process that cleans the text, for example, that
removes that kind of noise to have better
classification results (Haddi, Liu and Shi, 2013).
The first step of this process is to convert all the
instances of the dataset to lowercase, which will allow
to better compare the words with all the models that
are created. Then, we remove HTML tags and
punctuations. After that we can opt by removing stop
words or use our created domains, being that, we need
to remove empty reviews in the end and stemming the
text of the reviews. Now we will specifically explain
the stemming, removing stop words process and in
the end talk about the utilization of our created
domains:
Stemming: this process reduces words to their own
stems. For example, two words, "fishing", "Fisher"
after going into this process are changed to the main
word "fish". In this experimental study, we are using
Porter Stemmer because it is one of the most popular
English rule-based stemmers. Various studies have
shown that stemming helps to improve the quality of
the language model (Allan and Kumaran, 2003;
Brychcín and Konopík, 2015). This improvement
leads to another improvement in the classification
task where the model is being used.
Removing Stop Words: stop word removal is a
standard technique in text categorization (Yang et al.,
2007). This technique manipulates a list of commonly
used words like articles and prepositions, this type of
words doesn't matter to our classification task, so we
are removing them from the text. For this
experimental study, we use a list of common words
of English Language that includes about 100-200
words.
Created Domains: these domains are what make the
difference in our study. We decide to create two
words domains: the first one “Hotel_Domain”, with
596 common hotel words and the second one
“Adjectives”, with 197 adjective words that we can
use about hotels. We use these word domains like the
list of Stop Words, removing or only restrict those
words to the text, so we can compare how the word
restriction works in Sentiment Analysis and Text
Classification.
Eliminate Empty Reviews: as we use our domains to
restrict the text in this pre-processing task, there are
reviews that will be empty so we have to remove them
from the training data that we will consider for train
and test.
Text Transformation (TF-IDF): TF-IDF calculates
values for each word in a document through an
inverse proportion of the frequency of the word in a
document to the percentage of documents the word
appears in (Medina and Ramon, 2015). Therefore,
this algorithm gives more weight and relevance to
terms that appear less in the document comparatively
with terms that appear more frequently. This process
of text transformation must be used because machine
learning algorithms can't work with text features.
3.3 Experiment Models
We use 7 experiment models of words restriction that
we will describe below:
KDIR 2019 - 11th International Conference on Knowledge Discovery and Information Retrieval
444