MapReduce Algorithm for Big Data Sentiment
Analysis on Twitter implemented in Hadoop (White,
2012), the open source MapReduce implementation
(Dean and Ghemawat, 2004). Our algorithm exploits
the hashtags and emoticons inside a tweet, as senti-
ment labels, in order to avoid the time-intensive man-
ual annotation task. After that, we build the feature
vectors of training and test set and proceed to a clas-
sification procedure in a fully distributed manner us-
ing an AkNN query. Additionally, we encode features
using Bloom filters to compress the storage space of
the feature vectors. Through an extensive experimen-
tal evaluation we prove that our solution is efficient,
robust and scalable and confirm the quality of our sen-
timent identification.
The rest of the paper is organized as follows: in
Section 2 we discuss related work and in Section 3 we
present how our algorithm works. After that, we pro-
ceed to the experimental evaluation of our approach in
Section 4, while in Section 5 we conclude the paper
and present future steps.
2 RELATED WORK
Early opinion mining studies focus on document level
sentiment analysis concerning movie or product re-
views (Hu and Liu, 2004; Zhuang et al., 2006) and
posts published on webpages or blogs (Zhang et al.,
2007). Respectively, many efforts have been made to-
wards the sentence level sentiment analysis (Wilson
et al., 2009; Yu and Hatzivassiloglou, 2003) which
examines phrases and assigns to each one of them a
sentiment polarity (positive, negative, neutral).
Many researchers confront the problem of sen-
timent analysis by applying machine learning ap-
proaches and/or natural language processing tech-
niques. In (Pang et al., 2002), the authors em-
ploy three machine learning techniques to classify
movie reviews as positive or negative. On the other
hand, the authors in (Nasukawa and Yi, 2003) in-
vestigate the proper identification of semantic rela-
tionships between the sentiment expressions and the
subject within online articles. Moreover, the method
described in (Ding and Liu, 2007) proposes a set of
linguistic rules together with a new opinion aggrega-
tion function to detect sentiment orientations in online
product reviews.
Nowadays, Twitter has received much attention
for sentiment analysis, as it provides a source of mas-
sive user-generated content that captures a wide as-
pect of published opinions. In (Barbosa and Feng,
2010), the authors propose a 2-step classifier that sep-
arates messages as subjective and objective, and fur-
ther distinguishes the subjective tweets as positive or
negative. The approach in (Davidov et al., 2010) ex-
ploits the hashtags and smileys in tweets and evaluate
the contribution of different features (e.g. unigrams)
together with a kNN classifier. In this paper, we adopt
this approach and create a parallel and distributed ver-
sion of the algorithm for large scale Twitter data. A
three-step classifier is proposed in (Jiang et al., 2011)
that follows a target-dependent sentiment classifica-
tion strategy. Moreover, the authors in (Wang et al.,
2011) perform a topic sentiment analysis in Twitter
data through a graph-based model. A more recent ap-
proach (Yamamoto et al., 2014), investigates the role
of emoticons for multidimensional sentiment analysis
of Twitter by constructing a sentiment and emoticon
lexicon. A large scale solution is presented in (Khuc
et al., 2012) where the authors build a sentiment lexi-
con and classify tweets using a MapReduce algorithm
and a distributed database model. Although the classi-
fication performance is quite good, the construction of
sentiment lexicon needs a lot of time. Our approach is
much simpler and, to our best knowledge, we are the
first to present a robust large scale approach for opin-
ion mining on Twitter data without the need of build-
ing a sentiment lexicon or proceeding to any manual
data annotation.
3 MR-SAT APPROACH
Assume a set of hashtags H = {h
1
,h
2
,...,h
n
} and
a set of emoticons E = {em
1
,em
2
,...,em
m
} associ-
ated with a set of tweets T = {t
1
,t
2
,...,t
l
} (training
set). Each t ∈ T carries only one sentiment label
from L = H ∪ E. This means that tweets contain-
ing more that one labels from L are not candidates
for T, since their sentiment tendency may be vague.
However, there is no limitation in the number of hash-
tags or emoticons a tweet can contain, as long as they
are non-conflicting with L. Given a set of unlabelled
tweets TT = {tt
1
,tt
2
,...,tt
k
} (test set), we aim to in-
fer the sentiment polarities p = {p
1
, p
2
,..., p
k
} for
TT, where p
i
∈ L ∪ {neu} and neu means that the
tweet carries no sentiment information. We build a
tweet-level classifier C and adopt a kNN strategy to
decide the sentiment tendency ∀tt ∈ TT. We imple-
ment C by adapting an existing MapReduce classifi-
cation algorithm based on AkNN queries (Nodarakis
et al., 2014), as described in Subsection 3.3.
3.1 Feature Description
In this subsection, we present in detail the features
used in order to build classifier C. For each tweet we