name, regardless of the tweets' sentiment. As a result
of these errors, other researchers have even
questioned the validity of social media content for
forecasting events and movements (Metaxas, 2010;
Mustafaraj, 2011; Metaxas, 2011).
In this paper, we develop an accurate method for
mining election-relevant data for a statistically
correct prediction of the outcome. We have gathered
a reliable large-scale dataset from twitter and Google
Trends search interests, which is highly correlated
with real trends of US 2016. We have applied
Gaussian process regression to estimate weekly
predictions. Unlike other papers, this model is built
on predicting the candidates vote shares instead of an
absolute winner. This paper proceeds as follows. In
section 2 our method for predicting a large-scale
election is described. In section 3 the method is
applied to the data from the 2016 US elections and
concluding remarks are mentioned in section 4.
2 THE METHOD
Four main steps are followed in this method. First, a
uniformly sampled large dataset of tweets is gathered.
This data is then processed and augmented by adding
sentiment information to each tweet, collecting
relevant keywords data from Google Trends, and
arranging various online poll results. The authenticity
of this data is then checked with a correlation test. In
the end, a feature matrix is created and the Gaussian
process regression model is trained.
2.1 Data Collection
Social political events often have a short time span
and great complexity. As mentioned in DiGarzia
(2013), large datasets of online social content must be
used to achieve accurate results. The online data
sources used in this paper are twitter and Google
Trends, as well as the online election polls held by
polling firms and news reports, such as HuffPost
pollster. These online polls are refined and later used
as labels when training the model. These surveys are
scattered over time, thus, the online polls are arranged
chronologically and a final poll result is calculated for
each week by adding the weighted sum of the surveys
held in that week. Poll results are used as labels when
training the statistical model.
The data has been gathered from public tweets
containing the candidates’ names with a high
sampling rate of 1000 tweets per day per candidate
during active election months (about 6 months for US
Election). It should be mentioned that the method was
also applied to a dataset of 100 tweet per day per
candidate, which resulted in undesirable outcomes.
Around 370,000 tweets are gathered, however, about
70,000 repetitious tweets contain both candidates’
names which are then removed, resulting in a final
300,000 tweet dataset. Despite what was stated in
Sang (2012), the number of tweets containing a
candidate’s name does not necessarily reflect the
user’s election votes. Thus, the tweets’ sentiment
needs to be taken into account. Table 1 demonstrates
this fact in an example in which it is unlikely for the
first user to vote for Clinton.
The sentiment of a sentence can be analyzed using
the grammatical structure and the choice of words.
The RNTN algorithm (Socher, 2013) can determine
the sentiment of a phrase as positive or negative with
an accuracy rate of 80.7%. Due to processing
limitations, a simpler algorithm is used in our
experiment (Bose, 2017; Rinker 2017).
After eliminating common terms, frequent
hashtags and words are extracted from the twitter
data, and manually grouped into meaningful word
sets, 26 sets in our case. Each group contains an
election-relevant term that is used frequently in
tweets. The word representing each set is called a
‘keyword’. This classification is done using common
knowledge on election events. Table 2 explains this
process with an example.
The keywords are later used as search queries for
collecting the Google Trends (2017) data. Google
Trends returns a vector
on ‘Search interest factor’
which presents the popularity of a search query over
time.
Assuming
to be a keyword, we define:
..
≝Google Trends
search interest for keyword
in week ,
∈
1,
,
(1)
where is the total number of weeks in the dataset.
Table 1: An example of why all the tweets containing a
candidate’s name are not posted by their fans.
Sentiment Tweet
Negative
Crooked Hillary: Not In The Pocket Of Anyone
After Receiving $6 Million From Soros
#WakeUpAmerica
Positive
I thought Hillary did well on #60Minutes. So
calm and reasonable. Such a change from the
Republican'ts.