from the source to filtering out the data, pre-
processing, sentiment analysis, and the results. The
VADER will help analyse the sentiments and identify
the tweets as either positive, negative, or neutral.
3.1 Data Collection
In this research ‘Tweepy’, a library from Python, is
used to extract the data from Twitter. In this research
Tweepy was used instead of its closest rival Twint
because Tweepy is the official Python library
authorized to access Twitter API.
Twitter API enables developers to extract the user
tweets. However, it has a limited extraction of fewer
than three months, explaining why our extraction was
only one month. The data is usually stored in JSON
format. We used a Python script to extract the
required key from the JSON schema. The list of key-
value pairs parsed out from the JSON.
The data downloaded is 1.5 million tweets that
were collected for three weeks. This research
emphasized on tweets that mentioned the two running
candidates from the Democratic and Republican
parties. The data was extracted from a JSON object
(data interchange file format), a semi-structured data
file format.
The information obtained from the JSON object
includes hashtags, URL profile image, counts of
followers, friends and statuses, tweet text and locale
among other useful variables with regards to a tweet
and the user profile.
The information below shows a list of keywords
used for extracting user tweets related to each party
and leader. We used the #(explore) to extract the
unique hashtags for the candidates.
For Donald Trump and Mike Pence (Republican
Party).
(#Republican, #DonaldTrump, #voting, #Trump,
#HarrisCounty, #MAGA2020,
#TrumpIsANationalDisgrace,
#TrumpVirusDeathToll, #Trump-Covid,
#EndTrumpChaos, #TrumpTaxReturns,
#DumpTrump2020, #TrumpLies)
For Joe Biden and Kamala Harris (Democratic
Party)
(#BidenLies, #BidenLies,
#VoteBlueToSaveAmerica, #VoteBlue,
#VoteBidenHarris2020,
#BidenHarrisToSaveAmerica)
For anything else related to elections, we used the
following keywords:
(#ExGOP, #GOPSuperSpreaders,
#AmericasGreatestMistake, #TrumpVsBiden,
#PresidentialDebate, #PresidentialDebate2020,
#Election2020, #TrumpBidenDe-bate, #Propaganda,
#USPresidentialDebate2020, #USElection2020,
#BountyGate, #BLM, #BlackLivesMatter,
#Elections, #VoteLibertarian, #Opinion, #Coun-
tryOverParty, #nhPolitics, #VoteForAmerica,
#PlatinumPlan, #Presidential-Debates2020, #Debate,
#SuperSpreader)
3.2 Data Pre-Processing
Natural Language Processing (NLP): This is part of
the study at the intersection of computer science and
linguistics. The computer uses NLP to extract
meaning from natural human language. Non-
structured data is processed using NLP steps, which
are later documented to analyze the words' polarity
using WordNet, SentiWordNet. The mechanisms
used in the extraction of information include Word
Stemming and lemmatization, stop word analysis,
word tokenization, word sense disambiguation, etc.
Natural Language Toolkit (NLTK): This is a free,
open-source Python package that provides a few tools
for building programs and classifying data. NLTK
provides an easy way to use the interface of over 50
corpora and lexical resources, which includes a group
of text processing libraries for classification,
tokenization, stemming, tagging, parsing, and
semantic reasoning.
Figure 2: Word cloud for Donald Trump and Joe Biden.
Figure 2 demonstrates the Word cloud which is a
visualization wherein the more frequent words appear
large, and therefore, the less frequent words seem to
be even smaller. It is used in identifying the
commonly used words in tweets.
VADER Sentiment Analysis: VADER is a rule-
based lexicon and sentiment analysis tool that is
exceptionally accommodating to the social media
platform's sentiments. It tells about the positivity or
negativity of a score and its positive, negative, or
neutral sentiment. VADER is available in the NLTK