3.4 Sentiment Analysis
When the preprocessing is finished, the following step
in the dataset creation was to analyse the emotions
contained in the preprocessed texts. Regarding this
purpose, we developed a tool in Python that identi-
fies the frequency of each word on the preprocessed
text and compares it with an emotional lexicon. The
existence of the word into the lexicon results in the in-
formation of the word has an emotional context. Us-
ing the NRC Emotion lexicon (Mohammad and Tur-
ney, 2013) and NRC Affect Intensity lexicon (Mo-
hammad, 2018), we identified the frequencies of all
emotional words according to Plutchik’s emotional
definition and extracted the distribution of each emo-
tion in a text. Regarding the affect intensity, the ap-
proach was to sum all intensities of each emotion for
each preprocessed text.
Later, we identified the mean emotional profile
from each Twitter depressive author (TAMEP) to val-
idate the depressive authors (and avoid the fake self-
identified depressive). This measure is calculated
by the normalization of each emotion (summation of
each emotion divided by the sum of all emotions).
It results in a 12-dimension vector having values be-
tween 0 and 1, representing the average of each emo-
tion/intensity in a post.
For the emotional profile validation of depres-
sive authors, we created a depressive emotional base-
line using the solution presented by Kim (Kim et al.,
2020), that analysed Reddit’s community r/depression
as a source for depressive texts. In our case, we iden-
tified the mean emotional profile (RMEP) of depres-
sive posts from the last 600 Reddit’s posts and per-
formed the same steps described earlier to create the
emotional profile validator.
To validate if a Twitter depressive author is a valid
depressive profile, we performed a hypothesis test for
each author having TAMEP and RMEP. When the
null hypothesis could not be rejected (i.e., the p-value
between TAMEP and RMEP was less than 0.05), the
author profile in TAMEP was considered as depres-
sive.
For the other hand, to identify the non-depressive
profiles, we adopted a solution of measuring the co-
sine distance between the non-depressive authors in
TAMEP and RMEP. Once each emotional profile is a
12-dimensional vector (each Plutchik’s basic emotion
+ intensities), we defined as non-depressive profile the
ones that cosine distance between TAMEP and RMEP
is less or equal to 0.
This approach resulted in a set of 1788 authors
(947 classified as non-depressive and 841 classified
as depressive) and 492178 tweets collected.
Finally, we created a dataset containing the mean
of all emotions and intensities grouped by author and
trimester, where each author must have posted at least
150 messages. This minimal limit of 150 messages
was defined empirically because it is necessary to col-
lect as many messages as possible, so, authors having
less than this amount could bias the analysis due to
less information about their emotions.
This approach - known as bag-of-words - was
adopted because we intended to identify which emo-
tions and intensities are relevant to detect depression,
besides creating a “depressive emotional fingerprint”
through the words used in the texts.
Some recent NLP techniques - such as Word Vec-
tors and Transfer Learning - were discarded for this
study due to the nature of the problem detection in
the real world. Since a depressive person can be char-
acterized by negative emotions in their comments in
almost all situations in their lives, these techniques,
which can capture the context better than a bag of
words, are not relevant because the context of the
sentences is not relevant too. We consider that when
someone is depressive, their words are loaded of neg-
ative emotions in all situations, and for this reason,
it is relevant an author’s emotional profile snapshot
over the time - exactly as a psychologist does during
the treatment.
The final result was a dataset containing 1250 reg-
isters, divided into 686 non-depressive and 564 de-
pressive authors, indicating the average of emotions
and intensities of all your posts during the trimester.
To avoid problems with data unbalanced, we decided
to remove 11 aleatory non-depressive authors, result-
ing in 564 depressive and non-depressive authors. An
overview of the dataset is presented in Fig. 4.
4 DATA ANALYSIS
The data analysis was divided into two different anal-
ysis: Exploratory Analysis, and Machine Learning-
based analysis.
4.1 Exploratory Analysis
Exploratory analysis is the approach aimed at
analysing datasets to summarize their main features,
often with visual methods, aiming to find pieces of
information hidden in the data.
In this work, the initial approach in the ex-
ploratory analysis was to identify if the emotional data
are represented by a normal distribution. For this pur-
pose, we performed a Shapiro-Wilk test for each emo-
tion and intensity in the dataset. During our tests, all
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
1140