tages have been applied, resulting in longer pseudo-
documents (Hong and Davison, 2010; Mehrotra et al.,
2013).
LDA was used in (Hong and Davison, 2010) to
evaluate the differences between topics learned from
messages from the same user aggregated into a single
profile scheme and topics learned by the aggregation
of the user profiles, which in turn resulted from the
aggregation of messages from the same user. Their re-
sults show that both approaches generated topics sub-
stantially different, meaning that topics learned using
different strategies of data aggregation differ from
each other. They also demonstrated the length of the
documents influences the effectiveness of trained to-
pic models, namely, a better model can be trained by
aggregating short messages.
Another application of LDA was conducted in
(Alvarez-Melis and Saveski, 2016), in which tweets
belonging to the same conversation were grouped,
with each group of related tweets corresponding to a
single document. They evaluated whether the propo-
sed technique outperforms alternative schemes. The
resulting topics performed better than those derived
by hashtag-based pooling.
In (Hu et al., 2012), the researchers modeled the
topics of specific events as well as their associated
tweets, while performing event segmentation, with an
event consisting of several paragraphs, each one of
them discussing a particular set of topics. They assu-
med that an event, or a segment of it, can impose to-
pical influences on the related tweets, resulting either
in general topics, which are constant during the event,
and specific topics, which are related to specific seg-
ments of the event.
The researchers in (Mehrotra et al., 2013) pro-
posed, among others, a temporal pooling scheme to
aggregate tweets into what the authors have referred
to as macro-documents, based on the assumption that
when important events occur, a great number of users
starts posting about the event within a short time span.
As such, the authors pooled together tweets posted
within the same hour. They found that such scheme
can improve topic modeling on Twitter, without ha-
ving to modify LDA machinery.
Twitter posts presents some challenges due to
sparseness, as short documents (posts) might not con-
tain enough data to establish satisfactory term co-
occurrences. Although LDA have been proved to pro-
duce good results when applied to long documents
corpora, such as news articles (Zhao et al., 2011) and
academic abstracts (Yau et al., 2014), they often pro-
duce less coherent results when the application is per-
formed on posts from micro-blogging platforms such
as Twitter. This is due to the sparse nature of tweets,
and due to the sparsity of short documents in general.
Therefore, in order to alleviate the disadvantages, se-
veral pooling schemes to group together tweets into
longer individual documents have been proposed, so
that the LDA performance is improved without ha-
ving to modify its basic machinery.
Examples of these techniques are hashtag-based
aggregation (Mehrotra et al., 2013; Steinskog
et al., 2017), user-based aggregation (Hong and Da-
vison, 2010), or user-to-user conversation aggrega-
tion (Alvarez-Melis and Saveski, 2016). A Topic Mo-
del based on self-aggregation was also presented by
(Quan et al., 2015), which is based on the assumption
that each text snippet is sampled from a long pseudo-
document.
3 DATASET
This study uses a dataset previously used in (Lopes-
Teixeira et al., 2018), consisting of about 357944
geolocated tweets, written in Portuguese, posted by
159615 users from 206 countries across the world (ac-
cording to the platform indication), collected between
May 2014 and November 2017, covering 192 conse-
cutive weeks, and corresponding approximately to a
four years time span. Each tweet includes the meta-
data information as follows: user id, username, user
description, country and city from which the tweet
was posted, date and time, the tweet id, and the mes-
sage content.
To the collecting process, a brand filter was ap-
plied, so that only tweets mentioning at least one of
the 16 brands selected would be retained. The brands,
which were selected based on the number of follo-
wers and the number of tweets, are the following:
Adidas, Nike, Vans, Puma, Victoria’s Secret, Gucci,
Valentino, Versace, Converse, Michael Kors, Bur-
berry, Marc Jacobs, Armani, Tommy Hilfiger, Chris-
tian Louboutin, and Dolce & Gabanna. As in (Lopes-
Teixeira et al., 2018), for this study, we are only con-
sidering the top 10 brands, which are the brands with
more tweets in the dataset. Additional processing
steps were applied to remove irrelevant tweets. For
instance, regarding “Valentino” brand, posts mentio-
ning “Bobby Valentino” and “Valentino Rossi” were
removed from the database, as well as all the tweets
mentioning “Valentino” posted by users from Argen-
tina. The last step was needed because the word “Va-
lentino” is commonly mentioned in posts from Argen-
tina, but they were most likely referring to a person or
to pets with the same name. Tweets having the words
“Valentino” and “Humoro” were also removed, as in
these cases the users were not talking about the brand.
KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval
246