Discovering Trends in Brand Interest through Topic Models

Diana Lopes-Teixeira

, Fernando Batista

1,2

and Ricardo Ribeiro

1,2

Instituto Universit

ario de Lisboa (ISCTE-IUL), Lisboa, Portugal

L2F, INESC-ID Lisboa, Lisboa, Portugal

Keywords:

Topic Modeling, Topics Evolution, LDA, Preprocessing, Brand Interest.

Abstract:

Topic Modeling is a well-known unsupervised learning technique used when dealing with text data. It is used

to discover latent patterns, called topics, in a collection of documents (corpus). This technique provides a con-

venient way to retrieve information from unclassiﬁed and unstructured text. Topic Modeling tasks have been

performed for tracking events/topics/trends in different domains such as academic, public health, marketing,

news, and so on. In this paper, we propose a framework for extracting topics from a large dataset of short

messages, for brand interest tracking purposes. The framework consists training LDA topic models for each

brand using time intervals, and then applying the model on aggregated documents. Additionally, we present a

set of preprocessing tasks that helped to improve the topic models and the corresponding outputs. The expe-

riments demonstrate that topic modeling can successfully track people’s discussions on Social Networks even

in massive datasets, and capture those topics spiked by real-life events.

1 INTRODUCTION

The rapid growth of Internet has led to the growth of

social media websites like Twitter, a micro-blogging

platform launched in 2006. In social media websi-

tes people share diverse aspects of their life and talk

about events happening that they are aware of. Thus,

these websites produce tremendous amounts of data

that can be used in many ways. For instance, it can

be used to track emerging events, to discover tren-

ding topics, or to evaluate consumers’ satisfaction to-

ward a product in the market. Topic Modeling is

amongst the Text Mining techniques applied to ex-

ploit Twitter data. However, performing Topic Mo-

deling tasks in short messages, such as those availa-

ble on Twitter, differs from performing them in longer

documents, such as academic abstracts or newspaper

articles. This is mainly because Topic Models infer

topics based on the co-ocurrence of words in docu-

ments. Short messages limit this ability. Aggregating

Twitter posts generated richer documents from which

we can learn better topic models.

In this work, we focus on brand interest on Twit-

ter. Our goal is to understand what people say about

brands, and how that changes over time, and to point

tendencies on those changes. In order to overcome the

document length disadvantage, we present a pooling

technique that consists in grouping together tweets by

day and by brand to create longer documents that are

going to be used to train our Topic Model. We also

show that performing speciﬁc preprocessing steps has

impact on the quality of the output of a Latent Diri-

chlet Allocation (LDA) (Blei et al., 2003) Topic Mo-

del, applied on the Twitter posts aggregations, written

in Portuguese.

The remainder of this paper is organized as fol-

lows: Section 2 describes the related work; Sections 3

describes the dataset; Section 4 describes the Prepro-

cessing and Topic Model Training; Section 5 presents

the analysis and discussion of results, and Section 6

draws the major conclusions, presents the limitations

of the current work, and proposes a set of tasks to per-

form as future work.

2 RELATED WORK

LDA, an unsupervised probabilistic model, models

documents as distributions over topics, with topics

being represented as distributions over words (Blei

et al., 2003). This model has been applied on long

documents such as academic abstracts (Moro et al.,

2015), mid-sized documents such as customers’ re-

views (Calheiros et al., 2017), and short documents

such as microblogging posts (Paul and Dredze, 2014).

For the last one mentioned, some aggregation met-

hods to reduce the length and sparseness disadvan-

Lopes-Teixeira, D., Batista, F. and Ribeiro, R.

Discovering Trends in Brand Interest through Topic Models.

DOI: 10.5220/0006936202450252

In Proceedings of the 10th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2018) - Volume 1: KDIR, pages 245-252

ISBN: 978-989-758-330-8

245

tages have been applied, resulting in longer pseudo-

documents (Hong and Davison, 2010; Mehrotra et al.,

2013).

LDA was used in (Hong and Davison, 2010) to

evaluate the differences between topics learned from

messages from the same user aggregated into a single

proﬁle scheme and topics learned by the aggregation

of the user proﬁles, which in turn resulted from the

aggregation of messages from the same user. Their re-

sults show that both approaches generated topics sub-

stantially different, meaning that topics learned using

different strategies of data aggregation differ from

each other. They also demonstrated the length of the

documents inﬂuences the effectiveness of trained to-

pic models, namely, a better model can be trained by

aggregating short messages.

Another application of LDA was conducted in

(Alvarez-Melis and Saveski, 2016), in which tweets

belonging to the same conversation were grouped,

with each group of related tweets corresponding to a

single document. They evaluated whether the propo-

sed technique outperforms alternative schemes. The

resulting topics performed better than those derived

by hashtag-based pooling.

In (Hu et al., 2012), the researchers modeled the

topics of speciﬁc events as well as their associated

tweets, while performing event segmentation, with an

event consisting of several paragraphs, each one of

them discussing a particular set of topics. They assu-

med that an event, or a segment of it, can impose to-

pical inﬂuences on the related tweets, resulting either

in general topics, which are constant during the event,

and speciﬁc topics, which are related to speciﬁc seg-

ments of the event.

The researchers in (Mehrotra et al., 2013) pro-

posed, among others, a temporal pooling scheme to

aggregate tweets into what the authors have referred

to as macro-documents, based on the assumption that

when important events occur, a great number of users

starts posting about the event within a short time span.

As such, the authors pooled together tweets posted

within the same hour. They found that such scheme

can improve topic modeling on Twitter, without ha-

ving to modify LDA machinery.

Twitter posts presents some challenges due to

sparseness, as short documents (posts) might not con-

tain enough data to establish satisfactory term co-

occurrences. Although LDA have been proved to pro-

duce good results when applied to long documents

corpora, such as news articles (Zhao et al., 2011) and

academic abstracts (Yau et al., 2014), they often pro-

duce less coherent results when the application is per-

formed on posts from micro-blogging platforms such

as Twitter. This is due to the sparse nature of tweets,

and due to the sparsity of short documents in general.

Therefore, in order to alleviate the disadvantages, se-

veral pooling schemes to group together tweets into

longer individual documents have been proposed, so

that the LDA performance is improved without ha-

ving to modify its basic machinery.

Examples of these techniques are hashtag-based

aggregation (Mehrotra et al., 2013; Steinskog

et al., 2017), user-based aggregation (Hong and Da-

vison, 2010), or user-to-user conversation aggrega-

tion (Alvarez-Melis and Saveski, 2016). A Topic Mo-

del based on self-aggregation was also presented by

(Quan et al., 2015), which is based on the assumption

that each text snippet is sampled from a long pseudo-

document.

3 DATASET

This study uses a dataset previously used in (Lopes-

Teixeira et al., 2018), consisting of about 357944

geolocated tweets, written in Portuguese, posted by

159615 users from 206 countries across the world (ac-

cording to the platform indication), collected between

May 2014 and November 2017, covering 192 conse-

cutive weeks, and corresponding approximately to a

four years time span. Each tweet includes the meta-

data information as follows: user id, username, user

description, country and city from which the tweet

was posted, date and time, the tweet id, and the mes-

sage content.

To the collecting process, a brand ﬁlter was ap-

plied, so that only tweets mentioning at least one of

the 16 brands selected would be retained. The brands,

which were selected based on the number of follo-

wers and the number of tweets, are the following:

Adidas, Nike, Vans, Puma, Victoria’s Secret, Gucci,

Valentino, Versace, Converse, Michael Kors, Bur-

berry, Marc Jacobs, Armani, Tommy Hilﬁger, Chris-

tian Louboutin, and Dolce & Gabanna. As in (Lopes-

Teixeira et al., 2018), for this study, we are only con-

sidering the top 10 brands, which are the brands with

more tweets in the dataset. Additional processing

steps were applied to remove irrelevant tweets. For

instance, regarding “Valentino” brand, posts mentio-

ning “Bobby Valentino” and “Valentino Rossi” were

removed from the database, as well as all the tweets

mentioning “Valentino” posted by users from Argen-

tina. The last step was needed because the word “Va-

lentino” is commonly mentioned in posts from Argen-

tina, but they were most likely referring to a person or

to pets with the same name. Tweets having the words

“Valentino” and “Humoro” were also removed, as in

these cases the users were not talking about the brand.

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

246

Table 1: Database properties.

Brand Users Tweets Tweets/User

Nike 68098 126427 1,86

Adidas 65870 120784 1,83

Vans 41071 70091 1,71

Puma 12763 18710 1,47

Victoria’s Secret 9574 12642 1,32

Gucci 5312 7988 1,50

Versace 4989 7312 1,47

Valentino 3924 6083 1,55

Converse All Star 4893 5975 1,22

Michael Kors 1075 1558 1,45

Similarly, as long as no other brand have been menti-

oned in the post, tweets containing ice-cream related

words and the word “Valentino” were also stripped,

as they were referring to an ice-cream shop named

Valentino. The same was done for tweets containing

the word “Versace”, as there’s also an ice-cream shop

with this name. Tweets mentioning “Gucci Mane”,

“Gucci gang”, and “Gucci ﬁca bem com ela” (/Gucci

looks good on her) were ﬁltered out. All the posts

from a speciﬁc user from Indonesia were discarded,

as such user presented an unusual number of posts,

and we’ve found out that the corresponding account

was used solely for advertising purposes. There were

several other accounts used for the same purposes,

mainly amongst United States users. Tweets posted

by these users were discarded as well.

Table 1 shows user and tweet statistics for the se-

lected brands, revealing that Nike and Adidas are two

of the most well-known brands, being mentioned by

the majority of the users in our database.

As the country ﬁeld appeared written in several

different languages, we conducted a normalization

step which consisted of deﬁning a translation table

where all the values were translated into English, ex-

cept for “Cabo Verde”, “C

ote d’Ivoire”, and “Costa

Rica”. Although Hong Kong and Macao are currently

provinces of China (ofﬁcially the People’s Republic

of China), both were treated as separated regions, as

they hold the statute of special administrative regions.

Taiwan was also treated separately, even though this

country is sometimes still considered as a province

of China. Finally, a total of 86 tweets had the loca-

tion ﬁlled with the hyphen mark. To some of them,

the location of another tweet posted by the same user

was assigned to the tweet with no location. The ones

that no other located tweets posted by the same user

were found, were removed, as they were not valid for

this analysis. In the dataset, all the instances of the

brands Michael Kors, Converse, and Victoria’s Secret

were concatenated, so that the words composing the

brands’ names could be considered as a single word,

thus counting as one.

In order to perform topic modeling tasks, all the

tweets mentioning the same brand and posted during

the same week were aggregated using a concatena-

tion script. This step resulted in a dataset compo-

sed by 1918 documents, posted over a total of 192

weeks, with an average of approximately 10 docu-

ments/week.

4 PREPROCESSING AND TOPIC

MODEL TRAINING

(Vijayarani et al., 2015) provides an overview of pre-

processing tasks, and discussed what they’ve consi-

dered the three key steps of preprocessing, namely:

Removing stop words, stemming and using TF-IDF

weighting algorithms. Similarly, in (Srividhya and

Anitha, 2010) the researchers evaluated several pre-

processing techniques and analyzed the effect of such

preprocessing tasks on text classiﬁcation using ma-

chine learning algorithms.

Our experiments apply a set of preprocessing

steps to the dataset, so that more coherent and in-

formative topics could be produced. Because it is

common to ﬁnd tweets containing URLs, slang, mis-

spellings, and hashtags, the steps applied consisted

of removing URLs, stop-words, hashtags, punctua-

tion, numbers, and whitespaces. In order to retain

a good vocabulary to represent the whole dataset,

Term Frequency-Inverse Document Frequency (TF-

IDF) weighting scheme was applied. Also, terms that

are not present in at least 0.1% of the documents,

which corresponds to approximately two documents,

were not included in the vocabulary. The objective of

this step is to avoid as much as possible that misspel-

lings, which may occur only a few times, were caught

by the TF-IDF weighting measure, without stripping

away words with a low frequency rate that could ac-

tually be important. This step resulted in removing

words occurring less than four times in the whole da-

taset. The vocabulary was restricted to 5000 words.

Along with the preprocessing steps previously ex-

plained, several other preprocessing steps were app-

lied to the dataset, namely: removing adverbs, car-

dinal numbers, ordinal numbers, punctuation, con-

junctions, social networks common slang and abbre-

viations, and verbs. Verbs expressing some kind of

willing to acquire/buy brand items, or demonstrating

brand liking/loving, were kept. Brand names compo-

sed by two words (e.g. Michael Kors) had its name

concatenated, so that the TF-IDF algorithm, which

were applied to create the vocabulary, could handle

all the occurrences properly. Additionally, only terms

being present in at least two documents were consi-

Discovering Trends in Brand Interest through Topic Models

247

Table 2: Top 5 Puma topics from stemmed text.

# Terms

1 camisa uniforme adidas nike disc cola chuteira

jogos arsenal novo

2 disc adidas cop novo nike bandido agua tenis

whisky red

3 tenis quero rihanna cop fenty novo colecao adidas

bts

4 disc cop adidas camisa quero nike mizuno novo

tenis bota

5 disc rihanna adidas gira novo camisa cop paulo

catraca tenis

Table 3: Top 5 Nike topics from preprocessed text.

# Terms

1 adidas comprar quero comprei loja air queria

chinelo bone casaco

2 meia canela botajoga adidas comercial camisa

shox quero propaganda

3 adidas pes quero comprar celular querendo camisa

bone comprei role

4 adidas quero comprar air shox bone comprei

camisa quer coroa

5 compra adidas quero shox air comprei camisa fuzil

mola red

dered (sparse factor > 0.99983), thus whipping out

misspellings words that could be interpreted by TF-

IDF as low frequency relevant words.

Although mentioned in (Vijayarani et al., 2015)

as an important preprocessing task, we opted not to

use stemming, as it does not work well for Portu-

guese. For instance, the word ”copa” (/cup), which

refers to the Football World Championship, was re-

duced to ”cop”, and ﬁgures in almost every topic of

the Top 5. Table 3 shows that more informative to-

pics can be produced when preprocessing tasks are

applied. For example, apart from the brand name,

only 3 out of 12 terms can be considered informa-

tive in topics 1 and 6 from Table 2. Table 4 shows

that topics produced from unprocessed texts contain

several irrelevant words such as “http” (URL preﬁx),

stop-words, and the name of the brand itself. As the

Term-Frequency (TF) weighting measure was applied

instead of the TF-IDF one, which reduces the impor-

tance of non-relevant words appearing frequently, all

the topics begin with the name of the brand, which

is not informative as each brand has been evaluated

separately. Also, as stop words are frequent words

throughout the dataset, and they were not removed,

all the topics produced contain several stop words.

In order to limit the number of documents used

training our topic model, we have created documents

that aggregate groups of tweets, either on a daily or

in a weekly basis, and we have concluded that docu-

ments grouping a day of tweets produce better results.

This is in line with the work presented in (Mehro-

Table 4: Top 5 Nike topics from unprocessed text.

# Terms

1 adidas http meia meu nao nike que tenis uma vou

2 com era meu nike nos que tenis https sem pes

3 adidas com meu nao nike que tenis uma vou https

4 adidas com comercial esse http nao nike pra

propaganda que

5 com mais meu nao nike pra que tem vou quero

0.00

0.25

0.50

0.75

1.00

500

1000

day

proportion

Topics

nike comprar tenis quero comprei casaco superstar bone loja queria

tenis quero nike brusinha queria comprar moletom patrocina casaco blusa

tenis comprar quero nike queria site comprei chinelo loja colecao

nike tenis camisa flamengo queria comprar adidasbrasil comprei patrocinio peito

nike camisa puma flamengo novo nova copa palmeiras propaganda brasil

Figure 1: Adidas topics daily evolution.

tra et al., 2013), in which the researchers grouped

their tweets by time spans of 2 hours, based on the

assumption that a great number of users posts about

happenings within short time spans. Nonetheless, in

order to obtain clearer trends visualization, the mo-

dels were applied to documents that aggregate longer

time span, usually one week of data. Figures 1 and

3 were created using the same model, but the former

groups tweets by day, while the later groups tweets

by week. It can be observed that both ﬁgures show

the same clear trends over time. While such approach

works well for Adidas, as both week and day charts

are very similar, it does not work so well for Versace,

where Figure 2 shows a less clear trends visualization

than Figure 7. Therefore, we adopted to perform a

week-based analysis for every brand.

0.00

0.25

0.50

0.75

1.00

500

1000

day

proportion

Topics

versace donatella gaga musica zayn vestido clipe quero desfile queria

versace quer calcinha vestidinho rainha boate linda linha socialite adrenalina

versace comprar grana gucci chanel crime cama banco donatella raf

versace riachuelo colecao donatella vestido anitta desfile paulo nova comprar

Figure 2: Versace topics daily evolution.

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

248

Concerning the number of topics being conside-

red, that number was picked after several iterations,

considering both the information each topic conveyed

and the clearness of the resulting plot. As our goal is

to observe changes over time in order to point trends,

we have tried to reduce the overlapping topics, which

result into more noisy plots.

5 ANALYSIS

In line with the work presented in (Lopes-Teixeira

et al., 2018), it can be observed that brand interest,

i.e. the volume of posts mentioning the brands over

analysis, changed over the time. It shows ups and do-

wns, and several peaks could be related to real-world

events, as other studies have demonstrated (Mehrotra

et al., 2013; Paul and Dredze, 2014).

0.00

0.25

0.50

0.75

1.00

100

150

200

week

proportion

Topics

nike comprar tenis quero comprei casaco superstar bone loja queria

tenis quero nike brusinha queria comprar moletom patrocina casaco blusa

tenis comprar quero nike queria site comprei chinelo loja colecao

nike tenis camisa flamengo queria comprar adidasbrasil comprei patrocinio peito

nike camisa puma flamengo novo nova copa palmeiras propaganda brasil

Figure 3: Adidas topics weekly evolution.

Figure 3 shows that the ﬁrst weeks have more

shares about the last two topics. The last one, which is

about Nike, ”comercial”/”propaganda” (commercial),

”Brasil” (Brazil), ”copa” (world cup) and ”Messi”

(the football player), are clearly related to the Foot-

ball World Championship that took place in Brazil,

which spiked brand interest regarding sport brands,

such as Adidas, Nike and Puma, during the champi-

onship period, in 2014 (Lopes-Teixeira et al., 2018).

The ﬁrst, second, and third topics express the inten-

tion of purchasing new items: “novo”/“nova” (new),

“camisa” (shirt), “t

enis” (sneakers), ”moletom” (pul-

lover), ”chinelo” (ﬂip-ﬂops/slippers), ”bon

e” (bon-

net), and so on. The fourth topic, in which ﬁgure the

name of two Brazilian Football teams (Flamengo and

Palmeiras), had more shares until roughly the 50th

week. This might be, in part, related to the two mat-

ches in which these two teams faced each other, more

speciﬁcally in May 2014 and September 2014. A pos-

sible reason for Nike and Puma being present in topics

from Adidas data might be due to sport brands being

very often subject of comparison.

ao resisti entrei no site da adidas e comprei a tal

camisa edicao limitada de 300 dilmas do Flamengo

/ Couldn’t resist, I accessed Adidas website and

bought that 300 Flamengo limited edition shirt

Esses novos uniformes dos clubes europeus est

muito bonitos Adidas mandando ver e deixando a

Nike pra tr

as / These new European team uniforms

are very beautiful, Adidas is leaving Nike behind

0.00

0.25

0.50

0.75

1.00

100

150

200

week

proportion

Topics

camisa adidas inter comprar corinthians uniforme camisas quero nova queria

adidas quero air comprar casaco max queria loja comprei bone

meia adidas canela bota joga comercial propaganda brasil camisa copa

adidas pes comprar quero celular querendo comprei bone role air

quero comprar adidas shox comprei meia air bone bota quer

adidas comprar quero queria comprei air chinelo loja camisa blusa

Figure 4: Nike topics weekly evolution.

The ﬁrst topic in Figure 4 is related to Football

equipment items such as ”uniforme” (uniform) and

”camisas” (shirts), and two Brazilian Football teams

(Corinthians and Inter). This topic is present in almost

every documents, which might be due to the Brazi-

lian Football Championship, which occurs throughout

the year. The World Cup topic, which is the third,

can also be spotted in the plot. It can be observed

that this topic was more discussed in the early weeks

of the dataset, then its proportion decreased as the

time went by. Similarly to Adidas topics, Nike topics

also mention Adidas, demonstrating that these brands

are mentioned in the same document several times.

The topic in which ﬁgure the terms “Shox” and “Air”

(Nike sneakers), ”bon

e” (bonnet), and ”bota” (boot)

was discussed from the beginning to the middle of the

set of weeks, then they faded. This is in line with

the launch of Nike Spring/Summer collection, which

occurred around the ﬁrst semester of 2015 (Lopes-

Teixeira et al., 2018). The desire of purchasing is

common to almost every topics, and it’s shared across

the weeks. What distinguishes them are in essence the

items that are object of desire. Topic 5, for instance,

“camisa” (shirt), Air (sneakers), “blusa” (blouse/top),

while the second topic mentions ”Max” (sneakers),

”bon

e” (bonnet), and “chinelo” (slippers). Clearly,

this indicates that, for this brand, the items users are

interested in changed over the time.

Discovering Trends in Brand Interest through Topic Models

249

• “Adoro mt o trailer a publicidade da Nike para o

mundial”. “I love very much Nike ad trailer for

World Cup”.

• “Esse comercial da nike ta oh ???? Que venha

Copa Nike”. “This Nike commercial is lit! Let

the world cup begin”

0.00

0.25

0.50

0.75

1.00

100

150

200

week

proportion

Topics

camisa camisas adidas nike uniformes uniforme disc chuteira arsenal copa

disc adidas tenis nike mizuno bota comprei novo calca delicia

disc camisa clube janeiro encontro catraca gira shopping pumas kit

tenis rihanna quero chinelo queria fenty comprar colecao adidas bts

adidas disc novo comprei agua nike bandido whisky paulo camisa

Figure 5: Puma topics weekly evolution.

Figure 5 illustrates how the ﬁrst topic, which

is about “camisas” (shirts), “uniformes” (uni-

forms),”chuteira” (football boot), ”copa” (Wolrd

Cup), Nike, Adidas, “Disc” (Puma sneakers), starts

with a high proportion but by the end of the set of

weeks it is almost not discussed. The high propor-

tion of this topic is most likely due to the Football

Championship that took place in Brazil, back in 2014.

The second topic, which mentions ”camisas” (shirts),

”tenis” (sneakers), ”Mizuno” and ”Disc” (Puma sne-

akers), ”calc¸a” (trouser/pants), along with the brands

Adidas and Nike, follows the trend of the ﬁrst topic,

being more discussed until the middle of the dataset,

also losing relevance from that point until the end. In

the fall of 2015, the ﬁrst sneaker of Rihanna’s collabo-

ration with Puma was released, which sold out online

with the pre-sale launch. Over the next two years,

Rihanna also released several other, which were all

met positively by both critics and buyers. In 2016,

Rihanna debuted her ﬁrst clothing line in collabo-

ration with Puma. In the spring of the same year,

the second collection was also unveiled. In Autumn

2017, the debut of their autumn collection was pre-

sented.The chart shows that the fourth topic evolu-

tion is in line with these events, as terms such as

”Fenty”, ”Rihanna”, ”colec¸

ao” (collection),”queria”

(I wanted), ”comprar” (purchase), ”tenis” (sneakers),

and so on, can be spotted in this topic.

The last topic, in which ﬁgure the terms “te-

nis” (sneakers), “Mizuno” (Puma sneakers), “camisa”

(shirt), Adidas, Nike, along with the word “comprei”

(I bought), has a higher proportion from the begin-

ning until the middle of the dataset, losing strength

afterwards.

• “que adidas o que sua louca eu quero um creeper

da puma q a dona rihanna fez”. “What Adidas?! I

want a Puma Creeper made by Lady Rihanna.”

0.00

0.25

0.50

0.75

1.00

100

150

200

week

proportion

Topics

victoriassecret angel modelo creme victoriassecrets perfume modelos loja cheiro comprei

victoriassecret quer roupa morango cesto kit chantily chique corpo baunilha

victoriassecret desfile show fashion modelos assistir victoriassecrets assistindo ano angel

Figure 6: Victoria’s Secret topics weekly evolution.

Figure 6 shows that the last topic for Victoria’s

Secret brand, which is essentially about their annual

fashion show, presents a seasonal behavior, which

reﬂects the trend pointed in (Lopes-Teixeira et al.,

2018). The other two topics do not follow this trend,

rather, the second topic, which is about clothes, Victo-

ria’s Secret kit, and strawberry and Chantilly scented

lotions, is more talked about in the ﬁrst set of weeks.

The last topic, in the other hand, has its proportion

increased in the second set of weeks.

• “Adorava estar no Victorias Secret Show”.“Id love

to be on Victorias Secret Show”.

• “To vendo os desﬁles da Victorias Secret amoo

Meu sonho e ser modelo da Victorias Secret”.“Im

watching Victorias Secret Fashion Show, love it

My dream is to become a Victorias Secret model”.

Figure 7 shows that the ﬁrst topic, which is about

Riachuelo having a Versace collection, has an unusual

proportion somewhere before the ﬁftieth week of the

dataset. This high proportion is in line with (Lopes-

Teixeira et al., 2018), coinciding with the fashion

show in which Riachuelo presented its Versace col-

lection, in November 2014. As this topic is also about

“desﬁle” (fashion show), Donatella Versace, “roupas”

(clothes), the topic never really fades away. In fact, it

presents ups and downs that are most likely related to

the brand fashion shows carried out every year. The

third topic, which seems to talk about high couture

brands, as it mentions Gucci, Chanel, “grana” (a Por-

tuguese slang for money), is what people talked about

in the late weeks. Before this topic showed up, people

were talking about “Vestidinho” (short dress), socia-

lite and “boate” (nightclub).

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

250

0.00

0.25

0.50

0.75

1.00

100

150

200

week

proportion

Topics

versace donatella gaga musica zayn vestido clipe quero desfile queria

versace quer calcinha vestidinho rainha boate linda linha socialite adrenalina

versace comprar grana gucci chanel crime cama banco donatella raf

versace riachuelo colecao donatella vestido anitta desfile paulo nova comprar

Figure 7: Versace topics weekly evolution.

• “Hoje eu vi as pec¸as da Versace pra Riachuelo e

meu Deus, que colecc¸

ao linda”. “Just saw Versace

for Riachuelo clothes and God, what a beautiful

collection!”

• “Riachuelo fecha parceria com linha da Versace

OMG ja querooo Versace tem que chegar na Ri-

achuelo antes da primeira prova do enem”. “Ver-

sace starts a partnership with Riachuelo in Versace

collection OMG I want it now. I hope it hits Ria-

chuelo stores before the ﬁrst national exam”.

The ﬁrst topic illustrated in Figure 8 is about

the semi-annual fashion event named Paris Fashion

Week. This topic also mentions the former boys band

leader Zayn Malik, who attended the fashion event,

back in March 2017. As this fashion event is semi-

annual, several increases of the ﬁrst topic proportion

can be spotted in the chart. The last topic captured

shares about Valentino like brands such as Dior and

Gucci and . Also, it mentions (Valentino) ”Khan”,

which is a a well-known DJ and producer, and ”Ricky

Martin”, a Puerto Rican singer. Topic 3 is compo-

sed by terms such as ”feliz” (happy), ”colec¸

ao” (col-

lection), ”nova” (new), and ”modelo” (model). The

last topic, though, has nothing to do with Valentino

brand; rather it seems to be about a sport motocross

event, as it mentions “motogp”, “ganhar” (to win),

“seguidores” (/followers), and Valentino (Rossi), a

professional motorcyclist.

The ﬁrst topic in Figure 9 is composed by terms

such as “Gucci”, “cinto” (belt), “tenis” (sneakers),

”colec¸

ao” (collection), ”bordado” (embroidered), and

another haute couture brand ”Chanel”. This topic is

present in most of the documents, having its propor-

tion increased from the middle to the end of the chart.

By the end of the chart, we can quickly spot Topic

3, which is about Kim Taehyung (from the South Ko-

rean boys band ”Beyond The Scene”) appreciation to

Gucci clothes; his appreciation to the brand started to

be noticed/talked about in 2016. Altough mentioning

0.00

0.25

0.50

0.75

1.00

100

150

200

week

proportion

Topics

valentino fashion show paris zayn desfile week maisonvalentino new brazil

valentino bar khan vestido dior amo quero linda rickymartin gucci

valentino maisonvalentino feliz shows colecao realmrvalentino seguidores nova modelo cara

valentino seguidores motogp melhores motogpnosportv cade avidaaos familia grande sabado

Figure 8: Valentino topics weekly evolution.

”Gucci”, the fourth topic is not quite related to the

brand. In fact, this topic is related to a song from a

Brazilian singer, in which brands like Armani, Oak-

ley, Lacoste are also mentioned. The last topic also

mention other haute couture brands such as Prada,

Chanel, and Louis (Vuitton), along with ”bolsa” (bag)

and ”jaqueta” (jacket). The name ”Harry” also ﬁgu-

res in this topic, refering to the former member of a

British boys band, Harry Styles, whose appreciation

to the Gucci brand culminated in him being the new

face Of Gucci’s Tailoring Collection.

0.00

0.25

0.50

0.75

1.00

100

150

200

week

proportion

Topics

gucci comprar quero queria cinto chanel tenis colecao bordado nova

gucci prada mao harry comprada chanel dada pede coisas amigo

gucci taehyung btsbbmas tae modelo btstwt kim bts top vote

gucci armani lacoste oakley passo banco bote versace civil light

gucci harry comprar bolsa queria quero chanel prada louis jaqueta

Figure 9: Gucci topics weekly evolution.

6 CONCLUSIONS AND FUTURE

WORK

The current study demonstrates that grouping tweets

based on the day they were uploaded, and by brand,

to perform Topic Modeling tasks can produce cohe-

rent and informative topics. This study also shows

that topics about what people discuss/share opinion

and thoughts change over the time, and they can be

related to real-life happenings, which is in line with

the work presented in (Lopes-Teixeira et al., 2018).

Discovering Trends in Brand Interest through Topic Models

251

For instance, commercials, products launches, and

events can lead to emerging of new topics, which may

result (or not) in older topics fading. Additionally,

the plots presented show that each brand have diffe-

rent brand interest pattern, which was also stated in

(Lopes-Teixeira et al., 2018). For example, Victo-

ria’s Secret topic about their fashion show comes and

goes several times. Moreover, the importance of pre-

processing in Natural Language Processing was emp-

hasized. The experiments shows that preprocessing

steps do have impact in the quality of the topics re-

sulting from documents written in Portuguese. More

elucidative/informative topics were produced when

the documents were preprocessed. Tasks such remo-

ving URL’s, removing stop words and choosing the

representation vocabulary based on TF-IDF can avoid

common issues that reduce the coherence of the to-

pics. Results demonstrated that this framework can

be followed to obtain coherent topics, enabling one to

get insights about people’s conversation/discussions

on Social Networks.

Limitations of this study are related to the fact

tweets frequently have slangs, hashtags with words

concatenated, abbreviations and misspellings. Alt-

hough the documents were preprocessed, not all the

instances of this cases could be ﬁltered out. Anot-

her limitation is that stop-words are still limited for

Portuguese language. To overcome this, our own set

of Portuguese words considered non-relevant for this

study were created, so that meaningless words could

be properly removed.

Future work includes applying another Topic Mo-

deling algorithm in order to evaluate which one ﬁts

better for a large dataset. Discovering community pat-

terns, i.e., how topics change from one community to

another, is also a subject of future research.

ACKNOWLEDGEMENTS

This work was supported by national funds through

Fundac¸

ao para a Ci

encia e a Tecnologia (FCT) with

reference UID/CEC/50021/2013.

REFERENCES

Alvarez-Melis, D. and Saveski, M. (2016). Topic mo-

deling in twitter: Aggregating tweets by conversati-

ons. ICWSM, 2016:519–522.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent

dirichlet allocation. Journal of Machine Learning Re-

search, 3:993–1022.

Calheiros, A. C., Moro, S., and Rita, P. (2017). Senti-

ment classiﬁcation of consumer-generated online re-

views using topic modeling. Journal of Hospitality

Marketing & Management, 26(7):675–693.

Hong, L. and Davison, B. D. (2010). Empirical study of to-

pic modeling in twitter. In Proceedings of the ﬁrst

workshop on social media analytics, pages 80–88.

ACM.

Hu, Y., John, A., Wang, F., and Kambhampati, S. (2012).

Et-lda: Joint topic modeling for aligning events and

their twitter feedback. In AAAI, volume 12, pages 59–

65.

Lopes-Teixeira, D., Batista, F., and Ribeiro, R. (2018).

Spatio-temporal analysis of brand interest using social

networks. In CISTI’2018 - 13th Iberian Conference on

Information Systems and Technologies.

Mehrotra, R., Sanner, S., Buntine, W., and Xie, L. (2013).

Improving lda topic models for microblogs via tweet

pooling and automatic labeling. In Proceedings of the

36th international ACM SIGIR conference on Rese-

arch and development in information retrieval, pages

889–892. ACM.

Moro, S., Cortez, P., and Rita, P. (2015). Business intelli-

gence in banking: A literature analysis from 2002 to

2013 using text mining and latent dirichlet allocation.

Expert Systems with Applications, 42(3):1314–1324.

Paul, M. J. and Dredze, M. (2014). Discovering health to-

pics in social media using topic models. PloS one,

9(8):e103408.

Quan, X., Kit, C., Ge, Y., and Pan, S. J. (2015). Short and

sparse text topic modeling via self-aggregation. In IJ-

CAI, pages 2270–2276.

Srividhya, V. and Anitha, R. (2010). Evaluating prepro-

cessing techniques in text categorization. Interna-

tional journal of computer science and application,

47(11):49–51.

Steinskog, A., Therkelsen, J., and Gamb

ack, B. (2017).

Twitter topic modeling by tweet aggregation. In Pro-

ceedings of the 21st Nordic Conference on Computa-

tional Linguistics, pages 77–86.

Vijayarani, S., Ilamathi, M. J., and Nithya, M. (2015). Pre-

processing techniques for text mining-an overview.

International Journal of Computer Science & Com-

munication Networks, 5(1):7–16.

Yau, C.-K., Porter, A., Newman, N., and Suominen, A.

(2014). Clustering scientiﬁc documents with topic

modeling. Scientometrics, 100(3):767–786.

Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H.,

and Li, X. (2011). Comparing twitter and traditional

media using topic models. In European Conference

on Information Retrieval, pages 338–349. Springer.

KDIR 2018 - 10th International Conference on Knowledge Discovery and Information Retrieval

252