FakeSpreadersWhatsApp.BR: Misinformation Spreaders Detection in

Brazilian Portuguese WhatsApp Messages

Lucas Cabral, Diogo Martins, Jos

e Maria Monteiro, Javam Machado and Wellington Franco

Universidade Federal do Cear

a, Brazil

Keywords:

Datasets, Machine Learning, Misinformation Spreaders Detection.

Abstract:

In the past few years, the large-scale dissemination of misinformation through social media has become a

critical issue. In many developing countries such as Brazil, India, and Mexico, one of the primary sources of

misinformation is the messaging application WhatsApp. Recently, a few methods for automatic misinforma-

tion detection for the WhatsApp platform were proposed. On the other hand, identifying users who spread

fake news is another key aspect of mitigating misinformation dissemination effectively. However, features

to describe users on the WhatsApp platform were not found in the literature. This paper proposes a set of

23 features and two approaches (a supervised and another unsupervised) to identify possible misinformation

spreaders on WhatsApp. Our results indicate that the proposed features can be used to distinguish between

potential misinformation spreaders and users who share credible information with a F1 Score of 0.923.

1 INTRODUCTION

In recent years, the large-scale dissemination of mis-

information through social media has become a criti-

cal problem, undermining public health, social stabil-

ity and even democracy. In many developing coun-

tries such as Brazil, India and Mexico, the What-

sApp messaging app is one of the main sources of

misinformation (Martins et al., 2021; Martins et al.,

2022). In this context, identifying users who spread

false content on this platform is a central task in the

ﬁght against disinformation since it allows identify-

ing its origin or its main spreaders. The misinfor-

mation spreaders detection task aims to identify ma-

licious users responsible for spreading misinforma-

tion on a large scale (Morais and Digiampietri, 2022).

The identiﬁcation of misinformation spreaders makes

it possible to create mechanisms whose purpose is to

block the misinformation ﬂow, mitigating its dissem-

ination. However, proposals of features to describe

users on the WhatsApp platform were not found in

the literature.

This work brings three important contributions:

(i) A set of 23 features to describe the behavior of

users on WhatsApp, which can be used in the au-

tomatic detection of misinformation spreaders.

(ii) Two distinct approaches to identify possi-

ble spreaders of misinformation on WhatsApp:

thresholding and logistic regression.

(iii) A large-scale, labeled, and public dataset of mis-

information spreaders on WhatsApp platform.

This dataset, called FakeSpreadersWhatsApp.Br,

contains 5,364 instances, where each instance rep-

resents a user, and 23 different features, collected

from public chat groups, using the platform pro-

posed by (de S

a et al., 2021).

Our results indicate that the proposed approaches

and features can be effectively used to distinguish be-

tween potential misinformation spreaders and users

who share reliable information on WhatsApp. The

thresholding approach (an unsupervised method) ob-

tained a F1 Score of 0.840. The approach based on

logistic regression (a supervised method) presented a

F1 Score of 0.923. Then, we hope that this paper can

help researchers understand Brazil’s misinformation

propagation. The presented ideas can also be used

to build misinformation detection systems, which aim

to assist users in detecting and ﬁltering out deceptive

news.

The remainder of this paper is organized as

follows. Section 2 presents our “misinformation

spreader” deﬁnition. Section 3 discuss the main re-

lated work. Section 4 describes the methodology used

in this investigation. Section 4 presents the experi-

mental results. Conclusions and future work are pre-

sented in Section 5.

Cabral, L., Martins, D., Monteiro, J., Machado, J. and Franco, W.

FakeSpreadersWhatsApp.BR: Misinformation Spreaders Detection in Brazilian Portuguese WhatsApp Messages.

DOI: 10.5220/0011985500003467

In Proceedings of the 25th International Conference on Enterprise Information Systems (ICEIS 2023) - Volume 1, pages 219-228

ISBN: 978-989-758-648-4; ISSN: 2184-4992

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

219

2 FUNDAMENTAL CONCEPTS

In this work, we propose an embracing deﬁnition for

the concept of “misinformation spreader”, which is

formulated next:

Theorem 2.1. Let be a user u = {U

, E

} of a social

network N, who has associated with him a set U

, u

, ..., u

} of other m users with which u has a

connection, a set of engagements E

= {e

, e

, ..., e

where each e

= {p

, a,t} represents an engagement

of u with the publication p

, for the action a, in the

time t. Let the function Q(s) 7→ [0, 1] be a misinfor-

mation score assigned to u and τ be a decision thresh-

old. Detecting misinformation spreaders is the task of

learning a prediction function G(u, τ) 7→ [0, 1], satis-

fying:

G(u, τ) =

(

1 (is a misinf. spreader), if Q(s) ≥ τ

0 (is not a misinf. spreader), if Q(s) < τ

A speciﬁc deﬁnition to categorize a user as a mis-

information spreader may vary according to the an-

alyzed social network or the particular behavior one

wants to detect. However, it should be considered

that the user posts or shares misinformation with un-

usual frequency or proportion compared to other users

of this social network. That is, a misinformation

spreader publishes a high amount of misleading pub-

lications, or most of his publications contain false in-

formation. It is not, therefore, a gullible user who has

regular activities on the social network and eventually

publishes unreliable information, but users engaged in

abnormally disseminating misinformation compared

to regular users. It is essential to highlight that, de-

pending on the social network, this behavior often vi-

olates its community policies.

3 RELATED WORK

The misinformation spreaders detection is still a prob-

lem little addressed in the context of the Portuguese

language. Most of the existing works address the bots

detection problem. In (Leite et al., 2020), a set of

rules was proposed to describe and classify bots on

Twitter. The rules are based on the users behavior,

and use as input data the number of tweets book-

marked, the index of answered tweets, and the aver-

age of retweets. Using a decision tree, users can be

classiﬁed by these rules. The best result achieved an

AUC of 0.97 using the dataset collected by (Cresci

et al., 2017).

In (Benevenuto et al., 2008), the authors in-

vestigate the problem of detecting malicious users

(spammers) on the YouTube platform. Users are

represented by three groups of features: user fea-

tures, video features and social network features.

User features include the number of videos added

to YouTube, number of friends, number of videos

watched, number of videos added as favorites, num-

ber of response videos sent and received, number of

subscriptions, number of subscribers, and the maxi-

mum number of videos added in a day. Video fea-

tures include the videos length average, number of

views, ratings, comments, favorites, honorable men-

tions, and external links on posted videos. Social

network features include clustering coefﬁcient, user

rank, betweenness, reciprocity and assortativity. Us-

ing these features, an F1 Score of 0.81 was obtained

in the malicious user detection task.

The effectiveness of the most popular classiﬁers,

such as Random Forest and AdaBoost, in detect-

ing bots was evaluated in (Morais and Digiampietri,

2022). The obtained results pointed to the degrada-

tion of the efﬁciency of the classiﬁers when exposed

to new datasets, different from the dataset used during

the model training. This result derives, among other

factors, from the dependence on information based

on the user’s proﬁle, which are frequently changed

by bots developers whenever they realize that certain

features are being used by the detection algorithms.

In (Shahid et al., 2022), the authors provided a

comprehensive survey of the state of art methods for

detecting malicious users and bots based on different

features. In (Rath et al., 2021), the authors presented

SCARLET (truSt andCredibility bAsed gRaph neu-

raLnEtwork model using aTtention), a model to pre-

dict misinformation spreaders on Twitter. Using real

world Twitter datasets, they show that SCARLET is

able to predict false information spreaders with an ac-

curacy of over 87%. In (Rath and Srivastava, 2022),

the authors proposed a framework based on a com-

plementary approach to false information mitigation

on Twitter inspired from the domain of Epidemiol-

ogy, where false information is analogous to infec-

tion, social network is analogous to population and

likelihood of people believing an information is anal-

ogous to their vulnerability to infection.

In (Heidari et al., 2021), the authors analyzed sen-

timent features and their effect on the accuracy of ma-

chine learning models for social media bot detection

on Twitter. A new set of sentiment features were ex-

tracted from tweet’s text and used to train bot detec-

tion models. Besides, they proposed a new model for

the Dutch language and achieve more than 87% ac-

curacy for the Dutch tweets based on new sentiment

features.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

220

4 METHODOLOGY

Regarding the research methodology, the nature of

this research is applied. As for the approach, this re-

search is quantitative, organized through the follow-

ing methodological procedures: (i) extraction of data

referring to messages sent by users in WhatsApp pub-

lic groups, (ii) controlled experiments in the labora-

tory, and (iii) modeling/simulation for the construc-

tion of classiﬁers based on supervised and unsuper-

vised learning. It is important to highlight that this

project seeks to identify misinformation spreaders on

the WhatsApp platform, considering features that are

independent of the used language.

4.1 The FakeSpreadersWhatsApp.Br

Dataset

In this work, we start from the FakeWhatsApp.Br

(Cabral et al., 2021) dataset, where each line repre-

sents a message that a particular user sent in a spe-

ciﬁc WhatsApp public group. The FakeWhatsApp.Br

dataset has 282,601 messages sent by 5,364 users, in

59 public groups, between July and November 2018,

corresponding to the Brazilian election campaign pe-

riod. The columns of the FakeWhatsApp.Br dataset

are the date and time that the message was sent, the

sender’s phone number, the international phone code,

the Brazilian state (if the user is from Brazil), the con-

tent (text) of the message, the number of words, the

amount of characters, and whether the message con-

tained media such as audio, image, or video. How-

ever, the FakeWhatsApp.Br dataset does not have

any media ﬁles. Furthermore, the authors computed

how often the same message (with the exact same

text) appears in the dataset. A message was consid-

ered viral if it was observed more than once in the

dataset. For this, only viral messages with identical

textual content and more than ﬁve words were con-

sidered to ﬁlter out common messages such as greet-

ings. This subset of the FakeWhatsApp.Br dataset

contains 6,926 viral messages. Additionally, the mes-

sages were anonymized in order to remove personal

information such as identity document number, indi-

vidual taxpayer identiﬁcation number, zip code, and

telephone number, among others. Finally, the mes-

sages were manually labeled (as misinformation or

not misinformation). The strategy used to build the

FakeWhatsApp.Br dataset was described in (Cabral

et al., 2021). The corpus of FakeWhatsApp.Br is pub-

licly available in a repository online

From FakeWhatsApp.Br we built a new dataset

https://github.com/cabrau/FakeWhatsApp.Br

called FakeSpreadersWhatsApp.Br, containing 5,364

instances, where each instance represents a user, with

features calculated from the original dataset and de-

scribing their behavior in the groups during the ob-

served period. For each user, 23 different features

were computed, organized into two large groups: ac-

tivity and network features. Next, we detail each of

these features.

4.1.1 Activity Features

As the name implies, activity features quantify the ac-

tions taken by users in observed groups. Activity fea-

tures can be organized into three subgroups: count,

proportion, and temporal activity.

The count features are groups (number of groups

the user joins), total messages (total number of mes-

sages sent by the user), texts (number of text messages

sent), media (number of media messages sent), viral

(number of viral messages sent), repeated messages

(number of repeated messages sent by the user), and

misinformation (number of messages labeled as mis-

information sent by the user). High values in count

features indicate that the user was very active in the

observed scope. In addition, high values in the at-

tributes media, viral, and repeated messages indicate

propagandist behavior. High values in the misinfor-

mation feature indicate misinformation behavior.

Table 1 shows statistical measures that describe

the count features in the FakeSpreadersWhatsApp.Br

dataset. We can see that spreading misinformation is

not common among users, as 75% of users sent less

than one message containing misinformation. It is

also noted that most users do not actively participate

in the groups. All count features have distributions

with a high concentration of lower values, but with a

high standard deviation and a high maximum value,

characterizing themselves as long-tail distributions.

The proportion features are: texts (ratio between

the number of text messages and the total number

of messages sent by the user), media (ratio between

the number of messages containing media and the to-

tal number of messages sent by the user), viral (ratio

between the number of viral messages and the total

number of messages sent by the user), repeated mes-

sages (ratio between the number of repeated messages

and the total number of messages sent by the user) and

misinformation (ratio between the number of mes-

sages labeled as misinformation and the total number

of messages sent by the user). These features seek

to capture the relationship between a speciﬁc type of

message and the total number of messages sent by the

user.

Table 2 shows statistical measures that describe

the proportion features in the FakeSpreadersWhat-

FakeSpreadersWhatsApp.BR: Misinformation Spreaders Detection in Brazilian Portuguese WhatsApp Messages

221

Table 1: Statistical measures of the count-type activity features.

Groups Total messages Texts Media Viral Repeated messages Misinformation

average 1.16 52.68 29.13 23.55 3.89 2.57 2.13

standard deviation 0.65 138.06 89.74 63.19 15.01 16.26 7.33

min 1.00 1.00 0.00 0.00 0.00 0.00 0.00

Q1 1.00 3.00 2.00 1.00 0.00 0.00 0.00

median 1.00 13.00 6.00 4.00 0.00 0.00 0.00

Q3 1.00 45.00 23.00 19.00 2.00 1.00 1.00

max 11.00 4396.00 3742.00 1360.00 564.00 609.00 147.00

sApp.Br dataset. We have observed empirically that

some of these features indicate unusual activity, espe-

cially when users have not been very active, but had

high values for media and viral features. To illustrate

this point with a real example, a user who shared 17

viral messages might not be relevant when looking at

the raw amount alone. However, it was observed that

100% of these messages are viral, which is a behavior

that does not correspond to that of a regular user. This

user is not using the application for conversations, but

as a propagandist, just passing on content. The misin-

formation feature could only be obtained because the

data went through a previous manual labeling process.

However, the other features can be obtained from an

unlabeled dataset.

Temporal activity features are: active days (num-

ber of days the user has sent messages), mean, stan-

dard deviation, median, and maximum number of

messages per day. Table 3 shows statistical mea-

sures that describe temporal activity features in the

FakeSpreadersWhatsApp.Br dataset. Temporal activ-

ity features describe the user behavior over time and

can be extracted without a previous manual labeling

process. Suspicious activities include the user with

very sharp bursts of activity alternating with days of

no activity. A low mean, with a high standard devia-

tion and a high maximum value, can be a solid indi-

cator to identify misinformation spreaders.

4.1.2 Network Features

Modeling user relationships through a network or

graph, can provide relevant information about the

misinformation ﬂow. In some social networks such

as Twitter or Facebook, there are well-deﬁned con-

nections between users through the relationship of

following (Twitter) or friendship (Facebook). How-

ever, on WhatsApp, these connections are not ex-

plicit. Thus, we propose modeling the relationships

between WhatsApp users in the form of directed and

valued graphs, considering the sending of messages

in groups. In this modeling, each node represents a

user, and we can consider a graph for each type of

message: i) messages in general (General Graph), ii)

viral messages (Viral Graph), and iii) messages with

misinformation (Misinformation Graph). Consider-

ing the general graph, where each node represents a

user, a directed edge exists between user i and user j

if user i sent a message to a group that user j belongs.

The weight of this edge is the number of messages

sent by user i to that group. If user i and user j jointly

participate in k groups, the weight of the edge from i

to j will be the sum of the amounts of messages sent

by i to these k groups.

Analogous reasoning can be applied to create the

viral graph: there is a directed edge between user i

and user j if user i sent a viral message to a group

which user j belongs, and the weight of this edge is

the number of viral messages sent by the user i to that

group. The same goes for the misinformation graph.

It can be seen that in the three graphs, the number of

nodes is the same, varying the number of edges. The

count of the number of edges of each type is presented

in Table 4. It can be seen that they are large graphs

with many connections.

Figure 1 exempliﬁes the format of the general

graph, using a sample of 2,000 users. Besides, for

simpliﬁcation purposes, the weights and direction of

the edges are ignored in this representation. Note

that there are isolated groups and a cluster of users

strongly connected in the center. This is because there

are engaged users who actively participate in many

groups. Figure 2 illustrates this cluster in more detail.

Observe the existence of users with high centrality,

that is, who interacted with several other users and

users who interact only with their local group.

From these graphs, we can obtain some network

metrics, which we call here “network features”: gen-

eral centrality degree, general strength, viral central-

ity degree, viral strength, misinformation centrality

degree, and misinformation strength. The last two

features can only be computed due to manually as-

signed labels, while the rest can be computed from

unlabeled data. Table 5 shows statistical measures

that describe the network features in the FakeSpread-

ersWhatsApp.Br dataset.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

222

Table 2: Statistical measures of proportion-type activity features.

Proportion

Texts Media Viral Repeated messages Misinformation

average 0.567 0.433 0.069 0.039 0.041

standard deviation 0.317 0.317 0.140 0.107 0.107

min 0.000 0.000 0.000 0.000 0.000

Q1 0.333 0.158 0.000 0.000 0.000

median 0.571 0.429 0.000 0.000 0.000

Q3 0.842 0.667 0.085 0.014 0.042

max 1.000 1.000 1.000 0.941 1.000

Table 3: Statistical measures of temporal activity features.

Active days Average daily messages Standard deviation of daily messages Median of daily messages Maximum daily messages

average 33.3 2.1 2.8 1.4 10.3

standard deviation 30.0 5.0 4.7 4.8 16.9

min 1.0 0.0 0.0 0.0 1.0

Q1 3.0 0.4 0.7 0.0 2.0

median 28.0 1.0 1.4 0.0 4.0

Q3 59.0 2.0 3.2 1.0 12.0

max 120.0 149.0 148.5 149.0 294.0

Figure 1: Sample of general graph, with a selection of 2,000 users.

Table 4: Quantities of nodes and edges of the graphs gener-

ated to model the relationships between users.

Graphs statistics

Number of nodes 5.364

Number of edges (General graph) 1.125.326

Number of edges (Viral graph) 551.069

Number of edges (Misinformation graph) 433.204

4.2 Misinformation Spreaders Detection

First, we analyzed the general activity of the 5,364

users of the FakeSpreadersWhatsApp.Br dataset by

distributing the total number of messages each user

sent in the public groups. We observe that the distri-

bution of total messages and other user features has a

long tail, with the vast majority of users having low

activity. In fact, only 25% of the users sent more than

45 messages. As we are interested in users who had

relevant activity, to identify possible misinformation

spreaders, we created a clipping containing only users

who sent a more signiﬁcant number of messages than

the median, corresponding to 13 messages. This sub-

set contains 2,633 users, called active users. Figure 3

shows the distribution of the number of messages sent

by users in the FakeSpreadersWhatsApp.Br dataset.

Next, we analyzed the features distributions for

this subset of users. As already mentioned, we are

interested in the anomalous behavior to deﬁne key

users, so we used the the well-known outliers detec-

tion method based on the interquartile range, where

FakeSpreadersWhatsApp.BR: Misinformation Spreaders Detection in Brazilian Portuguese WhatsApp Messages

223

Figure 2: Detail of the graph illustrated in Figure 1, highlighting the strongly connected groups.

Table 5: Statistical measures of network features.

General centrality degree General strength Viral centrality degree Viral strength Misinformation centrality degree Misinformation strength

average 215 10598 105 713 83 386

standard deviation 142 29226 151 2859 136 1392

min 3 7 0 0 0 0

Q1 105 494 0 0 0 0

median 200 2114 0 0 0 0

Q3 278 8322 200 404 153 273

max 1710 672588 1681 96342 1506 28601

Figure 3: Number of messages sent by users in the FakeSpreadersWhatsApp.Br dataset.

an outlier is deﬁned as the value equal to Q3+1.5·IQ,

where Q3 is the third quartile and IQ is the interquar-

tile distance of the distribution, considering only the

subset of active users.

Then, we propose a deﬁnition of misinformation

spreader based on the misinformation strength fea-

ture. In this work, the misinformation spreaders con-

stitute the group that, among the active users, has an

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

224

anomalous value for the misinformation strength fea-

ture, according to the interquartile distance method.

In the FakeSpreadersWhatsApp.Br dataset, an outlier

value at the misinformation strength feature is above

the threshold of 28,601. It is important to note that

this user category could also have been deﬁned using

other feature, such as viral centrality degree, general

strength or active days. However, we chose misinfor-

mation strength because we are interested not only in

the amount of misinformation shared, but in the reach

they had, and this feature encapsulates both pieces of

information. That is, our deﬁnition of misinforma-

tion spreader encompasses users who can be charac-

terized as “spreaders” due to the scope and frequency

of their actions, which cause more damage than low-

range gullible users.

Next, we show four different messages sent by the

most active misinformation spreader, following the

deﬁnition presented previously:

1. • Original Message:

“https://youtu.be/iXi3X2XDg6A ...................

*URGENTE* !! multipliquem este v

ıdeo ao

aximo!!”

• Message Translated to English:

“https://youtu.be/iXi3X2XDg6A ...................

*URGENT* !! multiply this video to the

maximum!!”

2. • Original Message:

“https://youtu.be/WcXXsERafNA. *MAIS

UMA FAKE NEWS do HADDAD DES-

MACARADA!!!* *COMPARTILHEM com

todos os seus contatos!!!* vamos colocar este

ıdeo *EM ALTA* no YouTube!!!!”

• Message Translated to English:

“https://youtu.be/WcXXsERafNA. *ANOTHER

FAKE NEWS from HADDAD UNMASKED!!!*

*SHARE this with all your contacts!!!* we’ll

put this video *UP* on YouTube!!!!”

3. • Original Message:

“Mais uma fake News da m

ıdia........o assassi-

nato do capoeirista n

ao teve nada a ver com

pol

ıtica ou muito menos com apoiador de Bol-

sonaro....... *CANALHAS!! ..... Divulgue este

ıdeo para todos os seus contatos e grupos do

WhatsApp*”

• Message Translated to English:

“Another fake news from the media........the

murder of the capoeirista had nothing to do

with politics or much less with a Bolsonaro sup-

porter....... *SCAMPS!! ..... Share this video to

all your WhatsApp contacts and groups*”

4. • Original Message:

“*No Cear

a, o Comando Vermelho( CV )

PROIBIU propaganda de BOLSONARO nos

territ

orios que* *”administra”* *Somente

LULA E CIRO Podem. Por serem aliados

do CRIME.* Algu

em tem d

uvida agora da

quadrilha?”

• Message Translated to English:

“*In Cear

a, the Comando Vermelho ( CV )

PROHIBITED BOLSONARO’s propaganda in

the territories it* *”manages”* *Only LULA

AND CIRO can. Because they are allies of

CRIME.* Does anyone have any doubts now

about the gang?”

Table 6 presents information about misinforma-

tion spreaders. One can observe the that the misin-

formation spreader category is formed by only 2.5%

of users, but these are responsible for a large volume

of total misinformation, reaching almost 40%. This

shows that most misinformation is propagated by a

small number of users, whether acting maliciously or

not, which reinforces the need to identify these users

as a way to mitigate the spread of misinformation.

4.3 Experimental Evaluation

Based on our deﬁnition of misinformation spreader

presented in the previous section (value of the misin-

formation strength feature greater than 28,601) and

the 23 features of the FakeSpreadersWhatsApp.Br

dataset, binary classiﬁcation experiments were car-

ried out to identify whether a user is a misinformation

spreader (positive) or not (negative).

It is important to note that the misinformation la-

bel can only be assigned due to the message label-

ing process, which is used to calculate the misinfor-

mation strength metric, which deﬁnes a user’s class.

However, there are other features (such as viral cen-

trality degree, general strength and active days) that

are known a priori, without any manual labeling pro-

cess, and these can also be used to identify misinfor-

mation spreaders. Nevertheless, exploring these fea-

tures of misinformation spreaders detection is outside

the scope of this paper.

Additionally, it is worth mentioning that the sub-

set formed by active users presents a high imbal-

ance between classes (misinformation spreader and

non-spreader). The positive class (misinformation

spreader) is the minority, having 132 users, while the

negative class (non-spreader) is the majority, having

5,232 users, as illustrated in Figure 4. This imbal-

ance often increases the difﬁculty of classiﬁcation, as

classiﬁers may tend to recognize only the majority of

class patterns.

FakeSpreadersWhatsApp.BR: Misinformation Spreaders Detection in Brazilian Portuguese WhatsApp Messages

225

Table 6: Description of the misinformation spreader category in terms of number of users, percentage of these users in relation

to the total, amount of misinformation sent by users of this category and percentage of misinformation sent by users of this

category in relation to total misinformation.

Category Feature Threshold nº of users % of users nº of misinformation % of misinformation

Misinformation Spreader Misinformation Strength 28,601 132 2.5% 4,533 39.7%

Figure 4: Balancing between user classes. It is perceived

that it is a problem of extremely unbalanced classes, where

the positive class, of misinformation spreaders, is in the mi-

nority.

To evaluate the performance of the two ap-

proaches proposed in this work for classifying mis-

information spreaders, we performed a random sepa-

ration of data in the training and test sets in a stratiﬁed

way, maintaining the proportion between the classes.

Thus, the total number of users was split, with 80%

for the training set and 20% for the test set. Table 7

presents the amount of data of each class present in

each set.

Table 7: Number of negative and positive instances in the

training and test sets.

Training Test

Positive (misinformation spreaders) 106 26

Negative (non misinformation spreaders) 4,185 1,047

Total 4,291 1,073

4.3.1 Approach 1: Thresholding

This approach assumes a strong correlation between

the misinformation strength and viral strength fea-

tures since all misinformation is also a viral message

in the FakeSpreadersWhatsApp.Br dataset. In fact,

when analyzing the correlation of misinformation

strength with other features that can be obtained from

unlabeled data, the most strongly correlated variable

is viral strength, with a correlation index of 0.87.

Thus, a user committed to publicizing and spreading

viral messages has a good chance of spreading mes-

sages labeled as misinformation. We use this intuition

as an approach to detect misinformation spreaders.

The thresholding approach classiﬁes every user

with a value greater than or equal to the outlier thresh-

old in the viral strength feature as a misinformation

spreader. Thus, every user with an anomalous vi-

ral strength value, who disseminates viral messages

on a large scale, is classiﬁed as a misinformation

spreader. The threshold observed in the FakeSpread-

ersWhatsApp.Br dataset for the viral strength feature

was 5,675. Note that this approach can be used even

with unlabeled data. That is, the thresholding ap-

proach is an unsupervised method.

4.3.2 Approach 2: Logistic Regression

The second approach proposed in this work consists

of using logistic regression. The input is a subset

of the 23 user features from the FakeSpreadersWhat-

sApp.Br dataset, normalized by the z-score method,

using the mean and variance of the training set. This

subset is selected using a Decision Tree. To do this,

we train the model with the training set and obtain

the Gini Importance, which counts the times a fea-

ture is used to split a node, weighted by the number

of samples it splits. The result of the features’ im-

portance is illustrated in Figure 5. Note that the most

important feature is, in fact, viral strength. But other

features such as the proportion of repeated messages,

amount of media, daily messages, and the number of

viral messages also add information to the classiﬁer.

Thus, we chose the ten most important features: viral

strength, amount of media, the proportion of repeated

messages, general centrality degree, general strength,

average daily messages, amount of viral messages, vi-

ral centrality degree, active days, and 95th percentile

in the number of daily messages. In addition to the

feature selection, another important step is optimiz-

ing the decision threshold. Due to the imbalance, the

model may tend to estimate lower probabilities for the

positive class, so it is necessary to choose an appro-

priate decision threshold, which in this case, may be

less than 0.5. We used the optimal accuracy value in

a validation subset, separate from the training set, to

optimize the decision threshold choice. In order to

avoid inserting noise into the training data, oversam-

pling techniques were not used so that imbalance is

dealt with by choosing an appropriate decision thresh-

old.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

226

Note that this approach only can be used if labeled

data is available. That is, the logistic regression ap-

proach is a supervised method.

Figure 5: Features Importance.

5 RESULTS

The results obtained in the evaluation are presented in

Table 8. Although accuracy is not a suitable metric for

this problem, since the classes are very unbalanced, it

was also presented. It is observed that the threshold-

ing approach obtained a reasonable result in terms of

precision and recall, with a F1 Score of 0.840. This

approach identiﬁed approximately 84% of all misin-

formation spreaders in the test set.

The results obtained by logistic regression were

achieved with a decision threshold of 0.24. That is,

the models classiﬁes as misinformation spreader (pos-

itive) the user with a probability estimate greater than

24% . The performance obtained by logistic regres-

sion was superior in all metrics, in particular in re-

call, where approximately 92.3% of the misinforma-

tion spreaders were identiﬁed, and with a high preci-

sion, which means a low rate of false positives. Ad-

ditionally, logistic regression allows interpretability

techniques to be applied, which make possible, for

example, to understand the contributions of each fea-

tures to individual predictions. However, logistic re-

gression requires the data to be labeled.

Since viral strength and misinformation strength

are strongly correlated, we retrained the logistic re-

gression model without it. In this case, the decision

threshold was 0.19. We obtained a F1 Score of 0.807,

recall of 0.807, and AUC of 0.994. The performance

dropped compared to the other methods that rely on

the viral strength feature. Even so, the logistic regres-

sion approach without the misinformation strength

feature identiﬁed approximately 87% of all misinfor-

mation spreaders in the test set.

Table 8 shows the results of the misinformation

spreaders classiﬁcation using the thresholding and lo-

gistic regression approaches. Already Figure 6 illus-

trates the logistic regression confusion matrix, and

Figure 6: Logistic regression confusion matrix.

Figure 7: Logistic regression without viral strength confu-

sion matrix.

Figure 7 shows the confusion matrix for the logistic

regression trained without the viral strength feature.

Table 8: Results of the misinformation spreaders classi-

ﬁcation using the thresholding and logistic regression ap-

proaches.

Method ACC PRE REC F1 Score AUC

Thresholding 0.992 0.875 0.807 0.840 -

Logistic Regression 0.996 0.923 0.923 0.923 0.998

Logistic Regression w/o Viral Strength 0.990 0.807 0.807 0.807 0.994

6 CONCLUSION

In this paper, we propose a set of 23 features or-

ganized into two groups (activity and network at-

tributes) and two distinct approaches to identify pos-

sible misinformation spreaders on WhatsApp: thresh-

olding (an unsupervised method) and logistic regres-

sion (a supervised method). Our results indicate that

the proposed approaches and features can be effec-

tively used to distinguish between potential misinfor-

mation spreaders and users who share reliable infor-

mation on WhatsApp. The thresholding approach ob-

tained a F1 Score of 0.840. The approach based on lo-

gistic regression presented a F1 Score of 0.923. When

FakeSpreadersWhatsApp.BR: Misinformation Spreaders Detection in Brazilian Portuguese WhatsApp Messages

227

removing viral strength from the features set, the lo-

gistic regression model presented a F1 Score of 0.807.

REFERENCES

Benevenuto, F., Rodrigues, T., Almeida, V., Almeida, J.,

and Gonc¸alves, M. (2008). Detectando usu

arios ma-

liciosos em interac¸

oes via v

ıdeos no youtube. In Pro-

ceedings of the 14th Brazilian Symposium on Multi-

media and the Web, pages 138–145.

Cabral, L., Monteiro, J. M., da Silva, J. W. F., Mattos, C. L.,

and Mourao, P. J. C. (2021). Fakewhastapp. br: Nlp

and machine learning techniques for misinformation

detection in brazilian portuguese whatsapp messages.

Cresci, S., Di Pietro, R., Petrocchi, M., Spognardi, A.,

and Tesconi, M. (2017). The paradigm-shift of social

spambots: Evidence, theories, and tools for the arms

race. In Proceedings of the 26th international confer-

ence on world wide web companion, pages 963–972.

de S

a, I. C., Monteiro, J. M., da Silva, J. W. F., Medeiros,

L. M., Mourao, P. J. C., and da Cunha, L. C. C. (2021).

Digital lighthouse: A platform for monitoring public

groups in whatsapp.

Heidari, M., Jones, J. H. J., and Uzuner, O. (2021). An

empirical study of machine learning algorithms for

social media bot detection. In 2021 IEEE Interna-

tional IOT, Electronics and Mechatronics Conference

(IEMTRONICS), pages 1–5.

Leite, M. A. G. L., Guelpeli, M. V. C., and Santos,

C. Q. (2020). Um modelo baseado em regras para

a detecc¸

ao de bots no twitter. In Anais do IX Brazil-

ian Workshop on Social Network Analysis and Mining,

pages 37–48. SBC.

Martins, A. D. F., Cabral, L., Mour

ao, P. J. C., Monteiro,

J. M., and Machado, J. (2021). Detection of misinfor-

mation about covid-19 in brazilian portuguese what-

sapp messages. In International Conference on Appli-

cations of Natural Language to Information Systems,

pages 199–206. Springer.

Martins, A. D. F., Monteiro, J. M., and Machado, J. C.

(2022). Understanding misinformation about COVID-

19 in whatsapp messages. In Chiusano, S., Cerquitelli,

T., Wrembel, R., Nørv

ag, K., Catania, B., Vargas-

Solar, G., and Zumpano, E., editors, New Trends

in Database and Information Systems - ADBIS 2022

Short Papers, Doctoral Consortium and Workshops:

DOING, K-GALS, MADEISD, MegaData, SWODCH,

Turin, Italy, September 5-8, 2022, Proceedings, vol-

ume 1652 of Communications in Computer and Infor-

mation Science, pages 14–23. Springer.

Morais, D. and Digiampietri, L. A. (2022). Evaluating so-

cial bots detection approaches in different domains. In

XVIII Brazilian Symposium on Information Systems,

SBSI, New York, NY, USA. Association for Comput-

ing Machinery.

Rath, B., Morales, X., and Srivastava, J. (2021). Scar-

let: Explainable attention based graph neural network

for fake news spreader prediction. In Karlapalem,

K., Cheng, H., Ramakrishnan, N., Agrawal, R. K.,

Reddy, P. K., Srivastava, J., and Chakraborty, T., ed-

itors, Advances in Knowledge Discovery and Data

Mining, pages 714–727, Cham. Springer International

Publishing.

Rath, B. and Srivastava, J. (2022). Spreader-Centric Fake

News Mitigation Framework Based on Epidemiol-

ogy, pages 31–54. Springer International Publishing,

Cham.

Shahid, W., Li, Y., Staples, D., Amin, G., Hakak, S., and

Ghorbani, A. (2022). Are you a cyborg, bot or hu-

man?—a survey on detecting fake news spreaders.

IEEE Access, 10:27069–27083.

ICEIS 2023 - 25th International Conference on Enterprise Information Systems

228