Towards Predicting Mentions to Veriﬁed Twitter Accounts:

Building Prediction Models over MongoDB with Keras

Ioanna Kyriazidou

, Georgios Drakopoulos

, Andreas Kanavos

, Christos Makris

and Phivos Mylonas

CEID, University of Patras, Patras, Hellas, Greece

Department of Informatics, Ionian University, Kerkyra, Hellas, Greece

Keywords:

Social Analytics, Digital Inﬂuence, Digital Trust, Veriﬁed Account, Document Databases, MongoDB,

Convolutional Network, Logistic Regression, Keras, Pandas, TensorFlow.

Abstract:

Digital inﬂuence and trust are central research topics in social media analysis with a plethora of applications

ranging from social login to geolocation services and community structure discovery. In the evolving and

diverse microblogging sphere of Twitter veriﬁed accounts reinforce digital inﬂuence through trust. These

typically correspond either to an organization or to a person of high social status or to netizens who have been

consistently proven to be highly inﬂuential. This conference paper presents a framework for estimating the

probability that the next mention of any account will be to a veriﬁed account, an important metric of digital

inﬂuence. At the heart of this framework lies a convolutional neural network (CNN) implemented in keras

over TensorFlow. The training features are extracted from a dataset of tweets regarding the presentation of

the Literature Nobel prize to Bob Dylan collected with the Twitter Streaming API and stored in MongoDB.

In order to demonstrate the performance of the CNN, the results obtained by applying logistic regression

to the same training features are shown in the form of statistical metrics computed from the corresponding

contingency matrices, which are obtained using the pandas Python library.

1 INTRODUCTION

At the dawn of the big data or 5V era online so-

cial media constitute a major driver of research and

marketing alike. The main reason it that social me-

dia abound with interactions between accounts from a

wide array of multimodal options ranging from hash-

tags to affective states, geolocation information, and

live video streams. All these alternatives allow the

real time mining of invaluable knowledge regarding

numerous topics of social interest. Most importantly,

regardless of the level of technological sophistication

or the type of data involved, these interaction types

ultimately serve as vehicles for communication. And

along with communication come trust and digital in-

ﬂuence, either explicit or latent.

A major factor towards establishing trust in the

digital sphere is determining whether behind a given

social network account lies an actual real world entity,

whether an individual, a hacker group, a multinational

conglomerate, or any other type of formal or infor-

mal organization. Among the most recent cases where

this factor was questioned was the hunt undertaken in

2017 by the journalists of Counterpunch to ap-

proach the alleged freelance reporter Alice Dono-

van(www.counterpunch.org, 2017). Moreover, ar-

guments for establishing the identity of an ac-

count owner were put forward during the 2013 pub-

lic discussions regarding the alleged activities of

Twitter eggs, which eventually led to the drop-

ping of the famous egg avatar, a joking refer-

ence to the bird logo, in favor of a gender neutral

avatar(www.theguadian.com, 2017).

In the aftermath of these discussions Twitter un-

dertook the initiative of introducing a distinction be-

tween its own accounts in order to reinforce the cred-

ibility of a subset of accounts as well as of the tweets

from them. Speciﬁcally, Twitter accounts are divided

into the following categories:

• Veriﬁed accounts are selected by Twitter and cor-

respond to signiﬁcant real world entities such as

governments, sports groups, and academic insti-

tutions.

• Unveriﬁed accounts are ordinary accounts and

constitute the overwhelming majority of the Twit-

ter ecosystem.

Kyriazidou, I., Drakopoulos, G., Kanavos, A., Makris, C. and Mylonas, P.

Towards Predicting Mentions to Veriﬁed Twitter Accounts: Building Prediction Models over MongoDB with Keras.

DOI: 10.5220/0007810200250033

In Proceedings of the 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pages 25-33

ISBN: 978-989-758-386-5

The primary contribution of this conference paper

is a framework for predicting whether the next men-

tion of a Twitter account, whether veriﬁed or not, will

be to a veriﬁed account. This framework is based on a

convolutional neural network (CNN) implemented on

TensorFlow using keras as a front end in Python. The

mix of structural and functional training features was

exctracted from Twitter in JSON format with Twit-

ter4j over Java and stored in MongoDB, which is ideal

for natively storing and processing documents follow-

ing this format. For comparison purposes a logistic

classiﬁer was also applied on the same dataset using

Python and scikit-learn. The results indicate that the

CNN, trained under three different scenaria, outper-

forms this classiﬁer in terms of a number of statisti-

cal metrics derived from the contingency matrix, most

prominently accuracy, precision, and type I and II er-

ror rates.

The remaining of this work is structured as fol-

lows. Section 2 brieﬂy summarizes recent scientiﬁc

literature regarding social network analysis. The ar-

chitecture of the Twitter topic sampler is described

in section 3. Section 4 presents the fundamentals of

the two prediction models used in this work. The re-

lationship between inﬂuence, trust, and community

structure in the digital sphere as well as the intuition

behind the training features are examined in section

5, while in section 6 the results obtained from these

models are analyzed. Section 7 recapitulates the main

ﬁndings and outlines future research directions. Vec-

tors are displayed in lowercase boldface and scalars in

lowercase. Finally, the notation of this work is sum-

marized in table 1.

Table 1: Notation of this conference paper.

Symbol Meaning

= Deﬁnition or equality by deﬁnition

{

,...,s

}

Set with elements s

,...,s

Set cardinality

Φ(t

) Set of accounts following t

Ψ(t

) Set of accounts followed by t

Var [ f ] (Deterministic) variance of feature f

I[P] Indicator function for predicate P

2 PREVIOUS WORK

In the digital sphere inﬂuence revolves around the

fundamental dynamics of the relationship between

two or more accounts comprising a formal or infor-

mal group and essentially pertains to why, how, and

when the online behavior of a subset of accounts is

imitated by the remaining ones in that group (Rus-

sell, 2013)(Gilbert and Karahalios, 2009). There is

a plethora of tools for representing and assessing in

context the signiﬁcance of these imitation patterns in-

cluding set theoretic similarity metrics such as the

Tversky coefﬁcient (Tversky, 1977), which in con-

trast to the Tanimoto coefﬁcient is asymmetric in the

sense that one set is considered the template and the

other assumes the role of the variant with the two roles

not being interchangeable. Estimating the cardinality

of large sets may not be always easy, thus heuristics

based on the HyperLog method have been developed

(Drakopoulos et al., 2016b). The algorithmic corner-

stone for evaluating inﬂuence, digital or otherwise,

is the concept of meme, approximately the cultural

counterpart of a gene (Blackmore, 2000).

The associated concept of trust is more difﬁcult to

model algorithmically, as its has no discernable digi-

tal traits. Nonetheless, there have been approaches for

inferring trust through natural language processing

or matrix factorization techniques combined with the

silent but inherent trust transitivity (Jamali and Ester,

2010). The latter was identiﬁed as an important factor

in software engineering and computer security well

before the advent of online social media (Thompson,

1984). An effort to model Web trust with an applica-

tion to Semantic Web has been presented in (Mislove

et al., 2007). Other trust models have been proposed

in (Golbeck, 2005), in (Golbeck and Hendler, 2006)

which also presents a trust inferrence method, and in

(Golbeck et al., 2006) where trust is used as the basis

for movie recommendation. An alternative approach

to trust would be to rely on the fact that it is ultimately

rooted in human emotion. As such, affective comput-

ing techniques such as the ones proposed in (Muller,

2004) and in (Pang and Lee, 2008) attempt to discover

community structure in social media. For an overview

of affective computing see (Picard et al., 2001) or (Pi-

card, 2003).

Given that both inﬂuence and trust can be time

evolving, it makes sense to model account-account-

time trust triplets with third order tensors and ap-

ply higher order clustering techniques in order to

determine account groups which bond and consoli-

date or decay over time (Papalexakis and Faloutsos,

2015)(Papalexakis et al., 2014)(Drakopoulos et al.,

2019). Tensors have been already applied to network-

speciﬁc problems, for instance for aligning size with

tweet and retweet activity in Twitter (Drakopoulos

et al., 2018) and for extending in PubMed the term-

document vector query model to a term-keyword-

document one (Drakopoulos et al., 2017a).

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

JSON

JSONOAuth

MongoDB

Analytics

App

Figure 1: System architecture.

3 ARCHITECTURE

In order to implement the proposed analytics of sec-

tion 4 a new system written in Java in NetBeans 8.1

has been implemented. Its main components are the

Twitter4j library which provides a Twitter API for col-

lecting data such as account names, followers, hash-

tags, tweets, and retweets, a MongoDB database for

managing and querying this information in JSON for-

mat, and the prediction models, which have been de-

veloped in Python 3.6. Figure 1 displays the architec-

ture for collecting and analyzing Twitter data.

The primary reason for developing our own client

is the ability to extend it through more Twitter ana-

lytics, in the form of either Pythom modules or Java

jars, in order to facilitate more experiments. A funda-

mental design concept of Twitter is that it is to be in-

tegrated in other applictions. To this end, it provides a

dashboard for authenticated developers to rapidly cre-

ate and register applications.

Twitter4j

is an open source Java library for har-

nessing Twitter information. Among its other capabil-

ities it provides integration with OAauth for efﬁcient

check of the four Twitter authentication tokens, two

for the developer (consumer key, secret key) and two

(access token, secret token) for the application itself,

as well as with Maven in order to automatically sat-

isfy dependencies. Calls to the appropriate Twitter4j

methods have been placed in the source code of our

social media crawler. In general, there are two main

operational modes for such a crawler, namely account

and topic sampling. The latter was selected since we

are interested in assessing a functional Twitter feature.

www.twitter4j.org

MongoDB is a mainstay of the NoSQL move-

ment and one of the most popular document databases

designed to store JSON formatted documents in a

schemaless manner and process among others ad hoc

queries, indexing, and real time aggregation. Its ar-

chitecture is distributed, allowing horizontal and geo-

graphical scaling if needed, the former achieved thr-

ough key shrading. Currently transactions are not im-

plemented, leaving data consistency and replication

policy enforcement to the local administrator, leading

to the development of a number of third party con-

sistency control tools. MongoDB supports drivers for

most of the popular programming languages includ-

ing Python, C++, and Java.

Finally, JSON is an open Internet standard de-

scribed in RFC 8259. Originally developed for asyn-

chronous server-client communications, it was widely

adopted for formally describing in human readable

form documents. Its primary data structure is the as-

sociative array, consisting of key-value pairs where

the values may belong to different primitive types in-

cluding strings. Document databases such as Mon-

goDB natively support storing and iteratively or re-

cursively querying JSON structs.

4 PREDICTION MODELS

4.1 Logistic Regression

Logistic regression appears in a number of cases in

economics, statistics and engineering and it is a gen-

eralization of the least squares regression. Recall that

the latter estimates in the presence of noise with a

known distribution the functional parameters of an n-

th order model



... ϑ

n−1



(1)

are based on a set of m observation vectors



1 x

0,k

1,k

... x

n−1,k



(2)

where each vector contains observations of the n in-

put or independent variables X

,...,X

n−1

and the cor-

responding value of the output or dependent variable

Y . Thus, the possibly over- or underdetermined linear

system of equation (3) is formed







m−1













m−1







p + w (3)

where w is a noise vector of known distribution. Typi-

cally the noise and the model parameters are assumed

Towards Predicting Mentions to Veriﬁed Twitter Accounts: Building Prediction Models over MongoDB with Keras

Table 2: NoSQL database types.

Type Abstract type Formal description Prominent software

document formatted document JSON, BSON, XML, YAML MongoDB, CouchDB, OrientDB

key-value associative array JSON, BSON, YAML Riak, Amazon Dynamo, Redis

column family wide columns JSON, BSON, XML, YAML Apache Cassandra, KeySpace

graph property graph JSON-LD, RDF, ontologies Neo4j, TitanDB, Sparksee

to be uncorrelated. A more detailed formulation of

the same system is that of equation (4).







m−1













1 x

0,0

... x

n−1,0

1 x

0,1

... x

n−1,1

1 x

0,m−1

... x

n−1,m−1













n−1







(4)

Binomial logistic regression imposes a logistic

transformation to the observations using a single out-

put variable. Moreover, the latter is assumed to be

binary, essentially a Bernoulli trial, and logistic re-

gression can compute the probability that the output

variable can take either value or, alternatively, it esti-

mates the model parameters such that

(

1, p

+ η

≥ 0

0, p

+ η

< 0

(5)

The logistic probability density function is deﬁned

as:

(x;µ

,σ

)

−

x−µ



1 + e

−

x−µ





x−µ

+ e

−

x−µ



4σ

sech



x − µ

2σ



(6)

where the hyperbolic secant is in turn deﬁned as:

sech(β

cosh(β

+ e

−β

(7)

Alternatively, the hyperbolic secant function can

be deﬁned in terms of the power series:

sech(β

= 1 +

+∞

∑

k≡0 mod 2

k/2

(β

(8)

where E

is the k-th secant Euler number

OEIS integer sequences A046976 and A046977.

4.2 Convolutional Neural Network

In addition to the logistic classiﬁer, three convolu-

tional neural networks (CNNs) were created using

keras acting as the front end of TensorFlow. The four

stages of any deep learning model in TensorFlow are:

• Deﬁnition. At this stage the architecture of the

network including the number of neurons at each

layer, the connectivity between each layer, and the

number and location of the feedback loops. The

latter are important in determining the memory of

the classiﬁer.

• Translation. This is achieved with the selection

of the loss function and the call of the compile

method which handles model setup.

• Training. Fitting is done with the ﬁt and

evaluate methods.

• Prediction. The model generates actual predic-

tions based on the predict method.

The architecture of the network is deﬁned by the

following template, which is used for each new layer

of processing neurons:

l a y e r s = [ Dense ( 3 ) ]

model = S e q u e n t i a l ( l a y e r s )

model . add ( Dense ( n , i n p u t d i m =m,

i n i t = ’ uniform ’ , a c t i v a t i o n = ’ s o f t p l u s ’ ) )

The model.add() method adds an additional hid-

den or output layer which can be fully, densely, or

sparsely connected with the previous one. The synap-

tic weights were initialized with a uniform distribu-

tion in [0,1]. Four activation functions were consid-

ered, the sigmoid, the softmax, the rectiﬁer, and the

softplus. The sigmoid is deﬁned as:

(s;β

)

1 + e

−β

= 2tanh





− 1 (9)

The softmax function provides a ranking for the ele-

ments of any real valued data vector



... x

n−1



(10)

by assigning to each individual x the score:

;β

)

−β

∑

j=1

−β

(11)

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

The rectiﬁer or ramp activation function is deﬁned as:

(s;β

)

= max

{

}

(

s, s ≥ 0

0, s < 0

(12)

In many scenarios is used a smoother version of g

termed the softplus function deﬁned as:

(s;β

)

= ln



1 + e



(13)

The softplus function has two interesting properties

with respect to the network training process. First, it

avoids the vanishing gradient problem, which causes

the saturation of a number of synaptic weights in large

neural networks. Moreover, it is the antiderivative of

the sigmoid function:

(s;β

) =

−∞

(τ)dτ, β

6= 0 (14)

This implies that a softplus value essentially com-

prised of a broad spectrum of sigmoid values, which

is less prone to instantaneous spikes caused by an iso-

lated training sample. This results to a smoothed tra-

jectory of the weights in the parameter space, render-

ing the training process more robust.

Eventually, g

was used in all hidden layers,

whereas the logistic activation function was placed

in the output layer. The number of layers, includ-

ing the input and the output layers, was ﬁxed to

[11 : 17 : 11 : 1]. Initially, the ﬁrst hidden layer maps

the input features in a 50% larger space, which is

big enough to offer sufﬁcient discrimination between

features which are very close in the original space,

but also small enough to avoid overﬁtting. Then, the

transformed features are mapped back to the original

space. Finally, a logistic regression is applied to these

new features.

The cost function to be minimized was selected to

be the binary cross-entropy, which is suitable as its

name suggests for binary classiﬁcation problems, us-

ing for training the gradient descent Adam algorithm,

which is an extension of the RMSprop and the Ada-

Grad algorithms by maintaining per-parameter learn-

ing rates and using an averaged second moment to

update the learning rates. Adam is well deﬁned for

large and sparse networks and has low memory re-

quirements. In the source code this choice was de-

clared as:

model . c o m p i l e ( o p t i m i z e r = ’ adam ’ ,

l o s s = ’ b i n a r y c r o s s e n t r o p y ’ ,

m e t r i c s =[ ’ a c c u r a c y ’ ] )

Following that, the dataset is split into two parts.

The training part consists of the variables train X and

train y , which contain the features and the output

variable respectively. Along the same line of reason-

ing, the test part comprises of the variables test X and

test y . Once the CNN is trained, its performance is

evaluated, and predictions are generated. The training

phase consists of J epochs, where in each epoch Q

training vectors are driven to the CNN. In the source

code:

model . f i t ( t r a i n X , t r a i n y

n b ep o c h =J , b a t c h s i z e =Q)

model . e v a l u a t e ( t e s t X , t e s t y )

p = model . p r e d i c t ( t e s t X )

r = [ round ( s ) f o r s i n p ]

The training of the CNN plays a crucial role in

shaping its generalization capability. In general, a

golden spot between presenting too few input vec-

tors, in terms of feature variability, or too many dis-

tinguishes a properly trained CNN from either a CNN

which cannot capture all the dimensions of the feature

space or a CNN operating effectively like a dictio-

nary. To this end, the following three scenarios were

selected with respect to the parameters J and Q.

• J = 12500 and Q = 1000 (CNN1 in table 5)

• J = 1250 and Q = 100 (CNN2 in table 5)

• J = 150 and Q = 100 (CNN3 in table 5)

Finally, in order to obtain a better understanding

of the predictor performance, the contingency matrix

is generated using the pandas ml library.

from panda s m l i mp or t c o n t i n g e n c y M a t r i x

cm = c o n t i n g e n c y M a t r i x ( t e s t y , r )

cm . p r i n t s t a t s ( )

5 TRAINING FEATURES

In this section the intuition behind selecting the train-

ing features of the prediction model is given. The lat-

ter is based on the properties of trust, inﬂuence, and

community structure. Additionally, the importance of

introducing veriﬁed accounts is analyzed. The latter

is rooted in an axiom common in Twitter, and in any

social medium for that matter, stating that:

Axiom 1. Accounts are Created Equal but Hardly Re-

main So.

This can be attributed to a number of factors including

topical variety in the form of hashtags, tweeting fre-

quency, response ratio, or elevated status deriving di-

rectly from the real world entity or person behind the

account. For instance, a Twitter account of an estab-

lished newspaper is typically considered more reliable

and far less prone to fake news posting than that of

Towards Predicting Mentions to Veriﬁed Twitter Accounts: Building Prediction Models over MongoDB with Keras

an individual netizen. Qualitative metrics for assess-

ing digital inﬂuence in Twitter have been proposed in

(Drakopoulos et al., 2016a).

Establishing digital trust is quite complex as it in-

volves a number of psychological variables besides

multimodal interaction. Additionally, it may need to

be re-veriﬁed over time, as for instance when a two

factor authentication is required when a trusted ac-

count appears to be active from an unknown location.

At any rate:

Axiom 2. Trust Implies Communication.

Once trust is established between a group of two or

more accounts, it usually takes one out a small num-

ber of behavioral forms sharing imitation as a key

concept. One option is that an account takes a lead-

ing role and the others tend to imitate its actions, per-

haps with a varying time lag and some minor alter-

ations. Clearly in this case the leader enjoys more

trust than the remaining accounts and the trust rela-

tionships form a star with some possible edges con-

necting non-leading accounts.

Another alternative is the peer group where each

account enjoys a comparable amount of trust from the

other peer accounts. Again, in this case behavioral

correlations tend to appear, although the variance in

time lag may be initially larger since each peer is pos-

sible to do some fact checking by comparing the be-

havior of its peers. However, once a critical size of

peers has performed the same action, the others will

imitate these pioneering peers with a very short time

lag and with much smaller variance. Thus, depending

on the strength of trust connections, in peer groups

imitation might be a two-phase phenomenon. The

trust relationships in this scenario tend to form a cycle

with many cross-connecting edges, or at the extreme

case a clique.

An intermediate case emerges when there is a

leading subgroup within the trust group. In that case

time lags tend to be somewhat smaller, since imitation

is quick both inside the leading subgroup and between

the latter and the remaining group. Functionally, the

leading subgroup plays approximately the role of the

vertex cover as memes tend to be copied quickly from

the center to the periphery.

As mentioned earlier, the key in all three cases is

imitation with a varying time lag. Thus, delayed cor-

relations are bound to appear in the features of the

training set. Therefore, a good classiﬁer is expected

to exploit this critical property by having even a mem-

ory of some kind. This is a signiﬁcant difference be-

tween the logistic regressor, whose coefﬁcients might

contain a limited reﬂection of these correlations, and

the convolutional neural network, whose structure and

training process facilitates memory at the expense of

course of a more complicated and slow training pro-

cedure.

As a sidenote, convolutional networks is not the

only class of classiﬁers with memory. The ordinary

feedforward neural networks assimilate features of

the training set in their synaptic weights, Kohonen

networks or self organizing maps rely on the spatial

clustering of their processing units as a form of mem-

ory as a result of a Hebbian training process, tensor

stack networks depend on the cross training of a for-

est of neural networks, whereas the models relying on

Volterra kernels incorporate memory in the number

and lags of the kernel indices as follows:

g(x) ≈

∑

k=1

∑

j=1



j,1

,...,i

j,k



∏

s=1

g[n − i

j,s

] (15)

One more fundamental axiom is:

Axiom 3. Inﬂuence Implies Trust.

This element is important in its own right, but also

allows the indirect determination of trust through in-

ﬂuence. This methodology has been employed in so-

cial network analysis in order to inferr the select the

proper cluster size distribution out of a number of pos-

sible ones generated by various community discovery

algorithms (Drakopoulos et al., 2017c)(Drakopoulos

et al., 2017b).

The data for the prediction models consist of 5000

rows in total, where 1250 of these comprise the train-

ing set and the remaining 3750. Each row contains

features extracted from a single tweet, where every

tweet pertains to presenting the Literature Nobel to

Bob Dylan, a very popular topic at the time the dataset

was collected.

In order to construct the prediction model, a mix

of structural and functional features will be used.

Moreover, the feature set has been augmented with

semantic information: The eleven top trending hash-

tags, shown in table 3, for the day the dataset was

being collected have been retrieved. This was done

on the grouds that they typically carry more seman-

tic information in comparison to an ordinary word

in a tweet, in a relationship similar to the one be-

tween the index and the ordinary terms in a document

(Drakopoulos et al., 2017a). Out of them only #No-

belPrize, #BobDylan, #literature, and #Nobel were

kept as the frequency of the remaining hashtags was

very low.

Thus each row consists of eleven features ( f

) plus the value of the output indicator (y):

• f

- f

: Four individual binary indicators, each cor-

responding to whether one of the selected hash-

tags is present in the tweet.

• f

: The number of followers Φ(k) of the account.

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

Table 3: Percentage of appearance of the top trending hashtags.

NobelPrize medicine PremioNobel BobDylan AliceMunro

78.86% 0% 14.53% 89.35% 0%

literature malala Nobel peace physics

56.53% 0% 66.12% 0.04% 0.02%

• f

: The number of followees Ψ (k) of the account.

• f

: The number of tweets of the account.

• f

: The tweet id.

• f

: The number of mentions in the tweet.

• f

: Is the poster veriﬁed?

• f

: Is a veriﬁed account mentioned in the tweet?

• y: Is the mention towards a veriﬁed account?

Since the tweet id takes large values compared

to the boolean indicators, the latter were rescaled by

multiplying them by 1000 in order to keep the fea-

tures at the same range. This contributes to the accel-

eration and the stability of the training process. Ad-

ditionally, the original dataset is balanced in terms of

value variability, facilitating the easy partitioning to

training and test sets as shown in table 4. If a fea-

ture f

is an indicator, then the number of rows which

is non-zero

I[ f

] = 1

is shown, otherwise its deter-

ministic variance Var [ f

] is shown. Also the output

variable y is treated as an indicator feature. From the

values of table 4 it follows that the partitioning of the

original dataset is correct in the sense of having com-

parable feature variability in both the training and the

test sets.

6 RESULTS

The contingency matrices for the four predictors are

shown in table 5. Additionally, in the same table the

values of a number of statistical metrics based on the

contingency matrix are shown as well as the training

time for each predictor. In the case of the logistic re-

gressor, the time for computing the regression coefﬁ-

cients is used. Recall that, tp, tn, fp, and fn stand for

true positives, true negatives, false positives, and false

negatives respectively as is customary in data mining

literature.

The most signiﬁcant statistical metrics which can

be directly computed from any contingency matrix in-

clude:

• Accuracy (ac): Measures how many mentions are

properly classiﬁed, regardless of whether an ac-

count is veriﬁed or not.

• Speciﬁcity (sp): Measures how many of the men-

tions to unveriﬁed accounts are actually discov-

ered.

• Precision (pr): Measures how many of the men-

tions classiﬁed as relevant to veriﬁed accounts are

actually such.

• Sensitivity or recall (rc): Measures how many of

the metions to veriﬁed accounts are actually dis-

covered.

• False positive rate or type I error rate (t1): Mea-

sures how many of the mentions to unveriﬁed ac-

counts are misclassiﬁed.

• False negative rate or type II error rate (t2): Mea-

sures how many of the mentions to veriﬁed ac-

counts are misclassiﬁed.

• Negative predictive value (npv): Measures how

many of the mentions classiﬁed as relevant to un-

veriﬁed are actually such.

• The Mattthews correlation coefﬁcient or phi co-

efﬁcient (mcc): Measures the agreement between

the predicted values and the actual ones. It works

even when the number of samples from each cat-

egory are unevely represented in the dataset.

Also, the latter is deﬁned as:

mcc

tptn − fp fn

(tp + fn)(tp + fn)(tn + fp)(tn + fn)

(16)

Additional metrics which can be deﬁned in terms

of the ones presented here are the F1 (f1), the in-

formedness (if), and the markedness (mk) which are

respectively deﬁned as:

if = sn +sp − 1

mk = pr +npv − 1 (17)

The values of the contingency tables summarized

in table 5 can be interpreted as follows. Regarding

the CNN training process, the less training rows the

network is presented, the more accurate and precise it

becomes at the expense of recall. This might imply

that CNN can develop a limited generalization capa-

bility for a low number of training rows. Moreover,

the type I and II error rates increase with the number

Towards Predicting Mentions to Veriﬁed Twitter Accounts: Building Prediction Models over MongoDB with Keras

Table 4: Dataset partition.

set

I[ f

= 1]

| |

I[ f

= 1]

| |

I[ f

= 1]

| |

I[ f

= 1]

Var [ f

] Var [ f

]

train 591 (47.28%) 630 (50.41%) 636 (50.88%) 621 (49.68%) 1216.24 2264.11

test 1904 (50.77%) 2003 (53.41%) 1822 (48.58%) 1852 (49.38%) 1693.44 2253.43

Var [ f

] Var [ f

]

I[ f

= 1]

| |

I[ f

= 1]

| |

I[y = 1]

train 64.11 399812 37.21 367 (29.36%) 663 (53.04%) 628 (50.24%)

test 53.43 482703 33.32 1102 (29.38%) 1897 (50.58%) 1925 (51.33%)

Table 5: Contingency matrices, metrics values, and predictor training time (sec).

model tp tn fp fn ac sp pr rc (sn)

logistic 1285 1236 742 487 0.6727 0.6248 0.6639 0.7251

CNN1 1438 1316 535 461 0.7344 0.7109 0.7288 0.7572

CNN2 1400 1361 490 499 0.7363 0.7352 0.7407 0.7372

CNN3 1365 1472 379 534 0.7565 0.7952 0.7826 0.7187

model t1 t2 npv mcc f1 if mk time

logistic 0.3361 0.3752 0.7173 0.3506 0.6932 0.3499 0.3812 17

CNN1 0.2890 0.2427 0.7405 0.4688 0.7427 0.4861 0.4693 731

CNN2 0.2592 0.2627 0.7317 0.4724 0.7389 0.4724 0.4724 362

CNN3 0.2173 0.2812 0.7337 0.5152 0.7493 0.5119 0.5163 242

of training rows. However, even their lowest recorded

values might not be considered acceptable for a real

world application. We believe that this can be reme-

died by ﬁne tuning the architecture and the training of

the CNN.

In terms of accuracy, precision, type I and II er-

ror rates, and of the Matthews correlation every CNN

outperforms the standard logistic classﬁer. Moreover,

in two training cases the CNN achieves better recall.

This can be attributed to the extended training process

of the CNN as well as to the fact that its considerably

more synaptic weights have more memory compared

to the limited one present in the logistic regressors.

7 CONCLUSIONS AND FUTURE

WORK

This conference paper presents the implementation

in keras using TensorFlow as a backend for a con-

volutional neural network (CNN) under three differ-

ent training scenaria in order to predict whether the

next tweet will contain a mention to a veriﬁed Twit-

ter account. The dataset is stored as a JSON collec-

tion in a MongoDB instance and comprises of 5000

tweets pertaining to the award of the Literature No-

bel prize to Bob Dylan. In order to demonstrate the

inherent power of CNN, a logistic regression was per-

formed on the same dataset and nine statistical met-

rics were computed from the four contingency matri-

ces. In terms of accuracy, precision, type I and type II

error rates the CNN is superior.

The methodological framework of this work can

be extended in a number of ways. Concerning im-

mediate research, a number of combinations of hid-

den layers and activation functions can be constructed

in TensorFlow in order to ﬁnd the architecture which

yields the optimum value for accuracy or for other

metrics. Additionally, the deployment of TensorFlow

or a similar tool such as theano or torch7 over a GPU

or an array of GPUs would accelerate the training pro-

cess.

Longer term research objectives include tech-

niques for accelerating the training process by using

additional constraints for pruning highly correlated

synaptic weights, selecting a different objective func-

tion which offers increased interpretability within the

given Twitter context, and augmenting the training

dataset with affective information. The latter is the

primary driving force behind the digital activity of ne-

tizens and, thus, has a plethora of applications to so-

cial media such as evaluating brand loyalty, predict-

ing the outcome of political campaigns, and assess-

ing the digital inﬂuence of accounts (Drakopoulos,

2016). As for the synaptic weight constraints, seman-

tic metrics such as Wu-Palmer or Leacock-Chodorow

in conjunction with information extracted from the

features can be used to control the variance of synap-

tic weights in neighboring layers. Finally, when the

mentions to veriﬁed accounts are rare, results from

the extreme value theory such as the Fisher-Tippett-

Gnedenko theorem can be used to model the proba-

bilistic behavior of mentions.

WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies

ACKNOWLEDGEMENTS

The ﬁnancial support by the European Union and the

Greece (Partnership Agreement for the Development

Framework 2014-2020) under the Regional Opera-

tional Programme Ionian Islands 2014-2020 for the

project “TRaditional corfU Music PresErvation thr-

ough digiTal innovation - TRUMPET” is gratefully

acknowledged.

REFERENCES

Blackmore, S. (2000). The meme machine. Oxford Univer-

stiy Press.

Drakopoulos, G. (2016). Tensor fusion of social structural

and functional analytics over Neo4j. In IISA. IEEE.

Drakopoulos, G., Gourgaris, P., and Kanavos, A. (2018).

Graph communities in Neo4j: Four algorithms at

work. Evolving Systems.

Drakopoulos, G., Kanavos, A., Karydis, I., Sioutas, S., and

Vrahatis, A. G. (2017a). Tensor-based semantically-

aware topic clustering of biomedical documents.

Computation, 5(3).

Drakopoulos, G., Kanavos, A., Mylonas, P., and Sioutas,

S. (2017b). Deﬁning and evaluating Twitter inﬂuence

metrics: A higher order approach in Neo4j. SNAM,

71(1).

Drakopoulos, G., Kanavos, A., and Tsakalidis, A. (2016a).

Evaluating Twitter inﬂuence ranking with system the-

ory. In WEBIST.

Drakopoulos, G., Kanavos, A., and Tsakalidis, K. (2017c).

Fuzzy random walkers with second order bounds: An

asymmetric analysis. Algorithms, 10(2).

Drakopoulos, G., Kontopoulos, S., and Makris, C. (2016b).

Eventually consistent cardinality estimation with ap-

plications in biodata mining. In SAC. ACM.

Drakopoulos, G., Stathopoulou, F., Kanavos, A.,

Paraskevas, M., Tzimas, G., Mylonas, P., and Il-

iadis, L. (2019). A genetic algorithm for spatiosocial

tensor clustering: Exploiting TensorFlow potential.

Evolving Systems.

Gilbert, E. and Karahalios, K. (2009). Predicting tie

strength with social media. In SIGCHI conference on

human factors in computing systems, pages 211–220.

ACM.

Golbeck, J. and Hendler, J. (2006). Inferring binary trust

relationships in Web-based social networks. TOIT,

6(4):497–529.

Golbeck, J., Hendler, J., et al. (2006). Filmtrust: Movie

recommendations using trust in Web-based social net-

works. In Proceedings of the IEEE Consumer commu-

nications and networking conference, pages 282–286.

Golbeck, J. A. (2005). Computing and applying trust in

web-based social networks. PhD thesis, University of

Maryland, College Park.

Jamali, M. and Ester, M. (2010). A matrix factorization

technique with trust propagation for recommendation

in social networks. In Proceedings of the fourth ACM

conference on Recommender systems, pages 135–142.

ACM.

Mislove, A., Marcon, M., Gummadi, K. P., Druschel, P.,

and Bhattacharjee, B. (2007). Measurement and anal-

ysis of online social networks. In Proceedings of the

7th ACM SIGCOMM conference on Internet measure-

ment, pages 29–42. ACM.

Muller, M. (2004). Multiple paradigms in affective comput-

ing. Interacting with Computers, 16(4):759–768.

Pang, B. and Lee, L. (2008). Opinion mining and senti-

ment analysis. Foundations and trends in information

retrieval, 2(1-2):1–135.

Papalexakis, E. E. and Faloutsos, C. (2015). Fast efﬁ-

cient and scalable core consistency diagnostic for the

PARAFAC decomposition for big sparse tensors. In

ICASSP, pages 5441–5445.

Papalexakis, E. E., Pelechrinis, K., and Faloutsos, C.

(2014). Spotting misbehaviors in location-based so-

cial networks using tensors. In WWW, pages 551–552.

Picard, R. W. (2003). Affective computing: Challenges.

International Journal of Human-Computer Studies,

59(1):55–64.

Picard, R. W., Vyzas, E., and Healey, J. (2001). Toward

machine emotional intelligence: Analysis of affective

physiological state. TPAMI, 23(10):1175–1191.

Russell, M. A. (2013). Mining the social Web: Analyzing

data from Facebook, Twitter, LinkedIn, and other so-

cial media sites. O’Reilly, 2nd edition.

Thompson, K. (1984). Reﬂections on trusting trust. Com-

munications of the ACM, 27(8):761–763.

Tversky, A. (1977). Features of similarity. Psychological

review, 84(4):327.

www.counterpunch.org (2017). Go ask Alice: The curious

case of Alice Donovan.

www.theguadian.com (2017). Twitter drops egg avatar in

attempt to break association with Internet trolls.

Towards Predicting Mentions to Veriﬁed Twitter Accounts: Building Prediction Models over MongoDB with Keras