Trafﬁc Data Analysis from Social Media

Aiden Bezzina and Luana Chetcuti Zammit

Department of Systems and Control Engineering, University of Malta, Malta

Keywords:

Big Data Analytics, Trafﬁc Management Reporting System, Trafﬁc-Based Information System, Social Media

Analysis.

Abstract:

Social networking sites serve a very important role in our daily lives, providing us with a platform where

thoughts can be easily shared and expressed. As a result, these networking sites generate endless amount of

information about extensive range of topics. Nowadays, through software development, analysing the content

of social media is made possible through Application Program Interfaces (APIs). One particular application of

content analysis of social networking sites is trafﬁc. Trafﬁc events can be determined from these sites. Thus,

social networking sites have the potential to be utilised as a very cost-effective social sensor, whereby social

media posts serve as the sensor information. Advancements in the ﬁeld of machine learning have provided

ways and techniques in which social media posts can be exploited/harvested to detect small-scale events,

particularly trafﬁc events in a timely manner. This work aims to develop a trafﬁc-based information system

that relies on analysing the content of social media data. Social media content is classiﬁed as either ‘trafﬁc-

related’ or ‘non-trafﬁc-related’. ‘Trafﬁc-related’ events are further classiﬁed into various ‘trafﬁc-related’ sub-

categories, such as: ‘accidents’, ‘incidents’, ‘trafﬁc jams’, and ‘construction/road works’. The date, time, and

the geographical information of each associated trafﬁc event are also determined. To reach these aims, several

algorithms are developed: i) An adaptive data acquisition algorithm is developed to make it possible to gather

events from social media; ii) Several supervised binary classiﬁcation algorithms are developed to analyse the

content of social media and classify the results as either ‘trafﬁc-related’ events or ‘non-trafﬁc-related’ events;

iii) A topic classiﬁcation algorithm is developed to further analyse the ‘trafﬁc-related’ events and classify them

into the sub-categories previously mentioned; iv) A geoparser algorithm is further developed to obtain the date,

time and the geographical information of the trafﬁc event. A fully functional, real-time, automated system is

developed by interconnecting all the algorithms together. This developed system produces very promising

results when applied to Twitter data as a source of information. The results show that social networking sites

have the potential to serve as a very efﬁcient method to detect not only small-scale events, such as trafﬁc

events, but can also be scaled up to detect large-scale events.

1 INTRODUCTION

Trafﬁc congestion is a signiﬁcant problem in many

cities around the world. Generally, trafﬁc congestion

can be subdivided into two different types of conges-

tion, namely: recurrent congestion and non-recurrent

congestion (Gu et al., 2016). Recurrent congestion is

a type of congestion that occurs on a repetitive day-to-

day basis resulting in recurrent ﬂow patterns, whereas

non-recurrent congestion is typically induced by an

abnormal or an unexpected event, such as road works

and incidents (Gu et al., 2016). Consequently, detect-

ing these types of abnormal events in both a timely

and efﬁcient manner provides commuters the possi-

https://orcid.org/0000-0001-8759-2210

bility to plan their route accordingly, thus mitigat-

ing any future trafﬁc congestion (Gu et al., 2016).

Two techniques are typically found in literature to de-

tect abnormal trafﬁc events, namely: traditional traf-

ﬁc event detection techniques and online trafﬁc event

detection techniques. Traditional trafﬁc event detec-

tion techniques usually encompass some form of data

acquisition through a physical medium, such as sen-

sors, which is then typically analysed to infer or de-

rive conclusions (Gu et al., 2016). In contrast, online

trafﬁc event detection techniques acquire data from

social networking sites, such as Twitter or Facebook.

Traditional methods, tend to be restricted by sensor

coverage due to sparsely placed sensors. This in turn

tends to make such approaches quite inefﬁcient when

it comes to trafﬁc event detection due to the natural

144

Bezzina, A. and Chetcuti Zammit, L.

Trafﬁc Data Analysis from Social Media.

DOI: 10.5220/0011716600003479

In Proceedings of the 9th International Conference on Vehicle Technology and Intelligent Transpor t Systems (VEHITS 2023), pages 144-151

ISBN: 978-989-758-652-1; ISSN: 2184-495X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

randomness in the location and time of such events

(Gu et al., 2016). Moreover, social networking sites

tend to have a very large user base, allowing users

to share both images and videos, thus daily generat-

ing endless amount of data, making the online traf-

ﬁc event detection a very cost-effective and efﬁcient

technique relative to traditional methods (Gu et al.,

2016). Various works in literature make use of so-

cial media messages to detect trafﬁc events such as

the work of Gu et al. (2016), Schulz et al. (2013)

and Li et al. (2012). Gu et al. (2016) developed a

classiﬁer based on a Semi-Na

ıve Bayes (SNB) model,

to ﬁlter out ‘non-trafﬁc-related’ tweets. Furthermore,

‘trafﬁc-related’ tweets are analysed and further classi-

ﬁed into ‘trafﬁc-related’ sub-categories using a super-

vised Latent Dirichlet Allocation (sLDA) algorithm.

Schulz et al. (2013), developed classiﬁers to be able

to detect small-scale car accidents reported on Twit-

ter. Some of the classiﬁers which were developed to

detect trafﬁc events, are based on the Na

ıve Bayes Bi-

nary (NBB) model and the Support Vector Machine

(SVM). Li et al. (2012) proposed TEDAS, a system

capable of retrieving, pre-processing, classifying, and

geoparsing ‘trafﬁc-related’ tweets to extract both the

nature of the trafﬁc events and their associated geo-

graphic information. This system is based on a set of

rules to analyse the tweets. Similar to the works of Gu

et al. (2016); Schulz et al. (2013); Li et al. (2012), the

aim in this work is to develop a trafﬁc-based informa-

tion system that relies on analysing the content of so-

cial media data from Twitter. An adaptive data acqui-

sition is developed differently from the other works

where a rule ‘r’ is chosen if it is found within a spe-

ciﬁc percentage of all newly and previous classiﬁed

trafﬁc-related tweets. Furthermore, preprocessing is

carried out as shown in Table 1. Table 1 summarizes

the differences between this work and the works of

Gu et al. (2016); Schulz et al. (2013); Li et al. (2012).

Tweets are classiﬁed as either ‘trafﬁc-related’ or

‘non-trafﬁc-related’. Unlike the works of Gu et al.

(2016); Schulz et al. (2013); Li et al. (2012), where

only one or two classiﬁers were developed, in this

work, four supervised binary classiﬁcation algorithms

are developed with the aim to analyse their perfor-

mance in the Results Section. Classiﬁers based on

the Multinomial Na

ıve Bayes model (MNB), the SNB

model, the Multivariate Bernoulli Na

ıve Bayes model

(MVBNB) and the SVM are developed. ‘Trafﬁc-

related’ tweets are analysed and further classiﬁed

into ‘trafﬁc-related’ sub-categories using a sLDA

algorithm. The sub-categories are namely: ‘ac-

cidents’, ‘incidents’, ‘trafﬁc jams’, and ‘construc-

tion/road works’. The performance of the classiﬁers

of Gu et al. (2016); Schulz et al. (2013); Li et al.

(2012) are compared to the classiﬁers developed in

this work as detailed in Section 3. The date, time,

and the geographical information of each associated

trafﬁc event are also determined. Hence the proposed

trafﬁc-based information system is described in Sec-

tion 2 of the paper. Section 3 shows the results of the

proposed system, followed by conclusions and possi-

ble future works as described in Section 4.

2 METHODOLOGY

The stages involved in the developed system as shown

in Figure 1 are described in this Section. All stages

are implemented in R programming language, provid-

ing a vast number of tools for analysis and access to

many useful off-the-shelf packages (The R Founda-

tion, 2022).

Figure 1: Developed system stages.

2.1 Data Acquisition

An adaptive data acquisition approach is developed

to ensure the best quality and the maximum num-

ber of ‘trafﬁc-related’ tweets are gathered (Gu et al.,

2016). All gathered tweets are in English. An adap-

tive ‘trafﬁc-related’ keywords dictionary is formed to

ﬁlter the Twitter stream sessions. To extract tweets,

REST API is used (IBM Cloud Education, 2021). An

initial keyword dictionary is generated using a uni-

gram, DF (document frequency) based BOW (bag of

words) model. Based on a predeﬁned threshold, DF-

based ﬁltration is applied to extract the initial key-

words. To generate an adaptive data acquisition, new

‘trafﬁc-related’ keywords are generated and appended

to the initial dictionary by repeating the same proce-

dure whilst using the streamed tweets that will now

be classiﬁed as ‘trafﬁc-related’. As a result, the algo-

rithm is capable of expanding its initial dictionary to

adapt to the language of newly streamed tweets.

For ease of implementation, streaming sessions

are initiated through the rtweet R package. In partic-

ular, the stream tweets function provides an interface

with a large range of input arguments which makes

streaming tweets both very simplistic however, it is

limited to ﬁltering tweets based upon only one type

of query, be it location, keywords, or user ids.

For further analysis, parsing is applied to convert

the streamed tweets, stored in a JSON ﬁle, into an R

object via the parse stream function found also in the

Trafﬁc Data Analysis from Social Media

145

Table 1: Similarities and Differences with works by Gu et al. (2016); Schulz et al. (2013); Li et al. (2012).

This work Work by Gu et al.

(2016)

Work by Schulz

et al. (2013)

Work by Li et al.

(2012)

Data acquisi-

tion

Trafﬁc-related key-

word and language

ﬁltering

Keyword ﬁltering Spatial, temporal

and language ﬁlter-

ing

Trafﬁc-related key-

word ﬁltering.

Adaptive data

acquisition

Yes. A rule ‘r’ is

chosen if it is found

within a speciﬁc

percentage of all

newly and previous

classiﬁed trafﬁc-

related tweets.

Yes. Based on the

assumption that a

good rule generally

associates with

the tweets relating

to the subject at

hand. In this case,

every unigram and

bigram is extracted

as a candidate rule.

A rule ‘r’ is passed

if its conﬁdence

passes a certain

threshold. Then,

a rule validator is

utilised in order

to examine the

usefulness of a new

rule.

No. Yes. A tokenizer

is ﬁrst applied to

all ‘trafﬁc-related’

tweets. Subse-

quently, a reducer

is applied to count

the total of positive

and negative labels

for each of the to-

ken combinations.

Those token com-

binations with the

maximum positive

counts are chosen

as new rules.

Preprocessing Removing Twitter

mentions, links,

emoticons, convert-

ing to lowercase,

removing tweet

associated words,

stop words, punc-

tuation, brackets,

resolving abbrevi-

ations, replacing

contractions, sym-

bols, removing

blank spaces, nu-

merical numbers.

No. Removing

retweets, @

mentions, stop

words, resolving

abbreviations,

application of the

Google Spellcheck-

ing API, replacing

temporal expres-

sions, replacing

spatial expressions,

application of the

Stanford lemma-

tization function,

application of the

Stanford POS

tagger in order to

extract only nouns

and proper nouns.

No.

Binary Classiﬁ-

cation

Multinomial NB,

SNB, Bernoulli

NB, SVM

SNB MNB, Ripper rule

learner, SVM

TEDAS

Multi-class

classiﬁcation

Supervised LDA

accompanied by a

multi-class SVM

Supervised LDA

accompanied by a

multi-class SNB

No. No.

Geotagging Utilising an NER

model.

Utilising the GPS

tag in a tweet, the

content of a tweet,

and predicting the

location based on

the user’s history,

friends etc.

Stanford NER

model.

Regular Expres-

sions geotagger,

and a fuzzy geotag-

ger.

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

146

rtweet R package. Subsequently, any retweets or any

duplicate tweets that are gathered during the stream-

ing instance are deleted.

2.2 Pre-Processing

Pre-processing techniques are applied to transform

the streamed tweets into a format that eases classiﬁca-

tion by removing redundant information, such as stop

words using the package deﬁned in (Hornik, 2021)

and as detailed in Table 1. Figure 2 shows the pre-

processing steps carried out in this work.

Figure 2: Pre-processing stages.

2.3 Classiﬁcation

The initial step of classiﬁcation comprises of classi-

fying tweets as either ‘trafﬁc-related’ or ‘non-trafﬁc-

related’. Four binary text classiﬁers are developed,

namely: the MNB model, the SNB model, the

MVBNB model and the SVM. The optimal classiﬁer

from these four is chosen.

2.3.1 NB Classiﬁers

The main core of all NB classiﬁers is Bayes’ theorem,

which is given by Equation (1)

P(class| f eatures) =

P( f eatures|class) · P(class)

P( f eatures)

(1)

where P(class| f eatures) is the posterior probabil-

ity distribution which is the the probability of a spe-

ciﬁc instance belonging to a particular class given

its observed features, P( f eatures|class) is the con-

ditional probability which is the probability of ob-

serving a set of features within a particular class and

P(class) is the prior probability which is the prob-

ability of a particular class within a given dataset,

hence depicting the likelihood of encountering a spe-

ciﬁc class.

In practice, for ease of implementation NB classi-

ﬁers consider two assumptions, namely:

• Independent and identically distributed: Random

variables must be unrelated to one another whilst

also being derived from a similar probability dis-

tribution (Raschka, 2014).

• Conditional independence of features: All fea-

tures in a dataset are mutually exclusive (Raschka,

2014). This assumption is what gives NB classi-

ﬁers their ‘na

ıve’ property. As a result, the like-

lihoods or class-conditional probabilities of a par-

ticular feature set can instantly be calculated from

the given dataset with ease as shown in Equation

(2)

P( f eatures|class) =

∏

P( f

|class) (2)

where f represents the feature set given by

{ f

, f

, ..., f

}

In reality, this assumption is more often than not

violated. Nonetheless, NB classiﬁers still tend to per-

form extremely well under this assumption (Raschka,

2014; Askari et al., 2020; Rish, 2001). However,

strong violations of this assumption and nonlinear

classiﬁcation tasks tend to lead NB classiﬁers to per-

form very poorly (Raschka, 2014).

Considering a binary classiﬁcation task, the deci-

sion rule for NB classiﬁers can be deﬁned as:

class =

(

∏

P( f

|class

)·P(class

)

P( f eatures)

≥

∏

P( f

|class

)·P(class

)

P( f eatures)

B else

(3)

In practice, the evidence term can be neglected

since it is merely being used as a scaling factor

(Raschka, 2014). Therefore, the ﬁnal decision rule

can be deﬁned as:

class =

(

∏

P( f

|class

) · P(class

) ≥

∏

P( f

|class

) · P(class

)

B else

(4)

In practice, NB classiﬁers tend to suffer from the

problem of zero probabilities. This problem arises

whenever a speciﬁc feature ‘ f

’ is not available within

a particular class, thus leading to its class-conditional

probability being equal to zero. One solution is

called Additive smoothing, which is a technique that

is commonly utilised to smooth categorical data. With

Trafﬁc Data Analysis from Social Media

147

the introduction of Additive smoothing, the class-

conditional probability for a speciﬁc feature ‘ f

’ can

be deﬁned as:

P( f

|class) =

+ α

+ αd

(5)

where α is an additive smoothing parameter and

d gives the dimensionality of the feature set within

the class. By setting the value of the smoothing pa-

rameter bigger than zero it is guaranteed that the zero

probability problem is avoided. Generally, Lidstone

smoothing (α < 1) and Laplace smoothing (α = 1)

are the two most common additive smoothing types

(Raschka, 2014).

2.3.2 MNB Classiﬁer

The MNB classiﬁer deﬁnes the distribution of each

feature P( f eatures|class) as a multinomial distribu-

tion making it ideal for data that can be easily trans-

formed into numerical counts (Russell and Norvig,

1995). In general, the MNB model is used to calculate

term frequency denoted as T F. The binary decision

rule of the MNB classiﬁer can be deﬁned as:

class =

(

∏

P(t

|class

) · P(class

) ≥

∏

P(t

|class

) · P(class

)

B else

(6)

where

P(t

|class

) =

T F

+ α

T F

t,A

+ αd

(7)

T F

represents the frequency of term ’t

’ within

class A, TF

t,A

represents the total count of all the term

frequencies within class A.

2.3.3 SNB Classiﬁer

NB classiﬁers tend to frequently violate the assump-

tion of conditional independence of features. With

regards to text, it assumes that a speciﬁc word has

no bearing on the likelihood of observing additional

words in the same document or sentence (Raschka,

2014). As previously underlined, strong violations of

this assumption can lead NB classiﬁers to perform

very poorly in practice (Raschka, 2014). A coun-

termeasure to this issue is to extend NB classiﬁers

in such a manner that they are capable of detecting

dependencies between features (Kononenko, 1991).

The main idea behind the SNB classiﬁer is to relax

the conditional independence of features assumption

whilst also retaining both the simplicity and efﬁciency

properties of NB classiﬁers (Zheng and Webb, 2017).

In other words, the SNB classiﬁer seeks to ﬁnd the

optimal balance between ‘non-na

ıvety’ and the accu-

racy of approximations of the conditional probabili-

ties (Zheng and Webb, 2017).

2.3.4 MVBNB Classiﬁer

In contrast to the MNB classiﬁer, the features in the

MVBNB model are independent binary values that

represent the document frequency denoted by DF.

The Bernoulli Trials for a speciﬁc feature set can be

deﬁned as:

P( f eature|class) =

∏

P( f

|class)

· (1 − P( f

|class)

)

(1−b)

(8)

where b is a boolean term expressing the oc-

currence or absence of the term from the vocabu-

lary. Consequently, the binary decision rule of the

MVBNB classiﬁer can be deﬁned as:

class =

(

∏

P(t

|class

) · P(class

) ≥

∏

P(t

|class

) · P(class

)

B else

(9)

where

P(t

|class

) =

+ α

t,A

+ αd

(10)

represents the document frequency of term t

within class A and DF

represents the total count of

all the document frequencies within class A.

2.3.5 SVM Classiﬁer

For a given classiﬁcation task, the SVM utilises the

principle of a maximum margin classiﬁer to discrim-

inate between the data (Senekane and Taele, 2016).

In general, the margin can be deﬁned as the distance

between the generated decision boundary and the sup-

port vectors, represented by the data points which ex-

ist closest to the decision boundary (Senekane and

Taele, 2016). For a binary, linearly separable classi-

ﬁcation task, the decision boundary is a hyperplane

given by Equation (11) and classiﬁcation is based

upon the perpendicular distance of the instance to be

classiﬁed from the generated decision boundary given

by Equation (12).

y = w

x + b (11)

class =

(

A w

x + b ≥ 0

B else

(12)

where w

is the weight vector, x represents the in-

stance to be classiﬁed and b is the bias or a constant.

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

148

2.3.6 Topic Modeling

The second stage of the classiﬁcation is that ‘trafﬁc-

related’ tweets are managed into various ‘trafﬁc-

related’ sub-categories. In this work, ﬁve sub-

categories are considered, namely: ‘accident-related

information’, ‘incident-related information’, ‘trafﬁc-

related information’, ‘construction-related informa-

tion’, and ‘NA’, to encompass any tweets which do

not fall in any of the other sub-categories. A sLDA al-

gorithm is developed, utilising the documents-topics

distributions as the feature vectors, whereby each spe-

ciﬁc tweet is represented by a unique, normalised

topic distribution.

The main idea behind sLDA is that documents

are represented as a random distribution of a pre-

deﬁned number of latent or hidden topics, whereby

each topic is characterised by a unique distribution

of words or terms observed within the corpus (Zrigui

et al., 2012). One assumption that sLDA considers is

that each unique document within the corpus can be

represented by BOW model, or equivalently a collec-

tion of words, thus neglecting both the speciﬁc order

and the grammatical role of the words in each docu-

ment. The sLDA deﬁnes the generative process as a

joint distribution as summarised in Equation (13).

P(µ

1:K

, θ

1:D

, Z

1:D

) =

∏

i=1

P(µ

, β)

∏

d=1

P(θ

, α)

∏

n=1

P(Z

d,n

|θ

)P(W

d,n

|β

1:K

, Z

d,n

)

(13)

where

• α is the document topic density;

• β is the topic word density;

• k is the speciﬁc topic;

• K represents all topics;

• µ

represents the words distribution of topic k;

• d is the speciﬁc document;

• D represents all documents;

• θ

represents topics distribution for document d;

• Z

d,n

represents the topic assignment for the ‘n

′

term in document d;

• W

d,n

represents the ‘n

′th

term in document d;

• N represents all terms within a particular docu-

ment;

• P(µ

, β) represents a dirichlet distribution;

• P(Z

d,n

, θ

) represents a multinomial distribution;

• P(θ

, α) represents a dirichlet distribution and

• P(W

d,n

|β

1:K

, Z

d,n

) represents a multinomial distri-

bution.

In practice, maximising Equation (13) proves

to be very challenging, thus it is generally opted

to maximise Equation (13) through only the words

, n). As a result, Gibbs sampling is utilised to

successively sample for the conditional distribution

P(W

d,n

|β

1:K

, Z

d,n

Furthermore, lemmatization is also applied to im-

prove the interpretability of each generated sLDA

topic, as highlighted in (Russell and Norvig, 1995).

The number of features depicting each tweet is very

small relative to the number of feature vectors, thus

potentially giving rise to a nonlinear classiﬁcation

task. As a result, a SVM classiﬁer is utilised as part of

the sLDA algorithm to learn both linear and nonlinear

classiﬁcation tasks.

Other pre-processing techniques capable of re-

ducing the computational overhead of the supervised

LDA algorithm are utilised, as depicted in Figure 3.

Figure 3: sLDA Pre-processing stages.

A 10-fold cross-validation technique is applied to

the training data to both train and validate all the al-

gorithms. The individual validation scores are then

averaged amongst all folds to generate an overall val-

idation score. The ﬁnal evaluation is then performed

on a separate hold-out set with their results shown in

Section 3.

In general, text documents tend to have most of

their theme information stored via nouns and verbs,

thus making other words irrelevant with regards to

topic classiﬁcation. Consequently, POS tagging was

performed to be able to extract only the nouns and

verbs from each speciﬁc text document. For ease

of implementation, POS tagging was performed via

a pre-trained English model found in the UDPipe R

package. To avoid overﬁtting, whilst also removing

frequent words that contribute very little to topic in-

terpretability, nouns and verbs which had a DF of less

than ﬁve or were observed in more than 60% of the

dataset were ﬁltered out.

To transform the LDA algorithm into a super-

vised algorithm, words that are capable of discrimi-

nating between different topics were initially seeded

towards speciﬁc topics instead of being given ran-

Trafﬁc Data Analysis from Social Media

149

dom topic assignments. In this work, ﬁve topics

were considered, namely: ‘accident-related informa-

tion’, ‘incident-related information’, ‘trafﬁc-related

information’, ‘construction-related information’, and

a hidden topic to encompass any text documents

which do not fall in any of the preceding topics. To

improve the convergence rate of the supervised LDA

algorithm, the algorithm is capable of generating ex-

tra seed words that are likely to be observed with the

initial seed words. In other words, the algorithm is

capable of determining which speciﬁc words are in-

terdependent with each initial seed.

2.4 Geoparsing

Tweets are restricted in length, thus tending to omit

information about both the time and date of their as-

sociated trafﬁc event. Consequently, the REST API is

used to determine the time and date of each extracted

trafﬁc event. Forward geocoding is applied to trans-

form the extracted locations to their associated lati-

tude and longitude coordinates. To help in the visu-

alisation of the trafﬁc events, a web application is de-

veloped, whereby a worldwide map depicting all the

different types of geocoded trafﬁc events is generated.

To determine both the time and date of each ex-

tracted trafﬁc event the created at ﬁeld, obtained dur-

ing the streaming process is used. The created at ﬁeld

provides both the time and date when each streamed

tweet was posted. As a result, the determination of

the time of each extracted trafﬁc event was based

on the assumption that the trafﬁc events occurred at

roughly the same instance as when their associated

tweets were posted.

To extract the location of each trafﬁc event, NER

is applied to label location entities. For ease of imple-

mentation, NER is applied through the location entity

function found in the entity R package. Following

location extraction, forward geocoding is applied to

transform the extracted locations to their associated

coordinates in terms of longitude and latitude. For-

ward geocoding is applied through the geocode func-

tion found in the tidygeocoder R package. On the

occurrence that either no location or more than one

location is extracted for a speciﬁc trafﬁc event, or the

utilised geocoding service is not capable to transform

a speciﬁc location to its associated coordinates, the

location and its respective coordinates are assigned

‘NA’ respectively.

3 RESULTS

The four binary text classiﬁers were trained, vali-

dated, and tested on an annotated dataset (Dabiri,

2018) containing a total of 48,000 tweets with 50.03%

of the tweets being trafﬁc related. A train-test split ra-

tio of 10 to 2 is applied to the dataset. Each of the

text classiﬁers were evaluated using a 10-fold cross-

validation technique, and a separate hold-out set was

then utilised to generate benchmark performance met-

rics. The generated performance metrics of all the

tuned classiﬁers were then compared and analysed to

determine the optimal classiﬁer for the classiﬁcation

task at hand.

Figure 4 shows the performance of the four binary

text classiﬁers, corresponding to an F1 score ranging

between 0.978 and 0.983. The performance of these

classiﬁers can be compared to those in the works of

Gu et al. (2016); Schulz et al. (2013); Li et al. (2012).

Gu et al. (2016) obtained an F1 score of 0.926 by the

SNB classiﬁer. Schulz et al. (2013) obtained an F1

score ranging between 0.555 and 0.607 for the MNB,

Ripper rule learner and SVM. Gu et al. (2016) ob-

tained an F1 score of 0.80 for TEDAS. In all cases, the

performance of the four classiﬁers in this work out-

performed the results obtained in the former works.

Tukey’s Honestly-Signiﬁcant Difference results

indicate that there exists a signiﬁcant pairwise differ-

ences between the F1 scores of the tuned SVM clas-

siﬁer and the F1 scores of all other tuned classiﬁers

with 95% conﬁdence level. Consequently, the tuned

SVM classiﬁer was the optimum classiﬁer.

Table 2 presents the performance of the sLDA al-

gorithm during training and testing stages with sepa-

rate hold-out sets. This resulted in average weighted

F1 score of 0.988, weighted over all hold-out-sets,

thus quantifying the promising classiﬁcation results

for sLDA.

Figure 4: Performance Metrics of all four classiﬁers.

To ease the results visualisation whilst also pro-

viding the user some direct control of the system, a

web application was developed. The web applica-

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

150

Table 2: Weighted F1 score for sLDA algorithm.

Training (10 fold average) 0.985

1st hold-out-set for Testing 0.988

2nd hold-out-set for Testing 0.985

3rd hold-out-set for Testing 0.987

4th hold-out-set for Testing 0.991

tion consisted of a simple GUI, whereby the user is

given direct control of both the streaming time and

the number of iterations to be executed by the system.

A worldwide map depicting all the geocoded trafﬁc

events is generated, as shown in Figure 5, with blue

implying an accident; red implying an incident; green

implying trafﬁc and black implying road works.

Figure 5: Generated worldwide map.

4 CONCLUSION

This work proposes a trafﬁc-based information sys-

tem that relies on social media data. Tweets are clas-

siﬁed as either ‘trafﬁc-related’ or ‘non-trafﬁc-related’

using four binary text classiﬁers: the MNB model, the

SNB model, the MVBNB model and the SVM. The

SVM classiﬁer resulted in the optimum classiﬁer with

an F1 Score of 0.983. ‘Trafﬁc-related’ events are fur-

ther classiﬁed into various ‘trafﬁc-related’ categories,

such as ‘accidents’, ‘trafﬁc jams’, and ‘road works’

using a sLDA algortihm, resulting in a weighted F1

score of 0.988. A fully functional web application,

capable of automating the whole procedure is devel-

oped. Future work aims to address the number of top-

ics of the sLDA algorithm. The algorithm requires the

number of topics to be known a priori which is not al-

ways possible. More seed words can be deﬁned for

each topic category such that each generated sLDA

topic can be easily discriminated from other topics. A

hierarchical Dirichlet process could also be utilised,

whereby the number of topics is learnt automatically

from the dataset. Social media tends to provide ac-

cess to an endless amount of information about a vast

range of topics and can be scaled up to detect large-

scale events such as the covid pandemic.

REFERENCES

Askari, A., d’Aspremont, A., and Ghaoui, L. E. (2020).

Naive Feature Selection: Sparsity in Naive Bayes.

ArXiv, abs/1905.09884.

Dabiri, S. (2018). Tweets with trafﬁc-related labels for

developing a Twitter-based trafﬁc information sys-

tem. Mendeley Data, V1, Available online at:

10.17632/c3xvj5snvv.1 [Last accessed 30/06/2022].

Gu, Y., Qian, Z., and Chen, F. (2016). From Twitter to de-

tector: Real-time trafﬁc incident detection using social

media data. Transportation Research Part C: Emerg-

ing Technologies, 67, 321-342.

Hornik, K. (2021). Stopwords . Available online at:

https://rdrr.io/rforge/tm/man/stopwords.html [Last ac-

cessed 30/06/2022].

IBM Cloud Education (2021). rest-apis. Available online

at: https://www.ibm.com/cloud/learn/rest-apis [Last

accessed 30/06/2022].

Kononenko, I. (1991). Semi-Naive Bayesian Classiﬁer.

Proceedings of the 5th European Conference on Eu-

ropean Working Session on Learning.

Li, R., Lei, K., Khadiwala, R., and Chang, K. (2012).

”TEDAS: A Twitter-based Event Detection and Anal-

ysis System”. IEEE 28th International Conference on

Data Engineering.

Raschka, S. (2014). Naive Bayes and Text Classiﬁcation I -

Introduction and Theory corr, τ. abs/1410.5329, 2014.

CoRR, τ. abs/1410.5329.

Rish, I. (2001). An empirical study of the naive Bayes

classiﬁer. Available online at: https://citeseerx.ist.psu.

edu/viewdoc/summary?doi=10.1.1.330.2788 [Last

accessed 30/06/2022].

Russell, S. and Norvig, P. (1995). Artiﬁcial intelligence.

Englewood Cliffs, N.J.: Prentice Hall.

Schulz, A., Ristoski, P., and Paulheim, H. (2013). ”I See

a Car Crash: Real-Time Detection of Small Scale In-

cidents in Microblogs” . The Semantic Web: ESWC

2013 Satellite Events, 22-33.

Senekane, M. and Taele, B. (2016). Prediction of Solar

Irradiation Using Quantum Support Vector Machine

Learning Algorithm. Smart Grid and Renewable En-

ergy, 07:12, 293-301.

The R Foundation (2022). R: The R Project for Statisti-

cal Computing. Available online at: https://www.r-

project.org/ [Last accessed 30/06/2022].

Zheng, F. and Webb, G. (2017). Semi-naive Bayesian

Learning. Encyclopedia of Machine Learning and

Data Mining, 1137-1142.

Zrigui, M., Ayadi, R., Mars, M., and Maraoui, M. (2012).

Arabic Text Classiﬁcation Framework Based on La-

tent Dirichlet Allocation. Journal of Computing and

Information Technology, 20:2.

Trafﬁc Data Analysis from Social Media

151