randomness in the location and time of such events
(Gu et al., 2016). Moreover, social networking sites
tend to have a very large user base, allowing users
to share both images and videos, thus daily generat-
ing endless amount of data, making the online traf-
fic event detection a very cost-effective and efficient
technique relative to traditional methods (Gu et al.,
2016). Various works in literature make use of so-
cial media messages to detect traffic events such as
the work of Gu et al. (2016), Schulz et al. (2013)
and Li et al. (2012). Gu et al. (2016) developed a
classifier based on a Semi-Na
¨
ıve Bayes (SNB) model,
to filter out ‘non-traffic-related’ tweets. Furthermore,
‘traffic-related’ tweets are analysed and further classi-
fied into ‘traffic-related’ sub-categories using a super-
vised Latent Dirichlet Allocation (sLDA) algorithm.
Schulz et al. (2013), developed classifiers to be able
to detect small-scale car accidents reported on Twit-
ter. Some of the classifiers which were developed to
detect traffic events, are based on the Na
¨
ıve Bayes Bi-
nary (NBB) model and the Support Vector Machine
(SVM). Li et al. (2012) proposed TEDAS, a system
capable of retrieving, pre-processing, classifying, and
geoparsing ‘traffic-related’ tweets to extract both the
nature of the traffic events and their associated geo-
graphic information. This system is based on a set of
rules to analyse the tweets. Similar to the works of Gu
et al. (2016); Schulz et al. (2013); Li et al. (2012), the
aim in this work is to develop a traffic-based informa-
tion system that relies on analysing the content of so-
cial media data from Twitter. An adaptive data acqui-
sition is developed differently from the other works
where a rule ‘r’ is chosen if it is found within a spe-
cific percentage of all newly and previous classified
traffic-related tweets. Furthermore, preprocessing is
carried out as shown in Table 1. Table 1 summarizes
the differences between this work and the works of
Gu et al. (2016); Schulz et al. (2013); Li et al. (2012).
Tweets are classified as either ‘traffic-related’ or
‘non-traffic-related’. Unlike the works of Gu et al.
(2016); Schulz et al. (2013); Li et al. (2012), where
only one or two classifiers were developed, in this
work, four supervised binary classification algorithms
are developed with the aim to analyse their perfor-
mance in the Results Section. Classifiers based on
the Multinomial Na
¨
ıve Bayes model (MNB), the SNB
model, the Multivariate Bernoulli Na
¨
ıve Bayes model
(MVBNB) and the SVM are developed. ‘Traffic-
related’ tweets are analysed and further classified
into ‘traffic-related’ sub-categories using a sLDA
algorithm. The sub-categories are namely: ‘ac-
cidents’, ‘incidents’, ‘traffic jams’, and ‘construc-
tion/road works’. The performance of the classifiers
of Gu et al. (2016); Schulz et al. (2013); Li et al.
(2012) are compared to the classifiers developed in
this work as detailed in Section 3. The date, time,
and the geographical information of each associated
traffic event are also determined. Hence the proposed
traffic-based information system is described in Sec-
tion 2 of the paper. Section 3 shows the results of the
proposed system, followed by conclusions and possi-
ble future works as described in Section 4.
2 METHODOLOGY
The stages involved in the developed system as shown
in Figure 1 are described in this Section. All stages
are implemented in R programming language, provid-
ing a vast number of tools for analysis and access to
many useful off-the-shelf packages (The R Founda-
tion, 2022).
Figure 1: Developed system stages.
2.1 Data Acquisition
An adaptive data acquisition approach is developed
to ensure the best quality and the maximum num-
ber of ‘traffic-related’ tweets are gathered (Gu et al.,
2016). All gathered tweets are in English. An adap-
tive ‘traffic-related’ keywords dictionary is formed to
filter the Twitter stream sessions. To extract tweets,
REST API is used (IBM Cloud Education, 2021). An
initial keyword dictionary is generated using a uni-
gram, DF (document frequency) based BOW (bag of
words) model. Based on a predefined threshold, DF-
based filtration is applied to extract the initial key-
words. To generate an adaptive data acquisition, new
‘traffic-related’ keywords are generated and appended
to the initial dictionary by repeating the same proce-
dure whilst using the streamed tweets that will now
be classified as ‘traffic-related’. As a result, the algo-
rithm is capable of expanding its initial dictionary to
adapt to the language of newly streamed tweets.
For ease of implementation, streaming sessions
are initiated through the rtweet R package. In partic-
ular, the stream tweets function provides an interface
with a large range of input arguments which makes
streaming tweets both very simplistic however, it is
limited to filtering tweets based upon only one type
of query, be it location, keywords, or user ids.
For further analysis, parsing is applied to convert
the streamed tweets, stored in a JSON file, into an R
object via the parse stream function found also in the
Traffic Data Analysis from Social Media
145