Real-Time Data Harvesting Method for Czech Twitter

Pavel Kr

1,2

and V

aclav Rajtmajer

Dept. of Computer Science & Engineering, Faculty of Applied Sciences,

University of West Bohemia, Plze

n, Czech Republic

NTIS - New Technologies for the Information Society, Faculty of Applied Sciences,

University of West Bohemia, Plze

n, Czech Republic

Keywords:

Czech, Data, Harvesting, Social Media, Twitter.

Abstract:

This paper deals with automatic analysis of Czech social media. The main goal is to propose an approach to

harvest interesting messages from Twitter in Czech language with high download speed. This method uses

user lists to discover potentially interesting tweets to download. It is motivated by the fact that only about

20% of Twitter users are posting informative messages, whereas the remaining 80% not and that it is possible

to identify the “important” users by the user lists. The experimental results show that the proposed method

is very efﬁcient because it harvests about 6 times more data than the other approaches. This approach should

be integrated into an experimental system for the Czech News Agency to monitor the current data-ﬂow on

Twitter, download messages in real-time, analyze them and extract relevant events.

1 INTRODUCTION

Social media are virtual computer networks that al-

low individuals, companies, and other organizations

to create, share, view and analyze information mainly

in the form of short messages. The importance and

the size of the today’s social media are growing very

rapidly which is strictly related to the particular needs

of the automatic processing methods.

Twitter is a social net which uses very short mes-

sages limited by 140 characters. They are posted on-

line as status updates, so-called tweets. The tweets

can be accompanied by photos, videos, geolocation,

links to other users (words preceded by the sign @)

and trending topics (words preceded by the sign #).

The posted tweet can be liked, commented by the

other tweets, or redistributed by other users by for-

warding, so-called retweet. Due to its simplicity and

easy access, Twitter contains a very wide range of top-

ics from common every day conversations over sport

news to news about an ongoing disasters as earth-

quake, ﬂood or typhoon. Twitter is without doubt

a very interesting source of on-line information which

can be used for further analysis and data-mining. In

this work, we focus on Twitter because of its large

size, signiﬁcant amount of other existing work about

this network and particularly because of a number of

Twitter users post interesting news from various top-

ics in real-time.

We would like to use Twitter for automatic real-

time event detection because it will be very useful for

many journals and news agencies in order to discover

very quickly new interesting information. Particu-

larly, the Czech News Agency (

CTK

) requires a sys-

tem to automatically harvest data from Czech Twitter

and to discover potential events. Several deﬁnitions

of events exist, however we will use the deﬁnition

from a Cambridge Dictionary. An event is deﬁned

as “anything that happens, especially something im-

portant and unusual

”.

The ﬁrst main task of this system consists in ana-

lyzing of Twitter stream and in harvesting of the ap-

propriate tweets in Czech language in real-time. The

second important task is the subsequent analysis of

the downloaded data and to discover in such data

new events. The main goal of this paper is to pro-

pose and implement a novel method to solve the ﬁrst

task described above. Note, that the activity of the

Czech Twitter users is signiﬁcantly lower than of the

other ones, which is particularly evident for English

or French. Therefore, it is not possible to use common

methods provided by Twitter API and a novel method

is necessary. The core of the proposed method con-

sists in using user lists to download a sufﬁcient num-

ber of Czech tweets in real-time.

http://www.ctk.eu/

http://dictionary.cambridge.org/dictionary/british/event

KrÃ ˛al P. and Rajtmajer V.

Real-Time Data Harvesting Method for Czech Twitter.

DOI: 10.5220/0006212402590265

In Proceedings of the 9th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2017), pages 259-265

ISBN: 978-989-758-220-2

259

The rest of the paper is organized as follows. Sec-

tion 2 is a short review of Twitter analysis methods.

The following section presents an architecture of the

whole event detection system. Section 4 describes in-

dividual components of this system. The proposed

method for tweet harvesting in the Czech language

with high speed is presented in Section 4.1.3. Sec-

tion 5 deals with the results of our experiments. In

the last section, we conclude the experimental results

and propose some future research directions.

2 SHORT REVIEW OF TWITTER

ANALYSIS METHODS

Numerous studies have investigated Twitter, because

it offers many possibilities for data processing and

analysis. This social net can be used as a data source

of sentiment analysis and opinion mining as shown

for example in (Pak and Paroubek, 2010). The au-

thors have collected a sentiment analysis corpus from

Twitter and they have further built an efﬁcient sen-

timent classiﬁer on this data. Another work deal-

ing with sentiment analysis from Twitter is proposed

in (Kouloumpis et al., 2011). The authors show here

the importance of linguistic features for this task.

Twitter data can be further used for sociolog-

ical surveys as shown for instance in (Yardi and

Boyd, 2010). The authors have analyzed a group

polarization using the data collected from dynamic

debates. Another study analyzes Twitter commu-

nity (Java et al., 2009) to discover user activities.

A taxonomy characterizing the underlying intentions

of the users is presented.

Twitter can be also successfully used for event

detection as presented for instance in (Sakaki et al.,

2010; Earle et al., 2012). These approaches are gen-

erally based on the capturing of a presence or an in-

crease of particular key-words. For instance, an in-

crease of the words “earthquake” or “typhoon” is used

for disaster detection.

They were also proposed some more sophisticated

Twitter analysis approaches as for instance in (Li

et al., 2012). The authors propose a system called

Twevent, which ﬁrst detects event segments and then,

they are clustered considering both their frequency

distribution and content similarity to discover events.

Wikipedia is used as a knowledge base to derive the

most interesting segments to describe the identiﬁed

events and to discover realistic events. The main ad-

vantage of this system from the previous ones is that

it is domain independent and therefore, it can iden-

tify all event types. The further event detection tech-

niques on Twitter are available in the survey (Atefeh

Figure 1: System architecture.

and Khreich, 2015).

Twitter analysis methods are focused particularly

on English (sometimes also on French or on Chinese)

and relatively few works are oriented to the other lan-

guages. Twitter activities of the users in such lan-

guages are very high and therefore the common har-

vesting methods provided by Twitter API are sufﬁ-

cient to get a sufﬁcient amount of the data for a further

analysis. We assume, that this fact explains that, to the

best of our knowledge, no special Twitter harvesting

method exists. Therefore, we will evaluate and com-

pare our proposed harvesting method with the stan-

dard ones provided by Twitter.

It is also worth of noting, that no other study about

automatic event detection on Czech Twitter exists.

3 SYSTEM DESCRIPTION

In order to show the whole problem, we ﬁrst describe

a general architecture of the event detection system

and then, we detail the proposed method for fast har-

vesting of the Twitter data in Czech language.

The event detection system is composed of three

main functional units (Tweet Stream Analysis, Prepro-

cessing and Event Detection) which are further de-

composed into six tasks as depicted in Figure 1.

The ﬁrst task, Data acquisition, is beneﬁcial to

harvest on-line appropriate data from Twitter in Czech

language with high speed. Then, Spam ﬁltering is

done to remove tweets with useless information (so

called “spam”). The third task is Lemmatization

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

260

which is used for word normalization. The next step is

Non-signiﬁcant word ﬁltering. While the previous ﬁl-

tering was at the tweet level, this one is at word level

and is used to remove non-signiﬁcant words which

could decrease the detection performance. The next

step to discover events is Clustering. We group to-

gether the tweets with similar content using a cluster-

ing method. The ﬁnal decision about an event is based

on the thresholding. The last step, Results represen-

tation, is used to show the detected event to the users

in an acceptable form.

All these steps are described below in details with

the particular focus on the data acquisition, which is

the main contribution of this paper.

4 METHOD DESCRIPTION

4.1 Data Acquisition

We summarize ﬁrst our requirements to choose an op-

timal data acquisition method:

• working in real-time;

• downloading of the messages in Czech language;

• harvesting of a “sufﬁcient” number of tweets for

a further processing (it means, from our point of

view, as much as possible);

• usage for free;

• downloading only informative messages (op-

tional).

We analyze in the following text the different pos-

sibilities of Twitter for data harvesting. We show for

all the methods the maximum download speed de-

ﬁned by the Twitter constraints. Unfortunately, this

speed usually does not correspond to the real one, be-

cause the activity of the Twitter users is not sufﬁcient

to ﬁll these limits.

4.1.1 Search API

This API is a part of Twitter REST API. It allows

queries against the indices of recent or popular tweets

and behaves similarly to, but not exactly like the

search feature available in web clients. This API

searches against a sampling of “recent” tweets pub-

lished in the past 7 days and its maximum download

speed is 72,000 tweets/hour. The query can be re-

stricted by several constraints as for instance by a ge-

olocation or by a target language.

Another important property is that this API is fo-

cused on relevance and not on completeness. This

means that some tweets and users may be missing

from the search results. The ﬁrst approach, which

is further evaluate and compare, uses this API and is

hereafter referred as Search API method.

4.1.2 Streaming API

This API is intended to monitor (or process) tweets in

real-time. Three different streams with three different

connection types exist, however only Public stream

can be suitable for our task. It allows to get public

data from different users about different topics, while

the other two ones (User or Site) analyze only the data

from speciﬁc users.

From the point of view the connection type, we

can use only Filter connection, because Sample pro-

vides a sample from all the data and Firehose which

provides all possible data, is not free of charge. The

query can be, as in the case of the Search API, re-

stricted by several constraints (e.g. geolocation or tar-

get language). The maximum download speed of this

method is unfortunately not given. The second eval-

uated approach uses this API and is hereafter entitled

as Filtered Streaming API method.

4.1.3 UserList

Design of this method is motivated by the two follow-

ing facts:

• our preliminary studies have shown that the meth-

ods provided by Twitter API are not very suitable

for our task;

• about 20% of Twitter users are posting informa-

tive tweets, whereas the remaining 80% not (Naa-

man et al., 2010).

UserList is a Twitter possibility to allow each user

to create 20 lists with an option to store up to 5,000

users into one list. These lists can be used to show all

tweets that these users have posted and this procedure

can be used with Twitter API to get all published data

from 100,000 particular users.

The proposed method uses these list for acquisi-

tion of the signiﬁcant amount of tweets in a given lan-

guage (in Czech in our case, however the method is

general enough to handle the other ones). The down-

loaded messages should contain valuable information

for data-mining and further analysis as for instance

potentials events.

Our issue is now to select the representative users

in order to detect appropriate tweets. Our system is

designed for general event detection. Therefore it

must cover the all Twitter topics by active authors

from all ﬁelds. We use a small sample of interest-

ing people provided by Czech News Agency and this

sample is automatically extended by our algorithm.

Real-Time Data Harvesting Method for Czech Twitter

261

The algorithm to complete the UserList is based

on the assumption that:

• We have already a representative group of the

users (sample provided by

CTK);

• this set covers a representative part of our domain

of interest;

• their followers would be the users with similar in-

terests.

Therefore, we get by the Twitter API detailed in-

formation about all the followers of our initial group.

Then, we ﬁlter out all foreign (no Czech) users and we

continue with the ﬁrst step. Our algorithm is stopped

when a requested number of the users is explored.

For every user u, it is then computed a rank R

which is based on its number of followers Fn and the

number of submitted tweets T n as follows:

= w.Fn + (1 −w).T n (1)

where w is the importance of both criteria and was

set experimentally to 0.5.

Our list is sorted by this rank and the “best”

100,000 users are added to our twitter lists for a fur-

ther processing.

Twitter “ecosystem” is very dynamic and it

evolves very quickly. Therefore, this list must be pe-

riodically updated to keep actual information.

Our proposed method then harvests the data from

this representative list of the 100,000 users via Twit-

ter API. This method is hereafter referred as UserList

method. It is also worth of noting that this method is

language independent.

4.2 Pre-processing

4.2.1 Spam Filtering

As already stated, this task is realized in order to re-

move tweets with useless information. These tweets

are ﬁltered with a manually deﬁned set of rules (or

with a list of entire tweets). Table 1 shows some ex-

amples of whole tweets. The rules are based on the

predeﬁned patterns.

Of course, this simple method does not ﬁlter all

useless tweets. However, we assume that they will not

be detected as events by our detection algorithm due

to their not signiﬁcant amount. Therefore, it is not

necessary for the current system to implement more

sophisticated ﬁltering algorithm.

4.2.2 Lemmatization

Lemmatization consists in replacing a particular (in-

ﬂected) word form by its lemma (base form). It de-

creases the number of features of the system and is

Table 1: Examples of tweets to ﬁlter.

Tweet English translation

Automatically created messages

ridal jsem novou fotku

na Facebook.

I have added a new photo

on Facebook.

ıb

ı se mi video

@YouTube.

I like @YouTube movie.

Ozna

cil(-a) jsem video

@YouTube.

I have marked @YouTube

movie.

(Everyday) useless tweets created by the users

Dobr

e r

ano! Good morning!

Jdu ob

edvat, dobrou chu

t. I’m going to have lunch,

enjoy your meal.

successfully used in many natural language process-

ing tasks. We assume that lemmatization can improve

the detection performance of our method. It can be

useful particularly in clustering to group together ap-

propriate words.

Following the deﬁnition from the Prague De-

pendency Treebank (PDT) 2.0 (Zeman et al., 2014)

project, we use only the ﬁrst part of the lemma. This

is a unique identiﬁer of the lexical item (e.g. inﬁni-

tive for a verb), possibly followed by a digit to disam-

biguate different lemmas with the same base forms.

For instance, the Czech word “t

reba”, having the iden-

tical lemma, can signify necessary or for example de-

pending on the context. This is in the PDT notation

differentiated by two lemmas: “t

reba-1” and “treba-

2”. The second part containing additional information

about the lemma, such as semantic or derivational in-

formation, is not taken into account in this work.

4.2.3 Non-Signiﬁcant Word Filtering

Non-signiﬁcant words (also sometimes called stop

words) are considered words with high frequencies

which have in a sentence rather grammatical meaning

as for instance prepositions or conjunctions. In this

version, the ﬁltering is based on a manually deﬁned

list. We plan to implement more sophisticated method

based on Part-of-Speech (POS) tags in the further ver-

sion. However, we assume that this improved removal

will play marginal role for event detection.

4.3 Event Detection

4.3.1 Clustering

After getting the data we are facing the problem of

extracting events. We use a clustering technique for

this purpose. Consider that we get in real-time the ﬁl-

tered and lemmatized tweets which can represent (due

to the UserList method) very probably the events.

We transform every tweet into a binary representa-

tion using a bag of words method, which represents

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

262

its unique location in n-dimensional space. Then the

clustering algorithm is as follows:

1. take an (unprocessed) tweet;

2. calculate the cosine distance between a vector rep-

resenting this tweet and all the others;

3. choose a closest tweet (or cluster of tweets if any)

and group them together (the maximum allowed

distance is given by the threshold T h);

4. repeat the two previous operations (go to step 1)

till all tweets are processed.

The clusters created by this algorithm represent

the events. Of course, the clustering does not guaran-

tee that the created clusters represent only the events.

This should be done by the pre-processing:

• UseList data acquisition method harvests partic-

ularly informative tweets which contains mainly

the events;

• Spam ﬁltering step removes several useless tweets

(no events).

We also deﬁne a parameter T which indicates a time

period for the clustering. We assume that different

events will be produced at different “speed” (different

activities of Twitter users). For instance, information

about the winner of the football championship can be

quicker (more contributions in a short period) than in-

formation about a new director of some company.

It is worth of noting, that we have also consid-

ered a gradient of the frequencies in some event clus-

ters. Unfortunately, this improvement did not work

because of the small activity of the users on Czech

Twitter.

4.3.2 Results Representation

The results of the clustering are thus the groups of

tweets with some common words. This group is rep-

resented by the most signiﬁcant tweet. This tweet is

deﬁned as a message with the maximum of common

words and the minimum of the other words. This rep-

resentation is used due to the effort to use an answer

in natural language, instead of a list of key-words or

a phrase.

5 EXPERIMENTAL RESULTS

This section describes the experiments realized to

evaluate the proposed tweet harvesting method based

on user lists. The global functionality of the proposed

event detection system is also evaluated here. This

evaluation was done off-line.

5.1 Evaluation of the Data Acquisition

5.1.1 Comparison of the Czech and French

Twitter Activity

In the ﬁrst experiment, we would like conﬁrm our

claim that the activity of the Czech Twitter is sig-

niﬁcantly lower than in the case of the other lan-

guages. We have chosen French Twitter and Search

API method (see Section 4.1.1) for such comparison.

First, we have discovered that, it is not possible

to use language constraints to obtain only the Czech

tweets. Unfortunately, the Czech constraint is missing

and there is available only “sk” ﬁeld which contains

Czech and Slovak tweets together.

Therefore, we have decided to ﬁlter tweets accord-

ing to geolocation. We have chosen a square region,

covering most of the territories of the Czech Republic

and France, as our area of interest. We have analyzed

the download rate in interval from 22 to 29 August

2015. Figure 2 shows the results of this analysis.

2000

4000

6000

8000

10000

12000

14000

16000

08-23 08-24 08-25 08-26 08-27 08-28 08-29

Download speed [tweet/hour]

Time

Czech Republic

France

Figure 2: Comparison of the Czech and French Twitter ac-

tivity.

This ﬁgure shows that the activity of French Twit-

ter is more than 10 × higher than the Czech Twitter.

The average of the Czech download rate is about 495

tweets/hour. However, after a detailed examination,

we have identiﬁed that only less than 20% of tweets

are written in Czech languages.

Unfortunately, this number is insufﬁcient for

a successful further analysis as for instance for event

detection in real-time. Therefore, we must analyze

the other approaches for data acquisition.

5.1.2 Comparison of the Different Data

Acquisition Methods

In this experiment, we compare the download speed

of two standard methods provided by the Twitter

API (namely Search API and Filtered Streaming API

Real-Time Data Harvesting Method for Czech Twitter

263

Figure 3: Event detection example (time period T = 2h and acceptance threshold T h = 0.5). The rectangle on the right

contains six tweets that were saved by our acquisition method. The left “bubbles” show the results of our clustering (two

groups containing three and two tweets). The representative tweets are chosen (marked by the bold text on the left side) to be

presented to the user.

methods - see Sections 4.1.1 and 4.1.2, respectively)

with the proposed UserList approach (see Sec. 4.1.3).

We have thus executed all these methods in the same

two day period and then we have calculated the aver-

age value for one hour.

Table 2: Comparison of the download speed of the different

methods on the Czech Twitter.

Method Tweets no. / hour

Search API 43.5

Filtered Streaming API 56.6

UserList (proposed) 330.3

The results of this experiment are shown in Ta-

ble 2. This table shows that the proposed method

provides about 6 times more data than the standard

methods provided by Twitter API. Based on these re-

sults we have chosen the UserList approach to inte-

grate into our event detection system.

5.2 Event Detection

We have used 15,856 tweets downloaded by UserList

approach to evaluate the detection performance of our

system. We have executed the event detection algo-

rithm with different values of the acceptance thresh-

old (T h ∈ [0; 1]) and analyzed the results. The analy-

sis of the resulting clusters has shown that for results

with T h > 0.5 the algorithm still detects the major-

ity of events correctly (high precision). However, the

main interest is to have the recall as high as possi-

ble. The precision is not so important, because of the

possibility of manual ﬁltering of incorrectly detected

events. Therefore, we set in our system a slightly

lower acceptance threshold which causes to detect

more events with some false positives.

These preliminary results were shown and dis-

cussed with our client who is ready to test this ex-

perimental version of the system. It is clear that the

current version will already help to the reporters to re-

duce their work with manual checking of the available

data sources.

One sample of the results is depicted in Figure 3.

This ﬁgure shows that six tweets are saved by our ac-

quisition method (right rectangle). They are then clus-

tered into two groups containing three and two tweets

(left “bubbles”). Finally, one representative tweet is

chosen from both clusters to be presented to the user

(bold text left).

6 CONCLUSIONS AND

PERSPECTIVES

The main goal of this paper was to propose an ap-

ICAART 2017 - 9th International Conference on Agents and Artiﬁcial Intelligence

264

proach to harvest messages from Twitter in Czech

language with high download speed. The proposed

method uses user lists to discover potentially interest-

ing tweets to harvest. We have experimentally shown

that the proposed method is very efﬁcient because it

harvests about 6 times more data than the two other

approaches provided by the Twitter API. This method

will be integrated into our event detection system. We

have also experimentally shown that the results of the

event detection are promising because the algorithm

detects a signiﬁcant amount of potential events.

The proposed harvesting method is language in-

dependent. Therefore, the ﬁrst perspective consists in

evaluation of this method on other (particularly Euro-

pean) languages. Another perspective is a thorough

evaluation of the event detection method. We would

like also improve this method using more sophisti-

cated semantic similarity functions. Another perspec-

tive is an adaptation and evaluation of the whole de-

tection system to the other languages.

ACKNOWLEDGEMENTS

This work has been partly supported by the project

LO1506 of the Czech Ministry of Education, Youth

and Sports and by Grant No. SGS-2016-018 Data and

Software Engineering for Advanced Applications.

REFERENCES

Atefeh, F. and Khreich, W. (2015). A survey of techniques

for event detection in Twitter. Computational Intelli-

gence, 31(1):132–164.

Earle, P. S., Bowden, D. C., and Guy, M. (2012). Twitter

earthquake detection: earthquake monitoring in a so-

cial world. Annals of Geophysics, 54(6).

Java, A., Song, X., Finin, T., and Tseng, B. (2009). Why we

Twitter: An analysis of a microblogging community.

In Advances in Web Mining and Web Usage Analysis,

pages 118–138. Springer.

Kouloumpis, E., Wilson, T., and Moore, J. D. (2011). Twit-

ter sentiment analysis: The good the bad and the omg!

Icwsm, 11:538–541.

Li, C., Sun, A., and Datta, A. (2012). Twevent: segment-

based event detection from tweets. In Proceedings of

the 21st ACM international conference on Information

and knowledge management, pages 155–164. ACM.

Naaman, M., Boase, J., and Lai, C.-H. (2010). Is it re-

ally about me?: message content in social awareness

streams. In Proceedings of the 2010 ACM conference

on Computer supported cooperative work, pages 189–

192. ACM.

Pak, A. and Paroubek, P. (2010). Twitter as a corpus for

sentiment analysis and opinion mining. In LREc, vol-

ume 10, pages 1320–1326.

Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earthquake

shakes Twitter users: real-time event detection by so-

cial sensors. In Proceedings of the 19th international

conference on World wide web, pages 851–860. ACM.

Yardi, S. and Boyd, D. (2010). Dynamic debates: An analy-

sis of group polarization over time on Twitter. Bulletin

of Science, Technology & Society, 30(5):316–327.

Zeman, D., Du

sek, O., Mare

cek, D., Popel, M., Ra-

masamy, L.,

anek, J.,

Zabokrtsk

y, Z., and Haji

c, J.

(2014). Hamledt: Harmonized multi-language depen-

dency treebank. Language Resources and Evaluation,

48(4):601–637.

Real-Time Data Harvesting Method for Czech Twitter

265