which is used for word normalization. The next step is
Non-significant word filtering. While the previous fil-
tering was at the tweet level, this one is at word level
and is used to remove non-significant words which
could decrease the detection performance. The next
step to discover events is Clustering. We group to-
gether the tweets with similar content using a cluster-
ing method. The final decision about an event is based
on the thresholding. The last step, Results represen-
tation, is used to show the detected event to the users
in an acceptable form.
All these steps are described below in details with
the particular focus on the data acquisition, which is
the main contribution of this paper.
4 METHOD DESCRIPTION
4.1 Data Acquisition
We summarize first our requirements to choose an op-
timal data acquisition method:
• working in real-time;
• downloading of the messages in Czech language;
• harvesting of a “sufficient” number of tweets for
a further processing (it means, from our point of
view, as much as possible);
• usage for free;
• downloading only informative messages (op-
tional).
We analyze in the following text the different pos-
sibilities of Twitter for data harvesting. We show for
all the methods the maximum download speed de-
fined by the Twitter constraints. Unfortunately, this
speed usually does not correspond to the real one, be-
cause the activity of the Twitter users is not sufficient
to fill these limits.
4.1.1 Search API
This API is a part of Twitter REST API. It allows
queries against the indices of recent or popular tweets
and behaves similarly to, but not exactly like the
search feature available in web clients. This API
searches against a sampling of “recent” tweets pub-
lished in the past 7 days and its maximum download
speed is 72,000 tweets/hour. The query can be re-
stricted by several constraints as for instance by a ge-
olocation or by a target language.
Another important property is that this API is fo-
cused on relevance and not on completeness. This
means that some tweets and users may be missing
from the search results. The first approach, which
is further evaluate and compare, uses this API and is
hereafter referred as Search API method.
4.1.2 Streaming API
This API is intended to monitor (or process) tweets in
real-time. Three different streams with three different
connection types exist, however only Public stream
can be suitable for our task. It allows to get public
data from different users about different topics, while
the other two ones (User or Site) analyze only the data
from specific users.
From the point of view the connection type, we
can use only Filter connection, because Sample pro-
vides a sample from all the data and Firehose which
provides all possible data, is not free of charge. The
query can be, as in the case of the Search API, re-
stricted by several constraints (e.g. geolocation or tar-
get language). The maximum download speed of this
method is unfortunately not given. The second eval-
uated approach uses this API and is hereafter entitled
as Filtered Streaming API method.
4.1.3 UserList
Design of this method is motivated by the two follow-
ing facts:
• our preliminary studies have shown that the meth-
ods provided by Twitter API are not very suitable
for our task;
• about 20% of Twitter users are posting informa-
tive tweets, whereas the remaining 80% not (Naa-
man et al., 2010).
UserList is a Twitter possibility to allow each user
to create 20 lists with an option to store up to 5,000
users into one list. These lists can be used to show all
tweets that these users have posted and this procedure
can be used with Twitter API to get all published data
from 100,000 particular users.
The proposed method uses these list for acquisi-
tion of the significant amount of tweets in a given lan-
guage (in Czech in our case, however the method is
general enough to handle the other ones). The down-
loaded messages should contain valuable information
for data-mining and further analysis as for instance
potentials events.
Our issue is now to select the representative users
in order to detect appropriate tweets. Our system is
designed for general event detection. Therefore it
must cover the all Twitter topics by active authors
from all fields. We use a small sample of interest-
ing people provided by Czech News Agency and this
sample is automatically extended by our algorithm.
Real-Time Data Harvesting Method for Czech Twitter
261