Examining the Intra-Location Differences Among Twitter Samples

Rositsa V. Ivanova

1 a

, Ema Ku

sen

2 b

and Stefan Sobernig

1 c

Institute of Information Systems and New Media, Vienna University of Economics and Business, Vienna, Austria

Faculty of Informatics, University of Vienna, Vienna, Austria

Keywords:

Data Analysis, Data Collection, Data Quality, Network Science, Social Networks, Twitter.

Abstract:

In this paper, we explore Twitter data samples collected from ﬁve different geographical locations. For each

of these geographical locations, we compare variations occurring within samples collected simultaneously

from two different machines running Twitter API clients. In addition, we split the collected data samples into

“complete” and “incomplete” datasets. An incomplete dataset is a collection of Twitter messages where at

least one machine received a smaller data sample due to some interruption. A complete dataset is one that

includes all tweets that Twitter’s API delivers for a particular set of search parameters. Our ﬁndings indicate

that 86% of the complete samples show some variations in the attribute values attached to extracted tweets.

While the complete datasets show comparable attribute values and network characteristics, the incomplete

data samples exhibit substantial differences. We arrive at recommendations for researchers on Online Social

Networks on how to mine Twitter data while mitigating these risks.

1 INTRODUCTION

In the search for representative and freely acces-

sible data on Online Social Networks (OSN), re-

searchers frequently rely on datasets extracted from

Twitter. Tweets (Twitter messages) have been uti-

lized in various research ﬁelds from applied network

science to medicine (Grinberg et al., 2019; Morone

and Makse, 2015; Broniatowski et al., 2018; Bouty-

line and Willer, 2017; Ku

sen and Strembeck, 2021).

Twitter’s well-documented and publicly available

application programming interface (API) grants re-

searchers automated access to large datasets, only

requiring an existing Twitter account. As a major

downside, Twitter data is only made available to re-

searchers free of charge as a blackbox sample. Access

types with fewer limitations (e.g., higher monthly rate

limits, access to a full archive) are offered by Twit-

ter either via commercial or special purpose accounts

(e.g., for academic research). While these types of

accounts are used in business and academic settings,

they are either paid for or granted upon request after

fulﬁlling speciﬁc criteria. In many cases, researchers

may not be able to obtain an academic account or can-

not afford paying to lift the paywall. This leaves re-

https://orcid.org/0000-0002-4149-5017

https://orcid.org/0000-0003-1145-6778

https://orcid.org/0009-0002-5018-7961

searchers with free-of-charge API access to samples

of Twitter data.

Despite Twitter’s popularity as a data source, there

is still little awareness among OSN researchers of

how this sampled data from Twitter may affect their

research and how sampling has to be accounted for

in their research designs. Regarding representative-

ness, Twitter data samples were found to potentially

under- or over-represent certain user accounts (Pfef-

fer et al., 2018). Dataset sizes and the user accounts

contained therein vary substantially based on the ac-

cess types (Kim et al., 2020). Furthermore, Wang

et al. (2015) found that the various access types re-

sult in different user-activity patterns and sentiments.

Similarly, Morstatter et al. (2013) compared differ-

ent Twitter API endpoints and discovered that Twit-

ter data obtained via the free-of-charge API performs

worse than other access types in terms of reﬂecting

the statistical properties of Twitter activity. When

comparing data samples collected using popular and

non-popular search terms, Campan et al. (2018) re-

vealed that only unpopular search terms lead to un-

biased samples and that, otherwise, samples cannot

be considered random. Recently, Pfeffer et al. (2022)

found that approximately 10% of tweets are deleted in

the short term and up to 30% over the period of four

years. Moreover, Timoneda (2018) found that even

in the short term, 20-30% tweets of strong political

Ivanova, R., Kušen, E. and Sobernig, S.

Examining the Intra-Location Differences Among Twitter Samples.

DOI: 10.5220/0011990600003485

In Proceedings of the 8th International Conference on Complexity, Future Information Systems and Risk (COMPLEXIS 2023), pages 94-101

ISBN: 978-989-758-644-6; ISSN: 2184-5034

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

content could not even be recovered via the free-of-

charge API.

In Ivanova et al. (2022), we investigated the dif-

ferences in data samples that were collected from var-

ious geolocations. In this context, geolocations re-

fer to network-topological locations (Internet nodes)

that send requests to the free-of-charge Twitter API.

In particular, we showed that data collections ex-

tracted from the same geolocation as well as the same

network-topological zone may vary in terms of users

and tweets collected, as well in terms of the associ-

ated metadata. As a consequence, we showed that the

derived networks, such as retweet and mention net-

works, also exhibit substantial differences relevant to

a research design.

This paper extends our previous study by exam-

ining the differences between data samples extracted

from two different machines running at the same ge-

ographical location (i.e., intra-location differences).

For this purpose, we distinguish between two types

of data collections: 1) an incomplete collection re-

sults from at least one dataset not being fully retrieved

due to various server errors; 2) a complete collection

results from all available data having been received

from the Twitter API endpoint in a proven manner.

The remainder of this paper is structured as fol-

lows. Section 2 gives an overview of the background

information fundamental to this paper. Section 3 then

describes our technical approach for mining Twit-

ter data in an orchestrated manner. Next, Section 4

presents key ﬁndings and limitations. Section 5 con-

cludes the paper.

2 BACKGROUND

2.1 Twitter Data Model

For each request, the Twitter Search API v1.1 returns

tweets containing a wide variety of attributes. These

attributes include data related to the tweet itself (e.g.

creation time), geographical information (e.g. loca-

tion), a collection of the entities within the tweet (e.g.

urls), and data related to the user who published it

(e.g. screen name). According to Twitter API’s docu-

mentation

certain attributes are expected to change

over time (e.g. retweet count) while others remain

static (e.g. status id).

https://developer.twitter.com/en/docs/twitter-api/v1/

data-dictionary/object-model/tweet

All links last accessed on March 7th, 2023

2.2 Network Derivation

Combinations of the received attributes can be used

to derive various types of networks (e.g., mention

and retweet networks). Retweet networks represent

retweet activity and retweet frequency. In a retweet

network, each message (i.e., each retweet) is rep-

resented via a separate vertex and each retweet is

connected to the original message being retweeted.

Mention networks depict interactions between users

based on the @-mentions included in their respective

tweets. The derived networks can be used for per-

forming different types of network analysis tasks (see,

e.g., Kwak et al. 2010; Xiong et al. 2019; Gruzd and

Roy 2014; Ku

sen and Strembeck 2021).

3 APPROACH

3.1 Infrastructure

For the simultaneous collection of multiple datasets,

we created a distributed infrastructure of virtual ma-

chines (VMs) deployed in various geographical lo-

cations (see Figure 1). We created ten VMs in ﬁve

availability zones offered by Amazon Web Services

(AWS) at the time, each hosting two VMs (we will

use the term geolocation pair for referring to the vir-

tual machines running at the same geographical loca-

tion): Frankfurt (Germany), Mumbai (India), Sydney

(Australia), Seoul (South Korea), and Virginia (USA).

The selection of these ﬁve availability zones followed

the rationale of covering different geographical loca-

tions worldwide.

The collection process was coordinated by an ad-

ditional host located in Vienna (Austria) acting as the

orchestrator. Furthermore, a second host in Vienna

served as a central and permanent storage device for

the datasets collected from the ﬁve different geoloca-

tions (see Figure 1). Each collection was initiated via

the orchestrator by sending collection scripts to each

VM. The only variation in these scripts resulted from

distinct Twitter developer credentials used for the in-

dividual API calls. We set all developer accounts’ lo-

cations to correspond to the respective geolocations

of the AWS VMs to reduce potential bias. The exe-

cution time for the collection procedure was synchro-

nized within the infrastructure, meaning that all 10

VMs would begin requesting data from the Twitter

API at the same time. The individual requests were

done using the RTweet package

, which by default

collects a mixture of popular and most recent tweets.

https://cran.r-project.org/web/packages/rtweet/rtweet.

pdf

Examining the Intra-Location Differences Among Twitter Samples

Figure 1: Infrastructure for multi-site Twitter mining;

Adapted from Ivanova et al. (2022, Figure 1).

For the purposes of this paper, we assumed two

general collection types: 1) “complete” collections

and 2) “incomplete” collections. An incomplete

dataset is a collection of Twitter messages where at

least one machine received a smaller data sample due

to some interruption (e.g. timeout, over capacity). A

complete dataset is one that includes all messages that

Twitter’s API delivers for a particular set of search pa-

rameters.

3.2 Datasets

Table 1 depicts the 14 collections that have been ex-

tracted for our study, the requested hashtags, the max-

imum number of tweets received, the tweets’ publi-

cation period, and the respective collection date. The

hashtag selection aims to cover a wide variety of glob-

ally discussed topics (e.g. pop culture, conﬂicts) and

dataset sizes. A detailed overview of the number of

tweets collected per VMs for each topic is presented

in Table 2. The asterisk superscript of the collection

number indicates that at least one of the data sam-

ples was interrupted and the corresponding collection

is incomplete.

4 FINDINGS

For the purposes of this paper, we analyze the data

alongside the following aspects. First, we focus on

the tweet IDs (i.e., unique tweet identiﬁers) in each

collection. We then examine whether the tweet at-

tributes are consistent. Finally, we explore the char-

acteristics of retweet and mention networks derived

from our collections.

4.1 Node Level

Figures 2 and 3 depict the (dis-)similarities for the 14

data collections between the pairs of Internet nodes

at each of the ﬁve geolocations in our study (Frank-

furt, Mumbai, Sydney, Seoul and Virginia). The exact

overlap (see Figure 2) refers to tweets that include the

same attributes for the same tweet ID. In contrast, the

partial overlap (see Figure 3) refers to tweets that have

the same ID but exhibit partial differences in attribute

values, such as unequal retweet counts.

The overlap values offer a clear glance into the

similarities between intra-location datasets. Yet, it

is essential to consider the effects of an incomplete

dataset. For the interrupted collection C05*, we ob-

serve a considerable drop in the exact overlap value

for all ﬁve locations. Furthermore, for the geolocation

Seoul we ﬁnd an even lower value of the exact overlap

(i.e., < 1%), which occurs because Seoul 1 received

a considerably smaller dataset than Seoul 2. A rela-

tively low exact overlap can also be found across all

geolocations for C02, yet in this case all data sam-

ples are complete. However, the reason for the exact

overlap varying from 26.29% for Seoul to 93.47% for

Frankfurt remains unclear.

For the two locations without interruptions (i.e.,

Sydney and Virginia), the partial overlap is close to

100% for all collections. In some cases, such as the

interrupted collection C09* in Frankfurt, we see a

similar drop in the partial and the exact overlap per-

centages, affected by the incomplete collection for

Frankfurt 1. However, in other cases, such as C05*,

a similar correlation is noticeable despite the datasets

within the geolocation pairs being complete. The lo-

cation Seoul stands out in this regard, as we ﬁnd the

most differences in terms of varying overlap values

for the individual collections, yet merely three collec-

tions have been categorized as being incomplete (i.e.,

C05, C10, and C13).

There is a noteworthy indication of a potential pat-

tern in the overlap values, which can be found across

all locations. Overall, we observe low exact-overlap

values for the eight out of 14 collections: C02, C05,

C06, C09, C11, C12, C13, C14. However, with the

exception of C02 and C06, the rest of the collections

appear to be prone to API errors.

4.2 Attribute Level

We take a closer look at the partial overlaps by ex-

amining the individual attributes of the overlapping

tweets. We match the tweets across databases based

on their tweet IDs, thus assuming it to be a constant

attribute. The frequent variation of attribute values in

COMPLEXIS 2023 - 8th International Conference on Complexity, Future Information Systems and Risk

Table 1: Summary of tweet collections including the used hashtag, maximum number of tweets collected (i.e., #T), time

window for data collection, and date of collection. An asterisk denotes that the collection is incomplete. Adapted from

Ivanova et al. (2022, Table 1).

Collection Hashtag #T (max) From Until Collected on

C01 covid19 310.879 18.11.21 21.11.21 25.11.21

C02 BlackFriday 550.361 26.11.21 28.11.21 01.12.21

C03 Omicron 526.927 29.11.21 03.12.21 06.12.21

C04 HongKong 18.957 19.12.21 24.12.21 29.12.21

C05* HappyBirthdayTaehyung 2.577.930 29.12.21 01.01.22 04.01.22

C06 Djokovic 218.966 04.01.22 08.01.22 11.01.22

C07 tsunami 125.793 14.01.22 20.01.22 24.01.22

C08 Ukraine 85.641 18.01.22 22.01.22 25.01.22

C09* SuperBowl 1.826.490 13.02.22 15.02.22 20.02.22

C10* Putin 197.941 21.02.22 23.02.22 27.02.22

C11* Ukraine 436.420 21.02.22 23.02.22 25.02.22

C12* Putin 590.680 23.02.22 25.02.22 01.03.22

C13* Ukraine 1.144.923 10.03.22 13.03.22 17.03.22

C14* Ukraine 1.474.915 15.03.22 20.03.22 23.03.22

0.00 0.25 0.50 0.75 1.00

C01

C02

C03

C04

C05*

C06

C07

C08

C09*

C10*

C11*

C12*

C13*

C14*

Collections

Overlap in %

Legend

Frankfurt

Mumbai

Sydney

Seoul

Virginia

Figure 2: Exact intra-location overlap of the tweet popula-

tions per collection (i.e., identical attribute values).

tweets extracted at Seoul (as depicted by the orange

line in Figure 3) is also visible within the attribute-

level analysis. Table 3 depicts one such example of

the count-related attributes (e.g. retweet count) in col-

lection C10*. In this case, the number of differences

found between the two VMs in Seoul is considerably

higher than within the rest of the geolocations. A sim-

ilar pattern can also be found in other collections such

as collection C02 (see Table 4), which is categorized

as a complete collection.

User Object. In addition to count variables (e.g.

retweet count or like count) which are expected to

have variations over time, we also examined the con-

sistency of user-object attributes. Table 5 depicts the

difference within these attributes based on collection

0.00 0.25 0.50 0.75 1.00

C01

C02

C03

C04

C05*

C06

C07

C08

C09*

C10*

C11*

C12*

C13*

C14*

Collections

Overlap in %

Legend

Frankfurt

Mumbai

Sydney

Seoul

Virginia

Figure 3: Partial intra-location overlap of the tweet popula-

tions per collection (i.e., same tweet id, variations of other

attribute values).

C02. The column “Expected change” is based on the

ofﬁcial Twitter’s API documentation for the respec-

tive user-object attributes. This documentation char-

acterizes attributes as either “relatively constant” or

as expected to change frequently (over time), such

as the number of tweets the account has posted “sta-

tuses count” and its number of followers “follow-

ers count”

. Based on the fact that some of the at-

tributes are described as non-constant, we can safely

assume that other count values are assumed subjected

to changes over time (referred to via ”l.yes” in the re-

spective tables). The remaining attributes are marked

as “unknown”. The number of total changes for col-

https://developer.twitter.com/en/docs/twitter-api/v1/

data-dictionary/object-model/user

Examining the Intra-Location Differences Among Twitter Samples

Table 2: Number of tweets collected per location per hashtag and their respective logged messages. An asterisk denotes that

the collection is incomplete. Adapted from Ivanova et al. (2022, Table 2).

Coll. Frankfurt 1 Frankfurt 2 Mumbai 1 Mumbai 2 Sydney 1 Sydney 2 Seoul 1 Seoul 2 Virginia 1 Virginia 2

C01 310.802 310.803 310.811 310.810 310.808 310.806 310.805 310.879 310.814 310.808

C02 550.321 550.312 550.361 550.361 550.332 550.339 550.330 550.313 550.310 550.304

C03 526.804 526.811 526.803 526.804 526.813 526.815 526.927 526.830 526.809 526.812

C04 18.886 18.868 18.366 18.366 18.371 18.371 17.883

18.532 18.935 18.957

C05* 2.351.611 2.577.831 2.574.730 2.577.930 2.575.355 2.574.470 38.985

2.577.219 2.575.290 2.576.477

C06 218.831 218.827 218.841 218.842 218.841 218.842 218.965 218.966 218.858 218.836

C07 125.789 125.793 78.711

78.711

78.710

78.712

124.518 78.714

125.779 125.788

C08 85.640 85.637 84.406 84.406 84.409 84.412 85.622 85.622 85.640 85.641

C09* 554.498

1.825.922 247.731

1.825.615 1.825.616 1.825.577 1.826.056 1.826.146 1.826.476 1.826.490

C10* 197.909 197.910 197.937 197.941 197.939 197.932 197.939 63.356

197.907 197.909

C11* 436.391 436.392 436.379 432.209

436.401 436.409 436.420 436.399 436.409 436.409

C12* 590.543 590.548 34.577

590.600 590.646 590.674 590.653 590.680 590.607 590.601

C13* 1.068.768 1.100.893 1.100.851 1.100.832 1.144.885 1.144.892 1.144.923 150.033

1.100.894 1.100.907

C14* 1.309.517

1.408.257 590.046

445.348

1.381.884 1.381.923 1.382.012 1.382.025 1.467.636 1.474.915

Logged messages:

Two conﬁrmations that the script exited correctly

Error in curl::curl fetch memory(url, handle = handle) : OpenSSL SSL read: SSL ERROR SYSCALL, errno 104

Error in curl::curl fetch memory(url, handle = handle) : transfer closed with outstanding read data remaining

Killed

Over capacity - 130

Table 3: Absolute number of differences in count attributes

of overlapping tweets among geolocation pairs for C10*.

Note regarding abbreviations: FRA - Frankfurt, MUM -

Mumbai, SYD - Sydney, SEL - Seoul, VA - Virginia, rt -

retweet, qt - quoted.

Attribute FRA MUM SYD SEL VI

rt followers 520 1467 4288 18944 508

qt followers 216 291 756 1893 217

favourites 69 358 1272 13659 102

statuses 62 282 1193 3104 78

rt statuses 27 96 453 1775 40

followers 25 101 350 9144 38

rt friends 22 45 116 5579 11

friends 7 34 119 6637 29

qt statuses 5 9 37 171 8

retweet 3 1 14 73 1

rt favorite 3 10 29 2158 7

rt retweet 3 1 14 71 1

listed 1 1 9 1130 14

qt favorite 1 6 42 391 1

qt retweet 2 11 65

favorite 1 39 3

qt friends 1 4 340 1

lection C02 per location and per user-object attribute

is depicted in the respective columns.

4.3 Network Level

To better understand how intra-location differences

in the collected datasets may affect network analy-

ses, we take a closer look at two selected network

types that can be derived from the collected Twit-

ter datasets. In particular, we examine retweet and

Table 4: Absolute number of differences in count attributes

of overlapping tweets among geolocation pairs for C02.

Note regarding abbreviations: FRA - Frankfurt, MUM -

Mumbai, SYD - Sydney, SEL - Seoul, VA - Virginia, rt -

retweet, qt - quoted.

Attribute FR MU SY SE VI

rt followers 1826 9002 7926 67319 29239

favourites 1200 5081 4774 39875 14940

statuses 933 2740 2709 9855 14527

rt statuses 480 1926 2125 8906 12494

qt followers 399 900 899 3506 2297

followers 143 2508 2175 27482 3373

rt favorite 89 1316 1020 13825 3344

friends 60 2061 1744 21030 1667

retweet 53 76 28 524 1006

rt retweet 53 76 28 520 1003

listed 33 279 285 2761 401

qt statuses 33 83 100 444 855

rt friends 31 2471 2025 21622 1212

qt favorite 19 84 106 713 440

favorite 1 7 8 76 17

qt retweet 1 2 6 43 41

qt friends 61 53 706 35

mention networks. For each retweet or mention net-

work, we counted the number of vertices and edges,

calculated the number of connected components, and

compared degree distributions. Degree distributions

are contrasted using empirical quantile-quantile (QQ)

plots to highlight the (dis-)similarity of the network

structures.

Retweet Networks. We constructed retweet net-

works from all datasets and compared them per lo-

COMPLEXIS 2023 - 8th International Conference on Complexity, Future Information Systems and Risk

Table 5: User-object attributes as per Twitter’s API documentation and their actual variation per geolocation pairs for CO2.

Note regarding abbreviations: “l.yes” refers to a logically derived yes based on the Twitter API documentation.

Attribute Expected Frankfurt Mumbai Sydney Seoul Virginia

change

id no assumed to be constant (see 4.2)

screen name yes - - - - -

location yes 1 - - - -

url unknown - - 1 7 -

description yes 1 - - - 5

veriﬁed unknown - - - - -

followers count yes 143 2508 2175 27482 3373

friends count l.yes 60 2061 1744 21030 1667

listed count l.yes 33 279 285 2761 401

favourites count l.yes 1200 5081 4774 39875 14940

statuses count yes 933 2740 2709 9855 14527

created at no - - - 200 -

proﬁle banner url unknown - - - 1 3

proﬁle image url https unknown - 1 - - 9

default proﬁle unknown - - - - -

default proﬁle image unknown - - - - -

cation in terms of their network characteristics. It is

worth noting that we observe only one perfect match

between the network characteristics across all col-

lected complete collections (see C08)

The remainder of the complete collections contain

differences in at least one of the dataset pairs. Table 6

presents one such example based on collection C07.

Here, ﬁve of the VMs (i.e., Mumbai 1, Mumbai 2,

Sydney 1, Sydney 2, and Seoul 2) have retrieved less

tweets even though Twitter’s API did not produce any

error messages. An analysis of the extracted retweet

networks shows no differences between the datasets

collected at Frankfurt and at Mumbai. In Seoul, how-

ever, we notice mismatches regarding all network de-

scriptors. For the remaining two geolocations (Syd-

ney, Virginia) the differences in the number of ver-

tices and edges are comparatively minor (i.e., < 1%).

Lastly, we found that all incomplete collections

show differences in terms of all network character-

istics for the respective geolocations. As expected,

when a collection has been interrupted we found vari-

ations in all values and a mismatch in the graphical vi-

sualisation of the degree distributions. However, fur-

ther network differences can also be observed for the

geolocations with complete datasets.

Mention Networks. In addition to the retweet net-

works, we also derived mention networks and explore

the same network characteristics (e.g., number of ver-

tices). We established the existence of the same three

groups of collections as for retweet networks: perfect

Details on the exact network characteristics are

available as supplemental material at https://rivanova.org/

complexis2023.

match, complete collections with variations, incom-

plete collections with variations.

4.4 Limitations

This paper builds on our previous study (Ivanova

et al., 2022) by reusing some of the datasets. When

designing our infrastructure for tweet extraction (see

Section 3), our main goal was to implement the ex-

periment as reproducible as possible, while reducing

collection bias. The variety in the geographical loca-

tions was achieved via the use of a commercial cloud

platform (i.e., Amazon Web Services). We selected

ﬁve of AWS availability zones to maximize diversity

and global distribution (i.e., as many continents as

possible). We assume that the setup behind the in-

dividual VMs is equivalent (i.e., we assume AWS is

running comparable hardware at different locations),

yet further comparative research is needed in order to

conclude whether the use of another hosting platform

may yield different results. Furthermore, we used an

iterative collection approach. In particular, we col-

lected 200 messages every 10 seconds. Another op-

tion would be to request larger batches of tweets us-

ing longer time intervals. Investigating these sources

of possible variation is future work.

5 CONCLUSIONS

In this paper, we analyzed 14 Twitter data samples

collected simultaneously from ﬁve pairs of virtual

machines (VMs) running at ﬁve different geographi-

cal and network-topological locations (i.e., Frankfurt,

Mumbai, Sydney, Seoul and Virginia). We found vari-

Examining the Intra-Location Differences Among Twitter Samples

Table 6: Comparison of number of vertices, number of edges, number of connected components and degree distributions of

retweet networks per geolocation pair (two virtual machines) for C07 (i.e., a complete collection with differences).

Virtual |V| |E| Connected Degree distributions

machine Components

Frankfurt 1 124.704 118.296 6.408

Frankfurt 2 124.704 118.296 6.408

Mumbai 1

78.221 72.995 5.226

Mumbai 2

78.221 72.995 5.226

Sydney 1

78.220 72.994 5.226

Sydney 2

78.221 72.995 5.226

Seoul 1 123.464 117.067 6.397

Seoul 2

78.220 72.994 5.226

Virginia 1 124.696 118.288 6.408

Virginia 2 124.706 118.298 6.408

Two conﬁrmations that the collection script exited correctly, yet less tweets

ations among the data samples from the pairs running

at the same geolocation. These differences manifest

in terms of the collected tweet IDs, tweet attribute val-

ues, and the characteristics of the derived networks.

In addition, we split the collections into two

groups – complete collections that contain all tweets

provided via Twitter’s API, and incomplete col-

lections that stopped prematurely by Twitter’s API

throwing an error message.

Our ﬁndings show that complete collections tend

to have a similar number of received tweets varia-

tions regarding the exact tweets (matched based on

their tweet IDs) that have been collected. The over-

lap between the tweets has an observed range from

63.21% to 100% with a median of 99.97%. For in-

complete collections, we conﬁrmed that the result-

ing collections exhibited considerable differences in

terms of number of collected tweets. In this case, the

fractions of overlapping tweet IDs ranged from 1.51%

to 99.98% with a median of 13.56%.

When looking at the attributes of the collected

tweets, we found that count attributes, such as retweet

count, may be different regardless of whether the col-

lection was complete or incomplete. While Twitter’s

API documentation states that count attributes are ex-

pected to change over time, we found that there are no

consistency guarantees when retrieving datasets even

in a synchronized manner.

For our analysis, we also created retweet and men-

tion networks from the collected datasets. Upon ex-

amining the networks’ characteristics, we observed

variations for complete and incomplete collection.

Within the complete collections, we discovered exact

intra-location matches for only one collection. For

COMPLEXIS 2023 - 8th International Conference on Complexity, Future Information Systems and Risk

100

the remaining six collections, we found rather small,

yet noticeable variations in the number of vertices,

number of edges, number of connected components,

and the degree distribution. As expected, all incom-

plete collections showed differences in terms of net-

work characteristics. However, we also found smaller

network-level differences regarding certain network

characteristics for some complete collections.

Based on our study, we derive the following rec-

ommendations for researchers using Twitter’s free-

of-charge API. First, the status codes and error mes-

sages issued by Twitter’s API must be handled and

documented properly in order to avoid incomplete

data samples. Second, we suggest researchers to be

cautious when relying on attribute values which are

expected to change in time and in space, such as

count attributes (e.g. retweet count or like count). In

addition, our previous research on (dis-)similarities

between individual geolocations (see Ivanova et al.

2022) recommended the use of three or more geolo-

cations for accessing the Twitter API in parallel, and

the use of a three-day delay.

REFERENCES

Boutyline, A. and Willer, R. (2017). The social struc-

ture of political echo chambers: Variation in ideologi-

cal homophily in online networks. Political Psychology,

38(3):551–569.

Broniatowski, D. A., Jamison, A. M., Qi, S., AlKulaib,

L., Chen, T., Benton, A., Quinn, S. C., and Dredze, M.

(2018). Weaponized health communication: Twitter bots

and Russian trolls amplify the vaccine debate. American

Journal of Public Health, 108(10):1378–1384.

Campan, A., Atnafu, T., Truta, T. M., and Nolan, J. (2018).

Is data collection through Twitter streaming API useful

for academic research? In Proc. IEEE International

Conference on Big Data, pages 3638–3643. IEEE.

Grinberg, N., Joseph, K., Friedland, L., Swire-Thompson,

B., and Lazer, D. (2019). Fake news on Twitter during the

2016 US presidential election. Science, 363(6425):374–

378.

Gruzd, A. and Roy, J. (2014). Investigating political po-

larization on Twitter: A Canadian perspective. Policy &

Internet, 6(1):28–45.

Ivanova, R. V., Sobernig, S., and Strembeck, M. (2022).

Does geographical location have an impact on data sam-

ples extracted from Twitter? In Proc. 9th International

Conference on Social Networks Analysis, Management

and Security (SNAMS) (in press). IEEE.

Kim, Y., Nordgren, R., and Emery, S. (2020). The story

of Goldilocks and three Twitter’s APIs: A pilot study

on Twitter data sources and disclosure. International

Journal of Environmental Research and Public Health,

17(3):864.

sen, E. and Strembeck, M. (2021). Building blocks of

communication networks in times of crises: Emotion-

exchange motifs. Computers in Human Behavior, 123.

Kwak, H., Lee, C., Park, H., and Moon, S. (2010). What

is Twitter, a social network or a news media? In Proc.

19th International Conference on World Wide Web, pages

591–600. ACM.

Morone, F. and Makse, H. A. (2015). Inﬂuence maximiza-

tion in complex networks through optimal percolation.

Nature, 524(7563):65–68.

Morstatter, F., Pfeffer, J., Liu, H., and Carley, K. M. (2013).

Is the sample good enough? comparing data from twit-

ter’s streaming API with twitter’s ﬁrehose. In Kiciman,

E., Ellison, N. B., Hogan, B., Resnick, P., and Soboroff,

I., editors, Proc. International AAAI Conference on Web

and Social Media, pages 400–408. The AAAI Press.

Pfeffer, J., Mayer, K., and Morstatter, F. (2018). Tampering

with Twitter’s sample API. EPJ Data Science, 7(1):50.

Pfeffer, J., Mooseder, A., Hammer, L., Stritzel, O., and Gar-

cia, D. (2022). This sample seems to be good enough!

assessing coverage and temporal reliability of Twitter’s

academic API. CoRR, abs/2204.02290.

Timoneda, J. C. (2018). Where in the world is my tweet:

Detecting irregular removal patterns on Twitter. PloS

one, 13(9).

Wang, Y., Callan, J., and Zheng, B. (2015). Should we

use the sample? analyzing datasets sampled from twit-

ter’s stream api. ACM Transactions on the Web (TWEB),

9(3):1–23.

Xiong, Y., Cho, M., and Boatwright, B. (2019). Hashtag

activism and message frames among social movement

organizations: Semantic network analysis and thematic

analysis of Twitter during the #MeToo movement. Pub-

lic Relations Review, 45(1):10–23.

Examining the Intra-Location Differences Among Twitter Samples

101