Black Sheep Wall: Towards Multiple Vantage Point-Based Information

Space Situational Awareness

Bernhards Blumbergs

1,2

Cyber Resilience Laboratory, Information Science Division, Nara Institute of Science and Technology, Nara, Japan

CERT.LV, Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia

Keywords:

Incident Response, Information Space, Situational Awareness, Cyber Threat Intelligence, Distributed Website

Data Mining.

Abstract:

CSIRTs rely on processing extensive amounts of incident and threat intelligence data. While the market is satu-

rated with such solutions, they are limited to a narrow range of Internet positions for data collection, impeding

the establishment of the security context and comprehensive awareness of the monitored Internet resources. To

tackle this challenge, a novel approach is proposed for distributed content collection. Simultaneously employing

multiple Internet positions and various content access techniques, a broader representation of the content may

be obtained by combining data from all positions, followed by automated difference analysis and clustering.

The solution enables fully automated large-scale deployments across globally distributed IP networks and

seamless integration into existing toolsets. It enhances CSIRT capabilities in identifying content changes, access

restrictions, contextual intelligence on cybercrime and threat actor campaigns, as well as detecting defacement

and availability attacks, and misinformation attempts. Initial evaluation of the prototype demonstrated its

effectiveness by detecting signiﬁcant and distinct changes in website content, thereby providing expanded

visibility and intelligence. Prototype code and validation datasets are released publicly for further use, research,

and validation.

1 INTRODUCTION

In the current information age, collecting large vol-

umes of data, performing near real-time processing,

and deriving relevant and actionable information pro-

vide superiority over rivals and adversaries. Data ag-

gregation, storage, and processing demand have in-

creased, leading to competitive market saturation by

commercial and community initiatives (G2.com Inc.,

2024). This bolsters the development of unique ap-

proaches and solutions permitting each market player

to focus on a bespoke aspect of data collection, ana-

lytics, and analysis of the information space. Incident

response team (i.e., CSIRT/CERT) establishes their

situational awareness based on the aggregated informa-

tion via automated commercial, community-provided,

or own-collected threat feeds and reports. In addi-

tion, targeted data collection from websites or other

information sources may be performed to enrich the

aggregated threat feeds. Threat intelligence collection

and web resource content monitoring solutions may

be used to accomplish this task. However, information

resource access and data collection are conducted from

a single or limited set of positions within the service

provider-controlled IP address space. Simultaneous

data collection from the same information source from

a wide and geographically distributed set of positions

is not being considered. From a broader cyber security

perspective, such an approach may present limitations

towards establishing comprehensive information space

awareness, contextual information gathering, and con-

ducting informed incident response activities.

This work proposes a hypothesis, that an infor-

mation source (e.g., a website) may yield a different

content based on how it is being accessed with the

identiﬁed changes providing contextual intelligence

information. Varying access methods may reveal the

dynamic nature of the information source and deliver

additional information to enrich the incident response

process otherwise not provided by the current informa-

tion space awareness solutions. In this work, the term

of vantage point is understood as a combination of a

connection method and the position within the Internet

from which an information source is being accessed.

By deploying a set of collectors, the content from the

same information source may be collected simultane-

Blumbergs, B.

Black Sheep Wall: Towards Multiple Vantage Point-Based Information Space Situational Awareness.

DOI: 10.5220/0012709600003767

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 21st International Conference on Security and Cryptography (SECRYPT 2024), pages 605-614

ISBN: 978-989-758-709-2; ISSN: 2184-7711

605

ously from multiple vantage points (Figure 1). For

example, accessing an information source from a dif-

ferent geographically assigned IP address space, using

private or well-known VPN or proxy services, routing

through anonymization network exit nodes, or chang-

ing connection parameters (e.g., user-agent) may rep-

resent differences in the retrieved content. In this work,

the information space is understood as a set of informa-

tion sources either publicly available (e.g., clear-net)

or ones with speciﬁc access requirements (e.g., deep-

net or dark-net) relevant to the entity engaged in data

collection. The information space may cover a broad

range of resources, such as, websites, news articles,

social network posts, online databases and repositories,

malicious website resources, and leaked information

dumps. Collecting and analysing retrieved multiple

vantage-point-based data may reveal the information

source’s changing behaviour and provide additional

visibility to the CSIRT team, expanding their capa-

bilities. Differences in content representation based

on the origin and position on the Internet is a com-

mercially driven and well-accepted practice by benign

services for performing geographic load-balancing,

serving regionally relevant content, or search engine

optimization. However, the broader contextual per-

spective is achieved by identifying particular vantage

points for which distinct changes may be identiﬁed in

the delivered content among all the employed vantage

points.

This paper provides the following novel applied

contributions (released publicly on https://github.com/

lockout/b-swarm):

multiple vantage-point-based information space

situational awareness concept description, its pro-

totype design and implementation details, and com-

plete source code (GPLv3 license);

multiple vantage-point-based benign and mali-

cious website content snapshot collection, initial

evaluation, and data set (MIT license).

This paper is structured as follows – Chapter 2 re-

views the identiﬁed related work; Chapter 3 describes

the proposed prototype design considerations and its

implementation; Chapter 4 provides the overview of

the prototype deployment, data set collection, and eval-

uation; and Chapter 5 concludes the paper and presents

future development directions.

2 BACKGROUND AND RELATED

WORK

Related work within open-source data gathering and

information space awareness considers academic re-

search, vendor whitepapers, solution descriptions, and

technical blogs and write-ups. Searches were con-

ducted using the Internet and academic database search

engines (e.g., IEEE Xplore, ACM DL, Springer, Else-

vier, ResearchGate, and Google Scholar). A primary

time frame since 2017 was selected to represent the

latest developments and advances in this evolving ﬁeld.

The search queries consist of keyword combinations

– ”request origin” OR ”effect of origin”, ”impact of

location”, ”website content change”, ”website content

mining” AND ”distributed website content mining”,

”open-source intelligence” OR ”OSINT”, and ”cyber

threat intelligence” OR ”CTI”.

A. Distributed Website Data Mining. News por-

tal and social network content monitoring solutions

for information space awareness are primarily aimed

towards information space exposure assessment (Melt-

water, 2024)(SK-CERT, 2024), emerging news alert

identiﬁcation (Google LLC, 2024), social media

management (Hootsuite Inc., 2024), and marketing

(Brand24 Global Inc., 2024). Open-source intelligence

and cyber threat intelligence collection services aim at

big data aggregation and analytics from clear-net (Gart-

ner Inc., 2024) and dark-net resources (Flashpoint,

2024)(KELA, 2024)(Mandiant, 2024). There is an

understandable lack of publicly available information

related to the data collection speciﬁcs of commercial

solutions. Based on the solution website descriptions

and publicly available information, it has not been ob-

served that these services provide information space

visibility from multiple vantage points and offer iden-

tiﬁcation of content differences among them. Instead,

the end user accesses and queries a single stream of

data related to speciﬁc monitoring, situational aware-

ness, and data collection tasks. This may indicate that

service providers do not consider the dynamic nature

of the content based on the vantage point and will pro-

vide a limited single-faceted visibility of the content.

Wan et.al. (Wan et al., 2020) explore the impact of

location against the IPv4 network layer 3 and 4 scans

for TCP/80 HTTP, TCP/443 HTTPS, and TCP/22 SSH

ports and identiﬁed that the origin of the scan impacts

the results. The authors compared collected results

to censys.io service results and recognized that vis-

ibility is lost due to the limited scan origins of the

service. The paper proposes a multi-origin approach

with three origins giving the optimal scan results. This

paper does not consider application layer 7 data or

approaches to trigger the content changes. Pham et.al.

(Pham et al., 2016) explore web crawling strategies

to bypass the web robot detection mechanisms used

by websites. Crafted HTTP GET requests with six

well-known crawler user agents were sent directly and

SECRYPT 2024 - 21st International Conference on Security and Cryptography

606

Figure 1: Information source interaction via multiple vantage points.

over the Tor network and observed the website’s HTTP

response header. The authors observed the differences

in HTTP response status code and content length ﬁelds

based on changed user agents and Tor network sources

used for HTTP requests. This paper limits itself to

HTTP response header analysis.

B. Website Data Mining and Analysis. The web-

site’s dynamic nature, adaptive content delivery mech-

anisms, and saturation with unrelated data (e.g., linked

external resources and paid content) pose a challenge

for identifying and extracting the main intended infor-

mation of the webpage. This area has been well re-

searched with multiple approaches being proposed for

dynamic website content mining (Pujar and Mundada,

2021)(Russell and Klassen, 2018)(Neuendorf, 2017).

More recognized approaches include webpage docu-

ment model (DOM) parsing (Nadee and Prutsachain-

immit, 2018)(Alarte et al., 2018), and natural language

processing techniques (Nadee and Prutsachainimmit,

2018)(Thorleuchter and Van den Poel, 2012)(Gibson

et al., 2007). Some researchers propose the use of

document clustering or auto-regressive moving aver-

age algorithms for data processing and information

extraction (Suganya and Vijayarani, 2020)(Calzarossa

and Tessera, 2018), graph theory for pattern discov-

ery (Leung et al., 2022), or analysis of website meta-

data (McGahagan et al., 2019a)(McGahagan et al.,

2019b)(Song et al., 2020)(Kohli et al., 2012). Alterna-

tively, the website content identiﬁcation, labelling, and

extraction may be crowdsourced to human operators

(e.g., Amazon Mechanical Turk) (Kiesel et al., 2020).

For dark-net data mining content speciﬁcs, the

highly dynamic ﬂux nature of such websites, and

access restrictions (e.g., rate limiting, DDoS protec-

tion, and CAPTCHA challenges) need to be taken

into account. Dark-net information tracking has been

rising in prominence and various initiatives relevant

to CSIRT communities have been established (Shad-

owserver Foundation, 2024)(SISSDEN, 2024)(Lewis,

2024)(Expert Insights, 2024a)(Expert Insights, 2024b).

The most common approaches include dark-net forums

and marketplace crawling for the topic keyword ex-

traction (Yang et al., 2020)(Takaaki and Atsuo, 2019)

and content scraping and analysis (Lawrence et al.,

2017)(Crowder and Lansiquot, 2021)(Pantelis et al.,

2021)(Samtani et al., 2021). Measures to attempt to

surpass dark-net crawling protection mechanisms have

been evaluated, such as, a layer of SOCKS proxies,

clearing stored cookies, changing user-agent (Pantelis

et al., 2021), crawling speed limitations, and crowd-

sourcing or automating CAPTCHA solving (Samtani

et al., 2021)(Lawrence et al., 2017).

3 PROTOTYPE DEVELOPMENT

CONCEPTS

This section presents and describes the prototype’s

core operational concepts and design approaches,

based on the published technical report (Blumbergs,

2023). In this work, a complete set of simultaneously

collected information source content instances is re-

ferred to as a snapshot representing the dynamic nature

of the same information source from all vantage points.

Further automated snapshot analysis may identify any

changes at a given time between the instances within

the same snapshot. Furthermore, multiple snapshot

creation over time against the same data source and

its cross-correlation among the past snapshots may

present the nature of changes in the data source over

time. Detected changes are displayed to a human ana-

lyst in a structured and visual format for further analy-

sis and evaluation. This work limits its scope to multi-

vantage point-based data collection, basic snapshot

Black Sheep Wall: Towards Multiple Vantage Point-Based Information Space Situational Awareness

607

analysis, and representation to the analyst. Although

machine-assisted and guided, the in-depth analysis

and deriving the meaning of the changes in most cases

will be context-driven, dependent on the information

source, and the reasons for engaging in the data collec-

tion.

The prototype aims at enhancing the following

CSIRT operational capabilities:

identiﬁcation of changes within the snapshot.

Changes in the website content collected from ge-

ographically distributed vantage points may per-

mit activities, such as, the evaluation of content

changes due to geographical distribution, identiﬁ-

cation of access restrictions, assessment of phish-

ing websites, and resources used for cybercrime

and targeted attacks. For example, the detection of

a website serving malicious content only to connec-

tions originating from speciﬁc IP address ranges

or countries may provide contextual information

on the potential targeted scope, victims, and likely

adversarial intentions;

identiﬁcation of dynamic changes between a se-

quence of snapshots over time. Such changes

may permit activities, such as, observation of the

content availability, tracking and identiﬁcation of

changes in the website content, which may lead to

the disclosure and tracking of misinformation and

disinformation campaigns, as well as (D)DoS and

defacement attacks;

detection of keywords in the website content.

May allow activities, such as, the identiﬁcation

of leaked data and the tracking of speciﬁc trig-

ger words. For example, detection of information

disclosure linked to organization-owned or afﬁl-

iated domain names, employee email addresses,

and user accounts.

The prototype is written in Python3 language, and

its functionality and implementation consist of the

following stages: 1) collector deployment and con-

ﬁguration, 2) collector initialization and process man-

agement, 3) information space interaction and data

collection, 4) snapshot report assembly and storage,

and 5) snapshot processing and analysis. These stages

are represented in Figure 2 and are described further

in this chapter. The complete prototype source code

is released publicly under the GNU GPLv3 license on

GitHub as listed in the contributions.

A. Collector Implementation. The collector engine

uses Docker container technology to create a self-

contained environment, enable ﬂexible automation via

Docker Compose, and permit scalable cloud-based

deployments. The collector handles both the underly-

ing connection establishment and data collection from

the target source. Implemented data connection types

include the implicit deployment IP range within the

cloud services assigned region, Tor anonymization net-

work, proxy services, and support for VPN connection

services. To establish a vantage point for the collector

(Figure 1), the following approaches are implemented:

1) geographically distributed IP address space deploy-

ments (e.g., cloud or virtual private server hosting

services), 2) network proxy services (e.g., public or

private HTTP(S) or SOCKS servers), 3) anonymiza-

tion networks with Internet exit nodes (e.g., local Tor

router as a SOCKS5 proxy), and 4) change of HTTP

User-Agent (i.e., a list of mobile and desktop browser

user agents). A Google Chromium headless browser

is used and automated by using Selenium. A JSON-

based conﬁguration proﬁle deﬁnes the required vari-

ables and settings for the collector instance, collection

process, and interaction with the information source.

B. Snapshot Data Collection. The structure of the

single interaction report holds the data related to the

target information source with all the collected data

and metadata representing the various features of that

information source, its state, and its behaviour. The

choices of the retrieved feature data are related to the

unique identiﬁcation of the target information source,

its parameters, and served content. Such a single in-

teraction report structure is chosen to be ready for

unsupervised machine learning algorithms for cases,

such as, collected data feature analysis, clustering, and

pattern and outlier identiﬁcation. A complete set of

all interaction reports is assembled into a single snap-

shot in Apache Parquet columnar-based ﬁle format,

which allows more efﬁcient data storage, searching,

and result processing.

Collected resource data consists of 1) collector

metadata, 2) information source metadata, and 3) col-

lected data and its metadata. Collector metadata in-

cludes the following features – vantage point unique

identiﬁer, speciﬁed target URL, visited URL which

may differ from the initial one in case of HTTP redi-

rection, used user-agent string, and interaction time

measurement. Information source metadata includes

the following features – HTTP header data (e.g., ses-

sion cookies, identiﬁers, and parameters) and the SSL

certiﬁcate ﬁngerprint. This data may allow further col-

lection of additional information (e.g., passive-SSL or

passive-DNS lookups) helping identify and classify the

information source. Collected data and its metadata

include the following features – the retrieved HTML

content and its SHA256 hash and ppdeep fuzzy-hash,

retrieved data Shannon entropy calculation, and infor-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

608

Figure 2: Collector process ﬂow.

mation source graphical representation (i.e., screen-

shot). These retrieved content hashes are used for

similarity analysis of the extracted content between

other instances within the same snapshot. The graphi-

cal screenshot is collected via the Selenium Chromium

web driver, which permits a visual comparison, change

detection, and evaluation. To perform the similarity

comparison between the captured screenshots, the vi-

sual difference mean squared error (MSE) is calcu-

lated. Calculating the derived content fuzzy-hashes

and screenshot MSE values may permit the establish-

ment of threshold criteria to identify and present rel-

evant changes within and between snapshots when

performing automated snapshot analysis.

C. Snapshot Data Parsing. Each collector instance

produces a single ﬁle in Apache Parquet format. The

collected snapshot data is loaded in ClickHouse high-

performance column-oriented database management

system to form a single table representing the collected

data for the speciﬁed information space from the col-

lector vantage points. For the initial snapshot data

parsing and representation, a Jupyter Notebook with

IPython Widgets is developed to give the analyst basic

means for snapshot exploration and is released publicly

on GitHub under the GPLv3 licence as a part of the

prototype package. Notebook queries and loads snap-

shot table results for processing as Pandas DataFrames

for each URL representing the collected data across

all related vantage points. To permit further automatic

snapshot analysis and change identiﬁcation, additional

values are calculated based on the collected SHA256,

fuzzy-hash, MSE, and Shannon entropy feature val-

ues (i.e., key feature values). Within the URL-speciﬁc

DataFrame, values from one vantage point are mea-

sured against all other vantage points to establish how

one vantage point compares to the others within the

same snapshot by calculating mean values for all key

features. Lower mean values for the key feature in-

dicate a lower content similarity against the rest of

the elements in the set, except a lower mean value for

the MSE feature indicates a higher visual similarity

against other set elements. Based on the enriched data,

an automated analysis is performed to identify clusters

of vantage points by using DBSCAN (Density-Based

Spatial Clustering of Applications with Noise) clus-

tering algorithm. Similar content is assumed if fuzzy-

hash and MSE mean values are within the threshold

range (91.0-100.0%) and all samples are assigned to

the same cluster. The human analyst is represented

with a smaller set of URLs and their clustering labels

for which differences among vantage points have been

identiﬁed.

The human analyst may use the analysis Notebook

to visually interact with the snapshot data and focus

attention on the URLs with changes in their content

representation instead of manually analyzing the whole

snapshot data set (Figure 3). This rudimentary way

is sufﬁcient for the initial evaluation of the collected

snapshot data set and prototype validation. The appli-

cability and evaluation of speciﬁc machine learning

algorithms to perform a more efﬁcient data analysis,

feature detection, and clustering are out of the scope

of the current research and will be explored in-depth

in upcoming work.

4 PROTOTYPE ASSESSMENT

The validation aims to ensure, that the prototype code

performs according to the designed functionality, all

automation tasks are ﬁnished with no errors, com-

pleted deployment of the collector nodes across glob-

ally distributed network ranges, can process a set of

Black Sheep Wall: Towards Multiple Vantage Point-Based Information Space Situational Awareness

609

target information space resources, and successfully

performs the data collection and storage. Additionally,

sample data sets are created to verify the collection

process and assess the snapshot. For validation pur-

poses only, the online resource selection is divided

into two main categories – benign web resources (e.g.,

news portals, social networks, forums) and malicious

web resources (e.g., phishing websites).

A. Prototype Deployment Process. The Google

Cloud Platform (GCP) is chosen for the prototype

validation due to its geographically distributed cloud

network ranges, ﬂexible command line automation,

Python API support, native Docker engine support, and

running Docker containers as one-time jobs. Docker

speciﬁcations, conﬁguration ﬁles, and automation

scripts are released publicly on GitHub under the

GPLv3 license as a part of the prototype package. For

validation purposes, a subset of 18 out of all 37 pub-

licly announced GCP regional networks is selected to

ensure at least one vantage point from each regional

network covering Asia, Australia, Europe, the Mid-

dle East, North America, South America, and the

United States. This ensures that data collection is

performed with a broad perspective from globally dis-

tributed vantage points. The job deployment and ex-

ecution time heavily depend on the number of URLs

to be accessed, their network distance from the collec-

tor, access method, and connection speed and latency.

Additional checks are implemented to avoid timed-

out sessions, partially loaded resources, or extended

loading times. For information space awareness, near

real-time data collection is not required and time con-

straints may be more ﬂexible to complete collector

deployment and snapshot data collection.

All automated container deployments and GCP

Cloud Run Jobs were completed without errors and

produced a prototype validation dataset. For informa-

tive purposes, Table 1 presents the individual data set

collection metrics: 1) URL count in the speciﬁed in-

formation space and used vantage point count, 2) total

Parquet format snapshot ﬁle size and an average value

for each vantage point, 3) average GCP job execution

time per vantage point, 4) average time spent by the

collector instance to retrieve the data from a single

URL, and 5) count and ratio of identiﬁed URLs with

notable changes by the automated analysis process.

B. Data Set Collection and Evaluation. Top benign

domains. To form the benign web resource data set,

the information on top domain names is selected from

two publicly available resources: 1) Cloudﬂare Radar

domain rankings (top 500 domain CSV list) (Cloud-

ﬂare, 2024), which represents domain rankings based

on the Cloudﬂare service visibility, and 2) Similarweb

top websites ranking (top 50 domain names) (Simi-

larweb, 2024) claiming to represent analytics for the

most visited websites for business brand and market-

ing competitive analysis. Both lists are merged with

duplicates removed to represent a combined list of 516

unique URLs. The automated analysis identiﬁes that

most resources (75%) present the same or nearly exact

content across all vantage points. This similarity is

mostly related to the uniﬁed corporate identity and

providing equal experience to end users globally by

highly ranked entities (e.g., Microsoft, Amazon). It

has to be noted that the list of top domains is produced

by assessing the trafﬁc volumes for the corresponding

domain name and may include domains related to con-

tent delivery networks. These networks do not serve

any content when accessed directly without request

parameters and would yield empty collection results.

The dynamic content, as anticipated, is primarily ob-

served for user-generated content resources, such as,

youtube.com, twitch.com, and tiktok.com to represent

regionally-relevant material and trends.

As a ﬁrst use case, it was observed that resource

yandex.net (Internet services and search engine plat-

form originating from the Russian Federation) dis-

plays the crawling bot detection prompt (Figure 3)

only if the request originates from European IP net-

works. While such a broader perspective gives an

additional assessment of the website’s behaviour, it

is not unusual to observe content providers placing

restrictions or content ﬁltering based on their policies

or collected metrics. Within the current geopolitical

situation (Antoniuk, 2023), such restrictions might

be anticipated and may give additional perspective to

the human analyst towards intelligence collection and

maintaining awareness over the cyber domain.

Tor benign domains. To evaluate the implemented

Tor connectivity and proxy usage functionality, the top

50 benign domains were accessed by enabling the Tor

setting in the collector proﬁle. It has to be noted, that

Tor-based Internet-bound connection origin, regard-

less of container deployment location, are allocated

from a larger pool of known and geolocated Tor exit

nodes (BigDataCloud Pty Ltd, 2024). This makes

such connections easily identiﬁable and may impose

additional content access restrictions or modiﬁcations.

Dynamic Tor exit node assignment implies a certain

level of randomness, which may add an anticipated

uncertainty to the collected results. As an example, it

was observed that google.com displayed CAPTCHA

challenge for all Tor-based connections (Figure 4), ex-

cept for the collectors deployed in GCP europe-north1

and southamerica-west1. While this observation gives

insights that Google may not treat all Tor-based con-

SECRYPT 2024 - 21st International Conference on Security and Cryptography

610

Table 1: Snapshot data set overview.

Data set

name

URL count Vantage

count

Total

snapshot

size (average

per vantage)

Average job

execution

time

Average

single URL

harvest time

Distinct

URL count

(ratio)

Top benign

domains

516 18 2284MB

(127MB)

6.17 hours 20 seconds 131 (25%)

Tor benign

domains

50 18 420MB

(23MB)

22 minutes 23 seconds 40 (80%)

Phishing

URLs

499 18 2226MB

(124MB)

1.65 hours 9 seconds 175 (35%)

Tor phishing

URLs

50 18 118MB

(6.5MB)

11 minutes 10 seconds 4 (8%)

Figure 3: Top benign domain data set: Yandex on hxxps://yandex.net.

nections equally, the exact Tor exit nodes and their

geolocation are not known.

Phishing URLs. To form a malicious web resource

data set, publicly disclosed phishing website URL lists

were used, which may represent both ad-hoc created

or compromised resources. This data set is highly

dynamic since large volumes of phishing URLs are

reported, tracked, and taken down or blocked daily.

The data set is comprised of the following publicly

available resources: 1) OpenPhish (phishing feed list)

(OpenPhish, 2024b), and 2) Phishing domain database

(new phishing links today list) (Krog and Chababy,

2024). The ﬁnal combined and deduplicated list in-

cludes 499 unique phishing URLs. The automated

analysis identiﬁes that 65% of phishing sites serve the

same content across all vantage points. Public phish-

ing URL lists present globally identiﬁed and reported

broader mass phishing campaigns targeting brands

globally, instead of having a more narrow target group

(i.e., spear-phishing). As anticipated, it may be ob-

served that such mass phishing campaigns are primar-

ily ﬁnancially motivated (e.g., gambling, cryptocur-

rency) and collection of sensitive data (e.g., email and

ofﬁce authentication, streaming service accounts, and

social network and communication solution creden-

tials). The speciﬁc nature of such campaigns primar-

ily entails cloning and impersonating existing benign

resources while providing limited dynamic content.

According to the OpenPhish statistics (OpenPhish,

2024a), the most targeted brands and sectors are Face-

book (Meta), Bet365, Telegram, Ofﬁce365, cryptocur-

rency services, WhatsApp, AT&T, and Outlook and

other webmail providers. The initial assessment of

the phishing data set conﬁrms, that the majority of

collected resources fall within these categories.

As a second use case, a Microsoft scamming web-

site (Figure 5) serves maliciously looking content

(fake Microsoft Defender threat scanner (Meskauskas,

2024)) only for the connections originating from GCP

Asian networks Taiwan, Japan, and Australia except

India and Singapore. This ﬁnding provides threat in-

telligence from a global perspective indicating that the

threat actor is selective about targeting users exclu-

sively in Asia-Paciﬁc speciﬁcally aiming at least at

Taiwan, Japan, and Australia while excluding other

countries in the same region. A dedicated targeted

collection process was launched (snapshot data bun-

dled with phishing URL data set) to observe the be-

haviour of this speciﬁc phishing website from all 37

global GCP IP networks instead of selected 18 glob-

Black Sheep Wall: Towards Multiple Vantage Point-Based Information Space Situational Awareness

611

Figure 4: Tor benign domains data set: Google on hxxps://google.com.

Figure 5: Phishing URL data set: Microsoft scamming on hxxp://zcxzcxzcx.d2jk5f4fer48s8.amplifyapp.com.

ally spread GCP networks. The collected data showed

that the website displayed phishing content for con-

nections originating from Japan, South Korea, Taiwan,

and Australia, but excluded the following countries

– Indonesia, Singapore, India, and Hong Kong. This

intelligence may support the human analyst towards

in-depth investigation and identiﬁcation of threat actor

modus operandi.

Tor phishing URLs. For data collection validation

through the Tor network, 50 phishing sites from the

phishing list were randomly selected. It was observed

that phishing sites, hosted on major cloud providers or

behind domain fronting services (e.g., Cloudﬂare) will

impose connection checks and CAPTCHA challenges

for Tor-based connections. Most of the collected data

represent access restrictions across all vantage points

related to the used service providers and not to the

phishing resources. Such access control is primarily

performed due to cases of Tor network abuse to per-

form malicious activities against online systems and

resources. With most of the collected data contain-

ing access restriction messages or taken-down content,

this data set does not present any signiﬁcant ﬁndings

beyond the already identiﬁed ones. However, it was

observed that for the top 50 benign domains, the ratio

of CAPTCHA-restricted service access was signiﬁ-

cantly smaller than for phishing websites for Tor-based

connections. Although this may depend on many con-

ditions, an assumption may be made that top benign

websites have higher availability and security posture

not requiring immediate potentially suspicious con-

nection restrictions. Additionally, it may lead to the

assessment, that cybercriminals are looking to max-

imise their impact, success ratio, and proﬁt while in-

vesting the minimum required effort, being aware that

phishing sites have a short life expectancy before being

tracked and shut down.

5 CONCLUSIONS AND FUTURE

WORK

This research paper delivers practical and applied con-

tributions towards gaining expanded situational aware-

ness of the information space (released publicly on

https://github.com/lockout/b-swarm): 1) developed

prototype and related code, and 2) collected snapshot

data sets. It addresses the fundamental design limita-

tions of current information space awareness solutions

and proposes a novel approach towards expanding the

information awareness capabilities of cyber security

teams. A multiple vantage point-based approach of-

fers a comprehensive view of the dynamic nature of

the target information space. It provides contextual

information, which may be ingested into the existing

incident response tool set. Such augmented informa-

tion space visibility may span beyond the CSIRT team

requirements and apply to a wider range of sectors

dealing with information space tracking, such as, law

enforcement, combating ﬁnancial fraud, intelligence

ofﬁcers, identifying disinformation, and strategic com-

munications. The prototype implements a wide range

SECRYPT 2024 - 21st International Conference on Security and Cryptography

612

of functionality, cloud deployment scalability and geo-

graphical distribution, and solves multiple design and

operational complexities. Performed validation repre-

sents the capabilities and strengths of the developed

approach by collecting, evaluating, and identifying the

changes in the speciﬁed information sources. Initial

data set collection and evaluation supports the hypoth-

esis of this research and conﬁrms the dynamic nature

of content representation depending on the employed

vantage point and identifying notable changes for in-

creased context and intelligence collection.

The current prototype uses HTTP(S) protocol for

content collection, one of the most widespread access

methods. However, more expanded support for diverse

information sources (e.g., messaging platforms, and

API interaction) should be implemented and validated.

Despite having already full Tor network support, the

Tor network resource interaction (i.e., .onion sites) has

been outside the scope of this research paper due to

their speciﬁc access requirements. Additionally, by

using non-public CERT-provided URL lists and data

feeds, it may be expected that the likelihood of iden-

tifying targeted attacks would increase due to a nar-

rower perspective on speciﬁc entities. To advance the

capabilities of the solution and address the identiﬁed

limitations, the future work will focus on the follow-

ing key directions: 1) automatic URL data acquisi-

tion and ingestion provided by a partner CSIRT (e.g.,

MISP API interface), 2) snapshot creation over time

for change identiﬁcation, 3) machine learning-based

snapshot analysis for feature and pattern detection, and

4) improved visualization interface.

ACKNOWLEDGEMENTS

This research is conducted under the Japan Society

for the Promotion of Science (JSPS) International

Postdoctoral Research Fellowship, supported by the

KAKENHI grant, and hosted by the NAIST Prof.

Kadobayashi Cyber Resilience Laboratory.

REFERENCES

Alarte, J., Insa, D., Silva, J., and Tamarit, S. (2018). Main

content extraction from heterogeneous webpages. In

Web Information Systems Engineering – WISE 2018.

Springer.

Antoniuk, D. (2023). Russia wants to isolate its internet,

but experts warn it won’t be easy. https://therecord.

media/russia-internet-isolation-challenges. Accessed:

2024/03/05.

BigDataCloud Pty Ltd (2024). TOR Exit Nodes Geolo-

cated. https://www.bigdatacloud.com/insights/tor-exit-

nodes. Accessed: 2024/01/24.

Blumbergs, B. (2023). A multiple vantage point-based

concept for open-source information space awareness.

Technical report, The Institute of Electronics, Informa-

tion and Communication Engineers, Hokkaido, Japan.

Brand24 Global Inc. (2024). Brand24. https://brand24.com/.

Accessed: 2024/01/11.

Calzarossa, M. C. and Tessera, D. (2018). Analysis and

forecasting of web content dynamics. In 2018 32nd In-

ternational Conference on Advanced Information Net-

working and Applications Workshops (WAINA), pages

12–17.

Cloudﬂare (2024). Cloudﬂare Radar. https://radar.cloudﬂare.

com/domains. Accessed: 2024/01/09.

Crowder, E. and Lansiquot, J. (2021). Darknet data mining

- a canadian cyber-crime perspective. ArXiv (Draft

publication), abs/2105.13957.

Expert Insights (2024a). The Top 10 Cyber Threat Intelli-

gence Solutions. https://expertinsights.com/insights/

the-top-cyber-threat-intelligence-solutions/. Accessed:

2024/01/23.

Expert Insights (2024b). The Top 11 Dark Web Moni-

toring Solutions. https://expertinsights.com/insights/

the-top-dark-web-monitoring-solutions/. Accessed:

2024/01/23.

Flashpoint (2024). Flashpoint Ignite Platform. https:

//ﬂashpoint.io/ignite/. Accessed: 2024/01/11.

G2.com Inc. (2024). Best threat intelligence services

providers. https://www.g2.com/categories/threat-

intelligence-services. Accessed: 2024/04/25.

Gartner Inc. (2024). Managed Detection and

Response Services Reviews and Ratings.

https://www.gartner.com/reviews/market/managed-

detection-and-response-services. Accessed:

2024/01/11.

Gibson, J., Wellner, B., and Lubar, S. (2007). Adaptive

web-page content identiﬁcation. In Proceedings of the

9th Annual ACM International Workshop on Web In-

formation and Data Management, WIDM ’07, page

105–112, New York, NY, USA. Association for Com-

puting Machinery.

Google LLC (2024). Google Alerts. https://www.google.

com/alerts. Accessed: 2024/01/11.

Hootsuite Inc. (2024). Hootsuite. https://www.hootsuite.

com/. Accessed: 2024/01/11.

KELA (2024). KELA Cyber Threat Intelligence Platform.

https://www.kelacyber.com/. Accessed: 2024/01/11.

Kiesel, J., Kneist, F., Meyer, L., Komlossy, K., Stein, B.,

and Potthast, M. (2020). Web page segmentation re-

visited: Evaluation framework and dataset. In Pro-

ceedings of the 29th ACM International Conference on

Information and Knowledge Management, CIKM ’20,

page 3047–3054, New York, NY, USA. Association

for Computing Machinery.

Kohli, S., Kaur, S., and Singh, G. (2012). A website content

analysis approach based on keyword similarity analy-

sis. In Proceedings of the The 2012 IEEE/WIC/ACM

International Joint Conferences on Web Intelligence

Black Sheep Wall: Towards Multiple Vantage Point-Based Information Space Situational Awareness

613

and Intelligent Agent Technology - Volume 01, WI-IAT

’12, page 254–257, USA. IEEE Computer Society.

Krog, M. and Chababy, N. (2024). Phishing Domain

Database. https://github.com/mitchellkrogza/Phishing.

Database. Accessed: 2024/09/09.

Lawrence, H., Hughes, A., Tonic, R., and Zou, C. (2017).

D-miner: A framework for mining, searching, visual-

izing, and alerting on darknet events. In 2017 IEEE

Conference on Communications and Network Security

(CNS), pages 1–9.

Leung, C. K., Madill, E. W., and Singh, S. P. (2022). A web

intelligence solution to support recommendations from

the web. In IEEE/WIC/ACM International Conference

on Web Intelligence and Intelligent Agent Technology,

WI-IAT ’21, page 160–167, New York, NY, USA. As-

sociation for Computing Machinery.

Lewis, S. J. (2024). OnionScan. https://onionscan.org/.

Accessed: 2024/01/12.

Mandiant (2024). Digital Threat Monitoring.

https://www.mandiant.com/advantage/digital-threat-

monitoring. Accessed: 2024/01/11.

McGahagan, J., Bhansali, D., Gratian, M., and Cukier,

M. (2019a). A Comprehensive Evaluation of HTTP

Header Features for Detecting Malicious Websites. In

2019 15th European Dependable Computing Confer-

ence (EDCC), pages 75–82.

McGahagan, J., Bhansali, D., Pinto-Coelho, C., and Cukier,

M. (2019b). A Comprehensive Evaluation of Webpage

Content Features for Detecting Malicious Websites. In

2019 9th Latin-American Symposium on Dependable

Computing (LADC), pages 1–10.

Meltwater (2024). Meltwater Suite. https://www.meltwater.

com/. Accessed: 2024/01/11.

Meskauskas, T. (2024). What is Windows Defender

Security Center? https://www.pcrisk.com/removal-

guides/12537-windows-defender-security-center-

pop-up-scam. Accessed: 2024/03/05.

Nadee, W. and Prutsachainimmit, K. (2018). Towards data

extraction of dynamic content from javascript web ap-

plications. In 2018 International Conference on Infor-

mation Networking (ICOIN), pages 750–754.

Neuendorf, K. A. (2017). The Content Analysis Guidebook.

SAGE Publications, Inc., California.

OpenPhish (2024a). Global Phishing Activity. https:

//openphish.com/phishing activity.html. Accessed:

2024/01/09.

OpenPhish (2024b). Timely. Accurate. Relevant Phish-

ing Intelligence. https://openphish.com. Accessed:

2024/01/09.

Pantelis, G., Petrou, P., Karagiorgou, S., and Alexandrou, D.

(2021). On Strengthening SMEs and MEs Threat Intel-

ligence and Awareness by Identifying Data Breaches,

Stolen Credentials and Illegal Activities on the Dark

Web. In Proceedings of the 16th International Confer-

ence on Availability, Reliability and Security, ARES

21, New York, NY, USA. Association for Computing

Machinery.

Pham, K., Santos, A., and Freire, J. (2016). Understanding

website behavior based on user agent. In Proceedings

of the 39th International ACM SIGIR Conference on

Research and Development in Information Retrieval,

SIGIR ’16, page 1053–1056, New York, NY, USA.

Association for Computing Machinery.

Pujar, M. and Mundada, M. R. (2021). A systematic review

web content mining tools and its applications. Inter-

national Journal of Advanced Computer Science and

Applications.

Russell, M. A. and Klassen, M. (2018). Mining the Social

Web, Third Edition. O’Reilly Media, Inc., California.

Samtani, S., Li, W., Benjamin, V., and Chen, H. (2021).

Informing cyber threat intelligence through dark web

situational awareness: The azsecure hacker assets por-

tal. Digital Threats, 2(4).

Shadowserver Foundation (2024). Darknet events report.

https://www.shadowserver.org/what-we-do/network-

reporting/honeypot-darknet-events-report/. Accessed:

2024/01/12.

Similarweb (2024). Similarweb Top Websites Ranking.

https://www.similarweb.com/top-websites/. Accessed:

2024/01/09.

SISSDEN (2024). Secure Information Sharing Sensor Deliv-

ery Event Network Blog. https://sissden.eu/. Accessed:

2024/01/12.

SK-CERT (2024). Taranis NG. https://github.com/SK-

CERT/Taranis-NG. Accessed: 2024/01/11.

Song, Y.-D., Gong, M., and Mahanti, A. (2020). Measure-

ment and analysis of an adult video streaming service.

In Proceedings of the 2019 IEEE/ACM International

Conference on Advances in Social Networks Analysis

and Mining, ASONAM ’19, page 489–492, New York,

NY, USA. Association for Computing Machinery.

Suganya, E. and Vijayarani, S. (2020). Content based web

search and information extraction from heterogeneous

websites using ontology. International Journal of Sci-

entiﬁc and Technology Research.

Takaaki, S. and Atsuo, I. (2019). Dark web content anal-

ysis and visualization. In Proceedings of the ACM

International Workshop on Security and Privacy Ana-

lytics, IWSPA ’19, page 53–59, New York, NY, USA.

Association for Computing Machinery.

Thorleuchter, D. and Van den Poel, D. (2012). Using we-

bcrawling of publicly available websites to assess e-

commerce relationships. In 2012 Annual SRII Global

Conference, pages 402–410.

Wan, G., Izhikevich, L., Adrian, D., Yoshioka, K., Holz, R.,

Rossow, C., and Durumeric, Z. (2020). On the origin

of scanning: The impact of location on internet-wide

scans. In Proceedings of the ACM Internet Measure-

ment Conference, IMC ’20, page 662–679, New York,

NY, USA. Association for Computing Machinery.

Yang, J., Ye, H., and Zou, F. (2020). pyDNetTopic: A Frame-

work for Uncovering What Darknet Market Users Talk-

ing About. In Park, N., Sun, K., Foresti, S., Butler, K.,

and Saxena, N., editors, Security and Privacy in Com-

munication Networks, pages 118–139, Cham. Springer

International Publishing.

SECRYPT 2024 - 21st International Conference on Security and Cryptography

614