3 RELATED WORK
Web server fingerprinting is a widely-studied topic.
Early studies leveraged differences in TCP/IP stack
implementation for fingerprinting servers. For ex-
ample, host operating system identification based on
analysis of encrypted communication was introduced
by Beverly (Beverly, 2004). Shamsi et al. proposed
to automatically generate server signatures based on
TCP/IP packets for large-scale fingerprinting (Shamsi
and Loguinov, 2017).
Differences in network system implementation
were also leveraged by Yang et al. (Yang et al., 2019)
for fingerprinting of IoT devices. The approach re-
lied on Neural Network classification model build
with features extracted from the network layer, trans-
port layer, and application layer. Another concept of
fingerprinting for the IoT platform traffic was intro-
duced by designing a set of IoT platform fingerprint-
ing workflows via traffic analysis (He et al., 2022).
The authors manually analyzed the deciphered traf-
fic and found that some traffic in IoT platforms using
private protocols had obviously distinguishable char-
acteristics.
There has been a significant research done in the
area of browser fingerprinting. Browser fingerprint-
ing is the process of collecting data from a client’s
web browser in order to create a device’s finger-
print (Laperdrix et al., 2020). Browser fingerprint-
ing usually gathers a massive amount of data about
a user’s device, ranging from hardware to operating
system to browser configuration (e.g., user’s device
model, operating system, screen resolution, user time-
zone, preferred language setting, browser version,
tech specification of user’s CPU, graphics card, and
etc.).
As opposed to browser fingerprinting, web server
fingerprinting aims to determine the software char-
acteristics of the server. Lee was one of the first
researchers to point out that different web servers
implement the HTTP response differently despite
RFC specification outlining the proper HTTP re-
sponse (Lee et al., 2002). Hence, Lee developed
HMAP, an automated tool that leveraged a method
that uses the characteristics of HTTP messages to de-
termine the identity of an HTTP server with high re-
liability. For fingerprinting web servers, three types
of characteristics from HTTP responses were taken
into consideration: syntactic, semantic, and lexical.
HMAP works with variations of GET, HEAD request
lines using the wrong capitalization of protocol name,
version, and long URIs and compares each of the re-
sponses with a list of known server characteristics.
The tool does not take into consideration of other
available HTTP methods (e.g., DELETE, TRACE).
The approach is based on the explicit assumption that
server header is present and provides trustworthy in-
formation.
The study performed by Saumil et al. applied
the tool HTTPrint to analyze web server fingerprint-
ing (Shah, 2003b). The primary focus of this work
was the analysis of server banners from common web
servers. Only a few HTTP requests were considered
including DELETE, improper HTTP version, junk re-
quest.
Shrivastava (Shrivastava, 2011) provides exam-
ples of fingerprinting mechanisms such as HTML
data inspection, presence of the files based on HTTP
response codes, checksum-based identification. The
author focused on the application fingerprinting on
the application level.
Auger outlined fingerprinting techniques based on
web architecture, server, application software, back-
end database version. Banner grabbing technique of
the HTTP responses were highlighted as server head-
ers are likely to reveal identifying information, e.g.,
intermediate agents, via header, server version, and
error pages (Auger, 2009). The study analyzed the
lexical, syntactic, and semantic information provided
in HTTP response produced by abnormal requests.
Lavrenovs et al.(Lavrenovs and Mel
´
on, 2018) car-
ried out analysis of website extracted from Alexa’s
top one million list and presented a research on the
security of the most known websites. Although the
study was not focused on server fingerprinting, it pro-
vided an insight on how much information can be
revealed through server-side headers. The analysis
reached two conclusions: a) the more popular do-
mains leak less information and b) HTTP sites are less
restrictive than HTTPS served sites in terms of the in-
formation that they provide, mostly for server related
headers.
The study conducted by Book et al. (Book et al.,
2013) applied machine learning techniques for gen-
erating server fingerprinting automatically. The au-
thors used Bayesian inference without building ini-
tial server features. They used a set of 10 specialized
HTTP requests on 110,000 live servers. The analy-
sis was performed on the response codes and MIME
types returned by the server. The authors calculated
unique fingerprint for each type of web server and
then matched the responses of unknown web servers
against the developed fingerprint set.
Techniques for detecting web servers from the
banner information, HTTP response characteristics
(order of server and date headers), and special HTTP
requests were introduced by Huang et al. (Huang
et al., 2015). Through special HTTP requests which
HTTPFuzz: Web Server Fingerprinting with HTTP Request Fuzzing
263