ceives data from FileBeat and also from production
databases (see sec. 3) and generates a fault-tolerant
durable way data stream for the next module devoted
to data processing, namely Logstash. Kafka supports
the standard publish-subscribe mechanism to manage
its input sources (publishers) as well as output (sub-
scribers); it also arranges data into topics, a logical
grouping of messages sharing some features. Each
topic contains a set of partitions, each being a se-
quence of records; a record is the atomic element
used to build the stream. Partitions help in increas-
ing the parallelism, especially useful for multiple con-
sumers (Logstash instances in our case), also allowing
to scale the amount of data for a given topic. Parti-
tions generally run on a set of servers for fault toler-
ance purposes; which partition is managed by which
server and which consumer is associated with is com-
pletely configurable to achieve the best performances.
Kafka durably persists all its published records
for a specific retention time; when it expires, records
are deleted to free space. Subscribers (Logstash in-
stances) can be configured to consume records from
a specific offset position established by the subscriber
itself, implementing a flexible tail-command like op-
eration that allows tailoring stream receiving to ac-
tual subscriber’s capabilities; moreover, Kafka in-
clude an acknowledgment mechanism to guarantee
that consumers got their data. The management of
offset positions is handled by ZooKeeper (Apache
software foundation, 2020e), a coordination service
for distributed applications part of the Apache Soft-
ware Foundation like Kafka. ZooKeeper manages the
set of servers running Kafka, preventing race condi-
tions and deadlock, and exploiting servers’ states to
redirect a client whenever its related server is down or
the connection is lost.
2.2 Elaborating Data
The next stage of log analysis is the processing of
data, carried out by the Logstash module (Elastic.co,
2020e). As soon as data comes (from Kafka in our ar-
chitecture) an event is triggered and stored in a queue,
from where a thread (one for each input source) peri-
odically fetch a set of events, process them using cus-
tom filters, delivering processed data to other mod-
ules. In the input stage, both the set size and the
number of running threads are fully configurable, and
even the queue can be set as persistent (disk rather
than in-memory) to cope with Logstash unforeseen
crashes. The filter stage is the Logstash core, where
several predefined filters are available and customiz-
able to allow data to be structured, modified, added
or discarded. Filter ranges from standard regular ex-
pressions pattern matching to string-number conver-
sion and vice-versa, check IP addresses against a list
of network blocks or convert IP to geolocalized info,
data split (e.g. from CSV format) or merging, date
conversion, parsing of unstructured event data into
fields, JSON format plug-in and many others. The
last stage of the Logstash pipeline is the output, where
processed data are sent to other modules as Elastic-
Search (as it occurs in our architecture, see fig. 1),
files, services, or tools; output plug-ins are available
to support specific connection.
2.3 Visualizing and Mining Processed
Data
The last module of the proposed architecture includes
two different tools, the former is Elasticsearch and is
used to index and search data previously processed
by Logstash module, whereas the latter (Kibana) en-
dorses an effective and fruitful visualization of pro-
cessed data. Elasticsearch (Elastic.co, 2020b) is an
engine that uses standard RESTful APIs and JSON;
it allows indexing and searching stored data coming
from Logstash.
Elasticsearch runs as a distributed, horizontally
scalable and fault-tolerant cluster of nodes; indices
can be split into shards over separate nodes to im-
prove both searching operations as well as the sys-
tem’s fault-tolerance.
Elasticsearch can be used for several purposes as:
• logs monitoring, adopting a tail -f old-style visu-
alization
• infrastructure monitoring, where either predefined
or customizable metrics can be leveraged extract
and highlight relevant information
• application performance and availability monitor-
ing
Each document inside Elasticsearch is represented as
a JSON object with a key-value pair for each field;
APIs are provided to support full CRUD operations
over documents; optimized relevance models and
support for typo-tolerance, stemming, bigrams are in-
cluded. Full-text searching is performed by leverag-
ing the Lucene engine library (Apache software foun-
dation, 2020c).
The last module of the proposed architecture is
Kibana (Elastic.co, 2020d), that allows us to visual-
ize and navigate data managed by Elasticsearch, pro-
viding a complete set of tools as histograms, line
graphs, pie charts, maps, time series, graphs, tables,
tag clouds, and others. Besides, it comes with ma-
chine learning capabilities that help e.g. in time se-
ries forecasting and anomaly detections; finally, all
ICs Manufacturing Workflow Assessment via Multiple Logs Analysis
803