information is available for the investigation, if only
network traffic is inspected. Thus, alternative data
sources need to be leveraged to effectively perform
anomaly detection. To the best of our knowledge, no
online IDS is available, which processes log data to
perform anomaly detection.
Anomaly-based IDS normally implement white-
listing mechanisms; they are also called behavior
based approaches because they compare computer
network’s or system’s activities against a model that
characterizes the normal behavior. This reference is
considered to be “anomaly-free” and is also known as
ground-truth. Today’s rapidly changing cyber threat
landscape accounts for flexible and self-adaptive IDS
approaches. Thus, to build a system behavior model
semi-supervised self-learning methods can be ap-
plied, whose main strength is the capability to detect
unknown attacks without requiring attack signatures.
Leveraging self-learning and white-listing meth-
ods it is possible to implement anomaly detection sys-
tems that are independent from semantical interpre-
tation of the processed log data, and from the syn-
tax of the data. Since the system behavior model
can be built automatically, the only requirement is
that the syntax of the log lines does not change over
time, as this would be considered as a deviation from
the normal system behavior and therefore marked as
anomaly. This is feasible because log data, contrar-
ily to free text, follows a predefined structure. Nor-
mally, a log message comprises static chunks, i.e.,
constant strings that occur often in the log sequence
(e.g., prepositions, commands, protocol names), and
variable chunks occurring less frequently (e.g., IP ad-
dresses, host names, TCP port numbers). Moreover,
the order in which different parts occur in a log line is
deterministic and does not vary.
Another benefit of self-learning and white-listing
approaches utilized for anomaly detection is their in-
trinsic flexibility. They can be applied to process logs
produced by legacy systems and by appliance with
small market shares (like those largely employed in
cyber physical systems (CPS)). Such systems often
lack of vendor support and suffer from poor docu-
mentation. Usually neither security solution providers
supply signatures for monitoring these systems, nor
vendors make patches available to keep the systems
up-to-date. Furthermore, modern networks, such as
the Internet of Things (IoT), comprise many devices
that are often not powerful enough to allocate the
large amounts of resources required to run anomaly
detection tools. Thus, light-weight anomaly detec-
tion mechanisms that can operate consuming a min-
imal amount of memory and CPU are necessary. One
way to meet this requirement is to foresee a decentral-
ized architecture. Different light-weight processing
instances can be distributed across the infrastructure,
while a central control instance, running on a pow-
erful machine, executes resource intensive operations
(such as machine-learning functions) and allows an
administrator to control the distributed light-weighted
instances.
3.1 Detectable Anomalies
In order for an anomaly detection method to be con-
sidered effective, different types of anomalies have to
be recognizable by using it, with a certain level of
confidence. In the following we list the main cate-
gories of anomalies a modern anomaly detection tool
should be capable to reveal.
The simplest type of anomaly is represented by
anomalous single events. On the one hand, these
can be so-called outliers representing rarely occur-
ring events, which appear so seldom that are not part
of the normal system behavior model. On the other
hand, these anomalies can be violations of prohibited
parameter values; for example, a client access exe-
cuted through a user agent normally not utilized in a
network, therefore not white-listed. In case of black-
listing approaches, user agents that are not allowed
need to be added (one-by-one) to the blacklist, and
hence imply a high risk of incompleteness.
Anomalous event parameters are point anomalies
such as IP addresses, port numbers or software ver-
sions that are not white-listed and therefore are not
part of the normal system behavior. This type of
anomalies includes, for example, events that occur
outside of business hours, or are triggered by accounts
of employees who are on vacation.
Anomalous single event frequencies are events
usually considered normal, which occur with an
anomalous frequency. For example, in case of data
theft, an anomalously high number of database ac-
cesses from a single client would be recorded in the
log data, triggering an anomaly.
Anomalous event sequences are anomalies reveal-
able by observing the dependency between related
events. Such dependency can be formalized by defin-
ing correlation rules. A correlation rule describes a
series of events that have to occur in an ordered se-
quence, within a given time-window, to be considered
non anomalous. To detect more complex anomalous
processes, which may involve different systems on
a network, multiple log lines need to be examined.
After a particular log line type (recording a condi-
tioning event) is observed, another specific log line
(recording the expected implied event) has to occur
within a predefined time slot. Otherwise, an anomaly
AECID: A Self-learning Anomaly Detection Approach based on Light-weight Log Parser Models
389