This type strategy has several advantages:
• Reduces the effects of data sparsity and allows
training of simple models using a small amount
of data, without overfitting - see section 4;
• Enables the macro-level analysis for a specific in-
stance, instead of triggering alerts for individual
events, that are often not informative. Usually,
a security compromise will trigger multiple alerts
whenever probing and lateral movement attempts
start. Macro-level analysis makes it easy to spot
spikes in alerts. Instances that are likely to be
affected by the intrusion are the ones interlinked
with the breached instance. This means that you
can also benefit from grouping together instances
that are dependent (we call them “accounts”), but
it is not mandatory to do so;
• Makes it easier for security analysts to go over the
alerts and see what triggered by looking at the la-
bels.
2 RELATED WORK
Machine learning applied in the field of security has
received a growing interest in recent years. Some in-
teresting contributions include behavioral based anal-
ysis of malware, high-level feature generation using
various deep learning methods (e.g. vector quantiza-
tion), intrusion detection systems, malware signature
generation and many others (Noor et al., 2019; Zhou
et al., 2020; Gomathi et al., 2020; Das et al., 2020;
Gibert et al., 2020; Piplai et al., 2020).
When it comes to SIEM-based solutions, static
rules and machine learning (ML) for anomaly detec-
tion are used as a primary filtering and alerting mech-
anism. Their application is narrowed to specific use-
cases. For instance, (Anumol, 2015) introduce a sta-
tistical ML model for intrusion detection based on
network logs, while (Shi et al., 2018) uses deep learn-
ing to predict if a domain is malicious or not. (Feng
et al., 2017) present a ML user-centred model de-
signed to reduce the number of false positive alerts
generated by static rules.
Based on the number of research papers (Idham-
mad et al., 2018; Suresh and Anitha, 2011; Zekri
et al., 2017; Osanaiye et al., 2016), Distributed De-
nial of Service (DDoS) was given significantly more
attention, probably because the successful execution
of these types of attacks yields in major service out-
ages. However, neither of them addressed the issue
of finding and uncovering indicators of compromise
which can firmly tell that a system was compromised
with attacker having control of the same.
The work of (Hendler et al., 2018) is related to our
own research. However, there are several major dif-
ferences: the authors target PowerShell activity with
their focus being on command-line activity only, and
not other attributes of the event using a purely super-
vised approach.
Also, it is important to note that we focus on ag-
gregating risk scores at instance level, instead of alert-
ing on every single anomaly. A similar effort is de-
scribed by (Bryant and Saiedian, 2020). They propose
adding metadata to cyber kill chain that following a
divide and conquer approach to different kinds of sys-
tem activities and their combinations. On the other
hand, a leading SIEM vendor in risk based alerting
(RBA) space tries to take it a step further by mon-
itoring system activity by combining multiple data
sources to look across the board
1
. This increases the
likelihood of catching anomalous activity, be it opera-
tional or an actual security threat. But these solutions
again don’t solve the problem of static rules and con-
stant intervention by the security analysts to maintain
them.
3 DATASET DESCRIPTION
The data involved in the research can generically be
described as host activity data, where by host we un-
derstand an individual computing resource. In our
case, the computing resources are virtual machines in
the public cloud, which are equipped with a software
agent called Hubble
2
. The role of the agent is to col-
lect information from the computing instance and to
send it to a centralized log management solution. The
agent collects data like: recorded users, command line
history, outbound connections, processes being exe-
cuted, environment variables, critical files modified
and so on.
The work presented in this paper uses the data ex-
tracted by Hubble for running processes. The main
three categories (source types) of events are: (a) run-
ning processes; (b) running processes listening for
network connections; (c) running processes with es-
tablished outbound network connections;
The later two mentioned source-type overlap with
the first one, but they provide additional information
as: source port/IP, destination, listening port.
For clarity, we will enumerate all the fields present
in the meta-data associated with running processes:
1
https://conf.splunk.com/files/2019/slides/SEC1803.pdf
- Last Accessed 2020-10-31.
2
https://hubblestack.io/
A Principled Approach to Enriching Security-related Data for Running Processes through Statistics and Natural Language Processing
141