A Principled Approach to Enriching Security-related Data for Running

Processes through Statistics and Natural Language Processing

Tiberiu Boros

, Andrei Cotaie

, Kumar Vikramjeet

, Vivek Malik

, Lauren Park

and Nick Pachis

Adobe Systems, Romania

Adobe Systems, U.S.A.

Formerly Adobe Systems, U.S.A.

Keywords:

Infrastructure, Machine Learning, Statistical Approach, Natural Language Processing, Labeling, Tagging,

Security, Process, Process Metadata, Enriching Data, Hubble Stack, Risk Based Anomaly Detection.

Abstract:

We propose a principled method of enriching security related information for running processes. Our method-

ology applies to large organizational infrastructures, where information is properly collected and stored. The

data we use is based on the Hubble Stack (an open-source project), but any alternative solution that provides

the same type of information will sufﬁce. Using statistical and natural language processing (NLP) methods

we enrich our data with tags and we provide an analysis on how these tags can be used in Machine Learning

approaches for anomaly detection.

1 INTRODUCTION

Logging and monitoring are the most common tools

used by security teams across organizations with the

purpose of detecting breaches and reacting to inci-

dents. In simple environments, rule-based systems

are able to quickly identify and alert on suspicious

events. However, in complex cloud-based infrastruc-

tures, the inter-dependencies between various cloud

components (storage, computing, database, authenti-

cation etc.) makes it hard to detect systems that oper-

ate outside of normal parameters. Also, maintenance

or quick-ﬁx operational events represent a large por-

tion of the noise that poises the data and adds over-

head for the security teams. In addition to that, log-

ging and verbosity are not necessarily standardised

across the entire infrastructure and this can be treated

as a separate topic.

Arguably, a deterministic process for well-

documented infrastructures can be as basic as: use a

list of allowed processes for each instance, a table for

scheduled execution and alert on any event that con-

tradicts these rules. In practice, this is is not feasible

for most infrastructures (a fact demonstrated by the

time-lapse between the incidents and their detection)

and the most common reasons are:

• Development cycles for applications and cloud

services often lack behavioral documentation un-

til the ﬁnal (production) stage;

• Manual interventions such as planned mainte-

nance and quick ﬁxes for critical system outages

(CSOs) often impact normal system behavior. As

a rule of thumb, the larger the infrastructure the

more noise/false positives you get in your data;

• Third party software that undergoes upgrades will

often behave differently after an update. Also,

there are not many cases where you receive

a complete speciﬁcation that documents all the

changes/operations a given application will cause

to your instances;

• Most of the open-source stack will often rely on

scripts that run other applications for upgrade,

backup and synchronization (curl, wget, git, cc,

g++, awk, diff, etc.);

• Last but not least, rule-based systems are hard to

maintain and often fall behind of the number of

changes that affect your cloud infrastructure.

As such, heuristic and statistical methods are always

welcomed as an additional security layer, since they

are low-maintenance and, if implemented properly,

are able to scale with your infrastructure.

We propose a different approach to cloud-based

anomaly detection in running processes: (a) en-

rich the data with labels, (b) automatically analyze

the labels to establish their importance and assign

weights and (c) score events and instances using these

weights.

140

Boros, T., Cotaie, A., Vikramjeet, K., Malik, V., Park, L. and Pachis, N.

A Principled Approach to Enriching Security-related Data for Running Processes through Statistics and Natural Language Processing.

DOI: 10.5220/0010381401400147

In Proceedings of the 6th International Conference on Internet of Things, Big Data and Security (IoTBDS 2021), pages 140-147

ISBN: 978-989-758-504-3

This type strategy has several advantages:

• Reduces the effects of data sparsity and allows

training of simple models using a small amount

of data, without overﬁtting - see section 4;

• Enables the macro-level analysis for a speciﬁc in-

stance, instead of triggering alerts for individual

events, that are often not informative. Usually,

a security compromise will trigger multiple alerts

whenever probing and lateral movement attempts

start. Macro-level analysis makes it easy to spot

spikes in alerts. Instances that are likely to be

affected by the intrusion are the ones interlinked

with the breached instance. This means that you

can also beneﬁt from grouping together instances

that are dependent (we call them “accounts”), but

it is not mandatory to do so;

• Makes it easier for security analysts to go over the

alerts and see what triggered by looking at the la-

bels.

2 RELATED WORK

Machine learning applied in the ﬁeld of security has

received a growing interest in recent years. Some in-

teresting contributions include behavioral based anal-

ysis of malware, high-level feature generation using

various deep learning methods (e.g. vector quantiza-

tion), intrusion detection systems, malware signature

generation and many others (Noor et al., 2019; Zhou

et al., 2020; Gomathi et al., 2020; Das et al., 2020;

Gibert et al., 2020; Piplai et al., 2020).

When it comes to SIEM-based solutions, static

rules and machine learning (ML) for anomaly detec-

tion are used as a primary ﬁltering and alerting mech-

anism. Their application is narrowed to speciﬁc use-

cases. For instance, (Anumol, 2015) introduce a sta-

tistical ML model for intrusion detection based on

network logs, while (Shi et al., 2018) uses deep learn-

ing to predict if a domain is malicious or not. (Feng

et al., 2017) present a ML user-centred model de-

signed to reduce the number of false positive alerts

generated by static rules.

Based on the number of research papers (Idham-

mad et al., 2018; Suresh and Anitha, 2011; Zekri

et al., 2017; Osanaiye et al., 2016), Distributed De-

nial of Service (DDoS) was given signiﬁcantly more

attention, probably because the successful execution

of these types of attacks yields in major service out-

ages. However, neither of them addressed the issue

of ﬁnding and uncovering indicators of compromise

which can ﬁrmly tell that a system was compromised

with attacker having control of the same.

The work of (Hendler et al., 2018) is related to our

own research. However, there are several major dif-

ferences: the authors target PowerShell activity with

their focus being on command-line activity only, and

not other attributes of the event using a purely super-

vised approach.

Also, it is important to note that we focus on ag-

gregating risk scores at instance level, instead of alert-

ing on every single anomaly. A similar effort is de-

scribed by (Bryant and Saiedian, 2020). They propose

adding metadata to cyber kill chain that following a

divide and conquer approach to different kinds of sys-

tem activities and their combinations. On the other

hand, a leading SIEM vendor in risk based alerting

(RBA) space tries to take it a step further by mon-

itoring system activity by combining multiple data

sources to look across the board

. This increases the

likelihood of catching anomalous activity, be it opera-

tional or an actual security threat. But these solutions

again don’t solve the problem of static rules and con-

stant intervention by the security analysts to maintain

them.

3 DATASET DESCRIPTION

The data involved in the research can generically be

described as host activity data, where by host we un-

derstand an individual computing resource. In our

case, the computing resources are virtual machines in

the public cloud, which are equipped with a software

agent called Hubble

. The role of the agent is to col-

lect information from the computing instance and to

send it to a centralized log management solution. The

agent collects data like: recorded users, command line

history, outbound connections, processes being exe-

cuted, environment variables, critical ﬁles modiﬁed

and so on.

The work presented in this paper uses the data ex-

tracted by Hubble for running processes. The main

three categories (source types) of events are: (a) run-

ning processes; (b) running processes listening for

network connections; (c) running processes with es-

tablished outbound network connections;

The later two mentioned source-type overlap with

the ﬁrst one, but they provide additional information

as: source port/IP, destination, listening port.

For clarity, we will enumerate all the ﬁelds present

in the meta-data associated with running processes:

https://conf.splunk.com/ﬁles/2019/slides/SEC1803.pdf

- Last Accessed 2020-10-31.

https://hubblestack.io/

A Principled Approach to Enriching Security-related Data for Running Processes through Statistics and Natural Language Processing

141

• Account ID: the account from where the data is

collected (this was mentioned in the introduction);

• Cloud Instance ID: unique identiﬁer of the com-

puting resource;

• Parent Process Name: the service/binary process

that generated the event (apache, bash, jupyter

and so on);

• Process Name: the main utility used for the exe-

cution of the command (bash, php, mysql and so

on);

• User: the local user account under which the

parent process is already running (and with what

privileges);

• Group: the local group to which the User is allo-

cated;

• Command Line: the full execution of the pro-

cess with all involved parameters or additional

services;

• Environment Variables: system and user vari-

ables associated with the event and user;

• Open Files: threads associated with the execu-

tion of the process (ﬁles that the process reads or

writes);

• Path: the logic location on the disk from where

the process is being executed;

• Time: a timestamp of the moment when that

event was generated (when the process was exe-

cuted).

Each account generates an unique subset of data de-

pendent on the services it provides. Also, each ac-

count has a different number of computing resources

associated and a different computing load. Because of

these factors, the data generated by an account over a

period of time differs from the others. Some accounts

are highly active and can generate up to dozens of mil-

lions of events in a two hour time frame, while others

can generate just tens of thousands of events. In or-

der to preserve this uniqueness, we treat each account

separately but with an identical processing pipeline.

Also, due to technical limitations, we aggregate

the data before we process it. For aggregation we

use all the ﬁelds mentioned before, except the time

ﬁeld, which is replaced with a count value. This way,

we extract only the unique events themselves with an

associated number of occurrences. By doing so, we

limit the data generated by one account to a maximum

of a couple of hundred thousands of observations for

every two hour interval.

4 PROPOSED METHODOLOGY

While some anomalies can be caught by purely statis-

tical approaches, it is often the case where data spar-

sity makes it difﬁcult to spot outliers. This is a speciﬁc

trait of text-based data and because we are dealing

with command-lines, our dataset partially falls into

this category.

Our methodology (see Figure 1)

overcomes this

by adding a description layer to raw data examples in

the form of tags. The role of these tags is two-fold:

(a) bring domain-speciﬁc knowledge inspired by the

way security analysts perform their investigation and

(b) reduce data sparsity.

In what follows, we will describe our tagging

strategy (Section 4.1), introduce two methods for

scoring the events (one supervised and one unsuper-

vised) (Section 4.2) and validate our approach (Sec-

tion 5)

4.1 Describing Events using Tags

As security experts, whenever we analyse an event or

a series of events, we don’t blindly go head-on and

check every possible anomaly. Instead, we look for

speciﬁc indications of compromises such as particu-

lar commands (e.g. “netcat”, “telnet”, “useradd”) or

system users (“www-data”, “mysql” etc.) running ab-

normal commands.

The tags we generate are meant to highlight some

particular behaviors of compromised applications. In-

deed, they are not straight-forward indications of ma-

licious activity but when combined they are able out-

line some suspicious events that are potential candi-

dates for further analysis.

For clarity, we will detail each individual tag or

tagset:

• FIRST TIME SEEN ANOMALY: a tag ap-

plied to newly observed command-lines, based on

historical records. This tag is applied after the

cleanup process (see below note for clariﬁcation);

• RARE OUTBOUND CONNECTION: is ap-

plied if a process generates an outbound connec-

tion to a host that appears in less than 1% of the

data;

• RARE LISTENING PORT: is applied for a

process that opens a listening port and, based on

past observations, that listening port is used in less

than 1% of the data;

The distinction between malicious/benign examples

only apply for the supervised part - see section 4.2.2 for

details.

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

142

Figure 1: Overview of our proposed methodology.

• CMD ENTROPY: this tagset is based on the per-

plexity of the command-line, which is computed

using a tri-gram language model (LM) with sim-

ple interpolation smoothing. We also use corpora-

wide statistics for perplexity (µ and σ) to generate

three ﬂavours of this tag:

MEDIUM: if perplexity is between (µ +

σ,µ + 2σ)

HIGH: if perplexity is between (µ + 2σ,µ +

4σ)

EXTREME: if perplexity is larger than µ +

4σ

• RARE PROCESS PATH: the presence of this

tag signiﬁes that the process is launched from an

atypical location - less than 1% of the instances of

this process are executed from this location;

• RARE PARENT: based on the process tree, this

process has had the given parent process in less

than 1% of the observations;

• RARE USER PROCESS PAIR: the speciﬁc

user has executed the current process in less than

1% of the observations. This only applies for

a speciﬁc list of known system users: “mysql”,

“root”, “www-data”, “postgresql”, “ldap” etc.

Any manually added user instance will never trig-

ger this tag.

• ENV RARE PATH: this tag is set whenever the

PAT H environment variable has a combination of

values which is not frequent. In this case we use a

absolute threshold value of 5;

• ENV MISSING PATH: this tag is set if the

PAT H variable is empty or missing for a speciﬁc

process. The tag is only triggered if, for that spe-

ciﬁc process, based on past observations, the vari-

able was present and non-empty in more than 95%

of the cases;

• ENV RARE LD LIBRARY PATH: analog to

ENV RARE PATH;

• ENV MISSING LD LIBRARY PATH: analog

to ENV MISSING PATH;

• ENV RARE PWD: analog to

ENV RARE PATH;

• ENV MISSING PWD: analog to

ENV MISSING PATH;

• USER CMD: this tag is set if the user that

launched a command is not in our list of known

system users. This might not be entirely accurate,

since we maintain this list manually, but it holds

in most of the cases;

• CMD command : this is a multipur-

pose tag used in conjunctions with a list of

applications/command-line tools that we chose

to track. This list includes most command-line-

interface (CLI) UNIX/Linux tools (wget, curl, nc,

telnet, ssh, useradd, usermod, groupadd etc.) and

some additional non-standard packages. This tag

is dynamically generated whenever we encounter

one of the target CLI commands. For instance,

when we see a curl command, we automatically

add CMD CURL, when we see a pwd we add

CMD PWD and so on.

• PATH path : this is also a multipurpose tag,

based on a deﬁned list on interesting system paths

that might appear inside the text body of an ex-

ecuted command. Example of such paths are:

/dev/mem, /dev/tcp, /dev/kmem, /etc/hostname,

A Principled Approach to Enriching Security-related Data for Running Processes through Statistics and Natural Language Processing

143

/etc/ssh/sshd* and so on. The scope of such tags

is to highlight activity that might indicate man-

ual manipulation of system ﬁles or services. For

example, an user can initiate an outbound connec-

tion from the host without using the standard util-

ities like wget or curl, but by using such utilities

like:

bash − i>&

/dev/tcp/attacker ip/at tacker port0>&1.

• PARENT parent name : the purpose of this

tag is to highlight events which are being executed

by interesting parents, from a security point of

view. Parent processes like apache, nginx, httpd,

cupsd, mysqld and so on are on their own rare

events and can highlight abuse of that particular

process or service. Such tags bring more context

and correlations to the tag analysis phase.

• IP PUBLIC\IP PRIVATE: this tag is attributed

in two cases. First, the event is generated by a

process which establishes an outbound connection

and we have information regarding the destination

IP. Second, the command line itself contains one

or multiple character sequences representing IPs.

Once all IPs are extracted they are classiﬁed as

PUBLIC or PRIVATE IPs. Such information is

highly useful in classifying what kind or activity

is a processes behaving.

• REF LOCALHOST: tag highlights events

where the localhost component is referenced.

This can be done by using the localhost syntax in

network connections oriented commands or the

localhost IP itself 127.0.01. Network connections

commands containing only localhost without

other IP tags (especially IP PUBLIC tag) should

have a smaller impact in a decision process.

• PROCESS APPEARS LESS THAN 5 TIMES:

this tag is highly depended on the cloud environ-

ment for which the tag is generated. Its purpose

is to inform on processes that occurs very rare

(few times seen). Such a tag can have a short

life, if the process itself becomes a common

component of the environment. But in case of a

malicious event, skilled attackers can use system

utilities (which aren’t commonly executed) reach

their goals. Such a tag can ﬁlter down new and

interesting events recently reported.

NOTE. For the LM and the FIRST TIME SEEN tags

we use a pre-processing step in which all command-

lines are stripped of numbers and random character

sequences. This is done by passing them through a

tool called Stringliﬁer

, which is an open-source ML

https://github.com/adobe/stringliﬁer

model that identiﬁes high-entropy strings (passwords,

hashes, API keys) as well as strings that follow spe-

ciﬁc patterns (e.g. IP addresses).

For events labelling, we look at various event data

ﬁelds and generate tags as per inherit characteristics

of those ﬁelds. Tags capture certain facets of the event

or log, which are important in terms of security in-

cident detection point of view. One of the common

tags generated is FIRST T IME SEEN ANOMALY

when command in an anomalous event is seen ﬁrst

time for particular account. Another tag USER CMD

accounts for commands that are executed explicitly

by users on production instances. Similarly, tags are

generated for user commands executed on instances.

Certain tags accounts for environment variable abnor-

mality. Tags are also generated for network related ar-

tifacts. For example, the IP PUBLIC or IP PRIVATE

tags describe the type of the destination resource

for that network connection (local network or pub-

lic/internet). Tags like REF LOCALHOST describe

processes that connect via network protocols to other

services available on the same computing machine

hosting the processes (ex: 127.0.0.1 or localhost).

4.2 Risk based Scoring

As tags are not direct indicators of compromise, the

role of risk-based scoring (RBS) is to highlight sus-

pect activity by jointly analysing the assigned la-

bels. For simplicity, we deﬁne our scoring method as

a linear function between the input features (combina-

tions of labels/tags) with a scalar output. The primary

goal is to assign high scores to suspicious activity and

low scores to normal operations.

There are two main ways to compute the scoring

model: supervised learning and unsupervised learn-

ing. Both methods have their merits and demerits. For

instance, unsupervised learning is likely to perform

worse on this dataset than its supervised counterpart.

On the other hand, supervised learning requires la-

beled data, which is not easy to come. Furthermore,

the bias of this training data is likely to also bias the

model.

This is why we will analyze both approaches in

the following sections and compare the results.

4.2.1 Unsupervised Scoring

Customary, Term Frequency Inverse Document Fre-

quency (TF-IDF) has been used to analyze observa-

tions based on associated tags/words. In our case,

tags are unique per observation, thus there is a strong

correlation between the term frequency and inverse

document frequency, rendering TF-IDF not an good

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

144

Table 1: Converting set to N-grams.

Set

(CMD CP, RARE PARENT,

USER CMD)

Unigrams

(CMD CP),

(RARE PARENT),

(USER CMD)

Bigrams

(CMD CP, RARE PARENT),

(CMD CP, USER CMD),

(RARE PARENT, USER CMD)

Trigrams

(CMD CP, RARE PARENT,

USER CMD)

Table 2: Example of N-grams and their probabilities.

Probability N-Grams

0.001958

ENV RARE PWD

PARENT SSHD

0.176561 RARE USER PROCESS PAIR

0.015053 RARE PARENT

0.000012

CMD EXTREME ENTROPY

CMD PS

PARENT PYTHON

candidate for our goal. Instead, we focus on two un-

supervised methodologies for detecting correlations

between tags and assigning scores: a probabilistic n-

gram approach and frequent itemset mining.

Probabilistic N-gram Approach. In this method, the

event’s tags are arranged into unique combinations of

tags to represent unigrams, bigrams and trigrams, as

shown in Table 1. The occurrence frequency of a spe-

ciﬁc n-gram provides an idea about how often the tags

in n-gram are expected to appear together. We also

expect that frequent n-grams are correlated with nor-

mal operations, while rare n-grams are related to what

can be considered anomalous events (see Table 2 for

examples). Thus, our scoring procedure is deﬁned

as follows: (a) compute n-gram probabilities (we use

maximum likelihood estimates); (b) assign a “rarity

score” for each n-gram computed as the negative log

likelihood of that n-gram appearing. If G

is the set

of all n-gram combinations for event E

and P

is the

probability of observing tag t, than the RBA score for

is deﬁned by Equation 1

RBA

= −

∑

t∈G

logP

(1)

Frequent Itemset Mining. A common use case of

Frequent Itemset Mining involves identifying rela-

tions among items purchased by shoppers, and gen-

erating association rules to better understand buying

behaviors and provide better incentives. Similarly, in

our use case we can derive associations among tags

and their frequency by using the Apriori Algorithm

(Agarwal et al., 1994). Table 3 shows support score

of tags generated using this approach.

Table 3: Example of frequent itemsets and support.

Support Itemsets

0.016336

ENV RARE PWD

PARENT SSHD

0.449426 RARE USER PROCESS PAIR

0.039277 RARE PARENT

0.000347

CMD EXTREME ENTROPY

CMD PS

PARENT PYTHON

Similar to the probabilistic n-gram model, we use

the negative logarithm of the support value in our

scoring.

4.2.2 Supervised Scoring

The dataset used for supervised scoring is composed

on 10K benign events (collected for our own logs) and

245 malicious command-lines extracted from open

data-sources. For training/testing we split the data

into two subsets, while carefully preserving the ratio

between malicious and benign examples. 70% of the

data was used for training and 30% for testing.

Our scoring model is a single layer perceptron

with sparse input. For computational efﬁciency, in-

stead of performing n-hot encoding and sparse ma-

trix multiplications, we preferred to link each feature

to the weight it would normally trigger in the per-

ceptron model. This way, the output activation for

a given example input is computed by summing over

the weights of the non-zero (present) features in that

particular example. This is similar to the probabilis-

tic n-gram approach, but instead of using the negative

log-likelihood for an n-gram we now use a precom-

puted weight.

To train our model, we initialized our feature

weights with zeros (this is a linear model) and we used

the delta-rule update (Equations 2 and 3).

∑

t∈G

(2)

∆w

= η · (t

− y

) (3)

where G

has the same meaning as before, y

is the

predicted output for example E

, t

is the desired out-

put for that examples and η is the learning rate

Note 1. While both unsupervised methods generate

scores for unigrams, we don’t use this type of feature

in our scoring, because, as stated earlier, tags out of

context are not informative enough to provide the nec-

essary support for deciding if an event is malicious or

not.

We used a learning rate of 1e − 4.

A Principled Approach to Enriching Security-related Data for Running Processes through Statistics and Natural Language Processing

145

Note 2. In the training phase of the unsupervised

methods we only use the benign dataset, because

adding malicious examples in the mix would prone

the model to consider them non-anomalies at runtime.

Note 3. The supervised and unsupervised methods

generate scores for known tag combinations. How-

ever, there are always situations where the algorithm

can encounter previously unseen tag combinations

during runtime. We observed that best results are ob-

tained by scoring this special class of rare events with

a constant value

5 EXPERIMENTAL VALIDATION

We validate our approach using an artiﬁcially con-

structed dataset, which was assembled through a sim-

ilar strategy as the one presented in Section 4.2.2. We

note that the manually introduced malicious events

have not been previously used in training the super-

vised scoring function.

Ideally, we would prefer that the risk score would

highlight malicious events by assigning high values

to them. As such, we use a linear and an exponen-

tial decay function for evaluating both the supervised

and unsupervised scoring methods. Alternatively, we

could use the F-score computed on a limited number

of events. However, we feel that hard limits such as

top k events or score thresholds don’t fully character-

ize the system, since their choice can yield high recall

values by generating large and unmanageable num-

bers of events.

Let e

be one malicious event with score s

and

the number of benign events that have a score

. Then, our linear decay function (L) is deﬁned

by Equation 4 and our exponential (E) decay function

by Equation 5.

1 + 2 · b

(4)

(5)

The model is characterized by the average scores over

all malicious events (

∑

k=1

and

∑

k=1

, re-

spectively).

Note. Benign examples are not explicitly modeled

in our evaluation, because they are used in scoring

of the malicious events, which makes them indirect

participants to the overall accuracy.

Table 4 shows the results for all the presented models,

using both evaluation metrics.

Our models are evaluated using a heuristically obtained

value of −log(10

−8

Table 4: Model evaluation using linear and exponential de-

cay scoring.

Method Linear Exponential

Supervised Delta 0.9470 0.9461

Itemset Mining 0.6255 0.5962

N-Grams 0.5033 0.4798

6 CONCLUSIONS AND FUTURE

WORK

We introduced a methodology of reducing data spar-

sity and introducing domain-speciﬁc knowledge at the

same time, into security related application logs. We

explored both supervised and unsupervised methods

of performing risk-based-scoring and we evaluated

our methods using an artiﬁcially created dataset.

The results show the obvious: supervised score

generation works better than unsupervised methods

(at least for this type of data). However, we do ar-

gue that it is not always possible to collect or gener-

ate labeled data, hence both methods bring their own

value.

A potential use case for unsupervised methods is

to highlight operational hygiene issues for a speciﬁc

cloud account. If an environment follows operational

and security best practices the number and the diver-

sity of anomalous events would be uniform. An un-

supervised approach could highlight deviation from

such expectations.

We further plan to extend this methodology by

adding more types of labels and introduce additional

types of data. Extending our research on the unsuper-

vised and supervised scoring methods is also on our

road-map.

REFERENCES

Agarwal, R., Srikant, R., et al. (1994). Fast algorithms for

mining association rules. In Proc. of the 20th VLDB

Conference, pages 487–499.

Anumol, E. (2015). Use of machine learning algorithms

with siem for attack prediction. In Intelligent Com-

puting, Communication and Devices, pages 231–235.

Springer.

Bryant, B. and Saiedian, H. (2020). Improving siem alert

metadata aggregation with a novel kill-chain based

classiﬁcation model. Computers & Security, page

101817.

Das, S., Ashrafuzzaman, M., Sheldon, F. T., and Shiva, S.

(2020). Network intrusion detection using natural lan-

guage processing and ensemble machine learning. In

2020 IEEE Symposium Series on Computational In-

telligence (SSCI), pages 829–835. IEEE.

IoTBDS 2021 - 6th International Conference on Internet of Things, Big Data and Security

146

Feng, C., Wu, S., and Liu, N. (2017). A user-centric ma-

chine learning framework for cyber security opera-

tions center. In 2017 IEEE International Conference

on Intelligence and Security Informatics (ISI), pages

173–175. IEEE.

Gibert, D., Mateu, C., and Planes, J. (2020). The rise of

machine learning for detection and classiﬁcation of

malware: Research developments, trends and chal-

lenges. Journal of Network and Computer Applica-

tions, 153:102526.

Gomathi, S., Parmar, N., Devi, J., and Patel, N. (2020). De-

tecting malware attack on cloud using deep learning

vector quantization. In 2020 12th International Con-

ference on Computational Intelligence and Communi-

cation Networks (CICN), pages 356–361. IEEE.

Hendler, D., Kels, S., and Rubin, A. (2018). Detecting ma-

licious powershell commands using deep neural net-

works. In Proceedings of the 2018 on Asia Con-

ference on Computer and Communications Security,

pages 187–197.

Idhammad, M., Afdel, K., and Belouch, M. (2018). Semi-

supervised machine learning approach for ddos detec-

tion. Applied Intelligence, 48(10):3193–3208.

Noor, U., Anwar, Z., Amjad, T., and Choo, K.-K. R. (2019).

A machine learning-based ﬁntech cyber threat attribu-

tion framework using high-level indicators of compro-

mise. Future Generation Computer Systems, 96:227–

242.

Osanaiye, O., Cai, H., Choo, K.-K. R., Dehghantanha,

A., Xu, Z., and Dlodlo, M. (2016). Ensemble-based

multi-ﬁlter feature selection method for ddos detec-

tion in cloud computing. EURASIP Journal on Wire-

less Communications and Networking, 2016(1):130.

Piplai, A., Mittal, S., Abdelsalam, M., Gupta, M., Joshi,

A., and Finin, T. (2020). Knowledge enrichment by

fusing representations for malware threat intelligence

and behavior. In 2020 IEEE International Conference

on Intelligence and Security Informatics (ISI), pages

1–6. IEEE.

Shi, Y., Chen, G., and Li, J. (2018). Malicious domain name

detection based on extreme machine learning. Neural

Processing Letters, 48(3):1347–1357.

Suresh, M. and Anitha, R. (2011). Evaluating machine

learning algorithms for detecting ddos attacks. In In-

ternational Conference on Network Security and Ap-

plications, pages 441–452. Springer.

Zekri, M., El Kafhali, S., Aboutabit, N., and Saadi, Y.

(2017). Ddos attack detection using machine learn-

ing techniques in cloud computing environments. In

2017 3rd International Conference of Cloud Comput-

ing Technologies and Applications (CloudTech), pages

1–7. IEEE.

Zhou, H., Hu, Y., Yang, X., Pan, H., Guo, W., and Zou,

C. C. (2020). A worm detection system based on deep

learning. IEEE Access, 8:205444–205454.

A Principled Approach to Enriching Security-related Data for Running Processes through Statistics and Natural Language Processing

147