Deep Dive into Hunting for LotLs Using Machine Learning and Feature

Engineering

Tiberiu Boros

1 a

and Andrei Cotaie

Security Coordination Center, Adobe Systems, Bucharest, Romania

Security Operations, UIPath, Bucharest, Romania

Keywords:

Machine Learning, Feature Engineering, Living Off the Land Attacks.

Abstract:

Living off the Land (LotL) is a well-known method in which attackers use pre-existing tools distributed with

the operating system to perform their attack/lateral movement. LotL enables them to blend in along side

sysadmin operations, thus making it particularly difﬁcult to spot this type of activity. Our work is centered on

detecting LotL via Machine Learning and Feature Engineering while keeping the number of False Positives

to a minimum. The work described here is implemented in an open-source tool that is provided under the

Apache 2.0 License, along side pre-trained models.

1 INTRODUCTION

Living of the Land (LotL) is not a brand-new concept.

The knowledge and resources have been out there

for several years now. Still, LotL is one of the pre-

ferred approaches when we are speaking about highly

skilled attackers or security professionals. There are

two main reasons for this:

• Experts tend not to reinvent the wheel;

• Attackers like to keep a low proﬁle/footprint (no

random binaries/scripts on the disk).

The industry standard approach the LotL detection in-

volves SIEM Alerting (i.e. static-rules). Static rules

are often too broad or too limited, thus the output

becoming somewhat unreliable. Some well known

downsides of static rules:

(a) They are dependent highly on the experience of

the SME (Subject Matter Experts) that creates

them;

(b) They can generate a high number of False Posi-

tives (because of the thin line in terms of tools and

syntax between sysadmin operations and attacker

operations) or miss well known LotL activity;

is easier to retire and rewrite rather than maintain

and update.

https://orcid.org/0000-0003-1416-915X

In our previous work (citation will be added af-

ter the blind review process) we introduced a super-

vised model aimed at distinguishing between normal

operations and LotL activity, which was trained on a

custom designed dataset. The model was a Random

Forest (Pal, 2005) and the dataset was composed of

7.9M examples with a ratio of 0.02% (2/10000) be-

tween benign and malicious examples.

Since then, we focused on increasing the size and

quality of our dataset (see Section 3) and explored

deep-learning alternatives that are able to learn mean-

ingful patterns and latent representations for our cor-

pus/task (see Section 5). Our evaluation shows that

the deep-learning approach provides better F-scores

for both the initial and updated datasets.

Furthermore we discuss how the enhancements on

the model inﬂuence the accuracy of the classiﬁer and

provide an in-depth analysis on how various feature

classes contribute (Section 6). Finally, we present our

conclusions and future work plans (Section 7).

2 RELATED WORK

Classical intrusion detection systems, including LotL,

fall in three main categories: (a) Signature-Based

(SB); (b) Anomaly-Based (AB) and (c) Hybrid-

Based (HB).

While signature-based relies on predeﬁned pat-

terns(Modi et al., 2013), anomaly-based detection is

194

Boros, T. and Cotaie, A.

Deep Dive into Hunting for LotLs Using Machine Learning and Feature Engineering.

DOI: 10.5220/0011968700003482

In Proceedings of the 8th International Conference on Internet of Things, Big Data and Security (IoTBDS 2023), pages 194-199

ISBN: 978-989-758-643-9; ISSN: 2184-4976

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

data-oriented. It works by spotting behavioral out-

liers and by assuming that bad-actor activity will be

included in this category (Boros et al., 2021; Bu-

tun et al., 2013; Lee et al., 1999; Silveira and Diot,

2010). Both categories show strengths and weak-

nesses. For instance, in signature-based systems, a

high number of rules implies high accuracy, but this

approach scales poorly, is high-maintenance and does

not cope well with previously unseen attack methods.

Thankfully, there are ways to automatically append

rules and signatures, including the well-known honey

pots strategy (Kreibich and Crowcroft, 2004). On the

other hand, because anomaly-based systems rely on

data modeling, they can automatically scale to new

attack methods. However, not all anomalies are bad

actor activity and, as a rule of thumb, the number of

false-positives is high for this class of detection, often

hinging production.

One class of hybrid methods refers to machine

learning supervised classiﬁcation. ML is hard to in-

clude into one of SB or AB categories, since it shares

similar traits with both. Given that the classiﬁers work

on anything from raw data to engineered features and

that they require labeled data, they ﬁt well into the HB

category.

3 DATASET

Our initial dataset was composed of 1609 examples of

LotL commands compiled from open data-sources

and 7.9M benign examples, which were obtained

by random sampling

logs from our infrastructure .

Since then, we focused on further reﬁning and in-

creasing our corpus for the second iteration of our

models.

The present version of the dataset contains 1826

LotL examples and 24M benign examples.

There two important notes about the skewness of

our dataset:

• The ratio between malicious and benign exam-

ples is closer to real-life scenarios - in practice,

you don’t have 1/2 split between LotL activity and

normal operations;

• By preserving the same ratio during training, vali-

dation and testing, we ensure that reported results

are close to how the classiﬁer behaves in produc-

tion environments.

https://gtfobins.github.io/ and https://lolbas-

project.github.io/

Random sampling was used to reduce the risk of in-

cluding actual LotL activity as benign

Currently, we are unable to share the dataset, be-

cause it likely contains sensitive information. How-

ever, we are working on a public version of the

dataset that can open-sourced. This would likely pro-

vide a common testing and reporting grounds for re-

searchers. Until then, we are only able to offer some

information regarding the composition of our dataset.

Figure 1 shows the distribution between benign and

LotL examples on a log scale, for the top 10 com-

mands based on three metrics:

(a) commands that mostly appear in benign exam-

ples: java, postgres, mv, ps, chown, sleep,

sshd, docker, vi, du;

(b) commands that mostly appear in LotL examples:

rvim, rlogin, masscan, byebug, socat, dirb,

cobc, ghc, ltrace, nc;

of activity: jrunscript, xxd, mknod, tftp,

mkfifo, smbclient, aria2c, which, whois,

rvim.

4 FEATURE EXTRACTION

Given the sparsity of command lines, parameters and

attributes, performing adequate feature extraction is

crucial to preventing the over-ﬁtting of the model.

This is primarily achieved by observing and captur-

ing re-occurring security-related events and avoiding

features that are based on rare yet discriminative pat-

terns that would likely led to poor generalization of

the machine learning algorithms.

We distinguish 5 classes of features that we use in

our approach, which are based on (a) binaries, (b) pa-

rameters, (c) paths, (d) networking and (e) LotL Pat-

terns. Although they are well described in our previ-

ous paper, for completeness we will present them here

as well:

(a) Binaries. Binaries carry a signiﬁcant weight in

determining if LotL activity is happening on a

station, since they capture the capabilities of a

command line. For instance, “netcat” usually

means bi-directional network communication ca-

pabilities, “tcpdump” means monitoring capabil-

ities and “whoami” indicates standard reconnais-

sance capabilities. Of course, whether these tools

are actually used for malicious purposes or if they

can be successfully exploited, depends on the con-

text. For a complete set of commands that are

used as features, see Table 2;

(b) Parameters. Most of the tools we monitor are

multipurpose and their parameters help determine

Deep Dive into Hunting for LotLs Using Machine Learning and Feature Engineering

195

Figure 1: Command (binary) distribution in the dataset: (a) mostly used in benign activity; (b) mostly used in LotL activity;

Table 1: List of paths that are included in the feature-set.

/dev/mem /dev/tcp /dev/udp /dev/kmem

/dev/null /inet/tcp /bin/sh /inet/udp

/etc/crontab /var/spool/cron /bin/bash /etc/passwd

/etc/issue /etc/shadow /proc/version /etc/ssh/sshd_config

/etc/services /etc/network/interfaces /etc/sysconfig/network /etc/resolv.conf

/etc/fstab /etc/group /etc/hostname /etc/hosts

/etc/inetd.conf /boot .ssh/ /etc/fstab

/etc/profile /bin/bash /tmp/bash /etc/sudoers

/etc/vncpasswd /proc/ /var/tmp/ /dev/tty

/bin/ssh /tmp/ /usr/bin/yum

Table 2: List of commands that are included in the feature-set.

cat chmod cp curl sqlite3 postgres runuser

ldapsearch adduser export ftp history ping egrep

bzip ifconfig kill last mkdir mv passwd

lsof nkf ps rm sshd tar node

uname unzip nc useradd userdel vi vim

wget whoami sshd w socat telnet pip

easy_install nmap scp sftp smbclient tftp whois

finger crontab cpan apt-get byebug cobc cpulimit

rvim ltrace nohup top nice netstat find

pwd ls ssh gdb sudo netdiscover chroot

lsbrelease cut awk gawk chowm chkconfig selinux

visudo runlevel slapd openssl auditctl usermod groups

chgrp cobc emacs journalctl ltrace rsync strace

rvim cpan mkfifo perl lpstat python arp

php ruby node jrunscript ifconfig tcpdump echo

strings mono mknod backpipe tee busybox exec

route emacs rlogin xxd view rpcclient dnsdomainname

aria2c mysql debugfs iptables masscan bash tmux

screen gcc grep for while gpg hostname

sleep dpkg which fstab env set base64

sed dd ssh-keyscan locate unset printenv crash

miner minerd dirb mount ldconfig lld ftp

proxytunel yum mailcap openvpn valgrind rpm xargs

java postfix ansible git docker slsh hexdump

ghc chef salt showmount zsh go chattr

sha1sum syn.scan modprobe lsb-release du df timeout

time bzip apt-key chown javac more ssldump

traceroute ldapsearch mtr ntpdate csvtool

if the actual intent is malicious or not. For in-

stance, “netcat” can be used by legitimate scripts

to determine if an application is up and running,

by checking if a speciﬁc port is listening and

maybe sending a speciﬁc greeting message. On

the other hand, “netcat” used in conjunction

with “-e” or with pipes and redirects to “bash”,

likely indicates reverse shell activity;

tant role in determining the legitimate/illegitimate

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

196

usage of binaries. From sensitive locations

such as “/etc/passwd” to block devices such

as “/dev/tcp”, paths are a strong indicator that

something less than honest is happening on a box.

Of course, there are many paths that can be lever-

aged on a system, but thankfully, most of the paths

used in LotL attacks belong to a ﬁnite set (see Ta-

ble 1 for a complete set of paths we are using in

the feature extraction process);

(d) Networking. Communication to other hosts rep-

resents an notable use-case for LotL attacks.

While the exact IP address, range or ASN is a

strong indicator of compromise, it is a highly

volatile information and requires a reliable Threat

Intelligence (TI) source of information. Also, it

becomes obsolete very quickly. Instead, our net-

working features provide macro-level information

about the nature of communication (internal, ex-

ternal, loop-back or localhost references) trading

speciﬁcity and accuracy for stability over time and

TI independence;

(e) Well-known LotL Patterns. Features based on

known LotL patterns are regular expressions that

look for what can be considered as important sig-

natures that disambiguate the usage of a tool be-

tween legitimate and illegitimate. The number of

rules is signiﬁcant and we are unable to include

them in the paper. However, if one wants to gain

access to them, they are available in the constants

ﬁle from our public repository.

For every raw command-line, we perform the fea-

ture extraction process and enrich the examples with

tags that are later used in the classiﬁcation process.

We don’t rely on any text-based features in our pro-

cess (for instance n-grams) in order to reduce data

sparsity, avoid classiﬁer over-ﬁtting and gain better

generalization on previously unseen examples.

Note 1. The tags or features are discreete values

formed by concatenating a “class” preﬁx with the

command, parameter, path, networking or pattern at-

tribute that we detect. For example, if a command line

contains “netcat”, “bash”, “/var/tmp” and con-

nects to a public IP address, the extracted features

are: COMMAND_NC, COMMAND_BASH, PATH_/VAR/TMP

and IP_PUBLIC. While we could skip this step and

move directly to a ML-friendly representation (n-hot

encoding or embedding), we prefer to explicitly do

this in our tool, and present the tags along side the

classiﬁcation, because this increases visibility into

the dataset and makes it easier for an analyst to un-

derstand why a command-line has been classiﬁed as

LotL.

Note 2. Sometimes, connection information is not

present in the command line itself. However, most

implementations for collecting system logs rely on

Endpoint Detection and Response (EDR) solutions,

that enrich data with inbound/outbound connections.

In such cases, it is recommended that the IP ad-

dress(es) is/are concatenated to the command line, so

that the feature extraction process will pick it up and

generate the appropriate tag. All our training data has

been semi-automatically enhanced with this informa-

tion (automatically for benign examples and manually

for the LotL examples).

Note 3. There is one more tag called

“LOOKS_LIKE_KNOWN_LOL”, which is not used

in the training process, but it is used during run-time

to override the decision of the classiﬁer when a

previously unseen command resembles something

malicious in our dataset. More details about the tag

can be found in our initial work. However, we must

mention that it is not used the model evaluation or in

the ablation study.

Note 4. Tags for patterns don’t follow the same nam-

ing convention. Instead, we allow for selecting the

name of the tag that is generated whenever a regu-

lar expression matches and multiple expressions can

yield the same tag.

5 ENHANCED CLASSIFICATION

USING CUSTOM DROPOUT

Dropout is a well-established regularization technique

(Srivastava et al., 2014), that is preponderantly used

in the training process to prevent overﬁtting. Dur-

ing each training iteration, every unit (neuron) has

a pre-deﬁned probability of being masked (dropped

out) or, in some cases, of being scaled up. We extend

this framework to perform full input-feature which is

aimed at reducing the impact of two major issues:

(a) Because we have a complex feature extrac-

tion process, some of the features might be re-

dundant and increase model instability, since

they are not linear-independent: “COMMAND_SSH”,

“KEYWORD_-R” and “SSH_R” are usually triggered

at the same time;

(b) Regular expression-based feature extraction is not

error-free and some rules might not trigger in all

cases, while their binary, keyword and IP-based

counterparts will work.

To clarify, by full-feature dropout we mean that

we randomly mask input features or their correspond-

ing embeddings and we don’t perform any scaling on

the features that are left untouched. This way, the

model is able to learn to predict whether an example

is LotL or not using less features than the superﬂuous

Deep Dive into Hunting for LotLs Using Machine Learning and Feature Engineering

197

set we generate in our FE process. This yields a ro-

bust model with higher F-scores on previously unseen

data (see Section 6 for details).

To asses the performance of our methodology

we perform 5-fold validation on our previous dataset

(small) and static validation with ﬁxed train/dev/test

on our newly created dataset (large)

In Table 3, MLP-300-300-300 represents the cur-

rent model on which we apply full input-feature

dropout. The model is a three layer Perceptron with

a gated activation function (tanh · sigmoid). The size

of each layer is 300, but the output is halved, since

we use 1/2 of the units for computing sigmoid and

the other 1/2 for computing tanh, after which they

are combined by element-wise multiplication. As can

be observed, the MLP-300-300-300 outperforms all

other classiﬁers on the small dataset by a large mar-

gin and by a smaller gap on the larger dataset, prob-

ably because there is less chance of overﬁtting on the

later mentioned dataset.

The dropout rates for input and hidden layers is set

to 10%. For the layer sizes we used truncated grid-

search on the small datasets: we tested the same layer

size throughout the model from the set {50, 100, 150,

200, 250, 300, 350, 400, 500} by training on a ﬁxed

train/dev/test split of the small corpus.

6 ABLATION STUDIES

We set out to evaluate two aspects of our approach:

(a) the effect of full input dropout and (b) the effect

of different classes of features (commands, keywords,

paths, IP and regular expressions).

In order to see how full input dropout affects the

F-score of the MLP classiﬁer on previously unseen

data we conducted our experiments on both the small

and the large dataset, by training the same classiﬁer

with and without full input dropout

. In all cases

we used a static train/dev/test split, because train-

ing the classiﬁer is a time-consuming process. Table

4 shows the the full input dropout versions outper-

form the non-input dropout experiments. The gap is

larger on the smaller dataset, since there is a higher

chance of over-ﬁtting the classiﬁer. Using full input

feature dropout we achieve F-scores around 0.97 on

both datasets showing the dataset size has less impact

when this type of masking is applied.

The size of this new dataset prevented us for reporting

k-fold validation

When we removed the full input-dropout we added in-

place dropout on the internal layers. The reported results

are for 10% dropout, a rate that provided best results

Next, we asses how features belonging to the 5

different classes (commands, keywords, paths, IP in-

formation and patterns) contribute to the overall ac-

curacy of the model. The full input-feature dropout

scheme is no longer useful for this ablation study and

since the MLP is harder to train than the Random

Forest, we prefer to use the later mentioned classi-

ﬁer in our experiments. We use the merged training

and development set to build the classiﬁer (Random

Forest does not require validation) and train 6 mod-

els by masking features belonging to each class, with

an additional model that only relies on the regular ex-

pressions.

Table 4 shows the results for all experiments and

the baseline results obtained by training a model on

the entire feature set. As seen, commands, keywords,

IP information and paths, each taken on its own have

relatively small contributions to the overall accuracy,

with the smallest measured disruption for IP informa-

tion and largest for command lines. This is somewhat

expected, since there is an overlap between the reg-

ular expressions and features belonging to the afore-

mentioned classes. The contribution of regular ex-

pressions (i.e. human expert knowledge) is extremely

important. Without this class of features, the overall

F-Score of the model drops to 79.23. To make sure

the model does not only rely on the REGEX feature

class, we trained a classiﬁer only on this type of fea-

tures and we got an F-Score of 87.97. While this is a

really high score, it is still nowhere near the full model

(96.49), which shows that there is a signiﬁcant num-

ber of features from the other classes that contribute

to reaching top-accuracy.

7 CONCLUSIONS AND FUTURE

WORK

As discussed in the paper, we made signiﬁcant efforts

to further improve our LoTL classiﬁer and dataset. In

this endeavour, we increased the size of the dataset

and we introduced a custom dropout strategy, targeted

for our overlapping features, which had signiﬁcant

impact on the robustness of the model on previously

unseen data.

To better understand the way features inﬂuence

the model, we trained the same classiﬁer on versions

of the same dataset obtained by truncating the feature

set, the results showing a strong contribution of the

“human expert knowledge” to the overall accuracy.

All the work presented here, except the dataset, is

reﬂected in the public Github repository

, which we

https://github.com/adobe/libLOL

IoTBDS 2023 - 8th International Conference on Internet of Things, Big Data and Security

198

Table 3: F-Score results in the small dataset using 5-fold validation and on the large dataset using static train/dev/test split.

Classiﬁer Dataset F1-score Standard deviation Avg. train time

Random forest SMALL 0.9564 0.013 18 minutes

SVM SMALL 0.9518 0.027 3.5 hours

Logistic regression SMALL 0.9309 0.014 1.2 hours

MLP-300-300-300 SMALL 0.9722 0.062 6 hours

Random Forest LARGE 0.9649 N/A 35 minutes

MLP-300-300-300 LARGE 0.9756 N/A 23 hours

Table 4: Results of ablation experiments.

Classiﬁer Dataset Experiment F1-score

MLP-300-300-300

SMALL

Full features and input dropout 0.9717

Full features and w/o input dropout 0.9421

LARGE

Full features and input dropout 0.9756

Full features and w/o input dropout 0.9609

Random Forest LARGE

Full features 0.9649

Full features w/o commands 0.9123

Full features w/o keywords 0.9517

Full features w/o paths 0.9589

Full features w/o IP 0.9609

Full features w/o regex 0.7923

Only regex 0.8797

actively maintain and in the PIP-packaged version of

the code

REFERENCES

Boros, T., Cotaie, A., Vikramjeet, K., Malik, V., Park,

L., and Pachis, N. (2021). A principled approach to

enriching security-related data for running processes

through statistics and natural language processing.

IoTBDS 2021 - 6th International Conference on In-

ternet of Things, Big Data and Security.

Butun, I., Morgera, S. D., and Sankar, R. (2013). A sur-

vey of intrusion detection systems in wireless sensor

networks. IEEE communications surveys & tutorials,

16(1):266–282.

Kreibich, C. and Crowcroft, J. (2004). Honeycomb: cre-

ating intrusion detection signatures using honeypots.

ACM SIGCOMM computer communication review,

34(1):51–56.

Lee, W., Stolfo, S. J., and Mok, K. W. (1999). A data min-

ing framework for building intrusion detection mod-

els. Proceedings of the 1999 IEEE Symposium on Se-

curity and Privacy (Cat. No. 99CB36344), pages 120–

132.

Modi, C., Patel, D., Borisaniya, B., Patel, H., Patel, A., and

Rajarajan, M. (2013). A survey of intrusion detection

techniques in cloud. Journal of network and computer

applications, 36(1):42–57.

Pal, M. (2005). Random forest classiﬁer for remote sensing

https://pypi.org/project/lolc/

classiﬁcation. International journal of remote sensing,

26(1):217–222.

Silveira, F. and Diot, C. (2010). Urca: Pulling out anomalies

by their root causes. 2010 Proceedings IEEE INFO-

COM, pages 1–9.

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,

and Salakhutdinov, R. (2014). Dropout: a simple way

to prevent neural networks from overﬁtting. The jour-

nal of machine learning research, 15(1):1929–1958.

Deep Dive into Hunting for LotLs Using Machine Learning and Feature Engineering

199