Goyal et al., 2019). Feature extraction can result in
thousands of features, some of which may be redun-
dant and can be eliminated with data mining tech-
niques.
For efficient malware detection, machine
learning-based approaches require lots of benign and
malicious samples to train on. These samples are
often collected from so-called intelligence networks.
Nowadays, users’ machines run only a client-side
antivirus component, which may perform local
detection, but it can also request a server’s assistance
during the detection process. This setup is also
known as cloud-based malware detection. The
client-side component sends suspicious samples to a
server in the cloud, which performs a more in-depth
analysis, e.g., by executing the sample in a sandbox,
makes a decision, and informs the client. At the same
time, the server collects these submitted samples,
which can then be used for training machine learning
models.
Cloud-based malware detection coupled with ma-
chine learning has also been proposed for embedded
IoT devices (Sun et al., 2017; Hussain et al., 2020).
This is an advantageous combination for embedded
IoT devices, because resource-heavy analysis is per-
formed in the cloud and the resource-constrained de-
vices need to run only a light-weight client-side com-
ponent. The client-side component either forwards all
files to the cloud for analysis or applies a pre-trained
machine learning model to detect malware. Proposed
machine learning models include light-weight con-
volutional neural networks (Su et al., 2018), recur-
rent neural networks (HaddadPajouh et al., 2018),
random forest classifiers (Takase et al., 2020), fuzzy
and fast fuzzy pattern trees (Dovom et al., 2019).
Many existing works use static features (Ngo et al.,
2020), including function call graphs (Nguyen et al.,
2020), grey scale images of binaries (Karanja et al.,
2020), strings (Hwang et al., 2020), and instruction
opcodes (Nakhodchi et al., 2020).
2.2 SIMBIoTA
SIMBIoTA was proposed in (Tam
´
as. et al., 2021).
It is a light-weight antivirus solution with limited re-
quirements for storage, computation, and bandwidth,
hence suitable for embedded IoT devices. SIMBIoTA
relies on a large malware database maintained on a
backend server. This malware database is assumed
to be continuously updated with samples obtained
from an intelligence network as described above. The
server computes the TLSH hash values of the samples
in its database, and pushes a subset of these TLSH
hashes to the client-side antivirus component on the
embedded IoT devices, where a light-weight algo-
rithm uses them to detect malware based on binary
similarity. Therefore, SIMBIoTA requires resource-
constrained embedded IoT devices to store only a
small database with a few TLSH hash values.
In (Tam
´
as. et al., 2021), SIMBIoTA was evaluated
on a total of 47,937 malicious samples and a total of
14,119 benign samples for the ARM and MIPS archi-
tectures. In the experiments, the set of samples was
divided into two groups: the samples known to the
backend via the intelligence network, and the sam-
ples found only in the wild. The samples known to
the backend were used to construct the database of
TLSH hash values. Based on the metadata of mali-
cious samples available in VirusTotal
2
, the samples
were also put into so-called “weekly batches”, i.e.,
sets of samples that were first submitted to Virus-
Total on the same week. At the beginning of each
week, the database of TLSH hashes were updated and
the detection performance was measured in two ways.
First, we checked the true positive detection rate for
all samples in previous weeks’ weekly batches. Sec-
ond, we also submitted samples from the wild of the
next two weeks’ weekly batch to see SIMBIoTA’s de-
tection performance for new, previously unseen mal-
ware samples. The experiments measured a false pos-
itive detection rate of 0%, a true positive detection rate
above 90% for samples of previous weeks’ weekly
batches, and a true positive detection rate of ca. 90%
for the next two weeks’ weekly batches. Throughout
the experiments, fewer than 200 bytes were necessary
to update the TLSH hashes stored on the embedded
IoT device. By the end of the experiments, the storage
requirement on the embedded IoT device was 10 kB
in the case of ARM and 6.5 kB in the MIPS case.
Despite its remarkable features, SIMBIoTA has
a number of limitations as well. First, similar to
other malware detection solutions relying on static
features, analyzing obfuscated or encrypted samples
is challenging for SIMBIoTA. Second, as we show in
this paper, the bigger the database of similarity hash
values, the longer it takes for SIMBIoTA to decide
whether a given file is malicious or not. This can be
a challenge in IoT environments where embedded de-
vices must comply with real-time requirements, be-
cause the run time delay introduced by SIMBIoTA is
hard to design for. Last, even though a true positive
detection rate of 90% on average for new, previously
unseen malware samples is surprisingly good, exist-
ing literature suggests that machine learning-based
malware detection approaches can achieve even better
results.
In this paper, we modify SIMBIoTA’s architecture
2
https://www.virustotal.com/ (accessed: Jan 8, 2022)
SIMBIoTA-ML: Light-weight, Machine Learning-based Malware Detection for Embedded IoT Devices
57