data:image/s3,"s3://crabby-images/9e4e5/9e4e54d2c0826170e141606079c61c7e19cbc88d" alt=""
dynamically extracted features, section 5 shows the
databases used in our experiment and several results
and finally, section 6 draws several conclusions on the
practical aspects of our proposal.
2 RELATED WORK
Malware detection has been one field that attracted
a lot of attention from researchers who used various
machine learning methods. There are two main ap-
proaches that are usually employed when it comes to
deciding what features will be used in the process of
training and evaluating machine learning models.
The first approach involves extracting static fea-
tures from the files, without executing them. Thus,
this approach usually consumes fewer resources and
reaches high speed and potentially high accuracy.
(Ahmadi et al., 2016) proposed a malware fam-
ily classification system, using a wide range of static
features extracted from the original PE executable
files that were not unpacked or deobfuscated. By
combining the most relevant feature categories and
feeding them to a XGBoost-based classifier, their
model reached an accuracy of around 99.8% on the
Microsoft Malware Challenge dataset (Ronen et al.,
2018) of 20000 malware samples.
Other approaches demonstrated that minimal
knowledge is needed for extracting relevant static
features from executable files. For instance, a study
demonstrated that effective malware detection can be
obtained using the information in the first 300 bytes
from the PE header of executable files as input (Raff
et al., 2017b). The same year, a more comprehen-
sive study (Raff et al., 2017a) presented MalConv, a
deep convolutional neural network model which di-
rectly uses the raw byte representation of executable
(limited to the first 2 MB) as input, without any intelli-
gent identification of specialized structures or specific
executable or malware content. The model showed
good results, achieving 94% accuracy after training
on a large dataset of 2 million PE files.
In a more recently conducted study (Zhao et al.,
2023), the authors researched a different method, con-
verting the bytecode extracted from the original files
into color images and using them as input features
for training an AlexNet convolutional neural network
(CNN). The results were promising, the accuracy of
their model reaching more than 99% on two rather
small public malware datasets from Google Code Jam
and Microsoft of around 10000 samples.
However, using static features alone might bring
some limitations in real-world malware detection sce-
narios where advanced obfuscation, packing or en-
cryption are being used for creating malicious files.
In a recent study (Aghakhani et al., 2020) this aspect
was investigated and, using a dataset of almost 400
thousand files, it was demonstrated that using static
information exclusively is not indicative of the actual
behavior of the classified files and a substantial num-
ber of false positives on packed benign files occur.
The other main approach would be to extract dy-
namic features that describe the behavior of the mal-
ware during execution or partially retain information
regarding the said behavior.
One method is to include dynamic runtime op-
codes as input features, allowing the behavior of ex-
ecutables to be captured. An extensive study (Carlin
et al., 2019) showed that this approach can accurately
detect malware, even on a continuously growing and
updatable dataset that requires retraining. The authors
compared 23 machine learning algorithms and con-
cluded that their method worked best using the Ran-
dom Forest model.
In a recent study (Zhang et al., 2023), the authors
proposed another method of combining the API call
sequences-based dynamic features with the semantic
information of functions, bringing more context to the
actual performed action by the API call. Compared
to existing similar experiments that only used API
call information, their solution shows improvements
of 3% to 5% in detection accuracy.
(Ijaz et al., 2019) compare several methods based
on machine learning for detecting Windows OS ex-
ecutables. They use a small set of files of only
39000 malicious binaries and 10000 benign ones,
from which they statically extract a small set of 92
features from the PE headers using the PEFILE tool.
They also dynamically extract 2300 features from a
small part of the files from the execution in Cuckoo
Sandbox. Their detection measurements are made us-
ing either the static features or the dynamic features
separately. Also, using a sandbox for the training
and evaluation part when using the dynamic features
brings in a series of disadvantages, because it does
not provide a form of real-time protection for the new
malicious files that would need to be evaluated.
In one of the first such approaches, (Santos et al.,
2013) present a hybrid malware detection system that
combines both static and dynamic features. The
small dataset they use consists of 1000 malware and
1000 legitimate files, from which they extract two-
byte opcodes, perform feature selection using Infor-
mation Gain and select the first 1000 as the static fea-
tures that will be used. The dynamic characteristics
are extracted by monitoring the behavior of the pro-
grams in a controlled sandboxed environment.
Another different hybrid approach that uses both
Real-Time Deep Learning-Based Malware Detection Using Static and Dynamic Features
227