Section 3 describes fundamental background, such as
PE file format and LSTM networks. In Section 4,
we describe feature preprocessing and propose fea-
ture transformation using LSTM networks. Descrip-
tion of our experimental setup, from the dataset and
hardware used to final evaluation using supervised
ML algorithms, is placed in Section 5. We conclude
our work in Section 6.
2 RELATED WORK
In this part, we review related research in the field of
static malware detection. We focused on the papers
linked to neural networks, notably recurrent neural
networks (RNNs). However, we didn’t find much
work dealing with the use of LSTM networks as a
feature pre-treatment before the classification itself.
In (Lu, 2019), the authors used opcodes (opera-
tion code, part of machine language instruction (Bar-
ron, 1978)) extracted from a disassembled binary file.
From these opcodes, they created a language with the
help of word embedding. The language is then pro-
cessed by the LSTM network to get the prediction.
They achieved an AUC-ROC score of 0.99, however,
their dataset consisted of only 1,092 samples.
A much larger dataset of 90,000 samples was used
in (Zhou, 2018). They used an LSTM network to pro-
cess API call sequences combined with the convolu-
tional neural network to detect malicious files. While
also using static and dynamic features, they managed
to achieve an accuracy of 97.3%.
Deep neural networks were also used in (Saxe
and Berlin, 2015) with the help of Bayesian statist-
ics. They worked with a large dataset of more than
400 thousand binaries. With fixed FPR at 0.1%, they
reported AUC-ROC of 0.99964 with TPR of 95.2%.
The authors of (Hardy et al., 2016) used stacked
autoencoders for malware classification and achieved
an accuracy of 95.64% on 50,000 samples.
In (Vinayakumar et al., 2018), they trained the
stacked LSTM network and achieved an accuracy of
97.5% with an AUC-ROC score of 0.998. That said
they focused on android files and collected only 558
APKs (Android application package).
3 BACKGROUND
In this chapter, we explain the necessary background
for this paper. The first part deals with the Portable
Executable file format, describing the use cases and
structure. In the second part, we study the LSTM net-
works in detail. In the end, we also briefly mention
the autoencoder networks.
3.1 Portable Executable
Portable Executable (PE) format is a file format
for Windows operation systems (Windows NT) ex-
ecutables, DLLs (dynamic link libraries) and other
programs. Portable in the title denotes the trans-
ferability between 32-bit and 64-bit systems. The
file format contains all basic information for the OS
loader (Kowalczyk, 2018).
The structure of the PE file is strictly set as fol-
lows. Starting with MS-DOS stub and header, fol-
lowed with file, optional, and section headers and fin-
ished with program sections as illustrated in Figure 1.
The detailed description can be found in (Karl Bridge,
2019).
Figure 1: Structure of a PE file.
3.2 LSTM Network
Long short-term memory or shortly LSTM network
is a subdivision of recurrent neural networks. This
network architecture was introduced in (Hochreiter
and Schmidhuber, 1997). The improvement lies in
replacing a simple node from RNN with a compound
unit consisting of hidden state or h
t
(as with RNNs)
and so-called cell state or c
t
. Further, adding in-
put node g
t
compiling the input for every time step
t and three gates controlling the flow of information.
Gates are binary vectors, where 1 allows data to pass
through, 0 blocks the circulation. Operations with
gates are handled by using Hadamard (element-wise)
product with another vector (Leskovec et al., 2020).
As mentioned above, the LSTM cell is formed by
a group of simple units. The key difference from RNN
is the addition of three gates which regulate the in-
put/output of the cell.
Note that W
x
, W
h
and
~
b with subscripts in all of
the equations below are learned weights matrices and
vectors respectively, and f denotes an activation func-
tion, e.g. sigmoid. Subscripts are used to distinguish
matrices and vectors used in specific equations.
Representation of PE Files using LSTM Networks
517