data architectures (James et al., 2018). In the other
hand, malware analysis can utilize these models as
well.
The pledge of machine learning (ML) in detecting
malware consists in apprehending the features of
these malicious software to be able to differentiate
between good and bad binaries. Different steps are
needed for that purpose: malicious and benign bina-
ries are collected and malware specific features are
extracted (Saxe and Sanders, 2018) in order to de-
velop appropriate inference.
Multiple studies related to this field have been un-
dergone to analyze malware based on APIs (Fan et al.,
2015), system calls (Nikolopoulos and Polenakis,
2017), network inspections (Boukhtouta et al., 2016)
and to detect android malware (Wu et al., 2016). In
this paper, several ML algorithms are tested and uti-
lized to analyze input PE (Portable Executable) files
to establish their malicious or harmless nature. The
datasets were tested on several models including Ran-
dom Forest, Logistic Regression, Naive Bayes, Sup-
port Vector Machines, K-nearest neighbors and Neu-
ral Networks. Finally, multiple tests are undergone on
real data to test the accuracy of the models.
This paper is structured as follows: Section 2
overviews machine learning. Section 3 shows the apt-
ness of several algorithms to analyse malware. Sec-
tion 4 exposes the results of each detection algorithm.
Finally, Section 5 concludes the paper and states our
future work.
2 MACHINE LEARNING
Historically, the beginnings of AI date back to Alan
Turing in the 1950s (Moor, 2003). In the common
imaginary, artificial intelligence is a program that can
perform human tasks, learning by itself. However,
AI as defined in the industry is rather more or less
evolved algorithms that imitate human actions. Sub-
elements of AI include ML, NLP (Natural Language
Processing), Planning, Vison and Robotics. The ML
is a sub-part of artificial intelligence that focuses on
creating machines that behave and operate intelli-
gently or simulate that intelligence. ML is very ef-
fective in situations where insights must be discov-
ered from large and diverse datasets. ML algorithms
are grouped into five major classes, which correspond
to different types of learning (Russell and Norvig,
2016):
1. Supervised learning: the algorithm is given a cer-
tain number of examples (inputs) to learn from,
and these examples are labeled, that is, we asso-
ciate them with a desired result (outputs). The al-
gorithm then has for task to find the law which
makes it possible to find the output according to
the inputs. The aim is to estimate the best function
f(x) able to connect the input (x) to the output (y).
Through supervised learning, two major types of
problems can be solved: classification problems
and regression problems.
2. Unsupervised learning: no label is provided to
the algorithm that discovers without human as-
sistance the characteristic structure of the input.
The algorithm will build its own representation
and a human may have difficulty understanding it.
Common patterns are identified in order to form
homogeneous groups from the observations. Un-
supervised learning also splits into two subcate-
gories: clustering and associations. The idea be-
hind clustering is to find similarities within the
data in order to form clusters. Slightly different
from clustering, association algorithms take care
of finding rules in the data. These rules can take
the form of ”If conditions X and Y are met then
event Z may occur”.
3. Semi-supervised learning: it encompasses super-
vised learning and unsupervised learning leverag-
ing labeled data and unlabeled data in order to im-
prove the quality of learning (Zhu et al., 2003).
4. Reinforcement learning: it is an intermediary be-
tween the first two algorithms. This technique
does not rely on the evaluation of labeled data,
but operates according to an experience reward
method. The process is evaluated and reinjected
into the learning algorithm to improve decision
rules and find a better way out of the prob-
lem. Oriented for decision-making, this learning
is based on experience (failures and successes)
(Littman, 1994).
5. Transfer learning: it is a learning that can come to
optimize and improve a learning model already in
place. The understanding is therefore quite con-
ceptual. The idea is to be able to apply a set that
is acquired on a task to a second relative set.
Several ML algorithms are incorporated depend-
ing on their relevance. The focus in this study in-
cludes the undermentioned algorithms:
• Random Forest: This algorithm belongs to the
family of model aggregations, it is actually a
special case of bagging (bootstrap aggregating).
Moreover, random forests add randomness to the
variable level. For each tree, a bootstrap sample
is selected and at each stage, the construction of
a node of the tree is done on a subset of variables
randomly drawn.
Comparing Machine Learning Techniques for Malware Detection
845