that samples were classified correctly and samples of
the same family are similar to each other and dissim-
ilar to the samples of other families, using our ap-
proach, the analysts can focus only on those samples
which belong to the malware families for which the
analysts are specialized.
Good similarity measure plays an important role
in the performance of distance-based classifiers, such
as k -Nearest Neighbors (KNN). The distance be-
tween two feature vectors having the same class la-
bel must be minimized while the distance between
two feature vectors of different classes must be max-
imized. This is the goal of distance metric learning
methods used to learn the parameters of distance met-
rics from training data. As a result, they can poten-
tially improve the performance of the classifiers.
In our experiments, we consider six malware
families, which is a relatively small number. An-
other limitation of our work lies in assuming that our
dataset is large enough for training distance metric
learning (DML) algorithms. However, in practice,
new families or new malware variants are continu-
ously emerging. Therefore the training set, at some
moment, may not contain enough samples of the de-
sired malware family to train some supervised learn-
ing classifier.
The contributions of this paper are as follows:
• We determined and described the list of 25 fea-
tures all extracted (except one, i.e., size of a file)
from the PE file format. For each feature from
a section header, we considered the order of the
section rather than the type of the section (such as
.text, .data, .rsrc, etc.). While the sections’ order
turns out to be important for malware detection,
this kind of information is often not mentioned in
research papers.
• Using three DML algorithms, LMNN, NCA, and
MLKR, we achieved significantly better multi-
class classification results than any state-of-the-
art ML algorithms considered in our experiments.
We provided practical information concerning
performance, computational time, and resource
usage.
• We showed that the DML-based methods might
improve multiclass classification results even
when standard methods such as feature selection
or algorithm tuning were already applied. As a
result, we suggest using DML algorithms as an
important preprocessing step.
The rest of the paper is organized as follows. In
Section 2, we review recent malware detection me-
thods based on machine learning focusing on the clas-
sification of malware families. In Section 3, we give
some theoretical background and discuss three dis-
tance metric learning techniques used in our experi-
ments. The experimental setup and results of feature
selection algorithms are presented in Section 4. Sec-
tion 5 describes DML-based experiments and results.
We summarize our research work in Section 6.
2 RELATED WORK
This section briefly reviews the previous research pa-
pers on malware family classification related to our
work.
In (Basole et al., 2020), the authors conducted ex-
periments based on byte n-gram features, and they
considered 20 malware families. A binary classi-
fication were performed on different levels. In the
first level, for each of 20 families, they performed bi-
nary classification for 1,000 malware samples from
one family and 1,000 benign samples. In the se-
cond level, the malware class consists of two malware
families; in the third level, the malware class consists
of three malware families, and so on up to level 20,
where the malware class contains all of the 20 mal-
ware families. The authors applied four state-of-the-
art machine learning algorithms: KNN, Support Vec-
tor Machines, Random Forest, and Multilayer Percep-
tron. The best classification results (balanced accu-
racy) was achieved using KNN and Random Forest,
over 90% (at level 20), while KNN achieves the most
consistent results.
A fully automated system for analysis, classifi-
cation, and clustering of malware samples was in-
troduced in (Mohaisen et al., 2015). This system is
called AMAL and it collects behavior-based artifacts
describing files, registry, and network communica-
tion, to create features that are then used for classifica-
tion and clustering of malware samples into families.
The authors achieved more than 99% of precision and
recall in classification and more than 98% of precision
and recall for unsupervised clustering.
In (Ahmadi et al., 2016), the authors proposed
a malware classification system using different mal-
ware characteristics to assign malware samples to the
most appropriate malware family. The system allows
the classification of obfuscated and packed malware
without doing any deobfuscation and unpacking pro-
cesses. High classification accuracy of 99.77% was
achieved on the publicly accessible Microsoft Mal-
ware Challenge dataset.
(Islam et al., 2013) presented a classification
method based on static (function length frequency and
printable sting) and dynamic (API function names
with API parameters) features that were integrated
ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy
644