Improving Classiﬁcation of Malware Families using Learning a Distance

Metric

Martin Jure

cek, Olha Jure

ckov

a and R

obert L

orencz

Faculty of Information Technology, Czech Technical University in Prague, Czech Republic

Keywords:

Malware Family, PE File Format, Distance Metric Learning, Machine Learning.

Abstract:

The objective of malware family classiﬁcation is to assign a tested sample to the correct malware family. This

paper concerns the application of selected state-of-the-art distance metric learning techniques to malware fam-

ilies classiﬁcation. The goal of distance metric learning algorithms is to ﬁnd the most appropriate distance

metric parameters concerning some optimization criteria. The distance metric learning algorithms considered

in our research learn from metadata, mostly contained in the headers of executable ﬁles in the PE ﬁle format.

Several experiments have been conducted on the dataset with 14,000 samples consisting of six prevalent mal-

ware families and benign ﬁles. The experimental results showed that the average precision and recall of the

k -Nearest Neighbors algorithm using the distance learned on training data were improved signiﬁcantly com-

paring when the non-learned distance was used. The k -Nearest Neighbors classiﬁer using the Mahalanobis

distance metric learned by the Metric Learning for Kernel Regression method achieved average precision and

recall, both of 97.04% compared to Random Forest with a 96.44% of average precision and 96.41% of average

recall, which achieved the best classiﬁcation results among the state-of-the-art ML algorithms considered in

our experiments.

1 INTRODUCTION

A large number of new malicious samples are cre-

ated every day, which makes manual analysis imprac-

tical. The majority of these samples are generated by

malware generators, which need to input some pa-

rameters. These malware generators, together with

their particular settings, deﬁne corresponding mal-

ware families. Samples generated from the same ge-

nerator with a ﬁxed setting (i.e., from the malware

family) may be potentially similar to each other and

different from samples belonging to other malware

families or benign ﬁles. This work focuses on lever-

aging these differences to distinguish between mal-

ware families. Note that samples from the same mal-

ware family, however, generated in a different time

period may be different from each other (Wadkar

et al., 2020).

Since the samples are usually obfuscated, it is dif-

ﬁcult to classify new (previously unseen) samples into

the correct malware families. Moreover, there is no

known general similarity measure suitable for a fea-

ture set extracted from the PE ﬁle format to correctly

cluster all malware families. (Jure

cek and L

orencz,

2018) presented distance metric specially designed

for the PE ﬁle format that can handle all data types

of features.

Our work focuses on the multiclass classiﬁcation

problem where each malware family and benign ﬁles

have their own class. This multiclass classiﬁcation

problem is more challenging than a binary classiﬁca-

tion problem where the goal is to distinguish between

malicious and benign ﬁles. However, the results of

(Basole et al., 2020) may indicate that an increasing

number of families (from 2 to 20 families) drops an

average balanced accuracy slightly.

The practical use of distinguishing between mal-

ware families lies in helping malware analysts to deal

with a large number of samples. Due to a large num-

ber of malicious ﬁles that come to antivirus vendors,

there is a need to automatically categorize malware

into groups corresponding to malware families. Sam-

ples belonging to the same group are similar to each

other with respect to some similarity measure (deter-

mined by distance metric). These groups are then dis-

tributed to malware analysts and assuming that ﬁles

belonging to the same group have similar behavior, it

may help speed up the further analysis.

Usually, malware analysts are specialists for some

limited number of malware families. If we assume

Jure

cek, M., Jure

cková, O. and Lórencz, R.

Improving Classiﬁcation of Malware Families using Learning a Distance Metric.

DOI: 10.5220/0010326306430652

In Proceedings of the 7th International Conference on Information Systems Security and Privacy (ICISSP 2021), pages 643-652

ISBN: 978-989-758-491-6

643

that samples were classiﬁed correctly and samples of

the same family are similar to each other and dissim-

ilar to the samples of other families, using our ap-

proach, the analysts can focus only on those samples

which belong to the malware families for which the

analysts are specialized.

Good similarity measure plays an important role

in the performance of distance-based classiﬁers, such

as k -Nearest Neighbors (KNN). The distance be-

tween two feature vectors having the same class la-

bel must be minimized while the distance between

two feature vectors of different classes must be max-

imized. This is the goal of distance metric learning

methods used to learn the parameters of distance met-

rics from training data. As a result, they can poten-

tially improve the performance of the classiﬁers.

In our experiments, we consider six malware

families, which is a relatively small number. An-

other limitation of our work lies in assuming that our

dataset is large enough for training distance metric

learning (DML) algorithms. However, in practice,

new families or new malware variants are continu-

ously emerging. Therefore the training set, at some

moment, may not contain enough samples of the de-

sired malware family to train some supervised learn-

ing classiﬁer.

The contributions of this paper are as follows:

• We determined and described the list of 25 fea-

tures all extracted (except one, i.e., size of a ﬁle)

from the PE ﬁle format. For each feature from

a section header, we considered the order of the

section rather than the type of the section (such as

.text, .data, .rsrc, etc.). While the sections’ order

turns out to be important for malware detection,

this kind of information is often not mentioned in

research papers.

• Using three DML algorithms, LMNN, NCA, and

MLKR, we achieved signiﬁcantly better multi-

class classiﬁcation results than any state-of-the-

art ML algorithms considered in our experiments.

We provided practical information concerning

performance, computational time, and resource

usage.

• We showed that the DML-based methods might

improve multiclass classiﬁcation results even

when standard methods such as feature selection

or algorithm tuning were already applied. As a

result, we suggest using DML algorithms as an

important preprocessing step.

The rest of the paper is organized as follows. In

Section 2, we review recent malware detection me-

thods based on machine learning focusing on the clas-

siﬁcation of malware families. In Section 3, we give

some theoretical background and discuss three dis-

tance metric learning techniques used in our experi-

ments. The experimental setup and results of feature

selection algorithms are presented in Section 4. Sec-

tion 5 describes DML-based experiments and results.

We summarize our research work in Section 6.

2 RELATED WORK

This section brieﬂy reviews the previous research pa-

pers on malware family classiﬁcation related to our

work.

In (Basole et al., 2020), the authors conducted ex-

periments based on byte n-gram features, and they

considered 20 malware families. A binary classi-

ﬁcation were performed on different levels. In the

ﬁrst level, for each of 20 families, they performed bi-

nary classiﬁcation for 1,000 malware samples from

one family and 1,000 benign samples. In the se-

cond level, the malware class consists of two malware

families; in the third level, the malware class consists

of three malware families, and so on up to level 20,

where the malware class contains all of the 20 mal-

ware families. The authors applied four state-of-the-

art machine learning algorithms: KNN, Support Vec-

tor Machines, Random Forest, and Multilayer Percep-

tron. The best classiﬁcation results (balanced accu-

racy) was achieved using KNN and Random Forest,

over 90% (at level 20), while KNN achieves the most

consistent results.

A fully automated system for analysis, classiﬁ-

cation, and clustering of malware samples was in-

troduced in (Mohaisen et al., 2015). This system is

called AMAL and it collects behavior-based artifacts

describing ﬁles, registry, and network communica-

tion, to create features that are then used for classiﬁca-

tion and clustering of malware samples into families.

The authors achieved more than 99% of precision and

recall in classiﬁcation and more than 98% of precision

and recall for unsupervised clustering.

In (Ahmadi et al., 2016), the authors proposed

a malware classiﬁcation system using different mal-

ware characteristics to assign malware samples to the

most appropriate malware family. The system allows

the classiﬁcation of obfuscated and packed malware

without doing any deobfuscation and unpacking pro-

cesses. High classiﬁcation accuracy of 99.77% was

achieved on the publicly accessible Microsoft Mal-

ware Challenge dataset.

(Islam et al., 2013) presented a classiﬁcation

method based on static (function length frequency and

printable sting) and dynamic (API function names

with API parameters) features that were integrated

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

644

into one feature vector. The obtained results showed

that integrating features improved classiﬁcation accu-

racy signiﬁcantly. The highest weighted average ac-

curacy was achieved by the meta-Random Forest clas-

siﬁer.

Another malware family classiﬁcation system

referred to as VILO is presented in (Lakhotia

et al., 2013). They used TFIDF-weighted opcode

mnemonic permutation features and achieved bet-

ween 0.14% and 5.42% fewer misclassiﬁcations u-

sing KNN classiﬁer than does the usage of n-gram

features.

In the rest of this section, we survey some previ-

ous works on distance metric learning applied to the

problem of malware detection. There is only a cou-

ple of works that address this topic. (Jure

cek and

orencz, 2020) deals with measure learning and its

application to malware detection. Particle swarm op-

timization (PSO) was used to ﬁnd appropriate feature

weights for the heterogeneous distance function used

in the KNN classiﬁer. Positions of particles in the ini-

tialization step of PSO were set according to the in-

formation gain computed in the feature selection step

rather than randomly. As a result, PSO was acceler-

ated, and better classiﬁcation accuracy was achieved

using the weighted distance function.

Work (Kong and Yan, 2013) concerns with a mal-

ware detection method based on structural informa-

tion. The discriminant distance metric is learned to

cluster the malware samples belonging to the same

malware family.

3 BACKGROUND

Performance of some ML classiﬁers, such as KNN,

depends signiﬁcantly on the distance metric used to

compute similarity measure between two samples.

These classiﬁers rely on the assumption that samples

belonging to the same class are close to each other

(with respect to the distance function), and they are

far from samples belonging to the different classes.

The DML algorithms were designed to improve

the performance of distance-based classiﬁers via

learning the distance metric. This section provides

background and a brief description of three state-of-

the-art distance metric learning algorithms, LMNN,

NCA, and MLKR, used in our experiments.

Euclidean distance is by far the most commonly

used distance metric. Let x and y be two n-

dimensional feature vectors. The weighted Euclidean

distance is deﬁned as follows:

(x, y) =

∑

i=1

− y

)

(1)

where w

is a weight (non-negative real number)

associated with the jth feature. The distance metric

learning problem for weighted Euclidean distance is

deﬁned as ﬁnding (or learning) an appropriate weight

vector w = (w

, . . . , w

) using training data, with re-

spect to some optimization criterion, usually mini-

mizing error rate.

Several distance functions have been presented

(Wilson and Martinez, 1997). To improve classiﬁ-

cation or clustering results, many weighting schemes

were designed. A review of feature weighting me-

thods for lazy learning algorithms was proposed in

(Wettschereck et al., 1997).

Mahalanobis distance for two n-dimensional fea-

ture vectors x and y is deﬁned as

(x, y) =

(x − y)

M(x − y) (2)

where M is a positive semideﬁnite matrix. Maha-

lanobis distance can be considered as a generalization

of Euclidean distance, since if M is the identity ma-

trix, then d

in Eq. (2) is reduced to Euclidean dis-

tance. If M is diagonal, this corresponds to learning

the feature weights M

= w

from Eq. (1) deﬁned for

weighted Euclidean distance.

The goal of learning the Mahalanobis distance is

to ﬁnd an appropriate matrix M with respect to some

optimization criterion. In the context of the KNN

classiﬁer, the goal is to ﬁnd a matrix M, which is es-

timated from the training set, which leads to the lo-

west error rate of the KNN classiﬁer. Since a positive

semideﬁnite matrix M can always be decomposed as

M = L

L, distance metric learning problem can be

viewed as ﬁnding either M or L = M

. Mahalanobis

distance deﬁned in Eq. (2) expressed in terms of the

matrix L is deﬁned as

(x, y) = d

(x, y) = kL

(x − y)k

(3)

The matrix L can be used to projects the origi-

nal feature space into a new embedding feature space.

This projection is a linear transformation deﬁned for

feature vector x as

= Lx (4)

Note that the Mahanalobis distance d

(x, y) for

two samples from the original feature space equals the

Euclidean distance d(x

, y

) =

− y

)

− y

)

Improving Classiﬁcation of Malware Families using Learning a Distance Metric

645

in the space transformed by Eq. (4). This transfor-

mation is usefull since computation of Euclidean dis-

tance has lower time complexity than computation of

Mahalanobis distance.

In the rest of this paper, we will consider the

feature space as a real n-dimensional space R

. The

following subsections brieﬂy describe three distance

metric learning methods that we used in our experi-

ments.

3.1 Large Margin Nearest Neighbor

Large Margin Nearest Neighbor (LMNN) (Wein-

berger et al., 2006) is one of the state-of-the-art

distance metric learning algorithms used to learn a

Mahalanobis distance metric for KNN classiﬁcation.

LMNN consists of two steps. In the ﬁrst step, for each

instance, x, a set of k nearest instances belonging to

the same class as x (referred to as target neighbors)

is identiﬁed. In the second step, we adapt the Maha-

lanobis distance with the goal that the target neighbors

are closer to x than instances from different classes

that are separated by a large margin.

The Mahalanobis distance metric is estimated by

solving a semideﬁnite programming problem deﬁned

as:

min

∑

i, j: j→i



, x

)

+ µ

∑

k:y

6=y

max



0, 1 + d

, x

)

− d

, x

)



(5)

The notation j → i refers that the sample x

is a

target neighbor of the sample x

, and y

denotes the

class of x

. The parameter µ deﬁnes a trade-off be-

tween the two objectives.

3.2 Neighborhood Component Analysis

(Goldberger et al., 2005) proposed the Neighborhood

Component Analysis (NCA), a distance metric lear-

ning algorithm specially designed to improve KNN

classiﬁcation.

Let p

i j

be the probability that the sample x

is the

neighbor of the sample x

belonging to the same class

as x

. This probability is deﬁned as:

i j

exp(−||Lx

− Lx

)

∑

l6=i

exp(−||Lx

− Lx

)

, p

= 0 (6)

The goal of NCA is to ﬁnd the matrix L that ma-

ximizes the sum of probabilities p

argmax

N−1

∑

i=0

∑

j: j6=i,y

i j

(7)

The well-known gradient ascent algorithm is used

to solve this optimization problem. Note that both

LMNN and NCA algorithms do not make any as-

sumptions on the class distributions.

3.3 Metric Learning for Kernel

Regression

(Weinberger and Tesauro, 2007) proposed Metric

Learning for Kernel Regression (MLKR), which aims

at training a Mahalanobis matrix by minimizing the

error loss over the training samples:

L =

∑

− ˆy

)

(8)

where the prediction class ˆy

is derived from ker-

nel regression by calculating a weighted average of

the training samples:

ˆy

∑

j6=i

K(x

, x

)

∑

j6=i

K(x

, x

)

(9)

MLKR can be applied to many types of kernel

functions K(x

, x

) and distance metrics d(x, y).

Note that the mentioned distance metric learning

algorithms can be used as supervised dimensionality

reduction algorithms. Considering the matrix L ∈

d×n

with d < n then the dimension of transformed

sample x

= Lx is reduced to d.

4 EXPERIMENTAL SETUP

In this section, we present our dataset, describe eval-

uation measures and feature selection results.

4.1 Dataset

Our experiments are based on the dataset contain-

ing 14,000 samples consisting of 6 malware families

and benign ﬁles. The dataset is well-balanced since

each of the 6 malware families is of equal size, i.e.,

2,000 samples, and the number of benign ﬁles is also

2,000. The malicious programs were obtained from

(VirusShare, 2020), an online repository containing

various malware families. Benign ﬁles were gathered

from university computers. We conﬁrm that all mali-

cious samples considered in our experiments match

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

646

known signatures from antivirus companies. Also,

none of our benign programs was detected as mal-

ware.

In our experiments, we used the following six

prevalent malware families:

Allaple – a polymorphic network worm that spreads

to other computers and performs denial-of-service

(DoS) attacks.

Skeeyah – a Trojan horse that inﬁltrates systems and

steals various personal information and adds the

computer to a botnet.

Virlock – ransomware that locks victims’ computer

and demands a payment to unlock it.

Virut – a virus with backdoor functionality that ope-

rates over an IRC-based communications proto-

col.

Vundo – a Trojan horse that displays pop-up adver-

tisements and also injects JavaScript into HTML

pages.

Zbot – also known as Zeus, is a Trojan horse that

steals conﬁguration ﬁles, credentials, and banking

details.

4.2 Evaluation Measures

In this section, we present the metrics we used to mea-

sure the performance of the classiﬁcation models. In a

binary classiﬁcation problem, the following classical

quantities are employed:

• True Positive (TP) represents the number of mali-

cious samples classiﬁed as malware

• True Negative (TN) represents the number of be-

nign samples classiﬁed as benign

• False Positive (FP) represents the number of be-

nign samples classiﬁed as malware

• False Negative (FN) represents the number of ma-

licious samples classiﬁed as benign

The performance of binary classiﬁers considered

in our experiments is measured using three standard

metrics. The most intuitive and commonly used eval-

uation metric is the error rate:

ERR =

FP + FN

TP + TN + FP + FN

(10)

It is deﬁned on a given test set as the percentage

of incorrectly classiﬁed samples. Alternative for error

rate is accuracy deﬁned as ACC= 1−ERR. The se-

cond metric is precision, and it is deﬁned as follows:

precision =

TP + FP

(11)

Precision is the percentage of samples classiﬁed as

malware that are truly malware. The third parameter,

recall (or true positive rate), is deﬁned as:

recall =

TP + FN

(12)

Recall is the percentage of truly malicious sam-

ples that were classiﬁed as malware.

In the multiclass evaluation, since all the classes

have the same number of samples, we use averaged

versions of error rate, precision and recall. Average

error rate is deﬁned as follows:

(average) ERR =

∑

i≤N

class

pred

6= class

true

(13)

where N is the size of our dataset, and 1 is the

indicator function. Average precision and average re-

call is deﬁned as an average resulting precisions and

recalls, respectively, across all classes.

4.3 Feature Selection

The features used in our experiments are extracted

from the portable executable (PE) ﬁle format (Mi-

crosoft, 2019), which is the ﬁle format for executa-

bles, DLLs, object code, and others used in 32-bit and

64-bit versions of the Windows operating system. The

PE ﬁle format is the most widely used ﬁle format for

malware samples run on desktop platforms.

For extracting features from PE ﬁles, we used

Python module pefile (Carrera, 2017). This modu-

le extracts all PE ﬁle attributes into an object from

which they can be easily accessed. We extracted 358

numeric features that are based on static information

only, i.e., without running the program. The dimen-

sionality is high since for each section, and for each

kind of characteristics (array of ﬂags), we consider

each ﬂag as a single feature.

Before applying feature selection methods,

all features were normalized using procedure

preprocessing.normalize from the Scikit-learn

library (Scikit-learn, 2020). We then employed the

six feature selection methods also imported from the

Scikit-learn library.

Table 1 shows error rates of the KNN (k = 1) clas-

siﬁer applied to the feature set reduced using the cor-

responding feature selection algorithms. The lowest

error rate of 4.13% was achieved for 25 selected fea-

tures by RFE Logistic Regression. The KNN for the

original feature set (i.e., 358 features) achieved an er-

ror rate of 4.31%. All feature selection algorithms

were evaluated by 5-fold cross-validation on the ran-

domly chosen training data containing 80% of the

whole dataset. The remaining 20% of the dataset was

Improving Classiﬁcation of Malware Families using Learning a Distance Metric

647

Table 1: Evaluation of the feature selection algorithms in terms of error rates of the KNN (k = 1) classiﬁer. The abbreviation

SFM refers to Scikit-learn procedure feature selection.SelectFromModel. The abbreviation RFE refers to Recursive

Feature Elimination implemented in feature selection.RFE from the Scikit-learn library as well.

Error rates [%] for speciﬁc number of features

Feature selection method 2 5 10 25 50 75

Principal component analysis 11.00 5.30 4.85 4.22 4.26 4.31

SFM Logistic Regression 13.61 7.90 4.76 4.20 4.29 4.31

SFM Decision Tree 26.56 8.44 6.21 4.72 4.50 4.37

Information Gain 15.17 8.77 7.79 5.61 4.31 4.52

RFE Logistic Regression 17.08 6.49 5.30 4.13 4.29 4.31

RFE Decision Tree 11.67 7.53 4.76 4.63 4.65 4.22

reserved for testing the DML and ML algorithms (see

Section 5).

The following Fig. 1 illustrates the performance

of RFE Logistic regression for a various number of

features. For 50 and more features, the corresponding

error rates are approximately the same as the error

rate achieved from the original feature set (i.e., 358

features).

Figure 1: Relation between dimensionality and error rate of

KNN classiﬁer (k = 1).

Notice that for each feature selection algorithm

under consideration, except SFM Decision Tree, we

achieved a surprisingly low error rate with only two

features, as shown in Table 1. We will provide more

experiments for this extremely low dimensionality in

Section 5.2.

We also performed all six feature selection algo-

rithms to explore an even higher number of features:

100, 125, and 150. However, the corresponding er-

ror rates achieved from all the feature selection algo-

rithms were higher than 4.13%. As a result, in all ex-

periments (except those in Section 5.2), we will con-

sider 25 features selected by RFE Logistic regression

algorithm. To make our results reproducible, Table 2

summarizes all features used in our experiments. We

keep the name of the ﬁelds in the same form as in

the documentation (Microsoft, 2015), in order that the

reader can easily ﬁnd the detailed description.

The ﬁeld ﬁle size (size of the ﬁle on disk) is not

contained within the PE structure, and note that it dif-

fers from the ﬁeld SizeOfImage, which is the size of

the image loaded in memory.

PE ﬁles are divided into one or more sections. The

sections contain code, data, imports, and various cha-

racteristics. PE section features are considered sepa-

rately for each section. The order of the sections is

not the same for each PE ﬁle. Moreover, malware au-

thors can change the order of the sections. Therefore,

we prefer to consider only the order of sections (rather

than their names). The special importance among all

sections of a PE ﬁle has the last one since it may

contain useful information, especially for some types

of malware, such as ﬁle infector, which typically at-

taches malicious code at the end of the ﬁle. To deal

with a various number of sections across the samples,

we have decided to consider only the ﬁrst four sec-

tions and the last section.

5 EXPERIMENTAL RESULTS

This section presents multiclass classiﬁcation results

based on six base ML classiﬁers: k-Nearest Neigh-

bor (k = 1), Logistic Regression, (Gaussian) Naive

Bayes, Random Forest (number of trees in the for-

est = 100), and Multilayer Perceptron (hidden layer

sizes=(200,100), maximum number of iterations =

300, activation function = ’relu’, solver for weight op-

timization = ’adam’, random number generation for

weights and bias initialization = 1). Implementations

of the DML algorithms, the ML classiﬁers, and the

classiﬁcation metrics, are based on the Scikit-learn

library (Scikit-learn, 2020). If not mentioned, the

hyperparameters of the ML classiﬁers and the DML

methods were set to their default values as set in the

Scikit-learn library.

The input feature vectors used in the following ex-

periments are described in Section 4.3, and its dimen-

sionality is 25, except in the experiment described in

Section 5.2 where we consider only two-dimensional

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

648

Table 2: List of 25 features selected by the RFE Logistic Regression algorithm sorted by importance. All except one (ﬁle

size) are extracted from the PE ﬁle format.

Position Feature name Structure

1. PointerToRawData Section Header (the ﬁrst section)

2. Characteristics, ﬂag IMAGE SCN TYPE DSECT Section Header (the ﬁrst section)

3. Characteristics, ﬂag IMAGE SCN TYPE COPY Section Header (the ﬁrst section)

4. Characteristics, ﬂag IMAGE SCN TYPE NOLOAD Section Header (the ﬁrst section)

5. RVA of Exception Table Optional Header Data Directories

6. PointerToRawData Section Header (the third section)

7. Characteristics, ﬂag IMAGE SCN TYPE GROUP Section Header (the ﬁrst section)

8. RVA of Certiﬁcate Table Optional Header Data Directories

9. RVA of Base RelocationTable Optional Header Data Directories

10. RVA of Bound Import Optional Header Data Directories

11. VirtualSize Section Header (the ﬁrst section)

12. SizeOfRawData Section Header (the ﬁrst section)

13. Size of ﬁle not part of PE format

14. VirtualAddress Section Header (the fourth section)

15. VirtualSize Section Header (the second section)

16. RVA of Import Table Optional Header Data Directories

17. Characteristics, ﬂag IMAGE SCN TYPE REG Section Header (the ﬁrst section)

18. AddressOfEntryPoint Optional Header Standard Fields

19. RVA of Export Table Optional Header Data Directories

20. VirtualAddress Section Header (the ﬁrst section)

21. RVA of TLS Table Optional Header Data Directories

22. PointerToRawData Section Header (the second section)

23. VirtualAddress Section Header (the last section)

24. Size of Import Table Optional Header Data Directories

25. VirtualAddress Section Header (the second section)

feature vector.

All the following experiments were executed on a

single computer platform having two processors (Intel

Xeon Gold 6136, 3.0GHz, 12 cores each), with 32

GB of RAM running the Ubuntu server 18.04 LTS

operating system.

5.1 Application of Distance Metric

Learning Algorithms

In the ﬁrst set of experiments, we applied the

KNN classiﬁer using the following four distances:

Euclidean (non-learned) distance, and three Maha-

lanobis distances learned by three DML algorithms

(separately) LMNN, NCA, and MLKR. We used 5-

fold cross-validation as follows. The training dataset

(80 % of the whole dataset) was randomly divided

into ﬁve subsets of equal size, where four subsets

were used for training the DML algorithms, and one

subset was used for testing. First, the DML algorithm

was trained on the four subsets, and then KNN with

learned Mahalanobis distance metric was employed

using training and testing sets. This procedure was

run ﬁve times with different subset reserved for tes-

ting. Classiﬁcation results obtained for each fold are

then averaged to produce a single cross-validation es-

timate.

The ﬁrst experiment focuses on the hyperpara-

meter k that expresses the number of nearest neigh-

bors considered in the KNN classiﬁer and LMNN (the

learning rate of the optimization procedure = 10

−6

)

algorithm. Note that both MLKR nor NCA do not de-

pend on k. Fig. 2 shows the effect of k on the error

rate of the KNN using all four distances.

Figure 2: The relation between the number of nearest neigh-

bors (k) and error rate (ERR) compared for four distances.

Improving Classiﬁcation of Malware Families using Learning a Distance Metric

649

The KNN classiﬁer for k = 1 with Euclidean dis-

tance achieved the ERR of 3.70%. Error rates of the

KNN using Mahalanobis distance learned by DML al-

gorithms are equal to: 3.23% for LMNN, 3.51% for

NCA, and 3.57% for MLKR.

The result of distance metric learning algorithms

is n×n matrix, where n is the dimension of the feature

vector. Since the number of components of the matrix

to be learned grows at a quadratic rate and the size of

the training data is ﬁxed, we can expect that the size of

training data stops being sufﬁcient for high values of

n. In the next experiment, we used the Principal com-

ponent analysis to reduce the data’s dimension and

examine the learning ability of distance metric learn-

ing algorithms. We applied 5-fold cross-validation

technique described above. For our ﬁxed-size training

dataset, the highest performance improvement gained

by using DML algorithms was achieved for the fol-

lowing dimensions: n = 3 for LMNN, n = 4 for NCA,

and n = 6 for MLKR. These results may indicate that

considering 25-dimensional feature vectors used in

our experiments, our dataset’s size is insufﬁcient for

learning as many as 625 parameters (i.e., the number

of components of the matrix M).

We also explored the variation of classiﬁcation re-

sults based on DML algorithms across six malware

families and benign class. Precisions and recalls of

the KNN classiﬁer using the hyperparameter k = 1

are summarize in Table 3. The results indicate that

the precisions and recalls corresponding to malware

families depend signiﬁcantly on the DML algorithms.

Regarding resource usage of DML algorithms,

32GB of RAM was sufﬁciently enough (i.e., with-

out a need to use the disk as swap memory) for all

conducted experiments. The average computational

times of the DML algorithms applied on 8,960 sam-

ples are as follows: LMNN took 550 seconds, NCA

took 960 seconds, and MLKR took 4, 800 seconds.

5.2 Representation of Malware Families

in Two Dimensions

In this section, we examine the classiﬁcation results

based on two-dimensional feature vectors. Two of the

most important features, PointerToRawData and ﬂag

IMAGE SCN TYPE DSECT (see Table 2), were se-

lected by the RFE Logistic Regression and used for

multiclass classiﬁcation.

Fig. 3 presents the effectiveness of the KNN (k =

1) classiﬁcation of malware families and benign ﬁles

represented by only two-dimensional feature vectors.

While we achieved 99% accuracy for the malware

families Skeeyah and Virlock, benign ﬁles’ accuracy

was 58%.

Figure 3: Normalized confusion matrix comparing the ac-

curacy of KNN (k = 1) using Euclidean distance for six mal-

ware families and benign samples.

Two-dimensional representation of feature vectors

allows us to show malware families as points in the

plane. Three malware families, Allaple, Virut, and

Vundo, are illustrated in Fig. 4. One hundred samples

were chosen randomly from each of these classes,

and we achieved the average classiﬁcation accuracy

of 92.00%.

Figure 4: Three malware families: Allaple, Virut, and

Vundo represented in two dimensions.

5.3 Comparison with the

State-of-the-Art ML Algorithms

In the last experiment, LMNN, NCA, and MLKR

methods (each separately) were used to transform the

data, as it is shown in Eq. (4) in Section 3. We com-

pared the performance of the state-of-the-art ML al-

gorithms for original (non-transformed) data and for

the transformed data. Table 4 shows that the high-

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

650

Table 3: KNN (k = 1) classiﬁcation results of particular malware families and class of benign samples.

Precision [%] Recall [%]

class Euclid LMNN NCA MLKR Euclid LMNN NCA MLKR

allaple 98.58 98.79 92.32 92.39 98.42 98.49 92.60 91.69

benign 94.48 94.24 99.24 99.09 92.08 92.15 97.47 97.47

skeeyah 99.36 98.45 98.88 98.41 98.89 98.30 99.04 99.04

virlock 98.79 99.55 99.72 100 99.39 99.40 99.15 99.15

virut 97.24 96.51 97.38 96.19 95.44 97.07 96.49 96.49

vundo 95.38 97.26 94.85 95.27 98.07 97.71 98.43 98.11

zbot 92.65 92.59 93.03 93.50 94.21 94.34 92.33 93.08

Averaged 96.64 96.77 96.49 96.41 96.64 96.78 96.50 96.43

Table 4: Classiﬁcation results of DML algorithms evaluated for several state-of-the-art ML algorithms.

Average precision [%] Average recall [%]

KNN LR NB DT RF MLP KNN LR NB DT RF MLP

original 96.15 86.38 82.98 94.76 96.44 96.22 96.14 86.17 77.79 94.78 96.41 96.17

LMNN 96.77 89.55 82.02 95.20 97.05 96.39 96.78 89.27 81.03 95.20 97.00 96.35

NCA 96.45 90.78 78.08 94.87 96.76 95.11 96.46 90.60 75.02 94.86 96.72 95.07

MLKR 97.04 87.94 77.96 95.15 97.05 96.50 97.04 88.12 75.41 95.13 97.02 96.49

est average precision of 97.05% was achieved using

Random Forest on the data transformed by LMNN or

by MLKR. The KNN classiﬁer achieved the highest

average recall of 97.04% on the data transformed by

MLKR.

Regarding original (non-transformed) data, the

highest average precision of 96.44% and the highest

average recall of 96.41% among base ML algorithms

were both achieved by Random Forest.

Focusing on the KNN (k = 1) classiﬁer, any of

the three DML methods achieved better classiﬁca-

tion results (both average precision and recall) than

the KNN classiﬁer using common (non-learned) Eu-

clidean distance. Regarding the Naive Bayes classi-

ﬁer, we achieved a very low precision of 54.53% for

Vundo and a very low recall of 47.02% for Zbot com-

pared to other ML classiﬁers. These two results cause

that average precision and recall for Naive Bayes are

signiﬁcantly lower compared to other ML algorithms.

6 CONCLUSIONS

In this paper, we employed three distance metric

learning algorithms to learn the Mahalanobis dis-

tance metric to improve multiclass classiﬁcation per-

formance for our dataset containing six prevalent mal-

ware families and benign ﬁles. We classiﬁed the

previously unseen samples using the KNN classiﬁer

with the learned distance and achieved signiﬁcantly

better results than using common (non-learned) Eu-

clidean distance. The classiﬁcation results demon-

strate that DML-based methods outperform any of the

state-of-the-art ML algorithms considered in our ex-

periments. Our results indicate that the classiﬁcation

performance based on DML methods could be further

improved if we use a larger dataset for training the

distance metric. Another experiment was concerned

with low-dimensional representations of the input fea-

ture space. We achieved surprisingly good classiﬁca-

tion results even for two-dimensional feature vectors.

In future work, other types of features, such as

byte sequences, opcodes, API and system calls, and

others, could be used and possibly improve the classi-

ﬁcation results. It would also be interesting to explore

the application of distance metric learning algorithms

to clustering into malware families.

ACKNOWLEDGEMENTS

The authors acknowledge the support of the OP VVV

MEYS funded project CZ.02.1.01/0.0/0.0/16 019/

0000765 ”Research Center for Informatics”.

REFERENCES

Ahmadi, M., Ulyanov, D., Semenov, S., Troﬁmov, M., and

Giacinto, G. (2016). Novel feature extraction, selec-

tion and fusion for effective malware family classiﬁ-

cation. In Proceedings of the sixth ACM conference

on data and application security and privacy, pages

183–194.

Basole, S., Di Troia, F., and Stamp, M. (2020). Multifamily

malware models. Journal of Computer Virology and

Hacking Techniques, pages 1–14.

Carrera, E. (2017). Peﬁle. https://github.com/erocarrera/

peﬁle.

Improving Classiﬁcation of Malware Families using Learning a Distance Metric

651

Goldberger, J., Hinton, G. E., Roweis, S. T., and Salakhutdi-

nov, R. R. (2005). Neighbourhood components anal-

ysis. In Advances in neural information processing

systems, pages 513–520.

Islam, R., Tian, R., Batten, L. M., and Versteeg, S. (2013).

Classiﬁcation of malware based on integrated static

and dynamic features. Journal of Network and Com-

puter Applications, 36(2):646–656.

Jure

cek, M. and L

orencz, R. (2018). Malware detection

using a heterogeneous distance function. Computing

and Informatics, 37(3):759–780.

Jure

cek, M. and L

orencz, R. (2020). Distance metric

learning using particle swarm optimization to improve

static malware detection. In Int. Conf. on Information

Systems Security and Privacy (ICISSP), pages 725–

732.

Kong, D. and Yan, G. (2013). Discriminant malware dis-

tance learning on structural information for automated

malware classiﬁcation. In Proceedings of the 19th

ACM SIGKDD international conference on Knowl-

edge discovery and data mining, pages 1357–1365.

ACM.

Lakhotia, A., Walenstein, A., Miles, C., and Singh, A.

(2013). Vilo: a rapid learning nearest-neighbor classi-

ﬁer for malware triage. Journal of Computer Virology

and Hacking Techniques, 9(3):109–123.

Microsoft (2019). Pe format - win32 apps.

https://docs.microsoft.com/en-us/windows/win32/

debug/pe-format.

Microsoft (Revision 9.3, 2015). Visual studio, microsoft

portable executable and common object ﬁle format

speciﬁcation.

Mohaisen, A., Alrawi, O., and Mohaisen, M. (2015).

Amal: High-ﬁdelity, behavior-based automated mal-

ware analysis and classiﬁcation. computers & secu-

rity, 52:251–266.

Scikit-learn (2020). scikit-learn.org. https://scikit-learn.

org/.

VirusShare (2020). Virusshare.com. http://virusshare.com/.

Wadkar, M., Di Troia, F., and Stamp, M. (2020). Detect-

ing malware evolution using support vector machines.

Expert Systems with Applications, 143:113022.

Weinberger, K. Q., Blitzer, J., and Saul, L. K. (2006). Dis-

tance metric learning for large margin nearest neigh-

bor classiﬁcation. In Advances in neural information

processing systems, pages 1473–1480.

Weinberger, K. Q. and Tesauro, G. (2007). Metric learn-

ing for kernel regression. In Artiﬁcial Intelligence and

Statistics, pages 612–619.

Wettschereck, D., Aha, D. W., and Mohri, T. (1997). A

review and empirical evaluation of feature weighting

methods for a class of lazy learning algorithms. Arti-

ﬁcial Intelligence Review, 11(1-5):273–314.

Wilson, D. R. and Martinez, T. R. (1997). Improved het-

erogeneous distance functions. Journal of artiﬁcial

intelligence research, 6:1–34.

ICISSP 2021 - 7th International Conference on Information Systems Security and Privacy

652