Empirical Analysis for Investigating the Effect of Machine Learning

Techniques on Malware Prediction

Sanidhya Vijayvargiya

, Lov Kumar

, Lalita Bhanu Murthy

, Sanjay Misra

, Aneesh Krishna

and Srinivas Padmanabhuni

BITS-Pilani Hyderabad, India

NIT kurukshetra, India

Østfold University College, Halden, Norway

Curtin University, Australia

Testaing.Com, India

A.Krishna@curtin.edu.au

Keywords:

SMOTE, ANOVA, Genetic Algorithm, Ensemble Learning, Malware Family.

Abstract:

Malware is used to attack computer systems and network infrastructure. Therefore, classifying malware is

essential for stopping hostile attacks. From money transactions to personal information, everything is shared

and stored in cyberspace. This has led to increased and more innovative malware attacks. Advanced packing

and obfuscation methods are being used by malware variants to get access to private information for proﬁt.

There is an urgent need for better software security. In this paper, we identify the best ML techniques that can

be used in combination with various ML and ensemble classiﬁers for malware classiﬁcation. The goal of this

work is to identify the ideal ML pipeline for detecting the family of malware. The best tools for describing

malware activity are application programming interfaces (APIs). However, creating API call attributes for

classiﬁcation algorithms to achieve high accuracy is challenging. The experimental results demonstrate that the

proposed ML pipeline may effectively and accurately categorize malware, producing state-of-the-art results.

1 INTRODUCTION

During the COVID-19 pandemic, technological ad-

vancements led to a huge inﬂux of new internet users,

and the virtual world made up a bigger part of one’s

life. This has resulted in a large amount of private in-

formation about individuals and organizations being

shared and stored digitally. The security of this infor-

mation has been continuously tested by an increasing

number of malware attacks. The different varieties

of malware include worms, viruses, Trojan horses,

ransomware, rootkits, etc. Malware variations have

the ability to steal sensitive information, launch dis-

tributed denial of service (DDoS) assaults, and cause

havoc to computer systems.

Due to the varying nature of the malware, it is not

enough to simply detect the presence of the malware

but to deal with it properly, its family also needs to

be identiﬁed. For this purpose, to observe the be-

havior of the malware when training malware clas-

siﬁcation models, they are often run in virtual envi-

ronments, which do not affect the daily functioning

of the computer. The malware ﬁles are analyzed and

static, as well as dynamic features, are extracted from

the ﬁles. As malware becomes better at disguising it-

self, the need for better techniques for classiﬁcation

increases. Machine Learning methods have shown

to be reliable but as the complexity of malware in-

creases, Deep Learning methods are being explored

(Zhang et al., 2019). But these DL methods are still

very new and have not been able to provide the im-

pressive boost in performance they have in other ap-

plications. Ensemble classiﬁers have still been shown

to outperform DL classiﬁers (Daeef et al., 2022).

In this work, we attempt to identify the best ML

techniques for malware classiﬁcation and propose the

ideal ML pipeline for future works. The following is

a list of the research questions (RQs) that will be used

to accomplish the objectives.

• RQ1: Which feature selection or dimensional-

ity reduction technique is best suited for malware

classiﬁcation?

Vijayvargiya, S., Kumar, L., Mur thy, L., Misra, S., Krishna, A. and Padmanabhuni, S.

Empirical Analysis for Investigating the Effect of Machine Learning Techniques on Malware Prediction.

DOI: 10.5220/0011858200003464

In Proceedings of the 18th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2023), pages 453-460

ISBN: 978-989-758-647-7; ISSN: 2184-4895

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

453

• RQ2: Does using oversampling techniques like

SMOTE result in better classiﬁcation models than

training the classiﬁers on imbalanced data?

• RQ3: Since malware classiﬁcation is a multi-class

problem, does the One vs. One approach or the

One vs. Rest approach work best?

• RQ4: Which ML classiﬁer gives the best results?

The following are the contributions made by this

study:

• An in-depth analysis of how One vs. One and One

vs. Rest classiﬁcation techniques result in varying

performance.

• The impact on the performance of training with

a balanced dataset using oversampling techniques

is explored.

• Due to a large number of works on this topic,

there is no proper comparison of the numerous

ML techniques available. We compare the most

commonly used classiﬁers and various feature se-

lection techniques.

• The performance of malware classiﬁcation mod-

els is assessed and analyzed using key perfor-

mance indicators in the study. In contrast to ear-

lier research, we provide a thorough statistical

analysis to support the ﬁndings in this study us-

ing statistical testing.

The remainder of this paper is structured as fol-

lows: Section 2 presents a literature review of vari-

ous methods for previous works on malware classiﬁ-

cation. Section 3 describes the various study design

settings and the experimental dataset collection. In

Section 4, the study design is described using an ar-

chitectural framework. The results of the experiments

are provided in Section 5, along with an analysis

of them. Section 6 compares malware classiﬁcation

models created using various feature selection strate-

gies, widely used classiﬁers, and class-balancing al-

gorithms. Section 7 concludes by summarizing the re-

search results and offering suggestions for additional

studies.

2 RELATED WORKS

2.1 Feature Extraction and Selection

Daeef et al. (Daeef et al., 2022) highlighted the im-

portance of selecting the right set of features for mal-

ware family classiﬁcation-based API calls. Malware

analysis can be conducted in various ways. The static

method does not require the malware to be run. The

features derived from this type of analysis are not suit-

able for malware variants that have frequent changes

in dangerous functions, code, structures, etc. The

beneﬁt of static analysis is that the scope of analy-

sis covers the entire code. In contrast, reverse engi-

neering of malware is not required for dynamic anal-

ysis. The malware is run in an isolated environment

and various metrics, such as API calls, registry ac-

cess, etc., are used to record the behavior of the mal-

ware. The secure environment for such analysis is

provided by Sandbox technology. Jaccard similarity

between the API calls by different malware families,

and the frequency of API calls, among others, are key

features highlighted in this work. In their experimen-

tal results, the authors found that the Random Forest

classiﬁer with the proposed API call features was able

to outperform RNN networks like LSTM and GRU.

2.2 Machine Learning-Based

Classiﬁcation

Kouliaridis et al. (Kouliaridis and Kambourakis,

2021) conducted a study of machine-learning tech-

niques in android malware detection. The authors

focused on android malware detection due to the

open-source nature of the software which makes it a

prime target for malware writers. The key techniques

highlighted which improve performance include fea-

ture selection and dimensionality reduction to reduce

noise and bias. Ensemble models, which build upon

multiple classiﬁers, improve the overall classiﬁcation

performance and can be used together with the above-

mentioned techniques. The survey spans a period

from 2017 to 2021 android malware detection.

Turker et al. (T

urker and Can, 2019) proposed

a framework AndMFC for Android Malware Fam-

ily Classiﬁcation. The framework extracts API calls

and requested permissions from malware samples and

various machine learning classiﬁers are trained on this

data. This framework succeeds at detecting unknown

malware, those that the model has not seen before,

with high accuracy. The various classiﬁers employed

in this framework are AdaBoost, Logistic Regression,

Multi-Layer Perceptron etc.. The framework is evalu-

ated on three datasets. The results show utilizing both

static analysis, and dynamic analysis of the malware

samples helps achieve better performance. Feature

ranking was used to select the top 1000 features from

the original set of features. The framework achieved

93.63% accuracy in recognizing unknown malware in

the AMD dataset. On Drebin and UpDroid datasets,

AdaBoost and SVM achieved the highest accuracies

of 96.79% and 94.66% respectively.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

454

2.3 Deep Learning-Based Classiﬁcation

A framework based on a deep learning approach is

proposed by Aslan et al. (Aslan and Yilmaz, 2021).

The study’s primary contribution is the suggestion of

a novel hybrid design that optimally combines two di-

verse pre-trained network models. The malware is vi-

sualized as an image which is then processed through

a ResNet-50 and AlexNet architecture. The results

of the two pre-trained models are combined. On the

Malimg dataset, the framework is able to achieve an

accuracy of 97.78%. The beneﬁts of a DL approach

as highlighted in the paper, among others, are:

• DL reduces the need for feature engineering

• DL increases accuracy

• DL reduces the feature space

A recurrent network-based approach to extract the

API call patterns of malware families was attempted

by Kwon et al. (Kwon and Im, 2017). LSTMs were

trained on 551 representative API call patterns of 9

malware families. On a testing set of 236 samples,

the LSTMs achieved an accuracy of 71%. The au-

thors highlighted a limitation that the malware sam-

ple utilized for analysis may not accurately match the

Kaspersky malware family classiﬁcation criteria. An

area of improvement that was pointed out was when

collecting patterns for APIs called by malware and as-

sessing them, the authors solely considered the types

and order of the APIs that were invoked. Since the

API is higher level than the computer’s machine code

or assembly, extracting the malware API’s semantic

criteria and semantic differentiation may boost per-

formance. Further, this work did not include pre-

processing for the input to the LSTM. Pre-processed

inputs can also improve the results.

3 STUDY DESIGN

3.1 Dataset

The dataset used to test the various models in this pa-

per is the Malware API Call Histogram Dataset by

Paul Black published in the 13th International Cyber-

security Data Mining Competition (CDMC2022). It

contains 9 malware families that are to be predicted

by the models. The features of the malware are ex-

tracted from malware samples provided by Abuse.ch.

Dynamic analysis was performed in a Cuckoo sand-

box to extract the features. A histogram of API calls

makes up the features. The training set consists of

537 samples and the testing set contains 134 samples.

The label of each sample is stored in the ﬁrst column

of the dataset.

3.2 Class-Balancing Techniques

An imbalance in samples of malware families in the

training dataset has led to underperforming mod-

els. Synthetic Minority Oversampling Technique

(SMOTE) is an oversampling class balancing tech-

nique that helps get an equal number of instances

of each class in the training set. The classiﬁcation

models are trained on both the balanced and the im-

balanced datasets, and the variation in performance

is compared between the two sets of models. The

k-nearest neighbors of the data are then found via

SMOTE after sampling data from the minority class.

The chosen k-nearest neighbor is then used to mix the

collected data to produce synthetic data.

3.3 Feature Selection and

Dimensionality Reduction

Techniques

Four sets of features are fed as input to the classiﬁers.

Multiple previous works could get an improvement in

performance with a better set of features (Kwon and

Im, 2017). These include the set of original features

in the dataset to compare whether there is an improve-

ment in performance or regression due to the new fea-

tures. The feature selection techniques used are Anal-

ysis of Variance (ANOVA), and Genetic Algorithm

(GA). The dimensionality reduction technique used is

Principal Component Analysis (PCA).

• Such datasets can be made less dimensional by

using PCA, which both enhances interpretability

and reduces data loss. We only pick the top pri-

mary components to get lower-dimensional data.

Although some information is lost, accuracy is

sacriﬁced in favor of simplicity. Sometimes the

connections between the variables are so strong

that redundant data is present.

• ANOVA is a method that assists in separating sys-

temic and random components from observed ag-

gregate variability in a dataset. The dependent

variable is signiﬁcantly impacted by the systemic

elements but not by the random components. The

signiﬁcance of the independent factors’ inﬂuence

on the dependent variables is examined using the

ANOVA test. These tests either accept the alter-

native hypothesis or reject the null hypothesis.

• Natural selection is simulated by genetic algo-

rithm, which means that only those species that

Empirical Analysis for Investigating the Effect of Machine Learning Techniques on Malware Prediction

455

can adapt to changes in their environment can sur-

vive, procreate, and pass on to the next generation.

To solve an issue, they essentially replicate ”sur-

vival of the ﬁttest” among individuals of succes-

sive generations. Generating a population based

on subsets of the potential features is the ﬁrst

step in the feature selection process. A predictive

model for the intended task is used to assess the

subgroups from this population. The best subset

is used to create the subsequent generation, with

some mutation(where some features are added or

removed at random) and cross-over (the selected

subset is updated with features from other well-

performing features).

3.4 Classiﬁcation Techniques

We use 14 different classiﬁers, namely, Multinomial

Naive-Bayes, Bernoulli Naive-Bayes, Gaussian Naive

Bayes, Complement Naive Bayes, Decision Tree, k-

Nearest Neighbors, Linear Support Vector classiﬁer,

Polynomial kernel Support Vector classiﬁer, Radial

Basis function kernel Support Vector classiﬁer, Extra

Trees classiﬁer, Random Forest, Bagging classiﬁer,

Gradient Boosting classiﬁer, and AdaBoost classiﬁer.

These classiﬁers are some of the most commonly used

classiﬁers for malware classiﬁcation. They contain a

mix of simpler machine-learning classiﬁers and more

advanced ensemble classiﬁers. For each of these clas-

siﬁers, both One vs One and the One vs Rest approach

is used. 5-fold cross-validation is used to validate the

results from the classiﬁcation models.

4 RESEARCH METHODOLOGY

The dataset of malware API call histogram for mal-

ware classiﬁcation, provided in CDMC 2022, is used

to train the various malware family classiﬁcation

models. The dataset is subjected to two feature se-

lection techniques (genetic algorithm and ANOVA),

and one dimensionality reduction technique(PCA) to

get the best set of features to input into the classiﬁers.

The original set of features is also preserved. SMOTE

is used to balance the classes in the training set. The

models trained on the imbalanced dataset were used

for comparison.

The resulting data was fed to 2 variants, One vs

One classiﬁer and One vs Rest classiﬁer, of 14 differ-

ent classiﬁers for malware family prediction. In total,

224[4 sets of features * 2(1 balanced + 1 imbalance

dataset) * 2 multi-class classiﬁcation approaches * 14

classiﬁers] distinct models were trained.

5 EXPERIMENTAL RESULTS

AND ANALYSIS

The trained models’ predictive power is evaluated us-

ing AUC values, recall, precision, and accuracy, as

shown in Table 1. The performance of the models

is generally excellent, with the highest accuracy val-

ues being 98%, but for certain models, there is a stark

drop in performance. Thus, it is crucial to select the

right set of ML techniques.

We have utilized box plots for visual comparison

of each performance parameter. Statistical analysis

using the Friedman test for each ML approach is em-

ployed to validate the ﬁndings and draw conclusions.

The Friedman test either rejects the null hypothesis

and accepts the alternative hypothesis, or it accepts

the null hypothesis. The Friedman test’s 0.05 signif-

icance cutoff applies to all of the comparisons that

were done.

5.1 RQ1: Which Feature Selection or

Dimensionality Reduction

Technique Is Best Suited for

Malware Classiﬁcation?

The set of features input to the classiﬁers greatly im-

pacts the performance as the malware becomes better

at disguising itself. Some features may be redundant

and thus hinder the classiﬁcation models. Thus, di-

mensionality reduction techniques like PCA are im-

portant. Genetic algorithm is a potent technique for

feature selection where the original set of 208 fea-

tures is reduced to 10. ANOVA is a statistical tech-

nique that reduces the same original features to 163.

PCA, on the other hand, reduces 208 features to 179

features.

The best feature selection technique is the Genetic

algorithm; wherein, even with ten features, it can cap-

ture the most relevant information required for clas-

siﬁcation into different malware families, as seen in

Figure 1. The less number of features also make it

computationally efﬁcient to train models. The visual

differences between the box plots of the feature selec-

tion techniques are not easily seen. The Friedman test

helps statistically reject the null hypothesis that the

different feature selection techniques do not signiﬁ-

cantly affect performance. The lower the mean rank

in the Friedman test, the better the performance. The

degree of freedom was taken to be three. The results

of the Friedman test, as seen in Table 2, show that the

genetic algorithm performs better than the other fea-

ture selection techniques. PCA regresses the perfor-

mance of the models compared to the original set of

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

456

Table 1: Performance Parameters.

ORG-DATA SMOTE DATA

ONE-ONE ONE-REST ONE-ONE ONE-REST

Accuracy F-Measure AUC Accuracy F-Measure AUC Accuracy F-Measure AUC Accuracy F-Measure AUC

MNBC 40.60 0.34 0.81 89.66 0.89 0.97 85.70 0.85 0.97 88.38 0.87 0.97

BNBC 93.11 0.93 0.97 95.79 0.96 0.99 93.49 0.94 0.98 95.79 0.96 0.99

GNBC 51.77 0.46 0.88 90.29 0.88 0.96 48.40 0.43 0.88 91.19 0.90 0.96

CNBC 40.60 0.34 0.81 89.66 0.89 0.97 97.06 0.97 0.98 88.51 0.87 0.97

DT 97.39 0.97 0.99 98.08 0.98 0.99 97.83 0.98 0.99 97.32 0.97 0.99

KNN 97.02 0.97 0.99 97.57 0.98 0.99 97.96 0.98 0.99 97.57 0.98 0.99

SVCL 60.34 0.58 0.88 97.06 0.97 0.99 51.47 0.55 0.77 96.68 0.97 0.99

SVCP 33.52 0.33 0.58 74.07 0.75 0.90 23.75 0.26 0.53 60.92 0.61 0.78

SVCR 45.07 0.46 0.86 79.69 0.80 0.96 43.55 0.39 0.79 68.33 0.68 0.87

EXTRC 97.95 0.98 1.00 98.60 0.99 1.00 98.98 0.99 1.00 99.11 0.99 1.00

RFC 97.58 0.98 1.00 98.34 0.98 0.99 98.85 0.99 1.00 98.60 0.99 0.99

BAGC 97.95 0.98 0.99 98.08 0.98 0.99 97.83 0.98 0.99 97.32 0.97 0.99

GRABC 97.21 0.97 0.99 98.21 0.98 0.99 97.70 0.98 0.99 97.57 0.98 0.99

ADABC 97.58 0.98 0.98 98.21 0.98 0.99 98.85 0.99 0.99 97.70 0.98 0.99

ANOVA

MNBC 40.41 0.34 0.81 89.78 0.89 0.97 85.31 0.84 0.96 89.53 0.89 0.94

BNBC 92.74 0.93 0.98 95.02 0.95 0.99 93.49 0.93 0.98 95.79 0.96 0.99

GNBC 52.51 0.46 0.88 97.96 0.98 0.98 51.72 0.46 0.88 97.57 0.98 0.98

CNBC 40.41 0.34 0.81 89.78 0.89 0.97 87.61 0.87 0.98 89.53 0.89 0.94

DT 97.58 0.98 0.99 98.34 0.98 0.99 97.32 0.97 0.99 97.45 0.97 0.99

KNN 97.02 0.97 0.99 97.57 0.98 0.99 74.84 0.72 0.90 73.31 0.70 0.90

SVCL 61.45 0.59 0.87 96.93 0.97 0.99 45.72 0.46 0.75 92.98 0.93 0.98

SVCP 34.45 0.34 0.60 73.82 0.74 0.90 33.33 0.31 0.60 50.32 0.49 0.79

SVCR 44.69 0.47 0.85 79.69 0.80 0.96 42.78 0.42 0.79 61.94 0.62 0.84

EXTRC 97.95 0.98 1.00 98.60 0.99 1.00 98.98 0.99 1.00 99.11 0.99 1.00

RFC 97.58 0.98 1.00 98.60 0.99 0.99 98.60 0.99 1.00 98.72 0.99 1.00

BAGC 97.77 0.98 0.99 98.47 0.98 0.99 98.21 0.98 0.99 97.57 0.98 0.99

GRABC 97.77 0.98 0.99 98.34 0.98 0.99 98.21 0.98 0.99 97.96 0.98 0.99

ADABC 97.58 0.98 0.99 98.08 0.98 0.99 97.96 0.98 0.99 98.34 0.98 0.99

MNBC 60.15 0.54 0.96 59.78 0.54 0.95 47.25 0.52 0.69 89.53 0.89 0.94

BNBC 89.39 0.90 0.97 89.57 0.90 0.98 89.40 0.90 0.96 95.79 0.96 0.99

GNBC 93.11 0.93 0.97 94.04 0.94 0.98 74.58 0.70 0.92 97.57 0.98 0.98

CNBC 95.16 0.95 0.98 95.16 0.95 0.98 88.12 0.88 0.95 89.53 0.89 0.94

DT 96.65 0.97 0.98 96.28 0.96 0.98 95.91 0.96 0.98 97.06 0.97 0.99

KNN 96.65 0.97 0.99 96.83 0.97 1.00 96.81 0.97 1.00 73.31 0.70 0.90

SVCL 97.95 0.98 0.99 97.21 0.97 0.99 97.83 0.98 0.99 88.63 0.88 0.99

SVCP 89.20 0.89 0.95 73.93 0.74 0.90 81.23 0.82 0.92 50.32 0.49 0.79

SVCR 96.46 0.96 1.00 96.09 0.96 1.00 97.19 0.97 1.00 61.94 0.62 0.84

EXTRC 97.21 0.97 1.00 97.39 0.97 1.00 98.47 0.98 1.00 98.85 0.99 1.00

RFC 97.21 0.97 1.00 97.02 0.97 0.99 98.08 0.98 1.00 98.85 0.99 1.00

BAGC 97.39 0.97 0.99 95.90 0.96 0.99 96.93 0.97 0.99 97.06 0.97 1.00

GRABC 97.02 0.97 0.99 96.46 0.96 0.99 96.81 0.97 0.98 97.96 0.98 0.99

ADABC 97.39 0.97 0.99 96.83 0.97 0.99 98.08 0.98 0.99 98.34 0.98 0.99

features. ANOVA test and the original set of features

give very similar results. The genetic algorithm set of

features results in a mean of 0.97 AUC, a minimum

AUC of 0.69, and a maximum AUC of 1.

Table 2: AUC: Statistical and Friedman test results of fea-

ture selection.

AF ANOVA PCA GA Rank

AF 1.00 0.86 0.20 0.22 2.48

ANOVA 0.86 1.00 0.30 0.15 2.49

PCA 0.20 0.30 1.00 0.01 2.85

GA 0.22 0.15 0.01 1.00 2.18

5.2 RQ2: Does Using Oversampling

Techniques like SMOTE Result in

Better Classiﬁcation Models than

Training the Classiﬁers on

Imbalanced Data?

Most datasets used to train malware classiﬁcation

models are imbalanced. Multiple papers have at-

tempted to solve this problem using various tech-

niques. From the visual representation in the box

plot in Figure 2, we can deduce that SMOTE leads

to regression in performance. This is veriﬁed by the

Empirical Analysis for Investigating the Effect of Machine Learning Techniques on Malware Prediction

457

ANOVT

PCA

100

Accuracy

ANOVT

PCA

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision

ANOVT

PCA

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

ANOVT

PCA

0.6

0.7

0.8

0.9

AUC

Figure 1: Performance Parameters Boxplots of feature selection.

Friedman test, which has a null hypothesis, “the over-

sampling technique used to balance the dataset does

not have a signiﬁcant effect on the performance.” As

seen in Table 3, the original, imbalanced dataset has a

mean rank of 1.41, whereas the dataset after SMOTE

was applied gave a mean rank of 1.59. Based on the

descriptive statistics, both sets of models are similar,

with the original dataset having a minimum AUC of

0.58, a maximum AUC of 1, and a mean AUC of 0.95.

The SMOTE dataset has a minimum AUC of 0.53, a

maximum AUC of 1, and a mean AUC of 0.93.

ORGD

SMOTE

100

Accuracy

ORGD

SMOTE

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision

ORGD

SMOTE

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

ORGD

SMOTE

0.6

0.7

0.8

0.9

AUC

Figure 2: Performance Parameters Boxplots of SMOTE.

Table 3: AUC: Statistical and Friedman test results of

SMOTE.

ORG SMOTE Rank

ORG 1.00 0.57 1.41

SMOTE 0.57 1.00 1.59

5.3 RQ3: Since Malware Classiﬁcation

Is a Multi-Class Problem, Does the

One vs. One Approach or the One

vs. Rest Approach Work Best?

The classiﬁcation by the classiﬁers can be One vs.

One or One vs. Rest. As per the box plots in Figure

4, we can tell that the One vs. One classiﬁer far out-

performs the One vs. Rest classiﬁers. There is much

less variance in the former. To validate this claim, we

look at the mean ranks in the Friedman test in Table

4. One vs. Rest classiﬁers has a mean rank of 1.62

compared to the 1.38 mean rank of the One vs. One

classiﬁer. The One vs. One classiﬁer has a minimum

AUC of 0.6, a maximum AUC of 1, and a mean AUC

of 0.96. One vs. Rest classiﬁers has a minimum AUC

of 0.53, a maximum AUC of 1, and a mean AUC of

0.93. The One vs. One classiﬁer has a Q1 of 0.96 to

0.88 of the One vs. Rest classiﬁers. The null hypoth-

esis of the Friedman test, carried out with a degree of

freedom equal to 1, is “the different methods of clas-

siﬁcation, One vs One, and One vs Rest, do not have a

signiﬁcant effect on the performance of the models.”

Thus, it is clear that the One vs. One approach should

be preferred for multi-class malware family classiﬁ-

cation.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

458

ONEVSREST ONEVSONE

100

Accuracy

ONEVSREST ONEVSONE

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Precision

ONEVSREST ONEVSONE

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

ONEVSREST ONEVSONE

0.6

0.7

0.8

0.9

AUC

Figure 4: Performance Parameters Boxplots of One vs. One

approach or the One vs. Rest.

Table 4: AUC: Statistical and Friedman test results of One

vs. One approach or the One vs. Rest.

ONEONE ONEREST Rank

ONEONE 1.00 0.15 1.62

ONEREST 0.15 1.00 1.38

5.4 RQ4: Which ML Classiﬁer Gives

the Best Results?

This work compares many commonly used classi-

ﬁers– from simple Naive Bayes and SVC classiﬁers

to more advanced ensemble classiﬁers. Overall, 14

different classiﬁers have been compared. In the box

plot shown in Figure 3, even among different variants

of Naive Bayes and SVC, there is a huge variation in

performance. Ensemble classiﬁers seem to provide

the most reliable performance with less variance. De-

cision Tree and k-Nearest Neighbors classiﬁers also

give robust results. To discern which classiﬁer per-

formed best, we look at the mean ranks in the Fried-

man test in Table 5. The null hypothesis is “the dif-

ferent classiﬁers do not cause a signiﬁcant change in

the performance of the models.” The degree of free-

dom is taken as 13 for the Friedman test. The Ex-

tra Trees classiﬁer, with a mean rank of 1.97, outper-

forms other classiﬁers by a huge margin. Variants of

Naive Bayes, like multinomial, Gaussian, and com-

plement, have the highest mean ranks. Random Forest

classiﬁer has the second lowest mean rank. The Extra

Trees classiﬁer has a minimum AUC of 0.99, a maxi-

mum AUC of 1, and a mean AUC of 1. This indicates

that Extra Trees is the best choice for the classiﬁer.

Ensemble classiﬁers, in general, seem to outperform

other types of classiﬁers.

6 CONCLUSION

Malware family classiﬁcation is a much-researched

topic with multiple different ML and DL techniques

applied to keep up with the increasing complexity of

the problem. Due to the numerous techniques applied,

there is a lack of clarity about the ideal pipeline for

future research. In this work, we compare the various

MNBC

BNBC

GNBC

CNBC

KNN

SVCL

SVCP

SVCR

EXTRC

RFC

BAGC

GRABC

ADABC

100

Accuracy

MNBC

BNBC

GNBC

CNBC

KNN

SVCL

SVCP

SVCR

EXTRC

RFC

BAGC

GRABC

ADABC

0.4

0.6

0.8

Precision

MNBC

BNBC

GNBC

CNBC

KNN

SVCL

SVCP

SVCR

EXTRC

RFC

BAGC

GRABC

ADABC

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Recall

MNBC

BNBC

GNBC

CNBC

KNN

SVCL

SVCP

SVCR

EXTRC

RFC

BAGC

GRABC

ADABC

0.6

0.7

0.8

0.9

AUC

Figure 3: Performance Parameters Boxplots of ML classiﬁer.

Empirical Analysis for Investigating the Effect of Machine Learning Techniques on Malware Prediction

459

Table 5: AUC: Statistical and Friedman test results of ML classiﬁer.

MNBC BNBC GNBC CNBC DT KNN SVCL SVCP SVCR EXTRC RFC BAGC GRABC ADABC Rank

MNBC 1.00 0.00 0.13 0.13 0.00 0.00 0.09 0.03 0.22 0.00 0.00 0.00 0.00 0.00 11.97

BNBC 0.00 1.00 0.00 0.00 0.02 0.10 0.26 0.00 0.02 0.00 0.00 0.00 0.00 0.00 7.44

GNBC 0.13 0.00 1.00 0.86 0.00 0.00 0.62 0.00 0.45 0.00 0.00 0.00 0.00 0.00 10.38

CNBC 0.13 0.00 0.86 1.00 0.00 0.00 0.48 0.00 0.95 0.00 0.00 0.00 0.00 0.00 10.72

DT 0.00 0.02 0.00 0.00 1.00 0.80 0.02 0.00 0.00 0.00 0.00 0.03 0.39 0.16 5.97

KNN 0.00 0.10 0.00 0.00 0.80 1.00 0.02 0.00 0.01 0.00 0.00 0.26 0.82 0.60 6.00

SVCL 0.09 0.26 0.62 0.48 0.02 0.02 1.00 0.00 0.82 0.00 0.00 0.00 0.00 0.00 8.91

SVCP 0.03 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 13.66

SVCR 0.22 0.02 0.45 0.95 0.00 0.01 0.82 0.00 1.00 0.00 0.00 0.00 0.00 0.00 9.50

EXTRC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 0.14 0.00 0.00 0.00 1.97

RFC 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.14 1.00 0.00 0.00 0.00 3.06

BAGC 0.00 0.00 0.00 0.00 0.03 0.26 0.00 0.00 0.00 0.00 0.00 1.00 0.09 0.18 4.78

GRABC 0.00 0.00 0.00 0.00 0.39 0.82 0.00 0.00 0.00 0.00 0.00 0.09 1.00 0.58 5.47

ADABC 0.00 0.00 0.00 0.00 0.16 0.60 0.00 0.00 0.00 0.00 0.00 0.18 0.58 1.00 5.19

feature selection and dimensionality reduction tech-

niques and observe what impact class balancing tech-

niques have on the performance of the models. Due

to the various possible combinations of these tech-

niques, we identify the best combination and use sta-

tistical testing to support the claims made. The key

conclusions are:

• Genetic algorithm was the best technique for fea-

ture selection, and dimensionality reduction re-

gressed the performance of the models.

• Oversampling using SMOTE also regressed the

performance of the models, and better results

were obtained using the original imbalance

dataset.

• The Extra Trees classiﬁer was by far the best clas-

siﬁer out of the 14 classiﬁers compared in this

study.

• The classiﬁers performed better with a One vs.

One approach to the multi-class problem than the

One vs. Rest approach.

• The performance of the best combination of the

ML techniques gave very reliable results and per-

formed extremely well on all the metrics used to

measure the performance.

This work can be extended to compare how DL tech-

niques fare against ensemble techniques. A better set

of features and class-balancing techniques can help

boost performance for the existing DL architectures.

ACKNOWLEDGEMENTS

This research is funded by TestAIng Solutions Pvt.

Ltd.

REFERENCES

Aslan,

O. and Yilmaz, A. A. (2021). A new malware classi-

ﬁcation framework based on deep learning algorithms.

IEEE Access, 9:87936–87951.

Daeef, A. Y., Al-Naji, A., and Chahl, J. (2022). Features en-

gineering for malware family classiﬁcation based api

call. Computers, 11(11).

Kouliaridis, V. and Kambourakis, G. (2021). A compre-

hensive survey on machine learning techniques for an-

droid malware detection. Information, 12(5).

Kwon, I. and Im, E. G. (2017). Extracting the representative

api call patterns of malware families using recurrent

neural network. In Proceedings of the International

Conference on Research in Adaptive and Convergent

Systems, RACS ’17, page 202–207, New York, NY,

USA. Association for Computing Machinery.

urker, S. and Can, A. B. (2019). Andmfc: Android mal-

ware family classiﬁcation framework. In 2019 IEEE

30th International Symposium on Personal, Indoor

and Mobile Radio Communications (PIMRC Work-

shops), pages 1–6.

Zhang, H., Xiao, X., Mercaldo, F., Ni, S., Martinelli, F., and

Sangaiah, A. K. (2019). Classiﬁcation of ransomware

families with machine learning based onn-gram of op-

codes. Future Generation Computer Systems, 90:211–

221.

ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering

460