can adapt to changes in their environment can sur-
vive, procreate, and pass on to the next generation.
To solve an issue, they essentially replicate ”sur-
vival of the fittest” among individuals of succes-
sive generations. Generating a population based
on subsets of the potential features is the first
step in the feature selection process. A predictive
model for the intended task is used to assess the
subgroups from this population. The best subset
is used to create the subsequent generation, with
some mutation(where some features are added or
removed at random) and cross-over (the selected
subset is updated with features from other well-
performing features).
3.4 Classification Techniques
We use 14 different classifiers, namely, Multinomial
Naive-Bayes, Bernoulli Naive-Bayes, Gaussian Naive
Bayes, Complement Naive Bayes, Decision Tree, k-
Nearest Neighbors, Linear Support Vector classifier,
Polynomial kernel Support Vector classifier, Radial
Basis function kernel Support Vector classifier, Extra
Trees classifier, Random Forest, Bagging classifier,
Gradient Boosting classifier, and AdaBoost classifier.
These classifiers are some of the most commonly used
classifiers for malware classification. They contain a
mix of simpler machine-learning classifiers and more
advanced ensemble classifiers. For each of these clas-
sifiers, both One vs One and the One vs Rest approach
is used. 5-fold cross-validation is used to validate the
results from the classification models.
4 RESEARCH METHODOLOGY
The dataset of malware API call histogram for mal-
ware classification, provided in CDMC 2022, is used
to train the various malware family classification
models. The dataset is subjected to two feature se-
lection techniques (genetic algorithm and ANOVA),
and one dimensionality reduction technique(PCA) to
get the best set of features to input into the classifiers.
The original set of features is also preserved. SMOTE
is used to balance the classes in the training set. The
models trained on the imbalanced dataset were used
for comparison.
The resulting data was fed to 2 variants, One vs
One classifier and One vs Rest classifier, of 14 differ-
ent classifiers for malware family prediction. In total,
224[4 sets of features * 2(1 balanced + 1 imbalance
dataset) * 2 multi-class classification approaches * 14
classifiers] distinct models were trained.
5 EXPERIMENTAL RESULTS
AND ANALYSIS
The trained models’ predictive power is evaluated us-
ing AUC values, recall, precision, and accuracy, as
shown in Table 1. The performance of the models
is generally excellent, with the highest accuracy val-
ues being 98%, but for certain models, there is a stark
drop in performance. Thus, it is crucial to select the
right set of ML techniques.
We have utilized box plots for visual comparison
of each performance parameter. Statistical analysis
using the Friedman test for each ML approach is em-
ployed to validate the findings and draw conclusions.
The Friedman test either rejects the null hypothesis
and accepts the alternative hypothesis, or it accepts
the null hypothesis. The Friedman test’s 0.05 signif-
icance cutoff applies to all of the comparisons that
were done.
5.1 RQ1: Which Feature Selection or
Dimensionality Reduction
Technique Is Best Suited for
Malware Classification?
The set of features input to the classifiers greatly im-
pacts the performance as the malware becomes better
at disguising itself. Some features may be redundant
and thus hinder the classification models. Thus, di-
mensionality reduction techniques like PCA are im-
portant. Genetic algorithm is a potent technique for
feature selection where the original set of 208 fea-
tures is reduced to 10. ANOVA is a statistical tech-
nique that reduces the same original features to 163.
PCA, on the other hand, reduces 208 features to 179
features.
The best feature selection technique is the Genetic
algorithm; wherein, even with ten features, it can cap-
ture the most relevant information required for clas-
sification into different malware families, as seen in
Figure 1. The less number of features also make it
computationally efficient to train models. The visual
differences between the box plots of the feature selec-
tion techniques are not easily seen. The Friedman test
helps statistically reject the null hypothesis that the
different feature selection techniques do not signifi-
cantly affect performance. The lower the mean rank
in the Friedman test, the better the performance. The
degree of freedom was taken to be three. The results
of the Friedman test, as seen in Table 2, show that the
genetic algorithm performs better than the other fea-
ture selection techniques. PCA regresses the perfor-
mance of the models compared to the original set of
ENASE 2023 - 18th International Conference on Evaluation of Novel Approaches to Software Engineering
456