tify the minimum feature set that provides higher ac-
curacy, compare the discriminatory powers of feature
categories and analyze the results of models induced
by datasets belonging to different time-frames. For
these purposes, we applied a two-step procedure to
the dataset that is composed by system calls (i.e., a
dynamic behaviour) and permissions (i.e., a static be-
haviour), extracted from malware samples and legit-
imate applications. In the first step, we used statis-
tical hypothesis testing methods to identify the fea-
ture set that may have a significant contribution to
the classification. In the second step, we employed
Fisher’s Score and Gini Index which enabled to rank
the selected features according to their discrimina-
tory power. We in turn induced machine learning
models with different combinations of datasets with
varying feature sets. As Android is the most used
mobile operating system worldwide, we focused on
detection of Android malware (Statista, 2018). For
this research, we formed two malware datasets. ”Old
dataset” which consists of randomly selected apps
from Drebin malware dataset, collected between 2010
and 2012 (Arp et al., 2014).”New dataset” formed by
randomly choosing samples, belonging to years 2017
and 2018, from VirusTotal Academic malware dataset
(VirusTotal, 2018). Third one is called ”legitimate
dataset” which is composed by benign applications.
We utilized various combinations of these datasets for
inducing learning models.
This study shows that feature selection and rank-
ing process can significantly reduce the number of
features required in a classifier that provides high ac-
curacy for the detection of mobile malware. We found
that features possessing most discriminatory power in
classification may differ as new malware types evolve
over time, indicating a concept drift. Results suggest
that behaviour of mobile malware in terms of system
calls and permissions has become more similar to le-
gitimate apps over time although there are some vari-
ations among the extent of this evolvement in both
feature categories.
Our main contribution is a detailed analysis and
comparison of feature selection and ranking results
obtained for two types of feature categories. One of
the distinctive properties of the present paper is that,
in addition to the optimization of number of predic-
tors, we analyzed the change in selected features that
has occurred due to the evolvement of malware over
time.
This paper is organized as follows: Section 2
presents a review of related literature. Method em-
ployed in the study is described in Section 3. Re-
sults of our experiments are presented and discussed
in Section 4 whereas Section 5 concludes the study.
2 LITERATURE REVIEW
Feature selection and ranking methods have been
used in various machine learning-based malware de-
tection studies. In Yan et al. (2013) discriminatory
power of malware features such as hexdump of bina-
ries, disassembly codes, PE header and system calls
are measured by three filter methods, i.e., ReliefF,
Chi-squared, F-statistics, and two embedded meth-
ods, i.e., L1 regularized methods, L1-logreg and L1-
SVC. In this study, it is identified that PE header
and system calls are very beneficial to discern mal-
ware from legitimate software, and that L1 regular-
ized methods with 100 features provided higher de-
tection rates (Yan et al., 2013). In Ahmadi et al.
(2016) discriminatory powers of various static feature
categories are measured and compared by using mean
decrease impurity notion and random forest classifier
in a multi-class malware family classification.
Utilization of feature ranking methods is consid-
erably less common in those studies which provide
classifiers specifically for mobile malware detection
(Feizollah et al., 2015). Lindorfer et al. (2015) ap-
plied Fisher’s Score to evaluate the discriminatory
power of dynamic and static feature categories. This
study found out that required permissions and some
dynamic features related to SMS sending and dy-
namic loading of code have higher discriminatory
powers (Lindorfer et al., 2015). Cen et al. (2015),
created a classifier using Regularized Logistic Re-
gression with Lasso Norm for source code features
(java package, class and function levels). Information
Gain, Chi-Square and an embedded method of logis-
tic regression were utilized for feature selection. It
was found that 10% of the features selected by Infor-
mation Gain or Chi-Square are sufficient for high de-
tection rates (Cen et al., 2015). Similarly, in Shabtai et
al. (2012) filter methods such as Chi-Square, Fisher’s
Score and Information Gain were applied to some
system metric features (e.g., CPU consumption, num-
ber of running processes, battery level) in the early
times of Android.
Pehlivan et al. (2014) applied feature selection
methods such as Information Gain, ReliefF, Correla-
tion Feature Selection (CFS) and consistency-based
selection to permissions with different classification
models. Random forest classifier that selected 25 per-
mission features with CFS provided the best accuracy.
In a similar study by Nezhadkamali et al. (2017),
three feature selection methods, L1-based feature se-
lection, Information Gain and Gini Impurity, were
used with permissions. All three methods were tested
using different machine learning algorithms, such as
decision tree, SVM and Random forest. Best results
In-depth Feature Selection and Ranking for Automated Detection of Mobile Malware
275