A Novel Approach for Android Malware Detection and Classiﬁcation

using Convolutional Neural Networks

Ahmed Lekssays

1 a

, Bouchaib Falah

1 b

and Sameer Abufardeh

2 c

School of Science and Engineering, Al Akhawayn University in Ifrane, Ifrane, Morocco

Math, Science, and Tech. Dept., University of Minnesota Crookston, Crookston, MN, U.S.A.

Keywords:

Malware, Android, Machine Learning, Classiﬁcation, Convolutional Neural Networks.

Abstract:

Malicious software or malware has been growing exponentially in the last decades according to antiviruses

vendors. The growth of malware is due to advanced techniques that malware authors are using to evade de-

tection. Hence, the traditional methods that antiviruses vendors deploy are insufﬁcient in protecting people’s

digital lives. In this work, an attempt is made to address the problem of mobile malware detection and clas-

siﬁcation based on a new approach to android mobile applications that uses Convolutional Neural Networks

(CNN). The paper suggests a static analysis method that helps in malware detection using malware visual-

ization. In our approach, ﬁrst, we convert android applications in APK format into gray-scale images. Since

malware from the same family has shared patterns, we then designed a machine learning model to classify

Android applications as malware or benign based on pattern recognition. The dataset used in this research

is a combination of self-made datasets that used public APIs to scan the APK ﬁles downloaded from open

sources on the internet, and a research dataset provided by the University of New Brunswick, Canada. Using

our proposed solution, we achieved an 84.9% accuracy in detecting mobile malware.

1 INTRODUCTION

According to Statista, the number of Android devices

in the world is 2.53 Billion devices. This number

is expected to reach 2.87 Billion in 2020 (Statista,

2020). On the other hand, Kaspersky Lab detected

more than 6 million malware package in 2018 (Lab.,

2019). These numbers show the criticality of mobile

security and the challenges it has been facing in the

last years. It is an open war between malware writers

and AV vendors.

Due to the limitations of signature-based detec-

tion, AV vendors rely on some heuristic methods to

detect malware. These methods are based on some

rules deﬁned by the AV security experts where they

state the deﬁnition of restricted behaviors or opera-

tions that a software can do or execute. The main

drawback of heuristic-based methods is that they gen-

erate many false positives since not all software that

access certain ﬁles are malware. However, they help

in detecting new malware without comparing it with

https://orcid.org/0000-0001-5783-8638

https://orcid.org/0000-0001-5086-0808

https://orcid.org/0000-0002-9893-8923

a signature that was saved in AV vendors’ databases

(Michie et al., 1994).

From the two discussed detection techniques, AV

vendors used a hybrid technique that combines both

signature-based and heuristic-based methods to de-

tect malware. There are two types of analyses that

are done on any software to detect if it is malware or

not: static analysis and dynamic analysis (Zhao and

Liu, 2007). They are both done in a sandboxed en-

vironment for security purposes. For static analysis,

it is a phase in malware analysis where the malware

analyst does not execute the malware. He or she tries

to reverse engineer the malware to get its source code.

Then, the malware analyst checks the string signature,

byte-sequence, opcodes, etc. For dynamic analysis, it

has two main components: memory and network. It

is a technique in malware analysis where the malware

analyst executes the malware and watch its behavior

in the sandboxed environment. The malware analyst

checks the behavior in the run time and the API end-

points and calls that the malware tries to access. It

gives an idea about the severity of the malware and

the ways that security professionals can implement to

stop the malware (Zhao and Liu, 2007).

Nowadays, the probability of success of machine

606

Lekssays, A., Falah, B. and Abufardeh, S.

A Novel Approach for Android Malware Detection and Classiﬁcation using Convolutional Neural Networks.

DOI: 10.5220/0009822906060614

In Proceedings of the 15th International Conference on Software Technologies (ICSOFT 2020), pages 606-614

ISBN: 978-989-758-443-5

learning in classiﬁcation problems has increased

thanks to three main components: (1) the increase

of commercial feeds that helped new malware to ex-

ist, (2) computing power has become cheaper, and (3)

machine learning as an independent computer science

discipline has evolved, and big companies are invest-

ing in it which help researchers by providing the tools

to innovate in the ﬁeld.

Machine learning approaches and results are en-

couraging to achieve a high malware detection rate

without any human interaction. As a result, AV ven-

dors and their research and development teams began

deploying some machine learning classiﬁers such as

neural networks, decision trees, and logistic regres-

sion (Krizhevsky et al., 2012).

Malware analysis, as an independent discipline in

cybersecurity, has been facing the problem of mal-

ware classiﬁcation or detection as a binary classiﬁ-

cation. So, any ﬁle shall be analyzed to detect if it is

malware or not. If it is malware, it is labeled accord-

ing to its type and family based on its behavior by

using a classiﬁcation mechanism. The main purpose

behind this work is introduce a different approach for

malware detection in Android by using visual char-

acteristics of malware and deep learning for pattern

recognition.

Contributions. In this work, we have:

• Combined and preprocessed a dataset containing

benign and malicous Android applications;

• Developed a machine learning model based on

CNN to detect and classify mobile applications

samples as benign or malware.

• Experimented the suggested model based on com-

parisons with other deﬁned models.

Outline. The rest of the paper is organized as fol-

lows. In Section II, different malware types are intro-

duced. Then, in Section III, we introduce the different

malware analysis methdologies, whereas in Section

IV we discuss the related work. After that, in Sec-

tion V, we present in details our methodology. Then,

we explain the process of processing malware as an

image in Section VI. We share our results in Section

VII. And ﬁnally, we present out conclusions and fu-

ture work in Section VIII.

2 MALWARE TYPES

Malware is a compound word of two words: mali-

cious and software. Malware are software programs

that are designed and implemented to damage or ex-

ecute some malicious commands on a system which

may lead to some unwanted actions for the user such

as gathering sensitive information, disrupting normal

computer operations, gain control over the computer

system, spying on the user’s daily activities by gain-

ing access to mobile sensors, and destroying the mo-

bile system. The word malware is the general ter-

minology used to describe any malicious software.

However, they can be technically divided into the fol-

lowing categories depending on their goal.

• Adware: It is a type of malware that automatically

displays advertisements. It is used to gather data

about users’ interests and to get revenue from the

displayed advertisements.

• Spyware: It is a type of malware that tracks the

daily activities of the users without them knowing.

It is dependent on the data that it gets from mobile

sensors and other running applications. Spyware

can collect sensitive data, including keystrokes,

data harvesting, and monitoring activities.

• Virus: It is a type of malware that can copy itself

and spread on the mobile system. Viruses can be

transported on any medium, including but not lim-

ited to email attachments, social media messages,

malicious links, etc.

• Worm: It is a type of malware that can spread on

the network by exploiting operating system vul-

nerabilities. The major difference between worm

and virus is that virus depends on human action

to spread while the worm can replicate itself and

spread without any human interaction.

• Trojan: it is a type of malware that makes itself

appear as a normal ﬁle or application to trick users

into downloading and installing the trojan. A tro-

jan can give unauthorized remote access to the in-

fected mobile phone. It is usually designed in a

client-server architecture where the server is in-

stalled on the attacker’s machine, and the client is

the trojan itself. It is used to steal private informa-

tion including but not limited to logins, ﬁnancial

data, cryptocurrencies wallets, etc. In addition, it

is used to enable some devices on the victim’s mo-

bile phone such as front camera, spying on users’

activities such as keystrokes and ﬁles.

• Rootkit: It is a type of malicious software de-

signed to access other mobile phones remotely

and control them without being detected.

• Backdoors: It is a computer software that allows

access to compromised mobile phones. It allows

the attacker to have an entry point to the mobile

phone without the consent of the user.

• Ransomware: It is a malicious software that re-

stricts the user from accessing his or her ﬁles by

encrypting them. The decryption happens after

A Novel Approach for Android Malware Detection and Classiﬁcation using Convolutional Neural Networks

607

the attacker receives money from the victim. The

payment is usually done using cryptocurrencies

like Bitcoin or Ethereum.

Malware detection and classiﬁcation problem is one

of the key issues that AV vendors face since their

foundation. They have been detecting malware us-

ing signature-based methods and heuristic methods.

A malware signature is a hash that identiﬁes a speciﬁc

malware uniquely. This property helped AV vendors

to detect all the malware that happen to be from a cer-

tain family through a generic signature.

From experiments, researchers found out that all

malware in a certain family share properties, behav-

iors, and a generic signature of that family. This idea

is a fundamental concept that this thesis is built upon.

However, when AV started to detect such malware,

malware authors tried to bypass the security mech-

anisms implemented by AV vendors. They started

to write polymorphic and metamorphic malware to

avoid matching generic signatures. Polymorphic mal-

ware uses a polymorphic engine to mutate where the

original algorithm stays intact. One of the main tech-

niques used in writing polymorphic malware is en-

cryption. Metamorphic malware translates their bi-

nary code into a temporary representation. Then, they

edit their temporary representation, and they translate

then the edited version back to machine code again.

This mutation form can be achieved by several tech-

niques that may go to the architectural level of the mo-

bile phone by inserting NOP instructions or changing

the machine instructions completely.

3 MALWARE ANALYSIS

METHODOLOGIES

3.1 Static Analysis

Malware static analysis is based on the analysis of the

source code. It is called static because the analysis

is done without running the malware in a sandboxed

environment. It is also called code analysis. It is ba-

sically about examining the code without executing

the program. It helps in having an overall idea about

the code structure and auditing the code to check if it

adheres to industry standards. Automated tools help

security analysts and developers in performing static

analysis. The main advantage of static analysis is that

it can reveal bugs that do not manifest themselves.

Nevertheless, static analysis is just a ﬁrst step towards

analyzing the behavior and the effects of a certain ap-

plication (Schmidt et al., 2009).

3.2 Dynamic Analysis

Dynamic analysis is a technique used in computer

forensics and software testing in order to test and eval-

uate an application by executing it in a sandboxed en-

vironment in real-time. The malware analyst keeps

an eye on the behavior of the application in terms of

CPU, memory, and network usage. Automated tools

can help in this process by raising alerts in case of

suspicious activity by the application (Schmidt et al.,

2009).

4 RELATED WORK

Researchers have been exploring the ﬁeld of malware

detection in android applications from different per-

spectives and angles. In this work, we are exploring

the ﬁeld of malware detection using deep learning in

the light of convolutional neural networks. Many ap-

proaches were used to detect malware based on their

network trafﬁc, permissions, memory behavior, and

CPU usage. However, in our research, we followed a

static approach which is about detecting malware by

converting malware to images.

Ahmadi et al. have worked on novel feature ex-

traction, selection and fusion for Windows malware

families classiﬁcation. Their work tried to keep up

with the involvement of modern malware which is de-

signed with mutation characteristics such as polymor-

phism and metamorphism. These characteristics lead

to an exponential growth in the number of variants

for each malware. In their research, they developed

a novel paradigm that is effective in classifying mal-

ware variants using feature extraction. They group

malware based on feature extraction, selection and fu-

sion (Ahmadi et al., 2016).

Saxe and Berlin discussed the problem of mal-

ware detection from a different approach. They used

deep neural networks to detect malware based on two-

dimensional binary program features. They take the

Windows binaries of malware and they designed a

system that learns from the binary features. They

introduced an approach that achieves a usable detec-

tion rate at a low false-positive rate (Saxe and Berlin,

2015).

Yajamanam et al. have chosen a different ap-

proach. They integrated a computer vision approach

by calculating GIST descriptors for image-based mal-

ware classiﬁcation. GIST descriptors are image fea-

tures that have been recently used a lot in the ﬁeld of

malware classiﬁcation. In their research, they imple-

mented, tested, and analyzed a malware score based

on GIST descriptors. It is a potential advantage for

ICSOFT 2020 - 15th International Conference on Software Technologies

608

the ﬁeld of malware classiﬁcation. Their research

was based on Windows malware (Yajamanam et al.,

2018).

Li et al. opted for a hybrid malicious code de-

tection using deep learning. They suggested a new

Android classiﬁcation method called HADM, which

stands for Hybrid Analysis for Detection of Windows

Malware. They start with static and dynamic infor-

mation extraction. Then, they convert it into vector-

based representations. The method is based on com-

bining features extracted from deep learning with the

original features which resulted in an increase in de-

tection rate (Li et al., 2015).

Tong et al. have developed a hybrid approach for

mobile malware detection in Android. They adopted

both static and dynamic analysis. They collected ex-

ecution data of sample malware and benign applica-

tions using a net

ink technology to generate patterns

of system calls. They have built up a malicious pat-

tern set and normal pattern set in order to compare

the patterns of malware and benign applications. For

detecting unknown applications, they have followed a

dynamic method to collect system calls data. Then,

they compare them with the patterns that were built

up before (Tong and Yan, 2017).

Narudin et al. have evaluated machine learn-

ing classiﬁers for mobile malware detection. They

used various network trafﬁc features, and they group

them into four categories selected based on basic

information such as content-based, time-based, and

connection-based. They have used their own dataset.

They conducted multiple experiments and they found

that k-nearest neighbor is the efﬁcient classiﬁer for

malware detection (Narudin et al., 2016).

5 METHODOLOGY

5.1 General Overview of the Solution

The suggested solution of this paper is inspired by

previous research that has been done on malware de-

tection and classiﬁcation on the Windows platform

(Ahmadi et al., 2016). The idea is to convert mal-

ware to images based on the observation that images

of different malware samples from the same malware

family appear to be the same as shown in Fig. 1. They

have common visual characteristics. The idea sug-

gested in this paper is not common in research since it

was tested only on Windows. In addition, if the mal-

ware was embedded in another application, it saves

the same visual characteristics, so it produces a simi-

lar image to its family.

Figure 1: Visualizing Malware as Grayscale Image.

Figure 2: Charger Malware Family Visualization.

The work in (Ahmadi et al., 2016) focused on

computing image-based features to characterize mal-

ware precisely. They used GIST descriptors to calcu-

late texture features without going through the pro-

cess of segmentation. This step resulted in feature

vectors for each malware in the size of 900 features.

However, their work used just 320 features because

based on research, they found that the 320 feature

is the optimal number of features needed to identify

malware. In addition, they suggested that just 60 fea-

ture is the minimum that can be used with an error of

30%.

The feature vectors are used to train a K-nearest

neighbor classiﬁer with Euclidean distance. The con-

version process expects a malware binary ﬁle, in our

case the APK ﬁle, that is read as a vector of 8-bit un-

signed integers and structured as a 2D array. This ar-

ray is, then, visualized as a grayscale image in the

range [0, 255].

Visualizing malware as an image as the example

shown in Fig. 2 has many beneﬁts especially that se-

quences of a binary can be identiﬁed easily. More-

over, malware authors tend usually to change some

small portions of the code in order to write new vari-

ants. Those small pieces of code are usually changed

after the malware is caught by anti-viruses. So, im-

ages are a good tool to detect the changes while hav-

ing a global structure of the malware. Hence, different

malware can be regrouped into families based on their

visual properties, so they can be easily identiﬁed from

images.

5.2 Dataset

The data set used in this research is called Android

Malware Dataset (CICAndMal2017) (Lashkari et al.,

2018). The approach used to build this dataset is

to run both malware and benign applications on real

android smartphones in order to ensure the exact

running behavior of the applications. Research has

shown that simulators often result in inconsistent be-

haviors of the applications, which might change the

A Novel Approach for Android Malware Detection and Classiﬁcation using Convolutional Neural Networks

609

end results. In addition, some malware is smart

enough to detect emulated environments. The dataset

is composed of 10,854 samples (4,354 malware and

6,500 benign) from different sources. The benign ap-

plications are downloaded from Google Play in the

period between 2015 and 2017. However, the dataset

runs 5,491 applications (426 malware and 5,065 be-

nign). Due to storage and computational power limi-

tations, we have used in this research a sample of 852

applications (426 malware and 426 benign). This step

was also done to balance the dataset from any bias.

The malware samples in this dataset are classiﬁed into

four categories:

• Adware

• Ransomware

• Scareware

• SMS Malware

The samples consist of 42 unique malware fami-

lies within the four mentioned categories above.

6 MALWARE IMAGES

PREPROCESSING

6.1 Labeling

During this research, we have been dealing with a bi-

nary classiﬁcation problem using convolutional neu-

ral networks. In order to build a balanced and unbi-

ased dataset, we downloaded benign android applica-

tions from Play Store. The process of identifying if

the application is benign or malware was done using

VirusTotal API. VirusTotal is a web framework that

provides malware analysis as a service based on ﬁle

signatures since it has a large database (VirusTotal,

2020).

6.2 Data Pre-processing

In order to feed the CNN, a normalization step is

needed in order to decrease the sizes of the images

and to make them uniﬁed. CNN expects a set of la-

beled images of the same size. However, this was a

problem in this research since images have different

dimensions. In order to overcome this problem, we

used mean subtraction and normalization.

6.2.1 Mean Subtraction

For the mean subtraction, it is a widely used tech-

nique in preprocessing images for CNN. It is about

subtracting the mean across every individual feature

in the data. It can be interpreted geometrically by

centering the data cloud around the origin of every di-

mension. In our implementation, we used NumPy ar-

rays. We have implemented this operation as: X− =

np.mean(X) where X is our NumPy array that holds

the data assuming that we have grayscale images.

6.2.2 Normalization

For the normalization step, it refers to normalizing

the data dimensions so they are approximately the

same scale. We have implemented this step by di-

viding the standard deviation after making the data

zero-centered. The implementation using NumPy

was done as follows: X/ = np.std(X) where X is

our NumPy array that holds the data assuming that

we have grayscale images.

6.2.3 GIST Feature Extraction

In this research, we have worked with two types of

data. We have used images after performing the pre-

processing phase and feed them to the CNN. In ad-

dition, we have extracted the GIST vector features.

We have used the ﬁrst 320 values. This was imple-

mented using LearGist Python Wrapper since the of-

ﬁcial python library seems to be dead [15]. The idea

is to give a preprocessed image with the same size of

computing its GIST which results in a feature vector

of 960 values in which we used 320 values because

researchers concluded that 320 is enough to get opti-

mal results. In addition, researchers stated that only

60 values will give accurate results up to 60% (Douze

et al., 2009).

6.2.4 Malware Images Classiﬁcation

In this paper, we have used the k-nearest neighbor

algorithm. KNN or k-Nearest Neighbor is a super-

vised learning algorithm that is used widely in ma-

chine learning and data mining. It is a classiﬁer al-

gorithm where the learning is based on how a vec-

tor is similar to another. It does not compare the un-

classiﬁed data with all the other data. It performs a

mathematical calculation to measure the distance be-

tween the data to make the classiﬁcation. The main

distance calculations that are used in k-NN are Eu-

clidean Distance and Manhattan Distance (Cover and

Hart, 1967). In this research, we have used the Eu-

clidean Distance between two points p and q with the

following formula:

d (p,q) =

∑

i=1

− p

)

(1)

ICSOFT 2020 - 15th International Conference on Software Technologies

610

The algorithm of KNN is as follows:

1. Receives an unclassiﬁed data;

2. Measures the distance (Euclidean, Manhattan,

Minkowski or Weighted) from the new data to all

other data that is already classiﬁed;

3. Sort the distances;

4. Gets the K smaller distances (nearest neighbors)

where K is the number of neighbors that should

be selected. The default value for K is 1;

5. Gather the cluster of the nearest neighbor;

6. Classiﬁes the new data with the cluster that has

been chosen in Step 5.

7 RESULTS

7.1 Environment

The experiments were run on Ubuntu 18.04 LTS

with built-in GPUs. We used Deep Learning Vir-

tual Machine provided by Microsoft Azure with SSD

storage. The programming language that we used

is Python because it supports many machine learn-

ing libraries such as TensorFlow, Keras, Anaconda,

Jupyter, NumPy, matplotlib, SciPy, Pandas, and scikit

learn. The usefulness of each library is as follows:

• TensorFlow is “an open-source software li-

brary for numerical computation using data ﬂow

graphs.” The graph nodes represent mathematical

operations, while the graph edges represent the

multidimensional data arrays (tensors) that ﬂow

between them (TensorFlow, 2020).

• Keras is a high-level neural networks API, written

in Python and capable of running on top of Ten-

sorFlow, CNTK, or Theano (Keras, 2020).

• Anaconda is a free and open-source distribution

of the Python and R programming languages for

scientiﬁc computing to simplify package manage-

ment and deployment (Anaconda, 2020).

• Jupyter Project exists to develop open-source soft-

ware, open-standards, and services for interactive

computing across dozens of programming lan-

guages (Jupyter, 2020).

• NumPy, or Numerical Python, is the most univer-

sal and versatile library both for pros and begin-

ners (NumPy, 2020).

• Matplotlib is a ﬂexible library for creating graphs

and visualization (Matplotlib, 2020).

• Pandas is a well-known and high-performance

tool for presenting data frames (Pandas, 2020).

• Scikit Learn implements a wide-range of

machine-learning algorithms and makes it com-

fortable to plug them into actual applications (kit

Learn, 2020).

Figure 3: Training and Validation Accuracy vs. Epochs

with 10 Epochs.

7.2 Architectures

In order to achieve high accuracy, we have built sev-

eral architectures to detect malware using convolu-

tional neural networks. We have changed various pa-

rameters such as the learning rate, batch size, and the

number of epochs for each architecture. We have im-

plemented different architecture with different layers

and activation functions.

7.2.1 CNN A: 3c 2D

The architecture consists of:

1. Input layer NxN pixels (N = 128)

2. Convolutional Layer (32 ﬁlters of size 3x3)

3. Max Pooling layer

4. Convolutional Layer (64 ﬁlters of size 3x3)

5. Max Pooling layer

6. Convolutional Layer (128 ﬁlters of size 3x3)

7. Max Pooling layer

8. Flatten Layer

9. Densely-connected layer (64 neurons)

10. Densely-connected layer (1 neuron)

We have used ReLU activation function for all the lay-

ers except for the last one, we used sigmoid. We have

ﬁxed a learning rate of 0.01, batch size of 16, the num-

ber of epochs of 10, and a loss function to be binary

cross-entropy. The accuracy that we achieved with

this architecture is 84.9%. Fig. 3 describes the be-

havior of the model in terms of accuracy. In addition,

Fig. 4 shows the training vs. validation loss.

A Novel Approach for Android Malware Detection and Classiﬁcation using Convolutional Neural Networks

611

Figure 4: Training and Validation Loss vs. Epochs with 10

Epochs.

Figure 5: Training and Validation Accuracy vs. Epochs

with 100 Epochs.

Figure 6: Training and Validation Loss vs. Epochs with 100

Epochs.

We have tried to experiment with the architec-

ture by increasing the number of epochs to reach 100.

The model kept its good detection performance. The

graphs in Fig. 5 and Fig. 6 describe the accuracy and

the loss of the model in 100 epochs.

Figure 7: Training and Validation Accuracy vs. Epochs

with 10 Epochs.

Figure 8: Training and Validation Loss vs. Epochs with 10

Epochs.

7.2.2 CNN B: 2c 2D

The architecture consists of:

1. Input layer NxN pixels (N = 128)

2. Convolutional Layer (32 ﬁlters of size 3x3)

3. Max Pooling layer

4. Convolutional Layer (64 ﬁlters of size 3x3)

5. Max Pooling layer

6. Flatten Layer

7. Densely-connected layer (1 neuron)

We have used ReLU activation function for all the

layers except for the last one, we used softmax. We

have ﬁxed a learning rate of 0.001, batch size of 16,

number of epochs of 10, and a loss function to be bi-

nary cross-entropy. The accuracy that we achieved

with this architecture is 68.1%. The graph in Fig. 7

describes the behavior of the model in terms of ac-

curacy. In addition, the graph in Fig. 8 shows the

training vs. validation loss.

We have tried to experiment with the architec-

ture by increasing the number of epochs to reach 100.

ICSOFT 2020 - 15th International Conference on Software Technologies

612

The model kept its good detection performance. The

graphs in Fig. 9 and Fig. 10 describes the accuracy

and the loss of the model in 100 epochs:

Figure 9: Training and Validation Accuracy vs. Epochs

with 100 Epochs.

Figure 10: Training and Validation Loss vs. Epochs with

100 Epochs.

8 CONCLUSION AND FUTURE

WORK

In this research, we focused on studying the feasibility

of detecting and classifying mobile malware by treat-

ing them as images. It presents a static analysis ap-

proach using deep learning and convolutional neural

networks. It is a different approach in mobile malware

detection since the research focus is on how to detect

and classify mobile malware based on their behavior.

In addition, this work presented challenges of apply-

ing such approach on mobile applications since they

are usually large ﬁles. We used the proposed approach

to detect and classify malware from the binary ﬁles

of Android applications. We used different architec-

tures to test our approach and compared our results.

We have achieved an accuracy of 84.9% using one ar-

chitecture, and in another, we reached an accuracy of

68.1%. So, both architectures are useful for malware

detection. Our model can detect variants of malware

or another unknown malware based on the training

data. This is an added value to overcome some of the

problems facing signature-based detection systems.

Although we have 84.9% accuracy, we still have

limitations and challenges. The ﬁrst limitation is the

dataset. Our model learned from previous sample

malware. The dataset that we used had a limited num-

ber of sample malware. Also, because the majority of

Android malware datasets and some malware families

and benign applications are private, we had to collect

our own data. The second limitation is the processing

power. Processing images needs a variety of high-

performance GPUs that are not available at the uni-

versity lab.

This work is an attempt to make an advanced mal-

ware detection system. In the future, we plan to

increase the accuracy of our system and reduce the

number of false positives by using new image feature

extraction and new computer vision techniques that

can help in the problem of malware classiﬁcation. In

addition, we plan to introduce multi-level classiﬁca-

tion instead of binary classiﬁcation to precisely deter-

mine the type of the family of malware.

ACKNOWLEDGEMENTS

We would like to thank all fellow students, faculty,

and staff for supporting this research. Special thanks

to Mr. Saad Taame for his help in data pre-processing.

REFERENCES

Ahmadi, M., Ulyanov, D., Semenov, S., Troﬁmov, M., and

Giacinto, G. (2016). Novel feature extraction, selec-

tion and fusion for effective malware family classiﬁ-

cation. In Proceedings of the sixth ACM conference

on data and application security and privacy, pages

183–194.

Anaconda (2020).

Cover, T. and Hart, P. (1967). Nearest neighbor pattern clas-

siﬁcation. IEEE transactions on information theory,

13(1):21–27.

Douze, M., J

egou, H., Sandhawalia, H., Amsaleg, L., and

Schmid, C. (2009). Evaluation of gist descriptors

for web-scale image search. In Proceedings of the

ACM International Conference on Image and Video

Retrieval, pages 1–8.

Jupyter, P. (2020).

Keras (2020).

kit Learn, S. (2020).

A Novel Approach for Android Malware Detection and Classiﬁcation using Convolutional Neural Networks

613

Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-

agenet classiﬁcation with deep convolutional neural

networks. In Advances in neural information process-

ing systems, pages 1097–1105.

Lab., K. (2019). Kaspersky lab detected more than 6 million

malware package.

Lashkari, A. H., Kadir, A. F. A., Taheri, L., and Ghor-

bani, A. A. (2018). Toward developing a system-

atic approach to generate benchmark android malware

datasets and classiﬁcation. In 2018 International Car-

nahan Conference on Security Technology (ICCST),

pages 1–7. IEEE.

Li, Y., Ma, R., and Jiao, R. (2015). A hybrid malicious

code detection method based on deep learning. In-

ternational Journal of Security and Its Applications,

9(5):205–216.

Matplotlib (2020).

Michie, D., Spiegelhalter, D. J., Taylor, C., et al. (1994).

Machine learning. Neural and Statistical Classiﬁca-

tion, 13(1994):1–298.

Narudin, F. A., Feizollah, A., Anuar, N. B., and Gani,

A. (2016). Evaluation of machine learning classi-

ﬁers for mobile malware detection. Soft Computing,

20(1):343–357.

NumPy (2020).

Pandas (2020).

Saxe, J. and Berlin, K. (2015). Deep neural network based

malware detection using two dimensional binary pro-

gram features. In 2015 10th International Conference

on Malicious and Unwanted Software (MALWARE),

pages 11–20. IEEE.

Schmidt, A.-D., Bye, R., Schmidt, H.-G., Clausen, J., Ki-

raz, O., Yuksel, K. A., Camtepe, S. A., and Albayrak,

S. (2009). Static analysis of executables for collab-

orative malware detection on android. In 2009 IEEE

International Conference on Communications, pages

1–5. IEEE.

Statista (2020). Number of smartphone users worldwide

from 2014 to 2020 (in billions).

TensorFlow (2020).

Tong, F. and Yan, Z. (2017). A hybrid approach of mobile

malware detection in android. Journal of Parallel and

Distributed computing, 103:22–31.

VirusTotal (2020). Analyze suspicious ﬁles and urls to de-

tect types of malware, automatically share them with

the security community).

Yajamanam, S., Selvin, V. R. S., Di Troia, F., and Stamp,

M. (2018). Deep learning versus gist descriptors for

image-based malware classiﬁcation. In ICISSP, pages

553–561.

Zhao, Z. and Liu, H. (2007). Spectral feature selection for

supervised and unsupervised learning. In Proceed-

ings of the 24th international conference on Machine

learning, pages 1151–1157.

ICSOFT 2020 - 15th International Conference on Software Technologies

614