“Mirror, Mirror on the Wall, Who is the Fairest One of All?”

Machine Learning versus Model Checking: A Comparison between Two Static

Techniques for Malware Family Identiﬁcation

Vittoria Nardone and Corrado Aaron Visaggio

Department of Engineering, University of Sannio, Benevento, Italy

Keywords:

Malware, Android, Security, Model Checking, Testing.

Abstract:

Malware targeting Android platforms is growing in number and complexity. Huge volumes of new variants

emerge every month and this creates the need of being able to recognize timely the speciﬁc variants when

encountered. Several approaches have been developed for malware detection. Recently the research commu-

nity is developing approaches able to detect malware variants. Among all, two approaches demonstrated high

performances in detecting malware and assigning the family it belongs to: one based on machine learning and

one on formal methods. In this paper we compare the results achieved by two methods in terms of Precision,

Recall and Accuracy. We highlight points of strength and weakness of two methods.

1 INTRODUCTION

Mobile devices are spreading at an impressive pace,

and as reported by the Worldwide Quarterly Mobile

Phone Tracker, in the second quarter of 2016 Android

kept the greatest market share of mobile OS

This record has boosted the community of mal-

ware writers to devote efforts towards mobile plat-

forms. According to Internet Security Threat Report

the number of Android malware families added in

2015 grew by 6 percent,compared with the 20 per-

cent growth in 2014. The volume of Android malware

variants increased by 40 percent in 2015, compared

with 29 percent growth in the previous year, while

there were more than three times as many Android

apps classiﬁed as containing malware in 2015 than in

2014, an increase of 230 percent.

Smartphones are also largely used for building

botnet, as a recent DDOS attack has demonstrated

where 1 TB per second of trafﬁc has been conveyed

by just using infected smartphones remotely con-

trolled

. These facts suggest that it is urgent to ﬁnd

techniques of detection that are able to detect malware

targeting mobile platforms, contrasting the evasion

http://www.idc.com/prodserv/smartphone-os-market-

share.jsp

https://www.symantec.com/content/dam/symantec/

docs/reports/istr-21-2016-en.pdf

http://thehackernews.com/2016/09/ddos-attack-

iot.html

techniques whose current malware makes large use

of. Different and diverse methods have been proposed

to detect mobile malware and classify variants. Ma-

chine learning base classiﬁcation is one of the most

investigated. The main limit stands in the fact that

the effectiveness of the classiﬁer depends on the kind

of malware (and the number) that is included in the

training set. A malware that is not represented by the

training set will not be detected. This limit is not triv-

ial, if we consider the huge number of variants and

new kinds of malware that are released in the wild

each month. Formal methods have the capability to

assess with a very high precision whether a rule is

veriﬁed by a piece of code. The immediate advan-

tage of this technique is that if the speciﬁc behaviors

represented by the rules are shown by the malware,

they will be surely recognized. The second advan-

tage is that if a categorized behavior is implemented

in the program, the formal methods are able to locate

it in the code. In this paper we compare the two ap-

proaches, the one based on machine learning (Can-

fora et al., 2016) and the one based on formal method

(Mercaldo et al., 2016a; Mercaldo et al., 2016c), in

order to characterize the points of strength and weak-

ness of the techniques. The paper proceeds as fol-

lows: section 2 discusses the related literature, sec-

tion 3 provides the background for the compared ap-

proaches, while section 4 describes the approaches in

detail. Section 5 presents the experimentation and the

obtained results. Finally, section 6 draws the conclu-

Nardone, V. and Visaggio, C.

“Mirror, Mirror on the Wall, Who is the Fairest One of All?” - Machine Learning versus Model Checking: A Comparison between Two Static Techniques for Malware Family Identiﬁcation.

DOI: 10.5220/0006287506630672

In Proceedings of the 3rd International Conference on Information Systems Security and Privacy (ICISSP 2017), pages 663-672

ISBN: 978-989-758-209-7

663

sion of the work.

2 RELATED WORK

Authors in (Alam et al., 2016) use clone detection for

recognizing malware variants for Android. Authors

applied the method to a smaller and older data-set

than ours (166 Andorid malware). The main limi-

tation of this technique is that if a variant is not a

clone of the representative family members it is not

recognized. (Faruki et al., 2015) explores the homo-

geneity of bytes distribution for capturing similarity

among programs’ variants. This technique can be sen-

sitive to obfuscation. (Zhang et al., 2014) proposes an

approach that classiﬁes Android malware via depen-

dency graph. The authors build programs semantics

with contextual API.They are able to detect correctly

the 93% malware instances.

The authors in (Suarez-Tangil et al., 2014) present

Dendroid, a text mining based approach. Suarez-

Tangil et al. base their approach on the code struc-

tures and they use a real data-set of Android mal-

ware families

. They measure similarity between

malware samples, and than they use this similarity

to automatically classify the malware into families.

Their approach uses a data-set of 1260 malware col-

lected in 2010 and it has a smaller number of sam-

ples for each family if it is compared with the data-

set used by (Mercaldo et al., 2016a; Mercaldo et al.,

2016c; Canfora et al., 2016). The researchers in

(Feng et al., ) present a semantics-based approach

(called Apposcopy) to identify Android malware. Ap-

poscopy speciﬁes semantic characteristics of malware

families using signatures. The signature matching al-

gorithm of Apposcopy uses a combination of static

taint analysis and a new form of program representa-

tion called Inter-Component Call Graph to efﬁciently

detect Android applications that have certain control-

and data-ﬂow properties. Apposcopy in evaluated

on the Malgenoma data-set, as Dendroid (Suarez-

Tangil et al., 2014). The authors in (Battista et al.,

2016; Mercaldo et al., 2016b), using a model check-

ing based approach, identify the malicious payload in

repackaged Android applications. The logic rules de-

ﬁne the malicious payload. As preliminary evalua-

tion the authors only investigate DroidKungFu, Op-

fake and FakeInstaller families. Another behavioural

based approach is described in (Bose et al., 2008).

Bose and his colleagues specify common malware be-

havior using temporal logic formulas. This approach

is partially dynamic and it uses mobile viruses and

Android Malware Genome Project available at

http://www.malgenomeproject.org

worms targeting the Symbian OS. The frequencies of

ngrams of opcodes to identify Android malware fam-

ily is used in (Canfora et al., 2015). The authors use

a data-set composed of 5560 malware belonging to

several different families. The results show on the av-

erage an accuracy equals to 97%.

3 PRELIMINARIES

In this section we introduce some preliminary con-

cepts related to the two techniques compared in this

work. In particular the ﬁrst one is a classiﬁcation real-

ized with a Machine Learning engine (Canfora et al.,

2016), while the second one is a formal technique that

uses the model checking for verifying the presence

of certain malicious behaviors in the code (Mercaldo

et al., 2016a; Mercaldo et al., 2016c). Both the meth-

ods are static.

3.1 Machine Learning

The classiﬁcation based on machine learning consists

of identifying some features to be extracted from a

source code that allow the distinction between mal-

ware and goodware. This process is made of two main

steps:

1. Training: in this phase the classiﬁer is built, by

applying algorithms of data mining. The engine

evaluates which are the features that better distin-

guish the two classes of objects, in this case mal-

ware and goodware. The learning could be su-

pervised or not supervised. The learning is super-

vised if the training data-set is labeled with the

name of the belonging class. The methodology

under analysis is based on the supervised learn-

ing, because it’s known in advance whether the

application is malicious or not.

2. Prediction: this is the phase of evaluation. The

aim of this step is to evaluate the effectiveness of

the classiﬁer constructed in the previous phase. At

this stage the capability of the classiﬁer to pre-

dict the class a data-set’s member belongs to is

assessed. When detecting malware this stage eval-

uates whether the classiﬁer can discriminate cor-

rectly a malware from a goodware.

It should be underlined that a good classiﬁcation

is performed only with a selection of an appropriate

set of features. One of the main limit of a machine

learning classiﬁer is that the performance depends on

how much the training set is representative of the two

class examples.

ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering

664

3.2 Model Checking

Model Checking is a type of formal technique. For-

mal Methods are usually used to specify and verify

complex systems. Model checking technique requires

three steps: (i) to deﬁne the systems with a precise

notation; (ii) to specify the properties with a precise

notation; (iii) to verify the properties on the system

with a model checker tool.

Deﬁne the System

The system behavior is represented as an automaton.

There are a set of labeled edges and a set of nodes.

The nodes are the system states while an edge rep-

resents a transition from a state to another state (pre-

cisely the next state). The edges are labeled. An edge

means that the system can evolve from a state s to a

state s

performing an action a (the label of the edge).

This transition is indicated as follows: s

−→s

. The

initial state of the system is the root of the automa-

ton. It is often convenient to algebraically represent

the automaton in the form of processes. Usually the

process algebras have been used as precise notation

to describe complex computer systems. The method-

ology under analysis uses Milner’s Calculus of Com-

municating Systems (CCS) (Milner, 1989) as process

algebra. In CCS process algebra the systems are rep-

resented through processes and actions, which respec-

tively correspond to states and transitions. For more

details on CCS the reader can refer to (Bruns, 1997;

Milner, 1989).

Specify the Properties

A property that a system should satisfy can be deﬁned

using a temporal logic. In temporal logics there are

constructs allowing to verify in a formal way that a

particular event will eventually happen or that a prop-

erty is veriﬁed in every state. The methodology under

analysis uses the logic named mu-calculus (Stirling,

1989).

Model Checker Tool

Finally, to verify the properties deﬁned in the tem-

poral logic, the Model Checker is applied to the sys-

tem (modeled as transition system). This is a tool that

takes two inputs: the system model and the property.

The output of the Model Checker is binary. It returns

true whether the property is veriﬁed on the model

or false otherwise. The check is performed as an ex-

haustive state space search that is guaranteed to termi-

nate since the model is ﬁnite. The methodology under

analysis uses as formal veriﬁcation environment the

Concurrency Workbench of New Century (CWB-NC)

(Cleaveland and Sims, 1996). While model checking

was originally developed to verify the correctness of

systems, recently it has been also proposed in other

ﬁelds such as clone detection (Santone, 2011), biol-

ogy (De Ruvo et al., 2015), secure information ﬂow

(De Francesco et al., 2003), and mobile computing

(Anastasi et al., 2001). In the last years, model check-

ing has been successfully applied also in the security

ﬁeld, as explained in the following sections.

4 THE TWO APPROACHES

APPLIED

The paper compares the performance of two ap-

proaches, one based on Machine Learning, and the

other one based on Model Checking which are now

detailed. The Machine Learning based approach

is presented in (Canfora et al., 2016). The Model

Checking based approach is presented in (Mercaldo

et al., 2016a; Mercaldo et al., 2016c).

4.1 Machine Learning based

Methodology (ML)

The methodology based on machine learning (Can-

fora et al., 2016) uses two different techniques to de-

tect Android malware families: the Hidden Markov

Model (HMM) (Rabiner, 1989; Annachhatre et al.,

2015) and the Structural Entropy (Baysa et al., 2013).

HMM models certain sequences of opcodes belong-

ing to the app under analysis that could characterize

the eventual malicious behavior. The structural en-

tropy evaluates the distribution of bytes in the phys-

ical ﬁle for characterizing it in terms of malware (or

goodware). These two techniques are used to extract

four features ( f

, f

). The ﬁrst three capture

the HMM while the last one evaluates the Structural

Entropy. In detail the extracted features are: (i) f

a HMM score with 3 hidden states; (ii) f

is a HMM

score with 4 hidden states; (iii) f

is a HMM score

with 5 hidden states; (iv) f

is a measure of the struc-

tural entropy.

The classiﬁcation process aim is to establish

whether the features correctly classify the malware

family. In this approach six classiﬁcation algorithms

are used: J48, LabTree, NBTree, RandomForest,

RandomTree and RepTree.

The HMM-based features are composed starting

from the sequence of instructions of the application.

In particular the authors consider the sequence of op-

codes in the smali

code of the application. Start-

ing from the entry point of each application (i.e., the

Main Activity), the authors reconstruct the sequence

http://pallergabor.uw.hu/androidblog/dalvik opcodes

.html

“Mirror, Mirror on the Wall, Who is the Fairest One of All?” - Machine Learning versus Model Checking: A Comparison between Two

Static Techniques for Malware Family Identiﬁcation

665

Table 1: An example of logic rule.

= µX = hpushCOMMANDSi ϕ

∨ h−pushCOMMANDSi X

= µX = hpushCommandsi ϕ

∨ h−pushCommandsi X

= µX = hpushTgZzeroLHwIICkoai ϕ

∨ h−pushT gZzeroLHwII Ckoai X

= µX = hpushACT IVATIONi ϕ

∨ h−pushACT IVATIONi X

= µX = hpushActivationi ϕ

∨ h−pushActivationi X

= µX = hpushT gOottoHBgY f BVoMi tt ∨ h−pushT gOottoHBgY f BVoMi X

of opcodes of every called method, jumping to the in-

structions of the called method whenever there is an

invoke instruction. This reconstruction ends whether

there is a class belonging to the Android framework

or when the recursion level is equal to 4. On these

sequences the HMM detector is trained. The authors

used a number N of hidden states equal to 3, 4 and 5,

according with the features f

, f

Regarding the Structural Entropy method, the au-

thors estimate the structural entropy of the Android

executable (.dex ﬁle). Starting from blocks of dif-

ferent size, belonging to the .dex ﬁle, the method

computes the Shannon entropy for each block. The

wavelet transform is used to represent the segments

of the ﬁle. Finally a similarity score, based on Lev-

enshtein distance, is computed. The authors compare

the segments of two ﬁles to compute this score. At

the end of this process the feature f

is computed.

4.2 Model Checking based

Methodology (MC)

The methodology based on model checking (Mer-

caldo et al., 2016a; Mercaldo et al., 2016c) is com-

posed of two main steps:

• to build the model through a translation of Byte-

code instructions in form of process;

• to specify the properties related to malicious be-

haviours.

The formal model is written in CCS (Calculus of

Communicating Systems of Milner (Milner, 1989)).

The authors use a transformation function that trans-

lates every Bytecode instruction of the Android appli-

cation into CCS process. In particular, starting from

the apk ﬁle of an application through a reverse engi-

neering process the authors obtain the .class ﬁles.

Afterwards the authors use the Apache Commons

Bytecode Engineering Library (BCEL)

to parse the

Bytecode in order to translate every instruction in a

CCS process. This is an automatic process. At the

end of the ﬁrst step the formal model is built.

http://commons.apache.org/bcel/

According to the model checking technique to for-

mal veriﬁcation, the authors specify some properties.

The aim of step two is to investigate whether an ap-

plication is a malware and belongs to a particular An-

droid family. In order to achieve this goal, the spec-

iﬁed formulae catch a speciﬁc malicious behavior,

which is a typical behavior allowing the family char-

acterization. The mu-calculus logic, (Stirling, 1989)

as a branching temporal logic, is used to describe a de-

termined behavior of the app. Thus, after this second

step, for every malware family there is a set of formu-

lae able to catch a speciﬁc malicious behaviour. These

are temporal logic rules and are obtained through a

manual inspection process of malware samples. Also

the speciﬁcation of the property is not automatic.

Table 1 shows the logic rule related to a malicious

behaviour exposed by Plankton sample. The formula

catches some commands of the Plankton botnet. In

this formula is used the last ﬁxpoint (µZ.φ) of the re-

cursive recursive equation Z = φ. µZ binds free oc-

currences of Z in φ. An occurrence of Z is free if it is

not within the scope of a binder µZ.

In this approach the Concurrency Workbench

of New Century (CWB-NC) (Cleaveland and Sims,

1996) is used as formal veriﬁcation environment. The

CWB-NC model checker takes as input the formal

CCS model (built in the ﬁrst step) and the tempo-

ral logic rules written in mu-calculus (speciﬁed in the

second step). The output of the model checker is bi-

nary: true, whether the property is veriﬁed on the

model and false otherwise. The authors assume that

a sample belongs to a particular family whether the

properties related to that particular family are veriﬁed

on the model.

It is well-known that a model checking technique

typically suffers of the state explosion problem. In

fact, it is mainly applicable to small-scale applica-

tions, but do not scale up well. However, the state

explosion problem is not a real problem when veri-

fying Android applications, since the produced CCS

speciﬁcations do not have a large number of states and

transitions.

ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering

666

Table 2: Number of samples used by two methodologies.

Family (Canfora et al., 2016) (Mercaldo et al., 2016a) (Mercaldo et al., 2016c)

FakeInstaller 925 40 60

DroidKungFu 667 40 60

Plankton 625 625 60

Opfake 613 40 60

GinMaster 339 40 60

BaseBridge 330 330 60

Kmin 147 40 60

Geimini 92 0 60

Adrd 91 0 60

DroidDream 81 0 60

AnserverBot 0 187 0

DroidKungFu Update 0 1 0

Ransomware 672 0 683

5 THE COMPARISON

The experimentation aims at comparing the perfor-

mances of the two different methodologies. The per-

formances of classiﬁcation are measured with the

metrics recall, precision and accuracy that evaluate

the ability of the methods to correctly detect the fam-

ily a malware belongs to. The experimentation is car-

ried out on a real world data-set of Android applica-

tions.

5.1 Metrics

The performances of the methodology are evaluated

with the following metrics:

PR =

T P

T P + FP

; RC =

T P

T P + FN

;

Acc =

T P + T N

T P + FN + FP + T N

;

(1)

which are respectively the Precision (PR), the Re-

call (RC) and the Accuracy (Acc) formulae. The ﬁrst

two formulae indicate the measures of exactness and

correctness since the Precision tests the quality and

the recall tests the quantity of the detection. The Ac-

curacy evaluates the percentage of correct classiﬁca-

tions with respect of the total number of examined

samples. The variables in the Equations 1 are the fol-

lowing: T P (True Positives) indicates the number of

malware programs that are correctly associated to the

right family, FP (False Positives) indicates the num-

ber of malware programs that are erroneously associ-

ated to a family, FN (False Negatives) indicates the

number of malware programs that are not associated

to the belonging family, and T N (True Negatives) in-

dicates the number of malware programs that do not

belong to the considered families, and the classiﬁca-

tion does not associate them with any family.

It should be underlined that the Precision value

strictly depends on the number of samples that is in-

correctly identiﬁed. A sample is not correctly identi-

ﬁed when the prediction of its family is wrong. The

Precision depends on the number of False Positives:

increasing the number of samples belonging to differ-

ent families could increase also the number of FP.

Even if the Accuracy’s formula includes the num-

ber of FP, it evaluates the number of correct classi-

ﬁcations on the overall data-set. This makes the ac-

curacy a measure more comparable between the two

data-sets with different size.

5.2 Data-set

The two methodologies compared in this work use

in their experimentation the following two data-sets:

Drebin (Arp et al., 2014; Spreitzenbarth et al., 2013)

and a collection of freely available Ransomware sam-

ples (672

and 11

). In particular the machine learn-

ing based methodology in its experimentation (Can-

fora et al., 2016) uses the ten most numerous families

of Drebin data-set and the collection of ransomware

samples. The model checking based methodology

in (Mercaldo et al., 2016a) uses for its experimen-

tation the Android malware samples that implement

the update attack and malware from other Drebin

families. Plankton, AnserverBot, BaseBridge and

DroidKungFu-Update are the families that implement

the update attack. In (Mercaldo et al., 2016c) the au-

thors use the ransomware samples and samples be-

longing to the ten most numerous families of Drebin

data-set. Table 2 shows the number of samples used

by the two methodologies in their experimentation.

As shown in Table 2, in the cases of Plankton, Base-

Bridge and Ransomware families the two approaches

http://ransom.mobi/

http://contagiominidump.blogspot.it/

“Mirror, Mirror on the Wall, Who is the Fairest One of All?” - Machine Learning versus Model Checking: A Comparison between Two

Static Techniques for Malware Family Identiﬁcation

667

use the same data-set. This is the reason why we com-

pare the results achieved recognizing these three fam-

ilies. We compare their results in terms of correctness

and in this case the comparison is perfect. In terms of

Precision and Accuracy the comparison is a bit differ-

ent since in these metrics are involved the number of

FPs. As mentioned above, False Positive is a sample

not correctly identiﬁed and when the number of sam-

ples belonging to different families increases, also the

number of FP could increase. This difference of the

data-set is considered in our comparison.

The results used in this comparison are the results

achieved by two methodologies recognizing Plank-

ton, BaseBridge anf Ransomware samples. The fam-

ilies considered present the following malicious be-

haviours:

Plankton family: the samples belonging to this fam-

ily send sensitive data of the infected smartphone to

a remote server, like IMEI and browser history. They

use the class loading (a native functionality) to per-

form the malicious actions. Furthermore, Plankton

downloads unwanted advertisements and changes the

browser homepage or adds unwanted bookmarks to it.

BaseBridge family: the samples of this family run

an embedded payload located in an external folder.

They are able to receive premium numbers from re-

mote C&C servers and dial calls or send out SMS

messages to them, incurring fees for users.

Ransomware family: The main malicious aim of the

samples belonging to the Ransomware family is to

steal all personal data stored in the phone by encrypt-

ing all the ﬁles residing in the smartphone. Alterna-

tively the malware could lock the phone: in both the

cases the user is not able to access the smartphone, so

the ransomware asks for money in order to unlock the

phone.

5.3 Results

Table 3 shows the values of Recall obtained with the

two methodologies, where:

• Family column indicates the malware family the

classiﬁcation results refer to.

• Machine Learning based Approach column

contains the values of Recall reached by the

methodology based on Machine Learning tech-

nique. In particular it is composed of four sub-

columns:

– Algorithm sub-column shows the algorithms

used for the classiﬁcation.

– f

sub-column shows the Recall achieved by

the feature one with the six classiﬁcation algo-

rithms. The feature f

captures the HMM (Hid-

den Markov Models) with 3 hidden states.

– f

sub-column shows the Recall results

achieved by the feature f

that captures the

HMM with 4 hidden states.

– f

sub-column shows the Recall results ob-

tained by the feature f

that captures the HMM

with 5 hidden states.

– f

sub-column shows the Recall results pursued

with the feature f

that measures the Structural

Entropy of the bytes distribution.

• Model Checking based Approach column con-

tains the values of Recall reached by the method-

ology that applies the Model Checking technique.

In our comparison, the results show that the

methodology based on Model Checking outperforms

all the other techniques. This means that the Model

Checking based approach reaches the best values

of correctness in the classiﬁcation. Regarding the

methodology based on Machine Learning, the best

performances are produced by the feature f

. As a

matter of fact this feature shows also the smallest vari-

ability of Recall among the six used classiﬁcation al-

gorithms. This means that the feature f

is able to cor-

rectly identify the right family the malware belongs

to. The other three features are very sensitive to the

used algorithm, as a matter of fact there are values of

Recall that widely vary.

The histogram in Figure 1 shows the results

of Precision achieved by the two methodologies,

grouped for malware family. We reported in the graph

only the maximum values of Precision obtained by

Machine Learning based approach for all the consid-

ered features( f

, f

and f

), as this value repre-

sents the best performance that can be reached with a

speciﬁc pair (feature, classiﬁcation algorithm).

The results highlight that the Precision value

reached by the Model Checking based approach is

grater than the values achieved by the other approach.

Unfortunately, here the comparison between the two

method’s precision is just an indication of the real dif-

ferences in exactness, as previously discussed.

With regards to Accuracy in Figure 1, the Model

Checking outperforms the Machine Learning for all

the data-sets, with the only exception of BaseBridge

family where performances are equals.

5.4 Discussion

Hereafter we will refer to the two approaches by us-

ing the acronyms ML and MC,standing respectively

for Machine Learning approach and Model Checking

approach. The experimentation allowed us to charac-

terize the pros and cons of the two approaches.

ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering

668

Plankton BaseBridge

Ransomware

0.5

0.67

0.77

0.78

0.68

0.79

0.8

0.68

0.96

0.82

0.73

0.89

0.96

Precision

Plankton BaseBridge

Ransomware

0.5

0.93

0.98

0.94

0.93

0.98

0.95

0.93

0.99

0.96

0.93

0.99

0.98

0.99

Accuracy

MCK

Figure 1: Precision and Accuracy values Comparison. The values of f

, f

, and f

(Precision Histogram) are the maximum

Precision values achieved by the Machine Learning based approach. The maximum is selected between the six different

classiﬁcation algorithms. MCK (Model Checking) indicates the value of Precision and Accuracy reached by the methodology

based on Model Checking.

Strengths and Advantages

The ML is a completely automatic approach. It

reaches a good Recall in the malware family identi-

ﬁcation, especially using the feature f

(as shown in

Table 3). The authors in (Canfora et al., 2016) per-

form a very large experimentation in order to vali-

date their approach. The ML has a very low execu-

tion time, especially when the samples are analyzed

using the feature f

(Structural Entropy). In fact, the

average CPU time required is equal to 3.85 sec on a

personal computer with the following computational

proﬁle: Intel Core i5 desktop with 4 gigabyte RAM,

equipped with Linux Mint 15. The other three fea-

tures f

, f

, and f

are effective for discriminating

malware from goodware, but they do not perform well

in recognizing the family a malware belongs to. In

fact, the Precision and the Recall in malware identi-

ﬁcation are always greater than the 93% (as shown

in (Canfora et al., 2016)). In the light of all above,

ML obtains good result in malware family classiﬁca-

tion when using the feature f

with a low execution

time, while ML achieves a very good correctness and

exactness in malware detection using the other three

features.

Conversely, the MC reaches high levels of cor-

rectness (i.e. Recall) in malware families identiﬁca-

tion (as shown in Table 3). Futhermore, since it is

based on a formal method, it is a very rigorous ap-

proach and it is able to identify the exact location of

the malicious payload in the malware code. This is

made possible by the speciﬁed formulae that describe

the malicious behaviour to be found within the mal-

ware. In particular, MC points out the method where

the payload is located. MC does not require a training

set, but for each family a set of samples must be man-

ually inspected to extract the formulae representing

the malicious behavior. In the worst case, the largest

set counted 20 samples, which is small if compared

with the average size of the training sets used in ML.

Another advantage of MC is to work also whether the

malware is obfuscated, as shown in (Mercaldo et al.,

2016a; Mercaldo et al., 2016c). It is possible since

MC is behavioural based, and trivial transformations

of the code do not change the normal behaviour of the

code. For example, when an attacker inserts in the

code some unconditional jumps (code reordering) it

changes only the form of the code but the normal exe-

cution ﬂow of the code is preserved. MC is not pattern

matching, it looks for the malicious behaviour in the

form of malicious actions performed. Thus, MC is re-

silient to code obfuscation. Nothing we can say about

ML and its robustness to code obfuscation since the

authors in (Canfora et al., 2016) do not provide any

example.

Weaknesses and Disadvantages

The ML approach produced values of correctness that

are smaller than those obtained with the MC, even

if the f

feature showed performances that are close

to those of MC. A further weakness of ML is the

“Mirror, Mirror on the Wall, Who is the Fairest One of All?” - Machine Learning versus Model Checking: A Comparison between Two

Static Techniques for Malware Family Identiﬁcation

669

Table 3: Comparison of the Recall Results.

Family Machine Learning Based Approach Model Checking Based

Algorithm f

Approach

Plankton

J48 0.608 0.698 0.696 0.694

LadTree 0.608 0.698 0.696 0.694

NBTree 0.202 0.178 0.211 0.694

RandomForest 0.667 0.675 0.674 0.694

RandomTree 0.683 0.681 0.68 0.694

RepTree 0.606 0.604 0.609 0.694

BaseBridge

J48 0.727 0.73 0.741 0.799

0.98

LadTree 0.024 0.018 0.015 0.841

NBTree 0.211 0.214 0.224 0.59

RandomForest 0.769 0.775 0.793 0.841

RandomTree 0.771 0.775 0.783 0.841

RepTree 0.629 0.637 0.628 0.778

Ransomware

J48 0.766 0.704 0.72 0.896

0.99

LadTree 0.545 0.655 0.654 0.879

NBTree 0.602 0.589 0.702 0.89

RandomForest 0.654 0.608 0.714 0.902

RandomTree 0.712 0.711 0.743 0.935

RepTree 0.612 0.672 0.637 0.872

Table 4: The Two Methodologies in comparison.

ML MC

Advantages & Strengths

Completely Automatic High Correctness

Low Execution Time Payload Localization

Exhaustive Experimentation Very Small Training Set

Disadvantages & Weaknesses

Not High Correctness High Execution Time

Big Training Set Analyst Involvement

No Payload Localization Small Experimentation

required large cardinality of the training set which

forces the malware analyst to have a relevant volume

of samples to hand out to the machine learning en-

gine. In fact in the ML experimentation the authors

used a training set that contains the 80% of the col-

lected samples. Furthermore the ML does not pro-

vide any information about the payload and its lo-

calization. The ML classiﬁes only a malware in its

family. The execution time to analyze a new sample

using the features f

, f

and f

is in average greater

than 10 minutes, which cannot be considered conve-

nient. Finally, these features achieve a very good val-

ues of Precision and Recall only in malware detection,

while in family identiﬁcation their average values of

exactness and correctness never exceed the 75%, as

shown in Figure 1. The MC is not completely au-

tomatic, since the involvement of an analyst is nec-

essary to specify the formulae. The execution time

of MC is high, which is in average equal to 60 sec-

onds to check an application; this time is computed

on a personal computer with the following computa-

tional proﬁle: Intel Core i7 with 2 gigabyte RAM,

equipped with Linux Ubuntu 15. This execution time

is strictly dependent on the number of furmulae used,

the time to build the automaton and the veriﬁcation

time. The time to build the automaton depends on the

number of the states and the number of transitions.

These two numbers are determined by the complexity

of the application’s code, i.e. the number of bytecode

instructions, the number of if statements and cycles.

The number of formulae used is proportional to the

number of different malicious behaviours that must

be caught in the code. Finally, it is worth considering

that the experimentation performed by the authors in

(Mercaldo et al., 2016a; Mercaldo et al., 2016c) is run

on a data-set that is much larger than the one used for

the validation of the MC. This hinders the comparison

of the precision of the two approaches, but allows to

have only an indication on how different the perfor-

mance of the two approaches is. For this reason we

computed also the Accuracy of the two approaches,

that provides a more reliable comparison. The accu-

racy histogram in Figure 1 shows the Accuracy values

achieved by ML and MC. It should be underlined that

FPs are involved also in the formula of Accuracy. Ta-

ble 4 shows and summarizes the advantages/strengths

and disadvantages/weaknesses of two methodologies

ML and MC. To conclude, the two approaches exhibit

several advantages and disadvantages. The malware

analyst can choose the right trade-off with agreement

ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering

670

to the demands. If the analyst wants to exactly lo-

cate the payload within the malware code or wishes a

high value of correctness in the family identiﬁcation,

MC should be used. However, this approach requires

a greater computational time than ML. Instead, if the

analyst is interested in achieving a high correctness in

family identiﬁcation, is not looking for the payload lo-

cation, and the efﬁciency has a priority higher than the

effectiveness, the choice should fall on the ML with f

feature. Finally, if the analyst wants to achieve a high

correctness in malware detection, the ML should be

employed, by using the f

, f

or f

features. Unfortu-

nately this will require a longer execution time.

6 CONCLUSIONS

Recognizing malware families (Zhou and Jiang,

2012) is a primary goal of malware analyst and sev-

eral approaches have been developed to face this is-

sue. In this work we have compared two static

approaches. The ﬁrst one is a Machine Learning

based approach, differently the second one is a Model

Checking based approach. We have investigated

strengths and weaknesses of the two approaches. As

future work, we want to compare them with dynamic

techniques in order to have a clearer and wider pic-

ture.

REFERENCES

Alam, S., Riley, R., Sogukpinar, I., and Carkaci, N. (2016).

Droidclone: Detecting android malware variants by

exposing code clones. In 2016 Sixth International

Conference on Digital Information and Communica-

tion Technology and its Applications (DICTAP), pages

79–84.

Anastasi, G., Bartoli, A., De Francesco, N., and Santone, A.

(2001). Efﬁcient veriﬁcation of a multicast protocol

for mobile computing. Computer Journal, 44(1):21–

30. cited By 12.

Annachhatre, C., Austin, T. H., and Stamp, M. (2015).

Hidden markov models for malware classiﬁcation.

J. Computer Virology and Hacking Techniques,

11(2):59–73.

Arp, D., Spreitzenbarth, M., Huebner, M., Gascon, H., and

Rieck, K. (2014). Drebin: Efﬁcient and explainable

detection of android malware in your pocket. In Pro-

ceedings of 21th Annual Network and Distributed Sys-

tem Security Symposium (NDSS).

Battista, P., Mercaldo, F., Nardone, V., Santone, A., and

Visaggio, C. A. (2016). Identiﬁcation of android mal-

ware families with model checking. In Proceedings of

the 2nd International Conference on Information Sys-

tems Security and Privacy - Volume 1: ICISSP,, pages

542–547.

Baysa, D., Low, R. M., and Stamp, M. (2013). Structural

entropy and metamorphic malware. Journal of Com-

puter Virology and Hacking Techniques, 9(4):179–

192.

Bose, A., Hu, X., Shin, K. G., and Park, T. (2008). Be-

havioral detection of malware on mobile handsets. In

Proceedings of the 6th International Conference on

Mobile Systems, Applications, and Services, MobiSys

’08, pages 225–238, New York, NY, USA. ACM.

Bruns, G. (1997). Distributed Systems Analysis with CCS.

Prentice-Hall.

Canfora, G., Lorenzo, A. D., Medvet, E., Mercaldo, F.,

and Visaggio, C. A. (2015). Effectiveness of opcode

ngrams for detection of multi family android malware.

In Proceedings of the 2015 10th International Confer-

ence on Availability, Reliability and Security, ARES

’15, pages 333–340, Washington, DC, USA. IEEE

Computer Society.

Canfora, G., Mercaldo, F., and Visaggio, C. A. (2016). An

hmm and structural entropy based detector for android

malware. Comput. Secur., 61(C):1–18.

Cleaveland, R. and Sims, S. (1996). The ncsu concurrency

workbench. In CAV. Springer.

De Francesco, N., Santone, A., and Tesei, L. (2003). Ab-

stract interpretation and model checking for checking

secure information ﬂow in concurrent systems. Fun-

damenta Informaticae, 54(2-3):195–211. cited By 12.

De Ruvo, G., Nardone, V., Santone, A., Ceccarelli, M.,

and Cerulo, L. (2015). Infer gene regulatory networks

from time series data with probabilistic model check-

ing. pages 26–32. cited By 7.

Faruki, P., Laxmi, V., Bharmal, A., Gaur, M., and Ganmoor,

V. (2015). Androsimilar: Robust signature for detect-

ing variants of android malware. Journal of Informa-

tion Security and Applications, 22:66 – 80. Special

Issue on Security of Information and Networks.

Feng, Y., Anand, S., Dillig, I., and Aiken, A. Ap-

poscopy: Semantics-based detection of android mal-

ware through static analysis.

Mercaldo, F., Nardone, V., Santone, A., and Visaggio, C. A.

(2016a). Download malware? no, thanks: How for-

mal methods can block update attacks. In Proceedings

of the 4th FME Workshop on Formal Methods in Soft-

ware Engineering, FormaliSE ’16, pages 22–28, New

York, NY, USA. ACM.

Mercaldo, F., Nardone, V., Santone, A., and Visaggio,

C. A. (2016b). Hey malware, i can ﬁnd you! In

2016 IEEE 25th International Conference on En-

abling Technologies: Infrastructure for Collaborative

Enterprises (WETICE), pages 261–262.

Mercaldo, F., Nardone, V., Santone, A., and Visaggio, C. A.

(2016c). Ransomware Steals Your Phone. Formal

Methods Rescue It, pages 212–221. Springer Inter-

national Publishing, Cham.

Milner, R. (1989). Communication and concurrency. PHI

Series in computer science. Prentice Hall.

Rabiner, L. R. (1989). A tutorial on hidden markov models

and selected applications in speech recognition. Pro-

ceedings of the IEEE, 77(2):257–286.

Santone, A. (2011). Clone detection through process alge-

bras and java bytecode. pages 73–74. cited By 10.

“Mirror, Mirror on the Wall, Who is the Fairest One of All?” - Machine Learning versus Model Checking: A Comparison between Two

Static Techniques for Malware Family Identiﬁcation

671

Spreitzenbarth, M., Echtler, F., Schreck, T., Freling, F. C.,

and Hoffmann, J. (2013). Mobilesandbox: Looking

deeper into android applications. In 28th International

ACM Symposium on Applied Computing (SAC).

Stirling, C. (1989). An introduction to modal and temporal

logics for ccs. In Concurrency: Theory, Language,

And Architecture, pages 2–20.

Suarez-Tangil, G., Tapiador, J. E., Peris-Lopez, P., and

Blasco, J. (2014). Dendroid: A text mining approach

to analyzing and classifying code structures in android

malware families. Expert Syst. Appl., 41(4):1104–

1117.

Zhang, M., Duan, Y., Yin, H., and Zhao, Z. (2014).

Semantics-aware android malware classiﬁcation us-

ing weighted contextual api dependency graphs. In

Proceedings of the 2014 ACM SIGSAC Conference

on Computer and Communications Security, CCS ’14,

pages 1105–1116, New York, NY, USA. ACM.

Zhou, Y. and Jiang, X. (2012). Dissecting android malware:

Characterization and evolution. In 2012 IEEE Sympo-

sium on Security and Privacy, pages 95–109.

ForSE 2017 - 1st International Workshop on FORmal methods for Security Engineering

672