Behavior-based Malware Analysis using Proﬁle Hidden Markov Models

Saradha Ravi, N. Balakrishnan and Bharath Venkatesh

SERC, Indian Institute of Science, Bangalore, India

Keywords:

Behavior-based Malware Analysis, Proﬁle Hidden Markov Model, Multiple Sequence Alignment, Malware

Classiﬁcation, Metamorphic Viruses, Clustering.

Abstract:

In the area of malware analysis, static binary analysis techniques are becoming increasingly difﬁcult with

the code obfuscation methods and code packing employed when writing the malware. The behavior-based

analysis techniques are being used in large malware analysis systems because of this reason. In these dynamic

analysis systems, the malware samples are executed and monitored in a controlled environment using tools

such as CWSandbox(Willems et al., 2007). In previous works, a number of clustering and classiﬁcation

techniques from machine learning and data mining have been used to classify the malwares into families and

to identify even new malware families, from the behavior reports. In our work, we propose to use the Proﬁle

Hidden Markov Model to classify the malware ﬁles into families or groups based on their behavior on the host

system. PHMM has been used extensively in the area of bioinformatics to search for similar protein and DNA

sequences in a large database. We see that using this particular model will help us overcome the hurdle posed

by polymorphism that is common in malware today. We show that the classiﬁcation accuracy is high and

comparable with the state-of-art-methods, even when using very few training samples for building models.

The experiments were on a dataset with 24 families initially, and later using a larger dataset with close to 400

different families of malware. A fast clustering method to group malware with similar behaviour following

the scoring on the PHMM proﬁle database was used for the large dataset. We have presented the challenges in

the evaluation methods and metrics of clustering on large number of malware ﬁles and show the effectiveness

of using proﬁle hidden model models for known malware families.

1 INTRODUCTION

With the pervasive use of the internet, a lot of ma-

licious programs or malware that spread across it,

pose a serious threat to the security of the systems.

There are a variety of malware that are in the forms

of viruses, trojans, worms and botnets. The malware

uses the software vulnerabilities and other social en-

gineering techniques to trick the users into running

them, so that they can spread.The anti-malware com-

panies receive thousands of sample ﬁles everyday, for

analysis. These ﬁles are usually the ones that the user

ﬁnds suspicious in his/her system, and whose func-

tionality they are not sure of. It is important that we

understand the operations that a particular malicious

program does, in order to asses the seriousness of the

threat and its malware family. We also notice the fact

that, new ﬁles that are uploaded are usually the vari-

ants of some malicious program that is already exist-

ing. Though the action of the malware would be very

similar, the signatures of the ﬁles may not match, even

for very close variants. This makes the signature-

based techniques fail for the metamorphic variants of

the malware. Static malware analysis focuses on clas-

sifying malware based on features directly extracted

from malware sample whereas dynamic analysis does

it based on the behavioral patterns of the malware.

Due to the increasing need, many automated mal-

ware analysis systems such as CWSandbox(Willems

et al., 2007), Anubis (Bayer et al., ) are being used.

In these systems, the uploaded malware programs are

executed in a controlled environment. The operations

performed by the program, characterized by the sys-

tem calls and their arguments, or the registry changes

are monitored carefully and logged. From these ex-

ecution logs, the reports are generated and the ana-

lysts go through them to assess the threat. But when

there are large number of programs, analysing them

manually becomes a tough job. It would be easy

if there is an automated way of classify a malware

into already existing families or report a new behav-

ior proﬁle. A number of classiﬁcation and cluster-

ing techniques from machine learning are used on the

behavior reports(Rieck et al., 2011)(Lee and Mody,

195

Ravi S., Balakrishnan N. and Venkatesh B..

Behavior-based Malware Analysis using Proﬁle Hidden Markov Models.

DOI: 10.5220/0004528201950206

In Proceedings of the 10th International Conference on Security and Cryptography (SECRYPT-2013), pages 195-206

ISBN: 978-989-8565-73-0

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

2006)(Bayer et al., 2010)(Apel et al., 2009). Once the

malware programs are grouped, we can then see what

the common behavior of that particular family of mal-

ware is and use them later for proactively mitigating

that threat beforehand in anti-malware software. In

our work, we address the problem of malware classiﬁ-

cation and also clustering (based on similar behavior)

using proﬁle Hidden Markov Models.

2 RELATED WORK

In this section, we will look at some of the related

work done in the behavior-based malware analysis

and classiﬁcation. The dynamic analysis techniques

gained prominence because of the limitations in the

static analysis techniques (Moser et al., 2007).Moser

et al proposed a method where the normal model of

programs were modelled using sequences of six sys-

tem calls and any deviations from this was ﬂagged

as anomaly or a security threat.This was one of the

ﬁrst approachesof using behaviorto differentiate mal-

ware from benign programs. Bailey et al(Bailey et al.,

2007) tracked more abstract features like system state

changes rather than system call sequences for mal-

ware classiﬁcation.

Different distance measures are used to ﬁnd the

similarity within the ﬁles of the same malware family.

Some of them are appropriate for the given problem

whereas some are not, particularly when the order of

the activities in the behavior isn’t taken into consid-

eration. Lee et al propose a malware clustering ap-

proach where a modiﬁed Levenshtein distance is used

and a k-medoid partition clustering is performed(Lee

and Mody, 2006). The complexity of computing dis-

tances between malware in their method is quadratic

in the number of system calls and so expensive.

In more recent work(Bayer et al., 2010), Bayer et

al have employed faster approximate nearest neighbor

search using Locality Sensitive Hashing for compari-

son of the analysis reports with known behavior pro-

ﬁles that they have created (using data tainting meth-

ods to track system call dependencies). The behavior

reports are then clustered using hierarchical clustering

algorithm. Comparing the clusters to the true malware

clusters gave them 0.98 and 0.93, precision and recall

values.

The automatic classiﬁcation system given by

Rieck et al was used to identify novel families of

unseen malware using clustering and assign new in-

stances of malware to these families by classiﬁcation

using SVMs(Rieck et al., 2008). In this method proto-

types for each class of malware is generated and even-

tually used in the hierarchical clustering of the mal-

ware reports. The experiments for this work are con-

ducted on a larger dataset with close to 33000 reports

and a detailed study of resource utilization is also

done.Their malheur implementation gave F-scores,

around 0.95 for the clusters and 0.97 for the classi-

ﬁcation. In their previous work(Rieck et al., 2008),

the classiﬁcation of malware using support vector ma-

chines is elaborated and the discriminative features in

behavior reports are analysed to explain classiﬁcation

decisions. The authors also propose a new represen-

tation for the monitored behavior of malware(Trinius

et al., 2010). This representation is optimised to be ef-

ﬁcient when applying machine learning and data min-

ing techniques.

Wagener et al (Wagener et al., 2008) propose a

dynamic analysis method where they couple a se-

quence alignment method to compute the similarities

and leverage the Hellinger distances. They also show

how the use of phylogenetic tree improves their clas-

siﬁcation method. The different distance measures

used when clustering similar malware behavior are

examined in a work by Apel eta al(Apel et al., 2009).

Their ﬁnding is that the Manhattan distance or some

similarity coefﬁcient used on 3-grams of the report

contents, stored in tries or generalized sufﬁx trees,

work the best.

To detect similarity in workloads from NFS traces

for storage systems, Neeraja et al (Yadwadkar et al.,

2010) have applied the PHMM on the opcode se-

quences of the NFS traces. They also observe that

very few training sequences for a particular type of

workload, was enough for modeling. In another work

(Attaluri et al., 2009), the proﬁle HMM had been ap-

plied for x86 opcode sequences of the polymorphic

malware binaries generated by the commonly avail-

able virus kits. But they observe that the method

works for some families better than the others because

of the problems like subroutine permutation and the

code reordering.

3 PROPOSED APPROACH

3.1 Our Contribution

In this paper , it is shown that polymorphic malware

are better detected when we look at their behavior,

where we expect a certain common sequence of ac-

tions to be preserved, in spite of obfuscation in the

code. We choose PHMM mainly because it intuitively

ﬁtted the kind of sequence search problem, which we

have in classifying malware behavior. The initial ex-

periments are done on a fairly diverse dataset that has

close to 24 families of malware and we see that the re-

SECRYPT2013-InternationalConferenceonSecurityandCryptography

196

sults are quite promising. The F-scores for most of the

classes considered,(including polymorphic families)

are above 0.96. This way, we show that the method

is comparable to some of the best of the techniques

used for this problem. Later we extend the exper-

iments on a larger and more varied dataset of mal-

ware infected ﬁles, which poses more challenges to

the analysis and grouping of similar ﬁles. The chal-

lenges are explained and the results of using PHMM

models on the dataset is also presented.

The Proﬁle Hidden Markov Model is a probabilis-

tic approach that developed specially for modeling se-

quence similarity occurring in biological sequences

such as proteins and DNA(Eddy, 1998)(Durbin et al.,

1998). It is also a faster alternative to the traditional

deterministic approaches used in sequence matching

(Durbin et al., 1998). It is a modiﬁed implementa-

tion of HMM, which is basically a generative model

and constructs a probabilistic ﬁnite state machines.

For behavior-based analysis, we again assume that

there is a sequence of operations common for a virus

family and for a presented new sequence we would

like to ﬁnd the best known match from the database.

Our approach to the classiﬁcation problem employing

PHMM employs the following steps:

1. The behavior reports (which are XML ﬁles)

obtained from the dynamic analysis tool such as

CWSandbox(currently called the GFISandbox), can

be encoded using a more simpler representation such

as the MIST(Trinius et al., 2010). The MIST format

that we chose for experiments can be processed at dif-

ferent levels considering how much of system call ar-

gument information we look at. Refer to Fig. 4 for a

sample MIST encoding. We can also directly encode

every unique type of a system call to a particular al-

phabet in the range (A-T) and eventually the behavior

report looks like a protein sequence.

2. A small number of such sequences belonging to a

known malware family is givento a multiple sequence

alignment module to get an alignment ﬁle.

3. The multiple alignment ﬁle for a malware family is

used for constructing a proﬁle hidden markov model

for that family. Many such HMM proﬁles can be com-

bined to create a malware proﬁle database.

4. When a new malware ﬁle is given, it is again en-

coded as a sequence and searched for in the malware

proﬁle database. The proﬁle HMM gives a score for

the most similar malware families for that new se-

quence.The one with the highest score is taken as the

malware class prediction.

3.2 Methodology

The main reason for us to choose this approach for

solving the problem of ﬁnding malware similarity is

because the behavior of malware program has vari-

ablility, yet has a characteristic signature reﬂected in

the sequence of system calls. For example if we look

at the CWSandbox reports for two malware programs

from same family, we notice that a sequence of ma-

licious actions is preserved, interspersed with some

other actions introduced to confuse the malware de-

tection system.

A hidden markov model (HMM) is very suitable

for probabilistic modeling of such sequences, which

is evident from past works. Thus it can be used for

modeling different classes of malware. But as we

have discussed above , there might be additions, dele-

tions or changes to the system calls for different pro-

grams within same malware family. The proﬁle HMM

is exactly designed to model this kind of problem,

because it also has non-emitting states or the delete

states. We would now outline the concept of proﬁle

HMM before we proceed to show how it has been

used in our work.

3.3 Hidden Markov Models

A hidden markov model(HMM) is a statistical tool

which captures the features of one or more sequences

of observable symbols by constructing a probabilis-

tic ﬁnite state machine with some hidden states that

are emitting the observed symbols (Rabiner, 1989).

When the state machine is trained, its graph and the

transition probabilities are computed such that they

best produce the training sequences. When we test

with a new sequence, the HMM gives a score for how

best the sequence matched with the known state ma-

chine. In our case, the observed symbols are the codes

for each unique system call in the behavior report of

the malware program(MIST codes).

An HMM is speciﬁed by the following parame-

ters.

• the alphabet of symbols Σ

• the hidden state set Z

• the emission probability matrix E

|Z|x|Σ|

• the state transmission matrix A

|Z|x|Z|

• the initial state distribution π

Thus the HMM λ can be written as λ = (Σ,Z,A,E,π).

This model can thus be used to assign a probability to

an observed sequence X as follows

P(X|λ) =

∑

∏

k+1

(1)

This probability as indicated by the formula, is that of

emitting the observation sequence X after all possible

Behavior-basedMalwareAnalysisusingProfileHiddenMarkovModels

197

state transitions (i.e state transmission sequences). of

the model λ.

The model λ has to be learnt from training data

consisting of independent and identically distributed

sequences. This can be done by maximizing the

probability P(T|λ) where T is a training sequence.

There is no analytical solution to this, however this

can be done by using an iterative procedure that uses

E-M (Expectation-Maximization) algorithm(Rabiner,

1989).

Given a sequence X, the Viterbi algo-

rithm(Rabiner, 1989) can be used to compute

the hidden state Z, so as to maximise P(Z|X) i.e

determine most probable sequence of hidden states

that produced the observed sequence. Equation 1 can

then be evaluated using the likelihood and P(X) got

using the forward and backward procedures (Rabiner,

1989).

3.4 Proﬁle Hidden Markov Model

A PHMM is a speciﬁc formulation of a standard

HMM that makes explicit use of positional informa-

tion contained in the observation sequences(Attaluri

et al., 2009). PHMM is a strongly linear left-right

model while HMM is not(Eddy, 1998). A PHMM

model allows null transitions, so that it can match se-

quences that differ by point insertions and deletions

happening by chance mutations. They were speciﬁ-

cally formulated for use in bioinformatics,where such

insertions and deletions to DNA sequences were natu-

ral during evolution. Thus PHMMs can be seen effec-

tive in modeling metamorphic malware, that also go

through similar kind of evolution, both at binary level

and at a behavioral level. Furthermore, PHMM state

transition matrices are essentialy sparser than those of

HMM, allowing quicker inference.

A central concept to note here is that of sequence

alignment. In DNA sequencing, multiple gene se-

quences which are signiﬁcantly related are aligned.

The alignment can be used to ascertain if the gene

sequences where diverging out from some common

ancestor. Now, for an unknown sequence, this mul-

tiple sequence alignment of a proﬁle, can be used to

determine if the sequence is related to it or not.

A pairwise alignment of two sequences yields a

pair of sequences of equal length that captures the

difference between the two original sequences by in-

serting ‘-’ or gaps. The global alignment is an align-

ment such that the matches are maximised and the

insertions/deletions are minimised(Needleman et al.,

1970). The local alignment problem tries to locate

two longest subsequences from each sequence, such

that they are similar. This can be extended to align

Figure 1: The transition structure of a proﬁle HMM. For

example, from an insert state (diamond), we can go to the

next delete state (circle), continue in the insert state (self

loop) or go to the next match state (rectangle). Note that

while multiple sequential deletions are possible by follow-

ing the circle states, each with a different probability, mul-

tiple sequential insertions are only possible with the same

probability(Yadwadkar et al., 2010).

Figure 2: A sample MSA ﬁle for EJIK Malware Sequences.

multiple sequences. This multiple sequence align-

ment represents a family of similar sequences, where

some subsequences are conserved in all. While efﬁ-

cient dynamic programming based solutions exist to

pair alignment, multiple alignment scales as O(n

) in

both time and space. This makes it prohibitively ex-

pensive for implementation.

MUSCLE is a freely available program used com-

monly for MSA. It uses fast distance estimation using

k-mer counting, a progressive alignment using a new

proﬁle function, and reﬁnement using tree dependent

restricted partitioning method(Edgar,2004). We have

used MUSCLE for generating the MSA ﬁles in the

*.afa format. The MSA step essentially serves as a

training phase where we align sequences of selected

few malware reports in each class, in our approach

to using PHMM. The Viterbi algorithm, forward-

backward procedure and Expectation-Maximization

are naturally extended to PHMMs. In PHMM, the

emission probabilities are position dependent unlike

in standard HMM. Learning a proﬁle HMM from data

involves computing the emission probability matrix E

and the state transition probability matrix A using the

multiple sequence alignment data. These are given by

∑

(2)

∑

(3)

SECRYPT2013-InternationalConferenceonSecurityandCryptography

198

Where N

represents the number of transitions

from the state u to v and N

, the number of emis-

sions of t given a state u.(Durbin et al., 1998) After

the model λ has been learnt from the training multiple

alignment data, the problem of identifying the family

that a new sequence X belongs to, is decided by the

rule

y(X) = argmax

P(X|λ

) (4)

HMMER(Eddy, 2003) is an open source imple-

mentation of PHMM and its architecture gives ﬂex-

ibility in deciding between local and global align-

ments. It is a very powerful tool and can used to

perform operations like building HMM proﬁles from

MSA, compressing a HMM proﬁle database for efﬁ-

ciency and for searching the most matched proﬁle for

a new sequence. We have used hmmer for building

HMM proﬁles for all malware families and for search-

ing the ‘best suited’ proﬁle for new sequences, that are

essentially the malware reports in the test dataset.

4 INITIAL EXPERIMENTS

The initial experiments are conducted on the pub-

licly available dataset that comprises of behavior re-

ports generated by CWSandbox, for nearly 3130 mal-

ware binaries collected over three years from many

sources(Trinius, 2009). The malware ﬁles in this

dataset were annotated by choosing the majority of

the labels given independently by six different anti-

virus products. Each malware family has a number of

ﬁles ranging from 30 to 300. The details of the ref-

erence dataset that we have used for our experiments

is shown in Figure 3. The reports are available in the

MIST format too. For our experiments, we consider

only the MIST Level 0 in the reports.That is, we look

at only the system call type and not the argument val-

ues. This level is actually sufﬁcient for discrimination

of various classes of malware.

The MIST Level 0 reports have close to eighty ﬁve

different mist codes or system call operations, out of

which some 20 operations are very frequent. A sam-

ple MIST instruction for a CWSandox reresentation

is shown in Fig 4. Now we map every category op-

eration code with a unique alphabet in the range [A-

T]. The remaining category operation codes are also

mapped to alphabets in the accepted range.This fa-

cilitates the sequence representation to be compatible

with the protein sequence format such as the FASTA

or the STOCKHOLM formats.

Now for every family of malware, say Allaple,

we choose few (typically between 5-20) ﬁles and add

their sequence representation to FASTA (*.fa) ﬁle.

This FASTA ﬁle with the sequences is given to the

Figure 3: The different malware families and the number of

ﬁles in each, as in Malheur reference dataset(Trinius, 2009).

Figure 4: Sample of MIST representation of a portion of

CWSandbox report(Trinius et al., 2010).

multiple sequence alignment module and the output

is an aligned FASTA (*.afa), which has the multiple

alignment. The alignment ﬁle for that malware family

is now given to the hmmbuild step in hmmer, which

now creates a proﬁle HMM for the class.This is done

for all the families of malware. The malware proﬁle

database is the concatenation of all the HMM proﬁles

created for the known malware families in hand.

Presented with an unknown malware instance, we

convert the MIST encoded ﬁle to a FASTA sequence

ﬁle. The hmmsearch operation of the hmmer triggers

a search on the proﬁle database. The result of the

hmmsearch operationgives the scores for the different

malware families proﬁles, that were closely related to

the presented sequence. The score values for the over-

all sequence match and best domain matches are ob-

tained. Choosing the family which gets the maximum

score, gives the classiﬁcation result. The score differ-

ences between the families can also help us get some

insight into how close the match was, to each of it.

The hmmsearch operation takes longer time

for identifying very long sequences with more

than 50000 operations in a single report. Mul-

tiple sequence alignment and hmmsearch opera-

tions were run on a system with quad-core Intel(R)

Xeon(R)E5440 @ 2.83GHz processor with 32GB of

RAM. Some sequences in the family SALITY were

too long and we haven’t used them for testing in our

experiment. But we plan to look at how to handle

such sequences in our future work.

Behavior-basedMalwareAnalysisusingProfileHiddenMarkovModels

199

(a) Histogram of classiﬁcation accuracy (b) Histogram of F-scores

Figure 5.

5 RESULTS

5.1 On the Reference Dataset

We already saw that, around 5 to 20 ﬁles are used to

construct the proﬁle HMM for every malware family

considered. The testing set consisted of the remaining

ﬁles in the dataset(Trinius, 2009). The predictions of

the HMM for all the malware programs spanning the

24 families is given in the form of a confusion matrix

in Figure 5(c). We see that the overall accuracy for the

dataset is around 95% and the classiﬁcation accuracy

for most of the classes is close to 100%. This shows

that our approach is comparable with the state-of-the-

art approaches as the Malheur(Rieck et al., 2011), in

terms of prediction accuracy.

The overall accuracy rate over the entire dataset

is about 0.964. The accuracy of classiﬁcation for ev-

ery class of malware is shown as a histogram in the

Figure 5(a). We see that for most of the classes the

accuracy is close to 1.0. For classes such as Allaple,

which is polymorphic, all 300 instances were classi-

ﬁed correctly. It was noticed that the scores given for

the dominating proﬁle or malware class was very high

when compared to all other closely matched proﬁles.

Also, whenever there was misclassiﬁcation, the dif-

ference in the scores for the closely matched proﬁles

is small.

SECRYPT2013-InternationalConferenceonSecurityandCryptography

200

The Figure 5b) shows a plot of the F-scores ob-

tained for the classiﬁcation results. The average F-

score taken over most of the classes are more than

0.96 and there are families like Looper, Adultbrowser

etc. with values 1.0. We would like to compare this

with results from (Rieck et al., 2011)and (Bayer et al.,

2010) which give average F-scores of about 0.88 and

0.97 respectively.

The confusion matrix for the multi-class predic-

tion of malware families is presented in the Figure

5(c). By observing this matrix we see how the diago-

nal blocks are dark,owing to high prediction accuracy.

There are lighter grey blocks outside the diagonal re-

ﬂecting the proportion of ﬁles that were misclassiﬁed

for every target malware family. The confusion plot

gives us some insight on how closely related different

families are.

For some classes like Ldpinch, owing to the small

size of the reports and high variability, the accuracy is

low. Programs of the family virut and rbot are very

close in the behavior pattern which is reﬂected in the

accuracy.

5.2 Challenges of the Previous

Approaches

We wanted to extend our approach on larger col-

lection of malware to show its usefulness in real-

time malware analysis, motivated by the initial re-

sults. The publicly available malheur application

dataset(Trinius, 2009) has close to about 400 differ-

ent families of malware reports available, on which

we wanted to test our approach and do a comprehen-

sive study.

It is known that there are many challenges in the

analysis on such large and varied dataset when us-

ing the PHMM approach. To encode a broader range

of MIST instructions, a better and efﬁcient encoding

scheme was required. The encoding also had to take

into account the larger range of malware classes that

had to be analysed. Since there are many classes of

malware with just very few (1 to 3) instances available

for analysis, a pure classiﬁcation approach may not

be very suited. So we resort to a clustering approach

that would work on the PHMM scores that we obtain

for every malware report against stable malware pro-

ﬁles. If we assume the malware family name given

by an antivirus as the ground truth, then the cluster

size distribution for the labeling is still skewed. The

reference dataset that was used in our initial exper-

iments has a few shortcoming when evaluating the

performance of any methodology. The behaviour of

different classes of malware in the dataset were dis-

tinct from one another. As pointed in a work by Li

et al (Li et al., 2010) , most of the discriminative

classiﬁcation models built on such datasets give good

results and the effectiveness of one method over an-

other does not account for the intrinsic characteristics

of a malware. In the same work it is been analysed

that biased cluster size distribution in the dataset re-

duces the signiﬁcance of a high precision and recall

of the clustering results of the malware as observed

in the dataset used for a fast scalable clustering ap-

proach(Bayer et al., 2009). Also the issue of inconsis-

tency in the labels used for this evaluation across dif-

ferent anti-virus vendors renders the evaluation met-

ric not so effective. It is pointed that even a plagia-

rism detection software gave comparable results for

the dataset and metrics while still the LSH based clus-

tering technique considered in (Bayer et al., 2009) is

far more scalable. In essence, our analysis empha-

sizes on the clustering of malware mainly based on its

behaviour that we obtain and study of malware evolu-

tion on this large dataset using the PHMM approach.

6 DETAILED EXPERIMENTS

We present the details of experiments in testing the

usefulness of PHMM to create malware family pro-

ﬁles and how one can use the PHMM scores to cluster

a large set of malware instances. The malware analy-

sis using this approach involves the following steps.

1. We choose to use the MIST(Trinius et al.,

2010) approach for representing the instructions of

the CWSandbox report.

2. The MIST 0 level is what we have used for the cur-

rent experiments. In future, we will consider using the

arguments of the MIST instructions too (higher mist

levels).

3. Since the number of unique instructions in the

MIST set is more than the number of legitimate alpha-

bets in the protein encoding(20), we choose to use an

efﬁcient encoding algorithm which will be described

in the following subsection.

4. The choice of the malware instances to create the

family’s PHMM proﬁle was an important question

that arose. We have addresed the issue, by choosing

the most variable sequences in terms of the sequence

length and malware behaviour.

5. As in the previous experiment, the subsequent steps

are the MSA of the chosen sequences and building

of the HMM proﬁles for those families that we have

enough samples of.

6. The new unseen sequences of all the malware in-

stances are scored against the proﬁle database. The

resulting scores for the top scoring families(above an

inclusion threshold) are then normalised across all

Behavior-basedMalwareAnalysisusingProfileHiddenMarkovModels

201

known families in the database.

7. This normalised vector is then used for clustering

the malware into families. We have used a fast re-

peated bisection method for clustering the set of mal-

ware reports.

6.1 Encoding the Behaviour Reports

The behaviour reports of the malware dataset(Trinius,

2009) are encoded using the MIST codes as in the

paper(Trinius et al., 2010). When converting this en-

coding to that of the protein sequences we have fewer

codes to represent a larger set of MIST instructions.

We resorted to use the huffman encoding algorithm

for the same. The encoding gives shorter alphabet

code for more frequent instructions and longer codes

for the rare ones in the behavior reports. This gave

us a nice encoding that will ensure that the sequences

are not too long and hence easing the complexity in-

volved in multiple sequence alignment step.

6.2 Incremental Setting for the Detailed

Experiments

The dataset that we used for the detailed experiments

mainly focuses on the analysis of the malware fami-

lies that exhibit varied behaviour across samples. The

malheur application set has malware ﬁles spanning

over 403 families, among which, around 146 fami-

lies have more than three samples each. To see how

an incremental analysis can be done, we did a proﬁle

creation for about 130 families and the malware be-

longing to these families were scored over the proﬁle

database. This covered about 7700 ﬁles whose preci-

sion and recall was about 0.67 and 0.46 respectively.

In the incremental step, we add the PHMM pro-

ﬁles for 15 more prominent families to the database.

The total number of ﬁles in the dataset is around

18990 and the reports of all the 400 families of mal-

ware are presented for scoring and later clustering us-

ing a fast recursive bisection method. The vectors

of PHMM scores for each report is normalised and

the cosine similarity measure was used for cluster-

ing. The recursive bisection algorithm is very fast

and the clustering results for nearly 19000 malware

reports was available in less than one minute. This is

of a great advantage in malware analysis where thou-

sands of ﬁles are typically getting uploaded everyday

for analysis.

The classiﬁcation results for some of the initially

seen samples from newly added families (in the in-

cremental) can be explained with the help of phylo-

genetic analysis that will be introduced in the coming

section. It is assumed that, at some point of time the

phylogenetic analysis on the aligned MIST sequences

helps us discover a new class of malware branching

steadily from an existing family. Once that discov-

ery is done, the exemplary samples of the new family

is used for building its own proﬁle and the database

is updated. However, completely new families of

malware generally do not surface on the web so fre-

quently as the polymorphic variants or extensions of

already existing families of malware.

6.3 What is Phylogeny

The phylogenetic analysis is usually done in the ﬁeld

of evolutionary biology to ﬁnd the hierarchical rela-

tionships between the organisms belonging to differ-

ent taxonomic groups that form the leaves in the tree.

The physical or the genetic characeteristics are used

for the same. The cladograms or the phylogenetic

trees are constructed with the branch lengths repre-

senting the evolutionary distance between the organ-

isms in consideration.

The trees can be built from two forms of the ge-

nomic data. They are the distance matrices for the

genetic sequences and the molecular data compris-

ing the aligned sequences themselves. The distance

methods build the phylogenetic tree by clustering the

sequences based on their distances obtained from the

matrix, in an iterative manner. Character-based tree

building methods can use both types of data, and

search for the best hierarchical tree from a set of pos-

sible tree topologies. They are slower than the dis-

tance methods as they come with optimality guaran-

tees, but there are heuristics that speed up the process.

6.4 Malware Phylogeny from Our

Experiments

The tool ClustalW was used for phylogenetic anal-

ysis of the encoded MIST sequences of the mal-

ware. It takes in the prealigned homologous se-

quences, initially computes rough distance matrices

based on pairwise alignment scores. It then uses a

simple Neighbor-Joining clustering method to clus-

ter the leaves of the tree under branches. In biology,

the homologous sequences refer to the nucleic acid or

protein sequences that are similar because they have

a common evolutionary origin. The phylogenetic tree

for the families Virut and Kies is shown in Figure 6.

The initially seen few samples of Virut were misclas-

siﬁed as Kies and it is seen that the tree reﬂects the

common behaviour that they share. A similar tree

for the families Palevo and Buzus is shown in Fig-

ure 7. Palevo was a worm that surfaced in 2009 and

it opens a back door on the compromised computer

SECRYPT2013-InternationalConferenceonSecurityandCryptography

202

Table 1: Malware Clustering Comparision.

Features Precision Recall Number of clusters

PHMM Scores 0.7058 0.3106 400

4-gram frequencies 0.7000 0.3700 400

Figure 6: Phylogenetic tree for Virut and Kies.

and attempts to connect to the following IRC server

to receivecommands and the Buzus shared similar be-

haviour too. The length of the branches in the tree re-

ﬂects the genetic distance between aligned sequences.

This distance is usually deﬁned as the fraction of mis-

matches at aligned positions, with gaps either ignored

or counted as mismatches.

6.5 Analysis of the Results

The distribution of ﬁles over the different families of

viruses is very skewed, with popular polymorphic

viruses like Allaple, Texel, Swizzor with over 1000

samples and nearly 200 classes of viruses with just

one sample each. From a little bit of analysis into

the behaviour proﬁles, it was observed that some of

the majority classes like Basun had a stable behaviour

pattern with minimal variations and constituted a third

of the malware collection itself. With such a distribu-

tion in the data, a high precision and recall may not

help us distinguish the effectiveness of one method

over another. So for our study and comparison, we

Behavior-basedMalwareAnalysisusingProfileHiddenMarkovModels

203

Figure 7: Phylogenetic tree for Palevo and Buzus.

have used the malware families with the most vari-

able behaviour across instances, forming the majority

of the samples used in the study for comparison. The

typical example is the class of viruses called Texel. It

has many aliases used in the antivirus companies and

has a varying behaviour. It behaves like a virus be-

cause of its self replicating nature. Some instances

of Texel may frequently pop up advertising messages

to interrupt computer users, while more severely they

may destroy the data in computers. The lower re-

call and higher precision in the clustering obtained

from PHMM features is because the Texel ﬁles have a

varied behavior forming different clusters of big and

small sizes. The cluster distribution for the top ﬁf-

teen clusters are shown in the Fig.8. It is seen that

the clusters are pure and that the cluster size distribu-

tion obtained is not very skewed. The results show

that the performance of our methodology is compa-

rable with the state-of-the-art, even when using very

few sequences to actually model a behaviour proﬁle.

PHMM also gives the advantage of lesser complexity

in merging proﬁles or aligning extra sequences to an

existing proﬁle, when new variants appear or taxon-

omy changes.

7 FUTURE WORK

AND CONCLUSIONS

We have explored the approach of using proﬁle hid-

den markov model for the problem of malware clas-

SECRYPT2013-InternationalConferenceonSecurityandCryptography

204

Figure 8: Clustering Results - Top 15 clusters.

siﬁcation based on behavior reports. This approach

was considered because of the inherent similarity of

the metamorphism in malware to that of the muta-

tions in gene or protein sequences. The results are

quite assuring of a high classiﬁcation and clustering

performance, even when very few training instances

available for building models. We would like to ex-

tend this approach for identifying new unseen mal-

ware families and to distinguish benign ﬁles in our

future work that would empower us to design a com-

plete malware triage system. The methods of speed-

ing up the algorithms in the proﬁle HMMs can also

be explored,when employing in large scale malware

analysis.

REFERENCES

Apel, M., Bockermann, C., and Meier, M. (2009). Mea-

suring similarity of malware behavior. In Local Com-

puter Networks, 2009. LCN 2009. IEEE 34th Confer-

ence on, pages 891–898. IEEE.

Attaluri, S., McGhee, S., and Stamp, M. (2009). Proﬁle hid-

den markov models and metamorphic virus detection.

Journal in computer virology, 5(2):151–169.

Bailey, M., Oberheide, J., Andersen, J., Mao, Z., Jahanian,

F., and Nazario, J. (2007). Automated classiﬁcation

and analysis of internet malware. In Recent Advances

in Intrusion Detection, pages 178–197. Springer.

Bayer, U., Comparetti, P. M., Hlauschek, C., Kruegel, C.,

and Kirda, E. (2009). Scalable, behavior-based mal-

ware clustering. In Network and Distributed System

Security Symposium (NDSS). Citeseer.

Bayer, U., Kirda, E., and Kruegel, C. (2010). Improving the

efﬁciency of dynamic malware analysis. In Proceed-

ings of the 2010 ACM Symposium on Applied Com-

puting, pages 1871–1878. ACM.

Bayer, U., Kruegel, C., and Kirda, E. Anubis: Analyzing

unknown binaries.

Durbin, R., Eddy, S. R., Krogh, A., and Mitchison, G.

(1998). Biological sequence analysis: probabilistic

models of proteins and nucleic acids. Cambridge uni-

versity press.

Eddy, S. (2003). Hmmer: proﬁle hmms for protein se-

quence analysis. http://hmmer.janelia.org/.

Eddy, S. R. (1998). Proﬁle hidden markov models. Bioin-

formatics, 14(9):755–763.

Edgar, R. C. (2004). Muscle: multiple sequence align-

ment with high accuracy and high throughput. Nucleic

acids research, 32(5):1792–1797.

Lee, T. and Mody, J. J. (2006). Behavioral classiﬁcation. In

EICAR Conference.

Li, P., Liu, L., Gao, D., and Reiter, M. (2010). On

challenges in evaluating malware clustering. In Re-

cent Advances in Intrusion Detection, pages 238–255.

Springer.

Moser, A., Kruegel, C., and Kirda, E. (2007). Limits of

static analysis for malware detection. In Computer Se-

curity Applications Conference, 2007. ACSAC 2007.

Twenty-Third Annual, pages 421–430. IEEE.

Needleman, S. B., Wunsch, C. D., et al. (1970). A gen-

eral method applicable to the search for similarities in

the amino acid sequence of two proteins. Journal of

molecular biology, 48(3):443–453.

Rabiner, L. R. (1989). A tutorial on hidden markov models

and selected applications in speech recognition. Pro-

ceedings of the IEEE, 77(2):257–286.

Rieck, K., Holz, T., Willems, C., D¨ussel, P., and Laskov, P.

Behavior-basedMalwareAnalysisusingProfileHiddenMarkovModels

205

(2008). Learning and classiﬁcation of malware behav-

ior. Detection of Intrusions and Malware, and Vulner-

ability Assessment, pages 108–125.

Rieck, K., Trinius, P., Willems, C., and Holz, T. (2011). Au-

tomatic analysis of malware behavior using machine

learning. Journal of Computer Security, 19(4):639–

668.

Trinius, P. (2009). Malheur Dataset. http://pi1.informatik.

uni-mannheim.de/ malheur/ #dldata.

Trinius, P., Willems, C., Holz, T., and Rieck, K. (2010).

A malware instruction set for behavior-based analy-

sis. In Proceedings of 5th GI Conference Sicherheit,

Schutz und Zuverl assigkeit, Berlin, Germany.

Wagener, G., State, R., and Dulaunoy, A. (2008). Mal-

ware behaviour analysis. Journal in computer virol-

ogy, 4(4):279–287.

Willems, C., Holz, T., and Freiling, F. (2007). Toward au-

tomated dynamic malware analysis using cwsandbox.

Security & Privacy, IEEE, 5(2):32–39.

Yadwadkar, N. J., Bhattacharyya, C., Gopinath, K., Niran-

jan, T., and Susarla, S. (2010). Discovery of applica-

tion workloads from network ﬁle traces. In Proceed-

ings of the 8th USENIX conference on Fileand storage

technologies, pages 14–14. USENIX Association.

SECRYPT2013-InternationalConferenceonSecurityandCryptography

206