2006)(Bayer et al., 2010)(Apel et al., 2009). Once the
malware programs are grouped, we can then see what
the common behavior of that particular family of mal-
ware is and use them later for proactively mitigating
that threat beforehand in anti-malware software. In
our work, we address the problem of malware classifi-
cation and also clustering (based on similar behavior)
using profile Hidden Markov Models.
2 RELATED WORK
In this section, we will look at some of the related
work done in the behavior-based malware analysis
and classification. The dynamic analysis techniques
gained prominence because of the limitations in the
static analysis techniques (Moser et al., 2007).Moser
et al proposed a method where the normal model of
programs were modelled using sequences of six sys-
tem calls and any deviations from this was flagged
as anomaly or a security threat.This was one of the
first approachesof using behaviorto differentiate mal-
ware from benign programs. Bailey et al(Bailey et al.,
2007) tracked more abstract features like system state
changes rather than system call sequences for mal-
ware classification.
Different distance measures are used to find the
similarity within the files of the same malware family.
Some of them are appropriate for the given problem
whereas some are not, particularly when the order of
the activities in the behavior isn’t taken into consid-
eration. Lee et al propose a malware clustering ap-
proach where a modified Levenshtein distance is used
and a k-medoid partition clustering is performed(Lee
and Mody, 2006). The complexity of computing dis-
tances between malware in their method is quadratic
in the number of system calls and so expensive.
In more recent work(Bayer et al., 2010), Bayer et
al have employed faster approximate nearest neighbor
search using Locality Sensitive Hashing for compari-
son of the analysis reports with known behavior pro-
files that they have created (using data tainting meth-
ods to track system call dependencies). The behavior
reports are then clustered using hierarchical clustering
algorithm. Comparing the clusters to the true malware
clusters gave them 0.98 and 0.93, precision and recall
values.
The automatic classification system given by
Rieck et al was used to identify novel families of
unseen malware using clustering and assign new in-
stances of malware to these families by classification
using SVMs(Rieck et al., 2008). In this method proto-
types for each class of malware is generated and even-
tually used in the hierarchical clustering of the mal-
ware reports. The experiments for this work are con-
ducted on a larger dataset with close to 33000 reports
and a detailed study of resource utilization is also
done.Their malheur implementation gave F-scores,
around 0.95 for the clusters and 0.97 for the classi-
fication. In their previous work(Rieck et al., 2008),
the classification of malware using support vector ma-
chines is elaborated and the discriminative features in
behavior reports are analysed to explain classification
decisions. The authors also propose a new represen-
tation for the monitored behavior of malware(Trinius
et al., 2010). This representation is optimised to be ef-
ficient when applying machine learning and data min-
ing techniques.
Wagener et al (Wagener et al., 2008) propose a
dynamic analysis method where they couple a se-
quence alignment method to compute the similarities
and leverage the Hellinger distances. They also show
how the use of phylogenetic tree improves their clas-
sification method. The different distance measures
used when clustering similar malware behavior are
examined in a work by Apel eta al(Apel et al., 2009).
Their finding is that the Manhattan distance or some
similarity coefficient used on 3-grams of the report
contents, stored in tries or generalized suffix trees,
work the best.
To detect similarity in workloads from NFS traces
for storage systems, Neeraja et al (Yadwadkar et al.,
2010) have applied the PHMM on the opcode se-
quences of the NFS traces. They also observe that
very few training sequences for a particular type of
workload, was enough for modeling. In another work
(Attaluri et al., 2009), the profile HMM had been ap-
plied for x86 opcode sequences of the polymorphic
malware binaries generated by the commonly avail-
able virus kits. But they observe that the method
works for some families better than the others because
of the problems like subroutine permutation and the
code reordering.
3 PROPOSED APPROACH
3.1 Our Contribution
In this paper , it is shown that polymorphic malware
are better detected when we look at their behavior,
where we expect a certain common sequence of ac-
tions to be preserved, in spite of obfuscation in the
code. We choose PHMM mainly because it intuitively
fitted the kind of sequence search problem, which we
have in classifying malware behavior. The initial ex-
periments are done on a fairly diverse dataset that has
close to 24 families of malware and we see that the re-
SECRYPT2013-InternationalConferenceonSecurityandCryptography
196