Authors:
Saradha Ravi
;
N. Balakrishnan
and
Bharath Venkatesh
Affiliation:
Indian Institute of Science, India
Keyword(s):
Behavior-based Malware Analysis, Profile Hidden Markov Model, Multiple Sequence Alignment, Malware Classification, Metamorphic Viruses, Clustering.
Related
Ontology
Subjects/Areas/Topics:
Information and Systems Security
;
Information Assurance
;
Intrusion Detection & Prevention
;
Management of Computing Security
;
Security Engineering
;
Security in Information Systems
;
Software Security
Abstract:
In the area of malware analysis, static binary analysis techniques are becoming increasingly difficult with the code obfuscation methods and code packing employed when writing the malware. The behavior-based analysis techniques are being used in large malware analysis systems because of this reason. In these dynamic analysis systems, the malware samples are executed and monitored in a controlled environment using tools such as CWSandbox(Willems et al., 2007). In previous works, a number of clustering and classification techniques from machine learning and data mining have been used to classify the malwares into families and to identify even new malware families, from the behavior reports. In our work, we propose to use the Profile Hidden Markov Model to classify the malware files into families or groups based on their behavior on the host system. PHMM has been used extensively in the area of bioinformatics to search for similar protein and DNA sequences in a large database. We see th
at using this particular model will help us overcome the hurdle posed by polymorphism that is common in malware today. We show that the classification accuracy is high and comparable with the state-of-art-methods, even when using very few training samples for building models. The experiments were on a dataset with 24 families initially, and later using a larger dataset with close to 400 different families of malware. A fast clustering method to group malware with similar behaviour following the scoring on the PHMMprofile database was used for the large dataset. We have presented the challenges in the evaluation methods and metrics of clustering on large number of malware files and show the effectiveness of using profile hidden model models for known malware families.
(More)