AUTOMATIC BEHAVIOUR-BASED ANALYSIS AND
CLASSIFICATION SYSTEM FOR MALWARE DETECTION
Jaime Devesa, Igor Santos, Xabier Cantero, Yoseba K. Penya and Pablo G. Bringas
S
3
Lab, Deusto Technological Foundation, Bilbao, Spain
Keywords:
Security, Malware detection, Machine learning, Data-mining.
Abstract:
Malware is any kind of program explicitly designed to harm, such as viruses, trojan horses or worms. Since
the amount of malware is growing exponentially, it already poses a serious security threat. Therefore, every
incoming code must be analysed in order to classify it as malware or benign software. These tests commonly
combine static and dynamic analysis techniques in order to extract the major amount of information from
distrustful files. Moreover, the increment of the number of attacks hinders manually testing the thousands of
suspicious archives that every day reach antivirus laboratories. Against this background, we address here an
automatised system for malware behaviour analysis based on emulation and simulation techniques. Hence,
creating a secure and reliable sandbox environment allows us to test the suspicious code retrieved without risk.
In this way, we can also generate evidences and classify the samples with several machine-learning algorithms.
We have developed the proposed solution, testing it with real malware. Finally, we have evaluated it in terms
of reliability and time performance, two of the main aspects for such a system to work.
1 INTRODUCTION
Malware is any kind of code explicitly designed with
harmful intentions, such as viruses, trojan horses or
worms. Malware represents a high-priority issue to
security researchers and poses a major threat to the
privacy of computer users and their information.
Still, the traditional approach to analyse malware
requires that a human analyst manually performs the
tests and extracts the information in order to classify
the sample (Moser et al., 2007). Unfortunately,due to
the tremendous growth of malicious code, antivirus
companies receive everyday thousands of suspicious
files that have to be analysed and classified as mal-
ware or benign software. Hence, a reliable and fast
automation of the analysis and classification is a cru-
cial point to be able to cope with this threat.
With this scenario in mind, there are two mal-
ware analysis approaches: static analysis (Carrera
and Erd´elyi, 2004) which is performed without ac-
tually executing the file, only observing the binary
looking for suspicious patterns, and dynamic analy-
sis (Christodorescu et al., 2007; Rieck et al., 2008)
which implies running the sample in an isolated and
controlled environment monitoring its behaviour.
Against this background, we propose here an au-
tomatic system for malware behaviour analysis based
on emulation and simulation techniques. We advance
the state of the art in three main ways. First, we pro-
pose a sandbox to monitor program executions within
a contained environment. Second, we present and de-
scribe a method for feature extraction from behaviour
reports, and we train different machine-learning clas-
sifiers in order to provide detection of unknown mal-
ware instances. Last but not least, we evaluate the
system in terms of reliability and time performance.
2 SYSTEM ARCHITECTURE
2.1 System Manager
The System Manager is the main process of the en-
tire system, and it is responsible for orchestrating the
other components.
When the System Manager starts, it launches n
Qemu virtual machines and creates a TCP server
in order to establish communications and synchro-
nise other processes. Moreover, the suspicious files
and the generated behaviour trace logs are exchanged
with the sandboxes via these communication chan-
nels. Once any of the sandboxes finishes the analysis
of a file, the behaviour trace is copied into a net share
395
Devesa J., Santos I., Cantero X., K. Penya Y. and G. Bringas P. (2010).
AUTOMATIC BEHAVIOUR-BASED ANALYSIS AND CLASSIFICATION SYSTEM FOR MALWARE DETECTION.
In Proceedings of the 12th International Conference on Enterprise Information Systems - Artificial Intelligence and Decision Support Systems, pages
395-399
DOI: 10.5220/0002895203950399
Copyright
c
SciTePress
folder. Then, the System Manager is informed and
negotiates the extraction of features with the regular
expression parser. Finally, the information gathered
is stored in a database with the aim of further classifi-
cation.
2.2 Behaviour Monitoring
Behaviour monitoring is a dynamic analysis tech-
nique in which the suspicious file is executed inside
a contained and secure environment, called sandbox,
in order to get a complete and detailed trace of the
actions performed in the system.
Besides, as 87.9 % of users work with MS-based
platforms, most of the malware is aimed at these oper-
ating systems. Therefore, we have developed a sand-
box with the purpose of dynamically analyse Win-
dows Portable Executable (PE) files (Pietrek, 1994).
To this extent, the suspicious files are executed inside
an emulated environment, and relevant Windows API
calls are logged, showing the program’s behaviour.
Therefore, we propose a new approach of sandbox
using both emulation (Qemu) and simulation (Wine)
techniques, with the aim of achieving the greatest
transparency possible without interfering with the
system. In this way, we describe now the two main
platforms of our sandbox solution.
First, Wine is an open-source and complete re-
implementation (simulation) of the Win-32 Applica-
tion Programming Interface (API). Hence, it allows
Windows PE files to run as-if-natively under Unix-
based operating systems. Moreover, the main reason
of our choice, is that as being open-source we can
modify and improve the code to adjust it to our needs,
and therefore we can log any call to the Win-32 API
done by the malware.
Second, Qemu is an open-source pure software
virtual machine emulator that works by performing
equivalent operations in software for any given CPU
instruction. Unfortunately,there are several malicious
executables aware of being executed in a contained
environment exploiting different bugs within this vir-
tual machine. Plus, they can be fixed easily (Ferrie,
2006).
To this extent, we have made several improve-
ments in Wine. First, we have modified Wine’s
source-code to write every call done by a process
(identified by its PID) to the Windows API (divided
in families, i.e. registry, memory or files) into a log,
specifying parameters’ state before and after the func-
tions body. Thereby, we can obtain a complete and
homogenous trace with all processes behaviour, with-
out any interference with the system.Second, we have
made several modifications to the basic Wine installa-
tion by adding various Windows XP system dlls and
creating additional registry keys and folder structures.
In this way, we have obtained the most transparent
and similar system to a Windows O.S. as possible in
order to be able to execute the largest possible variety
of programs.
2.3 Feature Extraction
Wheneverthe input data to machine learning and clas-
sification methods is too large and complex, it is nec-
essary to transform this data into a reduced represen-
tation set of carefully chosen features (feature vector),
a process known as feature extraction.
Furthermore, for each malware sample analysed
in the sandbox, we may obtain a complete in-raw
trace with its detailed behaviour. These reports are
not suitable for machine learning applications as these
usually work with vectorial data. Hence, with the
aim of automatically extracting relevant information
in vector format from the traces, we developed reg-
ular expression rules and a parser to identify them.
This approach is a powerful and fast tool for identify-
ing patterns of characters within a text (Friedl, 2006),
in our case, the detailed trace log.
To this end, we have defined different specific ac-
tions taken by the program as regular expression rules,
with the help of an expert, in order to conform the
knowledge base. In this way, the creation of new rules
becomes very simple and intuitive and the system is
easily improvable.
Moreover, most of the actions defined are char-
acteristic of malware but there are both benign and
malicious behaviour rule definitions.
Therefore, we have converted all the information
achieved in a vector, parsing the behaviour reports
with the defined regular expressions.
To this extent, we defined a program behaviour
as a vector made up of the aforementioned charac-
teristics. More specifically, we represented an ex-
ecutable as a vector ~v that is composed by binary
characteristics c, where c can be either 1 (true) or 0
(false), ~v = (c
1
,c
2
,c
3
,...,c
n1
,c
n
) and n is the num-
ber of monitored actions.
In this way, we have characterised the vector in-
formation as binary digits, called features, each one
representing the corresponding characteristic of the
behaviour. When parsing a report, if one of the de-
fined actions is detected by a rule, the corresponding
feature is activated. The resulting vector for each pro-
gram’s trace is a finite sequence of bits, a proper infor-
mation for classifiers to effectively recognize patterns
and correlate similarities across a huge amount of in-
stances (Lee and Mody, 2006). Likewise, both in-raw
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
396
trace log and feature sequence for each analysed exe-
cutable are stored in a database for further treatment.
3 EXPERIMENTS AND RESULTS
3.1 Datasets
For the following experiments, we have used two
different datasets for testing the system: a malware
dataset and a benign software dataset. First, we have
downloaded a big malware collection from VxHeav-
ens website (VX-Heavens, 2009) conformed by dif-
ferent malicious code such as trojan horses, virus or
worms. Even though they were already named with
their family and variant names, we have scanned them
using AVG Anti-virus to guarantee the correctness of
the labelling.
Hereafter, we have analysed the whole malware
dataset in our system. Table 1 details the number of
files per family,the number of files that have been cor-
rectly analysed by our system and the correctly anal-
ysed percentage. We have considered a correct anal-
ysis when at least one of the features was activated,
this is, when at least one action covered in the rules
was performed. In summary, the average success rate
was 85,16%, which is a good result taking into ac-
count that our system relies in a single execution and
that some of the malware executables can detect that
they are being executed in an emulated environment
(Ferrie, 2006). Further, we have removed from the
dataset the 379 malware files that were not correctly
analysed, since no features were activated.
Table 1: Results for the malware analysed in the system.
Families Files Correctly analysed Rate (%)
Agobot 340 250 73.53
Bagle 127 124 97.64
MyDoom 47 43 91.49
Rbot 1161 950 81.83
Sdbot 879 808 91.92
TOTAL 2554 2175 85.16
In addition, we have collected diverse executable
files from a recent installation of a Windows XP sys-
tem without internet connection, in order to conform
the benign software dataset. In this way, we have
ensured that there were not infected files. For a fur-
ther assurance, we have also scanned them using AVG
Anti-virus. The whole benign software dataset has
been correctly labelled as legitimate executables.
Finally, we have created a 1500 balanced dataset
conformed by 750 random malware files chosen
within the 2175 correctly analysed malware samples,
and 750 non-malware files. It is proven that this bal-
ancing renders the evaluation of the machine learning
methods more reliable (Batista et al., 2004).
3.2 Time Performance
In order to evaluate the system’s time performance,
we have analysed a corpus of 500 files from the mal-
ware dataset using three differentconfigurations: with
one to four sandboxes being executed simultaneously
in the same machine. The default analysis timeout
for each file was initially estimated at 20 seconds, but
most of them finished their execution before reaching
this limit.
In this way, Table 2 shows the obtained results us-
ing the four configurations. Note that the improve-
ment percentage has been calculated based on the sin-
gle sandbox configuration.
As we increment the amount of computers de-
voted to the analysis, the performance grows as well,
in a geometrical progression.
Table 2: Analysis of 500 files time performance results.
VM Total time Time per file Improvement
(sec) (sec/file) percentage
1 9555 19.11 -
2 4640 9.28 51.44
3 3390 6.78 64.53
4 2815 5.63 70.54
In conclusion, we had an excellent time perfor-
mance running4 sandbox instances for each computer
dedicated to execute these components.
3.3 Malware Detection Experiments
We have conducted an experiment focused on clas-
sifying generic malware without taking into account
which family of malware belonged to. Therefore,
we can provide detection of unknown malicious in-
stances.
In particular, we have trained and tested N¨aive
Bayes, Decision Trees and Support Vector Machine
classifiers in order to evaluate them. Choosing these
machine-learning classifiers was not a trivial decision
since they have already been widely applied to mal-
ware detection (Ye et al., 2008). To this extent, we
have performed the following steps:
Cross Validation. Despite the small dataset, we
had to use as much of the available information in
order to obtain a proper representation of the data.
To this end, K-fold cross validation is usually used
in machine-learning experiments (Bishop, 2006).
AUTOMATIC BEHAVIOUR-BASED ANALYSIS AND CLASSIFICATION SYSTEM FOR MALWARE DETECTION
397
Thereby, for each classifier we tested, we per-
formed a k-fold cross validation (Kohavi, 1995)
with k = 10. In this way, our dataset was 10 times
split into 10 different sets of learning (66% of the
total dataset) and testing (34% of the total data).
Learning and Testing the Model. We made a
learning phase where each classifier acquires the
required knowledge using the learning dataset.
Hereafter, the algorithm classified the rest of the
dataset. We used different configurations for each
machine-learning classifier:
1. N
¨
aive Bayes: We employed EM parametrical
learning for training the Bayesian network.
2. Random Forest: We run the training algorithm
with a number of random trees of 100.
3. J48: We trained this classifier with a confidence
factor of 0.25 and a minimal number of objects
of 2.
4. SMO: We used the Weka implementation of
SVM trained with a polynomial kernel and a
complexity factor of 1.
Evaluation of the Model. In order to evaluate
each classifier’s capability we measured the True
Positive Ratio (TPR) that is the number of mal-
ware instances correctly detected, divided by the
total amount of malware files.
Moreover, we measured the False Positive Ratio
(FPR), that is the number of benign executables
that were misclassified as malware divided by the
total number of benign files Note that keeping this
measure low is highly important in commercial
antivirus for user satisfaction.
Furthermore, we measured Total Accuracy, that is
the total number of the classifier’s hits divided by
the number of instances in the whole dataset.
Table 3 shows the obtained results. In this way,
every classifier that we tested, achieved overall good
results in accuracy, malware detection (TPR) and low
false alarms (FPR).
Specifically, both of the decision tree algorithms
that we tested outperformed the rest of the classifiers
in terms of TPR and total accuracy. These two algo-
rithms, J48 and random forest, achieved a very high
accuracy (more than 96%) and high detection ratio
(more than 0.94).
Still, if we focus on false positive ratio, random
forest presented the best results. Therefore, it seems
to be the best option for detecting generic malware.
Furthermore, the good results obtained of the pro-
posed system renders it a very good approach for de-
tecting unknown malware. In this way, we think that
this system can help to identify faster the malware
Table 3: Results per classifier.
Classifiers T. Accuracy (%) TPR FPR
Naive Bayes 91.6 0.852 0.02
Random Forest 96.2 0.937 0.013
J48 96 0.943 0.023
SMO 95.4 0.931 0.023
Average 94.8 0.915 0.019
variants for further analysis and creation of their sig-
natures.
4 RELATED WORK
Due to the great growth of malware in the last few
years, an extensive work to cope with this malicious
code plague has been published.
To deal with obfuscation techniques, various dy-
namic analysis solutions based in sandbox environ-
ments have been developed, using different tech-
niques for monitoring the malware’s behaviour: TT-
Analyze (Bayer et al., 2006) like our sandbox, uses
the Qemu PC emulator, but it runs a Window oper-
ating system combined with the API hooking tech-
nique. On the other hand, CWSandbox (Willems
et al., 2007) relies in DLL injection to trace and
monitor all relevant system calls, and to generate a
machine-readable report.
Further, two different approaches for malicious
executables classification have been proposed (Rieck
et al., 2008): discrimination between families of mal-
ware, and discrimination between specific malware
instances and benign software.
First, recent work has focused in discrimination
between different malware families using clustering
of behaviour reports (Lee and Mody, 2006), trans-
forming reports into sequences for later grouping
them into clusters, which finally represent different
families of malicious programs. Nevertheless there
are some limitations taking into account that cluster-
ing methods have unsupervised nature, with its in-
herent problems. Another recent approach (Rieck
et al., 2008) monitors the suspicious file execution and
builds a vector space model with the frequencies of
the contained strings as features to classify malicious
behaviour into malware families with Support Vector
Machines (SVM). Second, in (Christodorescu et al.,
2007) it is presented an approach to discriminate be-
tween malware and benign software based on mining
differences between malicious programs and benign
executables behaviour reports.
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
398
5 CONCLUSIONS AND FUTURE
WORK
In this paper we have presented a distributed and auto-
matic system for malware detection based on the dy-
namic analysis of suspicious files and the later clas-
sification into malware and benign software groups.
For the behaviour analysis we have developed a new
sandbox based in Qemu and Wine.
Focusing on the obtained results, we consider that
the features extracted from the behaviour logs are
suitable for training classifiers in order to detect mal-
ware. Hence, we have achieved a high detection rate
(94.8 %) and a low percentage of false positive ratio
(0.019). On the other hand, we have shown that the
system presents a great time performance. Still, our
system also presents some limitations, as follows.
First, in order to take advantage over antivirus re-
searchers, malware writers have included diverse eva-
sion techniques (Ferrie, 2006) based on bugs on the
virtual machines implementation to fight back. Nev-
ertheless, with the aim of reducing the impact of these
countermeasures, we can improve the Qemu’s source
code in order to solve the bugs and not to be vulnera-
ble to the above-mentioned techniques.
Second, it is possible that some malicious actions
are only triggered under specific circumstances de-
pending on the environment, so the rely on a single
program execution will not manifest all its behaviour.
This is solved with a technique called multiple exe-
cution path (Moser et al., 2007), making the system
able to obtain different behaviours displayed by the
suspicious executable.
Finally, as Wine is not a completed project, it
does not perfectly simulate a Windows operating sys-
tem. Moreover, it grows fast, improving its own re-
implementation of the Win-32 API in each published
version. In conclusion, we might install the last stable
version available in order to maintain the system up-
to-date and to have a greater capacity of emulation.
As future lines of work, we plan to expand the fea-
tures identified by the system until now. This is, defin-
ing more regularexpression rules to perceiveboth ma-
licious and legitimate behaviour since, in this way, the
classification will give better results. Moreover, com-
bining our approach with static analysis techniques
may improve the obtained results.
REFERENCES
Batista, G., Prati, R., and Monard, M. (2004). A study of
the behavior of several methods for balancing machine
learning training data. ACM SIGKDD Explorations
Newsletter, 6(1):20–29.
Bayer, U., Kruegel, C., and Kirda, E. (2006). TTAnalyze:
A tool for analyzing malware. In Proceedings of the
15
th
Annual Conference of EICAR.
Bishop, C. (2006). Pattern recognition and machine learn-
ing. Springer New York.
Carrera, E. and Erd´elyi, G. (2004). Digital genome
mapping–advanced binary malware analysis. In Pro-
ceedings of the 14
th
Virus Bulletin Conference, pages
187–197.
Christodorescu, M., Jha, S., and Kruegel, C. (2007). Mining
specifications of malicious behavior. In Proceedings
of the the 6
th
joint meeting of the ESEC and the ACM
SIGSOFT symposium on The foundations of software
engineering, pages 5–14.
Ferrie, P. (2006). Attacks on virtual machine emulators. In
Proc. of AVAR Conference, pages 128–143.
Friedl, J. (2006). Mastering regular expressions. O’Reilly
Media, Inc.
Kohavi, R. (1995). A study of cross-validation and boot-
strap for accuracy estimation and model selection.
In Proceedings of the 14
th
International Joint Con-
ference on Artificial Intelligence, volume 14, pages
1137–1145.
Lee, T. and Mody, J. (2006). Behavioral classification. In
Proceedings of the 15
th
European Institute for Com-
puter Antivirus Research (EICAR) Conference.
Moser, A., Kruegel, C., and Kirda, E. (2007). Exploring
multiple execution paths for malware analysis. In Pro-
ceedings of the 28
th
IEEE Symposium on Security and
Privacy, pages 231–245.
Pietrek, M. (1994). Peering Inside the PE: A Tour of the
Win32 (R) Portable Executable File Format. Microsoft
Systems Journal, 3.
Rieck, K., Holz, T., Willems, C., Dussel, P., and Laskov,
P. (2008). Learning and Classification of Mal-
ware Behavior. Lecture Notes in Computer Science,
5137:108–125.
VX-Heavens (2009). VX heavens. Online:
http://vx.netlux.org/.
Willems, C., Holz, T., and Freiling, F. (2007). Toward au-
tomated dynamic malware analysis using cwsandbox.
IEEE Security & Privacy, 5(2):32–39.
Ye, Y., Wang, D., Li, T., Ye, D., and Jiang, Q. (2008).
An intelligent PE-malware detection system based on
association mining. Journal in Computer Virology,
4(4):323–334.
AUTOMATIC BEHAVIOUR-BASED ANALYSIS AND CLASSIFICATION SYSTEM FOR MALWARE DETECTION
399