folder. Then, the System Manager is informed and
negotiates the extraction of features with the regular
expression parser. Finally, the information gathered
is stored in a database with the aim of further classifi-
cation.
2.2 Behaviour Monitoring
Behaviour monitoring is a dynamic analysis tech-
nique in which the suspicious file is executed inside
a contained and secure environment, called sandbox,
in order to get a complete and detailed trace of the
actions performed in the system.
Besides, as 87.9 % of users work with MS-based
platforms, most of the malware is aimed at these oper-
ating systems. Therefore, we have developed a sand-
box with the purpose of dynamically analyse Win-
dows Portable Executable (PE) files (Pietrek, 1994).
To this extent, the suspicious files are executed inside
an emulated environment, and relevant Windows API
calls are logged, showing the program’s behaviour.
Therefore, we propose a new approach of sandbox
using both emulation (Qemu) and simulation (Wine)
techniques, with the aim of achieving the greatest
transparency possible without interfering with the
system. In this way, we describe now the two main
platforms of our sandbox solution.
First, Wine is an open-source and complete re-
implementation (simulation) of the Win-32 Applica-
tion Programming Interface (API). Hence, it allows
Windows PE files to run as-if-natively under Unix-
based operating systems. Moreover, the main reason
of our choice, is that as being open-source we can
modify and improve the code to adjust it to our needs,
and therefore we can log any call to the Win-32 API
done by the malware.
Second, Qemu is an open-source pure software
virtual machine emulator that works by performing
equivalent operations in software for any given CPU
instruction. Unfortunately,there are several malicious
executables aware of being executed in a contained
environment exploiting different bugs within this vir-
tual machine. Plus, they can be fixed easily (Ferrie,
2006).
To this extent, we have made several improve-
ments in Wine. First, we have modified Wine’s
source-code to write every call done by a process
(identified by its PID) to the Windows API (divided
in families, i.e. registry, memory or files) into a log,
specifying parameters’ state before and after the func-
tions body. Thereby, we can obtain a complete and
homogenous trace with all processes behaviour, with-
out any interference with the system.Second, we have
made several modifications to the basic Wine installa-
tion by adding various Windows XP system dlls and
creating additional registry keys and folder structures.
In this way, we have obtained the most transparent
and similar system to a Windows O.S. as possible in
order to be able to execute the largest possible variety
of programs.
2.3 Feature Extraction
Wheneverthe input data to machine learning and clas-
sification methods is too large and complex, it is nec-
essary to transform this data into a reduced represen-
tation set of carefully chosen features (feature vector),
a process known as feature extraction.
Furthermore, for each malware sample analysed
in the sandbox, we may obtain a complete in-raw
trace with its detailed behaviour. These reports are
not suitable for machine learning applications as these
usually work with vectorial data. Hence, with the
aim of automatically extracting relevant information
in vector format from the traces, we developed reg-
ular expression rules and a parser to identify them.
This approach is a powerful and fast tool for identify-
ing patterns of characters within a text (Friedl, 2006),
in our case, the detailed trace log.
To this end, we have defined different specific ac-
tions taken by the program as regular expression rules,
with the help of an expert, in order to conform the
knowledge base. In this way, the creation of new rules
becomes very simple and intuitive and the system is
easily improvable.
Moreover, most of the actions defined are char-
acteristic of malware but there are both benign and
malicious behaviour rule definitions.
Therefore, we have converted all the information
achieved in a vector, parsing the behaviour reports
with the defined regular expressions.
To this extent, we defined a program behaviour
as a vector made up of the aforementioned charac-
teristics. More specifically, we represented an ex-
ecutable as a vector ~v that is composed by binary
characteristics c, where c can be either 1 (true) or 0
(false), ~v = (c
1
,c
2
,c
3
,...,c
n−1
,c
n
) and n is the num-
ber of monitored actions.
In this way, we have characterised the vector in-
formation as binary digits, called features, each one
representing the corresponding characteristic of the
behaviour. When parsing a report, if one of the de-
fined actions is detected by a rule, the corresponding
feature is activated. The resulting vector for each pro-
gram’s trace is a finite sequence of bits, a proper infor-
mation for classifiers to effectively recognize patterns
and correlate similarities across a huge amount of in-
stances (Lee and Mody, 2006). Likewise, both in-raw
ICEIS 2010 - 12th International Conference on Enterprise Information Systems
396