![](bg2.png)
n-grams. For instance, in Chinese document classi-
fication (Zhou and Guan, 2002) or text classification
(Jacob and Gokhale, 2007).
N-grams are all substrings of a larger string with
a length n. A string is simply split into substrings of
fixed length n. For example, the string “MALWARE”,
can be segmented into several 4-grams: “MALW”,
“ALWA”, “LWAR”, “WARE” and so on.
Against this background, this paper advances the
state of the art in two main ways. First, we address
here a new methodology for malware detection based
on the use of n-grams for file signatures creation. Sec-
ond, we tackle the issue of dealing with false positives
using a parameter named d to control how strict the
method behaves to classify the instance as malware
or benign software in order to avoid false positives.
The remainder of this paper is structured as fol-
lows. Section 2 discusses related work. Section 3
introduces the method proposed in this paper. Section
4 describes the experiments performed and discusses
the obtained results. Finally, Section 5 concludes and
outlines the avenues of future work.
2 RELATED WORK
Lately, the problem of detecting unknown malicious
code has been addressed by using machine learning
and data mining (Schultz et al., 2001). In their re-
search they propose to use differentdata mining meth-
ods for detection of new malware, however, none of
the techniques proposed had good balance between
false positive ratio and malware detection ratio in their
experimental results.
In a verge closer to our view, N-grams were used
first for malware analysis by an IBM research group
(Kephart, 1994). They proposed a method to automat-
ically extract signatures for the malware. Still, there
was no experimental results in their research.
Furthermore, Assaleh et al in (Abou-Assaleh
et al., 2004) addressed an n-grams-based signature
method to detect computer viruses. In that approach,
given a set of non malicious programs and computer
viruses code, n-grams profiles for each class of soft-
ware (malicious or benign software) were generated,
in order to later classify any unknown instance into
benign or malicious code using a k-nn algorithm with
k = 1. That experiment had very good results in terms
of malware detection ratio; still, no false positive ratio
was given in the experiment results. This lack of false
positives in the experimental results and the absence
of any technique to control the appearance of them
renders this method to be unpractical in a commercial
way.
3 METHOD DESCRIPTION
Our technique relies on a large set of training values
in order to build representation for each file in that set.
This set is composed of a collection of malware soft-
ware and benign programs. Specifically, the malware
is made up of different kind of malicious software (i.e.
computer viruses, Trojan horses, spyware, etc ).
Once the set is chosen, we extract n-grams for
every file in that set that will act as the file sig-
nature. Hereafter, the system can classify any un-
known instance as malware or benign software. To
this extent, we classify the unknown instance using k-
nearest neighbour algorithm (Fix and Hodges, 1952),
one of the simplest machine learning algorithms that
can be used in classifying issues. This algorithm re-
lies on identifying the k most nearest (say most sim-
ilar) instances, to later classify the unknown instance
based on which class (malware or benign) are the k-
nearest instances.
The following measure function is used in order
to detect how much the unknown instance looks like
the known ones:
∑
xεX
f(x)
card(X) + card(Y)
(1)
where X is the set of the n-grams of the unknown in-
stance, x is any n-gram in the set X, Y is the set of
n-grams of the instance its class is known, and f(x) is
the number of coincidences of the n-gram x in the set
Y.
When every instance of the known set is measured
comparing it with the unknown instance, we select
the k highest values of the measured instance, and we
consider the unknown instance as malware only if on
the k most alike files the amount of malware instances
minus the amount of benign instances is greater or
equal than a parameter d, as shown in the following
formula:
MW(K) − GW(K) >= d (2)
where MW(K) is the amount of malware instances in
the k nearest neighbours, GW(K) is the amount of be-
nign software in the k nearest neighbours and d is the
parameter d.
This parameter d controls how strict the system is
going to be in order to classify the unknown instance
as malware or benign software. Moreover, this pa-
rameter is what we are going to use to keep low the
false positive ratio; with a high value of d (that has to
be always lesser or equal than k), we predict that the
false positive ratio is going to keep low.
ICEIS 2009 - International Conference on Enterprise Information Systems
318