REAL TIME DETECTION OF NOVEL ATTACKS BY MEANS OF

DATA MINING TECHNIQUES

∗

Marcello Esposito, Claudio Mazzariello, Francesco Oliviero, Simon Pietro Romano, Carlo Sansone

Dipartimento di Informatica e Sistemistica – Universit

a degli Studi di Napoli “Federico II”

Via Claudio 21, 80125 Napoli (Italy)

Keywords:

Intrusion Detection, Trafﬁc Features.

Abstract:

Rule-based Intrusion Detection Systems (IDS) rely on a set of rules to discover attacks in network trafﬁc. Such

rules are usually hand-coded by a security administrator and statically detect one or few attack types: minor

modiﬁcations of an attack may result in detection failures. For that reason, signature based classiﬁcation is

not the best technique to detect novel or slightly modiﬁed attacks. In this paper we approach this problem by

extracting a set of features from network trafﬁc and computing rules which are able to classify such trafﬁc.

Such techniques are usually employed in off line analysis, as they are very slow and resource-consuming. We

want to assess the feasibility of a detection technique which combines the use of a common signature-based

intrusion detection system and the deployment of a data mining technique. We will introduce the problem,

describe the developed architecture and show some experimental results to demonstrate the usability of such

a system.

1 INTRODUCTION

Security is one of the main concerns in the devel-

opment of new technologies and services over the

Internet. The most common and best known tools

used to ensure security of companies, campuses and,

more in general, of any network, are Firewalls and

Antiviruses. Though famous and well known, such

tools alone are not enough to protect a system from

malicious activities. Basing one’s own site’s secu-

rity on the deployment of these instruments relies on

the idea that intrusion prevention will sufﬁce in ef-

ﬁcently assuring data availability, conﬁdentiality and

integrity. Indeed, an iteresting idea about intrusions is

that they will sooner or later happen, despite the secu-

rity policy a network administrator deploys. Based on

such assumption, the researchers started to develop

instruments able to detect successful intrusions and,

in some cases, trace back the path leading to the at-

tack source. This is a more pessimistic, though much

more realistic way to look at the problem of network

security.

∗

Research outlined in this paper is partially funded

by the Ministero dell’Istruzione, dell’Universit

a e della

Ricerca (MIUR) in the framework of the FIRB Project

“Middleware for advanced services over large-scale, wired-

wireless distributed systems (WEB-MINDS)”

2 RELATED WORK

This work has many liaisons with both intrusion de-

tection and data mining.

As to the ﬁrst research ﬁeld, intrusion detection is

the art of detecting inappropriate, incorrect or anom-

alous activity within a system, be it a single host or

a whole network. An Intrusion Detection System

(IDS) analyzes a data source and, after preprocess-

ing the input, lets a detection engine decide, based

on a set of classiﬁcation criteria, whether the ana-

lyzed input instance is normal or anomalous, given a

suitable behavior model. Intrusion Detection Systems

can be grouped into three main categories: Network-

based Intrusion Detection Systems (N-IDS) (Vigna

and Kemmerer, 1999), Host-based Intrusion De-

tection Systems (H-IDS) (Andersson, 1995) (Tyson,

2000) and Stack-based Intrusion Detection Systems

(S-IDS) (Laing and Alderson, 2000). This classiﬁ-

cation depends on the information sources analyzed

to detect an intrusive activity. An N-IDS analyzes

packets captured directly from the network. By set-

ting network cards in promiscuous mode, an IDS can

monitor trafﬁc in order to protect all of the hosts con-

nected to a speciﬁed network segment. On the other

hand, an H-IDS focuses on a single host’s activity:

the system protects such a host by directly analyzing

the audit trails or system logs produced by the host’s

operating system. Finally, S-IDS are hybrid systems,

120

Esposito M., Mazzariello C., Oliviero F., Pietro Romano S. and Sansone C. (2005).

REAL TIME DETECTION OF NOVEL ATTACKS BY MEANS OF DATA MINING TECHNIQUES.

In Proceedings of the Seventh International Conference on Enterprise Information Systems, pages 120-127

DOI: 10.5220/0002536601200127

 SciTePress

which operate similarly to a N-IDS, but only analyze

packets concerning a single host of the network. They

monitor both inbound and outbound trafﬁc, follow-

ing each packet all the way up the TCP/IP protocol

stack, thus allowing the IDS to pull the packet out of

the stack even before any application or the operating

systems process it. The load each IDS must afford is

lower than the total trafﬁc on the network, thus keep-

ing the analysis overhead within reasonable bounds;

hypothetically, each host on the network could run a

S-IDS.

Intrusion Detection Systems can be roughly clas-

siﬁed (Figure 1) as belonging to two main groups as

well, depending on the detection technique employed:

anomaly detection and misuse detection (Bace, 2000).

Both such techniques rely on the existence of a reli-

able characterization of what is normal and what is

not, in a particular networking scenario.

More precisely, anomaly detection techniques base

their evaluations on a model of what is normal, and

classify as anomalous all the events that fall outside

such a model. Indeed, if an anomalous behavior is

recognized, this does not necessarily imply that an at-

tack activity has occurred: only few anomalies can be

actually classiﬁed as attempts to compromise the se-

curity of the system. Thus, a relatively serious prob-

lem exists with anomaly detection techniques which

generate a great amount of false alarms. On the other

side, the primary advantage of anomaly detection is

its intrinsic capability to discover novel attack types.

Numerous approaches exist which determine the vari-

ation of an observed behavior from a normal one. A

ﬁrst approach is based on statistical techniques. The

detector observes the activity of a subject (e.g. num-

ber of open ﬁles or TCP state transitions), and creates

a proﬁle representing its behavior. Every such proﬁle

is a set of “anomaly measures”. Statistical techniques

can then be used to extract a scalar measure represent-

ing the overall anomaly level of the current behavior.

The proﬁle measure is thus compared with a threshold

value to determine whether the examined behavior is

anomalous or not. A second approach, named predic-

tive pattern generation, is based on the assumption

that an attack is characterized by a speciﬁc sequence,

i.e. a pattern, of events. Hence, if a set of time-based

rules describing the temporal evolution of the user’s

normal activity exists, an anomalous behavior is de-

tected in case the observed sequence of events signif-

icantly differs from a normal pattern.

Misuse detection, also known as signature detec-

tion, is performed by classifying as attacks all the

events conforming to a model of anomalous behav-

ior. This technique is based on the assumption that an

intrusive activity is characterized by a signature, i.e. a

well-known pattern. Similarly to anomaly detection,

misuse detection can use either statistical techniques

or even a neural network approach to predict intru-

Figure 1: Approaches to Intrusion Detection

sions. Indeed, the rule-based approach is the most

used to detect an attack (SNORT

(Baker et al., 2004)

and Bro

(Paxson and Terney, 2004)). Intrusions are

coded by means of a set of rules: as soon as the ex-

amined event matches one of the rules, an attack is de-

tected. A drawback of this approach is that only well-

known intrusive activities can be detected, so that the

system is vulnerable to novel aggressions; sometimes,

few variations in an attack pattern may generate an in-

trusion that the IDS is not able to detect.

The main problem related to both anomaly and

misuse detection techniques resides in the encoded

models, which deﬁne normal or malicious behav-

iors. Although some recent open source IDS, such

as SNORT or Bro, provide mechanisms to write new

rules that extend the detection ability of the system,

such rules are usually hand-coded by a security ad-

ministrator, representing a weakness in the deﬁni-

tion of new normal or malicious behaviors. Recently,

many research groups have focused their attention on

the deﬁnition of systems able to automatically build a

set of models. Data mining techniques are frequently

applied to audit data in order to compute speciﬁc be-

havioral models (MADAM ID (Lee and Stolfo, 2000),

ADAM (Barbara et al., 2001)).

Coming to the second related research ﬁeld, we

recall that a data mining algorithm is referred to as

the process of extracting speciﬁc models from a great

amount of stored data (Fayyad et al., 1996). Machine

learning or pattern recognition processes are usually

exploited in order to realize this extraction (SLIP-

PER

(Cohen and Singer, 1999)). These processes

may be considered as off-line processes. In fact, all

the techniques used to build intrusion detection mod-

els need a proper set of audit data. The informa-

tion must be labelled as either “normal” or “attack”

in order to deﬁne the suitable behavioral models that

http://www.snort.org

http://www.bro-ids.org

http://www-2.cs.cmu.edu/

∼

wcohen/

slipper/

REAL TIME DETECTION OF NOVEL ATTACKS BY MEANS OF DATA MINING TECHNIQUES

121

represent these two different categories. Such audit

data are quite complicated to obtain. The data set

used for The Third International Knowledge Discov-

ery and Data Mining Tools Competition, the 1999

KDD data

(Lee and Stolfo, 2000)(Elkan, 2000), is

probably the most well-known example of this kind

of information, representing a processed version of

the DARPA Intrusion Detection Evaluation Program

database, collected and managed by the MIT Lincoln

Laboratory. The DARPA database contains tcpdump

data related to seven weeks of network trafﬁc gen-

erated over a military emulated LAN. KDD is ﬁlled

with ﬁve million connection records labelled as “nor-

mal” or “attack”.

3 RATIONALE AND

MOTIVATION

Strategies for non punctual intrusion detection often

do not take into account the concern of real-time

processing of network trafﬁc. Though, an effective

IDS should be able to produce the analysis results in

time to react and possibly activate countermeasures

against malicious behaviors.

The ability to detect an intrusion as soon as it oc-

curs is mandatory for an IDS. The most common

types of attacks, e.g. denial of service, can be very

dangerous if they are not detected in time. Although

some IDS store audit data for later analysis, most

of them examine such data in real-time so that the

system can perform the actions necessary in order

to avoid serious problems. Commonly used N-IDS

typically analyze packets captured from the network,

ﬁnding in the current packet the signature of an at-

tack in-progress. However, malicious activity cannot

be detected by examining just a single packet: some

types of attacks generate in a certain time interval a

great amount of packets belonging to different ses-

sions. Hence an efﬁcient detection needs statistical

parameters taking into account the temporal relation

between sessions. As stated before, Stolfo et al. (Lee

and Stolfo, 2000) have deﬁned a set of connection

features which summarize the temporal and statisti-

cal relations of the connections with reference to each

other. These features have been used to create the

connection records contained in the KDD database.

Several data mining processes use these connection

features to extract suitable behavioral models.

Trafﬁc model deﬁnition based on an off-line analy-

sis does not consider the unavoidable problems of

real-time computation of connection features. The

data mining process operates on a database, in which

data can be organized in a suitable way in order to

http://kdd.ics.uci.edu/

compute the features. In real-time intrusion detec-

tion, instead, the incoming packets do not contain all

of the information needed to compute the connection

features, but an appropriate system has to be imple-

mented in order to compute relations among the ex-

isting connections. Moreover, off-line analysis does

not consider the problem of potential packet losses in

the IDS, which has to be taken into account in the case

of real time analysis.

Our research aims to develop a framework for real-

time intrusion detection. The system we present

should be capable to effectively detect intrusions and

to operate under a variety of trafﬁc conditions, thus

providing an exploitable solution to the issue of real-

time analysis. Anomaly detection proves to be the

most suitable solution for our purpose, even though

such technique has the well known drawback related

to the relatively high number of false alarms raised.

Our intrusion detection system can be classiﬁed as

rule-based. Unfortunately the deﬁnition of a rule for

every attack is not an efﬁcient solution. On one hand,

this approach is not able to detect novel attack pat-

terns; on the other hand, the deﬁnition of new attacks

has a negative impact both on the computation load

and on the average time required to analyze every sin-

gle packet (hence, the related packet loss problem). In

order to overcome the above mentioned drawbacks,

by using a set of parameters derived by Stolfo’s con-

nection features — which cover a wide range of attack

types — it is possible to adopt different data min-

ing processes in order to characterize the attacks by

means of different sets of rules.

Summarizing the above considerations, with this

work we are interested in the analysis of real-time

intrusion detection. To this purpose, we will ex-

ploit data mining techniques to design a novel intru-

sion detection framework. We will present an imple-

mentation of the framework and evaluate i ts perfor-

mance in a real network scenario, by focussing on two

main performance ﬁgures: packets processing time

and system resources needed to compute the connec-

tion features.

4 THE REFERENCE MODEL

In this section we present our framework for real-time

intrusion detection. The overall model is composed

of two parts: the former is the data mining process,

which extracts behavioral models from pre-elaborated

network trafﬁc, and consists of a database of labelled

connection features and a data mining algorithm; the

latter is a real-time intrusion detection system which

analyzes and classiﬁes network trafﬁc based on the

models inferred (Figure 2). In particular, we execute

the off-line data mining process on a data set in or-

ICEIS 2005 - INFORMATION SYSTEMS ANALYSIS AND SPECIFICATION

122

Figure 2: Reference Framework Model

der to extract a set of rules; such a set is then used

in a real-time classiﬁcation process deployed by the

IDS that analyzes these pre-computed network data

and compares it with informations evaluated by real-

time network trafﬁc.

Data mining is part of a more complex KDD

(Knowledge Discovery in Databases) process consist-

ing of data analysis and discovery algorithms applied

to a database in order to extract high level informa-

tion — the patterns or the models — able to describe

a subset of the data. The models can be applied to un-

known data values in order to predict the right class to

which they belong. As we emphasized in the previous

section, such data mining processes operate on a set

of data which has been organized in a suitable fash-

ion (e.g. all the data are identiﬁed by a label which

explicitly speciﬁes the category they belong to).

In order to implement an efﬁcient classiﬁer, it is

important to deﬁne a suitable set of features to be ex-

tracted from the network trafﬁc contained in the data-

base. The greater the capability of the set of features

to discriminate among different categories, the better

the classiﬁer. There are three levels at which feature

sets may be deﬁned:

• The features may be referred to the single packet

captured from the network:

although this set is easy to compute, it is not able

to detect all the potential attack types.

• A set of features related to the entire session which

the packet belongs to may be deﬁned:

this is due to the fact that some intrusions may be

realized by means of a sequence of packets belong-

ing to either the same connection or different con-

nections.

• The computed set of features may perform a sta-

tistical analysis of the relation between the current

session and the other ones:

this is needed in order to capture intrusions which

affect the interrelation among different sessions.

To cope with the aforementioned requirements, we

have adopted a model descending from the one pro-

posed by Stolfo. We are interested in TCP, UDP and

ICMP trafﬁc. Therefore, a clear deﬁnition of the term

connection is necessary. For a TCP stream the con-

nection can be deﬁned, relying on the protocol speci-

ﬁcations, as the collection of messages exchanged be-

tween a client process and a server process. For UDP

and ICMP we considered each packet as a single, self-

contained connection.

The features deﬁned by Stolfo et al. can be clas-

siﬁed in tree main groups: intrinsic features, content

features, and trafﬁc features. Intrinsic features spec-

ify general information on the current session, like the

duration in seconds of the connection, the protocol

type, the port number (i.e. the service), the number

of bytes from the source to the destination, etc. (see

Table 1).

Table 1: Intrinsic Features

duration connection duration (s)

protocol type type of transport protocol

service port number on the server side

src bytes bytes from source to destination

dst bytes bytes from destination to source

ﬂag status of the connection

land land attack

wrong fragment number of wrong fragments

urgent number of urgent packets

The content features are related to the semantic

content of connection payload: for example, they

specify the number of failed login attempts, or the

number of shell prompts (Table 2).

Table 2: Content Features

hot number of hot indicators

failed logins number of failed login attempts

logged in successfully logged in

compromised num compromised conditions

root shell root shell is obtained

su su root command attempted

ﬁle creations number of ﬁle creations

shells number of shell prompts

access ﬁles number of ﬁle accesses

outbound cmds outbound commands in ftp

hot login the login belongs to the hot list

guest login the login is a guest login

The trafﬁc features can be divided in two groups:

the same host and the same service features. The

same host features examine all the connections in the

last two seconds to the same destination host as the

one involved in the current connection. We also focus

on the either the number of such connections, or the

rate of connections that have a “SYN” error. Instead,

the same service features examine all the connections

in the last two seconds to the same destination ser-

vice as the current one. These two feature sets are de-

REAL TIME DETECTION OF NOVEL ATTACKS BY MEANS OF DATA MINING TECHNIQUES

123

ﬁned time-based trafﬁc features because they analyze

all the events which have occurred in a time interval of

two seconds (Table 3); some types of attacks, instead,

as the slow probing, may occur every few minutes.

Therefore these features might not be able to detect all

the attack types. To this aim a new set of trafﬁc fea-

tures, called host-based, has been deﬁned; same host

and same service trafﬁc features are also computed

over a window of one hundred connections rather that

over a time interval of two seconds. In our frame-

work we will only adopt intrinsic and trafﬁc features.

Our purpose is to implement a network-based intru-

sion detection system, and we deem the content fea-

tures more suitable for a host-based scenario. Thanks

to the access to the operating system’s audit trails or

system logs, an H-IDS is more efﬁcient in the analysis

of the execution of dangerous commands on a single

host.

The proposed real-time IDS architecture consists of

three components: a sniffer, a processor, and a classi-

ﬁer. The sniffer is the lowest component of the archi-

tecture; connected directly to the network infrastuc-

ture, this module captures all the packets on the wire.

Snifﬁng is made possible by setting the network card

in promiscuous mode. Usually the sniffer also trans-

lates raw packets into a human-readable format.

The processor component elaborates the packets

captured from the sniffer in order to extract the needed

set of features. The main issue of the features compu-

tation process is related to the need of keeping up-

to-date information about the current connection, as

well as the other active sessions. We have to keep in

memory a representation of the current network state

in order to evaluate the statistical relations among the

active connections. Data in memory have to be prop-

erly organized in order to reduce the features compu-

tation time.

The classiﬁer is the core of the architecture; this

component analyzes the current connection features

and classiﬁes them. Based on the misuse detection ap-

proach, the process of classiﬁcation uses a set of rules

extracted by data mining algorithms. The features are

compared against all the rules in the set; when the ex-

amined vector of features matches at least one rule, an

intrusive action is detected. As to the connection data

in the processor component, the rules may be orga-

nized in memory in a suitable way in order to reduce

the time of analysis.

5 REAL-TIME IDS

IMPLEMENTATION ISSUES

The implemented architecture addresses the main re-

quirements of a real-time detection system: monitor-

ing the network trafﬁc in order to extract a set of fea-

tures from it, as well as behavior classiﬁcation based

on the extracted features. Monitoring, in particular,

is the most challenging issue to face from the point

of view of a real-time analysis. In our architecture,

the monitoring system can be divided into two com-

ponents: the sniffer that captures trafﬁc from the net-

work, and the processor that computes both the in-

trinsic and the trafﬁc features. While in an off-line

analysis features computation is simpler, since all the

information about connections are stored in a data-

base, in a real time analysis statistic measures have to

be be computed every time a new packet is captured

from the network (DFP, 2004).

In order to extract features from the trafﬁc, an ef-

fective processor must ensure two requirements:

• it holds information about the state of the connec-

tion which the analyzed packet belongs to;

• it holds comprehensive information about the traf-

ﬁc ﬂows that have already been seen across the net-

work.

According to the deﬁnition proposed in the previ-

ous section, every packet can be considered as a sin-

gle unit that is inserted in a more complex structure,

namely the connection, and on which the features are

computed. While neither UDP nor ICMP trafﬁc re-

quires a heavy load of computation, TCP trafﬁc re-

quires to emulate the TCP state diagram on both the

client and the server sides and for every active con-

nection. In particular, when a new packet is captured,

the system retrieves information about the connection

to which such a packet belongs and updates the con-

nection state of both the client and the server based on

the TCP protocol speciﬁcations.

In order to compute the statistical relations, infor-

mation on the past TCP, UDP and ICMP ﬂows is re-

quired, including those connections which have been

closed. Trafﬁc features, in fact, are computed by an-

alyzing all the connections (either active or expired)

having similar characteristics — besides the destina-

tion IP address and/or the destination port — as the

current one. Every connection has to be kept in mem-

ory until it is not needed anymore for other computa-

tions.

Our architecture is implemented by means of the

open-source N-IDS Snort; we have used this system

as the base framework on top of which we have built

our components. Snort is a lightweight network IDS

created by Marty Roesch. Its architecture is made

up of four main blocks: a sniffer, a preprocessor en-

gine that pre-computes of captured packets, a rule-

based detection engine, and a set of user output tools.

Thanks to Snort’s modular design approach, it is pos-

sible to add new functionality to the system by means

of program plugins. Moreover, Snort provides an

efﬁcient preprocessor plugin that reassembles TCP

streams and can thus be used to recover the TCP con-

ICEIS 2005 - INFORMATION SYSTEMS ANALYSIS AND SPECIFICATION

124

Table 3: Time-Based Trafﬁc Features

Same Host

count number of connections to the same host

serror rate % of connections with SYN errors

rerror rate % of connections with REJ errors

same srv rate % of connections to the same service

diff srv rate % of connections to different services

Same Service

srv count number of connections to the same service

srv serror rate % of connections with SYN errors

srv rerror rate % of connections with REJ errors

srv diff host rate % of connections to different services

nections status.

We have implemented a new preprocessor plugin

which computes the connection features. The main is-

sue we tackled has been the computation of the trafﬁc

features, which requires that a proper logical organi-

zation of the data is put into place in order to recover

information about the past network trafﬁc. Moreover,

to assure that the real-time requirement of the system

is met, a fast access to stored data is mandatory.

As to the data structures, we have adopted a binary

search tree. In the worse case this structure guarantees

a performance comparable to that achievable with a

linked list from the point of view of search time; per-

formance further improves in case the tree is a sta-

tic and well-balanced one. Unfortunately, our struc-

ture is not a static tree because the connections are

not known in advance; though, a self-adjusting binary

tree can be adopted in this case in order to balance a

dynamic tree.

We have used a Snort library of functions to man-

age the so-called Splay Trees. A Splay Tree is an ele-

gant self-organizing data structure created by Sleator

and Tarjan (Sleator and Tarjan, 1985): it actually is

an ordered binary tree, in which an item is moved

closer to the entry point — i. e. the tree root — when-

ever it is accessed, by means of a rotation of the item

with the parent node. This makes it faster to access

the most frequently used elements than the least fre-

quently used ones, without sacriﬁcing the efﬁciency

of operations such as insert and search.

With the above mentioned tree structure, we have

implemented two trees, a Same Host Tree and a Same

Service Tree to compute the same host and the same

service trafﬁc features, respectively. Every node in

the tree is identiﬁed by the destination IP address

in the ﬁrst tree, or by the destination service in the

second one. In this way, we want to store in the

same node information about all the connections that

share the same characteristics. In order to compute

both the time-based and the host-based trafﬁc fea-

tures, for every node in the tree we have implemented

two linked lists, one for each set. The linked lists con-

tain information like source IP address and/or source

port for all the connections that have been identiﬁed

and that have the same destination IP address and/or

the same destination service (Figure 3). The elements

of the list, one for every connection, are ordered in

time: the ﬁrst element is the oldest one, the last is the

most recent.

Figure 3: Same-Host Tree Structure

When a new packet is captured from the network,

our preprocessor plugin ﬁrst analyzes the protocol

of the packet in order to identify the most appropri-

ate procedure to compute intrinsic features. If the

packet belongs to either a UDP or an ICMP trafﬁc,

the information required to compute intrinsic features

is entirely contained in the packet. In case of TCP

trafﬁc, the procedure recovers the session which the

packet belongs to in order to determine some cru-

cial information, like the duration of the connection

or the number of bytes sent along both directions of

the stream, that cannot be directly inferred from the

packet. Then, the procedure analyzes the destination

IP address and the destination port to compute traf-

ﬁc features. Search operations are performed in both

trees: if no preexisting node is found, a new one is

created, and the trafﬁc features relative to the cur-

rent connection are initialized to zero. Otherwise, if

a node is already in the tree, the procedure analyzes

the two linked lists to compute the statistics for both

time-based and host-based trafﬁc features. Every el-

ement in the list is analyzed and the statistics are up-

dated. During this process the elements that do not

belong neither to a time interval of two seconds, nor

REAL TIME DETECTION OF NOVEL ATTACKS BY MEANS OF DATA MINING TECHNIQUES

125

to a window of the latest one hundred connections are

pruned off.

6 TESTING THE APPROACH

In this section we evaluate the performance overhead

due to the operation of the IDS, pointing out the in-

crease in CPU utilization and memory consumption

with respect to the values observed while running

Snort without our plugins. Our purpose is to show

the affordability of real-time intrusion detection, by

means of techniques which are usually employed in

off-line analysis. We evaluate both CPU and mem-

ory overhead, as well as packet loss ratio. Such tests

are deployed in two scenarios: in the former case, we

build a testbed to emulate network trafﬁc in a con-

trolled environment; in the latter case, we analyze

trafﬁc ﬂowing across the local network at Genova Na-

tional Research Council (CNR). In this scenario, the

most important results concern packet loss analysis.

We show that the complexity increase due to the ap-

plication of our detection techniques does not affect

dramatically the percentage of lost packets. Thus we

demonstrate the affordability of intrusion detection by

means of such techniques. While working on the test-

bed, we consider the topology depicted in Figure 4.

Figure 4: Reference testbed

In order to work in a totally controlled environ-

ment, we have to emulate the depicted scenario rather

than working in a real network environment; for that

purpose, we use another topology which just emulates

the one depicted above, as drawn in Figure 5.

Figure 5: A trafﬁc emulation scenario

Furthermore, we test the IDS using it on a real and

heavily loaded network, whose topology is drawn in

Figure 6. Such a test is useful to assess the limits

of applicability of our plugin, as well as to identify

directions for future improvements.

Figure 6: CNR Network Topology

In table Table 4 we see the values of CPU overhead

due to the use of Snort alone, versus Snort plus our

plugins. The machine operating as IDS in the emu-

lated trafﬁc scenario is equipped with a 1GHz Pen-

tium III CPU and an amount of 256MB RAM, run-

ning Mandrake Linux 9.1 as operating system, kernel

version 2.4.19. In this case we can point out an almost

unperceptible increase in memory consumption (Ta-

ble 5). The doubling in CPU usage percentage, when

using the modiﬁed version of Snort with respect to

the case of Snort alone, is not such a negative result,

since overall CPU usage is still low and under rea-

sonable thresholds, also considering that we are using

general purpose, not dedicated, hardware.

Table 4: Average CPU Overhead

Snort-2.1.0 Snort + Plugins

Emulated Trafﬁc 0.12% 0.22%

CNR Trafﬁc 1.16% 2.42%

The extensive test on CNR network also shows a

slightly higher CPU usage for the modiﬁed version of

Snort, still within the limit of 8% overhead. The ma-

chine acting as IDS is equipped with a 2GHz Pentium

IV, 512MB RAM and RedHat Linux 8.0, using kernel

2.4.18.

Table 5: Memory Overhead

Snort-2.1.0 Snort + Plugins

Emulated Trafﬁc 1.69% 1.70%

CNR Trafﬁc 4.99% 9.46%

ICEIS 2005 - INFORMATION SYSTEMS ANALYSIS AND SPECIFICATION

126

Figure 7: CPU Usage - CNR Network

Once again it is worth pointing out that the results

of our measures must be looked at under the perspec-

tive of the employment of non dedicated hardware.

Of course, the most interesting indication regards

the packet loss ratio. To attain the best results in in-

trusion detection, the main requirement is not to lose

any packets — no matter how much of the system

resources we use — if affordable with the available

hardware. Such result is sketched in Table 6. In the

test deployed using emulated trafﬁc, we notice an in-

crease of less than 10% in packet loss with respect to

the plain version of Snort, though the values are lower

than the ones obtained by testing the system on a real

network. This may be ascribed to the hardware used

in the two cases: the setup used in the latter scenario

is much more suitable than the one used in the former

case. In both cases, anyway, we observe a very low

increase in packet loss ratio, showing the feasibility

of such a technique.

Table 6: Packet Loss

Snort-2.1.0 Snort + Plugins

Emulated Trafﬁc 0.39% 0.42%

CNR Trafﬁc 0.14% 0.16%

7 CONCLUSIONS AND FUTURE

WORKS

This paper shows how it is possible to combine real-

time intrusion detection with data mining techniques,

while at the same time keeping the system overhead

under reasonable thresholds and containing the packet

loss ratio within certain boundaries. Future develop-

ment of this project will involve building rule sets and

evaluating their detection capabilities. We may test

rulesets computed with different algorithms which

make use of various techniques.

The work has also been published on Source-

Forge

, to hopefully receive feedback from users and

to communicate and cooperate with the Snort com-

munity.

ACKNOWLEDGEMENTS

We would like to thank Maurizio Aiello and the staff

at CNR laboratory in Genova, Italy, for their coopera-

tion and for providing us with part of the data as well

as the equipment used for the tests.

REFERENCES

(2004). Operation Experience with High-Volume Network

Intrusion Detection. ACM.

Andersson, D. (1995). Detecting usual program behavior

using the statistical component of the next-generation

intrusion detection expert system (nides). Technical

report, Computer Science Laboratory.

Bace, R. G. (2000). Intrusion Detection. Macmillan Tech-

nical Publishing.

Baker, A. R., Caswell, B., and Poor, M. (2004). Snort 2.1

Intrusion Detection - Second Edition. Syngress.

Barbara, D., Couto, J., Jajodia, S., Popyack, L., and Wu,

N. (2001). Adam: Detecting intrusion by data min-

ing. pages 11–16. IEEE. Workshop on Information

Assurance and Security.

Cohen, W. W. and Singer, Y. (1999). A simple, fast, and

effective rule learner.

Elkan, C. (2000). Results of the kdd99 classiﬁer learning. In

SIGKDD Explorations, volume 1, pages 63–64. ACM.

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996).

From data mining to knowledge discovery in data-

bases. AI Magazine, pages 37–52.

Laing, B. and Alderson, J. (2000). How to guide - im-

plementing a network based intrusion detection sys-

tem. Technical report, Internet Security Systems, Sov-

ereign House, 57/59 Vaster Road, Reading.

Lee, W. and Stolfo, S. J. (2000). A framework for con-

structing features and models for intrusion detection

systems. ACM Transactions on Information and Sys-

tem Security (TISSEC), 3(4):227–261.

Paxson, V. and Terney, B. (2004). Bro reference manual.

Sleator, D. and Tarjan, R. (1985). Self Adjusting Binary

Search Trees. Journal of the ACM, 32(3).

Tyson, M. (2000). Derbi: Diagnosys explanation and re-

covery from computer break-ins. Technical report.

Vigna, G. and Kemmerer, R. (1999). Netstat: a network

based intrusion detection system. Journal of Com-

puter Security, 7(1).

http://sourceforge.net/projects/

s-predator

REAL TIME DETECTION OF NOVEL ATTACKS BY MEANS OF DATA MINING TECHNIQUES

127