million connection records. Another two weeks of
test data yielded approximately two million more con-
nection records.
Each connection record consists of about 100
bytes. A connection is a sequence of TCP packets
starting and ending at some well defined times be-
tween which data flows to and from a source IP ad-
dress number under some well defined protocol. Each
connection in the training set is labelled either as nor-
mal or as an attack.
The training data contains twenty four attacks,
while the test data contains an additional fourteen at-
tacks. These thirty eight attack types can be catego-
rized into one of following four categories
DOS, a denial of service attack is an attack
where the assailant makes some computing or mem-
ory resource too busy or too full to handle legitimate
requests, or denies legitimate users access to a ma-
chine. An example of a DOS attack would be a Back
Attack. This strike attacks the Apache web server by
submitting requests with many frontslashes. As the
server tries to process these frontslashes it consumes
excessive CPU time slowing down and becoming un-
able to process other network requests.
R2L, a remote to local attack takes place when
an attacker manages to send packets across a network
targetting a machine the attacker does not have access
to. There are many ways to do this varying from a
dictionary attack on the users password to exploiting
buffer overflow vunerabilities.
U2R, a user to root attack occurs when a normal
user, (or possibly an unauthorized user who has al-
ready gained access to a normal user account through
social engineering or sniffing passwords), is able to
exploit some vulnerability in order to gain root access
to the system. A common means to carrying out many
U2R attacks is buffer overflow. When a programmer
writes a program to recieve some input the size of the
buffer in which it will be stored needs to be decided.
An intruder will profit from this opportunity by filling
the buffer and then including some extra commands to
be submitted and understood by the operating system.
PROBE, a probe atack is a surveillance proce-
dure where an attacker can quickly scan a network
in order to gain information on the network and the
machines for a possible attack in the future. (Kendall,
1999)
Each connection record is made up of forty two
features in the training data and forty one in the test
data. These features are divided into three categories.
The first set is basic features of individual TCP
connection such as the length of the connection,
the type of protocol, the network service and the
number of bytes from source to destination and vise
versa. The second set are content features within
a connection suggested by a domain knowledge
including the number of failed login attempts,
whether the attacker gained access, whether the
attacker attempted to gain root access, whether the
attacker was successful in gaining root acces, the
number of file creation operations and the number
of outbound commands in an ftp session. The final
set of features are Traffic features. These were com-
puted using a two second time window and resulted in
• Number of connections to the same host as the cur-
rent connection
• Percentage of connections that have ”SYN” errors
• Percentage of connections that have ”REJ” errors
• Percentage of connections to the same service
• Percentage of connections to different services
• Number of connections to the same service as the
current connection
Several categories of higher level features were de-
fined including same host and same service features.
The same host features study the connections in the
past two seconds that have the same destination host
as the the current connection while the same service
features inspect the connections in the past two sec-
onds that have the same service as the current con-
nection. These features together are called time based
features and it is from these that statistics relating to
service and protocol behaviour are derived.
3 ARCHITECTURE
The architecture of the system can be seen in figure
1. It is a distributed system, a computer architec-
ture consisting of interconnected processors. With
distributed systems each processor has its own local
memory. Processors communicate by message pass-
ing over the network. For any particular processor its
own resources are local whereas the other processors
and resources are remote. Together a processor and
its resources are called a node.
The distributed architecture of this system allows
for a shorter response time in analysing incoming net-
work traffic and higher reliability due to various data
mining algorithms performing differently depending
on the attack.
The system works as follows,
THE USE OF DATA MINING IN THE IMPLEMENTATION OF A NETWORK INTRUSION DETECTION SYSTEM
401