
 
environment for knowledge analysis) is used in a 
Java environment for training and testing the 
learning algorithms. It provides several different 
SVM implementations along with multiple kernels. I 
examine three things, the relative importance of 
features in training the dataset, the choice of kernel 
algorithm and parameter selection of SVM 
classifiers. By understanding what features are the 
most relevant, the dataset can be trimmed to include 
only the most useful data. The choice of kernel 
results in different levels of errors when applied to 
the KDD Cup dataset (McHugh, 2000). Frameworks 
offer five different kernels: Sigmoid, Linear, 
Polynomial and RBF. Each kernel offers three 
parameters for tuning and optimization which values 
are “gamma, cost and nu”. 
The performance norm has also been the subject 
of mine research. Here, the best kernel should 
maximize a predictive performance criterion as well 
as a computational performance criterion. That is, I 
seek the best categorizers that are; good at discover 
unwanted behaviour, are efficient to compute over 
massive datasets of network traffic. I address the 
“predictive performance” criterion, what meaning by 
good, after describing the cost model for this 
domain. 
2 CHALLENGES OF 
IMPROVING CATEGORIZING 
The approach to this work is done in steps, with 
supplemental complexity being added to the model 
at each level. As a prelude to developing any 
models, the data must first be put into a usable 
format. I am using the KDDCup 99 dataset, 
delineated earlier, which includes of features that are 
either continuous (numerically) valued or discrete. 
The continuous features in the provided dataset are 
in the text format (i.e. tcp/udp) and must be 
transformed. 
One of the primary challenges of intrusion 
discovery is gathering applicable data for training 
and testing of an algorithm. Lack of the KDD data 
set is the vast number of redundant records, which 
causes the learning algorithms to be biased towards 
the frequent records, and thus prevent them from 
learning rare records, which are usually more 
pernicious to networks. In addition, the existence of 
these repeated records in the test set will cause the 
evaluation results to be biased by the methods which 
have better categorizing rates on the frequent 
records. 
One of the disadvantages of SVM-based and 
other supervised machine learning method is the 
requisite on a large number of labelled training 
samples (Yao, Zhao, and Fan, 2006). Furthermore, 
recognizing the traffic after the network flow is 
collected could be too late should security and 
interventions become necessary in the early stage of 
the traffic flow. My intend is using supervised 
machine learning methods, as well as using feature 
parameters obtainable in the traffic flow for fast and 
accurate Network traffic discovery. 
Even though, the recommended data set still 
suffers from some of the problems in complex data 
set and may not be a perfect stand in of existing real 
networks, because of the lack of public data sets for 
network-based IDSs, at the same time it can be 
applied as an impressive benchmark data set to help 
researchers compare different machine learning 
methods.  
3 PROBLEM DEFINITION 
AND SVM’S 
Machine learning has large implications for 
intrusion discovery, because intrusions are becoming 
more complex and information systems are evenly 
become more intricate. By using machine learning 
techniques to analyze incoming network data, I can 
decide to determine malicious attacks before they 
compromise an information system. Research in the 
field of intrusion detection seems to focus on a 
variety of support vector machine method, neural 
networks and cluster algorithms.  
Support vector machines are the correspondingly 
recent methods of machine learning based on 
structural risk minimization, and they are a powerful 
machine learning method for both arrested 
development and classification problems. 
In this paper, I tried an effective approach to 
solve the two mentioned issues, resulting in new 
train and test sets, which consist of chosen records 
of the complete KDD data set. The provided data set 
does not suffer from a large number of tagged 
training samples. Besides, the numbers of records in 
the train and test sets are reasonable. This advantage 
makes it affordable to run the experiments which 
needed to randomly select a small portion. 
Inevitably, evaluation results of different research 
work will be consistent and comparable. 
Through the use of correct kernel choice, feature 
selection and parameter selection, I have shown that 
it is possible to improve the accuracy and efficiency 
UNWANTED BEHAVIOUR DETECTION AND CLASSIFICATION IN NETWORK TRAFFIC
123