environment for knowledge analysis) is used in a
Java environment for training and testing the
learning algorithms. It provides several different
SVM implementations along with multiple kernels. I
examine three things, the relative importance of
features in training the dataset, the choice of kernel
algorithm and parameter selection of SVM
classifiers. By understanding what features are the
most relevant, the dataset can be trimmed to include
only the most useful data. The choice of kernel
results in different levels of errors when applied to
the KDD Cup dataset (McHugh, 2000). Frameworks
offer five different kernels: Sigmoid, Linear,
Polynomial and RBF. Each kernel offers three
parameters for tuning and optimization which values
are “gamma, cost and nu”.
The performance norm has also been the subject
of mine research. Here, the best kernel should
maximize a predictive performance criterion as well
as a computational performance criterion. That is, I
seek the best categorizers that are; good at discover
unwanted behaviour, are efficient to compute over
massive datasets of network traffic. I address the
“predictive performance” criterion, what meaning by
good, after describing the cost model for this
domain.
2 CHALLENGES OF
IMPROVING CATEGORIZING
The approach to this work is done in steps, with
supplemental complexity being added to the model
at each level. As a prelude to developing any
models, the data must first be put into a usable
format. I am using the KDDCup 99 dataset,
delineated earlier, which includes of features that are
either continuous (numerically) valued or discrete.
The continuous features in the provided dataset are
in the text format (i.e. tcp/udp) and must be
transformed.
One of the primary challenges of intrusion
discovery is gathering applicable data for training
and testing of an algorithm. Lack of the KDD data
set is the vast number of redundant records, which
causes the learning algorithms to be biased towards
the frequent records, and thus prevent them from
learning rare records, which are usually more
pernicious to networks. In addition, the existence of
these repeated records in the test set will cause the
evaluation results to be biased by the methods which
have better categorizing rates on the frequent
records.
One of the disadvantages of SVM-based and
other supervised machine learning method is the
requisite on a large number of labelled training
samples (Yao, Zhao, and Fan, 2006). Furthermore,
recognizing the traffic after the network flow is
collected could be too late should security and
interventions become necessary in the early stage of
the traffic flow. My intend is using supervised
machine learning methods, as well as using feature
parameters obtainable in the traffic flow for fast and
accurate Network traffic discovery.
Even though, the recommended data set still
suffers from some of the problems in complex data
set and may not be a perfect stand in of existing real
networks, because of the lack of public data sets for
network-based IDSs, at the same time it can be
applied as an impressive benchmark data set to help
researchers compare different machine learning
methods.
3 PROBLEM DEFINITION
AND SVM’S
Machine learning has large implications for
intrusion discovery, because intrusions are becoming
more complex and information systems are evenly
become more intricate. By using machine learning
techniques to analyze incoming network data, I can
decide to determine malicious attacks before they
compromise an information system. Research in the
field of intrusion detection seems to focus on a
variety of support vector machine method, neural
networks and cluster algorithms.
Support vector machines are the correspondingly
recent methods of machine learning based on
structural risk minimization, and they are a powerful
machine learning method for both arrested
development and classification problems.
In this paper, I tried an effective approach to
solve the two mentioned issues, resulting in new
train and test sets, which consist of chosen records
of the complete KDD data set. The provided data set
does not suffer from a large number of tagged
training samples. Besides, the numbers of records in
the train and test sets are reasonable. This advantage
makes it affordable to run the experiments which
needed to randomly select a small portion.
Inevitably, evaluation results of different research
work will be consistent and comparable.
Through the use of correct kernel choice, feature
selection and parameter selection, I have shown that
it is possible to improve the accuracy and efficiency
UNWANTED BEHAVIOUR DETECTION AND CLASSIFICATION IN NETWORK TRAFFIC
123