the rules are sorted in decreasing estimated accuracy
order.
2.2 Prediction Verification
By using the mentioned classifier, a set of rules with
an estimation of accuracy is obtained, which serves
as certainty factor in the prediction of classes of traf-
fic data. At detection time, a particular rule with an
associated estimated accuracy will fire. At this point,
it is possible to force a minimum estimated accuracy
threshold A
th
to accept a prediction, being this an
heuristic that serves to discern traffic data that was
considered at training time from data that was not.
Therefore, setting an accuracy threshold lets populate
a data set with data that is supposed to be new to the
system and which therefore needs proper classifica-
tion. As a result, the estimated accuracy of rules lets
integrate prediction verification capabilities into the
system.
2.3 Classification of New Data
A sample is regarded as new if the rule-based classi-
fier is not able to properly classify it as normal traffic
nor any kind of attack traffic, basing the decision upon
an estimated accuracy threshold. As a result, this data
needs manual supervision by an external agent for its
classification. However, further help can be provided
in this task by automatically grouping similar traffic
data. Our system uses self-organizing maps to achieve
this, which have proven useful in other works (Bashah
and Shanmugam, 2005; Hoglund et al., 2000). Self-
organizing maps (Kohonen, 1997) use an euclidean-
similarity metric to achieve automatic clustering of
data by defining an overlaying set of reference vectors
on the feature space of the sample data set. Local-
order relations are set on the reference vectors so that
their values are depedent to each other neighbouring
vector. The self-organizing algorithm defines a non-
linear regression of the reference vectors through the
data points, which results in the reference vectors be-
ing scattered among the space according to the data
set’s probability density function. This lets classify-
ing all data samples that are represented by the same
reference vector in one step and thus reducing the ef-
fort of supervision.
When dimensioning the self-organizing map,
some problems need to be overcome such as choos-
ing a number of nodes that makes the map able to
adapt to all the data set or enhancing rare cases to
be considered apropriately by the map. Applying the
self-organizing map algorithm to all the data set might
result in the map’s inability to adapt to euclidean-too-
separated values and might also fail to consider rare
cases that are not too relevant in the probability den-
sity function. To prevent this from happening, visual
inspection of Sammon’s mappings (Sammon, 1969)
of different maps helps to choose a correct form of
the array or adapt the probability density function,
but is a manual task that is not desired in our sys-
tem and therefore a different approach is used. In
our case, the system performs a division in several
subsets of the original discarded set to try to obtain
subsets with similar features and increase the self-
organizing map’s accuracy. Different heuristics can
be used to perform this division, such as partition-
ing through certain fields like protocol type or type of
service (Yu et al., 2007), being all these approaches
aimed at reducing information entropy of the result-
ing subset. The classifier’s ruleset is a pruned version
of a decision tree that, as described on section 2.1,
is built through information entropy reduction with
the supervised training data set, and thus is a possi-
ble heuristic for reducing information entropy on the
discarded samples data set.
To achieve the subdivision, samples are grouped
in our system by hierarchical coincidence of the clas-
sifier’s rule clauses. More precisely, each sample fires
a particular rule, whose left-hand side is defined by a
list of clauses (c
1
, c
2
, ..., c
i
), ordered by classification
relevance as a result of the C4.5 algorithm. There-
fore, this allows hierarchical grouping of similar sam-
ples by removing trailing clauses and grouping all the
samples that share the same clauses. A depth value
needs to be set in this case, with a higher value re-
sulting in obtaining a higher number of subsets, and a
lower value producing bigger ones with more hetero-
geneous samples. The resulting sequence of clauses is
extended with protocol, type of service and flag fields
to build a subset identifier for each sample.
Finally, the self-organizing map algorithm is ap-
plied on every subset. A 3:2 aspect ratio is used on
the maps’ dimensions in order to favour learning sta-
bility, with an hexagonal topology and a total number
of nodes which is equal to 10% of the subsets cardi-
nality with a dimensions limit of 30x20.
2.4 Retraining
C4.5 rule-learning algorithm is batch-training-based.
To provide reinforced learning, the approach which
has been used in this system is to build a new data set
with different proportions of samples. At this point,
three types of samples are found in the system: dis-
carded samples during prediction verification, train-
ing data set samples that are detected correctly and
training data set samples that are not detected cor-
A MACHINE LEARNING APPROACH WITH VERIFICATION OF PREDICTIONS AND ASSISTED SUPERVISION
FOR A RULE-BASED NETWORK INTRUSION DETECTION SYSTEM
145