A MACHINE LEARNING APPROACH WITH VERIFICATION OF

PREDICTIONS AND ASSISTED SUPERVISION FOR A

RULE-BASED NETWORK INTRUSION DETECTION SYSTEM

Jos

e Ignacio Fern

andez-Villamor and Mercedes Garijo

Departamento de Ingenier

ıa de Sistemas Telem

aticos, Universidad Polit

ecnica de Madrid, Spain

Keywords:

Network Intrusion Detection Systems, Rules of inference, Machine learning, Decision trees, Self-organizing

maps.

Abstract:

Network security is a branch of network management in which network intrusion detection systems provide

attack detection features by monitorization of trafﬁc data. Rule-based misuse detection systems use a set

of rules or signatures to detect attacks that exploit a particular vulnerability. These rules have to be hand-

coded by experts to properly identify vulnerabilities, which results in misuse detection systems having limited

extensibility. This paper proposes a machine learning layer on top of a rule-based misuse detection system

that provides automatic generation of detection rules, prediction veriﬁcation and assisted classiﬁcation of new

data. Our system offers an overall good performance, while adding an heuristic and adaptive approach to

existing rule-based misuse detection systems.

1 INTRODUCTION

Network security is an important branch of network

management, in which network intrusion detection

systems (Mukherjee et al., 1994) provide attack de-

tection features by monitorization of trafﬁc data in de-

militarized zones. Therefore, intrusion detection sys-

tems are a ﬁeld of interest in the provision of network

security under the assumption of existence of security

vulnerabilities on hardware and software systems.

Intrusion detection systems can be classiﬁed into

anomaly detection systems (Denning, 1987), which

base detection on identiﬁcation of abnormal user be-

haviour, and misuse detection systems (Deri et al.,

2003), which base detection on identiﬁcation of well-

known attack patterns. An example of misuse detec-

tion system is Snort (Roesch, 1999), the de facto stan-

dard open source intrusion detection system. Snort

uses a set of rules to detect attacks that exploit a par-

ticular vulnerability. These rules have to be hand-

coded by experts to properly identify vulnerabilities.

Not only does this require an expertise on the ﬁeld

but it also implies an effort of analysis of trafﬁc data,

which results in Snort having limited extensibility.

This process implies: knowing which attacks the sys-

tem was not prepared for, classifying these attacks

and, ﬁnally, building new detection rules.

This paper proposes a machine learning layer on

top of a rule-based misuse detection system such as

Snort to provide automatic generation of detection

rules, prediction veriﬁcation and assisted classiﬁca-

tion of new data. On the compromise of expressing

precise detection rules that are tied to a particular ex-

ploit against generating more general rules, our ap-

proach tries to be more untied by using, e.g., key indi-

cators (Lee et al., 1999) instead of well-deﬁned mes-

sage patterns, so that the system is more suited for

future unknown attacks. In other words, the system is

aimed at detecting the nature of trafﬁc data and clas-

sifying it into normal or abnormal trafﬁc instead of

focusing on identifying exploits or particular attacks.

We think this is consistent with the idea of providing

automation and ensuring freshness in attack detection

as an added value to misuse detection systems, leav-

ing the exploit identiﬁcation task for security experts.

The whole maintenance lifecycle of an intrusion de-

tection system is considered, and a training pattern la-

beller, based on estimated accuracy of rules and self-

organizing maps, is proposed to allow assisted classi-

ﬁcation of new trafﬁc data.

143

Ignacio Fernández-Villamor J. and Garijo M. (2008).

A MACHINE LEARNING APPROACH WITH VERIFICATION OF PREDICTIONS AND ASSISTED SUPERVISION FOR A RULE-BASED NETWORK

INTRUSION DETECTION SYSTEM.

In Proceedings of the Fourth International Conference on Web Infor mation Systems and Technologies, pages 143-148

DOI: 10.5220/0001524801430148

 SciTePress

Figure 1: System architecture.

2 SYSTEM DESCRIPTION

A rule-based intrusion detection system has a set of

detection rules that allows the detection of a partic-

ular set of attacks. To ensure freshness of detection

capabilities new rules have to be included to detect at-

tacks that the system was not prepared for previously.

This involves three tasks: knowing which attacks the

system was not prepared for, classifying these attacks

and, ﬁnally, building new detection rules.

Our system is aimed at automating all these tasks

as much as possible while being built on top of a rule-

based intrusion detection system. The architecture is

shown on ﬁgure 1 and has four basic modules:

• A classiﬁer, trained to classify samples of trafﬁc

data to perform the basic functionality of a net-

work intrusion detection system.

• A prediction veriﬁer, which validates the predic-

tions made by the classiﬁer, in order to detect new

trafﬁc types and do reinforcement learning with

them.

• A labeller for classiﬁcation of new data, intended

for reducing the effort of classiﬁcation of new

trafﬁc data by grouping similar samples into clus-

ters of the same class.

• A data set builder, which prepares a data set to

retrain the classiﬁer.

The system is essencially a rule generator, so that re-

altime classiﬁcation of data is performed by the rule-

based network intrusion detection system, whereas

our modules only perform off-line tasks. More specif-

ically, the rule generation process starts after a set of

trafﬁc data is collected out of the intrusion detection

system and proceeds as follows:

1. The prediction veriﬁer estimates the validity of

previous predictions of trafﬁc classes and builds

up a data set out of discarded samples, which re-

quire further supervision due to the system’s in-

ability to classify them properly.

2. The trafﬁc data is supervised by a human agent by

labelling data clusters which are generated by the

labeller.

3. The data set builder constructs a data set out of

samples from previous supervised trafﬁc data and

the newly supervised data.

4. The classiﬁer is trained with the generated data

set, which results in the generation of a set of rules

that are used to refresh the intrusion detection sys-

tem ruleset.

2.1 Machine Learning of Trafﬁc Data

Several approaches have been considered for machine

learning of attack patterns for intrusion detection sys-

tems. Neural networks have been used to achieve

this task (Chavan et al., 2004; Kemmerer and Vi-

gna, 2005), but have difﬁculties to generalize their

knowledge and therefore to detect attacks that are not

present in the training data (Bouzida and Cuppens,

2005). Other approaches have been used, such as sta-

tistical models (Ye et al., 2001; Ye et al., 2003) or

Petri nets (Kumar and Spafford, 1994). None of them

can be used naturally to build detection rules and thus

are unpractical for our purposes.

Decision trees and rule-based systems (Hunt,

1962) have been also used for intrusion detection (Yu

et al., 2007; Chavan et al., 2004) and offer good per-

formance in terms of prediction rates and general-

ization to new attacks (Bouzida and Cuppens, 2005).

The proposed system uses the rule learning-based al-

gorithm C4.5 (Quinlan, 1993), which is essentially an

extension of ID3 algorithm aimed at avoiding overﬁt-

ting.

Considering a training data set, two thirds of it are

used as a growing data set, while one third remains

as a pruning data set. The growing data set is used to

build an ID3 tree, where an entropy gain function is

used to partition a data set S w.r.t. an attribute A.

The attribute with maximum entropy gain is cho-

sen to partition the data set at each node, so that a

tree is built iteratively. Continuous attributes are han-

dled in an equivalent way by calculating thresholds

through interpolation of consecutive values from the

data set for each continuous attribute and choosing the

threshold with maximum entropy gain.

After the decision tree is built, generation of rules

of inference is straightforward by scanning all possi-

ble paths in the tree from the top node to its leaves.

The left hand side of the rules is a combination of the

conditions on each node, while the right hand side is

each leaf’s class. An estimation of accuracy of each

rule is made by calculating the accuracy on the prun-

ing data set. The resulting rules are pruned by remov-

ing trailing conditions on the left hand side only when

the resulting estimated accuracy is not lower. Finally,

WEBIST 2008 - International Conference on Web Information Systems and Technologies

144

the rules are sorted in decreasing estimated accuracy

order.

2.2 Prediction Veriﬁcation

By using the mentioned classiﬁer, a set of rules with

an estimation of accuracy is obtained, which serves

as certainty factor in the prediction of classes of traf-

ﬁc data. At detection time, a particular rule with an

associated estimated accuracy will ﬁre. At this point,

it is possible to force a minimum estimated accuracy

threshold A

to accept a prediction, being this an

heuristic that serves to discern trafﬁc data that was

considered at training time from data that was not.

Therefore, setting an accuracy threshold lets populate

a data set with data that is supposed to be new to the

system and which therefore needs proper classiﬁca-

tion. As a result, the estimated accuracy of rules lets

integrate prediction veriﬁcation capabilities into the

system.

2.3 Classiﬁcation of New Data

A sample is regarded as new if the rule-based classi-

ﬁer is not able to properly classify it as normal trafﬁc

nor any kind of attack trafﬁc, basing the decision upon

an estimated accuracy threshold. As a result, this data

needs manual supervision by an external agent for its

classiﬁcation. However, further help can be provided

in this task by automatically grouping similar trafﬁc

data. Our system uses self-organizing maps to achieve

this, which have proven useful in other works (Bashah

and Shanmugam, 2005; Hoglund et al., 2000). Self-

organizing maps (Kohonen, 1997) use an euclidean-

similarity metric to achieve automatic clustering of

data by deﬁning an overlaying set of reference vectors

on the feature space of the sample data set. Local-

order relations are set on the reference vectors so that

their values are depedent to each other neighbouring

vector. The self-organizing algorithm deﬁnes a non-

linear regression of the reference vectors through the

data points, which results in the reference vectors be-

ing scattered among the space according to the data

set’s probability density function. This lets classify-

ing all data samples that are represented by the same

reference vector in one step and thus reducing the ef-

fort of supervision.

When dimensioning the self-organizing map,

some problems need to be overcome such as choos-

ing a number of nodes that makes the map able to

adapt to all the data set or enhancing rare cases to

be considered apropriately by the map. Applying the

self-organizing map algorithm to all the data set might

result in the map’s inability to adapt to euclidean-too-

separated values and might also fail to consider rare

cases that are not too relevant in the probability den-

sity function. To prevent this from happening, visual

inspection of Sammon’s mappings (Sammon, 1969)

of different maps helps to choose a correct form of

the array or adapt the probability density function,

but is a manual task that is not desired in our sys-

tem and therefore a different approach is used. In

our case, the system performs a division in several

subsets of the original discarded set to try to obtain

subsets with similar features and increase the self-

organizing map’s accuracy. Different heuristics can

be used to perform this division, such as partition-

ing through certain ﬁelds like protocol type or type of

service (Yu et al., 2007), being all these approaches

aimed at reducing information entropy of the result-

ing subset. The classiﬁer’s ruleset is a pruned version

of a decision tree that, as described on section 2.1,

is built through information entropy reduction with

the supervised training data set, and thus is a possi-

ble heuristic for reducing information entropy on the

discarded samples data set.

To achieve the subdivision, samples are grouped

in our system by hierarchical coincidence of the clas-

siﬁer’s rule clauses. More precisely, each sample ﬁres

a particular rule, whose left-hand side is deﬁned by a

list of clauses (c

, c

, ..., c

), ordered by classiﬁcation

relevance as a result of the C4.5 algorithm. There-

fore, this allows hierarchical grouping of similar sam-

ples by removing trailing clauses and grouping all the

samples that share the same clauses. A depth value

needs to be set in this case, with a higher value re-

sulting in obtaining a higher number of subsets, and a

lower value producing bigger ones with more hetero-

geneous samples. The resulting sequence of clauses is

extended with protocol, type of service and ﬂag ﬁelds

to build a subset identiﬁer for each sample.

Finally, the self-organizing map algorithm is ap-

plied on every subset. A 3:2 aspect ratio is used on

the maps’ dimensions in order to favour learning sta-

bility, with an hexagonal topology and a total number

of nodes which is equal to 10% of the subsets cardi-

nality with a dimensions limit of 30x20.

2.4 Retraining

C4.5 rule-learning algorithm is batch-training-based.

To provide reinforced learning, the approach which

has been used in this system is to build a new data set

with different proportions of samples. At this point,

three types of samples are found in the system: dis-

carded samples during prediction veriﬁcation, train-

ing data set samples that are detected correctly and

training data set samples that are not detected cor-

A MACHINE LEARNING APPROACH WITH VERIFICATION OF PREDICTIONS AND ASSISTED SUPERVISION

FOR A RULE-BASED NETWORK INTRUSION DETECTION SYSTEM

145

Table 1: Classiﬁer performance on the training data set.

Prediction / real normal probe dos u2r r2l Total

normal 99.95% 1.23% 0.01% 25.00% 8.66% 99.77%

probe 0.01% 98.52% 0.00% 0.00% 0.79% 99.26%

dos 0.02% 0.25% 99.99% 0.00% 0.00% 99.99%

u2r 0.00% 0.00% 0.00% 75.00% 0.00% 100.00%

r2l 0.02% 0.00% 0.00% 0.00% 90.55% 98.29%

Total 99.95% 98.52% 99.99% 75.00% 90.55% 99.94%

Table 2: Classiﬁer performance on the testing data set.

Prediction / real normal probe dos u2r r2l Total

normal 99.49% 17.76% 2.76% 54.29% 90.79% 73.29%

probe 0.26% 70.21% 0.01% 0.00% 3.16% 80.91%

dos 0.22% 12.03% 97.22% 0.00% 0.03% 99.72%

u2r 0.02% 0.00% 0.00% 35.71% 2.62% 5.38%

r2l 0.01% 0.00% 0.00% 10.00% 3.40% 95.52%

Total 99.49% 70.21% 97.22% 35.71% 3.40% 92.36%

rectly. The proportion of samples of each kind and

the total amount of them, in the form of the triple (n

−

, n

dis

), will determine the newly built data set and

thus the classiﬁcation capabilities of the new classi-

ﬁer.

3 PERFORMANCE EVALUATION

The system has been evaluated against KDD’99 data

set (University of California, 1999), from The Third

International Knowledge Discovery and Data Mining

Tools Competition. This data set is made up of almost

ﬁve million samples of data connections that are clas-

siﬁed and grouped under a set of trafﬁc classes called

normal (ordinary, non-malicious trafﬁc), probe (mon-

itorization and probing activities), dos (denial of ser-

vice attacks, which often imply ﬂooding activities),

u2r (user to root, which refers to users trying to ac-

quire root privileges) and r2l (remote to local, which

refers to remote unauthorized log-in).

A total of 41 attributes deﬁne each connection,

with symbolic ﬁelds such as type of service or type

of transport connection and continuous ﬁelds such as

average packet size or login attempts. Some of these

ﬁelds are aggregated ones, e.g., connections to the

same host in a 2-second window, and are included

due to its proved relevance in attack detection (Lee

et al., 1999). Other features of the data set are

data inconsistency, in the sense that certain samples

with the same ﬁelds belong to different classes, and

existence of new attack types in the testing data set,

which allows evaluation of generalization capabilities

of classiﬁers. Therefore, all these data set features

and its heuristic approach makes this data set an

appropriate tool to tune our system and evaluate it.

3.1 Classiﬁer Performance

Our rule-based classiﬁer was trained with a subset of

the KDD’99 training data set. Performance on the

training set and the testing set is shown on tables 1 and

2, respectively. Its difﬁculty to detect certain attacks

in the training data set is consequence of its compro-

mise to generalise to new attacks, which is achieved

by the rule-pruning phase described in section 2.1,

with the classiﬁer presenting the known difﬁculties

on this sample data (Bouzida and Cuppens, 2005).

Overall performance is comparable with other classi-

ﬁers, such as KDDCup’99 winner (Pfahringer, 1999),

a boosting-based classiﬁer which offers 92.71% accu-

racy.

As described in section 2.2, a prediction veriﬁer is

used to discard potential samples whose predictions

might be a priori regarded as unacceptable. This is

achieved by using rules’ estimation of accuracy, cal-

culated on the pruning data set, as the conﬁdence fac-

tor. The results of different accuracy thresholds A

are shown on table 3, which, as expected, lets improve

the overall accuracy of the classiﬁer. This allows our

classiﬁer to outperform all others, at the cost of mark-

ing conﬂicting samples as discarded. By observing

the results, an accuracy threshold of 0.98 seems an

appropriate value by offering an acceptable compro-

mise between packet discard ratio and accuracy and

thus has been used in the rest of experiments in this

paper.

3.2 Labeller Performance

Discarded samples are collected by the prediction ver-

iﬁer for further supervision. In our system, this task

is assisted by the sample labeller. The discarded sam-

ples obtained from the classiﬁer are grouped into a

number of subsets according to the sample ﬁelds and

rule clauses. Afterwards, the self-organizing map al-

WEBIST 2008 - International Conference on Web Information Systems and Technologies

146

Table 3: Effect of accuracy threshold.

Discards Accuracy

0.0 0.00% 92.36%

0.9 1.13% 93.11%

0.95 1.59% 93.19%

0.96 1.59% 93.19%

0.97 1.59% 93.19%

0.98 1.83% 93.21%

0.99 5.73% 94.07%

0.995 5.73% 94.07%

0.999 5.73% 94.07%

0.9999 100.00% –

gorithm is applied on these subsets, and a number of

nodes is obtained, being this number proportional to

the subsets cardinality. Different rule depth values can

be used in the subdivision of the original discarded

samples data set. Different results were obtained, as

shown on table 4, with higher depth values offering a

higher accuracy. As long as the highest depth value

requires classiﬁcation of only 15% of samples while

offering the highest accuracy, it can be considered the

optimal depth value for the labeller and has been used

in further experiments.

Table 4: Labelling performance.

Depth Subsets Nodes Accuracy

0 74 13.36% 89.03%

2 95 13.89% 90.43%

4 118 14.43% 91.41%

6 150 15.71% 93.19%

8 162 15.94% 95.33%

3.3 Overall Performance

After labelling of discarded packets has been

achieved, the classiﬁer is retrained by building a new

training set. Different parameters are possible when

building the new training set; n

and n

−

determine

the proportion of training samples that were correctly

and incorrectly classiﬁed, respectively, and n

dis

deter-

mines the proportion of samples that were discarded

during prediction time. The result of varying these

parameters is shown on table 5. Accuracies A

and

−

are calculated on the correct and incorrect subsets

of the training data set, while A

dis

is calculated on the

correctly labelled discarded data set. Overall accuracy

A is calculated on the full testing data set.

By observing the results, the classiﬁer shows an

optimal performance with n

= 0.3 , n

−

= 0.1 and

dis

= 0.6. It is noticeable that there is a top accuracy

that is achievable in the discarded data set which is

determined by the accuracy of the labeller. Also, ap-

Table 5: Performance after retraining.

−

dis

−

dis

0.5 0.4 0.1 99.71% 100.0% 92.55% 92.32%

0.5 0.1 0.4 99.64% 100.0% 94.61% 93.32%

0.3 0.2 0.5 99.47% 100.0% 95.08% 93.41%

0.4 0.1 0.5 99.56% 100.0% 94.94% 93.46%

0.5 0.3 0.2 99.54% 100.0% 93.45% 93.53%

0.3 0.1 0.6 99.57% 100.0% 94.94% 93.63%

parently the overall accuracy should have a value be-

tween A

, A

−

and A

dis

, although this does not happen

due to the fact that the newly built training data was

a combination of correctly and incorrectly classiﬁed

samples from the training set and discarded samples

from the testing set, while this last set is built heuris-

tically and therefore does not necessarily include the

problematic samples which would be useful for a sec-

ond learning phase.

4 RELATED WORK

Rule-based systems have been widely used in previ-

ous intrusion detection systems. Their performance in

terms of accuracy and speed makes them appropriate

for intrusion detection tasks, while their internal rep-

resentation of knowledge in the form of rules favours

human interpretation.

(Wuu and Chen, 2003) uses discrete attributes and

is focused on the generation of attack signatures and

thus does not consider prediction veriﬁcation or assis-

tance on classiﬁcation of new data.

(Yu et al., 2007) uses a different heuristic for pre-

diction veriﬁcation based on its boosting-based classi-

ﬁer, which consists of a set of binary rule-based clas-

siﬁers. Each of their binary classiﬁers has an asso-

ciated conﬁdence factor which is combined with the

rest of classiﬁers’ to estimate prediction conﬁdence.

While their conﬁdence factor shows a good perfor-

mance, having a set of binary classiﬁers implies us-

ing many rulesets. This contrasts with our approach,

which contains an only ruleset and thus can be inte-

grated into an existing rule-based intrusion detection

system.

5 CONCLUSIONS

Network intrusion detection systems have to deal with

continuous changes in software vulnerabilities, at-

tacks and exploits. Our system reuses a rule-based in-

trusion detection system such as Snort to implement

an adaptive machine-learning-based layer on top of

it. While assuming this constraint, our system offers

an overall good performance and adds features such

A MACHINE LEARNING APPROACH WITH VERIFICATION OF PREDICTIONS AND ASSISTED SUPERVISION

FOR A RULE-BASED NETWORK INTRUSION DETECTION SYSTEM

147

as prediction veriﬁcation for automatic collecting of

problematic data and assisted classiﬁcation of train-

ing data, which favours freshness of detection rules

and adds an heuristic and adaptive approach to exist-

ing rule-based misuse intrusion detection systems.

ACKNOWLEDGEMENTS

This research is funded in part by the Spanish Govern-

ment under the R&D project IMPROVISA (TSI2005-

07384-C03)

REFERENCES

Bashah, N. and Shanmugam, B. (2005). Artiﬁcial Intelli-

gence Techniques Applied to Intrusion Detection. In

IEEE Indicon Conference, Chennai, India.

Bouzida, Y. and Cuppens, F. (2005). Neural networks vs.

decision trees for intrusion detection. In Proceedings

of the 43rd annual Southeast regional conference.

Chavan, S., Shah, K., Dave, N., and Mukherjee, S.

(2004). Adaptive Neuro-Fuzzy Intrusion Detection

Systems. In Proceedings of the International Confer-

ence on Information Technology: Coding and Com-

puting (ITCC’04).

Denning, D. (1987). An Intrusion-Detection Model. In

IEEE transactions on software engineering.

Deri, L., Suin, S., and Maselli, G. (2003). Design and

implementation of an anomaly detection system: An

empirical approach. In Proceedings of Terena TNC,

2003.

Hoglund, A. J., Hatonen, K., and Sorvari, A. S. (2000). A

Computer Host Based User Anomaly Detection Sys-

tem Using Self Organizing Maps. In Proceedings

of the International Joint Conference on Neural Net-

works, IEEE IJCNN 2000, Vol. 5, pp. 411-416.

Hunt, E. B. (1962). Concept learning: an information pro-

cessing problem. Wiley.

Kemmerer, R. A. and Vigna, G. (2005). Hi-DRA: Intrusion

Detection for Internet Security. In Proceedings of the

IEEE, October 2005.

Kohonen, T. (1997). Self-Organizing Maps. Springer-

Verlag New York, Inc.

Kumar, S. and Spafford, E. (1994). A Pattern Matching

Model for Misuse Intrusion Detection. In Proceedings

of the 17th National Security Conference.

Lee, W., Stolfo, S., and Mok, K. W. (1999). A Data Mining

Framework for Building Intrusion Detection Models.

In Proceedings of the 1999 IEEE Symposium on Secu-

rity and Privacy.

Mukherjee, B., Heberlein, L. T., and Levitt, K. N. (1994).

Network Intrusion Detection. In IEEE Network.

Pfahringer, B. (1999). Winning the KDD99 classiﬁcation

cup: Bagged boosting. In ACM SIGKDD Explor., vol.

1, no. 2, pp 65-66.

Quinlan, R. (1993). C4.5: Programs for Machine Learning.

Morgan Kaufmann Publishers, Inc.

Roesch, M. (1999). Snort - Lightweight Intrusion Detection

for Networks. In Proceedings of the 13th USENIX

conference on System administration.

Sammon, J. W. (1969). A nonlinear mapping for data struc-

ture analysis. IEEE Transactions on Computers, C-

18(5):401-409, May 1969.

University of California (1999). The Third Interna-

tional Knowledge Discovery and Data Mining Tools

Competition Data. http://kdd.ics.uci.edu/

databases/kddcup99/kddcup99.html.

Wuu, L. C. and Chen, S. F. (2003). Building Intrusion Pat-

tern Miner for Snort Network Intrusion Detection Sys-

tem. In IEEE Computer Society.

Ye, N., Emran, S., Li, X., and Chen, Q. (2001). Statistical

Process Control for Computer Intrusion Detection. In

Proceedings DISCEX II.

Ye, N., Vilbert, S., and Chen, Q. (2003). Computer Intru-

sion Detection through EWMA for Auto Correlated

and Uncorrelated Data. In IEEE Transactions on Re-

liability.

Yu, Z., Tsai, J. J. P., and Weigert, T. (2007). An Auto-

matically Tuning Intrusion Detection System. IEEE

Transactions on Systems, Man, and cybernetics, vol.

37, no. 2, April 2007.

WEBIST 2008 - International Conference on Web Information Systems and Technologies

148