MISUSE DETECTION
A Neural Network vs. A Genetic Algorithm Approach
Pedro A. Diaz-Gomez and Dean F. Hougen
Robotics, Evolution, Adaptation, and Learning Laboratory (REAL Lab)
School of Computer Science, University of Oklahoma
Norman, OK, USA
Keywords:
Misuse detection, genetic algorithms, neural networks, false negative, false positive.
Abstract:
Misuse detection can be addressed as an optimization problem, where the problem is to find an array of
possible intrusions x that maximizes a function f (·) subject to a constraint r imposed by a user’s actions
performed on a computer. This position paper presents and compares two ways of fi nding x, in audit data, by
using neural networks and genetic algorithms.
1 INTRODUCTION
The misuse detection problem is the problem of find-
ing intrusions of known types. Knowing the intrusion
types in advance, the problem is to find instances of
them in audited data.
The misuse detection problem can be formulated
as: Given the observation vector OV Z
+
, and the
Attack-Event matrix AE Z
mn
of known intrusion
types, find the best parameter vector x {0, 1} such
that r
i
(x) = (AE x)
i
OV
i
0, for all 0 i m,
where x
j
are independent variables for all 0 j n
(m is the number of event types to consider and n is
the number of intrusions to check). The best x is the
one that minimizes the length of the residual r(x), i.e.,
we are facing a linear least squares problem, that can
be solved with different methods. However, one can
look at the problem as a linear constrained optimiza-
tion problem, where a Neural Network (NN) can be
proposed to solve it.
M
´
e (1998) addressed the misused detection prob-
lem as the problem to find the x {0,1} vector that
maximizes the function f (x) = w
T
· x, subject to the
constraint r
i
(x) = (AE x)
i
OV
i
0, for all 0
i m, where w is a weighting vector that allows for
more importance to be assigned to finding some in-
trusions, AE Z
+
is the matrix where columns are
known intrusions types and rows are the events nec-
essary for intrusions of those types to be carried out,
and OV Z
+
is the filtered audit data to be analyzed.
The linear problem, where the coefficients of the so-
lution x Z
+
are in {0,1} can be polynomially re-
duced to the zero-one integer programming problem
that is NP-Complete (M
´
e, 1993). M
´
e (1998) pro-
poses the use of a Genetic Algorithm (GA) to solve
it because of the capability of the GA to work on dif-
ferent subsets of possible solutions, however, some
of those subsets could be exclusive (M
´
e, 1998; Diaz-
Gomez and Hougen, 2006; Diaz-Gomez and Hougen,
2005b), making the problem harder to solve.
How the problem is addressed can reveal differ-
ent methods to solve it. Some methods require more
computation time and/or space than others, and some
give better quality solutions than others. This posi-
tional paper presents two approaches, a NN and a GA,
to solve approximately the misuse detection problem
and their computational complexities are compared.
2 NEURAL NETWORKS FOR
OPTIMIZATION
Neural networks have been widely used to solve op-
timization problems (Ham and Kostanic, 2001) and,
as was addressed in Section 1, the misuse detec-
tion problem can be seen as an optimization problem
where we want to maximize f (x) = w
T
· x, subject to
r
i
(x) = AE
i1
x
1
+ AE
i2
x
2
+ ... + AE
in
x
n
OV
i
0 for
i = 1, 2,...,m, x
1
0, x
2
0, ..., x
n
0, where x
i
are
459
A. Diaz-Gomez P. and F. Hougen D. (2007).
MISUSE DETECTION - A Neural Network vs. A Genetic Algorithm Approach.
In Proceedings of the Ninth International Conference on Enterprise Information Systems - AIDSS, pages 459-462
DOI: 10.5220/0002410904590462
Copyright
c
SciTePress
Table 1: Intrusions x
i
found by a neural network.
21 48 69 96 117 144 165 192 213 240 261 288 309 336 357 384 405 432 453 480 501
528 549 576 597 624 645 672 693 720 741 768 789 816 837 864 885 912 933 960 981 -
independent variables and w is the weighting vector.
In order to solve this linear problem with inequal-
ity constraints, Ham and Kostanic (2001) propose the
use of a NN with the recursive equation of motion
x
j
(k + 1) =
x
j
(k) µ
j
w
j
+ K
m
i=1
r
i
(x)AE
i j
if x
j
(k + 1) 0,
0 if x
j
(k + 1) < 0
(1)
where µ
j
is the learning rate, K is a positive parameter,
and k is the iteration step.
We set the following parameters: x
j
(0) = 0 for
all j, µ
j
= µ
0
/(log(1 + k)) with µ
0
= 0.005 (Ham
and Kostanic, 2001), w
j
= 1 j, K = 1, OV —that is
used in r
i
(x)—as in Table 2, AE corresponds to the
m n matrix in which columns are intrusions, m = 28,
n = 1,008, and the NN stops if µ
j
< 0.00001 or if the
number of iterations is h = 6, 000.
The net found 41 out of 108 possible intrusions
(see Table 1) and had no false positives. Some con-
vergence values for iterations until 600 are shown in
Figure 1. The last µ
j
was µ
6000
= 0.00057473.
0 100 200 300 400 500 600
0
0.5
1
1.5
2
Iteration Number
Convergence Value
Intrusions
Non−Intrusion
Figure 1: Intrusions type 48 and 21 found by a neural net-
work. At iteration 6,000 the convergence values were x
48
=
0.6426, x
21
= 0.241 and non intrusion x
917
= 3.5889e 34.
Table 2 shows an example of an OV vector and the
result of AE x OV that shows that the neural net-
work found an x vector that does not violates the con-
straint. The solution x is such that x
i
0, i. For ex-
ample, looking at entries in the AE matrix, intrusions
j = 21 + mod(0,48) have values AE
26,21+mod(0,48)
=
3 and there were 21 of those; intrusions j = 48 +
mod(0,48) have values AE
26,48+mod(0,48)
= 8 and
there were 20 of those; they give a total of activity for
entry 26 equal to 3210.241 +8200.6426 = 118
that is exactly OV
26
—see Figure1 and Table 2.
It should be emphasized that if the initial condi-
tions change, for example if x
j
(0) = 1 for all j, then
the algorithm converges to a second solution. It finds
all possible solutions (108), but, in this case the solu-
tion violates the constraint and it gives 399 false pos-
itives.
3 GENETIC ALGORITHMS FOR
OPTIMIZATION
A GA starts with an initial population P
0
{0,1}
sl
of
possible solutions usually generated randomly, where
s is the population size and l is the length of each pos-
sible solution. The algorithm iterates g times (gen-
erations) through all the individuals in the population
looking for fittest individuals x that are artificially se-
lected to mate and give origin possibly to new off-
spring, according to a fitness function f (·). We fol-
low the guidelines of M
´
e (1998) but use a differ-
ent fitness function that tries to avoid false alarms
while finding to a maximum number of intrusions,
and we use an operator called the union operator
(Diaz-Gomez and Hougen, 2006). While the GA it-
erates, the union operator stores all possible solutions
(local maximums) and checks if a new one violates
the constraint (AE x)
i
OV
i
, i. If a new intrusion
is found that does not violate the constraint then it is
added to a set S
1
; if it violates the constraint, then it is
added to a set S
2
, such that the entire set of possible
intrusions is S = S
1
S
2
.
The fitness functions uses the partial derivative
with respect to x of the Energy Function as in Ham
and Kostanic (2001). That is equated to zero in
order to obtain a critical x giving for each component
j,
m
i=1
r
i
(x) AE
i j
=
w
j
K
, which is satisfied by
the Equation
m
i=1
r
i
(x) 0 that is used as a penalty
when r
i
(x) > 0 (Diaz-Gomez and Hougen, 2005a;
Diaz-Gomez and Hougen, 2006):
ICEIS 2007 - International Conference on Enterprise Information Systems
460
Table 2: Event type, vector of observations OV and constraint comparison using solution x—shown in Table 1—which does
not violate the constraint.
Event # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
OV 0 0 0 0 0 0 0 1 0 0 5 0 0 0 0 1 0 25 0 13 0 0 0 2 0 118 315 0
AE x OV 0 0 0 0 0 0 0 -1 0 0 -5 0 0 0 0 -1 0 -25 0 -13 0 0 0 -2 0 0 -315 0
Table 3: A Subset of intrusions S
2
that violates the constraint with subset S
1
found by a GA.
S
1
21 69 155 157 192 213 261 309 336 384 453 480 501 528 576 683 693 741 768 789 837 864 933 960 1008
S
2
48 96 117 144 165 240 288 357 405 432 549 597 623 644 815 884 911 980 - - - - - - -
f (x) =
m
i=1
(AE x)
i
m
i=1
max{0,r
i
(x)}
m
i=1
(AE x)
i
(2)
The parameter settings are: a population size of
s = 1, 000, the fitness function as is Equation 2, one
point crossover with probability 60%, a mutation rate
of 2.4% per chromosome and a total of g = 20,000
generations (Diaz-Gomez and Hougen, 2006). The
same AE matrix and OV vector were used as for the
case of the NN.
The GA was run 30 times with the same set of
parameters just defined, and it found on average 69.3
intrusions (std = 19.4). The set of solutions found
S was disaggregated by the GA as it was running,
i.e., the GA used has the capability to separate the
two disjoint sets S
1
and S
2
(Diaz-Gomez and Hougen,
2005b)—see Table 3.
4 NEURAL NETWORK VS.
GENETIC ALGORITHM
APPROACH
The first topic that we are going to address is how the
algorithms presented here, distinguish an intrusion of
a non intrusion and the second one is the computa-
tional complexity of each algorithm.
4.1 Intrusions vs. Non-Intrusions
For the GA it is clear that a 1 in x
i
means a possible
intrusion i occurred and a 0 means non-intrusion. For
the NN, if x
i
converges to a value > 0 then we con-
sider a possible occurrence of an intrusion. However,
there is not an exact threshold for the NN to distin-
guish an intrusion from a non intrusion, as in the GA
case. In order to reinforce this fact, we performed
tests again, with the same set of parameters defined
in section 2 but the vector of observations OV was
changed (OV
0
) as in Table 4. The solution of the NN
was the same in section 2—see Table 1—but the val-
ues of convergence of the intrusions (x
48
= 0.0163,
x
21
= 0.0061) and non intrusion changed (x
917
= 0).
The NN was looking at each variable (intrusion) x
i
independently, as it is expected to do in accordance
with the conditions of this paradigm—the most that
concerns here is the convergence of x
i
0 and that
x
i
s are independent. To the contrary, the GA looks
the possible solution x, with all its components x
i
, to-
gether, i.e., if there is a possible solution x which vio-
lates the constraint, then, x is penalized accordingly—
see Equation 2. The set of x
i
s are evaluated by the
GA at the same time and the algorithm chooses them
by looking for the best of those sets.
As NN does not have the capability to look at ex-
clusive sets of intrusions (S
1
S
2
=
/
0), because x
j
are
independent for 0 j n, an iterative process that re-
ceives as input the output of the NN—i.e., Table 1—
and analyzes violations of constraint using x
j
{0, 1}
can be used as a second phase. This process looks at
each row of the AE matrix for columns corresponding
to the positions of the NN solution x where x
j
is con-
sidered a possible intrusion. The output is a subset of
Table 1 given in Table 5 as S
1
. This time we obtain 12
intrusions type 21 +mod(0,48)—see Section 2—and
10 intrusions type 48 + mod(0,48), which gives us
a total of 3 12 1 + 8 10 1 = 116 which clearly
does not violate the constraint (i.e. 116 OV
26
=
118). More than this 22, will begin to violate the
constraint—see S
2
in Table 5.
4.2 Computational Complexity
The NN needs to calculate the constraint, i.e., AE
x OV which has a cost of m n, it adjusts x and, as
the algorithm iterates h times, it gives an estimated
computational complexity of O (mnh). The GA needs
to calculate the constraint for each individual in the
population that has a cost of m n per individual, i.e.,
with s individuals it gives m n s per generation, and
as the algorithm iterates g generations, it gives a total
computational complexity of O(mnsg) (Diaz-Gomez
and Hougen, 2007). So the GA computational com-
MISUSE DETECTION - A Neural Network vs. A Genetic Algorithm Approach
461
Table 4: Vector of observations OV
0
. Same Solution x—shown in Table 1—which does not violate the constraint.
OV
0
0 0 0 0 0 0 1 40 0 0 5 0 0 0 0 1 0 25 0 13 0 0 0 2 0 3 30 0
AE x OV
0
0 0 0 0 0 0 -6 -40 0 0 -5 0 0 0 0 -1 0 -25 0 -13 0 0 0 -2 0 0 -30 0
Table 5: Second Phase. A Subset of intrusions S
2
that violates constraint with subset S
1
found by iterative process.
S
1
21 48 69 96 117 144 165 192 213 240 261 288 309 336 357 384 405 432 453 480 501 549
S
2
528 576 597 624 645 672 693 720 741 768 789 816 837 864 885 912 933 960 981 - - -
plexity is higher by O(sg/h).
The space complexity for the NN can be consid-
ered as O (nm) because it needs to store the AE matrix,
and the OV and x vectors. The GA, besides previous
structures, needs to store the population that is of or-
der O(sl). So the GA space complexity is higher in
O(sl) than the NN space complexity.
5 CONCLUSIONS AND FUTURE
WORK
Two paradigms were tested with the misuse detection
problem in audit trail files. As some intrusions share
the same types of events, the possible solution x is
such that some x
i
are dependent, which makes the
genetic algorithm paradigm more suited for solving
this problem. However, the quality of the solution ob-
tained with the GA has a higher computational com-
plexity cost of O(sg/h)—population size by the ratio
of number of generations over the NN iterations—and
space complexity cost of O(sl)—population size by
length of x—with respect to the NN.
The GA has the advantage of discriminating an
intrusion from a non-intrusion as the solution of the
problem is encoded as 1 (intrusion) and 0 (non-
intrusion). As the range of values of x
i
for the NN
are such that x
i
0 the values of intrusions are input
dependent—depending on the observed vector OV .
However, at least for this test set, non-intrusions are
variables x
i
that converge to 0 or to values 0 when
in the initial conditions x is zero.
For the test set defined in this paper, there were no
false positives, except if we consider the NN without
the second phase (see Section 4) or if the initial con-
ditions change—see Section 2. For the false negative
side, if we look at the two sets S
1
and S
2
—see Section
3, the GA has in average (over 30 runs) of 39.14%
false negatives, and the NN has 60.95%. However,
the set S
2
can have exclusive intrusions, so the process
can continue until we get a set of mutually exclusive
subsets whose union is S (Diaz-Gomez and Hougen,
2007).
In order to improve the false negative ratio of the
GA, it is possible that by increasing the population
size (s > 1,000) the ratio is going to decrease; how-
ever, it is possible that the number of generations g
should be considered too, independently or in con-
junction with the population size. For the case of the
NN, it is a more challenging problem to try to dimin-
ish the false negative ratio. After the convergence of
all x
i
s there is no improvement in the solution x, if
the number of iterations h is higher.
REFERENCES
Diaz-Gomez, P. A. and Hougen, D. F. (2005a). Analysis and
mathematical justification of a fitness function used in
an intrusion detection system. In Proceedings of the
Genetic and Evolutionary Computation Conference,
pages 1591–1592.
Diaz-Gomez, P. A. and Hougen, D. F. (2005b). Improved
off-line intrusion d etection using a genetic algorithm.
In Proceedings of the 7th International Conference on
Enterprise Information Systems, pages 66–73.
Diaz-Gomez, P. A. and Hougen , D. F. (20 06). A genetic al-
gorithm approach for doing misuse detection in audit
trail files. In Proceedings of the CIC-2006 Interna-
tional Conference on Computing, pages 329–335.
Diaz-Gomez, P. A. and Hougen, D. F. (2007). Misuse detec-
tion: An iterative process vs. a genetic algorithm ap-
proach. In Proceedings of the 9th International Con-
ference on Enterprise Information Systems.
Ham, F. M. and Kostanic, I. (2001). Principles of Neuro-
computing for Science & Engineering. Mc Graw Hill.
M
´
e, L. (1993). Security audit trail analysis using genetic
algorithms. In Proceedings of the 12th. International
Conference on Computer Safety, Reliability, and Se-
curity, pages 329–340.
M
´
e, L. (1998). GASSATA, a genetic algorithm as an alter-
native tool for security audit trail analysis. In Proceed-
ings of the First International Workshop on the Recent
Advances in Intrusion Detection.
ICEIS 2007 - International Conference on Enterprise Information Systems
462