function could be expressed as:
SIM(A, B) =
1
p
n
∑
i=1
1 − |ord(a
i
) − ord(b
i
)|
card(O)
(1)
In which p is the number of compared attributes
In which card(O) is the cardinality of the set of pos-
sible category values for a and b and card(O) is the
cardinality of the set of possible category values. In
this first approach, the weight coefficient is the same
for all the properties (1/p). It is important to highlight
that the similarity function coefficients or weight are
not adjusted to the knowledge base, thus the global
function does not measure properly, that is, accord-
ing to its true weight and each local characteristic.
Each property bears the same importance to discern
the cases. To resolve this matter, another strategy is
applied to the similarity, such as inductive reasoning
with decision trees discussed in the next section.
2.4 Decision Tree
A decision tree is used to classify the cases in order to
their common features. Each node of the tree stores
one of the properties of the characteristic vector and
each arc covers one of the possible alternative val-
ues for the property(Buckinx04). One of the most
well-known algorithms is ID3, and that is the pro-
posed approach. For the selection of the best attribute,
the entropy and gain concepts defined in (Mitchell,
1997) will be used. It is interesting to verify that the
solution-attribute is not a concept capable of holding
two values (positive and negative), and even more, its
cardinality is 5 (please see table 1). It should be high-
lighted that if there is no case associated to the final
node, the solution that is returned is the most usual
one on the parent node. Since the final node does
not reference any case, the need for another method
to retrieve a proper solution arises, and since it keeps
common characteristics in that regarding nodes at the
same level, it is natural to select the most common
solution of them.
3 RESULTS
Results for the two retrieval process approaches are
detailed and compared in this section. The mea-
surement that determines performance of the retrieval
phase is the number of properly recovered cases per
the total number of cases that are stored in the cases
database. Different cross validations that modify the
number of partitions from the examples space have
been applied to check these results. In each contrast
operation, one of the example sets is selected to test
the system and those remaining are used as training
sets for the CBR.
Parameter K specifies the number of partitions
over the examples space. It should be highlighted that
in K = N (N is the total number of examples), cross
validation is known as leave-one-out validation. For
K = 10 the validation seems to be especially accurate
(Kohavi, 1995).
In rows ”F” and ”F*” (Function) corresponding to
table 2, the number of well retrieved cases with a sim-
ilarity function (and its percentage) is displayed for a
knowledge case base without and with noise respec-
tively.
In row ”T” and ”T*” (Tree) the cases that are
well recovered by the decision tree are displayed for
knowledge case base without and with noise respec-
tively.
Table 2: Data without noise F and T rows and with noise F*
and T* rows . NC=Number of Cases.
NC 10(%) 30(%) 50(%) 70(%) 100(%)
K = 5
F* 4(40) 14(46) 27(54) 52(74) 74(74)
F 6(60) 15(50) 31(62) 54(77) 80(80)
T* 2(20) 18(60) 31(62) 50(71) 82(82)
T 3(30) 18(60) 33(66) 55(79) 86(86)
K = 10
F* 2(20) 13(43) 25(50) 51(73) 72(72)
F 5(50) 14(47) 29(58) 53(76) 78(78)
T* 1(10) 16(53) 32(64) 55(79) 80(80)
T 3(30) 17(57) 35(70) 60(86) 85(85)
K = Number of cases
F* 2(20) 13(43) 25(50) 50(71) 72(72)
F 5(50) 14(47) 29(58) 52(74) 78(78)
T* 1(10) 13(43) 31(62) 48(69) 83(83)
T 3(30) 13(43) 35(70) 53(76) 88(88)
In the figure 2a, it is shown the better decision tree
performance for high numbers of examples. Since
50 cases, there is enough stored knowledge to clas-
sify properly a new case with the decision tree. For
smaller examples sets, the accurate with the similar-
ity function is close to the performance of the decision
tree, even improves it for some cases.
In that pertaining to the results outlined in figure
2b, the first considerations may assume a stable num-
ber of proper recoveries, however, the results high-
light a slight increase according to the number of
cases, exactly the same as in the decision tree. The
reason for this is that all the variables have an impor-
tant role and no particular variable dominates the rest.
Likewise, there is no redundancy among them. Also,
there is more diversity and retrieving cases without
similarity to the queried one is more difficult.
PERFORMING THE RETRIEVE STEP IN A CASE-BASED REASONING SYSTEM FOR DECISION MAKING IN
INTRUSION SCENARIOS
345