in SSTable. In order to save distance computations,
the outlier scores os(e
S
1
), . . ., os(e
S
m
) associated with
a positive or negative example e are computed si-
multaneously as follows: first the set U is computed
as S
1
∪ . .. ∪ S
m
and, for each A ∈ U, the values
d
A
= (x
A
− y
A
)
2
are obtained, and then the distances
dist(x
S
j
, y
S
j
) are computed as
q
∑
A∈S
j
d
A
.
As a further optimization, the outlier scores as-
sociated with the negative examples are computed
first (see steps 2(c) and 2(d)). Then, while comput-
ing outlier scores associated with positive examples
(see step 2(e)), the outlier scores of the negative ones
are immediately exploited in order to filter out sub-
spaces which are not ρ-consistent (see step 2(e)ii)
and, hence, avoiding useless distance computations.
As selection-crossover-mutation strategies we
used proportional selection, one-point crossover, and
mutation by inversion of a single bit, while as conver-
gence criterion was used an a-priori fixed number of
iterations, also said generations (Holland, 1992).
As far as the temporal complexity of the algorithm
is concerned, say N the number of data set objects,
N
E
the total number of examples, d the number of
features in the space F, and g the number of gener-
ations. In the worst case, for each generation in or-
der to determine outlier scores the distances among
all the examples and all the data set objects are com-
puted, with a total cost O(g∗ N
E
∗ N ∗ d). After hav-
ing determined the outlying subspace S
ss
, in order to
compute the top-n outliers in that subspace, all the
pairwise distances among data set objects are to be
computed, and, then, the top-n outliers are to be sin-
gled out, with a total cost O(N
2
∗ d). Summarizing,
the temporal cost of the algorithm ExampleBasedOut-
lierDetection is O(g∗ N
E
∗ N ∗ d + N
2
∗ d).
4 EXPERIMENTAL RESULTS
In the experiments reported in the following, if not
otherwise specified, the crossover probability was set
to 0.9 and the mutation probability was set to 0.01.
Moreover, the parameter ρ, determining the “degree”
of consistency of the subspace, was set to 0.1.
First of all, we tested the ability of the algorithm
to compute the optimal solution (that is the outlying
subspace). With this aim, we considered a family of
synthetic data sets, called Synth in the following.
Each data set of the family is characterized by the
size D of its feature space. Each data set consists of
1,000 real dimensional vectors in the D-dimensional
Euclidean space, and is associated with about D posi-
tive examples and D negative examples. Examples are
placed so that the outlying subspace coincides with
a randomly selected subspace having dimensionality
⌈
D
5
⌉.
We varied the dimensionality D from 10 to 20 and
run our algorithm three times on each data set. We
recall that the size of the search space exponentially
increases with the number of dimensions D. We set
the population size to 50 and the number of genera-
tions to 50 in all the experiments. The parameter K
was set to 10.
Table 1 reports the results of these experiments.
Interestingly, the algorithm always found the optimal
solution in at least one of the runs. Up to 15 dimen-
sions it always terminated with the right outlying sub-
space. For higher dimensions it reported also some
different subspaces, but in all cases the solution re-
turned is a suboptimal one. Indeed, the second and
third solutions concerning the data set Synth18D are
subsets of the optimal solution both having only a sin-
gle missing feature, while the second solution con-
cerning the data set Synth20D is a superset of the op-
timal one having two extra features. By these exper-
iments it is clear that the method is able to return the
optimal solution or a suboptimal one.
The subsequent experiment was designed to val-
idate the quality of the solution returned by the pro-
posed method. In this experiment we considered the
Wisconsin Diagnostic Breast Cancer data set from the
UCI Machine Learning Repository. This data set is
composed of 569 instances, each consisting in 30 real-
valued attributes, grouped in two classes, that are be-
nign (357 instances) and malignant (212 instances).
The thirty attributes represent mean, standard error,
and largest value associated with the following ten
cell nucleus features: radius, texture, perimeter, area,
smoothness, compactness, concavity, concave points,
symmetry, and fractal dimension.
We normalized the values of each attribute in the
range [0, 1]. Moreover, we randomly selected ten be-
nign instances as the set of negative examples I
wdbc
and twenty malignant instances as the set of posi-
tive examples O
wbdc
. Moreover, we built a data set
DS
wdbc
of 357 objects by merging together all the re-
maining benign instances (that are 347) with other
ten randomly selected malignant examples, say them
DS
O
wdbc
.
We set the number of neighbors K to 50, and the
number of top outliers n to 20. First of all, we com-
puted the distance-based outliers in the full feature
space. We found that among the top twenty outliers,
six of them belong to the set DS
O
wdbc
(corresponding to
the 60% of DS
O
wdbc
). Next, we run the ExampleBased-
OutlierDetection algorithm. The outlying subspace
S
ss
wdbc
found was composed of seventeen features. In
FINDING DISTANCE-BASED OUTLIERS IN SUBSPACES THROUGH BOTH POSITIVE AND NEGATIVE
EXAMPLES
9