
calculation is carried out once: all individuals are
compared with the gravity centre. We note that AG_P2
(randomly or optimised) detect the outlier point in the
same attribute subset, (0, 2, 4 and 8) for the Shuttle data
set and (5, 6, 7 and 8) for the Segmentation data set. For
the Lung cancer data set, we found the same outlier set,
but only one attribute (5936) is found in all cases. The
number of attributes of the Lung cancer data set (12533
attributes) can explain this result. The four algorithms
found the same outlier point in the same attributes subset
for all data sets (except lung, we explained why). Then we
tried to evaluate the importance of the attribute subset size
(D). The results are shown in table3 (Shuttle), table 4
(Segmentation) and table 5 (Lung).
Shuttle D=1 D=2 D=3 D=4 D=9
Attri-
butes
subset
4 2-4 0-2-4 0-2-
4-8
0-…-8
Outlier 26711 26711 26711 2671
1
26711
Segmen
tation
D=1 D=2 D=3 D=4 D=6
D=
19
Attri-
butes
subset
8 6-8 6-7-
8
5-6-
7-8
2-5-
4-6-
7-8
0-…-
18
Outlier 1683 1683 1683 1683 1683 1683
Lung D=1 D=2 D=4 D=6 D=
6000
D=
12533
Attri-
butes
sub
set
5936 5936
-
6038
5103-
3476-
5936-
2329
11472
3613-
3086-
10507
5936-
1430
5936
-….-
1730
0-…-
12532
Out-
lier
10 10 10 10 10 10
For all the data sets we can see the results are the same
whatever the subset dimension is. This is only this
particular value of one attribute that makes the point
significantly different from the other ones. There is no
need to compute the distance with all the attributes.
Then we visualise these results using both parallel-
coordinates (Inselberg, 1985) and 2D (Fig.3a and 3b) or
3D (Fig. 1) scatter-plot matrices (Becker, 1987), to try to
explain why these points are different from the other ones.
The 2D scatter-plot matrices are the 2D projections of the
data points according to all possible pairs of attributes and
the 3D scatter plot matrix is a 3D projection of the n-
dimensional data points.
These kinds of visualisation tools allow the user to see
how far from the others is the outlier. For example in
figure 3, we can easily see the outlier has extreme values
along three attributes (the left ones, the first two being
minimum values and the last one being a maximum
value).
4 CONCLUSION AND FUTURE
WORK
In this paper, we have presented a hybrid algorithm for
outlier detection, which is especially suited for high
dimensional data sets. Conventional approaches compute
the distance with the all the attributes and so are unable to
deal with large number of attributes (because of the
computational cost). Here, we have only to find the best
significant attribute subset to efficiently detect the outliers.
The main idea is to combine attributes in a reduced subset
and find the combination where we can detect the best
outlier point, the point that is farthest from the other ones
in the whole data set. Some numerical tests have shown
that the new algorithm is able to significantly reduce the
research space in term of dimensions without any loss of
quality result. Then we visualise the obtained results with
scatter-plot matrices and parallel coordinates to try to
explain the results and show the attributes relevant for
making a point an outlier. These visualisation tools show
the outlier point is farthest from the other ones. A first
forthcoming improvement will be to try to qualify the
outlier: is it an error or only a point significantly different
from the other ones? We also will try to extend this
algorithm for the clustering task in high dimensional data.
We think that it must be possible to find good clusters in
reduced dimensional data set. Another subject will be to
try to find a low cost evaluation function, like a function
evaluating the combination of attribute subset to improve
the execution time.
Table 3: AG_P2_Opt Results (Shuttle data set)
Table 4: AG_P2_Opt Results (Segmentation data set)
Table 5: AG_P2_Opt Results (Lung Cancer data set)
REFERENCES
Aggarwal C.C., Yu P.S., 2001. Outlier detection for high
dimensional data, ACM Press New York, NY, USA,
Periodical-Issue-Article, pp 37 - 46.
Barnett V., Lewis T., 1994. Outliers in statistical data,
John Wiley.
Becker, R., Cleveland, W. and Wilks, A. 1987 "Dynamic
graphics fordata analysis," Statistical Science, 2, pp
355-395.
Fayyad U. , Piatetsky-Shapiro G., Smyth P. , 1996. From
Data Mining to Knowledge Discovery in Databases, AI
Magazine Vol. 17, No. 3, pp 37-54.
OUTLIER DETECTION AND VISUALISATION IN HIGH DIMENSIONAL DATA
487