Figure 4: Schematic of the sonar dataset partition. The
S
i
are the nested subsets, R = R
1
∪ R
2
and the mines M =
M
1
∪ M
2
. Together with the notation is the number of items
contained in each subset.
plify the process. The NC classifier applied to cate-
gory 1 found that only the variables, ordered by their
predictive value, x
11
, x
14
, x
8
, x
10
, x
12
, x
9
, x
7
, x
23
, x
13
are
needed to specify the classification rule in only one
iteration and about 6% error. The second iteration in-
volves additionally x
2
, x
5
, x
6
reducing the error to 4%.
The result is shown in Fig. 2; the separation achieved
is striking.
Two error estimates are used: Train & Test and
Cross-correlation. When the rule involves several it-
erations an additional criterion is employed to avoid
overfiting. Namely, the rule error is traced iteration
by iterations and the process is stopped when the error
increases compared to the previous. As pointed out in
(Inselberg and Avidan, 2000), the rule obtained by the
NC classifier were applied to 4 bench-mark datasets
and were the most accurate compared to those ob-
tained by 22 other well known classifiers.
3 PARTIONING INTO
SUB-CATEGORIES
As one might expect things do not always work out
as nicely as for the example. The sonar dataset from
(UCI, 2012) has been a real classification challenge
with which we illustrate the new divide-and-conquer
idea. It has 60 variables, 208 observations and 2 cat-
egories 1 for Mines with 111 observations and 0 for
Rocks with 97 data points. Applying the NC classifier
partitions the dataset into 3 nested subsets S
1
, S
2
, S
3
,
with 148, 51 and 14 items respectively, The rule ob-
tained involves about 35 variables and an unaccept-
able high error of about 45%. The result, demarcat-
ing the nesting (by the rectangles in the lower row)
and showing some of the variables used in the rule is
shown in Fig. 3.
The schematic in Fig. 4 clarifies the partition of
the dataset into 4 disjoint sets, M
1
, M
2
for the mines
and R
1
, R
2
for the “rocks”. These are obtained by S
3
=
Figure 5: This is a financial dataset where subset corre-
sponding to the high-gold prices is selected. The classifi-
cation by NC partitions this subset into two (indicated by
the 2 and 4th rectangle in the lower row) as for the sonar
dataset.
M
2
, R
2
= S
2
− S
3
, M
1
= S
1
− S
2
and R
1
= All − S
1
where All stands for the full dataset. This is a very
useful insight into the structure of the dataset and mo-
tivates the idea. The bulk of the mines are in M
1
which has the higher values of the variables needed
to specify the rule. By contrast, the subset M
2
= S
3
is
a small “island”, having the smaller variable values,
surrounded by R
2
differs markedly from M
1
.
Consider R ∪ M
1
and apply the NC classifier.
A rule distinguishing M
1
from R is found needing
only 4 variables. Due to small size of M
1
the er-
ror estimates, with either cross-correlation or train-
and-test the number of “false-negatives” were high,
about 30%, though the “false-positives” were about
5% yielding a weighted average error of about 15%.
For another interesting comparison distinguishing M
1
from M, NC yields a rule with 5 variables and an 8%
average error. It is clearthat M
1
is easily distinguished
both from the “rocks” and the larger class of mines
M
1
.
This strongly suggests that there are two very dif-
ferent types of mines included in this dataset. To sum-
marize part of NC’s output, indicated by the rectan-
gles in the lower row of the figure, gives the decom-
position of the dataset into nested subsets. From these
one or more of the categories can be partitioned to
obtain a more accurate and simpler rule. While this
has been observed for some time it was only investi-
gated recently. Of course, the idea of partitioning is
inherent in classification which after all pertains to the
division of a dataset and differentiating between the
parts. While there is a lot of literature on partitions
in data mining, as we already pointed out, this spe-
cific method has apparently not been proposed. Such
a decomposition can clearly be automated and also
the classification of the new categories can be done in
GeometricDivideandConquerClassificationforHigh-dimensionalData
81