timated through histograms. Assuming that the un-
derlying class distribution is appropriately captured in
the cluster partition, if a significant distortion of the
original clusters is introduced through cluster prun-
ing, the learned SVM models may also deviate from
the expected models to a certain extent. The objec-
tive is to remove potential clustering errors while pre-
serving to the highest possible extent the shape and
size of the original clusters. In practice, pruning an
amount of patterns from 20% to 30% of the cluster
size has been considered appropriate for the current
purpose. In addition, the selected thresholds also de-
pend on the pattern silhouette values: patterns with a
silhouette score larger than 0.5 are deemed to be clus-
tered with a sufficiently high “confidence”. Thus, the
maximum silhouette threshold applied in the cluster
pruning algorithm is sil
th
= 0.5. In consequence, if
the minimum observed silhouette score in a cluster is
larger than 0.5, the cluster remains unaltered in the
pruned partitions.
The specific criteria to select the silhouette thresh-
olds can illustrated by considering the clusters ex-
tracted from the Breast data set (all 569 data in-
stances). The distribution of silhouette scores has
been estimated by using the histogram function in
the R-software, which also provides the vectors of
silhouette values found as the histogram bin limits
and the counts of occurrences in each bin
4
. The
silhouette thresholds have been selected to coincide
to histogram bin limits’. In the Breast data set (2
classes/clusters), the vector of silhouette thresholds
for the first and second clusters is [0.5, 0.2]. The value
sil
th
= 0.5 for the first cluster corresponds to the up-
per bound for the silhouette thresholds, as explained
in the previous paragraph. It results in the removal of
5.2% of the cluster’s patterns. For the second clus-
ter, the threshold sil
th
= 0.2 is selected. The pruned
section associated to sil
th
corresponds to the first five
histogram bins, comprising 25% of the patterns in the
cluster. By including the sixth histogram bin in the
pruned section, the next possible silhouette thresh-
old level is sil
th
= 0.3, However, such threshold level
would lead to the removal of a considerable amount
(46.28%) of the cluster patterns, which is considered
unacceptable for preserving the cluster size/shape.
To summarise, the number of histogram bins cor-
responding to rejected patterns is determined accord-
ing to one of these two conditions: (1) the upper limit
of the last rejected bin should not be greater than
sil
th
= 0.5, and (2) The amount of rejected patterns
(total number of occurrences in the rejected bins)
4
The bin sizes provided by the R-software histogram
function are estimated according to the Sturges formula
(Freedman and Diaconis, 1981)
should not exceed a ratio of 30% of the total number
of patterns in the cluster.
6.2 Evaluation of the Cluster Pruning
Approach
In this section, the efficiency of the cluster pruning
method for rejecting missclassification errors from
the clustered data is evaluated through an analysis of
the algorithm outcomes on the Iris, Wine, Breast Can-
cer, Diabetes, Pendig and Seven Gaussians data set
5
.
For the purpose of evaluating the cluster prun-
ing algorithm, the cluster labeling task has been per-
formed using the complete set of labels for each data
set. The resulting misclassification error rates as well
as the NMI results observed in Table 1 confirm the
adequate behaviour of the proposed cluster pruning
algorithm for removing such sections from the clus-
ters with high probability of resulting in misclassi-
fication errors after cluster labeling. For instance,
while the pruned sections comprise around 10− 20%
of the patterns in the data sets, the percentage of
remaining misclassification errors has been substan-
tially reduced. As an example, the error rate has
dropped from 10.66% to 4.03% after pruning on the
Iris data set, while error rates have been reduced from
4.09% to 0.99% for the Breast data set, and from
22.40% to 8.98% in the Wine dataset. An exception
to the previous observations is the Diabetes data set,
in which the error rate after cluster pruning (38.16%)
remains very similar to the original missclassification
rate (40.10%) - note that, for 2 clusters as in the case
of the Diabetes data, the worst possible error rate that
can be observed is of 50%. Any error rate larger than
50% is not observed as it just produces an inversion of
the cluster labels. In other words, the original error in
the diabetes data set implies almost a roughly uniform
distribution of patterns from any of the two underly-
ing classes in the extracted clusters. This fact is also
evidenced by the NMI score 0.012. In consequence,
the error rate is roughly the same after cluster prun-
ing, and the removal of patterns by means of cluster
pruning algorithm is just as efficient as removing the
same amount of patterns at random.
7 SIMULATIONS AND RESULTS
In the experimental setting, SVMs have been used as
the baseline classifier. First, each data set has been
5
note that the cluster partitions obtained in this experi-
ments comprise all instances of the data sets (without prior
partitions into test/training).
AN APPROACH TO SEMI-SUPERVISED CLASSIFICATION USING THE HUNGARIAN ALGORITHM
429