Table 3: Micro-precision of K-means using all features and
single feature of original data.
Data Sets
Kmeans
All Features
Single Feature
Max Min
DWALabSet1 0.5033 0.7917 0.5000
DWALabSet2 0.5033 0.7233 0.5000
DWALabSet3 0.5367 0.7933 0.5000
DWALabSet4 0.3400 0.5642 0.3333
of BASE1 since BASE2 contains a certain number
of “better” base clusterings. In addition, the perfor-
mance of SHSEA using BASE2 is expected to be bet-
ter than HHSEA, since base clusterings with higher
MP are given larger weights in the consensus fusion
step. Furthermore, recall that BASE3 (BASE4) is
generated in the same way as BASE1 (BASE2) re-
spectively expect that K
( j)
(number of clusters in each
local clusterers) are set to be greater than K
0
(ex-
pected number of clusters). Therefore, we expect the
performance of SHSEA and HHSEA using BASE3
(BASE4) to be better than BASE1 (BASE2), since
the proposed semi-supervised methods are expected
to perform better when data points are divided into
smaller groups.
The micro-precision of our proposed system
(four unsupervised and two semi-supervised ensem-
ble methods) using four sets of base clusterings
(BASE1 to BASE4) is illustrated by sub-figure (a) of
Fig. 3 to Fig. 6. The performance of SHSEA and HH-
SEA is represented by series SH(P25) and HH(P25)
respectively and P25 means the ratio of reference and
testing points is P = 25%. Among four groups of
clustering results, the bar corresponding to the high-
est average MP of the unsupervised ensemble meth-
ods and the bars corresponding to the highest MP of
SHSEA and HHSEA are labelled in each chart. It
is clear that the performance of the proposed semi-
supervised methods conforms with our expectations.
Compared to the micro-precision of K-means al-
gorithm (Table 3), the clustering results has been im-
proved by both operational modes of the proposed
system. The performance of the semi-supervised
mode is better than the unsupervised mode (except
“DataSet1”). The winning set of base clusterings is
either BASE2 or BASE4. In all the example the best
performance is achieved by utilizing SHSEA.
To study the effect of quantity of reference points
on semi-supervised clustering ensemble methods, we
repeat the experiments in semi-supervised mode by
selecting different numbers of reference points, i.e.,
by varying the value of P in N
r
= P · N
u
. Compared
to the performance of K-means (Table 3), micro-
precision of SHSEA or HHSEA increases dramati-
cally when P is relatively small. It becomes steady
and sometimes starts to decrease as P increases.
Therefore, for the purpose of improving the perfor-
mance of semi-supervised ensemble algorithms may
not be beneficial to label more data points. It is due
to the facts that more reference points do not guaran-
tee the improvement and obtaining additional labels
is time-consuming and expensive.
Recall that the number of clusters in the j-th base
clustering K
( j)
is randomly generated in the base clus-
tering generator Φ
3
and Φ
4
. To study the effect of
randomized K
( j)
on the clustering ensemble methods,
we repeat the experiments by setting the number of
clusters in each base clustering the same and varying
the value of K
( j)
. Among these data sets, the high-
est MP occurs at different K
( j)
. The performance of
the proposed system using randomized K
( j)
is either
the best of all tested values of K
( j)
or it is very closed
to best. Due to the fact that we lack the knowledge
on how to select the optimal K
( j)
, we use randomized
K
( j)
in the following experiments to avoid the selec-
tion of K
( j)
for each data set.
3.3 Normalized Data Sets
The micro-precision of K-means using all normal-
ized features and normalized features individually is
shown in Table 4. The performance of K-means using
all features has been improved significantly by nor-
malization except the first three data sets, as compared
to Table 3. As discussed earlier the performance of
distance-based clustering algorithms may be affected
when data sets to be clustered contains features mea-
sured in diverse scales. By investigating features of
each data set, we noticed that the data sets contain fea-
tures measured in quite different ranges. Moreover,
the performance of K-means using normalized fea-
tures individually is similar to the performance of K-
means using original features individually. This result
is expected since similarity measure for single feature
is based on 1-dimensional distance calculation and it
is invariant to the feature scales.
Table 4: Micro-precision of K-means using all features and
single feature or normalized data.
Data Sets
Kmeans
All Features
Single Feature
(Normalized) Max Min
DWALabSet1 0.6628 0.7920 0.5000
DWALabSet2 0.5609 0.7233 0.5000
DWALabSet3 0.6120 0.7933 0.5000
DWALabSet4 0.5058 0.5644 0.3333
To study the effect of normalization on clustering
ensemble methods, we repeat the experiments previ-
ously described in Section 3.2 using normalized data
sets. The micro-precision of the proposed system is
illustrated by sub-figures (b) of Fig. 3 to Fig. 6.
Distributed Clustering using Semi-supervised Fusion and Feature Reduction Preprocessing
237