(Bache and Lichman, 2013), which provides hun-
dreds of data sets for the study of classification and
clustering. In the literature, many researchers evalu-
ate their clustering algorithms and clustering ensem-
ble methods using data sets from this website (Wang
et al., 2011)(Likas et al., 2003)(Zhou and Tang,
2006)(Yan et al., 2009)(Zhang and Gu, 2014). The
other type of data comes from a biomedical labora-
tory and they are used to study human breast cancer
cells undergoing treatment of different drugs.
To evaluate our proposed algorithm, we start from
applying different clustering algorithms (K-means,
Hierarchical agglomerative and Affinity propagation)
to the UCI data sets “Ionoshpere” and “Balance” and
the biomedical laboratory data sets “3ClassesTest1”,
“4ClassesTest1” and “5ClassesTest1”. The micro-
precisions of these algorithms are listed in Table 3.
For the comparison purpose, we also list the aver-
age micro-precisions of existing clustering ensemble
methods reported in (Wang et al., 2011). We only
list the ensemble method that performs best on each
data set. We apply the clustering ensemble method
(MCLA) proposed in (Strehl and Ghosh, 2003) to the
biomedical laboratory data sets.
Suppose p% represents the ratio of the number of
training points (N
r
) to the number of testing points
(N
u
). To study the effect of the amount of train-
ing data to the semi-supervised method, we vary the
values of p from {3,5,10, 15,20,25,30}. The per-
formance of our propose semi-supervised method is
listed in Table 4. Compared with individual clus-
tering algorithms (K-means, HAC and AP), our pro-
posed algorithm outperforms on the data sets listed
in Table 3. Compared with existing ensemble meth-
ods, our proposed algorithm also outperforms these
data sets. The micro-precisions increase dramatically
when p is relatively small and become steady when
p > 15%. Therefore, due to the fact that it is expen-
sive and time-consuming to obtain labels from field
experts, there is no need to make effort on increasing
the amount of training data because the improvement
of the accuracy of the semi-supervised method is not
always increased by increasing the number of training
points.
The biomedical data sets are obtained from the
study of human breast cancer cells undergoing treat-
ment of different drugs. When a certain type of drug
is injected into cancer cells, the cells usually react
differently: a portion of the cells may slightly react
to the injected drug (such as slightly enlarged); an-
other portion of the cells may react strongly (such as
loss of nucleus); and the rest may not react to the in-
jected drug at all. For those cells that strongly react
to the injected drug, it is very likely that their statis-
tical properties vary significantly and they can form
a new cluster. Therefore, in the study of the effect
of a certain drug to cancer cells, we could apply our
proposed new cluster detection algorithm to automat-
ically detect the existence of cancer cells that strongly
react to the injected drug.
We provide numerical examples to show the per-
formance of the proposed new cluster detection algo-
rithm. The original test files contains data observa-
tions from different classes. Each original test file has
a fixed amount of training data. To evaluate the new
cluster detection algorithm, we insert additional data
points to the original test files and vary the number of
additional data points. To evaluate the probability of
successful detection of a new cluster, we insert a mix-
ture of data points from a new class and from exist-
ing classes and vary the proportion of the data points
from a new class. For each original test file and a
particular number of additional points, we randomly
generate 20 versions of additional data set X
a
using
one of the mixture proportions listed in Table 6. The
number of total successful detections of a new cluster
are provided in Table 5. As expected, the probability
of successful detection of a new cluster using the pro-
posed algorithm goes higher when the number of data
points from a new class increases.
5 CONCLUSIONS
Since clustering is a more general problem such that
no categories/clusters are pre-defined for the cluster-
ing algorithms, the fusion of multiple clusterings is
more difficult due to the so-called correspondence
problem. In this paper, we have proposed the semi-
supervised clustering ensemble algorithms to com-
bine multiple clusterings by relabelling the cluster la-
bels according to the training clusters. We presented
numerical examples to demonstrate the capability of
the proposed algorithms on improving the quality of
cluster analysis. The improvement in terms of accu-
racy of the clustering results depends on the statis-
tical properties of the data set and also depends on
the amount of available reference labels. When addi-
tional observations become available, we need to de-
termine whether the training data is sufficient for the
new observations. Therefore, we have proposed the
new cluster detection algorithm to detect the event
that new observations come from a new class other
than existing training classes. We provided numeri-
cal examples to show that the proposed algorithm is
capable to detect a new cluster when the number of
new observations, not from existing classes, is accu-
mulated to a certain level.