3.2 Cluster Ensemble Generation
Different clustering algorithms provide different clus-
ter labels for the same data set since they focus on
different aspects of data (Topchy et al., 2004). Due
to its simplicity, k-means algorithm is a widely used
algorithm to provide individual clustering in genera-
tion step of cluster ensemble algorithm. By choos-
ing a number k quite larger than the expected num-
ber of clusters, k-means algorithm is capable to divide
data points into k smaller groups. Such a small group
of data pints usually is able to capture some details
of the structure of the entire data set, while some of
these smaller groups may need to be merged together
to form a cluster because their common properties.
In (Fred and Jain, 2005), cluster ensemble is gen-
erated by running k-means algorithm multiple times
with random initializations. The number of clusters
for each run is randomly selected from a set of inte-
gers (much greater than k
0
). We used a similar gen-
eration mechanism to obtain cluster ensemble in this
paper. For data set X (reference and unknown sets to-
gether), k-means algorithm is applied M times to gen-
erate M individual clusterings, which form a N-by-M
label matrix Λ. Entry of Λ on the i-th row and j-th
column Λ
i, j
is the cluster label of x
i
according to j-
th clustering. In previous pre-clustering step, data set
X is divided into k
0
+ 1 subsets: k
0
reference clusters
and an unknown set (i.e. X = {X
1
r
,X
2
r
,...,X
k
0
r
,X
u
}).
Accordingly, matrix Λ could be segmented into k
0
+1
parts: Λ
1
r
,Λ
2
r
,...,Λ
k
0
r
,Λ
u
.
3.3 Consensus Fusion
Consensus fusion of multiple clusterings is the core
step of the proposed algorithm. The fusion idea is
stated as follow: according to an individual cluster-
ing, count the number of agreements between label
of a data point in unknown set and labels of reference
points in each reference clusters; assign this data point
the corresponding cluster label which has the highest
number of agreements; repeat the procedure for all the
clusterings and determine the final cluster label based
on some fusion rule. The summary of the proposed
algorithm is stated in Table 1.
Suppose for k = 1, . . . , k
0
R
k
is the number of ref-
erence points in the k-th reference cluster and R is the
total number of reference points. Thus, R = R
1
+R
2
+
···+ R
k
0
and the total number of undetermined points
is N − R. For the i-th undetermined data point x
i
and
the j-th clustering λ
( j)
(where i = 1,...,N − R and
j = 1, . . . , M), the association vector a
ij
contains k
0
entries, each of which describes the association of x
i
and a reference cluster.
Table 1: Semi-supervised clustering ensemble algorithm.
1. Pre-clustering
(a) Choose p% of data points and obtain ref-
erence labels (1,...,k
0
)
2. Cluster Ensemble Generation
(a) Apply clusterer Φ
( j)
to data set X and ob-
tain individual clustering λ
( j)
(b) Repeat M times to form a label matrix
Λ = {Λ
1
r
,Λ
2
r
,...,Λ
k
0
r
,Λ
u
}
3. Consensus Fusion
(a) Assign undetermined data points their
most associated cluster ids (highest en-
try in association vector) according to la-
bel vector λ
j
. Association vector is com-
puted by
a
ij
(k) =
occurrence of Λ
u
(i, j)in Λ
k
r
(:, j)
#of points in kth reference cluster
(b) Repeated M times to form new sub-
matrix Λ
0
u
(c) Apply fusion rule to obtain consensus
clustering
Recall that for the undetermined data points the
corresponding segment of label matrix Λ is Λ
u
. Fu-
sion rule, such as majority voting, is difficult to ap-
ply directly to Λ
u
due to the correspondence problem
of cluster labels. A new matrix is necessary in or-
der to apply fusion rule to generate the consensus la-
bels. Based on the relabeling scheme we described
above, according to a clustering λ
( j)
, assign undeter-
mined data points their most associated cluster labels
(highest entry in the corresponding association vec-
tor) and repeat M times to form a new matrix Λ
0
u
. In
this new label matrix, the correspondence problem is
removed by utilizing the reference labels. We could
apply any fusion rule to obtain the consensus cluster-
ing. In this paper, we use plurality voting scheme to
generate the final consensus label.
4 ALGORITHM EXTENSION
FOR LARGE DATA SETS
Our proposed algorithm requires number of reference
labels is sufficient (i.e the ratio of the number of ref-
erence points and the size of data set is greater than
a certain percentage p%). Due to the fact that ex-
pertise or resource is usually expensive and limited,
the proposed algorithm is only suitable for data set
with a moderate size. For a large data set, we pro-
BIOSIGNALS2015-InternationalConferenceonBio-inspiredSystemsandSignalProcessing
320