Information Fusion for Semi-supervised Cluster Labelings
Huaying Li and Aleksandar Jeremic
Department of Electrical and Computer Engineering, McMaster University
Hamilton, Ontario, Canada
Keywords:
Clustering, Information Fusion, Cluster Ensemble and Semi-supervised Learning.
Abstract:
Clustering analysis is a widely used technique to find hidden patterns of a data set. Combining multiple
clustering results into a consensus clustering (cluster ensemble) is a popular and efficient method to improve
the quality of clustering analysis. Many algorithms were proposed in the literature and most of which are
unsupervised learning techniques. In this paper, we proposed a semi-supervised cluster ensemble algorithm.
It is so-called semi-supervised because labels of some data points in the given data set are known or provided
by experts. To evaluate the performance of the proposed algorithm, we compare it with other well-known
algorithms, such as MCLA and BCE.
1 INTRODUCTION
Mutation is accidental changes in genomic sequence
of DNA (Pickett, 2006). Studies of mutation are
usually completed using fluorescence microscopy, an
important tool for visualizing biochemical activity
within individual cells. In the past, analysis of these
images was done manually through visual inspection,
which sometimes leads to a time consuming and inac-
curate conclusion. Nowadays, automated image anal-
ysis techniques are developed, such as high content
analysis (HCA). It is the use of automated microscopy
and high end computation to understand the complex
biological processes. It typically involves acquiring
high resolution images and translating them into a
multi-dimensional feature space, which spans hun-
dreds of features per fluorescence channel and will be
further processed to provide relevant output (Shariff
et al., 2010). Cluster analysis is a widely used tech-
nique to find the hidden patterns or structure of a data
set. The objective is to divide data points into dis-
tinct clusters so that data points in the same cluster are
similar to each other and data points in different clus-
ter are dissimilar. Cluster analysis is also one popu-
lar machine learning technique to further process data
obtained from feature extraction step of HCA.
Although there are many clustering algorithms ex-
ist in the literature, such as hierarchical, centroid-
based, distribution-based and graph theory-based al-
gorithms, no single algorithm can correctly identify
underlying structure of all data sets in practice (Xu
This work was supported by Natural Sciences and En-
gineering Research Council of Canada.
and Wunsch, 2008). It is usually difficult to decide
which algorithm should be applied for a given data
set when prior information about the cluster shape
and size are not provided. Furthermore, for a partic-
ular clustering algorithm, it usually generates differ-
ent clustering labels for a given data set by choosing
different initial start points or different parameter set-
ting of the algorithm. Combing multiple clusterings
into a consensus labeling is a hard problem because of
two reasons: (1) number of clusters in each clustering
could be different and the desired number of clusters
is usually unknown; (2) cluster labels are symbolic
so there is also a correspondence problem. In (Vega-
Pons and Ruiz-Shulcloper, 2011), the authors provide
a detailed review of many existing algorithms: some
algorithms are based on relabeling and voting; some
are based on co-association matrix; Some are based
on graph and hypergraph representation of cluster-
ings; some are based on finite mixture models and
etc. All of these algorithms are unsupervised learning
because input data set is unlabeled and clusters are
not pre-defined. Also, most of cluster ensemble algo-
rithms consists of two major steps: cluster ensemble
generation and consensus fusion.
In this paper, we propose a semi-supervised clus-
ter ensemble method. The term semi-supervised
means labels of some data points are available and
are utilized in the fusion stage. In next section we
briefly describe the cluster ensemble problem and in-
troduce the structure of multiple clustering system. In
the following section, we propose a semi-supervised
algorithm for combing multiple clusterings. Then we
provide several numerical examples to illustrate the
318
Li H. and Jeremic A..
Information Fusion for Semi-supervised Cluster Labelings.
DOI: 10.5220/0005282203180323
In Proceedings of the International Conference on Bio-inspired Systems and Signal Processing (BIOSIGNALS-2015), pages 318-323
ISBN: 978-989-758-069-7
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
algorithm and evaluate performance of the algorithm
using different types of data sets. Finally, we give a
conclusion in the last section.
2 CLUSTER ENSEMBLE
PROBLEM
For a given data set, it depends on its characteristic to
choose an appropriate clustering algorithm with some
suitable parameter settings. It also depends on the
purpose of the use of the clustering results. Therefore,
utilizing cluster analysis techniques to obtain multi-
ple clusterings and combining them into a consensus
clustering is an efficient way to improve the quality
of clustering. The objective is to obtain a consensus
clustering containing the most information from each
individual clustering. In this section, we first intro-
duce some notation to describe the cluster ensemble
problem and then define our multiple clustering sys-
tem.
Let X = {x
1
,x
2
,...,x
N
} denote a set of N data
points and each data point x
n
(for n = 1,...,N) comes
from a F-dimensional feature space. Clustering tech-
nique is used to partition the data set into some
smaller sets, denoted as k clusters, in the way that
data points in a cluster are more similar to each other
than to those in different clusters. A clusterer Φ rep-
resents the function of generating a clustering result,
which is stored in a N-dimensional label vector. The
idea of cluster ensemble is represented in Fig. 1. For
a given data set X, a clusterer Φ
( j)
is used to parti-
tion the data into k
j
clusters and cluster labels are
stored in λ
( j)
, where j = 1,2,...,M. A set of clus-
terings Λ = {λ
(1)
,λ
(2)
,...,λ
(M)
} is generated by al-
ternating the clustering algorithm M times in each
clusterer or choosing different parameter settings (i.e.
choosing different numbers of clusters for each clus-
terer or choosing different initial cluster centers and
etc.). Multiple clusterings are fused later by a consen-
sus function Γ in order to obtain a single label vector
λ, a more reliable partition of the given data.
In Fig 1, arrows on the left represent the genera-
tion step and arrows on the right represent the con-
sensus step. On the one hand, there is usually no
constrains on how to generate the multiple cluster-
ings. There are several possibilities in the generation
process: using different clustering algorithms; using
the same clustering algorithm with different initializa-
tions and/or different parameter settings; using sub-
sets of features of data points and using random pro-
jections on different subspaces. On the other hand,
the consensus step is the core step to obtain a con-
solidated single clustering result. In this paper, we
Figure 1: Multiple Clustering System.
propose an algorithm that is based on relabeling and
voting. Different from the existing algorithms work-
ing with unlabeled data, our proposed algorithm deals
with data set that is partially labeled. We call it semi-
supervised cluster ensemble algorithm.
3 PROPOSED ALGORITHM
In this section, we propose our semi-supervised algo-
rithm on information fusion of multiple clusterings.
It contains three major steps: pre-clustering, cluster
ensemble generation and consensus fusion.
3.1 Pre-clustering
Usually in every field of study, opinions from experts
are available but limited due to many factors. Our
assumption is that for a given data set experts or re-
sources are available to provide future cluster labels
for a certain portion of all data points. In the pre-
clustering step, we randomly select p% (p is a pre-
determined number) of data points in X to be refer-
ence points. Data set X is thus divided into two sub-
sets: reference set and unknown set. Reference set
X
r
contains reference data points and unknown set X
u
consists of the rest of data points called undetermined
points. Experts analyze on reference set X
r
and group
the reference points into clusters. We call these clus-
ters as reference clusters. These clustering labels of
reference points are useful later in consensus fusion
and we call them reference labels. Suppose the num-
ber of reference clusters is k
0
. Reference set X
r
could
be divided into k
0
sub-sets: {X
1
r
,...,X
k
0
r
}, where X
k
r
(for k = 1,. . . , k
0
) contains the data points in the k-th
reference cluster. Since reference points are randomly
selected, we assume the number of desired clusters
for the whole data set X should be consistent with the
number of reference clusters k
0
.
InformationFusionforSemi-supervisedClusterLabelings
319
3.2 Cluster Ensemble Generation
Different clustering algorithms provide different clus-
ter labels for the same data set since they focus on
different aspects of data (Topchy et al., 2004). Due
to its simplicity, k-means algorithm is a widely used
algorithm to provide individual clustering in genera-
tion step of cluster ensemble algorithm. By choos-
ing a number k quite larger than the expected num-
ber of clusters, k-means algorithm is capable to divide
data points into k smaller groups. Such a small group
of data pints usually is able to capture some details
of the structure of the entire data set, while some of
these smaller groups may need to be merged together
to form a cluster because their common properties.
In (Fred and Jain, 2005), cluster ensemble is gen-
erated by running k-means algorithm multiple times
with random initializations. The number of clusters
for each run is randomly selected from a set of inte-
gers (much greater than k
0
). We used a similar gen-
eration mechanism to obtain cluster ensemble in this
paper. For data set X (reference and unknown sets to-
gether), k-means algorithm is applied M times to gen-
erate M individual clusterings, which form a N-by-M
label matrix Λ. Entry of Λ on the i-th row and j-th
column Λ
i, j
is the cluster label of x
i
according to j-
th clustering. In previous pre-clustering step, data set
X is divided into k
0
+ 1 subsets: k
0
reference clusters
and an unknown set (i.e. X = {X
1
r
,X
2
r
,...,X
k
0
r
,X
u
}).
Accordingly, matrix Λ could be segmented into k
0
+1
parts: Λ
1
r
,Λ
2
r
,...,Λ
k
0
r
,Λ
u
.
3.3 Consensus Fusion
Consensus fusion of multiple clusterings is the core
step of the proposed algorithm. The fusion idea is
stated as follow: according to an individual cluster-
ing, count the number of agreements between label
of a data point in unknown set and labels of reference
points in each reference clusters; assign this data point
the corresponding cluster label which has the highest
number of agreements; repeat the procedure for all the
clusterings and determine the final cluster label based
on some fusion rule. The summary of the proposed
algorithm is stated in Table 1.
Suppose for k = 1, . . . , k
0
R
k
is the number of ref-
erence points in the k-th reference cluster and R is the
total number of reference points. Thus, R = R
1
+R
2
+
···+ R
k
0
and the total number of undetermined points
is N R. For the i-th undetermined data point x
i
and
the j-th clustering λ
( j)
(where i = 1,...,N R and
j = 1, . . . , M), the association vector a
ij
contains k
0
entries, each of which describes the association of x
i
and a reference cluster.
Table 1: Semi-supervised clustering ensemble algorithm.
1. Pre-clustering
(a) Choose p% of data points and obtain ref-
erence labels (1,...,k
0
)
2. Cluster Ensemble Generation
(a) Apply clusterer Φ
( j)
to data set X and ob-
tain individual clustering λ
( j)
(b) Repeat M times to form a label matrix
Λ = {Λ
1
r
,Λ
2
r
,...,Λ
k
0
r
,Λ
u
}
3. Consensus Fusion
(a) Assign undetermined data points their
most associated cluster ids (highest en-
try in association vector) according to la-
bel vector λ
j
. Association vector is com-
puted by
a
ij
(k) =
occurrence of Λ
u
(i, j)in Λ
k
r
(:, j)
#of points in kth reference cluster
(b) Repeated M times to form new sub-
matrix Λ
0
u
(c) Apply fusion rule to obtain consensus
clustering
Recall that for the undetermined data points the
corresponding segment of label matrix Λ is Λ
u
. Fu-
sion rule, such as majority voting, is difficult to ap-
ply directly to Λ
u
due to the correspondence problem
of cluster labels. A new matrix is necessary in or-
der to apply fusion rule to generate the consensus la-
bels. Based on the relabeling scheme we described
above, according to a clustering λ
( j)
, assign undeter-
mined data points their most associated cluster labels
(highest entry in the corresponding association vec-
tor) and repeat M times to form a new matrix Λ
0
u
. In
this new label matrix, the correspondence problem is
removed by utilizing the reference labels. We could
apply any fusion rule to obtain the consensus cluster-
ing. In this paper, we use plurality voting scheme to
generate the final consensus label.
4 ALGORITHM EXTENSION
FOR LARGE DATA SETS
Our proposed algorithm requires number of reference
labels is sufficient (i.e the ratio of the number of ref-
erence points and the size of data set is greater than
a certain percentage p%). Due to the fact that ex-
pertise or resource is usually expensive and limited,
the proposed algorithm is only suitable for data set
with a moderate size. For a large data set, we pro-
BIOSIGNALS2015-InternationalConferenceonBio-inspiredSystemsandSignalProcessing
320
Table 2: Extended version of Semi-supervised clustering
ensemble algorithm.
Pre-clustering: Choose p% of data points
and obtain reference labels (1,...,k
0
)
If
R
N
p%, do step 2) and 3) in TABLE 1
to obtain λ
u
If
R
N
< p%, Q = d
p
1p
·
NR
R
e
for q = 1 : Q
X
q
= {X
r
,X
q
u
}, where X
q
u
is the q-th sub-
set of X
u
do step 2) and 3) in TABLE 1 on X
q
to
obtain λ
q
u
end
λ
u
= {λ
1
u
,λ
2
u
,...,λ
Q
u
}
Table 3: Data sets from UCI machine learning repository.
Data Set Data Points Features Classes
Ionosphere 351 34 2
Pima 768 8 2
Balance 625 4 3
Wine 178 13 3
Segmentation 2100 19 7
pose to divide the undetermined set X
u
into several
smaller sub-sets and to apply the proposed algorithm
on each subset combined with the reference set. Re-
call that N is the total number of data points in X
and R is the number of reference points. If
R
N
< p%,
we divide the undetermined set X
u
into Q sub-sets, i.
e. X
u
= {X
1
u
,X
2
u
,...,X
Q
u
}, where Q = d
p
1p
·
NR
R
e.
Function dxe is the ceil function: the smallest integer
not less than x.
For q = 1, . . . , Q, denote the combination of refer-
ence data set and the q-th subset of X
u
as a new data
set X
q
= {X
r
,X
q
u
}. Apply the semi-supervised clus-
ter ensemble algorithm to X
q
and generate consen-
sus clustering λ
q
u
and repeat it Q times. Combine the
Q segments together to form the overall consolidated
clustering of the whole undetermined set X
u
, where
λ
u
= {λ
1
u
,λ
2
u
,...,λ
Q
u
}. The summary of the extended
version of proposed algorithm is stated in Table 3.
5 EXPERIMENTAL RESULTS
AND DISCUSSION
In this section, we provide several numerical exam-
ples to show the performance of the proposed algo-
rithm. UCI machine learning repository website pro-
vides hundreds of data sets for machine learning re-
searchers (Bache and Lichman, 2013). We choose
Table 4: Maximum micro-precision of proposed algorithm
with selected percentages of reference.
Maximum p = 5% p = 10% p = 15%
Ionosphere 0.8034 0.8917 0.9345
Pima 0.6523 0.6914 0.7630
Balance 0.7456 0.7648 0.7872
Wine 0.4888 0.6404 0.7022
Segmentation 0.7652 0.7881 0.8262
Maximum p = 20% p = 25% p = 30%
Ionosphere 0.9345 0.9402 0.9459
Pima 0.7617 0.7773 0.7878
Balance 0.8096 0.8032 0.8240
Wine 0.6966 0.7640 0.7865
Segmentation 0.8129 0.8214 0.8367
Table 5: Average micro-precision of proposed algorithm
with selected percentages of reference.
Average p = 5% p = 10% p = 15%
Ionosphere 0.7929 0.8721 0.9248
Pima 0.5997 0.6837 0.7510
Balance 0.7162 0.7304 0.7496
Wine 0.4640 0.6163 0.6702
Segmentation 0.7543 0.7719 0.8078
Average p = 20% p = 25% p = 30%
Ionosphere 0.9262 0.9319 0.9239
Pima 0.7436 0.7638 0.7809
Balance 0.7778 0.7862 0.8008
Wine 0.6478 0.7433 0.7584
Segmentation 0.8057 0.8110 0.8284
five data sets from their website to evaluate the per-
formance of our proposed algorithm. The information
about the chosen data sets are listed in Table 3.
Since in our experiments testing data sets have
true labels associated with them, we choose micro-
precision (mp) as our metric to measure the accu-
racy of a clustering result with respect to the true
clustering (Wang et al., 2011). Suppose there are k
t
classes in truth for a given data set X with N data
points. Suppose N
k
is the number of data points in
the k-th cluster of a clustering result that are correctly
assigned to the corresponding class. Corresponding
class here represents the true class that has the largest
overlap with the k-cluster. The micro-precision is de-
fined by mp =
k
t
k=1
N
k
/N. As mentioned at the be-
ginning of this paper, many cluster ensemble algo-
rithms exist in the literature. We compared our algo-
rithm withMCLA and CSPA proposed in (Strehl and
Ghosh, 2003) and MM and BCE presented in (Wang
et al., 2011). The authors of (Vega-Pons and Ruiz-
Shulcloper, 2011) provide a brief review and the core
idea of these algorithms. True cluster labels of the
data sets listed in Table 3 are available throughout
UCI website.
InformationFusionforSemi-supervisedClusterLabelings
321
Figure 2: Micro-precision of proposed algorithm compared with other algorithms Part I.
Figure 3: Micro-precision of proposed algorithm compared with other algorithms Part II.
In our experiments, we randomly select p% of
data points and use the corresponding true labels of
these points as reference labels where M = 15 is used
to generate the ensemble. We start with p = 5 and in-
crease the percentage of reference points with a 5%
increment. The maximum and average performance
of our proposed semi-supervised cluster ensemble al-
gorithm are listed in Table 4 and Table 5 respec-
BIOSIGNALS2015-InternationalConferenceonBio-inspiredSystemsandSignalProcessing
322
tively. Compared with experimental results reported
in (Wang et al., 2011), Figure 2 and Figure 3 show that
the proposed algorithm could provide higher micro-
precisions in most of our experiments. In the fig-
ures each column represents one data set. The up-
per plots show the maximum performance while the
lower plots show the average performance. The x-
axis represents the percentage of reference points and
y-axis displays the micro-precision of the proposed
algorithm. The more reference points used, the bet-
ter performance obtained. There is a horizontal line
in each plot, which represents the corresponding best
micro-precision reported in (Wang et al., 2011). For
data sets Ionosphere, Balance and Segmentation, us-
ing only 5% of reference labels could generate a con-
sensus clustering with much higher micro-precision.
For data sets Pima and Wine, increasing the amount
of reference points is able to generate a more precise
consensus clustering.
We also apply the proposed algorithm to a
biomedical data set which was obtained using Perkin
Elmar high content imaging system. The data is
used to study human breast cancer cells undergoing
treatement of different drugs. In our experiment, data
points are from four different treatments. As prelim-
inary results, our proposed algorithm is able to label
75% of data points correctly by using 5% of reference
data points and 86% correctly by using 20% of refer-
ence.
REFERENCES
Bache, K. and Lichman, M. (2013). UCI machine learning
repository.
Fred, A. L. and Jain, A. K. (2005). Combining multi-
ple clusterings using evidence accumulation. Pat-
tern Analysis and Machine Intelligence, IEEE Trans-
actions on, 27(6):835–850.
Pickett, J. P. (2006). The American heritage dictionary of
the English language. Houghton Mifflin.
Shariff, A., Kangas, J., Coelho, L. P., Quinn, S., and Mur-
phy, R. F. (2010). Automated image analysis for high-
content screening and analysis. Journal of biomolec-
ular screening, 15(7):726–734.
Strehl, A. and Ghosh, J. (2003). Cluster ensembles—
a knowledge reuse framework for combining multi-
ple partitions. The Journal of Machine Learning Re-
search, 3:583–617.
Topchy, A. P., Jain, A. K., and Punch, W. F. (2004). A mix-
ture model for clustering ensembles. In SDM, pages
379–390. SIAM.
Vega-Pons, S. and Ruiz-Shulcloper, J. (2011). A survey of
clustering ensemble algorithms. International Jour-
nal of Pattern Recognition and Artificial Intelligence,
25(03):337–372.
Wang, H., Shan, H., and Banerjee, A. (2011). Bayesian
cluster ensembles. Statistical Analysis and Data Min-
ing, 4(1):54–70.
Xu, R. and Wunsch, D. (2008). Clustering, volume 10. John
Wiley & Sons.
InformationFusionforSemi-supervisedClusterLabelings
323