puted for each data points and all the average associ-
ation vectors are normalized to derive the soft con-
sensus label matrix for the given data set. For the
hard semi-supervised clustering ensemble algorithm
(HSEA), the hard consensus clustering is generated
from two approaches. One approach is to assign each
data point its most associated cluster id based on its
average association vector. This version is named as
soft to hard semi-supervised clustering ensemble al-
gorithms (SHSEA). The other approach is to relabel
the set of base clusterings by assigning each data point
its most associated cluster id according to each base
clustering and to derive the hard consensus clustering
by majority voting. This is considered as hard to hard
semi-supervised clustering ensemble algorithm (HH-
SEA).
2 DISTRIBUTED CLUSTERING
In the literature, many clustering ensemble algo-
rithms have been proposed and can be broadly di-
vided into different categories, such as relabelling
and voting based, co-association based, hypergraph
based and mixture-densities based clustering ensem-
ble algorithms (Ghaemi et al., 2009), (Vega-Pons
and Ruiz-Shulcloper, 2011), (Aggarwal and Reddy,
2013). Clustering ensemble methods usually consist
of two major steps: base clustering generation and
consensus fusion. The set of base clusterings can be
generated in different ways, which has been discussed
in the previous section. In this section, we provide a
brief review of several consensus fusion methods.
2.1 Semi-supervised Clustering
Ensemble
In this paper we propose the semi-supervised algo-
rithm that utilizes the side information (data obser-
vations with known labels). The algorithm calcu-
lates the association between each data point and the
training clusters (formed by the labelled data observa-
tions) and relabels the cluster labels in Φ
u
according
to the training clusters. In the context of this paper,
since the generation of base clusterings is based on
unsupervised clustering algorithms and the fusion of
base clusterings is guided by the side information, we
name the proposed algorithm as the semi-supervised
clustering ensemble algorithm (SEA). It consists of
two major steps: the base clusterings generation and
fusion. The base clustering generation step is com-
mon to the exisiting ensemble methods and summa-
rized in Table 1. For the base clustering fusion step,
we propose different version of the fusion function
to produce soft and hard consensus clustering respec-
tively.
2.2 Soft Semi-supervised Clustering
Ensemble Algorithm
Suppose the input data set X is the combination of
a training set X
r
and a testing set X
u
. The training
set X
r
contains data points {x
1
,...,x
N
r
}, for which
labels are provided in a label vector λ
r
. The testing
data set X
u
contains data points {x
N
r
+1
,...,x
N
}, the
labels of which are unknown. The consensus clus-
ter label vector (output of SEA) for the test set X
u
is
denoted by λ
u
. The size of training set X
r
is the mea-
sure of the number of data points in the training set
and is denoted by N
r
, i.e., |X
r
| = N
r
. Similarly, the
size of testing set X
u
is the measure of the number
of data points in the testing set and is denoted by N
u
,
i.e., |X
u
| = N
u
. According to the training and testing
sets, the label matrix Φ can be partitioned into two
block matrices Φ
r
and Φ
u
, which contain all the la-
bels corresponding to the data points in the training
set X
r
and testing set X
u
respectively. Suppose train-
ing data points belong to K
0
classes and all training
points from the k-th class form one cluster, denoted
by C
k
r
(k = 1,...,K
0
). Therefore, the training set X
r
consists of a set of K
0
clusters {C
1
r
,...,C
k
r
,...,C
K
0
r
}.
If the size of cluster C
k
r
is denoted by N
k
r
, the total
number of training points N
r
is equal to
∑
K
0
k=1
N
k
r
. We
rearrange label matrix Φ
r
to form K
0
block matrices:
Φ
1
r
,...,Φ
k
r
,...,Φ
K
0
r
. Each block matrix Φ
k
r
contains
the base cluster labels of data points in the k-th train-
ing cluster C
k
r
where k = 1,...,K
0
.
For a given set of base clusterings, the soft version
of the semi-supervised clustering algorithm (SSEA)
has the ability to provide a soft consensus cluster label
matrix. The fusion idea is stated as follow: (1) for a
particular data point count the number of agreements
between its label and the labels of training points in
each training cluster, according to an individual base
clustering (2) calculate the association vector between
this data point and the corresponding base clustering,
(3) compute the average association vector by averag-
ing the association vectors between this data point and
all base clusterings and (4) repeat for all data points
and derive the soft consensus clustering for the testing
set. The summary of SSEA is provided in Table 2.
According to the j-th clustering λ
( j)
, we compute
the association vector a
( j)
i
for the i-th unlabelled data
point x
i
, where i = 1,...,N
u
and j = 1,...,D. Since
there are K
0
training clusters, the association vector
a
( j)
i
has K
0
entries. Each entry describes the asso-
ciation between data point x
i
and the corresponding