2.2.2 Hierarchical Clustering
Agglomerative, bottom up, hierarchical clustering
was used with the Euclidean metric and several
linkages. Initially, each sample is assigned its own
cluster, which is further repeatedly merged using
certain linkage criteria until all samples are in one
cluster (Hastie et al., 2009). In this work Average and
Ward’s linkage (Ward, 1963) were tested. Average-
link groups the cluster pairs with the least mean
distance between the samples of each cluster, whereas
Ward’s linkage merges clusters resulting in the least
increase in within-cluster variance upon being
merged. The output hierarchy of the clusters can be
visualized in the form of a tree, called dendrogram. In
the dendrogram, each leaf node is an individual
sample, each inner node in the tree is the union of its
subclusters and the root is the cluster containing all
the samples. The final partition is obtained by cutting
the tree to result in the same number of clusters as the
number of classes k, in the given data set.
2.2.3 Evidence Accumulation Clustering
The simple use of a clustering algorithm, like K-
means, can give a diversity of solutions over the same
data set depending on the initialization, or of the
chosen k value. To overcome this issue, an approach
known as Clustering Ensemble has been proposed
that takes into account the diversity of solutions
produced by clustering algorithms. Clustering
ensembles can be generated from either different
clustering algorithms or from varying the algorithmic
parameters (Strehl and Ghosh, 2002; Ayad and
Kamel, 2008). To leverage clustering ensemble
results, Fred and Jain (2005) proposed an approach
known as Evidence Accumulation Clustering (EAC),
based on the combination of information of different
partitions, as illustrated in Figure 1.
The evidence accumulation clustering can be
summarized in the following steps: (i) building the
clustering ensemble, P, comprising the set of M
different partitions of a data set X.; (ii) combining
evidence from these partitions in a co-association
matrix; (iii) extracting the consensus partition. The
co-association matrix is built by taking the co-
occurrences of pairs of patterns in the same cluster as
votes for their association. The underlying hypothesis
is that patterns which should be grouped together, are
very likely to be assigned to the same cluster in
different data partitions. Therefore, the M data
partitions of N patterns yields a N x N
co - association matrix with elements:
=
(1)
where
is the number of times the pattern pair (i,j)
is assigned to the same cluster among the M
partitions. The last step of the evidence accumulation
clustering consists of extracting the consensus
partition, which is found by applying a clustering
algorithm to the co-association matrix.
In this paper, the clustering ensemble was
produced by applying k-means M=200 times, with k
randomly chosen between[
√
2
,
√
]. The
extraction of the consensus partition was performed
by applying two hierarchical clustering algorithms:
average-link and Ward's linkage with the final
number of clusters equal to the true number of
classes. The whole procedure, from the clustering
ensemble generation was repeated 50 times, with the
same parameters and the results are averaged.
2.2.4 Clustering Validation Measure
The validation of each clustering algorithm in each
data set is performed using the Adjusted Rand Index
(ARI) (Hubert and Arabie, 1985), which compares
the partition obtained by a clustering algorithm C =
{C
1
, C
2
, … , C
k
} against the ground-truth partition L
= {L
1
, L
2
, ..., L
s
}. This measure is an improved
version of Rand Index (RI) (Rand, 1971), which
quantifies agreement between two partitions by
counting the number of pairs of samples that are
clustered together or placed in different clusters in
both partitions, and the disagreement between
partitions by counting the number of pairs that are
clustered together in one partition but not in the other.
ARI corrects RI for a chance that random partitions
agree; it ensures that the value is then close to 0. The
maximum value of 1 is reached when external labels
and those assigned by clustering algorithms are
identical up to a permutation.
3 RESULTS AND DISCUSSION
Firstly, we present overall results by boxplots that
include results obtained on Affymetrix and cDNA
data sets (Figure 2 and 3). Box plots uncover how
agreements between clustering results and true labels
corresponding to cancer types highly vary, spanning
from 0 to 1, when results from all sets are analyzed
jointly. Median values of all methods, except for HC-
average, are approximately the same. Similar results
can be observed from box plots corresponding to
cDNA results.