Data Clustering Validation using Constraints

ao M. M. Duarte

1,2

, Ana L. N. Fred

and F. Jorge F. Duarte

Instituto de Telecomunicac¸

oes, Instituto Superior T

ecnico, Lisboa, Portugal,

GECAD - Knowledge Engineering and Decision-Support Research Center,

Institute of Engineering, Polytechnic of Porto, Porto, Portugal

Keywords:

Clustering Validation, Constrained Data Clustering.

Abstract:

Much attention is being given to the incorporation of constraints into data clustering, mainly expressed in

the form of must-link and cannot-link constraints between pairs of domain objects. However, its inclusion

in the important clustering validation process was so far disregarded. In this work, we integrate the use

of constraints in clustering validation. We propose three approaches to accomplish it: produce a weighted

validity score considering a traditional validity index and the constraint satisfaction ratio; learn a new distance

function or feature space representation which better suits the constraints, and use it with a validation index;

and a combination of the previous. Experimental results in 14 synthetic and real data sets have shown that

including the information provided by the constraints increases the performance of the clustering validation

process in selecting the best number of clusters.

1 INTRODUCTION

Data clustering aims at discovering structure in data,

i.e. the natural grouping(s) of a set of domain ob-

jects, such that similar domain objects are assigned

to the same cluster and objects that are dissimilar are

grouped in different clusters (Jain, 2010). Data clus-

tering is used in several important applications, such

as image segmentation, data summarization, grouping

customers for marketing purposes, organizing docu-

ments into a hierarchy of topics and subtopics or to

the study the genome data. Let X = {x

, ··· , x

} be

a set of n domain objects. The goal of a clustering

algorithm is to divide a data set X into K clusters,

producing a data partition P, P = {C

, ··· , C

} where

represents an individual cluster.

There are situations where domain knowledge ex-

ists for a particular application that may not be char-

acterized as data features. Also, a data analyst may

want to express some preferences or conditions for

the data clustering. Constrained data clustering algo-

rithms (Basu et al., 2008; Wagstaff, 2002; Basu, 2005;

Davidson and Ravi, 2005; Wang and Davidson, 2010)

use a priori knowledge about the data, mapped in the

form of constraints, to produce more useful solutions.

Distinct data partitions for the same data set may

be obtained using different clustering algorithms, pa-

rameters and/or algorithm initializations. This arises

the problem of evaluating the quality of a given data

partition or selecting the best data partition among

a set of possible ones. To address this problem,

many clustering validity techniques have been pro-

posed (Arbelaitz et al., 2013). These can be organized

into two categories: internal validity indices, which

use the information contained in the data set and in a

data partition; and external validity indices, that com-

pare a data partition with another partition believed

to be the correct one. The former should be used in

real applications where the underlying data structure

needs to be discovered while the latter can be used to

assess the performance of clustering algorithms using

benchmark data sets.

So far, the constraints are not being used in clus-

tering validation. This is an important ﬂaw, especially

if the clustering solution was obtained from a con-

strained clustering algorithm. In this paper we intro-

duce the use of constraints in the clustering valida-

tion process. We address the problem in three ways:

using both the constraint satisfaction ratio score and

a validity index score to produce a weighted valid-

ity score; learning a distance function or a feature

space representation that transforms the original fea-

ture space accordingly to the set of constraints and use

it in the clustering validity process; and a combination

of these approaches.

The rest of the paper is organized as follows. In

section 2, constrained data clustering is brieﬂy intro-

duced and the algorithms used in the experiments are

M. N. Duarte J., L. N. Fred A. and F. Duarte F..

Data Clustering Validation using Constraints.

DOI: 10.5220/0004543800170027

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge

Management and Information Sharing (KDIR-2013), pages 17-27

ISBN: 978-989-8565-75-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

presented. Section 3 describes how we address the

constrained clustering validity problem. Experimen-

tal results are discussed in section 4, and conclusions

are presented in section 5.

2 DATA CLUSTERING WITH

CONSTRAINTS

Traditional clustering algorithms fail to deal with pe-

culiarities that may exist in certain data clustering

problems: in clustering with obstacles (Tung et al.,

2000) the distance between domain objects may not

follow a straight line (e.g. the walking distance be-

tween houses in opposite margins of a river must con-

sider passing trough a bridge); in other situations one

may want to shape the characteristics of the clusters,

such as the minimum and maximum number of ob-

jects that a cluster must contain (Ge et al., 2007);

sometimes the labels of a subset of the data are known

and using that information would improve the clus-

tering solution (Basu, 2005); also, relations between

pairs of objects, such as must-link and cannot-link

constraints, may be available from an expert or can

be generated for a speciﬁc problem (Wagstaff, 2002).

In our work, we will focus in the relations be-

tween pairs of clusters. The reason is that many of

the other types of clustering constraints may be con-

verted into must-link and cannot-link constraints. The

set of must-link constraints, R

, contains all the pairs

of objects (x

, x

) that should belong to the same clus-

ter, while all the pairs of objects that should not be

grouped in the same cluster are added to the cannot-

link set R

2.1 Constrained Data Clustering

Algorithms

In this subsection, we introduce two pairwise-

constrained clustering algorithms that are used in the

experiments presented in section 4.

The ﬁrst one, the Pairwise-Constrained K-Means

(PCKM) (Basu et al., 2004), is a partitive algorithm

based on the well-know k-means algorithm (Mac-

Queen, 1967), which minimizes the distances be-

tween each cluster mean vector x

and the corre-

sponding domain objects x

∈ C

. The objective of

PCKM consists of minimizing the k-means objective

and the violation of constraints, simultaneously. The

PCKM associates a weight w

i j

to each constraint be-

tween a pair of objects x

, x

. These weights are used

as penalization costs every time a constraint is not

satisﬁed in the current data partition. The PCKM

objective-function is deﬁned as

J =

∑

∈X

||x

− x

∑

)∈R

i j

I(P

6= P

) (1)

∑

)∈R

i j

I(P

= P

where P

corresponds to the label of the cluster at-

tributed to x

, and x

is the mean vector of the cluster

to which x

belongs. I(·) returns 1 if the argument is

true, and returns 0 otherwise.

The other clustering algorithm is the Constrained

Average-Link (CAL) (Duarte et al., 2012) which is

based on the agglomerative hierarchical clustering al-

gorithm Average-Link (Sokal and Michener, 1958).

The algorithm works as follows. It starts with n clus-

ters, one for each domain object x

. Then, at each

step, the two closest clusters, according to a distance

measure between clusters, are merged. The process

iterates until some stopping criteria is met (e.g. a pre-

deﬁned number of clusters is reached) or all objects

belong to same cluster. The distance between clus-

ters measures the average distance between all pairs

of objects belonging to different clusters plus a penal-

ization for each constraint that is not satisﬁed. This

distance is deﬁned as

d(C

, C

) =

∑

i=1

∑

j=1

dist(x

, x

) − I

, x

) + I

, x

)

||C

(2)

where I

, x

) = p if (x

, x

) ∈ R

and 0 otherwise.

p ≥ 0 is a user parameter that inﬂuences the “softness”

of the constraints. In our experiments we deﬁned p as

the maximum distance between objects in a data set.

2.2 Acquiring Pairwise Constraints

Pairwise constraints for data clustering may be ob-

tained by several ways. For instance, in image seg-

mentation, pixels may be regarded as domain objects

and must-link constraints may be generated between a

object x

and all its direct neighbors (Wagstaff, 2002).

Must-link constrains may also be generated between

a object x

and all the other objects within a certain

distance, while cannot-link constraints may be cre-

ated if the distance between x

and other objects ex-

ceed a threshold value (Davidson and Ravi, 2005).

In web page categorization, an expert may indicate if

two web pages are similar or dissimilar (Cohn et al.,

2003). In some real world applications, the labels

from a subset of the data are known and can be used

to derive pairwise constraints (Basu, 2005).

In our work we used two different schemes for ac-

quiring must-link and/or cannot-link constraints. The

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

ﬁrst one is the Random Acquisition of Constraints

(RAC) and consists of randomly selecting two ob-

jects (x

, x

) ∈ X , iteratively, and ask the user (or

some oracle) if both objects should be assigned to

the same group. If the answer is “Yes”, a must-link

constraint is added to the set of must-link constraints,

= R

∪{(x

, x

)}. If the answer is “No” a cannot-

link constraint is added to the set of cannot-link con-

straints R

= R

∪ {(x

, x

)}. If the user cannot de-

cide, the pair of objects is marked as already tested.

Questions involving pairs of objects that are already

in a constraint set or marked as already tested cannot

be performed. The process repeats until a predeﬁned

number of constraints is achieved. Another way to

acquire pairwise constraints is the Random Acquisi-

tion of Labels (RAL), which is based on randomly

selecting a subset of the data set objects and ask the

user/oracle for the corresponding cluster labels. Af-

ter having a labeling for the subset of the data, for

each pair (x

, x

) in that subset we add a must-link

constraint to R

if the labels of both objects are the

same. If the labels are different, we add a cannot-

link constraint between x

and x

to R

. Notice that

the RAL method produces several more constraints

than the RAC for the same number of questions to the

user/oracle.

3 CLUSTERING VALIDATION

In this section we will propose how constraints can be

used to deﬁne internal validity criteria.

3.1 Clustering Validation using

Constraints

A clustering algorithm takes as input the representa-

tion of a given data set and produces a data partition

according to its clustering criterion and parameters.

Traditionally, a clustering validity index uses the data

partition and the original data representation to pro-

duce a quantitative indicator, or score, of the quality

of the data partition, as illustrated in ﬁgure 1. Two ex-

amples of internal validity indices will be described in

subsection 3.2.

Data set

Clustering

Algorithm

Validity

Index

Original data

representation

Score

Data

partition

Figure 1: Traditional clustering validation.

Data set

Clustering

Algorithm

Pairwise

Constraints

Constraint

Satisfaction

Weighted

Score

Validity

Index

Original

data

representation

Score

Data

partition

Figure 2: Clustering validation with weighted score.

With the emergence of constrained data clustering

this process is no longer suitable, since the constraints

are not considered in the validation phase and these

contain important information for assessing the qual-

ity of data partitions. To include constraints in the val-

idation process, we present three scenarios: weighting

the scores of a validation index with the ratio of the

constraint satisfaction; learning a distance measure or

data representation that considers both the data fea-

ture attributes and the constraints and outputs it to a

validation index; and the combination of the previous

two.

Figure 2 depicts the simple weighting approach.

A clustering algorithm uses as input both the data set

representation and the constraints. Let us assume the

constraints are in the form of must-link and cannot-

link constraints. As before, a validity index evaluates

the resulting partition using the original data repre-

sentation and produces a score. In addition, a con-

straint satisfaction ratio is computed from the set of

constraints and the data partition. The constraint sat-

isfaction ratio is deﬁned as

CS(P) =

∑

)∈R

I(P

= P

) +

∑

)∈R

I(P

6= P

)

| + |R

= |

(3)

where I(·) returns 1 if the argument is true, and re-

turns 0 otherwise. Then a weighted score is computed

W S(P) = (1 − α)NormIndex(P)+ αCS(P) (4)

where 0 ≤ α ≤ 1, and NormIndex(P) represents the

normalization of the validity index with values in the

[0, 1] interval, and assuming that values close to 1 in-

dicate good data partitions while values close to 0 in-

dicate bad data partitions. Thus, the weighted score is

[0, 1] bounded in the same way.

Another approach consists of learning a distance

metric or feature space representation that simulta-

neously reﬂects both the original feature space and

DataClusteringValidationusingConstraints

Data set

Clustering

Algorithm

Pairwise

Constraints

Validity

Index

Distance

Learning

Algorithm

Original

data representation

Original

data

representation

Learned

feature space or

distance function

Score

Data

partition

Figure 3: Clustering validation with distance learning.

the constraints. An example of a distance learning

method will be presented in 3.3 and will be used in

our experiments. When learning a new metric, the

idea is determining a positive semi-deﬁnite matrix M

that satisﬁes the properties of a metric, and then use it

to parameterize the Mahalanobis distance:

dist

− x

)

M(x

− x

). (5)

In this case, existing implementations of validity in-

dices should be modiﬁed to take into account the

Mahalanobis distance. When learning a new feature

space, that inconvenience does not exist because af-

ter the new representation is learned it can be sim-

ply fed to the clustering validation index. Figure 3

illustrates this approach. The clustering algorithm

produces a data partition using the original data and

the constraints. Then, a distance learning algorithm

learns a new metric represented by M or a new fea-

ture space representation using as input both the orig-

inal data features and the constraints, and outputs it to

the validity index ﬁnally evaluate the data partition.

Data set

Clustering

Algorithm

Pairwise

Constraints

Constraint

Satisfaction

Weighted

Score

Validity

Index

Distance

Learning

Algorithm

Original

data

representation

Learned

feature space or

distance function

Score

Data

partition

Original

data

representation

Figure 4: Clustering validation with distance learning and

weighted score.

The last approach for constrained clustering vali-

dations is simply the combination of the previous two

approaches, as can be seen in ﬁgure 4. The data par-

tition produced by the clustering algorithm is scored

by both the constraint satisfaction ratio and the va-

lidity index. In this case, the validity index (that

should be normalized to the [0, 1] interval) will use

the learned feature space or distance function to as-

sess the data partition, just as explained before. Fi-

nally, the weighted score is computed as deﬁned in

equation 4.

3.2 Validity Indices

We used two well-known internal validity indices to

assess the incorporation of constraints into the cluster-

ing validation process. These will be described next.

Let a

denote the average distance between x

∈ C

and the other objects in the same cluster, and b

the

minimum average distance between x

and all objects

grouped in another cluster:

| − 1

∑

∈C

j6=i

dist(x

, x

), (6)

= min

k6=l

∑

∈C

dist(x

, x

), x

∈ C

. (7)

dist(x

, x

) may by any distance function between

two objects. The silhouette width is deﬁned for each

object x

− a

max{a

, b

}

, (8)

and indicates how well x

is adjusted to its cluster

when compared to other clusters. A value close to

1 means that x

has been assigned to the appropriate

cluster, a value close to 0 suggests x

could also have

been assigned to the nearest cluster, and a value close

to -1 indicates that x

was incorrectly assigned. The

Silhouette index (Rousseeuw, 1987), S(P), is given by

the average silhouette width computed over all objects

in the data set:

S(P) =

∑

i=1

. (9)

The Hubert’s Statistic (Hubert and Arabie, 1985)

measures the correlation between a n × n co-

membership matrix, U, that represents the data par-

tition P, and a n × n distance matrix, D, which stores

the distances between all pairs of objects. The co-

membership U = [U

i j

] is built by setting each entry

i j

to 1 if both x

and x

were assigned to the same

cluster (P

= P

), or to 0 otherwise. Each entry of the

distance matrix D = [D

i j

] consists of the distance be-

tween x

and x

, i.e., D

i j

= dist(x

, x

), where dist

can be any distance function. The Hubert’s Statistic

is deﬁned as

H(P) =

n(n − 1)

n−1

∑

i=1

∑

j=i+1

i j

(10)

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

considering the matrices to be symmetric. High H(P)

values indicate good data partitions. However, H(P)

values increases with the number of clusters. A nor-

malized version of Hubert’s Statistic prevents this bias

and is deﬁned as

NH(P) =

n(n − 1)

n−1

∑

i=1

∑

j=i+1

i j

− µ

)(D

i j

− µ

)

(11)

where µ

, µ

, σ

and σ

are the means and standard

deviations of U and D, respectively.

3.3 Distance Learning using Pairwise

Constraints

The Discriminant Component Analysis (DCA) (Hoi

et al., 2006) was used in our work to learn new

distance metrics that simultaneously represent both

the original data and the clustering preferences, ex-

pressed as must-link and cannot-link constraints. The

DCA builds a set of chunklets Q = {Q

, ··· , Q

i.e. groups of objects connected by must-link con-

straints, and a set of discriminative chunklets S =

, ··· , S

}, one for each chunklet Q

. Each element

of the discriminative chunklet S

indicates the chun-

klets that have at least one cannot-link constraint con-

necting a object in Q

. Then DCA learns a data trans-

formation which minimizes the variance between ob-

jects in the same chunklet Q

and maximizes the vari-

ance between discriminative data chunklets S

. The

covariance matrices, C

and C

, store the total vari-

ance between objects in each S

∈ S and the total vari-

ance within objects in the same chunklets ∀Q

∈ Q .

These matrices are computed as:

∑

i=1

∑

i=1

∑

i∈S

− m

)(m

− m

)

, (12)

∑

j=1

∑

∈Q

− m

)(x

− m

)

, (13)

where m

is the mean vector of Q

. The optimal trans-

formation matrix A is obtained by optimizing the fol-

lowing objective-function:

J(A) = arg max

. (14)

The learned Mahalanobis matrix M is computed as

M = A

A. The algorithm we used to optimize equa-

tion 14 is described in (Hoi et al., 2006).

4 EXPERIMENTAL RESULTS

In this section we show how the different validity in-

dices perform in the selection of the number of clus-

ters to partition the data for a given clustering algo-

rithm.

In our experiments, 7 synthetic data sets (shown

in ﬁgure 5) and 7 real data sets taken from the UCI

ML repository (Bache and Lichman, 2013) were used

to assess the performance of the validity indices. The

number of domain objects (n), number of dimensions

(d), number of natural clusters (K

) and the distribu-

tion of objects per cluster for each data set are pre-

sented in table 1. A brief description of each real data

set is given next.

Table 1: Overview of the data sets.

Data sets n d K

Cluster Distribution

Bars 400 2 2 2 ×200

Cigar 250 2 4 2× 100 +2× 25

Circs 400 2 3 2 ×100 + 200

D1 200 2 4 19 +17 + 26 +138

D2 200 2 4 116 +39 + 21 +24

D3 200 2 5 98 +23 + 23 +35 + 21

Half Rings 400 2 2 2 ×200

Wine 178 13 3 59 +71 + 48

Yeast Cell 384 17 5 67 + 135 + 75 + 52 + 55

Optdigits 1000 64 10 10 ×100

Iris 150 4 3 3 ×50

House Votes 232 16 2 124 + 108

Breast Cancer 683 9 2 444 +239

The Iris data set consists of 50 objects from each

of three species of Iris ﬂowers (setosa, virginica and

versicolor) characterized by four features. One of the

clusters is well separated from the other two overlap-

ping clusters. The Breast Cancer data set is com-

posed of 683 domain objects characterized by nine

features and divided into two clusters: benign and

malignant. The Yeast Cell data set consists of 384

objects described by 17 attributes, split into ﬁve clus-

ters concerning ﬁve phases of the cell cycle. There

are two versions of this data set: the ﬁrst one, called

Log Yeast, uses the logarithm of the expression level;

the other, called Std Yeast, is a “standardized” ver-

sion of the same data set, with mean 0 and variance 1.

The Optdigits is a subset of Handwritten Digits data

set containing only the ﬁrst 100 objects of each digit,

from a total of 3823 domain objects characterized by

64 attributes. The House Votes data set is composed

of two clusters of votes for each of the U.S. House

of Representatives Congressmen on the 16 key votes

identiﬁed by the Congressional Quarterly Almanac.

From a total of 435 (267 democrats and 168 repub-

licans) only the objects without missing values were

considered, resulting in 232 objects (125 democrats

and 107 republicans). The Wine data set consists of

the results of a chemical analysis of wines grown in

the same region in Italy, divided into three clusters

with 59, 71 and 48 objects, described by 13 features.

In order to test the proposed approaches for con-

DataClusteringValidationusingConstraints

(a) Bars (b) Cigar (c) Circs

(d) D1 (e) D2 (f) D3 (g) Half Rings

Figure 5: Synthetic data sets.

strained clustering validation, two constraint sets

were built for each data set: one using the RAL pro-

cess considering the labels of 10% of the objects; and

the other using the RAC process producing 0.5n ran-

dom pairwise constraints, where n is the number of

objects in a data set. We use ground-truth informa-

tion to identify the labels of the objects and to answer

whether or not two objects belong to the same cluster.

Therefore, an “Yes”/“No” answer is always obtained

in RAC process. Then, for each constraint set, the

PCKM and CAL algorithms produced 14 data parti-

tion each, one for each number of clusters k in the

set {2, ··· , 15}. Finally, we applied the original Sil-

houette index and the Normalized Hubert’s statistic,

and our constrained validation approaches using these

validation indices to all the data partitions. Both the

Silhouette index and the Normalized Hubert’s statistic

return a score in the [−1, 1] range. To normalize the

score range to the [0, 1] for the weighting approaches,

we simply compute it as NormScore =

(Score + 1).

The α parameter for the constraint satisfaction ratio

was deﬁned as

. We used the Consistency index

(CI) (Fred, 2001), an external validity index, to eval-

uate the quality of the data partitions. Thereby, we

can assess the performance of the internal validity in-

dices by comparing its scores with the ones obtained

by CI. The Consistency index measures the fraction

of shared objects in matching clusters of a given parti-

tion (P) and the real data partition (P

) obtained from

ground-truth information. The Consistency index is

computed as

CI(P, P

) =

min{K,K

}

∑

k=1

∩C

| (15)

where K is the number of clusters of the partition that

is being evaluated, K

is the true number of clusters,

and it is assumed the clusters of P and P

have been

permuted in a way that the cluster C

matches with

the real cluster C

Tables 2 to 5 present the consistency index val-

ues for the data partitions selected (corresponding to

the selection of K) by each clustering validation index

using CAL and PCKM clustering algorithms with the

constraint sets built by the RAL and RAC processes.

The ﬁrst column of each table indicates the bench-

mark data set, the second and third columns show

the CI values for the partitions selected by the tradi-

tional Silhouette (S) and Normalized Hubert’s statis-

tic (NH), and the forth and ﬁfth columns show the

CI values for the partitions selected by our weighted

score approach using also the Silhouette and Normal-

ized Hubert’s statistic indices (S+CS and NH+CS)

with the original metric/feature space. Columns 6 to

9 present the analogous results for the proposed dis-

tance learning approach with and without weighting

the constraint satisfaction ratio. The last column indi-

cates the consistency index for the best partition pro-

duced for each data set, according to the ground-truth

information. The last line shows the number of times

that each validity measure selected the best data par-

tition (best K).

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

Table 2 shows the results for the CAL cluster-

ing algorithm using the RAL process. By compar-

ing all the results that use the Silhouette index, it can

be seen that the traditional Silhouette index selected

the best data partition in 7 out of the 14 data sets,

the same number obtained the corresponding weight-

ing approach despite the selected partitions are not

the same. The results obtained by learning a dis-

tance function were slightly better. With and with-

out weighting scores, the indices picked the best par-

titions 8 times. By doing the same analysis for the

Normalized Hubert’s statistic results, we noticed that

the traditional approach only identiﬁes the best parti-

tion in 5 out of the 14 data sets. The weighted score

approach with the original distance metric choose the

best partition 6 times. By using a learned distance

metric the best partition is selected 8 times and by

combining it with the constraint satisfaction the best

partitions are picked in 9 out 14 data sets.

The results for the PCKM clustering algorithm

using the RAL process are shown in Table 3. The

worst results were again achieved by the traditional

validity approach, having the best partitions been se-

lected only 4 times by both the Silhouette index and

the Normalized Hubert’s statistic. The results for the

weighted score approach are somewhat better. Both

indices selected 6 times the best partition. By learn-

ing a new distance metric the results are considerable

better. The Silhouette index selects the best partition 8

and 6 times with and without using the score weight-

ing, respectively. The Normalized Hubert’s statistic

identiﬁes the best data partitions 9 out of the 14 data

sets both with and without the score weighting.

Table 4 shows the results for the CAL clustering

algorithm using the RAC process. The traditional Sil-

houette index picked the best partitions 8 times and

with the score weighting 10 times, both with the orig-

inal and learned metrics. For the simple distance

learning approach, the best partitions were selected

in 8 out of 14 data sets. The Normalized Hubert’s

statistic results were not so good by selecting only

3, 4, 8 and 6 times the best partition for the tra-

ditional, score weighting, distance learning and dis-

tance learning plus score weighting approaches, re-

spectively. Nonetheless, the results obtained using

constraints are better, especially when the distance

metric was learned.

The results for the PCKM algorithm using the

RAC process are presented in table 5. The traditional

Silhouette index determined the best data partitions 8

times and the corresponding weighted score approach

identiﬁed the same best partitions plus another one.

The simple distance learning approach selected the

best partition in 9 out of the 14 data sets and com-

bining it with the weighted score approach decreases

the number of identiﬁed best partitions by one. The

traditional Normalized Hubert’s statistic only selects

the best partition 5 times and the score weighting ap-

proach in 8 times. The simple metric learning ap-

proaches picks the best partition also 8 times and,

again, the weighted score with distance learning ap-

proach diminishes the number of identiﬁed best par-

titions by one. We may conclude from the previous

results that the incorporation of constraints clearly in-

creases the performance of the clustering validation

process. By simple weighting a validity index score

with the constraint satisfaction ratio the results were

better. Also, it seems that learning a new metric based

on the pairwise constraints leads to even better results.

Table 6 indicates the number of times that each va-

lidity measure (by line) achieved better/worse/equal

results than the other validity measures (by column).

The Silhouette with the score weighting approach ob-

tained 14 times better results than the traditional Sil-

houette and 9 times worse. The Silhouette with dis-

tance learning achieved 10 times better partitions and

only 6 times worse. The metric learning combined

with the score weighting achieved 16 better results

and 10 worse. By performing the same analysis for

the Normalized Hubert’s statistic, the weighted score

approach was better than the traditional one 15 times

and 9 worse. The distance learning approach obtained

better results 21 times and 7 worse. The combination

of metric learning and score weighting obtained 22

improvements and only 12 results worst. These re-

sults evidence again that constrained clustering vali-

dation outperforms the traditional validation approach

especially when using distance learning.

Figure 6 shows the plots of the consistency index

values and the constraint satisfaction ratio obtained

for all partitions produced in our experiments versus

each internal validation index, distinguished by clus-

tering algorithm and constraint acquisition method,

and table 7 presents the correlation between the in-

ternal validation indices and the consistency index. It

can be seen that very different consistency values may

be achieved for partitions with all constraints satisﬁed

(ﬁgure 6c). This indicates that the constraint satisfac-

tion ratio alone is not a good indicator of the qual-

ity of the partitions which is corroborated by the low

correlation with the consistency index. We can also

conclude that the validation approaches that use dis-

tance learning have higher correlation with the con-

sistency index, which is another indication that these

are a good option for clustering validation.

DataClusteringValidationusingConstraints

Table 2: Consistency index values for the data partitions selected by each clustering validation index using the CAL algorithm

and the constraint sets generated by the RAL process.

Data sets

Original Distance Learned Distance Best

S NH S + CS NH + CS S NH S + CS NH + CS Partition

Bars 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

Cigar 0.500 0.504 0.688 0.896 0.872 0.896 0.872 0.896 0.896

Circs 0.583 0.583 0.430 0.430 0.583 0.545 0.430 0.430 0.583

D1 0.820 0.875 0.410 0.875 0.820 0.915 0.915 0.915 0.915

D2 0.475 0.475 0.395 0.600 0.535 0.535 0.535 0.495 0.600

D3 0.640 0.795 0.655 0.655 0.795 0.795 0.655 0.655 0.795

Half Rings 0.930 0.930 0.930 0.930 0.930 0.930 0.930 0.930 0.930

Wine 0.685 0.584 0.663 0.663 0.685 0.584 0.663 0.663 0.685

Std Yeast 0.659 0.680 0.698 0.680 0.659 0.656 0.698 0.698 0.698

Optical 0.862 0.837 0.862 0.837 0.862 0.862 0.862 0.862 0.862

Log Yeast 0.349 0.315 0.315 0.315 0.349 0.315 0.349 0.315 0.378

Iris 0.667 0.667 0.920 0.920 0.667 0.667 0.920 0.920 0.920

House Votes 0.888 0.888 0.888 0.888 0.888 0.888 0.888 0.888 0.888

Breast Cancer 0.950 0.753 0.950 0.753 0.950 0.950 0.950 0.950 0.950

#Best Partitions 7 5 7 6 8 8 8 9

Table 3: Consistency index values for the data partitions selected by each clustering validation index using the PCKM algo-

rithm and the constraint sets generated by the RAL process.

Data sets

Original Distance Learned Distance Best

S NH S + CS NH + CS S NH S + CS NH + CS Partition

Bars 0.995 0.995 0.995 0.995 0.995 0.995 0.995 0.995 0.995

Cigar 0.468 0.468 0.552 0.876 0.876 0.876 0.876 0.876 0.876

Circs 0.465 0.468 0.465 0.468 0.465 0.468 0.465 0.468 0.543

D1 0.595 0.595 0.595 0.595 0.535 0.595 0.535 0.595 0.595

D2 0.495 0.495 0.495 0.495 0.485 0.485 0.485 0.485 0.620

D3 0.630 0.665 0.630 0.630 0.665 0.665 0.665 0.630 0.665

Half Rings 0.665 0.665 0.665 0.665 0.665 0.665 0.665 0.665 0.718

Wine 0.669 0.669 0.691 0.691 0.669 0.669 0.691 0.691 0.691

Std Yeast 0.589 0.763 0.763 0.763 0.651 0.773 0.773 0.773 0.773

Optical 0.833 0.877 0.833 0.877 0.885 0.885 0.885 0.885 0.885

Log Yeast 0.320 0.266 0.323 0.326 0.320 0.367 0.326 0.359 0.367

Iris 0.667 0.667 0.913 0.913 0.667 0.667 0.667 0.913 0.913

House Votes 0.901 0.901 0.901 0.901 0.901 0.901 0.901 0.901 0.901

Breast Cancer 0.963 0.766 0.963 0.766 0.963 0.963 0.963 0.963 0.963

#Best Partitions 4 4 6 6 6 9 8 9

Table 4: Consistency index values for the data partitions selected by each clustering validation index using the CAL algorithm

and the constraint sets generated by the RAC process.

Data sets

Original Distance Learned Distance Best

S NH S + CS NH + CS S NH S + CS NH + CS Partition

Bars 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993

Cigar 0.880 0.880 0.880 0.880 0.880 0.880 0.880 0.880 0.880

Circs 0.495 0.460 0.460 0.460 0.523 0.460 0.460 0.460 0.523

D1 0.580 0.580 0.580 0.580 0.580 0.580 0.580 0.580 0.580

D2 0.510 0.510 0.510 0.405 0.500 0.405 0.500 0.405 0.610

D3 0.570 0.570 0.570 0.570 0.620 0.615 0.570 0.550 0.620

Half Rings 0.840 0.675 0.840 0.675 0.840 0.675 0.840 0.675 0.840

Wine 0.663 0.663 0.663 0.713 0.663 0.713 0.663 0.713 0.713

Std Yeast 0.667 0.667 0.667 0.667 0.667 0.667 0.667 0.667 0.667

Optical 0.755 0.711 0.750 0.750 0.755 0.782 0.755 0.750 0.782

Log Yeast 0.297 0.328 0.336 0.336 0.297 0.315 0.234 0.315 0.336

Iris 0.927 0.927 0.927 0.927 0.633 0.927 0.927 0.927 0.927

House Votes 0.940 0.780 0.940 0.940 0.940 0.940 0.940 0.940 0.940

Breast Cancer 0.963 0.851 0.963 0.851 0.963 0.851 0.963 0.851 0.963

#Best Partitions 8 3 10 4 8 8 10 6

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

Table 5: Consistency index values for the data partitions selected by each clustering validation index using the PCKM algo-

rithm and the constraint sets generated by the RAC process.

Data sets

Original Distance Learned Distance Best

S NH S + CS NH + CS S NH S + CS NH + CS Partition

Bars 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993 0.993

Cigar 0.880 0.880 0.880 0.880 0.880 0.880 0.880 0.880 0.880

Circs 0.495 0.460 0.460 0.460 0.523 0.460 0.460 0.460 0.523

D1 0.580 0.580 0.580 0.580 0.580 0.580 0.580 0.580 0.580

D2 0.510 0.510 0.510 0.405 0.500 0.405 0.500 0.405 0.610

D3 0.570 0.570 0.570 0.570 0.620 0.615 0.570 0.550 0.620

Half Rings 0.840 0.675 0.840 0.675 0.840 0.675 0.840 0.675 0.840

Wine 0.663 0.663 0.663 0.713 0.663 0.713 0.663 0.713 0.713

Std Yeast 0.667 0.667 0.667 0.667 0.667 0.667 0.667 0.667 0.667

Optical 0.755 0.711 0.750 0.750 0.755 0.782 0.755 0.750 0.782

Log Yeast 0.297 0.328 0.336 0.336 0.297 0.315 0.234 0.315 0.336

Iris 0.927 0.927 0.927 0.927 0.633 0.927 0.927 0.927 0.927

House Votes 0.940 0.780 0.940 0.940 0.940 0.940 0.940 0.940 0.940

Breast Cancer 0.963 0.851 0.963 0.851 0.963 0.851 0.963 0.851 0.963

#Best Partitions 8 5 9 8 9 8 8 7

Table 6: Number of times that each validity measure achieved better/worse/equal results than the other validity measures.

Validity Original Distance Learned Distance

Index S NH S + CS NH + CS S NH S + CS NH + CS

Original

Distance

S - 18/11/27 9/14/33 20/19/17 6/10/40 12/18/26 10/16/30 17/19/20

NH 11/18/27 - 8/25/23 9/15/32 13/24/19 7/21/28 9/28/19 12/22/22

S+CS 14/9/33 25/8/23 - 16/8/32 16/14/26 15/16/25 6/10/40 13/9/34

NH+CS 19/20/17 15/9/32 8/16/32 - 20/22/14 11/18/27 9/21/26 4/10/42

Learned

Distance

S 10/6/40 24/13/19 14/16/26 22/20/14 - 10/13/33 7/11/38 17/18/21

NH 18/12/26 21/7/28 16/15/25 18/11/27 13/10/33 - 12/12/32 11/8/37

S+CS 16/10/30 28/9/19 10/6/40 21/9/26 11/7/38 12/12/32 - 14/8/34

NH+CS 19/17/20 22/12/22 9/13/34 10/4/42 18/17/21 8/11/37 8/14/34 -

Table 7: Correlation between the internal validation indices and the consistency index.

Validity RAL 0.1 RAC 0.5

ALL

Index PCKM CAL PCKM CAL

CS 0.127 0.302 0.238 0.540 0.164

Original

Distance

S 0.536 0.346 0.773 0.455 0.568

NH 0.800 0.726 0.840 0.642 0.780

S+CS 0.497 0.407 0.786 0.665 0.472

NH+CS 0.684 0.547 0.786 0.661 0.542

Learned

Distance

S 0.788 0.614 0.711 0.486 0.647

NH 0.828 0.818 0.701 0.728 0.797

S+CS 0.816 0.671 0.753 0.747 0.613

NH+CS 0.742 0.634 0.694 0.717 0.576

5 CONCLUSIONS AND FUTURE

WORK

We proposed the incorporation of constraints to the

clustering validation in three ways: using both the

scores of a validity index and the constraint satisfac-

tion ratio to produce a weighted validity score; learn-

ing a distance function or feature space representation

which considers the original feature space and the set

of constraints, and use it in the clustering validity pro-

cess; and combining the previous approaches.

Experimental results have shown that including

constraints increases the performance of the cluster-

ing validation, especially if a new distance metric is

learned. In the future, we want to extend our study to

include other distance learning algorithms and valid-

ity indices, as well as evaluating data partitions pro-

duced by other algorithms.

DataClusteringValidationusingConstraints

−0.2 0 0.2 0.4 0.6 0.8 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(a) S (Original Distance)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(b) NH (Original Distance)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(d) S+CS (Original Distance)

0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(e) NH+CS (Original Distance)

−0.5 0 0.5 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(f) S (Learned Distance)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(g) NH (Learned Distance)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(h) S+CS (Learned Distance)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

PCKM RAL 0.1

CAL RAL 0.1

PCKM RAC 0.5

CAL RAC 0.5

(i) NH+CS (Learned Distance)

Figure 6: Consistency index versus each internal validation index.

ACKNOWLEDGEMENTS

This work is supported by FCT “Fundac¸

ao para a

encia e a Tecnologia” under the project “Learn-

ingS” - PTDC/EEI-SII/2312/2012.

REFERENCES

Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Prez, J. M.,

and Perona, I. (2013). An extensive comparative

study of cluster validity indices. Pattern Recognition,

46(1):243 – 256.

Bache, K. and Lichman, M. (2013). UCI machine learning

repository.

Basu, S. (2005). Semi-supervised clustering: probabilis-

tic models, algorithms and experiments. PhD thesis,

Austin, TX, USA. Supervisor-Mooney, Raymond J.

Basu, S., Banjeree, A., Mooney, E., Banerjee, A., and

Mooney, R. J. (2004). Active semi-supervision for

pairwise constrained clustering. In In Proceedings

of the 2004 SIAM International Conference on Data

Mining (SDM-04, pages 333–344.

Basu, S., Davidson, I., and Wagstaff, K. (2008). Con-

strained Clustering: Advances in Algorithms, Theory,

and Applications. Chapman & Hall/CRC.

Cohn, D., Caruana, R., and McCallum, A. (2003). Semi-

supervised clustering with user feedback.

Davidson, I. and Ravi, S. (2005). Clustering with con-

straints feasibility issues and the k-means algorithm.

In 2005 SIAM International Conference on Data Min-

ing (SDM’05), pages 138–149, Newport Beach,CA.

Duarte, J. M. M., Fred, A. L. N., and Duarte, F. J. F. (2012).

Evidence accumulation clustering using pairwise con-

straints. In Fred, A. L. N. and Filipe, J., editors, KDIR

2012 - Proceedings of the International Conference

on Knowledge Discovery and Information Retrieval,

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

Barcelona, Spain, 4 - 7 October, 2012, pages 293–

299. SciTePress.

Fred, A. L. N. (2001). Finding consistent clusters in data

partitions. In Proceedings of the Second International

Workshop on Multiple Classiﬁer Systems, MCS ’01,

pages 309–318, London, UK. Springer-Verlag.

Ge, R., Ester, M., Jin, W., and Davidson, I. (2007).

Constraint-driven clustering. In KDD ’07: Proceed-

ings of the 13th ACM SIGKDD international confer-

ence on Knowledge discovery and data mining, pages

320–329, New York, NY, USA. ACM.

Hoi, S., Liu, W., Lyu, M., and Ma, W.-Y. (2006). Learning

distance metrics with contextual constraints for im-

age retrieval. In Computer Vision and Pattern Recog-

nition, 2006 IEEE Computer Society Conference on,

volume 2, pages 2072–2078.

Hubert, L. and Arabie, P. (1985). Comparing partitions.

Journal of Classiﬁcation, 2:193–218.

Jain, A. K. (2010). Data clustering: 50 years beyond k-

means. Pattern Recogn. Lett., 31(8):651–666.

MacQueen, J. B. (1967). Some methods for classiﬁcation

and analysis of multivariate observations. In Cam, L.

M. L. and Neyman, J., editors, Proc. of the ﬁfth Berke-

ley Symposium on Mathematical Statistics and Prob-

ability, volume 1, pages 281–297. University of Cali-

fornia Press.

Rousseeuw, P. (1987). Silhouettes: a graphical aid to the in-

terpretation and validation of cluster analysis. Journal

of Computational and Applied Mathematics, 20:53–

65.

Sokal, R. R. and Michener, C. D. (1958). A statistical

method for evaluating systematic relationships. Uni-

versity of Kansas Scientiﬁc Bulletin, 28:1409–1438.

Tung, A. K. H., Hou, J., and Han, J. (2000). Coe: Clus-

tering with obstacles entities. a preliminary study. In

PADKK ’00: Proceedings of the 4th Paciﬁc-Asia Con-

ference on Knowledge Discovery and Data Mining,

Current Issues and New Applications, pages 165–168,

London, UK. Springer-Verlag.

Wagstaff, K. L. (2002). Intelligent clustering with instance-

level constraints. PhD thesis, Ithaca, NY, USA. Chair-

Claire Cardie.

Wang, X. and Davidson, I. (2010). Flexible constrained

spectral clustering. In Proceedings of the 16th ACM

SIGKDD international conference on Knowledge dis-

covery and data mining, KDD ’10, pages 563–572,

New York, NY, USA. ACM.

DataClusteringValidationusingConstraints