cused on providing pairwise constraints to the clus-
tering algorithm. Wagstaff et al. (2000, 2001) define
must-link and cannot-link constraints, respectively,
for specifying that two instances should, or should
not, be in the same cluster.
One way of dealing with these pairwise con-
straints is adapting existing clustering algorithms to
take them into account. Wagstaff et al. (2001) adapt
K-Means to this effect, treating must-link and cannot-
link as hard constraints.
Alternatively, one can use a standard algorithm,
but adapt the distance metric. Xing et al. (2002) pro-
pose to learn a Mahalanobis matrix M (Mahalanobis,
1936), which defines a corresponding Mahalanobis
distance
d
M
(x,y) =
q
(x −y)
T
M(x −y) (1)
They find the M that minimizes the sum of squared
distances between instances that must link, under the
constraint that d
M
(x,y) ≥ 1 for all x and y that can-
not link. M can be restricted to be a diagonal matrix,
or can be full. The idea of learning such a distance
function is generally referred to as metric-based or
similarity-adapting methods (Grira et al., 2004).
Combining algorithm and similarity adaptation,
Bilenko et al. (2004) introduced the MPCK-Means
algorithm. A first difference with Wagstaff et al. is
that constraints are now handled in a soft-constrained
manner by defining costs for unsatisfied constraints.
Furthermore, the k means are initialised using a seed-
ing procedure proposed by Basu et al. (2002). For the
metric-based part of MPCK-Means, a separate Maha-
lanobis metric can be learned for each tentative cluster
in every iteration of the algorithm, allowing clusters
of different shapes in the final partition.
Alternatively to pairwise constraints, Bar Hillel et
al. (2005) use chunklets, groups of instances that are
known to belong to the same cluster. Their Relevant
Component Analysis (RCA) algorithm takes chun-
klets as input and learns a Mahalanobis matrix. This
approach is shown to work better than Xing et al.’s in
high dimensional data. A downside is that only must-
link information is taken into account. There is no
information about which instances cannot link: dif-
ferent chunklets may belong to the same cluster, or
they may not. RCA minimizes the same function as
Xing et al.’s method, but under different constraints
(Bar-Hillel et al., 2005).
Yeung and Chang (2006) have extended RCA to
include cannot-link information. They treat each pair-
wise constraint as a chunklet, and compute a sepa-
rate matrix for the must-link constraints, A
ML
, and
for the cannot-link constraints, A
CL
. The data are
then transformed by A
1/2
CL
·A
−1/2
ML
. This “pushes apart”
cannot-link instances in the same way that must-link
instances are drawn together.
3 CLUSTERS AS EXAMPLES
3.1 Task Definition
We define the task of semi-supervised clustering with
example clusters as follows (where P (···) denotes the
power set):
Given: An instance space X, a set of instances D ⊆X,
a set of disjoint example clusters E ⊆ P (D), and a
quality measure Q : P (D) ×P (D) → R.
Find: A partition C = {C
1
,C
2
,...,C
k
} over D that
maximizes Q(C,E).
Q typically measures to what extent C is consis-
tent with E (ideally, E ⊆C), but may also take general
clustering quality into account. Note that the number
of clusters to be found, or the distance metric to be
used, are not part of the input. Also, the requirement
that E ⊆ C is not strict; this allows for noise in the
data.
The task just defined has many applications. We
mentioned, earlier on, entity resolution. A simi-
lar task is face recognition, in a context where all
(or most) occurrences of the face of a few persons
have been labeled. This application is typically high-
dimensional. Clustering in high-dimensional spaces
is difficult because multiple natural clusterings may
occur in different subspaces (Agrawal et al., 2005).
For instance, one might want to cluster faces accord-
ing to identity, poses, emotions shown, etc. An ex-
ample cluster can help the system determine the most
relevant subspace.
3.2 Translation to Pairwise Constraints
Example clusters can easily be translated into pair-
wise constraints. Let ML(x, y) denote a must-link
constraint between x and y, and CL(x,y) a cannot-
link constraint. Providing an example cluster C corre-
sponds to stating ML(x,y) for all x,y ∈C and CL(x,y)
for all x ∈C,y 6∈C. If C has n elements (call them x
1
,
. . . , x
n
), and the complete dataset has N elements (x
1
to x
N
), this generates n(n−1)/2 must-link constraints
and n(N −n) cannot-link constraints
2
. Clearly, this
set of constraints can be large (O(nN)), potentially
2
By applying two inference rules: ∀x,y, z : ML(x, y) ∧
ML(y,z) ⇒ ML(x,z) and ∀x, y,z : CL(x,y) ∧ ML(y,z) ⇒
CL(x, z), it suffices to list a minimal set of n −1 must-
link constraints and N −n cannot-link constraints. How-
ever, the existing metric learning methods that use pairwise
constraints do not automatically apply these rules.
KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
46