and partitioning are restored and a new swap is
started.
For its modus operandi, Random Swap can
naturally explore all the data space and then is
capable, with a suitable number of iterations, of
finding a solution close to the optimal one in practical
cases (Nigro et al., 2023).
2.4 Recombinator K-Means
It is based on an evolutionary algorithm (Baldassi,
2022a) which maintains a population of individuals
(configurations), initialized to the whole dataset. New
generations are then created by a recombination
mechanism followed by a local optimization through
K-Means as in Random Swap, until a convergence
criterion is met.
More in particular, at each generation, first
centroid configurations (solutions) are created by
executing times K-Means with g_kmeans++
seeding (in reality, an adapted version of
g_kmeans++ with a weighting mechanism is actually
used). Each individual centroids configuration,
refined by a few iterations of Lloyd’s K-Means, is
kept along with its cost. Finally, the population
is updated by keeping just the best solutions,
according to their cost, which emerge among
previous and newly generated solutions.
Weights, initialized by a uniform vector, mirror
priorities tied to configurations costs, which are
updated after each generation, to drive the selection
process of the next centroid in g_kmeans++.
The net effect of processing a generation is to
(re)combine centroid points of different solutions, so
as to compose new solutions with a lower cost.
The approach ensures that the average cost of
the population monotonically decreases with the
advancement of generations, with the population
which eventually coalesces to a single solution thus
providing a natural stopping criterion.
Recombinator K-Means is implemented in Julia
and has been successfully applied to both synthetic
and non-synthetic datasets.
2.5 Accuracy Clustering Indexes
Besides the internal cost, often the quality of a
clustering solution can also be checked by some
external measure. An important measure is the
Cluster Index () proposed in (Fränti et al., 2014). It
captures the dissimilarity between two centroid
solutions. The use of is effective when, e.g., a
synthetic dataset provided of ground truth
information (centroids and/or partitions) is
considered.
The expresses the degree by which an achieved
solution is close to the . Formally, each centroid
in is mapped onto the centroid point of which
has minimal distance from . Dissimilarity is then
computed by the number of points in (“orphans”)
upon which no point of is mapped on.
In a similar way, is mapped onto and the
number of orphans in is then counted. The
value is the maximum number of orphans
in the two ways of mapping. Of course, a is a
precondition for a “correct” clustering solution, that
is one which is structurally near to .
When partitions (labels) are provided as ground
truth (Fränti & Rezaei, 2016), the Jaccard distance
(Nigro et al., 2023) between two partitions can be
used for mapping and counting the orphans.
Obviously, non-synthetic datasets are normally
without ground truth. Anyway, it can happen that a
“golden” solution obtained by using some
sophisticated clustering algorithm can exist, thus
allowing checking the accuracy, through the , of a
particular solution achieved by using a specific
algorithm.
3 PB-K-MEANS
The design of Population-Based K-Means (PB-K-
Means) proposed in this paper (see Alg. 4), was
inspired by Recombinator K-Means (Baldassi, 2020,
2022a). However, the following are key differences
between the two algorithms. PB-K-Means rests on the
basic g_kmeans++ seeding (see Alg. 3). In addition,
the evolutionary iterations are simply realized by
Repeated K-Means executions always fed by
g_kmeans++.
The Alg. 4 gives a sketch of the two steps of PB-
K-Means which depend on two parameters: which
is the number of required candidate solutions to
initialize the population from the dataset , and
which is the number of repetitions of K-Means with
g_kmeans++ toward the identification of a final
solution. The notation run(K-Means, g_kmeans++,
X)
indicates that K-Means, with g_kmeans++ seeding, is
applied to the points of the entire dataset . In the step 2,
K-Means with g_kmeans++ is applied to the solution
points in the population . The cost is instead
computed on the dataset points just partitioned
according to cand (candidate solution or centroids
configuration) by K-Means.