Performance of a K-Means Algorithm Driven by Careful Seeding

Libero Nigro

and Franco Cicirelli

Engineering Department of Informatics Modelling Electronics and Systems Science, University of Calabria, Rende, Italy

CNR-National Research Council of Italy–Inst. for High Performance Computing and Networking (ICAR), Rende, Italy

Keywords: K-Means Clustering, Seeding Procedure, Greedy K-Means++, Clustering Accuracy Indexes, Java Parallel

Streams, Benchmark and Real-World Datasets, Execution Performance.

Abstract: This paper proposes a variation of the K-Means clustering algorithm, named Population-Based K-Means (PB-

K-M

EANS

), which founds its behaviour on careful seeding. The new K-Means algorithm rests on a greedy

version of the K-Means++ seeding procedure (g_kmeans++), which proves effective in the search for an

accurate clustering solution. PB-K-M

EANS

first builds a population of candidate solutions by independent

runs of K-Means with g_kmeans++. Then the reservoir is used for recombining the stored solutions by

Repeated K-Means toward the attainment of a final solution which minimizes the distortion index. PB-K-

EANS

is currently implemented in Java through parallel streams and lambda expressions. The paper first

recalls basic concepts of clustering and of K-Means together with the role of the seeding procedure, then it

goes on by describing basic design and implementation issues of PB-K-M

EANS

. After that, simulation

experiments carried out both on synthetic and real-world datasets are reported, confirming good execution

performance and careful clustering.

1 INTRODUCTION

K-Means (Lloyd, 1982) (MacQueen, 1967) (Jain,

2010) is a classical clustering algorithm very often

used, for its simplicity and efficiency, in such data

analysis and pattern recognition application domains

as image segmentation, text analysis, medicine,

bioinformatics, machine learning and so forth.

The goal of K-Means is to optimize partitioning 

data points 









, e.g. 







, in 

groups (clusters) in such a way that points in a same

cluster are similar to one another, and points in

different clusters are dissimilar. The similarity is

often expressed by the Euclidean distance. Each

cluster has a representative point which is its centre

or centroid. K-Means aims to minimize the Sum-of-

Squared Errors () cost (see later for details), also

said the distortion index (a sort of internal variance in

clusters).

In the last years, K-Means properties were

thoroughly studied (Fränti & Sieranoja, 2018,

2019)(Vouros et al., 2021). A critical issue is the

initialization of centroids (seeding procedure), which

https://orcid.org/0000-0001-6577-4777

https://orcid.org/0000-0002-6138-5739

can significantly affect the quality of the clustering

solution. However, fundamental is its local strategy

of management of centroids which implies K-Means

often gets stuck in a local sub-optimal solution. More

sophisticated clustering algorithms like Random

Swap (Fränti, 2018)(Nigro et al., 2023)

and genetic/evolutionary approaches (Fränti,

2000)(Baldassi, 2020, 2022a), try to remedy this

situation by adopting a global strategy of centroids

evolution, which can favour the achievement of a

solution close to the optimal one, although with a

greater computational cost.

This paper proposes Population-Based K-means

(PB-K-M

EANS

) which borrows ideas from Random

Swap and the evolutionary algorithm underlying

Recombinator-K-Means (Baldassi, 2020, 2022a).

Similarly to Recombinator-K-Means, an initial

population of representative centroid solutions is

built, by running classical K-Means together with the

Greedy K-Means++ (g_kmeans++) (Celebi et al.,

2013)(Baldassi, 2020, 2022a) seeding, which has

been proved to be effective in generating careful

(from the point of view of the  cost)

configurations of centroids which are locally refined

Nigro, L. and Cicirelli, F.

Performance of a K-Means Algorithm Driven by Careful Seeding.

DOI: 10.5220/0012045000003546

In Proceedings of the 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH 2023), pages 27-36

ISBN: 978-989-758-668-2; ISSN: 2184-2841

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

by K-Means. Then, in the same logic of Random

Swap (Fränti, 2018)(Nigro et al., 2023), K-Means

with g_kmeans++ is repeatedly executed upon the

population of solutions, thus combining centroids

belonging to different solutions, until a satisfying

solution minimizing the  cost is achieved. As in

Random Swap, the more is the number of repetitions,

greater is the chance of picking a combined solution

close to the optimal one.

To cope with the potential inefficiency when

dealing with large datasets, PB-K-M

EANS makes a

systematic use of parallelism. The tool is currently

implemented in Java using parallel streams and

lambda expressions (Nigro, 2022)(Urma et al., 2018),

thus is capable of exploiting the computing potentials

of today’s multi/many-core machines.

The paper reports the experimental results of

applying PB-K-M

EANS to synthetic and non-

synthetic datasets. The simulation results confirm that

PB-K-M

EANS can ensure reliable clustering and good

execution performance.

The paper is organized as follows. Section 2 gives

a review of the background work concerning K-

Means and the role of the seeding methods. The

section also covers basic concepts of Random Swap

and Recombinator K-Means, as well as some external

indexes often used to evaluate the quality of a

clustering. Section 3 describes the PB-K-Means

algorithm together with some implementation issues

in Java. Section 4 proposes an experimental setup for

PB-K-Means, composed of challenging synthetic and

real-world datasets, and the results gained by

practical applications of the tool. The time efficiency

of the new algorithm is demonstrated too. Finally,

conclusions are presented with an indication of on-

going and future work.

2 BACKGROUND

The following gives a summary of basic concepts of

clustering by K-Means (Lloyd, 1982) (MacQueen,

1967) (Jain, 2010) (Vouros et al., 2021), Random

Swap (Fränti, 2018) (Nigro et al., 2023) and

Recombinator-K-Means (Baldassi, 2020, 2022a)

which have influenced the design of the algorithm

proposed in this paper.

2.1 K-Means

The classical version of K-Means is the so-called

Lloyd’s algorithm illustrated in Alg. 1. In step 1, the

initial centroids are defined by a seeding procedure

(e.g., random).

Step 2 creates partitions of points according to

current centroids. Each point 



is assigned to the

cluster 



identified by nearest centroid 

















 













where 







 denotes the Euclidean distance

between 



and 



. In step 3, centroids are updated as

the mean point of each cluster:



























Steps 2 and 3 are iterated until a termination criterion

is met. For example, when the new centroids 









almost “coincide” with previous ones 









, within

a certain numeric tolerance, convergence is reached

and K-Means can terminate. Otherwise, termination

follows after a maximum number of iterations have

been executed.

Algorithm 1: The Lloyd’s K-Means.

Input: the dataset  and the number of clusters .

Output: final centroids and associated points partition.

1. Initialize centroids by some method (e.g., random).

2. Assign points of  to clusters according to the nearest

centroid.

3. Update centroids as the mean point of each cluster.

4. Chec

termination. If not termination, re

eat from 2.

An execution of K-Means aims at minimizing the

 cost:

 







































Practically, the normalized mean of  , that is

, can be used:

 







It is worthy of note the difficulty, in general, of

optimizing the , due to its highly

non-convex character. Clustering solutions are then,

necessarily, approximations of the optimal solution.

2.2 Seeding Methods

Centroids should be initialized (Fränti & Sieranoja,

2018, 2019) (Vouros et al, 2021) so as to not coincide

with outliers or noise points. Moreover, centroids

should be far away from each other in order to avoid

splitting, wrongly, a big cluster into multiple smaller

clusters. Different ways exist to initialize centroids,

as reported in the following.

SIMULTECH 2023 - 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

Random. Random seeding is the usual default choice.

In this case, centroids are initialized by uniform,

randomly selecting K points of the dataset X:









  





Of course, a random selection does not ensure any

property of centroids and cost minimization. Only in

the case centroids are chosen near to the ideal

position, the solution gets close to the optimal one.

Therefore, using K-Means with random seeding

normally requires the algorithm to be repeated a

certain number of times (Repeated K-Means) so as to

select, in the end, the solution which minimizes the

 cost.

K-Means++. It defines centroids incrementally

(Arthur & Vassilvitskii, 2007), as indicated in the

Alg. 2. Let 



 be the minimal distance of point 



from the currently defined  centroids, .

Algorithm 2: The K-Means++ seeding.

1. Define the first centroid in a random way:









   

  

2. Associate to each point 



the probability of being

chosen as the next centroid as:



































Next centroid is a point 



, not previously chosen,

selected probabilistically by a random switch based

on the values of 























3. If , repeat from step 2.

K-Means++ tends to distribute more evenly the

centroids in the data space, and guarantees centroids

are selected far away from each other.

Greedy K-Means++ (g_kmeans++). It is an

improvement (Celebi et al., 2013) (Baldassi, 2020,

2022a) of K-Means++, shown in Alg. 3.

At each step of a centroid selection,  attempts are

made so as to ensure the next candidate centroid is not

only distinct and far away from previously chosen

points, but it will contribute to minimal cost

increment (the greedy step) in the centroid

configuration.

In this work, as in (Baldassi, 2020), the adopted

value of the parameter  is







, which is a

trade-off between the improved seeding and the extra

computational cost.

Algorithm 3: The Greedy_K-Means++ seeding.









  













 

 





  

















 

    

  





  

 













It should be noted that, despite its careful seeding,

g_kmeans++ has a greater computational cost than K-

Means++ and, as always, it cannot guarantee that

selected centroids hit the optimal ones. A side benefit,

though, experimentally observed, is that specific

centroids, here called “exemplars”, in different

configurations/runs, can thicken around the optimal

centroid positions (ground truth). This property can

purposely be exploited to improve clustering.

2.3 Random Swap

The K_Means behaviour is constrained by its initial

seeding. Centroids are then refined by a local

strategy. A typical example occurs when, e.g.,

multiple centroids are wrongly associated with a same

big cluster which is well-separated from smaller

clusters. The former centroids are unable to move to

a smaller cluster which is without a centroid. More in

general, centroids cannot move between clusters far

away (Fränti & Sieranoja, 2019). What should be

required is a global strategy to drive centroids

management.

Random Swap (Fränti, 2018) initializes centroids

by random seeding. Then it makes a given number of

swap iterations. At each swap, a centroid is randomly

chosen which gets replaced by a randomly selected

point in the dataset:









  







  

If the new centroid configuration, locally refined

by a few K-Means iterations, implies a lesser 

cost than the previous configuration, it becomes

current, and the algorithm continues with the next

swap iteration. Otherwise, the previous configuration

Performance of a K-Means Algorithm Driven by Careful Seeding

and partitioning are restored and a new swap is

started.

For its modus operandi, Random Swap can

naturally explore all the data space and then is

capable, with a suitable number of iterations, of

finding a solution close to the optimal one in practical

cases (Nigro et al., 2023).

2.4 Recombinator K-Means

It is based on an evolutionary algorithm (Baldassi,

2022a) which maintains a population of individuals

(configurations), initialized to the whole dataset. New

generations are then created by a recombination

mechanism followed by a local optimization through

K-Means as in Random Swap, until a convergence

criterion is met.

More in particular, at each generation, first 

centroid configurations (solutions) are created by

executing  times K-Means with g_kmeans++

seeding (in reality, an adapted version of

g_kmeans++ with a weighting mechanism is actually

used). Each individual centroids configuration,

refined by a few iterations of Lloyd’s K-Means, is

kept along with its  cost. Finally, the population

is updated by keeping just the  best solutions,

according to their cost, which emerge among

previous and newly generated solutions.

Weights, initialized by a uniform vector, mirror

priorities tied to configurations costs, which are

updated after each generation, to drive the selection

process of the next centroid in g_kmeans++.

The net effect of processing a generation is to

(re)combine centroid points of different solutions, so

as to compose new solutions with a lower cost.

The approach ensures that the average  cost of

the population monotonically decreases with the

advancement of generations, with the population

which eventually coalesces to a single solution thus

providing a natural stopping criterion.

Recombinator K-Means is implemented in Julia

and has been successfully applied to both synthetic

and non-synthetic datasets.

2.5 Accuracy Clustering Indexes

Besides the internal cost, often the quality of a

clustering solution can also be checked by some

external measure. An important measure is the

Cluster Index () proposed in (Fränti et al., 2014). It

captures the dissimilarity between two centroid

solutions. The use of  is effective when, e.g., a

synthetic dataset provided of ground truth 

information (centroids and/or partitions) is

considered.

The  expresses the degree by which an achieved

solution  is close to the . Formally, each centroid

 in  is mapped onto the centroid point of  which

has minimal distance from . Dissimilarity is then

computed by the number of points in  (“orphans”)

upon which no point of  is mapped on.

In a similar way,  is mapped onto  and the

number of orphans in  is then counted. The

 value is the maximum number of orphans

in the two ways of mapping. Of course, a    is a

precondition for a “correct” clustering solution, that

is one which is structurally near to .

When partitions (labels) are provided as ground

truth (Fränti & Rezaei, 2016), the Jaccard distance

(Nigro et al., 2023) between two partitions can be

used for mapping and counting the orphans.

Obviously, non-synthetic datasets are normally

without ground truth. Anyway, it can happen that a

“golden” solution obtained by using some

sophisticated clustering algorithm can exist, thus

allowing checking the accuracy, through the , of a

particular solution achieved by using a specific

algorithm.

3 PB-K-MEANS

The design of Population-Based K-Means (PB-K-

Means) proposed in this paper (see Alg. 4), was

inspired by Recombinator K-Means (Baldassi, 2020,

2022a). However, the following are key differences

between the two algorithms. PB-K-Means rests on the

basic g_kmeans++ seeding (see Alg. 3). In addition,

the evolutionary iterations are simply realized by

Repeated K-Means executions always fed by

g_kmeans++.

The Alg. 4 gives a sketch of the two steps of PB-

K-Means which depend on two parameters:  which

is the number of required candidate solutions to

initialize the population  from the dataset , and 

which is the number of repetitions of K-Means with

g_kmeans++ toward the identification of a final

solution. The notation run(K-Means, g_kmeans++,

indicates that K-Means, with g_kmeans++ seeding, is

applied to the points of the entire dataset . In the step 2,

K-Means with g_kmeans++ is applied to the solution

points in the population . The cost  is instead

computed on the  dataset points just partitioned

according to cand (candidate solution or centroids

configuration) by K-Means.

SIMULTECH 2023 - 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

Algorithm 4: The PB-K-Means Operation.

1. .





    ++







2. 







  ++,





  





}

From the g_kmeans++ seeding, each solution

found in the step 1 has, in general, few chances to hit

the optimal solution but, as anticipated in section

2.1.1, it can contain “exemplars”, that is centroids

close to the optimal ones. Exemplars in different

solutions tend to accumulate in dense regions around

ground truth centroids. In the step 2, an exemplar in a

peak has a probability of being selected by

g_kmeans++ which depends on the numerosity of the

dense area. On the other hand, when one such an

exemplar is chosen, the probability of selecting as

next centroid a point in the same peak area or in its

nearness is very low because g_kmeans++ ensures

centroid candidates are far away from each other. As

a consequence, the  repetitions in step 2 have good

chances of finding a solution close to the optimal one

in many practical cases (see later in this paper).

The values of  and  depend on the handled

dataset. Roughly speaking, in many cases  does not

need a high value since the careful seeding ensured

by g_kmeans++. On the other hand, the more are the

 repetitions, the greater is the probability of hitting

a solution close to the optimal one.

When applying PB-K-Means to some challenging

datasets, the clustering accuracy can possibly be

improved by generating the population of solutions

each one extracted as the best one in a batch of few

repetitions.

3.1 Implementation Issues

Current Java implementation of PB-K-Means aims at

fulfilling the challenging goal of enabling parallel

execution of as many as possible recurrent operations,

including the partitioning and the centroids update

steps of K-Means (see Alg. 1), the calculation of the

 cost, the underlying operations of

g_kmeans++ and so forth. Toward this, parallel

streams and lambda expressions (Urma et al., 2018)

(Nigro, 2022) (Nigro et al., 2023) are used.

A parallel stream is managed by the fork/join

mechanism which enables an array/collection like the

dataset, the population, the centroids vector and so

forth, to be split into multiple segments, to spawn

separate threads for the independent processing of the

various segments and, finally, to combine the results

provided by the threads. Lambdas expressions are the

functional units which specify, in a compact and

effective way, the operations to accomplish on a data

stream.

The use of such popular parallelism can be easy to

use in the practical case, but it requires the designer

to absolutely avoid the use of shared data in lambda

expressions which could introduce subtle data

inconsistency problems thus making useless results.

PB-K-Means supporting classes include a

fundamental  class which exposes such global data

as  (numerosity of the dataset),  (number of

dimensions of data points),  (number of

clusters/centroids),  (accuracy degree of

g_kmeans++),  (population size),  (number of runs

of Repeated K-Means), as well as the available

seeding methods, and the external names and

corresponding methods for the dataset, ground truth,

population and so forth to load. The helper

 class supports common operations like

the Euclidean distance, provides some reference

methods (equivalent to lambda expressions) used by

point stream operations and so forth.

To give an idea of the actual Java programming

style, the Alg. 5 portrays an excerpt of the

kmeans++/g_kmeans++ methods, which concerns the

calculation of the common denominator (see Alg. 2)

of the probabilities of points in the dataset to be

selected as the next centroid. Similar calculations

occur when processing the population.

First a stream (a view, not a copy)  is

extracted from the dataset. The value of the 

 parameter allows to distinguish if

 has be operated in parallel or not. For

demonstration purposes, the following assumes

  .

Performance of a K-Means Algorithm Driven by Careful Seeding

Algorithm 5: An excerpt of kmeans++/g_kmeans++.

…

final int l=L;

Stream<DataPoint> pStream=

(PARALLEL) ? Arrays.stream(dataset).parallel() :

Arrays.stream(dataset);

DataPoint ssd=pStream //sum of squared distances

.map( p->{

p.setDist(Double.MAX_VALUE);

for( int k=0; k<l; ++k ) { //existing centroids

double d=p.distance(centroids[k]);

if( d<p.getDist() ) p.setDist(d);

}

return p; } )

.reduce( new DataPoint(), DataPoint::add2Dist,

DataPoint::add2DistCombiner );

double denP=ssd.getDist();

//common denominator of points probability

…

//random switch

…

The intermediate  operation on 

processes in parallel the points of the dataset, by

storing into each point  its minimal distance to

existing centroids (from ). This is realized as part

of the  lambda expression of the 

operation.

The fact that each point only modifies itself is

worthy of note, because it avoids modifications of any

shared data.

The  operation returns a new stream which is

operated by the  terminal operation. It is

 that concretely starts the parallel processing

which includes the  executions. The 

operation asks the underlying threads to add, squared,

the point distances as part of the reference method

add2Dist of the  class. The partial results

of the threads are finally combined, by the reference

method  of , by adding

them and returning a new  , whose

distance field holds the final calculation.

The calculations in Alg. 5 are then followed by a

random switch based on the point probabilities, which

actually selects the (not already chosen) next

centroid.

4 EXPERIMENTAL SETUP

It has been noted in (Fränti & Sieranoja, 2018, 2019)

that K-Means is well-suited to clustering regular

datasets with spherical clusters, and that the quality of

the clustering improves as the overlapping degree

increases. Cluster overlapping, indeed, facilitates

centroids movement during the iterations.

This paper argues that K-Means, and also PB-K-

Means, are more effective when dealing with datasets

where good clustering corresponds to the

minimization of the  cost. Therefore, a more

approximate solution is to be expected when the

clustering quality does not follow the cost

minimization.

In the following PB-K-Means is applied to two

classes of datasets: synthetic (benchmark) and non-

synthetic or real-world datasets. Benchmark datasets

(Repository, 2023) are shown in Table 1. They all

come with ground truth centroids, except for the

worms cases which have instead ground truth

partitions (labels). Differently from (Baldassi, 2020,

2022a), data are processed in the original unscaled

format. A description of the used datasets follows.

 contains  2-d points distributed around

50 spherical clusters.  contains 5000 2-d points

distributed in 15 Gaussian clusters with moderate

overlap.

Table 1: Synthetic datasets selected for the experiments

(Repository, 2023).

  



   

   

   

   

   





   





   

 is a high-dimensional dataset with

1024 points and 16 well-separated Gaussian clusters.

 admits 6500 2-d points distributed in 8

Gaussian clusters organized in two well-separated

groups respectively of 2000 and 100 points per

cluster.

Figure 1: The  dataset (Repository, 2023).

 datasets contain 10

2-d points which are

distributed into 100 clusters organized on a 10x10

grid (), or according to a sine curve ()

or in an irregular way in  (see Fig. 1). 

SIMULTECH 2023 - 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

and  have spherical clusters of the same size.

 is very challenging because clusters have a

random size and occupy random locations.

Also challenging are the chosen two worm-

shaped datasets.  (see Fig. 2) contains 35

individual clusters in 2-dimensional space.

 contains 25 clusters in 64-dimensional

space.

The worm shapes start at a random position and

move in a random direction. At each step, points are

drawn from a Gaussian distribution whose variance

increases gradually with each step. The direction of

movement is continually altered to an orthogonal

direction. In the  case, the orthogonal direction is

randomly selected at each step.

Figure 2: The  dataset (Repository, 2023).

The group of real-world datasets used for checking

the PB-K-Means behaviour, is listed in Table 2. All

of them refer to image data processing. They

characterize by their high dimensionality and/or the

high number of required clusters.

The  dataset (non-binarized version), the

 dataset (8 bits per color version), and the

 dataset (frame 1 vs 2 version) were

downloaded from the (Repository, 2023) web site.

These datasets come without ground truth

information.

The  dataset, from AT&T Laboratories

Cambridge, represents a facial recognition case. It

consists of  human subjects of which  different

poses are available. Each image is coded as  

 pixels. The  tested version is provided

with ground truth information (both centroids and

partition labels).

All the execution experiments were carried out on

a Win11 Pro, Dell XPS 8940, Intel i7-10700 (8

physical cores + 8 virtual ones), CPU@2.90 GHz,

32GB Ram, Java 17.

Table 2: Real-world datasets selected for the experiments.

  



   

   

    

   

4.1 A First Example

As a first case, the  dataset (see Table 1) was

chosen to compare the performance of PB-K-Means

to that achievable using classical Repeated K-Means

under different seeding.

 was first investigated by Repeated K-Means







 separately using Random











K-

Means++ 



 and Greedy K-Means++









seeding.

K-Means was repeated 10

times and the minimal

value of the  cost









 and the

corresponding  value (



), the minimal

value of the  (



) and the corresponding value

of the cost (



), the average  ()

value and the , that is the number of runs

in which a    was sensed divided by 10

, were

observed.

Figure 3: The  synthetic dataset (Repository, 2023).

For completeness, the Parallel Execution Time ()

in msec, required by Repeated K-Means to terminate

its runs was also measured. Results are collected in

Table 3.

Table 3: Experimental results about the  dataset.

















  











  





  











  









  





   



  

Performance of a K-Means Algorithm Driven by Careful Seeding

Table 5: PB-K-Means results on the synthetic datasets in Table 1.

  









  









 

 1.69E9 0 0 1.69E9 0 100%

 0.26 0 0 0.26 0 100%

 1.65E7 0 0 1.65E7 0 100%

 4.64E8 0 0 4.64E8 0 100%

 2.26E6 0 0 2.28E6 0 100%

 1.86E8 12 11 1.87E8 13.29 0%





 1.75E4 8 6 1.77E4 7.7 0%





 2.14E6 0 0 2.14E6 0.95 6%

As one can see from Table 3, the use of g_kmeans++

seeding makes it possible for  to outperform the

cases where kmeans++ or the random seeding are

adopted. Revealing are the estimates of the average

 and of the  It is confirmed, in all the

three cases, that for the , the minimum value of the

 always occurs at the minimum of the  cost.

The results in the column of





coincide

with the results documented, e.g., in (Baldassi, 2020)

(Nigro et al., 2023).

Table 4 reports the results observed when using

PB-K-Means, e.g., with  and  for the

first step (see Alg. 4), and  for the second step.

For the first step, which builds the population of

solutions, only the  was annotated. The results of

the second step indicate that PB-K-Means fully solves

the  dataset because there is a success-rate of

100% and the value of  is always 0.

The minimal

value of  is the same as in the 



case.

Table 4: PB-K-Means results on the  dataset.

, (      





 









































 





 





 

Table 4 also shows the time efficiency of PB-K-

Means with respect to the brute-force approach of

Repeated K-Means (see Table 3).

4.2 Synthetic Datasets Results

Table 5 collects the experimental results observed

when applying PB-K-Means to the synthetic datasets

introduced in Section 4 (see Table 1). All these

datasets have a  of  when clustered

with classical K-Means, as also documented in

(Fränti & Sieranoja, 2018).

The ,  and  datasets were

studied with PB-K-Means by using  and 

for the first step (single executions of K-Means with

g_kmeans++ seeding), and  for the second

step (see Alg. 4). The remaining, more challenging,

datasets were analyzed by using  with 

for the first step (each one of the  solutions emerges

as the best one in a batch of  executions), and 

 for the second step.

Except for the  and , all the

other synthetic datasets have the lowest value of 

(which is synonymous of “correct” clustering) at the

minimal  cost.  and  are

examples of datasets where good clustering does not

follow from the optimization of . This property

is explored in more detail with Random Swap in

(N23).

All the obtained results for  and

 datasets agree with similar results

reported in (Nigro et al., 2023) and (Sieranoja &

Fränti, 2019). For example, in (Sieranoja & Fränti,

2019), using a powerful and efficient clustering

algorithm based on density peaks (Rodriguez & Laio,

2014) (Nigro & Cicirelli, 2022), an average value of

 was predicted for the  value on .

In (Nigro et al., 2023) it is discussed as

 , although the higher dimensionality,

can be correctly clustered by Random Swap. The

same result was achieved by PB-K-Means as shown

in Table 5.

4.3 Real-World Datasets Results

For each real-word dataset in Table 2, the minimal

 cost was measured by applying PB-K-Means

with  and  for the first step, and 

 for the second step.

The results are reported in Table 6, and are in good

agreement with similar results available in (Baldassi,

SIMULTECH 2023 - 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

2020), where the        cost was

adopted.

Table 6: PB-K-Means results for the real-world datasets in

Table 2 - (     ) for the first step, and    for

the second step.

 







 

 

 

 

For the  dataset, which is equipped of ground

truth information, a  of , a    at

the minimum value of , a minimum   

value which occurs at the  of about , and

an average value of  about , were measured.

4.4 Time Efficiency

The computational efficiency of PB-K-Means

was checked, in a case, on the non-synthetic

 dataset (see Table 2) using a

population of  solutions, and by  runs of

the second step of the algorithm (see Alg. 4),

separately in parallel (parameter

  

and in sequential



  



mode.

It is important to note that the two versions of the

program execute exactly the same operations.

The total elapsed time  (in msec) for the

parallel case (



) and the serial case (



), needed

by the PB-K-Means recombination step to complete,

were measured as reported in Table 7.

Table 7: Parallel and sequential elapsed times of PB-K-

Means on  (8 physical cores+8 virtual

cores).

 PB-K-Means, 2nd step,  ,











(msec)







(msec)



The speedup was then estimated as follows:

  

















.

5 CONCLUSIONS

This paper proposes Population-Based K-Means (PB-

K-Means), a new variant of the K-Means algorithm

(Lloyd, 1982) (MacQueen, 1967) (Jain, 2010) which

owes to careful seeding provided by Greedy K-

Means++ (g_kmeans++) method (Celebi et al., 2013)

(Baldassi, 2020). PB-K-Means significantly

improves the clustering accuracy while ensuring an

efficient execution. It was inspired by Recombinator

K-Means (Baldassi, 2020, 2022a), a recent

evolutionary K-Means algorithm also based on

g_kmeans++. A key difference from Recombinator

K-Means is PB-K-Means simplicity.

PB-K-means is articulated in two steps: in the first

one a population of  solutions (centroid

configurations) is built using Lloyd’s K-Means with

g_kmeans++. In the second step, a certain number of

Repeated K-Means executions with g_kmeans++ are

used which recombine the population centroids

toward a careful solution.

The paper demonstrates the reliable and efficient

clustering capabilities of PB-K-Means by applying it

to challenging synthetic and real-world datasets.

The prosecution of the research aims to address

the following points.

First, to port the PB-K-Means implementation

currently based on Java parallel streams (Nigro,

2022) (Nigro et al., 2023) (Urma et al., 2018), on top

of the Theatre actor system (Nigro, 2021) which

permits a better exploitation of the parallel resources

offered by modern multi/many-core machines

(Cicirelli & Nigro, 2022).

Second, to experiment the population and

recombination steps of PB-K-Means in the Parallel

Random Swap tool (Nigro et al., 2023), which is

expected to require fewer swap iterations while

providing careful clustering.

Third, to possibly improve PB-K-Means by

exploiting concepts from the “refine” seeding

algorithm proposed in (Bradley and Fayyad, 1998)

and the pairwise-nearest-neighbor smoothing

approach described in (Baldassi, 2022b).

ACKNOWLEDGEMENTS

The authors are grateful to the colleague Carlo

Baldassi for having provided the  dataset

together with ground truth information.

REFERENCES

Lloyd, S.P. (1982). Least squares quantization in PCM.

IEEE Trans. on Information Theory, 28(2):129–137.

MacQueen, J. (1967). Some methods for classification and

analysis of multivariate observations. Proc. of the 5

Berkeley Symposium on Mathematical Statistics and

Probability. Berkeley, University of California Press,

281-297.

Performance of a K-Means Algorithm Driven by Careful Seeding

Jain, A.K. (2010). Data clustering: 50 years beyond k-

means. Pattern Recognition Letters, 31(8), 651–666.

Fränti, P., Sieranoja, S. (2018). K-means properties on six

clustering benchmark datasets. Applied Intelligence,

48(12):4743-4759.

Fränti, P., Sieranoja, S. (2019). How much can k-means be

improved by using better initialization and repeats?

Pattern Recognition, 93, 95-112.

Vouros, A., Langdell, S., Croucher, M., Vasilaki, E. (2021).

An empirical comparison between stochastic and

deterministic centroid initialization for K-means

variations. Machine Learning, 110:1975–2003.

Fränti, P. (2018). Efficiency of random swap algorithm. J.

Big Data, 5(1):1-29.

Nigro, L., Cicirelli, F., Fränti, P. (2023). Parallel random

swap: An efficient and reliable clustering algorithm in

Java. Simulation Modelling Practice and Theory, 124,

102712.

Fränti, P. (2000). Genetic algorithm with deterministic

crossover for vector quantization. Pattern Recognition

Letters, 21(1), 61-68.

Baldassi, C. (2020). Recombinator K-Means: A population

based algorithm that exploits k-means++ for

recombination. arXiv:1905.00531v3, Artificial

Intelligence Lab, Institute for Data Science and

Analytics, Bocconi University, via Sarfatti 25, 20135

Milan, Italy.

Baldassi, C. (2022a). Recombinator K-Means: An

evolutionary algorithm that exploits k-means++ for

recombination. IEEE Transactions on Evolutionary

Computation, 26(5), 991-1003.

Celebi, M.E., Kingravi, H.A., Vela, P.A. (2013). A

comparative study of efficient initialization methods for

the k-means clustering algorithm. Expert systems with

applications, 40(1):200–210.

Nigro, L. (2022). Performance of parallel K-means

algorithms in Java. Algorithms, 15(4), 117.

Urma, R.G., Fusco, M., & Mycroft, A. (2018). Modern

Java in Action. Manning, Shelter Island.

Arthur, D., Vassilvitskii, S. (2007). k-means++: The

advantages of careful seeding. In Proc. of the

Eighteenth Annual ACM-SIAM Symp. on Discrete

algorithms, pages 1027–1035. Society for Industrial

and Applied Mathematics.

Fränti, P., Rezaei, M., Zhao, Q. (2014). Centroid index:

cluster level similarity measure. Pattern Recognition,

47(9):3034-3045.

Fränti, P., Rezaei, M. (2016). Generalized centroid index to

different clustering models. Joint Int. Workshop on

Structural, Syntactic, and Statistical Pattern

Recognition (S+SSPR 2016), Merida, Mexico, LNCS

10029, 285-296.

Repository of datasets (2023). http://cs.uef.fi/sipu/datasets/,

last accessed on January 2023.

Sieranoja, S., Fränti, P. (2019). Fast and general density

peaks clustering. Pattern Recognition Letters, 128:551-

558.

Rodriguez, R., Laio, A. (2014). Clustering by fast search

and find of density peaks. Science, 344(6191), 14.92–

14.96.

Nigro, L., Cicirelli, F. (2022). Parallel clustering method

based on density peaks. Sixth World Conference on

Smart Trends in Systems, Security and Sustainability

(WorldS4 2022), London, 26-27 August, Springer

LNNS, ISSN:2367-3370.

Nigro, L. (2021). Parallel Theatre: A Java actor-framework

for high-performance computing. Simulation

Modelling Practice and Theory. 106, 102189.

Cicirelli, F., Nigro, L. (2022). Analyzing Stochastic

Reward Nets by model checking and parallel

simulation. Simulation Modelling Practice and Theory,

116, 102467.

Bradley, P.S., Fayyad, U.M. (1998). Refining initial points

for k-means clustering. In ICML, volume 98, 91–99.

Baldassi, C. (2022b). Systematically and efficiently

improving existing k-means initialization algorithms by

pairwise-nearest-neighbor smoothing, arXiv:2202.

03949v2, Artificial Intelligence Lab, Institute for Data

Science and Analytics, Bocconi University, via Sarfatti

25, 20135 Milan, Italy.

SIMULTECH 2023 - 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications