EK-Means: Towards Making Ensemble K-Means Work for Image-Based

Data Analysis Without Prior Knowledge of K

Danping Niu

1,2,3

, Yuan Ping

2,3a

, Yujian Liu

2,3

, Fanxi Wei

1,3

and Wenhong Wu

School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou, China

School of Information Engineering, Xuchang University, Xuchang, China

Henan Province Engineering Technology Research Center of Big Data Security and Applications, Xuchang, China

Keywords:

K-Means, DCAS, Clustering, Cluster Number, Malware Detection.

Abstract:

Despite its widespread application, K-means is signiﬁcantly constrained by its dependence on the prior knowl-

edge and its limitations in handling irregular data patterns, which restrict its performance in practical scenarios

such as malware detection. To address these shortcomings, a novel EK-means algorithm is proposed. It in-

troduces a dynamic cluster adaptation strategy (DCAS) to leverage similarity and separation measures in the

pre-clustering phase to enable adaptive splitting and merging of clusters. The continuous reﬁnement of cluster

compactness and centroid representativeness in this approach facilitates the discovery of clusters with arbitrary

shapes and the automatic discovery of the true number of clusters. Experimental results show that EK-means

achieves high clustering accuracy across multiple datasets, including Fashion-MNIST, Virus MNIST, BIG

2015, and Malimg. It notably excels in malware detection tasks, outperforming some existing mainstream

K-means enhancement methods.

1 INTRODUCTION

Cluster analysis groups data points to maximize intra-

cluster similarity and minimize inter-cluster similar-

ity. Traditional methods like K-means and its ex-

tensions are popular for their simplicity and efﬁ-

ciency (Liu et al., 2023). However, they struggle with

determining the optimal number of clusters, K, and

assume spherical, evenly distributed clusters, which

limits their performance on non-spherical or irregular

data distributions (Ikotun et al., 2023).

Current methods address these limitations with

various strategies. Heuristic approaches, based on

empirical rules, are computationally efﬁcient but of-

ten lack consistency and objectivity. Evaluation met-

rics, such as the Silhouette Coefﬁcient (Bagirov et al.,

2023), Dunn Index (Sary et al., 2024), and Davies-

Bouldin Index (Sowan et al., 2023), assess cluster-

ing quality but are sensitive to initial conditions, often

yielding inconsistent results, especially with noisy or

complex data. Hypothesis testing methods, relying

on distributional assumptions (e.g., Gaussian), strug-

gle with data that deviates from these assumptions or

contains mixed structures (Zhao et al., 2008). Com-

mon techniques like the Silhouette Coefﬁcient, and

information criteria (e.g., AIC/BIC) (Hajihosseinlou

Corresponding author: pingyuan@xcu.edu.cn

et al., 2024) perform well with spherical distributions

but fail to capture the complexities of non-spherical

or irregular data.

To address these challenges, we propose EK-

means, an ensemble K-means method that automat-

ically identiﬁes the optimal number of clusters, en-

hancing performance on non-spherical and irregularly

distributed data. We also introduce DCAS, a strategy

that adapts the clustering process to the data’s dis-

tribution, improving ﬂexibility and robustness. The

main contributions are as follows.

• The DCAS is designed to enable the automatic

discovery of the actual number of clusters through

splitting and merging operations.

• A similarity measure called the Local Compact-

ness Measure (LCM) is proposed, which is de-

signed to assess intra-cluster similarity and effec-

tively reduce computational complexity.

• By conducting a series of experiments, we

demonstrate that EK-means prove exceptional

accuracy and robustness across multiple image

datasets. Furthermore, it effectively discovers

the true number of clusters in malware detection,

achieving a high accuracy rate.

The paper is structured as follows: Section 2 re-

views advancements in K-means extensions to ad-

dress its limitations; Section 3 presents the EK-means

Niu, D., Ping, Y., Liu, Y., Wei, F. and Wu, W.

EK-Means: Towards Making Ensemble K-Means Work for Image-Based Data Analysis Without Prior Knowledge of K.

DOI: 10.5220/0013250100003899

In Proceedings of the 11th International Conference on Information Systems Security and Privacy (ICISSP 2025) - Volume 2, pages 575-584

ISBN: 978-989-758-735-1; ISSN: 2184-4356

575

approach and its implementation; Section 4 provides

experimental results and compares EK-means with

other methods; Section 5 concludes with key ﬁndings.

2 RELATED WORK

K-means is a widely used unsupervised clustering

method valued for its efﬁciency and simplicity. How-

ever, it has several limitations: (1) the number of clus-

ters K must be predeﬁned, which is often difﬁcult in

real-world datasets (Jos

e-Garc

ıa and G

omez-Flores,

2016); (2) it assumes spherical clusters, limiting its

performance on irregular or non-spherical data (Daud

et al., 2024); (3) it is sensitive to outliers and noise,

which can distort results (Gan and Ng, 2017); and

(4) its sensitivity to initial centroid placement can

lead to local optima (Ahmed et al., 2020). To over-

come these challenges, various improvements have

been proposed to eliminate the need for predeﬁned K

and to handle non-spherical clusters. The following

sections outline key advancements in these areas.

2.1 Addressing the Pre-Speciﬁed

Cluster Number Problem

In K-means clustering, determining the optimal num-

ber of clusters has been a key research challenge.

Early approaches used evaluation metrics like silhou-

ette scores and the elbow method to estimate cluster

counts. Teklehaymanot et al.(Teklehaymanot et al.,

2018) proposed a two-step method that estimates the

number of clusters while analyzing data structure, im-

proving model selection accuracy. However, these

methods are often limited by subjective thresholds

and speciﬁc data distributions, reducing their appli-

cability. Later, algorithms were developed to dynam-

ically adjust the number of clusters. For example, X-

means (Pelleg and Moore, 2000) uses BIC to evalu-

ate models with different K values, optimizing cluster

count. Fahim and Ahmed (Fahim, 2021) introduced

a DBSCAN-K-means hybrid, where DBSCAN esti-

mates cluster count and K-means reﬁnes intra-cluster

consistency. Yang et al. (Yang and Hussain, 2023) de-

veloped a K-means variant that autonomously identi-

ﬁes the number of clusters. Rykov et al. (Rykov et al.,

2024) extended the elbow method with inertia-based

techniques for better cluster selection. However, these

methods still struggle with highly mixed datasets, re-

vealing room for improvement. Despite these ad-

vances, these methods often assume normality, incur

high computational costs in high-dimensional data,

and lack scalability, pointing to the need for further

optimization.

2.2 Strategies for Addressing

Non-Spherical Clusters

In K-means clustering, the assumption of spherical

clusters with equal variance limits its performance on

non-spherical or irregular data. Several approaches

have been proposed to improve K-means’ adaptabil-

ity to complex data distributions. Early methods like

DBSCAN (Deng, 2020) use density-based clustering

to detect irregular clusters and remove noise. How-

ever, DBSCAN struggles with varying cluster den-

sities and is sensitive to parameter settings. To ad-

dress this, GriT-DBSCAN (Huang et al., 2023) intro-

duces grid-based partitioning, improving efﬁciency

for high-dimensional datasets. Morii et al. (Morii and

Kurahashi, 2006) enhanced K-means by splitting and

merging decision regions to improve clustering ac-

curacy. However, these methods incur high compu-

tational costs and rely on selecting appropriate ker-

nels. G-means dynamically adjusts cluster bound-

aries based on Gaussian distribution assumptions and

statistical tests, enabling effective handling of irreg-

ular shapes and automatic determination of the clus-

ter count. The K-Multiple-Means method (Nie et al.,

2019) addresses non-convex clusters by introducing

multiple centroids, but it remains computationally ex-

pensive and sensitive to parameter settings. Despite

these advancements, these methods still struggle with

high-dimensional data and parameter sensitivity, indi-

cating the need for further optimization.

To overcome these limitations, EK-means has un-

dergone several key optimizations:

• EK-means automatically determines the cluster

count using an ensemble approach, removing the

need for a pre-speciﬁed K. It dynamically reﬁnes

the ﬁnal number of clusters without assuming nor-

mality.

• EK-means enhances clustering performance on

complex data distributions without assuming

spherical clusters. By combining similarity and

dissimilarity metrics, it adaptively optimizes the

cluster structure, achieving high accuracy and ro-

bustness on non-spherical datasets.

3 EK-MEANS

In this section, we present the EK-means algorithm.

The process begins with an initial clustering step,

where the data is partitioned into multiple clusters

based on a predeﬁned K value. Next, each cluster

undergoes decomposition using the DCAS strategy

to evaluate whether sub-clusters should be retained.

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

576

After completing the decomposition phase, the al-

gorithm enters the merging stage, where DCAS is

again applied to determine whether clusters should

be merged. This iterative process of decomposition

and merging continues until no further changes are

needed. We ﬁrst describe the adaptive strategy for

cluster handling, then provide a detailed explanation

of the decomposition and merging steps, followed by

an overview of the entire EK-means implementation.

3.1 DCAS

This section introduces the DCAS, focusing on how it

enables adaptive clustering by optimizing both intra-

cluster structure and inter-cluster relationships. As

shown in Figure 1, the algorithm evaluates the prox-

imity between clusters based on the similarity of data

points within them, determining whether to perform

splitting or merging operations. EK-means facilitates

the dynamic decomposition and merging of clusters

by incorporating inter-cluster compactness. The pri-

mary objective is to identify sparse regions within the

dataset and partition the data accordingly. This ap-

proach eliminates the reliance on initial parameters,

enabling real-time adjustments for a more accurate

representation of the data structure.

In our adaptive clustering strategy Algorithm 1,

we consider two clusters with centroids C

and C

and corresponding data point sets P

and P

. To

evaluate cluster compactness, we randomly sample

nn data points from each cluster, forming subsets

= {x

, x

, . . . , x

}, where t ∈ {1, 2} is the cluster

index. The set T = {n, n

, n

} represents the hyperpa-

rameters used in EK-means. Here, n is the number of

randomly selected data points from a cluster, n

is the

number of nearest neighbors for each point, and n

the number of segments between the centroids of the

two clusters.

Deﬁnition 3.1 (Euclidean Distance). The Ed(·, ·) de-

notes the Euclidean distance. For two data points x

and y, their Euclidean distance Ed(x, y) is computed

as:

Ed(x, y) =

∑

k=1

− y

)

(1)

where x

and y

are the coordinates of x and y in the

k-th dimension, and nu is the number of dimensions.

Deﬁnition 3.2 (Nearest Neighbor). NN

) denotes

the b-th nearest neighbor of x

, where x

represents

the j-th randomly selected sample from cluster t.

For each randomly sampled data point x

∈ S

cluster C

, we compute the average distance to its

nearest neighbors, denoted as D-ANND(t, j), as de-

ﬁned below:

Deﬁnition 3.3 (Average Nearest Neighbor Distance).

Given a data point x

and its nearest neighbors, the

average nearest neighbor distance D-ANND(t, j) is:

D-ANND(t, j) =

∑

b=1

Ed(x

, NN

)) (2)

where x

represents the j-th randomly selected sam-

ple from cluster t.

Deﬁnition 3.4 (LCM). The LCM(·) is the mean of

the average distances to the nearest neighbors of ran-

domly sampled points within a cluster, calculated as:

LCM(t) =

∑

j=1

D-ANND(t, j) (3)

where t is the cluster index, and j is the j-th randomly

selected point in that cluster.

The LCM captures the local density of points

within a cluster, providing a better reﬂection of re-

gional structure compared to global metrics like the

within-cluster sum of squares (WCSS). By evaluat-

ing the neighborhood distances of randomly sampled

points, we reduce computational complexity, ensur-

ing efﬁciency for large datasets. The sample size is

typically set to 5 points based on the total number of

data points in the cluster. This measure is key to un-

derstanding the local structure within each cluster and

evaluating the connectivity between clusters. After

calculating the compactness measure LCM, we divide

the line segment between the centroids C

and C

into

equally spaced points.

Deﬁnition 3.5 (Position of P

for Cluster Separation).

The position P

of a point along the line segment be-

tween the centroids C

and C

is deﬁned as:

= C

+ a ·

−C

, a = 1, 2, . . . , n

− 1 (4)

where a denotes the index of the point along the seg-

ment, n

is the total number of segments, and |C

−C

is the distance between the centroids C

and C

For each point P

, a circle with radius d is drawn.

If all circles contain a data point, the clusters are con-

nected and merged; if any circle is empty, the clusters

remain separate.

3.2 Adaptive Decomposition and

Merging

We propose a clustering strategy based on inter-

cluster similarity and separability, designed to better

explore cluster structures and data point distribution.

This approach improves clustering accuracy and data

representation effectiveness.

EK-Means: Towards Making Ensemble K-Means Work for Image-Based Data Analysis Without Prior Knowledge of K

577

(a) Cluster decomposition (b) Cluster merging

Figure 1: Adaptive Cluster Decomposition and Merging: Random data points from each cluster are selected to calculate the

average distance to nearest neighbors, yielding LCM(1) and LCM(2) for clusters C1 and C2. The larger LCM value serves as

the partition threshold. A line AB is drawn between the centroids, and we check if any data points lie within a circle centered

at the partition point on AB. This decides whether to merge or split the clusters.

Algorithm 1: DCAS.

Input: C

, P

, T

Output: ﬂag m

1 Randomly select n points:

← {x

, x

, . . . , x

} from P

, where

t ∈ {1, 2}

2 for each set S

3 LCM(t) ←

∑

j=1

D-ANND(t, j)

4 LCM ← max(LCM(1), LCM(2))

5 Calculate

← C

+ a ·

−C

, a = 1, 2, . . . , n

− 1

6 if Circle(P

, LCM) is not empty for every a

then

7 ﬂag m ← True

8 else

9 ﬂag m ← False

In contrast to SMKM (Cap

o et al., 2022), EK-

means allows cluster splitting when the DCAS split-

ting criteria are satisﬁed and merging according to

the DCAS merging criteria. Our method dynamically

adjusts the value of K until the true number of clus-

ters is discovered. In comparison, SMKM determines

whether to split a cluster based on the reduction in

error from adding centroids, splitting only the clus-

ter that results in the maximum error reduction in

each iteration. SMKM performs only one split and

merge per iteration, maintaining a constant value of

K throughout the process.

3.2.1 Adaptive Cluster Decomposition

The adaptive cluster decomposition aims to uncover

substructures by iteratively splitting clusters, improv-

ing classiﬁcation accuracy. The process is shown in

Algorithm 2. In this phase, we enhance cluster com-

pactness by applying the 2-means algorithm to each

Algorithm 2: Adaptive Cluster Decomposition(ACD).

Input: Set of centroids , C = {c

, . . . , c

and its corresponding clustering,

P = {P

, . . . , P

}, T

Output: C

′

, P

′

, ﬂag

1 ﬂag EK ← False

2 for each cluster P

in P do

3 Apply 2-means clustering:

4 {P

, P

} ← P

5 {c

, c

} ← c

6 C

′

7 P

′

8 for each cluster P

in P do

9 ﬂag m ← DCAS(c

, c

, P

)

10 if f lag m is False then

11 C

′

← C

′

∪ {c

, c

}

12 P

′

← P

′

∪ {P

, P

}

13 ﬂag EK ← True

14 else

15 C

′

← C

′

∪ {c

}

16 P

′

← P

′

∪ {P

}

cluster P

, dividing it into two sub-clusters P

and

. The connectivity of these sub-clusters is evalu-

ated using an adaptive clustering strategy. If any circle

(centered at each sub-cluster point) is empty, it indi-

cates spatial separation between the sub-clusters. This

suggests that decomposition improves cluster com-

pactness and classiﬁcation performance. The variable

ﬂag EK tracks whether a split was performed during

the iteration.

3.2.2 Adaptive Cluster Merging

The cluster merging operation enhances compactness

and accuracy by identifying structural relationships

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

578

Algorithm 3: Adaptive Cluster Merging(ACM).

Input: C, P, T , ﬂag EK

Output: C

′

, P

′

, ﬂag EK

1 merge found ← True

2 Inter ← 0

3 while merge found = True do

4 merge found ← False

5 if Inter = 1 then

6 C ← C

′

,P ← P

′

7 Inter ← 1

8 K

= |C|

9 while merge found = False do

10 (i, j) ← argmin

1≤i< j≤K

11 ﬂag m ← DCAS(c

, c

, P

)

12 if ﬂag mandD

̸= INF then

13 merge found ← True

14 ﬂag EK ← True

15 c

′

←

|·c

+|P

|·c

|+|P

16 C

′

← (C \ {c

, c

}) ∪ {c

′

}

17 P

← P

∪ P

18 P

′

← (P \ {P

}) ∪ {P

}

19 else

20 D

← INF

between clusters. In this phase, clusters that are close

and structurally similar are merged to optimize the

overall cluster shape and centroid representativeness.

The merging process, based on density and spatial re-

lationships, is detailed in Algorithm 3. The ﬂag EK

tracks whether a merge occurred during the iteration,

and |C| represents the number of centroids in the cen-

troid set C.

Deﬁnition 3.6 (Euclidean Distance Between Clus-

ters). The distance between two clusters i and j is the

Euclidean distance between their centroids C

and C

D(i, j) = ∥C

−C

∥ (5)

where C

and C

are the centroids of clusters i and j,

and ∥ · ∥ denotes the Euclidean norm.

Deﬁnition 3.7 (Closest Pair of Clusters). The indices

∗

, j

∗

) correspond to the pair of clusters i and j that

have the smallest Euclidean distance between their

centroids. These indices are deﬁned as:

∗

, j

∗

) = argmin

i, j

D(i, j) (6)

where D(i, j) is the Euclidean distance between the

centroids C

and C

of clusters i and j, respectively.

The algorithm begins by identifying the most sim-

ilar pair of clusters, minimizing the distance D(i, j)

to ﬁnd the target for merging. An adaptive strategy

evaluates their connectivity: if both clusters contain

data points within the circles, they are considered con-

nected and should be merged. The centroid of the

merged cluster is calculated as a weighted average of

the original centroids, with weights based on the num-

ber of data points in each cluster. The algorithm then

iterates, checking for the closest pair of clusters un-

til no further merges are possible. This merging pro-

cess prioritizes intra-cluster density and spatial rela-

tionships. By using LCM and assessing spatial split

points, the algorithm ensures merging only occurs

when sufﬁcient connectivity exists, avoiding unnec-

essary merges. The process continues until no more

clusters can be merged, improving clustering quality,

representativeness, and compactness.

3.3 EK-Means Implementation

The EK-means method dynamically optimizes clus-

ters through adaptive strategies, including cluster

splitting and merging operations. The overall imple-

mentation steps are outlined in Algorithm 4.

The EK-means method begins with pre-clustering

using the K-means++ algorithm (Arthur and Vassil-

vitskii, 2007) with an initial number of clusters K

start

to obtain the initial centroid set C and cluster set P.

This forms the basis for further adaptive adjustments.

Next, the algorithm proceeds with cluster decompo-

sition and merging. Initially, each cluster is itera-

tively split into two sub-clusters using the 2-means

algorithm. The connectivity of these sub-clusters is

then assessed through adaptive strategies to determine

whether they should be retained. After all clusters

have been split, the merging operation begins. During

merging, the closest pair of clusters is identiﬁed, and

their connectivity is evaluated. If the merging criteria

are not met, the next closest pair is considered. When

merging conditions are satisﬁed, the clusters are com-

bined, and the centroid set C is updated. The pro-

cess repeats until no pairs of clusters meet the merg-

ing conditions. After merging, the splitting and merg-

ing steps continue until the stopping criteria are met,

such as when no further changes occur, or the max-

imum number of iterations is reached. Finally, the

EK-means algorithm outputs the optimized centroid

set C and cluster set P.

4 EXPERIMENTS

We conducted experiments on several image datasets

to evaluate EK-means’ capability in category dis-

covery and clustering performance. The section be-

EK-Means: Towards Making Ensemble K-Means Work for Image-Based Data Analysis Without Prior Knowledge of K

579

Algorithm 4: EK-means.

Input: Dataset D, Initial number of clusters

start

, ﬂag EK, max iter, T

Output: C , P

1 Pre-clustering:

C, P ← K-means++(D, K

start

)

2 i iter ← 0

3 Adaptive Strategy:

4 while f lag EK or i iter < max iter do

5 i iter ← i iter + 1

6 C

′

, P

′

, ﬂag EK ← ACD(C, P, T )

7 C, P, ﬂag EK ← ACM(C

′

, P

′

, T, ﬂag EK)

gins with a description of the experimental setup and

datasets, followed by an analysis of the algorithm’s

stability. Finally, we assess EK-means’ performance

in malware analysis tasks.

4.1 Experimental Setup

The experiments consist of three main assessments:

• Stability Analysis: This experiment evaluates the

clustering accuracy and stability of EK-means us-

ing the Fashion MNIST dataset.

• Malware Analysis: This experiment evaluates

the performance of EK-means in malware detec-

tion using the Virus MNIST (Noever and No-

ever, 2021), BIG 2015 (Ronen, 2018), and Mal-

img (Nataraj et al., 2011) datasets.

To compare with EK-means, we selected K-

means++, X-means, CDKM (Nie et al., 2022), and

SMKM. K-means++, CDKM, and SMKM require

a predeﬁned number of clusters, while X-means

and EK-means can autonomously determine the ﬁnal

cluster count based on the initial K.

Our experiments assess clustering accuracy and

the ability to identify true categories using clustering

accuracy as the evaluation metric. Clustering accu-

racy evaluates the consistency between predicted and

true labels. The formula for clustering accuracy is

given by:

Accuracy =

Correctly identiﬁed class

Total number of class

× 100 (7)

All experiments were performed on a machine

running Windows 11 with a 3.20 GHz CPU and 128

GB RAM.

4.2 Experimental Datasets

To evaluate the algorithm’s performance, we selected

four representative datasets: Fashion MNIST, Virus

MNIST, BIG 2015, and Malimg, covering tasks like

image clustering and malware analysis. Table 1

presents key statistics for each dataset. Feature ex-

traction is performed using the pre-trained ResNet18

model (He et al., 2015), where images are resized,

normalized, and converted into n-dimensional feature

vectors for analysis.

4.3 Stability Analysis

This section evaluates the stability of EK-means using

the Fashion MNIST dataset, focusing on two aspects:

(1) the consistency of accuracy and ﬁnal cluster num-

ber K across multiple experiments, and (2) the effect

of different initial K values on the ﬁnal cluster count

and accuracy.

As shown in Figure 2a, with an initial K = 16,

the ﬁnal cluster number obtained by EK-means re-

mains close to the actual class count (10) across re-

peated tests, indicating its ability to capture the in-

trinsic structure of the data effectively. In contrast,

X-means exhibits more signiﬁcant ﬂuctuations in K,

with ﬁnal values consistently above 25, deviating sig-

niﬁcantly from the ground truth. Figure 2b presents

the accuracy of both algorithms under the same initial

K. EK-means maintains a high accuracy above 0.9

with minimal variance, signiﬁcantly outperforming

X-means, whose accuracy remains below 0.5 with no-

ticeable instability. These results demonstrate the ro-

bustness and consistency of EK-means over multiple

trials. Furthermore, Figure 3a and 3b explore the in-

ﬂuence of varying initial K values. In Figure 3a, EK-

means shows minimal variation in the ﬁnal K, which

consistently approximates the true number of classes.

X-means produces signiﬁcantly ﬂuctuating K values,

often far exceeding the ground truth. Figure 3b plots

the clustering accuracy against the initial K. EK-

means achieves stable, high accuracy across different

initial K values, while X-means exhibits larger varia-

tions, with accuracy consistently below 0.5. In sum-

mary, EK-means demonstrates strong robustness and

adaptability, achieving consistent performance across

multiple trials and under varying initial conditions.

4.4 Malware Analysis

We conducted experiments on the Virus MNIST, BIG

2015, and Malimg datasets to evaluate the effective-

ness of EK-means in malware analysis. Figure 4 illus-

trates the clustering performance of different meth-

ods under varying initial K values, including the ﬁ-

nal number of clusters obtained by EK-means and X-

means.

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

580

Table 1: Description of the benchmark datasets.

Datasets

Dataset Description

Dims Size # of Classes Task description

Fashion MNIST 10 70000 10 Image clustering

Virus MNIST 10 50000 10 Malware analysis

BIG2015 10 10868 9 Malware analysis

Malimg 25 9339 25 Malware analysis

(a)Variation of ﬁnal K values (Initial K = 20). (b)Variation of accuracy results (Initial K = 20).

Figure 2: Comparison of EK-means and X-means clustering performance with initial K=20.

(a)Final K values with varying initial K. (b)Accuracy results with varying initial K.

Figure 3: Analysis of clustering results with varying initial K values.

Virus-MNIST. In the Virus MNIST dataset, when

the initial K deviates from the actual number of

clusters, the accuracy of K-means++, CDKM, and

SMKM drops signiﬁcantly, with optimal performance

only when K matches the true number of categories.

In contrast, EK-means maintains high accuracy across

various initial K values, with the ﬁnal K value remain-

ing close to the actual one, demonstrating its robust-

ness. On the other hand, X-means exhibits substan-

tial ﬂuctuations in the ﬁnal K value, often exceeding

the actual number of categories. This results in a no-

ticeable drop in accuracy, indicating its limitations in

identifying actual categories in malware datasets.

BIG 2015. On the BIG 2015 dataset, EK-means

performed exceptionally well. As shown in Fig-

ure 4(d), when the initial K exceeds 12, the ﬁnal K

value is close to the true number of categories, signif-

EK-Means: Towards Making Ensemble K-Means Work for Image-Based Data Analysis Without Prior Knowledge of K

581

(a) Virus-MNIST (b) Virus-MNIST

(e) Malimg (f) Malimg

Figure 4: Malware analysis.

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

582

icantly improving clustering accuracy. This improve-

ment is attributed to EK-means’ strategy of only split-

ting clusters that meet certain criteria, ensuring more

effective partitioning. When the initial K is too low,

each cluster may contain multiple categories, lead-

ing to poor partitioning and increased computational

complexity. Conversely, a larger initial K value en-

ables ﬁner partitioning from the start, improving clus-

ter purity and the accuracy of subsequent splits.

Malimg. On the Malimg dataset, as the initial K

value increases, the accuracy of EK-means improves,

similar to the ﬁndings on the BIG 2015 dataset, which

also exhibits signiﬁcant class imbalance. A larger ini-

tial K value proves crucial for improving the cluster-

ing effectiveness of EK-means when there are sub-

stantial differences between categories. With smaller

initial K values, clusters often contain a mix of cate-

gories, and the disparities in class sizes make it difﬁ-

cult to differentiate rare categories effectively, leading

to a ﬁnal K value smaller than the true number of cat-

egories. To improve EK-means’ performance on the

Malimg dataset, using a larger initial K value is rec-

ommended. Additionally, K-means++, CDKM, and

SMKM maintain relatively stable accuracy, with mis-

classiﬁcation having minimal impact on overall accu-

racy even when the K value exceeds the actual number

of categories due to class imbalance.

We employed (batch) MMRS sampling (John-

son et al., 1990) to select data points for calculat-

ing intra-cluster compactness, addressing the class

imbalance issue, and ensuring the accuracy of the

compactness measure. Overall, EK-means outper-

forms other methods in malware analysis, demon-

strating its robustness and accuracy in handling com-

plex datasets. Through its adaptive clustering strat-

egy, EK-means effectively identiﬁes actual categories

and adapts to varying data distributions, establishing

itself as a powerful tool in malware analysis.

5 CONCLUSIONS

The traditional k-means is widely used for various

clustering tasks due to its simplicity, computational

efﬁciency, ease of implementation, and scalability.

However, it struggles with automatically discovering

the true number of clusters and is ineffective in han-

dling non-spherical and irregularly distributed clus-

ters. To address these issues, we propose a novel

method, EK-means. By incorporating DCAS and

LCM, EK-means enables the automatic decomposi-

tion and merging of clusters, effectively overcoming

these challenges. Experimental results demonstrate

that the method discovers the true number of clusters

in irregular datasets and performs excellently in mal-

ware detection tasks. However, EK-means still ex-

hibits certain limitations when dealing with categories

with signiﬁcant substructures. Future research will

aim to improve methods for handling complex cat-

egories, enhance computational efﬁciency, improve

adaptability to heterogeneous data, and optimize the

algorithm’s applicability and performance.

ACKNOWLEDGEMENTS

This work is supported by the National Natu-

ral Science Foundation of China under Grant no.

62162009, the Key Technologies R&D Program of

He’nan Province under Grant No. 242102211065,

Postgraduate Education Reform and Quality Im-

provement Project of Henan Province under Grant

Nos. YJS2024AL112 and YJS2024JD38, the Inno-

vation Scientists and Technicians Troop Construc-

tion Projects of Henan Province under Grant No.

CXTD2017099, and the Scientiﬁc Research Innova-

tion Team of Xuchang University under Grant No.

2022CXTD003.

REFERENCES

Ahmed, M., Seraj, R., and Islam, S. M. S. (2020). The

k-means algorithm: A comprehensive survey and per-

formance evaluation. Electronics, 9(8):1295.

Arthur, D. and Vassilvitskii, S. (2007). k-means++: the ad-

vantages of careful seeding. In ACM-SIAM Sympo-

sium on Discrete Algorithms.

Bagirov, A. M., Aliguliyev, R. M., and Sultanova, N.

(2023). Finding compact and well-separated clus-

ters: Clustering using silhouette coefﬁcients. Pattern

Recognition, 135:109144.

Cap

o, M., Mart

ınez, A. P., and Lozano, J. A. (2022). An

efﬁcient split-merge re-start for the $k$k-means algo-

rithm. IEEE Trans. Knowl. Data Eng., 34:1618–1627.

Daud, H. B., binti Zainuddin, N., Sokkalingam, R., Museeb,

A., Inayat, A., et al. (2024). Addressing limitations

of the k-means clustering algorithm: outliers, non-

spherical data, and optimal cluster selection. AIMS

Mathematics, 9(9):25070–25097.

Deng, D. (2020). Dbscan clustering algorithm based on

density. In 2020 7th International Forum on Electri-

cal Engineering and Automation (IFEEA), pages 949–

953.

Fahim, A. (2021). K and starting means for k-means algo-

rithm. Journal of Computational Science, 55:101445.

Gan, G. and Ng, M. K.-P. (2017). K-means clustering with

outlier removal. Pattern Recognition Letters, 90:8–14.

EK-Means: Towards Making Ensemble K-Means Work for Image-Based Data Analysis Without Prior Knowledge of K

583

Hajihosseinlou, M., Maghsoudi, A., and Ghezelbash, R.

(2024). A comprehensive evaluation of optics, gmm

and k-means clustering methodologies for geochem-

ical anomaly detection connected with sample catch-

ment basins. Geochemistry, page 126094.

He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep resid-

ual learning for image recognition. 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Huang, X., Ma, T., Liu, C., and Liu, S. (2023). Grit-dbscan:

A spatial clustering algorithm for very large databases.

Pattern Recognition, 142:109658.

Ikotun, A. M., Ezugwu, A. E., Abualigah, L., Abuhaija,

B., and Heming, J. (2023). K-means clustering al-

gorithms: A comprehensive review, variants analysis,

and advances in the era of big data. Information Sci-

ences, 622:178–210.

Johnson, M. E., Moore, L. M., and Ylvisaker, D. (1990).

Minimax and maximin distance designs. Journal of

statistical planning and inference, 26(2):131–148.

Jos

e-Garc

ıa, A. and G

omez-Flores, W. (2016). Auto-

matic clustering using nature-inspired metaheuristics:

A survey. Applied Soft Computing, 41:192–213.

Liu, H., Chen, J., Dy, J., and Fu, Y. (2023). Transform-

ing complex problems into k-means solutions. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 45(7):9149–9168.

Morii, F. and Kurahashi, K. (2006). Clustering by the k-

means algorithm using a split and merge procedure.

In SCIS & ISIS SCIS & ISIS 2006, pages 1767–1770.

Japan Society for Fuzzy Theory and Intelligent Infor-

matics.

Nataraj, L., Karthikeyan, S., Jacob, G., and Manjunath,

B. S. (2011). Malware images: visualization and au-

tomatic classiﬁcation. In Visualization for Computer

Security.

Nie, F., Wang, C.-L., and Li, X. (2019). K-multiple-means:

A multiple-means clustering method with speciﬁed k

clusters. In Proceedings of the 25th ACM SIGKDD

international conference on knowledge discovery &

data mining, pages 959–967.

Nie, F., Xue, J., Wu, D., Wang, R., Li, H., and Li, X.

(2022). Coordinate descent method for kk-means.

IEEE Transactions on Pattern Analysis and Machine

Intelligence, 44(5):2371–2385.

Noever, D. A. and Noever, S. E. M. (2021). Virus-mnist: A

benchmark malware dataset. ArXiv, abs/2103.00602.

Pelleg, D. and Moore, A. W. (2000). X-means: Extend-

ing k-means with efﬁcient estimation of the number

of clusters. In Proceedings of the Seventeenth In-

ternational Conference on Machine Learning, ICML

’00, page 727–734, San Francisco, CA, USA. Morgan

Kaufmann Publishers Inc.

Ronen, R. (2018). Microsoft malware classiﬁcation chal-

lenge. arXiv preprint arXiv:1802.10135.

Rykov, A., de Amorim, R. C., Makarenkov, V., and Mirkin,

B. (2024). Inertia-based indices to determine the num-

ber of clusters in k-means: An experimental evalua-

tion. IEEE Access, 12:11761–11773.

Sary, R. A., Satyahadewi, N., and Andani, W. (2024). Ap-

plication of k-means++ with dunn index validation of

grouping west kalimantan region based on crime vul-

nerability. BAREKENG: Jurnal Ilmu Matematika dan

Terapan, 18(4):2283–2292.

Sowan, B., Hong, T.-P., Al-Qerem, A., Alauthman, M., and

Matar, N. (2023). Ensembling validation indices to

estimate the optimal number of clusters. Applied In-

telligence, 53(9):9933–9957.

Teklehaymanot, F. K., Muma, M., and Zoubir, A. M.

(2018). Bayesian cluster enumeration criterion for

unsupervised learning. IEEE Transactions on Signal

Processing, 66(20):5392–5406.

Yang, M.-S. and Hussain, I. (2023). Unsupervised multi-

view k-means clustering algorithm. IEEE Access,

11:13574–13593.

Zhao, Z., Guo, S., Xu, Q., and Ban, T. (2008). G-means: a

clustering algorithm for intrusion detection. In Inter-

national Conference on Neural Information Process-

ing, pages 563–570. Springer.

ICISSP 2025 - 11th International Conference on Information Systems Security and Privacy

584