4 CONCLUSIONS
In this paper, we proposed CSAI, a new cluster
evaluation index for dealing with the validity and
robustness of clustering solutions. This approach is
used to assess the quality of clustering solutions and the
reproducibility of data clustering through the concept
of model stability. Building up on some of the
limitations in the literature, CSAI distinguishes itself
by fully leveraging aggregated feature structures
pertaining to clusters rather than focusing on cluster
centroids. For evaluation, we conducted extensive
experiments on four different publicly available
datasets to validate the generality and effectiveness of
the proposed algorithm. In our experiments, we
selected the widely used clustering algorithms,
including k-means, k-medoid, agglomerative
hierarchical clustering and Gaussian mixture model.
Our experimental results demonstrate that CSAI is a
promising unified solution that can effectively evaluate
the quality of clustering solutions and their stability.
More importantly, the CSAI index exhibited the
highest efficacy in agglomerative hierarchical
clustering, surpassing other indices, highlighting its
suitability for hierarchical clustering while also
highlighting its effectiveness on other clustering
algorithms.
As limitation of the study, CSAI’s efficacy has
been evaluated using only textual datasets, despite the
potential for the proposed method to work on other
domains and data modalities such as images and other
datasets. Thus, further research will be towards using
these diverse data types to validate the algorithm's
versatility and robustness across different domains.
ACKNOWLEDGEMENTS
This research work is funded by SFIMediaFutures
Partners and the Research Council of Norway (Grant
number 309339).
REFERENCES
Bandyopadhyay, S., & Saha, S. (2008). A point symmetry-
based clustering technique for automatic evolution of
clusters. IEEE Transactions on Knowledge and Data
Engineering, 20(11), 1441–1457. https://doi.org/10.11
09/TKDE.2008.79
Clement, C. B., Bierbaum, M., O’Keeffe, K. P., & Alemi,
A. A. (2019). On the Use of ArXiv as a Dataset.
http://arxiv.org/abs/1905.00075
Davies, D. L., & Bouldin, D. W. (1979). A Cluster
Separation Measure. IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAMI-1(2), 224–
227. https://doi.org/10.1109/TPAMI.1979.4766909
Dharmarajan, A., & Velmurugan, T. (2013). Applications
of partition based clustering algorithms: A survey. 2013
IEEE International Conference on Computational
Intelligence and Computing Research, IEEE ICCIC
2013. https://doi.org/10.1109/ICCIC.2013.6724235
Duan, X., Ma, Y., Zhou, Y., Huang, H., & Wang, B. (2023).
A novel cluster validity index based on augmented non-
shared nearest neighbors. Expert Systems with
Applications, 223. https://doi.org/10.1016/j.eswa.202
3.119784
Dunn, J. C. (1974). Well-separated clusters and optimal
fuzzy partitions. Journal of Cybernetics, 4(1), 95–104.
https://doi.org/10.1080/01969727408546059
Ezugwu, A. E., Ikotun, A. M., Oyelade, O. O., Abualigah,
L., Agushaka, J. O., Eke, C. I., & Akinyelu, A. A.
(2022). A comprehensive survey of clustering
algorithms: State-of-the-art machine learning
applications, taxonomy, challenges, and future research
prospects. In Engineering Applications of Artificial
Intelligence (Vol. 110). https://doi.org/10.1016/j.enga
ppai.2022.104743
Gruppi, M., Horne, B. D., & Adalı, S. (2020). NELA-GT-
2019: A Large Multi-Labelled News Dataset for The
Study of Misinformation in News Articles.
http://arxiv.org/abs/2003.08444
Halim, Z., Waqas, M., & Hussain, S. F. (2015). Clustering
large probabilistic graphs using multi-population
evolutionary algorithm. Information Sciences, 317, 78–
95. https://doi.org/10.1016/j.ins.2015.04.043
Halkidi, M., & Vazirgiannis, M. (2001). Clustering validity
assessment: Finding the optimal partitioning of a data
set. Proceedings - IEEE International Conference on
Data Mining, ICDM. https://doi.org/10.1109/icdm.20
01.989517
Hutson, M. (2018). Artificial intelligence faces
reproducibility crisis Unpublished code and sensitivity
to training conditions make many claims hard to verify.
Science, 359(6377), 725–726. https://doi.org/10.1126/
science.359.6377.725
Jain, A. K., & Dubes, R. C. (1988). Clustering Methods and
Algorithms. In Algorithms for Clustering Data (pp. 55–
142). http://www.jstor.org/stable/1268876?origin=cros
sref
Jegatha Deborah, L., Baskaran, R., & Kannan, A. (2010). A
Survey on Internal Validity Measure for Cluster
Validation. International Journal of Computer Science
& Engineering Survey, 1(2), 85–102. https://doi.org/
10.5121/ijcses.2010.1207
Kaufman, L. (1990). Finding groups in data : an
introduction to cluster analysis - Partitioning Around
Medoids (Program PAM). Wiley, Hoboken, 68–125.
Kim, M., & Ramakrishna, R. S. (2005). New indices for
cluster validity assessment. Pattern Recognition
Letters, 26(15), 2353–2363. https://doi.org/10.1016/
j.patrec.2005.04.007