dimension is at least equal to 42000) while it is
advantageous on spherical data only from dimension
2000. In addition, the execution of both versions of k-
means on spherical data is much faster than on
homogeneous data. The k-means processes applied to
homogeneous data calculate a significantly higher
number of different centroids than those applied to
spherical data if we consider that these processes
adopt the same parameters (number of classes k,
dimension d and size of data n). This difference
therefore generates more iterations on homogeneous
data than on spherical data and convergence is also
slower. Among these different centroids, very few are
used in k-means applied to homogeneous data. On the
contrary, they are much reused in the case of k-means
unrolled on spherical data, which explains why the
extended k-means is much more favourable on these
data than the homogeneous data.
If EKM is applied several times on the same
dataset with different initializations of the centroids.
It can be seen that the proportions of the reused
centroids are not the same from one EKM execution
to another. This observation is valid for both data
distributions. So, the initialization of the centroids
coupled with the choice of the distribution have an
influence on our approach.
6 CONCLUSION
In this paper, we provided an approach to accelerate
the process of the unsupervised data learning
algorithm, called k-means. This approach is based on
an algorithm that pre-calculates and stores
intermediate results, called dynamic pre-aggregates,
to be reused in subsequent iterations. Our
experiments compare our extended k-means version
with the standard version using two types of data
(spherical and homogeneous data). 2798 synthetic
datasets that have been generated reaching up to
62GB. We demonstrate that our approach is
advantageous for partitioning large datasets from
dimensions 2000 and 42000 respectively for spherical
data and homogeneous data.
Several perspectives are planned. We are ongoing
to experiment our extended version of k-means with
even more massive data to better evaluate the cost of
calculating pre-aggregates. In addition, it is proposed
to study situations where it might be more effective
to start from an archived centroid corresponding to a
class almost similar to the new class encountered
rather than recalculate it entirely on the pretext that
the class is not exactly identical.
REFERENCES
Alsabti, K., Ranka, S. and Singh, V. (1997) ‘An efficient k-
means clustering algorithm’, EECS
Arthur, D. and Vassilvitskii, S. (2007) ‘k-means++: the
advantages of careful seeding’, SODA, pp. 1027–1025
Celebi, M. E., Kingravi, H. A. and Vela, P. A. (2013) ‘A
comparative study of efficient initialization methods for
the k-means clustering algorithm’, ESA, pp. 200–210.
Deshpande, P. M. and al. (1998) ‘Caching
multidimensional queries using chunks’, in
Proceedings of the 1998 ACM SIGMOD, pp. 259–270.
Elkan, C. (2003) ‘Using the Triangle Inequality to
Accelerate k-Means’, Proc.Twent. ICML, pp. 147–153.
.
Forgy, E. W. (1965) ‘Cluster analysis of multivariate data:
efficiency versus interpretability of classifications’,
Biometrics.
Gini, C. (1921) ‘Measurement of Inequality of Incomes’,
The Economic Journal. .
Gray, J. and al. (1996) ‘Data cube: A relational aggregation
operator generalizing group-by, cross-tab, and sub-
totals’, DMKD, 1(1), pp. 29–53.
Hamerly, G. (2010) ‘Making k -means even faster’, SDM
2010, pp. 130–140.
Hung, M.-C., Wu, J. and Chang, J.-H. (2005) ‘An Efficient
k-Means Clustering Algorithm Using Simple
Partitioning’, JISE 21, 1177, pp. 1157–1177.
Jain, A. K. (2010) ‘Data clustering: 50 years beyond K-
means’, PRL. North-Holland, 31(8), pp. 651–666.
Jain, A. K., Murty, M. N. and Flynn, P. J. (1999) ‘Data
clustering: a review’, CSUR, 31(3), pp. 264–323.
Wasay, A. and al. (2017) ‘Data Canopy’, Proceedings of
the 2017 ACM SIGMOD ’17, pp. 557–572.
Zhang, J. and al. (2013) ‘A Parallel Clustering Algorithm
with MPI – MKmeans’, JCP, 8(1), pp. 10–17.
Zhao, Weizhong; Ma, Huifang; He, Q. (2009) ‘Parallel K -
Means Clustering Based on MapReduce’, CLOUD
Zhao, W. L., Deng, C. H. and Ngo, C. W. (2018) ‘k-means:
A revisit’, Neurocomputing, 291, pp. 195–206.
ICEIS 2019 - 21st International Conference on Enterprise Information Systems
140