4.2 Difficult Datasets
There are cases where points from two different
clusters starts forming a single cluster during the early
phase of the iterations. For example occasionally the
dataset given in Fig. 6 generates clusters incorrectly
depending to the order of arrival.
Figure 6: Occasional incorrect clusters.
The original algorithm can merge similar clusters
but what we need is to separate a cluster into 2
clusters if the addition of a new point start forming a
different sub-cluster. The extra step will come with
additional computational complexity.
After adding the new point to a cluster, that cluster
will be tested if there is a need to separate the clusters
into two different clusters. For this operation we will
use the OptimalNumberofClusters algorithm
developed (Gokcay, 2017) to find the number of
clusters in any dataset. The derivation of the
algorithm will not be repeated here. The motivation is
not to determine the number of clusters but to test the
minimum point of the distance plot created by the
algorithm against the threshold T and separate the
clusters if necessary. Assuming that the points are
arriving one at a time, the running complexity of this
algorithm is (
) where q is the current cluster.
5 FUTURE WORK
As a position paper there is work that needs to be
completed. The threshold calculation needs to be
improved because in some cases the average
calculation may not be enough to detect the boundary
between clusters. The other improvement will be in
the incremental version of the data skeleton algorithm
to reduce the storage requirements. Also the
algorithm needs to be tested using real data sets as
well. Although the algorithm performs well against
random arrivals with nonlinearly separated synthetic
clusters, the case will be different with more
complicated cluster shapes.
6 CONCLUSIONS
In this paper we have developed a one-pass stream
clustering algorithm where the clusters are
independent of the arrival order and highly
nonconvex cluster distributions pose no problem. The
distance measure used in the algorithm can cluster
nonlinearly separable clusters efficiently. This is not
the case with K-means and all its derivatives since the
distance measure is hyper-ellipsoidal. No assumption
is needed about the possible number of clusters where
many algorithms require this number. Each new
sample point is processed once and a snapshot can be
taken from the algorithm at any time since there are
no on-line and off-line iterations.
REFERENCES
Alazeez, A., A., Jassim, S., and Du, H., 2017, EDDS: An
Enhanced Density-Based Method for Clustering Data
Streams, in 46th International Conference on Parallel
Processing Workshops (ICPPW), Bristol, 2017, pp.
103-112.
Charu A. C., et al. 2003, A framework for clustering
evolving data streams, in Proceedings of the 29th
international conference on Very large data bases-
Volume 29. VLDB Endowment, 2003.
Chen, J., He, H., 2016, A fast density-based data stream
clustering algorithm with cluster centers self-
determined for mixed data, In Information Sciences,
Volume 345, 2016, Pages 271-293
Feng, A. et al., 2006, Density-Based Clustering over an
Evolving Data Stream with Noise, in SDM. Vol. 6.
2006.
Gokcay E., Principe J. C., 2002, Information theoretic
clustering, in IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 24, no. 2, pp. 158-171,
Feb 2002.
Gokcay, E., Karakaya M., Bostan A., 2016, A new
Skeletonization Algorithm for Data Processing in
Cloud Computing, in UBMK-2016, First International
Conference on Computer Science and Engineering,
Çorlu, Turkey, 20-23 Oct, 2016.
Gokcay, E., Karakaya M., Sengul, G., 2017, Optimal
Number of Clusters, in ISEAIA 2017, Fifth
International Symposium on Engineering, Artificial
Intelligence & Applications, Girne, North Cyprus, 1-3
Nov, 2017.
Hassani, M., Spaus, P., Cuzzocrea A., and Seidl,T., 2016,
"I-HASTREAM: Density-Based Hierarchical
Clustering of Big Data Streams and Its Application to
Big Graph Analytics Tools," 2016 16th IEEE/ACM