Figure 9: Clustering result on Aggregation data set.
set. The results for the proposed method were cal-
culated after removing all outliers. With the excep-
tion of the FS index, our method performed best with
a merging constant of 0.7. With these values, our
method outperformed the Y-means method in all but
the CWB index. For the other indices, our method
performed similarly but slightly worse than fuzzy c-
means with 5 clusters with the exception of the PBM
index where our method performed significantly bet-
ter. The RAC method has no values for this data set
as it clustered the entire data set into a single cluster.
5 CONCLUSIONS
In this paper, an automatic clustering method based
on a heuristic divisive approach has been proposed
and implemented. The method is based on the
DIANA algorithm interrupted by a heuristic stopping
function. As this process alone generally produces
too many clusters, its result is then passed on to a
merging method. The advantage of this two phase
approach being that with the splitting and merging us-
ing different criteria for determining if data belong in
a same cluster, the merged clusters can take non el-
liptical shapes. This advantage sets our method apart
from the majority of hard clustering methods in that it
can handle data which is not linearly separable fairly
well.
Five data sets have been used to evaluate the
proposed clustering method. The proposed method
was also compared against an automatic hard clus-
tering method, a fuzzy clustering method (for which
a known number of clusters was provided), and an
automatic clustering method based on fuzzy c-means
using multiple cluster validity indices. The proposed
method was shown to be roughly equivalent in effec-
tiveness as the others to which it was compared when
clustering linearly separable data sets and equivalent
or better when clustering non linearly separable data
sets without ever needing to be provided a number of
clusters.
There remains work to be done in finding more ap-
propriate validation methods to evaluate the proposed
method as the validity indices used fall victim to the
same pitfalls as most hard clustering methods when
the data set is not linearly separable. There also re-
mains to further optimize the proposed method and to
attempt modifying it for specific applications.
In conclusion, the proposed clustering method not
only identifies a desired number of clusters, but pro-
duces valid clustering results.
ACKNOWLEDGEMENTS
We gratefully acknowledge the support from NBIF’s
(RAI 2012-047) New Brunswick Innovation Funding
granted to Dr. Nabil Belacel.
REFERENCES
Bezdek, J. C., Ehrlich, R., and Full, W. (1984). Fcm: The
fuzzy c-means clustering algorithm. Computers &
Geosciences, 10(23):191–203.
Fisher, R. A. (1936). The use of multiple measurements in
taxonomic problems. Annals of Eugenics, 7(2):179–
188.
Fukuyama, Y. and Sugeno, M. (1989). A new method
of choosing the number of clusters for the fuzzy c-
means method. In Proceedings of Fifth Fuzzy Systems
Symposium, pages 247–250.
Gan, G. (2011). Data Clustering in C++: An Object-
Oriented Approach. Chapman and Hall/CRC.
Gionis, A., Mannila, H., and Tsaparas, P. (2007). Clustering
aggregation. ACM Trans. Knowl. Discov. Data, 1(1).
Guan, Y., Ghorbani, A., and Belacel, N. (2003). Y-
means: a clustering method for intrusion detection.
In Electrical and Computer Engineering, 2003. IEEE
CCECE 2003. Canadian Conference on, volume 2,
pages 1083–1086. IEEE.
Jain, A. K., Murty, M. N., and Flynn, P. J. (1999). Data
clustering: a review. ACM Comput. Surv., 31(3):264–
323.
Kaufman, L. R. and Rousseeuw, P. (1990). Finding groups
in data: An introduction to cluster analysis.
MacNaughton-Smith, P. (1964). Dissimilarity Analysis: a
new Technique of Hierarchical Sub-division. Nature,
202:1034–1035.
Mok, P., Huang, H., Kwok, Y., and Au, J. (2012). A
robust adaptive clustering analysis method for auto-
matic identification of clusters. Pattern Recognition,
45(8):3017–3033.
AHierarchicalClusteringBasedHeuristicforAutomaticClustering
209