Means for Finding Meaningful Levels of a Hierarchical Sequence Prior to Performing a Cluster Analysis

David Allen Olsen

Abstract

When the assumptions underlying the standard complete linkage method are unwound, the size of a hierarchical sequence reverts back from n levels to n(n-1)/2 +1 levels, and the time complexity to construct a hierarchical sequence of cluster sets becomes O(n^4). Moreover, the post hoc heuristics for cutting dendrograms are not suitable for finding meaningful cluster sets of an n(n-1)/2 +1-level hierarchical sequence. To overcome these problems for small-n, large-m data sets, the project described in this paper went back more than 60 years to solve a problem that could not be solved then. This paper presents a means for finding meaningful levels of an n(n-1)/2 +1-level hierarchical sequence prior to performing a cluster analysis. By finding meaningful levels of such a hierarchical sequence prior to performing a cluster analysis, it is possible to know which cluster sets to construct and construct only these cluster sets. This paper also shows how increasing the dimensionality of the data points helps reveal inherent structure in noisy data. The means is theoretically validated. Empirical results from four experiments show that finding meaningful levels of a hierarchical sequence is easy and that meaningful cluster sets can have real world meaning.

References

  1. Anderberg, M. (1973). Cluster Analysis for Applications. Academic Press.
  2. Berkhin, P. (2006). A survey of clustering data mining techniques. In Kogan, J., Nicholas, C., and Teboulle, M., editors, Grouping Multidimensional Data: Re-
  3. Cormen, T., Leiserson, C., Rivest, R., and Stein, C. (2004). Introduction to Algorithms. MIT Press, 2nd edition.
  4. Daniels, K. and Giraud-Carrier, C. (2006). Learning the threshold in hierarchical agglomerative clustering. In Proceedings of the Fifth International Conference on Machine Learning and Applications (ICMLA 7806), pages 270-278, Orlando, FL.
  5. Everitt, B., Landau, S., Leese, M., and Stahl, D. (2011). Cluster Analysis. John Wiley and Sons, 5th edition.
  6. Gill, H. (2011). CPS overview. In Symposium on Control and Modeling Cyber-Physical Systems (www.csl.illinois.edu/video/csl-emergingtopics-2011-cyber-physical-systems-helen-gillpresentation), Champaign, IL.
  7. Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P., Mark, R., Mietus, J., Moody, G., Peng, C., and Stanley, H. (June 13, 2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220 [Circulation Electronic Pages; http://cir.ahajournals.org/cgi/content/full/101/ 23/e215].
  8. Hinneburg, A., Aggarwal, C., and Keim, D. (2000). What is the nearest neighbor in high dimensional spaces? In Proceedings of the 26th International Conference on Very Large Data Bases (VLDB 2000), pages 506-516, Cairo, Egypt.
  9. Isermann, R. (2006). Fault-Diagnosis Systems: An Introduction from Fault Detection to Fault Tolerance. Springer-Verlag.
  10. Jain, A. and Dubes, R. (1988). Algorithms for Clustering Data. Prentice Hall.
  11. Johnson, R. and Wichern, D. (2002). Applied Multivariate Statistical Analysis. Prentice Hall, 5th edition.
  12. Kim, H. and Lee, S. (2000). A semi-supervised document clustering technique for information organization. In Proceedings of the 9th ACM International Conference on Information and Knowledge Management (CIKM 7800), pages 30-37, McLean, VA.
  13. Kim, M., Payne, W. V., and Domanski, P. (2006). Performance of a residential heat pump operating in the cooling mode with single faults imposed. Technical report, U.S. National Institute of Standards and Technology, Gaithersburg, Maryland.
  14. Kirk, D. and Hwu, W. (2013). Programming Massively Parallel Processors. Elsevier Inc., 2nd edition.
  15. Lance, G. and Williams, W. (1967). A general theory of classificatory sorting strategies ii clustering systems. Computer J., 10(3):271-277.
  16. Matula, I. (1977). Graph theoretic techniques for cluster analysis algorithms. In Ryzin, J. V., editor, Classification and Clustering, pages 95-129. Academic Press.
  17. Murtagh, F. (2009). The remarkable simplicity of very high dimensional data: Application of model-based clustering. J. of Classification, 26:249-277.
  18. Navidi, W. (2006). Statistics for Engineers and Scientists. McGraw-Hill.
  19. Olsen, D. (2014). Include hierarchical clustering: A hierarchical clustering method based solely on interpoint distances. Technical report, Minneapolis, MN.
  20. Peay, E. (1974). Hierarchical clique structures. Sociometry, 37(1):54-65.
  21. Peay, E. (1975). Nonmetric grouping: Clusters and cliques. Psychometrika, 40(3):297-313.
  22. Tibshirani, R., Walther, G., and Hastie, T. (2001). Estimating the number of clusters in a dataset via the gap statistic. Journal of the Royal Statistical Society, 63(2):411-423.
  23. m m 2
  24. åk=1 sk2;(i; j) + åk=1 µk;(i; j)
  25. 2(åkm=1 sk2;(i; j) + åk=1 µk;(i; j) m 2
Download


Paper Citation


in Harvard Style

Olsen D. (2014). Means for Finding Meaningful Levels of a Hierarchical Sequence Prior to Performing a Cluster Analysis . In Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO, ISBN 978-989-758-039-0, pages 21-33. DOI: 10.5220/0005040600210033


in Bibtex Style

@conference{icinco14,
author={David Allen Olsen},
title={Means for Finding Meaningful Levels of a Hierarchical Sequence Prior to Performing a Cluster Analysis},
booktitle={Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,},
year={2014},
pages={21-33},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0005040600210033},
isbn={978-989-758-039-0},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 11th International Conference on Informatics in Control, Automation and Robotics - Volume 1: ICINCO,
TI - Means for Finding Meaningful Levels of a Hierarchical Sequence Prior to Performing a Cluster Analysis
SN - 978-989-758-039-0
AU - Olsen D.
PY - 2014
SP - 21
EP - 33
DO - 10.5220/0005040600210033