Authors:
Dylan Molinié
and
Kurosh Madani
Affiliation:
LISSI Laboratory EA 3956, Université Paris-Est Créteil, Sénart-FB Institute of Technology, Campus de Sénart, 36-37 Rue Georges Charpak, F-77567 Lieusaint, France
Keyword(s):
Unsupervised Clustering, Parameter Estimation, Cumulative Distributions, Industry 4.0, Cognitive Systems.
Abstract:
Unsupervised clustering consists in blindly gathering unknown data into compact and homogeneous groups; it is one of the very first steps of any Machine Learning approach, whether it is about Data Mining, Knowledge Extraction, Anomaly Detection or System Modeling. Unfortunately, unsupervised clustering suffers from the major drawback of requiring manual parameters to perform accurately; one of them is the expected number of clusters. This parameter often determines whether the clusters will relevantly represent the system or not. From literature, there is no universal fashion to estimate this value; in this paper, we address this problem through a novel approach. To do so, we rely on a unique, blind clustering, then we characterize the so-built clusters by their Empirical Cumulative Distributions that we compare to one another using the Modified Hausdorff Distance, and we finally regroup the clusters by Region Growing, driven by these characteristics. This allows to rebuild the featu
re space’s regions: the number of expected clusters is the number of regions found. We apply this methodology to both academic and real industrial data, and show that it provides very good estimates of the number of clusters, no matter the dataset’s complexity nor the clustering method used.
(More)