racy for efficiency when the inherent (hierarchical)
structure in a data set is not taxonomic. See (Lance
and Williams, 1967), (Olsen, 2014). The stan-
dard complete linkage method has the following four
weaknesses: First, when clusters are being combined
or a cluster is being subdivided, the standard com-
plete linkage method cannot resolve ties between in-
tercluster distances. Consequently, either one of the
distances is selected arbitrarily or alternative hierar-
chical sequences are constructed, and the results are
no longer deterministic. Second, because the stan-
dard complete linkage method uses intercluster dis-
tances to construct clusters, does not allow clusters
to overlap, and does not allow data points to migrate
between clusters (Lance and Williams, 1967), cluster
sets often are constructed inaccurately. Third, results
obtained from the standard compete linkage method
can depend on which end of a hierarchical sequence
is treated as the beginning. Consequently, the dendro-
grams for agglomerative hierarchical clustering and
divisive hierarchical clustering may be different, and
finding the cause(s) for the difference is both inconve-
nient and time-consuming. Fourth, the standard com-
plete linkage method does not find meaningful levels
or meaningful cluster sets of hierarchical sequences
1
.
It still is necessary to construct a dendrogram and de-
termine where and how many times to cut the den-
drogram, and post hoc heuristics are computationally
expensive to run.
Because of these weaknesses, it can be difficult
to interpret results obtained from the standard com-
plete linkage method. Consequently, it is underuti-
lized in automation and by intelligent control systems,
including supervisory functions such as fault detec-
tion and diagnosis and adaptation. Cf. (Isermann,
2006). When the standard complete linkage method is
used, stopping criteria often are used in place of post
hoc heuristics. Stopping criteria are predetermined.
If the model upon which they are based is inadequate
or changes, the stopping criteria lose their usefulness.
Moreover, the standard complete linkage method is an
updating method, so it uses information from previ-
ously constructed cluster sets to construct subsequent
cluster sets. It must construct the cluster set for ev-
ery level of an n-level hierarchical sequence until the
stopping criteria are met. See, e.g., (Jain and Dubes,
1
A “meaningful cluster set” refers to a cluster set that
can have real world meaning. Under ideal circumstances,
a “meaningful level” refers to a level of a hierarchical se-
quence at which a new configuration of clusters has fin-
ished forming. These definitions appear to be synonymous
for
n·(n−1)
2
+ 1-level hierarchical sequences. The cluster set
that is constructed for a meaningful level is a meaningful
cluster set, so these terms are used interchangeably.
1988), (Johnson and Wichern, 2002). These cluster
sets must be either materially accurate or, if possible,
amendable for material inaccuracies. See, e.g., U.S.
Patent No. 8,312,395 (defect identification in semi-
conductor production; operators must ensure that the
results are 80 to 90 percent accurate). As much as 90
percent of the effort that goes into implementing the
standard complete linkage method is used to develop
stopping criteria or interpret results.
Notwithstanding these weaknesses, the standard
complete linkage method is an important clustering
method. The distributions of many real world mea-
surements are bell-shaped, so the standard complete
linkage method has broad applicability. Its simplic-
ity makes it relatively easy to mathematically capture
its properties. Of the standard hierarchical cluster-
ing methods, the standard complete linkage method is
the only method that is invariant to monotonic trans-
formations of the distances between the data points,
that can cluster any kind of attribute, that is not prone
to inversions, and that produces globular or compact
clusters (Johnson and Wichern, 2002), (Everitt et al.,
2011). Moreover, more sophisticated methods show
no clear advantage for many purposes. Thus, the need
exists to bring complete linkage hierarchical cluster-
ing over from the “computational side of things ... to
the system ID/model ID kind of thinking” (Gill, 2011)
as part of closing the loop on cyber-physical systems.
For the first part of the project, a new, complete
linkage hierarchical clustering method was devel-
oped. See (Olsen, 2014). The new clustering method
is consonant with the model for a measured value that
scientists and engineers commonly use
2
, so it sub-
stantially improves upon the accuracy of the standard
complete linkage method. Further, it can construct
cluster sets for select, possibly non-contiguous levels
of an
n·(n−1)
2
+1-level hierarchical sequence. The new
clustering method was designed with small-n, large-
m data sets in mind, where n is the number of data
points, m is the number of dimensions, and “large”
means thousands and upwards (Murtagh, 2009).
3
2
The model for a measured value is measured value =
true value + bias (accuracy) + random error (statistical un-
certainty or precision) (Navidi, 2006). This model has sub-
stantially broader applicability than the taxonomic model
that is the basis for the standard complete linkage method.
3
These data sets are used by many cyber-physical sys-
tems and include time series. For example, a typical auto-
mobile has about 500 sensors; a small, specialty brewery
has about 600 sensors; and a small power plant has about
1100 sensors. The new clustering method may accommo-
date large-n, large-m data sets as well, and future work in-
cludes using multicore and/or heterogeneous processors to
parallelize parts of the new clustering method, but large-n,
large-m data sets are not the focus here.
ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics
22