and the row indices of the proximity vector are used
to index the sorted ordered triples (the “rank order in-
dices”).
Next, the ordered triples are evaluated in ascend-
ing order for linkage. As the ordered triples are evalu-
ated, threshold distance (index) d
0
increases implicitly
from 0 to the maximum of all the distance elements.
Threshold distance d
0
∈ R is a continuous variable
that determines which pairs of data points in a data
set are linked and which are not. Data points x
i
and
x
j
, i, j = 1,2,...,n,i 6= j, are linked if the distance be-
tween them is less than or equal to threshold distance
d
0
, i.e., d
i, j
≤ d
0
. From the linkage information that is
stored in the state matrix and the degrees of the data
points, a hierarchical sequence of cluster sets is con-
structible.
Because evaluating ordered triples for linkage is
decoupled from cluster set construction, the linkage
information in a state matrix can be updated without
constructing cluster sets. Further, cluster sets are con-
structed de novo. In other words, the cluster set for
each level of an
n·(n−1)
2
+1-level hierarchical sequence
is constructed independently of the cluster sets for the
other levels. This scheme has at least two advantages.
First, data points can migrate naturally as a part of
cluster set construction. Second, it is possible to con-
struct only the cluster sets that correspond to select,
possibly non-contiguous levels of a hierarchical se-
quence. Consequently, it is possible to construct only
the cluster sets for meaningful levels of a hierarchical
sequence.
3 USING DISTANCE GRAPHS TO
FIND MEANINGFUL LEVELS
To find meaningful levels of an
n·(n−1)
2
+ 1-level hi-
erarchical sequence, a distance graph is constructed
and visually examined. For 2-norm distance measures
such as Euclidean distance, using distance graphs is
motivated by the realization that as m → ∞, the vari-
ance σ
2
Z
m
of the random variable Z
m
= (
∑
m
k=1
Y
2
k
)
1
2
converges to
∑
m
k=1
σ
4
k
2(
∑
m
k=1
σ
2
k
+
∑
m
k=1
µ
2
k
)
+
∑
m
k=1
σ
2
k
µ
2
k
∑
m
k=1
σ
2
k
+
∑
m
k=1
µ
2
k
.
6
Y
k
is a normally distributed random variable such that
Y
k
∼ N(µ
k
,σ
k
). Often, as the dimensionality of the
data points increases and the 2-norm interclass dis-
tances become larger, the standard deviations of the
2-norm interclass distances, i.e., σ
Z
m
, nonetheless re-
main relatively small or constant. See (Olsen, 2014b).
When this scenario holds, data points that belong
6
An analog exists for 1-norm distance measures such as
city block distance.
to the same class link at about the same time even at
higher dimensionalities. Classes of data points can be
close together at lower dimensionalities. When they
are, the magnitudes of many intraclass distances and
interclass distances are about the same, so the two
kinds of distances commingle. However, the classes
of data points are farther apart at higher dimensional-
ities, so the intraclass distances and the interclass dis-
tances segregate into bands. Thus, higher dimension-
alities can attenuate the effects of noise
7
that preclude
finding meaningful levels of a hierarchical sequence
at lower dimensionalities and distinguish between the
classes. Moreover, this pattern repeats itself as clus-
ters become larger from including more data points.
Consequently, as the dimensionality of the data
points increases, the distance graphs for a data set
can exhibit identifiable features that correlate with
meaningful levels of the corresponding hierarchical
sequences. These levels are the levels at which multi-
ple classes have finished linking to form new config-
urations of clusters. In particular, assuming that the
data set has inherent structure, a distance graph takes
on a shape whereby sections of the graph run nearly
parallel to one of the graph axes. Where there is very
little or no linking activity, the sections run nearly ver-
tically. Where there is significant activity, i.e., where
new configurations of clusters are forming, the sec-
tions run nearly horizontally. Thus, portions of the
graph that come after the lower-right corners and be-
fore the upper-left corners indicate where new config-
urations of clusters have finished forming. A distance
graph can be visually examined prior to performing a
cluster analysis. Since a distance graph is used to find
meaningful levels of a hierarchical sequence prior to
performing a cluster analysis, it is not a summary of
the results obtained from the analysis. Instead, it en-
ables a user to selectively construct only meaningful
cluster sets, i.e., cluster sets where new configurations
of clusters have finished forming.
Finding meaningful levels is remarkably easy:
First, the differences (dissimilarities) between data
points x
i
and x
j
, i, j = 1, 2,..., n,x
i
6= x
j
, are calcu-
lated. Then, using a p-norm, p ∈ [1,∞), the lengths
or magnitudes of the vectors that contain these differ-
ences are calculated. Next, ordered triples (d
i, j
,i, j)
are constructed from these distances and the indices
of the respective data points, the ordered triples are
sorted into rank or ascending order according to their
distance elements, and rank order indices are assigned
to the sorted ordered triples. The rank order indices
and the ordered triples are used to construct a distance
graph. The rank order indices and/or the distance el-
7
Attenuating the effects of noise refers to reducing the
effects of noise on cluster construction.
ICINCO2014-11thInternationalConferenceonInformaticsinControl,AutomationandRobotics
298