are r and c respectively, and D
r
and D
c
are the diago-
nal matrices of these.
In the analysis of a cases-by-variables data matrix,
the right singular vectors of the SVD, V, are the con-
tribution coordinates of the columns (variables). A
further transformation involving a scaling factor D
q
,
such that
F = D
−1/2
q
UΓ (3)
defines the principal coordinates of the rows (cases).
The joint display of the two sets of points in F and
V can often be achieved on a common scale, thereby
avoiding the need for arbitrary independent scaling to
make the biplot legible.
The appropriate normalizations and the derivation
of scaling factors for the alternative methods are de-
tailed in his Table 2 and in various equations given
in Greenacre (2013). We use CA for the ordination
of the event data, following a double log transform of
the frequency data, N
0
such that
N = ln(ln(N
0
+ 1) + 1) +1 (4)
Note that the successive additions (+1) in Equa-
tion (4) above are simply to avoid taking ln(0). This is
a convenience, introducing an appropriate scaling so
as to make the biplot legible, but does not otherwise
alter the analysis. For the ratio-scale consumption
data we use the PCA method of Greenacre (2013), af-
ter centering and standardizing the input data by vari-
able.
2.3 Calculate Distance (S3)
Both ordination techniques, whether CA or PCA, re-
sult in a matrix F of principal coordinates of the rows
(cases) as in Equation (3). This matrix has the same
number of dimensions (columns) as variables in the
raw input data, however the information content of
the data is now concentrated towards the higher order
components (i.e. towards the left-most columns of F).
This is the central purpose of the dimension reduction
performed by SVD, and typically, a scree plot is used
to inspect the degree of dimension reduction, essen-
tially a plot of the eigenvalues, Γ in Equation (1).
A decision needs to be made as to how many
components to retain, referred to as a stopping rule
(Jackson, 1993; Peres-Neto et al., 2005). A conven-
tional rule is to retain only those components with
corresponding eigenvalues >1 (known as the Kaiser-
Guttman criterion, (Nunnaly and Bernstein, 1994)),
which is the rule we apply here, though this is a tun-
able parameter of the method and a range of values
should generally be explored. Once a stopping rule
has been decided the case-by-case distances d from
the origin in Euclidean space are calculated for the k
number of retained dimensions. This is done sepa-
rately for each partition. The following code is pro-
vided for clarity.
e.g. Matlab Code
d = sum(F(:,1:k).ˆ2,2).ˆ(1/2)
2.4 Derive Joint Distance Measure (S4)
The second novel aspect of the method, after par-
titioning the data set, is to examine the joint-
distribution of distances derived from the separate or-
dinations of the partitions. Where the partitions have
vastly different numbers and types of variables, and
where the specific ordination techniques differ be-
tween partitions, (as in our case of event data, circa
250 variables of event frequency counts, analyzed
by CA, vs. consumption data, seven variables of
ratio-scale data, analyzed by PCA), then comparison
should be made on the rank order of distances, rather
than directly on the distances themselves.
If all the data across all the variables were gen-
erated by independent random processes, then there
would be no relationship between the rank-ordering
of cases in the two lists. If the variables are at least
partially correlated (as is usually the case for real-
world data) then we would expect a correlation be-
tween the rankings derived from the two partitions,
but we would still expect an even spread of associ-
ations. A scatter plot of the two rank ordered lists
will reveal the nature of the association. A correlation
among variables will manifest as a concentration of
points towards the diagonal, but from a unified under-
lying process we would not expect much departure
from an even spread along the diagonal. If a sec-
ond, distinct process inserts cases into the data set,
we could expect that these may be manifested as a
departure from the uniform density of points, possi-
bly forming locally high density clusters. We plan to
develop the statistics of this phenomenon further in
subsequent publications.
2.5 Detect Anomaly (S5)
Following on from the derivation of a joint distance
measure based on the density of points in the joint
distribution of rank orders, as described in S4, we
now standardize the measure and inspect the depar-
ture from the mean density in units of standard devi-
ation. Cases at the far extremes of departure from the
mean may well be interpreted as being so divorced
from the background process generating the bulk of
the data as to be anomalies produced by a different
mechanism (Hawkins, 1980).
DATA 2018 - 7th International Conference on Data Science, Technology and Applications
288