Visualisation of Heterogeneous Data with the Generalised Generative
Topographic Mapping
Michel F. Randrianandrasana, Shahzad Mumtaz and Ian T. Nabney
Nonlinearity and Complexity Research Group, Aston University, Birmingham B4 7ET, U.K.
Keywords:
Data Visualisation, GTM, LTM, Heterogeneous and Missing Data.
Abstract:
Heterogeneous and incomplete datasets are common in many real-world visualisation applications. The prob-
abilistic nature of the Generative Topographic Mapping (GTM), which was originally developed for complete
continuous data, can be extended to model heterogeneous (i.e. containing both continuous and discrete val-
ues) and missing data. This paper describes and assesses the resulting model on both synthetic and real-world
heterogeneous data with missing values.
1 INTRODUCTION
Type-specific data analysis has been well studied in
machine learning
1
. In the last couple of decades, the
need to analyse mixed-type data has received some
attention from the machine learning community be-
cause of the fact that real-world processes often gen-
erate data of mixed-type. An example of such mixed-
type data could be a hospital’s patient database where
typical fields include age (continuous), gender (bi-
nary), test results (binary or continuous), height (con-
tinuous) etc. In practice a number of ad-hoc methods
are used to analyse mixed-type data. For instance,
if there is a mixture of continuous and discrete vari-
ables, then either all the discrete variables are con-
verted to some numerical scoring equivalent or, on
the other hand, all the continuous variables are discre-
tised. Alternatively, both types of variables are anal-
ysed separately and then the results are combined us-
ing some criteria. According to (Krzanowski, 1983),
All these options involve some element of subjectiv-
ity, with possible loss of information, and do not ap-
pear very satisfactory in general”. The ideal general
solution for analysing such heterogeneous data is to
specify a model that builds a joint distribution with an
appropriate noise model for each type of feature (for
example, a Bernoulli distribution for binary features,
a multinomial distribution for multi-category features
and a Gaussian distribution for continuous features)
and then fit the model to data (de Leon and Chough,
1
http://letdataspeak.blogspot.co.uk/2012/07/mixed-
type-data-analysis-i-overview.html
2013).
A multivariate distribution that can model random
variables of different types is not available. How-
ever, one possible way of jointly modelling discrete
and continuous features is using a latent variable ap-
proach to model the correlation between features of
different types. For example, a dataset consisting of
continuous, binary and multi-category features can
be modelled using a conditional distribution that is a
product of Gaussian, Bernoulli and multinomial dis-
tributions. This approach has been previously dis-
cussed as a possible extension for GTM (Bishop and
Svensen, 1998; Bishop et al., 1998) and PCA (Tip-
ping, 1999) models. This idea was implemented
in (Yu and Tresp, 2004) to visualise a mixture of con-
tinuous and binary data on a single continuous latent
space by extending probabilistic principal component
analysis (PPCA) and was called generalised PPCA
(GPPCA). GPPCA is a linear probabilistic model and
uses a variational Expectation-Maximisation (EM) al-
gorithm for parameter estimation. There are other la-
tent variable models for mixed-type datasets but to
the best of our knowledge most of these are linear
models (Moustaki, 1996; Sammel et al., 1997; Dun-
son, 2000; Teixeira-Pinto and Normand, 2009) and
they either use numerical integration or a sampling
approach to handle the intractable integration for fit-
ting a latent variable model of thistype. It is important
to mention that there is not much work reported in the
literature for analysing mixed-type data using a latent
variable formalism (de Leon and Chough, 2013). As
a generalisation of GTM, a latent trait model (LTM)
to handle discrete data was proposed in (Kab´an and
233
F. Randrianandrasana M., Mumtaz S. and T. Nabney I..
Visualisation of Heterogeneous Data with the Generalised Generative Topographic Mapping.
DOI: 10.5220/0005305002330238
In Proceedings of the 6th International Conference on Information Visualization Theory and Applications (IVAPP-2015), pages 233-238
ISBN: 978-989-758-088-8
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
Girolami, 2001): the model used the exponential fam-
ily of distributions. In this paper we describe and as-
sess a probabilistic non-linear latent variable model to
visualise a mixed-type dataset on a single continuous
latent space. We shall refer to this model as a gener-
alised GTM (GGTM).
The treatment of incomplete data for the standard
GTM has been explored in (Sun et al., 2002) using an
EM approach which estimates the parameters of the
mixing components of the GTM and missing values
at the same time. The same approach is used in this
paper to visualise mixed-type data containing missing
values with GGTM.
2 VISUALISATION OF
HETEROGENEOUS DATA
WITH GGTM
The main goal of a latent variable model is to find
a low-dimensional manifold, H , with M dimen-
sions (usually M = 2) for the distribution p(x) of
high-dimensional data space, D, with D dimensions.
Latent variable models have been developed to
handle a dataset where all the features are of the same
type.
Suppose that the D-dimensional data space is de-
fined by |R | continuous, |B| binary and |C| multi-
categorical features respectively. The link functions
for continuous, binary and multi-category features are
defined in equations (1), (2) and (3) respectively
µ
R
= Φ(z)W
R
.
(1)
µ
B
= g
B
(Φ(z)W
B
)
=
exp(Φ(z)W
B
)
1+ exp(Φ(z)W
B
)
.
(2)
µ
C
s
d
= g
C
(Φ(z)w
C
s
d
)
=
exp(Φ(z)w
C
s
d
)
S
d
s
d
=1
exp(Φ(z)w
s
d
)
.
(3)
We write each observation vector, x
n
in terms of sub-
vectors x
R
n
, x
B
n
and x
C
n
for continuous, binary and
multi-category features respectively. The likelihood
of each type of feature is given by
p(x
R
n
|z, W
R
, β) = p(x
R
n
|µ
R
, β)
=
β
2π
|R |
2
exp
β
2
||µ
R
x
R
n
||
2
. (4)
p(x
B
n
|z, W
B
) = p(x
B
n
|µ
B
)
=
|B|
d=1
µ
B
d
x
B
nd
1 µ
B
d
(1x
B
nd
)
. (5)
p(x
C
n
|z, W
C
) = p(x
C
n
|µ
C
)
=
|C |
d=1
S
d
s
d
=1
µ
C
s
d
x
C
ns
d
. (6)
Then we compute the product of the likelihoods for
the Gaussian (equation (4)), Bernoulli (equation (5))
and multinomial (equation (6)) distributions, and find
the distribution of x by integrating over the latent vari-
ables, z,
p(x|) =
Z
p(x
R
n
|z, W
R
, β)
p(x
B
n
|z, W
B
)p(x
C
n
|z, W
C
)p(z) dz,
(7)
where = {W
R
, β, W
B
, W
C
} contains all the model
parameters. We use as prior distribution, p(z), a sum
of delta functions as for the standard GTM and LTM
p(z) =
1
K
K
k=1
δ(z z
k
). (8)
The data distribution can now be derived from equa-
tions (7) and (8), where we use the same mixing co-
efficient for all components (i.e. π
k
=
1
K
),
p(x|) =
K
k=1
π
k
p(x|z
k
, ). (9)
The log-likelihood of the complete data takes the form
L() =
N
n=1
ln
K
k=1
π
k
p(x
n
|z
k
, ). (10)
The choice of noise model is related to the corre-
sponding type of data and also the link function map-
ping from latent to data space (Kab´an and Girolami,
2001). The exponential family of distributions is used
here to model mixed-type data under the latent vari-
able framework. From here onward to simplify the
notation, we use x
M
, where M can represent either
R , B or C , to indicate the type of feature for a data
point x.
2.1 An Expectation Maximization (EM)
Algorithm for GGTM
Our proposed model is based on a mixture of distribu-
tions where each component is a product of Gaussian,
Bernoulli and/or multinomial distributions. The pa-
rameters of the mixture model can be determined us-
ing an EM algorithm: in the E-step, we use the current
parameter set, , to compute the posterior probabili-
ties (responsibilities) using Bayes’ theorem,
r
kn
= p(z
k
|x
n
, W) =
π
k
p(x
n
|z
k
, W)
K
k
=1
π
k
p(x
n
|z
k
, W)
, (11)
IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications
234
where
p(x
n
|z
k
, W) =p(x
R
n
|z
k
, W
R
, β)
p(x
B
n
|z
k
, W
B
)p(x
C
n
|z
k
, W
C
).
(12)
We use the maximization of the relative likeli-
hood (Bishop, 1995), which does not require the com-
putation of the log of a sum. The relative likelihood
between the old and new set of parameters can be cal-
culated as
Q =
N
n=1
K
k=1
r
kn
log
p(x
n
|z
k
, W)p(z
k
)
=
N
n=1
K
k=1
r
kn
x
R
n
θ
R
k
G
θ
R
k
+ log(p
0
(x
R
n
))
+
x
B
n
θ
B
k
G
θ
B
k
+ log(p
0
(x
B
n
))
+
x
C
n
θ
C
k
G
θ
C
k
+ log(p
0
(x
C
n
))
+
log(p(z
k
))
(13)
where θ
M
k
= Φ(z
k
)W
M
. In the M-step we max-
imize the function Q with respect to each type of
weight sub-matrix W
M
as
Q
W
M
= Φ
T
h
RX
M
Eg(ΦW
M
)
i
, (14)
where Φ is a K × L matrix, R is a K × N matrix cal-
culated using equation (11), X
M
is an N × |M | data
sub-matrix and the diagonal matrix E contains the val-
ues
e
kk
=
N
n=1
r
kn
. (15)
In the case of an isotropic Gaussian with unit vari-
ance, the link function g(.) is the identity and by set-
ting the derivativeto zero we obtain, as in the standard
GTM (Bishop and Svensen, 1998),
d
W
R
= (Φ
T
EΦ)
1
Φ
T
RX
R
. (16)
For other link functions, a Generalised EM
(GEM) (McLachlan and Krishnan, 1997) algo-
rithm is used because convergence to the local
maximum is guaranteed without maximizing the
relative likelihood (Kab´an and Girolami, 2001). A
simple gradient-based update can be obtained for
W
M
from Equation (14)
W
M
∝ Φ
T
h
RX
M
Eg(ΦW
M
)
i
, (17)
where this can be used as an inner loop in the M-
step. The correlations between the dimensions of φ
l
responsible for preserving the neighbourhood are re-
quired for a topographic organisation given that the
natural parameter θ
M
is being updated under the gra-
dient update of the weight matrix W
M
(Kab´an and
Girolami, 2001):
c
θ
M
k
= φ
k
W
M
+ η
N
n=1
K
k
=1
r
k
n
φ
k
φ
T
k
(x
M
µ
M
k
).
3 VISUALISATION OF MISSING
DATA WITH GGTM
The EM framework supports the treatment of missing
values in the GGTM model.
3.1 Continuous Data
The data points x
n
are written as (x
o
n
, x
m
n
), where m
and o represent subvectors and submatrices of the pa-
rameters matching the missing and observed compo-
nents of the data (Ghahramani and Jordan, 1994). Bi-
nary indicator variables ζ
nk
are introduced to specify
which component of the mixture model generated the
data point. Both the indicator variables ζ
nk
and the
missing inputs x
m
n
are treated as hidden variables in
the EM algorithm. The changes made to the EM al-
gorithm for GTM are detailed in (Sun et al., 2002).
3.2 Discrete Data
The missing values are inferred in the E-step using
the usual posterior means with responsibility r
kn
com-
puted on the observed data,
E[x
m
n
|x
o
n
, µ
D
] =
K
k=1
r
kn
µ
D
k
, (18)
where D = {B or C }. In the M-step, the weight ma-
trix
d
W
D
is updated first using the complete training
data and we then update
c
µ
D
k
with
c
µ
D
k
= g
D
(Φ(z
k
)
d
W
D
). (19)
4 VISUALISATION QUALITY
EVALUATION MEASURES
Algorithms based on GTM are examples of unsuper-
vised learning which always give a result when ap-
plied to a particular dataset. Thus we cannot tell a
priori what is the expected or desired outcome. This
makes it difficult to judge which method is the best
(i.e. tells us the most about a certain dataset). Here
VisualisationofHeterogeneousDatawiththeGeneralisedGenerativeTopographicMapping
235
we use metrics that measure the degree of local neigh-
bourhood similarity between data space and latent
space which can be calculated even if ‘ground truth’
is not known.
4.1 Trustworthiness, Continuity and
Mean Relative Rank Errors
(MRREs)
Two well-known visualisation quality measures based
on comparing neighbourhoods in the data space x and
projection space z are trustworthiness and continu-
ity (Venna and Kaski, 2001). A mapping is said to
be trustworthy if k-neighbourhood in the visualised
space matches that in the data space but if the k-
neighbourhood in the data space matches that in the
visualised space it maintains continuity. The higher
the measure the better the visualisation, as this implies
that local neighbourhoods are better preserved by the
projection. We also use mean relative rank errors
with respect to data and latent spaces (MRRE
x
and
MRRE
z
), which measure the preservation of the rank
of the k-nearest neighbours contrary to the trustwor-
thiness and continuity which only consider matches
in the k-neighbourhood (Lee and Verleysen, 2008).
Note that the lower the MRRE the better the projec-
tion quality.
5 EXPERIMENTAL RESULTS
The GGTM was evaluated on both complete and
missing synthetic and real-world datasets and com-
pared with standard GTM for complete data. The
weight matrix W was initialised using principal com-
ponent analysis (PCA). For the metrics in Section 4,
we computed pair-wise distances using Hamming dis-
tances for the binary features and Euclidean distances
for the continuous features. For each distance matrix,
we divided each column by its standard deviation. All
experiments used 10-fold cross-validation. The vi-
sualisation quality measures were computed with a
range of neighbourhood sizes (5, 10, 15, 20) and the
mean of these measures over the different sizes and
cross-validation runs was computed.
5.1 Synthetic Dataset
The synthetic dataset was generated from an
equiprobable mixture of two Gaussians, N (m
k
, I)
(with k = 1, 2) with means m
1
=
2.0
3.5
3.5
, and m
2
=
Table 1: GTM and GGTM visualisation quality metrics of
the 12-dimensional synthetic datasets. Each figure repre-
sents the average over a 10-fold cross-validation with one
standard deviation on the test sets.
GTM GGTM GGTM
complete complete missing
Trustworthiness 0.969± 0.003 0.949± 0. 024 0.947± 0.027
Continuity 0.964± 0.003 0.970± 0.013 0.969± 0.014
MRRE
x
0.040± 0.000 0.043± 0.003 0.042± 0.003
MRRE
z
0.004± 0.000 0.038± 0.002 0.037± 0.002
3.5
4.5
4.5
. A dataset with 9-dimensional binary fea-
tures from four classes was also generated (these
classes were not used as inputs to the visualisation).
Both continuous and binary data were combined to
make a dataset of 12 features with 2, 800 data points.
The visualisation results of the complete and missing
datasets (10% randomly removed) are shown in Fig-
ure 1 and the quality metrics are given in Table 1.
(a) GTM (training set) (b) GTM (test set)
(c) GGTM (training set) (d) GGTM (test set)
(e) GGTM missing (train-
ing set)
(f) GGTM missing (test
set)
Figure 1: GTM and GGTM visualisations of the synthetic
12-dimensional datasets with 3 continuous and 9 binary fea-
tures.
We also generated a dataset with two multi-
category features with 8 and 16 categories in the
first and second features respectively. We ap-
pended the multi-category features to the previous
12-dimensional dataset and used a 1-of-S encoding
scheme for the multi-category features. Labels were
IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications
236
Table 2: GTM and GGTM visualisation quality metrics of
the 14-dimensional synthetic datasets.
GTM GGTM GGTM
complete complete missing
Trustworthiness 0.962± 0.004 0.977± 0.009 0.973± 0.014
Continuity 0.946 ± 0.008 0.980± 0.007 0.976± 0.013
MRRE
x
0.045± 0.001 0.044± 0.001 0.116± 0.005
MRRE
z
0.045± 0.001 0.041± 0.002 0.132± 0.005
based on the four classes in the binary data.The visu-
alisation results of the 14-dimensional complete and
missing datasets are shown in Figure 2 and the corre-
sponding quality metrics are given in Table 2. The
(a) GTM (training set) (b) GTM (test set)
(c) GGTM (training set) (d) GGTM (test set)
(e) GGTM missing (train-
ing set)
(f) GGTM missing (test
set)
Figure 2: GTM and GGTM visualisations of the synthetic
14-dimensional datasets with 3 continuous, 9 binary and 2
multi-category features.
proportion of missing values has also been increased
to 30%, 50%, 70% and 90% without substantially de-
grading the visualisation quality measures.
5.2 Hypothyroid Dataset
This real-world dataset is publicly available from the
UCI data repository (Bache and Lichman, 2013). The
dataset consists of two variable types: 15 binary and
6 continuous features. It contains three classes: pri-
mary thyroid, compensated thyroid and normal. The
dataset was originally divided into a training set of
Table 3: GTM and GGTM visualisation quality metrics of
the hypothyroid disease datasets.
GTM GGTM GGTM
complete complete missing
Trustworthiness 0.718± 0.022 0.718± 0. 015 0.716± 0.014
Continuity 0.804± 0.017 0.843± 0.014 0.835± 0.007
MRRE
x
0.018± 0.000 0.019± 0.000 0.019± 0.000
MRRE
z
0.016± 0.000 0.016± 0.000 0.016± 0.000
3, 772 data points (93 with primary hypothyroid, 191
with compensated hypothyroid and 3488 normal) and
a test set of 3, 428 data points (73 with primary hy-
pothyroid, 177 with compensated hypothyroid and
3178 normal). These training and test sets have been
merged prior to running a 10-fold cross-validation.
The visualisation results of the complete and missing
datasets are shown in Figure 3 and the quality metrics
are given in Table 3.
(a) GTM (training set) (b) GTM (test set)
(c) GGTM (training set) (d) GGTM (test set)
(e) GGTM missing (train-
ing set)
(f) GGTM missing (test
set)
Figure 3: GTM and GGTM visualisations of the thyroid
disease datasets. The cyan circles, red plus sign and blue
squares represent primary hypothyroid, compensated hy-
pothyroid and normal respectively.
6 CONCLUSIONS
A generalisation of the GTM to heterogeneous
and missing data has been described and as-
sessed in this paper. This involves modelling the
VisualisationofHeterogeneousDatawiththeGeneralisedGenerativeTopographicMapping
237
continuous and discrete data with Gaussian and
Bernoulli/multinomial distributions respectively.
These extensions have been suggested in (Bishop
et al., 1998) but this is the first time the mathematical
details have been worked out and an implementation
written and evaluated.
Visualisation results for synthetic data using the
GGTM have shown more compact clusters for each
class compared to the standard GTM whereas for the
real dataset no significant difference was observed.
For synthetic datasets with missing values, GGTM vi-
sualisations have greater compactness for each class.
In terms of visualisation quality evaluation metrics,
we observed that for a mix of continuous and binary
data, the trustworthiness and MRRE
x
are slightly bet-
ter for standard GTM compared to GGTM whereas
the continuity and MRRE
z
were better for GGTM
compared to standard GTM. However, for a mix of
continuous, binary and multi-category features, all the
quality evaluation measures were better for GGTM
compared to the standard GTM. Missing values have
caused limited deterioration in results compared to the
complete data case.
REFERENCES
Bache, K. and Lichman, M. (2013). UCI machine learning
repository.
Bishop, C. M. (1995). Neural networks for pattern recog-
nition. Oxford University Press.
Bishop, C. M. and Svensen, M. (1998). GTM: The gen-
erative topographic mapping. Neural Compuatation,
10(1):215–234.
Bishop, C. M., Svensen, M., and Williams, C. K. I. (1998).
Developments of the generative topographic mapping.
Neurocomputing, 21(1):203–224.
de Leon, A. R. and Chough, K. C. (2013). Analysis of
Mixed Data: Methods & Applications. Taylor &
Fracis Group. Chapman and Hall/CRC.
Dunson, D. B. (2000). Bayesian latent variable models
for clustered mixed outcomes. Journal of the Royal
Statistical Society. Series B (Statistical Methodology),
62(2):355–366.
Ghahramani, Z. and Jordan, M. I. (1994). Learning from
incomplete data. Technical Report AIM-1509.
Kab´an, A. and Girolami, M. (2001). A combined latent
class and trait model for the analysis and visualization
of discrete data. Pattern Analysis and Machine Intel-
ligence, IEEE Transactions on, 23(8):859–872.
Krzanowski, W. J. (1983). Distance between popula-
tions using mixed continuous and categorical vari-
ables. Biometrika, 70(1):235–243.
Lee, J. A. and Verleysen, M. (2008). Rank-based quality
assessment of nonlinear dimensionality reduction. In
ESANN, pages 49–54.
McLachlan, G. and Krishnan, T. (1997). The EM algorithm
and extensions. Wiley, New York.
Moustaki, I. (1996). A latent trait and a latent class model
for mixed observed variables. British Journal ofMath-
ematical and Statistical Psychology, 49(2):313–334.
Sammel, M. D., Ryan, L. M., and Legler, J. M. (1997). La-
tent variable models for mixed discrete and continu-
ous outcomes. Journal of the Royal Statistical Society.
Series B (Methodological), 59(3):667–678.
Sun, Y., Tino, P., and Nabney, I. (2002). Visualisa-
tion of incomplete data using class information con-
straints. In Winkler, J. and Niranjan, M., editors, Un-
certainty in Geometric Computations, volume 704 of
The Springer International Series in Engineering and
Computer Science, pages 165–173. Springer US.
Teixeira-Pinto, A. and Normand, S. T. (2009). Correlated
bivariate continuous and binary outcomes: issues and
applications. Statistics in Medicine, 28(13):1753–
1773.
Tipping, M. E. (1999). Probabilistic visualisation of high-
dimensional binary data. In Proceedings of the 1998
Conference on Advances in Neural Information Pro-
cessing Systems II, pages 592–598, Cambridge, MA,
USA. MIT Press.
Venna, J. and Kaski, S. (2001). Neighborhood preserva-
tion in nonlinear projection methods: an experimen-
tal study. In Proceedings of the International Con-
ference on Artificial Neural Networks, ICANN ’01,
pages 485–491, London, UK. Springer-Verlag.
Yu, K. and Tresp, V. (2004). Heterogenous data fusion via a
probabilistic latent-variable model. In M¨uller-Schloer,
C., Ungerer, T., and Bauer, B., editors, ARCS, volume
2981 of Lecture Notes in Computer Science, pages
20–30. Springer.
IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications
238