Visualisation of Heterogeneous Data with the Generalised Generative

Topographic Mapping

Michel F. Randrianandrasana, Shahzad Mumtaz and Ian T. Nabney

Nonlinearity and Complexity Research Group, Aston University, Birmingham B4 7ET, U.K.

Keywords:

Data Visualisation, GTM, LTM, Heterogeneous and Missing Data.

Abstract:

Heterogeneous and incomplete datasets are common in many real-world visualisation applications. The prob-

abilistic nature of the Generative Topographic Mapping (GTM), which was originally developed for complete

continuous data, can be extended to model heterogeneous (i.e. containing both continuous and discrete val-

ues) and missing data. This paper describes and assesses the resulting model on both synthetic and real-world

heterogeneous data with missing values.

1 INTRODUCTION

Type-speciﬁc data analysis has been well studied in

machine learning

. In the last couple of decades, the

need to analyse mixed-type data has received some

attention from the machine learning community be-

cause of the fact that real-world processes often gen-

erate data of mixed-type. An example of such mixed-

type data could be a hospital’s patient database where

typical ﬁelds include age (continuous), gender (bi-

nary), test results (binary or continuous), height (con-

tinuous) etc. In practice a number of ad-hoc methods

are used to analyse mixed-type data. For instance,

if there is a mixture of continuous and discrete vari-

ables, then either all the discrete variables are con-

verted to some numerical scoring equivalent or, on

the other hand, all the continuous variables are discre-

tised. Alternatively, both types of variables are anal-

ysed separately and then the results are combined us-

ing some criteria. According to (Krzanowski, 1983),

“All these options involve some element of subjectiv-

ity, with possible loss of information, and do not ap-

pear very satisfactory in general”. The ideal general

solution for analysing such heterogeneous data is to

specify a model that builds a joint distribution with an

appropriate noise model for each type of feature (for

example, a Bernoulli distribution for binary features,

a multinomial distribution for multi-category features

and a Gaussian distribution for continuous features)

and then ﬁt the model to data (de Leon and Chough,

http://letdataspeak.blogspot.co.uk/2012/07/mixed-

type-data-analysis-i-overview.html

2013).

A multivariate distribution that can model random

variables of different types is not available. How-

ever, one possible way of jointly modelling discrete

and continuous features is using a latent variable ap-

proach to model the correlation between features of

different types. For example, a dataset consisting of

continuous, binary and multi-category features can

be modelled using a conditional distribution that is a

product of Gaussian, Bernoulli and multinomial dis-

tributions. This approach has been previously dis-

cussed as a possible extension for GTM (Bishop and

Svensen, 1998; Bishop et al., 1998) and PCA (Tip-

ping, 1999) models. This idea was implemented

in (Yu and Tresp, 2004) to visualise a mixture of con-

tinuous and binary data on a single continuous latent

space by extending probabilistic principal component

analysis (PPCA) and was called generalised PPCA

(GPPCA). GPPCA is a linear probabilistic model and

uses a variational Expectation-Maximisation (EM) al-

gorithm for parameter estimation. There are other la-

tent variable models for mixed-type datasets but to

the best of our knowledge most of these are linear

models (Moustaki, 1996; Sammel et al., 1997; Dun-

son, 2000; Teixeira-Pinto and Normand, 2009) and

they either use numerical integration or a sampling

approach to handle the intractable integration for ﬁt-

ting a latent variable model of thistype. It is important

to mention that there is not much work reported in the

literature for analysing mixed-type data using a latent

variable formalism (de Leon and Chough, 2013). As

a generalisation of GTM, a latent trait model (LTM)

to handle discrete data was proposed in (Kab´an and

233

F. Randrianandrasana M., Mumtaz S. and T. Nabney I..

Visualisation of Heterogeneous Data with the Generalised Generative Topographic Mapping.

DOI: 10.5220/0005305002330238

In Proceedings of the 6th International Conference on Information Visualization Theory and Applications (IVAPP-2015), pages 233-238

ISBN: 978-989-758-088-8

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

Girolami, 2001): the model used the exponential fam-

ily of distributions. In this paper we describe and as-

sess a probabilistic non-linear latent variable model to

visualise a mixed-type dataset on a single continuous

latent space. We shall refer to this model as a gener-

alised GTM (GGTM).

The treatment of incomplete data for the standard

GTM has been explored in (Sun et al., 2002) using an

EM approach which estimates the parameters of the

mixing components of the GTM and missing values

at the same time. The same approach is used in this

paper to visualise mixed-type data containing missing

values with GGTM.

2 VISUALISATION OF

HETEROGENEOUS DATA

WITH GGTM

The main goal of a latent variable model is to ﬁnd

a low-dimensional manifold, H , with M dimen-

sions (usually M = 2) for the distribution p(x) of

high-dimensional data space, D, with D dimensions.

Latent variable models have been developed to

handle a dataset where all the features are of the same

type.

Suppose that the D-dimensional data space is de-

ﬁned by |R | continuous, |B| binary and |C| multi-

categorical features respectively. The link functions

for continuous, binary and multi-category features are

deﬁned in equations (1), (2) and (3) respectively

= Φ(z)W

(1)

= g

(Φ(z)W

)

exp(Φ(z)W

)

1+ exp(Φ(z)W

)

(2)

= g

(Φ(z)w

)

exp(Φ(z)w

)

∑

′

exp(Φ(z)w

′

)

(3)

We write each observation vector, x

in terms of sub-

vectors x

, x

and x

for continuous, binary and

multi-category features respectively. The likelihood

of each type of feature is given by

p(x

|z, W

, β) = p(x

|µ

, β)



2π



|R |

exp



−

||µ

− x



. (4)

p(x

|z, W

) = p(x

|µ

)

|B|

∏

d=1







1− µ



(1−x

)

. (5)

p(x

|z, W

) = p(x

|µ

)

|C |

∏

d=1

∏





. (6)

Then we compute the product of the likelihoods for

the Gaussian (equation (4)), Bernoulli (equation (5))

and multinomial (equation (6)) distributions, and ﬁnd

the distribution of x by integrating over the latent vari-

ables, z,

p(x|Ω) =

p(x

|z, W

, β)

p(x

|z, W

)p(x

|z, W

)p(z) dz,

(7)

where Ω = {W

, β, W

, W

} contains all the model

parameters. We use as prior distribution, p(z), a sum

of delta functions as for the standard GTM and LTM

p(z) =

∑

k=1

δ(z− z

). (8)

The data distribution can now be derived from equa-

tions (7) and (8), where we use the same mixing co-

efﬁcient for all components (i.e. π

p(x|Ω) =

∑

k=1

p(x|z

, Ω). (9)

The log-likelihood of the complete data takes the form

L(Ω) =

∑

n=1

∑

k=1

p(x

, Ω). (10)

The choice of noise model is related to the corre-

sponding type of data and also the link function map-

ping from latent to data space (Kab´an and Girolami,

2001). The exponential family of distributions is used

here to model mixed-type data under the latent vari-

able framework. From here onward to simplify the

notation, we use x

, where M can represent either

R , B or C , to indicate the type of feature for a data

point x.

2.1 An Expectation Maximization (EM)

Algorithm for GGTM

Our proposed model is based on a mixture of distribu-

tions where each component is a product of Gaussian,

Bernoulli and/or multinomial distributions. The pa-

rameters of the mixture model can be determined us-

ing an EM algorithm: in the E-step, we use the current

parameter set, Ω, to compute the posterior probabili-

ties (responsibilities) using Bayes’ theorem,

= p(z

, W) =

p(x

, W)

∑

′

p(x

′

, W)

, (11)

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

234

where

p(x

, W) =p(x

, W

, β)

p(x

, W

)p(x

, W

(12)

We use the maximization of the relative likeli-

hood (Bishop, 1995), which does not require the com-

putation of the log of a sum. The relative likelihood

between the old and new set of parameters can be cal-

culated as

Q =

∑

n=1

∑

k=1

log



p(x

, W)p(z

)



∑

n=1

∑

k=1













− G





+ log(p

))





− G





+ log(p

))





− G





+ log(p

))





log(p(z

))













(13)

where θ

= Φ(z

. In the M-step we max-

imize the function Q with respect to each type of

weight sub-matrix W

∂Q

∂W

= Φ

− Eg(ΦW

)

, (14)

where Φ is a K × L matrix, R is a K × N matrix cal-

culated using equation (11), X

is an N × |M | data

sub-matrix and the diagonal matrix E contains the val-

ues

∑

n=1

. (15)

In the case of an isotropic Gaussian with unit vari-

ance, the link function g(.) is the identity and by set-

ting the derivativeto zero we obtain, as in the standard

GTM (Bishop and Svensen, 1998),

= (Φ

EΦ)

−1

. (16)

For other link functions, a Generalised EM

(GEM) (McLachlan and Krishnan, 1997) algo-

rithm is used because convergence to the local

maximum is guaranteed without maximizing the

relative likelihood (Kab´an and Girolami, 2001). A

simple gradient-based update can be obtained for

from Equation (14)

∆W

∝ Φ

− Eg(ΦW

)

, (17)

where this can be used as an inner loop in the M-

step. The correlations between the dimensions of φ

responsible for preserving the neighbourhood are re-

quired for a topographic organisation given that the

natural parameter θ

is being updated under the gra-

dient update of the weight matrix W

(Kab´an and

Girolami, 2001):

= φ

+ η

∑

n=1

∑

′

− µ

′

3 VISUALISATION OF MISSING

DATA WITH GGTM

The EM framework supports the treatment of missing

values in the GGTM model.

3.1 Continuous Data

The data points x

are written as (x

, x

), where m

and o represent subvectors and submatrices of the pa-

rameters matching the missing and observed compo-

nents of the data (Ghahramani and Jordan, 1994). Bi-

nary indicator variables ζ

are introduced to specify

which component of the mixture model generated the

data point. Both the indicator variables ζ

and the

missing inputs x

are treated as hidden variables in

the EM algorithm. The changes made to the EM al-

gorithm for GTM are detailed in (Sun et al., 2002).

3.2 Discrete Data

The missing values are inferred in the E-step using

the usual posterior means with responsibility r

com-

puted on the observed data,

E[x

, µ

] =

∑

k=1

, (18)

where D = {B or C }. In the M-step, the weight ma-

trix

is updated ﬁrst using the complete training

data and we then update

with

= g

(Φ(z

)

). (19)

4 VISUALISATION QUALITY

EVALUATION MEASURES

Algorithms based on GTM are examples of unsuper-

vised learning which always give a result when ap-

plied to a particular dataset. Thus we cannot tell a

priori what is the expected or desired outcome. This

makes it difﬁcult to judge which method is the best

(i.e. tells us the most about a certain dataset). Here

VisualisationofHeterogeneousDatawiththeGeneralisedGenerativeTopographicMapping

235

we use metrics that measure the degree of local neigh-

bourhood similarity between data space and latent

space which can be calculated even if ‘ground truth’

is not known.

4.1 Trustworthiness, Continuity and

Mean Relative Rank Errors

(MRREs)

Two well-known visualisation quality measures based

on comparing neighbourhoods in the data space x and

projection space z are trustworthiness and continu-

ity (Venna and Kaski, 2001). A mapping is said to

be trustworthy if k-neighbourhood in the visualised

space matches that in the data space but if the k-

neighbourhood in the data space matches that in the

visualised space it maintains continuity. The higher

the measure the better the visualisation, as this implies

that local neighbourhoods are better preserved by the

projection. We also use mean relative rank errors

with respect to data and latent spaces (MRRE

and

MRRE

), which measure the preservation of the rank

of the k-nearest neighbours contrary to the trustwor-

thiness and continuity which only consider matches

in the k-neighbourhood (Lee and Verleysen, 2008).

Note that the lower the MRRE the better the projec-

tion quality.

5 EXPERIMENTAL RESULTS

The GGTM was evaluated on both complete and

missing synthetic and real-world datasets and com-

pared with standard GTM for complete data. The

weight matrix W was initialised using principal com-

ponent analysis (PCA). For the metrics in Section 4,

we computed pair-wise distances using Hamming dis-

tances for the binary features and Euclidean distances

for the continuous features. For each distance matrix,

we divided each column by its standard deviation. All

experiments used 10-fold cross-validation. The vi-

sualisation quality measures were computed with a

range of neighbourhood sizes (5, 10, 15, 20) and the

mean of these measures over the different sizes and

cross-validation runs was computed.

5.1 Synthetic Dataset

The synthetic dataset was generated from an

equiprobable mixture of two Gaussians, N (m

, I)

(with k = 1, 2) with means m



2.0

3.5



, and m

Table 1: GTM and GGTM visualisation quality metrics of

the 12-dimensional synthetic datasets. Each ﬁgure repre-

sents the average over a 10-fold cross-validation with one

standard deviation on the test sets.

GTM GGTM GGTM

complete complete missing

Trustworthiness 0.969± 0.003 0.949± 0. 024 0.947± 0.027

Continuity 0.964± 0.003 0.970± 0.013 0.969± 0.014

MRRE

0.040± 0.000 0.043± 0.003 0.042± 0.003

MRRE

0.004± 0.000 0.038± 0.002 0.037± 0.002



3.5

4.5



. A dataset with 9-dimensional binary fea-

tures from four classes was also generated (these

classes were not used as inputs to the visualisation).

Both continuous and binary data were combined to

make a dataset of 12 features with 2, 800 data points.

The visualisation results of the complete and missing

datasets (10% randomly removed) are shown in Fig-

ure 1 and the quality metrics are given in Table 1.

(a) GTM (training set) (b) GTM (test set)

(e) GGTM missing (train-

ing set)

(f) GGTM missing (test

set)

Figure 1: GTM and GGTM visualisations of the synthetic

12-dimensional datasets with 3 continuous and 9 binary fea-

tures.

We also generated a dataset with two multi-

category features with 8 and 16 categories in the

ﬁrst and second features respectively. We ap-

pended the multi-category features to the previous

12-dimensional dataset and used a 1-of-S encoding

scheme for the multi-category features. Labels were

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

236

Table 2: GTM and GGTM visualisation quality metrics of

the 14-dimensional synthetic datasets.

GTM GGTM GGTM

complete complete missing

Trustworthiness 0.962± 0.004 0.977± 0.009 0.973± 0.014

Continuity 0.946 ± 0.008 0.980± 0.007 0.976± 0.013

MRRE

0.045± 0.001 0.044± 0.001 0.116± 0.005

MRRE

0.045± 0.001 0.041± 0.002 0.132± 0.005

based on the four classes in the binary data.The visu-

alisation results of the 14-dimensional complete and

missing datasets are shown in Figure 2 and the corre-

sponding quality metrics are given in Table 2. The

(a) GTM (training set) (b) GTM (test set)

(e) GGTM missing (train-

ing set)

(f) GGTM missing (test

set)

Figure 2: GTM and GGTM visualisations of the synthetic

14-dimensional datasets with 3 continuous, 9 binary and 2

multi-category features.

proportion of missing values has also been increased

to 30%, 50%, 70% and 90% without substantially de-

grading the visualisation quality measures.

5.2 Hypothyroid Dataset

This real-world dataset is publicly available from the

UCI data repository (Bache and Lichman, 2013). The

dataset consists of two variable types: 15 binary and

6 continuous features. It contains three classes: pri-

mary thyroid, compensated thyroid and normal. The

dataset was originally divided into a training set of

Table 3: GTM and GGTM visualisation quality metrics of

the hypothyroid disease datasets.

GTM GGTM GGTM

complete complete missing

Trustworthiness 0.718± 0.022 0.718± 0. 015 0.716± 0.014

Continuity 0.804± 0.017 0.843± 0.014 0.835± 0.007

MRRE

0.018± 0.000 0.019± 0.000 0.019± 0.000

MRRE

0.016± 0.000 0.016± 0.000 0.016± 0.000

3, 772 data points (93 with primary hypothyroid, 191

with compensated hypothyroid and 3488 normal) and

a test set of 3, 428 data points (73 with primary hy-

pothyroid, 177 with compensated hypothyroid and

3178 normal). These training and test sets have been

merged prior to running a 10-fold cross-validation.

The visualisation results of the complete and missing

datasets are shown in Figure 3 and the quality metrics

are given in Table 3.

(a) GTM (training set) (b) GTM (test set)

(e) GGTM missing (train-

ing set)

(f) GGTM missing (test

set)

Figure 3: GTM and GGTM visualisations of the thyroid

disease datasets. The cyan circles, red plus sign and blue

squares represent primary hypothyroid, compensated hy-

pothyroid and normal respectively.

6 CONCLUSIONS

A generalisation of the GTM to heterogeneous

and missing data has been described and as-

sessed in this paper. This involves modelling the

VisualisationofHeterogeneousDatawiththeGeneralisedGenerativeTopographicMapping

237

continuous and discrete data with Gaussian and

Bernoulli/multinomial distributions respectively.

These extensions have been suggested in (Bishop

et al., 1998) but this is the ﬁrst time the mathematical

details have been worked out and an implementation

written and evaluated.

Visualisation results for synthetic data using the

GGTM have shown more compact clusters for each

class compared to the standard GTM whereas for the

real dataset no signiﬁcant difference was observed.

For synthetic datasets with missing values, GGTM vi-

sualisations have greater compactness for each class.

In terms of visualisation quality evaluation metrics,

we observed that for a mix of continuous and binary

data, the trustworthiness and MRRE

are slightly bet-

ter for standard GTM compared to GGTM whereas

the continuity and MRRE

were better for GGTM

compared to standard GTM. However, for a mix of

continuous, binary and multi-category features, all the

quality evaluation measures were better for GGTM

compared to the standard GTM. Missing values have

caused limited deterioration in results compared to the

complete data case.

REFERENCES

Bache, K. and Lichman, M. (2013). UCI machine learning

repository.

Bishop, C. M. (1995). Neural networks for pattern recog-

nition. Oxford University Press.

Bishop, C. M. and Svensen, M. (1998). GTM: The gen-

erative topographic mapping. Neural Compuatation,

10(1):215–234.

Bishop, C. M., Svensen, M., and Williams, C. K. I. (1998).

Developments of the generative topographic mapping.

Neurocomputing, 21(1):203–224.

de Leon, A. R. and Chough, K. C. (2013). Analysis of

Mixed Data: Methods & Applications. Taylor &

Fracis Group. Chapman and Hall/CRC.

Dunson, D. B. (2000). Bayesian latent variable models

for clustered mixed outcomes. Journal of the Royal

Statistical Society. Series B (Statistical Methodology),

62(2):355–366.

Ghahramani, Z. and Jordan, M. I. (1994). Learning from

incomplete data. Technical Report AIM-1509.

Kab´an, A. and Girolami, M. (2001). A combined latent

class and trait model for the analysis and visualization

of discrete data. Pattern Analysis and Machine Intel-

ligence, IEEE Transactions on, 23(8):859–872.

Krzanowski, W. J. (1983). Distance between popula-

tions using mixed continuous and categorical vari-

ables. Biometrika, 70(1):235–243.

Lee, J. A. and Verleysen, M. (2008). Rank-based quality

assessment of nonlinear dimensionality reduction. In

ESANN, pages 49–54.

McLachlan, G. and Krishnan, T. (1997). The EM algorithm

and extensions. Wiley, New York.

Moustaki, I. (1996). A latent trait and a latent class model

for mixed observed variables. British Journal ofMath-

ematical and Statistical Psychology, 49(2):313–334.

Sammel, M. D., Ryan, L. M., and Legler, J. M. (1997). La-

tent variable models for mixed discrete and continu-

ous outcomes. Journal of the Royal Statistical Society.

Series B (Methodological), 59(3):667–678.

Sun, Y., Tino, P., and Nabney, I. (2002). Visualisa-

tion of incomplete data using class information con-

straints. In Winkler, J. and Niranjan, M., editors, Un-

certainty in Geometric Computations, volume 704 of

The Springer International Series in Engineering and

Computer Science, pages 165–173. Springer US.

Teixeira-Pinto, A. and Normand, S. T. (2009). Correlated

bivariate continuous and binary outcomes: issues and

applications. Statistics in Medicine, 28(13):1753–

1773.

Tipping, M. E. (1999). Probabilistic visualisation of high-

dimensional binary data. In Proceedings of the 1998

Conference on Advances in Neural Information Pro-

cessing Systems II, pages 592–598, Cambridge, MA,

USA. MIT Press.

Venna, J. and Kaski, S. (2001). Neighborhood preserva-

tion in nonlinear projection methods: an experimen-

tal study. In Proceedings of the International Con-

ference on Artiﬁcial Neural Networks, ICANN ’01,

pages 485–491, London, UK. Springer-Verlag.

Yu, K. and Tresp, V. (2004). Heterogenous data fusion via a

probabilistic latent-variable model. In M¨uller-Schloer,

C., Ungerer, T., and Bauer, B., editors, ARCS, volume

2981 of Lecture Notes in Computer Science, pages

20–30. Springer.

IVAPP2015-InternationalConferenceonInformationVisualizationTheoryandApplications

238