An Alternative to Restricted-Boltzmann Learning for Binary Latent

Variables based on the Criterion of Maximal Mutual Information

David Edelman

University College Dublin, Ireland

Keywords:

Machine Learning, Data Compression, Information Theory, Unsupervised Learning.

Abstract:

The latent binary variable training problem used in the pre-training process for Deep Neural Networks is appro-

ached using the Principle (and related Criterion) of Maximum Mutual Information (MMI). This is presented

as an alternative to the most widely-accepted ’Restricted Boltzmann Machine’ (RBM) approach of Hinton.

The primary contribution of the present article is to present the MMI approach as the arguably more logically

’natural’ and logically simple means to the same ends. Additionally, the relative ease and effectiveness of the

approach for application will be demonstrated for an example case.

1 INTRODUCTION

As has become evident in recent years, the use of

pre-training is crucial to the overall training of Deep

Neural Networks. Historically, the training of feed-

forward Neural Networks involved weight initialisa-

tion had been carried out by mere pseudo-random

sampling, which worked satisfactorily for networks

of few hidden layers. The inadequacy of this form

of initialisation, however, inhibited research into net-

works of deeper architecture, and it was the key bre-

akthrough of Hinton in 1999 (Hinton, 1999) intro-

ducing new methods for unsupervised ’pre-training’,

which ﬁrst enabled the widespread use of Networks

of Deeper architecture, which in turn marked the be-

ginning of the resurgence in the research area of Neu-

ral Networks known as Deep Learning. In essence,

the notion of ’pre-training’ in a feed-forward net-

work amounts to an iterated succession of unsupervi-

sed data compressions continuing forward into a net-

work before supervised learning or training has be-

gun. The method of compression proposed by Hin-

ton is referred to as the ’Restricted Boltzmann Ma-

chine’ (hereafter, RBM), a construct which owes its

heuristics to analogy with problems in thermodyna-

mics, and requires an intricate estimation procedure

involving application of advanced Monte Carlo simu-

lation including Gibbs Sampling, in a process refer-

red to as Contrastive Divergence. The RBM approach

in pre-training has been proven to be effective, and

indeed become one of the most widely-used heuris-

tics for carrying out pre-training. One question worth

asking, however, is whether a logically simpler, more

direct approach (not involving analogies, heuristics or

requiring intricate simulations or calculations) might

be found. It is this question to which the present arti-

cle addresses itself.

In what follows, a method based on a probability-

based measure called Mutual Information, which, it

is argued, should be maximum between a pair of va-

riables if one is considered to be an optimal compres-

sion of the other. Therefore, a Maximimum Mutual

Information (hereafter, MMI) Criterion is introduced

and applied in training to attempt or approach opti-

mal compression from one network layer to the next.

It will be argued that this leads to a practicable algo-

rithm for achieving a similar aim as an RBM, and this

algorithm is then exhibited as being effective and sim-

ple to implement, where a practical example from the

Financial Markets is used to demonstrate.

Before proceeding, it should be mentioned that

while the methods proposed here for the ’pre-training’

problem would generally be applied in place of the

RBM methodology, the latter will not be reviewed

here. This is because it is believed that the RBM con-

struct and the advanced techniques involved in its ap-

plication do not lend themselves well to brief descrip-

tion and explanation, so it is therefore felt that readers

unfamiliar with RBMs would not beneﬁt from an at-

tempt to describe it all here, even in general terms. By

contrast, it is believed that a wide variety of readers

will be able to follow the (arguably much simpler) ap-

proach adopted here for addressing the ’pre-training’

problem, where many such readers might not readily

be able to grasp and apply the RBM approach without

considerable further study.

Edelman, D.

An Alternative to Restricted-Boltzmann Learning for Binary Latent Variables based on the Criterion of Maximal Mutual Information.

DOI: 10.5220/0007618608650868

In Proceedings of the 11th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2019), pages 865-868

ISBN: 978-989-758-350-6

865

2 BACKGROUND

As was mentioned earlier, unsupervised pre-training

of Feed-forward Neural networks has enabled Deep

Network architectures which were not possible to

train previously. The RBM notwithstanding, we ap-

proach the pre-training afresh and consider how one

might carry out such pre-training, based on ﬁrst prin-

ciples. In essence, the object is to compress a set of

input variables into a set of binary (or, ’sigmoidally-

approximated’ binary) variables. We propose an

’Information-Theoretic’ approach.

Consider the Shannon ’Information’ (Shannon,

1949) of a random variable X with density p

(x), or

merely the Entropy of X (in units of Information).

E (X) = −E

{log p

(X)},

where the expectation is understood to be with respect

to the distribution of X.

Next, the Mutual Information (see (Cover and

Thomas, 2006) and elsewhere) between variables X

and Y is given by

H (X;Y ) = E

{log

(X,Y )

(X) · p

(Y )

[Note that the above is in the form of a cross-entropy

and hence nonnegative, and also that if X and Y are

independent, the ratio is identically unity and the ex-

pectation zero]

The approach proposed here, then, is based on di-

rectly maximisation of the (estimated) shared Infor-

mation between the probability distributions of input

and output (’compression’) variables.

We proceed to speciﬁcs in the next section.

3 FORMULATION

Given observable variables X = (X

,.. ., X

) and la-

tent binary variables h = (h

,.. ., h

), consider the

problem usually addressed by a classic RBM in the

present context, where

p(h|X) =

∏

j=1

p(h

|X)

where

p(h

= 1|X) = 1/(1 +exp(b

+ X ·W

· j

)),

and b denotes a vector of biases and W

· j

denotes

the j

column of the weight vector W . For conve-

nience, in the sequel we shall refer to this quantity as

(X; b,W ).

In what follows we seek to estimate b and W by

maximising (as foreshadowed in the previous section)

the Mutual Information between h and X.

Before proceeding, it is important to note that the

Mutual Information may be written in terms of En-

tropy (E):

(log[p(X, h)/{p(X)p(h)}]) = E (h)−E{E (h|X)}

Since components of h are conditionally indepen-

dent given X, the quantity E (h|X) is straightforward

to calculate, as the sum of entropies of the compo-

nents of h:

E (h|X) = −

∑

(X; b,W )log( f

(X; b,W ))

+ (1 − f

(X; b,W ))log(1 − f

(X; b,W ))

For a random sample of X,x

,.. ., x

, if one con-

ditions on the sample, then under the permutation

distribution, the probability of any particular x

Hence, the quantity E{E (h|X )} is just the average of

the above expression over the sample:

E{E (h|X )} = −

∑

i=1

∑

;b,W ) log( f

;b,W ))

+ (1 − f

;b,W )) log(1 − f

;b,W ))

Next, in order to compute E (h) some special at-

tention will prove necessary. This is because the mar-

ginal density of h conditional on the sample is given

∑

i=1

∏

j=1

;b,W )

(1 − f

;b,W ))

1−h

which means that computation of the Entropy requires

n · 2

terms which is arguably intractable for any but

cases of very small m. This being the case, if one wis-

hes to avoid the requirement of Monte Carlo sampling

(which is certainly one way forward), it is helpful to

make further (fairly broad) assumptions about the the

distribution of h.

If, for instance, one were to assume that the com-

ponents of the latent variable vector h were indepen-

dent (or that this were ’nearly’ true, in some sense),

then under the hypothesis of ’near’-product densities,

the joint density would be well-approximated by

∏

j=1

∑

i=1

;b,W )

(1 − f

;b,W ))

1−h

or equivalently,

∏

j=1

(

∑

i=1

;b,W ))

(1 −

∑

i=1

;b,W ))

1−h

and the Entropy would be simple to compute:

−

∑

j=1

(·;b,W ) log( f

(·;b,W ))

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

866

... + (1 − f

(·;b,W )) log(1 − f

(·;b,W )),

where f

(·;b,W ) =

∑

;b,W ) are the average

proportions.

However, this assumption is not reasonable, but

fortunately a much more tenable one allows a similar

approximation, namely that higher-order dependence

between components is characterised by the pairwise

distributions [this would be true for correlated jointly-

distributed Gaussian variables, for example]. In the

present case of binary variables, there is a simple re-

presentation of a joint density:

p(h) = p(h

,.. ., h

) = c· {

∏

i j

)}

m−1

, m > 1

where p

i j

denotes the joint density of (h

) and c

a normalisation constant typically close to unity. [It

may be easily seen that in the special case of indepen-

dence, the above expression reduces to the product

density of the components h

with c ≡ 1.].

Assuming the form of the above density, the En-

tropy may be computed

E log{p(h)} = E[

m − 1

∑

i j

log{p

i j

)}] + log(c)

m − 1

∑

i j

E[log{p

i j

)}] + log(c)

m − 1

∑

i j

(1,1)log{p

i j

(1,1)} + · ··

i j

(1,0)log{p

i j

(1,0)}+ p

i j

(0,1)log{p

i j

(0,1)}+·· ·

+ p

i j

(0,0)log{p

i j

(0,0)}] + log(c).

Given the input sample, the p

i j

) for the four

cases of the argument may be computed by

i j

(1,1) = f

i j

(1,0) = f

− f

i j

(0,1) = f

− f

i j

(0,0) = 1 − f

− f

+ f

where the expectations (here denoted by bars over the

respective variables), conditional on the sample, are

computed by averaging.

4 RESULTS OF PRELIMINARY

EXPERIMENT

As part of a larger experiment to forecast the daily

return of the S& P 500 Index, a dataset was organi-

sed into a vector of output targets consisting of daily

returns from 4500 days, each to be forecast by the

returns of the respective 5 previous days’ returns, or-

ganised into a 5-column matrix with the same number

of rows as the output target. As a ﬁrst step to train a

Deep Network, it was decided to identify a feature ar-

ray of 20 binary variables for each input vector, using

the MMI method described above.

As a test, the MMI algorithm was implemented

in MATLAB@(MATLAB, 2011) and applied to the

input data.

It was desired to encode the input information to

estimate the 20 quasi-binary latent variables contai-

ning the most Information from the Input, as the be-

ginning of a stack of encoders to eventually be used

to initialise training for a Deep Learning forecast of

the next day’s return (the output target). The Mat-

lab@(MATLAB, 2011) implementation required en-

coding of the Mutual Information between the in-

put data and predictions as a function of connection

weights, where a small (.00001) L2 penalty was ap-

plied to the weights for stability. The Unconstrained

Function Minimisation (’fminunc’) routine of MAT-

LAB@(MATLAB, 2011) was applied with gradients

supplied, which converged in approximately 3 minu-

tes on a Macbook Pro laptop (2015, OSX 10.11.3, In-

tel 2.9GHz core-i5, 8Gb RAM).

The resulting output was (to 5 signiﬁcant ﬁgures)

an array of 20 columns of uncorrelated binary (0-1)

variables with column means each equal to 0.50000.

It is worth noting that the resulting ’Maximum-

Entropy’ character of the estimated output distribu-

tion was not anticipated but is arguably promising

for application, and not surprising given the use of

Entropy as a ﬁtting criterion. [It is also worth no-

ting that an earlier attempt of the same experiment

using the (over?)simpliﬁcation of assuming indepen-

dence of the components of h (mentioned in an ear-

lier section of this article) failed, with the resulting

output having mostly degenerate (all zero or all one)

columns].

5 CONCLUSION

In the previous, it has been seen that data compression

to binary (or near-binary) variables may be simply

achieved using the principle of Maximum Mutual In-

formation (MMI) using a ’one-size-ﬁts-all’ algorithm.

The primary claim of the present paper is that the ap-

proach presented here is far simpler logically and al-

gorithmically than the most widely-accepted method

based on the Restricted Boltzmann Machine (RBM)

construct. As it has not been argued here that the re-

sulting compressions achieved by MMI are in any-

way superior to those achieved via RBMs given suf-

An Alternative to Restricted-Boltzmann Learning for Binary Latent Variables based on the Criterion of Maximal Mutual Information

867

ﬁcient computation time, it is believed that side-by-

side practical comparison would entail careful ben-

chmarked studies involving speciﬁc hardware-related

performance, which is not included here. It is, howe-

ver, hoped that the new methods suggested here may

help contribute to the logical and algorithmic simpli-

ﬁcation of pre-training of Deep Neural Networks, and

that practitioners will conﬁrm the author’s conjecture

(and limited experience) that the computations of im-

plementing MMI are comparable in scope and per-

haps even more efﬁcient numerically than those of

RBMs.

REFERENCES

Cover, T. and Thomas, J. (2006). Elements of Information

Theory (2nd ed.), Ch. 2. Wiley New York.

Hinton, G. (1999). Products of experts. In Proceedings of

the Ninth International Conference on Artiﬁcial Neu-

ral Networks, Vol. I, pages pages 1–6. ICANN.

MATLAB (2011). version 7.12.0.635 (R2011a). The

MathWorks Inc., Natick, Massachusetts.

Shannon, C. E. (1949). Communication in the presence of

noise. Proc. Institute of Radio Engineers, 37(1):10–

21.

ICAART 2019 - 11th International Conference on Agents and Artiﬁcial Intelligence

868