An Alternative to Restricted-Boltzmann Learning for Binary Latent
Variables based on the Criterion of Maximal Mutual Information
David Edelman
University College Dublin, Ireland
Keywords:
Machine Learning, Data Compression, Information Theory, Unsupervised Learning.
Abstract:
The latent binary variable training problem used in the pre-training process for Deep Neural Networks is appro-
ached using the Principle (and related Criterion) of Maximum Mutual Information (MMI). This is presented
as an alternative to the most widely-accepted ’Restricted Boltzmann Machine’ (RBM) approach of Hinton.
The primary contribution of the present article is to present the MMI approach as the arguably more logically
’natural’ and logically simple means to the same ends. Additionally, the relative ease and effectiveness of the
approach for application will be demonstrated for an example case.
1 INTRODUCTION
As has become evident in recent years, the use of
pre-training is crucial to the overall training of Deep
Neural Networks. Historically, the training of feed-
forward Neural Networks involved weight initialisa-
tion had been carried out by mere pseudo-random
sampling, which worked satisfactorily for networks
of few hidden layers. The inadequacy of this form
of initialisation, however, inhibited research into net-
works of deeper architecture, and it was the key bre-
akthrough of Hinton in 1999 (Hinton, 1999) intro-
ducing new methods for unsupervised ’pre-training’,
which first enabled the widespread use of Networks
of Deeper architecture, which in turn marked the be-
ginning of the resurgence in the research area of Neu-
ral Networks known as Deep Learning. In essence,
the notion of ’pre-training’ in a feed-forward net-
work amounts to an iterated succession of unsupervi-
sed data compressions continuing forward into a net-
work before supervised learning or training has be-
gun. The method of compression proposed by Hin-
ton is referred to as the ’Restricted Boltzmann Ma-
chine’ (hereafter, RBM), a construct which owes its
heuristics to analogy with problems in thermodyna-
mics, and requires an intricate estimation procedure
involving application of advanced Monte Carlo simu-
lation including Gibbs Sampling, in a process refer-
red to as Contrastive Divergence. The RBM approach
in pre-training has been proven to be effective, and
indeed become one of the most widely-used heuris-
tics for carrying out pre-training. One question worth
asking, however, is whether a logically simpler, more
direct approach (not involving analogies, heuristics or
requiring intricate simulations or calculations) might
be found. It is this question to which the present arti-
cle addresses itself.
In what follows, a method based on a probability-
based measure called Mutual Information, which, it
is argued, should be maximum between a pair of va-
riables if one is considered to be an optimal compres-
sion of the other. Therefore, a Maximimum Mutual
Information (hereafter, MMI) Criterion is introduced
and applied in training to attempt or approach opti-
mal compression from one network layer to the next.
It will be argued that this leads to a practicable algo-
rithm for achieving a similar aim as an RBM, and this
algorithm is then exhibited as being effective and sim-
ple to implement, where a practical example from the
Financial Markets is used to demonstrate.
Before proceeding, it should be mentioned that
while the methods proposed here for the ’pre-training’
problem would generally be applied in place of the
RBM methodology, the latter will not be reviewed
here. This is because it is believed that the RBM con-
struct and the advanced techniques involved in its ap-
plication do not lend themselves well to brief descrip-
tion and explanation, so it is therefore felt that readers
unfamiliar with RBMs would not benefit from an at-
tempt to describe it all here, even in general terms. By
contrast, it is believed that a wide variety of readers
will be able to follow the (arguably much simpler) ap-
proach adopted here for addressing the ’pre-training’
problem, where many such readers might not readily
be able to grasp and apply the RBM approach without
considerable further study.
Edelman, D.
An Alternative to Restricted-Boltzmann Learning for Binary Latent Variables based on the Criterion of Maximal Mutual Information.
DOI: 10.5220/0007618608650868
In Proceedings of the 11th International Conference on Agents and Artificial Intelligence (ICAART 2019), pages 865-868
ISBN: 978-989-758-350-6
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
865
2 BACKGROUND
As was mentioned earlier, unsupervised pre-training
of Feed-forward Neural networks has enabled Deep
Network architectures which were not possible to
train previously. The RBM notwithstanding, we ap-
proach the pre-training afresh and consider how one
might carry out such pre-training, based on first prin-
ciples. In essence, the object is to compress a set of
input variables into a set of binary (or, sigmoidally-
approximated’ binary) variables. We propose an
’Information-Theoretic’ approach.
Consider the Shannon ’Information’ (Shannon,
1949) of a random variable X with density p
X
(x), or
merely the Entropy of X (in units of Information).
E (X) = E
X
{log p
X
(X)},
where the expectation is understood to be with respect
to the distribution of X.
Next, the Mutual Information (see (Cover and
Thomas, 2006) and elsewhere) between variables X
and Y is given by
H (X;Y ) = E
XY
{log
p
XY
(X,Y )
p
X
(X) · p
Y
(Y )
}.
[Note that the above is in the form of a cross-entropy
and hence nonnegative, and also that if X and Y are
independent, the ratio is identically unity and the ex-
pectation zero]
The approach proposed here, then, is based on di-
rectly maximisation of the (estimated) shared Infor-
mation between the probability distributions of input
and output (’compression’) variables.
We proceed to specifics in the next section.
3 FORMULATION
Given observable variables X = (X
1
,.. ., X
m
X
) and la-
tent binary variables h = (h
1
,.. ., h
m
h
), consider the
problem usually addressed by a classic RBM in the
present context, where
p(h|X) =
m
h
j=1
p(h
j
|X)
where
p(h
j
= 1|X) = 1/(1 +exp(b
j
+ X ·W
· j
)),
and b denotes a vector of biases and W
· j
denotes
the j
th
column of the weight vector W . For conve-
nience, in the sequel we shall refer to this quantity as
f
j
(X; b,W ).
In what follows we seek to estimate b and W by
maximising (as foreshadowed in the previous section)
the Mutual Information between h and X.
Before proceeding, it is important to note that the
Mutual Information may be written in terms of En-
tropy (E):
E
XH
(log[p(X, h)/{p(X)p(h)}]) = E (h)E{E (h|X)}
Since components of h are conditionally indepen-
dent given X, the quantity E (h|X) is straightforward
to calculate, as the sum of entropies of the compo-
nents of h:
E (h|X) =
j
f
j
(X; b,W )log( f
j
(X; b,W ))
+ (1 f
j
(X; b,W ))log(1 f
j
(X; b,W ))
For a random sample of X,x
1
,.. ., x
n
, if one con-
ditions on the sample, then under the permutation
distribution, the probability of any particular x
i
is
1
n
.
Hence, the quantity E{E (h|X )} is just the average of
the above expression over the sample:
E{E (h|X )} =
1
n
n
i=1
j
f
j
(x
i
;b,W ) log( f
j
(x
i
;b,W ))
+ (1 f
j
(x
i
;b,W )) log(1 f
j
(x
i
;b,W ))
Next, in order to compute E (h) some special at-
tention will prove necessary. This is because the mar-
ginal density of h conditional on the sample is given
by
1
n
n
i=1
m
j=1
f
j
(x
i
;b,W )
h
j
(1 f
j
(x
i
;b,W ))
1h
j
which means that computation of the Entropy requires
n · 2
m
terms which is arguably intractable for any but
cases of very small m. This being the case, if one wis-
hes to avoid the requirement of Monte Carlo sampling
(which is certainly one way forward), it is helpful to
make further (fairly broad) assumptions about the the
distribution of h.
If, for instance, one were to assume that the com-
ponents of the latent variable vector h were indepen-
dent (or that this were ’nearly’ true, in some sense),
then under the hypothesis of ’near’-product densities,
the joint density would be well-approximated by
m
j=1
1
n
n
i=1
f
j
(x
i
;b,W )
h
j
(1 f
j
(x
i
;b,W ))
1h
j
,
or equivalently,
m
j=1
(
1
n
n
i=1
f
j
(x
i
;b,W ))
h
j
(1
1
n
n
i=1
f
j
(x
i
;b,W ))
1h
j
,
and the Entropy would be simple to compute:
m
j=1
f
j
(·;b,W ) log( f
j
(·;b,W ))
ICAART 2019 - 11th International Conference on Agents and Artificial Intelligence
866
... + (1 f
j
(·;b,W )) log(1 f
j
(·;b,W )),
where f
j
(·;b,W ) =
1
n
i
f
j
(x
i
;b,W ) are the average
proportions.
However, this assumption is not reasonable, but
fortunately a much more tenable one allows a similar
approximation, namely that higher-order dependence
between components is characterised by the pairwise
distributions [this would be true for correlated jointly-
distributed Gaussian variables, for example]. In the
present case of binary variables, there is a simple re-
presentation of a joint density:
p(h) = p(h
1
,.. ., h
m
) = c· {
i j
p
i j
(h
i
,h
j
)}
1
m1
, m > 1
where p
i j
denotes the joint density of (h
i
,h
j
) and c
a normalisation constant typically close to unity. [It
may be easily seen that in the special case of indepen-
dence, the above expression reduces to the product
density of the components h
i
with c 1.].
Assuming the form of the above density, the En-
tropy may be computed
E log{p(h)} = E[
1
m 1
i j
log{p
i j
(h
i
,h
j
)}] + log(c)
=
1
m 1
i j
E[log{p
i j
(h
i
,h
j
)}] + log(c)
=
1
m 1
i j
[p
i j
(1,1)log{p
i j
(1,1)} + · ··
+p
i j
(1,0)log{p
i j
(1,0)}+ p
i j
(0,1)log{p
i j
(0,1)}+·· ·
+ p
i j
(0,0)log{p
i j
(0,0)}] + log(c).
Given the input sample, the p
i j
(h
i
,h
j
) for the four
cases of the argument may be computed by
p
i j
(1,1) = f
i
f
j
p
i j
(1,0) = f
i
f
i
f
j
p
i j
(0,1) = f
j
f
i
f
j
p
i j
(0,0) = 1 f
i
f
j
+ f
i
f
j
where the expectations (here denoted by bars over the
respective variables), conditional on the sample, are
computed by averaging.
4 RESULTS OF PRELIMINARY
EXPERIMENT
As part of a larger experiment to forecast the daily
return of the S& P 500 Index, a dataset was organi-
sed into a vector of output targets consisting of daily
returns from 4500 days, each to be forecast by the
returns of the respective 5 previous days’ returns, or-
ganised into a 5-column matrix with the same number
of rows as the output target. As a first step to train a
Deep Network, it was decided to identify a feature ar-
ray of 20 binary variables for each input vector, using
the MMI method described above.
As a test, the MMI algorithm was implemented
in MATLAB@(MATLAB, 2011) and applied to the
input data.
It was desired to encode the input information to
estimate the 20 quasi-binary latent variables contai-
ning the most Information from the Input, as the be-
ginning of a stack of encoders to eventually be used
to initialise training for a Deep Learning forecast of
the next day’s return (the output target). The Mat-
lab@(MATLAB, 2011) implementation required en-
coding of the Mutual Information between the in-
put data and predictions as a function of connection
weights, where a small (.00001) L2 penalty was ap-
plied to the weights for stability. The Unconstrained
Function Minimisation (’fminunc’) routine of MAT-
LAB@(MATLAB, 2011) was applied with gradients
supplied, which converged in approximately 3 minu-
tes on a Macbook Pro laptop (2015, OSX 10.11.3, In-
tel 2.9GHz core-i5, 8Gb RAM).
The resulting output was (to 5 significant figures)
an array of 20 columns of uncorrelated binary (0-1)
variables with column means each equal to 0.50000.
It is worth noting that the resulting ’Maximum-
Entropy’ character of the estimated output distribu-
tion was not anticipated but is arguably promising
for application, and not surprising given the use of
Entropy as a fitting criterion. [It is also worth no-
ting that an earlier attempt of the same experiment
using the (over?)simplification of assuming indepen-
dence of the components of h (mentioned in an ear-
lier section of this article) failed, with the resulting
output having mostly degenerate (all zero or all one)
columns].
5 CONCLUSION
In the previous, it has been seen that data compression
to binary (or near-binary) variables may be simply
achieved using the principle of Maximum Mutual In-
formation (MMI) using a ’one-size-fits-all’ algorithm.
The primary claim of the present paper is that the ap-
proach presented here is far simpler logically and al-
gorithmically than the most widely-accepted method
based on the Restricted Boltzmann Machine (RBM)
construct. As it has not been argued here that the re-
sulting compressions achieved by MMI are in any-
way superior to those achieved via RBMs given suf-
An Alternative to Restricted-Boltzmann Learning for Binary Latent Variables based on the Criterion of Maximal Mutual Information
867
ficient computation time, it is believed that side-by-
side practical comparison would entail careful ben-
chmarked studies involving specific hardware-related
performance, which is not included here. It is, howe-
ver, hoped that the new methods suggested here may
help contribute to the logical and algorithmic simpli-
fication of pre-training of Deep Neural Networks, and
that practitioners will confirm the author’s conjecture
(and limited experience) that the computations of im-
plementing MMI are comparable in scope and per-
haps even more efficient numerically than those of
RBMs.
REFERENCES
Cover, T. and Thomas, J. (2006). Elements of Information
Theory (2nd ed.), Ch. 2. Wiley New York.
Hinton, G. (1999). Products of experts. In Proceedings of
the Ninth International Conference on Artificial Neu-
ral Networks, Vol. I, pages pages 1–6. ICANN.
MATLAB (2011). version 7.12.0.635 (R2011a). The
MathWorks Inc., Natick, Massachusetts.
Shannon, C. E. (1949). Communication in the presence of
noise. Proc. Institute of Radio Engineers, 37(1):10–
21.
ICAART 2019 - 11th International Conference on Agents and Artificial Intelligence
868