positions). Likewise, we have assumed that test
vectors, which are randomly perturbed versions of
training vectors, are passed through a saturation
filter that activates only a node. This is also true of
network output, where the global threshold value
(which is the same for all nodes),
, is adjusted to
ensure that only the K nodes with the largest sums
are above threshold— this is known as K
winners-take-all (K-WTA). Palm has shown that in
order to have maximum memory capacity, the
number of 1s and 0s in the weight matrix should be
balanced, that is p
1
= p
0
= 0.5, that is
0
ln ln 2Mpq p=− ≤ . In order for this relationship to
hold, the training vectors need to be sparsely coded
with log n bits set to 1, then the optimal capacity
69.02ln =
is reached.
3.1 Palm Association as Bayesian
Inference
Message x
i
is transmitted, message
)()()(
'
tntxtx +=
is received, and all vectors are
N bits in length, so the Noisy Channel only creates
substitution errors. The Hamming Distance between
these two bit vectors is HD(x
i
, x
j
). We are assuming
throughout the rest of this paper that the Noisy
Channel is binary symmetric with the probability of
a single bit error being ε, and the probability that a
bit is transmitted intact is (1-ε). The error
probabilities are independent and identically
distributed and ε < 0.5.
We can now prove the following result:
Theorem 1: The messages, x
i
, are transmitted
over a Binary Symmetric Channel and message x’ is
received. For all messages, x
i
, being equally likely,
the training vector with the smallest Hamming
Distance from the received vector is the most-likely
that was sent in a Bayesian sense,
,min (,')
ii
ist HDx x∀ and
,max(|')
ij
jst px x∀ , i=j
Proof: This is a straightforward derivation from
Bayes rule
1
['| ][ ]
[|']
['| ][ ]
ii
i
N
j
j
px x px
px x
px x px
=
=
∑
Since the denominator is equal for all inputs, it
can be ignored. And since the x
i
are all equally
likely the p[x
i
] in the numerator can be ignored. Let
x
i
(h) be a vector that has a single bit error from x
i
,
that is HD(x
i
(h), x
i
)=h. The probability, p(x(1)) of a
single error is (1-ε)
n-1
ε. Note, this is not the
probability of all possible 1-bit errors only the
probability of a single error occurring that
transforms x
i
into x’. The probability of h errors
occurring is, p(x(h)) = (1-ε)
n-h
ε
h
. It is easy to see that
for ε<0.5 then p(x(1)) > p(x(2)) > … > p(x(h)). And
by definition HD(x(1)) < HD(x(2)) < … <
HD(x(h)). By choosing the training vector with the
smallest Hamming Distance from x’, we maximize
['| ]
i
xx and thus maximize [|']
i
xx
Theorem 2: When presented with a vector x’,
the Palm association memory returns the training
vector with the smallest Hamming Distance, if the
memory output is that vector x
i
, then we have
min ( , ')
ii
HD x x
The proof is not given here, but this is a
fundamental property of the Palm network as shown
by Palm (Palm 1980).
Going back to Bayes’s rule, using the fact that
the probability of h errors is p(x(h)) = (1-ε)
n-h
ε
h
, and
taking the logarithm we get:
ln [ ( | ')] ln [(1 ) ] ln[ ( )]
( , ')(ln ln (1 )) ln ( )
Nh h
i i
ii
px x px
HD x x p x
εε
εε
−
=− +
=−−+
Setting
(ln ln(1 ))
εε
−−and dividing, we
get
( , ') ( , ') ( )
iii
xx HDxx fx
+
Where f is the log of the probability, divided by
α.
We can then create a modified Palm network
‘with priors’ where there is a new input for each
vector. This input adds to the accumulating inner
product.
The weights in this modified Palm network are
calculated according to a “Hebbian” like rule and
the knowledge of prior probabilities. That is, a
weight matrix is computed by taking the outer
product of each training vector with itself, and then
doing a bit-wise OR of each training vector’s
Hebbian matrix; and plus an inference-based
learning item. η is a learning rate.
Theorem 3: The message x
i
is transmitted over a
Binary Symmetric Channel and message x’ is
received. For all messages, x
i
, with probability of
occurrence p(x
i
), the training vector with the
smallest Weighted Hamming Distance from the
received vector, x’, is the most likely message sent
in a Bayesian sense,
,max(,')
ii
ist f x x
and
,max(|')
ij
ist px x
, i=j.
Proof: The proof is similar to Theorem 1
ICINCO 2006 - SIGNAL PROCESSING, SYSTEMS MODELING AND CONTROL
180