interval (−1, 1) and not of 0 and 1 as in Table 1. It
is still required that the internal representation be ’bi-
nary’ in the sense that the hidden layer has to produce
coding values for a uniquely distinguishable mapping
at the output layer. The input and output is presented
to the network as shown in Table 1.
The SRN can be seen as a feedforward network
with additional inputs from the context layer and any
algorithm for feedforward networks can be used to
train it (Elman, 1990).
2.3 Network Training
We use the backpropagation algorithm to train the
SRN. To do so, two constraints have to be fulfilled.
1. The context units must be initialised with some
activation for the forward propagation of the first
training vector. Commonly, these initial values
are zero.
2. The activation levels of the hidden layer must be
stored in the context layer after each back propa-
gation phase. Hence, the context layer shows the
state of the hidden layer delayed by one time step.
After each input the network output is compared to
the desired output and the mean square error is prop-
agated back through the network. The weights are
updated with the constant learning rate ε = 0.1. Since
each output is evaluated right away, the process cor-
responds to online learning.
The weights are initialised with uniformly dis-
tributed random values in the interval [−0.3, 0.3]
(apart from the fixed hidden–to–context layer
weights). The learning rate and weight initialisation
interval are chosen according to preliminary tests. We
use the combination that yielded best training results
after 1000 training cycles.
3 SIMULATIONS
The above described network was implemented in
Matlab. Since we are interested in the network’s abil-
ity in terms of implicit sequence learning we present
the training input in two different ways, a sequential
and a random one.
One training cycle consists of a presentation of all
four input vectors that are shown to the network one
after another. For the case of a deterministic order
the first cycle is repeated for the whole training. This
implies a strong temporal relationship between the in-
put vectors since each one has a fixed successor. For
the case of a random order in each cycle the temporal
correlation between the input vectors is very weak.
If we denote the input vectors with the numbers
from one to four we can describe the two types of
sequences as follows:
det. sequence . . . |
cycle n
z }| {
1 2 3 4 |
cycle n+1
z }| {
1 2 3 4|. . .
random sequence . . . |
cycle n
z }| {
4 3 1 2 |
cycle n+1
z }| {
2 1 3 4|. . .
The network is trained for 1000 cycles. Hence, every
input vector is shown 1000 times to the network. This
results in 4000 training steps or rather 4000 weight
updates.
3.1 Training
3.1.1 Success of the Training
To measure the success of the network we evaluate
the output according to the winner–take–all principle.
The unit with the highest activation is counted as 1 the
remaining as 0. Thus, the network’s output is always
mapped onto a corresponding target vector.
We evaluate the network output in terms of the
probability of success (P
S
) for each training cycle.
When training starts the probability to excite the cor-
rect output is one out of four (P
S
= 0.25). At the end
of training the network should have learned the cod-
ing and always deliver the target vector, therefore we
expect P
S
= 1.
Since the weights are initialised randomly the
learning curves for single networks may differ consid-
erably. As we want to compare the general behaviour
of the network we train 100 networks on each type
of input sequence. Afterwards we calculate the mean
probability of success over the 100 networks (P
S
) for
the two test cases. In Figure 3 P
S
is plotted against the
number of training cycles. In general, the networks
perform better on a deterministic input sequence than
on a random one. In both cases, however, the ex-
pected P
S
= 1 is not reached, which we will explain
in the following.
Figure 4 shows the distribution of P
S
for the 100 net-
works after training. We plot the number of networks
n against the final probability of success P
S
. Trained
with a deterministic sequence one half of the networks
(n = 51) learned the encoding completely (P
S
= 1 by
the end of the training). On the other hand, only 23
networks could succeed if trained with a random se-
quence.
In summary it is more likely that a network trained
with a deterministic sequence is able to learn the task
IMPLICIT SEQUENCE LEARNING - A Case Study with a 4-2-4 Encoder Simple Recurrent Network
281