hidden-to-output weights to improve network
performance at each step. This error signal is then
estimated for each preceding layer, but the error
signal attenuates.
2.1 Extreme Learning Machine
Figure 1: Simple ELM network.
Huang et al’s (2004) principal contribution was to
suggest that a set of random weights in the hidden
layer could be used as a way to provide non-linear
mapping between the input neurons and the output
neurons. By having a large enough number of
neurons in the hidden layer the algorithm can map a
small number of input neurons to an arbitrarily large
number of output neurons in a non-linear way.
Training is performed only on the output neurons
and performance similar to multi-layer feed-forward
networks using back propagation achieved with
much reduced training time.
It is possible to train an ELM network as shown
in Figure 1 by using back-propagation, but since the
input-to-hidden weights are fixed, it is more efficient
to estimate the output weights using the Moore-
Penrose pseudo-inverse (Huang et al., 2004). The
weight matrix calculated is the best least square
error fit for the output layer and in addition provide
the smallest norm of weights, which is important for
optimal generalisation performance (Bartlett, 1998).
2.2 Cascade Correlation
The Cascade Correlation algorithm (Cascor)
(Fahlman and Lebiere, 1990) is a very powerful
method for training artificial neural networks.
Cascor is a constructive algorithm which begins
training with a single input layer connected directly
to the output layer. Neurons are added one at time to
the network and are connected to all previous hidden
and input neurons, producing a cascade network.
When a new neuron is to be added to the network,
all previous network weights are 'frozen'. The input
weights of the neuron which is about to be added are
then trained to maximise the correlation between
that neuron's output and the remaining network
error. The new neuron is then inserted into the
network, and all weights connected to the output
neurons are then trained to minimise the error
function.
Thus there are two training phases: the training
of the hidden neuron weights, and the training of
output weights. A previous extension to the cascor
algorithm was by the use of the RPROP (Riedmiller,
1994) algorithm to train the whole network
(Treadgold and Gedeon, 1997) with ‘frozen’ weights
represented by initially low learning rates. That
model (Casper), was shown to produce more
compact networks, which also generalise better than
Cascor.
2.3 Caveats
We have said it is generally accepted that 3 layers of
processing neurons is sufficient, but we must point
out that this is not always true.
For example, we know that in the field of
petroleum engineering, in order to reproduce the
fine-scale variability known to exist in core porosity/
permeability data, separate neural nets are used for
porosity prediction, followed by another for
permeability prediction. This produces better results
than a single combined network (Wong, Taggart and
Gedeon, 1995), and for hierarchical data (Gedeon
and Kóczy, 1998).
3 CASCADE CORRELATION
AND EXTREME LEARNING
MACHINE
ELMs can be trained very quickly to solve
classification problems. In general the larger the
hidden layer the higher the learning capacity of the
network. However the size of the hidden layer is
critical to performance. Too small and the network
will not have sufficient capacity to learn but too
large, learning times will suffer and over fitting
occurs.
Finding the ideal size for the layer is
problematic. If the number of neurons is greater or
equal to the number of training patterns then the
network will be able to achieve 100% learning.
However this is not a useful conclusion as in most
cases we would expect the network to achieve
satisfactory learning with far less neurons than this.
SIMULTECH2015-5thInternationalConferenceonSimulationandModelingMethodologies,Technologiesand
Applications
272