Derivative Free Training of Recurrent Neural Networks

A Comparison of Algorithms and Architectures

Branimir Todorović

1,4

, Miomir Stanković

2,4

and Claudio Moraga

3,4

Faculty of Natural Sciences and Mathematics, University of Niš, Niš, Serbia

Faculty of Occupational Safety, University of Niš, Niš, Serbia

European Centre for Soft Computing, 33600 Mieres, Spain

Technical University of Dortmund, 44221 Dortmund, Germany

Keywords: Recurrent Neural Networks, Bayesian Estimation, Nonlinear Derivative Free Estimation, Chaotic Time

Series Prediction.

Abstract: The problem of recurrent neural network training is considered here as an approximate joint Bayesian

estimation of the neuron outputs and unknown synaptic weights. We have implemented recursive estimators

using nonlinear derivative free approximation of neural network dynamics. The computational efficiency

and performances of proposed algorithms as training algorithms for different recurrent neural network

architectures are compared on the problem of long term, chaotic time series prediction.

1 INTRODUCTION

In this paper we consider the training of Recurrent

Neural Networks (RNNs) as derivative free

approximate Bayesian estimation. RNNs form a

wide class of neural networks with feedback

connections among processing units (artificial

neurons). Neural networks with feed forward

connections implement static input-output mapping,

while recurrent networks implement the mapping of

both input and internal state (represented by outputs

of recurrent neurons) into the future internal state.

In general, RNNs can be classified as locally

recurrent, where feedback connections exist only

from a processing unit to itself, and globally

recurrent, where feedback connections exist among

distinct processing units. The modeling capabilities

of globally recurrent neural networks are much

richer than that of the simple locally recurrent

networks.

There exist a group of algorithms for training

synaptic weights of recurrent neural networks that

are based on the exact or approximate computation

of the gradient of an error measure in the weight

space. Well known approaches that use methods for

exact gradient computation are back-propagation

through time (BPTT) and real time recurrent

learning (RTRL) (Williams and Zipser, 1989;

Williams and Zipser, 1990). Since BPTT and RTRL

are using only first-order derivative information,

they exhibit slow convergence. In order to improve

the speed of the RNN training, a technique known as

teacher forcing has been introduced (Williams and

Zipser, 1989). The idea is to use the desired outputs

of the neurons instead of the obtained to compute the

future outputs. In this way the training algorithm is

focused on the current time step, given that the

performance is correct on all earlier time steps.

However, in its basic form teacher forcing is not

always applicable. It clearly cannot be applied in

networks where feedback connections exist only

from hidden units, for which the target outputs are

not explicitly given. The second important case is

the training on noisy data, where the target outputs

are corrupted by noise. Therefore, to apply teacher

forcing in such cases, a true target outputs of

neurons have to be estimated somehow.

The well-known extended Kalman filter

(Anderson and Moore, 1979), as a second order

sequential training algorithm and state estimator

offers the solution to the both stated problems. It

improves the learning rate by exploiting second

order information on criterion function and

generalizes the teacher forcing technique by

estimating the true outputs of the neurons.

The extended Kalman filter can be considered as

the approximate solution of the recursive Bayesian

state estimation problem. The problem of estimating

the hidden state of a dynamic system using

observations which arrive sequentially in time is

Todorovi

c B., Stankovi

c M. and Moraga C..

Derivative Free Training of Recurrent Neural Networks - A Comparison of Algorithms and Architectures.

DOI: 10.5220/0005081900760084

In Proceedings of the International Conference on Neural Computation Theory and Applications (NCTA-2014), pages 76-84

ISBN: 978-989-758-054-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

very important in many fields of science,

engineering and finance. The hidden state of some

dynamic system is represented as a random vector

variable, and its evolution in time

,...}2,1,{ kx

described by a so called dynamic or process

equation:

),,(

1 kkkkk

duxfx



 ,

(1)

where

xdx

nnn

RRf 



: is nonlinear function,

and

,...}2,1,{ kd

is an i.i.d. process noise

sequence, while

n and

n are dimensions of the

state and process noise vectors respectively. The

hidden state is known only through the measurement

(observation) equation:

),(

kkkk

vxhy  ,

(2)

where

RRh 



: is nonlinear function,

and

,...}2,1,{ kv

is an i.i.d. measurement noise

sequence, and

and

n are dimensions of the

measurement and measurement noise vectors,

respectively.

In a sequential or recursive Bayesian estimation

framework, the state filtering probability density

function (pdf)

)(

:0 kk

yxp , (where

denotes the

set of all observations

},...,,{

10:0 kk

yyyy



up to the

time step k), represents the complete solution. The

optimal state estimate with respect to any criterion

can be calculated based on this pdf.

The recursive Bayesian estimation algorithm

consists of two steps: prediction and update. In the

prediction step the previous posterior

)(

1:01  kk

yxp is projected forward in time, using

the probabilistic process model:

kkkkkkk

dxyxpxxpyxp )()()(

1:0111:0 



 ,

(3)

where the state transition density function

)(

1kk

xxp is completely specified by )(



f and the

process noise distribution

)(

dp .

In the second step, the predictive density is

updated by incorporating the latest noisy

measurement

y using the observation likelihood

)(

xyp to generate the posterior:







kkkkk

kkkk

dxyxpxyp

yxpxyp

yxp

)()(

)(

1:0

(4)

This recursive estimation algorithm can be

applied to RNN training after representing the time

evolution of neurons outputs and connection

weights, as well as their observations, in the form of

the state space model. The hidden state of the

recurrent neural network

x is a stacked vector of

recurrent neurons outputs

s and connection

weights

w . Its evolution in time can be represented

by the following dynamic equation.

































skkk

dwusf

),,,(

(5)

where

and

represent dynamic noise

vectors.

The outputs of the neurons are obtained through

the following observation equation:

),,(

kkkk

vwshy



(6)

The recurrence relations (3) and (4) are only

conceptual solutions and the posterior density

)(

:0 kk

yxp cannot be determined analytically in

general. The restrictive set of cases includes the well

known Kalman filter, which represents the optimal

solution of (3) and (4) if the prior state density

)()(

000

xpyxp



, the process noise as well as the

observation noise densities are Gaussians, and

)(



and

)(



h are linear functions.

In case of RNN training,

)(f and )(



h are

nonlinear in general and an analytic solution is not

tractable, therefore some approximations and

suboptimal solutions have to be considered. The

well known suboptimal solution is the Extended

Kalman Filter (EKF), which assumes the Gaussian

property of noise and uses the Taylor expansion of

)(



f and )(



h (usually up to the linear term) to

obtain the recursive estimation for

)(

:0 kk

yxp . The

EKF has been successfully applied in RNN training

(Todorović et al., 2003; Todorović et al., 2004) due

to important advantages compared to RTRL and

BPTT: faster convergence and generalization of

teacher forcing. Recently, families of new derivative

free filters have been proposed as an alternative to

EKF for estimation in nonlinear systems. Divided

Difference Filters (DDF), derived in (Nørgaard et

al., 2000), are based on polynomial approximation

of nonlinear transformations using a

multidimensional extension of Stirling’s

interpolation formula. The Unscented Kalman Filter

(UKF) (Julier and Uhlmann, 1997) uses the true

nonlinear models and approximates the state

distribution using deterministically chosen sample

points. Surprisingly, both the DDF and the UKF

result in similar equations and are usually called

DerivativeFreeTrainingofRecurrentNeuralNetworks-AComparisonofAlgorithmsandArchitectures

derivative free filters (Van der Merwe and Wan,

2001).

The rest of the paper is organized as follows. In

the second section recursive Bayesian estimator is

approximated by linear minimum mean square error

estimator (MMSE), which recursively updates only

the first two moments of the relevant probability

densities. The problem that remains to be solved is

propagating these moments through the nonlinear

mapping of the process equation and the observation

equation. In the third section we describe three

approaches to this problem: a linearization of the

nonlinear mapping using a Taylor series expansion,

a derivative free unscented transform and a

derivative free polynomial approximation using a

multidimensional extension of the Stirling’s

interpolation formula. In the fourth section we give

the state space models of three globally recurrent

neural networks: fully connected, Elman and Non-

linear AutoRegresssive with eXogenous inputs

(NARX) recurrent neural networks. We trained them

by applying approximate recursive Bayesian joint

estimation of the recurrent neurons outputs and

synaptic weights. The results of applying three

different estimation algorithms in training three

different architectures of recurrent neural networks

are given in the last section.

2 LINEAR MMSE ESTIMATION

OF THE NONLINEAR STATE

SPACE MODEL

An analytically tractable solution of the problem of

recursive Bayesian estimation framework can be

obtained based on the assumption that the state

estimator

can be represented as a linear function

of the current observation

y :

kkkk

byAx





(7)

where matrix

A and vector

b are derived by

minimizing mean square estimation error criterion:

0: 1

ˆˆ

()()(, )

kkkkkkkkkk

x x x x p x y y dx dy



 



(8)

Note that the condition





bR is

equivalent to the requirement that the estimator is

unbiased:

0),()(

1:0





 kkkkkkkkk

dydxyyxpbyAx

(9)

from which we obtain,





kkkk

yAxb

ˆˆ

, )

(

ˆˆ





kkkkk

yyAxx ,

(10)

where









kkkkkkk

dxyxpxyxEx )(][

1:01:0

and









kkkkkkk

dyyypyyyEy )(][

1:01:0

Both the condition

0

AR and the

unbiasedness of the estimator result in the well

known orthogonality principle, which states that the

estimation error is orthogonal to the current

observation, and, consequently:

0),(

)

))(

(

1:0











kkkkk

kkkkkkk

dydxyyxp

yyyyAxx

(11)

From (11) we obtain the matrix

1



kkk

yyxk

PPA ,

where

])

)(

([

1:0 





kkkky

yyyyyEP

(12b)

])

)(

([

1:0 





kkkkyx

yyyxxEP

(12c)

Note that

must be invertible, that is

measurements

y have to be linearly independent.

Finally, after replacing

1



kkk

yyxk

PPA we obtain

the linear MMSE estimator:

)

(

ˆˆ

1 



kkyyxkk

yyPPxx

kkk

(13a)

The matrix Mean Square Error (MSE)

corresponding to (13):

yxyyxx

kkkk

kkkkkk

PPPPxxxxE

])

)(

[(





(13b)

is used as the approximation of the estimator

covariance

])

)(

[(

kkkkx

xxxxEP

 .

If the dynamic and the observation models are

linear and process and observation noises are

Gaussian, the linear MMSE estimator is the best

MMSE estimator and is equal to the conditional

mean

]/[

:0 kk

yxE , otherwise it is the best within the

class of linear estimators.

The problem that remains to be solved is the

estimation of the statics of a random variable

propagated trough the nonlinear transformation.

kkk

kkkkkk

dddxdp

yxpduxfx

1:011

)(

)(),,(











(14a)

kkkkk

kkkkkkkkx

dvdxdpyxp

xduxfxduxfP

11:01

)()(

)

),,(()

),,((









(14b)







kkkkkkkkk

dvdxvpyxpvuxhy )()(),,(

1:0

(14c)

kkkkk

kkkkkkkky

dvdxvpyxp

yvuxhyvuxhP

)()(

)

),,()(

),,((

1:0 









(14d)

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

kkkkk

kkkkkkyx

dvdxvpyxp

yvuxhxxP

)()(

)

),,()(

(

1:0 









(14e)

The problem can be considered in a general.

Suppose that x is a random variable with mean

and covariance

P . A random variable y is related

to x through the nonlinear function

)(xfy  . We

wish to calculate the mean

and the covariance

of y.

2.1 Extended Kalman Filter

The extended Kalman filter is based on the

multidimensional Taylor series expansion of

)(xf .

We shall consider only the first order EKF, obtained

by excluding the nonlinear terms of Taylor series

expansion:

xxfxfxxfxf





 )

()

()(

(15)

where

xfxf

)

(







is the Jacobian of the

nonlinear function and

x is a zero mean random

variable with covariance

P .

In this way the prediction of the state is given by:

kkkk

duxfx 



(

(16a)

kkk

kxkx

GQGFPFP





(16b)

where





, fG

 , ][

dEd  ,

][

kkk

ddEQ 

The prediction of the observation is given by:

kkk

vxhx 



)

(

(17a)

kkk

kxky

LRLHPHP





(17b)

where





][

vEv 

and

][

kkk

vvER 

2.2 Divided Difference Filter

In (Nørgaard et al., 2000) Nørgaard et al. proposed a

new set of estimators based on a derivative free

polynomial approximation of nonlinear

transformations using a multidimensional extension

of Stirling’s interpolation formula. This formula is

particularly simple if only the first and second order

polynomial approximation is considered:

fDfDxfxf

)

()(





(18)

where divided difference operators are defined by:

)(

xfx

pppx























(19)



is a “partial” difference operator:

)5.0()5.0

()

(

ppp

ehxfehxfxf 



(20)

and



is an average operator:

))5.0

()5.0

((5.0)

(

ppp

ehxfehxfxf 











(21)

where

e is the p-th unit vector.

Applying a stochastic decoupling of the variables

in x by the following transformation xSz

1

 , (

is the Cholesky factor of the covariance matrix

xxx

SSP  ), an approximation of the mean and the

covariance of

)(xfy



is obtained:

))()((

)(











pxpx

hsxfhsxf

(22a)

))

(2)

()

((

))

(2)

()

((

))

()

((

))

()

((

xfhsxfhsxf

hsxfhsxf

pxpx

pxpxy

















(22b)

Nørgaard et al. have derived the alternative

covariance estimate as well (Nørgaard et al., 2000):

)

()

)

((

)

()

)

((

)

()

)

((

yxfyxf

yhsxfyhsxf

pxpx

pxpxy















(23)

This estimate is less accurate than (22b).

Moreover, for

nh 

the last term becomes

negative semi-definite with a possible implication

that the covariance estimate (23) becomes non-

positive definite too. The reason why this estimate is

considered here is to provide a comparison with the

covariance estimate obtained by the Unscented

Transformation described in the next subsection.

2.3 Unscented Kalman Filter

Julier and Uhlman proposed the Unscented

Transformation (UT) (Julier and Uhlmann, 1997) in

order to calculate the statistics of a random variable

propagated through the nonlinear function

)(xfy



. The

n -dimensional continuous random

DerivativeFreeTrainingofRecurrentNeuralNetworks-AComparisonofAlgorithmsandArchitectures

variable

with mean x

and covariance

P is

approximated by

12 

n sigma points

X with

corresponding weights



np 2,...,1,0 :

nnnx  )(,)(,



X fo

np ...,2,1

)(5.0



 nsnx

ppxp

)(5.0







nsnx

nppxnp

where



determines the spread of the sigma points

around

(usually 14.1 







e ) and







is the

scaling parameter, usually set to 0 or

n3 (Julier

and Uhlmann, 1997).

is the p-th row or column

of the matrix square root of

P .

Each sigma point is instantiated through the

function

)(f to yield the set of transformed sigma

points

)(

XY f



, and the mean y

of the

transformed distribution is estimated by:

))

(

)

((

)(2

)

(

snxf





















(24)

The covariance estimate obtained by the

unscented transformation is:



















pxpx

pppy

ysnxfysnxf

yxfyxf

yyP

)

()

)

((

)(2

)

()

)

((

)(2

)

()

)

(()

)











YY ((

(25)

It can be easily verified that for



 nh , the

estimates of the mean (22a) and the covariance (23)

obtained by DDF are equivalent to the estimates (24)

and (25) obtained by UKF. The interval length

h is

set equal to the kurtosis of the prior random variable

. For Gaussians it holds 3

h .

3 STATE SPACE MODELS OF

GLOBALLY RECURRENT

NEURAL NETWORKS

In order to apply approximate Bayesian estimators

as training algorithm of recurrent neural networks

we need to represent dynamics of RNN in a form of

state space model. In this section we define the state

space models of three representative architectures of

globally recurrent neural networks: Elman, fully

connected, and NARX recurrent neural network.

3.1 Elman Network State Space Model

In Elman RNNs adaptive feedbacks are provided

between every pair of hidden units. The network is

illustrated in Fig.1.a), and the state space model of

the Elman network is given by equations













































uwxf

),,(

(26a)

),(,

wxhxvxy 

(26b)

where

represents the output of the hidden

neurons in the k-th time step,

x is the output of

the neurons in the last layer,

1

is the vector of

synaptic weights between the hidden and the output

layer and

1

is the vector of recurrent adaptive

connection weights. Note that in the original

formulation of Elman, these weights were fixed.

Random variables

d ,

d represent the

process noises.

It is assumed that the output of the network

),(

wxhx  is corrupted by the observation

noise

v .

1,k

2,k

n ,k

-1

1,k-1

2,k-1

n ,k-1

1,k

n ,k

-1

1, -1k

1,k

1, -1k

1,k

n , -1k

n ,k

n , -1k

n ,k

a) Elman b) Fully connected

Figure 1: Elman and fully connected RNN.

3.2 Fully Connected Recurrent

Network State Space Model

In fully connected RNNs adaptive feedbacks are

provided between each pair of processing units

(hidden and output). The state vector of a fully

connected RNN consists of outputs (activities) of

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

hidden

and output neurons

, and their

synaptic weights

w and

w . The activation

functions of the hidden and thee output neurons are

),,,(

1 k

uwxxf



and ),,,(

1 k

uwxxf



respectively. The network structure is illustrated in

Fig. 1.b).

The state space model of the network is given

by:











































uwxxf

),,,(

(27a)

]0[,

)(

OSOOO

nnnnnk

IHv



















(27b)

The dynamic equation describes the evolution of

neuron outputs and synaptic weights. In the

observation equation, the matrix H selects the

activities of output neurons as the only visible part

of the state vector, where

n is the number of

hidden states which are estimated:



HOS

nnn

nn  ,

n and

n are the

numbers of output and hidden neurons respectively,

n is the number of adaptive weights of the

output neurons,

n is the number of adaptive

weights of the hidden neurons.

3.3 NARX Recurrent Neural Network

State Space Model

The non-linear AutoRegressive with eXogenous

inputs (NARX) recurrent neural network usually

outperforms the classical recurrent neural networks,

like Elman or fully connected RNN, in tasks that

involve long term dependencies for which the

desired output depends on inputs presented at times

far in the past.

-1

k-1

k-

k- 

Figure 2: NARX recurrent neural network.

Here we define the state space model of a NARX

RNN. Adaptive feedbacks are provided between the

output and the hidden units. These feedback

connection and possible input connections are

implemented as FIR filters. The state vector consists

of outputs of the network in

 time steps

x ,

1

,...,

1

, the output

w , and hidden

synaptic

w weights.





















































kkk

wuuxxf

),,..,,,..,(

111



(28)

The dynamic equation describes the evolution of

network outputs and synaptic weights.

]0[,

)(

OSOOO

nnnnnk

IHv

























(29)

As in previous examples,

n represents the

number of output neurons.

n is the number of

hidden states of the NARX RNN:

nnnn





is the number of

adaptive weights of output neurons,

is number

of adaptive weights of hidden neurons.

All considered models have nonlinear hidden

neurons and linear output neurons. Two types of

nonlinear activation functions have been used in the

following tests: the sigmoidal and the Gaussian

radial basis function.

4 EXAMPLES

In this section we compare derived algorithms for

sequential training of RNN.

We have evaluated the performance of

algorithms in training three different architectures of

globally recurrent neural networks: fully connected

RNN, Elman RNN with adaptive recurrent

connections and NARX recurrent neural network.

The problem at hand was the long term prediction of

chaotic time series. Implementation of Divided

Difference Filter and Unscented Kalman filter did

DerivativeFreeTrainingofRecurrentNeuralNetworks-AComparisonofAlgorithmsandArchitectures

not required linearization of the RNN state space

models. However, in order to apply Extended

Kalman Filter we had to linearize the RNNs state

space models that are to calculate Jacobian of the

RNN outputs with respect to the inputs and synaptic

weights. Note that we did not apply back

propagation through time but standard back

propagation algorithm to calculate the Jacobian. This

was possible because of the joint estimation of RNN

outputs and synaptic weights.

In the process of the evaluation, recurrent neural

networks were trained sequentially on the certain

number of samples. After that they were iterated for

a number of samples, by feeding back just the

predicted outputs as the new inputs of the recurrent

neurons. Time series of iterated predictions were

compared with the test parts of the original time

series by calculating the Normalized Root Mean

Squared Error (NRMSE):









yyNRMSE

)

(



(30)

where



is the standard deviation of chaotic time

series,

y is the true value of sample at time step k ,

and



is the RNN prediction.

Mean and variance of the NRMSE obtained on

30 independent runs, average time needed for

training and number of hidden neurons and adaptive

synaptic weights are given in tables for comparison.

The variance of the process noise

and

were exponentially decayed form 1.e-1 and 1.e-3 to

1.e-10, and the variance of the observation noise

was also exponentially decayed from 1.e-1 to 1.e-10

during the sequential training.

4.1 Mackey Glass Chaotic Time Series

Prediction

In our first example we have considered the long

term iterated prediction of the Mackey Glass time

series. We have applied Divided Difference Filter,

Unscented Kalman Filter and Extended Kalman

Filter for joint estimation of synaptic weights and

neuron outputs of three different RNN architectures:

Elman, fully connected and non-linear

AutoRegressive with eXogenous inputs (NARX)

recurrent neural network.

After sequential adaptation on 2000 samples, a

long term iterated prediction of the next

100



samples is used to calculate the NRMSE.

Table 1. contains mean and variance of NRMSE

obtained after 30 independent trials of each

estimator applied on each architecture. We also give

the number of hidden units, the number of adaptable

parameters and time needed for training on 2000

samples. Given these results we can conclude that

the NARX network is superior in both NRMSE of

long term prediction and time needed for training,

compared to other two architectures. As for the

approximate Bayesian estimators, although slightly

slower in our implementation, derivative free filters

(DDF and UKF) are consistently better than EKF,

that is they produced RNN’s with significantly lower

NRMSE.

Sample results of long term prediction using

NARX network with sigmoidal neurons, trained

using DDF are shown in Fig. 3.

0.5

1.5

0.5

1.5

0.5

1.5

x(k-1)

Mackey Glass attractor

x(k-2)

x(k)

0.5

1.5

0.5

1.5

0.5

1.5

x(k-1)

NARX RMLP attractor

x(k-2)

x(k)

a) Phase plot of

x versus

1k

x and

2k

x for the

Mackey Glass time series and the NARX RMLP iterated

prediction

2020 2040 2060 2080 2100

0.5

1.5

Time step: k

Chaotic time series

Prediction

Error

b) Comparison of the original chaotic time series and the

NARX RMLP iterated prediction

Figure 3: Mackey Glass chaotic time series prediction.

Table 1: NRMSE of the long term iterated prediction for

various RNN architectures and training algorithms.

Mean Var n

T[s]

DDF_ELMAN_SIG 0.316 8.77e-3 10 121 20.88

UKF_ELMAN_SIG 0.419 3.43e-2 10 121 20.91

EKF_ELMAN_SIG 0.429 5.89e-2 10 121 14.98

DDF_FC_SIG 0.269 7.15e-3 10 131 23.43

UKF_FC_SIG 0.465 8.51e-2 10 131 23.78

EKF_FC_SIG 0.359 8.64e-2 10 131 17.38

DDF_NARX_SIG 0.0874 2.91e-4 5 41 5.64

UKF_NARX_SIG 0.119 1.89e-3 5 41 5.68

EKF_NARX_SIG 0.153 3.37e-3 5 41 4.76

4.2 Hénon Chaotic Time Series

Prediction

In our first example, we consider the prediction of

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications

the long-term behavior of the chaotic Hénon

dynamics:

3.04.11





kkk

xxx

(31)

RNNs with sigmoidal and Gaussian hidden neurons

(we call this network Recurrent Radial Basis

Function network – RBF network) were trained

sequentially on 3000 samples. After training

networks were iterated for 20 samples by feeding

back the current outputs of the neurons as the new

inputs. Figures 4 and 5 show results of prediction

using NARX_RBF network trained by DDF on a

Hénon chaotic time series.

It can be seen from Figures 4.a and 4.b, that,

although the network was trained only using sample

data chaotic attractor, which occupies small part of

the surface defined by equation (31), the recurrent

neural network was able to reconstruct that surface

closely to the original mapping (Fig 4.a), as well as

to reconstruct the original attractor (Fig 4.b).

-2

-1

-2

-3

-2

-1

x(k-1)

Hénon attractor

x(k-2)

x(k)

-2

-1

-2

-3

-2

-1

x(k-1)

NARX RRBF attractor

x(k-2)

x(k)

a) Surface plot of the Hénon and NARX_RRBF map; Dots

– chaotic attractor and NARX RRBF attractor

-2

-1

-2

-1.5

-1

-0.5

0.5

1.5

x(k-1)

Hénon attractor

x(k-2)

x(k)

-2

-1

-2

-1.5

-1

-0.5

0.5

1.5

x(k-1)

NARX RRBF attractor

x(k-2)

x(k)

b) Phase plot of

x versus

1k

x and

2k

x for the Hénon

series and NARX RRBF iterated prediction

Figure 4: Hénon chaotic time series prediction: surface

and phase plot.

3000 3005 3010 3015 3020

-1.5

-1

-0.5

0.5

1.5

Time step: k

Chaotic time series

Prediction

Error

Figure 5: Comparison of the original chaotic time series

and the NARX RRBF iterated prediction.

Results presented in Table 2. show that both DDF

and UKF produce more accurate RNNs than EKF

with comparable training time.

Table 2: Results of long term predictions of the Hénon

chaotic time series.

Mean Var n

T[s]

DDF_ELMAN_SIG 1.73e-2 6.19e-5 4 25 8.34

DDF_ELMAN_RBF 6.02e-2 3.89e-4 3 22 7.76

UKF_ELMAN_SIG 7.29e-2 9.46e-3 4 25 8.53

UKF_ELMAN_RBF 7.24e-2 1.79e-3 3 22 7.91

EKF_ELMAN_SIG 1.69e-1 5.17e-2 4 25 7.69

EKF_ELMAN_RBF 1.01e-1 7.50e-3 3 22 7.96

DDF_NARX_SIG 7.46e-3 3.39e-6 4 17 6.21

DDF_NARX_RBF 4.36e-3 4.15e-6 3 16 5.85

UKF_NARX_SIG 1.28e-2 2.68e-5 4 17 6.37

UKF_NARX_RBF 5.72e-3 7.14e-6 3 16 6.00

EKF_NARX_SIG 1.57e-2 1.65e-5 4 17 5.76

EKF_NARX_RBF 7.07e-3 1.35e-6 3 16 6.17

5 CONCLUSIONS

We considered the problem of recurrent neural

network training as an approximate recursive

Bayesian state estimation. Results in chaotic time

series long term prediction show that derivative free

estimators Divided Difference Filter and Unscented

Kalman Filter considerably outperform Extended

Kalman Filter as RNN learning algorithms with

respect to the accuracy of the obtained network,

while retaining comparable training times.

Experiments also show that of tree considered

architectures: Elman, fully connected and non-linear

AutoRegressive with eXogenous inputs (NARX)

recurrent neural network, NARX is by far superior

in both training time and accuracy of trained

networks in long term prediction.

REFERENCES

Anderson, B. and J. Moore, 1979. Optimal Filtering.

Englewood Cliffs, NJ, Prentice-Hall.

Julier, S. J., and Uhlmann, J. K., A new extension of the

Kalman filter to nonlinear systems. 1997, Proceedings

of AeroSense: The 11

international symposium on

aerospace/defence sensing, simulation and controls,

Orlando, FL.

Nørgaard, M., Poulsen, N. K., and Ravn, O., 2000,

Advances in derivative free state estimation for

nonlinear systems, Technical Report, IMM-REP-1998-

15, Department of Mathematical Modelling, DTU.

Todorović, B., Stanković, M., and Moraga, C. 2003, On-

line Learning in Recurrent Neural Networks using

DerivativeFreeTrainingofRecurrentNeuralNetworks-AComparisonofAlgorithmsandArchitectures

Nonlinear Kalman Filters. In Proc. of ISSPIT 2003,

Darmstadt, Germany.

Todorović, B., Stanković, M., and Moraga C., 2004,

Nonlinear Bayesian Estimation of Recurrent Neural

Networks. In Proc. of IEEE 4th International

Conference on Intelligent Systems Design and

Applications ISDA 2004, Budapest, Hungary, August

26-28, pp. 855-860.

Van der Merwe, R. and Wan, E. AS. 2001, Efficient

Derivative-Free Kalman Filters for Online Learning.

In Proc. of ESSAN, Bruges, Belgium.

Williams, R. J., & Zipser, D. 1989, A learning algorithm

for continually running fully recurrent neural

networks. Neural Computation, 1, 270-280.

Williams, R. J. and Zipser, D. 1990, Gradient-based

learning algorithms for recurrent connectionist

networks. TR NU_CCS_90-9. Boston, Northeastern

University.

NCTA2014-InternationalConferenceonNeuralComputationTheoryandApplications