HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES

FORECASTING

J.M. G

orriz

University of C

adiz

Avda Ram

on Puyol E 11202 s/n Algeciras Spain

C.G. Puntonet

University of Granada

Daniel Saucedo E 18071 s/n Granada Spain

E.W: Lang

University of Regensburg

Universit

atsstraße D-93040 Regensburg Germany

Keywords:

Support vector machines, structural risk minimization, kernel, on-line algorithms, matrix decompositions,

resource allocating network.

Abstract:

In this paper we show a new on-line parametric model for time series forecasting based on Vapnik-

Chervonenkis (VC) theory. Using the strong connection between support vector machines (SVM) and Regu-

larization theory (RT), we propose a regularization operator in order to obtain a suitable expansion of radial

basis functions (RBFs) with the corresponding expressions for updating neural parameters. This operator

seeks for the “ﬂattest” function in a feature space, minimizing the risk functional. Finally we mention some

modiﬁcations and extensions that can be applied to control neural resources and select relevant input space.

1 INTRODUCTION

The purpose of this work is twofold. It introduces

the foundations of SVM (Vapnik, 1998) and its con-

nection with RT (Tikhonov and Arsenin, 1997) in or-

der to show the new on-line algorithm for time series

forecasting. On the other hand, it attempts to give an

overview of Independent Component Analysis, used

in this paper to introduce exogenous information to

our model.

SVMs are learning algorithms based on the struc-

tural risk minimization principle (Vapnik and Chervo-

nenkis, 1974) (SRM) characterized by the use of the

expansion of SV “admissible” kernels and the spar-

sity of the solution. They have been proposed as

a technique in time series forecasting (Muller et al.,

1999; Muller et al., 1997) and have faced the overﬁt-

ting problem, presented in classical neural networks,

thanks to their high capacity for generalization. The

solution for SVM prediction is achieved solving the

constrained quadratic programming problem thus SV

machines are nonparametric techniques, i.e. the num-

ber of basis functions are unknown before hand. The

solution of this complex problem in real-time appli-

cations can be extremely uncomfortable because of

high computational time demand.

SVM are essentially Regularization Networks

(RN) with the kernels being Green´s function of the

corresponding regularization operators (Smola et al.,

). Using this connection, with a clever choice of reg-

ularization operator (based on SVM philosophy), we

should obtain a parametric model being very resistant

to the overﬁtting problem. Our parametric model is

a Resource allocating Network (Platt, 1991) charac-

terized by the control of neural resources and by the

use of matrix decompositions, i.e. Singular Value De-

composition (SVD) and QR Decomposition to input

selection and neural pruning (G

orriz, 2003).

We organize the essay as follows. In section 2 we

give a brief overview of the basic VC theory. SV al-

gorithm and its connection to RT Theory will be pre-

sented in section 3. The new on-line algorithm will be

compare to a previous version of it and the standard

SVM in section 5. Finally we state some conclusions

in section 6.

2 FOUNDATIONS ON VC

THEORY

A general notion of functional approximation prob-

lems

can be describe as follows:

Before discussing this problem, we prove the existence

of an exact representation for continuous function in terms

of simpler functions using Kolmogorov´s Theorem.

103

Górriz J., Puntonet C. and Lang E. (2004).

HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES FORECASTING.

In Proceedings of the First International Conference on Informatics in Control, Automation and Robotics, pages 103-107

DOI: 10.5220/0001124301030107

 SciTePress

Let ∆ ≡ {(x

, y

), . . . , (x

, y

)} a set of independent

and identically distributed training samples with un-

known probability distribution function F (x, y). The

problem of learning is that of choosing from the given

set of functions f (x, α

), α ∈ Λ, where Λ is a set of

parameters, the one that best approximates the out-

put y of the system. Thus the selection of the desired

function must be based on the training set ∆, i.e. ap-

plying empirical risk minimization principle:

R(α

) ≡

L(y, f(x, α)dF (x, y) (1)

i=1

− f(x

, α))

= R

emp

(α

). (2)

where we substitute the loss function L(y, f(x, α),

measuring the discrepancy between the response y to

a given input x and the solution f(x, α), for a spe-

ciﬁc loss which forms the least squares method. Un-

der certain conditions empirical risk functional con-

verges towards the expected risk hence the approxi-

mation in equation 1 holds, i.e. ` → ∞. However

under small sample sizes non convergence may occur

detecting the overﬁtting problem (Muller et al., 2001).

The way of avoiding it is introducing a regularization

term (Tikhonov and Arsenin, 1997) to limit the com-

plexity of the loss function class arising the problem

of model selection (Muller et al., 2001).

The theory for controlling the generalization abil-

ity of learning machines is devoted to constructing an

new inductive principle for minimizing the risk func-

tional and, at once, for controlling the complexity of

the loss function class. This is a major task, as we said

latter, whenever a small sample of training instances

“`” is used (Vapnik, 1998). To construct learning

methods we use the bounds found by Vapnik:

R(α

) ≤ R

emp

(α

) + <

. (3)

where R is the actual risk, R

emp

is the empirical risk

depending on samples, < is the conﬁdence interval,

is the set of selected parameters that deﬁnes the

class of approximation functions and h is the VC di-

mension

. In order to minimize the right-hand side of

inequality 3 we apply SRM principle as follows:

Let L

⊂ L

⊂ . . . ⊂ L

. . ., a nested “admissible”

family of loss functions classes with ﬁnite VC dimen-

sion denoted by “h

” with i = 1 . . . , k, for a given set

of observations ∆ the SRM principle chooses the suit-

able class L

(and the function L(x, α

) minimizing

The subindex k is related to the structure or subset of

loss functions we use in the approximation

Roughly speaking, the VC dimension h measures how

many training points can be separated for all possible la-

belling using functions of the class.

In the strict sense presented in (Vapnik, 1998), that is,

they are bounded functions or satisfy a certain inequality.

the guaranteed risk (right-hand side of inequality 3. In

other words, the higher complexity in class function

the lower empirical risk with the higher conﬁdence in-

terval (the second term in the bounds of the expected

risk).

3 SUPPORT VECTOR MACHINES

AND REGULARIZATION

THEORY

The SV algorithm is a nonlinear generalization of the

generalized portrait developed in the sixties by Vap-

nik and Lerner in (Vapnik and Lerner, 1963). The

basic idea in SVM for regression and function esti-

mation, is to use a mapping function Φ from the input

space X into a high dimensional feature space F and

then to apply a linear regression. Thus the standard

linear regression transforms into:

f(x) = hω · Φ(x)i + b. (4)

where Φ : X → F, b is a bias or threshold and ω ∈ F

is a vector deﬁning the function class. The target is to

determinate ω, i.e. the set of parameters in the neural

network, minimizing the regularizated risk expressed

as:

reg

[f] = R

emp

[f] + λ||ω||

. (5)

thus we are enforcing “ﬂatness” in feature space, that

is we seek small ω. Note that equation 5 is very com-

mon in RN with a certain second term.

SVM algorithm is a way of solving the minimiza-

tion of equation 5, that can be expressed as a quadratic

programming problem using the formulation stated in

(Vapnik, 1998):

minimize

||ω||

+ C

i=1

(ξ

+ ξ

∗

). (6)

given a suitable Loss function L(·)

, a constant C ≥ 0

and with slack variables ξ

, ξ

∗

≥ 0. The optimization

problem is solve constructing a Lagrange function by

introducing dual variables, using equation 6 and the

selected loss function.

Once it is uniquely solved, we can write the vector

ω in terms of the data points as follows:

ω =

i=1

(α

− α

∗

)Φ(x

). (7)

where α

, α

∗

are the solutions of the mentioned

quadratic problem. Once this problem, with high

For example Vapnik´s ² insensitive loss function

(Vapnik, 1998).

L(f(x)−y) =

|f(x) − y| − ² for |f(x) − y| ≥ ²

0 otherwise

ICINCO 2004 - INTELLIGENT CONTROL SYSTEMS AND OPTIMIZATION

104

computational demand

, is solved we take equation 7

into 4 and obtain the solution in terms of dot products:

f(x) =

i=1

(α

− α

∗

)hΦ(x

) · Φ(x)i + b. (8)

At this point we use a trick to avoid computing

the dot product in high dimensional feature space in

equation 8, replacing it by a kernel function that sat-

isﬁes Mercer´s condition. Mercer´s Theorem guaran-

tees the existence of this kernel function:

f(x) =

i=1

· k(x

, x) + b. (9)

where h

≡ (α

− α

∗

) and k(x

, x) = hΦ(x

) ·Φ(x)i.

Finally we note, regarding the sparsity of the

SV expansion 8, that only the elements satisfying

|f(x

) − y

| ≥ ², where ² is the standard deviation

of f(x

) from y

(see selected loss function), have

nonzero Lagrange multipliers α

, α

∗

. This can be

proved applying Karush-Kuhn-Tucher (KKT) condi-

tions (Kuhn and Tucker, 1951) to the SV dual opti-

mization problem.

RT appeared in the methods for solving ill posed

problems (Tikhonov and Arsenin, 1997). In RN we

minimize a expression similar to equation 5. How-

ever, the search criterium is enforcing smoothness (in-

stead of ﬂatness) for the function in input space (in-

stead of feature space). Thus we get:

reg

[f] = R

emp

[f] +

P f||

. (10)

where

P denotes a regularization operator in the sense

of (Tikhonov and Arsenin, 1997), mapping from the

Hilbert Space H of functions to a dot product Space

D such as hf, gi ∀f, g ∈ H is well deﬁned. Ap-

plying Fr

echet´s differential

to equation 10 and the

concept of Green´s function of

∗

P :

∗

P · G(x

, x

) = δ(x

− x

). (11)

(here δ denotes the Dirac´s δ, that is hf, δ(x

)i =

f(x

)), we get (G

orriz, 2003):

f(x) = λ

i=1

− f(x

)]

· G(x, x

). (12)

The correspondence between SVM and RN (equa-

tions 9 and 12) is proved if and only if the Green´s

function G is an “admissible” kernel in the terms of

Mercer´s theorem,i.e. we can write G as:

G(x

, x

) = hΦ(x

), Φ(x

)i with (13)

Φ : x

→ (

P G)(x

, .). (14)

This calculation must be compute several times during

the process

Generalizated differentiation of a function: dR[f] =

dρ

R[f + ρh]

, where h ∈ H.

A similar prove of this connection can be found in

(Smola et al., ). Hence given a regularization oper-

ator, we can ﬁnd an admissible kernel such that SV

machine using it will enforce ﬂatness in feature space

and minimize the equation 10. Moreover, given a SV

kernel we can ﬁnd a regularization operator such that

the SVM can be seen as a RN.

4 ON-LINE ALGORITHM USING

REGULARIZATION

OPERATORS

In this section we show a new on-line RN based on

“Resource Allocating Network” algorithms (RAN)

(Platt, 1991) which consist of a network using RBFs,

a strategy for

• Allocating new units (RBFs), using two part nov-

elty condition (Platt, 1991)

• Input space selection and neural pruning using ma-

trix decompositions such as SVD and QR with piv-

oting (G

orriz, 2003).

and a learning rule based on SRM as discuss in the

previous sections. Our network has 1 layer as is stated

in equation 9. In terms of RBFs the latter equation can

be expressed as:

f(x) =

N(t)

i=1

·exp

−

||x(t) − x

(t)||

2σ

(t)

+b. (15)

where N(t) is the number of neurons, x

(t) is the

center of neurons and σ

(t) the radius of neurons, at

time “t”. In order to minimize equation 10 we pro-

pose a regularization operator based on SVM philos-

ophy. We enforce ﬂatness in feature space, as de-

scribed in section 3, using the regularization operator

P f||

≡ ||ω||

, thus we get:

reg

[f] = R

emp

[f] +

N(t)

i,j=1

k(x

, x

). (16)

We assume that R

emp

= (y − f(x))

we minimize

equation 16 adjusting the centers and radius (gradi-

ent descend method ∆χ = −η

∂R[f ]

∂χ

, with simulated

annealing):

∆x

= −2

(x − x

(f(x) − y)k(x, x

) + ||ω||

(17)

where ||ω||

= α

N(t)

i,j=1

k(x

, x

)(x

− x

) and

∆h

= ˜α(t)f(x

) − η(f(x) − y)k(x, x

). (18)

The principal feature of these algorithms is sequential

adaptation of neural resources.

HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES FORECASTING

105

Table 1: Pseudo-code of SVM-online (t

denotes current

iteration; k denotes prediction horizon)

Initialize parameters and variables

Build input Toeplitz matrix A

using (3W-1) input values

Input space selection: determinate

Np relevant lags L using SVD and QR

[8]

Determinate input vector: x = x(t

− k − L(1))

while (true)

if (n

rbf s

> 0)

Compute f(x)

Find nearest RBF:||x − x

dmin

else

f(x) = x(t

− k − 1)

Calculate error: e = ||f(x) − x(t

)||

if (e > epsilon and ||x − x

dmin

|| > delta) [7]

Add RBF with parameters:

= x, σ

= kappa · ||x − c

dmin

||, h = e

else

Execute pruning

(SVD and QR

p to neural activations)[8]

Update parameters minimizing actual risk[15][16]

if (e > theta · ² and n

inps

< max

inps

)

inps

= n

inps

+ 1

Determinate new lags: L = [L

, L

, ..., L

]

Add rbf

add

RBFs

= t

+ 1

end

where α(t), ˜α(t) are scalar-valued “adaptation gain”,

related to a similar gain used in the stochastic approx-

imation processes, as in these methods, it should de-

crease in time. The second summand in equation 17

can be evaluated in several regions inspired by the so

called “divide-and-conquer” principle and used in un-

supervised learning, i.e. competitive learning in self

organizing maps (Kohonen, 1990) or in SVMs experts

(Cao, 2003). This is necessary because of volatile

nature of time series, i.e. stock returns, switch their

dynamics among different regions, leading to grad-

ual changes in the dependency between the input and

output variables. Thus the super-index in the latter

equation is redeﬁned as:

(t) = {s

(t) : ||x(t) − x

(t)|| ≤ ρ}. (19)

the set of neurons close to current input. The struc-

ture of the algorithm is shown below as pseudo-code,

including the set of initial parameters:

Table 2: Evolution of NRMSE (Normalized Root Mean

Square Error). Xth-S detones the Xth-step prediction

NRMSE on the test set (Noisy Mackey-Glass with delay

changing operation mode.

Method 1st-S 25th-S 50th-S 75-th 100th-S

NAPA PRED 1.1982 0.98346 0.97866 0.91567 0.90985

Standard SVM 0.7005 0.7134 0.7106 0.7212 0.7216

SVM

online 0.7182 0.71525 0.71522 0.72094 0.7127

5 EXPERIMENTS

The application of our network is to predict complex

time series. We choose the high-dimensional chaotic

system generated by the Mackey-Glass delay differ-

ential equation:

dx(t)

= −b · x(t) + a ·

x(t − τ)

1 + x

(t − τ)

. (20)

with b = 0.1, a = 0.2 and delay t

= 17. This equa-

tion was originally presented as a model of blood reg-

ulation and became popular in modelling time series

benchmark. We add two modiﬁcations to equation 20:

• Zero-mean gaussian noise with standard deviation

equal to 1/4 of the standard deviation of the origi-

nal series.

• Dynamics changes randomly in terms of delay (be-

tween 100-300 time steps) t

= 17, 23, 30.

We integrated the chaotic model using MatLab soft-

ware on a Pentium III at 850MHZ obtaining 2000 pat-

terns. For our comparison we use 100 prediction re-

sults from SVM

online (presented in this paper), stan-

dard SVM (with ²-insensitive loss) and NAPA PRED

(RAN algorithm using matrix decompositions being

one of the best on-line algorithms to date(G

orriz,

2003)). Clearly there´s a remarkable difference

between previous on-line algorithm and SVM philos-

ophy. Standard SVM and SVM online achieve simi-

lar results for this set of data at the beginning of the

process. In addition, there´s is noticeable improve-

ment in the last iterations because of the volatile na-

ture of the series. The change in time delay t

, leads

to gradual changes in the dependency between the in-

put and output variables and, in general, it´s hard for

a single model including SVMs to capture such a dy-

namic input-output relationship inherent in the data.

Focussing our attention on the on-line algorithm, we

observe the better performance of the new algorithm

such as lower number of neurons (“sparsity”), input

space dimension and forecasting results.

ICINCO 2004 - INTELLIGENT CONTROL SYSTEMS AND OPTIMIZATION

106

6 CONCLUSIONS

Based on SRM and the principle of “divide and con-

quer” , a new online algorithm is developed by com-

bining SVM and SOM using a resource allocating

network and matrix decompositions. Minimizing

the regularizated risk functional, using an operator

the enforce ﬂatness in feature space, we build a hy-

brid model that achieves high prediction performance,

comparing with the previous on-line algorithms for

time series forecasting. This performance is similar

to the one achieve by SVM but with lower compu-

tational time demand, essential feature in real-time

systems. The beneﬁts of SVM for regression choice

consist in solving a -uniquely solvable- quadratic op-

timization problem, unlike the general RBF networks,

which requires suitable non-linear optimization with

danger of getting stuck in local minima. Neverthe-

less the RBF networks used in this paper, join vari-

ous techniques obtaining high performance, even un-

der extremely volatile conditions, since the level of

noise and the change of delay operation mode applied

to the chaotic dynamics was rather high.

REFERENCES

Cao, L. (2003). Support Vector Machines Experts for Time

Series Forecasting. Neurocomputing, vol 51, 321–339

(2003).

orriz, J. (2003). Algoritmos H

ıbridos para la Mod-

elizaci

ond e Series temporales con t

ecnicas AR-ICA.

PhD Thesis, University of C

adiz, Departamento de

Electr

onica (2003).

Kohonen, T. (1990). The Self-Organizing Map. Proceedings

of the IEEE, num 9, vol 78, 1464–1480 (1990).

Kuhn, H. and Tucker, A. (1951). Nonlinear Programming.

In 2

Symposium on Mathematical Statistics and

Probabilistics,University of California Press, 481–492

(1951).

Muller, K., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf,

B. (2001). An Introduction to Kernel-Based Learning

Algorithms. IEEE Transactions on Neural Networks,

vol 12, num 2, 181–201 (2001).

Muller, K., Smola, A., Ratsch, G., Scholkopf, B., and

Kohlmorgen, J. (1999). Using Support Vector Ma-

chines for time series prediction. in B. Scholkopf,

C.J.C. Burges, A.J. Smola (Eds.) Advances in kernel

Methods- Spport Vector Learning, MIT Press, Cam-

bridge, MA. (1999) 243–254.

Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmor-

gen, J., and Vapnik, V. (1997). Predicting time se-

ries with Support Vector Machines. in ICANN’97:

Proceedings of the seventh International Conference

on Artiﬁcial Neural Networks, Lausanne, Switzerland

(1997) 999-1004.

Platt, J. (1991). A resource-allocating network for function

interpolation. Neural Computation, 3, (1991) 213–

225.

Smola, A., Scholkopf, B., and Muller, K. The connection

between regularization operators and support vector

kernels. Neural Networks, 11, 637–649.

Tikhonov, A. and Arsenin, V. (1997). Solutions of Ill-Posed

Problemsk. Winston. Washington D.C., U.S.A. Berlin

Heidelberg New York (1997) 415–438.

Vapnik, V. (1998). Statistical Learning Theory. Wiley, N.Y.

(1998).

Vapnik, V. and Chervonenkis, A. (1974). Theory of Pattern

Recognition. [in Russian]. Nauka, Moscow (1974).

Vapnik, V. and Lerner, A. (1963). Pattern Recognition us-

ing Generalized Portrait Method. Automation and Re-

mote Control, vol 24, issue 6,(1963).

HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES FORECASTING

107