HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES
FORECASTING
J.M. G
´
orriz
University of C
´
adiz
Avda Ram
´
on Puyol E 11202 s/n Algeciras Spain
C.G. Puntonet
University of Granada
Daniel Saucedo E 18071 s/n Granada Spain
E.W: Lang
University of Regensburg
Universit
¨
atsstraße D-93040 Regensburg Germany
Keywords:
Support vector machines, structural risk minimization, kernel, on-line algorithms, matrix decompositions,
resource allocating network.
Abstract:
In this paper we show a new on-line parametric model for time series forecasting based on Vapnik-
Chervonenkis (VC) theory. Using the strong connection between support vector machines (SVM) and Regu-
larization theory (RT), we propose a regularization operator in order to obtain a suitable expansion of radial
basis functions (RBFs) with the corresponding expressions for updating neural parameters. This operator
seeks for the “flattest” function in a feature space, minimizing the risk functional. Finally we mention some
modifications and extensions that can be applied to control neural resources and select relevant input space.
1 INTRODUCTION
The purpose of this work is twofold. It introduces
the foundations of SVM (Vapnik, 1998) and its con-
nection with RT (Tikhonov and Arsenin, 1997) in or-
der to show the new on-line algorithm for time series
forecasting. On the other hand, it attempts to give an
overview of Independent Component Analysis, used
in this paper to introduce exogenous information to
our model.
SVMs are learning algorithms based on the struc-
tural risk minimization principle (Vapnik and Chervo-
nenkis, 1974) (SRM) characterized by the use of the
expansion of SV “admissible” kernels and the spar-
sity of the solution. They have been proposed as
a technique in time series forecasting (Muller et al.,
1999; Muller et al., 1997) and have faced the overfit-
ting problem, presented in classical neural networks,
thanks to their high capacity for generalization. The
solution for SVM prediction is achieved solving the
constrained quadratic programming problem thus SV
machines are nonparametric techniques, i.e. the num-
ber of basis functions are unknown before hand. The
solution of this complex problem in real-time appli-
cations can be extremely uncomfortable because of
high computational time demand.
SVM are essentially Regularization Networks
(RN) with the kernels being Green´s function of the
corresponding regularization operators (Smola et al.,
). Using this connection, with a clever choice of reg-
ularization operator (based on SVM philosophy), we
should obtain a parametric model being very resistant
to the overfitting problem. Our parametric model is
a Resource allocating Network (Platt, 1991) charac-
terized by the control of neural resources and by the
use of matrix decompositions, i.e. Singular Value De-
composition (SVD) and QR Decomposition to input
selection and neural pruning (G
´
orriz, 2003).
We organize the essay as follows. In section 2 we
give a brief overview of the basic VC theory. SV al-
gorithm and its connection to RT Theory will be pre-
sented in section 3. The new on-line algorithm will be
compare to a previous version of it and the standard
SVM in section 5. Finally we state some conclusions
in section 6.
2 FOUNDATIONS ON VC
THEORY
A general notion of functional approximation prob-
lems
1
can be describe as follows:
1
Before discussing this problem, we prove the existence
of an exact representation for continuous function in terms
of simpler functions using Kolmogorov´s Theorem.
103
Górriz J., Puntonet C. and Lang E. (2004).
HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES FORECASTING.
In Proceedings of the First International Conference on Informatics in Control, Automation and Robotics, pages 103-107
DOI: 10.5220/0001124301030107
Copyright
c
SciTePress
Let {(x
1
, y
1
), . . . , (x
`
, y
`
)} a set of independent
and identically distributed training samples with un-
known probability distribution function F (x, y). The
problem of learning is that of choosing from the given
set of functions f (x, α
`
), α Λ, where Λ is a set of
parameters, the one that best approximates the out-
put y of the system. Thus the selection of the desired
function must be based on the training set , i.e. ap-
plying empirical risk minimization principle:
R(α
`
)
Z
L(y, f(x, α)dF (x, y) (1)
'
1
`
`
X
i=1
(y
i
f(x
i
, α))
2
= R
emp
(α
`
). (2)
where we substitute the loss function L(y, f(x, α),
measuring the discrepancy between the response y to
a given input x and the solution f(x, α), for a spe-
cific loss which forms the least squares method. Un-
der certain conditions empirical risk functional con-
verges towards the expected risk hence the approxi-
mation in equation 1 holds, i.e. ` . However
under small sample sizes non convergence may occur
detecting the overfitting problem (Muller et al., 2001).
The way of avoiding it is introducing a regularization
term (Tikhonov and Arsenin, 1997) to limit the com-
plexity of the loss function class arising the problem
of model selection (Muller et al., 2001).
The theory for controlling the generalization abil-
ity of learning machines is devoted to constructing an
new inductive principle for minimizing the risk func-
tional and, at once, for controlling the complexity of
the loss function class. This is a major task, as we said
latter, whenever a small sample of training instances
` is used (Vapnik, 1998). To construct learning
methods we use the bounds found by Vapnik:
R(α
k
`
) R
emp
(α
k
`
) + <
µ
`
h
k
. (3)
where R is the actual risk, R
emp
is the empirical risk
depending on samples, < is the confidence interval,
α
k
`
2
is the set of selected parameters that defines the
class of approximation functions and h is the VC di-
mension
3
. In order to minimize the right-hand side of
inequality 3 we apply SRM principle as follows:
Let L
1
L
2
. . . L
k
. . ., a nested “admissible”
4
family of loss functions classes with finite VC dimen-
sion denoted by “h
i
” with i = 1 . . . , k, for a given set
of observations the SRM principle chooses the suit-
able class L
k
(and the function L(x, α
k
`
) minimizing
2
The subindex k is related to the structure or subset of
loss functions we use in the approximation
3
Roughly speaking, the VC dimension h measures how
many training points can be separated for all possible la-
belling using functions of the class.
4
In the strict sense presented in (Vapnik, 1998), that is,
they are bounded functions or satisfy a certain inequality.
the guaranteed risk (right-hand side of inequality 3. In
other words, the higher complexity in class function
the lower empirical risk with the higher confidence in-
terval (the second term in the bounds of the expected
risk).
3 SUPPORT VECTOR MACHINES
AND REGULARIZATION
THEORY
The SV algorithm is a nonlinear generalization of the
generalized portrait developed in the sixties by Vap-
nik and Lerner in (Vapnik and Lerner, 1963). The
basic idea in SVM for regression and function esti-
mation, is to use a mapping function Φ from the input
space X into a high dimensional feature space F and
then to apply a linear regression. Thus the standard
linear regression transforms into:
f(x) = hω · Φ(x)i + b. (4)
where Φ : X F, b is a bias or threshold and ω F
is a vector defining the function class. The target is to
determinate ω, i.e. the set of parameters in the neural
network, minimizing the regularizated risk expressed
as:
R
reg
[f] = R
emp
[f] + λ||ω||
2
. (5)
thus we are enforcing “flatness” in feature space, that
is we seek small ω. Note that equation 5 is very com-
mon in RN with a certain second term.
SVM algorithm is a way of solving the minimiza-
tion of equation 5, that can be expressed as a quadratic
programming problem using the formulation stated in
(Vapnik, 1998):
minimize
1
2
||ω||
2
+ C
`
X
i=1
(ξ
i
+ ξ
i
). (6)
given a suitable Loss function L(·)
5
, a constant C 0
and with slack variables ξ
i
, ξ
i
0. The optimization
problem is solve constructing a Lagrange function by
introducing dual variables, using equation 6 and the
selected loss function.
Once it is uniquely solved, we can write the vector
ω in terms of the data points as follows:
ω =
`
X
i=1
(α
i
α
i
)Φ(x
i
). (7)
where α
i
, α
i
are the solutions of the mentioned
quadratic problem. Once this problem, with high
5
For example Vapnik´s ² insensitive loss function
(Vapnik, 1998).
L(f(x)y) =
½
|f(x) y| ² for |f(x) y| ²
0 otherwise
¾
ICINCO 2004 - INTELLIGENT CONTROL SYSTEMS AND OPTIMIZATION
104
computational demand
6
, is solved we take equation 7
into 4 and obtain the solution in terms of dot products:
f(x) =
`
X
i=1
(α
i
α
i
)hΦ(x
i
) · Φ(x)i + b. (8)
At this point we use a trick to avoid computing
the dot product in high dimensional feature space in
equation 8, replacing it by a kernel function that sat-
isfies Mercer´s condition. Mercer´s Theorem guaran-
tees the existence of this kernel function:
f(x) =
`
X
i=1
h
i
· k(x
i
, x) + b. (9)
where h
i
(α
i
α
i
) and k(x
i
, x) = hΦ(x
i
) ·Φ(x)i.
Finally we note, regarding the sparsity of the
SV expansion 8, that only the elements satisfying
|f(x
i
) y
i
| ², where ² is the standard deviation
of f(x
i
) from y
i
(see selected loss function), have
nonzero Lagrange multipliers α
i
, α
i
. This can be
proved applying Karush-Kuhn-Tucher (KKT) condi-
tions (Kuhn and Tucker, 1951) to the SV dual opti-
mization problem.
RT appeared in the methods for solving ill posed
problems (Tikhonov and Arsenin, 1997). In RN we
minimize a expression similar to equation 5. How-
ever, the search criterium is enforcing smoothness (in-
stead of flatness) for the function in input space (in-
stead of feature space). Thus we get:
R
reg
[f] = R
emp
[f] +
λ
2
||
ˆ
P f||
2
. (10)
where
ˆ
P denotes a regularization operator in the sense
of (Tikhonov and Arsenin, 1997), mapping from the
Hilbert Space H of functions to a dot product Space
D such as hf, gi f, g H is well defined. Ap-
plying Fr
´
echet´s differential
7
to equation 10 and the
concept of Green´s function of
ˆ
P
ˆ
P :
ˆ
P
ˆ
P · G(x
i
, x
j
) = δ(x
i
x
j
). (11)
(here δ denotes the Dirac´s δ, that is hf, δ(x
i
)i =
f(x
i
)), we get (G
´
orriz, 2003):
f(x) = λ
`
X
i=1
[y
i
f(x
i
)]
²
· G(x, x
i
). (12)
The correspondence between SVM and RN (equa-
tions 9 and 12) is proved if and only if the Green´s
function G is an “admissible” kernel in the terms of
Mercer´s theorem,i.e. we can write G as:
G(x
i
, x
j
) = hΦ(x
i
), Φ(x
j
)i with (13)
Φ : x
i
(
ˆ
P G)(x
i
, .). (14)
6
This calculation must be compute several times during
the process
7
Generalizated differentiation of a function: dR[f] =
h
d
R[f + ρh]
i
, where h H.
A similar prove of this connection can be found in
(Smola et al., ). Hence given a regularization oper-
ator, we can find an admissible kernel such that SV
machine using it will enforce flatness in feature space
and minimize the equation 10. Moreover, given a SV
kernel we can find a regularization operator such that
the SVM can be seen as a RN.
4 ON-LINE ALGORITHM USING
REGULARIZATION
OPERATORS
In this section we show a new on-line RN based on
“Resource Allocating Network” algorithms (RAN)
8
(Platt, 1991) which consist of a network using RBFs,
a strategy for
Allocating new units (RBFs), using two part nov-
elty condition (Platt, 1991)
Input space selection and neural pruning using ma-
trix decompositions such as SVD and QR with piv-
oting (G
´
orriz, 2003).
and a learning rule based on SRM as discuss in the
previous sections. Our network has 1 layer as is stated
in equation 9. In terms of RBFs the latter equation can
be expressed as:
f(x) =
N(t)
X
i=1
h
i
·exp
µ
||x(t) x
i
(t)||
2
2σ
2
i
(t)
+b. (15)
where N(t) is the number of neurons, x
i
(t) is the
center of neurons and σ
i
(t) the radius of neurons, at
time “t”. In order to minimize equation 10 we pro-
pose a regularization operator based on SVM philos-
ophy. We enforce flatness in feature space, as de-
scribed in section 3, using the regularization operator
||
ˆ
P f||
2
||ω||
2
, thus we get:
R
reg
[f] = R
emp
[f] +
λ
2
N(t)
X
i,j=1
h
i
h
j
k(x
i
, x
j
). (16)
We assume that R
emp
= (y f(x))
2
we minimize
equation 16 adjusting the centers and radius (gradi-
ent descend method χ = η
R[f ]
χ
, with simulated
annealing):
x
i
= 2
η
σ
i
(x x
i
)h
i
(f(x) y)k(x, x
i
) + ||ω||
2
.
(17)
where ||ω||
2
= α
P
N(t)
i,j=1
h
i
h
j
k(x
i
, x
j
)(x
i
x
j
) and
h
i
= ˜α(t)f(x
i
) η(f(x) y)k(x, x
i
). (18)
8
The principal feature of these algorithms is sequential
adaptation of neural resources.
HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES FORECASTING
105
Table 1: Pseudo-code of SVM-online (t
o
denotes current
iteration; k denotes prediction horizon)
Initialize parameters and variables
Build input Toeplitz matrix A
using (3W-1) input values
Input space selection: determinate
Np relevant lags L using SVD and QR
wp
[8]
Determinate input vector: x = x(t
o
k L(1))
while (true)
if (n
rbf s
> 0)
Compute f(x)
Find nearest RBF:||x x
dmin
||
else
f(x) = x(t
o
k 1)
Calculate error: e = ||f(x) x(t
o
)||
if (e > epsilon and ||x x
dmin
|| > delta) [7]
Add RBF with parameters:
x
i
= x, σ
i
= kappa · ||x c
dmin
||, h = e
else
Execute pruning
(SVD and QR
w
p to neural activations)[8]
Update parameters minimizing actual risk[15][16]
if (e > theta · ² and n
inps
< max
inps
)
n
inps
= n
inps
+ 1
Determinate new lags: L = [L
1
, L
2
, ..., L
Np
]
Add rbf
add
RBFs
t
o
= t
o
+ 1
end
where α(t), ˜α(t) are scalar-valued “adaptation gain”,
related to a similar gain used in the stochastic approx-
imation processes, as in these methods, it should de-
crease in time. The second summand in equation 17
can be evaluated in several regions inspired by the so
called “divide-and-conquer” principle and used in un-
supervised learning, i.e. competitive learning in self
organizing maps (Kohonen, 1990) or in SVMs experts
(Cao, 2003). This is necessary because of volatile
nature of time series, i.e. stock returns, switch their
dynamics among different regions, leading to grad-
ual changes in the dependency between the input and
output variables. Thus the super-index in the latter
equation is redefined as:
N
c
(t) = {s
i
(t) : ||x(t) x
i
(t)|| ρ}. (19)
the set of neurons close to current input. The struc-
ture of the algorithm is shown below as pseudo-code,
including the set of initial parameters:
Table 2: Evolution of NRMSE (Normalized Root Mean
Square Error). Xth-S detones the Xth-step prediction
NRMSE on the test set (Noisy Mackey-Glass with delay
changing operation mode.
Method 1st-S 25th-S 50th-S 75-th 100th-S
NAPA PRED 1.1982 0.98346 0.97866 0.91567 0.90985
Standard SVM 0.7005 0.7134 0.7106 0.7212 0.7216
SVM
online 0.7182 0.71525 0.71522 0.72094 0.7127
5 EXPERIMENTS
The application of our network is to predict complex
time series. We choose the high-dimensional chaotic
system generated by the Mackey-Glass delay differ-
ential equation:
dx(t)
dt
= b · x(t) + a ·
x(t τ)
1 + x
10
(t τ)
. (20)
with b = 0.1, a = 0.2 and delay t
d
= 17. This equa-
tion was originally presented as a model of blood reg-
ulation and became popular in modelling time series
benchmark. We add two modifications to equation 20:
Zero-mean gaussian noise with standard deviation
equal to 1/4 of the standard deviation of the origi-
nal series.
Dynamics changes randomly in terms of delay (be-
tween 100-300 time steps) t
d
= 17, 23, 30.
We integrated the chaotic model using MatLab soft-
ware on a Pentium III at 850MHZ obtaining 2000 pat-
terns. For our comparison we use 100 prediction re-
sults from SVM
online (presented in this paper), stan-
dard SVM (with ²-insensitive loss) and NAPA PRED
(RAN algorithm using matrix decompositions being
one of the best on-line algorithms to date(G
´
orriz,
2003)). Clearly there´s a remarkable difference
between previous on-line algorithm and SVM philos-
ophy. Standard SVM and SVM online achieve simi-
lar results for this set of data at the beginning of the
process. In addition, there´s is noticeable improve-
ment in the last iterations because of the volatile na-
ture of the series. The change in time delay t
d
, leads
to gradual changes in the dependency between the in-
put and output variables and, in general, it´s hard for
a single model including SVMs to capture such a dy-
namic input-output relationship inherent in the data.
Focussing our attention on the on-line algorithm, we
observe the better performance of the new algorithm
such as lower number of neurons (“sparsity”), input
space dimension and forecasting results.
ICINCO 2004 - INTELLIGENT CONTROL SYSTEMS AND OPTIMIZATION
106
6 CONCLUSIONS
Based on SRM and the principle of “divide and con-
quer” , a new online algorithm is developed by com-
bining SVM and SOM using a resource allocating
network and matrix decompositions. Minimizing
the regularizated risk functional, using an operator
the enforce flatness in feature space, we build a hy-
brid model that achieves high prediction performance,
comparing with the previous on-line algorithms for
time series forecasting. This performance is similar
to the one achieve by SVM but with lower compu-
tational time demand, essential feature in real-time
systems. The benefits of SVM for regression choice
consist in solving a -uniquely solvable- quadratic op-
timization problem, unlike the general RBF networks,
which requires suitable non-linear optimization with
danger of getting stuck in local minima. Neverthe-
less the RBF networks used in this paper, join vari-
ous techniques obtaining high performance, even un-
der extremely volatile conditions, since the level of
noise and the change of delay operation mode applied
to the chaotic dynamics was rather high.
REFERENCES
Cao, L. (2003). Support Vector Machines Experts for Time
Series Forecasting. Neurocomputing, vol 51, 321–339
(2003).
G
´
orriz, J. (2003). Algoritmos H
´
ıbridos para la Mod-
elizaci
´
ond e Series temporales con t
´
ecnicas AR-ICA.
PhD Thesis, University of C
´
adiz, Departamento de
Electr
´
onica (2003).
Kohonen, T. (1990). The Self-Organizing Map. Proceedings
of the IEEE, num 9, vol 78, 1464–1480 (1990).
Kuhn, H. and Tucker, A. (1951). Nonlinear Programming.
In 2
nd
Symposium on Mathematical Statistics and
Probabilistics,University of California Press, 481–492
(1951).
Muller, K., Mika, S., Ratsch, G., Tsuda, K., and Scholkopf,
B. (2001). An Introduction to Kernel-Based Learning
Algorithms. IEEE Transactions on Neural Networks,
vol 12, num 2, 181–201 (2001).
Muller, K., Smola, A., Ratsch, G., Scholkopf, B., and
Kohlmorgen, J. (1999). Using Support Vector Ma-
chines for time series prediction. in B. Scholkopf,
C.J.C. Burges, A.J. Smola (Eds.) Advances in kernel
Methods- Spport Vector Learning, MIT Press, Cam-
bridge, MA. (1999) 243–254.
Muller, K., Smola, A., Ratsch, G., Scholkopf, B., Kohlmor-
gen, J., and Vapnik, V. (1997). Predicting time se-
ries with Support Vector Machines. in ICANN’97:
Proceedings of the seventh International Conference
on Artificial Neural Networks, Lausanne, Switzerland
(1997) 999-1004.
Platt, J. (1991). A resource-allocating network for function
interpolation. Neural Computation, 3, (1991) 213–
225.
Smola, A., Scholkopf, B., and Muller, K. The connection
between regularization operators and support vector
kernels. Neural Networks, 11, 637–649.
Tikhonov, A. and Arsenin, V. (1997). Solutions of Ill-Posed
Problemsk. Winston. Washington D.C., U.S.A. Berlin
Heidelberg New York (1997) 415–438.
Vapnik, V. (1998). Statistical Learning Theory. Wiley, N.Y.
(1998).
Vapnik, V. and Chervonenkis, A. (1974). Theory of Pattern
Recognition. [in Russian]. Nauka, Moscow (1974).
Vapnik, V. and Lerner, A. (1963). Pattern Recognition us-
ing Generalized Portrait Method. Automation and Re-
mote Control, vol 24, issue 6,(1963).
HYBRID SOM-SVM ALGORITHM FOR REAL TIME SERIES FORECASTING
107