Let ∆ ≡ {(x
1
, y
1
), . . . , (x
`
, y
`
)} a set of independent
and identically distributed training samples with un-
known probability distribution function F (x, y). The
problem of learning is that of choosing from the given
set of functions f (x, α
`
), α ∈ Λ, where Λ is a set of
parameters, the one that best approximates the out-
put y of the system. Thus the selection of the desired
function must be based on the training set ∆, i.e. ap-
plying empirical risk minimization principle:
R(α
`
) ≡
Z
L(y, f(x, α)dF (x, y) (1)
'
1
`
`
X
i=1
(y
i
− f(x
i
, α))
2
= R
emp
(α
`
). (2)
where we substitute the loss function L(y, f(x, α),
measuring the discrepancy between the response y to
a given input x and the solution f(x, α), for a spe-
cific loss which forms the least squares method. Un-
der certain conditions empirical risk functional con-
verges towards the expected risk hence the approxi-
mation in equation 1 holds, i.e. ` → ∞. However
under small sample sizes non convergence may occur
detecting the overfitting problem (Muller et al., 2001).
The way of avoiding it is introducing a regularization
term (Tikhonov and Arsenin, 1997) to limit the com-
plexity of the loss function class arising the problem
of model selection (Muller et al., 2001).
The theory for controlling the generalization abil-
ity of learning machines is devoted to constructing an
new inductive principle for minimizing the risk func-
tional and, at once, for controlling the complexity of
the loss function class. This is a major task, as we said
latter, whenever a small sample of training instances
“`” is used (Vapnik, 1998). To construct learning
methods we use the bounds found by Vapnik:
R(α
k
`
) ≤ R
emp
(α
k
`
) + <
µ
`
h
k
¶
. (3)
where R is the actual risk, R
emp
is the empirical risk
depending on samples, < is the confidence interval,
α
k
`
2
is the set of selected parameters that defines the
class of approximation functions and h is the VC di-
mension
3
. In order to minimize the right-hand side of
inequality 3 we apply SRM principle as follows:
Let L
1
⊂ L
2
⊂ . . . ⊂ L
k
. . ., a nested “admissible”
4
family of loss functions classes with finite VC dimen-
sion denoted by “h
i
” with i = 1 . . . , k, for a given set
of observations ∆ the SRM principle chooses the suit-
able class L
k
(and the function L(x, α
k
`
) minimizing
2
The subindex k is related to the structure or subset of
loss functions we use in the approximation
3
Roughly speaking, the VC dimension h measures how
many training points can be separated for all possible la-
belling using functions of the class.
4
In the strict sense presented in (Vapnik, 1998), that is,
they are bounded functions or satisfy a certain inequality.
the guaranteed risk (right-hand side of inequality 3. In
other words, the higher complexity in class function
the lower empirical risk with the higher confidence in-
terval (the second term in the bounds of the expected
risk).
3 SUPPORT VECTOR MACHINES
AND REGULARIZATION
THEORY
The SV algorithm is a nonlinear generalization of the
generalized portrait developed in the sixties by Vap-
nik and Lerner in (Vapnik and Lerner, 1963). The
basic idea in SVM for regression and function esti-
mation, is to use a mapping function Φ from the input
space X into a high dimensional feature space F and
then to apply a linear regression. Thus the standard
linear regression transforms into:
f(x) = hω · Φ(x)i + b. (4)
where Φ : X → F, b is a bias or threshold and ω ∈ F
is a vector defining the function class. The target is to
determinate ω, i.e. the set of parameters in the neural
network, minimizing the regularizated risk expressed
as:
R
reg
[f] = R
emp
[f] + λ||ω||
2
. (5)
thus we are enforcing “flatness” in feature space, that
is we seek small ω. Note that equation 5 is very com-
mon in RN with a certain second term.
SVM algorithm is a way of solving the minimiza-
tion of equation 5, that can be expressed as a quadratic
programming problem using the formulation stated in
(Vapnik, 1998):
minimize
1
2
||ω||
2
+ C
`
X
i=1
(ξ
i
+ ξ
∗
i
). (6)
given a suitable Loss function L(·)
5
, a constant C ≥ 0
and with slack variables ξ
i
, ξ
∗
i
≥ 0. The optimization
problem is solve constructing a Lagrange function by
introducing dual variables, using equation 6 and the
selected loss function.
Once it is uniquely solved, we can write the vector
ω in terms of the data points as follows:
ω =
`
X
i=1
(α
i
− α
∗
i
)Φ(x
i
). (7)
where α
i
, α
∗
i
are the solutions of the mentioned
quadratic problem. Once this problem, with high
5
For example Vapnik´s ² insensitive loss function
(Vapnik, 1998).
L(f(x)−y) =
½
|f(x) − y| − ² for |f(x) − y| ≥ ²
0 otherwise
¾
ICINCO 2004 - INTELLIGENT CONTROL SYSTEMS AND OPTIMIZATION
104