INVERSE PROBLEMS IN LEARNING FROM DATA
Vˇera K˚urkov´a
Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vod´arenskou vˇeˇz´ı 2, Prague, Czech Republic
Keywords:
Learning from data, Minimization of empirical and expected error functionals, Reproducing kernel Hilbert
spaces.
Abstract:
It is shown that application of methods from theory of inverse problems to learning from data leads to simple
proofs of characterization of minima of empirical and expected error functionals and their regularized versions.
The reformulation of learning in terms of inverse problems also enables comparison of regularized and non
regularized case showing that regularization achieves stability by merely modifying output weights of global
minima. Methods of theory of inverse problems lead to choice of reproducing kernel Hilbert spaces as suitable
ambient function spaces.
1 INTRODUCTION
Supervised learning can be formally described as an
optimization problem of minimization of error func-
tionals over parameterized sets of input-output func-
tions computable by a given computational model.
Various learning algorithms iteratively modify param-
eters of the model until sufficiently small values of
error functionals are achieved and the corresponding
input-output function of the model sufficiently well
fits to the training data.
However, such algorithms in their best only
achieve good fit to the training data. It has been
proven that for typical computational units such as
sigmoidal perceptrons and Gaussian kernels, suffi-
ciently large networks can exactly interpolate any
sample of data (Ito, 1992; Michelli, 1986). Data
are often noisy and networks perfectly fitting to ran-
domly chosen training samples may be too much in-
fluenced by the noise and may not perform well on
data that were not chosen for training. Thus vari-
ous attempts to modify error functionals to improve
so called “generalization capability” of the model has
been proposed. In 1990s, Girosi and Poggio (Girosi
and Poggio, 1990) introduced into learning theory
a method of regularization as a means of improv-
ing generalization. They considered modifications
of error functionals based on Tikhonov regularization
which adds an additional functional, called stabilizer,
which penalizes undesired properties of input-output
functions such as high-frequency oscillations (Girosi
et al., 1995). In practical applications, various simple
stabilizers have been successfully used such as semi-
norms based on derivatives (Bishop, 1995) or sum or
square of output weights (Fine, 1999; Kecman, 2001).
Regularization was developed in 1970s as a
method of improving stability of solutions of cer-
tain problems from physics called inverse problems,
where unknown causes (e.g., shapes of functions,
forces or distributions) of known consequences (mea-
sured data) have to be found. These problems has
been studied in applied science, such as acoustics,
geophysics and computerized tomography (see, e.g.,
(Hansen, 1998)). To solve such a problem, one needs
to know how unknown causes determine known con-
sequences, which can often be described in terms of
an operator. In problems originating from physics,
dependence of consequences on causes is usually de-
scribed by integral operators (such as those defining
Radon or Laplace transforms (Bertero, 1989; Engl
et al., 1999)). As some problems do not always have
exact solutions or have solutions which are unstable
with respect to noise, various methods of finding ap-
proximate solutions and improving its stability has
been developed.
Also minimization of empirical and expected error
functionals with quadratic loss functions can be for-
mulated as inverse problems. But the operators rep-
resenting problems of finding unknown input/output
functions approximating well training data are quite
different from typical operators describing inverse
problems from applied science. In this paper, we
show that application of methods from theory of in-
verse problems to learning from data leads to choice
316
K˚urková V..
INVERSE PROBLEMS IN LEARNING FROM DATA.
DOI: 10.5220/0003079003160321
In Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation (ICNC-2010), pages
316-321
ISBN: 978-989-8425-32-4
Copyright
c
2010 SCITEPRESS (Science and Technology Publications, Lda.)
od reproducing kernel Hilbert spaces as ambient func-
tion spaces. In these spaces, characterization of func-
tions minimizing error functionalsfollows easily from
basic results on pseudosolutions and regularized so-
lutions. These characterizations have been proven
earlier using other methods, such as Fr´echet deriva-
tives (Wahba, 1990; Cucker and Smale, 2002; Poggio
and Smale, 2003) or operators with fractional powers
(Cucker and Smale, 2002), but reformulation of mini-
mization of error functionals in terms of inverse prob-
lems allows much simpler and transparent proofs. It
also provides a unifying framework showing that an
optimal regularized solution of the minimization task
differs from a non regularized one merely in coeffi-
cients of linear combinations of computational units.
Thus representation of learning as inverse problems
provides a useful tool for theoretical investigation of
properties of kernel and radial-basis networks. It
characterizes optimal input-output functions of these
computational models and enables to estimate effects
of regularization.
The paper is organized as follows. Section 2 gives
basic concepts and notations on learning from data. In
section 3, basic terminology and tools from theory of
inverse problems are introduced. In section 4, these
tools are applied to description of theoretically opti-
mal input-output functions in learning from data over
networks with kernel units.
2 ERROR FUNCTIONALS WITH
QUADRATIC LOSS FUNCTIONS
In statistical learning theory, learning from data has
been modeled as a search for a function minimiz-
ing the expected error functional defined by data de-
scribed by a probability measure. For X a compact
subset of R
d
and Y a bounded subset of R, let ρ be
a non degenerate (no nonempty open set has measure
zero) probability measure on Z = X ×Y. The expected
error functional (sometimes also called expected risk
or theoretical error) determined by ρ is defined for
every f in the set M (X) of all bounded ρ-measurable
functions on X as E
ρ,V
( f) =
R
Z
V( f(x),y)dρ, where
V : R ×R R
+
is a loss function. The most com-
mon loss function is the quadratic loss defined as
V(u,v) = (u v)
2
. We shortly denote by E
ρ
the ex-
pected error with the quadratic loss, i.e.,
E
ρ
( f) =
Z
Z
( f(x) y)
2
dρ.
Learning algorithms use a discretized version of
the expected error called the empirical error. It is
determined by a training sample z = {(x
i
,y
i
) R
d
×
R |i = 1,..., m} of input-output pairs of data and a
discrete probability measure p = {p(i)|i = 1,.. ., m}
on the set {1,...,m} (i.e.,
m
i=1
p(i) = 1). Empiri-
cal error is denoted E
z,p,V
and defined as E
z,p,V
( f) =
m
i=1
p(i)V( f (x
i
),y
i
). Similarly as in the case of ex-
pected error, we denote by E
z,p
the empirical error
with the quadratic loss function, i.e.,
E
z,p
( f) =
m
i=1
p(i)( f(x
i
) y
i
)
2
.
One of many advantages of the quadratic loss
function is that it enables to reformulate minimiza-
tion of expected and empirical error as minimization
of distances from certain “optimal” functions.
It is easy to see and well-known (Cucker and
Smale, 2002) that the minimum of E
ρ
over the set
M (X) of all bounded ρ-measurable functions on X
is achieved at the regression function f
ρ
defined for
x X as f
ρ
(x) =
R
Y
ydρ(y|x), where ρ(y|x) is the con-
ditional (w.r.t. x) probability measure on Y.
Let ρ
X
denote the marginal probability mea-
sure on X defined for every S X as ρ
X
(S) =
ρ(π
1
X
(S)), where π
X
: X ×Y X denotes the pro-
jection, and let L
2
ρ
X
(X) denote the Lebesgue space
of all functions on X satisfying
R
X
f
2
dρ
X
< with
the L
2
ρ
X
-norm denoted by k.k
L
2
. It can be easily
verified that f
ρ
L
2
ρ
X
(X). So min
fM (X)
E
ρ
( f) =
min
fL
2
ρ
X
(X)
E
ρ
( f
ρ
) = E
ρ
( f
ρ
) = σ
2
ρ
. Moreover, for
every f L
2
ρ
X
(X) (Cucker and Smale, 2002, p.5)
E
ρ
( f) =
Z
X
( f(x) f
ρ
(x))
2
dρ
X
+ σ
2
ρ
= kf f
ρ
X
k
2
L
2
+ σ
2
ρ
.
(1)
So on the function space L
2
ρ
X
(X), the expected er-
ror functional E
ρ
with the quadratic loss can be repre-
sented as the square of the L
2
ρ
X
-distance from its min-
imum point f
ρ
.
Also the empirical error functional can be repre-
sented in terms of a distance from a certain func-
tion. Let X
z
= {x
1
,... ,x
m
} with all x
i
distinct and
h
z
: X
z
Y be defined as h
z
(x
i
) = y
i
. Let k.k
2,m
de-
note the weighted l
2
-norm on R
m
defined as kxk
2
2,m
=
1
m
m
i=1
p(i)x
2
i
. Then
E
z
( f) = kf
|X
z
h
z
k
2
2,m
. (2)
So minimization of the empirical error E
z,p
is a search
for a function, the restriction on X
z
of which has a
smallest l
2
p
-distance from the function h
z
defined by
the sample z.
INVERSE PROBLEMS IN LEARNING FROM DATA
317
3 INVERSE PROBLEMS IN
LEARNING
The representations (1) and (2) allow us to use in
learning theory methods and tools from theory of in-
verse problems. For a linear operator A : X Y
between two Hilbert spaces (X , k.k
X
), (Y , k.k
Y
) (in
finite-dimensional case, a matrix A) an inverse prob-
lem (see, e.g., (Bertero, 1989)) determined by A is to
find for g Y (called data) some f X (called solu-
tion) such that
A( f) = g.
If for every g Y there exists a unique solution
f X , which depends continuously on data, then the
inverse problem is called well-posed. So for a well-
posed inverse problem, there exists a unique inverse
operator A
1
: Y X . When A is continuous, then
by the Banach open map theorem (Friedman, 1982,
p.141) A
1
is continuous, too. However, a continu-
ous dependence of solutions on data may not always
guarantee robustness against a noise. Stability has
been measured by behavior of eigenvalues of A and
the condition number defined for a well-posed prob-
lem given by an operator A as cond(A) = kAkkA
1
k.
Problems with large condition numbers are called ill-
conditioned.
When for some g Y no solution exists, at least
one can search for a pseudosolution f
o
, for which
A( f
o
) is a best approximation to g among elements
of the range of A, i.e.,
kA( f
o
) gk
Y
= min
fX
kA( f ) gk
Y
.
Theory of inverse problems overcomes ill-posedness
by using so called normal pseudosolutions instead
of solutions and in addition it also overcomes ill-
conditioning by using various regularized solutions.
Tikhonov’s regularization (Tikhonov and Arsenin,
1977) replaces the problem of minimization of the
functional kA(.) gk
2
Y
with minimization of
kA(.) gk
2
Y
+ γΨ,
where Ψ is a functional called stabilizer and the reg-
ularization parameter γ plays the role of a trade-off
between an emphasis on a proximity to data and a pe-
nalization of undesired solutions expressed by Ψ. A
typical choice of a stabilizer is the square of the norm
on X , for which Tikhonov regularization minimizes
the functional
kA(.) gk
2
Y
+ γk.k
2
X
. (3)
Let (H ,k.k
H
) be a Hilbert space, which is a linear
subspace of L
2
ρ
X
(X) with a possibly different norm
than the one obtained by restriction of k.k
L
2
ρ
X
to H ,
and let J : (H , k.k
H
) (L
2
ρ
X
(X),k.k
L
2
ρ
X
) denote the
inclusion operator. By the representation (1), we have
E
ρ
( f) = kf f
ρ
k
2
L
2
ρ
X
+ σ
2
ρ
= kJ( f) f
ρ
k
2
L
2
ρ
X
+ σ
2
ρ
.
(4)
So the problem of minimization of E
ρ
over H is
equivalent to the inverse problem defined by the in-
clusion operator J for the data f
ρ
.
To reformulate minimization of E
z,p
as an inverse
problem, define for the input sample x = (x
1
,... ,x
m
)
an evaluation operator L
x
: (H ,k.k
H
) (R
m
,k.k
2,m
)
as
L
x
( f) = ( f(x
1
),... , f(x
m
)).
It is easy to check that for every f : X R,
E
z,m
( f) =
m
i=1
p(i)( f(x
i
) y
i
)
2
= kL
x
( f) yk
2
2,m
.
(5)
4 PSEUDOSOLUTIONS AND
REGULARIZED SOLUTIONS
Originally, properties of pseudoinverse and regular-
ized inverse operators were described for operators
between finite dimensional spaces, where such oper-
ators can be represented by matrices (Moore, 1920;
Penrose, 1955). In 1970s, the theory of pseudoin-
version was extended to the infinite-dimensional case
it was shown that similar properties as the ones of
Moore-Penrose pseudoinverses of matrices also hold
for pseudoinverses of continuous linear operators be-
tween Hilbert spaces (Groetch, 1977). The reason
is that continuous operators have adjoint operators
A
: Y X satisfying the equation hA( f),gi
Y
=
hf, A
(g)i
X
. These adjoints play an important role in
a characterization of pseudosolutions and regularized
solutions. In the next section, we will see that that for
proper function spaces, adjoints of evaluation and in-
clusion operators used in representations (4) and (5)
can be easily described.
First we recall some basic results from theory of
inverse problems from (Bertero, 1989, pp.68-70) and
(Groetch, 1977, pp.74-76). For every continuous lin-
ear operator A : (X , k.k
X
(Y ,k.k
Y
) between two
Hilbert spaces there exists a unique continuous lin-
ear pseudoinverse operator A
+
: Y X (when the
range R(A) is closed, otherwise A
+
is defined only for
those g Y , for which π
clR(A)
(g) R(A)). The pseu-
doinverse A
+
satisfies for every g Y , kA
+
(g)k
X
=
min
f
o
S(g)
kf
o
k
X
, where S(g) = argmin(X ,kA(.)
gk
Y
), for every g Y , AA
+
(g) = π
clR
(g), and
A
+
= (A
A)
+
A
= A
(AA
)
+
. (6)
ICFC 2010 - International Conference on Fuzzy Computation
318
Moreover, for every γ > 0, there exists a unique oper-
ator
A
γ
: Y X
such that for every g Y , {A
γ
(g)} =
argmin(X ,kA(.) gk
2
Y
+ γk.k
2
X
) and
A
γ
= (A
A+ γI
X
)
1
A
= A
(AA
+ γI
Y
)
1
(7)
where I
X
,I
Y
denote the identity operators. For every
g Y , for which A
+
(g) exists, lim
γ0
A
γ
(g) = A
+
(g).
Formulas (6) and (7) combined with representa-
tions (4) and (5) provide useful tools for description
of optimal input-output functions in learning from
data. However, there is a restriction on computa-
tional models requiring that input-output functions
belong to suitable function spaces with an inner prod-
uct where evaluation functionals are continuous. The
space L
2
ρ
X
(X) is a Hilbert space (it has an inner prod-
uct) but it is easy to see that evaluation function-
als on this space are not continuous (it contains se-
quences of functions with the same norm with di-
verging evaluations at zero). This space is too large,
but it contains suitable subspaces formed by func-
tions with limited oscillations. These spaces contain
input-output functions computable by networks with
kernel and radial units, in particular Gaussian radial-
basis networks with any fixed width. They have be-
came popular due to the use of kernels in support
vector machines, but they were studied in mathemat-
ics since 1950 (Aronszajn, 1950). Since 1990s, they
were considered as useful ambient function spaces in
data analysis (Wahba, 1990). These spaces are called
Reproducing Kernel Hilbert Spaces (RKHS) because
each such space is uniquely determined by a symmet-
ric positive semidefinite kernel K : X ×X R. For
their theory and applications see, e.g., (Sch¨olkopf and
Smola, 2002). Here we just recall that a RKHS deter-
mined by K, denoted H
K
(X), contains all linear com-
binations of functions of the form K(., v) : X R,
v X, defined as K(.,v)(u) = K(u, v) (these func-
tions are called representers and they are generators
of the linear space H
K
(X), see, e.g., (Wahba, 1990;
Cucker and Smale, 2002)). The RKHS H
K
(X) is
endowed with an inner product defined on genera-
tors as hK
u
,K
v
i
K
= K(u,v), which induces the norm
kK
u
k
2
K
= K(u,u).
A paradigmatic example of a kernel is the Gaus-
sian kernel K(u,v) = e
−kuvk
2
. A RKHS defined
by the Gaussian kernel contains all linear combina-
tions of translations of Gaussians, so it contains input-
output functions of radial-basis networks with Gaus-
sian radial functions with a fixed width.
The role of kernel norms as stabilizers in
Tikhonov’s regularization (3) can be intuitively well
understood in the case of convolution kernels, i.e.,
kernels K(x,y) = k(x y) defined as translations of
a function k : R
d
R, for which the Fourier trans-
form
˜
k is positive. For such kernels, the value of the
stabilizer k.k
2
K
at any f H
K
(R
d
) can be expressed
as
kf k
2
K
=
1
(2π)
d/2
Z
R
d
˜
f(ω)
2
˜
k(ω)
dω.
So when lim
kωk→
1/
˜
k(ω) = , the stabilizer k.k
2
K
plays a role of a high-frequency filter. Examples of
convolution kernels with positive Fourier transforms
are the Gaussian and Bessel kernel. Note that for
any convolution kernel K with k(0) = 1, all functions
f =
m
i=1
w
i
K
u
i
, which are computable by one-hidden
layer networks with units computing translations of
k, satisfy kfk
K
m
i=1
|w
i
|kK
u
i
k
K
=
m
i=1
|w
i
|k(0) =
m
i=1
|w
i
|. So the output-weight regularization widely
used for its simplicity in practical applications guar-
antees a decrease of the stabilizer k.k
2
K
and thus pe-
nalizes solutions with high-frequency oscillations.
5 MINIMA OF ERROR
FUNCTIONALS
Reformulations (4) and (5) of minimizations of error
functionals as inverse problems together with equa-
tions (6) and (7) allow us to describe properties of
theoretically optimal solutions of minimization of er-
ror functionals with the quadratic loss.
For a kernel K, we denote by L
K
: L
2
ρ
X
(X)
L
2
ρ
X
(X) the integral operator defined as
L
K
( f)(y) =
Z
X
f(x)K(y,x)dρ
X
(x).
The next theorem describes minimima of the expected
error E
ρ
and of its Tikhonov regularization E
ρ,γ,K
=
E
ρ
+ γk.k
2
K
. It shows that for every γ > 0 there ex-
ists a unique function f
γ
minimizing E
ρ,γ,K
and that
this function is the image of the regression function f
ρ
under an integral operator L
K
γ
with a modified kernel
K
γ
, defined as K
γ
(x,y) =
i=1
λ
i
λ
i
+γ
φ
i
(x)φ
i
(y), where
λ
i
and φ
i
are the eigenvalues of the operator L
K
.
Theorem 1. Let X R
d
be compact, Y R be
bounded, K : X ×X R be a continuous symmetric
positive definite kernel, ρ be a non degenerate proba-
bility measure on X ×Y. Then
(i) if K is degenerate, then the inclusion operator J
K
:
(H
K
(X),k.k
K
) (L
2
ρ
X
(X),k.k
L
2
) has a pseudoin-
verse operator J
+
K
:(L
2
ρ
X
(X),k.k
L
2
) (H
K
(X),k.k
K
)
such that for every g L
2
ρ
X
(X), J
+
K
(g) = π
K
(g), where
π
K
is the projection on H
K
(X),
INVERSE PROBLEMS IN LEARNING FROM DATA
319
while if K is non degenerate, then the pseudoinverse
operator J
+
K
is only defined on R(J
K
) = H
K
(X) and
for all g H
K
(X), J
+
K
(g) = g.
(ii) for every γ > 0, there exists a unique func-
tion f
γ
H
K
(X) minimizing E
ρ,γ,K
, f
γ
= L
K
γ
( f
ρ
X
),
lim
γ0
kf
γ
f
ρ
k
L
2
= 0, and lim
γ0
E
ρ
( f
γ
) =
E
ρ
( f
ρ
).
The essential part of the proof of this theorem is
showing that the adjoint J
K
of the inclusion
J
K
: (H
K
(X),k.k
K
) (L
2
ρ
X
(X),k.k
L
2
ρ
X
)
is the integral operator L
K
. This allows application of
the equations (6) and (7). The formula
f
γ
= L
K
γ
( f
ρ
) (8)
describing the regularized solution can be called the
Representer Theorem in analogy to a similar result
describing the minimum point of the regularized em-
pirical error functional. This theorem was derived in
(Cucker and Smale, 2002, p.42), where it was formu-
lated as
f
γ
= (I + γL
1
K
) f
ρ
. (9)
However, the formulation (9) might be misleading as
the inverse L
1
K
to L
K
is defined only on a subspace
of L
2
ρ
X
(X). Note that it cannot be defined on any
complete subspace of L
2
ρ
X
(X) because in such a case
by Banach open map theorem it should be bounded,
but for a non degenerate kernel K, the eigenvalues
1
λ
i
of the inverse L
1
K
diverge. Here we derived the
description of the regularized solution (8) easily as
a straightforward consequence of well-known prop-
erties of Tikhonov’s regularization, while in (Cucker
and Smale, 2002, pp.27-28) the formula (9) was de-
rived using results on operators with fractional pow-
ers.
By Theorem 1, when the regression function f
ρ
X
is not in the function space H
K
(X) defined by the
kernel K, then E
ρ
does not achieve a minimum on
H
K
(X). However, for every regularization parame-
ter γ > 0, the regularized functional E
ρ,γ,K
achieves a
unique minimum equal to L
K
γ
( f
ρ
). Theorem 1 also
shows how regularization modifies coefficients {w
i
}
in the representation of the regression f
ρ
X
=
i=1
w
i
φ
in terms of the basis {φ
i
} formed by eigenvalues of
the operator L
K
. Regularization replaces these coeffi-
cients with coefficients {
w
i
λ
i
λ
i
+γ
}. The function α(i) =
w
i
λ
i
λ
i
+γ
is decreasing to 0, so the higher frequency co-
efficients are more reduced. With the regularization
parameter γ decreasing to zero, the coefficients
w
i
λ
i
λ
i
+γ
convergeto w
i
and so the regularized solutions f
γ
con-
verge in L
2
ρ
X
-norm to the regression function f
ρ
.
The next theorem describes minima od empirical
error functional and its regularized modification. We
use the notation
E
z,p,γ,K
= E
z,p
+ γk.k
2
K
.
By K [x] is denoted the Gram matrix of the kernel K
with respect to the vector x, i.e., the matrix
K [x]
i, j
= K(x
i
,x
j
),
by K
m
[x] the matrix
1
m
K [x], and by I the identity m×
m matrix.
Theorem 2. Let K : X × X R be a symmet-
ric positive semidefinite kernel, m be a positive
integer, z = (x, y) with x = (x
1
,... ,x
m
) X
m
,
y = (y
1
,... ,y
m
) R
m
, with x
1
,... ,x
m
distinct, and p
be a discrete probability measure on {1,...,m}, then
(i) f
+
= L
+
x
(y) is a minimum point of E
z,p
at H
K
(X),
kf
+
k
K
kf
o
k
K
for all f
o
argmin(H
K
(X),E
z,p
),
and f
+
= L
+
x
(y) =
m
i=1
c
i
K
x
i
, where c =
(c
1
,... ,c
m
) = K [x]
+
y,
(ii) for all γ > 0, there exists a unique f
γ
minimiz-
ing E
z,p,γ,K
over H
K
(X), f
γ
=
m
i=1
c
γ
i
K
x
i
, where
c
γ
= (K
m
[x] + γI )
1
y.
The essential part of the proof of this theorem
is description of the adjoint L
x
: (R
m
,k.k
2,m
)
(H
K
(X),k.k
K
) as L
x
(u) =
1
m
m
i=1
u
i
K
x
i
. This leads
to representation of L
x
L
x
: R
m
R
m
by the matrix
K
m
[x], which together with equations (6) and (7) gives
the description of functions minimizing the empirical
error and its regularization.
Theorem 2 shows that for every kernel K, every
sample of empirical data z and discrete probability
measure p, there exists a function f
+
minimizing the
empirical error functional E
z,p
over the whole RKHS
defined by K. This function is formed by a linear
combination of the representers K
x
1
,... ,K
x
m
of input
data x
i
, i.e.,it has the form
f
+
=
m
i=1
c
i
K
x
i
. (10)
Thus f
+
can be interpreted as an input-output func-
tion of a neural network with one hidden layer of
kernel units and a single linear output unit. The co-
efficients c = (c
1
,... ,c
m
) of the linear combination
(corresponding to network output weights) satisfy c =
K [x]
+
y, so the output weights can be obtained by
solving the system of linear equations. However, as
the operator L
x
has finite dimensional range, it is com-
pact and thus its pseudoinverse is unbounded. So the
optimal solution of minimization od empirical error is
unstable. The regularized solution
f
γ
=
m
i=1
c
γ
i
K
x
i
ICFC 2010 - International Conference on Fuzzy Computation
320
is also a linear combination of functions K
x
1
,... ,K
x
m
.
But the coefficients of these two linear combinations
are different: in the regularized case c
γ
= (K [x] +
γI )
1
y, while in the non-regularizedone c = K [x]
+
y.
The characterization of regularized solution min-
imization of empirical error in reproducing kernel
Hilbert spaces was derived in (Wahba, 1990) us-
ing Fr´echet derivatives (see also (Cucker and Smale,
2002), (Poggio and Smale, 2003)). Our proof based
on characterization of the adjoint L
x
of the evaluation
operator L
x
is much simpler and it also include the
non regularized case and thus it shows the effect of
regularization.
Theorem 2 shows that increase of “smoothness”
of the regularized solution f
γ
is achieved by merely
changing the coefficients of the linear combination.
In the non regularized case, the coefficients are ob-
tained from the output data vector y using the Moore-
Penrose pseudoinverse of the Gram matrix K [x],
while in the regularized one, they are obtained using
the inverse of the modified matrix K [x] + γI . So the
regularization merely changes amplitudes, but it pre-
serves the finite set of basis functions from which the
solution is composed.
In many practical applications, there are used net-
works with much smaller number n of units than the
size of the training sample of data m. However, char-
acterization of theoretically optimal solutions achiev-
able over networks with large numbers of units (equal
to the sizes m of training data) can be useful in inves-
tigation on dependence of quality of approximation of
such optimal solutions by suboptimal ones obtainable
over smaller models (Vito et al., 2005; K˚urkov´a and
Sanguineti, 2005a; K˚urkov´a and Sanguineti, 2005b).
As mentioned above, for convolution kernels we
have kf
γ
k
K
m
i=1
|c
γ
i
|. Instead of calculating k.k
2
K
norm, it is easier to use as a stabilizer the
1
-norm of
an output weight vector. For linear combinations of
functions of the form K
x
, this also leads to minimiza-
tion of k.k
2
K
norm.
ACKNOWLEDGEMENTS
This work was partially supported by M
ˇ
SMT grant
COST Intelli OC10047 and the Institutional Research
Plan AV0Z10300504.
REFERENCES
Aronszajn, N. (1950). Theory of reproducing kernels.
Transactions of AMS, 68:337–404.
Bertero, M. (1989). Linear inverse and ill-posed problems.
Advances in Electronics and Electron Physics, 75:1–
120.
Bishop, C. (1995). Training with noise is equivalent
to Tikhonov regularization. Neural Computation,
7(1):108–116.
Cucker, F. and Smale, S. (2002). On the mathematical foun-
dations of learning. Bulletin of AMS, 39:1–49.
Engl, E. W., Hanke, M., and Neubauer, A. (1999). Regular-
ization of Inverse Problems. Kluwer, Dordrecht.
Fine, T. L. (1999). Feedforward Neural Network Methodol-
ogy. Springer-Verlag, Berlin, Heidelberg.
Friedman, A. (1982). Modern Analysis. Dover, New York.
Girosi, F., Jones, M., and Poggio, T. (1995). Regulariza-
tion theory and neural networks architectures. Neural
Computation, 7:219–269.
Girosi, F. and Poggio, T. (1990). Regularization algorithms
for learning that are equivalent to multilayer networks.
Science, 247(4945):978–982.
Groetch, C. W. (1977). Generalized Inverses of Linear Op-
erators. Dekker, New York.
Hansen, P. C. (1998). Rank-Deficient and Discrete Ill-Posed
Problems. SIAM, Philadelphia.
Ito, Y. (1992). Finite mapping by neural networks and truth
functions. Mathematical Scientist, 17:69–77.
Kecman, V. (2001). Learning and Soft Computing. MIT
Press, Cambridge.
K˚urkov´a, V. and Sanguineti, M. (2005a). Error estimates
for approximate optimization by the extended Ritz
method. SIAM Journal on Optimization, 15:461–487.
K˚urkov´a, V. and Sanguineti, M. (2005b). Learning
with generalization capability by kernel methods with
bounded complexity. Journal of Complexity, 13:551–
559.
Michelli, C. A. (1986). Interpolation of scattered data:
Distance matrices and conditionally positive definite
functions. Constructive Approximation, 2:11–22.
Moore, E. H. (1920). Abstract. Bulletin of AMS, 26:394–
395.
Penrose, R. (1955). A generalized inverse for matri-
ces. Proceedings of Cambridge Philosophical Society,
51:406–413.
Poggio, T. and Smale, S. (2003). The mathematics of learn-
ing: dealing with data. Notices of AMS, 50:537–544.
Sch¨olkopf, B. and Smola, A. J. (2002). Learning with Ker-
nels Support Vector Machines, Regularization, Op-
timization and Beyond. MIT Press, Cambridge.
Tikhonov, A. N. and Arsenin, V. Y. (1977). Solutions of
Ill-posed Problems. W.H. Winston, Washington, D.C.
Vito, E. D., Rosasco, L., Caponnetto, A., Giovannini, U. D.,
and Odone, F. (2005). Learning from examples as an
inverse problem. Journal of Machine Learning Re-
search, 6:883–904.
Wahba, G. (1990). Splines Models for Observational Data.
SIAM, Philadelphia.
INVERSE PROBLEMS IN LEARNING FROM DATA
321