INVERSE PROBLEMS IN LEARNING FROM DATA

Vˇera K˚urkov´a

Institute of Computer Science, Academy of Sciences of the Czech Republic, Pod Vod´arenskou vˇeˇz´ı 2, Prague, Czech Republic

Keywords:

Learning from data, Minimization of empirical and expected error functionals, Reproducing kernel Hilbert

spaces.

Abstract:

It is shown that application of methods from theory of inverse problems to learning from data leads to simple

proofs of characterization of minima of empirical and expected error functionals and their regularized versions.

The reformulation of learning in terms of inverse problems also enables comparison of regularized and non

regularized case showing that regularization achieves stability by merely modifying output weights of global

minima. Methods of theory of inverse problems lead to choice of reproducing kernel Hilbert spaces as suitable

ambient function spaces.

1 INTRODUCTION

Supervised learning can be formally described as an

optimization problem of minimization of error func-

tionals over parameterized sets of input-output func-

tions computable by a given computational model.

Various learning algorithms iteratively modify param-

eters of the model until sufﬁciently small values of

error functionals are achieved and the corresponding

input-output function of the model sufﬁciently well

ﬁts to the training data.

However, such algorithms in their best only

achieve good ﬁt to the training data. It has been

proven that for typical computational units such as

sigmoidal perceptrons and Gaussian kernels, sufﬁ-

ciently large networks can exactly interpolate any

sample of data (Ito, 1992; Michelli, 1986). Data

are often noisy and networks perfectly ﬁtting to ran-

domly chosen training samples may be too much in-

ﬂuenced by the noise and may not perform well on

data that were not chosen for training. Thus vari-

ous attempts to modify error functionals to improve

so called “generalization capability” of the model has

been proposed. In 1990s, Girosi and Poggio (Girosi

and Poggio, 1990) introduced into learning theory

a method of regularization as a means of improv-

ing generalization. They considered modiﬁcations

of error functionals based on Tikhonov regularization

which adds an additional functional, called stabilizer,

which penalizes undesired properties of input-output

functions such as high-frequency oscillations (Girosi

et al., 1995). In practical applications, various simple

stabilizers have been successfully used such as semi-

norms based on derivatives (Bishop, 1995) or sum or

square of output weights (Fine, 1999; Kecman, 2001).

Regularization was developed in 1970s as a

method of improving stability of solutions of cer-

tain problems from physics called inverse problems,

where unknown causes (e.g., shapes of functions,

forces or distributions) of known consequences (mea-

sured data) have to be found. These problems has

been studied in applied science, such as acoustics,

geophysics and computerized tomography (see, e.g.,

(Hansen, 1998)). To solve such a problem, one needs

to know how unknown causes determine known con-

sequences, which can often be described in terms of

an operator. In problems originating from physics,

dependence of consequences on causes is usually de-

scribed by integral operators (such as those deﬁning

Radon or Laplace transforms (Bertero, 1989; Engl

et al., 1999)). As some problems do not always have

exact solutions or have solutions which are unstable

with respect to noise, various methods of ﬁnding ap-

proximate solutions and improving its stability has

been developed.

Also minimization of empirical and expected error

functionals with quadratic loss functions can be for-

mulated as inverse problems. But the operators rep-

resenting problems of ﬁnding unknown input/output

functions approximating well training data are quite

different from typical operators describing inverse

problems from applied science. In this paper, we

show that application of methods from theory of in-

verse problems to learning from data leads to choice

316

K˚urková V..

INVERSE PROBLEMS IN LEARNING FROM DATA.

DOI: 10.5220/0003079003160321

In Proceedings of the International Conference on Fuzzy Computation and 2nd International Conference on Neural Computation (ICNC-2010), pages

316-321

ISBN: 978-989-8425-32-4

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

od reproducing kernel Hilbert spaces as ambient func-

tion spaces. In these spaces, characterization of func-

tions minimizing error functionalsfollows easily from

basic results on pseudosolutions and regularized so-

lutions. These characterizations have been proven

earlier using other methods, such as Fr´echet deriva-

tives (Wahba, 1990; Cucker and Smale, 2002; Poggio

and Smale, 2003) or operators with fractional powers

(Cucker and Smale, 2002), but reformulation of mini-

mization of error functionals in terms of inverse prob-

lems allows much simpler and transparent proofs. It

also provides a unifying framework showing that an

optimal regularized solution of the minimization task

differs from a non regularized one merely in coefﬁ-

cients of linear combinations of computational units.

Thus representation of learning as inverse problems

provides a useful tool for theoretical investigation of

properties of kernel and radial-basis networks. It

characterizes optimal input-output functions of these

computational models and enables to estimate effects

of regularization.

The paper is organized as follows. Section 2 gives

basic concepts and notations on learning from data. In

section 3, basic terminology and tools from theory of

inverse problems are introduced. In section 4, these

tools are applied to description of theoretically opti-

mal input-output functions in learning from data over

networks with kernel units.

2 ERROR FUNCTIONALS WITH

QUADRATIC LOSS FUNCTIONS

In statistical learning theory, learning from data has

been modeled as a search for a function minimiz-

ing the expected error functional deﬁned by data de-

scribed by a probability measure. For X a compact

subset of R

and Y a bounded subset of R, let ρ be

a non degenerate (no nonempty open set has measure

zero) probability measure on Z = X ×Y. The expected

error functional (sometimes also called expected risk

or theoretical error) determined by ρ is deﬁned for

every f in the set M (X) of all bounded ρ-measurable

functions on X as E

ρ,V

( f) =

V( f(x),y)dρ, where

V : R ×R → R

is a loss function. The most com-

mon loss function is the quadratic loss deﬁned as

V(u,v) = (u −v)

. We shortly denote by E

the ex-

pected error with the quadratic loss, i.e.,

( f) =

( f(x) −y)

dρ.

Learning algorithms use a discretized version of

the expected error called the empirical error. It is

determined by a training sample z = {(x

) ∈ R

R |i = 1,..., m} of input-output pairs of data and a

discrete probability measure p = {p(i)|i = 1,.. ., m}

on the set {1,...,m} (i.e.,

∑

i=1

p(i) = 1). Empiri-

cal error is denoted E

z,p,V

and deﬁned as E

z,p,V

( f) =

∑

i=1

p(i)V( f (x

),y

). Similarly as in the case of ex-

pected error, we denote by E

z,p

the empirical error

with the quadratic loss function, i.e.,

z,p

( f) =

∑

i=1

p(i)( f(x

) −y

)

One of many advantages of the quadratic loss

function is that it enables to reformulate minimiza-

tion of expected and empirical error as minimization

of distances from certain “optimal” functions.

It is easy to see and well-known (Cucker and

Smale, 2002) that the minimum of E

over the set

M (X) of all bounded ρ-measurable functions on X

is achieved at the regression function f

deﬁned for

x ∈X as f

(x) =

ydρ(y|x), where ρ(y|x) is the con-

ditional (w.r.t. x) probability measure on Y.

Let ρ

denote the marginal probability mea-

sure on X deﬁned for every S ⊆ X as ρ

(S) =

ρ(π

−1

(S)), where π

: X ×Y → X denotes the pro-

jection, and let L

(X) denote the Lebesgue space

of all functions on X satisfying

dρ

< ∞ with

the L

-norm denoted by k.k

. It can be easily

veriﬁed that f

∈ L

(X). So min

f∈M (X)

( f) =

min

f∈L

(X)

( f

) = E

( f

) = σ

. Moreover, for

every f ∈L

(X) (Cucker and Smale, 2002, p.5)

( f) =

( f(x) − f

(x))

dρ

+ σ

= kf − f

+ σ

(1)

So on the function space L

(X), the expected er-

ror functional E

with the quadratic loss can be repre-

sented as the square of the L

-distance from its min-

imum point f

Also the empirical error functional can be repre-

sented in terms of a distance from a certain func-

tion. Let X

= {x

,... ,x

} with all x

distinct and

: X

→ Y be deﬁned as h

) = y

. Let k.k

2,m

de-

note the weighted l

-norm on R

deﬁned as kxk

2,m

√

∑

i=1

p(i)x

. Then

( f) = kf

−h

2,m

. (2)

So minimization of the empirical error E

z,p

is a search

for a function, the restriction on X

of which has a

smallest l

-distance from the function h

deﬁned by

the sample z.

INVERSE PROBLEMS IN LEARNING FROM DATA

317

3 INVERSE PROBLEMS IN

LEARNING

The representations (1) and (2) allow us to use in

learning theory methods and tools from theory of in-

verse problems. For a linear operator A : X → Y

between two Hilbert spaces (X , k.k

), (Y , k.k

) (in

ﬁnite-dimensional case, a matrix A) an inverse prob-

lem (see, e.g., (Bertero, 1989)) determined by A is to

ﬁnd for g ∈Y (called data) some f ∈X (called solu-

tion) such that

A( f) = g.

If for every g ∈ Y there exists a unique solution

f ∈ X , which depends continuously on data, then the

inverse problem is called well-posed. So for a well-

posed inverse problem, there exists a unique inverse

operator A

−1

: Y → X . When A is continuous, then

by the Banach open map theorem (Friedman, 1982,

p.141) A

−1

is continuous, too. However, a continu-

ous dependence of solutions on data may not always

guarantee robustness against a noise. Stability has

been measured by behavior of eigenvalues of A and

the condition number deﬁned for a well-posed prob-

lem given by an operator A as cond(A) = kAkkA

−1

Problems with large condition numbers are called ill-

conditioned.

When for some g ∈ Y no solution exists, at least

one can search for a pseudosolution f

, for which

A( f

) is a best approximation to g among elements

of the range of A, i.e.,

kA( f

) −gk

= min

f∈X

kA( f ) −gk

Theory of inverse problems overcomes ill-posedness

by using so called normal pseudosolutions instead

of solutions and in addition it also overcomes ill-

conditioning by using various regularized solutions.

Tikhonov’s regularization (Tikhonov and Arsenin,

1977) replaces the problem of minimization of the

functional kA(.) −gk

with minimization of

kA(.) −gk

+ γΨ,

where Ψ is a functional called stabilizer and the reg-

ularization parameter γ plays the role of a trade-off

between an emphasis on a proximity to data and a pe-

nalization of undesired solutions expressed by Ψ. A

typical choice of a stabilizer is the square of the norm

on X , for which Tikhonov regularization minimizes

the functional

kA(.) −gk

+ γk.k

. (3)

Let (H ,k.k

) be a Hilbert space, which is a linear

subspace of L

(X) with a possibly different norm

than the one obtained by restriction of k.k

to H ,

and let J : (H , k.k

) → (L

(X),k.k

) denote the

inclusion operator. By the representation (1), we have

( f) = kf − f

+ σ

= kJ( f) − f

+ σ

(4)

So the problem of minimization of E

over H is

equivalent to the inverse problem deﬁned by the in-

clusion operator J for the data f

To reformulate minimization of E

z,p

as an inverse

problem, deﬁne for the input sample x = (x

,... ,x

)

an evaluation operator L

: (H ,k.k

) →(R

,k.k

2,m

)

( f) = ( f(x

),... , f(x

)).

It is easy to check that for every f : X → R,

z,m

( f) =

∑

i=1

p(i)( f(x

) −y

)

= kL

( f) −yk

2,m

(5)

4 PSEUDOSOLUTIONS AND

REGULARIZED SOLUTIONS

Originally, properties of pseudoinverse and regular-

ized inverse operators were described for operators

between ﬁnite dimensional spaces, where such oper-

ators can be represented by matrices (Moore, 1920;

Penrose, 1955). In 1970s, the theory of pseudoin-

version was extended to the inﬁnite-dimensional case

– it was shown that similar properties as the ones of

Moore-Penrose pseudoinverses of matrices also hold

for pseudoinverses of continuous linear operators be-

tween Hilbert spaces (Groetch, 1977). The reason

is that continuous operators have adjoint operators

∗

: Y → X satisfying the equation hA( f),gi

hf, A

∗

(g)i

. These adjoints play an important role in

a characterization of pseudosolutions and regularized

solutions. In the next section, we will see that that for

proper function spaces, adjoints of evaluation and in-

clusion operators used in representations (4) and (5)

can be easily described.

First we recall some basic results from theory of

inverse problems from (Bertero, 1989, pp.68-70) and

(Groetch, 1977, pp.74-76). For every continuous lin-

ear operator A : (X , k.k

→ (Y ,k.k

) between two

Hilbert spaces there exists a unique continuous lin-

ear pseudoinverse operator A

: Y → X (when the

range R(A) is closed, otherwise A

is deﬁned only for

those g ∈ Y , for which π

clR(A)

(g) ∈ R(A)). The pseu-

doinverse A

satisﬁes for every g ∈ Y , kA

(g)k

min

∈S(g)

, where S(g) = argmin(X ,kA(.) −

), for every g ∈Y , AA

(g) = π

clR

(g), and

= (A

∗

= A

∗

(AA

∗

)

. (6)

ICFC 2010 - International Conference on Fuzzy Computation

318

Moreover, for every γ > 0, there exists a unique oper-

ator

: Y → X

such that for every g ∈ Y , {A

(g)} =

argmin(X ,kA(.) −gk

+ γk.k

) and

= (A

∗

A+ γI

)

−1

∗

= A

∗

(AA

∗

+ γI

)

−1

(7)

where I

denote the identity operators. For every

g ∈Y , for which A

(g) exists, lim

γ→0

(g) = A

(g).

Formulas (6) and (7) combined with representa-

tions (4) and (5) provide useful tools for description

of optimal input-output functions in learning from

data. However, there is a restriction on computa-

tional models requiring that input-output functions

belong to suitable function spaces with an inner prod-

uct where evaluation functionals are continuous. The

space L

(X) is a Hilbert space (it has an inner prod-

uct) but it is easy to see that evaluation function-

als on this space are not continuous (it contains se-

quences of functions with the same norm with di-

verging evaluations at zero). This space is too large,

but it contains suitable subspaces formed by func-

tions with limited oscillations. These spaces contain

input-output functions computable by networks with

kernel and radial units, in particular Gaussian radial-

basis networks with any ﬁxed width. They have be-

came popular due to the use of kernels in support

vector machines, but they were studied in mathemat-

ics since 1950 (Aronszajn, 1950). Since 1990s, they

were considered as useful ambient function spaces in

data analysis (Wahba, 1990). These spaces are called

Reproducing Kernel Hilbert Spaces (RKHS) because

each such space is uniquely determined by a symmet-

ric positive semideﬁnite kernel K : X ×X → R. For

their theory and applications see, e.g., (Sch¨olkopf and

Smola, 2002). Here we just recall that a RKHS deter-

mined by K, denoted H

(X), contains all linear com-

binations of functions of the form K(., v) : X → R,

v ∈ X, deﬁned as K(.,v)(u) = K(u, v) (these func-

tions are called representers and they are generators

of the linear space H

(X), see, e.g., (Wahba, 1990;

Cucker and Smale, 2002)). The RKHS H

(X) is

endowed with an inner product deﬁned on genera-

tors as hK

= K(u,v), which induces the norm

= K(u,u).

A paradigmatic example of a kernel is the Gaus-

sian kernel K(u,v) = e

−ku−vk

. A RKHS deﬁned

by the Gaussian kernel contains all linear combina-

tions of translations of Gaussians, so it contains input-

output functions of radial-basis networks with Gaus-

sian radial functions with a ﬁxed width.

The role of kernel norms as stabilizers in

Tikhonov’s regularization (3) can be intuitively well

understood in the case of convolution kernels, i.e.,

kernels K(x,y) = k(x −y) deﬁned as translations of

a function k : R

→ R, for which the Fourier trans-

form

k is positive. For such kernels, the value of the

stabilizer k.k

at any f ∈ H

) can be expressed

kf k

(2π)

d/2

f(ω)

k(ω)

dω.

So when lim

kωk→∞

k(ω) = ∞, the stabilizer k.k

plays a role of a high-frequency ﬁlter. Examples of

convolution kernels with positive Fourier transforms

are the Gaussian and Bessel kernel. Note that for

any convolution kernel K with k(0) = 1, all functions

f =

∑

i=1

, which are computable by one-hidden

layer networks with units computing translations of

k, satisfy kfk

≤

∑

i=1

|kK

∑

i=1

|k(0) =

∑

i=1

|. So the output-weight regularization widely

used for its simplicity in practical applications guar-

antees a decrease of the stabilizer k.k

and thus pe-

nalizes solutions with high-frequency oscillations.

5 MINIMA OF ERROR

FUNCTIONALS

Reformulations (4) and (5) of minimizations of error

functionals as inverse problems together with equa-

tions (6) and (7) allow us to describe properties of

theoretically optimal solutions of minimization of er-

ror functionals with the quadratic loss.

For a kernel K, we denote by L

: L

(X) →

(X) the integral operator deﬁned as

( f)(y) =

f(x)K(y,x)dρ

(x).

The next theorem describes minimima of the expected

error E

and of its Tikhonov regularization E

ρ,γ,K

+ γk.k

. It shows that for every γ > 0 there ex-

ists a unique function f

minimizing E

ρ,γ,K

and that

this function is the image of the regression function f

under an integral operator L

with a modiﬁed kernel

, deﬁned as K

(x,y) =

∑

∞

i=1

+γ

(x)φ

(y), where

and φ

are the eigenvalues of the operator L

Theorem 1. Let X ⊂ R

be compact, Y ⊂ R be

bounded, K : X ×X → R be a continuous symmetric

positive deﬁnite kernel, ρ be a non degenerate proba-

bility measure on X ×Y. Then

(i) if K is degenerate, then the inclusion operator J

(X),k.k

) → (L

(X),k.k

) has a pseudoin-

verse operator J

:(L

(X),k.k

) →(H

(X),k.k

)

such that for every g ∈L

(X), J

(g) = π

(g), where

is the projection on H

(X),

INVERSE PROBLEMS IN LEARNING FROM DATA

319

while if K is non degenerate, then the pseudoinverse

operator J

is only deﬁned on R(J

) = H

(X) and

for all g ∈ H

(X), J

(g) = g.

(ii) for every γ > 0, there exists a unique func-

tion f

∈ H

(X) minimizing E

ρ,γ,K

, f

= L

( f

lim

γ→0

− f

= 0, and lim

γ→0

( f

) =

( f

The essential part of the proof of this theorem is

showing that the adjoint J

∗

of the inclusion

: (H

(X),k.k

) → (L

(X),k.k

)

is the integral operator L

. This allows application of

the equations (6) and (7). The formula

= L

( f

) (8)

describing the regularized solution can be called the

Representer Theorem in analogy to a similar result

describing the minimum point of the regularized em-

pirical error functional. This theorem was derived in

(Cucker and Smale, 2002, p.42), where it was formu-

lated as

= (I + γL

−1

) f

. (9)

However, the formulation (9) might be misleading as

the inverse L

−1

to L

is deﬁned only on a subspace

of L

(X). Note that it cannot be deﬁned on any

complete subspace of L

(X) because in such a case

by Banach open map theorem it should be bounded,

but for a non degenerate kernel K, the eigenvalues

of the inverse L

−1

diverge. Here we derived the

description of the regularized solution (8) easily as

a straightforward consequence of well-known prop-

erties of Tikhonov’s regularization, while in (Cucker

and Smale, 2002, pp.27-28) the formula (9) was de-

rived using results on operators with fractional pow-

ers.

By Theorem 1, when the regression function f

is not in the function space H

(X) deﬁned by the

kernel K, then E

does not achieve a minimum on

(X). However, for every regularization parame-

ter γ > 0, the regularized functional E

ρ,γ,K

achieves a

unique minimum equal to L

( f

). Theorem 1 also

shows how regularization modiﬁes coefﬁcients {w

}

in the representation of the regression f

∑

∞

i=1

in terms of the basis {φ

} formed by eigenvalues of

the operator L

. Regularization replaces these coefﬁ-

cients with coefﬁcients {

+γ

}. The function α(i) =

+γ

is decreasing to 0, so the higher frequency co-

efﬁcients are more reduced. With the regularization

parameter γ decreasing to zero, the coefﬁcients

+γ

convergeto w

and so the regularized solutions f

con-

verge in L

-norm to the regression function f

The next theorem describes minima od empirical

error functional and its regularized modiﬁcation. We

use the notation

z,p,γ,K

= E

z,p

+ γk.k

By K [x] is denoted the Gram matrix of the kernel K

with respect to the vector x, i.e., the matrix

K [x]

i, j

= K(x

by K

[x] the matrix

K [x], and by I the identity m×

m matrix.

Theorem 2. Let K : X × X → R be a symmet-

ric positive semideﬁnite kernel, m be a positive

integer, z = (x, y) with x = (x

,... ,x

) ∈ X

y = (y

,... ,y

) ∈ R

, with x

,... ,x

distinct, and p

be a discrete probability measure on {1,...,m}, then

(i) f

= L

(y) is a minimum point of E

z,p

at H

(X),

≤ kf

for all f

∈ argmin(H

(X),E

z,p

and f

= L

(y) =

∑

i=1

, where c =

,... ,c

) = K [x]

(ii) for all γ > 0, there exists a unique f

minimiz-

ing E

z,p,γ,K

over H

(X), f

∑

i=1

, where

= (K

[x] + γI )

−1

The essential part of the proof of this theorem

is description of the adjoint L

∗

: (R

,k.k

2,m

) →

(X),k.k

) as L

∗

(u) =

∑

i=1

. This leads

to representation of L

∗

: R

→ R

by the matrix

[x], which together with equations (6) and (7) gives

the description of functions minimizing the empirical

error and its regularization.

Theorem 2 shows that for every kernel K, every

sample of empirical data z and discrete probability

measure p, there exists a function f

minimizing the

empirical error functional E

z,p

over the whole RKHS

deﬁned by K. This function is formed by a linear

combination of the representers K

,... ,K

of input

data x

, i.e.,it has the form

∑

i=1

. (10)

Thus f

can be interpreted as an input-output func-

tion of a neural network with one hidden layer of

kernel units and a single linear output unit. The co-

efﬁcients c = (c

,... ,c

) of the linear combination

(corresponding to network output weights) satisfy c =

K [x]

y, so the output weights can be obtained by

solving the system of linear equations. However, as

the operator L

has ﬁnite dimensional range, it is com-

pact and thus its pseudoinverse is unbounded. So the

optimal solution of minimization od empirical error is

unstable. The regularized solution

∑

i=1

ICFC 2010 - International Conference on Fuzzy Computation

320

is also a linear combination of functions K

,... ,K

But the coefﬁcients of these two linear combinations

are different: in the regularized case c

= (K [x] +

γI )

−1

y, while in the non-regularizedone c = K [x]

The characterization of regularized solution min-

imization of empirical error in reproducing kernel

Hilbert spaces was derived in (Wahba, 1990) us-

ing Fr´echet derivatives (see also (Cucker and Smale,

2002), (Poggio and Smale, 2003)). Our proof based

on characterization of the adjoint L

∗

of the evaluation

operator L

is much simpler and it also include the

non regularized case and thus it shows the effect of

regularization.

Theorem 2 shows that increase of “smoothness”

of the regularized solution f

is achieved by merely

changing the coefﬁcients of the linear combination.

In the non regularized case, the coefﬁcients are ob-

tained from the output data vector y using the Moore-

Penrose pseudoinverse of the Gram matrix K [x],

while in the regularized one, they are obtained using

the inverse of the modiﬁed matrix K [x] + γI . So the

regularization merely changes amplitudes, but it pre-

serves the ﬁnite set of basis functions from which the

solution is composed.

In many practical applications, there are used net-

works with much smaller number n of units than the

size of the training sample of data m. However, char-

acterization of theoretically optimal solutions achiev-

able over networks with large numbers of units (equal

to the sizes m of training data) can be useful in inves-

tigation on dependence of quality of approximation of

such optimal solutions by suboptimal ones obtainable

over smaller models (Vito et al., 2005; K˚urkov´a and

Sanguineti, 2005a; K˚urkov´a and Sanguineti, 2005b).

As mentioned above, for convolution kernels we

have kf

≤

∑

i=1

|. Instead of calculating k.k

norm, it is easier to use as a stabilizer the ℓ

-norm of

an output weight vector. For linear combinations of

functions of the form K

, this also leads to minimiza-

tion of k.k

norm.

ACKNOWLEDGEMENTS

This work was partially supported by M

SMT grant

COST Intelli OC10047 and the Institutional Research

Plan AV0Z10300504.

REFERENCES

Aronszajn, N. (1950). Theory of reproducing kernels.

Transactions of AMS, 68:337–404.

Bertero, M. (1989). Linear inverse and ill-posed problems.

Advances in Electronics and Electron Physics, 75:1–

120.

Bishop, C. (1995). Training with noise is equivalent

to Tikhonov regularization. Neural Computation,

7(1):108–116.

Cucker, F. and Smale, S. (2002). On the mathematical foun-

dations of learning. Bulletin of AMS, 39:1–49.

Engl, E. W., Hanke, M., and Neubauer, A. (1999). Regular-

ization of Inverse Problems. Kluwer, Dordrecht.

Fine, T. L. (1999). Feedforward Neural Network Methodol-

ogy. Springer-Verlag, Berlin, Heidelberg.

Friedman, A. (1982). Modern Analysis. Dover, New York.

Girosi, F., Jones, M., and Poggio, T. (1995). Regulariza-

tion theory and neural networks architectures. Neural

Computation, 7:219–269.

Girosi, F. and Poggio, T. (1990). Regularization algorithms

for learning that are equivalent to multilayer networks.

Science, 247(4945):978–982.

Groetch, C. W. (1977). Generalized Inverses of Linear Op-

erators. Dekker, New York.

Hansen, P. C. (1998). Rank-Deﬁcient and Discrete Ill-Posed

Problems. SIAM, Philadelphia.

Ito, Y. (1992). Finite mapping by neural networks and truth

functions. Mathematical Scientist, 17:69–77.

Kecman, V. (2001). Learning and Soft Computing. MIT

Press, Cambridge.

K˚urkov´a, V. and Sanguineti, M. (2005a). Error estimates

for approximate optimization by the extended Ritz

method. SIAM Journal on Optimization, 15:461–487.

K˚urkov´a, V. and Sanguineti, M. (2005b). Learning

with generalization capability by kernel methods with

bounded complexity. Journal of Complexity, 13:551–

559.

Michelli, C. A. (1986). Interpolation of scattered data:

Distance matrices and conditionally positive deﬁnite

functions. Constructive Approximation, 2:11–22.

Moore, E. H. (1920). Abstract. Bulletin of AMS, 26:394–

395.

Penrose, R. (1955). A generalized inverse for matri-

ces. Proceedings of Cambridge Philosophical Society,

51:406–413.

Poggio, T. and Smale, S. (2003). The mathematics of learn-

ing: dealing with data. Notices of AMS, 50:537–544.

Sch¨olkopf, B. and Smola, A. J. (2002). Learning with Ker-

nels – Support Vector Machines, Regularization, Op-

timization and Beyond. MIT Press, Cambridge.

Tikhonov, A. N. and Arsenin, V. Y. (1977). Solutions of

Ill-posed Problems. W.H. Winston, Washington, D.C.

Vito, E. D., Rosasco, L., Caponnetto, A., Giovannini, U. D.,

and Odone, F. (2005). Learning from examples as an

inverse problem. Journal of Machine Learning Re-

search, 6:883–904.

Wahba, G. (1990). Splines Models for Observational Data.

SIAM, Philadelphia.

INVERSE PROBLEMS IN LEARNING FROM DATA

321