ACTIVE LEARNING IN REGRESSION, WITH APPLICATION TO

STOCHASTIC DYNAMIC PROGRAMMING

Olivier Teytaud*, Sylvain Gelly* and J

emie Mary* **

*TAO (Inria, Univ. Paris-Sud, UMR CNRS-8623), France

**Grappa (Inria Univ. Lille), France

Keywords:

Intelligent Control Systems and Optimization, Machine learning in control applications, Active learning.

Abstract:

We study active learning as a derandomized form of sampling. We show that full derandomization is not

suitable in a robust framework, propose partially derandomized samplings, and develop new active learning

methods (i) in which expert knowledge is easy to integrate (ii) with a parameter for the exploration/exploitation

dilemma (iii) less randomized than the full-random sampling (yet also not deterministic). Experiments are

performed in the case of regression for value-function learning on a continuous domain. Our main results

are (i) efﬁcient partially derandomized point sets (ii) moderate-derandomization theorems (iii) experimental

evidence of the importance of the frontier (iv) a new regression-speciﬁc user-friendly sampling tool less-

robust than blind samplers but that sometimes works very efﬁciently in large dimensions. All experiments can

be reproduced by downloading the source code and running the provided command line.

1 INTRODUCTION

As pointed out in e.g. (Cohn et al., 1995a), the abil-

ity of the learner to select examples and modify its

environment in order to get better examples is one of

the main point in learning. In this model of learn-

ing, the learning algorithm is typically made of (i) a

sampler, that chooses points in the domain, and (ii) a

passive learner that takes these points and their labels

as provided by some unknown oracle (target concept

in classiﬁcation or target function in regression). Var-

ious forms of active learning have been proposed:

1. Blind approaches in which points to be labelled

by the oracle are well distributed in the domain, e.g.

in a quasi-random manner, without taking into ac-

count the labels (Cervellera and Muselli, 2003); in

this case, the process can be splitted in 3 successive

steps: (i) sample the points, (ii) label by the oracle,

(iii) learn the target concept/function;

2. Non-blind approaches, in which the sam-

pler uses the labels provided by the oracle in order

to choose the next points to be labelled; possibly,

the learner is embedded in the sampler. These ap-

proaches use various criteria; (Lewis and Gale, 1994)

chooses examples with the maximum uncertainty of

the learner, (Seung et al., 1992) chooses examples

that reduce maximally the size of the version space.

Other approaches include (Schohn and Cohn, 2000),

with application to Support Vector Machines (SVM),

(Cohn et al., 1995b), with application to Neural Nets.

Limitations of non-blind approaches. In spite of

the fact that the second approach is much more gen-

eral, the superiority of non-blind approaches is often

unclear, due to the nice robustness properties of blind

approaches. If you exploit properties of the problem

to avoid sampling areas that are probably less inter-

esting, you can miss some ”surprisingly” interesting

areas. Pessimistic theorems on the possibility of ac-

tive learning can be found in e.g. (Vidyasagar, 1997),

in a worst-case scenario. Mainly, as a conclusion of

the state of the art, the best applied results for non-

blind active-learning concern moderately large fami-

lies of functions; classiﬁcation more than regression,

decision trees more than neural networks or SVM.

In particular, we think that a main advantage of vi-

ability approaches in reinforcement learning is that it

reduces the problem (with loss of generality unfor-

tunately) to a classiﬁcation problem in which active

learning is more stable (Chapel and Deffuant, 2006).

This is not very surprising as for example, choosing

198

Teytaud O., Gelly S. and Mary J. (2007).

ACTIVE LEARNING IN REGRESSION, WITH APPLICATION TO STOCHASTIC DYNAMIC PROGRAMMING.

In Proceedings of the Fourth International Conference on Informatics in Control, Automation and Robotics, pages 198-205

DOI: 10.5220/0001645701980205

 SciTePress

points close to boundaries, what makes sense in clas-

siﬁcation or with regression-trees but not in numeric

regression, is a good solution for efﬁcient active learn-

ing. We propose in this paper a new non-blind ap-

proach, not always better than blind approaches as we

will see in the sequel, but that (i) is easy to use for

any type of state space (continuous or not) (ii) can

be parametrized continuously from the pure blind ap-

proach to a very deterministic sampling (iii) can easily

integrate expert information about the sampling.

Limitations of deterministic approaches. De-

terministic samplings are not always better than pure

random sampling. E.g. in numerical integration, de-

terministic blind samples are much better than ran-

dom points for various criteria in small dimensions,

but when the dimension increases, these strictly deter-

ministic approaches (Sloan and Wo

zniakowski, 1998)

have strong limitations (see also (Kearns et al., 1999;

Rust, 1997) for some interesting properties of random

sampling in the speciﬁc case of control problems).

Best results in the traditional ﬁeld of quasi-random

sequences, namely integration, now come from ran-

domized quasi-random sequences, in particular when

dimension increases (L’Ecuyer and Lemieux, 2002),

whereas former best sequences were strictly deter-

ministic ones. This is a somewhat surprising element

in the history of quasi-random sequences. We show

here both empirically and theoretically a similar su-

periority of randomized, yet non-naive, samplings in

the case of active learning. We also conclude em-

pirically to some related limitations of non-blind ap-

proaches w.r.t blind approaches in terms of robustness

but our theorems only concern almost-deterministic

samplings, and non-blind approaches are concerned

only if too strongly deterministic.

Why active learning is very important in dy-

namic problems. The importance of the exploration

step is particularly strong in reinforcement learning,

where exploitation (learning) is deeply mixed with ex-

ploration (gathering information); this is why active

learning is decisive in particular for dynamic prob-

lems. A main trouble, with respect to elements above,

is that value functions used in reinforcement learn-

ing lead to regression problems and not classiﬁcation

ones. Note that many works about active sampling of

the environment use simulations; see e.g. the pioneer

work (Barto et al., 1993) and many subsequent works.

We will here avoid these techniques, that have strong

advantages but also limitations as they can miss in-

teresting parts of the state space that are not seen

in simulations due to poor initial policies. We here

only consider sampling methods that sample as efﬁ-

ciently as possible the ‘full’ domain. This is orthog-

onal to, and not competing with, simulation-based

samplers. Examples of non simulation-based active

learning for dynamic problems have been provided in

(Munos and Moore, 1999) (active discretization of the

domain), (Chapel and Deffuant, 2006) (active learn-

ing for SVM in a viability-framework which reformu-

lates the regression task in a classiﬁcation task).

Overview of results. We will here study the fol-

lowing questions, in the case of regression:

1. Is non-blind active-learning better than blind ac-

tive learning? Essentially, we will see signiﬁcant dif-

ferences in both senses; for some problems the re-

sults are indeed much worse. This is an experimen-

tal answer but we provide also proofs for the related

robustness of randomized techniques w.r.t too-much-

deterministic methods. A non-blind sampler termed

EAS is deﬁned and tested.

2. Are deterministic blind-samplers better than ran-

dom ones? Essentially, we prove (th. 2.4) robust-

ness (universal consistency) results for random sam-

plers, and non robustness for deterministic samplers.

We then show that the quantity of randomness can be

strongly reduced, in a classical random-shift frame-

work preserving both (i) improved convergence rates

(as shown in (Cervellera and Muselli, 2003)) for

”smooth-enough” target-functions and (ii) universal

consistency (as shown in this paper) for any target

function. We propose and test an efﬁcient blind-

sampler termed GLD.

2 MATHEMATICAL ANALYSIS

We study here the derandomized samplings, i.e. sam-

pling with less randomness. It has been shown

(Cervellera and Muselli, 2003) that derandomizing

improves convergence rates for smooth target func-

tions when compared to naive random samplers. We

show that however, some robustness properties re-

quire a random part (theorem 2.3), but that this ran-

dom part can be reduced to combine (i) the robust-

ness of naive random sampling (ii) the improved con-

vergence rates of deterministic samplers for ”easy”

cases.

Deﬁnition 2.1 Consider a domain D = [0,1]

. Note

∗

the set of ﬁnite families of elements of E. A learner

A on D is a computable mapping from (D ×R)

∗

. Let

A the set of learners. A sampler S on D

is a computable mapping from (D ×R)

∗

A to D.

If S can be restricted to a computable mapping from

(D ×R)

∗

to D (i.e. if the A -component is not used)

then it is called a blind-sampler A sample-learner

on D, based on the sampler S and on a learner A,

and noted S + A, is an algorithm that takes as input

ACTIVE LEARNING IN REGRESSION, WITH APPLICATION TO STOCHASTIC DYNAMIC PROGRAMMING

199

an effective method f on a domain D and that can be

written as follows:

1. set n ← 0;

2. choose x

in D deﬁned by x

←

S((x

)... ,(x

n−1

),A);

3. deﬁne y

← f(x

);

4. set f

= A((x

),... ,(x

)).

5. set n ← n+ 1 and go back to step 2.

A sampler is said almost-deterministic (AD) if for

some n 7→k(n) ﬁnite it only uses k(n) random bits be-

fore choosing x

(i.e. before the n

epoch of the algo-

rithm above). A learner is said almost-deterministic

(AD) if for each run it only uses a ﬁnite number of ran-

dom bits that only depends on the length of its inputs.

A sample-learner on D is said universally consistent

(UC) if, for any measurable f with values in [0,1], al-

most surely

( f

(x) − f(x))

dx → 0 as n → ∞. We

say that a sampler S on D is universally consistent

(UC) if for at least one AD learner A, the sampler-

learner S+ A is UC.

We require in the deﬁnition of UC for a sam-

pler that the learning algorithm is AD because other-

wise stochasticity of the learner can be used for ran-

domizing the sampler. Therefore, to distinguish AD-

samplers and non-AD-samplers, we have to add this

restriction.

Theorem 2.2 The random sampling is UC.

Proof: By any UC-algorithm (see e.g. (Devroye

et al., 1994)).

Theorem 2.3 (UC samplers are stochastic)

Consider S an AD-sampler. Then, for any AD-learner

A, S+ A is not UC. Therefore, S is not UC.

Proof: Consider S and A, respectively an AD-sampler

and an AD-learner. Then, consider x

,... ,x

,... the

sequence of points provided by the sampler if f is the

constant function equal to 1 (this sequence might be

stochastic or not). By deﬁnition of AD, x

takes values

in a ﬁnite domain of cardinal c

. Therefore, all the x

take values in some countable domain V. Consider

now g

, for p ∈[0,1[, the function equal to 1 inV, and

to p in D\V. If S + A is UC, then on a target f = g

almost surely f

converges in norm L

to the constant

function equal to p. However, for all f = g

, p ∈[0,1[,

is distributed in the same manner, as f

depends

only on the f values in V. With p = 0 and p =

for

example, this leads to a contradiction. S+ A can not

be UC.

We showed in theorem 2.2 that random sampling

is UC and in theorem 2.3 that no AD-sampling can

be UC. But where is the limit? We show in the fol-

lowing theorem that a moderate derandomization, yet

using a continuous random seed, can lead to univer-

sal consistency. Note that the result includes quasi-

random sequences with a simple random shift, and

therefore includes samplings that have proved faster

under some smoothness hypothesis than the random

sampling (Cervellera and Muselli, 2003).

Note that the theorem below has a very moder-

ate requirement in terms of discrepancy (only conver-

gence to 0, whereas low-discrepancy sequences typi-

cally verify O(log(n)

/n)).

Theorem 2.4 (A random shift is enough) Consider

a sampler S that outputs x

,... ,x

and deﬁned as

follows:

1. randomly uniformly draw s ∈ D.

2. y

,... ,y

,... is a deterministic sequence in D

with dispersion

sup

x∈D

inf

i∈[[1,n]]

||x−y

∞

= O(1/n

1/d

)

and discrepancy

sup

r∈[0,1]

#{i ∈ [[1,n]];∀j,x

i, j

≤ r

}−π

i∈[[1,d]]

| → 0

as n → ∞ where x

i, j

is the j

component of x

3. for any n, x

= y

+ s modulo 1 (i.e. x

is such

that y

−x

∈ Z

and x

∈ [0,1[

Then, S is UC.

Interpretation: Adding a uniform random shift

to a deterministic sequence is enough to get a UC

sampler with the same improved convergence rates

as in (Cervellera and Muselli, 2003) for smooth tar-

get functions. Comparing theorems 2.3 and 2.4 show

that randomness is required for UC, but the quantity

of randomness can be much smaller than in pure ran-

dom sampling.

Proof: See

http://www.lri.fr/

teytaud/

ldsfordplong.pdf

3 SAMPLING METHODS (SM)

AND APPLICATIONS

We present below (i) the problem of function value

learning (ii) some blind point-sets (iii) a non-blind

learner-independent sampler (iv) our experiments on

the OpenDP set of benchmark-problems.

3.1 Active Function Value Learning

We introduce reinforcement learning and stochastic

dynamic programing (SDP) in a very short manner;

see (Bertsekas and Tsitsiklis, 1996; Sutton and Barto,

1998) for more information. The general idea is that a

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

200

good control for a control problem in discrete time is

reached if for each of the ﬁnitely many time steps, one

can learn efﬁciently the ”expected cost-to-go” (also

named Bellman-function or value-function in various

ﬁelds of mathematical programming and computer

science), i.e. a function that maps a state x and time

step t to the expected cost (for an optimal strategy)

starting from this state x at time t. This cost-to-go has

to be learnt from examples. These examples can be

sampled, and a procedure termed Bellman-operator

(BO) provides the cost-to-go for each given example.

Roughly, the classical SDP algorithm, based on a

sampler S and a learner L, is as follows. For each

time step t, backwards from the last time step to the

ﬁrst: 1. S samples a ﬁnite set of examples x

,...,x

time step t; 2. The BO (detailed below) compute the

expected cost-to-go for each of these examples (us-

ing the previously learnt expected cost-to-go, i.e. the

cost-to-go at time step t +1); 3. L learns the expected

cost-to-go V

at time step t from these examples.

The BO-procedure computes the cost-to-go at

state x for time step t using simulations of the tran-

sition from time step t to time step t + 1 and using

the cost-to-go-function V

t+1

at step t + 1 as follows:

(i) simulate each possible random outcome (ii) for

each random outcome get by simulation the instanta-

neous cost and the future state y and use the cost-to-go

t+1

(y) from y at time step t + 1 (available thanks to

the backward induction) for estimating the future-cost

(iii) add these two costs for each random outcome,

(iv) average all these results to get the required value,

i.e. the expected cost-to-go V

(x).

The sampler is important in order to reduce the

computational cost, a main issue of SDP. We here

use tools that are not speciﬁc of SDP. More speciﬁc

approaches for dynamic problems also exist; they

are more orthogonal to, than comparable to our ap-

proaches below. (Barto et al., 1993) showed the im-

portance of ”reducing” the domain, when possible,

by using simulations. If only a small subset of the

full state space is interesting, this approach restricts

the focus of the algorithm to the neighborhood of the

visited states. A drawback is that you need an ap-

proximate solution before applying simulations, and

you need simulations in order to reduce the domain

and apply dynamic programming; therefore, this ap-

proach leads to complex unstable ﬁxed-point itera-

tions. However, this approach can deal with much

bigger domains than the approach sampling the whole

domain. (Thrun, 1992) studied how to avoid visiting

many times the same area (leading to better guaran-

teed rates in some theoretical framework), and then

reduce the curse of dimensionality.

3.2 Blind Active-Samplers

We present below point sets in D = [0,1]

. P will be

the notation for a set of points P

,... ,P

. #E is the

cardinal of a set E. See e.g. (Tufﬁn, 1996; Owen,

2003; L’Ecuyer and Lemieux, 2002) for a general in-

troduction to ”well chosen” sets of points.

Low discrepancy. The most usual discrepancy is

deﬁned as follows:

Disc(P) = sup

r∈[0,1]



area([0,r]) −

#{i ∈ [[1,n]];P

≤ r}



where [0, r] is the set of q such that ∀i ∈ [[1, d]]0 ≤

≤ r

and area() is Lebesgue’s-measure. In-

dependent uniform random points have discrep-

ancy roughly decreasing as O(1/

√

#P). Well cho-

sen deterministic or (non-naive) randomized points

achieve O(log(#P)

/#P). See

http://www.lri.fr/

teytaud/ldsfordplong.pdf

for further elements.

We will in the sequel call ”low-discrepancy sequence”

a classical Niedereiter sequence (see e.g. (Niederre-

iter, 1992)). We will refer to this sequence as a QR

(quasi-random) sequence in the sequel.

Low dispersion. Low dispersion is less widely used

than low-discrepancy, but has some advantages (see

e.g. discussions in (Lindemann and LaValle, 2003;

LaValle and Branicky, 2002)). The most usual crite-

rion (to be minimized) is

Dispersion(P) = sup

x∈D

inf

p∈P

d(x, p) (1)

where d is the euclidean distance. It is related to the

following (easier to optimize numerically, except for

some values of #P) criterion (to be maximized) :

Dispersion

(P) = inf

)∈D

d(x

) (2)

A main difference is that optimizing eq. 2 ”pushes”

points on the frontier. This effect can be avoided as

follows :

Dispersion

(P) = inf

)∈D

d(x

,{x

}∪D

′

) (3)

where D

′

= {x ∈ R

;x 6∈ D}. We call ”low-dispersion

point set” (LD) a point set optimizing equation 2. We

call ”greedy-low-dispersion point set” (GLD) a point

set optimizing equation 2 in a greedy manner, i.e.

= (0.5,..., 0.5) is the middle point of [0, 1]

, P

such that Dispersion

({P

}) is maximal, and P

is such that Dispersion

({P

,... ,P

n−1

}) is max-

imal. Our implementation is based on Kd-trees but

Bkd-trees are possible (bkdtrees are similar to well-

known kdtrees, but also allow fast online adding of

ACTIVE LEARNING IN REGRESSION, WITH APPLICATION TO STOCHASTIC DYNAMIC PROGRAMMING

201

points; see (Procopiuc et al., 2002)). We call ”greedy-

low-dispersion point set far-from-frontier” (GLDfff)

the equivalent point set with dispersion

instead of

dispersion

. We also tested the same dispersion with

other L

-distances than the euclidean one, without

success. GLD is very successful, as shown in the se-

quel; we show a sample in ﬁgure 1.

Figure 1: A GLD-sample in dimension 2. Note that the

random choice of points among various possible optimal

points avoids (probably) too unequilibrated sets. The ﬁrst

points are shown with darker circles. There is randomness

in the choice of the next point among a given gray-level.

3.3 A Non-blind Active-sampler

Many tools for active-learning exist (see introduc-

tion) but can not be used in our case as they are

classiﬁcation-speciﬁc or known moderately efﬁcient

in the regression case. We here propose the use of an

evolutionary algorithm. This approach is new in the

case of active learning and in the case of SDP, but it is

inspired by evolutionary sampling (Liang and Wong,

2001). Its advantages are (i) it is user-friendly (ii) it

works better than random and also than derandom-

ized blind point sets at least for a non-negligible set

of benchmarks.

Evolutionary algorithms (EA,(Baeck, 1995; Eiben

and Smith, 2003)) are a classical tool for robustly

optimizing a non-linear function without requiring

derivatives. We here use them as a sampling algo-

rithm, as follows: 1. generate an initial population

uniformly on the domain; 2. evolve the population un-

til the allowed number of ﬁtness evaluations; 3. use as

active sample the union of all offsprings of all epochs.

Note that the learning algorithm is not embedded in

this approach which is therefore quite general.

We deﬁne precisely below our EA, but we point

out that any EA could be used as well. The pa-

rameters have not been specialized to our problem,

we have used a tool that has been designed for op-

timization purposes and not for our particular appli-

cation to sampling. Evolutionary algorithms exist

for various forms of domains, continuous, discrete

or mixed, and therefore our approach is quite gen-

eral. Estimation of distribution algorithms could be

used as well(Larranaga and Lozano, 2001). The EA

that we use can be found in the sgLibrary (

http:

//opendp.sourceforge.net

). It implements a very

simple genetic algorithm where the mutation is an

isotropic Gaussian of standard deviation

√

with n

the number of individuals and d the dimension of

space. The crossover between two individuals x and

y gives birth to two individuals

y and

Let λ

,λ

be such that λ

+ λ

= 1;

at each generation :

1. we copy the nλ

best individuals (set S

2. we combine the nλ

following best individu-

als with the individuals of S

(rotating among S

< λ

3. we mutate nλ

individuals among S

(again ro-

tating among S

if λ

< λ

4. we randomly generate n×λ

other individuals,

uniformly on the domain.

The parameters are σ = 0.08,λ

= 1/10,λ

2/10,λ

= 3/10,λ

= 4/10; these parameters are

standard ones from the library and have not been mod-

iﬁed for the work presented in this paper. The popu-

lation size is n = N

, where α ∈]0,1] is a parameter

of the approach. Note that this algorithm is oriented

towards sampling small values of the target function.

Therefore, it is based on the idea that small values

are more interesting. In the case of SDP, the target

function is the sum of the instantaneous cost and the

expected cost-to-go. Thus, this active sampling is ef-

ﬁcient if we are dealing with a problem in which it is

likely that trajectories are close to small values of this

target function. We will see in our experiments that

this is true in some problems, but not e.g. in stock

management, leading to poor results in that case. We

here see that the non-blind approach developed in

this section looks appealing, but has robustness draw-

backs. The tuning of α is a possible solution: α = 1

leads to the pure random sampling; smaller α leads to

a more optimistic approach in which the sample is re-

inforced in parts for which costs are better (smaller),

at the price of a weaker ability to avoid very large

costs.

3.4 Experiments

The domain to be sampled is made of the continuous

state space and a discrete state space. The samplers

have then to sample a product S × M, with S the

continuous state space and M = {m

,... ,m

} discrete

and ﬁnite, being the exogenous Markov-Process. The

following SM are used:

1. The SMs GLD, QR, LD, GLDfff are the

blind approaches deﬁned in section 3.2. The discrete

parts of the state-space are sampled in a simple

proportional manner: for sampling n points in S×M,

where S is the continuous state space and M is a

discrete domain M = {m

,... ,m

}, a GLD, QR,

LD or GLDfff point set is used for sampling S

with N = n/k points x

,... ,x

, and the sample is

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

202

((x

),... ,(x

)).

2. We have various random samplings. The

SM RandomForAll denotes the pure blind (uniform,

independent) random sampling of S × M. The

SM RandomOnlyForContinuousStates (ROFCS)

denotes the pure blind random sampling of S, with

a proportional sampling of M (i.e. m

appears the

same number of times as m

, ..., as m

). The SM

DeterministicForDim1 is equal to ROFCS, except

that for each value m

, the ﬁrst coordinate of S is

sampled by a regular grid (other coordinates are

random). This is therefore deterministic if S is

one-dimensional.

3. EAS-α denotes the non-blind approach using

EA-sampling, as explained in section 3.3. The

parameter is the α-parameter in section 3.3. All the

experiments have been performed with the OpenDP

library (Gelly and Teytaud, 2005) with the command

line options ”./runme.sh –nojava –nogui –test Opti-

mizationExperiments –outputFile myResultsFile.txt

–testedOptimizers ’[[GeneticAlgorithmOptimizer]]’

–testedRegressions ’[[AutomaticRegression]]’ –

numExampleWanted X –nbPointsSamplingMethod

500 –nbRuns 44 –nbTryEvaluationsOptimization

70” where X∈ N is the index of the problem. The

parameters of the experiments are as follows: 500

points are sampled on the domain (same number of

sampled points for each algorithm); all results are on

44 runs; the learner is the ”AutomaticRegression”

class from (Gelly and Teytaud, 2005), that uses the

Gaussian-SVM of SVMTorch (Collobert and Bengio,

2001) and a heuristic rule choice of hyper-parameters.

All problems are described in (Gelly et al., 2006)

and in (Gelly and Teytaud, 2005) (including free

downloads on

http://opendp.sourceforge.net

The dimensionality of problems is the dimension of

the continuous state space, excluding the discrete

parts. The objective functions are to be minimized

(lower values = better results).

Results about derandomized blind sampling. Re-

sults are summarized in table 1. Due to length con-

straints, all the details with conﬁdence intervals, and

results for other methods, are provided in

http://

www.lri.fr/

teytaud/ldsfordplong.pdf

. The

dimension refers to the dimension of the state space.

Columns QR (resp. GLD, GLDfff), refer to the fact

that QR-sampling (resp. GLD-sampling, GLDfff-

sampling) outperforms the baseline random algo-

rithm, namely ROFCS (in which the continuous part

is pure independent random and the discrete part is

proportional sampling); we also compared the algo-

rithms to pure random sampling, which is usually

worse than ROFCS (which can be seen as a very

simple preliminar derandomization). For the column

mentionning the fact that GLD is ﬁrst-ranked, ”y”

stands for ”GLD is signiﬁcantly better than all other

algorithms” and ”y(=ST)” stands for ”GLD is ﬁrst-

ranked and signiﬁcantly better than all other algo-

rithms except algorithm ST for which there’s no sta-

tistically signiﬁcant difference”.

Results about non-blind active regression. We

note EAS the algorithm with parameter α for non-

blind active sampling of the state space deﬁned in 3.3.

In moderate dimension, the efﬁciency of EAS is very

moderate; we here present results with high dimen-

sionality and 500 points sampled per time step. We

compared EAS, GLD and the best blind sampler on

this problem. All results are averaged on 44 runs.

Due to length constraints, details for all problems

are provided in

http://www.lri.fr/

teytaud/

ldsfordplong.pdf

and we here show only results

for the ”Arm” and ”Multi-Agent” problem; in sum-

mary for other problems: for Avoid (dim=8) and

Walls (dim=8), GLD outperforms the EAS for any

value of α; for Arm (dim=12) and Multi-agent

(dim=8), EAS outperforms GLD for all values of

α; for Shoot (dim=12) there’s not signiﬁcant differ-

ence in results; for Away (dim=8) the EAS outper-

form GLD moderately signiﬁcantly or signiﬁcantly

Table 1: Blind sampling methods. First, we see that GLD

is often the best blind SM. It outperforms the random-

sampling ROFCS in 7 out of 9 experiments. Second, other

derandomized samplings are only sometimes better than

random-sampling with proportional sampling for discrete

parts of the state space; essentially, the difference between

these other less-randomized samplings and ROFCS is not

signiﬁcant. The pure naive random sampling, Random-

ForAll (which includes random sampling of the discrete

part also), is not presented in this table as it is meaning-

less in some cases (when there’s no discrete part); its re-

sults, when meaningful, are poor. As a conclusion, (i) de-

randomizing the sampling for the discrete part is strongly

better (of course, at least in problems for which there is a

discrete part) (ii) GLD is a stable and efﬁcient partially-

derandomized sampler of continuous domains.

Problem GLD 1st QR GLD GLDfff

(dim) ranked

WallsX4 (8) y y y n

AvoidX4 (8) y y y y

Stock (4) y n y n

Avoid (2) y (=GLDfff) y y y

Arm (3) n n n n

Walls (2) y y y n

Multiag. (8) n n n y

Shoot (3) y y y y

Away (2) y (=QR) n y n

Total 7/9 5/9 7/9 4/9

ACTIVE LEARNING IN REGRESSION, WITH APPLICATION TO STOCHASTIC DYNAMIC PROGRAMMING

203

Table 2: Results for the ”arm” and ”multi-agent” problems.

For the ”multi-agent” problem, EAS is very efﬁcient; this is

probably related to the fact that intuitively, the criterion fo-

cusing on ”good regions” (where the expected cost-to-go is

small) is satisfactory for this problem: large bad regions of

the domain are very unlikely thanks to the existence of rea-

sonnable good paths. These large bad regions are explored

by blind samplers. The criterion that we plug in EAS is rele-

vant for this case, and should be adapted for other problems

in order to get similar good results. EAS has a performance

equivalent to ROFCS for the ”arm” problem.

Arm dim.x4 = 12

Sampler Average score

EAS-0.2 10.34±0.21

EAS-0.3 10.17±0.20

EAS-0.4 10.46±0.23

EAS-0.5 10.62±0.21

EAS-0.6 10.49±0.20

EAS-0.7 10.54±0.23

EAS-0.75 10.53±0.22

EAS-0.8 10.37±0.20

EAS-0.85 10.17±0.20

EAS-0.9 10.44±0.20

EAS-0.95 10.38±0.21

ROFCS 10.39± 0.20

GLD 12.52 ±0.25

Biagent dim.x4 = 8

Sampler Average score

EAS-0.2 2295.23±16.08

EAS-0.3 2311.59±13.46

EAS-0.4 2179.09±15.62

EAS-0.5 2175.45±16.11

EAS-0.6 2156.59±15.66

EAS-0.7 2137.5±13.41

EAS-0.75 2167.95±14.26

EAS-0.8 2195±16.62

EAS-0.85 2184.09±13.91

EAS-0.9 2170.23±14.22

EAS-0.95 2245.91±12.39

best-blind

= GLD 2326.59 ±16.17

depending on α except for α = 0.95.

4 CONCLUSION

This paper studies theoretically and practically active

learning, in the case of regression. As pointed out

in the introduction, the classiﬁcation case is very dif-

ferent and non-blind approaches look much more ap-

pealing in that case. Our conclusions hold for regres-

sion and our experiments are performed in the SDP

case in which robustness of sampling is particularly

important.

It is sometimes argued that randomness provides

robustness (Rust, 1997). We conﬁrm in theorem 2.3

that derandomization should not be complete for the

sake of robustness (strict determinism or almost deter-

minism lead to the loss of universal consistency), but

we show in theorem 2.4 that a strong derandomiza-

tion, with only a moderate random part, is enough for

UC. In particular, derandomized sequences as used

in theorem 2.4 combine (i) known convergence rates

(Cervellera and Muselli, 2003) of quasi-random se-

quences for learning smooth target functions and (ii)

robustness of random sampling in terms of UC as in

(Devroye et al., 1994).

This conclusion about the best ”quantity” of ran-

domness is exempliﬁed in our experiments by, in

some cases, the weakness of methods too strongly

focusing on parts looking promising (EAS with α <

1). On the other hand, these methods are sometimes

very efﬁcient for difﬁcult cases (see Table 2). Also,

we note that the best blind-method is GLD, often

by far, and GLD is a very natural randomized low-

dispersion sequence (see (Lindemann and LaValle,

2003; LaValle and Branicky, 2002) on this topic).

We believe in the importance of research about

blind or non-blind active learning as active learning is

a main bottleneck in reinforcement learning or SDP.

We have the following positive conclusions for active

regression:

1. Our new blind sampling method, namely

GLD, outperforms all other blind samplers in most

cases (see table 1). In particular, it is a natural ex-

tension to any number of points and any dimension-

ality of regular grid-samplings (see ﬁgure 1 for small

number of points, different (and better than, at least

in our experiments) from the low-dispersion approach

LD, GLDfff, RandomForAll, QR, etc. This blind ap-

proach is the most stable one in our experiments (ﬁrst

ranked in most experiments); it has weaknesses in

cases in which the frontier is not relevant. It is known

that quasi-random-sequence have to be randomized

for various no-bias properties (see e.g. (L’Ecuyer

and Lemieux, 2002) about randomized quasi-Monte-

Carlo sequences); in this paper we have (i) shown the-

oretical similar results for active learning pointing out

that randomness is necessary in active sampling (ii)

provided with GLD a natural randomization as we

simply randomly pick up, for each point, one of the

optimal points for the greedy criterion in equation 2

(see ﬁgure 1).

2. Our new non-blind sampling method is very

easy to use; it is easy to put expert knowledge in

it (increasing α for more carefully exploring the do-

main, reducing α for focusing on optimistic parts,

changing the criterion, adding diversity in the evolu-

tionary algorithm). However, experiments show that

the criterion tested here is not universal; it works

very efﬁciently for the ”multi-agent” problem, pro-

viding strategies with smaller expected cost than all

blind techniques tested, but not for some other prob-

lems. This is somewhat disappointing, but we guess

that non-blind sampling methods do require some

tuning. EAS moves continuously, thanks to the α-

parameter, from random sampling (or possibly quasi-

random sampling, but this has not been studied here)

to a focus on areas verifying some used-deﬁned cri-

terion. Results appear robust w.r.t to α, but the cri-

terion is probably a much stronger parameter; here,

it focuses on areas with small costs; but for other

problems it is probably much better to sample areas

with large costs (e.g. when strong penalties can occur

ICINCO 2007 - International Conference on Informatics in Control, Automation and Robotics

204

when hitting boundaries). A main advantage of EAS

is that it works in high dimensionalities in which

very few published papers have good active-results.

3. In some cases, GLD is far better than GLDfff,

but in one case GLD is the worst algorithm and

GLDfff is (very signiﬁcantly) the best one. As the

main difference between these two algorithms is that

GLD samples more strongly the frontier, this points

out the simple fact that the frontier of the state-

space can be very important: sometimes it is very

relevant (e.g. stock management, in which marginal

costs have a ﬁrst approximation thanks to cost-to-go

at corners) and sometimes it is pointless and expen-

sive (GLD puts 2

points on the corners among the

+ 1 ﬁrst points!).

REFERENCES

Baeck, T. (1995). Evolutionary Algorithms in theory and

practice. New-York:Oxford University Press.

Barto, A., Bradtke, S., and Singh, S. (1993). Learning to

act using real-time dynamic programming. Technical

Report UM-CS-1993-002.

Bertsekas, D. and Tsitsiklis, J. (1996). Neuro-dynamic pro-

gramming, athena scientiﬁc.

Cervellera, C. and Muselli, M. (2003). A deterministic

learning approach based on discrepancy. In Proceed-

ings of WIRN’03, pp53-60.

Chapel, L. and Deffuant, G. (2006). Svm viability controller

active learning. In Kernel machines for reinforcement

learning workshop, Pittsburgh, PA.

Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995a).

Active learning with statistical models. In Tesauro, G.,

Touretzky, D., and Leen, T., editors, Advances in Neu-

ral Information Processing Systems, volume 7, pages

705–712. The MIT Press.

Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1995b).

Active learning with statistical models. In Tesauro, G.,

Touretzky, D., and Leen, T., editors, Advances in Neu-

ral Information Processing Systems, volume 7, pages

705–712. The MIT Press.

Collobert, R. and Bengio, S. (2001). Svmtorch: Support

vector machines for large-scale regression problems.

Journal of Machine Learning Research, 1:143–160.

Devroye, L., Gyorﬁ, L., Krzyzak, A., and Lugosi, G.

(1994). the strong universal consistency of nearest

neighbor regression function estimates.

Eiben, A. and Smith, J. (2003). Introduction to Evolution-

ary Computing. springer.

Gelly, S., Mary, J., and Teytaud, O. (2006). Learning for

dynamic programming, proceedings of esann’2006, 7

pages, http://www.lri.fr/∼teytaud/lfordp.pdf.

Gelly, S. and Teytaud, O. (2005). Opendp, a c++ frame-

work for stochastic dynamic programming and rein-

forcement learning.

Kearns, M., Mansour, Y., and Ng, A. (1999). A sparse

sampling algorithm for near-optimal planning in large

markov decision processes. In IJCAI, pages 1324–

1231.

Larranaga, P. and Lozano, J. A. (2001). Estimation of

Distribution Algorithms. A New Tool for Evolutionary

Computation. Kluwer Academic Publishers.

LaValle, S. and Branicky, M. (2002). On the relation-

ship between classical grid search and probabilistic

roadmaps. In Proc. Workshop on the Algorithmic

Foundations of Robotics.

L’Ecuyer, P. and Lemieux, C. (2002). Recent advances in

randomized quasi-monte carlo methods.

Lewis, D. and Gale, W. (1994). Training text classiﬁers by

uncertainty sampling. In Proceedings of International

ACM Conference on Research and Development in In-

formation Retrieval, pages 3–12.

Liang, F. and Wong, W. (2001). Real-parameter evolution-

ary sampling with applications in bayesian mixture

models. J. Amer. Statist. Assoc., 96:653–666.

Lindemann, S. R. and LaValle, S. M. (2003). Incremen-

tal low-discrepancy lattice methods for motion plan-

ning. In Proceedings IEEE International Conference

on Robotics and Automation, pages 2920–2927.

Munos, R. and Moore, A. W. (1999). Variable resolution

discretization for high-accuracy solutions of optimal

control problems. In IJCAI, pages 1348–1355.

Niederreiter, H. (1992). Random Number Generation and

Quasi-Monte Carlo Methods.

Owen, A. (2003). Quasi-Monte Carlo Sampling, A Chapter

on QMC for a SIGGRAPH 2003 course.

Procopiuc, O., Agarwal, P., Arge, L., and Vitter, J. (2002).

Bkd-tree: A dynamic scalable kd-tree.

Rust, J. (1997). Using randomization to break the curse of

dimensionality. Econometrica, 65(3):487–516.

Schohn, G. and Cohn, D. (2000). Less is more: Active

learning with support vector machines. In Langley, P.,

editor, Proceedings of the 17

International Confer-

ence on Machine Learning, pages 839–846. Morgan

Kaufmann.

Seung, H. S., Opper, M., and Sompolinsky, H. (1992).

Query by committee. In Computational Learning The-

ory, pages 287–294.

Sloan, I. and Wo

zniakowski, H. (1998). When are quasi-

Monte Carlo algorithms efﬁcient for high dimensional

integrals? Journal of Complexity, 14(1):1–33.

Sutton, R. and Barto, A. (1998). Reinforcement learning:

An introduction. MIT Press., Cambridge, MA.

Thrun, S. B. (1992). Efﬁcient exploration in reinforcement

learning. Technical Report CMU-CS-92-102, Pitts-

burgh, Pennsylvania.

Tufﬁn, B. (1996). On the use of low discrepancy sequences

in monte carlo methods. In Technical Report 1060,

I.R.I.S.A.

Vidyasagar, M. (1997). A Theory of Learning and Gener-

alization, with Applications to Neural Networks and

Control Systems. Springer-Verlag.

ACTIVE LEARNING IN REGRESSION, WITH APPLICATION TO STOCHASTIC DYNAMIC PROGRAMMING

205