A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE

CONTROL IN LINEAR QUADRATIC REGULATORS

Daniel A. Braun and Pedro A. Ortega

University of Cambridge, Dept. of Engineering, CB2 1PZ Cambridge, U.K.

Keywords:

Minimum relative entropy principle, Adaptive control, Bayesian control rule, Linear quadratic regulator.

Abstract:

The design of optimal adaptive controllers is usually based on heuristics, because solving Bellman’s equations

over information states is notoriously intractable. Approximate adaptive controllers often rely on the principle

of certainty-equivalence where the control process deals with parameter point estimates as if they represented

“true” parameter values. Here we present a stochastic control rule instead where controls are sampled from

a posterior distribution over a set of probabilistic input-output models and the true model is identiﬁed by

Bayesian inference. This allows reformulating the adaptive control problem as an inference and sampling

problem derived from a minimum relative entropy principle. Importantly, inference and action sampling both

work forward in time and hence such a Bayesian adaptive controller is applicable on-line. We demonstrate the

improved performance that can be achieved by such an approach for linear quadratic regulator examples.

1 INTRODUCTION

Learning how to act in an unknown environment

poses the problem of adaptive control (

Astr¨om and

Wittenmark, 1995). Solving adaptive control prob-

lems optimally is a notoriously hard problem because

it requires the solution of Bellman’s optimality equa-

tions over large trees of information states, which

becomes quickly intractable. Therefore, a number

of approximate adaptive control methods have been

devised in the literature (

Astr¨om and Wittenmark,

1995). Most heuristics for adaptive control are based

on the certainty-equivalence principle, i.e. when they

estimate the unknown plant parameters, the uncer-

tainty of these estimates has no impact on the perti-

nent control strategies. Instead, a point estimate of

the system parameters is treated as if it represented

the “true” system parameters.

It is well known in optimal control theory that the

certainty-equivalence principle holds exactly for lin-

ear quadratic systems with known dynamics (

Astr¨om

and Wittenmark, 1995). In case of adaptive control,

however, the certainty-equivalence principle breaks

down in general and is only used as a heuristic. In

fact, previous studies have shown that even for the

linear quadratic controller correct closed-loop system

identiﬁcation cannot be guaranteed under certainty-

equivalence, which has led to the proposal of cost-

biased estimators (Campi and Kumar, 1996). Non-

certainty-equivalent controllers are usually designed

as extensions of a certainty-equivalent solution, such

as cautious or dual controllers that reduce the con-

trol gain in the face of high parameter uncertainty or

actively probe the environment by random excitation

(Wittenmark, 1975). Here we propose a non-certainty

equivalent approach to adaptive control based on a

Bayesian control rule derived from a minimum rel-

ative entropy principle. We demonstrate how such an

approach can be employed to solve adaptive control

problems with linear dynamics and quadratic cost.

2 A BAYESIAN RULE FOR

ADAPTIVE CONTROL

In the following we assume that the observations of

our controller are given by a state variable x

and the

possible actions of our controller are u

. The con-

troller can then be deﬁned as an input-output system

that is characterized by the conditional probabilities

P(x

t+1

≤t

) and P(u

t+1

≤t+1

≤t

)

where x

≤t

= x

,. .. ,x

and u

≤t

= u

,. .. ,u

de-

note concatenations of past states and actions respec-

tively. Analogous to the controller, the plant can be

thought of as an input-output system with conditional

probabilities

Q(u

t+1

≤t+1

≤t

) and Q(x

t+1

≤t

103

A. Braun D. and A. Ortega P. (2010).

A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE CONTROL IN LINEAR QUADRATIC REGULATORS.

In Proceedings of the 7th International Conference on Informatics in Control, Automation and Robotics, pages 103-108

DOI: 10.5220/0002938801030108

 SciTePress

If the controller can perfectly predict the plant for all

histories x

≤t

then

P(x

t+1

≤t

) = Q(x

t+1

≤t

In this case the plant equation is perfectly known and

the controller P can be tailored to the particular plant

Q. Especially, the control law P(u

t+1

≤t+1

≤t

) can

be chosen in such a way that it maximizes some opti-

mality criterion given full knowledge of the plant Q.

Consider now the case when the controller does

not know the plant dynamics, but assume we know

that the plant has dynamics Q

drawn randomly from

a set Q of possible dynamics indexed by m. Assume

further we have available a set of tailored controllers

, where each P

is tailor-made for one of the possi-

ble plants Q

. The set of possible plant dynamics and

tailored controllers can then be expressed as condi-

tional probabilities given by the following likelihood

and intervention models

P(x

t+1

|m,x

≤t

) and P(u

t+1

|m,x

≤t+1

≤t

)

with m ∈M indexing the different plant dynamics Q

and the different tailored controllers P

. How can we

now construct a controller P such that its behavior is

as close as possible to the tailored controller P

under

any realization of Q

∈Q ?

A convenient measure of how much P deviates

from P

is given by the relative entropy. In particu-

lar, we can quantify the average deviation of a control

law P(u

t+1

≤t+1

, ¯u

≤t

) from the tailored control law

P(u

t+1

|m,x

≤t+1

, ¯u

≤t

) of P

by computing



P(u

t+1

|m,x

≤t+1

, ¯u

≤t

)||P(u

t+1

≤t+1

, ¯u

≤t

)



where the average is taken with respect to a prior

P(m) and all possible input-output sequences with

probabilities P(x

≤t+1

¯u

≤t

|m). The bar symbol ¯u

≤t

in-

dicates that past actions have been set by the con-

troller and therefore have to be formalized as inter-

ventions (Pearl, 2000; Ortega and Braun, 2010). One

can then show that the above quantity is minimized

by the following control rule.

Theorem 1 (Bayesian Control Rule).

P(u

t+1

≤t+1

, ¯u

≤t

)

∑

P(u

t+1

|m,x

≤t+1

≤t

)P(m|x

≤t+1

, ¯u

≤t

)

where P(m|x

≤t+1

, ¯u

≤t

) is given by the recursive ex-

pression

P(m|x

≤t+1

, ¯u

≤t

)

P(x

t+1

|m,x

≤t

)P(m|x

≤t

, ¯u

)

∑

′ P(x

′

≤t

)P(m

′

≤t

, ¯u

)

(1)

The proof can be found in (Ortega and Braun,

2010). Here we apply the Bayesian control rule

to adaptive control. It describes a mixture distri-

bution over different tailored controllers indexed by

m, each of them suggesting the next control signal

t+1

with probability P(u

t+1

|m,x

≤t+1

≤t

). The mix-

ture weights are given by the posterior probability

P(m|x

≤t+1

, ¯u

≤t

). It resembles Bayesian inference in

that it starts out with a prior distribution over input-

output models index by m and computes a posterior

distribution after experiencing an interaction. Actions

can then be sampled from this posterior distribution.

3 LINEAR QUADRATIC

REGULATOR

A linear quadratic regulator is characterized by a lin-

ear dynamical system and a quadratic cost function.

In the following we will deal with the time-discrete

case. Formally, let x

∈ R

be the state vector of the

plant at time t, u

∈R

be the action of the controller,

and F ∈R

N×N

and G ∈R

N×M

the time-invariant sys-

tem matrices describing the dynamics of the plant

such that

t+1

= Fx

+ Gu

+ ξ

where ξ

∈ R

is a Gaussian random variable with

known covariance matrix Ω

. Furthermore, let c

the scalar instantaneous cost

) = x

+ u

where R ∈ R

M× M

is positive deﬁnite and Q ∈ R

N×N

is positive semi-deﬁnite. Thus, the time-average cost

J is given by

J(x

) = lim

T→∞

T−1

∑

t=0

If the matrices F,G,Q and R are all known the op-

timal controller has a well-known solution that is a

simple state-feedback law

∗

= −L

∗

where L

∗

can be computed from the algebraic Riccati

equation (Stengel, 1993).

3.1 Indirect Adaptive Bayesian Control

In this section we will assume that we know the cost

matrices Q and R, but have to estimate F and G dur-

ing the control process. Since we have to estimate

them explicitly in order to compute the optimal policy

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

104

∗

this is often called model-based or indirect adap-

tive control. This means we have to deal with an in-

ference problem—estimating F and G—and an opti-

mal control problem—generating control commands

given the estimates

F and

In order to solve the estimation problem we use

an Unscented Kalman Filter (UKF) in our simu-

lation experiments because it can estimate Gaus-

sian random variables both under linear and nonlin-

ear circumstances (Julier and Durrant-Whyte, 1995;

Haykin, 2001). The parameter vector we want to

estimate is given by the vectorized system matrices

ˆw= vec([

G]). Initially, we assume a Gaussian prior

over ˆw

. We model the evolution of the parameter es-

timate as a Brownian diffusion process given by

ˆw

t+1

= ˆw

+ ω

(2)

where ω ∈ R

N(N+M)

is a Gaussian random variable

with covariance matrix Ω

. The covariance matrix

determines the step size of the adaptation process.

The likelihood model needed for the inference pro-

cess is provided by

P(x

t+1

|ˆw,x

)

∝ e

−



t+1

−



Ω

−1



t+1

−



(3)

The adaptation rate Ω

can be adjusted dynamically

depending on how well the current parameter esti-

mates ﬁt the observations. In case of poor predictions

this should lead to high variability and fast adapta-

tion in big steps, in case of very good predictions this

should implies only small adaptation steps. This can

be implemented using a Robbins-Monroe innovation

update

Ω

(t+1)

= (1−α)Ω

(t)

+ αI

= K

ˆw

t+1

− ˆx

t+1

− ˆx

t+1

ˆw

)

where K

ˆw

is the Kalman gain as used in the UKF

and ˆx

t+1

stems from the prediction step of the UKF

(Haykin, 2001).

In order to solve the control problem we have

to use the current estimate ˆw = vec([

G]) to com-

pute the optimal control commands. A certainty-

equivalent self-tuning regulator would simply take the

mean estimate E[ ˆw] and use this estimate in the alge-

braic Riccati equation at every point in time as if it

was the true parameter vector. While this often works

ﬁne if only a few parameters of the matrix are un-

known, in general this can lead to suboptimal solu-

tions. Instead, we propose to use the Bayesian control

rule as laid down in equation (1). This means we have

to specify a likelihood and an intervention model. The

likelihood model P(x

t+1

|ˆw,x

≤t

) is given by equa-

tion (3). The intervention model is deterministic and

given by

P(u

t+1

|ˆw,x

≤t+1

≤t

) ∝ δ



t+1

+ L

ˆw

t+1



It might seem that this would imply taking the en-

tire probability distribution over ˆw and propagating it

through the Riccati equation. Then we would sample

an L at each point in time to determine u

t+1

. Fortu-

nately, an explicit computation of the posterior is not

necessary. We can simply sample from the distribu-

tion over ˆw, propagate this sampled value through the

Riccati equation and obtain a sampled policy L. The

more precise the estimates over ˆw are going to be, the

more precise the sampled policies L will get.

Example. In many motor control studies the hand

is modeled as a point mass, where the state vector x

comprises position and velocity in the plane (Todorov

and Jordan, 2002). In a discrete state space this yields

the following equation:

t+1

1 ∆t 0 0

0 1 0 0

0 0 1 ∆t

0 0 0 1

0 0

∆t/m 0

0 0

0 ∆t/m

+ ξ

where we chose ξ

to be distributed according to

∝ N

√

∆t

0 0 0 0

0 1/4 0 0

0 0 0 0

0 0 0 1/4

The noise ξ

models uncertainty in the force produc-

tion when controlling the mass point. In our simu-

lation of a reaching task with unknown system dy-

namics the controller had to learn to bring the mass

point from the periphery to the center of the coordi-

nate system trying to ﬁnd the optimal feedback gains.

This requires estimating a 24-dimensional parame-

ter vector w and sampling a 2×8-dimensional feed-

back gain. We chose the following parameter set-

tings: ∆t = 0.01, m = 1, R = [[0.001, 0];[0,0.001]],

Q = [[1,0,0, 0];[0,0.01,0,0];[0,0,1,0];[0,0,0,0.01]]

and α = 0.05 for the UKF. The results can be seen in

ﬁgure 1. The ﬁrst entry of the parameter vector ˆw is

depicted in ﬁgure 1a, the ﬁrst entry of the correspond-

ingly sampled L is depicted in ﬁgure 1b. After an ini-

tial exploration phase in which L is sampled from a

broad distribution the controller settles down and only

samples from a very narrow distribution centered at

the optimal value. Figure 1c,d shows initial and ﬁnal

trajectories and speed proﬁles: initially amorphous, a

straight-line movement is learned with a bell-shaped

speed proﬁle. Importantly, the Bayesian controller

converges much faster to the correct feedback gain

than the certainty-equivalent controller which never

A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE CONTROL IN LINEAR QUADRATIC

REGULATORS

105

fully reached the optimal value in our simulation—

compare ﬁgure 1e,f. In the following table the mean

absolute feedback gain error—the difference between

optimal feedback gain and actually executedfeedback

gain—is shown averaged over the last 3000 time steps

of 100 runs. We have also averaged over all 2 ×8

feedback gains.

Abs. Error

Certainty Equivalent Controller 8.26±0.01

Bayesian Control Rule 2.085±0.002

The results show that the Bayesian control rule in-

curs approximately 4 times less error on average than

the certainty-equivalent controller in this example. To

ensure that this result does not depend on the par-

ticular system we chose we ran the same simulation

but all the entries of the true F and G were drawn

randomly from a uniform distribution [0;1] in each

run, with 100 runs in total. However, these random

draws were “frozen” such that both controllers faced

the same random variables and differences cannot be

attributed to different random draws. Each run of this

simulation had 500 time steps and we compared the

feedback gain error in the last 100 time steps.

Abs. Error

Certainty Equivalent Controller 0.536 ±0.002

Bayesian Control Rule 0.111±0.001

On average the Bayesian control rule incurred ap-

proximately 5 times less error than the certainty-

equivalent controller.

3.2 Direct Adaptive Bayesian Control

The adaptive linear quadratic control problem can be

reformulated in a way that does not require estimat-

ing the system matrices F and G explicitly (Bradtke,

1993). Instead we can work directly on the policy

space and assign a Q value to each policy such that

the Q value of policy L is given by

) = c



+Gu





+Gu



where V

corresponds to the cost-to-go function.

Thus, Q

) can be expressed as a quadratic form







Q+ F

F F

F R+ G



| {z }





The matrix M ∈ R

(N+M)(N+M)

is positive deﬁnite and

represents the Q value of policy L. The relationship

between M and L is given by

L = −M

−1

(4)

as can be readily seen when computing

∂

) = 0. Previous studies have applied

Q-learning to solve this direct adaptive control prob-

lem by reinforcement learning methods (Bradtke,

1993). Here we want to transform it into an inference

problem. To this end, we need to relate M to an ob-

servable quantity in a way that is independent of the

policy that is currently executed by the controller. We

can achieve this by noting that Bellman’s optimality

equation imposes a recurrent relationship between

consecutive Q values, namely

) = c

) + Q

t+1

,−Lx

t+1

) (5)

Since c

is an observable quantity we can take it

on one side of the equation and put all Q -quantities

of equation (5) on the other side. Only the “true”

Q -function can predict all c

for all data points

t+1

}. Thus, we can use this relationship to

do inference over M where

ˆc









−



t+1

−M

−1

t+1





t+1

−M

−1

t+1



Assuming Gaussian noise with known variance σ

for

the cost observations we obtain the following likeli-

hood model for our Bayesian controller:

P(c

|M,x

t+1

) ∝ exp

−

2σ



ˆc

−c



The intervention model is again deterministic:

P(u

t+1

|M,x

t+1

) ∝ δ



t+1

+ M

−1

t+1



Doing inference over M is complicated by three facts:

(i) the likelihood model is highly nonlinear in the

parameters, (ii) M must be constrained to the set

of positive deﬁnite matrices and (iii) M will be ill-

conditioned in many examples because the different

parts of the matrix differ usually by various orders of

magnitude, as for example the unknown cost matri-

ces Q and R are often of different orders of magni-

tude. Here we can only address problem (i) and (ii),

i.e. the examples to demonstrate the Bayesian con-

troller have to be well-conditioned—which is, for in-

stance not true for the previous simulation. With re-

gard to (i) we found that for this inference process the

UKF only works robustly when the propagated means

are simply computed as an un-weighted average over

sigma points instead of the more common weighted

average. With regard to (ii) we note that any positive

deﬁnite matrix can be expressed as a product of its

unique Cholesky factors: M = m

m where m is up-

per triangular with diagonal elements strictly positive.

Then we can do inference over m with the simpler

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

106

0 100 200

0.8

0.9

1.1

time bins

est

0 200 400 600

−100

−50

100

time bins

sample

0 100 200

time bins

Speed

0 5 10

x position

y position

0 2000 4000

100

150

time bins

MEAN [ L

est

]

CEC

BCR

0 2000 4000

1000

2000

3000

4000

5000

time bins

VAR [ L

est

]

CEC

BCR

−10 0 10

−10

−5

First 5 Trials

x position

y position

−10 0 10

−10

−5

Last 5 Trials

x position

y position

0 50 100

time bins

Cost

−10 0 10

−10

−5

First 5 Trials

x position

y position

−10 0 10

−10

−5

Last 5 Trials

x position

y position

0 50 100

time bins

Cost

a b

c d

Figure 1: Results. (a-d) Learning to move a mass point when the system dynamics matrices F and G are unknown. A single

run of the control process is shown. (a) Temporal evolution of estimate of the ﬁrst entry of

F and the respective uncertainty

as represented by the Kalman ﬁlter. (b) Sampled feedback gain—only the ﬁrst entry of L is shown. The initial exploration

phase is followed by a stable performance after 400 time steps. The thin red line indicates the optimal feedback gain. (c,d)

Trajectories and speed proﬁles. Initially, the trajectory takes a random direction with an amorphous speed proﬁle (blue curves).

Later movement trajectories are straight and speed proﬁles bell-shaped (black curves). Panels (e-f) show sampled feedback

gains over 100 runs. (e) Mean executed feedback gain. The certainty-equivalent controller (CEC) slowly converges to the

region of optimal feedback gains. The exact optimal value was not reached in this simulation. The Bayesian control rule

(BCR) converges very fast to the optimal feedback gain. (f) Variance of executed feedback gain. The Bayesian controller

that used sampled feedback gains converges much faster than the certainty-equivalent controller. (g,h) Learning to move a

mass-less point when both the system dynamics matrices F and G and the cost matrices Q and R are unknown. (g) Bayesian

Control Rule. Trajectories of the ﬁrst and last 5 trials. Initially, movements are undirected but later converge to straight line

movements. The pertinent cost converges to the optimum. (h) Policy Iteration. Trajectories of the ﬁrst and last 5 trials. The

trajectories are wiggly because noise has to be added to the controller for exploration. Due to this extra noise the controller

cannot converge to the optimal cost.

constraint that the diagonal elements must be posi-

tive. In our simulation we implemented this constraint

by simply discarding any Kalman ﬁlter updates that

would violate it. In general, such constraints can be

easily implemented using particle ﬁlters.

Example. A simple well-conditioned example is a

mass-less particle that moves around in the plane. The

system dynamics can be formalized as:

t+1



1 0

0 1





1 0

0 1



The observations are noisy observations of the cost

= x

+ u

+ ξ

where ξ

is a normally distributed scalar variable with

variance σ

obs

= 0.1. Both Q and R were assumed to

be identity matrices and α = 0.5 as previously. This

is a 10-dimensional estimation problem. Figure 1g

shows that the Bayesian controller managed to ﬁnd

the optimal control solution only relying on inference

and sampling. We compared against a policy iteration

algorithm for linear quadratic controllers as proposed

in (Bradtke, 1993) – compare Figure 1h. In the lat-

ter exploration can only be achieved by adding extra

noise to the control command. Note that the Bayesian

control rule incurs this noise automatically by sam-

pling from the posterior. We simulated 100 trials with

50 time steps each.

To ensure again that this result does not depend on

the particular system we chose we ran another simula-

tion where each entry of F and G were drawnfrom the

uniform distribution [0;1] and Q and R were drawn

from an inverse Wishart distribution with identity co-

variance matrix and degree of freedom 2. The noise

was again “frozen” for comparison between the two

algorithms. We compared the absolute error between

the optimal and the actually executed feedback gain

over the last 20 trials. The Bayesian control rule out-

performed the policy iteration algorithm roughly by

factor 5.

Abs. Error

Policy Iteration (Bradtke, 1993) 2.5±0.1

Bayesian Control Rule 0.55±0.01

4 CONCLUSIONS

In this paper we suggest a minimum relative entropy

formulation of adaptive control problems when the

plant dynamics are unknown but known to belong to

a pre-deﬁned set of possible dynamics. This formu-

lation has an explicit solution given by the Bayesian

control rule, a stochastic rule for adaptive control.

We have presented two example classes that show

how adaptive linear quadratic control problems can

A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE CONTROL IN LINEAR QUADRATIC

REGULATORS

107

be tackled using this problem formulation. Usually,

adaptive controllers rely on the certainty equivalence

principle and ignore parameter uncertainty in the con-

trol process (

Astr¨om and Wittenmark, 1995). In con-

trast, a controller based on the Bayesian control rule

considers this uncertainty for balancing exploration

and exploitation in a way that minimizes the expected

relative entropy with regard to the true control law.

In particular, indirect control methods provide an

interesting perspective here, because they allow solv-

ing the adaptive control problem purely based on in-

ference and sampling methods that can be recruited

from a rich arsenal in machine learning. Both infer-

ence and action sampling work forward in time and

are therefore applicable online. Also they do not re-

quire different phases of policy evaluation and policy

improvement as some of the previous reinforcement

learning methods. Inference can be done online in-

dependent of the sampled policy. Several other stud-

ies have previously proposed to solve adaptive con-

trol problems based on inference methods (Toussaint

et al., 2006; Engel et al., 2005; Haruno et al., 2001).

Crucially, however, these studies have concentrated

on the observation part of the learning problem with

no principled solution for the action selection prob-

lem. Usually, exploration noise has to be introduced

in an ad hoc fashion in order to avoid suboptimal per-

formance. In contrast, the minimum relative entropy

cost function naturally leads to stochastic policies.

The main contribution of this study is to illustrate

how a relative entropy formulation can be applied to

solve an adaptive control problem. This is done by

deriving a stochastic controller based on the Bayesian

control rule for the LQR problem with unknown sys-

tem and cost matrices. Similar minimum relative en-

tropy formulations have recently also been proposed

to solve optimal control problems with known system

dynamics (Todorov, 2009; Kappen et al., 2009). How

these two approaches for adaptiveand optimal control

relate is an interesting question for future research.

Also, the Bayesian control rule suggested here could

in principle be employed to solve more general adap-

tive control problems with possibly nonlinear dynam-

ics. However, ﬁnding optimal tailored controllers for

complex sub-environments can in general be highly

non-trivial. Therefore, ﬁnding inference and sam-

pling methods that work for more general classes of

adaptive control problems poses a future challenge.

REFERENCES

Astr¨om, K. and Wittenmark, B. (1995). Adaptive Control.

Prentice Hall, 2nd edition.

Bradtke, S. (1993). Reinforcement learning applied to lin-

ear quadratic control. Advances in Neural Information

Processing Systems 5.

Campi, M. and Kumar, P. (1996). Optimal adaptive control

of an lqg system. Proc. 35th Conf. on Decision and

Control, pages 349–353.

Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement

learning with gaussian processes. In Proceedings of

the 22nd international conference on Machine learn-

ing, pages 201–208.

Haruno, M., Wolpert, D., and Kawato, M. (2001). Mosaic

model for sensorimotor learning and control. Neural

Computation, 13:2201–2220.

Haykin, S. (2001). Kalman ﬁltering and neural networks.

John Wiley and Sons.

Julier, S.J., U. J. and Durrant-Whyte, H. (1995). A new

approach for ﬁltering nonlinear systems. Proc. Am.

Control Conference, pages 1628–1632.

Kappen, B., Gomez, V., and Opper, M. (2009). Opti-

mal control as a graphical model inference problem.

arXiv:0901.0633.

Ortega, P. and Braun, D. (2010). A bayesian rule for adap-

tive control based on causal interventions. In Proceed-

ings of the third conference on artiﬁcial general intel-

ligence, pages 121–126. Atlantis Press.

Pearl, J. (2000). Causality: Models, Reasoning, and Infer-

ence. Cambridge University Press, Cambridge, UK.

Stengel, R. (1993). Optimal control and estimation. Dover

Publications.

Todorov, E. (2009). Efﬁcient computation of optimal ac-

tions. Proceedings of the National Academy of Sci-

ences U.S.A., 106:11478–11483.

Todorov, E. . and Jordan, M. (2002). Optimal feedback con-

trol as a theory of motor coordination. Nat. Neurosci.,

5:1226–1235.

Toussaint, M., Harmeling, S., and Storkey, A. (2006). Prob-

abilistic inference for solving (po)mdps. Technical

report, EDI-INF-RR-0934, University of Edinburgh,

School of Informatics.

Wittenmark, B. (1975). Stochastic adaptive control meth-

ods: a survey. International Journal of Control,

21:705–730.

ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics

108