A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE
CONTROL IN LINEAR QUADRATIC REGULATORS
Daniel A. Braun and Pedro A. Ortega
University of Cambridge, Dept. of Engineering, CB2 1PZ Cambridge, U.K.
Keywords:
Minimum relative entropy principle, Adaptive control, Bayesian control rule, Linear quadratic regulator.
Abstract:
The design of optimal adaptive controllers is usually based on heuristics, because solving Bellmans equations
over information states is notoriously intractable. Approximate adaptive controllers often rely on the principle
of certainty-equivalence where the control process deals with parameter point estimates as if they represented
“true” parameter values. Here we present a stochastic control rule instead where controls are sampled from
a posterior distribution over a set of probabilistic input-output models and the true model is identified by
Bayesian inference. This allows reformulating the adaptive control problem as an inference and sampling
problem derived from a minimum relative entropy principle. Importantly, inference and action sampling both
work forward in time and hence such a Bayesian adaptive controller is applicable on-line. We demonstrate the
improved performance that can be achieved by such an approach for linear quadratic regulator examples.
1 INTRODUCTION
Learning how to act in an unknown environment
poses the problem of adaptive control (
˚
Astr¨om and
Wittenmark, 1995). Solving adaptive control prob-
lems optimally is a notoriously hard problem because
it requires the solution of Bellman’s optimality equa-
tions over large trees of information states, which
becomes quickly intractable. Therefore, a number
of approximate adaptive control methods have been
devised in the literature (
˚
Astr¨om and Wittenmark,
1995). Most heuristics for adaptive control are based
on the certainty-equivalence principle, i.e. when they
estimate the unknown plant parameters, the uncer-
tainty of these estimates has no impact on the perti-
nent control strategies. Instead, a point estimate of
the system parameters is treated as if it represented
the “true” system parameters.
It is well known in optimal control theory that the
certainty-equivalence principle holds exactly for lin-
ear quadratic systems with known dynamics (
˚
Astr¨om
and Wittenmark, 1995). In case of adaptive control,
however, the certainty-equivalence principle breaks
down in general and is only used as a heuristic. In
fact, previous studies have shown that even for the
linear quadratic controller correct closed-loop system
identification cannot be guaranteed under certainty-
equivalence, which has led to the proposal of cost-
biased estimators (Campi and Kumar, 1996). Non-
certainty-equivalent controllers are usually designed
as extensions of a certainty-equivalent solution, such
as cautious or dual controllers that reduce the con-
trol gain in the face of high parameter uncertainty or
actively probe the environment by random excitation
(Wittenmark, 1975). Here we propose a non-certainty
equivalent approach to adaptive control based on a
Bayesian control rule derived from a minimum rel-
ative entropy principle. We demonstrate how such an
approach can be employed to solve adaptive control
problems with linear dynamics and quadratic cost.
2 A BAYESIAN RULE FOR
ADAPTIVE CONTROL
In the following we assume that the observations of
our controller are given by a state variable x
t
and the
possible actions of our controller are u
t
. The con-
troller can then be defined as an input-output system
that is characterized by the conditional probabilities
P(x
t+1
|x
t
,u
t
) and P(u
t+1
|x
t+1
,u
t
)
where x
t
= x
1
,x
2
,. .. ,x
t
and u
t
= u
1
,u
2
,. .. ,u
t
de-
note concatenations of past states and actions respec-
tively. Analogous to the controller, the plant can be
thought of as an input-output system with conditional
probabilities
Q(u
t+1
|x
t+1
,u
t
) and Q(x
t+1
|x
t
,u
t
).
103
A. Braun D. and A. Ortega P. (2010).
A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE CONTROL IN LINEAR QUADRATIC REGULATORS.
In Proceedings of the 7th International Conference on Informatics in Control, Automation and Robotics, pages 103-108
DOI: 10.5220/0002938801030108
Copyright
c
SciTePress
If the controller can perfectly predict the plant for all
histories x
t
,u
t
then
P(x
t+1
|x
t
,u
t
) = Q(x
t+1
|x
t
,u
t
).
In this case the plant equation is perfectly known and
the controller P can be tailored to the particular plant
Q. Especially, the control law P(u
t+1
|x
t+1
,u
t
) can
be chosen in such a way that it maximizes some opti-
mality criterion given full knowledge of the plant Q.
Consider now the case when the controller does
not know the plant dynamics, but assume we know
that the plant has dynamics Q
m
drawn randomly from
a set Q of possible dynamics indexed by m. Assume
further we have available a set of tailored controllers
P
m
, where each P
m
is tailor-made for one of the possi-
ble plants Q
m
. The set of possible plant dynamics and
tailored controllers can then be expressed as condi-
tional probabilities given by the following likelihood
and intervention models
P(x
t+1
|m,x
t
,u
t
) and P(u
t+1
|m,x
t+1
,u
t
)
with m M indexing the different plant dynamics Q
m
and the different tailored controllers P
m
. How can we
now construct a controller P such that its behavior is
as close as possible to the tailored controller P
m
under
any realization of Q
m
Q ?
A convenient measure of how much P deviates
from P
m
is given by the relative entropy. In particu-
lar, we can quantify the average deviation of a control
law P(u
t+1
|x
t+1
, ¯u
t
) from the tailored control law
P(u
t+1
|m,x
t+1
, ¯u
t
) of P
m
by computing
*
D
KL
P(u
t+1
|m,x
t+1
, ¯u
t
)||P(u
t+1
|x
t+1
, ¯u
t
)
+
where the average is taken with respect to a prior
P(m) and all possible input-output sequences with
probabilities P(x
t+1
¯u
t
|m). The bar symbol ¯u
t
in-
dicates that past actions have been set by the con-
troller and therefore have to be formalized as inter-
ventions (Pearl, 2000; Ortega and Braun, 2010). One
can then show that the above quantity is minimized
by the following control rule.
Theorem 1 (Bayesian Control Rule).
P(u
t+1
|x
t+1
, ¯u
t
)
=
m
P(u
t+1
|m,x
t+1
,u
t
)P(m|x
t+1
, ¯u
t
)
where P(m|x
t+1
, ¯u
t
) is given by the recursive ex-
pression
P(m|x
t+1
, ¯u
t
)
=
P(x
t+1
|m,x
t
,u
t
)P(m|x
t
, ¯u
<t
)
m
P(x
t
|m
,x
t
,u
t
)P(m
|x
t
, ¯u
<t
)
(1)
The proof can be found in (Ortega and Braun,
2010). Here we apply the Bayesian control rule
to adaptive control. It describes a mixture distri-
bution over different tailored controllers indexed by
m, each of them suggesting the next control signal
u
t+1
with probability P(u
t+1
|m,x
t+1
,u
t
). The mix-
ture weights are given by the posterior probability
P(m|x
t+1
, ¯u
t
). It resembles Bayesian inference in
that it starts out with a prior distribution over input-
output models index by m and computes a posterior
distribution after experiencing an interaction. Actions
can then be sampled from this posterior distribution.
3 LINEAR QUADRATIC
REGULATOR
A linear quadratic regulator is characterized by a lin-
ear dynamical system and a quadratic cost function.
In the following we will deal with the time-discrete
case. Formally, let x
t
R
N
be the state vector of the
plant at time t, u
t
R
M
be the action of the controller,
and F R
N×N
and G R
N×M
the time-invariant sys-
tem matrices describing the dynamics of the plant
such that
x
t+1
= Fx
t
+ Gu
t
+ ξ
t
where ξ
t
R
N
is a Gaussian random variable with
known covariance matrix
ξ
. Furthermore, let c
t
be
the scalar instantaneous cost
c
t
(x
t
,u
t
) = x
T
t
Qx
t
+ u
T
t
Ru
t
where R R
M× M
is positive definite and Q R
N×N
is positive semi-definite. Thus, the time-average cost
J is given by
J(x
t
,u
t
) = lim
T
1
T
T1
t=0
c
t
(x
t
,u
t
).
If the matrices F,G,Q and R are all known the op-
timal controller has a well-known solution that is a
simple state-feedback law
u
t
= L
x
t
where L
can be computed from the algebraic Riccati
equation (Stengel, 1993).
3.1 Indirect Adaptive Bayesian Control
In this section we will assume that we know the cost
matrices Q and R, but have to estimate F and G dur-
ing the control process. Since we have to estimate
them explicitly in order to compute the optimal policy
ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics
104
L
this is often called model-based or indirect adap-
tive control. This means we have to deal with an in-
ference problem—estimating F and G—and an opti-
mal control problem—generating control commands
given the estimates
ˆ
F and
ˆ
G.
In order to solve the estimation problem we use
an Unscented Kalman Filter (UKF) in our simu-
lation experiments because it can estimate Gaus-
sian random variables both under linear and nonlin-
ear circumstances (Julier and Durrant-Whyte, 1995;
Haykin, 2001). The parameter vector we want to
estimate is given by the vectorized system matrices
ˆw= vec([
ˆ
F;
ˆ
G]). Initially, we assume a Gaussian prior
over ˆw
0
. We model the evolution of the parameter es-
timate as a Brownian diffusion process given by
ˆw
t+1
= ˆw
t
+ ω
t
(2)
where ω R
N(N+M)
is a Gaussian random variable
with covariance matrix
ω
. The covariance matrix
determines the step size of the adaptation process.
The likelihood model needed for the inference pro-
cess is provided by
P(x
t+1
|ˆw,x
t
,u
t
)
e
1
2
x
t+1
ˆ
Fx
t
ˆ
Gu
t
T
1
ξ
x
t+1
ˆ
Fx
t
ˆ
Gu
t
(3)
The adaptation rate
ω
can be adjusted dynamically
depending on how well the current parameter esti-
mates fit the observations. In case of poor predictions
this should lead to high variability and fast adapta-
tion in big steps, in case of very good predictions this
should implies only small adaptation steps. This can
be implemented using a Robbins-Monroe innovation
update
(t+1)
ω
= (1α)
(t)
ω
+ αI
t
I
t
= K
ˆw
t
h
x
t+1
ˆx
t+1
ih
x
t+1
ˆx
t+1
i
T
(K
ˆw
t
)
T
where K
ˆw
t
is the Kalman gain as used in the UKF
and ˆx
t+1
stems from the prediction step of the UKF
(Haykin, 2001).
In order to solve the control problem we have
to use the current estimate ˆw = vec([
ˆ
F;
ˆ
G]) to com-
pute the optimal control commands. A certainty-
equivalent self-tuning regulator would simply take the
mean estimate E[ ˆw] and use this estimate in the alge-
braic Riccati equation at every point in time as if it
was the true parameter vector. While this often works
fine if only a few parameters of the matrix are un-
known, in general this can lead to suboptimal solu-
tions. Instead, we propose to use the Bayesian control
rule as laid down in equation (1). This means we have
to specify a likelihood and an intervention model. The
likelihood model P(x
t+1
|ˆw,x
t
,u
t
) is given by equa-
tion (3). The intervention model is deterministic and
given by
P(u
t+1
|ˆw,x
t+1
,u
t
) δ
u
t+1
+ L
ˆw
x
t+1
It might seem that this would imply taking the en-
tire probability distribution over ˆw and propagating it
through the Riccati equation. Then we would sample
an L at each point in time to determine u
t+1
. Fortu-
nately, an explicit computation of the posterior is not
necessary. We can simply sample from the distribu-
tion over ˆw, propagate this sampled value through the
Riccati equation and obtain a sampled policy L. The
more precise the estimates over ˆw are going to be, the
more precise the sampled policies L will get.
Example. In many motor control studies the hand
is modeled as a point mass, where the state vector x
t
comprises position and velocity in the plane (Todorov
and Jordan, 2002). In a discrete state space this yields
the following equation:
x
t+1
=
1 t 0 0
0 1 0 0
0 0 1 t
0 0 0 1
!
x
t
+
0 0
t/m 0
0 0
0 t/m
!
u
t
+ ξ
t
where we chose ξ
t
to be distributed according to
ξ
t
N
"
0,
t
0 0 0 0
0 1/4 0 0
0 0 0 0
0 0 0 1/4
!#
The noise ξ
t
models uncertainty in the force produc-
tion when controlling the mass point. In our simu-
lation of a reaching task with unknown system dy-
namics the controller had to learn to bring the mass
point from the periphery to the center of the coordi-
nate system trying to find the optimal feedback gains.
This requires estimating a 24-dimensional parame-
ter vector w and sampling a 2×8-dimensional feed-
back gain. We chose the following parameter set-
tings: t = 0.01, m = 1, R = [[0.001, 0];[0,0.001]],
Q = [[1,0,0, 0];[0,0.01,0,0];[0,0,1,0];[0,0,0,0.01]]
and α = 0.05 for the UKF. The results can be seen in
figure 1. The first entry of the parameter vector ˆw is
depicted in figure 1a, the first entry of the correspond-
ingly sampled L is depicted in figure 1b. After an ini-
tial exploration phase in which L is sampled from a
broad distribution the controller settles down and only
samples from a very narrow distribution centered at
the optimal value. Figure 1c,d shows initial and final
trajectories and speed profiles: initially amorphous, a
straight-line movement is learned with a bell-shaped
speed profile. Importantly, the Bayesian controller
converges much faster to the correct feedback gain
than the certainty-equivalent controller which never
A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE CONTROL IN LINEAR QUADRATIC
REGULATORS
105
fully reached the optimal value in our simulation—
compare figure 1e,f. In the following table the mean
absolute feedback gain error—the difference between
optimal feedback gain and actually executedfeedback
gain—is shown averaged over the last 3000 time steps
of 100 runs. We have also averaged over all 2 ×8
feedback gains.
Abs. Error
Certainty Equivalent Controller 8.26±0.01
Bayesian Control Rule 2.085±0.002
The results show that the Bayesian control rule in-
curs approximately 4 times less error on average than
the certainty-equivalent controller in this example. To
ensure that this result does not depend on the par-
ticular system we chose we ran the same simulation
but all the entries of the true F and G were drawn
randomly from a uniform distribution [0;1] in each
run, with 100 runs in total. However, these random
draws were “frozen” such that both controllers faced
the same random variables and differences cannot be
attributed to different random draws. Each run of this
simulation had 500 time steps and we compared the
feedback gain error in the last 100 time steps.
Abs. Error
Certainty Equivalent Controller 0.536 ±0.002
Bayesian Control Rule 0.111±0.001
On average the Bayesian control rule incurred ap-
proximately 5 times less error than the certainty-
equivalent controller.
3.2 Direct Adaptive Bayesian Control
The adaptive linear quadratic control problem can be
reformulated in a way that does not require estimat-
ing the system matrices F and G explicitly (Bradtke,
1993). Instead we can work directly on the policy
space and assign a Q value to each policy such that
the Q value of policy L is given by
Q
L
(x
t
,u
t
) = c
t
(x
t
,u
t
)+
Fx
t
+Gu
t
T
V
L
Fx
t
+Gu
t
where V
L
corresponds to the cost-to-go function.
Thus, Q
L
(x
t
,u
t
) can be expressed as a quadratic form
x
t
u
t
T
Q+ F
T
V
L
F F
T
V
L
G
G
T
V
L
F R+ G
T
V
L
G
| {z }
=
"
M
11
M
12
M
21
M
22
#
x
t
u
t
The matrix M R
(N+M)(N+M)
is positive definite and
represents the Q value of policy L. The relationship
between M and L is given by
L = M
1
22
M
21
(4)
as can be readily seen when computing
u
t
Q
L
(x
t
,u
t
) = 0. Previous studies have applied
Q-learning to solve this direct adaptive control prob-
lem by reinforcement learning methods (Bradtke,
1993). Here we want to transform it into an inference
problem. To this end, we need to relate M to an ob-
servable quantity in a way that is independent of the
policy that is currently executed by the controller. We
can achieve this by noting that Bellman’s optimality
equation imposes a recurrent relationship between
consecutive Q values, namely
Q
L
(x
t
,u
t
) = c
t
(x
t
,u
t
) + Q
L
(x
t+1
,Lx
t+1
) (5)
Since c
t
is an observable quantity we can take it
on one side of the equation and put all Q -quantities
of equation (5) on the other side. Only the “true”
Q -function can predict all c
t
for all data points
{x
t
,u
t
,x
t+1
}. Thus, we can use this relationship to
do inference over M where
ˆc
t
=
x
t
u
t
T
M
x
t
u
t
x
t+1
M
1
22
M
21
x
t+1
T
M
x
t+1
M
1
22
M
21
x
t+1
Assuming Gaussian noise with known variance σ
2
for
the cost observations we obtain the following likeli-
hood model for our Bayesian controller:
P(c
t
|M,x
t
,x
t+1
,u
t
) exp
"
1
2σ
2
ˆc
t
c
t
2
#
The intervention model is again deterministic:
P(u
t+1
|M,x
t+1
,x
t
,u
t
) δ
u
t+1
+ M
1
22
M
21
x
t+1
Doing inference over M is complicated by three facts:
(i) the likelihood model is highly nonlinear in the
parameters, (ii) M must be constrained to the set
of positive definite matrices and (iii) M will be ill-
conditioned in many examples because the different
parts of the matrix differ usually by various orders of
magnitude, as for example the unknown cost matri-
ces Q and R are often of different orders of magni-
tude. Here we can only address problem (i) and (ii),
i.e. the examples to demonstrate the Bayesian con-
troller have to be well-conditioned—which is, for in-
stance not true for the previous simulation. With re-
gard to (i) we found that for this inference process the
UKF only works robustly when the propagated means
are simply computed as an un-weighted average over
sigma points instead of the more common weighted
average. With regard to (ii) we note that any positive
definite matrix can be expressed as a product of its
unique Cholesky factors: M = m
T
m where m is up-
per triangular with diagonal elements strictly positive.
Then we can do inference over m with the simpler
ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics
106
0 100 200
0.8
0.9
1
1.1
time bins
F
est
11
0 200 400 600
−100
−50
0
50
100
time bins
L
sample
11
0 100 200
0
10
20
30
time bins
Speed
0 5 10
0
5
10
x position
y position
0 2000 4000
0
50
100
150
time bins
MEAN [ L
est
11
]
CEC
BCR
0 2000 4000
0
1000
2000
3000
4000
5000
time bins
VAR [ L
est
11
]
CEC
BCR
−10 0 10
−10
−5
0
5
10
First 5 Trials
x position
y position
−10 0 10
−10
−5
0
5
10
Last 5 Trials
x position
y position
0 50 100
10
2
10
3
10
4
time bins
Cost
−10 0 10
−10
−5
0
5
10
First 5 Trials
x position
y position
−10 0 10
−10
−5
0
5
10
Last 5 Trials
x position
y position
0 50 100
10
2
10
3
10
4
time bins
Cost
a b
c d
e
f
g
h
Figure 1: Results. (a-d) Learning to move a mass point when the system dynamics matrices F and G are unknown. A single
run of the control process is shown. (a) Temporal evolution of estimate of the first entry of
ˆ
F and the respective uncertainty
as represented by the Kalman lter. (b) Sampled feedback gain—only the first entry of L is shown. The initial exploration
phase is followed by a stable performance after 400 time steps. The thin red line indicates the optimal feedback gain. (c,d)
Trajectories and speed profiles. Initially, the trajectory takes a random direction with an amorphous speed profile (blue curves).
Later movement trajectories are straight and speed profiles bell-shaped (black curves). Panels (e-f) show sampled feedback
gains over 100 runs. (e) Mean executed feedback gain. The certainty-equivalent controller (CEC) slowly converges to the
region of optimal feedback gains. The exact optimal value was not reached in this simulation. The Bayesian control rule
(BCR) converges very fast to the optimal feedback gain. (f) Variance of executed feedback gain. The Bayesian controller
that used sampled feedback gains converges much faster than the certainty-equivalent controller. (g,h) Learning to move a
mass-less point when both the system dynamics matrices F and G and the cost matrices Q and R are unknown. (g) Bayesian
Control Rule. Trajectories of the first and last 5 trials. Initially, movements are undirected but later converge to straight line
movements. The pertinent cost converges to the optimum. (h) Policy Iteration. Trajectories of the first and last 5 trials. The
trajectories are wiggly because noise has to be added to the controller for exploration. Due to this extra noise the controller
cannot converge to the optimal cost.
constraint that the diagonal elements must be posi-
tive. In our simulation we implemented this constraint
by simply discarding any Kalman filter updates that
would violate it. In general, such constraints can be
easily implemented using particle filters.
Example. A simple well-conditioned example is a
mass-less particle that moves around in the plane. The
system dynamics can be formalized as:
x
t+1
=
1 0
0 1
x
t
+
1 0
0 1
u
t
The observations are noisy observations of the cost
c
t
= x
T
t
Qx
t
+ u
T
t
Ru
t
+ ξ
t
where ξ
t
is a normally distributed scalar variable with
variance σ
obs
= 0.1. Both Q and R were assumed to
be identity matrices and α = 0.5 as previously. This
is a 10-dimensional estimation problem. Figure 1g
shows that the Bayesian controller managed to find
the optimal control solution only relying on inference
and sampling. We compared against a policy iteration
algorithm for linear quadratic controllers as proposed
in (Bradtke, 1993) compare Figure 1h. In the lat-
ter exploration can only be achieved by adding extra
noise to the control command. Note that the Bayesian
control rule incurs this noise automatically by sam-
pling from the posterior. We simulated 100 trials with
50 time steps each.
To ensure again that this result does not depend on
the particular system we chose we ran another simula-
tion where each entry of F and G were drawnfrom the
uniform distribution [0;1] and Q and R were drawn
from an inverse Wishart distribution with identity co-
variance matrix and degree of freedom 2. The noise
was again “frozen” for comparison between the two
algorithms. We compared the absolute error between
the optimal and the actually executed feedback gain
over the last 20 trials. The Bayesian control rule out-
performed the policy iteration algorithm roughly by
factor 5.
Abs. Error
Policy Iteration (Bradtke, 1993) 2.5±0.1
Bayesian Control Rule 0.55±0.01
4 CONCLUSIONS
In this paper we suggest a minimum relative entropy
formulation of adaptive control problems when the
plant dynamics are unknown but known to belong to
a pre-defined set of possible dynamics. This formu-
lation has an explicit solution given by the Bayesian
control rule, a stochastic rule for adaptive control.
We have presented two example classes that show
how adaptive linear quadratic control problems can
A MINIMUM RELATIVE ENTROPY PRINCIPLE FOR ADAPTIVE CONTROL IN LINEAR QUADRATIC
REGULATORS
107
be tackled using this problem formulation. Usually,
adaptive controllers rely on the certainty equivalence
principle and ignore parameter uncertainty in the con-
trol process (
˚
Astr¨om and Wittenmark, 1995). In con-
trast, a controller based on the Bayesian control rule
considers this uncertainty for balancing exploration
and exploitation in a way that minimizes the expected
relative entropy with regard to the true control law.
In particular, indirect control methods provide an
interesting perspective here, because they allow solv-
ing the adaptive control problem purely based on in-
ference and sampling methods that can be recruited
from a rich arsenal in machine learning. Both infer-
ence and action sampling work forward in time and
are therefore applicable online. Also they do not re-
quire different phases of policy evaluation and policy
improvement as some of the previous reinforcement
learning methods. Inference can be done online in-
dependent of the sampled policy. Several other stud-
ies have previously proposed to solve adaptive con-
trol problems based on inference methods (Toussaint
et al., 2006; Engel et al., 2005; Haruno et al., 2001).
Crucially, however, these studies have concentrated
on the observation part of the learning problem with
no principled solution for the action selection prob-
lem. Usually, exploration noise has to be introduced
in an ad hoc fashion in order to avoid suboptimal per-
formance. In contrast, the minimum relative entropy
cost function naturally leads to stochastic policies.
The main contribution of this study is to illustrate
how a relative entropy formulation can be applied to
solve an adaptive control problem. This is done by
deriving a stochastic controller based on the Bayesian
control rule for the LQR problem with unknown sys-
tem and cost matrices. Similar minimum relative en-
tropy formulations have recently also been proposed
to solve optimal control problems with known system
dynamics (Todorov, 2009; Kappen et al., 2009). How
these two approaches for adaptiveand optimal control
relate is an interesting question for future research.
Also, the Bayesian control rule suggested here could
in principle be employed to solve more general adap-
tive control problems with possibly nonlinear dynam-
ics. However, finding optimal tailored controllers for
complex sub-environments can in general be highly
non-trivial. Therefore, finding inference and sam-
pling methods that work for more general classes of
adaptive control problems poses a future challenge.
REFERENCES
˚
Astr¨om, K. and Wittenmark, B. (1995). Adaptive Control.
Prentice Hall, 2nd edition.
Bradtke, S. (1993). Reinforcement learning applied to lin-
ear quadratic control. Advances in Neural Information
Processing Systems 5.
Campi, M. and Kumar, P. (1996). Optimal adaptive control
of an lqg system. Proc. 35th Conf. on Decision and
Control, pages 349–353.
Engel, Y., Mannor, S., and Meir, R. (2005). Reinforcement
learning with gaussian processes. In Proceedings of
the 22nd international conference on Machine learn-
ing, pages 201–208.
Haruno, M., Wolpert, D., and Kawato, M. (2001). Mosaic
model for sensorimotor learning and control. Neural
Computation, 13:2201–2220.
Haykin, S. (2001). Kalman filtering and neural networks.
John Wiley and Sons.
Julier, S.J., U. J. and Durrant-Whyte, H. (1995). A new
approach for ltering nonlinear systems. Proc. Am.
Control Conference, pages 1628–1632.
Kappen, B., Gomez, V., and Opper, M. (2009). Opti-
mal control as a graphical model inference problem.
arXiv:0901.0633.
Ortega, P. and Braun, D. (2010). A bayesian rule for adap-
tive control based on causal interventions. In Proceed-
ings of the third conference on artificial general intel-
ligence, pages 121–126. Atlantis Press.
Pearl, J. (2000). Causality: Models, Reasoning, and Infer-
ence. Cambridge University Press, Cambridge, UK.
Stengel, R. (1993). Optimal control and estimation. Dover
Publications.
Todorov, E. (2009). Efficient computation of optimal ac-
tions. Proceedings of the National Academy of Sci-
ences U.S.A., 106:11478–11483.
Todorov, E. . and Jordan, M. (2002). Optimal feedback con-
trol as a theory of motor coordination. Nat. Neurosci.,
5:1226–1235.
Toussaint, M., Harmeling, S., and Storkey, A. (2006). Prob-
abilistic inference for solving (po)mdps. Technical
report, EDI-INF-RR-0934, University of Edinburgh,
School of Informatics.
Wittenmark, B. (1975). Stochastic adaptive control meth-
ods: a survey. International Journal of Control,
21:705–730.
ICINCO 2010 - 7th International Conference on Informatics in Control, Automation and Robotics
108