and second, it requires significant skill and insight
into designing the controller. As both the derivation
and calibration of the system’s model and the manual
design of the controller require a significant amount
of highly qualified labor, this approach to controller
design is usually very costly and difficult.
In contrast, recent advances in the field of deep
reinforcement learning (DRL) have demonstrated the
remarkable ability of general-purpose DRL algo-
rithms to solve difficult sequential decision and con-
trol problems without access to knowledge of the sys-
tem’s dynamics in analytical form (Lillicrap et al.,
2015). DRL algorithms usually interact directly with
the system and compute an optimal policy by means
of trial and error. A downside of such algorithms is
their often excruciatingly long training times, often
measured in millions of trials (control steps) in the tar-
get environment. For real physical systems, such long
training times on the real hardware are usually com-
pletely infeasible, so training is provided either on a
simulation model of the system, created in a simulator
such as a physics engine, or on a learned parametric
model of the system’s dynamics obtained from a lim-
ited number of interactions with the physical system.
This general approach is known as model-based RL
(MBRL) (Polydoros and Nalpantidis, 2017). Despite
multiple recent successes, the long training times of
DRL algorithms, even in simulation, and the need to
adjust carefully the learning parameters, are still an
impediment to their wider application.
Recognizing the difficulty of obtaining a general
control policy that is valid for every state, another
class of methods aims to find solutions only for a spe-
cific starting state, after it has become known. This
class of methods, generally known as trajectory opti-
mization and stabilization algorithms, effectively au-
tomate the path planning and tracking approach de-
scribed above. Examples of this approach include the
methods of differential dynamic programming (DDP)
(Jacobson and Mayne, 1970), iterative LQR (iLQR)
(Li and Todorov, 2004), as well as direct transcription
and collocation methods for trajectory optimization
(Tedrake, 2023). These methods can be very effec-
tive, as the decision problem they are solving is much
simpler than computing an entire global control pol-
icy – instead of computing a function that maps any
state belonging to the multidimensional state space of
the system to a control, they compute a much simpler
function that maps time, a single-dimensional vari-
able, to control values to be used at that time. How-
ever, a significant disadvantage of such methods is
that trajectory computation must either be done off-
line, introducing a delay before control can start, or
on-line, in a model-predictive control (MPC) fashion
(Tassa et al., 2012). This often necessitates the use
of powerful and expensive micro-controllers and/or
limiting the prediction horizon, which could lead to
failure to reach the goal state for some systems.
One promising approach to avoiding the need
for either long off-line computation or intense on-
line computation associated with trajectory-based lo-
cal control is to combine multiple pre-computed lo-
cal trajectory-centric controllers into a single global
controller by means of a suitable machine learning
method. The highly influential Global Policy Search
(GPS) method trains a deep neural network (DNN) to
emulate the operation of multiple pre-computed con-
trollers by repeatedly sampling the output of these
controllers and gradually adjusting the global policy
encoded by the DNN, thus creating a global con-
troller that can be executed relatively fast at run-time
(Levine et al., 2016). One disadvantage associated
with this method is that policy learning progresses
relatively slowly, as each modification to the control
policy is limited in magnitude in order to avoid di-
vergence of the learning process. Furthermore, the
training method uses only the trajectory computed by
the trajectory optimization solver, but not the entire
controller implied by its solution.
We propose a method that operates on the same
general principle – to combine multiple trajectory-
based local controllers from multiple initial states into
a single global control policy – but using a differ-
ent machine learning method for the combination and
also using more components of the computed local
solutions than just the computed trajectories. As the
chosen machine learning method belongs to the class
of memory-based learning (MBL) methods, its train-
ing time is zero, and consists only of storing the pre-
computed local solutions in memory. The actual pre-
dictive model building takes place at run-time, when
a control for a particular state needs to be computed.
Although computation is shifted to run-time, exper-
imental results indicate that the computation time is
in fact shorter than the time needed to perform a sin-
gle forward pass through a DNN that encodes a pol-
icy computed by a DRL algorithm. Moreover, be-
cause the entire local feedback controllers are used in
the computation, including their gain schedules and
costs-to-go to the goal state, relatively few solutions
in memory are needed, resulting in savings in mem-
ory and computational time.
Several variants of the proposed method are de-
scribed in Section 2. Empirical verification on sev-
eral test problems is described in Section 3. Section
4 proposes several directions for further improvement
of the algorithm and concludes the paper.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
238