5 CONCLUSIONS
An on-line direct policy search algorithm for AUV
control based on Baxter and Bartlett’s direct-gradient
algorithm OLPOMDP has been proposed. The method
has been applied to a real learning control system in
which a simulated model of the AUV URIS navigates a
two-dimensional world in a target following task. The
policy is represented by a neural network whose input is
a representation of the state, whose output is action
selection probabilities, and whose weights are the policy
parameters. The objective of the agent was to compute a
stochastic policy, which assigns a probability over each
of the four possible control actions.
Results obtained confirm some of the ideas
presented in section 1. The algorithm is easier to
implement compared with other RL methodologies like
value function algorithms and it represents a
considerable reduction of the computational time of the
algorithm. On the other side, simulated results show a
poor speed of convergence towards minimal solution.
In order to validate the performance of the
method proposed, future experiments are centered
on obtaining empirical results: the algorithm must be
tested on real URIS in a real environment. Previous
investigations carried on in our laboratory with RL
value functions methods with the same prototype
URIS (Carreras et al., 2003) will allow us to
compare both results. At the same time, the work is
focused in the development of a methodology to
decrease the convergence time of the RLDPS
algorithm.
ACKNOWLEDGMENTS
This research was esponsored by the spanish
commission MCYT (DPI2001-2311-C03-01). I
would like to give my special thanks to Mr. Douglas
Alexander Aberdeen of the Australian National
University for his help.
REFERENCES
R. Sutton and A. Barto, Reinforcement Learning, an
Introduction. MIT Press, 1998.
W.D. Smart and L.P Kaelbling, “Practical reinforcement
learning in continuous spaces”, International
Conference on Machine Learning, 2000.
N. Hernandez and S. Mahadevan, “Hierarchical memory-
based reinforcement learning”, Fifteenth International
Conference on Neural Information Processing
Systems, Denver, USA, 2000.
D.P. Bertsekas and J.N. Tsitsiklis, Neuro-Dynamic
Programming. Athena Scientific, 1996.
R. Sutton, D. McAllester, S. Singh and Y. Mansour,
“Policy gradient methods for reinforcement learning
with function approximation” in Advances in Neural
Information Processing Systems 12, pp. 1057-1063,
MIT Press, 2000.
C. Anderson, “Approximating a policy can be easier than
approximating a value function” Computer Science
Technical Report, CS-00-101, February 10, 2000.
R. Williams, “Simple statistical gradient-following
algorithms for connectionist reinforcement learning”
in Machine Learning, 8, pp. 229-256, 1992.
J. Baxter and P.L. Bartlett, “Direct gradient-based
reinforcement learning” IEEE International
Symposium on Circuits and Systems, May 28-31,
Geneva, Switzerland, 2000.
V.R. Konda and J.N. Tsitsiklis, “On actor-critic
algorithms”, in SIAM Journal on Control and
Optimization, vol. 42, No. 4, pp. 1143-1166, 2003.
S.P. Singh, T. Jaakkola and M.I. Jordan, “Learning
without state-estimation in partially observable
Markovian decision processes”, in Proceedings of the
11
th
International Conference on Machine Learning,
pp. 284-292, 1994.
N. Meuleau, L. Peshkin and K. Kim, “Exploration in
gradient-based reinforcement learning”, Technical
report AI Memo 2001-003, April 3, 2001.
J. Baxter and P.L. Bartlett, “Direct gradient-based
reinforcement learning I: Gradient estimation algorithms”
Technical Report. Australian National University, 1999.
P. Ridao, A. Tiano, A. El-Fakdi, M. Carreras and A.
Zirilli, “On the identification of non-linear models of
unmanned underwater vehicles” in Control
Engineering Practice, vol. 12, pp. 1483-1499, 2004.
D. A., Aberdeen, Policy Gradient Algorithms for Partially
Observable Markov Decision Processes, PhD Thesis,
Australian National University, 2003.
S. Haykin, Neural Networks, a comprehensive foundation,
Prentice Hall, Upper Saddle River, New Jersey, USA,
1999.
T.I., Fossen, Guidance and Control of Ocean Vehicles,
John Wiley and Sons, New York, USA, 1994.
J. A. Bagnell and J. G. Schneider, “Autonomous
Helicopter Control using Reinforcement Learning
Policy Search Methods”, in Proceedings of the IEEE
International Conference on Robotics and Automation
(ICRA), Seoul, Korea, 2001.
M. T. Rosenstein and A. G. Barto, “Robot Weightlifting
by Direct Policy Search”, in Proceedings of the
International Joint Conference on Artificial
Intelligence, 2001.
N. Kohl and P. Stone, “Policy Gradient Reinforcement
Learning for Fast Quadrupedal Locomotion”, in
Proceedings of the IEEE International Conference on
Robotics and Automation (ICRA), 2004.
M. Carreras, P. Ridao and A. El-Fakdi, “Semi-Online
Neural-Q-Learning for Real-Time Robot Learning”, in
Proceedings of the IEEE/RSJ International
Conference on Intelligent Robots and Systems (IROS),
Las Vegas, USA, 2003.
DIRECT GRADIENT-BASED REINFORCEMENT LEARNING FOR ROBOT BEHAVIOR LEARNING
231