to face this non-stationarity by reducing the tangen-
tial speed in all the states, turning slightly right, and
then it started to gradually raise the tangential speed.
This behavior has put in evidence how this learning
approach potentially can increase the robustness to-
wards mechanical faults, thus increasing the auton-
omy of the robot. Further studies will be focused on
a detailed analysis of this important aspect.
5 DISCUSSION AND FUTURE
WORKS
The application of RL techniques to robotic applica-
tions could lead to the autonomous learning of opti-
mal solutions for many different tasks. Unfortunately,
most of the traditional RL algorithms fail when ap-
plied to real-world problems because of the time re-
quired to find the optimal solution, of the noise that
affects both sensors and actuators, and of the diffi-
culty to manage continuous state spaces.
In this paper, we have described and experimen-
tally tested a novel algorithm (PWC-Q
LB
-learning)
designed to overcome the main issues that arise when
learning is applied to real-world robotic tasks. PWC-
Q
LB
-learning computes the lower bound for the ac-
tion value function while following a piecewise con-
stant policy. Unlike other min-max-based algorithms,
PWC-Q
LB
-learning does not require the model of the
dynamics of the environment and avoids long and
blind exploration phases. Furthermore, it does not
learn the optimal policy for the theoretically worst
case, but it estimates the lower bound on the condi-
tions actually experienced by the robot according to
its current policy and to the current dynamics of the
environment. Finally, the piecewise constant action
selection and update guarantee a stable learning pro-
cess in continuous state spaces, even when the dis-
cretization is such that the Markov property is lost.
Although preliminary, the experiments showed
that PWC-Q
LB
-learning succeeds in learning a nearly-
optimal policy by optimizing the behavior of a sub-
optimal controller in noisy continuous environments.
Furthermore, it proved to be more stable with re-
spect to Q-learning even when a coarse discretiza-
tion of the state space is used. At the moment, we
are currently investigating the theoretical properties
of the proposed algorithm and we are testing its per-
formance on more complex robotic tasks, such as the
“align to goal” and the “kick” tasks.
REFERENCES
Bellman, R. (1957). Dynamic Programming. Princeton
University Press, Princeton.
Bonarini, A., Matteucci, M., Restelli, M., and Sorrenti,
D. G. (2006). Milan robocup team 2006. In RoboCup-
2006: Robot Soccer World Cup X.
Buffet, O. and Aberdeen, D. (2006). Policy-gradient for
robust planning. In Proceedings of the Workshop on
Planning, Learning and Monitoring with Uncertainty
and Dynamic Worlds (ECAI 2006).
Gaskett, C. (2003). Reinforcement learning under circum-
stances beyond its control. In Proceedings of Interna-
tional Conference on Computational Intelligence for
Modelling Control and Automation.
Heger, M. (1994). Consideration of risk in reinforcement
learning. In Proceedings of the 11th ICML, pages
105–111.
Kitano, H., Asada, M., Osawa, E., Noda, I., Kuniyoshi, Y.,
and Matsubara, H. (1997). Robocup: The robot world
cup initiative. In Proceedings of the First Interna-
tional Conference on Autonomous Agent (Agent-97).
Lin, L.-J. (1992). Self-improving reactive agents based on
reinforcement learning, planning and teaching. Ma-
chine Learning, 8(3-4):293–321.
Littman, M. L. (1994). Markov games as a framework for
multi-agent reinforcement learning. In Proceedings of
the 11th ICML, pages 157–163.
Mill
´
an, J. D. R. (1996). Rapid, safe, and incremental learn-
ing of navigation strategies. IEEE Transactions on
Systems, Man, and Cybernetics (B), 26(3):408–420.
Moore, A. and Atkeson, C. (1995). The parti-game algo-
rithm for variable resolution reinforcement learning
in multidimensional state-spaces. Machine Learning,
21:711–718.
Morimoto, J. and Doya, K. (2000). Acquisition of stand-up
behavior by a real robot using hierarchical reinforce-
ment learning. In Proceedings of the 17th ICML.
Morimoto, J. and Doya, K. (2001). Robust reinforcement
learning. In Advances in Neural Information Process-
ing Systems 13, pages 1061–1067.
Smart, W. D. and Kaelbling, L. P. (2002). Effective rein-
forcement learning for mobile robots. In Proceedings
of ICRA, pages 3404–3410.
Sutton, R. S. (1996). Generalization in reinfrocement learn-
ing: Successful examples using sparse coarse coding.
In Advances in Neural Information Processing Sys-
tems 8, pages 1038–1044.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn-
ing: An Introduction. MIT Press, Cambridge, MA.
Sutton, R. S., Precup, D., and Singh, S. (1999). Between
mdps and semi-mdps: A framework for temporal ab-
straction in reinforcement learning. Artificial intelli-
gence, 112(1-2):181–211.
Tesauro, G. (1995). Temporal difference learning and td-
gammon. Communications of the ACM, 38.
Watkins, C. and Dayan, P. (1992). Q-learning. Machine
Learning, 8:279–292.
PIECEWISE CONSTANT REINFORCEMENT LEARNING FOR ROBOTIC APPLICATIONS
221