different grid-worlds demonstrated that the overesti-
mation problem always occurred when higher learn-
ing rates were used and not with lower learning rates
or decaying learning rates.
We also analyzed the performance of Double Q-
learning with multilayer perceptrons under the same
conditions. This algorithm in general suffers from
more underestimation when higher learning rates are
used and surprisingly can also suffer from the overes-
timation bias when very low learning rates are used.
We also examined the performances of both algo-
rithms on two Open-AI gym control problems. The
results obtained on all six environments suggest that
in general the best performances are achieved by us-
ing the decaying learning rates.
Our future work includes studying the connection
between the learning rate and overestimation bias for
more complex environments. We also want to per-
form more research on different methods for anneal-
ing the learning rate over time.
ACKNOWLEDGEMENTS
We would like to thank the Center for Information
Technology of the University of Groningen for their
support and for providing access to the Peregrine high
performance computing cluster. Yifei Chen acknowl-
edges the China Scholarship Council (Grant number:
201806320353) for financial support.
REFERENCES
Bellman, R. (1957). Dynamic programming. Princeton Uni-
versity Press, Princeton, NJ, USA, 1 edition.
Boyan, J. A. and Moore, A. W. (1994). Generalization
in reinforcement learning: Safely approximating the
value function. In Proceedings of the 7th Interna-
tional Conference on Neural Information Processing
Systems, pages 369–376. MIT Press.
D’Eramo, C., Restelli, M., and Nuara, A. (2016). Esti-
mating maximum expected value through gaussian ap-
proximation. In International Conference on Machine
Learning, pages 1032–1040. PMLR.
Even-Dar, E. and Mansour, Y. (2003). Learning rates for
q-learning. Journal of machine learning Research,
5(Dec):1–25.
Fujimoto, S., Van Hoof, H., and Meger, D. (2018). Ad-
dressing function approximation error in actor-critic
methods. arXiv preprint arXiv:1802.09477.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse
rectifier neural networks. In Proceedings of the four-
teenth international conference on artificial intelli-
gence and statistics, pages 315–323.
Howard, R. A. (1960). Dynamic Programming and Markov
Processes. MIT Press.
Jaakkola, T., Jordan, M. I., and Singh, S. P. (1994). Con-
vergence of stochastic iterative dynamic programming
algorithms. In Advances in neural information pro-
cessing systems, pages 703–710.
Kingma, D. P. and Ba, J. (2014). Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Lin, L.-J. (1993). Reinforcement learning for robots using
neural networks. PhD thesis, Carnegie Mellon Uni-
versity, USA.
Lu, T., Schuurmans, D., and Boutilier, C. (2018). Non-
delusional q-learning and value-iteration. In Ad-
vances in Neural Information Processing Systems,
volume 31, pages 9949–9959. Curran Associates, Inc.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. arXiv preprint arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning. na-
ture, 518(7540):529–533.
Schilperoort, J., Mak, I., Drugan, M. M., and Wiering,
M. A. (2018). Learning to play pac-xon with q-
learning and two double q-learning variants. In 2018
IEEE Symposium Series on Computational Intelli-
gence (SSCI), pages 1151–1158. IEEE.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-
ing: An introduction. MIT press, 2nd edition.
Tesauro, G. (1994). Td-gammon, a self-teaching backgam-
mon program, achieves master-level play. Neural
computation, 6(2):215–219.
Thrun, S. and Schwartz, A. (1993). Issues in using function
approximation for reinforcement learning. In Pro-
ceedings of the 1993 Connectionist Models Summer
School. Erlbaum Associates.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approxi-
mation and q-learning. Machine learning, 16(3):185–
202.
Van Hasselt, H. (2010). Double q-learning. In Advances in
Neural Information Processing Systems, pages 2613–
2621. Curran Associates, Inc.
Van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Son-
nerat, N., and Modayil, J. (2018). Deep reinforce-
ment learning and the deadly triad. arXiv preprint
arXiv:1812.02648.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep re-
inforcement learning with double q-learning. In Thir-
tieth AAAI conference on artificial intelligence. AAAI
Press.
Watkins, C. J. C. H. (1989). Learning from delayed rewards.
PhD thesis, King’s College, Cambridge, UK.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Ma-
chine learning, 8(3-4):279–292.
Wiering, M. A. (2004). Convergence and divergence in
standard and averaging reinforcement learning. In
Proceedings of the 15th European Conference on Ma-
chine Learning, pages 477–488. Springer-Verlag.
You, K., Long, M., Wang, J., and Jordan, M. I. (2019).
How does learning rate decay help modern neural net-
works? arXiv preprint arXiv:1908.01878.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
118