ACKNOWLEDGEMENT
We would like to thank Moritz Lange for his valuable
contribution to the preparation of this paper.
REFERENCES
Amodei, D. and Clark, J. (2016). Faulty reward
functions in the wild. https://blog.openai.com/
faulty-reward-functions/,2016.
Barreto, A., Dabney, W., Munos, R., Hunt, J. J., Schaul,
T., Van Hasselt, H., and Silver, D. (2016). Successor
features for transfer in reinforcement learning. arXiv
preprint arXiv:1606.05312.
Berner, C., Brockman, G., Chan, B., Cheung, V., D˛ebiak,
P., Dennison, C., Farhi, D., Fischer, Q., Hashme,
S., Hesse, C., et al. (2019). Dota 2 with large
scale deep reinforcement learning. arXiv preprint
arXiv:1912.06680.
Brown, N. and Sandholm, T. (2019). Superhuman ai for
multiplayer poker. Science, 365(6456):885–890.
Brys, T., Harutyunyan, A., Vrancx, P., Taylor, M. E.,
Kudenko, D., and Nowé, A. (2014). Multi-
objectivization of reinforcement learning problems by
reward shaping. In 2014 international joint confer-
ence on neural networks (IJCNN), pages 2315–2322.
IEEE.
Chevalier-Boisvert, M., Willems, L., and Pal, S. (2018).
Minimalistic gridworld environment for openai gym.
https://github.com/maximecb/gym-minigrid.
Chollet, F. et al. (2015). Keras. https://keras.io.
Efthymiadis, K. and Kudenko, D. (2013). Using plan-based
reward shaping to learn strategies in starcraft: Brood-
war. In 2013 IEEE Conference on Computational In-
teligence in Games (CIG), pages 1–8. IEEE.
Hill, A., Raffin, A., Ernestus, M., Gleave, A., Kanervisto,
A., Traore, R., Dhariwal, P., Hesse, C., Klimov, O.,
Nichol, A., Plappert, M., Radford, A., Schulman, J.,
Sidor, S., and Wu, Y. (2018). Stable baselines. https:
//github.com/hill-a/stable-baselines.
Hlynsson, H. D., Schüler, M., Schiewer, R., Glasmachers,
T., and Wiskott, L. (2020). Latent representation pre-
diction networks. arXiv preprint arXiv:2009.09439.
Hlynsson, H. D. and Wiskott, L. (2019). Learning gradient-
based ica by neurally estimating mutual information.
In Joint German/Austrian Conference on Artificial
Intelligence (Künstliche Intelligenz), pages 182–187.
Springer.
Hlynsson, H. D., Wiskott, L., et al. (2019). Measuring
the data efficiency of deep learning methods. arXiv
preprint arXiv:1907.02549.
Kaelbling, L. P. (1993). Learning to achieve goals. In IJCAI,
pages 1094–1099. Citeseer.
Lample, G. and Chaplot, D. S. (2017). Playing fps games
with deep reinforcement learning. Proceedings of the
AAAI Conference on Artificial Intelligence, 31(1).
Lehnert, L. and Littman, M. L. (2020). Successor features
combine elements of model-free and model-based re-
inforcement learning. Journal of Machine Learning
Research, 21(196):1–53.
Lehnert, L., Littman, M. L., and Frank, M. J. (2020).
Reward-predictive representations generalize across
tasks in reinforcement learning. PLoS computational
biology, 16(10):e1008317.
Marashi, M., Khalilian, A., and Shiri, M. E. (2012). Auto-
matic reward shaping in reinforcement learning using
graph analysis. In 2012 2nd International eConfer-
ence on Computer and Knowledge Engineering (IC-
CKE), pages 111–116. IEEE.
Mataric, M. J. (1994). Reward functions for accelerated
learning. In Machine learning proceedings 1994,
pages 181–189. Elsevier.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. arXiv preprint arXiv:1312.5602.
Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invari-
ance under reward transformations: Theory and appli-
cation to reward shaping. In Icml, volume 99, pages
278–287.
Pathak, D., Mahmoudieh, P., Luo, G., Agrawal, P., Chen,
D., Shentu, Y., Shelhamer, E., Malik, J., Efros, A. A.,
and Darrell, T. (2018). Zero-shot visual imitation. In
Proceedings of the IEEE Conference on Computer Vi-
sion and Pattern Recognition Workshops, pages 2050–
2053.
Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015).
Universal value function approximators. In Interna-
tional conference on machine learning, pages 1312–
1320.
Schüler, M., Hlynsson, H. D., and Wiskott, L. (2018).
Gradient-based training of slow feature analysis by
differentiable approximate whitening. arXiv preprint
arXiv:1808.08833.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms. arXiv preprint arXiv:1707.06347.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
Van Den Driessche, G., Schrittwieser, J., Antonoglou,
I., Panneershelvam, V., Lanctot, M., et al. (2016).
Mastering the game of go with deep neural networks
and tree search. nature, 529(7587):484–489.
Sutton, R. S., Precup, D., and Singh, S. (1999). Between
mdps and semi-mdps: A framework for temporal ab-
straction in reinforcement learning. Artificial intelli-
gence, 112(1-2):181–211.
Wu, Y., Mansimov, E., Grosse, R. B., Liao, S., and Ba, J.
(2017). Scalable trust-region method for deep rein-
forcement learning using kronecker-factored approxi-
mation. In Advances in neural information processing
systems, pages 5279–5288.
Zou, H., Ren, T., Yan, D., Su, H., and Zhu, J. (2019).
Reward shaping via meta-learning. arXiv preprint
arXiv:1901.09330.
NCTA 2021 - 13th International Conference on Neural Computation Theory and Applications
276