changing distribution in the observed environment.
Since GIRM is an extension module, it can be ap-
plied to any reinforcement learning algorithm. In this
work, we used Asynchronous Actor-Critic (A2C) as
the baseline algorithm, comparing agents trained with
and without GIRM.
We show GIRM’s capability of exploration in a
sparse reward and a no reward setting. We used Super
Mario Bros. (Kauten, 2018) for the no reward set-
ting and show that the agent trained with GIRM pro-
vides the necessary intrinsic reward so that the agent
explores and completes the level and outperforms the
agent without GIRM. Our agent does so by learning
the pattern of going right in order to find unexplored
states. For the sparse reward setting, we evaluated
our agent in Montezuma’s Revenge, an atari game
that’s been recently used for benchmarking in rein-
forcement learning, due to its difficulty. While the
agent trained without GIRM is not capable of escap-
ing the initial room, with the addition of GIRM, our
agent explores multiple rooms throughout the envi-
ronment, achieving a mean score of 3954. On the
one hand, we show that GIRM provides a more ef-
ficient exploration strategy, but on the other hand, we
observe that our agent converges early, beginning to
exploit the high rewards from the environment.
We also identify another weakness of GIRM
through Montezuma’s Revenge: standardizing re-
wards through the usage of EMA and EMV turns
small differences between the observed novel state
and regenerated state into meaningful intrinsic re-
wards, however, as the agent begins to explore new
rooms, the already high difference between regener-
ated and novel states gets higher, which also increases
the distribution EMA and EMV represents, there-
fore GIRM loses the capability of assigning meaning-
ful rewards to novel states in the frequently visited
rooms. In the future, we would like to address this
problem. A potential solution could be leaving out
the very high or very low intrinsic reward when up-
dating EMA and EMV , treating them as an anomaly.
Furthermore, a more efficient reward scaling method
could be investigated.
Another future direction is to make use of GANs
to train a model to learn the dynamics of the envi-
ronment, instead of the distribution of observations.
Since the dynamics throughout the environment do
not change drastically, a model that learns the dy-
namics might have a better generalization property
throughout the environment. This idea is not a direct
improvement to GIRM, but instead, we recommend
an idea to utilize GANs in the efficient exploration
problem in reinforcement learning with a different ap-
proach.
ACKNOWLEDGEMENTS
This work is supported by Istanbul Technical Univer-
sity BAP Grant NO: MOA-2019-42321.
REFERENCES
Badia, A. P., Sprechmann, P., Vitvitskyi, A., Guo, Z. D.,
Piot, B., Kapturowski, S., Tieleman, O., Arjovsky,
M., Pritzel, A., Bolt, A., and Blundell, C. (2020).
Never give up: Learning directed exploration strate-
gies. CoRR, abs/2002.06038.
Bellemare, M. G., Srinivasan, S., Ostrovski, G., Schaul, T.,
Saxton, D., and Munos, R. (2016). Unifying count-
based exploration and intrinsic motivation. In Pro-
ceedings of the 30th International Conference on Neu-
ral Information Processing Systems, NIPS’16, page
1479–1487, Red Hook, NY, USA. Curran Associates
Inc.
Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2019).
Exploration by random network distillation. In 7th In-
ternational Conference on Learning Representations
(ICLR 2019), pages 1–17. Seventh International Con-
ference on Learning Representations, ICLR 2019 ;
Conference date: 06-05-2019 Through 09-05-2019.
Choshen, L., Fox, L., and Loewenstein, Y. (2018). Dora the
explorer: Directed outreaching reinforcement action-
selection. ArXiv, abs/1804.04012.
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K. O., and
Clune, J. (2021). First return, then explore. Nature,
590(7847):580–586.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial nets. In Ghahra-
mani, Z., Welling, M., Cortes, C., Lawrence, N., and
Weinberger, K. Q., editors, Advances in Neural Infor-
mation Processing Systems, volume 27. Curran Asso-
ciates, Inc.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and
Courville, A. C. (2017). Improved training of wasser-
stein gans. In Guyon, I., Luxburg, U. V., Bengio, S.,
Wallach, H., Fergus, R., Vishwanathan, S., and Gar-
nett, R., editors, Advances in Neural Information Pro-
cessing Systems, volume 30. Curran Associates, Inc.
Guo, Y., Choi, J., Moczulski, M., Bengio, S., Norouzi,
M., and Lee, H. (2019). Efficient exploration with
self-imitation learning via trajectory-conditioned pol-
icy. ArXiv, abs/1907.10247.
Hong, W., Zhu, M., Liu, M., Zhang, W., Zhou, M., Yu, Y.,
and Sun, P. (2019). Generative adversarial exploration
for reinforcement learning. In Proceedings of the First
International Conference on Distributed Artificial In-
telligence, DAI ’19, New York, NY, USA. Association
for Computing Machinery.
Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck,
F., and Abbeel, P. (2016). Vime: Variational informa-
tion maximizing exploration.
GAN-based Intrinsic Exploration for Sample Efficient Reinforcement Learning
271