
steps (K
¨
uttler et al., 2020; Izumiya and Simo-Serra,
2021). Therefore, considering the cost of implement-
ing and training other algorithms, we did not perform
comparisons with other methods. Thus, we aim to
further enhance the efficiency and speed of training,
conduct training for 1 billion environment steps, and
compare our method with others. Additionally, we
plan to include settings other than “Monk-Human-
Neutral-Male” to improve generalizability in NLE.
ACKNOWLEDGEMENTS
This work was supported by JSPS KAKENHI Grant
Numbers JP22K12157, JP23K28377, JP24H00714.
We acknowledge the assistance for the ChatGPT
(GPT-4o and 4o mini) was used for proofreading,
which was further reviewed and revised by the au-
thors.
REFERENCES
Bergdahl, J., Gordillo, C., Tollmar, K., and Gissl
´
en, L.
(2020). Augmenting automated game testing with
deep reinforcement learning. In 2020 IEEE Confer-
ence on Games (CoG), pages 600–603.
Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,
Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-
nai gym. In arXiv preprint arXiv:1606.01540.
Burda, Y., Edwards, H., Storkey, A., and Klimov, O. (2019).
Exploration by random network distillation. In Inter-
national Conference on Learning Representations.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,
V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dun-
ning, I., Legg, S., and Kavukcuoglu, K. (2018). IM-
PALA: Scalable distributed deep-RL with importance
weighted actor-learner architectures. In Proceed-
ings of the 35th International Conference on Machine
Learning, volume 80, pages 1407–1416.
Ha, D. and Schmidhuber, J. (2018). World models. In arXiv
preprint arXiv:1803.10122.
Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. (2024).
Mastering diverse domains through world models. In
arXiv preprint arXiv:2301.04104.
Hambro, E., Raileanu, R., Rothermel, D., Mella, V.,
Rockt
¨
aschel, T., K
¨
uttler, H., and Murray, N. (2022).
Dungeons and data: A large-scale nethack dataset. In
Advances in Neural Information Processing Systems,
volume 35, pages 24864–24878.
Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,
Botvinick, M., Mohamed, S., and Lerchner, A. (2017).
beta-VAE: Learning basic visual concepts with a con-
strained variational framework. In International Con-
ference on Learning Representations.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. In Neural Comput., volume 9, pages 1735–
1780. MIT Press.
Izumiya, K. and Simo-Serra, E. (2021). Inventory manage-
ment with attention-based meta actions. In 2021 IEEE
Conference on Games (CoG).
Kanagawa, Y. and Kaneko, T. (2019). Rogue-gym: A new
challenge for generalization in reinforcement learn-
ing. In arXiv preprint arXiv:1904.08129.
Kingma, D. P. and Ba, J. (2015). Adam: A method for
stochastic optimization. In International Conference
on Learning Representations.
Kingma, D. P. and Welling, M. (2014). Auto-encoding vari-
ational bayes. In International Conference on Learn-
ing Representations.
Klissarov, M., D’Oro, P., Sodhani, S., Raileanu, R., Bacon,
P.-L., Vincent, P., Zhang, A., and Henaff, M. (2023).
Motif: Intrinsic motivation from artificial intelligence
feedback. In arXiv preprint arXiv:2310.00166.
K
¨
uttler, H., Nardelli, N., Miller, A. H., Raileanu, R.,
Selvatici, M., Grefenstette, E., and Rockt
¨
aschel, T.
(2020). The nethack learning environment. In 34th
Conference on Neural Information Processing Sys-
tems.
Piterbarg, U., Pinto, L., and Fergus, R. (2023). Nethack is
hard to hack. In Thirty-seventh Conference on Neural
Information Processing Systems.
Samvelyan, M., Kirk, R., Kurin, V., Parker-Holder, J.,
Jiang, M., Hambro, E., Petroni, F., K
¨
uttler, H.,
Grefenstette, E., and Rockt
¨
aschel, T. (2021). Mini-
hack the planet: A sandbox for open-ended reinforce-
ment learning research. In Thirty-fifth Conference on
Neural Information Processing Systems Datasets and
Benchmarks Track.
Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2016).
Prioritized experience replay. In arXiv preprint
arXiv:1511.05952.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms. In arXiv preprint arXiv:1707.06347.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep re-
inforcement learning with double q-learning. In Pro-
ceedings of the Thirtieth AAAI Conference on Artifi-
cial Intelligence, AAAI’16, pages 2094–2100. AAAI
Press.
Volodymyr, M., Koray, K., David, S., A., R. A., Joel, V.,
G., B. M., Alex, G., Martin, R., K., F. A., Georg, O.,
Stig, P., Charles, B., Amir, S., Ioannis, A., Helen, K.,
Dharshan, K., Daan, W., Shane, L., and Demis, H.
(2015). Human-level control through deep reinforce-
ment learning. In Nature, number 518, pages 529–
533.
Wang, Z., Schaul, T., Hessel, M., Van Hasselt, H., Lanctot,
M., and De Freitas, N. (2016). Dueling network archi-
tectures for deep reinforcement learning. In Proceed-
ings of the 33rd International Conference on Interna-
tional Conference on Machine Learning, volume 48,
pages 1995–2003.
Efficient Models Deep Reinforcement Learning for NetHack Strategies
563