learning steps, but the state transition probability of
the agents is higher than th at of the other me thods.
This is thought to be because the cooperative behav-
ior that can be acquired by E X PODE+individual is
significantly different from that of the other methods,
but since the high probability is maintained even after
400,000 learning steps, it is considered that the value
of state transitions is sufficiently increased by the pro-
posed intrinsic reward.
5 CONCLUSIONS
This paper proposed two types of intrinsic rewards
aimed at acquiring coop erative behavior. One uti-
lizes the average value of actions selected by agents
at e ach step, aiming to further incr ease the value of
actions ne c essary for cooperative behavior, and the
other aims to avoid local solutions by giving indi-
vidual rewards to each agent when they meet certain
conditions. In addition, we conducted experiments to
verify the e ffectiveness in 6h
vs 8z, a StarCraft II sce-
nario, by adding the proposed intrin sic r eward design
to a conventional method (EXPODE) with an intrin-
sic reward design that pr omotes agents’ exploration
of unexplored regions. As a re sult, we confirmed that
by applying the proposed intrinsic reward, EXPODE
can learn policies with a higher win rate for allies in
many cases compared to before application, and can
stably acquir e cooperative behavior. Furthermore, we
confirmed that applying the proposed intrinsic reward
can increase the probability of selecting actions nec-
essary for cooperative behavior in 6h
vs 8z compared
to the conventiona l method, contributing to the learn-
ing of coop e rative beh avior.Future work include s ex-
periments using other scenarios of StarCraft II to ver-
ify the generality of the effect of the proposed intrin-
sic reward design.
REFERENCES
Chen, S., Chen, B.-H., Chen, Z., and Wu, Y. (2020).
Itinerary planning via deep reinforcement learning. In
Proceedings of the 2020 International Conference on
Multimedia Retrieval, ICMR ’20, page 286–290, New
York, NY, US A . Association for Computing Machin-
ery.
Coronato, A., Di Napoli, C., Paragliola, G. , and Serino,
L. (2021). Intelligent planning of onshore touristic
itineraries for cruise passengers in a smart city. In
2021 17th International Conference on Intelligent En-
vironments (IE ) , pages 1–7.
Ha, D., Dai, A. M., and Le, Q. V. (2016). Hypernetworks.
CoRR, abs/1609.09106.
Hasselt, H. (2010). Double q-learning. In Lafferty, J.,
Williams, C., Shawe-Taylor, J., Zemel, R., and Cu-
lotta, A., editors, Advances i n N eural Information
Processing Systems, volume 23. Curran Associates,
Inc.
Hausknecht, M. and Stone, P. (2015). Deep recurrent q-
learning for partially observable mdps. In 2015 aaai
fall symposium series.
Mnih, V., K avukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning. na-
ture, 518(7540):529–533.
Oliehoek, F. A., Amato, C., et al. (2016). A concise
introduction to decentralized POMDPs, volume 1.
Springer.
Oliehoek, F. A., Spaan, M. T., and Vlassis, N. (2008). Opti-
mal and approximate q-value functions for decentral-
ized pomdps. Journal of Artificial Intelligence Re-
search, 32:289–353.
Paragliola, G., Coronato, A., Naeem, M., and De Pietro,
G. (2018). A reinforcement learning-based approach
for the risk management of e-health environments: A
case study. In 2018 14th International Conference on
Signal-Image Technology & Internet-Based Systems
(SITIS), pages 711–716.
Rashid, T., S amvelyan, M., Schroeder, C., Farquhar, G ., Fo-
erster, J., and Whiteson, S. (2018). QMIX: Monotonic
value function factorisation for deep multi-agent rei n-
forcement learning. In Dy, J. and Krause, A., editors,
Proceedings of the 35th International Conference on
Machine Learning, volume 80 of Proceedings of Ma-
chine L earning Research, pages 4295–4304. PMLR.
Samvelyan, M., Rashid, T., de Witt, C. S. , Farquhar, G.,
Nardelli, N., Rudner, T. G. J., Hung, C., Torr, P. H. S.,
Foerster, J. N., and Whiteson, S. (2019). The starcraft
multi-agent challenge. CoRR, abs/1902.04043.
Sutton, R. S. and Barto, A. G. (1999). Reinforcement learn-
ing: An introduction. Robotica, 17(2):229–235.
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezh-
nevets, A. S. , Yeo, M., Makhzani, A., K¨uttler, H.,
Agapiou, J. P., Schrittwieser, J., Quan, J., Gaffney, S.,
Petersen, S., Simonyan, K., Schaul, T., van Hasselt,
H., Silver, D., Li llicrap, T. P., Calderone, K., Keet,
P., Brunasso, A., Lawrence, D., Ekermo, A., Repp, J.,
and Tsing, R. (2017). Starcraft II: A new challenge
for reinforcement learning. CoRR, abs/1708.04782.
Zhang, Y. and Yu, C. (2023). Expode: Exploiting policy dis-
crepancy for efficient exploration in multi-agent rein-
forcement learning. In Proceedings of the 2023 Inter-
national Conference on Autonomous Agents and Mul-
tiagent Systems, AAMAS ’23, page 58–66, Richland,
SC. International Foundation for Autonomous Agents
and Multiagent Systems.