coordinated strategy. One reason is that states of all
agents were observed and aggregated as the inputs to
the central neural network. Thus, it tries to learn the
policy π to help all agents by avoiding to gather in
one place. Agents using the centralized DQN could
cooperate efficiently because they could find nearby
restaurants and pick up orders immediately after de-
livering orders to customers by helping and receiv-
ing orders from any restaurants. In contrast, although
agents with decentralized DQNs still can learn poli-
cies for executing the tasks, it is hard for them to learn
such coordinated strategy by mutual cooperation, es-
pecially when they only have a small size of obser-
vations. Instead, they focused on a few restaurants to
receive the orders shown and this is the main differ-
ence in their coordinated behaviors by the centralized
and decentralized DQNs.
6 CONCLUSION
We investigated that a certain coordination strategy
could be learned by multi-agent in a dynamic takeout
platform problem. Our experiment results show that
agents with both DQN methods can learn the cooper-
ation strategy efficiently, especially for the centralized
DQN method. With the centralized DQN method,
agents controlled by a manager that could get the
states of all agents could have cooperative behaviors
by receiving the orders from any restaurant flexibly.
On the other hand, agents with decentralized DQNs
could also learn strategies for picking up and deliv-
ering orders, but their behaviors were quite different;
they focused on a few specific restaurants to receive
orders. However, there was an obvious problem that
agents could not learn well with too small observation
area. This is a pivotal issue which we want to focus
on and solve in the future.
We want to extend the size of the simulation envi-
ronment and the number of agents for our future work.
For example, agents are divided into a few teams in a
large environment, and agents are controlled by vari-
ous team leaders. Different teams are expected to be
responsible for an inevitable part of the region with
coordinated strategies.
ACKNOWLEDGEMENTS
This work is partly supported by JSPS KAKENHI,
Grant number 17KT0044.
REFERENCES
Diallo, E. A. O. and Sugawara, T. (2018). Learning strategic
group formation for coordinated behavior in adver-
sarial multi-agent with double dqn. In International
Conference on Principles and Practice of Multi-Agent
Systems, pages 458–466. Springer.
Egorov, M. (2016). Multi-agent deep reinforcement learn-
ing. CS231n: Convolutional Neural Networks for Vi-
sual Recognition.
Li, L., Lv, Y., and Wang, F.-Y. (2016). Traffic signal timing
via deep reinforcement learning. IEEE/CAA Journal
of Automatica Sinica, 3(3):247–254.
Lin, K., Zhao, R., Xu, Z., and Zhou, J. (2018). Efficient
large-scale fleet management via multi-agent deep re-
inforcement learning. In Proceedings of the 24th
ACM SIGKDD International Conference on Knowl-
edge Discovery & Data Mining, pages 1774–1783.
ACM.
Littman, M. L. (1994). Markov games as a framework
for multi-agent reinforcement learning. In Machine
learning proceedings 1994, pages 157–163. Elsevier.
Matignon, L., Laurent, G. J., and Le Fort-Piat, N. (2012).
Independent reinforcement learners in cooperative
markov games: a survey regarding coordination prob-
lems. The Knowledge Engineering Review, 27(1):1–
31.
Miyashita, Y. and Sugawara, T. (2019). Cooperation and co-
ordination regimes by deep q-learning in multi-agent
task executions. In International Conference on Arti-
ficial Neural Networks, pages 541–554. Springer.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. arXiv preprint arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fid-
jeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning.
Nature, 518(7540):529.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and
Riedmiller, M. (2014). Deterministic policy gradient
algorithms.
Tan, M. (1993). Multi-agent reinforcement learning: Inde-
pendent vs. cooperative agents. In Proceedings of the
tenth international conference on machine learning,
pages 330–337.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep re-
inforcement learning with double q-learning. In Thir-
tieth AAAI conference on artificial intelligence.
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Ma-
chine Learning, 8(3):279–292.
ICAART 2020 - 12th International Conference on Agents and Artificial Intelligence
294