Figure 14: Spent step results (in the simple maze experi-
ment).
7 CONCLUSION
This paper proposed a method based on the previous
noncommunicative and cooperative learning method
(PMRL-OM) based on DRL (A3C) in a multiagent
system within an unstable environment in terms of
hetero-transitions using different transitions accord-
ing to the difference between observed and recent sit-
uations. The proposed method inserts an LSTM mod-
ule into the A3C neural network. The experiments
compared the proposed method with A3C and with-
out the LSTM module. The derived results were as
follows: (1) the proposed method performs better than
the A3C algorithm and without the LSTM module. In
particular, the proposed method enables the agents’
learning to converge; (2) LSTM can adapt the time
dimension of the input information.
This paper showed that the proposed method can
not only adapt the hetero-transition of input informa-
tion, but should also adapt the hetero-transition of out-
put information. In particular, the hetero-transition
should be assumed as a partially observable Markov
decision process (POMDP), but the proposed method
performs as a Markov decision process. Therefore,
we will expand the proposed method to POMDP to
adapt the hetero-observations in the future.
ACKNOWLEDGEMENTS
This research was supported by JSPS Grant on
JP20K23326.
REFERENCES
Du, Y., Liu, B., Moens, V., Liu, Z., Ren, Z., Wang, J.,
Chen, X., and Zhang, H. (2021). Learning Correlated
Communication Topology in Multi-Agent Reinforce-
ment Learning, page 456–464. International Founda-
tion for Autonomous Agents and Multiagent Systems,
Richland, SC.
Fujita, Y., Kataoka, T., Nagarajan, P., and Ishikawa, T.
(2019). Chainerrl: A deep reinforcement learning li-
brary. In Workshop on Deep Reinforcement Learning
at the 33rd Conference on Neural Information Pro-
cessing Systems.
Ghosh, A., Tschiatschek, S., Mahdavi, H., and Singla, A.
(2020). Towards Deployment of Robust Cooperative
AI Agents: An Algorithmic Framework for Learning
Adaptive Policies, page 447–455. International Foun-
dation for Autonomous Agents and Multiagent Sys-
tems, Richland, SC.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural Comput., 9(8):1735–1780.
Kim, D., Moon, S., Hostallero, D., Kang, W. J., Lee, T.,
Son, K., and Yi, Y. (2019). Learning to schedule com-
munication in multi-agent reinforcement learning.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap,
T. P., Harley, T., Silver, D., and Kavukcuoglu, K.
(2016). Asynchronous methods for deep reinforce-
ment learning. CoRR, abs/1602.01783.
Raileanu, R., Denton, E., Szlam, A., and Fergus, R. (2018).
Modeling others using oneself in multi-agent rein-
forcement learning. In Dy, J. and Krause, A., editors,
Proceedings of the 35th International Conference on
Machine Learning, volume 80 of Proceedings of Ma-
chine Learning Research, pages 4257–4266, Stock-
holmsmassan, Stockholm Sweden. PMLR.
Uwano, F. (2021). A cooperative learning method for
multi-agent system with different input resolutions. In
4th International Symposium on Agents, Multi-Agents
Systems and Robotics.
Uwano, F. and Takadama, K. (2019). Utilizing observed
information for no-communication multi-agent rein-
forcement learning toward cooperation in dynamic en-
vironment. SICE Journal of Control, Measurement,
and System Integration, 12(5):199–208.
LSTM-based Abstraction of Hetero Observation and Transition in Non-Communicative Multi-Agent Reinforcement Learning
179