
ACKNOWLEDGEMENTS
This research was partially funded by the ERC Ad-
vanced Grant WhiteMech (No. 834228), the PNRR
MUR project PE0000013-FAIR, and also supported
by the BUBBLES Project (Grant No. 893206).
REFERENCES
Agarwal, M. and Aggarwal, V. (2021). Blind decision mak-
ing: Reinforcement learning with delayed observa-
tions. Proceedings of the International Conference on
Automated Planning and Scheduling, 31(1):2–6.
Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Un-
terthiner, T., Brandstetter, J., and Hochreiter, S.
(2019). Rudder: Return decomposition for delayed
rewards. In Wallach, H., Larochelle, H., Beygelzimer,
A., d'Alch
´
e-Buc, F., Fox, E., and Garnett, R., editors,
Advances in Neural Information Processing Systems,
volume 32. Curran Associates, Inc.
Brunori, D., Colonnese, S., Cuomo, F., Flore, G., and Ioc-
chi, L. (2021a). Delivering resources for augmented
reality by uavs: a reinforcement learning approach.
Frontiers in Communications and Networks, 2.
Brunori, D., Colonnese, S., Cuomo, F., and Iocchi, L.
(2021b). A reinforcement learning environment for
multi-service uav-enabled wireless systems. In 2021
IEEE International Conference on Pervasive Comput-
ing and Communications Workshops and other Affili-
ated Events (PerCom Workshops), pages 251–256.
Chen, B., Xu, M., Li, L., and Zhao, D. (2021). Delay-
aware model-based reinforcement learning for contin-
uous control. Neurocomputing, 450:119–128.
Chen, B., Xu, M., Liu, Z., Li, L., and Zhao, D.
(2020). Delay-aware multi-agent reinforcement learn-
ing. CoRR, abs/2005.05441.
Cheng, N., Wu, S., Wang, X., Yin, Z., Li, C., Chen, W.,
and Chen, F. (2023). Ai for uav-assisted iot appli-
cations: A comprehensive review. IEEE Internet of
Things Journal, 10(16):14438–14461.
Dalmau, R. and Allard, E. (2020). Air Traffic Control Using
Message Passing Neural Networks and Multi-Agent
Reinforcement Learning. In 10th SESAR Innovation
Days (SID), Virtual Event.
Frattolillo, F., Brunori, D., and Iocchi, L. (2023). Scal-
able and cooperative deep reinforcement learning ap-
proaches for multi-uav systems: A systematic review.
Drones, 7(4).
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).
Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. ArXiv,
abs/1801.01290.
Katsikopoulos, K. and Engelbrecht, S. (2003). Markov de-
cision processes with delays and asynchronous cost
collection. IEEE Transactions on Automatic Control,
48(4):568–574.
Kim, K. (2022). Multi-agent deep q network to enhance
the reinforcement learning for delayed reward system.
Applied Sciences, 12(7).
Littman, M. L. (1994). Markov games as a framework for
multi-agent reinforcement learning. In International
Conference on Machine Learning.
Liu, C. H., Chen, Z., and Zhan, Y. (2019). Energy-efficient
distributed mobile crowd sensing: A deep learning ap-
proach. IEEE Journal on Selected Areas in Commu-
nications, 37(6):1262–1276.
Moon, J., Papaioannou, S., Laoudias, C., Kolios, P., and
Kim, S. (2021). Deep reinforcement learning multi-
uav trajectory control for target tracking. IEEE Inter-
net of Things Journal, 8(20):15441–15455.
Mou, Z., Zhang, Y., Gao, F., Wang, H., Zhang, T., and Han,
Z. (2021). Three-dimensional area coverage with uav
swarm based on deep reinforcement learning. In ICC
2021 - IEEE International Conference on Communi-
cations, pages 1–6.
Pe
˜
na, P. F., Ragab, A. R., Luna, M. A., Ale Isaac, M. S.,
and Campoy, P. (2022). Wild hopper: A heavy-duty
uav for day and night firefighting operations. Heliyon,
8(6):e09588.
Puterman, M. L. (1990). Chapter 8 markov decision pro-
cesses. In Stochastic Models, volume 2 of Hand-
books in Operations Research and Management Sci-
ence, pages 331–434. Elsevier.
Rens., G., Raskin., J., Reynouard., R., and Marra., G.
(2021). Online learning of non-markovian reward
models. In Proceedings of the 13th International Con-
ference on Agents and Artificial Intelligence - Volume
2: ICAART, pages 74–86. INSTICC, SciTePress.
Sacco, A., Esposito, F., Marchetto, G., and Montuschi, P.
(2021). Sustainable task offloading in uav networks
via multi-agent reinforcement learning. IEEE Trans-
actions on Vehicular Technology, 70(5):5003–5015.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms. CoRR, abs/1707.06347.
Scott, J. E. and Scott, C. H. (2017). Drone delivery models
for healthcare. In Hawaii International Conference on
System Sciences.
Singh, S., Jaakkola, T., and Jordan, M. I. (1994). Learn-
ing without state-estimation in partially observable
markovian decision processes. In International Con-
ference on Machine Learning.
Wang, Q., Zhang, W., Liu, Y., and Liu, Y. (2019). Multi-
uav dynamic wireless networking with deep rein-
forcement learning. IEEE Communications Letters,
23(12):2243–2246.
Yuan, T., Chung, H.-M., Yuan, J., and Fu, X. (2023).
Dacom: Learning delay-aware communication for
multi-agent reinforcement learning. Proceedings
of the AAAI Conference on Artificial Intelligence,
37(10):11763–11771.
Zhu, Z., Xie, N., Zong, K., and Chen, L. (2021). Building
a connected communication network for uav clusters
using de-maddpg. Symmetry, 13(8).
A Delay-Aware DRL-Based Environment for Cooperative Multi-UAV Systems in Multi-Purpose Scenarios
343