Mixed Human-UAV Reinforcement Learning: Literature Review

and Open Challenges

Nicolo’ Brandizzi

∗ a

, Damiano Brunori

∗ b

, Francesco Frattolillo

∗ c

, Alessandro Trapasso

∗ d

and Luca Iocchi

Department of Computer, Automation and Management Engineering, Sapienza University of Rome,

via Ariosto 25 Roma 00185, Italy

Keywords:

UAVs, Reinforcement Learning, Human-in-the-Loop, Human-UAV.

Abstract:

Unmanned Aerial Vehicles (UAVs) are becoming a popular solution for a plethora of tasks, ranging from

supporting and extending communication to monitoring and exploring areas of interest. At the same time,

Reinforcement Learning (RL) has become an excellent candidate technique to face complex scenarios where a

model of the environment is not always available. Nevertheless, fully autonomous applications can have some

drawbacks under certain unpredictable circumstances. Thus an active human element could facilitate handling

such scenarios. All these things considered, and after an in-depth literature analysis, we focused on Mixed

Human-UAV reinforcement learning applications that would beneﬁt from introducing the human-in-the-loop

component by pointing out their strengths, weakness, and new challenges.

1 INTRODUCTION

Unmanned Aerial Vehicles (UAVs) are increasingly

becoming a key technology to solve a wide range

of problems due to their high mobility in three-

dimensional space, easy deployment, and relatively

low cost. Some examples include UAVs used as aerial

stations to provide services after a natural disaster or

for network access in remote areas, for emergency

communications and rescue support, when perform-

ing surveillance tasks in risky and dangerous areas,

and also for cargo and goods delivery to supply food

and medical goods and in the agricultural ﬁeld to

monitor and facilitate farming activities.

Autonomous UAV. As UAV application ﬁelds are

increasing in recent years, a general research trend is

aiming at developing autonomous UAVs able to make

their own real-time decisions, thus ensuring a highly

responsive autonomous system in dynamic and even

unknown possible scenarios. Indeed, a high degree

https://orcid.org/0000-0002-3191-6623

https://orcid.org/0000-0002-0384-3498

https://orcid.org/0000-0002-2040-3355

https://orcid.org/0000-0001-5431-6607

https://orcid.org/0000-0001-9057-8946

* Authors are alphabetically ordered

of automation could lead UAVs to (co)operate with-

out the need for constant device-to-device communi-

cation and to avoid complex and slow human remote-

controlled maneuvers.

An excellent approach that has proven extremely

effective in various complex problems is Reinforce-

ment Learning (RL). In recent years, a number of

problems have been successfully solved by Deep Re-

inforcement Learning (DRL), which is an integration

of deep neural networks in RL. In fact, it has been

shown that agents trained through DRL were able to

obtain super-human performance in extremely com-

plex games such as the classic Atari games, Go, Dota

2, and complex robotics problems.

Multi-UAV Reinforcement Learning. UAVs are

designed to operate in shared and dynamic airspace,

hence cooperation is desirable and necessary.

We are thus interested in Multi-UAV systems, i.e.,

teams or ﬂeets of UAVs (Ali et al., 2021; Duﬂo et al.,

2020) cooperating to achieve the desired goal. By

leveraging a team of UAVs, tasks can be executed

faster and more efﬁciently than by using a single UAV.

UAVs also need to achieve their goal while avoiding

conﬂicts and collisions with other UAVs and objects

in the environment, since they are deployed in shared

airspace and safety (including ground safety) is of ut-

most importance.

Brandizzi, N., Brunori, D., Frattolillo, F., Trapasso, A. and Iocchi, L.

Mixed Human-UAV Reinforcement Learning: Literature Review and Open Challenges.

DOI: 10.5220/0011955100003622

In Proceedings of the 1st International Conference on Cognitive Aircraft Systems (ICCAS 2022), pages 38-42

ISBN: 978-989-758-657-6

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

Most of the relevant RL and DRL literature targets

single-agent systems. However, in this paper, we fo-

cus on the application of RL in a multi-agent system

(MAS), known as Multi-Agent Reinforcement Learn-

ing (MARL) (Jafari et al., 2020), and more speciﬁ-

cally on Multi-UAV RL techniques.

There are a number of reinforcement learning

techniques applied to the multi-agent scenario. For

example, independent Learners (Tan, 1993) are used

for UAV trajectory planning and time resource alloca-

tion in a multiple UAV-enabled network (Tang et al.,

2020). A slightly different problem such as trajec-

tory planning for Multi-UAV assisted Mobile edge-

computing (EG), (Wang et al., 2021), can be faced

with Multi-Agent Deep Deterministic Policy Gradi-

ent (MADDPG) (Lowe et al., 2017) algorithm. More-

over, an extension of the famous Deep Q-learning al-

gorithm (Mnih et al., 2013) to the Multi-agent case

is applied for autonomous forest ﬁre ﬁghting (Hak-

sar and Schwager, 2018): this approach is based on

an exploratory strategy led by an optimal heuristic

function. Other scenarios may arise where (D)RL can

be efﬁciently deployed, e.g., leader-follower systems

(Hung and Givigi, 2017), long-term communication

coverage (Liu et al., 2020a), Mobile Crowd Sensing

(MCS) (Liu et al., 2019).

2 OVERVIEW OF Multi-UAV RL

METHODS

In this section, we present a brief review of the lit-

erature on Reinforcement Learning applied to Multi-

UAV systems. Initially, we investigated major con-

ferences such as AAMAS (International Conference

on Autonomous Agents and Multiagent Systems),

ECML (European conference on machine learning),

ICML (The International Conference on Machine

Learning), NeurIPS (Neural Information Processing

Systems), but without relevant matches. Thus we

moved to a keyword search on scientiﬁc databases

such as IEEE Xplore, ScienceDirect, MDPI and

Springer. Our interest was focused on the following

keywords: UAVs, Multi-UAVs, Reinforcement learn-

ing, Drones, Multi-agents and a combination of them.

As a consequence of our analysis and as shown in Fig-

ure 1, this topic is recent and rapidly growing.

Additionally, the European Drone Outlook Study

(Undertaking, 2017) forecasts an increase in the drone

marketplace of EUR 15 billion annually by 2050 and

they estimated at least EUR 200 million in addi-

tional funding in Research & Development within 5

to 10 years from the study. Furthermore, they expect

around 7 million consumer leisure drones to operate

Figure 1: Number of publications per year (2016 to 2021)

given the research keywords.

across Europe and around 400 thousand for govern-

ment and commercial missions in 2050. The above-

mentioned data reveals the interest in this emerging

sector and therefore the need to ﬁnd new solutions for

their uses. (Undertaking, 2017) also points out the

sectors which will beneﬁt the most as being agricul-

ture, energy, E-commerce, delivery and transport.

By analyzing the literature, our aim is to high-

light the most popular research directions and group-

speciﬁc problems into macro classes such as Cover-

age, Trajectory generation, Computation Ofﬂoading

and Communication. In this regard, we noticed a lack

of work including human supervision on autonomous

UAVs in case of unexpected events. Although ma-

chine learning (ML) and in particular RL are excellent

tools capable of automating the agents’ behavior, hu-

mans still play an important role in supervising UAVs.

Mixed Human-Robot Teams. Fully autonomous

agents are prone to errors, as they cannot cope with

the amount of knowledge required to successfully

perform a task in a real-world environment. Sev-

eral studies investigated the level of autonomy (LOA)

adopted by agents in mixed human-robot settings

(Wu et al., 2018). The autonomy is most often an-

alyzed in terms of a dynamic framework in which the

agents learn to adapt their LOA to the task at hand,

e.g., mixed-initiative interaction settings (Allen et al.,

1999), adaptive autonomy (Suzanne Barber et al.,

2000) and collaborative control (Fong et al., 2001).

MAS with autonomous agents are sensitive to a range

of issues related to their performance for computer vi-

sion tasks, such as object detection, imperfect knowl-

edge of the system, and lastly ethical issues resulting

from non-explainable AI decisions.

During our literature research, we found only

one work focusing on mixed human-robot teams for

multi-UAVs systems (WANG et al., 2020). In the

latter, the authors build a transparent human-UAV

framework where they study the behavior of hetero-

geneous UAVs equipped with different capabilities.

This framework is based on the requirements of Ob-

servability, Predictability, and Directionality (OPD),

Mixed Human-UAV Reinforcement Learning: Literature Review and Open Challenges

which must be met for each agent on the team to

understand the intentions and behaviors of the other

members, contrary to previous approaches which fo-

cused on improving the level of autonomy (LOA) of

team members. In this work, agents can mitigate

this problem by being aware of the state and actions

of other agents. An optimal policy for the leader-

follower problem is learned through deep reinforce-

ment learning (DRL), while a path planner is used to

take into account the presence of enemies threats in

dynamic environments. These policies help humans

by providing hints or warnings if the Human user,

a.k.a. Manned Aerial Vehicle (MAV), approaches

dangerous areas. MAV behaviors are understandable

and controllable, allowing humans to observe, predict

and direct them.

3 OPEN SCIENTIFIC

CHALLENGES &

APPLICATIONS

In our analysis of the current state of the art on

reinforcement learning for Multi-UAV systems, we

found a lack of work considering the presence of

one or more human agents within the environment.

When considering mixed human-robot interactions,

the issue of interpretability becomes crucial for a hu-

man operator to understand and guide autonomous

UAVs. This problem is well addressed in the ﬁeld

of explainable-AI (XAI), whose aim is to stir from

the classical black-box approach in favor of a more

user-friendly system. We advocate that mixed human-

robot teams are fundamental for solving complex

tasks in a shared aerial environment and the related

research activities should proceed in this direction. In

the following section, we present some promising re-

search lines and industrial applications that can ben-

eﬁt from more in-depth studies in this ﬁeld. All the

presented applications can be built using a MARL

framework and focus on the supervision and/or col-

laboration applied by a human operator on multiple

UAVs for a range of different scenarios.

3.1 Challenges

Some research directions are here considered to in-

vestigate the feasibility and safety of a human-UAV

hybrid RL system.

Explainable UAVs Behaviour. To supervise au-

tonomous UAVs, humans need to understand the

decision-making process of AI models. For this rea-

son, we want to avoid the classic black-box approach

in favor of a more understandable method (WANG

et al., 2020). UAVs can hardly have full autonomy

to solve difﬁcult tasks regardless of the human inter-

vention, and this becomes even more evident when

ethical reasons are involved in decisional processes:

human-in-the-loop is still crucial in this case. There-

fore, UAVs need to cooperate with human users ac-

cording to protocols that should be as clear as possi-

ble in order to avoid misunderstanding issues between

drones and humans’ behaviors. As a result, research

should focus on the balance between the autonomy

level of UAVs and human control: in this regard, it

is essential to represent the behaviors learned by the

drones through DRL in some formalism that is clear

and understandable to a human operator.

Knowledge Representation. To the best of our

knowledge, there is a noticeable lack of works mod-

eling the presence of a human in the context of RL,

which mainly relies on the Markov Decision Process

(MDP) formalism for its problems description. For-

mally an MDP is deﬁned by the tuple (S , A, P , R ),

where the elements are respectively the state and the

action spaces of agents and the transition and reward

functions of an environment. Extending an MDP to

take into account the human-in-the-loop factor could

be a very difﬁcult challenge. Several works such

as (Gateau et al., 2016) involving both humans and

UAVs, model the former through a Mixed Observ-

ability MDP (MOMDP). MOMDP is an extension of

MDP with partial observations and stochastic poli-

cies. In the literature the use of MOMDP is usually

associated with Planning algorithms, while their ap-

plication in Mixed Human-UAV are still unexplored.

Furthermore, we need also to consider that standard

agents in a generic RL problem require a number

of learning interactions with the environment much

higher than their human counterparts. On the other

hand, a non-human agent provided with high compu-

tational capability can predict some speciﬁc informa-

tion that may easily be neglected by a human agent.

Conﬂict Resolution. When dealing with UAVs,

Air trafﬁc monitoring (ATM) becomes an unmanned

ATM, i.e. UTM, and in this latter case obviously

more particular and additional safety constraints are

required with respect to the ATM case. Thus, when an

emergency event arises in a given UTM scenario with

autonomous UAVs, a responsive human intervention

can be necessary and resolutive. Autonomous to hu-

man control switch is desirable when a severe and

not delayable fault is happening and a manually man-

ICCAS 2022 - International Conference on Cognitive Aircraft Systems

aged system is extremely required. In these particular

conditions, the human user should directly control the

faulty UAV in order to allow manual emergency ma-

neuvers. Acting in this way, motions or even landing

operations can be performed in complete safety while

the other UAV swarm members ﬂy along new paths

which are recomputed online by applying (D)RL de-

conﬂiction algorithms (Isufaj et al., 2021). This ap-

proach can be applied to any kind of conﬂict-risky

or emergency scenario which can occur at any ﬂight

phase, namely either during the ongoing ﬂight or in

takeoff or landing phases.

3.2 Applications

Among the plethora of available applications for UAV

systems, we want to focus attention on some of those

who could beneﬁt from adding a human-in-the-loop

element.

UAV Relay Networks. Data gathering, edge-

computing, and information ﬂow are crucial in mod-

ern scenarios and present a number of challenges tied

to the hostile nature of the environment. Centralized

networks control is the preferred structure in state-

of-the-art systems, however, it does not guarantee an

adequate level of robustness when dealing with dis-

turbing phenomena such as interference and commu-

nication jamming (Wang et al., 2020). On the other

hand, a decentralized Multi-UAV system can be de-

ployed in order to guarantee a sufﬁcient amount of

autonomous behavior. The system can leverage the

reinforcement learning framework’s ability to focus

on multiple goals concurrently such as maximizing

the physical connectivity among UAVs and the area

of coverage. Finally, the system can be supervised

and controlled by a human agent on the ﬁeld only by

transmitting high-level actions, e.g. move forward, in-

stead of coordinating the whole UAVs team.

Improved UAVs Surveillance. A possible applica-

tion of Multi-UAVs systems and DRL could be per-

sistent coverage and surveillance (Liu et al., 2020b).

In this context, the UAVs must monitor a certain area

through the use of sensors such as onboard RGB cam-

eras, trying to organize themselves in order to offer

adequate coverage and therefore also trying to min-

imize the overlap of the monitored sub-areas. In a

standard application, individual UAVs can send noti-

ﬁcations in case of suspicious events in a given zone

and proceed again with their workﬂow. In an alter-

native version of the same application but with the

human-in-the-loop component, the human operator

could be asked to take control of the UAV through

teleoperation, following the notiﬁcation. In this case,

the human operator would be able to control the UAV

and therefore follow and accurately track the event.

In this way, it is also possible to reduce the amount

of data exchanged, as the UAV is only delegated to

send notiﬁcations while the real-time streaming of the

monitored environment can be activated following the

human intervention.

4 CONCLUSION

In this work, we describe various UAVs applications

and highlight some evidence showing that the UAV

sector is constantly growing in the last few years. In

particular, we focus on a literature analysis related to

RL approaches applied to Multi-UAV systems. As

a result, we identify a shortage of works concerning

the human role as possible collaborator or supervi-

sor in complex mixed scenarios involving unexpected

and not easily predictable events: most of the time

these particular conditions are challenging and prob-

lematic to be managed by the UAVs autonomously.

Finally, we propose some alternative solutions to ex-

isting problems, trying to indicate how the presence of

a human operator can be crucial and have a positive

contribution in ﬁnding a valid and possibly optimal

solution.

ACKNOWLEDGEMENTS

This paper has been partially supported by BUBBLES

Project. BUBBLES project has received funding from

the SESAR Joint Undertaking under the European

Union’s Horizon 2020 research and innovation pro-

gram under grant agreement No 893206. Research

has been partially supported also by the ERC Ad-

vanced Grant WhiteMech (No. 834228) and by the

EU ICT-48 2020 project TAILOR (No. 952215).

REFERENCES

Ali, Z. A., Han, Z., and Masood, R. J. (2021). Collective

motion and self-organization of a swarm of uavs: A

cluster-based architecture. Sensors, 21(11):1–19.

Allen, J. E., Guinn, C. I., and Horvtz, E. (1999). Mixed-

initiative interaction. IEEE Intelligent Systems and

their Applications, 14(5):14–23.

Duﬂo, G., Danoy, G., Talbi, E. G., and Bouvry, P. (2020).

Automating the Design of Efﬁcient Distributed Be-

haviours for a Swarm of UAVs. 2020 IEEE Sym-

posium Series on Computational Intelligence, SSCI

2020, pages 489–496.

Mixed Human-UAV Reinforcement Learning: Literature Review and Open Challenges

Fong, T., Thorpe, C., and Baur, C. (2001). Collaborative

control: A robot-centric model for vehicle teleoper-

ation, volume 1. Carnegie Mellon University, The

Robotics Institute Pittsburgh.

Gateau, T., Chanel, C. P. C., Le, M.-H., and Dehais, F.

(2016). Considering human’s non-deterministic be-

havior and his availability state when designing a col-

laborative human-robots system. In 2016 IEEE/RSJ

International Conference on Intelligent Robots and

Systems (IROS), pages 4391–4397.

Haksar, R. N. and Schwager, M. (2018). Distributed Deep

Reinforcement Learning for Fighting Forest Fires

with a Network of Aerial Robots. IEEE International

Conference on Intelligent Robots and Systems, pages

1067–1074.

Hung, S. M. and Givigi, S. N. (2017). A Q-Learning

Approach to Flocking with UAVs in a Stochastic

Environment. IEEE Transactions on Cybernetics,

47(1):186–197.

Isufaj, R., Omeri, M., and Piera, M. A. (2021). Multi-uav

conﬂict resolution with graph convolutional reinforce-

ment learning. CoRR, abs/2111.14598.

Jafari, M., Xu, H., and Carrillo, L. R. G. (2020). A

biologically-inspired reinforcement learning based in-

telligent distributed ﬂocking control for Multi-Agent

Systems in presence of uncertain system and dynamic

environment. IFAC Journal of Systems and Control,

13:100096.

Liu, C. H., Chen, Z., and Zhan, Y. (2019). Energy-Efﬁcient

Distributed Mobile Crowd Sensing: A Deep Learning

Approach. IEEE Journal on Selected Areas in Com-

munications, 37(6):1262–1276.

Liu, C. H., Ma, X., Gao, X., and Tang, J. (2020a). Dis-

tributed Energy-Efﬁcient Multi-UAV Navigation for

Long-Term Communication Coverage by Deep Rein-

forcement Learning. IEEE Transactions on Mobile

Computing, 19(6):1274–1285.

Liu, Y., Liu, H., Tian, Y., and Sun, C. (2020b). Reinforce-

ment learning based two-level control framework of

UAV swarm for cooperative persistent surveillance in

an unknown urban area. Aerospace Science and Tech-

nology, 98:105671.

Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., and

Mordatch, I. (2017). Multi-agent actor-critic for

mixed cooperative-competitive environments. CoRR,

abs/1706.02275.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2013). Playing atari with deep reinforcement learn-

ing.

Suzanne Barber, K., Goel, A., and Martin, C. E. (2000).

Dynamic adaptive autonomy in multi-agent systems.

Journal of Experimental & Theoretical Artiﬁcial In-

telligence, 12(2):129–147.

Tan, M. (1993). Multi-Agent Reinforcement Learning:

Independent vs. Cooperative Agents. Mach. Learn.

Proc. 1993, pages 330–337.

Tang, J., Song, J., Ou, J., Luo, J., Zhang, X., and Wong,

K. K. (2020). Minimum Throughput Maximization

for Multi-UAV Enabled WPCN: A Deep Reinforce-

ment Learning Method. IEEE Access, 8:9124–9132.

Undertaking, S. E. S. A. R. . J. (2017). European drones

outlook study : unlocking the value for Europe. Pub-

lications Ofﬁce.

WANG, C., WU, L., YAN, C., WANG, Z., LONG, H., and

YU, C. (2020). Coactive design of explainable agent-

based task planning and deep reinforcement learning

for human-UAVs teamwork. Chinese Journal of Aero-

nautics, 33(11):2930–2945.

Wang, L., Wang, K., Pan, C., Xu, W., Aslam, N., and

Hanzo, L. (2021). Multi-Agent Deep Reinforcement

Learning-Based Trajectory Planning for Multi-UAV

Assisted Mobile Edge Computing. IEEE Transac-

tions on Cognitive Communications and Networking,

7(1):73–84.

Wang, W., Lu, X., Liu, S., Xiao, L., and Yang, B.

(2020). Energy Efﬁcient Relay in UAV Networks

Against Jamming: A Reinforcement Learning Based

Approach. IEEE Vehicular Technology Conference,

2020-May:0–4.

Wu, X., Wang, C., Niu, Y., Hu, X., and Fan, C. (2018).

Adaptive human-in-the-loop multi-target recognition

improved by learning. International Journal of Ad-

vanced Robotic Systems, 15(3):1729881418774222.

ICCAS 2022 - International Conference on Cognitive Aircraft Systems