Decentralized Multi-agent Formation Control via Deep

Reinforcement Learning

Aniket Gutpa

1,*

and Raghava Nallanthighal

2,†

Department of Electrical Engineering, Delhi Technological University, New Delhi, India

Department of Electronics and Communication Engineering, Delhi Technological University, New Delhi, India

Keywords: Multi-agent Systems, Swarm Robotics, Formation Control, Policy Gradient Methods.

Abstract: Multi-agent formation control has been a much-researched topic and while several methods from control

theory exist, they require astute expertise to tune properly which is highly resource-intensive and often fails

to adapt properly to slight changes in the environment. This paper presents an end-to-end decentralized

approach towards multi-agent formation control with the information available from onboard sensors by using

a Deep Reinforcement learning framework. The proposed method directly utilizes the raw sensor readings to

calculate the agent’s movement velocity using a Deep Neural Network. The approach utilizes Policy gradient

methods to generalize efficiently on various simulation scenarios and is trained over a large number of agents.

We validate the performance of the learned policy using numerous simulated scenarios and a comprehensive

evaluation. Finally, the performance of the learned policy is demonstrated in new scenarios with non-

cooperative agents that were not introduced during the training process.

1 INTRODUCTION

Multi-agent systems are rapidly gaining momentum

due to their several real-world applications in disaster

relief scenarios, rescue operations, military

operations, warehouse management, agriculture and

many more. All these tasks require the teams of

robots to cooperate autonomously to produce the

desired results as displayed by many animals and

insects in nature which emulate swarming behaviour.

One of the major challenges for multi-agent

systems is autonomous navigation while adhering to

the three rules of Reynolds (Reynolds and Craig,

1987). Modern control theory presents numerous

solutions, supported with rigorous proofs, to this

problem and demonstrates the feasibility of multi-

agent formation control and obstacle avoidance.

(Marko and Stiepan, 2012), (Anuj et al., 2020),

(Egerestedt, 2007) present an artificial potential field

approach and have demonstrated stable autonomous

navigation of a swarm of unmanned aerial vehicles

(UAVs). While this approach does the work, it

neglects the non-linearities in the system dynamics

and thus fails to demonstrate optimal behaviour in

http://www.dtu.ac.in/Web/Departments/Electrical/about/

†

http://www.dtu.ac.in/Web/Departments/Electronics/about/

conjunction with the vehicle dynamics. Further, these

approaches require extensive tuning to provide

desired performance.

Further, in (Hung and Givigi, 2017), a Q-learning

based controller is presented for flocking of a swarm

of UAVs in unknown environments using a leader-

follower approach. This approach does not yield an

optimal solution as state space and action space is

discretized to limit the size of the Q-table. As the

number of states increases, computing the Q-table

becomes increasingly inefficient. Another

disadvantage of this approach is the single point of

failure offered by the leader agent.

In (Johns and Rasmus, 2018), Reinforcement

learning is utilized to improve the performance of a

behaviour-based control algorithm which serves as

both a baseline from which the RL algorithm

compares with and a base from which the RL

algorithm starts training. This approach produces

significant results but is not an end to end solution

which can be directly applied to any kind of system.

Appropriate tuning of the behaviour-based control

algorithm is still required for the efficient

performance of the complete controller.

Gutpa, A. and Nallanthighal, R.

Decentralized Multi-agent Formation Control via Deep Reinforcement Learning.

DOI: 10.5220/0010241302890295

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 1, pages 289-295

ISBN: 978-989-758-484-8

289

The contribution of this paper is to propose a

decentralized, end-to-end framework that is scalable,

adaptive, and fault-tolerant which directly maps the

sensor data to optimal steering commands. To

overcome the problems offered by discrete state space

and discrete action space controllers, we propose a

policy gradient-based controller, which utilizes a

Deep neural network map the continuous state space

directly to a continuous action space. The

effectiveness of policy learned from the proposed

method is demonstrated through a series of simulation

experiments where it is able to find the optimal

trajectories in complex environments.

The rest of the paper is organised as follows:

Section II gives a short introduction to Policy

Gradient methods and leads on for the mathematical

formulation of our problem statement in the context

of RL. Section III introduces the proposed algorithm

and the underlying structure of the complete

controller for decentralized implementation.

Subsequently, section IV describes the simulation

environment, the computational complexity of our

approach and presents the quantitative results

obtained. Further, it also provides the experimental

results in unknown environments to demonstrate the

generalization capability of the learned policy.

Finally, Section V concludes the paper with

recommendations for future work.

2 PRELIMINARIES

2.1 Markov Decision Process

A Markov decision process (Bellman and Richard,

1957) is a finite transition graph which satisfies the

markov property. For a single agent MDP can be

represented by a tuple ,, where 





,…,



 is the set of states, 



,…,



 is

the set of actions and 



,…,



 is the set of

rewards obtained by through transition between states

(



represents the terminal state). The objective is to

find the solution of this MDP to maximize the

expected sum of rewards by finding the optimal

policy, value function or action-value function.

2.2 Deep Q-learning

Deep Q-learning (Mnih et al., 2015) is one of the

well-known methods in Deep RL and has been shown

to provide extremely efficient results. It uses deep

neural networks for estimating the action-value

function for a given problem to identify the optimal

action in any state. The estimator function can be

described as:







,









~





,



max















,







(1)

where,  is the denotes the expectation operator.

Therefore, if the neural network representing the Q

function is given by parameters , the loss function

can be given by:





,







,,,





~









,













(2)





1



max















,





is called the

temporal difference and is calculated using target

network Q^π' which is updated with the same

parameters as main Q-function periodically. This

method stores the set of experiences generated over

time into the Replay buffer which is used to randomly

sample data for training the loss function given above.

2.3 Policy Gradient Algorithms

Policy gradient algorithms explicitly denote the

policy as 



| represented by a deep neural

network with parameter . These parameters can be

trained by maximizing the objective function

















~,~



, where R is the reward

function. Thus, to update the policy parameters,

gradient ascent can be used by calculating the

gradient of the objective function.



























(3)

Where,

















~,~











log

















These methods can be used for both stochastic as

well as deterministic policies and thus are suitable for

problems with continuous action spaces.

2.4 Algebraic Graph Theory

A complete multi-agent system can be

mathematically represented by a weighted undirected

graph 



,



where1,2,…, is the set of

agents and 



,



:,∈,is the edge set.

The adjacency matrix of graph  represented by

 is a standard matrix in graph theory which

represents the weights of the edges. This matrix can

be used to represent the relationship between agents.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

290

3 PROBLEM FORMULATION

In this paper, the task of formation flocking is defined

in the context of holonomic agents moving on the

Euclidean plane. The position vector for each agent

can be denoted by 













,







and the velocity

vector by 













,







, for the UAV i (i = 1, 2, …,

N) at time t, where N is the number of UAVs. The

goal of the agents is to attain the desired formation

using the adjacency matric 



of the formation

matrix graph 



(where the weight of each edge

, of the graph corresponds to the desired distance

between the agents  and ) and navigate to the goal

position 









,







while avoiding inter-agent

collisions. To maintain the formation at the goal

position, another adjacency matrix 



of graph





(where the weight of each edge ,



 of the

graph corresponds to the desired distance between the

agent  and goal position 



Thus, the problem statement can be modelled as a

sequential MDP defined by a tuple ,,,

where, 







,…





,



,…,



 , 







,…,





,



,…,



 is the set of sets of

states, actions, rewards and observations of all agents.

At time step t, agent  is present in state  and has

access to an observation o, which is used to calculate

action . As a consequence of this action, the agent

reaches a new state ′, receives a reward  and has

access to new observation ′. The goal of the agent is

to learn the optimal policy 



∗

| where  denotes

the policy parameters which maximizes the sum of

obtained rewards. This process continues until all the

agents reach the goal position where the episode

terminates.

The sequence of the tuple ,,,′ made by the

robot  can be considered as a trajectory denoted by

. The set of these trajectories collected by all the

agents is denoted by  and is stored in memory as the

Replay buffer.  denotes the minibatch of

experiences randomly samples from  for training

the model.

4 APPROACH

4.1 Environmental Setup

To solve the sequential MDP defined in Section III,

an environmental model with appropriately defined

observation space, action space and reward function

is required.

4.1.1 Observation Space

The observation space o consists of the planar

position of each agent given by 





along with the goal

position 





. These values are flattened and passed to

a fully connected input layer of the policy network to

calculate the action values. Thus, the size of input

layer varies with the number of agents which also

increases the training time for the policy network.

4.1.2 Action Space

The holonomic agents considered for the problem

statement have an action space that consists of a set

of permissible velocities in the continuous space. The

output action is the two-dimensional velocity vector















,







. The output layer is constrained by the

hyperbolic tangent  activation function which

limits the output value in the range of (-1,1). This

output value is then multiplied by 



parameter to

limit the action value, which is then directly passed to

the navigation controller of the agent.

4.1.3 Reward Function

To achieve the target of formation control and

navigation in the minimum time possible, the reward

function is designed as follows:

















































(4)

where, the total reward 





is the sum of goal

reward









, formation reward









and inter-agent

collision penalty









of an agent  at timestep .

The goal reward













is awarded for the agent 

for reaching the goal position as:





































,













,



























,





,

(5)

To maintain the desired formation as governed by

the formation graph 



, the agents are penalized for

deviating from their desired relative position with

other agents as:







































,





,

(6)

Finally, to avoid inter-agent collisions, the agents

are penalized as follows:



















,















0,



,

(7)

Decentralized Multi-agent Formation Control via Deep Reinforcement Learning

291

Algorithm 1: Multi-agent TD3.

nitialize policy network with parameters , Q-

unctions with parameters 



,



. Empty the repla

uffer 

2: Set target network parameters equal to mai

arameters









←,





,

←



,





,

←



or number of episodes do

for robot 1,2,… do

Take observation 



select action















,



,





Execute action  in the environment

Observe next state ′, reward  and done signa

 to indicate whether o′ is terminal.

Store ,,,′, in replay buffer 

If ′ is terminal, reset environment state.

10:

end for

11: if memory can provide minibatch then

12: Randomly sample a batch of transitions,





,,,′,



from 

13: Compute target actions



























,



,









,,



14: Compute Bellman target





,



,











min

,





,





,













15: Update Q-functions by one step of gradien

escent using 



















,







,



,









,,



,



∈

1,2

16: if time to update policy then

17: Update policy by one step of gradient ascen

sing 















,











∈

18: Update target networks with





←



1



,

←

,





1







1,2

19:

end if

20:

end if

21:

end for

4.2 Algorithm Design

The algorithm used in this paper is an extended

version of the state-of-the-art Twin Delayed

Deterministic Policy Gradient algorithm (Fujimoto et

al., 2018), (Lillicrap et al., 2015)) commonly known

as TD3. TD3 concurrently learns two Q-functions and

a policy and uses the smaller out of the two Q-values

to calculate the loss function. We use the paradigm of

centralized training with decentralized execution,

which means that the data collected by all the agents

is used for training the policy network 



|, but

each agent takes action in a decentralized fashion by

using this policy.

During each episode of the training process, there

are two major steps. First, all the agents exploit the

same policy to calculate their actions and record their

new observations which are then stored in the replay

buffer. Second, random batches of experiences are

sampled from the replay buffer to calculate the

Bellman target which is used to calculate the loss

function as given in eq.2. This loss is optimized using

Adam optimizer (Kingma et al., 2014). Further, for

policy training, the problem statement requires a

deterministic policy 











which gives the action

that maximizes the Q-function. Therefore, the

objective function can be modified as











max





~







,







(8)

and therefore, the policy gradient can be calculated as



























,











∈

(9)

The target networks in the algorithm are updated

periodically using the Polyak averaging method as

follows:





←







1





(10)



,

←

,





1







;1,2

(11)

where ≪1.

4.3 Network Architecture

Figure 1: Network architecture of the (a) critic and (b) actor

neural networks.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

292

TD3 algorithm utilizes six neural networks in total,

three main and three target networks. The target

networks are employed to stabilize the training

process. The observation o with dimensions



∗2∗



directly acts as the input for the policy network and

outputs the action value. While for the Q-network,

observation o and action a are concatenated to form

the input layer and the output is a single Q-value

which gives the quality of that action. The network

architecture for both policy network and the Q-

network is given in Figure.1. RELU activation

function is used for the hidden layers while the output

layer utilizes a ‘tan h’ activation function to limit the

output velocity by a maximum limit.

5 SIMULATION AND TESTING

In this section, first the method of simulation is

explained. Then, the computational complexity of the

algorithm and training resources utilized is described.

Lastly, the quantitative results of various simulations

in varying environments are demonstrated.

5.1 Simulation Environment

The simulation environment is a custom developed

GUI using Matplotlib library in python. The blue

circles represent the agents, the red circle is the goal

and the black circles are the obstacles.

The start position of the agents, goal position and

all the obstacles are randomly generated after each

episode is over. This approach was found to prevent

the policy from exploiting the errors in the Q-function

and thus improved the performance of the algorithm

substantially.

5.2 Training Configuration

The complete implementation was carried out using

TensorFlow 2 on a computer with a Nvidia GTX 1060

GPU and an Intel i7-8750 CPU with the set of hyper-

parameters as mentioned in Table 1. The complete

training process took about 22 hours for the learned

policy to achieve robust performance in a complete

simulation of 10 agents.

During the execution, since only the actor network

is required, the computational resources required are

minimal and thus the same architecture can be used

on single-board computers like the Nvidia Jetson

nano and Raspberry Pi.

Table 1: Hyperparameters used for simulation.

Hyperparameter Value



1000



0.99





0.0001





0.001



0.995



1000000





-2





Hyperparameter Value



0.1



0.5









-50





0.5









-0.1





-0.1

5.3 Experiments and Results

Since, the observation space consists of the planar

positions of all the agents and the goal position, the

training time increases rapidly as the number of

agents scale up. Figure 3 shows the training time

required for training the policy on three, five, seven

and ten agents respectively.

Further, Figure 3 shows the average reward

received as training progresses. Figure 4-6 shows the

performance of the learned policy on a varying

number of agents for different formations.

To test the generalization capabilities of the

learned policy, we introduced non-cooperative agents

[agents travelling directly towards the goal position in

a straight line with constant velocity] which were not

used during the entire training process. But the

learned policy was able to generalize well with these

agents as demonstrated in Figure 7. (Non-cooperative

agents are shown by square boxes)

Figure 2: Increase in training time with increment in

number of agents.

Decentralized Multi-agent Formation Control via Deep Reinforcement Learning

293

Figure 3: Average rewards progression with increasing

episodes during training process.

(a)

(b)

Figure 4: V-formation using 3 agents.

(a)

(b)

Figure 5: X-formation using 5 agents.

(a)

(b)

(c)

(d)

Figure 6: Diamond-formation using 10 agents.

(a) (b)

Figure 7: Diamond-formation using 8 cooperative and 2

non-cooperative agents.

6 CONCLUSIONS

In this paper, we addressed the problem of Multi-

agent formation control and navigation in unknown

environments by using an end-to-end Deep

Reinforcement Learning framework. The learned

policy demonstrates several advantages over the

existing methods in terms of maintaining the desired

formation, collision avoidance performance,

adaptability to the environments and generalized

performance.

Multi-agent systems are the future of the robotics

industry and as they become more prevalent, these

systems will require adaptive controllers which can

execute the tasks effectively in unfamiliar situations.

Thus, integration of Reinforcement Learning with

multi-agent systems is a step towards developing

truly intelligent systems which can perform

efficiently in real-world. We hope that our work can

serve as the starting step in developing swarming

systems for future applications.

Our future work in this area will be aimed towards

following two goals:

1. Developing an efficient approach to solve the

problem of extended training time with increase in

number of agents.

2. Applying DRL to mission-specific problems

such as target search and area coverage and to apply

better exploration techniques to improve the learning

time of the policy.

REFERENCES

Reynolds, Craig, 1987. Flocks, herds and schools: A

distributed behavioural model. SIGGRAPH '87:

Proceedings of the 14th Annual Conference on

Computer Graphics and Interactive Techniques.

Association for Computing Machinery. pp. 25–34.

Marko Bunic, Stjepan Bogdan, 2012. Potential Function

Based Multi-Agent Formation Control in 3D Space,

IFAC Proceedings Volumes, Volume 45, Issue 22,

Pages 682-689.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

294

Anuj Agrawal, Aniket Gupta, Joyraj Bhowmick, Anurag

Singh, Raghava Nallanthighal, 2020. “A Novel

Controller of Multi-Agent System Navigation and

Obstacle Avoidance”, Procedia Computer Science,

Volume 171, Pages 1221-1230.

M. Egerestedt, 2007. “Graph-theoretic methods for multi-

agent coordination” in ROBOMAT.

S. Hung and S. N. Givigi, 2017. "A Q-Learning Approach

to Flocking with UAVs in a Stochastic Environment,"

in IEEE Transactions on Cybernetics, vol. 47, no. 1, pp.

186-197.

Johns, Rasmus, 2018. “Intelligent Formation Control using

Deep Reinforcement Learning.”.

Mnih, V., Kavukcuoglu, K., Silver, D. et al, 2015. Human-

level control through deep reinforcement learning.

Nature 518, 529–533.

Fujimoto, Scott & Hoof, Herke & Meger, Dave, 2018.

Addressing Function Approximation Error in Actor-

Critic Methods.

Lillicrap, Timothy & Hunt, Jonathan & Pritzel, Alexander

& Heess, Nicolas & Erez, Tom & Tassa, Yuval &

Silver, David & Wierstra, Daan, 2015. Continuous

control with deep reinforcement learning. CoRR.

Kingma, Diederik and Ba, Jimmy, 2014. “Adam: A method

for Stochastic Optimization.”

Bellman, Richard, 1957. “A Markovian Decision

Process.” Journal of Mathematics and Mechanics, vol. 6,

no. 5, pp. 679–684., www.jstor.org/stable/24900506.

Decentralized Multi-agent Formation Control via Deep Reinforcement Learning

295