Efﬁcient Multi-Agent Exploration in Area Coverage Under Spatial and

Resource Constraints

Maram Hasan

and Rajdeep Niyogi

Indian Institute of Technology Roorkee, Roorkee, 247667, India

Keywords:

Multi-Agent Reinforcement Learning, Curiosity-Based Rewards, Exploration, Coverage Path Planning,

Constrained Environments.

Abstract:

Efﬁcient exploration in multi-agent Coverage Path Planning (CPP) is challenging due to spatial, resource, and

communication constraints. Traditional reinforcement learning methods often struggle with agent coordina-

tion and effective policy learning in such constrained environments. This paper presents a novel end-to-end

multi-agent reinforcement learning (MARL) framework for area coverage tasks, leveraging the centralized

training and decentralized execution (CTDE) paradigm with enriched tensor-based observations and curiosity-

based intrinsic rewards, which encourage agents to explore under-visited regions, enhancing coverage efﬁ-

ciency and learning performance. Additionally, prioritized experience adaptation accelerates convergence by

focusing on the most informative experiences, improving policy robustness. By integrating these components,

the proposed framework facilitates adaptive exploration while adhering to the spatial, resource, and operational

constraints inherent in CPP tasks. Experimental results demonstrate superior performance over traditional ap-

proaches in coverage tasks under variable conﬁgurations.

1 INTRODUCTION

Multi-agent systems (MAS) are increasingly applied

in diverse domains that require coordinated opera-

tions across complex environments. These systems

enable agents to collaboratively achieve tasks de-

manding extensive spatial coverage, adaptability, and

efﬁcient information gathering, often exceeding the

capabilities of individual agents. One prominent ap-

plication is Coverage Path Planning (CPP), where

agents develop optimal routes to ensure thorough

area coverage while minimizing gaps and overlap-

ping (Tan et al., 2021).The main goal of CPP is to

ensure that every location within an environment is

visited at least once, while adhering to optimization

constraints (Orr and Dutta, 2023). It has become in-

dispensable in ﬁelds like autonomous cleaning, pre-

cision agriculture, space exploration, and search-and-

rescue operations, where systematic coverage is es-

sential for operational efﬁciency and high-quality out-

comes (Yanguas-Rojas and Mojica-Nava, 2017).

Implementing efﬁcient multi-agent coordination

in CPP poses signiﬁcant challenges, particularly due

https://orcid.org/0000-0001-9040-5842

https://orcid.org/0000-0003-1664-4882

to spatial and operational constraints, resource lim-

itations, and communication challenges. These fac-

tors shape how agents navigate and interact to achieve

the goal of full area coverage while minimizing re-

source consumption and optimizing efﬁciency. In

structured environments, such as warehouses, spatial

constraints require agents to operate within physical

deﬁned boundaries like walls and other structural el-

ements, necessitating precise navigation and strategic

path planning. Additionally, they encompass cover-

age completeness, ensuring all regions within the en-

vironment are visited at least once with minimal over-

lap, requiring paths that avoid redundancy.

Resource efﬁciency is equally critical in CPP, es-

pecially in mission-critical scenarios, where agents

must minimize energy consumption and time by se-

lecting paths that achieve full coverage while avoiding

unnecessary detours or delays (Ghaddar and Merei,

2020). Furthermore, dynamic obstacles, such as hu-

man workers, other robots, or moving machinery, in-

troduce further unpredictability. Agents must con-

tinuously adapt their paths in real time to avoid col-

lisions, recalibrating routes in response to these ob-

stacles to ensure safety and operational effectiveness.

Together, spatial constraints, resource efﬁciency, and

dynamic obstacles create a challenging environment

1278

Hasan, M. and Niyogi, R.

Efﬁcient Multi-Agent Exploration in Area Coverage Under Spatial and Resource Constraints.

DOI: 10.5220/0013321600003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 3, pages 1278-1287

ISBN: 978-989-758-737-5; ISSN: 2184-433X

for traditional path planning, requiring innovative ap-

proaches to improve adaptability and efﬁciency.

In recent years, reinforcement learning (RL) has

emerged as a promising solution for dynamic robotic

decision-making. It enables agents to learn behaviors

through trial-and-error interactions with the environ-

ment rather than relying on explicit manual program-

ming. Advancements in multi-agent reinforcement

learning (MARL) extend this capability, offering ro-

bust solutions to tackle diverse challenges by allow-

ing multiple agents to collaborate effectively, adapt to

environmental changes in real time, and achieve coor-

dinated coverage through learning-based approaches.

Multi-agent exploration in area coverage using re-

inforcement learning algorithms can be categorized

into end-to-end and two-stage approaches (Garaffa

et al., 2021). End-to-end methods treat exploration

as a uniﬁed process, where raw or processed sen-

sor data is input directly into an RL policy, which

generates control actions for the agent (Chen et al.,

2019b). This approach entrusts RL with all aspects

of the exploration task. In two-stage approaches, RL

is integrated with conventional methods by dividing

decision-making into distinct components. One usage

involves RL determining target locations, with classi-

cal algorithms like Dijkstra or A* (Stentz, 1994) han-

dling path planning independently. Another applies

RL exclusively to path planning, where partitioning

algorithms such as dynamic Voronoi assign targets,

leaving RL to navigate to these destinations (Hu et al.,

2020). A third variation employs separate RL mod-

els for target selection and path planning in a layered

structure, enabling agents to address intricate explo-

ration tasks at the cost of higher computational over-

head (Jin et al., 2019).

Our work focuses on developing an advanced

MARL framework that addresses these spatial and re-

source constraints, as well as communication limita-

tions, to enhance exploration and coverage in multi-

agent systems. By integrating enriched state repre-

sentations, intrinsic motivation and prioritized expe-

rience adaptation, we aim to enhance agents’ explo-

ration and learning efﬁciency under these constraints.

2 RELATED WORK

2.1 Classical Optimization Methods

Classical optimization and heuristic methods have

laid the foundation for coverage path planning (CPP)

in multi-agent systems especially given the NP-hard

nature of this problem (Chen et al., 2019a). Frontier-

based exploration (Yamauchi, 1997), a systematic

spatial exploration method, involves agents identify-

ing the boundary between explored and unexplored

areas, known as frontiers, and move toward these re-

gions to maximize coverage. Cooperative frontier-

based strategies extend this approach by enabling

agents to share information and coordinate move-

ments, thereby reducing redundant exploration and

improving execution efﬁciency (Burgard et al., 2005).

Sweeping-based methods enable agents to sys-

tematically cover areas in coordinated patterns, typ-

ically moving in parallel or predeﬁned formations to

ensure comprehensive coverage with minimal over-

lap (Tran et al., 2022). These approaches are effective

in both communication-enabled and communication-

free scenarios, maximizing coverage while minimiz-

ing redundancies (Sanghvi et al., 2024). Meanwhile,

biologically inspired swarming algorithms leverage

local interaction rules to achieve complex and sta-

ble coordinated behaviors (Gazi and Passino, 2004).

Building on this, decentralized swarm-based ap-

proaches have been developed for dynamic coverage

control, allowing agents to adapt to environmental

changes (Atınc¸ et al., 2020). This adaptability proves

particularly valuable for tasks requiring real-time re-

allocation of coverage areas (Khamis et al., 2015).

Classical methods, while useful, are limited in dy-

namic and complex environments due to their reliance

on static rules. They struggle with redundant cov-

erage, limited adaptability to dynamic obstacles and

unexpected change, and coordination challenges as

agent numbers increase. These limitations highlight

the need for learning-based approaches that enable

autonomous adaptation, improved coordination, and

effective handling of multi-agent coverage tasks.

2.2 Learning-Based Methods

Recently, learning methods have been increasingly

applied to coverage path planning tasks, with rein-

forcement learning enabling agents to learn and adapt

autonomously in complex environments (Zhelo et al.,

2018). Early studies focused on single-agent RL,

such as the application of Double Deep Q-Network

(DDQN) to train individual agents in a simulated

grid-world environment (Li et al., 2022), improving

navigation without explicit inter-agent coordination.

Meanwhile, centralized approaches like (Jin et al.,

2019) combine deep Q-networks (DQN) for target se-

lection with DDPG for adjusting agents’ rotations, fa-

cilitating coordinated movements through centralized

target selection. These approaches faced scalability

challenges as the number of agents increased.

Traditional reinforcement learning have been ex-

tended to multi-agent settings, enabling agents to

Efﬁcient Multi-Agent Exploration in Area Coverage Under Spatial and Resource Constraints

1279

collaboratively learn strategiedemands that enhance

coverage and adaptability in real time. End-to-end

MARL algorithms such as Multi-agent Proximal Pol-

icy Optimization (Chen et al., 2019b), directly learn

coordinated exploration strategies from raw sensory

data leveraging convolutional neural networks to pro-

cess multi-channel visual inputs. Similarly, a QMIX-

based algorithm with a modiﬁed loss function and

sequential action masking (Choi et al., 2022) has

been applied to improve coordination among Auto-

mated Guided Vehicles in cooperative path planning.

Two-stage MARL approaches divide tasks into high-

level decision-making and low-level execution lay-

ers. For instance, hierarchical cooperative explo-

ration (Hu et al., 2020) uses dynamic Voronoi parti-

tioning to assign unique exploration areas to agents,

while (Setyawan et al., 2022) employs two levels of

hierarchies with Multi-agent Deep Deterministic Pol-

icy Gradient (MADDPG) at each layer for effective

coordination in multi-agent coverage tasks.

On another note, exploration in multi-agent re-

inforcement learning is critical for efﬁcient learning

and faster convergence, particularly in complex en-

vironments where traditional methods like epsilon-

greedy or noise-driven exploration fall short. Alter-

native methods have been proposed such as employ-

ing graph neural networks in (Zhang et al., 2022)

for coarse-to-ﬁne exploration, while a combination of

DQNs and graph convolutional networks was utilized

in (Luo et al., ) for sequential node exploration on

topological maps. Furthermore, curiosity-based in-

trinsic reward mechanisms have emerged as a promis-

ing technique for exploration where it was integrated

with the asynchronous advantage actor-critic (A3C)

algorithm to enable effective mapless navigation in

single-agent systems (Zhelo et al., 2018), demonstrat-

ing signiﬁcant improvements over traditional explo-

ration methods. Therefore, we are motivated to incor-

porate a curiosity-based intrinsic reward mechanism

to enhance exploration in multi-agent learning.

Our proposed end-to-end MARL architecture ex-

tends MADDPG (Lowe et al., 2017) to facilitate adap-

tive exploration that adheres to spatial and resource

constraints. By leveraging intrinsic rewards, enriched

state representations, and priority buffer adaptation,

it effectively addresses these challenges and achieves

superior performance in complex coverage tasks.

3 PROBLEM DEFINITION

We consider the task of coverage path planning where

the goal is to effectively and fully cover a given

environment while reducing resource consumption

and redundancy. The desired path should (a) vis-

its all previously unvisited locations to ensure com-

plete coverage, (b) effectively navigates around ob-

stacles, and (c) minimizes the total operational time.

Each agent operates autonomously under the follow-

ing constraints: limited perception range, constrained

battery capacity, and dynamic obstacle avoidance.

We consider a structured indoor environment E,

such as a warehouse or industrial facility, character-

ized by obstacles that inﬂuence agent movement. The

environment has dimensions M × N, where M and N

denote the length and width, respectively. Speciﬁc en-

vironment attributes, such as obstacle count, location,

and size, are generated based on a predeﬁned prob-

ability distribution P . In the environment E, we de-

ﬁne I a set of k agents, with a speciﬁed diameter d

The environment is discretized into cells of size d

where each cell represents a ﬁxed spatial unit. Figure

1 shows the discretization process of the multi-agent

environment with dynamic and static obstacles.

Figure 1: A discretized indoor structured environment, fea-

turing a team of ﬁve agents (purple circles) navigating dy-

namic and static obstacles (yellow and grey, respectively) to

achieve efﬁcient exploration and coverage.

3.1 Problem Formalization

In our cooperative multi-agent coverage path plan-

ning environment, each agent operates with a pol-

icy guided solely by local observations rather than a

global state s, which remains unknown to all agents.

To capture this limited information access, we for-

malize the problem using a Decentralized Partially

Observable Markov Decision Process (Dec-POMDP)

framework. The objective is to learn a joint policy

π = {π

, π

, . . . , π

}, comprising individual policies

that collectively optimize the cumulative discounted

rewards over a deﬁned planning horizon h.

Starting at an initial state s

, at each time step t,

the environment has a global state s

∈ S, each agent

i ∈ I selects actions based solely on its own local ob-

servations o

∈ O

. Each agent i interacts indepen-

dently within its observation space O

, choosing ac-

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1280

tions according to its policy π

. After executing the

joint action a

= (a

, a

, . . . , a

) ∈ A, where A denotes

the joint action space, the environment transitions to

the next state s

t+1

∈ S as per a transition probability

T (s

t+1

, a

). Each agent receives a reward reﬂecting

quality of action (section 5.1). Dec-POMDP embod-

ies a fully decentralized structure, where agents inde-

pendently execute actions and receive local rewards

based on their actions and observations.

4 METHODOLOGY

4.1 Multi-Agent Deep Deterministic

Policy Gradient (MADDPG)

We extend MADDPG (Lowe et al., 2017) to learn

efﬁcient policies for coverage path planning. While

MADDPG inherently supports continuous action

spaces, our framework employs a posteriori dis-

cretization using a grid-based approach to facilitate

coverage tracking and reward computation. During

training, the centralized critic accesses the global state

and actions of all agents, capturing inter-agent depen-

dencies. However, decentralized execution enables

agents to act independently based on local observa-

tions, ensuring scalability and adaptability. While

centralized approaches are computationally demand-

ing and impractical for real-time execution, fully de-

centralized methods often result in suboptimal per-

formance due to limited awareness of global context.

MADDPG addresses these challenges by employing

a centralized critic during training and decentralized

actors for execution. Figure 2 shows the components

and interactions of the proposed framework.

The Actor Network π

: Each agent i selects action

= π

) based on local observations o

, thereby

facilitating decentralized decision-making. The ob-

jective of the actor network is to learn a deterministic

policy π

that maximizes the expected cumulative re-

ward by minimizing the following loss function L

= −E

∼D



(s, π

), a

−i

)



where D is the replay buffer storing past experiences

for sampling, Q

is the centralized critic’s estimate of

the Q-value, and a

−i

denotes the actions taken by all

agents except agent i. This loss function encourages

each actor to take actions that maximize the central-

ized Q-value estimates, thereby aligning the individ-

ual agent’s actions with the overall system’s objective.

The Critic Network Q

: It evaluates the quality of

the joint actions by estimating the expected cumula-

tive reward for a given state-action pair. It takes as

input the full state s (global information) and the joint

action vector a = (a

, a

, . . . , a

) and produces as out-

put the estimated Q-value Q

(s, a). The critic mini-

mizes the temporal difference (TD) error L

= E

(s,a,r,s

′

)∼D



(s, a) − y



where the target value y is deﬁned as:

y = r

+ γE

′

∼π



′

, a

′

)



Here, r

is the agent’s local reward, γ is the discount

factor that balances the inﬂuence of future and im-

mediate rewards, and s

′

represents the next state af-

ter taking the joint action. Minimizing TD error im-

proves Q-value estimation accuracy, providing mean-

ingful feedback for actor training. This feedback en-

sures the development of strategies that enhance over-

all system performance.

4.2 Curiosity-Based Exploration

The curiosity-driven mechanism further complements

this framework by incentivizing agents to explore un-

visited areas, overcoming the limitations of local ob-

servations in decentralized execution. Traditional ex-

ploration strategies, such as epsilon-greedy or noise-

driven methods, often prove insufﬁcient for environ-

ments that require comprehensive coverage and thor-

ough exploration, as they may fail to guide agents ef-

fectively through obstacles and complex layouts.

The curiosity reward is learned using a self-

supervised approach, driven by the prediction error

between the agent’s anticipated state features and the

actual observed features. Speciﬁcally, it uses a feature

extraction network f

parameterized by ψ, to process

each state s and generate a feature vector φ(s) = f

(s)

focusing on relevant aspects of the state while reduc-

ing complexity. Next, the extracted features serve as

input to a forward dynamics model

which predicts

the next state’s feature vector based on the current

state and action

φ(s

′

) =

(φ(s), a).

The intrinsic reward r

curiosity

is then computed as the

prediction error between the predicted and the actual

observed feature vector

φ(s

′

curiosity

= ∥φ(s

′

) −

φ(s

′

)∥

where ∥ · ∥ denotes the Euclidean norm. This mecha-

nism encourages agents to explore novel states where

the prediction error is high, effectively guiding them

toward under-explored areas. By integrating this cu-

riosity reward with extrinsic rewards deﬁned by the

environment, the actor network learns policies that

balance exploration and exploitation. The centralized

critic further reﬁnes these policies by incorporating

Efﬁcient Multi-Agent Exploration in Area Coverage Under Spatial and Resource Constraints

1281

global state information during training, ensuring the

learned strategies align with the overall objective.

4.3 Tensor-Like State Representation

For each agent, the observation is represented as

a multi-layered tensor to enhance spatial awareness

and environmental understanding. This representa-

tion comprises three channels: an occupancy map

that indicating agent positions and dynamic obstacles,

an obstacle map highlighting static obstacles, and a

visitation map recording the frequency of cell visits.

By structuring these features in an image-like format,

agents gain spatial context, allowing them to distin-

guish between frequently and infrequently visited re-

gions. The multi-layered conﬁguration facilitates the

use of convolutional processing, enabling agents to

leverage spatial patterns more effectively, and ulti-

mately implement more efﬁcient coverage strategies.

4.4 Priority Buffer Adaptation

In reinforcement learning, standard experience re-

play buffers employ random sampling assigning equal

probability to all experiences, treating them as equally

valuable for agent’s learning process. While effec-

tive in simpler tasks, this approach can delay learn-

ing in complex multi-agent environments where ex-

periences vary signiﬁcantly in their impact on explo-

ration and coordination strategies. To address this,

we implement a Prioritized Experience Replay (PER)

buffer (Schaul, 2015) that assigns higher sampling

probabilities to more informative experiences, such as

successful exploration steps or collisions, over repet-

itive navigation. Experiences with larger temporal-

difference (TD) errors where predictions diverge most

from latest observed outcomes are prioritized, accel-

erating convergence in challenging tasks.

The priority p

of an experience i is computed as:

= (|δ

| + ε)

where δ

is the TD error, ε is a small constant ensures

non-zero priority, and α ∈ [0, 1] controls the level of

prioritization, the higher values the more focus on

high-error experiences. The TD error δ

is deﬁned as:

= r + γQ(s

′

, a

′

;θ

′

) − Q(s, a;θ)

where r is the reward, s and a are the state-action pair,

′

and a

′

are the next state and action, γ is the dis-

count factor, and Q represents the action-value func-

tion parameterized by θ (current) and θ

′

(target net-

work). The sampling probability from the prioritized

buffer for experience i is computes as:

input

state

Convolutional

Network

agents

occupancy

map

obstacles

occupancy

map

visitation

map

Coverage

Progress

Actions

Feature

Extractor

next state

current state

Forward

Model

Curiosity

Reward

Centralized Training

Environment

Actor 1

Actor 2 Actor 3

Decentralized

Execution

Curiosity

Module

Actor n

Critic

Output

Figure 2: Our framework architecture showing the critic and

actors interactions, and the curiosity module.

P(i) =

∑

To correct for sampling bias introduced by prioritiza-

tion, importance-sampling (IS) weights are used as :



N · P(i)



where N is the total number of experiences, and β ∈

[0, 1] gradually increases to 1 to reduce bias as learn-

ing progresses. These IS weights adjust the loss, en-

suring unbiased gradient updates. By prioritizing ex-

periences with greater informational value, the buffer

enhances the model’s ability to learn robust policies

and accelerates convergence, particularly in environ-

ments where speciﬁc state-action pairs have a dispro-

portionate impact on overall performance.

5 EXPERIMENTS

5.1 Task Description

In this coverage problem, we consider an environment

that simulates a real-world warehouse area where

agents move efﬁciently to ensure complete coverage.

This coverage may include inspection, surveillance,

or cleaning operations across the warehouse. The

warehouse layout is converted into a grid map. Trans-

forming the continuous space and actions of the en-

vironment into discretized equivalents, as shown in

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1282

Figure 1. Each grid cell (i, j) represents a unique spa-

tial location that requires coverage, with a ﬁxed size

equal to the agent diameter d

, for example, (0.5 m ×

0.5 m). The dynamic obstacles within the grid intro-

duce complexity by changing its positions over time.

Agent Conﬁguration. Each agent a

in the environ-

ment has the following characteristics:

• Battery Capacity B

: Maximum energy the agent

can utilize before recharging is required.

• Perception Range P

: Fixed distance within

which the agent observes nearby obstacles,

agents, and grid cells.

• Neighborhood Set N

(t): All observable entities

within agent’s perception range P

, including dy-

namic and static obstacles and other agents.

Observation Space. While the centralized critic

has full-observability of the environment, the Actor

for each agent operates with partial observability, re-

lying only on localized information rather than com-

plete global awareness. Each agent is limited to its

perception range, able to observe information within

its z × z perception window centered around a

’s po-

sition. For example, within a 5 × 5 grid, the agent

observes obstacles, other agents, cell coverage status

within this window. Agents occupy one grid cell at a

time and can move to adjacent cells if they are valid

cells. Figure 3 illustrates the observations accessible

to agents within their perception range across various

scenarios of a discretized environment.

5 x 5 Perception Range

3 x 3 Perception Range

at Lower Boundary

Point

Figure 3: The observations and perception range of each

agent in a discretized environemnt.

Action Space. Each agent has a discrete action space

, representing the set of all possible actions it can

take in the environment. This includes the primitive

actions of moving in the standard 2D directions, as

well as the NoOp (No Operation) action, used when

no other action is appropriate or necessary.

= {U p, Down, Le f t, Right, NoOp}

At each time t, agent a

selects an action to navigate

the grid. All agents share the same travel velocity for

the purpose of simplicity.

Dynamics. State transitions are inﬂuenced by the ac-

tions of all agents and the movement of dynamic ob-

stacles. The state s(t + 1) depends on the joint ac-

tion {a

(t)} for all agents and the obstacle dynamics.

Agents experience transitions to speciﬁc cells based

on their chosen actions.

Action Masking. Agents face signiﬁcant challenges

in avoiding collisions, such as moving into walls or

static obstacles. To address this, we employ an action

masking mechanism that preemptively restricts infea-

sible actions, ensuring safer navigation and minimiz-

ing uninformative experiences that could hinder train-

ing. However, collisions with dynamic obstacles are

managed separately through the reward mechanism.

Reward Signal. Our environment operates under

dense rewards setting, where each agent receives re-

ward signal at each timestep. This signal encour-

ages efﬁcient coverage by rewarding agents for cov-

ering new cells, penalizing multiple visits to already-

covered cells, and discouraging excessive energy us-

age. Thus, incentiving agents to prioritize uncovering

new cells until full coverage is achieved.

• Coverage Reward R

cover

: A positive reward for

covering an uncovered cell (i, j) where V

tot

(i, j),

the total visitation count of cell (i, j) across all

agents, is equal to zero. This encourages agents

to prioritize new areas for coverage.

cover

(

+10 if V

tot

(i, j) = 0, (cell is uncovered)

0 otherwise

• Overlapping Penalty R

overlap

: A negative penalty

is applied when an agent visits a cell (i, j) already

covered by any agent, with λ controlling the sever-

ity. This discourages frequent overlaps, encourag-

ing agents to minimize revisits to previously cov-

ered areas:

overlap











−min



0.5 ·V

tot

(i, j)

1.5

, 5



if (i, j) is

covered

0 otherwise.

• Energy-Aware Penalty R

energy

: To account for

resource constraints, a negative penalty propor-

tional to energy usage is applied to promote efﬁ-

cient path choices. The energy consumption E

(t)

for agent a

is constrained by:

∑

t=0

(t) ≤ B

, ∀k ∈ I

where E

(t) denotes the total energy used along

its path up to timestep t, based on movement and

turns. The penalty R

energy

is deﬁned as:

Efﬁcient Multi-Agent Exploration in Area Coverage Under Spatial and Resource Constraints

1283

energy

(

−(λ

· d + λ

· θ) if a

moves or turns

−λ

if a

stays NoOP

Here, d is the per-timestep distance traveled, θ

is the turning angle, and constants λ

, λ

and λ

represent the energy consumed per unit distance,

per degree of turn, and for NoOp actions, respec-

tively. A small penalty λ

encourages agents to

move instead of getting stuck. For this work, we

use λ

= 0.11, λ

= 0 and λ

= 0.15 as rotation

actions are not included in the action space.

• Collision Penalty R

collision

: A penalty is applied

when the distance between agent a

and any ob-

stacle within its perception range obs ∈ O

falls

below a predeﬁned margin d

margin

, or if the agent’s

new position is occupied by other agents or obsta-

cles. This mechanism encourages agents to avoid

collisions and maintain safe navigation.

collision











−1 if d

Manhattan

, obs) ≤ d

margin

−10 if (x

(t + 1), y

(t + 1))occupied

0 otherwise

where d

Manhattan

, obs) = |x

−x

obs

|+|y

−y

obs

represents the manhattan distance between the

agent and any obstacle in its perception range.

The cumulative reward R

(t) for agent a

at each

timestep t is deﬁned as:

(t) = R

cover

+ R

overlap

+ R

energy

+ R

collision

(1)

Finally, the total reward signal incorporates the

curiosity-based intrinsic reward to promote efﬁcient

exploration during training.

total

= R

(t) + α

curiosity

· R

curiosity

(2)

The weight, α

curiosity

∈ [0, 1], controls the contribution

of the curiosity reward to the ﬁnal reward signal.

5.2 Implementation Details

In this section, we detail the implementation setup of

our experiments, focusing on the application of our

framework in a warehouse environment. The exper-

iments were conducted on two discretized environ-

ments of sizes 10 × 10 and 20 × 20 representing a

5m × 5m and 10m × 10m warehouses. These envi-

ronments features three agents in the smaller grid and

ﬁve in the larger one, along with static obstacles (15

and 40, respectively). The agents’ objective was to

achieve complete coverage of the grid while minimiz-

ing redundant exploration under spatial constraints.

In our experiments, we consider two levels of obser-

vation:

• Proximal Information Level (PIL) provided

agents with basic positional information about

their immediate neighborhood, denoted as N

(t).

Agents only perceive their immediate surrounding

cells, leading to limited situational awareness.

• Enriched Tensor-based Representation (ETR)

introduced an extended observational details, in-

corporating our proposed tensor-like state repre-

sentation in addition to historical visitation fre-

quencies and coverage.

In MADDPG, each agent has actor and critic net-

works. In the Proximal Information Level, both net-

works are feedforward: the actor has three layers with

ReLU activation and a softmax output, while the critic

uses four layers with ReLU activation to estimate Q-

values for joint actions. In the Enriched Tensor-Based

Representation (ETR), convolutional layers process

tensor-like observations, capturing spatial and tem-

poral dependencies. The actor processes local ob-

servations via CNNs (16, 32 ﬁlters, 3 × 3, stride 1,

ReLU), followed by a 64-unit LSTM layer and fully

connected layers with 256, 128 units, producing ac-

tion probabilities via a softmax layer. The LSTM

addresses partial observability. The critic processes

global state information through two convolutional

layers and fully connected layers (256, 128, and 64

units), integrating a coverage progress tensor for Q-

value computation.

The curiosity module includes a feature extraction

network with two convolutional layers, followed by

a fully connected layer producing a 64-dimensional

embedding. This embedding is used by the forward

dynamics model, which consists of two fully con-

nected layers with 64 and 32 units, followed by ReLU

activation, to predict the next state’s features. All

models are jointly optimized using Adam optimizers

with learning rates of 5 × 10

−4

for actors, 10

−3

for

critics, and 10

−3

for the curiosity module. Training

is conducted with a replay buffer size of 10

, a batch

size of 128, a discount factor γ = 0.95, and a soft up-

date factor τ = 0.01. Each episode comprises of 90 or

140 steps depending on the environment size.

The experimental evaluation covered the following

conﬁgurations:

1. MADDPG (Proximal Information Level):

Standard MADDPG using basic positional

information, without intrinsic rewards or en-

hancements. Agents received limited local

observations from a 3 × 3 and a 9 × 9 window.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1284

Table 1: Comparison of coverage percentages between MADDPG and our approach across different observation levels and

window sizes.

Conﬁguration

Proximal Information Level PIL Enriched Tensor-Based Representation ETR

3x3 9x9 5x5 17x17

MADDPG 56.47 (%) 65.88 (%) 50.88 (%) 35.67 (%)

Our Approach 72.94 (%) 86.24 (%) 85.94 (%) 90.66 (%)

2. Our framework (Enriched Tensor-based Rep-

resentation): Enhanced MADDPG utilizing en-

riched observation, intrinsic curiosity-based re-

wards, and Prioritized Experience Replay (PER).

The observation window is 5 × 5 and 17 × 17.

Average cumulative rewards and coverage percent-

ages were recorded over 5000 episodes. Cover-

age was measured as the ratio of visited cells to

the total number of grid cells, excluding obstacle

cells. Agent behaviors were visualized using visita-

tion maps, which highlighted the distribution of agent

visits and identiﬁed areas with insufﬁcient coverage.

6 RESULTS AND DISCUSSIONS

In our initial experiments under proximal information

level (PIL), each agent is provided with a limited ob-

servation window containing only basic positional in-

formation of agent’s neighborhood set N

(t) within

its perception range. This local observation offers

a minimal situational awareness, lacking any global

context or historical visitation data. Despite exten-

sive training across multiple episodes, as shown in

4b, agents under this conﬁguration consistently fail to

achieve effective coverage of the environment, capped

at 56.47% for a 3x3 window and 65.88% for a 9x9

window, as indicated in Table 1. Their movement pat-

terns are highly repetitive, and they often become con-

ﬁned to speciﬁc areas, resulting in signiﬁcant gaps in

overall coverage as shown is Figure 4a. This subopti-

mal behavior indicates that, even with the exploration

noise embedded in MADDPG, the agents struggle to

explore the environment effectively. The limited ob-

servational information in PIL restricts each agent’s

perspective to its immediate surroundings, making it

challenging to make informed movement decisions

that promote efﬁcient coverage. Consequently, agents

often revisit previously explored cells rather than dis-

covering unvisited parts of the grid, impeding com-

prehensive exploration.

Furthermore, in the more challenging 20x20 envi-

ronment, MADDPG with PIL showed a similar trend

of limited coverage. Using a 5x5 window, MAD-

DPG agents achieved only 50.88% coverage, strug-

gling to adapt their strategies effectively in a larger

space as shown in Figure 5a and Figure 5b. When

the window was further expanded to 17x17, cover-

age unexpectedly dropped to 35.67%, a signiﬁcant

29.89% decrease from the performance in 5x5 con-

ﬁguration. This decline could be attributed to the

sparsity of useful information in the PIL observation

space. The larger observation window introduces a

majority of zero values, representing empty space,

with only limited non-zero values for obstacles and

agent IDs. This representation lacks the necessary

variation for learning meaningful policies, effectively

overwhelming the learning process and reducing the

agents’ ability to extract relevant environmental fea-

tures. Consequently, agents failed to prioritize unex-

plored regions, leading to conﬁned and unproductive

movement patterns.

To address these limitations, we incoporated an

enriched tensor-based representation that expands the

observation space to include additional layers, such

as historical visitation frequency in an image-like for-

mat, and used CNN to process the input efﬁciently.

This enriched observation along with the cuiriosity-

based rewards signiﬁcantly enhanced agent perfor-

mance across all scenarios by incorporating histori-

cal visitation data and additional spatial context into

the observation space. In the 10x10 environment,

ETR achieved 72.94% coverage with a 3x3 observa-

tion window, marking a 29.23% improvement over

MADDPG with PIL. Figure 4c and Figure 4d demon-

strated an example coverage and average rewards of

our framework with PIL. Expanding the observation

window to 9x9 further boosted coverage to 86.24%,

a 30.91% improvement over the PIL conﬁguration.

This demonstrates that historical context and spatial

layering facilitate more strategic exploration, allow-

ing agents to prioritize unexplored areas systemati-

cally and make more informed decisions.

Furthermore, with a 17x17 observation window,

our proposed conﬁguration achieved a notable cov-

erage level of 90.66% and a stable learning and in-

creased cumulative rewards over training episodes as

shown in Figure 5c and Figure 5d respectively. This

result reﬂects a substantial 42.63% absolute improve-

ment under ETR in coverage over the PIL conﬁgura-

tion under the same settings. This enhancement is at-

tributed to ETR’s spatial encoding and extended cov-

Efﬁcient Multi-Agent Exploration in Area Coverage Under Spatial and Resource Constraints

1285

(a) (b) (c) (d)

Figure 4: Performance in the 10x10 environment with three agents under 3x3 observation windows. Subﬁgures (a) and (b)

depict area coverage and average cumulative rewards using MADDPG and PIL, respectively. While subﬁgures (c) and (d)

show area coverage and average cumulative rewards using our framework with PIL.

(a) (b) (c) (d)

Figure 5: Performance in the 20x20 environment with ﬁve agents. Subﬁgures (a) and (b) depict area coverage and average

cumulative rewards using MADDPG with a 5x5 observation window, while subﬁgures (c) and (d) show area coverage and

average cumulative rewards using our framework with ETR and a 17x17 observation window.

erage observations, which effectively address the lim-

itations associated with the sparse, zero-dominated

observations characteristic of PIL. By incorporating

meaningful spatial context and past visitation data,

ETR enables agents to avoid redundant coverage and

strategically prioritize unexplored areas. The integra-

tion of curiosity-driven intrinsic rewards further en-

hances exploration efﬁciency by incentivizing agents

to seek novel states, promoting balanced and thorough

exploration.

Overall, the results demonstrated an improvement

of up to 42.63% in absolute coverage compared to

baseline approaches, underscoring the effectiveness

of combining enriched observation representations

with intrinsic rewards. Our framework advances the

state-of-the-art MADDPG in the area coverage prob-

lem under spatial constraints, promoting coordinated

and comprehensive exploration in multi-agent envi-

ronments.

7 CONCLUSION

The rising demand for automated and efﬁcient area

coverage in structured environments, such as ware-

houses and industrial facilities, highlights the need

for robust solutions to enhance exploration and coor-

dination in multi-agent environments under resource

limitations and complex spatial constraints. The pro-

posed framework integrates an enriched tensor-based

representation and prioritized experience replay with

curiosity-driven intrinsic rewards to utilize spatial en-

coding and visitation history, driving agents to ex-

plore novel and less-visited areas. Prioritized expe-

rience sampling further enhances the model’s abil-

ity to learn from most informative experiences to

prompt successful navigation and exploration. To-

gether, these components foster an adaptive and ex-

ploratory learning process, particularly effective in

spatially constrained environments with limited infor-

mation where traditional strategies fall short.

Experimental results demonstrated the efﬁcacy of

the proposed framework, achieving up to a 42.63%

improvement in grid coverage compared to vanilla

PIL-based framework. This improvement was espe-

cially signiﬁcant in larger and more complex environ-

ments, where ETR facilitated systematic navigation

and comprehensive coverage with reduced overlap.

Notably, during training, the centralized critic sup-

ports inter-agent coordination by leveraging global

state information. Post-training, agents rely solely on

decentralized actors policies and local observations,

ensuring scalability and responsiveness for real-world

robotics without computational overhead.

In conclusion, the proposed framework success-

fully advances the state of the art in MARL for area

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

1286

coverage tasks by addressing spatial and resource

constraints. Its ability to integrate enriched observa-

tions, prioritize meaningful experiences, and promote

adaptive exploration makes it a promising solution

for real-world applications in structured and resource-

constrained environments.

ACKNOWLEDGMENTS

The second author was in part supported by a research

grant from Google.

REFERENCES

Atınc¸, G. M., Stipanovi

c, D. M., and Voulgaris, P. G.

(2020). A swarm-based approach to dynamic cov-

erage control of multi-agent systems. IEEE Trans-

actions on Control Systems Technology, 28(5):2051–

2062.

Burgard, W., Moors, M., Stachniss, C., and Schneider, F. E.

(2005). Coordinated multi-robot exploration. IEEE

Transactions on Robotics, 21(3):376–386.

Chen, X., Tucker, T. M., Kurfess, T. R., and Vuduc, R.

(2019a). Adaptive deep path: efﬁcient coverage of a

known environment under various conﬁgurations. In

2019 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems (IROS), pages 3549–3556.

IEEE.

Chen, Z., Subagdja, B., and Tan, A.-H. (2019b). End-to-end

deep reinforcement learning for multi-agent collabo-

rative exploration. In 2019 IEEE International Con-

ference on Agents (ICA), pages 99–102. IEEE.

Choi, H.-B., Kim, J.-B., Han, Y.-H., Oh, S.-W., and Kim,

K. (2022). Marl-based cooperative multi-agv con-

trol in warehouse systems. IEEE Access, 10:100478–

100488.

Garaffa, L. C., Basso, M., Konzen, A. A., and de Fre-

itas, E. P. (2021). Reinforcement learning for mobile

robotics exploration: A survey. IEEE Transactions on

Neural Networks and Learning Systems, 34(8):3796–

3810.

Gazi, V. and Passino, K. M. (2004). Stability analysis of

swarms. IEEE Transactions on Automatic Control,

48(4):692–697.

Ghaddar, A. and Merei, A. (2020). Eaoa: Energy-aware

grid-based 3d-obstacle avoidance in coverage path

planning for uavs. Future Internet, 12(2):29.

Hu, J., Niu, H., Carrasco, J., Lennox, B., and Arvin, F.

(2020). Voronoi-based multi-robot autonomous ex-

ploration in unknown environments via deep rein-

forcement learning. IEEE Transactions on Vehicular

Technology, 69(12):14413–14423.

Jin, Y., Zhang, Y., Yuan, J., and Zhang, X. (2019). Efﬁcient

multi-agent cooperative navigation in unknown envi-

ronments with interlaced deep reinforcement learn-

ing. In ICASSP 2019-2019 IEEE International Con-

ference on Acoustics, Speech and Signal Processing

(ICASSP), pages 2897–2901. IEEE.

Khamis, A., Hussein, A., and Elmogy, A. (2015). Multi-

robot task allocation: A review of the state-of-the-art.

Cooperative robots and sensor networks 2015, pages

31–51.

Li, W., Zhao, T., and Dian, S. (2022). Multirobot coverage

path planning based on deep q-network in unknown

environment. Journal of Robotics, 2022(1):6825902.

Lowe, R., Wu, Y. I., Tamar, A., Harb, J., Pieter Abbeel,

O., and Mordatch, I. (2017). Multi-agent actor-critic

for mixed cooperative-competitive environments. Ad-

vances in neural information processing systems, 30.

Luo, T., Subagdja, B., Wang, D., and Tan, A.-H. Multi-

agent collaborative exploration through graph-based

deep reinforcement learning. In 2019 IEEE Interna-

tional Conference on Agents (ICA), pages 2–7. IEEE.

Orr, J. and Dutta, A. (2023). Multi-agent deep reinforce-

ment learning for multi-robot applications: A survey.

Sensors, 23(7):3625.

Sanghvi, N., Niyogi, R., and Milani, A. (2024). Sweeping-

based multi-robot exploration in an unknown environ-

ment using webots. In ICAART (1), pages 248–255.

Schaul, T. (2015). Prioritized experience replay. arXiv

preprint arXiv:1511.05952.

Setyawan, G. E., Hartono, P., and Sawada, H. (2022). Coop-

erative multi-robot hierarchical reinforcement learn-

ing. Int. J. Adv. Comput. Sci. Appl, 13:35–44.

Stentz, A. (1994). Optimal and efﬁcient path planning for

partially-known environments. In Proceedings of the

1994 IEEE international conference on robotics and

automation, pages 3310–3317. IEEE.

Tan, C. S., Mohd-Mokhtar, R., and Arshad, M. R. (2021).

A comprehensive review of coverage path planning

in robotics using classical and heuristic algorithms.

IEEE Access, 9:119310–119342.

Tran, V. P., Garratt, M. A., Kasmarik, K., Anavatti, S. G.,

and Abpeikar, S. (2022). Frontier-led swarming: Ro-

bust multi-robot coverage of unknown environments.

Swarm and Evolutionary Computation, 75:101171.

Yamauchi, B. (1997). A frontier-based approach for au-

tonomous exploration. In Proceedings 1997 IEEE In-

ternational Symposium on Computational Intelligence

in Robotics and Automation CIRA’97.’Towards New

Computational Principles for Robotics and Automa-

tion’, pages 146–151. IEEE.

Yanguas-Rojas, D. and Mojica-Nava, E. (2017). Explo-

ration with heterogeneous robots networks for search

and rescue. IFAC-PapersOnLine, 50(1):7935–7940.

Zhang, H., Cheng, J., Zhang, L., Li, Y., and Zhang, W.

(2022). H2gnn: Hierarchical-hops graph neural net-

works for multi-robot exploration in unknown envi-

ronments. IEEE Robotics and Automation Letters,

7(2):3435–3442.

Zhelo, O., Zhang, J., Tai, L., Liu, M., and Burgard,

W. (2018). Curiosity-driven exploration for mapless

navigation with deep reinforcement learning. arXiv

preprint arXiv:1804.00456.

Efﬁcient Multi-Agent Exploration in Area Coverage Under Spatial and Resource Constraints

1287