
ternative plans, instead of just time-separating the
agents (i.e., adding wait actions). Plan with alterna-
tives has a tree-like structure with one main branch –
the original classical plan – and several other branches
(alternatives) rooting from the main branch. Dur-
ing execution, each agent starts by following its main
plan. In case of experiencing a delay (big enough
to cause a collision), an agent might be detoured to
one of the alternative branches. Mainly due to com-
putability reasons, their method only provides alter-
natives for the main plan, not alternatives for alterna-
tives. Furthermore, no collisions are guaranteed only
among pairs of agents both following the main plan
and among pairs of agents, where one follows its main
plan and the second one its alternative plan. There is
no guarantee for agents who both follow an alterna-
tive plan.
Another approach to combat unexpected delays
is p-robustness (Atzmon et al., 2020) which aims to
find a solution that will be executed successfully (i.e.,
without collisions) with a probability at least p.
Shahar et al. (Shahar et al., 2021) worked with
a different model of delays, called MAPF with Time
Uncertainty (MAPF-TU). In a MAPF-TU instance, an
interval of possible duration (in timesteps) is assigned
to each action (edge in the graph). The real duration
of the action during execution is then a number from
the interval. The task of MAPF-TU is to find a safe
solution. A solution is safe if it guarantees no col-
lisions during execution for any possible set of real
action durations.
2 PRELIMINARIES
2.1 Conflict Based Search
Conflict Based Search (CBS), introduced by (Sharon
et al., 2012), is one of the most widely-used algo-
rithms for solving MAPF. CBS is a two-level algo-
rithm. The upper level solves coordination of agents,
while the lower level searches for plans (shortest
paths) for each agent individually. This modularity
makes CBS easily modifiable for different variants of
MAPF, including ours. In this section, we describe
the basic variant of CBS in detail.
Firstly, let us define some basic notions. A con-
flict of agents a
i
and a
j
who should be (given current
solution) simultaneously in a timestep t located in a
vertex v is denoted by (a
i
,a
j
,t,v). A constraint for
an agent a
i
is a tuple (a
i
,v,t) denoting that the agent
is prohibited from being in vertex v in timestep t. A
solution (set of plans) is consistent with a given set of
constraints C if all plans respect given constraints. A
solution is valid if there are no conflicts among agents.
The idea of CBS is as follows. We maintain a set
of constraints, initially empty. We find the cheapest
consistent solution and check it for validity. If there is
any conflict, the conflict is resolved by adding a con-
straint that prevents that conflict. We iterate this step
until a valid solution is found. A conflict (a
i
,a
j
,t,v)
can be resolved by adding one of the constraints
(a
i
,v,t) or (a
j
,v,t). In order to obtain an optimal solu-
tion, we need to explore both options. Ultimately, this
leads to the so-called Constraint Tree. Optimization is
handled by the higher level of CBS that searches the
Constraint Tree using best-first search which ensures
finding an optimal solution.
The task of the low-level algorithm is to compute
a consistent solution in each step. This can be done
individually for each agent – basically, it is a shortest-
path problem, that just has to avoid the prohibited
states.
Over recent years, CBS has received several im-
provements that make it a state-of-the-art algorithm
for classical MAPF, such as Conflict Avoidance Ta-
ble (CAT) (Sharon et al., 2012), Priority Conflict (PC)
and Conflict Bypassing (Li et al., 2019), or Corridor
Reasoning (Li et al., 2021b).
2.2 Markov Decision Process
Markov Decision Process is a sequential decision pro-
cess in a non-deterministic (stochastic) environment.
Following Sutton and Barto (Sutton and Barto, 2018),
MDP is a tuple (S, A,R, p). S denotes a finite set of
states, including starting state s
0
, A a finite set of ac-
tions, R is a set of possible rewards and a function
p: S × R × S × A → [0,1] describes dynamics of the
system – the value p(s
′
,r|s, a) is the probability of
getting to state s
′
and obtaining reward r when the
agent executes action a in state s. Therefore, p is re-
quired to satisfy the following property:
∑
s
′
∈S
∑
r∈R
p(s
′
,r|s, a) = 1, for all s ∈ S,a ∈ A.
In other words, for each choice of s and a, p specifies
a probability distribution over {(s
′
,r)}.
A solution to the MDP is a policy π : S → A, i.e.,
a function that for each state s recommends an action
π(s) that the agent should perform in that state.
Quality of a policy π is measured using return –
the cumulative sum of rewards obtained by the agent
in the environment, possibly discounted by a discount
factor γ (real number between 0 and 1). The purpose
of the discount factor is to decrease the effect of dis-
tant actions on the actual value of return and to ensure
that the return is a finite number even in the case of an
infinite horizon.
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
96