Figure 1: Laboratory benchmark model for MARL evalu-
ation. The length of each axis is 400 pixels, which corre-
sponds to the physical length of 1 meter.
In the laboratory benchmark setup, spherical
robots are able to move within a predefined area. A
camera, mounted on the ceiling, monitors the move-
ment of the robots and continuously monitors the
robots positions. The robots manoeuvring range is
selected to be restricted to a specific area which is vis-
ible to the camera. There are two goal landmarks at
specific locations inside the area. Further the spher-
ical robots are originally positioned at two base sta-
tions inside the given area. The robots know the rel-
ative position of each other and also the positions of
the landmarks and the base stations. In the benchmark
scenario, the robots will be requested to simultane-
ously move to one of these goal landmarks without
colliding into each other, stay at that location for a
short period of time, and then move back to the base
station.
In the remaining of this paper, in Section 2, an
evaluation and comparison of three suitable MARL
algorithms based on a simulation of a self-defined en-
vironment is given. In Section 3 the detailed setup of
the laboratory benchmark setup and the design of the
final MAS is introduced. For validation of the pro-
posed simulation based learning method, in Section
4, the implementation of the MAS and the experimen-
tal results are presented. The experimental results are
compared to results from the simulated environment.
The paper ends with concluding remarks and sugges-
tions for future work.
2 MULTI-AGENT
REINFORCEMENT LEARNING
In RL, an agent that is situated in an environment
learns which action to take for a particular environ-
mental state in order to maximize its total received
reward. The agent discovers the best actions for an
environmental state, by trying them. Finite Markov
decision processes (MDP) are mathematically ideal-
ized forms of RL problems. The agent perceives its
environment, and after a decision, it takes an action,
which leads to an environment state transition and a
reward for the agent. The introduced MARL frame-
works are based on the MDP. However, the difference
to single-agent RL is that actions of other agents will
have an effect on the environment as well. This leads
to a non-deterministic interaction of an agent with the
environment it acts in. Following the assumptions for
MASs as stated in (Poole and Mackworth, 2017), the
exsiting approaches integrate developments in the ar-
eas of single agent RL, game theory, and direct policy
search techniques (Busoniu et al., 2008).
In (Matignon et al., 2007), a comparison of basic
Q-learning algorithms is presented. Centralized Q-
learning shows good performance but there is a high
information demand and a larger state-action space to
be maintained. In decentralized Q-learning the state-
action space is reduced. Noticeably, an agent can get
punished even if it takes a correct action. The reason
for this is that other agents may take wrong actions
and the joint action then leads to punishment. This
can be avoided by distributed Q-learning method,
which restricts Q-values to only increment. A key
issue with distributed Q-learning is that it does not
guarantee to convergence to the optimal joint policy
in difficult coordination scenarios. For this reason,
hysteretic Q-learning has been proposed (Matignon
et al., 2007). This learning method is decentralized
in the sense that each agent builds its own Q-table
whose size is independent of the number of agents in
the environment and a linear function of its own ac-
tions. According to (Matignon et al., 2007), the per-
formance of hysteretic Q-learning is similar to cen-
tralized algorithms while much smaller Q-value tables
are used.
Apart from adapting Q-learning to multi-agent
scenarios, policy gradient based methods have also
been applied, especially the actor-critic method
(Lowe et al., 2017; Li et al., 2008; Foerster et al.,
2017). To ease training, a framework based on cen-
tralized training with decentralized execution is ap-
plied. The critic is based on extra information, such
as the policies of other agents, while the actor only
uses the local observations to choose actions. In a
fully cooperative environment, there is only one critic
for all actors since all always have the same reward.
However, in a mixed cooperative-competitive envi-
ronment, there is one critic for each actor.
In the remainder of this section, centralized
Q-learning, hysteretic Q-learning and the MAAC
method with linear function approximation will be in-
troduced. Further, these three methods are evaluated
with respect to applicability in the introduced labora-
tory benchmark setup.
ICAART 2019 - 11th International Conference on Agents and Artificial Intelligence
104