information of the next state instead of the current
one, which increases the amount of stored informa-
tion and needs additional calculations. Using our pre-
vious approach, a satisfactory result was approved in
the case of dynamic environments. However, a sim-
pler and less costly approach can be adopted when
dealing with only stationary environments: this is the
objective of the current paper.
Several propositions have been made in the
present manuscript in order to overcome the short-
comings of the D-DCM-MultiQ method and improve
multi-agent learning in large, stationary and unknown
environments. The main contributions of this paper
are presented in the following points:
• In order to overcome the limitations of the D-
DCM-MultiQ method, the action selection strat-
egy is not only based on the own information
of the learner, but also on the information of its
neighbors which increases the efficiency of the
choice. In this regard, a new cooperative action
selection strategy is proposed and evaluated. This
proposed method is derived from the ε-greedy
policy (Coggan, 2004). Using this latter, no mem-
orization of exploration specific data is required,
which makes it particularly interesting for very
large or even continuous state-spaces.
• As the communication between agents is local, an
agent can trap between two states (one state in-
side the communication range of the neighboring
agent and another state outside its communication
range). To this end, a second form of agent, called
”relay agent”, is designed to ameliorate the co-
operation between learning agents by storing all
learners’ information. When choosing the next
action to perform, each learning agent should es-
pecially exploit the relay’s backups if it is within
its neighborhood.
• To ensure an efficient learning regardless of the
environment size and the communication range,
the relay agent must move randomly during the
whole learning phase in order to cover different
regions of the environment over time and ensure
the resolution of bottleneck situations in a timely
manner.
• A particular update of the relay’s knowledge is
suggested in order to optimize the size of the
stored data; for each state/action pair, the relay
agent stores only the most promising value of all
neighbors’ information related to this pair, instead
of saving all learners’ information. As a result, the
relay retains a single Qvalue for each state/action
pair, like a learning agent.
The rest of the paper is organized as follows.
Problem statement is described in Section 2. In Sec-
tion 3, the D-DCM-Multi-Q approach is presented
and briefly reviewed. Section 4 is dedicated to present
our suggestions for improvement. Several experi-
ences are conducted in Section 5 showing the effi-
ciency of our proposals. Some concluding remarks
and future works are discussed in Section 6.
2 PROBLEM STATEMENT
The learning problem is a cooperative foraging task in
a large, stationary and two-dimensional discrete envi-
ronment. As in (Guo and Meng, 2010) and (Zemzem
and Tagina, 2015), agents can perform four actions:
”left”, ”right”, ”up” and ”down” and their decisions
are only based on the interaction with the environ-
ment as well as the interaction with local neighbors. It
is assumed that all agents are initially in the nest, and
each agent can locate itself using its on-board sensors
such as encoders and can detect the target or obstacles
using sonar sensors. Each agent has limited onboard
communication range and can share its state informa-
tion with its neighbors that are within its communica-
tion range.
The foraging task may be abstractly viewed as a
sequence of two alternating tasks for each agent:
• Start from the nest and reach the food source (For-
aging Phase). In this case, the learning method is
applied.
• Start from the food source, laden with food, and
reach the nest (Ferrying Phase). Every time one
agent finds the food source, it will wait for other
agents to reach the target (Waiting Phase). Once
all the agents find the food source, they start a
collective transport phase. As agents must follow
the same path, they select the shortest path among
all agents’ foraging paths (Zemzem and Tagina,
2015).
To provide agents with distributed RL, we use a
model close to Markov Decision Processes (MDPs).
It is a tuple of (n,S, A, T, R), where n is the number of
agent’s neighbors. S is a set of states which is defined
as S = [s
1
, .., s
m
], where m is the number of states each
of which is identified by the coordinates (x, y). A is a
set of actions available to an agent which is defined as
A = [a
1
, .., a
p
]. R : S× A −→ r is the reward function
for an agent and T is a state transition function (T is
a probability distribution over S) (Bruno C. Da Silva,
Eduardo W. Basso, Ana L. C. Bazzan, 2006).