show that the value functions (and hence, the policies)
learned in scenarios with a few agents can be used in
scenarios that multiplies the number of agents.
The paper has been organized as follows, section
2 describes the learning process used. In the section 3
we explain the motivation and use of the value func-
tion transfer. The section 4 shows the simulation re-
sults. The section 5 presents the main conclusions.
2 CROWD NAVIGATION AS A RL
DOMAIN
As independent learners, each agent’s learning pro-
cess can be modeled as a single-agent MDP. A MDP
(Howard, 1960) is defined as a set of states S, a
set of actions A, a stochastic transition function T :
S× A× S → ℜ and the reward function R : S × A →
ℜ that specifies the agent’s task. The agent’s ob-
jective is to find an optimal policy, that is, a map-
ping from states to actions so as to maximize the
expected sum of discounted reward, E{
∑
∞
j=0
γ
j
r
t+ j
}
where r
t+ j
is the reward received j steps into the fu-
ture. The discount factor 0 < γ < 1 sets the influ-
ence of the future rewards. The optimal action-value
function Q(s,a)
∗
stores this expected value for each
pair state-action in a discrete MDP. The Bellman op-
timality equation gives a recursive definition of the
optimal action-value function Q
∗
(s,a) = R (s,a) +
γ
∑
s
′
T(s,a, s
′
)max
a
′
∈A
Q
∗
(s
′
,a
′
). Usually the transi-
tion function T is unknown, then the optimal policy
can be learned through experience. There are several
model-free algorithms to find Q
∗
(s,a). In Q-learning
algorithm (Watkins and Dayan, 1992) the agent starts
with arbitrary values for the action-value function Q,
and updates the entry correspondent to time t (s
t
,a)
from the new state s
t+1
receiving an inmediate reward
r
t
as follows : Q(s
t
,a) = (1 − α
t
)Q(s
t
,a) + α
t
(r
t
+
γ max
a
′
∈A
Q(s
t+1
,a
′
)). This sequence converges to
Q
∗
(s,a) when all states are visited infinitely often
and the learning rate α(t) has a bounded sum of its
cuadratic value when t → ∞.
In this work we present an experiment consisting
on a group of agents that has to leave a region of
the space reaching a door placed in the middle of a
wall. This problem has two slopes. Individually, each
agent has to learn to avoid other agents, avoid to crash
against a border and learn the existing bias between
the different actions in terms of effectiveness. As a
group, the agents have to learn to leave by the exit in
an organized way.
The features that describe the state are: a) one
feature for the distance from the agent to the goal,
b) eight features for the occupancy states of the eight
Figure 1: Multiagent learning situation example.
neighbour positions c) one feature for the orientation
respect to the goal. There is no reference to position
in the grid to allow portability. We have considered
the Chesvichev distance because it is adequate to be
used with diagonal movements. The set of actions
consists on the eight possible unitary movements to
the neighbor cells of the grid plus the action “Stay in
your place”. The agent is always oriented to the goal.
The state configuration and the actions are relative to
this orientation. For instance, in Figure 1, the arrows
represent the same action (“go to the North”) for the
different agents and all the agents sensorize the north
point in different places of the neighborhood.
We use Q-Learning with an ε-greedy exploratory
policy and exploring starts because it is a simple and
well-known model-free algorithm to converge to a
stationary deterministic optimal policy of an MDP.
The values of the learning algorithm are: the constant
step-size parameter α = 0.3, the exploration parame-
ter ε = 0.2 with an exponential decay and a discount
factor γ = 0.9. The algorithm stops when an empirical
maximum of trials is reached. If the agent reaches the
goal, the reward value is 1.0; if the agent crash against
a wall or with another agent, its immediate reward is
−2.0; if the agent consumes the maximum allowed
number of steps or cross the grid limits it is rewarded
with a value of 0. The immediate reward is always 0
for the rest of the middle states of a trial. The action-
value functions are initialized optimistically to ensure
that all the actions are explored. We have designed the
described learning problem with 20 agents, therefore
there are 20 independent learning processes.
The Figure 2 shows the incremental mean of the
expression R = R
F
γ
t
, where R
F
= {0, 1}, γ is the dis-
count factor and t is the lenght of the episode in steps.
Besides the mean reward, the curve indicates a mean
for the length of an episode in the interval [7.0,8.0]
that is coherent with the dimensions of the learning
grid. The length of an episode is the number of deci-
sions taken in this episode. The other curve displays
the averaged lenght of the episodes.
ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence
608