maze structure into an infrastructure environment as
a grid world with a set of paths where each cell rep-
resents a state. Characteristics such as corridors, T-
junctions, interjections and L-turns remain. At each
step time an agent perceives a state and selects one of
four actions: North, West, South or East. That brings
it to the next state. According to the orientation in
which each agent reaches a state it has four possible
actions to select. Second, we consider the dots dis-
tributed around the environment. The agents must eat
all dots with fixed and variable values to begins a new
episode. Finally, multiple agents that represent the
ghosts are the last adaptation of the game.
The tasks that each agent can perform are: pa-
trol, homing and avoiding. Patrol causes the agent
to move throughout the environment following a path
by taking decision to choose actions. Homing causes
the agent to be resting to charge batteries energy. Fi-
nally, Avoiding causes the agent to avoid obstacles or
other robots. Each agent switches its internal task
between Patrol and Homing considering battery life
value. Therefore, the learning task consist of find-
ing a mapping from states to actions for cooperative
patrol where each agent learns how to select the ac-
tion with the highest value. Each agent implements a
set of adaptation rules. Individually each agent has a
map, whereas from a group perspective, the rules are
triggered by pheromone like communication among
robots, so that each time step they inform others the
state explored (Khamis et al., 2006).
3 REINFORCEMENT LEARNING
A reinforcement learning model consist of a discrete
set of environment states, S; a discrete set of actions,
A; and a set of scalar reinforcement signal. In this
model, an agent learns a mapping from situations to
actions by trial-and-error interactions with the envi-
ronment to achieve a goal. This environment must be
at least partially observable. At each time step t ∈ T
each agent receives a current state indication s
t
∈ S of
the environment, then it chooses an action a
t
∈ A to
generate an output which changes the state of the en-
vironment to s
t+1
∈ S and the value of this state tran-
sition is indicated to the agent through an scalar r
t
known as reward (Sutton and Barto, 1998). A reward
defines the goal in a reinforcement learning problem.
It maps each perceived state or state-action pair to a
single numerical value that indicates the intrinsic de-
sirability of that state. An important reward property
is known as Markov Property. A reward with this
property must include immediate sensation and re-
tain all relevant information from the past (Puterman,
1994). Thus, the agent learns to perform actions that
maximize the sum of the rewards received when start-
ing from some initial state and proceeding to a ter-
minal one. The reward function must be necessarily
unalterable by the agent and it only serves as a basic
for altering a policy π. In this implementation mainly
dots values have been used as reward. A policy is a
mapping from each state, s ∈ S, and action, a ∈ A(s),
to the probability π(s, a) of taking action a when in
state s. An stationary policy, π : S → Π(A) defines a
probability distribution over actions. A policy is the
core of the agent since it defines which action must
be performed at each state. Thus, the objective of re-
inforcement learning is to develop a agent with a be-
havior policy to choose actions that tend to increase
the long-run sum of values of the reward.
We have implemented an off-policy temporal dif-
ference algorithm known as Q-learning which learns
directly from raw experience without a model of the
environment and updates estimations based in part
on other learned estimations without waiting for a
final outcome. A model consist of the state transi-
tion probability function T(s, a, s
′
) and the reinforce-
ment function R(s, a). However, reinforcement learn-
ing is concerned with how to obtain an optimal policy
when such a model is not know in advanced (Watkins
and Dayan, 1992). The objective of Q-learning is
to learn the action-value function Q applying the
rule Q(s
t
, a
t
) ← Q(s
t
, a
t
) + α[r + γmax
a
Q(s
t+1
, a) −
Q(s
t
, a
t
)], where < s
t
, a
t
, r, s
t+1
> is an experience tu-
ple. If each action is executed in each state an infinite
number of times on an infinite time run and α is de-
cayed appropriately, the Qvalues will converge with
probability 1 to their optimal values Q
∗
(Kaelbling
et al., 1996). An action-value function for policy π
defines the value of taking action a in state s under
policy π, denoted by Q
π
(s, a), as the expected reward
starting from s, taking action a,and thereafter follow-
ing policy π.
The general form of the Q-learning algorithm
consist of nine step, as describe below.
1 Initialize Q(s,a) arbitrarily
2 Repeat (for each episode):
3 Initialize s
4 Repeat (for each step of episode):
5 Choose a from s using policy derived from Q
6 Take action a, observe r, s’
7 Apply equation X
8 s <- s’
9 Until s is terminal
In this implementation an episode terminates when
all dots are eaten and each step of episode consist of
choosing an action a when in state s, updates Q(s, a)
and go to s
′
.
ICAART 2012 - International Conference on Agents and Artificial Intelligence
314