active perception function as follows:
• return TRUE if another agent is less than 2 loca-
tions away (i.e. agents could collide in the current
timestep);
• return FALSE otherwise.
In related work was demonstrated how generalized
learning automata were capable of learning an area
around an agent where other agents had to be ob-
served (De Hauwere et al., 2009). We used the re-
sults of this study for the implementation of the active
perception step as this gives the same results as using
a predefined list of states in which coordination was
necessary for the active perception step, as was done
in the paper by Melo & Veloso
Figure 3 shows the average number of collisions
that occured for the different environments using the
different algorithms. We see that independent learn-
ers perform bad throughout all environments, except
for TunnelToGoal 3 (Environment (c)). This can be
explained by the size of the state space in that environ-
ment. Since our exploration strategy is fixed, agents
might make a mistake before reaching the entrance of
the tunnel and as such avoid collisions (by luck). In
this environment both joint-state learners and joint-
state action learners perform quite bad. This is also
due to the size of the state space. The number of
states these agents have to observe is 3025. Joint-
state-action learners then have to choose between 16
possible actions. These algorithms still haven’t learnt
a good policy after 2000 episodes. We see through-
out all environments that CQ-Learning finds collision
free policies, and finds these faster than any other al-
gorithm we compared with.
Two of the five algorithms used in the experiments
search for states in which observing the other agent
is necessary: CQ-learning and LoC. In Figure 4 we
show the number of times these algorithms decide to
observe the other agent per episode. For CQ-learning
this is the number of times an agent is in a state in
which it uses global state information to select an ac-
tion. For LoC this is the number of times the CO-
ORDINATE action is chosen (which triggers an ac-
tive perception step). We see that in the TunnelTo-
Goal 3 environment LoC uses a lot of coordinate ac-
tions in the beginning, as this action has initially an
equally high chance of getting selected as the other
actions. Due to the size of the environment it takes a
long time for this algorithm to choose the best action.
CQ-learning can be seen in a bottom-up way, which
initially never plays joint, and then expands its state
space. If the agents can solve the coordination prob-
lem independently, as can be seen in Figure 4(d), they
never use global state information to learn a solution.
In Figure 5(a) we show the evolution of the size
of the state space when using CQ-learning in the Tun-
nelToGoal 3 environment. We have also plotted the
line which indicates the size of the state space in
which independent Q-learners are learning. For joint-
state and joint-state-action learners this line would be
constant at 3025. The variation in this line can be
explained by the fixed exploration strategy which is
used. This causes agents to deviate from their pol-
icy sometimes which causes new states in which col-
lisions occur to be detected. These states however are
removed pretty quickly again thanks to the confidence
level. These states are only occasionally visited and
the other agents are only rarely at the same location
as when the collision state was detected, so the confi-
dence level of these states decreases rapidly. In Fig-
ure 5(b) we show in which locations the agents will
observe other locations in order to avoid collisions.
We used the same color codes as in Figure 5(a). The
alpha level of the colors indicate the confidence each
agent has in that particular joint state. The agents
have correctly learned to observe other agents around
the entrance of the tunnel, where collisions are most
likely, and play independent using their local state in-
formation in all other locations.
5 CONCLUSIONS
This paper described an improved version of CQ-
Learning. This algorithm is capable of adapting
its state space to incorporate knowledge about other
agents, in those states where acting independent does
not suffice to reach a good policy. As such this
techniques takes the midground between acting com-
pletely independentlocal state space and acting in a
complete joint-state space. This is done by means of
statistical tests which will indicate whether a richer
state representation is needed for a specific state. In
these states the state information the agent uses is
augmented with global information about the other
agents. By means of a confidence value that indi-
cates to what degree coordination is necessary for a
given state, it is possible that states are reduced again
to only containing local state information. We have
shown through experiments that our algorithm finds
collision free policies in gridworlds of various size
and difficulty and illustrated the set of states in which
the agents use global state information. We compared
our technique to commonly accepted RL-techniques
as well as to state-of-the-art algorithms in the field
of sparse interactions and illustrated that CQ-learning
outperformed the other approaches.
A possible avenue for future research is to detect
ADAPTIVE STATE REPRESENTATIONS FOR MULTI-AGENT REINFORCEMENT LEARNING
187