it is possible to encounter dangerous situations with
multiple agents at the same time. In order to still use
the transferred knowledge from the source task, we
consider interactions between agents pairwise. Each
agent tests the locations of all other agents separately,
using the transferred classifier. When more than one
agent is identified as potentially dangerous, this is
treated as a separate danger situation. Thus to deter-
mine which Q-values to use, the agent first checks if
then if it is in a danger state including all 3 agents,
then checks if it is in a danger state including 2 agent
or finally relies only on its local state. For these new
danger situations including 3 agents, we currently do
not transfer Q-values, i.e. all Q-values in these states
are initialized to 0. We again give results for the both
algorithms averaged over 25 trials. Parameter settings
were identical to those used above.
The first problem is an extension of the Tunnel-
ToGoal environment including 3 agents, as shown in
Figure 2(d). Results on this problem are shown in Fig-
ure 5 (top row). The experimental outcome parallels
the results on the 2 agent TunnelToGoal problem. The
transfer agents outperform CQ-learning with respect
to the number of collisions, but require more steps to
reach their goals during the initial phases of learning.
The second problem we consider is again the ISR
environment from the previous section, but now in-
cluding 3 agents. Results are shown in Figures 5 (bot-
tom row). As is clear from these graphs, this prob-
lem seems especially challenging for the CQ-learners.
The agents suffer a large amount of collisions and re-
quire a very large amount of steps to their goals during
the first few hundred episodes. After this initial bad
period their final performance does match that of the
transfer agents.
6.3 The Cost of Transfer
In the previous subsections we have demonstrated
that transfer learning is an effective tool for multi-
agent coordination and can significantly improve per-
formance. The results shown above, however, do not
take into account he costs associated with transfer
learning. We consider only the performance on the
target task and neglect the time spent training on the
source task. Table 2 now gives the total collisions and
steps over all 2000 episodes. For the transfer agents
these totals do not include the 50000 initial training
steps. We see that, while the transfer agents clearly
perform best in terms of collisions, they are outper-
formed on the ’total number of steps’ criterion on both
TunnelToGoal environments. If we count the 50000
training steps, this difference becomes even larger and
the transfer agents also lose out on the standard ISR
environment. On the most difficult problems CIT and
ISR with 3 agents, however, the transfer learners per-
form much better, even including their training time.
From these results, we can conclude that our transfer
mechanism is most suitable to apply in more complex
problems, or in small problems where the cost of fail-
ing to coordinate is very high.
Finally, we also give results for the influence of
the amount of time spent on the source task. Fig-
ure 6 shows the total number of collisions and aver-
age number of steps after different amounts of train-
ing time on the source task. We show the average
amount of steps to the goal required by the agent (over
the first 500 episodes) and the total amount of colli-
sions during these first episodes. All results were ob-
tained in the CIT environment with the experimental
settings described above and averaged over 25 trials.
We show results for a training time of 10000, 25000,
50000, 75000 and 100000 time steps before transfer.
For lower amounts of training time we see a signif-
icantly lower performance, this is mainly due to the
fact that the agent is not able to learn a classifier which
perfectly identifies the danger states (as the one shown
in Table 1). This means the agents are not able to pre-
dict all conflicts. Additionally, since the agents do
not identify additional conflict situations after trans-
fer, they may not be able to resolve these situations,
which also means agents require more steps to reach
their goals.
7 DISCUSSION AND FUTURE
WORK
We now consider some ways in which the meth-
ods developed here could be further improved. A
first question that remains is whether our method
for coordination transfer can be combined with ex-
isting forms of value function transfer mechanisms,
e.g (Taylor, 2008). Using the transfer learning mech-
anism described in this paper, agents transfer expe-
rience on agent coordination, but the target naviga-
tion task has to be learned from scratch. Ideally,
this system should be combined with previous single
agent transfer techniques, speeding up both the coor-
dination and navigation subtasks. Another important
question is to determine if the system can be extended
to deal with delayed penalties for failing to coordi-
nate. Currently, the need for coordination is detected
based on samples of the immediate reward. While this
has proven an effective method in the settings con-
sidered in this paper, this system may fail to resolve
situations where a failure to coordinate leads to prob-
lems only after multiple time steps. For this to work
ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence
270