2 BENCHMARKS FOR
MULTI-AGENT SYSTEMS
Benchmarks are well-defined problems that simplify
a more complex reality. The first goal of such
simplification is to support fair comparisons of
distinct solutions, as long as it facilitates the
submission of such solutions to the same situations.
Being a science of the artificial, besides an
analytic-theoretical vein, AI has a growing empirical
facet where controlled experimentation is
fundamental (Pereira, 2001). So, for such science,
benchmarks play an important role that exceeds the
issue of comparing competing systems. They are
part of the apparatus of empirical AI.
As pointed by Hanks, Pollack and Cohen (1993),
AI systems are intended to be deployed in large,
extremely complex environments. But before the
application to real problems, AI researchers submit
their methods, algorithms and techniques to
simplified, simulated versions of such environments
(i.e. to benchmarks).
Thus, the experimental process consists in the
researcher varying the features of the simulated
environment, or even the benchmark task, in order to
measure the resulting effects on system
performance. With such information, the researcher
shall be able to discriminate between uninteresting
isolated phenomena, and relevant general results,
adequately explaining why his (or her) system
behaves the way it does.
2.1 Definition
For AI, a benchmark is a problem sufficiently
generic in order to be solved by various different
techniques, sufficiently specific to let such distinct
techniques be compared, and sufficiently
representative of a class of real applied problems
(Drogoul, Landau and Muñoz, 2007).
Well-known AI benchmarks are the knapsack,
the n-queens, and the traveling salesman problems
(Drogoul, Landau and Muñoz, 2007). As one can
see, such problems are sufficiently generic – since
they can be solved by plenty of techniques – and
also sufficiently specific, as shown by the number of
studies comparing such techniques (Martello and
Toth, 1990; Russell and Norvig, 2004; Cook, 2008).
These problems are sufficiently representative, too,
since they are NP-complete problems (Garey and
Johson, 1979.)
However, as pointed by Hanks, Pollack and
Cohen (1993), since AI began to focus less on
component technologies and more on complete,
integrated systems (specially MAS), such classic
benchmarks revealed their limitations. They evaluate
the performance of component technologies
individually, ignoring the interactions of such
components one with the others and with the
environment.
So, what characterizes a benchmark for MAS? In
the same manner as the other benchmarks for AI,
benchmarks for MAS must be sufficiently generic
and sufficiently specific. Their particularity is
constituted just by the sufficiently representative
issue. As explained by Stone (2002), as long as the
complexity of MAS exceeds the NP-completeness
theory, their representativeness is related to AI
subfields like collaboration, coordination, reasoning,
planning, learning, sensor fusion, etc.
2.2 Good Benchmarks
Besides defining what a benchmark is for MAS, an
important aspect is defining what turns a benchmark
into a good benchmark.
As defined by Drogoul, Landau and Muñoz
(2007), a good benchmark is the one that makes the
representation and the understanding of new
methods easier, letting the researcher focus on the
solution rather than on the representation of the
problem.
From the point of view of Hanks, Pollack and
Cohen (1993), a good benchmark must also count
with a testbed. Succinctly, a testbed is a complete
software environment that offers an interface to
configure the parameters of a benchmark, as well as
assures that distinct techniques are being tested and
evaluated in equivalent situations.
2.3 RoboCup and TAC Competitions
As pointed out by Stone (2002), the RoboCup and
the TAC competitions comprise the currently most
popular benchmarks for MAS. The reason for such
popularity is related to the problems that such
competitions simplify through their benchmarks.
RoboCup challenges competitors to win soccer
games (one of the most popular sports of the world),
played by computer agents (Robocup, 2008). TAC,
the Trading Agent Competition, challenges
competitors to build trading agents that dispute for
resources in an e-market (Tac, 2008).
Both competitions have yearly promoted
international tournaments. Their benchmarks are
indeed sufficiently generic, given the diversity and
the quantity of proposed solutions by the many
competitors. Such benchmarks are also sufficiently
specific, once the proposed techniques are
SIMPATROL - Towards the Establishment of Multi-agent Patrolling as a Benchmark for Multi-agent Systems
571