2 BACKGROUND
In many real-world problems, success is seldom de-
fined under a singular objective. To solve these prob-
lems, autonomous agents typically need to learn a
wide range of different policies that can balance trade-
offs between multiple different objectives in various
ways. In this section we discuss several areas of study
relevant to this problem including Multi-Objective
Optimization and Quality Diversity. We also discuss
related works such as MOME and the use of counter-
factuals in machine learning.
2.1 Multi-Objective Optimization
Problems that do not have a single, optimal solu-
tion and are classified as Multi-Objective Optimiza-
tion (MOO) problems. Many problems use multiple
different objectives to define success, many of which
are conflicting. When the preferred balance between
objectives is known in advance, the objectives can be
scalarized and an optimal solution can be found with
traditional single-objective searches (Das and Den-
nis Jr, 1996; Drugan and Nowe, 2013; Van Moffaert
et al., 2013). In many cases, the preferred balance
may not be known or may change based on the sit-
uation. Most strategies to MOO aim to provide a set
of solutions that optimize different trade-offs between
objectives. The goal of MOO methods is to find cov-
erage of the optimal region of the objective space,
known as the Pareto front, in order to present a set
of solutions that provide different balances between
objectives.
Evolutionary search-based methods are a natural
fit for this problem, as they are able to find, com-
pare, and iterate over a population of policies to
find incrementally better solutions. Two of the most
prominent evolutionary multi-objective methods are
Strength Pareto Evolutionary Algorithm 2 (SPEA2)
(Zitzler et al., 2001) and Nondominated Sorting Ge-
netic Algorithm II (NSGA-II) (Deb et al., 2002).
NSGA-II uses a non-dominated sorting method for
ranking and sorting each solution, and it also in-
cludes a measure of crowding distance among solu-
tions to force the search to diversify across the objec-
tive space. SPEA2 computes a strength value for each
solution found, and it stores stores previously found
non-dominated solutions. This set of stored solutions
is updated after every generation. The fitness of each
solution is calculated based on the number of solu-
tions that dominate it and that are dominated by it.
2.2 Quality Diversity
Where MOO aims to solutions that provide cover-
age over the objective space, Quality Diversity (QD)
seeks find diversified policies that capture a broad
set of behaviors. This builds on previous techniques
in novelty search that focus on finding substantially
different solutions (Eysenbach et al., 2019; Lehman
and Stanley, 2011). QD aims to diversify across a
pre-defined “behavior space”, a reduced-dimensional
summary of the actions or behaviors of each policy.
QD methods separate the behavior space into many
local regions, called niches, then keeps the best pol-
icy found in each niche. One notable example of
QD methods is Multi-dimensional Archive of Pheno-
typic Elites (MAP-Elites) (Mouret and Clune, 2015).
MAP-Elites is an evolutionary method that aims to
find many different solutions for a given task. Specif-
ically, MAP-Elites aims to find a diverse set of poli-
cies that do not necessarily solve a task optimally, but
solve a task using a variety of unique behaviors.
2.2.1 Multi-Objective Map Elites (MOME)
Multi-Objective Map Elites (MOME) (Pierrot et al.,
2022) incorporates the principles of both MOO and
QD search by building on the MAP-Elites framework.
Instead of a singular policy, MOME keeps the set
of policies in each niche that are locally Pareto op-
timal. In doing so, they are able to provide solutions
that are diverse across both the objective and behav-
ior spaces. However, combining these two methods
leads to extensive time spent searching and optimiz-
ing in unhelpful or sub-optimal regions of the behav-
ior space. By keeping locally Pareto optimal policies,
this method keeps many policies that result in no re-
ward as there are many niches where zeroes across all
objectives is locally Pareto optimal. These policies
are then used for mutation and crossover; without an-
other mechanism to add back diversity, the search is
weighed down and has a difficult time escaping these
local minima.
2.3 Counterfactuals in Agent Learning
Counterfactual thought is a strategy people often
use to analyze alternative outcomes to real events
and learn from “what-if” scenarios (Boninger et al.,
1994; Byrne, 2016). This concept has also been
adopted in agent-based learning paradigms, most no-
tably in applications of reward shaping that aim to ad-
dress the issue of credit assignment within multiagent
learning problems. For example, Difference Evalua-
tions compare rewards generated from counterfactual
states with the actual rewards agents received to pro-
Shaping the Behavior Space with Counterfactual Agents in Multi-Objective Map Elites
43