REACT: Revealing Evolutionary Action Consequence Trajectories for

Interpretable Reinforcement Learning.

Philipp Altmann, C

eline Davignon, Maximilian Zorn, Fabian Ritz,

Claudia Linnhoff-Popien and Thomas Gabor

LMU Munich, Germany

Keywords:

Reinforcement Learning, Interpretability, Genetic Algorithms.

Abstract:

To enhance the interpretability of Reinforcement Learning (RL), we propose Revealing Evolutionary Action

Consequence Trajectories (REACT). In contrast to the prevalent practice of validating RL models based on

their optimal behavior learned during training, we posit that considering a range of edge-case trajectories pro-

vides a more comprehensive understanding of their inherent behavior. To induce such scenarios, we introduce

a disturbance to the initial state, optimizing it through an evolutionary algorithm to generate a diverse popula-

tion of demonstrations. To evaluate the ﬁtness of trajectories, REACT incorporates a joint ﬁtness function that

encourages local and global diversity in the encountered states and chosen actions. Through assessments with

policies trained for varying durations in discrete and continuous environments, we demonstrate the descriptive

power of REACT. Our results highlight its effectiveness in revealing nuanced aspects of RL models’ behavior

beyond optimal performance, with up to 400% increased ﬁdelities, contributing to improved interpretability.

Code and videos are available at https://github.com/philippaltmann/REACT.

1 INTRODUCTION

With the increasing use of large, parameterized func-

tion approximation models, there is a growing de-

mand for interpretation methods that bridge the gap

between human understanding and computational in-

telligence. This is particularly pronounced in the con-

text of complex dynamic approaches like reinforce-

ment learning (RL), where policies are usually real-

ized with parameterized neural networks. As a run-

ning example, consider a 9 × 9 gridworld, where

the agent is perfectly trained to traverse the environ-

ment and reach the target ﬁeld. However, unforeseen

circumstances (like sensor failure or domain shifts)

might cause the agent to end up in ﬁelds not along this

optimal trajectory, where an overﬁtted policy might

even get stuck. Yet, those scenarios are equally im-

portant to interpret the inherent behavior. This yields

several challenges: First, contrary to static supervised

learning tasks like classiﬁcation, RL policies are in-

herently hard to visualize, especially given the in-

tended application to varying circumstances. Second,

demonstrating the desired behavior in a laboratory

training setup does not serve as sufﬁcient validation

to enable the interpretability of the inherent behavior.

Third, comparative evaluation plays a central role in

comprehending, explaining, and interpreting varying

phenomena by providing additional context informa-

tion and, thus, control (Vartiainen, 2002). To tackle

these challenges, we propose to evaluate a set of di-

verse edge-case demonstrations, which we obtain by

precisely disturbing the initial state. To generate a

small yet informative set of demonstrations, we em-

ploy evolutionary optimization, which can be adapted

to yield diverse solution candidates in complex solu-

tion landscapes across various (local) optima. To har-

ness these prospects, we propose a framework to indi-

rectly optimize a population of demonstration behav-

ior generated by a given (trained) policy by altering

(disturbing) the initial state. Overall, we provide the

following contributions:

• We formalize a novel interpretability joint ﬁtness

metric to assess demonstration trajectories w.r.t.

their local (inherent) and global (comparative)

state diversity and action certainty.

• We propose an architecture for Revealing Evo-

lutionary Action Consequence Trajectories (RE-

ACT), integrating the previously deﬁned ﬁtness to

optimize a pool of diverse demonstrations to serve

as a basis for interpreting the underlying policy.

• We evaluate REACT in ﬂat and holey gridworlds

and a continuous robotic control task, comparing

policies of varying training stages.

Altmann, P., Davignon, C., Zorn, M., Ritz, F., Linnhoff-Popien, C. and Gabor, T.

REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning.

DOI: 10.5220/0013005900003837

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Computational Intelligence (IJCCI 2024), pages 127-138

ISBN: 978-989-758-721-4; ISSN: 2184-3236

127

2 PRELIMINARIES

Reinforcement Learning. We focus on problems

formalized as Markov decision processes (MDPs)

M = ⟨S, A, P, R, µ, γ⟩, with a set S of states s,

a set A of actions a, a transition probability P(s

′

s, a) of reaching s

′

when executing a in s, a scalar

reward r

= R(s, a, s

′

) ∈ R at step t, the ini-

tial state distribution s

∼ µ, and the discount fac-

tor γ ∈ [0, 1) for calculating the discounted return

∞

k=0

t+k

(Puterman, 1990). More specif-

ically, we consider learning in a constrained setting

with a single deterministic initial state s

⋆

and evalu-

ating with initial states drawn from µ. Furthermore,

we consider the objective of reinforcement learning

(RL) to ﬁnd an optimal policy π

∗

with action se-

lection probability π(a | s) that maximizes the ex-

pected discounted return (Richard S. Sutton, 2015).

Policy-based methods directly approximate the opti-

mal policy from trajectories τ of experience tuples

⟨s, a, r, s

′

⟩, generated by π. Proximal policy opti-

mization (PPO) extends this concept, optimizing a

surrogate loss that restricts policy updates to improve

the robustness (Schulman et al., 2017). Soft actor-

critic (SAC) bridges the gap between value-based and

policy-based approaches (Haarnoja et al., 2018). Both

algorithms have shown versatile applicability to var-

ious scenarios. Thus, we use both approaches to

train the policies we base our empirical studies on.

While enabling learning in complex high-dimensional

or continuous scenarios, using deep neural networks

to approximate the optimal policy comes at the cost of

introducing a black-box model. Therefore, even when

ﬁnding a parameterization that resembles an optimal

policy, its decision cannot be anticipated, and reasons

for action choices cannot be (readily) inferred. Yet,

RL has been proposed to provide compelling solu-

tions to various real-world decision-making problems

such as autonomous driving or robotic control (Wur-

man et al., 2022; de Lazcano et al., 2023; Rolf et al.,

2023). Such problems require transparency, e.g., to

account for safety concerns or quality control.

Explainability. This ﬁeld of research not only con-

cerns providing explanations for speciﬁc decisions of

such black-box models but also extends to providing

their general interpretability. According to Li et al.

(2022), we classify interpretation algorithms regard-

ing three characteristics: Their representation, the

type of the model to be interpreted, and the relation

between the interpretation algorithm and the model.

The representation can be based on the importance

of (latent) features in relation to the ﬁnal objective

(Lundberg and Lee, 2017). Alternatively, one can use

the model’s response to different inputs to identify be-

haviors. Some algorithms approximate the model us-

ing an interpretable surrogate model (Ribeiro et al.,

2016). Finally, some models show the interpretation

by a sample dataset showing the impact of training

(Koh and Liang, 2017; Pleiss et al., 2020). Regarding

the model to be interpreted, some approaches con-

sider the model as a black box (Pleiss et al., 2020;

Ribeiro et al., 2016). These algorithms are called

model-agnostic and can be applied to any model.

Other approaches require speciﬁc model characteris-

tics such as differentiability or even a particular type

of model (Koh and Liang, 2017). Closed-form algo-

rithms are applied after training, while composition

algorithms can (also) be integrated into the training

process. Further relations include dependence, where

the algorithms add operations to the model after train-

ing to output interpretable terms, and proxy, where an

interpretable proxy model is created. Our algorithm

represents the interpretation as a model response, dis-

playing the policy behavior throughout various trajec-

tories provoked by the initial state. Furthermore, we

consider the model a black box, where our algorithm

can interpret various models, provided any action se-

lection probability. The type of model is not rele-

vant to our approach, making our approach model-

agnostic. Furthermore, we propose a closed-form ap-

proach to be applied after training.

Evolutionary Optimization. To optimize initial

states that cause diverse demonstrations, we use a

population-based evolutionary optimization process

with populations P = {τ

}

0≤i≤p

of size p, where the

initial population P

is chosen randomly, state space

X with P ∈ N

, a ﬁtness function F : X → R,

and the evolution step function E(P

, F) = P

t+1



⊎mutants

)⊎children

)), with

a (non-deterministic) selection function σ

: N

→

that returns n ∈ N individuals and could de-

pend on F, a mutation function mutants

{mutation(x) : x ∼ σ

⌈p·p

⌉

} and a crossover

function children

= {crossover(x

, x

) :

, x

∼ σ

⌈p·p

⌉

}, with mutation and crossover prob-

abilities (rates) p

and p

(Fogel, 2006). Individ-

uals I ∈ P are deﬁned by their inherent features

(genotype), in which we encode an initial state s

sampled from the initial state distribution µ deﬁned

by the MDP. Their individual ﬁtness is calculated

based on their resulting appearance (phenotype), i.e.,

the demonstration trajectory τ generated by executing

policy π in the given environment starting from s

. A

binary encoding of the individual state allows for im-

plementing a simple bit-ﬂip mutation a single-point

crossover operation to recombine two parents.

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

128

To foster parents with higher ﬁtness, tournament se-

lection is commonly applied within function σ (Miller

et al., 1995). While evolutionary algorithms are usu-

ally used to search for one single best individual, we

are interested in the entire population of individuals

similar to (Ishibuchi et al., 2008; Neumann et al.,

2019). Deploying a ﬁtness function that promotes di-

versity among trajectories allows us to see the differ-

ent strategies an agent follows in different situations.

Generally, all measures of the diversity of an individ-

ual I in a population P are related to the pairwise

distance between individuals in P as measured by a

suitable norm (e.g., Euclidean for real-valued repre-

sentations, Hamming for symbolic representations)

(Wineberg and Oppacher, 2003). Therefore, the indi-

vidual diversity w.r.t. the population can be estimated

by D(I, P) =

′

∈P

′

−I| (Gabor et al., 2018).

3 TRAJECTORY FITNESS

EVALUATION

In the following, we discuss assessing the ﬁtness of

trajectories τ = ⟨s

, a

, r

, . . . , s

, a

, r

t+1

⟩ ∼ P

π,s

to serve as an insightful demonstration to interpret the

inherent behavior of policy π. Unlike the central ob-

jective of RL, we are not interested in optimizing for

the best-performing individuals but rather in a popu-

lation of diverse demonstrations following π from an

initial state s

to be optimized. Therefore, we refrain

from using the reward metric supplied to learn the pol-

icy and deﬁne a joint ﬁtness metric F in the following.

To illustrate our deliberations, we consider the 9 × 9

gridworld environment depicted in Fig. 1.

Figure 1: Joint ﬁtness F elements local diversity D

(light

blue), global diversity D

(blue), and certainty C (orange),

compared to an exemplary optimal trajectory (white).

We strive for high diversity to achieve insightful

demonstrations. Considering a single trajectory, a di-

verse path covering a larger fraction of the available

state space (e.g., the light blue path in Fig. 1) would

be more informative regarding the behavior to be an-

alyzed than the comparably direct path resulting from

policy optimization (e.g., the white path in Fig. 1).

Even though it might be considered less optimal w.r.t.

the reward of the given environment, such behavior

might depict an edge, which is important to assess the

given policy. We refer to this measure as local diver-

sity and formalize the corresponding metric

(τ) =

|P |

|{s ∈ τ }|, (1)

where P = {P

| P

⊂ N, ∀d ∈ 1, . . . , dim}, with

|P | = |P

| · ... · |P

dim

| is the dim-dimensional po-

sition space extracted by ρ : S 7→ P from a state

s. In our exemplary gridworld, we consider the 2-

dimensional position of the agent with a |P | = 9 · 9

distinct states. Yet, this representation might be ex-

tended by other important, moving, or task-speciﬁc

objects like obstacles or targets. In our case, higher

local diversity implies more divergence from the opti-

mal path, increasing the relevance of the trajectory.

Furthermore, this position-centric formalization al-

lows us to consider the Euclidean distance between

states ||s − s

′

(ρ(s) − ρ(s

′

)

. For use in con-

tinuous environments, we suggest applying appropri-

ate discretizations to regularize state similarities.

Considering a set of multiple trajectories T , nei-

ther solely disturbed (light blue) nor solely optimal

(white) paths accurately reﬂect the behavior of π. We

therefore additionally consider a global diversity D

(blue) of trajectories τ ∈ T formalized as

(τ, T ) =

⌈P ⌉

min

′

∈T \τ

δ(τ, τ

′

), (2)

based on the maximum state distance ⌈P ⌉ =

max

s,s

′

∈S

||s − s

′

and the one-way distance δ be-

tween trajectories τ and τ

′

(Lin and Su, 2008):

δ(τ, τ

′

) =

s∈τ

d(s, τ

′

) +

′

∈τ

′

d(s

′

, τ )

|τ| + |τ

′

, (3)

using the state-to-trajectory distance:

d(s

′

, τ ) = min

s∈τ

(||s − s

′

) (4)

This accumulated two-way measure allows for

comparison between trajectories of different lengths.

Furthermore, using the min operator in Eq. (2) causes

equal trajectories in T to be valued at 0. Ultimately,

even if T contains only optimal yet maximally dis-

sected behavior to reach the target, presenting such

diverse demonstrations increases the overall inter-

pretability of π. Note that, even though only deﬁned

for disturbing the agent’s position, further deviations,

such as altering layouts, are formally not precluded.

However, calculating the global diversity might re-

quire using a different distance metric, like the Lev-

enshtein distance, instead.

REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning

129

Environment

Individual

Demonstration

Policy

Reward

Action

State

Initial State

Trajectory

Joint Fitness

Figure 2: REACT Architecture.

Both diversity measures implicitly cover insuf-

ﬁciencies and uncertainties of π that may occur in

states less prevalent during training. To reﬂect the di-

versity of the action decision itself, we furthermore

consider the certainty, formalized as the cumulative

normalized action probability of τ given π:

(τ) =

|τ|

s,a∈τ

π(a|s). (5)

Counterintuitively, we are interested in trajecto-

ries with low certainties causing more diverse deci-

sions that may fail to solve the intended task, such as

the exemplary orange path in Fig. 1. We intentionally

chose the normalized sum of probabilities instead of

their product to promote trajectories with low certain-

ties throughout.

Overall, we deﬁne the joint ﬁtness, combining

global diversity D

, local diversity D

, and certainty

of a trajectory τ in context of a set of previously

evaluated trajectories T as follows:

F(τ, T ) = D

(τ, T ) + F

, with (6)

= min

t∈T





(τ)



−



(t)





. (7)

To reﬂect the τ -speciﬁc metrics of local diver-

sity and certainty in relation to the set of trajectories

T , considered for calculating the global diversity, we

consult these measures only regarding their minimum

distance between τ and T . We chose the minimum

distance of both local metrics to encourage individu-

als to maximize their local distance to the closest in-

dividual, thereby promoting diverse or uncertain be-

havior. As we deﬁned all components of the joint

ﬁtness to be normalized, we furthermore do not in-

troduce additional parameterizations to balance their

impact. Preliminary studies conﬁrmed this approach.

4 REACT

To optimize a pool of demonstrations to interpret a

given policy using the previously deﬁned ﬁtness, we

propose revealing evolutionary action consequence

trajectories (REACT) to optimize a population of ini-

tial states causing diverse demonstrations. By show-

ing not only the optimistic optimal behavior, we aim

to increase the traceability of the learned behavior

and, ultimately, trust in the black-box policy model.

In contrast to most evolutionary approaches, we are

interested in the whole population, not just the single

best-performing individual. The overall architecture

is depicted in Fig. 2 and outlined in Alg. 1.

To form the initial population P of size p, individ-

uals encoded by the initial state s

are generated from

µ given by the MDP of the given environment. Invalid

individuals that cannot generate any demonstration

are disregarded. As we only introduce disturbances

to the initial position of the agent, the initial state s

can be encoded by the initial position of the agent.

To account for evaluation environments comprising

different-sized 2D-discrete and 3D-continuous state

spaces, we opt for a universal multi-dimensional 6-bit

encoding with inverse normalization to ensure precise

reconstruction of the intended position. For further

details, please refer to the appendix.

To evaluate the individuals’ ﬁtness, trajectories τ

are sampled from the environment, starting from their

individual initial state, following π. For improved

comparability, we furthermore remove duplicate con-

secutive states from τ . These demonstrations consti-

tute the individuals’ phenotype directly affecting their

ﬁtness to serve as a viable representation of the given

model. The individual ﬁtness is calculated accord-

ing to Eq. (6) based on the individual trajectory τ and

the set of previous demonstrations T to reﬂect the in-

dividual performance within the demonstration pool.

Note that even though we sample experiences from

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

130

Algorithm 1: Revealing Evolutionary Action Consequence Trajectories (REACT)

Require: P, µ, π ▷ We use a policy trained with a single initial state

1: P ← ⟨s

∼ µ⟩

; T ← ∅ ▷ Generate initial population of size p and empty T

2: for individual I ∈ P do

3: τ ∼ P

π,s

▷ Sample trajectory τ

from initial state s

4: F

(τ, T ) ▷ Calculate Fitness of I w.r.t. to phenotype τ and

previous demonstrations T according to Eq. (6)

5: T ← T ∪ τ ▷ Update demonstrations T

6: end for

7: for all generations g do

8: O ← mutants

(P) ⊎ children

(P) ▷ Generate offspring from mutation and crossover using

, p

, the individual ﬁtness, and tournament selection

9: for individual I ∈ O do

10: τ ∼ P

π,s

▷ Sample trajectory τ

from initial state s

11: F

(τ, T ) ▷ Calculate Fitness according to Eq. (6)

12: T ← T ∪ τ ▷ Update demonstrations T

13: end for

14: P ← migration(P ⊎ O, F, p) ▷ Select p best individuals for the next generation from

the population and offspring according to their ﬁtness

15: T ← T \ {τ

| I /∈ P} ▷ Remove extinct demonstrations

16: end for

17: return T

the environment, we do not consider further improv-

ing the policy at hand. Nevertheless, the proposed

architecture could serve as an automated adversarial

curriculum to generate scenarios for further training.

After evaluating the ﬁrst generation, the best indi-

viduals are selected via tournament selection to cre-

ate new individuals through recombination. The re-

combination operator is executed with the recombi-

nation probability p

∈ [0, 1] deﬁned beforehand. To

generate the offspring, we use single-point crossover.

The new individuals are then added to the population.

Then, a mutation operator with mutation probability

∈ [0, 1] is applied to random individuals from the

original population. The mutation is implemented by

a single bit-ﬂip of one random bit in the individual’s

encoding. As we are interested in the whole popula-

tion, we keep the individual before mutation and add

the mutated individual to the population to keep the

evolution elitist. After evaluating the newly generated

offspring, as described above, one after the other, the

population is reduced to the intended size p by remov-

ing the individuals with the lowest ﬁtness value along

with their generated demonstrations. The described

procedure is repeated for a ﬁxed number of g genera-

tions.

All required implementations, appendices, and

video renderings are available at https://github.com/

philippaltmann/REACT.

Hyperparameters. The most important hyperpa-

rameter to consider is the population size p. It in-

ﬂuences the effectiveness of the evolutionary process

and determines the number of demonstrations gener-

ated to interpret the policy. To suit human needs, p

should be comprehensibly small and sufﬁciently di-

verse (Behrens et al., 2023). Preliminary experiments

suggest a population size of p = 10 is a reasonable

compromise. Larger populations can be used if only

the best p individuals are considered to demonstrate

the policy’s behavior. For experimental details, please

refer to the appendix

. Furthermore, if not stated oth-

erwise, we optimize the population of demonstrations

over 40 iterations (generations). Our central goal is

to diversify the population throughout optimization,

so we use a reasonably high crossover probability

= 0.75 combined with a high mutation probabil-

ity p

= 0.5. In combination with the chosen binary

state encoding of length 6, representing the agent’s

initial position, this conﬁguration causes the genera-

tion of offspring to start at further distances.

5 RELATED WORK

Evolutionary RL. Evolutionary approaches have

also been applied to optimize a population of policies

(Khadka and Tumer, 2018) to foster their explorative

capabilities. Both task-agnostic (Parker-Holder et al.,

2020) and task-speciﬁc (Wu et al., 2023) diversity

REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning

131

measures have been shown to be beneﬁcial for im-

proving the quality of the resulting policy. Note, how-

ever, that we do not consider any policy improvement

but use evolutionary optimization to generate scenar-

ios that best describe the learned policy in a given en-

vironment. Nevertheless, this line of work highlights

the importance of considering the diversity of behav-

ior in addition to its quality. Similarly, quality diver-

sity (QD) optimization arose from considering the be-

havioral novelty of solution candidates as their opti-

mization criteria (Lehman and Stanley, 2011). Ga-

bor and Altmann (2019) proposed using surrogate-

assisted genetic algorithms for building recommender

systems. Bhatt et al. (2022) integrated QD into the au-

tomated environment generation during training via a

surrogate model to improve the robustness of the pol-

icy. We take a similar approach to improve the inter-

pretability of the learned policy, optimizing for a set

of diverse policy demonstrations. However, in con-

trast to the novelty criterion, which considers solely

the distance to the current population, we propose us-

ing a joint ﬁtness combining both local and global cri-

teria of the trajectories to be optimized.

Robust RL. We consider a process where a pol-

icy is trained with a single deterministic initial state

and evaluated with a changing initial state to sim-

ulate the policy’s edge-case behavior, allowing the

learned behavior to be interpretable. Therefore, from

a different perspective, we consider the robustness

of a policy to out-of-distribution samples, i.e., ini-

tial states that were potentially not experienced dur-

ing training, also referred to as generalization capa-

bilities. If an agent is trained well, only looking at

some episodes of the agent’s interaction with the en-

vironment usually solely shows the expected behav-

ior, including often-occurring states. However, the

agent’s strategy also includes behavior in states that

have not been encountered that often. We also want

to show this behavior. The goal is to show the most

diverse behavior and generate a small but informant

overview of the agent’s strategy. To improve the gen-

eralization capabilities, using varying training conﬁg-

urations (Cobbe et al., 2020), optimized training sce-

narios (Altmann et al., 2023), or an evolving curricu-

lum (Parker-Holder et al., 2022) has shown to be a vi-

able approach. Yet, we speciﬁcally chose a different

training approach to showcase the methodical impact

of REACT for visualizing a possibly insufﬁcient pol-

icy in edge-case scenarios. Note that this work gener-

ally does not consider any policy improvement. Nev-

ertheless, the generated representations could be fed

back into the training process as adversarial samples,

similar to Gabor et al. (2019).

Explainable RL. There are several approaches to

the interpretability and explainability of RL (XRL),

which are surveyed by Heuillet et al. (2021) and Al-

harin et al. (2020). Similar to general explainabil-

ity approaches previously introduced, RL interpre-

tation algorithms can be divided into different cate-

gories. One central aspect is their scope, reﬂecting

either local decisions or the global strategy. A fur-

ther distinction is drawn between post-hoc methods,

which keep the original model (Lage et al., 2019),

and intrinsic methods, replacing the original model

with a more explainable surrogate (Guo et al., 2021;

Huang et al., 2017). Combinations of both are also

possible. Furthermore, XRL algorithms can be ap-

plied before, during, or after training. Finally, XRL

algorithms can be classiﬁed according to their type

of explanation. The most common types are tex-

tual explanations, image explanations, collections of

states or state-action pairs, and explanations through

rules. We approach XRL by generating a collection

of demonstration trajectories that show diverse be-

havior based on a given policy interacting with an en-

vironment. We thereby strive for a scope that includes

the global inherent strategy. Furthermore, we op-

timize the diversity of those demonstrations using an

evolutionary process, which can be considered a post-

hoc method. Speciﬁcally, REACT does not require a

particular policy speciﬁcation and, therefore, does not

need to be integrated prior to or during training. Sim-

ilarly, Amir and Amir (2018) propose creating a pol-

icy summary containing a ﬁxed number of important

states and their surrounding states. The effect of an

action on that state identiﬁes the importance of states.

The goal is to ﬁnd states where a slight action mod-

iﬁcation would strongly inﬂuence the cumulative re-

ward. Therefore, the approach is mainly based on the

value function rather than relying on an external opti-

mization mechanism. Such states have also been re-

ferred to as critical states, where the chosen action

has a signiﬁcant impact on the outcome, which can

be used to interpret policies trained using maximum

entropy-based RL (Huang et al., 2018). While RE-

ACT is similar regarding its global scope, we refrain

from integrating the reward into the ﬁtness to be op-

timized and instead use it to validate ﬁnding a set of

diverse demonstrations. Likewise, Sequeira and Ger-

vasio (2020) consider the agents’ actions and states

in the environment, but also its policy, to compile a

summarizing video of interestingness elements. The

frequency, execution certainty, transition value, and

sequences determine interesting elements, intending

to show a maximally diverse set of highlights. In con-

trast to both, we consider demonstrations of complete

trajectories instead of patching together possibly un-

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

132

related sequences due to their impact. Therefore, we

refrain from experimental comparison to those ap-

proaches. Nevertheless, to measure the quality of

demonstrations, we use the ﬁdelity metric proposed

by Guo et al. (2021), adapted to indicate increased ﬁ-

delity with higher scores, which we introduce in the

following section.

RL Testing. Like REACT, Tappler et al. (2022) use

a genetic algorithm to ﬁnd an interesting trace for test-

ing RL policies. Zolfagharian et al. (2023) optimize

full episodes to search for faulty behavior and train

a predictive model. REACT, on the other hand, op-

timizes the initial state that causes the trace, given

a policy to be analyzed. Pang et al. (2022) propose

a similar fuzz test framework for RL, modifying the

initial state to generate fresh sequences. In contrast

to REACT, they do so by estimating the sensitivity of

the given model to its seed instead of applying evo-

lutionary operators on the initial state. Overall, how-

ever, those approaches are primarily motivated to gen-

erate test cases, preferably where the model fails. RE-

ACT aims to generate a balanced representation of the

learned behavior, speciﬁcally including edge cases.

6 EVALUATION

Setup. To validate the proposed architecture, we

use a simple, fully observable discrete Flat-

Grid11 environment with 11 × 11 ﬁelds shown in

Fig. 3a(Altmann, 2023). The goal of the policy is

to reach the target state (rewarded +50), where there

are neither holes nor obstacles that could disrupt the

agent’s path. A step cost of −1 is applied to en-

courage choosing the shortest path. Episodes are ter-

minated upon reaching the target state or after 100

steps. We use PPO (Schulman et al., 2017) with de-

fault parameters (Rafﬁn et al., 2021) to train a policy

that we can then evaluate with REACT. To show di-

verse behavior, we intentionally terminated training

early (after 35k steps), just after the agent conﬁdently

reached the target. Using such an imperfect policy has

a higher probability that the agent has not yet explored

the entire environment. For evaluation purposes, we

also want it to display behavior that leads the agent

not to reach its goal. Note that the policy is trained

with a single initial state shown in Fig. 3a. The fol-

lowing results are averaged over ten random seeds to

increase the signiﬁcance of the experimental results

presented to optimize the demonstrations based on a

single, previously trained policy.

Metrics. To provide an intuition over the resulting

demonstrations T , we summarize them in a single 3D

histogram, displaying the state-frequency of all grid

cells. Compared to showing the discrete paths (cf.

Fig. 1), this allows the visualization of results over

multiple optimization runs without diminishing the

depiction of the demonstration diversity by averaging

them. Since viewing the behavior diversity of the ﬁ-

nal demonstrations is very subjective, we additionally

consider the cumulated demonstration ﬁdelity:

S =

τ ∈T

|τ|

|T |



R − r



, (8)

with the absolute mean reward

R =

|T |

| and

the total trajectory length |T | =

|τ|, adapted from

(Guo et al., 2021). Intuitively, the ﬁdelity of an expla-

nation measures the approximation quality w.r.t the

given model, where a higher value indicates higher

coverage (Molnar, 2020). As we consider a set of tra-

jectories to serve as an explanation, S could also be

viewed as their Shapley values, i.e., the impact of each

trajectory on the total demonstration (Shapley and

Shubik, 1954). Consequently, S also closely resem-

bles the population diversity D, deﬁned earlier. Fur-

thermore, we consider the ﬁnal return and ﬁnal (tra-

jectory) length. Both metrics are crucial when train-

ing the optimal policy (maximizing the return while

minimizing the solution length) and do not inﬂuence

the optimized ﬁtness function. However, we are not

interested in the minimum or maximum of the returns

or lengths but instead in the range of the metrics and

how uniformly the individuals are spread across dif-

ferent returns and trajectory lengths. Therefore, we

use box plots to visualize our results, where a bigger

range between the whiskers promises greater diver-

sity, and larger boxes indicate an even distribution.

We also report the deterministic policy performance

in the unaltered training environment, which is often

used to validate learned behavior and serves as a base-

line. Note that the ﬁdelity for a single trajectory with

any contrasting behavior is always zero. This already

accurately reﬂects the deﬁciencies of considering a

single training scenario for the evaluation. Further-

more, we compare REACT to a random search ap-

proach, implemented as the initial population P

be-

fore applying the evolutionary process. This Random

approach could be considered most closely related to

comparable interpretability approaches, altering the

environment without optimization while maintaining

comparability to REACT.

Results. Fig. 3 shows the evaluation results. The

trained policy reaches a return of 34 with a trajectory

length of 16 (cf. Fig. 3c). Using a random pool of

REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning

133

(a) FlatGrid11

0 10 20 30 40

Generation

Fidelity

(b) Fidelity

Random REACT

−100

−80

−60

−40

−20

Return

Training

Figure 3: REACT Evaluation: Fidelity (c) and Final Return (c) of Random (d) and REACT (e) demonstrations of a PPO policy

trained for 35k steps in the FlatGrid11 (a) show increased diversity and even distribution of REACT-generated demonstrations

over random or static initial states, with the training performance in the unaltered environment shown as a blue line in (c).

initial states increases the encountered return while

reducing the trajectory length of the resulting demon-

strations by moving the initial state closer to the tar-

get. Yet, random demonstrations still mostly yield

behavior in the upper reward region. REACT man-

ages to diversify the pool of demonstrations further,

more evenly covering a larger region of ﬁnal returns.

Looking at the ﬁnal ﬁdelities in Fig. 3b, REACT is

able to double the demonstration quality compared

to the Random approach. Analyzing the resulting

demonstrations from a single population, shown in

Figs. 3d and (e), reveals two further insights: Overall,

most trajectories successfully reach the target, shown

by the highest occurrence of the target state, indi-

cating a successfully trained policy that is robust to

the introduced state disturbances. Yet, REACT pro-

duces more diverse trajectories distributed over far-

ther states. Some states even resulted in the policy

failing to navigate to the target, as indicated by out-

liers with a ﬁnal return of -100.

Fitness Impact. Besides yielding diverse demon-

strations, we also want to ensure the appropriateness

of the proposed joint ﬁtness. Fig. 4 therefore provides

an additional in-depth analysis of the impact of the ﬁt-

ness components across the single last population of

10 individuals (a) and throughout the 40 optimization

generations (b).

To accurately show the inﬂuence of the local di-

versity (light blue) and the certainty (orange), we vi-

sualize their population distance, which is combined

in the minimum local distance (yellow) to be accu-

Global Diversity Local Diversity Certainty Local Distance

Joint Fitness

2 4 6 8 10

Individual

0.05

0.1

0.15

0.2

0.25

0.3

Fitness

(a) Population Analysis

10 20 30 40

Generation

0.2

0.4

0.6

0.8

1.2

Fitness

(b) Generation Analysis

Figure 4: FlatGrid JointFitness Analysis.

mulated with the global diversity (blue) (cf. Eq. (6)).

Interestingly, already with a population size of 10, in-

dividual ﬁtness decreases throughout the population,

reafﬁrming the chosen population size. Individuals

evaluated with a lower global ﬁtness (baring higher

similarity to the overall population) show higher lo-

cal distances, i.e., dissimilarities to the population re-

garding the diversity of the behavior itself, which con-

ceptually justiﬁes considering both diversity perspec-

tives. In addition, all ﬁtness components are shown

to inﬂuence the whole behavior optimization, evenly

increasing throughout the 40 generations. The con-

siderably minor improvement in the last ten genera-

tions indicates convergence of the optimized demon-

strations.

Holey Gridworld

To further evaluate our approach, we use the more

complex HoleyGrid environment shown in Fig. 5a,

extending the previous FlatGrid with holes immedi-

ately terminating an episode with a reward of −50.

The holes add additional complexity to the gridworld

since the policy needs to learn to circumvent them to

reach the target successfully. The policy to be ana-

lyzed is trained with PPO for 150k steps in a static

layout, just reaching successful behavior, with a re-

turn of 36 and a trajectory length of 14 (cf. Fig. 5c).

The evaluation results in Fig. 5 reveal a smaller

range of returns than the FlatGrid results, presum-

ably caused by the additional holes. In contrast to

the unaltered training environment in which the pol-

icy navigates successfully, we are able to reveal un-

successful behavior with returns slightly below −50.

Again, REACT covers a slightly larger fraction of the

return compared to demonstrations from randomly

generated initial states. Regarding their ﬁdelity (cf.,

Fig. 5b), the ﬁnal REACT demonstrations signiﬁ-

cantly outperform the Random demonstrations with

a mean of around 24, even though dropping slightly

below the Random baseline at around 13 in the ini-

tial generations. This is also reﬂected in the demon-

stration 3D histograms in Figs. 5d and (e). REACT

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

134

(a) HoleyGrid11

0 10 20 30 40

Generation

Fidelity

(b) Fidelity

Random REACT

Return

−40

−20

Training

Figure 5: HoleyGrid Evaluation: Fidelity (b) and Final Return (c) of Random (d) and REACT (e) demonstrations of a PPO

policy trained for 150k steps in the HoleyGrid11 (a) indicate further edge-case demonstration being generated using REACT

over random or static initial states. The blue line in (c) displays the training performance in the unaltered environment.

demonstrations almost cover the whole state space,

where, due to the nature of the ﬁtness, we can assume

all remaining states to yield comparable behavior that

would not increase the demonstration diversity. The

unoptimized demonstrations only cover more direct

solution paths, which is also reﬂected in the smaller

interquartile range of the according returns. Please

refer to the appendix for an in-depth analysis of opti-

mization progress and the impact of joint ﬁtness.

Combining the high demonstration coverage of

10 highly diverse yet comprehensibly compact tra-

jectories, we argue that REACT allows a human to

properly assess the trained policy’s inherent behavior.

Concretely, the analyzed policy can be described as

robust with high certainty, given the above-zero in-

terquartile range of the demonstration returns, where

further training in some problematic edge cases could

be desirable depending on the intended application.

Overall, REACT increases the interpretability of the

policy at hand, especially compared to a single train-

ing trajectory with randomly chosen initial positions.

Continous Robotic Control

Finally, we demonstrate the effect of REACT in a

more complex real-world application, where it could

be utilized to decide between deploying different poli-

cies. For this, we use the continuous robotic control

environment FetchReach shown in Fig. 6a.

Environment. The agent is represented by a manip-

ulator, the robotic arm, with six degrees of freedom,

and its end effector, a gripper. The task is to control

the robotic arm by applying a three-dimensional force

vector to move the gripper to reach the target state

(green point). In contrast to the previous gridworld

environments, both action- and observation-space are

real-valued. Furthermore, the task is open-ended such

that episodes continue for 50 steps regardless of suc-

cessfully reaching the target. Therefore, we use a

sparse reward function, where the agent is penalized

−1 for every step in which it is not close to the target,

i.e., where the Euclidean distance between the effec-

tor and the target is greater than 0.05. During train-

ing, the effector’s position is always initialized at the

center, while the target is randomly positioned within

a 0.3-sized cube around the center to improve gener-

alization of the learned behavior (de Lazcano et al.,

2023).

REACT Parameters. Given the increased environ-

mental complexity, we adapted the parameterization

of REACT according to preliminary studies. Most

importantly, to remove all random factors from gen-

erating demonstrations, we include the target position

and the agent (gripper) position in the initial state to

be optimized. This results in a 6-dimensional state en-

coding, which we encode with a bit-length of 9 to re-

duce the intervals between possible states to less than

0.001. Furthermore, we replace the total number of

possible states |{s ∈ S}| for calculating the local di-

versity (cf. Eq. (1)) by the trajectory length |τ|, which

results in the static horizon H = 50 for this envi-

ronment. Also, as previously denoted, we use a dis-

cretization of states s ∈ τ such that the visited frac-

tion of the state space remains reﬂected as intended.

Finally, we increased the population size to 30 and

used 1000 generations. Lastly, instead of analyzing a

single moderately trained policy, we compare policies

from three stages of training.

Training. Having proven beneﬁcial in various con-

tinuous control tasks, we train the policies to be

compared using SAC (Haarnoja et al., 2018), imple-

mented with default parameterization (Rafﬁn et al.,

2021). To demonstrate the comparative evaluation ca-

pabilities, we trained policies for 100k, 3M, and 5M

steps, which we refer to as SAC-100k, SAC-3M, and

SAC-5M respectively for the following. Therefore,

we are able to compare policies from three training

stages, ranging from early convergence to possibly

over-trained, thus overﬁtting the training task.

REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning

135

(a) FetchReach (b) SAC-100k T (c) SAC-3M T (d) SAC-5m T

Return

−25

−20

−15

−10

−5

SAC-100k SAC-3M SAC-5M

Random

REACT

Training

(e) Final Return

Trajectory Distance

0.5

1.5

SAC-100k SAC-3M SAC-5M

Random

REACT

Training

(f) Final Length

Env Algorithm Random REACT

FlatGrid11 PPO-35k 8.62 ± 6.47 20.48 ± 7.25

HoleyGrid11 PPO-150k 13.22 ± 3.58 24.40 ± 4.57

FetchReach

SAC-100k 1.44 ± 0.15 1.61 ± 0.23

SAC-3M 1.46 ± 0.22 7.39 ± 4.92

SAC-5M 1.46 ± 0.24 2.83 ± 0.93

(g) Mean ﬁnal ﬁdelities and 95% conﬁdence intervals

Figure 6: REACT Evaluation: Final Return (e) and Length (f) of Random (Red) and REACT (Green) demonstrations of SAC

policies trained in the FetchReach (a) environment for 100k (b), 3M (c), and 5M (d) steps demonstrate the applicability of

REACT in discerning policies from different training stages by disclosing their inherent behavior. (g) summarizes our results.

Results. The overall evaluation results are shown in

Fig. 6. The performance of all models in the unaltered

training environment shows both increasing returns of

around −1.8, −1.7, and −1.6 and decreasing trajec-

tory distances of around 0.73, 0.12, and 0.11 for SAC-

100k, SAC-3M, and SAC-5M respectively. With a

maximum target distance of 0.3, based only on these

results, the primal SAC-100k could be disregarded

due to the signiﬁcantly more extensive movement,

even though reaching competitive rewards. However,

REACT reveals further important insights on which

to base model interpretations and subsequent deci-

sions. Regarding the ﬁnal trajectory length, REACT

shows diverse demonstrations to be evenly distributed

around the single training experience for SAC-100k,

with both the length and the variance of the length de-

creasing upon further training. Compared to demon-

strations based on random initial states, REACT again

shows a slight increase in the overall diversity and

even distribution of demonstrations. More interest-

ing results, however, are shown for the ﬁnal return,

where random conﬁgurations, similar to the training

conﬁguration, do not reveal any insightful differences

between the models. On the other hand, REACT re-

veals the overall return variance increasing with fur-

ther training, with the median of returns even de-

creasing. It is important to note that the return is

not included in the ﬁtness to optimize the demon-

strations. Thus, these observations emerge from di-

verse behavior generated by the policies. Given the

increasing returns observed for the training conﬁgu-

ration, this could most likely be explained as over-

ﬁtting behavior. To give an intuition of the scope

and nature of the resulting demonstrations, we ﬁnally

consider the path of all Random (red) and REACT

(green) trajectories shown in Figs. 6b-(d). Due to

the continuous nature of the environment and the in-

creased number of individuals, we did not plot the

resulting demonstrations as cumulative distributions

(mainly because averaging would diminish any diver-

sity within the populations). Although this kind of vi-

sualization does not allow for the precise analysis of

each resulting trajectory, it perfectly conveys the over-

all nature of the generated demonstrations

. Again,

REACT covers a comparably larger fraction of the

state space more evenly and even detected a policy

insufﬁciency of SAC-3M, causing the demonstration

of an outlier. In summary, the shortest-trained policy

reaches targets the fastest, showing the lowest penal-

ties and thus the highest returns but with the lowest

precision and, hence, the highest movement and tra-

jectory length. Longer-trained policies, on the other

hand, show higher penalties. They reach the target

slower yet more precisely, as indicated by the over-

all lower trajectory length. The assessment of those

characteristics heavily depends on the intended appli-

cation; however, REACT has revealed those critical

characteristics of the inherently learned behavior.

Finally, Table 6g summarizes the ﬁnal ﬁdelities.

Overall, REACT improves the demonstration qual-

ity compared to the Random baseline, roughly main-

taining its low score throughout the different models.

Video renderings are available at https://github.com/

philippaltmann/REACT.

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

136

Notably, REACT extracts viable characteristics, espe-

cially for more mature models, signiﬁcantly outper-

forming the chosen baseline and showcasing its scal-

ability.

7 CONCLUSION

To enhance the interpretability of RL, we introduced

Revealing Evolutionary Action Consequence Trajec-

tories (REACT). REACT adds disturbances to the en-

vironment by altering the initial state, causing the pol-

icy to generate edge-case demonstrations. To assess

trajectories for demonstrating a given policy, we for-

malized a joint ﬁtness combining the local diversity

and certainty of the trajectory itself with the global

diversity of a population of demonstrations. To op-

timize a pool of demonstrations, we apply an evolu-

tionary process to the population of individuals, en-

coded as the initial state, evaluated by the joint ﬁtness.

To evaluate REACT, we analyzed various policies

trained in ﬂat and holey gridworlds as well as a con-

tinuous robotic control task at different training states.

Comparisons to the unaltered training environment

and randomly generated initial states showed that RE-

ACT reveals a set of more diverse and more evenly

distributed demonstrations to serve as a varietal ba-

sis to assess the learned (inherent) behavior. In ad-

dition to the ﬁnal return, we analyzed the demonstra-

tions’ utility using an adapted ﬁdelity metric. How-

ever, we refrain from human evaluations and leave

the subjective assessment of the appended demonstra-

tions to the reader. Furthermore, we only introduced

disturbances of the initial agent and target positions.

Thus, future work should examine extending REACT

to further variations of the environment, such as the

overall layout or the task itself. Also, the resulting

pool of demonstrations could be used either to further

improve the policy regarding revealed vulnerabilities

or to infer a global causality model to further foster

the policy’s interpretability. Overall, we believe that

REACT represents a universal policy-centric starting

point for improving the overall interpretability of the

currently mostly opaque RL models.

ACKNOWLEDGEMENTS

This work was partially funded by the Bavarian Min-

istry for Economic Affairs, Regional Development

and Energy as part of a project to support the thematic

development of the Institute for Cognitive Systems.

REFERENCES

Alharin, A., Doan, T.-N., and Sartipi, M. (2020). Reinforce-

ment learning interpretation methods: A survey. IEEE

Access, 8:171058–171077.

Altmann, P. (2023). hyphi gym. https://github.com/

philippaltmann/hyphi-gym/.

Altmann, P., Ritz, F., Feuchtinger, L., N

ußlein, J., Linnhoff-

Popien, C., and Phan, T. (2023). Crop: towards

distributional-shift robust reinforcement learning us-

ing compact reshaped observation processing. In

Proceedings of the Thirty-Second International Joint

Conference on Artiﬁcial Intelligence, IJCAI ’23.

Amir, D. and Amir, O. (2018). Highlights: Summarizing

agent behavior to people. In Adaptive Agents and

Multi-Agent Systems.

Behrens, M., Gube, M., Chaabene, H., Prieske, O., Zenon,

A., Broscheid, K.-C., Schega, L., Husmann, F., and

Weippert, M. (2023). Fatigue and human perfor-

mance: an updated framework. Sports medicine,

53(1):7–31.

Bhatt, V., Tjanaka, B., Fontaine, M., and Nikolaidis, S.

(2022). Deep surrogate assisted generation of envi-

ronments. Advances in Neural Information Process-

ing Systems, 35:37762–37777.

Cobbe, K., Hesse, C., Hilton, J., and Schulman, J. (2020).

Leveraging procedural generation to benchmark rein-

forcement learning. In International conference on

machine learning, pages 2048–2056. PMLR.

de Lazcano, R., Andreas, K., Tai, J. J., Lee, S. R., and Terry,

J. (2023). Gymnasium robotics. http://github.com/

Farama-Foundation/Gymnasium-Robotics.

Fogel, D. B. (2006). Evolutionary computation: toward a

new philosophy of machine intelligence. John Wiley

& Sons.

Gabor, T. and Altmann, P. (2019). Benchmarking surrogate-

assisted genetic recommender systems. In Pro-

ceedings of the Genetic and Evolutionary Compu-

tation Conference Companion, GECCO ’19, page

1568–1575, New York, NY, USA. Association for

Computing Machinery.

Gabor, T., Belzner, L., and Linnhoff-Popien, C. (2018).

Inheritance-based diversity measures for explicit con-

vergence control in evolutionary algorithms. In Pro-

ceedings of the Genetic and Evolutionary Computa-

tion Conference, pages 841–848.

Gabor, T., Sedlmeier, A., Kiermeier, M., Phan, T., Henrich,

M., Pichlmair, M., Kempter, B., Klein, C., Sauer, H.,

AG, R. S., et al. (2019). Scenario co-evolution for

reinforcement learning on a grid world smart factory

domain. In Proceedings of the Genetic and Evolution-

ary Computation Conference, pages 898–906.

Guo, W., Wu, X., Khan, U., and Xing, X. (2021). Edge:

Explaining deep reinforcement learning policies. Ad-

vances in Neural Information Processing Systems,

34:12222–12236.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor. CoRR,

abs/1801.01290.

REACT: Revealing Evolutionary Action Consequence Trajectories for Interpretable Reinforcement Learning

137

Heuillet, A., Couthouis, F., and D

ıaz-Rodr

ıguez, N.

(2021). Explainability in deep reinforcement learning.

Knowledge-Based Systems, 214:106685.

Huang, S. H., Bhatia, K., Abbeel, P., and Dragan, A. D.

(2018). Establishing appropriate trust via critical

states. CoRR, abs/1810.08174.

Huang, S. H., Held, D., Abbeel, P., and Dragan, A. D.

(2017). Enabling robots to communicate their objec-

tives. CoRR, abs/1702.03465.

Ishibuchi, H., Tsukamoto, N., and Nojima, Y. (2008). Evo-

lutionary many-objective optimization: A short re-

view. In 2008 IEEE congress on evolutionary com-

putation (IEEE world congress on computational in-

telligence), pages 2419–2426. IEEE.

Khadka, S. and Tumer, K. (2018). Evolutionary reinforce-

ment learning. CoRR, abs/1805.07917.

Koh, P. W. and Liang, P. (2017). Understanding black-box

predictions via inﬂuence functions. In International

conference on machine learning, pages 1885–1894.

PMLR.

Lage, I., Lifschitz, D., Doshi-Velez, F., and Amir, O.

(2019). Exploring computational user models for

agent policy summarization. CoRR, abs/1905.13271.

Lehman, J. and Stanley, K. O. (2011). Abandoning objec-

tives: Evolution through the search for novelty alone.

Evolutionary computation, 19(2):189–223.

Li, X., Xiong, H., Li, X., Wu, X., Zhang, X., Liu, J.,

Bian, J., and Dou, D. (2022). Interpretable deep learn-

ing: interpretation, interpretability, trustworthiness,

and beyond. Knowledge and Information Systems.

Lin, B. and Su, J. (2008). One way distance: For shape

based similarity search of moving object trajectories.

GeoInformatica, 12:117–142.

Lundberg, S. M. and Lee, S.-I. (2017). A uniﬁed approach

to interpreting model predictions. Advances in neural

information processing systems, 30.

Miller, B. L., Goldberg, D. E., et al. (1995). Genetic algo-

rithms, tournament selection, and the effects of noise.

Complex systems, 9(3):193–212.

Molnar, C. (2020). Interpretable machine learning.

Lulu.com.

Neumann, A., Gao, W., Wagner, M., and Neumann, F.

(2019). Evolutionary diversity optimization using

multi-objective indicators. In Proceedings of the

Genetic and Evolutionary Computation Conference,

pages 837–845.

Pang, Q., Yuan, Y., and Wang, S. (2022). Mdpfuzz: test-

ing models solving markov decision processes. In

Proceedings of the 31st ACM SIGSOFT International

Symposium on Software Testing and Analysis, pages

378–390.

Parker-Holder, J., Jiang, M., Dennis, M., Samvelyan, M.,

Foerster, J., Grefenstette, E., and Rockt

aschel, T.

(2022). Evolving curricula with regret-based environ-

ment design. In International Conference on Machine

Learning, pages 17473–17498. PMLR.

Parker-Holder, J., Pacchiano, A., Choromanski, K., and

Roberts, S. (2020). Effective diversity in population-

based reinforcement learning. CoRR, abs/2002.00632.

Pleiss, G., Zhang, T., Elenberg, E., and Weinberger, K. Q.

(2020). Identifying mislabeled data using the area un-

der the margin ranking. Advances in Neural Informa-

tion Processing Systems, 33:17044–17056.

Puterman, M. L. (1990). Markov decision processes. Hand-

books in operations research and management sci-

ence, 2:331–434.

Rafﬁn, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus,

M., and Dormann, N. (2021). Stable-baselines3: Reli-

able reinforcement learning implementations. Journal

of Machine Learning Research, 22(268):1–8.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why

should i trust you?” explaining the predictions of any

classiﬁer. In Proceedings of the 22nd ACM SIGKDD

international conference on knowledge discovery and

data mining, pages 1135–1144.

Richard S. Sutton, A. G. B. (2014, 2015). Reinforcement

Learning: An Introduction. The MIT Press, Cam-

bridge, Massachusetts, London, England, 2 edition.

Rolf, B., Jackson, I., M

uller, M., Lang, S., Reggelin, T., and

Ivanov, D. (2023). A review on reinforcement learning

algorithms and applications in supply chain manage-

ment. International Journal of Production Research,

61(20):7151–7179.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms.

Sequeira, P. and Gervasio, M. (2020). Interestingness ele-

ments for explainable reinforcement learning: Under-

standing agents’ capabilities and limitations. Artiﬁcial

Intelligence, 288:103367.

Shapley, L. S. and Shubik, M. (1954). A method for evalu-

ating the distribution of power in a committee system.

American political science review, 48(3):787–792.

Tappler, M., C

ordoba, F. C., Aichernig, B. K., and

onighofer, B. (2022). Search-based testing of rein-

forcement learning. arXiv preprint arXiv:2205.04887.

Vartiainen, P. (2002). On the principles of comparative eval-

uation. Evaluation, 8(3):359–371.

Wineberg, M. and Oppacher, F. (2003). The underlying

similarity of diversity measures used in evolutionary

computation. In Genetic and evolutionary computa-

tion conference, pages 1493–1504. Springer.

Wu, S., Yao, J., Fu, H., Tian, Y., Qian, C., Yang, Y.,

FU, Q., and Wei, Y. (2023). Quality-similar diversity

via population based reinforcement learning. In The

Eleventh International Conference on Learning Rep-

resentations.

Wurman, P. R., Barrett, S., Kawamoto, K., MacGlashan, J.,

Subramanian, K., Walsh, T. J., Capobianco, R., De-

vlic, A., Eckert, F., Fuchs, F., et al. (2022). Outracing

champion gran turismo drivers with deep reinforce-

ment learning. Nature, 602(7896):223–228.

Zolfagharian, A., Abdellatif, M., Briand, L. C.,

Bagherzadeh, M., and Ramesh, S. (2023). A

search-based testing approach for deep reinforcement

learning agents. IEEE Transactions on Software

Engineering, 49(7):3715–3735.

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

138