Iterative Environment Design for Deep Reinforcement Learning Based

on Goal-Oriented Speciﬁcation

Simon Schwan

and Sabine Glesner

Software and Embedded Systems Engineering, Technische Universit

at Berlin, Straße des 17. Juni 135, Berlin, Germany

Keywords:

Reinforcement Learning, Requirements Speciﬁcation, Iterative Development, Goals, Environment Design.

Abstract:

Deep reinforcement learning solves complex control problems but is often challenging to apply in practice for

non-experts. Goal-oriented speciﬁcation allows to deﬁne abstract goals in a tree and thereby, aims at lowering

the entry barriers to RL. However, ﬁnding an effective speciﬁcation and translating it to an RL environment

is still difﬁcult. We address this challenge with our idea of iterative environment design and automate the

construction of environments from goal trees. We validate our method based on four established case studies

and our results show that learning goals by iteratively reﬁning speciﬁcations is feasible. In this way, we

counteract the common trial-and-error practice in the development to accelerate the use of RL in real-world

applications.

1 INTRODUCTION

Reinforcement Learning (RL) has emerged as a

promising solution for complex control problems

such as collision avoidance (Everett et al., 2021) in

autonomous driving. RL agents learn through interac-

tions with their environment, being rewarded for de-

sirable behavior. The initial step in the development

of RL solutions involves deﬁning an environment in

form of a Markov decision process (MDP). Despite

the potential of RL, the deﬁnition of the environment

requires signiﬁcant experience and expertise and of-

ten involves trial-and-error. This results in very high

entry barriers for developers that are experts in their

application domain, but not RL. Thereby, these barri-

ers limit the application of RL in practice drastically.

Goal-oriented speciﬁcation (Schwan et al., 2023)

enables to specify goals in a tree structure, which al-

lows to abstract from technical details of RL. How-

ever, at this point, goal-oriented speciﬁcation is miss-

ing deﬁnitions of how to automatically construct en-

vironments from goal tree speciﬁcations. These def-

initions are needed to make the approach applicable

by enabling the training of RL agents from goal trees.

Moreover, it is challenging to develop effective spec-

iﬁcations on the ﬁrst try because of many interdepen-

dent design choices. These choices come not only

https://orcid.org/0009-0002-4085-1777

https://orcid.org/0009-0003-6946-3257

from deﬁning the environment, but also from training

agents on the environment.

With this work, we close the gap and address

the challenge of constructing environments from goal

trees to enable iterative goal-oriented design. Our

key idea is to design RL environments in iterations of

three phases: specifying goals, training agents, evalu-

ating the results for future improvements. To achieve

this, our two main contributions are as follows:

1. We introduce a method for automatically con-

structing environments from goal trees. Thereby,

we enable the training of RL agents from these

speciﬁcations.

2. We instantiate our method with deﬁnitions for a

speciﬁc set of goal tree components. By care-

fully choosing these deﬁnitions, we ensure that

goal tree reﬁnements lead to an increase of the re-

warding feedback to the agent and create the op-

portunity for iterative improvements.

Together, manually reﬁning speciﬁcations and au-

tomatically constructing environments from them, en-

ables domain experts to train agents while focusing

on the goals rather than the technical details. At the

same time, we reduce time-consuming tasks of con-

structing the environment manually in each iteration.

This makes iterative environment design from goal-

oriented speciﬁcations practical, and we evaluate our

method through four case studies from the Farama

Gymnasium (Farama Foundation, 2024). First, we

240

Schwan, S. and Glesner, S.

Iterative Environment Design for Deep Reinforcement Learning Based on Goal-Oriented Speciﬁcation.

DOI: 10.5220/0013148500003890

In Proceedings of the 17th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2025) - Volume 2, pages 240-251

ISBN: 978-989-758-737-5; ISSN: 2184-433X

infer goals from the original environments to enable

goal-oriented speciﬁcation. Second, we deﬁne mul-

tiple goal trees for each case study. Third, we train

agents for each automatically constructed environ-

ment and analyze the results.

The paper is structured as follows. We relate

our work to existing research (Section 2). Then, we

describe preliminaries (3) such as MDPs and goal-

oriented speciﬁcation. Subsequently, we introduce

our running example (4) and deﬁne our automated

construction of environments from goal trees (5). We

evaluate our method (6) and conclude (7).

2 RELATED WORK

With the increasing use of artiﬁcial intelligence (AI)

systems, Requirements Engineering for AI (RE4AI)

(Ahmad et al., 2023) becomes relevant. The survey

identiﬁes the Uniﬁed Modeling Language (UML) and

goal-oriented requirements engineering (GORE) to be

the prevalent modeling languages in RE4AI. Rein-

forcement learning is a methodology that addresses

problems within the context of a speciﬁc theoretical

framework: the Markov decision process. Our spec-

iﬁcation method enables requirements engineering

that is inspired by GORE (Van Lamsweerde, 2001)

but tailored to the speciﬁc framework of RL. A sur-

vey of Human-in-the-loop (HITL) RL (Retzlaff et al.,

2024) examines existing research. It identiﬁes the

HITL paradigm to be of utmost importance and pro-

poses that humans, i.e. developers, domain experts

and users, interact with the RL system in four sequen-

tial phases: agent development, agent learning, agent

evaluation, agent deployment. In alignment with the

theoretical ﬁndings of the survey, we introduce our it-

erative design approach according to the initial three

phases. However, we do not consider the agent de-

ployment.

We base our speciﬁcation language on goals,

which are used in several other RL methods (Schaul

et al., 2015; Andrychowicz et al., 2017; Florensa

et al., 2018; Jurgenson et al., 2020; Chane-Sane et al.,

2021; Okudo and Yamada, 2021; Ding et al., 2023;

Okudo and Yamada, 2023). These methods focus

on improving the training efﬁciency and we identify

three major directions: (1) using goals to shape and

make the reward dense (Okudo and Yamada, 2021,

2023; Ding et al., 2023); (2) the division of a ma-

jor goal into subgoals (Jurgenson et al., 2020; Chane-

Sane et al., 2021) such as following intermediate way-

points on a trajectory; (3) learning goals simultane-

Code and results are available at https://doi.org/10.

6084/m9.ﬁgshare.26408821.v1

ously (Schaul et al., 2015; Andrychowicz et al., 2017;

Florensa et al., 2018) to improve generalizability. In

contrast, our approach uses goals to specify require-

ments and enable iterative environment design instead

of focusing on training efﬁciency.

Furthermore, the idea of goal-oriented speciﬁca-

tion is to integrate existing RL techniques into the

speciﬁcation and training procedure. In this con-

text, we review existing goal-based methods. Of-

ten, goal-based methods can be automatically applied

(Andrychowicz et al., 2017; Florensa et al., 2018;

Chane-Sane et al., 2021; Jurgenson et al., 2020) to

train RL agents. Hindsight experience replay (HER)

(Andrychowicz et al., 2017) enables to learn many

goals from the same episode by relabeling the target

goals of the terminal state. Thus, HER improves gen-

eralizability by learning from unsuccessful episodes.

While we ﬁnd HER promising to be integrated into

our training, other approaches (Florensa et al., 2018;

Jurgenson et al., 2020; Chane-Sane et al., 2021) need

to have speciﬁcally tailored RL algorithms. These

methods stand in contrast to our approach that en-

ables learning from standard, model-free RL algo-

rithms. Subgoal-based reward shaping (Okudo and

Yamada, 2021, 2023) relies on the manual speciﬁca-

tion of ordered subgoals to guide the agent. We may

be able to integrate it into goal-oriented speciﬁcation

by developing a corresponding goal-tree operator.

Finally, there are other speciﬁcation languages for

RL that are related to our work. While (Hahn et al.,

2019) speciﬁes RL objectives in an ω-regular lan-

guage, most languages (Li et al., 2017; Jothimurugan

et al., 2019, 2021; Cai et al., 2021; Hammond et al.,

2021) are based on linear temporal logic, which al-

lows them to specify temporal properties. These lan-

guages enable the design of reward with theoretical

guarantees such as providing policy invariance. How-

ever, these theoretical considerations do not guarantee

that deep RL algorithms converge because of statis-

tical optimization. In contrast, our method integrates

training into the iterative environment design to coun-

teract unpredictable side effects.

3 PRELIMINARIES

This section ﬁrst provides preliminaries for reinforce-

ment learning, followed by goal-oriented speciﬁca-

tion.

Reinforcement Learning. A reinforcement learn-

ing problem is formally modeled as a Markov deci-

sion process (MDP) (Sutton and Barto, 2018) by a

tuple (S,A,P,R) with S being the space of all states

Iterative Environment Design for Deep Reinforcement Learning Based on Goal-Oriented Speciﬁcation

241

satisfying the Markov property, A being the space of

all actions, P(s,a, s

′

) = Pr[s

t+1

= s

′

= s,a

= a] be-

ing the transition probability and R : S × A × S → R

being the immediate reward. The RL agent inter-

acts with the MDP and collects samples in form of

episodes τ = (s

t+1

,...,s

T −1

)

with states s

∈ S, actions a

∈ A, rewards r

R(s

i+1

) and the terminal state s

at time T .

The objective of RL algorithms is to ﬁnd a policy

π : S → A, also named agent, that maximizes the ex-

pected cumulative and discounted reward, i.e. re-

turn R(τ) =

∑

i=0

, with discount factor γ ∈ [0,1].

Depending on the parameters, R(s

t+1

) denotes

the reward and R(τ) the return. A popular RL algo-

rithm is Proximal Policy Optimization (PPO) (Schul-

man et al., 2017), a policy gradient method that up-

dates in small steps by a clipped surrogate objective.

PPO can optimize for discrete and continuous actions.

There are several methods to specify reward.

Sparse reward, i.e. rewarding only on success, has

the advantage of deﬁning a single and clear objective.

However, the agent may not be able to experience this

sparse reward because it does not reach the associated

success states. In this case, it is possible to shape the

reward to a dense function or to sparsely reward in in-

termediate states. If the reward R consists of multiple

components R

as in multi-objective RL, a common

choice is to scalarize using the weighted linear sum

of the components. This enables the use of single-

objective RL algorithms (Roijers et al., 2013).

Throughout the paper, we assume state spaces S to

be feature spaces S : S

× S

× ... consisting of a

set of features F = {A, B,C,...} where a state s ∈ S is

a tuple s = (s

,...).

Goal-Oriented Speciﬁcation. Goal-oriented spec-

iﬁcation (Schwan et al., 2023) introduces the speci-

ﬁcation of goals for RL agents in a hierarchical tree

structure. The approach formalizes the separation of

Markov decision processes into immutable aspects,

i.e. the initial environment, and the engineered as-

pects, i.e. requirements as depicted in Figure 1. The

initial environment is a three tuple (S

∗

) with

∗

being the initial state space (e.g. available sen-

sors), A

∗

being the initial action space (e.g. actua-

tors) and P

∗

being the initial transition probabilities

(e.g. physics of the world or a simulation). These

aspects are immutable and cannot be modiﬁed dur-

ing speciﬁcation. Requirements are the counterpart

to the initial environment. They include aspects of

MDPs that can be designed or manipulated by the

engineer to solve a problem with RL. Formally, re-

quirements are a tuple (S

) that belong to

a goal space G ⊆ S

. The state space S

contains

Environment (MDP)

Initial Environment

Executes action

Transitions from state to

Requirementsof Goal

Agent

engineered

aspects

ﬁxed

aspects

Constraint

Automaton

counter

Figure 1: Composed environment (Markov decision pro-

cess) from the initial environment and requirements.

feature-engineered states. The action space is deﬁned

by A

, which contains possibly abstracted actions that

may differ from the initial environment. The termina-

tion space T

: S

× A

contains state-action pairs for

which episodes in the MDP terminate. This allows to

constrain undesired transitions in P

∗

for state-actions

(s,a) ∈ T

. The reward R

: S×A×S is a single scalar

reward function that implicitly inherits the objective

associated with the goal G.

Figure 1 shows how the initial environment and

requirements are combined to construct an environ-

ment in form of an MDP. To do so, it is necessary to

specify a mapping between the requirements and the

initial environment based on two functions. The ﬁrst

function state : S

∗

→ S

enables the conversion of the

initial state space S

∗

to the feature-engineered state

space of the requirements S

. The agent chooses its

next action a

∈ A

based on the converted states. It

is necessary to execute this action in the initial envi-

ronment, which the deﬁnition of execute : A

→ A

∗

enables. Then, the environment proceeds to its next

state.

Furthermore, goal-oriented speciﬁcation (Schwan

et al., 2023) introduces the ability to specify goals in

a tree. The idea is to construct a requirements tuple

from a goal-tree speciﬁcation, but exact deﬁnitions

are not presented in the original work. Each node in

the tree contains its own goal space and requirements.

The construction of these requirements is deﬁned by

the following tree components: leaf nodes, operators,

annotations. Leaf nodes are the atomic units for goals

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

242

-1

Figure 2: Pendulum environment.

in the tree and include an associated goal space G.

Goals can be hierarchically structured by a generic

operator (S

) =

) that al-

lows to reﬁne the parent goal G into N subgoal nodes.

In addition, nodes can be annotated with distance an-

notations that guide the agent to the goal space as well

as safety constraints. Safety constraints are speciﬁed

by their constraint state-action space C : S × A and

the number of allowed constraint violations δ

∈ N

where cnt

: S × A → {0,1} counts a constraint viola-

tion (s,a) ∈ C with 1.

4 RUNNING EXAMPLE

We illustrate our work using the Pendulum case study

from Farama Gymnasium (Farama Foundation, 2024)

that is shown in Figure 2. The objective of the case

study is learning to stabilize a circular pendulum in

an upright position. The state space S

∗

⊆ R

contains

three features, i.e. the X and Y position and the angu-

lar velocity α

. Actions a ∈ R deﬁne the torque that

is applied to the pendulum.

The objective of the pendulum is implicitly de-

ﬁned by the given reward function. However, to use

goal-oriented speciﬁcation, we need explicit goals as

sections of the state space. Through empirical mea-

surements on trained agents based on the original re-

ward, we examine the terminal states of the episodes.

Based on the results, we deﬁne the goal space such

that a successfully trained agent consistently reaches

the goal at the end of an episode. For the Pendulum

case study, we deﬁne the goal G

Pend

with constants

as standing upright at (s

) = (1,0) with a tol-

erance of 15 degrees (c

= sin(15) ≈ 0.966, c

sin(15) ≈ 0.259) with low angular velocity below the

threshold c

= 0.25:

Pend

= {(s

)|s

≥ c

,|s

| ≤ c

}

(1)

In the following, we simplify the notation of goal

spaces by using {s

≥ c

,|s

| ≤ c

,|s

| ≤ c

}

analogously.

5 AUTOMATED CONSTRUCTION

OF ENVIRONMENTS FROM

GOAL TREE SPECIFICATIONS

In this section, we present our two main contributions

to evolve goal-oriented speciﬁcation (Schwan et al.,

2023) and make iterative design of RL agents from

goal trees practical. First, we introduce our goal tree

processing algorithm that automatically constructs a

single, composite requirements tuple at the root. Our

algorithm is generic with respect to leaf nodes, op-

erators and annotations from the goal tree. Thereby,

it allows the development of future components inte-

grating further RL methods. Second, we introduce a

speciﬁc set of deﬁnitions for leaf nodes, operators and

annotations that instantiate the generic parts of our al-

gorithm. These deﬁnitions allow the construction of

environments from which RL agents can be directly

trained and evaluated. We leverage the fact that goal-

oriented speciﬁcation allows specifying the same goal

in a variety of goal trees. According to our deﬁnitions,

node reﬁnements increase the rewarding feedback to

the agent. Therefore, each reﬁned goal tree speciﬁ-

cation leads to the construction of a unique environ-

ment variant. Together, our algorithm and deﬁnitions

allow us to automatically construct unique environ-

ments that can be used to train RL agents and analyze

their behavior. Finally, this enables iterative environ-

ment design from goal-oriented speciﬁcation.

Next, we introduce our algorithm followed by the

deﬁnitions that instantiate the generic construction.

For clarity, we use the terminology of specifying for

manually engineered aspects of the goal tree speciﬁ-

cation and constructing for our automated construc-

tion of requirements.

We implement a depth-ﬁrst traversal to recursively

construct the composite requirements as shown in Al-

gorithm 1. The algorithm process node(node,S,A)

receives a node as input for which we construct the

output requirements (S

) and the goal

space G. Additionally, it receives a state space

S and an action space A as input. We create

a single requirements tuple for a speciﬁcation by

starting the process at the root node of the tree

process node(root,S

root

). Here, S

root

and A

root

are the direct result from the speciﬁed state(...) and

execute(...) functions as presented in Section 3. Our

algorithm processes a node in three sequential steps

as follows.

First, we construct a requirements tuple for the

node under construction. Nodes may be either a leaf

or an operator node. Leaf nodes have a speciﬁed goal

space G = goal(node), from which we construct the

requirements tuple according to our deﬁnitions below.

Iterative Environment Design for Deep Reinforcement Learning Based on Goal-Oriented Speciﬁcation

243

Data: node, S, A

Result: G, (S

)

/* Step 1: Process leaf or operator */

if node is leaf then

G = goal(node)

) = lea f (S,A,G)

else

r =

for c ∈ children(node) do

),G

= process node(c, S, A)

r ← r ∪ {(S

)}

end

G, (S

) =

), G

end

/* Step 2: Process node annotations */

for a ∈ annotations(node) do

) ← build(a, (S

))

end

/* Step 3: Process root specifics */

if node is root then

insert G into T

for all actions

end

return G, (S

)

Algorithm 1: Our algorithm process node(node,S,A) im-

plements a recursive depth-ﬁrst traversal of a goal tree spec-

iﬁcation to construct composite requirements.

Operator nodes have children, and we recursively

construct their requirements (S

) depth-ﬁrst

by calling process node(c,S, A). Subsequently, we

combine these requirements using the generic opera-

tor

. This generic approach enables us to extend our

speciﬁcation language in the future. However, we in-

troduce the speciﬁc deﬁnitions of our

-operator be-

low. Second, we adapt the requirements according

to the annotations of the node. We sequentially pro-

cess these annotations by updating the requirements

(←) according to our deﬁnitions as introduced below.

Third, we end the training of episodes if the agent en-

ters the root goal space. We do so by inserting the goal

space G into the termination space T

at the root.

Following, we introduce deﬁnitions for each goal

tree component: leaf nodes,

-operator, distance and

constraint annotations. We use these deﬁnitions to

automatically construct the requirements tuple at the

root according to Algorithm 1. The result is a unique

environment for each goal tree specifying identical

goals. We illustrate all deﬁnitions by applying them

to our running example from Section 4 using the three

goal tree speciﬁcations as shown in Figure 3.

Leaf Nodes. Leaf nodes are the atomic units of

goals in goal-oriented speciﬁcation. Each leaf node

contains a speciﬁed goal space G ⊆ S

, which is the

section of the state space that the agent aims to reach.

We construct the requirements (S

) for a

leaf by calling lea f (S,A,G) as shown in Algorithm 1.

The observation and action space need to conform

with the other components of the tree. For leaf nodes,

we externally deﬁne these by the structure of the tree:

= S ∧ A

= A

S and A are given as input into the lea f function.

The reward of a leaf node needs to give feedback

to the agent when it reaches the goal space. At the

same time, it is possible to specify goal trees that com-

pose multiple leaf goals. We deﬁne the reward R

follows:

t+1

) =











1 s

̸∈ G, s

t+1

∈ G

−1 s

∈ G

t+1

̸∈ G

0 else

The positive reward for entering the goal space S

t+1

∈

G may be sufﬁcient if the leaf is the only goal in the

tree. However, goal-oriented speciﬁcation enables the

composition of leaf nodes through operators, and it

may be possible that an agent again exits the goal

space. For this reason, we neutralize the positive re-

ward with a negative reward of the same magnitude to

prevent the recurrent collection of positive rewards.

The termination space T

deﬁnes state-action pairs

at which episodes are terminated. For the same rea-

son of composing leaf nodes, we do not terminate

episodes when reaching the goal of a leaf node. In-

stead, we initialize the termination space of a leaf

node as the empty space:

Still, we terminate episodes when the agent enters the

composite goal at the root as shown in step three of

Algorithm 1.

The simplest goal tree for our running example

is to specify a leaf node with the goal space G

Pend

from Eq. 1 at the root (Figure 3.a). This speciﬁcation

results in a reward function with a sparse positive

reward when the agent successfully reaches the

goal and episodes terminate. Nevertheless, this

speciﬁcation may prove challenging during training.

Depending on the size and the dynamics of the

environment as well as the exploration strategy,

the agent may not be able to reach goal states and

experience reward.

-operator Node. The

-operator enables the

speciﬁcation of simultaneous subgoals. For this rea-

son, our deﬁnitions follow the semantics of intersect-

ing subgoal spaces G =

. We instantiate the

generic operator

from Algorithm 1 by deﬁning

) =

). Our deﬁnitions

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

244

with

a) b) c)

Pendulum Initial Environment

Figure 3: Three goal tree speciﬁcations (a-c) for the same goal G

Pend

with increasing speciﬁcation details (left to right)

combined with the Pendulum initial environment.

enable the construction of the composite requirements

tuple of the parent goal as follows.

To enable the composition of child nodes, we de-

ﬁne state and action spaces of the parent and its chil-

dren to be identical:

∀i.S

= S

∧ A

= A

Moreover, this deﬁnition allows the intersection of the

goal spaces to construct the parent goal space accord-

ing to our semantics. The reward of our

-operator

node should compose reaching the goals of its chil-

dren. Our recursive algorithm constructs the require-

ments tuple of the child nodes ﬁrst. These require-

ments include reward components R

for each child

node. We use these components and deﬁne the parent

reward R

to be the cumulated weighted sum:

t+1

) =

∑

t+1

)

By default, we weight the reward components equally

with ω

= 1/N. However, this deﬁnition introduces

the challenge of weighting, which is a non-trivial and

often time-consuming manual task in reward design.

Nevertheless, it deﬁnes a reward shape that increases

the feedback to the agent by rewarding the success

of reaching intermediate subgoals. For this reason,

our

-operator allows for reﬁnement of a goal and

enables the construction of unique requirements while

preserving the goal space at the root. We deﬁne the

termination space of the parent to terminate episodes

whenever a child indicates termination:

[

For example, the

-operator enables us

to reﬁne the root goal G

Pend

of Figure 3.a

into two child goals for reaching the position

Pos

= {s

≥ c

,|s

| ≤ c

∈ α

} and stabilizing

the pendulum G

Stab

= {s

∈ X,s

∈ Y,|s

| ≤ c

}

as shown in Figure 3.b. While the goal space at the

root remains the same G

Pend

= G

Pos

∩ G

Stab

, the use

of the

-operator leads to a weighted sparse reward

shape with R

Pend

= ω

Pend,0

Pos

+ ω

Pend,1

Stab

. We

omit the reward parameters R(s

t+1

) for illustra-

tion. In contrast to the constructed requirements from

Figure 3.a, not only the overall goal of the root is

rewarded but also intermediate steps when reaching

a child goal space. Thus, the

-operator increases

the feedback to the agent, and we can construct a

unique environment variant from the goal tree.

Annotations. Tree nodes can be annotated in goal-

oriented speciﬁcation. Thereby it is possible to con-

strain undesirable behavior or increase feedback to

the agent by guiding it towards the desired goal.

Each annotation modiﬁes the requirements tuple of its

node, which we denote by the left arrow (←) in step

2 of Algorithm 1. For each additional reward compo-

nent, we specify a weight for balancing.

In goal-oriented speciﬁcation, safety constraints

are speciﬁed by their constraint action-state space

C : S × A, along with the number of permitted vio-

lations δ ∈ N. However, differing state transitions for

identical states violate the Markov property. This may

be the case for terminating episodes or penalizing

the agent when the violation boundary δ is reached.

To preserve the Markov property for constraint viola-

tions, we make the violation counter transparent to the

agent. To achieve this, we construct an extended state

space S

← S

× N by adding a violation counter s

with:

= δ −

∑

cnt

)

The construction entails updating the state and termi-

nation spaces for other tree components with the ex-

tended state space to comply with our deﬁnitions. For

instance, the state spaces are identical for parent and

Iterative Environment Design for Deep Reinforcement Learning Based on Goal-Oriented Speciﬁcation

245

child nodes according to deﬁnitions of the

-operator.

For this reason, we need to update the state and ter-

mination spaces of adjacent tree components when

adding the constraint violation counter. Furthermore,

we integrate the additional state feature into a con-

straint automaton as shown in Figure 1 that counts

constraint violations. If an episode contains λ con-

straint violations, we end it. We do so by modifying

the termination space to include states with a violation

counter of s

= 0:

← T

∪ {(s,a)|s

= 0}

Finally, we add a penalty to the original reward as

follows:

Pen

t+1

) =

(

−1 (s

) ∈ C

0 else

Distance annotations of goal-oriented speciﬁca-

tion enable to guide the agent towards the goal. The

allow a dense reward shaping by specifying a Eu-

clidean distance function dist : S

→ R with

dist(s) =

(s − g)

for a goal state g ∈ G. From this, we construct a

potential-based reward shape (Ng et al., 1999):

Dist

t+1

) = dist(s

) − dist(s

t+1

)

Finally, we add the dense reward component R

Dist

the existing reward of the requirements.

Figure 3.c illustrates a third tree speciﬁcation in

which the reﬁned subgoals G

Pos

and G

Stab

are an-

notated. The sparse reward of G

Pos

is shaped by a

dense reward constructed from the speciﬁed distance

dist

Pos

(s) =

− 1)

+ (s

− 0)

to the top center

position. To restrict exploration of the state space

with high angular velocities of s

≥ 0.7, we annotate

the stabilization node by specifying the safety con-

straint C with δ = 1. Note: we introduce the con-

straint C for illustration purpose only and we do not

use it in our experiments.

From the annotated speciﬁcation in Figure 3.c, we

construct a structured reward function at our root re-

quirements with weighted components as follows:

Pend

=ω

Pend,0

(

position goal

z }| {

Pos,0

Pos

+ ω

Pos,1

Dist

)

+ω

Pend,1

(ω

Stab,0

+ ω

Stab,1

Pen

| {z }

stabilization goal

)

Each reward component strictly belongs to one of

the nodes with goal spaces G

Pos

and G

Stab

. The

weights ω

Pend,i

allow to balance between the position

and stabilization goals whereas the weights ω

Pos,i

and ω

Stab,i

balance the proportions of the inner

reward. Again, we construct requirements for a

goal tree speciﬁcation with the same goal space

at the root. However, our deﬁnitions enable us to

construct a unique and trainable environment variant.

At this point, we have evolved goal-oriented spec-

iﬁcation by enabling the automated construction of

environments from goal trees. Furthermore, we con-

struct unique requirements that increase the feedback

to the agent for each goal tree reﬁnement. In the fol-

lowing section, we use our automated construction to

train RL agents on a series of speciﬁcations in four

case studies and evaluate the results.

6 EXPERIMENTS & DISCUSSION

With experiments on four existing case studies from

Farama Gymnasium (Farama Foundation, 2024), we

examine two key questions for iterative environment

design: (1) Can we specify goal trees from which

agents are trained to achieve the speciﬁed goals? (2)

How can the reﬁnement of tree speciﬁcations be used

in the iterative design of RL environments? To answer

these questions, we ﬁrst present our experimental de-

sign and setup in Section 6.1 and discuss the results

subsequently in Section 6.2.

6.1 Experiment Design and Setup

We evaluate our speciﬁcation language on four case

studies from Farama Gymnasium (Farama Founda-

tion, 2024), encompassing control problems with dis-

crete and continuous action spaces: Acrobot, Pendu-

lum, MountainCarContinuous, LunarLander. Origi-

nally, each case study represents a trainable RL envi-

ronment, which we call baseline. Each baseline in-

cludes a reward that implicitly deﬁnes the objective.

For each case study, we follow our key idea of de-

signing RL environments in iterations of specifying

goals, training agents and evaluating the results.

Initially, we focus on training the baseline and

identify goals in the state space that are necessary to

use goal-oriented speciﬁcation. For this purpose, we

use Proximal Policy Optimization (PPO) (Schulman

et al., 2017) because of to its versatility in handling

both discrete and continuous action spaces along with

the minimal tuning effort required. We manually tune

hyperparameters for PPO to ensure that the agents can

solve the tasks. For fairness, we use these baseline-

tuned parameters for all experiments of the case study

and reduce bias from hyperparameter tuning. Finally,

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

246

Table 1: Complete list of experiment conﬁgurations for our case studies.

Acrobot-v1 Pendulum-v1 MountainCar

Continuous-v0

LunarLander-v2

State Space S

Cos,θ

× S

Sin,θ

× S

Cos,θ

× S

Sin,θ

× S

Vel

,θ

× S

Vel

,θ

× S

Height

× S

Vel

× S

X,Vel

× S

Y,Vel

×S

× S

α,Vel

× S

Leg

× S

Leg

Goals

Height

1.0 ≤ s

Height

≤ 3.0

Position

0.966 ≤ s

≤ 1.0

−0.259 ≤ s

≤ 0.259

Stabilization

− 0.25 ≤ s

≤ 0.25

Position

0.45 ≤ s

≤ 1.0

Velocity

−0.03 ≤ s

Vel

≤ 0.07

Position

− 0.2 ≤ s

≤ 0.2

− 0.05 ≤ s

≤ 0.05

Velocity

− 0.1 ≤ s

X,Vel

≤ 0.1

− 0.1 ≤ s

Y,Vel

≤ 0.1

Stabilization

0.1 ≤ s

≤ 0.1

− 0.1 ≤ s

α,Vel

≤ 0.1

LegsGrounded

Leg

= 1, s

Leg

= 1

Distance

Annotations

dist

Height

(s) =

Height

− 1.5)

dist

Position

(s) =

− 1)

dist

Velocity

(s) =

− 0.07)

dist

Position

(s) =

+ s

dist

Velocity

(s) =

X,Vel

+ s

Y,Vel

dist

Stabilization

(s) =

Constraints - - - δ

Crash

= 1

Crash

= {(s, a)|

≤ −0.2∨s

≤ 0.2)∧S

≤ 0.2}

PPO

Hyper-

parameters

- - (use_sde=True) batch_size=32, n_steps=1024,

n_epochs=4, gae_lambda=0.98,

gamma=0.999, , ent_coef=0.01

We only state the relevant features for each goal. There are no further restrictions on other state space features.

We expand the state space by S

Height

with height = −cos(θ

) − cos(θ

+ θ1) through our state(...) function to enable the

speciﬁcation of a height goal.

If not stated differently, we use the default PPO parameters from Stable Baselines (DLR-RM, 2024b). Most importantly,

these are: gamma=0.99, learning_rate=0.0003, batch_size=64, n_steps=2024, n_epochs=10, gae_lambda=0.95,

ent_coef=0.0, use_sde=False

We use optimized hyperparameters from RLZoo (DLR-RM, 2024a), a training framework with published hyperparameters

we extract goal states as described for our running ex-

ample in Section 4. Table 1 lists the goals and hyper-

parameters for reproducibility.

Based on the identiﬁed goals, we deﬁne up to three

goal tree speciﬁcations for each case study. With each

iteration, we increase the speciﬁcation details reﬁning

the previous tree similar to Figure 3. Our ﬁrst spec-

iﬁcation consists of a single root leaf node, which

includes the goal space that is the intersection of all

identiﬁed goals. Second, we reﬁne this root node by

the

-operator into subgoal nodes. Finally, we cre-

ate a third speciﬁcation by annotating the leaf nodes

with distance metrics as given in Table 1. For the

Acrobot case study, we have identiﬁed only one goal

and, therefore, we do not include an

-operator re-

ﬁned speciﬁcation. Additionally, for the LunarLander

case study, we impose a safety constraint (see Table 1)

representing the crash penalty from the baseline envi-

ronment.

From each speciﬁcation, we automatically con-

struct an environment variant. We proceed to train

agents for each variant and measure their performance

by inspecting the individual success rate for each

goal. We do so by deﬁning success based on a goal

space G, counting how often the agent reaches the

goal at the terminal state s

T,i

over N episodes:

success

(π) =

∑

∼π

(

1 ,s

T,i

∈ G

0 else

Finally, we manually tune the weights of the re-

ward components of those case studies, in which the

agents converge to a local maximum and are therefore

unable to learn all goals.

Iterative Environment Design for Deep Reinforcement Learning Based on Goal-Oriented Speciﬁcation

247

Figure 4: Our results show the success rate of the individual goals with one row per case study. Each column encompasses a

single speciﬁcation scenario with increasing speciﬁcation details from left to right.

For reproducibility, we adhere to the following

setup. Each result is averaged over 10 independent

runs with random initialization. We train our agents

with PPO from Stable Baselines (DLR-RM, 2024b)

with tuned hyperparameters Table 1. We normalize

the reward function weights for each environment as

follows: for each of the N reward components of a

node, the weight is

; the hard safety constraint from

the LunarLander results in a penalty of −1; each dis-

tance reward shape is divided by a speciﬁed maxi-

mum distance d to the goal.

6.2 Discussion of Results

The results of our experiments across the four case

studies are depicted in Figure 4. Each row corre-

sponds to one case study, while each column repre-

sents one of the four experimental scenarios: base-

line, single goal,

-operator reﬁned goal, with dis-

tance. The graphs illustrate the success metric for

each goal listed in Table 1, providing comparability

across the scenarios, even though the baseline and

single goal scenario do not encompass these goals

directly. The following paragraphs present and dis-

cuss the results for each case study individually and

we conclude with a summary of our ﬁndings.

Acrobot. We identify a single goal to reach a spe-

ciﬁc height For the Acrobot environment. Thus, we

do not apply the

-operator. The baseline scenario

shows convergence, achieving the height goal consis-

tently over all runs. The single goal scenario reaches

about 60 % success rate at the end of training with

high variance. This high variance is caused because

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

248

in only 6 out of 10 training runs, the agents can reach

the goal. The agents do not receive any reward in the

remaining four runs due to the sparsity of the reward.

Therefore, the agents cannot converge to a solution.

Incorporating the distance annotation in the last sce-

nario resolves this exploration issue effectively, and

we achieve similar performance to the baseline.

Pendulum. In the baseline of the Pendulum envi-

ronment, the goals are learned consistently. The sin-

gle goal scenario achieves success in 8 out of 10

runs. The

-operator reﬁned scenario presents a

more intricate result, in which the reﬁned sparse re-

ward facilitates reliable learning of the stabilization

goal but reaching the target position is more difﬁcult

to achieve. Here, half of the 10 agents converge to

a local optimum where they stabilize the pendulum

without achieving the position goal. The position dis-

tance annotation alleviates this problem by provid-

ing an additional dense reward. However, only by

manually weighting the position and stabilization re-

wards, we can resolve the conﬂict between the goals.

With this, we achieve more precise convergence to the

goals compared to the baseline as shown in Figure 5.

Figure 5: Results with manual weights for the Pendu-

lum (ω

position

= 0.8, ω

stabilization

= 0.2) and LunarLander

(ω

stabilization

= 50, ω

legs grounded

= 20, ω

low velocity

= 100,

position

= 200) case studies.

MountainCarContinuous. Originally, the Moun-

tainCarContinuous case study is built to introduce an

exploration challenge. This challenge is evident in

the results of the baseline scenario, where the goal is

achieved in only 4 out of 10 runs. We observe similar

difﬁculties across all three of our speciﬁcation scenar-

ios. In our

-operator reﬁned with distance scenario,

we achieve the goal in 6 out of 10 runs. Improving

the exploration and speciﬁc tuning of PPO can resolve

the exploration challenge for the baseline (Kapoutsis

et al., 2023). We strongly believe that our method

yields comparable results with similar tuning efforts,

although this remains to be tested. Nevertheless, we

observe that learning performance improves based on

our reﬁnements with increasing speciﬁcation detail.

Furthermore, during the design of the case study,

we have experienced the performance degradation for

a different goal tree speciﬁcation. Speciﬁcally, an-

notating the position goal node with the x distance

to the goal states introduces unexpected complexity

into the problem. Here, the trained agents consis-

tently learned to stand still at the bottom of the moun-

tain, while avoiding the negative reward required to

attempt climbing the mountain at high velocity.

LunarLander. The LunarLander case study in-

volves the learning of four goals. The baseline

scenario reaches approximately 80 % success in

achieving the overall

-operator goal. While the

sparse reward proves insufﬁcient in the single goal

scenario, the

-operator reﬁned scenario enhances

this sparsity by rewarding each goal individually.

Despite this, achieving all goals simultaneously

remains challenging. Our distance annotations

scenario mitigates this issue for the stabilization

goal. Nonetheless, in most runs, the agents converge

to a local optimum, stabilizing without reaching

the landing position. Manual weighting corrects

this imbalance by prioritizing the position goal as

depicted in Figure 5 and we achieve more precise

convergence compared to the baseline.

To conclude our evaluation, we identify known

limitations of our method. Subsequently, we summa-

rize our results with respect to the two key questions

introduced at the beginning of this section.

Our method entails two limitations. First, spec-

ifying complex goals in the state space requires a

state structure for which it is possible and sufﬁcient to

handcraft these goals. While this is theoretically pos-

sible, in practice it hinders the use of our method in

high-dimensional state spaces such as learning from

raw pixels. Second, we have experienced that speciﬁc

reﬁnements of goals can lead to undesired and unex-

pected behavior of the trained agents as described in

the results of the MountainCarContinuous case study.

However, this does not contradict our approach of it-

erative environment design but rather emphasizes the

need for iterations. Nevertheless, it is important to

recognize the possibility of degrading results after a

reﬁnement.

Finally, we examine the results regarding our two

key questions. First, we have shown the ability to

specify and learn goal tree speciﬁcations sufﬁciently

for all four case studies. Our results show that we

can consistently learn the goals in three out of four

cases with manually deﬁned weights. For the re-

maining MountainCarContinuous case study, we have

achieved similar results compared to the baseline and

we have learned to reach the goals in 6 out of 10 runs.

Second, the ability to specify goal trees and automat-

ically construct environments, enables iterations by

Iterative Environment Design for Deep Reinforcement Learning Based on Goal-Oriented Speciﬁcation

249

evaluating a speciﬁcation from trained agents. Our

ﬁndings indicate that single goals with sparse re-

wards, often, do not provide enough feedback for

effective learning. However, the results of our

operator consistently improve by providing a reﬁned

reward for the goal. Additionally, the

-operator en-

ables to balance possibly conﬂicting goals by weight-

ing the inner reward components while the distance

annotations help to guide the agent towards other-

wise challenging goals. In contrast, we would like

to point out that goal tree reﬁnements do not always

yield improvements in learning the goals. Therefore,

iterations can also include reverting or adapting prior

changes to the speciﬁcation. Nevertheless, our results

show that we can iteratively improve on learning the

speciﬁed goals. In the following section, we conclude

and present future work.

7 CONCLUSION

In this work, we have introduced iterative environ-

ment design for reinforcement learning based on goal-

oriented speciﬁcation (Schwan et al., 2023). We

evolve goal-oriented speciﬁcation and make it practi-

cal with two contributions. First, we introduce our au-

tomated method to construct RL environments from

goal tree speciﬁcations. Thereby, we enable the train-

ing of agents from these speciﬁcations to evaluate

their behavior for future improvements. Second, we

enable iterative goal tree reﬁnements by introducing

deﬁnitions for leaf nodes, the

-operator and annota-

tions. To evaluate our method, we have trained agents

in four case studies with up to three speciﬁcation sce-

narios each. With manually tuned weights of the re-

ward components, we achieve goal success rates sim-

ilar to the baselines but with higher precision. Finally,

our results show that goal tree reﬁnements can be used

to iteratively improve the learning of speciﬁed goals.

Through iterative environment design, we oppose the

common trial-and-error practice to facilitate the ap-

plication of reinforcement learning.

In future work, we plan on automating the man-

ual weighting of reward components from our

operator to further reduce time-consuming manual

tasks. Moreover, we aim at enhancing our speciﬁca-

tion method to be practical for high-dimensional state

spaces. Finally, introducing new operators can enable

specifying and learning temporal abstractions. With

this, we follow our idea to overcome the common

trial-and-error practice and facilitate the development

of RL solution for domain experts.

ACKNOWLEDGEMENTS

This work has been partially funded by the Fed-

eral Ministry of Education and Research as part of

the Software Campus project ZoLA - Ziel-orientiertes

Lernen von Agenten (funding code 01IS23068).

REFERENCES

Ahmad, K., Abdelrazek, M., Arora, C., Bano, M., and

Grundy, J. (2023). Requirements engineering for

artiﬁcial intelligence systems: A systematic map-

ping study. Information and Software Technology,

158:107176.

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong,

R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P.,

and Zaremba, W. (2017). Hindsight experience re-

play. In Advances in Neural Information Processing

Systems, pages 5048–5058.

Cai, M., Xiao, S., Li, B., Li, Z., and Kan, Z. (2021). Re-

inforcement Learning Based Temporal Logic Control

with Maximum Probabilistic Satisfaction. In 2021

IEEE International Conference on Robotics and Au-

tomation (ICRA), pages 806–812.

Chane-Sane, E., Schmid, C., and Laptev, I. (2021). Goal-

Conditioned Reinforcement Learning with Imagined

Subgoals. In Proceedings of the 38th International

Conference on Machine Learning, volume 139, pages

1430–1440.

Ding, H., Tang, Y., Wu, Q., Wang, B., Chen, C.,

and Wang, Z. (2023). Magnetic Field-Based Re-

ward Shaping for Goal-Conditioned Reinforcement

Learning. IEEE/CAA Journal of Automatica Sinica,

10(12):2233–2247.

DLR-RM (2024a). RL Baselines3 Zoo: A Train-

ing Framework for Stable Baselines3 Reinforce-

ment Learning Agents. https://github.com/DLR-RM/

rl-baselines3-zoo. [Last accessed on July 17th, 2024].

DLR-RM (2024b). Stable-Baselines3. https://github.com/

DLR-RM/stable-baselines3. [Last accessed on July

17th, 2024].

Everett, M., Chen, Y. F., and How, J. P. (2021). Colli-

sion avoidance in pedestrian-rich environments with

deep reinforcement learning. IEEE Access, 9:10357–

10377.

Farama Foundation (2024). Gymnasium: An API standard

for reinforcement learning with a diverse collection of

reference environments. https://gymnasium.farama.

org/. [Last accessed on July 12th, 2024].

Florensa, C., Held, D., Geng, X., and Abbeel, P. (2018).

Automatic goal generation for reinforcement learning

agents. In International conference on machine learn-

ing, pages 1515–1528.

Hahn, E. M., Perez, M., Schewe, S., Somenzi, F., Trivedi,

A., and Wojtczak, D. (2019). Omega-Regular Objec-

tives in Model-Free Reinforcement Learning. In Tools

and Algorithms for the Construction and Analysis of

Systems, pages 395–412.

ICAART 2025 - 17th International Conference on Agents and Artiﬁcial Intelligence

250

Hammond, L., Abate, A., Gutierrez, J., and Wooldridge,

M. (2021). Multi-Agent Reinforcement Learning with

Temporal Logic Speciﬁcations. In AAMAS ’21: 20th

International Conference on Autonomous Agents and

Multiagent Systems, pages 583–592. ACM. arXiv:.

Jothimurugan, K., Alur, R., and Bastani, O. (2019). A

composable speciﬁcation language for reinforcement

learning tasks. Advances in Neural Information Pro-

cessing Systems, 32.

Jothimurugan, K., Bansal, S., Bastani, O., and Alur, R.

(2021). Compositional reinforcement learning from

logical speciﬁcations. Advances in Neural Informa-

tion Processing Systems, 34:10026–10039.

Jurgenson, T., Avner, O., Groshev, E., and Tamar, A. (2020).

Sub-Goal Trees a Framework for Goal-Based Rein-

forcement Learning. In Proceedings of the 37th In-

ternational Conference on Machine Learning, volume

119 of Proceedings of Machine Learning Research,

pages 5020–5030.

Kapoutsis, A. C., Koutras, D. I., Korkas, C. D., and

Kosmatopoulos, E. B. (2023). ACRE: Actor-Critic

with Reward-Preserving Exploration. Neural Comput.

Appl., 35(30):22563–22576.

Li, X., Vasile, C. I., and Belta, C. (2017). Reinforcement

learning with temporal logic rewards. IEEE Interna-

tional Conference on Intelligent Robots and Systems,

pages 3834–3839.

Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invari-

ance under reward transformations : Theory and ap-

plication to reward shaping. 16th International Con-

ference on Machine Learning, 3:278–287.

Okudo, T. and Yamada, S. (2021). Subgoal-Based Re-

ward Shaping to Improve Efﬁciency in Reinforcement

Learning. IEEE Access, 9:97557–97568.

Okudo, T. and Yamada, S. (2023). Learning Potential

in Subgoal-Based Reward Shaping. IEEE Access,

11:17116–17137.

Retzlaff, C. O., Das, S., Wayllace, C., Mousavi, P., Afshari,

M., Yang, T., Saranti, A., Angerschmid, A., Taylor,

M. E., and Holzinger, A. (2024). Human-in-the-Loop

Reinforcement Learning: A Survey and Position on

Requirements, Challenges, and Opportunities. J. Artif.

Intell. Res., 79:359–415.

Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley,

R. (2013). A survey of multi-objective sequential

decision-making. Journal of Artiﬁcial Intelligence Re-

search, 48:67–113.

Schaul, T., Horgan, D., Gregor, K., and Silver, D. (2015).

Universal value function approximators. In 32nd In-

ternational Conference on Machine Learning.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal Policy Optimization Al-

gorithms. CoRR.

Schwan, S., Kl

os, V., and Glesner, S. (2023). A Goal-

Oriented Speciﬁcation Language for Reinforcement

Learning. In International Conference on Modeling

Decisions for Artiﬁcial Intelligence, pages 169–180.

Springer.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Van Lamsweerde, A. (2001). Goal-oriented requirements

engineering: A guided tour. Proceedings of the IEEE

International Conference on Requirements Engineer-

ing, pages 249–261.

Iterative Environment Design for Deep Reinforcement Learning Based on Goal-Oriented Speciﬁcation

251