Trajectory Generation Model: Building a Simulation Link Between

Expert Knowledge and Ofﬂine Learning

Arlena Wellßow

1,2 a

, Torben Logemann

1 b

and Eric MSP Veith

1,2 c

Adversarial Resilience Learning, Carl von Ossietzky University Oldenburg, Oldenburg, Germany

OFFIS – Institute for Information Technology, Oldenburg, Germany

ﬁ

Keywords:

Modeling, Ofﬂine Learning, Simulation, Expert Knowledge.

Abstract:

Reinforcement learning has shown its worth in multiple ﬁelds (e. g., voltage control, market participation).

However, training each agent from scratch leads to relatively long time and high computational power costs.

Including expert knowledge in agents is beneﬁcial, as human reasoning is coming up with independent solu-

tions for unusual states of the (less complex) system, and, especially in long-known ﬁelds, many strategies are

already established and, therefore, learnable for the agent. Using this knowledge allows agents to use these

solutions without encountering such situations in the ﬁrst place. Expert knowledge is usually available only

in semi-structured, non-machine-readable forms, such as (mis-) use cases. Also, data containing extreme sit-

uations, grid states, or emergencies is usually limited. However, these situations can be described as scenarios

in these semi-structured forms. A State machine representing the scenarios’ steps can be built from thereon to

generate data, which can then be used for ofﬂine learning. This paper shows a model and a prototype for such

state machines. We implemented this prototype using state machines as internal policies of agents without a

learning part for four speciﬁed scenarios. Acting according to the actuator setpoints the state machines emit,

the agent outputs control the connected simulation. We found that our implemented prototype can generate

data in the context of smart grid simulation. These data samples show the speciﬁed behavior, cover the search

space through variance, and do not include data that violates given constraints.

1 INTRODUCTION

The energy grid, a critical national infrastructure

(CNI), evolves into a cyber-physical system called

the smart grid. This results in a more signiﬁcantly

important adoption of new strategies in grid opera-

tion and control Hossain et al. (2023). This is due to

the introduction of ICT-based control systems. These

components are integrated into the grid to manage the

volatile generation and prosumers. They also enable

the grid operators to handle grids more efﬁciently.

While providing more usability also leads to a higher

risk of errors and potential dangers of cyber attacks.

Therefore, a higher risk of failure is given. As this

happens, including expert knowledge from grid op-

erators and Information and Communications Tech-

nologies (ICT) specialists becomes more critical.

Until now, the risk of a failure within the energy

https://orcid.org/0009-0005-7295-000X

https://orcid.org/0000-0002-2673-397X

https://orcid.org/0000-0003-2487-7475

system was adequately hedged by the redundancy of

the physical system (N-1 rule) (De Nooij et al., 2010).

With the need for an ever-efﬁcient operation of power

grids, which heralds an extensive integration of ICT,

this is no longer the case, as ICT integration poses

its own risks and redundancy of the whole grid is too

cost heavy. The failure of physical systems can lead

to the failure of ICT systems. Redundancy is, there-

fore, still relevant. However, failures of ICT systems

can also lead to failures of physical systems. These

risks can no longer be solved by redundancies alone

and must be mitigated by new technologies (e.g., safe

communication protocols, encryption) (Mayer et al.,

2020; Sch

utz et al., 2022).

The power grid transforms, making it highly

non-deterministic as an overall system. Due to the

high complexity, the introduction of machine learn-

ing (ML) systems for optimization, the proliferation

of prosumer roles, the emergence of (localized) en-

ergy markets, and the high share of distributed renew-

able energy sources (DERs) to achieve the necessary

efﬁciency goals require testing through extensive sim-

Wellßow, A., Logemann, T. and MSP Veith, E.

Trajectory Generation Model: Building a Simulation Link Between Expert Knowledge and Ofﬂine Learning.

DOI: 10.5220/0012764900003758

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 14th International Conference on Simulation and Modeling Methodologies, Technologies and Applications (SIMULTECH 2024), pages 91-102

ISBN: 978-989-758-708-5; ISSN: 2184-2841

ulations for developing mitigations.

However, this development carries a high effort in

terms of time needed for development and ﬁnancial

efforts as this time spent has to be paid (Uslar, 2015;

Uslar et al., 2019).

Therefore, previous approaches have employed

agent-based systems, often based on Deep Reinforce-

ment Learning (DRL), to generate suitable mitiga-

tion strategies for hitherto unknown scenarios (Fis-

cher et al., 2019). However, there is a lot of expert

knowledge in energy systems that AI agents do not

have by default. For example, agents do not know

strategies when starting with random sampling (e.g.,

using the Soft Actor Critic (SAC) algorithm with ran-

dom sampling) and do not know about conﬂicting ac-

tors that rewards can not cover. This knowledge needs

to be obtained through training to operate as well as

or better than human operators. But even if agents

are trained for quite some time, they might lack the

knowledge for rare occurring situations (for example,

misbehavior (Veith et al., 2024)). Sampling efﬁciency

is critical here. If insufﬁcient data is given, even of-

ﬂine learning algorithms cannot learn. This is espe-

cially relevant in critical situations that do not occur

often enough to be learned by an agent using histori-

cal data for training. These situations will be covered

by adding scenario data or injection of expert knowl-

edge.

A current best practice in the domain of the elec-

tric energy system is the description of system (mis-

)behavior utilizing specially designed templates based

on the (mis-)use case methodology. This data is usu-

ally available only in semi-structured, non-machine-

readable forms.

The big picture of our research aims to integrate

expert knowledge into AI based experiments to make

training of rare situations easier as well as reduce

computation expenses.

We have already shown that experiment genera-

tion from Misuse Cases is possible and feasible for

automation (Veith, 2023a; Wellssow et al., 2024).

In these templates, it is also possible to describe

scenarios stepwise. These steps can be converted into

a state machine. With this, we can rebuild the whole

scenario as a state machine. For now this is done by

hand and will be automated in the ongoing research

process.

Afterward, this approach sets in. We use the de-

ﬁned state machines, step through them while running

a smart grid simulation, and collect their data.

In the big picture shown in Figure 1 can be seen

that this data is then used for ofﬂine training. As with

state machine generation automation and everything

before and after these steps, this is not part of this

work and will be shown in later publications.

Our paper is structured as follows: First, we in-

troduce other research work related to this topic. Af-

terward, needed background information is given, and

the framework we used is presented. In the fourth sec-

tion, we present our model with the fundamental con-

cepts. The subsequent section contains our ﬁrst pro-

totype that was implemented using this model. In the

sixth section, we conclude our ﬁndings and discuss

our results. Lastly, we give a view of future work.

2 RELATED WORK

The section of related work covers parts of the big

picture concerning the research presented in this pa-

per and a subsection regarding ofﬂine learning data

sampling in the literature.

2.1 Usecase Methodology

The IEC 62559 standard outlines the methodology

for use cases, providing a systematic procedure for

eliciting and describing them. Following this stan-

dard, IEC 62559-2 also furnishes a standardized and

structured template for use case descriptions. This

template encompasses essential information, includ-

ing the use case’s name, identiﬁer, scope, objectives,

conditions, and narrative in natural language. Addi-

tional details, such as its relationship with other use

cases, prioritization, and related Key Performance In-

dicators (KPIs), are also included.

The second section of the use case template

presents use case diagrams. A technical information

summary is provided, listing all acting components

and offering a step-by-step analysis for each situation

associated with the use case. Processes related to the

use case, including lists of exchanged information and

requirements, are linked to these analyses. The doc-

umentation concludes with the inclusion of common

terminologies and speciﬁc details.

IEC 62559 is a series of standards widely em-

ployed in various ﬁelds, as exempliﬁed by Gottschalk

et al. (2017).

2.2 Ofﬂine Reinforcement Learning for

Deep RL

In numerous real-world scenarios, such as large

power grids, conducting experiments in the actual en-

vironment can be unsafe or computationally expen-

sive, making collecting new data in real-time chal-

lenging or risky. This is where ofﬂine reinforcement

SIMULTECH 2024 - 14th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

Figure 1: The proposed toolchain of the big picture this work is a part of. [Wellssow et al. (2024)].

learning, also called batch reinforcement learning,

becomes relevant (Ernst et al., 2005; Lange et al.,

2012; Prudencio et al., 2022; Panaganti et al., 2022).

Ofﬂine reinforcement learning enables agents to

learn from existing data without direct interaction

with the environment. By leveraging pre-existing data

that contains observations, actions, and next step ob-

servations, agents can enhance their decision-making

capabilities efﬁciently and safely. In this context,

safety regards the possible damages caused by faulty

operation in the environment. Ofﬂine learning is cru-

cial in critical infrastructure cases, such as optimal

voltage control in power grids, where making in-

formed decisions without additional data collection is

paramount.

Several building blocks and techniques are em-

ployed to achieve effective ofﬂine reinforcement

learning. These are presented in the following para-

graphs.

policy constraints (Fujimoto et al., 2018; Kumar

et al., 2019; Wu et al., 2019)modify the objective of

the learned policy, ensuring adherence to the behav-

ior policy derived from the existing dataset. These

constraints enforce a divergence metric between the

known and behavioral policies, such as minimizing

the discrepancy between their distributions to prevent

the agent from deviating too far from the data-driven

behavior. Another approach involves estimating the

advantage function, which describes the relative im-

provement of actions, rather than directly estimating

the behavioral policy itself (Peng et al., 2019).

Importance sampling (Precup et al., 2000; Jiang

and Li, 2016) is another technique employed in ofﬂine

reinforcement learning, where weights are assigned to

samples in the dataset. However, to ensure efﬁciency,

minimizing the variance of these weights is crucial, as

their product grows exponentially with the length of

trajectories.

Regularization (Haarnoja et al., 2018) in ofﬂine

RL involves adding a regularization term to the objec-

tive function and aims to make the agent’s value func-

tion estimates more conservative, preventing over-

estimation and discouraging the selection of out-of-

distribution (OOD) actions that could be harmful or

suboptimal. An extension of this is uncertainty esti-

mation, which allows the agent to dynamically adjust

the penalty term based on estimated uncertainty, en-

abling a ﬂexible trade-off between conservative and

naive off-policy deep reinforcement learning methods

(Levine et al., 2020).

Model-based methods (Kidambi et al., 2020; Yu

et al., 2020) tackle ofﬂine reinforcement learning by

estimating the dataset’s transition dynamics and re-

ward functions. For example, a surrogate model for

a power grid (e.g. as shown by Balduin et al. (2019))

can be learned from a collection of power ﬂow calcu-

lations obtained during simulations. However, incor-

porating uncertainty estimation techniques becomes

essential without precise world models, as it helps

prevent the agent from transitioning to OOD states.

One-step methods (Brandfonbrener et al., 2021;

Kostrikov et al., 2021) in ofﬂine reinforcement learn-

ing involves collecting multiple states to estimate the

action-value function, followed by a single policy im-

provement step. This differs from iterative actor-critic

methods, ensuring that the action-value estimates al-

ways represent the distribution of the given dataset,

eliminating the need to evaluate actions outside of the

data distribution.

Imitation learning (Chen et al., 2020), a class of

ofﬂine reinforcement learning algorithms, mimics the

behavior policy derived from the dataset. It involves

techniques like behavior cloning, where the learned

policy is trained to replicate the behavior policy us-

ing supervised learning. However, this approach is

subject to distributional shift, as even small mistakes

in the learned policy can lead to compounding errors.

To overcome this, imitation learning methods focus

on ﬁltering out undesirable behaviors from the dataset

and only mimicking the good ones, leveraging value

functions and heuristics to select high-quality trajec-

tories.

Another essential technique in ofﬂine reinforce-

ment learning is trajectory optimization (Janner et al.,

2021), which trains a joint state-action model over

entire trajectories based on the behavior policy. An

optimal action sequence with a good model can be

planned from the initial states. The use of sequence

Trajectory Generation Model: Building a Simulation Link Between Expert Knowledge and Ofﬂine Learning

modeling objectives reduces sensitivity to distribu-

tional shift, and the model can be optimized using

a variety of objectives, such as maximum likelihood

or a combination of maximum likelihood and value

functions.

2.3 Ofﬂine Learning Data Sampling

To effectively train ofﬂine learning agents, it is imper-

ative to have a large and diverse enough dataset. The

dataset has to represent the learned behavior of the

speciﬁed scenarios. This is mainly done by sampling

data. It can be accomplished by sorting and collect-

ing historical data or by iterative data generation with

simulations (Ghysels et al., 2004).

In the smart grid context, data sampling is done

in various attempts (Tu et al., 2017; Kayastha et al.,

2014; Rhodes et al., 2014).

Two main approaches are random sampling and

using Generative adversarial networks (GANs) to

generate data. We chose to structure the here shown

approach as random sampling has a low sample efﬁ-

ciency rate as well as the convolution problem Ger-

ster et al. (2021), while the usage of GANs is com-

putationally expensive and can’t generate data that is

better than the data the GAN was trained on Matchev

et al. (2022).

3 BACKGROUND

3.1 The PalaestrAI Framework

PalaestrAI

is a software stack that enables exper-

imentation, co-simulation, storage and analysis for

complex simulations of learning agents in cyber phys-

ical systems (CPSs) Veith et al. (2023). The frame-

work provides packages to implement or interface RL

agents, environments, and simulators. Its primary fo-

cus is on ensuring neat and reproducible execution of

experiment runs while coordinating the various parts

of these runs and storing their results for later analy-

sis.

One of palaestrAI’s major design paradigms is

a loose coupling between simulation execution, en-

vironment, and agents through a messaging bus.

This allows for the implementation of well-known

algorithm/rollout-worker splits as they are common

in other implementations, but also enables relatively

easy extensions of existing algorithms with concepts

such as the one presented here. A synchronization be-

tween these components happens on the basis of a Ze-

https://gitlab.com/arl2

roMQ Major Domo Protocol (MDP) implementation,

which implements message passing on the basis of a

request-reply pattern. In addition, palaestrAI uses

a conductor/watchdog concept that separates module

management from algorithm logic. This results in a

simpliﬁed API for developers that facilitates the im-

plementation of new DRL algorithms. This design

decision minimizes the learning curve for developers

and allows them to focus on algorithmic innovation.

Another beneﬁt of palaestrAI’s approach is that

co-simulation softwares can be transparently coupled,

needing only a minimal API adapter. Our setup uses

mosaik (Steinbrink et al., 2019) and the MIDAS sce-

nario framework (Balduin et al., 2023) to create the

power grid and to synchronize the asset models (e. g.,

the battery).

Additionally, palaestrAI strongly emphasizes the

reproducibility of experiments. The arsenAI mod-

ule, speciﬁcally designed for experiment design, en-

sures that experiments are meticulously laid out in

YAML ﬁles. These ﬁles include critical details such

as random seeds, agent speciﬁcations, sensor, actuator

conﬁgurations, and termination conditions for each

phase. This approach guarantees that experiments can

be replicated while minimizing setup requirements.

Furthermore, palaestrAI allows an unprecedented

level of ﬂexibility in experiment design, implement-

ing an actual Design of Experiments (DoE) process.

Through the arsenAI module, scientists can deﬁne

a multitude of parameters and factors. This allows

for extensive variations in agent combinations, sen-

sor/actuator assignments, and hyperparameters. This

versatility enables a comprehensive exploration of di-

verse scenarios.

3.2 Finite Non-Deterministic State

Machines

As deﬁned by Hopcroft et al. (2001) ﬁnite non-

deterministic state machines are 5-tuples

M = (Q, Σ, δ,s0, F) (1)

Where Q is a set of states, Σ is the alphabet the ma-

chine is working over, s0 is the initial state, F is the

set of ﬁnal states, and δ is deﬁned as a relation:

δ : QxΣ

∗

→ 2

(2)

In general, state machines are used in a wide area

of ﬁelds like digital event reconstruction (Gladyshev

and Patel, 2004), communication modeling (Brand

and Zaﬁropulo, 1983), or game development (Jag-

dale, 2021).

In the smart grid context, state machines are used

for modeling (as seen in (Kaygusuz et al., 2018)) or

SIMULTECH 2024 - 14th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

for control in an autonomous, non-learning setting (as

seen in (Trigkas et al., 2018)).

For this paper, we use an empty alphabet and do

not make use of the ﬁnal states as we do not want to

accept sequences but instead use the structure for data

generation in a continuous way.

4 MODEL

We propose a generic state machine model adaptable

for different scenarios. While we consider scenarios

in the smart grid context, we propose that this concept

also applies to other areas.

The state machine is ﬁnite. This allows modeling

the complete state machine, as we do not consider in-

ﬁnite state sets.

The general concept is depicted in Figure 2 with

states named s

. We use the state machine deﬁnition

as deﬁned by Hopcroft et al. (2001) with additions to

the state and transition deﬁnitions.

= (Q, Σ, δ, q0, F)

with

(q, ({c

} ∈ ActuatorSet points,

} ∈ TimeStepIntervals)) ∈ Q

(i ∈ Q, n ∈ Q, {sc} ∈ StepConstraints) ∈ δ

(3)

Our concept resolves around four signiﬁcant traits

that can be featured in the state machines used in this

approach:

Non-Determinism. Non-Determinism plays a mas-

sive role in this concept, allowing the same state ma-

chine to generate differing data. This is because de-

veloping differing timelines with one state machine is

possible, as the nondeterministic feature allows data

generation with state changes at differing time points.

Although we consider this to be the most helpful part

of generating differing datasets and using it for each

presented example, the concept is also usable with de-

terministic state machines.

State Actuator Constraints C

. State Actuator

Constraints operate as the control output. In these

constraints, an interval for every output of every asset

the agent controls determines the possible setpoints

in this state. This interval can be given as an array

for one controlled actuator or a combination of a dic-

tionary containing the actuator name as a key and an

interval of possible setpoints as a value.

The action spaces of the correlating actuator deﬁ-

nitions determine valid setpoints.

-------------------

, c

-------------------

, c

-------------------

, c

-------------------

, c

[sc

1,2

]

[...][...]

Figure 2: State centered depiction of the proposed state ma-

chine concept [own representation].

Time-Step Intervals I

. Time-Step Intervals are a

second kind of interval given. This interval deter-

mines the timesteps, an agent is set to be in this state

with the lower bound being the least amount of steps

the agent has to stay in this state and the upper bound

being the maximal amount of steps.

Constrained Steps Sc

i,N

. Contrained Steps allow to

limit a step to certain conditions. This is especially

helpful in each situation where a step is not after a

certain time in the scenario but if (or if not) a certain

condition is satisﬁed. Constrained steps are deﬁned

between two states and are therefore named with the

numbers of both states.

4.1 Combination with Online Learning

Approaches

To demonstrate how our concept could be connected

to online learning agents, we show an exemplary

combination with the well-known AWAC ofﬂine

learning algorithm (Nair et al., 2020) shown in

algotithm 1.

Algorithm 1: AWAC-Algorithm as derived from Nair et al.

(2020).

Dataset D = {(s, a, s’, r)

};

Initialize buffer β = D;

Initialize π

, Q

;

for iteration i = 1, 2, . . . do

Sample batch (s, a, s’, r) ∼ β;

y = r(s,a) + γE

′

k−1

(s’, a’)];

φ ←arg min

(s, a) − y

];

θ ←arg max

s,a∼β

[logπ

(a|s)

exp(

(s, a))];

if i > num ofﬂine steps then

, . . . , τ

∼ p

(τ);

β ← β

{τ

, . . . , τ

}

end

Trajectory Generation Model: Building a Simulation Link Between Expert Knowledge and Ofﬂine Learning

06:00

Aug 7, 2020

08:00 10:00 12:00 14:00 16:00 18:00 20:00 22:00

100

Battery State Battery (Dis-) Charge

Time

State of Charge [%]

Charge/Discharge [MW]

Figure 3: Simple Battery Experiment: Battery Loading State in percent and the agent’s setpoints for (dis-)charging over time.

[own representation].

06:00

Aug 7, 2020

08:00 10:00 12:00 14:00 16:00 18:00 20:00

100

Battery State Battery (Dis-) Charge

Time

State of Charge [%]

Charge/Discharge [MW]

Figure 4: Multiple Actuators Experiment: Battery Loading State in percent and the agents setpoints for (dis-)charging over

time. [own representation].

06:00

Aug 7, 2020

08:00 10:00 12:00 14:00 16:00 18:00 20:00

−0.6

−0.4

−0.2

0.2

0.4

0.6

−0.6

−0.4

−0.2

0.2

0.4

0.6

midas_powergrid.Pysimmods-0.Photovoltaic-0.q_mvar Photovoltaic 0 Q setpoint

Time

Reactive Power Output [MVAr]

Reactive Power Setpoint [MVAr]

Figure 5: Multiple Actuators Experiment: Setpoints for PV control. [own representation].

As seen in algorithm 2, when combining the two

algorithms, our approach runs ﬁrst as it generates the

data needed for the AWAC algorithm. This means,

data is generated ﬁrst and then used for training with

the AWAC implementation.

We iterate through the state machine until the

maximum of steps is completed. While doing so, we

save the states of the simulation before (s) and after

(s’) each action (a). M

.advance() covers the process

of constraint checking and stepping in the state ma-

chine. When a constraint for a state transition is not

satisﬁed, this transition is not callable, and therefore,

the state connected with this transition is not reach-

able. If there are multiple possible transitions, the

transition is chosen in a non-deterministic way based

on predeﬁned probabilities. In each step, a reward (r)

is calculated for the grid state given by the simulation

and then stored with the states and action set.

The AWAC implementation can work with the

stored data in the same way as with every other

dataset in the original implementation.

SIMULTECH 2024 - 14th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

06:00

Aug 7, 2020

08:00 10:00 12:00 14:00

100

Battery State Battery (Dis-) Charge

Time

State of Charge [%]

Charge/Discharge [MW]

Figure 6: Constrained Step Experiment: Battery Loading State in percent and the agents setpoints for (dis-)charging over

time. [own representation].

06:00

Aug 7, 2020

07:00 08:00 09:00

100

Battery State Battery (Dis-) Charge

Time

State of Charge [%]

Charge/Discharge [MW]

Figure 7: Multiple PathsExperiment: Battery Loading State in percent and the agents setpoints for (dis-)charging over time.

[own representation].

Algorithm 2: AWAC-Algorithm combined with our data

generation approach.

Initialize Simulation S;

Statemachine M

= (Q, Σ, δ, s0, F);

maximal steps = x;

for j ≤ x do

s = S.state;

a ← {c} ∈ (M

.state, c

.state

) ∈ Q

;

r ← reward(a);

s’ = S.step(a);

store to database(s, a, s’, r);

.advance();

end

Dataset D = {(s, a, s’, r)

} from database;

Initialize buffer β = D;

Initialize π

, Q

;

for iteration i = 1, 2, . . . do

Sample batch (s, a, s’, r) ∼ β;

y = r(s,a) + γE

′

k−1

(s’, a’)];

φ ←arg min

(s, a) − y

];

θ ←arg max

s,a∼β

[logπ

(a|s)

exp(

(s, a))];

if i > num ofﬂine steps then

, . . . , τ

∼ p

(τ);

β ← β

{τ

, . . . , τ

}

end

5 IMPLEMENTED PROTOTYPE

The implemented prototype

is constructed as a mus-

cle in palaestrAI. This means it can control given ac-

tuators based on sensor information. Normally, each

learning agent in palaestrAI contains a muscle and

a brain part in which the agent learns from previous

steps. We implement the model as a muscle only, as

we do want to use the proposing of actions by the

muscle. The model works without a brain part, as

there is no learning, and the state machine provides

the strategy by which the agent works. The use of

palaestrAI ensures that the format of the trajectories

from the data generation matches the one used for

training the agents. Therefore, the data can be used

for ofﬂine learning afterward without reformatting.

As seen in Figure 8 our prototype contains two

classes: The AgentStateMachine, which contains the

actual state machine, and the StateMachineAgent,

which is wrapped around this state machine. The state

machine is built with the python package pytransi-

tions

, which enables all essential state machine func-

tions like stepping, keeping the state, and returning

the state the machine is in.

As seen in Figure 9, the agent loops while gener-

Our code can be found here: https://gitlab.com/arl2/

state-machine-data-generation

https://github.com/pytransitions/transitions

Trajectory Generation Model: Building a Simulation Link Between Expert Knowledge and Ofﬂine Learning

Figure 8: Uml Class diagram of our StateMachineAgent implementation [own representation].

State Machine Agent

submits action

to the

environment

gets new

simulation state

from environment

saves state, action

and

next state to

database

checks conditions

performs a

step in the

state machine

gets actions for each

actuator from the

step machine

Figure 9: Implemented loop of the StateMachineAgent in

an abstract visualization [own representation].

ating data until the maximum amount of steps given

before is reached. It starts with getting actions from

the state machine for the initial simulation state. Af-

terward, the actions are submitted to the co-simulated

environment, and a new state is reported after the co-

simulation has performed those actions. Afterward,

the tuple of state, action, next state, and reward is

saved to the database and will later serve as part of the

sampled data. The StateMachineAgent then checks

the conditions (possibly) given for the transitions of

the state machine before performing a step. After-

ward, actions are provided by the state machine, and

the loop begins again.

This is used to iterate over the state machine while

the simulation is running. We ran four different exper-

iments with this prototype:

• Simple battery experiment

• Multiple actuators experiment

Table 1: Actuator Mapping for Experiments.

Experiment Actuator(s)

Simple Battery Experiment Battery

Multiple Actuators Experiment Battery, PV

Multiple Paths Experiment Battery

Constrained Step Experiment Battery

• Multiple paths experiment

• Constrained step experiment

Simple Battery Experiment. This is our simplest

experiment. It contains just two states (S1 and S2) and

two kinds of transitions (step and keep) connecting

them.

tg,A

= (Q, Σ, δ, s1, F)

with Q and F as depicted in Figure 10

(4)

This state machine is depicted in Figure 10. We used

one battery as an actuator to control. The states differ

between loading and discharging.

The results of this experiment can be seen in Fig-

ure 3. In this chart, the non-deterministic behavior of

the state machine is observable through the changes

between charging and discharging that happen after

differing amounts of timesteps.

Multiple Actuators Experiment. In the second ex-

periment, we gave the agent a second asset to control.

States and transitions were kept from the simple bat-

tery experiment. For this experiment, we changed our

constraint implementation from an array to a dictio-

nary containing the actuator names and their allowed

setpoints as an array. Except for this, the state ma-

chine was kept the same as in the simple battery ex-

periment. For this experiment, the agent was able to

control two assets in the grid: the battery and a pho-

tovoltaic asset. Both are independent, co-simulated

models to give setpoints to but are modeled within the

SIMULTECH 2024 - 14th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

-------------------

, c

-------------------

, c

=[20,200],

[-0.8,-0.8]

=[20,200],

=[0.8,0.8]

keep()

step()

keep()

Figure 10: State Machine used in the simple battery exper-

iment and the multiple actuators experiments. [own repre-

sentation].

same grid and, therefore, impact each other physically

in the environment.

As seen in Figure 4 and Figure 5, the values for

both assets change according to the changes in the

state machine. More states would allow for more de-

tailed control (e.g., changing the setpoints of just one

of the assets in a state transition).

Multiple Paths Experiment. For the third exper-

iment, we kept the infrastructure of the simple bat-

tery experiment but added another state. This allows

a non-deterministic progression of the state machine.

tg,B

= (Q, Σ, δ, s1, F)

with Q and F as depicted in Figure 11

(5)

We choose an additional state S3 where the battery

is neither charged nor discharged. The adapted state

machine is depicted in Figure 11. The additional state

is reachable from the initial state S1. The discharging

state (S2) can be reached after the next time state S1

is left.

In Figure 7, the non-deterministic state changes

after the state of charging are shown in the setpoints.

Note, that the battery charge in percent is increas-

ing, as there is non-deterministic change between dis-

charging and keeping the battery charge, while load-

ing is scheduled in every control cycle of the state ma-

chine.

Constrained Step Experiment. In the constrained

step experiment, the infrastructure and the state ma-

chine from the simple battery example are kept. How-

ever, we added a constrained step to the state machine,

-------------------

, c

-------------------

, c

=[20,200],

=[-0.8,-0.8]

=[20,200],

=[0.8,0.8]

keep()

step() step()

keep()

=[20,200],

=[0.0,0.0]

-------------------

, c

step()

step_b()

keep()

Figure 11: State Machine with additional state used in the

multiple paths experiment. [own representation].

where the machine can only step to the discharging

state S2 when the battery is charged to at least 50 per-

cent. This is shown in Figure 12.

-------------------

, c

-------------------

, c

=[20,200],

=[-0.8,-0.8]

=[20,200],

=[0.8,0.8]

keep()

step()

step(),

1,2

={BatteryStorage >= 50%}

keep()

Figure 12: State Machine with constrained step used in the

constrained step experiment. [own representation].

Note that in Figure 6 the battery charge in percent

is risen over 50 percent in each loading cycle. This is

due to the constraint being used in the state machine,

as the charging state s1 has to be kept until the con-

straint is satisﬁed.

6 CONCLUSION

Within this work, we presented a model for data gen-

eration for ofﬂine learning using state machines. We

also showed an implemented prototype and used it to

generate data for four scenarios in the smart grid con-

text. The data generated by the StateMachineAgent

Trajectory Generation Model: Building a Simulation Link Between Expert Knowledge and Ofﬂine Learning

shows the pattern that was represented by the state

machines. It also varies due to the non-deterministic

nature of the state machines.

Our approach allows the generation of training

data containing speciﬁc situations and scenarios. It

is to be mentioned that this approach also enables the

generation of rather signiﬁcant amounts of (varying)

data in a short amount of time.

Our implemented prototype shows that the non-

deterministic features and changeable intervals for

setpoint determination allow for generating varying

data. The Multiple-Actor-Experiment has proven that

this concept for data generation also supports multi-

ple actuators. The possibility of varying data through

the use of multiple strategies in the state machine

that can be executed in a non-deterministic way is

shown by the Multiple-Path-Experiment. Further, the

Constraint-Step-Experiment indicates that more com-

plex strategies can be encoded in state machines by

applying constraints while generating data.

7 FUTURE WORK

Currently, state machines can only be hard coded and

not be read dynamically via a speciﬁcation, e.g., by

providing a conﬁguration ﬁle. A possible way to do

this is by building state machines in a structured way

with standardised formats, e.g., with YAML. These

could then be given to the StateMachineAgent, which

creates an AgentStateMachine containing the infor-

mation.

Regarding the bigger goal of our research, future

work aims to generate State Machines from structured

expert knowledge automatically. An expert would en-

ter their valuable knowledge in a semi-structured for-

mat, like a misuse case. Afterward, a state machine is

generated from these scenario inputs.

Furthermore, an automated state machine genera-

tion paired with the presented approach adapted with

a YAML read-in, a toolchain could be established, au-

tomatically building datasets (or even ofﬂine trained

agents) whenever new expert knowledge is added.

When the toolchain is enabled, our future work

aims to research the limits of its usage. This refers

to limiting the ﬁelds the toolchain is applicable for

and listing the requirements and limitations and the

needed amount of available expert knowledge.

It is also planned to include the expert knowledge

toolchain in an agent architecture Veith (2023b) that

involves neuro-evolutionary algorithms that then con-

tinuously revise the rules given in the state machine.

8 FUNDING

The Adversarial Resilience Learning methodology,

extended agents, the continued work on palaestrAI,

as well as the ofﬂine learning development, has been

funded by the German Federal Ministry of Education

and Research under the project grant Adversarial Re-

silience Learning (01IS22071).

This work has been funded by the German Fed-

eral Ministry for Economic Affairs and Climate Ac-

tion under the project grant RESili8 (03EI4051A).

REFERENCES

Balduin, S., Tr

oschel, M., and Lehnhoff, S. (2019). Towards

domain-speciﬁc surrogate models for smart grid co-

simulation. Energy Informatics, 2(1):1–19.

Balduin, S., Veith, E. M. S. P., and Lehnhoff, S. (2023). Mi-

das: An open-source framework for simulation-based

analysis of energy systems. In Wagner, G., Werner, F.,

and De Rango, F., editors, Simulation and Modeling

Methodologies, Technologies and Applications, pages

177–194, Cham. Springer International Publishing.

Brand, D. and Zaﬁropulo, P. (1983). On communicating

ﬁnite-state machines. Journal of the ACM (JACM),

30(2):323–342.

Brandfonbrener, D., Whitney, W., Ranganath, R., and

Bruna, J. (2021). Ofﬂine rl without off-policy eval-

uation. Proc. Adv. Neural Inf. Process. Syst., page

4933–4946.

Chen, X., Wang, Z. Z. Z., Wang, C., Wu, Y., and Ross,

K. (2020). Bail: Best-action imitation learning for

batch deep reinforcement learning. Proc. NeurIPS,

page 18353–18363.

De Nooij, M., Baarsma, B., Bloemhof, G., Slootweg, H.,

and Dijk, H. (2010). Development and application of

a cost–beneﬁt framework for energy reliability: Using

probabilistic methods in network planning and regula-

tion to enhance social welfare: The n-1 rule. Energy

Economics, 32(6):1277–1282.

Ernst, D., Geurts, P., and Wehenkel, L. (2005). Tree-based

batch mode reinforcement learning. Journal of Ma-

chine Learning Research, 6(Apr):503–556.

Fischer, L., Memmen, J. M., Veith, E. M., and Tr

oschel,

M. (2019). Adversarial resilience learning—towards

systemic vulnerability analysis for large and complex

systems. In ENERGY 2019, The Ninth International

Conference on Smart Grids, Green Communications

and IT Energy-aware Technologies, number 9, pages

24–32, Athens, Greece. IARIA XPS Press.

Fujimoto, S., Hoof, H., and Meger, D. (2018). Address-

ing function approximation error in actor-critic meth-

ods. In Proceedings of the 35th International Confer-

ence on Machine Learning, pages 1587–1596. PMLR.

ISSN: 2640-3498.

Gerster, J., Lehnhoff, S., Sarstedt, M., Hofmann, L., and

Veith, E. M. (2021). Comparison of random sampling

SIMULTECH 2024 - 14th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

100

and heuristic optimization-based methods for deter-

mining the ﬂexibility potential at vertical system in-

terconnections. In 2021 IEEE PES Innovative Smart

Grid Technologies Europe (ISGT Europe), pages 1–6.

Ghysels, E., Santa-Clara, P., and Valkanov, R. (2004). The

midas touch: Mixed data sampling regression models.

Gladyshev, P. and Patel, A. (2004). Finite state machine

approach to digital event reconstruction. Digital In-

vestigation, 1(2):130–149.

Gottschalk, M., Uslar, M., and Delfs, C. (2017). The use

case and smart grid architecture model approach: the

IEC 62559-2 use case template and the SGAM applied

in various domains. Springer.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor.

Hopcroft, J. E., Motwani, R., and Ullman, J. D. (2001). In-

troduction to automata theory, languages, and compu-

tation. Acm Sigact News, 32(1):60–65.

Hossain, R. R., Yin, T., Du, Y., Huang, R., Tan, J., Yu,

W., Liu, Y., and Huang, Q. (2023). Efﬁcient learn-

ing of power grid voltage control strategies via model-

based deep reinforcement learning. Machine Learn-

ing, pages 1–26.

Jagdale, D. (2021). Finite state machine in game develop-

ment. algorithms, 10(1).

Janner, M., Li, Q., and Levine, S. (2021). Ofﬂine reinforce-

ment learning as one big sequence modeling problem.

Proc. NeurIPS, page 1273–1286.

Jiang, N. and Li, L. (2016). Doubly robust off-policy value

evaluation for reinforcement learning. Proc. ICML,

page 652–661.

Kayastha, N., Niyato, D., Hossain, E., and Han, Z. (2014).

Smart grid sensor data collection, communication,

and networking: a tutorial. Wireless communications

and mobile computing, 14(11):1055–1087.

Kaygusuz, C., Babun, L., Aksu, H., and Uluagac, A. S.

(2018). Detection of compromised smart grid devices

with machine learning and convolution techniques. In

2018 IEEE International Conference on Communica-

tions (ICC), pages 1–6. IEEE.

Kidambi, R., Rajeswaran, A., Netrapalli, P., and Joachims,

T. (2020). Morel : Model-based ofﬂine reinforcement

learning. Proc. NeurIPS, page 21810–21823.

Kostrikov, I., Nair, A., and Levine, S. (2021). Ofﬂine rein-

forcement learning with implicit q-learning.

Kumar, A., Fu, J., Tucker, G., and Levine, S. (2019). Learn-

ing new attack vectors from misuse cases with deep

reinforcement learning. Proc. NeurIPS, pages 1–11.

Lange, S., Gabel, T., and Riedmiller, M. (2012). Reinforce-

ment learning, chapter Batch reinforcement learning,

page pages 45–73. Springer.

Levine, S., Kumar, A., Tucker, G., and Fu, J. (2020). Ofﬂine

reinforcement learning: Tutorial, review, and perspec-

tives on open problems.

Matchev, K. T., Roman, A., and Shyamsundar, P. (2022).

Uncertainties associated with gan-generated datasets

in high energy physics. SciPost Physics, 12(3):104.

Mayer, C., Brunekreeft, G., Blank-Babazadeh, M., Stark,

S., Buchmann, M., Dalheimer, M., et al. (2020).

Resilienz digitalisierter energiesysteme. Blackout-

Risiken Verstehen, Stromversorgung Sicher Gestalten.

Nair, A., Gupta, A., Dalal, M., and Levine, S. (2020). Awac:

Accelerating online reinforcement learning with of-

ﬂine datasets. arXiv preprint arXiv:2006.09359.

Panaganti, K., Xu, Z., Kalathil, D., and Ghavamzadeh, M.

(2022). Robust reinforcement learning using ofﬂine

data. Advances in neural information processing sys-

tems, 35:32211–32224.

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. (2019).

Advantage-weighted regression: Simple and scalable

off-policy reinforcement learning.

Precup, D., Sutton, R., and Singh, S. (2000). Eligibility

traces for off-policy policy evaluation. Proc. ICML,

page 759–766.

Prudencio, R. F., Maximo, M. R. O. A., and Colombini,

E. L. (2022). A survey on ofﬂine reinforcement learn-

ing: Taxonomy, review, and open problems.

Rhodes, J. D., Upshaw, C. R., Harris, C. B., Meehan, C. M.,

Walling, D. A., Navr

atil, P. A., Beck, A. L., Naga-

sawa, K., Fares, R. L., Cole, W. J., et al. (2014). Ex-

perimental and data collection methods for a large-

scale smart grid deployment: Methods and ﬁrst re-

sults. Energy, 65:462–471.

Sch

utz, J., Uslar, M., and Clausen, M. (2022).

Digitalisierung. Synthesebericht 3 des SINTEG

orderprogramms, Studie im Auftrag des BMWK,

Berlin. Berlin.

Steinbrink, C., Blank-Babazadeh, M., El-Ama, A., Holly,

S., L

uers, B., Nebel-Wenner, M., Ram

ırez Acosta,

R. P., Raub, T., Schwarz, J. S., Stark, S., Nieße, A.,

and Lehnhoff, S. (2019). Cpes testing with mosaik:

Co-simulation planning, execution and analysis. Ap-

plied Sciences, 9(5).

Trigkas, D., Ziogou, C., Voutetakis, S., and Papadopoulou,

S. (2018). Supervisory control of energy distribu-

tion at autonomous res-powered smart-grids using a

ﬁnite state machine approach. In 2018 5th Interna-

tional Conference on Control, Decision and Informa-

tion Technologies (CoDIT), pages 415–420. IEEE.

Tu, C., He, X., Shuai, Z., and Jiang, F. (2017). Big data

issues in smart grid–a review. Renewable and Sus-

tainable Energy Reviews, 79:1099–1107.

Uslar, M. (2015). Energy Informatics: Deﬁnition, State-

of-the-art and new horizons. In Kupzog, F., editor,

Proceedings der ComForEn 2015 Vienna, Wien. TU

Wien, OVE Verlag.

Uslar, M., Rohjans, S., Neureiter, C., Pr

ostl Andr

en, F.,

Velasquez, J., Steinbrink, C., Efthymiou, V., Migli-

avacca, G., Horsmanheimo, S., Brunner, H., et al.

(2019). Applying the smart grid architecture model

for designing and validating system-of-systems in the

power and energy domain: A european perspective.

Energies, 12(2):258.

Veith, E., Balduin, S., Wenninghoff, N., Wolgast, T.,

Baumann, M., Winkler, D., Hammer, L., Salman,

A., Schulz, M., Raeiszadeh, A., Logemann, T., and

Wellßow, A. (2023). palaestrAI: A training ground

Trajectory Generation Model: Building a Simulation Link Between Expert Knowledge and Ofﬂine Learning

101

for autonomous agents. In Proceedings of the 37

an-

nual European Simulation and Modelling Conference.

EUROSIS.

Veith, E., Logemann, T., Berezin, A., Wellßow, A., and Bal-

duin, S. (2024). Imitation game: A model-based and

imitation learning deep reinforcement learning hybrid.

arXiv preprint arXiv:2404.01794.

Veith, E. M. (2023a). An architecture for reliable learn-

ing agents in power grids. In ENERGY 2023, The

Thirteenth International Conference on Smart Grids,

Green Communications and IT Energy-aware Tech-

nologies, pages 13–16, Barcelona, Spain. IARIA XPS

Press.

Veith, E. M. (2023b). An architecture for reliable learning

agents in power grids.

Wellssow, A., Kohlisch-Posega, J., Veith, E. M. S. P., and

Uslar, M. (2024). Threat modeling for ai analysis:

Towards the usage of misuse case templates and uml

diagrams for ai experiment description and trajectory

generation. IEEA ’24, accepted to be published, New

York, NY, USA. Association for Computing Machin-

ery.

Wu, Y., Tucker, G., and Nachum, O. (2019). Behavior reg-

ularized ofﬂine reinforcement learning.

Yu, T., Thomas, G., Yu, L., Ermon, S., Zou, J., Levine,

S., Finn, C., and Ma, T. (2020). Mopo: Model-

based ofﬂine policy optimization. Proc. NeurIPS,

33:14129–14142.

SIMULTECH 2024 - 14th International Conference on Simulation and Modeling Methodologies, Technologies and Applications

102