Analyzing Exact Output Regions of Reinforcement Learning Policy

Neural Networks for High-Dimensional Input-Output Spaces

Torben Logemann

and Eric MSP Veith

Carl von Ossietzky University Oldenburg,

Research Group Adversarial Resilience Learning,

Oldenburg, Germany

Keywords:

Decision Tree, Reinforcement Learning, Explainability, Neural Network.

Abstract:

Agent systems based on deep reinforcement learning have achieved remarkable success in recent years. They

have also been applied to a variety of research topics in the ﬁeld of power grids, as such agents promise real re-

silience. However, deep reinforcement learning agents cannot guarantee behavior, as the mapping of the entire

input space to the output of even a simple feed-forward neural network cannot be accurately explained. For

critical infrastructures, such black box models are not acceptable. To ensure an optimized trade-off between

learning performance and explainability, this paper relies on efﬁcient regularizable feed-forward neural net-

works and presents an extension of the algorithm NN2EQCDT to transform the networks into pruned decision

trees with signiﬁcantly fewer nodes to be accurately explained. In this paper, we present a methodological

approach to further analyze the decision trees for high-dimensional input-output spaces and analyze an agent

for a power grid experiment.

1 INTRODUCTION

Deep Reinforcement Learning (DRL)—the notion

of agents that learn from interacting with their

environment—is at the core of many of these remark-

able successes, beginning with its breakthrough in

2013 by end-to-end learning of Atari games (Mnih

et al., 2013) and Double Deep Q-Learning (DDQN)

(Van Hasselt et al., 2016) and culminating in devel-

opments such as AlphaGo (Zero), AlphaZero (Silver

et al., 2018) and MuZero (Schrittwieser et al., 2020).

Since the 2013 hallmark paper by Mnih et al., re-

searchers have improved training performance, sen-

sibility to hyperparameters, exploration, or sample

efﬁciency through algorithms such as Twin-Delayed

Deep Deterministic Policy Gradient (DDPG) (TD3)

(Fujimoto et al., 2018), Proximal Policy Optimization

(PPO) (Schulman et al., 2017), and Soft Actor Critic

(SAC) (Haarnoja et al., 2018). The current research

corpus shows that these agents have proven that they

are capable of handling complex tasks.

DRL agents promise true resilience by learning

to counter the unknown unknowns. However, unlike

intrinsically interpretable DRL models (Puiutta and

https://orcid.org/0000-0002-2673-397X

https://orcid.org/0000-0003-2487-7475

Veith, 2020), no guarantees can yet be made about the

behavior of DRL agents learned with black-box mod-

els. This is, however, a necessity for operators, since

no responsibility can be taken for an unknown control

system that cannot be validated, especially when it is

used in such critical or very critical areas as Critical

National Infrastructures (CNIs).

Agents deployed in complex environments, such

as complex networked systems, are potentially con-

fronted with many different situations and learn com-

plex behaviors to fulﬁll their goals. For example, in

(Veith et al., 2023; Veith et al., 2024) Adversarial Re-

silience Learning (ARL) attack agents are trained to

cause voltage band violations in a power grid. This

goal is achieved by exploiting a weakness in the use

of voltage regulators in the grid in use.

To gain a better insight into how the control

strategies work compared to the classical reward

and action implications on the victim busses, the

NN2EQCDT algorithm was developed (Logemann

and Veith, 2023), which exactly transforms Deep

Neural Network (DNN) networks into pruned Deci-

sion Trees (DTs). This allows not only individual

trajectories to be explained, but also the mapping of

entire observation regions to functions for the actions

on the basis of all input points from the region (Veith

Logemann, T. and Veith, E. M.

Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces.

DOI: 10.5220/0012928000003886

In Proceedings of the 1st International Conference on Explainable AI for Neural and Symbolic Methods (EXPLAINS 2024), pages 96-107

ISBN: 978-989-758-720-7

and Logemann, 2023; Logemann, 2023). It is embed-

ded in the ARL architecture in (Veith, 2023), which

aims to create a framework for an agent that can learn

sample-efﬁciently, but also provide insights into the

behavior of the agents.

For more complex control strategies with higher

input/output dimensions, it becomes more difﬁcult to

give guarantees, since such spaces make it consider-

ably more difﬁcult to read the control strategies di-

rectly from and visualize the transformed DTs. Fur-

thermore, it is not easily possible to provide explana-

tions and guarantees for control agents over a longer

time horizon. This is relevant, e.g., in the case of

agents for voltage maintenance, as voltage band vi-

olations must never occur.

In this paper, we present a methodological ap-

proach that can represent the properties of regions

in higher dimensional input-output spaces in relation

to each other. The input space is completely dis-

jointly divided into polytopes by a DT generated by

the NN2EQCDT algorithm. For a better understand-

ing of the regions, the polytopes are also approxi-

mated by inner boxes and their neighborhood rela-

tion is represented in the new presented concept of the

Neighborhood Graph (NHG), which works for arbi-

trarily high-dimensional regions. The edges can rep-

resent properties between these regions. In this way,

(abrupt) transitions in the control planes of the agent

can be identiﬁed. We illustrate our approach in a well-

known gymnasium environment, which is accessible

for both visual inspection and our approach, to show

the feasibility of our method.

The remainder of this paper is structured as fol-

lows: First, related work is presented in section 2

and an overview of the whole system is given in sec-

tion 3. Then the construction of the equivalent pruned

DT from a Feed-Forward Deep Neural Network (FF-

DNN) is generally described in section 4. The exact

representation of the output region is also described

in section 5. In addition, the calculation of the in-

ner boxes in section 6 and the handling of obser-

vation space constraints in section 7 are described.

With these calculations, an example of a controller

trained in the MountainCarContinuous-v0 environ-

ment (MCC) is given in section 8. Then the NHG is

presented in section 9 and consistently validated with

the MCC example. After that, a higher input and out-

put dimensional experiment in the power grid domain

is analyzed. Finally, a discussion in section 11 and

references to future work follow.

2 RELATED WORK

2.1 Deep Reinforcement Learning

Reinforcement Learning (RL) is the process of learn-

ing through interaction. When an RL agent inter-

acts with its environment and thereby observing the

consequences of its actions in form of rewards, it

can learn to alter its own behavior in response to

the received rewards. RL has the paradigm of trial-

and-error-learning and is inﬂuenced by the optimal

control ﬁeld as well as balancing the trade-off be-

tween exploration and exploitation. It is based on the

Markov Decision Process (MDP), which is a quintru-

plet (S, A,T , R , γ), that describes that an agent ob-

serves the state s

∈ S from its environment at time t

and takes the action a

∈ A as a response to the dis-

counted (γ the discount factor) reward r

∈ R . If the

state is reset after each episode, then the sequence of

states, actions and rewards is a trajectory of the policy.

The return of a trajectory is given by the discounted

accumulation of rewards R =

∑

T −1

t=0

t+1

This transi-

tions the state to the next state s

t+1

∈ S with the prob-

ability p

∈ T . If the system is only partially observ-

able, a Partially-Observable Markov Decision Process

(POMDP) further speciﬁes Ω as the set of observa-

tions and O as the conditional observation properties,

Ω×O ⊆ S. In general, reinforcement learning aims to

learn a policy, a

∼ π(·|s

). Finding the optimal policy

∗

, that maximizes the expected return from all states,

∗

= arg max

E[R|π] , (1)

is the optimization problem of all reinforcement

learning algorithms (Arulkumaran et al., 2017).

2.2 Explainability for Deep

Reinforcement Learning

DNNs are inherently opaque, as the meaning of any

particular set of nodes eludes a human. Therefore,

DRL policies are similar black boxes (Jaunet et al.,

2020); since experts cannot readily understand why a

particular action was taken especially in the context

of the overall learned policy, the lack of transparency

limits trust in DRL agents (Qing et al., 2022). Recent

papers survey the landscape of the very active ﬁeld

of research of eXplainable Reinforcement Learning

(XRL) (Arrieta et al., 2020; Puiutta and Veith, 2020),

to which we refer the interested reader. We will focus

on equivalent representations of policy networks as

decision trees.

Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces

2.2.1 Explaining with Decision Trees

Transforming the agent’s policy into a DT is intu-

itively justiﬁed, as we can easily accept decisions

in a basis of if-then-else conditions; DTs seem to

be easy to interpret (Du et al., 2019). Algorithms

such as Viper (Bastani et al., 2018) create the DT

from samples of the original policy. But when the

agents’ state/action space and its strategies become

more complex, the DT also grows in size, making its

creation and even evaluation slow. For this reason,

approaches such as distilling into Soft Decision Trees

(SDTs) (Irsoy et al., 2012; Frosst and Hinton, 2017)

have been proposed (Coppens et al., 2019), because

SDTs are more space efﬁcient.

However, such distillation methods require the

agent produce trajectories the DT can be constructed

from (imitation learning), which have the drawback

that the DT can only cover what the agent has done

so far; there is no guarantee that the DT covers the

agent’s behavior in general.

To this end, another approach has been proposed

in which the policy network itself is converted to an

equivalent DT (Aytekin, 2022). But a problem with

this approach was that the resulting DT can grow very

large. We have previously extended the algorithm to

prune unreachable nodes off from the DT dynami-

cally while creation and, therefore, minimize it (Lo-

gemann, 2023).

3 SYSTEM OVERVIEW

In ﬁg. 1 an overview about the whole system is vi-

sualized. First, the NN2EQCDT algorithm is used

to transform a given DNN exactly into a truncated

DT. The regions that the DT splits form convex poly-

DNN

Parameters:

Weights: W

Biases: B

NN2EQCDT

algorithm

Decision Tree

(Equal, pruned)

Neighbor Graph

General constraints

(Global outer box)

Properties

(Cos distance,

Inner boxes,

Volumes, etc.)

Using

Linear

Prog.

Figure 1: System overview of the interplay between the

NN2EQCDT algorithm and the NHG calculation.

topes. These are used as a matrix inequality sys-

tem for Linear Programming (LP) approaches to com-

pute various properties and the neighborhoods that are

compacted into a NHG. General constraints, such as

global outer boxes, are included in the calculations

with the NN2EQCDT algorithm as well as those with

LP. The NHG is a tool that can be used for further

analysis of region transitions and properties.

4 THE NN2EQCDT ALGORITHM

In this section, we brieﬂy describe the original algo-

rithm NN2EQCDT (cf. algorithm 1). For further de-

tails and evaluations, please see to the original publi-

cation (Logemann, 2023).

Data: ANN weight matrix W

W , bias matrix B

general constraints c

Result: Pruned Decision Tree T

begin

W = W

B = B

⊤

rules = CALC RULE TERMS



W ,



T, new SAT leaves =

CREATE INITIAL SUBTREE(rules, c

)

ADD SAT PATHS(T, new SAT leaves)

SET HAT ON SAT NODES



T, new SAT leaves,

W ,



for i = 1, . . . , n − 1 do

SAT paths = POP SAT PATHS(T)

for SAT path in SAT paths do

a = COMPUTE A ALONG(SAT path)

SAT leave = LAST ELEMENT(SAT path)

W ,

B = GET LAST HAT OF LEAVE(T,

SAT leave)

W = (W

⊙ [(a

⊤

)

×k

])

B = (W

⊙ [(a

⊤

)

×k

])

B + B

⊤

rules = CALC RULE TERMS



W ,



new SAT leaves = ADD SUBTREE

(T, SAT leave, rules, c

)

SET ON NODES



T, new SAT leaves,

W ,



ADD SAT PATHS(T, new SAT leaves)

end

CONVERT FINAL RULE TO EXPR(T)

PRUNE TREE(T)

end

Algorithm 1: Main loop of the NN2EQCDT algorithm.

The weight and bias matrices W

and B

of layer i

from the FF-DNN model are processed layer by layer.

Initial rules are calculated from the effective matrices

EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods

(function CALC RULE TERMS). The initial effective

matrices are simply the weight and bias matrices of

the ﬁrst layer. A rule is the symbolic representation

for checking the input data on a node in DT. The eval-

uation of a rule with input data decides which further

path the input data takes in DT.

A rule can be imagined as an evaluation of the ap-

plication of the ReLU function to the current input

data, which is calculated for all possible input data

with the effective matrices. The jth rule of the ith

layer thus sums the matrix coefﬁcients of the jth row

multiplied by their respective input variables, so:

(rule

)

∧

∑

(

)

k, j

+ (

)

k, j

> 0 . (2)

The DT is built from top to bottom, whereby the

k-th rule is assigned to all nodes of the offset + k-th

line of the DT. The offset is the index of the last layer

of the already built DT. Before adding each node n

with a rule to DT, a check is made to see whether the

rule from node n together with all the rules from the

nodes in the path above it from the root node to node

n and the general boundary constraints can be satis-

ﬁed. This can be done either with Satisﬁability Mod-

ulo Theories (SMT) solvers or in this case by check-

ing with LP as described in the next sections. If the

rules together are unsatisﬁable, i.e., there can be no

input, so that the evaluation of DT that takes this path,

the node n and thus further subtrees are not added to

keep the size of DT dynamically small.

All further subtrees that are calculated are only ap-

pended to the satisfying (SAT) leaf nodes of the al-

ready created DT. Therefore, SAT leaf nodes must be

saved, which is why they are returned after a subtree

has been created (see return values of the functions

CREATE INITIAL SUBTREE or ADD SUBTREE).

The calculation of

W and

B depends on the slope

vector a

a, which represents the ”decisions” of the input

data x

x by the rules on the path from the root node to

a SAT control node. Thus, a

= 1 if rule

node

x) ≡

true otherwise a

= 0. Since the calculation of the

rules in turn depends on

W and

B, each subtree that is

generated from rules with the slope vector of a SAT

leave node n

SAT leave

is appended to that leave node.

So after the initial subtree has been created

from the ﬁrst set of rules using the function CRE-

ATE INITIAL SUBTREE, the rules are not assigned

to all nodes in a row in DT, but only to the

nodes within a subtree. Starting with the second

layer, all further layers are iterated through, and for

each, the path from the root to the SAT leaf node

(function POP SAT PATHS) is iterated to ﬁrst calcu-

late the corresponding slope vector (function COM-

PUTE A ALONG). This is used to calculate

W and

B together with the current W

and B

. After calcu-

lating the rules from the effective matrices, these are

used together with the general constraints for the SAT

check to create a subtree that is added to the SAT leaf

node (function LAST ELEMENT) of the given SAT

path.

In order to access the latest

W and

B for the cal-

culation of the effective weight matrices in the next

layer iteration, they are stored in the new SAT leaf

nodes. In order to be able to iterate over the new SAT

paths, they are calculated and added to the tree from

the new SAT leaves (function ADD SAT PATHS).

Finally, after iterating all layers, the rules of the

last SAT leaves are converted into expressions, and

the DT can be further pruned by removing nodes with

unnecessary rules if they are evaluated equally for all

possible inputs.

5 EXACT OUTPUT REGION

REPRESENTATION

The nodes of a DT transformed with the NN2EQCDT

algorithm contain n rules on a direct path from the

root to a leaf node with expression (exclusive), which

are linear inequalities of the form:

rule

0≤k≤(n−1)

∧

∑

)

+ b

> 0 . (3)

The rules represent half-space constraints, as each

one subdivides the entire input space. Each node of a

path of a DT from the root to a leaf node effectively

further subdivides an already segmented space with a

then bounded half-space. All leaf nodes therefore lie

in output regions that are described by the intersec-

tion of the half-spaces of their respective paths above.

These output regions are therefore convex polytopes

of the standard form (Vanderbei et al., 2014) P = {x

x ∈

: A

x ≤ b

b} where

A =



)



and b

b =





(4)

are the concatenation of the coefﬁcients of a rule in

the columns for all rules in the rows.

For a proof see section 12.

Note that the strict inequalities of the comparison

against zero, which originate from the ReLU reso-

lution, must be reformulated into the standard form.

This requires the formulation of a non-strict inequal-

ity, which is why the case of equality must be checked

separately.

Whether linear inequalities have a common solu-

tion can be determined by solving “fake” LP problems

of the form

min

subject to A

x ≤ b

b .

(5)

Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces

Solving LPs problems is comparatively more efﬁ-

cient and this can be used for dynamic path checking

instead of checking the constraints with general SMT

solvers as originally used in (Logemann and Veith,

2023).

Solving such LP problems can also be used to

check whether points or other convex polytopes lie in

or intersect the convex polytope in question by adding

further representative constraints to the LP problem,

such as:

min

subject to A

A x

x ≤ b

x = y

y, for point y

y (6a)

or A

′

x ≤ b

′

, for polytope P

′

. (6b)

Each leaf node l of a DT represents the linear

equation of the hyperplane bounded by the respective

polytope P

. This denotes the actual function that is

applied in the region to derive the equivalent output

from a FF-DNN, such as the setpoint of an agent ac-

tor from its policy network.

6 INNER BOXES

The exact representation of the polytopes as an in-

tersection of the half-spaces becomes more difﬁcult

to imagine for higher dimensions. For a general

overview, however, smaller but deﬁnitive intervals for

all dimensions are more suitable, as in such a repre-

sentation the boundaries for each dimension are inde-

pendent of the others. This is achieved by inner boxes

with maximum volume of the polytopes.

According to Proposition 1 of (Bemporad et al.,

2004), an inner box with maximum volume of a full-

dimensional convex polytope P = {x

x ∈ R

: A

x ≤

b}, can be obtained by optimizing the following LP

problem:

max

x,y

∑

j∈D

lny

subject to A

x + A

y ≤ b

b ,

(7)

where A

is the positive part of A

A, i. e., a

i j

max(0, a

i j

The optimal solution (x

∗

, y

∗

) denotes the inner

box with maximum volume B(x

∗

, x

∗

+ y

∗

). A box is

represented as B(l

l, u

u) = {x

x ∈ R

: l

l ≤ x

x ≤ u

u}, where

l and u

u are real d-vectors.

The convexity of the polytopes is ensured by con-

struction with the half-space splitting in the DTs of by

NN2EQCDT algorithm, as described in section 5. A

−6

−4 −2 0 2 4

−4

−2

−x + 3y − 10 ≤ 0

x − 3y − 10 ≤ 0

−2x − y − 10 ≤ 0

2x + y − 10 ≤ 0



−4

−2



≤ x ≤





Figure 2: Example for the volume-maximized inner box B

(orange) of a convex polytope P (blue).

polytope P on the other hand is full-dimensional if it

has an interior point. An interior point of P is a point

x ∈ R

that satisﬁes A

x < b

b. This cannot be checked

directly by LP optimization as it does not work with

strict inequalities. But it can be checked using the fol-

lowing LP problem:

max

x,x

subject to A

x + 1x

≤ b

≤ 1 .

(8)

If this is solved successfully, with x

∗

being an opti-

mal solution and x

∗

being the optimal value, then x

∗

the inner point of P , if x

∗

> 0, P is full-dimensional.

In the case of x

∗

< 0, P is empty and in the case

of x

∗

= 0, P is neither full-dimensional nor empty

(Fukuda, 2015).

In ﬁg. 2 an example of an inner box with maxi-

mum volume of a polytope is shown. Although the

inner box has a maximum volume for the interior of

the polytope, the regions P \ B are signiﬁcant, i. e. B

comprises only

vol(B)

vol(P)

53.88

= 59.4%

of the volume of P.

However, the approximation may be sufﬁcient for

an overview, since the high-dimensional regions are

easier to visualize with the box intervals than with

the more complicated inequality constraints of the re-

spective polytope.

EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods

100

7 OBSERVATION SPACE

CONSTRAINTS

Normally, the observation space has natural or artiﬁ-

cial value limits in each dimension, so that only values

within known intervals can be entered. These can be

used as a global constraint for the NN2EQCDT algo-

rithm (see ﬁg. 1) to generate smaller DT by pruning

nodes or subtrees where the input data falls entirely

within unreachable regions. The calculation of the in-

ner box can also be modiﬁed to respect these intervals

max

x,y

∑

j∈D

lny

subject to A

x + A

y ≤ b

b ,

min

≤ x

x ≤ o

max

min

≤ x

x + y

y ≤ o

max

(9)

with (o

min

, o

max

) being the bounds for the input or ob-

servation space. Note that they span the outer box

out

min

, o

max

) because x

x and x

x + y

y are both points

constrained by the respective polytope and globally

on these constraints together, in which the maximiza-

tion of an inner box is performed. Therefore, the

edges of all i-th inner boxes B

, x

+ y

) can be

constrained directly by the bounds of the outer box

without affecting the volume maximization. Since

the boundary conditions of the outer box are basi-

cally also additional half-space intersection points,

the intersection set B

out

∩ P = P

′

is also a convex

polytope. In addition to convexity, the calculation of

the inner box also requires that the polytope is full-

dimensional, which in turn can be checked with the

optimization of the LP problem described in section 6.

Therefore, the resulting polytope must have the stan-

dard form given by

′







x ∈ R



′





−I





x ≤





max

−o

min





= b

′







(10)

8 MOUNTAIN CAR EXAMPLE

In ﬁg. 3 the output regions of the same trained model

are shown for a control strategy of the MCC as used

in (Logemann and Veith, 2023). The MCC has two

inputs in the observation space, the ﬁrst being the po-

sition of the vehicle along the x-axis and the second

being the velocity of the vehicle, and an actuator that

controls the directional force exerted on the vehicle.

An agent was trained that maximizes the reward

(there is only a sparse reward for reaching the moun-

tain). Its DNN was transformed into a DT using the

NN2EQCDT algorithm. In ﬁg. 3a the output regions

of this DT are plotted according to their exact poly-

topes, while in ﬁg. 3b the output regions are repre-

sented by the inner boxes of the polytopes. The output

regions are plotted against the linear actuator func-

tions speciﬁed in the leaf nodes of the respective DT.

Using the Gymnasium interface, which provides

space types such as box intervals, the observation

space can be constrained as described in section 7. In

the case of MCC, such intervals are given by −1.2 ≤

x ≤ 0.6 and −0.07 ≤ y ≤ 0.07, which is why they are

used for the modiﬁed optimization problem.

In addition to the output regions, the actual action

points the simulation runs through are displayed. Ac-

tion points are deﬁned here as the tuple of the obser-

vations (here: x, y), their mapping of the neural net-

work to the actuator value (here: (x, y) 7→ z) and the

reward value (here: (env, s, x, y, z) 7→ r, with env be-

ing the environment and s the state of the environ-

ment (from which the possible partial observations

obtained), and on which the subsequent actuator value

is applied to). So here the action points are repre-

sented by a = ((x, y), (z), r). In ﬁg. 3, they are plotted

as 3D points by the mapping ((x, y) 7→ z). The trace

of action points starts on the lower plane and ends on

the higher planes. The rewards are represented by the

color of the points. Here, one can see that there is

only one high spare reward (yellow) at the end, that

corresponds to the car reaching the mountain.

In addition to the output regions, the actual action

points that are fed into the simulation are displayed.

Action points are deﬁned here as the tuple of the ob-

servation values (here: x, y), their mapping of the neu-

ral network to the actuator value (here: (x, y) 7→ z)

and the reward value (here: (env, s, x, y, z) 7→ r. env

is the environment and s is the state of the environ-

ment from which the possible partial observations re-

sult to which the subsequent actuator value is applied.

The action points are therefore represented here by

a = ((x, y), (z), r). In ﬁg. 3 they are drawn as 3D

points through the mapping ((x, y) 7→ z). The track

of action points starts on the lower level and ends on

the higher levels. The rewards are represented by the

color of the points. Here you can see that there is only

one high sparse reward (yellow) at the end, which cor-

responds to the car reaching the mountain.

Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces

101

Sparse Reward

Starting

(a) Exact output regions with polytopes (b) Output regions represented by inner boxes of polytopes

Figure 3: Output regions of a FF-DNN model trained in the MCC with the global constraints and actual action points.

9 NEIGHBORHOOD GRAPH

For three dimensions, the output regions can be drawn

and observed as polytopes; for higher dimensions, the

approximating, but dimension-independent, represen-

tation of inner boxes can be used for better visualiza-

tion. Only linear functions are applied to these output

regions. In order to better understand the interaction

of output regions and linear functions of an agent in

higher-dimensional input and output spaces, the con-

cept of a NHG is introduced.

A NHG is a simple, undirected, weighted, cyclic

and ﬁnite graph G = (V, E). The vertices v ∈ V repre-

sent initial regions. Two vertices (v, u) are connected

by an edge e = (v, u) ∈ E iff. the initial regions P

and

represented by v and u are direct neighbors, which

is given by I = P

∩ P

̸=

0. If they are neighbors,

the intersection contains the neighboring edge points

that both have in common, due to the relaxation of the

strict inequalities as described in section 5. The inter-

section I can again be checked for non-emptiness by

solving the LP problem from eq. (6b) and checking

whether it has a solution.

For two hyperplanes described by the linear func-

tions v 7→

∑

+ c and u 7→

∑

+ d with the co-

efﬁcient vectors a

a = [(a

), c]

⊤

and b

b = [(b

), d]

⊤

and

which are evaluated with input points in the poly-

topes, the cosine similarity is calculated by

cosα =

a · b

∥a

a∥∥b

b∥

. (11)

If we consider closed-loop systems, such as the

model of section 8 as a controller for the dynamics of

the system of MCC, all model outputs for each input

control the system so that the next input point x

i+1

lies in the ε sphere of the previous input. This means

that d(x

, x

i+1

) < ε, where the Euclidean distance is

d(x

x, y

y) = ∥x

x − y

y∥ for a global, minimal ε.

As can be seen in the example, the ε can be com-

paratively small so that the points of the action point

trace cross the polytopes through their neighbors.

However, if two hyperplanes of neighboring poly-

topes are not similar, there is a sudden change in con-

trol behavior, which can cause a sudden change in the

state and behavior of the system. The non-similarity

of the hyperplanes of two neighboring polytopes can

therefore be an important measure. For this reason,

the modiﬁed cosine distance cos

distance

= 1 −

cosα

is indicated by the size of the edges in the NHG, as

shown in the example in ﬁg. 4. It is modiﬁed by tak-

ing the absolute value of the cosine similarity, as the

direction of the normal vectors of the hyperplane is

irrelevant.

The magenta-colored edges in ﬁg. 4 represent the

trace of the action points of the MCC example. The

weight of the edges (thickness) describes how often

EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods

102

Figure 4: NHG for the FF-DNN model trained in MCC with

the global constraints and on the one hand the cosine dis-

tance as dark blue edges and the action point trace as ma-

genta edges.

an edge was crossed by two consecutive action points

in relation to each other. It can be observed that the

action point trace starts at the vertex 40 and runs over

vertex 2, which is represented by the lower hyper-

plane in ﬁg. 3. Furthermore, the trace runs across ver-

tex 26, which is a steep hyperplane, and ﬁnally the

trace ends on the upper levels until the agent gets the

reward in vertex 19, which has a large size.

The size of the polytopes in the output region can

also be signiﬁcant, as the loop system can remain in

a state where the inputs fall into the same output re-

gion for a long time and the action points are sampled

as they pass through the same hyperplane function.

The size of the polytopes is measured by their vol-

ume, which is visualized by the size of the vertices

in the NHG. The volume is calculated with the library

of (TuLiP control, 2024). The library can also be used

to calculate polytope properties such as the center and

radius of the Chebyshev sphere and the bounding box.

In addition to these dimensions, the boundaries and

the volume of the inner boxes of the polytopes, as de-

scribed in section 6, and the expression of the hyper-

plane are mapped to the vertices in the NHG.

10 EXPERIMENT

In the following scenario (Logemann, 2024) the CI-

GRE Medium Voltage (MV) benchmark power grid

(Task Force C6.04.02, 2014), as shown in ﬁg. 5, is

used.

Several PV systems are connected to the busses

as generators and various loads with time-dependent

proﬁles. The power fed in by the PV systems and

* Feeder 2 is not further

relevant in the scenario

Figure 5: CIGRE MV power grid, adapted from (Fraun-

hofer IEE and University of Kassel, 2023); blue: bus with

the Photovoltaic (PV) inverters for which the agent controls

the setpoints for Active Power (AP) and Reactive Power

(RP), and bus with the highest weighted Voltage Magni-

tude (VM) for the RP setpoint, red: bus with the highest

weighted VM for the AP setpoint.

thus the VMs of the buses are time-dependent on the

solar radiation caused by the simulated weather. The

simulated VMs of the busses are shown in ﬁg. 6. Be-

tween the time steps 2000 and 2500, the VMs initially

increase and ﬁnally decrease due to violations of the

VM constraints of the busses.

10.1 Classic Analysis

To keep the voltage within the constraints, a very sim-

ple controller agent with DRL is learned. The sim-

ulated VMs of this scenario with the controller are

shown in ﬁg. 5.

The objective of the agent in this scenario is to

keep the VMs of the busses (observations: vm

, in-

put for the DNN of the agent) in feeder 1 as close as

Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces

103

0 500 1000 1500 2000 2500

0.8

0.9

1.1

1.2

Mean Bus VM

Bus 0 VM

Bus 1 VM

Bus 2 VM

Bus 3 VM

Bus 4 VM

Bus 5 VM

Bus 6 VM

Bus 7 VM

Bus 8 VM

Bus 9 VM

Bus 10 VM

Bus 11 VM

Controller-free Scenario Voltage Magnitudes

Simulation Time (15-min steps)

Voltage Magnitude (p.u.)

Figure 6: Time series of the VMs for the scenario without a controller agent. Between 2000 and 2500, the voltage constraints

(VM must be in [0.8, 1.2]) are violated and the bus is switched off so that the VM drops to zero.

06:00

May 5, 2017

12:00 18:00 00:00

May 6, 2017

06:00 12:00 18:00

1.03

1.04

1.05

1.06

1.07

1.08

1.09

1.1

1.11

Bus 10 VM

Bus 11 VM

Bus 2 VM

Bus 3 VM

Bus 4 VM

Bus 5 VM

Bus 6 VM

Bus 7 VM

Bus 8 VM

Bus 9 VM

Voltage Magnitude Observations of Controller Agent

Simulation Time

Observation Voltage Magnitude (p.u.)

Figure 7: Time series of the VMs for the scenario without a controller agent.

possible to 1 p.u. (nominal voltage). The voltage is

not only generally controlled by the AP, but is also

directly dependent on the RP fed in. The balanced

feed-in of AP and RP generated by a PV can be set via

setpoints for the inverter (Turitsyn et al., 2011). In the

scenario, the agent controls such on bus 5 (actuators)

using the output of the agent’s DNN, where these out-

put values are ﬁrst transformed before they are actu-

ally applied as setpoints, as shown for the simulation

of the scenario in ﬁg. 8.

To obtain the applied setpoints y

, the function

=min(max(a

s,A

tanh(o

)

+ a

b,A

, {p

min

, q

min

}), {p

max

, q

max

})

(12)

with A = {ap, rp} is ﬁrst applied with output o

the DNN for the actuators, the action scaling factor

s,A

and bias a

b,A

as well as the respective maximal

and minimal AP setpoint (p

{min,max}

) and RP setpoint

{min,max}

) for the clipping. The y

p,q

values are then

balanced to the actual setpoints, that the inverter can

deliver at the current time via the P-Q characteristic

(Turitsyn et al., 2011).

10.2 Exact Controller Analysis

The DNN of the controller agent was transformed into

an exact DT, which in turn was used to create the

NHG as described in section 3. It consists of only two

nodes, with almost all simulated data points falling

into the observation region of one node. Thus, almost

all observed VMs are evaluated by a set of two linear

expressions of the form

∑

A,i

+ c

, (13)

where w

A,i

is the weight for the VM of the ith bus

A,i

and c

is the bias value for the two actuators.

EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods

104

06:00

May 5, 2017

12:00 18:00 00:00

May 6, 2017

06:00 12:00 18:00

0.735

0.736

0.737

0.738

0.739

0.74

0.741

0.742

0.743

1.8685

1.8686

1.8687

1.8688

PV Bus 4 (reactive power) PV Bus 4 (active power)

Controller Agent Setpoint Time Series

Simulation Time

Setpoint Reactive Power (MVar)

Setpoint Active Power (MW)

Figure 8: Time series for the setpoints of the controller agent for the PV system on bus 4; AP and RP setpoints overlap all the

time.

Table 1: Weights for the observed VMs for the region in

which almost all test simulation data fall.

AP actuator RP actuator

Bus b

Weight w

Bus b

Weight w

9 0.013 5 0.019

4 0.009 2 0.018

10 0.003 11 0.018

3 0.002 3 0.008

6 0.001 8 0.007

11 −0.004 7 −0.001

7 −0.004 9 −0.006

5 −0.005 4 −0.010

8 −0.006 6 −0.010

2 −0.006 10 −0.013

The weights of two different buses b

i, j

for the AP

and RP actuators listed in table 1 are only approxi-

mately proportional to each other (w

A,i

∝

∼

A, j

), as the

non-linear y function and the inverter balancing, as

described in section 10.1, are still applied to them.

The bus with the highest weighted VM for the AP

actuator is 9 and for the RP it is bus 5. Considering

only the RP actuator, the learned strategy is analo-

gous to a simple RP controller, which also controls

only one RP setpoint for a PV inverter on the same

bus, only from which it observes the VM (Ju and Lin,

2018).

Although, or perhaps despite this, the VM proﬁles

of all busses in ﬁg. 7 are very similar except for the

scaling, the agent has learned properties of the topol-

ogy of the network: the controller has learned to iden-

tify its own bus (5) in terms of the generally known

simple RP controller strategy. Also noteworthy is the

second highest weighted VM of bus 2, which has a

fundamentally stronger inﬂuence on all VMs of the

feeder’s busses, as it is closer to the common coupling

point (the 110/20 kV transformer on bus 1 in ﬁg. 5).

11 DICUSSION

Although, the inner boxes are volume-maximized,

are based on the exact polytope representation of the

output regions, and can thus be used for an exact

overview, they are signiﬁcantly smaller than the poly-

topes in terms of their volume. The representation

for an overview of a polytope could be extended by

the inner boxes of the sub-polytopes in P ∩ B

of a

polytope P and its inner box B

. On the other hand,

this would increase the complexity of the overview

because then, there are multiple inner boxes for the

same polytope and thus the same output region. So

it has to be further evaluated in more practical, real-

world scenarios if, and in which form, this inner box

representation(s) can be useful for an overview.

The NHG has been successfully used to visual-

ize the relation of the hyperplanes and other measures

to each other. The volume of the polytopes and the

cosine distance as well as the mapping of the other

properties of the hyperplanes can be used to inves-

tigate the higher dimensional position of the linear

functions of neighboring polytopes to each other. It

has been shown that restricting the input space to the

box spanning the action points leads to interpretable

graphs in a simple real world experiment. However,

the methods need to be evaluated for larger observa-

tion spaces and more input and output dimensions.

Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces

105

12 CONCLUSION AND FUTURE

WORK

This paper presents a methodological approach to an-

alyze the exact output regions of policy FF-DNNs for

high-dimensional input-output spaces. Such an exact

analysis is necessary for the understanding of agents’

strategies, especially with regards to CNI operation.

The paper builds upon the exact transformation of FF-

DNNs into a pruned DT. The paths from a root to the

leaf nodes of such an DT make up output regions that

can be represented by polytopes. For an overview of

a polytope, the inner box is computed. Furthermore,

the output regions are depicted by vertices in the in-

troduced NHG. They are connected by edges if they

are direct neighbors to each other. Further proper-

ties, like the polytope constraints, inner box bounds

and their volume, as well as, the Chebyshev ball cen-

ter and radius, the bounding box, and the cosine dis-

tance of the normal vectors of the hyperplanes are

also mapped to the vertices and thus hyperplanes in

the output regions. Thereby, the Neighbor Graph es-

pecially allows us to visualize the relations between

higher-dimensional output regions.

In future work, these analysis tools could be eval-

uated in real scenarios with even higher dimensions.

They could also be used to analyze the learned control

strategies of ARL agents in the power grid for larger

input domains. This would make it possible to get a

better overview of the entire possible behavior of the

agent.

ACKNOWLEDGEMENTS

This work was funded by the German Federal Min-

istry for Education and Research (BMBF) under

Grant No. 01IS22071.

REFERENCES

Arrieta, A. B., D

ıaz-Rodr

ıguez, N., Del Ser, J., Bennetot,

A., Tabik, S., Barbado, A., Garc

ıa, S., Gil-L

opez, S.,

Molina, D., Benjamins, R., et al. (2020). Explainable

artiﬁcial intelligence (xai): Concepts, taxonomies, op-

portunities and challenges toward responsible ai. In-

formation fusion, 58:82–115.

Arulkumaran, K., Deisenroth, M. P., Brundage, M., and

Bharath, A. A. (2017). Deep reinforcement learning:

A brief survey. IEEE Signal Processing Magazine,

34(6):26–38.

Aytekin, C¸ . (2022). Neural networks are decision trees.

CoRR, abs/2210.05189:1–8. [retrieved: 05, 2023].

Bastani, O., Pu, Y., and Solar-Lezama, A. (2018). Ver-

iﬁable reinforcement learning via policy extraction.

Advances in neural information processing systems,

31:2499–2509.

Bemporad, A., Filippi, C., and Torrisi, F. D. (2004). Inner

and outer approximations of polytopes using boxes.

Computational Geometry, 27(2):151–178.

Coppens, Y., Efthymiadis, K., Lenaerts, T., and Now

e, A.

(2019). Distilling deep reinforcement learning poli-

cies in soft decision trees. In International Joint Con-

ference on Artiﬁcial Intelligence.

Du, M., Liu, N., and Hu, X. (2019). Techniques for in-

terpretable machine learning. Communications of the

ACM, 63(1):68–77.

Fraunhofer IEE and University of Kassel (2023). Pan-

dapower 2.0 cigre benchmark power grid implemen-

tation. [retrieved: 06, 2023].

Frosst, N. and Hinton, G. (2017). Distilling a neural

network into a soft decision tree. arXiv preprint

arXiv:1711.09784.

Fujimoto, S., Hoof, H., and Meger, D. (2018). Address-

ing function approximation error in actor-critic meth-

ods. In Proceedings of the 35th International Con-

ference on Machine Learning, ICML 2018, Stock-

holmsm

assan, Stockholm, Sweden, July 10-15, 2018,

volume 80 of Proceedings of Machine Learning Re-

search, pages 1587–1596. PMLR.

Fukuda, K. (2015). Lecture: Polyhedral computation,

spring 2015. [retrieved: 06.2024].

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor. CoRR,

abs/1801.01290. [retrieved: 05, 2023].

Irsoy, O., Yildiz, O. T., and Alpaydin, E. (2012). Soft de-

cision trees. In International Conference on Pattern

Recognition.

Jaunet, T., Vuillemot, R., and Wolf, C. (2020). Drlviz: Un-

derstanding decisions and memory in deep reinforce-

ment learning. Computer Graphics Forum, 39(3):49–

61.

Ju, P. and Lin, X. (2018). Adversarial attacks to distributed

voltage control in power distribution networks with

DERs. In Proceedings of the Ninth International Con-

ference on Future Energy Systems, pages 291–302.

ACM.

Logemann, T. (2023). Explainability of power grid at-

tack strategies learned by deep reinforcement learning

agents.

Logemann, T. (2024). Power grid experi-

ment. https://gitlab.com/arl-experiments/

explains-simple-voltage-controller. [retrieved:

09, 2024].

Logemann, T. and Veith, E. M. (2023). Nn2eqcdt: Equiv-

alent transformation of feed-forward neural networks

as drl policies into compressed decision trees. vol-

ume 15, page 94–100. IARIA, ThinkMind. [retrieved:

07, 2023].

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M. A.

EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods

106

(2013). Playing atari with deep reinforcement learn-

ing. CoRR, abs/1312.5602:1–9. [retrieved: 05, 2023].

Puiutta, E. and Veith, E. M. S. P. (2020). Explainable rein-

forcement learning: A survey. In Machine Learning

and Knowledge Extraction. CD-MAKE 2020, volume

12279, pages 77–95, Dublin, Ireland. Springer, Cham.

Qing, Y., Liu, S., Song, J., and Song, M. (2022). A sur-

vey on explainable reinforcement learning: Concepts,

algorithms, challenges. CoRR, abs/2211.06665:1–25.

[retrieved: 05, 2023].

Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K.,

Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hass-

abis, D., Graepel, T., et al. (2020). Mastering atari,

go, chess and shogi by planning with a learned model.

Nature, 588(7839):604–609.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. CoRR, abs/1707.06347. [retrieved: 05,

2023].

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,

M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D.,

Graepel, T., et al. (2018). A general reinforcement

learning algorithm that masters chess, shogi, and go

through self-play. Science, 362(6419):1140–1144.

Task Force C6.04.02 (2014). Benchmark systems for net-

work integration of renewable and distributed energy

resources. Elektra - CIGRE’s digital magazine, 575.

TuLiP control (2024). Polytope implementation. [retrived:

06, 2024].

Turitsyn, K., Sulc, P., Backhaus, S., and Chertkov, M.

(2011). Options for control of reactive power by dis-

tributed photovoltaic generators. Proceedings of the

IEEE, 99(6):1063–1073.

Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep

reinforcement learning with double q-learning. In

Proceedings of the Thirtieth AAAI Conference on Ar-

tiﬁcial Intelligence, February 12-17, 2016, Phoenix,

Arizona, USA, volume 30, pages 2094–2100. AAAI

Press.

Vanderbei, R. J. et al. (2014). Linear programming.

Springer.

Veith, E. M. (2023). An architecture for reliable learn-

ing agents in power grids. ENERGY 2023 : The

Thirteenth International Conference on Smart Grids,

Green Communications and IT Energy-aware Tech-

nologies, pages 13–16. [retrieved: 05, 2023].

Veith, E. M. and Logemann, T. (2023). Towards explainable

attacker-defender autocurricula in critical infrastruc-

tures. CYBER 2023 : The Eighth International Con-

ference on Cyber-Technologies and Cyber-Systems //,

pages 27–31. [retrieved: 06, 2024].

Veith, E. M., Logemann, T., Wellßow, A., and Balduin, S.

(2024). Play with me: Towards explaining the beneﬁts

of autocurriculum training of learning agents. In 2024

IEEE PES Innovative Smart Grid Technologies Eu-

rope (ISGT EUROPE), pages 1–5, Dubrovnik, Croa-

tia. IEEE.

Veith, E. M. S. P., Wellßow, A., and Uslar, M. (2023).

Learning new attack vectors from misuse cases with

deep reinforcement learning. Frontiers in Energy Re-

search, 11:01–23. [retrieved: 05, 2023].

APPENDIX

Convexity of Intersection of Half-Spaces

Theorem 1. Let H

1≤i≤m

⊆ R

be half spaces, deﬁned

by linear inequalities, such that

H = {x

x ∈ R

| a

⊤

x ≥ b}, a

a ∈ R

, b ∈ R. Then its

intersection set A =

i=1

is convex.

Proof. First it is proofed, that half spaces are con-

vex sets, then that the intersection of any collection

of convex sets is also convex.

(1.) Show that half spaces are convex:

∀x

1,2

∈ H ⊆ R

, ∀λ ∈ [0, 1] : λx

+ (1 − λ)x

∈ H .

Let x

, x

∈ H and λ ∈ [0, 1], then

⊤

− b ≥ 0

Convex def.

======⇒ λ(a

⊤

− b)

| {z }

≥0

∈[0,1]

z }| {

(1 − λ)(a

⊤

− b)

| {z }

≥0

≥ 0

⇔ λa

⊤

− λb + a

⊤

− b − λa

⊤

+ λb ≥ 0

⊤

(λx

+ (1 − λ)x

) − b ≥ 0

from this follows λx

+ (1 −λ)x

∈ H, ∀x

1,2

∈ H, λ ∈

[0, 1]

(2.) Show that the intersection set of convex sub-

sets C a real or complex vector space V is convex:

C =

C ⊆V,

C convex

C is convex .

Let x

1,2

∈ C be two elements from the intersec-

tion set, then they are in all intersected sets: ∀C ∈ C :

1,2

∈ C and because of the convexity, it holds:

∀λ ∈ [0, 1] : λx

+ (1 − λ)x

∈ C

Because this holds for every C and thus λx

+ (1 −

λ)x

are in C for all λ ∈ [0, 1], these elements are also

in the intersection set C. Hence it is also convex.

(3.) From (1.) and (2.), it directly follows that the

intersection set of half spaces is convex.

Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces

107