Analyzing Exact Output Regions of Reinforcement Learning Policy
Neural Networks for High-Dimensional Input-Output Spaces
Torben Logemann
a
and Eric MSP Veith
b
Carl von Ossietzky University Oldenburg,
Research Group Adversarial Resilience Learning,
Oldenburg, Germany
{torben.logemann, eric.veith}@uol.de
Keywords:
Decision Tree, Reinforcement Learning, Explainability, Neural Network.
Abstract:
Agent systems based on deep reinforcement learning have achieved remarkable success in recent years. They
have also been applied to a variety of research topics in the field of power grids, as such agents promise real re-
silience. However, deep reinforcement learning agents cannot guarantee behavior, as the mapping of the entire
input space to the output of even a simple feed-forward neural network cannot be accurately explained. For
critical infrastructures, such black box models are not acceptable. To ensure an optimized trade-off between
learning performance and explainability, this paper relies on efficient regularizable feed-forward neural net-
works and presents an extension of the algorithm NN2EQCDT to transform the networks into pruned decision
trees with significantly fewer nodes to be accurately explained. In this paper, we present a methodological
approach to further analyze the decision trees for high-dimensional input-output spaces and analyze an agent
for a power grid experiment.
1 INTRODUCTION
Deep Reinforcement Learning (DRL)—the notion
of agents that learn from interacting with their
environment—is at the core of many of these remark-
able successes, beginning with its breakthrough in
2013 by end-to-end learning of Atari games (Mnih
et al., 2013) and Double Deep Q-Learning (DDQN)
(Van Hasselt et al., 2016) and culminating in devel-
opments such as AlphaGo (Zero), AlphaZero (Silver
et al., 2018) and MuZero (Schrittwieser et al., 2020).
Since the 2013 hallmark paper by Mnih et al., re-
searchers have improved training performance, sen-
sibility to hyperparameters, exploration, or sample
efficiency through algorithms such as Twin-Delayed
Deep Deterministic Policy Gradient (DDPG) (TD3)
(Fujimoto et al., 2018), Proximal Policy Optimization
(PPO) (Schulman et al., 2017), and Soft Actor Critic
(SAC) (Haarnoja et al., 2018). The current research
corpus shows that these agents have proven that they
are capable of handling complex tasks.
DRL agents promise true resilience by learning
to counter the unknown unknowns. However, unlike
intrinsically interpretable DRL models (Puiutta and
a
https://orcid.org/0000-0002-2673-397X
b
https://orcid.org/0000-0003-2487-7475
Veith, 2020), no guarantees can yet be made about the
behavior of DRL agents learned with black-box mod-
els. This is, however, a necessity for operators, since
no responsibility can be taken for an unknown control
system that cannot be validated, especially when it is
used in such critical or very critical areas as Critical
National Infrastructures (CNIs).
Agents deployed in complex environments, such
as complex networked systems, are potentially con-
fronted with many different situations and learn com-
plex behaviors to fulfill their goals. For example, in
(Veith et al., 2023; Veith et al., 2024) Adversarial Re-
silience Learning (ARL) attack agents are trained to
cause voltage band violations in a power grid. This
goal is achieved by exploiting a weakness in the use
of voltage regulators in the grid in use.
To gain a better insight into how the control
strategies work compared to the classical reward
and action implications on the victim busses, the
NN2EQCDT algorithm was developed (Logemann
and Veith, 2023), which exactly transforms Deep
Neural Network (DNN) networks into pruned Deci-
sion Trees (DTs). This allows not only individual
trajectories to be explained, but also the mapping of
entire observation regions to functions for the actions
on the basis of all input points from the region (Veith
96
Logemann, T. and Veith, E.
Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces.
DOI: 10.5220/0012928000003886
In Proceedings of the 1st International Conference on Explainable AI for Neural and Symbolic Methods (EXPLAINS 2024), pages 96-107
ISBN: 978-989-758-720-7
Copyright © 2024 by Paper published under CC license (CC BY-NC-ND 4.0)
and Logemann, 2023; Logemann, 2023). It is embed-
ded in the ARL architecture in (Veith, 2023), which
aims to create a framework for an agent that can learn
sample-efficiently, but also provide insights into the
behavior of the agents.
For more complex control strategies with higher
input/output dimensions, it becomes more difficult to
give guarantees, since such spaces make it consider-
ably more difficult to read the control strategies di-
rectly from and visualize the transformed DTs. Fur-
thermore, it is not easily possible to provide explana-
tions and guarantees for control agents over a longer
time horizon. This is relevant, e.g., in the case of
agents for voltage maintenance, as voltage band vi-
olations must never occur.
In this paper, we present a methodological ap-
proach that can represent the properties of regions
in higher dimensional input-output spaces in relation
to each other. The input space is completely dis-
jointly divided into polytopes by a DT generated by
the NN2EQCDT algorithm. For a better understand-
ing of the regions, the polytopes are also approxi-
mated by inner boxes and their neighborhood rela-
tion is represented in the new presented concept of the
Neighborhood Graph (NHG), which works for arbi-
trarily high-dimensional regions. The edges can rep-
resent properties between these regions. In this way,
(abrupt) transitions in the control planes of the agent
can be identified. We illustrate our approach in a well-
known gymnasium environment, which is accessible
for both visual inspection and our approach, to show
the feasibility of our method.
The remainder of this paper is structured as fol-
lows: First, related work is presented in section 2
and an overview of the whole system is given in sec-
tion 3. Then the construction of the equivalent pruned
DT from a Feed-Forward Deep Neural Network (FF-
DNN) is generally described in section 4. The exact
representation of the output region is also described
in section 5. In addition, the calculation of the in-
ner boxes in section 6 and the handling of obser-
vation space constraints in section 7 are described.
With these calculations, an example of a controller
trained in the MountainCarContinuous-v0 environ-
ment (MCC) is given in section 8. Then the NHG is
presented in section 9 and consistently validated with
the MCC example. After that, a higher input and out-
put dimensional experiment in the power grid domain
is analyzed. Finally, a discussion in section 11 and
references to future work follow.
2 RELATED WORK
2.1 Deep Reinforcement Learning
Reinforcement Learning (RL) is the process of learn-
ing through interaction. When an RL agent inter-
acts with its environment and thereby observing the
consequences of its actions in form of rewards, it
can learn to alter its own behavior in response to
the received rewards. RL has the paradigm of trial-
and-error-learning and is influenced by the optimal
control field as well as balancing the trade-off be-
tween exploration and exploitation. It is based on the
Markov Decision Process (MDP), which is a quintru-
plet (S, A,T , R , γ), that describes that an agent ob-
serves the state s
t
S from its environment at time t
and takes the action a
t
A as a response to the dis-
counted (γ the discount factor) reward r
t
R . If the
state is reset after each episode, then the sequence of
states, actions and rewards is a trajectory of the policy.
The return of a trajectory is given by the discounted
accumulation of rewards R =
T 1
t=0
γ
t
r
t+1
This transi-
tions the state to the next state s
t+1
S with the prob-
ability p
t
T . If the system is only partially observ-
able, a Partially-Observable Markov Decision Process
(POMDP) further specifies as the set of observa-
tions and O as the conditional observation properties,
×O S. In general, reinforcement learning aims to
learn a policy, a
t
π(·|s
t
). Finding the optimal policy
π
, that maximizes the expected return from all states,
π
= arg max
π
E[R|π] , (1)
is the optimization problem of all reinforcement
learning algorithms (Arulkumaran et al., 2017).
2.2 Explainability for Deep
Reinforcement Learning
DNNs are inherently opaque, as the meaning of any
particular set of nodes eludes a human. Therefore,
DRL policies are similar black boxes (Jaunet et al.,
2020); since experts cannot readily understand why a
particular action was taken especially in the context
of the overall learned policy, the lack of transparency
limits trust in DRL agents (Qing et al., 2022). Recent
papers survey the landscape of the very active field
of research of eXplainable Reinforcement Learning
(XRL) (Arrieta et al., 2020; Puiutta and Veith, 2020),
to which we refer the interested reader. We will focus
on equivalent representations of policy networks as
decision trees.
Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces
97
2.2.1 Explaining with Decision Trees
Transforming the agent’s policy into a DT is intu-
itively justified, as we can easily accept decisions
in a basis of if-then-else conditions; DTs seem to
be easy to interpret (Du et al., 2019). Algorithms
such as Viper (Bastani et al., 2018) create the DT
from samples of the original policy. But when the
agents’ state/action space and its strategies become
more complex, the DT also grows in size, making its
creation and even evaluation slow. For this reason,
approaches such as distilling into Soft Decision Trees
(SDTs) (Irsoy et al., 2012; Frosst and Hinton, 2017)
have been proposed (Coppens et al., 2019), because
SDTs are more space efficient.
However, such distillation methods require the
agent produce trajectories the DT can be constructed
from (imitation learning), which have the drawback
that the DT can only cover what the agent has done
so far; there is no guarantee that the DT covers the
agent’s behavior in general.
To this end, another approach has been proposed
in which the policy network itself is converted to an
equivalent DT (Aytekin, 2022). But a problem with
this approach was that the resulting DT can grow very
large. We have previously extended the algorithm to
prune unreachable nodes off from the DT dynami-
cally while creation and, therefore, minimize it (Lo-
gemann, 2023).
3 SYSTEM OVERVIEW
In fig. 1 an overview about the whole system is vi-
sualized. First, the NN2EQCDT algorithm is used
to transform a given DNN exactly into a truncated
DT. The regions that the DT splits form convex poly-
DNN
Parameters:
Weights: W
Biases: B
NN2EQCDT
algorithm
Decision Tree
(Equal, pruned)
Neighbor Graph
General constraints
(Global outer box)
Properties
(Cos distance,
Inner boxes,
Volumes, etc.)
Using
Linear
Prog.
Figure 1: System overview of the interplay between the
NN2EQCDT algorithm and the NHG calculation.
topes. These are used as a matrix inequality sys-
tem for Linear Programming (LP) approaches to com-
pute various properties and the neighborhoods that are
compacted into a NHG. General constraints, such as
global outer boxes, are included in the calculations
with the NN2EQCDT algorithm as well as those with
LP. The NHG is a tool that can be used for further
analysis of region transitions and properties.
4 THE NN2EQCDT ALGORITHM
In this section, we briefly describe the original algo-
rithm NN2EQCDT (cf. algorithm 1). For further de-
tails and evaluations, please see to the original publi-
cation (Logemann, 2023).
Data: ANN weight matrix W
W
W , bias matrix B
B
B,
general constraints c
g
Result: Pruned Decision Tree T
begin
ˆ
W
W
W = W
W
W
0
ˆ
B
B
B = B
B
B
0
rules = CALC RULE TERMS
ˆ
W
W
W ,
ˆ
B
B
B
T, new SAT leaves =
CREATE INITIAL SUBTREE(rules, c
g
)
ADD SAT PATHS(T, new SAT leaves)
SET HAT ON SAT NODES
T, new SAT leaves,
ˆ
W
W
W ,
ˆ
B
B
B
for i = 1, . . . , n 1 do
SAT paths = POP SAT PATHS(T)
for SAT path in SAT paths do
a
a
a = COMPUTE A ALONG(SAT path)
SAT leave = LAST ELEMENT(SAT path)
ˆ
W
W
W ,
ˆ
B
B
B = GET LAST HAT OF LEAVE(T,
SAT leave)
ˆ
W
W
W = (W
W
W
i
[(a
a
a
)
×k
])
ˆ
W
W
W
ˆ
B
B
B = (W
W
W
i
[(a
a
a
)
×k
])
ˆ
B
B
B + B
B
B
i
rules = CALC RULE TERMS
ˆ
W
W
W ,
ˆ
B
B
B
new SAT leaves = ADD SUBTREE
(T, SAT leave, rules, c
g
)
SET ON NODES
T, new SAT leaves,
ˆ
W
W
W ,
ˆ
B
B
B
ADD SAT PATHS(T, new SAT leaves)
end
end
CONVERT FINAL RULE TO EXPR(T)
PRUNE TREE(T)
end
Algorithm 1: Main loop of the NN2EQCDT algorithm.
The weight and bias matrices W
W
W
i
and B
B
B
i
of layer i
from the FF-DNN model are processed layer by layer.
Initial rules are calculated from the effective matrices
EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods
98
(function CALC RULE TERMS). The initial effective
matrices are simply the weight and bias matrices of
the first layer. A rule is the symbolic representation
for checking the input data on a node in DT. The eval-
uation of a rule with input data decides which further
path the input data takes in DT.
A rule can be imagined as an evaluation of the ap-
plication of the ReLU function to the current input
data, which is calculated for all possible input data
with the effective matrices. The jth rule of the ith
layer thus sums the matrix coefficients of the jth row
multiplied by their respective input variables, so:
(rule
i
)
k
=
j
(
ˆ
W
W
W
i
)
k, j
x
j
+ (
ˆ
B
B
B
i
)
k, j
> 0 . (2)
The DT is built from top to bottom, whereby the
k-th rule is assigned to all nodes of the offset + k-th
line of the DT. The offset is the index of the last layer
of the already built DT. Before adding each node n
with a rule to DT, a check is made to see whether the
rule from node n together with all the rules from the
nodes in the path above it from the root node to node
n and the general boundary constraints can be satis-
fied. This can be done either with Satisfiability Mod-
ulo Theories (SMT) solvers or in this case by check-
ing with LP as described in the next sections. If the
rules together are unsatisfiable, i.e., there can be no
input, so that the evaluation of DT that takes this path,
the node n and thus further subtrees are not added to
keep the size of DT dynamically small.
All further subtrees that are calculated are only ap-
pended to the satisfying (SAT) leaf nodes of the al-
ready created DT. Therefore, SAT leaf nodes must be
saved, which is why they are returned after a subtree
has been created (see return values of the functions
CREATE INITIAL SUBTREE or ADD SUBTREE).
The calculation of
ˆ
W
W
W and
ˆ
B
B
B depends on the slope
vector a
a
a, which represents the ”decisions” of the input
data x
x
x by the rules on the path from the root node to
a SAT control node. Thus, a
a
a
k
= 1 if rule
node
k
(x
x
x)
true otherwise a
a
a
k
= 0. Since the calculation of the
rules in turn depends on
ˆ
W
W
W and
ˆ
B
B
B, each subtree that is
generated from rules with the slope vector of a SAT
leave node n
SAT leave
is appended to that leave node.
So after the initial subtree has been created
from the first set of rules using the function CRE-
ATE INITIAL SUBTREE, the rules are not assigned
to all nodes in a row in DT, but only to the
nodes within a subtree. Starting with the second
layer, all further layers are iterated through, and for
each, the path from the root to the SAT leaf node
(function POP SAT PATHS) is iterated to first calcu-
late the corresponding slope vector (function COM-
PUTE A ALONG). This is used to calculate
ˆ
W
W
W and
ˆ
B
B
B together with the current W
W
W
i
and B
B
B
i
. After calcu-
lating the rules from the effective matrices, these are
used together with the general constraints for the SAT
check to create a subtree that is added to the SAT leaf
node (function LAST ELEMENT) of the given SAT
path.
In order to access the latest
ˆ
W
W
W and
ˆ
B
B
B for the cal-
culation of the effective weight matrices in the next
layer iteration, they are stored in the new SAT leaf
nodes. In order to be able to iterate over the new SAT
paths, they are calculated and added to the tree from
the new SAT leaves (function ADD SAT PATHS).
Finally, after iterating all layers, the rules of the
last SAT leaves are converted into expressions, and
the DT can be further pruned by removing nodes with
unnecessary rules if they are evaluated equally for all
possible inputs.
5 EXACT OUTPUT REGION
REPRESENTATION
The nodes of a DT transformed with the NN2EQCDT
algorithm contain n rules on a direct path from the
root to a leaf node with expression (exclusive), which
are linear inequalities of the form:
rule
0k(n1)
=
j
(a
k
)
j
x
j
+ b
k
> 0 . (3)
The rules represent half-space constraints, as each
one subdivides the entire input space. Each node of a
path of a DT from the root to a leaf node effectively
further subdivides an already segmented space with a
then bounded half-space. All leaf nodes therefore lie
in output regions that are described by the intersec-
tion of the half-spaces of their respective paths above.
These output regions are therefore convex polytopes
of the standard form (Vanderbei et al., 2014) P = {x
x
x
R
d
: A
A
Ax
x
x b
b
b} where
A
A
A =
(a
k
)
j
and b
b
b =
b
k
(4)
are the concatenation of the coefficients of a rule in
the columns for all rules in the rows.
For a proof see section 12.
Note that the strict inequalities of the comparison
against zero, which originate from the ReLU reso-
lution, must be reformulated into the standard form.
This requires the formulation of a non-strict inequal-
ity, which is why the case of equality must be checked
separately.
Whether linear inequalities have a common solu-
tion can be determined by solving “fake” LP problems
of the form
min
x
x
x
0
subject to A
A
Ax
x
x b
b
b .
(5)
Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces
99
Solving LPs problems is comparatively more effi-
cient and this can be used for dynamic path checking
instead of checking the constraints with general SMT
solvers as originally used in (Logemann and Veith,
2023).
Solving such LP problems can also be used to
check whether points or other convex polytopes lie in
or intersect the convex polytope in question by adding
further representative constraints to the LP problem,
such as:
min
x
x
x
0
subject to A
A
A x
x
x b
b
b,
x
x
x = y
y
y, for point y
y
y (6a)
or A
A
A
x
x
x b
b
b
, for polytope P
. (6b)
Each leaf node l of a DT represents the linear
equation of the hyperplane bounded by the respective
polytope P
l
. This denotes the actual function that is
applied in the region to derive the equivalent output
from a FF-DNN, such as the setpoint of an agent ac-
tor from its policy network.
6 INNER BOXES
The exact representation of the polytopes as an in-
tersection of the half-spaces becomes more difficult
to imagine for higher dimensions. For a general
overview, however, smaller but definitive intervals for
all dimensions are more suitable, as in such a repre-
sentation the boundaries for each dimension are inde-
pendent of the others. This is achieved by inner boxes
with maximum volume of the polytopes.
According to Proposition 1 of (Bemporad et al.,
2004), an inner box with maximum volume of a full-
dimensional convex polytope P = {x
x
x R
d
: A
A
Ax
x
x
b
b
b}, can be obtained by optimizing the following LP
problem:
max
x
x
x,y
y
y
jD
lny
j
subject to A
A
Ax
x
x + A
A
A
+
y
y
y b
b
b ,
(7)
where A
A
A
+
is the positive part of A
A
A, i. e., a
+
i j
=
max(0, a
i j
).
The optimal solution (x
x
x
, y
y
y
) denotes the inner
box with maximum volume B(x
x
x
, x
x
x
+ y
y
y
). A box is
represented as B(l
l
l, u
u
u) = {x
x
x R
d
: l
l
l x
x
x u
u
u}, where
l
l
l and u
u
u are real d-vectors.
The convexity of the polytopes is ensured by con-
struction with the half-space splitting in the DTs of by
NN2EQCDT algorithm, as described in section 5. A
6
4 2 0 2 4
6
4
2
0
2
4
x + 3y 10 0
x 3y 10 0
2x y 10 0
2x + y 10 0
4
2
x
4
2
B
P
Figure 2: Example for the volume-maximized inner box B
(orange) of a convex polytope P (blue).
polytope P on the other hand is full-dimensional if it
has an interior point. An interior point of P is a point
ˆ
x
x
x R
d
that satisfies A
A
A
ˆ
x
x
x < b
b
b. This cannot be checked
directly by LP optimization as it does not work with
strict inequalities. But it can be checked using the fol-
lowing LP problem:
max
x
x
x,x
0
x
0
subject to A
A
Ax
x
x + 1x
0
b
b
b,
x
0
1 .
(8)
If this is solved successfully, with x
x
x
being an opti-
mal solution and x
0
being the optimal value, then x
x
x
is
the inner point of P , if x
0
> 0, P is full-dimensional.
In the case of x
0
< 0, P is empty and in the case
of x
0
= 0, P is neither full-dimensional nor empty
(Fukuda, 2015).
In fig. 2 an example of an inner box with maxi-
mum volume of a polytope is shown. Although the
inner box has a maximum volume for the interior of
the polytope, the regions P \ B are significant, i. e. B
comprises only
vol(B)
vol(P)
=
32
53.88
= 59.4%
of the volume of P.
However, the approximation may be sufficient for
an overview, since the high-dimensional regions are
easier to visualize with the box intervals than with
the more complicated inequality constraints of the re-
spective polytope.
EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods
100
7 OBSERVATION SPACE
CONSTRAINTS
Normally, the observation space has natural or artifi-
cial value limits in each dimension, so that only values
within known intervals can be entered. These can be
used as a global constraint for the NN2EQCDT algo-
rithm (see fig. 1) to generate smaller DT by pruning
nodes or subtrees where the input data falls entirely
within unreachable regions. The calculation of the in-
ner box can also be modified to respect these intervals
by
max
x
x
x,y
y
y
jD
lny
j
subject to A
A
Ax
x
x + A
A
A
+
y
y
y b
b
b ,
o
o
o
min
x
x
x o
o
o
max
,
o
o
o
min
x
x
x + y
y
y o
o
o
max
(9)
with (o
o
o
min
, o
o
o
max
) being the bounds for the input or ob-
servation space. Note that they span the outer box
B
out
(o
o
o
min
, o
o
o
max
) because x
x
x and x
x
x + y
y
y are both points
constrained by the respective polytope and globally
on these constraints together, in which the maximiza-
tion of an inner box is performed. Therefore, the
edges of all i-th inner boxes B
i
(x
x
x
i
, x
x
x
i
+ y
y
y
i
) can be
constrained directly by the bounds of the outer box
without affecting the volume maximization. Since
the boundary conditions of the outer box are basi-
cally also additional half-space intersection points,
the intersection set B
out
P = P
is also a convex
polytope. In addition to convexity, the calculation of
the inner box also requires that the polytope is full-
dimensional, which in turn can be checked with the
optimization of the LP problem described in section 6.
Therefore, the resulting polytope must have the stan-
dard form given by
P
=
x
x
x R
d
A
A
A
=
A
A
A
I
I
x
x
x
b
b
b
o
o
o
max
o
o
o
min
= b
b
b
.
(10)
8 MOUNTAIN CAR EXAMPLE
In fig. 3 the output regions of the same trained model
are shown for a control strategy of the MCC as used
in (Logemann and Veith, 2023). The MCC has two
inputs in the observation space, the first being the po-
sition of the vehicle along the x-axis and the second
being the velocity of the vehicle, and an actuator that
controls the directional force exerted on the vehicle.
An agent was trained that maximizes the reward
(there is only a sparse reward for reaching the moun-
tain). Its DNN was transformed into a DT using the
NN2EQCDT algorithm. In fig. 3a the output regions
of this DT are plotted according to their exact poly-
topes, while in fig. 3b the output regions are repre-
sented by the inner boxes of the polytopes. The output
regions are plotted against the linear actuator func-
tions specified in the leaf nodes of the respective DT.
Using the Gymnasium interface, which provides
space types such as box intervals, the observation
space can be constrained as described in section 7. In
the case of MCC, such intervals are given by 1.2
x 0.6 and 0.07 y 0.07, which is why they are
used for the modified optimization problem.
In addition to the output regions, the actual action
points the simulation runs through are displayed. Ac-
tion points are defined here as the tuple of the obser-
vations (here: x, y), their mapping of the neural net-
work to the actuator value (here: (x, y) 7→ z) and the
reward value (here: (env, s, x, y, z) 7→ r, with env be-
ing the environment and s the state of the environ-
ment (from which the possible partial observations
obtained), and on which the subsequent actuator value
is applied to). So here the action points are repre-
sented by a = ((x, y), (z), r). In fig. 3, they are plotted
as 3D points by the mapping ((x, y) 7→ z). The trace
of action points starts on the lower plane and ends on
the higher planes. The rewards are represented by the
color of the points. Here, one can see that there is
only one high spare reward (yellow) at the end, that
corresponds to the car reaching the mountain.
In addition to the output regions, the actual action
points that are fed into the simulation are displayed.
Action points are defined here as the tuple of the ob-
servation values (here: x, y), their mapping of the neu-
ral network to the actuator value (here: (x, y) 7→ z)
and the reward value (here: (env, s, x, y, z) 7→ r. env
is the environment and s is the state of the environ-
ment from which the possible partial observations re-
sult to which the subsequent actuator value is applied.
The action points are therefore represented here by
a = ((x, y), (z), r). In fig. 3 they are drawn as 3D
points through the mapping ((x, y) 7→ z). The track
of action points starts on the lower level and ends on
the higher levels. The rewards are represented by the
color of the points. Here you can see that there is only
one high sparse reward (yellow) at the end, which cor-
responds to the car reaching the mountain.
Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces
101
Sparse Reward
Starting
(a) Exact output regions with polytopes (b) Output regions represented by inner boxes of polytopes
Figure 3: Output regions of a FF-DNN model trained in the MCC with the global constraints and actual action points.
9 NEIGHBORHOOD GRAPH
For three dimensions, the output regions can be drawn
and observed as polytopes; for higher dimensions, the
approximating, but dimension-independent, represen-
tation of inner boxes can be used for better visualiza-
tion. Only linear functions are applied to these output
regions. In order to better understand the interaction
of output regions and linear functions of an agent in
higher-dimensional input and output spaces, the con-
cept of a NHG is introduced.
A NHG is a simple, undirected, weighted, cyclic
and finite graph G = (V, E). The vertices v V repre-
sent initial regions. Two vertices (v, u) are connected
by an edge e = (v, u) E iff. the initial regions P
v
and
P
u
represented by v and u are direct neighbors, which
is given by I = P
v
P
u
̸=
/
0. If they are neighbors,
the intersection contains the neighboring edge points
that both have in common, due to the relaxation of the
strict inequalities as described in section 5. The inter-
section I can again be checked for non-emptiness by
solving the LP problem from eq. (6b) and checking
whether it has a solution.
For two hyperplanes described by the linear func-
tions v 7→
a
i
x
i
+ c and u 7→
b
i
y
i
+ d with the co-
efficient vectors a
a
a = [(a
i
), c]
and b
b
b = [(b
i
), d]
and
which are evaluated with input points in the poly-
topes, the cosine similarity is calculated by
cosα =
a
a
a · b
b
b
a
a
a∥∥b
b
b
. (11)
If we consider closed-loop systems, such as the
model of section 8 as a controller for the dynamics of
the system of MCC, all model outputs for each input
x
x
x
i
control the system so that the next input point x
x
x
i+1
lies in the ε sphere of the previous input. This means
that d(x
x
x
i
, x
x
x
i+1
) < ε, where the Euclidean distance is
d(x
x
x, y
y
y) = x
x
x y
y
y for a global, minimal ε.
As can be seen in the example, the ε can be com-
paratively small so that the points of the action point
trace cross the polytopes through their neighbors.
However, if two hyperplanes of neighboring poly-
topes are not similar, there is a sudden change in con-
trol behavior, which can cause a sudden change in the
state and behavior of the system. The non-similarity
of the hyperplanes of two neighboring polytopes can
therefore be an important measure. For this reason,
the modified cosine distance cos
distance
= 1
|
cosα
|
is indicated by the size of the edges in the NHG, as
shown in the example in fig. 4. It is modified by tak-
ing the absolute value of the cosine similarity, as the
direction of the normal vectors of the hyperplane is
irrelevant.
The magenta-colored edges in fig. 4 represent the
trace of the action points of the MCC example. The
weight of the edges (thickness) describes how often
EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods
102
Figure 4: NHG for the FF-DNN model trained in MCC with
the global constraints and on the one hand the cosine dis-
tance as dark blue edges and the action point trace as ma-
genta edges.
an edge was crossed by two consecutive action points
in relation to each other. It can be observed that the
action point trace starts at the vertex 40 and runs over
vertex 2, which is represented by the lower hyper-
plane in fig. 3. Furthermore, the trace runs across ver-
tex 26, which is a steep hyperplane, and finally the
trace ends on the upper levels until the agent gets the
reward in vertex 19, which has a large size.
The size of the polytopes in the output region can
also be significant, as the loop system can remain in
a state where the inputs fall into the same output re-
gion for a long time and the action points are sampled
as they pass through the same hyperplane function.
The size of the polytopes is measured by their vol-
ume, which is visualized by the size of the vertices
in the NHG. The volume is calculated with the library
of (TuLiP control, 2024). The library can also be used
to calculate polytope properties such as the center and
radius of the Chebyshev sphere and the bounding box.
In addition to these dimensions, the boundaries and
the volume of the inner boxes of the polytopes, as de-
scribed in section 6, and the expression of the hyper-
plane are mapped to the vertices in the NHG.
10 EXPERIMENT
In the following scenario (Logemann, 2024) the CI-
GRE Medium Voltage (MV) benchmark power grid
(Task Force C6.04.02, 2014), as shown in fig. 5, is
used.
Several PV systems are connected to the busses
as generators and various loads with time-dependent
profiles. The power fed in by the PV systems and
* Feeder 2 is not further
relevant in the scenario
*
*
Figure 5: CIGRE MV power grid, adapted from (Fraun-
hofer IEE and University of Kassel, 2023); blue: bus with
the Photovoltaic (PV) inverters for which the agent controls
the setpoints for Active Power (AP) and Reactive Power
(RP), and bus with the highest weighted Voltage Magni-
tude (VM) for the RP setpoint, red: bus with the highest
weighted VM for the AP setpoint.
thus the VMs of the buses are time-dependent on the
solar radiation caused by the simulated weather. The
simulated VMs of the busses are shown in fig. 6. Be-
tween the time steps 2000 and 2500, the VMs initially
increase and finally decrease due to violations of the
VM constraints of the busses.
10.1 Classic Analysis
To keep the voltage within the constraints, a very sim-
ple controller agent with DRL is learned. The sim-
ulated VMs of this scenario with the controller are
shown in fig. 5.
The objective of the agent in this scenario is to
keep the VMs of the busses (observations: vm
b
, in-
put for the DNN of the agent) in feeder 1 as close as
Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces
103
0 500 1000 1500 2000 2500
0.8
0.9
1
1.1
1.2
Mean Bus VM
Bus 0 VM
Bus 1 VM
Bus 2 VM
Bus 3 VM
Bus 4 VM
Bus 5 VM
Bus 6 VM
Bus 7 VM
Bus 8 VM
Bus 9 VM
Bus 10 VM
Bus 11 VM
Controller-free Scenario Voltage Magnitudes
Simulation Time (15-min steps)
Voltage Magnitude (p.u.)
Figure 6: Time series of the VMs for the scenario without a controller agent. Between 2000 and 2500, the voltage constraints
(VM must be in [0.8, 1.2]) are violated and the bus is switched off so that the VM drops to zero.
06:00
May 5, 2017
12:00 18:00 00:00
May 6, 2017
06:00 12:00 18:00
1.03
1.04
1.05
1.06
1.07
1.08
1.09
1.1
1.11
Bus 10 VM
Bus 11 VM
Bus 2 VM
Bus 3 VM
Bus 4 VM
Bus 5 VM
Bus 6 VM
Bus 7 VM
Bus 8 VM
Bus 9 VM
Voltage Magnitude Observations of Controller Agent
Simulation Time
Observation Voltage Magnitude (p.u.)
Figure 7: Time series of the VMs for the scenario without a controller agent.
possible to 1 p.u. (nominal voltage). The voltage is
not only generally controlled by the AP, but is also
directly dependent on the RP fed in. The balanced
feed-in of AP and RP generated by a PV can be set via
setpoints for the inverter (Turitsyn et al., 2011). In the
scenario, the agent controls such on bus 5 (actuators)
using the output of the agent’s DNN, where these out-
put values are first transformed before they are actu-
ally applied as setpoints, as shown for the simulation
of the scenario in fig. 8.
To obtain the applied setpoints y
A
, the function
y
A
=min(max(a
s,A
tanh(o
A
)
+ a
b,A
, {p
min
, q
min
}), {p
max
, q
max
})
(12)
with A = {ap, rp} is first applied with output o
A
of
the DNN for the actuators, the action scaling factor
a
s,A
and bias a
b,A
as well as the respective maximal
and minimal AP setpoint (p
{min,max}
) and RP setpoint
(q
{min,max}
) for the clipping. The y
p,q
values are then
balanced to the actual setpoints, that the inverter can
deliver at the current time via the P-Q characteristic
(Turitsyn et al., 2011).
10.2 Exact Controller Analysis
The DNN of the controller agent was transformed into
an exact DT, which in turn was used to create the
NHG as described in section 3. It consists of only two
nodes, with almost all simulated data points falling
into the observation region of one node. Thus, almost
all observed VMs are evaluated by a set of two linear
expressions of the form
o
A
=
i
w
A,i
vm
A,i
+ c
A
, (13)
where w
A,i
is the weight for the VM of the ith bus
vm
A,i
and c
A
is the bias value for the two actuators.
EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods
104
06:00
May 5, 2017
12:00 18:00 00:00
May 6, 2017
06:00 12:00 18:00
0.735
0.736
0.737
0.738
0.739
0.74
0.741
0.742
0.743
1.8685
1.8686
1.8687
1.8688
Controller Agent Setpoint Time Series
Simulation Time
Setpoint Reactive Power (MVar)
Setpoint Active Power (MW)
Figure 8: Time series for the setpoints of the controller agent for the PV system on bus 4; AP and RP setpoints overlap all the
time.
Table 1: Weights for the observed VMs for the region in
which almost all test simulation data fall.
AP actuator RP actuator
Bus b
ap
Weight w
ap
Bus b
rp
Weight w
rp
9 0.013 5 0.019
4 0.009 2 0.018
10 0.003 11 0.018
3 0.002 3 0.008
6 0.001 8 0.007
11 0.004 7 0.001
7 0.004 9 0.006
5 0.005 4 0.010
8 0.006 6 0.010
2 0.006 10 0.013
The weights of two different buses b
i, j
for the AP
and RP actuators listed in table 1 are only approxi-
mately proportional to each other (w
A,i
w
A, j
), as the
non-linear y function and the inverter balancing, as
described in section 10.1, are still applied to them.
The bus with the highest weighted VM for the AP
actuator is 9 and for the RP it is bus 5. Considering
only the RP actuator, the learned strategy is analo-
gous to a simple RP controller, which also controls
only one RP setpoint for a PV inverter on the same
bus, only from which it observes the VM (Ju and Lin,
2018).
Although, or perhaps despite this, the VM profiles
of all busses in fig. 7 are very similar except for the
scaling, the agent has learned properties of the topol-
ogy of the network: the controller has learned to iden-
tify its own bus (5) in terms of the generally known
simple RP controller strategy. Also noteworthy is the
second highest weighted VM of bus 2, which has a
fundamentally stronger influence on all VMs of the
feeder’s busses, as it is closer to the common coupling
point (the 110/20 kV transformer on bus 1 in fig. 5).
11 DICUSSION
Although, the inner boxes are volume-maximized,
are based on the exact polytope representation of the
output regions, and can thus be used for an exact
overview, they are significantly smaller than the poly-
topes in terms of their volume. The representation
for an overview of a polytope could be extended by
the inner boxes of the sub-polytopes in P B
in
of a
polytope P and its inner box B
in
. On the other hand,
this would increase the complexity of the overview
because then, there are multiple inner boxes for the
same polytope and thus the same output region. So
it has to be further evaluated in more practical, real-
world scenarios if, and in which form, this inner box
representation(s) can be useful for an overview.
The NHG has been successfully used to visual-
ize the relation of the hyperplanes and other measures
to each other. The volume of the polytopes and the
cosine distance as well as the mapping of the other
properties of the hyperplanes can be used to inves-
tigate the higher dimensional position of the linear
functions of neighboring polytopes to each other. It
has been shown that restricting the input space to the
box spanning the action points leads to interpretable
graphs in a simple real world experiment. However,
the methods need to be evaluated for larger observa-
tion spaces and more input and output dimensions.
Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces
105
12 CONCLUSION AND FUTURE
WORK
This paper presents a methodological approach to an-
alyze the exact output regions of policy FF-DNNs for
high-dimensional input-output spaces. Such an exact
analysis is necessary for the understanding of agents’
strategies, especially with regards to CNI operation.
The paper builds upon the exact transformation of FF-
DNNs into a pruned DT. The paths from a root to the
leaf nodes of such an DT make up output regions that
can be represented by polytopes. For an overview of
a polytope, the inner box is computed. Furthermore,
the output regions are depicted by vertices in the in-
troduced NHG. They are connected by edges if they
are direct neighbors to each other. Further proper-
ties, like the polytope constraints, inner box bounds
and their volume, as well as, the Chebyshev ball cen-
ter and radius, the bounding box, and the cosine dis-
tance of the normal vectors of the hyperplanes are
also mapped to the vertices and thus hyperplanes in
the output regions. Thereby, the Neighbor Graph es-
pecially allows us to visualize the relations between
higher-dimensional output regions.
In future work, these analysis tools could be eval-
uated in real scenarios with even higher dimensions.
They could also be used to analyze the learned control
strategies of ARL agents in the power grid for larger
input domains. This would make it possible to get a
better overview of the entire possible behavior of the
agent.
ACKNOWLEDGEMENTS
This work was funded by the German Federal Min-
istry for Education and Research (BMBF) under
Grant No. 01IS22071.
REFERENCES
Arrieta, A. B., D
´
ıaz-Rodr
´
ıguez, N., Del Ser, J., Bennetot,
A., Tabik, S., Barbado, A., Garc
´
ıa, S., Gil-L
´
opez, S.,
Molina, D., Benjamins, R., et al. (2020). Explainable
artificial intelligence (xai): Concepts, taxonomies, op-
portunities and challenges toward responsible ai. In-
formation fusion, 58:82–115.
Arulkumaran, K., Deisenroth, M. P., Brundage, M., and
Bharath, A. A. (2017). Deep reinforcement learning:
A brief survey. IEEE Signal Processing Magazine,
34(6):26–38.
Aytekin, C¸ . (2022). Neural networks are decision trees.
CoRR, abs/2210.05189:1–8. [retrieved: 05, 2023].
Bastani, O., Pu, Y., and Solar-Lezama, A. (2018). Ver-
ifiable reinforcement learning via policy extraction.
Advances in neural information processing systems,
31:2499–2509.
Bemporad, A., Filippi, C., and Torrisi, F. D. (2004). Inner
and outer approximations of polytopes using boxes.
Computational Geometry, 27(2):151–178.
Coppens, Y., Efthymiadis, K., Lenaerts, T., and Now
´
e, A.
(2019). Distilling deep reinforcement learning poli-
cies in soft decision trees. In International Joint Con-
ference on Artificial Intelligence.
Du, M., Liu, N., and Hu, X. (2019). Techniques for in-
terpretable machine learning. Communications of the
ACM, 63(1):68–77.
Fraunhofer IEE and University of Kassel (2023). Pan-
dapower 2.0 cigre benchmark power grid implemen-
tation. [retrieved: 06, 2023].
Frosst, N. and Hinton, G. (2017). Distilling a neural
network into a soft decision tree. arXiv preprint
arXiv:1711.09784.
Fujimoto, S., Hoof, H., and Meger, D. (2018). Address-
ing function approximation error in actor-critic meth-
ods. In Proceedings of the 35th International Con-
ference on Machine Learning, ICML 2018, Stock-
holmsm
¨
assan, Stockholm, Sweden, July 10-15, 2018,
volume 80 of Proceedings of Machine Learning Re-
search, pages 1587–1596. PMLR.
Fukuda, K. (2015). Lecture: Polyhedral computation,
spring 2015. [retrieved: 06.2024].
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).
Soft actor-critic: Off-policy maximum entropy deep
reinforcement learning with a stochastic actor. CoRR,
abs/1801.01290. [retrieved: 05, 2023].
Irsoy, O., Yildiz, O. T., and Alpaydin, E. (2012). Soft de-
cision trees. In International Conference on Pattern
Recognition.
Jaunet, T., Vuillemot, R., and Wolf, C. (2020). Drlviz: Un-
derstanding decisions and memory in deep reinforce-
ment learning. Computer Graphics Forum, 39(3):49–
61.
Ju, P. and Lin, X. (2018). Adversarial attacks to distributed
voltage control in power distribution networks with
DERs. In Proceedings of the Ninth International Con-
ference on Future Energy Systems, pages 291–302.
ACM.
Logemann, T. (2023). Explainability of power grid at-
tack strategies learned by deep reinforcement learning
agents.
Logemann, T. (2024). Power grid experi-
ment. https://gitlab.com/arl-experiments/
explains-simple-voltage-controller. [retrieved:
09, 2024].
Logemann, T. and Veith, E. M. (2023). Nn2eqcdt: Equiv-
alent transformation of feed-forward neural networks
as drl policies into compressed decision trees. vol-
ume 15, page 94–100. IARIA, ThinkMind. [retrieved:
07, 2023].
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M. A.
EXPLAINS 2024 - 1st International Conference on Explainable AI for Neural and Symbolic Methods
106
(2013). Playing atari with deep reinforcement learn-
ing. CoRR, abs/1312.5602:1–9. [retrieved: 05, 2023].
Puiutta, E. and Veith, E. M. S. P. (2020). Explainable rein-
forcement learning: A survey. In Machine Learning
and Knowledge Extraction. CD-MAKE 2020, volume
12279, pages 77–95, Dublin, Ireland. Springer, Cham.
Qing, Y., Liu, S., Song, J., and Song, M. (2022). A sur-
vey on explainable reinforcement learning: Concepts,
algorithms, challenges. CoRR, abs/2211.06665:1–25.
[retrieved: 05, 2023].
Schrittwieser, J., Antonoglou, I., Hubert, T., Simonyan, K.,
Sifre, L., Schmitt, S., Guez, A., Lockhart, E., Hass-
abis, D., Graepel, T., et al. (2020). Mastering atari,
go, chess and shogi by planning with a learned model.
Nature, 588(7839):604–609.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms. CoRR, abs/1707.06347. [retrieved: 05,
2023].
Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,
M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D.,
Graepel, T., et al. (2018). A general reinforcement
learning algorithm that masters chess, shogi, and go
through self-play. Science, 362(6419):1140–1144.
Task Force C6.04.02 (2014). Benchmark systems for net-
work integration of renewable and distributed energy
resources. Elektra - CIGRE’s digital magazine, 575.
TuLiP control (2024). Polytope implementation. [retrived:
06, 2024].
Turitsyn, K., Sulc, P., Backhaus, S., and Chertkov, M.
(2011). Options for control of reactive power by dis-
tributed photovoltaic generators. Proceedings of the
IEEE, 99(6):1063–1073.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep
reinforcement learning with double q-learning. In
Proceedings of the Thirtieth AAAI Conference on Ar-
tificial Intelligence, February 12-17, 2016, Phoenix,
Arizona, USA, volume 30, pages 2094–2100. AAAI
Press.
Vanderbei, R. J. et al. (2014). Linear programming.
Springer.
Veith, E. M. (2023). An architecture for reliable learn-
ing agents in power grids. ENERGY 2023 : The
Thirteenth International Conference on Smart Grids,
Green Communications and IT Energy-aware Tech-
nologies, pages 13–16. [retrieved: 05, 2023].
Veith, E. M. and Logemann, T. (2023). Towards explainable
attacker-defender autocurricula in critical infrastruc-
tures. CYBER 2023 : The Eighth International Con-
ference on Cyber-Technologies and Cyber-Systems //,
pages 27–31. [retrieved: 06, 2024].
Veith, E. M., Logemann, T., Wellßow, A., and Balduin, S.
(2024). Play with me: Towards explaining the benefits
of autocurriculum training of learning agents. In 2024
IEEE PES Innovative Smart Grid Technologies Eu-
rope (ISGT EUROPE), pages 1–5, Dubrovnik, Croa-
tia. IEEE.
Veith, E. M. S. P., Wellßow, A., and Uslar, M. (2023).
Learning new attack vectors from misuse cases with
deep reinforcement learning. Frontiers in Energy Re-
search, 11:01–23. [retrieved: 05, 2023].
APPENDIX
Convexity of Intersection of Half-Spaces
Theorem 1. Let H
1im
R
n
be half spaces, defined
by linear inequalities, such that
H = {x
x
x R
n
| a
a
a
x
x
x b}, a
a
a R
n
, b R. Then its
intersection set A =
m
T
i=1
H
i
is convex.
Proof. First it is proofed, that half spaces are con-
vex sets, then that the intersection of any collection
of convex sets is also convex.
(1.) Show that half spaces are convex:
x
x
x
1,2
H R
n
, λ [0, 1] : λx
x
x
1
+ (1 λ)x
x
x
2
H .
Let x
1
, x
2
H and λ [0, 1], then
a
a
a
x
x
x
1
b 0
Convex def.
====== λ(a
a
a
x
x
x
1
b)
| {z }
0
+
[0,1]
z }| {
(1 λ)(a
a
a
x
x
x
2
b)
| {z }
0
0
λa
a
a
x
x
x
1
λb + a
a
a
x
x
x
2
b λa
a
a
x
x
x
2
+ λb 0
a
a
a
(λx
x
x
1
+ (1 λ)x
x
x
2
) b 0
from this follows λx
x
x
1
+ (1 λ)x
x
x
2
H, x
x
x
1,2
H, λ
[0, 1]
(2.) Show that the intersection set of convex sub-
sets C a real or complex vector space V is convex:
C =
\
C V,
C convex
C is convex .
Let x
x
x
1,2
C be two elements from the intersec-
tion set, then they are in all intersected sets: C C :
x
x
x
1,2
C and because of the convexity, it holds:
λ [0, 1] : λx
x
x
1
+ (1 λ)x
x
x
2
C
Because this holds for every C and thus λx
x
x
1
+ (1
λ)x
x
x
2
are in C for all λ [0, 1], these elements are also
in the intersection set C. Hence it is also convex.
(3.) From (1.) and (2.), it directly follows that the
intersection set of half spaces is convex.
Analyzing Exact Output Regions of Reinforcement Learning Policy Neural Networks for High-Dimensional Input-Output Spaces
107