value function and the structure of the state space is
the most powerful method to understand the control
rules of the agent. This is because, we can visually
understand the progress of the explore and the evalu-
ation of the state. In this paper, we aim to be able to
understand the control rules of the agent by focusing
on the landscape of the value function and the struc-
ture of the state space. When the state space is low di-
mensional, visualization techniques can represent the
landscape of the value function and the structure of
the state space; however, when the state space is high
dimensional, it becomes difficult to visualize.
Therefore, the purpose of this study is to visualize
the landscape of the value function and the structure
of the state space when the state space is high dimen-
sional. Additionally, we show that this visualization is
useful for understanding the control rules of the agent
in reinforcement learning.
2 BACKGROUND
Reinforcement learning involves learning motion se-
lection for a certain state to maximize the reward (Sut-
ton and Barto, 1998). Many algorithms for reinforce-
ment learning are based on estimating value func-
tions. These function are classified into two types:
state value functions and action value functions. A
state value function is represented by V (s) and eval-
uates how much value an agent has in a given state.
An action value function is represented by Q(s, a) and
evaluates how much value is gained by performing a
given action in a given state for an agent.
2.1 Q-learning
Q-learning (Watkins and Dayan, 1992) is a value-
updating algorithm in reinforcement learning that up-
dates the action value function Q(s
t
, a
t
) using Equa-
tion (1).
Q(s
t
, a
t
) ← (1 −α)Q(s
t
, a
t
) + α
h
r
t+1
+ γ max
a
Q(s
t+1
, a)
i
(1)
Here, α is a parameter called the step-size, where 0 ≤
α ≤ 1. This parameter controls the rate at which the
action value function Q(s
t
, a
t
) is updated.
The ε-greedy method is often used as a behavioral
selection method in Q-learning, where 0 ≤ ε ≤ 1. This
is a method of taking random action with a probability
of ε and taking the action probability function with the
largest action with the probability of 1 − ε. In other
words, the larger the value of ε, the higher the explo-
ration rate, and the smaller the value of ε, the higher
the exploitation rate.
Figure 1: A heat map of learning how to climb a virtual
volcano from S (the start) to G (the goal) at the summit.
2.2 Heat Map
A heat map can be used as a method to visualize the
landscape of the value function and the structure of
the state space; an example of heat map is shown in
Figure 1. Here, a two-dimensional state space is de-
fined in the domain and the state value function is rep-
resented by the color shading. The heat map in Fig-
ure 1 represents a learning of how to climb a virtual
volcano from S (the start) to G (the goal) at the sum-
mit. In general, an agent has a policy to perform a
state transition to a state higher than the state value
of the existing state. In other words, when visual-
ized with this heat map, the transition to a warmer
color state is taken to be the policy and the probabil-
ity of repeating the state transition to a warmer color
state where a transition is possible is high. By analyz-
ing the heat map, the designer can roughly understand
how the agent repeats the state transition. Therefore,
the heat map visualizes the structure of the state space
and the landscape of the value function and is useful
for understanding the agent control rule.
However, it is rare for a heat map to be able to
visualize the structure the of state space and the land-
scape of the value function. This is because a heat
map has only two dimensions in the state space and it
is difficult to use it in the case of three or more dimen-
sions. Moreover, in the example of the virtual volcano
in Figure 1, it is assumed that the state transition can
be performed in the adjacent state; however, in actual-
ity, the state transition cannot be performed in the ad-
jacent state nor can the state transition be performed
in the non-adjacent state. If a system that can visual-
ize the structure of the state space and the landscape
of the value function exists, even if such a state space
has high dimensions or the structure of the state space
is complex, it would be useful. This is the purpose of
this study.
Topological Visualization Method for Understanding the Landscape of Value Functions and Structure of the State Space in Reinforcement
Learning
371