
In the conventional framework of reinforcement
training, a one-dimensional scalar value is used to
represent an evaluation reward. It is the only rein-
forcement signal for learning and developing an op-
timum policy. When there are positive and negative
rewards, however, using only a scalar-value reward
may result in a tradeoff between exploration and ex-
ploitation. Uchibe et al. (Uchibe 1999) proposed a
method to make a reward function multidimensional
to enable simultaneous learning of several functions,
and they verified that coordinated actions can be real-
ized in a multi-agent environment. Uchibe’s method
seems effective when there are positive and nega-
tive rewards because of its capability to make a re-
ward function multidimensional and to handle a re-
ward as vector data. However, multidimensional con-
version increases the number of parameters in the re-
ward function and attenuation matrix, thereby making
it difficult to determine optimum values.
Because no clear principle has been defined for
reflecting a multidimensional evaluation on a one-
dimensional action, it is difficult to convert the results
and transfer them into another system.
Knowledge obtained from rats and monkeys about
operand-conditioned subjects (Miller 1959)(Ison
1967) and from humans having damaged brains
(Milner 1963) indicates that distinguishing between
the evaluations of successes and failures has a
tremendous effect on action learning (Yamakawa
1992)(Okada 1997)(Okada 1998). With this in mind,
the authors propose reinforcement learning by a two-
dimensional evaluation. This evaluation involves an
evaluation function based on the dimensions of re-
ward and punishment. An evaluation immediately af-
ter an action is called a reward evaluation if its pur-
pose is to obtain a favorable result after repeated at-
tempts to learn an action, or punishment evaluation if
its purpose is to suppress an action.
Reinforcement learning using the two dimensions
of reward and punishment separates the conventional
one-dimensional reinforcement signal into reward
and punishment. The proposed method uses the dif-
ference between reward evaluation and punishment
evaluation (utility) as a factor in determining the ac-
tion and their sum (interest) as a parameter in deter-
mining the ratio of exploration to exploitation. Util-
ity and interest are rough ways to define the principle
of reflecting multidimensional evaluation on a one-
dimensional action.
Chapter 2 describes the formulation of the pro-
posed reinforcement learning method based on the
two dimensions of reward and punishment. Chapters
3 prove the usefulness of the proposed system by de-
scribing the learning process of an autonomous mo-
bile robot. Finally, Chapter 4 summarizes the study.
2 REINFORCEMENT LEARNING
BASED ON TWO DIMENSIONS
OF REWARD AND
PUNISHMENT
2.1 Basic Idea
Two-dimensional reinforcement learning basically
consists of two aspects. One is to distinguish be-
tween reward evaluation and punishment evaluation
forecasts. The other is to determine an action accord-
ing to the combined index of positive and negative
reward forecasts.
2.1.1 Search by interest and resource allocation
The conventional reinforcement learning method uses
only the difference (utility) between reward and pun-
ishment reinforcement signals in an evaluation to
determine an action. In comparison, the proposed
method determines the sum (interest) of reward and
punishment evaluation signals and considers it as a
kind of criticality. Criticality can be considered to be
curiosity or motivation in living things, and it used to
determine which processing should be noted. In other
words, not only in reinforcement learning but in any
other kind of trial-and-error learning it can be used to
determine the ratio of exploration search to exploita-
tion action.
2.1.2 Distinction of the time discount ratio of
forecast reward
In reinforcement learning, a forecast reward is dis-
counted more if it’s more likely to be received in the
future. This discount ratio is called the time discount
ratio (γ) of the forecast reward. The value of γ ranges
from 0 to 1.0. If the value is 0, only the current rein-
forcement signal is noted and its future reinforcement
is disregarded. If the value is 1.0, the evaluation of
action is considered until the distant future.
In many practical problems, a reward reinforce-
ment signal is related to the method used to move to-
ward a goal and a forecast reward signal is used for
learning a series of actions to reach the goal. To con-
sider the effect of a goal that is far away, the γ setting
must therefore be large.
Meanwhile, if a punishment reinforcement signal
for avoiding a risk has an effect too far away from the
risk, an avoidance action may be generated in many
input states. In turn, the search range of the operat-
ing subject is reduced, thereby lowering the perfor-
mance of the subject. Therefore, to generate a pun-
ishment reinforcement signal for initiating an action
ICINCO 2004 - ROBOTICS AND AUTOMATION
250