Table 3: Comparison of the efficiency of 10 trials [10
3
].
Ng Abbeel Proposed
method
5 × 5 173( 0 ) 189 (12.6) 92.0 (0.132)
6 × 6 1470 ( 0 ) 257 (86.7) 120 (0.172)
7 × 7 1410 ( 0 ) 1900 (581) 154 (0.166)
8 × 8 295 ( 0 ) 3500 (2490) 193 (0.178)
6.2 Evaluation of Learning Efficiency
From Table 2, the learning efficiency of the proposed
method was high for all GridWorld settings. The rea-
son for this is that the methods of Ng and Abbeel max-
imize only the reproducibility, whereas the proposed
method explicitly considers the learning efficiency.
On the other hand, Abbeel’s method had a slightly
higher efficiency than the proposed method in the
6×6 and 8 ×8 grids. The reason for this is that, as de-
scribed in 5.1, the same reward function is estimated
every time when repeating trials with Ng’s method,
but in Abbeel’s method and the proposed method,
there is an inherent randomness, resulting in different
estimates of the reward function with each repetition
of the trial. In other words, although Abbeel’s method
does not explicitly consider learning efficiency, a re-
ward function with a high learning efficiency can be
determined by chance over multiple trials.
In order clarify this point, the learning efficiency
and standard deviation are calculated for the ten re-
ward functions used in the experiment, and the aver-
age value and standard deviations are shown in Ta-
ble 3.
In Table 3, the standard deviation is 0 because
Ng’s method provides the same reward function for
all 10 trials. On the other hand, for Abbeel’s method,
the standard deviation is very large compared with
other methods, and thus the learning efficiency varies
widely between trials. Since the proposed method es-
timates the reward function using a GA, there is no
guarantee that it converges to the optimal solution.
Nevertheless, the standard deviation for the proposed
method is small, showing that the proposed method
can provide stable estimates for the reward function
with a high learning efficiency for each trial.
6.3 Limits of the Proposed Method
In GridWorld environments larger than 8×8, the pro-
posed method could not obtain a solution that satisfies
all of the restrictions presented in Eq. (9). It is rea-
sonable to suppose that since the executable area is
narrowed by increasing the number of constraints and
states, it is difficult to search the executable area with
a simple penalty method. In particular, a drawback
of the GA method is that it cannot explicitly handle
constraints.
7 CONCLUSION
In this paper, we proposed an inverse reinforcement
learning method for finding the reward function that
maximizes the learning efficiency. In our proposed
method, an objective function for the learning effi-
ciency is introduced using the framework of the in-
verse reinforcement learning of Ng et al. Moreover,
since our proposed objective function is nonlinear and
convex, we find the solution using a GA.
A limitation of the method proposed in this paper
is that the GA cannot always converge to an optimal
solution when the number of states is too high, so fu-
ture work will consider the following: 1) relaxation of
the constraints, 2) formulation of our method as a lin-
ear or quadratic programming problem, and 3) appli-
cation of a strong GA for finding solutions satisfying
the constraints.
REFERENCES
Andrew Y. Ng, Daishi Harada, S. R. (1999). Policy invari-
ance under reward transformations: Theory and appli-
cation to reward shaping. In International Conference
on Machine Learning.
B. D. Ziebart, A. Maas, J. A. B. and Dey, A. K. (2008).
Maximum entropy inverse reinforcement learning. In
The AAAI Conference on Artificial Intelligence.
M. Babes-Vroman, V. Marivate, K. S. and Littman, M.
(2011). Apprenticeship learning about multiple inten-
tions. In International Conference on Machine Learn-
ing.
Neu, G. and Szepesvari, C. (2007). Apprenticeship learn-
ing using inverse reinforcement learning and gradient
methods. In annual Conference on Uncertainty in Ar-
tificial Intelligence.
Ng, A. Y. and Russell, S. (2000). Algorithms for inverse
reinforcement learning. In International Conference
on Machine Learning.
Pieter Abbeel, A. Y. N. (2004). Apprenticeship learning
via inverse reinforcement learning. In International
Conference on Machine Learning.
Richard S. Sutton, A. G. B. (2000). Reinforcement Learn-
ing: An Introduction. Xko.
U. Syed, M. B. and Schapire, R. E. (2008). Apprenticeship
learning using linear programming. In International
Conference on Machine Learning.
Estimation of Reward Function Maximizing Learning Efficiency in Inverse Reinforcement Learning
283