0
200
400
600
800
0 20000 40000 60000
Average steps by the agent reaches a goal
Number of trial
R-MAX (No knowledge)
TR-MAX (Source task 1)
TR-MAX (Source task 2)
TR-MAX (Source task 3)
Figure 2: Average number of steps above first 60,000 tri-
als in the target task for R-MAX and TR-MAX using each
source task’s knowledge.
10 times for target task. In each execution, the agent
experiences 240,000 trials. Then, we calculated the
average steps that the agent consumes by reaching
the goal for each trail. Figure 2 shows the results
of the initial 60,000 trials because almost no change
occurred after 60,000 trials. TR-MAX converged at
about 10,000 to 20,000 trials, whereas R-MAX did
not converge within 60,000 trials. Moreover, we ob-
serve that the similarity of the source task to the tar-
get task affects the convergence speed. For instance,
among these three source tasks, Source task 1 in Fig-
ure 1(b) is the most similar to the target task in Fig-
ure 1(a), and “TR-MAX (Source task 1)” in Figure 2
converged the fastest among others. On the other
hand, “TR-MAX (Source task 3)” converged very
slowly, because Source task 3 in Figure 1(d) is not
similar to the target. In this sense, we verified that TR-
MAX could effectively utilize the prior knowledge.
Table 1: Sample complexities of R-MAX and TR-MAX us-
ing each source task’s knowledge. |Z
min|
is the minimum
size of the zero-transitioned state set, and |P
ε
| is the size of
the useful state-action pairs for the target task.
R-MAX
TR-MAX
Source
task 1
Source
task 2
Source
task 3
Sample
complexity
×10
7
4970 3.88 4.99 5.34
Ratio to
R-MAX
100% 0.078% 0.100% 0.107%
m 46742738 36550 46956 50243
|Z
min
| – 96 95 95
|P
ε
| – 376 304 196
Next, we compare the sample complexities of R-
MAX and TR-MAX in Table 1. As expected, the
sample complexities of TR-MAX are by far smaller
than the complexity of R-MAX, and they reflect the
similarities. We also note that the size |P
ε
| of the use-
ful state-action pairs P
ε
significantly depends on the
similarity, where the size |Z
min
| of the minimum zero-
transitioned sets does not.
5 CONCLUDING REMARKS
We proposed the TR-MAX algorithm that improved
the R-MAX algorithm by using prior knowledge for a
target task obtained from a source task. We proved
that the sample complexity of TR-MAX is indeed
smaller than that of R-MAX, and that TR-MAX is
PAC-MDP. In computationalexperiments, we verified
that TR-MAX could learn much more efficiently than
R-MAX.
In our future work, we are interested in capturing
“knowledge transfer in reinforcement learning” in an-
other way. In this position paper, we captured this as
a zero-transitioned state set and a useful state-action
pair set defined by the heuristic transition and heuris-
tic reward functions. In reality, however, these func-
tions cannot be obtained without knowing the true
transition and reward probabilities. In a real envi-
ronment, we never know the true probabilities, which
should be treated in future work.
REFERENCES
Brafman, R. I. and Tennenholtz, M. (2003). R-max - a gen-
eral polynomial time algorithm for near-optimal rein-
forcement learning. The Journal of Machine Learn-
ing, 3:213–231.
Kakade, S. M. (2003). On the Sample Complexity of Re-
inforceent Learning. PhD thesis, University College
London.
Konidaris, G. and Barto, A. (2006). Autonomous shaping:
Knowledge transfer in reinforcement learning. In Proc
of ICML, pages 489–496.
Kretchmar., R. M. (2002). Parallel reinforcement learning.
In Proc. of SCI, pages 114–118.
Mann, T. A. and Choe, Y. (2012). Directed exploration in
reinforcement learning with transferred knowledge. In
Proc. of EWRL, pages 59–76.
Miyazaki, K., Yamamura, M., and Kobayashi, S. (1997).
k-certainty exploration method: an action selector to
identify the environment in reinforcement learning.
Artificial intelligence, 91(1):155–171.
Rummery, G. A. and Niranjan, M. (1994). On-line Q-
learning using connectionist systems. Cambridge Uni-
versity.
Saito, J., Narisawa, K., and Shinohara, A. (2012). Predic-
tion for control delay on reinforcement learning. In
Proc. of ICAART, pages 579–586.
Schuitema, E., Busoniu, L., Babuska, R., and Jonker,
P. (2010). Control delay in reinforcement learning
ReducingSampleComplexityinReinforcementLearningby
TransferringTransitionandRewardProbabilities
637