Table 4: Mean and standard error of the gathered amount
of points over the last 100 games. The results are averaged
over 20 simulations.
Method Mean points SE
Max-Boltzmann 96 1.3
Diminishing ε-Greedy 66 1.1
Interval-Q 55 1
Error-Driven-ε 32 1.5
ε-Greedy 3 1.2
Random-Walk -336 0.2
Greedy -346 1.1
points is important, because the learning algorithms
try to maximize the discounted sum of rewards that
relates to the amount of obtained points. A high win
rate does not always correspond to a high amount of
points, which becomes clear when comparing Greedy
to Random-Walk. Greedy has a much higher win rate
than Random-Walk whereas it gathers less points.
In the first 60 generations the temperature of Max-
Boltzmann is relatively high, which produces approx-
imately equal behaviour to ε-Greedy. During the last
10 generations the exploration gets more guided re-
sulting in an significantly increasing average amount
of points. Error-Driven-ε exploration outperforms
all other methods in the 10-70 generations interval.
However this method produces unstable behaviour,
which is most likely caused by the way the explo-
ration rate is computed from the average TD-errors
over generations.
We can conclude that Max-Boltzmann performs
better than the other methods. The only problem with
Max-Boltzmann is that it takes a lot of time before
it outperforms the other methods. In Figures 2 and
3, we can see that only in the last 10 generations
Max-Boltzmann starts to outperform the other meth-
ods. More careful tuning of the hyperparameters of
this method may result in even better performances.
Looking at the results, it is clear that the trade-
off between exploration and exploitation is im-
portant. All methods that actualize this explo-
ration/exploitation trade-off perform significantly bet-
ter than the methods that use only exploration or ex-
ploitation. The Greedy algorithm learns a locally op-
timal policy in which it does not get destroyed easily.
The Random-Walk policy performs many stupid ex-
ploration actions, and is killed very quickly. There-
fore, the Random-Walk method never learns to play
the whole game.
6 CONCLUSIONS
This paper examined exploration methods in connec-
tionist reinforcement learning in Bomberman. We
have explored multiple exploration methods and can
conclude that Max-Boltzmann outperforms the other
methods on both win rate and points gathered. The
only aspect where Max-Boltzmann is being out-
performed, is the learning curve. Error-Driven-ε
learns faster, but produces unstable behaviour. Max-
Boltzmann takes longer to reach a high performance
than some other methods, but it is possible that there
exist better temperature-annealing schemes for this
method. The results also demonstrated that the com-
monly used ε-Greedy exploration strategy is easily
outperformed by other methods.
In future work, we want to examine how well the
different exploration methods perform for learning to
play other games. Furthermore, we want to carefully
analyze the reasons why Error-Driven-ε becomes un-
stable and change the method to solve this.
REFERENCES
Bom, L., Henken, R., and Wiering, M. (2013). Reinforce-
ment learning to train Ms. Pac-Man using higher-order
action-relative inputs. In 2013 IEEE Symposium on
Adaptive Dynamic Programming and Reinforcement
Learning (ADPRL), pages 156–163.
Kaelbling, L. (1993). Learning in Embedded Systems. A
Bradford book. MIT Press.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. CoRR, abs/1312.5602.
Shantia, A., Begue, E., and Wiering, M. (2011). Con-
nectionist reinforcement learning for intelligent unit
micro management in starcraft. In Neural Networks
(IJCNN), The 2011 International Joint Conference on,
pages 1794–1801. IEEE.
Sutton, R. S. and Barto, A. G. (2015). Reinforcement Learn-
ing : An Introduction. Bradford Books, Cambridge.
Szita, I. (2012). Reinforcement learning in games. In Wier-
ing, M. and van Otterlo, M., editors, Reinforcement
Learning: State-of-the-Art, pages 539–577. Springer
Berlin Heidelberg.
Thrun, S. (1992). Efficient exploration in reinforce-
ment learning. Technical Report CMU-CS-92-102,
Carnegie-Mellon University.
Tijsma, A. D., Drugan, M. M., and Wiering, M. A. (2016).
Comparing exploration strategies for Q-learning in
random stochastic mazes. In 2016 IEEE Symposium
on Adaptive Dynamic Programming and Reinforce-
ment Learning (ADPRL), pages 1–8.
Watkins, C. J. C. H. and Dayan, P. (1992). Technical note:
Q-learning. Machine Learning, 8(3):279.
Wiering, M. A. (1999). Explorations in efficient reinforce-
ment learning. PhD thesis, University of Amsterdam.
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
362