Figure 13: Even pure novelty search only produces more
novel games towards the end of the training. The baseline
parameters were likely to be already implicitly optimised
for game diversity by the initial Bayesian hyperparameter
optimisation aiming to learn efficiently.
tree and used network-internal features as auxiliary
targets. We showed that, although the effort could be
decreased slightly, the benefit is mostly only small.
Our future work follows two directions: We want
to further analyse possible improvements of Alp-
haZero, e.g. based on the Connect 4 scenario, and
we want to investigate the applicability to real-world
control problems. For the first path, we identified
an approach of growing the network more system-
atically as a possibly beneficial extension. Alterna-
tively, a more sophisticated fitness function for the
evolutionary self-playing phase could provide a more
suitable trade-off between heterogeneity and conver-
gence. For the second path, we will investigate if such
a technique is applicable to real-world control prob-
lems ( D’Angelo, Gerasimou, Gharemani et al., 2019)
as given by self-learning traffic light controller (Som-
mer et al., 2016) or smart camera networks (Rudolph
et al., 2014).
REFERENCES
D’Angelo, Gerasimou, Gharemani et al. (2019). On learn-
ing in collective self-adaptive systems: state of prac-
tice and a 3D framework. In Proc. of SEAMS@ICSE
2019, pages 13–24.
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-
time analysis of the multiarmed bandit problem. Ma-
chine learning, 47(2–3):235–256.
Br
¨
ugmann, B. (1993). Monte Carlo Go. www.ideanest.com
/vegos/MonteCarloGo.pdf .
Gao, C., Mueller, M., Hayward, R., Yao, H., and Jui, S.
(2020). Three-Head Neural Network Architecture for
AlphaZero Learning. https://openreview.net/forum?i
d=BJxvH1BtDS.
Gibney, E. (2016). Google AI algorithm masters ancient
game of Go. Nature News, 529(7587):445.
Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-
excitation networks. In Proc. of IEEE CVPR, pages
7132–7141.
Kocsis, L. and Szepesv
´
ari, C. (2006). Bandit based Monte-
Carlo planning. In European conference on machine
learning, pages 282–293. Springer.
Lan, L.-C., Li, W., Wei, T.-H., Wu, I., et al. (2019). Multiple
Policy Value Monte Carlo Tree Search. arXiv preprint
arXiv:1905.13521.
LCZero (2020). Leela Chess Zero. http://lczero.org/.
Pons, P. (2020). https://connect4.gamesolver.org and
https://github.com/PascalPons/connect4. accessed:
2020-10-02.
Prasad, A. (2019). Lessons From Implementing AlphaZero.
https://medium.com/oracledevs/lessons-from-imple
menting-alphazero-7e36e9054191. accessed: 2019-
11-29.
Rudolph, S., Edenhofer, S., Tomforde, S., and H
¨
ahner, J.
(2014). Reinforcement Learning for Coverage Opti-
mization Through PTZ Camera Alignment in Highly
Dynamic Environments. In Proc. of ICDSC’14, pages
19:1–19:6.
Silver, Huang et al. (2016). Mastering the game of Go
with deep neural networks and tree search. Nature,
529(7587):484–489.
Silver, Hubert et al. (2018). A general reinforcement
learning algorithm that masters chess, shogi, and Go
through self-play. Science, 362(6419):1140–1144.
Silver, Schrittwieser et al. (2017). Mastering the
game of Go without human knowledge. Nature,
550(7676):354.
Smith, L. N. (2017). Cyclical learning rates for training
neural networks. In 2017 IEEE Winter Conference on
Applications of Computer Vision (WACV), pages 464–
472. IEEE.
Sommer, M., Tomforde, S., and H
¨
ahner, J. (2016). An Or-
ganic Computing Approach to Resilient Traffic Man-
agement. In Autonomic Road Transport Support Sys-
tems, pages 113–130. Springer.
Tilps (2019a). https://github.com/LeelaChessZero/lc0/pull
/721. accessed: 2019-11-29.
Tilps (2019b). https://github.com/LeelaChessZero/lc0/pull
/635. accessed: 2019-11-29.
Videodr0me (2019). https://github.com/LeelaChessZero/lc
0/pull/700. accessed: 2019-11-29.
Wu, D. J. (2020). Accelerating Self-Play Learning in Go.
Young, A. (2019). Lessons From Implementing AlphaZero,
Part 6. https://medium.com/oracledevs/lessons-from
-alpha-zero-part-6-hyperparameter-tuning-b1cfcbe
4ca9a. accessed: 2019-11-29.
APPENDIX
The framework and the experimental platform
for distributed job processing are available at:
https://github.com/ColaColin/MasterThesis.
Improvements to Increase the Efficiency of the AlphaZero Algorithm: A Case Study in the Game ’Connect 4’
811