Table 10: Performance score and standard errors with one
roll-outs and a depth of ten actions against the random op-
ponent and without using the learned model of the oppo-
nent.
State representation Sigmoid Elu
Small vision grids 0.46 (0.007) 0.50 (0.001)
Large vision grids 0.50 (0.001) 0.51 (0.002)
Full grid 0.37 (0.010) 0.35 (0.006)
Table 11: Performance score and standard errors with one
roll-outs and a depth of ten actions against the deterministic
opponent and without using the learned model of the oppo-
nent.
State representation Sigmoid Elu
Small vision grids 0.34 (0.003) 0.35 (0.001)
Large vision grids 0.35 (0.001) 0.36 (0.001)
Full grid 0.21 (0.007) 0.21 (0.005)
Using vision grids as state representation not only in-
creased the learning speed, it also increased the agent’s
performance in most cases. From all state represen-
tations, the large vision grids obtain the best perfor-
mances. They reduce the number of different possible
inputs compared to full grids, but contain more informa-
tion that the small vision grids.
This paper also confirms the benefits of the Elu ac-
tivation function over the sigmoid activation function.
Against the semi-deterministic opponent, the Elu ac-
tivation function increased the agent’s performance in
eleven of the twelve conducted experiments and against
the random opponent performance increased in eight of
the twelve experiments. From this it seems that the Elu
activation function performs especially much better than
the sigmoid function in case of less noisy updates due to
the more deterministic opponent.
Finally, the introduced opponent modelling tech-
nique allows the agent to concurrently learn and model
the opponent and in combination with planning algo-
rithms, such as Monte-Carlo roll-outs, it can be used
to significantly increase performance against two widely
different opponents.
An interesting possibility for future research is to test
whether the use of vision grids causes the agent to form
a better generalised policy. We believe that this is the
case, since vision grids are less dependent on the dimen-
sions of the environment and possible obstacles the agent
might encounter. Therefore, the learned policy will bet-
ter generalise to other environments. Finally, the pro-
posed opponent modelling technique is widely applica-
ble and we are interested to see whether it also proves
useful in other problems.
REFERENCES
Bellman, R. (1957). A markovian decision process. Indiana
Univ. Math. J., 6 No. 4:679–684.
Bom, L., Henken, R., and Wiering, M. (2013). Reinforce-
ment learning to train Ms. Pac-Man using higher-order
action-relative inputs. In 2013 IEEE Symposium on
Adaptive Dynamic Programming and Reinforcement
Learning (ADPRL), pages 156–163.
Bouzy, B. and Helmstetter, B. (2004). Monte-Carlo Go
Developments, pages 159–174. Springer US, Boston,
MA.
Clevert, D., Unterthiner, T., and Hochreiter, S. (2015). Fast
and accurate deep network learning by exponential
linear units (elus). CoRR, abs/1511.07289.
Collins, B. (2007). Combining opponent modeling and
model-based reinforcement learning in a two-player
competitive game. Master’s thesis, School of Infor-
matics, University of Edinburgh.
Ganzfried, S. and Sandholm, T. (2011). Game theory-based
opponent modeling in large imperfect-information
games. In the 10th International Conference on
Autonomous Agents and Multiagent Systems-Volume
2, pages 533–540. International Foundation for Au-
tonomous Agents and Multiagent Systems.
He, H., Boyd-Graber, J. L., Kwok, K., and III, H. D. (2016).
Opponent modeling in deep reinforcement learning.
CoRR, abs/1609.05559.
Mealing, R. A. (2015). Dynamic opponent modelling in
two-player games. PhD thesis, University of Manch-
ester, UK.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. arXiv preprint arXiv:1312.5602.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).
Neurocomputing: Foundations of Research, chapter
Learning Internal Representations by Error Propaga-
tion, pages 673–695. MIT Press, Cambridge, MA,
USA.
Shantia, A., Begue, E., and Wiering, M. (2011). Con-
nectionist reinforcement learning for intelligent unit
micro management in Starcraft. In Neural Networks
(IJCNN), The 2011 International Joint Conference on,
pages 1794–1801. IEEE.
Sheppard, B. (2002). World-championship-caliber scrab-
ble. Artificial Intelligence, 134(12):241 – 275.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,
van den Driessche, G., Schrittwieser, J., Antonoglou,
I., Panneershelvam, V., Lanctot, M., Dieleman, S.,
Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I.,
Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,
T., and Hassabis, D. (2016a). Mastering the game of
Go with deep neural networks and tree search. Nature,
529(7587):484–489.
Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez,
A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabi-
nowitz, N., Barreto, A., and Degris, T. (2016b). The
predictron: End-to-end learning and planning. CoRR,
abs/1612.08810.
Southey, F., Bowling, M., Larson, B., Piccione, C., Burch,
N., Billings, D., and Rayner, C. (2005). Bayes bluff:
Opponent modelling in poker. In Proceedings of the
Opponent Modelling in the Game of Tron using Reinforcement Learning
39