De Vries, H., Memisevic, R., and Courville, A. (2016).
Deep learning vector quantization. In European Sym-
posium on Artificial Neural Networks, Computational
Intelligence and Machine Learning.
Hammer, B., Strickert, M., and Villmann, T. (2005). Su-
pervised neural gas with general similarity measure.
Neural Processing Letters, 21(1):21–44.
Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T.,
Leibo, J. Z., Silver, D., and Kavukcuoglu, K. (2016).
Reinforcement learning with unsupervised auxiliary
tasks. arXiv preprint arXiv:1611.05397.
Kohonen, T. (1990). Improved versions of learning vector
quantization. In 1990 IJCNN International Joint Con-
ference on Neural Networks, pages 545–550 vol.1.
Kohonen, T. (1995). Learning vector quantization. In Self-
Organizing Maps, pages 175–189. Springer.
Kohonen, T., Hynninen, J., Kangas, J., Laaksonen, J., and
Torkkola, K. (1996). LVQ PAK: The learning vec-
tor quantization program package. Technical report,
Technical report, Laboratory of Computer and Infor-
mation Science Rakentajanaukio 2 C, 1991-1992.
Li, Y. (2017). Deep reinforcement learning: An overview.
arXiv preprint arXiv:1701.07274.
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement learn-
ing. In Proceedings of The 33rd International Con-
ference on Machine Learning, volume 48 of Pro-
ceedings of Machine Learning Research, pages 1928–
1937, New York, New York, USA. PMLR.
Mnih, V., Heess, N., Graves, A., et al. (2014). Recurrent
models of visual attention. In Advances in neural in-
formation processing systems, pages 2204–2212.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. arXiv preprint arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,
J., Bellemare, M. G., Graves, A., Riedmiller, M., Fid-
jeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning.
Nature, 518(7540):529–533.
Nair, V. and Hinton, G. E. (2010). Rectified linear units
improve restricted boltzmann machines. In Proceed-
ings of the 27th international conference on machine
learning (ICML-10), pages 807–814.
Sato, A. and Yamada, K. (1996). Generalized learning vec-
tor quantization. In Advances in neural information
processing systems, pages 423–429.
Seo, S. and Obermayer, K. (2003). Soft learning vector
quantization. Neural Computation, 15(7):1589–1604.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement learn-
ing: An introduction, volume 1. MIT press Cam-
bridge.
Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour,
Y. (2000). Policy gradient methods for reinforcement
learning with function approximation. In Solla, S. A.,
Leen, T. K., and M
¨
uller, K., editors, Advances in Neu-
ral Information Processing Systems 12, pages 1057–
1063. MIT Press.
Tieleman, T. and Hinton, G. (2012). Lecture 6.5-rmsprop:
Divide the gradient by a running average of its recent
magnitude. COURSERA: Neural networks for ma-
chine learning, 4(2):26–31.
Todorov, E., Erez, T., and Tassa, Y. (2012). Mujoco:
A physics engine for model-based control. In 2012
IEEE/RSJ International Conference on Intelligent
Robots and Systems, pages 5026–5033.
Tsitsiklis, J. N., Van Roy, B., et al. (1997). An analy-
sis of temporal-difference learning with function ap-
proximation. IEEE transactions on automatic control,
42(5):674–690.
Williams, R. J. and Peng, J. (1991). Function optimiza-
tion using connectionist reinforcement learning algo-
rithms. Connection Science, 3(3):241–268.
Wymann, B., Espi
´
e, E., Guionneau, C., Dimitrakakis, C.,
Coulom, R., and Sumner, A. (2000). Torcs, the
open racing car simulator. Software available at
http://torcs. sourceforge. net.
APPENDIX
We now discuss some technical details concerning the
GLPQ temperature.
Theorem 1. Consider a GLPQ output operator with
µ
i
=
ς
i
−ς
j
ς
i
+ς
j
. Also assume that each action has k corre-
sponding prototypes.
We then know that
max
∑
i:c(~w
i
)=a
exp(τµ
i
)
∑
j
(τµ
j
)
=
exp(2τ)
exp(2τ)+ |A|−1
(15)
Proof. We automatically know that the output is max-
imized for some action a if for all prototypes for
which c(~w
i
) = a we have that ~w
i
=
~
h. In such cases,
we obtain µ
i
= 1 and µ
j
= −1 for j 6= i. Therefore,
the resulting maximum value of the policy is
k exp(τ)
k exp(τ)+ k(|A|−1)exp(−τ)
=
exp(2τ)
exp(2τ)+ |A|−1
(16)
Corollary 1. Let p be the maximum value of the pol-
icy, we then have that
p =
exp(2τ)
exp(2τ)+ |A|−1
(17)
⇒ pexp(2τ) + p(|A|−1) = exp(2τ) (18)
⇒ exp(2τ) = −
p(|A|−1)
p −1
(19)
⇒ τ =
1
2
ln
−
p(|A|−1)
p −1
, (20)
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
130