Lange, S. and Riedmiller, M. (2010). Deep auto-encoder
neural networks in reinforcement learning. In The 2010
International Joint Conference on Neural Networks
(IJCNN), pages 1–8.
Lange, S., Riedmiller, M., and Voigtlander, A. (2012). Au-
tonomous reinforcement learning on raw visual input
data in a real world application. pages 1–8.
Lee, A. X., Nagabandi, A., Abbeel, P., and Levine, S.
(2019). Stochastic latent actor-critic: Deep reinforce-
ment learning with a latent variable model. CoRR,
abs/1907.00953.
Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representa-
tion learning with contrastive predictive coding. arXiv
preprint:1807.03748.
Prieur, L. (2017). Deep-q learning using simple feedfoward
neural network. Github Gist in https://goo.gl/VpDqSw.
Raffin, A., Hill, A., Traor
´
e, K. R., Lesort, T., D
´
ıaz-Rodr
´
ıguez,
N., and Filliat, D. (2019). Decoupling feature extrac-
tion from policy learning: assessing benefits of state
representation learning in goal based robotics. SPiRL
Workshop ICLR.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization algo-
rithms. CoRR, abs/1707.06347.
Smith, N. A. and Eisner, J. (2005). Contrastive estimation:
Training log-linear models on unlabeled data. In Pro-
ceedings of the 43rd Annual Meeting on Association
for Computational Linguistics, pages 354–362. Asso-
ciation for Computational Linguistics.
Srinivas, A., Laskin, M., and Abbeel, P. (2020). CURL: con-
trastive unsupervised representations for reinforcement
learning. CoRR, abs/2004.04136.
Stooke, A., Lee, K., Abbeel, P., and Laskin, M. (2020). De-
coupling representation learning from reinforcement
learning. arXiv:2004.14990.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learn-
ing: An Introduction. MIT Press, Cambridge, MA,
USA, second edition.
Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S.,
and Lucic, M. (2019). On mutual information maxi-
mization for representation learning. In International
Conference on Learning Representations.
Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2018). Deep
image prior. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
9446–9454.
Wahlstr
¨
om, N., Sch
¨
on, T. B., and Deisenroth, M. P. (2015).
From pixels to torques: Policy learning with deep dy-
namical models. In ICML 2015 Workshop on Deep
Learning.
Zhang, M., Vikram, S., Smith, L., Abbeel, P., Johnson, M. J.,
and Levine, S. (2018). SOLAR: deep structured latent
representations for model-based reinforcement learn-
ing. CoRR, abs/1808.09105.
APPENDIX: HYPERPARAMETERS
RL Agent
Model Architecture.
The input to the neural network
consists of a
96 ×96 ×4
grayscale image. There are a
total of six convolutional layers with ReLU activation
functions. The first hidden layer convolves 8 filters of
4 ×4
with stride 2 with the input image. The second
hidden layer convolves 16 filters of
3 ×3
with stride 2.
The third hidden layer convolves 32 filters of
3 ×3
with stride 2. The fourth hidden layer convolves 64
filters of
3 ×3
with stride 2. The fifth hidden layer
convolves 128 filters of
3 ×3
with stride 1. The sixth
hidden layer convolves 64 filters of
3 ×3
with stride 1.
The critic’s final hidden layer is fully-connected with
100 rectifier units using the described ConvLayers.
The output layer is a fully-connected linear layer
with only one single output. The actor also has a
fully-connected layer with 100 rectifier units after the
described ConvLayers. The final layer consists of two
parallel fully connected layers with 100 units to three
outputs for each valid action with a tanh activation
function. From these layers, we sample the continuous
actions using the beta distribution.
PPO’s Hyperparameters.
Frameskip = 8. Discount
Factor = 0.99. Value Loss Coefficient = 2. Image Stack
= 4. Buffer Size = 5000. Adam Learning Rate = 0.001.
Batch Size = 128. Entropy Term Coefficient = 0.0001.
Clip Parameter = 0.1. PPO Epoch = 10.
State Representation Learning
Variational Autoencoder.
The VAE is trained using
the same encoder as the PPO agent described above.
We train for 2000 epochs using a batch size of 64, a
KL-divergence weight (beta) of 1.0 which is annealed
over the first 10 epochs, and the Adam optimizer
(Kingma and Ba, 2014) with learning rate 0.0003.
Contrastive Learning.
For contrastive learning, we
also use the same encoder as the PPO agent described
above and Adam optimizer with learning rate 0.001.
We train the model for 5000 epochs using a large batch
size of 1024 which provides a large number of negative
samples through batch-wise permutations, which has
been shown to be beneficial in previous work (Oord
et al., 2018; He et al., 2019).
Decoupling State Representation Methods from Reinforcement Learning in Car Racing
759