tion platform for general agents. Journal of Artificial
Intelligence Research, 47:253–279.
Carlson, T. A., Simmons, R. A., Kriegeskorte, N., and
Slevc, L. R. (2014). The emergence of semantic mean-
ing in the ventral temporal pathway. Journal of cogni-
tive neuroscience, 26(1):120–131.
Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,
I., and Abbeel, P. (2016). Infogan: Interpretable rep-
resentation learning by information maximizing gen-
erative adversarial nets. In Advances in neural infor-
mation processing systems, pages 2172–2180.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,
L. (2009). Imagenet: A large-scale hierarchical image
database. In 2009 IEEE conference on computer vi-
sion and pattern recognition, pages 248–255.
Forestier, S., Mollard, Y., and Oudeyer, P.-Y. (2017).
Intrinsically motivated goal exploration processes
with automatic curriculum learning. arXiv preprint
arXiv:1708.02190.
Gordon, R. D. and Irwin, D. E. (1996). What’s in an object
file? evidence from priming studies. Perception &
Psychophysics, 58(8):1260–1277.
Gutmann, M. and Hyv
¨
arinen, A. (2010). Noise-contrastive
estimation: A new estimation principle for unnormal-
ized statistical models. In Proceedings of the Thir-
teenth International Conference on Artificial Intelli-
gence and Statistics, pages 297–304.
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal,
K., Bachman, P., Trischler, A., and Bengio, Y. (2019).
Learning deep representations by mutual information
estimation and maximization. International Confer-
ence on Learning Representations (ICLR).
Jonschkowski, R. and Brock, O. (2015). Learning state rep-
resentations with robotic priors. Autonomous Robots,
39(3):407–428.
Jonschkowski, R., Hafner, R., Scholz, J., and Riedmiller,
M. (2017). Pves: Position-velocity encoders for unsu-
pervised learning of structured state representations.
arXiv preprint arXiv:1705.09805.
Kalantidis, Y., Sariyildiz, M. B., Pion, N., Weinzaepfel, P.,
and Larlus, D. (2020). Hard negative mixing for con-
trastive learning. arXiv preprint arXiv:2010.01028.
Khan, A., Sohail, A., Zahoora, U., and Qureshi, A. S.
(2020). A survey of the recent architectures of deep
convolutional neural networks. Artificial Intelligence
Review, 53(8):5455–5516.
Kingma, D. P. and Welling, M. (2013). Auto-encoding vari-
ational bayes. arXiv preprint arXiv:1312.6114.
Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-
man, S. J. (2017). Building machines that learn and
think like people. Behavioral and brain sciences, 40.
Lesort, T., D
´
ıaz-Rodr
´
ıguez, N., Goudou, J.-F., and Filliat,
D. (2018). State representation learning for control:
An overview. Neural Networks, 108:379–392.
Ma, Z. and Collins, M. (2018). Noise contrastive estima-
tion and negative sampling for conditional models:
Consistency and statistical efficiency. arXiv preprint
arXiv:1809.01812.
Marr, D. (1982). Vision: A computational investigation into
the human representation and processing of visual in-
formation. The Modern Schoolman, 2(4.2).
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,
Antonoglou, I., Wierstra, D., and Riedmiller, M.
(2013). Playing atari with deep reinforcement learn-
ing. arXiv preprint arXiv:1312.5602.
Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and
Levine, S. (2018). Visual reinforcement learning with
imagined goals. In Advances in Neural Information
Processing Systems, pages 9191–9200.
Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015).
Action-conditional video prediction using deep net-
works in atari games. In Advances in neural infor-
mation processing systems, pages 2863–2871.
Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Representa-
tion learning with contrastive predictive coding. arXiv
preprint arXiv:1807.03748.
P
´
er
´
e, A., Forestier, S., Sigaud, O., and Oudeyer, P.-Y.
(2018). Unsupervised learning of goal spaces for in-
trinsically motivated goal exploration. arXiv preprint
arXiv:1803.00781.
Poole, B., Ozair, S., Oord, A. v. d., Alemi, A. A., and
Tucker, G. (2019). On variational bounds of mutual
information. arXiv preprint arXiv:1905.06922.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., Berg, A. C., and Fei-Fei, L. (2015). Ima-
geNet Large Scale Visual Recognition Challenge. In-
ternational Journal of Computer Vision, 115(3):211–
252.
Sermanet, P., Lynch, C., Chebotar, Y., Hsu, J., Jang, E.,
Schaal, S., Levine, S., and Brain, G. (2018). Time-
contrastive networks: Self-supervised learning from
video. In 2018 IEEE International Conference on
Robotics and Automation (ICRA), pages 1134–1141.
Sohn, K. (2016). Improved deep metric learning with multi-
class n-pair loss objective. In Advances in neural in-
formation processing systems, pages 1857–1865.
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K.,
and Winther, O. (2016). Ladder variational autoen-
coders. In Advances in neural information processing
systems, pages 3738–3746.
Visani, G. M., Hughes, M. C., and Hassoun, S. (2020).
Hierarchical classification of enzyme promiscuity us-
ing positive, unlabeled, and hard negative examples.
arXiv preprint arXiv:2002.07327.
Wang, W., Zhou, T., Yu, F., Dai, J., Konukoglu, E., and
Van Gool, L. (2021). Exploring cross-image pixel
contrast for semantic segmentation. arXiv preprint
arXiv:2101.11939.
Unsupervised Learning of State Representation using Balanced View Spatial Deep InfoMax: Evaluation on Atari Games
119