with deep convolutional nets and fully connected crfs.
CoRR, abs/1412.7062.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). ImageNet: A Large-Scale Hierarchical
Image Database. In CVPR09.
Farneb
¨
ack, G. (2000). Fast and accurate motion estimation
using orientation tensors and parametric motion mod-
els. In ICPR.
Fischer, P., Dosovitskiy, A., Ilg, E., H
¨
ausser, P., Hazırbas¸,
C., Golkov, V., van der Smagt, P., Cremers, D., and
Brox, T. (2015). Flownet: Learning optical flow with
convolutional networks.
Florence, P. R., Manuelli, L., and Tedrake, R. (2018). Dense
object nets: Learning dense visual object descriptors
by and for robotic manipulation. In CoRL.
G
¨
uler, R. A., Neverova, N., and Kokkinos, I. (2018). Dense-
pose: Dense human pose estimation in the wild. 2018
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition, pages 7297–7306.
Hadsell, R., Chopra, S., and Lecun, Y. (2006). Dimen-
sionality reduction by learning an invariant mapping.
pages 1735 – 1742.
Hariharan, B., Arbelaez, P., Girshick, R., and Malik, J.
(2015). Hypercolumns for object segmentation and
fine-grained localization. pages 447–456.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition supplementary ma-
terials.
Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995).
The ”wake-sleep” algorithm for unsupervised neural
networks. Science, 268 5214:1158–61.
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A.,
and Brox, T. (2017). Flownet 2.0: Evolution of opti-
cal flow estimation with deep networks. pages 1647–
1655.
Janai, J., Guney, F., Ranjan, A., Black, M., and Geiger, A.
(2018). Unsupervised Learning of Multi-Frame Op-
tical Flow with Occlusions: 15th European Confer-
ence, Munich, Germany, September 8-14, 2018, Pro-
ceedings, Part XVI, pages 713–731.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long,
J., Girshick, R. B., Guadarrama, S., and Darrell, T.
(2014). Caffe: Convolutional architecture for fast fea-
ture embedding. In ACM Multimedia.
Long, J., Shelhamer, E., and Darrell, T. (2014). Fully con-
volutional networks for semantic segmentation. Arxiv,
79.
Lowe, D. (2004). Distinctive image features from scale-
invariant keypoints. International Journal of Com-
puter Vision, 60:91–.
Reda, F., Pottorff, R., Barker, J., and Catanzaro, B. (2017).
flownet2-pytorch: Pytorch implementation of flownet
2.0: Evolution of optical flow estimation with deep
networks.
Schmidt, T., Newcombe, R. A., and Fox, D. (2017). Self-
supervised visual descriptor learning for dense corre-
spondence. IEEE Robotics and Automation Letters,
2:420–427.
Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A.,
and Fitzgibbon, A. (2013). Scene coordinate regres-
sion forests for camera relocalization in rgb-d images.
pages 2930–2937.
Sivic, J., Russell, B., Efros, A., Zisserman, A., and Free-
man, W. (2005). Discovering objects and their lo-
cation in images. IEEE International Conference on
Computer Vision, pages 370–377.
Sudderth, E. B., Torralba, A., Freeman, W. T., and Willsky,
A. S. (2005). Describing visual scenes using trans-
formed dirichlet processes. In NIPS.
Sun, D., Yang, X., Liu, M.-Y., and Kautz, J. (2017). Pwc-
net: Cnns for optical flow using pyramid, warping,
and cost volume.
Taylor, J., Shotton, J., Sharp, T., and Fitzgibbon, A. (2012).
The vitruvian manifold: Inferring dense correspon-
dences for one-shot human pose estimation. vol-
ume 10.
Wang, J., song, Y., Leung, T., Rosenberg, C., Wang, J.,
Philbin, J., Chen, B., and Wu, Y. (2014). Learning
fine-grained image similarity with deep ranking. Pro-
ceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition.
Wang, X. and Gupta, A. (2015). Unsupervised learning of
visual representations using videos. 2015 IEEE In-
ternational Conference on Computer Vision (ICCV),
pages 2794–2802.
Zhang, R., Lin, L., Zhang, R., Zuo, W., and Zhang, L.
(2015). Bit-scalable deep hashing with regularized
similarity learning for image retrieval and person re-
identification. IEEE Transactions on Image Process-
ing, 24:4766–4779.
Visual Descriptor Learning from Monocular Video
451