Visual Descriptor Learning from Monocular Video

Umashankar Deekshith

, Nishit Gajjar

, Max Schwarz

and Sven Behnke

Autonomous Intelligent Systems Group of University of Bonn, Germany

Keywords:

Dense Correspondence, Deep Learning, Pixel Descriptors.

Abstract:

Correspondence estimation is one of the most widely researched and yet only partially solved area of computer

vision with many applications in tracking, mapping, recognition of objects and environment. In this paper,

we propose a novel way to estimate dense correspondence on an RGB image where visual descriptors are

learned from video examples by training a fully convolutional network. Most deep learning methods solve

this by training the network with a large set of expensive labeled data or perform labeling through strong

3D generative models using RGB-D videos. Our method learns from RGB videos using contrastive loss,

where relative labeling is estimated from optical ﬂow. We demonstrate the functionality in a quantitative

analysis on rendered videos, where ground truth information is available. Not only does the method perform

well on test data with the same background, it also generalizes to situations with a new background. The

descriptors learned are unique and the representations determined by the network are global. We further show

the applicability of the method to real-world videos.

1 INTRODUCTION

Many of the problems in computer vision, like 3D

reconstruction, visual odometry, simultaneous local-

ization and mapping (SLAM), object recognition, de-

pend on the underlying problem of image correspon-

dence (see Fig. 1). Correspondence methods based on

sparse visual descriptors like SIFT (Lowe, 2004) have

been shown to be useful for applications like cam-

era calibration, panorama stitching and even robot

localization. However, SIFT and similar hand de-

signed features require a textured image. They per-

form poorly for images or scenes which lack sufﬁcient

texture. In such situations, dense feature extractors

are better suited than sparse keypoint-based methods.

With the advancement in deep learning in recent

years, the general trend is that neural networks can be

trained to outperform hand designed feature methods

for any function using sufﬁcient training data. How-

ever, supervised training approaches require signiﬁ-

cant effort because they require labeled training data.

Therefore, it is useful to have a way to train the model

in a self-supervised fashion, where the training labels

are created automatically.

https://orcid.org/0000-0002-0341-5768

https://orcid.org/0000-0001-7610-797X

https://orcid.org/0000-0002-9942-6604

https://orcid.org/0000-0002-5040-7525

Figure 1: Point tracking using the learned descriptors in

monocular video. Two different backgrounds are shown to

represent the network’s capability to generate global corre-

spondences. A square patch of 25 pixels are selected (left)

and the nearest neighbor for those pixels based on the pixel

representation is shown for the other images.

In their work, (Schmidt et al., 2017) and (Flo-

rence et al., 2018) have shown approaches for self-

supervised visual descriptor learning using raw RGB-

D sequence of images. Their approach shows that it

is possible to generate dense descriptors for an ob-

ject or a complete scene and that these descriptors are

consistent across videos with different backgrounds

and camera alignments. The dense descriptors are

learned using contrastive loss (Hadsell et al., 2006).

The method unlocks a huge potential for robot ma-

nipulation, navigation, and self learning of the envi-

ronment. However, for obtaining the required corre-

spondence information for training, the authors rely

444

Deekshith, U., Gajjar, N., Schwarz, M. and Behnke, S.

Visual Descriptor Learning from Monocular Video.

DOI: 10.5220/0008989304440451

In Proceedings of the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2020) - Volume 5: VISAPP, pages

444-451

ISBN: 978-989-758-402-2; ISSN: 2184-4321

Figure 2: Learned dense object descriptors. Top: Stills from a monocular video demonstrating full 6D movement of the object.

Bottom: Normalized output of the trained network where each pixel is represented uniquely, visualized as an RGB image.

The objects were captured in different places under different lighting conditions, viewpoints giving the object a geometric

transformation in a 3D space. It can be seen here that the descriptor generated is independent of these effects. This video

sequence has not been seen during training.

on 3D reconstruction from RGB-D sensors. This lim-

its the applicability of their method (i.e. shiny and

transparent objects cannot be learned).

Every day a large quantity of videos are generated

from basic point and shoot cameras. Our aim is to

learn the visual descriptors from an RGB video with-

out any additional information. We follow a similar

approach to (Schmidt et al., 2017) by implementing

a self-supervised visual descriptor learning network

which learns dense object descriptors (see Fig. 2). In-

stead of the depth map, we rely on the movement

of the object of interest. To ﬁnd an alternate way

of self learning the pixel correspondences, we turn

to available optical ﬂow methods. The traditional

dense optical ﬂow method of (Farneb

ack, 2000) and

new deep learning based optical ﬂow methods (Sun

et al., 2017; Ilg et al., 2017) provide the information

which is the basis of our approach to generate self-

supervised training data. Optical ﬂow gives us a map-

ping of pixel correspondences within a sequence of

images. Loosely speaking, our method turns relative

correspondence from optical ﬂow into absolute corre-

spondence using our learned descriptors.

To focus the descriptor learning on the object of

interest, we employ a generic foreground segmenta-

tion method (Chen et al., 2014), which provides a

foreground object mask. We use this foreground mask

to limit the learning of meaningful visual descriptors

for the foreground object, such that these descriptors

are as far apart in descriptor space as possible.

In this work, we show that it is possible to learn

visual descriptors for monocular images through self-

supervised learning by training them using contrastive

loss for images labeled from optical ﬂow information.

We further demonstrate applicability in experiments

on synthetic and real data.

2 RELATED WORK

Traditionally, dense correspondence estimation algo-

rithms were of two kinds. One with focus on learning

generative models with strong priors (Hinton et al.,

1995), (Sudderth et al., 2005). These algorithms were

designed to capture similar occurrence of features.

Another set of algorithms use hand-engineered meth-

ods. For example, SIFT or HOG performs cluster-

ing over training data to discover the feature classes

(Sivic et al., 2005), (C. Russell et al., 2006). Re-

cently, the advances in deep learning and their abil-

ity to reliably capture high dimensional features di-

rectly from the data has made a lot of progress in

correspondence estimation, outperforming the tradi-

tional methods. (Taylor et al., 2012) have proposed

a method where a regression forest is used to predict

dense correspondence between image pixels and ver-

Visual Descriptor Learning from Monocular Video

445

tices of an articulated mesh model. Similarly, (Shot-

ton et al., 2013) use a random forest given only a sin-

gle acquired image, to deduce the pose of an RGB-D

camera with respect to a known 3D scene. (Brach-

mann et al., 2014) jointly train an objective over both

3D object coordinates and object class labeling to de-

termine to address the problem of estimating the 6D

Pose of speciﬁc objects from a single RGB-D im-

age. Semantic segmentation of images as presented

by (Long et al., 2014; Hariharan et al., 2015) use neu-

ral network that produce dense correspondence of im-

ages. (G

uler et al., 2018) propose a method to estab-

lish dense correspondence between RGB image and

surface based representation of human body. All of

these methods rely on labeled data. To avoid the ex-

pensive labeling process, we use relative labels for

pixels generated before training with minimum com-

putation and no human intervention.

Relative labeling has been in use for various appli-

cations. (Wang et al., 2014) introduced a multi-scale

network with triplet sampling algorithm that learns a

ﬁne-grained image similarity model directly from im-

ages. For image retrieval using deep hashing, (Zhang

et al., 2015) trained a deep CNN where discrimina-

tive image features and hash functions are simulta-

neously optimized using max-margin loss on triplet

units. However, these methods also require labeled

data. (Wang and Gupta, 2015) propose a method

for image patch detection where they employ rela-

tive labeling for data points. Here they try to track

patches in videos where two patches connected by

a track should have similar visual representation and

have closer distance in deep feature space and hence

be able to differentiate it from a third patch. In con-

trast, our method works on pixel level and is thus able

to provide ﬁne-grained correspondence. Also, their

work does not focus on representing a speciﬁc patch

distinctively and so it is not clear if a correspondence

generated for a pixel is consistent across different sce-

narios for an object.

Our work makes use of optical ﬂow estimation

(Fischer et al., 2015; Janai et al., 2018; Sun et al.,

2017; Ilg et al., 2017), which has made signiﬁcant

progress through deep learning. While these ap-

proaches can obviously be used to estimate corre-

spondences for short time frames, our work is applica-

ble for larger geometric transformation, environment

with more light variance and occlusion. Of course, a

better optical ﬂow estimate during training improves

results in our method.

Contrastive loss for dense correspondence has

been used by (Schmidt et al., 2017) where they have

used relative labeling to train a fully convolutional

network. The idea presented by them is for the de-

scriptor to be encoded with the identity of the point

that projects onto the pixel so that it is invariant to

lighting, viewpoint, deformation and any other vari-

able other than the identity of the surface that gener-

ated the observation. (Florence et al., 2018) train the

network for single-object and multi-object descriptors

using a modiﬁed pixel-wise contrastive loss function

which (similar to (Schmidt et al., 2017)) minimizes

the feature distance between the matching pixels and

pushes that of non-matching pixels to be at least a

conﬁgurable threshold away from each other. Their

aim is to train a robotic arm to generate training data

consisting of objects of interests and then training a

FCN network to distinctively identify different parts

of the same object and multiple objects. In these

methods, an RGB-D video is used and a strong 3D

generative model is used to automatically label cor-

respondences. In absence of the depth information, a

dense optical ﬂow between subsequent frames of an

RGB video can provide the correlation between pix-

els in different frames. For image segmentation to se-

lect the dominant object in the scene, we refer to the

solution given by (Chen et al., 2014) and (Jia et al.,

2014).

3 METHOD

Our aim is to train a visual descriptor network us-

ing an RGB video to get a non-linear function which

can translate an RGB image R

W xHx3

to a descriptor

image R

W xHxD

. In the absence of geometry infor-

mation, two problems need to be solved: The ob-

ject of interest needs to be segmented for training and

pixel correspondences need to be determined. To seg-

ment the object of interest from the scene, we refer

to the off the shelf solution provided by (Chen et al.,

2014). It provides us with a pre-trained network built

on Caffe (Jia et al., 2014) from which masks can be

generated for the dominant object in the scene. To

obtain pixel correspondences we use a dense optical

ﬂow between subsequent frames of an RGB video.

The network architecture and loss function used

for training are as explained below:

3.1 Network Architecture

A fully convolution network (FCN) architecture is

used where the network has 34 layered ResNet(He

et al., 2016) as the encoder block and 6 deconvolu-

tional layers as the decoder block, with ReLU as the

activation function. ResNet used is pre-trained on Im-

ageNet (Deng et al., 2009) data-set. The input has

the size H×W ×3 and the output layer has the size

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

446

Figure 3: Training process. Images A and B are consecutive frames taken from an input video. A.1: Dominant object mask

generated for segmentation of image A. Flow: Optical ﬂow between images A and B, represented as an RGB image. This

is used for performing automatic labelling of images where pixel movement for every pixel in Image A is represented by

the calculated optical ﬂow. A.2: Training input with a randomly selected reference point (green). B.2: Horizontally ﬂipped

version of image B. Note the corresponding reference point (green), which is calculated using the optical ﬂow offset. The

reference point pair is used as a positive example for contrastive loss. Additional negative samples for the green reference

point in A.2 are marked in blue. A.3, B.3: Learned descriptors for images A and B.

H×W ×D, where D is the descriptor size. The net-

work also has skip connections to retain information

from earlier layers, and L2 normalization at the out-

put.

3.2 Loss Function

A modiﬁed version of pixelwise contrastive loss

(Hadsell et al., 2006), as has been used by (Florence

et al., 2018), is used for training. The corresponding

pixel pair determined using the optical ﬂow is consid-

ered as a positive example and the network is trained

to reduce the L2 distance between their descriptors.

The negative pixel pair samples are selected by ran-

domly sampling the pixels from the foreground ob-

ject, where the previously generated mask helps in

limiting the sampling domain to just the object of in-

terest. The negative samples are expected to be at

least m distance away. We thus deﬁne the loss func-

tion

L(A, B, u

, u

, O

A→B

) =







, u

)

if O

A→B

) = u

max(0, M − D

, u

))

otherwise.

(1)

where O

A→B

is the optical ﬂow mapping of pixels

from image A to B, M is a margin and D

, u

)

is the distance metric between the descriptor of image

A at pixel u

and the descriptor of image B at pixel u

as deﬁned by

, u

) = || f

) − f

)||

(2)

with the computed descriptor images f

and f

3.3 Training

For training the descriptor space for a particular ob-

ject, an RGB video of the object is used. Masks are

generated for each image by the pre-trained DeepLab

network (Chen et al., 2014) to segment the im-

age into the foreground and background. Two sub-

sequent frames are extracted from the video, and

dense optical ﬂow is calculated using Farneback’s

method (Farneb

ack, 2000) after applying the mask

(see Fig. 3). We also used the improvements in op-

tical ﬂow estimation using Flownet2 (Ilg et al., 2017)

where we generated the optical ﬂow through pre-

trained networks (Reda et al., 2017) for our data-set

and observed that an improvement in optical ﬂow de-

termination does indeed help us in a better correspon-

dence estimation.

Visual Descriptor Learning from Monocular Video

447

Figure 4: Synthetic images used for experiments. Top row:

Hulk ﬁgure, Bottom row: Drill. In both cases, we render

videos with two different background images.

The pair of frames are passed through the FCN

and their outputs along with the optical ﬂow map

are used to calculate the pixel-wise contrastive loss.

Since there are millions of pixels in an image, ’pixel

pair’ sampling needs to be done to increase the speed

of training. For this, only the pixels on the object

of interest are taken into consideration. To select

the pixel pair, the pixel movement is measured us-

ing the optical ﬂow information. As shown in ﬁgure

7, once the pixels are sampled from image A, corre-

sponding pixels in image B are found using the op-

tical ﬂow map and these become the pixel matches.

For each sampled pixel from image A, a number of

non-match pixels are sampled uniformly at random

from the object in image B. These create the pixel

non-matches. We experimented with the number of

positive pixel matches in the range of 1000 to 10000

matches and observed that for us, 2000-3000 posi-

tive pixel matches worked the best for us and this

was used for the results. For each match 100-150

non-match pairs are considered. The loss function

is applied and averaged over the number of matches

and non-matches and then the loss is backpropagated

through the network. We performed a grid search to

ﬁnd the parameters for the training. We noticed that

the method is not very sensitive to the used hyperpa-

rameters. We also noticed that the network trains in

about 10 epochs, with loss values stagnating after that

point. Data augmentation techniques such as ﬂipping

one image in the pair across its horizontal and vertical

axis are also applied.

4 EXPERIMENTS

For quantitative analysis, we created a rendered video

where a solid object is moved - both rotation and

translation - in a 3-dimensional space from which we

extracted the images to be used for training and evalu-

ation. For all the experiments a drill object is used un-

less specifed. Different environments under different

lighting conditions were created for video rendering.

From these videos, frames extracted from the ﬁrst half

of every video is used as train data set and the rest as

test data set.

For evaluation, we chose a pixel in one image and

compared the pixel representations, which is the out-

put of the network, against the representations of all

the pixels in the second image creating an L2 distance

matrix. The percentile of total pixels in the matrix

that have a distance lesser than the ground truth pixel

is determined.

With these intermediate results, we performed the

below tests to demonstrate the performance of our

method under different conditions:

1. We compared percentile of pixels whose feature

representation is closer than the representation of

the ground truth pixel whose representation is ex-

pected to be the closest. This is averaged over an

image between consecutive images in a test video.

This shows the consistency of the pixel represen-

tations of the object for small geometric transfor-

mation. For this particular experiment, we used

both a drill object dataset and a hulk object dataset

(see Fig. 4). The hulk dataset was trained with the

same hyperparameters that were found to be best

for drill.

2. We selected random images from all the combina-

tions of training and test images with the same and

different backgrounds. We performed test sim-

ilar to 1 and results for the train test combina-

tion are averaged together. This shows the con-

sistency of the pixel representations of the object

for large geometric transformation across various

backgrounds.

3. We present how the model performs between a

pair of images pixel-wise where we plot the per-

centile of pixels at different error percentiles.

4. We show the accuracy of pixel-representation by

the presented method for pixels selected through

SIFT. This provides a direct comparison of our

method with SIFT where we compare the perfor-

mance of our method on points that are considered

ideal for SIFT.

All these methods are compared against dense

SIFT and are plotted together. The method used

by (Schmidt et al., 2017) where they use the 3D

generative model for relative labeling is the closest

to our method. However, we could not provide an

analysis of the results from both the methods as we

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

448

Figure 5: Cumulative histogram of false positive pixels av-

eraged over a frame between consecutive frames in a video.

Figure 6: Correspondence error histogram for keypoints se-

lected through SIFT. We show the pixel distance of the esti-

mated corresponding point to the ground truth correspon-

dence for images of size 640*480. The presented result

considers a different combination of input images based on

train and test images with same and different backgrounds.

couldn’t make the same setup for quantitative anal-

ysis. Also, their method was applicable for RGB-D

videos whereas our method as far as we know is new

and work on RGB videos and so is different and can-

not be compared. We also show visual results from

the video recorded and used for training as ground

truth is not available for the same. For test 4, we cal-

culated the signiﬁcant points for both the images in

comparison and determined the signiﬁcant points for

both. We determined SIFT representations for these

points and representation for the corresponding pix-

els were taken from the network output. We then

compare the accuracy of the two methods against the

ground truth information to check for accuracy.

5 EVALUATION

The result for test 1 is shown in Table 2. This demon-

strates the overall capability of tracking a pixel across

multiple frames in a video and shows how close the

representations are in that. As can be seen, our

method far outperforms the SIFT method by a large

margin when the same test is conducted for the drill

Figure 7: Pixel-wise comparison between the presented

method and SIFT. The plot shows the fraction of pixels

against percentile of false positives for each pixel, for each

of the two approaches.

object. Figure 5 shows the frame-wise comparison for

the same. As can be seen, between most frames, the

presented method performs under an error of 1 per-

centile whereas SIFT results are distributed over the

range. We also evaluated the results for the rendered

hulk object. The results show a signiﬁcant difference

between drill and hulk object. Our hypothesis for this

deviation is that the hulk object has a much better tex-

ture than drill and hence SIFT performs much better.

We note that training with optimized hyperparameters

for the hulk object would likely improve the result.

Test 2 results are presented in Table 1. This test com-

pares the ability of the network to ﬁnd global repre-

sentation and representation for pixels not seen be-

fore. As can be seen, even upon selecting the previ-

ously unseen images and while having different back-

grounds between the images, the presented method is

able to identify the pixels with similar accuracy prov-

ing that the pixel representation found through this

method is global. It was observed during the test that

the accuracy of prediction depended on the transfor-

mation and change in images and not on whether an

image and environment has been previously seen by

the network proving that the method determines sta-

ble features. Test 3 is performed to show the pixel-

wise performance of the presented method and the re-

sult for the same is shown in Fig. 7. This shows the

overall spread of accuracy of representation, the pix-

els that are represented well and the outliers. The er-

ror percentile for our method performs well for most

pixels. As can be seen, for over two-third of the

pixels, the error percentile is under 2 percent with

close to 15% pixels under 1 percent error and outper-

forms the compared method. This is shown in Fig. 7,

it shows the performance of our method when pit-

ted against SIFT. This shows that the representation

uniquely identiﬁes the high contrast pixels more eas-

ily, better than SIFT. Test 4 is performed to show the

Visual Descriptor Learning from Monocular Video

449

Table 1: Quantitative evaluation. We show the percentile of false correspondences in each image, averaged over the dataset.

Train refers to images where they have been exposed to the network during training. Test refers to images which are previously

unseen by the network. Train and Test notations are with respect to our network. The results are averaged over images with

same and different backgrounds selected at equal proportions.

Method Image Train-Train Train-Test Test-Test

SIFT Same Background 61.2 56.9 45.8

Our Method Same Background 3.9 8.0 3.2

SIFT Different Background 61.6 46.0 39.2

Ours Method Different Background 4.2 1.1 1.2

Table 2: Image-wise comparison of correspondence results.

Here, the error percentile for corresponding images in a

video is calculated and is averaged for the whole image and

this is repeated for subsequent 100 frame pairs in a video.

Images were sampled at 5 frames per second.

Method Object Mean Error Std Deviation

SIFT Drill 37.0 16.4

Ours Drill 0.5 0.5

SIFT Hulk 9.5 4.8

Ours Hulk 1.6 1.9

capability of the presented method for signiﬁcant

points determined for SIFT algorithm, the results for

which are available in Fig. 6. We expected the SIFT

to perform well for these signiﬁcant points since SIFT

was developed for this problem. But surprisingly,

even for these points, our method performed better

than SIFT in most cases as can be seen from the ﬁg-

ure.

6 CONCLUSIONS

We introduce a method where images from a monoc-

ular camera can be used to recognize visual features

in an environment which is very important for robots.

We present an approach where optical ﬂow calcu-

lated using a traditional method such as (Farneb

ack,

2000) or using pre-trained networks as given by (Ilg

et al., 2017), (Sun et al., 2017) is used to label the

training data collected by the robot enabling it to per-

form dense correspondence of the surrounding with-

out manual intervention. For recognizing the domi-

nant object, we use a pre-trained network (Chen et al.,

2014) which is integrated into this method and can

be used without supervision. We present evidence to

the global nature of the representation generated by

the trained network where the same representation is

generated for the object present in different environ-

ments and/or at different viewpoints, transformation,

and lighting conditions. We also showed that even

in cases where a particular image and environment

is previously unseen by the network, our network is

able to generate stable features with results similar

to training images which will allow the robot to ex-

plore new environments. We quantitatively compared

our approach to a hard-engineered approach scale-

invariant feature transform (SIFT) both considering

the percentile of false positives pixels averaged over

the image and comparison at a pixel level. We show

that in both evaluations, our approach far outperforms

the existing approach. We also showed our approach’s

performance with real-world scenarios where the net-

work was trained with a video recorded using a hand-

held camera and we were able to show that the net-

work was able to generate visually stable features for

the object of interest both in previously seen and un-

seen environment thereby showing the extension of

our method to real-world scenario.

Using optical ﬂow for labeling the positive and

negative exampled needed for contrastive loss-based

training provides us with the ﬂexibility of replacing

the optical ﬂow generation method with an improved

method when it becomes available and any improve-

ment in optical ﬂow calculation will improve our re-

sults further. Similarly, for generating the mask, a

more accurate dominant object segmentation will im-

prove the presented results further. The extracted fea-

tures ﬁnds application in various ﬁelds of robotics and

computer vision such as 3D registration, grasp gen-

eration, simultaneous localization and mapping with

loop closure detection and following a human being

to a destination among others.

REFERENCES

Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shot-

ton, J., and Rother, C. (2014). Learning 6D Object

Pose Estimation Using 3D Object Coordinates, vol-

ume 8690.

C. Russell, B., T. Freeman, W., A. Efros, A., Sivic, J., and

Zisserman, A. (2006). Using multiple segmentations

to discover objects and their extent in image collec-

tions. volume 2, pages 1605 – 1614.

Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and

Yuille, A. L. (2014). Semantic image segmentation

VISAPP 2020 - 15th International Conference on Computer Vision Theory and Applications

450

with deep convolutional nets and fully connected crfs.

CoRR, abs/1412.7062.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). ImageNet: A Large-Scale Hierarchical

Image Database. In CVPR09.

Farneb

ack, G. (2000). Fast and accurate motion estimation

using orientation tensors and parametric motion mod-

els. In ICPR.

Fischer, P., Dosovitskiy, A., Ilg, E., H

ausser, P., Hazırbas¸,

C., Golkov, V., van der Smagt, P., Cremers, D., and

Brox, T. (2015). Flownet: Learning optical ﬂow with

convolutional networks.

Florence, P. R., Manuelli, L., and Tedrake, R. (2018). Dense

object nets: Learning dense visual object descriptors

by and for robotic manipulation. In CoRL.

uler, R. A., Neverova, N., and Kokkinos, I. (2018). Dense-

pose: Dense human pose estimation in the wild. 2018

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 7297–7306.

Hadsell, R., Chopra, S., and Lecun, Y. (2006). Dimen-

sionality reduction by learning an invariant mapping.

pages 1735 – 1742.

Hariharan, B., Arbelaez, P., Girshick, R., and Malik, J.

(2015). Hypercolumns for object segmentation and

ﬁne-grained localization. pages 447–456.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition supplementary ma-

terials.

Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995).

The ”wake-sleep” algorithm for unsupervised neural

networks. Science, 268 5214:1158–61.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A.,

and Brox, T. (2017). Flownet 2.0: Evolution of opti-

cal ﬂow estimation with deep networks. pages 1647–

1655.

Janai, J., Guney, F., Ranjan, A., Black, M., and Geiger, A.

(2018). Unsupervised Learning of Multi-Frame Op-

tical Flow with Occlusions: 15th European Confer-

ence, Munich, Germany, September 8-14, 2018, Pro-

ceedings, Part XVI, pages 713–731.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long,

J., Girshick, R. B., Guadarrama, S., and Darrell, T.

(2014). Caffe: Convolutional architecture for fast fea-

ture embedding. In ACM Multimedia.

Long, J., Shelhamer, E., and Darrell, T. (2014). Fully con-

volutional networks for semantic segmentation. Arxiv,

79.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60:91–.

Reda, F., Pottorff, R., Barker, J., and Catanzaro, B. (2017).

ﬂownet2-pytorch: Pytorch implementation of ﬂownet

2.0: Evolution of optical ﬂow estimation with deep

networks.

Schmidt, T., Newcombe, R. A., and Fox, D. (2017). Self-

supervised visual descriptor learning for dense corre-

spondence. IEEE Robotics and Automation Letters,

2:420–427.

Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A.,

and Fitzgibbon, A. (2013). Scene coordinate regres-

sion forests for camera relocalization in rgb-d images.

pages 2930–2937.

Sivic, J., Russell, B., Efros, A., Zisserman, A., and Free-

man, W. (2005). Discovering objects and their lo-

cation in images. IEEE International Conference on

Computer Vision, pages 370–377.

Sudderth, E. B., Torralba, A., Freeman, W. T., and Willsky,

A. S. (2005). Describing visual scenes using trans-

formed dirichlet processes. In NIPS.

Sun, D., Yang, X., Liu, M.-Y., and Kautz, J. (2017). Pwc-

net: Cnns for optical ﬂow using pyramid, warping,

and cost volume.

Taylor, J., Shotton, J., Sharp, T., and Fitzgibbon, A. (2012).

The vitruvian manifold: Inferring dense correspon-

dences for one-shot human pose estimation. vol-

ume 10.

Wang, J., song, Y., Leung, T., Rosenberg, C., Wang, J.,

Philbin, J., Chen, B., and Wu, Y. (2014). Learning

ﬁne-grained image similarity with deep ranking. Pro-

ceedings of the IEEE Computer Society Conference on

Computer Vision and Pattern Recognition.

Wang, X. and Gupta, A. (2015). Unsupervised learning of

visual representations using videos. 2015 IEEE In-

ternational Conference on Computer Vision (ICCV),

pages 2794–2802.

Zhang, R., Lin, L., Zhang, R., Zuo, W., and Zhang, L.

(2015). Bit-scalable deep hashing with regularized

similarity learning for image retrieval and person re-

identiﬁcation. IEEE Transactions on Image Process-

ing, 24:4766–4779.

Visual Descriptor Learning from Monocular Video

451