Multi-stream CNN based Video Semantic Segmentation

for Automated Driving

Ganesh Sistu

, Sumanth Chennupati

and Senthil Yogamani

Valeo Vision Systems, Ireland

Valeo Troy, U.S.A.

Keywords:

Semantic Segmentation, Visual Perception, Automated Driving.

Abstract:

Majority of semantic segmentation algorithms operate on a single frame even in the case of videos. In this

work, the goal is to exploit temporal information within the algorithm model for leveraging motion cues and

temporal consistency. We propose two simple high-level architectures based on Recurrent FCN (RFCN) and

Multi-Stream FCN (MSFCN) networks. In case of RFCN, a recurrent network namely LSTM is inserted

between the encoder and decoder. MSFCN combines the encoders of different frames into a fused encoder

via 1x1 channel-wise convolution. We use a ResNet50 network as the baseline encoder and construct three

networks namely MSFCN of order 2 & 3 and RFCN of order 2. MSFCN-3 produces the best results with

an accuracy improvement of 9% and 15% for Highway and New York-like city scenarios in the SYNTHIA-

CVPR’16 dataset using mean IoU metric. MSFCN-3 also produced 11% and 6% for SegTrack V2 and DAVIS

datasets over the baseline FCN network. We also designed an efﬁcient version of MSFCN-2 and RFCN-2

using weight sharing among the two encoders. The efﬁcient MSFCN-2 provided an improvement of 11% and

5% for KITTI and SYNTHIA with negligible increase in computational complexity compared to the baseline

version.

1 INTRODUCTION

Semantic segmentation provides complete semantic

scene understanding wherein each pixel in an image

is assigned a class label. It has applications in vari-

ous ﬁelds including automated driving (Horgan et al.,

2015) (Heimberger et al., 2017), augmented reality

and medical image processing. Our work is focused

on semantic segmentation applied to automated dri-

ving which is discussed in detail in the survey paper

(Siam et al., 2017a). Recently, this algorithm has ma-

tured in accuracy which is sufﬁcient for commercial

deployment due to advancements in deep learning.

Most of the standard architectures make use of a sin-

gle frame even when the algorithm is run on a video

sequence. Efﬁcient real-time semantic segmentation

architectures are an important aspect for automated

driving (Siam et al., 2018). For automated driving

videos, there is a strong temporal continuity and con-

stant ego-motion of the camera which can be exploi-

ted within the semantic segmentation model. This in-

spired us to explore temporal based video semantic

segmentation. This paper is an extension of our pre-

vious work on RFCN (Siam et al., 2017c).

In this paper, we propose two types of architectu-

res namely Recurrent FCN (RFCN) and Multi-Stream

FCN (MSFCN) inspired by FCN and Long short-term

memory (LSTM) networks. Multi-Stream Architec-

tures were ﬁrst introduced in (Simonyan and Zisser-

man, 2014) in which a two stream CNN was propo-

sed for action recognition. They were also success-

fully used for other applications like Optical Flow (Ilg

et al., 2016), moving object detection (Siam et al.,

2017b) and depth estimation (Ummenhofer et al.,

2016). However, this has not been explored for se-

mantic segmentation using consecutive video frames

to the best of our knowledge. The main motivation is

to leverage temporal continuity in video streams. In

RFCN, we temporally processed FCN encoders using

LSTM network. In MSFCN architecture, we combine

the encoder of current and previous frames to produce

a new fused encoder of same feature map dimension.

This would enable keeping the same decoder.

The list of contributions include:

• Design of RFCN & MSFCN architectures that ex-

tends semantic segmentation models for videos.

• Exploration of weight sharing among encoders for

computational efﬁciency.

Sistu, G., Chennupati, S. and Yogamani, S.

Multi-stream CNN based Video Semantic Segmentation for Automated Driving.

DOI: 10.5220/0007248401730180

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 173-180

ISBN: 978-989-758-354-4

173

Figure 1: Comparison of different approaches to extend semantic segmentation to videos - a) Frame-level output b) Detect

and track c) Temporal post processing d) Recurrent encoder model and e) Fused multi-stream encoder model.

• Implementation of an end-to-end training method

for spatio-temporal video segmentation.

• Detailed experimental analysis of video seman-

tic segmentation with automated driving datasets

KITTI & SYNTHIA and binary video segmenta-

tion with DAVIS & SegTrack V2 datasets.

The rest of the paper is structured as follows.

Section 2 discusses different approaches for exten-

ding semantic segmentation to videos. Section 3 ex-

plains the different multi-stream architectures desig-

ned in this work. Experimental setup and results are

discussed in section 4. Finally, section 5 provides the

conclusion and future work.

2 EXTENDING SEMANTIC

SEGMENTATION TO VIDEOS

In this section, we provide motivation for incorpo-

rating temporal models in automated driving and

explain different high level methods to accomplish

the same. Motion is a dominant cue in automated

driving due to persistent motion of the vehicle on

which the camera is mounted. The objects of interest

in automotive are split into static infrastructure like

road, trafﬁc signs, etc and dynamic objects which are

interacting like vehicles and pedestrians. The main

challenges are posed due to the uncertain behavior

of dynamic objects. Dense optical ﬂow is commonly

used to detect moving objects purely based on motion

cues. Recently, HD maps is becoming a commonly

used cue which enables detection of static infrastruc-

ture which is previously mapped and encoded. In this

work, we explore the usage of temporal continuity

to improve accuracy by implicitly learning motion

cues and tracking. We discuss the various types

of temporal models in Fig 1 which illustrates the

different ways to extend image based segmentation

algorithm to videos.

Single Frame Baseline: Fig 1 (a) illustrates the typi-

cal way the detector is run every frame independently.

This would be the reference baseline for comparing

accuracy of improvements by other methods.

Detect and Track Approach: The premise of this

approach is to leverage the previously obtained

estimate of semantic segmentation as the next frame

has only incrementally changed. This can reduce

the computational complexity signiﬁcantly as a

lighter model can be employed to reﬁne the previous

semantic segmentation output for the current frame.

The high level block diagram is illustrated in Fig1

(b). This approach has been successfully used for

detection of bounding box objects where tracking

could even help when detector fails in certain frames.

However, it is difﬁcult to model it for semantic

segmentation as the output representation is quite

complex and it is challenging to handle appearance

of new regions in the next frame.

Temporal Post Processing: The third approach is

to use a post-processing ﬁlter on output estimates

to smooth out the noise. Probabilistic Graphical

Models (PGM) like Conditional Random Fields

(CRF) are commonly used to accomplish this. The

block diagram of this method is shown in Fig 1 (c)

where recurrence is built on the output. This step

is computationally complex because the recurrence

operation is on the image dimension which is large.

Recurrent Encoder Model: In this approach, the

intermediate feature maps from the encoders are

fed into a recurrent unit. The recurrent unit in the

network can be an RNN, LSTM or a GRU. Then the

resulting features are fed to a decoder which outputs

semantic labels. In Fig 2a, the ResNet50 encoder

conv5 layer features from consecutive image streams

are passed as temporal features for LSTM network.

While conv4 and conv3 layer features can also be

processed via the LSTM layer, the conv4 and conv3

features from two stream are concatenated followed

by a convolution layer to keep the architecture simple

and memory efﬁcient.

Fused Multi-stream Encoder Model: This method

can be seen as a special case of Recurrent model in

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

174

some sense. But the perspective of multi-stream en-

coder will enable the design of new architectures. As

this is the main contribution of this work, we will des-

cribe it in more detail in next section.

3 PROPOSED CNN

ARCHITECTURES

In this section, we discuss the details of the proposed

multi-stream networks shown in Fig 2b, 2c & 2d.

Multi stream fused architectures (MSFCN-2 &

MSFCN-3) concatenate the output from each encoder

and fuse them via 1x1 channel-wise convolutions

to obtain a fused encoder which is then fed to the

decoder. Recurrent based architecture (RFCN) uses

an LSTM unit to feed the decoder.

Single Stream Architecture: A fully convolution

network (FCN) shown in Fig 2a is inspired from

(Long et al., 2015) is used as the baseline architec-

ture. We used ResNet50 (Kaiming He, 2015) as the

encoder and conventional up-sampling with skip-

connections to predict pixel wise labels. Initializing

model weights by pre-trained ResNet50 weights,

alleviates over-ﬁtting problems as these weights are

the result of training on a much larger dataset namely

ImageNet.

Multi-stream Fused Architectures: Multi-Stream

FCN architecture is illustrated in Fig 2b & 2c.

We used multiple ResNet50 encoders to construct

the multi-stream architectures. Consecutive input

frames are processed by multiple ResNet50 enco-

ders independently. The intermediate feature maps

obtained at 3 different stages (conv3, conv4 and

conv5) of encoder are concatenated and added to

the up-sampling layers of the decoder. MSFCN-2 is

constructed using 2 encoders while MSFCN-3 uses 3

encoders. A channel-wise 1x1 convolution is applied

to fuse the multiple encoder streams into a single one

of the same dimension. This will enable the usage of

the same decoder.

Multi-stream Recurrent Architecture: A recurrent

fully convolutional network (RFCN) is designed to

incorporate a recurrent network into a convolutional

encoder-decoder architecture. It is illustrated in Fig

2d. We use the generic recurrent unit LSTM which

can specialize to simpler RNNs and GRUs. LSTM

operates over the encoder of previous N frames and

produces a ﬁltered encoder of the same dimension

which is then fed to the decoder.

(a) FCN: Single Encoder Baseline

(b) MSFCN-2: Two stream fusion architecture

(d) RMSFCN-2: Two stream LSTM architecture

Figure 2: Four types of architectures constructed and tested

in the paper. (a) Single stream baseline, (b) Two stream

fusion architecture, (c) Three stream fusion architecture and

(d) Two stream LSTM architecture.

Weight Sharing Across Encoders: The generic form

of multi-stream architectures have different weights

for the different encoders. In Fig 1 (e), the three en-

coders can be different and they have to be recompu-

ted each frame. Thus the computational complexity

of the encoder increases by a factor of three. Howe-

ver, if the weights are shared between the encoders,

there is no need of recomputing it each frame. One

Multi-stream CNN based Video Semantic Segmentation for Automated Driving

175

Table 1: Semantic Segmentation Results on SYNTHIA Sequences. We split the test sequences into two parts, one is Highway

for high speeds and the other is City for medium speeds.

Dataset Architecture Mean IoU Sky Building Road Sidewalk Fence Vegetation Pole Car Lane

Highway

FCN 85.42 0.91 0.67 0.89 0.02 0.71 0.79 0.01 0.81 0.72

MSFCN-2 93.44 0.92 0.66 0.94 0.28 0.85 0.78 0.11 0.82 0.71

RFCN-2 94.17 0.93 0.71 0.95 0.31 0.82 0.83 0.13 0.87 0.7

MSFCN-3 94.38 0.93 0.69 0.96 0.31 0.87 0.81 0.12 0.87 0.72

City

FCN 73.88 0.94 0.94 0.72 0.78 0.34 0.54 0 0.69 0.56

MSFCN-2 87.77 0.87 0.94 0.84 0.83 0.68 0.64 0 0.8 0.8

RFCN-2 88.24 0.91 0.92 0.87 0.78 0.56 0.67 0 0.8 0.74

MSFCN-3 88.89 0.88 0.89 0.86 0.74 0.64 0.53 0 0.71 0.72

Table 2: Semantic Segmentation Results on KITTI Video Sequence.

Architecture NumParams Mean IoU Sky Building Road Sidewalk Fence Vegetation Car Sign

FCN 23,668,680 74.00 46.18 86.50 80.60 69.10 37.25 81.94 74.35 35.11

MSFCN-2 (shared weights) 23,715,272 85.31 47.89 91.08 97.58 88.02 62.60 92.01 90.26 58.11

RFCN-2 (shared weights) 31,847,828 84.19 50.20 93.74 94.90 88.17 59.73 87.73 87.66 55.55

MSFCN-2 47,302,984 85.47 48.72 92.29 96.36 90.21 59.60 92.43 89.27 70.47

RFCN-2 55,435,540 83.38 44.80 92.84 91.77 91.67 58.53 86.01 87.25 52.87

Table 3: Semantic Segmentation Results on SYNTHIA Video Sequence.

Architecture Mean IoU Sky Building Road Sidewalk Fence Vegetation Pole Car Sign Pedestrain Cyclist Lane

FCN 84.08 97.2 92.97 87.74 81.58 34.44 62 1.87 72.75 0.21 0.01 0.33 93.08

MSFCN-2 (shared) 88.88 97.08 93.14 93.58 86.81 47.47 75.11 46.78 88.22 0.27 32.12 2.27 95.26

RFCN-2 (shared) 88.16 96.85 91.07 94.17 85.62 28.29 83.2 47.28 87.6 19.12 16.89 3.01 93.97

MSFCN-2 90.01 97.34 95.97 93.14 86.76 73.52 73.63 35.02 87.86 3.62 27.57 1.11 95.35

RFCN-2 89.48 97.15 94.01 93.76 85.88 76.26 70.35 39.86 87.5 8.16 28.05 1.28 94.67

encoder feature extraction per frame sufﬁces and the

fused encoder is computed by combination of previ-

ously computed encoders. This weight sharing ap-

proach drastically brings down the complexity with

negligible additional computation relative to the sin-

gle stream encoder. We demonstrate experimentally

that the weight shared encoder can still provide a sig-

niﬁcant improvement in accuracy.

4 EXPERIMENTS

In this section, we explain the experimental setting

including the datasets used, training algorithm details,

etc and discuss the results.

4.1 Experimental Setup

In most datasets, the frames in a video sequence are

sparsely sampled temporally to have better diversity

of objects. Thus consecutive video frames are not pro-

vided for training our multi-stream algorithm. Synt-

hetic datasets have no cost for annotation and ground

truth annotation is available for all consecutive fra-

mes. Hence we made use of the synthetic autonomous

driving dataset SYNTHIA (Ros et al., 2016) for our

experiments. We also made use of DAVIS2017 (Pont-

Tuset et al., 2017) and SegTrack V2 (Li et al., 2013)

which provides consecutive frames, they are not auto-

motive datasets but realistic.

We implemented the different proposed multi-

stream architectures using Keras (Chollet et al.,

2015). We used ADAM optimizer as it provided

faster convergence. The maximum order (number

of consecutive frames) used in the training is three

(MSFCN-3) because of limitation of memory nee-

ded for training. Categorical cross-entropy is used

as loss function for the optimizer. Maximum num-

ber of training epochs is set to 30 and early stopping

with a patience of 10 epochs monitoring the gains is

added. Mean class IoU and per-class IoU were used

as accuracy metrics. All input images were resized to

224x384 because of memory requirements needed for

multiple streams.

4.2 Experimental Results and

Discussion

We performed four sets of experiments summarized

in four tables. Qualitative results are provided in Fi-

gure 4 for KITTI, Figure 5 for DAVIS and Figure 6

for SYNTHIA. We also provide a video sequence de-

monstrating qualitative results for larger set of frames.

Table 1: Firstly, we wanted to evaluate different

orders on multi-stream and understand the impact.

We also wanted to understand the impact on high

speed and medium speed scenarios. SYNTHIA da-

taset was used for this experiment as it had sepa-

ration of various speed sequences and it was also a

relatively larger dataset. Two-stream networks pro-

vided a considerable increase in accuracy compared

to the baseline. MSFCN-2 provided an accuracy im-

provement of 8% for Highway and 14% for City se-

quence. RFCN-2 provided a slightly better accuracy

relative to MSFCN-2. MSFCN-3 provided marginal

improvement over MSFCN-2 and thus we did not ex-

plore higher orders.

Table 2: KITTI is a popular automotive dataset and

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

176

thus we used it to perform experiments on this real

life automated driving dataset. We reduced our ex-

periments to MSFCN-2 and RFCN-2 but we added

shared weight versions of the same. MSFCN-2 provi-

ded an accuracy improvement of 11% and the shared

weight version only lagged behind slightly.

Table 3: We repeated the experiments of the same

networks used in Table 2 on a larger SYNTHIA

sequence. MSFCN-2 provided an accuracy impro-

vement of 6% in Mean IoU. MSFCN-2 with shared

weights lagged by 1%. RFCN-2 versions had slightly

lesser accuracy compared to its MSFCN-2 counter-

parts with and without weight sharing.

Table 4: As most automotive semantic segmentation

datasets do not provide consecutive frames for tempo-

ral models, we tested in real non-auomotive datasets

namely SegTrack and DAVIS. MSFCN-3 provided an

accuracy improvement of 11% in SegTrack and 6%

in DAVIS. This demonstrates that the constructed net-

works provide consistent improvements in various da-

tasets.

Figure 3: Accuracy over epochs for SYNTHIA dataset.

Table 4: Comparison of Multi-stream network with its ba-

seline counterpart on DAVIS and SegTrack.

Dataset Architecture Mean IoU

SegTrack V2

FCN 83.82

MSFCN-3 94.61

DAVIS

FCN 77.64

MSFCN-3 83.42

BVS(M

arki et al., 2016) 66.52

FCP(Perazzi et al., 2015) 63.14

We have chosen a moderately sized based enco-

der namely ResNet50 and we will be experimenting

with various sizes like ResNet10, ResNet101, etc

for future work. In general, multi-stream provides

a signiﬁcant improvement in accuracy for this mo-

derately sized encoder. The improvements might be

larger for smaller networks which are less accurate.

With shared weights encoder, increase in computa-

tional complexity is minimal. However, it increases

memory usage and memory bandwidth quite signiﬁ-

cantly due to maintenance of additional encoder fea-

ture maps. It also increases the latency of output by

33 ms for a 30 fps video sequence. From visual in-

spection, the improvements are seen mainly in reﬁ-

ning the boundaries and detecting smaller regions. It

is likely due to temporal aggregation of feature maps

for each pixel from past frames.

MSFCN vs FCN: The single frame based FCN suf-

fers to segment weaker classes like poles and objects

at further distances. Table 3 shows IoU metrics for

weaker classes like Pole, Fence and Sidewalk have

signiﬁcantly improved in case of multi stream net-

works compared to single stream FCN. Fig 4 visually

demonstrates that the temporal encoder modules help

in preserving the small structures and boundaries in

segmentation.

MSFCN-2 vs MSFCN-3: The increase in the tempo-

ral information has clearly increased the performance

of the semantic segmentation. But this brings an extra

latency for real time applications.

MSFCN-2 vs RFCN: For a multi stream network the

recurrent encoder feature fusion has shown quite a

decent improvement compared to feature concatena-

tion technique. It is also observed that the recurrent

networks helped in preserving the boundaries of the

weaker classes like poles and lane markings. Howe-

ver, RFCN demands more parameters and takes lon-

ger training time for convergence as shown in Fig 3.

Weight Sharing: In most of the experiments,

MSFCN-2 with shared weights provided good impro-

vement over the baseline and its performance deﬁ-

cit relative to the generic MSFCN-2 is usually small

around 1%. However, shared weights version provide

a drastic improvement in computational complexity

as shown by the number of parameters in Table 2.

Shared weights MSFCN-2 has a negligible increase in

number of parameters and computational complexity

as well whereas the generic MSFCN-2 has double the

number of parameters. Thus it is important to make

use of weight sharing.

5 CONCLUSIONS

In this paper, we designed and evaluated two video

semantic segmentation architectures namely Recur-

rent FCN (RFCN) and Multi-Stream FCN (MSFCN)

networks to exploit temporal information. We

implemented three architectures namely RFCN-2,

MSFCN-2 and MSFCN-3 using ResNet50 as base en-

coder and evaluated on SYNTHIA sequences. We

obtain promising improvements of 9% and 15% for

Highway and New York-like city scenarios over the

baseline network. We also tested MSFCN-3 on real

datasets like SegTrack V2 and DAVIS datasets where

11% and 6% accuracy improvement was achieved, re-

spectively. We also explored weight sharing among

Multi-stream CNN based Video Semantic Segmentation for Automated Driving

177

Figure 4: Results on KITTI dataset.

Figure 5: Results over DAVIS dataset. Left to right: RGB image, Ground Truth, Single encoder (FCN), Two stream encoder

(MSFCN-2), Two stream encoder + LSTM (RFCN-2), Three stream encoder (MSFCN-3).

encoders for better efﬁciency and produced an im-

provement of 11% and 5% for KITTI and SYNTHIA

using MSFCN-2 with roughly the same complexity as

the baseline encoder. In future work, we plan to ex-

plore more sophisticated encoder fusion techniques.

REFERENCES

Chollet, F. et al. (2015). Keras. https://keras.io.

Heimberger, M., Horgan, J., Hughes, C., McDonald, J., and

Yogamani, S. (2017). Computer vision in automated

parking systems: Design, implementation and chal-

lenges. Image and Vision Computing, 68:88–101.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

178

Figure 6: Qualitative results of experiments with SYNTHIA dataset. Left to right: RGB image, Single encoder (FCN), Two

stream encoder (MSFCN-2), Ground Truth, Two stream encoder + LSTM (RFCN-2) and Three stream encoder (MSFCN-3).

Horgan, J., Hughes, C., McDonald, J., and Yogamani, S.

(2015). Vision-based driver assistance systems: Sur-

vey, taxonomy and advances. In Intelligent Transpor-

tation Systems (ITSC), 2015 IEEE 18th International

Conference on, pages 2032–2039. IEEE.

Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A.,

and Brox, T. (2016). Flownet 2.0: Evolution of optical

ﬂow estimation with deep networks. arXiv preprint

arXiv:1612.01925.

Kaiming He, Xiangyu Zhang, S. R. J. S. (2015). Deep

residual learning for image recognition. CoRR,

abs/1512.03385.

Li, F., Kim, T., Humayun, A., Tsai, D., and Rehg, J. M.

(2013). Video segmentation by tracking many ﬁgure-

ground segments. In Proceedings of the IEEE Interna-

tional Conference on Computer Vision, pages 2192–

2199.

Long, J., Shelhamer, E., and Darrell, T. (2015). Fully con-

volutional networks for semantic segmentation. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 3431–3440.

arki, N., Perazzi, F., Wang, O., and Sorkine-Hornung, A.

(2016). Bilateral space video segmentation. In Pro-

ceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, pages 743–751.

Perazzi, F., Wang, O., Gross, M., and Sorkine-Hornung, A.

Multi-stream CNN based Video Semantic Segmentation for Automated Driving

179

(2015). Fully connected object proposals for video

segmentation. In Proceedings of the IEEE Interna-

tional Conference on Computer Vision, pages 3227–

3234.

Pont-Tuset, J., Perazzi, F., Caelles, S., Arbel

aez, P.,

Sorkine-Hornung, A., and Van Gool, L. (2017). The

2017 davis challenge on video object segmentation.

arXiv:1704.00675.

Ros, G., Sellart, L., Materzynska, J., Vazquez, D., and Lo-

pez, A. M. (2016). The synthia dataset: A large col-

lection of synthetic images for semantic segmentation

of urban scenes. In The IEEE Conference on Compu-

ter Vision and Pattern Recognition (CVPR).

Siam, M., Elkerdawy, S., Jagersand, M., and Yogamani, S.

(2017a). Deep semantic segmentation for automated

driving: Taxonomy, roadmap and challenges. In In-

ternational Conference on Intelligent Transportation

Systems (ITSC), pages 29–36. IEEE.

Siam, M., Gamal, M., Abdel-Razek, M., Yogamani, S.,

and Jagersand, M. (2018). Rtseg: Real-time seman-

tic segmentation comparative study. arXiv preprint

arXiv:1803.02758.

Siam, M., Mahgoub, H., Zahran, M., Yogamani, S., Ja-

gersand, M., and El-Sallab, A. (2017b). Modnet:

Moving object detection network with motion and

appearance for autonomous driving. arXiv preprint

arXiv:1709.04821.

Siam, M., Valipour, S., Jagersand, M., Ray, N., and Yo-

gamani, S. (2017c). Convolutional gated recurrent

networks for video semantic segmentation in automa-

ted driving. In International Conference on Intelligent

Transportation Systems (ITSC), pages 29–36. IEEE.

Simonyan, K. and Zisserman, A. (2014). Two-stream con-

volutional networks for action recognition in videos.

In Advances in neural information processing sys-

tems, pages 568–576.

Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E.,

Dosovitskiy, A., and Brox, T. (2016). Demon: Depth

and motion network for learning monocular stereo.

arXiv preprint arXiv:1612.02401.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

180