Keywords: Single-image Depth Prediction, CNN.
Abstract: With the recent surge of deep neural networks, depth prediction from a single image has seen substantial
progress. Deep regression networks are typically learned from large data without much constraints about the
scene structure, thus often leading to uncertainties at discontinuous regions. In this paper, we propose a
structure-aware depth prediction method based on two observations: depth is relatively smooth within the
same objects, and it is usually easier to model relative depth than model the absolute depth from scratch.
Our network first predicts an initial depth map and takes an object saliency map as input, which helps to
teach the network to learn depth refinement. Specifically, a stable anchor depth is first estimated from the
detected salient objects, and the learning objective is to penalize the difference in relative depth versus the
estimated anchor. We show such saliency-guided relative depth constraint unveils helpful scene structures,
leading to significant gains on the RGB-D saliency dataset NLPR and depth prediction dataset NYU V2.
Furthermore, our method is appealing in that it is pluggable to any depth network and is trained end-to-end
with no overhead of time during testing.
1 INTRODUCTION
Depth prediction plays an essential role in
understanding the 3D geometry of a scene. It has
been shown that depth information can largely
facilitate other vision tasks such as reconstruction
(Silberman et al. 2012), recognition (Fox 2012) and
semantic segmentation (Cheng et al. 2017). Stereo
images (Kong and Black 2016) or image sequence
(Suwajanakorn, Hernandez, and Seitz 2015) usually
suffice for accurate depth prediction. While for
single-view depth prediction, it is an ill-posed
problem due to the lack of geometric information.
Ambiguities or uncertainties often happen at those
discontinuous regions between objects.
In this paper, we propose a structure-aware
depth prediction method based on two observations:
1) depth is relatively smooth within the same objects
when compared to within the full image; 2) and it is
often easier to model relative depth than model the
absolute depth value.
We incorporate such observations by learning
to refine depth map with the guidance of object
saliency (we use the saliency detector (Tong et al.
2015)). Generally, the saliency map of objects is a
simple way to reveal the scene structure in terms of
objects. Saliency map is also class-agnostic, thus can
cover a broad range of objects each with spatially
smooth depth values within its contour. As a result,
we are able to first estimate an anchor depth value of
the whole scene, from an initial depth map
reweighted by object saliency. Since almost all
object regions have small variance in depth values,
such anchor depth estimation can act as a reliable
reference. Then we take the anchor depth for depth
refinements of entire image in a relative way.
Previously (Chen et al. 2016) explored relative depth
estimation, but both their depth ground truth and
prediction are just ordinal, not real depth values.
Here we design a relative depth loss for our network
to learn the genuine relative depth of other pixels
versus the estimated anchor depth, and penalize the
deviation to correct relative depth.
Fig. 1 demonstrates our overall learning
framework. During training, we propose two
formulations of relative depth constraints to
supervise the depth refinement process. At test time,
such finetuned network is simply applied for depth
prediction without any overhead. Obviously, our
method is pluggable to any depth network and can
be easily trained end-to-end. We show our saliency-
guided depth model does learn some scene structures,
leading to significant gains on the RGB-D saliency
dataset NLPR and depth prediction dataset NYU V2.