scenes are promising for use in practical applications.
2 RELATED WORK
Here we explain two major groups of prior studies rel-
evant to ours, i.e., single-photo depth estimation and
3D mesh reconstruction from line drawings.
Single-photo Depth Estimation. Previous depth-
estimation studies have mostly focused on photos,
i.e., RGB images. Because both photos and our tar-
gets, i.e., line drawings, represent a scene rather than
a single object, the research literature in single-photo
depth estimation is essentially valuable for our study.
Most of the modern methods use DL because,
in contrast to traditional methods, they can auto-
matically extract appropriate features and are thus
more robust. Roy and Todorovic (Roy and Todor-
ovic, 2016) introduced the neural regression forest
for single-image depth estimation. Liu et al. pro-
posed using additional modules to classify images
into planar and non-planar regions and regressing
plane equations (Liu et al., 2018; Liu et al., 2019).
Ramamonjisoa and Lepetit (Ramamonjisoa and Lep-
etit, 2019) used a classic network architecture (Ron-
neberger et al., 2015) and improved the depth estima-
tion quality by applying a novel edge-preserving loss
function. However, when naively applied to depth es-
timation for line drawings, these methods suffer from
the severe lack of visual information in line drawings,
as explained in Section 1.
3D Object Reconstruction from Line Drawings.
There exist several methods for reconstructing 3D
meshes from single line drawings. Due to the in-
herent ambiguity of 3D shape in line drawings, some
methods require different types of user annotations to
specify 3D shapes, e.g., (Li et al., 2017). Our method
learns to work with grayscale line drawings without
any additional user input.
Recent methods adopt CNNs. Lun et al. (Lun
et al., 2017) proposed a method to reconstruct a 3D
model from line drawings in two object views. How-
ever, the network requires an object class as an addi-
tional input, which constrains the number of possible
object classes and drastically limits free-form mod-
eling. To address free-form modeling, Li et al. (Li
et al., 2018) proposed smoothing ground-truth (GT)
3D meshes, thus, making the CNN independent from
shape features specific to exact 3D models. How-
ever, these approaches require contours to be explic-
itly specified. Zheng et al. (Zheng et al., 2020) pro-
posed a shading GAN which implicitly infers 3D in-
formation, but such information cannot be used di-
rectly and requires further processing. Our approach
does not require to specify object contours or classes
and infers final depth maps of whole scenes.
3 OUR METHOD
Our preliminary experiment revealed that our base-
line method (Liu et al., 2018) failed to estimate depths
solely from line drawings. This might be caused by
the lack of information, which leads to our key idea:
data enrichment. To enrich the input line drawings,
our method integrates three streams of networks for
coloring, initial depth, and normal estimation. Next,
our method obtains the final depth map by refining
the intermediate data. Figure 1 shows our depth es-
timation pipeline. Our depth estimation pipeline re-
quires various data for training. Line drawings are re-
quired as an input to all the modules. Data enrichment
modules require original RGB images, depth and nor-
mal maps as the ground truth. The refinement module
requires ground truth depth maps, planar segmenta-
tions, and planar equations.
3.1 Pix2pix Modules for Data
Enrichment
To tackle the detail shortage problem in line draw-
ings, we integrate three branches of conditional GAN
for colorization, initial depth, and normal estima-
tion. Namely, we adopt the pix2pix architecture (Isola
et al., 2017) for all of them. We train the first
branch, edge2pix, to hallucinate original RGB im-
ages. The second and the third branches, edge2depth
and edge2norm, are trained to estimate rough depth
and normal maps, respectively.
After processing the input line drawing with
pix2pix modules, we concatenate initial line draw-
ings, hallucinated RGB images, initially estimated
depth and normal maps. Next, we feed the concate-
nated result to the PlaneNet module.
3.2 PlaneNet Module
To obtain final depth maps from the initial depth maps
and intermediate data, we use the PlaneNet (Liu et al.,
2018) module. This module is based on a dilated
version of the ResNet network (He et al., 2016; Yu
et al., 2017) and has three branches following it. The
first branch regresses plane parameters represented as
three-dimensional vectors dn, where d are offsets and
n are unit normal vectors that define plane equations.
The second branch segments an image into planar re-
gions and a single non-planar mask. The third branch
Line2depth: Indoor Depth Estimation from Line Drawings
479