Line2depth: Indoor Depth Estimation from Line Drawings

Pavlov Sergey

, Kanamori Yoshihiro

and Endo Yuki

Graduate School of Science and Technology, University of Tsukuba, Tsukuba, Japan

Keywords:

Depth Estimation, Line Drawing, Convolutional Neural Network, Conditional GAN.

Abstract:

Depth estimation from scenery line drawings has a number of applications, such as in painting software and

3D modeling. However, it has not received much attention because of the inherent ambiguity of line drawings.

This paper proposes the ﬁrst CNN-based method for estimating depth from single line drawings of indoor

scenes. First, to combat the ambiguity of line drawings, we enrich the input line drawings by hallucinating

colors, rough depth, and normal maps using a conditional GAN. Next, we obtain the ﬁnal depth maps from the

hallucinated data and input line drawings using a CNN for depth estimation. Our qualitative and quantitative

evaluations demonstrate that our method works signiﬁcantly better than conventional photo-aimed methods

trained only with line drawings. Additionally, we conﬁrmed that our results with hand-drawn indoor scenes

are promising for use in practical applications.

1 INTRODUCTION

Depth estimation from a single image has tradition-

ally been one of the major challenges in computer vi-

sion. Properly estimated scene depth has many appli-

cations in various areas such as robotics, augmented

reality, and 3D modeling. Although this task has been

quite difﬁcult with traditional methods, the recent

rise of deep learning (DL) has allowed researchers to

achieve substantial progress in depth estimation (Liu

et al., 2018; Liu et al., 2019). While the main targets

of research have mostly been photos and videos, to the

best of our knowledge, no previous study has investi-

gated for line drawings as inputs, yet it may also have

practical applications in many areas, such as painting

software, 3D modeling, and manga creation.

Depth estimation only from a single line drawing

is quite challenging due to a number of reasons. The

main issue is the underlying ambiguity of line draw-

ings. In contrast to color images, line drawings usu-

ally lack textures and shading, leaving only contours

with white insides to work with. Even for a human ob-

server, it might be complicated to decide whether the

shape is convex, concave, or ﬂat. Another challenge is

the lack of datasets containing both line drawings and

depth maps, which makes the task of training an ef-

fective convolutional neural network (CNN) substan-

https://orcid.org/0000-0003-0821-6810

https://orcid.org/0000-0003-2843-1729

https://orcid.org/0000-0001-5132-3350

tially difﬁcult.

However, despite the difﬁculty, line drawings

might have enough cues to estimate depth maps. Al-

though line drawings may lack certain information,

they do show the shape of objects. For example,

object shape can hint at its spatial orientation and

whether the object is planar or non-planar. This obser-

vation holds particularly for indoor scenes, which are

ﬁlled with many planar objects such as walls, ﬂoors,

and furniture. We thus tackle the following research

question in this study: Can we estimate the depth map

from a single line drawing of an indoor scene?

This paper proposes the ﬁrst method for estimat-

ing depth from single line drawings of indoor scenes

based on supervised learning using CNNs. First,

to combat the ambiguity of line drawings, we en-

rich the input line drawings by hallucinating colors,

rough depth and normal maps and using them as addi-

tional inputs. For this hallucination, we integrate the

conditional generative adversarial network (GAN),

pix2pix (Isola et al., 2017). Next, we obtain the ﬁ-

nal depth maps from the combined inputs (i.e., line

drawings, hallucinated colors, depth maps, and nor-

mal maps) using PlaneNet (Liu et al., 2018), one of

the recent CNNs for depth estimation, which explic-

itly handles planar regions in the scene. Our qual-

itative and quantitative evaluations demonstrate that

our method works signiﬁcantly better than conven-

tional methods trained only with line drawings. Also,

we conﬁrmed that our results with hand-drawn indoor

478

Sergey, P., Yoshihiro, K. and Yuki, E.

Line2depth: Indoor Depth Estimation from Line Drawings.

DOI: 10.5220/0010245104780483

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

478-483

ISBN: 978-989-758-488-6

scenes are promising for use in practical applications.

2 RELATED WORK

Here we explain two major groups of prior studies rel-

evant to ours, i.e., single-photo depth estimation and

3D mesh reconstruction from line drawings.

Single-photo Depth Estimation. Previous depth-

estimation studies have mostly focused on photos,

i.e., RGB images. Because both photos and our tar-

gets, i.e., line drawings, represent a scene rather than

a single object, the research literature in single-photo

depth estimation is essentially valuable for our study.

Most of the modern methods use DL because,

in contrast to traditional methods, they can auto-

matically extract appropriate features and are thus

more robust. Roy and Todorovic (Roy and Todor-

ovic, 2016) introduced the neural regression forest

for single-image depth estimation. Liu et al. pro-

posed using additional modules to classify images

into planar and non-planar regions and regressing

plane equations (Liu et al., 2018; Liu et al., 2019).

Ramamonjisoa and Lepetit (Ramamonjisoa and Lep-

etit, 2019) used a classic network architecture (Ron-

neberger et al., 2015) and improved the depth estima-

tion quality by applying a novel edge-preserving loss

function. However, when naively applied to depth es-

timation for line drawings, these methods suffer from

the severe lack of visual information in line drawings,

as explained in Section 1.

3D Object Reconstruction from Line Drawings.

There exist several methods for reconstructing 3D

meshes from single line drawings. Due to the in-

herent ambiguity of 3D shape in line drawings, some

methods require different types of user annotations to

specify 3D shapes, e.g., (Li et al., 2017). Our method

learns to work with grayscale line drawings without

any additional user input.

Recent methods adopt CNNs. Lun et al. (Lun

et al., 2017) proposed a method to reconstruct a 3D

model from line drawings in two object views. How-

ever, the network requires an object class as an addi-

tional input, which constrains the number of possible

object classes and drastically limits free-form mod-

eling. To address free-form modeling, Li et al. (Li

et al., 2018) proposed smoothing ground-truth (GT)

3D meshes, thus, making the CNN independent from

shape features speciﬁc to exact 3D models. How-

ever, these approaches require contours to be explic-

itly speciﬁed. Zheng et al. (Zheng et al., 2020) pro-

posed a shading GAN which implicitly infers 3D in-

formation, but such information cannot be used di-

rectly and requires further processing. Our approach

does not require to specify object contours or classes

and infers ﬁnal depth maps of whole scenes.

3 OUR METHOD

Our preliminary experiment revealed that our base-

line method (Liu et al., 2018) failed to estimate depths

solely from line drawings. This might be caused by

the lack of information, which leads to our key idea:

data enrichment. To enrich the input line drawings,

our method integrates three streams of networks for

coloring, initial depth, and normal estimation. Next,

our method obtains the ﬁnal depth map by reﬁning

the intermediate data. Figure 1 shows our depth es-

timation pipeline. Our depth estimation pipeline re-

quires various data for training. Line drawings are re-

quired as an input to all the modules. Data enrichment

modules require original RGB images, depth and nor-

mal maps as the ground truth. The reﬁnement module

requires ground truth depth maps, planar segmenta-

tions, and planar equations.

3.1 Pix2pix Modules for Data

Enrichment

To tackle the detail shortage problem in line draw-

ings, we integrate three branches of conditional GAN

for colorization, initial depth, and normal estima-

tion. Namely, we adopt the pix2pix architecture (Isola

et al., 2017) for all of them. We train the ﬁrst

branch, edge2pix, to hallucinate original RGB im-

ages. The second and the third branches, edge2depth

and edge2norm, are trained to estimate rough depth

and normal maps, respectively.

After processing the input line drawing with

pix2pix modules, we concatenate initial line draw-

ings, hallucinated RGB images, initially estimated

depth and normal maps. Next, we feed the concate-

nated result to the PlaneNet module.

3.2 PlaneNet Module

To obtain ﬁnal depth maps from the initial depth maps

and intermediate data, we use the PlaneNet (Liu et al.,

2018) module. This module is based on a dilated

version of the ResNet network (He et al., 2016; Yu

et al., 2017) and has three branches following it. The

ﬁrst branch regresses plane parameters represented as

three-dimensional vectors dn, where d are offsets and

n are unit normal vectors that deﬁne plane equations.

The second branch segments an image into planar re-

gions and a single non-planar mask. The third branch

Line2depth: Indoor Depth Estimation from Line Drawings

479

Figure 1: Our pipeline for estimating indoor-scene depth from a single line drawing.

Table 1: Depth accuracy comparison using the ScanNet dataset. The best values are emphasized by boldface. The left block

shows error rates. The right block shows percentages of pixels within the given thresholds (in meter scale) compared to GT.

Methods Rel↓ Rel(sqr) ↓ Log10 ↓ RMSE↓ 1.25 m↑ 1.25

m↑ 1.25

m↑

PlaneNet (Liu et al., 2018) 0.386 0.307 0.230 0.764 16.8 47.0 71.7

SARPN (Chen et al., 2019) 0.240 0.134 0.097 0.492 60.85 86.16 96.57

Ours (w/o depth and norm) 0.475 0.445 0.156 0.757 36.2 66.6 87.1

Ours (w/o norm) 0.248 0.153 0.117 0.527 50.9 81.4 94.7

Ours (w/o pix) 0.196 0.098 0.092 0.430 63.8 88.9 97.0

Ours (full) 0.193 0.097 0.088 0.423 65.3 90.3 97.3

estimates the non-planar depth map. Finally, the out-

puts of all three branches are merged into a single

depth map output.

Training each module of the pipeline with the

whole dataset will result in pix2pix modules overﬁt

to the dataset, making PlaneNet insensitive to the in-

termediate results, which will severely affect the test-

ing results. To avoid this issue, we divide the training

dataset into two subsets. The ﬁrst subset is used to

train all of the pix2pix modules because they are in-

dependent of each other. The second one is used to

train the PlaneNet module.

4 DATASET

For the dataset, we use the PlaneNet (Liu et al.,

2018) version of the ScanNet (Dai et al., 2017)

dataset. We discard extremely bright and blurry sam-

ples by examining the image edge strength. Next,

we extract line drawings using one of the recent

CNNs for edge extraction (He et al., 2019), preceded

by the contrast limited adaptive histogram equal-

ization (CLAHE) (Zuiderveld and Heckbert, 1994).

Whilst we found such line drawings not plausible,

they worked surprisingly well for the training of our

pipeline.

5 EXPERIMENTAL RESULTS

We trained each of the pix2pix modules up to 200

epochs and the PlaneNet module up to 50 epochs. For

comparison, we chose the PlaneNet (Liu et al., 2018)

network as the baseline method and trained it with

the same dataset up to 50 epochs. Our dataset has

approximately 45,000 training and 1,000 testing sam-

ples. We use 15,000 samples to train the pix2pix mod-

ules and 30,000 samples to train the PlaneNet mod-

ule. In total, the training took 34 hours using NVIDIA

GeForce RTX 2080 Ti GPU.

Figure 2 shows some of the depth reconstruction

results produced by our method and some of the re-

cent approaches. As can be seen, PlaneNet (Liu et al.,

2018) produces globally plausible but locally incon-

sistent outputs. In contrast, SARPN (Chen et al.,

2019) produces locally consistent results, but glob-

ally they are signiﬁcantly different from the ground

truth. Our method is mostly consistent both locally

and globally. It indicates that our method has suc-

cessfully learned how to deal with the ambiguity in

indoor-scene line drawings and to estimate their depth

maps. This hypothesis is also supported by the 3D

representations shown in Figure 3.

Table 1 provides a quantitative evaluation of the

recent methods (Liu et al., 2018; Chen et al., 2019)

and our pipeline on various metrics used in a prior

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

480

work (Eigen et al., 2014). The four metrics on the left

represent various error statistics such as rooted-mean-

square-error (RMSE) and relative difference (Rel).

The three metrics on the right show the percentage of

pixels, for which the relative difference between the

GT and the predicted depths is within a certain thresh-

old. As can be observed, our method works better not

only visually but also quantitatively. Also, as an abla-

tion study, we trained our pipeline alternately without

the edge2pix, edge2norm, and both edge2depth and

edge2norm modules. While some of the omitted ver-

sions are statistically better than the baseline methods,

we found them to perform qualitatively worse than the

full version. It is also clear that the edge2depth mod-

ule plays the key role in the pipeline.

Figure 2 additionally shows some of the plane

segmentation results. Note that the random col-

ors in segmentations are used just for distinguishing

planes and do not correspond among different meth-

ods. For some images plane segmentations are esti-

mated wrongly, merging perpendicular planes (e.g.,

“Ours” for Scene 2, where the green segment in-

cludes three different planes). However, for others,

the segmentation seems to work properly, and even

in the cases where a single plane is separated into

various planar instances (e.g., “Ours” for Scene 1,

where the red and green segments are mixed), these

instances appear to have nearly identical plane param-

eters, which means that even improper plane segmen-

tation usually does not strongly affect the ﬁnal accu-

racy of depth maps.

5.1 Evaluation with Hand-drawn Line

Drawings

Toward practical applications, we also evaluated our

method using hand-drawn line drawings of three

scenes. As can be seen from Figure 4, the resulting

depth maps are of slightly worse quality than the re-

sults using the evaluation dataset (Figure 2). How-

ever, the results are still visually plausible and account

for the overall object layout in the scenes.

6 CONCLUSION AND FUTURE

WORK

This paper has proposed the ﬁrst pipeline to estimate

depth of indoor-scene line drawings. To combat the

problem of ambiguity, our method integrates three

streams of conditional GAN. Next, to obtain the ﬁnal

depth, our pipeline integrates the PlaneNet module, a

recent depth estimation method. Our method handles

Figure 2: Comparisons of depth maps and plane segmen-

tation with the baseline (Liu et al., 2018) and our method.

The ﬁrst two rows: original images (after grayscale conver-

sion and CLAHE), and input line drawings. The remain-

ing rows: plane segmentations of GT, baseline results (Liu

et al., 2018), and ours, depth maps of GT, baseline re-

sults (Liu et al., 2018), SARPN (Chen et al., 2019), and

ours. In the depth maps, the color indicates distance from

the camera, from closest to furthest: red, yellow, green,

blue.

indoor scenes including the hand-drawn line drawings

effectively.

Future work includes training and testing with a

high-resolution dataset. It might also include intro-

ducing a neural module to classify lines to texture-

based and geometry-based ones, thus solving this

complicated task for the main pipeline. Another

promising yet challenging direction might be to inte-

Line2depth: Indoor Depth Estimation from Line Drawings

481

Figure 3: 3D representations of inferred depth maps obtained after converting the depth maps to meshes.

grate vanishing points detection into the loss function

or as an additional input to the network. Line draw-

ings in our dataset might be further improved by em-

ploying apparent ridges for line extraction (Judd et al.,

2007).

REFERENCES

Chen, X., Chen, X., and Zha, Z.-J. (2019). Structure-aware

residual pyramid network for monocular depth esti-

mation. In International Joint Conferences on Artiﬁ-

cial Intelligence (IJCAI 2019).

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser,

T., and Nießner, M. (2017). ScanNet: Richly-

annotated 3D reconstructions of indoor scenes. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 5828–5839.

Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map

prediction from a single image using a multi-scale

deep network. In Advances in neural information pro-

cessing systems, pages 2366–2374.

He, J., Zhang, S., Yang, M., Shan, Y., and Huang, T. (2019).

Bi-directional cascade network for perceptual edge

detection. In Proceedings of the IEEE Conference

on Computer Vision and Pattern Recognition, pages

3828–3837.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

482

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

1125–1134.

Judd, T., Durand, F., and Adelson, E. (2007). Apparent

ridges for line drawing. ACM transactions on graphics

(TOG), 26(3):19–es.

Li, C., Pan, H., Liu, Y., Tong, X., Sheffer, A., and Wang,

W. (2017). BendSketch: modeling freeform surfaces

through 2D sketching. ACM Transactions on Graph-

ics (TOG), 36(4):1–14.

Li, C., Pan, H., Liu, Y., Tong, X., Sheffer, A., and Wang,

W. (2018). Robust ﬂow-guided neural prediction for

sketch-based freeform surface modeling. ACM Trans-

actions on Graphics (TOG), 37(6):1–12.

Liu, C., Kim, K., Gu, J., Furukawa, Y., and Kautz, J. (2019).

PlanerCNN: 3D plane detection and reconstruction

from a single image. In Proceedings of the IEEE Con-

ference on Computer Vision and Pattern Recognition,

pages 4450–4459.

Liu, C., Yang, J., Ceylan, D., Yumer, E., and Furukawa, Y.

(2018). PlaneNet: Piece-wise planar reconstruction

from a single RGB image. In Proceedings of the IEEE

Conference on Computer Vision and Pattern Recogni-

tion, pages 2579–2588.

Lun, Z., Gadelha, M., Kalogerakis, E., Maji, S., and Wang,

R. (2017). 3D shape reconstruction from sketches via

multi-view convolutional networks. In 2017 Interna-

tional Conference on 3D Vision (3DV), pages 67–77.

IEEE.

Ramamonjisoa, M. and Lepetit, V. (2019). SharpNet:

Fast and accurate recovery of occluding contours in

monocular depth estimation. In Proceedings of the

IEEE International Conference on Computer Vision

Workshops, pages 0–0.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In International Conference on Medical

image computing and computer-assisted intervention,

pages 234–241. Springer.

Roy, A. and Todorovic, S. (2016). Monocular depth estima-

tion using neural regression forest. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 5506–5514.

Yu, F., Koltun, V., and Funkhouser, T. (2017). Dilated

residual networks. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 472–480.

Zheng, Q., Li, Z., and Bargteil, A. (2020). Learning to

shadow hand-drawn sketches. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 7436–7445.

Zuiderveld, K. and Heckbert, P. S. (1994). Contrast limited

histogram equalization. San Diego, CA, USA: Aca-

demic Press Professional, Inc, pages 474–485.

Figure 4: Our results from hand-drawn line drawings. From left to right: line drawings, estimated segmentations, estimated

depth, and ground truth depths of the corresponding images.

Line2depth: Indoor Depth Estimation from Line Drawings

483