roads and vehicles, roads and buildings, and intersec-
tions and pedestrian crossings. Thus, if we can train
the network to learn such regularities, the network
may generate more accurate complemented images.
In this research, we force the network to learn scene
structure recovery as well as image inpainting and to
generate accurate complemented images.
2 RELATED WORK
A common method of viewpoint transfer is to place
multiple cameras around the 3D scene and use these
camera images to generate images from arbitrary
viewpoints by interpolation (Kanade et al., 1997;
ISHIKAWA, 2008; Lipski et al., 2010; Chari et al.,
2012; Sankoh et al., 2018). These methods are used
to generate free viewpoint images, such as in sports
broadcasting. However, these methods require a large
number of cameras densely placed around the scene,
and thus these methods cannot be used in road envi-
ronments.
In order to solve this problem, we consider view-
point transfer from images obtained by a single in-
vehicle camera. Some methods have been proposed
for generating new views from a single viewpoint
image by using geometric transformations, such as
projective transformations (Kazuki Ichikawa and Jun
Sato, 2008). By estimating the scene depth informa-
tion, we can further improve the geometric viewpoint
transfer. Eigen et al. proposed a deep learning-based
method for estimating scene depth from monocular
images (Eigen et al., 2014). An unsupervised learning
method for depth estimation is also proposed by us-
ing a pair of stereo cameras during the network train-
ing (Garg et al., 2016; Godard et al., 2017).
Although the depth information enables us to
transfer the image point in one view to the image point
in a new view, the geometric transformation alone
can only produce image information that exists in the
original image, and it cannot generate images that do
not exist in the original image, such as the image ob-
tained by looking sideways.
In recent years, image inpainting has been devel-
oped as a technique for filling in the missing image
information, and it has been shown that images can be
restored accurately even when many defective pixels
are scattered throughout the image (Bertalmio et al.,
2000; Liao et al., 2021). However, since these image
inpainting methods use the similarity and regularity
of only 2D image features, they require non-defective
pixels to be scattered throughout the image. There-
fore, the existing image inpainting methods do not
work properly when we have large missing regions
like the image in Fig. 1 (b), which are caused by the
difference in viewing direction before and after the
viewpoint transformation.
In order to solve this problem, we in this pa-
per use multi-task learning to simultaneously learn
two tasks, that is inpainting the in-vehicle images
and inferring the structural information of the hidden
scene. Multitask learning is a method that improves
the performance of each task by learning multiple
related tasks at the same time. Examples of multi-
task learning include Faster R-CNN (Ren et al., 2016)
and YOLO (Redmon and Farhadi, 2018), which si-
multaneously perform object class recognition and
object location estimation, and Mask R-CNN (He
et al., 2017), which simultaneously performs seman-
tic segmentation in addition to object recognition
using Faster R-CNN (He et al., 2017). The net-
work structure for multi-task learning can take various
forms depending on the number and types of tasks,
but the basic structure consists of a task-sharing layer
that learns features common to each task and a task-
specific layer that learns features specific to each task.
In this paper, we propose a method for learning
image inpainting and structural inference simultane-
ously by using multi-task learning, and performing
viewpoint transfer based on the inference of the struc-
tural information of the invisible scene.
3 GENERATING PEDESTRIAN
VIEWPOINT IMAGE
In this research, viewpoint transfer images with miss-
ing regions are complemented using a network based
on conditional GAN (cGAN). However, viewpoint
transfer with significantly different viewpoints can re-
sult in very large missing regions in images. There-
fore, a simple image inpainting method cannot com-
plete missing images with high quality. In this re-
search, we propose a method to generate pedestrian
viewpoint images with high quality by recovering the
scene structure unique to road scenes while perform-
ing image completion. For this objective, we propose
two methods: a method based on multi-task learning
(Method 1) and a method using Semantic Loss with
multi-task learning (Method 2).
3.1 Generating Pedestrian Viewpoint
Images Using Multi-Task Learning
In a road scene, there are objects unique to the road
scene, such as roads, cars, and buildings, each of
which has a similar general shape. In addition, road
VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications
386