Generating Pedestrian Views from In-Vehicle Camera Images

Daina Shimoyama, Fumihiko Sakaue and Jun Sato

Nagoya Institute of Technology, Nagoya 466-8555, Japan

{shimoyama@cv., sakaue@, junsato@}nitech.ac.jp

Keywords:

GAN, Semantic Segmentation, Multi-Task Learning, Pedestrian Views, In-Vehicle Camera.

Abstract:

In this paper, we propose a method for predicting and generating pedestrian viewpoint images from images

captured by an in-vehicle camera. Since the viewpoints of an in-vehicle camera and a pedestrian are very

different, viewpoint transfer to the pedestrian viewpoint generally results in a large amount of missing infor-

mation. To cope with this problem, we in this research use the semantic structure of the road scene. In general,

it is considered that there are certain regularities in the driving environment, such as the positional relationship

between roads, vehicles, and buildings. We generate accurate pedestrian views by using such structural infor-

mation on the road scenes.

1 INTRODUCTION

In recent years, many methods have been proposed to

predict the behavior of pedestrians on roads for avoid-

ing car accidents and for realizing autonomous driv-

ing.

The standard method for predicting the route of a

pedestrian is to use RNNs and LSTMs (Alahi et al.,

2016). The orientation of the pedestrian’s head and

the scene structures are also used for improving the

accuracy of route prediction (Lee et al., 2017; Yagi

et al., 2018).

Furthermore, it is expected that the accuracy of

pedestrian behavior prediction can be further im-

proved by using visibility information that indicates

what kind of scenery the pedestrian is looking at. For

example, if there is a pedestrian crossing in front of a

pedestrian’s view, we can predict that the pedestrian

is likely to head there to cross the street. Thus, in this

paper, we propose a method to predict and generate

what kind of scenery the pedestrians are seeing from

each viewpoint based on the camera images mounted

on the vehicle.

The viewpoint transfer can be achieved by ﬁrst

converting the in-vehicle image to a pedestrian view-

point image, and then inpainting the missing parts in

the converted image. Since the 3D structure of the

scene is required for viewpoint transfer, we obtain the

depth information of each point in the image using

a depth estimation method based on monocular im-

ages (Godard et al., 2019), and then transfer the in-

vehicle image to an arbitrary pedestrian viewpoint im-

age.

(a) in-vehicle image (b) pedestrian

viewpoint image

Figure 1: Viewpoint transfer with depth estimation. The

viewpoint-transformed image in (b) contains a large amount

of missing regions that are not visible in the original camera

image in (a).

Fig. 1 (b) is an example pedestrian viewpoint im-

age obtained by converting the in-vehicle image in

Fig. 1 (a) estimating the depth information from (Go-

dard et al., 2019). As shown in this ﬁgure, the gener-

ated image contains a large amount of missing regions

that are not visible from the in-vehicle camera view-

point. Therefore, we need to inpaint the missing parts

in the next step.

In recent years, cGAN-based methods (Mirza and

Osindero, 2014; Isola et al., 2017) have achieved sig-

niﬁcant results in such image transformation tasks.

However, simple image transformation networks can-

not inpaint an image accurately when a large amount

of regions are missing in the image like the image in

Fig. 1 (b). Thus, in this paper, we propose a method to

generate a pedestrian viewpoint image with higher ac-

curacy by simultaneously estimating the scene struc-

ture speciﬁc to the road environment.

In general, the road environment has certain reg-

ularities, such as the positional relationship between

Shimoyama, D., Sakaue, F. and Sato, J.

Generating Pedestrian Views from In-Vehicle Camera Images.

DOI: 10.5220/0011736300003417

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 4: VISAPP, pages

385-392

ISBN: 978-989-758-634-7; ISSN: 2184-4321

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

385

roads and vehicles, roads and buildings, and intersec-

tions and pedestrian crossings. Thus, if we can train

the network to learn such regularities, the network

may generate more accurate complemented images.

In this research, we force the network to learn scene

structure recovery as well as image inpainting and to

generate accurate complemented images.

2 RELATED WORK

A common method of viewpoint transfer is to place

multiple cameras around the 3D scene and use these

camera images to generate images from arbitrary

viewpoints by interpolation (Kanade et al., 1997;

ISHIKAWA, 2008; Lipski et al., 2010; Chari et al.,

2012; Sankoh et al., 2018). These methods are used

to generate free viewpoint images, such as in sports

broadcasting. However, these methods require a large

number of cameras densely placed around the scene,

and thus these methods cannot be used in road envi-

ronments.

In order to solve this problem, we consider view-

point transfer from images obtained by a single in-

vehicle camera. Some methods have been proposed

for generating new views from a single viewpoint

image by using geometric transformations, such as

projective transformations (Kazuki Ichikawa and Jun

Sato, 2008). By estimating the scene depth informa-

tion, we can further improve the geometric viewpoint

transfer. Eigen et al. proposed a deep learning-based

method for estimating scene depth from monocular

images (Eigen et al., 2014). An unsupervised learning

method for depth estimation is also proposed by us-

ing a pair of stereo cameras during the network train-

ing (Garg et al., 2016; Godard et al., 2017).

Although the depth information enables us to

transfer the image point in one view to the image point

in a new view, the geometric transformation alone

can only produce image information that exists in the

original image, and it cannot generate images that do

not exist in the original image, such as the image ob-

tained by looking sideways.

In recent years, image inpainting has been devel-

oped as a technique for ﬁlling in the missing image

information, and it has been shown that images can be

restored accurately even when many defective pixels

are scattered throughout the image (Bertalmio et al.,

2000; Liao et al., 2021). However, since these image

inpainting methods use the similarity and regularity

of only 2D image features, they require non-defective

pixels to be scattered throughout the image. There-

fore, the existing image inpainting methods do not

work properly when we have large missing regions

like the image in Fig. 1 (b), which are caused by the

difference in viewing direction before and after the

viewpoint transformation.

In order to solve this problem, we in this pa-

per use multi-task learning to simultaneously learn

two tasks, that is inpainting the in-vehicle images

and inferring the structural information of the hidden

scene. Multitask learning is a method that improves

the performance of each task by learning multiple

related tasks at the same time. Examples of multi-

task learning include Faster R-CNN (Ren et al., 2016)

and YOLO (Redmon and Farhadi, 2018), which si-

multaneously perform object class recognition and

object location estimation, and Mask R-CNN (He

et al., 2017), which simultaneously performs seman-

tic segmentation in addition to object recognition

using Faster R-CNN (He et al., 2017). The net-

work structure for multi-task learning can take various

forms depending on the number and types of tasks,

but the basic structure consists of a task-sharing layer

that learns features common to each task and a task-

speciﬁc layer that learns features speciﬁc to each task.

In this paper, we propose a method for learning

image inpainting and structural inference simultane-

ously by using multi-task learning, and performing

viewpoint transfer based on the inference of the struc-

tural information of the invisible scene.

3 GENERATING PEDESTRIAN

VIEWPOINT IMAGE

In this research, viewpoint transfer images with miss-

ing regions are complemented using a network based

on conditional GAN (cGAN). However, viewpoint

transfer with signiﬁcantly different viewpoints can re-

sult in very large missing regions in images. There-

fore, a simple image inpainting method cannot com-

plete missing images with high quality. In this re-

search, we propose a method to generate pedestrian

viewpoint images with high quality by recovering the

scene structure unique to road scenes while perform-

ing image completion. For this objective, we propose

two methods: a method based on multi-task learning

(Method 1) and a method using Semantic Loss with

multi-task learning (Method 2).

3.1 Generating Pedestrian Viewpoint

Images Using Multi-Task Learning

In a road scene, there are objects unique to the road

scene, such as roads, cars, and buildings, each of

which has a similar general shape. In addition, road

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

386

Figure 2: Network structure of method 1. Generator re-

ceives missing RGB images and missing label images, and

outputs generated RGB images that complement missing

RGB images and generated label images that complement

missing label images.

scenes have various unique properties, such as a road

stretching across the bottom of the image and build-

ings likely to line both sides of the road. By learning

these unique properties of road scenes, we can expect

more accurate missing image completion.

The structural information of the scene can be rep-

resented by semantic label images obtained from the

semantic segmentation. However, the semantic label

images obtained from the input missing images also

have missing regions and are incomplete. Therefore,

we propose a method that learns two tasks simultane-

ously, one is a task to complement the missing RGB

images and the other is a task to complement the miss-

ing semantic label images. These two tasks are related

to each other, but are different. The ﬁrst task focuses

on the scene appearance, while the second task fo-

cuses on the scene structure. By adopting the multi-

task learning of complementing the missing RGB im-

ages and missing label images, it may be possible to

share features for complementing missing informa-

tion and to incorporate structural information of the

scene more efﬁciently into the learning process. We

call this method 1 in this paper.

The network structure of the proposed method 1

is shown in Fig. 2. The generator receives missing

RGB images and missing label images, and outputs

generated RGB images that complement the missing

RGB images and generated label images that comple-

ment the missing label images. The RGB image pairs

and label image pairs are used as inputs for training,

and the training is performed based on the following

evaluation equation:

∗

= argmin

max

cGAN

+ λ

RGB

+ λ

Label

(1)

Figure 3: Network structure of method 2. We evaluate the

quality of complemented RGB images by combining se-

mantic loss with adversarial loss to further improve the per-

formance of the generator.

In-vehicle Pedestrian 1 Pedestrian 2

Figure 4: Examples of images obtained from Airsim. We

generated images seen at the same time from in-vehicle

camera viewpoints and various pedestrian viewpoints.

cGAN

= E

[logD(x

, y

)] +

[log(1 − D(x

, G

, x

, z)))] (2)

RGB

= E

[||y

− G

, x

, z)||

] (3)

Label

= E

[−

∑

logG

, x

, z)] (4)

where x

is the missing RGB image, x

is the missing

label image, y

is the target RGB image, y

is the tar-

get label image, and z is the input noise. G

, x

, z)

is the complemented RGB image generated by G, and

, x

, z) is the complemented label image gener-

ated by G. Note that x

and y

are generated from x

and y

by using PSPNet (Zhao et al., 2017).

3.2 Generation of Pedestrian Viewpoint

Images Using Semantic Loss

We next explain a method for further improving

the generated pedestrian viewpoint images by using

semantic loss. Since the viewpoint transformation

causes a large amount of missing regions in the im-

age, it is very difﬁcult to evaluate the complemented

image. Thus, we evaluate the quality of the com-

plemented RGB image by combining a semantic loss

with an adversarial loss to further improve the perfor-

mance of the generator. We call this Method 2.

The network structure of the proposed method 2

is shown in Figure 3. The generator takes a missing

RGB image and a missing label image as inputs, and

outputs a complemented RGB image and a comple-

mented label image. The complemented RGB image

Generating Pedestrian Views from In-Vehicle Camera Images

387

is then input to the pre-trained PSPNet to perform se-

mantic segmentation, and the L1 norm of the obtained

semantic label image and its ground truth image is

added as the semantic loss. Since the semantic loss

evaluates the structural correctness of the generated

RGB images, it constrains the generated RGB images

based on higher-level evaluations, and we can expect

better training of the generator.

The following evaluation equations are used to

train the generator in Method 2.

∗

= argmin

max

cGAN

+ λ

RGB

+ λ

Label

+λ

Sem

(5)

cGAN

= E

[logD(x

, y

)] +

[log(1 − D(x

, G

, x

, z)))] (6)

RGB

= E

[||y

− G

, x

, z)||

] (7)

Label

= E

[−

∑

logG

, x

, z)] (8)

Sem

= E

[−

∑

)logG

, x

, z)]

(9)

where, G

(·) represents the label image generated by

PSPNet.

4 DATASET

In order to train the network of the proposed method,

it is necessary to prepare pairs of in-vehicle images

and pedestrian viewpoint images. To create such

pairs, images from two different viewpoints must be

acquired at the same time, but it is very difﬁcult to

obtain a large number of such real image pairs. Thus,

we in this research generate a synthetic image dataset

by using Airsim.

Airsim (Shah et al., 2017) is an outdoor scene sim-

ulator that can simulate vehicle and drone views on a

map built on the Unreal Engine. Airsim can simulate

various conditions such as weather and location, and

can acquire RGB images, depth images, and segmen-

tation images of the scene. In this research, we used

Airsim to obtain images from in-vehicle camera view-

points and various pedestrian viewpoints at the same

time, as shown in Fig. 4, to create paired images for

network training.

5 EXPERIMENTS

We next show the experimental results of the pro-

posed method. We prepared 13,200 pairs of missing

images and ground truth images of various pedestrian

viewpoints as described in the previous section. We

used 12,000 sets as training data and 1,200 sets as test

data, and trained and tested the proposed method. The

learning rate and the number of epochs vary depend-

ing on the proposed method, and we used the values

shown in Table 1 for each method.

We compare our two methods with the existing

method, pix2pix (Isola et al., 2017). The accuracy

of each method is evaluated quantitatively as well as

qualitatively by using synthetic images. We also eval-

uate our method by using real images to show the ef-

ﬁciency of the network trained with synthetic images.

5.1 Synthetic Image Experiments

We ﬁrst show the results from synthetic images. Fig. 5

shows the pedestrian viewpoint images obtained from

test in-vehicle images by using the proposed method

and the existing method. From these images, we ﬁnd

that the objects such as vehicles, buildings, and roads

are generated more accurately by using method 1 and

method 2.

In particular, in the images in the ﬁrst row of

Fig. 5, we ﬁnd that method 1 and method 2 clearly

generate the white lines and sidewalks in front of the

road, while the existing method fails to generate the

missing areas in front of the road. In the image in

the third row, we ﬁnd that method 1 and method 2

generate better images reproducing the boundary line

between the road and the grass, as well as the car.

Furthermore, even when the viewpoints are very dif-

ferent and the information in the input image is very

limited, as in the example in the ﬁfth row, the pro-

posed method can recover the buildings and road re-

gions fairly accurately, while the existing method fails

to recover the scene. This is because the proposed

method recovers semantic label images that represent

the structural information of the scene.

5.2 Accuracy Evaluation

We next evaluate the accuracy of the proposed method

quantitatively. As evaluation metrics, we used LPIPS

(learned perceptual image patch similarity) for the

generated RGB images and mIoU (mean intersection

over union) for the generated label images. We also

evaluated mIoU of label images obtained by inputting

the generated RGB images to PSPNet. Table 2 shows

LPIPS of the generated RGB images, mIoU of gener-

ated label images, and mIoU of label images obtained

from the generated RGB images. These two mIoUs

are indicated by mIoU1 and mIoU2 respectively. The

table shows that the proposed method improves on all

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

388

Table 1: Details of learning networks for the proposed method.

Method

Generator

Learning rate

Discriminator

Learning rate

epoch number

existing method 2.0 × 10

−4

1.0 × 10

−6

300

method 1 2.0 × 10

−4

2.0 × 10

−6

400

method 2 2.0 × 10

−4

1.0 × 10

−6

400

In-vehicle Image Input Image GT pix2pix method 1 method 2

Figure 5: Synthetic image experiments. The rightmost three columns show pedestrian viewpoint images and segmentation

images obtained from in-vehicle images in the leftmost column by using the proposed method and the existing method.

metrics compared to the existing method, conﬁrming

the effectiveness of the proposed method.

We next evaluate our method by using pre-trained

YOLO (Redmon and Farhadi, 2018). In this exper-

iment, we evaluated the closeness of the generated

images to the ground truth images by comparing the

detection results of vehicles and person output by pre-

trained YOLO between the ground truth images and

the generated images.

Figure 6 shows the object detection results ob-

tained from YOLO. It is clear that the detection re-

sults for the images generated by method 2 are closer

to the detection results for the ground truth images

than those generated by the existing method. In par-

Generating Pedestrian Views from In-Vehicle Camera Images

389

Table 2: Accuracy of pedestrian viewpoint image generation. We evaluated the accuracy of generated pedestrian viewpoint

images by using LPIPS, and the accuracy of segmentation images by using mIoU (mIoU1). We also evaluated mIoU of label

images obtained by inputting the generated pedestrian viewpoint images to PSPNet (mIoU2).

existing method method1 method2

LPIPS (↓) 0.423 0.356 0.352

mIoU1 (↑) - 0.276 0.368

mIoU2 (↑) 0.1474 0.235 0.366

Input Image GT pix2pix method1 method2

Figure 6: Evaluation of the quality of generated images using YOLO. The rightmost three columns show objects extracted by

using YOLO from images generated by using the proposed method and the existing method.

ticular, YOLO cannot detect several cars in the im-

ages generated by the existing method, whereas it can

detect cars accurately in the images generated by the

proposed method 2. These results show that the pro-

posed method can generate images that are closer to

the ground truth images.

5.3 Real Image Experiments

We next show the results from real images. In this

experiment, we evaluated our method by using 1200

real images in the Cityscapes dataset (Cordts et al.,

2015). We tested our networks and pix2pix trained by

using synthetic images as before.

The second column in Fig. 7 shows the input

missing images viewed from the pedestrian view-

point, which are generated from the in-vehicle images

shown in the ﬁrst column in Fig. 7 by using the depth

estimation net (Godard et al., 2019). The third column

shows the pedestrian views generated from the exist-

ing method, and the fourth and ﬁfth columns show

those from the proposed method trained on the syn-

thetic images.

These images show that the images generated by

the proposed method with semantic loss reproduce

roads and objects much more accurately than those

by the existing method, even in real images. Espe-

cially, in the images in the second and third rows of

Fig. 7, it can be seen that the proposed method can

predict plausible buildings in the large missing area

where no information exists in the input images. Al-

though the actual images cannot be obtained and the

comparison with the ground truth images is not pos-

sible, we ﬁnd that the proposed method can generate

realistic images comparable to human imagination.

6 CONCLUSION

In this paper, we proposed a method for generating

pedestrian viewpoint images from in-vehicle images

by using deep learning.

We ﬁrst proposed a method based on multi-task

learning, which simultaneously learns a task to com-

plement missing RGB images and a task to comple-

ment missing label images. We next extended our

method by adding semantic loss derived by using the

pre-trained semantic segmentation network. We also

constructed the training dataset for our network by

generating synthetic pairs of road environment im-

ages using Airsim simulator.

Experimental results show the effectiveness of the

proposed method using structural information in the

scene. However, since sufﬁcient accuracy has not yet

been obtained for real images, further improvement in

performance is necessary for future work.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

390

In-vehicle Image Input Image pix2pix method 1 method 2

Figure 7: Real image experiments. The rightmost three columns show pedestrian views obtained from in-vehicle images in

the leftmost column by using the proposed method and the existing method.

REFERENCES

Alahi, A., Goel, K., Ramanathan, V., Robicquet, A., Fei-

Fei, L., and Savarese, S. (2016). Social lstm: Human

trajectory prediction in crowded spaces. In Proceed-

ings of the IEEE conference on computer vision and

pattern recognition, pages 961–971.

Bertalmio, M., Sapiro, G., Caselles, V., and Ballester, C.

(2000). Image inpainting. In Proc. ACM SIG-GRAPH,

pages 417—-424.

Chari, V., Agrawal, A., Taguchi, Y., and Ramalingam, S.

(2012). Convex bricks: A new primitive for visual

hull modeling and reconstruction. In 2012 IEEE In-

ternational Conference on Robotics and Automation,

pages 770–777. IEEE.

Cordts, M., Omran, M., Ramos, S., Scharw

achter, T., En-

zweiler, M., Benenson, R., Franke, U., Roth, S., and

Schiele, B. (2015). The cityscapes dataset. In CVPR

Workshop on The Future of Datasets in Vision.

Eigen, D., Puhrsch, C., and Fergus, R. (2014). Depth map

prediction from a single image using a multi-scale

deep network. Advances in neural information pro-

cessing systems, 27.

Garg, R., Bg, V. K., Carneiro, G., and Reid, I. (2016). Un-

supervised cnn for single view depth estimation: Ge-

ometry to the rescue. In European conference on com-

puter vision, pages 740–756. Springer.

Godard, C., Mac Aodha, O., and Brostow, G. J. (2017).

Unsupervised monocular depth estimation with left-

right consistency. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 270–279.

Godard, C., Mac Aodha, O., Firman, M., and Brostow, G. J.

(2019). Digging into self-supervised monocular depth

prediction.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. (2017).

Mask r-cnn. In Proceedings of the IEEE international

conference on computer vision, pages 2961–2969.

ISHIKAWA, A. (2008). Free viewpoint video generation

for walk-through experience using image-based ren-

dering. ACM Multimedia 2008, Vancouver, Canada,

Oct.-Nov.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In IEEE International Conference on

Computer Vision and Pattern Recognition.

Kanade, T., Rander, P., and Narayanan, P. (1997). Virtu-

alized reality: Constructing virtual worlds from real

scenes. IEEE Multimedia, 4(1).

Kazuki Ichikawa and Jun Sato (2008). Image synthesis for

blind corners from uncalibrated multiple vehicle cam-

eras. In 2008 IEEE Intelligent Vehicles Symposium,

pages 956–961.

Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H.,

and Chandraker, M. (2017). Desire: Distant future

prediction in dynamic scenes with interacting agents.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 336–345.

Liao, L., Xiao, J., Wang, Z., Lin, C., and Satoh, S. (2021).

Image inpainting guided by coherence priors of se-

mantics and textures. In IEEE International Confer-

ence on Computer Vision and Pattern Recognition.

Lipski, C., Linz, C., Berger, K., Sellent, A., and Magnor,

M. (2010). Virtual video camera: Image-based view-

point navigation through space and time. In Computer

Generating Pedestrian Views from In-Vehicle Camera Images

391

Graphics Forum, volume 29, pages 2555–2568. Wiley

Online Library.

Mirza, M. and Osindero, S. (2014). Conditional generative

adversarial nets. arXiv preprint arXiv:1411.1784.

Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental

improvement. arXiv preprint arXiv:1804.02767.

Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster

r-cnn: Towards real-time object detection with re-

gion proposal networks. IEEE transactions on pattern

analysis and machine intelligence, 39(6):1137–1149.

Sankoh, H., Naito, S., Nonaka, K., Sabirin, H., and Chen,

J. (2018). Robust billboard-based, free-viewpoint

video synthesis algorithm to overcome occlusions un-

der challenging outdoor sport scenes. In Proceedings

of the 26th ACM international conference on Multi-

media, pages 1724–1732.

Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2017). Air-

sim: High-ﬁdelity visual and physical simulation for

autonomous vehicles. In Field and Service Robotics.

Yagi, T., Mangalam, K., Yonetani, R., and Sato, Y. (2018).

Future person localization in ﬁrst-person videos. In

Proceedings of the IEEE Conference on Computer Vi-

sion and Pattern Recognition, pages 7593–7602.

Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017).

Pyramid scene parsing network. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 2881–2890.

VISAPP 2023 - 18th International Conference on Computer Vision Theory and Applications

392