An Assessment of Shadow Generation by GAN with Depth Images on

Non-Planar Backgrounds

Kaito Toyama and Maki Sugimoto

Graduate School of Open Environmental Science, Keio University,

3-14-1 Hiyoshi Kohoku-ku Yokohama, Kanagawa, Japan

{kaito.t-0504, maki.sugimoto}@keio.jp

Keywords:

Mixed Reality, Shadow Generation, Virtual Object.

Abstract:

We propose the use of a Generative Adversarial Network (GAN) with depth images to generate shadows for

virtual objects in mixed reality environments. This approach improves the accuracy of shadow generation

process by aligning shadows with non-planar geometries. While traditional methods require detailed lighting

and geometry data, recent research has emerged that generates shadows by learning from the image itself, even

when such conditions are not fully known. However, these studies are limited to projecting shadows only onto

the ground: a planar geometry. Our dataset used for training the GAN, includes depth images allows natural

shadow generation in complex environments.

1 INTRODUCTION

Shadow generation for virtual objects in Mixed Re-

ality involves synthesizing shadows that align with

real-world light sources and geometry when combin-

ing virtual objects with real-world images (Schrater

and Kersten, 2000). Shadows are crucial for depth

perception , which enhances the realism of virtual ex-

periences (Hoffman et al., 1998; Chrysanthakopoulou

and Moustakas, 2024).

Typically, shadow generation algorithms require

information about real-world lighting, geometry, and

the relative position and shape of virtual objects.

However, obtaining this information is often chal-

lenging due to factors like occluded light sources and

complex geometries.

To address this, machine learning approaches have

been proposed. Liu et al. developed ARShadowGAN

(Liu et al., 2020), a method that uses a Generative

Adversarial Network (GAN) to generate virtual shad-

ows that resemble real-world shadows by learning

from the real-world objects and their shadows. This

method successfully generates realistic shadows us-

ing only two inputs: a composite image without shad-

ows and a mask of the virtual object. Additionally, Qi

et al. introduced HAU-Net and IF-Net to account for

lighting conditions and shadow shape, improving the

naturalness of generated shadows (Meng et al., 2023).

However, these methods are limited by their as-

sumption that shadows are projected onto a ﬂat sur-

Input Images Output Image

Figure 1: Left ﬁgures indicate the input images. Each im-

age includes a shadowless virtual object with depth images.

Right images indicate generated shadows by the virtual ob-

jects on non-planar backgrounds.

face, making them less effective for complex geome-

tries. In this study, we utilize depth images to generate

natural shadows onto non-planar geometries. Depth

images provide information about the distance be-

tween a camera and the target geometory, which can

be considered as a key of shadow generation onto

non-planar geometries. By incorporating depth im-

ages as an extension of the existing GAN framework,

this paper investigates whether shadows considering

depth information can be generated onto non-planar

geometries.

Toyama, K. and Sugimoto, M.

An Assessment of Shadow Generation by GAN with Depth Images on Non-Planar Backgrounds.

DOI: 10.5220/0013188200003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

135-142

ISBN: 978-989-758-728-3; ISSN: 2184-4321

135

(a) No Shadow (b) Virtual Object Mask (c) Depth

(d) Real Object Mask (e) Real Shadow Mask (f) Ground Truth

Figure 2: The structure of the dataset. It consists of (a): No Shadow image, (b): Virtual Object Mask, (c): Depth image, (d):

Real Object Mask, (e): Real Shadow Mask and (f): Ground Truth image.

2 RELATED WORK

Shadow generation for virtual objects in MR environ-

ments can be achieved using algorithms when lighting

conditions and real-world geometry are known. How-

ever, when these conditions are not fully available,

shadow generation becomes challenging. To address

this issue, machine learning-based methods have been

developed.

Liu et al. proposed a method that integrates an At-

tention Block and a Shadow Generator Block into a

GAN. This approach takes a real-world image with a

synthesized virtual object and its corresponding mask

as input. The Attention Block estimates the positions

of real-world objects and their shadows, while the

Shadow Generator Block uses this information along

with the input images to generate shadows for the vir-

tual object. The generated shadow is then combined

with the real-world image to create a composite im-

age, which is fed into the Discriminator. Both the

Attention Block and Shadow Generator Block utilize

U-Net (Ronneberger et al., 2015) for enhanced seg-

mentation accuracy.

Meng et al. addressed the issue of unrealistic

shadows in complex lighting environments by em-

ploying HAU-Net and IFNet. HAU-Net captures

the interactions between foreground objects and the

background in both spatial and channel dimensions,

predicting realistic shadow shapes while considering

background lighting conditions. IFNet uses Exposure

Fusion to generate and merge multiple exposure im-

ages, which helps enhance the realism of shadows un-

der varying lighting conditions. This process adapts

to changes in background lighting, adjusting shadow

intensity and shape, resulting in more natural and re-

alistic shadows in the ﬁnal composite image.

Sheng et al. introduced PixHt-Lab (Sheng et al.,

2023), a system that leverages a pixel height-based

representation to generate realistic lighting effects for

image compositing, such as soft shadows and reﬂec-

tions. Unlike methods that assume shadows are pro-

jected onto a ground plane, PixHt-Lab maps the 2.5D

pixel height representation to a 3D space, enabling

the reconstruction of both foreground and background

geometries. This approach signiﬁcantly enhances

soft shadow quality on general shadow receivers like

walls and curved surfaces by incorporating 3D-aware

buffer channels. Their neural renderer, SSG++, uti-

lizes these buffer channels to guide soft shadow gen-

eration, addressing limitations in shadow realism and

providing more control over lighting effects. While

PixHt-Lab focuses on 2D image compositing, its use

of geometry-aware data structures to guide lighting

effects aligns closely with our approach to generat-

ing shadows for virtual objects on non-planar back-

grounds.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

136

To further improve the realism and quality of

generated images, incorporating depth images into

the generation process has proven effective. Depth

images provide essential information about the dis-

tance from each pixel to the object surface, offer-

ing valuable insights into the shape of objects in

RGB images. Qi et al. proposed a method that

integrates depth images into the image generation

process, signiﬁcantly enhancing the naturalness and

overall quality of the generated images. Their ap-

proach combines semantic labels with depth maps

using a multi-conditional semantic image genera-

tion method. The quality improvement is achieved

through the Multi-scale Feature Extraction and Infor-

mation Fusion Module (MEIF) and the Multi-scale

Channel Attention Module (MCA). MEIF leverages

both depth information and semantic labels, using a

pyramid-shaped feature extraction mechanism to cap-

ture both global and local details from the depth im-

age, thus enhancing the feature maps derived from

depth information. MCA aligns features across differ-

ent scales by learning the correlation between feature

map channels at varying scales, ensuring consistency

and coherence in the generated images.

3 DATASET

We constructed a dataset which consists of No

shadow images, Real object masks, Real shadow

masks, Virtual object masks, Depth images and Truth

shadow images, to train the network model. The

image dataset used in this study Figure 2 was ren-

dered using Unity 2021.3.16f1, with objects within

the images sourced from ShapeNet (Chang et al.,

2015) and BlenderCity (Couturier, 2023). Depth im-

ages were obtained by calculating the depth from

the camera coordinates using the Z-buffer method.

To generate mask images for background and virtual

objects, we made non-background objects invisible

and applied the Z-buffer method, where only the ob-

ject portions had pixel values greater than 0. These

images were then binarized to create mask images.

The shadow mask images were generated by turning

off the Cast Shadow setting for the objects and then

taking the difference from the original image, fol-

lowed by binarization. We generated 3,600 images by

combining 20 background types, 3 viewpoints, and

60 virtual objects. The dataset will be published at

https://github.com/YaMaKaTsu5004/DepthShadowG

AN dataset].

4 METHODOLOGY

Figure 3 shows the architecture of the network model

used in this study. We extended the network model

of ARShadowGAN (Liu et al., 2020) to generate

shadows with depth information. Broadly, it con-

sists of three main components: the Attention Block,

the Shadow Generator Block, and the Discriminator

Block.

4.1 Attention Block

In the Attention Block, segmentation of real-world

objects and their shadows, which serve as input for

the Shadow Generator Block, is performed. The input

images consist of a composite image without shad-

ows, a mask image of the virtual object, and a real-

world depth image. The encoder uses ResNet (Resid-

ual Neural Network) (He et al., 2016), and U-Net

(Ronneberger et al., 2015) is employed to learn the

segmentation of real-world objects and their shadows.

4.2 Shadow Generator Block

The Shadow Generator Block is responsible for gen-

erating shadows for virtual objects. The input im-

ages consist of a composite image without shadows,

a mask image of the virtual object, a real-world depth

image, and the mask images of real-world objects

and their shadows generated by the Attention Block.

ResNet18 (He et al., 2016) is used as the encoder, and

U-Net (Ronneberger et al., 2015) is employed to gen-

erate a rough shadow. The generated shadow is then

reﬁned through a Reﬁnement process to make it more

natural.

4.3 Discriminator

The Discriminator distinguishes whether the gener-

ated virtual shadow is plausible, thereby advancing

the training of the generator. In this study, the Dis-

criminator model adopts the PatchGAN method (Isola

et al., 2017). PatchGAN consists of three consecu-

tive convolutional layers, followed by Batch Normal-

ize 2D and Leaky ReLU. Next, a convolution gener-

ates the ﬁnal feature map, which is activated using a

sigmoid function. The ﬁnal output of the Discrimina-

tor is the global average pooling of the activated ﬁnal

feature map. In this study, the Discriminator takes as

input the concatenation of the virtual object shadow,

the virtual object mask, and the image containing the

shadow of the virtual object.

An Assessment of Shadow Generation by GAN with Depth Images on Non-Planar Backgrounds

137

Figure 3: The architecture used in this study. The Attention Block focuses on real objects and their shadows, while the Virtual

Shadow Generator generates shadows for virtual objects. The input images consist of a Virtual Object Mask, a Depth image,

and No Shadow images. During training, the generated images are evaluated for authenticity by the Discriminator.

4.4 Loss Function

The loss function L

attn

used in the AttentionBlock is

deﬁned using a squared loss as follows:

attn



obj

(x, m, d) − M

obj



shadow

(x, m, d) − M

shadow



(1)

where A

obj

is the predicted mask image of the

background object, and M

obj

is the ground truth mask

image of the background object. Similarly, A

shadow

is the predicted mask image of the background ob-

ject’s shadow, and M

shadow

is the ground truth mask

image of the background object’s shadow. The inputs

(x, m, d) represent the background image, the mask

image of the virtual object, and the depth image, re-

spectively.

The loss function L

gen

for the ShadowGenerator-

Block is deﬁned as the weighted sum of three differ-

ent terms:

gen

= β

+ β

per

+ β

adv

(2)

Here, L

represents the squared loss, which calcu-

lates the squared error between the predicted and true

values. This is used to measure the error by treating

the real composite image and the generated composite

image as a regression problem. In the model of this

study, reﬁnement is performed to adjust the coarse

shadows into more natural shadows. Therefore, the

loss function calculates the squared loss both before

and after the reﬁnement, and their sum is deﬁned as

. The output before reﬁnement is:

¯y = x + G(x, m, A

obj

, A

shadow

) (3)

and the output after reﬁnement is:

ˆy = x + R(G(x, m, A

obj

, A

shadow

)) (4)

The ﬁnal L

loss is deﬁned as follows:

∥

y − ¯y

∥

y − ˆy

∥

(5)

per

is the perceptual loss (Johnson et al., 2016),

which emphasizes high-level feature and structural

similarity of the images. In this study, we use VGG16

(Simonyan and Zisserman, 2015) pre-trained on Im-

ageNet (Deng et al., 2009) to extract features. This

function is deﬁned as follows:

per

= MSE(V

, V

¯y

) + MSE(V

, V

ˆy

) (6)

where MSE is the mean squared error, and V

rep-

resents the feature map extracted by the pre-trained

VGG16. This calculation compares the feature maps

at the intermediate layers to compute the loss.

The loss function for the Discriminator is deﬁned

as follows:

adv

= log(D(x, m, y)) + log(1 − D(x, m, ˆy)) (7)

Here, D represents the probability that the image

is real. During GAN training, the Discriminator tries

to maximize L

dv, while the Generator tries to mini-

mize it.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

138

Ground

Truth

Ours

Figure 4: A comparison of our method and the ground truth.

Ours

ARShadow

GAN

(Liu et al., 2020)

Figure 5: A comparison of our method and ARShadowGAN.

Experiment PSNR Experiment SSIM Experiment LPIPS

Figure 6: Box plots summarizing the results for each case in the experiment.

5 EXPERIMENT

For the learning process, 3,600 images in the dataset

were split into training, validation, and test sets in a

ratio of 2880:360:360, respectively.

Figure 4 shows examples of the shadow genera-

tion results with the ground truth images. Also, Fig-

ure 5 shows comparative images with ARShadow-

GAN (Liu et al., 2020). The quantitative evaluation

using PSNR, SSIM, and LPIPS (Zhang et al., 2018)

for ARShadowGAN is displayed in box plots in Fig-

ure 6. The numerical comparison is presented in Table

6 ABLATION STUDY

An ablation study was conducted to examine the im-

pact of each loss term by removing L

per

and L

adv

re-

spectively. Figure 7 shows The ablation study results.

An Assessment of Shadow Generation by GAN with Depth Images on Non-Planar Backgrounds

139

Ours -gan loss -per loss ground truth

Figure 7: A comparison between the regular results in the ablation study, the results with the perceptual loss removed, the

results with the GAN loss removed, and the ground truth images.

Ablation Study PSNR Ablation Study SSIM Ablation Study LPIPS

Figure 8: Box plots summarizing the results for each case in the ablation study.

Table 1: Experiment Average Rating.

PSNR [dB] SSIM LPIPS

ARShadowGAN 28.98 ± 2.29 0.9021 ± 0.0436 0.05544 ± 0.0210

Ours 29.57 ± 2.53 0.9176 ± 0.0295 0.1149 ± 0.0268

Table 2: Ablation Study Average Rating.

PSNR [dB] SSIM LPIPS

Ours 29.57 ± 2.53 0.9176 ± 0.0295 0.1149 ± 0.0268

-per loss 30.96 ± 2.34 0.9115 ± 0.0343 0.1358 ±0.0261

-gan loss 30.33 ± 2.08 0.9199 ±0.0286 0.1138 ± 0.0289

Figure 8 shows the comparison of PSNR, SSIM, and

LPIPS (Zhang et al., 2018) box plots. Table 2 shows

the average and standard deviation in each condition.

7 DISCUSSION

Figure 4 shows shadow generation results by our

method. It was observed that for the virtual object

of a chair, even the legs of the chair were projected

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

140

well as a realistic shadow, indicating that the method

were able to handle relatively detailed objects. Fig-

ure 5 shows a comparison between our method and

ARShadowGAN. The ﬁgure indicates our method is

capable to generate shadows with non-planar back-

grounds. However, we can also see points to be im-

proved. In the output images of our method, there

were issues such as the failure to generate shadows

for the convex parts of objects, excessive noise when

projecting shadows onto walls, and shadow overlap-

ping on concave parts of objects, which tended to

cause noise. Speciﬁcally, when projecting shadows

onto walls, noise was more prevalent for areas with

lower brightness. This is likely because the color of

the wall and the projected shadow were similar, lead-

ing to poor learning in the Perception Loss. Addition-

ally, the issue of noise in shadows within the object is

thought to be caused by the absence of the virtual ob-

ject’s image in the depth map. Since the mask image

of the virtual object was provided, the mask region in

the depth map became unnecessary information, pos-

sibly leading to poor learning in those regions.

For the qualitative assessment of the ablation

study, when comparing the images in Figure 7, it was

conﬁrmed that without the perceptual loss, the shad-

ows around the object’s outline were not generated

compared to the results of our method, resulting in

unnatural images. Without the GAN loss, the shad-

ows projected onto the wall were shallower in angle

and smaller in size. For the quantitative assessment

of the results in Table 2 and Figure 8, the data did

not follow a normal distribution. It was conﬁrmed by

the Shapiro-Wilk test. When performing the Mann-

Whitney U test between Ours and the -per loss condi-

tion, a signiﬁcant difference was found in PSNR and

SSIM at p = 0.05. This suggests that the perceptual

loss term contributes to noise reduction and structural

similarity in the images. Furthermore, a signiﬁcant

difference was found only in PSNR at p = 0.05 be-

tween Ours and -gan loss, indicating that the Discrim-

inator loss term likely contributes to noise reduction.

8 LIMITATION

One limitation of this study is that the dataset only

includes vertical walls, so accuracy may decrease de-

pending on the complexity of the backgrounds. Addi-

tionally, since the shape information beyond the out-

line of the virtual objects is not included, generating

shadows for complex virtual objects remains difﬁcult.

To address this challenge, it will be necessary to in-

corporate shape information of virtual objects in the

learning process.

Moreover, the acquisition of depth images from

real-world environments is still imprecise, which

means that the model used in this study may not

achieve sufﬁcient accuracy when applied in real-

world scenarios. As a potential solution for applying

this model in the real world, segmentation and label-

ing of the ground and walls could be used as an alter-

native input in place of depth images.

9 CONCLUSION

In this study, we constructed an MR dataset that in-

cludes depth images and generated shadows for vir-

tual objects on non-planar background geometries.

For the dataset construction, a new method for in-

corporating depth images as input was established.

By utilizing the depth images, this study proved that

it is possible to cast shadows of virtual objects onto

surfaces beyond ﬂat background. The results of this

study suggest that generating shadows in considera-

tion of depth information can be applied for complex

background geometries.

REFERENCES

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,

Huang, Q., Li, Z., Savarese, S., Savva, M., Song,

S., Su, H., Xiao, J., Yi, L., and Yu, F. (2015).

Shapenet: An information-rich 3d model repository.

arXiv preprint arXiv:1512.03012.

Chrysanthakopoulou, A. and Moustakas, K. (2024). Real-

time shader-based shadow and occlusion rendering in

ar. In 2024 IEEE Conference on Virtual Reality and

3D User Interfaces Abstracts and Workshops (VRW),

pages 969–970.

Couturier, A. (2023). Scenecity: 3d city generator addon

for blender. https://www.cgchan.com/store/scenecity.

5.5.2024.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-

Fei, L. (2009). Imagenet: A large-scale hierarchical

image database. In 2009 IEEE Conference on Com-

puter Vision and Pattern Recognition, pages 248–255.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In 2016 IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR), pages 770–778.

Hoffman, H., Hollander, A., Schroder, K., et al. (1998).

Physically touching and tasting virtual objects en-

hances the realism of virtual experiences. Virtual Re-

ality, 3:226–234.

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017).

Image-to-image translation with conditional adversar-

ial networks. In 2017 IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), pages 5967–

5976.

An Assessment of Shadow Generation by GAN with Depth Images on Non-Planar Backgrounds

141

Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Perceptual

losses for real-time style transfer and super-resolution.

In Leibe, B., Matas, J., Sebe, N., and Welling, M.,

editors, Computer Vision – ECCV 2016, pages 694–

711, Cham. Springer International Publishing.

Liu, D., Long, C., Zhang, H., Yu, H., Dong, X., and Xiao, C.

(2020). Arshadowgan: Shadow generative adversarial

network for augmented reality in single light scenes.

In 2020 IEEE/CVF Conference on Computer Vision

and Pattern Recognition (CVPR), pages 8136–8145.

Meng, Q., Zhang, S., Li, Z., Wang, C., Zhang, W., and

Huang, Q. (2023). Automatic shadow generation via

exposure fusion. IEEE Transactions on Multimedia,

25:9044–9056.

Ronneberger, O., Fischer, P., and Brox, T. (2015). U-net:

Convolutional networks for biomedical image seg-

mentation. In Navab, N., Hornegger, J., Wells, W. M.,

and Frangi, A. F., editors, Medical Image Computing

and Computer-Assisted Intervention – MICCAI 2015,

volume 9351 of Lecture Notes in Computer Science,

pages 234–241, Cham. Springer.

Schrater, P. and Kersten, D. (2000). How optimal depth cue

integration depends on the task. International Journal

of Computer Vision, 40:71–89.

Sheng, Y., Zhang, J., Philip, J., Hold-Geoffroy, Y., Sun, X.,

Zhang, H., Ling, L., and Benes, B. (2023). Pixht-lab:

Pixel height based light effect generation for image

compositing. In 2023 IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition (CVPR), pages

16643–16653.

Simonyan, K. and Zisserman, A. (2015). Very deep con-

volutional networks for large-scale image recognition.

In International Conference on Learning Representa-

tions.

Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang,

O. (2018). The unreasonable effectiveness of deep

features as a perceptual metric. In 2018 IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 586–595.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

142