A Neural Network with Adversarial Loss for Light Field Synthesis from

a Single Image

∗

Simon Evain and Christine Guillemot

Inria, Rennes, France

Keywords:

Monocular, View Synthesis, Deep Learning, Light Field, Depth Estimation.

Abstract:

This paper describes a lightweight neural network architecture with an adversarial loss for generating a full

light ﬁeld from one single image. The method is able to estimate disparity maps and automatically identify

occluded regions from one single image thanks to a disparity conﬁdence map based on forward-backward

consistency checks. The disparity conﬁdence map also controls the use of an adversarial loss for occlusion

handling. The approach outperforms reference methods when trained and tested on light ﬁeld data. Besides,

we also designed the method so that it can efﬁciently generate a full light ﬁeld from one single image, even

when trained only on stereo data. This allows us to generalize our approach for view synthesis to more diverse

data and semantics.

1 INTRODUCTION

View synthesis has been a very active ﬁeld of research

in the computer vision and computer graphics com-

munities for many years ((Woodford et al., 2007),

(Horry et al., 1997)), and it has known signiﬁcant ad-

vances thanks to the emergence of deep learning tech-

niques. In this paper, we tackle a speciﬁc case of this

problem: to synthesize an entire light ﬁeld from one

single image. This problem has a variety of applica-

tions, such as generating several views of a scene, ex-

tracting depth and automatically identifying occluded

regions from images captured with regular 2D cam-

eras.

Working from one single image is a very challeng-

ing problem, for at test time, the approach lacks infor-

mation, e.g. on scene geometry. The method hence

needs strong priors on scene geometry and semantics.

Learning-based methods are therefore very good can-

didates for these tasks, since priors can be automat-

ically learnt from data. In this paper, we describe a

method that is able to produce an entire light ﬁeld, es-

timate scene depth and identify occluded regions from

just one single image. This way, we can beneﬁt from

light ﬁeld features without requiring a light ﬁeld cap-

ture set-up, e.g., simulating perspective shift and post-

capture digital re-focusing. We propose a lightweight

∗

This work has been funded by the EU H2020 Re-

search and Innovation Programme under grant agreement

No 694122 (ERC advanced grant CLIM).

architecture based on (Evain and Guillemot, 2020),

but enhanced to be able to generate an entire light ﬁeld

and to better handle occlusions using an adversarial

approach. The network is trained on pairs of images

and learns to perform a forward and backward view

synthesis, with two independent branches, thanks to

the estimation of two disparity maps. Checking the

consistency of the two independent predictions allows

us to identify occluded regions and compute a dispar-

ity conﬁdence map. At test time, the network only

needs one image to compute the two disparity maps

that are then used to identify the occluded regions.

This disparity conﬁdence map is used to control the

application of an adversarial technique for occlusion

handling. We show that the network can be trained on

light ﬁeld data, and that it outperforms reference tech-

niques trained on light ﬁeld datasets, such as (Srini-

vasan et al., 2017), in terms of reconstructed light

ﬁeld quality.

Now, training on light ﬁeld data as in (Srinivasan

et al., 2017) necessarily restricts the scope of the ap-

proach, as this requires a large amount of data that

is not easy to capture. Besides, such monocular ap-

proaches are also bound a lot by semantics of the

training data, making it hard to train a network that

can be usable for a variety of scene geometry and se-

mantics, unless a sufﬁcient number of examples of

diverse scenes is present in the training set. Exist-

ing light ﬁeld datasets are in general too limited to

meet that requirement. We show that the proposed

Evain, S. and Guillemot, C.

A Neural Network with Adversarial Loss for Light Field Synthesis from a Single Image.

DOI: 10.5220/0010268701750184

In Proceedings of the 16th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2021) - Volume 5: VISAPP, pages

175-184

ISBN: 978-989-758-488-6

175

architecture can be trained on stereo content. This

drastically increases the amount of possible training

data that can be exploited by our approach. We show

that the proposed network produces very plausible

and good-quality light ﬁelds even when trained from

stereo images with large baselines as in the KITTI

dataset ((Geiger et al., 2012)), and this way produces

light ﬁelds with large ﬁelds of view. In summary, our

contributions are:

• A lightweight neural network based on (Evain and

Guillemot, 2020), extended to be able to generate

a full light ﬁeld from one single view, with occlu-

sion handling relying on both a computed dispar-

ity conﬁdence map and an adversarial approach.

• A light ﬁeld synthesis method from one single

image that not only outperforms reference meth-

ods when trained and tested on light ﬁeld data,

but that can also generalize to much more diverse

scenes, thanks to its ability to be trained on stereo

datasets. The method hence enables convincing

light ﬁeld features (e.g., digital refocusing, virtual

camera motion) from only single 2D images.

1.1 Related Work

1.1.1 Monocular Stereo View Synthesis

Monocular view synthesis refers to the generation of

new views from one single image. This problem has

been tackled before the emergence of deep learning

techniques, however with some limitations, e.g. with

extremely similar images ((Woodford et al., 2007)), or

with scenes presenting very similar geometry ((Horry

et al., 1997)). Solving this difﬁcult problem requires

strong scene priors, which can be efﬁciently learned

from data by using deep learning techniques.

The easiest set-up for monocular view synthesis

is the stereo case: given a pair of images, the aim is

to generate one image from the other. Usually, the

two cameras are kept in the same set-up, e.g. in terms

of relative distance between the cameras throughout

the training dataset, and the network implicitly learns

these set-up conditions. The pioneering work in the

domain is deep3D ((Xie et al., 2016)) with a network

designed to work on a dataset of 3D movies. The ap-

proach automatically generates a 3D sequence from a

2D sequence. A soft measure of disparity is estimated

from the input image and is used to warp the ﬁnal

prediction. This method produces images of convinc-

ing quality. However, the network is bound to one

speciﬁc resolution, which implies training several net-

works for various resolutions. Besides, its number of

parameters is very high (around 60 million parame-

ters for a 512 × 256 resolution for the KITTI base-

line), and the network is not able to accurately process

occlusions. Finally, the produced images tend to be

blurry. Other approaches, such as (Evain and Guille-

mot, 2020), evolve over the concept by producing the

ﬁnal image by merging a disparity-based warped pre-

diction and a prediction based on direct minimization.

Even if the approach outperforms state-of-the-art in

the stereo domain, its scope (stereo case only) is still

limited. The authors in (Ivan et al., 2019) also uti-

lize an appearance ﬂow and spatio-angular consistent

loss functions and show that their model can produce

novel views of good quality in the case of densely

sampled light ﬁelds as those captured by Lytro Illum

cameras. Finally, we can also cite the approach in

(Shih et al., 2020) which converts a RGB-D input

image into a layered depth image (LDI) representa-

tion with explicit pixel connectivity. The authors then

use a learning-based inpainting model to synthesize

content in the occluded regions. While the addressed

problem has some similarities with ours, in particular

concerning occlusion handling, the authors assume a

RGB-D input image while we consider a simple RGB

input image.

1.1.2 Light Field View Synthesis

Light ﬁeld imaging is based on the principles of in-

tegral imaging pioneered by Lippmann ((Lippmann,

1908)) in 1908. Due to technical challenges, it took

several decades before having practical light ﬁeld

camera designs, e.g. with the work of (Ng, 2006).

Thanks to the ability to distinguish rays of light,

light ﬁeld cameras permit a planar change of view-

point, as well as post-capture digital refocusing.

Given how complex a light ﬁeld set-up can be (no-

tably in the case of camera arrays), and how memory-

consuming they are, view synthesis for light ﬁelds

soon became an important ﬁeld of research to cope

with these issues. In (Kalantari et al., 2016), the au-

thors design a method to build a full light ﬁeld from 4

corner images only. To do so, a cascade of two convo-

lutional neural networks (one for disparity and one for

color estimation) is employed, producing high-quality

images.

In (Mildenhall et al., 2019), an approach is pro-

posed based on the concept of Multi-Plane Image

(MPI) representation, ﬁrst introduced in (Zhou et al.,

2018), to reconstruct a light ﬁeld from a set of un-

structured image captures. The MPI representation is

a stack of parallel planes, regularly sampled in dispar-

ity, with a measure of visibility depending, for every

plane, on whether the pixel is at the foreground or the

background. The results are impressive but require

several input images (4 or 5) to be able to reconstruct

the light ﬁeld. In contrast, our method only requires

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

176

one image to be able to generate light ﬁelds.

The authors in (Srinivasan et al., 2017) tackle the

issue of generating a full light ﬁeld from one sin-

gle image. The approach takes the central view of

the light ﬁeld as input, and seeks to generate the

light ﬁeld by respecting the epipolar consistency con-

straint. The presented results are good quality, but

the overall approach comes with some limitations:

it obtains its more signiﬁcant results on the Flowers

dataset, a dataset of images with very strong simi-

larities in scene semantics and geometry. When the

dataset gets a bit more complex and diverse, the ap-

proach encounters difﬁculties. Besides, working from

the epipolar constraint also means that the approach

gains from working with a large number of views

with small baselines at training time, which makes

the training process very long. More importantly, the

method requires a light ﬁeld dataset that should be

large enough, and consistent enough in both seman-

tics and geometry. This kind of dataset is very rare,

and the method cannot be efﬁciently trained on other

data than light ﬁelds. In contrast, our method is able

to obtain very good results on the Flowers dataset, and

thanks to its ability to be trained on stereo content,

can also produce light ﬁelds with more generic and

diverse scenes.

Finally, the problem of light ﬁeld view synthesis

from one single image has been recently addressed

in (Tucker and Snavely, 2020) where the authors ﬁrst

construct a MPI representation from the input image,

and then warp this MPI to generate new light ﬁeld

viewpoints. While this approach gives good results,

the network is quite heavy (around 47 million param-

eters).

1.1.3 Model-based View Synthesis

Methods have also been developed to tackle monoc-

ular view synthesis with the help of models or cam-

era set-up parameters. In (Sun et al., 2018), through

the blending of two predictions, a 6 Degrees of Free-

dom vector is obtained from input set-up parameters.

Even if the results are impressive for ranges of trans-

formation that have been learnt and processed during

training, when requiring a transformation which has

not been studied by the network, the method is usu-

ally much less efﬁcient. To tackle the goal of gen-

erating a light ﬁeld from stereo content, the method

is not adapted. Other methods ((Park et al., 2017),

(Tulsiani et al., 2018)) have been developed to gener-

ate new views by blending two predictions, including

one to be applied only in occluded regions. While

the method in (Park et al., 2017) is very efﬁcient

on 3D models, or very simple scenes, it is less efﬁ-

cient on natural images. Besides, it requires ground

truth occlusion maps in the training set, which are not

easy to capture. This reduces the amount of datasets

that can be used for training. The authors in (Tul-

siani et al., 2018) present a method able to distin-

guish the foreground from the background through

learning. Even if the results are interesting, they re-

quire a signiﬁcantly heavier architecture than ours in

terms of number of parameters. In our case the oc-

cluded regions are automatically identiﬁed through

forward-backward checks, and we use an adversarial

loss to generate plausible content in occluded regions.

GANs ((Goodfellow et al., 2014)) have been shown

to be very useful for inpainting, e.g. in (Pathak et al.,

2016) where the unknown region is completed with a

mix of pixel-wise and adversarial-based predictions,

as well as for video generation (Clark et al., 2019).

Note that GANs have also been used in (Ruan et al.,

2018) to synthesize a light ﬁeld from one single im-

age, however the problem is posed as a problem of

image super-resolution and the solution is therefore

based on image super-resolution approaches. In our

method, the use of the adversarial loss in the occluded

regions is controlled by an estimated disparity conﬁ-

dence map. The authors in (Mildenhall et al., 2020)

address the problem of view synthesis by ﬁrst regress-

ing from a continuous 5D representation of the scene

to a volumetric representation with volume densities

and view-dependent colors. This volumetric scene

function is then used together with volume rendering

techniques to generate novel light ﬁeld views.

2 DESCRIPTION OF THE

METHOD

While the proposed method builds upon the architec-

ture in (Evain and Guillemot, 2020) designed for gen-

erating new views from one single image in a stereo

setting, it is extended here in order to be able to gener-

ate an entire light ﬁeld from one single input view. In

addition, the network is designed in such a way that it

can be trained either with stereo content or using pairs

of light ﬁeld views. While the network can be trained

from stereo content as well as from pairs of light ﬁeld

views, when using classical stereo content to train the

network, the pipeline is adapted to account for natu-

rally missing information, e.g., related to scene geom-

etry, through the resort to the Reﬁner, as explained in

sections 2.3 and 2.5.

2.1 Disparity-Based Predictor (DBP)

The Disparity-Based Predictor (DBP) is a neural

network made up of two branches, accounting for

A Neural Network with Adversarial Loss for Light Field Synthesis from a Single Image

177

Figure 1: A light ﬁeld generated from one single image

(input is the central view in the ﬁgure). The approach is

trained on KITTI stereo contents, and is augmented using

our method at test time to generate the light ﬁeld.

Figure 2: Outline of the DBP section of our architecture.

both the Feature Extractor and the Decoder, as shown

in ﬁgure 2. It receives one single input image and

estimates a disparity map. Trained with a pair of

views, the two branches of the DBP take one image

of the pair as input, and consider the other image as

ground truth. In each branch, the Feature Extractor

is used to extract features of the input image, using

a MobileNet architecture, with weights shared with

the other branch. The weights are initialized with

ImageNet ((Deng et al., 2009)) weights. A second

part in each branch, the Decoder, produces a dispar-

ity map through upsampling layers and using skip-

connections. Finally, a spatial transformer layer is

used to warp the disparity map to predict a view.

This ﬁrst prediction is based on the warping of pixels,

hence the result is usually sharp, but artifacts may re-

Figure 3: Diagram depicting the employed conﬁdence

method.

main due to disparity errors, in particular in occluded

regions.

2.2 Estimating the Prediction

Conﬁdence

The next step consists in identifying the regions not

well handled by DBP and the warping process. In

order to compute the conﬁdence we have in our ﬁrst

prediction, we use the already trained DBP, and we

follow the protocol deﬁned in ﬁgure 3. We send as in-

put of our two branches the same input image I

source

This will give us two independent predictions in dis-

parity, centered on two different target views (d

target1

and d

target2

). We then re-warp these disparities back

onto the source view (giving us d

source1

and d

source2

and we take as conﬁdence measure C

their difference,

using the following expression:

= exp(−γ|d

source1

− d

source2

|) (1)

We can note that in contrast with the method in

(Evain and Guillemot, 2020), the error is directly

computed and not estimated using a trained network.

Doing this simpliﬁes the learning process, and allows

us to reduce the number of parameters (a network can

be removed when comparing with (Evain and Guille-

mot, 2020)). It is also a way to improve the conﬁ-

dence map, so that occluded regions are better identi-

ﬁed, as shown in the Results section.

2.3 Reﬁner based on a GAN

To correct errors in lower conﬁdence regions, and to

account for the fact that the corresponding informa-

tion is not available at test time, we use a Reﬁner net-

work trained using an adversarial loss combined with

a pixel-wise metrics. This leads to plausible estimates

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

178

of the pixels in the occluded regions. The reﬁner net-

work is actually the generator of a Wasserstein GAN

((Arjovsky et al., 2017)), and adversarial learning is

carried out only in regions of low conﬁdence.

The reﬁner is built as an encoder-decoder structure

with skip-connections. It is made up of a succession

of Spectrally Normalized convolutional layers (as ﬁrst

described in (Miyato et al., 2018)). The discriminator

is also built using these layers. To make sure that the

learned distribution remains faithful to the input and

ground truth data, we also add pixelwise and gradient-

wise metrics besides the Wasserstein loss. This allows

us to ﬁll occluded regions with synthesized contents,

which will be both realistic (thanks to the adversarial

loss) and as faithful as possible (thanks to the pixel-

wise metrics).

At test time, we only use the generator part of the

adversarial process to synthesize our view. It takes

as input the warped prediction, as well as the esti-

mated disparity map. The ﬁnal predicted view V

f in

is obtained by combining the two predictions using

the conﬁdence map as

f in

= C

disp

+ (1 −C

re f

∗ (2)

where V

disp

is the output of DBP and an input to the

reﬁner, and V

re f

the output of the reﬁner, and C

the

computed conﬁdence map.

The method can be tailored to be efﬁciently

trained on both light ﬁelds and stereo content. There

is one reﬁner per branch, which is applied in both

learning and test time.

2.4 Training on Light Fields

When using light ﬁelds for training, we have access

to both horizontal and vertical disparities, hence the

DBP can be trained to estimate these two disparities

and produce the corresponding horizontal and vertical

warpings. We extract pairs of views by taking the cen-

ter view as one of the two images of the pair, and the

other one randomly within the light ﬁeld. The max-

imum disparities of the light ﬁeld are taken as refer-

ence. When working on views which are not extreme,

and assuming a regular sampling of views in struc-

tured light ﬁelds, we estimate the disparity d

int

of an

intermediate view by interpolation as

int

(x) = αd(x − (1 − α)d(x)) (3)

where α represents the targeted position, and x the

bidimensional coordinates. This allows us to obtain

an interpolated disparity map for warping, that will

tend to favor background disparity for occluded re-

gions, and lead to more plausible results than when

simply multiplying the disparity map.

2.5 Training on Stereo Content

The method can also be trained on stereo data, and be

used to generate full light ﬁelds. In this case, we can

only train the method with horizontal disparity, and

have to infer vertical disparity at test time. We there-

fore add a simple module to infer the vertical disparity

at test time, once the network was trained on stereo

contents. The new, two-channel disparity map d

new

is obtained from the horizontal, predicted one, by ap-

plying the following transformation to the horizontal

disparity map d

hor

new

(y, x ) = αd

hor

(α

hor

(x), x − (1 −α

hor

(x)))

(4)

where y accounts for vertical coordinates, while x ac-

counts for horizontal coordinates, and α = (α

, α

) is

a set of parameters accounting for the relative posi-

tion of the requested view relatively to the input view.

Given that the warped disparity map may however

contain errors especially in the foreground near the

borders of the image, we improve it by applying an

auto-regressive extrapolation along the vertical lines

and from the 50 previous points. The rest of the net-

work proceeds with the warped prediction, and reﬁnes

and automatically improves the occluded regions at

test time.

2.6 Summary

In summary, the procedure is as follows:

• From a pair of images, learning the disparity and

warping from it through the DBP to generate one

from the other.

• Through a conﬁdence computation obtained by

inputting the same image in both branches, deter-

mining which regions are likely to be accurate.

• In the regions with low-conﬁdence, using a reﬁner

with adversarial learning to improve the results.

3 LEARNING PROCEDURE

Let L

DBP

and R

DBP

be the DBP-based predictions, and

L and R the ground truth images, and d

and d

the

disparity maps for the warping towards predictions L

and R. We ﬁrst train the DBP using the metrics:

(||L

DBP

− L||

+ ||R

DBP

− R||

)

+ λ

(||∇L

DBP

− ∇L||

+ ||∇R

DBP

− ∇R||

)

(5)

Before training the Reﬁner, we add a step of geomet-

rical restructuring for the DBP. Finally, we freeze the

A Neural Network with Adversarial Loss for Light Field Synthesis from a Single Image

179

weights of DBP, and train the Reﬁner in order to min-

imize the loss function

(||L

REF

− L||

) + λ

(||∇L

REF

− ∇L||

)

+ λ

(||L

∗

− L||

) + λ

(||∇L

∗

− ∇L||

+ λ

L (L

∗

, L)

(6)

where L

REF

is the prediction performed by the Re-

ﬁner, L the ground truth image, L

∗

the ﬁnal combined

prediction L

∗

= C

DBP

+ (1 − C

REF

, and L the

Wasserstein loss. The discriminator for the adversar-

ial process is trained using only this Wasserstein loss.

For the hyperparameters, we consider: γ = 0.08, λ

0.80, λ

= 0.20, λ

= 0.27, λ

= 0.054, λ

= 0.54,

= 0.135, λ

= 0.01. We optimize our approach us-

ing the Adam algorithm ((Kingma and Lei Ba, 2015)),

with β

= 0.9 and β

= 0.999. We use a learning

rate of 0.0001 for the overall network (with 0.00001

for the discriminator during training). The work was

implemented using TensorFlow ((Abadi et al., 2015))

and Keras ((Chollet et al., 2015)). The network was

stopped when no improvement in the validation met-

rics was obtained after 20 epochs. The network is

fully trained after only a few hours, and contains

around 6 million parameters at training time.

For the following experiments, our method for

training took as input patches of resolution 256 ∗ 256

(for the stereo case) or 256 ∗ 512 (for the light ﬁeld

case), normalized between -1 and 1, with data aug-

mentation in 20 % of the cases, with random gamma

and brightness transformations. In this article, we

used two datasets for comparison: Flowers ((Srini-

vasan et al., 2017)) and KITTI ((Geiger et al., 2012)).

Flowers is a light ﬁeld dataset, with rather small base-

lines, comprising around 3,000 light ﬁelds of ﬂowers

in similar geometrical conﬁgurations. We systemati-

cally pick the central view as one element of the pair,

and we randomly choose another view as the other el-

ement of the pair. We adjust the value of α to account

for the coordinate of the selected view. As a start-

ing point, we only focus on one corner view as target

that we arbitrarily choose as the reference disparity

(α = (1, 1)). After 10 epochs, we add the rest of the

views as possible target views and the interpolation

process described in section 2.4 is then applied. We

perform a train-test-validation split, to be able to com-

pare our approach. KITTI is a stereo dataset which

depicts urban scenes, and contain pairs of images with

a very signiﬁcant disparity gap between them. In this

work, we use 400 pairs of images randomly chosen as

training elements.

Table 1: Statistical comparisons between our method

trained on light ﬁeld data (Ours), reference method LF4D

((Srinivasan et al., 2017)), and our stereo-based method

(Stereo). We display the mean PSNRs and SSIM on the

4 corner views (the most difﬁcult ones to predict), as well

as on the full light ﬁeld.

PSNR/SSIM Ours LF4D Stereo

4 corners 34.97/0.94 31.61/0.89 33.54/0.93

Full LF 38.41/0.96 35.10/0.94 37.16/0.95

4 EVALUATION

We compare the proposed approach to several meth-

ods: LF4D ((Srinivasan et al., 2017)), a method able

to predict a full light ﬁeld from one single image,

by enforcing epipolar constraints within the predicted

light ﬁeld, using the code provided by the authors. We

also compare visually our approach with the method

in (Sun et al., 2018), in the stereo case, using the net-

work provided by the authors. We also compare our

method to the recently published method (Evain and

Guillemot, 2020), and with the reference method (Xie

et al., 2016), both focused on working in a stereo set-

ting. To evaluate our stereo-based approach, we also

use it on Flowers by only training it from 2 aligned

views on the central line of the light ﬁeld ((Srinivasan

et al., 2017)). For evaluation, we use PSNR, SSIM

and LPIPS ((Zhang et al., 2018)) as reference metrics.

Due to the visual nature of the task, we strongly rec-

ommend the reader to take a look at the Supplemen-

tary video, which displays other examples of views

synthesized using the proposed method.

4.1 Light Field View Synthesis Results

Training and Testing with Light Field Data. We

ﬁrst focus on training and testing the network with

light ﬁelds. For that, we use the Flowers dataset

((Srinivasan et al., 2017)). We evaluate predicted

views in comparison with the reference method LF4D

((Srinivasan et al., 2017)), by applying an identical

experimental protocol, in ﬁgures 4, 5 and 6, in table

1, as well as in the supplementary video. We see that

our approach clearly outperforms LF4D, both metric-

wise and visually.

We also use the Flowers dataset to evaluate our

stereo-training based approach, i.e. by only training

the network on stereo aligned pairs (extreme left-side

view - center view and center view - extreme right-

side view). The results (the last row of ﬁgure 4 and

table 1) show that our method, even when trained

on stereo content, manages to outperform the LF4D

monocular light ﬁeld synthesis method, and is able to

produce high-quality light ﬁelds. This shows that our

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

180

Figure 4: Visual prediction for a top-left image from the

Flowers test set, as well as the corresponding L1 errors, for,

from left to right, our method, LF4D ((Srinivasan et al.,

2017)) and the stereo version of our method. The errors

were multiplied with a factor of 3 for better visualization.

Figure 5: Close-up views from ﬁgure 4. On the left side,

our results, on the right side, the results obtained in (Srini-

vasan et al., 2017). We note that our results are sharper and

structurally more consistent.

stereo to light ﬁelds adaptation module is very efﬁ-

cient.

Training on Stereo Content. We also train the net-

work using the stereo KITTI dataset ((Geiger et al.,

2012)), in order to build a full light ﬁeld. The views

produced have no ground truth equivalent; only visual

evaluation is possible in this case. Visual results are

shown in ﬁgure 1 and in the supplementary video. To

evaluate our approach, we compare it visually to the

monocular part of the method in (Sun et al., 2018).

The network, also trained on KITTI, receives as in-

put one image and a transformation vector expressing

the relative coordinates of the target view. We spec-

ify to the pre-trained network a transformation vector

similar to ours.

We note that our approach clearly performs better

visually on this data (see ﬁgure 7). This is probably

Figure 6: Supplementary visual comparisons between our

work (left-side) and (Srinivasan et al., 2017) (right-side).

We note that our images are sharper and better quality.

Figure 7: Visual comparison of two of our predictions

with Sun’s method, for similar geometrical transformations

(from left to right, 2 sequences of: input, our prediction, and

the prediction obtained from (Sun et al., 2018)). The views

we produce are less blurry and have fewer distortions.

Table 2: Comparison of the results of our approach with 2

reference methods ((Evain and Guillemot, 2020), (Xie et al.,

2016)) in a stereo setting. For PSNR and SSIM, the higher,

the better. For LPIPS, the lower, the better.

KITTI Test Set PSNR SSIM LPIPS

Ours 19.96 0.76 0.130

(Evain and Guillemot, 2020) 19.24 0.74 0.139

Deep3D ((Xie et al., 2016)) 19.08 0.74 0.220

because the vertical transformations are not present

in the KITTI training set, and can thus not be learnt

efﬁciently by the method in (Sun et al., 2018). Given

that our approach is optimized to generate the light

ﬁeld, we are able for this task to obtain more realistic

results.

To evaluate metric-wise our predictions, we also

compare them with stereo-based view synthesis meth-

ods (Evain and Guillemot, 2020) and (Xie et al.,

2016) in table 2, on the KITTI test set, in a stereo

setting. We note that our approach signiﬁcantly out-

performs these two reference methods in the 3 chosen

metrics. We can note that we obtain those results with

a smaller number of parameters (notably, (Evain and

Guillemot, 2020) has 200,000 more parameters). We

show in ﬁgure 8 a visual stereo prediction, associated

with the L1 error. We can see that the predicted view

is rather high-quality.

Finally, we compare our conﬁdence computation

process with the one described in (Evain and Guille-

mot, 2020) in ﬁgure 9. We note that our occlusion

identiﬁcation process is signiﬁcantly more efﬁcient.

Testing on Natural Images. We can also test our

network on natural images, captured using a smart-

phone. It allows us to produce a full light ﬁeld from

one single image. A visual example of it is shown in

ﬁgure 10.

A Neural Network with Adversarial Loss for Light Field Synthesis from a Single Image

181

Figure 8: Result of our approach in a stereo setting, on the

KITTI test set, for evaluation. From top to bottom: input

image, our prediction, ground truth image, L1 error.

Figure 9: Visual evaluation and comparison of the conﬁ-

dence map. Yellow means low-conﬁdence. From left to

right: our prediction, conﬁdence map returned by our ap-

proach, conﬁdence map returned by (Evain and Guillemot,

2020) in the same setting. We note that our way to compute

the conﬁdence map is signiﬁcantly better at speciﬁcally cap-

turing occluded regions.

4.2 Ablation Study

Impact of the Conﬁdence-based Reﬁner. We eval-

uate the impact of the reﬁner on the result in tables

3 and 5. We can note that it signiﬁcantly increases

the performance both in PSNR and SSIM for both

datasets. Its contribution is, though, more signiﬁcant

when working on KITTI, due to its more signiﬁcant

occluded regions. We also evaluate its positive contri-

bution when training the approach on stereo contents,

and using it to generate light ﬁelds in table 4. We note

that the Reﬁner in this case also allows to signiﬁcantly

improve the performance of the approach.

Table 3: Statistical comparisons for the ablation study on

the Flowers test set. No Reﬁner only uses the warped pre-

diction, No AL does not use adversarial learning.

Flowers Ours No AL No Reﬁner

PSNR 38.41 38.40 37.59

SSIM 0.96 0.96 0.95

Figure 10: A light ﬁeld generated from one single image

(input is the central view in the ﬁgure). The approach is

tested on a natural image, captured using a smartphone. For

a result with higher resolution, we advise the reader to check

the supplementary video.

Table 4: Statistical comparisons for the ablation study on

the Flowers test set. Stereo ours is our stereo-based light

ﬁeld synthesis method, Stereo No Reﬁner evaluates the pre-

diction when no reﬁner is used.

Flowers Stereo ours Stereo No reﬁner

PSNR 37.16 36.02

SSIM 0.95 0.93

Impact of Adversarial Learning. We also evalu-

ate the impact of our adversarial process on the re-

sult. We note that depending on the chosen dataset,

we do not draw the same conclusions. When working

on Flowers (see table 3), we note that the adversarial

process does not really have a signiﬁcant impact. The

occluded regions in Flowers are indeed smaller and

then easier to ﬁll, reducing the usefulness of the ad-

versarial loss.

On the other hand, when working on KITTI, we

can see that the adversarial process is much more ben-

eﬁcial, giving an overall increase in PSNR and SSIM,

but more importantly a signiﬁcantly better LPIPS

((Zhang et al., 2018)), showing that it is an adequate

way to improve the perceptiveness of our images. A

Table 5: Statistical comparisons for the ablation study on

the KITTI test set. No Reﬁner only uses the warped predic-

tion, No AL does not use adversarial learning.

KITTI Test Set Ours No AL No reﬁner

PSNR 19.96 19.85 18.87

SSIM 0.76 0.75 0.74

LPIPS 0.130 0.135 0.144

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

182

visual example of such an improvement on KITTI is

displayed in ﬁgure 11.

Figure 11: Contribution of adversarial learning. From left

to right, input image, prediction without adversarial learn-

ing, prediction with adversarial learning. We note that the

approach with adversarial learning is able to ﬁll in these oc-

cluded regions more realistically (highlighted in red).

5 CONCLUSION

In this article, we have described a method able to

produce light ﬁelds, with a training from both light

ﬁeld datasets and stereo datasets. The proposed

method allows us to generate high-quality light ﬁelds,

from only one single input image and for diverse im-

ages and semantics. We manage to achieve good per-

formance for producing these light ﬁelds, and are able

to use stereo data to produce light ﬁelds with a wider

variety of contents and semantics.

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,

G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-

houcke, V., Vasudevan, V., Vi

egas, F., Vinyals, O.,

Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and

Zheng, X. (2015). TensorFlow: Large-scale machine

learning on heterogeneous systems. Software avail-

able from tensorﬂow.org.

Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasser-

stein gans. ICML.

Chollet, F. et al. (2015). Keras. https://keras.io.

Clark, A., Donahue, J., and K., S. (2019). Adversarial video

generation on complex datasets.

Deng, J., Dong, W., Socher, R., Li, L., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. CVPR.

Evain, S. and Guillemot, C. (2020). A lightweight neural

network for monocular view generation with occlu-

sion handling. PAMI.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? the kitti vision benchmark

suite. CVPR.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial networks.

NIPS.

Horry, Y., Anjyo, K., and Arai, K. (1997). Tour into the

picture: using a spidery mesh interface to make ani-

mation from a single image. Proceedings of the 24th

annual conference on Computer graphics and inter-

active techniques, pages 225–232.

Ivan, A., Williem, and Park, I. K. (2019). Synthesizing a

4d spatio-angular consistent light ﬁeld from a single

image.

Kalantari, N., Wang, T., and Ramamoorthi, R. (2016).

Learning-based view synthesis for light ﬁeld cameras.

SIGASIA.

Kingma, D. and Lei Ba, J. (2015). Adam: a method for

stochastic optimization. ICLR.

Lippmann, G. (1908). La photographie int

egrale. Comptes-

Rendus de l’Acad

emie des Sciences.

Mildenhall, B., Srinivasan, P., Ortiz-Cayon, R., Kalantari,

N., Ramamoorthi, R., Ng, R., and Kar, A. (2019). Lo-

cal light ﬁeld fusion: Practical view synthesis with

prescriptive sampling guidelines. ACM Transactions

on Graphics.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,

Ramamoorthi, R., and Ng, R. (2020). Nerf: Repre-

senting scenes as neural radiance ﬁelds for view syn-

thesis.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y.

(2018). Spectral normalization for generative adver-

sarial networks. ICPR.

Ng, R. (2006). Digital light ﬁeld photography.

Park, E., Yang, J., Yumer, E., Ceylan, D., and Berg, A.

(2017). Transformation-grounded image generation

network for novel 3d view synthesis. CVPR.

Pathak, D., Kr

ahenb

uhl, P., Donahue, J., Darrell, T., and

Efros, A. (2016). Context encoders: Feature learning

by inpainting. CVPR.

Ruan, L., Chen, B., and Lam, M. L. (2018). Light ﬁeld

synthesis from a single image using improved wasser-

stein generative adversarial network. In Jain, E. and

Kosinka, J., editors, EG 2018 - Posters. The Euro-

graphics Association.

Shih, M.-L., Su, S.-Y., Kopf, J., and Huang, J.-B. (2020).

3d photography using context-aware layered depth in-

painting. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Srinivasan, P., Wang, T., Sreelal, A., Ramamoorthi, R., and

Ng, R. (2017). Learning to synthesize a 4d rgbd light

ﬁeld from a single image. ICCV.

A Neural Network with Adversarial Loss for Light Field Synthesis from a Single Image

183

Sun, S., Huh, M., Liao, Y., Zhang, N., and Lim, J. (2018).

Multi-view to novel view: Synthesizing novel views

with self-learned conﬁdence. ECCV.

Tucker, R. and Snavely, N. (2020). Single-view view syn-

thesis with multiplane images. In The IEEE Con-

ference on Computer Vision and Pattern Recognition

(CVPR).

Tulsiani, S., Tucker, R., and Snavely, N. (2018). Layer-

structured 3d scene inference via view synthesis.

ECCV.

Woodford, O., Reid, I., Torr, P., and Fitzgibbon, A.

(2007). On new view synthesis using multiview

stereo. BMVC, pages 1–10.

Xie, J., Girshick, R., and Farhadi, A. (2016). Deep3d: Fully

automatic 2d-to-3d video conversion with deep convo-

lutional neural networks. ECCV.

Zhang, R., Isola, P., Efros, A., Shechtman, E., and Wang, O.

(2018). The unreasonable effectiveness of deep fea-

tures as perceptual metric. CVPR.

Zhou, T., Tucker, R., Flynn, J., Fyffe, G., and Snavely, N.

(2018). Stereo magniﬁcation: Learning view synthe-

sis using multiplane images. SIGGRAPH.

VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications

184