Editing Scene Illumination and Material Appearance

of Light-Field Images

Jaemin Cho

1 a

, Dongyoung Choi

1 b

, Dahyun Kang

1 c

, Gun Bang

2 d

and Min H. Kim

1 e

School of Computing, KAIST, Daejeon, South Korea

Research Department of Media Coding, ETRI, Deajeon, South Korea

{jmcho, dychoi}@vclab.kaist.ac.kr, 313usually@gmail.com, gbang@etri.re.kr, minhkim@kaist.ac.kr

Keywords:

Light-Field Image Decomposition, Scene Illumination, Neural Inverse Rendering.

Abstract:

In this paper, we propose a method for editing the scene appearance of light-ﬁeld images. Our method enables

users to manipulate the illumination and material properties of scenes captured in light-ﬁeld format, offering

various control over image appearance, including dynamic relighting and material appearance modiﬁcation,

which leverages our specially designed inverse rendering framework for light-ﬁeld images. By effectively

separating light ﬁelds into appearance parameters—such as diffuse albedo, normal, specular intensity, and

roughness within a multi-plane image domain, we overcome the traditional challenges of light-ﬁeld imaging

decomposition. These challenges include handling front-parallel views and a limited image count, which

have previously hindered neural inverse rendering networks when applying them to light-ﬁeld image data.

Our method also approximates environmental illumination using spherical Gaussians, signiﬁcantly enhancing

the realism of scene reﬂectance. Furthermore, by differentiating scene illumination into far-bound and near-

bound light environments, our method enables highly realistic editing of scene appearance and illumination,

especially for local illumination effects. This differentiation allows for efﬁcient, real-time relighting rendering

and integrates seamlessly with existing layered light-ﬁeld rendering frameworks. Our method demonstrates

rendering capabilities from casually captured light-ﬁeld images.

1 INTRODUCTION

Light-ﬁeld cameras and devices have revolutionized

image capture and processing in computer vision and

graphics, enabling various applications in novel view

synthesis (Gortler et al., 1996; Choi et al., 2019;

Riegler and Koltun, 2020; Riegler and Koltun, 2021),

scene editing (Jarabo et al., 2014; Mihara et al., 2016),

and augmented reality (Holynski and Kopf, 2018). By

capturing multiple sub-aperture images, they provide

a depth of scene information far beyond what tradi-

tional cameras can offer.

In this paper, we propose a method that allows

for scene editing through our novel decomposition of

light-ﬁeld images. We target key appearance parame-

ters—diffuse albedo, normals, specular intensity, and

roughness—within the multi-plane image (MPI) do-

https://orcid.org/0000-0003-2800-5105

https://orcid.org/0000-0003-1896-4038

https://orcid.org/0000-0003-2632-0048

https://orcid.org/0000-0003-4355-599X

https://orcid.org/0000-0002-5078-4005

Input Scene editing

Figure 1: We present a method to decompose light-ﬁeld im-

ages into illumination and intrinsic appearance parameters,

enabling realistic relighting and material editing. Refer to

our supplemental video for additional results.

main (Zhou et al., 2018). This technique not only cap-

tures but enhances the realism of environmental illu-

mination, thereby effectively editing material appear-

Cho, J., Choi, D., Kang, D., Bang, G. and Kim, M. H.

Editing Scene Illumination and Material Appearance of Light-Field Images.

DOI: 10.5220/0013145500003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 3: VISAPP, pages

90-101

ISBN: 978-989-758-728-3; ISSN: 2184-4321

ance and scene relighting in casually captured light-

ﬁeld images as shown in Figure 1.

Despite the richness of the light-ﬁeld image data,

processing light-ﬁeld images still presents unique

challenges. Their sub-aperture images typically ex-

hibit front-parallel view directions and a limited count

range, complicating the creation of comprehensive

3D models and the extraction of material appearance

parameters. Furthermore, the conventional methods

for light-ﬁeld imaging often fall short in depicting

specular reﬂections accurately due to their reliance on

viewing direction, a limitation that each pixel main-

tains a constant value, regardless of the angle of view

(Flynn et al., 2019; Mildenhall et al., 2019; Zhou

et al., 2018).

Addressing these challenges, we introduce a neu-

ral inverse rendering method especially tailored for

scene-scale light-ﬁeld inputs within the MPI domain.

Our method employs spherical Gaussian approxima-

tions to segment the light environment into far-bound

and near-bound regions, capturing both global and lo-

cal illumination variations essential for realistic scene

appearance manipulation. The overall process of our

method is visualized in Figure 2. From a given light-

ﬁeld image, our method decomposes sub-aperture im-

ages into appearance parameters (diffuse albedo, nor-

mal, specular intensity, and roughness) in multi-plane

image space, along with global/local scene illumi-

nation represented as spherical Gaussian functions.

The MPI has 9 channels, with the number next to

each appearance parameter indicating the number of

channels representing that parameter. These decom-

posed elements enable the manipulation of scene ap-

pearance and illumination at a scene-wide scale, not

just for individual objects. This approach not only

boosts rendering efﬁciency but also ensures seamless

integration with existing light-ﬁeld rendering archi-

tectures (Flynn et al., 2019; Mildenhall et al., 2019).

By demonstrating practical applications ranging

from relighting to material appearance changes, our

work showcases the versatility and effectiveness of

our method in enhancing the realism and applicability

of light-ﬁeld imaging for scene editing. Our contribu-

tions are summarized as follows:

• Introducing a scene editing method through a

novel neural inverse rendering for scene-scale

light-ﬁeld inputs in the MPI domain.

• Enhancing scene realism through global/local il-

lumination estimation using spherical Gaussians.

• Demonstrating practical scene editing applica-

tions, highlighting our method’s versatility in re-

lighting and material appearance modiﬁcations.

2 RELATED WORK

Light ﬁelds Light ﬁelds enable various applica-

tions, including dense depth map capture (Tao et al.,

2013), novel-view image creation (Pozo et al., 2019),

depth of ﬁeld refocusing (Veeraraghavan et al.,

2007), and 3D content capturing for holographic dis-

plays (Jones et al., 2007). However, the geome-

try information they provide is typically sparser than

that from conventional multiview setups in structure-

from-motion (SfM) or neural rendering. Tradition-

ally, light-ﬁeld studies have used additional 3D scan-

ning for accurate geometry (Wood et al., 2000;

Lensch et al., 2003). Recent advances include stereo-

imaging methods for dense depth maps (Hedman

et al., 2017; Hedman and Kopf, 2018; Pozo et al.,

2019; Kang et al., 2021) and the use of neural net-

works for novel view synthesis (Srinivasan et al.,

2019; Flynn et al., 2019; Penner and Zhang, 2017;

Choi et al., 2019). Despite these advancements, the

accuracy and view angle variation remain insufﬁcient

for complete inverse rendering. Our work diverges by

focusing on inverse rendering of light ﬁelds in MPI

space, decomposing them into scene illumination and

appearance parameters without additional geometry

input.

2.1 Multi-Plane Image

Multi-plane Images (MPI) are used for novel view

synthesis by mapping target image information onto

multiple planes in the reference image’s coordinate

frame via inverse homography (Zhou et al., 2018).

MPIs represent perspective geometry with parallel

planes along the reference camera’s view frustum,

where each plane has RGB and alpha values. In an

MPI with D planes, the transmittance of the d-th plane

) and the rendered image color C are deﬁned as:

= α

∏

d−1

i=1

(1 −α

). And the color of the rendering

image C is: C =

∑

i=1

Light-Field Image Decomposition. While light-

ﬁeld view synthesis has been widely studied (Flynn

et al., 2016; Zhou et al., 2018; Srinivasan et al.,

2019; Flynn et al., 2019; Penner and Zhang, 2017;

Choi et al., 2019; Mildenhall et al., 2019; Broxton

et al., 2020; Wang et al., 2018; Wu et al., 2017;

Wizadwongsa et al., 2021), light-ﬁeld image decom-

position has seen less progress due to the insufﬁ-

cient geometry information provided by light ﬁelds

alone. Recent advancements have involved mak-

ing speciﬁc assumptions, like using dielectric mate-

rials for specular reﬂection (Tao et al., 2015; Kang

et al., 2021), or formulating joint optimization prob-

Editing Scene Illumination and Material Appearance of Light-Field Images

global/local illumination (SGs)

roughness

specular intensity

MPI

diffuse albedo

normal

transparency

Training images

…

Material editing

Illumination editing

Figure 2: Our method decomposes light-ﬁeld images into appearance parameters and scene illumination in multi-plane image

space, enabling scene-wide manipulation of appearance and lighting.

lems based on view changes, albedo reﬂectances, and

material count (Wang et al., 2016; Li et al., 2017;

Ngo et al., 2019). However, these often involve un-

realistic assumptions and focus mainly on depths and

normals, neglecting view-dependent properties. Sub-

sequent methods have adopted physics-based (Kang

et al., 2021), classical graphics (Beigpour et al.,

2018), or perception-based reﬂectance models (Sulc

et al., 2018) for inverse rendering. Yet, these are typi-

cally formulated on the image plane (Beigpour et al.,

2018) or in the 4D light-ﬁeld domain (Sulc et al.,

2018), limiting their suitability for efﬁcient and in-

teractive rendering. To address these limitations, our

work proposes a decomposition method in the multi-

plane image space, aiming for more efﬁcient and in-

teractive rendering capabilities.

Neural Inverse Rendering. Neural inverse render-

ing methods such as IRON (Zhang et al., 2022a) and

PS-NeRF (Yang et al., 2022) utilize images captured

under varying lighting conditions for each viewpoint,

with IRON focusing on edge-aware, physics-based

surface rendering and PS-NeRF addressing self-

occlusion in unknown lighting. These approaches

yield high-quality results. Other methods, including

PhySG (Zhang et al., 2021a), NerFactor (Zhang et al.,

2021b), NeRV (Srinivasan et al., 2021), NeILF (Yao

et al., 2022), InvRender (Zhang et al., 2022b), and

TensoIR (Jin et al., 2023), operate under ﬁxed lighting

conditions. While these techniques offer various im-

provements, such as handling self-occlusion and in-

direct illumination, they do not estimate a local illu-

mination proﬁle. VMINER (Fei et al., 2024) mod-

els local illumination as discrete point light sources,

which works well for small, distinct sources. How-

ever, for larger or more complex light sources, this

method becomes computationally expensive. Our

work diverges by estimating both global and local il-

lumination as spherical Gaussians in light ﬁelds, fa-

cilitating real-time relighting and natural scene illu-

mination adjustments and maintaining computational

efﬁciency even with complex lighting setups. Al-

though current state-of-the-art neural inverse render-

ing networks demonstrate high effectiveness in han-

dling single objects, for example, synthetic datasets

used in NeRF (Mildenhall et al., 2020), they struggle

with the unique characteristics of scene-scale light-

ﬁeld images, such as limited numbers and speciﬁc

viewpoint arrangements. Our method addresses these

challenges, enabling effective use in realistic scene

appearance editing and relighting.

3 LIGHT-FIELD

DECOMPOSITION ON MPI

3.1 Preliminaries

Multi-plane image maps target image information

onto multiple planes using inverse homography from

the target to the reference image (Zhou et al., 2018).

MPIs effectively visualize perspective geometry from

forward-facing images. They consist of multiple par-

allel planes within the reference camera’s view frus-

tum, where each plane stores RGB color and alpha

transparency values per pixel. In an MPI with D

planes, RGB color and alpha transparency of the i-

th plane are denoted as c

and α

, respectively, with

the planes ordered from nearest to farthest from the

viewpoint. The transmittance T

of the d-th plane is

given by: T

= α

∏

d−1

i=1

(1 − α

). The rendered image

color C is calculated as: C =

∑

i=1

3.2 Geometry Estimation

Our approach begins with geometry estimation, fol-

lowed by material and environment mapping for

physics-based rendering. Given the complexity of

these factors in the ﬁnal rendering image, we train

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

(a) Uncalibrated disparity maps (b) Calibration result

of one disparity map

of other views

Before calibration

After calibration

Red boxes indicate the corresponding pixels of 





Figure 3: Multiview disparity calibration. (a) Uncalibrated disparity maps and their corresponding pixels to x

by optical ﬂow.

(b) Calibration result of one disparity map. (c) Calibrated disparity maps.

these parameters in stages rather than simultaneously.

Initially, we establish a reference geometry, enabling

MPI to train on a geometry. This involves match-

ing pixel correspondences between the target and

sub-camera images using optical ﬂow predicted by

RAFT (Teed and Deng, 2020). We then calculate

the disparity map by minimizing the error in unpro-

jecting matched samples from all camera views into

world space, following Kang et al. (Kang et al., 2021).

To ensure comprehensive geometry training, we com-

pute disparity maps for all image viewpoints, not just

one reference viewpoint. This is crucial as relying on

a disparity map from only one viewpoint may lead

to inadequate training for areas not visible from that

viewpoint.

As shown in Figure 3(a), disparity maps vary with

the viewpoint, leading to blurriness during training.

To correct the value x

(the red box in Figure 3(b)), we

ﬁnd its corresponding pixels in other disparity maps

(the red boxes in Figure 3(a)), and apply the correc-

tion using the following equation: y

β−α

∑

i=α

where α and β deﬁne the range to eliminate outliers,

with α = [N × 0.10] and β = [N × 0.25]. This range

was determined through experiments:

Finally, we adjust these disparity maps to align

with the reference viewpoint using the extrinsic cam-

era function.

As illustrated in Figure 4, we introduce a loss term

to evaluate the difference between the alpha trans-

parency of the MPI and the reference geometry. In

an MPI with D planes, where the planes are ordered

from nearest to farthest from the viewpoint, we de-

note the i-th plane as p

. The alpha transparency of

is given, and its distance from the reference view-

point is d

. For a depth d obtained from the previous

disparity map, we deﬁne a 1D Gaussian function f (x)

as: f (x) = exp



−

(x−d)

2σ



, where σ is the difference

⋯

MPI

Viewpoint

RAFT disparity map

Viewpoint

()fx

(a) Alpha of samples in ray (b) Gaussian of alpha in ray



Unknown

geometry

0.7

depth

⋯

()fx

Figure 4: Geometry estimation. (a) Utilize the training

viewpoint with corresponding MPI warping and sample α

from the warped MPI. (b) Generate a 1D Gaussian function

of depth, based on the initial disparity map from the training

viewpoint. (c) Illustrate the difference between α

and f (x)

along a single ray.

of disparity between planes, and the MPI’s planes are

equidistant in inverse depth space.

Considering that the disparity maps reveal depth

for visible parts but not hidden geometry, our loss

calculation omits differences between alpha trans-

parency and f (x) for non-visible areas. Thus, the loss

between the disparity map and alpha transparency is

deﬁned as:

geo

∑

i=1

(α

− f (d

))

, (1)

where n is the largest integer fulﬁlling either (α

≤

f (d

)) or ( f (d

) > 0.7). The geometry of the MPI

was best learned when the threshold value is set to

0.7.

Editing Scene Illumination and Material Appearance of Light-Field Images

Disparity maps, generated for all training view-

points, account for geometry invisible from the refer-

ence viewpoint but visible from others. The loss cal-

culation uses the current training image viewpoint, its

disparity map, and the warped MPI for that viewpoint.

However, the original disparity estimation method as-

sumes view-independent imaging, meaning surface

colors do not change with viewing direction. There-

fore, the initially estimated disparity map deviates

from the actual disparity, limiting its use for render-

ing in our model like other methods using known ge-

ometry. To address this, we progressively decrease

the weight of the loss by comparing alpha and dis-

parity maps during training. This approach allows the

model to initially learn approximate geometry quickly

and gradually adjust to correct geometry, particularly

in scenes where the initial disparity map is inaccu-

rate due to view-dependent effects like specular high-

lights.

3.3 Spatially-Varying Reﬂectance

Estimation

Our model employs physics-based rendering, differ-

ing from traditional MPI models (Mildenhall et al.,

2019; Wizadwongsa et al., 2021) for view synthesis.

It computes the rendering equation using geometry,

material, and light environment. Traditional render-

ing equations involve integrating over hemispheres to

account for all incident light directions. This integra-

tion, often estimated by sampling methods like Monte

Carlo, is computationally expensive and undermines

the primary advantage of traditional MPI, i.e., real-

time rendering.

To avoid this, our approach utilizes a spherical

Gaussian (SG) approximation. This assumes that

the light environment comprises a mixture of SGs,

and the bidirectional reﬂectance distribution function

(BRDF) is approximated by SGs. An n-dimensional

spherical Gaussian is expressed as: G (ω; ε, λ, µ) =

µe

λ(ω·ε−1)

, where ω ∈ S

represents the input, ε ∈ S

is the lobe axis, λ ∈ R+ denotes the lobe sharpness,

and µ ∈ R+

is the lobe amplitude.

The signiﬁcant advantage of using SG approxima-

tion in the rendering equation is the elimination of the

sampling process. Since both the integral of an SG

and the inner product of two SGs can be calculated in

closed form, the rendering process becomes substan-

tially more efﬁcient and cost-effective.

In our model, we speciﬁcally employ the Cook-

Torrance BRDF for our rendering process. Given

the assumption that the scene comprises dielectric

objects and the training images were captured with

a white balance algorithm, we use a 1-dimensional

Fresnel coefﬁcient F

∈ R+, representing specular in-

tensity. Additionally, we utilize 1-dimensional SGs

with mono-color lobe amplitude µ ∈ R+.

The Fresnel function, based on the Schlick

model and approximated by SG, is formulated as:

F (ω

′

, h) = F

+(1 − F

)



1 − cos



′



, where h is a

half vector between lighting direction ω

and viewing

direction ω

, ω

′

is a vector that reﬂects ω

for sur-

face normal n, and θ

′

is the angle between ω

′

and h.

Notably, when θ

′

is less than 60 degrees, the Fresnel

function F closely resembles the Fresnel coefﬁcient

. Therefore, we further approximate the Fresnel

function as: F (ω

′

, h) ≈ F

3.4 Global and Local Illumination

Estimation

Our model is designed to estimate geometry, material,

and illumination for scene-scale inputs. This requires

the capability to process both distant light sources,

such as the sun, and nearby light sources. Tradi-

tional methods, which typically use a single environ-

ment map per scene, fall short in representing close,

position-dependent light sources. To address this, we

estimate two distinct types of illumination maps: a

global illumination map for far-bound region and a

local illumination map for near-bound region. The

far-bound region refers to a region where the illumi-

nation is so far from the geometry of the MPI that

it comes in at a constant direction, no matter where

the incident point is located. On the other hand, the

near-bound region is where the illumination is in-situ

light, and its direction changes depending on the posi-

tion of the incident point. Both illumination maps are

formulated as a mixture of spherical Gaussian func-

tions, allowing for a comprehensive representation of

diverse lighting conditions.

Global Illumination. Our global illumination map,

denoted as L

far

, accounts for light sources that are in-

ﬁnitely distant. Such light sources provide consistent

direction and intensity regardless of a point’s location

within the scene. We represent this map as a combi-

nation of twelve 1-D spherical Gaussians:

far

∑

k=1

G(ω

;ε

, λ

, µ

), (2)

where ε

∈ S

represents the lobe axis, λ

∈ R+ is

the lobe sharpness, and µ

∈ R

indicates the lobe

amplitude.

Local Illumination. In contrast to global illumina-

tion, local illumination in our model is represented

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

SG Light source

MPI

Local Area

Global illumination

Local illumination

(SGs)

Combined illumination

(a) (b) (c)

SG of

Figure 5: Global and local illumination estimation. (a) depicts virtual light sources as green circles and the spherical Gaussians

of the global illumination map as blue circles. (b) shows the SGs of both global and local illumination maps for points p

and p

. Note that in this ﬁgure, both blue and green circles represent SGs. (c) illustrates the combined illumination maps for

points p

and p

using virtual light sources, which vary in direction,

intensity, and sharpness depending on the rendering

location within the scene. Each light source is char-

acterized by its position x

∈ R

, lobe sharpness λ

∈

R+, and amplitude µ

∈ R+. These parameters de-

termine the amplitude and sharpness when projected

as spherical Gaussians in the local illumination map.

The map adjusts lobe sharpness and amplitude rela-

tive to each point’s distance from the light sources.

Utilizing 24 virtual light sources per scene, the dis-

tance d

from the k-th light source to a point x ∈ R

is d

∥

x − x

∥

. The local illumination map L

near

this point is represented as:

near

∑

k=1



;ε

′



, (3)

where ε

′

is the normalized vector of x − x

, and N is

the number of virtual light sources. Thus, L

near

com-

prises a mixture of N SGs. And these two local and

global illuminations are integrated into one illumina-

tion map represented by SGs for each pixel, as shown

in Figure 5.

Previous works like NeILF (Yao et al., 2022) and

InvRender (Zhang et al., 2022b) have used point-

speciﬁc environment maps, yielding high-quality re-

sults. In contrast, our method differs from InvRender

in that we use virtual light sources to represent point

light in the near-bound region, while InvRender uses

indirect illumination. To represent the results caused

by point light in the near-bound region and indirect

illumination, NeILF uses illumination maps for each

point of the object. However, it cannot distinguish be-

tween global illumination, indirect illumination, and

point light from the learned illumination map. Our

method is specially designed for appearance editing

of light-ﬁeld images.

This feature is beneﬁcial for relighting tasks, such

as altering the color of a speciﬁc light source or re-

moving it entirely. Modifying lights in the light

source group leads to consistent changes in the cor-

responding SGs within the local illumination maps.

Therefore, our model incorporates both a global illu-

mination map and a light source group, projecting the

latter onto the local illumination map for each pixel

in the MPI during rendering and combining it with

the global illumination map for comprehensive scene

illumination.

3.5 Optimization

Our model’s optimization process involves several

loss functions, including the geometric consistency

loss L

geo

from Equation (1). We employ a total varia-

tion of diffuse albedo to promote its local smoothness.

The reconstruction loss is formulated as follows:

rec

= α



− I



+ β



∇I

− ∇I



, (4)

where I represents the image synthesized by our

model, and I

is the ground truth image.

The transparency loss originates from Neural vol-

ume (Lombardi et al., 2019)’s regularization term that

causes the transparency value of MPI to converge to 0

or 1, deﬁned as follows:

tran

= σ(1 − σ) , (5)

where σ is the transparency.

The total loss function, combining these elements,

is deﬁned as:

total

= κ(L

rec

+ γTV (ρ)) + ω



geo

+ 2L

tran



, (6)

where ρ is the diffuse albedo, and α, β, and γ are

weights balancing the loss terms, set to α = 2, β =

0.5, and γ = 0.03. Weights ω and κ also play a

role, but they are adjusted over epochs. Speciﬁ-

cally, ω decreases and κ increases with the number

Editing Scene Illumination and Material Appearance of Light-Field Images

Table 1: Quantitative comparison of average scores for inverse rendering results for the rendering dataset (McGuire, 2017)

obtained with IBL-NeRF (Choi et al., 2023). Green highlights indicate best accuracy.

Conference Sponza

PSNR ↑ SSIM ↑ LPIPS ↓ RMSE ↓ PSNR ↑ SSIM ↑ LPIPS ↓ RMSE ↓

IBL-NeRF 11.97 0.7143 0.3577 0.0635 13.44 0.6151 0.4187 0.0453

Diffuse

Ours 13.87 0.6930 0.3409 0.0410 15.43 0.5645 0.3210 0.0287

IBL-NeRF 8.36 0.1599 0.9236 0.1459 8.62 0.1571 0.7835 0.1375

Normal

Ours 12.47 0.2961 0.7576 0.0567 11.77 0.2897 0.6645 0.0665

IBL-NeRF 38.06 0.9871 0.1771 0.00016 29.67 0.9155 0.2830 0.00109

Rendering

Ours 38.36 0.9747 0.1711 0.00015 32.81 0.9353 0.1032 0.00057

of epochs, reducing the inﬂuence of L

geo

and enhanc-

ing the impact of other loss terms as learning pro-

gresses. The values for ω and κ are determined as

follows: ω = min



0.5, 0.05 + 1.1

−

e−900



, and κ =

min



1, 0.01 + 1.1

e−801



, where e denotes the num-

ber of epochs.

4 EXPERIMENTAL RESULTS

Implementation Details. Our method utilizes a

multilayer perceptron (MLP) for regressing pixel co-

ordinates, a strategy chosen to minimize noise in the

MPI. This MLP, taking pixel coordinates (x, y) from

the d-th plane as input p, predicts alpha transparency

(α) and material information for each pixel. Key

to our approach is the explicit learning of spherical

Gaussians (SGs) and light sources for light ﬁelds,

which represent the light environment. This includes

learning parameters like lobe, amplitude, and sharp-

ness. For MPI implementation, we use 192 multi-

plane images, uniformly spaced in disparity space (in-

verse depth). The disparity range for the closest and

furthest MPI from the viewpoint is determined dur-

ing image calibration using SfM (SfM) (Sch

onberger

and Frahm, 2016). In our MLP conﬁguration, the

pixel position (x, y, d) is crucial for predicting pixel-

speciﬁc parameters (α, material). Positional encod-

ing (Mildenhall et al., 2020) is applied to the pixel

locations. This results in x and y being encoded into

20 dimensions and d into 16 dimensions, enhancing

the model’s ability to capture ﬁne details in the MPI.

Instead of using an MLP, albedo, global illumination

(SGs), and local illumination (virtual light sources)

are learned explicitly. A vanilla MLP with 6 layers

takes 56 dimensions as input and produces 3 dimen-

sions of output.

4.1 Quantitative Evaluation

Existing neural inverse rendering methods typically

require a large number of inward-looking pho-

tographs for a target object and often struggle with a

limited set of front-parallel light-ﬁeld images. Among

these methods, IBL-NeRF (Choi et al., 2023) stands

out as it tackles scene-scale inverse rendering prob-

lems akin to our approach.

We use several light-ﬁeld datasets, which are cap-

tured by large-baseline camera arrays in a structured

and unstructured manner. We then use COLMAP

to estimate the camera parameters. For the real im-

age dataset, we use a real forward-facing light ﬁeld

dataset (Mildenhall et al., 2019), which consists of

images captured by handheld smartphones. Each

scene has 20 to 62 images, with a resolution of 3982×

2986, which we reduce to a resolution of 1080 × 720

in our method. The synthetic validation rendering

dataset uses the CONFERENCE and SPONZA model-

ing ﬁles from a 3D graphics model website (McGuire,

2017) and is rendered as 512 × 512 resolution images

from 49 viewpoints. To assess the effectiveness of our

method, we conducted a comparative analysis with

IBL-NeRF using this rendering dataset.

As shown in Figure 6, the estimated normals of

IBL-NeRF show that the boundaries of objects in the

scene are not clear and mixed, likely due to the limited

number of light-ﬁeld images available for input.

In contrast, our method demonstrates the capabil-

ity to efﬁciently decompose light ﬁelds into diffuse

albedo, normals, and roughness for each pixel, using

an equivalent number of input images, thus achieving

more accurate and detailed results. Moreover, our ap-

proach excels in learning scene geometry from light

ﬁeld datasets and consistently estimates uniform ma-

terial properties for each object within the scene. The

comparative results, as presented in Table 1, show

that our method surpasses IBL-NeRF in several key

aspects, including diffuse albedo, normal, roughness,

and overall rendering quality.

4.2 Scene Editing Results

Our inverse rendering method was applied to a real

forward-facing dataset (Mildenhall et al., 2019), con-

sisting of scenes captured from 20 to 30 nearly iden-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

Rendering Diffuse Specular Normal

(b) Sponza

Rendering Diffuse Specular Normal

IBL-NeRFOurs

(a) Conference

Figure 6: Compared to IBL-NeRF (Choi et al., 2023), which struggles with normal estimation due to limited input light-ﬁeld

images, our method effectively decomposes light ﬁelds into normal and material properties, yielding better results. Refer to

Table 1 for quantitative evaluation.

SpecularDiffuseNormal DisparityPhotograph Illumination

Flower

GlobalLocal

Fortress

GlobalLocal

Horns

GlobalLocal

Leaves

GlobalLocal

Orchids

GlobalLocal

GlobalLocalGlobalLocalGlobalLocalGlobalLocalGlobalLocal

Figure 7: Decomposition results by our method with real light-ﬁeld photographs (Mildenhall et al., 2019). Our method

decomposes input light ﬁelds into normal, disparity, and material appearance parameters and the global/local illumination

proﬁles of the scene as well. This allows us to edit these parameters naturally (as shown in Figure 8).

tical directions. Figure 7 presents the normals, dis-

parity, diffuse albedo, specular intensity, roughness,

and both global and local illumination as estimated

by our method. Despite the limited number of in-

put sub-aperture images, our approach successfully

decomposes the scene’s geometry, normals, and ap-

pearance parameters. The accurate estimation of sur-

face normals contributes to a clear separation of dif-

Editing Scene Illumination and Material Appearance of Light-Field Images

Input Roughness editing

Horns Fortress

Input Albedo editing

Fortress Flower

Input Illumination editing

Horns Leaves

HornsLeaves

Material and illumination editing

Figure 8: Our method enables real-time scene editing. This

allows dynamic changes to lighting and object appearance,

including color and roughness. Refer to the supplemental

video for real-time relighting demonstrations.

fuse albedo from specular intensity, resulting in lo-

cally smooth, shading-free appearances of intrinsic

albedo. The illumination proﬁles effectively cap-

ture the smooth/sharp and directional qualities of the

scene’s lighting.

Figures 8 and 9 show scene editing results,

demonstrating how changes in material properties and

scene illumination can be effectively implemented.

Our method enables the rendering of realistic images,

even when the albedo and roughness of particular ob-

jects are modiﬁed. Furthermore, by substituting the

estimated global and local illuminations with a single

virtual light source, we achieve renderings that accu-

rately represent the altered illumination conditions in

relation to the scene’s surfaces. This not only val-

idates the potential for detailed illumination editing

but also underscores our method’s proﬁciency in iden-

tifying and integrating internal light sources as part of

the local illumination.

4.3 Ablation Study

Geometry Loss and Local Illumination. Geome-

try loss and local illumination are crucial elements

of our inverse rendering method. We typically use

12 SGs for global illumination and 24 virtual light

sources for local illumination. To assess the impact of

these components, we conduct a comparative study

where we omit local illumination and instead utilize

36 SGs solely for global illumination. As shown in

Figure 10, omitting geometry loss led to inaccura-

cies in complex scenes. Speciﬁcally, some objects

that were challenging to model geometrically were

rendered as translucent, erroneously revealing objects

behind them. Additionally, relying solely on global

illumination proved insufﬁcient for accurately repre-

senting scene illumination, resulting in incorrect in-

verse rendering outcomes. These ﬁndings underscore

the importance of both geometry loss and local illu-

mination in achieving effective inverse rendering in

complex scenes.

Impact of the Number of SGs. The detail level

in our illumination map is directly inﬂuenced by the

number of spherical Gaussians: more SGs lead to a

more detailed representation. However, this bene-

ﬁt comes at the cost of increased computational de-

mands. To ﬁnd the ideal balance between detail and

efﬁciency, we conducted training on a real forward-

facing dataset (Mildenhall et al., 2019). The global

illumination does not need high-frequency patterns,

so it can be represented with 12 SGs. Thus, we exper-

imented with increasing the number of virtual light

sources. Note that the number of virtual light sources

inﬂuencing local illumination maps corresponds to

the number of SGs in these maps. The view synthesis

quality in our experiments is gauged by images syn-

thesized by the MLP, and the rendering time is mea-

sured from the moment MLP outputs are converted to

MPI format.

Table 2 details the rendering speed performance

across various counts of virtual light sources. We

compare the quality and speed of view synthesis

while increasing the number of virtual light sources

from 12 to 48 with ﬁxed number of SGs represent-

ing global illumination. Using real forward-facing

dataset scenes at a resolution of 1080×720, our model

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

(e) Rendering result

with edited albedo

(d) Edited albedo map(c) Object mask(b) Relighting result

Sponza

Conference

(a) Input light fields

Figure 9: Additional scene editing results. (a) Input images for inverse rendering. (b) Rendered images with edited illumi-

nation. (c) Object mask of input images. (d) Albedo images with the albedo of speciﬁc objects edited. (e) Rendered images

with edited albedo.

w/ geometry lossw/o geometry loss

(a) View synthesis result wo/w geometry loss

Photograph Diffuse Specular Disparity

w/o

geometry loss

w/o local

illumination

ours

(b) Inverse rendering result

wo/w geometry loss and local illumination

Figure 10: Ablation study. (a) Impact of the geometry loss

in terms of image quality. (b) Impact of geometry loss and

local illumination estimation for inverse rendering results.

achieves a rendering speed of 29 frames per second on

an NVIDIA GeForce RTX 3090. This performance,

while not matching the state-of-the-art method NeX,

is notable considering the additional complexity of

our model, which includes physics-based rendering

and BRDF calculations. As shown in Table 2, quality

scores improve as the number of virtual light sources

increases, yet the improvement plateaus beyond 24

sources. Beyond this number, rendering speed de-

creases signiﬁcantly. This trend is likely due to the

relatively simple lighting conditions in the real scenes

of our dataset, as opposed to more complex virtual

illumination settings. Therefore, to optimize the bal-

ance between rendering quality and speed efﬁciency,

we set the number of virtual light sources in our

model to 24 and the spherical Gaussians in the global

illumination map to 12.

Table 2: Ablation study. Average score and rendering speed

comparison according to the number of SGs for the real

forward-facing dataset (Mildenhall et al., 2019). The bold

numbers indicate our choice for experiments.

# of SGs

(global+virtual)

PSNR↑ SSIM↑ LPIPS↓ FPS

24 (12+12) 23.92 0.848 0.270 45

36 (12+24) 24.14 0.857 0.258 29

48 (12+36) 24.19 0.856 0.254 21

60 (12+48) 24.21 0.858 0.255 16

5 DISCUSSION AND

CONCLUSIONS

We have presented a neural inverse rendering method

speciﬁcally designed for editing the scene-scale ap-

pearance of light-ﬁeld images, incorporating elements

of physics-based rendering. This approach effec-

tively learns scene material information and lighting

environments, enabling diverse scene editing tasks

like relighting and altering material appearances with

high ﬁdelity. We represent geometric information as

an MPI, training material data for each scene point,

which is suitable for real-time neural rendering ap-

plications and ensures compatibility with traditional

Editing Scene Illumination and Material Appearance of Light-Field Images

MPI architectures. Additionally, our method uniquely

categorizes the lighting environment into far-bound

and near-bound regions, adeptly handling both global

and local illumination of real scenes.

As limitations, our method primarily focuses on

manipulating light intensity, orientation, or hue and

does not alter light relative location of point illumina-

tions. While employing a limited number of spheri-

cal Gaussians effectively captures specular highlights,

it can occasionally encounter challenges in accu-

rately representing more complex specular phenom-

ena, such as mirror reﬂections.

Also, we acknowledge the presence of checker-

board artifacts in the normal maps and black dots in

the relighting results, which were noted in the sup-

plementary video and ﬁgures. These artifacts are a

consequence of the discrete structure of the MPI rep-

resentation with a given spatial resolution. Setting up

a higher spatial resolution for MPI representation can

mitigate these issues.

ACKNOWLEDGEMENTS

Min H. Kim acknowledges the Samsung Research

Funding & Incubation Center (SRFC-IT2402-02),

the Korea NRF grant (RS-2024-00357548), the

MSIT/IITP of Korea (RS-2022-00155620, RS-2024-

00398830, 2022-0-00058, and 2017-0-00072), Mi-

crosoft Research Asia, LIG, and Samsung Electron-

ics.

REFERENCES

Beigpour, S., Shekhar, S., Mansouryar, M., Myszkowski,

K., and Seidel, H.-P. (2018). Light-Field appearance

editing based on intrinsic decomposition. Journal of

Perceptual Imaging, 1(1):010502–1–010502–1.

Broxton, M., Flynn, J., Overbeck, R., Erickson, D., Hed-

man, P., Duvall, M., Dourgarian, J., Busch, J.,

Whalen, M., and Debevec, P. (2020). Immersive light

ﬁeld video with a layered mesh representation. ACM

Transactions on Graphics (TOG), 39(4):86–1.

Choi, C., Kim, J., and Kim, Y. M. (2023). IBL-NeRF:

Image-based lighting formulation of neural radiance

ﬁelds. In CGF, volume 42, page e14929. Wiley On-

line Library.

Choi, I., Gallo, O., Troccoli, A., Kim, M. H., and Kautz,

J. (2019). Extreme view synthesis. In Proc. the

IEEE/CVF ICCV, pages 7781–7790.

Fei, F., Tang, J., Tan, P., and Shi, B. (2024). VMINer:

Versatile multi-view inverse rendering with near-and

far-ﬁeld light sources. In Proc. the IEEE/CVF CVPR,

pages 11800–11809.

Flynn, J., Broxton, M., Debevec, P., DuVall, M., Fyffe,

G., Overbeck, R., Snavely, N., and Tucker, R. (2019).

Deepview: View synthesis with learned gradient de-

scent. In Proc. the IEEE/CVF CVPR, pages 2367–

2376.

Flynn, J., Neulander, I., Philbin, J., and Snavely, N. (2016).

Deepstereo: Learning to predict new views from the

world’s imagery. In Proc. the IEEE/CVF CVPR, pages

5515–5524.

Gortler, S. J., Grzeszczuk, R., Szeliski, R., and Cohen, M. F.

(1996). The Lumigraph. In Proc. the 23rd annual con-

ference on Computer graphics and interactive tech-

niques, pages 43–54.

Hedman, P., Alsisan, S., Szeliski, R., and Kopf, J. (2017).

Casual 3d photography. ACM Transactions on Graph-

ics (TOG), 36(6):1–15.

Hedman, P. and Kopf, J. (2018). Instant 3d photography.

ACM Transactions on Graphics (TOG), 37(4):1–12.

Holynski, A. and Kopf, J. (2018). Fast depth densiﬁcation

for occlusion-aware augmented reality. ACM Trans-

actions on Graphics (ToG), 37(6):1–11.

Jarabo, A., Masia, B., Bousseau, A., Pellacini, F., and

Gutierrez, D. (2014). How do people edit light ﬁelds?

ACM Transactions on Graphics (SIGGRAPH), 33(4).

Jin, H., Liu, I., Xu, P., Zhang, X., Han, S., Bi, S., Zhou, X.,

Xu, Z., and Su, H. (2023). TensoIR: Tensorial inverse

rendering. In Proc. the IEEE/CVF CVPR.

Jones, A., McDowall, I., Yamada, H., Bolas, M., and De-

bevec, P. (2007). Rendering for an interactive 360

light ﬁeld display. In ACM SIGGRAPH 2007 papers,

pages 40–es.

Kang, D., Jeon, D. S., Kim, H., Jang, H., and Kim, M. H.

(2021). View-dependent scene appearance synthesis

using inverse rendering from light ﬁelds. In Proc. the

IEEE ICCP, pages 1–12. IEEE.

Lensch, H. P., Kautz, J., Goesele, M., Heidrich, W., and

Seidel, H.-P. (2003). Image-based reconstruction of

spatial appearance and geometric detail. ACM Trans-

actions on Graphics (TOG), 22(2):234–257.

Li, Z., Xu, Z., Ramamoorthi, R., and Chandraker, M.

(2017). Robust energy minimization for brdf-invariant

shape from light ﬁelds. In Proc. the IEEE/CVF CVPR,

pages 5571–5579.

Lombardi, S., Simon, T., Saragih, J., Schwartz, G.,

Lehrmann, A., and Sheikh, Y. (2019). Neural vol-

umes: Learning dynamic renderable volumes from

images. ACM Transactions on Graphics (TOG),

38(4):65:1–65:14.

McGuire, M. (2017). Computer graphics archive. https:

//casual-effects.com/data.

Mihara, H., Funatomi, T., Tanaka, K., Kubo, H.,

Mukaigawa, Y., and Nagahara, H. (2016). 4d light

ﬁeld segmentation with spatial and angular consisten-

cies. In Proc. the ICCP.

Mildenhall, B., Srinivasan, P. P., Ortiz-Cayon, R., Kalantari,

N. K., Ramamoorthi, R., Ng, R., and Kar, A. (2019).

Local light ﬁeld fusion: Practical view synthesis with

prescriptive sampling guidelines. ACM Transactions

on Graphics (TOG), 38(4):1–14.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

100

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,

Ramamoorthi, R., and Ng, R. (2020). NeRF: Repre-

senting scenes as neural radiance ﬁelds for view syn-

thesis. In Proc. the ECCV.

Ngo, T.-T., Nagahara, H., Nishino, K., Taniguchi, R.-i., and

Yagi, Y. (2019). Reﬂectance and shape estimation

with a light ﬁeld camera under natural illumination.

IJCV, 127(11-12):1707–1722.

Penner, E. and Zhang, L. (2017). Soft 3D reconstruction

for view synthesis. ACM Transactions on Graphics

(TOG), 36(6):1–11.

Pozo, A. P., Toksvig, M., Schrager, T. F., Hsu, J., Mathur,

U., Sorkine-Hornung, A., Szeliski, R., and Cabral, B.

(2019). An integrated 6DoF video camera and sys-

tem design. ACM Transactions on Graphics (TOG),

38(6):1–16.

Riegler, G. and Koltun, V. (2020). Free view synthesis. In

Proc. the ECCV, pages 623–640. Springer.

Riegler, G. and Koltun, V. (2021). Stable view synthesis. In

Proc. the IEEE/CVF CVPR, pages 12216–12225.

Sch

onberger, J. L. and Frahm, J.-M. (2016). Structure-

from-motion revisited. In Proc. the IEEE/CVF CVPR.

Srinivasan, P. P., Deng, B., Zhang, X., Tancik, M., Milden-

hall, B., and Barron, J. T. (2021). NeRV: Neural re-

ﬂectance and visibility ﬁelds for relighting and view

synthesis. In Proc. the IEEE/CVF CVPR, pages 7495–

7504.

Srinivasan, P. P., Tucker, R., Barron, J. T., Ramamoorthi, R.,

Ng, R., and Snavely, N. (2019). Pushing the bound-

aries of view extrapolation with multiplane images. In

Proc. the IEEE/CVF CVPR, pages 175–184.

Sulc, A., Johannsen, O., and Goldluecke, B. (2018). In-

verse lightﬁeld rendering for shape, reﬂection and nat-

ural illumination. In Energy Minimization Methods

in CVPR: 11th International Conference, EMMCVPR

2017, Venice, Italy, October 30–November 1, 2017,

Revised Selected Papers 11, pages 372–388. Springer.

Tao, M. W., Hadap, S., Malik, J., and Ramamoorthi, R.

(2013). Depth from combining defocus and corre-

spondence using light-ﬁeld cameras. In Proc. the

IEEE/CVF CVPR, pages 673–680.

Tao, M. W., Su, J.-C., Wang, T.-C., Malik, J., and Ra-

mamoorthi, R. (2015). Depth estimation and specu-

lar removal for glossy surfaces using point and line

consistency with light-ﬁeld cameras. IEEE transac-

tions on pattern analysis and machine intelligence,

38(6):1155–1169.

Teed, Z. and Deng, J. (2020). RAFT: Recurrent all-pairs

ﬁeld transforms for optical ﬂow. In Proc. the ECCV.

Veeraraghavan, A., Raskar, R., Agrawal, A., Mohan, A.,

and Tumblin, J. (2007). Dappled photography: Mask

enhanced cameras for heterodyned light ﬁelds and

coded aperture refocusing. ACM Transactions on

Graphics (TOG), 26(3):69.

Wang, T.-C., Chandraker, M., Efros, A. A., and Ramamoor-

thi, R. (2016). SVBRDF-invariant shape and re-

ﬂectance estimation from light-ﬁeld cameras. In Proc.

the IEEE/CVF CVPR, pages 5451–5459.

Wang, Y., Liu, F., Wang, Z., Hou, G., Sun, Z., and Tan,

T. (2018). End-to-end view synthesis for light ﬁeld

imaging with pseudo 4DCNN. In Proc. the ECCV,

pages 333–348.

Wizadwongsa, S., Phongthawee, P., Yenphraphai, J., and

Suwajanakorn, S. (2021). NeX: Real-time view syn-

thesis with neural basis expansion. In Proc. the

IEEE/CVF CVPR, pages 8534–8543.

Wood, D. N., Azuma, D. I., Aldinger, K., Curless, B.,

Duchamp, T., Salesin, D. H., and Stuetzle, W. (2000).

Surface light ﬁelds for 3D photography. In SIG-

GRAPH, pages 287–296.

Wu, G., Zhao, M., Wang, L., Dai, Q., Chai, T., and Liu, Y.

(2017). Light ﬁeld reconstruction using deep convolu-

tional network on EPI. In Proc. the IEEE/CVF CVPR,

pages 6319–6327.

Yang, W., Chen, G., Chen, C., Chen, Z., and Wong, K.-

Y. K. (2022). PS-NeRF: Neural inverse rendering for

multi-view photometric stereo. In Proc. the ECCV.

Yao, Y., Zhang, J., Liu, J., Qu, Y., Fang, T., McKinnon, D.,

Tsin, Y., and Quan, L. (2022). Neilf: Neural incident

light ﬁeld for physically-based material estimation. In

Proc. the ECCV, pages 700–716. Springer.

Zhang, K., Luan, F., Li, Z., and Snavely, N. (2022a). IRON:

Inverse rendering by optimizing neural sdfs and mate-

rials from photometric images. In Proc. the IEEE/CVF

CVPR, pages 5565–5574.

Zhang, K., Luan, F., Wang, Q., Bala, K., and Snavely,

N. (2021a). PhySG: Inverse rendering with spherical

gaussians for physics-based material editing and re-

lighting. In Proc. the IEEE/CVF CVPR, pages 5453–

5462.

Zhang, X., Srinivasan, P. P., Deng, B., Debevec, P., Free-

man, W. T., and Barron, J. T. (2021b). NeRFactor:

Neural factorization of shape and reﬂectance under an

unknown illumination. ACM Transactions on Graph-

ics (TOG), 40(6):1–18.

Zhang, Y., Sun, J., He, X., Fu, H., Jia, R., and Zhou, X.

(2022b). Modeling indirect illumination for inverse

rendering. In Proc. the IEEE/CVF CVPR.

Zhou, T., Tucker, R., Flynn, J., Fyffe, G., and Snavely, N.

(2018). Stereo magniﬁcation: Learning view synthe-

sis using multiplane images. In ACM Transactions on

Graphics (SIGGRAPH).

Editing Scene Illumination and Material Appearance of Light-Field Images

101