MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View

Synthesis

Sudarshan Raghavan Iyengar

, Subash Sharma

and Patrick Vandewalle

EAVISE, ESAT-PSI, KU Leuven, Belgium

{sudarshan.iyengar, patrick.vandewalle}@kuleuven.be

Keywords:

Neural Radiance Fields, Novel-View Synthesis, 3D Reconstruction, Depth Priors, View-Dependent Effects.

Abstract:

Neural Radiance Fields (NeRFs) have emerged as a powerful technique for novel view synthesis, but ac-

curately capturing both intricate geometry and complex view-dependent effects, especially in challenging

real-world scenes, remains a limitation of existing methods. This work presents MuSt-NeRF, a novel multi-

stage pipeline designed to enhance the ﬁdelity and robustness of NeRF-based reconstructions. The approach

strategically chains complementary NeRF architectures, organized into two stages: a depth-guided stage that

establishes a robust geometric foundation, followed by a reﬁnement stage that enhances details and accu-

rately renders view-dependent effects. Crucially, MuSt-NeRF allows ﬂexible stage ordering, enabling either

geometry-ﬁrst or photometry-ﬁrst reconstruction based on scene characteristics and desired outcomes. Exper-

iments on diverse datasets, including synthetic scenes and complex indoor environments from the ScanNet

dataset, demonstrate that MuSt-NeRF consistently outperforms single-stage NeRF and 3D Gaussian Splat-

ting methods, achieving higher scores on established metrics like PSNR, SSIM, and LPIPS, while producing

visually superior reconstructions. MuSt-NeRF’s ﬂexibility and robust performance make it a promising ap-

proach for high-ﬁdelity novel view synthesis in complex, real-world scenes. The code is made available at

https://github.com/sudarshan-iyengar/MuSt-NeRF.

1 INTRODUCTION

Reconstructing 3D scenes from a set of images and

synthesizing photorealistic novel views is a long-

standing challenge in computer vision, with applica-

tions in virtual/augmented reality, robotics, and med-

ical imaging (Manni et al., 2021; Lee et al., 2022;

Yang et al., 2024).

Neural Radiance Fields (NeRFs) (Mildenhall

et al., 2020) have signiﬁcantly advanced novel view

synthesis by representing scenes using a multi-layer

perceptron (MLP) that maps 3D locations and view-

ing directions to color and density, enabling the ren-

dering of photorealistic novel views via differentiable

volume rendering.

Despite advancements, NeRFs unfortunately suf-

fer from several drawbacks. Reconstructing accurate

geometry from sparse views can lead to artifacts like

fogginess and ﬂoaters, as traditional NeRF training

primarily relies on photometric consistency, which is

https://orcid.org/0009-0003-4046-4969

https://orcid.org/0009-0004-1025-460X

https://orcid.org/0000-0002-7106-8024

insufﬁcient to disambiguate between different geome-

tries yielding the same image. Furthermore, repre-

senting glossy surfaces and specular reﬂections ac-

curately is difﬁcult due to the high degree of varia-

tion in appearance even with small changes in view-

point. Existing methods (Wei et al., 2021; Shaﬁei

et al., 2021; Tancik et al., 2022; Barron et al., 2021;

Martin-Brualla et al., 2021) often address these chal-

lenges individually, creating a need for more holistic

solutions.

To address these limitations, we introduce MuSt-

NeRF, a multi-stage NeRF pipeline that combines

the strengths of complementary NeRF architectures

within a ﬂexible, multi-stage framework. By combin-

ing a depth-guided geometric foundation stage with a

photometric reﬁnement stage, MuSt-NeRF achieves

robust performance even with sparse inputs while

capturing high-ﬁdelity reﬂections and handling un-

bounded scenes. This two-stage approach, with its

adaptable geometry-ﬁrst and photometry-ﬁrst work-

ﬂows, allows for optimized performance based on

scene characteristics. Our key contributions through

this paper are threefold:

1. A novel two-stage NeRF architecture capable of

Iyengar, S. R., Sharma, S. and Vandewalle, P.

MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View Synthesis.

DOI: 10.5220/0013169100003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

563-573

ISBN: 978-989-758-728-3; ISSN: 2184-4321

563

handling various scene types, including those with

unbounded elements, complex lighting, and a mix

of diffuse and specular materials.

2. A ﬂexible pipeline design that supports both

geometry-ﬁrst and photometry-ﬁrst execution,

adapting to varying scene characteristics.

3. A new composite score metric that com-

bines PSNR, SSIM, and LPIPS to provide a

perceptually-aligned evaluation of rendering qual-

ity. This composite score is used to reﬁne regions

of the scene requiring additional photometric or

geometric detail.

2 RELATED WORK

2.1 Neural Radiance Fields

Neural Radiance Fields use neural networks to rep-

resent a 3D scene as a continuous function. This

function, parameterized by a multi-layer perceptron

(MLP), takes as input the 3D location x = (x, y, z)

and 2D viewing direction (θ, φ) and outputs the radi-

ance c = (R, G, B) and volume density σ at that point

and viewing direction (Mildenhall et al., 2020). The

NeRF MLP learns to map 5D input coordinates to

the corresponding radiance and volume density val-

ues. This can be represented as follows:

: (x, d) → (c, σ) (1)

More precisely, positional encoding is applied at

the input to the MLP to be able to represent high fre-

quency details more accurately by mapping the inputs

to higher degree Fourier features. However, this is

omitted from the equation for simplicity.

To render a novel view, NeRF casts a ray r(t) =

o + td (where o is the camera origin, d is the ray di-

rection, and t is the distance along the ray) through

each pixel into the scene. The color c(r) and volume

density σ(r) are then computed at 3D points r sam-

pled along each ray. These sampled values are inte-

grated within the near and far bounds [t

] using the

volume rendering equations:

T (t) = e

−

σ(r(s))ds

(2)

C(r) =

T (t) · σ(r(t)) · c(r(t), d)dt (3)

In practice, the integral is computed in its dis-

cretized form using stratiﬁed and hierarchical sam-

pling.

This rendering process, while capable of generat-

ing high-quality images, presents several limitations.

Speciﬁcally, NeRF requires a large number of input

views for accurate reconstructions, struggles with un-

bounded scenes, and can exhibit difﬁculties capturing

view-dependent effects, particularly specular reﬂec-

tions.

2.2 Enhancing NeRF - Addressing Its

Core Limitations

Several extensions to the original NeRF model have

been proposed to address these limitations (Rabby

and Zhang, 2023; Gao et al., 2023; Dellaert and Yen-

Chen, 2021). A key aspect for improving geometric

robustness and enabling training with sparse views is

the incorporation of depth information. Dense Depth

Priors NeRF (Roessle et al., 2022) leverages read-

ily available depth information from Structure-from-

Motion (SfM) pipelines, using a depth completion

network based on ResNet-18 to generate dense depth

and uncertainty maps from sparse point clouds. These

maps then guide ray sampling during training and are

incorporated into a depth loss term, improving geo-

metric accuracy and reducing reliance on dense input

views. MuSt-NeRF employs a similar depth-guided

strategy in its initial stage to build a strong geomet-

ric foundation, which is the goal of the ﬁrst stage of

MuSt-NeRF.

Handling unbounded scenes is another signiﬁcant

challenge as it implies that content in the scene can

lie at arbitrarily far distances (theoretically tending to

inﬁnity). The key challenges in applying NeRF-like

models to unbounded scenes are ﬁnding an effective

way to parameterize the 3D space and ﬁnding efﬁ-

cient ray sampling strategies. Mip-NeRF 360 (Bar-

ron et al., 2022) addresses these in two ways: Firstly,

they introduce a non-linear scene parameterization

that maps 3D coordinates onto a bounded sphere, con-

tracting objects that are farther away towards the cen-

ter, leaving the objects closer and near the center rela-

tively unchanged. Secondly, rather than a single MLP

being trained, MipNeRF-360 uses two MLPs: a Pro-

posal MLP which predicts solely volume density, and

a NeRF MLP which predicts both volume density and

radiance. The Proposal MLP is a smaller and faster

MLP that is evaluated multiple times in order to gen-

erate a coarse representation of the scene’s density

distribution and is then used to guide the NeRF MLP’s

sampling process by informing it of the regions that

are likely to contain surfaces. MuSt-NeRF incorpo-

rates these strategies within its photometric reﬁne-

ment stage, enabling the representation of unbounded

scenes. However, unlike Mip-NeRF 360, MuSt-NeRF

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

564

also explicitly addresses view-dependent effects, fur-

ther enhancing realism.

View-dependent effects, particularly specular and

glossy reﬂections, signiﬁcantly impact the realism of

novel views. Ref-NeRF (Verbin et al., 2021) tackles

this challenge by explicitly modeling reﬂected radi-

ance using a combination of viewing directions and

surface normals. It utilizes an Integrated Directional

Encoder (IDE) to capture a distribution of reﬂection

directions and disentangles diffuse and specular com-

ponents. While Ref-NeRF demonstrates improve-

ments in reﬂection rendering, it can still be computa-

tionally demanding, particularly when combined with

techniques for unbounded scenes. MuSt-NeRF in-

tegrates the reﬂection modeling capabilities of Ref-

NeRF into its reﬁnement stage. Crucially, by building

on the geometrically robust foundation established in

the initial stage, MuSt-NeRF mitigates the computa-

tional burden and data requirements typically associ-

ated with high-ﬁdelity reﬂection rendering. This com-

bination enables efﬁcient and realistic novel view syn-

thesis even with sparse views and complex lighting.

3 METHOD

Our goal is to synthesize photorealistic novel views

of a real-world scene given N input images and their

corresponding camera poses, obtained via COLMAP

(Sch

onberger and Frahm, 2016). MuSt-NeRF

achieves this through two chained NeRF stages, in-

tegrating the strengths of recent NeRF advancements.

A depth-guided initial stage builds a strong geomet-

ric foundation, which is then reﬁned by a photomet-

ric reﬁnement stage. This two-stage approach, with

ﬂexible stage ordering (geometry-ﬁrst or photometry-

ﬁrst), enhances the quality of novel view synthesis.

3.1 Stage 1: Geometric Foundation

This stage focuses on establishing a robust geometric

representation of the scene, even from a sparse set of

input images. For this, we leverage depth information

to guide the NeRF optimization.

First, we obtain camera poses and construct a

sparse point cloud using SfM via COLMAP. In the

absence of ground truth depth maps, we utilize the

sparse point cloud generated by COLMAP and em-

ploy a ResNet18-based depth completion network

(Roessle et al., 2022) to generate dense depth maps

z(r) and corresponding uncertainty maps s(r). The

depth maps guide ray sampling during training, con-

centrating samples in regions with higher object like-

lihood, as indicated by the depth map and uncertainty

estimate. This depth-guided sampling enhances geo-

metric accuracy, especially with sparse view inputs.

The Stage 1 NeRF MLP is trained using a com-

bination of loss functions designed to leverage both

photometric information from the RGB images and

geometric constraints from the dense depth maps.

The total loss function is proposed as a weighted lin-

ear combination of the color and depth losses, where

λ is a hyperparameter that can be tuned to determine

the weight given to the depth loss:

L = L

color

(r) + λ · L

depth

(r) (4)

color



C(r) −C(r)



(5)

depth

(

log



ˆs(r)



(ˆz(r)−z(r))

ˆs(r)

if α

0 otherwise

(6)

where α =

ˆz(r) − z(r)

> s(r) ∨ ˆs(r) > s(r).

color

is a standard loss based on the difference

between the rendered and ground truth RGB colors.

depth

uses the Gaussian Negative Log-Likelihood

(GNLL) loss to penalize the model in the following

conditions: if the predicted depth ˆz(r) differs from

the target depth z(r) by more than the target’s uncer-

tainty/standard deviation s(r), or if the predicted un-

certainty ˆs(r) exceeds s(r). In such cases, the GNLL

loss is used; otherwise, the loss is zero.

We evaluate the rendered test views using a com-

posite score (range 0-1) based on three metrics,

PSNR, SSIM, and LPIPS:

C = w

PSNR



PSNR

max



+ w

SSIM

· SSIM

+ w

LPIPS

· (1 − LPIPS) (7)

Test images with a composite score below a

threshold (see Section 4.1), along with their neighbor-

ing views, are passed to Stage 2 for reﬁnement. This

selection strategy ensures that challenging regions, in-

cluding those near poorly reconstructed areas, receive

additional attention in the reﬁnement stage. The ar-

chitecture can be seen in Figure 1.

3.2 Stage 2: Photometric Reﬁnement

Building upon the geometric foundation established

in Stage 1, this stage reﬁnes the scene representa-

tion by focusing on high-ﬁdelity rendering of view-

dependent effects, particularly specular reﬂections,

MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View Synthesis

565

Figure 1: The proposed MuSt-NeRF architecture with Stage 1 on top and Stage 2 below. Stage 1 performs a depth-guided

reconstruction using either available ground truth depth maps or depth-completed maps from COLMAP sparse point clouds.

Stage 2 performs a photometric-driven reconstruction with a focus on unbounded elements and view-dependent effects.

and handling unbounded scenes. We achieve this by

integrating and extending the principles of Mip-NeRF

360 (Barron et al., 2022) and Ref-NeRF (Verbin et al.,

2021).

To effectively handle unbounded scenes, we incor-

porate core elements of Mip-NeRF 360. Speciﬁcally,

we use a non-linear scene parameterization, which

maps scene coordinates onto a unit sphere. This pa-

rameterization improves the representation of distant

objects by contracting them towards the origin while

preserving the relative positions of nearby points. We

also employ online distillation by splitting the MLP

into a proposal MLP and a NeRF MLP. The proposal

MLP’s density predictions guide the hierarchical sam-

pling of the NeRF MLP, i.e., more samples are taken

in regions where the proposal MLP predicts higher

volume density. This concentrates rendering effort

on regions likely to contain surfaces, improving ef-

ﬁciency, especially in unbounded scenes.

The NeRF MLP, however, also incorporates key

elements to more faithfully represent specular reﬂec-

tions. In addition to the view direction, which tra-

ditional NeRF models use, we incorporate predicted

surface normals

n to the NeRF MLP. The combina-

tion of surface normals and view direction enables the

explicit modeling of reﬂected radiance, which is piv-

otal in obtaining realistic specular reﬂections. This

is achieved by using an Integrated Directional En-

coder (Verbin et al., 2021). This encodes a distri-

bution of reﬂection directions, accounting for surface

roughness, allowing the model to effectively represent

a wide range of material properties, from diffuse to

highly specular. The IDE processes

n, x and d, out-

putting a specular color c

. The specular color from

the IDE is then combined with the diffuse color from

the proposal MLP using a weighted linear combina-

tion, which acts as a tone mapper, to get the ﬁnal color

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

566

The Stage 2 architecture, comprising the proposal

and NeRF MLPs described above, is trained using a

photometric loss:

color



C(r) −C(r)



. (8)

We also include a regularizer on normals, as done

in Ref-NeRF, that penalizes normals oriented away

from the camera for samples along the ray that con-

tribute to the ﬁnal color. This regularizer term en-

forces volume density to concentrate around surfaces,

helping to resolve “foggy” artifacts often seen near

reﬂective surfaces in NeRF outputs.

3.3 Integrating Multi-Stage Outputs

MuSt-NeRF combines the outputs of its two stages

to generate the ﬁnal novel view renderings, includ-

ing a dynamic walkthrough. Due to coordinate sys-

tems varying between the stages, we present a method

to align the two, to ensure proper integration of the

stages.

Walkthrough Generation. We generate walk-

throughs by rendering novel views along smooth cam-

era trajectories created using B-spline interpolation.

Key images from the input set deﬁne the desired path,

and intermediate poses are generated via B-spline in-

terpolation between these key images. Finally, we

render novel views from each of these interpolated

poses, creating a sequence of images that forms the

walkthrough. We empirically observed that a B-spline

of degree 1 with 5 to 10 interpolated poses between

each pair of key images provides a balance between

ﬁne-grained control and efﬁcient computation.

Coordinate System Alignment. After generating a

walkthrough with Stage 1 poses, some novel views,

particularly those spatially close to regions requiring

reﬁnement, may still exhibit artifacts. To address this,

we then re-render these views using the trained Stage

2 model.

Since Stage 1 and Stage 2 use different subsets

of the original image set as input to COLMAP, their

respective coordinate systems are misaligned. There-

fore, to obtain an image from an identical pose, we

must ﬁrst align the coordinate systems before render-

ing.

Our alignment method uses a shared image,

present in the input sets of both stages (I

shared

), as a

common reference. This image is typically chosen as

the test image from Stage 1 which did not meet the

quality threshold from the ﬁrst stage.

Let P

stage1,shared

and P

stage2,shared

be the camera

poses of I

shared

in the Stage 1 and Stage 2 coordinate

systems, respectively. The transformation that aligns

Figure 2: Coordinate System Alignment. The diagram il-

lustrates the relationship between the Stage 1 and Stage 2

coordinate systems.

the Stage 1 coordinate system to the Stage 2 coordi-

nate system can be computed as:

stage2,1

= P

stage2,shared

· (P

stage1,shared

)

−1

(9)

This transformation is then used to map a novel

view pose P

stage1,Y

from the Stage 1 coordinate sys-

tem to its Stage 2 equivalent as follows:

stage2,Y

= P

stage2,1

· (P

stage1,Y

) (10)

3.4 Bi-Directionality of MuSt-NeRF

A key feature of MuSt-NeRF is its ﬂexibility in ex-

ecution order. The pipeline can operate in either a

standard geometry-ﬁrst or a reversed photometry-ﬁrst

mode, adapting to different scene characteristics.

Standard Pipeline (Geometry-First). In this mode,

Stage 1 (Geometric Foundation) is executed ﬁrst, fol-

lowed by Stage 2 (Photometric Reﬁnement). This

approach prioritizes establishing a strong geometric

base before reﬁning details and view-dependent ef-

fects. It is particularly well-suited for scenes where

accurate geometry is paramount, or when working

with sparse input views where a robust initial recon-

struction is essential. It also improves the rendering

of challenging lighting conditions since the geometry

is robust.

Reversed Pipeline (Photometry-First). In this

mode, Stage 2 (Photometric Reﬁnement) precedes

Stage 1 (Geometric Foundation). This approach pri-

oritizes the accurate capture of lighting, reﬂections,

and view-dependent effects. The geometric inaccura-

cies or inconsistencies stemming from such an archi-

tecture can then be rectiﬁed through Stage 1. This is

beneﬁcial for scenes where ﬁne details and accurate

MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View Synthesis

567

rendering of complex lighting are of primary impor-

tance, even at the potential cost of some initial geo-

metric inaccuracies that Stage 1 can often rectify.

The optimal choice between the standard and re-

versed pipelines depends on the speciﬁc scene prop-

erties and rendering priorities, as demonstrated in our

experiments.

4 EXPERIMENTS

4.1 Implementation Details

Experimental Conﬁgurations. To ensure meaning-

ful comparisons, the conﬁgurations for each experi-

ment were kept consistent. All experiments were per-

formed on a system equipped with an NVIDIA RTX

2080 Super GPU with 8 GB of VRAM. An overview

of the most important conﬁgurations can be found in

Table 1.

Evaluation Metrics. We evaluate the quality of

novel view renderings using three established metrics

commonly employed in the NeRF literature: PSNR,

SSIM, and LPIPS. In addition to these individual met-

rics, we also use a composite score, deﬁned in Equa-

tion (7), with values for w

PSNR

, w

SSIM

, w

LPIPS

, and

PSNR

max

set to 0.20, 0.35, 0.45, and 35 dB, respec-

tively. These weights prioritize perceptually aligned

metrics (LPIPS and SSIM) over purely pixel-based

comparisons (PSNR), reﬂecting human visual percep-

tion of scene similarity. A threshold on this compos-

ite score is used to determine if an image from Stage

1 needs to be reﬁned in Stage 2. This threshold was

experimentally set to 0.7, based on qualitative assess-

ment of rendered image quality: images with com-

posite scores below 0.7 exhibited noticeable artifacts

and were deemed unsatisfactory.

4.2 Results and Analysis

We evaluate MuSt-NeRF on increasingly complex

scenes. Preliminary experiments validate the ef-

fectiveness of our Stage 2 photometric reﬁnement,

while also highlighting the need for a multi-stage

Table 1: Overview of the fundamental conﬁgurations used

in MuSt-NeRF experiments.

Parameter Stage 1 Stage 2

Number of Epochs 200k 400k

Batch size 1024 512

MLP layers 8 Prop, NeRF: 4,8

Neurons per layer 256 Prop, NeRF: 256,512

Image resolution (px) 624x468 624x468

approach. Subsequent experiments on challenging

ScanNet scenes then demonstrate the performance of

the full MuSt-NeRF pipeline, comparing the standard

and reversed conﬁgurations.

4.2.1 Preliminary Experiments

These experiments isolate and evaluate the Stage 2

architecture, demonstrating its ability to handle both

unbounded scenes and specular reﬂections, while also

motivating the need for a multi-stage approach. We

use the following datasets:

Mip-NeRF 360 (Materials, Vasedeck): The Ma-

terials scene features round balls of diverse material

properties under controlled lighting, enabling assess-

ment of specular reﬂection capture. The Vasedeck

scene is a real-world capture of ﬂowers, primarily

exhibiting diffuse reﬂections, allowing us to evaluate

performance on real-world data with simpler lighting.

Custom Dataset (Plant on Table, Room): The

Plant on Table scene combines diffuse and specu-

lar reﬂections with unbounded elements (background

visible through glass). The Room scene (real-world,

smartphone, inside-out) provides a more challenging

test with complex geometry and lighting.

Table 2 presents the quantitative results of these

experiments, comparing MuSt-NeRF Stage 2 with

Mip-NeRF 360.

The preliminary experiments evaluate MuSt-

NeRF Stage 2 on increasingly complex outside-in

scenes. Beginning with the synthetic Materials scene,

we observe that MuSt-NeRF Stage 2 accurately ren-

ders specular highlights, achieving an average com-

posite score of over 0.9 (Table 2, Figure 3). Through

the Vasedeck scene, we see that MuSt-NeRF Stage 2

is able to handle real-world scenes, performing com-

parably to the Mip-NeRF 360 implementation. The

subsequent experiment on the Plant on Table scene

further conﬁrms MuSt-NeRF Stage 2’s ability to han-

dle unbounded elements as well as specular and dif-

fuse reﬂections. It is important to note here that each

of the test images scored higher than the threshold of

0.7 in these three experiments.

Based on these results, we evaluate Stage 2 on

the Room scene, which is an inside-out scenario. We

observe that here too, MuSt-NeRF Stage 2 performs

better than Mip-NeRF 360 on average (Table 2), and

is able to capture reﬂections effectively (Figure 4).

However, we also observe some geometric inaccura-

cies, especially in regions with high depth variation

and with limited overlapping viewpoints between im-

ages. It is in these regions that the composite score

of MuSt-NeRF is lower than the threshold (Table 3).

These limitations, arising from Stage 2’s purely pho-

tometric nature, emphasize the need for a geometri-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

568

Table 2: Quantitative Results of the Preliminary Experiments - Performance Comparison between Mip-NeRF 360 and MuSt-

NeRF Stage 2.

Mip-NeRF 360 MuSt-NeRF Stage 2

Scene SSIM ↑ PSNR ↑ LPIPS ↓ Composite Score ↑ SSIM ↑ PSNR ↑ LPIPS ↓ Composite Score ↑

Materials 0.897 26.905 0.093 0.876 0.961 27.301 0.029 0.929

Vasedeck 0.793 24.592 0.189 0.783 0.723 24.409 0.205 0.750

Plant 0.742 25.108 0.255 0.738 0.757 24.905 0.249 0.746

Room 0.738 25.674 0.234 0.750 0.830 26.102 0.210 0.795

(a) Ground Truth Synthesized

(b) Ground Truth Synthesized

Figure 3: Preliminary Experiments: a- Materials, b-

Vasedeck, c- Plant. Zoom in for a clearer view.

cally robust foundation, i.e., Stage 1 of MuSt-NeRF,

and the subsequent evaluation of our full pipeline.

4.2.2 ScanNet Experiments

Building upon the insights from our preliminary ex-

periments, we now evaluate the full MuSt-NeRF

pipeline (Stages 1 and 2) on ﬁve diverse scenes from

the ScanNet dataset (Dai et al., 2017). The provided

RGB-D images of real-world indoor environments

enable us to assess performance on complex, real-

istic data while leveraging ground truth depth. Fur-

thermore, we investigate the necessity of ground truth

depth by comparing performance using both ground

truth and depth-completed maps (derived from sparse

point clouds, as shown in Stage 1 of Figure 1).

The selected scenes exhibit varied characteris-

tics, including challenging lighting conditions, com-

plex geometry, and varying object density: Scene

708 (dimly lit), Scene 710 (small, densely clut-

tered), Scene 738 (hotel room with unbounded el-

Table 3: Quantitative Evaluation Results of the Inside-Out

Room scene.

IMAGE SSIM ↑ PSNR ↑ LPIPS ↓ Composite Score ↑

1 0.781 21.012 0.278 0.718

2 0.882 23.372 0.111 0.843

3 0.866 26.441 0.165 0.830

4 0.927 27.690 0.089 0.893

5 0.807 26.320 0.247 0.772

6 0.861 26.390 0.135 0.842

7 0.930 30.609 0.071 0.918

8 0.857 24.451 0.186 0.806

9 0.924 30.509 0.081 0.911

10 0.878 27.618 0.159 0.844

11 0.870 27.682 0.146 0.847

12 0.720 24.792 0.432 0.650

13 0.854 28.526 0.181 0.830

14 0.757 18.791 0.260 0.705

15 0.784 27.045 0.286 0.750

16 0.585 26.389 0.536 0.564

Average 0.830 26.102 0.210 0.795

(a) Ground Truth Synthesized

(b) Ground Truth Synthesized

Figure 4: Preliminary Experiments - Room Scene: a- Ren-

dered image with a high composite score, b- Rendered im-

age scoring below the threshold. Zoom in for a clearer view.

MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View Synthesis

569

Table 4: Quantitative Results of the ScanNet Experiments - Performance Comparison among Mip-NeRF 360, 3D Gaussian

Splatting, and MuSt-NeRF.

Mip-NeRF 360 3D Gaussian Splatting MuSt-NeRF

Scene SSIM ↑ PSNR ↑ LPIPS ↓ Composite Score ↑ SSIM ↑ PSNR ↑ LPIPS ↓ Composite Score ↑ SSIM ↑ PSNR ↑ LPIPS ↓ Composite Score ↑

Scene 708 0.752 25.101 0.274 0.733 0.746 21.022 0.315 0.689 0.846 27.208 0.180 0.821

Scene 710 0.750 22.704 0.214 0.746 0.727 18.700 0.436 0.615 0.788 24.336 0.185 0.782

Scene 758 0.845 26.403 0.119 0.843 0.816 23.373 0.364 0.705 0.857 26.698 0.128 0.845

Scene 781 0.803 25.487 0.252 0.763 0.706 18.406 0.311 0.662 0.826 26.872 0.159 0.821

Table 5: Quantitative Results on ScanNet Scenes - Compar-

ison of the Standard and Reversed Pipelines of MuSt-NeRF.

The optimal choice for each scene is highlighted in bold.

Scene Conﬁguration SSIM ↑ PSNR ↑ LPIPS ↓ Composite Score ↑

Scene 708 Standard 0.846 27.208 0.180 0.821

Reversed 0.796 25.617 0.238 0.768

Scene 710 Standard 0.783 20.497 0.207 0.748

Reversed 0.788 24.336 0.185 0.782

Scene 738 Standard

No reﬁnement

0.706 19.427 0.192 0.722

Reversed - - - -

Scene 758 Standard 0.808 21.962 0.160 0.786

Reversed 0.857 26.698 0.128 0.845

Scene 781 Standard 0.776 22.829 0.175 0.773

Reversed 0.826 26.872 0.159 0.821

ements), Scene 758 (medium-sized room with sim-

ple lighting), and Scene 781 (large room with com-

plex lighting and specular reﬂections). We compare

the performance of both the standard (geometry-ﬁrst)

and reversed (photometry-ﬁrst) MuSt-NeRF conﬁgu-

rations with Mip-NeRF 360 and 3D Gaussian Splat-

ting (Kerbl et al., 2023). The results of the ScanNet

experiments are shown in Table 4, Table 5 and Fig-

ure 6. The results highlight four main ﬁndings of the

MuSt-NeRF architecture.

Beneﬁts of Multi-Stage Reﬁnement. The most

compelling ﬁnding from our ScanNet experiments is

the consistent and signiﬁcant improvement achieved

through multi-stage reﬁnement. In both the standard

and reversed pipelines, the reﬁnement stage effec-

tively leverages the strengths of one stage to miti-

gate the weaknesses of the other, leading to clearly

enhanced performance, as seen through Table 4 and

Table 5.

In the standard pipeline, we see that the NeRF

model is not able to capture details of the piano (Scene

710), books (Scene 758) or model reﬂections from

multiple light sources (Scene 781) accurately. The re-

ﬁnement stage of the standard pipeline excels at mod-

eling speciﬁcally these, leading to an improvement in

the composite score as well as in visual comparisons

(Figure 6).

Similarly, in the reverse pipeline, we see that the

photometric reconstruction incorrectly infers the ge-

ometry of the open blinds (Scene 710). The depth-

guided training in the reﬁnement stage corrects this

error, resulting in a more accurate and visually con-

vincing representation. Again, the composite scores

and visual comparisons highlight the enhanced per-

formance achieved through multi-stage reﬁnement.

When comparing Mip-NeRF 360 and 3D Gaus-

sian Splatting to MuSt-NeRF, we see that MuSt-NeRF

consistently outperforms the two 360 across all four

scenes, with the most prominent difference being

in low-lighting scenes (Scene 708) and large scenes

(Scene 781) where the beneﬁts of the geometry-

guided stage is most strongly observed. It is essen-

tial to note that each test image for each of the ex-

periments exceeded the composite score threshold af-

ter the entire MuSt-NeRF pipeline, a benchmark that

Mip-NeRF 360 and, especially, 3D Gaussian Splat-

ting failed to meet consistently. The underperfor-

mance of 3D Gaussian Splatting can likely be at-

tributed to the limited number of input images and

the resulting sparsity of the point clouds generated

for these scenes. This sparsity leads to insufﬁcient

overlap between Gaussians, hindering their ability to

blend smoothly and produce high-quality reconstruc-

tions.

Inﬂuence of Lighting and Scene Character-

istics. Lighting conditions and scene complex-

ity signiﬁcantly impact the relative performance of

the standard and reversed pipelines. The standard

(geometry-ﬁrst) pipeline demonstrates greater robust-

ness in challenging lighting, particularly in the dimly

lit Scene 708 (Table 5). The reversed (photometry-

ﬁrst) pipeline excels in scenes with complex light-

ing and specular reﬂections, such as Scene 781 (Ta-

ble 5). Scenes with simpler lighting and geometry,

like Scene 758, show very similar performance with

both conﬁgurations. These observations demonstrate

MuSt-NeRF’s adaptability: the choice of pipeline can

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

570

be tailored to the scene’s speciﬁc characteristics and

rendering priorities. The depth guidance in Stage 1

of the standard pipeline provides a strong geometric

prior, which is beneﬁcial in dimly lit scenes, whereas

prioritizing photometric reﬁnement by performing the

reversed pipeline allows for more accurate capture of

complex lighting.

Inﬂuence of Image Quality. We also analyze

how variations in image quality affect MuSt-NeRF.

Scene 738 of the ScanNet dataset contains several

blurry images (see Figure 5). The lack of sharpness

hampers COLMAP’s ability to extract features and

estimate poses accurately, leading to incomplete or

erroneous pose information. In Scene 738, COLMAP

failed to ﬁnd sufﬁcient poses for images in regions

rendered poorly during Stage 1, preventing subse-

quent Stage 2 reﬁnement. In the reversed pipeline for

the same scene, COLMAP was only able to extract 4

poses from a set of 67 images, which is inadequate for

reliable training of the Stage 2 architecture (Table 5).

(a) Example blurry images from the dataset.

(b) Rendered image with the highest and lowest composite

scores respectively.

Figure 5: Scene 738 - Blurry images from the dataset, best,

and worst-scoring synthesized images.

There is an additional issue that poor image qual-

ity presents. Even if COLMAP successfully extracts

poses from these images, the NeRF model’s learning

process is negatively impacted. The model will learn

the inherent blurriness present in the training images,

leading to suboptimal novel view synthesis. The ren-

dered outputs will inherit this blurriness, even in re-

gions where sharper details could potentially be re-

covered with higher-quality input images. This high-

lights the importance of high-quality input images for

optimal performance and the data-driven nature of

NeRF-based approaches.

Sufﬁciency of Depth-Completed Maps. To val-

idate the inﬂuence of depth map quality on MuSt-

NeRF, we conducted supplementary experiments

comparing performance with ground truth depth maps

against performance with depth-completed maps.

Speciﬁcally, we evaluated ScanNet scenes 738 and

758 using the standard (geometry-ﬁrst) MuSt-NeRF

conﬁguration with ground truth depth obtained di-

rectly from the ScanNet dataset. These results com-

plement our primary ScanNet experiments (Tables 4

and 5), which utilized depth-completed maps derived

from sparse point clouds. The comparative ﬁndings

are presented below.

Table 6: Comparison of Composite Scores using Ground

Truth and Depth-Completed Maps.

Scene True Depth Map ↑ Depth-Completed Map ↑

738 0.724 0.722

758 0.780 0.786

These experiments reveal comparable perfor-

mance between ground truth depth maps and depth-

completed maps in Stage 1 of MuSt-NeRF (Table 6).

This suggests that the depth completion network pro-

duces depth estimates that are sufﬁciently precise to

guide the geometric reconstruction effectively. Con-

sequently, it also points to the practical versatility of

MuSt-NeRF, as it is not dependent on the availability

of RGB-D data for optimal performance.

5 CONCLUSION

In this work, we presented MuSt-NeRF, a novel two-

stage NeRF pipeline that enhances novel view synthe-

sis by addressing the challenges of unbounded scenes,

complex lighting, and view-dependent effects, par-

ticularly specular reﬂections. MuSt-NeRF combines

a depth-guided geometric foundation stage with a

photometric reﬁnement stage, integrating and extend-

ing principles from Mip-NeRF 360 and Ref-NeRF.

Our approach provides ﬂexibility by supporting both

geometry-ﬁrst and photometry-ﬁrst execution modes,

allowing users to adapt the pipeline to different scene

characteristics and rendering priorities.

Our experiments on a variety of scenes, includ-

ing synthetic data, real-world captures, and challeng-

ing indoor environments from the ScanNet dataset,

demonstrated the effectiveness of MuSt-NeRF. We

showed that our two-stage approach consistently out-

performs the single-stage Mip-NeRF 360 baseline.

The preliminary experiments validated the photomet-

ric reﬁnement stage’s capabilities in capturing com-

plex lighting, reﬂections and handling unbounded el-

ements, showcasing the strengths of our combined

MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View Synthesis

571

(a) Scene 708.

(b) Scene 710.

(d) Scene 781.

Figure 6: ScanNet Experiments: Top pair per scene shows

the standard pipeline before and after reﬁnement, while the

bottom pair shows the reverse pipeline.

Mip-NeRF 360 and Ref-NeRF architecture, while the

ScanNet experiments highlighted the beneﬁts of our

full pipeline, including the importance of the depth-

guided stage for geometric robustness, the ﬂexibil-

ity of both the standard and reverse pipeline con-

ﬁgurations, and MuSt-NeRF’s robustness to varia-

tions in lighting conditions. Furthermore, the re-

sults indicated that the choice between the standard

and reversed pipelines depends on scene properties:

the standard pipeline excels in low-light scenarios

and in scens with limited views, while the reversed

pipeline is better suited for scenes with complex re-

ﬂections and ﬁne details where photometric accuracy

is paramount. Potential future work includes evalu-

ating MuSt-NeRF on a wider variety of scene types,

such as outdoor environments or scenes with trans-

parent objects and benchmarking against newer NeRF

and Gaussian Splatting variants.

REFERENCES

Barron, J. T., Mildenhall, B., Tancik, M., Hedman, P.,

Martin-Brualla, R., and Srinivasan, P. P. (2021). Mip-

NeRF: A Multiscale Representation for Anti-Aliasing

Neural Radiance Fields. arXiv:2103.13415.

Barron, J. T., Mildenhall, B., Verbin, D., Srinivasan, P. P.,

and Hedman, P. (2022). Mip-nerf 360: Unbounded

anti-aliased neural radiance ﬁelds.

Dai, A., Chang, A. X., Savva, M., Halber, M., Funkhouser,

T., and Nießner, M. (2017). Scannet: Richly-

annotated 3d reconstructions of indoor scenes.

Dellaert, F. and Yen-Chen, L. (2021). Neural volume ren-

dering: Nerf and beyond.

Gao, K., Gao, Y., He, H., Lu, D., Xu, L., and Li, J. (2023).

Nerf: Neural radiance ﬁeld in 3d vision, a comprehen-

sive review.

Kerbl, B., Kopanas, G., Leimk

uhler, T., and Drettakis,

G. (2023). 3d gaussian splatting for real-time radi-

ance ﬁeld rendering. ACM Transactions on Graphics,

42(4).

Lee, S., Chen, L., Wang, J., Liniger, A., Kumar, S., and

Yu, F. (2022). Uncertainty Guided Policy for Ac-

tive Robotic 3D Reconstruction Using Neural Radi-

ance Fields. IEEE Robotics and Automation Let-

ters, 7(4):12070–12077. Conference Name: IEEE

Robotics and Automation Letters.

Manni, A., Oriti, D., Sanna, A., De Pace, F., and Manuri,

F. (2021). Snap2cad: 3D indoor environment recon-

struction for AR/VR applications using a smartphone

device. Computers & Graphics, 100:116–124.

Martin-Brualla, R., Radwan, N., Sajjadi, M. S. M., Barron,

J. T., Dosovitskiy, A., and Duckworth, D. (2021). Nerf

in the wild: Neural radiance ﬁelds for unconstrained

photo collections.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,

Ramamoorthi, R., and Ng, R. (2020). NeRF: Rep-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

572

resenting Scenes as Neural Radiance Fields for View

Synthesis. arXiv:2003.08934 [cs].

Rabby, A. S. A. and Zhang, C. (2023). Beyondpixels: A

comprehensive review of the evolution of neural radi-

ance ﬁelds. arXiv e-prints, pages arXiv–2306.

Roessle, B., Barron, J. T., Mildenhall, B., Srinivasan, P. P.,

and Nießner, M. (2022). Dense depth priors for neural

radiance ﬁelds from sparse input views.

Sch

onberger, J. L. and Frahm, J.-M. (2016). Structure-

from-motion revisited. In Conference on Computer

Vision and Pattern Recognition (CVPR).

Shaﬁei, M., Bi, S., Li, Z., Liaudanskas, A., Ortiz-Cayon,

R., and Ramamoorthi, R. (2021). Learning Neural

Transmittance for Efﬁcient Rendering of Reﬂectance

Fields. arXiv:2110.13272.

Tancik, M., Casser, V., Yan, X., Pradhan, S., Mildenhall,

B., Srinivasan, P. P., Barron, J. T., and Kretzschmar,

H. (2022). Block-NeRF: Scalable Large Scene Neural

View Synthesis. arXiv:2202.05263 version: 1.

Verbin, D., Hedman, P., Mildenhall, B., Zickler, T., Barron,

J. T., and Srinivasan, P. P. (2021). Ref-nerf: Structured

view-dependent appearance for neural radiance ﬁelds.

Wei, Y., Liu, S., Rao, Y., Zhao, W., Lu, J., and Zhou, J.

(2021). NerﬁngMVS: Guided Optimization of Neu-

ral Radiance Fields for Indoor Multi-view Stereo.

arXiv:2109.01129.

Yang, Z., Dai, J., and Pan, J. (2024). 3D reconstruction from

endoscopy images: A survey. Computers in Biology

and Medicine, 175:108546.

MuSt-NeRF: A Multi-Stage NeRF Pipeline to Enhance Novel View Synthesis

573