
5 CONCLUSION
We introduced SPNeRF, a zero-shot 3D segmenta-
tion approach that enhances Neural Radiance Fields
(NeRF) through the integration of geometric primi-
tives and visual-language features. Without training
on any ground truth labels, our model can semanti-
cally segment unseen complex 3D scenes. By embed-
ding superpoint-based geometric structures and ap-
plying a primitive consistency loss, SPNeRF over-
comes the limitations of CLIP’s image-based embed-
dings, achieving higher spatial consistency and seg-
mentation quality in 3D environments, while mitigat-
ing ambiguities in point-wise embeddings. SPNeRF
outperforms LERF and performs competitively with
OpenNeRF, while SPNeRF avoids additional 2D seg-
mentation models required by OpenNeRF. While SP-
NeRF has demonstrated competitive performance, it
also inherits limitations from CLIP’s 2D image-based
embeddings, leading to occasional ambiguities in de-
tails. Future work could explore more efficient alter-
natives to NeRF, such as Gaussian splatting (Kerbl
et al., 2023) or efficiently incorporating 2D founda-
tion models like the Segment Anything Model (SAM)
(Kirillov et al., 2023) to enable instance-level seg-
mentation.
ACKNOWLEDGEMENTS
This work has partly been funded by the German
Federal Ministry for Digital and Transport (project
EConoM under grant number 19OI22009C).
REFERENCES
Barron, J. T., Mildenhall, B., Tancik, M., Hedman, P.,
Martin-Brualla, R., and Srinivasan, P. P. (2021). Mip-
nerf: A multiscale representation for anti-aliasing
neural radiance fields. ICCV.
Cen, J., Fang, J., Zhou, Z., Yang, C., Xie, L., Zhang, X.,
Shen, W., and Tian, Q. (2024). Segment anything in
3d with radiance fields.
Engelmann, F., Manhardt, F., Niemeyer, M., Tateno, K.,
Pollefeys, M., and Tombari, F. (2024). OpenNeRF:
Open Set 3D Neural Scene Segmentation with Pixel-
Wise Features and Rendered Novel Views. In Inter-
national Conference on Learning Representations.
Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Effi-
cient graph-based image segmentation. International
Journal of Computer Vision, 59:167–181.
Ghiasi, G., Gu, X., Cui, Y., and Lin, T. (2021).
Open-vocabulary image segmentation. CoRR,
abs/2112.12143.
Kerbl, B., Kopanas, G., Leimk
¨
uhler, T., and Drettakis, G.
(2023). 3d gaussian splatting for real-time radiance
field rendering. ACM Trans. on Graphics, 42(4).
Kerr, J., Kim, C. M., Goldberg, K., Kanazawa, A., and Tan-
cik, M. (2023). Lerf: Language embedded radiance
fields. In ICCV.
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C.,
Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C.,
Lo, W.-Y., Doll
´
ar, P., and Girshick, R. (2023). Seg-
ment anything. arXiv:2304.02643.
Lan, M., Chen, C., Ke, Y., Wang, X., Feng, L., and Zhang,
W. (2024). Clearclip: Decomposing clip representa-
tions for dense vision-language inference.
Li, B., Weinberger, K. Q., Belongie, S. J., Koltun, V., and
Ranftl, R. (2022). Language-driven semantic segmen-
tation. CoRR, abs/2201.03546.
Luo, H., Bao, J., Wu, Y., He, X., and Li, T. (2023). Segclip:
Patch aggregation with learnable centers for open-
vocabulary semantic segmentation.
Max, N. (1995). Optical models for direct volume render-
ing. IEEE Transactions on Visualization and Com-
puter Graphics, 1(2):99–108.
Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T.,
Ramamoorthi, R., and Ng, R. (2020). Nerf: Repre-
senting scenes as neural radiance fields for view syn-
thesis. In ECCV.
Peng, S., Genova, K., Jiang, C. M., Tagliasacchi, A., Polle-
feys, M., and Funkhouser, T. (2023). Openscene: 3d
scene understanding with open vocabularies.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark,
J., Krueger, G., and Sutskever, I. (2021). Learning
transferable visual models from natural language su-
pervision.
Siddiqui, Y., Porzi, L., Bul
´
o, S. R., M
¨
uller, N., Nießner, M.,
Dai, A., and Kontschieder, P. (2022). Panoptic lifting
for 3d scene understanding with neural fields.
Sun, S., Li, R., Torr, P., Gu, X., and Li, S. (2024). Clip as
rnn: Segment countless visual concepts without train-
ing endeavor.
Takmaz, A., Fedele, E., Sumner, R. W., Pollefeys, M.,
Tombari, F., and Engelmann, F. (2023). Open-
Mask3D: Open-Vocabulary 3D Instance Segmenta-
tion. In NeurIPS.
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J.,
Zhou, C., Zhou, J., and Yang, H. (2022). Ofa: Unify-
ing architectures, tasks, and modalities through a sim-
ple sequence-to-sequence learning framework.
Xu, J., Mello, S. D., Liu, S., Byeon, W., Breuel, T., Kautz,
J., and Wang, X. (2022). Groupvit: Semantic segmen-
tation emerges from text supervision.
Yang, J., Ding, R., Deng, W., Wang, Z., and Qi, X.
(2024). Regionplc: Regional point-language con-
trastive learning for open-world 3d scene understand-
ing.
Yang, Y., Wu, X., He, T., Zhao, H., and Liu, X. (2023).
Sam3d: Segment anything in 3d scenes.
Yin, Y., Liu, Y., Xiao, Y., Cohen-Or, D., Huang, J., and
Chen, B. (2024). Sai3d: Segment any instance in 3d
scenes.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
676