Lightweight Transformer Occupancy Networks for 3D Virtual Object

Reconstruction

Claudia Melis Tonti

and Irene Amerini

Sapienza, University of Rome, Italy

{melistonti, amerini}@diag.uniroma1.it

Keywords:

Mesh, Generation, Point Cloud, Computer Vision, 3D.

Abstract:

The increasing demand for edge devices highlights the necessity for modern technologies to be adaptable to

general-purpose hardware. Speciﬁcally, in ﬁelds like augmented reality, virtual reality, and computer graphics,

reconstructing 3D objects from sparse point clouds is highly computationally intensive, presenting challenges

for execution on embedded devices. In previous works, the speed of 3D mesh generation has been prioritized

with respect to preserving a high level of detail. Our focus in this work is to enhance the speed of the inference

in order to get closer to real-time mesh generation.

1 INTRODUCTION

In recent years, the growing importance of computer

vision in technologies like autonomous navigation

(Urmson et al., 2008), robotic mapping (Zhu et al.,

2017), augmented reality (AR) (Alhaija et al., 2017),

and gaming has increased the demand for efﬁcient

processing of multidimensional data such as images

and videos.

Traditional methods for 3D reconstruction, such

as Signed Distance Functions (SDFs), require heavy

computations and slow processes for mesh genera-

tion. Newer approaches like ConvONet (Peng et al.,

2020) achieve faster and more efﬁcient 3D recon-

struction compared to earlier methods like DeepSDF

(Park et al., 2019). However, real-time reconstruc-

tion from point clouds remains underexplored, despite

its importance for applications on low-performance

hardware, such as AR or immersive reality.

Point clouds, created using LiDAR sensors or esti-

mated via Monocular Depth Estimation (MDE) (Papa

et al., 2023), act as lightweight 3D mesh compres-

sions. Their sparsity can be adjusted to balance detail

retention and computational efﬁciency, allowing on-

demand reconstruction of 3D objects within speciﬁc

areas, such as those observed in AR viewers.

Our work focuses on efﬁcient 3D object recon-

struction for AR applications, with use cases includ-

ing interior design, game development, and virtual ar-

https://orcid.org/0009-0002-0574-612X

https://orcid.org/0000-0002-6461-1391

tifact reconstruction (Patil et al., 2023; Virtanen et al.,

2020). We leverage lightweight implicit representa-

tions based on Convolutional Occupancy Networks

(ConvONet) (Tonti et al., 2024), extracting meshes

from point clouds to prioritize speed over accuracy.

This ensures smooth interaction in AR environments,

avoiding delays that could disrupt the user experience.

To enhance ConvONet, we integrate an efﬁcient

transformer, capitalizing on its ability to process com-

plex spatial relationships effectively. Our method is

validated on the ShapeNet benchmark dataset.

The paper is structured as follows: Section 2 re-

views related work, Section 3 introduces the proposed

ConvONet with a Dynamic Vision Transformer, Sec-

tion 4 provides implementation details, Section 5

presents experimental results, and Section 6 con-

cludes with future directions.

2 RELATED WORKS

This section reviews key methods for representing

and reconstructing 3D shapes, focusing on various

representation techniques. Early approaches, such

as Signed Distance Functions (SDF) and Deep SDF

(Mescheder et al., 2019; Park et al., 2019), describe

surfaces as a continuous function that assigns a signed

distance value to every point in 3D space. While ef-

fective for deﬁning geometry, SDFs require signiﬁ-

cant memory and cannot infer surfaces in unobserved

regions, leading to gaps in the reconstructed 3D rep-

408

Tonti, C. M. and Amerini, I.

Lightweight Transformer Occupancy Networks for 3D Virtual Object Reconstruction.

DOI: 10.5220/0013377400003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 1: GRAPP, HUCAPP

and IVAPP, pages 408-414

ISBN: 978-989-758-728-3; ISSN: 2184-4321

Figure 1: Graphical representation of the proposed ConvONet pipeline. The ConvONet pipeline begins with an input point

cloud containing N points. In the Encoder, the ResNet PointNet (Qi et al., 2017) individually encodes these points, while a

plane predictor network globally encodes the entire point cloud to learn L dynamic planes and their speciﬁc features. These

plane features are combined with the per-point features and projected onto the learned dynamic planes. The dynamic planes

are summed and overlapped between each other, then processed in the Dynamic Vision Transformer. In the decoder, for

a given query point p ∈ R

, the feature vector ψ(x, p), where x is the input, is derived from bilinear interpolation of the

processed feature planes. Finally, a fully-connected network predicts the occupancy probability f

(p, ψ(p, x)) at point p.

resentation (Curless and Levoy, 1996; Choy et al.,

2016).

Deep Local Shapes (Chabra et al., 2020) address

these limitations by using neural networks to ap-

proximate SDFs in small regions, improving infer-

ence efﬁciency by breaking down scenes into inde-

pendent local features. Implicit function-based meth-

ods (Chibane et al., 2020) further advance this idea by

representing 3D objects with multi-scale 3D vectors

that encode both local and global structures, enabling

better handling of complex geometries.

Genova et al. (Genova et al., 2020) introduced Lo-

cal Deep Implicit Functions (LDIF), a hybrid repre-

sentation that combines local implicit functions with

structured templates to decompose space efﬁciently.

LDIF uses a latent vector representation for ﬁner de-

tail and leverages PointNet (Qi et al., 2017) to encode

both local and global features for 3D reconstruction

tasks. These representations focus on breaking down

large-scale geometries into manageable components

while retaining critical surface details.

ConvONet (Peng et al., 2020) proposed a contin-

uous implicit grid representation that combines con-

volutional layers and occupancy networks to estimate

the probability of a point belonging to the surface of

the object. This hierarchical encoding of 3D features

enables accurate reconstructions and inspired meth-

ods like DPConvONet (Lionar et al., 2021a) and Neu-

ralblox (Lionar et al., 2021b). DPConvONet projects

point cloud features onto dynamic planes, capturing

surface normals for unoriented inputs, while Neural-

blox supports nearly real-time 3D scene generation

using occupancy grids from point clouds.

Despite these advancements, achieving real-time

reconstruction remains challenging. Neuralblox of-

fers faster mesh generation but sacriﬁces detail, and

Lightweight ConvONet (Tonti et al., 2024) attempts

to optimize efﬁciency through architectural compres-

sion, yet its inference time is still not real-time.

To address these challenges, we propose integrating

Dynamic Vision Transformers (Dynamic ViT) (Rao

et al., 2023) into Lightweight ConvONet. This inte-

gration replaces the bottleneck U-Nets, enabling more

efﬁcient processing of point cloud projections while

leveraging the transformer’s ability to handle complex

spatial relationships. This approach aims to improve

both speed and accuracy, moving closer to real-time

performance.

3 PROPOSED METHOD

Our research aims to increase inference speed to gen-

erate fast and high-quality 3D meshes. In this work,

we analyze the effects of integrating an efﬁcient trans-

former architecture into the ConvONet pipeline in or-

der to decrease the generation time.

The rest of this section is organized as follows:

Section 3.1 introduces the structure of the network,

and Section 3.1.3 presents the architecture of the Dy-

namic Vision Transformer.

3.1 The Proposed Network

The architecture of the proposed network, shown in

Figure 1, has an encoder-decoder structure.

The pipeline, as shown in the ﬁgure, begins with

a sparse point cloud input. The encoder, detailed in

Section 3.1.1, processes this input to extract per-point

and global features, which are then used to estimate

the dynamic planes. These planes simplify the envi-

ronment by transforming it from 3D to 2D, reducing

computational complexity during encoding.

Once the points are projected onto these planes,

the overlapping projections are processed by the Dy-

namic ViT, as described in Section 3.1.3. In the de-

coder, outlined in Section 3.1.2, the planar features

are utilized to perform bilinear interpolation for each

Lightweight Transformer Occupancy Networks for 3D Virtual Object Reconstruction

409

point in 3D space, calculating its occupancy probabil-

ity. This approach balances efﬁciency and precision

in 3D reconstruction.

3.1.1 Encoder

The encoder-decoder network is based on the ground-

work of Lionar et al. The network processes a point

cloud input with a ResNet-based PointNet. The

encoder extracts per-point features using a ResNet

PointNet and learns a set of dynamic planes with a

plane predictor network. The ResNet PointNet con-

sists of ﬁve ResNet blocks, a PointNet, and a ﬁ-

nal ResNet block, encoding the point cloud point-

wise. The plane predictor network uses a PointNet for

global max pooling to predict the set of planes. These

features are converted to plane parameters, deﬁning

the plane by its normal vector (a, b, c).

The per-point and planar features are summed and

projected onto the predicted dynamic planes using a

basis change and orthographic projection, followed

by normalization as proposed by Lionar et al. Every

plane with projection is summed up with the others

and then processed with Dynamic Vit (DyVit), pre-

sented in Section 3.1.3. The outputs of the trans-

former are the planar features that will be processed

in the Decoder.

3.1.2 Decoder

Given the estimated planar features, the decoder pre-

dicts the probability of each point in 3D space being

within the occupancy grid. This probability is derived

from the sum of the bilinear interpolations of the es-

timated planar features. The network is based on a

fully connected network of ﬁve ResNet blocks, with

input features summed to the feature vector in each

block. After estimating the occupancy probability of

each point, the point cloud is turned into a mesh.

3.1.3 Dynamic Vision Transformer

Compared to Convolutional Neural Networks

(CNNs), Vision Transformers (ViTs) (Touvron et al.,

2022) are better at capturing long-range relationships

between pixels, enabling them to process images with

more comprehensive detail. The main components of

a ViT are:

• Patch Embedding: The input image is divided into

smaller patches, which are ﬂattened and projected

into a ﬁxed-dimensional space. Positional embed-

dings are added to retain spatial information for

each patch.

• Multi-head Self-Attention (MSA): This mecha-

nism allows the model to understand global rela-

tionships by enabling interactions between differ-

ent patches (tokens). MSA calculates query (Q),

key (K), and value (V) matrices, using softmax to

create an attention matrix that captures dependen-

cies across the input. Parallel attention layers en-

hance the model’s ability to capture complex fea-

tures. It computes query (Q), key (K), and value

(V) matrices, transforming scores into probabil-

ities with softmax, thereby forming an attention

matrix that aggregates global information across

the sequence.

• Feed-Forward Neural Network (FNN): After the

self-attention step, the output is normalized and

processed by a feed-forward network to reﬁne the

features further.

Rao et al. (Rao et al., 2023) proposed DyVit

which dynamically removes less important tokens

from transformer blocks based on their contribution to

the output. During training, the threshold for selecting

important tokens is adjusted automatically, and only

the most relevant tokens are retained during inference.

This reduces computational costs while maintaining

accuracy. To handle performance drops caused by to-

ken pruning, DyVit uses knowledge distillation (KD).

A teacher model with all tokens intact guides the

pruned student model, helping it learn rich represen-

tations and retain performance.

The structure of DyViT adapted for our task is pre-

sented in Figure 2. The ﬁrst part of the student and

teacher networks is almost identical: the input, which

consists of the 3 overlapped projection planes, is di-

vided into patches and processed by the patch embed-

der, then, the processed patches are transmitted to the

self-attention mechanism. The student model, differ-

ently from the teacher model, processes the patches

with self-attention. The attention output is processed

with the prediction module which is in between the

transformer blocks to delete less informative tokens

selected by the features given by the previous layer.

In this way, fewer tokens are processed in the follow-

ing layers. This technique is called sparsiﬁcation and

the sparse tokens are processed in the self-attention

which returns the output of the model, which is the

resulting planar features.

To train DyVit effectively, the total loss is a com-

bination of three components:

• Cross-Entropy Loss Rao et al. use the standard

cross-entropy to represent the distance between

the ground truth and the result of DyViT. In our

case, we don’t have a ground truth for the projec-

tion planes so, we represent instead the distance

between the DyVit result and the plane projections

given as input. We took this decision because we

hypothesize that the nearest points to the input

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

410

Figure 2: Graphical representation of the student-teacher architecture of Dynamic Vision Transformer with an input of 3

overlapped projection planes.

point cloud points have the highest probability of

lying on the surface of the mesh so, this distance

should be minimized. The loss is deﬁned in this

shape:

= CrossEntropy(y, y) (1)

where y is the input patch of the overlapped pro-

jection layers and y is the prediction of DyViT.

• Self-Distillation loss This loss is used to mini-

mize the inﬂuence on performance caused by to-

ken sparsiﬁcation. To do so, the ﬁnal remaining

tokens of DyViT should be as close as possible to

the tokens of the teacher model. This minimiza-

tion is represented by the following equation:

distill

∑

b=1

∑

i=1

b,S

∑

b=1

∑

i=1

b,S

− t

′

)

(2)

where t

and t

′

are the resulting token of DyVit

and the teacher model respectively,

b,S

is the de-

cision mask for the b-th step and the s-th sparsiﬁ-

cation stage.

• KL Divergence with this loss it is minimized the

difference between the results of DyViT and the

results of the teacher using KL divergence:

= KL(y||y

′

) (3)

where y

′

is the prediction of the teacher model

The total loss is computed as:

L = L

+ λ

distill

(4)

with λ

and λ

distill

are set to 0.5. This approach

enables DyVit to achieve efﬁciency by dynamically

pruning tokens while retaining the performance of a

full-token transformer.

4 IMPLEMENTATION DETAILS

The evaluation of the performances has been executed

through object-level reconstruction, using a subset of

the ShapeNet dataset (Chang et al., 2015) similar to

the Choi et al. (Choy et al., 2016) work. The sub-

set consists of 50,000 meshes along with their cor-

responding point clouds, covering 13 different object

Lightweight Transformer Occupancy Networks for 3D Virtual Object Reconstruction

411

classes. For each class 1/5 of the samples are used for

testing, 2/5 for validation, and the rest for training.

We used an NVIDIA GeForce GTX 1080 Ti to train

the model.

The hidden feature dimensions are 32 for the en-

coder and the decoder, the resolution of the x, y, and z

planes is set to 64, the batch size is 64 and the valida-

tion is done every 200 iterations. The learning rate of

the Adam optimizer is set at 10

−4

, and the betas val-

ues are set at β

= 0.9 and β

= 0.999. Moreover, we

set the depth of the DyVit to 2. We set the number of

heads of the transformer attention to 8 and the patch

size to 4. In this way, the model reaches convergence

after around 12k iterations.

The model employs a Sinusoidal Representation

Network (SIREN) as its activation function, intro-

duced by (Sitzmann et al., 2020). SIREN uses sine

functions as its basis, which naturally encode high-

frequency details, making it particularly effective for

processing inputs like images or 3D implicit represen-

tations. This periodic activation function ensures ef-

ﬁcient training by carefully controlling the activation

values and spatial frequencies. The authors of SIREN

demonstrated its effectiveness across domains such as

images, videos, sound, and 3D data. Their method

outperforms traditional architectures, like those using

ReLU-based networks, in terms of accuracy, speed,

and the ability to capture ﬁne details.

During inference, we apply Multiresolution Iso-

Surface Extraction (MISE) to extract meshes given

the occupancy values as inputs (Mescheder et al.,

2019). Because the objective of this work is to de-

crease the time of mesh generation, we are interested

in the inference time, which is expressed in seconds

per mesh.

Other metrics used for evaluation are essential to

determine the precision of the results and to under-

stand the consistency of the model. These metrics are

presented in the following:

Accuracy: calculated using the mean absolute

distance (MAD) between the dense points of the pre-

dicted mesh and the nearest surface of the ground

truth mesh. A lower MAD indicates better accuracy.

MAD

accuracy

∑

∥p

− g

∥

Where p

is the point of index i of the predicted mesh,

and g

is the point of index i of the ground truth.

Completeness: mean absolute distance between

the dense points of the ground truth and the surface of

the predicted mesh. The lower, the better.

MAD

completeness

∑

∥g

− p

∥

Volumetric Intersection over Union: the ratio

of the volume of the intersection to the volume of the

union of the ground truth and predicted mesh. The

higher, the better.

IoU =

Volumeo f Overlap

Volumeo f U nion

Chamfer-L1: the average of the accuracy and

completeness metrics. The lower, the better.

Cham f er-L1 =

MAD

accuracy

+ MAD

completeness

F-score: the average of precision and recall be-

tween the predicted mesh and the ground truth. The

higher, the better.

F-score = 2

precision · recall

precision + recall

5 EXPERIMENTS

In this section, we evaluate the performance of our

model compared to state-of-the-art approaches in

Section 5.1 and perform an ablation study in Section

5.2 to examine how changes in the network architec-

ture impact its results.

5.1 Comparison with Baseline Models

We compared our model with existing methods, re-

ferred to as baselines, on the ShapeNet dataset. As

shown in Table 1, our model generates a 3D mesh

in an average of 0.40 seconds per mesh, making it

the fastest among the models tested. As explained in

Section 4, for the scope of this work, whose purpose

is generating a roughly estimated mesh in less time

as possible (plausible mesh) to be reﬁned in a subse-

quent stage, these metrics are less critical compared

to inference time. In Figure 3 we demonstrate the

consistency of our generated meshes by doing a vi-

sual comparison with the result of the DPConvONet

model.

Table 1: Performance comparison between our model and

state-of-the-art model. We can easily see that despite our

model giving non-optimal results, it is the fastest model

with respect to previous works.

Model Accuracy ↓ Completeness ↓ Chamfer-L

↓ F-score ↑ IoU ↑ Inference Time ↓

(Lionar et al., 2021a) 0.0047 0.0039 0.0043 0.9433 0.8879 0.86 s/mesh

(Lionar et al., 2021b) 0.0986 0.0999 0.0992 0.1851 0.7901 0.65 s/mesh

(Tang et al., 2021) 0.0097 0.0082 0.0090 0.7045 0.7429 0.45 s/mesh

(Tonti et al., 2024) 0.0058 0.0048 0.0053 0.9037 0.8499 0.48 s/mesh

Our 0.0342 0.0263 0.0302 0.3297 0.4924 0.40 s/mesh

Table 1 presents the evaluation metric values ob-

tained by the compared models. These values were

computed using the pre-trained weights provided in

the GitHub repositories of DPConvONet, Neuralblox,

SA-ConvONet, and LConvONet.

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

412

Figure 3: Visual comparison of DPConvONet, our model, and the ground truth (shown from left to right). This comparison

demonstrates that our model produces consistent results, comparable to the meshes generated by the state-of-the-art method,

DPConvONet.

The obtained results present a boost of +53% (-

0.46s) with respect to DPConvONet, which is the

model with the best-resulting evaluation metrics, a

boost of +11% (-0.05s) for SA-ConvONet, which has

the lowest inference time, and a boost of +16% (-

0.08s) to LConvONet, which generates high-quality

details maintaining the inference time low. Fur-

thermore, when compared to Neuralblox, our model

demonstrates a signiﬁcant improvement with a +38%

boost (-0.25s) and achieves higher evaluation metrics,

indicating more precise mesh generation.

5.2 Ablation Study

In Table 2 we propose different network conﬁgura-

tions. We replace the Siren activation function with

ReLU to analyze the performance difference between

the two architectures. We may notice that in this case,

the evaluation metrics values present a general wors-

ening, and the inference time increased. This con-

ﬁrms the results in (Sitzmann et al., 2020), which

demonstrate SIREN’s superior efﬁciency and accu-

racy compared to standard activation functions like

ReLU, Tanh, or GELU.

Table 2: Performance comparison between different conﬁg-

urations of the network.

Model Accuracy ↓ Completeness ↓ Chamfer-L

↓ F-score ↑ IoU ↑ Inference Time ↓

Our 0.0342 0.0263 0.0302 0.3297 0.4924 0.40 s/mesh

Network without Siren 0.0539 0.0312 0.0426 0.2293 0.3904 0.4521 s/mesh

Network with 3 DyViT 0.0361 0.026 0.031 0.3436 0.5124 0.4585 s/mesh

In the third row of the table, we present the results

given by the model when it processes separately the

3 planes with 3 different DyViTs. From such results,

we can notice that not only does the inference time in-

crease but processing the planes separately doesn’t re-

ally affect the accuracy of mesh generation. The mo-

tivation lies in the ability of the transformer to extract

both local and global features, which allows DyViT to

easily interpret the projections onto the 3 planes also

when such planes are overlapped.

These ﬁndings highlight that the use of SIREN

and overlapping projection planes contributes signiﬁ-

cantly to the efﬁciency and quality of our model.

6 CONCLUSIONS

This study explores the integration of an efﬁcient vi-

sion transformer, such as DyViT, into a Convolutional

Occupancy Network, replacing three U-Nets used for

processing point cloud projections. The transformer’s

ability to handle both local and global features allows

for overlapping projection planes, signiﬁcantly allevi-

ating bottlenecks between the encoder and decoder.

We also highlight the importance of the SIREN

activation function in preserving a high level of detail

within the model’s output.

Looking ahead, we aim to further reduce inference

time by reﬁning the mesh representation process. In

computer graphics, 3D artists often create a coarse ob-

ject shape and enhance it with bump, normal, and par-

allax mapping to maintain detail. Inspired by this, our

Lightweight Transformer Occupancy Networks for 3D Virtual Object Reconstruction

413

future objective is to represent the point cloud using a

single projection plane and use our model to densify

it. This approach would enable us to generate paral-

lax mapping for a coarse 3D mesh, maintaining visual

detail while optimizing computational efﬁciency.

REFERENCES

Alhaija, H. A., Mustikovela, S. K., Mescheder, L., Geiger,

A., and Rother, C. (2017). Augmented reality meets

deep learning for car instance segmentation in urban

scenes. In British machine vision conference, vol-

ume 1.

Chabra, R., Lenssen, J. E., Ilg, E., Schmidt, T., Straub,

J., Lovegrove, S., and Newcombe, R. (2020). Deep

local shapes: Learning local sdf priors for detailed

3d reconstruction. In Computer Vision–ECCV 2020:

16th European Conference, Glasgow, UK, August 23–

28, 2020, Proceedings, Part XXIX 16, pages 608–625.

Springer.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan,

P., Huang, Q., Li, Z., Savarese, S., Savva, M.,

Song, S., Su, H., et al. (2015). Shapenet: An

information-rich 3d model repository. arXiv preprint

arXiv:1512.03012.

Chibane, J., Alldieck, T., and Pons-Moll, G. (2020). Im-

plicit functions in feature space for 3d shape recon-

struction and completion. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 6970–6981.

Choy, C. B., Xu, D., Gwak, J., Chen, K., and Savarese,

S. (2016). 3d-r2n2: A uniﬁed approach for single

and multi-view 3d object reconstruction. In Computer

Vision–ECCV 2016: 14th European Conference, Am-

sterdam, The Netherlands, October 11-14, 2016, Pro-

ceedings, Part VIII 14, pages 628–644. Springer.

Curless, B. and Levoy, M. (1996). A volumetric method for

building complex models from range images. In Pro-

ceedings of the 23rd annual conference on Computer

graphics and interactive techniques, pages 303–312.

Genova, K., Cole, F., Sud, A., Sarna, A., and Funkhouser,

T. (2020). Local deep implicit functions for 3d shape.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 4857–

4866.

Lionar, S., Emtsev, D., Svilarkovic, D., and Peng, S.

(2021a). Dynamic plane convolutional occupancy net-

works. In Proceedings of the IEEE/CVF Winter Con-

ference on Applications of Computer Vision, pages

1829–1838.

Lionar, S., Schmid, L., Cadena, C., Siegwart, R., and Cra-

mariuc, A. (2021b). Neuralblox: Real-time neural

representation fusion for robust volumetric mapping.

In 2021 International Conference on 3D Vision (3DV),

pages 1279–1289. IEEE.

Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S.,

and Geiger, A. (2019). Occupancy networks: Learn-

ing 3d reconstruction in function space. In Proceed-

ings of the IEEE/CVF conference on computer vision

and pattern recognition, pages 4460–4470.

Papa, L., Russo, P., and Amerini, I. (2023). Meter: a mobile

vision transformer architecture for monocular depth

estimation. IEEE Transactions on Circuits and Sys-

tems for Video Technology, pages 1–1.

Park, J. J., Florence, P., Straub, J., Newcombe, R., and

Lovegrove, S. (2019). Deepsdf: Learning continu-

ous signed distance functions for shape representa-

tion. In Proceedings of the IEEE/CVF conference on

computer vision and pattern recognition, pages 165–

174.

Patil, A. G., Qian, Y., Yang, S., Jackson, B., Bennett, E.,

and Zhang, H. (2023). Rosi: Recovering 3d shape

interiors from few articulation images. arXiv preprint

arXiv:2304.06342.

Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M.,

and Geiger, A. (2020). Convolutional occupancy net-

works. In Computer Vision–ECCV 2020: 16th Euro-

pean Conference, Glasgow, UK, August 23–28, 2020,

Proceedings, Part III 16, pages 523–540. Springer.

Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2017). Point-

net: Deep learning on point sets for 3d classiﬁcation

and segmentation. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 652–660.

Rao, Y., Liu, Z., Zhao, W., Zhou, J., and Lu, J. (2023). Dy-

namic spatial sparsiﬁcation for efﬁcient vision trans-

formers and convolutional neural networks. IEEE

Transactions on Pattern Analysis and Machine Intel-

ligence, 45(9):10883–10897.

Sitzmann, V., Martel, J. N. P., Bergman, A. W., Lindell,

D. B., and Wetzstein, G. (2020). Implicit neural rep-

resentations with periodic activation functions.

Tang, J., Lei, J., Xu, D., Ma, F., Jia, K., and Zhang, L.

(2021). Sa-convonet: Sign-agnostic optimization of

convolutional occupancy networks. In Proceedings of

the IEEE/CVF International Conference on Computer

Vision, pages 6504–6513.

Tonti, C. M., Papa, L., and Amerini, I. (2024). Lightweight

3-D Convolutional Occupancy Networks for Virtual

Object Reconstruction . IEEE Computer Graphics

and Applications, 44(02):23–36.

Touvron, H., Cord, M., El-Nouby, A., Verbeek, J., and

egou, H. (2022). Three things everyone should know

about vision transformers. In European Conference

on Computer Vision, pages 497–515. Springer.

Urmson, C., Anhalt, J., Bagnell, D., Baker, C., Bittner, R.,

Clark, M., Dolan, J., Duggins, D., Galatali, T., Geyer,

C., et al. (2008). Autonomous driving in urban envi-

ronments: Boss and the urban challenge. Journal of

ﬁeld Robotics, 25(8):425–466.

Virtanen, J.-P., Daniel, S., Turppa, T., Zhu, L., Julin, A.,

Hyypp

a, H., and Hyypp

a, J. (2020). Interactive dense

point clouds in a game engine. ISPRS Journal of Pho-

togrammetry and Remote Sensing, 163:375–389.

Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-

Fei, L., and Farhadi, A. (2017). Target-driven visual

navigation in indoor scenes using deep reinforcement

learning. In 2017 IEEE international conference on

robotics and automation (ICRA), pages 3357–3364.

IEEE.

GRAPP 2025 - 20th International Conference on Computer Graphics Theory and Applications

414