
pipeline for a standalone XR device, we use a Jetson
Xavier NX 8 GB for benchmarking. This system-on-
module device comes in a small form factor with GPU
support and is widely used as an AI edge device, due
to its cloud-native support and hardware acceleration
made possible with the NVIDIA software stack. We
consider two different scenarios:
Scenario 1. The synthetic dataset setup:
YOLOv8x, 8 images, 22474 points, with 36
objects.
Scenario 2. The setup used in this section:
YOLOv8n, 3 images, 1028 points, with 3 objects.
To utilize the most of the hardware, the device is
configured to run in the maximum performing power
mode (20 W) and the object detection models are con-
verted to quantized TensorRT-optimized model files
using INT8 precision. The preprocessing is done on
GPU and the set-based instance assignment algorithm
is implemented on CPU, while the post-processing is
done using Open3D compiled with CUDA support.
The results are shown in Figure 6. The total execu-
tion time for Scenario 1 is 190.63 ms per frame and
22.53 ms per frame in Scenario 2. Keep in mind that
the localization part of the pipeline is not taken into
account; however, it should be noted that it can be
executed in parallel to the object detector.
5 DISCUSSION
Our approach shows notable strengths in identifying
smaller objects and in scenarios where comprehen-
sive detection is crucial. However, there are limita-
tions in detecting the full structure of larger objects,
like tables, which affect the precision in specific con-
texts. This insight is not problematic for applications
in XR environments, where interaction often focuses
on object surfaces like tabletops. Overall, the findings
highlight the potential of our method in diverse ap-
plications, balancing between detailed detection and
practical constraints in real-world scenarios.
6 CONCLUSIONS
In this work, we introduced a fast incremental algo-
rithm (SMVLift) for lifting 2D semantics to 3D on
constrained hardware. By working with sparse point
clouds, on-device performance is made possible. For
robustness, we aggregated the semantic masks from
multiple views, by using a novel set-based instance
segmentation algorithm. Our method was compared
to a state-of-the-art algorithm and showed compara-
ble or superior results despite being significantly more
lightweight. In addition, we showed that our method
can be incorporated into a real XR application by
positioning an avatar on a chair using a Varjo XR-3
headset. Finally, we showed that the method is capa-
ble of real-time performance on a Jetson Xavier NX
and argued that, due to the mechanical form factor
of such devices, the computational capacity of future
generations of XR devices are likely to be running al-
gorithms such as the one proposed in the paper.
REFERENCES
Armeni, I., Sax, A., Zamir, A. R., and Savarese, S. (2017).
Joint 2D-3D-Semantic Data for Indoor Scene Under-
standing. https://arxiv.org/abs/1702.01105.
Hau, J., Bultmann, S., and Behnke, S. (2022). Object-level
3D semantic mapping using a network of smart edge
sensors. In IEEE International Conference on Robotic
Computing (IRC), pages 198–206. IEEE.
He, K., Gkioxari, G., Doll
´
ar, P., and Girshick, R. (2017).
Mask R-CNN. In IEEE International Conference on
Computer Vision (ICCV), pages 2980–2988.
Heo, J., Bhardwaj, K., and Gavrilovska, A. (2023). FleXR:
A system enabling flexibly distributed extended real-
ity. In Proceedings of the Conference on ACM Multi-
media Systems, pages 1–13.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics
YOLOv8. https://github.com/ultralytics/ultralytics.
Mascaro, R., Teixeira, L., and Chli, M. (2021). Dif-
fuser: Multi-view 2D-to-3D label diffusion for seman-
tic scene segmentation. In IEEE International Con-
ference on Robotics and Automation (ICRA), pages
13589–13595.
Ngo, T. D., Hua, B.-S., and Nguyen, K. (2023). ISBNet:
A 3D point cloud instance segmentation network with
instance-aware sampling and box-aware dynamic con-
volution. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR),
pages 13550–13559.
Wang, B. H., Chao, W.-L., Wang, f. Y., Hariharan, B., Wein-
berger, K. Q., and Campbell, M. (2019). LDLS: 3-D
object segmentation through label diffusion from 2-
D images. IEEE Robotics and Automation Letters,
4(3):2902–2909.
Wang, Y., Shi, T., Yun, P., Tai, L., and Liu, M.
(2018). PointSeg: Real-time semantic seg-
mentation based on 3D LiDAR point cloud.
https://arxiv.org/abs/1807.06288.
Wu, Z., Zhao, T., and Nguyen, C. (2020). 3D reconstruction
and object detection for HoloLens. In Digital Image
Computing: Techniques and Applications (DICTA),
pages 1–2. IEEE.
Zhang, H., Han, B., Ip, C. Y., and Mohapatra, P. (2020).
Slimmer: Accelerating 3D semantic segmentation for
mobile augmented reality. In IEEE International
Conference on Mobile Ad Hoc and Sensor Systems
(MASS), pages 603–612.
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
562