regions near the LiDAR sensor where the point cloud
is denser and, consequently, more detailed.
REFERENCES
Armeni, I., Sax, A., Zamir, A. R., and Savarese, S. (2017).
Joint 2D-3D-Semantic Data for Indoor Scene Under-
standing. ArXiv e-prints.
Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer
normalization.
Behley, J., Garbade, M., Milioto, A., Quenzel, J., Behnke,
S., Stachniss, C., and Gall, J. (2019). SemanticKITTI:
A Dataset for Semantic Scene Understanding of Li-
DAR Sequences. In Proc. of the IEEE/CVF Interna-
tional Conf. on Computer Vision (ICCV).
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. CoRR, abs/2005.14165.
Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2016). Multi-
view 3d object detection network for autonomous
driving. CoRR, abs/1611.07759.
Choy, C. B., Gwak, J., and Savarese, S. (2019). 4d spatio-
temporal convnets: Minkowski convolutional neural
networks. CoRR, abs/1904.08755.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Min-
derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2020). An image is worth 16x16 words:
Transformers for image recognition at scale. CoRR,
abs/2010.11929.
Graham, B., Engelcke, M., and van der Maaten, L. (2017).
3d semantic segmentation with submanifold sparse
convolutional networks. CoRR, abs/1711.10275.
Guo, M., Cai, J., Liu, Z., Mu, T., Martin, R. R., and Hu,
S. (2020). PCT: point cloud transformer. CoRR,
abs/2012.09688.
He, K., Zhang, X., Ren, S., and Sun, J. (2015). Deep
residual learning for image recognition. CoRR,
abs/1512.03385.
Howard, J. and Gugger, S. (2020). Deep Learning for
Coders with Fastai and Pytorch: AI Applications
Without a PhD. O’Reilly Media, Incorporated.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-
celerating deep network training by reducing internal
covariate shift. CoRR, abs/1502.03167.
Kingma, D. P. and Ba, J. (2017). Adam: A method for
stochastic optimization.
Koch, S., Matveev, A., Jiang, Z., Williams, F., Artemov, A.,
Burnaev, E., Alexa, M., Zorin, D., and Panozzo, D.
(2019). Abc: A big cad model dataset for geometric
deep learning. In The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
Lai, X., Liu, J., Jiang, L., Wang, L., Zhao, H., Liu, S., Qi,
X., and Jia, J. (2022). Stratified transformer for 3d
point cloud segmentation.
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. CoRR,
abs/2103.14030.
Maturana, D. and Scherer, S. (2015). Voxnet: A 3d convolu-
tional neural network for real-time object recognition.
In 2015 IEEE/RSJ International Conference on Intel-
ligent Robots and Systems (IROS), pages 922–928.
Qi, C. R., Su, H., Mo, K., and Guibas, L. J. (2016). Pointnet:
Deep learning on point sets for 3d classification and
segmentation. CoRR, abs/1612.00593.
Qi, C. R., Yi, L., Su, H., and Guibas, L. J. (2017). Point-
net++: Deep hierarchical feature learning on point sets
in a metric space. CoRR, abs/1706.02413.
Robert, D., Raguet, H., and Landrieu, L. (2023). Effi-
cient 3d semantic segmentation with superpoint trans-
former.
Su, H., Maji, S., Kalogerakis, E., and Learned-Miller, E. G.
(2015). Multi-view convolutional neural networks for
3d shape recognition. CoRR, abs/1505.00880.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.,
Jones, L., Gomez, A. N., Kaiser, L., and Polo-
sukhin, I. (2017). Attention is all you need. CoRR,
abs/1706.03762.
Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M.,
and Solomon, J. M. (2018). Dynamic graph CNN for
learning on point clouds. CoRR, abs/1801.07829.
Wu, X., Lao, Y., Jiang, L., Liu, X., and Zhao, H. (2022).
Point transformer v2: Grouped vector attention and
partition-based pooling.
Yang, Y.-Q., Guo, Y.-X., Xiong, J.-Y., Liu, Y., Pan, H.,
Wang, P.-S., Tong, X., and Guo, B. (2023). Swin3d: A
pretrained transformer backbone for 3d indoor scene
understanding.
Zhao, H., Jia, J., and Koltun, V. (2020a). Explor-
ing self-attention for image recognition. CoRR,
abs/2004.13621.
Zhao, H., Jiang, L., Jia, J., Torr, P. H. S., and Koltun, V.
(2020b). Point transformer. CoRR, abs/2012.09164.
VISAPP 2024 - 19th International Conference on Computer Vision Theory and Applications
76