
performance during the feature fusion stage and opti-
mal configuration for sensor fusion. Additionally, we
integrate deformable cross-attention to improve the
extraction of robust camera features, leveraging the
complementary information from LiDAR and radar
modalities. Due to the limited availability of moving
object labels within nuScenes, which are currently re-
stricted to the vehicle class, our experimental valida-
tion solely focuses on this category. However, it is
possible to boost the performance further and extend
the motion detection task to more classes, such as bi-
cyclists and pedestrians with the label availability. We
leave this work to future research.
REFERENCES
Caesar, H. et al. (2020). nuscenes: A multimodal dataset
for autonomous driving. In CVPR.
Chen, X. et al. (2021). Moving Object Segmentation in 3D
LiDAR Data: A Learning-based Approach Exploiting
Sequential Data. IEEE Robotics and Automation Let-
ters (RA-L), 6:6529–6536.
Chen, X. et al. (2023). Futr3d: A unified sensor fusion
framework for 3d detection. In proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 172–181.
Das, A., Das, S., Sistu, G., Horgan, J., Bhattacharya, U.,
Jones, E., Glavin, M., and Eising, C. (2023). Revisit-
ing modality imbalance in multimodal pedestrian de-
tection. In 2023 IEEE International Conference on
Image Processing (ICIP), pages 1755–1759. IEEE.
Das, A., K
ˇ
r
´
ı
ˇ
zek, P., Sistu, G., B
¨
urger, F., Madasamy, S.,
U
ˇ
ri
ˇ
c
´
a
ˇ
r, M., Kumar, V. R., and Yogamani, S. (2020).
Tiledsoilingnet: Tile-level soiling detection on auto-
motive surround-view cameras using coverage metric.
In 2020 IEEE 23rd International Conference on In-
telligent Transportation Systems (ITSC), pages 1–6.
IEEE.
Das, A., Paul, S., Scholz, N., Malviya, A. K., Sistu, G.,
Bhattacharya, U., and Eising, C. (2024). Fisheye
camera and ultrasonic sensor fusion for near-field ob-
stacle perception in bird’s-eye-view. arXiv preprint
arXiv:2402.00637.
Dasgupta, K., Das, A., Das, S., Bhattacharya, U.,
and Yogamani, S. (2022). Spatio-contextual deep
network-based multimodal pedestrian detection for
autonomous driving. IEEE Transactions on Intelligent
Transportation Systems.
Dong, X., Wang, P., Zhang, P., and Liu, L. (2020).
Probabilistic oriented object detection in automotive
radar. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition Workshops,
pages 102–103.
Dosovitskiy, A. et al. (2015). Flownet: Learning optical
flow with convolutional networks. In Proceedings
of the IEEE international conference on computer vi-
sion, pages 2758–2766.
Fang, S. et al. (2023). Tbp-former: Learning temporal
bird’s-eye-view pyramid for joint perception and pre-
diction in vision-centric autonomous driving. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 1368–
1378.
Fragkiadaki, K. et al. (2015). Learning to segment moving
objects in videos. In 2015 IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
4083–4090.
Godfrey, J., Kumar, V., and Subramanian, S. C. (2023).
Evaluation of flash lidar in adverse weather conditions
towards active road vehicle safety. IEEE Sensors Jour-
nal.
Harley, A. W., Fang, Z., Li, J., Ambrus, R., and Fragkiadaki,
K. (2022). A simple baseline for bev perception with-
out lidar. In arXiv:2206.07959.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Hu, A. et al. (2021). FIERY: Future instance segmenta-
tion in bird’s-eye view from surround monocular cam-
eras. In Proceedings of the International Conference
on Computer Vision (ICCV).
Kim, Y. et al. (2023). Crn: Camera radar net for accurate,
robust, efficient 3d perception. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 17615–17626.
Li, C., Song, D., Tong, R., and Tang, M. (2018). Multi-
spectral pedestrian detection via simultaneous detec-
tion and segmentation. In British Machine Vision Con-
ference 2018, BMVC 2018. BMVA Press.
Li, P. et al. (2023). Powerbev: A powerful yet lightweight
framework for instance prediction in bird’s-eye view.
In Elkind, E., editor, Proceedings of the Thirty-Second
International Joint Conference on Artificial Intel-
ligence, IJCAI-23, pages 1080–1088. International
Joint Conferences on Artificial Intelligence Organiza-
tion. Main Track.
Liang, T. et al. (2022). Bevfusion: A simple and robust
lidar-camera fusion framework. Advances in Neural
Information Processing Systems, 35:10421–10434.
Lippke, M. et al. (2023). Exploiting sparsity in automo-
tive radar object detection networks. arXiv preprint
arXiv:2308.07748.
Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.,
and Han, S. (2023). Bevfusion: Multi-task multi-
sensor fusion with unified bird’s-eye view represen-
tation. In IEEE International Conference on Robotics
and Automation (ICRA).
Man, Y., Gui, L.-Y., and Wang, Y.-X. (2023). Bev-guided
multi-modality fusion for driving perception. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition.
Menze, M. and Geiger, A. (2015). Object scene flow for
autonomous vehicles. In In Proceedings of the IEEE
conference on computer vision and pattern recogni-
tion, volume 245, pages 3061–3070.
BEVMOSNet: Multimodal Fusion for BEV Moving Object Segmentation
871