reasonable network design and control of the amount
of computation are essential. Achieving a balance
between detection accuracy and real-time
performance is a key consideration for future
perceptual models.
4.2.5 Knowledge Distillation
The purpose of knowledge distillation is to migrate
the knowledge learned from a large model or multiple
model integration to another lightweight model. It is
essentially a method of model compression. This
method can achieve a significant reduction in model
size without reducing the detection accuracy of the
original model, which can make the model
commercially deployable. Knowledge distillation is
the transfer of information between the teacher model
and the student model. Specifically, knowledge is
suggested from the trained teacher model to the
lightweight student model. This approach reduces the
model size while keeping the detection effectiveness.
Recently, BEVDISTILL (Chen et al. 2022)
performed two feature distillations, dense and sparse,
between teacher and student models for feature
alignment optimization and knowledge migration at
the instance prediction level. The model was highly
successful on the nuScenes dataset and also proved
that knowledge distillation is a powerful tool for
solving practical deployment challenges.
5 CONCLUSION
Due to the increasing importance of accurate and
robust perception techniques in applications such as
autonomous driving, this paper explores multimodal
fusion 3D target detection algorithms based on BEV
technology, focusing on the fusion of LIDAR and
camera vision for perception. First, this paper derives
the advantages of BEV technology in current
perception systems, including occlusion reduction,
end-to-end design, and error accumulation reduction.
By analyzing the advantages and disadvantages of
BEV sensing algorithms and sensors for image-only,
LIDAR-only, and LC fusion, this study finds that the
LC fusion sensing method has better detection
accuracy and robustness, and is one of the most
promising sensing methods in the future. In addition,
the advantages and limitations of the related
algorithmic models are discussed from the three
fusion granularities of point-level fusion, feature-
level fusion, and voxel-level fusion, among which the
algorithms based on virtual points, the algorithms
based on the unified representation of the BEV space,
and the algorithms related to the distillation of
knowledge are of pioneering significance. Aiming at
these models, this paper puts forward suggestions
such as network lightweight design. However, the
multimodal fusion approach faces three challenges,
including difficulties in fusion feature alignment,
high dependence on complete modal inputs, and
depth estimation or spatial compression leading to
inaccurate depth information and semantic
ambiguity. Approaches such as unified spatial
representation and decoupling of perceptual channels
are suggested to address these issues. In the future,
multimodal fusion perception methods can evolve
towards end-to-end, multi-task learning, temporal
fusion, low latency, and knowledge distillation. This
paper is dedicated to exploring the characteristics of
multimodal fusion techniques and mining the
potential development directions, which provides a
reference and summary perspective for future
research.
REFERENCES
C. R.Qi, H., K. C.Mo, et al, PointNet: Deep Learning on
Point Sets for 3D Classification and Segmentation; in
proceedings of the 30th IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
Honolulu, HI, F Jul 21-26, (2017).
C. W.Wang, C., M.Zhu, et al, PointAugmenting: Cross-
Modal Augmentation for 3D Object Detection; in
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
Electr Network, F Jun 19-25, 2021. 2021.
C.Ge, J.Chen, E,Xie, et al, ArXiv.2304.09801, (2023).
H.Wu, C. L.Wen, S S.Shi, et al, Virtual Sparse Convolution
for Multimodal 3D Object Detection; in Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Vancouver, CANADA, F
Jun 17-24, (2023).
J. J. Huang, et al, BEVDet4D: Exploit Temporal Cues in
Multi-camera 3D Object Detection; arXiv
abs/2203.17054(2022)
S. Mohapatra, S.Yogamani, H.Gotzig, et al, 2021 IEEE
International Intelligent Transportation Systems
Conference (ITSC), 2809-15(2021).
S.Vora, A H.Lang, B.Helou, et al, PointPainting:
Sequential Fusion for 3D Object Detection; in
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
Electr Network, F Jun 14-19, 2020. 2020.
S.Wang, H. Caesar, L Nan, et al, ArXiv.2309.14516, (2023).
T. W.Yin, X Y.Zhou, P KRäHENBHüL, Multimodal
Virtual Point 3D Detection; in proceedings of the 35th
Conference on Neural Information Processing Systems
(NeurIPS), Electr Network, F Dec 06-14, 2021 .2021.
T.Liang, H.Xie, K.Yu, Xia, et al, ArXiv (2022).