BEV-Based 3D Detection for Automatic Driving Using Lidar-Camera
Fusion
Jihua Jiang
College of Energy and Power Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing, 211800, China
Keywords: Bird's Eye View, 3D Detection, Lidar-Camera Fusion, Automatic Driving.
Abstract: At present, deep learning technology and automatic driving related research are gradually mature.
Autonomous driving perception technology has been developed tremendously as an important part of the
autonomous driving system. This paper explores the development of a 3D detection task based on the Lidar-
Camera fusion (LC Fusion) scheme of BEV technology from different fusion mechanisms. This paper
concludes that the LC Fusion algorithm for BEV will be the most promising perception approach at present
and the main form of perception system in the future. The BEV-based LC Fusion has many advantages such
as high detection accuracy and robustness. Starting from the fusion granularity, this paper summarizes the
characteristics of high accuracy and low latency of the current LC Fusion algorithm and the limitations such
as network complexity, and proposes improvements such as attention fusion mechanism and network
lightweighting for the problems faced by the algorithm. In addition, this paper proposes solutions such as
unified spatial representation and decoupling of sensing channels, as well as the development direction of
sensing systems including end-to-end design, multi-task learning, and knowledge distillation. This paper can
provide reference materials and summarize perspectives for subsequent related researchers to pave the way
for the development of this perception technology.
1 INTRODUCTION
In recent years, neural networks and deep learning
technologies have developed rapidly. In addition,
automatic driving perception technology, as the
foundation of the whole automatic driving system,
plays a key role in the subsequent decision-making
and control. Due to the excellent detection accuracy
and rich scalability, multi-sensor fusion perception
schemes have gradually become mainstream, among
which the Lidar-Camera Fusion scheme has been
widely noticed by academia and industry for its
highly complementary characteristics and excellent
perception performance.
At this stage, Lidar-Camera Fusion (LC Fusion)
can be categorized into pre-fusion, cascade fusion,
and post-fusion according to the fusion stages. Each
of these three fusion stages possesses its
characteristics and is a hot research topic at present.
Bird-eye-view (BEV) as a more popular
perception method has received widespread attention.
BEV perception technology is to convert sensory
information into features in BEV space, and due to
the uniformity of the BEV space, it is easier to realize
the fusion and processing of multiple heterogeneous
modal data as well as multitasking learning and other
tasks. Not only that, the overlapping of objects can be
reduced in the BEV perspective to get a clearer and
more accurate perceptual environment. In addition,
BEV perception can also be directly optimized end-
to-end by neural network algorithms, without the
need for serial perception channels, which can avoid
the accumulation of errors, as well as reduce the
impact of the algorithmic logic and improve the
accuracy of perception and prediction.BEV
perception provides a strong perceptual foundation
for important functions such as decision-making and
control, which is of great significance for the eventual
realization of high-level automated driving.
Therefore the addition of BEV spatial technology
provides new solution ideas for LC Fusion. Unlike
traditional perception schemes, BEV perception
schemes do not stack sub-task modules in a linear
structure but convert camera and radar sensing to a
unified bird's-eye view for related perception tasks,
which will give BEV perception the advantages of
easier cross-camera and multi-modal fusion, as well
as the realization of timing fusion.
552
Jiang, J.
BEV-Based 3D Detection for Automatic Driving Using Lidar-Camera Fusion.
DOI: 10.5220/0012838400004547
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 1st International Conference on Data Science and Engineering (ICDSE 2024), pages 552-559
ISBN: 978-989-758-690-3
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
This paper mainly focuses on the technical
characteristics and problems faced by the existing
Lidar-Camera fusion schemes based on BEV
technology, to obtain the focus and development
direction of the existing fusion technology, to provide
a summarized perspective for researchers in this field,
and to provide research ideas and focus for later
researchers. This paper introduces the concept of
BEV technology as well as its advantages and
summarizes the perception methods based on BEV
according to the different sensors, in which the
advantages and disadvantages of the sensors are also
compared and analyzed. Then, this paper provides a
systematic review of BEV-based Lidar-Camera
fusion algorithms, details the advantages and
disadvantages of the relevant representative
algorithms at three fusion granularities: point-level
fusion, feature-level fusion, and voxel-level fusion,
and summarizes the common challenges and
solutions faced by multimodal fusion algorithms.
Finally, this paper discusses the future research
direction of multimodal fusion perception methods.
2 THE CLASSIFICATION OF BEV
Depending on the input modality, BEV perception
algorithms can be categorized into pure LiDAR-
based, pure vision-based, and multimodal fusion-
based approaches. This section focuses on these types
of perception algorithms.
2.1 BEV Perception Algorithms Based
on Pure LiDAR
LiDAR calculates the distance to the surrounding
objects by measuring the time difference between the
laser beam being emitted and being received by the
object, which can realize accurate sensing of
geometrical physical information such as object
attitude, shape, and speed. This is important for
intelligent vehicles to understand the surrounding
environment in real time. However, the sparseness of
the LiDAR point cloud makes it unable to bring rich
semantic information. Currently, Lidar also has
limitations such as poor long-distance sensing
accuracy and expensive price. In recent years, Lidar-
based sensing algorithms have gained significant
development, realizing sensing accuracy that cannot
be achieved by other sensors.
Pure Lidar algorithms can be categorized into
point-based, voxel-based, and feature projection-
based methods. Examples include the classical
pointnet as well as voxelnet (Qi et al. 2017, Zhou &
Tuzel 2018).
2.2 Pure Vision-based BEV Perception
Algorithms
Pure visual images obtained from cameras can
provide dense information, display shape and texture
attributes, and contain a large amount of rich semantic
information. However, image data depth information
is lost and it is difficult to obtain accurate geometric
information such as distance. Although network
research for image depth estimation has been
developing rapidly recently, it is still difficult to
realize accurate depth estimation, which severely
limits the detection accuracy. Purely visual BEV
perception algorithms can be categorized into MLP-
based, Transformer-based, and depth-based methods
based on the transformation from perspective view
(PV) to bird's eye view (BEV) (Ma et al. 2022). The
Transformer-based and depth-based perception
algorithms, which have been studied more, are
represented by algorithms with superior performance,
such as BEVFormer, BEVDet4D, and so on (Li et al.
2022, Huang et al. 2022).
2.3 BEV Perception Algorithm based
on LC Fusion
Due to the highly complementary characteristics of
LiDAR and camera, combining the two for
perception has become a mainstream research
direction. LiDAR can obtain physical information
such as object contour, distance, etc., as a way to
make up for the camera's difficulty in predicting the
accurate depth. The RGB map of the camera can
provide rich semantic information, LiDAR provides
physical information, such as object contours and
distances, complementing the camera's difficulty in
predicting accurate depth. The RGB map from the
camera offers rich semantic information,
compensating for LiDAR's lack of semantics and the
challenge of recognizing distant objects. This fusion
enhances detection accuracy and robustness.
Currently, the three fusion methods have been
thoroughly studied, and many research teams have
proposed models with powerful performance, such as
TransFusion and BEVFusion (Bai et al. 2022, Liang
et al. 2022).
BEV-Based 3D Detection for Automatic Driving Using Lidar-Camera Fusion
553
3 BEV-BASED LC FUSION
The following is a categorization and specific
description of BEV-based LC Fusion algorithm. It
can be specifically categorized into point-level-based
fusion, feature-level fusion, and voxel level fusion.
3.1 Point-Level Fusion
Point-level fusion is the fusion of Lidar points and
camera images at the data input stage, which has more
complete data preservation and less information loss.
At this stage, point-level fusion first finds the
association between Lidar points and image pixels
based on the calibration matrix and then augments the
Lidar features with segmentation scores or CNN
features by concatenating the associated pixels point
by point. For example, PointAugmenting projects the
Lidar points onto the image plane and decorates the
Lidar points using semantics in the camera image to
make them better recognizable for long-range and
occluded targets (Wang et al. 2021). This model is a
classic example of point-level fusion, which realizes
the alignment of LIDAR points with image data at the
semantic level in a real sense, although there is still
much room for improvement.
There is also the equally classic PiontPainting,
which also uses point-by-point splicing to realize
fusion (Vora et al. 2020). It fully utilizes the
complementary characteristics of point cloud data
and image data, using the image semantic
segmentation classification results to be spliced with
Lidar points, and decorating Lidar features with the
semantic segmentation scores, and finally a Lidar-
based object detector can be used on this decorated
point cloud to obtain 3D detection.
Despite the improvements made, these point-level
fusion methods still suffer from two main problems.
First, they only fuse Lidar features and image features
by point-by-point summation or concatenation, and
therefore and their dependence on image quality.
Second, finding hard correlations between sparse
Lidar points and dense image pixels not only wastes
many image features with rich semantic information,
but also relies heavily on high-quality alignment
between the two sensors, but the inherent spatio-
temporal mismatch makes it more difficult to achieve
high-quality alignment.
In addition, in recent years, there have also been ways
to convert image data into dense pseudo-point clouds
and then fuse them with Lidar point clouds to obtain
perceptual results in a Lidar 3D target detection
backbone. In 2021, Yin et al. proposed MVP, which
introduced the fusion of Lidar data using virtual
points (Yin et al. 2021). In 2023, Wu et al. proposed
Vir ConvNet, a method that uses the VirConv module
for virtual point-based 3D object detection,
addressing issues of dense virtual points and noise
introduced by depth complementation (Wu et al.
2023). Currently, this fusion method using the
conversion of images into pseudo-point clouds has
received widespread attention, and it has the
advantage of higher accuracy of 2D image depth
complementary network and better utility and
generalization ability, which provides a new solution
idea for the future fusion perception scheme.
3.2 Feature-Level Fusion
Feature-level fusion refers to the fusion of sensory
information with feature data of different modalities
after obtaining relevant abstract features through
neural networks to obtain fused features for
subsequent processing and related sensory tasks.
A more well-known feature-level fusion at this
stage is TransFusion proposed in 2021, which
consists of a convolutional backbone and a detection
head based on a Transformer decoder (Bai et al.
2022). This model enables the use of sparse object
query sets to predict the initial bounding box from a
LiDAR point cloud guided by image features while
employing an attentional mechanism to adaptively
find important features and fuse them. The
TransFusion model achieves soft connectivity at the
feature level as opposed to the hard connectivity of
the Lidar points projected into the image in the point-
level fusion, which effectively prevents image
degradation and sensor misalignment due to the
quality degradation and sensor misalignment, and has
better robustness in terms of perceptual degradation.
In 2022, Chen et al. proposed AtuoAlign (Chen,
et al. 2023). Instead of using camera projection
matrices and geometric transformations to establish
deterministic correspondences, the model uses a
learnable alignment map to model the mapping
relationship between the image and the point
cloud.AtuoAlign achieves positional and semantic
consistency at both pixel-level and instance level,
which effectively ensures the accuracy of feature
alignment at different granularities. However, the
cross-attention feature alignment module in the
AtuoAlign model inevitably brings the cost of high
computation because it adopts the global attention
mechanism. Therefore, the AtuoAlign model can
hardly bear the computational cost of querying high-
resolution images, and can only be limited to lower-
resolution and lower-quality images, which reduces
its performance.
ICDSE 2024 - International Conference on Data Science and Engineering
554
The team proposed AtuoAlignV2 to solve this
problem. This model proposes the cross-modal
DeformCAFA model (Chen et al. 2024). It takes into
account the sparse learnable sampling points for
cross-channel relational modeling, enhances the
tolerance of calibration error, accelerates the feature
aggregation between different channels, improves
computational efficiency, and alleviates the
contradiction. Meanwhile, the model proposes the
GT-AUG data enhancement model. It can handle the
target occlusion problem in the image domain and
generate smoother images for fusion simply and
efficiently. The team also considered the problem of
degradation of perception accuracy due to missing
images for image inputs in multi-view perception in
real-world scenarios and proposed an image
discarding strategy, which will improve the training
speed of the model because fewer images are
processed in each batch, and will improve the overall
performance and robustness of the model. However,
network lightweight design is still a key point to
consider for its future practical deployment.
3.3 Voxel-Level Fusion
In recent years, voxel-level fusion has been
developed more rapidly. The main feature of this
method is to decouple the Lidar and Camera
perceptual streams, and the two modalities are subject
to independent feature extraction and prediction, and
finally fused in voxel space. This fusion approach is
simple and efficient, reduces the model's dependence
on the complete modal input, and improves the
robustness of the perceptual system.
In 2022, BEVFusion proposed by Liang et al.
drew attention to this fusion method (Liang et al.
2022). This model truly realizes the decoupling of the
two perceptual streams and solves the problem that
multimodal inputs are highly dependent on the
complete inputs.BEVFusion proposes to decouple the
camera and the Lidar to form two independent
perceptual streams so that their raw data are
converted into BEV spatial features through the BEV
encoder and finally fused using a simple module.
Such a simple design makes it possible to directly use
mature models at this stage for relevant perceptual
tasks, which makes its generalization ability greatly
enhanced.
To achieve a more robust and accurate perception
capability, Ge et al. proposed MetaBEV in 2023, an
algorithm that possesses strong capabilities in dealing
with feature alignment and sensor failure problems
(Ge et al. 2023). The model proposes a cross-modal
BEV Evolving decoder based on the independent
perception module of BEVFusion, which uses cross-
modal variable attention to aggregate learnable
queries with Camera BEV features and Lidar Bev
features to obtain fused features. Finally, several task-
specific heads are applied to support different 3D
perception tasks. A robust fusion module with a new
M2oE-FFN layer is introduced. The main purpose is
to mitigate gradient conflicts between 3D target
detection and semantic segmentation tasks for more
balanced and robust performance.MetaBEV achieves
multi-task perception in multimodal fusion. Although
MetaBEV improved BEVFusion, it still used two
modally different BEV encoders, which failed to
completely solve the problem of feature
misalignment.
So in 2023 wang et al. proposed UniBEV, which
is a model that differs from previous multimodal
detection methods based on a unified BEV
representation in that all sensor modalities will use a
unified BEV encoder based on variable attention to
extract BEV features from the original coordinate
system and enable the features to be fused in the
channel normalized weights module (Wang et al.
2023). This can properly solve the data discrepancies
due to inconsistent encoding methods. More
importantly, the UniBEV model designs the channel
normalized weight module (CNW) to fuse the two
perceptual streams. This module fuses the BEV
feature maps by summing and averaging all available
modal feature maps. The averaging can dilute the
information from the more reliable sensors with the
information from the less reliable sensors, alleviating
the contradiction caused by the different sensor
information. Therefore, this fusion mechanism makes
the model more robust for dealing with sensor faults
or failures.
In addition, in recent years knowledge distillation
techniques have received increasing attention for
their advantages in the fusion of heterogeneous non-
uniform features in both 2D and 3D spaces. In 2023,
Chen et al. proposed BEVDISTILL (Chen et al.
2022). The model unifies image and Lidar features in
BEV space and enables fusion by adaptively
transferring knowledge through dense and sparse
feature distillation of non-homogeneous
representations in a teacher-student paradigm.
Notably, the model's distillation paradigm takes note
of the differences between modalities. Knowledge
distillation is a different perceptual approach from the
conventional use of attentional mechanisms to
achieve alignment and fusion between two
modalities, and it has advantages for achieving
alignment between heterogeneous non-uniform data,
BEV-Based 3D Detection for Automatic Driving Using Lidar-Camera Fusion
555
which has become an important development
direction for future multimodal fusion.
There are also methods such as UVTR for
knowledge distillation fusion at the voxel level,
which avoid problems such as semantic ambiguity
caused by compression by transferring knowledge
between the student model and the teacher model at
the voxel level without a high degree of compression
(Li et al 2022).
4 ADVISE
The fourth subsection will discuss three problems and
related solutions faced by BEV-based multimodal
fusion detection algorithms, as well as future
directions for centralized development.
4.1 Challenges and Possible Solutions
For models based on BEV space for multimodal
fusion detection, three problems will be commonly
faced, which are the difficulty of fusion feature
alignment, high dependence on the complete modal
input, inaccurate depth information, and semantic
ambiguity due to depth estimation or spatial
compression of image data or point cloud data. The
solutions provided by different algorithmic models
are discussed below starting from the problems.
4.1.1 Difficulty in Fusion Feature Alignment
Since multi-sensor fusion involves the fusion of
sensory information with different extraction
methods and different data forms, the problem of
difficult feature alignment of different modal data is
bound to occur in the fusion process. This can lead to
the introduction of errors, resulting in problems such
as decreased sensing accuracy.
First, the algorithm represented by DeepFusion
proposes two techniques, InverseAug and
LearnableAlign, to achieve robust alignment and
fusion (Meng et al. 2022). These two techniques bring
low computational costs and provide large gains in
recognizing long-range targets, significantly
improving the model's recognition and localization
capabilities.
The second is the algorithm represented by
UniBEV. In UniBEV, the research team designed a
variable-attention-based unified BEV encoder, which
would enable high-precision alignment of
heterogeneous data features in the BEV domain. The
model improves on BEVFusion as well as MetaBEV
by utilizing a variable attention module on Lidar and
Camera perceptual streams to achieve a unified
encoding of BEV features to facilitate the interaction
of the information between the two and finally
achieve adaptive fusion of the two through the
channel normalized weight module (CNW). It can be
seen that the unified representation-based approach
has a natural advantage in solving the problem of
feature alignment.
The third is the algorithm represented by
AutoAlign (Chen, et al. 2023). The AtuoAlign
algorithm is designed with a cross-attention feature
alignment module (CAFA), and a self-supervised
cross-channel feature interaction module (SCFI).
With these two modules, heterogeneous data features
can be aligned in a dynamic and data-driven manner.
It can be shown that the current adaptive feature
alignment methods based on attention mechanisms as
well as those based on uniform spatial representations
are accurate and efficient. In recent years, there is a
large potential for research in this direction. However,
performing network lightweight as well as low-
latency design are still issues that need to be
considered in this area.
4.1.2 High Dependence on Complete Modal
Inputs
Many current multimodal fusion algorithms rely
heavily on the accurate sensing of Lidar points, with
the Lidar point cloud detection network as the
backbone. This leads to the problem of severe
degradation of perception when the Lidar sensor fails
or malfunctions, and even threatens the safe driving
of self-driving cars. It can be said that this is a
problem that needs to be solved urgently at present
for the development of autonomous driving. In recent
years, many research teams have given suitable
solutions to this problem.
The first one is the algorithm represented by
BEVFusion.BEVFusion adopts the method of
decoupling the Lidar and Camera sensing streams,
which can reduce the heavy dependence on Lidar in
fusion, as well as effectively prevent the problem of
sensor failure. This ensures that when one sensor
channel fails, the task can be continued by another
perceptual channel. Later on, a research team
successively proposed MetaBEV and UniBEV based
on the BEVFusion model, which further improved the
performance, but this inevitably leads to the problem
of perceptual degradation when failure occurs.
Next is the algorithm represented by Policy
Fusion (Huang et al. 2023). This algorithm is a
decision-level fusion algorithm that utilizes the
VCNet network (Value Critique Network) in the
ICDSE 2024 - International Conference on Data Science and Engineering
556
decision-making phase to score the content of the
decisions obtained from end-to-end to obtain the
primary and secondary decisions. The two decisions
are then fused by the Primary and Secondary Strategy
Fusion (PSF) module. In case of severe sensor failure,
the unaffected independent decision will replace the
fused decision. Sensor failures as well as failures can
be solved properly.
4.1.3 Inaccurate Depth Information and
Semantic Ambiguity Due to Depth
Estimation or Spatial Compression of
Image Data or Point Cloud Data
When depth estimation is performed on image data
from two-dimensional space to three-dimensional
space, the problem of inaccurate depth information is
caused by the limited prediction accuracy of the depth
estimation network. When compression of LiDAR
point cloud data into BEV space is performed, the
high degree of compression at each location
aggregates features from different objects, which can
lead to problems such as semantic ambiguity.
To solve this problem, the research team proposed
an algorithm represented by UVTR, which uniformly
encodes inputs from different sensors and performs a
uniform representation of different modal data in
voxel space without further compression (Li et al
2022). However, the problem of heterogeneous data
characterization is ignored in such methods and the
two are fused by forced knowledge distillation. This
is the direction for further enhancement of such
methods.
4.2 Future Research Direction
This section will discuss several future directions of
multimodal fusion perception approaches, including
end-to-end, multi-task learning, and temporal fusion.
4.2.1 End-to-End
End-to-end perception is to make the original sensory
data through the neural network directly output the
prediction results or other task results, no longer
needing to pre-process the original data for feature
capture. This kind of perception can avoid manually
designed features, but use the deep learning neural
network to adaptively extract effective features for
related perception tasks, which can give the model
more space to automatically adjust according to the
data and improve the robustness of perception. And
the end-to-end design can reduce the manual
intervention and post-processing steps, reducing error
accumulation.
4.2.2 Multi-Task Learning
Multi-task learning (MTL) refers to the execution of
multiple task predictions with a set of trained weights,
which receives widespread attention due to its
practical value, such as complementary performance
and lower computational cost. However, in multitask
learning models must learn to balance the various
objectives of each task and avoid task conflicts, which
will be challenging.MetaBEV proposed a robust
fusion module with a new M2oE-FFN layer (Ge et al.
2023). Its main role is to mitigate conflicts between
multiple tasks. Experiments show that the method has
a better balancing effect. However, further
improvements are still needed to realize a more
concise and efficient multi-task learning network.
4.2.3 Timing Fusion
Timing fusion is the key to improving the accuracy
and continuity of the perception algorithm, improving
the problems of inter-frame jumps and target
occlusion in target detection, more accurately
determining the target motion speed, and also playing
an important role in target prediction and tracking.
Currently, temporal fusion has become the direction
of attention in perception modeling. In 2022
BEVDet4D was proposed by Huang et al (Huang et
al. 2022). It makes the features of the current frame
and the BEV features of the previous frame first go
through the alignment in the temporal and spatial
dimensions and then spliced in the channel dimension.
This method substantially improves the performance
in speed prediction through temporal fusion with only
a small increase in computational cost.
4.2.4 Low Latency
Most of the current multimodal fusion algorithms for
real commercial scenarios have computational
efficiency and computational volume that are difficult
to deploy to vehicle ECUs for reliable real-time
sensing. Low latency has become an urgent problem
to be solved for the practical deployment of sensing
models. In 2021, a research team proposed a Lidar-
based real-time detection model BEVDetNet
(Mohapatra et al. 2021). With a concise and efficient
design, the model truly realizes low-latency detection.
It has a latency of 4ms on the embedded Nvidia Xille
platform, which can reach the level for commercial
deployment. To have real-time performance,
BEV-Based 3D Detection for Automatic Driving Using Lidar-Camera Fusion
557
reasonable network design and control of the amount
of computation are essential. Achieving a balance
between detection accuracy and real-time
performance is a key consideration for future
perceptual models.
4.2.5 Knowledge Distillation
The purpose of knowledge distillation is to migrate
the knowledge learned from a large model or multiple
model integration to another lightweight model. It is
essentially a method of model compression. This
method can achieve a significant reduction in model
size without reducing the detection accuracy of the
original model, which can make the model
commercially deployable. Knowledge distillation is
the transfer of information between the teacher model
and the student model. Specifically, knowledge is
suggested from the trained teacher model to the
lightweight student model. This approach reduces the
model size while keeping the detection effectiveness.
Recently, BEVDISTILL (Chen et al. 2022)
performed two feature distillations, dense and sparse,
between teacher and student models for feature
alignment optimization and knowledge migration at
the instance prediction level. The model was highly
successful on the nuScenes dataset and also proved
that knowledge distillation is a powerful tool for
solving practical deployment challenges.
5 CONCLUSION
Due to the increasing importance of accurate and
robust perception techniques in applications such as
autonomous driving, this paper explores multimodal
fusion 3D target detection algorithms based on BEV
technology, focusing on the fusion of LIDAR and
camera vision for perception. First, this paper derives
the advantages of BEV technology in current
perception systems, including occlusion reduction,
end-to-end design, and error accumulation reduction.
By analyzing the advantages and disadvantages of
BEV sensing algorithms and sensors for image-only,
LIDAR-only, and LC fusion, this study finds that the
LC fusion sensing method has better detection
accuracy and robustness, and is one of the most
promising sensing methods in the future. In addition,
the advantages and limitations of the related
algorithmic models are discussed from the three
fusion granularities of point-level fusion, feature-
level fusion, and voxel-level fusion, among which the
algorithms based on virtual points, the algorithms
based on the unified representation of the BEV space,
and the algorithms related to the distillation of
knowledge are of pioneering significance. Aiming at
these models, this paper puts forward suggestions
such as network lightweight design. However, the
multimodal fusion approach faces three challenges,
including difficulties in fusion feature alignment,
high dependence on complete modal inputs, and
depth estimation or spatial compression leading to
inaccurate depth information and semantic
ambiguity. Approaches such as unified spatial
representation and decoupling of perceptual channels
are suggested to address these issues. In the future,
multimodal fusion perception methods can evolve
towards end-to-end, multi-task learning, temporal
fusion, low latency, and knowledge distillation. This
paper is dedicated to exploring the characteristics of
multimodal fusion techniques and mining the
potential development directions, which provides a
reference and summary perspective for future
research.
REFERENCES
C. R.Qi, H., K. C.Mo, et al, PointNet: Deep Learning on
Point Sets for 3D Classification and Segmentation; in
proceedings of the 30th IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
Honolulu, HI, F Jul 21-26, (2017).
C. W.Wang, C., M.Zhu, et al, PointAugmenting: Cross-
Modal Augmentation for 3D Object Detection; in
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
Electr Network, F Jun 19-25, 2021. 2021.
C.Ge, J.Chen, E,Xie, et al, ArXiv.2304.09801, (2023).
H.Wu, C. L.Wen, S S.Shi, et al, Virtual Sparse Convolution
for Multimodal 3D Object Detection; in Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), Vancouver, CANADA, F
Jun 17-24, (2023).
J. J. Huang, et al, BEVDet4D: Exploit Temporal Cues in
Multi-camera 3D Object Detection; arXiv
abs/2203.17054(2022)
S. Mohapatra, S.Yogamani, H.Gotzig, et al, 2021 IEEE
International Intelligent Transportation Systems
Conference (ITSC), 2809-15(2021).
S.Vora, A H.Lang, B.Helou, et al, PointPainting:
Sequential Fusion for 3D Object Detection; in
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
Electr Network, F Jun 14-19, 2020. 2020.
S.Wang, H. Caesar, L Nan, et al, ArXiv.2309.14516, (2023).
T. W.Yin, X Y.Zhou, P KRäHENBHüL, Multimodal
Virtual Point 3D Detection; in proceedings of the 35th
Conference on Neural Information Processing Systems
(NeurIPS), Electr Network, F Dec 06-14, 2021 .2021.
T.Liang, H.Xie, K.Yu, Xia, et al, ArXiv (2022).
ICDSE 2024 - International Conference on Data Science and Engineering
558
X. Y.Bai, Z. Y.Hu, X. G.Zhu, et al, TransFusion: Robust
LiDAR-Camera Fusion for 3D Object Detection with
Transformers; in Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern
Recognition (CVPR), New Orleans, LA, F Jun 18-
24,(2022).
Y. W., A W., T J.Meng, et al, DeepFusion: Lidar-Camera
Deep Fusion for Multi-Modal 3D Object Detection; in
Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
New Orleans, LA, F Jun 18-24,(2022).
Y.Li, Y.Chen, et al, ArXiv (2022).
Y.Ma, T.Wang, X.Bai, H.Yang, ArXiv.2208.02797(2022).
Y.Zhou, O.Tuzel, VoxelNet: End-to-End Learning for Point
Cloud Based 3D Object Detection; in proceedings of
the 31st IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), Salt Lake City, UT,
F Jun 18-23, (2018).
Z. B. Huang, S L.Sun, J.Zhao, et al, Inf Fusion, 98:
11(2023).
Z. Q.Li, et al, BEVFormer: Learning Bird’s-Eye-View
Representation from Multi-Camera Images via
Spatiotemporal Transformers; in proceedings of 2022
European Conference on Computer Vision(ECCV).
Z. Y.Chen, et al, ArXiv.2207.10316. Accessed 3 Feb.
(2024).
Z. Y.Chen, et al, AutoAlign: Pixel-Instance Feature
Aggregation for Multi-Modal 3D Object Detection; in
Proceedings of the Thirty-First International Joint
Conference on Artificial Intelligence(IJCAI), 1 July
2022, ijcai.2022/116. Accessed 3 Oct. (2023).
Z.Chen, Z.Li, S.Zhang, et al, ArXiv,(2022).
BEV-Based 3D Detection for Automatic Driving Using Lidar-Camera Fusion
559