
2 RELATED WORK
The fusion of camera and LiDAR data is a widely re-
searched topic in multimodal fusion with applications
in object detection and segmentation. Various tech-
niques have been proposed over the years to solve
these problems, (Cui et al., 2022) proposed the fol-
lowing categorization options: signal-level, feature-
level, result-level, and multi-level fusion. Signal-level
fusion depends on raw sensor data, while it is suitable
for depth completion (Cheng et al., 2019) (Lin et al.,
2022) and landmark detection (Lee and Park, 2021)
(Caltagirone et al., 2018), it still suffers from loss of
texture information. Voxel grid or 2D projection are
used to represent LiDAR data as feature maps, for in-
stance, the implementation of VoxelNet (Zhou and
Tuzel, 2017) uses raw point clouds as voxels before
fusing LiDAR data with camera pixels. Result-level
fusion increases accuracy by merging prediction re-
sults from different model outputs (Jaritz et al., 2020)
(Gu et al., 2018). Through reviewing the literature, it
is possible to observe that the recent trend is to shift
towards multi-level fusion, which represents a com-
bination of all other fusion strategies. The compu-
tational complexity resulting from LiDAR 3D data
is tackled by reducing the dimensionality to a two-
dimensional image to exploit the existing image pro-
cessing methods. Our work uses a transformer-based
network for integrating camera and LiDAR data in a
cross-fusion strategy in the decoder layers.
The attention mechanism introduced in the trans-
former architecture in (Vaswani et al., 2023) has
a tremendous impact in various fields, especially in
natural language processing (Xiao and Zhu, 2023)
and computer vision. One notable variant is the vi-
sion transformer (ViT) (Dosovitskiy et al., 2021),
which excels in autonomous driving tasks by han-
dling global contexts and long-range dependencies.
Perceiving the surrounding area in a two-dimensional
plane primarily involves extracting information from
camera images with notable works like bird eye view
transformers for road surface segmentation presented
in (Zhu et al., 2024). Other recent approaches include
lightweight transformers for lane shape prediction
and combined semantic and instance segmentation
(Lai-Dang, 2024). Three-dimensional autonomous
driving perception is an extensively researched topic
focusing on object detection and segmentation. In
(Wang et al., 2021) DETR3D, the authors present a
multi-camera object detection method, unlike others
that rely on monocular images, it extracts 2D features
from images and uses 3D object queries to link fea-
tures to 3D positions via camera transformation ma-
trices. FUTR3D (Chen et al., 2023) employs a query-
based Modality-Agnostic Feature Sampler (MAFS),
together with a transformer decoder with a set-to-set
loss for 3D detection, thus avoiding using late fusion
heuristics and post-processing tricks. BEVFormer
(Li et al., 2022) improves object detection and map
segmentation with spatial and temporal attention lay-
ers via spatiotemporal transformers.
Recent works emphasize the fusion of camera and
LiDAR data for enhanced perception. CLFT models,
for instance, process LiDAR point clouds as image
views to achieve 2D semantic segmentation, bridging
gaps in multi-modal semantic object segmentation.
3 METHODOLOGY
In this section, we elaborate on the detailed struc-
ture of the CLFT network in the sequential order of
data processing, aiming to provide an exclusive in-
sight into how the sensory data flows in the network,
thus, benefits the understanding and reproducibility of
our work.
The CLFT network achieves the camera-LiDAR
fusion by progressively assembling features from
each modality first and then conducting the cross-
fusion at the end. Figuratively, the CLFT network
has two directions to process the input camera and
LiDAR data in parallel; the integration of two modal-
ities happens at the ‘fusion’ stage in the network’s de-
coder block. In general, there are three steps in the
entire process. The first step is pre-processing the in-
put, which embeds the image-like data to the learn-
able transformer tokens; the second step closely fol-
lows the protocols of ViT (Dosovitskiy et al., 2021)
encoders to encode the embedded tokens; the last step
is the post-processing of the data, which progressively
assembles and fuses the feature representations to ac-
quire segmentation predictions. The details of the
three steps are described in the following three sub-
sections.
3.1 Embedding
The camera and LiDAR input data pre-processing is
independent and in parallel. As mentioned in Sec-
tion 1, we select the LiDAR processing strategy to
project the point cloud data onto the camera plane,
thus attaining the LiDAR projection images. For deep
multi-modal sensor fusion, the transition from differ-
ent inputs to a unified modality simplifies the network
structure and minimizes the fusion errors.
As shown in Fig. 1, there are a total of four steps
in the embedding module. The first step is resiz-
ing the camera and LiDAR matrices to r = 384 and
A Novel Vision Transformer for Camera-LiDAR Fusion Based Traffic Object Segmentation
567