3D Object Detection for Autonomous Driving: A Practical Survey

Alvaro Ramajo-Ballester

, Arturo de la Escalera Hueso

and Jos

e Mar

ıa Armingol Moreno

Intelligent Systems Lab, University Carlos III de Madrid, Spain

Keywords:

3D Object Detection, Autonomous Driving, Deep Learning.

Abstract:

Autonomous driving has been one of the most promising research lines in the last decade. Although still far

off the sought-after level 5, the research community shows great advancements in one of the most challenging

tasks: the 3d perception. The rapid progress of related ﬁelds like Deep Learning is one the reasons behind this

success. This enables and improves the processing algorithms for the input data provided by LiDAR, cameras,

radars and such other devices used for environment perception. With such growing knowledge, reviewing

and structuring the state-of-the-art solutions becomes a necessity in order to correctly address future research

directions. This paper provides a comprehensive survey of the progress of 3D object detection in terms of

sensor data, available datasets, top-performing architectures and most notable frameworks that serve as a

baseline for current and upcoming works.

1 INTRODUCTION

The ﬁeld of autonomous driving, which aspires to

provide the ability for vehicles to move safely with

little to no human intervention, has advanced quickly

in recent years. With the potential to reduce human

error and improve road safety, autonomous driving

techniques have been widely used in a variety of situ-

ations, such as self-driving vehicles in controlled sce-

narios (Marin-Plaza et al., 2021), delivery robots, etc.

Environment perception, a crucial element of

these kind of systems, aids autonomous vehicles in

understanding their surroundings through sensory in-

put. Perception systems typically use multi-modality

data as input (RGB images from cameras, point

clouds from LiDAR, etc.) to predict the geometry

and semantic details of signiﬁcative elements on the

road. As deep learning techniques in computer vi-

sion have advanced (Ramajo-Ballester et al., 2022),

3D object detection algorithms have evolved quickly.

The proposed solutions follow very diverse method-

ologies, data distributions and evaluation metrics.

To overcome this lack of common ground, this

work aims to gather the most notable approaches to

offer a broad view of the current state of the art. To

achieve that, this survey tries to build on top of al-

ready existing ones (Qian et al., 2022; Mao et al.,

https://orcid.org/0000-0001-9425-9408

https://orcid.org/0000-0002-2618-857X

https://orcid.org/0000-0002-3353-9956

2022), updating their results with the latest and best

performing models accross several benchmarks.

Speciﬁcally in the last years, a new research ap-

proach is emerging in this context, which is detect-

ing these 3D objects from an infrastructure point of

view. With sensors mounted several meters above the

ground, the ﬁeld of view increases signiﬁcantly while

also reducing occlusions between the elements on the

road and the environment. This, of course, comes

with its own challenges, like vehicle to infrastructure

communication, protocol standardization across dif-

ferent devices and other related tasks that have to be

taken into account.

2 BACKGROUND

2.1 Foundations / Basic Concepts

3D object detection aims to predict the attributes of

3D objects in driving scenarios from sensory inputs.

In that context, let X denote input data, LiDAR or

RGB images, for instance, and F denote a detector

parameterized by θ . A general formula for 3D object

detection can be represented as follows:

B = F(X ; θ ) (1)

where B = {B

,...,B

} is a set of n 3D objects in a

scene. In this task, representing a 3D object is critical

Ramajo-Ballester, Á., de la Escalera Hueso, A. and Armingol Moreno, J.

3D Object Detection for Autonomous Driving: A Practical Survey.

DOI: 10.5220/0011748400003479

In Proceedings of the 9th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2023), pages 64-73

ISBN: 978-989-758-652-1; ISSN: 2184-495X

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

since it affects what data should be provided for the

subsequent prediction and planning processes. Typi-

cally, a 3D item is shown as a 3D cuboid that contains

this object. This enclosure can be described by its 8

corners (Chen et al., 2017), 4 corners and heights (Ku

et al., 2018) or, usually, by the 7 parameters for an

oriented bounding box (Lang et al., 2019; Weng and

Kitani, 2019) in eq. 2

B = {x,y,z,l,w, h,θ ,c} (2)

where (x,y, z) is the center coordinates of that cuboid;

(l,w,h) are its length, width and height, respectively;

θ , yaw angle in the ground plane and c, the corre-

sponding class, say, car, pedestrian, etc.

2.2 Sensors

Raw data for 3D object detection can be provided by a

wide variety of sensors. The two sensor types that are

most frequently used are cameras and LiDAR (Light

Detection And Ranging) sensors.

Cameras are economical, accessible, and capa-

ble of capturing scene details from a speciﬁc angle.

They produce images X

cam

= R

W ×H×3

, where W, H

are the image width and height and each pixel has

3 RGB channels. Although cheap, cameras have in-

herent limitations when it comes to 3D object detec-

tion. First of all, they are only capable of capturing vi-

sual data so they cannot obtain the three-dimensional

structure of a scene. To overcome this problem, stereo

cameras use matching algorithms to align correspon-

dences in both left and right pictures for depth recov-

ery (Chang and Chen, 2018).

On the other hand, LiDAR sensors (Velodyne,

Ouster, RoboSense) can be used to get ﬁne-grained

3D structures of a scene by producing a great number

of laser beams and measuring their reﬂective informa-

tion. A range image X

lid

∈ R

m×n×3

can be produced

by a LiDAR sensor using m beams and n readings

in one scan cycle. Each pixel has 3 channels, cor-

responding to range, azimuth and inclination in the

spherical coordinate system. Through the conversion

of spherical coordinates into Cartesian coordinates,

range images can be further transformed into point

clouds.

2.3 Datasets

As the data-driven era progresses, the accessibility of

large-scale datasets has been continuously fostering

the community. Some of the most notable and pub-

licly accessible datasets related to autonomous driv-

ing have been included. KITTI (Geiger et al., 2012;

Geiger et al., 2013), Waymo Open (Sun et al., 2020)

and nuScenes (Caesar et al., 2020) datasets are the

typical examples and stand out as the most popular

ones, among others. An example of the last two is

shown in ﬁg. 1 and a more detailed comparison is pre-

sented in table 1.

Figure 1: Waymo (above) and nuScenes (below) datasets

examples.

KITTI Dataset. This is the pioneer dataset when it

comes to 3D object detection was released in 2012.

It contains the LiDAR and visual information from

15k frames, with more than 200k 3D annotations from

8 classes (car, van, truck, pedestrian, person, cyclist,

tram, misc), although just car, pedestrian and cyclist

labels are considered for the online scoreboard eval-

uation. Three difﬁculty levels (Easy, Moderate, and

Hard) are introduced depending on the height of 2D

bounding boxes, the level of occlusion and truncation.

Waymo Open. Waymo annotates 12M 3d bound-

ing boxes among more than 200k frames. It com-

prises 1150 sequences (or scenes) over 4 classes, out

of which only 3 correspond to the KITTI classes.

nuScenes. It manually labels 1.4M boxes among 40K

frames. It collects 1000 sequences over 23 classes but

only 10 of those are considered for detection.

Only the training and validation labels are avail-

able, as none of them offer the testing ones. Re-

searchers are required to submit their predictions to

the online leaderboard server for assessing the test

3D Object Detection for Autonomous Driving: A Practical Survey

set. It is important to note that, in contrast to

KITTI, which only collects data on sunny days, both

nuScenes and Waymo Open acquire their data under

a variety of weather (such as rainy, foggy, and snowy

conditions) and lighting (daytime and nighttime).

In the last years, many dataset publications have

been released with infrastructure-mounted sensors,

like Rope3D (Ye et al., 2022b), A9 Dataset (Creß

et al., 2022), IPS300+ (Wang et al., 2022) and DAIR-

V2X (Yu et al., 2022). The latter is one of the

most complete ones because it released three ver-

sions: Vehicle Dataset (DAIR-V2X-V), with 22325

LiDAR and image frames; Infrastructure Dataset

(DAIR-V2X-I), including 10084 point cloud and im-

age frames and Cooperative Dataset (DAIR-V2X-C),

with 18330 data frames from infrastructure and 20515

from vehicle for Vehicle-Infrastructure Cooperative

(VIC) 3D object detection.

2.4 Evaluation Metrics

The primary performance metric for 3D object de-

tection is Average Precision (AP), which shares the

same philosophy as its 2D counterparts (Everingham

et al., 2010). Before identifying the subtle connec-

tions and differences of dataset-speciﬁc AP adopted

among widely used benchmarks, the vanilla form of

the AP metric is deﬁned as follows:

AP =

max{P(r

′

≥ r)}dr (3)

where P(r) is the precision-recall curve; r

′

, each posi-

ble recall value and r the recall variable over which

the integral is calculated. When calculating preci-

sion and recall, the main distinction with the 2D AP

metric is the matching criterion between predictions

and ground truths. In that regard, KITTI proposes

two commonly-used AP metrics: AP3D and APBEV.

AP3D matches the predicted objects to the respec-

tive ground truths if the 3D Intersection over Union

(3D IoU) of two cuboids is above a certain threshold

whereas APBEV is based on the IoU of two cuboids

from the bird’s-eye view (BEV IoU).

However, accurately calculating this area is not

trivial, PASCAL VOC (Everingham et al., 2010) pro-

posed an alternative metric: the interpolated AP|

It is formulated as the mean precision computed for

each recall subset (R) of N evenly spaced recall lev-

els, from r

= 0 to r

= 1:

AP|

∑

r∈R

P(r) (4)

where R = [r

−r

N−1

2(r

−r

)

N−1

,. .. ,r

] and

P(r) = max

′

≥r

P(r

′

KITTI Benchmark (Geiger et al., 2012). KITTI

used the interpolated AP|

before changing to

AP|

as suggested in (Simonelli et al., 2019). That

resulted in a more fair comparison of the scores. It

holds two separate leaderboards for 3D object detec-

tion and BEV detection. As previously said, it dif-

ferentiates three levels of difﬁculty: easy, moderate

and hard, regarding the occlusion and height of the

bounding boxes.

Waymo Benchmark (Sun et al., 2020). Similarly,

it proposes interpolated AP

and Average Preci-

sion weighted by heading (APH). To calculate that,

Waymo evaluates on 21 evenly spaced recall levels

= 0, r

= 1, N = 21), with IOU threslholds of 0.7

for vehicles and 0.5 for pedestrians and cyclists. For

APH, the true positives are weighted by the heading

accuracy:

= min(|

θ − θ |,2π − |

θ − θ |)/π (5)

where

θ and θ are the predicted and ground truth

azimuth, respectively.Two levels of diﬁculty are in-

cluded in the benchmark: L1 for bounding boxes with

more than ﬁve lidar points and L2 for those with be-

tween one and ﬁve points.

nuScenes Benchmark (Caesar et al., 2020). It uses

a custom metric called NuScenes Detection Score

(NDS). In order to compute it, a set of error met-

rics are deﬁned for measuring translation (ATE), scale

(ASE), orientation (AOE), velocity (AVE) and at-

tribute errors (AAE). All of them have a 2m center

distance threshold. The errors (ε ) are converted to

scores as shown in eq. 6.

= max(1 − ε

,0) (6)

These metrics are weighted afterwards to calculate

de NDS:

NDS =

5mAP +

∑

(7)

where mAP is calculated by a BEV center distance

with thresholds 0.5m, 1m, 2m, 4m (Qian et al., 2022).

3 REVIEW AND ANALYSIS

The core section of this work brings together some of

the most notable contributions for the task of 3D ob-

ject detection in autonomous driving context. It starts

with camera-based methods, which only make use of

visual information from the surroundings, either with

a single or multiple views. After that, LiDAR-based

approaches are reviewed, showing a clearly superior

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

Table 1: Comparison between publicly available datasets for 3D object detection, sorted by year.

Dataset LiDAR Images 3D annot. Cl. Night/Rain View

KITTI (Geiger et al., 2012) 15k 15k 200k 8 No/No Onboard

Ko-FAS (Strigel et al., 2014) 39k 19.4k - - -/- Infrastructure

KAIST (Choi et al., 2018) 8.9k 8.9k - 3 Yes/No Onboard

ApolloScape (Huang et al., 2019) 20k 144k 475k 6 -/- Onboard

H3D (Patil et al., 2019) 27k 83k 1.1M 8 No/No Onboard

Lyft L5 (Kesten et al., 2019) 46k 323k 1.3M 9 No/No Onboard

Argoverse (Chai et al., 2021) 44k 490k 993k 15 Yes/Yes Onboard

A*3D (Pham et al., 2020) 39k 39k 230k 7 Yes/Yes Onboard

A2D2 (Geyer et al., 2020) 12.5k 41.3k - 14 -/- Onboard

nuScenes (Caesar et al., 2020) 400k 1.4M 1.4M 23 Yes/Yes Onboard

Waymo Open (Sun et al., 2020) 230k 1M 12M 4 Yes/Yes Onboard

AIODrive (Weng et al., 2020) 250k 250k 26M - Yes/Yes Virtual onboard

BAAI-VANJEE (Yongqiang et al., 2021) 2.5k 5k 74k 12 Yes/Yes Infrastructure

PandaSet (Xiao et al., 2021) 8.2k 49k 1.3M 28 Yes/Yes Onboard

KITTI-360 (Liao et al., 2022) 80k 300k 68k 37 -/- Onboard

Argoverse 2 Sensor (Wilson et al., 2021) ∼150k ∼1M - 30 Yes/Yes Onboard

ONCE (Mao et al., 2021) 1M 7M 417k 5 Yes/Yes Onboard

Cirrus (Wang et al., 2021) 6.2k 6.2k - 8 -/- Onboard

Rope3D (Ye et al., 2022b) - 50k 1.5M 12 Yes/Yes Infrastructure

A9 Dataset (Creß et al., 2022) 1.7k 5.4k 215k 8 Yes/Yes Infrastructure

IPS300+ (Wang et al., 2022) 28k 57k 4.5M 7 Yes/- Infrastructure

DAIR-V2X (Yu et al., 2022) 71k 71k 1.2M 10 -/- Onboard / Infrast.

performance than its image-only counterpart. Fusion-

ing both sensors, multi-modal techniques offer the

best results. These metrics are ﬁnally compared in

section 3.4.

3.1 Image-Based Methods

A simple method for monocular 3D object detection,

inspired by 2D detection techniques, is to directly

regress the 3D box parameters from images using a

convolutional neural network. This regression tech-

niques can be fully trained and normally incorporate

designs from the 2D detection network architectures.

This methods range from anchor-based ap-

proaches, where pre-deﬁned 3D bounding boxes are

predicted for posterior ﬁltering and reﬁnement (Luo

et al., 2021; Kumar et al., 2021); anchor-free models,

which predict the 3D objects attributes passing the

images through multiple and separate heads (Read-

ing et al., 2021; Zhou et al., 2021); stereo-based tech-

niques, where depth is estimated with pixel disparity

across images (Chang and Chen, 2018; Hartley and

Zisserman, 2003) or from temporal and multi-view

images, where the temporal information is leveraged

to improve 3D object detection (Liu et al., 2022a; Liu

et al., 2022b; Rukhovich et al., 2022).

3.2 LiDAR-Based Methods

This section covers those models using point clouds

or range images for detection. They differ mainly on

the data representations: point, voxel and pillar.

3.2.1 Point-Based Methods

Point cloud sampling and feature learning are the two

fundamental elements of a point-based 3D object de-

tector. These methods typically inherit the success of

deep learning techniques on point clouds (Mao et al.,

2019; Chen et al., 2022b; Chen et al., 2022a) and

propose a variety of architectures to detect 3D ob-

jects directly from raw points. Point clouds are ﬁrst

forwarded through a point-based backbone network,

where they are gradually sampled and the operators

learn the features. Then, using the points and fea-

tures from the downsampled data, 3D bounding boxes

are predicted. For this kind of representation, (Zhang

et al., 2022) offers an uncertainty estimator to take

into account in the optimization process.

3.2.2 Voxel-Based Methods

Voxels are 3D cubes and contain points inside voxel

cells. Most voxel cells in the 3D space are empty

and lack any points due to the sparse distribution

of the point cloud. Practical applications only store

and use the non-empty voxels for feature extraction

3D Object Detection for Autonomous Driving: A Practical Survey

Table 2: Comparison of the state-of-the-art 3D detection mAP on KITTI test set, sorted by vehicle moderate difﬁculty.

Vehicle Pedestrian Cyclist

Model Easy Mod Hard Easy Mod Hard Easy Mod Hard

SFD (Wu et al., 2022b) 91.73 84.76 77.92 - - - - - -

CasA++ (Wu et al., 2022a) 90.68 84.04 79.69 56.33 49.29 46.70 87.76 73.79 66.94

GraR-VoI (Yang et al., 2022a) 91.89 83.27 77.78 - - - - - -

GLENet-VR (Zhang et al., 2022) 91.67 83.23 78.43 - - - - - -

VPFNet (Zhu et al., 2022) 91.02 83.21 78.20 - - - - - -

BtcDet (Xu et al., 2022) 90.64 82.86 78.09 - - - 82.81 68.68 61.81

DVF-V (Mahmoud et al., 2022) 89.40 82.45 77.56 - - - - - -

RDIoU (Sheng et al., 2022) 90.65 82.30 77.26 - - - - - -

Focals Conv-F (Chen et al., 2022c) 90.55 82.28 77.59 - - - - - -

SASA (Chen et al., 2022a) 88.76 82.16 77.16 - - - - - -

VoxSeT (He et al., 2022) 88.53 82.06 77.46 - - - - - -

Table 3: Comparison of the state-of-the-art 3D detection methods on Waymo test set, sorted by vehicle L2 mAPH.

Vehicle Pedestrian Cyclist

L1 L2 L1 L2 L1 L2

Model mAP mAPH mAP mAPH mAP mAPH mAP mAPH mAP mAPH mAP mAPH

MPPNetEns-MMLab (Chen et al., 2022b) 87.77 87.37 81.33 80.93 87.92 85.15 82.86 80.14 80.74 79.90 78.54 77.73

BEVFusion-TTA (Liu et al., 2022c) 87.96 87.58 81.29 80.92 87.64 85.04 82.19 79.65 82.53 81.67 80.17 79.33

LidarMultiNet-TTA (Ye et al., 2022a) 87.64 87.26 80.73 80.39 87.75 85.07 82.48 79.86 82.77 81.84 80.50 79.59

DeepFusion-Ens (Li et al., 2022b) 86.45 86.09 79.43 79.09 86.14 83.77 80.88 78.57 80.53 79.80 78.29 77.58

AFDetV2-Ens (Hu et al., 2022) 85.80 85.41 78.71 78.34 85.22 82.16 79.71 76.75 81.20 80.30 78.70 77.83

CenterFormer (Zhou et al., 2022) 85.36 84.94 78.68 78.28 85.22 82.48 80.09 77.42 76.21 75.32 74.04 73.17

VISTA (Deng et al., 2022) 81.70 81.30 74.40 74.00 81.40 78.30 75.50 72.50 74.90 73.90 72.50 71.60

Graph-CE (Yang et al., 2022a) 80.77 80.28 72.55 72.10 82.35 76.64 74.44 69.02 75.28 74.21 72.52 71.49

BtcDet (Xu et al., 2022) 78.58 78.06 70.10 69.61 - - - - - - - -

RDIoU (Sheng et al., 2022) 78.40 78.00 69.50 69.10 - - - - - - - -

GLENet-VR (Zhang et al., 2022) 77.32 76.85 69.68 68.97 - - - - - - - -

TransFusion (Bai et al., 2022) - - - 65.10 - - - 64.00 - - - 67.40

DVF-V (PV-RCNN) (Mahmoud et al., 2022) 67.62 67.09 62.66 62.17 - - - - - - - -

Table 4: Comparison of the state-of-the-art 3D detection methods on nuScenes test set. CV stands for construction vehicle

and TC, trafﬁc cone.

Model mAP NDS Car Truck Bus Trailer CV Ped Motor Bycicle TC Barrier

DeepInteraction-e (Yang et al., 2022b) 75.6 76.3 88.3 64.3 74.2 66.0 44.7 92.5 85.4 66.4 90.9 83.5

BEVFusion-e (Liu et al., 2022c) 75.0 76.1 90.5 65.8 74.2 67.4 42.6 91.8 84.4 62.9 89.4 81.1

Focals Conv-F (Chen et al., 2022c) 70.1 73.6 87.5 60.0 69.9 64.0 32.6 89.0 81.1 59.2 85.5 71.8

BEVFusion (Liang et al., 2022) 71.3 73.3 88.5 63.1 72.0 64.7 38.1 90.0 75.2 56.5 86.5 78.3

MSMDFusion-T (Jiao et al., 2022) 70.8 73.2 87.9 61.6 70.0 64.4 38.1 89.7 73.9 56.6 87.1 79.0

MDRNet-L (Huang et al., 2022) 68.4 72.8 87.9 58.5 67.3 64.1 30.2 89.0 77.0 50.7 85.0 74.7

AutoAlignV2 (Chen et al., 2022d) 68.4 72.4 87.0 59.0 69.3 59.3 33.1 87.6 72.9 52.1 85.1 78.0

TransFusion (Bai et al., 2022) 68.9 71.7 87.1 60.0 68.3 60.8 33.1 88.4 73.6 52.9 86.7 78.1

LidarMultiNet (Ye et al., 2022a) 67.0 71.6 86.9 57.4 64.7 61.0 31.5 87.2 75.3 47.6 85.1 73.5

UVTR-Multi. (Li et al., 2022a) 67.1 71.1 87.5 56.0 67.5 59.5 33.8 86.3 73.4 54.8 79.6 73.0

VISTA (Deng et al., 2022) 63.7 70.4 84.7 54.2 64.0 55.0 29.1 83.6 71.0 45.2 78.6 71.8

AFDetV2 (Hu et al., 2022) 62.4 68.5 86.3 54.2 62.5 58.9 26.7 85.8 63.8 34.3 80.1 71.0

SASA (Chen et al., 2022a) 45.0 61.0 76.8 45.0 66.2 36.5 16.1 69.1 39.6 16.9 29.9 53.6

(Rukhovich et al., 2022; Wu et al., 2022a; Xu et al.,

2022; Zhou et al., 2022; Huang et al., 2022). Some of

the most recent contributions use an anchor-free ap-

proach (Hu et al., 2022), a transformer-based back-

bone (He et al., 2022), global context pooling (Ye

et al., 2022a) or offer a plug-in module for rotation-

decoupled IoU optimization (Sheng et al., 2022).

3.2.3 Pillar-Based Methods

In pillars, the voxel size is special since it is unlimited

in the vertical direction. Through the use of a Point-

Net [207], pillar features can be merged from points

and then dispersed again to create a 2D BEV image

for feature extraction. The pillar representation was

ﬁrst introduced in PointPillars (Lang et al., 2019), and

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

Table 5: Development frameworks used in all the reviewed works. PC stands for point cloud; I, for monocular image and ST,

for stereo image.

Model Year Mode Code

SFD 2022 PC+I OpenPCDet

CasA++ 2022 PC OpenPCDet

GraR-VoI 2022 PC+I MMDetection3D + OpenPCDet

GLENet-VR 2022 PC -

VPFNet 2022 PC+ST OpenPCDet

BtcDet 2022 PC OpenPCDet

DVF-V 2022 PC+I -

RDIoU 2022 PC OpenPCDet

Focals Conv-F 2022 PC+I OpenPCDet + CenterPoint

SASA 2022 PC OpenPCDet

VoxSeT 2022 PC OpenPCDet

DeepInteraction-e 2022 PC+I MMDetection3D

BEVFusion 2022 PC+I MMDetection3D

BEVFusion (2) 2022 PC+I MMDetection3D + CenterPoint

MSMDFusion-T 2022 PC+I MMDetection3D + TransFusion

MDRNet-L 2022 PC -

AutoAlignV2 2022 PC+I MMDetection3D

TransFusion 2022 PC+I MMDetection3D + CenterPoint

LidarMultiNet 2022 PC -

TransFusion-L 2022 PC -

UVTR-Multimodality 2022 PC+I MMDetection3D + Det3D

VISTA 2022 PC Det3D + OpenPCDet + CenterPoint

AFDetV2-Ens 2022 PC -

MPPNet 2022 PC OpenPCDet

DeepFusion-Ens 2022 PC+I Lingvo

CenterFormer 2022 PC CenterPoint

was then developed further in (Fan et al., 2022), as

shown in (Mao et al., 2022).

3.3 Multi-Modal Methods

For 3D object detection, camera and LiDAR are two

complementary sensor types. When compared to a

LiDAR sensor, which specializes in 3D localization

and provides rich information of 3D structures, a cam-

era provides color information from which rich se-

mantic features can be extracted. That is the reason

why the data fusion from these two sensors is the most

common in the state of the art and why many attempts

have been made to accurately detect 3D objects us-

ing a combination of camera and LiDAR data. These

approaches are primarily based on LiDAR-based 3D

object detectors and try to incorporate image infor-

mation into various stages of the detection pipeline,

given that LiDAR-based detection methods outper-

form camera-based methods signiﬁcantly. Combining

the two modalities inevitably results in more compu-

tational overhead and inference time latency due to

the complexity of both detection systems (Mao et al.,

2022).

This technique has been tried extensively (Wu

et al., 2022b; Bai et al., 2022). Some variations in-

clude intermediate stage fusion (Yang et al., 2022b;

Chen et al., 2022d; Li et al., 2022a; Li et al., 2022b),

multi-stage fusion (Yang et al., 2022a; Jiao et al.,

2022), BEV representation fusion (Liu et al., 2022c;

Liang et al., 2022), sequential fusion (Mahmoud

et al., 2022), LiDAR and stereo images fusion (Zhu

et al., 2022) or focal convolution to replace sparse

CNN (Chen et al., 2022c).

3.4 Performance Comparison

Following the guideline of this survey and adhering

to the principle of building on previous works, we

compare the most recent and best performing solution

across the main three different benchmarks: KITTI

(Geiger et al., 2012; Geiger et al., 2013), as shown

in table 2; Waymo (Sun et al., 2020), in table 3, and

ﬁnally nuScenes (Caesar et al., 2020), displayed in ta-

ble 4.

3D Object Detection for Autonomous Driving: A Practical Survey

In addition, a comparison between the different

methods and frameworks that lie underneath the re-

viewed works is shown in table 5. MMDetection

(OpenMMLab, 2020a) and OpenPCDet (OpenMM-

Lab, 2020b) remain as the most popular baseline for

LiDAR and image 3D object detection projects. Both

property of OpenMMLab, although developed by dif-

ferent teams, they accumulate more than 3k start on

GitHub and almost a thousand forks each. MMDe-

tection3D becomes useful for both LiDAR and image

detection and it even has well documented tutorials on

how to use it for custom datasets. On the other hand,

OpenPCDet focuses on LiDAR-based detection, with

custom cuda code for faster processing.

Apart from those, CenterPoint (Yin et al., 2021) is

another framework of great use, which is based on the

formers and Det3D (Zhu et al., 2019) and lies ground

to TransFusion (Bai et al., 2022). Whereas all of them

are based on PyTorch (Paszke et al., 2019), Lingvo

(Li et al., 2022b) is written natively on TensorFlow

(Abadi et al., 2015).

4 CONCLUSIONS

This papers covers a wide variety of different methods

that try to solve 3D object detection tasks and per-

forms an extensive review of the current technology

vanguard. By starting from basics concepts and go-

ing through the details of the available datasets, eval-

uation metrics and the latest architectures for image,

point cloud and sensor fusion processing, it gives a

clear and broad view of this research ﬁeld. As it can

be extracted from the results, the fusion and LiDAR-

based techniques offer a more robust solution, clearly

surpassing the performance of their monocular image

counterparts. Nonetheless, the sensor cost is an im-

portant factor that could make the latter a viable alter-

native in certain environments.

ACKNOWLEDGEMENTS

Grant PID2019-104793RB-C31 and PDC2021-

121517-C31 funded by MCIN/AEI/10.13039/50110

0011033 and by the European Union “NextGen-

erationEU/PRTR” and the Comunidad de Madrid

through SEGVAUTO-4.0-CM (P2018/EMT-4362).

REFERENCES

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,

Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,

M., Ghemawat, S., Goodfellow, I., Harp, A., Irving,

G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kud-

lur, M., Levenberg, J., Man

e, D., Monga, R., Moore,

S., Murray, D., Olah, C., Schuster, M., Shlens, J.,

Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Van-

houcke, V., Vasudevan, V., Vi

egas, F., Vinyals, O.,

Warden, P., Wattenberg, M., Wicke, M., Yu, Y., and

Zheng, X. (2015). TensorFlow: Large-scale machine

learning on heterogeneous systems. Software avail-

able from tensorﬂow.org.

Bai, X., Hu, Z., Zhu, X., Huang, Q., Chen, Y., Fu, H., and

Tai, C.-L. (2022). Transfusion: Robust lidar-camera

fusion for 3d object detection with transformers. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 1090–

1099.

Caesar, H., Bankiti, V., Lang, A. H., Vora, S., Liong, V. E.,

Xu, Q., Krishnan, A., Pan, Y., Baldan, G., and Bei-

jbom, O. (2020). nuscenes: A multimodal dataset for

autonomous driving. In Proceedings of the IEEE/CVF

conference on computer vision and pattern recogni-

tion, pages 11621–11631.

Chai, Y., Sun, P., Ngiam, J., Wang, W., Caine, B., Vasude-

van, V., Zhang, X., and Anguelov, D. (2021). To the

point: Efﬁcient 3d object detection in the range image

with graph convolution kernels. In Proceedings of the

IEEE/CVF conference on computer vision and pattern

recognition, pages 16000–16009.

Chang, J.-R. and Chen, Y.-S. (2018). Pyramid stereo match-

ing network. In Proceedings of the IEEE conference

on computer vision and pattern recognition, pages

5410–5418.

Chen, C., Chen, Z., Zhang, J., and Tao, D. (2022a). Sasa:

Semantics-augmented set abstraction for point-based

3d object detection. In AAAI Conference on Artiﬁcial

Intelligence, volume 1.

Chen, X., Ma, H., Wan, J., Li, B., and Xia, T. (2017). Multi-

view 3d object detection network for autonomous

driving. In Proceedings of the IEEE conference

on Computer Vision and Pattern Recognition, pages

1907–1915.

Chen, X., Shi, S., Zhu, B., Cheung, K. C., Xu, H., and Li,

H. (2022b). Mppnet: Multi-frame feature intertwin-

ing with proxy points for 3d temporal object detection.

arXiv preprint arXiv:2205.05979.

Chen, Y., Li, Y., Zhang, X., Sun, J., and Jia, J. (2022c).

Focal sparse convolutional networks for 3d object de-

tection. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

5428–5437.

Chen, Z., Li, Z., Zhang, S., Fang, L., Jiang, Q., and Zhao,

F. (2022d). Autoalignv2: Deformable feature aggre-

gation for dynamic multi-modal 3d object detection.

arXiv preprint arXiv:2207.10316.

Choi, Y., Kim, N., Hwang, S., Park, K., Yoon, J. S., An,

K., and Kweon, I. S. (2018). Kaist multi-spectral

day/night data set for autonomous and assisted driv-

ing. IEEE Transactions on Intelligent Transportation

Systems, 19(3):934–948.

Creß, C., Zimmer, W., Strand, L., Fortkord, M., Dai, S.,

Lakshminarasimhan, V., and Knoll, A. (2022). A9-

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

dataset: Multi-sensor infrastructure-based dataset for

mobility research. In 2022 IEEE Intelligent Vehicles

Symposium (IV), pages 965–970. IEEE.

Deng, S., Liang, Z., Sun, L., and Jia, K. (2022). Vista:

Boosting 3d object detection via dual cross-view spa-

tial attention. In Proceedings of the IEEE/CVF Con-

ference on Computer Vision and Pattern Recognition,

pages 8448–8457.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. International journal of com-

puter vision, 88(2):303–338.

Fan, L., Pang, Z., Zhang, T., Wang, Y.-X., Zhao, H., Wang,

F., Wang, N., and Zhang, Z. (2022). Embracing single

stride 3d object detector with sparse transformer. In

Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 8458–

8468.

Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).

Vision meets robotics: The kitti dataset. The Inter-

national Journal of Robotics Research, 32(11):1231–

1237.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready

for autonomous driving? the kitti vision benchmark

suite. In 2012 IEEE conference on computer vision

and pattern recognition, pages 3354–3361. IEEE.

Geyer, J., Kassahun, Y., Mahmudi, M., Ricou, X., Durgesh,

R., Chung, A. S., Hauswald, L., Pham, V. H.,

uhlegg, M., Dorn, S., et al. (2020). A2d2:

Audi autonomous driving dataset. arXiv preprint

arXiv:2004.06320.

Hartley, R. and Zisserman, A. (2003). Multiple view geom-

etry in computer vision. Cambridge university press.

He, C., Li, R., Li, S., and Zhang, L. (2022). Voxel set trans-

former: A set-to-set approach to 3d object detection

from point clouds. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 8417–8427.

Hu, Y., Ding, Z., Ge, R., Shao, W., Huang, L., Li, K., and

Liu, Q. (2022). Afdetv2: Rethinking the necessity

of the second stage for object detection from point

clouds. In Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, volume 1, pages 969–979.

Huang, D., Chen, Y., Ding, Y., Liao, J., Liu, J., Wu, K., Nie,

Q., Liu, Y., and Wang, C. (2022). Rethinking dimen-

sionality reduction in grid-based 3d object detection.

arXiv preprint arXiv:2209.09464.

Huang, X., Wang, P., Cheng, X., Zhou, D., Geng, Q., and

Yang, R. (2019). The apolloscape open dataset for

autonomous driving and its application. IEEE trans-

actions on pattern analysis and machine intelligence,

42(10):2702–2719.

Jiao, Y., Jie, Z., Chen, S., Chen, J., Wei, X., Ma, L., and

Jiang, Y.-G. (2022). Msmdfusion: Fusing lidar and

camera at multiple scales with multi-depth seeds for

3d object detection. arXiv preprint arXiv:2209.03102.

Kesten, R., Usman, M., Houston, J., Pandya, T., Nad-

hamuni, K., Ferreira, A., Yuan, M., Low, B., Jain,

A., Ondruska, P., Omari, S., Shah, S., Kulkarni, A.,

Kazakova, A., Tao, C., Platinsky, L., Jiang, W., and

Shet, V. (2019). Level 5 perception dataset 2020.

https://level-5.global/level5/data/.

Ku, J., Moziﬁan, M., Lee, J., Harakeh, A., and Waslander,

S. L. (2018). Joint 3d proposal generation and object

detection from view aggregation. In 2018 IEEE/RSJ

International Conference on Intelligent Robots and

Systems (IROS), pages 1–8. IEEE.

Kumar, A., Brazil, G., and Liu, X. (2021). Groomed-

nms: Grouped mathematically differentiable nms for

monocular 3d object detection. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 8973–8983.

Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., and

Beijbom, O. (2019). Pointpillars: Fast encoders for

object detection from point clouds. In Proceedings

of the IEEE/CVF conference on computer vision and

pattern recognition, pages 12697–12705.

Li, Y., Chen, Y., Qi, X., Li, Z., Sun, J., and Jia, J.

(2022a). Unifying voxel-based representation with

transformer for 3d object detection. arXiv preprint

arXiv:2206.00630.

Li, Y., Yu, A. W., Meng, T., Caine, B., Ngiam, J., Peng,

D., Shen, J., Lu, Y., Zhou, D., Le, Q. V., et al.

(2022b). Deepfusion: Lidar-camera deep fusion for

multi-modal 3d object detection. In Proceedings of

the IEEE/CVF Conference on Computer Vision and

Pattern Recognition, pages 17182–17191.

Liang, T., Xie, H., Yu, K., Xia, Z., Lin, Z., Wang, Y., Tang,

T., Wang, B., and Tang, Z. (2022). Bevfusion: A sim-

ple and robust lidar-camera fusion framework. arXiv

preprint arXiv:2205.13790.

Liao, Y., Xie, J., and Geiger, A. (2022). Kitti-360: A novel

dataset and benchmarks for urban scene understand-

ing in 2d and 3d. IEEE Transactions on Pattern Anal-

ysis and Machine Intelligence.

Liu, Y., Wang, T., Zhang, X., and Sun, J. (2022a). Petr:

Position embedding transformation for multi-view 3d

object detection. arXiv preprint arXiv:2203.05625.

Liu, Y., Yan, J., Jia, F., Li, S., Gao, Q., Wang, T., Zhang,

X., and Sun, J. (2022b). Petrv2: A uniﬁed framework

for 3d perception from multi-camera images. arXiv

preprint arXiv:2206.01256.

Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.,

and Han, S. (2022c). Bevfusion: Multi-task multi-

sensor fusion with uniﬁed bird’s-eye view representa-

tion. arXiv preprint arXiv:2205.13542.

Luo, S., Dai, H., Shao, L., and Ding, Y. (2021). M3dssd:

Monocular 3d single stage object detector. In Pro-

ceedings of the IEEE/CVF Conference on Computer

Vision and Pattern Recognition, pages 6145–6154.

Mahmoud, A., Hu, J. S., and Waslander, S. L. (2022). Dense

voxel fusion for 3d object detection. arXiv preprint

arXiv:2203.00871.

Mao, J., Niu, M., Jiang, C., Liang, H., Chen, J., Liang, X.,

Li, Y., Ye, C., Zhang, W., Li, Z., et al. (2021). One

million scenes for autonomous driving: Once dataset.

arXiv preprint arXiv:2106.11037.

Mao, J., Shi, S., Wang, X., and Li, H. (2022). 3d object

detection for autonomous driving: A review and new

outlooks. arXiv preprint arXiv:2206.09474.

3D Object Detection for Autonomous Driving: A Practical Survey

Mao, J., Wang, X., and Li, H. (2019). Interpolated con-

volutional networks for 3d point cloud understanding.

In Proceedings of the IEEE/CVF international confer-

ence on computer vision, pages 1578–1587.

Marin-Plaza, P., Yag

ue, D., Royo, F., de Miguel, M.

A.,

Moreno, F. M., Ruiz-de-la Cuadra, A., Viadero-

Monasterio, F., Garcia, J., San Roman, J. L., and

Armingol, J. M. (2021). Project ares: Driverless trans-

portation system. challenges and approaches in an un-

structured road. Electronics, 10(15):1753.

OpenMMLab (2020a). MMDetection3D: OpenMMLab

next-generation platform for general 3D object detec-

tion. https://github.com/open-mmlab/mmdetection3d.

OpenMMLab (2020b). Openpcdet: An open-source tool-

box for 3d object detection from point clouds. https:

//github.com/open-mmlab/OpenPCDet.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,

Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,

Antiga, L., et al. (2019). Pytorch: An imperative style,

high-performance deep learning library. Advances in

neural information processing systems, 32.

Patil, A., Malla, S., Gang, H., and Chen, Y.-T. (2019). The

h3d dataset for full-surround 3d multi-object detection

and tracking in crowded urban scenes. In 2019 In-

ternational Conference on Robotics and Automation

(ICRA), pages 9552–9557. IEEE.

Pham, Q.-H., Sevestre, P., Pahwa, R. S., Zhan, H., Pang,

C. H., Chen, Y., Mustafa, A., Chandrasekhar, V., and

Lin, J. (2020). A* 3d dataset: Towards autonomous

driving in challenging environments. In 2020 IEEE

International Conference on Robotics and Automation

(ICRA), pages 2267–2273. IEEE.

Qian, R., Lai, X., and Li, X. (2022). 3d object detection for

autonomous driving: a survey. Pattern Recognition,

page 108796.

Ramajo-Ballester,

A., Gonz

alez Cepeda, J., and Armin-

gol Moreno, J. M. (2022). Vehicle re-identiﬁcation

in road environments using deep learning techniques.

In ITS European Congress 2022, pages 363–374.

Reading, C., Harakeh, A., Chae, J., and Waslander, S. L.

(2021). Categorical depth distribution network for

monocular 3d object detection. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 8555–8564.

Rukhovich, D., Vorontsova, A., and Konushin, A. (2022).

Imvoxelnet: Image to voxels projection for monocu-

lar and multi-view general-purpose 3d object detec-

tion. In Proceedings of the IEEE/CVF Winter Con-

ference on Applications of Computer Vision, pages

2397–2406.

Sheng, H., Cai, S., Zhao, N., Deng, B., Huang, J., Hua,

X.-S., Zhao, M.-J., and Lee, G. H. (2022). Rethink-

ing iou-based optimization for single-stage 3d object

detection. arXiv preprint arXiv:2207.09332.

Simonelli, A., Bulo, S. R., Porzi, L., L

opez-Antequera, M.,

and Kontschieder, P. (2019). Disentangling monocular

3d object detection. In Proceedings of the IEEE/CVF

International Conference on Computer Vision, pages

1991–1999.

Strigel, E., Meissner, D., Seeliger, F., Wilking, B., and Di-

etmayer, K. (2014). The ko-per intersection laser-

scanner and video dataset. In 17th International

IEEE Conference on Intelligent Transportation Sys-

tems (ITSC), pages 1900–1901. IEEE.

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Pat-

naik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine,

B., et al. (2020). Scalability in perception for au-

tonomous driving: Waymo open dataset. In Proceed-

ings of the IEEE/CVF conference on computer vision

and pattern recognition, pages 2446–2454.

Wang, H., Zhang, X., Li, Z., Li, J., Wang, K., Lei, Z., and

Haibing, R. (2022). Ips300+: a challenging multi-

modal data sets for intersection perception system. In

2022 International Conference on Robotics and Au-

tomation (ICRA), pages 2539–2545. IEEE.

Wang, Z., Ding, S., Li, Y., Fenn, J., Roychowdhury, S.,

Wallin, A., Martin, L., Ryvola, S., Sapiro, G., and

Qiu, Q. (2021). Cirrus: A long-range bi-pattern lidar

dataset. In 2021 IEEE International Conference on

Robotics and Automation (ICRA), pages 5744–5750.

IEEE.

Weng, X. and Kitani, K. (2019). Monocular 3d object detec-

tion with pseudo-lidar point cloud. In Proceedings of

the IEEE/CVF International Conference on Computer

Vision Workshops, pages 0–0.

Weng, X., Man, Y., Cheng, D., Park, J., O’Toole, M., Ki-

tani, K., Wang, J., and Held, D. (2020). All-in-one

drive: A large-scale comprehensive perception dataset

with high-density long-range point clouds. arXiv.

Wilson, B., Qi, W., Agarwal, T., Lambert, J., Singh, J.,

Khandelwal, S., Pan, B., Kumar, R., Hartnett, A.,

Pontes, J. K., et al. (2021). Argoverse 2: Next genera-

tion datasets for self-driving perception and forecast-

ing. In Thirty-ﬁfth Conference on Neural Information

Processing Systems Datasets and Benchmarks Track

(Round 2).

Wu, H., Deng, J., Wen, C., Li, X., Wang, C., and Li, J.

(2022a). Casa: A cascade attention network for 3-d

object detection from lidar point clouds. IEEE Trans-

actions on Geoscience and Remote Sensing, 60:1–11.

Wu, X., Peng, L., Yang, H., Xie, L., Huang, C., Deng, C.,

Liu, H., and Cai, D. (2022b). Sparse fuse dense: To-

wards high quality 3d detection with depth comple-

tion. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition, pages

5418–5427.

Xiao, P., Shao, Z., Hao, S., Zhang, Z., Chai, X., Jiao, J.,

Li, Z., Wu, J., Sun, K., Jiang, K., et al. (2021). Pan-

daset: Advanced sensor suite dataset for autonomous

driving. In 2021 IEEE International Intelligent Trans-

portation Systems Conference (ITSC), pages 3095–

3101. IEEE.

Xu, Q., Zhong, Y., and Neumann, U. (2022). Behind the

curtain: Learning occluded shapes for 3d object de-

tection. In Proceedings of the AAAI Conference on

Artiﬁcial Intelligence, volume 3, pages 2893–2901.

Yang, H., Liu, Z., Wu, X., Wang, W., Qian, W., He, X., and

Cai, D. (2022a). Graph r-cnn: Towards accurate 3d

object detection with semantic-decorated local graph.

arXiv preprint arXiv:2208.03624.

VEHITS 2023 - 9th International Conference on Vehicle Technology and Intelligent Transport Systems

Yang, Z., Chen, J., Miao, Z., Li, W., Zhu, X., and

Zhang, L. (2022b). Deepinteraction: 3d object

detection via modality interaction. arXiv preprint

arXiv:2208.11112.

Ye, D., Zhou, Z., Chen, W., Xie, Y., Wang, Y., Wang, P.,

and Foroosh, H. (2022a). Lidarmultinet: Towards a

uniﬁed multi-task network for lidar perception. arXiv

preprint arXiv:2209.09385.

Ye, X., Shu, M., Li, H., Shi, Y., Li, Y., Wang, G., Tan, X.,

and Ding, E. (2022b). Rope3d: The roadside percep-

tion dataset for autonomous driving and monocular 3d

object detection task. In Proceedings of the IEEE/CVF

Conference on Computer Vision and Pattern Recogni-

tion, pages 21341–21350.

Yin, T., Zhou, X., and Kr

ahenb

uhl, P. (2021). Center-based

3d object detection and tracking. CVPR.

Yongqiang, D., Dengjiang, W., Gang, C., Bing, M., Xijia,

G., Yajun, W., Jianchao, L., Yanming, F., and Juan-

juan, L. (2021). Baai-vanjee roadside dataset: To-

wards the connected automated vehicle highway tech-

nologies in challenging environments of china. arXiv

preprint arXiv:2105.14370.

Yu, H., Luo, Y., Shu, M., Huo, Y., Yang, Z., Shi, Y., Guo,

Z., Li, H., Hu, X., Yuan, J., et al. (2022). Dair-v2x:

A large-scale dataset for vehicle-infrastructure coop-

erative 3d object detection. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 21361–21370.

Zhang, Y., Zhang, Q., Zhu, Z., Hou, J., and Yuan, Y.

(2022). Glenet: Boosting 3d object detectors with

generative label uncertainty estimation. arXiv preprint

arXiv:2207.02466.

Zhou, Y., He, Y., Zhu, H., Wang, C., Li, H., and Jiang, Q.

(2021). Monocular 3d object detection: An extrin-

sic parameter free approach. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 7556–7566.

Zhou, Z., Zhao, X., Wang, Y., Wang, P., and Foroosh, H.

(2022). Centerformer: Center-based transformer for

3d object detection. In European Conference on Com-

puter Vision, pages 496–513. Springer.

Zhu, B., Jiang, Z., Zhou, X., Li, Z., and Yu, G. (2019).

Class-balanced grouping and sampling for point cloud

3d object detection. arXiv preprint arXiv:1908.09492.

Zhu, H., Deng, J., Zhang, Y., Ji, J., Mao, Q., Li, H., and

Zhang, Y. (2022). Vpfnet: Improving 3d object de-

tection with virtual point based lidar and stereo data

fusion. IEEE Transactions on Multimedia.

3D Object Detection for Autonomous Driving: A Practical Survey