has a well-known success story in camera-based
computer vision. In this context, literature survey
tackles the problem of real-time performance using
Single shot detectors, like YOLO (Redmon, J et al.,
2016) and SSD (Liu, W et al.,2016).
3D object detection using LiDAR point clouds are
mainly divided into two types: 3D voxel grids and 2D
projections. A 3D voxel grid transforms the point
cloud into a regularly spaced 3D grid called voxels,
and from each voxel cell we can compute statistics
and apply 3D convolutions to extract high-order
representation from the voxel grid (M. Engelcke et al.,
2017). However, point clouds are sparse by nature, the
voxel grid are also sparse, less compact and require
huge computation. As a result, typical systems ((M.
Engelcke et al., 2017, B. Li,) et al., 2016) only run at 1-
2 FPS. On the other hand, 2D projection based
techniques projects the point cloud onto a plane,
which is then discretized into a 2D image based
representation where 2D convolutions are applied.
These 2D projection based representations are more
compact, but they bring information loss during
projection and discretization. In addition to
computation efficiency, BEV representation also has
other advantages i.e. it eases the problem of object
detection as objects do not overlap with each other
and thus the network can exploit priors about the
physical dimensions of objects.
In our framework, we use single-shot detection
based architecture to detect objects on LiDAR’s
BEV.
1.1 Related Work
Most of the works on 3D object detection using
LIDAR BEV representation relies on single point
cloud PIXOR (Yang, B et al., 2018), Complex YOLO
(M. Simon et al., 2018), YOLO3D (Ali, W et al., 2018).
These LiDAR 3D point clouds object detection
methods do not take the advantage of temporal
information to produce more accurate 3D bounding
boxes. Recently, Fast and Furious (Luo W et
al.,2018), IntentNet (S. Casas et al., 2018), Neural
motion planner (Wenyuan Zeng et al., 2019)
incorporates the time with 3D voxels using 2D, 3D
convolutions and adopts a multi-task learning like
tracking, motion forecasting and motion planning. To
our knowledge YOLO4D (El Sallab., 2018) is the only
technique where they incorporated temporal
information from successive point clouds.
In this paper, we exploit temporal information
from successive point clouds using spatial-temporal
model to augment context information for BEV based
3D object detection.
1.2 Contribution
In this paper, we propose a spatial-temporal context
based 3D object detector that operates on sequence of
3D point clouds. Our approach is a single-stage 3D
object detector that exploits the 2D BEV
representation in an efficient way since it is
computationally less expensive as compared with 3D
voxel grids, and also preserves the metric space which
allows our model to explore priors about the size and
shape of the object categories. Our detector outputs
accurate oriented bounding boxes in real-world
dimension in bird’s eye view. The main contributions
of this paper are:
a) Non-local context network (NLCN), a novel
approach to augment the CNN backbone features for
BEV object detection by a context representation
computed using non-local relations between feature
maps to capture global appearance and motion
information. This approach has led to significant
improvements by 4.4mAP over single-frame BEV
object detector and by 1.1mAP over YOLO4D BEV
object detector on Argoverse dataset (M. Chang et al.,
2019).
b) Spatial-temporal context network (STCN), a novel
approach of generating context representation for
BEV object detector by applying 2D convolutions on
a stack of BEV images (Super image) to capture local
spatial-temporal information and using 3D
convolutions on local spatial-temporal feature maps
to capture global temporal interactions (long-range
temporal dynamics). This approach led to significant
improvements by 6.9mAP over single-frame 3D
object detector and by 3.5mAP over YOLO4D 3D
object detector on Argoverse dataset (M. Chang et al.,
2019).
The rest of the paper is organized as follows; first,
we discuss the single frame based 3D object
detection, followed by the spatial-temporal
approaches to encode context information from
temporal point cloud sequences. Finally, we present
our experimental results and evaluate different
techniques on Argoverse dataset (M. Chang et al.,
2019).
2 SPATIO-TEMPORAL 3D
OBJECT DETECTION
In this section, the approach for spatial-temporal BEV
object detection is described. The main motivation
behind our work is to exploit not only the spatial but
also the temporal information in the input LIDAR
Real-time Spatial-temporal Context Approach for 3D Object Detection using LiDAR
433