sharing a single object-detection module would en-
hance efficiency. With this in mind, a tracking-by-
detection approach was adopted for this application.
It was also decided that the most suitable next step
would be to build the baseline system, using a single
detector (unimodal) based on a 3D point cloud sen-
sor, as its main input. Figure 1 provides an overview
of the pipeline used to develop the baseline system,
highlighting the data flow through its components.
The components of the baseline system were se-
lected from preexisting methods that proved suitable
for this use case scenario. The suitability of these
methods was evaluated considering their accuracy, ef-
ficiency, and simplicity. The accuracy of each method
has a direct impact on the system. In this application
scenario, accuracy is associated with security and ac-
curate navigation. Poor accuracy can lead to serious
injury to people and damage to equipment, resulting
in a decrease in trust and adoption of the system. For
real-time systems like this, efficiency and speed are
crucial to ensure fast decision-making, which is an
essential trait to have when navigating dynamic envi-
ronments.
3.1 Object Detection Module
The purpose of the object detector is to classify and
localize the objects of interest in 3D space. There-
fore, the search was concentrated on 3D object detec-
tors using point clouds as input data. After some con-
sideration, the chosen 3D object detector was Point-
Pillars (Lang et al., 2019). Considering the KITTI
benchmarks, PointPillars ranked among the top per-
formers in terms of accuracy at the time of its pub-
lication. Today, its accuracy, while further from the
top, still holds up fairly well. Its key strength lies in
efficiency, as it avoids the need for computationally
expensive 3D convolutions. PointPillars remains one
of the fastest methods, achieving a runtime of just 16
ms. Additionally, the accessibility of a publicly avail-
able implementation code repository (ZhuLifa, 2022)
played a significant role in the decision. This facili-
tated the implementation and customization to fit this
project’s requirements and reduced the development
time.
3.1.1 Implementation Details
In this work, a reference implementation (ZhuLifa,
2022) inspired by the PointPillars method was used.
Despite these differences, this implementation up-
holds the core principles of PointPillars (Lang et al.,
2019). PointPillars is a deep learning model for
3D object detection using point cloud data. It em-
ploys a Pillar Feature Net to extract features from the
point cloud by converting 3D data into a 2D pseudo-
image. These features are then processed through a
2D Convolutional Neural Network (CNN) backbone.
Finally, a detection head predicts object bounding
boxes, classes, and additional relevant attributes.
The reference implementation contains the Point-
Pillars network, comprising the pillar encoder, the
backbone, the neck, and the head. It also includes
three main scripts for training, evaluating, and testing
the PointPillars network. The training script and sup-
porting functions, such as data augmentation, were
not altered, since they were developed to use on the
KITTI dataset, which was used to evaluate the final
pipeline. This script was utilized to train the Point-
Pillars Network. Likewise, the evaluating script was
used without modifications to obtain the network per-
formance metrics. The testing script was developed
to display the network bounding box detections for a
single frame. Rather than using it directly, the script
served as a reference for creating the detection mod-
ule.
The detection module was implemented as a
reusable Python function that can be integrated into
a larger application. This function can be called suc-
cessively to process each frame sequentially, enabling
real-time processing. As shown in Figure 1, it re-
ceives a LiDAR 3D point cloud, a detection threshold
value, and the camera and LiDAR calibration config-
urations as input. It starts by filtering the point cloud
to remove all invalid points (not captured by the cam-
era), thereby improving the inference speed. Then,
it runs inference on these filtered points to achieve
the predictions. These predictions are finally filtered
by confidence, to remove low-confidence predictions,
ensuring only the most reliable detections are passed
on for tracking. Each detection is represented by:
D
3D
= ( f , c, x
1
, y
1
, x
2
, y
2
, s, h, w, l, x, y, z, θ, α)
where f denotes the frame number, c the class la-
bel, (x
1
, y
1
) and (x
2
, y
2
) the coordinates of the top left
and bottom right corners of the projected 2D bound-
ing box, respectively. s represents the detection score,
(h, w, l) denotes the height, width, and length of the
3D bounding box, and (x, y, z) are the center coordi-
nates of the 3D bounding box, θ is the object’s angle
of rotation around its Y-axis, and α is the observation
angle.
3.2 Multiple Object Tracking Module
The multiple-object tracking module will receive pre-
dictions from the PointPillars detector (see Figure 1),
in the form of 3D bounding boxes. Therefore, the
selected tracker must be capable of handling 3D de-
tections. Among the available methods, meeting this
A Modular Multimodal Multi-Object Tracking-by-Detection Approach, with Applications in Outdoor and Indoor Environments
339