A Modular Multimodal Multi-Object Tracking-by-Detection Approach,
with Applications in Outdoor and Indoor Environments
Eduardo Borges
a
, Lu
´
ıs Garrote
b
and Urbano J. Nunes
c
University of Coimbra, Institute of Systems and Robotics, Department of Electrical and Computer Engineering, Portugal
{eduardo.borges, garrote, urbano}@isr.uc.pt
Keywords:
Tracking-by-Detection, AMRs, Point Cloud and RGB, Multimodal and Multi-Object.
Abstract:
Object detection and tracking are integral components of numerous modern robotics systems, playing an es-
sential role in applications like autonomous driving and industrial Autonomous Mobile Robots (AMRs). In
this paper, we propose a modular multimodal multi-object detection and tracking system tailored for AMRs
in complex industrial environments. The proposed system employs a tracking-by-detection approach, utiliz-
ing both 3D point cloud and RGB data to detect and track multiple objects simultaneously. To develop it,
a baseline unimodal framework was created using a PointPillars detector and the AB3DMOT tracker, op-
erating exclusively on point cloud data. To enhance detection and tracking accuracy, a 2D object detector
(YOLOv8) was integrated, enabling multimodal detection. The system’s performance was evaluated on the
KITTI dataset, demonstrating notable improvements in detection accuracy and tracking consistency. This
enhancement strengthens the system’s robustness and reliability, which are critical factors for real-time per-
ception in AMRs.
1 INTRODUCTION
As industries embrace the era of automation, cat-
alyzed by the principles of Industry 4.0, the demand
for AMRs that can seamlessly navigate intricate and
ever-changing environments while avoiding obstacles
has intensified greatly. Central to their functionality is
the ability to skillfully perceive and interact with the
surrounding environment. The avoidance of dynamic
objects relies on their continuous monitoring through
the detection and tracking of their position, enabling
the estimation of their future trajectories. Traditional
approaches to object detection and tracking are being
eclipsed by the advancements in Deep Learning (DL)
techniques. These techniques provide the ground-
work for AMRs to operate with unprecedented accu-
racy, efficiency, and adaptability.
The objective of this work was to develop a real-
time system for multi-object detection and tracking,
designed for AMRs operating in complex and dy-
namic industrial environments. The outputs of this
system, specifically the object trajectories, will be
used to assess the collision risk with objects outside
a
https://orcid.org/0000-0002-4454-6182
b
https://orcid.org/0000-0003-3833-3794
c
https://orcid.org/0000-0002-7750-5221
the AMR’s security laser field of view. The AMRs
will operate in industrial environments performing
various tasks depending on their specific types.
An AMR is composed of four main functional
components: perception, localization, cognition, and
motion control. The perception module is responsi-
ble for converting raw sensor data into interpretable
information, building an environmental model, and
identifying the locations of objects or targets. The
system proposed in this paper, designed for real-time
multiple object detection and tracking, will be an im-
portant part of an AMRs’ perception module. Local-
ization uses this data to create local maps and deter-
mine the precise position of the AMR within its envi-
ronment. Cognition, often referred to as the “brain”
of the AMR, utilizes the robot’s position, local map,
external commands, and additional information (e.g.,
object trajectories) from perception to execute essen-
tial functions like path planning and collision avoid-
ance. Finally, motion control handles the naviga-
tion decisions from cognition and translates them into
commands for the actuators, enabling the AMR to
perform its required tasks.
To develop the proposed system, a comprehen-
sive study was conducted on diverse methods of ob-
ject detection and tracking, with a particular focus
on those utilizing Deep Learning techniques. Based
336
Borges, E., Garrote, L. and Nunes, U.
A Modular Multimodal Multi-Object Tracking-by-Detection Approach, with Applications in Outdoor and Indoor Environments.
DOI: 10.5220/0013073200003822
Paper published under CC license (CC BY-NC-ND 4.0)
In Proceedings of the 21st International Conference on Informatics in Control, Automation and Robotics (ICINCO 2024) - Volume 2, pages 336-344
ISBN: 978-989-758-717-7; ISSN: 2184-2809
Proceedings Copyright © 2024 by SCITEPRESS Science and Technology Publications, Lda.
on application-specific criteria, one object detection
method and one object tracking method were then se-
lected to build a baseline system capable of meeting
the objectives with satisfactory performance. To im-
prove the performance of this baseline system and en-
sure its effectiveness and robustness, several modifi-
cations were implemented and evaluated. Each mod-
ification was developed to improve the performance
of detection, tracking, or both while minimizing the
impact on computational requirements and process-
ing speed, which are critical factors in real-time sys-
tems. This work proposes a DL-based multimodal
object detection and tracking system for industrial
AMRs. To achieve this, a tracking-by-detection ap-
proach was employed, where the detection and track-
ing tasks are performed in sequence. Two systems
were developed. The first is an unimodal baseline
that uses only point cloud data as input, serving as a
reference for comparison. It employs the PointPillars
framework (Lang et al., 2019) for 3D object detection
and AB3DMOT (Weng et al., 2020a) for Multiple Ob-
ject Tracking (MOT). The second system expanded
upon this baseline by incorporating a 2D object de-
tector, YOLOv8 (Jocher et al., 2023) into the tracking
module.
The validation of the proposed approaches was
carried out in the KITTI dataset (Geiger et al., 2012).
The proposed approaches achieved relevant results, in
particular when 2D detections were included in the
tracking decision.
The main contributions of this work are:
A baseline modular system using PointPillars and
AB3DMOT for 3D multi-object detection and
tracking.
A multimodal tracking system that integrates 3D
point cloud data from LiDAR sensors with 2D im-
age data from RGB cameras using YOLOv8 for
2D object detection.
Experiments using the KITTI dataset, demonstrat-
ing that the integration of 2D RGB detections into
the system results in improved performance.
2 RELATED WORK
Multiple Object Tracking (MOT) is an important
computer vision task that tracks the trajectories of
multiple objects in a sequence of captured data. In the
context of autonomous navigation, either indoor or
outdoor, it is used as a safety tool to prevent collisions
with dynamic entities like people, animals, robots,
cars, etc.
MOT approaches can be categorized into two
main groups: tracking-by-detection and joint detec-
tion and tracking. Tracking-by-detection is a mod-
ular approach where the tracking process is decou-
pled from the detection process. In tracking-by-
detection, an object detector localizes the objects in
each frame independently and provides them to the
tracker. The tracker performs data association and
manages trajectories, outputting the active trajecto-
ries. On the other hand, joint detection and tracking
methods perform detection and tracking in a single
network. The related work presented will be more
focused on tracking-by-detection works, as they are
closer to the approach followed in this work. Table 1
summarizes the methods discussed in this section.
Focused on real-time applications, Bewley et
al. proposed Simple Online and Realtime Track-
ing (SORT) using the tracking-by-detection ap-
proach (Bewley et al., 2016). SORT employs the
Faster Region-Based Convolutional Neural Network
(R-CNN) (Ren et al., 2016) as a 2D object detector
and a Kalman filter (Kalman, 1960) with a constant
velocity model to predict the future state of detec-
tions. During the data association stage, a cost ma-
trix is calculated by computing the Intersection over
Union (IoU) distance between each detection and all
predicted bounding boxes, with optimal assignments
made using the Hungarian algorithm (Kuhn, 1955).
The track management system employed is simple yet
effective: unique identities are assigned to objects as
they enter or leave the sensor’s Field Of View (FOV).
Detections not associated with tracks are converted
into temporary tracks and monitored until the system
gains sufficient confidence to avoid tracking false pos-
itives. Tracks were terminated if the system failed to
detect them for a specified number of frames.
Wojke et al. introduced Deep SORT (Wojke et al.,
2017) as an extension of the SORT approach that in-
corporates appearance information, reducing identity
switches and enabling tracking during longer occlu-
sion periods. This information is captured by em-
ploying a pre-trained CNN to extract descriptors from
each detection. The main distinction between the two
methods lies in the computation of the cost matrix.
Deep SORT builds this matrix by combining two met-
rics in a weighted sum: the motion metric and the ap-
pearance metric. The motion metric is calculated us-
ing the squared Mahalanobis distance (Mahalanobis,
1936) between the predicted Kalman states and the
measured states. The appearance metric is calculated
as the cosine distance between the stored appearance
descriptors from the tracks and the appearance de-
scriptors of new detections.
While SORT and Deep SORT have shown re-
markable results in 2D object tracking scenarios, au-
A Modular Multimodal Multi-Object Tracking-by-Detection Approach, with Applications in Outdoor and Indoor Environments
337
Table 1: Summary of Tracking-by-Detection MOT Methods.
Method Year Type Advantages Limitations
SORT
(Bewley et al., 2016)
2016 2D
Simple and effective.
Low computational cost.
Real-time tracking.
Limited to 2D tracking.
Identity switches due to occlusion.
Deep SORT
(Wojke et al., 2017)
2017 2D
Includes appearance information.
Reduces identity switches.
Limited to 2D tracking.
Higher computational cost.
AB3DMOT
(Weng et al., 2020a)
2019 3D
Extends SORT to 3D.
Low computational cost.
High processing speed.
Only uses motion information.
mmMOT
(Zhang et al., 2019)
2019 3D
Jointly learns 2D and 3D features.
Robust.
Relies solely on appearance features.
Potential feature dominance issues.
GNN3DMOT
(Weng et al., 2020b)
2020 3D
Uses both 2D and 3D features.
Mitigates feature dominance.
Feature information sharing.
Higher computational complexity.
Requires training a GNN.
Figure 1: Overview of the object detection and tracking baseline pipeline. The object detector receives LiDAR 3D point
cloud data as input to perform object classification and predict 3D bounding boxes. A multiple-object tracker then takes the
detector’s predictions, assigns an Unique Identifier (UID) to each object, and updates their trajectories over time.
tonomous navigation requires an understanding of the
three-dimensional position and movement of objects.
Given these considerations, Weng et al. proposed A
Baseline for 3D Multi-Object Tracking (AB3DMOT)
(Weng et al., 2020a), an approach that extends the
principles of SORT to the 3D space. Because of
this expansion, the 2D detector was replaced by a 3D
detector, specifically a pre-trained PointRCNN (Shi
et al., 2019). The Kalman filter’s state vector was
extended to incorporate additional three-dimensional
parameters. Consequently, the cost function was ad-
justed to use the 3D IoU metric. Due to its simplic-
ity and low computational cost, AB3DMOT is one of
the fastest methods among 3D MOT systems. The
multi-modality Multiple Object Tracking (mmMOT)
(Zhang et al., 2019) framework, proposed by Zhang
et al., uses a PointPillars detector and solves the asso-
ciation problem using an integer linear programming
approach. The cost matrix is obtained by employing
a deep adjacency estimation module on 2D and 3D
detection features.
The aforementioned methods achieve MOT with
varying degrees of success, but some challenges re-
main. AB3DMOT, similarly to SORT, exclusively
uses motion information to build its cost matrix.
While this approach might be effective for specific
scenarios, including appearance information can en-
hance discrimination between tracked objects, lead-
ing to reduced identity switches and improved accu-
racy. Deep SORT includes both motion and appear-
ance information, however, it exclusively focuses on
2D or 3D space. Because 2D and 3D information are
complementary, using features extracted from both
spaces can improve robustness. mmMOT does learn
2D and 3D features jointly, however, it relies solely
on appearance-based features and does not address
the problem of one type dominating over the other.
Furthermore, in all mentioned methods, features are
extracted independently of each detection. The asso-
ciation process might improve if a feature from one
detected object is informed by the features of each
other object. GNN3DMOT (Weng et al., 2020b) was
proposed by Weng et al. to tackle these challenges. It
extracts motion and appearance features from both 2D
and 3D spaces, utilizing the Dropout technique (Sri-
vastava et al., 2014) to mitigate feature dominance
during training. Additionally, it employs a Graph
Neural Network (GNN) to build a feature interaction
module that shares feature information between every
detection.
3 METHODOLOGY
The system developed in this work will be integrated
into a broader system operating on an AMR. This
larger system will require object detection for mul-
tiple purposes, such as target detection. Therefore,
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
338
sharing a single object-detection module would en-
hance efficiency. With this in mind, a tracking-by-
detection approach was adopted for this application.
It was also decided that the most suitable next step
would be to build the baseline system, using a single
detector (unimodal) based on a 3D point cloud sen-
sor, as its main input. Figure 1 provides an overview
of the pipeline used to develop the baseline system,
highlighting the data flow through its components.
The components of the baseline system were se-
lected from preexisting methods that proved suitable
for this use case scenario. The suitability of these
methods was evaluated considering their accuracy, ef-
ficiency, and simplicity. The accuracy of each method
has a direct impact on the system. In this application
scenario, accuracy is associated with security and ac-
curate navigation. Poor accuracy can lead to serious
injury to people and damage to equipment, resulting
in a decrease in trust and adoption of the system. For
real-time systems like this, efficiency and speed are
crucial to ensure fast decision-making, which is an
essential trait to have when navigating dynamic envi-
ronments.
3.1 Object Detection Module
The purpose of the object detector is to classify and
localize the objects of interest in 3D space. There-
fore, the search was concentrated on 3D object detec-
tors using point clouds as input data. After some con-
sideration, the chosen 3D object detector was Point-
Pillars (Lang et al., 2019). Considering the KITTI
benchmarks, PointPillars ranked among the top per-
formers in terms of accuracy at the time of its pub-
lication. Today, its accuracy, while further from the
top, still holds up fairly well. Its key strength lies in
efficiency, as it avoids the need for computationally
expensive 3D convolutions. PointPillars remains one
of the fastest methods, achieving a runtime of just 16
ms. Additionally, the accessibility of a publicly avail-
able implementation code repository (ZhuLifa, 2022)
played a significant role in the decision. This facili-
tated the implementation and customization to fit this
project’s requirements and reduced the development
time.
3.1.1 Implementation Details
In this work, a reference implementation (ZhuLifa,
2022) inspired by the PointPillars method was used.
Despite these differences, this implementation up-
holds the core principles of PointPillars (Lang et al.,
2019). PointPillars is a deep learning model for
3D object detection using point cloud data. It em-
ploys a Pillar Feature Net to extract features from the
point cloud by converting 3D data into a 2D pseudo-
image. These features are then processed through a
2D Convolutional Neural Network (CNN) backbone.
Finally, a detection head predicts object bounding
boxes, classes, and additional relevant attributes.
The reference implementation contains the Point-
Pillars network, comprising the pillar encoder, the
backbone, the neck, and the head. It also includes
three main scripts for training, evaluating, and testing
the PointPillars network. The training script and sup-
porting functions, such as data augmentation, were
not altered, since they were developed to use on the
KITTI dataset, which was used to evaluate the final
pipeline. This script was utilized to train the Point-
Pillars Network. Likewise, the evaluating script was
used without modifications to obtain the network per-
formance metrics. The testing script was developed
to display the network bounding box detections for a
single frame. Rather than using it directly, the script
served as a reference for creating the detection mod-
ule.
The detection module was implemented as a
reusable Python function that can be integrated into
a larger application. This function can be called suc-
cessively to process each frame sequentially, enabling
real-time processing. As shown in Figure 1, it re-
ceives a LiDAR 3D point cloud, a detection threshold
value, and the camera and LiDAR calibration config-
urations as input. It starts by filtering the point cloud
to remove all invalid points (not captured by the cam-
era), thereby improving the inference speed. Then,
it runs inference on these filtered points to achieve
the predictions. These predictions are finally filtered
by confidence, to remove low-confidence predictions,
ensuring only the most reliable detections are passed
on for tracking. Each detection is represented by:
D
3D
= ( f , c, x
1
, y
1
, x
2
, y
2
, s, h, w, l, x, y, z, θ, α)
where f denotes the frame number, c the class la-
bel, (x
1
, y
1
) and (x
2
, y
2
) the coordinates of the top left
and bottom right corners of the projected 2D bound-
ing box, respectively. s represents the detection score,
(h, w, l) denotes the height, width, and length of the
3D bounding box, and (x, y, z) are the center coordi-
nates of the 3D bounding box, θ is the object’s angle
of rotation around its Y-axis, and α is the observation
angle.
3.2 Multiple Object Tracking Module
The multiple-object tracking module will receive pre-
dictions from the PointPillars detector (see Figure 1),
in the form of 3D bounding boxes. Therefore, the
selected tracker must be capable of handling 3D de-
tections. Among the available methods, meeting this
A Modular Multimodal Multi-Object Tracking-by-Detection Approach, with Applications in Outdoor and Indoor Environments
339
requirement, AB3DMOT (Weng et al., 2020a) was
selected. AB3DMOT is a simple and effective real-
time 3D MOT system designed for applications such
as autonomous navigation. It focuses on achieving
high accuracy while maintaining a low computational
cost, making it ideal for real-time applications where
high-speed processing is critical. The main factors
that contributed to this selection were its simplicity
and speed. It employs a 3D Kalman filter for state
estimation and the Hungarian algorithm for data as-
sociation. This straightforward approach results in
a lower computational cost and higher processing
speed. AB3DMOT is one of the fastest methods on
the KITTI tracking benchmark (Geiger et al., 2012).
It first obtains current-frame 3D detections from point
clouds using an ”off-the-shelf” 3D object detector.
Next, it employs a 3D Kalman filter to predict the
state of associated trajectories to the current frame
using a constant velocity model. The Hungarian al-
gorithm is then used to match these predicted trajec-
tories to the obtained 3D detections. Afterward, the
3D Kalman filter updates the state of matched tra-
jectories based on the corresponding matched detec-
tions. Additionally, a birth and death memory man-
ages the associated trajectories, adding new ones for
newly detected objects and removing those of disap-
peared objects (Weng et al., 2020a). Similar to Point-
Pillars, its accuracy, when used with a PointRCNN
detector (Shi et al., 2019), was among the top per-
formers at the time of publication, continuing to fare
well when compared to more recent methods. Fur-
thermore, the availability of the official Python imple-
mentation (Weng, 2020) was also a significant factor
in the decision, for the same reasons considered in the
detector selection.
3.2.1 Implementation Details
The official AB3DMOT repository contains a library
with the tracker model and several supporting func-
tions. It also includes a main function to process and
visualize object tracking. Initially, this main func-
tion receives pre-saved PointRCNN detections, from
the KITTI dataset, as input. These detection files are
grouped by sequence, sequence type (training or test-
ing), and class. Then, the function executes a loop for
every sequence, and inside it a loop for each class,
tracking the detected objects. This processing ap-
proach, while effective for offline analysis, cannot be
employed for real-time applications, as it introduces
latency. The developed system needs to handle each
frame as it is received to ensure immediate and accu-
rate tracking.
The tracking module was developed in a manner
similar to the detection module. It was implemented
as a Python function which can be called successively
after the detection module to process its detections.
The tracking function receives the detections from the
current frame along with the frame number. Unlike
the previously discussed approach, these detections
are not grouped by class, each detection includes a
value representing its class. These detections are then
organized into a dictionary-like data structure, to en-
sure compatibility with the tracking model. After-
ward, these newly formatted detections are provided
to the tracking model to perform tracking and obtain
the active trajectories T
3D
. Each trajectory T
3D
is rep-
resented as a modified version of a detection D
3D
. The
tracking module introduces a UID, id, rearranges the
order of the existing variables, and omits the frame
number f , and observation angle α. The tracking data
structure T
3D
is defined as:
T
3D
= (h, w, l, x,y, z, θ, c, id, s, x
1
, y
1
, x
2
, y
2
) (1)
3.3 System Integration
In the integration phase, a main function was devel-
oped to simulate real-time object detection and track-
ing. This behavior was achieved with a loop that reads
the current frame input data from memory.
1
This data
contains one image, one point cloud, and the current
robot pose. The camera and LiDAR calibration pa-
rameters are static and do not need to be read. The
main function then calls the detection function to pro-
cess the point cloud and obtain the detections. The
predictions from the detection module are then passed
to the tracking module to generate the active trajec-
tories. This process is repeated for each subsequent
frame.
The predictions from the detection function (us-
ing the PointPillars network) adhere to the same 3D
world coordinate system convention as the KITTI
dataset (forward, leftward, upward). However, the
AB3DMOT model, used in the tracking module, was
developed using PointRCNN, which follows a differ-
ent coordinate system convention (rightward, down-
ward, forward). These compatibility issues were ad-
dressed and both modules were integrated to create
the pipeline depicted in Figure 1.
3.4 Integration of a 2D Object Detector
As is, the system uses an adjustable confidence score
threshold to categorize detections into two groups:
low-confidence detections D
L
3D
, and high-confidence
1
In the real system, this data would be delivered directly
by the robot’s sensors.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
340
detections D
H
3D
. High-confidence detections enter the
tracking module, while low-confidence detections are
discarded. The tracking module initializes a trajectory
only after it has been consistently matched with a de-
tection for a minimum number of consecutive frames
min
hits
. This strategy helps prevent false positives,
as matching false detections over successive frames
is unlikely. While effective, this strategy can intro-
duce false negatives, as min
hits
consecutive matches
are required to initiate a trajectory, potentially delay-
ing the recognition of true detections. To address this
issue, a 2D object detector was integrated into the de-
tection module, running concurrently with the 3D ob-
ject detector, to independently detect objects of inter-
est in RGB images. The selected 2D object detector
was YOLOv8 due to its state-of-the-art accuracy and
speed, essential for real-time detection. Additionally,
its availability as a Python package made it easy to
integrate into the existing system.
Some 3D detections have corresponding 2D de-
tections originating from the same object. Due to their
independence, distinct data sources (RGB and point
clouds), and differing detector architectures, these de-
tections are more likely to be classified as true posi-
tives. As such, they were grouped into a new cat-
egory: very high confidence detections D
V H
3D
. De-
tections from this group do not get filtered out and
do not need to adhere to the same rules as high-
confidence detections. Instead, they are immediately
initialized into trajectories by the tracking module
if max (con f
2D
, con f
3D
) > con f th and cls
2D
=
cls
3D
, where con f
2D
and con f
3D
represent the con-
fidence scores from the 2D and 3D detectors, re-
spectively, con f th denotes the adjustable confidence
threshold, and cls
2D
and cls
3D
represent the object
class of the 2D and 3D detections, respectively.
This modification aims to improve detection accu-
racy by reducing false negatives by allowing detection
with very high confidence to bypass the min
hits
re-
quirement, reducing the delay in trajectory initializa-
tion. Nevertheless, this requirement is still important
for more ambiguous situations, such as those where
an object is only detected by the 3D detector (D
H
3D
).
3.4.1 Implementation Details
The YOLOv8 detector provides 2D bounding boxes
in the format BB
2D
= (x
1
, y
1
, x
2
, y
2
), where (x
1
, y
1
)
represents the coordinates of the top left corner, and
(x
2
, y
2
) represents the coordinates of the bottom right
corner of the bounding box. Similarly, the PointPil-
lars detector generates 2D bounding boxes, represent-
ing a projection of the predicted 3D bounding boxes
to the image plane. These 2D bounding boxes fol-
low the same format: BB
3D
2D
= (x
1
, y
1
, x
2
, y
2
). By di-
rectly utilizing these bounding boxes, the step of pro-
jection that requires camera and LiDAR calibration
matrices can be bypassed, thus saving computational
resources. Figure 2a provides an overview of the in-
tegration process for the 2D and 3D object detectors.
Additionally, it illustrates how both types of detec-
tions are visualized on the image plane. To iden-
tify which 3D detection has a matching 2D detection
(D
V H
3D
), a straightforward IoU is calculated for all pairs
of bounding boxes (each consisting of one BB
2D
and
one BB
3D
). The resulting values are organized into a
table, commonly referred to as an affinity matrix, rep-
resenting the overlap between each pair of bounding
boxes.
To find the optimal pairing, where each 3D de-
tection is matched with only one 2D detection, the
Hungarian algorithm is applied to this table. While
the Hungarian algorithm was designed for minimiza-
tion problems, this task requires maximizing the IoU
value to find the most suitable matches. To make it
compatible with the algorithm, the IoU values from
the table are negated, transforming the problem into a
minimization task. Low-overlap matches are then fil-
tered out by applying an IoU threshold to the resulting
pairs. This process is illustrated in Figure 2b.
4 EXPERIMENTAL VALIDATION
The KITTI dataset (Geiger et al., 2012) was selected
for the system evaluation. KITTI is well-established
and has become a standard benchmark for evaluating
the performance of object detection and tracking al-
gorithms. Furthermore, due to its extensive use in re-
search, it allows for the comparison of the system’s
performance against a range of existing methods.
The Higher Order Tracking Accuracy (HOTA)
benchmark (Luiten et al., 2020) was chosen as the
evaluation metric for its ability to address limita-
tions of previous metrics like Multiple Object Track-
ing Accuracy (MOTA). HOTA offers a compre-
hensive assessment by evaluating detection, asso-
ciation, and localization through a family of sub-
metrics, ensuring a balanced evaluation across mul-
tiple localization thresholds. It also includes Local-
isation Accuracy (LocA), measuring how accurately
predicted bounding boxes match the ground truth,
enhancing object localization assessment. The de-
composition into components like Detection Accu-
racy (DetA), Association Accuracy (AssA), Detec-
tion Recall (DetRe), Detection Precision (DetPr), As-
sociation Recall (AssRe), and Association Precision
(AssPr) provides detailed insights into the system’s
tracking performance.
A Modular Multimodal Multi-Object Tracking-by-Detection Approach, with Applications in Outdoor and Indoor Environments
341
(a) Overview of the integration of 2D and 3D object detectors into the detection module and visualization of both types of
detections on the image plane.
IoU Table
0.90
0.00
0.00
0.00
0.00
0.00
0.00
0.85
0.00
0.00
0.00
0.00
0.00
0.00
0.70
0.00
0.00
0.00
0.00
0.00
0.05
0.80
0.00
0.00
0.00
0.00
0.00
0.00
0.75
0.00
Hungarian
Algorithm
Matches
Low-overlap
Matches Filtering
(b) Illustration of the IoU-based data association process. The matched detections belong to the very high confidence detection
class (D
V H
3D
).
Figure 2: Integration of the YOLOv8 object detector to create a new class of very high confidence detection D
V H
3D
.
4.1 Evaluation Protocol
The systems were evaluated by systematically alter-
ing their parameters to identify the combination that
produces the best performance. The selection strat-
egy involved adjusting each parameter individually
while keeping the others constant to avoid influenc-
ing the outcome. After evaluating the performance
for each adjustment, the parameter value that pro-
duced the best performance was fixed. This perfor-
mance was determined by averaging the results be-
tween all classes. This procedure was repeated for all
parameters until the best-performing values for each
were identified. The evaluation was conducted using
the TrackEval code repository (Luiten, 2020), start-
ing with the determination of the best parameters for
the baseline system. Once selected, these parameters
were fixed, and the modification was applied individ-
ually to create a modified version of the baseline sys-
tem.
4.2 Baseline System
The baseline system contains three tunable parame-
ters: m
th, which sets the affinity threshold for a valid
match during data association; min hits, which de-
fines the minimum number of matched detections re-
quired for a trajectory to transition from tentative to
active; and max age, which determines the maximum
number of consecutive frames a trajectory can remain
unmatched before being terminated.
The results were generated using the training se-
quences, as Ground Truth (GT) information for the
test sequences is not publicly available. Although this
evaluation approach does not guarantee performance
on the test set, the results remain valuable for assess-
ing performance changes resulting from system mod-
ifications. Table 2 presents the obtained performance
metrics for the “Car” and “Pedestrian” classes using
the HOTA benchmark.
From the analysis of the results presented in Ta-
ble 2, the parameter combination yielding the highest
HOTA score for the ”Car” class is BASE 07, while for
the ”Pedestrian” class, is BASE 08. However, when
considering the cumulative score, the BASE 02 pa-
rameter combination produced the best results (with
an average of 55.33%). As such, these parameters
were selected as the baseline parameters.
4.3 Multimodal Tracking by Detection
Pipeline
With the addition of a 2D object detector, the resulting
system contains two additional tunable parameters:
iou th, which defines the threshold for matching 2D
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
342
Table 2: Baseline system evaluation parameters and results for the “Car” and “Pedestrian” classes in the HOTA benchmark.
m th min hits max age HOTA DetA AssA DetRe DetPr AssRe AssPr LocA
Car
0 3 4 68.12% 66.11% 70.41% 73.42% 80.81% 76.29% 85.26% 87.64%
-0.2 3 4 68.22% 66.17% 70.54% 73.53% 80.76% 76.43% 85.26% 87.63%
-0.4 3 4 68.11% 66.15% 70.35% 73.61% 80.62% 76.42% 84.95% 87.33%
-0.2 4 4 67.33% 65.02% 69.94% 71.83% 81.21% 75.53% 85.44% 87.75%
-0.2 5 4 66.31% 63.70% 69.23% 70.06% 81.51% 74.55% 85.55% 87.47%
-0.2 6 4 65.18% 62.30% 68.29% 68.27% 81.72% 73.44% 85.67% 87.94%
-0.2 3 3 68.89% 67.19% 70.86% 72.97% 83.03% 75.48% 87.03% 87.72%
-0.2 3 5 67.42% 64.99% 70.15% 73.90% 78.57% 77.21% 83.69% 87.58%
Pedestrian
0 3 4 42.33% 42.05% 42.80% 50.22% 56.06% 46.51% 66.47% 72.97%
-0.2 3 4 42.44% 42.12% 42.96% 50.36% 56.00% 47.01% 65.29% 72.96%
-0.4 3 4 42.43% 42.08% 42.98% 50.34% 55.95% 47.16% 64.89% 72.95%
-0.2 4 4 42.38% 42.04% 42.93% 49.10% 57.56% 46.91% 65.29% 73.03%
-0.2 5 4 42.14% 41.68% 42.80% 47.92% 58.62% 46.68% 65.52% 73.08%
-0.2 6 4 41.81% 41.19% 42.63% 46.73% 59.52% 46.45% 65.62% 73.12%
-0.2 3 3 40.93% 41.83% 40.25% 48.49% 58.03% 43.43% 67.19% 73.05%
-0.2 3 5 43.20% 41.75% 44.88% 51.19% 54.37% 49.60% 64.72% 72.93%
bounding boxes from the 2D and 3D detectors, and
con f th, which sets the minimum confidence score
required for very high confidence detections, D
V H
3D
, to
be directly initialized into an active trajectory by the
tracking module. Table 3 presents the parameter com-
binations used to evaluate the system and the obtained
performance metrics for the “Car” and “Pedestrian”
classes using the HOTA benchmark.
An analysis of the results presented in Table 3
shows that all tested parameter combinations resulted
in a significant boost in performance. This modifi-
cation increased HOTA scores by at least 1.85 and
1.92 percentage points for the “Car” and “Pedestrian”
classes, respectively. The best-performing parame-
ter combination (MOD1 03) produced improvements
of 2.64 and 3.00 percentage points for the respective
classes.
For the “Car” class, the only components of the
HOTA metric that showed a decrease in performance
were DetPr, AssPr, and LocA. LocA experienced a
very slight decrease (from 87.63% to 87.14%), while
DetPr and AssPr saw more significant reductions
(DetPr dropping from 80.76% to 79.14%, and AssPr
from 76.43% to 74.84%). Despite these decreases,
both DetA and AssA showed notable improvements,
with DetA increasing from 66.17% to 69.67%, and
AssA from 70.54% to 72.85%. For the “Pedestrian”
class, none of the HOTA components showed a signif-
icant decrease in performance. Similarly to the “Car”
class, DetA and AssA both increased, with DetA ris-
ing from 42.12% to 45.89%, and AssA from 42.96%
to 45.23%, reflecting a general improvement in the
tracking and detection accuracy. This analysis sug-
gests that while some metric scores slightly declined,
the system overall benefited from better detection and
association accuracy.
5 CONCLUSIONS
In this paper, we propose a modular multimodal
multi-object detection and tracking system designed
for operating in indoor or outdoor environments. The
system integrates both 3D point cloud data and RGB
images, utilizing a tracking-by-detection approach
with PointPillars for 3D detection and YOLOv8 for
2D detection. Our experimental results, validated on
the KITTI dataset, showed significant improvements
in detection accuracy and tracking consistency, partic-
ularly due to the integration of 2D and 3D detections,
which enhanced robustness. The proposed system
showed improvements in both the tracking of vehi-
cles and pedestrians, offering a more reliable solution
for real-time perception in dynamic environments.
Several routes for future work can be explored.
First, extending the system to incorporate additional
sensor modalities, such as thermal or RGB-D data,
could further improve detection accuracy in challeng-
ing indoor conditions, such as poor lighting. Second,
optimizing the computational efficiency of the system
for deployment could be another necessary improve-
ment. Third, additional refinement of the object asso-
ciation process, particularly for occluded objects, may
help reduce identity switches and improve tracking
performance. Lastly, a key direction for future work
involves adapting the system specifically for AMRs
in industrial environments.
A Modular Multimodal Multi-Object Tracking-by-Detection Approach, with Applications in Outdoor and Indoor Environments
343
Table 3: Results for the “Car” and “Pedestrian” classes with the multimodal pipeline, considering the 2D detector.
ID iou th conf th HOTA DetA AssA DetRe DetPr AssRe AssPr LocA
Car
BASE 02 68.22% 66.17% 70.54% 73.53% 80.76% 76.43% 85.26% 87.63%
MOD1 01 0.5 0.5 70.66% 69.32% 72.28% 78.21% 79.63% 78.99% 84.63% 87.23%
MOD1 02 0.75 0.5 70.07% 68.62% 71.78% 76.85% 80.27% 78.17% 84.95% 87.45%
MOD1 03 0.5 0.3 70.86% 69.65% 72.35% 79.04% 79.16% 79.30% 84.81% 87.14%
MOD1 04 0.5 0.1 70.86% 69.67% 72.34% 79.08% 79.14% 79.30% 84.41% 87.13%
Pedestrian
BASE 02 42.44% 42.12% 42.96% 50.36% 56.00% 47.01% 65.29% 73.09%
MOD1 01 0.5 0.5 44.36% 44.64% 44.30% 54.10% 56.03% 48.66% 65.36% 73.08%
MOD1 02 0.75 0.5 42.96% 42.98% 43.12% 51.69% 56.08% 47.25% 65.43% 73.09%
MOD1 03 0.5 0.3 45.44% 45.86% 45.23% 56.17% 55.72% 50.00% 65.07% 73.07%
MOD1 04 0.5 0.1 45.41% 45.89% 45.14% 56.27% 55.65% 49.91% 64.99% 73.06%
ACKNOWLEDGMENTS
This work has been supported by the Portuguese
Foundation for Science and Technology (FCT)
through grant UIDB/00048/2020 and by Agenda
“GreenAuto: Green innovation for the Automo-
tive Industry”, with reference PRR-C644867037-
00000013.
REFERENCES
Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B.
(2016). Simple online and realtime tracking. In 2016
IEEE international conference on image processing
(ICIP), pages 3464–3468. IEEE.
Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we ready
for autonomous driving? the kitti vision benchmark
suite. In Conference on Computer Vision and Pattern
Recognition (CVPR).
Jocher, G., Chaurasia, A., and Qiu, J. (2023). Ultralytics
yolov8. https://github.com/ultralytics/ultralytics. Ac-
cessed: 2024-06-5.
Kalman, R. E. (1960). A New Approach to Linear Filtering
and Prediction Problems. Journal of Basic Engineer-
ing, 82(1):35–45.
Kuhn, H. W. (1955). The hungarian method for the assign-
ment problem. Naval Research Logistics Quarterly,
2(1-2):83–97.
Lang, A. H., Vora, S., Caesar, H., Zhou, L., Yang, J., and
Beijbom, O. (2019). Pointpillars: Fast encoders for
object detection from point clouds. In Proceedings
of the IEEE/CVF conference on computer vision and
pattern recognition, pages 12697–12705.
Luiten, J. (2020). Trackeval: Hota (and other) evaluation
metrics for multi-object tracking (mot). https://github.
com/JonathonLuiten/TrackEval. Accessed: 2024-05-
02.
Luiten, J., Osep, A., Dendorfer, P., Torr, P., Geiger, A., Leal-
Taixe, L., and Leibe, B. (2020). Hota: A higher order
metric for evaluating multi-object tracking. Interna-
tional Journal of Computer Vision, 129(2):548–578.
Mahalanobis, P. C. (1936). On the generalized distance in
statistics. Proceedings of the National Institute of Sci-
ences of India, 2(1):49–55.
Ren, S., He, K., Girshick, R., and Sun, J. (2016). Faster
r-cnn: Towards real-time object detection with re-
gion proposal networks. IEEE transactions on pattern
analysis and machine intelligence, 39(6):1137–1149.
Shi, S., Wang, X., and Li, H. (2019). Pointrcnn: 3d object
proposal generation and detection from point cloud.
In Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition, pages 770–779.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. (2014). Dropout: A simple way
to prevent neural networks from overfitting. Journal
of Machine Learning Research, 15(56):1929–1958.
Weng, X. (2020). Ab3dmot: 3d multi-object tracking: A
baseline and new evaluation metrics. https://github.
com/xinshuoweng/AB3DMOT. Accessed: 2024-02-
20.
Weng, X., Wang, J., Held, D., and Kitani, K. (2020a). 3d
multi-object tracking: A baseline and new evaluation
metrics. In 2020 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS).
Weng, X., Wang, Y., Man, Y., and Kitani, K. M. (2020b).
Gnn3dmot: Graph neural network for 3d multi-object
tracking with 2d-3d multi-feature learning. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 6499–6508.
Wojke, N., Bewley, A., and Paulus, D. (2017). Simple on-
line and realtime tracking with a deep association met-
ric. In 2017 IEEE international conference on image
processing (ICIP). IEEE.
Zhang, W., Zhou, H., Sun, S., Wang, Z., Shi, J., and Loy,
C. C. (2019). Robust multi-modality multi-object
tracking. In Proceedings of the IEEE/CVF interna-
tional conference on computer vision, pages 2365–
2374.
ZhuLifa (2022). Pointpillars: Fast encoders for ob-
ject detection from point clouds. https://github.com/
zhulf0804/PointPillars. Accessed: 2024-02-20.
ICINCO 2024 - 21st International Conference on Informatics in Control, Automation and Robotics
344