Towards Multi-object Detection and Tracking in Urban Scenario under

Uncertainties

Achim Kampker

, Mohsen Sefati

1,∗,†

, Arya S. Abdul Rachman

2,∗

, Kai Kreisk

other

and Pascual Campoy

Chair of Production Engineering of E-Mobility Components, RWTH Aachen University, Aachen, Germany

Delft Center for Systems and Control, Delft University of Technology, Delft, The Netherlands

Computer Vision and Aerial Robotics Group, Centre of Automatics and Robotics,

Universidad Polit

ecnica de Madrid, Madrid, Spain

K.Kreiskoether@pem.rwth-aachen.de, Pascual.Campoy@upm.es

Keywords:

Multi Object Tracking, Perception, 3D LIDAR, Autonomous Driving, Probabilistic Filtering.

Abstract:

Automated vehicles in urban scenarios require a reliable perception technology to tackle the high amount of

uncertainties. In this paper a real-time framework for multi-object detection and manoeuvre-aware tracking

is presented, where the application of 3D LIDAR for a cluttered urban environment is demonstrated. Our

approach combines sensor occlusion-aware detection method with computationally efﬁcient rule-based ﬁlter-

ing and adaptive probabilistic tracking to handle uncertainties arising from sensing limitation of 3D LIDAR

and complexity of the targets’ movement. The evaluation results using real-world pre-recorded data and com-

parison with state-of-the-art shows that the presented framework is capable of achieving promising tracking

performance in the urban scenarios.

1 INTRODUCTION

Advanced Driver Assistance Systems (ADAS) and Au-

tomated Driving (AV) have been the focus of many re-

search activities for several decades. Achieving higher

automation levels for AVs also imposes higher require-

ments on environment perception. By advancing from

highways to urban and intercity scenarios, further chal-

lenges have to be met, especially with respect to the

tasks of detection and tracking. In a typical urban

scene the AV is surrounded by multiple trafﬁc objects

of different types (pedestrians, cyclists, cars, trucks,

etc.) with different skills and movement patterns. The

AVs should be able to detect and associate these ob-

jects with corresponding context information from the

modelled scene and predict their feature behaviour for

the subsequent tasks such as decision making and tra-

jectory planning. The basis for this is the ability to clas-

sify between dynamic and static objects and keeping

the tracking of dynamic objects in a continuous man-

ner. Consequently, multi-object detection and tracking

become essential for AVs perception.

∗

These authors contributed equally to the paper

†

Corresponding author

The recently introduced compact 3D LIDAR scan-

ner (Velodyne, 2007) is especially suitable for multi-

object detection and tracking task, since it enables far-

reaching high ﬁdelity acquisition of surrounding spa-

tial information, which is not possible with conven-

tional sensing technologies. LIDAR-based perception

tasks geared toward autonomous vehicle is a widely

discussed topic. Among others, (Luo et al., 2016) sug-

gest real-time capable LIDAR detection and tracking,

(Chen et al., 2015) introduce model-based detection

for surrounding vehicles, (Himmelsbach and Wuen-

sche, 2012) propose a top-down bottom-up approach

to enhance detection and tracking result while simul-

taneously doing classiﬁcation. Notwithstanding, there

are comparably fewer literatures, which address the

holistic integration of LIDAR perception tasks aimed

toward practical use in the urban situation. (Zhang

et al., 2011), (Wojke and Haselich, 2012), and (Choi

et al., 2013) notably propose a complete scheme of

Multi Object Tracking (MOT). However, these im-

plementation does not speciﬁcally target the use-case

of urban driving, and limitation of vehicle embedded

computer is not necessarily taken into account.

156

Kampker, A., Sefati, M., Abdul Rachman, A., Kreisköther, K. and Campoy, P.

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties.

DOI: 10.5220/0006706101560167

In Proceedings of the 4th International Conference on Vehicle Technology and Intelligent Transport Systems (VEHITS 2018), pages 156-167

ISBN: 978-989-758-293-6

Data acquisition (e.g. LIDAR)

Ground Extraction

Raw data:

Point clouds

Clustering and Segmentation

Connected Components

Bounding Box Fitting

MAR & L -Shape Fitting

Rule - based Filtering

Data Association

(JPDA)

Tracking Filter

(IMM- UKF)

Track Management

Maturity Check & Dynamic Classification

Box Correction

Dimension & Perspective Correction & Over/ &

Under-Segmentation Handling

Elevated

measurements

Organized

measurements

Bounding

boxes

Object hypotheses (measurements)

New

tracks

Predictions

Track

list

Mature tracks

Updated tracks

Proposed Framework

Detection Multi-Object Tracking

Associations

Figure 1: Structure of the proposed multi-object detection and tracking.

This paper presents a complete framework and

pipeline for multi-target object detection and tracking

for urban scenarios based on a hybrid approach, where

both grid-based and object-based techniques are com-

bined. The proposed framework is designed to cope

with multiple targets, cluttered environment (i.e. the

objects are close to each other), occlusions and uncer-

tainties by applying a set of computationally efﬁcient

strategies. The main advantage of this framework is

robustness against common uncertainties in urban sce-

narios with the application of probabilistic approaches

for data association and tracking ﬁlters. In addition

to that, promising tracking reliability with dynamic

classiﬁcation is achieved. The input of the framework

is 3D LIDAR raw data in the form of point cloud,

while the output is a track list of associated objects

with their corresponding dynamic and geometric prop-

erties together with association probabilities. In this

work, the framework is demonstrated by data acquired

from a state-of-art Velodyne HDL-64E LIDAR sen-

sor, because of its advantages regarding accuracy, ﬁeld

of view, and sampling rate of three-dimensional envi-

ronmental measurements. However, this framework is

also applicable to other sensor technologies, since it

mainly relies on generic grid-based and object-based

approaches.

The rest of the paper is structured as follow: an

overview of the structure and main functions of the

framework is given in section 2. Section 3 describes

the detection part, where the non-ground measure-

ments are extracted and object hypotheses are gener-

ated. The multi-object tracking with its main compo-

nents are presented in section 4, and the further post-

processing functions for dimension correction are pre-

sented in section 5. Finally, the framework is evaluated

in section 6 by the use of raw data from KITTI data set

(Geiger et al., 2013) and MOT16 (Milan et al., 2016)

evaluation metrics.

2 SYSTEM ARCHITECTURE

The framework can be divided into two main function

categories: detection and tracking. The input of the

detection part is a 3D point cloud, which has to be

divided into non-ground and elevated measurements.

This is accomplished by a slope-based ground removal

approach and a subsequent ﬁltering process. In a fur-

ther step, object hypotheses for the tracking targets are

generated in a clustering step. The objects of interest

are extracted by means of a subsequent feature-based

bounding box ﬁtting and a rule-based ﬁltering.

The tracking is done based on centroid tracking of

generated bounding boxes with four main steps: data

association, tracking ﬁlters, tracking management and

bounding-box correction. In the association step, a set

of object hypotheses is determined, which correspond

to the predicted measurements based on the already

established tracks. In a case of a possible association,

the track is updated with an associated measurement,

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

157

otherwise a new track is created. The prediction and

update steps are done by means of tracking ﬁlters. The

track management maintains all tracks; labels their

maturity and ﬁlters out the non-feasible and old ones.

Finally, the bounding-box correction assign valid

bounding box dimensions to the mature tracks and

uses the track history in order to update this informa-

tion with new measurements. Figure 1 illustrates the

framework structure with its main components. The

algorithmic implementation of the framework is dis-

cussed in Section 3 to 5.

3 DETECTION

3.1 Ground Extraction

The ground extraction is an important pre-processing

step, in which all incoming 3D points are binary la-

belled into two groups of ground and non-ground el-

ements. The term ground is considered as navigable

and reachable area, which surrounds the ego-vehicle.

The urban scenarios may have different types of ter-

rain. Therefore, the ground extraction must be able

to handle non-ﬂat, sloped and uneven surfaces. This

module is the ﬁrst component in the whole framework

and has to deal with entire data coming from the sen-

sor. Thus, its computational performance is an impor-

tant aspect. For this goal, a combination of channel-

based and scan-based approach is used in this work.

In the proposed approach, a slope-based and channel-

wise classiﬁcation is performed on a polar grid and

by means of the modiﬁed technique from (Himmels-

bach and Wuensche, 2012). After an initial estimation

of the ground surface is achieved, the interrelation-

ship between channels and the consistency of the es-

timated ground is checked subsequently. For this aim,

the height comparison between the neighbour cells in

a polar grid is applied. The estimated ground surface

is smoothed and the missing spatial information are

ﬁlled by applying a median ﬁlter. Figure 2 shows the

result of ground extraction and the effect of consis-

tency check and median ﬁlter.

3.2 Clustering

The ﬁrst step towards the hypotheses generation is

to divide an unorganised and non-ground point cloud

into the smaller parts. This step is called clustering

and can be done in 3D, 2.5D and 2D. Since the com-

putational cost for 3D clustering is usually so high,

the clustering problem is treated as a 2D-problem by

mapping all elevated 3D points to a 2D grid as it is

Raw data

Slope-Based classification:

Some false classification

Consistency Check:

False classification reduced

Median Filter:

false classification eliminated

1 2

3 4

Figure 2: Main steps of the ground extraction: 1) raw data,

2) result of the ground estimation, 3) result of consistency

check and 4) result of median ﬁlter.

proposed by (Levinson et al., 2011) and (Himmels-

bach et al., 2010). The clustering in this work is ac-

complished by applying the Connected Component

Clustering (Pfaltz, 1966) in Cartesian grid representa-

tion. This approach has its origin in computer vision

for clustering the 2D binary images. However it has

also been used for 3D LIDAR point cloud (cf. (Rubio

et al., 2013). The Connected Component Clustering

is applied to the point cloud based on the row-to-row

approach. This approach makes two passes: 1) stor-

ing equivalences and assigning the temporary label

for ”connectedness” of cells and 2) determining the re-

lation between the equivalence classes and replacing

the temporary labels.

Initially, the whole grid is checked for the occupancy

and the cells are assigned with two initial states for

empty (0) and occupied (-1). Each cell in the grid is ex-

amined for the connectivity by checking the occupied

neighbour cells and using the spatial kernel

with

size

. If the target cell belongs to the same region as

the neighbour cells, the same cluster ID is assigned to

it. Otherwise, the new ID is created by incrementing

the ID by one. If the connected neighbour cells are

already assigned with different cluster IDs, the mini-

mum ID will be chosen as the target cell. After all oc-

VEHITS 2018 - 4th International Conference on Vehicle Technology and Intelligent Transport Systems

158

cupied cells are assigned with a cluster ID, the second

pass uses a union-ﬁnd data structure to replace each

cell label with its equivalent class and avoid multiple

labels for a single connected region. Since the size of

the kernel deﬁnes the maximum spatial distance be-

tween two connected cells within the same cluster, it

is responsible for over- and under-segmentation error,

which is taken into account later in section 5.3. Fig-

ure 3 shows three partial snapshots of the same scene,

where the ground classiﬁcation, ground removal and

clustering have been applied subsequently. Different

colours in the left part of the ﬁgure refer to the differ-

ent clusters in the scene.

Clustered

Ground

Removal

Raw Data

Figure 3: Measurement pre-processing: ground removal and

clustering.

3.3 Bounding Box Fitting

The clusters of 3D points which are recognised as ob-

jects provide limited information about the pose of

the objects. Moreover, some parts of the objects might

be seen only partially with LIDAR. Thus, a further

process is required in order to formulate a better hy-

pothesis about each object. In order to tackle this, a

3D bounding box representation, which gives better

information about the dimension and orientation of

the detected objects is chosen. Generally, there are two

groups of approaches for bounding box ﬁtting: feature-

based and model-based approaches (Chen et al., 2015).

The model-based approaches offer more accuracy and

better results than the feature-based approaches, due

to the use of rectangle or cuboid models together with

the application of optimization or sampling techniques.

However, they suffer from high computational cost

and are therefore not suitable for urban scenarios with

a high number of detected objects. Thus, a feature-

based method is proposed for this work, where its re-

sult is improved continuously by integrating the track-

ing results back to the detection.

First, the Minimum Area Rectangle (MAR) (Freeman

and Shapira, 1975) is applied to the 2D cluster in or-

der to create the initial bounding box. The height in-

formation of each cluster is retained, by deriving the

difference between the highest and the lowest point.

This information can be used for forming of 3D ori-

ented bounding box (cf. Figure 4). The MAR approach

is sufﬁcient for most of well-deﬁned clusters. How-

ever, this approach might fail for occluded objects and

leads to erroneous heading angle, where there are not

enough measurement points available. To tackle this

issue, a feature-based L-shape ﬁtting approach is ap-

plied, which corrects the box orientation. Similar to

(Ye et al., 2016), the L-shape ﬁtting is done by extract-

ing the outer contour of the cluster. As it is shown in

Figure 5, the farthest outlier points

and

are se-

lected, which are laying on the opposite side of the ob-

ject facing the LIDAR sensor. The line

is drawn be-

tween two points and an orthogonal line

is obtained

with the maximum distance

max

and angle close to 90

deg

by applying the Iterative End Point (IEPF) algo-

rithm. The corner point

can be found near to the

, which forms together with

and

an L-shape

polygon. The heading is then described by the longest

line of the L-shape which is a valid assumption for the

most of trafﬁc objects such as vehicles and cyclist.

1 2 3

Figure 4: MAR box ﬁtting with embedded height.

width

length

1 2 3 4

max

Figure 5: L-shape ﬁtting for a more accurate bounding box

ﬁtting.

4 MULTI-OBJECT TRACKING

4.1 Tracking Algorithm

The object tracking refers to the problem of determin-

ing the number of objects of interest, their identities

and their states based on sensor measurements. In this

work, the states are position, velocity and yaw angle

and yaw rate. The result of a tracking algorithm re-

lies mainly on two parts of data association and track-

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

159

ing ﬁlter. There are three important aspects, which are

needed to be taken into account by selection of opti-

mal data association for urban scenarios: 1) handling

multiple objects with different movement patterns, 2)

handling cluttered environment and 3) computational

efﬁciency.

Based on these requirements the Joint Probabilistic

Data Association (JPDA) (Bar-Shalom and Li, 1995)

is chosen for this work. For the ﬁltering part, typical

approaches are based on Bayesian Filtering such as

Kalman and Particle Filters, which deal with a single

motion model to predict and update object states. Even

by the presence of a perfect motion model represent-

ing the object trajectory, there is no guarantee, that the

object always follows this model. The objects in ur-

ban trafﬁc may have different movement patterns and

switch between different maneuvers described by dif-

ferent models. Therefore, a maneuver-aware tracking

approach which is capable of dealing with multiple

motion models has to be applied. Among different

maneuver-aware target tracking algorithms, the Inter-

acting Multiple Model (IMM) (Genovese, 2001) based

on an optimal Kalman Filter shows a promising perfor-

mance. Beside an improvement in the ﬁltering process,

an additional advantage of IMM is the dynamic classi-

ﬁcation.

Further advantages can be achieved by application of

non-linear models, which also requires non-linear esti-

mation ﬁlters such as Extended Kalman Filter (EKF),

Unscented Kalman Filter (UKF) or Particle Filter (PF).

(Gao et al., 2012) and (Djouadi et al., 2005) have

shown that IMM-UKF has a better performance than

IMM-EKF. Thus, a tracking algorithm is proposed in

this work based on a coupled ﬁlter JPDA-IMM-UKF

for three motion models: Constant Velocity, Constant

Turn Rate, and Random Motion, which can deal with

the tracking of multiple manoeuvring objects in a clut-

tered environment. It can be noted, that there are al-

ready similar implementations for coupled ﬁlters such

as IMM-UK-PDA(Schreier et al., 2016a), IMM-UK-

MHT (Blackman, 2004) and IMM-PF (Wang et al.,

2015). However, the JPDA-IMM-UKF is not applied

for the LIDAR and urban scenarios yet, which is the

contribution of this work.

The JPDA-IMM-UKF algorithm consists of four main

steps: 1) Interaction, 2) Prediction-and-Measurement

Validation Step, 3) data Association-and-Model-

Speciﬁc Filtering Step and 4) Mode Probability

Update-and-Combination Step. Compared to existing

closely related implementation PDA-IMM-UKF ap-

plied to RADAR (see (Schreier et al., 2016b)), JPDAF

is used instead of the conventional PDAF since we are

performing multi-object tracking and considering the

presence of clutter. The association probability

be-

tween each track

and measurement

considering all

feasible joint association events

across all measure-

ments Z

is given by (Bar-Shalom and Li, 1995):

(l) ≡=

∑

θ:θ

∈θ

P{θ

} (1)

With computed Kalman gain

, and innovation term

−

; the updated system states

x become:

k|k

k|k−1

+ K

)

with v

∑

m=1

m,k

−

)

(2)

4.2 Track Management

There are two main objectives considered for track

management in this work: 1) maturity check and prun-

ing, and 2) dynamic classiﬁcation.

4.2.1 Maturity Check and Pruning

Each track is assumed as ”immature” once it is ini-

tialised based on the ﬁrst association with a measure-

ment. Subsequently, the status will be changed to ”ma-

ture” after it is seen for a more than

= 3

consecutive

time frames. As long as the track is not initialised or

associated with a wrong measurement out of the valid

range, its state is deﬁned as ”invalid” and set to zero.

Once it is initialised, the state is set to ”initialising”

and incremented by one. After it is seen in multiple

consecutive frames, it is assumed as mature and its sta-

tus is set to ”tracking” and incremented further. The

track enters the ”drifting” status by a further state in-

crement, as soon as its measurement is lost at the next

time step. Once a feasible measurement is found in

the next frame the status is changed back to ”tracking”

and the state is decremented. Otherwise, the state is

incremented up to

= 3

frames, where the status is

reset to ”immature”.

One of the undesirable traits of JPDA ﬁlter is its ten-

dency to coalesce when the neighbouring track shares

the same measurement. In order to prevent duplicate

tracks associated with the same measurement, a hybrid

pruning approach is developed based on track history

and Euclidean distance. The track is considered as du-

plicate if the cumulative sum of standard deviation is

less than a predeﬁned threshold called history gating

level. In this case, the track with shorter life time is

deleted. Furthermore, the Euclidean distance between

each track pair is calculated and checked against the

physically possible distance in urban scenes. If the dis-

tance is less than a threshold, the newer track will be

deleted.

VEHITS 2018 - 4th International Conference on Vehicle Technology and Intelligent Transport Systems

160

4.2.2 Dynamic Classiﬁcation

Classifying the dynamic objects is a non-trivial task

due to the presence of measurement and detection

noise, occlusion and therefore jumping object frames.

Thus, the velocity thresholds are not sufﬁcient for dy-

namic classiﬁcation and further information has to be

taken into account. Similar to (Schreier, 2017) the

classiﬁcation is done by incorporating both velocity

thresholds and IMM probabilities. The object is clas-

siﬁed as static, when it has a zero or close to zero ve-

locity together with a higher probability of a Random

Motion Model. Since the estimated velocity is not nec-

essarily smooth, an average velocity of previous

n = 3

frames is taken into account for the classiﬁcation.

Figure 6 shows an intersection in an urban scene with

different trafﬁc objects waiting behind the red trafﬁc

light, a trafﬁc object turning to the right and a further

trafﬁc object crossing the intersection. It can be seen

that the waiting trafﬁc objects are classiﬁed as static,

while the turning object is assigned with a ”dynamic”

state. It can also be seen that the crossing vehicle is

in a ”drifting” state, since there is no measurement

available at this frame for an association.

Dynamic Track

Initializing Track

Static Track

Drifting Track

Static

Dynamic

Initializing

Figure 6: Track Management: colour-coded classiﬁcation of

track maturity and dynamic classiﬁcation. ”dynamic” track

(green) indicates the object is moving, ”static” track (blue)

indicates the object is stopping together with ego-vehicle,

”initializing” track (yellow) indicates the track is not yet ma-

ture since the object has just entered the sensor frame, and

”drifting” track (red) indicates the track is about to be lost

because it is entering the blind spot area.

5 BOUNDING BOX

CORRECTION

The JPDA-UKF-IMM algorithm is designed to track

the centre of the ﬁtted bounding box, which is techni-

cally a position tracker. Since the bounding box dimen-

sions are not among the ﬁltered states, a further step

is required in order to associate the correct geometric

features of the box. The LIDAR sensor is not able to

see the whole object in each frame due to occlusion

caused by the target object itself (i.e. self-occlusion)

or a nearby blocking object. This may lead to over-

or under-segmentation as well as dimension changes

over time. In order to tackle this problem, the result of

tracking algorithm can be used to improve the bound-

ing box ﬁtting in three steps explained in the following

subsections.

5.1 Dimension Updating

The dimensions of bounding boxes may change due to

object occlusions or changes in observation positions

of ego vehicle as it is shown in Figure 7. A dimension

history can be integrated for monitoring the dimension

changes over time and allowing an update for mature

tracks with ”tracking” status under two main assump-

tions: 1) the bounding box is not allowed to shrink

and reduce its width and length and 2) the bounding

box is not allowed to have sudden changes in head-

ing angle or moving direction. If there are more than

one bounding boxes associated with a single track, the

one with higher association probability is taken. In a

case of equal probabilities, the one in the nearest Eu-

clidean distance is chosen. Furthermore, it is checked,

if there is an approximately same number of points in

the track and associated measurement. The dimension

information is kept and stored for the track until the

next update for each mature track.

5.2 Perspective Correction

In addition to the dimension of the bounding box,

its position has to be updated based on observation

viewing angle. For this goal, the new bounding box is

shifted with respect to nine reference points proposed

by (Schueler et al., 2012) illustrated in Figure 8 by the

green dots. The reference points describe the best seen

corner or edge and distinguish between the front, rear

and side of the target. The shifting process is done un-

der the assumption that the reference points of the old

and the new bounding box signiﬁcantly overlaps. An

instance of perspective correction is shown in Figure

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

161

Ego

Vehicle

tk

1tk

Object 1

tk

1tk

Object 2

Object 3

Occluding

Object

Bounding Box

Center Point

Figure 7: Effect of occlusion on bounding box ﬁtting.

Ego-vehicle

Reference point

Laser point

Figure 8: Changes in the dimension of a ﬁtted bounding

box of the target with respect to viewing angle. The dark

points indicate the LIDAR points; the green points indicate

the reference points and red boxes show the ﬁtted bounding

boxes.

5.3 Over- and Under-segmentation

Handling

The occlusion objects in trafﬁc scenes especially

in urban scenarios may cause over- and under-

segmentation. This error may also be caused by clus-

tering process, where the kernel size is not adjusted

well to the current scene. In order to solve this prob-

lem, the top-down approach from (Himmelsbach and

Wuensche, 2012) is applied, where the tracking infor-

mation is re-used in clustering and box ﬁtting. An

over-segmentation can be identiﬁed by inspecting if

the new-found clusters overlap with the predicted po-

sition of the bounding box. In case of signiﬁcant over-

laps, the clusters have to be merged (c.f. Figure 10).

An under-segmentation occurs, when predicted tracks

are within one clustered region. In this case, the kernel

size of the region has to be reduced iteratively until the

correct number of expected clusters is achieved.

frame=21 full

dimension bounding

box

Front part occlusion makes

detector box shrink,

Tracker box is shifted to

bottom edge

Figure 9: Instance of perspective correction. Bounding box

of a self-occluded van located in front of ego-vehicle is

shifted downward so that the van dimension is retained.

6 EVALUATION

6.1 RAW Data and Ground Truth:

KITTI Dataset

The proposed multi-object detection and tracking al-

gorithm can be evaluated in real world scenarios by

using non-synthetic data. For this purpose, the KITTI

datasets (Geiger et al., 2013) are used. This public

benchmark provides the recordings of Velodyne HDL-

64E sensor, among other sensors in different urban

scenarios in the city of Karlsruhe, Germany. It also

includes real-world trafﬁc situations and range from

highways over rural areas to inner-city scenes with

high-quality hand-labelled annotation.

In order to evaluate the relevant urban scenarios,

KITTI datasets within category ”City” are used. The

collection of datasets consist of 10 different driving

scenarios with the cumulative frame number of 2111

frames and 188 unique trafﬁc objects. The composi-

tion of each dataset is represented in Table 1 and Fig-

ure 11. Note that the datasets from ”City” category

contain lots of vulnerable road users compared to other

sets and thus more representative for urban scenarios.

6.2 Benchmark Results and Discussion

The tracker performance is evaluated by using MOT16

benchmark method (Milan et al., 2016) which com-

bines the CLEAR quantitative metrics (Bernardin and

Stiefelhagen, 2008) augmented with set of Track Qual-

ity Measures (Wu and Nevatia, 2007). It is important

to note the used datasets are not uniform: the driving

VEHITS 2018 - 4th International Conference on Vehicle Technology and Intelligent Transport Systems

162

Occluding

object

Over-segmented

boxes

Occluding

object

Merged Box

Figure 10: Example of over-segmentation handled by box merging.

Table 1: Evaluation Datasets.

Dataset Frame count Unique obj. No. of box

0001 106 11 142

0002 75 2 45

0005 152 14 473

0009 441 82 1413

0013 142 4 101

0017 112 5 84

0018 268 12 196

0048 20 7 81

0051 436 40 381

0057 359 12 471

Sum 2111 189 3387

scenario along with the object compositions and move-

ment types may vary signiﬁcantly as dataset changes.

This is intentional as tracking methods can be heav-

ily overﬁtted on one particular dataset and potentially

introduce evaluation bias (Milan et al., 2016). There-

fore, the individual dataset evaluation result is a more

representative indicator to reﬂect the framework per-

formance. Nevertheless, it is still useful to view the

averaged score as shown in Table 1 to provide the

reader with the information about tracker overall per-

formance.

The Multi Object Tracking Accuracy (MOTA) score

reﬂects that the tracker has a reasonable high degree

of accuracy with a 86% overall score. The score is

lowered mainly by the number of False Negative (FN),

since the number of False Positive (FP) and ID Switch

(IDSW) are comparatively low. The Multi Object

Tracking Precision (MOTP) score is limited to 91%

D01 D02 D05 D09 D13 D17 D18 D48 D51 D57

Cyclist

1 2 1 0 0 0 0 0 2 0

Pedestrian

0 0 2 3 0 0 0 0 3 0

Van

0 0 3 0 1 0 2 1 11 1

Car

10 0 8 77 2 4 9 6 23 9

25%

50%

75%

100%

Object Class Distribution

Car Van Pedestrian Cyclist

Figure 11: Distribution of object classes across all evaluation

datasets.

which is an expected result, since despite a perfect

tracking, only partial dimensional information can be

derived when an object enters the sensor frame from

a far distance; so the tracking precision would always

be low for the ﬁrst few time frames.

A signiﬁcant deviation across all datasets can be seen

due to datasets contain varying urban scenario. Nev-

ertheless, the average results highlight that the tracker

yield higher rate of Mostly Tracked (MT) than Mostly

Lost (ML). The recall-rate (i.e. sensitivity) and preci-

sion (i.e. positive predictive value) indicate that the

tracker hypotheses possess a high degree of relevance

to the actual object, where the lower recall rate is con-

sistent with the number of FNs counted in total. Fi-

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

163

Table 2: Overall evaluation result.

(a) CLEAR Metrics

Metrics Value

MOTA 86.12% ± 6.00

MOTP 91.01% ± 5.03

FP (sum) 65

FN (sum) 406

IDSW (sum) 75

Total Obj. Instances 3387

Total Frame 2111

(b) Track Quality Measures.

Metrics Value

Mostly Tracked 70.64% ± 17.47

Mostly Lost 9.33% ± 8.10

Recall rate 88.92% ± 10.18

Precision rate 98.43% ± 2.73

Fragmentation 211

nally, the number of Fragmentation is a subset of the

FNs; here we see that more than half of FNs is caused

by track lost. Note that the lost tracks are recoverable

(i.e. not all FNs are the results of complete detection

failure across all frames). The tracking performance

(MOTP and MOTA) for individual dataset are shown

in Figure 12, the Quality Measures are shown in Fig-

ure 13, and the base metric scores are listed in Table

Some datasets are discussed in details, in order to pro-

vide the reader with physical meaning of the results: In

Dataset 0005, 78% of tracks are considered to be MT,

while a relatively large number of ML tracks are avail-

able. Here the ego vehicle is moving in a curved urban

road which causes a constant change in a reference

sensor frame. Combination of self-occlusion (cf. Fig-

ure 9 for visual depiction) and a relative fast turning

rate of the target objects increases the uncertainties of

the target spatial position. Therefore, reduced tracking

accuracy and fragmentation errors can be seen in this

scenario. Dataset 0009 represents a typical complex

detection and tracking scenario: it has a large number

of frame counts with a great number of unique objects

compared to other datasets. In this dataset, the ego ve-

hicle made a 90-degree turn (i.e. a sudden change of

sensor frame) and stopped at a 4-way junction with

a persistently occluding object. The situation can be

observed in Figure 10.

Nevertheless, handling uncertainties is one of the main

contributions of this work: here we see the MOTA

score reﬂects that the use of JPDA ﬁlter enables the

91.61%

93.33%

89.73%

82.59%

96.04%

80.95%

79.08%

84.34%

81.36%

82.17%

94.25%

85.10%

91.87%

89.64%

97.05%

93.86%

95.22%

93.83%

88.38%

80.90%

0% 25% 50% 75% 100%

D01

D02

D05

D09

D13

D17

D18

D48

D51

D57

Multi Object Tracking Accuracy and

Precision Across All Datasets

MOTA MOTP

Figure 12: Per dataset MOTA and MOTP scores.

90.91%

MT 100.00%

78.57%

80.49%

50.00%

40.00%

66.67%

71.43%

70.00%

63.64%

ML 9.09%

ML 14.29%

ML 4.88%

ML 25.00%

ML 8.33%

ML 15.00%

ML 18.18%

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

D01

D02

D05

D09

D13

D17

D18

D48

D51

D57

Multi Object Tracking Quality Measure

MT ML PT

Figure 13: Per-dataset Track Quality Measures. ”PT” refers

to Partially Tracked, which is not classiﬁed as either MT or

ML.

tracker to form hypotheses with sufﬁcient accuracy.

The results are obtained despite cluttered environment

and manoeuvring targets. Performance reduction is

found in certain scenarios (Dataset 0017 and 0051).

However, since more than 80% of the tracker hypothe-

ses are considered as MT with only 5% considered as

ML, adequate robustness against persistent- and self-

occlusion of the target objects can be found, regardless

of sensor frame change, turning cars and other occlud-

ing objects.

VEHITS 2018 - 4th International Conference on Vehicle Technology and Intelligent Transport Systems

164

Table 3: CLEAR comparison of state-of-art 3D LIDAR trackers.

Method MOTA MOTP FN FP

Proposed Framework 86.12 % n/a. in m 11.89 % 1.92 %

Tracking Circle (Ye et al., 2016) (averaged) 86.5% < 0.2 m 3.5% 8.0%

Energy-based (Xiao et al., 2016) 84.2 % < 0.12 m 5.8 % 2.77 %

BUTD (Xiao et al., 2016) 89.1 % < 0.16 m 2.6 % 7.6 %

Generative (Kaestner et al., 2012) 77.7 % < 0.14 m 8.5 % 10.1 %

As a summary, the benchmarking process yields a bet-

ter understanding of the tracker performance in a large

variation of urban scenarios with different classes of

trafﬁc objects. Cars, vans and pedestrians are tracked

reliably by an average of above 86% with the proposed

framework. Quality Measures support the scores of the

CLEAR metrics: MT tracks outnumber ML tracks by

a signiﬁcant margin in all datasets, including datasets

with complex scenarios. These scenarios contain con-

stant sensor frame change, persistent occluding object

and actively-manoeuvring targets. Note that we see

that the tracking accuracy and performance may de-

crease as the number of objects increases. However,

in this situation the Quality Measures indicate that the

majority of objects are still covered adequately by the

tracker.

7 COMPARISON TO

STATE-OF-THE-ART

The use of both established MOT metrics and public

datasets are also useful to enable objective comparison

to the performance of state-of-art trackers. The utilised

metrics, namely the MOTP, MOTA, MT, ML, FN and

FP are common measures for tracking performance.

Publicly ranked benchmarks (see KITTI Object Track-

ing Evaluation 2012 (Geiger et al., 2013) and 2017

MOT challenge (Leal-Taix

e et al., 2017)) use these

metrics, as well as numbers of MOT-related literatures

such as (Zheng et al., 2012; Bernardin and Stiefelha-

gen, 2008; Piao et al., 2016; Wen et al., 2015).

Compared to camera tracking, there is notably fewer

LIDAR literature which put signiﬁcant concern on

evaluation using established metrics. Some notable

publications which use both Velodyne and CLEAR

as metrics are that of (Ye et al., 2016) which use

geometric-based tracking circle method, (Xiao et al.,

2016) which use point assignment task based on

energy function, (Spinello et al., 2011) which use

Bottom-Up Top-Down Detector (BUTD), and (Kaest-

ner et al., 2012) which use Generative Object Detec-

tion and Tracking. The comparison results can be

seen in Table 3. These works use different criteria to

compute the MOTP. Our approach takes into account

the position and dimensional integrity of the tracked

objects, thus the bounding box overlap ratio is used.

Meanwhile, these works consider only the precision

of centre point of a detected object, so the MOTP is

based on Euclidean distance error instead. In addi-

tion, only work of (Ye et al., 2016) deals with a sen-

sor mounted on a moving car; the other three use the

dataset recorded on ETH Zurich Polyterrasse, which

deals with a static reference frame in university can-

teen scenery and populated mainly with pedestrians.

While results of (Ye et al., 2016) would be the best

control comparison to this thesis work, it only uses 2

datasets with unspeciﬁed ground truth details.

A general overview indicates, that our proposed ap-

proach has a comparable accuracy (

3% differences)

to state-of-art, but accompanied with quite larger per-

centage of FN (11.89 % vs 2.6% with that of BUTD).

In the previous section, it has been found that a large

number of FN is contributed by the datasets with

complex scenario (mainly Dataset 0017). Neverthe-

less, if we inspect other datasets individually, the FN

rate would be on par (2-7%) with other approaches.

Therefore, a comparison with standardised datasets are

needed to give more insight, if the compared state-of-

art works exhibit a similar performance reduction in

signiﬁcantly complex urban situations.

8 CONCLUSION

An integrated multi-object detection and tracking

framework has been introduced in this paper. The

framework is especially designed for the use of en-

vironment perception in the urban scenarios with the

associated uncertainties of 3D LIDAR sensor measure-

ments. However, this framework can be used for other

sensor technologies as well.

The detector is able to cope with occlusion and handle

under/over-segmentation, by receiving the additional

information from the tracker. The tracking algorithm

itself employs probabilistic data association and ﬁlter-

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

165

ing based on a coupled IMM-UKF-JPDA ﬁlter, which

allows a manoeuvre-aware multi-object tracking un-

der uncertainties in a cluttered environment. More-

over, geometric properties of the tracks are updated

in a post-processing part by means of computationally

low demanding rule-based ﬁltering and the the use of

box frame history.

Finally, the framework is evaluated with the help of es-

tablished MOT16 metrics, which shows that the track-

ing performance is favourable in a variety of pre-

recorded real-world urban scenarios. Since the frame-

work is designed and found to run in real-time (under

100 ms), we expect that our framework is applicable

for autonomous vehicles. However, the performance

of this framework can be increased in future works by

further code optimisation, applying parallel program-

ming techniques and further ﬁtting algorithm for V

and U shape trafﬁc objects.

REFERENCES

Bar-Shalom, Y. and Li, X. R. (1995). Multitarget-

multisensor Tracking: Principles and Techniques.

Yaakov Bar-Shalom.

Bernardin, K. and Stiefelhagen, R. (2008). Evaluating mul-

tiple object tracking performance: The CLEAR MOT

metrics. Eurasip J. Image Video Process., 2008.

Blackman, S. (2004). Multiple hypothesis tracking for multi-

ple target tracking. IEEE Aerosp. Electron. Syst. Mag.,

19(1):5–18.

Chen, T., Dai, B., Liu, D., Fu, H., Song, J., and Wei, C. (2015).

Likelihood-Field-Model-Based Vehicle Pose Estima-

tion with Velodyne. IEEE Conf. Intell. Transp. Syst.

Proceedings, ITSC, 2015-Octob:296–302.

Choi, J., Ulbrich, S., Lichte, B., and Maurer, M. (2013).

Multi-Target Tracking using a 3D-Lidar sensor for au-

tonomous vehicles. IEEE Conf. Intell. Transp. Syst.

Proceedings, ITSC, (Itsc):881–886.

Djouadi, M., Sebbagh, A., and Berkani, D. (2005). IMM-

UKF algorithm and IMM-EKF algorithm for track-

ing highly maneuverable target: a comparison. Proc.

7th WSEAS Int. Conf. Autom. Control. Model. Simul.,

pages 283–288.

Freeman, H. and Shapira, R. (1975). Determining the

minimum-area encasing rectangle for an arbitrary

closed curve. Commun. ACM, 18(7):409–413.

Gao, L., Xing, J., Ma, Z., Sha, J., and Meng, X. (2012). Im-

proved IMM algorithm for nonlinear maneuvering tar-

get tracking. Procedia Eng., 29:4117–4123.

Geiger, a., Lenz, P., Stiller, C., and Urtasun, R. (2013). Vi-

sion meets robotics: The KITTI dataset. Int. J. Rob.

Res., 32(11):1231–1237.

Genovese, A. F. (2001). The interacting multiple model

algorithm for accurate state estimation of maneuvering

targets. Johns Hopkins APL Tech. Dig. (Applied Phys.

Lab., 22(4):614–623.

Himmelsbach, M., von Hundelshausen, F., and Wuensche,

H. (2010). Fast segmentation of 3D point clouds for

ground vehicles. Iv, pages 560–565.

Himmelsbach, M. and Wuensche, H. J. (2012). Tracking and

classiﬁcation of arbitrary objects with bottom-up/top-

down detection. IEEE Intell. Veh. Symp. Proc., pages

577–582.

Kaestner, R., Maye, J., Pilat, Y., and Siegwart, R. (2012).

Generative object detection and tracking in 3D range

data. 2012 IEEE Int. Conf. Robot. Autom., pages 3075–

3081.

Leal-Taix

e, L., Milan, A., Reid, I., Roth, S., and Schindler,

K. (2017). MOT17 Results.

Levinson et al. (2011). Towards fully autonomous driving:

Systems and algorithms. IEEE Intell. Veh. Symp. Proc.,

(Iv):163–168.

Luo, Z., Habibi, S., and Mohrenschildt, M. (2016). Li-

DAR Based Real Time Multiple Vehicle Detection and

Tracking. 10(6):1083–1090.

Milan, A., Leal-Taixe, L., Reid, I., Roth, S., and Schindler,

K. (2016). MOT16: A Benchmark for Multi-Object

Tracking. pages 1–12.

Pfaltz, J. L. (1966). Sequential Operations in Digital Picture

Processing. J. ACM, 13(4):471–494.

Piao, S., Sutjaritvorakul, T., and Berns, K. (2016). Compact

Data Association in Multiple Object Tracking: Pedes-

trian Tracking on Mobile Vehicle as Case Study. 9th

IFAC Symp. Intell. Auton. Veh., 49(15):175–180.

Rubio, D. O., Lenskiy, A., and Ryu, J. H. (2013). Connected

components for a fast and robust 2D lidar data segmen-

tation. Proc. - Asia Model. Symp. 2013 7th Asia Int.

Conf. Math. Model. Comput. Simulation, AMS 2013,

(September 2015):160–165.

Schreier, M. (2017). Bayesian environment representation,

prediction, and criticality assessment for driver assis-

tance systems. PhD thesis.

Schreier, M., Willert, V., and Adamy, J. (2016a). Compact

Representation of Dynamic Driving Environments for

ADAS by Parametric Free Space and Dynamic Object

Maps. IEEE Trans. Intell. Transp. Syst., 17(2):367–

384.

Schreier, M., Willert, V., and Adamy, J. (2016b). Compact

Representation of Dynamic Driving Environments for

ADAS by Parametric Free Space and Dynamic Object

Maps. IEEE Trans. Intell. Transp. Syst., 17(2):367–

384.

Schueler, K., Weiherer, T., Bouzouraa, M. E., and Hofmann,

U. (2012). 360 Degree multi sensor fusion for static

and dynamic obstacles. 2012 IEEE Intell. Veh. Symp.,

pages 692–697.

Spinello, L., Luber, M., and Arras, K. O. (2011). Tracking

people in 3D using a bottom-up top-down detector.

Proc. - IEEE Int. Conf. Robot. Autom., pages 1304–

1310.

Velodyne (2007). Velodyne’s HDL-64E: A High Deﬁnition

Lidar Sensor for 3-D Applications. 2007. White Pap.,

page 7.

Wang, D. Z., Posner, I., and Newman, P. (2015). Model-

free detection and tracking of dynamic objects with

2D lidar. Int. J. Rob. Res., 34(7):1039–1063.

VEHITS 2018 - 4th International Conference on Vehicle Technology and Intelligent Transport Systems

166

Wen, L., Du, D., Cai, Z., Lei, Z., Chang, M.-C., Qi, H., Lim,

J., Yang, M.-H., and Lyu, S. (2015). UA-DETRAC: A

New Benchmark and Protocol for Multi-Object Detec-

tion and Tracking.

Wojke, N. and Haselich, M. (2012). Moving Vehicle Detec-

tion and Tracking in Unstructured Environments. 2012

IEEE Int. Conf. Robot. Autom., pages 3082–3087.

Wu, B. and Nevatia, R. (2007). Detection and tracking of

multiple, partially occluded humans by Bayesian com-

bination of edgelet based part detectors. Int. J. Comput.

Vis., 75(2):247–266.

Xiao, W., Vallet, B., Schindler, K., and Paparoditis, N. (2016).

Simultaneous Detection and Tracking of Pedestrian

From Panoramic Laser Scanning Data. ISPRS

Ann. Photogramm. Remote Sens. Spat. Inf. Sci., III-

3(July):295–302.

Ye, Y., Fu, L., and Li, B. (2016). Object Detection and Track-

ing Using Multi-layer Laser for Autonomous Urban

Driving.

Zhang, L., Li, Q., Li, M., Mao, Q., and N

uchter, A. (2011).

Multiple Vehicle-like Target Tracking Based on the

Velodyne LiDAR. (2005).

Zheng, W., Thangali, A., Sclaroff, S., and Betke, M. (2012).

Coupling detection and data association for multi-

ple object tracking. Comput. Vis. Pattern Recognit.

(CVPR), 2012 IEEE Conf., pages 1948–1955.

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

167