typical distribution of moving objects in real scenes.
The proposed system extends the feature tracking
type of visual odometry system (Klein and Murray,
2007; Badino et al., 2013; Mur-Artal and Tard
´
os,
2017; Persson et al., 2015; Cvi
ˇ
si
´
c et al., 2017) to
detect and estimate multiple trajectories. The imple-
mentation is based on CV4X (Persson et al., 2015)
which achieved state of the art results on the KITTI
egomotion benchmark and can thus provide reliable
egomotion. We now extend this system to also es-
timate trajectories of independently moving objects.
The proposed system uses geometric consistency over
time to detect objects. This takes the form of repro-
jection error minimization to assign tracks and esti-
mate object states over time. As the world is also de-
tected as a rigidly moving object, a useful by-product
of the system is the ego-motion trajectory. We call the
system described in Section 3 Sequential Hierarchi-
cal Ransac Estimation (Shire). Shire performs well
in practice, in particular for nearby and fast moving
objects. These objects are exactly those that are im-
portant for dynamic obstacle avoidance. Shire is real-
time (30 Hz) on a standard desktop CPU.
The system is evaluated using a novel dataset,
which we make available (Persson, 2020). Ideally
we would have liked to evaluate our method by com-
paring the trajectories of the estimated IMOs to their
ground truth. To the best of our knowledge no suitable
dataset is available for this, nor can we generate such
ground truth for our dataset. The dataset is collected
from our experimental vehicle, when driving in real
traffic. It covers inner city, country road, and highway.
We evaluate using bounding box instance segmenta-
tion of moving objects. This acts as a proxy for de-
tection, ID persistence, and estimation. This dataset
has been preprocessed and we provide rectified im-
ages, estimated disparity and semantic segmentation
as well as manually annotated IMOs. At the dataset
link, you can also find evaluation software, the Shire
code, and an example video.
2 RELATED WORK
IMO detection and trajectory estimation could be ap-
proached by scene-flow methods. Scene-flow is the
observed 3D motion per pixel. These methods im-
plicitly segment the flow into rigid objects. Scene-
flow approaches can be categorized as classic or deep
learning based methods.
Classic methods are well represented by Piece-
wise Rigid Scene Flow (Vogel and Roth, 2015). The
method uses classic flow, classic stereo and (clas-
sic) superpixel-segmentation. These are used to form
a regularized sceneflow cost function. This cost is
then optimized over a single stereo image pair using
Gauss-Newton (GN). The recent Deep Rigid Scene-
flow (Ma et al., 2019) method is similar in many ways.
The method uses a flow network, a stereo network,
and a semantic-segmentation network. The networks
are used to form a similar cost function. The cost
is then optimized over a single stereo image pair us-
ing GN, unrolled for differentiability. Replacing each
component with its deep learning based variants re-
quires supervised training. This implies the need for
large datasets with both semantic and 3D correspon-
dences ground truth. The cost of such is the main dis-
advantage of the modern method. However, compar-
ison on the KITTI sceneflow benchmark shows that
the latter method significantly outperforms the for-
mer. However, while the deep learning approach is
faster at 746ms compared to 3 minutes on 0.5Mpixel
images, both methods are far from real time. Both
methods could hypothetically be extended to use mul-
tiple images. However, it is unclear how to do so with-
out further increasing the computational cost. The
methods are potentially useful as input to the pro-
posed system. We conclude that they are currently
to slow for our target application.
Another deep learning approach which targets a
similar problem is MOTSFusion (Luiten et al., 2019).
This method aims to identify and separate cars, both
still and moving. The method uses deep optical
flow, bounding box detections, semantic segmenta-
tion, deep stereo. The method also performs ego-
motion compensation using ORBSLAM. Next they
use they use a per track geometrically aware boot-
strap tracking method method in order to associate
tracks over time. This results in strong id propagation
performance for the object detections. The method
also achieves good results on the MOTS benchmark.
The purpose here is tracking rather than 6D trajec-
tory estimation however, though the method could be
extended or used as input. Similar to the deep rigid
sceneflow the issues are with general moving objects
and computational cost. MOTSFusion operating at
0.5MPixel takes 440ms per frame, and while this may
be applicable in some cases, it is a long time in colli-
sion avoidance.
The good performance of the deep learning ap-
proaches comes at a price. Relying on bounding box
and/or segmentation limits the systems to the classes
for which data is available. For new cameras or scene
content, the deep learning methods require finetuning,
which requires ground truth. Even if only to adapt to
the minor differences in resolution, scene and image
characteristics, this requires ground truth data we do
not have. By contrast, the classic methods typically
Independently Moving Object Trajectories from Sequential Hierarchical Ransac
723