pairs instead of associating 2D features with depth
information from a depth map. Following their ap-
proach, we utilize features with and without depth to
maximize information gain. However, we propose for
this a slightly modified version of their cost function
for features without depth information which returns
a metric error and is essentially based on the epipo-
lar constraint (Hartley and Sturm, 1997). This ena-
bles our algorithm to be extended by tightly coupled
sensor fusion since it allows a normalization of the
error terms. In contrast to the algorithm of Zhang
et al. we do not implement windowed bundle adjus-
tment (BA) and focus on pure frame to frame motion
estimation, which can be considered a pairwise BA
algorithm. This keeps processing times low and ena-
bles us to apply our algorithm on board a robot with
low computational power which is an unmanned ae-
rial vehicle in our case. BA is a further step that im-
proves the results of the very first step related to cal-
culating VO from two consecutive stereo image pairs.
This is the reason why we focus our work in this pa-
per on this first step, that can of course be improved
by a further BA upon the improvements we present in
this paper. We evaluate the proposed algorithm exten-
sively on two different datasets and present evaluation
results that are similar to that of Zhang et al.
The rest of this paper is structured as follows. In
Section 2 we present the related work. After that, we
revise in Section 3 important basic knowledge before
we give an overview of our algorithm in Section 4. In
Section 5 we introduce our algorithm more in detail
followed by the evaluation of our approach on two
different datasets in Section 6. Finally our work is
summarized in Section 7.
2 RELATED WORK
A lot of work on VO has been done by the Robo-
tics and Perception group at ETH Z
¨
urich, led by Da-
vide Scaramuzza. His two-part tutorial gives an over-
view to common algorithms used in VO (Scaramuzza
and Fraundorfer, 2011; Fraundorfer and Scaramuzza,
2012). In these tutorials also visual-SLAM techni-
ques which are related to VO are explained.
Cvi
ˇ
si
´
c and Petrovi
´
c recently achieve very accu-
rate results with their method on the KITTI odome-
try benchmark (Cvi
ˇ
si
´
c and Petrovi
´
c, 2015). They
focus on careful feature selection by making use of
two different keypoint detection algorithms. This al-
lows them to have a high feature acceptance thres-
hold while keeping still enough features for motion
estimation. In contrast to our work the motion esti-
mation of their algorithm is divided into rotation esti-
mation and translation estimation separately. First the
rotation matrix is extracted from an estimated essen-
tial matrix and then the translation vector is estimated
by iteratively minimizing the reprojection error with
a fixed rotation matrix.
Buczko et al. also focused on outlier rejection in
order to achieve high accuracy during motion estima-
tion. The authors presented a new iterative outlier
rejection scheme (Buczko and Willert, 2016b) and
a new flow-decoupled normalized reprojection error
(Buczko and Willert, 2016a). Their method achieves
as well outstanding results in translation and rotation
accuracy on the KITTI odometry benchmark.
The work in this paper is based on the idea of com-
bining information of image features with and wit-
hout known depth information developed by Zhang
et al. They build an odometry system that combines
a monocular camera with a depth sensor. Their al-
gorithm assigns to the tracked features in the camera
images depth information from the depth sensor if this
information is available. During their motion estima-
tion they utilize features with and without depth in
order to achieve a maximum information gain when
recovering the motion. The use of features without
depth enables to compute a relatively accurate pose
even if only a few features with depth information
are available. They evaluated their method with an
Asus Xtion Pro Live RGB-D camera and a sensor sy-
stem consisting of a camera in combination with a
Hokuyo UTM-30LX laser scanner on their own da-
tasets. Additionally they also evaluated on the public
KITTI dataset which provides the depth information
from a Velodyne HDL-64E laser scanner. Our method
differs from theirs as it obtains metric depth informa-
tion from a stereo camera and it does not use BA. As
mentioned before, we also apply a small adaption of
their cost function for 2D to 2D point corresponden-
ces.
Fu et al. compute the depth information of
Zhang’s algorithm by using a stereo camera also (Fu
et al., 2015). However, they do this by using a
block matching algorithm. Furthermore their algo-
rithm uses key frames which can reduce the drift in
situations where the camera is not moving. Also the
rest of their motion estimation is similar to Zhang’s
approach since they apply their motion estimation ap-
proach including BA for their camera system.
3 THEORETICAL BACKGROUND
The VO approach described in this paper solves a BA
problem in order to estimate the most likely motion
given two stereo image pairs from consecutive points
VISAPP 2018 - International Conference on Computer Vision Theory and Applications
456