RELIABLE LOCALIZATION AND MAP BUILDING BASED ON

VISUAL ODOMETRY AND EGO MOTION MODEL IN

DYNAMIC ENVIRONMENT

Pangyu Jeong and Sergiu Nedevschi

Technical University of Cluj-Napoca, Baritu 28, Cluj-Napoca, Romania

Keywords: Visual Odometry (VO), Ego Motion Model, Ego Motion Control, Localization, Map Building, Temporal

Filter.

Abstract: This paper presents a robust method for localization and map building in dynamic environment. The

proposed localization and map building provide general approaches to use them both in indoor and outdoor

environments. The proposed localization is based on the relative global position starting from initial

departure position. In order to provide reliable positioning information, Visual Odometry (VO) is used

instead of ego robot’s encoder. Unlike general VO based localization, the proposed VO does not use

iterative refinement in order to select inliers. The suggested VO uses ego motion model based on the motion

control. The rotation and translation values of tracked features are guided by the estimated rotation and

translation values obtained by motion control. Namely the estimated motion provides upper and lower limits

of motion variation of VO. This estimated boundary of motion variation helps to reject outliers among

tracked features. The rejected outliers represent tracked features of fast/slow moving objects against ego

robot movement. The map is built along with ego robot path. In order to get rich 3D points in each frame

accumulated dense map based temporal filter method is adapted.

1 INTRODUCTION

The stable and accurate position estimation is a key

issue of the robot/vehicle navigation. The related

researchers have proposed absolute/relative global

position estimation. They have used GPS and INS

(Inertial Navigation System) to estimate

robot/vehicle position. Recently vision based new

approaches have been introduced for localization.

They have been called Visual Odometry (VO). The

VO provides stable position and rotation estimation

comparing with Encoder raw data.

Nister(2004) gives one of the first solutions for

Visual Odometry. He presents implementation

results using single camera and stereo camera in

order to estimate vehicle pose. He uses a feature

tracking method and the 5 point algorithm with

RANSAC to select the motion model in both single

camera and stereo camera. The result of Visual

Odometry is compared to result of INS/DGPS.

Howard(2008) presents a different approach to

Visual Odometry. Features tracked only in two

consecutive frames are triangulated to obtain 3D

points that are used in Visual Odometry. For

computing the rotation and translation parameters, a

back projection equation connecting the 3D tracked

points from the previous frame with the

corresponding 2D image points from the current

frame is optimized. The algorithm is verified

outdoor on many different robot platforms.

Agrawal(2007)

presents Visual Odometry in

rough terrain environment. They track features in

multiple frames rather than just in two consecutive

frames. A two step camera motion computation

process is used on a bundle of frames considering

each camera pose a different camera observing the

same points in space. Re-projecting the same 3D

points on each camera in turn tests for convergence.

Visual Odometry pose is corrected by considering

the gravity normal from IMU (Inertial Measurement

Unit) and the yaw from GPS to maintain global pose

consistency.

Konolige(2006) presents outdoor mapping and

navigation based on stereo vision. They use Visual

Odometry integrated with IMU/GPS for robustness

in difficult lighting or motion situation and for

overall global consistency. The ground surface

316

Jeong P. and Nedevschi S. (2010).

RELIABLE LOCALIZATION AND MAP BUILDING BASED ON VISUAL ODOMETRY AND EGO MOTION MODEL IN DYNAMIC ENVIRONMENT.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 316-321

DOI: 10.5220/0002826503160321

 SciTePress

model is determined with RANSAC and used in

obstacle detection. They use an offline learning

method for finding the paths and the extended road

surface.

The related works are mainly focused on VO

based localization. Their methods have problems

with inliers selection in dynamic environment and

with visual information loss in fast changing

situations.

In this paper, for localization, an ego-motion

model based special data filter method is proposed

for selecting as inliers only the static features from

the dynamic environment and for overcoming visual

information loss in fast changing situations.

The 3D images are registered along with position

estimated by VO and ego-motion model in order to

build 3D map. The 3D map also consists of 3D road

surface. Especially road surface extraction is a tough

task for vision.

There are two major problems for the road

surface extraction.

1) The road surface extraction is strongly influenced

by non-uniform illumination, variant road surface

textures, and road surface conditions. Many

researchers have proposed methods of road surface

determination (Guo, Sofman and Dahlkamp, 2006

However there are still open issues in terms of

accuracy and time cost efficiency.

2) The road surface, many times, consists of

homogeneous textures. It means that there are no 3D

reconstructed points on the road surface.

In this paper, simple X-Y (Front-view)

projection method is proposed for road surface

extraction, and accumulated dense map (temporal

filter) is used for obtaining 3D positions of road

surface. The method of accumulated dense map

along with robot path provides rich 3D information

even though features of road surface are not

textured.

This paper consists of three main sections. The

localization is presented in the section 2.The map

building is presented in the section 3. The

experiments and their results are presented in the

section 4.

The conclusions highlight the achievements of

the work.

2 PROPOSED LOCALIZATION

2.1 Pre-processing for Selecting Good

Tracking Features

The VO is based on the 3D points obtained by

triangulation of features determined by feature

tracking method (

Shi, 1994). In this paper we do not

use our own triangulation method. We directly used

the 3D points provided by Tyzx (Tyzx.com) dense

stereo engine. Unfortunately the 3D points are

affected by noise due to different reasons. It causes

inaccurate translation/rotation estimation of VO.

Therefore 3D noise elimination procedure is

required before determining inliers for translation/

rotation estimation. This 3D noise is filtered

following image projection procedure. The

coordinate system is configured as X (west-east), Y

(north-south), and Z (current position to ahead)).

There are three ways in which the projection can be

achieved: X-Y projection (Front-view), Y-Z

projection (Side-view), and X-Z projection (Top-

view). The X-Y projection method is adopted in this

paper because noise points and object points are

easily separated. Scoring map (500 x 312) in the X-

Y projection is achieved by following equation:

)(

minmin

−

(1)

(

)()

312/,500/

minmaxminmax

YYYXXX −=

−

∵

where

c and

are column and row of 500 x 312

image,

and

are 3D point coordinates, and

min

and

min

are the minimum values of

and

respectively.

Each cell of the scored map is filled with number

of projected 3D points by corresponding Equation

(1). The cells which have smaller scored number

than the threshold (

threshold

) are eliminated.

The

threshold

is determined by following off-line

recursive iterations only at the beginning. The

threshold

starts with “0”, it increments until the

average Y value of X-Y projected points becomes a

positive floating point value. Due to the down

orientation of the Y axis and due to the positioning

of the origin of the world coordinate system on the

road surface, the initial average values are negative

floating point values. In each iteration, the cells

which have smaller scored number than the

threshold are eliminated. After iteration is finished,

the remaining 3D points are restored from scoring

map. These 3D points will be used for VO.

2.2 Inliers Selection for Visual

Odometry

In dynamic environments, the selection of inliers in

Visual Odometry is very important. Unlike other

Visual Odometry approaches (Nistér, Howard,

RELIABLE LOCALIZATION AND MAP BUILDING BASED ON VISUAL ODOMETRY AND EGO MOTION

MODEL IN DYNAMIC ENVIRONMENT

317

Agrawal, Konolige), the proposed Visual Odometry

avoids influences of non-static objects. We select

static features to compute translation/rotation

estimation of robot in terms of X, Z, and yaw (

plane) angle.

1) In the first step we extract good features in the

intensity image by well-known (Shi-Tomassi, 1994)

method: For tracking these features in two

consecutive images the pyramidal Lukas-Kanade

optical flow based tracking method is used (Lucas,

1981). As a result, we obtain between 300-400 pairs

of features.

2) The ego motion model is used to determine

maximum-and minimum-variation boundary of

translation/rotation.

The ego motion model is based on the kinematics

and dynamics model of a differential-drive robot.

The kinematics and dynamics model is derived from

point to point movement. It is descried in Figure 1.

Figure 1: The point

)(

to point

)(

movement.

During point to point movement, translation

(

yx ΔΔ ,

) and rotation (

and /or

) is obtained. In

addition, translation velocity

and rotational

velocity

•

are also obtained.

The rotation estimation is induced from the ego

robot’s rotational velocity and complex motion

control equation during two consecutive image

frames. The rotational angle follows:

)*4/()180*|(|

EEt ×Δ×±=Δ

•

θθ

⎪

⎩

⎪

⎨

⎧

<−

≥+

•

(2)

where

tΔ

is the elapsed time between two

consecutive frames,

and

is rotation and

translation error. These

and

are determined

experimentally.

⎟

⎠

⎞

⎜

⎝

⎛

∑

••

θθ

. where n is the number of rotation

velocities during

The

and

are induced from forward

kinematics.

)sin(2/)(

Δ×Δ×+−=Δ tvvx

)cos(2/)(

Δ×

tvvy

(3)

The x

and y

of Figure. 1 are equivalent to

direction of camera coordinates and

direction

of camera coordinates respectively.

From equation (2) and (3), the physical

movement of ego robot is determined in terms of

translation and rotation. These translation value and

rotation angle are used to filter out features of VO.

Namely the features which have bigger rotational

angles and translation values than the equation (2)

and (3) will be excluded for VO. The excluded

features represent fast moving objects comparing to

movement of ego robot. The remaining features after

excluding fast moving features are used for

computing mean (

) and Absolute Standard

Deviation (ASD:

) in terms of translation and

rotation. As a result of computing ASD, two ASDs

(+/- ASD:

) are obtained. The features

that have bigger translation values than

−

and

bigger rotational angle than

−

will be kept as

inliers for VO. Therefore finally we can compute a

pair of mean inliers, one corresponding to the

previous frame and the other to the current frame:

(

)

−− tt

()

ZX ,

They are used for computing translation and

rotation of robot as part of VO.

2.3 Hybrid Filter for Localization

The localization in this paper is relative global. The

robot’s position is accumulated from starting

position along with robot’s path. This relative global

position is obtained by hybrid filter. The hybrid filter

consists of Visual Odometry (VO)-based motion

estimation and ego motion model based motion

estimation. The hybrid filter switches VO estimation

to ego motion estimation according to current

situation. The VO is mainly used for robot’s

localization. The ego motion estimation is used

when visual information is lost and no tracking

features exist in the scene. The case of loss of visual

information especially appears in clumsy and

compact indoor environment when the scene is too

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

318

close. The case of no available tracking features

appears in the poor illuminated or in the no textured

image scenes.

The proposed hybrid-filter-based localization can

be used in both indoor and outdoor environment.

The equation of the relative global localization

follows:

We choose the X, Z coordinates of the 3D feature

points from the previous and current

frames,

()

−− tt

()

ZX ,

. The position estimation

of robot in terms of X, Z is:

() ()

θθ

Δ−•=Δ−•= cos,sin

currcurr

curraccumaccumcurraccumaccum

ZZZXXX +=+= ,

(4)

where

()()

1 tttt

ZZXX −+−=•

−−

()

πθ

/180*)/()(arccos BABA ⋅=Δ

[]

,−=Δ

is yaw angle.

tttt

ZZXXBA

11 −−

+=⋅

()()

1 −−

ZXA

() ()

ZXB +=

curr

is X axis mapping of

•

curr

is Z axis

mapping of

•

3 PROPOSED MAP BUILDING

The 3D map is registered along with robot’s path.

Tyzx’s stereo camera is used for proposed 3D map

building. The proposed map-building method

generates complete road surface as well as 3D

environment. Two map building algorithms are

introduced to solve the above mentioned problems.

The 3D road pixel determination is based on X-Y

projection. The 3D environment is built through

temporal fusion using a dense accumulation–based

temporal filter.

For the 3D road pixel determination, the row

position with the maximum pixel number in the

filtered X-Y projection map is determined by

searching. Lets call it

indexroad

The 3D points which are in the

indexroad

row

cells of scoring map are assumed as the 3D points of

the road surface. From experiments the 3D points of

road surface are between

−

indexroad

and

indexroad

. This approach was proved to be

very simple and very robust in the structured

environment.

For the 3D environment building, the

successive depth images are fused using a temporal

filter. The temporal filter is implemented by

exploiting two information: position height and its

time stamp. At each step we replace the current

position height if a new height is provided for that

position, or discard the height having a time step

older than the current one with a pre-defined

threshold.

The pixels of road surface which are extracted in

the first case, but also the pixels corresponding to

the other elements of the environment are

accumulated by the temporal filter. Consequently the

accumulated 3D environment surface is generated.

This 3D relative environment surface is registered

by 2D translation and rotation on the 3D world

coordinates.

The obtained local 3D road surface is registered

based on the following equation.

⎥

⎦

⎤

⎢

⎣

⎡

⎥

⎦

⎤

⎢

⎣

⎡

⎥

⎦

⎤

⎢

⎣

⎡

−

⎥

⎦

⎤

⎢

⎣

⎡

env

θθ

cos0sin

010

sin0cos

(5)

where (

) is the global position of the robot

at time

, and

is relative yaw at time

(

env

) is the relative environment pixel

position at time

(

goes from 0 to the total number

of 3D pixels) and (

) is the new position

of the pixel relative to the global coordinate system.

The global position (

) and relative yaw (

)

comes from

{

}

accumaccum

of equation (4).

The proposed Hybrid-filter-based localization and

map building can be run in real-time conditions and

variant environments (indoor and outdoor).

4 EXPERIMENTS

As mentioned in Section 2.2, the ego robot is

modeled as differential drive robot, and empirical

parameters determination for translation error and

rotational angle error is required. These errors

mainly come from robot’s weight load. In order to

determine translation error and rotational error, robot

navigates pre-defined path with constant

commanded velocity (100 mm/sec). The pre-defined

path is a rectangle (1500 mm x 1500 mm).

As varying empirical parameters (

and E

), the

actual robot trajectory is varying too. The

parameters are determined when the robot trajectory

has the smallest error against pre-defined path. The

behavior of the robot navigation looks like car due to

RELIABLE LOCALIZATION AND MAP BUILDING BASED ON VISUAL ODOMETRY AND EGO MOTION

MODEL IN DYNAMIC ENVIRONMENT

319

differential drive modeling. The results

corresponding to different sets of translation and

rotation error parameters are presented in Figure 2.

-1000 -500 0 500

1000

1500 2000 2500

-500

500

1000

1500

2000

2500

-1000 -500 0 500

1000

1500 2000 2500

-500

500

1000

1500

2000

2500

(a) In case of

= 0.08 and

=940

-1000 -500 0 500

1000

1500 2000 2500

-200

200

1000

1200

1400

1800

400

600

800

1600

-1000 -500 0 500

1000

1500 2000 2500

-200

200

1000

1200

1400

1800

400

600

800

1600

(b) In case of

= 0.08 and

=1010

Figure 2: Ego motion calibration (blue: encoder data, red:

estimated trajectories according to translation and rotation

velocity)

Figure 3: 3D data acquisition in variant environments.

The parameter of rotation error is fixed to 0.08,

and only the parameter of translation error changes

from 940 to 1020. The blue line (rectangle)

represents raw encoder data that is obtained in each

four corners of pre-defined rectangle path.

The red line represents robot’s trajectory that is

computed by translation value and rotation angle

along with time stamps. The minimum mismatched

value (10 cm between departure position and

returned position) is obtained when the parameter of

translation error is about 1000. This case can be seen

in Figure 2(b). These obtained error parameters are

used in Equation (2) in order to provide maximum

accuracy of VO.

The road surface extraction by X-Y projection

and temporal filter is presented in Figure 3. The

extracted road surface pixels by X-Y projection is

presented in blue color on the original image (e.g.

(a)). Their corresponding 3D positions are displayed

in (b), (c), and (d). The green spheres in the (b), (c),

and (d) represents the 3D points before temporal

filtering. The gray points represent the 3D points

added by fusing 10.000 images using temporal

filtering.

The fusion of successive images has as

consequence the increase of the density of the 3D

representation together with the increase of the

representation accuracy.

To evaluate accuracy of localization, robot

passes though 3 known points (2 meters, 6 meters,

and 10 meter) in 5 turns. It is presented in Table 1.

We obtained 1.5875 percentages of average error.

As final experiment, the robot navigates in long

indoor environment. When the robot navigates

indoor, the robot met many different environment

variations (non-uniform illuminations Figure 4. (1-

3), narrow paths Figure 4. (3-6), and non textured

road surface Figure 4. (6-7)). In spite of these facts

3D road surface and 3D environment is well

registered along with the robot path.

5 CONCLUSIONS

This paper proposed a new robust localization and

map building method based on VO and ego motion

model. Robot navigates in dynamic environment hat

contains moving objects, non-uniform illumination,

and non textured road surfaces. The achieved

accuracy of position estimation is 1.5875 errors

percentage in 10 meters.

(a) Indoor (b) 3D data (front-view)

Fusion with Temporal filter: 3475509 points (gray and green);

No fusion: 7192 points (green)

Fusion with Temporal filter: 99571 points (gray and green);

No fusion: 353 points (green)

(d) 3D Road data (side-view) after 1000 frames accumulation

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

320

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(a) Raw accumulated image and reconstructed 3D road surface and 3D

environment.

Z (mm)

0 10000 20000 30000 40000 50000 60000

X (mm)

-40000

-30000

-20000

-10000

10000

b) Distance variation along with navigation path.

Sampled Data

0 50 100 150 200 250 300 350 400 450

Angle (degree)

-150

-100

-50

100

150

Figure 4: Result Localization and map building in the long distance navigation (about 110 meter).

Table 1: Accuracy comparison at each known position.

Measuring

Nr.

2 meters

(mm)

6 meters

(mm)

10 meters

(mm)

Remark

Error

(%)

X Z Yaw X Z Yaw X Z Yaw

1 35.65 1847.5 -1.82 259.46 5625.01 -3.78 789.82 9703.77 -10.57 2.617

2 80.29 2066.14 -4.23 416.25 5981.49 -5.46 900.35 10118.09 -6.98 0.766

3 39.9 2062.2 -2.52 329.59 6095.2 -9.2 1298.73 10270.26 -14.44 1.729

4 95.14 1901.13 -4.98 541.43 5704.66 -6.18 1442.69 9646.03 -15.59 2.042

5 62.85 1975.55 -2.97 305.59 6018.14 -3.99 1063.45 10130.03 -16.25 0.783

REFERENCES

D. Nistér, O. Naroditsky and J. Bergen, 2004 “Visual

odometry”,

Proc.IEEE Computer Society Conference

on Computer Vision and Pattern Recognition (CVPR

2004)

, Vol. 1, pp. 652-659.

A. Howard, 2008, “Real-Time Stereo Visual Odometry for

Autonomous Ground Vehicles”,

International

Conference on Robots and Systems (IROS)

, Sep..

M. Agrawal and K. Konolige, 2007, “Rough terrain visual

odometry”,

In Proc. International Conference on

Advanced Robotics (ICAR)

, Aug..

K. Konolige, M. Agrawal, R.C. Bolles, C. Cowan, M.

Fischler, and BP Gerkey, 2008 “Outdoor mapping and

navigation using stereo vision”,

Intl. Symp. On

experimental Robotics

, pp. 179-190.

J. Shi and C. Tomasi, 1994, “Good Features to Track”,

IEEE conference on Computer Vision and Pattern

Recognition

(CVPR94), pp. 593-600.

B. D. Lucas and T. Kanade, 1981, “An iterative image

registration technique with an application to stereo

vision”,

Proceedings of the 1981 DARPA Image

Understanding Workshop

(pp.121-130).

Y. Guo, V. Gerasimov, G. Poulton, 2006, “Vision-Based

Drivable Surface Detection in Autonomous Ground

Vehicles,” IEEE/RSJ International Conference on

Intelligent Robots and Systems

, pp. 3273–3278.

B. Sofman, E. Lin, J. Bagnell, N. Vandapel, and A. Stentz,

2006, “Improving robot navigation through self-

supervised online learning,” In Robotics: Science and

System (RSS)

, Cambridge.

H. Dahlkamp, A Kaejler, D. Stavens, S. Thrun, G.R.

Bradski, 2006, “Self-supervised Monocular Road

Detection in Desert Terrain,”

Robotics:Science and

Systems II

http://www.tyzx.com

RELIABLE LOCALIZATION AND MAP BUILDING BASED ON VISUAL ODOMETRY AND EGO MOTION

MODEL IN DYNAMIC ENVIRONMENT

321