Bio-inspired Active Vision for Obstacle Avoidance

Manuela Chessa, Saverio Murgia, Luca Nardelli, Silvio P. Sabatini and Fabio Solari

Department of Informatics, Bioengineering, Robotics and Systems Engineering, University of Genoa,

Via all’Opera Pia 13, 16145 Genova, Italy

Keywords:

Bio-inspired Framework, Active Vision, Non-rectiﬁed Disparity Estimation, 3D Reconstruction, Robot

Navigation.

Abstract:

Reliable distance estimation of objects in a visual scene is essential for any artiﬁcial vision system designed

to serve as the main sensing unit on robotic platforms. This paper describes a vision-centric framework for

a mobile robot which makes use of bio-inspired techniques to solve visual tasks, in particular to estimate

disparity. Such framework features robustness to noise, high speed in data processing, good performance in

3D reconstruction, the possibility to orientate the cameras independently and it requires no explicit estimation

of the extrinsic parameters of the cameras. These features permit navigation with obstacle avoidance allowing

active exploration of the scene. Furthermore, the modular design allows the integration of new modules with

more advanced functionalities.

1 INTRODUCTION

Depth estimation from stereoscopic image pairs is a

fundamental problem widely discussed in the litera-

ture. It is necessary to perform complex tasks such

as navigation, scene analysis and interaction with the

environment. However, depth estimation is often af-

fected by restrictions such as stereo calibration and

rectiﬁcation, noise, high computational demand and

occlusions. Usually, ﬁxed and rectiﬁed stereoscopic

or RGB-D systems are adopted because of the re-

duced computational load of the algorithms involved

in disparity estimation (e.g. block matching) and 3D

reconstruction. For example, RBG-D devices like Mi-

crosoft Kinect or ASUS Xtion project a known pat-

tern (Scharstein and Szeliski, 2003) of points on the

scene using infrared light. While this allows for faster

computation, this strategy is heavily dependent on the

scene dimensions and the sensing range is ﬁxed. In

(Grigorescu et al., 2011), a robust closed-loop cam-

era pose and scene structure estimation is performed,

which, though providing good results, relies on recti-

ﬁcation and parallelism of the cameras. All of these

solutions are far from being similar to how humans

and animals sense of sight works: moreover, they of-

ten reduce the degrees of freedom offered by the sys-

tem.

In (Klarquist and Bovik, 1998), a foveated vision

system is proposed, which relies on consecutive ﬁx-

ations of scene features in order to estimate a global

reconstruction of the 3D scene: their work, despite

having a variable baseline, explains how vergence can

greatly help in scene analysis and highlights the lim-

itations imposed by rectiﬁed systems (e.g. minimum

distance for objects, according to the overlapping re-

gion of the view volumes of the cameras).

Thus, despite its complexity, a vergent system can

often be a desirable choice: it allows for the adjust-

ment of the sensing range and many common tasks

(e.g. object tracking or ﬁxation) can beneﬁt from

the use of such a system. Thanks to the growth of

the computational power of PCs and the advantages

in code parallelisation brought by GPGPU, this ap-

proach has been made feasible on current, consumer-

grade platforms.

Due to these reasons and since the human vision

system copes with the aforementioned limitations,

we have based our work on a bio-inspired approach,

where:

• The orientation of each camera can be adjusted

independently.

• The system structure resembles the modularity of

the human vision system, where different areas

have different purposes and information is com-

bined at upper levels.

• Disparity is estimated using a cortical model of

the primary visual cortex (V1) neurons and their

interactions.

• The 3D reconstruction of the scene is computed

505

Chessa M., Murgia S., Nardelli L., P. Sabatini S. and Solari F..

Bio-inspired Active Vision for Obstacle Avoidance.

DOI: 10.5220/0004873705050512

In Proceedings of the 9th International Conference on Computer Graphics Theory and Applications (WARV-2014), pages 505-512

ISBN: 978-989-758-002-4

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

in parallel with a Single Instruction Multiple

Threads (SIMT) approach.

Following the Early Vision model proposed by

Adelson & Bergen (Adelson and Bergen, 1991),

many steps are necessary in order to gain a deep un-

derstanding of the scene structure, ranging from dis-

parity estimation and 3D reconstruction to image seg-

mentation and blob detection: many solutions have

been devised over the years (Chen et al., 2011), each

of them based on different assumptions and/or with

different strengths, weaknesses and execution times.

In any case, their main goal lies in measuring a spe-

ciﬁc feature of the scene.

When approaching the problem of robot naviga-

tion, computer vision is not the only discipline in-

volved: the robot needs to be modelled, all of its sen-

sors have to be analysed in order to characterise its

proprioception (position and orientation estimation)

and interaction with the external world (e.g. cameras

for object detection, tracking and/or avoidance). Fi-

nally, every module has to be integrated with the oth-

ers.

The main contributions this paper provides are:

• Integration of features belonging to the Early Vi-

sion model (e.g. disparity and edges) in a modular

framework which combines them in order to solve

complex visual tasks (e.g. 3D reconstruction in a

vergent system, blob extraction, navigation).

• Development of a SIMT approach towards depth

estimation, starting from disparity computed

through a bio-inspired algorithm.

• Evaluation of the performances (execution times

and depth estimation errors) of such platform

when equipped with the aforementioned bio-

inspired algorithm.

The remainder of this paper describes the proposed

system, the ﬂow of data, the experiments and the ob-

tained results.

2 PROPOSED SYSTEM

The proposed system is represented in the block di-

agram in Fig. 1. It comprises many modules: im-

age acquisition, used to obtain the images from the

cameras and to apply undistorting operations, dis-

parity estimation and 3D reconstruction, segmenta-

tion and blobs extraction, Navigator module, which

combines depth and blob information with the robot

state (RobotModel module) to detect near objects and

avoid them during navigation. Finally, a GUI mod-

ule is present which shows a 3D reconstruction of the

scene in real-world coordinates, allows orders to be

issued to the robot and to set the various parameters

which characterise the system.

Great attention has been paid whilst designing this

framework in order to keep it as modular as possible,

having many pieces of software running at the same

time sharing data, and still being easy to update with

new modules and features.

Image Acquisition and Segmentation. According

to the pinhole camera model (Forsyth and Ponce,

2002), the acquisition module encapsulates all the in-

trinsic parameters of the cameras and the functions

to obtain undistorted stereo images. The system does

not rectify the images, and thus it does not need the

extrinsic parameters of the stereo rig.

In order to implement an efﬁcient colour segmen-

tation module, colour edge detection was applied as a

preliminary step to detect the boundaries between the

observed surfaces. Similarly to other approaches (see

(Chen and Chen, 2010; Dutta and Chaudhuri, 2009))

our implementation performed the following opera-

tions:

• Median ﬁltering, to lessen the effect of noise pre-

serving edges.

• Image gradient computation for every channel,

through a derivative kernel (e.g. [−1,0,1]) or

other operators (e.g Sobel).

• Sum of the norm of the gradients (6 components,

X and Y for every channel) for every pixel. To

provide a faster execution, we chose || · ||

, sum-

ming the absolute values of the components.

• Binary thresholding (see Fig. 2, second row) in

order to set the strong edges to 0 and the inner re-

gions to 1. The obtained binary map can be then

easily labelled with blob detection algorithms.

Blob Extraction. Based on the algorithm proposed

by F. Chang in (Chang et al., 2004), component la-

belling was applied through contour tracing: we de-

veloped our own implementation of the algorithm

and a library which has been released under the

LGPL license (OpenCVBlobsLib

, originally based

on cvblobslib). The original project was enhanced

both in terms of performance (implementing a multi-

thread algorithm) and functionalities: for example,

blob joining capability was added, allowing to link

many separate regions to one entity. Regarding the

multi thread implementation, the approach can be de-

scribed as follows:

• Horizontal splitting of the image into number of

Threads regions.

v1.0 https://code.google.com/p/opencvblobslib/

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

506

Figure 1: Modules and data ﬂow. Different paths mean parallel tasks.

• Blobs crossing the regions are detected by a single

thread detection algorithm, which runs along the

separating row.

• Having already labelled the crossing blobs, the

other threads can be created and can run in their

image regions without worrying about conﬂicts

with other threads (that could happen if two or

more would trace the same contour).

• Detected blobs are concatenated in a single array.

Moreover, the library allows to compute geomet-

ric properties of the blobs (joined or not): this is

fundamental after depth estimation, since it allows to

analyse the depth map on a region basis and not per

pixel, enabling the robot to perform coarse scene anal-

ysis for navigation.

Figure 2: Image segmentation and component labelling.

The right pair of images shows our approach (using a 11x11

median ﬁlter and threshold value of 21), while the left one

shows the result using Sobel operator. Our implementation

manages to separate more regions.

Disparity Estimation. Disparity, in a vergent

stereo system which purposely loses the horizon-

tal epipolar constraint, is represented by a two-

dimensional vector: this further complicates the

heavy task of its estimation, yet adds a very important

degree of freedom to the vision system, which could

perform better in many tasks (e.g. tracking and at-

tentional mechanisms). Bio-inspired algorithms can

cope well with such two-dimensional disparity vec-

tors: in particular, we have chosen an algorithm based

on the energy model of V1 cortical neurons (Fleet

et al., 1996) (its efﬁcient GPU implementation is de-

scribed in (Chessa et al., 2012)). This architecture is

built on a population of binocular simple and com-

plex neurons, the former implemented as a bank of

complex-valued Gabor ﬁlters, with different orienta-

tions in space and phase shifts, the latter as a squaring

operation (energy computation) on the output of the

simple units. Its fundamental processing steps can be

summarised as:

• Linear ﬁltering stage, in which the RFs (Recep-

tive Fields) of the simple S cells are applied to the

image through convolution.

• Energy model, where quadrature pairs of S cells

are combined, squared and eventually thresh-

olded.

• Divisive normalisation, used to remove noise and

to simulate the mechanism of light adaptation.

• Population decoding, in which pools of neurons

responses are combined in order to extract dispar-

ity information.

Although detected disparities are limited in range de-

pending on the radial peak frequency of the used Ga-

bor ﬁlters, and on their spatial support, sub-pixel ac-

curacy in disparity estimation is guaranteed, thus cre-

ating very accurate estimates. Moreover, a coarse to

ﬁne approach through a Gaussian pyramid is intro-

duced to further extend the range of detectable dis-

parities while maintaining a relatively low computa-

tional load for the processing system. Given the in-

trinsically parallel nature of our visual neural path-

ways, a GPU implementation of this model is adopted

Bio-inspiredActiveVisionforObstacleAvoidance

507

(Chessa et al., 2012): this dramatically reduces the

time needed for a complete disparity estimation (35×

gain in performance compared to the CPU implemen-

tation using 1024 × 1024 images), opening the door

for real time systems.

3D Reconstruction. As illustrated in (Chessa et al.,

2009), the cameras and robot reference systems can

be chosen like in Fig. 3, where the 2 cameras are dis-

placed only along the X axis of the robot coordinate

system and their rotations are described with α and β

angles: With this choice, the projective equations for

the left camera can be written as

= f

cosα

+ Z sinα

sinα

cosβ

−Y sinβ

− Z cosα

cosβ

= f

sinα

sinβ

+Y cosβ

− Z cosα

sinβ

sinα

cosβ

−Y sinβ

− Z cosα

cosβ

(1)

where f

is the focal length of the camera (in cm), α

and β respectively represent the pan and tilt angles,

= X +

where b = O

− O

is the baseline.

With Eq. 1 in mind, assuming that the baseline

and the camera pan and tilt angles are known, the pro-

jection of a point F(X,Y, Z) onto the left image plane

can be computed. Similarly, this process can be ap-

plied to the right camera, by substituting the angles

and changing X

to X

−

= X −b/2.

Having obtained a good estimate of the disparity

vector, it is then possible to relate every pixel with its

homologous on the other image. By inverting the pro-

jective equations (considering the world coordinates

as unknowns) a pair of 2 by 3 linear systems is ob-

tained, one for each camera. The left one can be writ-

ten as

sin(α

)cos(β

) − f

cos(α

))X

−x

sin(β

+(−x

cos(α

)cos(β

) − f

sin(α

))Z

= −1/2x

bsin(α

)cos(β

) + 1/2 f

bcos(α

)

sin(α

)cos(β

) − f

sin(α

)sin(β

))X

+(− f

cos(β

) − y

sin(β

))Y

+( f

cos(α

)sin(β

) − y

cos(α

)cos(β

))Z

= −1/2y

sin(α

)cos(β

)b + 1/2 f

sin(α

)sin(β

(2)

and by itself could not provide a unique solution for

the problem. By combining it with the linear system

relative to the right camera (very similar in structure

to the left one), assuming that F is the solution for

both, a 4 by 3 linear system is obtained.

Figure 3: Cameras and robot reference systems.

In an ideal case, this system of equations provides

the intersection between the 2 lines starting from the

cameras origins, passing through the points on the im-

age planes and reaching F. However, since dispar-

ity estimation, angle measurements and pixel quanti-

sation introduce errors, these lines may not intersect

at all. Thus, an ordinary least squares solution was

adopted, by minimizing the norm ||A

F − b||

. The

solution is then

F = (A

−1

b where

F is an esti-

mate of F.

Since each system is independent from the oth-

ers, an SIMT solution was devised and implemented

on the GPU, with different threads handling sepa-

rate linear systems. In this way, VGA-resolution (i.e.

640x480 pixels) images, which involve around 10

independent systems, can be easily processed in real

time (10ms in GPU vs 60ms in CPU), transforming

disparity information in depth (see Fig. 4).

Robot Modelling and Control. In order to provide

effective and easy to use controls, the robot was mod-

elled following the unicycle model (Matveev et al.,

2013): thus, two PID controllers were designed to

control linear and angular speed, using the linear dis-

tance and the heading difference as error functions.

The robot state (position, heading direction and

cameras angles) is constantly updated by a dedicated

thread which reads the encoders and computes the

new state for every iteration. In this way, an update

routine runs in the background, supplying the whole

system with a constantly up-to-date state of the robot.

Navigation. Navigation is implemented through

position goals, whose coordinates are taken with re-

spect to the robot starting position and orientation (i.e.

at the start of the program, the robot sets the origin of

the world reference system to its position). In this

step, a ﬁrst integration of the two streams of informa-

tion is done: segmentation/object information is com-

bined with the depth map in order to analyse it based

on the different objects in the scene, rather than just

by pixel. By checking the central area of the image

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

508

Figure 4: Top row: RGB frames. Middle row: horizontal

and vertical disparities, positive values encoded in blue and

negative in green. Bottom row: Depth map, close pixels

represented by warm colours.

for blobs under a certain depth value, the robot can

detect obstacles, and by evaluating the lateral regions

(see Fig. 5 on the left for the 3 regions) it can effec-

tively set new intermediate waypoints to reach before

its preﬁxed target. Having detected which of the 2

areas has the furthest objects (i.e. the minimum dis-

tance in that region is greater than in the other one),

the robot proceeds to set an intermediate goal, whose

distance from the robot is proportional to the distance

from the issued target and whose angle (with respect

to the robot heading) has value of ±π/10, with the

sign coherent with which of the 2 lateral regions has

been selected. Moreover, in case the robot ﬁnds other

obstacles before reaching its intermediate goal, it will

overwrite it with a new one, thus avoiding the situa-

tion in which a self-generated goal falls over an ob-

stacle.

Thread Execution. A modular system often re-

quires for many operations to be executed simultane-

ously: parallelism is a solution adopted in every bi-

ological system, and has proved itself very effective

in handling multiple tasks. Similarly, our system is

characterised by many threads, some of them created

to follow the data ﬂow pattern and others to provide

constant updates about the state of the robot. We have:

1. RobotModel thread, which communicates with

Navigator, reads the sensors, updates the robot po-

sition and translates orders (i.e. speed) into robot

parameters.

2. Navigator thread, which applies the PID con-

trollers for target-reaching, draws the 2D map,

and generally coordinates the movement process

(e.g obstacle avoidance).

3. GUI thread, which controls I/O with the user and

displays windows.

4. Ogre3D thread, which controls and manages the

3D view window.

5. Main thread, designed to coordinate the whole

data processing.

This last thread, after acquiring the images, imme-

diately creates two child threads, one entrusted with

disparity estimation and 3D reconstruction (compu-

tations that happen almost completely on the GPU),

and the other with colour segmentation and blobs ex-

traction (which takes place only on the CPU). Con-

sequently, every resource the PC can supply is ex-

ploited, thus maximizing the performance. Focusing

on the actual implementation of the system, data is

passed between modules through pointers, when pos-

sible. In this way, unnecessary memory copying oper-

ations are avoided, and the overall performance bene-

ﬁts from this approach.

GUI. With the provided GUI module, the user can

effectively visualise all the intermediate results of

data processing, issue orders, manually set cameras

angles and algorithms parameters and choose one of

the implemented disparity estimation algorithms, if

more than one is present. Moreover, a 2D map shows

the robot position and orientation in real time and al-

lows for order issuing. Finally, a 3D engine renders

the 3D model of the robot, updated in real time, with

the reprojected pixels from the depth map (see Fig. 5).

3 EXPERIMENTAL SETUP

The hardware used in our tests consists of a

consumer-grade PC and a K-Team Koala robot

equipped with a pan-tilt module with two off the shelf

webcams and encoders in every motor. All the com-

putations are performed on the aforementioned PC.

http://www.k-team.com/mobile-robotics-products/

koala

Bio-inspiredActiveVisionforObstacleAvoidance

509

Figure 5: GUI and 3D reconstruction. From top left: Image with blobs and their mean depth, left and right video streams, 3D

reconstruction. Bottom center: Navigation map and depth map.

Developing a framework for robot control and

navigation requires many sub-systems to be devised,

in order to reach quasi-independence of the soft-

ware from the hardware. To achieve this objec-

tive, we based our work on already established soft-

ware libraries: OpenCV

, for image processing, ba-

sic GUI drawing (highgui module) and user I/O,

Pthreads-Win32

, for cross-platform (Windows and

Linux) parallel threads management, Ogre

, for 3D

self-representation of the robot and reconstruction

of the depth map, CUDA SDK

for GPU coding

and OpenCVBlobsLib for labelling and ﬁltering con-

nected regions.

4 TESTING THE SYSTEM ON

REAL GROUND

Firstly, 3D reconstruction was tested to check the ac-

curacy of depth estimation: By placing a planar chess-

board at a known distance we computed the mean

depth over the object area and then compared it with

the real one, obtaining the results seen in Fig. 6.

As for navigation, we performed some experi-

ments in order to assess the actual capabilities of the

robot in navigating in a room:

1. First experiment: Two waypoints were issued

(Fig. 7 shows snapshots taken from a lateral cam-

era). The robot reached the ﬁrst one avoiding

the frontal obstacle (this was achieved through the

version 2.4.6, www.opencv.org

ver. 2.9.0, https://sourceware.org/pthreads-win32/

ver. 1.8.1, http://www.ogre3d.org/

ver. 5.0, http://www.nvidia.com/object/cuda home new

.html

creation of intermediate waypoints by the robot it-

self) and then proceeded towards the second one,

behaving as expected.

2. Second experiment: A single waypoint was is-

sued, the robot found an obstacle immediately in

front of itself and then two others laterally while

reaching its target (Fig. 8).

Videos of these tests and others can be found at

the following link: http://goo.gl/V1hva8. Finally, in

Fig. 9, execution times for a single processing cycle

are shown.

Figure 6: Experimental setup. On the top left, an extract of

the blobs image with their mean depth. On the top right,

a table comparing real (Actual) and computed (Estimated)

distances, along with the standard deviation of depth over

the chessboard area. The error on the estimated depth still

allows for precise navigation.

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

510

Figure 7: Left ﬁgure: Above view of the environment of the ﬁrst navigation experiment. Three red obstacles and an interme-

diate goal G

are present, the brown objects representing other objects. The grey area represents the trajectory followed by

the robot, as drawn by the framework itself. Right ﬁgure: Lateral view of the experiment.

Figure 8: Second navigation experiment: back view. The robot found a frontal obstacle and two lateral ones. This time its

goal was set to the position visible in frame 9.

Figure 9: Execution times: note how disparity estimation is

the heaviest computation involved. However, being run on

the GPU, the CPU is free to process other tasks.

5 CONCLUSIONS AND FUTURE

WORK

In this paper a modular stereo-vision based robotic

system has been presented. It has been shown that

even using only consumer-grade hardware, the sys-

tem is able to reach real-time performance with a

good responsiveness of the robot in the avoidance of

obstacles. Also, the biologically inspired approach

has provided good results when dealing with unrec-

tiﬁed stereo rigs, providing robustness and reliability.

Moreover, by integrating many features of the image

Bio-inspiredActiveVisionforObstacleAvoidance

511

with the state of the robot, the system is able to pro-

vide an estimate of depth of the pixels and of the 3D

structure of the scene, along with a coarse segmen-

tation of objects based on their colour. By design,

every module of the framework can be enhanced, ex-

panded and more modules can be added: for example,

it could be possible to add information from structure

from motion to enhance the robustness of the vision

module. In the future we plan to add some new fea-

tures such as object recognition and tracking, execu-

tion of tasks when reaching way-points, ﬁltering and

clustering of the 3D reconstruction data. Finally, the

modularity of this work allows it to be used together

with other robotic systems, the only requirement be-

ing the development of a new class modelling the

robot. Thanks to its modularity, we think that other

people could use this framework in their projects and

we plan to release the source code in the near future.

REFERENCES

Adelson, E. H. and Bergen, J. R. (1991). The plenoptic

function and the elements of early vision. Computa-

tional models of visual processing, 91(1):3–20.

Chang, F., Chen, C.-J., and Lu, C.-J. (2004). A linear-time

component-labeling algorithm using contour tracing

technique. Computer Vision and Image Understand-

ing, 93(2):206–220.

Chen, S., Li, Y., and Kwok, N. M. (2011). Active vision

in robotic systems: A survey of recent developments.

INT J ROBOT RES, 30(11):1343–1377.

Chen, X. and Chen, H. (2010). A novel color edge detection

algorithm in rgb color space. In Signal Processing,

2010 IEEE 10th Int’l Conf. on, pages 793–796. IEEE.

Chessa, M., Bianchi, V., Zampetti, M., Sabatini, S. P., and

Solari, F. (2012). Real-time simulation of large-scale

neural architectures for visual features computation

based on gpu. Network: Computation in Neural Sys-

tems, 23(4):272–291.

Chessa, M., Solari, F., and Sabatini, S. P. (2009). A virtual

reality simulator for active stereo vision systems. In

VISAPP (2), pages 444–449.

Dutta, S. and Chaudhuri, B. B. (2009). A color edge de-

tection algorithm in rgb color space. In Advances in

Recent Technologies in Communication and Comput-

ing, 2009. ARTCom’09. Int’l Conf. on, pages 337–

340. IEEE.

Fleet, D. J., Wagner, H., and Heeger, D. J. (1996). Neu-

ral encoding of binocular disparity: energy models,

position shifts and phase shifts. Vision research,

36(12):1839–1857.

Forsyth, D. A. and Ponce, J. (2002). Computer Vision: A

Modern Approach. Prentice Hall.

Grigorescu, S. M., Macesanu, G., Cocias, T. T., Puiu, D.,

and Moldoveanu, F. (2011). Robust camera pose and

scene structure analysis for service robotics. Robotics

and Autonomous Systems, 59(11):899–909.

Klarquist, W. N. and Bovik, A. C. (1998). Fovea: A

foveated vergent active stereo vision system for dy-

namic three-dimensional scene recovery. Robotics

and Automation, IEEE Transactions on, 14(5):755–

770.

Matveev, A. S., Hoy, M. C., and Savkin, A. V. (2013).

The problem of boundary following by a unicycle-like

robot with rigidly mounted sensors. Robotics and Au-

tonomous Systems, 61(3):312–327.

Scharstein, D. and Szeliski, R. (2003). High-accuracy

stereo depth maps using structured light. In Computer

Vision and Pattern Recognition, 2003. Proc. 2003

IEEE CS Conf. on, volume 1, pages I–195. IEEE.

GRAPP2014-InternationalConferenceonComputerGraphicsTheoryandApplications

512