Evaluating Two Ways for Mobile Robot Obstacle Avoidance with Stereo

Cameras: Stereo View Algorithms and End-to-End Trained

Disparity-sensitive Networks

Alexander K. Seewald

Seewald Solutions, L

archenstraße 1, A-4616 Weißkirchen a.d. Traun, Austria

Keywords:

Deep Learning, Obstacle Avoidance, Autonomous Robotics, Mobile Robotics, View Disparity, Stereo View

Algorithms.

Abstract:

Obstacle avoidance is an essential feature for autonomous robots. Recently, stereo view algorithm challenges

have started to focus on fast algorithms with low computational expense, which may enable obstacle avoidance

on mobile robots using only stereo cameras. Therefore we have evaluated classical and state-of-the-art stereo

salgorithms qualitatively using internal datasets from Seewald (2020), showing that – although improvements

are discernable – current algorithms still fail at this task. As it is known (e.g. from Muller et al. (2004))

that deep learning networks trained on stereo views do not rely on view disparity – conﬁrmed by the fact that

networks perform almost equally well when trained with only one camera image – we present an alternative

network which is end-to-end trained on a simple layer of biologically plausible disparity-sensitive cells and

show that it performs equally well as systems trained on raw image data, but must by design rely on view

disparity alone.

1 INTRODUCTION

Obstacle avoidance is an essential feature for au-

tonomous robots, the lack of which can make the es-

timation of the robot’s intelligence drop dramatically.

As benchmark task it has been known from the be-

ginning of autonomous robotics and is usually solved

with specialized sensors and Simultaneous Localiza-

tion and Mapping algorithms (SLAM, Cadena et al.

(2016)). However, different sensors have different er-

ror modes and multiple sensor classes must be com-

bined to obtain acceptable results; and SLAM algo-

rithms rely on complex map-building, are not robust

enough for a wide variety of environments, and are

very tricky to set up, adjust and make sufﬁciently ro-

bust for even one given environment.

Recently stereo algorithm challenges such as the

Robust Vision Challenge

have focussed on robust vi-

sion of the kind that may be useful to mobile robots

as an alternative for near obstacle avoidance that does

not rely on combining specialized sensors or map-

building.

Here, we have qualitatively evaluated classical as

well as state-of-the-art stereo algorithms on internal

https://robustvision.net

datasets from our ToyCollect robot platform (Seewald

(2020)). These datasets only include stereo views and

thus cannot be quantitatively evaluated but are repre-

sentative of the type of stereo views commonly ob-

tained with mobile robots. Our qualitative evaluation

indicates that despite the excellent results of state-of-

the-art stereo algorithms on high-quality benchmark

stereo views, these results transfer only to a very small

extent to real-life stereo views.

We speculate that this is due to the algorithms

determining distance by occlusion, linear perspec-

tive, size perspective, distribution of shadows and il-

lumination, and familiar size (Kandel et al. (2000),

p. 558f) – all of which are likely to differ between

benchmark datasets and our internal datasets – and

rely only to a limited extent – if at all – on view dis-

parity. This is surprising as the human visual system

relies very strongly on view disparity for near obstacle

detection. Further support for this comes from end-

to-end deep-learning models for obstacle avoidance

trained directly on stereo views (Muller et al. (2004)),

where it is known that models trained using informa-

tion from just one camera perform almost as well as

models trained using information from both cameras.

Seewald, A.

Evaluating Two Ways for Mobile Robot Obstacle Avoidance with Stereo Cameras: Stereo View Algorithms and End-to-End Trained Disparity-sensitive Networks.

DOI: 10.5220/0010878500003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 3, pages 663-672

ISBN: 978-989-758-547-0; ISSN: 2184-433X

663

Table 1: Robot Hardware Overview.

Type Robot Base Motors

Motor

controller

Camera Controller Chassis Power LocalDL

v1.3

(K3D)

Pololu Zumo

(#1418)

2x Pololu 100:1

brushed motors

(#1101)

Pololu Qik

2s9v1 Dual

Serial (#1110)

1x RPi Camera Rv1.3

w/ K

ula3D Bebe

Smartphone 3D lens

1x RPi 3B+

Modiﬁed

transparent

chassis

4x 3.7V 14500 LiIon

Yes

8fps

v1.21

(R2X)

Pololu Zumo

(#1418)

2x Pololu 100:1

brushed motors

(#1101)

Pololu Qik

2s9v1 Dual

Serial (#1110)

2x RPi Camera Rv2.1 2x RPi ZeroW

Three-part

3D-printed

chassis

4x 3.7V 14500 LiIon or

4x 1.5V AA Alkaline or

4x 1.2V AA NiMH

<1fps

v2.2

(OUT)

Dagu Robotics

Wild Thumper

4WD (#RS010)

4x 75:1 brushed

motors (included)

Pololu Qik

2s12v10 Dual

Serial (#1112)

2x RPi Camera Rv2.1

1x RPi CM3

on ofﬁcial

eval board

None

(open

case)

1x 7.2V 2S 5000mAh LiPo

Yes

8fps

As the lack of depth maps in our internal datasets

prevents us from retraining state-of-the-art stereo al-

gorithms, we have instead created an end-to-end

trained deep learning network which is forced to learn

from view disparity by mapping the network input

from raw image data to a simple approximation of

disparity-sensitive cells known to exist in monkey vi-

sual cortex (Kandel et al. (2000), p. 562 and Fig. 28-

14). We then proceed to show that such a network

performs as well at the obstacle avoidance task as the

original end-to-end-trained model using raw stereo

view images presented in Seewald (2020). Limita-

tions in the used hardware and software platforms pre-

vent us from testing the new model directly on our

robots for now, so we leave that part for future work.

2 RELATED RESEARCH

As far as we know, no previous qualitative evaluation

of stereo view algorithms in a mobile robotics setting

for obstacle avoidance has been done. Neither are we

aware of any attempt to explicitly force end-to-end

trained deep learning networks to rely on view dispar-

ity for obstacle avoidance. So we will instead describe

earlier attempts at obstacle avoidance here.

Muller et al. (2006) describe a purely vision-based

obstacle avoidance system for off-road mobile robots

that is trained via end-to-end deep learning. It uses

a six-layer convolutional neural network that directly

processes raw YUV images from a stereo camera

pair. They claim their system shows the applicabil-

ity of end-to-end learning methods to off-road obsta-

cle avoidance as it reliably predicts the bearing of

traversible areas in the visual ﬁeld and is robust to

the extreme diversity of situations in off-road envi-

ronments. Their system does not need any manual

calibration, adjustments or parameter tuning, nor se-

lection of feature detectors, nor designing robust and

A double periscope that divides the camera optical path

into two halves and moves them apart using mirrors, effec-

tively creating a stereo camera pair from one camera.

Sadly, no longer available

Can be plotted as one part for perfect alignment.

Left two and right two are each connected.

fast stereo algorithms. They note some important

points w.r.t. training data collection which we have

also noted and applied for our data collection:

• High diversity of training grounds.

• Various lighting conditions.

• Collect sequences where robot starts moving

straight and then moves left/right around obstacle.

• Avoid turns when no obstacles are present.

• Include straight runs with no obstacles nor turns.

• Turn at approx. same distance from an obstacle.

The earlier published technical report, Muller et al.

(2004), extends the previously mentioned paper with

additional experiments, a much more detailed de-

scription of the hardware setup, and a slightly more

detailed description of the deep learning network. A

training and test environment is described. An even

more detailed description of how training data was

collected is given. They found that using informa-

tion from just one camera performs almost as well as

using information from both cameras, which is sur-

prising. Modiﬁed deep learning networks which try

to control throttle as well as steering angle and also

utilize additional sensors performed disappointingly.

Bojarski et al. (2016) describe a system similar to

Muller et al. (2006) that is trained to drive a real car

using 72 hours of training data from human drivers.

They note that the distance between crashes of the

original DAVE system was about 20 meters. They add

artiﬁcial shifts and rotations to the training data. They

report an autonomy values of 98%, corresponding to

one human intervention every 5 minutes. However,

their focus is on lane following and not on obstacle

avoidance. They used three cameras – left, center,

right – and a complex ten layer convolutional neural

network.

Pfeiffer et al. (2016) describe an end-to-end mo-

tion planning system for autonomous indoor robots.

It goes beyond our approach in also requiring a tar-

get position to move to, but uses only local informa-

tion which is similar to our approach. However, their

approach uses a 270

◦

laser range ﬁnder and cannot

be directly applied to stereo cameras. Their model

is trained using simulated training data and as such

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

664

Figure 1: Robots K3D, R2X, OUT (left to right, ruler units: cm).

Figure 2: FOV comparison between robot R2X and K3D.

has some problems navigating realistic ofﬁce environ-

ments.

Wang et al. (2019) describe a convolutional neu-

ral network that learns to predict steering output from

raw pixel values. Contrary to our approach, they

use a car driving simulator instead of real camera

recordings, and they use three simulated cameras in-

stead of our two real cameras.They explicitly address

overﬁtting and vanishing gradient which may reduce

the achievable performance also in our case.

They

note several papers on end-to-end-learning for au-

tonomous driving including Muller et al. (2006) –

however it should be noted that most of the mentioned

papers are concerned with car driving and lane fol-

lowing, and not with obstacle avoidance, which are

overlapping but different problems.

Khan and Parker (2019) describe a deep learn-

ing neural network that learns obstacle avoidance in

a class room setting from human drivers, somewhat

similar to our system. As starting point they use a

deep learning network that has been trained on an im-

age classiﬁcation task and reuse some of the hidden

layers for incremental training. However, their ap-

proach uses only one camera and cannot be directly

applied to stereo cameras. Still, their results seem

Especially for the smaller OUT Train dataset.

promising and will be considered for future experi-

ments.

Luo et al. (2019) describe a combined robot sys-

tem of ad-hoc code with a pretrained deep learning

network that successfully tracks and follows a second

robot in two settings. They mainly focus on tracking

and not on obstacle avoidance as we do here. Track-

ing uses a modiﬁed YOLO deep learning network for

initial slow object detection and a kernel correlation

ﬁlter for continuous fast tracking. Obstacle avoidance

relies exclusively on an active sensing depth camera

– which we also plan to integrate into our platform in

the near future – and is not separately evaluated.

Becker and Ebner (2019) describe a robot system

to detect obstacles a posteriori by acceleration sensor

data. They propose a logistic regression model trained

on known collisions as detected by a human observer,

and show that it can detect new collisions in 13 out of

14 cases. However, as we try to avoid obstacles their

approach cannot be directly applied. It would still be

useful for autonomous collection of training data and

online retraining without relying on more commonly

used bumper sensors.

Evaluating Two Ways for Mobile Robot Obstacle Avoidance with Stereo Cameras: Stereo View Algorithms and End-to-End Trained

Disparity-sensitive Networks

665

3 ToyCollect PLATFORM

All data was collected with our ToyCollect robotics

open source hardware/software platform ( https://tc.

seewald.at ). A hardware overview of the three uti-

lized robots can be found in Table 1 while Fig.1 shows

images for each robot with rulers for scale.

While robots OUT and K3D need only a sin-

gle main controller and have sufﬁcient computational

power to run a small deep-learning model at interac-

tive frame rates (around 8 frames-per-second) – thus

enabling onboard processing – the need for two cam-

eras on R2X necessitates the usage of two controllers,

each connected to one camera, and each streaming the

frames independently to a processing server.

OUT and R2X each use two cameras with a ﬁeld-

of-view of 62.2

◦

horizontal and 48.8

◦

vertical, how-

ever for K3D we used the older camera which has

only 54

◦

horizontal and 41

◦

vertical ﬁeld-of-view and

the horizontal viewing angle is approximately halved

again by the 3D smartphone lens. Fig.2 shows the

difference in ﬁeld-of-view between K3D and R2X.

OUT also includes a depth camera, GPS module,

a 10-DOF inertial measurement unit including an ac-

celerometer, gyroscope, magnetometer and a barome-

ter for attitude measurement as well as a thermometer,

and four ultrasound sensors. However these were not

used for our experiments.

R2X includes two high-power white LEDs to al-

low operation in total darkness and whose brightness

can be controlled in 127 steps. However these also

were not used for our experiments.

Unfortunately it is not possible to connect two

cameras to a RPi controller since the necessary con-

nections are only available on the chip and not on

the PCB. Only the ComputeModule allows to directly

connect two cameras; however the ofﬁcial evaluation

boards for the RPi compute module are too large to

ﬁt on the small robot. So for robot R2X we integrated

two RPi Zero Ws into a single robot chassis and con-

nected each to a separate camera.

However since the

RPi Zero W is based on the original RPi 1, it is much

too slow for online deep learning model processing

and achieves less than 1 frames per second. So for

this robot we must stream the video frames to a sec-

ond more powerful plattform.

In the meantime other options have become available,

e.g. StereoPi, which we are currently evaluating.

4 DATA COLLECTION

We collected two different types of data (consisting

of frames and speed/direction control input

) for in-

door and outdoor obstacle avoidance. In each case we

aimed for a consistent avoidance behaviour roughly

at the same distance from each obstacle similar to

that described in Muller et al. (2006) (see also Sec.2).

However, instead of collecting many short sequences,

we collected large continuous sequences and after-

wards ﬁltered the frames with a semi-automated ap-

proach. All data was collected by students during

summer internships in 2017, 2018 and 2019. The

students were made aware of the conditions for data

collection, and were supervised for about one ﬁfth of

recording time to ensure compliance.

The collected datasets can be found in Table 2.

R2X Train and OUT Train were those used for train-

ing the correspondingly named robots in Seewald

(2020).

For indoor obstacle avoidance (robots R2X, K3D)

we initially collected 557,640 frames in a variety of

indoor and outdoor settings with a small set of small

mobile robots. For training we restricted them to

267,617 frames recorded by the latest robot in 2019 to

ensure all recordings were done with the same cam-

eras which were later also used for evaluation. Stereo

views were collected directly onboard R2X robots

in uncompressed YUV 4:2:2 format on SD cards in

640x368 resolution at 10fps. Control of the robot was

via paired Bluetooth controller. We ﬁrst inspected

the recorded sequences manually and removed those

with technical issues (e.g. no movement, cameras

not synchronized, test runs). Because of synchro-

nization issues, the ﬁrst minute of each sequence (up

to the point when each MASTER and SLAVE syn-

chronize with an external time server

) had to be re-

moved. Additionally, sequences with very slow speed

and with backward movement (negative speed) were

removed along with 50ms of context. Since both cam-

eras were independently recorded, we also removed

all frames without a partner frame that is at most

50ms

apart. We also removed frames where move-

ment information is not available within ±25ms of the

average timestamp for the image frames. Lastly, we

had to remove 80% of the frames with straight for-

ward movement as otherwise these would have domi-

nated the training set. After all these ﬁlters, 70,745

Only direction (steering) is used for training.

The RPi platform does not offer a realtime clock and

thus suffers from signiﬁcant clock drift. It would have

been quite simple to synchronize local clock during MAS-

TER/SLAVE synchronization but we failed to consider it.

Half of 100ms (i.e. 10fps recording frequency)

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

666

Table 2: Datasets.

Name

Coll.by

robot

Size

Train?

Qual.

Eval?

Comments

Mixed R2X 557,640 No Yes Indoors and outdoors near houses

Outside OUT 51,735 No Yes Outdoors, ﬁelds, forests, small streets, walkways and trails

R2X Train R2X 70,745 Yes No Curated subset of Mixed and recorded on a single robot, in-

cludes only indoor scenes.

OUT Train OUT 27,368 Yes No Curated subset of Outside

K3D Train R2X 70,745 Yes No R2X Train dataset restricted to ﬁeld-of-view of K3D robot (see

Fig. 2)

frames remained which we distributed into 13,791

(20%) frames for testing and 56,954 (80%) for train-

ing.

For outdoor obstacle avoidance (robot OUT) we

collected 66,057 frames in a variety of outdoor set-

tings. These were collected on a mobile phone con-

nected via Wi-ﬁ to an OUT robot. The phone also

translated the phone-paired Bluetooth controller com-

mands to Wi-ﬁ and sent them to the robot. These steps

were necessary since at that time no Compute Mod-

ule with signiﬁcant onboard memory was available.

The frames were collected in 1280x720 resolution in

raw H264 format at 15fps. A preliminary curating

step removed those sequences where the robot stood

still, was being repaired or when movement com-

mands were not accepted by the robot. This yielded

the 51,735 frames for Outside in Table 2. We then

used the same additional ﬁltering as above and ob-

tained 27,368 frames which we distributed into 5,351

(20%) for testing and 22,017 (80%) for training.

In all cases except K3D the frames were down-

scaled to half recording resolution via linear interpo-

lation while preserving aspect ratio. For K3D, where

only a part of the frame was used for training and test-

ing due to its much smaller horizontal ﬁeld-of-view

(see Fig. 2), the stereo views were not scaled down

but used at the original resolution.

For the qualitative evaluation of stereo algorithms

we used the stereo images as-is (except for linear up-

and downscaling). For the disparity-sensitive deep

learning model (NCC-Disp) experiments, we instead

computed the normalized correlation coefﬁcient (see

Eq. 1) between patches of the Left and Right stereo

views using 8x8 pixel-sized patches and disparities of

0, 1, 2, 4, 8, 16 and 32 pixels. Patches were moved

in half-patch-size (4 pixel) steps. We obtained seven

channels with a resolution of 78x44 pixels for OUT

and R2X, but 62x48 pixels for K3D due to the differ-

ent aspect ratio and resolution.

The old RPi ComputeModule evaluation board offers

neither an SD card slot nor Bluetooth and the only available

USB slot was needed for a Wi-ﬁ USB stick.

5 QUALITATIVE EVALUATION

OF STEREO ALGORITHMS

We evaluated the following stereo algorithms, using

the stereo views described in the previous section. We

ran the algorithms on our datasets from Table 2. In

each case, we chose a random sample of about 50

frames and evaluated them w.r.t. the usefulness for

obstacle avoidance into three categories. We also in-

spected extensive simulated runs of each algorithm

where frames were processed in the same order as

they would be processed on the robot.

• Good: Good stereo depth reconstruction with at

most small errors.

• Mediocre: Stereo depth reconstruction with ma-

jor errors but still somewhat useful for obstacle

avoidance.

• Bad: Incorrect stereo depth reconstruction with

at most small patches with approximately correct

depth; useless for obstacle avoidance purposes.

Since we were interested in comprehensive perfor-

mance, we chose the largest, most diverse dataset

Mixed, and for promising algorithms additionally the

dataset Outside to enable some comparison w.r.t. nat-

ural scene performance.

1. GraphCut (GC, from Kim et al. (2003) as KZ1)

2. Block Matcher (BM, Konolige (1998))

3. Semi-Global Block Matcher (SGBM,

Hirschmuller (2008))

4. HITNet (Tankovich et al. (2021))

Reﬂections on the wooden parquet ﬂoor in dataset

Mixed sometimes made the system merge objects

with their reﬂections, making them seemingly appear

below the ﬂoor. As this must be addressed separately,

we have disregarded such errors in our evaluation.

We used the OpenCV (Bradski (2000)) implemen-

tation of all algorithms except HITNet, which was

available as a GIThub project with pretrained Ten-

Evaluating Two Ways for Mobile Robot Obstacle Avoidance with Stereo Cameras: Stereo View Algorithms and End-to-End Trained

Disparity-sensitive Networks

667

NCC(i, j, d) =

∑

∀i

, j

∈[0,7]

L(4i + i

, 4 j + j

)R(4i + i

− d, 4 j + j

)

∑

∀i

, j

∈[0,7]

L(4i + i

, 4 j + j

)

∑

∀i

, j

∈[0,7]

R(4i + i

− d, 4 j + j

)

(1)

Table 3: Results of qualitative evaluation. Mixed: 56, Out-

side: 52 stereo views were randomly selected and manually

analyzed by visual inspection. For deﬁnition of category

labels see text. Last column: proportion of non-bad stereo

views.

Stereo

Alg.

Dataset bad med. good

Prop.

¬bad

GC Mixed 48 8 0 14.28%

BM Mixed 51 5 0 8.92%

SGBM Mixed 46 10 0 17.85%

SGBM Outside 39 13 0 25.00%

HITnet Mixed 25 29 2 55.35%

HITnet Outside 19 32 1 63.46%

sorﬂow models.

Results can be found in Table 3,

however we will also comment on the strengths and

weaknesses of the tested algorithms in the following

subsections. The latter is based on the simulated runs

mentioned earlier.

5.1 GraphCut

GraphCut was one of the ﬁrst stereo algorithms, how-

ever it is also one of the slowest. It is described in Kim

et al. (2003) (as KZ1). With the default maximum

disparity of 16 it is about 100 times slower than the

fastest algorithm. Visual inspection shows many large

dark areas and black dots where no depth informa-

tion is reconstructed. During robot turns it becomes

completely white indicating minimum distance. A

maximum disparity of 40 with two-fold downscaling

made it slightly slower but the depth map became very

rough and pixelated. Again blobs were often found

with other depths that were overlaid over walls or ﬂat

objects. Walls were sometimes well detected but only

shortly. It does seem as if GC only detected axis-

parallel walls and does not deal with gradients. We

only ever saw blocks in one color (depth). In a way

the best scenes remind of HITNet – except the restric-

tion on axis-parallel walls – but processing speed is of

course several orders of magnitudes slower.

5.2 Block-Matching (BM)

Block-Matching (BM, Konolige (1998)) determines

disparity by matching blocks between both stereo

views. We’ve again used two-fold linear downscal-

ing for this algorithm. The depth map is very noisy

and seems to work only on ﬂat surfaces. Convex sur-

faces have good disparity values only at the border.

https://github.com/google-research/google-research/

tree/master/hitnet

The ﬂoor shows very little depth information even

when structured, e.g. parquet or ceramic tiles. Walls

are very hard to discern, and many ﬂoor tile seams

have completely wrong depth values. In many cases

only corners and edges have depth data and the rest

of the image is black. When turning, worse depth

maps are obtained, and motion blur is clarly discern-

able – when driving straight ahead for half a second

the depth maps stabilizes again and gives better but

still not useful results.

5.3 Semi-Global Block Matching

(SGBM)

Semi-Global Block Matching (SGBM, Hirschmuller

(2008)) is a much more intelligent version of Block-

Matching. With linear two-fold downscaling it runs

quite fast. When the robot is moving straight ahead

walls are often visible relatively well, unlike the ﬂoor.

When the robot turns, depth information vanishes ex-

cept for thick edges, but reestablish after a few sec-

onds. Parquet ﬂoor and the furniture of the same color

is not clearly differentiated, perhaps due its similarity

in color and texture. In complex scenes many differ-

ent patches with different distances are visible. Con-

cluding, SGBM is not perfect but still quite an im-

provement over BM.

A window size of 9x9 seems to work best and

gives reasonable results with simple scenes. For Out-

side, disparities do not change from frame to frame,

perhaps because the Compute Module platform had

lower latency since both cameras were attached to one

module (rather than two modules as for Mixed). Trees

and skies are sometimes easily discernable but trees

are not well separated. Objects are sometimes visible

but very noisy and blurry. A smaller size increases

noise in estimated depth, a larger size makes dispari-

ties change from frame to frame, probably due to the

less exact alignment, and makes object contours much

more blurry. Robot turns still create artefacts for all

tested window sizes of 3x3, 9x9 and 21x21.

5.4 HITNet

The main reason to use HITNet (Tankovich et al.

(2021)) was that it was a relatively fast current Deep

Learning network that performed well on benchmark

datasets, and was also the highest-ranking model on

Middlebury (at that time of running the experiments)

for which pretrained models were readily available.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

668

Table 4: Sample images for the three categories bad, mediocre and good.

Cat. Dataset Stereo views GC BM SGBM HITnet

good

Mixed

good

Outside

mediocre

Mixed

mediocre

Outside

bad

Mixed

bad

Outside

We evaluated two different pretrained versions of

HITNet – eth3D and middlebury – on a random sam-

ple of 105 frames from Mixed using the pooled ver-

sion eth3D with two-fold linear downsampling, and

middlebury with two-fold linear upsampling.

48 cases, the depth map was clearly wrong for both

versions and we were unable to even approximately

match the shown depth maps to the stereo view by

visual inspection; in 41 cases, the eth3D version per-

formed better; and in 11 cases, the middlebury ver-

sion performed better. In 5 cases there was no signif-

icant difference. This indicates – not surprisingly –

that the eth3D version which was pooled over KITTI,

eth3D and middlebury datasets performs much better

than the version trained just on Middlebury. About 5

frames looked really well and would have been very

useful for obstacle avoidance could this behaviour be

obtained everywhere. We therefore chose the pooled

eth3D pretrained version for the qualitative evalua-

tion.

HITNet showed small ﬂuctuations even with

stereo views that change very little (i.e. when the

robot is not moving at all), and strong ﬂuctuations

during movement (perhaps due to the random latency

of Mixed dataset). Every movement – including ro-

tation – shows artefacts. Still HITNet was a signif-

icant improvement even over SGBM and sometimes

showed good depth maps which even allowed to de-

tect small objects (about every 20th frame). It is also

the only algorithm for which we actually found stereo

views of category good.

According to pers.comm. by V. Tankovich.

Using the Outside dataset also indicated that

around every second frame was somewhat useful, but

very noisy, and also that only very near objects are

somewhat reliably detected every second frame, but

sadly not in consecutive order – in most cases se-

quences of bad depth maps alternate with sequences

of mediocre depth maps. However in some cases HIT-

Net also hallucinated objects – some of these were

due to reﬂections and were disregarded but others

were clear mistakes. The Outside dataset does seem

to work slightly better than Mixed, however this could

be due to the better temporal alignment of the stereo

views for Outside.

It is clear from the results in Table 3 that HITnet

is about two to three times better than the best clas-

sical algorithm, SGBM, and yields useful (i.e. not

bad) depth maps about every other frame. It is also

clear that stereo views from natural scenes perform

better than those from indoor scenes by about 50%.

This could be explained by the more complex textures

in outdoor scenes or perhaps just the better temporal

alignment of stereo views for Outside.

As one of the cameras used to record Mixed has

failed before we could compute rectiﬁed stereo, we

had to use non-rectiﬁed stereo views. To make sure

the non-rectiﬁed stereo did not inﬂuence our results,

we also compared HITNet on a random sample of

47 frames from the Outside dataset with rectiﬁed ver-

sus non-rectiﬁed stereo views. In 29 cases, the non-

rectiﬁed version looked better; in 3 cases, the rectiﬁed

version looked better and in 15 cases there was no

signiﬁcant difference, so in effect the rectiﬁed stereo

Evaluating Two Ways for Mobile Robot Obstacle Avoidance with Stereo Cameras: Stereo View Algorithms and End-to-End Trained

Disparity-sensitive Networks

669

views performed worse. So we may tentatively con-

clude that the missing rectiﬁcation is not responsible

for the observed poor performance.

6 DISPARITY-SENSITIVE DEEP

LEARNING NETWORK

(NCC-DISP)

We designed a series of 25 consecutive networks

with different layers and architecture towards the ﬁ-

nal model presented here. The ﬁrst one was a straight-

forward ten-layer model with alternating Conv2D and

AvgPool2D layers. The basic principles which in our

observation correlate with good performance over this

sequence of networks were:

1. Segmentation: It proved important to process the

slices corresponding to different disparity values

independently and only merge them at the last

layer.

2. Dropout: At a higher level of parameters (more

than 500,000) the small dataset led to overﬁtting.

Adding a dropout layer just before the last layer

proved to resolve this – and also speeded up con-

vergence – in most cases.

3. Alternate Convolutional and Averaging Lay-

ers: Alternating convolutional and averaging lay-

ers worked reasonably well, and removing aver-

aging layers was harmful to performance.

4. Use Appropriate Dimensions: We started out

with a 2D representation using seven channels for

the different disparity values. However this lim-

ited our options for convolutional layers. Switch-

ing to a 3D representation of the same data using

one channel but seven z-layers enabled using the

3D functions of Conv and AvgPool, which imme-

diately improved the performance.

5. Use Activation Relu and MaxPool Layers:

Switching from sigmoid to relu activation and re-

placing AvgPool with MaxPool layers added ad-

ditional nonlinearity which beneﬁtted overall per-

formance.

The monkey visual cortex and most likely the human

visual cortex as well actually splits the different modalities

all the way – there is no point where the different levels are

connected directly and summed up – so clearly the brain

uses at least one other method to merge information without

physical connection. One good candidate is clearly neural

assembly synchronization, which does not seem to be avail-

able in Tensorﬂow yet.

Figure 3: Final disparity-sensitive model NCC-DISP for

R2X and OUT. The K3D model was slightly smaller since

its input was only [62,48,7,1] but was otherwise the same.

The number of layers was limited by the low res-

olution of the input dataset, so we could not in-

crease depth signiﬁcantly without using kernels of

very small sizes or removing averaging layers, both

of which proved to hamper performance. However

as the ﬁnal model (see Fig. 3) proved similar in

evaluation set accuracy to the original model (except

for OUT

Train), real-life performance is likely to be

comparable as well.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

670

Table 5: Results of learning experiments

Dataset #Samp. Model #Param.

#Steps to

best model

Corr.

coeff.

Eval.Acc.

Acc. w/

center ±48

Acc. w/

center ±32

K3D Train

70,745 DAVE-like 71,867 278.7k 0.4032 59.58% 59.27% 59.08%

K3D Train 70,745 NCC-Disp 179,567 285.0k 0.3531 58.65% 58.17% 57.97%

R2X Train 70,745 DAVE-like 71,867 305.5k 0.3945 59.93% 59.57% 59.05%

R2X Train 70,745 NCC-Disp 186,623 347.7k 0.4088 59.69% 59.03% 58.80%

OUT Train 27,368 DAVE-like 71,867 69.5k 0.1807 58.10% 50.27% 40.87%

OUT Train 27,368 NCC-Disp 186,623 81.8k 0.1349 54.59% 46.01% 39.92%

For training, we used TensorFlow with AdamOp-

timizer and a learning rate of 10

−4

. Training data

was prepared as described in Sec. 4. We computed

normalized correlation coefﬁcient values on regular

8x8 pixel size patches taken from left and right im-

ages at seven different horizontal pixel disparity val-

ues, yielding a 3D image with a z-Depth seven. Note

that due to this transformation no actual image data

is fed into the network, essentially forcing it to use

disparity information only.

For steering outputs, we formulated steering as

a classiﬁcation problem, representing left, right and

straight forward as distinct classes. Left and right

were determined from a steering output of ±65 which

corresponds to half the maximum steering output of

±127. This variant corresponds to Cl3 of the original

paper Seewald (2020) which was the best-performing

output-model there.

Table 5 shows the results of the different mod-

els DAVE-like (as described in Seewald (2020))

and NCC-Disp (see Fig. 3) on the three datasets

K3D Train, R2X Train and OUT Train. As can be

seen, performance according to evaluation set accu-

racy and correlation between original and estimated

steering is similar between DAVE-like and NCC-Disp

models – except for OUT Train where it is noticably

worse. The number of training steps to the best model

is also rather similar. The number of parameters for

the NCC-Disp models is roughly 2.5 times the origi-

nal DAVE-like model but this has been somewhat off-

set by a dropout layer with 0.8, where 80% of weights

are randomly set to zero in each training step. Disre-

garding the normalized correlation coefﬁcient compu-

tation, the model is still well within the size where it

can be run on RPi 3B+ or higher.

We were surprised by the performance loss on

OUT Train. Initially we assumed that the disparity

levels may need to be adapted for the larger robot

used to collect this dataset, however experiments with

After submitting the ﬁnal paper version for Seewald

(2020), we noticed that by extending the training steps it

was possible to enable good performance for this robot

as well despite the highly distorted ﬁeld-of-view. We re-

trained, took it to the conference and demonstrated it after

the session, where it performed reasonably well.

two other disparity sets did not improve performance

(data not shown). However, there is one other signiﬁ-

cant difference: The OUT Train dataset was collected

in H264-compressed format while all other datasets

were collected in uncompressed YUV 4:2:2, which

adds an additional layer of noise on the image data.

This could well account for a noisier output from

the NCC preprocessing step and therefore for the ob-

served performance loss.

7 DISCUSSION

One limitation of our approach to represent dispar-

ity is obviously the simplistic normalized correlation

coefﬁcient function to determine patch similarity. It

would be better to use a patch similarity computation

layer as the ﬁrst layer of the network and optimize

it as a part of the training process. Such a network

could adapt receptive ﬁeld size and disparity sensitiv-

ity to actually needed values similar as evolution has

done for compound eyes in insects.

One method we have yet not considered is obsta-

cle detection by movement. For a moving robot, sim-

ple video analytics such as optical ﬂow can determine

objects on collision course by looming. However this

method is only applicable once the robot is moving

and cannot be used when the robot is standing still. It

would perhaps be feasible to extend the deep learning

model with recurrent loops to store past frames and

integrate such additional information as well. As our

data has been collected in long sequences it would be

well suited for such learning experiments.

8 CONCLUSION

We have shown that – in the absence of quantitative

data – a qualitative evaluation, despite its inevitable

subjectivity, is helpful in determining the overall use-

fulness of algorithms and models for a speciﬁc task.

In our case of stereo view algorithms we obtained a

negative result, however it is clear that deep learning

models are improving at a rapid rate and we may ex-

pect more progress in the future.

Evaluating Two Ways for Mobile Robot Obstacle Avoidance with Stereo Cameras: Stereo View Algorithms and End-to-End Trained

Disparity-sensitive Networks

671

As a second way towards obstacle avoidance with

stereo cameras, we have – after noting that deep learn-

ing algorithms such as Muller et al. (2004) do not rely

on view disparity – proposed the disparity-sensitive

deep learning network NCC-DISP which has only dis-

parity data as input, and show that it performs as well

as an earlier system using raw stereo camera inputs.

The next generation of ToyCollect robots will ad-

ditionally feature an active sensing depth camera to

generate approximate depth maps which would al-

low a qualitative evaluation of stereo camera algo-

rithms. Adding bumper and acceleration sensors will

furthermore allow to automate the generation of large

amounts of obstacle detection training data, which is

currently still a manual process.

ACKNOWLEDGEMENTS

This project was partially funded by the Austrian Re-

search Promotion Agency (FFG) and by the Austrian

Federal Ministry for Transport, Innovation and Tech-

nology (BMVIT) within the Talente internship re-

search program 2014-2019. We would like to thank

Lukas D.-B. for all 3D chassis designs of robot R2X;

Vladimir T. for answering questions and explaining

his model; and all interns which have worked on this

project, notably Georg W., Julian F. and Miriam T.

REFERENCES

Becker, F. and Ebner, M. (2019). Collision detection for

a mobile robot using logistic regression. In Proceed-

ings of the 16th International Conference on Informat-

ics in Control, Automation and Robotics - Volume 2:

ICINCO,, pages 167–173. INSTICC, SciTePress.

Bojarski, M., Del Testa, D., Dworakowski, D., et al. (2016).

End to end learning for self-driving cars. Technical

Report 1604.07316, Cornell University.

Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Jour-

nal of Software Tools.

Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza,

D., Neira, J., Reid, I., and Leonard, J. (2016). Past,

present, and future of simultaneous localization and

mapping: Towards the robust-perception age. IEEE

Transactions on Robotics, 32(6):1309–1332.

Hirschmuller, H. (2008). Stereo processing by semiglobal

matching and mutual information. PAMI, 30(2):328–

341.

Kandel, E. R., Schwartz, J. H., Jessell, T. M., of Biochem-

istry, D., Jessell, M. B. T., Siegelbaum, S., and Hud-

speth, A. (2000). Principles of neural science, 4th

Edition. McGraw-Hill New York.

Khan, M. and Parker, G. (2019). Vision based indoor obsta-

cle avoidance using a deep convolutional neural net-

work. In Proceedings of the 11th International Joint

Conference on Computational Intelligence - NCTA,

(IJCCI 2019), pages 403–411. INSTICC, SciTePress.

Kim, J., Kolmogorov, V., and Zabih, R. (2003). Visual cor-

respondence using energy minimization and mutual

information. In ICCV ’03: Proceedings of the Ninth

IEEE International Conference on Computer Vision,

page 1033, Washington, DC, USA. IEEE Computer

Society.

Konolige, K. (1998). Small vision systems: Hardware and

implementation. In Robotics research, pages 203–

212. Springer.

Luo, W., Xiao, Z., Ebel, H., and Eberhard, P. (2019). Stereo

vision-based autonomous target detection and track-

ing on an omnidirectional mobile robot. In Proceed-

ings of the 16th International Conference on Informat-

ics in Control, Automation and Robotics - Volume 2:

ICINCO,, pages 268–275. INSTICC, SciTePress.

Muller, U., Ben, J., Cosatto, E., Fleep, B., and LeCun, Y.

(2004). Autonomous off-road vehicle control using

end-to-end learning. Technical report, DARPA-IPTO,

Arlington, Virginia, USA. ARPA Order Q458, Pro-

gram 3D10, DARPA/CMO Contract #MDA972-03-

C-0111, V1.2, 2004/07/30.

Muller, U., Ben, J., Cosatto, E., Fleep, B., and LeCun, Y.

(2006). Off-road obstacle avoidance through end-to-

end learning. In Advances in neural information pro-

cessing systems, pages 739–746.

Pfeiffer, M., Schaeuble, M., Nieto, J. I., Siegwart, R.,

and Cadena, C. (2016). From perception to de-

cision: A data-driven approach to end-to-end mo-

tion planning for autonomous ground robots. CoRR,

abs/1609.07910.

Seewald, A. K. (2020). Revisiting End-to-end Deep Learn-

ing for Obstacle Avoidance: Replication and Open Is-

sues. In Proceedings of the 12th International Con-

ference on Agents and Artiﬁcial Intelligence (ICAART

2020), Valetta, Malta., pages 652–659.

Tankovich, V., Hane, C., Zhang, Y., Kowdle, A., Fanello, S.,

and Bouaziz, S. (2021). Hitnet: Hierarchical iterative

tile reﬁnement network for real-time stereo matching.

In Proceedings of the IEEE/CVF Conference on Com-

puter Vision and Pattern Recognition, pages 14362–

14372.

Wang, Y., Dongfang, L., Jeon, H., Chu, Z., and Matson, ET.

(2019). End-to-end learning approach for autonomous

driving: A convolutional neural network model. In

Rocha, A., Steels, L., and van den Herik, J., editors,

Proc. of ICAART 2019, volume 2, pages 833–839.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

672