Environment-aware Sensor Fusion using Deep Learning

Caio Fischer Silva

1,2 a

, Paulo V. K. Borges

1 b

and Jos

e E. C. Castanho

2 c

Robotics and Autonomous Systems Group, CSIRO, Australia

School of Engineering, S

ao Paulo State University - UNESP, Bauru, SP, Brazil

Keywords:

Environment-aware Sensor Fusion using Deep Learning.

Abstract:

A reliable perception pipeline is crucial to the operation of a safe and efﬁcient autonomous vehicle. Fusing

information from multiple sensors has become a common practice to increase robustness, given that different

types of sensors have distinct sensing characteristics. Further, sensors can present diverse performance

according to the operating environment. Most systems rely on a rigid sensor fusion strategy which considers

the sensors input only (e.g., signal and corresponding covariances), without incorporating the inﬂuence of the

environment, which often causes poor performance in mixed scenarios. In our approach, we have adjusted

the sensor fusion strategy according to a classiﬁcation of the scene around the vehicle. A convolutional

neural network was employed to classify the environment, and this classiﬁcation is used to select the best

sensor conﬁguration accordingly. We present experiments with a full-size autonomous vehicle operating in

a heterogeneous environment. The results illustrate the applicability of the method with enhanced odometry

estimation when compared to a rigid sensor fusion scheme.

1 INTRODUCTION

With the recent advances in robotics, autonomous mo-

bile robots are now operating in a broad range of do-

mains. Some well-known examples include industrial

plants (Borges et al., 2013), urban trafﬁc (Bojarski

et al., 2016) and agriculture (Lottes et al., 2018). The

navigation system required to perform efﬁciently in

such scenarios needs to be robust to several opera-

tional challenges such as obstacle avoidance and re-

liable localization. When the same vehicle/platform

navigates through signiﬁcantly different environments,

the navigation can be even more challenging. The

site shown in Figure 1 is representative of this situ-

ation. The highlighted paths represent sections on

which an autonomous vehicle should navigate to per-

form a given task. The image shows that different

regions of the tracks present distinct structural char-

acteristics. In some segments the vehicle must travel

through a densely built-up area, with large structures

and metallic sheds. In contrast, in other sections, the

path is unstructured and mostly surrounded by vegeta-

tion, including off-road terrain.

Sensors are required in robotic navigation to obtain

https://orcid.org/0000-0001-7948-5036

https://orcid.org/0000-0001-8137-7245

https://orcid.org/0000-0003-1762-7478

Figure 1: Satellite view illustrating a heterogeneous oper-

ation space. The red path were used to train the CNN to

classify the environment, while the white was used to vali-

date its performance. Image from Google Maps.

information about the robot’s surroundings. Since

each sensor has advantages and drawbacks, a single

sensor is often not sufﬁcient to reliably represent the

world, and hence fusing data from multiple sensors has

become a common practice. Probabilistic techniques,

such as the Kalman ﬁlter (Maybeck, 1990a) and the

Particle ﬁlter (Maybeck, 1990b), enable sensor fusion

by explicitly modeling the uncertainty of each sensor.

Silva, C., Borges, P. and Castanho, J.

Environment-aware Sensor Fusion using Deep Learning.

DOI: 10.5220/0007841900880096

In Proceedings of the 16th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2019), pages 88-96

ISBN: 978-989-758-380-3

Figure 2: Architecture overview of the Environment-Aware

sensor fusion applied in odometry.

These are well-known approaches with optimal

performance when the navigation takes place in quite

homogeneous environments. However, in challenging

and mixed scenarios, as described above, employing

rigid statistical models of sensor noise may provide

a suboptimal solution. An environment-aware sensor

fusion, which dynamically adapts to each different

environment, can allow a better sensor fusion perfor-

mance.

Previous work using teach and repeat approach

illustrated the effectiveness of such adaptive scheme

(Rechy Romero et al., 2016), but it is limited to a pre-

viously deﬁned path. To overcome this constraint, we

propose applying convolutional neural networks and

camera images to recognize typical navigation envi-

ronments (such as indoor, off-road, industrial, urban)

and associate that information to the best sensor-fusion

strategy. This approach is the gist of the method pro-

posed in this paper.

The proposed method was implemented and evalu-

ated in full-size autonomous utility vehicle developed

by the Robotics and Autonomous System Laboratory

at CSIRO, see Figure 4.

To validate the performance of the adaptive sensor

fusion scheme, we employed it to odometry estima-

tion. In the presence of ground truth, provided by a

reliable localization system, the error estimation in

odometry becomes trivial, which makes the perfor-

mance evaluation of the proposed method accurate

and straightforward.

Figure 2 shows a block diagram of the proposed

method. Experimental results have shown a reduction

in the estimation errors in comparison to using a rigid

combination of the sensors.

1.1 Related Work

Dividing the navigation map to consider different do-

mains has been used earlier to enhance the perfor-

mance of mobile robots (Lowry et al., 2016; Churchill

and Newman, 2012; McManus et al., 2012; Rechy

Romero et al., 2016). In this paradigm, usually known

as the teach and repeat, the navigation map is visited

in an initial phase in which the environment is learned.

Then, it is divided into sub-maps which are used later

to adjust the behavior of the robot. This paradigm was

employed in Furgale and Barfoot (2010), to enhance

navigation of long-range, autonomous operation of a

mobile robot in outdoor, GPS denied, and unstructured

environments. However, using sub-maps to change

system response can only be used in locations previ-

ously visited. That is, the behavior of the system is not

deﬁned in unknown places, even if they are similar to

those visited before.

In Rechy Romero et al. (2016), an adaptive sen-

sor fusion technique is applied for obstacle detection.

It performs better than using each sensor alone or a

covariance-based weighted combination of them. The

authors have driven an automated vehicle through

a heterogeneous operation environment and quanti-

ﬁed the performance of each obstacle-detecting sensor

along the trajectory. This information was used by

an environment-aware sensor fusion (EASF) strategy

that provides different conﬁdence levels to each sensor

based on its location along the path. The method uses

a look-up table that relates the vehicle’s position with

the best sensor conﬁguration.

In Suger et al. (2016), also proposed the use of

an adaptive approach to obstacle detection for mo-

bile robots. A random forest classiﬁer was trained

to identify each environment using local geometrical

descriptors from a point cloud so it can classify places

not visited during the training. The work presents a

classiﬁcation metric but does not elaborate on how the

adaptive strategy improved obstacle detection.

An adaptive scheme for robot localization was used

by Guilherme et al. (2016). The robot is equipped with

a short-range laser scanner and a Global Positioning

System (GPS) module. A histogram of the distances

between occupied cells on an occupancy grid from

the laser scanner was used to classify the environment

in outdoor or indoor using the k-nearest neighbor al-

gorithm. In outdoor environments, the localization

system would rely only on the GPS module, while

it uses the laser scanner and a previously built map

indoors.

In 2016, NVIDIA’s researchers (Bojarski et al.,

2016) have trained a convolutional neural network to

map the raw pixels from a front-facing camera to steer-

ing angles to control a self-driving car on roads and

highways (with or without lane markings). Accord-

ing to the authors, end-to-end learning leads to better

performance than optimizing human-selected interme-

diate criteria, like lane detection. This work has shown

the potential of convolutional neural networks to face

the highly challenging tasks of autonomous driving.

Zhou et al. (2014) have demonstrated that a properly

trained convolutional neural network can identify dif-

Environment-aware Sensor Fusion using Deep Learning

ferent environments based only on visual information,

using the so-called “deep features”.

These related works have shown that adding infor-

mation about the environment can lead to more robust

systems, able to operate on mixed scenarios. We com-

bine the teach and repeat paradigm with deep learning

based visual scene classiﬁcation to optimize the sensor

fusion performance. As a result, our system assigns

the optimal sensor conﬁguration to each scene based

only on visual information.

1.2 Visual Pose Estimation

Visual Odometry (VO) is the process of estimating the

movement of a robot given a sequence of images from

a camera attached to it. The idea was ﬁrst introduced in

1980 for planetary rovers operating on Mars (Moravec,

1980).

However, it still a active research topic, as can be

seen from the fact that the work on visual odometry by

Forster et al. (2017) won the IEEE RAS Publication

Award in 2018.

The classical approach to VO relies on extracting

and tracking visual features, and then combine the

relative motion of these features in a sequence of im-

ages to estimate the camera’s movement. The process

of simultaneously localize the robot and map the en-

vironment using visual information is called Visual

SLAM.

The output of a VO system is usually fused with

other sensor modalities to improve its accuracy. A

popular choice for ground vehicles is fusing it with

wheel odometry (Bischoff et al., 2012), but inertial

sensors are broadly used as well (Corke et al., 2007).

2 ENVIRONMENT-AWARE

SENSOR FUSION

In this section we describe the framework for adaptive

sensor fusion, exploiting visual information from the

environment. We also provide a performance metric

used to compare each odometry method employed

and to deﬁne the best sensor conﬁguration to each

environment.

2.1 Performance Metric

We chose the metric proposed by Burgard et al. (2009)

to compare the employed odometry methods. This

metric was proposed as an objective benchmark for

comparison of SLAM algorithms. Since it uses only

relative geometric relations between poses along the

Figure 3: Example of why the absolute difference is subopti-

mal (Burgard et al., 2009).

robot’s trajectory, one can use it to compare odometry

methods without loss of generalization.

In the presence of a ground truth trajectory, it is

usual to get the odometry error by the absolute dif-

ference between the estimated poses and the ground

truth. Burgard et al. (2009) claim that this metric is

suboptimal because an error on the estimation of a sin-

gle transition between poses could increase the error

in all future poses. To illustrate this behavior, sup-

pose a robot moving in a straight line with a perfect

pose estimation system, but with a single rotation error

somewhere, let us say on the middle of the trajectory,

as shown on Figure 3.

Using the absolute difference would assign a zero

error to all the poses in the submap 1, as expected

considering an error-free pose estimator. But it would

assign a non-zero error to all the poses in the submap

2, even if the error is present only in the transition

between two particular poses, shown as a bold arrow

in the ﬁgure.

The proposed metric is based on the relative dis-

placement between the poses. Given two poses

and

in a trajectory,

i, j

is deﬁned as the relative transfor-

mation that moves from pose

. Given

1:T

, the

set of estimated poses, and

∗

1:T

, the ground truth ones.

The relative difference is deﬁned in (1) as the squared

difference between the estimated and the ground truth

transformations, respectively

and

∗

. In the example

from Figure 3, the relative error is non-zero only for

the transformation represented by the bold arrow.

ε(δ) =

∑

i, j

(δ

i, j

 δ

∗

i, j

)

(1)

By selecting the relative displacement

i, j

, one can

highlight speciﬁc properties. For instance, by comput-

ing the relative displacement between nearby poses,

the local consistency is highlighted. In contrast, the

relative displacement between far away poses enforces

the overall geometry of the trajectory. In the experi-

ments, we used a mid-range displacement, big enough

to include some big scale geometry information while

highlighting local consistency.

We used a 10 seconds time interval to compute the

relative transformations, which resulted in a 25 meters

average distance between each pose when the vehicle

was moving in a straight line. The ground truth trajec-

tory was provided by a 3D LIDAR localization system,

ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics

based on the SLAM algorithm proposed by Bosse and

Zlot (2009) operating in a previously mapped area.

2.2 Informed Sensor Fusion Strategy

We deﬁne a sensor conﬁguration

as the combination

of weights describing the reliability of each sensor.

Considering a system equipped with

different sen-

sors, the sensor conﬁguration would be a vector of

elements as follows.

λ = [α

, α

, ..., α

]

∈ R

, with α

∈ [0, 1] (2)

Where

represents the reliability of the sensor

If it is equal to zero the sensor will not be used in the

fusion and if its value is one the sensor will be fused

using the provided error model. Intermediate values

should proportionally increase the uncertainty of the

sensor.

Appropriately changing the sensor conﬁguration

can prevent the hazardous situation where the system

is very conﬁdent about a wrong estimation or the sub-

optimal situation where the system deﬁnes an unre-

alistic high uncertainty to a sensor in all scenarios to

compensate for its high error in some domains.

As described in Section 1.1, the teach and repeat

paradigm can be used to select a suitable sensor conﬁg-

uration, but in this work, we propose the use of visual

scene classiﬁcation.

3 EXPERIMENTAL SET-UP

In this section, we describe the experimental set-up

and some implemented methods.

3.1 Vehicle Description

The robot is built upon a John Deere Gator, an electric

medium-size utility vehicle (see Figure 4). The vehi-

cle has been fully automated in the Robotics and Au-

tonomous System Laboratory at CSIRO (Egger et al.,

2018; Pfrunder et al., 2017).

The vehicle is equipped with a Velodyne VLP-16

Puck LIDAR, that provides a 360 degrees 3D point

cloud with a typical range accuracy of

±3cm

, which is

used for localization and obstacle avoidance. Besides

that, the vehicle has four safety 2D LIDARs (one on

each corner). Anytime an object is detected by the

lasers inside a safety zone an emergency stop signal is

triggered.

As usual in wheeled robots, the Gator has a wheel

odometer, made of a metal disc pressed onto the brake

drum and an inductive sensor. In addition to that, a

Figure 4: The robot used, a John Deere Gator holding multi-

ple sensors.

visual odometry algorithm was implemented using as

input images from an Intel RealSense D435 (Keselman

et al., 2017) mounted front facing in the vehicle, details

are provided in Section 3.2.

The vehicle holds two computers, one of them used

for the low-level hardware control and the other for

high-level tasks, such as localization and path planning.

The integration between the computers and the sensors

is done using the Robotics Operating System (ROS)

(Quigley et al., 2009).

3.2 Visual Odometry

A popular Visual SLAM implementation is the

ORB SLAM2 (Mur-Artal and Tard

os, 2017), which

uses the ORB (Oriented FAST and Rotated BRIEF)

feature detector. ORB SLAM2 is an open-source li-

brary for Monocular, Stereo and RGB-D cameras that

includes loop closure and relocalization capabilities.

We disabled the loop closure and relocalization treads

to get a pure visual odometry behavior.

The ORB SLAM2 classiﬁes the detected features

into close and far key points applying a distance thresh-

old. The closest key points can be safely triangulated

between consecutive frames, providing a reliable trans-

lation inference. On the other hand, the farthest points

tend to give a more accurate rotation inference, since

multiple views support them.

Although the authors of ORB SLAM2 provide a

ROS node implementation, it is only able to read im-

ages from a ROS topic and output the ﬁnal path as a

text ﬁle. We modiﬁed the library to provide a ROS

friendly interface, outputting the estimated position

and covariance in real-time, both in a ROS topic and

TF tree.

In addition to a standard RGB sensor, the Intel

RealSense D435 presents a stereo pair of infrared (IR)

cameras and an IR pattern projector used for RGB-D

imaging. The pair of IR images were used for the

visual odometry since it performed better than the

Environment-aware Sensor Fusion using Deep Learning

Figure 5: Sample image after the Adaptive Histogram Equal-

ization, the green dots stands for the detected ORB features.

RGB-D sensor while outdoors.

The images were equalized before the feature ex-

traction. The histogram equalization is a widespread

technique in image processing, used to enhance the

image’s contrast. It often performs poorly when the

image has a bi-modal histogram, images that have both

dark and bright areas. This effect was minimized using

the Contrast Limited Adaptive Histogram Equalization

(CLAHE) algorithm (Pizer et al., 1987).

Enhancing the contrast made it easier to ﬁnd the

visual features, making the system more robust to chal-

lenging light conditions, inherent of the outdoor op-

eration. Figure 5 shows an image after the CLAHE,

with green dots indicating the extracted ORB features.

The features spread over the image, with some key

points close to the camera, enhancing the translation

estimation.

A demo video of the visual odometry running on

the Gator vehicle is available.

3.3 ROS robot localization Package

The Extended Kalman Filter (EKF) (Julier and

Uhlmann, 2004) is probably the most popular algo-

rithm for sensor fusion in robotics. Fusing wheel and

image sensors is a classic combination for odometry

(Bischoff et al., 2012), but there are other options,

such as Inertial Measurement Unit (IMU), LIDAR,

RADAR, and Global Positioning System (GPS).

The ROS package robot localization (Moore and

Stouch, 2014) provides an implementation of a non-

linear pose estimator (EKF) for robots moving in 3D

space. The package can fuse an arbitrary number of

sensors. It gets as parameter a binary vector indicating

which sensor should be fused and which one should be

ignored. This vector can be seen as a particular case

of sensor conﬁguration as deﬁned in Section 2.2.

https://youtu.be/I2bq0zsCuME

4 VISUAL ENVIRONMENT

CLASSIFICATION

The environment classiﬁcation was treated as a classi-

cal supervised image classiﬁcation problem. The oper-

ation area is shown in Figure 1 was divided into three

classes named industrial, parking lot and off-road.

In Section 2.2 we deﬁned the sensor conﬁguration

for informed sensor fusion, the reliability of each sen-

sor is a value between zero and one. Without loss of

generality, in our implementation we used the values

in Equation 2 as either zero or one, in a binary

representation.

In the industrial and the parking lot, the surface is

even, made of asphalt or concrete. In this scenario the

wheel slippage is low, and the wheel odometry presents

a low error. Even if the ground does not present many

visual features, the visual odometry performs properly

relying on the far key points. Hence, both sensors are

fused to estimate the odometry.

On the other hand, in the off-road environment,

the ﬂatness assumption of human-made environments

does not hold, which allied with the increase in the

wheel slippage results in poor performance of the

wheel odometry. In contrast, the visual odometry can

beneﬁt from the feature richness of the uneven terrain.

So, only the VO is used in this scenario.

The industrial and the parking lot classes could be

treated as a single class since both of then produces the

same sensor fusion behavior. We decided to keep both

classes instead of merging them since other parameters

could be changed according to these environments

on future works, like allowing reverse driving in the

parking lot.

We divided campus in two closed loops, the ﬁrst

one used to collect the image to train the CNN and one

to validate the neural network’s classiﬁcation perfor-

mance, respectively the red and white paths on Figure

1. Both paths present segments on the three classes, but

the second path was not exposed to the CNN during

the training phase.

Figure 6 shows samples of images used in the train-

ing and testing set. One can see the challenging light-

ing conditions, inherent of the outdoor operation.

A pre-trained implementation of the VGG16 (Si-

monyan and Zisserman, 2014) was used to classify the

images. The VGG16 is a 16-layer network used by

the VGG team in the ILSVRC-2014 competition. The

neural network was initially trained using

244 × 244

images assigned to one of the thousand labels present

in the Imagenet Dataset (Deng et al., 2009). By the

process of transfer learning, we froze the convolutional

layers to train our classiﬁer using a custom built dataset

of the three classes described above.

ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics

(a) (b) (c)

(d) (e) (f)

Figure 6: First row shows images used to train the CNN and

the second images used to validate the performance.

The transfer learning relies on the assumption that

the features learned to solve a particular problem on

computer vision might be useful to solve similar ones.

The main advantages of transfer learning are the reduc-

tion in training time and data requirements.

We used ten thousand images of each label. Since

the camera generates around thirty images per second,

it is not a demanding task to collect it. The images

were collected on different days and times of the day,

increasing the statistical signiﬁcance of the dataset.

At the beginning of the training, the network strug-

gled in the transition between each scenario and some

segments on the off-road environment. By inspecting

the classiﬁcation errors, looking for hard-negatives,

we detected that those were mostly pictures from the

off-road environment but showing buildings, cases that

the neural network misclassiﬁed as being an industrial

area. After collecting more data in this circumstance,

the neural network was able to yield a good general-

ization.

The classiﬁcation using the neural network was

made at 15Hz on the same computer used to run the

visual odometry. Assuming that the environment does

not change at a high frequency, the real-time execu-

tion is not a requirement of the classiﬁer. Thus the

prediction could be made less often to reduce the com-

putational burden.

5 EXPERIMENTS

We have deﬁned two closed loops in the testing ﬁeld.

Figure 1 shows path one in red and path two in white.

The path one was visited during the data collection

to train the classiﬁer, as described in section 4. We

collected six and four samples from the path one and

two respectively.

The raw measurements of all the sensors were

saved during the data collection. After that, we esti-

Figure 7: Relative error in the second path.

mated ofﬂine the vehicle’s trajectory using each sensor

alone, a rigid fusion of the wheel and visual odome-

try and the environment-aware sensor fusion strategy

described before. These estimated trajectories were

compared with the ground truth poses to get the rela-

tive error as described in section 2.1.

6 RESULTS

6.1 Scene Classiﬁcation Accuracy

After the data collection and the training described in

Section 4, the network achieved 98.7% classiﬁcation

accuracy on the training set (red path) and 97.2% on

the validation set (white path). This high accuracy

might be seen as overﬁtting since both the training

and validation set were collected on the same cam-

pus. The accuracy in an extremely different landscape

would probably be much lower. However, that is also

a limitation on the teach and repeat approach.

By using convolutional neural networks, we pro-

vide the system with the ability to operate in places

never visited before. It should be noticed that the white

path, shown in Figure 1, was never visited during the

training phase.

Considering the high accuracy achieved on the net-

work validation, we expect a near optimum classiﬁ-

cation and, as a consequence, the same sensor fusion

behavior on both paths.

6.2 Odometry Accuracy

Figure 7 shows the relative error for each odometry

method on a particular sample from second path. The

error in the wheel odometer is not on the plot to im-

prove the visualization since it is an order of magnitude

bigger. As expected, the error in the environment-

aware approach follows the trend of the approach with

the smaller error on each time interval.

The Tables 1 and 2 presents the average relative

error in the paths one and two respectively. Using only

the wheel odometry is the worst option on both. In

Environment-aware Sensor Fusion using Deep Learning

Table 1: Mean relative error in the training path.

Sensor Mean Relative Error (m)

Wheel Odometry 3.01(±0.57)

Visual Odometry (ORB SLAM2) 0.54(±0.09)

Wheel Odometry + Visual Odometry (EKF) 0.34(±0.10)

Environment-aware Sensor Fusion (EASF) 0.35(±0.10)

Table 2: Mean relative error in the testing path.

Sensor Mean Relative Error (m)

Wheel Odometry 2.72(±0.37)

Visual Odometry (ORB SLAM2) 0.49(±0.09)

Wheel Odometry + Visual Odometry (EKF) 0.65(±0.25)

Environment-aware Sensor Fusion 0.36(±0.10)

the red path, the EKF (rigid sensor fusion) improved

the odometry estimation in 57.2% when compared

with the visual odometry, while the EASF approach

improved only 55.4%. So the EASF was 1.8% less

accurate than the rigid sensor fusion scheme.

However, on the white path, the rigid fusion re-

sulted in a bigger error than using only the VO, proba-

bly due to the bad performance of the wheel odometry

on this scenario. This noise does not affect the EASF

scheme, that reduced the error by 31.1% regarding the

VO. So, the error in the EASF is more the 50% smaller

than the error in the EKF.

This difference in the average performance might

be caused by the low presence of the off-road scenario

in the ﬁrst path. It is just a small section in a big

loop. On the other hand, the second path has near

equally distributed sections of both off-road and on-

road domains.

Since the covariance on the wheel odometry was

measured on the asphalt and concrete, it is not a good

representation of the error while driving off-road. This

overconﬁdence leads the rigid fusion to bad estima-

tions.

The results in the second path proved that using

the visual information to switch between odometry

sources according to the environment might lead to

a better performance than always fusing all available

sensors. In more challenging operational spaces, for

instance, paths including mud and gravel, our approach

might perform even better.

7 CONCLUSIONS

We have presented a new approach to dynamically

adapt a sensor fusion strategy for autonomous robot

navigation based on the surrounding environment.

Convolutional neural networks were trained to rec-

ognize images of the environment on which the robot

navigates and based on this information the system

adapts its sensor fusion strategy.

The proposed method can be seen as an extension

to the teach and repeat paradigm using convolutional

neural networks.

To validate the concepts, we also presented a prac-

tical implementation of the system on a full-size au-

tonomous vehicle. It is shown that, in environments

where the sensor behavior changes, it is possible to se-

lect a more suitable sensor conﬁguration using visual

information to improve the odometry capabilities of

the system. Experimental results have shown an im-

provement in performance when compared to a rigid

sensor fusion approach.

On this project, the sensor conﬁguration only in-

cludes binary values for the sensors reliability. We

expect that linearly changing the conﬁdence level of

each sensor according to the environment would lead

to even better performance than our approach. One

way to vary the conﬁdence level is by changing the

measurements covariances accordingly.

The results presented here only consider the use

of two sensors so future work could add more sen-

sors to the current framework. Another improvement

would be the implementation of an automatic label

scheme. It could be built using the performance met-

ric proposed in Section 2.1 to label each image with

the top performance sensor conﬁguration. Further,

the methodology can also be extended to localization

and mapping, creating environment-aware mapping

strategies for long-term localization.

ACKNOWLEDGEMENTS

The authors would like to thank Russell Buchanan,

Jiadong Guo and the rest of the CSIRO team for their

assistance during this work. This work was partially

funded by the grant #2018/02122-0 Sao Paulo Reseach

ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics

Foundation (FAPESP).

REFERENCES

Bischoff, B., Nguyen-Tuong, D., Streichert, F., Ewert, M.,

and Knoll, A. (2012). Fusing vision and odometry for

accurate indoor robot localization. In 2012 12th Inter-

national Conference on Control Automation Robotics

Vision (ICARCV), pages 347–352.

Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B.,

Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Muller,

U., Zhang, J., Zhang, X., Zhao, J., and Zieba, K.

(2016). End to end learning for self-driving cars. arXiv

preprint arXiv:1604.07316.

Borges, P. V. K., Zlot, R., and Tews, A. (2013). Integrating

off-board cameras and vehicle on-board localization

for pedestrian safety. IEEE Transactions on Intelligent

Transportation Systems, 14(2):720–730.

Bosse, M. and Zlot, R. (2009). Continuous 3d scan-matching

with a spinning 2d laser. In 2009 IEEE International

Conference on Robotics and Automation, pages 4312–

4319.

Burgard, W., Stachniss, C., Grisetti, G., Steder, B.,

ummerle, R., Dornhege, C., Ruhnke, M., Kleiner,

A., and Tard

os, J. D. (2009). Trajectory-based compar-

ison of slam algorithms. In In Proc. of the IEEE/RSJ

Int. Conf. on Intelligent Robots & Systems (IROS.

Churchill, W. and Newman, P. (2012). Practice makes per-

fect? managing and leveraging visual experiences for

lifelong navigation. In Robotics and Automation

(ICRA), 2012 IEEE International Conference on, pages

4525–4532. IEEE.

Corke, P., Lobo, J., and Dias, J. (2007). An introduction to

inertial and visual sensing. The International Journal

of Robotics Research, 26(6):519–535.

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei,

L. (2009). Imagenet: A large-scale hierarchical image

database. In CVPR09.

Egger, P., Borges, P. V., Catt, G., Pfrunder, A., Siegwart,

R., and Dub

e, R. (2018). Posemap: Lifelong, multi-

environment 3d lidar localization. In 2018 IEEE/RSJ

International Conference on Intelligent Robots and

Systems (IROS), pages 3430–3437. IEEE.

Forster, C., Carlone, L., Dellaert, F., and Scaramuzza,

D. (2017). On-manifold preintegration for real-time

visual–inertial odometry. IEEE Transactions on

Robotics, 33(1):1–21.

Furgale, P. and Barfoot, T. D. (2010). Visual teach and

repeat for long-range rover autonomy. Journal of Field

Robotics, 27(5):534–560.

Guilherme, R., Marques, F., Lourenco, A., Mendonca, R.,

Santana, P., and Barata, J. (2016). Context-aware

switching between localisation methods for robust

robot navigation: A self-supervised learning approach.

In 2016 IEEE International Conference on Systems,

Man, and Cybernetics (SMC), pages 004356–004361.

IEEE.

Julier, S. J. and Uhlmann, J. K. (2004). Unscented ﬁltering

and nonlinear estimation. Proceedings of the IEEE,

92(3):401–422.

Keselman, L., Iselin Woodﬁll, J., Grunnet-Jepsen, A., and

Bhowmik, A. (2017). Intel RealSense Stereoscopic

Depth Cameras. ArXiv e-prints.

Lottes, P., Behley, J., Milioto, A., and Stachniss, C. (2018).

Fully convolutional networks with sequential informa-

tion for robust crop and weed detection in precision

farming. IEEE Robotics and Automation Letters (RA-

L), 3:3097–3104.

Lowry, S., S

underhauf, N., Newman, P., Leonard, J. J., Cox,

D., Corke, P., and Milford, M. J. (2016). Visual place

recognition: A survey. IEEE Transactions on Robotics,

32(1):1–19.

Maybeck, P. S. (1990a). The Kalman Filter: An Introduction

to Concepts. In Autonomous Robot Vehicles, pages

194–204. Springer New York, New York, NY.

Maybeck, P. S. (1990b). The Kalman Filter: An Introduction

to Concepts. In Autonomous Robot Vehicles, pages

194–204. Springer New York, New York, NY.

McManus, C., Furgale, P., Stenning, B., and Barfoot, T. D.

(2012). Visual teach and repeat using appearance-

based lidar. In 2012 IEEE International Conference

on Robotics and Automation, pages 389–396.

Moore, T. and Stouch, D. (2014). A generalized extended

kalman ﬁlter implementation for the robot operating

system. In Proceedings of the 13th International Con-

ference on Intelligent Autonomous Systems (IAS-13).

Springer.

Moravec, H. P. (1980). Obstacle Avoidance and Navi-

gation in the Real World by a Seeing Robot Rover.

PhD thesis, Stanford University, Stanford, CA, USA.

AAI8024717.

Mur-Artal, R. and Tard

os, J. D. (2017). Orb-slam2: An open-

source slam system for monocular, stereo, and rgb-d

cameras. IEEE Transactions on Robotics, 33(5):1255–

1262.

Pfrunder, A., Borges, P. V., Romero, A. R., Catt, G., and

Elfes, A. (2017). Real-time autonomous ground ve-

hicle navigation in heterogeneous environments using

a 3d lidar. In 2017 IEEE/RSJ International Confer-

ence on Intelligent Robots and Systems (IROS), pages

2601–2608. IEEE.

Pizer, S. M., Amburn, E. P., Austin, J. D., Cromartie, R.,

Geselowitz, A., Greer, T., ter Haar Romeny, B., Zim-

merman, J. B., and Zuiderveld, K. (1987). Adaptive

histogram equalization and its variations. Computer vi-

sion, graphics, and image processing, 39(3):355–368.

Quigley, M., Conley, K., Gerkey, B. P., Faust, J., Foote, T.,

Leibs, J., Wheeler, R., and Ng, A. Y. (2009). Ros:

an open-source robot operating system. In ICRA

Workshop on Open Source Software.

Rechy Romero, A., Koerich Borges, P. V., Elfes, A., and

Pfrunder, A. (2016). Environment-aware sensor fusion

for obstacle detection. In 2016 IEEE International

Conference on Multisensor Fusion and Integration for

Intelligent Systems (MFI), pages 114–121. IEEE.

Simonyan, K. and Zisserman, A. (2014). Very Deep Convo-

lutional Networks for Large-Scale Image Recognition.

arXiv e-prints, page arXiv:1409.1556.

Suger, B., Steder, B., and Burgard, W. (2016). Terrain-

adaptive obstacle detection. In 2016 IEEE/RSJ Inter-

Environment-aware Sensor Fusion using Deep Learning

national Conference on Intelligent Robots and Systems

(IROS), pages 3608–3613.

Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., and Oliva, A.

(2014). Learning Deep Features for Scene Recognition

using Places Database.

ICINCO 2019 - 16th International Conference on Informatics in Control, Automation and Robotics