Indoor Point-to-Point Navigation with Deep Reinforcement Learning and

Ultra-Wideband

Enrico Sutera

1,2 a

, Vittorio Mazzia

1,2,3 b

, Francesco Salvetti

1,2,3 c

, Giovanni Fantin

1,2 d

and Marcello Chiaberge

1,2 e

Department of Electronics and Telecommunications, Politecnico di Torino, 10124 Turin, Italy

PIC4SeR, Politecnico di Torino Interdepartmental Centre for Service Robotics, Turin, Italy

SmartData@PoliTo, Big Data and Data Science Laboratory, Turin, Italy

Keywords:

Indoor Autonomous Navigation, Autonomous Agents, Deep Reinforcement Learning, Ultra-Wideband.

Abstract:

Indoor autonomous navigation requires a precise and accurate localization system able to guide robots through

cluttered, unstructured and dynamic environments. Ultra-wideband (UWB) technology, as an indoor position-

ing system, offers precise localization and tracking, but moving obstacles and non-line-of-sight occurrences

can generate noisy and unreliable signals. That, combined with sensors noise, unmodeled dynamics and

environment changes can result in a failure of the guidance algorithm of the robot. We demonstrate how

a power-efﬁcient and low computational cost point-to-point local planner, learnt with deep reinforcement

learning (RL), combined with UWB localization technology can constitute a robust and resilient to noise

short-range guidance system complete solution. We trained the RL agent on a simulated environment that

encapsulates the robot dynamics and task constraints and then, we tested the learnt point-to-point navigation

policies in a real setting with more than two-hundred experimental evaluations using UWB localization. Our

results show that the computational efﬁcient end-to-end policy learnt in plain simulation, that directly maps

low-range sensors signals to robot controls, deployed in combination with ultra-wideband noisy localization

in a real environment, can provide a robust, scalable and at-the-edge low-cost navigation system solution.

1 INTRODUCTION

The main focus of service robotics is to assist hu-

man beings, generally performing dull, repetitive or

dangerous tasks, as well as household chores. In

most of the applications, the robot has to navigate in

an unstructured and dynamic environment, thus re-

quiring robust and scalable navigation systems. In

practical application, robot motion planning in dy-

namic environments with moving obstacles adopts a

layered navigation architecture where each block at-

tempts to solve a particular task. In a typical stack,

in a GPS-denied scenario, precise indoor localiza-

tion is always a challenging objective with a great

inﬂuence on the overall system and correct naviga-

tion (Rigelsford, 2004). Indeed, algorithms such as

https://orcid.org/0000-0001-5655-3673

https://orcid.org/0000-0002-7624-1850

https://orcid.org/0000-0003-4744-4349

https://orcid.org/0000-0002-9361-3484

https://orcid.org/0000-0002-1921-0126

SLAM (Cadena et al., 2016) or principal indoor lo-

calization techniques based on technologies, such as

WiFi, radio frequency identiﬁcation device (RFID),

ultra-wideband (UWB) and Bluetooth (Zafari et al.,

2019), are greatly affected by multiple factors; among

others, presence of multi-path effects, noise and char-

acteristics of the speciﬁc indoor environment are still

open challenges that can compromise the entire navi-

gation stack.

Robot motion planning in dynamic and un-

structured environments with moving obstacles has

been studied extensively (Mohanan and Salgoankar,

2018), but, being an NP-complete (Barraquand and

Latombe, 1991) problem, classical solutions have sig-

niﬁcant limitations in terms of computational request,

power efﬁciency and robustness at different scenar-

ios. Moreover, currently available local navigation

systems have to be tuned for each new robot and en-

vironment (Chen et al., 2015) constituting a real chal-

lenge in presence of dynamical and unstructured en-

vironments.

Deep learning and in particular Deep reinforce-

Sutera, E., Mazzia, V., Salvetti, F., Fantin, G. and Chiaberge, M.

Indoor Point-to-Point Navigation with Deep Reinforcement Learning and Ultra-Wideband.

DOI: 10.5220/0010202600380047

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 1, pages 38-47

ISBN: 978-989-758-484-8

ment learning (RL) has shown very promising results

in ﬁelds as diverse as video games (Mnih et al., 2015;

Mnih et al., 2013), energy usage optimization (Mo-

canu et al., 2018), remote sensing (Salvetti et al.,

2020; Khaliq et al., 2019; Mazzia et al., 2020) and vi-

sual navigation (Zhu et al., 2017; Tamar et al., 2016;

Aghi et al., 2020), since 2013. Greatly inspired by

the work of Chiang et al. (Chiang et al., 2019), we ex-

ploited deep reinforcement learning to obtain an agent

robust to localization noise and able to map raw noisy

low-level 2-D lidar observations to robot controls lin-

ear and angular velocities. Indeed, the obtained learnt

policy through a plain and fast simulation process is a

light-weight, power-efﬁcient motion planning system

that can be deployed at the edge, on very low-cost

hardware with limited computational capabilities.

In particular, we focused our research on a tight

integration between the point-to-point local motion

planner, learnt in simulation, with UWB localiza-

tion technology, providing experimental proofs of the

feasibility of the system UWB-RL in a real setting.

UWB radios are rapidly growing in popularity, offer-

ing decimeter-level accuracy and increasingly smaller

and cheaper transceivers (Magnago et al., 2019). In

comparison with other techniques, UWB enables both

distance estimation and communication among de-

vices within the same radio chip with relative low-

level consumption. However, the accurate estima-

tion of the position of a robot is critical for its cor-

rect navigation and, as previously mentioned, also

UWB, in a real scenario, is affected by several fac-

tors of disturbance. Our results show that, even in

the presence of very uncertain localization informa-

tion, due to the presence of moving obstacles in the

environment, multi-path effects and other sources of

noise, our proposed solution is robust and has compa-

rable performance with classical approaches. Never-

theless, our solution has a much lower computational

request and power consumption constituting a com-

petitive and end-to-end local motion planner solution

for indoor autonomous navigation in dynamic and un-

known environments.

2 PROPOSED METHOD

2.1 Reinforcement Learning

Deep RL is a machine learning technique that merges

deep learning and reinforcement learning together.

The latter is generally used for tackling problems that

can be modeled as a Markov decision process (MDP).

Hence, the typical learning setup consists of an agent

which interacts with an environment. The agent se-

Algorithm 1: DDPG algorithm.

lects an action a

∈ A and performs it in the environ-

ment, which gives back a new state s

t+1

∈ S and a

reward r

t+1

, sequentially at each time-step t. The en-

vironment may also be stochastic. A and S are the

space of the actions and the space of the states, re-

spectively. The reward r

is the feedback signal at the

basis of the learning process from raw data, hence be-

ing higher for ”better” actions and lower for ”worse”

ones. The agent chooses an action by following a pol-

icy π that maps states to actions. A sequence with

shape s

,...,s

i+1

is then gen-

erated, which can be seen as many transitions one af-

ter another. The training phase aims at making the

agent learn to maximize the return G, which is usu-

ally the discounted sum of future rewards, expressed

G =

∑

k=0

t+k+1

(1)

The term γ is called discount factor and it regulates

the importance of rewards along the episode. It can

assume values between 0 (only the immediate reward

is important) and 1 (all future rewards are equally im-

portant). The agent is typically characterized by a pol-

icy π(a|s) which maps states to actions. A policy can

be stochastic, e.g. can give the probability of an ac-

Indoor Point-to-Point Navigation with Deep Reinforcement Learning and Ultra-Wideband

tion a to be taken when in state s, or deterministic,

hence giving the action directly and in this case is of-

ten denoted by µ.

Since the aim of an agent is to maximise G, it is

useful to deﬁne the expected return when an action a

is taken from a state s

and then a policy π is followed.

This is expressed by the action-value function:

) = E

i≥t

i>t

∼E,a

i>t

∼π

| s

] (2)

In many RL approaches the Bellman equation is

used:

) = E

t+1

∼E

[r(s

)

+ γE

t+1

∼π

t+1

)]]

(3)

which, under target deterministic policy becomes:

) = E

t+1

∼E

[r(s

)

+ γ[Q

t+1

)]]

(4)

This relationship is used to learn Q

off policy,

that means that the exploited transition can also be

generated by using another stochastic behavioural

policy β. This approach can be referred to as Q-

learning. Considering a ﬁnite action space, once the

Q function is known, it is sufﬁcient to choose the ac-

tion that maximizes the expected returns. This is also

called greedy policy:

µ(s) = argmax

Q(s,a) (5)

If we consider to approximate the action-value

function using a function approximator, whose pa-

rameters can be denoted as θ

, the optimization can

be performed by minimizing the loss, which can be

expressed as:

L(θ

) = E

∼ρ

∼β,r

∼E

[(Q(s

| θ

) − y

)

] (6)

where:

= r(s

) + γQ(s

t+1

,µ(s

t+1

| θ

)) (7)

and ρ denotes the discounted state visitation distribu-

tion for a policy β. This procedure was recently used

along with two new feature: a replay buffer and a tar-

get network for obtaining the target y

(Mnih et al.,

2013)(Mnih et al., 2015).

2.2 Deep Deterministic Policy Gradient

The above seen Q-learning-related procedure cannot

be directly applied to a problem with a continuous ac-

tion space. So, we implement a version of the deep

deterministic policy gradient (DDPG) algorithm (Lil-

licrap et al., 2015) that uses an actor-critic approach to

overcome the limitations of discrete actions. Consid-

ering to be using function approximators, actor and

Figure 1: Scheme of used lidar measurements. Lower val-

ues of distance are considered more signiﬁcant for obstacle

detection.

critic can be denoted respectively as µ(s | θ

) and

Q(s,a | θ

). The critic function is learned as done

in Q-learning, hence using the Bellman equation and

exploiting the same loss. The actor function instead

is updated by exploiting the knowledge of the pol-

icy gradient (Silver et al., 2014). Considering a start-

ing distribution J = E

∼E,a

∼π

] and applying the

chain rule to the expected return with respect to the

parameters of the actor, the policy gradient can be ob-

tained:

∆

θµ

J ≈ E

∼ρ

[∆

θµ

Q(s,a | θ

) |

s=s

,a=µ(s

|θ

)

]

= E

∼ρ

[∆

Q(s,a | θ

) |

s=s

,a=µ(s

)

· ∆

µ(s | θ

) |

s=s

]

(8)

The full pseudo-code is shown in Alg. 1. We

hardly update the target networks periodically instead

of performing a continuous soft update. Moreover, we

tackle the exploration-exploitation dilemma by main-

taining an ε probability to perform a random action

rather than following the policy µ. The value of ep-

silon decays during the training as:

ε = max(ε

episode

,ε

min

) (9)

where ε

is the decay parameter.

2.3 Point-to-Point Agent Training

In this work, we deal with continuous domains, hold-

ing A ∈ R

(continuous control) and S ∈ R

(con-

tinuous state space), with N and M dimensions of ac-

tion and observation spaces. Concerning the latter,

the observation is the representation that the agent has

of the current state. In our case, the observation is a

vector with 62 elements. It is made of 60 1-D mea-

surements of the lidar, and distance and angle with

respect to the goal. During the simulation phase, we

use odometry data and magnetometer measurements

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

Figure 2: Graphical representation of the actor/critic architecture. During the training procedure, the actor processes the

observations of the robot s

with a cascade of fully connected layers producing distinct actions a

for the angular and linear

velocity. Subsequently, the critic network takes as input both s

and a

generating the corresponding Q value estimation. After

the training procedure, the policy learnt by the agent in simulation is exploited by the robot to navigate from point-to-point.

to compute and provide the previously mentioned dis-

tance and angle. This clearly demonstrates the robust-

ness of the trained agent, which is able to generalize

to the real scenario even without explicitly modelling

UWB localization signals during the training process.

The 1-D 60 measurements are not equally spaced. In-

stead, the whole 2π circle is split into 60 sectors, and

the minimum non-outliers are taken, to guarantee the

knowledge of nearer obstacles, as shown in Fig. 1.

The action instead is a 2-D vector, containing the

angular and the linear velocity of the robot:

• Linear velocity: the sigmoid activation function

guarantees a value between 0 and 1 since we

want the robot to only have non-negative values

of speed;

• Angular velocity: the hyperbolic tangent activa-

tion function constraints the output between −1

and 1.

According to the target network technique, we use

four networks: actor network, critic network and their

target twins, with the same architectures represented

in Fig. 2. The networks are mainly constituted by

fully connected layers with ReLU activation func-

tions, except for the ﬁnal ones. The ﬁrst three layers

of the actor have respectively 512, 256 and 256 neu-

rons. The critic ﬁrst two hidden layers have respec-

tively 256 (state side) and 64 (action side) units. The

following hidden layers have 256 and 128 neurons se-

quentially. The last layer of the critic is a single out-

put FC with linear activation function to provide the

Q value.

Figure 3: The robotic platform used for the experimen-

tation: a Robotis TurtleBot3 Burger with a Decawave

EVB1000 Ultra-wideband tag.

3 EXPERIMENTAL DISCUSSION

AND RESULTS

In this section, we present the hardware and software

setup used during the experimentation phase. We pro-

vide a full description of the training phase of the RL

agent, with a detailed list of all the selected hyperpa-

rameters. Finally, we describe the different tests per-

formed and we present a quantitative evaluation of the

proposed local planner.

3.1 Hardware and Robotic Platform

The training of the RL agent is performed using a

workstation with an Intel Core i7 9700k CPU, along

with 64 GB of RAM. It takes around 24 hours to com-

plete. Concerning the robotic platform, we select the

Indoor Point-to-Point Navigation with Deep Reinforcement Learning and Ultra-Wideband

Table 1: Adopted hyperparameters in simulation during the

point-to-point agent training.

Hyper-parameters

starting epsilon 1

minimum epsilon 0.05

epsilon decay 0.998

learning rate 0.00025

discount factor 0.99

sample size 64

batch size 64

target network update 2000

deque memory maxlen 1000000

Robotis TurtleBot3 Burger model

, which is a low-

cost, ROS-oriented (Robot Operating System) solu-

tion. An accurate model is also provided for Gazebo

simulations. The Turtelbot3 Burger model we use is

equipped with a Raspberry Pi 3 B+. Concerning the

Ultra-wideband hardware, we use a TREK1000 eval-

uation kit by Decawave to provide the agent with the

localization data that in simulation are obtained via

odometry and magnetometer measurements. Fig. 3

shows the complete robotic platform used during the

experimentation.

3.2 RL Agent Training

The training is performed simulating both agent and

environment on Gazebo. The robot is controlled by

the actor network presented in the methodology. The

training is performed in episodes, that means the robot

is re-spawned in the same starting point. The objec-

tive for each episode is to reach a randomly spawned

goal and the reward that is given to the agent depends

on it. We use the following equation to provide re-

ward values:

R =











+1000, if goal is reached

−200, if collision occurs

3 · h

· 10 ·|∆d|, else,

(10)

where ∆d is the difference between distance at current

and previous instants of time, and:

= −



t−1

1.2 · f

− heading



+ 1 (11)

is the heading reward. ω is the angular speed, while

f is the control frequency. The third part of the equa-

tion gives a positive reward when the robot is getting

closer to the goal (|∆d| contribute). Moreover, this

value is higher if it is directly pointing it (h

con-

tribute). The values of the hyperparameters used in

the training phase are shown in Tab. 1. The target

http://www.robotis.us/turtlebot-3/

Table 2: Selected settings of the robot and of the simulated

environment during the training of the deep reinforcement

learning agent.

Robot settings

lidar points 60

ctrl period 0.33

maximum angular speed 1rad/s

maximum linear speed 0.2m/s

Simulation settings

time step 0.0035s

max update rate 2000s

−

timeout 250s (in sim. time)

Figure 4: Estimation of the position in the x-y plan using

error affected ranging measurements.

network update sets how often the target networks are

hardly updated, in terms of steps. The same value of

learning rate is used for both actor and critic, equal to

0.00025. The discount factor is set to a value of 0.99,

as for the epsilon decay. Tab. 2 presents the environ-

ment and the robot settings used in simulation.

3.3 Ultra-Wideband Settings

In our experimental setup, we use UWB as the only

positioning method, do its robustness against the

noise in the localization measurements. The real-

time locating system (RTLS) is composed of 5 De-

cawave EVB1000 boards: 4 placed in ﬁxed positions

(anchors) at the corners of the experimental area and

one mounted on the robotic platform. The EVB1000

boards are set to communicate using channel 2 (cen-

tral frequency 3.993 GHz) with a data rate of 6.8

Mbps, a preamble length of 128 symbols and a po-

sitioning update rate of 10 Hz. We mount the four

anchors on four tripods at slightly different heights,

with maximum height set at less than 2 meters. The

position of the anchors along the vertical axis strongly

affects the precision of the localization in the hor-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

(a) Path: AB. (b) Path: ABCD. (c) Path: AE.

Figure 5: Scenarios used in the ﬁrst set of tests for the local planner evaluation. Points positions: A(0, 0) m, B(4.35,0.02) m,

C(2.55, 2) m, D(2.25, −1.8) m, E(1, 0) m.

izontal (x-y) plan, increasing the height is possible

to achieve better performances. Moreover, the robot

(target) is able to move outside the area deﬁned by

the ﬁxed devices, and this is a critical situation for

the computation of the position. The raw ranging

data are smoothed using a simple linear Kalman ﬁl-

ter. The measurement noise covariance, computed in

previous experiments, is set to σ

= 6.67 ∗ 10

−4

and

the process noise covariance (σ

= 10

−4

) is chosen

to obtain the desired behavior from the ﬁlter. Finally,

the position of the robot is computed as the intersec-

tion of the four spheres centered in the anchors’ po-

sitions with a radius equal to the corresponding rang-

ing measurements, as schematically shown in Fig. 4.

This is a typical nonlinear estimation problem that we

solve using the Gauss-Newton nonlinear least-squares

method, which is a well-suited algorithm for range-

based position estimation, as discussed in (Yan et al.,

2008). At each sampling step, the new position is esti-

mated starting the iteration from the last known point.

3.4 Experimental Settings

To prove the robustness and the reliability of the pro-

posed system, we perform several experimentations

that can be grouped into two test sets. In the ﬁrst,

we compare our system with a classical one based on

the well-known Dynamic Window Approach (DWA)

(Fox et al., 1997) in different scenarios to prove that

our local planner achieves better performances, with

lower computational effort. In the second set of tests,

we focus on the robustness of the system to UWB

localization noise, and we compare it to the perfor-

mance obtained by humans put in the same testing

conditions of the RL agent. The achieved results

thoroughly show how the proposed system can rep-

resent a reliable and efﬁcient local planner to enable

autonomous navigation in unknown and unstructured

environments.

In all the following tests, we ﬁx the reference

frame on the initial position of the robot and measure

the positions with a Leica AT403 Laser Tracker. The

main metric for navigation performance is the suc-

cess rate. Each experiment is considered successful

if the robot is able to get within 20 cm to the target

position without getting stuck. Since the robot can

theoretically reach the goal also with random wander-

ing, we consider a maximum time t

max

. If the robot

is unable to reach the target position within this time

interval, the test is considered failed. Considering the

maximum linear speed of 0.22 m s

−1

of the Turtlebot3

Burger and an average path length of 5.5 m over all

the experiments, we consider 180 s as a reasonable

value for t

max

. Moreover, we consider the mean total

time t

mean

as a metric to understand how well the local

planner is able to ﬁnd an optimal solution to the nav-

igation problem and RMS accelerations ˙v

RMS

as metrics for navigation smoothness. Finally, colli-

sions with static or moving obstacles are registered

for each test, since the ability to avoid them assumes

a vital relevance in robotic autonomous navigation.

3.5 Local Planner Quantitative

Evaluation

The ﬁrst set of tests is aimed at comparing the pro-

posed local planner with the most used Dynamic Win-

dow Approach (DWA) (Fox et al., 1997). We use

the ROS implementation of this navigation algorithm,

based on the work of Brock et al. (Brock and Khatib,

1999). The two algorithms are compared with re-

peated tests in three different scenarios:

1. the robot has to navigate to the target point au-

tonomously and is suddenly interrupted by a mov-

ing obstacle;

Indoor Point-to-Point Navigation with Deep Reinforcement Learning and Ultra-Wideband

Table 3: Experimental results of the ﬁrst set of tests: comparison with DWA (Fox et al., 1997) local planner.

Scenario Algorithm Success rate t

mean

[s] ˙v

RMS

[ms

−2

]

RMS

[rads

−2

]

DWA 1 37 0.2277 1.1371

RL+UWB 1 33 0.1342 1.6382

DWA 0.80 48 0.3016 1.0022

RL+UWB 1 45 0.1535 2.6922

DWA 0.70 97 0.2866 1.6434

RL+UWB 0.91 65 0.1149 1.4719

DWA 0.50 129 0.2757 0.9180

RL+UWB 0.91 94 0.1050 1.5483

ABCD

DWA 0.50 261 0.2880 1.1879

RL+UWB 0.91 223 0.1225 1.7528

DWA 1 48 0.1920 1.1390

RL+UWB 1 31 0.1047 1.4132

2. the robot has to navigate to three waypoints in a

certain order inside a fairly complex environment;

3. the robot has to reach a goal located behind a wall,

with single opening quite far from the goal.

Fig. 5 shows a visual presentation of the ﬁrst tests set

scenarios. The ﬁrst one is particularly useful to eval-

uate the obstacle avoidance performance of the algo-

rithm and its ability to react to a sudden change in

the navigation environment, by following a new safer

path to reach the target. In this case, the moving ob-

stacle consists in a panel put in front of the robot while

it is navigating towards the target. The second sce-

nario shows the ability to solve subsequent point-to-

point tasks in a quite complex and unstructured envi-

ronment. The robot starts in point A and has to nav-

igate segments AB, BC, CD, subsequently. We eval-

uate performances both on the single point-to-point

tasks and on the whole path ABCD. Finally, the last

scenario is relevant to judge the ability of the algo-

rithm to adopt local sub-optimal actions that make the

robot actually increase the distance from the target, in

order to be subsequently able to reach the ﬁnal goal.

In this sense, this kind of situation is useful to evalu-

ate whether the robot is able to escape local minima.

We perform a total of 30 tests for both the algo-

rithms in the three different scenarios. Tab. 3 summa-

rizes the experimentation. In general, our approach

has a higher success rate and requires, on average, less

time to reach the target. It gets lower linear acceler-

ations, but higher angular ones, resulting in a lower

smoothness on the angular control. The second sce-

nario appears to be the toughest one, in particular in

its third task CD, where the DWA success rate drops

to 0.5. In all these tests, we register no collisions with

both the algorithms. However, the main advantage of

the proposed local planner is its computational effort.

We achieve up to 400 Hz control frequency the pro-

posed RL planner. On the other hand, since the DWA

is an optimization algorithm, on the same machine it

ranges between 0.5 Hz and 5 Hz. This dramatic im-

provement in computational efﬁciency allows for the

proposed local planner to be completely run in an em-

bedded system on the robot itself, without the need of

a powerful machine as classic algorithms as DWA do.

We deploy the RL agent on a Raspberry Pi3 B+ em-

bedded computer, and we are able to achieve a real-

time control at about 30 Hz.

3.6 Noise Robustness and Human

Comparison

The second set of tests is aimed at comparing the

proposed algorithm with human performance, as well

as demonstrate how the RL+UWB system is highly

robust against localization noise. In literature, RL

agents are frequently compared to human agents to

prove their control performance in complex tasks

(Mnih et al., 2015; Silver et al., 2016; Mirowski et al.,

2016; Silver et al., 2018). We perform such compari-

son by putting several people in the same experimen-

tal conditions of the RL agent. Human testers are kept

in a different room with respect to the experimental

environment and can see in real-time the robot posi-

tion, the goal and the 1-D lidar range measurements,

that are exactly the same information available to the

RL planner. Fig. 7 presents the interface shown to hu-

man testers during the experimentation. Both humans

and RL agent are tested in the following scenarios,

shown in Fig. 6:

1. the robot has to navigate to the target point au-

tonomously and is suddenly interrupted by a per-

son;

2. the robot has to pass through a small opening par-

tially occluded by a moving obstacle;

3. the robot has to navigate to three waypoints in a

certain order inside a fairly complex environment;

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

(a) Path: AB. (b) Path: AB. (c) Path: ABCD.

(d) Path: ACBD.

Figure 6: Scenarios used in the second set of tests for the comparison with human control. Points positions: A(0, 0) m,

B(4.45,0.02) m, C(1.86,−0.21) m, D(1.65, 1.65) m.

Table 4: Experimental results of the second set of tests: comparison with human control.

Scenario Agent Success rate Collisions t

mean

[s] ˙v

RMS

[ms

−2

]

RMS

[rads

−2

]

Human 1 0 30 0.3574 2.0783

RL+UWB 1 0 29 0.3557 3.8413

Human 1 0.25 42 0.3382 2.0012

RL+UWB 1 0 50 0.3333 4.5058

Human 1 0.25 38 1.7703 2.1109

RL+UWB 1 0 39 0.3513 4.1098

Human 1 0.50 36 0.3643 2.0638

RL+UWB 1 0 29 0.3495 4.1354

Human 0.75 0.25 97 0.3691 2.1878

RL+UWB 0 0 - - -

ABCD

Human 0.75 1 161 0.3696 2.1490

RL+UWB 0 0 - - -

Human 1 0 49 0.3227 2.1171

RL+UWB 1 0 49 0.3230 4.5523

Human 1 0.25 40 0.3224 2.3882

RL+UWB 1 0 28 0.3393 4.4387

Human 1 0 49 0.3420 2.1848

RL+UWB 1 0 25 0.3280 4.3618

ACBD

Human 1 0.25 137 0.3290 2.2301

RL+UWB 1 0 102 0.3301 4.4509

4. the robot has to navigate to three waypoints in

a certain order inside a fairly complex environ-

ment with both static and moving obstacles (peo-

ple wandering in the scenario).

In all these tests, Gaussian noise is superimposed to

UWB measurements in order to evaluate the robust-

ness of the system to localization errors. Higher un-

certainty in the UWB positioning is also caused by

the presence of people in the environments (scenarios

1, 2, 4) who obstruct the anchors and cause the NLOS

(non-line of sight) condition. Fig. 8 shows an exam-

ple of the trajectory followed by the robot in the ﬁrst

scenario (path AB, interrupted by a sudden moving

person). The noisy signal of the UWB clearly gives

a high uncertainty on the position of the robot. How-

ever, the RL local planner is highly robust against lo-

calization errors and it is able to reach the goal.

Figure 7: Human interface during the second tests set.

Testers are allowed to see lidar measurements, robot pose

and goal distance and direction.

Indoor Point-to-Point Navigation with Deep Reinforcement Learning and Ultra-Wideband

Tab. 4 presents the results of the second set of

tests. In general, the RL+UWB appears to have

similar performances to human control, even in the

presence of localization noise. The proposed algo-

rithm appears to be particularly able in avoiding ob-

stacles, while humans result more subject to collisions

in complex environments. One interesting thing to

notice is that the RL+UWB is completely unable to

solve the CD task of scenario 3, when the robot is sur-

rounded by walls on three edges. It remains stuck,

repeating the same actions over and over. This behav-

ior can be explained by the absence of memory in this

kind of planners, that makes them unable to escape

from too narrow local minima. Humans are able to

analyze subsequent states and can understand how the

environment is actually disposed, while the RL agent

simply reacts to the current state and cannot build an

environment map.

Figure 8: Trajectory followed in a test in the ﬁrst scenario

(path AB, interrupted by a sudden moving person). The

UWB added noise is clearly visible and shows how the pro-

posed local planner is highly robust against positioning er-

rors.

4 CONCLUSION

In this paper, we proposed a novel indoor local mo-

tion planner based on a strict synergy between an

autonomous agent trained with deep reinforcement

learning and ultra-wideband localization technology.

Indoor autonomous navigation is a challenging task,

and localization techniques can generate noisy and

unreliable signals. Moreover, due to the high com-

plexity of typical environments, hand-tuned classical

methodologies are highly prone to failure and require

access to a large number of computational resources.

The extensive experimentation and evaluations of our

research proved that our low-cost and power-efﬁcient

solution has comparable performance with classical

methodologies and is robust to noise and scalable to

dynamic and unstructured environments.

ACKNOWLEDGEMENTS

This work has been developed with the contribution of

the Politecnico di Torino Interdepartmental Centre for

Service Robotics PIC4SeR

and SmartData@Polito

This work is partially supported by the Italian govern-

ment via the NG-UWB project (MIUR PRIN 2017).

REFERENCES

Aghi, D., Mazzia, V., and Chiaberge, M. (2020). Local mo-

tion planner for autonomous navigation in vineyards

with a rgb-d camera-based algorithm and deep learn-

ing synergy. Machines, 8(2):27.

Barraquand, J. and Latombe, J.-C. (1991). Robot mo-

tion planning: A distributed representation approach.

The International Journal of Robotics Research,

10(6):628–649.

Brock, O. and Khatib, O. (1999). High-speed navigation us-

ing the global dynamic window approach. In Proceed-

ings 1999 IEEE International Conference on Robotics

and Automation (Cat. No. 99CH36288C), volume 1,

pages 341–346. IEEE.

Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza,

D., Neira, J., Reid, I., and Leonard, J. J. (2016). Past,

present, and future of simultaneous localization and

mapping: Toward the robust-perception age. IEEE

Transactions on robotics, 32(6):1309–1332.

Chen, C., Seff, A., Kornhauser, A., and Xiao, J. (2015).

Deepdriving: Learning affordance for direct percep-

tion in autonomous driving. In Proceedings of the

IEEE International Conference on Computer Vision,

pages 2722–2730.

Chiang, H.-T. L., Faust, A., Fiser, M., and Francis, A.

(2019). Learning navigation behaviors end-to-end

with autorl. IEEE Robotics and Automation Letters,

4(2):2007–2014.

Fox, D., Burgard, W., and Thrun, S. (1997). The dy-

namic window approach to collision avoidance. IEEE

Robotics & Automation Magazine, 4(1):23–33.

Khaliq, A., Mazzia, V., and Chiaberge, M. (2019). Reﬁning

satellite imagery by using uav imagery for vineyard

environment: A cnn based approach. In 2019 IEEE

International Workshop on Metrology for Agriculture

and Forestry (MetroAgriFor), pages 25–29. IEEE.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T.,

Tassa, Y., Silver, D., and Wierstra, D. (2015). Contin-

uous control with deep reinforcement learning.

Magnago, V., Corbal

an, P., Picco, G., Palopoli, L.,

and Fontanelli, D. (2019). Robot localization

via odometry-assisted ultra-wideband ranging with

stochastic guarantees. In Proc. IEEE/RSJ Int. Conf.

Intell. Robots Syst.(IROS), pages 1–7.

Mazzia, V., Khaliq, A., and Chiaberge, M. (2020). Improve-

ment in land cover and crop classiﬁcation based on

https://pic4ser.polito.it

https://smartdata.polito.it

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

temporal features learning from sentinel-2 data using

recurrent-convolutional neural network (r-cnn). Ap-

plied Sciences, 10(1):238.

Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard,

A. J., Banino, A., Denil, M., Goroshin, R., Sifre,

L., Kavukcuoglu, K., et al. (2016). Learning to

navigate in complex environments. arXiv preprint

arXiv:1611.03673.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2013). Playing atari with deep reinforcement learn-

ing. arXiv preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,

J., Bellemare, M. G., Graves, A., Riedmiller, M., Fid-

jeland, A. K., Ostrovski, G., et al. (2015). Human-

level control through deep reinforcement learning.

Nature, 518(7540):529.

Mocanu, E., Mocanu, D. C., Nguyen, P. H., Liotta, A., Web-

ber, M. E., Gibescu, M., and Slootweg, J. G. (2018).

On-line building energy optimization using deep re-

inforcement learning. IEEE Transactions on Smart

Grid.

Mohanan, M. and Salgoankar, A. (2018). A survey of

robotic motion planning in dynamic environments.

Robotics and Autonomous Systems, 100:171–185.

Rigelsford, J. (2004). Introduction to autonomous mobile

robots. Industrial Robot: An International Journal.

Salvetti, F., Mazzia, V., Khaliq, A., and Chiaberge, M.

(2020). Multi-image super resolution of remotely

sensed images using residual attention deep neural

networks. Remote Sensing, 12(14):2207.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,

Van Den Driessche, G., Schrittwieser, J., Antonoglou,

I., Panneershelvam, V., Lanctot, M., et al. (2016).

Mastering the game of go with deep neural networks

and tree search. nature, 529(7587):484.

Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai,

M., Guez, A., Lanctot, M., Sifre, L., Kumaran, D.,

Graepel, T., et al. (2018). A general reinforcement

learning algorithm that masters chess, shogi, and go

through self-play. Science, 362(6419):1140–1144.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and

Riedmiller, M. (2014). Deterministic policy gradient

algorithms. In Proceedings of the 31st International

Conference on International Conference on Machine

Learning - Volume 32, ICML’14, page I–387–I–395.

JMLR.org.

Tamar, A., Wu, Y., Thomas, G., Levine, S., and Abbeel,

P. (2016). Value iteration networks. In Advances in

Neural Information Processing Systems, pages 2154–

2162.

Yan, J., Tiberius, C., Bellusci, G., and Janssen, G. (2008).

Feasibility of gauss-newton method for indoor posi-

tioning. In 2008 IEEE/ION Position, Location and

Navigation Symposium, pages 660–670.

Zafari, F., Gkelias, A., and Leung, K. K. (2019). A survey

of indoor localization systems and technologies. IEEE

Communications Surveys & Tutorials.

Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-

Fei, L., and Farhadi, A. (2017). Target-driven visual

navigation in indoor scenes using deep reinforcement

learning. In 2017 IEEE international conference on

robotics and automation (ICRA), pages 3357–3364.

IEEE.

Indoor Point-to-Point Navigation with Deep Reinforcement Learning and Ultra-Wideband