Exploring Applications of Deep Reinforcement Learning

for Real-world Autonomous Driving Systems

Victor Talpaert

1,2

, Ibrahim Sobh

, B. Ravi Kiran

, Patrick Mannion

Senthil Yogamani

, Ahmad El-Sallab

and Patrick Perez

U2IS, ENSTA ParisTech, Palaiseau, France

AKKA Technologies, Guyancourt, France

Valeo Egypt, Cairo

Galway-Mayo Institute of Technology, Ireland

Valeo Vision Systems, Ireland

Valeo.ai, France

Keywords:

Autonomous Driving, Deep Reinforcement Learning, Visual Perception.

Abstract:

Deep Reinforcement Learning (DRL) has become increasingly powerful in recent years, with notable achie-

vements such as Deepmind’s AlphaGo. It has been successfully deployed in commercial vehicles like Mo-

bileye’s path planning system. However, a vast majority of work on DRL is focused on toy examples in

controlled synthetic car simulator environments such as TORCS and CARLA. In general, DRL is still at its

infancy in terms of usability in real-world applications. Our goal in this paper is to encourage real-world

deployment of DRL in various autonomous driving (AD) applications. We ﬁrst provide an overview of the

tasks in autonomous driving systems, reinforcement learning algorithms and applications of DRL to AD sys-

tems. We then discuss the challenges which must be addressed to enable further progress towards real-world

deployment.

1 INTRODUCTION

Autonomous driving (AD) is a challenging applica-

tion domain for machine learning (ML). Since the

task of driving “well” is already open to subjective

deﬁnitions, it is not easy to specify the correct be-

havioural outputs for an autonomous driving system,

nor is it simple to choose the right input features to

learn with. Correct driving behaviour is only loo-

sely deﬁned, as different responses to similar situa-

tions may be equally acceptable. ML-based control

policies have to overcome the lack of a dense me-

tric evaluating the driving quality over time, and the

lack of a strong signal for expert imitation. Supervi-

sed learning methods do not learn the dynamics of the

environment nor that of the agent (Sutton and Barto,

2018), while reinforcement learning (RL) is formula-

ted to handle sequential decision processes, making it

a natural approach for learning AD control policies.

In this article, we aim to outline the underlying

principles of DRL for applications in AD. Passive

perception which feeds into a control system does

not scale to handle complex situations. DRL setting

would enable active perception optimized for the spe-

ciﬁc control task. The rest of the paper is structured as

follows. Section 2 provides background on the vari-

ous modules of AD, an overview of reinforcement le-

arning and a summary of applications of DRL to AD.

Section 3 discusses challenges and open problems in

applying DRL to AD. Finally, Section 4 summarizes

the paper and provides key future directions.

2 BACKGROUND

The software architecture for Autonomous Driving

systems comprises of the following high level tasks:

Sensing, Perception, Planning and Control. While

some (Bojarski et al., 2016) achieve all tasks in one

unique module, others do follow the logic of one task

- one module (Paden et al., 2016). Behind this sepa-

ration rests the idea of information transmission from

the sensors to the ﬁnal stage of actuators and motors

as illustrated in Figure 1. While an End-to-End ap-

proach would enable one generic encapsulated solu-

tion, the modular approach provides granularity for

this multi-discipline problem, as separating the tasks

in modules is the divide and conquer engineering ap-

564

Talpaert, V., Sobh, I., Kiran, B., Mannion, P., Yogamani, S., El-Sallab, A. and Perez, P.

Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems.

DOI: 10.5220/0007520305640572

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 564-572

ISBN: 978-989-758-354-4

Figure 1: Fixed modules in a modern autonomous driving systems : Sensor infrastructure notably include Cameras, Radars,

LiDARs (the Laser equivalent of Radars) or GPS-IMUs (GPS and Inertial Measurement Units provide an instantaneous

position); their raw data is considered low level. This dynamic data is often processed into higher level descriptions as part of

the Perception module. Perception estimates positions of: location in lane, cars, pedestrians, trafﬁc lights and other semantic

objects among other descriptors. It also provides road occupation over a larger spatial & temporal scale by combining maps.

Mapping Localised high deﬁnition maps (HD maps) provide centrimetric reconstruction of buildings, static obstacles and

dynamic objects, which frequently are crowdsourced. Scene understanding provide the high-level scene comprehension that

includes detection, classiﬁcation & localisation tasks, feeding the driving policy/planning module. Path Planning predicts

the future actor trajectories and manoeuvres. A static shortest path from point A to point B with dynamic trafﬁc information

constraints are employed to calculate the path. Vehicle control orchestrates the high level orders for motion planning using

simple closed loop systems based on sensors from the Perception task.

proach.

2.1 Review of Reinforcement learning

Reinforcement Learning (RL) (Sutton and Barto,

2018) is a family of algorithms which allow agents

to learn how to act in different situations. In other

words, how to establish a map, or a policy, from situ-

ations (states) to actions which maximize a numerical

reward signal. RL has been successfully applied to

many different ﬁelds such as helicopter control (Naik

and Mammone, 1992), trafﬁc signal control (Mannion

et al., 2016a), electricity generator scheduling (Man-

nion et al., 2016b), water resource management (Ma-

son et al., 2016), playing relatively simple Atari ga-

mes (Mnih et al., 2015) and mastering a much more

complex game of Go (Silver et al., 2016), simulated

continuous control problems (Lillicrap et al., 2015),

(Schulman et al., 2015), and controlling robots in real

environments (Levine et al., 2016).

RL algorithms may learn estimates of state values,

environment models or policies. In real-world appli-

cations, simple tabular representations of estimates

are not scalable. Each additional feature tracked in

the state leads to an exponential growth in the number

of estimates that must be stored (Sutton and Barto,

2018).

Deep Neural Networks (DNN) have recently been

applied as function approximators for RL agents, al-

lowing agents to generalise knowledge to new unseen

situations, along with new algorithms for problems

with continuous state and action spaces. Deep RL

agents can be value-based: DQN (Mnih et al., 2013),

Double DQN (Van Hasselt et al., 2016), policy-based:

TRPO (Schulman et al., 2015), PPO (Schulman et al.,

2017); or actor-critic: DPG (Silver et al., 2014),

DDPG (Lillicrap et al., 2015), A3C (Mnih et al.,

2016). Model-based RL agents attempt to build an en-

vironment model, e.g. Dyna-Q (Sutton, 1990), Dyna-

2 (Silver et al., 2008). Inverse reinforcement learning

(IRL) aims to estimate the reward function given ex-

amples of agents actions, sensory input to the agent

(state), and the model of the environment (Abbeel and

Ng, 2004a), (Ziebart et al., 2008), (Finn et al., 2016),

(Ho and Ermon, 2016).

2.2 DRL for Autonomous Driving

In this section, we visit the different modules of the

autonomous driving system as shown in Figure 1 and

describe how they are achieved using classical RL and

Deep RL methods. A list of datasets and simulators

for AD tasks is presented in Table 1.

Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

565

Table 1: A collection of datasets and simulators to evaluate AD algorithms.

Dataset Description

Berkeley Driving Dataset (Xu et al., 2017) Learn driving policy from demonstrations

Baidu’s ApolloScape Multiple sensors & Driver Behaviour Proﬁles

Honda Driving Dataset (Ramanishka et al., 2018) 4 level annotation (stimulus, action, cause, and at-

tention objects) for driver behavior proﬁle.

Simulator Description

CARLA (Dosovitskiy et al., 2017) Urban Driving Simulator with Camera, LiDAR,

Depth & Semantic segmentation

Racing Simulator TORCS (Wymann et al., 2000) Testing control policies for vehicles

AIRSIM (Shah et al., 2018) Resembling CARLA with support for Drones

GAZEBO (Koenig and Howard, 2004) Multi-robo simulator for planning & control

Vehicle Control: Vehicle control classically has been

achieved with predictive control approaches such as

Model Predictive Control (MPC) (Paden et al., 2016).

A recent review on motion planning and control task

can be found by authors (Schwarting et al., 2018).

Classical RL methods are used to perform optimal

control in stochastic settings the Linear Quadratic Re-

gulator (LQR) in linear regimes and iterative LQR

(iLQR) for non-linear regimes are utilized. More re-

cently, random search over the parameters for a policy

network can perform as well as LQR (Mania et al.,

2018).

One can note recent work on DQN which is

used in (Yu et al., 2016) for simulated autonomous

vehicle control where different reward functions are

examined to produce speciﬁc driving behavior. The

agent successfully learned the turning actions and

navigation without crashing. In (Sallab et al., 2016)

DRL system for lane keeping assist is introduced

for discrete actions (DQN) and continuous actions

(DDAC), where the TORCS car simulator is used and

concluded that, as expected, the continuous actions

provide smoother trajectories, and the more restricted

termination conditions, the slower convergence

time to learn. Wayve, a recent startup, has recently

demonstrated an application of DRL (DDPG) for AD

using a full-sized autonomous vehicle (Kendall et al.,

2018). The system was ﬁrst trained in simulation,

before being trained in real time using onboard

computers, and was able to learn to follow a lane,

successfully completing a real-world trial on a 250

metre section of road.

DQN for Ramp Merging: The AD problem of

ramp merging is tackled in (Wang and Chan, 2017),

where DRL is applied to ﬁnd an optimal driving

policy using LSTM for producing an internal state

containing historical driving information and DQN

for Q-function approximation.

Q-function for Lane Change: A Reinforcement

Learning approach is proposed in (Wang et al., 2018)

to train the vehicle agent to learn an automated lane

change in a smooth and efﬁcient behavior, where the

coefﬁcients of the Q-function are learned from neural

networks.

IRL for Driving Styles: Individual perception of

comfort from demonstration is proposed in (Kuderer

et al., 2015), where individual driving styles are

modeled in terms of a cost function and use feature

based inverse reinforcement learning to compute

trajectories in vehicle autonomous mode. Using Deep

Q-Networks as the reﬁnement step in IRL is proposed

in (Sharifzadeh et al., 2016) to extract the rewards.

While evaluated in a simulated autonomous driving

environment, it is shown that the agent performs a

human-like lane change behavior.

Multiple-goal RL for Overtaking: In (Ngai and

Yung, 2011) a multiple-goal reinforcement learning

(MGRL) framework is used to solve the vehicle

overtaking problem. This work is found to be able

to take correct actions for overtaking while avoiding

collisions and keeping almost steady speed.

Hierarchical Reinforcement Learning (HRL):

Contrary conventional or ﬂat RL, HRL refers to

the decomposition of complex agent behavior using

temporal abstraction, such as the options framework

(Barto and Mahadevan, 2003). The problem of

sequential decision making for autonomous driving

with distinct behaviors is tackled in (Chen et al.,

2018). A hierarchical neural network policy is

proposed where the network is trained with the

Semi-Markov Decision Process (SMDP) though the

proposed hierarchical policy gradient method. The

method is applied to a trafﬁc light passing scenario,

and it is shown that the method is able to select correct

decisions providing better performance compared to

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

566

a non-hierarchical reinforcement learning approach.

An RL-based hierarchical framework for autonomous

multi-lane cruising is proposed in (Nosrati et al.,

2018) and it is shown that the hierarchical design

enables signiﬁcantly better learning performance

than a ﬂat design for both DQN and PPO methods.

Frameworks: A framework for an end-end Deep

Reinforcement Learning pipeline for autonomous dri-

ving is proposed in (Sallab et al., 2017), where the in-

puts are the states of the environment and their aggre-

gations over time, and the output is the driving acti-

ons. The framework integrates RNNs and attention

glimpse network, and tested for lane keep assist algo-

rithm. In this section we reviewed in brief the appli-

cations of DRL and classical RL methods to different

AD tasks.

2.3 Predictive Perception

In this section, we review some examples of ap-

plications of IOC and IRL to predictive perception

tasks. Given a certain instance where the autonomous

agent is driving in the scene, the goal of predictive

perception algorithms are to predict the trajectories

or intention of movement of other actors in the envi-

ronment. The authors (Djuric et al., 2018), trained a

deep convolutional neural network (CNN) to predict

short-term vehicle trajectories, while accounting for

inherent uncertainty of vehicle motion in road trafﬁc.

Deep Stochastic IOC (inverse optimal control) RNN

Encoder-Decoder (DESIRE) (Lee et al., 2017) is a

framework used to estimate a distribution and not

just a simple prediction of an agent’s future positions.

This is based on the context (intersection, relative

position of other agents).

Pedestrian Intention: Pedestrian intents to cross

the road, board another vehicle, or is the driver in

the parked car going to open the door (Erran and

Scheider, 2017) . Authors in (Ziebart et al., 2009)

perform maximum entropy inverse optimal control

to learn a generic cost function for a robot to avoid

pedestrians. Authors (Kitani et al., 2012) used

inverse optimal control to predict pedestrian paths by

considering scene semantics.

Trafﬁc Negotiation: When in trafﬁc scenarios invol-

ving multiple agents, policies learned require agents

to negotiate movement in densely populated areas and

with continuous movement. MobilEye demonstra-

ted the use of the options framework (Shalev-Shwartz

et al., 2016).

3 PRACTICAL CHALLENGES

Deep reinforcement learning is a rapidly growing

ﬁeld. We summarize the frequent challenges such

as sample complexity and reward formulation, that

could be encountered in the design of such methods.

3.1 Bootstrapping RL with Imitation

Ability learning by imitation is used by humans to te-

ach other humans new skills. Demonstrations usually

focus on state space essential areas from the expert’s

point of view. Learning from Demonstrations (LfD)

is signiﬁcant especially in domains where rewards are

sparse. In imitation learning, an agent learns to per-

form a task from demonstrations without any feed-

back rewards. The agent learns a policy as a super-

vised learning process over state-action pairs. Howe-

ver, high quality demonstrations are hard to collect,

leading to sub-optimal policies (Atkeson and Schaal,

1997). Accordingly, LfD can be used to initialize

the learning agent with a policy inspired by perfor-

mance of an expert. Then, RL can be conducted to

discover a better policy by interacting with the envi-

ronment. Learning a model of the environment con-

densing LfD and RL is presented in (Abbeel and Ng,

2005). Measuring the divergence between the current

policy and the expert for policy optimization is pro-

posed in (Kang et al., 2018). DQfD (Hester et al.,

2017) pre-trains the agent and uses demonstrations by

adding them into the replay buffer of the DQN and

giving them additional priority. More recently, a trai-

ning framework that combines LfD and RL for fast le-

arning asynchronous agents is proposed in (Sobh and

Darwish, 2018).

3.2 Exploration Issues with Imitation

In some cases, demonstrations from experts are not

available or even not covering the state space lea-

ding to learning a poor policy. One solution con-

sists in using the Data Aggregation (DAgger) (Ross

and Bagnell, 2010) methods where the end-to-end le-

arned policy is run and extracted observation-action

pair is again labelled by the expert, and aggregated to

the original expert observation-action dataset. Thus

iteratively collecting training examples from both re-

ference and trained policies explores more states and

solves this lack of exploration. Following work on

Search-based Structured Prediction (SEARN) (Ross

and Bagnell, 2010), Stochastic Mixing Iterative Le-

arning (SMILE) trains a stochastic stationary policy

over several iterations and then makes use of a geo-

metric stochastic mixing of the policies trained. In a

Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

567

standard imitation learning scenario, the demonstra-

tor is required to cover sufﬁcient states so as to avoid

unseen states during test. This constraint is costly and

requires frequent human intervention. Hierarchical

imitation learning methods reduce the sample com-

plexity of standard imitation learning by performing

data aggregation by organizing the action spaces in a

hierarchy (Le et al., 2018).

3.3 Intrinsic Reward Functions

In controlled simulated environments such as games,

an explicit reward signal is given to the agent along

with its sensor stream. In real-world robotics and au-

tonomous driving deriving, designing a good reward

functions is essential so that the desired behaviour

may be learned. The most common solution has been

reward shaping (Ng et al., 1999) and consists in sup-

plying additional rewards to the agent along with that

provided by the underlying MDP. Rewards as already

pointed earlier in the paper, could be estimated by in-

verse RL (IRL) (Abbeel and Ng, 2004b), which de-

pends on expert demonstrations.

In the absence of an explicit reward shaping and

expert demonstrations, agents can use intrinsic re-

wards or intrinsic motivation (Chentanez et al., 2005)

to evaluate if their actions were good. Authors (Pat-

hak et al., 2017) deﬁne curiosity as the error in an

agents ability to predict the consequence of its own

actions in a visual feature space learned by a self-

supervised inverse dynamics model. In (Burda et al.,

2018) the agent learns a next state predictor model

from its experience, and uses the error of the pre-

diction as an intrinsic reward. This enables that agent

to determine what could be a useful behavior even

without extrinsic rewards.

3.4 Bridging the Simulator-reality Gap

Training deep networks requires collecting and anno-

tating a lot of data which is usually costly in terms of

time and effort. Using simulation environments ena-

bles the collection of large training datasets. Howe-

ver, the simulated data do not have the same data dis-

tribution compared to the real data. Accordingly, mo-

dels trained on simulated environments often fail to

generalise on real environments. Domain adaptation

allows a machine learning model trained on samples

from a source domain to generalise to a target dom-

ain. Feature-level domain adaptation focuses on le-

arning domain-invariant features. In work of (Ganin

et al., 2016), the decisions made by deep neural net-

works are based on features that are both discrimi-

native and invariant to the change of domains. Pixel

level domain adaptation focuses on stylizing ima-

ges from the source domain to make them similar to

images of the target domain, based on image condi-

tioned generative adversarial networks (GANs). In

(Bousmalis et al., 2017b), the model learns a trans-

formation in the pixel space from one domain to the

other, in an unsupervised way. GAN is used to adapt

simulated images to look like as if drawn from the

real domain. Both feature-level and pixel-level dom-

ain adaptation combined in (Bousmalis et al., 2017a),

where the results indicate that including simulated

data can improve the vision-based grasping system,

achieving comparable performance with 50 times fe-

wer real-world samples. Another relatively simpler

method is introduced in (Peng et al., 2017), by dyn-

amics randomizing of the simulator during training,

policies are capable of generalising to different dy-

namics without any training on the real system. RL

with Sim2Real: A model trained in virtual environ-

ment is shown to be workable in real environment

(Pan et al., 2017). Virtual images rendered by a simu-

lator environment are ﬁrst segmented to scene parsing

representation and then translated to synthetic realis-

tic images by the proposed image translation network.

The proposed network segments the simulated image

input, and then generates a synthetic realistic images.

Accordingly, the driving policy trained by reinforce-

ment learning can be easily adapted to real environ-

ment.

World Models proposed in (Ha and Schmidhu-

ber, 2018b; Ha and Schmidhuber, 2018a) are trained

quickly in an unsupervised way, via a Variational Au-

toEncoder (VAE), to learn a compressed spatial and

temporal representation of the environment leading to

learning a compact policy. Moreover, the agent can

train inside its own dream and transfer the policy back

into the actual environment.

3.5 Data Efﬁcient and Fast Adapting

Depending on the task being solved, RL require a lot

of observations to cover the state space. Efﬁcieny

is usually achieved with imitation learning, reward

shaping and transfer learning. Readers are directed

towards the survey on transfer learning in RL here

(Taylor and Stone, 2009). The primary motivation in

transfer learning in RL is to reuse previously trained

policies/models for a source task, so as to reduce the

current target task’s training time. Authors in (Liaw

et al., 2017) study the policy composition problem

where composing previously learned basis policies,

e.g., driving in different conditions or over different

terrain types, the goal is to be able to reuse them for

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

568

a novel task that is a mixture of previously seen dy-

namics by learning a meta-policy, that maximises the

reward. Their results show that learning a new policy

for a new task takes longer than a meta-policy learnt

using basis policies.

Meta-learning, or learning to learn, is an impor-

tant approach towards versatile agents that can adapt

quickly to news tasks. The idea of having one neural

network interact with another one for meta-learning

has been applied in (Duan et al., 2016) and (Wang

et al., 2016). More recently, the Model-Agnostic

Meta-Learning (MAML) is proposed in (Finn et al.,

2017), where the meta-learner seeks to ﬁnd an initiali-

zation for the parameters of a neural network, that can

be adapted quickly for a new task using only few ex-

amples. Continuous adaptation in dynamically chan-

ging and adversarial scenarios is presented in (Al-

Shedivat et al., 2017) via a simple gradient-based

meta-learning algorithm. Additionally, Reptile (Ni-

chol et al., 2018) is mathematically similar to ﬁrst-

order MAML, making it consume less computation

and memory than MAML.

3.6 Incorporating Safety in DRL for AD

Deploying an autonomous vehicle after training

directly could be dangerous. We review different ap-

proaches to incorporate safety into DRL algorithms.

SafeDAgger (Zhang and Cho, 2017) introduces

a safety policy that learns to predict the error

made by a primary policy trained initially with

the supervised learning approach, without que-

rying a reference policy. An additional safe policy

takes both the partial observation of a state and

a primary policy as inputs, and returns a binary

label indicating whether the primary policy is likely

to deviate from a reference policy without querying it.

Multi-agent RL for Comfort Driving and Safety:

In (Shalev-Shwartz et al., 2016), autonomous driving

is addressed as a multi-agent setting where the host

vehicle applies negotiations in different scenarios;

where balancing is maintained between unexpected

behavior of other drivers and not to be too defensive.

The problem is decomposed into a policy for learned

desires to enable comfort of driving, and trajectory

planning with hard constraints for safety of driving.

DDPG and Safety Based Control: The deep rein-

forcement learning (DDPG) and safety based control

are combined in (Xiong et al., 2016), including

artiﬁcial potential ﬁeld method that is widely used

for robot path planning. Using TORCS environment,

the DDPG is used ﬁrst for learning a driving policy

in a stable and familiar environment, then policy

network and safety-based control are combined to

avoid collisions. It was found that combination of

DRL and safety-based control performs well in most

scenarios.

Negative-Avoidance for Safety: In order to enable

DRL to escape local optima, speed up the training

process and avoid danger conditions or accidents,

Survival-Oriented Reinforcement Learning (SORL)

model is proposed in (Ye et al., 2017), where survi-

val is favored over maximizing total reward through

modeling the autonomous driving problem as a con-

strained MDP and introducing Negative-Avoidance

Function to learn from previous failure. The SORL

model found to be not sensitive to reward function

and can use different DRL algorithms like DDPG.

3.7 Other Challenges

Multimodal Sensor Policies. Modern autonomous

driving systems constitute of multiple modalities

(Sobh et al., 2018), for example Camera RGB, Depth,

Lidar and others sensors. Authors in (Liu et al., 2017)

propose end-to-end learning of policies that leverages

sensor fusion to reduced performance drops in noisy

environment and even in the face of partial sensor fai-

lure by using Sensor Dropout to reduce sensitivity to

any sensor subset.

Reproducibility. State-of-the-art deep RL methods

are seldom reproducible. Non-determinism in stan-

dard benchmark environments, combined with vari-

ance intrinsic to the methods, can make reported re-

sults tough to interpret, authors discuss common is-

sues and challenges (Henderson et al., 2018). The-

refore, in the future it would be helpful to develop

standardized benchmarks for evaluating autonomous

vehicle control algorithms, similar to the benchmarks

such as KITTI (Geiger et al., 2012) which are already

available for AD perception tasks.

4 CONCLUSION

AD systems present a challenging environment for

tasks such as perception, prediction and control. DRL

is a promising candidate for the future development of

AD systems, potentially allowing the required beha-

viours to be learned ﬁrst in simulation and further re-

ﬁned on real datasets, instead of being explicitly pro-

grammed. In this article, we provide an overview of

Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

569

AD system components, DRL algorithms and appli-

cations of DRL for AD. We discuss the main chal-

lenges which must be addressed to enable practical

and wide-spread use of DRL in AD applications. Alt-

hough most of the work surveyed in this paper were

conducted on simulated environments, it is encoura-

ging that applications on real vehicles are beginning

to appear, e.g. (Kendall et al., 2018). The key chal-

lenges in constructing a complete real-world system

would require resolving key challenges such as sa-

fety in RL, improving data efﬁciency and ﬁnally ena-

bling transfer learning using simulated environments.

To add, environment dynamics are better modeled by

considering the predictive perception of other actors.

We hope that this work inspires future research and

development on DRL for AD, leading to increased

real-world deployments in AD systems.

REFERENCES

Abbeel, P. and Ng, A. Y. (2004a). Apprenticeship learning

via inverse reinforcement learning. In Proceedings of

the twenty-ﬁrst international conference on Machine

learning, page 1. ACM.

Abbeel, P. and Ng, A. Y. (2004b). Apprenticeship learning

via inverse reinforcement learning. In Proceedings of

the Twenty-ﬁrst International Conference on Machine

Learning, ICML ’04, pages 1–, New York, NY, USA.

ACM.

Abbeel, P. and Ng, A. Y. (2005). Exploration and appren-

ticeship learning in reinforcement learning. In Pro-

ceedings of the 22nd International Conference on Ma-

chine Learning, pages 1–8, Bonn, Germany.

Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mor-

datch, I., and Abbeel, P. (2017). Continuous adapta-

tion via meta-learning in nonstationary and competi-

tive environments. arXiv preprint arXiv:1710.03641.

Atkeson, C. G. and Schaal, S. (1997). Robot learning from

demonstration. In Proceedings of the 14th Interna-

tional Conference on Machine Learning, volume 97,

pages 12–20, Nashville, USA.

Barto, A. G. and Mahadevan, S. (2003). Recent advances

in hierarchical reinforcement learning. Discrete Event

Dynamic Systems, 13(4):341–379.

Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B.,

Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Mul-

ler, U., Zhang, J., Zhang, X., Zhao, J., and Zieba,

K. (2016). End to end learning for self-driving cars.

CoRR, abs/1604.07316.

Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M.,

Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P., Ko-

nolige, K., et al. (2017a). Using simulation and dom-

ain adaptation to improve efﬁciency of deep robotic

grasping. arXiv preprint arXiv:1709.07857.

Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., and

Krishnan, D. (2017b). Unsupervised pixel-level dom-

ain adaptation with generative adversarial networks.

In The IEEE Conference on Computer Vision and Pat-

tern Recognition (CVPR), volume 1, page 7.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Dar-

rell, T., and Efros, A. A. (2018). Large-scale

study of curiosity-driven learning. arXiv preprint

arXiv:1808.04355.

Chen, J., Wang, Z., and Tomizuka, M. (2018). Deep hierar-

chical reinforcement learning for autonomous driving

with distinct behaviors. In 2018 IEEE Intelligent Vehi-

cles Symposium (IV), pages 1239–1244. IEEE.

Chentanez, N., Barto, A. G., and Singh, S. P. (2005). Intrin-

sically motivated reinforcement learning. In Advan-

ces in neural information processing systems, pages

1281–1288.

Djuric, N., Radosavljevic, V., Cui, H., Nguyen, T., Chou,

F.-C., Lin, T.-H., and Schneider, J. (2018). Motion

prediction of trafﬁc actors for autonomous driving

using deep convolutional networks. arXiv preprint

arXiv:1808.05819.

Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Kol-

tun, V. (2017). CARLA: An open urban driving simu-

lator. In Proceedings of the 1st Annual Conference on

Robot Learning, pages 1–16.

Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutske-

ver, I., and Abbeel, P. (2016). Fast reinforcement lear-

ning via slow reinforcement learning. arXiv preprint

arXiv:1611.02779.

Erran, L. and Scheider, J. (2017). Machine learning for au-

tonomous vehicless. In ICML Workshop on Machine

Learning for Autonomous Vehicles 2017,.

Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic

meta-learning for fast adaptation of deep networks.

arXiv preprint arXiv:1703.03400.

Finn, C., Levine, S., and Abbeel, P. (2016). Guided cost

learning: Deep inverse optimal control via policy op-

timization. In International Conference on Machine

Learning, pages 49–58.

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Laro-

chelle, H., Laviolette, F., Marchand, M., and Lem-

pitsky, V. (2016). Domain-adversarial training of neu-

ral networks. The Journal of Machine Learning Rese-

arch, 17(1):2096–2030.

Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we re-

ady for autonomous driving? the kitti vision bench-

mark suite. In Computer Vision and Pattern Recogni-

tion (CVPR), 2012 IEEE Conference on, pages 3354–

3361. IEEE.

Ha, D. and Schmidhuber, J. (2018a). Recurrent world

models facilitate policy evolution. arXiv preprint

arXiv:1809.01999. Accepted at NIPS 2018.

Ha, D. and Schmidhuber, J. (2018b). World models. arXiv

preprint arXiv:1803.10122.

Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup,

D., and Meger, D. (2018). Deep reinforcement lear-

ning that matters. In McIlraith, S. A. and Weinberger,

K. Q., editors, (AAAI-18), the 30th innovative Appli-

cations of Artiﬁcial Intelligence, New Orleans, Lou-

isiana, USA, February 2-7, 2018, pages 3207–3214.

AAAI Press.

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul,

T., Piot, B., Horgan, D., Quan, J., Sendonaris, A.,

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

570

Dulac-Arnold, G., et al. (2017). Deep q-learning from

demonstrations. arXiv preprint arXiv:1704.03732.

Ho, J. and Ermon, S. (2016). Generative adversarial imi-

tation learning. In Advances in Neural Information

Processing Systems, pages 4565–4573.

Kang, B., Jie, Z., and Feng, J. (2018). Policy optimiza-

tion with demonstrations. In Proceedings of the 35th

International Conference on Machine Learning, vo-

lume 80, pages 2469–2478, Stockholm, Sweden.

Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D.,

Allen, J.-M., Lam, V.-D., Bewley, A., and Shah, A.

(2018). Learning to drive in a day. arXiv preprint

arXiv:1807.00412.

Kitani, K. M., Ziebart, B. D., Bagnell, J. A., and Hebert,

M. (2012). Activity forecasting. In Fitzgibbon, A.,

Lazebnik, S., Perona, P., Sato, Y., and Schmid, C.,

editors, Computer Vision – ECCV 2012, pages 201–

214, Berlin, Heidelberg. Springer Berlin Heidelberg.

Koenig, N. and Howard, A. (2004). Design and use para-

digms for gazebo, an open-source multi-robot simula-

tor. In In IEEE/RSJ International Conference on In-

telligent Robots and Systems, pages 2149–2154.

Kuderer, M., Gulati, S., and Burgard, W. (2015). Learning

driving styles for autonomous vehicles from demon-

stration. In Robotics and Automation (ICRA), 2015

IEEE International Conference on, pages 2641–2646.

IEEE.

Le, H. M., Jiang, N., Agarwal, A., Dud

ık, M., Yue, Y., and

Daum

e III, H. (2018). Hierarchical imitation and rein-

forcement learning. arXiv preprint arXiv:1803.00590.

Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H.,

and Chandraker, M. (2017). Desire: Distant future

prediction in dynamic scenes with interacting agents.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 336–345.

Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-

to-end training of deep visuomotor policies. The Jour-

nal of Machine Learning Research, 17(1):1334–1373.

Liaw, R., Krishnan, S., Garg, A., Crankshaw, D., Gon-

zalez, J. E., and Goldberg, K. (2017). Composing

meta-policies for autonomous driving using hierar-

chical deep reinforcement learning. arXiv preprint

arXiv:1711.01503.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T.,

Tassa, Y., Silver, D., and Wierstra, D. (2015). Conti-

nuous control with deep reinforcement learning. arXiv

preprint arXiv:1509.02971.

Liu, G.-H., Siravuru, A., Prabhakar, S., Veloso, M., and

Kantor, G. (2017). Learning end-to-end multimodal

sensor policies for autonomous navigation. In Levine,

S., Vanhoucke, V., and Goldberg, K., editors, Procee-

dings of the 1st Annual Conference on Robot Lear-

ning, volume 78 of Proceedings of Machine Learning

Research, pages 249–261. PMLR.

Mania, H., Guy, A., and Recht, B. (2018). Simple random

search provides a competitive approach to reinforce-

ment learning. arXiv preprint arXiv:1803.07055.

Mannion, P., Duggan, J., and Howley, E. (2016a). An expe-

rimental review of reinforcement learning algorithms

for adaptive trafﬁc signal control. In McCluskey, L. T.,

Kotsialos, A., M

uller, P. J., Kl

ugl, F., Rana, O., and

Schumann, R., editors, Autonomic Road Transport

Support Systems, pages 47–66. Springer International

Publishing.

Mannion, P., Mason, K., Devlin, S., Duggan, J., and Ho-

wley, E. (2016b). Multi-objective dynamic dispa-

tch optimisation using multi-agent reinforcement lear-

ning. In Proceedings of the 15th International Confe-

rence on Autonomous Agents and Multiagent Systems

(AAMAS), pages 1345–1346.

Mason, K., Mannion, P., Duggan, J., and Howley, E. (2016).

Applying multi-agent reinforcement learning to wa-

tershed management. In Proceedings of the Adaptive

and Learning Agents workshop (at AAMAS 2016).

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,

Harley, T., Silver, D., and Kavukcuoglu, K. (2016).

Asynchronous methods for deep reinforcement lear-

ning. In International Conference on Machine Lear-

ning, pages 1928–1937.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Anto-

noglou, I., Wierstra, D., and Riedmiller, M. (2013).

Playing atari with deep reinforcement learning. arXiv

preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., et al. (2015).

Human-level control through deep reinforcement le-

arning. Nature, 518(7540):529.

Naik, D. K. and Mammone, R. (1992). Meta-neural net-

works that learn by learning. In Neural Networks,

1992. IJCNN., International Joint Conference on, vo-

lume 1, pages 437–442. IEEE.

Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invari-

ance under reward transformations: Theory and appli-

cation to reward shaping. In ICML, volume 99, pages

278–287.

Ngai, D. C. K. and Yung, N. H. C. (2011). A multiple-

goal reinforcement learning method for complex vehi-

cle overtaking maneuvers. IEEE Transactions on In-

telligent Transportation Systems, 12(2):509–522.

Nichol, A., Achiam, J., and Schulman, J. (2018).

On ﬁrst-order meta-learning algorithms. CoRR,

abs/1803.02999.

Nosrati, M. S., Abolfathi, E. A., Elmahgiubi, M., Yadmel-

lat, P., Luo, J., Zhang, Y., Yao, H., Zhang, H., and Ja-

mil, A. (2018). Towards practical hierarchical reinfor-

cement learning for multi-lane autonomous driving.

Paden, B., C

ap, M., Yong, S. Z., Yershov, D. S., and Fraz-

zoli, E. (2016). A survey of motion planning and con-

trol techniques for self-driving urban vehicles. CoRR,

abs/1604.07446.

Pan, X., You, Y., Wang, Z., and Lu, C. (2017). Virtual to

real reinforcement learning for autonomous driving.

arXiv preprint arXiv:1704.03952.

Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017).

Curiosity-driven exploration by self-supervised pre-

diction. In International Conference on Machine Le-

arning (ICML), volume 2017.

Peng, X. B., Andrychowicz, M., Zaremba, W., and Ab-

beel, P. (2017). Sim-to-real transfer of robotic con-

trol with dynamics randomization. arXiv preprint

arXiv:1710.06537.

Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems

571

Ramanishka, V., Chen, Y.-T., Misu, T., and Saenko, K.

(2018). Toward driving scene understanding: A data-

set for learning driver behavior and causal reasoning.

In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 7699–7707.

Ross, S. and Bagnell, D. (2010). Efﬁcient reductions for

imitation learning. In Proceedings of the thirteenth

international conference on artiﬁcial intelligence and

statistics, pages 661–668.

Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S.

(2016). End-to-end deep reinforcement learning for

lane keeping assist. arXiv preprint arXiv:1612.04340.

Sallab, A. E., Abdou, M., Perot, E., and Yogamani,

S. (2017). Deep reinforcement learning frame-

work for autonomous driving. Electronic Imaging,

2017(19):70–76.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Mo-

ritz, P. (2015). Trust region policy optimization. In

International Conference on Machine Learning, pa-

ges 1889–1897.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Schwarting, W., Alonso-Mora, J., and Rus, D. (2018). Plan-

ning and decision-making for autonomous vehicles.

Annual Review of Control, Robotics, and Autonomous

Systems.

Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018). Air-

sim: High-ﬁdelity visual and physical simulation for

autonomous vehicles. In Field and Service Robotics,

pages 621–635. Springer.

Shalev-Shwartz, S., Shammah, S., and Shashua, A. (2016).

Safe, multi-agent, reinforcement learning for autono-

mous driving. arXiv preprint arXiv:1610.03295.

Sharifzadeh, S., Chiotellis, I., Triebel, R., and Cremers,

D. (2016). Learning to drive using inverse reinfor-

cement learning and deep q-networks. arXiv preprint

arXiv:1612.03653.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre,

L., Van Den Driessche, G., Schrittwieser, J., Antonog-

lou, I., Panneershelvam, V., Lanctot, M., et al. (2016).

Mastering the game of go with deep neural networks

and tree search. nature, 529(7587):484–489.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and

Riedmiller, M. (2014). Deterministic policy gradient

algorithms. In ICML.

Silver, D., Sutton, R. S., and M

uller, M. (2008). Sample-

based learning and search with permanent and tran-

sient memories. In Proceedings of the 25th internati-

onal conference on Machine learning, pages 968–975.

ACM.

Sobh, I., Amin, L., Abdelkarim, S., Elmadawy, K., Saeed,

M., Abdeltawab, O., Gamal, M., and El Sallab, A.

(2018). End-to-end multi-modal sensors fusion sy-

stem for urban automated driving. In NIPS Works-

hop on Machine Learning for Intelligent Transporta-

tion Systems.

Sobh, I. and Darwish, N. (2018). End-to-end framework for

fast learning asynchronous agents. In Proceedings of

the 32nd Conference on Neural Information Proces-

sing Systems, Imitation Learning and its Challenges

in Robotics workshop.

Sutton, R. S. (1990). Integrated architectures for learning,

planning, and reacting based on approximating dyna-

mic programming. In Machine Learning Proceedings

1990, pages 216–224. Elsevier.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement lear-

ning: An introduction. MIT press.

Taylor, M. E. and Stone, P. (2009). Transfer learning for

reinforcement learning domains: A survey. Journal of

Machine Learning Research, 10(Jul):1633–1685.

Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep rein-

forcement learning with double q-learning. In AAAI,

volume 16, pages 2094–2100.

Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,

Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D.,

and Botvinick, M. (2016). Learning to reinforcement

learn. arXiv preprint arXiv:1611.05763.

Wang, P. and Chan, C.-Y. (2017). Formulation of deep rein-

forcement learning architecture toward autonomous

driving for on-ramp merge. In Intelligent Transpor-

tation Systems (ITSC), 2017 IEEE 20th International

Conference on, pages 1–6. IEEE.

Wang, P., Chan, C.-Y., and de La Fortelle, A. (2018). A rein-

forcement learning based approach for automated lane

change maneuvers. arXiv preprint arXiv:1804.07871.

Wymann, B., Espi

e, E., Guionneau, C., Dimitrakakis, C.,

Coulom, R., and Sumner, A. (2000). Torcs, the

open racing car simulator. Software available at

http://torcs. sourceforge. net, 4.

Xiong, X., Wang, J., Zhang, F., and Li, K. (2016). Com-

bining deep reinforcement learning and safety ba-

sed control for autonomous driving. arXiv preprint

arXiv:1612.00147.

Xu, H., Gao, Y., Yu, F., and Darrell, T. (2017). End-to-end

learning of driving models from large-scale video da-

tasets. In 2017 IEEE Conference on Computer Vision

and Pattern Recognition, CVPR 2017, Honolulu, HI,

USA, July 21-26, 2017, pages 3530–3538.

Ye, C., Ma, H., Zhang, X., Zhang, K., and You, S. (2017).

Survival-oriented reinforcement learning model: An

effcient and robust deep reinforcement learning algo-

rithm for autonomous driving problem. In Internati-

onal Conference on Image and Graphics, pages 417–

429. Springer.

Yu, A., Palefsky-Smith, R., and Bedi, R. (2016). Deep rein-

forcement learning for simulated autonomous vehi-

cle control. Course Project Reports: Winter 2016

(CS231n: Convolutional Neural Networks for Visual

Recognition), pages 1–7.

Zhang, J. and Cho, K. (2017). Query-efﬁcient imitation le-

arning for end-to-end simulated driving. In Procee-

dings of the Thirty-First AAAI Conference on Artiﬁcial

Intelligence, San Francisco, California, USA., pages

2891–2897.

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K.

(2008). Maximum entropy inverse reinforcement lear-

ning. In AAAI, volume 8, pages 1433–1438. Chicago,

IL, USA.

Ziebart, B. D., Ratliff, N., Gallagher, G., Mertz, C., Peter-

son, K., Bagnell, J. A., Hebert, M., Dey, A. K., and

Srinivasa, S. (2009). Planning-based prediction for

pedestrians. In Proc. IROS.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

572