Exploring Applications of Deep Reinforcement Learning
for Real-world Autonomous Driving Systems
Victor Talpaert
1,2
, Ibrahim Sobh
3
, B. Ravi Kiran
2
, Patrick Mannion
4
,
Senthil Yogamani
5
, Ahmad El-Sallab
3
and Patrick Perez
6
1
U2IS, ENSTA ParisTech, Palaiseau, France
2
AKKA Technologies, Guyancourt, France
3
Valeo Egypt, Cairo
4
Galway-Mayo Institute of Technology, Ireland
5
Valeo Vision Systems, Ireland
6
Valeo.ai, France
Keywords:
Autonomous Driving, Deep Reinforcement Learning, Visual Perception.
Abstract:
Deep Reinforcement Learning (DRL) has become increasingly powerful in recent years, with notable achie-
vements such as Deepmind’s AlphaGo. It has been successfully deployed in commercial vehicles like Mo-
bileye’s path planning system. However, a vast majority of work on DRL is focused on toy examples in
controlled synthetic car simulator environments such as TORCS and CARLA. In general, DRL is still at its
infancy in terms of usability in real-world applications. Our goal in this paper is to encourage real-world
deployment of DRL in various autonomous driving (AD) applications. We first provide an overview of the
tasks in autonomous driving systems, reinforcement learning algorithms and applications of DRL to AD sys-
tems. We then discuss the challenges which must be addressed to enable further progress towards real-world
deployment.
1 INTRODUCTION
Autonomous driving (AD) is a challenging applica-
tion domain for machine learning (ML). Since the
task of driving “well” is already open to subjective
definitions, it is not easy to specify the correct be-
havioural outputs for an autonomous driving system,
nor is it simple to choose the right input features to
learn with. Correct driving behaviour is only loo-
sely defined, as different responses to similar situa-
tions may be equally acceptable. ML-based control
policies have to overcome the lack of a dense me-
tric evaluating the driving quality over time, and the
lack of a strong signal for expert imitation. Supervi-
sed learning methods do not learn the dynamics of the
environment nor that of the agent (Sutton and Barto,
2018), while reinforcement learning (RL) is formula-
ted to handle sequential decision processes, making it
a natural approach for learning AD control policies.
In this article, we aim to outline the underlying
principles of DRL for applications in AD. Passive
perception which feeds into a control system does
not scale to handle complex situations. DRL setting
would enable active perception optimized for the spe-
cific control task. The rest of the paper is structured as
follows. Section 2 provides background on the vari-
ous modules of AD, an overview of reinforcement le-
arning and a summary of applications of DRL to AD.
Section 3 discusses challenges and open problems in
applying DRL to AD. Finally, Section 4 summarizes
the paper and provides key future directions.
2 BACKGROUND
The software architecture for Autonomous Driving
systems comprises of the following high level tasks:
Sensing, Perception, Planning and Control. While
some (Bojarski et al., 2016) achieve all tasks in one
unique module, others do follow the logic of one task
- one module (Paden et al., 2016). Behind this sepa-
ration rests the idea of information transmission from
the sensors to the final stage of actuators and motors
as illustrated in Figure 1. While an End-to-End ap-
proach would enable one generic encapsulated solu-
tion, the modular approach provides granularity for
this multi-discipline problem, as separating the tasks
in modules is the divide and conquer engineering ap-
564
Talpaert, V., Sobh, I., Kiran, B., Mannion, P., Yogamani, S., El-Sallab, A. and Perez, P.
Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems.
DOI: 10.5220/0007520305640572
In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 564-572
ISBN: 978-989-758-354-4
Copyright
c
2019 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
Figure 1: Fixed modules in a modern autonomous driving systems : Sensor infrastructure notably include Cameras, Radars,
LiDARs (the Laser equivalent of Radars) or GPS-IMUs (GPS and Inertial Measurement Units provide an instantaneous
position); their raw data is considered low level. This dynamic data is often processed into higher level descriptions as part of
the Perception module. Perception estimates positions of: location in lane, cars, pedestrians, traffic lights and other semantic
objects among other descriptors. It also provides road occupation over a larger spatial & temporal scale by combining maps.
Mapping Localised high definition maps (HD maps) provide centrimetric reconstruction of buildings, static obstacles and
dynamic objects, which frequently are crowdsourced. Scene understanding provide the high-level scene comprehension that
includes detection, classification & localisation tasks, feeding the driving policy/planning module. Path Planning predicts
the future actor trajectories and manoeuvres. A static shortest path from point A to point B with dynamic traffic information
constraints are employed to calculate the path. Vehicle control orchestrates the high level orders for motion planning using
simple closed loop systems based on sensors from the Perception task.
proach.
2.1 Review of Reinforcement learning
Reinforcement Learning (RL) (Sutton and Barto,
2018) is a family of algorithms which allow agents
to learn how to act in different situations. In other
words, how to establish a map, or a policy, from situ-
ations (states) to actions which maximize a numerical
reward signal. RL has been successfully applied to
many different fields such as helicopter control (Naik
and Mammone, 1992), traffic signal control (Mannion
et al., 2016a), electricity generator scheduling (Man-
nion et al., 2016b), water resource management (Ma-
son et al., 2016), playing relatively simple Atari ga-
mes (Mnih et al., 2015) and mastering a much more
complex game of Go (Silver et al., 2016), simulated
continuous control problems (Lillicrap et al., 2015),
(Schulman et al., 2015), and controlling robots in real
environments (Levine et al., 2016).
RL algorithms may learn estimates of state values,
environment models or policies. In real-world appli-
cations, simple tabular representations of estimates
are not scalable. Each additional feature tracked in
the state leads to an exponential growth in the number
of estimates that must be stored (Sutton and Barto,
2018).
Deep Neural Networks (DNN) have recently been
applied as function approximators for RL agents, al-
lowing agents to generalise knowledge to new unseen
situations, along with new algorithms for problems
with continuous state and action spaces. Deep RL
agents can be value-based: DQN (Mnih et al., 2013),
Double DQN (Van Hasselt et al., 2016), policy-based:
TRPO (Schulman et al., 2015), PPO (Schulman et al.,
2017); or actor-critic: DPG (Silver et al., 2014),
DDPG (Lillicrap et al., 2015), A3C (Mnih et al.,
2016). Model-based RL agents attempt to build an en-
vironment model, e.g. Dyna-Q (Sutton, 1990), Dyna-
2 (Silver et al., 2008). Inverse reinforcement learning
(IRL) aims to estimate the reward function given ex-
amples of agents actions, sensory input to the agent
(state), and the model of the environment (Abbeel and
Ng, 2004a), (Ziebart et al., 2008), (Finn et al., 2016),
(Ho and Ermon, 2016).
2.2 DRL for Autonomous Driving
In this section, we visit the different modules of the
autonomous driving system as shown in Figure 1 and
describe how they are achieved using classical RL and
Deep RL methods. A list of datasets and simulators
for AD tasks is presented in Table 1.
Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems
565
Table 1: A collection of datasets and simulators to evaluate AD algorithms.
Dataset Description
Berkeley Driving Dataset (Xu et al., 2017) Learn driving policy from demonstrations
Baidu’s ApolloScape Multiple sensors & Driver Behaviour Profiles
Honda Driving Dataset (Ramanishka et al., 2018) 4 level annotation (stimulus, action, cause, and at-
tention objects) for driver behavior profile.
Simulator Description
CARLA (Dosovitskiy et al., 2017) Urban Driving Simulator with Camera, LiDAR,
Depth & Semantic segmentation
Racing Simulator TORCS (Wymann et al., 2000) Testing control policies for vehicles
AIRSIM (Shah et al., 2018) Resembling CARLA with support for Drones
GAZEBO (Koenig and Howard, 2004) Multi-robo simulator for planning & control
Vehicle Control: Vehicle control classically has been
achieved with predictive control approaches such as
Model Predictive Control (MPC) (Paden et al., 2016).
A recent review on motion planning and control task
can be found by authors (Schwarting et al., 2018).
Classical RL methods are used to perform optimal
control in stochastic settings the Linear Quadratic Re-
gulator (LQR) in linear regimes and iterative LQR
(iLQR) for non-linear regimes are utilized. More re-
cently, random search over the parameters for a policy
network can perform as well as LQR (Mania et al.,
2018).
One can note recent work on DQN which is
used in (Yu et al., 2016) for simulated autonomous
vehicle control where different reward functions are
examined to produce specific driving behavior. The
agent successfully learned the turning actions and
navigation without crashing. In (Sallab et al., 2016)
DRL system for lane keeping assist is introduced
for discrete actions (DQN) and continuous actions
(DDAC), where the TORCS car simulator is used and
concluded that, as expected, the continuous actions
provide smoother trajectories, and the more restricted
termination conditions, the slower convergence
time to learn. Wayve, a recent startup, has recently
demonstrated an application of DRL (DDPG) for AD
using a full-sized autonomous vehicle (Kendall et al.,
2018). The system was first trained in simulation,
before being trained in real time using onboard
computers, and was able to learn to follow a lane,
successfully completing a real-world trial on a 250
metre section of road.
DQN for Ramp Merging: The AD problem of
ramp merging is tackled in (Wang and Chan, 2017),
where DRL is applied to find an optimal driving
policy using LSTM for producing an internal state
containing historical driving information and DQN
for Q-function approximation.
Q-function for Lane Change: A Reinforcement
Learning approach is proposed in (Wang et al., 2018)
to train the vehicle agent to learn an automated lane
change in a smooth and efficient behavior, where the
coefficients of the Q-function are learned from neural
networks.
IRL for Driving Styles: Individual perception of
comfort from demonstration is proposed in (Kuderer
et al., 2015), where individual driving styles are
modeled in terms of a cost function and use feature
based inverse reinforcement learning to compute
trajectories in vehicle autonomous mode. Using Deep
Q-Networks as the refinement step in IRL is proposed
in (Sharifzadeh et al., 2016) to extract the rewards.
While evaluated in a simulated autonomous driving
environment, it is shown that the agent performs a
human-like lane change behavior.
Multiple-goal RL for Overtaking: In (Ngai and
Yung, 2011) a multiple-goal reinforcement learning
(MGRL) framework is used to solve the vehicle
overtaking problem. This work is found to be able
to take correct actions for overtaking while avoiding
collisions and keeping almost steady speed.
Hierarchical Reinforcement Learning (HRL):
Contrary conventional or flat RL, HRL refers to
the decomposition of complex agent behavior using
temporal abstraction, such as the options framework
(Barto and Mahadevan, 2003). The problem of
sequential decision making for autonomous driving
with distinct behaviors is tackled in (Chen et al.,
2018). A hierarchical neural network policy is
proposed where the network is trained with the
Semi-Markov Decision Process (SMDP) though the
proposed hierarchical policy gradient method. The
method is applied to a traffic light passing scenario,
and it is shown that the method is able to select correct
decisions providing better performance compared to
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
566
a non-hierarchical reinforcement learning approach.
An RL-based hierarchical framework for autonomous
multi-lane cruising is proposed in (Nosrati et al.,
2018) and it is shown that the hierarchical design
enables significantly better learning performance
than a flat design for both DQN and PPO methods.
Frameworks: A framework for an end-end Deep
Reinforcement Learning pipeline for autonomous dri-
ving is proposed in (Sallab et al., 2017), where the in-
puts are the states of the environment and their aggre-
gations over time, and the output is the driving acti-
ons. The framework integrates RNNs and attention
glimpse network, and tested for lane keep assist algo-
rithm. In this section we reviewed in brief the appli-
cations of DRL and classical RL methods to different
AD tasks.
2.3 Predictive Perception
In this section, we review some examples of ap-
plications of IOC and IRL to predictive perception
tasks. Given a certain instance where the autonomous
agent is driving in the scene, the goal of predictive
perception algorithms are to predict the trajectories
or intention of movement of other actors in the envi-
ronment. The authors (Djuric et al., 2018), trained a
deep convolutional neural network (CNN) to predict
short-term vehicle trajectories, while accounting for
inherent uncertainty of vehicle motion in road traffic.
Deep Stochastic IOC (inverse optimal control) RNN
Encoder-Decoder (DESIRE) (Lee et al., 2017) is a
framework used to estimate a distribution and not
just a simple prediction of an agent’s future positions.
This is based on the context (intersection, relative
position of other agents).
Pedestrian Intention: Pedestrian intents to cross
the road, board another vehicle, or is the driver in
the parked car going to open the door (Erran and
Scheider, 2017) . Authors in (Ziebart et al., 2009)
perform maximum entropy inverse optimal control
to learn a generic cost function for a robot to avoid
pedestrians. Authors (Kitani et al., 2012) used
inverse optimal control to predict pedestrian paths by
considering scene semantics.
Traffic Negotiation: When in traffic scenarios invol-
ving multiple agents, policies learned require agents
to negotiate movement in densely populated areas and
with continuous movement. MobilEye demonstra-
ted the use of the options framework (Shalev-Shwartz
et al., 2016).
3 PRACTICAL CHALLENGES
Deep reinforcement learning is a rapidly growing
field. We summarize the frequent challenges such
as sample complexity and reward formulation, that
could be encountered in the design of such methods.
3.1 Bootstrapping RL with Imitation
Ability learning by imitation is used by humans to te-
ach other humans new skills. Demonstrations usually
focus on state space essential areas from the expert’s
point of view. Learning from Demonstrations (LfD)
is significant especially in domains where rewards are
sparse. In imitation learning, an agent learns to per-
form a task from demonstrations without any feed-
back rewards. The agent learns a policy as a super-
vised learning process over state-action pairs. Howe-
ver, high quality demonstrations are hard to collect,
leading to sub-optimal policies (Atkeson and Schaal,
1997). Accordingly, LfD can be used to initialize
the learning agent with a policy inspired by perfor-
mance of an expert. Then, RL can be conducted to
discover a better policy by interacting with the envi-
ronment. Learning a model of the environment con-
densing LfD and RL is presented in (Abbeel and Ng,
2005). Measuring the divergence between the current
policy and the expert for policy optimization is pro-
posed in (Kang et al., 2018). DQfD (Hester et al.,
2017) pre-trains the agent and uses demonstrations by
adding them into the replay buffer of the DQN and
giving them additional priority. More recently, a trai-
ning framework that combines LfD and RL for fast le-
arning asynchronous agents is proposed in (Sobh and
Darwish, 2018).
3.2 Exploration Issues with Imitation
In some cases, demonstrations from experts are not
available or even not covering the state space lea-
ding to learning a poor policy. One solution con-
sists in using the Data Aggregation (DAgger) (Ross
and Bagnell, 2010) methods where the end-to-end le-
arned policy is run and extracted observation-action
pair is again labelled by the expert, and aggregated to
the original expert observation-action dataset. Thus
iteratively collecting training examples from both re-
ference and trained policies explores more states and
solves this lack of exploration. Following work on
Search-based Structured Prediction (SEARN) (Ross
and Bagnell, 2010), Stochastic Mixing Iterative Le-
arning (SMILE) trains a stochastic stationary policy
over several iterations and then makes use of a geo-
metric stochastic mixing of the policies trained. In a
Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems
567
standard imitation learning scenario, the demonstra-
tor is required to cover sufficient states so as to avoid
unseen states during test. This constraint is costly and
requires frequent human intervention. Hierarchical
imitation learning methods reduce the sample com-
plexity of standard imitation learning by performing
data aggregation by organizing the action spaces in a
hierarchy (Le et al., 2018).
3.3 Intrinsic Reward Functions
In controlled simulated environments such as games,
an explicit reward signal is given to the agent along
with its sensor stream. In real-world robotics and au-
tonomous driving deriving, designing a good reward
functions is essential so that the desired behaviour
may be learned. The most common solution has been
reward shaping (Ng et al., 1999) and consists in sup-
plying additional rewards to the agent along with that
provided by the underlying MDP. Rewards as already
pointed earlier in the paper, could be estimated by in-
verse RL (IRL) (Abbeel and Ng, 2004b), which de-
pends on expert demonstrations.
In the absence of an explicit reward shaping and
expert demonstrations, agents can use intrinsic re-
wards or intrinsic motivation (Chentanez et al., 2005)
to evaluate if their actions were good. Authors (Pat-
hak et al., 2017) define curiosity as the error in an
agents ability to predict the consequence of its own
actions in a visual feature space learned by a self-
supervised inverse dynamics model. In (Burda et al.,
2018) the agent learns a next state predictor model
from its experience, and uses the error of the pre-
diction as an intrinsic reward. This enables that agent
to determine what could be a useful behavior even
without extrinsic rewards.
3.4 Bridging the Simulator-reality Gap
Training deep networks requires collecting and anno-
tating a lot of data which is usually costly in terms of
time and effort. Using simulation environments ena-
bles the collection of large training datasets. Howe-
ver, the simulated data do not have the same data dis-
tribution compared to the real data. Accordingly, mo-
dels trained on simulated environments often fail to
generalise on real environments. Domain adaptation
allows a machine learning model trained on samples
from a source domain to generalise to a target dom-
ain. Feature-level domain adaptation focuses on le-
arning domain-invariant features. In work of (Ganin
et al., 2016), the decisions made by deep neural net-
works are based on features that are both discrimi-
native and invariant to the change of domains. Pixel
level domain adaptation focuses on stylizing ima-
ges from the source domain to make them similar to
images of the target domain, based on image condi-
tioned generative adversarial networks (GANs). In
(Bousmalis et al., 2017b), the model learns a trans-
formation in the pixel space from one domain to the
other, in an unsupervised way. GAN is used to adapt
simulated images to look like as if drawn from the
real domain. Both feature-level and pixel-level dom-
ain adaptation combined in (Bousmalis et al., 2017a),
where the results indicate that including simulated
data can improve the vision-based grasping system,
achieving comparable performance with 50 times fe-
wer real-world samples. Another relatively simpler
method is introduced in (Peng et al., 2017), by dyn-
amics randomizing of the simulator during training,
policies are capable of generalising to different dy-
namics without any training on the real system. RL
with Sim2Real: A model trained in virtual environ-
ment is shown to be workable in real environment
(Pan et al., 2017). Virtual images rendered by a simu-
lator environment are first segmented to scene parsing
representation and then translated to synthetic realis-
tic images by the proposed image translation network.
The proposed network segments the simulated image
input, and then generates a synthetic realistic images.
Accordingly, the driving policy trained by reinforce-
ment learning can be easily adapted to real environ-
ment.
World Models proposed in (Ha and Schmidhu-
ber, 2018b; Ha and Schmidhuber, 2018a) are trained
quickly in an unsupervised way, via a Variational Au-
toEncoder (VAE), to learn a compressed spatial and
temporal representation of the environment leading to
learning a compact policy. Moreover, the agent can
train inside its own dream and transfer the policy back
into the actual environment.
3.5 Data Efficient and Fast Adapting
RL
Depending on the task being solved, RL require a lot
of observations to cover the state space. Efficieny
is usually achieved with imitation learning, reward
shaping and transfer learning. Readers are directed
towards the survey on transfer learning in RL here
(Taylor and Stone, 2009). The primary motivation in
transfer learning in RL is to reuse previously trained
policies/models for a source task, so as to reduce the
current target task’s training time. Authors in (Liaw
et al., 2017) study the policy composition problem
where composing previously learned basis policies,
e.g., driving in different conditions or over different
terrain types, the goal is to be able to reuse them for
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
568
a novel task that is a mixture of previously seen dy-
namics by learning a meta-policy, that maximises the
reward. Their results show that learning a new policy
for a new task takes longer than a meta-policy learnt
using basis policies.
Meta-learning, or learning to learn, is an impor-
tant approach towards versatile agents that can adapt
quickly to news tasks. The idea of having one neural
network interact with another one for meta-learning
has been applied in (Duan et al., 2016) and (Wang
et al., 2016). More recently, the Model-Agnostic
Meta-Learning (MAML) is proposed in (Finn et al.,
2017), where the meta-learner seeks to find an initiali-
zation for the parameters of a neural network, that can
be adapted quickly for a new task using only few ex-
amples. Continuous adaptation in dynamically chan-
ging and adversarial scenarios is presented in (Al-
Shedivat et al., 2017) via a simple gradient-based
meta-learning algorithm. Additionally, Reptile (Ni-
chol et al., 2018) is mathematically similar to first-
order MAML, making it consume less computation
and memory than MAML.
3.6 Incorporating Safety in DRL for AD
Deploying an autonomous vehicle after training
directly could be dangerous. We review different ap-
proaches to incorporate safety into DRL algorithms.
SafeDAgger (Zhang and Cho, 2017) introduces
a safety policy that learns to predict the error
made by a primary policy trained initially with
the supervised learning approach, without que-
rying a reference policy. An additional safe policy
takes both the partial observation of a state and
a primary policy as inputs, and returns a binary
label indicating whether the primary policy is likely
to deviate from a reference policy without querying it.
Multi-agent RL for Comfort Driving and Safety:
In (Shalev-Shwartz et al., 2016), autonomous driving
is addressed as a multi-agent setting where the host
vehicle applies negotiations in different scenarios;
where balancing is maintained between unexpected
behavior of other drivers and not to be too defensive.
The problem is decomposed into a policy for learned
desires to enable comfort of driving, and trajectory
planning with hard constraints for safety of driving.
DDPG and Safety Based Control: The deep rein-
forcement learning (DDPG) and safety based control
are combined in (Xiong et al., 2016), including
artificial potential field method that is widely used
for robot path planning. Using TORCS environment,
the DDPG is used first for learning a driving policy
in a stable and familiar environment, then policy
network and safety-based control are combined to
avoid collisions. It was found that combination of
DRL and safety-based control performs well in most
scenarios.
Negative-Avoidance for Safety: In order to enable
DRL to escape local optima, speed up the training
process and avoid danger conditions or accidents,
Survival-Oriented Reinforcement Learning (SORL)
model is proposed in (Ye et al., 2017), where survi-
val is favored over maximizing total reward through
modeling the autonomous driving problem as a con-
strained MDP and introducing Negative-Avoidance
Function to learn from previous failure. The SORL
model found to be not sensitive to reward function
and can use different DRL algorithms like DDPG.
3.7 Other Challenges
Multimodal Sensor Policies. Modern autonomous
driving systems constitute of multiple modalities
(Sobh et al., 2018), for example Camera RGB, Depth,
Lidar and others sensors. Authors in (Liu et al., 2017)
propose end-to-end learning of policies that leverages
sensor fusion to reduced performance drops in noisy
environment and even in the face of partial sensor fai-
lure by using Sensor Dropout to reduce sensitivity to
any sensor subset.
Reproducibility. State-of-the-art deep RL methods
are seldom reproducible. Non-determinism in stan-
dard benchmark environments, combined with vari-
ance intrinsic to the methods, can make reported re-
sults tough to interpret, authors discuss common is-
sues and challenges (Henderson et al., 2018). The-
refore, in the future it would be helpful to develop
standardized benchmarks for evaluating autonomous
vehicle control algorithms, similar to the benchmarks
such as KITTI (Geiger et al., 2012) which are already
available for AD perception tasks.
4 CONCLUSION
AD systems present a challenging environment for
tasks such as perception, prediction and control. DRL
is a promising candidate for the future development of
AD systems, potentially allowing the required beha-
viours to be learned first in simulation and further re-
fined on real datasets, instead of being explicitly pro-
grammed. In this article, we provide an overview of
Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems
569
AD system components, DRL algorithms and appli-
cations of DRL for AD. We discuss the main chal-
lenges which must be addressed to enable practical
and wide-spread use of DRL in AD applications. Alt-
hough most of the work surveyed in this paper were
conducted on simulated environments, it is encoura-
ging that applications on real vehicles are beginning
to appear, e.g. (Kendall et al., 2018). The key chal-
lenges in constructing a complete real-world system
would require resolving key challenges such as sa-
fety in RL, improving data efficiency and finally ena-
bling transfer learning using simulated environments.
To add, environment dynamics are better modeled by
considering the predictive perception of other actors.
We hope that this work inspires future research and
development on DRL for AD, leading to increased
real-world deployments in AD systems.
REFERENCES
Abbeel, P. and Ng, A. Y. (2004a). Apprenticeship learning
via inverse reinforcement learning. In Proceedings of
the twenty-first international conference on Machine
learning, page 1. ACM.
Abbeel, P. and Ng, A. Y. (2004b). Apprenticeship learning
via inverse reinforcement learning. In Proceedings of
the Twenty-first International Conference on Machine
Learning, ICML ’04, pages 1–, New York, NY, USA.
ACM.
Abbeel, P. and Ng, A. Y. (2005). Exploration and appren-
ticeship learning in reinforcement learning. In Pro-
ceedings of the 22nd International Conference on Ma-
chine Learning, pages 1–8, Bonn, Germany.
Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mor-
datch, I., and Abbeel, P. (2017). Continuous adapta-
tion via meta-learning in nonstationary and competi-
tive environments. arXiv preprint arXiv:1710.03641.
Atkeson, C. G. and Schaal, S. (1997). Robot learning from
demonstration. In Proceedings of the 14th Interna-
tional Conference on Machine Learning, volume 97,
pages 12–20, Nashville, USA.
Barto, A. G. and Mahadevan, S. (2003). Recent advances
in hierarchical reinforcement learning. Discrete Event
Dynamic Systems, 13(4):341–379.
Bojarski, M., Testa, D. D., Dworakowski, D., Firner, B.,
Flepp, B., Goyal, P., Jackel, L. D., Monfort, M., Mul-
ler, U., Zhang, J., Zhang, X., Zhao, J., and Zieba,
K. (2016). End to end learning for self-driving cars.
CoRR, abs/1604.07316.
Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M.,
Kalakrishnan, M., Downs, L., Ibarz, J., Pastor, P., Ko-
nolige, K., et al. (2017a). Using simulation and dom-
ain adaptation to improve efficiency of deep robotic
grasping. arXiv preprint arXiv:1709.07857.
Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., and
Krishnan, D. (2017b). Unsupervised pixel-level dom-
ain adaptation with generative adversarial networks.
In The IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), volume 1, page 7.
Burda, Y., Edwards, H., Pathak, D., Storkey, A., Dar-
rell, T., and Efros, A. A. (2018). Large-scale
study of curiosity-driven learning. arXiv preprint
arXiv:1808.04355.
Chen, J., Wang, Z., and Tomizuka, M. (2018). Deep hierar-
chical reinforcement learning for autonomous driving
with distinct behaviors. In 2018 IEEE Intelligent Vehi-
cles Symposium (IV), pages 1239–1244. IEEE.
Chentanez, N., Barto, A. G., and Singh, S. P. (2005). Intrin-
sically motivated reinforcement learning. In Advan-
ces in neural information processing systems, pages
1281–1288.
Djuric, N., Radosavljevic, V., Cui, H., Nguyen, T., Chou,
F.-C., Lin, T.-H., and Schneider, J. (2018). Motion
prediction of traffic actors for autonomous driving
using deep convolutional networks. arXiv preprint
arXiv:1808.05819.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., and Kol-
tun, V. (2017). CARLA: An open urban driving simu-
lator. In Proceedings of the 1st Annual Conference on
Robot Learning, pages 1–16.
Duan, Y., Schulman, J., Chen, X., Bartlett, P. L., Sutske-
ver, I., and Abbeel, P. (2016). Fast reinforcement lear-
ning via slow reinforcement learning. arXiv preprint
arXiv:1611.02779.
Erran, L. and Scheider, J. (2017). Machine learning for au-
tonomous vehicless. In ICML Workshop on Machine
Learning for Autonomous Vehicles 2017,.
Finn, C., Abbeel, P., and Levine, S. (2017). Model-agnostic
meta-learning for fast adaptation of deep networks.
arXiv preprint arXiv:1703.03400.
Finn, C., Levine, S., and Abbeel, P. (2016). Guided cost
learning: Deep inverse optimal control via policy op-
timization. In International Conference on Machine
Learning, pages 49–58.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Laro-
chelle, H., Laviolette, F., Marchand, M., and Lem-
pitsky, V. (2016). Domain-adversarial training of neu-
ral networks. The Journal of Machine Learning Rese-
arch, 17(1):2096–2030.
Geiger, A., Lenz, P., and Urtasun, R. (2012). Are we re-
ady for autonomous driving? the kitti vision bench-
mark suite. In Computer Vision and Pattern Recogni-
tion (CVPR), 2012 IEEE Conference on, pages 3354–
3361. IEEE.
Ha, D. and Schmidhuber, J. (2018a). Recurrent world
models facilitate policy evolution. arXiv preprint
arXiv:1809.01999. Accepted at NIPS 2018.
Ha, D. and Schmidhuber, J. (2018b). World models. arXiv
preprint arXiv:1803.10122.
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup,
D., and Meger, D. (2018). Deep reinforcement lear-
ning that matters. In McIlraith, S. A. and Weinberger,
K. Q., editors, (AAAI-18), the 30th innovative Appli-
cations of Artificial Intelligence, New Orleans, Lou-
isiana, USA, February 2-7, 2018, pages 3207–3214.
AAAI Press.
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul,
T., Piot, B., Horgan, D., Quan, J., Sendonaris, A.,
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
570
Dulac-Arnold, G., et al. (2017). Deep q-learning from
demonstrations. arXiv preprint arXiv:1704.03732.
Ho, J. and Ermon, S. (2016). Generative adversarial imi-
tation learning. In Advances in Neural Information
Processing Systems, pages 4565–4573.
Kang, B., Jie, Z., and Feng, J. (2018). Policy optimiza-
tion with demonstrations. In Proceedings of the 35th
International Conference on Machine Learning, vo-
lume 80, pages 2469–2478, Stockholm, Sweden.
Kendall, A., Hawke, J., Janz, D., Mazur, P., Reda, D.,
Allen, J.-M., Lam, V.-D., Bewley, A., and Shah, A.
(2018). Learning to drive in a day. arXiv preprint
arXiv:1807.00412.
Kitani, K. M., Ziebart, B. D., Bagnell, J. A., and Hebert,
M. (2012). Activity forecasting. In Fitzgibbon, A.,
Lazebnik, S., Perona, P., Sato, Y., and Schmid, C.,
editors, Computer Vision ECCV 2012, pages 201–
214, Berlin, Heidelberg. Springer Berlin Heidelberg.
Koenig, N. and Howard, A. (2004). Design and use para-
digms for gazebo, an open-source multi-robot simula-
tor. In In IEEE/RSJ International Conference on In-
telligent Robots and Systems, pages 2149–2154.
Kuderer, M., Gulati, S., and Burgard, W. (2015). Learning
driving styles for autonomous vehicles from demon-
stration. In Robotics and Automation (ICRA), 2015
IEEE International Conference on, pages 2641–2646.
IEEE.
Le, H. M., Jiang, N., Agarwal, A., Dud
´
ık, M., Yue, Y., and
Daum
´
e III, H. (2018). Hierarchical imitation and rein-
forcement learning. arXiv preprint arXiv:1803.00590.
Lee, N., Choi, W., Vernaza, P., Choy, C. B., Torr, P. H.,
and Chandraker, M. (2017). Desire: Distant future
prediction in dynamic scenes with interacting agents.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 336–345.
Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-
to-end training of deep visuomotor policies. The Jour-
nal of Machine Learning Research, 17(1):1334–1373.
Liaw, R., Krishnan, S., Garg, A., Crankshaw, D., Gon-
zalez, J. E., and Goldberg, K. (2017). Composing
meta-policies for autonomous driving using hierar-
chical deep reinforcement learning. arXiv preprint
arXiv:1711.01503.
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T.,
Tassa, Y., Silver, D., and Wierstra, D. (2015). Conti-
nuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971.
Liu, G.-H., Siravuru, A., Prabhakar, S., Veloso, M., and
Kantor, G. (2017). Learning end-to-end multimodal
sensor policies for autonomous navigation. In Levine,
S., Vanhoucke, V., and Goldberg, K., editors, Procee-
dings of the 1st Annual Conference on Robot Lear-
ning, volume 78 of Proceedings of Machine Learning
Research, pages 249–261. PMLR.
Mania, H., Guy, A., and Recht, B. (2018). Simple random
search provides a competitive approach to reinforce-
ment learning. arXiv preprint arXiv:1803.07055.
Mannion, P., Duggan, J., and Howley, E. (2016a). An expe-
rimental review of reinforcement learning algorithms
for adaptive traffic signal control. In McCluskey, L. T.,
Kotsialos, A., M
¨
uller, P. J., Kl
¨
ugl, F., Rana, O., and
Schumann, R., editors, Autonomic Road Transport
Support Systems, pages 47–66. Springer International
Publishing.
Mannion, P., Mason, K., Devlin, S., Duggan, J., and Ho-
wley, E. (2016b). Multi-objective dynamic dispa-
tch optimisation using multi-agent reinforcement lear-
ning. In Proceedings of the 15th International Confe-
rence on Autonomous Agents and Multiagent Systems
(AAMAS), pages 1345–1346.
Mason, K., Mannion, P., Duggan, J., and Howley, E. (2016).
Applying multi-agent reinforcement learning to wa-
tershed management. In Proceedings of the Adaptive
and Learning Agents workshop (at AAMAS 2016).
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T.,
Harley, T., Silver, D., and Kavukcuoglu, K. (2016).
Asynchronous methods for deep reinforcement lear-
ning. In International Conference on Machine Lear-
ning, pages 1928–1937.
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Anto-
noglou, I., Wierstra, D., and Riedmiller, M. (2013).
Playing atari with deep reinforcement learning. arXiv
preprint arXiv:1312.5602.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller,
M., Fidjeland, A. K., Ostrovski, G., et al. (2015).
Human-level control through deep reinforcement le-
arning. Nature, 518(7540):529.
Naik, D. K. and Mammone, R. (1992). Meta-neural net-
works that learn by learning. In Neural Networks,
1992. IJCNN., International Joint Conference on, vo-
lume 1, pages 437–442. IEEE.
Ng, A. Y., Harada, D., and Russell, S. (1999). Policy invari-
ance under reward transformations: Theory and appli-
cation to reward shaping. In ICML, volume 99, pages
278–287.
Ngai, D. C. K. and Yung, N. H. C. (2011). A multiple-
goal reinforcement learning method for complex vehi-
cle overtaking maneuvers. IEEE Transactions on In-
telligent Transportation Systems, 12(2):509–522.
Nichol, A., Achiam, J., and Schulman, J. (2018).
On first-order meta-learning algorithms. CoRR,
abs/1803.02999.
Nosrati, M. S., Abolfathi, E. A., Elmahgiubi, M., Yadmel-
lat, P., Luo, J., Zhang, Y., Yao, H., Zhang, H., and Ja-
mil, A. (2018). Towards practical hierarchical reinfor-
cement learning for multi-lane autonomous driving.
Paden, B., C
´
ap, M., Yong, S. Z., Yershov, D. S., and Fraz-
zoli, E. (2016). A survey of motion planning and con-
trol techniques for self-driving urban vehicles. CoRR,
abs/1604.07446.
Pan, X., You, Y., Wang, Z., and Lu, C. (2017). Virtual to
real reinforcement learning for autonomous driving.
arXiv preprint arXiv:1704.03952.
Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. (2017).
Curiosity-driven exploration by self-supervised pre-
diction. In International Conference on Machine Le-
arning (ICML), volume 2017.
Peng, X. B., Andrychowicz, M., Zaremba, W., and Ab-
beel, P. (2017). Sim-to-real transfer of robotic con-
trol with dynamics randomization. arXiv preprint
arXiv:1710.06537.
Exploring Applications of Deep Reinforcement Learning for Real-world Autonomous Driving Systems
571
Ramanishka, V., Chen, Y.-T., Misu, T., and Saenko, K.
(2018). Toward driving scene understanding: A data-
set for learning driver behavior and causal reasoning.
In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 7699–7707.
Ross, S. and Bagnell, D. (2010). Efficient reductions for
imitation learning. In Proceedings of the thirteenth
international conference on artificial intelligence and
statistics, pages 661–668.
Sallab, A. E., Abdou, M., Perot, E., and Yogamani, S.
(2016). End-to-end deep reinforcement learning for
lane keeping assist. arXiv preprint arXiv:1612.04340.
Sallab, A. E., Abdou, M., Perot, E., and Yogamani,
S. (2017). Deep reinforcement learning frame-
work for autonomous driving. Electronic Imaging,
2017(19):70–76.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Mo-
ritz, P. (2015). Trust region policy optimization. In
International Conference on Machine Learning, pa-
ges 1889–1897.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Klimov, O. (2017). Proximal policy optimization al-
gorithms. arXiv preprint arXiv:1707.06347.
Schwarting, W., Alonso-Mora, J., and Rus, D. (2018). Plan-
ning and decision-making for autonomous vehicles.
Annual Review of Control, Robotics, and Autonomous
Systems.
Shah, S., Dey, D., Lovett, C., and Kapoor, A. (2018). Air-
sim: High-fidelity visual and physical simulation for
autonomous vehicles. In Field and Service Robotics,
pages 621–635. Springer.
Shalev-Shwartz, S., Shammah, S., and Shashua, A. (2016).
Safe, multi-agent, reinforcement learning for autono-
mous driving. arXiv preprint arXiv:1610.03295.
Sharifzadeh, S., Chiotellis, I., Triebel, R., and Cremers,
D. (2016). Learning to drive using inverse reinfor-
cement learning and deep q-networks. arXiv preprint
arXiv:1612.03653.
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre,
L., Van Den Driessche, G., Schrittwieser, J., Antonog-
lou, I., Panneershelvam, V., Lanctot, M., et al. (2016).
Mastering the game of go with deep neural networks
and tree search. nature, 529(7587):484–489.
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and
Riedmiller, M. (2014). Deterministic policy gradient
algorithms. In ICML.
Silver, D., Sutton, R. S., and M
¨
uller, M. (2008). Sample-
based learning and search with permanent and tran-
sient memories. In Proceedings of the 25th internati-
onal conference on Machine learning, pages 968–975.
ACM.
Sobh, I., Amin, L., Abdelkarim, S., Elmadawy, K., Saeed,
M., Abdeltawab, O., Gamal, M., and El Sallab, A.
(2018). End-to-end multi-modal sensors fusion sy-
stem for urban automated driving. In NIPS Works-
hop on Machine Learning for Intelligent Transporta-
tion Systems.
Sobh, I. and Darwish, N. (2018). End-to-end framework for
fast learning asynchronous agents. In Proceedings of
the 32nd Conference on Neural Information Proces-
sing Systems, Imitation Learning and its Challenges
in Robotics workshop.
Sutton, R. S. (1990). Integrated architectures for learning,
planning, and reacting based on approximating dyna-
mic programming. In Machine Learning Proceedings
1990, pages 216–224. Elsevier.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement lear-
ning: An introduction. MIT press.
Taylor, M. E. and Stone, P. (2009). Transfer learning for
reinforcement learning domains: A survey. Journal of
Machine Learning Research, 10(Jul):1633–1685.
Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep rein-
forcement learning with double q-learning. In AAAI,
volume 16, pages 2094–2100.
Wang, J. X., Kurth-Nelson, Z., Tirumala, D., Soyer, H.,
Leibo, J. Z., Munos, R., Blundell, C., Kumaran, D.,
and Botvinick, M. (2016). Learning to reinforcement
learn. arXiv preprint arXiv:1611.05763.
Wang, P. and Chan, C.-Y. (2017). Formulation of deep rein-
forcement learning architecture toward autonomous
driving for on-ramp merge. In Intelligent Transpor-
tation Systems (ITSC), 2017 IEEE 20th International
Conference on, pages 1–6. IEEE.
Wang, P., Chan, C.-Y., and de La Fortelle, A. (2018). A rein-
forcement learning based approach for automated lane
change maneuvers. arXiv preprint arXiv:1804.07871.
Wymann, B., Espi
´
e, E., Guionneau, C., Dimitrakakis, C.,
Coulom, R., and Sumner, A. (2000). Torcs, the
open racing car simulator. Software available at
http://torcs. sourceforge. net, 4.
Xiong, X., Wang, J., Zhang, F., and Li, K. (2016). Com-
bining deep reinforcement learning and safety ba-
sed control for autonomous driving. arXiv preprint
arXiv:1612.00147.
Xu, H., Gao, Y., Yu, F., and Darrell, T. (2017). End-to-end
learning of driving models from large-scale video da-
tasets. In 2017 IEEE Conference on Computer Vision
and Pattern Recognition, CVPR 2017, Honolulu, HI,
USA, July 21-26, 2017, pages 3530–3538.
Ye, C., Ma, H., Zhang, X., Zhang, K., and You, S. (2017).
Survival-oriented reinforcement learning model: An
effcient and robust deep reinforcement learning algo-
rithm for autonomous driving problem. In Internati-
onal Conference on Image and Graphics, pages 417–
429. Springer.
Yu, A., Palefsky-Smith, R., and Bedi, R. (2016). Deep rein-
forcement learning for simulated autonomous vehi-
cle control. Course Project Reports: Winter 2016
(CS231n: Convolutional Neural Networks for Visual
Recognition), pages 1–7.
Zhang, J. and Cho, K. (2017). Query-efficient imitation le-
arning for end-to-end simulated driving. In Procee-
dings of the Thirty-First AAAI Conference on Artificial
Intelligence, San Francisco, California, USA., pages
2891–2897.
Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K.
(2008). Maximum entropy inverse reinforcement lear-
ning. In AAAI, volume 8, pages 1433–1438. Chicago,
IL, USA.
Ziebart, B. D., Ratliff, N., Gallagher, G., Mertz, C., Peter-
son, K., Bagnell, J. A., Hebert, M., Dey, A. K., and
Srinivasa, S. (2009). Planning-based prediction for
pedestrians. In Proc. IROS.
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications
572