Zeroth-Order Optimization Attacks on Deep Reinforcement

Learning-Based Lane Changing Algorithms for Autonomous Vehicles

Dayu Zhang

, Nasser Lashgarian Azad

, Sebastian Fischmeister

and Stefan Marksteiner

Systems Design Engineering, University of Waterloo, 200 University Ave. W, Waterloo, Ontario, Canada

Electrical and Computer Engineering, University of Waterloo, 200 University Ave. W, Waterloo, Ontario, Canada

AVL List GmbH. Hans-List-Platz 1, 8020 Graz, Austria

ﬁ

Keywords:

Deep Reinforcement Learning, Adversarial Training, Zeroth-Order Optimization, Autonomous Vehicles.

Abstract:

As Autonomous Vehicles (AVs) become prevalent, their reinforcement learning-based decision-making algo-

rithms, especially those governing highway lane changes, are potentially vulnerable to adversarial attacks.

This study investigates the vulnerability of Deep Q-Network (DQN) and Deep Deterministic Policy Gradient

(DDPG) reinforcement learning algorithms to black-box attacks. We utilize zeroth-order optimization meth-

ods like ZO-SignSGD, allowing effective attacks without gradient information, revealing vulnerabilities in

the existing systems. Our results demonstrate that these attacks can signiﬁcantly degrade the performance of

the AV, reducing their rewards by 60 percent and more. We also explore adversarial training as a defensive

measure, which enhances the robustness of the DRL algorithms but at the expense of overall performance.

Our ﬁndings underline the necessity of developing robust and secure reinforcement learning algorithms for

AVs, urging further research into comprehensive defense strategies. The work is the ﬁrst to apply zeroth-order

optimization attacks on reinforcement learning in AVs, highlighting the imperative for balancing robustness

and accuracy in AV algorithms.

1 INTRODUCTION

The escalating pace of Autonomous Vehicle (AV)

development and deployment emphasizes the urgent

need for secure and robust decision-making algo-

rithms. However, these algorithms, often promised

to be based on Deep Reinforcement Learning (DRL),

present a potential vulnerability that adversarial at-

tacks could exploit.

Since the discovery of adversarial examples in

2013, adversarial attacks and defenses have been

well-studied in the ﬁeld of deep learning (Szegedy

et al., 2014). These attacks introduce minute per-

turbations, compelling machine learning algorithms

to produce erroneous or attacker-desired predictions.

Such perturbations are often imperceptible to both hu-

man observers and conventional detection techniques.

Many attacks and defenses have been demonstrated

on AVs, primarily targeting perception modules, such

as cameras and lidars (Boloor et al., 2020; Cao et al.,

2019). However, adversarial attacks and defenses

still leave a signiﬁcant gap in the ﬁeld of DRL, espe-

cially in the application of DRL in AVs, where DRL

shows its promise as the future of control. Notably,

black-box attacks represent a signiﬁcant threat due

to their capacity to manipulate system output with-

out the attacker having deep knowledge of the sys-

tem’s inner workings, in this case, the gradient in-

formation of the underlying model. As AVs increas-

ingly share our roads, understanding the vulnerability

of DRL-based lane-changing algorithms to such at-

tacks becomes crucial for ensuring the safety, trust,

and widespread acceptance of these emerging tech-

nologies.

In this study, we comprehensively examine the

vulnerability of highway lane-changing algorithms —

a critical and the most fundamental component of

AV systems — to black-box attacks. Our focus is

on the widely implemented DRL policies: Deep Q-

Network (DQN) and Deep Deterministic Policy Gra-

dient (DDPG), representing discrete and continuous

action spaces, respectively (Mnih et al., 2013)(Lilli-

crap et al., 2019). Leveraging zeroth-order optimiza-

tion methods, often applied in deep learning — such

as ZO-SignSGD — we demonstrate their utility in ex-

ecuting black box attacks on DRL agents, uniquely,

without the need for gradient information (Liu et al.,

2019). Such an idea is plausible if the attacker ac-

Zhang, D., Azad, N., Fischmeister, S. and Marksteiner, S.

Zeroth-Order Optimization Attacks on Deep Reinforcement Learning-Based Lane Changing Algorithms for Autonomous Vehicles.

DOI: 10.5220/0012187700003543

In Proceedings of the 20th International Conference on Informatics in Control, Automation and Robotics (ICINCO 2023) - Volume 1, pages 665-673

ISBN: 978-989-758-670-5; ISSN: 2184-2809

665

cesses vehicle sensor values. However, to maximize

the damage without getting detected, the attack must

produce a perturbation small enough to avoid detec-

tion yet can still wreak havoc. Our experiments re-

veal that these techniques can effectively undermine

the decision-making processes of AVs, highlighting a

signiﬁcant vulnerability in current systems. Further,

we investigate the efﬁcacy of adversarial training as

a mitigation strategy within the context of DRL, pro-

viding insights into its potential to enhance the robust-

ness of lane-changing algorithms against adversarial

attacks. Our research underscores the importance of

considering security in developing and deploying re-

inforcement learning algorithms in AV.

As a result, the contributions of this paper are as

follows:

• The ﬁrst paper to highlight and apply zeroth-order

optimization attacks on DRL in general and the

ﬁrst use of such attacks in the context of AVs.

• A demonstration of the effect of targeted adver-

sarial attacks and how they can force agents into

a speciﬁc action leading to hazardous conditions

and collisions. We show that such attacks con-

verge in surprisingly short time with minimal per-

turbation.

• Results on hardening DRL through perturbation

training providing guidance to future work against

zeroth-order optimization attacks.

The remainder of the paper is structured as fol-

lows: Section 2 outlines the related work. Section 3

provides the problem deﬁnition and describes the en-

vironment. Section 4 explains the DRL policies used

in this work. Section 5 highlights the attack model

and the new zeroth order optimization attack. Sec-

tion 6 outlines our approach and discusses the results.

Finally, Section 7 draws conclusions from the work

and describes future initiatives.

2 RELATED WORK

As Reinforcement Learning (RL) continues to prove

its potency in complex decision-making tasks, there

has been a surge in academic interest in exploring its

intersection with Adversarial Machine Learning and

vulnerabilities to such attacks.

2.1 Reinforcement Learning

RL is a machine learning algorithm where an agent

learns to interact with an environment to maximize

the reward. Given a state, the agent produces an ac-

tion, and based on this action, the environment pro-

vides a corresponding reward. The agent then updates

its policy based on the reward. The agent is trained by

interacting with the environment for several episodes.

Along with the development of deep learning,

DRL has been shown to be effective in several ap-

plications such as Atari games (Mnih et al., 2013),

robotics (Kalashnikov et al., 2021; Chebotar et al.,

2021), and AVs (Isele et al., 2018; Mnih et al., 2015).

DQN, one of the ﬁrst DRL algorithms, demonstrated

its potential by outperforming humans in Atari games

(Mnih et al., 2013). It employs a neural network to ap-

proximate the Q function, representing the expected

reward for taking an action in a given state. Cru-

cial to DQN’s operation is experience replay, which

de-correlates the training data and the target network,

which stabilizes the training process. On the other

hand, DDPG, an actor-critic algorithm, effectively

manages continuous action spaces (Lillicrap et al.,

2019). Like DQN, it employs a neural network to

approximate the Q function and the policy. How-

ever, it distinguishes itself through the use of a re-

play buffer for de-correlating training data, the target

network for training stabilization, and its proﬁciency

in solving numerous continuous control tasks in the

OpenAI Gym (Brockman et al., 2016).

2.2 Adversarial Machine Learning

Adversarial machine learning was ﬁrst introduced by

Szegedy et al. in 2013 (Szegedy et al., 2014), where

it was used to craft speciﬁc adversarial examples.

Adversarial examples are input data manipulated to

cause a machine learning model to misclassify it.

While the perturbations are usually indiscernible to

the human eye, they lead the model to drastically in-

correct outputs. The discovery of adversarial exam-

ples highlighted the vulnerability of machine learn-

ing models, even when they achieve high accuracy on

test data. Since its discovery, attack and defense have

been popular topics in machine learning. Notable at-

tacks on deep neural networks include the Fast Gradi-

ent Sign Method (FGSM) (Goodfellow et al., 2015),

the DeepFool (Moosavi-Dezfooli et al., 2016), and

the Carlini and Wagner attack (Carlini and Wagner,

2017). Defenses include adversarial training (Car-

lini and Wagner, 2017), adversarial examples detec-

tion (Papernot et al., 2016), and adversarial robustness

certiﬁcation (Sinha et al., 2020). Though these meth-

ods have been only shown to be effective in machine

learning, it is clear that the same techniques can be

applied to the ﬁeld of reinforcement learning. Small

perturbations in the observation can lead to an unde-

sired action, signiﬁcantly affecting the agent’s perfor-

mance. In 2017, Huang et al. showed that the DQN

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

666

agent is vulnerable to the same attack applied to neu-

ral networks, such as the FGSM attack (Huang et al.,

2017). Similar defenses, such as adversarial training,

are also shown to be effective in the ﬁeld of rein-

forcement learning (Pattanaik et al., 2017). Lately,

more research has been done to increase the robust-

ness of reinforcement learning agents in the context

of AV (He et al., 2023; Buddareddygari et al., 2022).

2.3 Black-Box Attacks

The exploration of adversarial attacks has, for the

most part, been rooted in ﬁrst-order optimization

methods. These methods, while powerful, often

necessitate the availability of gradient information,

making them impractical for real-world scenarios

where such information may not always be acces-

sible. The quest for gradient-free alternatives pre-

dates the recent strides in deep learning and adver-

sarial machine learning. Traditional methods such as

COBYLA and various Bayesian optimization tech-

niques have been investigated extensively. How-

ever, these methods have demonstrated scalability

limitations in dealing with modern, complex models

that exhibit an ever-increasing dimensionality (Pow-

ell, 1994; Shahriari et al., 2016).

Zeroth-order (ZO) optimization methods have

emerged as a promising alternative, offering efﬁ-

ciency in computational resources while maintaining

a competitive convergence rate (Liu et al., 2020). A

surge of interest in recent years has led to the devel-

opment of several ZO optimization techniques, in-

cluding but not limited to Zeroth Order Stochastic

Gradient Descent (ZOSGD), ZO-SignSGD, and ZO-

ADMM (Liu et al., 2020). The appeal of these tech-

niques lies in their ability to operate without explicit

gradient information, thus bridging the gap between

the theoretical world of optimization and the prag-

matic constraints of real-world applications.

In this paper, we focus mainly on the ZO-

SignSGD method. Unlike other methods that use ex-

act estimated gradient values, ZO-SignSGD utilizes

the sign of the gradient to update the model parame-

ters. This feature provides both computational advan-

tages and practical feasibility, allowing the perturba-

tion to converge in a relatively small amount of itera-

tions (Liu et al., 2019). We venture into an underex-

plored area by employing ZO-SignSGD as a tool to

study adversarial attacks on DRL algorithms in AVs,

potentially expanding the understanding and applica-

tion of black-box attacks in real-world scenarios.

Figure 1: An example render of the highway lane changing

environment.

3 PROBLEM

This paper focuses on applying, attacking, and de-

fending a highway lane-changing deep Q network

(DQN) agent and a Deep Deterministic Policy Gradi-

ent (DDPG) agent. In this paper, we assume that the

vehicle has been compromised without detection, al-

lowing the adversary to access and manipulate sensor

data, thereby altering the states perceived by the DRL

agent. Given the increasing adoption of detection al-

gorithms for common attacks, adversarial machine-

learning strategies are employed to maximize damage

to DRL agents. These strategies introduce minimal

perturbations to maintain stealth and reduce the like-

lihood of detection during the attack. It’s worth not-

ing that this paper does not delve into the speciﬁcs of

the vehicle’s attack surface or penetration methods.

To assure the performance of the unattacked agent,

the reinforcement learning algorithms are based on

Stable Baselines 3, an online RL library written in

Python (Rafﬁn et al., 2021). The training environment

is based on the ’highway-env’ library to allow faster

deployment, hyperparameter tuning, and debugging,

as seen in Figure 1 (Leurent, 2018).

3.1 Environment

The environment is a lightweight highway lane-

changing environment compatible with the OpenAI

gym interface (Leurent, 2018). To expedite the train-

ing process, the environment is conﬁgured with less

than 30 cars.

The environment’s observation space tracks the

vehicle’s kinematics on the highway. That includes

the position and velocity of the ego vehicle. It also

records the relative position and velocity of other ve-

hicles on the highway. The observation space is nor-

malized relative to the ego vehicle. The position is

normalized with the bound of [−100, 100], and the ve-

locity is normalized with the bound of [0, 20].

During initialization, all vehicles, including the

ego vehicle, are randomly positioned on the highway,

ensuring a minimum separation between them. Vehi-

cles, excluding the ego vehicle, adhere to a randomly

initialized Intelligent Driver Model and the Minimiz-

ing Overall Braking Induced by Lane change (MO-

Zeroth-Order Optimization Attacks on Deep Reinforcement Learning-Based Lane Changing Algorithms for Autonomous Vehicles

667

BIL) model (Leurent, 2018).

For DQN, the environment’s action space is a

high-level discrete action space with ﬁve actions. The

actions are deﬁned as:

• Action 1: change lane to the left

• Action 2: idle (do nothing)

• Action 3: change lane to the right

• Action 4: accelerate

• Action 5: decelerate

Simple proportional controllers control the lower-

level actions such as speciﬁc heading, velocity, and

acceleration when each action is chosen, so the rein-

forcement learning agent only needs to make a high-

level decision on which action to take.

For DDPG, the action is a continuous action space

for the kinematics of the ego vehicle with a dimension

of 2. The ﬁrst dimension is the acceleration of the

ego vehicle, and the second dimension is the steering

angle of the ego vehicle. Both actions are normalized

to [−1, 1].

4 POLICIES

The reward function rewards the agent for staying in

the right lane at a faster speed while penalizing the

agent for collision. The reward function is deﬁned as:

R(s, a) =RightLaneReward+

0.4 ·

v − v

min

max

− v

min

+ collision

(1)

The collision reward is set to -1. So that the agent will

seek to move faster while avoiding a collision. Right-

LaneReward is set to 0.1 when the agent is traveling

on the right-most lane.

4.1 DQN Policy

The agent is trained with the DQN algorithm (Mnih

et al., 2013). Similar to Q learning, the underlying

structure of the model is Markov Decision Process

Equation 2.

new

, a

) = Q(s

, a

α(r

+ γ · max

Q(s

t+1

, a)−

Q(s

, a

))

(2)

• Q(s

, a

): Q value of the current state and action

• α: learning rate

• r

: reward of the current state and action

• γ: discount factor

• s: state

• a: action

However, for deep Q learning, the Q function is

approximated by a neural network. The neural net-

work is trained with the DQN algorithm. This algo-

rithm uses a replay buffer to store the experience of

the agent. The replay buffer samples a batch of ex-

periences to train the neural network. The DQN loss

function is deﬁned as:

L(θ) = E

(s,a,r,s

′

)∼U (D)



r + γ ·max

′

Q(s

′

, a

′

;θ

−

) − Q(s, a;θ)



(3)

• θ: the parameters of the neural network

• θ

−

: the parameters of the target network

• D: the replay buffer

Like Q learning, the network loss is the reward

plus the discounted maximum Q value of the next

state minus the current Q value. But in this case,

the gradient descent method minimizes the loss func-

tion. The network is trained with the Adam opti-

mizer (Kingma and Ba, 2017). For a more straight-

forward implementation, Stable Baseline 3 is used to

train the model (Rafﬁn et al., 2021).

4.2 DDPG Policy

The agent is trained with the stable baseline library

implementation of the DDPG algorithm (Lillicrap

et al., 2019; Rafﬁn et al., 2021). In contrast to DQN,

which only deals with discrete action spaces, DDPG

allows the handling of continuous action spaces, mak-

ing it particularly suitable for AV, where actions are

often continuous, like acceleration and steering an-

gle. Another main difference is that DDPG combines

the actor-critic approach with insights from Deep Q-

Networks (DQN). The actor in this setup is responsi-

ble for determining the best action given the current

state, while the critic evaluates the chosen action’s

quality. As seen in Equation 4, the actor updates in

the direction that maximizes the Q value of the cur-

rent state and action. On the other hand, the critic is

updated based on the Temporal Difference (TD) er-

ror, which is the difference between the critic’s cur-

rent estimate of the Q-value and the improved esti-

mate yielded by the latest action from the actor. This

process, similar to Q-learning, involves the use of a

learning rate to balance the weight between the old

and new estimates:

∇

J ≈

∑

∇

Q(s, a|θ

)|s = s

, a = µ(s

)∇θ

µ(s|θ

(4)

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

668

5 ZEROTH ORDER ATTACK

This work uses the zeroth order optimization as an

adversarial attack.

5.1 Attacker Model

The attacker model deﬁnes the attacker’s opportunity,

intent, and capability to place the work in context.

The attacker has the opportunity to inﬂuence the sys-

tem during operation at the level of sensor outputs.

The attacker’s short-term goal is to disrupt the agent

to the point that the agent will cause a collision on

the road. The capabilities of the attacker consist of

the following: (1) the attacker can inﬂuence a sensor

value up to a 10% deviation of its actual value, (2)

an attacker can inﬂuence sensor values immediately

(i.e., from the start of the vehicle until a shutdown),

continuously (i.e., no cool down periods), and indeﬁ-

nitely (i.e., for as long as the attacker wants), and (3)

the attacker can inﬂuence any sensor, and all sensors

at the same time (i.e., there no assumption that any

single sensor in the set provides an actual value).

5.2 Zeroth Order SignSGD

The ZO-SignSGD method is a gradient-free (zeroth

order) optimization method that uses the sign of the

gradient to update the model (Liu et al., 2019).

The ZO-SignSGD method implemented is deﬁned

as:

Algorithm 1 shows the implementation of the ZO-

SignSGD attack for lane changing. The input vari-

ables, such as learning rate, initial value, and num-

ber of iterations, are tweaked to ensure a fast conver-

gence while minimizing the perturbation size. As a

black box optimization algorithm, the ﬁrst step is to

estimate the gradient. A gradient of a function can

be estimated by adding a small perturbation to the in-

put data. For a high-dimension function such as the

neural network used in the DRL, the gradient must

be computed by summing all estimated gradients over

perturbations of random directions. To achieve a fast

convergence of the algorithm, similar to the Fast Gra-

dient Sign Method, only the sign of the gradient is

used. This also avoids the error introduced by the

numerical value of the estimated gradient. The per-

turbation is then calculated by multiplying the sign

of the gradient with the learning rate to minimize the

objective function seen in Equation 8. This process

is repeated until it reaches the maximum number of

iterations. The perturbed observation is then used to

get the action from the policy, thus, continuing into

the next step.

Data: ZO-SignSGD

Input: learning rate {δ

} , initial value x

and number of iterations: T

def GradEstimate(x, µ, q, d):

for k = 1, 2, ..., q do

u = normalized(random number);

g +

d( f (x+µu)−f (x))

end

def optimization(x):

for k=1, 2, ..., T do

= GradEstimate(x

);

k+1

= x

− δ

sign(

);

end

def Main:

for i in range of timesteps do

while not done do

action =

model.predict(observation);

env.step(action);

perturbed obs =

optimization(observation);

observation = perturbed obs;

end

Algorithm 1: Implantation of ZO-SignSGD for Lane Keep-

ing. Adversarial observation is calculated for each step.

Using this algorithm, the speciﬁc objective for this

attack is crafted with two losses in mind. The ﬁrst loss

is the distance between the target action and the origi-

nal action, which can be seen in Equation 5. The same

is true for both DQN and DDPG. “a” is the constant

that controls the weight of the loss.

L 1

DQN

= a · norm(Q(x + δ, y

target

;θ)−

Q(x + δ, y; θ))

(5)

L 1

DDPG

=a · norm(action(x + δ;θ)−

action(x;θ))

(6)

The second loss is the distortion caused by the per-

turbation, as seen in Equation 7. This calculates the

distance between the original observation and the per-

turbed observation.

L 2 = norm((perturbed obs − original obs)

) (7)

Ob jective : min(L 1 + L 2) (8)

When crafting the perturbation, both losses are

added together to minimize both during the optimiza-

tion. For a successful attack, both losses will converge

and be minimized. Figure 2 shows an example of this

convergence. Since Zeroth Order SignSGD is not a

Zeroth-Order Optimization Attacks on Deep Reinforcement Learning-Based Lane Changing Algorithms for Autonomous Vehicles

669

constrained optimization method, the perturbation is

not guaranteed to be small. At the same time, since it

is a gradient-free stochastic optimization method, the

attack can fail to converge within the given number of

iterations.

6 APPROACH & EVALUATION

6.1 Adversarial Training

As seen in many adversarial machine learning papers,

adversarial training is a method to train a model to

be robust to adversarial attacks (Carlini and Wagner,

2017; Pattanaik et al., 2017). A general Procedure can

be seen in Algorithm 2.

Data: Adversarial-Training

for i = 1, 2, ..., timestep do

attack the observation Q(obs, a, θ);

obs’ = ZO SignSGD(Q(obs, a, θ));

a’ = Q(obs

′

, a, θ);

new obs, reward = env(a’,s);

Train policy as per DQN or DDPG

algorithm;

end

Algorithm 2: Adversarial training of the policy.

The algorithm is based on the methods discussed

in (Pattanaik et al., 2017). Though it may appear sim-

ple, this algorithm has proven successful against the

trained perturbation method for both deep learning

and DRL models.

6.2 Initial Training

The DQN algorithm is ﬁrst trained for 20,000 time

steps with the default hyperparameter included in sta-

ble baseline 3. The model can learn the environment

and achieve a mean reward of 310.58 per episode.

For most of the episodes, The policy can navigate the

highway for the entire episode without fail.

The DDPG model is trained for 120,000 time

steps. A higher time step allows the actor and critic

to converge on this lane-changing task. Due to the

continuous action space control by the DDPG algo-

rithm, the reward is not as stable as DQN. However,

the model can still achieve a mean reward of 232.78

per episode. Note that DDPG is trained with noise

added to the action space as part of the exploration

strategy.

The maximum reward obtainable by the agents per

episode would be 450, assuming it never crashes, is

0 10 20 30 40 50 60 70 80 90 100

Iteration

0.05

0.1

0.15

0.2

0.25

0.3

Losses

Loss1

Loss2

Figure 2: Loss convergence for successful perturbation for

both Loss1 and Loss2.

always on the right-most lane, and always travels at

high speed. However, such theoretical maximum re-

ward is unobtainable as the agent must slow down to

change lanes and avoid crashes. At the same time, As

long as other cars are in the right-most lane, the agent

is unlikely to obtain the full reward for the step. Get-

ting a higher reward also requires the agent to have

a constant heading going forward. This has proven

to be difﬁcult to maintain for a DDPG agent where

the policy has control of the steering. Therefore, both

agents performed relatively well in this task.

6.3 Attacks

The idea of the attack is to mimic a real-world sce-

nario where the attacker has access to the vehicle’s

sensors, enabling them to craft perturbations to the

observation space. The perturbation is generated

at every step with the consideration of perturbation

sizes. The maximum iteration per step for perturba-

tion crafting is set to 100. However, most success-

ful perturbations are created within 50 iterations. An

example of the loss for a successful perturbation is

shown in Figure 2. The perturbation converges under

100 iterations. As the iterations go on, the distortion

becomes the focus of the optimization program and,

as a result, shrinks with iterations. The targeted ac-

tion for the DDPG policy is set to be [1, 0.5], meaning

full throttle and turning right. The attack successfully

causes the model to turn right most of the time, caus-

ing the mean reward per episode to plunge. For DQN,

since the action space is high level, the targeted attack

is chosen to be “accelerate” (Action 3) to prevent the

ego vehicle from changing lanes at all.

The attack is unconstrained with the loss function

deﬁned in Equation 5. However, the size of the pertur-

bation is directly correlated to the parameters for ZO

SignSGD. The larger step each iteration of the gradi-

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

670

Figure 3: Result of the perturbation. The ego vehicle hits

another car.

ent takes, the bigger the perturbations. Therefore, the

parameters are carefully tuned to allow the attack to

converge quickly while maintaining a reasonable per-

turbation size to enable a distortion to within 0.2, rep-

resenting the 10% deviation from its original value,

while allowing the attack to be carried out within 100

iterations. Small perturbations created by adversarial

machine learning like this may help the attack avoid

possible detection. Since ZO SignSGD can minimize

both Loss1 and Loss2 as deﬁned in Equation 5 and 7,

as the iteration grows, distortion can be minimized if

parameters are tuned in such a way. This trend already

can already be seen in Figure 2.

6.4 Defenses

Employing the approach outlined in Algorithm 2, we

further trained the DQN and DDPG policies that were

initially compromised by the adversarial attack for an

extra 5,000 and 10,000 time steps, respectively, this

time incorporating adversarial observations into the

learning process. The progression of the reward per

episode during this adversarial training phase can be

visualized in Figure 4 and Figure 5.

Despite the initial attack, both policies exhibited

a marginal increase in the reward per episode fol-

lowing adversarial training, suggesting some degree

of learned resilience against adversarial manipula-

tion. However, it is noteworthy that this increase was

rather insubstantial, particularly in the case of DQN.

Moreover, despite maintaining identical training pa-

rameters, the mean reward per episode for both poli-

cies during adversarial training was lower than that

achieved during the initial training phase.

This decline in performance is likely to be at-

tributed to overﬁtting to the perturbed observations.

The model’s parameters have essentially learned to

respond speciﬁcally to the adversarial patterns in

the observations, thereby diminishing its performance

under normal conditions. This is particularly evident

in Figure 5, where the reward per episode shows a de-

clining trend, a classic indicator of overﬁtting.

The trade-off between robustness and perfor-

mance in adversarial settings is a well-documented

challenge in machine learning literature. A semi-

nal 2019 paper elucidated this issue by demonstrating

worsened generalization performance of deep learn-

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Steps

Mean Episode Reward

Figure 4: Reward/Episode for adversarial training of

DDPG.

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Steps

100

110

120

130

Mean Episode Reward

Figure 5: Reward/Episode for adversarial training of DQN.

ing networks under adversarial training (Raghunathan

et al., 2019). More recently, in 2022, potential ex-

planations for this trade-off were proposed, such as

the lower utility of robust features for generaliza-

tion tasks or the insufﬁciency of datasets for adver-

sarial training (Clarysse et al., 2022). This dilemma

is manifested in our experiment, wherein the adver-

sarially trained policies underperformed compared to

their non-adversarially trained counterparts. This un-

derscores the complexity of designing reinforcement

learning policies that are both robust to adversarial at-

tacks and proﬁcient at their designated tasks.

6.5 Results

A table of mean rewards for each model is shown in

Table 1. The highest obtainable reward per episode

is 450, with one reward per step if the agent reaches

high speed, stays in the right lane, and does not crash.

The environment has completely random vehicle lay-

outs every single time. Therefore the agent would not

be able to obtain the total 450 rewards as it is often

required to slow down, change lanes out of the right

lane to avoid a collision or maintain the high-speed

reward.

Zeroth-Order Optimization Attacks on Deep Reinforcement Learning-Based Lane Changing Algorithms for Autonomous Vehicles

671

Table 1: Mean reward of each model with the maximum theoretical reward of 450.

Policy

DQN DDPG

Scenarios Reward [% of theoretic max] Reward [% of theoretic max]

Initial Training 310.58 69.01% 232.78 51.73%

Under Attack 108.22 24.89 % 45.44 10.09 %

Adversarial Training 111.08 22.78 % 62.40 13.87 %

After Training (No Attack) 149.33 33.18 % 91.22 20.27 %

The reward is calculated by ﬁnding the mean of re-

wards for 100 episodes. The environment calculates

the reward, as shown in Equation 1. For Initial Train-

ing and After Training, no attacks are performed on

the observation. The rewards are collected to show the

generalized performance of the model. For Under At-

tack and Adversarial Training, the attack is performed

on the observation to force the policy into performing

only one action, if the perturbation converges within

the given number of iterations.

As seen in Table 1 in the scenario Initial Training,

the agents performed relatively well, setting up a good

baseline performance of the policies. However, both

policies fail to perform as the attacks take place in

Under Attack. Some marginal performance gain is

seen after the policies undergo Adversrial Training.

But overall generalized performance is suffering as

seen in After Training.

7 CONCLUSION AND FUTURE

WORK

In this work, we harnessed the ZO-SignSGD method

to craft perturbations capable of triggering the fail-

ure of trained reinforcement learning models. Re-

markably, these attacks were successfully carried out

on both DQN and DDPG models by introducing per-

turbations to the observation space, even without ac-

cess to the actual gradient information of the mod-

els. While the untouched models achieved high re-

wards — approximately 310 and 250, respectively —

the targeted attacks signiﬁcantly disrupted the perfor-

mance of the ego vehicle, forcing it to follow the at-

tacker’s actions and plummeting the reward to near

zero. This unique vulnerability underscores the vul-

nerabilities of reinforcement learning models to ad-

versarial attacks even when the attacker lacks detailed

model information.

In response to these successful attacks, we trained

the models using these adversarial examples to en-

hance their robustness. Both models demonstrated an

increased resilience, improving their rewards in the

face of adversarial observations. However, it’s impor-

tant to note that adversarial training proved to be a

time-intensive process, and the resulting models un-

derperformed their original versions. This trade-off,

where adversarial training dampens a model’s gener-

alization performance, mirrors ﬁndings observed in

other machine learning applications (Clarysse et al.,

2022).

Our adversarial attacks, while effective, are not

yet optimized. Future work could draw inspiration

from the adversarial attack strategies in the broader

machine learning ﬁeld, potentially leading to stronger

and more efﬁcient attacks. This could involve, for in-

stance, targeting keyframes during the vehicles’ oper-

ation. Moreover, testing the transferability of adver-

sarial examples across different models could provide

critical insights into the vulnerability of autonomous

vehicles, particularly since deep reinforcement learn-

ing models often perform identical tasks. To prevent

fast and catastrophic perturbations by attackers, it will

be crucial to test these examples in real-world scenar-

ios.

As demonstrated in this paper, adversarial train-

ing is not a panacea for these adversarial threats. It

may cause an unexpected loss of rewards if the model

adapts too much to the adversarial observation. Given

the requirement for autonomous vehicles to function

ﬂawlessly under all circumstances. Further investi-

gation into other defensive measures is imperative to

build more robust and secure systems. These could in-

clude intrusion detection systems, model distillation,

and model veriﬁcation. Each of these could poten-

tially contribute to a more comprehensive solution,

mitigating the risks of adversarial attacks.

REFERENCES

Boloor, A., Garimella, K., He, X., Gill, C., Vorobeychik, Y.,

and Zhang, X. (2020). Attacking vision-based percep-

tion in end-to-end autonomous driving models. Jour-

nal of systems architecture, 110:101766–.

ICINCO 2023 - 20th International Conference on Informatics in Control, Automation and Robotics

672

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-

nAI Gym. arXiv 1606.01540.

Buddareddygari, P., Zhang, T., Yang, Y., and Ren, Y.

(2022). Targeted Attack on Deep RL-based Au-

tonomous Driving with Learned Visual Patterns. In

2022 International Conference on Robotics and Au-

tomation (ICRA), pages 10571–10577.

Cao, Y., Xiao, C., Cyr, B., Zhou, Y., Park, W., Rampazzi,

S., Chen, Q. A., Fu, K., and Mao, Z. M. (2019). Ad-

versarial Sensor Attack on LiDAR-Based Perception

in Autonomous Driving. In Proceedings of the 2019

ACM SIGSAC Conference on Computer and Commu-

nications Security, CCS ’19, page 2267–2281, New

York, NY, USA. Association for Computing Machin-

ery.

Carlini, N. and Wagner, D. (2017). Towards Evaluating the

Robustness of Neural Networks. In 2017 IEEE Sym-

posium on Security and Privacy (SP), pages 39–57.

Chebotar, Y., Hausman, K., Lu, Y., Xiao, T., Kalashnikov,

D., Varley, J., Irpan, A., Eysenbach, B., Julian, R.,

Finn, C., and Levine, S. (2021). Actionable Mod-

els: Unsupervised Ofﬂine Reinforcement Learning of

Robotic Skills. arXiv 2104.07749.

Clarysse, J., H

orrmann, J., and Yang, F. (2022). Why

adversarial training can hurt robust accuracy. arXiv

2203.02006.

Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). Ex-

plaining and Harnessing Adversarial Examples. arXiv

1412.6572.

He, X., Yang, H., Hu, Z., and Lv, C. (2023). Robust Lane

Change Decision Making for Autonomous Vehicles:

An Observation Adversarial Reinforcement Learning

Approach. IEEE Transactions on Intelligent Vehicles,

8(1):184–193.

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and

Abbeel, P. (2017). Adversarial Attacks on Neural Net-

work Policies. arXiv 1702.02284.

Isele, D., Nakhaei, A., and Fujimura, K. (2018). Safe Re-

inforcement Learning on Autonomous Vehicles. In

2018 IEEE/RSJ International Conference on Intelli-

gent Robots and Systems (IROS), pages 1–6.

Kalashnikov, D., Varley, J., Chebotar, Y., Swanson, B., Jon-

schkowski, R., Finn, C., Levine, S., and Hausman, K.

(2021). MT-Opt: Continuous Multi-Task Robotic Re-

inforcement Learning at Scale. arXiv 2104.08212.

Kingma, D. P. and Ba, J. (2017). Adam: A Method for

Stochastic Optimization. arXiv 1412.6980.

Leurent, E. (2018). An Environment for Autonomous Driv-

ing Decision-Making. GitHub.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T.,

Tassa, Y., Silver, D., and Wierstra, D. (2019). Contin-

uous control with deep reinforcement learning. arXiv

1509.02971.

Liu, S., Chen, P.-Y., Chen, X., and Hong, M. (2019).

signSGD via Zeroth-Order Oracle. In International

Conference on Learning Representations.

Liu, S., Chen, P.-Y., Kailkhura, B., Zhang, G., Hero III,

A. O., and Varshney, P. K. (2020). A Primer on

Zeroth-Order Optimization in Signal Processing and

Machine Learning: Principals, Recent Advances, and

Applications. IEEE Signal Processing Magazine,

37(5):43–54.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2013). Playing Atari with Deep Reinforcement

Learning. arXiv 1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Sadik, A., Antonoglou, I., King, H., Ku-

maran, D., Wierstra, D., Legg, S., and Hassabis, D.

(2015). Human-level control through deep reinforce-

ment learning. Nature, 518(7540):529–533.

Moosavi-Dezfooli, S.-M., Fawzi, A., and Frossard, P.

(2016). Deepfool: a simple and accurate method to

fool deep neural networks. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 2574–2582.

Papernot, N., McDaniel, P., Wu, X., Jha, S., and Swami,

A. (2016). Distillation as a Defense to Adversarial

Perturbations against Deep Neural Networks. arXiv

1511.04508.

Pattanaik, A., Tang, Z., Liu, S., Bommannan, G., and

Chowdhary, G. (2017). Robust Deep Reinforce-

ment Learning with Adversarial Attacks. arXiv

1712.03632.

Powell, M. J. D. (1994). A Direct Search Optimiza-

tion Method That Models the Objective and Con-

straint Functions by Linear Interpolation, pages 51–

67. Springer Netherlands, Dordrecht.

Rafﬁn, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus,

M., and Dormann, N. (2021). Stable-Baselines3: Reli-

able Reinforcement Learning Implementations. Jour-

nal of Machine Learning Research, 22(268):1–8.

Raghunathan, A., Xie, S. M., Yang, F., Duchi, J. C., and

Liang, P. (2019). Adversarial Training Can Hurt Gen-

eralization. arXiv 1906.06032.

Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and

de Freitas, N. (2016). Taking the Human Out of the

Loop: A Review of Bayesian Optimization. Proceed-

ings of the IEEE, 104(1):148–175.

Sinha, A., Namkoong, H., Volpi, R., and Duchi, J. (2020).

Certifying Some Distributional Robustness with Prin-

cipled Adversarial Training. arXiv 1710.10571.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,

D., Goodfellow, I., and Fergus, R. (2014). Intriguing

properties of neural networks. arXiv 1312.6199.

Zeroth-Order Optimization Attacks on Deep Reinforcement Learning-Based Lane Changing Algorithms for Autonomous Vehicles

673