Multi-Environment Training Against Reward Poisoning Attacks on Deep

Reinforcement Learning

Myria Bouhaddi and Kamel Adi

Computer Security Research Laboratory, University of Quebec in Outaouais, Gatineau, Quebec, Canada

Keywords:

Deep Reinforcement Learning, Adversarial Attacks, Reward Poisoning Attacks, Optimal Defense Policy,

Multi-Environment Training.

Abstract:

Our research tackles the critical challenge of defending against poisoning attacks in deep reinforcement learn-

ing, which have signiﬁcant cybersecurity implications. These attacks involve subtle manipulation of rewards,

leading the attacker’s policy to appear optimal under the poisoned rewards, thus compromising the integrity

and reliability of such systems. Our goal is to develop robust agents resistant to manipulations. We propose

an optimization framework with a multi-environment setting, which enhances resilience and generalization.

By exposing agents to diverse environments, we mitigate the impact of poisoning attacks. Additionally, we

employ a variance-based method to detect reward manipulation effectively. Leveraging this information, our

optimization framework derives a defense policy that fortiﬁes agents against attacks, bolstering their resistance

to reward manipulation.

1 INTRODUCTION

Reinforcement Learning (RL) has garnered signif-

icant attention in recent years due to its remark-

able ability to solve complex decision-making prob-

lems through continuous agent-environment interac-

tion, leading to the development of optimal action se-

lection policies (Sutton and Barto, 2018). Deep Rein-

forcement Learning (DRL), an amalgamation of rein-

forcement learning and deep learning, has emerged as

a powerful tool for handling high-dimensional state

spaces and complex task selection policies. Sev-

eral DRL algorithms, including Deep Q-Networks

(DQN) (Mnih et al., 2015), Trust Region Policy Opti-

mization (TRPO) (Schulman et al., 2015), and Asyn-

chronous Advantage Actor-Critic (A3C) (Greydanus

et al., 2018), have been developed to efﬁciently tackle

challenging real-world problems.

DRL has made signiﬁcant contributions in diverse

ﬁelds, including robotics, healthcare, and ﬁnance.

In robotics, DRL enables the development of au-

tonomous robots capable of learning tasks like grasp-

ing, walking, and manipulation. In healthcare, it opti-

mizes treatment plans for patients with chronic con-

ditions by leveraging patient data, improving treat-

ment outcomes. In ﬁnance, DRL aids in designing

automated trading systems that make intelligent real-

time decisions based on market data. These exam-

ples highlight the wide-ranging applicability of DRL

in addressing complex real-world problems.

However, the security of DRL systems has be-

come a critical concern, as they are vulnerable to ad-

versarial attacks (Kiran et al., 2021; Behzadan and

Munir, 2017a). Even small perturbations can sig-

niﬁcantly impact performance (Zhang et al., 2021b),

and attacks on one policy can be transferred to oth-

ers (Huang et al., 2017). Poisoning attacks, specif-

ically, manipulate reward signals during the learning

process, thereby inﬂuencing the behavior of the agent.

Compromised DRL systems pose risks such as eco-

nomic losses, injuries, and even potential loss of life,

especially in critical domains like autonomous cars

and drones.

While security challenges in supervised and un-

supervised learning have been extensively studied

(Akhtar and Mian, 2018), the security implications of

DRL demand signiﬁcant attention. Ensuring robust-

ness against attacks is crucial for the safe deployment

of DRL systems in critical applications. Addressing

these security challenges is paramount for successful

real-world implementation.

This paper focuses on the problem of reward

poisoning in DRL, where attackers alter rewards to

manipulate the agent’s policy, necessitating the de-

velopment of defense mechanisms. We propose

a robust RL algorithm that can detect and defend

870

Bouhaddi, M. and Adi, K.

Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning.

DOI: 10.5220/0012139900003555

In Proceedings of the 20th International Conference on Security and Cryptography (SECRYPT 2023), pages 870-875

ISBN: 978-989-758-666-8; ISSN: 2184-7711

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

against reward tampering. Our approach involves

training agents in diverse environments to minimize

the impact of poisoning, employing variance-based

techniques for detection, and enhancing resilience

through adversarial training. Our main contributions

include a rigorous formulation of the poisoning at-

tack as an optimization problem, providing insights

into the attacker’s objectives and enabling exploration

of strategies for detection and mitigation. Addition-

ally, we propose a novel approach that mitigates re-

ward manipulation attacks, leveraging multiple envi-

ronments, variance-based detection, and adversarial

training. Our experimental results demonstrate the

effectiveness of our approach in enhancing the ro-

bustness of agent-based systems against adversarial

attacks.

2 RELATED WORK

Attacks Against Reinforcement Learning. Rein-

forcement learning (RL) is susceptible to various

types of attacks, with evasion attacks and model poi-

soning attacks being two prominent categories exten-

sively studied in deep RL (Huang et al., 2017; Kos

and Song, 2017; Lin et al., 2017). Evasion attacks aim

to induce undesirable behavior in trained policies by

ﬁnding adversarial examples, while model poisoning

attacks manipulate the reward signal during RL train-

ing to induce sub-optimal policies. These attacks have

signiﬁcant implications in real-world applications, in-

cluding the manipulation of pre-trained RL models

downloaded by agents.

Previous research has investigated reward poison-

ing in both batch and online RL settings. In batch RL,

attackers can easily modify pre-collected rewards,

while online RL poses a greater challenge as rewards

need to be modiﬁed on-the-ﬂy. Although reward poi-

soning in online RL has been studied using multi-

armed bandits, our focus is on black-box attacks that

can target any efﬁcient RL algorithm.

Furthermore, studies have explored reward poi-

soning in the white-box setting, where attackers have

complete knowledge of the underlying Markov deci-

sion process (MDP) or learning algorithm. These at-

tacks involve manipulating the reward function using

adversarial rewards based on the state and action, in-

dependent of the learning process. Notably, (Zhang

et al., 2020) developed an adaptive attack that lever-

ages the victim’s Q-table, signiﬁcantly accelerating

the attack process.

In contrast to observation perturbation attacks

that alter the agent’s environment observation during

training without changing the actual state or reward

(Behzadan and Munir, 2017b; Inkawhich et al., 2019),

our poisoning attacks directly modify the actual re-

ward or state of the environment. This differentiation

highlights the distinct nature and potential impact of

reward manipulation attacks in RL.

Defenses Against Poisoning Attacks. In order to

ensure the security of DRL policy training, defense

mechanisms are employed to protect against poison-

ing attacks. The importance of robustness cannot

be overstated, as it guarantees the functionality of

the system even in the presence of disturbances (Be-

hzadan and Munir, 2017b). Defenses against poison-

ing attacks can generally be classiﬁed into two cate-

gories: (1) studies that provide theoretical guarantees

for learning under perturbations (Banihashem et al.,

2021; Lykouris et al., 2021; Chen et al., 2021; Wei

et al., 2022; Zhang et al., 2021a; Wu et al., 2022),

and (2) empirical approaches that evaluate the ro-

bustness of the system through practical experiments

(Behzadan and Munir, 2017b; Behzadan and Munir,

2018; Wang et al., 2020).

However, it is important to note that designing ro-

bust DRL algorithms often comes at a cost, as it may

compromise the overall performance of the learned

policies. Achieving complete robustness is challeng-

ing, especially considering the evolving strategies em-

ployed by attackers. Hence, relying solely on robust-

ness measures may prove inadequate in ensuring the

secure learning of DRL policies. It is essential to

explore additional measures and techniques that can

enhance the security and reliability of DRL systems

against poisoning attacks.

In this context, our work focuses on addressing

the issue of data poisoning in reinforcement learning,

particularly the manipulation of reward signals to in-

ﬂuence policy. We aim to propose innovative solu-

tions that effectively protect DRL policies from such

attacks. While robustness is crucial, we acknowledge

its potential impact on policy performance. There-

fore, we propose a lightweight approach that en-

hances protection without compromising the overall

performance of the system. By complementing ro-

bustness measures, we aim to strengthen the security

of DRL learning and enhance the reliability of the

learned policies.

3 PRELIMINARY

In this section, we will outline the essentials of deep

reinforcement learning, including its key components

and underlying principles.

Deep Reinforcement Learning. In reinforcement

learning (RL), an agent learns an optimal behavior by

Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning

871

sequentially interacting with an environment, known

as a Markov Decision Process (MDP), to achieve its

objectives through trial and error. The MDP is de-

ﬁned as a tuple M = (S,A, P,R, γ,σ), where S and

A are the state and action spaces, respectively, P is

the transition dynamics that determine the probabil-

ity distribution of the next state given the current state

and action, R is the reward function that maps state-

action pairs to scalar rewards, γ is a discount factor

that weighs immediate and future rewards and σ is the

initial distribution over the states. The training pro-

cess consists of multiple episodes where each episode

is initialized with a state sampled from σ. The agent

interacts with the environment at each timestep until

the episode ends. It is assumed that every episode is

comprised of T distinct timesteps. We assume that S

and A are ﬁnite and discrete sets.

The agent interacts with the environment sequen-

tially, starting with an initial state s

, following the

distribution σ, and selecting actions based on a pol-

icy π. Policies can be generic (stochastic) denoted by

π(a|s), mapping states to action probabilities, or de-

terministic denoted by π(s). The set of all policies is

Π, and deterministic policies are Π

det

The agent’s transition to a new state s

t+1

based on

P and the reward r

)

it receives, reﬂecting the qual-

ity of its decision, leads to the generation of a trajec-

tory T consisting of state-action-reward triplets. This

trajectory captures the agent’s interaction with the en-

vironment, and at each time step, the agent updates

its Q-table, which stores the estimated values of state-

action pairs.

In reinforcement learning, the cumulative reward

or return is the total reward an agent receives over

time. It is computed as the sum of discounted re-

wards at each timestep, using the factor γ ∈ [0,1]. This

balances the importance of immediate and future re-

wards, expressed as CR =

∑

t=0

R(s

We deﬁne the state value V

(s) for a policy π as

the expected total return CR from state s under pol-

icy π. It is represented by the function V

: S → R

and expressed as V

(s) = E[CR|s = s

], accounting

for stochastic environment transitions.

The state-action value function Q

(s,a), also

known as the Q-function, extends the deﬁnition of the

state-value function V

(s) to state-action pairs. It rep-

resents the expected return CR from state s, taking ac-

tion a, and following policy π. The agent aims to ﬁnd

an optimal policy π

∗

that maximizes the expected re-

turn from all states, given by π

∗

= argmax

(s,a).

The policy score ρ

quantiﬁes the overall quality

of a policy π based on the expected rewards obtained

by following the policy over an extended period. It

is calculated by considering all possible actions from

each state using the Q-values. The score is expressed

as ρ

= E[(1 − γ)

∑

∞

t=0

R(s

)|π,σ], where the ini-

tial state s

is sampled from the initial state distribu-

tion σ, and subsequent states s

are obtained by ex-

ecuting policy π in the MDP. The score reﬂects the

expected total return, discounted by a factor of 1 −γ.

Deep Reinforcement Learning (DRL) combines

deep learning and RL to tackle challenges in learn-

ing control policies from high-dimensional raw input

data and large state and action spaces. The policy π in

DRL is represented by a deep neural network with pa-

rameters Θ. Various DRL algorithms, including Deep

Q-Network (DQN), Trust Region Policy Optimiza-

tion (TRPO), and Asynchronous Advantage Actor-

Critic (A3C), aim to optimize the policy network by

maximizing the expected return.

In Deep Q-learning, the Q-values for actions are

approximated based on states, enabling the agent to

select the action with the highest Q-value to maxi-

mize its reward. This approach has shown success

in domains like Go and Atari games.

The policy gradient algorithm directly parameter-

izes the policy as π

(s,a), which takes the state as

input and outputs the corresponding action. By max-

imizing the expected total discounted rewards, rep-

resented by the objective function J(θ), the optimal

parameters θ and policy are obtained. The gradient

of this objective function can be expressed as the ex-

pected product of the gradient of the log of the policy

network and the action-value function of the Markov

Decision Process.

To address this, the policy gradient algorithm ap-

proximates the action-value function Q

(s,a) using a

deep neural network Q

learned alongside the policy

network. This allows the Policy Gradient Theorem to

be applied, facilitating the computation of the policy

gradient.

4 REWARD POISONING

ATTACKS AGAINST DRL

The goals of attacks against machine learning models

in agent-based reinforcement learning often involve

manipulating the policies of the agents to align them

with a speciﬁc target policy. This is accomplished

by strategically modifying the agent’s reward func-

tion. In our study, we adopt an attack formulation in-

spired by previous works such as (Ma et al., 2019) and

(Rakhsha et al., 2020). By leveraging these existing

approaches, we aim to develop a comprehensive un-

derstanding of attack strategies and their implications

in the context of agent-based reinforcement learning.

SECRYPT 2023 - 20th International Conference on Security and Cryptography

872

4.1 Threat Model

Our paper tackles the problem of reward poisoning

in a white-box setting, where the attacker has com-

plete knowledge of the agent’s Markov Decision Pro-

cess (MDP) environment and Deep Q-Learning algo-

rithm, except for their future randomness. This means

that the attacker places themselves between the en-

vironment and the agent, allowing them to carry out

white-box attacks with ease. The motivation behind

considering a white-box attack scenario is to develop

a defense mechanism that can effectively counter the

most challenging and sophisticated attacks.By assum-

ing maximum knowledge for the attacker, our goal is

to create a defense approach that is resilient even in

the worst-case scenario, where the attacker strategi-

cally manipulates reward signals to deceive and redi-

rect the agent’s policy (Bouhaddi et al., 2018).

At each time step t, the attacker observes the

agent’s policy network π

(s,a), the current state s

the agent’s action a

, the resulting new state s

t+1

, and

the received reward r

. The attacker’s proﬁle is de-

ﬁned as ξ = (π

†

,υ, ∆

), with π

†

representing the tar-

get policy, υ limiting reward manipulations within an

episode of length L, and ∆

constraining perturbations

added to rewards. These limitations ensure attack

effectiveness, avoid detection, and prevent excessive

disruptions to the agent’s learning process. By care-

fully controlling perturbations, the attacker can tar-

get speciﬁc rewards, achieving their objectives while

maintaining a low proﬁle.

The target policy is deﬁned as a function from the

state space to the action space, π

†

: S → 2

|A|

,π

†

(s) ⊆

A, which speciﬁes the set of actions desired by the

attacker at the state s. The attacker can focus on cer-

tain states more than others, as these states trigger par-

ticular actions desired by the attacker. Thus, S

†

, the

set of target states, can be deﬁned as S

†

= {s ∈ S :

†

(s) ̸= π

∗

(s)}, with π

∗

(s) the optimal policy of the

agent, which represents the actions it would have cho-

sen in the absence of reward perturbations. So, given

a target state set S

†

⊆ S, the target policy is denoted

as:

†

(s) =



†

if s ∈ S

†

(s) otherwise.

where a

†

is the target action desired by the attacker,

(s) is the victim’s actual policy and π

†

(s) the par-

tial target policy which is more suitable for large-scale

state spaces, either discrete or continuous, as com-

pared to the complete target policy that deﬁnes de-

sired actions in all states.

The attacker can introduce a perturbation δ

∈ R

to the reward associated with the current state-action

pair r(s

). For simplicity, we use the notation r

instead of r

)

. As a result, the reward perceived

by the agent at time step t is given by r

+ δ

. We as-

sume that the attack is limited by the inﬁnity norm,

which is referred to as limited per-step perturbation.

This means that |δ

| ≤ ∆

for any time step t. In

other words, we consider two constraints regarding

the added perturbation: it should not be too substan-

tial nor too frequent.

The attacker’s objective is to discover the best se-

quence of perturbations to incite the agent to adopt

the target policy while reducing the number of rounds

during which the agent becomes aware of the attack.

This involves minimizing the agent’s disagreement

with the target policy, so as not to raise suspicions

and maintain the success of the attack.

Therefore, the attacker’s problem is modeled as

the following optimization problem:

min

d(r

+ δ

) (1)

s.t ρ

†

≥ ρ

, ∀π ∈ Π

det

\{π

†

} (2)

≤ ∆

,∀t (3)

∑

1[Q

/∈ Q

†

] ≤ υ (4)

where, d computes the Euclidean distance between

the true reward and the altered reward. Therefore,

the attacker’s problem boils down to solving this op-

timization problem to ﬁnd the minimum perturbation

that allows for the adoption of its target policy by the

agent, while avoiding detection by limiting the num-

ber of interventions, the amount of perturbation added

per time step, and bounding the number of times the

agent deviates from Q

†

5 PROPOSED DEFENSE

MECHANISM USING

MULTI-ENVIRONMENT

TRAINING

We propose a defense approach to mitigate poison-

ing attacks in deep reinforcement learning (DRL) set-

tings. Our approach utilizes multi-environment train-

ing, where an agent interacts randomly with multiple

environments, each with different transition probabil-

ities. This enables the agent to learn a more robust

policy that is less affected by perturbations in the re-

ward signal introduced by the attacker.

Our approach offers signiﬁcant advantages over

existing methods. By exposing the agent to diverse

environments, it enhances the agent’s experience and

leads to a more robust policy that can handle new

Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning

873

and unseen situations. Furthermore, our approach is

computationally efﬁcient, requiring minimal modiﬁ-

cations to the existing RL training pipeline.

In our proposed defense approach, we use a gen-

eralized environment G = (E, µ), consisting of a set

of multiple environments E and a distribution µ over

these environments. Each environment e

∈ E is mod-

eled as a Markov Decision Process (MDP). This al-

lows us to evaluate the agent’s performance across

different environments sampled from E according to

µ. Training the agent in multiple environments with

different reward structures increases its robustness to

reward poisoning attacks.

We adopt the multi-environment training tech-

nique, where the agent interacts with a randomly sam-

pled environment from E according to the probability

distribution µ. The attacker, aware of the training en-

vironment, may attempt to poison the rewards. How-

ever, they must balance their actions to avoid detec-

tion. This limits the attacker’s ability to perturb all

rewards in all environments, reducing the overall im-

pact of the poisoning.

To compute the agent’s policy, we introduce the

weighted policy π

, which combines the policies

learned in each environment e

based on their corre-

sponding probabilities µ

∑

· π

(5)

Similarly, we compute the weighted Q-values

(s,a) by combining the Q-values Q

(s,a) obtained

in each environment:

(s,a) =

∑

· Q

(s,a) (6)

Finally, the average policy π

avg

(s) is obtained by

taking the argmax over the weighted Q-values:

avg

(s) = arg max

(s,a) (7)

This computation allows the agent to select ac-

tions that are robust to reward poisoning attacks by

considering the variability of rewards across environ-

ments.

Our defense approach using multi-environment

training provides a more robust training environment

that mitigates the impact of poisoning attacks. It en-

courages the agent to learn a generalized policy ef-

fective across multiple environments. Additionally, it

can be combined with other defense mechanisms to

further enhance the agent’s resilience against poison-

ing attacks.

To detect reward poisoning, we employ a

variance-based technique. We compare the observed

rewards to an expected value under an unpoisoned re-

ward signal. By computing the variance of observed

rewards across multiple environments, we can deter-

mine if the agent has been subject to reward poison-

ing.

The variance-based technique assumes that the at-

tacker cannot poison all rewards in all environments,

resulting in higher variance under a poisoned reward

signal compared to an unpoisoned one. We calcu-

late the reward variance Var(R) using the observed

rewards R

and the mean reward

R across all environ-

ments:

Var(R) =

n − 1

∑

i=1

−

(8)

where n is the number of environments.

To compare observed rewards to an expected un-

poisoned value, we calculate the variance of the un-

poisoned rewards in each environment. The variance

is computed using the formula:

Var(U) =

n − 1

∑

i=1

−

(9)

where U

represents the expected unpoisoned re-

ward in the i

environment,

U is the mean expected

unpoisoned reward across all environments, and n is

the number of training environments.

If the ratio of the observed reward variance to

the expected unpoisoned reward variance exceeds a

threshold value h, we determine that the agent has

been subjected to reward poisoning, as indicated by

the equation:

Var(R)

Var(U)

> h (10)

Our defense mechanism involves the environment

∈ E, the agent, and the attacker. We propose Al-

gorithm 1 to implement our approach, which utilizes

multi-environment training to defend against reward

poisoning attacks. By following this algorithm, the

agent learns a generalized policy across multiple envi-

ronments, enhancing its resistance to reward poison-

ing attacks. The algorithm also imposes constraints

on the attacker, making it more difﬁcult for them to

poison all rewards of all actions in all environments

without detection.

6 CONCLUSION

This work addresses reward poisoning attacks in deep

reinforcement learning by introducing a novel de-

fense mechanism. We propose training agents in a

SECRYPT 2023 - 20th International Conference on Security and Cryptography

874

multi-environment setting, randomly selecting envi-

ronments for agent interaction. By averaging rewards

across multiple environments, our approach effec-

tively mitigates the impact of poisoning and enhances

agent robustness. Our method ensures the preserva-

tion of true reward performance while providing prov-

able guarantees for defense policy effectiveness, en-

suring safety and reliability in critical applications.

This contribution represents a signiﬁcant advance-

ment in the development of robust and secure deep

reinforcement learning systems for real-world scenar-

ios. Future goals include conducting experiments to

compare our approach with existing defenses, vali-

dating its effectiveness and practicality, and leverag-

ing the insights gained to further enhance our defense

mechanism.

REFERENCES

Akhtar, N. and Mian, A. (2018). Threat of adversarial at-

tacks on deep learning in computer vision: A survey.

Ieee Access, 6:14410–14430.

Banihashem, K., Singla, A., and Radanovic, G. (2021). De-

fense against reward poisoning attacks in reinforce-

ment learning. arXiv preprint arXiv:2102.05776.

Behzadan, V. and Munir, A. (2017a). Vulnerability of deep

reinforcement learning to policy induction attacks. In

Machine Learning and Data Mining in Pattern Recog-

nition: 13th International Conference, MLDM 2017,

New York, NY, USA, July 15-20, 2017, Proceedings

13, pages 262–275. Springer.

Behzadan, V. and Munir, A. (2017b). Whatever does not kill

deep reinforcement learning, makes it stronger. arXiv

preprint arXiv:1712.09344.

Behzadan, V. and Munir, A. (2018). Mitigation of pol-

icy manipulation attacks on deep q-networks with

parameter-space noise. In Computer Safety, Reliabil-

ity, and Security: SAFECOMP 2018, V

aster

as, Swe-

den, September 18, 2018, Proceedings 37, pages 406–

417. Springer.

Bouhaddi, M., Radjef, M. S., and Adi, K. (2018). An efﬁ-

cient intrusion detection in resource-constrained mo-

bile ad-hoc networks. Computers & Security, 76:156–

177.

Chen, Y., Du, S., and Jamieson, K. (2021). Improved cor-

ruption robust algorithms for episodic reinforcement

learning. In International Conference on Machine

Learning, pages 1561–1570. PMLR.

Greydanus, S., Koul, A., Dodge, J., and Fern, A. (2018). Vi-

sualizing and understanding atari agents. In Interna-

tional conference on machine learning, pages 1792–

1801. PMLR.

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and

Abbeel, P. (2017). Adversarial attacks on neural net-

work policies. arXiv preprint arXiv:1702.02284.

Inkawhich, M., Chen, Y., and Li, H. (2019). Snooping at-

tacks on deep reinforcement learning. arXiv preprint

arXiv:1905.11832.

Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab,

A. A., Yogamani, S., and P

erez, P. (2021). Deep rein-

forcement learning for autonomous driving: A survey.

IEEE Transactions on Intelligent Transportation Sys-

tems, 23(6):4909–4926.

Kos, J. and Song, D. (2017). Delving into adver-

sarial attacks on deep policies. arXiv preprint

arXiv:1705.06452.

Lin, Y.-C., Hong, Z.-W., Liao, Y.-H., Shih, M.-L., Liu, M.-

Y., and Sun, M. (2017). Tactics of adversarial attack

on deep reinforcement learning agents. arXiv preprint

arXiv:1703.06748.

Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W.

(2021). Corruption-robust exploration in episodic re-

inforcement learning. In Conference on Learning The-

ory, pages 3242–3245. PMLR.

Ma, Y., Zhang, X., Sun, W., and Zhu, J. (2019). Policy

poisoning in batch reinforcement learning and control.

Advances in Neural Information Processing Systems,

32.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,

Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-

level control through deep reinforcement learning. na-

ture, 518(7540):529–533.

Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and

Singla, A. (2020). Policy teaching via environment

poisoning: Training-time adversarial attacks against

reinforcement learning. In International Conference

on Machine Learning, pages 7974–7984. PMLR.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and

Moritz, P. (2015). Trust region policy optimization. In

International conference on machine learning, pages

1889–1897. PMLR.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Wang, J., Liu, Y., and Li, B. (2020). Reinforcement learning

with perturbed rewards. In Proceedings of the AAAI

conference on artiﬁcial intelligence, volume 34, pages

6202–6209.

Wei, C.-Y., Dann, C., and Zimmert, J. (2022). A model se-

lection approach for corruption robust reinforcement

learning. In International Conference on Algorithmic

Learning Theory, pages 1043–1096. PMLR.

Wu, F., Li, L., Xu, C., Zhang, H., Kailkhura, B., Ken-

thapadi, K., Zhao, D., and Li, B. (2022). Copa:

Certifying robust policies for ofﬂine reinforcement

learning against poisoning attacks. arXiv preprint

arXiv:2203.08398.

Zhang, H., Chen, H., Boning, D., and Hsieh, C.-J. (2021a).

Robust reinforcement learning on state observations

with learned optimal adversary. arXiv preprint

arXiv:2101.08452.

Zhang, X., Ma, Y., Singla, A., and Zhu, X. (2020). Adaptive

reward-poisoning attacks against reinforcement learn-

ing. In International Conference on Machine Learn-

ing, pages 11225–11234. PMLR.

Zhang, Z., Lim, B., and Zohren, S. (2021b). Deep learn-

ing for market by order data. Applied Mathematical

Finance, 28(1):79–95.

Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning

875