Multi-Environment Training Against Reward Poisoning Attacks on Deep
Reinforcement Learning
Myria Bouhaddi and Kamel Adi
Computer Security Research Laboratory, University of Quebec in Outaouais, Gatineau, Quebec, Canada
Keywords:
Deep Reinforcement Learning, Adversarial Attacks, Reward Poisoning Attacks, Optimal Defense Policy,
Multi-Environment Training.
Abstract:
Our research tackles the critical challenge of defending against poisoning attacks in deep reinforcement learn-
ing, which have significant cybersecurity implications. These attacks involve subtle manipulation of rewards,
leading the attacker’s policy to appear optimal under the poisoned rewards, thus compromising the integrity
and reliability of such systems. Our goal is to develop robust agents resistant to manipulations. We propose
an optimization framework with a multi-environment setting, which enhances resilience and generalization.
By exposing agents to diverse environments, we mitigate the impact of poisoning attacks. Additionally, we
employ a variance-based method to detect reward manipulation effectively. Leveraging this information, our
optimization framework derives a defense policy that fortifies agents against attacks, bolstering their resistance
to reward manipulation.
1 INTRODUCTION
Reinforcement Learning (RL) has garnered signif-
icant attention in recent years due to its remark-
able ability to solve complex decision-making prob-
lems through continuous agent-environment interac-
tion, leading to the development of optimal action se-
lection policies (Sutton and Barto, 2018). Deep Rein-
forcement Learning (DRL), an amalgamation of rein-
forcement learning and deep learning, has emerged as
a powerful tool for handling high-dimensional state
spaces and complex task selection policies. Sev-
eral DRL algorithms, including Deep Q-Networks
(DQN) (Mnih et al., 2015), Trust Region Policy Opti-
mization (TRPO) (Schulman et al., 2015), and Asyn-
chronous Advantage Actor-Critic (A3C) (Greydanus
et al., 2018), have been developed to efficiently tackle
challenging real-world problems.
DRL has made significant contributions in diverse
fields, including robotics, healthcare, and finance.
In robotics, DRL enables the development of au-
tonomous robots capable of learning tasks like grasp-
ing, walking, and manipulation. In healthcare, it opti-
mizes treatment plans for patients with chronic con-
ditions by leveraging patient data, improving treat-
ment outcomes. In finance, DRL aids in designing
automated trading systems that make intelligent real-
time decisions based on market data. These exam-
ples highlight the wide-ranging applicability of DRL
in addressing complex real-world problems.
However, the security of DRL systems has be-
come a critical concern, as they are vulnerable to ad-
versarial attacks (Kiran et al., 2021; Behzadan and
Munir, 2017a). Even small perturbations can sig-
nificantly impact performance (Zhang et al., 2021b),
and attacks on one policy can be transferred to oth-
ers (Huang et al., 2017). Poisoning attacks, specif-
ically, manipulate reward signals during the learning
process, thereby influencing the behavior of the agent.
Compromised DRL systems pose risks such as eco-
nomic losses, injuries, and even potential loss of life,
especially in critical domains like autonomous cars
and drones.
While security challenges in supervised and un-
supervised learning have been extensively studied
(Akhtar and Mian, 2018), the security implications of
DRL demand significant attention. Ensuring robust-
ness against attacks is crucial for the safe deployment
of DRL systems in critical applications. Addressing
these security challenges is paramount for successful
real-world implementation.
This paper focuses on the problem of reward
poisoning in DRL, where attackers alter rewards to
manipulate the agent’s policy, necessitating the de-
velopment of defense mechanisms. We propose
a robust RL algorithm that can detect and defend
870
Bouhaddi, M. and Adi, K.
Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning.
DOI: 10.5220/0012139900003555
In Proceedings of the 20th International Conference on Security and Cryptography (SECRYPT 2023), pages 870-875
ISBN: 978-989-758-666-8; ISSN: 2184-7711
Copyright
c
2023 by SCITEPRESS Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)
against reward tampering. Our approach involves
training agents in diverse environments to minimize
the impact of poisoning, employing variance-based
techniques for detection, and enhancing resilience
through adversarial training. Our main contributions
include a rigorous formulation of the poisoning at-
tack as an optimization problem, providing insights
into the attacker’s objectives and enabling exploration
of strategies for detection and mitigation. Addition-
ally, we propose a novel approach that mitigates re-
ward manipulation attacks, leveraging multiple envi-
ronments, variance-based detection, and adversarial
training. Our experimental results demonstrate the
effectiveness of our approach in enhancing the ro-
bustness of agent-based systems against adversarial
attacks.
2 RELATED WORK
Attacks Against Reinforcement Learning. Rein-
forcement learning (RL) is susceptible to various
types of attacks, with evasion attacks and model poi-
soning attacks being two prominent categories exten-
sively studied in deep RL (Huang et al., 2017; Kos
and Song, 2017; Lin et al., 2017). Evasion attacks aim
to induce undesirable behavior in trained policies by
finding adversarial examples, while model poisoning
attacks manipulate the reward signal during RL train-
ing to induce sub-optimal policies. These attacks have
significant implications in real-world applications, in-
cluding the manipulation of pre-trained RL models
downloaded by agents.
Previous research has investigated reward poison-
ing in both batch and online RL settings. In batch RL,
attackers can easily modify pre-collected rewards,
while online RL poses a greater challenge as rewards
need to be modified on-the-fly. Although reward poi-
soning in online RL has been studied using multi-
armed bandits, our focus is on black-box attacks that
can target any efficient RL algorithm.
Furthermore, studies have explored reward poi-
soning in the white-box setting, where attackers have
complete knowledge of the underlying Markov deci-
sion process (MDP) or learning algorithm. These at-
tacks involve manipulating the reward function using
adversarial rewards based on the state and action, in-
dependent of the learning process. Notably, (Zhang
et al., 2020) developed an adaptive attack that lever-
ages the victim’s Q-table, significantly accelerating
the attack process.
In contrast to observation perturbation attacks
that alter the agent’s environment observation during
training without changing the actual state or reward
(Behzadan and Munir, 2017b; Inkawhich et al., 2019),
our poisoning attacks directly modify the actual re-
ward or state of the environment. This differentiation
highlights the distinct nature and potential impact of
reward manipulation attacks in RL.
Defenses Against Poisoning Attacks. In order to
ensure the security of DRL policy training, defense
mechanisms are employed to protect against poison-
ing attacks. The importance of robustness cannot
be overstated, as it guarantees the functionality of
the system even in the presence of disturbances (Be-
hzadan and Munir, 2017b). Defenses against poison-
ing attacks can generally be classified into two cate-
gories: (1) studies that provide theoretical guarantees
for learning under perturbations (Banihashem et al.,
2021; Lykouris et al., 2021; Chen et al., 2021; Wei
et al., 2022; Zhang et al., 2021a; Wu et al., 2022),
and (2) empirical approaches that evaluate the ro-
bustness of the system through practical experiments
(Behzadan and Munir, 2017b; Behzadan and Munir,
2018; Wang et al., 2020).
However, it is important to note that designing ro-
bust DRL algorithms often comes at a cost, as it may
compromise the overall performance of the learned
policies. Achieving complete robustness is challeng-
ing, especially considering the evolving strategies em-
ployed by attackers. Hence, relying solely on robust-
ness measures may prove inadequate in ensuring the
secure learning of DRL policies. It is essential to
explore additional measures and techniques that can
enhance the security and reliability of DRL systems
against poisoning attacks.
In this context, our work focuses on addressing
the issue of data poisoning in reinforcement learning,
particularly the manipulation of reward signals to in-
fluence policy. We aim to propose innovative solu-
tions that effectively protect DRL policies from such
attacks. While robustness is crucial, we acknowledge
its potential impact on policy performance. There-
fore, we propose a lightweight approach that en-
hances protection without compromising the overall
performance of the system. By complementing ro-
bustness measures, we aim to strengthen the security
of DRL learning and enhance the reliability of the
learned policies.
3 PRELIMINARY
In this section, we will outline the essentials of deep
reinforcement learning, including its key components
and underlying principles.
Deep Reinforcement Learning. In reinforcement
learning (RL), an agent learns an optimal behavior by
Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning
871
sequentially interacting with an environment, known
as a Markov Decision Process (MDP), to achieve its
objectives through trial and error. The MDP is de-
fined as a tuple M = (S,A, P,R, γ,σ), where S and
A are the state and action spaces, respectively, P is
the transition dynamics that determine the probabil-
ity distribution of the next state given the current state
and action, R is the reward function that maps state-
action pairs to scalar rewards, γ is a discount factor
that weighs immediate and future rewards and σ is the
initial distribution over the states. The training pro-
cess consists of multiple episodes where each episode
is initialized with a state sampled from σ. The agent
interacts with the environment at each timestep until
the episode ends. It is assumed that every episode is
comprised of T distinct timesteps. We assume that S
and A are finite and discrete sets.
The agent interacts with the environment sequen-
tially, starting with an initial state s
0
, following the
distribution σ, and selecting actions based on a pol-
icy π. Policies can be generic (stochastic) denoted by
π(a|s), mapping states to action probabilities, or de-
terministic denoted by π(s). The set of all policies is
Π, and deterministic policies are Π
det
.
The agent’s transition to a new state s
t+1
based on
P and the reward r
(s
t
,a
t
)
it receives, reflecting the qual-
ity of its decision, leads to the generation of a trajec-
tory T consisting of state-action-reward triplets. This
trajectory captures the agent’s interaction with the en-
vironment, and at each time step, the agent updates
its Q-table, which stores the estimated values of state-
action pairs.
In reinforcement learning, the cumulative reward
or return is the total reward an agent receives over
time. It is computed as the sum of discounted re-
wards at each timestep, using the factor γ [0,1]. This
balances the importance of immediate and future re-
wards, expressed as CR =
T
t=0
γ
t
R(s
t
,a
t
).
We define the state value V
π
(s) for a policy π as
the expected total return CR from state s under pol-
icy π. It is represented by the function V
π
: S R
and expressed as V
π
(s) = E[CR|s = s
t
], accounting
for stochastic environment transitions.
The state-action value function Q
π
(s,a), also
known as the Q-function, extends the definition of the
state-value function V
π
(s) to state-action pairs. It rep-
resents the expected return CR from state s, taking ac-
tion a, and following policy π. The agent aims to find
an optimal policy π
that maximizes the expected re-
turn from all states, given by π
= argmax
π
Q
π
(s,a).
The policy score ρ
π
quantifies the overall quality
of a policy π based on the expected rewards obtained
by following the policy over an extended period. It
is calculated by considering all possible actions from
each state using the Q-values. The score is expressed
as ρ
π
= E[(1 γ)
t=0
γ
t
R(s
t
,a
t
)|π,σ], where the ini-
tial state s
0
is sampled from the initial state distribu-
tion σ, and subsequent states s
t
are obtained by ex-
ecuting policy π in the MDP. The score reflects the
expected total return, discounted by a factor of 1 γ.
Deep Reinforcement Learning (DRL) combines
deep learning and RL to tackle challenges in learn-
ing control policies from high-dimensional raw input
data and large state and action spaces. The policy π in
DRL is represented by a deep neural network with pa-
rameters Θ. Various DRL algorithms, including Deep
Q-Network (DQN), Trust Region Policy Optimiza-
tion (TRPO), and Asynchronous Advantage Actor-
Critic (A3C), aim to optimize the policy network by
maximizing the expected return.
In Deep Q-learning, the Q-values for actions are
approximated based on states, enabling the agent to
select the action with the highest Q-value to maxi-
mize its reward. This approach has shown success
in domains like Go and Atari games.
The policy gradient algorithm directly parameter-
izes the policy as π
θ
(s,a), which takes the state as
input and outputs the corresponding action. By max-
imizing the expected total discounted rewards, rep-
resented by the objective function J(θ), the optimal
parameters θ and policy are obtained. The gradient
of this objective function can be expressed as the ex-
pected product of the gradient of the log of the policy
network and the action-value function of the Markov
Decision Process.
To address this, the policy gradient algorithm ap-
proximates the action-value function Q
ω
(s,a) using a
deep neural network Q
ω
learned alongside the policy
network. This allows the Policy Gradient Theorem to
be applied, facilitating the computation of the policy
gradient.
4 REWARD POISONING
ATTACKS AGAINST DRL
The goals of attacks against machine learning models
in agent-based reinforcement learning often involve
manipulating the policies of the agents to align them
with a specific target policy. This is accomplished
by strategically modifying the agent’s reward func-
tion. In our study, we adopt an attack formulation in-
spired by previous works such as (Ma et al., 2019) and
(Rakhsha et al., 2020). By leveraging these existing
approaches, we aim to develop a comprehensive un-
derstanding of attack strategies and their implications
in the context of agent-based reinforcement learning.
SECRYPT 2023 - 20th International Conference on Security and Cryptography
872
4.1 Threat Model
Our paper tackles the problem of reward poisoning
in a white-box setting, where the attacker has com-
plete knowledge of the agent’s Markov Decision Pro-
cess (MDP) environment and Deep Q-Learning algo-
rithm, except for their future randomness. This means
that the attacker places themselves between the en-
vironment and the agent, allowing them to carry out
white-box attacks with ease. The motivation behind
considering a white-box attack scenario is to develop
a defense mechanism that can effectively counter the
most challenging and sophisticated attacks.By assum-
ing maximum knowledge for the attacker, our goal is
to create a defense approach that is resilient even in
the worst-case scenario, where the attacker strategi-
cally manipulates reward signals to deceive and redi-
rect the agent’s policy (Bouhaddi et al., 2018).
At each time step t, the attacker observes the
agent’s policy network π
θ
(s,a), the current state s
t
,
the agent’s action a
t
, the resulting new state s
t+1
, and
the received reward r
t
. The attacker’s profile is de-
fined as ξ = (π
,υ,
t
), with π
representing the tar-
get policy, υ limiting reward manipulations within an
episode of length L, and
t
constraining perturbations
added to rewards. These limitations ensure attack
effectiveness, avoid detection, and prevent excessive
disruptions to the agent’s learning process. By care-
fully controlling perturbations, the attacker can tar-
get specific rewards, achieving their objectives while
maintaining a low profile.
The target policy is defined as a function from the
state space to the action space, π
: S 2
|A|
,π
(s)
A, which specifies the set of actions desired by the
attacker at the state s. The attacker can focus on cer-
tain states more than others, as these states trigger par-
ticular actions desired by the attacker. Thus, S
, the
set of target states, can be defined as S
= {s S :
π
(s) ̸= π
(s)}, with π
(s) the optimal policy of the
agent, which represents the actions it would have cho-
sen in the absence of reward perturbations. So, given
a target state set S
S, the target policy is denoted
as:
π
θ
(s) =
a
if s S
π
θ
i
(s) otherwise.
where a
is the target action desired by the attacker,
π
θ
i
(s) is the victim’s actual policy and π
θ
(s) the par-
tial target policy which is more suitable for large-scale
state spaces, either discrete or continuous, as com-
pared to the complete target policy that defines de-
sired actions in all states.
The attacker can introduce a perturbation δ
t
R
to the reward associated with the current state-action
pair r(s
t
,a
t
). For simplicity, we use the notation r
t
instead of r
(s
t
,a
t
)
. As a result, the reward perceived
by the agent at time step t is given by r
t
+ δ
t
. We as-
sume that the attack is limited by the infinity norm,
which is referred to as limited per-step perturbation.
This means that |δ
t
|
t
for any time step t. In
other words, we consider two constraints regarding
the added perturbation: it should not be too substan-
tial nor too frequent.
The attacker’s objective is to discover the best se-
quence of perturbations to incite the agent to adopt
the target policy while reducing the number of rounds
during which the agent becomes aware of the attack.
This involves minimizing the agent’s disagreement
with the target policy, so as not to raise suspicions
and maintain the success of the attack.
Therefore, the attacker’s problem is modeled as
the following optimization problem:
min
δ
t
d(r
t
,r
t
+ δ
t
) (1)
s.t ρ
π
ρ
π
, π Π
det
\{π
} (2)
δ
t
t
,t (3)
t
1[Q
t
/ Q
] υ (4)
where, d computes the Euclidean distance between
the true reward and the altered reward. Therefore,
the attacker’s problem boils down to solving this op-
timization problem to find the minimum perturbation
that allows for the adoption of its target policy by the
agent, while avoiding detection by limiting the num-
ber of interventions, the amount of perturbation added
per time step, and bounding the number of times the
agent deviates from Q
.
5 PROPOSED DEFENSE
MECHANISM USING
MULTI-ENVIRONMENT
TRAINING
We propose a defense approach to mitigate poison-
ing attacks in deep reinforcement learning (DRL) set-
tings. Our approach utilizes multi-environment train-
ing, where an agent interacts randomly with multiple
environments, each with different transition probabil-
ities. This enables the agent to learn a more robust
policy that is less affected by perturbations in the re-
ward signal introduced by the attacker.
Our approach offers significant advantages over
existing methods. By exposing the agent to diverse
environments, it enhances the agent’s experience and
leads to a more robust policy that can handle new
Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning
873
and unseen situations. Furthermore, our approach is
computationally efficient, requiring minimal modifi-
cations to the existing RL training pipeline.
In our proposed defense approach, we use a gen-
eralized environment G = (E, µ), consisting of a set
of multiple environments E and a distribution µ over
these environments. Each environment e
i
E is mod-
eled as a Markov Decision Process (MDP). This al-
lows us to evaluate the agent’s performance across
different environments sampled from E according to
µ. Training the agent in multiple environments with
different reward structures increases its robustness to
reward poisoning attacks.
We adopt the multi-environment training tech-
nique, where the agent interacts with a randomly sam-
pled environment from E according to the probability
distribution µ. The attacker, aware of the training en-
vironment, may attempt to poison the rewards. How-
ever, they must balance their actions to avoid detec-
tion. This limits the attacker’s ability to perturb all
rewards in all environments, reducing the overall im-
pact of the poisoning.
To compute the agent’s policy, we introduce the
weighted policy π
w
, which combines the policies
learned in each environment e
i
based on their corre-
sponding probabilities µ
i
:
π
w
=
i
µ
i
· π
i
(5)
Similarly, we compute the weighted Q-values
Q
w
(s,a) by combining the Q-values Q
i
(s,a) obtained
in each environment:
Q
w
(s,a) =
i
µ
i
· Q
i
(s,a) (6)
Finally, the average policy π
avg
(s) is obtained by
taking the argmax over the weighted Q-values:
π
avg
(s) = arg max
a
Q
w
(s,a) (7)
This computation allows the agent to select ac-
tions that are robust to reward poisoning attacks by
considering the variability of rewards across environ-
ments.
Our defense approach using multi-environment
training provides a more robust training environment
that mitigates the impact of poisoning attacks. It en-
courages the agent to learn a generalized policy ef-
fective across multiple environments. Additionally, it
can be combined with other defense mechanisms to
further enhance the agent’s resilience against poison-
ing attacks.
To detect reward poisoning, we employ a
variance-based technique. We compare the observed
rewards to an expected value under an unpoisoned re-
ward signal. By computing the variance of observed
rewards across multiple environments, we can deter-
mine if the agent has been subject to reward poison-
ing.
The variance-based technique assumes that the at-
tacker cannot poison all rewards in all environments,
resulting in higher variance under a poisoned reward
signal compared to an unpoisoned one. We calcu-
late the reward variance Var(R) using the observed
rewards R
i
and the mean reward
¯
R across all environ-
ments:
Var(R) =
1
n 1
n
i=1
(R
i
¯
R)
2
(8)
where n is the number of environments.
To compare observed rewards to an expected un-
poisoned value, we calculate the variance of the un-
poisoned rewards in each environment. The variance
is computed using the formula:
Var(U) =
1
n 1
n
i=1
(U
i
¯
U)
2
(9)
where U
i
represents the expected unpoisoned re-
ward in the i
th
environment,
¯
U is the mean expected
unpoisoned reward across all environments, and n is
the number of training environments.
If the ratio of the observed reward variance to
the expected unpoisoned reward variance exceeds a
threshold value h, we determine that the agent has
been subjected to reward poisoning, as indicated by
the equation:
Var(R)
Var(U)
> h (10)
Our defense mechanism involves the environment
e
i
E, the agent, and the attacker. We propose Al-
gorithm 1 to implement our approach, which utilizes
multi-environment training to defend against reward
poisoning attacks. By following this algorithm, the
agent learns a generalized policy across multiple envi-
ronments, enhancing its resistance to reward poison-
ing attacks. The algorithm also imposes constraints
on the attacker, making it more difficult for them to
poison all rewards of all actions in all environments
without detection.
6 CONCLUSION
This work addresses reward poisoning attacks in deep
reinforcement learning by introducing a novel de-
fense mechanism. We propose training agents in a
SECRYPT 2023 - 20th International Conference on Security and Cryptography
874
multi-environment setting, randomly selecting envi-
ronments for agent interaction. By averaging rewards
across multiple environments, our approach effec-
tively mitigates the impact of poisoning and enhances
agent robustness. Our method ensures the preserva-
tion of true reward performance while providing prov-
able guarantees for defense policy effectiveness, en-
suring safety and reliability in critical applications.
This contribution represents a significant advance-
ment in the development of robust and secure deep
reinforcement learning systems for real-world scenar-
ios. Future goals include conducting experiments to
compare our approach with existing defenses, vali-
dating its effectiveness and practicality, and leverag-
ing the insights gained to further enhance our defense
mechanism.
REFERENCES
Akhtar, N. and Mian, A. (2018). Threat of adversarial at-
tacks on deep learning in computer vision: A survey.
Ieee Access, 6:14410–14430.
Banihashem, K., Singla, A., and Radanovic, G. (2021). De-
fense against reward poisoning attacks in reinforce-
ment learning. arXiv preprint arXiv:2102.05776.
Behzadan, V. and Munir, A. (2017a). Vulnerability of deep
reinforcement learning to policy induction attacks. In
Machine Learning and Data Mining in Pattern Recog-
nition: 13th International Conference, MLDM 2017,
New York, NY, USA, July 15-20, 2017, Proceedings
13, pages 262–275. Springer.
Behzadan, V. and Munir, A. (2017b). Whatever does not kill
deep reinforcement learning, makes it stronger. arXiv
preprint arXiv:1712.09344.
Behzadan, V. and Munir, A. (2018). Mitigation of pol-
icy manipulation attacks on deep q-networks with
parameter-space noise. In Computer Safety, Reliabil-
ity, and Security: SAFECOMP 2018, V
¨
aster
˚
as, Swe-
den, September 18, 2018, Proceedings 37, pages 406–
417. Springer.
Bouhaddi, M., Radjef, M. S., and Adi, K. (2018). An effi-
cient intrusion detection in resource-constrained mo-
bile ad-hoc networks. Computers & Security, 76:156–
177.
Chen, Y., Du, S., and Jamieson, K. (2021). Improved cor-
ruption robust algorithms for episodic reinforcement
learning. In International Conference on Machine
Learning, pages 1561–1570. PMLR.
Greydanus, S., Koul, A., Dodge, J., and Fern, A. (2018). Vi-
sualizing and understanding atari agents. In Interna-
tional conference on machine learning, pages 1792–
1801. PMLR.
Huang, S., Papernot, N., Goodfellow, I., Duan, Y., and
Abbeel, P. (2017). Adversarial attacks on neural net-
work policies. arXiv preprint arXiv:1702.02284.
Inkawhich, M., Chen, Y., and Li, H. (2019). Snooping at-
tacks on deep reinforcement learning. arXiv preprint
arXiv:1905.11832.
Kiran, B. R., Sobh, I., Talpaert, V., Mannion, P., Al Sallab,
A. A., Yogamani, S., and P
´
erez, P. (2021). Deep rein-
forcement learning for autonomous driving: A survey.
IEEE Transactions on Intelligent Transportation Sys-
tems, 23(6):4909–4926.
Kos, J. and Song, D. (2017). Delving into adver-
sarial attacks on deep policies. arXiv preprint
arXiv:1705.06452.
Lin, Y.-C., Hong, Z.-W., Liao, Y.-H., Shih, M.-L., Liu, M.-
Y., and Sun, M. (2017). Tactics of adversarial attack
on deep reinforcement learning agents. arXiv preprint
arXiv:1703.06748.
Lykouris, T., Simchowitz, M., Slivkins, A., and Sun, W.
(2021). Corruption-robust exploration in episodic re-
inforcement learning. In Conference on Learning The-
ory, pages 3242–3245. PMLR.
Ma, Y., Zhang, X., Sun, W., and Zhu, J. (2019). Policy
poisoning in batch reinforcement learning and control.
Advances in Neural Information Processing Systems,
32.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-
ness, J., Bellemare, M. G., Graves, A., Riedmiller, M.,
Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-
level control through deep reinforcement learning. na-
ture, 518(7540):529–533.
Rakhsha, A., Radanovic, G., Devidze, R., Zhu, X., and
Singla, A. (2020). Policy teaching via environment
poisoning: Training-time adversarial attacks against
reinforcement learning. In International Conference
on Machine Learning, pages 7974–7984. PMLR.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and
Moritz, P. (2015). Trust region policy optimization. In
International conference on machine learning, pages
1889–1897. PMLR.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-
ing: An introduction. MIT press.
Wang, J., Liu, Y., and Li, B. (2020). Reinforcement learning
with perturbed rewards. In Proceedings of the AAAI
conference on artificial intelligence, volume 34, pages
6202–6209.
Wei, C.-Y., Dann, C., and Zimmert, J. (2022). A model se-
lection approach for corruption robust reinforcement
learning. In International Conference on Algorithmic
Learning Theory, pages 1043–1096. PMLR.
Wu, F., Li, L., Xu, C., Zhang, H., Kailkhura, B., Ken-
thapadi, K., Zhao, D., and Li, B. (2022). Copa:
Certifying robust policies for offline reinforcement
learning against poisoning attacks. arXiv preprint
arXiv:2203.08398.
Zhang, H., Chen, H., Boning, D., and Hsieh, C.-J. (2021a).
Robust reinforcement learning on state observations
with learned optimal adversary. arXiv preprint
arXiv:2101.08452.
Zhang, X., Ma, Y., Singla, A., and Zhu, X. (2020). Adaptive
reward-poisoning attacks against reinforcement learn-
ing. In International Conference on Machine Learn-
ing, pages 11225–11234. PMLR.
Zhang, Z., Lim, B., and Zohren, S. (2021b). Deep learn-
ing for market by order data. Applied Mathematical
Finance, 28(1):79–95.
Multi-Environment Training Against Reward Poisoning Attacks on Deep Reinforcement Learning
875