an average of a number of past action-value estimates
in the update target. Consequently, the overestimation
bias and estimation variance of the algorithm is lower
than those of Q-learning. The problem remains that
the overestimation bias is never reduced to zero be-
cause the average operator is applied to a finite num-
ber of approximate action-value functions. Moreover,
this algorithm cannot control its estimation bias.
Weighted Double Q-learning (Zhang et al., 2017)
uses a weighted version of Q-learning and Double Q-
learning to compute the maximum action value of the
next state in the update target. Although this algo-
rithm can control its estimation bias, it cannot un-
derestimate more than Double Q-learning or overesti-
mate more than Q-learning.
A very recent method is Maxmin Q-learning (Lan
et al., 2020), which uses an ensemble of agents to
learn the optimal action values. In this algorithm,
a number of past sampled experiences are stored in
a replay buffer. In each step a minibatch of experi-
ences is randomly sampled from the replay buffer and
is used to update the action-value estimates of one or
more agents. For each experience in the minibatch, all
agents compute an estimate for the maximum action
value of the next state, and the minimum of those es-
timates is used in the update target. The authors pro-
posed this method because they identified that under-
estimation bias may be preferable to overestimation
bias and vice versa depending on the reinforcement
learning problem, and they showed that the estimation
bias of this algorithm can be controlled by tweaking
the number of agents. Although this algorithm can
underestimate more than Double Q-learning, there is
a limit to its underestimation and it cannot overesti-
mate more than Q-learning.
Contributions. In this paper, we propose Variation-
resistant Q-learning to control and utilize estimation
bias for better performance. We present the tabular
version of the algorithm and mathematically prove its
convergence. Furthermore, the proposed algorithm is
combined with a multilayer perceptron as function ap-
proximator and compared to Q-learning and Double
Q-learning. The empirical results on three different
problems with different kinds of stochasticities indi-
cate that the new method behaves as expected in prac-
tice.
Paper Outline. This paper is structured as fol-
lows. In section 2, we present the theoretical back-
ground. In section 3, we explain Variation-resistant
Q-learning. Section 4 describes the experimental
setup and presents the results. Section 5 concludes
this paper and provides suggestions for future work.
2 THEORETICAL BACKGROUND
2.1 Reinforcement Learning
In reinforcement learning, we consider an agent that
interacts with an environment. At each point in time
the environment is in a state that the agent observes.
Every time the agent acts on the environment, the en-
vironment changes its state and provides a reward sig-
nal to the agent. The goal of the agent is to act opti-
mally in order to maximize its total reward.
One large challenge in reinforcement learning is
the exploration-exploitation dilemma. On the one
hand, the agent should exploit known actions in order
to maximize its total reward. On the other hand, the
agent should explore unknown actions in order to dis-
cover actions that are more rewarding than the ones it
already knows. To perform well, the agent must find
a balance between exploration and exploitation.
A widely used method to achieve this balance is
the ε-greedy method. When using this exploration
strategy, the agent takes a random action in a state
with probability ε. Otherwise, it takes the greedy (i.e.
most highly valued) action. The amount of explo-
ration can be adjusted by changing the value of ε.
2.2 Finite Markov Decision Processes
Many reinforcement learning problems can be math-
ematically formalized as finite Markov decision pro-
cesses. Formally, a finite Markov decision process is a
tuple (S,A, R, p, γ,t) where S = {s
1
, s
2
, . . . , s
n
} is a fi-
nite set of states, A = {a
1
, a
2
, . . . , a
m
} is a finite set of
actions, R = {r
1
, r
2
, . . . , r
κ
} is a finite set of rewards,
p : S × R × S × A 7→ [0, 1] is the dynamics function,
γ ∈ [0, 1] is the discount factor, and t = 0, 1, 2, 3, . . . is
the time counter.
At each time step t the environment is in a state
S
t
∈ S. The agent observes S
t
and takes an action A
t
∈
A. The environment reacts to A
t
by transitioning to a
next state S
t+1
∈ S and providing a reward R
t+1
∈ R ⊂
R to the agent. The dynamics function determines
the probability of the next state and reward given the
current state and action.
We consider episodic problems, in which the
agent begins an episode in a starting state S
0
∈ S and
there exists a terminal state S
T
∈ S
+
= S∪{S
T
}. If the
agent reaches S
T
, the episode ends, the environment
is reset to S
0
, and a new episode begins. During an
episode, the agent tries to maximize the total expected
discounted return. The discounted return at time step
t is defined as G
t
=
∑
T
k=t+1
γ
k−t−1
R
k
. The discount
factor determines the importance of future rewards.
ICAART 2021 - 13th International Conference on Agents and Artificial Intelligence
18