Pre-Training Deep Q-Networks Eliminates the Need for Target

Networks: An Empirical Study

Alexander Lindstr

, Arunselvan Ramaswamy

∗ b

and Karl-Johan Grinnemo

Department of Mathematics and Computer Science, Karlstad University, Universitetsgatan 2, 65188 Karlstad, Sweden

Keywords:

Deep Q-Network, Deep Q-Learning, Stability, Pre-Training, Variance Reduction.

Abstract:

Deep Q-Learning is an important algorithm in the ﬁeld of Reinforcement Learning for automated sequential

decision making problems. It trains a neural network called the Deep Q Network (DQN) to ﬁnd an optimal

policy. Training is highly unstable with high variance. A target network is used to mitigate these problems, but

leads to longer training times and, high training data and very large memory requirements. In this paper, we

present a two phase pre-trained online training procedure that eliminates the need for a target network. In the

ﬁrst - ofﬂine - phase, the DQN is trained using expert actions. Unlike previous literature that tries to maximize

the probability of picking the expert actions, we train to minimize the usual squared Bellman loss. Then,

in the second - online - phase, it continues to train while interacting with an environment (simulator). We

show, empirically, that the target network is eliminated; training variance is reduced; training is more stable;

when the duration of pre-training is carefully chosen the rate of convergence (to an optimal policy) during the

online training phase is faster; the quality of the ﬁnal policy found is at least as good as the ones found using

traditional methods.

1 INTRODUCTION

Reinforcement Learning (RL) is a paradigm in AI

used to solve complex sequential decision making

problems, which can be formalized using Markov

Decision Processes (MDP). Traditional approaches

of RL required full knowledge of the sequential

decision-making problem at hand, making it inappro-

priate for large complex problems with a large num-

ber of complex scenarios (Mnih et al., 2015). To

overcome these limitations, a Neural Network (NN) is

combined with traditional RL ideas to develop Deep

Q Learning (DQL), which has since then become

the most popular modern RL algorithm (Mnih et al.,

2015). In DQL, a neural network called DQN is

trained to minimize the squared Bellman loss func-

tion. Minimizing this loss allows DQN to ﬁnd the

optimal policy, i.e., the sequence of decisions to solve

the original problem at hand.

DQL signiﬁes a signiﬁcant advancement in AI.

However, integrating NNs with RL presents numer-

https://orcid.org/0000-0003-0222-7658

https://orcid.org/0000-0001-7547-8111

https://orcid.org/0000-0003-4147-9487

∗

Ramaswamy was partially supported by The Knowl-

edge Foundation (grant no. 20200164).

ous challenges, e.g., typically NNs require large

amounts of data for training, numerical instability

during NN training is a particular issue in RL as the

nature of the decision making problem is complex, to

begin with. To address such issues, DQL introduced

two key mechanisms: “the replay buffer” and “the

target network”. The replay buffer facilitates contin-

uous learning from past and current experiences in

the same measure. Meanwhile, the target network,

as claimed by the authors, reduces instability during

learning. Essentially, the target network is just a copy

of the main network, but in comparison, it gets up-

dated less frequently by taking a new copy of the

main network (Ramaswamy et al., 2023), (Mnih et al.,

2013).

While these mechanisms are pertinent for success-

ful learning, they result in large overheads and high

memory demands, which ultimately slow down learn-

ing (Yang et al., 2021). Despite its success, using

a target network DQL lacks signiﬁcant investigation

of its importance during different scenarios. For this

study, we aim to investigate whether the reliance on

the target network can be reduced or even eliminated

under certain conditions, thereby reducing memory

usage of DQL and maintenance overheads without

compromising its performance.

Lindström, A., Ramaswamy, A. and Grinnemo, K.-J.

Pre-Training Deep Q-Networks Eliminates the Need for Target Networks: An Empirical Study.

DOI: 10.5220/0013374600003905

In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2025), pages 437-444

ISBN: 978-989-758-730-6; ISSN: 2184-4313

437

A DQN is trained using data generated through re-

peated interactions with the environment in which the

decision making problem is being solved. Each in-

teraction warrants a feedback which then improves its

performance. As the DQN weights are randomly ini-

tialized the interaction is suboptimal or even danger-

ous at the beginning. In order to overcome this issue,

a DQN is pre-trained using data associated with ex-

pert interactions with the environment. As this stage

is before the DQN gets to interact with the environ-

ment, the previously mentioned dangers are averted.

The pre-training phase is also called the ofﬂine learn-

ing phase. The training phase where the DQN in-

teracts with the environment (and continues to train)

is called the online phase. Training DQNs in two

phases, ofﬂine followed by an online phase, is a pop-

ular and practical training paradigm. This will the fo-

cus of this paper.

1.1 Problem Description and Our

Contribution

While a target network has been demonstrated to be

useful in stabilizing learning for DQL, its contribution

to ﬁnding better policies remains under-explored. In

(Mnih et al., 2015), the signiﬁcance of a target net-

work is highlighted in mitigating the moving target

problem and in enhancing stability. However, there is

a lack of empirical evidence for its necessity through-

out the entire training process (Mnih et al., 2015). Re-

cent mathematical proofs, e.g., in (Ramaswamy and

ullermeier, 2022) suggest that the target network

could be removed at certain stages of training. These

proofs consider both the online and ofﬂine phases

of learning. Recall that online learning refers to the

learning paradigm wherein the learning agent inﬂu-

ences the data used for training in a direct “ongoing”

manner. In ofﬂine learning, the agent is trained us-

ing data that was collected in the past - pre-training

- and often the agent has no inﬂuence on this train-

ing data (Hester et al., 2017). As mentioned before

an ofﬂine learning phase (pre-training) typically pre-

cedes an online learning phase. This kind of training

is called pre-trained online learning.

1. When an agent is trained in two phases - the of-

ﬂine followed by an online phase, a target network

can be completely omitted.

2. In order to ﬁnd the best policy, we observed that

there is an optimal amount of pre-training. Too

little affects stability during the online phase, too

much affects optimality.

3. As compared to training with a target network, the

training without a target network has lower vari-

ance. Hence, learning is faster!

1.2 Related Work

There have been several other works that have ques-

tioned the superﬂuity of a target network. For ex-

ample, in (Ramaswamy et al., 2023), it is suggested

that the target network could be removed in online

learning by replacing the activation function used in

the DQN with their newly developed Truncated Gaus-

sian Error Linear Unit (TGeLU) activation function

(Ramaswamy et al., 2023). This paper seeks to ad-

dress the practical implications of these theoretical

insights. The central question guiding this research is:

“How does the removal of the target network at vari-

ous scenarios and stages of training impact the stabil-

ity and effectiveness of DQL?”

In DQL, a target network is employed alongside

the main network during the training process to calcu-

late the Mean Square Error (MSE) loss. The primary

purpose of integrating a target network in DQL train-

ing is to address the moving target problem, which

arises from high variance in Q-value estimates, lead-

ing to unstable learning. In addition to the moving tar-

get problem affecting stable learning, NNs using non-

linear activation functions also suffer from a problem

referred to as numerical instability or exploding gra-

dient, which is caused by high variance. During the

backpropagation phase, the gradient will grow expo-

nentially, and if the loss value gets too high, this will

result in the weights becoming too large to handle

(Philipp et al., 2018).

By using a target network with less frequent up-

dates. This means that the target Q-values change

more slowly and are suggested to contribute to a

smoother training process, allowing for more reliable

convergence to an optimal policy, but at the same

time may slow down the learning process (Mnih et al.,

2015). Using a target network will also contribute to

a higher memory-consuming due to the need to keep

a copy of the network during training, which makes

the memory demand twice as much compared to only

using the main network. While it may be useful at

the beginning of the training to use a target network

to obtain stable learning, there is no proof that using

one will result in better policies.

Removing the target network to reduce the overes-

timation of Q-values and decrease memory demand is

not a novel idea per se, but the approach to removing

the target network and some of the ﬁndings we have

discovered are new. In this paper, we have been able

to show that it is possible for an agent to achieve just

as good or better policy than an agent trained with a

target network for the entire process without any ma-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

438

jor modiﬁcations to the original algorithm. The fol-

lowing three papers present suggested modiﬁed DQL

algorithms, all of which focus on online learning and

aim to remove or reduce the usage of a target network

in order to stabilize the learning better. These are then

followed by two cases suggesting different methods

used to optimize training with a combination of of-

ﬂine and online learning.

The paper (Van Hasselt et al., 2016) aims to con-

tribute to stable learning by lowering the overestima-

tion of Q-values. When Q-values are overestimated,

the agent believes certain actions’ values to be higher

than they actually are. This can lead to the agent mak-

ing suboptimal decisions, resulting in higher variance

in the Q-value estimates. While Double Q-learning

(DDQL) still uses a target network, its use is loosened

in contrast to DQL. In DQL, the target network pro-

vides the numerical value for the future calculation of

the current Q-value. However, in DDQL, the target

network is only used to estimate future actions, while

all numerical value contributions are provided by the

main network.

In the paper (Kim et al., 2019), the authors argue

that removing the target network in the case of online

learning achieves both faster and more stable learn-

ing. The approach here is to replace the max oper-

ator used in the Bellman equation with Mellowmax

because the max operator suffers from overestima-

tion. However, Mellowmax also suffers from overes-

timation, but to address this problem is the parameter

(ω) introduced to decrease the overestimations. Al-

though DeepMellow demonstrated higher cumulative

rewards for most compared cases compared to DQL,

parameter tuning still has some limitations to choos-

ing the most optimal ω. For larger problems, this tun-

ing process may be very exhaustive (Kim et al., 2019).

In the paper (Yang et al., 2021), the aim here as

well is to remove the target network by replacing the

loss function MSE and instead use Random Hybrid

Optimization (RHO). RHO is based on Hybrid Op-

timization (HO), which is a technique used to get a

tradeoff for the gradients between different optimiza-

tion methods, Mean Squared Value Error (MSVE)

and Mean Square Bellman Error (MSBE). The dif-

ference with RHO compared to HO is that instead

of using both optimization methods simultaneously

and calculating a gradient for some value in between

the methods, RHO uses one optimization method at

a time for every update, but the method is randomly

selected (Yang et al., 2021)

In this paper, we also aim to demonstrate that the

target network could be removed if an agent is ﬁrst

trained ofﬂine before continuing to train online. How-

ever, mitigating the target network in ofﬂine learning

is not a well-explored area. In the case of related

work, the closet resembling our works comes from

the paper (Cruz Jr et al., 2019), which focus on com-

paring different conﬁgurations and setups applied for

pre-training in combination with DQL (Cruz Jr et al.,

2019), and the paper (Hester et al., 2017) present

a new algorithm combining pre-training and online

learning (Hester et al., 2017). But while these pa-

pers do not suggest or focus on reducing or removing

the target network, these methods are similar to ours,

except that we will not use a target network when we

combine ofﬂine and online learning.

2 NEURAL NETWORK

ARCHITECTURE AND MODEL

TRAINING

This empirical study aims to evaluate and determine

whether the highly memory-demanding target net-

work can be removed without compromising learning

stability. The reason to remove the target network is

not only because it’s highly memory-demanding but

also to contradict the statement that using a target

network would result in a policy closer to optimality.

The study investigates pre-trained online learning (of-

ﬂine phase followed by an online phase). Although,

we believe that our conclusions can be replicated in

other RL algorithms as well. We begin by introduc-

ing the training and evaluation environments for the

agents. Subsequently, we present the architecture of

the DQN and provide a step-by-step explanation of

how the training process is carried out, both online

and ofﬂine.

2.1 Environment

Figure 1: Screenshot of the “CartPole-v1” environment.

We use the “CartPole-v1” environment from OpenAI

Pre-Training Deep Q-Networks Eliminates the Need for Target Networks: An Empirical Study

439

Gym. OpenAI Gym is a Python API standard that

provides a collection of environments for training and

testing RL algorithms. The “CartPole-v1” environ-

ment simulates an inverted pendulum, as illustrated

in Figure 1. The primary objective in this environ-

ment is to prevent a pole pivoted at a point on the cart

from falling by pushing the cart to the left or right on

a track. The ’CartPole-v1’ environment comes with

an action space, observation space, range limitations

for the observation space, and a reward function. In

our case, we use the standard conﬁguration applied to

the current environment version, except for the reward

function.

2.1.1 Interaction with the Environment

Interacting with the environment begins with apply-

ing an action as a parameter, which returns parame-

ters: observation, reward, termination ﬂag, truncation

ﬂag, and additional information. In our case, we only

use the observation, termination, and truncation pa-

rameters. Additional information is primarily used

for debugging purposes, and for reward calculation,

we use our own constructed reward function.

2.1.2 Action and Observation Space

For the “CartPole-v1” environment, the action space

is discrete with two available actions:

• 0: Push the cart to the left on the track

• 1: Push the cart to the right on the track

The observation space consists of four observa-

tions: cart position (x), cart velocity, pole angle (φ),

and pole angular velocity. Table 1 presents the details

of the observation space. If the cart is centered in the

middle of the screen and the pole stands straight up,

the position and angle are x = 0 and φ = 0, respec-

tively.

Table 1: Observation space for the “CartPole-v1” environ-

ment.

Observation Min Max

Cart position (x) -4.8 4.8

Cart velocity −∞ ∞

Pole angle (φ) -0.418 rad

0.418 rad

Pole angular velocity −∞ ∞

2.1.3 Termination Conditions and Reward

Function

Some limitations may trigger termination or trunca-

tion conditions to determine whether the environment

The angle is described in radians

was solved. If the termination ﬂag is triggered, the

environment is “not solved”; if the truncated ﬂag is

triggered, the environment is “solved”.

Termination Condition: the termination parameter

is triggered if either or both x /∈ [−2.4,2.4] and

φ /∈ [−0.2095, 0.2095] is true.

Truncation Condition: The truncation ﬂag is trig-

gered if 500 steps have been taken, which is the

maximum number of steps for this environment.

For this study, we utilize the following reward

function:

reward =











10 if truncated

−10 if terminated

(1 − φ) + (5 − x) else

(1)

During the online training phase, where we inter-

act directly with the environment, each episode con-

sists of 500 steps. A reward is received for each step

of the training episode. If the environment is trun-

cated, meaning the maximum number of steps possi-

ble is reached, the agent is rewarded with +10. If the

environment terminates before 500 steps, the agent is

rewarded with −10. For all other steps, the agent re-

ceives a higher reward the closer the cart is to the mid-

dle and the closer the pole angle is to zero. The reward

function’s design encourages the agent to trigger the

truncated ﬂag and keep the pole steady and balanced.

2.2 Agent Setup and Training Process

We want to use the environment to train agents for

both online and ofﬂine learning modes. Online learn-

ing is the most common training process for RL. In

online learning, we start with an empty replay buffer.

Throughout training and interaction with the environ-

ment, we populate the replay buffer with new expe-

riences obtained from the interaction and update the

weights of the DQN with mini-batches sampled from

the replay buffer. For ofﬂine learning, we begin with

a pre-ﬁlled replay buffer. The weights of the DQN

are updated using mini-batches sampled from the pre-

ﬁlled replay buffer without interacting with the envi-

ronment. While ofﬂine learning is not as common as

online learning, there are cases where it’s beneﬁcial to

pre-train an agent before continuing the training on-

line. Therefore, we aim to evaluate the concept of

removing the target network for both scenarios: when

an agent trains only online and when it ﬁrst pre-trains

before continuing to train online.

2.3 DQN Architecture

Our DQN consists of four layers: one input layer, two

hidden layers, and one output layer. The input and

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

440

output layers are adjusted to match the environment’s

observation and action spaces, respectively. In the

case of the “CartPole-v1” environment, the observa-

tion space is a 4-tuple, and the action space contains

two possible actions. Therefore, the input layer has

four neurons, and the output layer has two neurons. In

this case, we use two hidden layers containing 64 and

128 neurons using Sigmoid activations unless other-

wise speciﬁed.

2.4 Learning Rate and Exploration

Probability Decay

Next, we discuss the decay process during online

learning, where we use polynomial decay for both

the learning rate and exploration probability. During

training, when the learning loss is below 0.5 for 100

consecutive episodes, the learning rate is decayed us-

ing:

α = (α

initial

− α

min

) · (1 −

max

)

+ α

min

(2)

• n = current episode

• N

max

= maximum number of episode (In this case

set to 10000)

• α

initial

= initial rate (set to 0.02)

• α

min

= target rate (set to 0.0001)

• α

polynomial factor ( higher values results in a

faster decay, set to 2 for this case)

Next, we explain the decay of the exploration rate

ε. The exploration rate decays every 100 episode re-

gardless of how the agent performs, using the follow-

ing equation:

ε = (ε

initial

− ε

min

) · (1 −

max

)

+ ε

min

(3)

• ε

initial

= 1

• ε

min

= 0.01

• ε

= 7

2.5 Step-by-Step Description of

Pre-Trained Online Learning

The pre-training is done as an ofﬂine learning phase,

where an agent is trained without interaction with

an environment. Instead, we use a pre-ﬁlled replay

buffer B to train the agent. The replay buffer contains

experiences collected from previous interactions with

the environment or from expert demonstrations.

Initialization

• Create a New DQN: Randomly initialize the NN

weights using a uniform distribution.

Training

1. Train for some number of iterations:

2. Sample Mini-Batch: Sample a mini-batch b with

experiences from the replay buffer B, denoted as

b ∼ U (B).

3. Update Network Weights: Update the network

weights by backpropagation using Stochastic Gra-

dient Descent (SGD) with the mini-batch b and

the loss function in Equation 4.

4. Repeat: Unless training is terminated, repeat

steps 2 and 3.

This process enables the agent to learn from previ-

ously collected experiences stored in the replay buffer

without interacting with the environment. This is now

followed by the online learning phase which now does

not use a target network, described below.

Initialization

1. Create a DQN: Initialize the NN to the pre-

trained DQN.

2. Initialize Replay Buffer: Create a replay buffer

B with a maximum capacity C to store experi-

ences.

Training

1. Train for some number of episodes:

2. Action Selection:

• With probability ε, select a

∼ U (A

• Otherwise, a

:= argmax

Q(s

,a;θ).

3. Store Experience: Store experience exp

(s,a,r,s

′

) in the replay buffer B

4. Sample Mini-Batch: Sample a mini-batch b with

experiences from the replay buffer B, denoted as

b ∼ U (B)

5. Update DQN Weights:

• Calculate the loss function L

(θ

) using the

mini-batch b.

(θ

) = E

(r,s,a,s

′

)∼U(B)



r(s, a, s

′

γmax

′

Q(s

′

;θ

) − Q(s,a; θ

)



(4)

Once the maximum capacity C is reached, experiences

are removed from the replay buffer in a First In First Out

(FIFO) order.

In the beginning, steps 4, 5, and 6 are skipped until the

amount of experience stored in the replay buffer is equal to

or greater than the mini-batch size speciﬁed for training

Pre-Training Deep Q-Networks Eliminates the Need for Target Networks: An Empirical Study

441

• Perform backpropagation using the SGD opti-

mizer.

6. Repeat: Unless training is terminated, repeat the

above steps.

3 EMPIRICAL RESULTS AND

ANALYSIS

During the ofﬂine phase, the agent is trained without

interacting with an environment. Instead, we use a

pre-ﬁlled experience replay buffer containing expert

experience and update the weights of the DQN for

n number of iterations using mini-batches uniformly

drawn from this buffer. After pre-training, we con-

tinue to train the agent, starting with the pre-trained

policy learnt during the ofﬂine phase. The old replay

buffer is discarded and we start the online learning

process with a empty one. We show through experi-

ments that a target network is redundant if the tradi-

tional online training is preceded by an ofﬂine phase

(pre-training). Our experiments show that: even with

discarding the target network completely, the agent

can ﬁnd a policy that is as good, or better than, the

one found by an agent which trains using a target net-

work throughout the training process.

To show the superﬂuity of target networks when

pre-training is used, we compare the policy obtained

by a DQN trained with and without a target network

in ofﬂine learning. However, instead of training two

agents in parallel, we can train only one agent in pre-

trained online learning without a target network and

compare with the DQN trained with a target network

in online learning.

3.1 Impact of Pre-Training and

Learning Stability

Recall that a policy is deemed acceptable if it is

able to balance the cart-pole for all of the 500 steps

within an episode consistently for 1000 consecutive

episodes. We discovered that the number of pre-

training steps signiﬁcantly impacts the ability to ﬁnd

an acceptable policy for the “CartPole-v1” environ-

ment. Figure 2 illustrates this effect. It shows that the

agent pre-trained with 150000 iterations cannot bal-

ance the cart-pole for 500 steps per episode for 1000

consecutive episodes, which is the requirement for an

acceptable policy. This agent for trained for 10000

episodes before stopping without ﬁnding an accept-

able policy. In contrast, the agents pre-trained for

200000 and 250000 iterations both ﬁnd an accept-

able policy. The main goal of this analysis was to

Figure 2: Along x-axis, the episode countdown to the stop-

page of the online training is plotted. Along y-axis, the

number of steps in an episode where the cart-pole remains

balanced is plotted. For the red curve that corresponds to

the experiment where the target network is used throughout

training, the countdown is started at the 3021

episode in

the online training phase. The other curves correspond to

different number pre-training steps before switching to the

online training phase. The orange curve, with 200000 pre-

training steps reaches an acceptable policy faster - at the

1159

episode, even without a target network.

demonstrate that an agent can ﬁnd an acceptable pol-

icy without using a target network if it is ﬁrst pre-

trained before continuing the training in online learn-

ing. We found such policies for both cases, where

we pre-trained for 200000 and 250000 steps, respec-

tively. Additionally, we observed that very long pre-

training has an adverse effect on the rate of conver-

gence within the online phase. For example, the best

performance was achieved when the number of pre-

training steps very limited to 200000 steps as opposed

to 250000. In fact, the former has the best rate of con-

vergence, ﬁnd an acceptable policy in 1159 episodes,

bettering the rate of convergence of traditional DQN

with a target network - it needed 3021 episodes.

3.2 Comparing the Quality of the

Acceptable Policies Found

One method to compare the quality of the acceptable

policy found is to compare the rewards accumulated

within an episode. This is done in Figure 3. It com-

pares the traditional online training with three pre-

trained online training modes - 250000, 300000 and

350000 pre-training steps. While describing Figure 2,

we said that 200000 is the optimal number of pre-

training steps with respect to learning rate - it ﬁnds

the acceptable policy ﬁrst. For this set of experiments,

we decided to pre-train for a longer time, since we

wanted to check the critical nature of stopping the of-

ﬂine phase, particularly with respect to the quality of

the acceptable policy found. For example, suppose

we are conservative and have a very long pre-training

phase, can we ﬁnd a good policy? Figure 3 shows that

good policies can be learnt in a stable manner, with-

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

442

Figure 3: Like in the previous ﬁgure, along x-axis, the

episode countdown to the stoppage of the online training

is plotted. Along y-axis, we plot the cumulative rewards

per episode. The experiments corresponding to the pre-

trained online learning were repeated for 250000, 300000

and 350000 pre-training steps - beyond the optimal limit.

We wanted to study the criticality of choosing the right pre-

training duration. The red curve corresponds to the tradi-

tional online training with a target network. The orange

curve, corresponding to 300000 pre-training steps has a cu-

mulative reward that is comparable to traditional training.

It is poignant to observe that the cumulative rewards exhibit

very low variance, as compared to traditional training which

exhibits very high variance throughout training.

out a target network, provided the DQN is pre-trained

for a certain number of steps. Pre-training has the ad-

ditional advantageous, since it reduces variance dur-

ing training. Low variance is often associated with

low data consumption, hence we believe that elimi-

nating target networks also has a positive impact on

the amount of data needed to train a DQN.

Another method to compare the quality of the ac-

ceptable policies is by comparing the average loss per

episode. This is done in 4. The ﬁrst thing to notice is

that the variance is lower for the agents trained with-

out a target network. Much like the plot with the re-

wards, the pre-trained online training is able to effec-

tively minimize the squared Bellman loss - the prin-

cipal aim in RL. From our various experiments we

conclude that the target network is redundant if we

pre-train an agent before the online learning phase.

We can discard the target network, but we may need

to determine the length of pre-training, which for now

is determined through trial and error.

4 CONCLUSIONS

We described a simple two phase training for DQN.

The aim of this study is to show that this eliminates

the need for a target network, thereby speeding up

training, in addition to greatly reducing variance. In

the ﬁrst stage, we train the DQN in an ofﬂine man-

ner using a history of expert actions. In particular,

an experience replay buffer is ﬁlled with these sam-

ples. Then, for a certain number of pre-training steps,

Figure 4: This ﬁgure tells the same story as said by Figure 3,

from the perspective of loss instead of reward. Along x-

axis, the episode countdown to the stoppage of the online

training is plotted. Along y-axis, we plot the average loss

per episode.

a mini-batch is sampled from this buffer, and a loss

gradient is calculated to update the DQN. The usual

squared Bellman loss is minimized. This step differs

from the traditional method of using expert actions.

In literature, during pre-training, the DQN is trained

to maximize the chances of picking the expert actions.

In the second online phase, the DQN is trained to min-

imize the Bellman loss while interacting with an en-

vironment (simulator) - traditional method. However,

we did not use a target network at any stage of train-

ing.

We found that our pre-trained online training of

DQN greatly enhanced learning stability without the

need for a target network. Variance during training

was reduced, this reduced the training duration and

the amount of training data needed. The policy found

was as good, and in some cases better than, the one

found by training using a target network. The only

caveat being the pre-training duration. We observed

that the number of pre-training steps has an inﬂuence

on the rate of convergence during the online phase.

This duration, we believe, is problem dependent and

can be found through trial and error. While a short

pre-training phase leads to unstable learning, longer

than optimal pre-training still ﬁnds a good policy. The

optimality if only with respect to the rate of conver-

gence to this policy.

ACKNOWLEDGEMENTS

The research presented in this paper was conducted

when Lindstr

om was working on his Masters the-

sis under the supervision of Ramaswamy (Lindstr

om,

2024).

The work was carried out within the Data-driven

Latency-sensitive Mobile Services for a Digitalized

Society (DRIVE) project, which is partly funded by

the Knowledge Foundation of Sweden.

Pre-Training Deep Q-Networks Eliminates the Need for Target Networks: An Empirical Study

443

REFERENCES

Cruz Jr, G. V. d. l., Du, Y., and Taylor, M. E.

(2019). Pre-training Neural Networks with Human

Demonstrations for Deep Reinforcement Learning.

arXiv:1709.04083 [cs].

Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul,

T., Piot, B., Horgan, D., Quan, J., Sendonaris, A.,

Dulac-Arnold, G., Osband, I., Agapiou, J., Leibo,

J. Z., and Gruslys, A. (2017). Deep Q-learning from

Demonstrations. arXiv:1704.03732 [cs].

Kim, S., Asadi, K., Littman, M., and Konidaris, G. (2019).

DeepMellow: Removing the Need for a Target Net-

work in Deep Q-Learning. In Proceedings of the

Twenty-Eighth International Joint Conference on Ar-

tiﬁcial Intelligence, pages 2733–2739, Macao, China.

International Joint Conferences on Artiﬁcial Intelli-

gence Organization.

Lindstr

om, A. (2024). An empirical study of stability and

variance reduction in Deep Reinforcement Learning.

Dept. of Computer Science, Karlstad University.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M.

(2013). Playing Atari with Deep Reinforcement

Learning. arXiv:1312.5602 [cs].

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Sadik, A., Antonoglou, I., King, H., Ku-

maran, D., Wierstra, D., Legg, S., and Hassabis, D.

(2015). Human-level control through deep reinforce-

ment learning. Nature, 518(7540):529–533.

Philipp, G., Song, D., and Carbonell, J. G. (2018). GRA-

DIENTS EXPLODE - DEEP NETWORKS ARE

SHALLOW - RESNET EXPLAINED.

Ramaswamy, A., Bhatnagar, S., and Saxena, N. (2023).

A Framework for Provably Stable and Consis-

tent Training of Deep Feedforward Networks.

arXiv:2305.12125 [cs].

Ramaswamy, A. and H

ullermeier, E. (2022). Deep q-

learning: Theoretical insights from an asymptotic

analysis. IEEE Transactions on Artiﬁcial Intelligence,

3(2):139–151.

Van Hasselt, H., Guez, A., and Silver, D. (2016). Deep Re-

inforcement Learning with Double Q-Learning. Pro-

ceedings of the AAAI Conference on Artiﬁcial Intelli-

gence, 30(1).

Yang, G., Li, Y., Fei, D., Huang, T., Li, Q., and Chen, X.

(2021). DHQN: a Stable Approach to Remove Tar-

get Network from Deep Q-learning Network. In 2021

IEEE 33rd International Conference on Tools with Ar-

tiﬁcial Intelligence (ICTAI), pages 1474–1479, Wash-

ington, DC, USA. IEEE.

ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods

444