Bootstrapping a DQN Replay Memory with Synthetic Experiences

Wenzel Baron Pilar Von Pilchau

1 a

, Anthony Stein

and J

org H

ahner

Organic Computing Group, University of Augsburg, Eichleitnerstr. 30, Augsburg, Germany

Artiﬁcial Intelligence in Agricultural Engineering, University of Hohenheim, Garbenstraße 9, Hohnheim, Germany

Keywords:

Experience Replay, Deep Q-Network, Deep Reinforcement Learning, Interpolation, Machine Learning.

Abstract:

An important component of many Deep Reinforcement Learning algorithms is the Experience Replay that

serves as a storage mechanism or memory of experienced transitions. These experiences are used for training

and help the agent to stably ﬁnd the perfect trajectory through the problem space. The classic Experience

Replay however makes only use of the experiences it actually made, but the stored transitions bear great

potential in form of knowledge about the problem that can be extracted. The gathered knowledge contains

state-transitions and received rewards that can be utilized to approximate a model of the environment. We

present an algorithm that creates synthetic experiences in a nondeterministic discrete environment to assist the

learner with augmented training data. The Interpolated Experience Replay is evaluated on the FrozenLake

environment and we show that it can achieve a 17% increased mean reward compared to the classic version.

1 INTRODUCTION

The concept known as Experience Replay (ER)

started as an extension to Q-Learning and AHC-

Learning (Lin, 1992) and developed to a norm in

many Deep Reinforcement Learning (RL) algorithms

(Schaul et al., 2015; Mnih et al., 2015; Andrychow-

icz et al., 2017). One major advantage is its ability

to increase sample efﬁciency. Another important as-

pect is, that algorithms like Deep Q-Network (DQN)

are even not able to learn in a stable manner without

this extension (Tsitsiklis and Van Roy, 1997). This

effect is caused by correlations in the observation se-

quence and the fact that small updates may signiﬁ-

cantly change the policy and in turn alternate the dis-

tribution of the data. By uniformly sampling over the

stored transitions, ER is able to remove these corre-

lations as well as smoothing over changes in the data

distribution (Mnih et al., 2015).

Most versions of ER store the real, actually made,

experiences. For instance the authors of (Mnih et al.,

2015) used vanilla ER for their combination with

DQN, and also (Schaul et al., 2015) who extended

vanilla ER to their Prioritized Experience Replay, that

is able to favour experiences from which the learner

can beneﬁt most. But there are also approaches that

are ﬁlling their replay memory with some kind of

synthetic experiences to support the learning pro-

https://orcid.org/0000-0001-9307-855X

cess. One example is the Hindsight Experience Re-

play from (Andrychowicz et al., 2017) that takes a

trajectory of states and actions aligned with a goal and

replaces the goal with the last state of the trajectory to

create a synthetic experience. Both, the actual experi-

enced trajectory, as well as the synthetic one are then

stored in the ER. This approach helps the learner to

understand how it is able to reach different goals. This

approach was implemented in a multi-objective prob-

lem space and after reaching some synthetic goals the

agent is able to learn how to reach the intended one.

Our contribution is an algorithm that is targeted

to improve (Deep) RL algorithms that make use of

an ER, like e.g. DQN, DDPG or classic Q-Learning

(Zhang and Sutton, 2017), in nondeterministic and

discrete environments by means of creating synthetic

experiences utilizing stored real transitions. We can

increase sample efﬁciency as transitions are further

used to generate more and even better experiences.

The algorithm therefore computes an average value

of the received rewards in a situation and combines

this value with observed follow-up states to create so

called interpolated experiences that assists the learner

in its exploration phase.

The evaluation is performed on the FrozenLake

environment from the OpenAI Gym (Brockman et al.,

2016).

This approach investigates only discrete and non-

deterministic environments and the averaging is a

rather simple method as well, but the intention is

404

von Pilchau, W., Stein, A. and Hähner, J.

Bootstrapping a DQN Replay Memory with Synthetic Experiences.

DOI: 10.5220/0010107904040411

In Proceedings of the 12th International Joint Conference on Computational Intelligence (IJCCI 2020), pages 404-411

ISBN: 978-989-758-475-6

to gain ﬁrst insights in this highly interesting ﬁeld.

We can reveal promising potentials utilizing this very

simple technique and this work serves as a basis to

build up further research on.

The paper is structured as follows: We start with

a brief introduction of the ER and Deep Q-Learning

in section 2 and follow up with some related work in

section 3. In section 4 we introduce our algorithm

along with a problem description and the Interpola-

tion Component that was used as an underlying archi-

tecture. The evaluation and corresponding discussion

as well as interpretation of the results is presented in

section 5. Section 6 is the conclusion and presents

ideas for future work.

2 BACKGROUND

In this section, we introduce some background knowl-

edge.

2.1 Experience Replay

The ER is a biological inspired mechanism (McClel-

land et al., 1995; O’Neill et al., 2010; Lin, 1992; Lin,

1993) to store experiences and reuse them for training

later on.

An experience is deﬁned as: e

= (s

, a

, r

, s

t+1

)

where a

denotes the start state, a

the performed ac-

tion, r

the corresponding received reward and s

t+1

the following state. To perform experience replay, at

each time step t the agent stores its recent experience

in a data set D

= {e

, . . . , e

This procedure is repeated over many episodes,

where the end of an episode is deﬁned by a terminal

state. The stored transitions can then be utilized for

training either online or in a speciﬁc training phase.

It is very easy to implement ER in its basic form and

the cost of using it is mainly determined by the stor-

age space needed.

2.2 Deep Q-Learning

The DQN algorithm is the combination of the clas-

sic Q-Learning (Sutton and Barto, 2018) with neural

networks and was introduced in (Mnih et al., 2015).

The authors showed that their algorithm is able to play

Atari 2600 games on a professional human level uti-

lizing the same architecture, algorithm and hyperpa-

rameters for every single game. As DQN is a deriva-

tive of classical Q-Learning it approximates the opti-

mal action-value function:

∗

(s, a) = max



+γr

t+1

+γ

t+2

+. . . |s

= s, a

= a, π



(1)

However DQN employs a neural network instead of

a table. Equation (1) displays the maximum sum of

rewards r

discounted by γ at each time-step t, that

is achievable by a behaviour policy π = P(a|s), after

making an observation s and taking an action a. DQN

performs an Q-Learning update at every time step that

uses the temporal-difference error deﬁned as follows:

= r

+ γmax

Q(s

t+1

, a

) − Q(s

, a

) (2)

(Tsitsiklis and Van Roy, 1997) showed that a

nonlinear function approximator used in combina-

tion with temporal-difference learning, such as Q-

Learning, can lead to unstable learning or even diver-

gence of the Q-Function.

As a neural network is a nonlinear function ap-

proximator, there arise several problems:

1. the correlations present in the sequence of obser-

vations,

2. the fact that small updates to Q may signiﬁcantly

change the policy and therefore impact the data distri-

bution, and

3. the correlations between the action-values Q(s

, a

)

and the target values r + γ max

Q(s

t+1

, a

) present in

the td-error shown in (2).

The last point is crucial, because an update to Q

will change the values of both, the action-values as

well as the target values, that could lead to oscillations

or even divergence of the policy. To counteract these

issues, two concrete actions have been proposed:

1. The use of an ER solves, as stated above, the

two ﬁrst points. Training is performed each step on

minibatches of experiences (s, a, r, s

) ∼ U(D), that

are drawn uniformly at random from the ER.

2. To remove the correlations between the action-

values and the target values a second neural network

is introduced that is basically a copy of the network

used to predict the action-values, but it is freezed for

a certain interval C before it is updated again. This

network is called target network and is used for the

computation of the target action-values. (Mnih et al.,

2015)

We use the target network as it was presented

above and extend the classic ER with a component

to create synthetic experiences.

3 RELATED WORK

The classical ER, introduced in section 2.1, has been

improved in many further publications. One promi-

nent improvement is the so called Prioritized Expe-

rience Replay (Schaul et al., 2015) that replaces the

uniform sampling with a weighted sampling in favour

Bootstrapping a DQN Replay Memory with Synthetic Experiences

405

of experience samples that might inﬂuence the learn-

ing process most. This modiﬁcation of the distribu-

tion in the replay induces bias and to account for this,

importance-sampling has to be used. The authors

show that a prioritized sampling leads to great suc-

cess. This extension of the ER also changes the de-

fault distribution, but uses real transitions and there-

fore has a different focus.

Another publication (De Bruin et al., 2015) inves-

tigates the composition of experience samples in the

ER. They discovered that for some tasks it is impor-

tant, that transitions, made in an early phase, when

exploration is high, are important to prevent overﬁt-

ting. Therefore they split the ER in two parts, one

with samples from the beginning and one with actual

samples. They also show that the composition of the

data in an ER is vital for the stability of the learn-

ing process and at all times diverse samples should be

included. Following this results we try to achieve a

broad distribution over the state space utilizing syn-

thetic experiences.

(Stein et al., 2017; Stein et al., 2018) use interpo-

lation in combination with XCS Classiﬁer System to

speed up learning in single-step problems by means

of using previous experiences as sampling points for

interpolation. The used component for interpolation

is part of this work and discussed in more detail in

section 4.3.

4 INTERPOLATED EXPERIENCE

REPLAY

In this Section we present the FrozenLake problem

and introduce our algorithm to solve it. We also intro-

duce the Interpolation Component (IC) that serves as

architectural concept.

4.1 Problem Description

“FrozenLake” is one example of a nondeterminis-

tic world in which an action a

∈ A realised in a

state s

∈ S may not lead consistently to the same

following state s

t+1

∈ S. FrozenLake is basically

a grid world consisting of an initial state I, a ﬁnal

state G and frozen, as well as unfrozen tiles. The

unfrozen tiles equal holes H in the lake and if the

agent falls into one of such, he receives a reward of

-1 and has to start from the initial state again. If

the agent reaches G he receives a reward of 1. The

set of possible actions A consists of the four cardi-

nal directions A = {N, E, S, W }. The environment is

nondeterministic, because the agent might slide on

the frozen tiles which is implemented through a cer-

tain chance of executing a different action instead of

the intended one. The environment is discrete, be-

cause there is a discrete number of states the agent

can reach. The environment used for evaluation is

the “FrozenLake8x8-v0” environment from OpenAI

Gym (Brockman et al., 2016).

If an action is chosen that leads the agent in the

direction of the goal, but because of the slippery fac-

tor it is falling into a hole, it also receives a neg-

ative reward and creates the following experience:

= (s

, a

, −1, s

t+1

). If this experience is used for a

Q update it misleadingly shifts the state-action value

away from a positive value. We denote the slippery

factor for executing a neighbouring action as c

slip

, the

resulting rewards for executing the two neighbouring

actions as r

right

and r

left

and the reward for executing

the intended action as r

int

and can then deﬁne the true

expected reward for executing a

in s

as follows:

exp

slip

· r

right

slip

· r

left

+ (1 − c

slip

) · r

int

(3)

Following (3) we can deﬁne the experience that takes

the state-transition function into account and that not

confuses the learner as follows:

exp

= (s

, a

, r

exp

, s

t+1

) (4)

The learner will converge its state-action value

, a

) after seeing a lot of experiences to:

, a

) = Q

∗

, a

) = r

exp

+ γmax

∗

t+1

, a

)

(5)

We deﬁne the set of all rewards that belong to the ex-

periences that start in the same state s

and execute the

same action a

as:



∈ {r|(s, a, r, s

) ∈ D

∧a = a

∧s = s

}



(6)

In our work we utilize stored transitions from the re-

play memory to create synthetic experiences with an

averaged reward r

avg

that is as close as possible to

exp

. Following (6) we can deﬁne these interpolated

experiences as:

avg

∑

r∈R

(7)

avg

= (s

, a

, r

avg

, s

t+1

) (8)

with

avg

≈ e

exp

(9)

The accuracy of this interpolation correlates with the

amount of transitions stored in the ER, starting in s

and executing a

. As a current limitation so far, be-

cause we need a legal follow-up state s

t+1

, it is cru-

cial for the environment to be discrete (in a continu-

ous world we would have inﬁnite states s

). Otherwise

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

406

we had to interpolate or predict this following state or

else the state-transition function as well and this could

harm the accuracy of the interpolated experience.

4.2 Algorithm

Our algorithm triggers an interpolation after every

step the agent takes. A query point x

∼U(S) is drawn

at random from the state space and all matching ex-

periences:

match

:= {e

∈ D

= x

} (10)

for that holds that their starting point s

is equal to the

query point x

, are collected from the ER. Then for

every action a ∈ A all experiences that satisfy a

= a

are selected from D

match

in:

match

:= {e

∈ D

match

∧ a

= a} (11)

The resulting transitions are used to compute an aver-

age reward value r

avg

and a synthetic experience e

avg

for every distinct next state:

t+1

∈ {s

|(s

, a

, r

, s

) ∈ D

match

} (12)

is created. This results in a minimum of 0 and a max-

imum of 3 synthetic experiences per action and sums

up to a maximum of 12 synthetic transitions per in-

terpolation depending on the amount of stored tran-

sitions in the ER. As with the amount of stored real

transitions, that can be seen as the combined knowl-

edge about the model, the quality of the interpolated

experiences may get better, a parameter c

s int

is intro-

duced, that determines the minimum amount of stored

experiences before the ﬁrst interpolation.

4.3 Interpolation Component

Stein et al. introduce their IC (Stein et al., 2017) that

this work uses as underlying basic structure for its in-

terpolation tasks.

This IC serves as an abstract pattern and con-

sists of a Machine Learning Interface (MLI), an In-

terpolant, an Adjustment Component, an Evaluation

Component and the Sampling Points (SP). If the MLI

receives a sample it is handed to the Adjustment Com-

ponent, there, following a decision function, it is

added to or removed from SP. If an interpolation is re-

quired, the Interpolation Component fetches required

sampling points from SP and computes, depending on

an interpolation technique, an output. The Evalua-

tion Component provides a metric to track a so-called

trust-level as a metric of interpolation accuracy.

We replaced the SP with the ER. It is realized by

a FiFo queue with a maximum length. This queue

represents the classic ER and is ﬁlled only with real

Interpolated Experience Replay

real experiences synthetic experiences

ier

synthetic

Figure 1: Intuition of Interpolated Experience Replay mem-

ory.

experiences. To store the synthetic transitions another

queue, a so-called ShrinkingMemory, is introduced.

This second storage is characterized by a decreasing

size. Starting at a predeﬁned maximum it gets smaller

depending on the length of the real experience queue.

The Interpolated Experience Replay (IER) has a total

size, comprising the sum of the lengths of both queues

as can be seen in ﬁg. 1. If this size is reached, the

length of the ShrinkingMemory is decreased and the

oldest items are removed, as long as either the real

valued queue reaches its maximum length and there

is some space left for interpolated experiences or the

IER ﬁlls up with real experiences. This decision was

made because it is expected, that near convergence

the learner beneﬁts more from actual experiences than

from synthetic transitions spread over the state space.

This approach includes a minimum size for the inter-

polated storage, but this was not further investigated

in this work and is left for future work.

The IER algorithm as described in section 4.2 is

located in the Interpolant, and, as stated above, exe-

cuted in every step.

For the IER, we need to be able to ﬁnd all expe-

riences e in D that matches a randomly chosen ﬁrst

state. To make this efﬁcient we use a dictionary that

maps state-action pairs to their associated average re-

wards and distinct next states of all seen transitions.

The dictionary is updated after every transition the

agent makes.

To evaluate the quality of computed interpolations

in future work, a metric could be designed to be used

in the Evaluation part of the IC.

5 EVALUATION

5.1 Experimental Setup

For evaluation purposes, a linear regression in form of

a neural network is used. This decision was felt be-

cause we use one input node for each state, that gives

an overall amount of 64 input nodes. Neural networks

have the ability to generalize over neighbouring areas,

but using the architecture described above, this seems

to have no effect because every state has its own input

node. We therefore decided to reduce complexity by

Bootstrapping a DQN Replay Memory with Synthetic Experiences

407

Table 1: Overview of hyperparameters applied for the

FrozenLake8x8-v0 experiment.

Parameter Value

Learning rate α 0.0005

Discount factor γ 0.95

Epsilon start 1

Epsilon min 0.05

Update target net interval τ 300

Size of Experience Replay s

100k

Size of IER s

ier

100k

Start Learning at size of IER 300

Minibatch size 32

not using hidden layers. The observed results can be

transferred to a more complex neural network (DQN),

as a ﬁrst step we sticked to the presented approach.

One output node for every possible action was used,

that results in 4 output nodes. Vanilla ER was selected

as a baseline and compared with the IER approach

presented in section 4. Preliminary experiments re-

vealed the hyperparameters given in table 1, that are

shared for all experiments. Furthermore, different ca-

pacities for storing synthetic experiences s

syntehtic

combination with different warm-up phases, i.e., val-

ues for c

s int

, are investigated. As exploration tech-

nique a linearly decaying ε-greedy was used, and dif-

ferent durations t

expl

tried. The different constella-

tions of the individual experiments are shown in ta-

ble 2. We measure the average return over the last

100 episodes to obtain a moving average that indi-

cates how often the agent is able to reach to goal in

this time. Each experiment was repeated for 20 times

and the results are reported as the mean values and the

observed standard deviations (±1SD) over the repeti-

tions.

Each conﬁguration was tested against the baseline

and the differences have been assessed for statistical

signiﬁcance. Therefore, we ﬁrst conducted Shapiro-

Wilk tests in conjunction with visual inspection of

QQ-plots to determine whether a normal distribution

can be assumed. Since this criterion could not be con-

ﬁrmed for any of the experiments, the Mann-Whitney-

U test has been chosen. All measured statistics, com-

prising the corresponding p-values for the hypothesis

tests are reported in table 4.

5.2 Experimental Results

Fig. 2 depicts the results of the best three IER con-

ﬁgurations as given in table 4. Experiment 1 and 2

were run for 1000 episodes. Experiment 3 for 1300

episodes, because of the longer exploration phase

compared to the previous experiments. It can be ob-

served, that the baseline approach (DQN using vanilla

Table 2: Overview of the individually conducted experi-

ment constellations.

experiment t

expl

syn

s inter

1 500 episodes

20k

250

500

1,000

100k

250

500

1,000

2 750 episodes

20k

250

500

1,000

100k

250

500

1,000

3 1,000 episodes

20k

250

500

1,000

100k

250

500

1,000

ER) stays below the IER approach (green line) and the

latter one is converging on a higher value alongside a

steeper increase, that indicates faster learning. This

effect is even more distinct in the experiments with

shorter exploration phases (experiments 1 and 2).

Figure 3 reports the results of all experiments. The

plots reveal, similar to ﬁg. 2, that the IER approach

outperforms the baseline. All the tested IER conﬁgu-

rations perform similarly well, with only marginal de-

viations. It turns out that the choice of s

syn

and c

s int

only has little to no effect. Because the IER algorithm

performs better than the baseline, and this effect is

even bigger in the scenarios with shorter exploration

phases, it can be used to decrease the time needed for

exploration.

In ﬁg. 4 the size of the IER can be seen. As the

choice of c

s int

does not have a huge effect on the

amount of interpolated experiences compared to the

maximum size we plotted only the graphs for the con-

ﬁgurations of the best results. The crossed curves rep-

resent the amount of stored interpolated experiences

and the dotted curves the amount of stored real expe-

riences. The red curve depicts the baseline and the

amount of real transitions is slightly above the IER

variants in all three experiments. Taking into account

that, ﬁrst, an episode ends after the agent has, either

reached the ﬁnal state, fell into a hole or reached the

maximum time limit, and, second, the IER agents per-

formed better, it seems that the baseline agent learned

to avoid falling into a hole, but does not reach the ﬁ-

nal state as often as the other agents. This explains

the higher amount of experiences. Fig. 4a shows that

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

408

Table 3: Summary of results.

experiment s

syn

s int

Mean ±1SD

p-value p-value

Shapiro-Wilk Mann-Whitney-U

0 0 0.3772 ±0.3121 1.9085e-33

20,000

250 0.43 ±0.3277 2.0967e-34 1.2702e-19

500 0.4324 ±0.3297 6.5603e-35 6.6317e-22

1,000 0.4416 ±0.3396 2.8203e-34 2.4728e-23

100,000

250 0.4287 ±0.3364 3.611e-35 3.2159e-22

500 0.4168 ±0.3266 9.4551e-35 1.4136e-13

1,000 0.4261 ±0.3282 3.0105e-35 9.3601e-21

0 0 0.2385 ±0.2807 8.7954e-35

20,000

250 0.2877 ±0.3126 1.9496e-33 1.6066e-06

500 0.2911 ±0.3105 5.4785e-33 1.9508e-06

1,000 0.2653 ±0.309 9.4381e-35 1.3255e-02

100,000

250 0.2785 ±0.3018 5.5895e-33 1.7829e-04

500 0.2782 ±0.3155 1.555e-34 2.7468e-03

1,000 0.2734 ±0.3026 7.4416e-34 1.0413e-03

0 0 0.0885 ±0.1347 5.9146e-38

20,000

250 0.1194 ±0.1618 3.9763e-35 3.1267e-09

500 0.1236 ±0.1642 1.1407e-34 6.4606e-09

1,000 0.1215 ±0.161 8.5940e-35 1.0439e-09

100,000

250 0.1198 ±0.1716 1.901e-36 2.5456e-03

500 0.1229 ±0.1666 3.8836e-35 3.5305e-03

1,000 0.116 ±0.1602 1.5933e-35 6.5812e-04

(a) Experiment 1. (b) Experiment 2. (c) Experiment 3.

Figure 2: The best results among all conducted experiments. The solid red line represents the classical ER serving as baseline

to compare with. The dashed green line shows the average reward of the IER approach. The blue line depicts the decaying

epsilon. The lines for IER and the baseline represent the repetition averages.

(a) Experiment 1. (b) Experiment 2. (c) Experiment 3.

Figure 3: All experiments with all perturbations of t

expl

, s

syn

and c

s int

. The dashed lines show the results of the single

experiments. The blue line depicts the decaying epsilon. The x-axis represents the episodes and the y-axis the average reward

of all 20 repetitions.

Bootstrapping a DQN Replay Memory with Synthetic Experiences

409

(a) Experiment 1. (b) Experiment 2. (c) Experiment 3.

Figure 4: The size of the IER represented by the amount of real and synthetic experiences. c

s int

was chosen from table 4.

The crosses represent the amount of synthetic and the dots the amount of real experiences. Brown show the size of the IER

with s

syntehtic

= 100, 000 and orange with s

syn

= 20, 000. The red curves represent the baseline. The x-axis marks the episode

length and the y-axis the amount of stored experiences.

Table 4: Best IER conﬁgurations found during the parame-

ter study.

experiment s

syn

s int

1 20,000 1,000

2 20,000 250

3 100,000 1,000

the ratio of experiences at the end of the exploration

phase is in favor of the synthetic ones in the case of

syn

= 100, 000. Fig. 4b shows that the ratio changed

but is still in favor of the synthetic transitions and

ﬁg. 4c shows that at this time the ratio is in favor

of the real experiences. If we look at the graphs for

syn

= 20, 000, then all ratios are in favor of the real

examples, but also not that far away from a ratio of

50/50 in experiment 1 and 2. This seems to be a good

choice as the best results were achieved with a choice

of s

syn

that is close to an equal distribution of interpo-

lated and real transitions. This should be investigated

further.

6 CONCLUSION AND FUTURE

WORK

We presented an extension for the classic ER used

in Deep RL that includes synthetic experiences to

speedup and improve learning in nondeterministic

and discrete environments. The proposed algorithm

uses stored, actually seen transitions to utilize the ex-

perience of the model that serve as basis for the calcu-

lation of synthetic (s, a, r, s) tuples by means of inter-

polation. The synthetic experiences comprise a more

accurate estimate of the expected long-term return a

state-action pair promises, than a real transition does.

So far the employed interpolation technique is a sim-

ple equally weighted averaging that serves as an ini-

tial approach. More complex methods in more com-

plex problem spaces have to be investigated in the

future. The IER approach was compared to the de-

fault ER in the FrozenLake8x8-v0 environment from

the OpenAI Gym and showed an increased perfor-

mance in terms of a 17% increased overall mean re-

ward. Several conﬁgurations for the maximum size

of the stored synthetic experiences, different warm-up

times for the interpolation, as well as different explo-

ration phases were examined, but revealed no remark-

able effect. Nevertheless, a ratio of 50/50 for real and

synthetic experiences in the IER seems promising and

needs further research.

As the algorithm creates a synthetic experience

for every action and every follow-up state there is a

huge amount of transitions created that could be de-

creased in a way that takes further knowledge into ac-

count. An example would be that only those actions

are considered, that the actual policy would propose

in the given situation. Or only for that follow-up state

that has the most (promising) stored experiences in

the storage. Also, further investigation of the compo-

sition regarding the IER seems interesting, since, as

stated above, the ratio of the stored transitions might

have an effect. As the evaluation was limited to the

FrozenLake environments provided by OpenAI Gym,

the proposed algorithm could be tested on more com-

plex versions that differ in size and difﬁculty. Also a

continuous version with a greatly increased state and

action space is required for deeper analysis.

As of yet, the proposed approach is limited to dis-

crete and nondeterministic environments. We plan to

develop the IER further to solve more complex prob-

lems (increased state and action space) as well. To

achieve this, a solution for the unknown follow-up

state is needed, that could also be interpolated or even

predicted by a state-transition function that is learned

in parallel. Here the work from (Jiang et al., 2019)

NCTA 2020 - 12th International Conference on Neural Computation Theory and Applications

410

could serve as a possible approach to begin with. A

yet simple, but nevertheless more complex problem,

because of its continuity, that is beyond the domain

of grid worlds is the MountainCar problem. Other,

more complex interpolation techniques have to be ex-

amined to adapt our IER approach in this environ-

ment. And at last, the impact of interpolated experi-

ences on more sophisticated experience replay mech-

anisms such as Hindsight ER and Prioritized ER have

to be investigated as well.

REFERENCES

Andrychowicz et al. (2017). Hindsight experience replay. In

Advances in Neural Information Processing Systems

30, pages 5048–5058. Curran Associates, Inc.

Brockman, G. et al. (2016). Openai gym.

De Bruin, T., Kober, J., Tuyls, K., and Babu

ska, R. (2015).

The importance of experience replay database com-

position in deep reinforcement learning. In Deep re-

inforcement learning workshop, NIPS.

Jiang, W., Hwang, K., and Lin, J. (2019). An experience re-

play method based on tree structure for reinforcement

learning. IEEE Transactions on Emerging Topics in

Computing, pages 1–1.

Lin, L.-J. (1992). Self-improving reactive agents based on

reinforcement learning, planning and teaching. Ma-

chine Learning, 8(3):293–321.

Lin, L.-J. (1993). Reinforcement learning for robots us-

ing neural networks. Technical report, CARNEGIE-

MELLON UNIV PITTSBURGH PA SCHOOL OF

COMPUTER SCIENCE.

McClelland, J. L., McNaughton, B. L., and O’Reilly, R. C.

(1995). Why there are complementary learning sys-

tems in the hippocampus and neocortex: insights

from the successes and failures of connectionist mod-

els of learning and memory. Psychological review,

102(3):419.

Mnih et al. (2015). Human-level control through deep rein-

forcement learning. Nature, 518(7540):529.

O’Neill, J., Pleydell-Bouverie, B., Dupret, D., and

Csicsvari, J. (2010). Play it again: reactivation of wak-

ing experience and memory. Trends in Neurosciences,

33(5):220 – 229.

Schaul, T., Quan, J., Antonoglou, I., and Silver, D. (2015).

Prioritized Experience Replay. arXiv e-prints, page

arXiv:1511.05952.

Stein, A., Menssen, S., and H

ahner, J. (2018). What about

Interpolation? A Radial Basis Function Approach to

Classiﬁer Prediction Modeling in XCSF. In Proc. of

the GECCO, GECCO ’18, page 537–544, New York,

NY, USA. Association for Computing Machinery.

Stein, A., Rauh, D., Tomforde, S., and H

ahner, J. (2017).

Interpolation in the extended classiﬁer system: An ar-

chitectural perspective. Journal of Systems Architec-

ture, 75:79–94.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of

temporal-difference learning with function approxi-

mation. IEEE Transactions on Automatic Control,

42(5):674–690.

Zhang, S. and Sutton, R. S. (2017). A deeper look at expe-

rience replay. CoRR, abs/1712.01275.

Bootstrapping a DQN Replay Memory with Synthetic Experiences

411