Tangled Program Graphs with Indexed Memory in Control Tasks with

Short Time Dependencies

Tanya Djavaherpour

, Ali Naqvi

and Stephen Kelly

Department of Computing and Software, McMaster University, Hamilton, On, Canada

{djavahet, naqvia18, spkelly}@mcmaster.ca

Keywords:

Evolutionary Reinforcement Learning, Genetic Programming, Partially Observable Environments.

Abstract:

This paper addresses the challenges of shared temporal memory for evolutionary reinforcement learning agents

in partially observable control tasks with short time dependencies. Tangled Program Graphs (TPG) is a genetic

programming framework which has been widely studied in memory intensive tasks from video games, time

series forecasting, and predictive control domains. In this study, we aim to improve external indexed memory

usage in TPG by minimizing the impact of destructive agents during cultural transmission. We test various

memory resetting strategies—per agent, per episode, and a no-memory control group—and evaluate their

effectiveness in mitigating destructive effects while maintaining performance. Results from Acrobot, Pen-

dulum, and CartPole tasks show that resetting memory more often can signiﬁcantly boost TPG performance

while preserving computational efﬁciency. These ﬁndings highlight the importance of memory management

in Reinforcement Learning (RL) and suggest opportunities for further optimization for more complex visual

RL environments, including adaptive memory resetting and evolved probabilistic memory operations.

1 INTRODUCTION

Reinforcement Learning (RL) agents learn through

trial-and-error interaction with their environment

(Sutton and Barto, 2018). Deep Reinforcement

Learning (DRL), with its capacity to decompose sen-

sor inputs and build hierarchical representations of

sensor data, has signiﬁcantly expanded the capabili-

ties of autonomous agents operating within complex

environments (Mnih et al., 2015). Despite these ad-

vancements, DRL agents often encounter formidable

obstacles in tasks necessitating robust memory func-

tionalities (Pleines et al., 2023). This paper inves-

tigates these challenges and proposes simple strate-

gies to enhance temporal memory capabilities in the

recently-proposed genetic programming framework

known as Tangled Program Graphs (TPG) (Kelly and

Heywood, 2018).

Effective memory management is crucial for en-

suring that agents can retain and utilize relevant in-

formation over time, particularly in environments that

are only partially observable or which require long

term planning. We explore various strategies for en-

https://orcid.org/0009-0002-3585-1262

https://orcid.org/0009-0009-5735-4313

https://orcid.org/0000-0002-6071-4705

hancing the efﬁciency of indexed memory in TPG,

with the goal of minimizing the negative impact of

destructive agents and improving overall system per-

formance. Through a series of experiments, we eval-

uate different memory management approaches, in-

cluding probabilistic methods for writing into mem-

ory shared among a population of agents, and inves-

tigate their impact on the performance of TPG agents

in partially observable benchmark RL environments

with short time dependencies. This study focuses

on comparing memory management strategies within

TPG, using the original version of PyTPG (Amaral,

2019) as the baseline. Our results demonstrate that

clearing shared temporal memory before each evalua-

tion episode improves agent performance by reducing

the negative impact of destructive agents and lowering

decision-making complexity.

2 BACKGROUND

Genetic Programming (GP) is an Evolutionary

Computation paradigm that evolves computer pro-

grams using evolutionary algorithms (Brameier and

Banzhaf, 2007). RL agents evolved with GP can

model their environment over time through the use of

temporal memory. In Linear Genetic Programming

296

Djavaherpour, T., Naqvi, A. and Kelly, S.

Tangled Program Graphs with Indexed Memory in Control Tasks with Short Time Dependencies.

DOI: 10.5220/0013016800003837

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Computational Intelligence (IJCCI 2024), pages 296-303

ISBN: 978-989-758-721-4; ISSN: 2184-3236

(LGP) (Brameier and Banzhaf, 2007), programs are

represented a sequence of instructions which read and

write from memory registers. LGP supports a sim-

ple form of recursive temporal memory simply by al-

lowing registers to maintain state between sequential

program executions. More generally, GP can support

indexed memory by augmenting agents with a linear

memory array and adding specialized read and write

operations to the GP function set (Teller, 1994). If

indexed memory is shared among agents in a pop-

ulation, it can also support the transmission of in-

formation between individuals by non-genetic means.

Spector’s (Spector and Luke, 1996b), “Culture” al-

lows all individuals to share the same memory, sim-

ilar to societal interactions, where each individual is

affected by others in a shared environment, but risks

“pollution” of the memory matrix by agents that per-

form badly.

In Visual RL, observable states are high-

dimensional matrices such as video frames. TPG can

directly process high-dimensional video inputs and

has been tested in various gaming scenarios, outper-

forming traditional deep neural network RL methods

in multi-task learning (Kelly and Heywood, 2018).

These TPG agents were also more computationally

efﬁcient, requiring fewer calculations per action than

other approaches. Their efﬁciency is primarily due to:

1) the hierarchical complexity of each entity evolving

based on its interaction with the problem domain, un-

like the ﬁxed complexity in conventional Deep Learn-

ing (Mnih et al., 2015); and 2) within a TPG entity,

subsystems often focus on different segments of the

visual input, meaning only certain components are ac-

tive at any speciﬁc moment (Kelly et al., 2020).

Despite visual RL providing high-resolution in-

put, individual frames often lack the complete infor-

mation required to select the best action. This par-

tial observability signiﬁcantly limits the agent’s per-

ception of the environment and implies that tempo-

ral memory must be available for the agent to build a

mental model of its environment. TPG has success-

fully used emergent modularity combined with reg-

ister memory and indexed memory to evolve prob-

lem solvers for memory-intensive tasks (Kelly et al.,

2021). In short, TPG agents are composed of teams

of linear genetic programs which share a single mem-

ory data structure and cooperatively manage a model

of the environment which enables operation in par-

tially observable RL tasks. In this work, we aim to en-

hance the effectiveness of indexed memory usage by

minimizing the effects of destructive individuals dur-

ing the cultural transmission of information through

shared memory, advancing our understanding of the

“culture” of digital organisms.

3 METHODOLOGY

3.1 Environments

The environments used in this work are partially-

observable versions of the widely-studied RL bench-

marks Acrobot, Pendulum, and Cartpole (Sutton and

Barto, 2018), Figure 1. These tasks are selected for

their high level of challenge, extensive comparative

results available in the literature, and computational

simplicity resulting in fast experiments.

3.1.1 Acrobot

The Acrobot task is a dynamical system involving a

double pendulum with 6 observation variables, indi-

cated in Table 1, and 500 time steps. The control task

involves swinging up the lower link of the double pen-

dulum to reach a speciﬁed target height. As shown in

Figure 1, the state of the Acrobot at every time step is

given by the cosine and sine of the angles of the two

links in radians (θ

, θ

) and their angular velocities.

The action space is discrete and consists of three ac-

tions: applying +1 torque, -1 torque, or no torque (0)

to the second joint.

Table 1: Acrobot observation space.

Num Observation Min Max

0 Cos(θ

) -1 1

1 Sin(θ

) -1 1

2 Cos(θ

) -1 1

3 Sin(θ

) -1 1

4 θ

Angular Velocity -4π 4π

5 θ

Angular Velocity -9π 9π

The reward function is −t

end

, which is reached

when the free end hits the target height (−cos(θ

) −

cos(θ

+ θ

) > 1.0) or when the episode exceeds 500

steps. The goal is to reach the target in as few steps

as possible, with each step incurring a -1 reward, and

reaching the target ending with a reward of 0.

3.1.2 Pendulum Task

The Pendulum task, shown in Figure 1, is a control

problem with 3 observation variables and 200 time

steps. This task involves swinging up a pendulum to

an upright position and keeping it balanced. The ac-

tion space consists of a single continuous control vari-

able, representing the torque applied to the joint. The

observation space consists of three elements which

are indicated in Table 2.

The reward function is as follows:

max

∑

t=1

−(φ(θ)

+ 0.1 ×

+ 0.001 × Torque

) (1)

Tangled Program Graphs with Indexed Memory in Control Tasks with Short Time Dependencies

297

a Acrobot. b Pendulum. c Cartpole.

Figure 1: Problem environments used in this work. See (Brockman et al., 2016) for a detailed description of control tasks.

Table 2: Pendulum observation space.

Num Observation Min Max

0 x = cos(θ) -1.0 1.0

1 y = sin(θ) -1.0 1.0

θ = Angular Velocity -0.8 0.8

In this reward function, φ(θ) is the difference be-

tween the current angle θ and the upright position

angle, and torque is the control input applied to the

pendulum. The term φ(θ)

penalizes deviation from

the upright position, 0.1 ×

penalizes high angu-

lar velocities to encourage smoother movements, and

0.001 × Torque

penalizes large control inputs to pro-

mote energy efﬁciency.

3.1.3 Cartpole Task

The Cartpole task involves balancing a pole on a cart

by applying force to the cart to keep the pole upright.

This task has 4 observation variables given by cart po-

sition (x), cart velocity ( ˙x), pole angle (θ), and pole

velocity at the tip (

θ). As shown in Figure 1, the state

of the Cartpole at every time step is given by the cart

position and velocity, pole angle in radians (θ), and

pole angular velocity. The action space is discrete and

consists of two actions, which represent pushing the

cart to the left or right. The observation space consists

of four elements which are indicated in Table 3.

Table 3: Cartpole observation space.

Num Observation Min Max

0 Cart Position -4.8 4.8

1 Cart Velocity -Inf Inf

2 Pole Angle (θ) -0.418 rad 0.418 rad

3 Pole Angular Velocity -Inf Inf

The reward function is t

end

, with +1 awarded for

each time step the pole remains upright. t

end

reached when the pole falls, the cart moves out of

bounds, or the max number of steps is reached. The

goal is to maximize the number of time steps the pole

stays upright.

In all tasks, agent training ﬁtness is its mean re-

ward over 20 episodes, where each episode begins

with random initial conditions and ends with success,

failure, or reaching a time constraint. Post-evolution,

the single training champion is reloaded and evalu-

ated in 100 test episodes with initial conditions not

seen during training.

Velocity state variables describe how the system

is changing over time. To make these environments

partially observable, we remove velocity state vari-

ables from the observation space. In order to con-

trol the systems without this information, agents must

use temporal memory to store sequential observations

over time and integrate this data to predict the velocity

of the system. Note that predicting system velocities

only requires short-term memory.

3.2 Tangled Program Graphs

Tangled Program Graphs (TPG) is a hierarchical al-

gorithm for evolving teams of programs. The basic

building block in TPG is a team of programs (see Fig-

ure 2). Each team represents a stand-alone decision-

making entity (agent) in this framework. Each pro-

gram is a linear structure consisting of registers and

instructions that operate on observation inputs and in-

ternal memory registers. Programs return two val-

ues: a bid value and an action value. Teams follow

a ﬁrst-placed sealed bid auction method where the

highest bidding program at each timestep wins the

right to decide the action. This action could be a dis-

crete value (directional forces in Figure 2), continuous

value (contents of scalar register s[1] in Figure 2), or

a pointer to another team. If the action is atomic (i.e.

discrete or continuous) it is returned to the task envi-

ronment as the control output for the current timestep.

If the action is a team pointer, then decision-making is

delegated and the bidding process repeats at this team

for the same timestep and observation. The process

repeats recursively until an atomic action is reached.

3.3 Memory

The TPG model introduced in (Smith and Heywood,

2019) features an external shared memory accessible

to all agents. Each agent consists of several teams and

programs, and each program has its own private reg-

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

298

Time (generations)

Team

Populaon

Program

Populaon

{ }{ } { }

{ }

Program Graphs Emerge

s[1]

s[1]s[1]

Inial Populaons

Figure 2: Tangled Program Graphs’ hierarchical decision-

making structure in which teams of programs predict dis-

crete and continuous actions.

isters, which are inaccessible to other programs. Pro-

gram registers are stateful, and thus provide a simple

form of recurrent temporal memory. Furthermore, all

programs have access to the shared external memory

for reading and writing operations. This memory is

not reset between training episodes or the evaluation

of different agents, ensuring continuity and allowing

for cumulative knowledge building.

Indexed memory operations are handled proba-

bilistically to manage both short-term and long-term

retention. The write operations distribute the con-

tent of a program’s registers across the external mem-

ory in a probabilistic manner, with locations in the

middle of the memory being updated more frequently

(short-term memory) and those towards the ends be-

ing updated less frequently (long-term memory). This

study uses the following probability deﬁnition, which

is shown in Figure 3, and where i corresponds to the

index:

write

(i) =

0.25

0.5π(i

+ 0.25)

(2)

This function provides a heavy-tailed distribution,

allowing writing across a wide range of memory lo-

cations, the probability is sharply peaked at the center

and rapidly decreases as the offset increases.

Read operations use indexing, allowing programs

to locate regions of external memory characterized by

speciﬁc temporal properties. This approach allows

programs to interact during each generation or across

different generations, facilitating more sophisticated

decision-making strategies.

4 EXPERIMENTS

The experiments detailed in this section are designed

to evaluate our TPG shared temporal memory im-

plementations in mitigating the negative impact of

destructive agents while maintaining system perfor-

mance and efﬁciency. We used TPG as implemented

Figure 3: Probability function for memory write operations.

in (Amaral, 2019). The culture method discussed in

(Spector and Luke, 1996a) highlights the negative im-

pact of destructive agents, noting that while positive

ideas from individuals can be preserved for collective

beneﬁt, negative actions by a single agent can destroy

valuable information. To reduce this negative impact,

we study the effect of clearing memory at different

stages and compare the results with the original ver-

sion of shared memory in TPG.

We assess three strategies: resetting memory for

each agent (Section 4.1), resetting memory for each

episode (Section 4.2), and a no-memory condition

(Section 4.3). Algorithm 1 details the implemen-

tation of these strategies. The following terms are

used in the pseudocode: execute frames() exe-

cutes a set of frames where the agent takes an ac-

tion based on observations and receives feedback

from the environment for each frame. Lock pool-

ing and release pooling manage parallelism, with

lock pooling preventing other agents from inter-

acting with memory and release pooling restoring

parallelism after the agent completes its interac-

tions. execute episodes with frames() runs mul-

tiple episodes.

4.1 Reset Memory for Each Agent

In this case, the external memory and registers are

cleared and set to zero at the beginning of evaluating

each agent in each generation. This method ensures

that each agent can independently build its own mem-

ory model at run time and removes the possibility of

negative impact from other agents. Each agent essen-

tially has its own indexed memory which is shared

among its programs, resembling a smaller society. In

this case, the agent’s memory maintains state over all

training episodes, during which time the agent is free

to gradually develop its mental model of the environ-

ment. While each agent interacts with memory, it is

essential to restrict others’ access to it. In our current

implementation, this requires blocking the paralleliz-

ing system, which increases experiment run time.

Tangled Program Graphs with Indexed Memory in Control Tasks with Short Time Dependencies

299

Algorithm 1: Agent execution with memory resetting

conditions.

for generation in generations do

run agent with pooling

if original version then

execute episodes with frames()

else

if reset for each agent then

lock pooling

reset external memory

reset agent’s registers

execute episodes with frames()

release pooling

else

if reset for each episode then

for episode in episodes do

lock pooling

reset external memory

reset agent’s registers

execute frames()

release pooling

end

else

for episode in episodes do

for frame in frames do

reset agent’s registers

act and get feedback

end

Function execute frames():

for frame in frames do

act and get feedback

end

Function

execute episodes with frames():

for episode in episodes do

execute frames();

end

4.2 Reset Memory for Each Episode

This approach also removes potential negative impact

of other agents. In this version, we reset the external

memory and all the agent’s registers at the beginning

of each episode. This tests the agents’ ability to build

their memory quickly during a single episode. Again,

when one agent interacts with memory, it is essential

to restrict others’ access to it.

4.3 No Memory

In this version, we do not use any external indexed

memory and we clear all the agent’s registers to zero

at the beginning of each time step, implying the

agent’s behaviour is entirely stateless. This is a con-

trol experiment to conﬁrm that all partially observable

task conﬁgurations absolutely require stateful agents

with temporal memory capabilities.

4.4 Experimental Parameters

Evolutionary hyper-parameters follow previous TPG

work in RL tasks (Smith and Heywood, 2019). The

initial root team population is set at 360 and remains

static throughout evolution. We utilize “Cauchy Half”

(Equation 2) for memory distribution in scenarios in-

volving memory. The operation set includes: “ADD”,

“SUB”, “MULT”, “DIV”, “NEG”, “COS”, “LOG”,

“EXP”, “MEM READ”, and “MEM WRITE” allow-

ing complex interactions without any task-speciﬁc

functions. To constrain model complexity and com-

putational cost of decision-making, we set the proba-

bility of acting atomic to be 1.0, meaning no programs

point to another team.

4.5 Results

Experiments reveal that the reset memory for each

episode strategy (Section 4.2) improves the score and

performance of TPG agents across all the control

problems mentioned in Section 3.1, as shown in Fig-

ure 5. This memory conﬁguration also results in the

lowest solution complexity, as indicated in Figure 6.

We conduct experiments for all the cases detailed

in Section 4 as well as the original version of PyTPG

(Amaral, 2019), using the Cauchy Half distribution

for memory writing probability. We ran 10 repeats

with unique random seeds for Pendulum task and

CartPole task, and 8 repeats for Acobot task. Each

experiment was run using multiple cores to manage

the computational load efﬁciently: 30 hours with 30

cores for Acrobot, 48 hours with 10 cores for Pendu-

lum, and 72 hours with 20 cores for CartPole. The

results were compared based on the achieved score

during the same running period (Figure 4), reached

score based on the number of generations (Figure 5),

and their complexity (Figure 6). The complexity is

characterized by average number of instructions exe-

cuted per action decision.

To plot Figures 5 and 4, we determined the min-

imum number of generations across all experiments.

According to Figure 5, for all three environments, the

approach of resetting memory for each episode has

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

300

the best average score after the 5

generation.

In Figure 6, the complexity over the minimum

number of generations across all experiments is re-

ported. This ﬁgure demonstrates that resetting mem-

ory and registers for each episode reduces complexity.

Interestingly, in all three environments, although the

no memory version has the worst score over genera-

tions, it exhibits the highest complexity. This indi-

cates that agents are struggling to improve by making

more complex decisions. On the other hand, the ver-

sion with resetting memory for each episode, which

has the highest score, also exhibits less computational

complexity than the original PyTPG.

Execution speed varied across tasks: the no mem-

ory version consistently ran the most generations, in-

dicating the fastest execution speed. The original ver-

sion performed at an intermediate speed, while both

the reset memory for each agent and reset memory

for each episode versions were the slowest, running

signiﬁcantly fewer generations across all tasks due to

the blocking of parallelism as discussed in Section 4.

The blocking mechanism is further illustrated in Al-

gorithm 1. The running time explains the original ver-

sion’s superior results over the same amount of time

as indicated in Figure 4. However, since this version

runs more generations in the same amount of time as

reset for each episode, it achieves a better score. Still,

based on Figure 5, it would perform worse if it oper-

ated at the same speed as the reset memory for each

episode case.

These results support our hypothesis that, for tasks

without long term state dependencies, resetting mem-

ory before each episode can reduce the effect of nega-

tive agents and improve results over the same number

of generations. As expected, the no memory version

cannot solve these partially observable tasks.

After training, we reloaded and tested the cham-

pion of the last common generations for each case

across all seeds. We applied the Mann-Whitney U test

to compare each case with the reset memory for each

episode case, conﬁrming the results in Figure 5 with

p-values less than 0.05. In Acrobot, the reset memory

for each agent and reset memory for each episode ver-

sions showed no signiﬁcant differences due to simi-

lar scores. However, both versions showed signiﬁcant

differences (p-value<0.05) compared to the original

and no-memory versions. Readers interested in fur-

ther details about TPG and visualizations of evolved

graphs of teams are referred to (Djavaherpour et al.,

2024), (Smith and Heywood, 2024), (Kelly et al.,

2021).

a Cartpole environment.

b Pendulum environment.

c Acrobot environment.

Figure 4: Scores achieved in different memory strategy ex-

periments over 48 hours in various environments.

5 CONCLUSION

This study explored the effectiveness of different

memory management strategies in enhancing the per-

formance of Tangled Program Graphs in partially ob-

servable Reinforcement Learning environments. We

experimented with TPG’s original shared indexed

memory formulation, resetting memory for each

agent, resetting memory for each episode, and a no-

memory condition across three benchmark tasks: Ac-

robot, Pendulum, and CartPole.

The results show that resetting memory for each

episode improves the performance of TPG agents

across all tasks. This strategy led to the highest av-

Tangled Program Graphs with Indexed Memory in Control Tasks with Short Time Dependencies

301

a Cartpole environment.

b Pendulum environment.

c Acrobot environment.

Figure 5: Scores achieved in different memory strategy ex-

periments over 48 hours, based on the minimum number of

generations run in various environments.

erage scores after the initial few generations and re-

duced the complexity of decision-making processes.

In contrast, the no-memory version, although capable

of running more generations, struggled to solve the

partially observable tasks effectively, exhibiting the

highest complexity and lowest performance.

Interestingly, while the reset memory for each

agent and reset memory for each episode strategies

showed similar performance, both were signiﬁcantly

better than the original and no-memory versions in

terms of robustness and reliability, demonstrating

consistency of the agents’ performance across differ-

ent runs with a tighter distribution of scores over the

repeats. In contrast, the reset memory for each gen-

a Cartpole environment.

b Pendulum environment.

c Acrobot environment.

Figure 6: Complexity in different memory strategy exper-

iments over 48 hours, based on the minimum number of

generations run in the different environment.

eration case failed to perform better than the original

version in CartPole only. The Mann-Whitney U test

conﬁrmed these ﬁndings, with p-values less than 0.05,

indicating signiﬁcant differences.

These ﬁndings suggest that shared memory and

“culture” can have a negative impact on the perfor-

mance of TPG agents in partially observable tasks

with no long term temporal dependencies. Resetting

memory before each episode can mitigate these neg-

ative effects, improving agent performance and re-

ducing decision-making complexity. However, the

primary drawback of the memory reset strategies is

the increased runtime due to the blocking of paral-

lelism. Implementing a dedicated memory for each

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

302

agent could potentially mitigate this issue, allowing

parallel execution without interference and maintain-

ing computational efﬁciency.

Overall, effective memory management strate-

gies are crucial in reinforcement learning tasks. By

carefully selecting and optimizing memory resetting

strategies, signiﬁcant improvements can be achieved

in the efﬁciency and effectiveness of TPG in challeng-

ing control environments.

6 FUTURE WORK

Future work will scale these experiments to more

complex environments, such as Memory Gym

(Pleines et al., 2023), in order to validate the meth-

ods’ robustness and explore their adaptability to tasks

with long and short time dependencies. The current

memory strategies help agents quickly build mental

models without directly sharing information. How-

ever, this may not be suitable in complex tasks where

global memory is beneﬁcial (e.g. (Smith and Hey-

wood, 2019)). For such cases, we envision a dynamic

method, such as resetting memory based on real-time

performance metrics (e.g., wiping memory if median

score drops below that of the previous generation),

could provide a more adaptive approach. Addition-

ally, investigating other probabilistic memory func-

tions and their combinations could provide further in-

sights into optimizing agent’s memory use. For exam-

ple, rather than manually resetting memory, it might

be possible to evolve customized memory manage-

ment rules for each agent which automatically min-

imize negative effects on shared memory. Finally,

integrating advanced parallelization techniques could

mitigate the runtime overhead caused by memory re-

sets, improving their practicality in real-world appli-

cations. Since this paper incurred signiﬁcant wall

clock run time, faster TPG frameworks, such as those

from (Djavaherpour et al., 2024), will be considered

for use in future work.

Overall, studying the long-term evolutionary im-

pacts of different memory strategies could provide

deeper insights into the development of more sophis-

ticated and adaptive agents in partially observable en-

vironments.

REFERENCES

Amaral, R. (2019). Pytpg: Tangled program graphs

in python. https://github.com/Ryan-Amaral/PyTPG/

tree/7295f90ececbfc34fdbc1d73e032a9c2407a182c.

Brameier, M. and Banzhaf, W. (2007). Linear Genetic Pro-

gramming. Springer.

Brockman, G., Cheung, V., Pettersson, L., Schneider, J.,

Schulman, J., Tang, J., and Zaremba, W. (2016). Ope-

nAI Gym. arXiv, 1606.01540.

Djavaherpour, T., Naqvi, A., Zhuang, E., and Kelly, S.

(2024). Evolving Many-Model Agents with Vector

and Matrix Operations in Tangled Program Graphs.

In Genetic Programming Theory and Practice XXI.

Springer (AD).

Kelly, S. and Heywood, M. I. (2018). Emergent Solutions

to High-Dimensional Multitask Reinforcement Learn-

ing. Evolutionary Computation, 26(3):347–380.

Kelly, S., Newsted, J., Banzhaf, W., and Gondro, C. (2020).

A modular memory framework for time series predic-

tion. In Proceedings of the 2020 Genetic and Evolu-

tionary Computation Conference, GECCO ’20, page

949–957, New York, NY, USA. Association for Com-

puting Machinery.

Kelly, S., Smith, R. J., Heywood, M. I., and Banzhaf, W.

(2021). Emergent tangled program graphs in partially

observable recursive forecasting and vizdoom naviga-

tion tasks. ACM Trans. Evol. Learn. Optim., 1(3).

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M. A., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Sadik, A., Antonoglou, I., King, H., Ku-

maran, D., Wierstra, D., Legg, S., and Hassabis, D.

(2015). Human-level control through deep reinforce-

ment learning. Nature, 518:529–533.

Pleines, M., Pallasch, M., Zimmer, F., and Preuss, M.

(2023). Memory gym: Partially observable challenges

to memory-based agents. In The Eleventh Interna-

tional Conference on Learning Representations.

Smith, R. J. and Heywood, M. I. (2019). A model of

external memory for navigation in partially observ-

able visual reinforcement learning tasks. In Genetic

Programming: 22nd European Conference, EuroGP

2019, Held as Part of EvoStar 2019, Leipzig, Ger-

many, April 24–26, 2019, Proceedings, page 162–177,

Berlin, Heidelberg. Springer-Verlag.

Smith, R. J. and Heywood, M. I. (2024). Interpreting tan-

gled program graphs under partially observable dota 2

invoker tasks. IEEE Transactions on Artiﬁcial Intelli-

gence, 5(4):1511–1524.

Spector, L. and Luke, S. (1996a). Cultural transmission of

information in genetic programming. In Proceedings

of the 1st Annual Conference on Genetic Program-

ming, page 209–214, Cambridge, MA, USA. MIT

Press.

Spector, L. and Luke, S. (1996b). Culture enhances the

evolvability of cognition. In Cottrell, G., editor, Cog-

nitive Science (CogSci) 1996 Conference Proceed-

ings, pages 672–677, Mahwah, NJ, USA. Lawrence

Erlbaum Associates.

Sutton, R. and Barto, A. (2018). Reinforcement Learning:

An Introduction. The MIT Press, Cambridge, MA,

2nd edition.

Teller, A. (1994). The evolution of mental models, page

199–217. MIT Press, Cambridge, MA, USA.

Tangled Program Graphs with Indexed Memory in Control Tasks with Short Time Dependencies

303