Hybrid Genetic Programming and Deep Reinforcement Learning for

Low-Complexity Robot Arm Trajectory Planning

Quentin Vacher

, Nicolas Beuve

, Paul Allaire

, Thibaut Marty

, Micka

el Dardaillon

and

Karol Desnos

Univ Rennes, INSA Rennes, CNRS, IETR – UMR 6164, F-35000 Rennes, France

{ﬁrstname.lastname}@insa-rennes.fr

Keywords:

Genetic Programming, Reinforcement Learning, Inverse Kinematics.

Abstract:

Robot arm control is a technological challenge where an algorithm needs to learn a deep understanding of

spatial navigation. In particular, spatial navigation requires learning the relationship between the motor joint

angular positions and the Cartesian coordinates of the robot. Trajectory planning is an even more complex

challenge, where the algorithm must create a trajectory between two coordinates that does not cause a collision.

State-of-the-art algorithms capable of solving trajectory planning are based on deep Reinforcement Learning

(RL). These algorithms achieve high accuracy but suffer from high computational complexity. This paper

proposes to use a genetic RL algorithm, the Tangled Program Graphs (TPGs), to solve trajectory planning.

Using a genetic process, the TPGs generate a graph of programs with low inference complexity. On a ﬁrst

trajectory planning problem, the algorithm used achieves performance close to the state-of-the-art, but with a

100 less execution time and a 20× smaller model size. On a second and more difﬁcult problem, the TPGs are

not able to learn with efﬁciency. We propose a hybrid solution that mixes the TPGs and a state-of-the-art deep

RL algorithm, the Soft Actor-Critic (SAC). This solution achieves better performance than the state-of-the-art

for both problems, with 6 to 20 times faster execution times.

1 INTRODUCTION

Robot control presents a formidable technological

challenge in today’s landscape. At its core, this chal-

lenge entails the development of algorithms that pos-

sess a deep understanding of spatial relationships.

This understanding is crucial for orchestrating the co-

ordinated movements of the various motors of the

robot. As robots become increasingly ubiquitous

across industries, there is a pressing need to optimize

the computational costs associated with controlling

them. This optimization ensures efﬁciency without

compromising performance, a balancing act essential

for widespread adoption and practical implementa-

tion.

Some recent techniques, such as Deep Learning,

can efﬁciently control robot models. Deep Learning

https://orcid.org/0009-0001-9568-7196

https://orcid.org/0000-0002-1371-4016

https://orcid.org/0009-0005-9325-0643

https://orcid.org/0000-0001-7035-8727

https://orcid.org/0000-0001-6862-2090

https://orcid.org/0000-0003-1527-9668

solutions (Giorelli et al., 2015) effectively predict the

motor joint values of a robot to reach a desired posi-

tion. However, these solutions struggle to avoid colli-

sions during trajectory planning.

Deep Reinforcement Learning (RL) methods can

efﬁciently learn how to avoid collisions (Zhong et al.,

2021). This class of methods is well-suited for trajec-

tory planning, but suffers from a high computational

complexity.

We propose to use a Genetic Programming (GP)

algorithm, the Tangled Program Graph (TPG), to do

trajectory planning. This algorithm uses GP evolu-

tion to generate a graph designed for a speciﬁc task

with really low complexity. With only some tuning of

the training hyperparameters, we show that the TPGs

achieve state-of-the-art RL performance on a Inverse

Kinematics (IK) task, while being 100 times less com-

putationally expensive and having a model size 20

times smaller. However, on a more difﬁcult trajectory

planning task with more diversity, the TPGs alone are

not able to reach equivalent accuracy.

To take advantage of both the low complexity of

the TPGs and the high accuracy of a deep RL algo-

rithm, the Soft Actor-Critic (SAC), a hybrid solution

Vacher, Q., Beuve, N., Allaire, P., Marty, T., Dardaillon, M. and Desnos, K.

Hybrid Genetic Programming and Deep Reinforcement Learning for Low-Complexity Robot Arm Trajectory Planning.

DOI: 10.5220/0013012500003837

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Computational Intelligence (IJCCI 2024), pages 139-150

ISBN: 978-989-758-721-4; ISSN: 2184-3236

139

is proposed. This hybrid solution achieves better re-

sults than the state-of-the-art deep RL on trajectory

planning with collisions, while being 6 to 20 times

faster in inference. Because the hybrid solution com-

bines both algorithms, the inference time of the TPGs

alone is still 4 to 17 times faster than the hybrid solu-

tion at inference The model size is the combination of

both the deep RL and the TPG models, so it is slightly

larger than the state-of-the-art.

The various algorithms and principles used in this

paper are described in Section 2. The experimental

setup, which describes the environment of the robot

and the implementation of the TPGs used, is pre-

sented in Section 3. Section 4 presents the perfor-

mance of the TPGs and the used deep RL algorithm,

the SAC (Haarnoja et al., 2018). Section 5 describes

the hybrid solution that combines the TPGs and the

SAC and compares it with the state-of-the-art. Fi-

nally, a discussion is made in Section 6 and Section 7

concludes this paper.

2 RELATED WORKS AND

BACKGROUND

This section introduces the Inverse Kinematics (IK)

problem. Section 2.1 presents the problem and the

most commonly used existing solutions. Then Sec-

tion 2.2 describes a branch of IK which is the main fo-

cus of this paper: the trajectory planning. Section 2.3

introduces the RL ﬁeld and the reason why deep RL

solutions are used to solve trajectory planning prob-

lems. Finally, Section 2.4 introduces the TPGs, de-

scribing the principles and applications of this GP al-

gorithm.

2.1 Inverse Kinematics

Inverse Kinematics (IK) is a fundamental problem in

robotics that deals with determining joint conﬁgura-

tions for a robot manipulator. The goal is to place the

end-effector in a given position and orientation, where

the end-effector is the hand of the robot. The spatial

coordinates describing the position and orientation of

the end-effector are deﬁned as x ∈ R

, where n is the

number of dimensions. The motor angular coordi-

nates are deﬁned as q ∈ R

, where m is the number of

joints, as illustrated in Figure 1a. The forward kine-

matics problem is equivalent to ﬁnding the function

f (·) that satisﬁes the following equation:

x = f (q) (1)

The function f (·) is based on trigonometric func-

tions, has a complexity of O(n) and a unique solu-

tion. The IK problem consists in ﬁnding the func-

tion f

−1

(·) that calculates the joint coordinates of the

robot q from the position and orientation of the end-

effector x. As illustrated in Figure 1b, this function

can have multiple solutions (Ho and King, 2022) if

the number of joint dimensions m is greater than the

number of space dimensions n. Depending on the im-

plemented robot control strategy, the goal of IK may

be to ﬁnd all solutions or the one that best satisﬁes cer-

tain criteria. Criteria can be to ﬁnd the position that

requires the minimum motor movement or the posi-

tion that maximizes the stability of the robot (Phan-

iteja et al., 2017). There exist several solutions to

solve the IK problems, which can be divided into

three categories: analytical, numerical and learning

solutions.

• Analytical solutions (Sciavicco and Siciliano,

2012) address the kinematic equations directly.

They are effective for problems with few Degrees

of Freedom (DOF) and have a low computational

cost while predicting the exact solution. However,

for larger problems with many joints to control,

analytical methods may never ﬁnd a solution.

• Numerical solutions, such as the Jacobian trans-

pose (Wolovich and Elliott, 1984) or the damped

least squares method (Wampler, 1986), solve the

IK problem, by iteratively computing an approx-

imation to a solution of the function f

−1

(·). Be-

cause of this iterative nature, they have a higher

computational cost than analytic solutions, and

the solution may depend on the ﬁrst approxima-

tion used to initialize the iterations.

• The emergence of deep learning in recent years

has brought new solutions to the IK problem.

Deep learning is known to be very efﬁcient at

solving high-dimensional problems, so it is nat-

urally well suited to the IK problem. A stan-

dard feedforward neural network outperforms

the accuracy of the Jacobian transpose meth-

ods (Giorelli et al., 2015) while computing the re-

sult 100 times faster.

2.2 Trajectory Planning

Trajectory planning is a branch of IK where the goal

is to move the end-effector from a coordinate x

src

a target coordinate x

dst

and to control the path taken

along the way to avoid collisions or unavailable posi-

tions.

Deep learning or numerical methods usually com-

pute trajectories by deﬁning intermediate coordinates,

creating a trajectory, and computing the position of

the motor joints coordinate by coordinate (Duka,

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

140

Joint 2

Joint 4

Joint 3

Joint 1

(x, y, z)

(a) A Robot arm with joints

(b) Robot arms with same location

Figure 1: The 4-DOF Robot arm used in the experiment. (a) The arm with the motor joints (black dots) and the end-effector

coordinates (red dot). (b) A case where two coordinates reach the same end-effector coordinates.

2014). This results in weak collision avoidance and

leaves the possibility that two intermediate coordi-

nates will result in completely different motor posi-

tions. If the method can compute all the mappings of

the function f

−1

(·) (Ho and King, 2022), then taking

the closest mapping to the previous coordinates can

solve the motor correlation. However, collision avoid-

ance in trajectory planning is much harder to solve

with these methods.

2.3 Reinforcement Learning Solutions

for Trajectory Planning

RL is a branch of machine learning in which an agent

interacts with an environment by taking actions that

yield rewards (Sutton and Barto, 2018). The objective

of the agent is to learn a policy, that is, a mapping

from states to actions, that maximizes the cumulative

reward over time.

RL is a good solution for trajectory planning

because the agent can create optimized trajectories

based on a reward deﬁned by the learning environ-

ment. The reward is usually the distance between the

end-effector and the target, but a penalty can be added

to encourage the agent to avoid collisions (Phaniteja

et al., 2017).

Some RL algorithms using deep learning have

shown great efﬁciency in robot control (Ibarz et al.,

2021), such as the Soft Actor-Critic (SAC) (Haarnoja

et al., 2018). However, these solutions use deep learn-

ing networks that require too much computation for

a deployment on tightly resource-constrained embed-

ded systems. The SAC model from (Haarnoja et al.,

2019) uses 2 layers of 256 units each, leading to at

A B+>

(a) TPG example

Team

(Vertex)

Ac�on

(Vertex)

Program

(Edge)

(b) TPG semantics

Figure 2: Semantics of the TPGs.

least 65k parameters, resulting in models of about

300 kB.

2.4 Tangled Program Graphs

Today, Deep RL methods are the most widely used

algorithms for solving trajectory planning in the IK

problem. However, other RL techniques such as the

TPG could be used to solve this problem. TPG is an

algorithm introduced by Kelly and Heywood (Kelly

and Heywood, 2017) based on GP principles that has

been shown to be efﬁcient for RL problems at very

low computational cost.

2.4.1 The TPG Model

A TPG, depicted in Figure 2, consists of three main

entities that form a direct graph: the programs, the

teams, and the actions. The actions and teams are

the vertices of the graphs, while the programs are

the edges between a source and a target team or ac-

tion. Teams are internal vertices, while actions are the

leaves of the graph. Each team has several outgoing

edges.

Hybrid Genetic Programming and Deep Reinforcement Learning for Low-Complexity Robot Arm Trajectory Planning

141

2.4.2 The Program

A program is a list of instructions constructed during

the genetic evolution training process. A set of in-

structions is deﬁned by the programmer prior to the

training process. The program maps the state vector

with the different instructions to produce a bid. Dur-

ing the inference process, a team follows the path of

the program with the highest bid and reaches the cor-

responding vertex. The instructions are crucial for the

selection of actions and can be adapted for each type

of problem (Desnos et al., 2021).

2.4.3 The Evolution Process

The structure of the graph evolves dynamically. A

ﬁxed number of roots are created in each generation

by duplicating and mutating roots that have survived

the natural selection process of the previous gener-

ation . The mutation of a root can change either the

destination of the edges of the root, or the instructions

of the programs on the edges.

The TPGs are able to solve RL problems, such

as the Atari games (Kelly and Heywood, 2017), with

low inference complexity. The evolutionary process

used to train the TPGs produces RL agents with an

adaptive computational complexity that depends on

the difﬁculty of the problem. Unlike the TPGs, deep

RL methods require a speciﬁc model size deﬁned by

a data scientist, which is about 100 times larger than

a TPG and takes about 100 to 1000 times longer to

run than a TPG (Kelly, 2018) for equivalent accuracy.

These characteristics make the TPGs a good candi-

date for the trajectory planning problem.

3 EXPERIMENTAL SETUP

This section presents the experimental setup used to

evaluate the TPGs, the SAC, and the hybrid solution

on trajectory planning tasks. Section 3.1 describes the

robot arm model used. Section 3.2 explains what the

environment is for the TPGs and its different com-

ponents, then Section 3.3 describes some differences

between the environment used by the SAC and the

TPGs.

To ensure reproducibility of the results presented,

all library code, pre-trained TPGs, execution traces

and analysis code are made available as open-source

artifacts (Vacher et al., 2024)

3.1 4-DOF Robot Arm

The KIT-WIDOWX-ARM-COMP robot arm used in

the experiment has 4-DOF as represented on Fig-

ure 1a. It is a three-link arm placed on a base l

with rotational joints. The joints are limited to range

[−

], except for the base, which can do a full rota-

tion, but cannot go from −π to π. The RL algorithms

are trained on a simulator built with Orocos library

and a model characterizing of the valid positions was

built using physical constraints of the actual arm.

The model of the robot arm used in the experi-

ment has 4-DOF, Figure 1a shows a representation of

the arm. The arm used is the KIT-WIDOWX-ARM-

COMP. It has six servomotors but two of them con-

trol the hand which is not considered in this study. It

is a three-link arm, as shown in Figure 1a, placed on

a base l

with rotational joints and different sizes of

joints:

= 125 mm; l

= 142 mm; l

= 142 mm;

= 155 mm; d = 49 mm

Motor angular possibilities are limited within the fol-

lowing ranges (note that q

can not go from −π to π):

∈ [−π, π]; q

∈ [−

];

∈ [−

]; q

∈ [−

]

3.2 The Environnement for the TPGs

The Environment State. The state of the environ-

ment as given to the RL agent is composed of 13 val-

ues divided into four vectors :

1. The four motor positions;

2. The position of the hand P

= (x

, y

, z

);

3. The position of the target P

= (x

, y

, z

);

4. The difference between the hand and the target

= (x

− x

, y

− y

− z

));

Vector four may seem irrelevant because it is the dif-

ference between vectors two and three, but it has been

empirically found to signiﬁcantly speed up the train-

ing process for both the TPG and SAC algorithms.

The Action Space. TPGs can only perform a sin-

gle discrete action per inference. The action rotates

one of the four motors either clockwise or counter-

clockwise by 3 motor steps, where a step is approx-

imately 0.0046 rad or 0.26°. With a smaller motor

step size, the results can be more accurate, but require

more actions to learn because the number of actions

needed to reach a target is higher. Hence, there are

eight available actions: that is 2 actions per motor,

plus a ninth action which is a no-action that causes

self-termination.

Objective of the Learning. The robot arm

must move from coordinates (x

src

, y

src

, z

src

) to

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

142

dst

, y

dst

, z

dst

). A position is considered reached

when the distance between the end-effector and the

target is less than three millimeters. This value of

three millimeters is arbitrarily chosen. Two different

trajectory planning problems are considered, the ﬁrst

being simpler than the second:

• Fixed Start (FS). Go from a ﬁxed starting posi-

tion to a random destination.

• Random Start (RS). Go from a random starting

position to a random destination.

For the Fixed Start (FS) problem, all targets can be

reached in a maximum of 1707 actions. The Ran-

dom Start (RS) problem doubles this to 3414. These

values are the maximum number of actions during an

episode to reach the target.

At each generation, the roots are tested on 100 tar-

gets, making 100 episodes per generation. A valida-

tion is performed at each generation with the ﬁve best

roots of the training with 100 validation targets. Two

unique sets of validation targets, one for each prob-

lem, are used for all the results presented in this paper.

The targets are sampled uniformly in the reachable 3D

space.

The Score. Contrary to classic Reinforcement

Learning (RL) algorithms, the TPGs do not use an

instant reward and the Bellman equation (Bellman,

1957) to learn, but only a score at the end of each

evaluation. The score S, as speciﬁed in Equation (2),

is the distance in millimeters to the target multiplied

by −1 if the distance d between the target and the

hand is above 3 millimeters. If the distance is lower,

then the objective is reached and the score becomes

positive to encourage to take fewer actions to reach

the target.

S =

(

−1 × distance if d ≥ 3

(nbMaxActions − nbActions) × 0.01 if d < 3

(2)

Since 100 episodes are evaluated for each root per

generation, the score of a root is the average score of

the 100 episodes.

Invalid positions occur when any part of the robot

is below the bottom of the arm, assuming the robot

is attached to a table. If a collision or an invalid posi-

tion is detected, the action is aborted, and the environ-

ment state remains unchanged. For the TPGs, a root

will always choose the same action with the same en-

vironment state. Consequently, the same action will

be selected repeatedly, causing the arm to stop. As a

result, the score will reﬂect the current negative dis-

tance, which is unfavorable compared to a successful

outcome. The selection will naturally avoid collisions

and invalid positions.

3.3 The Environment Variations for the

SAC

The environment used by the SAC is mainly the same

as the environment for the TPGs described in the pre-

vious section.

Using the same reward as the TPGs would not be

fair for either the SAC or the TPGs, because both al-

gorithms learn differently. The TPGs learn with a ﬁ-

nal score at the end of a generation while the SAC

learns with a reward at each action step. The used

reward r is described by Equation 3, where d is the

distance in millimeters between the hand and the tar-

get.

r =











−0.1 × d if d ≥ 3 & no collision

−0.1 × d − 10 if d ≥ 3 & collision

100 if d < 3

(3)

Unlike the TPGs, the SAC can perform multiple

actions at once, thus the SAC has four possible ac-

tions, each of which rotates one of the four motors ei-

ther clockwise or counterclockwise by 3 motor steps.

Because the SAC uses multiple actions, it is limited

to 1000 actions per episode instead of 1707. The limit

could be higher because the reward of the SAC does

not depend on the maximum number of actions, but

some earlier tests showed that it does not affect the

learning of the SAC.

For the TPGs, an episode ends when a collision

occurs, as explained in the description of the score

in section 3.2, but this also corresponds to the goal

of learning: to avoid collisions. The training of the

SAC was initially done with the same behavior, where

a collision would end the episode. The problem is

that this creates episodes with different numbers of

actions. Since the goal of RL is to maximize the sum

of rewards over an episode, this created instability in

the training.

To avoid this problem, during the training of the

SAC, a collision does not end the episode but accu-

mulates negative rewards more quickly. However, for

comparison with the TPGs in inference, a collision

ends the current episode. The SAC is referred as SAC

No Collision on inference (NC

) .

4 RESULTS OF TPGs AND SACs

Section 4.1 presents the results of the TPGs, then the

results of the SAC are shown in Section 4.2. Finally,

in Section 4.3, a comparison of both algorithms is

done.

Hybrid Genetic Programming and Deep Reinforcement Learning for Low-Complexity Robot Arm Trajectory Planning

143

(a) Distance (mm) per generation (b) Success rate per generation

Figure 3: Results of the TPGs for the FS problem, with ﬁxed starting positions, and the RS problem, with random starting

positions. The curves show the mean of the ten seeds ± the standard deviation in the shaded areas. (a) Shows the distance

results. (b) Shows the percentage of success.

4.1 Results of the TPGs

The ﬁrst experiments with the TPGs on trajectory

planning are done for the FS problem, where the ini-

tial position of the arm is ﬁxed between each episode.

Out of the box, training the TPGs with the training hy-

perparameters proposed in Kelly’s work only achieves

a mean distance to the target of 94.5 millimeters over

5 seeds after 150 generations.

An empirical study has been done to improve the

results by tuning the hyperparameters. The parame-

ters studied are the number of roots at each genera-

tion, the number of surviving roots at each generation

and the number of initial roots. The number of roots

did not change, but the number of surviving roots and

initial roots have changed from 180 and 180 to 20 and

7200, respectively. With these parameters, the TPGs

achieve an average distance to target of 38.8 millime-

ters over 10 seeds after 150 generations. The training

was then extended to 300 generations to improve the

results. The TPGs achieve an average distance to tar-

get of 28.61 mm and a success rate of 57% for the FS

problem

A training with the same conﬁguration was done

for the RS problem, where the initial position is

random. The TPGs achieve a distance to target of

133.69 mm and a success rate of 22.6% for the RS

problem. Figure 3 shows the training results of the

TPGs for both objectives.

4.2 Results of the SAC

The SAC is trained on 5 seeds using the parameters in

Table 1. One hundred episodes are done for the train-

ing before a validation step of 100 episodes. To eval-

uate a low complexity SAC, a training is done with

16 hidden units per layer instead of 256. Figure 4

presents the results of the different conﬁgurations.

Table 1: Parameters for training the SAC.

Parameters Values Parameters Values

gamma 0.95 batch size 256

learning rate 5 × 10

−4

buffer size 10

reward scale 2 gradient steps 16

hidden units 256 hidden layer 2

target coef. 0.005

The task is learned for the FS problem with a

mean distance of 7.29 mm for the classic SAC. How-

ever, when testing the SAC that reaches 7.29 mm

on the same validation set of targets, but ending the

episode when a collision occurs, the distance drops

to 25.2 mm. On the second objective with a Random

Start (RS), the SAC achieves a average distance of

44.1 mm which drops to 93.5 mm when adding the

collision termination condition.

The alternative SAC conﬁguration with No Col-

lision in training (NC

) reaches 114.16 mm and the

SAC with 16 hidden units reaches 96.14 mm. Since

the performance of these alternative SAC is really bad

compared to the classic SAC, they are not kept for fur-

ther tests.

4.3 Comparison Between the TPGs and

the SAC

In this section, a comparison with the results of the

SAC and the TPGs is done. A comparison of the ac-

curacy of the algorithms is presented in Section 4.3.1

and their computational complexity are studied in

Section 4.3.2.

4.3.1 Performance Comparison

Table 2 shows the distance, success rate, and num-

ber of collisions achieved by both algorithms for both

problems. Both the SAC with allowed collisions and

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

144

(a) Distance (mm) per generation (b) Success rate per generation

Figure 4: Results of SACs on the ﬁrst objective, with ﬁxed starting positions, and second objective, with random starting

positions. The curves show the mean of the ﬁve seeds ± the standard deviation in the shaded areas. The number 16 indicates

that there are 16 number of hidden units per layer instead of 256. NC

indicates that the training is done with no collision

allowed. (a) Shows the distance results. (b) Shows the percentage of success.

No Collision on inference are measured. For the dis-

tance achieved, the results are similar for the FS prob-

lem, where the SAC NC

and the TPGs reach 25.2 mm

and 28.6 mm, respectively. However? The SAC NC

is still better than the TPGs for the success rate. It has

a success rate of 82%, while the TPGs has 56.8%. For

the RS problem, the results are better for the SAC NC

for both metrics, with a distance of 93.5 mm com-

pared to 133.7 mm for the TPGs and a success rate

of 37.4% compared to 22.2%. The only metric where

the SAC and the TPGs are almost equal for both prob-

lems is the number of collisions. The SAC has 11.2%

for the FS problem and 36.8% for the RS problem.

The TPGs is really close with 15.5% for the FS prob-

lem and 37.2% for the RS problem.

Figures 5a and 5b show the distance reached by

each algorithm and each seed for the FS and RS prob-

lems, respectively. SAC NC

has cases where the dis-

tance is really high because due to collision termina-

tion. For the FS problem, although both algorithms

achieve a nearly equal mean distance, the distribu-

tion of the distance is not the same. For the SAC,

most of the distances are close to 3mm, but some are

really bad reaching 500mm, while for the TPGs, the

distances are more spread between 3 and 200mm.

4.3.2 Inference Complexity Study

To measure the inference execution time, both algo-

rithms are run on a single-CPU 13th Gen Intel(R)

Core(TM) i7-13700H. SAC uses the PyTorch C++

API and the TPGs uses the C code generation pro-

posed by GEGELATI. The code is compiled using

GCC 13.3 on a Windows 10 operating system. Fig-

ures 5c and 5d show the time per action on each tar-

get for each seed for both problems. Both algorithms

have similar execution times between each seed and

Table 2: Accuracy performance between the SAC over ﬁve

seeds, the TPGs over 10 seeds and the hybrid solution over

the 50 combined seeds. The metrics used are the distance,

the success rate and the collision rate. SAC NC

refers to the

SAC with no collision on inference, meaning that a collision

ends the episode. The results are the mean of the seeds ± the

standard deviation.

Metrics Distance Success Collision

Fixed Start Problem

SAC 7.3(±03) 90.7%(±2) 11.2%(±02)

SAC NC

25.2(±08) 82.0%(±3) 11.2%(±02)

TPGs 28.6(±09) 56.8%(±9) 15.5%(±10)

Hybrid 11.3(±06) 86.9%(±8) 12.2%(±08)

Random Start Problem

SAC 44.1(±03) 57.8%(±4) 36.8%(±04)

SAC NC

93.5(±04) 37.4%(±5) 36.8%(±04)

TPGs 133.7(±17) 22.2%(±8) 37.2%(±14)

Hybrid 81.2(±22) 49.3%(±9) 36.3%(±11)

each target for both problems. The TPGs is on aver-

age 96 times faster than SAC for both problems. The

SAC takes 61.53 µs per action, while the TPGs takes

0.64 µs.

This complexity is also reﬂected in the model

sizes. For the SAC, all models have the same size

of 292 kB, while the models of the TPGs have a size

between 11 kB and 18 kB. The models of the TPGs

are 20 times smaller than the model of the SAC.

The TPGs can achieve similar performance to the

SAC on this problem, while being a hundred times

faster and with a model size 20 times smaller. The

results of the TPGs are very promising for the tra-

jectory planning problem. However, there are some

problems with the TPGs. The TPGs, like many GP

algorithms, need much more training time than Deep

Hybrid Genetic Programming and Deep Reinforcement Learning for Low-Complexity Robot Arm Trajectory Planning

145

(a) Distance reached on each target for the FS problem. (b) Distance reached on each target for the RS problem.

Figure 5: Results of each seed for both the SAC and the TPGs.

RL. The TPGs also have a larger standard deviation

during training compared to the SAC.

5 HYBRID SOLUTION

In this section, we propose a hybrid version that com-

bines the SAC and the TPG to produce a solution that

surpasses the accuracy of the state-of-the-art while

having a lower execution time. Section 5.1 describes

the proposed algorithm, then Section 5.2 presents an

experimental study to ﬁnd the best solutions. Finally,

Section 5.3 compares the results of the hybrid solu-

tions with the SAC and the TPGs.

5.1 Hybrid Solution Description

As seen in the previous section, the TPGs and the

SAC reach almost the same mean distance but the dis-

tribution of the distances is not the same. This implies

that the TPGs and the SAC learn different behaviors

to reach a target. The TPGs are on average 95 times

less complex than the SAC. These two features moti-

vate the hybrid solution: we want to beneﬁt from both

behaviors and from the low complexity of the TPGs.

The hybrid solution, described in Algorithm 1, re-

lies on a TPG and a SAC that have been trained from

scratch independently. For inference, the hybrid solu-

tion starts the trajectory planning with the TPG for a

ﬁrst approach to the target, then switches to the SAC

for the ﬁner approach. There are three conditions that

cause the selection of actions to switch to the SAC:

1. If the actions of the TPG lead to a signiﬁcant in-

crease in the distance of the arm from its target.

A threshold deﬁned by the sepDistance parameter

is used to trigger this condition. Whenever a se-

ries of TPG actions increases the distance of the

end-effector of the arm beyond this threshold, the

condition is triggered and the SAC takes over the

control of the robot arm.

2. If the TPG chooses the no-action option, it will

terminate itself.

3. If the TPG exhibits cyclic behavior. The inference

of a TPG is a deterministic and stateless process,

which means that given the input state, it will al-

ways result in selecting the same action. As a con-

sequence, a TPG can enter a cycle of actions.

If a collision occurs, no matter if the SAC or the

TPG takes the action, the inference ends immediately.

After switching to the SAC algorithms, the hybrid

algorithm will automatically return to the TPG control

after a ﬁxed number of actions deﬁned by the NbIt-

GoBackTPG integer parameter of the algorithm. This

parameter allows to tune the ratio of actions taken by

the TPG when running the hybrid algorithm.

5.2 Experimental Study for Hybrid

Solution

The described hybrid solution depends on two param-

eters: nbItGoBackTPG the number of SAC iterations

after which the TPG takes over, and sepDistance the

distance to the target increase causing the SAC to take

over. The goal of the hybrid solution is to ﬁnd a com-

promise between accuracy and complexity. In this

idea, an experimental study is carried out on the two

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

146

Algorithm 1: Inference with the TPG and the Deep RL Al-

gorithm.

Data: Environment, TPG agent, SAC agent,

sepDistance, nbItGoBackT PG

Initialize environment;

Start episode;

bestDistance ← currentDistanceToTarget;

T PGRunning ← true;

counter ← 0;

while episode not done do

if TPGRunning then

The TPG takes an action;

if bestDistance + sepDistance >

currentDistanceToTarget then

T PGRunning ← f alse;

else

if TPG stops moving or is cycling

then

T PGRunning ← f alse;

end

else

The SAC takes an action;

counter ← counter + 1;

if counter = nbItGoBackT PG then

counter ← 0;

T PGRunning ← true;

end

if bestDistance >

currentDistanceToTarget then

bestDistance ←

currentDistanceToTarget;

end

End episode;

parameters. 36 combinations of the parameters are

tested:

• nbItGoBackTPG: [1, 5, 10, 50, 100, 1000];

• sepDistance (mm): [1, 3, 10, 20, 100, ∞].

Inﬁnity as the separation distance means that the SAC

can only take over if the TPGs stops moving or starts

cycling.

Five seeds are trained for the SAC and ten for the

TPGs for each conﬁguration. Thus, 50 pairs of the

SAC and the TPG can be generated for the hybrid so-

lution. Each of the 36 combinations is tested with the

50 combinations to ensure the viability of the results.

As shown in Section 4.3.2, the inference execu-

tion time of the TPG and the SAC have a very limited

variations. Instead of the execution time, the ratio of

the actions performed by the TPG to the total num-

ber of actions performed in this episode is measured.

Then, an estimated time per action is calculated with

the mean time of the TPG and the SAC previously

measured on the Intel i7-13700H core. This allows us

to distribute this experimental study on different types

of processors where the execution times may vary de-

pending on the hardware used, while keeping a simple

metric.

Figure 6 presents the results of the 36 conﬁgura-

tions for both problems. For the FS problem, the cho-

sen conﬁguration is found with the nbItGoBackTPG

set to 20 iterations and the sepDistance set to 100 mm.

This conﬁguration achieves an average distance of

11.3 mm with a time per action at 16.07 µs. For the

RS problem, the chosen conﬁguration has a sepDis-

tance of 100 mm and the nbItGoBackTPG is set at 10

iterations. This conﬁguration reaches a mean distance

of 85.5 mm with a time per action of 57.96 µs. The

two conﬁgurations are colored in green and orange,

respectively, in Figure 6 to highlight that they both

reached top scores within the set of conﬁgurations for

both problems.

5.3 Comparison with State-of-the-Art

The comparison with state-of-the-art and the hybrid

solutions is done in the same way as in Section 4.3,

with a performance comparison in Section 5.3.1 and

a complexity comparison in Section 5.3.2.

5.3.1 Performance Comparison

The hybrid solution achieves better results than the

SAC and the TPGs for both problems. The results

are described in Table 2. For the FS problem, the

mean distance of 11.3 mm is far below the mean dis-

tance of the SAC and TPG, which reached 25.2 mm

and 28.6 mm, respectively. With a success rate of

86.9% the hybrid algorithm is slightly better than the

SAC, with 82%. The number of collisions increased

slightly from 11.2% to 12.2%, but not signiﬁcantly

considering the standard deviation of respectively 2%

and 8% for the SAC and the hybrid solution, respec-

tively. For the RS problem, the mean distance is better

than the SAC by 13%: the SAC is at 93.5 mm while

the hybrid solution is at 81.2 mm. However, the hy-

brid solution of this harder problem has a high stan-

dard deviation of 22 mm, while the SAC is only at

4 mm.

Figures 7a and 7b show the distance reached by

the hybrid solution on different seeds compared to

the SAC. For the FS problem, the hybrid solution has

much fewer cases where an episode ends far from the

target compared to the SAC. For the RS problem, the

Hybrid Genetic Programming and Deep Reinforcement Learning for Low-Complexity Robot Arm Trajectory Planning

147

(a) Results for the FS problem (b) Results for the RS problem

Figure 6: Pareto fronts between the distance and the approximate time per action of each of the 36 conﬁgurations of the hybrid

solution for both problems. Each point is the average of the 50 combinations. The selected conﬁguration of the FS problem

is colored in green and the and the selected conﬁguration of the RS problem is colored in orange.

distribution of the distance is better for most of the

seeds for the hybrid solution than for the SAC.

To conclude this comparison, the best seeds of

each solution are compared. For the FS problem, the

SAC achieves a mean distance of 13.7 mm and a suc-

cess rate of 87%. The hybrid solution outperforms the

SAC with a success rate of 94% and a mean distance

of 5.1 mm. For the RS problem, the SAC achieves a

mean distance of 86.9 mm and a success rate of 33%.

The hybrid solution is again much better with a suc-

cess rate of 64% and a mean distance of 53.4 mm.

5.3.2 Complexity Study

Figures 7c and 7d show the time per action of the hy-

brid solutions on different SAC and TPG pairs com-

pared to the SAC. For both problems, the hybrid so-

lution has a better time per action with an average

time of 2.77 µs and 10.92 µs on FS and RS problems

respectively while the SAC has a time per action of

61.5 µs for both problems. The hybrid solution is

more effective, by being twenty times faster than the

SAC for the FS problem and 6 times quicker of the

RS problem. The best seeds of the hybrid solution

detailed at the end of the previous section have a time

per action of 3.05 µs for the FS problem and 11.55 µs

for the RS problem. Because the hybrid solution is a

combination of the TPGs and the SAC, the hybrid so-

lution is slower than the TPGs. The TPGs are 4 times

faster than the hybrid solution on the FS problem and

17 times faster on the RS problem.

For the SAC, all models have the same size of

292 kB, while the models of the TPGs have a size be-

tween 11 kB and 18 kB. Thus, the hybrid solution has

a model size of about 305 kB.

6 DISCUSSION

Beyond the performance of the hybrid solution, the

contribution of this work lies in the signiﬁcant advan-

tages obtained by combining the TPG and the deep

RL algorithm. We show that the synergistic effects of

their complementary strengths lead to improved re-

sults in robot control, in accuracy, and especially in

complexity.

This combined approach opens up new possibil-

ities for solving robot control problems. With the

evolution of deep RL algorithms, harder robot con-

trol tasks are being learned every day (Haarnoja et al.,

2024), with larger model sizes, leading to complex

solutions. Recent research explores reduction in com-

putational cost of the deep RL agent (Gu et al., 2023)

and is complementary to our approach. A hybrid so-

lution using a TPG combined with a low complexity

deep RL algorithm could give even better results.

The task studied in this paper, trajectory planning,

is well suited for the hybrid solution because it is

easy to detect when a TPG has bad behavior. Bet-

ter ways to detect these bad behaviors should be in-

vestigated in future work to adapt this hybrid solu-

tion to a wider variety of RL tasks where the TPG

have recently been applied (Kelly et al., 2021; Smith

and Heywood, 2023). The TPG have also been ap-

plied to classiﬁcation tasks (Smith et al., 2021; Chillet

et al., 2023). Studying the design of a hybrid solution

for classiﬁcation tasks using the TPG and some deep

learning techniques could also lead to interesting re-

sults.

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

148

(a) Distance reached on each target for the FS problem. (b) Distance reached on each target for the RS problem.

Figure 7: Results of 10 combined seeds of the best hybrid solution for both problems. Plotting the 50 combined seeds would

have been unreadable, instead 10 combinations are shown, the ﬁrst two seeds of the TPGs with the ﬁrst of the SAC, seeds 2

and 3 of the TPGs with the second of the SAC, and so on.

7 CONCLUSION

This paper presents a low complexity solution for tra-

jectory planning on IK using a genetic RL algorithm.

The adaptability of the TPGs allows the learning of

the complex environment. An empirical study of the

parameters of the TPG makes it possible to achieve al-

most state-of-the-art performance on a problem, while

being 100 times faster and having a model size 20

times smaller. However, the tuning is not sufﬁcient to

learn a harder problem, where even the state-of-the-

art struggles. In order to propose a solution with low

complexity and to increase the accuracy on this prob-

lem, a hybrid solution combining the SAC, a state-

of-the-art algorithm, and the TPGs is designed. It

achieves a better accuracy than both SAC and the

TPGs for both problems, while being 20 to 6 times

faster than the SAC depending on the problem. This

hybrid solution does not use any additional training,

the solution uses pre-trained models of the SAC and

the TPGs. Designing training for the hybrid solu-

tion could improve the results and reduce the running

time.

For other IK applications, RL algorithms use

memory, continuous action, multi-action, and visual

tracking, four capabilities that the implementation of

the TPGs used in this paper does not have. An exten-

sion of the TPGs using memory and continuous ac-

tion has already been proposed (Kelly and Banzhaf,

2020). Future work will consider these extensions, as

well as multiple actions, to improve performance for

the IK problem.

ACKNOWLEDGEMENTS

This research was funded, in whole or in part, by

the Agence Nationale de la Recherche (ANR), grant

ANR-22-CE25-0005-01. A CC BY license is applied

to the AAM resulting from this submission, in accor-

dance with the open access conditions of the grant.

REFERENCES

Bellman, R. (1957). A markovian decision process. Journal

of Mathematics and Mechanics, 6(5):679–684.

Chillet, A., Boyer, B., Gerzaguet, R., Desnos, K., and Gau-

tier, M. (2023). Tangled program graph for radio-

frequency ﬁngerprint identiﬁcation. In 2023 IEEE

34th Annual International Symposium on Personal,

Indoor and Mobile Radio Communications (PIMRC),

pages 1–7. IEEE.

Desnos, K., Sourbier, N., Raumer, P.-Y., Gesny, O., and

Pelcat, M. (2021). Gegelati: Lightweight artiﬁcial in-

telligence through generic and evolvable tangled pro-

gram graphs. In Workshop on Design and Architec-

tures for Signal and Image Processing (14th edition),

pages 35–43.

Duka, A.-V. (2014). Neural network based inverse kine-

matics solution for trajectory tracking of a robotic

arm. Procedia Technology, 12:20–27. The 7th Interna-

tional Conference Interdisciplinarity in Engineering,

INTER-ENG 2013, 10-11 October 2013, Petru Maior

University of Tirgu Mures, Romania.

Giorelli, M., Renda, F., Calisti, M., Arienti, A., Ferri, G.,

and Laschi, C. (2015). Neural network and jacobian

method for solving the inverse statics of a cable-driven

Hybrid Genetic Programming and Deep Reinforcement Learning for Low-Complexity Robot Arm Trajectory Planning

149

soft arm with nonconstant curvature. IEEE Transac-

tions on Robotics, 31(4):823–834.

Gu, S., Kuba, J. G., Chen, Y., Du, Y., Yang, L., Knoll,

A., and Yang, Y. (2023). Safe multi-agent reinforce-

ment learning for multi-robot control. Artiﬁcial Intel-

ligence, 319:103905.

Haarnoja, T., Ha, S., Zhou, A., Tan, J., Tucker, G., and

Levine, S. (2019). Learning to walk via deep rein-

forcement learning.

Haarnoja, T., Moran, B., Lever, G., Huang, S. H., Tirumala,

D., Humplik, J., Wulfmeier, M., Tunyasuvunakool, S.,

Siegel, N. Y., Hafner, R., et al. (2024). Learning agile

soccer skills for a bipedal robot with deep reinforce-

ment learning. Science Robotics, 9(89):eadi8022.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor.

Ho, C.-K. and King, C.-T. (2022). Automating the learning

of inverse kinematics for robotic arms with redundant

dofs.

Ibarz, J., Tan, J., Finn, C., Kalakrishnan, M., Pastor, P.,

and Levine, S. (2021). How to train your robot

with deep reinforcement learning: lessons we have

learned. The International Journal of Robotics Re-

search, 40(4–5):698–721.

Kelly, S. (2018). Scaling genetic programming to challeng-

ing reinforcement tasks through emergent modularity.

PhD thesis, Dalhousie University, Halifax, Nova Sco-

tia, Canada.

Kelly, S. and Banzhaf, W. (2020). Temporal memory shar-

ing in visual reinforcement learning. Genetic Pro-

gramming Theory and Practice XVII, pages 101–119.

Kelly, S. and Heywood, M. I. (2017). Emergent tangled

graph representations for atari game playing agents.

In Genetic Programming: 20th European Conference,

EuroGP 2017, Amsterdam, The Netherlands, April

19-21, 2017, Proceedings 20, pages 64–79. Springer.

Kelly, S., Voegerl, T., Banzhaf, W., and Gondro, C. (2021).

Evolving hierarchical memory-prediction machines in

multi-task reinforcement learning. Genetic Program-

ming and Evolvable Machines, 22:573–605.

Phaniteja, S., Dewangan, P., Guhan, P., Sarkar, A., and Kr-

ishna, K. M. (2017). A deep reinforcement learning

approach for dynamically stable inverse kinematics of

humanoid robots. In 2017 IEEE international con-

ference on robotics and biomimetics (ROBIO), pages

1818–1823. IEEE.

Sciavicco, L. and Siciliano, B. (2012). Modelling and con-

trol of robot manipulators. Springer Science & Busi-

ness Media.

Smith, R. J., Amaral, R., and Heywood, M. I. (2021).

Evolving simple solutions to the cifar-10 bench-

mark using tangled program graphs. In 2021 IEEE

Congress on Evolutionary Computation (CEC), pages

2061–2068. IEEE.

Smith, R. J. and Heywood, M. I. (2023). Interpreting tan-

gled program graphs under partially observable dota 2

invoker tasks. IEEE Transactions on Artiﬁcial Intelli-

gence, 5(4):1511–1524.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learn-

ing: An introduction. MIT press.

Vacher, Q., Beuve, N., Allaire, P., Marty, T., Dardaillon, M.,

and Desnos, K. (2024). Ecta24 artifacts.

Wampler, C. (1986). Manipulator inverse kinematic solu-

tions based on vector formulations and damped least-

squares methods. Systems, Man and Cybernetics,

IEEE Transactions on, 16:93 – 101.

Wolovich, W. A. and Elliott, H. (1984). A computational

technique for inverse kinematics. The 23rd IEEE Con-

ference on Decision and Control, pages 1359–1363.

Zhong, J., Wang, T., and Cheng, L. (2021). Collision-free

path planning for welding manipulator via hybrid al-

gorithm of deep reinforcement learning and inverse

kinematics. Complex & Intelligent Systems, pages 1–

14.

ECTA 2024 - 16th International Conference on Evolutionary Computation Theory and Applications

150