Mixed Deep Reinforcement Learning-behavior Tree for Intelligent

Agents Design

Lei Li

, Lei Wang, Yuanzhi Li

and Jie Sheng

Department of Automation, University of Science and Technology of China, Hefei 230027, Anhui, China

Keywords:

Reinforcement Learning, Behavior Tree, Intelligent Agents, Option Framework, Unity 3D.

Abstract:

Intelligent agent design has increasingly enjoyed the great advancements in real-world applications but most

agents are also required to possess the capacities of learning and adapt to complicated environments. In this

work, we investigate a general and extendable model of mixed behavior tree (MDRL-BT) upon the option

framework where the hierarchical architecture simultaneously involves different deep reinforcement learning

nodes and normal BT nodes. The emphasis of this improved model lies in the combination of neural net-

work learning and restrictive behavior framework without conﬂicts. Moreover, the collaborative nature of two

aspects can bring the beneﬁts of expected intelligence, scalable behaviors and ﬂexible strategies for agents.

Afterwards, we enable the execution of the model and search for the general construction pattern by focusing

on popular deep RL algorithms, PPO and SAC. Experimental performances in both Unity 2D and 3D environ-

ments demonstrate the feasibility and practicality of MDRL-BT by comparison with the-state-of-art models.

Furthermore, we embed the curiosity mechanism into the MDRL-BT to facilitate the extensions.

1 INTRODUCTION

Designing an intelligent agent confronted with com-

plex tasks in diverse environments is generally known

as an intractable challenge. A universally accepted

deﬁnition of intelligent agents in (Wooldridge and

Jennings, 1995) indicates that the agent can operate

automatically, perceive environments reactively and

exhibit goal-settled acts initiatively, which demands

for the ability of observing, learning and behaving.

The techniques of employing such intelligent agents

have profound impacts on a wide range of appli-

cations including computer games, scenario simula-

tions, robot locomotion.

Behavior Tree (BT), expressed by (Dromey, 2003)

in the mid2000s, is a well-deﬁned and graphical

framework for modelling AI decision behaviors. As a

replacement of Finite State Machines (FSM) and a fa-

vorable AI approach utilized inherently in games, BT

owns the features of re-usability, readability and mod-

ularity. However, an excellent design of BT needs

enough experience and efforts when the behaviour

representations of agents become increasingly com-

plicated. It’s apparent in the fact that agents with

https://orcid.org/0000-0003-3496-9752

https://orcid.org/0000-0002-3068-8569

these constrained behaviors have difculty responding

towards dynamically changing environments.

Reinforcement Learning (RL) as one of the

paradigms and methodologies of machine learning

based on Markov Decision Process (MDP) has been

scaled up to a variety of challenging domains, such as

AlphaGo (Silver et al., 2016) and AlphaGo Zero (Sil-

ver et al., 2017), Atari game (Mnih et al., 2013), Sim-

ulated Robotic Locomotion (Lillicrap et al., 2015),

StarCraft (Vinyals et al., 2017), even Vehicle Energy

Management (Liessner et al., 2019). Accordingly,

deep RL gradually emerges with the signiﬁcant ad-

vance of neural network. Compared with behavior

trees, agents augmented with RL not only can po-

tentially take adaptive strategies, but also learns in-

crementally a complex policy. But deep RL mod-

els are always accompanied with poor sampling ef-

ﬁciency and limited convergence, lacking a guarantee

of an optimal result. Another cause for the bounded

applicability is the difﬁculty of designing a reward

function that encourages the desired behaviors all

through training. With respect to hyperparameters,

most methods depend on special settings and easily

get brittle with a small change.

Whether an appropriate model can implement an

intelligent agent with given demands is contingent

upon effective design mechanisms and applicable ex-

Li, L., Wang, L., Li, Y. and Sheng, J.

Mixed Deep Reinforcement Learning-behavior Tree for Intelligent Agents Design.

DOI: 10.5220/0010316901130124

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 1, pages 113-124

ISBN: 978-989-758-484-8

113

ecution. As discussed in prior contents, there ex-

ist ubiquitous shortcomings in practicability above.

It’s signiﬁcant for us to embed the deep RL into

BT and offer a valid model pattern. According to

(de Pontes Pereira and Engel, 2015), a series of sub-

tasks can be abstractly transformed into a reinforce-

ment learning node and in turn BT, proﬁting from its

hierarchical architecture, can be enhanced reasonably

by absorbing these nodes. In both theoretical and ex-

perimental aspects at last, the model named MDRL-

BT can availably incorporate heteogeneous deep RL

nodes and normal BT nodes to produce a considerable

improvement in intelligent agents design.

2 RELATED WORK

The fundamental theories of BT arouse out of (Mateas

and Stern, 2002; Isla, 2005; Florez-Puga et al., 2009).

(Mateas and Stern, 2002) provided a behavior lan-

guage designed speciﬁcally for authoring believable

agents with rich personality as a primitive forerunner.

(Isla, 2005) centering on scalable decision-making

used BT to handle complexity in the Halo2 AI. (Sub-

agyo et al., 2016) enriches behavior tree with emotion

to simulate multi-behavior NPCs in re evacuation.

The deep RL originates from the paper (Mnih

et al., 2013) with the enforcement of CNN network

directly. (Mnih et al., 2015) has stricken a great suc-

cess by developing a deep Q-network (DQN) in this

ﬁeld. (Schulman et al., 2015) proposes Trust Region

Policy Optimization (TRPO) in policy optimization.

On the basis of TRPO, Proximal Policy Optimization

(PPO) (Schulman et al., 2017) takes the minibatch up-

date and optimizes a surrogate objective function with

stochastic gradient ascent. Soft actor-critic (SAC) are

proposed by Haarnoja (Haarnoja et al., 2018) to max-

imize expected reward and entropy in Actor-Critic

(AC).

The concept of integrating RL into BT has been

put forward to alleviate the endeavors of manual pro-

gramming in some research. (Zhang et al., 2017)

combines BT with MAXQ to induce constrained and

adaptive behavior generation. (Dey and Child, 2013)

presents Q-learning behaviour trees (QL-BT). In the

(de Pontes Pereira and Engel, 2015), a formal den-

ition of learning nodes is applicable to address the

problem of learning capabilities in constrained agents.

Yanchang Fu in (Fu et al., 2016) carries out simu-

lation experiments including 3 opponent agents with

RL-BT. In (Kartasev, 2019) there are detailed descrip-

tions of Hierarchical reinforcement learning and Semi

Markov Decision Processes in BT.

Compared with the relevant works, the contribu-

tions of this paper is intended to contain the follow-

ing aspects: ﬁrstly, we demonstrate a general model

MDRL-BT combined ﬂexibly with different deep op-

tions and a simple training procedure to design an in-

telligent agent. Besides, we investigate potential traits

of MDRL-BT for an effective training model. The la-

tent variable generative models and primitive process

of learning can be strengthened with mixed deep RL

algorithms including PPO and SAC by comparative

experiments. Furthermore, we set up experiments on

Unity 3D environment for high quality physics sim-

ulations and revise a simple and uniﬁed reward func-

tion about scores and time. Finally, the MDRL-BT

model with curiosity is implemented practically in an

empirical 2D application.

The remainder of this paper is structured in the

following: The introduction of intelligent agents and

corresponding research are presented ﬁrstly. After-

wards, the theories of BT and RL and analysis of

MDRL-BT architecture are introduced in detail. In

the model section, we outline the framework based

on options and bring the RL nodes into BT. At the

same time, we facilitate the execution of the model.

In the experiments, we build up some experiments

to search for better performance and draw some con-

clusions from the results. Finally, we summarize the

work and look forward to the future research direc-

tion.

3 PRELIMINARIES

Formally speaking, a behaviour tree is composed of

some nodes and directed edges where internal nodes

called composite nodes and leaf nodes known as ac-

tion nodes are connected by edges.

Each node is classiﬁed by the execution strategy

in the following. Sequence, analogies to logical-and

operation, returns Failure once one of the children

fails, otherwise Success. Note that Fallback nodes,

equivalent to logical-or, are appropriate for executing

the ﬁrst success nodes. Condition nodes represent a

proposition check and instantly return Success if the

condition holds or Failure if not yet. All Action nodes

having speciﬁc codes return Success if the action cor-

rectly completes, Failure if it is impossible to con-

tinue and Running when the process is ongoing.

The execution of BT operates with a tick gener-

ated by a root node at a given frequency, which prop-

agates in depth ﬁrst. When receiving the signal, the

node invokes its execution, enables the corresponding

behaviors or traverses the tick to children. Each node

except root completes with the return status of Run-

ning, Success or Failure, which is transferred to the

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

114

parent for determining the next routine. Until the root

node terminates, a new tick always comes into being

from root with a cyclical loop.

The key problem of RL aims at maximizing cumu-

lative rewards in the interactions with environments.

At time step t, the agent in a given state s

∈ S de-

cides on an action a

∈ A with respect to a mapping

relationship called the policy π : S → A, then receives

a reward signal r

and reaches a new state s

t+1

∈ S.

In a given environment, S is a complete description

of state space and A often represents the set of all

valid actions. Generally speaking, the entire sequence

of states, actions and reward can be considered as an

inﬁnite-horizon discounted Markov Decision Process

(MDP), deﬁned by the tuple (S, A, p, r). The state

transition probability p demonstrates the probability

density of the next state s

t+1

in the condition of the

current state s

∈ S and action a

∈ A. To represent

the long-term cumulative reward, the discount fac-

tor γ is considered to avoid the inﬁnite total reward:

∑

i=t

i−t

, a

In Q-learning, to evaluate the expected return of a

policy, a value function is deﬁned: V

(s) = E

s] and the state-action value function is the expected

return for an action a performed at state s : Q

(s, a) =

= s, a

= a]. From the Bellman equation,

the recursive relationship can be shown: Q

, a

) =

t+1

+ γQ

t+1

, a

t+1

)|s

= s, a

= a]. In DQN

(Mnih et al., 2015), a deep convolutional neural net-

work is used to approximate the optimal action-value

function as follows:

∗

(s, a) = max

E[r

+γr

r+t

+...|s

= s, a

= a, π] (1)

In the policy network (Silver et al., 2014) , log

loss and discount reward are used to update the pol-

icy guide gradient. The policy can be updated as the

equation:

∇

J(µ

) = E

s∼ρ

[∇

(s)∇

(s, a)|

a=µ

(s)

] (2)

Mathematically, the advantage function which is

crucially important for policy gradient methods is de-

ﬁned by A

(s, a) = Q

(s, a) −V

(s). Proximal policy

optimization (PPO) (Schulman et al., 2017) breaks

down the return function into the return function by

the old strategy plus other terms with the monotonic

improvement guarantee and gives a deﬁnition of the

probability ratio: r

(θ) =

)

old

)

and r

(θ

old

) = 1.

The main objective of PPO is the following:

L(θ) =

[min(r

(θ)

, clip(r

(θ), 1 − ε, 1 + ε)

)]

(3)

Soft Actor Critic(SAC) (Haarnoja et al., 2018), an

extended stochastic off-policy optimization approach

based on actor-critic formulation, centers on entropy

regularization to maximize a trade-off between explo-

ration and exploitation with the acceleration of the

learning process. The agent at every time step obtains

an augmented reward proportional to the expected en-

tropy of the policy over ρ

with trade-off coefﬁcient

α:

J(π) =

∑

t=0

)∼ρ

[r(s

, a

) + αH (π(·|s

))] (4)

4 MODEL

The general approach for maintaining the superiority

of RL and BT together is to apply the option frame-

work to BT. On this basis, deep RL algorithms can be

imbedded unaffectedly in the learning nodes to obtain

observation information and make decisions in accor-

dance with the learned policy. In the meantime, the

learning nodes are claimed to keep the feasible and

constrained characteristics of normal nodes. Deriv-

ing from recursive BT, the generated model MDRL-

BT stresses on a relatively simple and efﬁcient real-

ization and implements an optimized execution with

these nodes.

4.1 The Option Framework in BT

The central trait of MDRL-BT focuses on an idea that

a big task can be decomposed into multiple smaller

tasks in BT and several nodes associated with a task

can aggregate into a learning node. Each divided

task reduces the non-linear increase of dimensional-

ity with the size of the observation and action space,

similar to the essence of Hierarchical Reinforcement

Learning (HRL) and Semi Markov Decision Pro-

cesses (SMDP) based on the option framework pro-

posed in (Sutton et al., 1998). In the framework, a

primary option is initialized in a certain state and then

a sub-option is adopted by the learning strategy. After

that, the sub-option proceeds until it terminates and

another option continues like the running process of

BT.

An option is a 3-tuple consisting of three elements

< I , π, β > where : I ⊆ S indicates the initial state of

option, π : S × O → [0, 1](O =

s∈S

) represents the

semi-markov policy which is a probability distribu-

tion function based on state space and option space,

µ : S × O × A → [0, 1] with additional action space

deﬁnes the intra-option policy. For each state s, the

available options are represented by O(s). When the

present state s is an element of I , a corresponding op-

tion is successfully initialized. The bellman equation

Mixed Deep Reinforcement Learning-behavior Tree for Intelligent Agents Design

115

for the value of an option o in state s can be expressed:

(s, o) = Q

(s, o) +

∑

P(s

|s, o)

∑

∈O

π(s

, o

)

(5)

The current option chooses the next option o with

the probability π(s, o) during the execution and then

the state can change to s

. Moreover, the deﬁnition of

the intra-option value function is:

(s, o) =

∑

µ(a|s, o)Q

(s, o, a) (6)

where Q

: S × O × A → R represents the action

value in a state-option pair according to (Bacon et al.,

2017):

(s, o, a) = r(s, a) + γ

∑

P(s

|s, a)U(o, s

) (7)

β : S → [0, 1] is the termination condition and β(s)

means that state s has the probability β(s) of termi-

nating and exiting the current option. The value of o

upon the arrival of state s

with the probability β(s

)

of option termination, U(o, s

) is written as:

U(o, s

) = (1 − β(s

))Q

, o) + β(s

) (8)

In the context of BT, the termination condition β

is bound up with the return status of Failure or Suc-

cess. A new episode starts when an option is acti-

vated by a signal tick for a timestep and ends up with

the termination of option. With regard to Running,

it accounts for the process of an option node with a

consecutive series of uninterrupted ticks in BT. This

would imply that the ticks complete the synchroniza-

tion with RL algorithm. As far as an MDP problem

is concerned, the option collects the actions, rewards

and states stored in trajectories D, which updates the

option-option policy π or intra-option policy µ. In

general, traditional composite and decorator nodes in

BT have a ﬁxed policy for calling their children se-

quentially. In this paper, we remove the limitations for

the utilization of option framework so that the policy

π could rearrange the execution order of children.

4.2 Reinforcement Learning Nodes

Based on the previous theorem (de Pontes Pereira and

Engel, 2015), learning action and composite nodes

are referred to as the extensions of the normal BT

nodes. These learning nodes not only successfully are

equipped with the learning capacities , but also main-

tain the readability and modularity in hierarchical BT

framework. For learning composite nodes, the notion

of learning fallback nodes is deﬁned as follows.

Deﬁnition 1. A learning fallback node below at-

tached with children c

, c

..., c

as possible choosing

Figure 1: The transformation from several normal nodes to

a learning node with observations, reward functions, policy

and actions.

sub-options can be modelled as an option with an in-

put set I = S, an output order (c

, c

..., c

), a termi-

nation β = 1 if any Tick(c

) ∈ Success or all Tick ∈

Failure and a policy π.

The learning Fallback nodes can query the rele-

vant children every episode in learned priority instead

of the constant order. The corresponding learned pol-

icy π, according to the observations mainly correlated

with the state of the environment, devotes to decid-

ing one of the children and then holding a series of

updates during execution. The learning Sequence re-

sembling learning Fallback just differs in the termi-

nation condition β = 1 if any tick(c

) ∈ Failure or all

ticks ∈ Success. The example of learning composite

nodes involving SAC is illustrated in algorithm 1.

Deﬁnition 2. A learning action node can be modelled

as an option with an input set I = S, actions a ∈ A

, a

termination condition β, and a policy µ.

In most cases, MDRL-BT abstracts a subtree into

a learning action node for potential performance and

Algorithm 1: SAC composite nodes with N children.

Input: an input set I = S , initial state value func-

tion parameters φ, φ and soft Q-function paremeters

ψ, tractable policy θ parameters, count steps k = 0.

Output: Failure, Suceess, or Running

1: state is Failure if Sequence, Success if Fallback

2: if a tick arrives then

3: collect global state s

, run policy π

, take ac-

tion a

∈ A and get an index order i

, i

..., i

4: for j ← 1 to N do

5: childstatus ← Tick(child(i

))

6: if childstatus=Running then

7: return Running

8: else if childstatus=state then

9: goto → line 11

10: state ← ∼state

11: receive reward r

, add tuple (s

, a

, r

, s

k+1

) to

trajectories D

, update the parameters(i ∈ {1, 2}):

φ ← φ − λ

∇

(φ), ψ

← ψ

− λ

∇

(ψ

)

θ ← θ − λ

∇

(θ), φ ← τφ + (1 − τ)φ, k ← k + 1

12: return state

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

116

Figure 2: The structure of MDRL-BT contains four categories. The blue option nodes refer to the two kinds of learning nodes

with deep RL algorithms. The blank nodes indicate the normal BT nodes and the green rectangle with imaginary line is not a

tree node but the input of external state of environment or initial design settings.

simpliﬁcation. In ﬁgure 1, the entire BT can be turned

into a learning action node for an enemy attack in

this simple case. In short, the learning action nodes

carry through an MDP to deﬁne the evolution of agent

states. The learning action nodes with PPO can be

summarized in algorithm 2.

Algorithm 2: PPO learning action nodes.

Input: an input set I = S, initial policy parameters

θ, value function parameters φ, count steps k = 0.

Output: Failure, Suceess, or Running

1: if a tick arrives then

2: collect local state s

, run policy π

old

, take ac-

tion a

∈ A, and receive reward r

3: if the task goal is ﬁnished then

4: return Success

5: else if impossible to ﬁnish then

6: return Failure

7: else

8: add (s

, a

, r

, s

k+1

) to the trajectories D

compute rewards-to-goes

, and compute the ad-

vantage estimates

, based on value function V

9: Update the policy parameter:

θ = argmax

[min(r

(θ)

, clip(r

(θ), 1 −ε, 1 +ε)

)]

φ = argmin

[(V

) −

)

]

10: θ

old

← θ, φ

old

← φ, k ← k + 1

11: return Running

4.3 MDRL-BT Architecture

In this section, we will systematically illustrate the ex-

tended architecture of MDRL-BT and analyze respec-

tively the different functions of every node area. As

stated in ﬁgure 2, it is a recursive BT as a whole with a

core option and four types of divided areas, in keeping

with the hierarchies of option framework. It deserves

to be mentioned that every periodic tick represents a

temporal level of timescale signal propagating from

top to down and activates the running courses of trig-

gered nodes. The core option as a representative of

the core logic abstraction from complicated tasks can

also be replaced with normal composite nodes with a

ﬁxed execution setting. The children of core option is

roughly classiﬁed on the grounds of types of nodes.

Although the four areas are distinct, every area is di-

rectly connected with the core option and can be inter-

spersed disorderly with every independent individual

of other areas. The parameter N in every area is also

different, ranging from zero to inﬁnity.

The blue option nodes are the learning compos-

ite nodes with discrete outputs. In conjunction with

global state input relative to the local state, this type

of nodes can distill global observations to dispense

the order index of children. As seen in ﬁgure 2, the

recursive sub-tree can follow the option node with the

same framework of four areas generation after gener-

ation. So the recusive sub-tree can be large or small

depending on speciﬁc tasks and in this perspective the

Mixed Deep Reinforcement Learning-behavior Tree for Intelligent Agents Design

117

MDRL-BT combined with this type of node can be in

possess of ﬂexible structure and have certain general-

ity in applications.

The blue action nodes are the learning action

nodes which collects the local state to improve pol-

icy with continuous or discrete actions. The nodes

with meticulous reward function can facilitate imple-

mentation of subtask and simplify large-scale archi-

tecture. In the presence of several learning action

nodes with the similar action space, reward function

and task goals, it is recommended that a RL brain can

be independent of these nodes and keep parallel con-

nection for reusability and time saving, such as details

in Experiment 1. Accompanying the MDRL-BT with

this blue action nodes in essence extends the BT to

MDRL-BT.

Composite nodes and action nodes are subsumed

together into blank nodes. For the blank option nodes

they can be conventional identiﬁed types of Fallback,

Sequence, Parallel, Decorator mentioned above. The

priority setting, p(s) → R mapping the state to pri-

ority value, is the function of the execution order de-

signed initially as the input. The blank nodes of Ac-

tion or Condition with typical commands are com-

mon indivisible units. As the granularity of MDRL-

BT, the executable action nodes can be deﬁned prefer-

Figure 3: The execution process of MDRL-BT. It is actu-

ally a nested process with inter-option and intra-option pol-

icy distinguished by having children. The traditional nodes

suffer from stationary policies and are ignored selectively.

ably with the demonstrations for solving complicated

problems and improving learning efﬁciency of the

tree.

A myriad of ﬂexible classical algorithms are in-

corporated concurrently when different blue nodes are

adopted, giving rise to the nature of BT. It’s plausible

that MDRL-BT can reap the advantage of BT and RL

and can vary with practical applications to cater for

designers’ needs. The next part would introduce the

execution process and effective reward function.

4.4 MDRL-BT Execution

There are N learning children of core option with

unﬁxed execution time interval T

, policy π

, the

corresponding state S

, action space A

, reward R

(n ∈ [1, N]) and M normal nodes with time interval

and degree of goal completion G

. Uniformly the

core option has T

,π

. Along with the begin-

ning of MDRL-BT, the core option gathers observa-

tion s

∈ S

and take an action a

∈ A

, get the tuple

results of index order (i

, i

..., i

). At the time t, we

can get the following equation.

, i

..., i

)

= a

|π

), a

∈ A

, s

∈ S

, (9)

The execution ﬂow of MDRL-BT can be summa-

rized detailedly in ﬁgure 3. Up to now, the tuple

< s

t+1

>, where s

t+1

is the global state of next

episode, can be stored in its buffer trajectories τ for

experience replay. The subsequent procedure may be

easily adapted to other learning options with a plural-

ity of subspaces because of the recursive inference for

subtree. In the local time interval, the children moti-

vate MDPs, considered explicitly as the sub episode

of the high level.

It’s an assumption that the ﬁrst k children nodes

has ﬁnished with Success status and the i

k+1

node

aborts the next execution. It is derived that the total

time of core option is the sum of ﬁrst k normal exe-

cution time where ε is the total error of transfroming

time and T

corresponds to the amount of the whole

tree execution time.

∑

j=1

+ ε (10)

The reactive rewards function needs to reﬂect the

tendency of goal-achieving in a sense. In the ordi-

nary way, the rewards are tied to the execution time

and degree of task completion, and should be normal-

ized theoretically for training performance. In this pa-

per, the abort status doesn’t exist in scenarios where

the core option can execute the all children nodes

k = M + N. Hereby, the mixture rewards of core op-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

118

tion can be computed in the following.

(

(t))+ f

(t)) i ∈ normal nodes

(t))+ R

(t) i ∈ learning nodes

(11)

= Normalize(

M+N

∑

i=1

), r

∈ R

(12)

where f

is always a piecewise function for reducing

the time error ε and f

is a mapping function for goal

achievement. The equation is not the only rewards de-

sign but can sometimes be a more reasonable choice

than the others.

Thus far, MDRL-BT has made use of option

framework to ascertain usability in a theoretical man-

ner. This conjugated model with this hierarchical de-

sign for the intelligent agents, compatible with the

structure of BT, mixes nodes together and can over-

come the weakness of the RL and BT. In the model, it

turns out to be that the learning can be undertaken in

tandem by mixed RL nodes and the constrained run-

ning is solely in the charge of BT. MDRL-BT with

the underlying option framework has circumvented

the conﬂicts between BT and RL and can be easily

altered for different targets. In the next section, we

will do some experiments for comparison and mani-

fest operability.

5 EXPERIMENTS

Several valuable experiments with various complex-

ity are performed in this section to identify the char-

acteristics of MDRL-BT and validate the intelligent

agents. Most experiments are consequently con-

ducted and measured on Unity 3D environments close

(a) experiment 1 2 env. (b) experiment 3 env.

Figure 4: The unity 3D environments (a)(b) of experiments

and the high ﬁdelity makes it possible for agents to bring

evidence towards the application of model. Details and dis-

cussion of experiments are released in the description sec-

tion. Corgi environment (c)(d) with an extendable engine

is a friendly 2D game where agents can ﬁnish some simple

tasks.

(a) experiment 1 2 env. (b) experiment 3 env.

Figure 5: The plane environments. Every tagged objects

would be labelled by arrows. Four discrete actions only be

taken by the agent to achieve the taskmove forward, move

backward, turn left, turn right. Furthermore, the agent is

provided by a view fan ﬁeld composed by a number of ray

sensors as the primary observations which can detect the

corresponding objects. The apparent information of relative

positions, rotations and distances, are added collectively up

to 109 and 1490 observations.

to real-world situations for generality and veracity. In

order to take a deep dive into the traits of MDRL-BT,

the training models are conﬁgured with different con-

structions and components as comparison. The Unity

ML-Agents Toolkit (Juliani et al., 2018), accessible to

the wider research, serves as an open-source project

and can be used to train intelligent agents through a

simple-to-use Python API. In (Noblega et al., 2019),

adaptable NPC-agent with PPO has been devised in

Unity ML-Agents Toolkit environments as an enlight-

enment of experimental simulations.

5.1 Description

Inspired by (Sakr and Abdennadher, 2016) in which

rescue and saving simulation involves task plan-

ning and realistic estimations, experiment 1 is es-

tablished by extending simulated ﬁre control scenar-

ios (de Pontes Pereira and Engel, 2015) to a 3D en-

vironment in ﬁgure 4(a)(b) for agent training. Ex-

periment 1 and 2 almost take place in an identical

surroundings where independent RL can accomplish

the benchmarks and the proposed MDRL-BT would

combat the challenges of baselines. With regard to

complex relationship in experiment 3, signiﬁcant ad-

vances in performance are made by MDRL-BT irre-

spective of training and testing. And an agent attached

with MDRL-BT on a 2D game about corgi engine is

heightened by the driven-curiosity learning for a fur-

ther expansion.

The standard of scores acquired by agent for eval-

uation is measured by the degree of target completion

and the frequency of collisions with the walls. Simul-

taneously, the total time of each episode is collected

separately for the estimation of completion speed.

Mixed Deep Reinforcement Learning-behavior Tree for Intelligent Agents Design

119

(a) model in exp1. (b) model in exp2.

Figure 6: The models of two experiments. The solid lines

are ticks ﬂowing between nodes and the imaginary lines

proceed with an interaction between blue nodes and green

brain. The RL option in dark blue is a learning compos-

ite node in all experiments and the RL nodes in watery blue

are learning action nodes aiming at sub-tasks like saving the

victim. Other blank nodes are nothing but normal nodes.

Figure 7: The mixed model described is composed of 12

watery blue PPO learning action nodes with a shared brain,

3 dark blue SAC composite learning options with different

conﬁgurations and another 9 blank normal BT nodes.

In order to assess the results equally and exactly,

the mean values of every experiment upon thirty-two

thousand times are calculated. Consistent with the

general learning process, incremental steps and re-

wards of feedback during the training are kept track

of to understand the convergence.

Experiment 1. As enumerated in plane ﬁgure

5(a), there are four types of objects characterized by

victim, ﬁre, criminal and agent. The task refers to it

that the agent is bound to save victim, extinguish the

ﬁre and catch criminal as soon as possible. For ev-

ery episode, the agent is commanded to accomplish

the task spontaneously but the four objects are ini-

tially placed or reset in a random pattern to eliminate

the training contingency. This scenario is surrounded

markedly by high walls in ﬁgure 5(a)(b) to prohibit

stepping outside. Afterwards, the ground without

friction appears so smooth that the agent with man-

ual control is indeed difﬁcult to manipulate in discrete

action spaces.

As aforementioned, the state-of-arts of off-policy

SAC (labelled as SAC) and on-policy PPO (PPO)

are implemented separately to establish a test base-

line and opsive behavior designer is integrated by

Unity NavMeshAgent to build a normal BT with-

out the need of training (BT). To employ the ver-

satile framework of MDRL-BT appropriately, learn-

ing action nodes at ﬁrst are assigned as children of a

sequence node in BT which is identical to the con-

struction in ﬁgure 6(a) except option nodes. The se-

quence node as a core attempts to control the main

process and the action nodes with SAC (BT

sac

) and

PPO (BT

ppo

) opt to act from the learned strategy. In

the cost of more observations and changing inputs, the

different action nodes can be connected with a shared

RL brain to speed up training in terms of the similar

strategy.

Undoubtedly, the sequence node above with a

constrained querying pattern may lead to a sub-

optimal consequence on account of hardly inevitable

order, which is also the universal self-imposed restric-

tions in general BT. We replace the sequence node

by RL option node as the ﬁgure 6(a) shows, which

is indicative of the breakthrough point of the limited

structure. The core option node serves as a global

decision maker to explore a better consequence, cal-

culating speciﬁc target value and scheduling the ex-

pected queries with learning policy. Accordingly,

there are two promising alternative options with PPO

(Option

ppo

) and SAC (Option

sac

) for further training.

For removing the impacts of learning action nodes,

PPO action nodes aren’t modiﬁed in the two models.

Taking the time of training into consideration, we also

use a previously trained PPO action node in the start

to apply the option with SAC(Option

pre

Experiment 2. On the foundation of the preced-

ing subject in experiment 1, an extinguisher marked

green is placed as a vital part of the environment and

the sequential order of three independent tasks is de-

manded to conﬁrm the positive features of BT in ex-

periment 2. The agent certainly acquires an extin-

guisher intended for addressing the ﬁre issue ahead

of time, otherwise approaching the ﬁre within certain

distances leads to a punishment of reward and score.

The model described in ﬁgure 6(b) is dominated by a

sequence node in this restrictive and ﬂexible circum-

stance. For comparative analysis, the training holds

ﬁxed steps of 3 ∗ 10

Experiment 3. An increasingly complicated re-

quest arises that the agent struggles to undertake three

victims saving and put out two types of ﬁre previ-

ously, then catch a criminal and enter the door to

restart a period lastly in ﬁgure 5(b). Meanwhile,

three extinguishers with corresponding tags are as-

sociated with the assumption that extinguisher1 only

deals with ﬁre1, extinguisher2 only for ﬁre2 but extin-

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

120

(a) exp1 training. (b) exp1 training. (c) exp2 training.

(d) exp3 training. (e) exp3 training. (f) corgi training.

Figure 8: The training curve of experiments. Because learning composite options are different from learning action nodes

in training frequency. So experiment 1 and experiment 3 would have two graphs. Experiment 2 and corigi without learning

composite nodes only have one graph. Every curve with corresponding tag means an algorithm with the same conditions in a

graph.

guisher3 for ﬁre1 and ﬁre2 both. That’s the same case

that every object is generated arbitrarily in any cor-

ner of the environment and more walls are added to

hamper the movement of the agent. What’s more, the

victims, extinguishers and ﬁre may be randomly ab-

sent at the outset of every episode. This highlights the

dynamically changing of the environment and reaches

the number of 2

kinds of different situations totally.

In addition, the active criminal is moving all the time

with a slow velocity and will stay away from the agent

within a certain distance in the face of agent.

5.2 Results and Analysis

In this section, the performances for the trials are sub-

jected to contrastive analysis. The examinations are

carried out to corroborate the beneﬁt of mixed com-

ponents.

PPO vs SAC. Figure 8(a) indicates that indepen-

dent PPO and SAC can rapidly converge. Although

the reward of SAC with the preponderance of sample-

efﬁcient learning can get close to one in a short time

and in contrast it takes a long time for PPO to ar-

rive, it is apparent in table 1 and ﬁgure 9 that in the

perspective of scores and time the PPO outperforms

readily SAC due to the inﬂuence of the sensitive hy-

perparameters tuning and frequent policy updates of

SAC in discrete action space. Therefore, it is hy-

pothesized that PPO is more applicable to the fre-

quent movement manipulation of agents than SAC

which is the same difference between PPO(BT

ppo

)

and SAC(BT

sac

). PPO(BT

ppo

) appears more stable

and behaves as well as the independent RL on the ta-

ble 1 but it takes more steps to converge. Adopting

PPO nodes can yield the better objective of scores and

time.

MDRL-BT vs RL. According to the quantitative

analysis of training steps in experiment 2, the frame-

Table 1: Evaluation statistics. Exp is about experiment type.

S and T respectively show scores and time. Mean means an

average value. The full score reaches 100 and the unit of T

is seconds.

Exp Model Mean S Mean T Train

BT 92.4365 13.2821 -

PPO 96.5761 9.0265 10

SAC 95.4983 13.1198 10

ppo

96.5513 9.5955 10

sac

96.1467 11.2166 10

Option

ppo

97.4605 8.8977 5 ∗ 10

Option

sac

97.7533 8.5394 3 ∗ 10

Option

pre

97.7208 8.2931 1 ∗ 10

PPO 71.2845 12.2972 3 ∗ 10

SAC 84.8081 17.5410 3 ∗ 10

ppo

94.2582 12.3749 3 ∗ 10

other - - 1 ∗ 10

mixed 90.2595 19.5666 1 ∗ 10

Mixed Deep Reinforcement Learning-behavior Tree for Intelligent Agents Design

121

(a) exp1 score. (b) exp1 time.

Figure 9: The distribution of 32000 tests suggests the average value and the standard deviation. The corresponding conclusion

can be drawn clearly on the grounds of the results. In experiment 3, only the mixed model can get a positive score in a limited

time. So the result isn’t shown.

work for modeling constrained yet adaptive agents re-

veals the character of BT and surpasses the behavior

of PPO and SAC quite a few. The sequential strategy

time-consuming for the independent RL algorithms is

exhibited favorably by MDRL-BT for simpliﬁcation

and efﬁciency.

Option Nodes. Table 1 especially shows that

all models with learning core option and PPO ac-

tion nodes can outperform the other methods and

present an optima. Respectively, Option

sac

with less

training steps but thoroughly performs better than

Option

ppo

from ﬁgure 8(b) and table 1, because SAC

is particularly appropriate for low frequency updates

and its sample efﬁciency consequently exceeds PPO.

Option

pre

with pre-trained action nodes can almost

keep in line with Option

sac

in scores with the less

steps. It is indicated that Option

pre

can be a supe-

rior choice in complicated models for the reduction of

training steps. To sum up, learning action nodes with

PPO deals with complex environmental dynamics and

option nodes with SAC quickly handle planning and

scheduling in the construction of MDRL-BT.

MDRL-BT vs Others. The behavior of wall

touching or breaking rules with a reduction of scores

and rewards in experiment 3 makes the scenario no-

ticeably troublesome and the agent must be in posses-

sion of some intelligence to manage and conduct its

behavior across the environment. In ﬁgure 8(d), the

independent PPO and SAC which beforehand fall into

a local dilemma hardly proceed with training and a

simple model of BT

ppo

is incapable of achieving good

performance due to the dynamic rewards and the pun-

ishment of far too much walls touching. Nevertheless,

the MDRL-BT with the model in ﬁgure 7 successfully

addresses the issues as table 1 and ﬁgure8 (e) shows.

2D. On the basis of 3D experiment, we explore

the 2D game environment in ﬁgure 4(c)(d) built by

corgi engine to verify the other features of MDRL-

BT model. The corgi agent with BT

ppo

is required

to collect the coins scattered all over the corners in a

limited time. What’s more, curiosity is employed in

the MDRL-BT and in ﬁgure 8(f) both models can be

trained well enough but curiosity (Burda et al., 2018)

can get a little better result.

6 CONCLUSIONS

This paper has researched a mixed model for invent-

ing an intelligent agent. We do some surveys on con-

textual backgrounds and related studies to explore the

promotion of agent designs. Enough efforts about

the combination of deep RL and BT have been made

by digging deep into the theoretical basis and ex-

isting correlations. As a specialization of option-

framework, MDRL-BT architecture is reﬁned on the

strength of deep learning nodes and BT construction.

We accomplish the execution synchronization of RL

and BT and deﬁne an appropriate rewards function to

prescribe the desired decisions. Several virtual simu-

lations are implemented on Unity 2D and 3D environ-

ments to employ semantics and structure of MDRL-

BT. The mixed model also varies slightly with the

complexity for displaying the special attributes.

The insights gained from results may be of assis-

tance to intelligent agents. MDRL-BT, reﬂecting the

integrated advantage in the theorem, empirically out-

weights the BT and RL and can be successfully ap-

plied to 2D and 3D environments. When especially

faced with complicated affairs or sequential tasks, the

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

122

MDRL-BT keeps the mind of hierarchies by decom-

posing a main issue into several simple questions to

provide a rational alternative solution. As evident

from the result, MDRL-BT doesn’t need elaborate re-

ward design to guarantee the training convergence rel-

ative to general RL algorithms. In the design of mixed

models, its a better choice to use PPO action nodes

with a shared brain and SAC composite nodes, even

pre-train nodes. So as to a real available application,

general RL algorithms or normal BT can be used for

simple tasks and by the way, MDRL-BT can be a can-

didate for complex problems.

MDRL-BT has a certain extensibility because of

recusive BT framework and RL foundations. Some-

times further exploration for extending MDRL-BT by

importing other mechanisms such as curiosity in the

sparse reward distribution can be an exciting avenue.

However, there will be enormous work to ﬁnish from

the unconspicuous consequence. In the future work,

the correlative theory and applicable scene about the

additional algorithms can be investigated for better

performance.

REFERENCES

Bacon, P.-L., Harb, J., and Precup, D. (2017). The option-

critic architecture. In Thirty-First AAAI Conference

on Artiﬁcial Intelligence.

Burda, Y., Edwards, H., Pathak, D., Storkey, A., Dar-

rell, T., and Efros, A. A. (2018). Large-scale

study of curiosity-driven learning. arXiv preprint

arXiv:1808.04355.

de Pontes Pereira, R. and Engel, P. M. (2015). A framework

for constrained and adaptive behavior-based agents.

arXiv preprint arXiv:1506.02312.

Dey, R. and Child, C. (2013). Ql-bt: Enhancing behaviour

tree design and implementation with q-learning. In

2013 IEEE Conference on Computational Inteligence

in Games (CIG), pages 1–8.

Dromey, R. G. (2003). From requirements to design: for-

malizing the key steps. In International Conference

on Software Engineering and Formal Methods.

Florez-Puga, G., Gomez-Martin, M., Gomez-Martin, P.,

Diaz-Agudo, B., and Gonzalez-Calero, P. (2009).

Query-enabled behavior trees. IEEE Transactions

on Computational Intelligence and AI in Games,

1(4):298–308.

Fu, Y., Qin, L., and Yin, Q. (2016). A reinforcement learn-

ing behavior tree framework for game ai. In 2016 In-

ternational Conference on Economics, Social Science,

Arts, Education and Management Engineering, pages

573–579.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018).

Soft actor-critic: Off-policy maximum entropy deep

reinforcement learning with a stochastic actor. In

ICLR 2018 : International Conference on Learning

Representations 2018.

Isla, D. (2005). Gdc 2005 proceeding: Handling complexity

in the halo 2 ai. Retrieved October, 21:2009.

Juliani, A., Berges, V., Vckay, E., Gao, Y., Henry, H., Mat-

tar, M., and Lange, D. (2018). Unity: A general plat-

form for intelligent agents. arXiv:1809.02627.

Kartasev, M. (2019). Integrating reinforcement learning

into behavior trees by hierarchical composition.

Liessner, R., Schmitt, J., Dietermann, A., and Bker, B.

(2019). Hyperparameter optimization for deep re-

inforcement learning in vehicle energy management.

In Proceedings of the 11th International Conference

on Agents and Artiﬁcial Intelligence - Volume 2:

ICAART,, pages 134–144. INSTICC, SciTePress.

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T.,

Tassa, Y., Silver, D., and Wierstra, D. (2015). Contin-

uous control with deep reinforcement learning. arXiv

preprint arXiv:1509.02971.

Mateas, M. and Stern, A. (2002). A behavior language for

story-based believable agents. IEEE Intelligent Sys-

tems, 17(4):39–47.

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A.,

Antonoglou, I., Wierstra, D., and Riedmiller, M. A.

(2013). Playing atari with deep reinforcement learn-

ing. arXiv preprint arXiv:1312.5602.

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Ve-

ness, J., Bellemare, M. G., Graves, A., Riedmiller,

M., Fidjeland, A. K., Ostrovski, G., Petersen, S.,

Beattie, C., Sadik, A., Antonoglou, I., King, H., Ku-

maran, D., Wierstra, D., Legg, S., and Hassabis, D.

(2015). Human-level control through deep reinforce-

ment learning. Nature, 518(7540):529–533.

Noblega, A., Paes, A., and Clua, E. (2019). Towards adap-

tive deep reinforcement game balancing. In Proceed-

ings of the 11th International Conference on Agents

and Artiﬁcial Intelligence - Volume 2: ICAART,, pages

693–700. INSTICC, SciTePress.

Sakr, F. and Abdennadher, S. (2016). Harnessing super-

vised learning techniques for the task planning of am-

bulance rescue agents. In Proceedings of the 8th In-

ternational Conference on Agents and Artiﬁcial In-

telligence - Volume 1: ICAART,, pages 157–164. IN-

STICC, SciTePress.

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and

Abbeel, P. (2015). Trust region policy optimization.

arXiv preprint arXiv:1502.05477.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and

Klimov, O. (2017). Proximal policy optimization al-

gorithms. arXiv preprint arXiv:1707.06347.

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L.,

Den Driessche, G. V., Schrittwieser, J., Antonoglou,

I., Panneershelvam, V., Lanctot, M., et al. (2016).

Mastering the game of go with deep neural networks

and tree search. Nature, 529(7587):484–489.

Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D.,

and Riedmiller, M. (2014). Deterministic policy gra-

dient algorithms. In Proceedings of the 31st In-

ternational Conference on International Conference

Mixed Deep Reinforcement Learning-behavior Tree for Intelligent Agents Design

123

on Machine Learning - Volume 32, ICML’14, page

I387I395. JMLR.org.

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I.,

Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M.,

Bolton, A., et al. (2017). Mastering the game of go

without human knowledge. Nature, 550(7676):354–

359.

Subagyo, W. P., Nugroho, S. M. S., and Sumpeno, S.

(2016). Simulation multi behavior npcs in ﬁre evacu-

ation using emotional behavior tree. In 2016 Interna-

tional Seminar on Application for Technology of Infor-

mation and Communication (ISemantic), pages 184–

190.

Sutton, R. S., Precup, D., and Singh, S. P. (1998). Intra-

option learning about temporally abstract actions. In

Proceedings of the Fifteenth International Conference

on Machine Learning, ICML ’98, page 556564, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhn-

evets, A. S., Yeo, M., Makhzani, A., Kttler, H., Aga-

piou, J., Schrittwieser, J., Quan, J., Gaffney, S., Pe-

tersen, S., Simonyan, K., Schaul, T., van Hasselt,

H., Silver, D., Lillicrap, T., Calderone, K., Keet, P.,

Brunasso, A., Lawrence, D., Ekermo, A., Repp, J.,

and Tsing, R. (2017). Starcraft ii: A new challenge

for reinforcement learning.

Wooldridge, M. and Jennings, N. R. (1995). Intelligent

agents: theory and practice. The Knowledge Engi-

neering Review, 10(2):115152.

Zhang, Q., Sun, L., Jiao, P., and Yin, Q. (2017). Combin-

ing behavior trees with maxq learning to facilitate cgfs

behavior modeling. In 2017 4th International Confer-

ence on Systems and Informatics (ICSAI), pages 525–

531.

ICAART 2021 - 13th International Conference on Agents and Artiﬁcial Intelligence

124