COOPERA

TIVE LEARNING OF BDI ELEVATOR AGENTS

Yuya Takata, Yuki Mikura, Hiroaki Ueda and Kenichi Takahashi

Graduate School of Information Sciences, Hiroshima City University, Hiroshima 731-3194, Japan

Keywords:

Multiagent systems, BDI architecture, Reinforcement learning.

Abstract:

We propose a framework of cooperative learning of BDI agents. Our framework uses some kinds of agents,

including a task management agent (TMA) and rational agents. TMA is designed as a learning agent. It

manages assignment of tasks to rational agents. When a task is created, TMA evaluates the most useful strategy

on the basis of reinforcement learning. Rational agents also evaluate the value that the task is assigned to them

according to the strategy, and they give the value as their intention to TMA. Then, TMA optimally assigns the

task to a rational agent by using both the value and the rough strategies, and the rational agent processes the

task. In this article, we apply the proposed method to an elevator group control problem. Experiment results

show that the proposed method ﬁnds better task assignment than the methods without cooperative learning.

1 INTRODUCTION

In recent years, agent systems such as the RoboCup

soccer simulator (The RoboCup Federation, 2009)

and the multi-car elevator system (The IEICE Engi-

neering Sciences Society, Concurrent System Tech-

nology, 2009) have emerged as an active subﬁeld of

artiﬁcial intelligence. In the agent circumstances, au-

tonomous agents are required to learn optimal rules

for their decision making or to infer their optimal be-

havior. Thus, many kinds of methods such as re-

inforcement learning (Sutton and Barto, 1998), evo-

lutionary computation (Katagiri et al., 2000) and

the BDI reasoning engine (Rao and Georgeff, 1991;

Doniec et al., 2006) have been proposed.

Reinforcement learning is one of traditional meth-

ods to learn optimal rules for agent behavior. It can

obtain optimal agent behavior efﬁciently for simple

agent circumstances such that agents can sense a few

percepts or cooperative behavior of agents is not re-

quired. For complex agent circumstances, however,

reinforcement learning requires quite a lot of trial and

error processes to learn the optimal agent behavior,

since the number of states increases exponentially as

the numbers of percepts and agents increase.

In Ref. (Excelente-Toledo and Jennings, 2003), it

is reported that anticipating other agent’s intention is

important for acquiring cooperative behavior. Ratio-

nal agents following the BDI (Belief Desire Intention)

model can anticipate other agent’s intention and infer

their optimal behavior. In order to infer the optimal

behavior, the designer of agent circumstances needs

to give a set of inference rules to the agents. For com-

plex agent circumstances, however, it is difﬁcult for

the designer to design a set of rules.

In this article, we propose a framework of cooper-

ative learning of BDI agents. In a multiagent circum-

stance, it is difﬁcult for the designer to give a set of in-

ference rules for optimizing cooperative behavior. On

the other hand, it is not difﬁcult to give inference rules

optimizing behavior of a single agent and the designer

often has some rough strategies for optimizing coop-

erative behavior. Thus, we think that the pseudo op-

timal cooperative behavior is acquired by using both

inference rules for a single agent and the rough strate-

gies for cooperative behavior. In our framework, we

use two types of agents, a task management agent

(TMA) and rational agents. TMA manages assign-

ment of tasks to rational agents. For efﬁcient task

assignment, TMA uses rough strategies given by the

designer. In order to learn which strategy is useful for

a current situation, TMA learns the policy to switch

strategies on the basis of reinforcement learning. Ra-

tional agents receive a task and the strategy selected

by TMA. They evaluate the value of the strategy as

their intention. Then, TMA optimally assigns a task

to a rational agent by using both the value and the

rough strategies, and the rational agent processes the

172

Takata Y., Mikura Y., Ueda H. and Takahashi K. (2010).

COOPERATIVE LEARNING OF BDI ELEVATOR AGENTS.

In Proceedings of the 2nd International Conference on Agents and Artiﬁcial Intelligence - Agents, pages 172-177

DOI: 10.5220/0002717201720177

 SciTePress

task. The framework is implemented and applied to

an elevator group control problem.

The rest of this paper is organized as follows. In

the next section, we brieﬂy explain the elevator group

control problem and the implementation of the prob-

lem by using BDI agents. In Section 3, we present a

framework of cooperative learning of BDI agents. In

Section 4, we specify our elevator group control prob-

lem. Some experimental results are shown in Section

5. And ﬁnally, Section 6 concludes the paper.

2 IMPLEMENTATION OF BDI

ELEVATOR AGENTS

2.1 The Elevator Group Control

Problem

High-rise buildings have several shafts and cars (also

called cages) in order to transport passengers efﬁ-

ciently. Recently, it is reported that a multi-car eleva-

tor system and elevator group control are effective for

transportation (Valdivielso et al., 2008; Ikeda et al.,

2008). In the system, buildings have several shafts

and there are two or more cars in each shaft. Passen-

gers resister their destination ﬂoor in the controller

and it schedules which car transports them. Then, the

controller guides the passengers to the scheduled car.

The objective of the elevator group control problem is

to develop the controller that transports passengers as

efﬁciently as possible.

2.2 Implementing the Problem as a BDI

Multiagent System

In this article, we implement the elevator group con-

trol problem as BDI multiagent system in Jadex

(Braubach and Pokahr, 2009). Our system consists

of four kinds of agents, a problem management agent

(PMA), an environment agent (EA), a task manage-

ment agent (TMA), and car agents (CAs). Passen-

gers, shafts and cars are instantiated as objects. PMA

deﬁnes an elevator group control problem such as the

number of shafts and the number of cars in each shaft.

Then PMA sets up an agent circumstance by calling

EA, TMA, and CAs. EA controls the agent circum-

stance. It creates passengers and informs the situa-

tion of both passengers and cars to TMA. When any

events, such as passengers get off a car, are occurred,

EA updates the circumstance. TMA receives situa-

tions of passengers and cars from EA. When there are

any passengers for whom the car in charge has not

been decided, TMA schedules which car transports

TMA

CA CA CA

passengers

control

create

inform their request

controlcontrolcontrol

assign passengers

The set of agents

The set of objects

Figure 1: Architecture of the Elevator Agent System.

them. Each CA receives the information of passen-

gers that it should serve. Then, CA moves its car in

order to transport its passengers.

2.3 The Environment Agent

The environment agent (EA) controls the agent cir-

cumstance, i.e., it creates passengers as objects,

moves cars according to decision making of CAs and

discards passengers transported to their destination

ﬂoors. Each passenger i is treated as an object with

a tuple < t

>. t

is the time when i is created.

and d

are the source and destination ﬂoors of i, re-

spectively. When i is transported to d

at time T , the

service time for i is deﬁned as ST (i) = T − t

. Then,

the average efﬁciency E deﬁned by Eq. (1) is calcu-

lated. IST (i) is the ideal service time for i and it is

deﬁned as the duration that a car moves from s

to d

without stopping on the way. N is the total number of

passengers. In this article, we deﬁne the objective of

the elevator group control problem as minimizing E.

E =

∑

i=1

ST (i)

∑

i=1

IST (i)

(1)

EA informs the situation of both passengers and

cars to TMA and CAs. According to the informa-

tion, TMA decides which car transports each passen-

ger and each CA selects an action such as up, stop

and down in order to transport their passengers. EA

receives actions selected by CAs and moves cars.

COOPERATIVE LEARNING OF BDI ELEVATOR AGENTS

173

2.4 The Task Management Agent

The task management agent (TMA) decides which car

transports each passenger. Since the policy of TMA

greatly inﬂuences E, the designer of an agent system

should design the policy carefully.

One of traditional methods to design the optimal

policy is reinforcement learning. We can construct

the percept spaces as the combination of situations of

cars and passengers. However, it is difﬁcult to acquire

the optimal policy by reinforcement learning, since

the percept space becomes huge. Another method to

design the optimal policy is to use a BDI reasoning

engine. However, it is difﬁcult for the designer to de-

ﬁne inference rules for efﬁcient transportation. That

is, it is difﬁcult to design the optimal policy when we

use either reinforcement learning or a BDI reasoning

engine.

Fortunately, a designer often has some rough

strategies for efﬁcient transportation, such as “the

nearest car from a passenger should transport him.”

In our implementation, we collect a couple of rough

strategies. Then, TMA is designed as a learning

agent, where TMA learns which strategy is useful for

a current situation. The details of learning mechanism

are discussed in Section 3.

2.5 The Car Agents

We designed car agents (CAs) as rational agents,

where each CA can control one car. Five kinds of

plans are implemented for CAs. The list of plans is

shown below.

Go Plan. CA moves its car to a source or destination

ﬂoor of passengers.

Call Plan. CA evaluates the nearest source ﬂoor of

passengers waiting in an elevator hall.

Board Plan. Passengers take the car.

Transport Plan. CA evaluates the nearest destina-

tion ﬂoor of passengers being on board.

Get off Plan. Passengers get off the car.

CAs control their cars by switching these plans.

The Jadex BDI reasoning engine is used for selecting

a plan. Here, we deﬁne four inference rules (R1)–

(R4). i, j, k and m indicate passengers. call(i) is a

predicate indicating that the source ﬂoor of the pas-

senger i is the nearest from the car. transport(i) in-

dicates that the destination ﬂoor of i is the nearest

from the car. board(i) indicates that the car is stop-

ping at the source ﬂoor of i and i can board the car.

get o f f (i) indicates that the car is stopping at the des-

tination ﬂoor of i. BEL(X) indicates that CA believes

X is true. GOAL(X) indicates that CA has a goal to

make X true. U is the tense operator “until”.

(R1) BEL(call(i)) ⊃

GOAL(call(i)) U (GOAL(transport( j)) ∨

GOAL(board(k)) ∨ GOAL(get o f f (m))

(R2) BEL(transport(i)) ⊃

GOAL(transport(i)) U (GOAL(board( j)) ∨

GOAL(get o f f (k)))

(R3) BEL(board(i)) ⊃

GOAL(board(i)) U GOAL(get o f f ( j))

(R4) BEL(get o f f (i)) ⊃ GOAL(get o f f (i))

These inference rules give the ﬁrst priority to pas-

sengers who get off the car, the second priority to pas-

sengers who board the car, and the third priority to

passengers who are on board for transported by the

car. The passengers waiting in an elevator hall are

given the lowest priority. When source ﬂoors of some

passengers are on the way of the car, however, the car

stops at the ﬂoors for the passengers exceptionally.

3 COOPERATIVE

REINFORCEMENT LEARNING

When a passenger resisters his destination ﬂoor in the

controller, TMA selects CA that transports the pas-

senger. Then, CA infers a schedule to transport the

passenger. That is, the elevator group control prob-

lem is considered as the problem to ﬁnd the optimal

policy to assign cars to passengers for minimizing E.

In order to ﬁnd the optimal policy, we should con-

sider huge kinds of situations. When we try to obtain

the policy by reinforcement learning only by TMA, it

might be difﬁcult to ﬁnd the optimal policy efﬁciently.

When we use a BDI reasoning engine for TMA, it is

difﬁcult for the designer to give inference rules induc-

ing the optimal policy. Thus, we introduce coopera-

tive learning of TMA and CAs.

3.1 Framework of Cooperative

Learning

Figure 2 shows the framework of cooperative learn-

ing. The designer of the agent circumstance often has

some rough strategies for efﬁcient transportation. In

our framework, TMA learns which strategy is useful

for a current situation. Reinforcement learning is used

for acquiring the policy of TMA.

When a passenger is created by EA, TMA evalu-

ates the value of each strategy. Each CA also evalu-

ates the value of assigning itself to the passenger on

the basis of reinforcement learning. By using values

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

174

< t

, d

>, S

TMA

Plans

Percepts (O

)

Percepts (P

< t

, d

Reward(r

)

v(O

, S

)

Assign the passenger i

Reward(r

)

Figure 2: Framework of Cooperative Learning.

evaluated by TMA and CAs, TMA selects CA trans-

porting the passenger.

3.2 Learning the Policy of TMA

TMA learns values of strategies for a current situa-

tion. In this article, we gave two strategies to TMA.

: The nearest car from a passenger should transport

him.

: The car whose load is the lowest should transport

a passenger.

V (P

), which is the value of the strategy S

for

a current situation P

, is evaluated by reinforcement

learning. P

= {p

1,t

,...p

M,t

} is a current percept vec-

tor of TMA. p

j,t

is a value of the j-th percept variable

at time t. Now, we assume that TMA assigns CA to

the passenger i at time t on the basis of the strategy S

We also assume that the car transports the passenger

to his destination ﬂoor at time T (> t). Then, V (P

)

is updated by Eq. (2).

V (P

) ← V (P

) + α (r

−V (P

)) (2)

is a reward deﬁned by IST (i)/ST (i). ST (i) is the

service time for i deﬁned by ST (i) = T − t. IST (i) is

the ideal service time for i. α is a learning ratio.

When TMA assigns CA x to the passenger i, only

the car controlled by x can transport i. When the car

is full, i cannot board the car. In this case, TMA re-

ceives a negative reward r

and V (P

) is updated.

Then, TMA assigns CA to i again. During reinforce-

ment learning, TMA selects a strategy by the epsilon

greedy strategy. When the strategy S

is selected,

TMA assigns CA to a passenger only by using S

3.3 Learning Values by CAs

Each CA learns the value of assigning itself to a pas-

senger, v

). O

= {o

1,t

,...o

m,t

} is a current per-

cept vector of CA x at time t. S

is the strategy se-

lected by TMA at t. Now, we assume that TMA as-

signs CA x to the passenger i at time t on the basis of

the strategy S

. When the car controlled by x trans-

ports the passenger to his destination ﬂoor, a positive

reward, r

= IST (i)/ST (i), is given to x. Otherwise,

x receives a negative reward. Then, v

) is up-

dated by Eq. (3).

) ← v

) + α (r

− v

)) (3)

3.4 The Value of a Task Assignment

After trial and error processes (the learning phase)

have been done, TMA acquires which strategy is use-

ful for a current situation and each CA also has the

value of assigning a passenger to it. Then, we can de-

ﬁne the criterion to select a car for a passenger. Equa-

tion (4) deﬁnes the value of assigning x to a passenger

at time t. CA x maximizing C(x) is selected as the

most suitable CA for the current situation. In Eq. (4),

K is the number of strategies given by the designer.

C(x) =

∑

k=1

V (P

) v

) (4)

4 SPECIFICATION OF THE

PROBLEM

4.1 Speciﬁcation

Table 1 shows the speciﬁcation of our elevator group

control problem. The values of almost parameters are

equal to the values deﬁned in Ref. (The IEICE En-

gineering Sciences Society, Concurrent System Tech-

nology, 2009). The number of cars in a shaft is two in

the reference. When there are multiple cars in a shaft,

it is difﬁcult for us to give inference rules for con-

trolling cars optimally. Thus, we assume that there

is a single car in a shaft. When we can give infer-

ence rules for controlling multiple cars in a shaft, the

proposed framework is applicable to the multi-car el-

evator system. When EA creates a passenger, his ac-

companying persons might be created. In this article,

we call the passenger and his accompanying persons

the passenger group. The number of parsons in a pas-

senger group y is decided on the basis of the Poisson

distribution f (y = n). Equation (5) is the probability

that y = n. λ is 4. The source and destination ﬂoors

are decided on the basis of the uniform distribution.

f (y = n) =

−λ

(5)

COOPERATIVE LEARNING OF BDI ELEVATOR AGENTS

175

Table 1: Speciﬁcation of the Elevator Group Control Prob-

lem.

The number of ﬂoors 30

The number of shafts 4

The number of cars in a shaft 1

The distance between adjacent ﬂoors 4 [m]

The maximum speed of a car 6 [m/sec]

The acceleration of a car ±2 [m/sec

]

The capacity of a car 20 [persons]

The duration of the simulation 20,000 [secs]

The number of passengers

per minute (before 10,000 secs). 30

The number of passengers

per minute (after 10,000 secs). 34

4.2 Percept Vectors

We assume that TMA can sense two kinds of percepts,

the number of passenger groups towards upper ﬂoors

up and the number of passenger groups towards lower

ﬂoors down. For efﬁcient reinforcement learning, the

percept values are categorized. Here, we translate the

percept variables up and lower into the categorized

percept variables p

and p

, respectively. When up ≤

2, p

is set to 1. When 2 < up ≤ 5, p

is set to 2.

Otherwise, p

is set to 3. down is also translated to

(∈ {1, 2, 3}) in the same manner for up,

CA x can sense three kinds of percepts, the re-

mainder capacity of the car reminder

, the distance

between the car and the current passenger distance

and the load of the car load

reminder

is deﬁned by Eq.(6).

reminder

= capacity − (#PW + #PB

) (6)

capacity is the capacity of a car. #PW is the number

of persons of the passenger group i that TMA cur-

rently tries to assign to CA. #PB

is the number of

passengers in the car controlled by x. Here, we trans-

late the percept variable reminder

into the catego-

rized variable o

. When reminder

≤ 8, o

is set to 1.

When 8 < reminder

≤ 20, o

is set to 2. Otherwise,

is 3.

distance

deﬁned by Eq. (6) is the pseudo dis-

tance between the car controlled by x and the passen-

ger group i.

distance

= dist + β #ST

(7)

dist is evaluated by using both the situations of the

car and i. Now, we deﬁne the situation of the car as

< CF

,DF

>. CF

is the ﬂoor that the car is in. DF

the farthest destination ﬂoor of passengers that are in

the car. The situation of i is < t

>. When s

on the way of the car, i.e., (d

− s

)(DF

−CF

) > 0

and (s

− CF

)(DF

− CF

) > 0, then dist is evalu-

ated as |s

− CF

|. Otherwise, dist is evaluated as

|DF

−CF

| + |DF

− s

|, since the car should trans-

port all passengers in x before transporting i. #ST

the predicted number of stops before arriving at the

source ﬂoor of i. β is a parameter indicating the cost

of a stop which is deﬁned as 2. The percept vari-

able distance

is translated into the categorized vari-

able o

. When distance

≤ 1, o

is set to 1. In the

cases of 1 < distance

≤ 10 and 10 < distance

≤ 25,

is set to 2 and 3, respectively. In the cases of

25 < distance

≤ 40 and 40 < distance

≤ 70, o

set to 4 and 5, respectively. Otherwise, o

is 6.

The third percept variable load

indicates the load

of x deﬁned by Eq. (8). N

is the number of passen-

ger groups in the circumstance at time t and n

is the

number of passenger groups assigned to x. load

also translated into the categorized variable o

. When

load

= 0, o

is set to 1. When 0 < load

≤ 1/6, o

is set to 2. When 1/6 < load

≤ 1/3, o

is set to 3.

Otherwise, o

is 4.

load

(8)

5 EXPERIMENTAL RESULTS

5.1 Experimental Setup

Here, we compare four kinds of methods shown be-

low.

) The nearest car from a passenger should transport

him.

) The car whose load is the lowest should transport

a passenger.

Learning) TMA switches the strategy according to

the values of strategies S

and S

Cooperative) The proposed method. TMA assigns

passengers to CA on the basis of Eq. (4).

In Learning, TMA does not evaluate C(x). That is,

Learning does not use v

) for car assignment.

α in Eqs. (2) and (3) is 0.8. ε for the epsilon

greedy strategy is 0.2. A negative reward is deﬁned

as −0.1. These parameter values are decided by per-

forming preliminary experiments.

5.2 Results

Figure 3 shows the changes of the objective func-

tion E during the learning phase. Since Learning and

Cooperative are equivalent in the the learning phase,

results for Cooperative are omitted. In Fig. 3, E for

Learning is the lowest. Thus, the policy for switching

strategies has been acquired well.

ICAART 2010 - 2nd International Conference on Agents and Artificial Intelligence

176

Learning

3.2

2.0

2.4

2.8

0 5,000 10,000 15,000 20,000

Figure 3: Changes of E during the learning phase.

3.2

2.0

2.4

2.8

0 5,000 10,000 15,000 20,000

Learning

Cooperation

Figure 4: Results of cooperative assignment.

Figure 4 shows the results of cooperative assign-

ment. Here, we use a passenger sequence that is dif-

ferent from the sequence used in the previous experi-

ment. Cooperative assigns a passenger to CA by us-

ing Eq. (4). In Learning, the optimal strategy is se-

lected according to V (P

). When a passenger se-

quence is changed, Learning is more efﬁcient than

and S

. E for Cooperative is lowest of all. That

is, V (P

) and v

) are acquired adequately by

reinforcement learning, and C(x) is a good criterion

for efﬁcient assignment of a car.

6 CONCLUSIONS

We have proposed a cooperative reinforcement learn-

ing method for rational agents, and the method have

been applied to the elevator group control problem.

Experiment results show that the proposed method ac-

quires better rules than the methods without coopera-

tive learning.

In this article, however, we have not applied our

method to a multi-car elevator system and percept

variables such as reminder

are discretized manually.

Improvement of our method to overcome these prob-

lems is remained as our future works.

ACKNOWLEDGEMENTS

The authors would like to thank the anonymous re-

viewers for their valuable comments.

REFERENCES

Braubach, L. and Pokahr, A. (accessed 11 August, 2009).

Jadex BDI agent system – overview.

http://vsis-www.informatik.uni-

hamburg.de/projects/jadex/.

Doniec, A., Espi

e, S., Mandiau, R., and Piechowiak, S.

(2006). Non-normative behaviour in multi-agent sys-

tem: Some experiments in trafﬁc simulation. In Pro-

ceedings of IAT2006, pages 30–36.

Excelente-Toledo, C. B. and Jennings, N. R. (2003). Learn-

ing when and how to coordinate. Web Intelligence and

Agent Systems, IOS Press, 1(3-4):203–218.

Ikeda, K., Suzuki, H., Kita, H., and Markin, S. (2008).

Exemplar-based control of multi-car elevators and

its multi-objective optimization using genetic algo-

rithm. In The 23rd International Technical Confer-

ence on Circuits/Systems, Computers and Communi-

cations, pages 701–704.

Katagiri, H., Hirakawa, K., and Hu, J. (2000). Genetic net-

work programming - application to intelligent agents

-. In Proceedings of SMC2000, pages 3829–3834.

Rao, A. S. and Georgeff, M. P. (1991). Modeling rational

agents within a BDI-architecture. In Proceedings of

KR’91, pages 473–484.

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learn-

ing. The MIT Press.

The IEICE Engineering Sciences Society, Concurrent Sys-

tem Technology (accessed 11 August, 2009). CST so-

lution competition 2008.

http://www.ieice.org/ cst/compe08/. (in Japanese).

The RoboCup Federation (accessed 11 August, 2009).

http://www.robocup.org/.

Valdivielso, A., Miyamoto, T., and Kumagai, S. (2008).

Multi-car elevator group control: Schedule comple-

tion time optimization algorithm with synchronized

schedule direction and service zone coverage oriented

parking strategies. In The 23rd International Techni-

cal Conference on Circuits/Systems, Computers and

Communications, pages 689 – 692.

COOPERATIVE LEARNING OF BDI ELEVATOR AGENTS

177