COOPERATIVE MULTI-ROBOT SYSTEM FOR INFRASTRUCTURE

SECURITY TASKS

Erik Hernandez, Antonio Barrientos, Claudio Rossi and Jaime del Cerro

Center for Robotics and Automation, Technical University of Madrid, C/ Jos´e Gutierrez Abascal 2, Madrid, Spain

Keywords:

Reinforcement learning, Robotics security, Infrastructure security tasks, Multi-robot systems.

Abstract:

As a result of terrorist attacks in the last years, new efforts have raised trying to solve challenges related to se-

curity task automation using robotic platforms. In this paper we present the results of a cooperative multi-robot

approach for infrastructure security applications at critical facilities. We formulate our problem using a Ms.

Pac-Mac like environment. In this implementation, multiple robotic agents deﬁne policies with the objective to

increase the number of explored states in a grid world. This is through the application of the off-policy learning

algorithm from reinforcement learning area, known as Q-learning. We validate experimentally our approach

with a group of agents learning a patrol task and we present results obtained in simulated environments.

1 INTRODUCTION

Recently, terrorist attacks around the world have

showed environments vulnerability. As a result of

these events new efforts have arisen in research to

try to solve challenges related with security tasks

automation. Robotics Security is a consequence of

this research interest in security systems using mo-

bile robots. Actually, “intelligent” buildings security

depend upon Closed Circuit TeleVision for informa-

tion gathering to identify anomalous behaviors. How-

ever, these solutions are inﬂexible and human limi-

tations such as boredom, distraction or fatigue affect

their performance, moreover, in some environments

people must deal with dangerous conditions. There-

fore, it is important to improve security elements uti-

lized in these systems. Mobile robots characteristics

make it suitable for this purpose. There are numer-

ous advantages using mobile robots, i.e., it does not

experiment human limitations. However, some tasks

are too complex that a single robot cannot achieve

good results. To overcome these challenges we can

use Multi-Robot Systems which are deﬁned as a set

of homogeneous or heterogeneous robots interacting

in the same environment using cooperative behaviors

(Everett, 2003), (Oates et al., 2009), (Luo and Lin,

2009), (Roman-Ballesteros, 2006).

In this paper we present a collaborative multi-

robot approach for infrastructure security applications

at critical facilities such as nuclear power plants, ur-

ban transport installations and so forth. We have for-

mulated our problem using a Ms. Pac-Mac like en-

vironment in which multiple agents deﬁne policies to

explore states in a grid world, this adaptation differs

to other works trying to solve multi-robot patrolling

tasks (Chevaleyre, 2004), (Elmaliach et al., 2007). To

solve this problem we haveimplemented an algorithm

from reinforcement learning known as Q-learning.

Reinforcement Learning yields machine learning sys-

tems which appeals to many researchers due to its

generality. In these systems an agent learns how to

achieve a goal by trial-and-error interactions with its

environment. Our approach is different to numerous

implementations on grid worlds of Q-learning works,

mainly in the objective and in the way in this objec-

tive is reached. The rest of this paper is organized as

follows. Section 2 introduces the problem. Section

3 shows the methodology implemented to solve the

learning task. Section 4 gives the experimental result

and their evaluation. Finally, section 5 concludes the

paper.

2 PROBLEM DESCRIPTION

We have adapted Ms. Pac-Man game to represent in-

frastructure environments. Namely, in this game a hu-

man steers Ms. Pac-Man throughout a maze to eat

dots while avoid getting caught by four ghosts. De-

spite our approach is based on this game, we have

removed much of its complexity. First, we shape the

313

Hernandez E., Barrientos A., Rossi C. and del Cerro J..

COOPERATIVE MULTI-ROBOT SYSTEM FOR INFRASTRUCTURE SECURITY TASKS.

DOI: 10.5220/0003721403130316

In Proceedings of the 4th International Conference on Agents and Artiﬁcial Intelligence (ICAART-2012), pages 313-316

ISBN: 978-989-8425-96-6

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

maze structure into an infrastructure environment as

a grid world with a set of paths where each cell rep-

resents a state. Characteristics such as corridors, T-

junctions, interjections and L-turns remain. At each

step time an agent perceives a state and selects one of

four actions: North, West, South or East. That brings

it to the next state. According to the orientation in

which each agent reaches a state it has four possible

actions to select. Second, we consider the dots dis-

tributed around the environment. The agents must eat

all dots with ﬁxed and variable values to begins a new

episode. Finally, multiple agents that represent the

ghosts are the last adaptation of the game.

The tasks that each agent can perform are: pa-

trol, homing and avoiding. Patrol causes the agent

to move throughout the environment following a path

by taking decision to choose actions. Homing causes

the agent to be resting to charge batteries energy. Fi-

nally, Avoiding causes the agent to avoid obstacles or

other robots. Each agent switches its internal task

between Patrol and Homing considering battery life

value. Therefore, the learning task consist of ﬁnd-

ing a mapping from states to actions for cooperative

patrol where each agent learns how to select the ac-

tion with the highest value. Each agent implements a

set of adaptation rules. Individually each agent has a

map, whereas from a group perspective, the rules are

triggered by pheromone like communication among

robots, so that each time step they inform others the

state explored (Khamis et al., 2006).

3 REINFORCEMENT LEARNING

A reinforcement learning model consist of a discrete

set of environment states, S; a discrete set of actions,

A; and a set of scalar reinforcement signal. In this

model, an agent learns a mapping from situations to

actions by trial-and-error interactions with the envi-

ronment to achieve a goal. This environment must be

at least partially observable. At each time step t ∈ T

each agent receives a current state indication s

∈ S of

the environment, then it chooses an action a

∈ A to

generate an output which changes the state of the en-

vironment to s

t+1

∈ S and the value of this state tran-

sition is indicated to the agent through an scalar r

known as reward (Sutton and Barto, 1998). A reward

deﬁnes the goal in a reinforcement learning problem.

It maps each perceived state or state-action pair to a

single numerical value that indicates the intrinsic de-

sirability of that state. An important reward property

is known as Markov Property. A reward with this

property must include immediate sensation and re-

tain all relevant information from the past (Puterman,

1994). Thus, the agent learns to perform actions that

maximize the sum of the rewards received when start-

ing from some initial state and proceeding to a ter-

minal one. The reward function must be necessarily

unalterable by the agent and it only serves as a basic

for altering a policy π. In this implementation mainly

dots values have been used as reward. A policy is a

mapping from each state, s ∈ S, and action, a ∈ A(s),

to the probability π(s, a) of taking action a when in

state s. An stationary policy, π : S → Π(A) deﬁnes a

probability distribution over actions. A policy is the

core of the agent since it deﬁnes which action must

be performed at each state. Thus, the objective of re-

inforcement learning is to develop a agent with a be-

havior policy to choose actions that tend to increase

the long-run sum of values of the reward.

We have implemented an off-policy temporal dif-

ference algorithm known as Q-learning which learns

directly from raw experience without a model of the

environment and updates estimations based in part

on other learned estimations without waiting for a

ﬁnal outcome. A model consist of the state transi-

tion probability function T(s, a, s

′

) and the reinforce-

ment function R(s, a). However, reinforcement learn-

ing is concerned with how to obtain an optimal policy

when such a model is not know in advanced (Watkins

and Dayan, 1992). The objective of Q-learning is

to learn the action-value function Q applying the

rule Q(s

, a

) ← Q(s

, a

) + α[r + γmax

Q(s

t+1

, a) −

Q(s

, a

)], where < s

, a

, r, s

t+1

> is an experience tu-

ple. If each action is executed in each state an inﬁnite

number of times on an inﬁnite time run and α is de-

cayed appropriately, the Qvalues will converge with

probability 1 to their optimal values Q

∗

(Kaelbling

et al., 1996). An action-value function for policy π

deﬁnes the value of taking action a in state s under

policy π, denoted by Q

(s, a), as the expected reward

starting from s, taking action a,and thereafter follow-

ing policy π.

The general form of the Q-learning algorithm

consist of nine step, as describe below.

1 Initialize Q(s,a) arbitrarily

2 Repeat (for each episode):

3 Initialize s

4 Repeat (for each step of episode):

5 Choose a from s using policy derived from Q

6 Take action a, observe r, s’

7 Apply equation X

8 s <- s’

9 Until s is terminal

In this implementation an episode terminates when

all dots are eaten and each step of episode consist of

choosing an action a when in state s, updates Q(s, a)

and go to s

′

ICAART 2012 - International Conference on Agents and Artificial Intelligence

314

4 EXPERIMENTS AND RESULT

7 8 9 10

11 12 13

5453525150

3938

28 29 30 31 32 33 34

Map A Map B

Figure 1: Maps Utilized in Our Implementation.

We have used the maps depicted in ﬁgure 1 to de-

scribe important aspects of our approach.

0 10 20 30 40 50 60 70 80 90 100

-50

100

150

200

250

300

350

400

450

West and East Q Values

Steps

East = 1 West = 1 East = 10 West = 10 East = 100 West = 100 East = 1000 West = 1000

0 5 10 15 20

Detail at (10 ,47.5)

85 86 87 88 89 90 91 92 93 94 95

23.5

24.5

Detail at (90 ,24.2)

80 82 84 86 88 90 92 94 96 98 100

6.2

6.4

6.6

Detail at (90 ,6.3)

80 82 84 86 88 90 92 94 96 98 100

4.45

4.5

4.55

4.6

Detail at (90 ,4.525)

80 82 84 86 88 90 92 94 96 98 100

Detail at (90 ,2.5)

Figure 2: Evolution of West and East Qvalues in State 34

T-junction of Map B with four Different Initial Values.

To initialize the QValues we select values so that

north is chosen the ﬁrst time a state is explored, then

west or east if north has been selected and south as the

last option. Since the algorithm performance changes

with the multiplier of initial values, we have executed

experiments with four multipliers 1x, 10x, 100x and

1000x. Figure 2 shows the ﬁrst 100 steps of the al-

gorithm in state 34 of map B. As it can be seen in

the details, the multiplier affects the frequency of the

selection of west or east actions. The bigger the fre-

quency, the better the performance of the algorithm.

Nevertheless, when a different action is selected ev-

ery step, the performance decreases. Based on this

results we have selected the multiplier 100x.

To calculate the value of γ we have executed ex-

periments of one episode with values from 0.1 to

0.9 and increments of 0.1 using map A. The results

show that the average of γ = 0.9 gets the shortest

path, however, it performs a high exploration in the

ﬁrst episode, in contrast with γ = 0.1 in which explo-

rations take a number of steps similar to the short-

est path. Based on these results, we have selected

γ = 0.1 which indicates that the agents are concerned

with maximizing immediate rewards.

Taking this indication into account, ﬁgures 3 and

4 show the evolution of the reward, γmax

Q(s

t+1

, a)

and Qvalues for west and east options in states 13 and

34 of map A and B, respectively. Both states are T-

junctions, however, when each state is reached with

east orientation, the number of next states for each

one is different, i.e., 1 in west direction and 22 in east

0 10 20 30 40 50 60 70 80 90 100

-10

Reward, MaxQ and Q values

Steps

MaxQ of State 6 MaxQ of State 20 Reward for East Option Reward for West Option Q for West Option Q for East Option

80 82 84 86 88 90 92 94 96 98 100

0.2

0.3

0.4

0.5

0.6

Detail at (90 ,0.38)

80 82 84 86 88 90 92 94 96 98 100

Detail at (90 ,4.5)

80 82 84 86 88 90 92 94 96 98 100

6.2

6.3

6.4

6.5

Detail at (90 ,6.32)

Figure 3: First 100 Steps of the Evolution of r,

γmax

Q(s

t+1

, a) and Qvalues in State 13 T-junction of

Map A.

0 10 20 30 40 50 60 70 80 90 100

-10

Reward, MaxQ and Q values

Steps

MaxQ of State 27 MaxQ of State 41 Reward for East Option Reward for West Option Q for West Option Q for East Option

80 82 84 86 88 90 92 94 96 98 100

0.5

0.6

0.7

Detail at (90 ,0.38)

80 82 84 86 88 90 92 94 96 98 100

-2

Detail at (90 ,4.5)

80 82 84 86 88 90 92 94 96 98 100

5.1

5.2

5.3

Detail at (90 ,6.32)

Figure 4: First 100 Steps of the Evolution of r,

γmax

Q(s

t+1

, a) and Qvalues in State 34 T-junction of

Map B.

direction for state 13 of map A and 4 in both direc-

tions for state 34 of map B. As can be seen in the

details of γmax

Q(s

t+1

, a), in ﬁgures 3 and 4, despite

that the number of next states of each state is not the

same, the difference does not affect proportionally the

values of γmax

Q(s

t+1

, a). This is a consequence of

γ value, the only appreciable difference is the reward

that every 5 steps the agent receives when it explores

state 13 in these conditions.

The value of the shaped reinforcement signal r

used in our approach is the result of the expression

r = dot(s

′

) + incrementalDot(s

′

) + λ, where λ is the

relation between the number of eaten dots and the

number of explored states. Finally, the value of the

learning rate α that we have used in all experiments

in this work is subject to the rule, α(Q(s,a)) =

N(s,a)

where N(s, a) is the number of times that the action a

has been selected when in state s.

A) First Stage of Patrol Route B) Second Stage of Patrol Route

C) Third Stage of Patrol Route D) Fourth Stage of Patrol Route

Figure 5: A Team of Four Khepera III Robots Patrolling.

COOPERATIVE MULTI-ROBOT SYSTEM FOR INFRASTRUCTURE SECURITY TASKS

315

Once we get a suitable performance with an agent,

we use a team of four agents with different initial po-

sition. Now each one must communicate to indicate

its current state and the next one, as a result of this

cooperation, each agent deﬁnes a policy that allows it

to patrol in different areas of the environment and de-

sign a special division of labor in which negotiation is

not needed.

North Orientation

West Orientation

South Orientation

East Orientation

1 5 10 17 32 41 52 59 64 69 81 99 106 111 126 131 136139

States

Q Values

North

West

South

East

1 5 10 17 32 41 52 59 64 69 81 99 106 111 126 131 136139

North

West

South

East

Q Values

1 5 10 17 32 41 52 59 64 69 81 99 106 111 126 131 136139

North

West

South

East

Q Values

1 5 10 17 32 41 52 59 64 69 81 99 106 111 126 131 136139

North

West

South

East

Q Values

Figure 6: Behavior Policy of Agent A

Figure 6 shows the policy of one agent. As it can

be seen, almost all states are explored at least once,

however, after 2000 episodes, an agent deﬁnes a be-

havior with preference to patrol on the left upper area

of map. Due to the lack of space the policies of the

other agents are not showed, however the simulation

depicted in ﬁgure 5 shows the four agents patrolling

in the same environment. Even though the results we

have presented consider four agents, our approach is

not restricted to this number.

5 CONCLUSIONS

In this paper, we presented a cooperative multi-robot

approach for security applications at critical facilities

in which a team of agents deﬁne policies to patrol, us-

ing the algorithm of reinforcement learning known as

Q-leaning. The initial Qvalues that we selected, al-

low each agent to move in a suitable way instead of in

a random manner, as some works of multi-robot im-

plementations in multi-robot foraging or search and

rescue literature have proposed. Therefore, the agents

do not need to wait until a learning phase terminates

to patrol. It is worth noting that in Q-learning liter-

ature, almost all implementations in grid worlds de-

scribe solutions as undiscounted task in which the

goal positions are deﬁned and agents receive delayed

reward. In contrast, in our approach we do not de-

ﬁne goal positions but task goal. Indeed, as it can

be seen in section 4 a value near undiscounted fac-

tor does not work or its performance is worse than

the smallest value. Moreover, these implementations

seem to be off-line process instead of on-line as we

claim in our implementation. It is on-line due to the

fact that an agent can explore the environment mean-

while it ﬁnds its optimal behavior policy. Finally, we

validated our approach in a simulation environment to

show the ability of the method to deﬁne patrol paths.

ACKNOWLEDGEMENTS

This work has been supported by the Robotics and

Cybernetics Research Group at Technique Univer-

sity of Madrid (Spain), and funded under the projects

“ROTOS: Multi-Robot system for outdoor infrastruc-

tures protection”, sponsored by Spain Ministry of

Education and Science (DPI2010-17998), and the

project ROBOCITY 2030 Project, sponsored by the

Community of Madrid (S-0505/DPI/000235).

REFERENCES

Chevaleyre, Y. (2004). Theoretical analysis of the

multi-agent patrolling problem. In Intelligent

Agent Technology, 2004.(IAT 2004). Proceedings.

IEEE/WIC/ACM International Conference on, pages

302–308. IEEE.

Elmaliach, Y., Agmon, N., and Kaminka, G. (2007). Multi-

robot area patrol under frequency constraints. Annals

of Mathematics and Artiﬁcial Intelligence.

Everett, H. (2003). Robotic security systems. Instrumenta-

tion & Measurement Magazine, IEEE.

Kaelbling, L., Littman, M., and Moore, A. (1996). Re-

inforcement learning: A survey. Arxiv preprint

cs/9605103.

Khamis, A., Kamel, M., and Salichs, M. (2006). Coopera-

tion: concepts and general typology. In Systems, Man

and Cybernetics, 2006. SMC’06. IEEE International

Conference on, volume 2, pages 1499–1505. IEEE.

Luo, R. and Lin, T. (2009). Multisensor based security

robot system for intelligent building. Robotics and

Autonomous Systems.

Oates, R., Milford, M., and Wyeth, G. (2009). The im-

plementation of a novel, bio-inspired, robotic security

system. , 2009. ICRA'09.

Puterman, M. (1994). Markov decision processes: discrete

stochastic dynamic programming. John Wiley& Sons,

Inc.

Roman-Ballesteros, I. (2006). A framework for cooperative

multi-robot surveillance tasks. Robotics and Electron-

ics.

Sutton, R. and Barto, A. (1998). Reinforcement learning:

An introduction, volume 116. Cambridge Univ Press.

Watkins, C. and Dayan, P. (1992). Q-learning. Machine

learning, 8(3):279–292.

ICAART 2012 - International Conference on Agents and Artificial Intelligence

316