Disruption Recovery within Agent Organisations in Distributed Systems
Asia Al-karkhi and Maria Fasli
School of Computer Science and Electronic Engineering, Essex University,
Wivenhoe Park, Colchester CO4 3SQ, U.K.
Keywords:
Disruption, Multi-agent System, Organisation, Head, Henchman Recovery Protocol, Service Provider.
Abstract:
One of the challenging problems in distributed systems is dealing with agent failure. In this paper, we present
an approach for task recovery in a distributed system where the agents self-organise themselves in organisa-
tions in order to execute tasks more efficiently. However, within this setting, unpredictable events can happen
and agents can fail leading to task failure and weaker overall system performance. We present the Hench-
man recovery protocol which enables the agents within an organisation to maintain task execution and recover
tasks in the event of agent failure. We show how the protocol helps to maintain the efficiency of the cre-
ated organisations through a series of experiments in a simulated distributed task execution system which has
been implemented in Repast Simphony. The experimental results demonstrate the robustness of the proposed
solution in a number of settings.
1 INTRODUCTION
In distributed systems, it is very often the case that
agents may not be able to carry out tasks because it
may be busy or not active, hence delegation is one
of the available options. Commonly, no agent can be
designed with full knowledge about other agents in
the vicinity of a network of agents. In complex agent
systems, agents may form groups, coalitions or organ-
isations under certain conditions in order to improve
the execution of tasks or the utilisation of resources or
improve the agent connections, (Corkill et al., 2015;
Horling and Lesser, 2004; Dignum, 2009).
This work is mainly targeting data centres in the
cloud to supply a more efficient service level to the
customers’ requests by providing both a theoretical
framework and a simulated model. In our proposed
environment we will recruit a multi-agent to encap-
sulate heterogeneous types of resources to simulate
data centres that can be accessed on demand by many
customers for any amount of time. Therefore, all the
explanation in this paper is to demonstrate how the
agents recruitment process will be carried out. Fur-
thermore, to simulate the customer side, we have cre-
ated a customer agent that asks for services by send-
ing customers’ tasks requests to the agents network
using web connections.
However, one of the main challenges in open and dis-
tributed systems is the occurrence of unpredictable
events which can affect the individual agent perfor-
mance and in consequence the overall system perfor-
mance. For instance, agents can fail and this may
mean that any action(s) or task(s) that an agent has
taken on would also fail to be executed. Agent fail-
ure within an organisation also creates problems as
agents will fail to execute tasks and other agents may
still be delegating tasks to them if they are unaware
that they are no longer in operation. Hence, appro-
priate mechanisms are needed to handle agent failure
within organisations so that their functionality and ef-
fectiveness can be maintained.
The main emphasis of this paper is on devising mech-
anisms that can be utilized in the context of organisa-
tions within distributed systems that would make the
organisation of autonomous agents more efficient in
handling failures and hence maintain their efficiency
and effectiveness. The specific scenario that we are
exploring is based on a scaled free network (Barabsi
and Bonabeau, 2003) which is a common network
topology, often found in social networks, and infras-
tructure networks (e.g. electricity grids, communica-
tion networks and the Internet. In the implemented
network model, the agents may decide to create or-
ganisations to improve the number of executed tasks
and increase their utility and make better utilisation
of their resources. Hence, we show that the agent or-
ganisations can be more resilient to unexpected dis-
ruption that could affect the system performance. We
Al-karkhi, A. and Fasli, M.
Disruption Recovery within Agent Organisations in Distributed Systems.
DOI: 10.5220/0006574203710378
In Proceedings of the 10th International Conference on Agents and Artificial Intelligence (ICAART 2018) - Volume 2, pages 371-378
ISBN: 978-989-758-275-2
Copyright © 2018 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
371
present a protocol for recovering from agent failure,
called “Henchman Recovery Protocol (HRP)”, whose
task is to maintain the functionality of the created or-
ganisations.
The rest of the paper is organised as follows. Section
2 discusses the related work in the literature. Sec-
tion 3 presents the system scenario which includes
descriptions of the organisations, Members and cus-
tomer agent with task message formats. Section 4
presents the HRP, and also introduces the protocol
that is used by each Henchman inside the created or-
ganisation to monitor the Head. Sections 5 presents
the experimental work. Finally, the paper ends in sec-
tion 6 with the conclusion and proposals for future
work.
2 RELATED WORK
The main aim for designing peer to peer networks is
to apply fair and sensible allocation of the spare re-
sources using different resource allocation methods.
However, the existence of disruption and other obsta-
cles that are sometimes associated with data transmis-
sion such as machine being offline or high progression
incited by user activities, may lead to minimise the
correct utilisation of the available resources (Botev
et al., 2015).
In grid distributed systems, wireless networks and
cloud computing, failure events such as improper sys-
tem service or physical failure is still under research.
Hence, this work is overlapped with grid networks
and cloud data centres due to the great similarity be-
tween these systems in term of concepts and hetero-
geneous types of resources as have been explained in
(Foster et al., 2008).
Researchers (Tesauro et al., 2004) have emphasized
that distributed systems should provide self-healing,
self-configuration and self-protection. In their pro-
posed solution, they provided a software solution
called “Unity” that managed a data centre using an
autonomous multi-agent system. They have used
the agents to self-heal and self-organise the com-
putational resources by creating 17 clusters of self-
management agents in the data centre, so many cus-
tomers can access the resources at the same time.
Their work has been implemented only on a small
size data centre and to extend their work for indus-
try purposes, a bigger size of data centre is essential.
In our created models the agent’s network can create
a second layer (organisations) depending on the size
of the available network. (Gutierrez-Garcia and Sim,
2010), also focus on employing agents to improve the
service level in clouds. They have suggested cus-
tomer agents, broker agents, service provider agents
and resource agents. These agents are using self-
organisation technique to provide composition ser-
vices from the cloud to the customers and support
achieving a service level agreement. However, in their
work, agents have been implemented without deci-
sion making capability, while in our work, adding
such capability to the agents has enhanced the service
level provided to the customers.
Multi-agent systems have been borrowing recovery
protocols from other domains. For example, client
server, world wide web, peer to peer network and mo-
bile ad-hoc network. Disruption can happen when the
resources are limited with limited access to certain
type of resources or when there is a more likely in-
direct access to the resources.
Rollback recovery protocols have been recruited in
different distributed environments, as in (Elnozahy
et al., 2002). The authors have presented a survey to
distinguish between different types of rollback recov-
ery protocols and compared their performance. The
first one is the checkpoint based protocol, which is
based on choosing a checkpoint to restore the system
to that point. The second one is the log based proto-
col, which is a combination between the checkpoint
protocol and log in information protocol. These pro-
tocols deal with nodes in a network as groups of pro-
cesses that communicate between each other. These
interactive processes are accessing a storage appli-
ance periodically to save recovery information which
could have at least a checkpoint state for those pro-
cesses during passing messages to each other. They
could then be used when they would return to active
after processing the disruption. However, the pro-
cesses could be working on old information after re-
turning to their original state, something which might
be no longer required by the system.
(Miyashita et al., 2015a; Miyashita et al., 2015b) have
claimed their work is to solve resource conflict. We
observe that their work is based on the Team For-
mation (T F) game to create teamwork as a solution.
Their system will not produce teams unless the sys-
tem is very busy or there is a demand to create them.
However, the TF game is not able to work with a large
numbers of agents, i.e. a large network. Hence, they
have not indicated the maximum network size and the
number of created organisations. A team formaliza-
tion process depends only on the reciprocal agents,
i.e. agents which have previous knowledge about
each other during the network construction time. In
our work, the experimental work shows the maximum
number of agents could be 5000. In addition, the busy
agents are making use of other non-busy agents in the
organisation creation process to execute more tasks.
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
372
3 MODEL SCENARIO
The system consists of a set of agents
A=
{
a
1
, a
2
, ..., a
m
}
; these agents will enter the
environment one after another. The number of con-
nections between the agents will be created within
a specific limit X, i.e. each agent should maintain
its connections to be no more than the value of X
to avoid creating a centralized network which is
quite similar to the scale free network (Barabsi and
Bonabeau, 2003). The network is then constructed by
exchanging messages between the agents, these mes-
sages consist of (Agent
ID
, Resources
0
in f ormation) to
be sent over each connection. An agent’s connections
are not fixed; later on, the agents will change their
connections for better performance and utilisation.
Consequently, the created network is a heterogeneous
environment, i.e. agents have different types of
resources to accomplish as many types of tasks
as possible. During the connection process, each
agent will share its contact details with its randomly
selected neighbour (s), So that each agent will have
known contacts in its contact list.
In addition, to simulate the semantic type of resources
which already exists in the internet, the resources
have been represented as a vector of integer values,
where agent resources are AR =< r 1, r 2, r 3 >,
r i=(0 4). To this end we have explained the created
network of agents.
Within the simulation model there is a customer
agent which is an entity that is in charge of sending
customers’ tasks to the network of agents. Each
customer will send tasks to the network through each
cycle of the simulation time. The tasks will be sent
as a message called customer message CM which
contains the following parameters: CM ={CID, T ID,
RV , T T L, RA, T D}, where:
CID: The Customer ID is a unique identifier
which is used to identify customers and en-
able communication between them and service
providers regarding the status of tasks (in task
queue, executed, or failed).
T ID: Task ID is a unique identification number
given to each task, where each customer can send
a different number of tasks in each cycle.
RV : Resource Vector represents a sequence of
simulated resources, RV = < r1, r2, r3 >, which
may be different from task to task.
T T L: Time to Live is number of hops for the cus-
tomer message to traverse through the network of
agents.
RA: Required Accuracy represents the required
accuracy for matching the customer resources
with the agent’s resources. Its accuracy values be-
tween (0 12) are pre-agreed between the cus-
tomer and the agents.
T D: The Task Deadline represents the deadline
by which the customer would expect to have the
result of the task execution back.
When the customer agent sends a customer task to an
agent in the network, this task will only be accepted
and executed by the receiving agent in the network if
it meets certain conditions that are checked by each
agent. The first such condition is a matching process
between the customer resource vector and the receiv-
ing agent resources using the well-known Manhattan
Distance. The resulting matching value should be met
with the task required accuracy, which is a specific
value between (0 12).
For example, if a customer RV is <0, 0, 0> with a RA
equal to 6, and the recipient agent RV is <2, 2, 2>,
then when applying the Manhattan Distance equation,
the match has then occurred. The second condition
checked is the T T L, if T T L = 0, the task will be con-
sidered failed, otherwise, if T T L > 0 the receiving
agent will check the T D of the task; if it is sufficient
then it will be executed, but if its not sufficient (ei-
ther because the received agent is currently executing
a task or its queue of tasks has number of tasks) the
agent will then delegate the task to a neighbour agent
in the network. Hence, if an agent cannot satisfy at
least one of the conditions mentioned above, the task
will be either failed or delegated to another agent in
the network and so on.
In this network environment, the agents can fail (con-
trolled by a probability) to simulate the situation when
agents can be offline and are unable to accept mes-
sages and execute tasks for a period of time. The po-
tential of an agent failing is controlled by a probability
based on a uniform distribution to simulate real-life
network failure occurrences. When a customer agent
sends tasks to the network, a feedback from the re-
ceiving agent should be issued in return that must be
one of the following possibilities:
“Task has been executed”: If an agent has ac-
cepted the task.
“Task in AT Q”: If the receiving agent is able to
execute the task but is currently executing another
task and the deadline of the new task is within the
time consideration of the receiving agent, then it
will be added to the agent Accepted Task Queue
AT Q.
“Task has been failed”: If any one of the condi-
tions has not been met, the T D and T T L =0.
Disruption Recovery within Agent Organisations in Distributed Systems
373
3.1 Organisation Creation Process
In a network of agents, an agent is called Busy if it
accepts a task and is currently Busy executing it for a
period of time. We have defined another status called
“Busy agent and AT Q > K which means an agent
that is currently Busy, it has just received another task
from the customer and its ATQ contains K number
of accepted tasks, K 6 W , where W is the maximum
size of AT Q. An agent with these conditions is one
that can start the organisation creation process and by
default it is the Head of the organisation. The newly
created organisations emerge based on the gossip al-
gorithm, (Serugendo et al., 2011).
In earlier work that is presented in (Al-karkhi and
Fasli, 2017), we have used a gossip algorithm where
the Head is sending a multicast message to randomly
selected neighbours asking the agents to join its or-
ganisation and be its service providers.
But here in Algorithm 1, we have introduced a better
usage based on the gossip algorithm where the Head
sends the multicast message to all directly connected
neighbours to create the organisation. in both works,
the decision of accepting to join or not, depends on
the status of the agents of being busy or less busy at
that time. The less busy agents are the most likely
ones which would accept to join the organisation in
order to improve their resources utilisation. An “ac-
cept to join” message will be sent to the Head with
its contact details. If the Head’s message encounters
a busy agent, that agent will only convey the message
to other agents in the network. The size of an organi-
sation may vary based on the status of the agents that
receiving the Head’s message. The Head’s message
will continue to be transmitted over the network until
it returns back to the Head or it’s T T L has expired.
To this end, the system has added a virtual layer “an
organisation” over the network of agents. We believe
that the creation of organisations will lead to better
utilisation of the system resources.
The grid and cloud computing are focusing on provid-
ing a powerful range of resource sharing in their dis-
tributed environment. But, it has been found that in an
organisation that is part of a grid computing system, a
computational node can be underutilised and not meet
its full power. i.e it can be only utilised or be busy less
than 5% of the time (Haider and Nazir, 2016). We are
aiming to increase the agent’s resources utilisation by
allowing an agent to join more than one organisation.
This lets us execute more tasks and makes the system
more resistant to the failure. For example, if a Mem-
ber agent receives a task and cannot execute it, then it
will send the task initially to the Head of the first or-
ganisation that the agent had joined (since it can join
more than one organisation). The Head will send the
task to its Members if one of them can not accept the
task, the Head can send the task to another Member
that can satisfy the required resources and accuracies
and so on for all other joined organisations. If the
task cannot be executed inside any joined organisa-
tions, then the Member will work as if there is no
organisation i.e the task will be delegated to one of
the Member’s neighbours with the TTL and deadline
constraints.
However, there is still a number of constraints on the
resulting self-organised system and on the resulting
task execution values such as the existing network
size, the number of tasks being issued in each cy-
cle and the simulation running time. So, in our work
we have invented a mechanism named the HRP as an
added solution for the creation process of agent organ-
isations. The first part of this protocol has been im-
plemented during runtime of the gossip algorithm by
calling “BeMyHenchman” which will be described in
section 4.1 in this paper. The second part is explained
in section 4.2.
Algorithm 1 : Gossip Protocol.
Select an agent a
i
while true do
cycle i
a
i
.TransmitTo ([N] targets (agents)), [N] is the
local membership contact list of agent a
i
.
if [N] are non-Busy agents then
a
i
.Infect (gossip message)
They have the option to join or not to the cre-
ated organisation.
call “BeMyHenchman” to select an agent to
be Henchman.
else
The Busy agents will be used only to transmit
the message to their peers.
end if
cycle i=cycle i+1
The receiving agent in the last period broadcasts
the gossip message to its peers.
end while
3.2 Model Description
This section presents the description for the second
layer (organisations) above the existing agent’s net-
work.
Every created organisation in the system has the fol-
lowing format:Org= {Org
ID
, Org
H
,
a
1
, ..., a
j
, HM,
Z},where:Org
ID
: a unique identifier for each created
organisation. Org
H
: is the name of the Head for an or-
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
374
ganisation.
a
1
, ......, a
j
: the Members in the organi-
sation at a specific time. HM: is the Henchman name,
the Head’s follower in each created organisation. Z: is
the maximum number of accepted Members to join in
each organisation which is a setting parameter for all
the organisations but the number of Member joined in
each organisation may vary from one organisation to
another as mentioned in section 3.1.
3.3 Model Implementation
The following are the principles for the implemented
model.
a
i
is an agent that can join f number of organisa-
tions as a maximum but it can decide to join only
a random number g of f , where g 6 f .
The role an agent a
i
in an organisation (Org) is a
service provider, where: Org 6=
/
0 and 1 6i6j.
The roles for a
i
within an organisation Org can be
{Head, Service Provider, Henchman}.
A Head in each organisation is responsible for for-
warding the message to its Members bearing the
customers task message. The function of a Head
in an organisation is to coordinate the work in its
organisation. The Head may also act as Service
Provider to maximise its resources utilisation.
4 HENCHMAN RECOVERY
PROTOCOL (HRP)
After implementing the organisation’s model in the
previous section, we have seen that the number of ex-
ecuted tasks has been increased, and we have demon-
strated that with different network sizes in the de-
picted Figures in section 5 below. The black line in
the figures indicates the organisation’s model perfor-
mance. However, the environment is dynamic, so that
we need more powerful management to be imposed
on the structure of the organisations to provide an ef-
ficient service to the customers, mainly to detect the
Head failing.
Our method focuses on providing the Head of each
organisation with a follower called Henchman.
The Head is the one which is responsible for dis-
tributing the tasks to its Members in the organisa-
tion, so that when the Head has failed, the Hench-
man will act as a substitute to the failed Head provid-
ing self-organised capability to the system to maintain
the functionality of the created organisation. Since
the customer is sending tasks to the network, there is
a risk from sending the tasks to a failed Head, and
here the Henchman’s role in the recovery protocol is
to prevent losing the tasks.
4.1 Henchman Recruitment
The Head agent sends a message to the first agent
a
i
, where 1 6i6j that joins its organisation asking
whether it can be its Henchman (HM) and sends
the same message to other agents that join subse-
quently until it assigns a Henchman for its organisa-
tion. The Head has the responsibility to synchronize
its database (DB) with the Henchman which contains
the contact details of the Members of the organisa-
tion, and also for every newly joined agent. This
means that the Henchman DB will always be identical
with the Head’s database. When the Head is offline,
the Henchman will immediately replace that Head to
maintain the functionality of the organisation.
The Henchman is a Member as well as it is responsi-
ble for checking the availability of the Head by send-
ing a heartbeat message “are you alive” to the Head
in random cycles. This will be explained in more
detail in the Algorithm 2 called the “Heartbeat”. If
the Henchman does not receive any acknowledgement
back from the Head, the Henchman will declare to all
other Members and the customer agent that the Head
has failed and the new Head is the Henchman in order
to redirect the traffic to itself instead. When the Head
recovers, i.e. the Henchman receives “I’m alive” mes-
sages again from the Head, the Henchman will in-
stantly inform the organisation Members as well as
the customer that the Head is now alive and the or-
ganisation should be back to its normal condition.
However, there is a small chance that the disruption
may effect both the Head and its Henchman at the
same time. In this case if the customer agent sends
tasks during that time, tasks will be considered failed,
until the Head and/or the Henchman return(s) back to
active after being offline for a random number of cy-
cles.
4.2 Henchman Heartbeat Algorithm
After assigning the desired Henchman, in Algorithm
2, the Henchman will check the availability of the
Head after a random number of cycles. At this point
the task messages received by the Head are auto-
matically seen by the Henchman in order to keep
the Henchman continuously updated about the cus-
tomer’s CIDs.
Disruption Recovery within Agent Organisations in Distributed Systems
375
Algorithm 2 : Henchman Heartbeat Algorithm
while (true) do
Henchman.SendMessage (Head, Are you
Alive.”)
Head.SendMessage (“I’m Alive”, Henchman)
if (Henchman.Receive (Head, “no message”)
then
Henchman.SendMessage(Customer, Mem-
bers) // To redirect them to the Henchman.
else
The Head is online again, redirect everything
back to the Head
Henchman.SendMessage (Customer, “Head is
alive”)
Henchman.SendMessage (Members, “Head is
alive”)
in case they would like to send tasks or receive
tasks to/from the Head.
end if
keeps on checking the Head in random cycles
cycle i=cycle i+Random(1, N)
end while
5 EXPERIMENTAL WORK
The network has been implemented in Repast Sim-
phony which is an Agent Based Modeling (ABM)
simulator using Java programming (Macal and North,
2014). Table1 shows the setting parameters which
have been used to set up the experiments, where (ND
= Normal Distribution, (T and M) are given values of
the mean and the standard deviation, H = Head, HM
= Henchman, SP = ServiceProvider).
Table 1: Henchman Recovery Protocol.
Agents Tasks Prob.offline(1) Prob.offline(2) Offline cycles
5000 ND*T+M H=0.9, HM=0.6 H=0.6, HM=0.2 H(2-5), HM+SP=(10-11)
2500 ND*T+M H=0.9, HM=0.6 H=0.6, HM=0.2 H(2-5), HM+SP=(10-11)
500 ND*T+M H=0.9, HM=0.6 H=0.6, HM=0.2 H(2-5), HM+SP=(10-11)
The three models (HRP, With organisation and With-
out organisation) as shown in the figures below, have
been run for 5000 simulation cycles with the setting
parameters shown in Table 1. The simulation has
been run for 5000 cycles to produce the result for the
model with HRP and network size=5000 agents, and
re-executed for 10 times with the setting parameter re-
vealed in Table 1, and the same process has been ap-
plied for the other two models. The experiments have
shown the operation of all the three models to differ-
ent network size (5000, 2500, 500) and with different
offline probability values (0.9, 0.6) for each network
size. These experiments measure the effect of the sim-
ulated models on the system performance using Equa-
tion 1. The equation has been used to calculate the
average number of successfully executed tasks during
the simulation cycles (ANSET), the absolute value is
the Manhattan Distance, used to produce the value of
the matching process between the requested resources
issued by the customer for each task and the matched
resources from the agents for all the tasks which are
executed successfully.
ANSET =
1
R
R
i=1
j=2
j=0
RV
j
AR
j
(1)
Where:
R: number of runs = 10.
RV
j
: the requested resources vector for the task j.
AR
j
: the matched agent resources that performs task
j successfully.
j = 0 to 2 : the index of the tuple that represent the
requested resource vector.
The Figures below will demonstrate the differences
in the average number of successfully executed tasks
within cycles for the models (HRP, With organisation
and Without organisation) for different network sizes.
With network size of 5000 agents, Our tests revealed
that the HRP in comparing with other two models
has positively executed higher percentage of tasks
with different offline probability values (0.9, 0.6), as
shown in Figures 1a and 1b. Also similar results have
been achieved with network size of 2500 agents, Fig-
ures 2a,2b. The same performance from the HRP has
also achieved with network size of 500 agents in com-
parison with the other two models. However, with
network size of 500 it was very difficult for the agents
in both (With organisation and Without organisation)
models to cope with the failing probability with the
high number of tasks issued from the customers in
each cycle, as shown in Figures 3a and 3b.
The most noteworthy output in the depicted Figures
is from the HRP. Clearly, the Henchman has added
advantages when it has been recruited in each organ-
isation due to its integrated roles: temporary Head,
follower, and a Member. In addition, the fact that the
existence of heterogeneous Members in the organisa-
tions can operate in conjunction with the HRP to cope
with organisation utilisation when some Members be-
ing offline. This demonstrates the idea that generating
heterogeneous organisations structure and imposing
roles for the agents in the system will improve the
system performance as more tasks can be recovered.
Moreover, in the models (the HRP and With organi-
sation) even if no Members were able to execute the
task, a task still have the chance to be executed by
delegating it to one of the neighbour agents. So that,
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
376
With organisation model shows in between perfor-
mance and that implies even with the existence of the
organisations, the model is still prone to loose tasks
due to the high number of tasks issued from the cus-
tomer agent and the probability of agent failing. How-
ever, With organisation model, the system shows bet-
ter performance than the system Without organisation
for all presented network sizes and offline probabili-
ties. This is because the Without organisation model
has no organisation and depends only on task dele-
gation if the receiving agent cannot accept the given
tasks. Hence, this is not an adequate solution due to
presence of agent failure.
To compare and evaluate our work against other work
in the literature, mainly the work that deals with fault
tolerance in the networks, the master/standby server
(MAS/SBS) has been investigated (Chen, 2007). Ini-
tially, for the configuration and setting purposes, both
of the servers should be available from the begin-
ning of the network creation process, and the con-
nection link should be initiated by the MAS to send
acknowledgement messages to the SBS showing its
availability. Therefore, to create that scenario in our
work, in each organisation we have specified the Head
to act as MAS called Master Head (MAH) and also
specified another agent in the organisation to act as a
SBS called Standby Head (SBH). The SBH will only
respond to the customer messages when the MAH
of the organisation is down. In our work, it is the
Henchman’s responsibility to send the heartbeat mes-
sages to the Head to check its availability. In ad-
dition, the organisation structure (Head, Henchman,
Members) with its roles leads to achievement of self-
organised groups that can overcome the failure issue
as explained in subsection 3.1 above, unlike with the
MAS/SBS method where the SBS can only switch
to active mode when the MAS has failed. This ex-
periment has been implemented with the same set-
ting parameters for the same task distribution that has
been presented in Table 1. However, the network
size has been tested with 500 agents as a testbed to
verify our system, Figures 4 a and 4 b with differ-
ent offline probabilities (0.9, 0.6). Our system has
demonstrated on average a better performance in each
cycle compared to the MAS/SB server. This is be-
cause, in our scenario, the Henchman has a role in the
self-organisation process where it acts as a Member
that can receive and execute tasks as well as a fol-
lower to the Head of the organisation. In contrast in
the MAS/SBS approach, the SBS is only a backup
server and only turned to active mode when the MAS
is failed.
(a) (b)
Figure 1: Average Number of Successfully Executed Tasks
for 5000 agents and offline probability 0.9 and 0.6.
(a) (b)
Figure 2: Average Number of Successfully Executed Tasks
for 2500 agents and offline probability 0.9 and 0.6.
(a) (b)
Figure 3: Average Number of Successfully Executed Tasks
for 500 agents and offline probability 0.9 and 0.6.
(a) (b)
Figure 4: Comparison Between Henchman Recovery Pro-
tocol and The Master/standby Server, probability 0.6 and
0.9.
6 CONCLUSION
This paper has provided a framework for proposed
data centres, mainly focusing on how to recruit agents
Disruption Recovery within Agent Organisations in Distributed Systems
377
as part of the process of creating organisations. We
highlighted the importance of solving the disruption
problem using a self-organised multi-agent system as
well as providing a solution in which the organisa-
tions of agents emerge depending on how busy agents
get and this requires no central control. We have man-
aged to demonstrate the HRP as a remedy for the
disruption problem. It is employed during the self-
organisation process when the Head of each organisa-
tion decides to have one of its Members as a Hench-
man. The purpose of the Henchman agent is to main-
tain the functionality of the organisation and its ef-
fectiveness in case of agent failure. A heartbeat al-
gorithm has been used by each Henchmen to watch
each organisation’s Head. Experimental work has
demonstrated the HRP as a reliable solution for a self-
organised system. Our future work is to extend our
system and implement one of the leadership compe-
tition protocol between the Head and the Henchman.
And we can employ another Henchman inside the or-
ganization to maintain the organisation functionality
in case the Head and the Henchman are both being
offline.
REFERENCES
Al-karkhi, A. and Fasli, M. (2017). Deploying self-
organisation to improve task execution in a multi-
agent systems.
Barabsi, A.-L. and Bonabeau, E. (2003). Scale-free net-
works. Scientific American, 288(5):50–59.
Botev, J., Rothkugel, S., and Klein, J. (2015). Socio-
inspired design approaches for self-adaptive and self-
organizing collaborative systems. In Self-Adaptive
and Self-Organizing Systems Workshops (SASOW),
2015 IEEE International Conference on, pages 19–24.
IEEE.
Chen, C.-W. (2007). Dual redundant server system for
transmitting packets via linking line and method
thereof.
Corkill, D. D., Garant, D., and Lesser, V. R. (2015). Ex-
ploring the effectiveness of agent organizations. In
International Workshop on Coordination, Organiza-
tions, Institutions, and Norms in Agent Systems, pages
78–97. Springer.
Dignum, V. (2009). The role of organization in agent sys-
tems. Handbook of Research on Multi-Agent Systems:
Semantics and Dynamics of Organizational Models,
pages 1–16.
Elnozahy, E. N., Alvisi, L., Wang, Y.-M., and Johnson,
D. B. (2002). A survey of rollback-recovery proto-
cols in message-passing systems. ACM Computing
Surveys (CSUR), 34(3):375–408.
Foster, I., Zhao, Y., Raicu, I., and Lu, S. (2008). Cloud
computing and grid computing 360-degree compared.
In Grid Computing Environments Workshop, 2008.
GCE’08, pages 1–10. Ieee.
Gutierrez-Garcia, J. O. and Sim, K.-M. (2010). Self-
organizing agents for service composition in cloud
computing. In Cloud Computing Technology and Sci-
ence (CloudCom), 2010 IEEE Second International
Conference on, pages 59–66. IEEE.
Haider, S. and Nazir, B. (2016). Fault tolerance in com-
putational grids: perspectives, challenges, and issues.
SpringerPlus, 5(1):1991.
Horling, B. and Lesser, V. (2004). A survey of multi-agent
organizational paradigms. The Knowledge Engineer-
ing Review, 19(4):281–316.
Macal, C. and North, M. (2014). Introductory tutorial:
Agent-based modeling and simulation. In Proceed-
ings of the 2014 Winter Simulation Conference, pages
6–20. IEEE Press.
Miyashita, Y., Hayano, M., and Sugawara, T. (2015a). For-
mation of association structures based on reciprocity
and their performance in allocation problems. In Inter-
national Workshop on Coordination, Organizations,
Institutions, and Norms in Agent Systems, pages 262–
281. Springer.
Miyashita, Y., Hayano, M., and Sugawara, T. (2015b). Self-
organizational reciprocal agents for conflict avoidance
in allocation problems. In Self-Adaptive and Self-
Organizing Systems (SASO), 2015 IEEE 9th Interna-
tional Conference on, pages 150–155. IEEE.
Serugendo, G. D. M., Gleizes, M.-P., and Karageorgos, A.
(2011). Self-organising software: From natural to ar-
tificial adaptation. Springer Science & Business Me-
dia.
Tesauro, G., Chess, D. M., Walsh, W. E., Das, R., Se-
gal, A., Whalley, I., Kephart, J. O., and White, S. R.
(2004). A multi-agent systems approach to autonomic
computing. In Proceedings of the Third International
Joint Conference on Autonomous Agents and Multia-
gent Systems-Volume 1, pages 464–471. IEEE Com-
puter Society.
ICAART 2018 - 10th International Conference on Agents and Artificial Intelligence
378