A PEER-TO-PEER SEARCH IN DATA GRIDS

BASED ON ANT COLONY OPTIMIZATION

Uro

s Jovanovi

XLAB Research

Teslova 30, SI-1000 Ljubljana, Slovenia

stjan Slivnik

University of Ljubljana, Faculty of Computer and Information Science

ska 25, SI-1000 Ljubljana, Slovenia

Keywords:

Distributed search, P2P, data grids, ant colony optimization.

Abstract:

A method for (1) an efﬁcient discovery of data in large distributed raw datasets and (2) collection of thus

procured data is considered. It is a pure peer-to-peer method without any centralized control and is therefore

primarily intended for a large-scale, dynamic (data)grid environments. It provides a simple but highly efﬁcient

mechanism for keeping the load it causes under control and proves especially usefull if data discovery and

collection is to be performed simultaneoulsy with dataset generation. The method supports a user-speciﬁed

extraction of structured metadata from raw datasets, and automatically performs aggregation of extracted

metadata. It is based on the principle of ant colony optimization (ACO). The paper is focused on effective

data aggregation and includes the detailed description of the modiﬁcations of the basic ACO algorithm that

are needed for effective aggregation of the extracted data. Using a s imulator, the method was vigorously tested

on the wide set of different network topologies for different rates of data extraction and aggregation. Results

of the most signiﬁcant tests are included.

1 INTRODUCTION

Nowadays, vast datasets too large to be stored on a

single computer are being generated and used all the

time. In the scientiﬁc environment, they are often

produced during experiments in a number of different

ﬁelds, ranging from physics to genetics. In business,

companies are storing all kinds of data from which

customer behaviour can be analysed and predicted. In

computing and telecommunications, logs are gener-

ated day and night.

Some datasets are generated by a single source but

must be distributed immediately because of their size.

For example, data grids are being developed to man-

age all data produced by particle accelerators. Other

datasets like computer logs, for example, are gener-

ated by multiple sources but are most often analysed

together.

Data grids represent a promising way to handle

large and distributed datasets (Chervenak et al., 1999;

Berman et al., 2003). To enhance the overall system

performance, data replication is used to minimize the

transfer of different chunks of a distributed dataset

to the computer running the application (

Cibej et al.,

2005).

However, if the amount of data needed by the appli-

cation is small compared to the entire dataset, the data

should be extracted at the remote servers, collected to-

gether, and ﬁnally transfered back to the application.

Thus, an application must be capable of sending cus-

tom requests to the dataset servers. In a distributed en-

vironment based on an unstructured network with no

ﬁxed topology and no metadata prepared in advance,

this problem is hard to solve efﬁciently.

Grid infrastructure like Globus or gLite (Globus,

2006; gLite, 2006) is used to handle all low level de-

tails of sending the requests and obtaining the results,

and to provide read access to raw data in order to ex-

tract metadata. To avoid ﬂooding the network with

requests and thus keeping the load limited the im-

plemented grid service uses a method of ant colony

optimization for extracting and collecting data from

chunks of a distributed dataset (Dorigo and St

utlze,

2004).

It has been demonstrated that ACO can be an effec-

tive method for clustering (Handl et al., 2006), but so

far only static datasets have been considered. In this

paper the ACO algorithm, adapted and extended in or-

der to handle extraction and collection from dynamic

raw datasets, is described.

363

Jovanovi

c U. and Slivnik B. (2006).

A PEER-TO-PEER SEARCH IN DATA GRIDS BASED ON ANT COLONY OPTIMIZATION.

In Proceedings of the First International Conference on Software and Data Technologies, pages 363-366

DOI: 10.5220/0001319703630366

 SciTePress

2 USER’S PERSPECTIVE

Following the approach introduced in (Jovanovi

et al., 2006), a grid service must enable a user to do

the following two tasks:

1. Starting a new search: As a user is given full (read)

access to the raw dataset, a user should be able to

specify a program that searches the local chunk of

a distributed dataset and extract the relevant data.

2. Checking the list of performed searches: A user

must be able to check if any other user has already

performed a search equal or similar to the one he or

she is about to start.

As the user’s program extracts data from the local

chunk only, it is the responsibility of the grid service

to transfer the user’s program to each dataset server,

to run the program, and to collect the extracted data

from all servers.

The ﬁrst reason for keeping a list of performed

searches is to keep the overall system load low (a) by

not repeating the same search all over again and (b)

by using results from a similar (possibly more gen-

eral) search instead of starting a new one. The second

reason is a fact that the list of performed searches acts

as a virtual blackboard about what data or relations

among data in the dataset seem interesting to other

users. This might be especially valuable in a collabo-

rative scientiﬁc research. It follows that the results of

the searches performed in the past should be stored in

one form or the other somewhere in a data grid.

Hence, the typical method for performing a search

is best described by an (informal) Algorithm 1. Note

(1) that starting a new search in line 6 does not block

the execution as the new search is performed on re-

mote servers; (2) that the search and its results will

eventually appear on the list of performed searches

(even if it yields no results) and thus the loop in lines

7–13 does terminate.

Algorithm 1 Performing a search.

1: check the list of performed searches

2: if the (similar) search has been found then

3: return its results and stop

4: end if

5: provide the extraction program

6: start a new search

7: loop

8: check the list of performed searches

9: if the search has been found then

10: return its results and stop

11: end if

12: sleep for a speciﬁed amount of time

13: end loop

This method allows different implementations of a

distributed index of performed searches and different

implementations of searching. At the time being, the

research is focused on the efﬁcient peer-to-peer im-

plementation of a single search (line 6). The task

of maintaining a list of performed searches in a dis-

tributed index and especially a method for comparing

descriptions of different searches are left for the fu-

ture work.

3 ANT COLONY OPTIMIZATION

FOR SEARCHING RAW

DATASETS

Ant colony optimization is a biologically-inspired op-

timization method (Dorigo and St

utlze, 2004). The

basic idea is to use a large number of simple agents

called ants: each ant performs a relatively simple task

but combined together they are able to produce so-

phisticated and effective optimization. Further im-

provements of ACO are based usually on a combina-

tion of the ACO algorithm with other local optimiza-

tion techniques (Dorigo and St

utlze, 2004). Ant based

clustering and sorting have already been studied in the

past (Deneubourg et al., 1991; Handl et al., 2006).

There are two main differences between ant based

clustering and peer-to-peer searching as described in

this paper. First, a node in the system can possess

a pile of data, not just one datum. Second, other

ant based clustering algorithms are suited for mesh

or similar regular planar topologies, and thus the dis-

tance function is modiﬁed in order to be efﬁcient for

an arbitrary topology of a distributed environment.

Extraction Ants. Using user-speciﬁed programs,

extraction ants extract data from local chunks of

raw datasets. Each extraction program is uniquely

identiﬁed. Extraction ants mark their paths with

pheromones associated with the id of the program

they are carrying. In order for extraction ants to dis-

cover as much data as possible they avoid the trails

and prefer the clean paths.

Aggregation Ants. The data extracted by extraction

ants is collected into piles by aggregation ants. Ag-

gregation ant is based on Algorithm 2. According to

the pick-up function 1/(1+x/n)

, small piles of data

are picked up with a very high probability while large

piles are most unlikely to be picked up. The distinc-

tion between small and large pile is predeﬁned and

based on the characteristics of the environment, such

as the average connection bandwidth, and expected

amount of data. Note that the distinction between

a small and large piles can be simply regulated by

changing n (a measure for the size of a pile) and k

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

364

ant has a load

dropped load

the gap on the trail

prevents formation

of the cycle

after the gap,

the trail continues

Figure 1: Result of the acyclic path marking. (The trail

on the right is a “backward” trail produced when the ants

is moving away from the location where a data has been

dropped.).

(a measure of strictness) in the pick-up function. Sec-

ond, the dropping probability function is simpliﬁed:

the ant decides to drop the load whenever data of the

same type is present on the node. Furthermore, in or-

der to limit the network load, the number of hops that

a loaded ant can take is limited.

The probability of ﬁnding some data is based on

the number of pheromone trails that lead to the pile

containing the data. To increase this probability, after

dropping the load, an ant marks the ﬁrst few connec-

tions with pheromones directed to the location of the

dropped data. The outcome of such pheromone mark-

ing is a tree of pheromone trails where every branch

of the tree is directed towards the root of the tree that

contains a pile of data. The result of such process in

shown in Figure 1.

There are positive as well as negative consequences

of these so called pheromone ﬁelds. On the positive

side, there is a high probability that a smaller pile is

correctly merged with larger pile. On the negative

side, once an ant enters the pheromone ﬁeld, it gets

trapped. The ant that picks up a pile gets trapped in

the pile’s own pheromone trails.

Pile Ants. In order to avoid the pheromone ﬁelds

because of their negative effects, pile ants are used.

They are simple random walkers that are produced by

piles and ignore any pheromone information. At ev-

ery new location, a pile ant checks if another pile of

data of the same type as the one in the pile that emit-

ted it, exists. The smaller pile is picked up and merged

with the bigger pile.

One-Time Aggregating Ants. One-time aggrega-

tion ants are special type of aggregation ants. When-

ever an extracting ant extract some data, a one-time

aggregation ant is also created at the same location.

This ant picks up the new extracted data and tries

to drop it at an appropriate place. When the data is

dropped, the ant dies, i.e., is removed from the sys-

tem. One-time aggregation ants annul the time needed

to discover new extracted data and increase the prob-

ability that they are dropped into near piles.

Algorithm 2 A pheromone-based aggregation ant.

1: loop

2: while the ant is empty do

3: while the node is empty do

4: select a random direction

5: make a step in the selected direction

6: end while

7: if load is selected then

8: pick up the load; h ← 0

9: end if

10: end while

11: repeat

12: select a direction

using the existing pheromones

13: mark the selected direction

14: make a step in the selected direction

15: h ← h + 1

16: if h = h

max

∨ the node load = the ant load

then

17: drop the load; h ← 0

18: end if

19: until the ant is empty

20: while h < h

max

21: select a random direction

22: mark the selected direction

using the cycle-prevention method

23: make a step in the selected direction

24: h ← h + 1

25: end while

26: end loop

Query Ants. The role of the query ants is to ﬁnd

data collected in piles by following the pheromone

paths created by aggregation ants.

Another method for data discovery excludes the

need for query ants. It is based on maintaining of

distributed indexing service. When the pile is static

and big enough, it registers its location into this dis-

tributed indexing service. Here, static property of a

pile means that the pile has not changed its position

in some predeﬁned amount of time. The second prop-

erty, being big enough, means that the pile contains

at minimum some predeﬁned amount of aggregated

metadata.

A PEER-TO-PEER SEARCH IN DATA GRIDS BASED ON ANT COLONY OPTIMIZATION

365

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600

random walkers

ants

0 100 200 300 400 500 600

random walkers

ants

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600

random walkers

ants

Figure 2: The results of testing with high rate of extraction (100 random extractions per iteration; 300 extraction ants): the

total number of piles aggregated by random walkers and aggregation ants (left), number of large (static) and small piles

aggregated by random walkers and ants (top right and top bottom, respectively).

4 EXPERIMENTAL RESULTS

The experimental results were obtained by simulat-

ing the extraction and collection of data using a two-

layered network consisting of 2550 nodes. There are

50 fully-connected nodes, each of them being a node

of a subnetwork consisting of another set of 51 fully-

connected nodes. We have also performed the tests

on different layouts and of different scale, but differ-

ent topologies yield very similar results to those pre-

sented here.

For the chosen topology and the number of nodes,

we tested our algorithm against the random walk

heuristic. Note that during the extraction the aggrega-

tion is also being performed. We have also tested the

scenario when the extraction stops and only aggrega-

tion is being performed. These cycles are referred to

as pure aggregation cycles.

Figure 2 shows the results obtained when data is

being extracted on 100 random locations in each iter-

ation, and 300 aggregation ants were always present

in the system besides one-time aggregation ants.

5 CONCLUSION

In this paper we have presented modiﬁcations of basic

ACO algorithms that are well suited to the limitations

of the distributed environments. The described modi-

ﬁcation of ACO enables very simple yet precise run-

time control over the load caused by extraction and

aggregation simply by regulating the number of ants.

REFERENCES

Berman, F., Fox, G. C., and Hey, A. J. G. (2003). Grid

Computing: Making th e Global Infrastructure a Real-

ity. John Wiley and Sons, Ltd., Chichester, England.

Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., and

Tuecke, S. (1999). The data grid: Towards an archi-

tecture for the distributed management and analysis of

large scientiﬁc datasets. Journal of Network and Com-

puter Applications, 23:187–200.

Cibej, U., Slivnik, B., and Robi

c, B. (2005). The complexity

of static data replication in data grids. Parallel Com-

puting, 31:900–912.

Deneubourg, J. L., Goss, S., Franks, N., Sendova-Franks,

A., Detrain, C., and Chretien, L. (1991). The dynam-

ics of collective sorting: Robot-line ants and ant-like

robots. In From Animals to Animats 1 (Proceeding

of the First International Conference on Simulation of

Adaptive Behavior), pages 356–365. The MIT Press,

Boston, USA.

Dorigo, M. and St

utlze, T. (2004). Ant Colony Optimiza-

tion. The MIT Press, Boston, USA.

gLite (2006). EGEE > gLite — Lightweight Middleware

for Grid Computing. Retrieved April 28th, 2006, from

http://glite.web.cern.ch/glite/.

Globus (2006). The globus alliance. Retrieved April 28th,

2006, from http://www.globus.org/.

Handl, J., Knowles, J., and Dorigo, M. (2006). Ant-based

clustering and topographic mapping. Artiﬁcial Life,

12(1):35–62.

Jovanovi

c, U., Mo

cnik, J., Novak, M., Pipan, G., and

Slivnik, B. (2006). Using ant colony optimization for

collaborative (re)search in data grids. In Proceedings

of the Cracow Grid Workshop ’05, Cracow, Poland,

pages 205–207.

ICSOFT 2006 - INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES

366