BUILDING SCALABLE DATA MINING GRID APPLICATIONS

An Application Description Schema and Associated Grid Services

Vlado Stankovski

Faculty of Civil and Geodetic Engineering, University of Ljubljana, Jamova cesta 2, Ljubljana, Slovenia

Dennis Wegener

Fraunhofer Institute for Intelligent Analysis and Information Systems, Sankt Augustin, Germany

Keywords: Grid, distributed applications, data mining, middleware.

Abstract: Grid-enabling existing stand-alone data mining programs, data and other resources, such as computational

servers, is motivated by the possibility for their sharing via local and wide area networks. Expected benefits

are improved effectiveness, efficiency, wider access and better use of existing resources. In this paper, the

problem of how to grid enable a variety of existing data mining programs, is investigated. The presented

solution is a simple procedure, which was developed under the DataMiningGrid project. The actual data

mining program, which is a batch-style executable, is uploaded on a grid server and an XML document that

describes the program is prepared and registered with the underlying grid information services. The XML

document conforms to an Application Description Schema, and is used to facilitate discovery and execution

of the program in the grid environment. Over 20 stand-alone data mining programs have already been grid

enabled by using the DataMiningGrid system. By using Triana, a workflow editor and manager which

represents the end-user interface to the grid infrastructure, it is possible to combine grid enabled data mining

programs and data into complex data mining applications. Grid-enabled resource sharing may facilitate

novel, scalable, distributed data mining applications, which have not been possible before.

1 INTRODUCTION

Data mining in grid computing environments is

motivated by resource sharing via local and wide

area networks (Stankovski et al 2008a, 2008b).

Increased performance, scalability, access and

resource exploitation are the expected key benefits.

Furthermore novel distributed data mining

applications may facilitate the automated extraction

of potentially useful information from increasingly

large, geographically distributed data volumes.

However, grid-enabling large-scale data mining

applications is difficult to achieve due to a number

of factors. Grid computing itself is a novel field of

research and relevant standards and technologies are

still evolving (Foster, Kesselman, Tuecke, 2001;

Plaszczak and Wellner, 2006; Sotomayor and

Childers, 2006; Antonioletti et al., 2005). Moreover,

there exists a plethora of data mining technologies

and a staggering number of largely varying data

mining application scenarios (Kumar, Kantardzic

and Madden, 2006; Guedes, Meira and Ferreira,

2006; Stankovski and Dubitzky, 2007; Conguista,

Talia and Trunfio, 2007). Finally, data mining users

range from highly domain-oriented end users to

technology-aware specialists. To the former user

group transparency and ease-of-use is paramount,

whereas the latter group needs to be in control of

detailed data mining and grid technology aspects.

In the DataMiningGrid project, we have aimed to

address the requirements of modern data mining

application scenarios, in particular those which

involve sophisticated resource sharing. A detailed

technical account of the DataMiningGrid system is

presented elsewhere (Stankovski et al, 2008a,

2008b) as well as the actual applications (e.g.

Trnkoczy and Stankovski, 2008). We have designed

and implemented a workflow-oriented, scalable,

high performance computing system that supports

emerging grid interoperability standards and

technology. The system itself is freely available

under the Apache Open Source License V2.0 via

221

Stankovski V. and Wegener D. (2008).

BUILDING SCALABLE DATA MINING GRID APPLICATIONS - An Application Description Schema and Associated Grid Services.

In Proceedings of the Third International Conference on Software and Data Technologies - PL/DPS/KE, pages 221-228

DOI: 10.5220/0001891302210228

 SciTePress

SourceForge.net, including all supporting

documentation.

In the present study, we investigate the problem

of how existing, stand-alone data mining

applications can be grid-enabled and subsequently

executed on a grid service infrastructure.

Our final goal is to enable users from various

disciplines to build and utilize complex, scalable

data mining applications. To that end, an effective

mechanism, which was developed under the

DataMiningGrid project, is presented in this paper.

2 GRID RESOURCES

In a grid environment, it is possible to exploit

numerous, potentially unlimited resources, such as

data, data mining applications, CPUs, storage,

networks and clusters.

Given the nature of data mining applications, a

variety of computational resources that may be

shared were identified:

 Data. The data to be mined, which may exist in

the form of relational databases, data files

(documents in various formats) and directories

consisting of collections of documents;

 Programs. Data mining programs providing

the implementation of data mining algorithms

used to mine data. Seen from the grid resource

viewpoint, a data mining program, application

or algorithm is an executable with associated

input data, output data and parameter settings.

The executable can be anything starting from

a Java, C, Python or a BashShell program.

 Computational Machines. Computational

machines providing raw computing power to

run the data mining program and process the

data. Important parameters about

computational machines are speed,

occupancy, memory which can be used during

processing, architecture and so on;

 Storage. Data storage devices to physically

store the input and output data of data mining

applications. For storage devices it is mainly

important to have the ability to reserve space

in advance, and to have safe and fast

mechanisms for storing and retrieving data.

Should storage be used for storing relational

databases, than the necessary server-side

software is also essential to be implemented

on the actual site;

 Streaming Devices. Sensors and other devices

streaming data in a network are special kind of

resources. These, however, are currently not

(directly) supported by the DataMiningGrid

technology; and

 Networks. Optimization of network parameters

should also be considered for time-critical

applications.

It is important to realize that certain resources

from the list above can easily be moved in the

network (notably data, programs and associated

libraries) and existing transfer protocols could be

used for that purpose (e.g. ftp, GridFTP or RFT),

while other resources can not be moved in the

network (e.g., computational machines). All

resources, however, have different parameters that

should be considered when developing a distributed,

grid-based data mining application.

We have investigated several large-scale data

mining scenarios in which it is impractical or

impossible to move the data in the network due to

variety of reasons, e.g., the large amount of data or

security restrictions. In such cases, it must be made

possible to move the data mining programs to the

data rather then the data to the data mining

programs.

3 GRID SERVICES

Grid middleware services represent virtualization of

grid resources. In other words, grid resources can

only be accessed by using grid services, while local

resources have to be grid-enabled before they are

actually shared.

3.1 Virtualization

In the past years, considerable research and

development effort has been put towards the

development of middleware services and tools

targeting some of the access and composition

obstacles to large-scale resource sharing and

exploitation. The key benefit of grid services is that

they provide an effective and popular way of

abstracting the complexity of distributed data and

computational resources and also represent a variety

of utility services. DataMiningGrid represents a

platform that enables the sharing of grid resources

between applications which facilitates reuse,

embedding, modification or extension of an

application’s content to enable a far more rapid

development cycle than what was previously

possible with conventionality programming

methodologies.

In particular, users are not concerned either with

the technical details associated with constructing

ICSOFT 2008 - International Conference on Software and Data Technologies

222

grid services, or the technical details of the

underlying grid and Web service infrastructures, but

are interested in exploring their data. To that end,

DataMiningGrid provides workflow-editing

software components that are integrated with the

existing Triana workflow editor and manager. By

using the workflow editing components it is possible

to develop applications which are tailored to the

specific needs of the end-users. Moreover,

interoperability between services within a distributed

system is enabled by using the OGF’s WSRF

standard, so as to liberate users to use the grid

services regardless of their own systems.

3.2 Workflows

Workflows are effectively declaratively defined as

coordinated “plumbing” between services, executed

as steps. Triana is an advanced system for editing

and managing workflows, which is used as a user

interface to Web services (Churches at al., 2005) and

recently also as a user interface to grid services

(Stankovski et al., 2008a). It may be used to link

efficiently and effectively data resources, analytical

tools and computing processes together in a form of

various domain-specific graphs representing

information flows. It is complete and self-defined,

and represents a far more effective mean of sharing

knowledge, processing, communication, storage or

content than the more elementary building blocks

that represent the individual services.

As more data is produced by users every day and

more utility services are provided online, coupled

with ‘time-to-market’ pressures, the complexity and

dynamicity of information flows in the past years

have been increasing dramatically. It is therefore

necessary in many scientific and business

communities, and will be even more so in the near

future, to discover, extract and share the knowledge

contained in complex workflows.

3.3 Grid Middleware and Services

Different grid middleware solutions exist and

continue to be developed over the past years. Here,

we provide a short overview of ready-to-use grid,

WSRF-compliant services, which were used to build

a grid service infrastructure (a test-bed). These

include the Globus Toolkit 4, Condor and

DataMiningGrid high-level services.

3.3.1 Globus Toolkit 4 Middleware and

Services

One of the first grid middleware toolkits

implementing the Web Services Resource

Framework (WSRF) v. 1.2 specification, a

specification promoted by the Organization for the

Advancement of Structured Information Standards

(OASIS), is the Globus Toolkit 4 (GT4) (Foster,

Kesselman and Tuecke, 2001). GT4 provides a

range of grid services that can be directly used to

build a distributed grid environment. These include

data management, job execution management,

community authorization services etc. All these

services can be used to build custom grid

applications, and are elaborated in detail elsewhere

(Sotomayor and Childers, 2006). Besides these

ready-to-use services, the GT4 provides an

Application Programming Interface (API) that

allows for development of proprietary WSRF-

compliant services. Due to the reasons listed above

the GT4 was used as grid middleware in this study.

Following is a short overview of relevant ready-

to-use grid services from the GT4 toolkit.

 The Web Service - Grid Resource Allocation

and Management (WS-GRAM) provides all

basic mechanisms required for execution

management, i.e., initiation, monitoring,

management, scheduling, and coordination of

remote computations.

 Data Management Services, such as GridFTP

and Reliable File Transfer (RFT). These data

services are mainly used for transfer and

management of distributed, file based data,

including program executables and their

software libraries. GridFTP is used, e.g., to

transfer executables and required libraries to

the selected computational server in the grid.

 Information Services are used to discover,

characterize and monitor resources, services

and computation. The GT4’s Monitoring and

Discovery System 4 (MDS4) provides

information about the available grid resources

and their status. It has the ability to collect and

store information from multiple, distributed

information sources. This information is used

to monitor (e.g., to track usage) and discover

(e.g., to assign computing jobs and other

tasks) the current state of services and

resources in a grid system. The

DataMiningGrid high-level services (in

particular the Resource Broker and

Information Services) are using the MDS4

service.

BUILDING SCALABLE DATA MINING GRID APPLICATIONS - An Application Description Schema and Associated

Grid Services

223

In our test bed, the following GT4 services are

used extensively: WS-GRAM, GridFTP and MDS4.

3.3.2 Condor

Scheduling of grid jobs in local computing clusters

is achieved by using the Condor middleware.

Condor is specialized workload management

software for submitting compute-intensive jobs to

local computational clusters, which has been

described in detail elsewhere (Thain, Tannenbaum

and Livny, 2005). In our application, the GT4

submits a subset of parallel jobs to appropriate

Condor clusters, and it is up to the Condor software

to place them into a local queue, choose when and

where in the local cluster to run the jobs, monitor the

progress of the jobs, and ultimately inform GT4

services upon their completion.

3.3.3 DataMiningGrid High-Level Services

In addition to the core grid services provided by

GT4, other high-level WSRF compliant, ready-to-

use grid services have recently been developed

under the DataMiningGrid project. Here, we provide

a brief overview of the Resource Broker and the

Information Integrator Service. These services fully

support the parallel execution of a variety of data

mining (batch-style) programs in the grid

environment.

 The Resource Broker Service is responsible

for the execution of data mining programs

anywhere in the grid environment. It provides

a matching between the request for data

mining program execution, which is also

called a job in grid terminology, and the

available computational and data resources in

the grid. It takes as input the computational

requirements of the job (CPU power, memory,

disk space etc.) and data requirements of the

job (data size, data transfer speed, data

location etc.) and selects the most appropriate

execution machine for that particular job. The

job is passed on to the WS-GRAM service and

executed either on an underlying Condor

cluster or by using the GT4’s Fork

mechanism. The Resource Broker service is

capable of job delegation to resources

spanning over multiple administrative

domains. The execution machines are

automatically selected so that the inherent

complexity of the underlying infrastructure is

hidden from the users. The Resource Broker

service performs the orchestration of

automatic data and application transfers

between the grid nodes, using the GridFTP

component of GT4 for the transfers. The

Resource Broker is designed to execute multi-

jobs. Multi-jobs are collections of single jobs

that are bound for parallel execution. In

DataMiningGrid, a multi-job usually consists

of a single data mining program, which is

instantiated with different input parameters

and/or different input data sets. The individual

jobs are then executed in parallel on various

computational servers in the grid environment.

Each job, therefore, represents one execution

of a data mining program (i.e., an executable)

with specific input parameters and data inputs.

The Resource Broker makes extensive use of

the associated Information Integrator service.

 The Information Integrator Service provided

by DataMiningGrid operates in connection to

the MDS4 service provided by GT4. The

Information Integrator service is designed to

feed into other grid components and services,

including services for discovery, replication,

scheduling, troubleshooting, application

adaptation, and so on. Its key role is to create

and maintain a register of grid-enabled data

mining programs. By doing so, it facilitates

the discovery of grid-enabled programs on the

grid, and their later use through the Resource

Broker service.

4 AN APPLICATION

DESCRIPTION SCHEMA

A system whose main function is to facilitate the

sharing of grid resources within a grid environment

and supporting the development of distributed,

scalable data mining applications has to take into

account the unique constraints and requirements of

data mining programs with respect to the data

management, their execution requirements, and so

on.

In order to cope with the complexity of the

dynamically changing grid environment, an

Application Description Schema (ADS), which is a

novel metadata model in form of an XML schema,

was developed. The ADS defines properties

necessary to describe data mining programs and

other grid resources that may be shared in

distributed grid applications in a uniform way. For

example, the XML schema provides properties to

describe the data mining program and all its input

data, output data, parameter settings etc, so it can be

ICSOFT 2008 - International Conference on Software and Data Technologies

224

used to describe any program (i.e., executable) in

general.

The ADS consists of two parts, which are:

 A common part, which can be used to describe

any program, i.e., not necessarily data mining

programs; and

 A data mining (domain-specific) part

containing additional information relating to

data mining programs, e.g., the program’s

application domain(s), the name of the atomic

algorithm, the problem solving technique etc.

The common part of the ADS is subdivided into:

 General part that contains definitions of

properties of the program, such as a unique

ID, the program’s name, vendor etc. This

information is used to build a grid wide

registry of available programs;

 Execution part contains definitions of

properties related to the execution of the

program. This includes the application type

(Java, C, Bash Shell or Python), the location

of the executable in the grid, list of libraries

required for execution etc.

 Application part provides definitions of

properties needed to use the program. This

includes options, data inputs and outputs,

parameter lists and loops, requirements,

environment variables etc.

The ADS adds a great range of functionality to

the system. It is used through the whole execution

process in the grid environment, beginning from the

registration of data mining and other programs

through program discovery, selection, parameter,

and input data specification to the actual execution

of the program on the grid.

5 GRID ENABLING DATA

MINING PROGRAMS

5.1 Batch-Style Programs

A data mining program is grid enabled when it is

made available in the grid environment so that the

grid users can actually share it and make use of it,

hence, the program may be considered a grid

resource. To cope with interoperability and other

high-level aspects it is necessary to provide an

extensive description of each data mining program

by means of the ADS.

From an infrastructural viewpoint, all

executables and their associated libraries represent

files. In DataMiningGrid, any data mining program

that can be invoked from command-line (and is

implemented to run without a graphical user

interface), can be grid enabled. A predominant

variety of general and domain specific programs can

be adapted to be invoked from a command-line. For

example, input data, output data, and parameter

settings can be presented to a program via the

command-line, using a specific format, e.g. a flag

followed by the associated value (‘flag <space>

value’, e.g., ‘-number 77’). Additionally the

program may have some system requirements like

minimum free disk space, minimum memory or

required operating system, which have to be

specified in order to later run the program on

appropriate computational machine(s) in the grid.

Execution of data mining algorithms in the grid

is based always on a valid description of the

prerequisites the algorithm needs. The ADS was

developed to describe the algorithms including all

their parameters, input/output data, necessary

environment prerequisites etc.

5.2 Grid-Enabler (Web) Application

By using the ADS schema, the developers can create

a very detailed description of their program, which

guarantees that it will run successfully and on the

other hand provides information for its discovery on

the grid. Providing the description will always rely

on the developer of the grid application, someone

who may not be acquainted with the underlying grid

system. The presented solution is a very simple

procedure: the actual data mining program (i.e.,

executable) is uploaded on a grid server and an ADS

instance that describes the program is prepared and

registered with the underlying Information Integrator

service.

To speed-up the grid-enabling process, the

DataMiningGrid project developed a Web

application which consists of several form-based jsp

web pages, leading the user through the whole

process of creating and uploading his data mining

program. The following Web pages are provided:

 General information

 Execution information

 Input data specification

 Output data specification

 Requirements specification

 Executable and libraries upload

5.3 Life-Cycle of the ADS Instance

The ADS instance file contains all invariant

properties of the respective data mining program

BUILDING SCALABLE DATA MINING GRID APPLICATIONS - An Application Description Schema and Associated

Grid Services

225

(e.g., system architecture, location of the executable

and libraries, programming language). These

attributes cannot be altered by users of the system,

but are typically specified by the developer of the

program during the process of publishing the

program on the grid. The ADS instance also includes

default values for all options, but the exact values

are not set.

When querying for a data mining program, the

client side components (implemented as Triana

units) use the ADS instance in order to dynamically

create a GUI, which conforms to the description of

that particular data mining program. For example,

for each option a form field is generated, where the

user can specify the values for that option. At this

stage the user provides the exact values for the

applications parameters (during runtime, e.g.,

application parameter values, data input, additional

requirements) of the program.

A fully specified ADS instance represents a

multi-job description and is submitted to the

Resource Broker for parallel execution in the grid.

The Resource Broker uses the information

contained in the ADS instance to aggregate

appropriate resources. Particularly useful are the

following information:

 Static Resource Requirements. regarding

system architecture and operating system.

Applications implemented in a hardware-

dependent language (e.g., C) typically run

only on the system architecture and operating

system they have been compiled for (e.g.,

PowerPC or Intel Itanium running Linux). For

this reason, the Resource Broker has to select

execution machines that offer the same system

architecture and operating system as required

by the application.

 Modifiable Resource Requirements. memory

and disk space. While data mining

applications may require a minimal amount of

memory and disk space at start-up time,

memory and disk space demands typically rise

with the amount of data being processed and

with the solution space being explored.

Therefore, end users are allowed to specify

these requirements in accordance with the data

volume to be processed and their knowledge

of the application’s behaviour. The Resource

Broker will take into account these user-

defined requirements and match them to those

machines and resources that meet them.

 Modifiable Requirements. identity of

machines. In some cases end users may

generally wish to limit the list of possible

execution machines based on personal

preferences, for instance, when processing

sensitive data. To support this requirement, it

is possible for the user to specify the IPs of

such machines in the job description. Such a

list causes the Resource Broker to match only

those resources and machines listed and to

ignore all other machines independent of their

capabilities.

 The Total Number of Jobs. Instead of

specifying single values for each option and

data input that the selected application

requires, it is also possible to declare a list of

distinct values (e.g., true, false) or a loop (e.g.,

from 0.50 to 10.00 with step 0.25). These

represent rules for variable instantiations,

which are translated into a number of jobs

with different parameters by the Resource

Broker. This is referred to as a multi-job. As a

result, the Broker will prefer computational

resources that are capable of executing the

whole list of jobs at once in order to minimize

data transfer. Typically, such resources are

either clusters or high-performance machines

offering many distinct processors. As an

example, if the user specifies two input files

(a.txt, b.txt) for the same data input and two

loops running from 1 to 10 with step 1 as

parameters for two options, the Resource

Broker will translate this into 200 (2 x 10 x

10) distinct jobs. If no singe resource capable

of executing them at once is available, the

Broker will distribute these jobs over those

resources that provide the highest capability.

In addition, the Resource Broker evaluates

further information from the job description that

becomes important at the multi-job submission

stage. This information is briefly described below:

 Instructions. on where the program

executables are stored, including all required

libraries, and how to start the selected

program. These are required for transferring

executables and associated libraries to

execution machines across the grid, which is

part of the stage-in process. By staging-in

programs together with the input data

dynamically at run-time, the system is capable

of executing these applications on any suitable

machine in the grid without prior installation

of the respective data mining program.

 All Data Inputs and Data Outputs. that have

to be transferred prior the execution.

 All Option Values (Data Mining Program

Parameters). that have to be passed to the

ICSOFT 2008 - International Conference on Software and Data Technologies

226

program at start-up. As the Resource Broker is

capable of scheduling executables that are

started in batch-mode from a command line, it

passes all option values as flag-value pairs.

Here, each flag is fixed and represents a single

option. The values, however, may change for

each call if a multi-job is specified.

6 GRID ENVIRONMENT

From this point forward the data mining program is

ready to be used in the grid execution environment.

6.1 Test Bed

The DataMiningGrid test bed was developed on the

bases of the grid middleware and ready-to-use grid

services discussed in the previous sections. It is a

grid service infrastructure, with services running at

various sites across different administrative domains

in three European countries (Ireland, Slovenia and

Germany).

The test bed provides a number of capabilities,

the most important being the following:

 The ability to execute a variety of batch-style

programs, at any appropriate

computational server in the grid. Over 20

grid-enabled programs are currently stored in

executable repositories on various grid servers

in the test bed. These programs may be

combined in complex-workflows, containing

several multi-jobs executed sequentially;

 Meta-scheduling, i.e., dynamic and automatic

allocation of optimal computational servers in

the grid environment is achieved through the

use of the Resource Broker, the Information

Integrator service and MDS4.

 Program and data movement across

different administrative domains is

achieved through the use of the GridFTP and

RFT services.

In addition to these, grid environments based on

GT4 and DataMiningGrid high-level services have a

number of other capabilities, such as a Grid Security

Infrastructure, which is described in Stankovski et

al. (2008a, 2008b). Over 20 grid-enabled data

mining programs were already combined into large

scale data mining applications. To demonstrate the

flexibility and extensibility of the developed

software, we have developed applications for

various research and industrial sectors, such as

bioinformatics, industry (text-mining), medicine,

open publishing and digital libraries, civil

engineering. For example, grid-enabled Federated

Digital Libraries have been described by Trnkoczy,

Turk, and Stankovski (2006), and Trnkoczy and

Stankovski (2008).

7 CONCLUSIONS

It seems obvious that emerging large-scale data

mining applications shall rely increasingly on

distributed computing environments.

Many data mining programs require a repeated

execution of the same process with different

parameters that usually control the behaviour of the

implemented data mining algorithm or different

input data sets. This is typically required in

optimization or sensitivity analysis tasks. Hence,

properties like performance, scalability, usability

and security are critical for this kind of applications.

The DataMiningGrid system was developed to

address these requirements. More details about the

results of this project can be found at its web site

(DataMiningGrid). The developed software is

distinguished by its flexibility, ease of use,

conceptual simplicity, compliance with emerging

grid and data mining standards, and the use of

mainstream grid and open technology.

The DataMiningGrid project developed a

coherent framework, which offers data miners, who

are usually not grid experts, the ability to easily grid

enable existing, stand-alone data mining programs,

construct complex data mining tasks, which are

represented by complex Triana workflows and

execute these workflows in a grid environment. A

case study on the extensibility of the

DataMiningGrid platform is given in the literature

(Wegener and May, 2007).

The developed DataMiningGrid software is

freely available under the Apache Open Source

License V2.0 via SourceForge.net, including all

supporting documentation.

Despite its promise, however, there are still a lot

of issues to be resolved before grid technology is

commonly applied to large-scale data mining tasks.

(Stankovski and Dubitzky, 2007).

ACKNOWLEDGEMENTS

This work was supported by the European

Commission FP6 grant DataMiningGrid,

http://www.datamininggrid.org, contract no. 4475.

The collaboration of all project partners is

BUILDING SCALABLE DATA MINING GRID APPLICATIONS - An Application Description Schema and Associated

Grid Services

227

acknowledged. They have jointly participated in

developing the system.

REFERENCES

Antonioletti, M., et al., 2005. “The design and

implementation of Grid database services in OGSA-

DAI,” “Concurrency and Computation: Practice and

Experience,” vol. 17, no. 2-4, pp. 357--376.

Churches, G., et al., 2005. “Programming scientific and

distributed workflow with Triana services”,

Concurrency and Computation: Practice and

Experience, vol. 18, no. 10, pp. 1021--1037.

Congiusta, D., Talia, D., and Trunfio, P. 2007.

“Distributed data mining services leveraging WSRF,”

Future Generation Computer Systems, vol. 23, no. 1,

pp. 34--41.

DataMiningGrid. 2006. Data Mining in Grid Computing

Environments, EU contract no. 4475,

http://www.datamininggrid.org

Foster, I., Kesselman, C. and Tuecke, S., 2001. “The

Anatomy of the Grid: Enabling Scalable Virtual

Organizations,” International Journal of High

Performance Computing Applications, vol. 15, no. 3,

pp. 200--222.

Guedes, D., Meira, W.Jr., and Ferreira, R., 2006.

“Anteater: A Service-Oriented Architecture for High-

Performance Data Mining”, IEEE Internet Computing,

pp. 36--43.

Kumar, M., Kantardzic, M. and Madden, S., 2006. “Guest

Editors' Introduction: Distributed Data Mining--

Framework and Implementations,” IEEE Internet

Computing, vol. 10, no. 4, pp. 15--17.

Nabrzyski, J., Schopf, M., and Węglarz, J., 2004. (editors),

“Grid Resource Management: State of the Art and

Future Trends,” Kluwer Academic Publishers, Boston.

Plaszczak, P. and Wellner, Jr. R., 2006. “Grid Computing:

The Savvy Manager’s Guide,” Moragan Kaufmann,

Amsterdam.

Sotomayor, B., and Childers, L., 2006. “Globus Toolkit 4:

Programming Java Services,” Moragan Kaufmann,

Amsterdam.

Stankovski et al., “Grid-enabling data mining applications

with DataMiningGrid: An architectural perspective”,

Future Generation Computing Systems, vol. 24, no. 4,

pp. 259--279.

Stankovski et al., 2008b. “Digging Deep in the Data Mine

with DataMiningGrid”, IEEE Internet Computing, in

press.

Stankovski, V. and Dubitzky, W. 2007. “Special Section:

Data Mining in Grid Computing Environments”,

Future Generation Computer Systems, vol. 23, no. 1,

pp.

Thain, D., Tannenbaum, T. and Livny, M., 2005.

Distributed computing in practice: The Condor

Experience, Concurrency-Practice and Experience,

vol. 17, pp. 323—356.

Trnkoczy, J. and Stankovski, V. 2008. “Improving the

performance of Federated Digital Library services”

Future Generation Computer Systems, in press,

doi:10.1016/j.future.2008.04.007.

Trnkoczy, J., Turk, Ž. and Stankovski, V., 2006. “A Grid-

based Architecture for Personalized Federation of

Digital Libraries,” Library Collections, Acquisitions,

and Technical Services, vol. 30, pp. 139--153.

Venugopal, S., Buyya, R., and Winton, L., 2006. “A Grid

Service Broker for Scheduling e-Science Applications

on Global Data Grids,” Concurrency and

Computation: Practice and Experience, vol.18, no 6,

pp. 685-69.

Wegener, D. and May, M. 2007. “Extensibility of Grid-

Enabled Data Mining Platforms: A Case Study” In

Proc. of the 5th International Workshop on Data

Mining Standards, Services and Platforms, pp 13-22,

San Jose, California, USA, August, 2007. ISBN 978-

1-59593-838-1.

ICSOFT 2008 - International Conference on Software and Data Technologies

228