JCL: A High Performance Computing Java Middleware

Andr

e Lu

ıs Barroso Almeida

1,2

, Saul Emanuel Delabrida Silva

, Antonio C. Nazare Jr.

and Joubert de Castro Lima

DECOM, Universidade Federal de Ouro Preto, Ouro Preto, Minas Gerais, Brazil

CODAAUT, Instituto Federal de Minas Gerais, Ouro Preto, Minas Gerais, Brazil

DCC, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brazil

Keywords:

Big Data, Internet of Things, Middleware, Reﬂective Computing, High Performance Computing, Distributed

Shared Memory, Remote Method Invocation.

Abstract:

Java C

a&L

a or just JCL is a distributed shared memory reﬂective lightweight middleware for Java developers

whose main goals are: i) provide a simple deployment strategy, where automatic code registration occurs,

ii) support a collaborative multi-developer cluster environment where applications can interact without ex-

plicit dependencies, iii) execute existing sequential Java code over both multi-core machines and cluster of

multi-core machines without refactorings, enabling the separation of business logic from distribution issues in

the development process, iv) provide a multi-core/multi-computer portable code. This paper describes JCL’s

features and architecture; compares and contrasts JCL to other Java based middleware systems, and reports

performance measurements of JCL applications.

1 INTRODUCTION

We live in a world where large amounts of data

are stored and processed every day (Han et al.,

2011). Despite the signiﬁcant increase in the per-

formance of today’s computers, there are still big

problems that are intractable by sequential comput-

ing approaches (Kaminsky, 2015). Big data (Bryn-

jolfsson, 2012), Internet of Things (IoT) (Perera et al.,

2014) and elastic cloud services (Zhang et al., 2010)

are results of this new decentralized, dynamic and

communication-intensive society. Many fundamen-

tal services and sectors such as electric power supply,

scientiﬁc/technological research, security, entertain-

ment, ﬁnancial, telecommunications, weather fore-

cast and many others, use solutions that require high

processing power. For instance, “Walmart handles

more than a million customer transactions each hour

and it estimated that the transaction database contains

more than 2.5 petabytes of data” (Troester, 2012).

Thus, these solutions are executed over parallel and

distributed computer architectures. In this context, a

new demand for both computer architectures and ap-

plications to handle such big problems has emerged.

High Performance Computing (HPC) is based

on the concurrence principle, so high speedups are

achievable, but the development process becomes

complex when concurrence is introduced. Therefore,

middleware systems and frameworks are designed to

help reducing the complexity of such development.

The use of middleware as a software layer on top

of an operating system became usual in last years in

order to organize a computer cluster or grid (Tanen-

baum and Van Steen, 2007). The challenging issue

is how to provide sufﬁcient support and general high-

level mechanisms using middleware for rapid devel-

opment of distributed and parallel applications. Fur-

thermore, the middleware systems found in the liter-

ature have no information of their use on different

computational platforms. For instance, there is no

evidence that a set of embedded devices are able to

work with a cloud platform using the same middle-

ware. To developers, the system integration is not

transparent and depends of different skills. In Perera

et al. (2014), the authors mention the existence of six

machine classes in IoT and they advocate that mid-

dleware systems should run over all of them (Perera

et al., 2014).

Middleware systems can be adopted for gen-

eral purposes, such as Message Passing Interface

(MPI) (Forum, 1994), Java Remote Method Invoca-

tion (RMI) (Pitt and McNiff, 2001), Hazelcast (Veen-

tjer, 2013), JBoss (Watson et al., 2005) and many

others, but they can also be designed for a speciﬁc

Almeida, A., Silva, S., Jr., A. and Lima, J.

JCL: A High Performance Computing Java Middleware.

In Proceedings of the 18th International Conference on Enterprise Information Systems (ICEIS 2016) - Volume 1, pages 379-390

ISBN: 978-989-758-187-8

379

purpose, like gaming, mobile computing and real-

time computing, for instance (Murphy et al., 2006;

Gokhale et al., 2008; Tariq et al., 2014). They support

a programming model based on shared memory, mes-

sage passing or event based (Ghosh, 2014). Among

various programing languages for middleware sys-

tems, the interest in Java for HPC is rising (Taboada

et al., 2013). This interest is based on many fea-

tures, such built-in networking, multithreading sup-

port, platform independence, portability, type safety,

security, extensive API and wide community of de-

velopers (Taboada et al., 2009).

Besides the computational performance, the is-

sues presented below must be considered in the design

and development of a modern middleware.

Refactorings: Usually, middleware systems intro-

duce some dependencies to HPC applications, thus,

end-users need to follow some standards speciﬁed

by them, since methods and global variables must

be distributed over a data network. Consequently,

an ordinary Java or C++ object

must implement

several middleware classes or statements to become

distributed. There are many middleware examples

with such dependencies, including standards and mar-

ket leaders like Java RMI (Pitt and McNiff, 2001),

JBoss (Watson et al., 2005) and Mapreduce based so-

lutions (Hindman et al., 2011; Zaharia et al., 2010).

As a consequence of these dependencies, two prob-

lems emerge: i) the end-user cannot separate business

logic from distribution issues during the development

process and; ii) existing and well tested sequential ap-

plications cannot be executed over HPC architectures

without refactorings. A zero-dependency middleware

is unrealistic, but a middleware with few adaptations

is fundamental to achieve low coupling with the ex-

isting code.

Deployment: Deployment can be a time consum-

ing task in large clusters, i.e. any live update of an

application module or class often interrupts the execu-

tion of all services running in the cluster. Some mid-

dleware systems adopt third-party solutions to dis-

tribute and update modules in a cluster (Henning and

Spruiell, 2006; Nester et al., 1999; Veentjer, 2013;

Pitt and McNiff, 2001), but sometimes updating dur-

ing application runtime and without stoppings is a re-

quirement. This way, middleware systems capable of

deploying a distributed application transparently, as

well as updating its modules during runtime and pro-

grammatically, are very useful to reduce maintenance

costs caused by several unnecessary interruptions.

Collaboration: Cloud computing introduces op-

portunities, since it allows collaborative development

“An object is a self-contained entity consisting of data

and procedures to manipulate data” (Egan, 2005)

or development as a service in cloud stack. A middle-

ware providing a multi-developer environment where

applications can access methods and user typed ob-

jects from each other without explicit references is

fundamental to introduce development as a service or

just to transform a cluster into a collaborative devel-

opment environment.

Portable Code: Portable multi-core/multi-

computer code is an important aspect to consider

during development process, since in many insti-

tutions, such as research ones, there can be huge

multi-core machines and several clusters of ordinary

PCs to solve a couple of problems. This way, code

portability is very useful to test algorithms and data

structures in different computer architectures without

refactorings. A second justiﬁcation for offering at

least two releases in a middleware is that clusters are

nowadays multi-core, so middleware systems must

implement shared memory architectural designs in

conjunction with distributed ones.

The main goal of this work is to introduce a new

middleware that ﬁll the issues previously shown, pre-

cisely:i) a simple deployment strategy and capacity to

update internal modules during runtime; ii) a collab-

orative multi-developer environment; iii) a service to

execute existing sequential Java code over both multi-

core machines and cluster of multi-core machines

without refactorings; iv) multi-core/multi-computer

portable code. This middleware, called Java C

a&L

or just JCL, is a tool for develop HPC applications.

In this paper, the three main components of our archi-

tecture are presented and evaluated. Is not the focus

of this work to be a “how to” guide, although some

source codes are shown in order to clarify the reader’s

understanding. The main contributions of JCL are:

i) A middleware that gathers several features pre-

sented separately in the last decades of middle-

ware literature, enabling building distributed ap-

plications with few portable instructions to clus-

ters made from different platforms;

ii) A comparative study of market leaders and well

established middleware standards for Java com-

munity. This paper emphasizes the importance of

several features and how JCL and its counterparts

fulﬁll them;

iii) A scalable middleware over multi-core and multi-

computer architectures;

iv) A feasible middleware alternative to fast proto-

type portable Java HPC applications.

This work is organized as follows. Section 2 dis-

cusses works that are similar to the proposed middle-

Java C

a&L

a is available for download at http://

www.joubertlima.com.br

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

380

ware, pointing out their beneﬁts and limitations. Sec-

tion 3 details the JCL middleware, presenting its ar-

chitecture and features. Section 4 describes a user

case application. Section 5 presents our experimen-

tal evaluation and discusses the results. Finally, in

Section 6, we conclude our work and point out future

improvements of JCL.

2 RELATED WORK

In this section, we describe the most promising mid-

dleware systems in various stages of development.

We evaluated each work in terms of: i) requirement

for low/medium/high refactorings, ii) automatic or

manual deployment, iii) support for collaborative de-

velopment, iv) implementation of both multi-core and

multi-computer portable code. Other analyses were

made to verify if the middleware is discontinued, and

if it is fault tolerant in terms of storage and processing.

Academic and commercial solutions are put together

and their limitations/improvements are highlighted in

Table 1. Middleware systems that present high simi-

larities with JCL are described in detail in this section.

The remaining related work is described just in Table

Inﬁnispan by JBoss/RedHat (Team, 2015) is a

popular open source distributed in-memory key/value

data store (Di Sanzo et al., 2014) which enables two

ways to access the cluster: i) the ﬁrst way enables an

API avaliable in a Java library ; ii) the second way en-

ables several protocols, such as HotRod, REST, Mem-

cached and WebSockets, making Inﬁnispan a lan-

guage independent solution. Besides storage services,

the middleware can execute tasks remotely and asyn-

chronously, but end-users must implement Runnable

or Callable interfaces. Furthermore, it is necessary

to register these tasks at Java virtual machine (JVM)

classpath of each cluster node, so Inﬁnispan does not

have the dynamic class loading feature.

Java Parallel Processing Framework JPPF is an

open source grid computing framework based on pure

Java language (Xiong et al., 2010) which simpliﬁes

the process of parallelizing applications that demand

high processing, allowing end-users to focus on their

core software development (Cohen, 2015). It imple-

ments the dynamic class loading feature in cluster

nodes, but it does not support collaborative develop-

ment, i.e. methods cannot be shared among different

JPPF applications, producing many services over a

cloud infrastructure, for instance. JPPF does not im-

plement shared memory services just execute meth-

ods.

Hazelcast (Veentjer, 2013) is a promising middle-

ware in the industry. It offers the concept of func-

tions, locks and semaphores. Hazelcast provides a

distributed lock implementation and makes it possible

to create a critical section within a cluster of JVM;

so only a single thread from one of the JVM’s in

the cluster is allowed to acquire that lock. (Veentjer,

2013). Besides an Application Programming Inter-

face (API) for asynchronous remote method invoca-

tions, Hazelcast has a simple API to store objects in a

computer grid. JCL separates business logic from dis-

tribution issues and, in Hazelcast, both requirements

are put together, so ﬂexibility and dynamism are re-

duced during execution time. Hazelcast cannot in-

stantiate a global variable remotely like JCL, i.e., it al-

ways maintains double copies of each variable at each

remote instantiation. Hazelcast has manual schedul-

ing for global variables and executions, so the end-

user can control the cluster machine to store or run an

algorithm. Hazelcast does not implement automatic

deployment, so it is necessary to manually add each

end-user class to the JVM classpath before starting

each Hazelcast node.

Oracle Coherence is an in-memory data grid

commercial middleware that offers database caching,

HTTP session management, grid agent invocation and

distributed queries (Seovic et al., 2010). Coherence

provides an API for all services, including cache ser-

vices and others. It enables an agent deployment

mechanism, so there is the dynamic class loading fea-

ture in cluster nodes, but such agents must implement

the EntryProcessor interface, thus refactorings are

necessary.

RAFDA (Walker et al., 2003) is a reﬂective mid-

dleware. It permits arbitrary objects in an applica-

tion to be dynamically exposed for remote access, al-

lowing applications to be written without concern for

distribution (Walker et al., 2003). RAFDA objects

are exposed as Web services without requiring reengi-

neering to provide distributed access to ordinary Java

classes. Applications access RAFDA functionalities

by calling methods on infrastructure objects named

RAFDA runtime (RRT). Each RRT provides two in-

terfaces to application programmers: one for local

RRT accesses and the other for remote RRT accesses.

RRT has peer-to-peer communication, so it is possible

to execute a task in a speciﬁc cluster node, but if the

end-user needs to submit several tasks to more than

one remote RRT, a scheduler must be implemented

from the scratch. RAFDA has no portable multi-core

and multi-computer versions.

In the beginning of 2000’s, an interesting middle-

ware, named FlexRMI, was proposed by Taveira et

al. (2003) to enable asynchronous remote method

invocation using the standard Java Remote Method

JCL: A High Performance Computing Java Middleware

381

Table 1: JCL and its counterparts’ features.

Tool

Feature

Fault

Tolerant

Refactoring

required

Simple

Deploy

Collaborative

Portable

Code

Support

Available

JCL No No Yes Yes Yes Yes

Inﬁnispan Yes Low No Yes No Yes

JPPF Yes No Yes No No Yes

Hazelcast Yes Low No Yes No Yes

Oracle

Coherence

Yes Medium NF

Yes No Yes

RAFDA No No Yes Yes No Yes

PJ No NF

No Yes Yes

FlexRMI No Medium No No No No

RMI No Medium No No No Yes

Gridgain Yes Low No Yes No Yes

ICE Yes High No No No Yes

MPJ

Express

No Medium No No Yes Yes

Jessica NF

No Yes No Yes Yes

1 - NF: Not found

Invocation (RMI) API. FlexRMI is a hybrid model

allowing both asynchronous or synchronous remote

methods invocations. There are no restrictions in

the ways a method is invoked in a program. The

same method can be called asynchronously at one

point and synchronously at another point in the

same application. It is the programmer’s respon-

sibility the decision on how the method call is to

be made. (Taveira et al., 2003) FlexRMI changes

Java RMI stub and skeleton compilers to achieve

high transparency. FlexRMI is the RMI asyn-

chronous, so there is no multi-core version. Fur-

thermore, it requires at least “java.rmi.Remote”

and “java.rmi.server.UnicastRemoteObject” ex-

tensions to produce a RMI application. Since it does

not implement the dynamic class loading feature, all

classes and interfaces must be stored in nodes before

a RMI (and also FlexRMI) application starts, making

deployment a time-consuming effort.

Gelibert et al. (2011) proposed a new middleware

using Distributed Shared Memory (DSM) principles

to efﬁciently simplify the clustering of dynamic ser-

vices. The proposed approach consists in transpar-

ently integrating DSM into Open Services Gateway

Initiative (OSGi) (OSGi, 2010) service model using

containers and annotations. The authors use Terra-

cotta framework (Terracotta Inc., 2008) as a kernel of

the entire solution. Gelibert et al. (2011) point out

limitations of using static types in the code, since in-

strumentation is done at runtime, thus the compiler

cannot perform static veriﬁcation on the application

code. This creates complicated debugging scenarios

when problems, especially transient ones, occur.

Programming Graphical Processing Unit (GPU)

clusters with a distributed shared memory abstraction

offered by a middleware layer is a promising solu-

tion for some speciﬁc problems, i.e. Single Instruc-

tion Multiple Data (SIMD) solutions. In (Karantasis

and Polychronopoulos, 2011), an extension of Pleiad

middleware (Karantasis and Polychronopoulos, 2009)

is implemented enabling Java developers to work with

a local GPU abstraction over several machines with

one-four GPU devices each.

3 JCL ARCHITECTURE

This section details the architecture of the proposed

middleware. The reﬂective capability of several pro-

gramming languages, including Java, is an elegant

way for middleware systems to introduce low cou-

pling between distribution and business logic, as well

as simplify the deployment process and introduce

cloud multi-developer environments. Thus, reﬂection

is the basis for many JCL features.

JCL has two versions: multi-computer and multi-

core. The multi-computer version, named “Pacu”,

stores objects and executes tasks over a cluster where

all communications are done via TCP/UDP proto-

col. On the other hand, the multi-core version, named

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

382

“Lambari”, turns the User component into a local host

component without the overhead of TCP/UDP com-

munications. All objects and tasks are respectively

stored and executed locally on the end-user machine.

The architecture of JCL is composed of three main

components: User, Server, and Host. While User

is designed to expose the middleware services in a

unique API to be adopted by developer, Server is re-

sponsible to manage the JCL cluster. Finally, the Host

component is where the objects are stored and the reg-

istered methods are invoked. The next sections de-

scribe the design choices of JCL to provide the previ-

ously cited features.

3.1 User Component

To achieve portability, a single access point to JCL

cluster is mandatory and the User component is re-

sponsible for that. It represents a uniﬁed API for both

versions (multi-computer and multi-core) and it is

where asynchronous remote method invocations and

object storage take place. The user selects which ver-

sion to start with according to a property ﬁle conﬁg-

ured by the end-user. In the multi-core version, User

avoids network protocols, performing shared memory

communications with Host. In the multi-computer

version, UDP and TCP/IP protocols are adopted, thus

marshalling/unmarshalling, location, naming and sev-

eral other components are introduced. These compo-

nents are fundamental to distributed systems and are

explained in details in Coulouris et al. (2007).

During the application execution, User component

follows a pipeline composed of six steps: 1) receive

end-user application calls; 2) generate unique identi-

ﬁers for these calls; 3) return the identiﬁers to the end-

user application; 4) schedule them; 5) submit them to

Hosts; 6) and ﬁnally, store results of submitted calls

locally. Since JCL is by default asynchronous, end-

user application calls receive a ticket for each submit-

ted task. After a complete execution of the pipeline

above, the result is ready to be obtained using the

identiﬁer provided by User. Step 5 can be optimized

in the multi-computer version if successive submis-

sions occur, i.e. successive calls are buffered and sub-

mitted in batch to a Host.

This component adopts different strategies to

schedule processing and storage calls in the multi-

computer version. It allocates Hosts to handle pro-

cesses according to the number of cores in the en-

tire cluster. For instance, in a cluster with ten quad-

core machines, we have forty cores, so User submits

chunks of processing calls to the ﬁrst machine, where

each chunk size must be multiple of four, since it is

a quadcore processor. Internally, a Host allocates a

pool of threads, also with size multiple of four, to

consume such processing calls. After the ﬁrst chunk,

User sends the second, the third and so on. After ten

submissions, User starts submitting to the ﬁrst ma-

chine again. The circular list behavior continues as

long as there are processing calls to be executed. Het-

erogeneous clusters are possible, since JCL automati-

cally allocates a number of chunks proportional to the

number of cores of each machine.

There is a scheduling strategy for storage calls, so

the User component calculates a function F to deter-

mine in which host the global variable will be stored

(Equation 1), where hash(vn) is the global variable

name hashcode, nh is the number of JCL hosts and

F is the node position. Experiments with incremen-

tal global variable names like “p i j” or “p i”, where i

and j are incremented for each variable and p is any

preﬁx, showed that F achieves an almost uniform dis-

tribution for object storage over a cluster in several

scenarios with different variable name combinations,

however there is no guarantee of a uniform distribu-

tion for all scenarios. For this reason, User intro-

duces a delta (d) property that normally ranges from

0 − 10% of nh. The delta property relaxes function F

result enabling two or more Hosts as alternatives to

store a global variable.

F =

|hash(vn)|

(1)

In general, d relaxes a ﬁxed JCL Host selection

without introducing overhead in F. A drawback in-

troduced by d is that JCL must check (2 ∗ d) + 1 ma-

chines to search for an object, i.e., if d is equal to 2,

JCL must check ﬁve machines (two machines before

and two after the machine identiﬁed by function F in

the logical ring). JCL checks all ﬁve alternatives in

parallel, so the drawback is very small, as our experi-

ments demonstrate. JCL with delta equals 2, 1 and 0

has similar execution time in clusters with 5, 10 and

15 machines.

3.2 Server Component

This component was designed to manage the cluster

and is responsible for receiving the information from

each Host and distributing it to all registered User

components, enabling them to directly communicate

with each Host. The Server also implements the pos-

sibility for the end-user to assign the placement of

objects stored in the cluster, thereby disrespecting the

Host selection obtained from the function F presented

in Equation 1.

The Server fulﬁlls the function of centralizing

JCL: A High Performance Computing Java Middleware

383

component, receiving the features of the computer

where each Host is installed. Before adding a new

Host, the Server notiﬁes its presence to all regis-

tered Hosts that, after receiving the new member

registration notiﬁcation, recalculate the function F

and change the Host objects, if necessary. After all

changes have taken place, the Server is notiﬁed, ful-

ﬁlling the registration of the new Host.

When there is an end-user application running, at

least one Host registration needs to be completed suc-

cessfully, so that such application can receive the clus-

ter map, enabling direct communication with the reg-

istered Hosts and eliminating the necessity to search

for a Host in the Server at every new demand.

One of JCL’s advantages is the possibility of stor-

ing Java objects in a speciﬁc Host. In this case, the

end-user can specify Hosts that are different from

those calculated by function F. To guarantee that

all running applications in a speciﬁc cluster have ac-

cess to all the instantiated objects, the locations as-

signed manually by the end-user are centralized in the

Server, this way they do not depend on the function

F. It is possible to note that the increase of manually

assigned variables concentrates the workload on the

Server, thus variables with high amount of accesses

can cause bottlenecks in the Server component. The

end-user can also choose the Host to execute their

methods, therefore JCL scales or allows its developers

to scale their demands.

3.3 Host Component

JCL Host has two basic functions: to store the ob-

jects sent by User or by another Host and invoke pre-

viously registered class methods. Its architecture al-

lows the Host component to dynamically determine

the number of threads (workers) running the end-user

class methods. By default, the application is con-

ﬁgured to use the total number of cores available on

the Host, that is, if the computer has four cores, four

workers are created. Nevertheless, the user can create

as many workers as necessary by simply setting the

property that assigns the number of threads to be cre-

ated. The justiﬁcation for such feature is the possibil-

ity to combine CPU-bound methods with I/O bound

method executions, requiring the operating system to

schedule them, what increases the number of context

switches, however such extra workload usually pays

off, since there is a possibility of prefetching CPU

bound method executions while waiting for I/O re-

sults.

Before the Host component publishes its services

to User components, it starts a JCL pipeline composed

of three steps: 1) the Host notiﬁes the Server its in-

tention to join the cluster; 2) the Server propagates

the existence of a new Host; 3) the Server allows the

Host to join JCL cluster.

A property that differentiates JCL from most of

its counterparts is the class registration process that

simpliﬁes deployment. This process, invoked by User

and performed by Host, adds the necessary classes

to JVM classpath at runtime, which enables the end-

user to store objects and remotely execute methods

without manually registering the class in each JVM

of each cluster Host.

To perform object storage, two different alterna-

tives are adopted. In the ﬁrst one, the end-user creates

the object and sends it to be stored in a Host, which

may or may not be chosen by him. In the second one,

the end-user deﬁnes which object should be created,

passing its arguments for the constructor and, there-

fore, enabling the object instantiation directly at the

Host. It is possible to note that the second storage

option allows the creation of massive objects at Host

without transferring them through the data network.

As described previously in this section, the mid-

dleware is based on Java reﬂection. This way, there is

no need to adapt any classes so that they can be ex-

ecuted at Hosts. Once registered, the target classes,

as well as their dependencies, are sent to Host where

methods are mapped and available for execution. This

feature enables JCL applications to separate business

logic code from distribution code, as well as simpli-

ﬁes deployment and enables distributed objects stor-

age.

4 USE CASE

This section aims to evaluate JCL in terms of fun-

damental computer science algorithms development,

such as sorting. The JCL BIG sorting application was

implemented, since it represents a solution with inten-

sive communication, processing and I/O.

The distributed BIG sorting application is a sort-

ing solution where data are partitioned and also

sorted, i.e. there is no centralized sorting mecha-

nism. Data are generated and stored in a binary ﬁle

by each Host thread, performing parallel I/O on each

Host component. Data are integers between −10

+10

. The ﬁnal sorting contains one million different

numbers and their frequencies distributed over a clus-

ter, but the original input data were generated from

two billion possibilities.

The sorting application is a simple and elegant

sorting solution based on items frequencies. The fre-

quency of each number of each input data partition

is obtained locally by each Host thread and a chunk

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

384

24 J CL f a c a d e j c l = J CL Fac a d e I m p l . g e t I n s t a n c e ( ) ;

26 i n t n u m J C L c l u s t e r C o r e s = j c l . g e t C l u s t e r C o r e s ( ) ;

27 / / r e g i s t e r i n g

28 j c l . r e g i s t e r ( Random Number . c l a s s , ” Random Number ” ) ;

30 / / b u i l d s t h e i n p u t d a t a , p a r t i t i o n e d o v e r JCL c l u s t e r

31 O b j e c t [ ] [ ] a r g s = new O b j e c t [ n u m J C L c l u s t e r C o r e s ] [ ] ;

32 f o r ( i n t i =0; i<n u m J C L c l u s t e r C o r e s ; i ++) {

33 O b j e c t [ ] oneArg = { se m e n t e s , ” o u t p u t ”+ i } ;

34 a r g s [ i ]= oneArg ;

35 }

36 L i s t <S t r i n g > t i c k e t s = j c l . e x e c u t e A l l C o r e s ( ” Random Number ” , ” Crea te1GB ” , ←-

a r g s ) ;

37 j c l . g e t A l l R e s u l t B l o c k i n g ( t i c k e t s ) ;

38 f o r ( S t r i n g a T i c k e t : t i c k e t s ) j c l . r e m o v e R e s u l t ( a T i c k e t ) ;

39 t i c k e t s . c l e a r ( ) ;

40 t i c k e t s = n u l l ;

41 S ystem . e r r . p r i n t l n ( ” Time t o c r e a t e i n p u t ( s e c ) : ” + ←-

( Syst em . nanoTime ( )−t i m e ) /1 0 0 0 0 0 0 0 0 0 ) ;

Figure 1: Main class - how to generate pseudo-random numbers in the JCL cluster.

strategy builds a local data partition for the entire

JCL cluster, i.e. each thread knows how many JCL

threads are alive, so all number frequencies (n f ) di-

vided by number of cluster threads (nct) create a con-

stant C, i.e. C = n f /nct. Each different number in

an input data partition is retrieved and its frequency

is aggregated in a global frequency GF. When GF

reaches C value, a new chunk is created, so C is fun-

damental to produce chunks with similar number fre-

quencies without storing the same number multiple

times. When JCL avoids equal number values it also

reduces communication costs, since numbers of one

Host thread must be sent to other threads in the JCL

cluster to perform a fair distributed sorting solution.

The sorting is composed of three phases, besides

the data generation and a validation phase to guar-

antee that all numbers from all input data partitions

are retrieved and checked against JCL sorting dis-

tributed structure. The sorting has approximately 350

lines of code, three classes and only the main class

must be a JCL class, i.e. inherit JCL behavior. The

pseudo-aleatory number generation phase illustrates

how JCL executes existing sequential Java classes on

each Host thread with few instructions (Figure 1).

Lines 24, 26 and 28 of the main class illustrate how to

instantiate JCL, obtain JCL cluster number of cores

and register a class named “Random Number” in

JCL, respectively. Lines 31-35 represent all argu-

ments of all “Create1GB” method calls, so in our

example we have “numJCLClusterCores” method

arguments and each of them is a string labeled

“output suﬃx”, where the sufﬁx varies from 0 to

“numJCLClusterCores” variable value.

Line 36 represents a list of tickets, adopted

to store all JCL identiﬁers for all method calls,

since JCL is by default asynchronous. The

JCL method “executeAllCores” executes the same

method “Create1GB” in all Host threads with unique

arguments on each method call. Line 37 is a synchro-

nization barrier, where big sorting main class waits

until some tasks, identiﬁed by “tickets” variable, have

ﬁnished. From lines 38-40 objects are destroyed lo-

cally and remotely (line 38), and ﬁnally in line 41

there is the time elapsed to generate pseudo-random

numbers over a cluster of multi-core machines and in

parallel. The “Random Number” class is a sequen-

tial Java class and method “Create1GB” adopts Java

Random math class to generate 1GB numbers on each

input data partition binary ﬁle.

Phases one, two and three are similar to Figure 1,

i.e. they are inside the main class and they behave ba-

sically splitting method calls over the cluster threads

and then waiting all computations to end. Precisely,

at phase one JCL reads the input and produces the

set of chunks, as well as each chunk frequency or the

frequencies of its numbers. C is calculated locally in

phase one, i.e. for a single input data partition, so in

C equation n f represents how many numbers an input

data partition contains and nct represents the number

of JCL Host threads. Phase one ﬁnishes its execution

after storing all number frequencies locally in a JCL

Host to avoid a second ﬁle scan. It is possible to note

that phase one does not split the numbers across the

local chunks, since the algorithm must ensure a global

chunk decision for that.

After phase one, the main class constructs a global

JCL: A High Performance Computing Java Middleware

385

169 l o n g l o a d = 0 ; i n t b ; S t r i n g r e s u l t = ” ” ;

170 f o r ( I n t e g e r a c : s o r t e d ) {

171 l o a d +=map . g e t ( ac ) ;

172 i f ( lo a d >( t o t a l F / ( numOfJCLThreads ) ) ) {

173 b= a c ;

174 r e s u l t +=b+ ” : ” ;

175 l o a d = 0 ;

176 }

177 }

Figure 2: Main class - how to mount the global chunk schema to partition the cluster workload.

97 f o r ( i n t r = 0 ; r<numJCLThreads ; r ++) {

98 JCLMap<I n t e g e r , Map<I n t e g e r , Long>> h = new JCLHashMap<I n t e g e r , ←-

Map<I n t e g e r , Long>>( S t r i n g . va l u e O f ( r ) ) ;

99 h . p u t ( id , f i n a l [ r ] ) ;

Figure 3: Sorting class - how to deliver chunks to other Host threads.

sorting schema with fair workload. Figure 2 illus-

trates how the main class produces chunks with simi-

lar number frequencies. Each result of phase one con-

tains a schema to partition the cluster workload, so a

global schema decision must consider all numbers in-

side all chunks of phase one.

The main class calculates the total frequency of

the entire cluster, since each thread in phase one also

returns the chunk frequency. Variable “totalF” rep-

resents such a value. Lines 169 to 177 represent how

JCL sorting produces similar chunks with a constant

C as a threshold. The global schema is submitted to

JCL Host threads and phase two starts.

Phase two starts JCL Host threads and each thread

can obtain the map of numbers and their frequencies,

generated and stored at phase one. The algorithm

just scans all numbers and inserts them into speciﬁc

chunks according to the global schema received pre-

viously. Phase two ends after inserting all numbers

and their frequencies into JCL cluster to enable any

JCL Host thread to access them transparently. Fig-

ure 3 illustrates JCL global variable concept, where

Java objects lifecycles are transparently managed by

JCL over a cluster. The sorting class obtains a global

JCL map labelled “h” (Figure 3). Each JCL map

ranges from 0 to number of JCL threads in the clus-

ter (line 97), so each thread manages a map with its

numbers and frequencies, where each map entry is a

chunk of other JCL Host thread, i.e. each JCL Host

thread has several chunks created from the remain-

ing threads. Line 99 of Figure 3 represents a sin-

gle entry in a global map “h”, where “id” represents

the current JCL Host thread identiﬁcation and “ﬁnal”

variable represents the numbers/frequencies of such a

chunk. Phase three of sorting application just merges

all chunks into a unique chunk per JCL Host thread.

This way, JCL guarantees that all numbers are sorted,

but not centralized in a Server or Host component, for

instance.

Our sorting experiments were conducted with JCL

multi-computer version. The ﬁrst set of experiments

evaluated JCL in a desktop cluster composed of 15

machines, where 5 machines were equipped with In-

tel I7-3770 3.4GHz processors (4 physical cores and 8

cores with hyper-threading technology) and 16GB of

of RAM DDR 1333Mhz, and the other 10 machines

were equipped with Intel I3-2120 3.3GHz processors

(2 physical cores and 4 cores with hyper-threading

technology) and 8GB of RAM DDR 1333Mhz. The

Operating System was a Ubuntu 14.04.1 LTS 64 bits

kernel 3.13.0-39-generic and all experiments could

ﬁt in RAM memory. Each experiment was repeated

ﬁve times and both higher and lower runtimes were

removed. An average time was calculated from the

three remaining runtime values. JCL distributed BIG

Sorting Application version sorted 1 TB in 2015 sec-

onds and the OpenMPI version took 2121 seconds,

being JCL 106 seconds faster. Both distributed BIG

sorting applications (JCL and MPI) implement the ex-

plained idea and are available at JCL website.

The second experiments evaluated JCL in an em-

bedded cluster composed of two raspberry pi devices,

each one with an Arm ARM1176JZF-S processor,

512MB of RAM and 8GB of external memory, and

one raspberry pi 2 with a quadcore processor oper-

ating at 900MHz, 1GB of RAM and 8GB of exter-

nal memory. The Operating System was Raspbian

Wheezy and all experiments could ﬁt in RAM mem-

ory. Each experiment was repeated ﬁve times and

both higher and lower runtimes were removed. An

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

386

average time was calculated from the three remaining

runtime values.

The JCL distributed BIG sorting was modiﬁed to

enable devices with low disk capacity to also sort a

big amount of data. Basically, the new sorting version

does not store the pseudo-random numbers in external

memory. It gathers the number generation phase with

the phase where number frequencies are calculated.

Differently from other IoT middleware systems (Per-

era et al., 2014), where small devices such as rasp-

berry pi are adopted only for sensing, JCL introduces

the possibility to implement general purpose applica-

tions and not only sensing ones. Furthermore, JCL

sorting can run on large or small clusters, as well as

massive muti-core machines with a unique portable

code. The small raspberry pi cluster sorted 60GB of

data in 2,7 hours.

5 EXPERIMENTAL EVALUATION

Experiments were conducted with JCL multi-

computer and multi-core versions. Initially, the JCL

middleware was evaluated in a desktop cluster com-

posed of 15 machines, the same cluster used to test

JCL distributed BIG sorting application. The middle-

ware was evaluated in terms of throughput, i.e., the

number of processed JCL operations per second. The

goal of these experiments is to stress JCL measuring

how many executions it supports per second and also

how uniform function F, presented in Equation 1, can

be when both incremented global variable names and

random names are adopted.

In the ﬁrst set of experiments, we tested JCL asyn-

chronous remote method invocation (Figure 4). For

each test we ﬁxed the number of remote method in-

vocations to 100 thousand executions. JCL Protocol

Buffer Algorithm (PBA) algorithm was adopted, so

JCL differentiates the sizes of both machines of the

cluster to conﬁgure the workload. The experiments

were composed of two different methods: the ﬁrst one

is a void method with a book as argument, where a

book is a user type class composed of authors, editors,

edition, pages and year attributes (Figure 4 A); and the

second method is composed of an array of string and

two integer values as arguments which are adopted by

algorithms for calculating Levenshtein distance, Fi-

bonacci series and prime numbers (Figure 4 B). We

measured the throughput of each cluster conﬁguration

(5, 10 and 15 machines).

The results demonstrated that JCL’s throughput

rises when cluster size increases as the task becomes

more CPU bound. There is a throughput decrease

when the cluster increases from 5 to 10 machines and

7500

8000

8500

125

150

175

200

Method_Book

Method_Cpu

5 10 15

Cluster size

Throughput(s)

Figure 4: Method invocation.

1070

1080

1090

1100

1105

1110

1115

1120

Book

String

5 10 15

Cluster size

Throughput(s)

Figure 5: Global variable experiments.

the justiﬁcation is that the new ﬁve Hosts are smaller

than the ﬁrst ﬁve ones, so the throughput does not in-

crease at the same rate (Figure 4 B). The second test

represents non CPU bound scenarios, so it is clear that

network overhead is greater than task processing (Fig-

ure 4 A).

In the second set of experiments (Figure 5), we

ﬁxed the number of instantiated global variables to 40

thousand instantiations. We tested the book class in-

stantiation explained previously (Figure 5 A) and also

a smaller object like a string with 10 characters (Fig-

ure 5 B). We tested the ﬁve best machines ﬁrst and

then added the ten worst machines, thus this strategy

may inﬂuence the throughput results negatively. As

the cluster enlarges, the number of connections and

other issues also become time-consuming, thus a re-

JCL: A High Performance Computing Java Middleware

387

HostID

Frequency

0.00 0.02 0.04 0.06 0.08 0.10

Δ = 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

HostID

Frequency

0.00 0.02 0.04 0.06 0.08 0.10

Δ = 1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

HostID

Frequency

0.00 0.02 0.04 0.06 0.08 0.10

Δ = 2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

HostID

Frequency

0.00 0.02 0.04 0.06 0.08 0.10

Δ = 3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Figure 6: Variable names with autoincrement.

HostID

Frequency

0.00 0.04 0.08 0.12

Δ = 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

HostID

Frequency

0.00 0.04 0.08 0.12

Δ = 1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

HostID

Frequency

0.00 0.04 0.08 0.12

Δ = 2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

HostID

Frequency

0.00 0.04 0.08 0.12

Δ = 3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Figure 7: Bag of words.

duction in throughput should be expected. Another

important observation is the synchronous behavior of

JCL shared memory services, which is another bot-

tleneck when the cluster becomes bigger. The results

demonstrate that as the global variable becomes more

complex, the data network overhead increases. Pre-

cisely, the throughput of String variable reduces 2,3%

when JCL cluster increases from 5 to 15 Hosts, but

when book variable is adopted, the throughput re-

duces 3,3% in the same cluster conﬁgurations. The

positive aspect of such a scenario is that JCL over big-

ger clusters stores more data.

In the third set of experiments, we evaluated the

uniformity of function F presented in Equation 1 plus

a delta d and instantiated 40 thousand variable names.

JCL cluster size was set to 15 machines, and then we

tested different preﬁx variable names, and also autoin-

crement sufﬁxes, i.e., variable names like “p i” and

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

388

“p i j”, where p is a preﬁx and i and j are autoin-

crement values. We also tested F + d distribution for

an arbitrary bag of words and chose the Holy Bible’s

words to verify how JCL data partition performs. The

results are illustrated in Figure 6 and 7, where ∆ is

delta size. Usually, JCL achieves an almost uniform

distribution using delta between zero and two.

The result of the bag of arbitrary words becomes

more uniform as delta increases, so even when the

end-user decides to adopt arbitrary variable names in

the code, JCL can achieve an almost fair data parti-

tion. We also tested the JCL overhead when a variable

content is retrieved using delta zero, one and two, and

there are almost no overhead varying deltas, but data

partition uniformity is reduced as delta tends to zero,

what can be seen in Figure 7. The justiﬁcation is that

network communication times for data checkings are

irrelevant when compared with remote instantiation

runtimes.

Finally, we also evaluated JCL multi-core ver-

sion against a Java thread implementation provided

by Oracle. An Intel I7-3770 3.4GHz processor with

8 cores, including hyper-threading technology, and

16GB of RAM was used in the experiment. We

implemented a sequential version for a CPU bound

task composed of existing Java sequential algorithms

for calculating Levenshtein distance, Fibonacci series

and prime numbers. We calculated JCL and Java

threads speedups, and the results demonstrated sim-

ilar speedups, i.e. in a machine with four physical

cores and four virtual cores, JCL achieved speedup of

5.61 and Oracle Java threads the speedup of 5.77.

6 CONCLUSION

In this paper, we present a novel reﬂective middleware

that is able to invoke remote methods asynchronously

and also manage Java objects lifecycle over a clus-

ter of JVMs. JCL is designed for multi-core, multi-

computer and hybrid computer architectures. End-

users write portable JCL applications, where global

variables are also multi-developer, so different appli-

cations can transparently share resources without ex-

plicit references over a computer cluster. JCL can ex-

ecute existing Java code or JCL code, this way JCL

can build complex applications. Reﬂection capabili-

ties enable JCL to separate distribution from business

logic, enabling both existing sequential code execu-

tions over many high performance computer archi-

tectures with zero changes and multiple distribution

strategies for a single sequential algorithm according

to a hardware speciﬁcation. Deployment in JCL is not

time consuming, i.e. a JCL cluster without end-user

code is sufﬁcient to run any Java application in JCL.

No other middleware solution puts all these features

together in a unique solution.

Experiments demonstrate that JCL is a promis-

ing solution, although many improvements must be

done. JCL must implement security methods. End-

users should be able to lock/unlock global variables

and execute tasks from private groups over a single

JCL cluster, enabling different collaboration levels or

proﬁles. JCL must be fault tolerant in storage and

processing. Future systems should be able to recover

on their own. Self-stabilization, self-healing, self-

reconﬁguration and recovery-oriented computing im-

plement several algorithms/protocols that can be in-

corporated into JCL. JCL must implement the con-

cept of multi-server, therefore a JCL server can man-

age, for instance, a cluster of JCL hosts with invalid

IPs and communicate with other JCL servers, pro-

viding a multi-cluster JCL solution. GPU execution

abstractions, where location and copies are transpar-

ent to end-users, are fundamental to JCL. A heuristic

based scheduler, where cloud requirements are con-

sidered, is also an important improvement to JCL. An

API for sensing is fundamental to JCL for IoT. Cross-

platform Host component, including platforms with-

out JVM, with JVMs that are not compatible with JSR

901 (Java Language Speciﬁcation) or platforms with-

out operating system, are mandatory to IoT. Built-in

modules for monitoring and administration should be

added to JCL.

ACKNOWLEDGEMENTS

We thank Jos

e Estev

ao Eug

enio de Resende and Gus-

tavo Silva Paiva for helping with experiments, Univer-

sidade Federal de Ouro Preto (UFOP) and Instituto

Federal de Minas Gerais (IFMG) for the infrastruc-

ture, and Fundac¸

ao de Amparo

a Pesquisa do Estado

de Minas Gerais (FAPEMIG) for ﬁnancial support.

REFERENCES

Brynjolfsson, E. M. (2012). Big data: The management

revolution. Harvard Business Review, 90(10):6066.

Cohen, L. (2015). Java Parallel Processing Frame-

work. Available from: hhttp://www.jppf.org/i.[15

Dezember 2015].

Coulouris, G., Dollimore, J., and Kindberg, T. (2007). Sis-

temas Distribu

ıdos - 4ed: Conceitos e Projeto. Book-

man Companhia.

Di Sanzo, P., Quaglia, F., Ciciani, B., Pellegrini, A., Di-

dona, D., Romano, P., Palmieri, R., and Peluso, S.

JCL: A High Performance Computing Java Middleware

389

(2014). A ﬂexible framework for accurate simula-

tion of cloud in-memory data stores. arXiv preprint

arXiv:1411.7910.

Egan, S. (2005). Open Source Messaging Application De-

velopment: Building and Extending Gaim. Apress.

Forum, M. P. (1994). Mpi: A message-passing interface

standard. Technical report, Knoxville, TN, USA.

Gelibert, A., Rudametkin, W., Donsez, D., and Jean, S.

(2011). Clustering osgi applications using distributed

shared memory. In Proceedings of International Con-

ference on New Technologies of Distributed Systems

(NOTERE 2011), pages 1–8.

Ghosh, S. (2014). Distributed systems: an algorithmic ap-

proach. CRC press.

Gokhale, A., Balasubramanian, K., Krishna, A. S., Bal-

asubramanian, J., Edwards, G., Deng, G., Turkay,

E., Parsons, J., and Schmidt, D. C. (2008). Model

driven middleware: A new paradigm for developing

distributed real-time and embedded systems. Science

of Computer programming, 73(1):39–58.

Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Con-

cepts and Techniques. Morgan Kaufmann Publishers

Inc., San Francisco, CA, USA, 3rd edition.

Henning, M. and Spruiell, M. (2006). Distributed program-

ming with ice reading.

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A.,

Joseph, A. D., Katz, R., Shenker, S., and Stoica, I.

(2011). Mesos: A platform for ﬁne-grained resource

sharing in the data center. In Proceedings of the

USENIX Conference on Networked Systems Design

and Implementation (NSDI 2011), pages 295–308.

Kaminsky, A. (2015). Big CPU, Big Data: Solving the

World’s Toughest Computational Problems with Par-

allel Computing. Unpublished manuscript. Retrieved

from http://www.cs.rit.edu/˜ark/bcbd.

Karantasis, K. and Polychronopoulos, E. (2011). Pro-

gramming gpu clusters with shared memory abstrac-

tion in software. In Proceedings of Euromicro In-

ternational Conference on Parallel, Distributed and

Network-Based Processing (PDP 2011), pages 223–

230.

Karantasis, K. I. and Polychronopoulos, E. D. (2009).

Pleiad: A cross-environment middleware providing

efﬁcient multithreading on clusters. In Proceedings of

ACM Conference on Computing Frontiers (CF 2009),

pages 109–116.

Murphy, A. L., Picco, G. P., and Roman, G.-C. (2006).

Lime: A coordination model and middleware support-

ing mobility of hosts and agents. ACM Trans. Softw.

Eng. Methodol., 15(3):279–328.

Nester, C., Philippsen, M., and Haumacher, B. (1999). A

more efﬁcient rmi for java. In Proceedings of the

ACM 1999 conference on Java Grande, pages 152–

159. ACM.

OSGi (2010). Osgi speciﬁcation release 4.2.

Perera, C., Liu, C. H., Jayawardena, S., and Chen, M.

(2014). A survey on internet of things from industrial

market perspective. Access, IEEE, 2:1660–1679.

Pitt, E. and McNiff, K. (2001). Java.Rmi: The Remote

Method Invocation Guide. Addison-Wesley Longman

Publishing Co., Inc.

Seovic, A., Falco, M., and Peralta, P. (2010). Oracle Co-

herence 3.5. Packt Publishing Ltd.

Taboada, G. L., Ramos, S., Exp

osito, R. R., Touri

no, J.,

and Doallo, R. (2013). Java in the high performance

computing arena: Research, practice and experience.

Science of Computer Programming, 78(5):425–444.

Taboada, G. L., Touri

no, J., and Doallo, R. (2009). Java

for high performance computing: assessment of cur-

rent research and practice. In Proceedings of the 7th

International Conference on Principles and Practice

of Programming in Java, pages 30–39. ACM.

Tanenbaum, A. S. and Van Steen, M. (2007). Distributed

systems. Prentice-Hall.

Tariq, M. A., Koldehofe, B., Bhowmik, S., and Rother-

mel, K. (2014). Pleroma: a sdn-based high perfor-

mance publish/subscribe middleware. In Proceed-

ings of the 15th International Middleware Conference,

pages 217–228. ACM.

Taveira, W. F., de Oliveira Valente, M. T., da Silva Bigonha,

M. A., and da Silva Bigonha, R. (2003). Asyn-

chronous remote method invocation in java. Journal

of Universal Computer Science, 9(8):761–775.

Team, I. (2015). Inﬁnispan 8.1 Documentation. Available

from: hhttp://inﬁnispan.org/docs/8.1.x/index.

htmli.[15 Dezember 2015].

Terracotta Inc. (2008). The Deﬁnitive Guide to Terracotta:

Cluster the JVM for Spring, Hibernate and POJO

Scalability. Springer Science & Business.

Troester, M. (2012). Big data meets big data analytics.

Cary, NC: SAS Institute Inc.

Veentjer, P. (2013). Mastering Hazelcast. Hazelcast.

Walker, S. M., Dearle, A., Norcross, S. J., Kirby, G. N. C.,

and McCarthy, A. (2003). Rafda: A policy-aware

middleware supporting the ﬂexible separation of ap-

plication logic from distribution. Technical report,

University of St Andrews. Technical Report CS/06/2.

Watson, R. T., Wynn, D., and Boudreau, M.-C. (2005).

Jboss: The evolution of professional open source soft-

ware. MIS Quarterly Executive, 4(3):329–341.

Xiong, J., Wang, J., and Xu, J. (2010). Research of dis-

tributed parallel information retrieval based on jppf.

In 2010 International Conference of Information Sci-

ence and Management Engineering, pages 109–111.

IEEE.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,

and Stoica, I. (2010). Spark: cluster computing with

working sets. In Proceedings of the 2nd USENIX con-

ference on Hot topics in cloud computing, pages 10–

10.

Zhang, Q., Cheng, L., and Boutaba, R. (2010). Cloud com-

puting: state-of-the-art and research challenges. Jour-

nal of internet services and applications, 1(1):7–18.

ICEIS 2016 - 18th International Conference on Enterprise Information Systems

390