Design of an RDMA Communication Middleware for Asynchronous

Shufﬂing in Analytical Processing

Rui C. Gonc¸alves

, Jos

e Pereira

and Ricardo Jim

enez-Peris

HASLab, INESC TEC & U. Minho, Braga, Portugal

LeanXcale, Madrid, Spain

Keywords:

Shufﬂing, Analytical Processing, Middleware, RDMA.

Abstract:

A key component in a distributed parallel analytical processing engine is shufﬂing, the distribution of data

to multiple nodes such that the computation can be done in parallel. In this paper we describe the initial

design of a communication middleware to support asynchronous shufﬂing of data among multiple processes

on a distributed memory environment. The proposed middleware relies on RDMA (Remote Direct Memory

Access) operations to transfer data, and provides basic operations to send and queue data on remote machines,

and to retrieve this queued data. Preliminary results show that the RDMA-based middleware can provide a

75% reduction on communication costs, when compared with a traditional sockets implementation.

1 INTRODUCTION

The proliferation of sensors networks or web plat-

forms supporting user generated content, in conjunc-

tion with the decrease on the costs of storage equip-

ments, lead to a signiﬁcant increase of the rate of data

generation.

This explosion of data brought new opportunities

for business, which can leverage this data to improve

its operation. On the other hand, storing and pro-

cessing these massive amounts of data poses tech-

nological challenges, which lead to the emergence

NoSQL database systems and solutions based on the

MapReduce (Dean and Ghemawat, 2008) program-

ming model, as an alternative to the traditional Re-

lation Database Managements Systems (RDBMS) in

large scale data processing.

An important concept in several frameworks for

large scale data processing (e.g. Hadoop MapRe-

duce (Hadoop, ), FlumeJava (Chambers et al., 2010),

Apache Storm (Storm, )) is data shufﬂing. Shufﬂing

redistributes data among multiple processes, namely

to group related data objects in the same process.

Even though the basic concept is simple, different

frameworks use different approaches to implement

shufﬂing. For example, there are pull-based solu-

tions, where the receiver process requests data from

the source process, or push-based solutions, where

the source pushes the data to the receiver. Multiple

strategies may also be used to organize the data and

to select the receiver process. For example, data ob-

jects may be distributed randomly or based on a hash

function. The shufﬂing process may also sort the data

objects of each process.

In this paper we propose a Java communica-

tion middleware designed to support efﬁcient asyn-

chronous data shufﬂing, using a push-based approach,

which takes advantage of RDMA (Remote Direct

Memory Access) for communication. It was designed

to support hash shufﬂing on an analytical process-

ing application, which was previously relying on Java

sockets.

RDMA protocols provide efﬁcient mechanisms to

read/write data directly from the main memory of re-

mote machines, without the involvement of the re-

mote machine’s CPU (at least when the RDMA pro-

tocol is directly supported by the network hardware).

This enables data transfers with lower latency and

higher throughput.

The proposed design relies on the RDMA Verbs

programming interface, and uses one-sided write op-

erations to transfer data, and send/receive operations

to exchange control messages.

2 RDMA BACKGROUND

RDMA technologies (Mellanox, 2015) provide reli-

able data transfers with low latency/high-throughput,

348

Gonçalves, R., Pereira, J. and Jimenez-Peris, R.

Design of an RDMA Communication Middleware for Asynchronous Shufﬂing in Analytical Processing.

In Proceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016) - Volume 1, pages 348-351

ISBN: 978-989-758-182-3

by avoiding memory copies and complex network

stacks. Moreover, as applications access the network

adapter directly when they need to exchange data,

there is no need for the operating system intervention,

which also reduces CPU utilization.

The RDMA Verbs is the basic programming in-

terface to use RDMA. It provides two types of com-

munication semantics: memory semantics or channel

semantics. The former relies on one-sided read and

write operations to transfer data. The latter relies on

typical two-sided send/receive operations, where one

side of the communication executes a send operation,

and the other side executes a receive operation. It

should be noted that the receive operation must be ini-

tiated before the send operation.

Network operations in RDMA Verbs are asyn-

chronous. Requests for network operations (e.g.,

write, send) are posted on queue pairs (each one com-

prised of a send and a receive queue) maintained

by the network adapter. A queue pair is associ-

ated with one connection between two communica-

tion end-points. The application may choose to re-

ceive completion events when requests are ﬁnished,

which are posted into a completion queue associated

with the queue pair. Moreover, the application may

request to be notiﬁed when a completion event was

added to the queue (these notiﬁcations are sent to a

completion channel). This is useful to avoid the need

of active polling the completion queues.

RDMA Verbs works with locked memory, i.e., the

memory buffers must be registered. Besides locking

the memory, the registration process also provides a

security mechanisms to limit the operations that can

be performed on each buffer. Network adapters have

on-chip memory that can be used to cache address

translation tables and other connection related data.

Due to the limited amount of on-chip memory, the

number of connections and the amount of registered

memory used must be carefully decided (Dragojevi

et al., 2014).

The RDMA protocol and its Verbs programming

interface may be supported directly by the network

hardware, but it may also be provided by software

(e.g., Soft-iWARP, Soft-RoCE). Even though these

solutions do not provide some of the typical advan-

tages of RDMA (e.g., they still require the involve-

ment of the remote machine’s CPU on network op-

erations), they can provide improved performance as

buffers are guaranteed to use locked memory, and by

reducing system-calls.

In our prototype implementation, we are using the

jVerbs library (Stuedi et al., 2013), a Java implemen-

tation of the RDMA Verbs interface available on the

IBM JRE. Besides providing an RDMA Verbs API for

Java, jVerbs relies on modiﬁcations of the IBM Java

Virtual Machine to reduce memory copies, even when

using an RDMA protocol implemented by software.

3 RDMA COMMUNICATION

MIDDLEWARE

The goal for the communication middleware is to

provide efﬁcient data exchange between multiple

threads, running on multiple processes. RDMA was

previously explored by Wang et al. (Wang et al.,

2013) for shufﬂing in Hadoop MapReduce. Their

implementation followed a synchronous pull-based

approach, and used send/receive requests to request

data, and then RDMA write requests to transfer the

data. In that case the data is produced in one phase,

and consumed in a later phase, which means data to

shufﬂe is likely to need to be stored on disk, to avoid

blocking threads. Our proposal was designed for ap-

plications where data to shufﬂe is being produced and

consumed at the same time (as it is the case of Apache

Storm, for example). We also assume that data is pro-

duced and consumed at a similar rate (i.e. buffers

rarely ﬁll up), thus in our design data is never sent

to disk. Instead, in case a buffer ﬁlls up, the thread

using it will block.

On the base of our proposed design we have shuf-

ﬂe queues. They are used to asynchronously receive

data objects from other processes (and its threads).

That is, the shufﬂe queues abstract a set of queues

used by a thread to receive data objects from the

threads running on remote processes.

For threads running on the same process, data ob-

jects can be exchanged directly using shared mem-

ory and dynamic queues. However, when sending

data objects to remote threads, the use of the net-

work is required. In those cases, a thread maintains

an incoming and an outgoing buffer per each remote

thread. When sending data objects to a remote thread,

they are initially serialized to the appropriate outgo-

ing buffer (considering the target thread). The com-

munication middleware provides the functionality of

transferring data from the outgoing buffer of a thread

to the matching incoming buffer of the target thread,

from where the data objects will eventually be pulled

by the remote thread.

In summary, the communication middleware was

designed to provide the following functionalities:

• ability to send and queue data objects to remote

threads;

• ability to pull queued remote data objects;

Design of an RDMA Communication Middleware for Asynchronous Shufﬂing in Analytical Processing

349

• ability to block a thread when there is no data ob-

jects to process (and to wake it up when new data

objects become available); and

• ability to block a thread when local buffers are full

(and to wake it up when space becomes available).

The RDMA middleware uses mainly one-sided

RDMA write operations to transfer data objects di-

rectly between Java memory buffers. Additionally, it

also uses send/receive operations to exchange control

data.

When initializing the application, communica-

tion end-points are created on each process, i.e., an

RDMA server connection is created, and bound to the

machine IP. The next task is to connect the network.

Brieﬂy, this comprises the following steps:

• allocation and registration of memory buffers;

• allocation of queue pairs, completion channel, and

completion queue;

• start of a new thread (the network thread), which

handles the completion events;

• establish RDMA connections with all other pro-

cesses;

• exchange of memory keys between processes (re-

quired to allow the execution of one-sided RDMA

write operations), using send/receive operations.

• pre-allocation and initialization of objects needed

to execute the network requests.

When shufﬂing data, threads send and receive data

objects asynchronously using incoming and outgoing

buffers to serialize data objects and to temporarily

store them until they are transferred/pulled. These

buffers are implemented as circular buffers. They

have an head and a tail (new data is written at the

head position, that is, the data available in the buffer

is stored between the tail and the head).

When sending data objects to remote threads, the

object is serialized to the outgoing buffer, and an

RDMA write request is posted (queued for execu-

tion), to transfer a segment of data to the appropriate

remote incoming buffer. The thread only queues the

RDMA write request, i.e., it does not have to wait for

the request to be actually executed. As there is no in-

tervention of the receiving side, send/receive requests

are used to notify the remote process that a data object

was written in its buffers. This is done by the network

thread, after it receives an event conﬁrming that the

RDMA write requests completed successfully. More-

over, the network thread will also update the tail of

the outgoing buffer from where the data was trans-

ferred, as the space occupied by the data sent can now

be reused.

Before posting the RDMA write request, the

thread needs to determine whether there is free space

available on the remote buffer. This is determined by

the tail position of the remote before, which is tracked

on the sending side (notiﬁcations are also used to up-

date this information). If there is no space available

on the remote buffer, the thread continues its oper-

ation, and the network thread will post the RDMA

write request when it receives a notiﬁcation updating

the tail of the remote buffer.

The local outgoing buffers may also become full.

When this happens, the thread blocks, as it cannot se-

rialize its current object and proceed to the next one.

When the network thread is notiﬁed that an RDMA

write request completed, and space was released on

the desired buffer, the network thread wakes up the

blocked thread.

The data objects transferred will eventually be

pulled by the receiver thread. The threads do not

know when data was transferred into their buffers. To

overcome this limitation, network threads exchange

notiﬁcations when data is transferred. The network

thread maintains a queue of buffers with data avail-

able for each thread, which allows the threads to avoid

the need to actively poll all incoming buffers. If a

thread has no data objects to process, it blocks. It is

the network thread that will wake up this thread when

additional data arrives. That is, the shufﬂe queues act

as blocking queues. This design enables the overlap

of communication and computation, as long as data is

produced at a similar rate as it is consumed.

A prototype implementation of the proposed mid-

dleware design was implemented, and compared with

a previously used middleware based on sockets, to

provide a preliminary evaluation of the beneﬁts of us-

ing RDMA. The sockets middleware used a similar

push-based approach using circular buffers, but re-

lied on non-blocking Java sockets between each pair

of buffers to transfer data from outgoing to incom-

ing buffers. This preliminary implementation for the

RDMA middleware provided a reduction of commu-

nication costs of around 75%, when shufﬂing data

among 32 threads on 8 machines, and using a soft-

ware implementation of the RDMA protocol (Soft-

iWARP).

4 CONCLUDING REMARKS

In this paper we proposed the design of an RDMA-

based communication middleware to support push-

based asynchronous shufﬂing.

Preliminary results, based on a prototype imple-

mentation of the RDMA-based middleware, show that

DataDiversityConvergence 2016 - Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP

and OLAP

350

the RDMA approach proposed can provide a reduc-

tion of communication costs of around 75%, when

compared with a sockets implementation. This pre-

liminary work shows that we can beneﬁt signiﬁcantly

from RDMA technologies, and that this is a research

direction worth exploring.

ACKNOWLEDGEMENTS

This work has received funding from the European

Union’s Seventh Framework Programme for research,

technological development and demonstration under

grant agreement no 619606, project LeanBigData –

Ultra-Scalable and Ultra-Efﬁcient Integrated and Vi-

sual Big Data Analytics (http://leanbigdata.eu).

REFERENCES

Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry,

R. R., Bradshaw, R., and Weizenbaum, N. (2010).

Flumejava: Easy, efﬁcient data-parallel pipelines. In

PLDI ’10: Proceedings of the 2010 ACM SIGPLAN

Conference on Programming Language Design and

Implementation, pages 363–375.

Dean, J. and Ghemawat, S. (2008). Mapreduce: Simpliﬁed

data processing on large clusters. Communications of

the ACM, 51(1):107–113.

Dragojevi

c, A., Narayanan, D., Castro, M., and Hodson, O.

(2014). FaRM: Fast remote memory. In NSDI ’14:

11th USENIX Symposium on Networked Systems De-

sign and Implementation, pages 401–414.

Hadoop. Apache hadoop project. http://hadoop.apache.org.

Mellanox (2015). RDMA Aware Networks Programming

User Manual. Mellanox Technologies.

Storm. Apache storm project. http://storm.apache.org.

Stuedi, P., Metzler, B., and Trivedi, A. (2013). jverbs: Ultra-

low latency for data center applications. In SOCC ’13:

Proceedings of the 4th Annual Symposium on Cloud

Computing, pages 10:1–10:14.

Wang, Y., Xu, C., Li, X., and Yu, W. (2013). Jvm-bypass for

efﬁcient hadoop shufﬂing. In IPDPS ’13: Proceedings

of the 2013 IEEE 27th International Symposium on

Parallel and Distributed Processing, pages 569–578.

Design of an RDMA Communication Middleware for Asynchronous Shufﬂing in Analytical Processing

351