• ability to block a thread when there is no data ob-
jects to process (and to wake it up when new data
objects become available); and
• ability to block a thread when local buffers are full
(and to wake it up when space becomes available).
The RDMA middleware uses mainly one-sided
RDMA write operations to transfer data objects di-
rectly between Java memory buffers. Additionally, it
also uses send/receive operations to exchange control
data.
When initializing the application, communica-
tion end-points are created on each process, i.e., an
RDMA server connection is created, and bound to the
machine IP. The next task is to connect the network.
Briefly, this comprises the following steps:
• allocation and registration of memory buffers;
• allocation of queue pairs, completion channel, and
completion queue;
• start of a new thread (the network thread), which
handles the completion events;
• establish RDMA connections with all other pro-
cesses;
• exchange of memory keys between processes (re-
quired to allow the execution of one-sided RDMA
write operations), using send/receive operations.
• pre-allocation and initialization of objects needed
to execute the network requests.
When shuffling data, threads send and receive data
objects asynchronously using incoming and outgoing
buffers to serialize data objects and to temporarily
store them until they are transferred/pulled. These
buffers are implemented as circular buffers. They
have an head and a tail (new data is written at the
head position, that is, the data available in the buffer
is stored between the tail and the head).
When sending data objects to remote threads, the
object is serialized to the outgoing buffer, and an
RDMA write request is posted (queued for execu-
tion), to transfer a segment of data to the appropriate
remote incoming buffer. The thread only queues the
RDMA write request, i.e., it does not have to wait for
the request to be actually executed. As there is no in-
tervention of the receiving side, send/receive requests
are used to notify the remote process that a data object
was written in its buffers. This is done by the network
thread, after it receives an event confirming that the
RDMA write requests completed successfully. More-
over, the network thread will also update the tail of
the outgoing buffer from where the data was trans-
ferred, as the space occupied by the data sent can now
be reused.
Before posting the RDMA write request, the
thread needs to determine whether there is free space
available on the remote buffer. This is determined by
the tail position of the remote before, which is tracked
on the sending side (notifications are also used to up-
date this information). If there is no space available
on the remote buffer, the thread continues its oper-
ation, and the network thread will post the RDMA
write request when it receives a notification updating
the tail of the remote buffer.
The local outgoing buffers may also become full.
When this happens, the thread blocks, as it cannot se-
rialize its current object and proceed to the next one.
When the network thread is notified that an RDMA
write request completed, and space was released on
the desired buffer, the network thread wakes up the
blocked thread.
The data objects transferred will eventually be
pulled by the receiver thread. The threads do not
know when data was transferred into their buffers. To
overcome this limitation, network threads exchange
notifications when data is transferred. The network
thread maintains a queue of buffers with data avail-
able for each thread, which allows the threads to avoid
the need to actively poll all incoming buffers. If a
thread has no data objects to process, it blocks. It is
the network thread that will wake up this thread when
additional data arrives. That is, the shuffle queues act
as blocking queues. This design enables the overlap
of communication and computation, as long as data is
produced at a similar rate as it is consumed.
A prototype implementation of the proposed mid-
dleware design was implemented, and compared with
a previously used middleware based on sockets, to
provide a preliminary evaluation of the benefits of us-
ing RDMA. The sockets middleware used a similar
push-based approach using circular buffers, but re-
lied on non-blocking Java sockets between each pair
of buffers to transfer data from outgoing to incom-
ing buffers. This preliminary implementation for the
RDMA middleware provided a reduction of commu-
nication costs of around 75%, when shuffling data
among 32 threads on 8 machines, and using a soft-
ware implementation of the RDMA protocol (Soft-
iWARP).
4 CONCLUDING REMARKS
In this paper we proposed the design of an RDMA-
based communication middleware to support push-
based asynchronous shuffling.
Preliminary results, based on a prototype imple-
mentation of the RDMA-based middleware, show that
DataDiversityConvergence 2016 - Workshop on Towards Convergence of Big Data, SQL, NoSQL, NewSQL, Data streaming/CEP, OLTP
and OLAP
350