memory, that forms what is termed a NUMA node.
Each NUMA node is linked to other NUMA nodes by
UPI links, in such a way that all nodes are transitively
linked, so that the whole server memory can be
consistently accessed from any node. The
BullSequana S800 has eight NUMA nodes numbered
from 0 to 7, which are linked so that there are at most
two hops between nodes as shown in
Figure
1.
Each node has 3 neighbours at one hop, and four
nodes at two hops. Obviously, the latency to access a
portion of memory from a given node varies with the
distance between the accessing node to the accessed
node: where there is one hop, the latency is about
twice the latency of an access to local memory; where
there are two hops, the latency is about three times the
latency of a local access, as shown in
Figure
2. This
may dramatically affect the performance of the
application.
Figure 2: Distances between NUMA nodes.
NUMA awareness aims at maximizing access to
local memory by the threads that run on every core.
3 NUMA AWARE SPARK
A Spark application comprises of:
A driver process, that controls the whole
processing of the application. The application is
modelled as a direct acyclic graph (DAG). The
driver understands and interprets this model, and,
when operating on a given dataset, splits the
processing in stages and individual tasks that it
schedules and distributes to executors that run in
the cluster nodes.
Executor processes that perform the actual data
processing, as instructed by the driver. The
executor processes hold data parts in their
memory: a task applies to a data part, and each
executor receives (and runs) tasks retated to the
data parts it holds. Executor processes run on
cluster nodes, the so-called worker nodes. There
may be one or more executors per worker node.
Data processing may involve many executor
processus spread on many worker nodes.
A Spark application consumes (large) data from
various sources, processes it, then outputs (smaller)
results to various sinks. Processing usually
transforms the input data so that to build a dataset in
the desired form, caches it in memory, then applies
algorithms that repeatedly operate on the cached data.
Unlike Hadoop - that stores intermediate results on
storage (HDFS), Spark retains intermediate results in
memory as far as possible, and the more memory it
gets, the less pressure on the Input/output system.
This makes Spark less sensitive to the input/output
system than Hadoop.
Spark will try to perform initial processing close
to the input data location to consumes it at the highest
rate, but further computing will be more dependent on
the way it (re)partitions and retains data in the
memory cache of the executors. And as the
processing time is usually far higher than the
cumulated input and output times, the distribution of
data among the executor processes (in their memory)
is an important factor to consider.
Spark still writes data to file systems during
shuffling operation, when data is rebalanced between
executors, e.g. on sort operations or data
repartitioning. Shuffling thus may put high pressure
on both I/O systems and network, the later incurring
CPU consumption as data has to be serialized before
being transferred through the network. One way to
lower it is to use large memory executors: data
analytics often reduce datasets size from a very large
input to a far smaller output. The more filtering and
transformations that lower the size can be done in a
continuous space, the less the number and size of
shuffling operations will occur. Large-memory and
many-core platforms enable few huge executors as
well as more common approaches with several
executors, for a single application as well as several
ones.
Spark has been designed for scale out and is not
natively NUMA aware. We have made extensions to
inculcate NUMA awareness to the worker processes
of a Spark cluster deployed in standalone mode. A
worker process manages executor processes in a
server of a Spark cluster: placement of executors
within the server is a local concern the worker is
responsible for. When a worker launches a Spark
executor process, it binds it to a NUMA node, so that
the threads running the tasks within the executor
process access local memory, where the data parts
reside. Or, if the executor does not fit to a single
NUMA node, it binds it to a set of NUMA nodes close
to each other. The worker process manages a table of