2.1 PySpark Execution Model
Spark runs Python programs differently than Scala
programs: unlike Scala programs, Spark executors do
not run Python programs directly – that operate on the
data they hold, but delegate their execution to Python
workers, that are separate processes. This architecture
has some drawbacks, especially for data movements
between Spark executors and Python executors due to
data serialization/deserialization between those
processes. The conversion of a Spark data frame to
Pandas with the DataFrame.toPandas()
method is quite inefficient: rows are collected from
the Spark executor, serialized into Python’s pickle
format, then moved to the Python worker, after that
deserialized (from pickle format) into a list of tuples,
finally transformed to a Panda data frame. And the
reverse operations are to do in the other way for
results. This overload gives results far below to the
execution of an equivalent Scala program.
Developers commonly work around this problem by
defining their UDFs (user defined functions) in
Scala/Java, calling them from PySpark. Experiments
have shown that the time spent in serializing and
deserializing data often exceeds the compute time
(Kraus and Joshua Patterson, 2018), thus, targeting
GPU acceleration would not be a good option,
because so much time will be lost in serializing,
deserialization and copying data.
2.2 Apache ARROW in Spark
Things have changed yet, as starting from version 2.3,
Spark can leverage Apache Arrow technology.
Apache Arrow
(Kraus and Joshua Patterson, 2018) is
a cross-language development platform for in-
memory data. It specifies a standardized language-
independent columnar memory format for flat and
hierarchical data, organized for efficient analytic
operations on modern hardware. It also provides
computational libraries and zero-copy streaming
messaging and interprocess communication. It
enables execution engines to take advantage of the
latest SIMD (Single instruction multiple data)
operations included in modern processors (CPU,
GPU)), for native vectorized optimization of
analytical data processing. Columnar layout is
optimized for data locality for better performance on
CPUs and GPUs. The Arrow memory format
supports zero-copy reads for lightning-fast data
access without serialization overhead.
As of Spark 2.3, Apache Arrow is introduced as a
supported dependency to offer increased performance
with columnar data transfer. Once the data is in
Arrow memory format, it can transit (possibly
without moving) along the processing pipeline from
a framework to the next without the need to multiple
serialization/deserialization, e.g. from the Spark
executor (a java process) to the GPU through the
Python worker (a Python process).
2.3 RAPIDS AI
Figure 1: RAPIDS AI components (RAPIDS AI, 2019).
RAPIDS AI is a collection of open source software
libraries and APIs recently launched by NVIDIA to
execute end-to-end data science analytics pipelines
entirely on GPUs. It relies on NVIDIA CUDA
primitives for low-level compute optimization, but
exposes GPU parallelism and high-bandwidth
memory speed through user-friendly Python
interfaces. RAPIDS AI also focuses on common data
preparation tasks for analytics and data science. This
includes a familiar DataFrame API that integrates
with a variety of machine learning algorithms for end-
to-end pipeline accelerations without paying typical
serialization costs. RAPIDS AI also includes support
for multi-nodes nodes, multi-GPU deployments,
enabling vastly accelerated processing and training
on much larger dataset sizes.
The RAPIDS AI cuDF API is a DataFrame
manipulation library based on Apache Arrow that
accelerates loading, filtering, and manipulation of
data for model training and data preparation. The
Python bindings of the core-accelerated CUDA
DataFrame manipulation primitives mirror the
Pandas interface for seamless onboarding of Pandas
users. Previous efforts were provided by GoAI (GPU
Open Analytics Initiative) project that initiated
PyGDF (Python GPU DataFrame library): PyGDF is
based on Apache Arrow data format, converts Pandas
DataFrame to GPU DataFrame, and interfaces with
CUDA using Numba, a compiler for Python arrays
and numerical functions to speed up Python programs
with high-performance functions. PyGDF is already
integrated with cuDF, a more elaborated and
complete library.
ADITCA 2019 - Special Session on Appliances for Data-Intensive and Time Critical Applications
438