In Spark performance tuning, memory related
tuning should be a high priority. As an in-memory
computing engine, Spark holds most of the data sets
in memory, not on hard disks, which greatly reduces
the file access time. When free memory space
becomes insufficient, data set are spilled to disks and
this operation causes long latency. Garbage
Collection (GC) can also occur to release more Java
Virtual Machine (JVM) heap space thus adding
significant GC latency. Besides memory hardware
configuration parameters like capacity and
bandwidth, Spark also provides a wide range of
parameters to control the memory behaviour. All
these parameters and memory related operations have
a significant performance impact. These parameters
exist in software at 4 different layers: the Spark
execution engine, cluster resource management
(YARN, Mesos, Standalone etc.), JVM and
Operating System (OS). Since complex interactions
exist between these parameters, it is very difficult to
find an optimized parameters configuration that
would maximize the Spark cluster performance.
Traditional Cluster design and deployment
decision are experience or measurement based, which
can’t meet Spark cluster deployment criterions very
well. Due to the very new nature of Spark, very few
users can take sound and accurate decisions based on
experience. On the other hand, upon cluster
availability, measurement based optimization is
extremely time consuming and can be easily
interrupted by random environment factors like disk
or network interface card (NIC) failures.
Simulation based cluster analysis in general is a
much more reliable approach to obtain systematic
optimization solutions. Among the various simulation
methods proposed (Kolberga et al., 2013), (Wang et
al., 2011), (Kennedy and Gopal, 2013), (Verma et al.,
2011), CSMethod (Bian et al., 2014) is a fast and
accurate cluster simulation method which employs a
layered and configurable architecture to simulate Big
Data clusters on standard client computers (desktop
or laptop).
The Spark workflow, especially the DAG
abstraction, is very different from the Hadoop
MapReduce workflow. In addition, current
CSMethod based MapReduce model’s memory
subsystem is too coarse to meet accuracy
requirements for Spark simulation. To fill these gaps,
this paper proposes a new simulation framework
which is based on and extending CSMethod. All
performance intensive Spark parameters and
workflow are modeled for fast and accurate
performance prediction with a fine-grained multi-
layer memory subsystem.
The whole Spark cluster software stack is
abstracted and simulated at functional level, including
computing, communications and dataset access.
Software functions are dynamically mapped onto
hardware components. The timing of hardware
components (storage, network, memory and CPU) is
modeled according to payload and activities as
perceived by software. A low overhead discrete-event
simulation engine enables fast simulation speed and
good scalability. The Spark simulator accepts Spark
applications with input dataset information and
cluster configurations then simulates the performance
behaviour of the Spark application. The cluster
configuration includes the software stack
configuration and the hardware components
configuration.
The following key contributions are presented in
this paper:
• We propose a new framework to simulate the whole
performance intensive Spark workflow, including:
DAG generation; RDD input fetch, transfer, shuffle
and block management; Spill and HDFS access.
• We describe a fine-grained multi-layer memory
performance model which simulates the memory
behaviour of Spark, JVM, OS and H/W layers with
high accuracy.
• We implement and validate the Spark simulation
framework using a range of micro benchmarks and a
real case IoT (Internet of Things) workload. The
average error rate is within 7% and simulation speeds
are very high. Running on a commercial Desktop the
simulation time is close to the native execution time
of a 5 node Intel Xeon E5 high-end server cluster.
• We demonstrate a simulation based Spark parameter
tuning approach which helps BigData cluster
deployment planning, evaluation and optimization.
The rest of this paper is organized as follows.
Section 2 presents the proposed Spark simulator in
details. The experimental environment set up and the
workload are then introduced in section 3. Section 4
illustrates the evaluation results and its analyses. A
memory related Spark performance tuning case study
is then presented in details in section 5. Section 6
overviews related work. A summary and future work
thoughts are described in the final section.
2 SPARK SIMULATION
FRAMEWORK
ARCHITECTURE
In this section, we introduce the proposed Spark
simulation framework in details.