Reconﬁgurable and Adaptive Spark Applications

Bilal Abi Farraj, Wael Al Rahal Al Orabi, Chadi Helwe, Mohamad Jaber,

Mohamad Omar Kayali and Mohamed Nassar

American University of Beirut, Computer Science Department, Beirut, Lebanon

Keywords:

Big Data, Spark, Hadoop, Quality of Service, Cloud.

Abstract:

The contribution of this paper is two-fold. First, we propose a Domain Speciﬁc Language (DSL) to easily

reconﬁgure and compose Spark applications. For each Spark application we deﬁne its input and output inter-

faces. Then, given a set of connections that map outputs of some Spark applications to free inputs of other

Spark applications, we automatically embed Spark applications with the required synchronization and commu-

nication to properly run them according to the user-deﬁned mapping. Second, we present an adaptive quality

management/selection method for Spark applications. The method takes as input a pipeline of parameterized

Spark applications, where the execution time of each Spark application is an unknown increasing function of

quality level parameters. The method builds a controller that automatically computes adequate quality for each

Spark application to meet a user-deﬁned deadline. Consequently, users can submit a pipeline of Spark appli-

cations and a deadline, our method automatically runs all the Spark applications with the maximum quality

while respecting the deadline speciﬁed by the user. We present experimental results showing the effectiveness

of our method.

1 INTRODUCTION

Since 2003, big data has become the new trend after

Google started working with its project Google File

System (GFS) (Ghemawat et al., 2003), and now big

data has become a necessity of most new technolo-

gies. Processing big data requires implicit distribution

of computation and new programming paradigm. To

this end, MapReduce programming model was pro-

posed in (Dean and Ghemawat, 2004; Dean and Ghe-

mawat, 2010) to efﬁciently process data on a cluster.

Some of the drawbacks of MapReduce are: (1) lack

of expressiveness: does not support iterative jobs and

only support two functions map and reduce; (2) not

interactive: only used for batch applications; (3) ef-

ﬁciency: in order to implement iterative applications,

intermediate results must be written and replicated on

the disk (i.e., through Hadoop Distributd File System

- HDFS). For instance, machine learning algorithms

cannot efﬁciently beneﬁt from MapReduce since they

require iterative jobs to be executed. For this, Spark

was proposed in (Zaharia et al., 2010) to allow in-

teractive and iterative jobs. Spark uses resilient dis-

tributed datasets (RDD) which are a read-only collec-

tion of objects stored in the RAM. RDDs are big par-

allel collections that are distributed across the cluster.

RDDs are created through parallel transformations

(e.g., map, group by, ﬁlter, join, create from HDFS

block). Moreover, RDDs can be cached by allowing

to keep data sets of interest in Memory across oper-

ations and thus contribute to a substantial speedup.

Additionally, after creating RDDs, it is possible to do

analytics on them by running actions on them such as

count, reduce, collect and save. Note that all opera-

tions/transformations are lazy until you run an action.

Then, the Spark execution engine pipelines operations

and ﬁnds an execution plan. In this paper, we consider

Spark applications.

The contribution of this paper is two-fold.

• Reconﬁgurable Component-based Spark. We

deﬁne a high-level speciﬁcation language to de-

velop Spark applications independently and au-

tomatically compose them. For each application,

we export its input and output interfaces, then we

deﬁne a conﬁguration ﬁle to compose those inter-

faces. Then, applications are automatically aug-

mented with the proper code to wait data/notiﬁ-

cation from other applications or notify/send data

to other applications. This allows to simplify the

development of complex system consisting of sev-

eral dependent Spark applications. Moreover, for

the same set of Spark applications, several sys-

tems may be built depending on the input conﬁg-

uration ﬁle. For instance, in the case of bioinfor-

Jaber, M., Nassar, M., Orabi, W., Farraj, B., Kayali, M. and Helwe, C.

Reconﬁgurable and Adaptive Spark Applications.

DOI: 10.5220/0006289901120119

In Proceedings of the 7th International Conference on Cloud Computing and Services Science (CLOSER 2017), pages 84-91

ISBN: 978-989-758-243-1

matics applications several ordering of pipelines

on other applications may be deﬁned (e.g., index-

ing, alignment, etc.).

• Adaptive Execution of Spark Applications.

Several companies provide services to deploy and

run Spark applications on the cloud (e.g., Ama-

zon Web Services, Microsoft Azure). The pricing

generally depends on the power of the allocated

nodes and the execution time needed to run the

applications. However, execution times may con-

siderably vary over time as they depend on the ap-

plication. Furthermore, non predictability of the

underlying platform and operating systems are ad-

ditional factors of uncertainty. For this, we pro-

pose a method to run a sequence of parameterized

(Quality of Service) Spark applications within a

speciﬁed time. Parameterized applications are

those that can be augmented with an extra param-

eter denoting the quality level. For example, ma-

chine learning and graph analytics are good ex-

amples of parameterized applications, since their

quality depend on the number of rounds (the more

rounds the better quality). Consequently, using

our method, cloud services can provide users with

the ability to specify a deadline/price and a se-

quence of Spark applications to be executed us-

ing the best quality levels possible. A controller

iteratively selects the best quality for each Spark

application depending on the remaining deadline

and time already used.

The remaining of this paper is structured as fol-

lows. Section 2 deﬁnes a language to easily com-

pose Spark applications. Section 3 deﬁnes a method

to adapt the quality levels to execute Spark applica-

tions while respecting a given deadline. Sections 4

discusses related work. Finally, Sections 5 and 6 draw

some conclusions and perspectives.

2 RECONFIGURATION OF

SPARK APPLICATIONS

Given a set of dependent Spark applications, we de-

ﬁne a method to automatically compose them. Each

spark application takes a set of inputs (e.g., ﬁles) and

produces a set of outputs. An input can be either free

or direct. A free input should be mapped to an out-

put of a different spark application. A direct input is

mapped to a direct path.

Formally, a spark application is deﬁned as fol-

lows:

Deﬁnition 1 (Spark Application). A spark applica-

tion SA is deﬁned as a set of tuple (ins, outs), where:

Figure 1: Representation of dependencies between sub-

jobs.

• ins = f reeIns ∪ directIns is the set of inputs;

• outs is the set of outputs.

Given a user-speciﬁed conﬁguration, spark appli-

cations are composed by mapping free inputs of spark

applications to outputs of other applications. For-

mally a conﬁguration is deﬁned as follows:

Deﬁnition 2 (Conﬁguration). Given a set of spark ap-

plication {SA

}

i∈I

, a conﬁguration C is a function de-

ﬁned by C : FreeInput → Out put, where:

• FreeInput =

i∈I

. f reeIns;

• Out put =

i∈I

.outs.

Example 1. Figure 1 shows an example of composing

spark applications. The system consists of ﬁve Spark

applications SA

, . . . , SA

. SA

has one direct input i

and two outputs o

and o

. SA

has one free input i

and one output o

. First output o

of SA

is mapped

to the free input i

of SA

Given a set of spark applications {SA

}

i∈I

, a con-

ﬁguration C must be valid. A conﬁguration is valid

iff:

• Free inputs are mapped to outputs of different ap-

plications. That is, if C(SA

. f i

) = SA

, then

i 6= j, where f i

is a free input in SA

and o

is an

output in SA

• The graph directed G obtained from {SA

}

i∈I

and C does not contain cycles, where the set

of vertices of G represents Spark applications

and the set of edges represents the mapping be-

tween them. Formally, G = ({SA

}

i∈I

, E), where

E = {(SA

, SA

) | ∃ f reeIn ∈ SA

. f reeIns ∧ out ∈

.outs : C( f reeIn) = out}.

Example 2. The system deﬁned in Figure 1 is valid

since (1) all free inputs are mapped to outputs of dif-

ferent applications; and (2) the graph obtained by

connecting outputs to free inputs is acyclic.

2.1 Semantics and Code Generation

Given a set of Spark applications and a conﬁguration

C, we ﬁrst check the validity of conﬁguration C, then

Reconﬁgurable and Adaptive Spark Applications

we automatically synthesize the glue of each Spark

application that corresponds to conﬁguration C. A

glue surrounds each Spark application by the corre-

sponding synchronization and communication primi-

tives. The code generation is depicted in Listing 1. It

mainly consists of the following ﬁve steps:

Listing 1: Automatic Glue Generation of a Spark Applica-

tion.

Step 1: Fill Inputs

f o r ( i n pu t ∈ SAi . i ns ) {

i f ( i sF r e e I n p u t ( in p ut ) ) {

so u rc e = i np u t . f rom ;

in d ex = in p ut . i nd e x ;

in p ut s . a d d ( s ou r ce . ge tI n p u t ( inde x )) ;

} e l s e {

in p ut s . a d d ( i npu t . v al u e ) ;

}

Step 2: Wait Signals

f o r ( i n pu t ∈ SAi . fre e In s ) {

wa it Si gn al ( inp ut . from ) ;

}

Step 3: Fill Outputs

f o r ( o u tp ut ∈ SAi . o uts ) {

ou t p u ts . a d d ( o ut p ut );

}

Step 4: Run Spark Application

run ( " p a th " , inpu t s , ou tp ut s );

Step 5: Send Signals

f o r ( o u tp ut ∈ SAi . o uts ) {

free - i n pu t s = C

−1

( o u tp ut )

f o r ( free - in put ∈ free - in p ut s ) {

se nd Si gn al (free - i np u t . id ) ;

}

1. Fill input: inputs are mapped to their correspond-

ing paths. In case of a free input, an output of dif-

ferent Spark application is ﬁlled. In case of direct

input a direct path is ﬁlled.

2. Wait signals: Spark applications with free inputs

f reeIns wait for signals from other Spark appli-

cations with outputs mapped to f reeIns. This

phase may require a setup phase to set the con-

nections between processes (e.g., sockets, shared

semaphores, signals).

3. Fill outputs: outputs are mapped to their corre-

sponding paths.

4. Run: Run Spark application.

5. Send signals: Upon completion, Spark applica-

tions, with outputs mapped to free inputs f reeIns,

send signals to the Spark applications correspond-

ing to f reeIns.

2.2 DSL Implementation

We deﬁne a Domain Speciﬁc Language (DSL) using

JSON representation that deﬁnes Spark applications

and a conﬁguration to connect them. Each Spark ap-

plication is identiﬁed by an identiﬁer id, a path where

the Spark program exists path, number of inputs ni,

and number of outputs no. Then, the conﬁguration

maps inputs to direct locations or to output of other

Spark applications (in case of free inputs). Listing 2

depicts the general structure to specify a set of Spark

applications and a conﬁguration. It mainly consists of

two parts:

• The ﬁrst part deﬁnes the set of Spark applications

with their corresponding interfaces (i.e., an iden-

tiﬁer, location, number of inputs and number of

outputs).

• The second part deﬁnes the conﬁguration which

connects Spark applications (i.e., connect output

to free inputs).

Listing 2: General shape of a conﬁguration ﬁle.

{ " s park - app li ca ti on s " : [

{ " i d " : " id " , " p a th " : " pa t h " , " n i " : " n" , " no " : " n "} ,

...

{ " i d " : " id " , " p a th " : " pa t h " , " n i " : " n" , " no " : " n "} ,

]}

{ " co n f ig ur at i o n " : [

{ " id " :" id " ,

" i " : [ " i1 " , " i2 " , . . . ] ,

" o " : [ " o1 " , " o2 " , . . . ] ,

} ,

{ " id " :" id " ,

" i " : [ " i1 " , " i2 " , . . . ] ,

" o " : [ " o1 " , " o2 " , . . . ] ,

} ,

...

{ " id " :" id " ,

" i " : [ " i1 " , " i2 " , . . . ] ,

" o " : [ " o1 " , " o2 " , . . . ] ,

}

]}

Listing 3 shows an example of system that cor-

responds to Figure 1. It consists of two parts: (1)

spark applications, and (2) conﬁguration. Spark ap-

plication SA

has one input, which is a direct input,

and two outputs. Spark application SA

has one in-

put and one output. The ﬁrst input is mapped to the

output of Spark application SA

(i.e., #SA1.o1).

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

Listing 3: Conﬁguration File Corresponding to Figure 1.

{ " s park - app li ca ti on s " : [

{ " i d " : " SA 1 ", " p a th " : " p a th 1 " , " ni " : " 1 " ," n o " : " 2 " } ,

{ " i d " : " SA 2 ", " p a th " : " p a th 2 " , " ni " : " 1 " ," n o " : " 1 " } ,

{ " i d " : " SA 3 ", " p a th " : " p a th 3 " , " ni " : " 1 " ," n o " : " 1 " } ,

{ " i d " : " SA 4 ", " p a th " : " p a th 4 " , " ni " : " 2 " ," n o " : " 1 " } ,

{ " i d " : " SA 5 ", " p a th " : " p a th 5 " , " ni " : " 1 " ," n o " : " 1 " } ,

]}

{ " co n f ig ur at i o n " : [

{ " id " :" S A1 " ,

" i " : [ " pat h in 1 " ] ,

" o " : [ " path o1 " , " path o2 " ] ,

} ,

{ " id " :" S A2 " ,

" i " : [ " # SA 1 .o1 " ] ,

" o " : [ " path o1 " ] ,

} ,

{ " id " :" S A3 " ,

" i " : [ " # SA 1 .o2 " ] ,

" o " : [ " path o1 " ] ,

} ,

{ " id " :" S A4 " ,

" i " : [ " # SA 2 .o1 " , " # SA3 . o 1 " ] ,

" o " : [ " path o1 " ] ,

} ,

{ " id " :" S A5 " ,

" i " : [ " # SA 4 .o1 " ] ,

" o " : [ " path o1 " ] ,

}

]}

3 REAL-TIME ADAPTIVE QoS OF

SPARK APPLICATION

It is difﬁcult to predict the execution time of a set of

Spark applications and can vary between two differ-

ent runs. Moreover, deploying a set of Spark appli-

cations on the cloud can be very expensive and can

vary on-demand, i.e., depending on the needed time.

In this section, we propose a novel technique that al-

lows users to submit a set of Spark applications with a

deadline to execute all of them. For this, we consider

quality-based Spark applications, i.e., each Spark ap-

plication can be executed using different quality lev-

els. We build a controller that adaptively selects the

maximum quality levels for each Spark application

while respecting the given deadline. This approach

would be a great ﬁt for Quality of Service applica-

tions. For the sake of simplicity, we consider a sub-

set of CBS where Spark application deﬁnes a pipeline

(i.e., the output of Spark application i is connected to

the input of Spark application i + 1).

Quality of Service (QoS) applications such as ma-

chine learning and graph analytics, iteratively process

the data. That is multiple rounds of computation are

performed on the same data. Therefore, the number

of rounds can be easily tuned to achieve a better qual-

ity of the end result. In general, the quality is pro-

portional to the number of rounds, and the execution

times of applications are unknown increasing func-

tions with respect to quality level parameters (i.e., in-

crease the quality implies increase of execution time).

In case of big data, these applications can be easily

and efﬁciently implemented using Spark which pro-

vides: (1) scalability; (2) implicit distribution; (3)

fault tolerance.

We introduces the adaptive method for QoS man-

agement of sequence of Spark applications. Given a

sequence of parameterized spark applications and a

deadline to be met, the method allows adapting the

behavior of each application by adequately setting

its quality level parameter while respecting the given

deadline. The objective is to not miss the deadline and

to select the maximum quality levels.

We consider Spark applications for which it is

possible, by using timing analysis and proﬁling tech-

niques, to compute estimate of worst-case execution

times and average execution times for different qual-

ity levels. Worst case and average execution times

will be used to estimate the execution times of Spark

applications, and hence select the best quality while

not exceeding the deadline.

We implement a controller that iteratively takes as

input the current Spark application SA

to be executed

and the actual time

of the previously executed Spark

applications, and it selects the quality level q to exe-

cute SA

so that we would be able to execute the re-

maining applications while not exceeding the dead-

line. Then, the new deadline will be computed with

respect to the actual execution time of SA

(q). The

controller must maximize the overall quality levels.

For this, it may follow several strategies. For instance,

it may select the maximum quality level of the cur-

rent application, however, this will end up by giving

priority for ﬁrst applications (i.e., selecting maximum

quality) and low priority last applications (i.e., select-

ing minimum quality). Another option is to make

the selected quality levels smooth (i.e., low deviation

of quality levels with respect to the average quality).

Consequently, we implement three different policies

that are supported by the controller: (1) safe; (2) sim-

ple; (3) mixed.

Note that the actual execution times will be only known

after executing the application and will be used to adapt the

remaining deadline.

Reconﬁgurable and Adaptive Spark Applications

3.1 Controller - Proposed Policies

We consider parameterized Spark applications, i.e.,

their execution times depend on the input quality.

Moreover, non-predictability of the underlying cluster

is another factor of uncertainty. For this, we consider

that the execution times are unknown but bounded. A

parameterized application can be deﬁned as follows.

Deﬁnition 3 (Parameterized Spark System). A pa-

rameterized Spark system PSS is a tuple PSS =

(SA, av, wc, D), where

• SA is a sequence of Spark applications SA =

[SA

, . . . , SA

• t

: SA × Q → R

, is a function that returns the

average execution time, t

(SA

, q) for a given

Spark application SA

and a quality level q. Q =

min

max

] is a ﬁnite interval of integers denoting

quality levels. We assume that, for all SA

∈ SA,

q 7→ t

(SA

, q) is a non-decreasing function.

• t

: SA × Q → R

, is a function that returns the

worst-case execution time, t

(SA

, q) for a given

Spark application SA

and a quality level q. We

assume that, for all SA

∈ SA, q 7→ t

(SA

, q) is a

non-decreasing function.

• D ∈ R

is the global deadline to be met.

Let t

i−1

be the actual execution time to execute

, . . . , SA

i−1

. That is, the time to execute the re-

maining Spark applications (i.e., SA

, . . . , SA

) is D −

i−1

. The controller must select the maximum qual-

ity levels for the remaining Spark applications while

respecting the new deadline. We deﬁne t

to be an es-

timation of the execution time of a sequence of Spark

applications for a given quality, t

([SA

, . . . , SA

], q).

Since our goal is not to exceed the deadline, our es-

timation should be an over-approximation (i.e., based

on worst-case execution times) of the actual execu-

tion time. We distinguish several policies to estimate

execution times of the remaining applications.

Safe Policy (t

s f

): this deﬁnes an obvious pol-

icy to ensure safety (deadline is met) but gives

more priority/time (i.e., higher quality) to ﬁrst

applications. Given a sequence of Spark ap-

plications and a quality level q, t

s f

is deﬁned

as follows: t

s f

([SA

, . . . , SA

], q) = t

(SA

, q) +

∑

k=i+1

(SA

, q

min

Simple Policy (t

): this deﬁnes another policy to

ensure safety and improves smoothness (distributed

quality over all the remaining applications) by com-

bining worst-case and average case execution times.

is deﬁned as follows: t

([SA

, . . . , SA

], q) =

Max



s f

([SA

, . . . , SA

], q), t

([SA

, . . . , SA

], q)



where t

([SA

, . . . , SA

], q) =

∑

k=n

k=i

(SA

, q).

Mixed Policy (t

): the mixed policy

deﬁnes a generalization of the simple pol-

icy to improve smoothness by taking into

account all possible sequences. t

is de-

ﬁned as follows: t

([SA

, . . . , SA

], q) =

Max

k=n

k=i−1

([SA

, . . . , SA

], q) +

s f

([SA

k+1

, . . . , SA

], q)}.

Theorem 1. Given a pipeline of Spark applications

[SA

, . . . , SA

] and a deadline D. The controller will

not exceed D for all the deﬁned policies (i.e., Safe,

Simple and Mixed) provided that the worst execution

time is always greater than the actual time.

Proof. For a given quality q, the mixed policy

has a higher over-estimation of the execution time

than the simple policy. Moreover, the latter has

a higher over-estimation of the execution time than

the safe policy. Formally, the above policies sat-

isfy the following property: t

([SA

, . . . , SA

], q) ≥

([SA

, . . . , SA

], q) ≥ t

s f

([SA

, . . . , SA

], q).

Moreover, as the safe policy is based on worst-

case execution times, it clearly satisﬁes the safety

requirement (i.e., deadline not exceeded at all itera-

tions). Consequently, all the policies satisfy the safety

requirement.

3.2 Controller - Quality Manager

Algorithm

The controller as depicted in Figure 2 must iter-

atively select maximum quality levels (optimality)

while guaranteeing safety (deadline must be met). For

this, at each iteration i, it selects (depending on the

policy used) the quality level of the next Spark appli-

cation according to the following:

Max {q | t

([SA

, . . . , SA

], q) + t

i−1

≤ D}

Where t

i−1

is the actual execution time after execut-

ing SA

, . . . , SA

i−1

, SA

is the current application to be

executed and t

is an estimation of the execution time

to execute the reaming applications, i.e., SA

, . . . , SA

This estimation can be computed using one of the

three policies deﬁned in Subsection 3.1 (i.e., average,

simple, mixed).

Example 3. Given a sequence of three Spark applica-

tions SA

, SA

with deadline equals to 9, where

the average (avet), worst case (wcet) and the actual

(acet) execution times are given in the following ta-

ble.

quality wcet avet acet

1 1 1 1

2 5 2 2

3 6 3 3

4 7 4 4

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

Pipeline of Spark Applications

. . .

n−1

i o i o i o i o

Controller

Max {q | t

([SA

, . . . , SA

], q) +t

i−1

≤ D}

Actual execution time t

i−1

after executing SA

, . . . , SA

i−1

Next quality, i.e., q, to

execute SA

Figure 2: Adaptive Spark.

• In case of safe policy, the quality levels 4, 1, 1 will

be selected for applications SA

, SA

, respec-

tively.

• In case of simple policy, the quality levels 3, 2, 1

will be selected for applications SA

, SA

, re-

spectively.

• In case of mixed policy, the quality levels 2, 2, 2

will be selected for applications SA

, SA

, re-

spectively.

3.3 Experimental Results

We consider three built-in Spark applications using

MLlib (Meng et al., 2015; Spark MLlib, 2016) (1)

K-means clustering; (2) Logistic Regression; and (3)

Support Vector Machine (SVM). We adapt these ap-

plication to run in ﬁve different quality levels from 1

to 5 (highest quality). Each quality represents a spe-

ciﬁc number of iterations (higher quality more iter-

ations). For example, quality level 4 and 5 in case

of SVM correspond to 50 and 100 iterations, respec-

tively.

We compute the average and worst-case execution

times by running several benchmarks. Tables 1 and

2 depicts the average and worst-case execution times

for speciﬁc number of instances and features. Note

that these execution times depend on the parameters

of each algorithm (e.g., number of features, size of

training set). For this, in the general case, the average

and worst-case execution times are functions with re-

spect to input parameters. For the sake of simplicity,

we only consider ﬁxed value of parameters. More-

over, we consider that the deadline is greater than the

summation of worst case execution times of the low-

est quality.

We test our implementation by considering a se-

quence of seven Spark applications: SA

(SVM), SA

(Logistics Regression), SA

(K-means), SA

(Logis-

tics Regression), SA

(K-means), SA

(SVM) and SA

(SVM). The deadline is 1000 seconds. Figure 3 de-

picts the quality levels selected by the controller using

safe, simple and mixed policies, respectively. Clearly,

1 2 3 4 5 6 7

Quality

Spark Applications

Safe Policy

Simple Policy

Mixed Policy

Figure 3: Selected Quality by the Controller using Safe,

Simple and Mixed Policies.

the quality levels selected using the safe policy gives

more priority to ﬁrst applications. The simple pol-

icy provides a small smoothness of quality levels.

The mixed policy provides a fair distribution (good

smoothness) of quality levels between all applica-

tions.

4 RELATED WORK

4.1 Data-Parallel Pipeline

Cascading (Cascading, 2016) and FlumeJava (Cham-

bers et al., 2010) are Java libraries that simplify the

development of data-parallel pipelines. Unlike our

method: (1) the proposed frameworks/libraries are

not compatible with Spark applications; (2) they al-

low pipelines/graph modeling within an application

and not between applications. Note that, our method

can be combined with those frameworks to give more

composition expressiveness.

4.2 Large Scale Graph Processing

Pregel (Malewicz et al., 2010) and Giraph (Giraph,

2016) are systems for large scale graph process-

Reconﬁgurable and Adaptive Spark Applications

Table 1: Worst case running time.

Algorithms Q1 Q2 Q3 Q4 Q5

K-means 115s 125s 135s 140s 145s

Logistic Regression 120s 135s 140s 150s 155s

SVM 130s 170s 190s 235s 345s

Table 2: Average case running time.

Algorithms Q1 Q2 Q3 Q4 Q5

K-means 95s 115s 125s 130s 135s

Logistic Regression 110 125s 130s 145s 150s

SVM 90s 140s 160s 205s 315s

ing. The two systems are inspired by the Bulk Syn-

chronous Parallel BSP model (Valiant, 2011). Such

systems take as input a directed graph and they de-

ﬁne programs to express a sequence of iterations (re-

ceive messages, send messages, local computation)

depending on the vertex identiﬁer. Our approach dif-

fers from this work in the following aspects. First,

they are not applicable to Spark. Second, they are

only speciﬁc to graph processing. Note also that our

approach can be applied on top of these systems.

4.3 High-Level Speciﬁc Big Data

Library

Several high-level speciﬁc libraries were proposed to

process big data. For example, MLlib (Meng et al.,

2015; Spark MLlib, 2016) provides machine learning

library on top of Spark. Apache Mahout (Mahout,

2016) allows to build an environment for quickly cre-

ating machine learning applications. Our high-level

speciﬁc language allows to compose different appli-

cations written using different high-level libraries.

Thus, developers can easily build a complex system

consisting of several applications using several li-

braries.

4.4 Adaptive QoS

In (Wust et al., 2004; Buttazzo et al., 1998; R.I.Davis

et al., 1993), they proposed solutions to achieve a soft

real-time execution multimedia applications based on

several techniques such as Markov decision process,

reinforcement learning and Earliest Deadline First

(EDF).

In (Combaz et al., 005b), they proposed a method

where the quality levels can be set adequately in

case of multimedia applications so that the follow-

ing QoS requirements are respected: (1) Safety - No

deadline is missed; (2) Optimality - Maximize the

available time; and (3) Smoothness of quality lev-

els. Our method is a variation of the method pre-

sented in (Combaz et al., 2008) where we apply it on

a sequence of Spark applications. To the best of our

knowledge, we have not seen major work that allows

the cloud to be parameterized to dynamically allocate

maximum quality levels for a sequence of submit-

ted jobs while respecting a user-input deadline (i.e.,

price/cost).

4.5 Dynamic Partitioning

In (Gounaris et al., 2017), they introduce an algorithm

for dynamic partitioning of RDDs with a view to min-

imizing resource consumption. The algorithm uses

analytical models for expressing the running time as

a function of the number of machines employed. Un-

like our method, it allows to dynamically select qual-

ity levels for a whole Spark application. Note that the

two methods may be combined to tune the granularity

of dynamism.

5 CONCLUSION

We have presented a new approach for linking Spark

applications to build a complex one based on a user-

deﬁned conﬁguration ﬁle. Given a set of Spark appli-

cations, a conﬁguration ﬁle deﬁnes a directed graph

between the applications, where each edge is a de-

pendency between two applications. Dependency de-

notes order of execution and data dependency (e.g.,

the input of an application requires the output of a dif-

ferent application). We have implemented a Domain

Speciﬁc Language (DSL) to easily deﬁne interfaces

as well as connections between them. Our framework

automatically build the ﬁnal system with respect to

the input conﬁguration.

Furthermore, we have deﬁned a method that al-

lows clusters to be augmented with controllers in or-

der to automatically manage quality levels of a se-

quence of Spark applications. Thus, our method al-

lows users to run a set of applications with the best

CLOSER 2017 - 7th International Conference on Cloud Computing and Services Science

possible quality without violating time constraints

(e.g., deadline). We discussed several policies in or-

der to either give more priority for early applications

or to obtain a fair quality between all the applications.

6 FUTURE WORK

The future work goes into several directions. First, we

consider to deﬁne a uniﬁed DSL that combines sev-

eral frameworks and not only Spark. Second, we will

introduce interface place holders within a Spark appli-

cation to allow more expressive composition. Third,

we will support several clusters and hence efﬁciently

transfer data at synchronization points become a chal-

lenging task. Another direction is to extend the con-

troller to take as input a graph of Spark applications

(not only a sequence). For this, the controller should

be augmented with a scheduler to select the best next

Spark application in addition to the next quality level.

REFERENCES

Buttazzo, G. C., Lipari, G., and Abeni, L. (1998). Elastic

task model for adaptive rate control. RTSS, pages 286–

295.

Cascading (2016). Cascading. http://www.cascading.org.

Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry,

R. R., Bradshaw, R., and Weizenbaum, N. (2010).

Flumejava: easy, efﬁcient data-parallel pipelines. In

Proceedings of the 2010 ACM SIGPLAN Conference

on Programming Language Design and Implementa-

tion, PLDI 2010, Toronto, Ontario, Canada, June 5-

10, 2010, pages 363–375.

Combaz, J., Fernandez, J., Sifakis, J., and Strus, L. (2008).

Symbolic quality control for multimedia applications.

Real-Time Systems, 40(1):1–43.

Combaz, J., Fernandez, J.-C., Lepley, T., and Sifakis, J.

(2005b). Qos control for optimality and safety. Pro-

ceedings of the 5th Conference on Embedded Soft-

ware.

Dean, J. and Ghemawat, S. (2004). Mapreduce: Simpliﬁed

data processing on large clusters. In 6th Symposium on

Operating System Design and Implementation (OSDI

2004), San Francisco, California, USA, December 6-

8, 2004, pages 137–150.

Dean, J. and Ghemawat, S. (2010). Mapreduce: a ﬂexible

data processing tool. Commun. ACM, 53(1):72–77.

Ghemawat, S., Gobioff, H., and Leung, S. (2003). The

google ﬁle system. In Proceedings of the 19th ACM

Symposium on Operating Systems Principles 2003,

SOSP 2003, Bolton Landing, NY, USA, October 19-

22, 2003, pages 29–43.

Giraph (2016). Apache giraph. http://giraph.apache.org.

Gounaris, A., Kougka, G., Tous, R., Tripiana, C., and Tor-

res, J. (2017). Dynamic conﬁguration of partitioning

in spark applications. IEEE Transactions on Parallel

and Distributed Systems, PP(99):1–1.

Mahout (2016). Apache mahout. http://mahout.apache.org.

Malewicz, G., Austern, M. H., Bik, A. J. C., Dehnert,

J. C., Horn, I., Leiser, N., and Czajkowski, G. (2010).

Pregel: a system for large-scale graph processing. In

Proceedings of the ACM SIGMOD International Con-

ference on Management of Data, SIGMOD 2010, In-

dianapolis, Indiana, USA, June 6-10, 2010, pages

135–146.

Meng, X., Bradley, J. K., Yavuz, B., Sparks, E. R.,

Venkataraman, S., Liu, D., Freeman, J., Tsai, D. B.,

Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M. J.,

Zadeh, R., Zaharia, M., and Talwalkar, A. (2015).

Mllib: Machine learning in apache spark. CoRR,

abs/1505.06807.

R.I.Davis, K.W.Tindell, and A.Burns (1993). Scheduling

slack time in ﬁxed priority pre-emptive systems. Pro-

ceeding of the IEEE Real-Time Systems Symposium,

pages 222–231.

Spark MLlib (2016). Spark mllib.

http://spark.apache.org/mllib/.

Valiant, L. G. (2011). A bridging model for multi-core com-

puting. J. Comput. Syst. Sci., 77(1):154–166.

Wust, C. C., Steffens, L., Verhaegh, W. F., Bril, R. J., and

Hentschel, C. (2004). Qos control strategies for high-

quality video processing. Euromicro Conference on

Real-Time Systems, pages 3–12.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,

and Stoica, I. (2010). Spark: Cluster computing with

working sets. In 2nd USENIX Workshop on Hot Topics

in Cloud Computing, HotCloud’10, Boston, MA, USA,

June 22, 2010.

Reconﬁgurable and Adaptive Spark Applications