Adaptive Resource Management for Distributed Data Analytics based on

Container-level Cluster Monitoring

Thomas Renner, Lauritz Thamsen and Odej Kao

Complex and Distributed IT Systems, Technische Universit

at Berlin, Berlin, Germany

Keywords:

Cluster Resource Management, Distributed Data Analytics, Distributed Dataﬂow Systems, Adaptive Resource

Management.

Abstract:

Many distributed data analysis jobs are executed repeatedly in production clusters. Examples include daily ex-

ecuted batch jobs and iterative programs. These jobs present an opportunity to learn workload characteristics

through continuous ﬁne-grained cluster monitoring. Therefore, based on detailed proﬁles of resource utiliza-

tion, data placement, and job runtimes, resource management can in fact adapt to actual workloads. In this

paper, we present a system architecture that contains four mechanisms for an adaptive resource management,

encompassing data placement, resource allocation, and container as well as job scheduling. In particular, we

extended Apache Hadoop’s scheduling and data placement to improve resource utilization and job runtimes

for recurring analytics jobs. Furthermore, we developed a Hadoop submission tool that allows users to re-

serve resources for speciﬁc target runtimes and which uses historical data available from cluster monitoring

for predictions.

1 INTRODUCTION

Modern distributed data-analytic frameworks such as

Spark (Zaharia et al., 2010b) and Flink (Carbone

et al., 2015) allow to process large datasets in par-

allel using a large number of computers. Often, ap-

plications of these frameworks (i.e. jobs) run within

virtualized containers on top of cluster resource man-

agement systems like YARN (Vavilapalli et al., 2013),

which provide jobs a speciﬁc amount of resources

for their execution. In addition, a co-located dis-

tributed ﬁle system such as HDFS (Shvachko et al.,

2010) stores the data that is to be analyzed by the

data-analytic frameworks. This design allows to share

cluster resources effectively by running data analytic

jobs side-by-side on a single cluster infrastructure

composed of up to hundreds or more nodes and ac-

cessing shared datasets. Thereby, one challenge is

to realize an effective resource management of these

large cluster infrastructures in order to run data ana-

lytics in an economically viable way.

Resource management for data-analytic clusters

is a challenging task as analytic jobs often tend to

be long running, resource-intensive, and consequently

expensive. For instance, in most resource manage-

ment systems the user needs to specify upfront how

many resources a job allocates for its execution in

terms of number of containers as well as cores and

memory per container. Furthermore, users often tend

to allocate too many resources for their jobs. In terms

of numbers, a productive cluster at Google achieved

aggregate CPU utilization between 25% and 30% and

memory utilization of 40%, while resource reserva-

tions exceeded 75% and 60% of the available CPU

and memory capacities (Verma et al., 2015). Further-

more, another challenge for effective resource man-

agement is that data-analytic jobs often have different

workload characteristics and, thus, stress different re-

sources. For instance, some frameworks provide li-

braries for executing scalable machine learning jobs

such as Spark MLlib (Meng et al., 2016), which of-

ten tend to be more CPU-intensive. On the contrary,

jobs containing queries for data aggregation tend to

be more I/O-intensive (Jia et al., 2014). Knowing the

dominant resource of a job in advance can be used

to schedule a job with other jobs for a better node

utilization and less interfere between jobs. For data-

intensive jobs, achieving a high degree of data locality

can reduce the network demand due to the availabil-

ity of input data on local disks. In this case, data can

be ingested faster and the computation can start im-

mediately (Zaharia et al., 2010a). However, the set

of containers, in which a job and its tasks is running,

is chosen by the resource manager that often sched-

Renner, T., Thamsen, L. and Kao, O.

Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring.

DOI: 10.5220/0006420100380047

In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 38-47

ISBN: 978-989-758-255-4

ules containers without taking the location of data into

account. Thus, the task scheduling of data-analytic

frameworks is restricted to the set of nodes the con-

tainers are running on.

Besides, identiﬁable workloads characteristics

such as containing more I/O- or more CPU-intensive

jobs, typically workloads also contain many analyt-

ics jobs that are executed repeatedly. This is the case

for iterative computations, yet also for recurring batch

analytics, for which the same job is executed on a

weekly or even daily schedule. Such recurring batch

jobs can make up to 40% of all jobs in productive

clusters (Agarwal et al., 2012; Chaiken et al., 2008).

In summary, on the one hand cluster workloads

can vary signiﬁcant in terms of which resources are

most dominant and thus which optimizations are most

effective, while workloads often have deﬁning charac-

teristics as well as consist of repeatedly executed jobs.

Therefore, we argue that resource management for

data-analytic clusters should automatically adapt to

the workloads running on the resources. Furthermore,

detailed knowledge of a job’s resource usage and

scale-out behavior also allows to execute jobs more

efﬁciently as well as with respect to user performance

requirements. The basis for this knowledge on work-

load characteristics and particular jobs is ﬁne-grained

cluster monitoring, in case of resource-managed clus-

ters on a container-level. Moreover, user performance

requirements can be collected to not only adapt to

job characteristics but also make informed decisions

based on user needs.

In this paper we report our experience of applying

adaptive resource management for distributed data

analytics based on such ﬁne-grained container-level

cluster monitoring. In particular, we present a system

based on Hadoop

that allows to manage different as-

pects of resource management and which we used to

improve job execution time, increase resource utiliza-

tion, and meet performance demands of users. The

system measures job runtimes and the resource con-

sumption of every single job based on all its execu-

tion containers utilizations. In addition, we record

information on the input data to improve data local-

ity of the frameworks and colocate related ﬁles on

the same set of nodes. We further schedule combina-

tions of jobs to run together on the cluster for which

we observed high overall resource utilization and lit-

tle inference when executed co-located. The system is

also capable of taking user performance constraints in

terms of runtime targets and automatically allocating

resources for these targets, after modeling the scale-

out performance of a job based on previous runs. In

Apache Hadoop, http://www.hadoop.apache.org/, ac-

cessed 2017-03-20

addition, the system dynamically adapts resource al-

locations at runtime towards user-deﬁned constraints

such as resource utilization targets at barriers between

dataﬂow stages.

Outline. The remainder of the paper is structured

as follows. Section 2 provides background on sys-

tems used for data analytics. Section 3 presents our

idea of adaptive resource management. Section 4 ex-

plains our system architecture for adaptive resource

management. Section 5 presents four different appli-

cations of our idea of adaptive resource management.

Section 6 discusses related work. Section 7 concludes

this paper.

2 BACKGROUND

This section describes distributed dataﬂow systems,

distributed ﬁle systems, and resource management

systems. Together these systems often form the setup

of shared nothing compute clusters used for data ana-

lytics.

2.1 Distributed Dataﬂow Systems

In distributed dataﬂow systems tasks receive inputs

and produce outputs. These tasks are data-parallel,

so each parallel task instance receives only parts of

the input data and produces only parts of the output

data of the entire dataﬂow. The tasks usually are user-

deﬁned versions of a set of deﬁned operators such as

Map, Reduce, and Join. A Map, for example, is con-

ﬁgured with a user-deﬁned function (UDF). A Join,

in contrast, is conﬁgured by deﬁning the join crite-

ria, i.e. which attributes have to match for elements

to be joined. Therefore, operators like Joins, but also,

for example, group-wise reductions require elements

with matching keys to be available at the same task

instances. If this is not already given from the pre-

ceding tasks, the input data is re-partitioned for these

operations.

In general the tasks form a directed graph, of-

ten supporting joins of multiple separate dataﬂows or

even forks of a single one. Some frameworks also

support iterative programs with cyclic directed graphs

or features like result caching. Besides data paral-

lelism, task parallelism is often realized in form of

pipeline parallelism. Typically, the computation is

distributed and orchestrated by a master across many

workers. Workers expose their compute capabilities

via slots, which host either single task instances or

chains of subsequent task instances.

Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring

Key features of such distributed dataﬂow systems

include effective task distribution, fault tolerance,

scalability, processing speed, and usability. More-

over, program code typically does not contain details

on the execution environment, which instead is speci-

ﬁed in separate conﬁguration ﬁles, allowing the same

program to be used in different environments. Con-

sequently, developers can concentrate on application

logic, while cluster operation teams focus on system

speciﬁc such as resource allocation and conﬁguration.

2.2 Distributed File Systems

In distributed ﬁle systems large ﬁles are typically

fragmented into blocks, which are then stored across

multiple nodes. Each node can store many blocks of

multiple different ﬁles and the blocks of a large ﬁle

can be stored across many nodes.

Such systems typically also follow the mas-

ter/worker pattern. The master stores metadata, such

as which block of which ﬁle is stored on a particular

node. This metadata is usually structured as ﬁle direc-

tories, allowing users to access ﬁles by specifying the

path to a ﬁle. When a client requests a ﬁle from the

master, the master returns the exact block locations,

which the client consequently uses to read the actual

ﬁle blocks directly from the workers.

Key features of distributed ﬁle systems include

fault tolerance, scalability, as well as usability. Fault

tolerance is usually realized using replication, so that

each block is often replicated three times, each one

stored on different nodes or even different racks. The

systems then implement fault detection and recovery

to continuously re-replicate blocks in case of failures.

Regarding usability, from a user’s perspective the dis-

tributed nature of the ﬁle system is abstracted. That

is, access to the distributed ﬁles is usually just as easy

as accessing local ﬁles, except for maybe read/write

performance. To increase access performance, a com-

monly applied technique is selecting compute nodes

for tasks with a maximum of input data locally avail-

able.

2.3 Resource Management Systems

Resource management systems allow to share cluster

resources among multiple users and applications. Of-

ten, these systems also support multiple different an-

alytics solutions such as different distributed dataﬂow

systems. Thus, users can select per application which

framework is best for their task. Then, users re-

serve parts of the cluster for a time through contain-

ers. These containers represent compute capabili-

ties. Often, a container is a reservation for a num-

ber of cores and an amount of main memory. Nodes

can host multiple such containers–as many as do ﬁt

their resources–and multiple containers on a number

of hosts are typically reserved for a single application.

Resource management systems also usually fol-

low the master/worker pattern, with a central master

responsible for scheduling and orchestration, while

workers host application containers. More advanced

system designs apply master replication for fault tol-

erance or decentralized scheduling for scalability. In

case of multiple schedulers used for scalability, usu-

ally optimistic concurrency control schemes are used.

There is also a middle-ground between centralized

and decentralized resource managers with hybrid so-

lutions with, for example, hierarchical designs.

3 ADAPTIVE RESOURCE

MANAGEMENT

The goal of adaptive resource management is to con-

tinuously improve the resource utilization of a data-

analytic cluster. The overall idea is to learn about

the cluster workloads and automatically adapt data

and container placement as well as job scheduling

based on these information. Workloads often have

predictable characteristics since up to 40 % of jobs

in productive clusters are recurring (Agarwal et al.,

2012; Chaiken et al., 2008). For these jobs, the exe-

cution logic stays the same for every execution, how-

ever the input data may changes. Typical scenarios for

recurring jobs include when new data becomes avail-

able or at discrete times such as for hourly or nightly

batch reports.

Job and

Cluster

Monitoring

Adjust

Resource

Management

User and

Performance

Requirements

Figure 1: Adaptive resource management cycle.

Figure 1 illustrates the process of our adaptive re-

source management approach. The distinct steps in

this cycle are:

• User and Performance Requirements: User spec-

ify performance requirements regarding runtime

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

targets. For instance, a recurring job should be

ﬁnished in an hour. Based on previous runs of that

job, the necessary set of resources is estimated and

allocated.

• Job and Cluster Monitoring: Jobs that are exe-

cuted on the cluster are continuously monitored

on a ﬁne-grained container level. Based on this

historical workload data, job repositories are gen-

erated. Repository data includes execution time,

data placement, data access pattern, resource uti-

lization, and resource allocation. These detailed

proﬁles allow to learn for future executions of jobs

and, thereby, improve resource utilization.

• Adjusted Resource Management: Adaptive re-

source management is achieved by data block

placement, container placement, and job schedul-

ing. Data blocks are placed on as many nodes as

used by a job for optimal data locality. In addi-

tion, related ﬁles are colocated on the same set of

nodes. Containers of jobs that use these data as

input are placed on these nodes. Job scheduling is

used to run jobs with complimentary resource us-

age and low interference together. As in many re-

source management systems resource utilization

of containers is not strictly limited to the number

of resources used for scheduling, it is beneﬁcial to

schedule an I/O intensive job with an CPU inten-

sive job.

4 SYSTEM ARCHITECTURE

This section describes our system architecture for im-

plementing adaptive resource management for dis-

tributed data analytics. The system consists of a

monitoring component called Freamon that provides

container-level job monitoring for Hadoop deploy-

ments. It monitors resource utilization of single data

analytic jobs by monitoring all its execution contain-

ers that run on different nodes. A data analytic job

can be any application that supports execution on

Hadoop YARN. In addition, Freamon measures the

total node utilization to analyze interference of dif-

ferent container placements as well as storage utiliza-

tion and data block placements. Different adaptive

resource management application use Freamon’s his-

torical job data for adaptive container placement, data

block placement, and container scheduling decisions.

Figure 2 shows how Freamon is integrated with

existing systems and its components for providing

ﬁne-grained monitoring which then enables adaptive

resource management applications. Freamon relies

on Hadoop YARN and HDFS. The Compute Layer

Monitoring

Data

Layer

Compute

Layer

Adaptive Ressource Management Applications

Monitoring

Freamon

A job running in distributed containers

Input datablocks of a job

② Allocation

Assistant

① Data Colocation

Assistant

④ Scaling

Assistant

③ Job Colocation

Assistant

Node

FAgent

Job

Scheduling

Node

FAgent

Datablock

Placement

Node

FAgent

Container

Placement

Figure 2: System overview for applying adaptive resource

management.

is represented by YARNs slave nodes (i.e. NodeM-

anagers), which provide data analytic jobs with clus-

ter compute resources through containers. The Data

Layer includes all HDFS slave nodes (i.e. DataN-

odes), which are responsible for storing datasets in se-

ries of blocks distributed across all available DataN-

odes. Both layers are running on all cluster nodes.

An important reason for co-locating both systems is

to improve data locality, the attempt to execute tasks

on nodes where the input data is stored. Data locality

can reduce the network demand due to the availability

of the input data on local disks and, thus, ingestion is

faster, allowing computation to start earlier.

The FAgent runs on every node and collects differ-

ent node-related system metrics. It collects resource

utilization of the whole node such as cpu, disk, net-

work, and memory data. In addition, Freamon tracks

resource utilization on a more ﬁne-grained container-

level, in which the data analytic jobs are executed.

Thereby, it is to mentation that resources of contain-

ers are not strictly allocated. A container can gain

more resources if available. Thus, containers execut-

ing jobs are competing for available node resources.

The advantage is that a job can use more resources, if

available. With Freamon, we know exactly how much

a job was utilizing from a node and can use this infor-

mation later for resource savings for recurring jobs.

The agent uses the proc ﬁlesystem and monitors the

pid of the container executing the job. It is also a

frontend to other unix monitoring tools if available.

Records are sent periodically to the Freamon moni-

toring master.

Freamon is the master monitoring component, in

which all information about resource utilizations col-

Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring

lected by the distributed FAgents come together. In

addition, it communicates with the YARN master

node (i.e. ResourceManager) to gather job-speciﬁc

data of ﬁnished jobs like: runtime, name, application

framework, and scheduling proﬁle including num-

ber of containers, cores as well as memory used for

scheduling. Freamon also collects information about

datasets that are processed by jobs. These informa-

tion contain the size of input datasets as well as data

placements, e.g. on which nodes the data blocks are

stored and how often the data is accessed by analyz-

ing the HDFS audit log. All job proﬁles are stored in

a repository of historical workload data. The job and

data proﬁles can be accessed by different adaptive re-

source management applications.

Adaptive Resource Management Applications

Adaptive resource management applications use

Freamon’s ﬁne-grained cluster monitoring and work-

load repository for different purposes. For example,

the Allocation Assistant uses job runtimes for auto-

matic resource allocation according to users’ perfor-

mance targets, whereas the two Colocation Assistants

for container as well as data placement aim for im-

proved data locality, local data exchange, resource

utilization, and throughput. Different applications of

adaptive resource management are discussed in Sec-

tion 5 in more detail.

5 APPLICATIONS OF ADAPTIVE

RESOURCE MANAGEMENT

This section summarizes four speciﬁc adaptive re-

source management techniques for distributed data

analytics that we implemented. The contribution of

this section is to show the applicability of adaptive re-

source management for distributed data analytics us-

ing the envisioned container-level cluster monitoring

system in more detail.

5.1 Data Colocation Assistent

The performance for data-intensive jobs often de-

pends on the time it takes to read input data. Fur-

thermore, many jobs are recurring and, for example,

are triggered at a discrete time for daily or nightly

execution or when new data becomes available. For

this recurring data-intensive jobs, it is possible to de-

cide where to store the input data in the distributed

ﬁle system and containers before the job execution

starts. The Data Colocation Assistant (Renner et al.,

2016) uses these characteristics and consists of a data

block and container placement phase. First, it allows

to mark sets of ﬁles as related, which, for instance,

Colocate

Containers and Input Data of Job A

Containers and Input Data of Job B

Figure 3: Data Colocation Assistant - data and container

colocation for data-intensive applications.

often are processed jointly. These data blocks of the

related ﬁles are automatically placed on the same set

or subset of nodes during the data placement phase.

Afterwards during the container placement phase, the

job and its containers are scheduled on these nodes,

where the data was placed before. The main ad-

vantage of CoLoc is a reduction of network trans-

fers due to a higher data locality and some locally

performed operators like grouping or joining. Thus,

especially data-intensive workloads beneﬁt this adap-

tive resource management application. Figure 3 illus-

trates the process of the adaptive data and container

colocation application.

5.2 Resource Allocation Assistant

Scale in # workers

Runtime in min

Prediction Function

Min/max scale

Runtime Target

0 1 2 3 4 5 6 7 8 9 10 11

Figure 4: Resource Allocation Assistant - resource selection

based on modeled scale-out behavior of a recurring job.

Given a runtime target, the Resource Allocation

Assistant (Thamsen et al., 2016b) automatically se-

lects the number of containers to allocate for a job.

For this, the Resource Allocation Assistant models

the scale-out behavior of job based on previous runs.

Therefore, we do not require dedicated isolated pro-

ﬁling of jobs. Instead, the Resource Allocation As-

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

sistant retrieves runtime information on the previous

executions of a job from Freamon’s workload repos-

itory and then uses linear regression to ﬁnd a func-

tion for the job’s scale-out performance. In particu-

lar, we use both a generic parametric model for dis-

tributed computation and nonparametric regression.

The parametric model is used for extrapolation and

when not enough data is available for effective mod-

eling with the nonparametric approach. The nonpara-

metric approach allows to interpolate arbitrary scal-

ing behavior with high accuracy when enough sam-

ples are available. The Resource Allocation Assis-

tant automatically selects between both models using

cross-validation. When the scale-out model is deter-

mined, this model is used to allocate resources for the

user’s runtime target. In particular, the Resource Al-

location Assistant uses the lowest scale-outs within

user-deﬁned bounds that satisﬁes the target constraint.

This process is shown in Figure 4.

5.3 Job Colocation Assistant

Queued Jobs

Cluster Resources

Utilization

CPU

I/O

CPU

I/O

Run

Figure 5: Job Colocation Assistant - job scheduling based

on resource usage and interference.

The Job Colocation Assistant (Thamsen et al.,

2017) schedules jobs based on their resource usage

and interference with other jobs. In particular, it se-

lects jobs to run on the cluster based on the jobs al-

ready running on the cluster, as shown in Figure 5.

For this, it uses a reinforcement learning algorithm to

learn which jobs utilize the different resources best,

while interfering with each other least. Consequently,

this scheduling scheme can make more informed de-

cisions as time goes by and jobs are executed repeat-

edly on the cluster, yet does not require dedicated iso-

lated proﬁling of jobs. To measure the goodness of

colocation we take CPU, disk, and network usage as

well as I/O wait into account. CPU, disk, and network

usage we rate highly when utilized high, while we use

I/O wait as indicator of interference. That is, lower

I/O wait is better. The computed colocation prefer-

ences are then used to select the next job from the

queue of scheduled jobs.

5.4 Dynamic Scaling Assistant

The Dynamic Scaling Assistant (Thamsen et al.,

2016a) adjusts resource allocations at runtime to meet

(i+1).

Scale

Task i: IterationSync. barrier

Figure 6: Dynamic Scaling Assistant - resource allocation

adjustment for dataﬂow jobs at synchronization barriers.

user-deﬁned execution constraints such as resource

utilization targets. For these dynamic scalings, we

use the barriers between dataﬂow stages. In partic-

ular, there are barriers between the iterations of iter-

ative dataﬂow programs. Scaling at these barriers al-

lows to adjust the resource allocation without having

to migrate task state. Moreover, scaling can be done

on the basis of runtime statistics that reﬂect the en-

tire elements of the dataﬂow. That is, at a barrier all

elements of the previous stage have been processed

and, thus, are reﬂected in, for example, the average

resource utilization of the nodes. This is important

as data characteristics such as key value distributions

can have a signiﬁcant inﬂuence on the performance of

dataﬂows. This is, for example, the case when the par-

titions are uneven and worker nodes with little work,

therefore, have to wait on worker nodes with lots of

work at barriers.

6 RELATED WORK

This section describes related work in four categories:

distributed dataﬂow systems, distributed ﬁle systems,

resource management systems, as well as work on us-

ing analytical cluster resources adaptively.

6.1 Distributed Dataﬂow Systems

MapReduce (Dean and Ghemawat, 2004) offers a

programming and an execution model for scalable

data analytics with distributed dataﬂows on shared

nothing clusters. The programming model is based

on the two higher order functions Map and Reduce,

which are both supplied with UDFs. The execution

models comprises the data-parallel execution of task

instances of these two operations, where each Map

phase is followed by a Reduce phase. In-between both

phases the intermediate results are written to disk and

shufﬂed: elements with the same key are read by the

same Reduce task instance. Fault tolerance is given

because a distributed ﬁle system with replication is

used for the intermediate results.

Systems like Dryad (Isard et al., 2007) and

Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring

Nephele (Warneke and Kao, 2009) added the possi-

bility to develop arbitrary directed acyclic task graphs

instead of just combinations of subsequent Map

and Reduce tasks. SCOPE (Chaiken et al., 2008),

Nephele/PACTs (Battr

e et al., 2010) and Spark (Za-

haria et al., 2010b) then added more operators such

as Joins. Spark also introduced an alternative ap-

proach to fault tolerance: instead of saving interme-

diate results in a fault-tolerant distributed ﬁle system,

Spark’s Resilient Distributed Datasets (RDDs) (Za-

haria et al., 2012) store enough linage information

to be able to re-compute speciﬁc partitions in case of

failures. Spark also enables users to explicitly cache

intermediate results for future usage. This allows, for

example, to speed up iterative computations that re-

quire an input repeatedly.

Flink (Carbone et al., 2015; Alexandrov et al.,

2014) and Google’s Dataﬂow (Akidau et al., 2015)

add many features regarding continuous inputs and

scalable stream processing. Flink uses in fact a stream

processing engine for both batch and stream process-

ing. Furthermore, they add functionality to deﬁne

windows for, for example, joins and aggregations.

They also provide mechanisms to deal with elements

that arrive late. Spark also provides a system for

stream processing (Zaharia et al., 2013). However,

Spark uses microbatches and no true streaming en-

gine for this.

6.2 Distributed File Systems

Distributed File Systems focusing on data analytics

typically store a ﬁle in series of blocks and stripe

the data block across multiple servers. In addition,

most add redundancy to the stored data by replicat-

ing the data blocks to provide fault tolerance against

node failures. HDFS (Shvachko et al., 2010) is part

of Hadoop and currently the de-facto storage system

for storing large datasets for data analytic tasks. Other

famous storage systems are Ceph (Weil et al., 2006),

GlusterFS (Davies and Orsaria, 2013), and XtreemFS

(Hupfeld et al., 2008).

Alluxio, former known as Tachyon (Li et al.,

2014), is a memory-centric distributed storage system

enabling data sharing at memory-speed supporting

many Hadoop compatible frameworks such as Spark,

Flink or MapReduce. It stores data off-heap and can

run on top of different storage systems such as HDFS

persistant layer.

6.3 Resource Management Systems

Different resource management system with a fo-

cus on data analytics like YARN (Vavilapalli

et al., 2013), Mesos (Hindman et al., 2011), or

Omega (Schwarzkopf et al., 2013) have been devel-

oped. In these systems, users can allocate cluster re-

sources by usually specifying the number of contain-

ers and their size in terms of cores and available mem-

ory. After these containers are deployed on the cluster

infrastructure, distributed dataﬂow systems are exe-

cuted here. YARN is part of Hadoop and emerged

of the MapReduce framework, and thus mainly fo-

cuses on batch jobs. One signiﬁcant difference be-

tween YARN and Mesos is that YARN is a mono-

lithic and Mesos a non-monolithic two-level sched-

uler. Other known cluster resource management sys-

tems include Microsoft’s Apollo (Boutin et al., 2014),

and Google’s Borg (Verma et al., 2015)

6.4 Adaptive Resource Management

The most related architectures are Quasar (Delimitrou

and Kozyrakis, 2014) and Jockey (Ferguson et al.,

2012).

Quasar (Delimitrou and Kozyrakis, 2014), which

is a successor system to Paragon (Delimitrou and

Kozyrakis, 2013), automatically performs resource

allocation and job placement based on previously ob-

served and dedicated sample runs. Quasar also adjusts

allocations at runtime after monitoring. In contrast,

our architecture also encompasses data placement and

selecting the next job from the queue of scheduled

jobs.

Jockey is a resource manager for Scope, which

performs initial resource allocation and further adapts

the allocation at runtime towards a user-given runtime

target. For this, Jockey uses detailed knowledge of

the execution model and instrumentation to gain com-

prehensive insights into the jobs. Jockey, however,

does not take care of data placement and also does

not schedule jobs.

Further related systems in regards to automatic

resource allocation include, for example, Elasti-

sizer (Herodotou et al., 2011a), which is part of the

Starﬁsh (Herodotou et al., 2011b) and also selects re-

sources for a user’s runtime target.

Related systems in regards to dynamic scaling

and other runtime adjustments include many works

in the area of large-scale stream processing. Works

such a StreamCloud (Gulisano et al., 2012) and QoS-

based Dynamic Scaling for Nephele (Lohrmann et al.,

2015) have investigated adaptive dynamic scaling for

large-scale distributed stream processing. There is

also work on adaptively placing and migrating tasks

at runtime based on execution statistics. An ex-

ample is placing tasks that exchange comparably

large amounts of the data onto the same hosts in

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

Storm (Aniello et al., 2013)

In regards to adaptive data placement, the closest

related works are Scarlett (Ananthanarayanan et al.,

2011) and ERMS (Elastic Replica Management sys-

tem) (Cheng et al., 2012). Both systems mine access

patterns of applications and use these information to

increase and decrease data replication factor of ﬁles

that are accessed most to improve throughput.

7 CONCLUSION

In this paper, we presented an architecture of a set

of applications that make resource management for

distributed dataﬂows more adaptive based on clus-

ter monitoring. Our solution monitors job runtimes,

container-based resource utilization, and data access.

Based on these information, we adapt different as-

pects of resource management to the actual workload

and, in particular, repeatedly executed dataﬂow jobs

such as recurring batch jobs. Speciﬁcally, we adapt

the following aspects of resource management: data

placement, resource allocation, job scheduling, and

container placement. The four adaptive resource man-

agement applications we implemented are:

1. First, we place data on as many nodes as used by

a job for optimal data locality.

2. Second, we allocate as many resources for a job as

a user’s runtime target requires after modeling the

scale-out performance of a job using its previous

runs.

3. Third, we schedule jobs to run together that ex-

hibit a high overall resource utilization yet little

inference when executed co-located.

4. Fourth, we adjust resource allocations at run-

time towards user-deﬁned constraints such as uti-

lization targets at barriers between subsequent

dataﬂow stages.

While we have presented these four different tech-

niques before, this paper presents them as applica-

tions in an overall architecture. In the future, we want

to evaluate how well these techniques work together

and also implement further ideas for adaptive resource

management based on ﬁne-grained cluster monitor-

ing.

ACKNOWLEDGEMENTS

This work has been supported through grants by the

German Science Foundation (DFG) as FOR 1306

Stratosphere and by the German Ministry for Educa-

tion and Research (BMBF) as Berlin Big Data Center

BBDC (funding mark 01IS14013A).

REFERENCES

Agarwal, S., Kandula, S., Bruno, N., Wu, M.-C., Stoica,

I., and Zhou, J. (2012). Reoptimizing data parallel

computing. In Presented as part of the 9th USENIX

Symposium on Networked Systems Design and Imple-

mentation (NSDI 12), pages 281–294.

Akidau, T., Bradshaw, R., Chambers, C., Chernyak, S.,

Fern

andez-Moctezuma, R. J., Lax, R., McVeety, S.,

Mills, D., Perry, F., Schmidt, E., and Whittle, S.

(2015). The Dataﬂow Model: A Practical Ap-

proach to Balancing Correctness, Latency, and Cost

in Massive-scale, Unbounded, Out-of-order Data Pro-

cessing. Proc. VLDB Endow., 8(12):1792–1803.

Alexandrov, A., Bergmann, R., Ewen, S., Freytag, J.-C.,

Hueske, F., Heise, A., Kao, O., Leich, M., Leser, U.,

Markl, V., Naumann, F., Peters, M., Rheinl

ander, A.,

Sax, M. J., Schelter, S., H

oger, M., Tzoumas, K., and

Warneke, D. (2014). The Stratosphere Platform for

Big Data Analytics. The VLDB Journal, 23(6):939–

964.

Ananthanarayanan, G., Agarwal, S., Kandula, S., Green-

berg, A., Stoica, I., Harlan, D., and Harris, E. (2011).

Scarlett: coping with skewed content popularity in

mapreduce clusters. In Proceedings of the sixth con-

ference on Computer systems, pages 287–300. ACM.

Aniello, L., Baldoni, R., and Querzoni, L. (2013). Adaptive

Online Scheduling in Storm. In Proceedings of the 7th

ACM International Conference on Distributed Event-

based Systems, DEBS ’13, pages 207–218. ACM.

Battr

e, D., Ewen, S., Hueske, F., Kao, O., Markl, V., and

Warneke, D. (2010). Nephele/PACTs: A Program-

ming Model and Execution Framework for Web-scale

Analytical Processing. In Proceedings of the 1st ACM

Symposium on Cloud Computing, SoCC ’10, pages

119–130. ACM.

Boutin, E., Ekanayake, J., Lin, W., Shi, B., Zhou, J., Qian,

Z., Wu, M., and Zhou, L. (2014). Apollo: Scalable

and Coordinated Scheduling for Cloud-scale Comput-

ing. In Proceedings of the 11th USENIX Conference

on Operating Systems Design and Implementation,

OSDI’14, pages 285–300. USENIX Association.

Carbone, P., Katsifodimos, A., Ewen, S., Markl, V., Haridi,

S., and Tzoumas, K. (2015). Apache Flink: Stream

and Batch Processing in a Single Engine. IEEE Data

Engineering Bulletin, 38(4):28–38.

Chaiken, R., Jenkins, B., Larson, P.-A., Ramsey, B., Shakib,

D., Weaver, S., and Zhou, J. (2008). SCOPE: Easy and

Efﬁcient Parallel Processing of Massive Data Sets.

Proc. VLDB Endow., 1(2):1265–1276.

Cheng, Z., Luan, Z., Meng, Y., Xu, Y., Qian, D., Roy, A.,

Zhang, N., and Guan, G. (2012). ERMS: An Elastic

Replication Management System for HDFS. In 2012

IEEE International Conference on Cluster Computing

Workshops, pages 32–40. IEEE.

Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring

Davies, A. and Orsaria, A. (2013). Scale out with Glus-

terFS. Linux Journal, 2013(235).

Dean, J. and Ghemawat, S. (2004). Mapreduce: Simpliﬁed

data processing on large clusters. In Proceedings of

the 6th Conference on Symposium on Operating Sys-

tems Design & Implementation, OSDI’04, pages 10–

10. USENIX Association.

Delimitrou, C. and Kozyrakis, C. (2013). Paragon: QoS-

aware Scheduling for Heterogeneous Datacenters. In

Proceedings of the Eighteenth International Confer-

ence on Architectural Support for Programming Lan-

guages and Operating Systems, ASPLOS ’13, pages

77–88. ACM.

Delimitrou, C. and Kozyrakis, C. (2014). Quasar:

Resource-efﬁcient and QoS-aware Cluster Manage-

ment. In Proceedings of the 19th International Con-

ference on Architectural Support for Programming

Languages and Operating Systems, ASPLOS ’14,

pages 127–144. ACM.

Ferguson, A. D., Bodik, P., Kandula, S., Boutin, E., and

Fonseca, R. (2012). Jockey: Guaranteed Job La-

tency in Data Parallel Clusters. In Proceedings of the

7th ACM European Conference on Computer Systems,

EuroSys ’12, pages 99–112. ACM.

Gulisano, V., Jimenez-Peris, R., Patino-Martinez, M., Sori-

ente, C., and Valduriez, P. (2012). StreamCloud: An

Elastic and Scalable Data Streaming System. IEEE

Trans. Parallel Distrib. Syst., 23(12):2351–2365.

Herodotou, H., Dong, F., and Babu, S. (2011a). No One

(Cluster) Size Fits All: Automatic Cluster Sizing for

Data-intensive Analytics. In Proceedings of the 2Nd

ACM Symposium on Cloud Computing, SOCC ’11,

pages 18:1–18:14. ACM.

Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L.,

Cetin, F. B., and Babu, S. (2011b). Starﬁsh: A Self-

tuning System for Big Data Analytics. In Proceedings

of the the 5th Conference on Innovative Data Systems

Research, CIDR ’11. CIDR 2011.

Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A.,

Joseph, A. D., Katz, R., Shenker, S., and Stoica, I.

(2011). Mesos: A Platform for Fine-grained Resource

Sharing in the Data Center. In Proceedings of the

8th USENIX Conference on Networked Systems De-

sign and Implementation, NSDI’11, pages 295–308.

USENIX Association.

Hupfeld, F., Cortes, T., Kolbeck, B., Stender, J., Focht,

E., Hess, M., Malo, J., Marti, J., and Cesario, E.

(2008). The xtreemfs architecturea case for object-

based ﬁle systems in grids. Concurrency and compu-

tation: Practice and experience, 20(17):2049–2060.

Isard, M., Budiu, M., Yu, Y., Birrell, A., and Fetterly, D.

(2007). Dryad: Distributed Data-parallel Programs

from Sequential Building Blocks. In Proceedings of

the 2Nd ACM SIGOPS/EuroSys European Conference

on Computer Systems 2007, EuroSys ’07, pages 59–

72. ACM.

Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S. A., Yang,

Q., Luo, C., and Li, J. (2014). Characterizing and sub-

setting big data workloads. In Workload Characteri-

zation (IISWC), 2014 IEEE International Symposium

on, pages 191–201. IEEE.

Li, H., Ghodsi, A., Zaharia, M., Shenker, S., and Stoica, I.

(2014). Tachyon: Reliable, memory speed storage for

cluster computing frameworks. In Proceedings of the

ACM Symposium on Cloud Computing, pages 1–15.

ACM.

Lohrmann, B., Janacik, P., and Kao, O. (2015). Elastic

Stream Processing with Latency Guarantees. In Pro-

ceedings of the 35th IEEE International Conference

on Distributed Computing Systems, ICDCS’15, pages

399–410.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkatara-

man, S., Liu, D., Freeman, J., Tsai, D., Amde, M.,

Owen, S., et al. (2016). Mllib: Machine learning in

apache spark. Journal of Machine Learning Research,

17(34):1–7.

Renner, T., Thamsen, L., and Kao, O. (2016). Coloc:

Distributed data and container colocation for data-

intensive applications. In Big Data (Big Data), 2016

IEEE International Conference on, pages 1–6. IEEE.

Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and

Wilkes, J. (2013). Omega: Flexible, Scalable Sched-

ulers for Large Compute Clusters. In Proceedings of

the 8th ACM European Conference on Computer Sys-

tems, EuroSys ’13, pages 351–364. ACM.

Shvachko, K., Kuang, H., Radia, S., and Chansler, R.

(2010). The hadoop distributed ﬁle system. In

Mass Storage Systems and Technologies (MSST),

2010 IEEE 26th Symposium on, pages 1–10.

Thamsen, L., Rabier, B., Schmidt, F., Renner, T., and

Kao, O. (2017). Scheduling Recurring Distributed

Dataﬂow Jobs Based on Resource Utilization and In-

terference. In 2017 IEEE International Congress on

Big Data (BigData Congress), page to appear. IEEE.

Thamsen, L., Renner, T., and Kao, O. (2016a). Contin-

uously improving the resource utilization of iterative

parallel dataﬂows. In Distributed Computing Systems

Workshops (ICDCSW), 2016 IEEE 36th International

Conference on, pages 1–6. IEEE.

Thamsen, L., Verbitskiy, I., Schmidt, F., Renner, T., and

Kao, O. (2016b). Selecting resources for distributed

dataﬂow systems according to runtime targets. In In-

ternational Performance Computing and Communica-

tions Conference (IPCCC), 2016 IEEE 35th Interna-

tional Conference on, pages 1–6. IEEE.

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S.,

Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H.,

Seth, S., Saha, B., Curino, C., O’Malley, O., Radia,

S., Reed, B., and Baldeschwieler, E. (2013). Apache

Hadoop YARN: Yet Another Resource Negotiator. In

Proceedings of the 4th Annual Symposium on Cloud

Computing, SOCC ’13, pages 5:1–5:16. ACM.

Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D.,

Tune, E., and Wilkes, J. (2015). Large-scale Cluster

Management at Google with Borg. In Proceedings of

the Tenth European Conference on Computer Systems,

EuroSys ’15, pages 18:1–18:17. ACM.

Warneke, D. and Kao, O. (2009). Nephele: Efﬁcient Par-

allel Data Processing in the Cloud. In Proceedings

of the 2Nd Workshop on Many-Task Computing on

Grids and Supercomputers, MTAGS ’09, pages 8:1–

8:10. ACM.

DATA 2017 - 6th International Conference on Data Science, Technology and Applications

Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E.,

and Maltzahn, C. (2006). Ceph: A Scalable, High-

performance Distributed File System. In Proceed-

ings of the 7th Symposium on Operating Systems De-

sign and Implementation, OSDI ’06, pages 307–320.

USENIX.

Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K.,

Shenker, S., and Stoica, I. (2010a). Delay scheduling:

a simple technique for achieving locality and fairness

in cluster scheduling. In Proceedings of the 5th Eu-

ropean conference on Computer systems, pages 265–

278. ACM.

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J.,

McCauley, M., Franklin, M. J., Shenker, S., and Sto-

ica, I. (2012). Resilient Distributed Datasets: A Fault-

tolerant Abstraction for In-memory Cluster Comput-

ing. In Proceedings of the 9th USENIX Conference

on Networked Systems Design and Implementation,

NSDI’12, pages 2–2. USENIX Association.

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S.,

and Stoica, I. (2010b). Spark: Cluster Computing with

Working Sets. In Proceedings of the 2Nd USENIX

Conference on Hot Topics in Cloud Computing, Hot-

Cloud’10, pages 10–10. USENIX Association.

Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., and

Stoica, I. (2013). Discretized Streams: Fault-tolerant

Streaming Computation at Scale. In Proceedings

of the Twenty-Fourth ACM Symposium on Operating

Systems Principles, SOSP ’13, pages 423–438. ACM.

Adaptive Resource Management for Distributed Data Analytics based on Container-level Cluster Monitoring