DDHCS: Distributed Denial-of-service Threat to YARN Clusters

based on Health Check Service

Wenting Li, Qingni Shen, Chuntao Dong, Yahui Yang and Zhonghai Wu

School of Software and Microelectronics & MoE Key Lab of Network and Software Assurance,

Peking University, Beijing, China

Keywords: DDoS, Hadoop, YARN, Attack Broadness, Attack Strength, Security.

Abstract: Distributed denial-of-service (DDoS) attack continues to grow as a threat to organizations worldwide. This

attack is used to consume the resources of the target machine and prevent the legitimate users from

accessing them. This paper studies the vulnerabilities of Health Check Service in Hadoop/YARN and the

threat of denial-of-service to a YARN cluster with multi-tenancy. We use theoretical analysis and numerical

simulations to demonstrate the effectiveness of this DDoS attack based on health check service (DDHCS).

Our experiments show that DDHCS is capable of causing significant impacts on the performance of a

YARN cluster in terms of high attack broadness (averagely 85.6%), high attack strength (more than 80%)

and obviously resource utilization degradation. In addition, some novel schemes are proposed to prevent

DDHCS attack efficiently by improving the YARN security.

1 INTRODUCTION

Hadoop is open source software based on scalability

and reliability. It can be used to process vast amount

of data in parallel on large clusters. Since then

Apache Hadoop has matured and developed to a

data platform for not just processing humongous

amount of data in batch but also with the advent of

YARN. It now supports many diverse workloads

such as interactive queries over large data with Hive

on Tez, realtime data processing with Apache Storm,

in-memory datastore like Spark and the list goes on.

For Hadoop’s initial purpose, it was always

assumed that clusters would consist of cooperating,

trusted machines used by trusted users in a trusted

environment. Initially, there was no security model –

Hadoop didn’t authenticate users or services, and

there was no data privacy (O’Malley et al., 2009);

(Kholidy and Baiardi, 2012). When moving Hadoop

to a public cloud, there are challenges to original

Hadoop security mechanisms. However, the research

on MapReduce and Hadoop has mainly focused on

the system performance aspect, and the security

issues seemly have not received sufficient attention.

A distributed denial-of-service (DDoS) is where

the attack source is more than one–and often

thousands–of unique IP addresses, it is an attempt to

make a machine or network resource unavailable to

its intended users, such as to temporarily or

indefinitely interrupt or suspend services of a host

connected to the Internet. The first DDoS attack

incident (Criscuolo, 2000) was reported in 1999 by

the Computer Incident Advisory Capability (CIAC).

Since then, most of the DDoS attacks continue to

grow in frequency, sophistication and bandwidth

(Hameed and Ali, 2015); (Kholidy et al., 2015).

Previous work has demonstrated the threat and

stealthiness of DDoS attack in cloud environment

(Sabahi, 2011); (Durcekova et al., 2012); (Ficco and

Rak, 2015). As a solution, (Alarifi and Wolthusen,

2014); (Karthik and Shah, 2014); (Mizukoshi and

Munetomo, 2015) have successfully demonstrated

how to mitigate DDoS attack with cloud techniques.

There also have been numerous suggestions on how

to detect DDoS attack. For example, using

MapReduce for DDoS Forensics (Khattak et al.,

2011), a hybrid statistical model to detect DDoS

attack (Girma et al., 2015); (Lee, 2011). Unlike in

cloud environment, DDoS attacks in BigData based

on Hadoop/YARN environment are more aggressive

and destructive, but there seems a lack of research.

One problem with the Hadoop/YARN system is

that by assigning the tasks to many nodes, it is

possible for malicious users submitting attack

program to affect the entire cluster. In this paper, we

study the vulnerabilities of Health Check Service in

146

Li, W., Shen, Q., Dong, C., Yang, Y. and Wu, Z.

DDHCS: Distributed Denial-of-service Threat to YARN Clusters based on Health Check Service.

DOI: 10.5220/0005741801460156

In Proceedings of the 2nd International Conference on Information Systems Security and Privacy (ICISSP 2016), pages 146-156

ISBN: 978-989-758-167-0

YARN. These vulnerabilities encountered in YARN

motivate a new type of DDoS attacks, which we call

DDoS attack based on health check service

(DDHCS). Our work innovatively exposes health

check service in YARN as a possible vulnerability to

adversarial attacks, hence it opens new avenue to

improving the security of YARN.

In summary, this paper makes the following

contribution.

 We present three vulnerabilities of Health Check

Service in YARN, including i) Resource

Manager (RM) is lack of Job Validation; ii) It is

easy for a user to make a job failed, which will

make the node transform into unhealthy state; iii)

RM will add the unhealthy nodes to the exclude

list, which means the decrease of service nodes

in the cluster.

 We design a DDHCS attack model, we use

theoretical analysis and numerical simulations to

demonstrate the effectiveness of this attack for

different scenarios. Moreover, we empirically

show that DDHCS is capable of causing

significant impacts on the performance of a

YARN cluster in terms of high attack broadness

(averagely 85.6%), high attack strength (more

than 80%) and obviously resource utilization

degradation.

 We propose three improving methods against

DDHCS, including User blacklist mechanism,

Parameter check and Map-tracing.

The rest of this paper is organized as follows.

Section 2 discusses the background. Section 3

describes the vulnerabilities we found in YARN.

Section 4 presents DDHCS attack model. Section 5

demonstrates implementation of our attack model

and evaluates attack effect by MapReduce job.

Section 6 contains our suggestion to strength

security of YARN. Section 7 concludes the paper

and discusses some future work.

2 BACKGROUND

Health Check Service is a YARN service-level

health test that checks the health of the node it is

executing on. ResourceManager (RM) using health

check service to manage NodeManagers (NM). If

any health check fails, the NM marks the node as

unhealthy and communicates this to the RM, which

then stops assigning containers (resource

representation) to the node. Before we introduce

health check service, we should know about RM

component, NM States and triggering conditions.

2.1 YARN-ResourceManager

Hadoop has evolved into a new generation—Hadoop

2, in which the classic MapReduce module is

upgraded into a new computing platform, called

YARN (or MRv2) (Vavilapalli et al., 2013).

YARN uses RM to replace classic JobTracker,

and uses ApplactionMaster (AM) to replace classic

TaskTracker (Lee, 2011). The RM runs as a daemon

on a dedicated machine, and acts as the central

authority arbitrating resources among various

competing applications in the cluster. The AM is

“head” of a job, managing all life-cycle aspects

including dynamically increasing and decreasing

resources consumption, managing the flow of

execution, handling faults and computation skew,

and performing other local optimizations.

The NM is YARN’s per-node agent, and takes

care of the individual compute nodes in a Hadoop

cluster. This includes keeping up-to date with the

RM, overseeing containers’ life-cycle management

(Huang et al., 2014) monitoring resource usage of

individual containers, tracking node-health, log’s

management and auxiliary services which may be

exploited by different YARN applications.

There are three components connecting RM to

NM, which co-manage the life-cycle of NM, as

shown in Figure 1. They are NMLivelinessMonitor,

NodesListManager and ResourceTrackerService.

We discuss the three services as follows.

Figure 1: ResourceManager architecture.

1) NMLivelinessMonitor: This component keeps

track of each NM’s last one heartbeat time. Any

DataNode that doesn’t have any heartbeat within

a configured interval of time, by default 10

minutes, is deemed dead and expired by the RM.

All the containers currently running on an

expired DataNode are marked as dead and no

new containers are scheduling on it.

2) NodesListManager: This component manages a

collection of included and excluded DataNodes.

DDHCS: Distributed Denial-of-service Threat to YARN Clusters based on Health Check Service

147

It is responsible for reading the host

configuration files to seed the initial list of

DataNodes. The files are specified as

“yarn.resourcemanager.nodes.include-path” and

“yarn.resourcemanager.nodes.exclude-path”. It

also keeps track of DataNodes that are

decommissioned as time progresses.

3) ResourceTrackerService: This component

responds to RPCs from all the DataNodes. It is

responsible for registration of new DataNode,

rejecting requests from any

invalid/decommissioned DataNodes, obtain

node-heartbeats and forward them over to the

Yarn Scheduler.

2.2 Node States

In YARN, an object is abstracted as a state machine

when it is composed of several states and events

triggering transfer of these states. There are four

types of state machines inside RM—RMApp,

RMAppAttempt, RMContainer and RMNode. We

focus on RMNode state machine.

RMNode state machine is the data structure used

to maintain a node lifecycle in the RM, and its

implementation is RMNodeImpl class. The class

maintains a node state machine, and records the

possible node states and events that may lead to the

state transform (Huseyin et al., 2015).

As shown in Figure2 and Table1, each node has

six basic states (NodeState) and eight kinds of

events that lead to the transfer of the six states

(RMNodeEventType), the role of RMNodeImpl is

waiting to receive events of RMNodeEventType

type from the other objects, and transfer the current

state to another state, and trigger another behavior at

the same time. In subsequent articles, we focus on

the unhealthy state and decommission state:

Figure 2: Node state machine.

UNHEALTHY: The administrator configures on each

NM a health monitoring scripts, NM has a dedicated

thread to execute the script periodically, to

determine whether the NM is under healthy state.

The NM communicates this “unhealthy” state to the

RM via heartbeats. After that, RM won’t assign a

new task to the node until it turns to be healthy state.

DECOMMSSIONED: If a node is added to exclude

list, the corresponding NM would be set for

decommission state, thus the NM would not be able

to communicate with the RM.

2.3 Health Check Service

The NM runs health check service to determine the

health of the node it is executing on, in intervals of

10 minutes. If any health check fails, the NM marks

the node as unhealthy and communicates this to the

RM, which then stops assigning containers to the

node. Communication of the node status is done as

part of the heartbeat between the NM and the RM.

This service determines the health status of the

nodes through two strategies, one is Health Script,

Administrators may specify their own health check

script that will be invoked by the health check

service. If the script exits with a non-zero exit code,

times out or results in an exception being thrown,

the node is marked as unhealthy. Another one is

Disk Checker. The disk checker checks the state of

the disks that the NM is configured to use. The

checks include permissions and free disk space. It

also checks that the file system isn’t in a read-only

state. If a disk fails the check, the NM stops using

that particular disk but still reports the node status as

healthy. However, if a number of disks fail the check

(25% by default), then the node is reported as

unhealthy to the RM and new containers will not be

assigned to the node.

Table 1: Basic states and basic events of node.

States Describe Trigger Events

NEW The initial state of state machine

RUNNING NM register to RM

STARTED

DECOMMISSION

A DataNode is added to exclude list

DECOMMISSIO

UNHEALTHY

Health Check Service determines whether

NM is unhealthy

STATUS_

UPDATE

LOST

NM doesn’t heartbeat within 10 minutes, is

deemed dead

EXPIRE

REBOOTING

RM finds NM’s heartbeat ID doesn’t agree

with its preservation, RM require it to

restart.

REBOOTING

We focus on the Health Script, we note that if the

script cannot be executed due to permissions or an

incorrect path, etc. then it counts as a failure and the

node will be reported as unhealthy. The NM

communicates this “unhealthy” state to the RM,

which then adds it into exclude list. The NM will run

this Health Script continuously, once the state is

NEW

DECOMMISSIONED

UNHEALTHY

REBOOTED

RUNNING

LOST

STATUS _UP DATE

DECOMMISSION

EXPIRE

REBOOTING

EXPIRE

REBOOTING

CLEANUP_APP

CLEANUP_CONTAINER

RECONNECTED

DECOMMISSIONED

ICISSP 2016 - 2nd International Conference on Information Systems Security and Privacy

148

transformed into “healthy”, RM will remove it from

the exclude list, and reassign containers to the node.

The administrator can modify the configuration

parameter in yarn-site.xml.

3 VULNERABILITY ANALYSIS

3.1 Lack of Job Validation

The fundamental idea of MRv2 is to split up the two

major functionalities of the JobTracker into separate

daemons. The idea is to have a global RM and per-

application AM. An application is a single job in the

classical sense of Map-Reduce jobs.

Jobs are submitted to the RM via a public

submission protocol and go through an admission

control phase during which security credentials are

validated and various operational and administrative

checks are performed.

Figure 3: YARN rejects a MapReduce job.

RMApp is the data structure used to maintain a

job life-cycle in RM, and its implementation is

RMAppImpl class. RMAppImpl holds the basic

information about the job (i.e. Job ID, job name,

queue name, start time) and the instance attempts.

We found that only the following situations will

lead to APP_REJECTED (an event of RMApp state

machine) event, as shown in Fiture3:

1) The client submit a job to RM via RPC function

ApplicationClientProtocl#submitApplication ma

y throw an exception, it happens when Resource-

Request over the minimum or maximum of the re

sources;

2) Once the scheduler discovers that the job is

illegal, (i.e. users submit to the inexistent queue

or the queue reaches the upper limit of job

numbers), it refuses to accept the job.

RM validates resource access permission, but lack of

job validation about whether or not the job can

finish. The only event that causes the job to enter the

FINISHED state is the normal exit from the AM

container. We can submit a job to the cluster which

is bound to fail, RM allocates resources for it and

it’s running on corresponding NM. However, RM

doesn’t check whether the job can be successfully

completed.

3.2 Easy to Make a Job Failed

The MapReduce enforces a strict structure: the

computation task splits into map and reduce

operations. Each instance of a map or reduce, called

a computation unit, takes a list of key-value tuples.

A MapReduce task consists of sequential phases of

map and reduce operations. Once the map step is

finished, the intermediate tuples are grouped by their

key-components. This process of grouping is known

as shuffling. All tuples belong to one group are

processed by a reduce instance which expects to

receive tuples sorted by their key-component (Wu et

al., 2013). Outputs of the reduce step can be used as

inputs for the map step in the next phase, creating a

chained MapReduce task.

Each Map/Reduce Task is just a concrete

description of computing tasks, the real mission is

done by TaskAttempt. The MRAppMaster executes

the Mapper/Reducer task as a child process in a

separate JVM, it can start multiple instances in

order. If the first running instance failed, it starts

another one instance to recalculate, until this data

processing is completed or the number of attempts

reaches the upper limit. By default, the maximum

attempts are 4 times. The users can configurate

parameter in the job via

mapreduce.map.maxattempts and

mapreduce.reduce.maxattempts. MRAppMaster may

also start multiple instances simultaneously, so they

will complete data processing. In MRAppMaster,

the life-cycle of the TaskAttempt, Task and Job are

described by a finite state machine, as shown in

Figure 4, where TaskAttempt is the actual task for

the calculation, the other two components are only

responsible for monitoring and management.

To our best knowledge, in some cases, the task never

completes successfully even after multiple attempts.

And it is easy to make the failed job, for instance,

hardware failure, software bugs, process crashes and

OOM (Out Of Memory). If there is no response

from a NM in a certain amount of time, the

MRAppMaster makes the task as failed. We

DDHCS: Distributed Denial-of-service Threat to YARN Clusters based on Health Check Service

149

summarize the five conditions result in task failed as

follows:

1) Map Task or Reduce Task fails. It means the

problems of the MapReduce program itself

which makes the task failed. There may be some

errors in the user code.

2) Time out. It may be due to network delay to read

data out of time, or the task itself takes longer

time than expected. In this case, the long-running

tasks take up system resources and will reduce

the performance of the cluster over time.

3) The bottleneck of reading files. If the number of

tasks performed by a job is very great, the

common input file may become a bottleneck.

4) Shuffle error. If the map task completes quickly,

and all the data is ready to copy for shuffle, it

will lead to overload of threads and memory

usage of buffer in the shuffle process, which will

cause a shortage of memory.

5) The child process JVM quit suddenly. It may be

caused by the bug of JVM, which makes the

MapReduce code running failed.

We can easily make job failed using one of these

items, for instance, we write program with an

infinite loop, or we specify the timeout as 10

seconds, but submit a long-running job, which need

at least 2 minutes.

Figure 4: The job/task state transition.

3.3 Weak Exclude List Mechanism

As discussed in 2.3, NM runs health check service to

determine the health of the node it is executing on. If

the task failed more than 3 times in a node, the node

is regarded under the unhealthy state. When a

DataNode under unhealthy state, all the containers

currently running on this DataNode are marked as

dead and no new containers are scheduled on it.

Explicitly point out the default failure times in the

RMContainerRequestor class as follows:

maxTaskFailuresPerNode =

conf.getInt(MRJobConfig.

MAX_TASK_FAILURES_PER_TRACKER, 3);

NodesListManager maintains an exclude list - a file

that resides on the RM and contains IP address of

the DataNodes to be excluded. When NM reports its

unhealthy state to RM via heartbeat, RM doesn’t

check why and how it becomes unhealthy, but adds

it into exclude list directly.

Before that, RM calculates the proportion of the

nodes in exclude list, which gets parameter

information from MRJobConfig interface. When the

node number of exclude list is less than a certain

percentage (default is 33%), RM will add the node

into exclude list, otherwise the unhealthy node won’t

be added to exclude list.

Finally, the failure handling of the containers

themselves is completely left to the framework. The

RM collects all container exit events from the NMs

and propagates those to the corresponding AMs in a

heartbeat response. AM already listens to these

notifications and retries map or reduce tasks by

requesting new containers from the RM.

4 DDHCS THREAT MODELS

The adversary is the malicious insider in the cloud,

aiming to subvert availability of the cluster. As

discussed in Section 3, we discovered three

vulnerabilities of YARN platform, we can use the

health check service to submit easy failed jobs to

add DataNodes to exclude list, which will cause

service degradation and the reduction of active

DataNodes.

Considering the scenario in Figure 5, the normal

users and malicious users can submit jobs to the

YARN cluster. The jobs that normal users submitted

can finish completely, while the jobs that malicious

users submitted are the failed jobs, which will never

complete. We use the running process of an applicat-

ion to analyze the attack process. The steps are

detailed as follows:

1) Distributed attackers and normal users submit

applications to the RM via a public submission

protocol and go through an admission control

phase during which security credentials are

validated and various operational and

administrative checks are performed.

2) Accepted applications are passed to the scheduler

to run. Once the scheduler has enough resources,

the application is moved from accepted to

running state. Aside from internal bookkeeping,

ICISSP 2016 - 2nd International Conference on Information Systems Security and Privacy

150

this involves allocating a container for the AM

and spawning it on a node in the cluster.

Figure 5: DDHCS: DDoS attack based on health check

service.

3) When RM starts the AM, it should register with

the RM and periodically advertise its liveness

and requirements via heartbeat. To obtain

containers, AM issues resource requests to the

RM.

4) Once the RM allocates a container, AM can

construct a container launch context (CLC) to

launch the container on the corresponding NM.

Monitoring the progress of work done inside the

container is strictly the AM’s responsibility.

5) To launch the container, the NM copies all the

necessary dependencies to local storage. Map

tasks process each block of input (typically

128MB) and produce intermediate results, which

are key-value pairs. These are saved to disk.

Reduce tasks fetch the list of intermediate results

associated with each key and run it through the

user’s reduce function, which produces output.

6) If the task fails to complete, the task will be tried

for a number of times, saying 3 times; if all tries

fail, this task will be treated as a failure, and AM

will contact RM to set up another container

(possibly in another node) for this task, until this

task is completed or the MapReduce job is

terminated.

7) For each DataNode, which executes the failed

task, its health check service will add one to its

total number of failures. And if the DataNode

has failed more than 3 times, the node will be

marked as unhealthy. The NM reports this

unhealthy state to the RM, which then adds it

into exclude node lists.

8) Once the AM is done with its work, it should

unregister from the RM and exit cleanly.

Attackers repeat the procedure until the exclude list

has 33% nodes of the total number, aiming at

reducing the service availability and performance by

exhausting the resources of the cluster (including

memory, processing resources, and network

bandwidth).

5 EVALUATION

5.1 Experiment Setup

We set up our Hadoop cluster with 20 nodes. Each

node runs a DataNode and a NodeManager with an

Intel Core i7 processor running at 3.4 GHz, 4096

MB of RAM, and run Hadoop 2.6.0, which is a

distributed, scalable, and portable system. All

experiments use the default configuration in Hadoop

for HDFS and MapReduce except otherwise noted

(e.g., the HDFS block size is 128MB, max java heap

size is 2GB).

A. Attack Programs

Attack Setting. We consider a setting in which

attackers and normal users are concurrent using the

same YARN platform. It is well known that YARN

in public clouds makes extensive use of multi-

tenancy. We design three attack programs as

follows:

WordCount_A: We use WordCount benchmark in

Hadoop as our main intrusion program because it is

widely used and represents many kinds of data-

intensive jobs. We specify the timeout parameter as

10 milliseconds (named as WordCount_A).

Since the input file we used is the full English

Wikipedia archive with the total data size of 31GB,

the program can’t finish within the time limit.

BeerAndDiaper: We write an infinite loop in this

program and specify the timeout parameter as 10

milliseconds, which will fail to complete within the

time limit.

WordCount_N: We use an executable program, but

as a normal user, we can modify the configuration

file–map-site.xml in client. We change the value of

mapreduce.task.timeout from 1000 (ms) to 10 (ms).

We use the “hadoop dfsadmin -refreshNodes”

command to reload the configuration file. We

submit executable WordCount program (named as

WordCount_N) with large input file, since it can’t

finish in 10 milliseconds, it will be marked as failed.

B. Evaluation Index

First, we introduce the variable to be used as

follows. N denotes the total number of living nodes

DDHCS: Distributed Denial-of-service Threat to YARN Clusters based on Health Check Service

151

that a Hadoop cluster currently has; m denotes the

number of unhealthy nodes after DDHCS attack.

Here for simplicity, we assume that all of the nodes

in a cluster are identical. T



denotes the start time

of the job, T



denotes the end time, then we

calculate the total completion times under normal

circumstances as T=T



-T



, we repeat the jobs

for 20 times, recording the start time and finish time,

so we can obtain the average time under normal

circumstances as :



=∑

i=1

/ n

(1)

Similarly, we calculate the average time under

DDoS attack as:





=∑

i=1

′

/ n

(2)

Wherein, T



denotes the total completion times

under DDoS attack, calculated by

T′ = T′

finish

- T′

star

(3)

We can characterize the scale of the addressed

DDHCS attacks in three dimensions: (i) attack

broadness, which is defined as bm/N; (ii) attack

strength, denoted as s, which in the portion of

resource occupied by the DDHCS attack in an

infected node. For example, given attack broadness

b=83.2%, and attack strength s=80%, a task will cost

as 1/1  s (here 5) times long as usual to

complete, with the probability of b. As shown in the

follow, we can go through a mathematical derivation

that attack strength is as follows:

s = (T



- T



) / T



(4)

(iii) resource degradation, we compare the CPU,

memory occupancy rate and network bandwidth

usage with and without DDHCS attacks, which can

read from the job logs.

5.2 Evaluations

To verify the attack effectiveness of our approach,

we test three programs mentioned above for

evaluating attack broadness, attack strength and

resource degradation. In the following section, we

describe the details of the experimental records.

A. Attack Broadness

As we discussed in 5.1, N denotes the total

number of living nodes that a Hadoop cluster

currently has; m denotes the number of unhealthy

nodes after DDHCS attack. We use bm/N to

describe the attack broadness. We investigate a

range of DDHCS intensities with three programs:

WordCount_A, BeerAndDiaper and WordCount_N,

running 100 times, 80 times, 60 times respectively.

We can check the unhealthy nodes and

decommission nodes in the cluster using the website

http://master:8088/cluster/apps. We record the

unhealthy nodes and decommission nodes after each

DDHCS attack, as shown in Table.2.

As we can see in Table.2, the experimental

results are the same as our research results. The

decommission nodes represent the nodes which are

added to exclude list, it accounts for less than 33%

of total nodes. The average attack broadness of these

three programs are 86.7%, 83.3%, 86.7%

respectively, we can see that the cluster becomes

unable to provide the services to its legitimate users.

B. Attack Strength

In this experiment, we run 4 benchmark

applications to cover a wide range of data-intensive

tasks: compute intensive (Grep), shuffle intensive

(Index), database queries (Join), iterative

(Randomwriter). We first run the 4 benchmark

applications 20 times before the DDHCS attack to

calculate the average running time, then we run three

attack programs 100 times separately as three attack

scenarios. After each attack scenarios we run each

benchmark 20 times again to calculate the average

running time after DDHCS attack.

Grep. Grep is a popular application for large scale

data processing. It searches some regular

expressions through input text files and outputs the

lines which contain the matched expressions.

Inverted Index. Inverted index is widely used in

search area. We implement a job in Hadoop that

builds an inverted index from given documents and

generates a compressed bit vector posting list for

each word.

Join. Join is one of the most common applications

that experience the data skew problem.

Randomwriter. Randomwriter writes 10GB data to

each node randomly, it is memory intensive, CPU

intensive and have high I/O consumption.

Firstly, we run each benchmark 20 times with no

DDHCS attack to summarize the average running

timeT



. Then we run 100 times of the three attack

programs WordCount_A, BeerAndDiaper,

WordCount_N separately and record the running

time of each legal benchmark application after each

attack program. We summarize the average running

time T







, T







, T







, T







in Table.3, and analyze the attack strength. The

result shows that under each type of DDHCS attack,

the attack strength is more than 80 percent, and the

cluster performance is more degraded.

Figure 6 demonstrate the average running time of

the 4 benchmark applications with the increase of

ICISSP 2016 - 2nd International Conference on Information Systems Security and Privacy

152

Table 2: Summary of DDHCS Attack broadness.

Job type Times

Total

nodes

Unhealthy

nodes

Decommission

nodes

Exclude list

nodes rate

Attack

broadness

WordCount_A

100 20 18 6 30% 90%

80 20 18 6 30% 90%

60 20 16 5 25% 80%

BeerAndDiaper

100 20 17 6 30% 85%

80 20 17 6 30% 85%

60 20 16 5 25% 80%

WordCount_N

100 20 18 6 30% 90%

80 20 17 5 25% 85%

60 20 17 5 25% 85%

Table 3: Summary of the attack strength of 4 benchmark applications.

Average running time Attack strength

Grep

Inverted

Index

Join

Random

writer

Grep

Inverted

Index

Join

Random

writer

Normal 112.4s 86.4s 113.6s 71.3s 0%

WordCount_A 726.7s 444.6s 745.2s 563.7s 84.5% 80.6% 84.8% 87.4%

DDoS BeerAndDiaper 737.8s 435.1s 751.3s 579.2s 86.1% 80.1% 84.9% 87.8%

WordCount_N 733.1s 453.3s 749.5s 553.8 84.7% 80.9% 84.8% 87.1%

(a)under WordCount_A DDHCS attack (b)under BeerAndDiaper DDHCS attack (c)under WordCount_N DDHCS attack

Figure 6: Job running time under 3 attack scenarios.

DDHCS attacks. We can see that as the increase of

the attack program running times, the average

running time of each benchmark applications

prolonged significantly, which means the cluster is

unable to provide service and the average time to

access user request is higher than normal.

C. Resource Degradation

In order to demonstrate these results, we run

additional experiments trying to compare the

resources degradation. We simulated a scenario with

BeerAndDiaper DDHCS attack. We run a range of

attack program intensities: 20 times, 40 times, 60

times, 80 times and 100 times. The CPU, memory

usage and network bandwidth usage before and after

BeerAndDiaper DDHSC attack are illustrated in

Figure 7, Figure 8.

In this scenario, most of the nodes are infected,

and resource consumption has a significant rise and

hence the YARN cluster performance is greatly

deteriorated, which makes YARN become unable to

provide the services to its legitimate users.

6 SUGGESTION AGAINST

DDHCS

Recent work has proposed many methods to detect

or prevent traditional DDoS attack, but these

techniques are not suitable for Big Data platform

(Gu et al., 2014); (Kiciman and Fox, 2005); (Specht

and Lee, 2004). According to the vulnerabilities of

our study, it is mainly because of legal users

submitting malicious programs to launch attacks

against YARN, we can’t make defense by predicting

user behavior. An important method to prevent

DDoS attacks against YARN is to enhance the

cluster. This requires a heightened awareness of

security issues and prevention techniques from all

YARN users.

DDHCS: Distributed Denial-of-service Threat to YARN Clusters based on Health Check Service

153

Figure 7: Summary of the CPU, memory occupancy rate

and network bandwidth usage before DDHCS attack.

Figure 8: Summary of the CPU, memory occupancy rate

and network bandwidth usage after BeerAndDiaper

DDHCS attack.

Since the root of this problem is that there lack

job inspecting mechanism by Hadoop/YARN, the

most straightforward recipe is to verify whether the

job succeeds within the time limitation. We

proposed three methods to strength YARN security

as follows:

User Blacklist Mechanism. Just like the node

exclude list mechanism, we could construct user

blacklist. As shown in Figure 9. When a user

submitted jobs fail more than 3 times, the user is

added into User blacklist. Every entry in the User

blacklist includes the User ID, IP address of a

blacklisted user, and a list of submitted jobs

associated with this user. A user that matches an

entry in the blacklists is placed on the isolated nodes

running text and cannot distribute his jobs on the

other nodes until he proves to be clean.

A user should not be blacklisted forever. A

blacklisted user should be allowed to gain his/her

rights back if it can be verified that the user’s jobs

are no longer failed. This is realized as follows. Each

user in the blacklist is associated with a time-to-live

value. Periodically each job submitted by the user

runs test on the isolated nodes: if it still fails then the

user’s time-to-live value adds one, otherwise, it can

finish successfully, the value is reduced by one. The

user is removed from the User blacklist when its

time-to-value is down to 0.

Figure 9: User blacklist mechanism.

Parameter Check. We all know that MapReduce

program has a fixed structure. Consider the problem

of WordCount in a large collection of documents,

the user would write code similar to the following

pseudocode.

map(String key, String value):

// key: document name

// value: document contents

for each word w in value:

EmitIntermediate(w, “1”);

reduce(String key, Iterator values):

// key: a word

// values: a list of counts

int result = 0;

for each v in values:

result += ParseInt(v);

Emit(AsString(result));

In addition, the user writes code to fill in a

mapreduce specification object with the names of

the input and output files and optional tuning

parameters. So we can check these parameters

before the program execution. If we find that some

of the parameters are too high or too low compared

with the normal value, the MapReduce program is

not allowed to execute. For instance, the default

execution time are 10 minutes, if the user specifies it

as 10 milliseconds, this job will be rejected.

ICISSP 2016 - 2nd International Conference on Information Systems Security and Privacy

154

Map-tracing. Novel visualizations and statisti-cal

views of the behavior of MapReduce programs

enable users to trace the MapReduce program

behavior through the program’s stages. Also, most

previous techniques for tracing have extracted

distributed execution traces at the programming

language level (e.g. using instru-mented middleware

or libraries to track requests (Chen et al., 2002)

(Barham et al., 2004); (Koskinen and Jannotti,

2008), we can learn from them and generate views at

the higher-level MapReduce abstraction. Figure10

shows the overall flow of a MapReduce operation,

we mainly focus on the Map phase.

Figure 10: MapReduce execution overview.

We observed that, for each phase, the logs

faithfully repeat the observed distributions of task

completion times, data read by each task, size and

location of inputs, probability of failures and

recomputations, and fairness based evictions. So we

can trace the first 10% maps for each job, and if the

maps have some problems, such as, can’t finish

successfully, the cluster won’t assign resources for

the remaining tasks.

Finally, the DDoS attacks exist in multi-tenancy

environment, so it is important for a user to learn the

security and the resource usage patterns of other

users sharing the cluster. It is necessary for rational

planning the number of nodes that each user can use.

7 CONCLUSIONS

In this paper, we studied the vulnerability of YARN

and proposed a DDoS attack based on health check

service (DDHCS). We summarize three vulnera-

bilities and design three attack programs to

demonstrate how many nodes in a YARN cluster

can be invaded by malicious users. We evaluate the

attack effectiveness in a YARN cluster under

DDHCS attacks. Our study shows that these

vulnerabilities may be easily used by malicious users

to launch DDHCS attacks and can cause significant

impact on the performance of a YARN cluster. The

highest 90% of the nodes deny of service and attack

strength is more than 80%. Given this, we proposed

three methods to enhance YARN. Regarding future

research, we will move forward to strengthening the

security of YARN, realizing our three suggestions,

making good filter and defense. We will extend our

trust calculus for estimating and optimizing the

trustworthiness of cloud workflow for handing big

data.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge the support of

the National High Technology Research and

Development Program (“863” Program) of China

under Grant No. 2015AA016009, the National

Natural Science Foundation of China under Grant

No. 61232005, and the Science and Technology

Program of Shen Zhen, China under Grant No.

JSGG20140516162852628. Specially thanks to

Ziyao Zhu and Wenjun Qian for the support of

experiments.

REFERENCES

Alarifi, S., & Wolthusen, S. D. (2014, April). Mitigation

of Cloud-Internal Denial of Service Attacks. In Service

Oriented System Engineering (SOSE), 2014 IEEE 8th

International Symposium on (pp. 478-483). IEEE.

Barham, P., Donnelly, A., Isaacs, R., & Mortier, R. (2004,

December). Using Magpie for Request Extraction and

Workload Modelling. In OSDI (Vol. 4, pp. 18-18).

Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer,

E. (2002). Pinpoint: Problem determination in large,

dynamic internet services. InDependable Systems and

Networks, 2002. DSN 2002. Proceedings. International

Conference on (pp. 595-604). IEEE.

Criscuolo, P. J. (2000). Distributed Denial of Service:

Trin00, Tribe Flood Network, Tribe Flood Network

2000, and Stacheldraht CIAC-2319 (No. CIAC-2319).

CALIFORNIA UNIV LIVERMORE RADIATION

LAB.

Durcekova, V., Schwartz, L., & Shahmehri, N. (2012,

May). Sophisticated denial of service attacks aimed at

application layer. In ELEKTRO, 2012 (pp. 55-60).

IEEE.

Ficco, M., & Rak, M. (2015). Stealthy denial of service

strategy in cloud computing. Cloud Computing, IEEE

Transactions on, 3(1), 80-94.

Girma, A., Garuba, M., Li, J., & Liu, C. (2015, April).

Analysis of DDoS Attacks and an Introduction of a

Hybrid Statistical Model to Detect DDoS Attacks on

Cloud Computing Environment. In Information

DDHCS: Distributed Denial-of-service Threat to YARN Clusters based on Health Check Service

155

Technology-New Generations (ITNG), 2015 12th

International Conference on (pp. 212-217). IEEE.

Gu, Z., Pei, K., Wang, Q., Si, L., Zhang, X., & Xu, D.

LEAPS: Detecting Camouflaged Attacks with

Statistical Learning Guided by Program Analysis.

Hameed, S., & Ali, U. (2015). On the Efficacy of Live

DDoS Detection with Hadoop. arXiv preprint

arXiv:1506.08953.

Huang, J., Nicol, D. M., & Campbell, R. H. (2014, June).

Denial-of-Service Threat to Hadoop/YARN Clusters

with Multi-Tenancy. In Big Data (BigData Congress),

2014 IEEE International Congress on (pp. 48-55).

IEEE.

Huseyin Ulusoy, Pietro Colombo, Elena Ferrari, Murat

Kantarcioglu, Erman Pattuk. (2015, April). GuardMR:

Fine-grained Security Policy Enforcement for

MapReduce System. ASIA CCS’15.

Karthik, S., & Shah, J. J. (2014, February). Analysis of

simulation of DDOS attack in cloud. In Information

Communication and Embedded Systems (ICICES),

2014 International Conference on (pp. 1-5). IEEE.

Khattak, R., Bano, S., Hussain, S., & Anwar, Z. (2011,

December). DOFUR: DDoS Forensics Using

MapReduce. In Frontiers of Information Technology

(FIT), 2011 (pp. 117-120). IEEE.

Kholidy, H., & Baiardi, F. (2012, April). CIDS: a

framework for intrusion detection in cloud systems. In

Information Technology: New Generations (ITNG),

2012 Ninth International Conference on (pp. 379-

385). IEEE.

Kholidy, H., Baiardi, F., & Hariri, S. (2015). DDSGA: A

Data-Driven Semi-Global Alignment Approach for

Detecting Masquerade Attacks. Dependable and

Secure Computing, IEEE Transactions on, 12(2), 164-

178.

Kiciman, E., & Fox, A. (2005). Detecting application-

level failures in component-based internet services.

Neural Networks, IEEE Transactions on, 16(5), 1027-

1041.

Koskinen, E., & Jannotti, J. (2008, April). Borderpatrol:

isolating events for black-box tracing. In ACM

SIGOPS Operating Systems Review (Vol. 42, No. 4,

pp. 191-203). ACM.

Lee, Y., Kang, W., & Lee, Y. (2011). A hadoop-based

packet trace processing tool (pp. 51-63). Springer

Berlin Heidelberg.

Lee, Y., & Lee, Y. (2011, December). Detecting ddos

attacks with hadoop. InProceedings of The ACM

CoNEXT Student Workshop (p. 7). ACM.

Mizukoshi, M., & Munetomo, M. (2015, May).

Distributed denial of services attack protection system

with genetic algorithms on Hadoop cluster computing

framework. In Evolutionary Computation (CEC), 2015

IEEE Congress on (pp. 1575-1580). IEEE.

O’Malley, O., Zhang, K., Radia, S., Marti, R., & Harrell,

C. (2009). Hadoop security design. Yahoo, Inc., Tech.

Rep.

Sabahi, F. (2011, May). Cloud computing security threats

and responses. InCommunication Software and

Networks (ICCSN), 2011 IEEE 3rd International

Conference on (pp. 245-249). IEEE.

Specht, S. M., & Lee, R. B. (2004, September).

Distributed Denial of Service: Taxonomies of Attacks,

Tools, and Countermeasures. In ISCA PDCS (pp. 543-

550).

Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal,

S., Konar, M., Evans, R., & Baldeschwieler, E. (2013,

October). Apache hadoop yarn: Yet another resource

negotiator. In Proceedings of the 4th annual

Symposium on Cloud Computing (p. 5). ACM.

Wu, H., Tantawi, A. N., & Yu, T. (2013, June). A self-

optimizing workload management solution for cloud

applications. In Web Services (ICWS), 2013 IEEE 20th

International Conference on (pp. 483-490). IEEE.

ICISSP 2016 - 2nd International Conference on Information Systems Security and Privacy

156