Dependability Testing of MapReduce Systems

ao Eugenio Marynowski

, Andrey Ricardo Pimentel

, Taisy Silva Weber

and Antonio Junior Mattos

Department of Informatics, Federal University of Paran

a, UFPR, Curitiba, Brazil

Department of Informatics, Federal University of Rio Grande do Sul, UFRGS, Porto Alegre, Brazil

Keywords:

MapReduce, Hadoop, Dependability, Test, Fault Injection, Fault Tolerance.

Abstract:

MapReduce systems have been widely used by several applications, from search engines to ﬁnancial and com-

mercial systems. There is considerable enthusiasm around MapReduce systems due to their simplicity and

scalability. However, they lack a testing approach and framework ensuring their dependability. In this work,

we propose a complete dependability testing solution for MapReduce systems. This solution is a model-based

approach to generate representative fault cases, and a testing framework to automate their execution. More-

over, we introduce a new way to model distributed components using Petri Nets, and we show the promising

results of the proposed testing framework, HadoopTest, on identifying faulty systems in real deployment sce-

narios.

1 INTRODUCTION

The amount of data stored by various applications,

such as social networks, commercial applications, and

research, have grown to over petabytes. There are

many frameworks to facilitate the analysis of large

data sets; MapReduce is one of them, with broad

adoption. It abstracts parallel and distributed issues

such as data partition, replication, distributed process-

ing, and fault tolerance (Dean and Ghemawat, 2004).

Despite a considerable number of MapReduce ap-

plications may present partial results, such as large-

scale web indexing and pattern-based searching, sev-

eral applications must present full results, such as ap-

plications in domains of business, ﬁnancial, and re-

search. To make use of MapReduce in such domains,

it is essential to test its dependability (Abouzeid et al.,

2009; Teradata Coorporation, 2012; Hadoop, 2012).

Dependability test aims at validating the behavior

of fault tolerant systems, i.e., it aims at ﬁnding errors

in the implementation or speciﬁcation of fault tolerant

mechanisms (Avizienis et al., 2004; Ammann and Of-

futt, 2008). For this purpose, the system is executed

on a controlled testing environment with the injec-

tion of artiﬁcial faults. Two main issues concerning

this approach are: generating representative elements

from the potentially inﬁnite and partially unknown set

of fault cases, and automating their executions.

Testing the dependability of MapReduce systems

requires to execute fault cases capable of stimulating

all of its tolerated faults, which, by turn requires ex-

plicit control over its processing steps.

In this work, we present a solution for dependabil-

ity testing of MapReduce systems through the gener-

ation and execution of representative fault cases. We

use a Petri Net model of the fault tolerance mecha-

nism to generate these fault cases, and a framework

to automate their execution in real deployment scenar-

ios. Additionally, we introduce a new approach to in-

terpret the model components, modeling the MapRe-

duce components as dynamic items, and modeling the

independence of these components with their actions

and states.

This paper is organized as follows. The next

session introduces the basic concepts, presenting a

description of MapReduce and deﬁning fault cases.

Section 3 presents our approach to model MapRe-

duce fault tolerance mechanism. Section 4 shows

how we generate the representative fault cases. Sec-

tion 5 presents our framework for dependability test-

ing. Section 6 describes the initial results through im-

plementation and experimentation. Section 7 surveys

related work. Section 8 concludes the paper.

165

Marynowski J., Pimentel A., Weber T. and Mattos A..

Dependability Testing of MapReduce Systems.

DOI: 10.5220/0004436101650172

In Proceedings of the 15th International Conference on Enterprise Information Systems (ICEIS-2013), pages 165-172

ISBN: 978-989-8565-60-0

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

2 BASIC CONCEPTS

2.1 MapReduce

MapReduce is a simpliﬁed programming model and

the associated implementation for processing and an-

alyzing large scale data. It offers a programming envi-

ronment based on two high-level functions, map and

reduce, and a runtime environment to execute them on

a cluster. The MapReduce architecture includes sev-

eral worker components, and one master that sched-

ules map and reduce tasks to run at the workers.

Figure 1 shows a MapReduce execution in-

stance. The master receives a Job and co-

ordinates ﬁve worker components, identiﬁed by

{worker0, . .. , worker4}. It assigns the map function

to {worker0, worker1, worker2, worker3}, and each

one reads the input data from the ﬁles splitted in a Dis-

tributed File System (DFS), applies the user-deﬁned

map function on each split, and creates several out-

puts locally. The master assigns the reduce function

to {worker4, worker1, worker0}, and each one reads

the map outputs locally or remotely, applies user-

deﬁned reduce function, and writes the results to the

DFS.

Figure 1: A MapReduce execution overview.

The MapReduce fault tolerance mechanism iden-

tiﬁes faulty workers by timeout, and reschedules their

tasks to a healthy worker. The fault handling differs

between tasks and their processing steps, e.g., if a

worker fails when it is executing a map task, the mas-

ter only reschedules its task for another worker; but

if a component fails after executing a map task, the

master reschedules the task for another worker and

informs all workers executing reduce tasks that they

must read the map result from the new worker.

2.2 Fault Case

A fault case is a distributed test case extension involv-

ing the components required for a complete execu-

tion and validation of a system under test while faults

are injected (Echtle and Leu, 1994; Ambrosio et al.,

2005; de Almeida et al., 2010b).

Deﬁnition 2.1 (Fault Case). A fault case is a 4-tuple

F = (C

, A

, R

, O) where:

• C

= {c

, c

, . . . , c

}, and it is a ﬁnite set of sys-

tem components;

• A

= {a

, a

, . . . , a

}, and it is a ﬁnite set of ac-

tions that can involve fault injections;

• R

= {r

, . . . , r

}, and it is a ﬁnite set of action

results;

• O is an oracle.

The oracle is a mechanism responsible for verify-

ing the system behavior during a fault case execution,

and associating its result, i.e., a verdict pass, fail or

inconclusive. Each action (a

) can get its result (r

success, failure, or timeout (without response during

a time limit). If all action results (R

) get success, the

F verdict is pass. If any action result is failure, the

F verdict is fail. But if at least one action execution

gets timeout, the F verdict is inconclusive, making

the test inaccurate for assigning some of the earlier

statements and, moreover, it is necessary to rerun the

fault case.

Deﬁnition 2.2 (Action). A fault case action is a 7-

tuple a

= (h, n,C

, I,W, D,t) where:

• h ∈ N|h 6 |A

|, and it is an hierarchical order in

which action a

must execute - actions with same

h execute in parallel;

• n ∈ N|n 6 |C

|, and it is the success number of

action executions to result success for a

;

• C

⊆ C

, and it is a set of components that execute

;

• I is a set of instructions or commands executed by

the components;

• W is an optional instruction or command that is a

trigger required to execute a

;

• D ⊆ A

|∀a

∈ D : j < i, r

= SUCCESS, and it is

a set of actions that must be successfully executed

before a

, otherwise the action result r

is failure;

• t is a time to execute a

3 MODELING MapReduce FAULT

TOLERANCE MECHANISM

Modeling the MapReduce fault tolerance mechanism

demands a formal model of the concurrent and dis-

tributed behavior of its components. The model must

represent the components as dynamic items, enabling

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

166

them to be easily removed or inserted, without sub-

stantial model changes. Moreover, the model should

represent the components without specifying their ac-

tions, allowing an action to be performed by any en-

abled component. This feature is essential to model

the rescheduling process of faulty map and reduce

tasks.

Finite State Machine (FSM) (Bernardi et al.,

2012) and Petri Net (PN) (Callou et al., 2012) are

the main approaches to model distributed systems in-

volving their dependability properties. FSM abstracts

the details of the system behavior and enables to a di-

rect relation to the MapReduce processing steps. Al-

though there are extensions to represent other features

(e.g., timing specs), FSM restricts the modeling of

several components that have parallel and distinct be-

haviors. Each component needs a speciﬁc set of states

and an alphabet to model its behavior.

The PN modeling enables us to take a new ap-

proach to interpret their components, modeling the

MapReduce components as dynamic items, to be eas-

ily inserted or removed. Moreover, it allows to model

the independence of these components with their ac-

tions and states, i.e., an action can be executed by any

enabled component.

Figure 2 shows a Petri Net that models

part of the MapReduce fault tolerance mecha-

nism that handles faults while running Map and

Reduce functions. Labeled transitions represent

the fault case actions, tokens represent MapRe-

duce components, and places represent their states

or processing steps. When the transition “mas-

ter.sendJOB” ﬁres, it consumes one token from

“online master” and one from “online workers”,

and produces a new one in “worker.runningMap”.

Now, two transitions can ﬁre, “nothing” and

“worker.runningMap-FAIL”. If “nothing” ﬁres, it

consumes one token from “worker.runningMap”,

and produces one in “worker.runningReduce”. If

“worker.runningMap-FAIL” transition ﬁres, it con-

sumes one token from “worker.runningMap” and one

from “online workers”, and produces one again in

“worker.runningMap”. This behavior occurs simi-

larly if “worker.runningReduce-FAIL” ﬁres when it

has a token at “worker.runningReduce”, but it is nec-

essary to have a token at “online workers” for ﬁring

it.

PN allows a comprehensively modeling of the

MapReduce fault tolerance mechanism. Moreover, it

allows extensions to model other behaviors implicitly

speciﬁed in MapReduce fault tolerance mechanism,

such as the temporal faults identiﬁcation and the pro-

cess interruption when it is impossible complete a job.

Figure 2: A Petri Net modeling example.

4 GENERATING

REPRESENTATIVE FAULT

CASES

The representativeness of a fault case is how impor-

tant it is to identify defects on a system under test (Ar-

lat et al., 2003; Natella et al., 2012). We consider the

representative fault cases for the dependability test-

ing of MapReduce systems as generated through an

abstraction of its fault tolerance mechanism. This ap-

proach is successfully used to test other systems (Ech-

tle and Leu, 1994; Ambrosio et al., 2005; Bernardi

et al., 2012). It guides the generation to a ﬁnite set of

fault cases that should be tolerated, and that they must

be tested to ensure the system dependability.

We generate representative fault cases from a

reachability graph of the Petri Net that models the

MapReduce fault tolerance mechanism. A reachabil-

ity graph consists of all possible sequences of tran-

sition ﬁrings from a Petri Net. Each possible path

starting from the root graph vertex composes one fault

case. This approach is applicable only in a Pure Petri

Net, i.e., that has no loops.

Figure 3 shows a reachability graph generated

from the Petri Net example at Figure 2. There are

three fault case possible: (1) without faults, execut-

ing “master.sendJOB” and “master.assertRESULT”;

(2) with one fault, adding “worker.runningMap-

FAIL”; and (3) with the other possible fault,

“worker.runningReduce-FAIL”.

Table 1 shows a set of fault case actions of the fault

case (2). The goal is to validate the MapReduce ex-

ecution while one component fails when executing a

map task. This fault case involves three components,

= {c

, c

}, obtained from the Petri Net tokens,

and seven actions A

= {a

, . . . , a

}, obtained from

the reachability graph (Figure 3) and the start and stop

DependabilityTestingofMapReduceSystems

167

Figure 3: A reachability graph example.

actions. The component c

executes the action a

start the master. If action a

succeeds, the compo-

nents {c

, c

} execute the action a

to start the work-

ers. Otherwise, the action a

ﬁnishes and receives

the failure result. This occurs with all actions that

has a dependency relation with a failed action, recur-

sively. Without failed actions, the process continues

and the next execution is a

by the component c

, and

it submits a job. During the job execution, only the

ﬁrst component (n

= 1) of {c

, c

} fails when it ex-

ecutes the map task (W

= runningMap()). At action

, the c

validates the job result, comparing the ex-

pected with the obtained. The next actions stop the

MapReduce execution.

5 FRAMEWORK FOR TESTING

MAPREDUCE SYSTEMS

HadoopTest is a test framework to automatically exe-

cute fault cases. It extends the PeerUnit testing frame-

work (de Almeida et al., 2010a). HadoopTest adds the

controlling and monitoring of all MapReduce compo-

nents, the injecting of faults according to its process-

ing steps, and the validation of its behavior.

The HadoopTest architecture consists of one co-

ordinator and several testers. The coordinator con-

trols the execution of distributed testers, coordinates

the actions of fault cases, and generates the verdict

from tester results. Each tester receives coordination

messages, executes fault case actions in the MapRe-

duce components, and returns their results.

Figure 4 shows the application of HadoopTest to

the MapReduce instance presented in Figure 1, and

with a fault injection while a worker executes a Re-

duce function. The coordinator individually controls

the execution of six testers, identiﬁed by t0..t5, fol-

lowing the fault case. Tester t0 controls the mas-

ter component and each other tester, t1..t5, controls

a worker instance. This architecture enables the de-

ployment of fault cases applying lower service func-

tions on testers. For instance, tester t2 injects a fault

on worker1, removing it from the system while it exe-

cutes the reduce function. This enables to put MapRe-

Figure 4: Testing a MapReduce instance with fault injec-

tion.

duce components in any state (i.e., running, idle, or

stopped) and monitors their activity at any time.

The fault case execution consists of coordinating

and controlling testers to execute actions in a dis-

tributed, parallel and synchronized way. Algorithm 1

shows the main steps to coordinate testers for execut-

ing a fault case F . For each hierarchical level h, exist-

ing in A, the coordinator sends messages to the testers

for executing actions in parallel, receives the local re-

sults, and processes them to set action results, R. Af-

ter executing all actions, the oracle O analyzes R and

assigns the fault case verdict.

Algorithm 1: Coordination Algorithm.

Input: F , a fault case; M , a map function

between A

and the hierarchical orders

of its actions

Data: R

, a set of local tester results

Output: A verdict

foreach h ∈ M (A

) do

SendMessages(M

−1

(h), R

)

←ReceiveResults(M

−1

(h))

←ProcessResults(R

, M

−1

(h))

return O(R

) ;

Algorithm 2 shows the steps to execute a fault case

action by a tester. It receives the coordination mes-

sage to execute a

. If the trigger W

is deﬁned, it waits

his execution. After that, or if W

is not deﬁned, the

tester veriﬁes if the number of success action execu-

tions n

is greater than zero, then it executes the set of

instructions I

and returns the execution result. Oth-

erwise, it returns failure, informing to the coordinator

that it cannot execute a

6 EXPERIMENTAL VALIDATION

This section presents an evaluation of our proposed

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

168

Table 1: A set of fault case actions.

h n C

I W D t

1 1 {c

} startMaster()

0 100

2 2 {c

, c

} startWorker() {a

} 1000

3 1 {c

} sendJOB() {a

} 1000000

3 1 {c

, c

} FAIL() runningMap() {a

} 1000

4 1 {c

} assertRESULT () {a

} 10000

5 1 {c

, c

} stopWorker() {a

} 1000

6 1 {c

} stopMaster() {a

} 1000

Algorithm 2: Action Execution Algorithm.

Data: a

, a fault case action

Output: An action result

← ReceiveAction()

if W

6= NULL then

Wait W

if n

> 0 then

return Run I

return FAILURE

solution through the automatic and manual executions

of fault cases for testing Hadoop (Hadoop, 2012), an

open-source MapReduce implementation. First, we

present the results obtained by the manual execution

of the representative fault cases. Second, we evaluate

the overhead produced by HadoopTest to coordinate

the execution of fault cases. Finally, we validate the

HadoopTest effectiveness for identifying faulty sys-

tems by testing the PiEstimator, an application bun-

dled into Hadoop.

6.1 Manual Execution of Representative

Fault Cases

We manually executed some fault cases generated

from the modeling of the MapReduce fault tolerance

mechanism to conﬁrm their representativeness in the

defect identiﬁcation. One fault case consisted of four

components that execute the WordCount, while two

components failed by crash when they executed the

map task. Hadoop interrupted the execution when the

second component failed, although the data remained

in the other active component. The correct behavior

would be to schedule the tasks to the active compo-

nent, but Hadoop did not do it due to a corruption of

a control ﬁle.

In addition, we executed fault cases involving

temporal parameters. We identiﬁed that Hadoop does

not consider timeout parameters to detect fault com-

ponents, and does not interrupt the execution when

the data were no longer available, i.e., all components

that stored data failed, but Hadoop continued without

iteration by two hours.

6.2 The HadoopTest Overhead

We evaluated the HadoopTest overhead by executing

PiEstimator in two ways. In the ﬁrst one, Hadoop is

executed alone, to evaluate the raw execution time.

In the second one, Hadoop is executed along with

HadoopTest, to evaluate the overhead produced dur-

ing testing. We use 10, 50, 100 and 200 machine-

nodes on the Grid’5000 plataform to realize this ex-

periment. Figure 5 shows the average execution time

of PiEstimator running on Hadoop and HadoopTest.

We vary the number of map instances in each execu-

tion.

100

110

120

130

200 400 600 800 1000 1200 1400 1600 1800 2000

time (sec)

maps

hadoop

hadooptest

Figure 5: Execution time variance of the PiEstimator.

HadoopTest presents a minimal overhead by con-

trolling Hadoop while executing fault cases. This

characteristic enables testing MapReduce systems

considering large-scale failure scenarios.

6.3 Identifying Faulty Systems

We used Mutation Testing (Offutt, 1994) to evaluate

whether HadoopTest is able to identify faulty systems.

We create a set of faulty versions (i.e., mutants) of

the PiEstimator. Mutations are changes of arithmetic

and logic operators into the original source code to

generate incorrect results. The goal is to identify the

largest possible number of incorrect results.

We generated 13 mutants of the PiEstimator class

DependabilityTestingofMapReduceSystems

169

and Table 2 shows the results. The expected π value

returned by the original application was 3.1416 and

only the mutants M1, M6, M7, M9 and M12 returned

this value. These mutants have the pass verdict on

the test case execution while the other mutants M4,

M5, M10 and M11 received a fail verdict, since the π

computation parameters were modiﬁed resulting in a

different value than the expected one. In the case of

the mutants M0, M2, M3 and M8, the modiﬁcations

were in the execution parameters which interfered on

their correct execution. Hence, they returned NULL

as results.

Table 2: Results and verdicts generated by 13 PiEstimator

mutants.

Mutants Result Pass Fail

M0 NULL X

M1 3.1416 X

M2 NULL X

M3 NULL X

M4 3.0776 X

M5 3.1312 X

M6 3.1416 X

M7 3.1416 X

M8 NULL X

M9 3.1416 X

M10 3.1408 X

M11 3.1408 X

M12 3.1416 X

We evaluated the HadoopTest effectiveness by the

number of detected mutants. When mutation analy-

sis is applied to a system code and generates several

mutants, some of them are equivalent to the origi-

nal source code, due to different reasons, such as the

modiﬁed part is never executed, and the binary oper-

ators used have the same result. We considered the

equivalent mutants those that obtained the same out-

put as the original system. The initial implementa-

tion of HadoopTest demonstrated promising results

by identifying all the non-equivalent mutants of PiEs-

timator.

7 RELATED WORK

The fault case generation is commonly done ran-

domly or by the Test Engineer (Benso et al., 2007;

Chandra et al., 2007; Henry, 2009; Bernardi et al.,

2012; Jacques-Silva et al., 2006; Lefever et al., 2004).

These approaches are inadequate for the dependabil-

ity testing of MapReduce systems because they dis-

regard the internals of the fault tolerance mechanism,

i.e., they ignore the behavior of fault recovery pro-

tocols regarding the different processing steps, e.g.,

they inject faults in some machines (fails 3 of 10) for

some period (from 30 to 40 sec). They can evaluate

the system behavior, but they cannot test system de-

pendability. Others evaluate the dependability by gen-

erating, systematically, fault cases from source code.

Such approaches are costly, even after applying prun-

ing techniques, and they limit the fault case genera-

tion to few concurrent cases (Joshi et al., 2011; Fu

et al., 2004; Marinescu et al., 2010).

Some testing frameworks provide solutions to

control distributed components and to validate the

system behavior, but they do not inject faults nei-

ther consider the components processing steps (Pan

et al., 2010; Dragan et al., 2006; Zhou et al., 2006;

de Almeida et al., 2010a). Related fault injection

frameworks enable to inject multiple and various

faults, but they do not control the system dynami-

cally to inject faults according the processing steps

(Jacques-Silva et al., 2006; Pham et al., 2011; Stott

et al., 2000; Lefever et al., 2004; Hoarau et al., 2007).

Moreover, none of the cited frameworks presents re-

sults about MapReduce dependability.

MapReduce related testing frameworks are not ap-

plicable to the dependability testing. Herriot (Boud-

nik et al., 2010) provides a set of interfaces that val-

idates small system parts, e.g., a method or a func-

tion. Csallner et al. (Csallner et al., 2011) system-

atically search the bad-deﬁned map and reduce func-

tions, possibly identiﬁed by component faults. Oth-

ers, evaluate MapReduce execution by log analysis to

detect MapReduce performance problems (Tan et al.,

2008; Pan et al., 2009; Tan et al., 2009; Huang et al.,

2010). Although, these approaches evaluate MapRe-

duce functionality and performance, they do not auto-

matically execute fault cases and validate the system

dependability.

8 CONCLUSIONS

We exposed and analyzed the issue of testing MapRe-

duce system dependability. We presented a solution

based on the generation and execution of represen-

tative fault cases. We generated fault cases from

a formal model of the MapReduce fault tolerance

mechanism. We evaluated two modeling approaches

and adopted the Petri Net because its adequacy. We

presented a new way to model distributed compo-

nents using Petri Nets. We modeled the MapRe-

duce components as dynamic items and the indepen-

dence of them with their actions and states. Moreover,

we showed the HadoopTest framework, that executes

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

170

fault cases in real deployment scenarios without over-

head and identifying faulty systems.

We identiﬁed some bugs in Hadoop with the man-

ual execution of representative fault cases, but we in-

tend to automatically execute them with HadoopTest.

We plan to automatically generate representative fault

cases from a Petri Net model, and test other MapRe-

duce systems, such as HadoopDB and Hive.

ACKNOWLEDGEMENTS

This work is partially sponsored by CAPES and the

experiments presented were carried out using the

Grid’5000 experimental testbed.

REFERENCES

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silber-

schatz, A., and Rasin, A. (2009). HadoopDB: An

architectural hybrid of MapReduce and DBMS tech-

nologies for analytical workloads. In VLDB - Inter-

national Conference on Very Large Data Bases, pages

922–933. VLDB Endowment.

Ambrosio, A. M., Mattiello-Francisco, F., Vijaykumar,

N. L., de Carvalho, S. V., Santiago, V., and Martins, E.

(2005). A methodology for designing fault injection

experiments as an addition to communication systems

conformance testing. In DSN-W - International Con-

ference on Dependable Systems and Networks Work-

shops, Yokohama, Japan.

Ammann, P. and Offutt, J. (2008). Introduction to Software

Testing. Cambridge University Press.

Arlat, J., Crouzet, Y., Karlsson, J., Folkesson, P., Fuchs, E.,

and Leber, G. (2003). Comparison of Physical and

Software-Implemented Fault Injection Techniques.

IEEE Transactions on Computers, 52(9):1115–1133.

Avizienis, A., Laprie, J.-C., Randell, B., and Landwehr,

C. E. (2004). Basic Concepts and Taxonomy of De-

pendable and Secure Computing. IEEE Transactions

on Dependable and Secure Computing, 1(1):11–33.

Benso, A., Bosio, A., Carlo, S. D., and Mariani, R. (2007).

A Functional Veriﬁcation based Fault Injection Envi-

ronment. In DFT - International Symposium on Defect

and Fault-Tolerance in VLSI Systems, pages 114–122.

IEEE.

Bernardi, S., Merseguer, J., and Petriu, D. C. (2012). De-

pendability Modeling and Assessment in UML-Based

Software Development. The Scientiﬁc World Journal,

2012:1–11.

Boudnik, K., Rajagopalan, B., and Murthy, A. C.

(2010). Herriot. https://issues.apache.org/jira/browse/

HADOOP-6332.

Callou, G., Maciel, P., Tutsch, D., and Ara

ujo, J. (2012).

A Petri Net-Based Approach to the Quantiﬁcation of

Data Center Dependability. In Pawlewski, P., editor,

Petri Nets - Manufacturing and Computer Science,

page 492. InTech.

Chandra, T. D., Griesemer, R., and Redstone, J. (2007).

Paxos Made Live: An Engineering Perspective. In

PODC - Symposium on Principles of Distributed

Computing, pages 398–407, New York, New York,

USA. ACM Press.

Csallner, C., Fegaras, L., and Li, C. (2011). New

Ideas Track: Testing MapReduce-Style Programs. In

ESEC/FSE’11, Szeged, Hungary.

de Almeida, E. C., Marynowski, J. E., Suny

e, G., and Val-

duriez, P. (2010a). PeerUnit: a framework for testing

peer-to-peer systems. In ASE - International Confer-

ence on Automated Software Engineering, pages 169–

170, New York, USA. ACM.

de Almeida, E. C., Suny

e, G., Traon, Y. L., and Valduriez,

P. (2010b). Testing peer-to-peer systems. ESE - Em-

pirical Software Engineering, 15(4):346–379.

Dean, J. and Ghemawat, S. (2004). MapReduce: Sim-

pliﬁed Data Processing on Large Clusters. In OSDI

- USENIX Symposium on Operating Systems Design

and Implementation, pages 137–149, San Francisco,

California. ACM Press.

Dragan, F., Butnaru, B., Manolescu, I., Gardarin, G.,

Preda, N., Nguyen, B., Pop, R., and Yeh, L. (2006).

P2PTester: a tool for measuring P2P platform perfor-

mance. In BDA conference.

Echtle, K. and Leu, M. (1994). Test of fault tolerant dis-

tributed systems by fault injection. In FTPDS - Work-

shop on Fault-Tolerant Parallel and Distributed Sys-

tems, pages 244–251. IEEE.

Fu, C., Ryder, B. G., Milanova, A., and Wonnacott, D.

(2004). Testing of java web services for robustness.

In ISSTA - International Symposium on Software Test-

ing and Analysis, pages 23–33.

Hadoop (2012). The Apache Hadoop. http://hadoop.

apache.org/.

Henry, A. (2009). Cloud Storage FUD: Failure, Uncertainty

and Durability. In FAST - USENIX Symposium on File

and Storage Technologies, San Francisco, California.

Hoarau, W., Tixeuil, S., and Vauchelles, F. (2007). FAIL-

FCI: Versatile fault injection. Future Generation

Computer Systems, 23(7):913–919.

Huang, S., Huang, J., Dai, J., Xie, T., and Huang, B. (2010).

The HiBench benchmark suite: Characterization of

the MapReduce-based data analysis. In ICDEW - In-

ternational Conference on Data Engineering Work-

shops, pages 41–51. IEEE.

Jacques-Silva, G., Drebes, R., Gerchman, J., F. Trindade, J.,

Weber, T., and Jansch-Porto, I. (2006). A Network-

Level Distributed Fault Injector for Experimental

Validation of Dependable Distributed Systems. In

COMPSAC - International Computer Software and

Applications Conference, pages 421–428. IEEE.

Joshi, P., Gunawi, H. S., and Kou (2011). PREFAIL: A

Programmable Tool for Multiple-Failure Injection. In

OOPSLA - Conference on Object-Oriented Program-

ming, Portland, Oregon, USA.

Lefever, R., Joshi, K., Cukier, M., and Sanders, W. (2004).

A global-state-triggered fault injector for distributed

system evaluation. IEEE Transactions on Parallel and

Distributed Systems, 15(7):593–605.

DependabilityTestingofMapReduceSystems

171

Marinescu, P. D., Banabic, R., and Candea, G. (2010).

An extensible technique for high-precision testing

of recovery code. In USENIXATC - Conference

on USENIX Annual Technical Conference, page 23.

USENIX.

Natella, R., Cotroneo, D., Duraes, J. A., and Madeira, H. S.

(2012). On Fault Representativeness of Software Fault

Injection. TSE - IEEE Transactions on Software En-

gineering.

Offutt, A. J. (1994). A Practical System for Mutation Test-

ing: Help for the Common Programmer. In ITC - In-

ternational Test Conference, pages 824–830. IEEE.

Pan, X., Tan, J., Kalvulya, S., Gandhi, R., and Narasimhan,

P. (2009). Blind Men and the Elephant: Piecing To-

gether Hadoop for Diagnosis. In ISSRE - International

Symposium on Software Reliability Engineering.

Pan, X., Tan, J., Kavulya, S., Gandhi, R., and Narasimhan,

P. (2010). Ganesha: blackBox diagnosis of MapRe-

duce systems. In SIGMETRICS Performance Evalua-

tion Review, page 8. ACM Press.

Pham, C., Chen, D., Kalbarczyk, Z., and Iyer, R. K. (2011).

CloudVal: A framework for validation of virtualiza-

tion environment in cloud infrastructure. In DSN -

International Conference on Dependable Systems and

Networks, pages 189–196. IEEE.

Stott, D., Floering, B., Burke, D., Kalbarczpk, Z., and

Iyer, R. (2000). NFTAPE: a framework for assessing

dependability in distributed systems with lightweight

fault injectors. In IPDS - International Computer Per-

formance and Dependability Symposium, pages 91–

100. IEEE Comput. Soc.

Tan, J., Pan, X., Kavulya, S., Gandhi, R., and Narasimhan,

P. (2008). SALSA: analyzing logs as state machines.

In WASL - Conference on Analysis of System Logs,

page 6, CA, USA. USENIX.

Tan, J., Pan, X., Kavulya, S., Gandhi, R., and Narasimhan,

P. (2009). Mochi: visual log-analysis based tools for

debugging hadoop. In Proceedings of the 2009 confer-

ence on Hot topics in cloud computing, pages 18–18.

USENIX Association.

Teradata Coorporation (2012). MapReduce, SQL-

MapReduce Resources and Hadoop Integration –

Aster Data. http://www.asterdata.com/resources/

mapreduce.php, 09/02/12.

Zhou, Z., Wang, H., Zhou, J., Tang, L., and Li., K. (2006).

Pigeon: A Framework for Testing Peer-to-Peer Mas-

sively Multiplayer Online Games over Heterogeneous

Network. In CCNC - Consumer Communications and

Networking Conference. IEEE.

ICEIS2013-15thInternationalConferenceonEnterpriseInformationSystems

172