Practical Improvements to the Minimizing Delta Debugging Algorithm

Ren

ata Hodov

an and

Akos Kiss

Department of Software Engineering, University of Szeged, Dugonics t

er 13, 6720, Szeged, Hungary

Keywords:

Test Case Minimization, Delta Debugging, Parallelization.

Abstract:

The ﬁrst step in dealing with an input that triggers a fault in a software is simplifying it as much and as

automatically as possible, since a minimal test case can ensure the efﬁcient use of human developer resources.

The well-known minimizing Delta Debugging algorithm is widely used for automated test case simpliﬁcation

but even its last published revision is more than a decade old. Thus, in this paper, we investigate how it

performs nowadays, especially (but not exclusively) focusing on its parallelization potential. We present new

improvement ideas, give algorithm variants formally and in pseudo-code, and evaluate them in a large-scale

experiment of more than 1000 test minimization sessions featuring real test cases. Our results show that with

the help of the proposed improvements, Delta Debugging needs only one-fourth to one-ﬁfth of the original

execution time to reach 1-minimal results.

1 INTRODUCTION

Triggering faults in a software is one (undoubtedly

important) thing, but once one is triggered, someone

has to ﬁnd and understand its root cause before the

correction could even be considered. And the time

and effort of that someone are precious. Assuming

that the fault is deterministically triggered by a given

input, the ﬁrst step to ensure the efﬁcient use of devel-

oper resources is to simplify the input by removing

those parts that don’t contribute to the failure. This

is especially true for those inputs, which are not only

large but also hard to comprehend by a human soft-

ware engineer, e.g., the results of fuzzing or some

other random test generation (Takanen et al., 2008).

Of course, the more this simpliﬁcation process can

be automated the better. The more than a decade

old idea of Delta Debugging from Zeller and Hilde-

brandt (Zeller, 1999; Hildebrandt and Zeller, 2000;

Zeller and Hildebrandt, 2002) is oft-cited and used

for exactly this purpose.

Unfortunately, we found that all available generic

– i.e., target language-independent – implementations

of delta debugging (e.g., the one incorporated into

HDD (Misherghi and Su, 2006), the clean-room im-

plementation of the Lithium tool

, and even the refer-

ence implementation provided by Zeller

) realize the

idea as a sequential algorithm. As multi-core and even

http://www.squarefree.com/lithium/

http://www.st.cs.uni-saarland.de/dd/DD.py

multi-processor computers are widespread nowadays,

this looked like a suboptimal approach.

This ﬁrst impression has led us to take a closer

look at the original formalization from Zeller and

Hildebrandt: we have been looking for ways of im-

proving the performance of the algorithm. Interest-

ingly, parallelization was not the only opportunity for

enhancement.

In this paper, we present algorithm variants, which

can result as small (i.e., 1-minimal) test cases as those

produced by the original formalization but can use re-

sources considerably better. We also present a large-

scale experiment with the algorithm variants to prove

their usefulness in practice.

The rest of the paper is structured as follows: Sec-

tion 2 gives a brief overview of the minimizing Delta

Debugging algorithm and related concepts. Section 3

discusses our observations on the original algorithm

and presents improved variants. Section 4 introduces

the setup of our experiments and details the results

achieved with a prototype tool implementing all algo-

rithms. Section 5 surveys related work, and ﬁnally,

Section 6 concludes the paper.

2 A BRIEF OVERVIEW OF

DDMIN

In order to make the present paper self-contained, we

give Zeller and Hildebrandt’s latest formulation of

Hodován, R. and Kiss, Á.

Practical Improvements to the Minimizing Delta Debugging Algorithm.

DOI: 10.5220/0005988602410248

In Proceedings of the 11th International Joint Conference on Software Technologies (ICSOFT 2016) - Volume 1: ICSOFT-EA, pages 241-248

ISBN: 978-989-758-194-6

241

Algorithm 1: Zeller and Hildebrandt’s.

Let test and c

be given such that test(0

) = 3 ∧ test(c

) = 7 hold. The goal is to ﬁnd c

= d dmin(c

) such that c

⊆ c

test(c

) = 7, and c

is 1-minimal. The minimizing Delta Debugging algorithm ddmin(c) is

ddmin(c

) = ddmin

,2) where

ddmin

,n) =











ddmin

(∆

,2) if ∃i ∈ {1,...,n} ·test(∆

) = 7 (“reduce to subset”)

ddmin

(∇

,max(n − 1,2)) else if ∃i ∈ {1,...,n} ·test(∇

) = 7 (“reduce to complement”)

ddmin

,min(|c

|,2n)) else if n < |c

| (“increase granularity”)

otherwise (“done”).

where ∇

= c

− ∆

, c

= ∆

∪ ∆

∪ . . . ∪ ∆

, all ∆

are pairwise disjoint, and ∀∆

· |∆

| ≈ |c

|/n holds. The recursion invariant

(and thus precondition) for ddmin

is test(c

) = 7 ∧ n ≤ |c

Delta Debugging (Zeller and Hildebrandt, 2002) ver-

batim in Algorithm 1.

The algorithm takes a test case and a testing func-

tion as parameters. The test case represents the

failure-inducing input to be minimized, while the test-

ing function is used to determine whether an arbitrary

test case triggers the original failure (by signaling fail

outcome, or 7) or not (by signaling pass outcome, or

3). (Additionally, according to its deﬁnition, a test-

ing function may also signal unresolved outcome, or

?, but that outcome type is irrelevant to the algorithm.)

Informally, the algorithm works by partitioning

the input test case to subsets and removing as many

subsets as greedily as possible such that the remain-

ing test case still induces the original failure. The al-

gorithm and the partitioning is iterated until the sub-

sets become elementary and no more subsets can be

removed. The result of the algorithm is a 1-minimal

test case, where the deﬁnition of n-minimality is:

Deﬁnition 1 (n-minimal test case). A test case c ⊆ c

is n-minimal if ∀c

⊂ c ·|c|− |c

| ≤ n ⇒ (test(c

) 6= 7)

holds. Consequently, c is 1-minimal if ∀δ

∈ c ·

test (c − {δ

}) 6= 7 holds.

Although the original deﬁnition of the algorithm

is more of a recursive math formula, all known im-

plementations realize it as a sequential non-recursive

procedure, as we already mentioned in Section 1.

Therefore, we give an equivalent variant of ddmin

– the key component of the minimizing Delta Debug-

ging algorithm – in a sequential pseudo-code in Algo-

rithm 2.

3 PRACTICAL IMPROVEMENTS

TO DDMIN

If we take a closer look at the minimizing Delta De-

bugging algorithm, as introduced in the previous sec-

tion, we can make several observations that can lead

to practical improvements. In the next subsections we

discuss such improvement possibilities.

Algorithm 2: Non-recursive sequential pseudo-code.

1 procedure ddmin

, n)

2 begin

3 out:

4 while true do begin

5 (∗ reduce to subset ∗)

6 forall i in 1..n do

7 if test(∆

,n)

) = 7 then begin

8 c

= ∆

,n)

9 n = 2

10 continue out

11 end

12 (∗ reduce to complement ∗)

13 forall i in 1..n do

14 if test(∇

,n)

) = 7 then begin

15 c

= ∇

,n)

16 n = max(n − 1, 2)

17 continue out

18 end

19 (∗ increase granularity ∗)

20 if n < |c

| then begin

21 n = min(|c

|, 2n)

22 continue out

23 end

24 (∗ done ∗)

25 break out

26 end

27 return c

28 end

3.1 Parallelization

The ﬁrst thing to notice is that although the imple-

mentations tend to use sequential loops to realize the

“reduce to subset” and “reduce to complement” cases

of ddmin

, as exempliﬁed in Algorithm 2, the poten-

tial for parallelization is there in the original formu-

lation, since ∃i ∈ {1, ..., n} does not specify how to

ﬁnd that existing i. Since n can grow big for real

inputs and test is often expected to be an expensive

operation, we propose to make use of the paralleliza-

tion potential and rewrite ddmin

to use parallel loops.

The pseudo-code of the parallelized version is given

in Algorithm 3.

In the pseudo-code we intentionally do not specify

the implementation details of the parallel constructs

but we assume the following for correctness and max-

imum efﬁciency:

ICSOFT-EA 2016 - 11th International Conference on Software Engineering and Applications

242

Algorithm 3: Parallel pseudo-code.

1 procedure ddmin

, n)

2 begin

3 while true do begin

4 (∗ reduce to subset ∗)

5 found = 0

6 parallel forall i in 1..n do

7 if test(∆

,n)

) = 7 then begin

8 found = i

9 parallel break

10 end

11 if found 6= 0 then begin

12 c

= ∆

,n)

f ound

13 n = 2

14 continue

15 end

16 (∗ reduce to complement ∗)

17 found = 0

18 parallel forall i in 1..n do

19 if test(∇

,n)

) = 7 then begin

20 found = i

21 parallel break

22 end

23 if found 6= 0 then begin

24 c

= ∇

,n)

f ound

25 n = max(n − 1, 2)

26 continue

27 end

28 (∗ increase granularity ∗)

29 if n < |c

| then begin

30 n = min(|c

|, 2n)

31 continue

32 end

33 (∗ done ∗)

34 break

35 end

36 return c

37 end

• First of all, assignments to a single variable are

expected to be atomic. I.e., if found = i gets par-

allelly executed in two different loop bodies (with

values i1 and i2) then the value of found is ex-

pected to become strictly either i1 or i2. (See

lines 8 and 20.)

• The amount of parallelization in parallel forall

is left for the implementation, but it is expected

not to overload the system, i.e., start parallel loop

bodies only until all computation cores are ex-

hausted. (See lines 6 and 18.)

• parallel break is suggested to stop/abort all par-

allelly started loop bodies even if their computa-

tion hasn’t ﬁnished yet (in a general case, this may

cause computation results to be thrown away, vari-

ables left uninitialized, etc., which is always to be

considered and taken care of, but it causes no is-

sues in this algorithm). (See lines 9 and 21.)

3.2 Combination of the Reduce Cases

To improve the algorithm further, we have to take a

look again at the original formulation of ddmin

Algorithm 1. We can observe that although ddmin

seems to be given with a piecewise deﬁnition, the

Algorithm 4: Parallel pseudo-code with combined reduce

cases.

1 procedure ddmin

, n)

2 begin

3 while true do begin

4 (∗ reduce to subset or complement ∗)

5 found = 0

6 parallel forall i in 1..2n do

7 if 1 ≤ i ≤ n then

8 if test(∆

,n)

) = 7 then begin

9 found = i

10 parallel break

11 end

12 else if n + 1 ≤ i ≤ 2n then

13 if test(∇

,n)

i−n

) = 7 then begin

14 found = i

15 parallel break

16 end

17 if 1 ≤ found ≤ n then begin

18 c

= ∆

,n)

f ound

19 n = 2

20 continue

21 end else if n + 1 ≤ found ≤ 2n then begin

22 c

= ∇

,n)

f ound−n

23 n = max(n − 1, 2)

24 continue

25 end

26 (∗ increase granularity ∗)

27 if n < |c

| then begin

28 n = min(|c

|, 2n)

29 continue

30 end

31 (∗ done ∗)

32 break

33 end

34 return c

35 end

pieces are actually not independent but are to be con-

sidered one after the other, as mandated by the else if

phrases. However, we can also observe that this se-

quentiality is not necessary. There may be several ∆

and ∇

test cases that induce the original failure, we

may choose any of them (i.e., we don’t have to prefer

subsets over complements), and we will still reach a

1-minimal solution at the end.

We cannot beneﬁt from this observation as long

as our implementation is sequential but we propose

to combine the two reduce cases, and test all subsets

and complements in one step when parallelization is

available. This way the algorithm does not have to

wait until all subset tests ﬁnish but can start testing the

complements as soon as computation cores become

available.

Algorithm 4 shows the pseudo-code of the algo-

rithm variant with the combined reduce cases.

The pseudo-code in earlier sections – i.e., Algo-

rithms 2 and 3 – are consistent with the original for-

mal deﬁnition of Zeller and Hildebrandt, as shown in

Algorithm 1. They specialize the formalization and

differ from each other only in the way how the exis-

tency check ∃i ∈ {1,.. .,n} is implemented (i.e., se-

quentially or parallelly). However, the idea of com-

bining the reduce cases as presented in Algorithm 4,

although still yields 1-minimal results, deviates from

Practical Improvements to the Minimizing Delta Debugging Algorithm

243

the original formalization.

3.3 De-prioritization or Omission of

“Reduce to Subset”

Once we have observed that there is no strict need

for sequentially investigating the two reduce cases of

ddmin

, we can also observe that the “reduce to sub-

set” case is not even necessary for 1-minimality. It is

a greedy attempt by the algorithm to achieve a signiﬁ-

cant reduction of the test case by removing all but one

subsets in one step rather than removing them one by

one in the “reduce to complement” case. However,

there are several input formats where attempting to

keep just the “middle” of a test case almost always

gives a syntactically invalid input and thus cannot in-

duce the original failure. (C-like source ﬁles with the

need for correct ﬁle and function headers and mandat-

ing properly paired curly braces are a typical example

of such inputs.) For such input formats, the “reduce

to complement” case may occur signiﬁcantly more of-

ten, while the “reduce to subset” case perhaps not at

all. Thus, we argue that it is worth experimenting with

the reordering of the reduce cases, and also with the

complete omission of the “reduce to subset” case, as

it may be simply the waste of computation resources.

The idea of reordering the reduce cases can be

applied to all previously introduced algorithm vari-

ants. However, presenting the modiﬁed algorithms

again would be unnecessarily verbose for little ben-

eﬁt. Thus, we only refer to the algorithms shown in

previous sections and mention those parts that need to

be changed to test complements ﬁrst:

• in Algorithm 2, lines 5–11 and 12–18 should be

swapped,

• in Algorithm 3, swapping lines 4–15 and 16–27

achieves the same, while

• in Algorithm 4, lines 8–11 have to be swapped

with lines 13–16, lines 18–20 with lines 22–24,

and the subscript indices of ∆s and ∇s have to be

updated so that the −n element is applied to ∆s.

The idea of omitting the “reduce to subset” case com-

pletely can reasonably be applied to the sequential

and parallel variants only, as leaving the testing of

subsets from the algorithm with combined reduce

cases would be no different from the parallel variant.

Thus, the changes to be applied for testing comple-

ments only are as follows:

• in Algorithm 2, lines 5–11 are to be deleted, while

• in Algorithm 3, lines 4–15 are superﬂuous.

4 EXPERIMENTAL RESULTS

During the investigation of the minimizing Delta De-

bugging algorithm, we created a prototype tool

that

implemented our proposed improvements. At the

very beginning, it was based on Zeller’s public do-

main reference implementation but as new ideas got

incorporated into it, it was rewritten several times

until only traces of the original source remained.

The tool was written in Python 3, and the parallel

loop constructs of the algorithm variants were imple-

mented based on Python’s mutiprocessing module.

For the evaluation platform of our tool and

our ideas, we have used a dual-socket Supermicro

X9DRG-QF machine equiped with 2 Intel Xeon E5-

2695 v2 (x86-64) CPUs clocked at 2.40 GHz and

64 GB DDR3 RAM at 1600 MHz. Each CPU had 12

cores and each core was hyper-threaded, which gave

24 physical cores and 48 processing threads (or log-

ical CPUs seen by the kernel) in total. The machine

was running Ubuntu 14.04 with Linux kernel 3.13.0,

and the native compiler was gcc 4.9.2.

As primary software targets to trigger failure in

and minimize tests for (also known as system under

test, or SUT), we have chosen two real-life web en-

gines – from the WebKit

and Chromium

projects

–, for which a large number of test cases were avail-

able. The WebKit project was checked out from its of-

ﬁcial code repository at revision 192323 dated 2015-

11-11 and built in debug conﬁguration for its GTK+

port, i.e., external dependencies like UI elements were

provided by the GTK+ project

. The SUT was the

WebKitTestRunner program (executed with the --no-

timeout command line option), a minimalistic web

browser application used in the testing of layout cor-

rectness of the project. The Chromium project was

checked out from its ofﬁcial code repository at revi-

sion hash 6958b6e dated 2015-11-29 and also built

in debug conﬁguration. The SUT from that project

was the content shell tool (executed with the --single-

process --no-sandbox --run-layout-test options), also

a minimal web browser. As a secondary target, we

have implemented a mock SUT, a parametrized tester

function that can unit test the algorithm implemen-

tations for correctness by working on abstract arrays

as inputs and allowing an explicit control of what is

accepted as failure-inducing or passing, and also al-

lows experimenting with various test execution times

(since the time required for the evaluation of the fail-

ure condition is negligible and the tester can lengthen

https://github.com/renatahodovan/picire

https://webkit.org/

https://www.chromium.org/

http://www.gtk.org/

ICSOFT-EA 2016 - 11th International Conference on Software Engineering and Applications

244

Table 1: Test cases, sizes, and their interestingness condition.

Test Case Size Condition

Chromium A (html) 31,626 bytes / 1,058 lines ASSERTION FAILED: i < size()

Chromium B (html) 34,323 bytes / 3,769 lines ASSERTION FAILED: static cast<unsigned>(offsetInNode)

<= layoutTextFragment−>start()

+ layoutTextFragment−>fragmentLength()

Chromium C (html) 48,503 bytes / 1,706 lines ASSERTION FAILED: !current.value()−>isInheritedValue()

WebKit A (html) 23,364 bytes / 959 lines ASSERTION FAILED: newLogicalTop >= logicalTop

WebKit B (html) 30,417 bytes / 1,374 lines ASSERTION FAILED: willBeComposited == needsToBeComposited(layer)

WebKit C (html) 36,051 bytes / 3,791 lines ASSERTION FAILED: !needsStyleRecalc() || !document().childNeedsStyleRecalc()

Example A (array) 8 elements {5,8} ⊆ c ∧ (2 ∈ c ∨ 7 6∈ c)

Example B (array) 8 elements {1,2,3, 4, 5, 6, 7,8} ⊆ c

Example C (array) 8 elements {1,2,3, 4, 6, 8} ⊆ c ∧{5, 7} 6⊆ c

Example D (array) 100 elements {2x|0 ≤ x < 50} ⊆ c

its running time arbitrarily).

For each of the two browser engine targets, we

have selected 3 fuzzer-generated test cases – HTML

with a mixture of SVG, CSS, and JavaScript – that

triggered various assertion failures in the code. The

average size of the tests was around 2000 lines, with

the shortest test case being 959 lines long and the

longest one growing up to 3769 lines. For the mock

SUT, we have used 4 manually crafted test cases,

which consisted of an array of numbers and a con-

dition to decide about the interestingness of a certain

subset. Three of them were small tests imported from

Zeller’s reference examples, while the larger fourth

(with 100 elements) was specially crafted by us for

a corner case when reduction becomes possible only

when granularity is increased to the maximum. Ta-

ble 1 details the sizes and the interestingness condi-

tions of all test cases.

In the evaluation of the algorithm variants that we

have proposed in the previous sections, our ﬁrst step

was to check the sizes of the minimized test cases.

The reasons for the investigation were two-fold: ﬁrst,

as the manually crafted test cases have exactly one 1-

minimal subset, they acted as a sanity check. Second,

as 1-minimal test cases are not necessarily unique in

general, we wanted to see how the algorithm variants

affect the size of the result in practice, i.e., on the real

browser test cases.

We have examined the effect of the reduce step

variations (“subsets ﬁrst”, “complements ﬁrst”, and

“complements only”; as described in Section 3.3) on

the algorithm and the effect of the two parallelization

approaches (“parallel” and “combined”; as described

in Sections 3.1 and 3.2) independently. It turned out

that all reduce variants of the sequential algorithm

gave exactly the same result not only for the examples

tests but for the real browser cases as well. Detailed

investigations have revealed that “reduce to subset” is

a very rare step in practice (it happened in the ﬁrst

iteration only, with n = 2, when subsets and comple-

ments are equal anyway) and because of the sequen-

tial nature of the algorithm, the “reduce to comple-

ment” steps were taken in the same order by all algo-

rithm variants. Parallel variants – executed with the

loops limited to 1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48,

and 64 parallel bodies – did not show a big impact on

the result size either, although 0–3 lines of differences

appeared. (Clearly, there were multiple failing com-

plements in some iterations and because of the paral-

lelism, the algorithms did not always choose the same

step as the sequential variant.) Since those 0–3 lines

mean less than 1% deviation in relative terms, we can

conclude that none of the algorithm variants deterio-

rate the results signiﬁcantly. As a last step of this ﬁrst

experiment, we have evaluated all algorithm variants

as well (i.e., parallel algorithms with reduce step vari-

ations), and their results are summarized in Table 2.

The numbers show that not only did not get the results

signiﬁcantly worse, but for WebKit B parallelization

variants found a smaller 1-minimal test case than the

sequential algorithm.

In our second experiment, we have investigated

the effect of the algorithm variants on the number

of iterations and test evaluations needed to reach 1-

minimality. In Table 3(a), we can observe the effects

of the variation of the reduce steps in the sequential al-

gorithm on all test cases. Although the number of iter-

ations never changed, the number of test evaluations

dropped considerably everywhere. Even if we used

caching

of test results, the “complements ﬁrst” vari-

ant saved 12–17% of test evaluations on real browser

test cases, while “complements only” saved 35–40%.

The test cases of the mock SUT show a bit more

scattered results with the “complements ﬁrst” variant

(0–23% reduction in test evaluations), but “comple-

ments only” achieves a similar 36–41%. As test re-

sults are re-used heavily by the original “subsets ﬁrst”

approach, if we would also interpret those cache hits

as test evaluations then that could mean a dramatic

90% reduction in some cases. Table 3(b) contains the

results of parallelizations – yet again with 1, 2, 3, 4,

6, 8, 12, 16, 24, 32, 48, and 64 parallel loop bodies

allowed. These results might look disappointing or

In the context of Delta Debugging, caching is the memo-

rization and reuse of the outcomes of the testing function

on already investigated test cases. The caching mechanism

is not incorporated in the deﬁnition of the algorithm, since

it can be left for the test function to realize.

Practical Improvements to the Minimizing Delta Debugging Algorithm

245

Table 2: Size of 1-minimal test cases.

Test Case Sequential Parallel Combined

Chromium A 31 (2.93%) 31–34 (2.93–3.21%) 31–34 (2.93–3.21%)

Chromium B 34 (0.90%) 34–37 (0.90–0.98%) 34–37 (0.90–0.98%)

Chromium C 9 (0.53%) 9 (0.53%) 9 (0.53%)

WebKit A 11 (1.15%) 11 (1.15%) 11 (1.15%)

WebKit B 21 (1.53%) 19–23 (1.38–1.67%) 19–23 (1.38–1.67%)

WebKit C 46 (1.21%) 46 (1.21%) 46 (1.21%)

Example A 2 (25.00%) 2 (25.00%) 2 (25.00%)

Example B 8 (100.00%) 8 (100.00%) 8 (100.00%)

Example C 6 (75.00%) 6 (75.00%) 6 (75.00%)

Example D 50 (50.00%) 50 (50.00%) 50 (50.00%)

Table 3: The number of iterations and test evaluations needed to reach 1-minimality. First numbers or intervals in each column

stand for evaluated tests, the second numbers or intervals in parentheses with + preﬁx stand for cache hits, while the third

numbers or intervals in brackets stand for iterations performed.

(a) Comparison on the effects of reduce step variations (performing all tests sequentially).

Test Case Subsets First Complements First Complements Only

Chromium A 531 (+1589) [76] 466 (+54) [76] 343 (+24) [76]

Chromium B 560 (+1931) [83] 489 (+9) [83] 356 (+3) [83]

Chromium C 263 (+388) [56] 218 (+5) [56] 159 (+2) [56]

WebKit A 265 (+479) [53] 222 (+7) [53] 161 (+3) [53]

WebKit B 503 (+1674) [84] 430 (+21) [84] 319 (+7) [84]

WebKit C 831 (+5127) [129] 714 (+14) [129] 509 (+5) [129]

Example A 22 (+22) [8] 17 (+5) [8] 14 (+1) [8]

Example B 26 (+2) [3] 26 (+2) [3] 14 (+0) [3]

Example C 30 (+16) [5] 28 (+3) [5] 18 (+1) [5]

Example D 472 (+3237) [57] 422 (+16) [57] 276 (+0) [57]

(b) Comparison on the effects of parallelization (testing subsets ﬁrst).

Test Case Sequential Parallel Combined

Chromium A 531 (+1589) [76] 531–2476 (+1568–2389) [76–89] 531–2933 (+1574–2386) [76–89]

Chromium B 560 (+1931) [83] 560–2016 (+1614–1936) [76–83] 560–2947 (+468–1932) [76–83]

Chromium C 263 (+388) [56] 263–589 (+388–394) [56] 263–958 (+22–388) [56]

WebKit A 265 (+479) [53] 265–675 (+477–486) [53] 265–708 (+452–483) [53]

WebKit B 503 (+1674) [84] 503–2116 (+1668–1771) [84–85] 503–2132 (+1623–1762) [84–85]

WebKit C 831 (+5127) [129] 831–5670 (+5127–5136) [129] 831–5697 (+5115–5132) [129]

Example A 22 (+22) [8] 22–35 (+22–23) [8] 22–53 (+5–22) [8]

Example B 26 (+2) [3] 26 (+2) [3] 26–28 (+0–2) [3]

Example C 30 (+16) [5] 30–38 (+16) [5] 30–51 (+3–16) [5]

Example D 472 (+3237) [57] 472–3398 (+3237–3251) [57] 472–3440 (+3009–3249) [57]

controversial at ﬁrst sight as the number of evaluated

tests increased, sometimes by a factor of 6 or even

more. However, as we will soon see, this increase is

the natural result of parallelization and is not mirrored

in the running time of the minimizations.

In our third and last experiment, we have investi-

gated the effect of the algorithm variants on the run-

ning time of minimization. First, we have taken a

closer look at the WebKit A test case. We found that

even without parallelization, varying the reduce steps

could decrease the running time by 14% and 35%,

well aligned with the results presented in Table 3(a).

However, when we enabled parallelization (e.g., 64

parallel loop bodies), we could reduce the running

time by 73–75%. Unfortunately, by combining paral-

lelization and reduce step variants, we couldn’t sim-

ply multiply these reduction factors. E.g., with 64-

fold parallelization, the effect of “complements ﬁrst”

became a 2–5% gain in running time, and the effect

of “complements only” also shrank to 7–15%. Never-

theless, that is still a gain, the combined parallel loops

with complements tests performed only gave the best

results, and reduced the time of the minimization to

72 seconds from the original 316, achieving an im-

pressive 77% reduction.

However, we did not only focus on the WebKit

A test case, but we have taken all 6 real browser tests,

plus parametrized the 4 artiﬁcial test cases to run for 0

seconds, 1 second, and randomly between 1 and 3 sec-

onds. That gave 18 test cases in practice, for 3 × 3 al-

gorithm variants, on 12 different levels of paralleliza-

tion. Even though we skipped the re-evaluation of

duplicates (e.g., “complements ﬁrst” is the same for

“parallel” and “combined” variants, or sequential al-

gorithms run unchanged independently of the avail-

able computation cores), we got the results of 1098

minimization executions, all of which would be hard

to picture within the limits of this paper. Thus, we

present only the best results for each test case in Ta-

ble 4.

The results clearly show that for most of the cases

(12 of 18) the “parallel complements only” variant

gave the highest running time reduction – as expected.

Perhaps it is worth discussing those elements of the

table, which are a bit less expected: ﬁrst of all, it

might be surprising that parallelization did not always

work best at the highest level used (64). However,

we have to recall that the machine used for evalu-

ation has 24 real cores only (and 48 logical CPUs

seen by the kernel, but those are the result of hyper-

threading), and it should also be considered that the

browser SUTs are inherently multi-threaded or even

ICSOFT-EA 2016 - 11th International Conference on Software Engineering and Applications

246

Table 4: Algorithm variants giving the highest reduction in running time of minimization.

Test Case Best Algorithm (Parallelization Level) Best/Original Time Reduction

Chromium A Parallel Complements-Only (24) 231s / 904s 74.42%

Chromium B Parallel Complements-Only (16) 180s / 890s 79.78%

Chromium C Parallel Complements-Only (8) 102s / 413s 75.28%

WebKit A Parallel Complements-Only (12) 70s / 316s 77.63%

WebKit B Parallel Complements-Only (12) 127s / 584s 78.26%

WebKit C Parallel Complements-Only (12) 192s / 915s 78.97%

Example A (0s) Sequential Complements-Only (1) 0.0020s / 0.0023s 13.79%

Example A (1s) Combined Subsets-First (64) 7s / 22s 67.79%

Example A (1–3s) Combined Subsets-First (32) 11s / 45s 75.34%

Example B (0s) Sequential Complements-Only (1) 0.0011s / 0.0014s 23.58%

Example B (1s) Parallel Complements-Only (16) 3s / 26s 88.29%

Example B (1–3s) Parallel Complements-Only (16) 7s / 50s 85.91%

Example C (0s) Sequential Complements-Only (1) 0.0016s / 0.0020s 20.80%

Example C (1s) Parallel Complements-Only (24) 5s / 30s 83.13%

Example C (1–3s) Parallel Complements-Only (16) 8s / 53s 84.78%

Example D (0s) Sequential Complements-Only (1) 0.0436s / 0.0762s 42.76%

Example D (1s) Parallel Complements-Only (64) 58s / 472s 87.69%

Example D (1–3s) Parallel Complements-Only (64) 79s / 939s 91.48%

multi-process, thus even a single instance may occupy

more than one cores. It may also come as a surprise

that a sequential algorithm turned out to perform best

for some artiﬁcial test cases. These were the cases

when the evaluation of the interestingness of a test

was so quick that the overhead of the parallelization

implementation became a burden. (In practice, test

implementations rarely run so fast.)

As a summary, we can conclude that all proposed

variants of the Delta Debugging algorithm yielded a

measurable speed-up, with best combinations gaining

cca. 75–80% of the running time on real test cases,

i.e., needing only one-fourth to one-ﬁfth of the origi-

nal execution time to reach 1-minimal results. More-

over, on manually crafted test cases, even higher run-

ning time reduction could be observed.

5 RELATED WORK

The topic of test case reduction is an almost two

decades old research area. Many of the works in this

subject are based on Zeller and Hilldebrand’s pub-

lication (Hildebrandt and Zeller, 2000), which gave

a general, language independent and greedy solution

for the 1-minimal test case reduction problem. Al-

though the approach works well, but due to its gen-

erality it can be sub-optimal in performance and this

fact gives room for optimizations or specializations.

Misherghi and Su experimented with replacing

the letter and line based splitting with code units de-

termined by language grammars (Misherghi and Su,

2006). Running Delta Debugging on code units could

avoid the generation of large amounts of syntactically

incorrect test cases that would have been thrown away

anyway.

Another approaches used static and dynamic anal-

ysis, or slicing to discover semantic relationships

in the code under inspection and reduce the search

space where the failure is placed (Leitner et al.,

2007; Burger and Zeller, 2011). Few years later,

Regehr et al. used Delta Debugging as one possible

method in their C-Reduce test case reducer tool for

C sources. This system contains various source-to-

source transformator plugins – line-based Delta De-

bugging among others – to mimic the steps that a hu-

man would have done.

Regehr also experimented with running the plu-

gins of C-Reduce in parallel. His initiative results

were promising but, to our best knowledge, he hadn’t

performed extensive evaluations later. The write-

up about this work is available on his blog (Regehr,

2011).

6 SUMMARY

In this paper, we have taken a close look at the mini-

mizing Delta Debugging algorithm. Our original mo-

tivation was to enable the algorithm to utilize the

parallelization capabilities of multi-core and multi-

processor computers, but as we analyzed it, we found

further enhancement possibilities.

We have realized the parallelization potential im-

plicitly given in the existency checks of the original

deﬁnition of the algorithm (although never exploited

by its authors), but we have also pointed out that the

same original deﬁnition unnecessarily prescribed se-

quentiality elsewhere, for its “reduce to subset” and

“reduce to complement” steps. Moreover, we have

argued that the “reduce to subset” step of the origi-

nal algorithm may even be omitted completely with-

out losing its key property of resulting 1-minimal test

cases but being more effective for some types of in-

puts. Each observation has led to an improved algo-

rithm variant, all of which are given in pseudo-code.

Finally, we presented an experiment conducted on

4 artiﬁcial test cases and on 2 wide-spread browser

engines with 3 real test cases each. All test cases

were minimized with all presented algorithm variants

Practical Improvements to the Minimizing Delta Debugging Algorithm

247

and with 12 different levels of parallelization (rang-

ing from single-core execution – i.e., no paralleliza-

tion – to 64-fold parallelization on a computer with

48 virtual cores). The results of the 1098 success-

fully executed test case minimizations prove that all

presented improvements to the Delta Debugging algo-

rithm achieved performance improvements, with best

variants reducing the running time on real test cases

signiﬁcantly, by cca. 75–80%.

For future work, we still see improvement possi-

bilities and research topics. We plan to investigate

how the best algorithm variant can be selected for a

given hardware and SUT, and which order of subset

and complement tests work for different input formats

the best.

ACKNOWLEDGMENTS

This work was partially supported by the European

Union project “REPARA”, project number: 609666.

REFERENCES

Burger, M. and Zeller, A. (2011). Minimizing reproduction

of software failures. In 2011 International Symposium

on Software Testing and Analysis, pages 221–231.

Hildebrandt, R. and Zeller, A. (2000). Simplifying failure-

inducing input. In 2000 International Symposium on

Software Testing and Analysis, pages 135–145.

Leitner, A., Oriol, M., Zeller, A., Ciupa, I., and Meyer, B.

(2007). Efﬁcient unit test case minimization. In 22nd

International Conference on Automated Software En-

gineering, pages 417–420.

Misherghi, G. and Su, Z. (2006). HDD: Hierarchical delta

debugging. In 28th International Conference on Soft-

ware Engineering, pages 142–151.

Regehr, J. (2011). Parallelizing delta debugging.

http://blog.regehr.org/archives/749.

Takanen, A., DeMott, J., and Miller, C. (2008). Fuzzing

for Software Security Testing and Quality Assurance.

Artech House.

Zeller, A. (1999). Yesterday, my program worked. Today,

it does not. Why? In 7th European Software Engi-

neering Conference/7th International Symposium on

Foundations of Software Engineering, pages 253–267.

Zeller, A. and Hildebrandt, R. (2002). Simplifying and iso-

lating failure-inducing input. IEEE Transactions on

Software Engineering, 28(2):183–200.

ICSOFT-EA 2016 - 11th International Conference on Software Engineering and Applications

248