Table 1: Test cases, sizes, and their interestingness condition.
Test Case Size Condition
Chromium A (html) 31,626 bytes / 1,058 lines ASSERTION FAILED: i < size()
Chromium B (html) 34,323 bytes / 3,769 lines ASSERTION FAILED: static cast<unsigned>(offsetInNode)
<= layoutTextFragment−>start()
+ layoutTextFragment−>fragmentLength()
Chromium C (html) 48,503 bytes / 1,706 lines ASSERTION FAILED: !current.value()−>isInheritedValue()
WebKit A (html) 23,364 bytes / 959 lines ASSERTION FAILED: newLogicalTop >= logicalTop
WebKit B (html) 30,417 bytes / 1,374 lines ASSERTION FAILED: willBeComposited == needsToBeComposited(layer)
WebKit C (html) 36,051 bytes / 3,791 lines ASSERTION FAILED: !needsStyleRecalc() || !document().childNeedsStyleRecalc()
Example A (array) 8 elements {5,8} ⊆ c ∧ (2 ∈ c ∨ 7 6∈ c)
Example B (array) 8 elements {1,2,3, 4, 5, 6, 7,8} ⊆ c
Example C (array) 8 elements {1,2,3, 4, 6, 8} ⊆ c ∧{5, 7} 6⊆ c
Example D (array) 100 elements {2x|0 ≤ x < 50} ⊆ c
its running time arbitrarily).
For each of the two browser engine targets, we
have selected 3 fuzzer-generated test cases – HTML
with a mixture of SVG, CSS, and JavaScript – that
triggered various assertion failures in the code. The
average size of the tests was around 2000 lines, with
the shortest test case being 959 lines long and the
longest one growing up to 3769 lines. For the mock
SUT, we have used 4 manually crafted test cases,
which consisted of an array of numbers and a con-
dition to decide about the interestingness of a certain
subset. Three of them were small tests imported from
Zeller’s reference examples, while the larger fourth
(with 100 elements) was specially crafted by us for
a corner case when reduction becomes possible only
when granularity is increased to the maximum. Ta-
ble 1 details the sizes and the interestingness condi-
tions of all test cases.
In the evaluation of the algorithm variants that we
have proposed in the previous sections, our first step
was to check the sizes of the minimized test cases.
The reasons for the investigation were two-fold: first,
as the manually crafted test cases have exactly one 1-
minimal subset, they acted as a sanity check. Second,
as 1-minimal test cases are not necessarily unique in
general, we wanted to see how the algorithm variants
affect the size of the result in practice, i.e., on the real
browser test cases.
We have examined the effect of the reduce step
variations (“subsets first”, “complements first”, and
“complements only”; as described in Section 3.3) on
the algorithm and the effect of the two parallelization
approaches (“parallel” and “combined”; as described
in Sections 3.1 and 3.2) independently. It turned out
that all reduce variants of the sequential algorithm
gave exactly the same result not only for the examples
tests but for the real browser cases as well. Detailed
investigations have revealed that “reduce to subset” is
a very rare step in practice (it happened in the first
iteration only, with n = 2, when subsets and comple-
ments are equal anyway) and because of the sequen-
tial nature of the algorithm, the “reduce to comple-
ment” steps were taken in the same order by all algo-
rithm variants. Parallel variants – executed with the
loops limited to 1, 2, 3, 4, 6, 8, 12, 16, 24, 32, 48,
and 64 parallel bodies – did not show a big impact on
the result size either, although 0–3 lines of differences
appeared. (Clearly, there were multiple failing com-
plements in some iterations and because of the paral-
lelism, the algorithms did not always choose the same
step as the sequential variant.) Since those 0–3 lines
mean less than 1% deviation in relative terms, we can
conclude that none of the algorithm variants deterio-
rate the results significantly. As a last step of this first
experiment, we have evaluated all algorithm variants
as well (i.e., parallel algorithms with reduce step vari-
ations), and their results are summarized in Table 2.
The numbers show that not only did not get the results
significantly worse, but for WebKit B parallelization
variants found a smaller 1-minimal test case than the
sequential algorithm.
In our second experiment, we have investigated
the effect of the algorithm variants on the number
of iterations and test evaluations needed to reach 1-
minimality. In Table 3(a), we can observe the effects
of the variation of the reduce steps in the sequential al-
gorithm on all test cases. Although the number of iter-
ations never changed, the number of test evaluations
dropped considerably everywhere. Even if we used
caching
7
of test results, the “complements first” vari-
ant saved 12–17% of test evaluations on real browser
test cases, while “complements only” saved 35–40%.
The test cases of the mock SUT show a bit more
scattered results with the “complements first” variant
(0–23% reduction in test evaluations), but “comple-
ments only” achieves a similar 36–41%. As test re-
sults are re-used heavily by the original “subsets first”
approach, if we would also interpret those cache hits
as test evaluations then that could mean a dramatic
90% reduction in some cases. Table 3(b) contains the
results of parallelizations – yet again with 1, 2, 3, 4,
6, 8, 12, 16, 24, 32, 48, and 64 parallel loop bodies
allowed. These results might look disappointing or
7
In the context of Delta Debugging, caching is the memo-
rization and reuse of the outcomes of the testing function
on already investigated test cases. The caching mechanism
is not incorporated in the definition of the algorithm, since
it can be left for the test function to realize.
Practical Improvements to the Minimizing Delta Debugging Algorithm
245