verge.Application of LittleDarwin to jTerminal yields
94 mutants, numbered m1 to m94; the test of these
mutants against the original using the selected test
suite kills 48 mutants. Some of these mutants are
equivalent to each other, i.e. they produce the same
output for all 37 elements of T ; when we partition
these 48 mutants by equivalence, we find 31 equiva-
lence classes, and we select a mutant from each class;
we let µ be this set. Orthogonally, we consider set T
and we select twenty subsets thereof, derived as fol-
lows:
• T1, T2, T3, T4, T5: Five distinct test suites ob-
tained from T by removing 5 elements at random.
• T6, T7, T8, T9, T10: Five distinct test suites ob-
tained from T by removing 10 elements.
• T11, T12, T13, T14, T15: Five distinct test suites
obtained from T by removing 15 elements.
• T16, T17, T18, T19, T20: Five distinct test suites
obtained from T by removing one element.
Whereas mutation coverage is usually quantified by
the mutation score (the fraction of killed mutants), in
this paper we represent it by mutation tally, i.e. the set
of killed mutants; we compare test suites by means of
inclusion relations between their mutation tallies; like
semantic coverage, this defines a partial ordering. We
use two mutant generators, hence we get two order-
ing relations between test suites. To compute the se-
mantic coverage of these test suites, we consider two
standards of correctness (partial, total) and two spec-
ifications: We choose (the functions of) two mutants,
M25 and M50, as specifications.
Hence we get six graphs on nodes T 1...T 20, rep-
resenting six ordering relations of test suite effective-
ness. Due to space limitations, we do not show these
graphs, but we show in Table 3 the similarity ma-
trix between these six graphs; the similarity index be-
tween two graphs is the ratio of the number of com-
mon arcs over the total number of arcs.
6 CONCLUSION
6.1 Summary
In this paper, we define detector sets for partial cor-
rectness and total correctness of a program with re-
spetc to a specification, and we use them to define ab-
solute (partial and total) correctness as well as relative
(partial and total) correctness. Also, we use detector
sets to define the semantic coverage of a test suite, a
measure of effectiveness which reflects the extent to
which a test suite is able to expose the failure of an
incorrect program or, equivalently, the level of confi-
dence it gives us in the correctness of a correct pro-
gram. We illustrate the derivation of semantic cover-
age of sample test suites on a benchmark example.
6.2 Assessment
We do not validate our measure of effectiveness em-
pirically, as we do not know what ground truth to
validate it against; but we prove that it has a num-
ber of important properties, such as: monotonicity
with respect to the standard of correctness; mono-
tonicity with respect to the refinement of the specifi-
cation against which the program is tested; and mono-
tonicity with respect to the relative correctness of the
program.
Other attributes of semantic coverage include that
it is based on failures rather than faults, hence is
defined formally using objectively observable effects
rather than hypothesized causes. Also, semantic cov-
erage defines a partial ordering between test suites, to
reflect the fact that test suite effectiveness is itself a
partially ordered attribute.
6.3 Threats to Validity
The main difficulty of the proposed coverage metric is
that it assumes the availability of a specification, and
that its derivation requires a detailed semantic analy-
sis of the program. Yet as a formal measure of test
suite effectiveness, semantic coverage can be used for
reasoning analytically about test suites, or for com-
paring test suites even when their semantic coverage
cannot be computed; for example, we may be able to
compare Γ
[R,P]
(T ) and Γ
[R,P]
(T
′
) for inclusion with-
out necessarily computing them, but by analyzing T ,
T
′
, dom(P), dom(R), and dom(R ∩P).
6.4 Related Work
Coverage metrics of test suites have been the focus of
much reserch over the years, and it is impossible to do
justice to all the relevant work in this area (Hemmati,
2015; Gligoric et al., 2015; Andrews et al., 2006); as
a first approximation, it is possible to distinguish be-
tween code coverage, which focuses on measuring the
extent to which a test suite exercises various features
of the code, and specification coverage, which focuses
on measuring the extent to which a test suite exer-
cises various clauses or use cases of the requirements
specification. This can be tied to the orthogonal ap-
proaches to test data generation, using, respectively,
structural criteria and functional criteria. Mutation
coverage falls somehow outside of this dichotomy, in
Semantic Coverage: Measuring Test Suite Effectiveness
293