Semantic Coverage: Measuring Test Suite Effectiveness

Samia Al Blwi

, Amani Ayad

, Besma Khaireddine

, Imen Marsit

and Ali Mili

1 a

NJIT, Newark NJ, U.S.A.

Kean University, Union NJ, U.S.A.

University of Tunis El Manar, Tunis, Tunisia

University of Sousse, Sousse, Tunisia

Keywords:

Software Testing, Test Suite Effectiveness, Syntactic Coverage, Mutation Coverage, Semantic Coverage.

Abstract:

Several syntactic measures have been deﬁned in the past to assess the effectiveness of a test suite: statement

coverage, condition coverage, branch coverage, path coverage, etc. There is ample analytical and empirical

evidence to the effect that these are imperfect measures: exercising all of a program’s syntactic features is

neither necessary nor sufﬁcient to ensure test suite adequacy; not to mention that it may be impossible to

exercise all the syntactic features of a program (re: unreachable code). Mutation scores are often used as

reliable measures of test suite effectiveness, but they have issues of their own: some mutants may survive

because they are equivalent to the base program not because the test suite is inadequate; the same mutation

score may mean vastly different things depending on whether the killed mutants are distinct from each other

or equivalent; the same test suite and the same program may yield different mutation scores depending on the

mutation operators that we use. Fundamentally, whether a test suite T is adequate for a program P depends

on the semantics of the program, the speciﬁcation that the program is tested against, and the property of

correctness that the program is tested for (total correctness, partial correctness). In this paper we present a

formula for the effectiveness of a test suite T which depends exactly on the semantics of P, the correctness

property that we are testing P for, and the speciﬁcation against which this correctness property is tested; it

does not depend on the syntax of P, nor on any mutation experiment we may run. We refer to this formula as

the semantic coverage of the test suite, and we investigate its properties.

1 ON THE EFFECTIVENESS OF A

TEST SUITE

1.1 Motivation

In this paper we envision to deﬁne a measure to quan-

tify the effectiveness of a test suite. The effectiveness

of an artifact can only be deﬁned with respect to the

purpose of the artifact, and must reﬂect its ﬁtness for

the declared purpose. If the purpose of a test suite is

to reveal the presence of faults in incorrect programs,

then it is sensible to quantify the effectiveness of a test

suite by its ability to reveal faults. A necessary condi-

tion to reveal a fault is to exercise the code that con-

tains the fault; hence many metrics of test suite effec-

tiveness focus on the ability of a test suite to exercise

syntactic attributes of the program (Mathur, 2014);

but while achieving syntactic coverage is necessary, it

https://orcid.org/0000-0002-6578-5510

is far from sufﬁcient, and not always possible. Indeed

not all faults cause errors and not all errors lead to ob-

servable failures (Avizienis et al., 2004); also, it is not

always possible to exercise all syntactic features of a

program (re: infeasible paths, dead code), so that it

is possible to thoroughly test a program without cov-

ering all its statements (if the code that has not been

exercised contains no faults).

A better measure of test suite effectiveness is mu-

tation coverage, which is deﬁned as the ratio of mu-

tants that it kills out of a set of generated mutants.

But while mutation coverage is often used as the gold

standard of test suite effectiveness (Inozemtseva and

Holmes, 2014; Andrews et al., 2006), it has issues of

its own:

• The same mutation score may mean vastly differ-

ent things depending on whether the killed mu-

tants are all distinct from each other, all equiva-

lent, or partitioned into some equivalence classes.

• The same test suite T may yield different mutation

Al Blwi, S., Ayad, A., Khaireddine, B., Marsit, I. and Mili, A.

Semantic Coverage: Measuring Test Suite Effectiveness.

DOI: 10.5220/0012063900003538

In Proceedings of the 18th International Conference on Software Technologies (ICSOFT 2023), pages 287-294

ISBN: 978-989-758-665-1; ISSN: 2184-2833

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

287

scores for different sets of mutants, hence the mu-

tation score cannot be considered as an intrinsic

attribute of the test suite.

• Even assuming that mutants are a faithful proxy

for actual faults (Just et al., 2014), we argue that

assessing the effectiveness of test suites by their

mutation score may be imperfect, because of the

disconnect between fault density and failure rate

(Farooq et al., 2012).

In this paper we present a measure of test suite

effectiveness which depends only on the program un-

der test, the correctness property we are testing it for,

and the speciﬁcation against which correctness is de-

ﬁned; also, this measure is intended to reﬂect a test

suite’s effectiveness to expose failures, rather than to

detect faults. In the next section we present and dis-

cuss some criteria that a measure of test suite effec-

tiveness ought to satisfy, and in section 1.3 we present

and justify some design principles that we resolve to

adopt to this effect.

In section 2 we introduce detector sets, and dis-

cuss their signiﬁcance for the purposes of program

testing and program correctness, and in section 3 we

use detector sets to introduce our deﬁnition of test

suite effectiveness; we validate our proposed deﬁni-

tion in section 4 by showing, analytically, that it meets

all the requirements set forth in section 1.2. In section

5 we illustrate the derivation of semantic coverage on

a sample benchmark example, and show its empirical

relationship to mutation scores. We conclude in sec-

tion 6 by summarizing our ﬁndings, critiquing them,

comparing them to related work, and sketching direc-

tions of further research.

1.2 Requirements of Semantic Coverage

We consider a program P that we want to test for cor-

rectness against a speciﬁcation R and we wish to as-

sess the ﬁtness of a test suite T for this purpose. We

argue that the effectiveness of test suite T to achieve

the purpose of the test ought to be deﬁned as a func-

tion of three parameters:

• Program P.

• Speciﬁcation R.

• The standard of correctness that we are testing

P for: partial correctness or total correctness

(Hehner, 1992).

The requirements we present below dictate how se-

mantic coverage ought to vary as a function of these

parameters.

Rq1. Monotonicity with respect to test suite size.

Notwithstanding that we favor smaller test

suites for the sake of efﬁciency, we argue that

from the standpoint of effectiveness, larger test

suites are better: if T

′

is a superset of T then T

′

ought to have higher semantic coverage than T.

Rq2. Monotonicity with respect to relative correct-

ness. Relative correctness is the property of

a program to be more-correct than another

with respect to a speciﬁcation (Diallo et al.,

2015). A test suite T ought to have increas-

ingly greater semantic covarage as the program

grows more-correct, since more-correct pro-

grams have fewer failures to reveal.

Rq3. Monotonicity with respect to reﬁnement. Spec-

iﬁcations are naturally ordered by reﬁnement,

whereby more-reﬁned speciﬁcations represent

stronger/ harder to satisfy requirements (Mor-

gan, 1998; Hehner, 1992); a given program P

fails more often against a more-reﬁned speciﬁ-

cation than a less-reﬁned speciﬁcation. Hence

the same test suite ought to have greater seman-

tic coverage for less-reﬁned speciﬁcations.

Rq4. Monotonicity with respect to the standard of

correctness. Total correctness is a stronger

property than partial correctness, hence the

same program will fail the test of total correct-

ness more often than it fails the test of partial

correctness. The same test suite T ought to have

greater semantic coverage if it is applied to par-

tial correctness than if applied to total correct-

ness.

In section 4 we prove that the formula of semantic

coverage presented in section 3 satisﬁes all the re-

quirements (Rq1-Rq4) discussed in this section.

1.3 Design Principles

We resolve to adopt the following design principles as

we deﬁne semantic coverage:

• Focus on Failure. We adopt the deﬁnitions of

fault, error and failure proposed by Avizienis et al

(Avizienis et al., 2004). A failure is an observable,

veriﬁable, certiﬁable effect. By contrast, a fault

(referred to in (Avizienis et al., 2004) as the ad-

judged or hypothesized cause of an error) is a hy-

pothetical cause of the observed failure; the same

failure may be caused by more than one fault or

combination of faults. Hence whereas a failure is

an objectively veriﬁable effect, a fault is a specu-

lative hypothesized cause; our deﬁnition is based

on failures rather than faults.

• Partial Ordering. It is easy to imagine two test

suites whose effectiveness cannot be ranked: for

ICSOFT 2023 - 18th International Conference on Software Technologies

288

example, they reveal disjoint or distinct sets of

failures. Hence test suite effectiveness is essen-

tially a partial ordering. Yet if we quantify test

suite effectiveness by numbers, we introduce an

artiﬁcial total ordering, on what is actually a par-

tially ordered set. Hence we resolve to deﬁne se-

mantic coverage, not as a number, but as an ele-

ment of a partially ordered set.

• Analytical Validation. We resolve to validate our

proposed measure of effectiveness by means of

analytical (vs empirical) methods because we do

not know of a widely accepted ground truth of test

suite effectiveness against which we can empiri-

cally validate our deﬁnition. Hence we resolve to

validate our deﬁnition by arguing that it captures

the right attributes and that it meets all the require-

ments that we mandate in section 1.2.

In section 5 we compute the semantic coverage of a

set of (20) test suites of a benchmark program for two

speciﬁcations and two standards of correctness, and

we compare the four graphs so derived against two

graphs that rank these test suites by mutation cover-

age, for two mutant generators.

1.4 Relational Mathematics

We assume the reader is familiar with elementary dis-

crete mathematics;in this section, we present some

deﬁnitions and notations that we use throughout the

paper. We deﬁne sets by means of C-like variable dec-

larations; if we declare a set S as:

xType x; yType y;

then we mean S to be the cartesian product of the

sets of values represented by types xType and yType.

Elements of S are denoted by s, and have the form

s = ⟨x, y⟩. The cartesian components of an element of

S are usually decorated the same way as the element,

so we write for example s

′

= ⟨x

′

, y

′

⟩.

A relation on set S is a subset of the Cartesian

product S×S. Special relations on S include the iden-

tity relation (I = {(s, s)|s ∈ S}), the universal relation

(L = S × S) and the empty relation (φ = {}). Opera-

tions on relations include the set theoretic operations

of union, intersection and complement; they also in-

clude the domain of a relation, denoted by dom(R)

and deﬁned by: dom(R) = {s|∃s

′

: (s, s

′

) ∈ R}. The

product of two relations R and R

′

is denoted by R ◦ R

′

(or RR

′

for short) and deﬁned by RR

′

= {(s, s

′

)|∃s

′′

(s, s

′′

) ∈ R ∧ (s

′′

, s

′

) ∈ R

′

}. Note that given a relation

R, the product of R by the universal relation L yields

the relation RL = dom(R) × S; we use RL as a repre-

sentation of the domain of R in relational form. The

inverse of relation R is the relation denoted by

R and

deﬁned by

R = {(s, s

′

)|(s

′

, s) ∈ R}. The restriction of

relation R to subset T is the relation denoted by

T \

and deﬁned by

T \

R = {(s, s

′

)|s ∈ T ∧ (s, s

′

) ∈ R}.

2 CORRECTNESS AND

DETECTOR SETS

Absolute correctness is the property of a program to

be (partially or totally) correct with respect to a speci-

ﬁcation. Relative correctness is the property of a pro-

gram to be more (partially or totally) correct than an-

other with respect to a speciﬁcation. The detector set

of a program with respect to a speciﬁcation is the set

of inputs (tests) that expose the (partial or total) incor-

rectness of a program with respect to a speciﬁcation.

In this section, we will show how (partial, total) de-

tector sets enable us to deﬁne absolute and relative

correctness in simple, uniform terms.

2.1 Speciﬁcation Reﬁnement

In this paper, we represent speciﬁcations by relations

and programs by deterministic relations (functions).

An important concept in any programming calculus is

the property of reﬁnement, which ranks speciﬁcations

according to the stringency of the requirements that

they represent.

Deﬁnition 1. Given two relations R and R

′

on space

S, we say that R

′

reﬁnes R (abbreviation: R

′

⊒ R, or

R ⊑ R

′

) if and only if: RL ∩ R

′

L ∩ (R ∪ R

′

) = R.

Intuitive interpretation: this deﬁnition means that

′

has a larger domain than R, and assigns fewer im-

ages than R to the elements of the domain of R. This

is formulated in the following Proposition.

Proposition 1. Given two relations R and R

′

on space

S. If R

′

reﬁnes R then RL ⊆ R

′

L and R

′

∩ RL ⊆ R.

2.2 Program Semantics

We consider a program P on space S and we let s be

an element of S; execution of P on initial state s may

terminate after a ﬁnite number of steps in some ﬁ-

nal state s

′

when the exit statement of the program is

reached; we then say that execution of P on s con-

verges. Alternatively, execution of P on s may fail to

converge, for any number of reasons: it enters an in-

ﬁnite loop; it adresses an array out of its bounds; it

references a nil pointer; etc; we then say that execu-

tion of P on s diverges.

Given a program P on space S, the function of pro-

gram P (which we also denote by P) is the set of pairs

Semantic Coverage: Measuring Test Suite Effectiveness

289

of states (s, s

′

) such that execution of P on state s con-

verges and returns the ﬁnal state s

′

. The domain of P

is the set of states on which execution of P converges.

2.3 Absolute Correctness

A speciﬁcation on space S is a binary relation on S; it

contains all the pairs of states (s, s

′

) that the speciﬁer

considers correct. The correctness of a program P on

space S can be determined with respect to a speciﬁca-

tion R on S according to the following deﬁnition.

Deﬁnition 2. Given a program P on state S and a

speciﬁcation R on S, we say thet P is (totally) correct

with respect to R if and only if P reﬁnes R. We say that

P is partially correct with respect to R if and only if P

reﬁnes R ∩PL.

The following proposition gives set theoretic char-

acterizations of total correctness and partial correct-

ness.

Proposition 2. Given a program P on space S and a

speciﬁcation R on S, program P is totally correct with

respect to R if and only if dom(R) = dom(R ∩ P); and

program P is partially correct with respect to R if and

only if dom(R) ∩ dom(P) = dom(R ∩ P).

2.4 Detector Sets

Now that we know how to characterize correctness,

we resolve to deﬁne sets of initial states that expose

the incorrectness of a program with respect to a spec-

iﬁcation.

Deﬁnition 3. Given a program P on space S and a

speciﬁcation R on S:

• The detector set for total correctness of program

P with respect to R is denoted by Θ

(R, P) and

deﬁned by:

(R, P) = dom(R) \dom(R ∩ P).

• The detector set for partial correctness of program

P with respect to R is denoted by Θ

(R, P) and

deﬁned by:

(R, P) = (dom(R) ∩dom(P)) \ dom(R ∩ P).

When we want to refer to a detector set without

specifying a particular standard of correctness (par-

tial, total), we simply say detector set, and we use the

notation Θ(R, P).

Given that detector sets are intended to expose in-

correctness, they are empty whenever there is no in-

correctness to expose; this is formualetd in the fol-

lowing proposition.

Proposition 3. Given a speciﬁcation R on space S

and a program P on S.

Figure 1: Detector Set of Pprogram P with Respect to Spec-

iﬁcation R for Total Correctness.

Figure 2: Detector Set of Pprogram P with Respect to Spec-

iﬁcation R for Partial Correctness.

• Program P is totally correct with respect to spec-

iﬁcation R if and only if Θ

(R, P) =

• Program P is partially correct with respect to

speciﬁcation R if and only if Θ

(R, P) =

2.5 Relative Correctness

Whereas absolute correctness is the property of a pro-

gram to be (totally or partially) correct with respect to

a speciﬁcation, relative correctness is the property of a

program to be more-correct than another with respect

to a speciﬁcation. It is natural to deﬁne relative cor-

rectness by means of detector sets: a program grows

more and more (totally or partially) correct as its (total

or partial) detector set grows smaller (in the sense of

inclusion), culminating in absolute correctness when

its detector set is empty. But relative total correctness

has already been deﬁned, in (Diallo et al., 2015); be-

fore we redeﬁne it using a new formula, we ensure

that the original formula is equivalent to the detector

set-based formula we envision in this paper.

Proposition 4. Given a program P and a speciﬁca-

tion R, the following two conditions are equivalent:

f 1 : dom(R ∩ P) ⊆ dom(R ∩P

′

f 2 : Θ

(R, P

′

) ⊆ Θ

(R, P).

Deﬁnition 4. We consider a speciﬁcation R on space

S and two programs P and P

′

on S.

ICSOFT 2023 - 18th International Conference on Software Technologies

290

Table 1: Deﬁnitions of Correctness by Means of Detec-

torSets.

Partial

Correctness

Total

Correctness

Absolute Correctness

P absolutely correct iff: Θ

(R, P) =

0 Θ

(R, P) =

Relative Correctness

′

more-correct than P iff:

(R, P

′

)

⊆ Θ

(R, P)

(R, P

′

)

⊆ Θ

(R, P)

• We say that P

′

is more-totally-correct than P with

respect to R if and only if:

(R, P

′

) ⊆ Θ

(R, P).

• We say that P

′

is more-partially-correct than P

with respect to R if and only if:

(R, P

′

) ⊆ Θ

(R, P).

Table 1 summarizes and organizes the deﬁnitions

of correctness to help contrast them. Note the follow-

ing relation between the detector sets of a program P

with respect to a speciﬁcation R:

(R, P) = dom(P) ∩Θ

(R, P).

From this simple equation, we can readily infer two

properties about absolute correctness and relative cor-

rectness:

• Absolute Correctness. If a program P is totally

correct with respect to speciﬁcation R, then it is

necessarily partially correct with respect to R.

• Relative Correctness. A program P

′

can be more-

partially-correct than a program P either by be-

ing more-totally-correct (hence reducing the term

(R, P)) or by diverging more widely (hence re-

ducing the term dom(P)), or both.

To illustrate the partial ordering properties of rela-

tive total correctness, we consider the following spec-

iﬁcation on space S of integers, deﬁned by

R = {(s, s

′

)|1 ≤ s ≤ 3 ∧ s

′

= s

+ 3}.

We consider twelve candidate programs, listed in Ta-

ble 2. Figure 3 shows how these candidate programs

are ordered by relative total correctness; The green

oval shows those candidates that are absolutely cor-

rect, and the orange oval shows candidate programs

that are incorrect; the red oval shows the candidate

programs that are least correct.

3 SEMANTIC COVERAGE

We consider a program P on space S and a speciﬁca-

tion R on S, and we let T be a subset of S. We argue

- - 







*



















*





















Figure 3: Ordering Candidate Programs by Relative Total

(and Partial) Correctness with Respect to R.

that the purpose of test suite T is to prove or disprove

the correctness of P with respect to R: T ought to be

sufﬁciently thorough that, if P is incorrect with re-

spect to R, then testing it on T ought to expose the

incorrectness of P.Since the detector set of a program

includes all the initial states on which execution of P

fails, the effectiveness of a test suite T can be mea-

sured by the extent to which T encompasses all the

elements of Θ(R, P). What precludes a test suite T

from being a superset of Θ(R, P) are the elements of

Θ(R, P) that are outside T , i.e. the set

Θ(R, P) ∩ T.

The smaller this set, the higher the effectiveness of T ;

if we want a measure of effectiveness that increases

with the effectiveness of T , we take the complement

of this set.

Deﬁnition 5. We consider a program P on space S

and a speciﬁcation R on S, and we let T be a subset

of S.

• The semantic coverage of test suite T for the total

correctness of program P with respect to speciﬁ-

cation R is denoted by Γ

TOT

[R,P]

(T ) and deﬁned by:

TOT

[R,P]

(T ) = T ∪ Θ

(R, P).

• The semantic coverage of test suite T for the par-

tial correctness of program P with respect to spec-

iﬁcation R is denoted by Γ

PAR

[R,P]

(T ) and deﬁned by:

PAR

[R,P]

(T ) = T ∪ Θ

(R, P).

See Figure 4. If we want to talk about semantic

coverage without specifying the standard of correct-

ness, we use the notation Γ

[R,P]

(T ) deﬁned by:

[R,P]

(T ) = T ∪ Θ(R, P).

Semantic Coverage: Measuring Test Suite Effectiveness

291

Table 2: Candidate Programs for Speciﬁcation R.

p0: s=pow(s,3)+4; p4: s=pow(s,3)+s+1; p8: s=pow(s,3)+s*s-4*s+8;

p1: s=pow(s,3)+5; p5: s=pow(s,3)+s; p9: s=2*pow(s,3)-6*s*s+11*s-3;

p2: s=pow(s,3)+6; p6: s=pow(s,3)+s*s-5*s+9; p10:s=3*pow(s,3)-12*s*s+22*s-9;

p3: s=pow(s,3)+s+2; p7: s=pow(s,3)+s*s-3*s+5; p11:s=4*pow(s,3)-18*s*s+33*s-15;

4 ANALYTICAL VALIDATION

In this section we revisit the requirements put forth

in section 1.2 and prove that the formula of seman-

tic coverage proposed above does satisfy all these re-

quirements.

4.1 Rq1: Monotonicity with Respect to

the Test Suite

Deﬁnition 5 clearly provides that the semantic cover-

age of a test suite T is monotonic with respect to T .

4.2 Rq2: Monotonicity with Respect to

Relative Correctness

The effectiveness of a test suite increases as the pro-

gram under test grows more (totally or partially) cor-

rect.

Proposition 5. Given a speciﬁcation R on space S

and two programs P and P

′

on S, and a subset T of S.

If P

′

is more-totally-correct than P with respect to R

then:

TOT

[R,P

′

]

(T ) ⊇ Γ

TOT

[R,P]

(T ).

Proposition 6. Given a speciﬁcation R on space S

and two programs P and P

′

on S, and a subset T of S.

If P

′

is more-partially-correct than P with respect to

R then:

PAR

[R,P

′

]

(T ) ⊇ Γ

PAR

[R,P]

(T ).

4.3 Rq3: Monotonicity with Respect to

Reﬁnement

A test suite T grows more effective as the speciﬁca-

tion against which we are testing the program grows

less-reﬁned.

Proposition 7. Given a program P on space S and

two speciﬁcations R and R

′

on S, and a subset T of S.

If R

′

reﬁnes R then:

TOT

′

,P]

(T ) ⊆ Γ

TOT

[R,P]

(T ).

Proposition 8. Given a program P on space S and

two speciﬁcations R and R

′

on S, and a subset T of S.

If R

′

reﬁnes R then:

PAR

′

,P]

(T ) ⊆ Γ

PAR

[R,P]

(T ).

Figure 4: Semantic Coverage of Test T for Program P with

respect to R (shades of green).

4.4 Rq4: Monotonicity with Respect to

the Standard of Correctness

A test suite T is more effective for testing partial cor-

rectness than for testing total correctness.

Proposition 9. Given a program P on space S, a

speciﬁcation R on S, and test suite T (subset of S),

the semantic coverage of T for partial correctness of

P with respect to R is greater than or equal to the

semantic coverage for total correctness of P with re-

spect to R.

5 ILLUSTRATION

In this section we report on an experiment in which

we evaluate the semantic coverage of a set of test

suites; the sole purpose of this section is to illustrate

the derivation of semantic coverage on a concrete ex-

ample. We do compare semantic coverage against

mutation coverage, but the intent of this comparison

is not to validate semantic coverage any more than it

is to validate mutation coverage. The sole purpose of

this comparison is to satisfy our curiosity about how

these two criteria rank sample test suites.

We consider the Java benchmark program of jTer-

minal, an open-source software product routinely

used in mutation testing experiments (Parsai and De-

meyer, 2017). We apply the mutant generation tool

LittleDarwin in conjunction with a test generation

and deployment class that includes 35 test cases (Par-

sai and Demeyer, 2017); we augment the bench-

mark test suite with two additional tests, intended to

trip the base program jTerminal, by causing it to di-

ICSOFT 2023 - 18th International Conference on Software Technologies

292

verge.Application of LittleDarwin to jTerminal yields

94 mutants, numbered m1 to m94; the test of these

mutants against the original using the selected test

suite kills 48 mutants. Some of these mutants are

equivalent to each other, i.e. they produce the same

output for all 37 elements of T ; when we partition

these 48 mutants by equivalence, we ﬁnd 31 equiva-

lence classes, and we select a mutant from each class;

we let µ be this set. Orthogonally, we consider set T

and we select twenty subsets thereof, derived as fol-

lows:

• T1, T2, T3, T4, T5: Five distinct test suites ob-

tained from T by removing 5 elements at random.

• T6, T7, T8, T9, T10: Five distinct test suites ob-

tained from T by removing 10 elements.

• T11, T12, T13, T14, T15: Five distinct test suites

obtained from T by removing 15 elements.

• T16, T17, T18, T19, T20: Five distinct test suites

obtained from T by removing one element.

Whereas mutation coverage is usually quantiﬁed by

the mutation score (the fraction of killed mutants), in

this paper we represent it by mutation tally, i.e. the set

of killed mutants; we compare test suites by means of

inclusion relations between their mutation tallies; like

semantic coverage, this deﬁnes a partial ordering. We

use two mutant generators, hence we get two order-

ing relations between test suites. To compute the se-

mantic coverage of these test suites, we consider two

standards of correctness (partial, total) and two spec-

iﬁcations: We choose (the functions of) two mutants,

M25 and M50, as speciﬁcations.

Hence we get six graphs on nodes T 1...T 20, rep-

resenting six ordering relations of test suite effective-

ness. Due to space limitations, we do not show these

graphs, but we show in Table 3 the similarity ma-

trix between these six graphs; the similarity index be-

tween two graphs is the ratio of the number of com-

mon arcs over the total number of arcs.

6 CONCLUSION

6.1 Summary

In this paper, we deﬁne detector sets for partial cor-

rectness and total correctness of a program with re-

spetc to a speciﬁcation, and we use them to deﬁne ab-

solute (partial and total) correctness as well as relative

(partial and total) correctness. Also, we use detector

sets to deﬁne the semantic coverage of a test suite, a

measure of effectiveness which reﬂects the extent to

which a test suite is able to expose the failure of an

incorrect program or, equivalently, the level of conﬁ-

dence it gives us in the correctness of a correct pro-

gram. We illustrate the derivation of semantic cover-

age of sample test suites on a benchmark example.

6.2 Assessment

We do not validate our measure of effectiveness em-

pirically, as we do not know what ground truth to

validate it against; but we prove that it has a num-

ber of important properties, such as: monotonicity

with respect to the standard of correctness; mono-

tonicity with respect to the reﬁnement of the speciﬁ-

cation against which the program is tested; and mono-

tonicity with respect to the relative correctness of the

program.

Other attributes of semantic coverage include that

it is based on failures rather than faults, hence is

deﬁned formally using objectively observable effects

rather than hypothesized causes. Also, semantic cov-

erage deﬁnes a partial ordering between test suites, to

reﬂect the fact that test suite effectiveness is itself a

partially ordered attribute.

6.3 Threats to Validity

The main difﬁculty of the proposed coverage metric is

that it assumes the availability of a speciﬁcation, and

that its derivation requires a detailed semantic analy-

sis of the program. Yet as a formal measure of test

suite effectiveness, semantic coverage can be used for

reasoning analytically about test suites, or for com-

paring test suites even when their semantic coverage

cannot be computed; for example, we may be able to

compare Γ

[R,P]

(T ) and Γ

[R,P]

′

) for inclusion with-

out necessarily computing them, but by analyzing T ,

′

, dom(P), dom(R), and dom(R ∩P).

6.4 Related Work

Coverage metrics of test suites have been the focus of

much reserch over the years, and it is impossible to do

justice to all the relevant work in this area (Hemmati,

2015; Gligoric et al., 2015; Andrews et al., 2006); as

a ﬁrst approximation, it is possible to distinguish be-

tween code coverage, which focuses on measuring the

extent to which a test suite exercises various features

of the code, and speciﬁcation coverage, which focuses

on measuring the extent to which a test suite exer-

cises various clauses or use cases of the requirements

speciﬁcation. This can be tied to the orthogonal ap-

proaches to test data generation, using, respectively,

structural criteria and functional criteria. Mutation

coverage falls somehow outside of this dichotomy, in

Semantic Coverage: Measuring Test Suite Effectiveness

293

Table 3: Graph Similarity of Semantic Coverage and Mutation Coverage.

Graph

Similarity

Mut.

Tally 1

Mut.

Tally 2

PAR

[M25,P]

(T )

PAR

[M50,P]

(T )

TOT

[M25,P]

(T )

TOT

[M50,P]

(T )

Mut. Tally,1 1.00 0.43 0.34 0.35 0.34 0.50

Mut. Tally,2 0.43 1.00 0.67 0.70 0.67 0.53

PAR

[M25,P]

(T ) 0.34 0.67 1.00 0.66 1.0 0.46

PAR

[M50,P]

(T ) 0.35 0.70 0.66 1.00 0.66 0.62

TOT

[M25,P]

(T ) 0.34 0.67 1.00 0.66 1.00 0.46

TOT

[M50,P]

(T ) 0.50 0.53 0.46 0.62 0.46 1.00

that it depends exclusively on the program, not its

speciﬁcation, and that it operates by applying muta-

tion operators; as such, it has often been used as a

baseline for assessing the effectiveness of other cov-

erage metrics (Andrews et al., 2006; Inozemtseva and

Holmes, 2014). But mutation coverage also depends

on the mutant generator, and can give different values

for different generators.

Our work differs from other research in many

ways: ﬁrst, semantic coverage is not a number but a

set. Second, semantic coverage is not intrinsic to the

program, but depends also on the correctness standard

and the speciﬁcation. Third, semantic coverage is fo-

cused on failures rather than faults, since unlike faults,

failures are an objectively observable attribute.

6.5 Research Prospects

We are exploring means to use the deﬁnition of se-

mantic coverage to derive a function that is indepen-

dent of the speciﬁcation, and reﬂects the diversity of

the test suite. We are also considering to expand the

empirical study of semantic coverage.

ACKNOWLEDGEMENTS

The authors are very grateful to the anonymous re-

viewers for their valuable feedback. This work is par-

tially supported by NSF grant DGE1565478.

REFERENCES

Andrews, J. H., Briand, L. C., Labiche, Y., and Namin,

A. S. (2006). Using mutation analysis for assessing

and comparing testing coverage criteria. IEEE Trans-

actions on Software Engineering, 32(8):608–624.

Avizienis, A., Laprie, J. C., Randell, B., and Landwehr,

C. E. (2004). Basic concepts and taxonomy of de-

pendable and secure computing. IEEE Transactions

on Dependable and Secure Computing, 1(1):11–33.

Diallo, N., Ghardallou, W., and Mili, A. (2015). Correct-

ness and relative correctness. In Proceedings, 37th

International Conference on Software Engineering,

NIER track, Firenze, Italy.

Farooq, S. U., Quadri, S., and Ahmed, N. (2012). Metrics,

models and measurement in software reliability. In

Proceedings, SAMI 2012, Herlany, Slovakia.

Gligoric, M., Groce, A., Zhang, C., Sharma, R., Alipour,

M. A., and Marinov, D. (2015). Guidelines for

coverage-based comparisons of non-adequate test

suites. ACM Transactions on Software Engineering

and Methodology (TOSEM), 24(4):1–33.

Hehner, E. C. (1992). A Practical Theory of Programming.

Prentice Hall.

Hemmati, H. (2015). How effective are code coverage cri-

teria? In 2015 IEEE International Conference on

Software Quality, Reliability and Security, pages 151–

156. IEEE.

Inozemtseva, L. and Holmes, R. (2014). Coverage is not

strongly correlated with test suite effectiveness. In

Procedings, 36th International Conference on Soft-

ware Engineering. ACM Press.

Just, R., Jalali, D., Inozemtseva, L., Ernst, M., Holmes, R.,

and Fraser, G. (2014). Are mutants a valid substitute

for real faults in software testing? In Proceedings,

FSE.

Mathur, A. P. (2014). Foundations of Software Testing.

Pearson.

Morgan, C. C. (1998). Programming from Speciﬁcations,

Second Edition. International Series in Computer Sci-

ences. Prentice Hall, London, UK.

Parsai, A. and Demeyer, S. (2017). Dynamic mutant sub-

sumption analysis using littledarwin. In Proceedings,

A-TEST 2017, Paderborn, Germany.

ICSOFT 2023 - 18th International Conference on Software Technologies

294