Review Time as Predictor for the Quality of Model Inspections

Marian Daun

1 a

, Meenakshi Manjunath

1 b

and Jennifer Brings

2 c

Center of Robotics, Technical University of Applied Sciences W

urzburg-Schweinfurt, Schweinfurt, Germany

University of Duisburg-Essen, Essen, Germany

Keywords:

Validation, Software Inspection, Model Inspection, Ad Hoc Review, Review Time, Controlled Experiment.

Abstract:

Software inspections play an important part in ensuring the quality of software development. With the emer-

gence of model-based development approaches, there is also a need for model inspections to ensure correctness

of model-based artifacts. In practice, ad hoc inspections are regularly conducted, often by new and rather in-

experienced colleagues, which are asked spontaneously to review an artifact of interest. The use of novices,

such as trainees or student assistants, allows shorter review cycles at reduced costs. The quality of these ad

hoc inspections is commonly attributed to different factors, often related to the reviewer. Increasing review

time can be seen as an indicator that the reviewer takes the review serious. Furthermore, with more time spent,

it can be assumed that more defects will be found. In this paper, we report the results of an experiment on ad

hoc model inspections. Our results show that – contradictory to these assumptions and empirical ﬁndings from

inspections of textual documents – the review time a reviewer decides to spend on a review has no signiﬁcant

inﬂuence on the effectiveness of ad hoc model inspections.

1 INTRODUCTION

During software engineering processes, manual qual-

ity assurance is regularly mandated and conducted

at different stages and with different intensities (ISO

26262-1, 2011; ISO/IEC 25030, 2007). While for-

mal inspections are completed from time to time,

on multiple occasions brief visual inspections by co-

workers or the developers themselves are done reg-

ularly. These visual reviews of the requirements

(Miller et al., 1998), the code (de Almeida et al.,

2003), or other development artifacts (Laitenberger

et al., 2000) aim at improving the overall quality of

the software product to be developed.

In the past, research has been conducted on soft-

ware inspections and other formal validation tech-

niques. On a regular basis, different validation tech-

niques have been compared to ad hoc reviews, of-

ten showing that they are more effective and efﬁcient.

However, most validation techniques exceed ad hoc

reviews in the resources needed (e.g., since multiple

reviewers are expected to be present, multiple intense

reviewing days are deﬁned), which leads to increased

costs. Hence, such validation techniques are used at

https://orcid.org/0000-0002-9156-9731

https://orcid.org/0009-0005-6421-1450

https://orcid.org/0000-0002-2918-5008

distinct points but not commonly throughout a devel-

opment project. Therefore, there is still a need for ad

hoc reviews.

Ad hoc reviews are often conducted with inexperi-

enced reviewers, these are typically available student

assistants working in the company unit or newer col-

leagues or trainees that have not yet assigned to spe-

ciﬁc duties. As they tend to be more error-prone than

the use of - pricey and seldomly available - experts in

the ﬁeld, often ad hoc reviews are distributed among

a larger set of reviewers. However, there is a need to

assess the quality of the inspection results, as the de-

velopers typically want to focus on the more reliable

reviews that provide less false positives. Established

research identify three major factors inﬂuencing the

quality of an inspection:

1. The reviewer, i.e. personal factors such as years

of experience, degrees achieved, etc.

2. The review subject, i.e. the way the requirements,

code, models, other artifacts under review have

been prepared and are structured.

3. The review process, i.e. the technique applied, the

number of resources and time spent.

A common misconception about reviews is that

the time spent for reviewing (i.e. the review time)

inﬂuences the quality of a review. Particularly, it

120

Daun, M., Manjunath, M. and Brings, J.

Review Time as Predictor for the Quality of Model Inspections.

DOI: 10.5220/0012682900003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 120-131

ISBN: 978-989-758-696-5; ISSN: 2184-4895

is assumed that ad hoc reviews suffer from too lit-

tle time spent whereas on-the-ﬂy reviewing a co-

worker’s code, text, model, etc. This means that re-

views conducted in less time are usually less effective

and efﬁcient when compared to reviews conducted in

more time. If that is the case, then, we could priori-

tize reviews with more review time taken higher than

reviews with less time taken.

In this paper, we investigate this effect for visual

inspections of models (i.e. ad hoc model inspections),

in contribution to the research goal: Does review time

inﬂuence ad hoc model inspections?

At this point, we brieﬂy lay out our deﬁnition of

review time. Review time refers to the duration taken

by a reviewer to conduct a review task. However, we

do not refer to predeﬁned review times, i.e. a reviewer

is asked to spend two hours for a review. Instead,

we refer to the actual time used without given restric-

tions. This implies that we want to ﬁnd out the inﬂu-

ences of the actual time a reviewer deems sufﬁcient

for a reviewing task or a reviewer is willing to invest

for a review on the quality of the review. This ap-

proach makes it challenging to offer straightforward

guidelines on the recommended duration for an un-

compromised quality review. In contrast, we want to

contribute to the question, whether the time taken is

an indicator for ﬁnding good reviewers or to estimate

whether a review is more likely to be of good quality

or not.

To answer the research goal, an experiment on the

inﬂuence of review time on ad hoc model inspections

was conducted and is reported in this paper. In to-

tal, 200 participants conducted ad hoc model inspec-

tions. In total, we collected data from 520 partici-

pants who performed multiple ad hoc model inspec-

tion tasks across a total of eight different tasks. After

ﬁltering, 497 data sets were used for analyzing the

inﬂuence of review time on effectiveness, reviewers’

conﬁdence, and efﬁciency of ad hoc model inspec-

tions. The results show that there is no discernible ef-

fect for effectiveness, while conﬁdence and efﬁciency

are inﬂuenced by review time. However, results show

that increased review time does neither lead to in-

creased efﬁciency and conﬁdence, nor does decreas-

ing review time. Moderate review time leads to sig-

niﬁcantly higher efﬁciency and conﬁdence compared

to very short and very long review time.

This paper is structured as follows. Section 2 in-

troduces background information and related work on

ad hoc model inspections and related studies. Sub-

sequently, Section 3 introduces the study design and

Section 4 the study results. The major ﬁndings and

threats to validity are discussed in Section 5. Finally,

Section 6 concludes the paper.

2 RELATED WORK

2.1 Ad Hoc Model Inspections

Different inspection techniques have been proposed

to support validation of various software develop-

ment artifacts. Among others, formal inspection ((Fa-

gan, 1976; Fagan, 1986), often referred to as Fagan-

Inspection), walkthroughs (Boehm, 1987), N-fold in-

spection (Martin and Tsai, 1990), checklist-based in-

spection (Thelin et al., 2003), perspective-based read-

ing (Shull et al., 2000) and scenario-based reading

(Regnell et al., 2000) have gained much attention

and are regularly investigated for their effectiveness

and efﬁciency (e.g., (Miller et al., 1998; Basili et al.,

1996)). This is commonly done by comparing these

techniques between each other, or even frequently by

comparison with ad hoc inspections (or ad hoc re-

views). Ad hoc inspections are typically deﬁned as

inspections that are conducted without any guidance

for the reviewer and without a prescribed process. Ba-

sically, the reviewer is just given the review artifact

and the task to validate its correctness (Porter et al.,

1995; O.Oladele and O. Adedayo, 2014).

The majority of existing studies is interested in in-

specting requirements artifacts or code artifacts. The

inspection of requirement artifacts is, for instance, in-

vestigated by Miller et al. (Miller et al., 1998). In

a controlled experiment, trained student participants

conduct inspections for error detection in natural lan-

guage requirements speciﬁcations. Basili et al. 1996

(Basili et al., 1996) report on a controlled experi-

ment with professional software developers compar-

ing different inspection techniques for requirements

documents. Finding out that perspective-based review

is signiﬁcantly more effective than other inspection

techniques for requirements documents. Other ex-

amples for requirements inspection studies were con-

ducted by He and Carver (He and Carver, 2006), Mal-

donado et al. (Maldonado et al., 2006), Laitenberger

et al. (Laitenberger et al., 2001), Berling and Rune-

son (Berling and Runeson, 2003), and Sabaliauskaite

et al. (Sabaliauskaite et al., 2004), which often come

to comparable ﬁndings. Code inspections are, among

others, studied by Porter et al. (Porter et al., 1997),

Laitenberger (Laitenberger, 1998), Almeida et al. (de

Almeida et al., 2003), or Dunsmore et al. (Dunsmore

et al., 2003).

However, while requirements inspections and

code inspections have been heavily investigated,

model inspections are also the center of various stud-

ies. For instance, de Mello et al. (d. Mello et al.,

2012) investigate the inspection of feature models.

Conradi et al. (Conradi et al., 2003) and Laitenberger

Review Time as Predictor for the Quality of Model Inspections

121

et al. (Laitenberger et al., 2000) report experiments

on inspections of UML models.

In previous work, we proposed dedicated review

models to improve model inspections (Daun et al.,

2014). Results showed that Message Sequence Charts

are a favorable modeling language for conducting

reviews. Particularly, we conducted experiments

to compare the use of Message Sequence Charts

with functional speciﬁcation languages (Daun et al.,

2019b), of review models merging multiple speciﬁca-

tions (Daun et al., 2019a), of the representation for-

mat for inconsistencies shown in the review model

(Daun et al., 2017), or the use of instance- vs. type-

level speciﬁcations (Daun et al., 2020). By analyzing

the data gathered from all these experiments, we tried

to identify predictors for a reviewer’s performance,

but concluded so far that commonly suggested predic-

tors like experience and conﬁdence are not reliable for

the quality of model inspections (Daun et al., 2021).

In summary, ad hoc inspections are regularly used

as comparison for more advanced inspection tech-

niques under investigation. Although other inspec-

tion techniques typically win the comparison with

ad hoc inspections, controlled experiments do exist

that found out that ad hoc inspections do not perform

worse than systematic inspection techniques (Lanu-

bile and Visaggio, 2000) or at least not worse than all

other systematic inspection techniques (Porter et al.,

1995; Porter and Votta, 1998). In addition, ad hoc

model inspections possess beneﬁts regarding the low

resource consumption. This allows conducting ad hoc

inspections frequently whenever validation is needed.

Therefore, ad hoc inspections are regularly used in in-

dustry (although typically not as the only inspection

technique used during the entire project).

2.2 Inﬂuence Factors

Beside experiments comparing effectiveness and efﬁ-

ciency of different inspection techniques, studies ex-

ist aiming at investigating other inﬂuence factors for

inspections. For instance, a variety of studies inves-

tigates the inﬂuence the used notation has. Particular

emphasis is typically given to the set of symbols used.

Figl et al. (Figl et al., 2013a) investigate the inﬂuence

of the symbol sets used in modeling languages. In a

study with 136 participants, it is shown that percep-

tual distinctiveness and semiotic clarity of the used

symbols affects model comprehension. Particularly,

the correctness of model understanding, the cognitive

load to be possessed, and the time needed for under-

standing the models varies signiﬁcantly. In (Figl et al.,

2013b), Figl et al. report another study with 155 stu-

dent participants, showing that aesthetic design of the

used notational elements can improve the model un-

derstanding of process models. Nugroho conducted

an experiment with graduate students, ﬁnding out that

for UML diagrams, the level of detail has a signiﬁ-

cant inﬂuence (Nugroho, 2009). Lucia et al. report

in (Lucia et al., 2008) results of two controlled ex-

periments with Bachelor and Master students show-

ing that UML class diagrams are signiﬁcantly easier

to comprehend than ER diagrams. Bavota et al. con-

ducted a study to compare UML class diagrams and

ER diagrams regarding their impact on model com-

prehension (Bavota et al., 2011). They also showed

that UML class diagrams are in general easier to com-

prehend than ER diagrams.

Further studies, are more broadly looking at other

inﬂuence factors than modeling language related is-

sues. Mendling et al. report in (Mendling et al., 2012)

a study investigating the inﬂuence of model and per-

sonal factors on the comprehension of process mod-

els. Major ﬁndings are that comprehension is hin-

dered by the annotation of additional semantic in-

formation, and that theoretical knowledge as well as

modeling experience support model comprehension.

In (Zimoch et al., 2017), Zimoch et al. report a study

investigating the inﬂuence of process modeling expe-

rience on model comprehension. They conclude that

experience in general has a positive impact on model

comprehension. However, in case complexity of the

models under investigation is considerably increased,

the impact of experience vanishes more and more.

In conclusion, literature attributes three major fac-

tors inﬂuencing effectiveness and efﬁciency of model

inspections: (a) the inspection technique applied, (b)

syntax related issues of the inspected model, and (c)

experience as personal factor.

Although time has not yet been widely investi-

gated, existing studies on these three factors, partly,

also report an inﬂuence of time needed. As this was

in no case the major point of investigation, ﬁndings

are typically only brieﬂy summarized. From these

ﬁndings, it can be concluded that time inﬂuences the

model inspection in such a way that the more time

needed, the better the inspection result. This, par-

ticularly, often interwoven with the investigation of

the inspection technique (e.g., (Lanubile and Visag-

gio, 2000)). This means that in many cases it has been

found out that inspection techniques taking more time

are advantageous. For instance, perspective-based

reading is more time-consuming than checklist-based

inspections and also found to be more effective (e.g.,

(Basili et al., 1996)). However, review time is per-

ceived as something costly, that should be minimized

in industrial practice (cf. e.g., (Doolan, 1992)).

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

122

3 STUDY DESIGN

For experiment reporting, we keep to established best

practices (Wohlin et al., 2000; Jedlitschka et al.,

2008), which helps, among others, increase compre-

hensibility and comparability with other experiments.

3.1 Goal and Research Questions

As stated in Section 1 the overall goal of this study

is to investigate whether review time does inﬂuence

ad hoc model inspections. To achieve this goal, we

investigate the effects review time has on effective-

ness, conﬁdence, and efﬁciency of ad hoc model

inspections. Therefore, we deﬁne three research

questions:

RQ1: Does review time inﬂuence the effective-

ness of ad hoc model inspections?

RQ2: Does review time inﬂuence the reviewers’

conﬁdence in ad hoc model inspections?

RQ3: Does review time inﬂuence the efﬁciency of ad

hoc model inspections?

3.2 Variables

Review Time: is measured in seconds and is deﬁned

as the time used for ad hoc model inspection of one

model. As review time is deﬁned on an open-end ratio

scale, we also deﬁne review time intervals to allow for

better comparison of means. Therefore, the ratio scale

is transferred into an ordinal scale.

Effectiveness: is measured as the ratio of correct re-

view decisions made compared to all review decisions

made. Hence, effectiveness is measured on a ratio

scale from 0 (i.e. 0% correct decisions made) to 1

(i.e. 100% correct decisions made).

Conﬁdence: is deﬁned as the average conﬁdence the

reviewer claims for the review decisions made. Con-

ﬁdence is measured on 5-point semantic differential

scale, where 1 means very unconﬁdent and 5 very

conﬁdent. However, as conﬁdence is calculated as

mean of all review decisions made for a model, con-

ﬁdence is deﬁned on a ratio scale from 1 to 5.

Efﬁciency: is measured as the average time used for a

correct decision made. Efﬁciency is measured in sec-

onds and deﬁned on an open-end ratio scale. Note that

efﬁciency is not independent of review time. Nev-

ertheless, we are interested in efﬁciency, as indus-

try typically aims at efﬁcient reviews (cf. (Doolan,

1992)). Thus, it is of interest to determine whether

Note that while we typically use ﬁve-minute intervals,

the scale is no interval scale, as we use a catch-all group for

all review times greater than thirty-ﬁve minutes.

a high efﬁciency is bound to certain ranges of review

time.

3.3 Hypotheses

Based on the research questions, we deﬁne the

following null and alternative hypotheses:

: There is no effect of review time on effec-

tiveness.

: Increasing review time leads to increased

effectiveness.

: Increasing review time leads to decreased

effectiveness.

: Review time inﬂuences effectiveness, but the

effect is not linear.

: There is no effect of review time on conﬁ-

dence.

: Increasing review time leads to increased

conﬁdence.

: Increasing review time leads to decreased

conﬁdence.

: Review time inﬂuences conﬁdence, but the

effect is not linear.

: There is no effect of review time on efﬁ-

ciency.

: Increasing review time leads to increased

efﬁciency.

: Increasing review time leads to decreased

efﬁciency.

: Review time inﬂuences efﬁciency, but the effect

is not linear.

3.4 Participants

The experiment was conducted with student partici-

pants. The students are mostly graduate students en-

rolled in degree programs for applied computer sci-

ence and business information systems. Participants

were recruited in software engineering courses. Due

to the courses’ syllabi, it was ensured that the par-

ticipants do have sufﬁcient knowledge of validation

activities and the modeling languages investigated, in

addition, they were trained to conduct ad hoc model

inspections for these kinds of models. In total, 200

students participated in the experiment. As each par-

ticipant conducted multiple ad hoc model inspections,

a total of 520 data sets were collected.

Review Time as Predictor for the Quality of Model Inspections

123

3.5 Experiment Material

As experiment material, excerpts from industrial

speciﬁcations have been used. These have been re-

vised to match intended size and complexity, to re-

move intellectual property as well as issues relating

to needed in-depth domain expertise. Models were

chosen to ﬁt approximately one page. As modeling

languages, Message Sequence Charts (International

Telecommunication Union, 2016), automata (de Al-

faro and Henzinger, 2001), and functional architec-

ture models (Albers et al., 2016) have been used.

3.6 Experiment Design and Procedure

The experiment was conducted online and was de-

signed to last about 30–40 minutes, in which the par-

ticipants conducted multiple ad hoc model inspec-

tions back to back. In addition, a post hoc question-

naire was used to collect demographic data.

3.7 Analysis Procedure

The data sets were ﬁltered for data sets indicating

non-serious participation (i.e. 23 data sets were re-

moved). The remaining data sets were analyzed by

calculating common descriptive statistic parameters.

To estimate overall inﬂuence of review time on

effectiveness, conﬁdence, and efﬁciency, Pearson’s

correlation and simple regression were conducted.

Therefore, the original review time was used.

As mentioned above, we sorted review time into

intervals. Review time intervals (1-5 minutes, 5–10

minutes, 10–15 minutes, 15–20 minutes, 20–25 min-

utes, 25–30 minutes, 30–35 minutes, and >30 min-

utes). These review time intervals were used for

conducting one-way independent analysis of variance

(ANOVA) to compare effectiveness, conﬁdence, and

efﬁciency for different review time. In addition, Lev-

ene’s test was used to determine difference of vari-

ances. In case of heterogeneous variances, the Welsh-

Test was conducted. Post hoc analyses for signiﬁcant

ANOVA results included the Bonferroni-Test and the

Games-Howell-Test.

4 RESULTS

4.1 Descriptive Statistics

The descriptive statistics for explanatory and response

variables are given in Table 1. In addition, Fig. 1

shows the distribution of the data. We excluded 23

data sets from the investigation. In these 23 cases, the

Table 1: Descriptive Statistics.

Effectiveness Conﬁdence Efﬁciency Review Time

N Valid 497 496 497 497

Miss. 23 24 23 23

Mean 0.561 3.260 47.605 591.21

Std. Err. 0.009 0.046 5.257 25.264

Median 0.571 3.417 4.511 463

Std. Dev. 0.21 1.029 117.202 563.218

Variance 0.044 1.059 13736.368 317214.028

Min. 0 1 0 61

Max. 1 5 1310 5973

Percentiles 25 0.417 2.333 1.849 257.5

50 0.571 3.416 4.511 463

75 0.714 4 51.3 756

ad hoc model inspection was conducted in less than

one minute. Investigation of the results substantiated

the assumption that this means that participants did

not partake seriously. As can be seen for effective-

ness and conﬁdence, values are distributed across the

entire deﬁned scale. In mean ad hoc model inspection

resulted in 56% correct review decisions made and

a conﬁdence of 3.26, which is above an expectation

value of 3. In mean, participants needed 591 second

(i.e. almost 10 minutes) for an entire review and 48

seconds for a correct decision made.

Table 2, shows descriptive statistics for the effec-

tiveness, conﬁdence, and efﬁciency in relation to the

deﬁned review time intervals. As can be seen, most ad

hoc model inspections took 5–10 minutes, while the

absolute majority of reviews were conducted within

1–15 minutes. The distribution for the review time

intervals is visualized using box plots in Fig. 2.

4.2 Hypotheses Tests

4.2.1 Effectiveness

For analyzing the inﬂuence of review time on the

effectiveness of ad hoc model inspections, we con-

ducted Pearson correlation, simple regression, and

analysis of variance.

Pearson correlation shows no correlation between

effectiveness and review time (r = .033, p = .461).

A simple regression shows no signiﬁcant regression

equation (F(1, 495) = .545, p = .461), with R

−.001. Thus, review time cannot be used to explain

effectiveness.

To investigate the effect of review time on effec-

tiveness, we also conducted a one-way independent

ANOVA. As outlined in Section 4.1, we grouped re-

view time in intervals of ﬁve minutes to investigate

intergroup effects. There was no signiﬁcant effect of

review time on effectiveness (F(7, 489) = .973, p =

.450). Levene’s test indicated equal variances (F =

.997, p = .437), thus we assume the ANOVA reliable.

Consequently, we cannot reject H1

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

124

Table 2: Descriptive Statistics for Effectiveness, Conﬁdence, and Efﬁciency Depending on Review Time in Minutes.

95% Conf. Interval

Review Time N Mean Std. Dev. Std. Err. Lower Upper Min. Max.

Effectiveness 1-5 151 0.576 0.214 0.017 0.542 0.610 0.000 0.917

5-10 159 0.553 0.213 0.017 0.519 0.586 0.083 1.000

10-15 108 0.562 0.211 0.020 0.522 0.603 0.083 1.000

15-20 44 0.518 0.173 0.026 0.465 0.570 0.000 0.917

20-25 13 0.521 0.234 0.065 0.380 0.662 0.167 0.857

25-30 9 0.598 0.217 0.072 0.431 0.765 0.333 0.917

30-35 6 0.706 0.230 0.094 0.465 0.948 0.333 1.000

>35 7 0.605 0.164 0.062 0.453 0.756 0.375 0.750

Total 497 0.561 0.210 0.009 0.543 0.580 0.000 1.000

Conﬁdence 1-5 150 2.693 0.960 0.078 2.539 2.848 1.000 5.000

5-10 159 3.271 1.047 0.083 3.107 3.435 1.167 5.000

10-15 108 3.749 0.806 0.078 3.595 3.902 1.429 5.000

15-20 44 3.775 0.583 0.088 3.598 3.952 2.583 5.000

20-25 13 3.449 0.841 0.233 2.941 3.957 1.917 5.000

25-30 9 4.163 0.667 0.222 3.650 4.675 3.167 5.000

30-35 6 4.169 0.579 0.236 3.561 4.777 3.143 4.750

>35 7 2.104 1.165 0.440 1.027 3.181 1.250 3.833

Total 496 3.260 1.029 0.046 3.169 3.351 1.000 5.000

Efﬁciency 1-5 151 34.687 24.744 2.014 30.708 38.665 0.000 144.500

5-10 159 44.364 60.634 4.809 34.867 53.862 0.539 359.000

10-15 108 27.425 65.453 6.298 14.939 39.910 0.932 366.000

15-20 44 29.345 77.977 11.755 5.638 53.052 0.000 357.667

20-25 13 20.147 55.942 15.516 -13.659 53.953 2.017 206.167

25-30 9 41.493 111.426 37.142 -44.157 127.142 2.308 338.600

30-35 6 167.394 181.572 74.127 -23.155 357.942 3.783 392.000

>35 7 782.188 439.677 166.182 375.554 1188.821 9.014 1310.000

Total 497 47.605 117.202 5.257 37.276 57.934 0.000 1310.000

4.2.2 Conﬁdence

As for effectiveness, we conducted Pearson correla-

tion, simple regression, and analysis of variance.

Conﬁdence is positively related to review time.

Pearson correlation shows a small effect of r = .169

that is highly signiﬁcant at p < .001. A simple re-

gression was calculated to predict conﬁdence based

on review time. A signiﬁcant regression equation was

found (F(1, 494) = 14.486, p < .001), with a small

= .028.

A one-way independent ANOVA shows a signiﬁ-

cant effect of review time on conﬁdence (F(7, 488) =

18.039, p < .001). As Levene’s test indicated unequal

variances (F = 6.007, p < .001), we conducted the

Welch-Test, which conﬁrmed the ﬁndings from the

ANOVA (F(7, 35.834) = 19.716, p < .001). Hence,

we can reject H2

and accept H2

. We used post hoc

tests to investigate the differences between the groups,

while different tests yielded in comparable results, we

focus on the results of the Games-Howell-Test as it

meets the preconditions best (see Figure 3).

First, the two groups with the least time con-

sumption signiﬁcantly differ from groups with mod-

erately more time consumption: Conﬁdence for a re-

view time of 1–5 minutes (M = 2.693, SD = 0.96)

signiﬁcantly differs from a review time of 5–10 min-

utes (M = 3.271, SD = 1.047), 10–15 minutes (M =

3.749, SD = 0.806), 15–20 minutes (M = 3.775, SD =

0.583), 25–30 minutes (M = 4.163, SD = 0.667), and

30–35 minutes (M = 4.169, SD = 0.579). In addition,

conﬁdence for a review time of 5–10 minutes (M =

3.271, SD = 1.047) also signiﬁcantly differs from a

review time of 10–15 minutes (M = 3.749, SD =

0.806), 15–20 minutes (M = 3.775, SD = 0.583), and

25–30 minutes (M = 4.163,SD = 0.667).

Second, Conﬁdence for a review time of more

than 35 minutes (M = 2.104, SD = 1.165) signiﬁ-

cantly differs from a review time of 25–30 minutes

(M = 4.163, SD = 0.667) and 30–35 minutes (M =

4.169, SD = 0.579). Hence, conﬁdence is not increas-

ing with increasing review time. While this is the case

for shorter review time, for longer review time conﬁ-

dence is decreasing. Therefore, we can neither accept

, nor H2

, but accept H2

4.2.3 Efﬁciency

Again, we conducted Pearson correlation, simple re-

gression, and analysis of variance.

Efﬁciency is positively related to review time by a

large effect of r = .602. The effect is highly signiﬁ-

cant at p < .001. A simple regression found a signif-

icant regression equation (F(1, 495) = 280.887, p <

.001) with R

= .362. Participants’ predicted efﬁ-

ciency is equal to −26.418 + .125(ReviewTime) sec-

onds/correct answer when review time is measured in

seconds.

A one-way independent ANOVA shows a signiﬁ-

cant effect of review time on efﬁciency (F(7,489) =

96.898, p < .001). However, Levene’s test indicated

unequal variances (F = 38.88, p < .001). Therefore,

we conducted the Welch-Test, which conﬁrmed the

ﬁndings from the ANOVA (F(7, 34.57) = 3.819, p =

Review Time as Predictor for the Quality of Model Inspections

125

Figure 1: Scatterplots.

.004). Thus, we can reject H3

and accept H3

We used post hoc tests to investigate the differences

between the groups, while different tests yielded in

comparable results, we only report results of the

Games-Howell-Test. Signiﬁcant differences in efﬁ-

ciency do only exist between large review time (i.e.

¿35 min) and lower review time. Namely: Efﬁ-

ciency for a review time of more than 35 minutes

(M = 782s, SD = 440s) signiﬁcantly differs from a re-

view time of 1–5 minutes (M = 35s, SD = 25s), 5–10

minutes (M = 44s, SD = 61s), 10–15 minutes (M =

27s, SD = 65s), 15–20 minutes (M = 29s, SD = 78s),

20–25 minutes (M = 20s, SD = 56s), and 25–30 min-

utes (M = 41s,SD = 111s).

Thus, efﬁciency is not increasing with increasing

review time. While this is the case for lower review

time, for large review time efﬁciency is decreasing.

Therefore, we reject H3

. Considering the signiﬁ-

cant results, we accept H3

. Taking the increasing

means for lower review time into account, we also ac-

cept H3

5 DISCUSSION

5.1 Major Findings

With respect to the three investigated research ques-

tions, we can conclude three major ﬁndings regarding

the inﬂuence of review time on ad hoc model inspec-

tions:

Regarding RQ1, we found out that review time

does not inﬂuence the effectiveness of ad hoc model

inspections. This is based on the absence of signiﬁ-

cant correlations or regression, and that one-way in-

dependent ANOVA did not yield signiﬁcant results.

Regarding RQ2, we can state that review time

does have an inﬂuence on the reviewers’ conﬁdence

of ad hoc model inspections. Very small effects are

found using correlation and regression analyses. In

addition, analysis of variance showed a signiﬁcant

difference between groups. We found out that while

in principle conﬁdence is increasing with increasing

review time, a very long review time results in the

lowest conﬁdence. Regarding signiﬁcance, it can be

stated that review times of about 10–35 minutes for

one model to be investigated during ad hoc model in-

spection related to a higher conﬁdence than review

times of 1–10 minutes and above 35 minutes.

Regarding RQ3, we found out that review time

does inﬂuence efﬁciency of ad hoc model inspections.

Correlation and regression analysis found large sta-

tistically signiﬁcant effects, which is not surprising

considering the inherent relationship between review

time and efﬁciency. However, analysis of variance

shows a more ﬁne-grained view. Large review times

of more than 35 minutes (considering the Bonferroni-

Test, maybe also of 30–35 minutes) result in signiﬁ-

cantly less efﬁcient ad hoc model inspections.

Note that other tests, such as the Bonferroni-Test, ﬁnd

also signiﬁcant differences between more than 35 minutes

and 30–35 minutes, and between 30–35 minutes and all

other groups. However, due to the heterogeneity of vari-

ances, we keep to the interpretation of Games-Howell-Test,

although differences of means are indeed large for these

group comparisons.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

126

Figure 2: Box Plots.

In summary, we can conﬁrm that review time has

an inﬂuence on ad hoc model inspections. However,

contrary to other investigations on inﬂuencing factors

on the inspections, we could not ﬁnd any inﬂuence of

review time on the effectiveness of the review. Re-

garding the inﬂuence on the reviewer’s conﬁdence

and the efﬁciency of the review, we also found ev-

idence that contradicts assumptions from the related

work. Increasing review time does not necessarily

lead to better reviews (i.e. reviews, where the re-

viewer is more conﬁdent in decision-making and that

are conducted with a higher efﬁciency). There seems

to be a point where more review time leads to worse

reviews (in terms of conﬁdence and efﬁciency). Nev-

ertheless, for small and moderate review times of up

to 30 minutes for inspecting one model, we can sub-

stantiate claims that increasing review time leads to

better reviews (although only in terms of conﬁdence

and efﬁciency but not for effectiveness).

5.2 Threats to Validity

5.2.1 Threats to Internal Validity

In online experiments, a threat of participants losing

interest, dropping out of the experiment, or corrupting

the measurements by idling must always be consid-

ered. To lower this threat, we designed the experiment

such that two to three ad hoc model inspections could

be conducted within a time frame of 30–40 minutes.

We assumed this time frame to be sufﬁciently short

for participants not losing interest. As results show,

this was the case for the majority of participants.

However, some participants took far more time, as

can also be seen. Taking the respective participants’

results for effectiveness into account, we assume that

this is not related to idling (which would have made

exclusion of the data sets necessary), but from partic-

ipants trying to show their best performance on the

study. While this is a threat for comparing different

inspection techniques etc. it is not in our case. Par-

ticularly, this is a good simulation for increased effort

spend compared to moderate and least possible effort.

Since volunteers may bias the results because

they are generally more motivated than the average

student, we decided to conduct the experiments as

a mandatory part of our requirements engineering

courses and explicitly decided to give no bonuses

or credits as motivation. Therefore, the experiments

were designed to also serve as teaching material,

achieving a learning effect on model perception. This

was supported by extensive debrieﬁngs in class. The

experimental setup was carefully adopted to meet na-

tional laws as well as comply with university’s ethics

Review Time as Predictor for the Quality of Model Inspections

127

(a) Results of Games-Howell-Test for Confidence

MD SE Sig. 95% Conf. Interval

1-5min

5-10min -0.577 0.114 0.000 -0.926 -0.229

10-15min -1.055 0.110 0.000 -1.392 -0.718

15-20min -1.082 0.118 0.000 -1.445 -0.718

20-25min -0.755 0.246 0.106 -1.616 0.105

25-30min -1.469 0.236 0.002 -2.352 -0.587

30-35min -1.475 0.249 0.011 -2.544 -0.407

>35min 0.590 0.447 0.865 -1.304 2.483

5-10min

1-5min 0.577 0.114 0.000 0.229 0.926

10-15min -0.478 0.114 0.001 -0.825 -0.131

15-20min -0.504 0.121 0.001 -0.877 -0.132

20-25min -0.178 0.247 0.995 -1.041 0.685

25-30min -0.892 0.237 0.047 -1.775 -0.009

30-35min -0.898 0.251 0.103 -1.964 0.168

>35min 1.167 0.448 0.296 -0.726 3.059

10-15min

1-5min 1.055 0.110 0.000 0.718 1.392

5-10min 0.478 0.114 0.001 0.131 0.825

15-20min -0.026 0.117 1.000 -0.388 0.336

20-25min 0.300 0.246 0.913 -0.560 1.160

25-30min -0.414 0.235 0.657 -1.296 0.468

30-35min -0.420 0.249 0.697 -1.489 0.649

>35min 1.645 0.447 0.091 -0.249 3.539

15-20min

1-5min 1.082 0.118 0.000 0.718 1.445

5-10min 0.504 0.121 0.001 0.132 0.877

10-15min 0.026 0.117 1.000 -0.336 0.388

20-25min 0.326 0.249 0.882 -0.540 1.192

25-30min -0.388 0.239 0.730 -1.273 0.497

30-35min -0.394 0.252 0.759 -1.458 0.671

>35min 1.671 0.449 0.085 -0.220 3.562

20-25min

1-5min 0.755 0.246 0.106 -0.105 1.616

5-10min 0.178 0.247 0.995 -0.685 1.041

10-15min -0.300 0.246 0.913 -1.160 0.560

15-20min -0.326 0.249 0.882 -1.192 0.540

25-30min -0.714 0.322 0.384 -1.803 0.375

30-35min -0.720 0.332 0.423 -1.892 0.453

>35min 1.345 0.498 0.230 -0.546 3.236

25-30min

1-5min 1.469 0.236 0.002 0.587 2.352

5-10min 0.892 0.237 0.047 0.009 1.775

10-15min 0.414 0.235 0.657 -0.468 1.296

15-20min 0.388 0.239 0.730 -0.497 1.273

20-25min 0.714 0.322 0.384 -0.375 1.803

30-35min -0.006 0.325 1.000 -1.182 1.170

>35min 2.059 0.493 0.032 0.166 3.952

30-35min

1-5min 1.475 0.249 0.011 0.407 2.544

5-10min 0.898 0.251 0.103 -0.168 1.964

10-15min 0.420 0.249 0.697 -0.649 1.489

15-20min 0.394 0.252 0.759 -0.671 1.458

20-25min 0.720 0.332 0.423 -0.453 1.892

25-30min 0.006 0.325 1.000 -1.170 1.182

>35min 2.065 0.500 0.033 0.149 3.981

>35min

1-5min -0.590 0.447 0.865 -2.483 1.304

5-10min -1.167 0.448 0.296 -3.059 0.726

10-15min -1.645 0.447 0.091 -3.539 0.249

15-20min -1.671 0.449 0.085 -3.562 0.220

20-25min -1.345 0.498 0.230 -3.236 0.546

25-30min -2.059 0.493 0.032 -3.952 -0.166

30-35min -2.065 0.500 0.033 -3.981 -0.149

(b) Results of Games-Howell-Test for Efficiency

MD SE Sig. 95% Conf. Interval

1-5

min

5-10min -9.678 5.213 0.582 -25.638 6.283

10-15min 7.262 6.612 0.956 -13.113 27.636

15-20min 5.342 11.927 1.000 -32.536 43.219

20-25min 14.540 15.646 0.977 -41.759 70.839

25-30min -6.806 37.197 1.000 -153.765 140.153

30-35min -132.707 74.154 0.650 -477.626 212.212

>35min -747.501 166.194 0.042 -1466.889 -28.113

5-10

min

1-5min 9.678 5.213 0.582 -6.283 25.638

10-15min 16.940 7.924 0.394 -7.313 41.192

15-20min 15.019 12.701 0.934 -24.911 54.949

20-25min 24.217 16.244 0.801 -32.856 81.291

25-30min 2.871 37.452 1.000 -144.035 149.778

30-35min -123.029 74.282 0.713 -467.509 221.450

>35min -737.823 166.252 0.045 -1457.102 -18.545

10-15

min

1-5min -7.262 6.612 0.956 -27.636 13.113

5-10min -16.940 7.924 0.394 -41.192 7.313

15-20min -1.920 13.336 1.000 -43.608 39.768

20-25min 7.278 16.745 1.000 -50.587 65.142

25-30min -14.068 37.672 1.000 -160.955 132.818

30-35min -139.969 74.394 0.607 -484.075 204.137

>35min -754.763 166.302 0.040 -1473.946 -35.579

15-20

min

1-5min -5.342 11.927 1.000 -43.219 32.536

5-10min -15.019 12.701 0.934 -54.949 24.911

10-15min 1.920 13.336 1.000 -39.768 43.608

20-25min 9.198 19.466 1.000 -54.598 72.994

25-30min -12.148 38.958 1.000 -159.353 135.058

30-35min -138.049 75.053 0.625 -480.081 203.983

>35min -752.843 166.598 0.041 -1471.472 -34.213

20-25

min

1-5min -14.540 15.646 0.977 -70.839 41.759

5-10min -24.217 16.244 0.801 -81.291 32.856

10-15min -7.278 16.745 1.000 -65.142 50.587

15-20min -9.198 19.466 1.000 -72.994 54.598

25-30min -21.346 40.252 0.999 -169.903 127.211

30-35min -147.247 75.733 0.575 -487.449 192.956

>35min -762.041 166.905 0.038 -1480.118 -43.963

25-30

min

1-5min 6.806 37.197 1.000 -140.153 153.765

5-10min -2.871 37.452 1.000 -149.778 144.035

10-15min 14.068 37.672 1.000 -132.818 160.955

15-20min 12.148 38.958 1.000 -135.058 159.353

20-25min 21.346 40.252 0.999 -127.211 169.903

30-35min -125.901 82.911 0.781 -459.546 207.745

>35min -740.695 170.282 0.042 -1454.028 -27.362

3-35

min

1-5min 132.707 74.154 0.650 -212.212 477.626

5-10min 123.029 74.282 0.713 -221.450 467.509

10-15min 139.969 74.394 0.607 -204.137 484.075

15-20min 138.049 75.053 0.625 -203.983 480.081

20-25min 147.247 75.733 0.575 -192.956 487.449

25-30min 125.901 82.911 0.781 -207.745 459.546

>35min -614.794 181.965 0.102 -1329.378 99.790

>35

min

1-5min 747.501 166.194 0.042 28.113 1466.889

5-10min 737.823 166.252 0.045 18.545 1457.102

10-15min 754.763 166.302 0.040 35.579 1473.946

15-20min 752.843 166.598 0.041 34.213 1471.472

20-25min 762.041 166.905 0.038 43.963 1480.118

25-30min 740.695 170.282 0.042 27.362 1454.028

30-35min 614.794 181.965 0.102 -99.790 1329.378

Figure 3: Results of Games-Howell-Test.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

128

regulations on student participation in software engi-

neering experiments.

5.2.2 Threats to Construct Validity

The experiment setup was inspired by conducted,

published, and well-received experiments from the re-

lated work on the investigation of different inspection

techniques. The experiment material was created in

close collaboration with domain experts from indus-

try and academia. In addition, pretest groups have

been used to ensure that experiment material is com-

prehensible and adequate for the given tasks.

5.2.3 Threats to External Validity

External validity in software engineering experiments

is mainly concerned with the question of generaliz-

ability to industrial application (H

ost et al., 2000).

Therefore, the use of student participants is often seen

as problematic (e.g., (Runeson, 2003)). However,

other studies have found out that student results are

generalizable (e.g., (Tichy, 2000)). In addition, ad

hoc inspections are in industrial practice often con-

ducted by newer employees, for which generalizabil-

ity from students often holds (Salman et al., 2015).

For the experiment material, we ensured generaliz-

ability in close collaboration with industry profes-

sionals to adopt excerpts from industry speciﬁcations.

To improve generalizability, we used models in

three different modeling languages. While we, thus,

do not need to limit our results to one single mod-

eling language, there is a risk, that review time, ef-

fectiveness, conﬁdence, efﬁciency signiﬁcantly dif-

fer between the different modeling languages. While

we did not recognize such effects, we ensured that

each participant conducted ad hoc model inspection

tasks for models in different modeling languages. To

avoid crossover effects having an impact, we used

randomization to distribute the order of inspection

tasks equally over all participants.

5.2.4 Threats to Conclusion Validity

The major threat regarding conclusion validity is typ-

ically the use of too small sample sizes, which hinder

reaching statistical signiﬁcance. The use of 497 in-

cluded data sets is to be considered large compared to

other investigations from the related work. Another

threat to conclusion validity lies in transforming ratio

scale data into ordinal scale data, as has been done

for review time to allow for conducting analysis of

variance. There is the risk, that using another interval

leads to different results. Therefore, we conducted a

second investigation, grouping review time into inter-

vals of one minute. Results do not considerably differ.

Particularly, no new signiﬁcant differences could be

found (e.g., differences between two medium review

time intervals). Therefore, we assume our grouping

adequate.

5.3 Inferences

Most important is the insight that no empirical evi-

dence could be found that review time has any inﬂu-

ence on the effectiveness of ad hoc model inspections.

As effectiveness best relates to the number of defects

found during an inspection, we can conclude that ad

hoc model inspections can be conducted in short time

and increasing review time does not lead to any con-

siderable advantage regarding the number of defects

found. However, it is to note that our ﬁnding is lim-

ited to the inspection of models that are reduced in

size and complexity and limited to approximately one

page. Hence, we cannot state that this ﬁnding will

also hold for the review of speciﬁcations with multi-

ple models or complex models consisting of a mul-

titude of diagrams. Therefore, further investigations

would be needed. However, for our deﬁnition of re-

view time (i.e. the reviewer chooses to spend) we as-

sume there might be no signiﬁcant inﬂuence either.

In this case, also in our experiment, reviewers that are

more thorough and spend more review time should

have achieved better results no matter the size of the

materials.

Regarding the inﬂuence of review time on con-

ﬁdence and efﬁciency, we found large review times

leading to less conﬁdent decisions and low efﬁciency.

Therefore, it seems counter intuitively better to re-

strict review time to a moderate amount of less than

30 minutes. In addition, review times of less than

10 minutes also lead to low conﬁdence, although efﬁ-

ciency is not inﬂuenced.

In summary, review time cannot be used to esti-

mate the quality of a review. As there are no signiﬁ-

cant differences between review times of 10–30 min-

utes discernible, the ad hoc model inspection can be

kept brief. We assume that this is transferable to larger

inspections as well, keeping the review time short but

not too short maximizes the result of an ad hoc model

inspection. As review time does not inﬂuence the

effectiveness of the ad hoc model inspection, it can

be assumed that brief visual inspections conducted in

very short time are helpful in the development pro-

cess and, thus, should be made use of whenever pos-

sible. In particular, the insight regarding the missing

inﬂuence of review time on effectiveness allows for

a multitude of potential application scenarios for ad

hoc model inspections during a software development

project.

Review Time as Predictor for the Quality of Model Inspections

129

6 CONCLUSION

In this paper, we reported an experiment to investigate

the inﬂuence of review time on ad hoc model inspec-

tions. In the experiment we analyzed the inﬂuence

of review time on effectiveness, conﬁdence, and efﬁ-

ciency. The experiment was conducted with 200 par-

ticipants that conducted a total of 520 ad hoc model

inspections. Most important, analysis of the data sets

showed that review time does not have a signiﬁcant

inﬂuence on the effectiveness of ad hoc model inspec-

tions. For conﬁdence, we found a small inﬂuencing

effect, for efﬁciency a high effect.

Analysis of variance (ANOVA) showed that re-

view time leads to signiﬁcantly different conﬁdence

and efﬁciency. Post hoc tests showed that a short

review time of up to ten minutes negatively inﬂu-

ences the conﬁdence the reviewer has in the decisions

made. In contrast to assumptions made in the related

work, we found out that large review times also have

a negative inﬂuence. For a review time greater than

thirty minutes, conﬁdence and efﬁciency is signiﬁ-

cantly lower as for moderate review time.

REFERENCES

Albers, K., Beck, S., B

uker, M., Daun, M., MacGregor, J.,

Salmon, A., Weber, R., and Weyer, T. (2016). System

function networks. In Advanced Model-Based Engi-

neering of Embedded Systems, Extensions of the SPES

2020 Methodology, pages 119–144. Springer.

Basili, V. R., Green, S., Laitenberger, O., Lanubile, F.,

Shull, F., Sørumg

ard, S., and Zelkowitz, M. V.

(1996). The Empirical Investigation of Perspective-

Based Reading. Empirical Software Engineering,

1(2):133–164.

Bavota, G., Gravino, C., Oliveto, R., De Lucia, A., Tor-

tora, G., Genero, M., and Cruz-Lemus, J. A. (2011).

Identifying the Weaknesses of UML Class Diagrams

during Data Model Comprehension. In Model Driven

Engineering Languages and Systems, pages 168–182,

Berlin, Heidelberg. Springer.

Berling, T. and Runeson, P. (2003). Evaluation of a per-

spective based review method applied in an industrial

setting. IEE Proceedings - Software, 150(3):177–184.

Boehm, B. W. (1987). Industrial software metrics top 10

list. IEEE Software, 4(5):84–85.

Conradi, R., Mohagheghi, P., Arif, T., Hegde, L. C., Bunde,

G. A., and Pedersen, A. (2003). Object-Oriented

Reading Techniques for Inspection of UML Models –

An Industrial Experiment. In ECOOP 2003 – Object-

Oriented Programming, pages 483–500, Berlin, Hei-

delberg. Springer.

d. Mello, R. M., Teixeira, E. N., Schots, M., Werner, C.

M. L., and Travassos, G. H. (2012). Checklist-based

inspection technique for feature models review. In

2012 Sixth Brazilian Symposium on Software Compo-

nents, Architectures and Reuse, pages 140–149.

Daun, M., Brings, J., Krajinski, L., and Weyer, T. (2019a).

On the beneﬁts of using dedicated models in val-

idation processes for behavioral speciﬁcations. In

IEEE/ACM Int. Conf. on Software and System Pro-

cesses, pages 44–53. IEEE.

Daun, M., Brings, J., Obe, P. A., and Stenkova, V. (2021).

Reliability of self-rated experience and conﬁdence as

predictors for students’ performance in software engi-

neering. Empirical Software Engineering, 26(4):80.

Daun, M., Brings, J., and Weyer, T. (2017). On the impact

of the model-based representation of inconsistencies

to manual reviews: Results from a controlled experi-

ment. In Conceptual Modeling: 36th Int. Conf., pages

466–473. Springer.

Daun, M., Brings, J., and Weyer, T. (2020). Do instance-

level review diagrams support validation processes of

cyber-physical system speciﬁcations: results from a

controlled experiment. In Int. Conf. on Software and

System Processes, pages 11–20.

Daun, M., Weyer, T., and Pohl, K. (2014). Validating the

functional design of embedded systems against stake-

holder intentions. In Int. Conf. on Model-Driven En-

gineering and Software Development, pages 333–339.

IEEE.

Daun, M., Weyer, T., and Pohl, K. (2019b). Improving man-

ual reviews in function-centered engineering of em-

bedded systems using a dedicated review model. Soft-

ware and Systems Modeling, 18(6):3421–3459.

de Alfaro, L. and Henzinger, T. A. (2001). Interface au-

tomata. SIGSOFT Softw. Eng. Notes, 26(5):109–120.

de Almeida, J. R., Camargo, J. B., Basseto, B. A., and Paz,

S. M. (2003). Best practices in code inspection for

safety-critical software. IEEE Software, 20(3):56–63.

Doolan, E. P. (1992). Experience with fagan’s inspec-

tion method. Software: Practice and Experience,

22(2):173–182.

Dunsmore, A., Roper, M., and Wood, M. (2003). Practi-

cal code inspection techniques for object-oriented sys-

tems: an experimental comparison. IEEE Software,

20(4):21–29.

Fagan, M. E. (1976). Design and Code Inspections to Re-

duce Errors in Program Development. IBM Systems

Journal, 15(3):182–211.

Fagan, M. E. (1986). Advances in Software Inspections.

IEEE Trans. Software Eng., 12(7):744–751.

Figl, K., Mendling, J., and Strembeck, M. (2013a). The In-

ﬂuence of Notational Deﬁciencies on Process Model

Comprehension. Journal of the Association for Infor-

mation Systems, 14(6).

Figl, K., Recker, J., and Mendling, J. (2013b). A study

on the effects of routing symbol design on process

model comprehension. Decision Support Systems,

54(2):1104–1118.

He, L. and Carver, J. C. (2006). PBR vs. checklist: a repli-

cation in the n-fold inspection context. In Travassos,

G. H., Maldonado, J. C., and Wohlin, C., editors, 2006

International Symposium on Empirical Software En-

gineering, pages 95–104. ACM.

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

130

ost, M., Regnell, B., and Wohlin, C. (2000). Using Stu-

dents as Subjects-A Comparative Study of Students

and Professionals in Lead-Time Impact Assessment.

Empirical Software Engineering, 5(3):201–214.

International Telecommunication Union (2016). Recom-

mendation z.120: Message Sequence Chart (MSC).

ISO 26262-1 (2011). Road vehicles – Functional safety –

Part 1: Vocabulary.

ISO/IEC 25030 (2007). Software engineering – Soft-

ware product Quality Requirements and Evaluation

(SQuaRE) – Quality requirements.

Jedlitschka, A., Ciolkowski, M., and Pfahl, D. (2008). Re-

porting experiments in software engineering. In Shull,

F., Singer, J., and Sjøberg, D. I. K., editors, Guide

to Advanced Empirical Software Engineering, pages

201–228. Springer London.

Laitenberger, O. (1998). Studying the effects of code in-

spection and structural testing on software quality. In

International Symposium on Software Reliability En-

gineering, pages 237–246.

Laitenberger, O., Atkinson, C., Schlich, M., and El Emam,

K. (2000). An experimental comparison of reading

techniques for defect detection in UML design docu-

ments. Journal of Systems and Software, 53(2):183–

204.

Laitenberger, O., Emam, K. E., and Harbich, T. G. (2001).

An Internally Replicated Quasi-Experimental Com-

parison of Checklist and Perspective-Based Reading

of Code Documents. IEEE Trans. Software Eng.,

27(5):387–421.

Lanubile, F. and Visaggio, G. (2000). Evaluating Defect

Detection Techniques for Software Requirements In-

spections. page 25.

Lucia, A. D., Gravino, C., Oliveto, R., and Tortora, G.

(2008). Data model comprehension: An empirical

comparison of er and uml class diagrams. In 2008

16th IEEE Int. Conf. on Program Comprehension,

pages 93–102.

Maldonado, J. C., Carver, J., Shull, F., Sandra Ca-

margo Pinto Ferraz Fabbri, D

oria, E., Martimiano, L.

A. F., Mendonc¸a, M. G., and Basili, V. R. (2006).

Perspective-Based Reading: A Replicated Experiment

Focused on Individual Reviewer Effectiveness. Em-

pirical Software Engineering, 11(1):119–142.

Martin, J. and Tsai, W. T. (1990). N-fold inspection:

A requirements analysis technique. Commun. ACM,

33(2):225–232.

Mendling, J., Strembeck, M., and Recker, J. (2012). Fac-

tors of process model comprehension—Findings from

a series of experiments. Decision Support Systems,

53(1):195–206.

Miller, J., Wood, M., and Roper, M. (1998). Further Ex-

periences with Scenarios and Checklists. Empirical

Software Engineering, 3(1):37–64.

Nugroho, A. (2009). Level of detail in UML models and

its impact on model comprehension: A controlled

experiment. Information and Software Technology,

51(12):1670–1685.

O.Oladele, R. and O. Adedayo, H. (2014). On Empirical

Comparison of Checklist-based Reading and Adhoc

Reading for Code Inspection. International Journal

of Computer Applications, 87(1):35–39.

Porter, A. A., Siy, H. P., Toman, C. A., and Votta, L. G.

(1997). An experiment to assess the cost-beneﬁts

of code inspections in large scale software develop-

ment. IEEE Transactions on Software Engineering,

23(6):329–346.

Porter, A. A. and Votta, L. G. (1998). Comparing Detection

Methods For Software Requirements Inspections: A

Replication Using Professional Subjects. Empirical

Software Engineering, 3(4):355–379.

Porter, A. A., Votta, L. G., and Basili, V. R. (1995). Com-

paring Detection Methods for Software Requirements

Inspections: A Replicated Experiment. IEEE Trans.

Software Eng., 21(6):563–575.

Regnell, B., Runeson, P., and Thelin, T. (2000). Are the

Perspectives Really Different? – Further Experimen-

tation on Scenario-Based Reading of Requirements.

Empirical Software Engineering, 5(4):331–356.

Runeson, P. (2003). Using Students as Experiment Sub-

jects – An Analysis on Graduate and Freshmen Stu-

dent Data. In International Conference on Empiri-

cal Assessment & Evaluation in Software Engineer-

ing, pages 95–102.

Sabaliauskaite, G., Kusumoto, S., and Inoue, K. (2004).

Assessing defect detection performance of interacting

teams in object-oriented design inspection. Informa-

tion & Software Technology, 46(13):875–886.

Salman, I., Misirli, A. T., and Juristo, N. (2015). Are stu-

dents representatives of professionals in software en-

gineering experiments? In IEEE/ACM International

Conference on Software Engineering, volume 1, pages

666–676.

Shull, F., Rus, I., and Basili, V. R. (2000). How Perspective-

Based Reading Can Improve Requirements Inspec-

tions. IEEE Computer, 33(7):73–79.

Thelin, T., Runeson, P., and Wohlin, C. (2003). An ex-

perimental comparison of usage-based and checklist-

based reading. IEEE Transactions on Software Engi-

neering, 29(8):687–704.

Tichy, W. F. (2000). Hints for Reviewing Empirical Work in

Software Engineering. Empirical Software Engineer-

ing, 5(4):309–312.

Wohlin, C., Runeson, P., H

ost, M., Ohlsson, M., Reg-

nell, B., and Wessl

en, A. (2000). Experimentation in

software engineering: An introduction, volume 6 of

Kluwer international series in software engineering.

Kluwer Academic, Boston, Mass.

Zimoch, M., Pryss, R., Probst, T., Schlee, W., and Reichert,

M. (2017). Cognitive Insights into Business Process

Model Comprehension: Preliminary Results for Expe-

rienced and Inexperienced Individuals. In Enterprise,

Business-Process and Information Systems Modeling,

pages 137–152, Cham. Springer.

Review Time as Predictor for the Quality of Model Inspections

131