Metrics to Estimate Model Comprehension: Towards a Reliable

Quantiﬁcation Framework

Bastian Tenbergen

1 a

and Marian Daun

2 b

Department of Computer Science, State University of New York at Oswego, Oswego, U.S.A.

Center of Robotics, Technical University of Applied Sciences W

urzburg-Schweinfurt, Schweinfurt, Germany

Keywords:

Model-Based Software Engineering, Graphical Representations, Model Comprehension, Model Quality,

Empirical Study.

Abstract:

Model-driven development has established itself as one of the core practices in software engineering. Increases

in quality demands paired with shorter times to market and increased mission-criticality of software systems

have sensitized software engineering practitioners to make use of not only formal, but also semi-formal mod-

els, particularly graphical diagrams to express the system under development in ways that facilitate collabo-

ration, validation & veriﬁcation, as well as conﬁguration and runtime monitoring. However, what does and

does not constitute a “good” model, i.e., a model that is ﬁt for a practical purpose? While some model quality

frameworks exist, the trouble with most of these is that they often lack the ability to concretely quantify and

thereby objectively differentiate a “good” from a “poor” model, i.e., models that can be easily understood by

the model reader. Without being able to reliably produce easily comprehensible models, training new team

members during on-boarding or educating software engineering students is dramatically hindered. In this

paper, we report on a research trajectory towards reliably measuring the comprehensibility of graphical dia-

grams.

1 INTRODUCTION

Model-based engineering has been taught for decades

at universities and in the past decennium it has widely

spread into industrial use and become an established

technique, particularly in safety critical domains.

While graphical models also serve a purpose of doc-

umentation (i.e., serving a permanent role in the sys-

tem’s speciﬁcation), their most essential characteristic

is to facilitate stakeholder communication (i.e., serv-

ing a transient role in knowledge exchange). Hence,

key to successful communication is to create models

that are understood by all users the same way. No

ambiguity may exist among the stakeholders regard-

ing what the model expresses. This comprises both

the need for the model to adequately reﬂect the se-

mantics of the universe of discourse, but also be free

from syntactic errors. It must be noted that in many

cases, errors both with regard to semantic adequacy

and syntactic correctness can in practice be forgiven,

as it is often desirable to create a ”quick and dirty”

https://orcid.org/0000-0002-0145-4800

https://orcid.org/0000-0002-9156-9731

model in order to move forward the discussion with

the team. However, when the models are meant to be-

come part of the speciﬁcation and when not all team

members have the same understanding of the ”quick

and dirty” model, the development process can be

slowed down dramatically. Moreover, inadequacies

or defects in models can remain covert, thereby hid-

ing mistakes from developers until their rectiﬁcation

becomes much more costly in later phases of develop-

ment (Boehm, 1984) or, in the worst case, can remain

covert until loss of life or severe security breaches oc-

cur during operation.

It is therefore essential to sensitize established

practitioners as well as educate software engineer-

ing graduates to the quality of conceptual models

through dedicated training. While of course educa-

tors, coaches, and practitioners employ certain ped-

agogical strategies to foster model quality (Muller,

2015; Daun et al., 2017), an objective, repeatable way

to ascertain the semantic and syntactic comprehensi-

bility largely depends on the domain or the involved

individuals. While several model quality frameworks

to assess comprehensibility (among other factors)

have been proposed in the past (e.g., (Krogstie et al.,

498

Tenbergen, B. and Daun, M.

Metrics to Estimate Model Comprehension: Towards a Reliable Quantiﬁcation Framework.

DOI: 10.5220/0012684800003687

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE 2024), pages 498-505

ISBN: 978-989-758-696-5; ISSN: 2184-4895

2006; Nelson et al., 2012)), these often suffer from a

lack of practical applicability and are abstract.

In this paper, we propose a framework of con-

crete metrics for model comprehension that may en-

able practical applicability. We furthermore propose

an empirical experimental design to investigate which

of these concrete metrics are suitable over others and

how they relate to one another. The goal of this is to

ﬁnd the best way to objectively quantify the compre-

hensibility of a given model. In future work, we shall

experimentally validate our proposed measurements

and generalize our framework.

This paper is structured as follows. Section 2

discusses the backgroud on good and poor models

and summarizes the related work on model compre-

hension metrics. Section 3 outlines our proposed

framework using the Goal Question Metric frame-

work (Basili and Rombach, 1994). Section 4 pro-

poses an experimental design to validate the frame-

work. Section 5 concludes this manuscript.

2 RELATED WORK

2.1 Aspects Model Comprehension

The term ”model comprehension” commonly refers

to stakeholders’ ability to understand a graphical dia-

gram in a way that facilitates a shared understanding

of all stakeholders’ regarding a system under develop-

ment (Hermann et al., 2022). Who these stakeholders

are may differ between development projects and in

consequence, the choice of modeling languages will

differ as well. For example, highly formalized petri-

nets may not be adequate to discuss the needs of end

users, but may be exactly the right tool to discuss tim-

ing concerns with safety auditors. Similarly, semantic

and syntactic aspects related to adequacy or correct-

ness may impair understanding if, e.g., the modeler

does not understand the meaning of the elements in

modeling language or makes syntactic mistakes when

creating a diagram. Yet, while guidance is available

regarding the choice of modeling language (Brambilla

et al., 2017), and while problem catalogs can help ad-

dress issues in semantic and syntactic understanding

through directed training efforts (Reuter et al., 2020),

practical factors and visual properties of a diagram

may impair its comprehension. Take for example the

following two diagrams shown in Fig. 1 and Fig. 2,

created as part of a undergraduate course in safety

software engineering. Both ﬁgures represent UML

state machine diagrams and both describe the same

system under development: a robot that traverses a

labyrinth while avoiding obstacles.

There is quite obviously a subjective quality dif-

ference between both models. Both diagrams contain

states labeled with ”discretized” gerund forms remi-

niscent of actions (e.g., ”entering path waypoints” or

”driving forward”). On the one hand, Fig. 1 is sim-

plistic to the point of superﬁcial, yet process execu-

tion and triggering events are quickly understandable.

Events are represented using transitions depicted as

Bezier-type arrows. Details pertaining to implemen-

tation details are not depicted, but perhaps could be

part of lower-level diagrams detailing the inner work-

ings of the state. Unfortunately, the student modelers

neglected to provide these with their solution, hence

decreasing model completeness. However, there are

no syntactic errors.

On the other hand, Fig. 2 shows a model with

much higher level of detail, to the point of seeming

complete. The modelers availed themselves of some

advanced features of state machine diagrams, such

as hierarchical and parallel substates as well as event

guards. Transitions are depicted using straight arrows

with rounded corners, however, some of them look

”crooked” (i.e., are not perfectly horizontal or ver-

tical). However, syntactic errors show a state with

no exit transition ”store and location” in an other-

wise unexplained parallel substate ”waiting for end”,

which the student modelers neglected to further ex-

plain. Overall, the diagram has a much higher infor-

mation density.

Which of these two diagrams is better? Should

a modeler sacriﬁce completeness for ease of under-

standability or be as complete as possible in a dia-

gram? Does layout and connector spacing aid under-

standing or impair it? Are the labels such that they

assist implementation or raise questions instead? All

of these factors and many more have been proposed in

the past as factors inﬂuencing model comprehension,

which we will explore in the next section.

2.2 Factors Inﬂuencing Model

Comprehension

Research on model comprehension differentiates be-

tween factors concerning the model interpreter, the

process in which the model is used (e.g., the software

development process), and the model itself. In this

paper, we focus on factors pertaining to the model

itself. In previous work, we investigated metrics to

determine the quality of model comprehension by the

model interpreter. For instance, in (Daun et al., 2021),

we have shown that self-rated experience and conﬁ-

dence are bad predictors for the quality of model com-

prehension. In other work, we compared the effect

different modeling languages have on the user. We

Metrics to Estimate Model Comprehension: Towards a Reliable Quantiﬁcation Framework

499

Figure 1: Example of a UML State Machine Diagram with ”poor” apparent quality.

Figure 2: Example of a UML State Machine Diagram with ”high” apparent quality.

have shown that depicting the same situation in mod-

els of different modeling languages is differently per-

ceived. For instance, the quality of model comprehen-

sion is higher when interpreting sequence charts com-

pared to functional speciﬁcations (Daun et al., 2019).

Other ﬁndings include that the generation of dedi-

cated views can improve understanding but the need

for investigating multiple orthogonal views can de-

crease understanding of the overall speciﬁcation (Ten-

bergen et al., 2018; Bandyszak et al., 2018).

In (Hermann et al., 2022), we conducted a system-

atic literature survey of 109 included primary studies

from 1995 to 2018 to ﬁnd research dealing with met-

rics that explain model comprehension. Our analy-

sis revealed that the most commonly studied factors

inﬂuencing model comprehension are visual features

and layout (45 out of 109 primary studies), mostly

in UML class (24 primary studies) and BPMN dia-

grams (27 primary studies). From more than 79 pri-

mary studies dedicated to experimental evaluation of

model comprehension, we elicited 30 factors inﬂu-

encing model comprehension (for details, see (Her-

mann et al., 2022)).

Investigations so far suggest an impact of the

model size on the model’s quality for model compre-

hension. For instance, the number of classes and their

relationships has an impact for class diagrams (Gen-

ero et al., 2001; Genero et al., 2000). Other studies for

class diagrams suggest that the number of aggrega-

tions and hierarchy relations used has an impact (Figl,

2012), or relationships in general (Lange et al., 2005;

Esperanza Manso et al., 2009; Yusuf et al., 2007) or

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

500

the overall size in terms of number of classes, rela-

tionships, and aggregations (Lange et al., 2005).

Other factors are the naming and semantics of a

model. For instance the use of guidelines for label-

ing elements can improve communication and un-

derstanding (Huang, 2008). In particular, the use

of semiotically clear language has a positive impact

on model comprehension (Sharif and Maletic, 2009)

as well as the simplicity of symbols used (Di Cerbo

et al., 2011). Other metrics target the complexity of

a model, for instance, the amount of information con-

tained within a model (Marriott et al., 2012) or the

nestedness of a diagram (Cruz-Lemus et al., 2005).

A further category of approaches deals with the lay-

out of a model like the proximity of model elements

to each other (Abrah

ao et al., 2013), the number of

crossing edges (Genero et al., 2002) or the connect-

edness of a model (Anda et al., 2001).

In the next section, we present out framework for

measuring the quality of model comprehension which

we created using the Goal Question Metric (GQM) to

structure the metrics presented above.

3 A FRAMEWORK FOR

MEASURING THE QUALITY

OF MODEL COMPREHENSION

3.1 Goal Question Metric

As we saw in Section 2, some of the metrics identiﬁed

in our literature survey (Hermann et al., 2022) are not

operationalized. This means that there are no con-

crete instructions to quantify a diagram (e.g., “degree

of semiotic clarity” in (Krogstie et al., 2006)). To al-

leviate this issue, we employ the Goal Question Met-

ric (GQM) approach. GQM is a systematic method

to arrive at concrete ways to measure properties in

software engineering purposefully (Basili and Rom-

bach, 1994). The underlying premise is that concrete

metrics can only be deﬁned when considering their

purpose with regard to a common goal. Intermedi-

ate questions help reﬁne the goal from the perspec-

tive of a speciﬁc stakeholder involved in a task. To

this end, the GQM approach deﬁnes a top-level con-

ceptual goal, intermediate level operationalizations of

the goal (“questions”), and a quantitative level with

speciﬁc data that can be measured (“metric”). Goals

are usually deﬁned for an object under study, such

as a software engineering process or artifact. Ques-

tions reﬁne the goal, sometimes in several subordinate

levels, and are usually posed to characterize proper-

ties about the object under study, which in turn are

concretely measured through metrics. In the follow-

ing section, we will make use of the GQM approach

to identify concretely measurable metrics for non-

operationalized metrics from the literature.

3.2 Metric Candidates for a Model

Comprehension Framework

To reﬁne the set of candidate metrics from Section 2

into less vague and concretely measurable metrics, we

pursued two goals: (1) to combine equivalent metrics

from the literature, and (2) to systematically quantify

non-countable metrics. In doing so, we strictly fol-

lowed Berander and J

onsson’s approach to arrive at

an efﬁcient measurement framework (Berander and

onsson, 2006) shown in Figure 3.

As can be seen, the framework differentiates be-

tween metrics that quantify diagram size, speciﬁcally

the number of nodes and entities and edges and re-

lationships. This is because most of the metrics in

Section 2 indicate aspects pertaining to the size of

the model as one of the dominant characteristics in-

ﬂuencing comprehension. In fact, the literature sug-

gests a negatively reciprocal relationship between size

and comprehension: the larger the model, the harder

to comprehend (Hermann et al., 2022). The size

of the model is typically understood as the number

of nodes or entities therein. Nodes and entities in-

clude UML classes, lifelines, states or partitions, but

also model elements typically associated with rela-

tionships between elements, such as decision/merge

nodes or fork/joins.

The framework further recognizes metrics pertain-

ing to layout. While at ﬁrst glance, these metrics seem

easily quantiﬁable, the problem with such metrics is

that their quantiﬁcation is not transferable between

two diagrams, regardless of model type. Even if two

diagrams document the same subject and are of the

same model type (e.g., activity diagrams document-

ing the procedure of a user logging in to an online

account), differences in layout may decrease visual

clarity and thereby comprehensibility. It is therefore

essential to quantify these metrics relative to the over-

all physical size of the diagram as measured, e.g., in

pixels, inches, or centimeters.

Finally, the framework contains metrics that must

be qualiﬁed cognitively. These metrics not only dif-

fer in objectivity between two diagrams of the same

type and content, but also between the persons who

read the diagram. Generally, less required explana-

tion, complexity, and simpler names make a diagram

easier to understand. Yet, to what degree the dia-

gram is understandable cannot be counted, as it re-

lates to an internal cognitive process within the model

Metrics to Estimate Model Comprehension: Towards a Reliable Quantiﬁcation Framework

501

Figure 3: GQM Measurement Framework of Model Comprehension Metrics from the Literature (Hermann et al., 2022).

interpreter that may even differ on the current mood

and level of attention the person has. It is there-

fore necessary to adopt a quantiﬁcation approach that

introduces comparability between two model inter-

preters’ assessments. Commonly used are ﬁve-point

scales ranging from “strongly agree” to “strongly dis-

agree”. Yet, while such scales allow quantiﬁcation,

they lack comparability. For example, both a software

engineering researcher and a person who served on

OMG’s standardization body drafting the UML may

strongly agree with the notion of being an expert in

UML modeling, yet there might be considerable dif-

ferences in the details of the standard or pragmatism

of UML modeling between the two. We therefore pro-

pose using quantiﬁable statements with points associ-

ated to categories of experience. For the metric named

above, this could for example be:

• 4: No additional information is needed.

• 3: I needed additional information to understand

one or two model elements or had to ask someone

for help.

• 2: I needed additional information to understand

several model elements or had to ask several peo-

ple for help.

• 1: I needed additional information to understand

all or nearly all model elements, or was not able

to understand the model despite receiving help.

4 STUDY DESIGN AND FUTURE

WORK

In this section, we propose an experimental design to

investigate the suitability of the metrics identiﬁed in

Section 3. We are following the structure of reporting

empirical studies proposed by (Wohlin et al., 2012),

as reﬂected in the following subsections. The goal of

this empirical design is not to elicit a deﬁnitive set of

metrics that are valid for any model, but instead to

compare metrics beyond what their original authors

proposed.

4.1 Research Questions & Hypotheses

To experimentally validate our framework from Sec-

tion 3, we deﬁne the following research questions:

1. RQ: Which metrics are an adequate measurement

of the perceived comprehensibility of a model,

given the model type?

2. RQ: Which metric or set of metrics are an ade-

quate predictor of model quality, given the model

type?

3. RQ: Are there any metrics that predict model

quality regardless of model type?

To answer these questions, we quantify student-

created models using each metrics outlined in Section

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

502

3 and compare the quantiﬁed metric to a score as-

signed by instructors as well as peer reviews (see Sec-

tion 4.2 for details on the model stimuli). We there-

fore deﬁne the following hypotheses:

1. HP: There exists a correlation between the quan-

tiﬁed value of the metric and the score assigned to

a model.

2. HP: There exists a correlation between the quanti-

ﬁed value of the metric and at least one other met-

ric from different quality properties, for which we

failed to reject HP1.

3. HP: There exists a correlation between the quanti-

ﬁed value of the metric and the score assigned for

at least one more model of a different type.

Correlations may be negative for metrics that min-

imize a value (e.g., fewer line crossings are better)

or positive for metrics that maximize the value (e.g.,

more nodes are better). Hypothesis HP1 directly an-

swers RQ1. Hypothesis HP2 will be accepted if

any two metrics correlated with model score (HP1),

but are from different model quality properties (e.g.,

model size: number of relationships and layout: num-

ber of crossing edges, see Section 3). The result of

HP2 is therefore a set of metrics that together ade-

quately quantify model properties and therefore pre-

dict model quality in the sense of RQ2. Research

question RQ3 is directly addressed by HP3. We ac-

cept evidence in favor of HP3 for any metric for which

we failed to reject HP1 for at least two models, as long

as the models are of a different type (e.g., number of

crossing edges in class diagrams and number of cross-

ing edges in activity diagrams).

4.2 Experimental Stimuli

We will use student-created models from a Require-

ments Engineering course (Tenbergen and Daun,

2019). In this course, students create goal models, use

case diagrams, sequence diagrams, class diagrams,

activity diagrams, and state machine diagrams in as-

signment sheets. The assignment sheets take the form

of providing the students with a natural language de-

scription of a universe of discourse and the task to cre-

ate a speciﬁc diagram based on the content of the de-

scription. Scored out of 15 points, the purpose of the

assignment sheets is to practice the creation of graph-

ical models for the purpose of elicitation and docu-

mentation of requirements. The course has been of-

fered annually since 2017 and typically comprises six

assignment sheets, prepared in teams of two students,

evaluated by the instructor. Since 2021, calibrated

peer reviews have been adopted in the course (Ten-

bergen and Daun, 2022), in which student teams eval-

uate and score at least three solutions from their peers

in addition to creating their own solution. Conversely,

each team is evaluated by at least three other teams;

the total score of their solution is calculated from the

average of peer-assigned scored. For peer evaluations,

grading rubrics are created by the instructor and pro-

vided to the evaluating teams. Each year, an average

of 20 students enroll in the course, paired in 10 teams

and creating 8 models across all assignment sheets,

yielding approximately 80 student-created models per

year. This means that overall, more than 500 candi-

date models and their scores can be subjected to each

of the metrics from Section 3. We believe this to be

a robust data set from which to calculate correlations

from.

4.3 Data Preparation & Analysis

To prepare the data for analysis and hypothesis test-

ing, the following steps will be undertaken:

1. All models from all assignments sheets, students,

and years will be collected, striped from informa-

tion identifying students, and instead assigned a

unique identiﬁer.

2. Afterward, all models will be reviewed to assess if

they are suitable for inclusion in this study. While

all models for this course would by default be

included, exclusion criteria may include models

containing profanity, models that were plagiarized

from another team or year, are accidental dupli-

cates, or models that semantically do not ﬁt the

task at hand (e.g., the assignment sheet calls for

creating a class diagram, but the students created

an activity diagram instead). A deﬁnitive list of

inclusion and exclusion criteria will be created

once data preparation is under way, but shall be

strictly enforced for all models.

3. For all remaining included models, their ID, their

model type, and score as well as point deductions

(including value of deductions) will be recorded

in a spreadsheet, one row per model.

4. In the spreadsheet, a column will be created for

each metric from Section 3.

5. For each model, a quantiﬁed value for each metric

will be calculated and recorded.

Once these calculations are complete, the resulting

data set can be ﬁltered by model type and correla-

tions can easily be computed for each metric and

the models’ scores. Since this mode of analysis is

a merely quantiﬁed approach, it is likely that certain

qualitative observations may emerge. For example,

an open question remains if peer reviews gravitate

Metrics to Estimate Model Comprehension: Towards a Reliable Quantiﬁcation Framework

503

towards naturally applying similar metrics when re-

viewing model quality. Any such observation shall be

recorded by the experimenters.

4.4 Threats to Validity

Like any study, certain threats to validity that stem

from the experimental design remain despite our best

efforts. We have created this experimental design

using pre-existing stimuli out of convenience and

to reduce the time needed for this preliminary re-

search thrust. Therefore, the main remaining threat

to validity concerns generalization and reproducibil-

ity. Since the experimental stimuli are conveniently

selected and the experiment relies on systematic ap-

plication of measurements instead of participant per-

formance, results may not allow conclusively reject-

ing or accepting hypotheses. The stimuli were not

designed with this experiment in mind, their inherent

properties might inﬂuence experimental outcome and

hence require secondary validation through follow-up

experimentation. Moreover, the experimental stim-

uli mainly feature UML diagrams, hence yielding

a validation of the framework with regard thereto.

Hence, follow-up experimentation with a dedicated

experiment measuring comprehension of participants

in modeling and model reading tasks must be con-

ducted in order to conclusively validate the frame-

work. Furthermore, since we seek to validate a frame-

work of our own making, experimenter bias impairing

conclusion validity seems applicable. However, the

framework is based on metrics proposed in the litera-

ture and not on our work. In contrast, we seek to re-

ject ﬂawed aspects of this framework early and rigor-

ously, so we can improve it. We are therefore reason-

ably certain that the experimental design we chose is

adequate to validate the framework and generate pre-

liminary insights that can be addressed in follow-up

studies.

5 SUMMARY

In this paper, we have made a case in favor of provid-

ing a generalizable, quantiﬁable framework to mea-

sure the comprehensibility of graphical models. Such

a framework will assist in reducing the effort needed

in onboarding developers to a project, and assist edu-

cators in teaching aspiring model-driven software en-

gineers in what makes a “good” model. We have

brieﬂy summarized results from our detailed litera-

ture review regarding factors inﬂuencing model com-

prehension and used the GQM approach to propose

a framework with concrete quantiﬁable metrics. We

furthermore presented the design of a preliminary em-

pirical study using readily available stimuli to validate

and improve the framework. Albeit the anticipated

conclusions using this design will be limited due to

the fact that the stimuli were not designed speciﬁcally

to measure comprehension, we are conﬁdent that our

design will generate initial, reliable insights and allow

us to design a more rigorous follow-up study. This

shall be the topic of future work.

REFERENCES

Abrah

ao, S., Gravino, C., Insfran, E., Scanniello, G., and

Tortora, G. (2013). Assessing the effectiveness of dy-

namic modeling in the comprehension of software re-

quirements: results from a family of ﬁve experiments.

IEEE Trans. Softw. Eng, 39(3):327–342.

Anda, B., Sjøberg, D., and Jørgensen, M. (2001). Qual-

ity and understandability of use case models. In

ECOOP 2001—Object-Oriented Programming: 15th

European Conference Budapest, Hungary, June 18–

22, 2001 Proceedings 15, pages 402–428. Springer.

Bandyszak, T., Daun, M., Tenbergen, B., and Weyer, T.

(2018). Model-based documentation of context un-

certainty for cyber-physical systems. In 2018 IEEE

14th International Conference on Automation Science

and Engineering (CASE), pages 1087–1092. IEEE.

Basili, V. R. and Rombach, H. D. (1994). The goal question

metric approach. Encyclopedia of software engineer-

ing, pages 528–532.

Berander, P. and J

onsson, P. (2006). A goal question metric

based approach for efﬁcient measurement framework

deﬁnition. In Proceedings of the 2006 ACM/IEEE Int

Symp on Empirical Software Engineering, ISESE ’06,

page 316–325, New York, NY, USA. Association for

Computing Machinery.

Boehm, B. W. (1984). Software engineering economics.

IEEE transactions on Software Engineering, (1):4–21.

Brambilla, M., Cabot, J., and Wimmer, M. (2017). Model-

driven software engineering in practice. Morgan &

Claypool Publishers.

Cruz-Lemus, J. A., Genero, M., Manso, M. E., and Piat-

tini, M. (2005). Evaluating the effect of composite

states on the understandability of uml statechart dia-

grams. In Proceedings of the 8th Int Conf MoDELS

2005, pages 113–125. Springer.

Daun, M., Brings, J., Aluko Obe, P., and Stenkova, V.

(2021). Reliability of self-rated experience and conﬁ-

dence as predictors for students’ performance in soft-

ware engineering: Results from multiple controlled

experiments on model comprehension with graduate

and undergraduate students. Empirical Software En-

gineering, 26(4):80.

Daun, M., Brings, J., Obe, P. A., Pohl, K., Moser, S., Schu-

macher, H., and Rieß, M. (2017). Teaching conceptual

modeling in online courses: Coping with the need for

individual feedback to modeling exercises. In 2017

ENASE 2024 - 19th International Conference on Evaluation of Novel Approaches to Software Engineering

504

IEEE 30th Conf on Software Engineering Education

and Training (CSEE&T), pages 134–143. IEEE.

Daun, M., Weyer, T., and Pohl, K. (2019). Improving man-

ual reviews in function-centered engineering of em-

bedded systems using a dedicated review model. Soft-

ware and Systems Modeling, 18(6):3421–3459.

Di Cerbo, F., Dodero, G., Reggio, G., Ricca, F., and

Scanniello, G. (2011). Precise vs. ultra-light activ-

ity diagrams-an experimental assessment in the con-

text of business process modelling. In Product-

Focused Software Process Improvement: 12th Inter-

national Conference, PROFES 2011, Torre Canne,

Italy, June 20-22, 2011. Proceedings 12, pages 291–

305. Springer.

Esperanza Manso, M., Cruz-Lemus, J. A., Genero, M.,

and Piattini, M. (2009). Empirical validation of

measures for uml class diagrams: A meta-analysis

study. In Models in Software Engineering: Work-

shops and Symposia at MODELS 2008, pages 303–

313. Springer.

Figl, K. (2012). Symbol choice and memory of visual mod-

els. In 2012 IEEE Symp on Visual Languages and

Human-Centric Computing, pages 97–100. IEEE.

Genero, M., Miranda, D., and Piattini, M. (2002). Deﬁn-

ing and validating metrics for uml statechart diagrams.

Proceedings of QAOOSE, 2002.

Genero, M., Olivas, J. A., Piattini, M., Romero, F. P., and

de Calatrava, R. (2001). A controlled experiment for

corroborating the usefulness of class diagram metrics

at the early phases of oo developments. In ADIS.

Genero, M., Piattini, M., and Calero, C. (2000). Early mea-

sures for uml class diagrams. L’objet, 6(4):489–505.

Hermann, J., Tenbergen, B., and Daun, M. (2022). Metrics

to estimate model comprehension quality: Insights

from a systematic literature review. Complex Systems

Informatics and Modeling Quarterly, (31):1–17.

Huang, W. (2008). An eye tracking study into the effects of

graph layout. arXiv preprint arXiv:0810.4431.

Krogstie, J., Sindre, G., and Jørgensen, H. (2006). Process

models representing knowledge for action: a revised

quality framework. European Journal of Information

Systems, 15:91–102.

Lange, C. F., DuBois, B., Chaudron, M. R., and Demeyer,

S. (2005). Experimentally investigating the effective-

ness and effort of modeling conventions for the uml.

Lecture Notes in Computer Science, 4364:91–100.

Marriott, K., Purchase, H., Wybrow, M., and Goncu, C.

(2012). Memorability of visual features in network

diagrams. IEEE Transactions on Visualization and

Computer Graphics, 18(12):2477–2485.

Muller, G. (2015). Challenges in teaching conceptual mod-

eling for systems architecting. In Advances in Con-

ceptual Modeling: Proceedings of ER 2015 Work-

shops, pages 317–326. Springer.

Nelson, H. J., Poels, G., Genero, M., and Piattini, M.

(2012). A conceptual modeling quality framework.

Software Quality Journal, 20:201–228.

Reuter, R., Stark, T., Sedelmaier, Y., Landes, D., Mottok, J.,

and Wolff, C. (2020). Insights in students’ problems

during uml modeling. In 2020 IEEE Global Engineer-

ing Education Conference, pages 592–600.

Sharif, B. and Maletic, J. I. (2009). The effect of layout

on the comprehension of uml class diagrams: A con-

trolled experiment. In 2009 5th IEEE International

Workshop on Visualizing Software for Understanding

and Analysis, pages 11–18. IEEE.

Tenbergen, B. and Daun, M. (2019). Industry projects in

requirements engineering education: application in a

university course in the us and comparison with ger-

many. In Proceedings of the 52nd Hawaii Int Conf on

System Sciences.

Tenbergen, B. and Daun, M. (2022). Calibrated peer re-

views in requirements engineering instruction: Appli-

cation and experiences. In Proceedings of the 55th

Hawaii Int Conf on System Sciences.

Tenbergen, B., Daun, M., Obe, P. A., et al. (2018). View-

centric context modeling to foster the engineering

of cyber-physical system networks. In 2018 IEEE

Int Conf on Software Architecture, pages 206–20609.

IEEE.

Wohlin, C., Runeson, P., H

ost, M., Ohlsson, M. C., Reg-

nell, B., and Wessl

en, A. (2012). Experimentation in

software engineering. Springer Science & Business

Media.

Yusuf, S., Kagdi, H., and Maletic, J. I. (2007). Assess-

ing the comprehension of uml class diagrams via eye

tracking. In 15th IEEE Int Conf on Program Compre-

hension, pages 113–122. IEEE.

Metrics to Estimate Model Comprehension: Towards a Reliable Quantiﬁcation Framework

505