A Methodology Based on Quality Gates for Certiﬁable AI in Medicine:

Towards a Reliable Application of Metrics in Machine Learning

Miriam Elia

and Bernhard Bauer

Faculty of Applied Computer Science, University of Augsburg, Germany

(miriam.elia, bernhard.bauer)@informatik.uni-augsburg.de

Keywords:

Certiﬁable AI, Quality Management, Machine Learning, Healthcare, Metrics, Deep Learning, Performance

Evaluation, Algorithm Auditing.

Abstract:

As of now, intelligent technologies experience a rapid growth. For a reliable adoption of those new and

powerful systems into day-to-day life, especially with respect to high-risk settings such as medicine, technical

means to realize legal requirements correctly, are indispensible. Our proposed methodology comprises an

approach to translate such partly more abstract concepts into concrete instructions - it is based on Quality

Gates along the intelligent system’s complete life cycle, which are composed of use-case adapted Criteria

that need to be addressed with respect to certiﬁcation. Also, the underlying philosophy regarding stakeholder

inclusion, domain embedding and risk analysis is illustrated. In the present paper, the Quality Gate Metrics is

outlined for the application of machine learning performance metrics focused on binary classiﬁcation.

1 INTRODUCTION

Thanks to astonishing results, the adoption of AI in

medicine is moving more and more into the center of

attention. Many requirements for a conscious integra-

tion of the new technology, especially regarding high-

risk contexts, have been published. Recently, the EU

released its AI Act that stands as a legislative guide-

line (European Commission, 2021). However, tech-

nical means to realize these requirements in medicine

are yet to be developed and standardized. In addi-

tion, ”[t]he healthcare application ﬁeld introduces re-

quirements and potential pitfalls that are not imme-

diately obvious from the ’general data science’ view-

point” (Jussi, 2021, 1). Challenges regarding the de-

sired adoption of this complex technology into clini-

cal day-to-day life are partly based on the necessity of

comprehensive Machine Learning (ML) knowledge

to accurately evaluate the system. The present work

is part of our approach towards a generic and cus-

tomizable methodology - introduced in this paper and

based on Quality Gates (QG) - that comprises exist-

ing research on the development of ML models for

Certiﬁable AI in Medicine into guidelines for devel-

opers and auditing ofﬁces, while paying special atten-

tion to end-user perspectives, the inclusion of domain

https://orcid.org/0000-0001-6253-230X

https://orcid.org/0000-0002-7931-1105

knowledge and risk analysis. The focus lies on deﬁn-

ing general guidelines towards metrics selection for a

comprehensive evaluation of the ML model, adapted

to the respective medical context. Section 2 explains

the current legal situation regarding intelligent medi-

cal devices with respect to software quality manage-

ment and metrics for ML in healthcare. In section

3, our methodology’s basic concepts are introduced,

while section 4 specializes on the QG Metrics and

presents guidelines for a reliable selection, adapted

to the medical context. Finally, section 5 summarizes

the present work and derives open research questions.

2 RELATED WORK

Functional & Safety Standards: Since 2021, an

updated version of the Medical Device Regulation

(MDR) is in place that guarantees the Conformit

e Eu-

rop

eenne (CE), i.e. conformity with ”[...] EU safety,

health and environmental protection requirements, as

well as with norms set by the International Orga-

nization for Standardization (ISO)” (Ben-Menahem,

2020, 1). ISO does not perform certiﬁcation activities

itself, but provides internationally accepted norms, as

DIN EN ISO 9001:2015-11 for process-oriented qual-

ity management systems, or ISO 13485 for medical

devices, for instance. Another important concept is

safety, i.e. protecting the user from potentially harm-

486

Elia, M. and Bauer, B.

A Methodology Based on Quality Gates for Certiﬁable AI in Medicine: Towards a Reliable Application of Metrics in Machine Learning.

DOI: 10.5220/0012121300003538

In Proceedings of the 18th International Conference on Software Technologies (ICSOFT 2023), pages 486-493

ISBN: 978-989-758-665-1; ISSN: 2184-2833

 2023 by SCITEPRESS – Science and Technology Publications, Lda. Under CC license (CC BY-NC-ND 4.0)

ful behavior of the software. Functional requirements

are summarized under IEC 61508, while DIN EN

IEC 60601 and DIN EN IEC 62304 speciﬁcally fo-

cus on medical devices. Moreover, Safety Integrity

Levels 0 − 4 (SIL), i.e. ”[...] classiﬁcation levels

indicating safety requirements in safety-critical sys-

tems”(Papadopoulos, 2010, 1) are assigned.

Certiﬁcation & Medical AI: As of now, the EU

AI Act is on everyone’s mind, aiming to form the

”[...] legislation for a coordinated European approach

on the human and ethical implications of AI” (Eu-

ropean Commission, 2021, 2). This document de-

ﬁnes the foundation of AI-based devices in the EU, its

philosophy is summarized in (European Commission,

2020), and discussed in further detail with respect to

medicine in (Schneeberger, 2020). Currently, the cer-

tiﬁcation process for high-risk medical devices is con-

ducted by an independent authority, i.e. notiﬁed bod-

ies. (Ben-Menahem, 2020, 1-3) However, currently,

they are not equipped to implement all incoming de-

mands, which could lead to a scarcity of medical de-

vices in the EU (European Commission, 2023, 2-4).

For a comprehensive impact analysis regarding the

new MDR regulations for risk classes, clinical eval-

uation, post-market surveillance and notiﬁed bodies,

refer to (Niemiec, 2022). Current challenges for AI

in healthcare are mainly centered around black box

models that are able to perform complex tasks, but

whose inner workings are incomprehensible for hu-

man stakeholders. This could lead to an incorrect ap-

plication of developed models in the clinical context,

”[...] due to methodological ﬂaws and/or underlying

biases” (Roberts, 2021, 1), for instance. In (Muller,

2021) generally applicable principles regarding AI in

medicine that could form a solid baseline for technical

design decisions, are summarized.

Quality Gates & Metrics: A QG is a concept derived

from software quality management, and could be de-

ﬁned as ”[...] an objective quality assurance gate,

that is, a veriﬁcation procedure, performed either

by independent reviewers or by automated scripts”

(Paula F., 2006, 34). Their most basic function-

ing consists of summarizing important criteria regard-

ing speciﬁc outcomes that are generated at differ-

ent points during the software development life cycle

(Flohr, 2008, 245). A means of deﬁning criteria for

virtual QGs for manufacturing use cases is presented

in (Filz, 2020, 8ff), but could be adapted to medi-

cal contents, since they are based on the inclusion

of domain knowledge. A thorough and comprehen-

sive understanding of the respectively conveyed infor-

mation is indispensable for ML performance metrics

interpretation, especially in medicine, but not nec-

essarily guaranteed (Hicks, 2022, 1). For instance,

a very common metric for classiﬁcation tasks is the

Receiver Operating Characteristic Area under the

Curve (ROC AUC). It is used as primary evaluation

metric in popular bench marking tools hosted e.g. on

Grand-challenge.org, like the STOIC

challenge for

3D computer tomography classiﬁcation of COVID-19

infected lungs (Boulogne, 2023), for instance. Their

metrics selection is based on (Reinke, 2021), accord-

ing to which ROC AUC and its prominent opposition

Precision-Recall AUC (PR AUC) both reﬂect data im-

balance (Reinke, 2021, 43ff.). However, there is an

ongoing discussion whether or not ROC AUC reﬂects

imbalanced data sets, which is a very common case

in medicine (Davis, 2006; Saito, 2015). Also, pub-

lished paper and bench-marking tools tend to display

disagreement regarding the consistent application of

both metrics for an empirical analysis (Ribeiro, 2020;

Strodthoff, 2020). This inconsistency enforces the ne-

cessity to standardize valid approaches.

3 METHODOLOGY BASED ON

QUALITY GATES

Our proposed methodology’s main objective is to

”make auditing simple”, and thus provide concrete in-

structions for the domain-adapted realization of spe-

ciﬁc legislative requirements in the context of Certiﬁ-

able AI in medicine, while respecting different stake-

holder’s needs and speciﬁc design decisions’ risks. In

the long term, such ﬁndings could be adapted in a

(partially) automated manner to the complete appli-

cation’s life cycle through adapted frameworks and

templates for a comprehensive documentation of de-

sign decisions. In general, the conceptual foundation

is based on the deﬁnition of scientiﬁcally substanti-

ated Criteria for QGs along the complete life cycle of

the intelligent software. To the best of our knowledge,

a similar adaptation of QGs and ML-certiﬁcation in

healthcare has not yet been published. Attributed

to the variety of different ML methods for different

medicinal use cases that compose of different data

types and tuning objectives, the concrete realization

of Criteria should be adapted respectively. Struc-

tural similarities from a technical viewpoint between

use cases should sufﬁce to generalize applied meth-

ods, as in (Strodthoff, 2020, 3) where metrics from

multi-label protein discovery were adapted to ECG-

classiﬁcation.

General Structure of Quality Gates: In ﬁgure 1 the

high-level QG’s hierarchy adapted to ML-processes

is depicted: QG Data ensures a clean and informa-

https://stoic2021.grand-challenge.org/

A Methodology Based on Quality Gates for Certiﬁable AI in Medicine: Towards a Reliable Application of Metrics in Machine Learning

487

tive data set that is ready for model training, QG Soft-

ware guarantees overall compliance with software en-

gineering requirements, QG Model delivers a trans-

parent algorithm that has been thoroughly assessed,

QG Deployment assures a seamless rollout, while

QG Maintenance ensures regular monitoring, which

could include physician training in the medical sec-

tor Only in combination, the whole QG4Application

is evaluated.

Figure 1: High-Level Quality Gates.

The process-steps depicted in ﬁgure 1 represent

basic ML-development and should be audited for all

levels of risk regarding intelligent applications, refer

to (Koshiyama, 2021, 3), where ﬁve similar stages of

development are deﬁned for ML algorithm auditing in

general, or (Oala, 2021, 2) where the authors present

a concept for healthcare-speciﬁc algorithm auditing.

3.1 Basic Concepts

The following deﬁnitions comprise our presented

methodology’s conceptual foundation, and are partly

derived from traditional software quality assessment.

In (Koshiyama, 2021), the authors propose Explain-

ability, Robustness, Fairness, and Privacy as audit-

ing verticals, which form an important component of

Algorithm Audit research (Koshiyama, 2021, 2-3).

Our ”auditing verticals” are intended as guidelines of

thought when deﬁning QG Criteria that should have

been analyzed by the responsible party for the respec-

tive process step(s). Within our context, fundamen-

tal requirements for trustworthy, and thus certiﬁable

AI, such as fairness, privacy, and robustness (Euro-

pean Commission, 2020), are addressed via a pro-

found Risk Analysis.

Quality Gate: Signiﬁcant milestone or decision point

during the creation of a ML-based software that, in

a body, serve as a quality guideline to assess the

software‘s compliance with EU-legislation regarding

Certiﬁable AI in medicine. Project-speciﬁc Criteria

are evaluated against pre-deﬁned desired Criteria for

the particular use case. Based on the degree of their

fulﬁlment, Gatekeepers decide the project’s level of

compliance, which might lead to re-working some

Our proposed methodology focuses on the ML part of

the complete medical device, thus, Software Engineering-

speciﬁc information is only mentioned marginally.

QGs. QGs might be optional or weighted differently

regarding their impact. (Flohr, 2008)

Scope: Each QG has access to speciﬁc project-based

resources or outcomes that are measured by the re-

spective Gatekeeper. Its Scope includes other QG’s

outcomes. QGs are arranged in a tree-structure, with

growing Scope from more project-speciﬁc leaf-QGs

to more abstract root-QGs. The highest level of Scope

covers QGs for Data, Software and Model Develop-

ment, as well as Application Deployment and Mainte-

nance.

Criteria: Basis for QG-evaluation by the Gatekeeper.

Concrete and use case speciﬁc requirements with

growing level of abstractness following the Scope

from leaf to root. Should be adapted to the speciﬁc

use case if necessary. (Flohr, 2008)

Gatekeeper: Measures the fulﬁllment of each QG re-

garding its decision Scope depending on the point in

time of the application life cycle. It compares pre-

deﬁned, desired Criteria with the actual project’s out-

comes and decides to what extend the system is in

compliance. (Flohr, 2008)

Scoring System: It comprises multiple indexes, with

the Compliance Index as central part, i.e. the ”main

index” that comprises the complete application’s eval-

uation in a single number. Its calculation comprises

other indexes and the Gatekeeper’s results. For in-

stance, QG Data could be evaluated as stand-alone or

embedded within the application’s assessment.

Explainability: Since XAI has become a very popu-

lar ﬁeld of research, its inclusion during the ML-based

application’s life cycle is addressed separately: we

follow a similar philosophy, as in (European Commis-

sion, 2020), where XAI contributes to the requirement

Transparency, through providing a pool of methods,

that help to ”[...] explain both the technical processes

of the AI system and the reasoning behind the deci-

sions or predictions that the AI system makes” (Eu-

ropean Commission, 2020, 14). Thus, depending on

the respective process step, the application of XAI

can have various forms and objectives in support of

realizing other Criteria, rather than being the cen-

ter of an evaluation: a developer might use LIME to

assess the model’s performance regarding learnt fea-

tures (Ribeiro, 2016) to achieve robustness, while a

physician requires a humanly readable explanation.

3.2 Quality Gates in the Application’s

Life Cycle

Following ISO guidelines, the aforementioned high-

level QGs are illustrated as processes during the soft-

ware’s life cycle in ﬁgure 2: Data Management and

all its sub processes, followed by Model Develop-

ICSOFT 2023 - 18th International Conference on Software Technologies

488

ment and Software Engineering, and is concluded

with the Application’s Deployment, and Maintenance

in the real world. Our suggested methodology aims

to assist development teams in form of the inclu-

sion of Domain Knowledge, or communication of

QG Inter-Dependencies based on previous/following

QG outcomes during the application’s life cycle with

the objective to design the application in a way that

Stakeholder’s needs are fulﬁlled and possible Risks

mitigated. Also, the auditor will ﬁnd scientiﬁcally

grounded guidelines for ML quality assessment tai-

lored to speciﬁc groups of medical use cases thanks

to our methodology’s customizability.

Figure 2: Quality Gates during the ML Life Cycle.

Regarding our proposed methodology’s applica-

tion, it should be considered, that ”[a]lthough these

stages appear static and self-containing, in practice

they interact in a dynamic fashion, not following a lin-

ear progression but a series of loops [...]”(Koshiyama,

2021, 3). This may include multiple iterations of the

presented processes, starting during development and

before certiﬁcation, while continuing afterwards. For

auditing, only a static version of a particular model

with its respective data can be assessed, and each

modiﬁcation or re-training automatically requires a

new audit

3.2.1 Overall Guidelines

Generally, we deﬁned four guidelines as necessary

lines of thought for the deﬁnition of QG-Criteria,

with the objective to unify common ML-concepts in

support of different stakeholders involved during the

ML-based software’s life cycle. Thus, regarding ML

for medicine, special focus is placed on the inclusion

of Stakeholders, while paying special attention to a

thorough impact Risk Analysis in the real world. Also

beneﬁts of the inclusion of Domain Knowledge are

evaluated, and Inter-Dependencies regarding differ-

ent QGs’ outcomes considered.

Stakeholder: Considering all Stakeholder needs

from a technical point of view enhances the imple-

mentation of a successful application that fulﬁlls its

For instance, mirroring the Digital Twin concept from

the industry, a static Real-World Twin could be deployed in

the application and dynamically updated, while its twin is

continuously optimized, including new data.

intended purpose. The most obvious include Devel-

oper and Domain Experts from a medical background

who participate in the software’s development, the

Auditor who is responsible to ensure product compli-

ance with legislation, and ﬁnally the User, i.e. medic-

inal personnel and patients, who will one day work

with the system. Thus, interdisciplinary teams are ad-

vised to be considered standard during the complete

development process.

Risk Analysis: The application-wide Risk Analysis

is realized by the mapping of conducted analysis with

differing and speciﬁc aims into indexes. Examples in-

clude uncertainty estimation, fairness, privacy, trans-

parency, robustness and sustainability.

Inter-Dependency: This guideline is methodology-

speciﬁc and refers to communicating result-based rec-

ommendations between QGs. An example for an

Inter-QG-Dependency is the recommendation of ad-

equate metrics based on the QG Data’s analysis and

clinical objective of the project, or the effects of QG

Pesudo- /Anonymization on included meta data as ad-

ditional features.

Domain Knowledge: Especially in medicine, the in-

clusion of domain- and use case-speciﬁc knowledge

is indispensable for efﬁciently training and accurately

evaluating the ML-model, since AI-based software

for healthcare is primarily designed to enhance clini-

cal treatment and patient care. Thus, the inclusion of

use-case speciﬁc domain knowledge should be con-

sidered, when designing Criteria for leaf-QGs.

3.2.2 Data

The proposed QG Data comprises the processes

Source Selection, General Preparation, and ML

Preparation, and is illustrated in ﬁgure 3. When

deﬁning the data set composition, a high distribution

in data sources is desired, in favor of the algorithm’s

quality. Further, raw data needs to be analyzed, e.g.

with respect to data type, and possibly collected meta

data, as well as cleaned from missing values or er-

rors. Then, especially in healthcare, data pseudo- or

anonymization is likely to be required. In a ﬁnal step,

since the data is intended to serve as basis for training

a ML algorithm, samples need to be annotated cor-

rectly, as well as the resulting label and feature distri-

butions analyzed.

Figure 3: Quality Gates for Data Set Generation.

Bias: Biased data sets propagate their inherent distri-

butions through model training into the application’s

A Methodology Based on Quality Gates for Certiﬁable AI in Medicine: Towards a Reliable Application of Metrics in Machine Learning

489

real world context, which could lead to incorrect and

unfair predictions. Thus, the implementation of mea-

surements to detect and reduce bias is important for

the assessment of the Fairness requirement (Euro-

pean Commission, 2020, 6). Bias analysis is relevant

for multiple process steps, especially for QG Data and

QG Model, but also in retrospect during QG Deploy-

ment and QG Maintenance. Within our methodology,

such methods are summarized in form of a Bias Index

for Risk Analysis. Other interesting, and promising ar-

eas of research that are potentially interesting for ML

data set preparation in medicine include synthetic data

and multi-modal approaches (MacEachern, 2021).

3.2.3 Model

The presented QG Model is divided into four sub

processes, Data Quality and Pre-processing, Model

Training, Evaluation, and Validation, as depicted in

ﬁgure 4. Depending on design decisions regarding the

model’s architecture or training objectives, different

pre-processing steps should be considered and eval-

uated against domain-speciﬁc requirements. During

model training, optimized hyper-parameters are cal-

culated, and different architectures compared on train

and validation set, while considering data set speciﬁc

information from the previous QG Data, like class

imbalance, for instance. The ﬁnal test set is applied

for evaluating the algorithm’s generalization perfor-

mance by means of domain-embedded and meaning-

fully interpreted metrics.

Figure 4: Quality Gates during Model Development.

The extent to which this mostly development-

speciﬁc information is relevant to present a compre-

hensive view of the model’s behavior for auditing is

yet to be deﬁned. However, information included for

QG Validation like XAI methods that could help to

evaluate the model from a different viewpoint, by un-

covering wrong patterns learnt or to test the model’s

robustness through adversarial analysis, for instance

(Ribeiro, 2016), are important information to assess

the system’s overall performance.

3.2.4 Deployment

Other important steps during the application’s life cy-

cle, are its deployment and maintenance in the real

world. Regarding healthcare-speciﬁc requirements

however, in our proposed methodology those two

inter-related processes are regarded separately. QG

Deployment is divided into three sub processes On-

boarding, Reporting and Feedback, as illustrated in

ﬁgure 5. Especially in a clinical setting, a close co-

operation between the human user, and the intelli-

gent system is evident, which requires a thorough

on-boarding phase to support a conscious utilization

of the ML-based device. For instance, an appropri-

ate XAI-method could be integrated to analyze model

predictions from an additional perspective, but whose

interpretation might need further explanations to be

humanly interpretable. Refer to (Henry, 2022) for an

analysis of physician and intelligent system coopera-

tion.

Figure 5: Quality Gates for Deployment.

A valid approach towards achieving a conscious

application is educating medical personnel about AI’s

beneﬁts and risks, as well as necessary basic ML

knowledge, depending on their respective degree of

interest, i.e. only application or also development of

such systems. Other important considerations for this

phase include approaches on monitoring the model’s

behavior in the real world, as well as concepts to

transmit and integrate user feedback.

3.2.5 Maintenance

The outlined QG Maintenance is divided into four sub

processes, Support, Monitoring, Optimization, and

Decommissioning or New Data, as depicted in ﬁgure

6. This phase is closely related to the previously de-

ﬁned QG Deployment, and partly continues relevant

processes. Besides offering user support and train-

ing as necessary, as well as repeatedly monitoring the

model’s real-world performance, algorithm optimiza-

tion is another important process that should be pur-

sued in parallel, while optionally including new data.

Figure 6: Quality Gates during Maintenance.

These steps are repeated until the application’s de-

commissioning, which should be organized in detail

to assure a smooth continuation in the respective real-

world setting. The intended tendency is to closely ob-

serve the model after its ﬁrst deployment, and contin-

uously re-assess, but with an increasing distance be-

tween iterations, in alignment with MDR regulations

(Ben-Menahem, 2020, 3).

ICSOFT 2023 - 18th International Conference on Software Technologies

490

4 QUALITY GATE: METRICS

For a reliable model training and performance eval-

uation within its application context, adequate met-

rics need to be selected, since their interpretation is

not comparable for different tasks with varying ob-

jectives and data distributions (Strodthoff, 2020, 6).

QG Metrics is also relevant after deployment, for

maintenance, and to measure clinical success. These

multiple points of reference for QG Metrics require

a thorough documentation of all relevant decisions,

their interpretation should be domain-embedded and

translated for multiple stakeholders, at best from early

development on. However, its results are to be in-

terpreted in combination with Indexes relevant for

Risk Analysis. This section ﬁrst introduces a gen-

eral overview of machine learning metrics for the

most common binary classiﬁcation problems, refer

to (M

uller, 2022) for metrics in image segmentation.

In a next step, healthcare-speciﬁc challenges that af-

fect metrics interpretation are highlighted. Finally,

solid guidelines for a reliable deﬁnition of Criteria

for QG Metrics in the broader picture of certiﬁable AI

in medicine are outlined, while addressing domain-

speciﬁc challenges.

4.1 Metrics for Healthcare

In general, a combination of metrics is necessary for

a comprehensive view on the model’s performance

- no single metric reﬂects on all desired capabili-

ties (Kelly, 2019, 3). Thus, preliminary material dis-

playing the model’s prediction versus true labels for

instance, could provide necessary insights to better

evaluate the model, for reasonable prediction thresh-

olds, and to enhance the global understanding of ap-

plied performance metrics. Regarding metrics as a

body, two foundational perspectives could be deﬁned

to evaluate model performance depending on the ac-

cessible artifacts. Another perspective on ML evalu-

ation that is not part of this research comprises statis-

tical analysis of the model’s architecture and its com-

ponents, as in (Martin, 2021).

Classiﬁcation: Following this strategy, the model’s

performance on different classes is measured based

on the confusion matrix (Jussi, 2021, 5). Further, this

approach is based on Thresholding to categorize pre-

dictions in either true or false. Their applicability,

including the most widely used ones such as Accu-

racy

, Recall, Speciﬁcity, Precision or F1-score, as

well as common pitfalls regarding incorrect perfor-

Accuracy is a poor measure for imbalanced data sets,

and should be replaced by Balanced Accuracy (Jussi, 2021,

5-7).

mance measuring in current research, has been thor-

oughly studied in (Hicks, 2022).

Ranking: The Ranking-perspective is based on the

real valued function learned by the model that returns

its conﬁdence, thus accessing the model is necessary.

Further, this approach is a threshold-independent per-

formance measurement, includes metrics like ROC

AUC (Zhang, 2014, 1822), and could also be referred

to for threshold optimization.

4.2 Guidelines for Metrics Application

As a general rule, each selected metric should be ac-

companied by Additional Material that exactly docu-

ments its interpretation within the real-world medical

context, since measuring clinical efﬁcacy is not trivial

(Kelly, 2019, 3). For this purpose, a comprehensive

understanding of the underlying data is indispensable,

as these represent the link to reality. However, clinical

data usually is distributed, heterogeneous and high-

dimensional, and multiple sources might ﬁrst need to

be fused following some medical reasoning to rep-

resent meaningful input for the ML model (Muller,

2021, 120), which can be an elaborate process. Thus,

”[t]he development of quality recommendations and

standards for training data sets has to be a community-

driven effort of many diverse stakeholders” (Muller,

2021, 120), since high-quality data sets play a crucial

factor regarding model performance.

Standards: Some metrics are not symmetric, i.e. the

deﬁnition which class is positive 1 or negative 0 im-

pacts their outcome and is not interchangeable (Hicks,

2022, 3). A standardized deﬁnition marking the dis-

ease as the positive class, while healthy samples are

deﬁned as the negative class is reasonable for bi-

nary classiﬁcation in healthcare, as proposed in (Jussi,

2021, 9) for instance. Also, the inconsistency regard-

ing metrics selection in general should be addressed

by deﬁning a standardized metrics collection for au-

diting different ML use cases, like e.g. image clas-

siﬁcation, in addition to ”[p]eer-reviewed randomised

controlled trials as an evidence gold standard” (Kelly,

2019, 2f.) that accurately measure possible risks and

clinical success. The need for further standardization

is becoming more prevalent with respect to auditing,

and could be realized by the standardized inclusion of

a certain metrics combination within popular imple-

mentation frameworks.

Bench-Marking: For classiﬁer comparison, bench-

marking trained models within different areas of

medicine are important to establish a generally ac-

cepted performance base line. However, careful con-

sideration is necessary, since some metrics behave

differently, depending on the data collection process

A Methodology Based on Quality Gates for Certiﬁable AI in Medicine: Towards a Reliable Application of Metrics in Machine Learning

491

and its resulting diversity (Jussi, 2021, 5). For of-

ﬁcial bench-marking, either independent real-world

test sets that are publicly unavailable should be cre-

ated (Kelly, 2019, 3), or, another approach are simula-

tion studies based on synthetic data (Friedrich, 2022,

3). Additionally, platforms that provide the necessary

infrastructure are required.

Imbalanced Data: As mentioned, imbalanced data

is a very common case for medical data. Thus, stake-

holders are expected to be aware of this situation and

select and interpret metrics accordingly. In contrast to

sensitivity and speciﬁcity, positive and negative pre-

dictive value (PPV/NPV) are ”[...] inﬂuenced by the

ratio of disease and healthy cases that happen to be in

the test set” (Jussi, 2021, 5), for instance.

Metrics Calculation: Careful considerations are

obligatory while designing training and validation

versus test data sets, since they are required to be in-

dependent for a bias-reduced evaluation and stratiﬁed

for class-imbalance. Other popular methods for the

data pipeline setup during the development process,

like cross validation and bootstrapping, are discussed

in (Jussi, 2021, 9-12) in great detail, referencing im-

portant settings that might need to be audited differ-

ently for certiﬁcation.

Performance Optimization: To select optimal hy-

perparameters, it is crucial to optimize with respect

to the same error measure for comparison (Jussi,

2021, 13), which should be embedded and understood

within its medical application area. From a developer

perspective, the metric that will be monitored during

training for methods like early stopping or learning

rate reduction should be deﬁned carefully.

Domain Embedding: Domain embedded evaluation

approaches are expected to be the most resourceful

approaches and should be considered as standard for

medicine, since the intelligent application’s real per-

formance is to be measured and understood regard-

ing its real-world impact (Kelly, 2019, 3). Luckily,

”[m]any ﬁelds of biomedicine have published their

own guidelines on how to evaluate machine learning

algorithms [...]” (Jussi, 2021, 9).

Generalizability: Due a high variation in clinical

data, achieving a reliable generalizability is challeng-

ing but important. A possible solution could include

on-site model training to sharpen a pre-trained model

towards its speciﬁc application context. Further, clin-

ical assessment requires independent and diverse test

sets that are capable to measure such abstracts con-

cepts, see Risk Analysis. (Kelly, 2019, 4)

PPV is equal to precision for binary classiﬁcation

(Jussi, 2021, 6).

5 CONCLUSION

The present work outlines an approach to translate

legislation regarding medical AI applications into

concrete technical guidelines illustrated for metrics in

healthcare. First, the basic concept comprising our

proposed methodology is explained in detail, as well

as the current situation regarding certiﬁcation and

software quality management for medical AI. Like-

wise, the philosophy underlying our methodology is

outlined while paying special attention to all stake-

holders from the beginning, is highlighted. Also, au-

diting should include all stakeholders’ perspectives:

the ML-developers’, health experts’ and/or patients’

view of the intelligent application. Further, current

ambiguities regarding metrics selection that demand

for auditing in medicine to create/retrace commonly

accepted concepts to their origins for repeated (re-

)evaluation, are addressed. Finally, guidelines for Cri-

teria deﬁnition(s) that comprise QG Metrics are pro-

posed. As of now, we are working on a project in the

ECG domain for multi-label classiﬁcation that will be

published as a use case for the proposed approach to-

wards a reliable metrics application.

Our suggested methodology is one possible ap-

proach to realize algorithm auditing, and current re-

search should continue to develop standardized com-

pilations for speciﬁc ML use cases in favor of the au-

diting process. Thanks to the multitude and diversity

of such use cases, this is not a trivial approach, and the

present paper ventures a ﬁrst attempt to design a com-

prehensive methodology, presented in more detail for

a reasonable selection process for ML performance

metrics. To address all existing medical use cases, ex-

tensive further research is required, possibly follow-

ing a mixture of newly proposed technical guidelines,

as in (Oala, 2021; Jussi, 2021).

Another principal question that should be further

analyzed is to what extend open-sourcing should be

made obligatory, since a monopoly on such power-

ful technologies is questionable. An important part of

the outlined methodology includes indexes for Risk

Analysis that are designed to evaluate more abstract

but indispensible concepts such as transparency or

robustness. Further research should consider addi-

tional indexes that contribute to a more sound vue

d’ensemble of the whole model’s performance and

compliance with legislation, as well as develop tech-

nical realizations for relevant points during the soft-

ware life cycle. While developing such concepts, it

might be future-oriented to consider their generaliz-

ability towards bench-marking different artifacts like

data sets or models, which the presented Scoring Sys-

tem might be suitable for. Another crucial aspect, that

ICSOFT 2023 - 18th International Conference on Software Technologies

492

could be included in standard auditing of intelligent

medical devices, is measuring metrics or other com-

ponents via statistical tools such as standard devia-

tion and conﬁdence intervals, refer to (Jussi, 2021)

for more information.

ACKNOWLEDGEMENTS

This work was partially funded by the German Fed-

eral Ministry of Education and Research (BMBF) un-

der reference number 031L9196B.

REFERENCES

Ben-Menahem, S. M., e. a. (2020). How the new european

regulation on medical devices will affect innovation.

Nature Biomedical Engineering, 4(6):585–590.

Boulogne, L. H., e. a. (2023). The stoic2021 covid-19 ai

challenge: Applying reusable training methodologies

to private data. Manuscript submitted for publication.

Davis, J., e. a. (2006). The relationship between precision-

recall and roc curves. In Proceedings of the 23rd In-

ternational Conference on Machine Learning, ICML

’06, page 233–240, New York, NY, USA. Association

for Computing Machinery.

European Commission, D.-G. f. C. N. C. . T. (2020). The

Assessment List for Trustworthy Artiﬁcial Intelligence

(ALTAI) for self assessment. Publications Ofﬁce.

European Commission, D.-G. f. C. N. C. . T. (2021). Reg-

ulation of the european parliament and of the council

laying down harmonised rules on artiﬁcial intelligence

(artiﬁcial intelligence act) and amending certain union

legislative acts.

European Commission, D.-G. f. H. . F. S. (2023). Proposal

for a regulation of the european parliament and of the

council amending regulations (eu) 2017/745 and (eu)

2017/746 as regards the transitional provisions for cer-

tain medical devices and in vitro diagnostic medical

devices (text with eea relevance.).

Filz, M., e. a. (2020). Virtual quality gates in manufacturing

systems: Framework, implementation and potential.

Journal of Manufacturing and Materials Processing,

Flohr, T. (2008). Deﬁning suitable criteria for quality gates.

In Software Process and Product Measurement, pages

245–256, Berlin, Heidelberg. Springer Berlin Heidel-

berg.

Friedrich, S., e. a. (2022). On the role of benchmarking data

sets and simulations in method comparison studies.

Henry, K. E., e. a. (2022). Human-machine teaming is

key to AI adoption: clinicians’ experiences with a

deployed machine learning system. NPJ Digit Med,

5(1):97.

Hicks, S. A., e. a. (2022). On evaluation metrics for medi-

cal applications of artiﬁcial intelligence. Scientiﬁc Re-

ports, 12(1):5979.

Jussi, T., e. a. (2021). Evaluation of machine learning algo-

rithms for health and wellness applications: A tutorial.

Computers in Biology and Medicine, 132:104324.

Kelly, C. J., e. a. (2019). Key challenges for delivering clini-

cal impact with artiﬁcial intelligence. BMC Medicine,

17(1):195.

Koshiyama, A., e. a. (2021). Towards algorithm auditing:

A survey on managing legal, ethical and technologi-

cal risks of ai, ml and associated algorithms. SSRN

Electronic Journal.

MacEachern, S. J., e. a. (2021). Machine learning for pre-

cision medicine. Genome, 64(4):416–425.

Martin, C. H, e. a. (2021). Predicting trends in the qual-

ity of state-of-the-art neural networks without access

to training or testing data. Nature Communications,

12(1):4122.

uller, D., e. a. (2022). Towards a guideline for evalu-

ation metrics in medical image segmentation. BMC

Research Notes, 15(1):210.

Muller, H., e. a. (2021). The ten commandments of ethical

medical ai. Computer, 54(07):119–123.

Niemiec, E. (2022). Will the EU medical device regulation

help to improve the safety and performance of medical

AI devices? Digit Health, 8:20552076221089079.

Oala, L., e. a. (2021). Machine learning for health: Al-

gorithm auditing & quality control. J Med Syst,

45(12):105.

Papadopoulos, Y., e. a. (2010). Automatic allocation of

safety integrity levels. pages 7–10.

Paula F., W. P. (2006). Quality gates in use-case driven

development. In Proceedings of the 2006 Interna-

tional Workshop on Software Quality, WoSQ ’06,

page 33–38, New York, NY, USA. Association for

Computing Machinery.

Reinke, A., e. a. (2021). Common limitations of image pro-

cessing metrics: A picture story.

Ribeiro, A. H., e. a. (2020). Automatic diagnosis of the

12-lead ECG using a deep neural network. Nature

Communications, 11(1).

Ribeiro, M. T., e. a. (2016). ”why should i trust you?”:

Explaining the predictions of any classiﬁer.

Roberts, M., e. a. (2021). Common pitfalls and recommen-

dations for using machine learning to detect and prog-

nosticate for covid-19 using chest radiographs and ct

scans. Nature Machine Intelligence, 3(3):199–217.

Saito, T., e. a. (2015). The precision-recall plot is more

informative than the roc plot when evaluating bi-

nary classiﬁers on imbalanced datasets. PLOS ONE,

10(3):1–21.

Schneeberger, D., e. a. (2020). The european legal frame-

work for medical ai. In Machine Learning and Knowl-

edge Extraction, pages 209–226, Cham. Springer In-

ternational Publishing.

Strodthoff, N., e. a. (2020). Deep learning for ecg analysis:

Benchmarks and insights from ptb-xl.

Zhang, M., e. a. (2014). A review on multi-label learning al-

gorithms. IEEE Transactions on Knowledge and Data

Engineering, 26(8):1819–1837.

A Methodology Based on Quality Gates for Certiﬁable AI in Medicine: Towards a Reliable Application of Metrics in Machine Learning

493