ful behavior of the software. Functional requirements
are summarized under IEC 61508, while DIN EN
IEC 60601 and DIN EN IEC 62304 specifically fo-
cus on medical devices. Moreover, Safety Integrity
Levels 0 − 4 (SIL), i.e. ”[...] classification levels
indicating safety requirements in safety-critical sys-
tems”(Papadopoulos, 2010, 1) are assigned.
Certification & Medical AI: As of now, the EU
AI Act is on everyone’s mind, aiming to form the
”[...] legislation for a coordinated European approach
on the human and ethical implications of AI” (Eu-
ropean Commission, 2021, 2). This document de-
fines the foundation of AI-based devices in the EU, its
philosophy is summarized in (European Commission,
2020), and discussed in further detail with respect to
medicine in (Schneeberger, 2020). Currently, the cer-
tification process for high-risk medical devices is con-
ducted by an independent authority, i.e. notified bod-
ies. (Ben-Menahem, 2020, 1-3) However, currently,
they are not equipped to implement all incoming de-
mands, which could lead to a scarcity of medical de-
vices in the EU (European Commission, 2023, 2-4).
For a comprehensive impact analysis regarding the
new MDR regulations for risk classes, clinical eval-
uation, post-market surveillance and notified bodies,
refer to (Niemiec, 2022). Current challenges for AI
in healthcare are mainly centered around black box
models that are able to perform complex tasks, but
whose inner workings are incomprehensible for hu-
man stakeholders. This could lead to an incorrect ap-
plication of developed models in the clinical context,
”[...] due to methodological flaws and/or underlying
biases” (Roberts, 2021, 1), for instance. In (Muller,
2021) generally applicable principles regarding AI in
medicine that could form a solid baseline for technical
design decisions, are summarized.
Quality Gates & Metrics: A QG is a concept derived
from software quality management, and could be de-
fined as ”[...] an objective quality assurance gate,
that is, a verification procedure, performed either
by independent reviewers or by automated scripts”
(Paula F., 2006, 34). Their most basic function-
ing consists of summarizing important criteria regard-
ing specific outcomes that are generated at differ-
ent points during the software development life cycle
(Flohr, 2008, 245). A means of defining criteria for
virtual QGs for manufacturing use cases is presented
in (Filz, 2020, 8ff), but could be adapted to medi-
cal contents, since they are based on the inclusion
of domain knowledge. A thorough and comprehen-
sive understanding of the respectively conveyed infor-
mation is indispensable for ML performance metrics
interpretation, especially in medicine, but not nec-
essarily guaranteed (Hicks, 2022, 1). For instance,
a very common metric for classification tasks is the
Receiver Operating Characteristic Area under the
Curve (ROC AUC). It is used as primary evaluation
metric in popular bench marking tools hosted e.g. on
Grand-challenge.org, like the STOIC
1
challenge for
3D computer tomography classification of COVID-19
infected lungs (Boulogne, 2023), for instance. Their
metrics selection is based on (Reinke, 2021), accord-
ing to which ROC AUC and its prominent opposition
Precision-Recall AUC (PR AUC) both reflect data im-
balance (Reinke, 2021, 43ff.). However, there is an
ongoing discussion whether or not ROC AUC reflects
imbalanced data sets, which is a very common case
in medicine (Davis, 2006; Saito, 2015). Also, pub-
lished paper and bench-marking tools tend to display
disagreement regarding the consistent application of
both metrics for an empirical analysis (Ribeiro, 2020;
Strodthoff, 2020). This inconsistency enforces the ne-
cessity to standardize valid approaches.
3 METHODOLOGY BASED ON
QUALITY GATES
Our proposed methodology’s main objective is to
”make auditing simple”, and thus provide concrete in-
structions for the domain-adapted realization of spe-
cific legislative requirements in the context of Certifi-
able AI in medicine, while respecting different stake-
holder’s needs and specific design decisions’ risks. In
the long term, such findings could be adapted in a
(partially) automated manner to the complete appli-
cation’s life cycle through adapted frameworks and
templates for a comprehensive documentation of de-
sign decisions. In general, the conceptual foundation
is based on the definition of scientifically substanti-
ated Criteria for QGs along the complete life cycle of
the intelligent software. To the best of our knowledge,
a similar adaptation of QGs and ML-certification in
healthcare has not yet been published. Attributed
to the variety of different ML methods for different
medicinal use cases that compose of different data
types and tuning objectives, the concrete realization
of Criteria should be adapted respectively. Struc-
tural similarities from a technical viewpoint between
use cases should suffice to generalize applied meth-
ods, as in (Strodthoff, 2020, 3) where metrics from
multi-label protein discovery were adapted to ECG-
classification.
General Structure of Quality Gates: In figure 1 the
high-level QG’s hierarchy adapted to ML-processes
is depicted: QG Data ensures a clean and informa-
1
https://stoic2021.grand-challenge.org/
A Methodology Based on Quality Gates for Certifiable AI in Medicine: Towards a Reliable Application of Metrics in Machine Learning
487