
making process compared to the baseline, unassisted
human decision. This can empower managers and de-
signers in either modifying the system to discourage
overreliance or promote trust, or act onto the (com-
pany or institution) culture, proposing training to pro-
fessionals as to promote a balanced level of trust to-
wards the machine advice.
8 CONCLUSIONS
The tool presented in this paper allows for a multi-
dimensional evaluation of the quality of DSS, taking
into account their robustness (Section 2), data simi-
larity (Section 3), calibration (Section 4), utility (Sec-
tion 5), reliability (Section 6) and human interaction
(Section 7). More generally, this work aims at con-
tributing to the “beyond accuracy” discourse: begin-
ning with the critical recognition that the traditional
metric of accuracy, albeit vital, remains just one piece
of the puzzle, we presented the importance of less
prevalent but equally important metrics for DSS qual-
ity assessment. The twofold intent (contribution to
decision support system evaluation, and to the beyond
accuracy discourse) underscores our development of
the DSS Quality Assessment tool as available to all
the interested community of scholars and practition-
ers. Designed with versatility in mind, this tool caters
to a diverse range of needs, serving as a valuable asset
for researchers, practitioners, and organizations.
We recognize the challenges in gathering all nec-
essary data needed for each evaluation step in prac-
tice. The DSS Quality Assessment tool is designed to
be modular, with each step capable of independent ex-
ecution depending on available data. This flexibility
allows users to tailor the evaluation to their specific
goals and available resources.
In our promotion of a multidimensional assess-
ment of DSS, we conclude by emphasizing the imper-
ative of technovigilance (Cabitza and Zeitoun, 2019).
Beyond mere evaluation, there is a need for continu-
ous oversight and reflection on the deployment, use,
and implications of these systems, especially as new
challenges arise in Medical AI. The modular design
of the DSS Quality Assessment tool, for example, al-
lows for the adaptation and inclusion of additional as-
sessments as needs arise. With the increasing threat of
adversarial attacks on ML systems — a threat posing
significant risks in the medical domain — evaluations
of robustness against such attacks are set to become
more and more relevant in the near future (Li et al.,
2021). This ensures that the tool remains relevant and
useful in the face of these evolving threats.
A genuinely effective Decision Support System
(DSS) must be integrated within a culture that pri-
oritizes technology assessment, vigilantly monitors
outcomes, and is consistently attentive to the effects
observed. As the field of Artificial Intelligence (AI)
evolves, so does our comprehension of how to eval-
uate it. It has become clear that concentrating solely
on accuracy is inadequate. Employing a broad, mul-
tifaceted approach is not merely advantageous – it is
imperative. Our tool, which is readily available online
at no cost, represents a modest yet significant con-
tribution towards realizing this research agenda and
methodology, and it is open for use and validation by
all practitioners and researchers who are aligned with
these principles.
REFERENCES
Araujo, T., Helberger, N., Kruikemeier, S., and De Vreese,
C. H. (2020). In ai we trust? perceptions about auto-
mated decision-making by artificial intelligence. AI &
society, 35:611–623.
Assale, M., Bordogna, S., and Cabitza, F. (2020). Vague
visualizations to reduce quantification bias in shared
medical decision making. In VISIGRAPP (3: IVAPP),
pages 209–216.
Birhane, A., Kalluri, P., Card, D., Agnew, W., Dotan, R.,
and Bao, M. (2022). The values encoded in machine
learning research. In Proceedings of the 2022 ACM
Conference on Fairness, Accountability, and Trans-
parency, pages 173–184.
Brier, G. W. (1950). Verification of forecasts expressed
in terms of probability. Monthly weather review,
78(1):1–3.
Cabitza, F. and Campagner, A. (2021). The need to sepa-
rate the wheat from the chaff in medical informatics:
Introducing a comprehensive checklist for the (self)-
assessment of medical AI studies. International Jour-
nal of Medical Informatics, 153.
Cabitza, F., Campagner, A., Albano, D., Aliprandi, A.,
Bruno, A., Chianca, V., Corazza, A., Di Pietto, F.,
Gambino, A., Gitto, S., et al. (2020a). The elephant in
the machine: Proposing a new metric of data reliabil-
ity and its application to a medical case to assess clas-
sification reliability. Applied Sciences, 10(11):4014.
Cabitza, F., Campagner, A., Ronzio, L., Cameli, M., Man-
doli, G. E., Pastore, M. C., Sconfienza, L. M., Fol-
gado, D., Barandas, M., and Gamboa, H. (2023a).
Rams, hounds and white boxes: Investigating human–
ai collaboration protocols in medical diagnosis. Arti-
ficial Intelligence in Medicine, 138:102506.
Cabitza, F., Campagner, A., and Sconfienza, L. M. (2020b).
As if sand were stone. new concepts and metrics to
probe the ground on which to build trustable ai. BMC
Medical Informatics and Decision Making, 20(1):1–
21.
Cabitza, F., Campagner, A., Soares, F., et al. (2021). The
importance of being external. methodological insights
HEALTHINF 2024 - 17th International Conference on Health Informatics
228