Letting Go of the Numbers: Measuring AI Trustworthiness
Carol J. Smith
Carnegie Mellon University, U.S.A.
Abstract: AI systems need to be designed to work with, and for, people. A person’s willingness to trust a particular
system is based on their expectations of the system’s behavior. Their trust is complex, transient, and
personal – it cannot easily be measured. However, an AI system’s trustworthiness can be measured. A
trustworthy AI system demonstrates that it will fulfill its promise by providing evidence that it is dependable
in the context of use, and the end user has awareness of its capabilities during use. We can measure reliability
and instrument systems to monitor usage (or lack thereof) quantitatively. However, AI’s potential is bound to
perceptions of its trustworthiness, which requires qualitative measures to fully ascertain. Doing AI well
requires a reset – letting go of (some of) the numbers and learning new methods that provide a more complete
assessment of the system.
EXTENDED ABSTRACT
AI systems need to be designed to work with, and for,
people. Despite this obvious requirement, the
performance of AI systems has been primarily
focused on numeric (quantitative) values such as
accuracy and F1 scores. These measures are
important, but consider what we can learn about the
system’s fit with user needs from an accuracy score
(not much). How about measuring the system’s
trustworthiness to end users from an accuracy score?
Quantitative metrics alone are not capable of
providing a holistic view of the system’s design,
performance, and usage. If end users have an issue,
quantitative information cannot typically provide
enough information to fully understand the issue, nor
can they provide enough guidance to address it.
Unaddressed issues add up and eventually will affect
system use. Despite this, only minimal effort is
typically made to develop measures to determine if
people using AI systems find them to be helpful and
trustworthy. Prioritizing a good user experience
during development and while the system is
deployed, will help support a successful AI system.
An AI system’s potential is bound to stakeholders’
perceptions of its trustworthiness, which requires
qualitative measures to fully ascertain. I use the term
stakeholders to include those developing, acquiring,
using, and being affected by AI systems. I will
primarily discuss the people using systems and those
who are affected by AI systems and I will present
alternatives that will support you in conducting more
complete assessments of AI systems.
Trustworthiness is a property of a system that
demonstrates that it will fulfill its promise by
providing evidence that it is dependable in the context
of use and end users have awareness of its capabilities
during use (C. Gardner, et al., 2023). Users gain an
understanding (or misunderstanding) of an AI
system’s capabilities and limits as they work with it,
within their context. Their awareness of those
capabilities may be informed through training, their
direct experience, and their colleagues’ experiences,
and they will use this information to develop a
justified level of confidence - or calibrated trust of the
system.
When humans develop calibrated trust of the
system – a psychological state of adjusted confidence
that is aligned to end users’ real-time perceptions of
trustworthiness (C. Gardner, et al., 2023) – they can
be productive using the system and use it
appropriately. Calibrated trust is neither over or
under-trust – it is a true understanding of the systems
capabilities and limitation. When people over-trust a
system, they are likely to use it for tasks it was not
designed to complete. For example, a generative AI
system will excel at tasks where creativity is desirable
and is less likely to be successful for tasks requiring
retrieval of specific wording. My colleague Robin
may choose to use a generative AI system for this task
due to positive experiences using it in other contexts.
Robin is likely to find the system to be ineffective in
this new activity and will potentially distrust the
system as a result of this poor experience. Robin may
be less likely to use it - even in situations where it
could be helpful to them.