
tion. For instance, our neural network could undergo
further iterations to incorporate diverse data sources,
thereby optimizing the weighting of agents in deter-
mining the score. Additionally, future work should
involve reconfiguring the architecture of the prototype
and subjecting it to empirical evaluation.
In the scope of our work, we have successfully
created a prototypical agent-based framework for as-
sessing the trustworthiness of LLMs. This prototype
establishes a robust foundation and signals promising
directions for future advancements. With this imple-
mentation, we have made a contribution in developing
a validation framework for LLM outputs, marking a
vital step towards its potential future application.
REFERENCES
Ali, S., Abuhmed, T., El-Sappagh, S., Muhammad, K.,
Alonso-Moral, J. M., Confalonieri, R., Guidotti, R.,
Del Ser, J., D
´
ıaz-Rodr
´
ıguez, N., and Herrera, F.
(2023). Explainable Artificial Intelligence (XAI):
What we know and what is left to attain Trust-
worthy Artificial Intelligence. Information Fusion,
99:101805.
Birhane, A., Kasirzadeh, A., Leslie, D., and Wachter, S.
(2023). Science in the age of large language models.
Nature Reviews Physics, 5(5):277–280. Number: 5
Publisher: Nature Publishing Group.
Bowman, S. R. and Dahl, G. (2021). What Will it Take to
Fix Benchmarking in Natural Language Understand-
ing? In Proceedings of the 2021 Conference of the
North American Chapter of the Association for Com-
putational Linguistics: Human Language Technolo-
gies, pages 4843–4855, Online. Association for Com-
putational Linguistics.
Cruzes, D. S. and Dyba, T. (2011). Recommended steps for
thematic synthesis in software engineering. In 2011
international symposium on empirical software engi-
neering and measurement, pages 275–284. IEEE.
Future of Life Institute (FLI) (2021). The Artificial Intelli-
gence Act.
Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., and
Han, J. (2022). Large Language Models Can Self-
Improve. arXiv:2210.11610 [cs].
Jacovi, A., Marasovi
´
c, A., Miller, T., and Goldberg, Y.
(2021). Formalizing Trust in Artificial Intelligence:
Prerequisites, Causes and Goals of Human Trust in
AI. In Proceedings of the 2021 ACM Conference on
Fairness, Accountability, and Transparency, FAccT
’21, pages 624–635, New York, NY, USA. Associa-
tion for Computing Machinery.
Jang, M. and Lukasiewicz, T. (2023). Consistency Analysis
of ChatGPT. arXiv:2303.06273 [cs].
Mayring, P. and Fenzl, T. (2010). Qualitative inhaltsanalyse
[qualitative content analysis]. Qualitative Forschung
Ein Handbuch (Qualitative Research: A Handbook),
pages 468–475.
Peffers, K., Tuunanen, T., Rothenberger, M. A., and
Chatterjee, S. (2007). A Design Science Re-
search Methodology for Information Systems Re-
search. Journal of Management Information Sys-
tems, 24(3):45–77. Publisher: Routledge eprint:
https://doi.org/10.2753/MIS0742-1222240302.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M.,
Abid, A., Fisch, A., Brown, A. R., Santoro, A.,
Gupta, A., Garriga-Alonso, A., et al. (2022). Be-
yond the imitation game: Quantifying and extrapolat-
ing the capabilities of language models. arXiv preprint
arXiv:2206.04615.
Teubner, T., Flath, C. M., Weinhardt, C., van der Aalst, W.,
and Hinz, O. (2023). Welcome to the Era of ChatGPT
et al. Business & Information Systems Engineering,
65(2):95–101.
Toreini, E., Aitken, M., Coopamootoo, K., Elliott, K., Ze-
laya, C. G., and van Moorsel, A. (2020). The rela-
tionship between trust in AI and trustworthy machine
learning technologies. In Proceedings of the 2020
Conference on Fairness, Accountability, and Trans-
parency, FAT* ’20, pages 272–283, New York, NY,
USA. Association for Computing Machinery.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention Is All You Need. arXiv:1706.03762
[cs].
vom Brocke, J., Simons, A., Niehaves, B., and Reimer,
K. (2009). Reconstructing the Giant: On the Impor-
tance of Rigour in Documenting the Literature Search
Proess. In European Conference on Information Sys-
tems (ECIS) 2009.
von Eschenbach, W. J. (2021). Transparency and the Black
Box Problem: Why We Do Not Trust AI. Philosophy
& Technology, 34(4):1607–1622.
Webster, J. and Watson, R. T. (2002). Analyzing the Past to
Prepare for the Future: Writing a Literature Review.
MIS Quarterly, 26(2):xiii–xxiii.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou,
Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y.,
Yang, C., Chen, Y., Chen, Z., Jiang, J., Ren, R., Li,
Y., Tang, X., Liu, Z., Liu, P., Nie, J.-Y., and Wen,
J.-R. (2023). A Survey of Large Language Models.
arXiv:2303.18223 [cs].
ICAART 2024 - 16th International Conference on Agents and Artificial Intelligence
534