
6 CONCLUSION
In the paper, we have outlined a methodology for test-
ing the quality of interpretation of workflow architec-
tures by LLMs. Stemming from the conjecture that
LLM should correctly answer low-abstraction-level
questions to respond to those at a higher level of ab-
straction reliably, it introduces a set of test patterns,
intended to generate a series of low-abstraction queries
testing the reliability of LLM answers. Although the
presented list of test patterns is not exhaustive, the
initial results indicate that this approach is viable.
The main lessons learned from the results were as
follows.
•
LLMs can reasonably interpret workflow archi-
tectures to answer questions about their structure,
behavior, and basic functionality.
•
The answers of LLMs are subject to aleatoric
uncertainty—the LLM can give different results to
the same question. However, taking the majority
vote (of 5 repetitions in our case) gives a correct
answer to almost all of our test instances (22–24
correct out of 25 instances depending on the LLM
and WADL variant).
•
The “problematic” test instances differ among the
LLMs and WADL variants. The answers tend to be
worse when the workflow is semantically incorrect
(Sect. 4.4.2), and in the case of Import WADL
variant (only in the agent mode, e.g., List of tasks
pattern in Table 2).
•
It is necessary to formulate the questions as clearly
and accurately as possible (Sect. 4.4.1).
•
Instructing the LLM to reason about the question
before answering it improves the results. (Kojima
et al., 2023)
• The ROUGE (Lin, 2004) and BERTScore (Zhang
et al., 2020) metrics are not good enough to evalu-
ate open-ended questions.
In the future, we plan to extend the methodology
by instantiating more test patterns and by identifying
a better evaluation metric for the Basic functionality
category, and apply it to questions at a higher abstrac-
tion level, such as recommending a task fitting into the
given workflow architecture.
ACKNOWLEDGEMENTS
This work was partially supported by the EU project
ExtremeXP grant agreement 101093164, partially by
INTER-EUREKA project LUE231027, partially by
Charles University institutional funding 260698, and
partially by the Charles University Grant Agency
project 269723.
REFERENCES
Ahmad, A., Waseem, M., Liang, P., Fahmideh, M., Aktar,
M. S., and Mikkonen, T. (2023). Towards Human-Bot
Collaborative Software Architecting with ChatGPT.
In Proceedings of EASE 2023, Oulu, Finland, pages
279–285. ACM.
Dhar, R., Vaidhyanathan, K., and Varma, V. (2024). Can
LLMs generate architectural design decisions? - An
exploratory empirical study. In Proceedings of ICSA
2024, Hyderabad, India, pages 79–89. IEEE CS.
Fatemi, B., Halcrow, J., and Perozzi, B. (2023). Talk like a
graph: Encoding graphs for large language models.
Guo, J., Du, L., Liu, H., Zhou, M., He, X., and Han, S.
(2023). Gpt4graph: Can large language models under-
stand graph structured data? An empirical evaluation
and benchmarking.
Ip, J. (2024). LLM evaluation metrics: Everything you need
for LLM evaluation.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y.
(2023). Large language models are zero-shot reasoners.
Li, B., Wu, W., Tang, Z., Shi, L., Yang, J., Li, J., Yao, S.,
Qian, C., Hui, B., Zhang, Q., Yu, Z., Du, H., Yang, P.,
Lin, D., Peng, C., and Chen, K. (2024). DevBench: A
comprehensive benchmark for software development.
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D.,
Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Ku-
mar, A., Newman, B., Yuan, B., Yan, B., Zhang, C.,
Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas,
D., Hudson, D. A., ..., and Koreeda, Y. (2023). Holistic
evaluation of language models.
Lin, C.-Y. (2004). ROUGE: A package for automatic evalu-
ation of summaries. In Text Summarization Branches
Out, pages 74–81, Barcelona, Spain. Association for
Computational Linguistics.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C.
(2023). G-eval: NLG evaluation using GPT-4 with
better human alignment. In Proceedings of EMNLP
2023, Singapore.
Sutawika, L., Schoelkopf, H., Gao, L., Abbasi, B., Bi-
derman, S., Tow, J., ben fattori, Lovering, C.,
farzanehnakhaee70, Phang, J., Thite, A., Fazz, Wang,
T., Muennighoff, N., Aflah, sdtblck, nopperl, gakada,
tttyuntian, ..., and AndyZwei (2024). Eleutherai/lm-
evaluation-harness: v0.4.2.
Wang, H., Feng, S., He, T., Tan, Z., Han, X., and Tsvetkov,
Y. (2024). Can language models solve graph problems
in natural language?
Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi,
Y. (2020). BERTScore: Evaluating text generation
with BERT.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang,
H., Gonzalez, J. E., and Stoica, I. (2023). Judging
LLM-as-a-judge with MT-Bench and Chatbot Arena.
Interpreting Workflow Architectures by LLMs
617