
6 CONCLUSION
In this paper, we examined how well Large Language
Models (LLM) can generate knowledge work docu-
ments with respect to a specified topical domain and
descriptions comprising multiple parameters. Our
studies were conducted on documents generated by
our knowledge work dataset generator KnoWoGen
and the Mistral-7B-Instruct LLM. Overall, the ex-
periments show that the generated documents were
perceived as natural, fitting their intended domain
and other parameters, making the parameters reliable
ground truth data.
In future experiments, it would be meaningful to
also examine multiple, related documents and inspect
whether the generated documents are coherent regard-
ing their common task description and their contents
as this information can also serve as relevant ground
truth. Moreover, regarding topics of generated doc-
uments, it would be also interesting to assess the
content-related variability of documents in a larger
set of documents targeting the same topic or domain.
Moreover, since our experiments showed that param-
eters are well-respected, in follow-up work, it is now
also meaningful to examine whether synthesized doc-
uments are valuable as training data to improve the
performance of machine learning models on down-
stream tasks.
ACKNOWLEDGEMENTS
This work was funded by the German Federal Min-
istry of Education and Research (BMBF) in the
project SensAI (grant no. 01IW20007).
REFERENCES
Chang, Y., Wang, X., Wang, J., Wu, Y., Yang, L., Zhu,
K., Chen, H., Yi, X., Wang, C., Wang, Y., Ye, W.,
Zhang, Y., Chang, Y., Yu, P. S., Yang, Q., and Xie,
X. (2024). A survey on evaluation of large language
models. ACM Trans. Intell. Syst. Technol., 15(3):39:1–
39:45.
Fourrier, C., Habib, N., Lozovskaya, A., Szafer, K.,
and Wolf, T. (2024). Open LLM leaderboard v2.
https://huggingface.co/spaces/open-llm-leaderboard/
open llm leaderboard.
Gonc¸alves, D. (2011). Pseudo-desktop collections and PIM:
The missing link. In ECIR 2011 workshop on evalu-
ating personal search, pages 3–4.
Heim, D., Jilek, C., Ulges, A., and Dengel, A. (2024). Us-
ing large language models to generate authentic multi-
agent knowledge work datasets. In INFORMATIK
2024, pages 1347–1357. Gesellschaft f
¨
ur Informatik
e.V., Bonn.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford,
C., Chaplot, D. S., de las Casas, D., Bressand, F.,
Lengyel, G., Lample, G., Saulnier, L., et al. (2023).
Mistral 7b.
Jones, W. (2008). Keeping Found Things Found: The Study
and Practice of Personal Information Management.
Morgan Kaufmann.
Likert, R. (1932). A technique for the measurement of atti-
tudes. Archives of psychology.
Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen,
G., and Wang, H. (2024). On llms-driven synthetic
data generation, curation, and evaluation: A survey. In
Findings of the ACL, ACL 2024, Bangkok, Thailand,
August 11-16, 2024, pages 11065–11082. ACL.
Reinhardt, W., Schmidt, B., Sloep, P. B., and Drachsler, H.
(2011). Knowledge worker roles and actions— re-
sults of two empirical studies. Knowledge and Process
Management, 18:150–174.
Sordi, J. O. D., de Azevedo, M. C., Bianchi, E. M. P. G.,
and Carandina, T. (2020). Defining the term knowl-
edge worker: Toward improved ontology and opera-
tionalization. Academy of Management Proc.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. (2023). Llama 2: Open foundation
and fine-tuned chat models. arXiv, 2307.09288.
Tung, V. T., Jacucci, G., and Ruotsalo, T. (2017). Watch-
ing inside the screen: Digital activity monitoring for
task recognition and proactive information retrieval.
Proc. ACM Interact. Mob. Wearable Ubiquitous Tech-
nol., 1(3):109:1–109:23.
Xu, R., Cui, H., Yu, Y., Kan, X., Shi, W., Zhuang,
Y., Wang, M. D., Jin, W., Ho, J., and Yang, C.
(2024). Knowledge-infused prompting: Assessing
and advancing clinical text data generation with large
language models. In Findings of the ACL, ACL
2024, Bangkok, Thailand, August 11-16, 2024, pages
15496–15523. ACL.
Ye, J., Gao, J., Li, Q., Xu, H., Feng, J., Wu, Z., Yu, T., and
Kong, L. (2022). Zerogen: Efficient zero-shot learn-
ing via dataset generation. In Proc. of the 2022 Conf.
on Empirical Methods in Natural Language Process-
ing, EMNLP 2022, Abu Dhabi, United Arab Emirates,
December 7-11, 2022, pages 11653–11669. ACL.
Yu, Y., Zhuang, Y., Zhang, J., Meng, Y., Ratner, A. J., Kr-
ishna, R., Shen, J., and Zhang, C. (2023). Large lan-
guage model as attributed training data generator: A
tale of diversity and bias. In Advances in Neural Infor-
mation Processing Systems 36: Annual Conf. on Neu-
ral Information Processing Systems 2023, NeurIPS
2023, New Orleans, LA, USA, December 10 - 16,
2023.
Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou,
Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al.
(2023). A survey of large language models. arXiv,
2303.18223.
ICAART 2025 - 17th International Conference on Agents and Artificial Intelligence
828