A Data Annotation Approach Using Large Language Models
Carlos Rocha, Jonatas Grosman, Fernando Correia, Venicius Rego, Hélio Lopes
2025
Abstract
Documents are crucial for the economic and academic systems, yet extracting information from them can be complex and time-consuming. Visual Question Answering (VQA) models address this challenge using natural language prompts to extract information. However, their development depends on annotated datasets, which are costly to produce. To face this challenge, we propose a four-step process that combines Computer Vision Models and Large Language Models (LLMs) for VQA data annotation in financial reports. This method starts with Document Layout Analysis and Table Structure Extraction to identify document structures. Then, it uses two distinct LLMs for the generation and evaluation of question and answer pairs, automating the construction and selection of the best pairs for the final dataset. As a result, we found Mixtral-8x22B and GPT-4o mini to be the most cost-benefit for generating pairs, while Claude 3.5 Sonnet performed best for evaluation, aligning closely with human assessments.
DownloadPaper Citation
in Harvard Style
Rocha C., Grosman J., Correia F., Rego V. and Lopes H. (2025). A Data Annotation Approach Using Large Language Models. In Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS; ISBN 978-989-758-749-8, SciTePress, pages 748-755. DOI: 10.5220/0013280100003929
in Bibtex Style
@conference{iceis25,
author={Carlos Rocha and Jonatas Grosman and Fernando Correia and Venicius Rego and Hélio Lopes},
title={A Data Annotation Approach Using Large Language Models},
booktitle={Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS},
year={2025},
pages={748-755},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013280100003929},
isbn={978-989-758-749-8},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS
TI - A Data Annotation Approach Using Large Language Models
SN - 978-989-758-749-8
AU - Rocha C.
AU - Grosman J.
AU - Correia F.
AU - Rego V.
AU - Lopes H.
PY - 2025
SP - 748
EP - 755
DO - 10.5220/0013280100003929
PB - SciTePress