A Data Annotation Approach Using Large Language Models

Carlos Rocha, Jonatas Grosman, Fernando Correia, Venicius Rego, Hélio Lopes

2025

Abstract

Documents are crucial for the economic and academic systems, yet extracting information from them can be complex and time-consuming. Visual Question Answering (VQA) models address this challenge using natural language prompts to extract information. However, their development depends on annotated datasets, which are costly to produce. To face this challenge, we propose a four-step process that combines Computer Vision Models and Large Language Models (LLMs) for VQA data annotation in financial reports. This method starts with Document Layout Analysis and Table Structure Extraction to identify document structures. Then, it uses two distinct LLMs for the generation and evaluation of question and answer pairs, automating the construction and selection of the best pairs for the final dataset. As a result, we found Mixtral-8x22B and GPT-4o mini to be the most cost-benefit for generating pairs, while Claude 3.5 Sonnet performed best for evaluation, aligning closely with human assessments.

Download


Paper Citation


in Harvard Style

Rocha C., Grosman J., Correia F., Rego V. and Lopes H. (2025). A Data Annotation Approach Using Large Language Models. In Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS; ISBN 978-989-758-749-8, SciTePress, pages 748-755. DOI: 10.5220/0013280100003929


in Bibtex Style

@conference{iceis25,
author={Carlos Rocha and Jonatas Grosman and Fernando Correia and Venicius Rego and Hélio Lopes},
title={A Data Annotation Approach Using Large Language Models},
booktitle={Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS},
year={2025},
pages={748-755},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013280100003929},
isbn={978-989-758-749-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 27th International Conference on Enterprise Information Systems - Volume 1: ICEIS
TI - A Data Annotation Approach Using Large Language Models
SN - 978-989-758-749-8
AU - Rocha C.
AU - Grosman J.
AU - Correia F.
AU - Rego V.
AU - Lopes H.
PY - 2025
SP - 748
EP - 755
DO - 10.5220/0013280100003929
PB - SciTePress