A Data Annotation Approach Using Large Language Models

Carlos Rocha

, Jonatas Grosman

, Fernando A. Correia

, Venicius Rego

and H

elio Lopes

Department of Informatics, PUC-Rio, Marqu

es de S

ao Vicente, 225 RDC, 4th ﬂoor - G

avea, Rio de Janeiro, Brazil

{cmrocha, jgrosman, fjunior, vrego, lopes}@inf.puc-rio.br

Keywords:

Data Annotation, Large Language Model, Visual Question-Answering, Documents, Machine Learning.

Abstract:

Documents are crucial for the economic and academic systems, yet extracting information from them can be

complex and time-consuming. Visual Question Answering (VQA) models address this challenge using natural

language prompts to extract information. However, their development depends on annotated datasets, which

are costly to produce. To face this challenge, we propose a four-step process that combines Computer Vision

Models and Large Language Models (LLMs) for VQA data annotation in ﬁnancial reports. This method starts

with Document Layout Analysis and Table Structure Extraction to identify document structures. Then, it uses

two distinct LLMs for the generation and evaluation of question and answer pairs, automating the construction

and selection of the best pairs for the ﬁnal dataset. As a result, we found Mixtral-8x22B and GPT-4o mini to

be the most cost-beneﬁt for generating pairs, while Claude 3.5 Sonnet performed best for evaluation, aligning

closely with human assessments.

1 INTRODUCTION

Data annotation is essential for developing supervised

machine learning models, especially in Natural Lan-

guage Processing (NLP) and Computer Vision (CV).

The quality of these models relies on large volumes

of labeled data for tasks like document comprehen-

sion and interpretation. Document Visual Question

Answering (DocVQA) exempliﬁes this by combining

NLP and CV to enable models to interpret document

images, leveraging textual and visual features to allow

interaction with document information through natu-

ral language queries.

Training DocVQA models requires thousands of

carefully reviewed and annotated document pages

(Mathew et al., 2021). This annotation and re-

view process is time-consuming and costly, primar-

ily due to the signiﬁcant human resources needed. To

meet this demand, the literature often employs non-

expert annotators through crowd-sourcing platforms

like Amazon Mechanical Turk (Mathew et al., 2021;

Mathew et al., 2022). However, using such plat-

forms presents signiﬁcant challenges and limitations,

including inconsistent annotation quality and the ad-

https://orcid.org/0009-0004-9696-330X

https://orcid.org/0000-0002-1152-5828

https://orcid.org/0000-0003-0394-056X

https://orcid.org/0000-0002-7101-4608

https://orcid.org/0000-0003-4584-1455

ditional effort required for review and validation, in-

creasing both cost and time (Kittur et al., 2008).

DocVQA models are used in domains like scien-

tiﬁc research, insurance, and ﬁnance for tasks such

as entity relationship identiﬁcation, policy compre-

hension, fraud detection, and ﬁnancial data extrac-

tion. Financial reports, such as those from the Brazil-

ian Stock Market (Brasil Bolsa Balc

ao - B3)

, are

a valuable data source. In B3, 2,706 companies

are mandated by the Brazilian Securities and Ex-

change Commission, or Comiss

ao de Valores Mo-

bili

arios (CVM),

to publish annual balance sheets,

quarterly reports, and other information, reﬂecting

similar frameworks as the Securities and Exchange

Commission (SEC)

in the U.S. and European Secu-

rities and Markets Authority (ESMA)

in the EU.

Although these ﬁnancial reports are public, they

lack structured annotated data, requiring solutions to

the data annotation challenge. One approach is us-

ing Large Language Models (LLMs) as annotators

to generate labels in a zero-shot or few-shot manner.

While promising, this method introduces noise, par-

ticularly in complex, domain-speciﬁc tasks (Agrawal

et al., 2022). Given their remarkable capability in text

annotation, LLMs can automate the creation of anno-

https://www.b3.com.br/

https://www.gov.br/cvm/en

https://www.sec.gov/

https://www.esma.europa.eu/

748

Rocha, C., Grosman, J., Correia, F. A., Rego, V. and Lopes, H.

A Data Annotation Approach Using Large Language Models.

DOI: 10.5220/0013280100003929

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 27th International Conference on Enterprise Information Systems (ICEIS 2025) - Volume 1, pages 748-755

ISBN: 978-989-758-749-8; ISSN: 2184-4992

tated question-answer datasets for DocVQA. Lever-

aging this generalist capability can signiﬁcantly re-

duce time and cost, leading to our main research ques-

tion: MRQ: How can we use Large Language Models

in data annotation for Document Visual Question An-

swering task?.

To address MRQ, the approach must handle di-

verse document layouts, ensuring readable text repre-

sentation and reliable, high-quality question-answer

pairs for effective model training. This leads to two

sub-questions: RQ1: How can we combine computer

vision models and Large Language Models to gen-

erate questions and answers from documents?, and

RQ2: How can we assess the quality of the questions

and answers generated?.

We propose a three-stage process: (i) Tran-

scription, (ii) Question-Answer Generation, and (iii)

Question-Answer Judgment. First, we transcribe the

document page by recognizing the characters and the

layout’s main points and converting the tables into

a markup language. The transcriptions are then in-

putted into an LLM to generate QA pairs, followed by

a ﬁnal stage where another LLM evaluates the quality

of these pairs for inclusion in the ﬁnal dataset.

Our work’s main contributions are the following:

• We present a new approach for generating and

evaluating question-answer pairs in documents

that contain both text and tables.

• We contribute by evaluating different models for

Document Layout Analysis and Table Structure

Recognition to extract information from B3’s ﬁ-

nancial reports.

The remainder of this paper is organized as fol-

lows: Section 2 reviews key works on DocVQA

dataset construction and LLM data annotation. Sec-

tion 3 outlines our method and its stages. Section 4

details the experiments validating the models and ap-

proach. Finally, Section 5 highlights ﬁndings, limita-

tions, and future work.

2 RELATED WORK

DocVQA requires a labeled dataset for effective

model training as in any supervised machine learning

task. High-quality data helps models interpret various

textual, structural, and graphical elements in a docu-

ment. (Mathew et al., 2021) introduced a dataset with

complex documents containing 50,000 questions over

12,767 images, curated from the UCSF Industry Doc-

ument Library and annotated through a three-stage

crowdsourced process of question generation, valida-

tion, and review. Our approach is similar to (Mathew

et al., 2021) but with fewer questions per page, ad-

justed validation, and using an LLM to judge the QA

pairs generated.

LLMs have been explored for efﬁcient annota-

tion in NLP. (Wang et al., 2021) showed GPT-3 sig-

niﬁcantly reduced costs in question generation while

maintaining human-like quality. Our work aligns with

theirs, focusing on ﬁnancial reports in QA annotation

with LLMs. For complex table data, (Nguyen et al.,

2023) converted tables to HTML and dataframes for

QA input. Given LLMs tendency to hallucinate, ques-

tion validity is crucial. (Bai et al., 2024) introduced a

benchmarking framework where LLMs both generate

and evaluate questions, reducing biases via decentral-

ized peer evaluation. Larger models like LLaMA-65B

and GPT-4 showed enhanced accuracy in this setup.

We adopt this evaluation framework using LLMs to

assess annotations, ﬁltering errors and improving au-

tomation without human intervention.

3 METHODOLOGY

As depicted in Figure 1, the process consists of three

stages: (i) Transcription, (ii) Question-Answer Gen-

eration, and (iii) Question-Answer Judgment. The

transcription stage generates a textual representation

of document ﬁles and feeds an LLM to generate QA

tuples in the second stage. Finally, the same transcrip-

tion and the generated QA tuples feed another LLM

to select the valid tuples to compose the dataset.

3.1 Transcription

The process begins with the Transcription Stage,

where a document ﬁle is received and the desired

pages are extracted as images for subsequent steps.

An OCR tool is then applied to transcribe the textual

content of these images.

Given the importance of OCR and the lack of an-

notated data for training and evaluating open-source

tools, we selected Microsoft OCR V4.

This choice

was based on the work of (Santos et al., 2023),

whose evaluation of various OCR tools on diverse

Portuguese text images showed that the Microsoft

model outperformed all other evaluated models, de-

livering solid results even with different backgrounds

and text rotation levels.

The Document Layout Analysis (DLA) step uses

an object detection model for structure recognition.

Object detection models can generally identify which

https://azure.microsoft.com/en-us/products/ai-

services/ai-document-intelligence

A Data Annotation Approach Using Large Language Models

749

Figure 1: Proposed process overview.

objects from a known set are present in the image

and provide information about their position. In our

case, the objects are document segments like tables,

images, and section headers, with the DLA model re-

turning the bounding boxes of the segments.

After segmentation, the identiﬁed regions were

stored in a data structure containing their bounding

boxes, the detected class, positional information, and

the textual content detected by the OCR tool. The in-

tersection of the text and region bounding boxes was

calculated to determine if the text was within a spe-

ciﬁc region. Given the variety of available DLA mod-

els, we selected models from the literature and per-

formed experiments to choose the best one, which we

describe in more detail in Subsection 4.1.

The Table Structure Recognition (TSR) step uses a

second object detection model to segment table struc-

tures, detecting rows and columns as bounding boxes

and deﬁning cells based on the intersections. Next,

a separate algorithm converts these bounding boxes

into HTML. As with the DLA step, multiple models

are available. So, we reviewed models from the lit-

erature and conducted experiments to select the best

one, detailed in Subsection 4.2.

At the end of the Transcription stage, the docu-

ment image transcription is modiﬁed to include posi-

tional markers. These markers identify the transcript

regions used to generate a QA pair. For text regions,

we add the paragraph number; for table regions, we

add the table number and the number of each row.

3.2 Question-Answer Generation

The modiﬁed document transcription from the pre-

vious step is input into the selected LLM for the

Question-Answer Generation stage, along with gen-

eration instructions.

These instructions include the

coherence standards the model should follow, rules

Generation instructions available on Github:

https://github.com/carlos-vinicios/DocVQA-Data-

Annotator

for using relative pronouns, the number of desired tu-

ples, the focus area, and the key elements to include

in the output. The prompt also provides positive and

negative examples to guide the generation process.

Finally, the expected output of this stage is a textual

format divided into three parts: QUESTION - AN-

SWER - TEXT REGION. The ﬁrst two form the QA

tuple, while the third is a positional marker indicating

the text region used for generation.

3.3 Question-Answer Judgment

Finally, the objective of the Question-Answer Judg-

ment stage is to ensure that the generated questions

and answers are coherent and correct, ﬁltering out

possible hallucinations from the generating model

and increasing the quality of the ﬁnal dataset. This

stage also receives as input the transcript of the doc-

ument, without the positional markers, together with

the evaluation instructions,

specifying the criteria

for evaluating the questions and answers.

Each QA tuple judgment uses two calls to the

LLM. The ﬁrst call evaluates the coherence of the

question, which is considered coherent if the ques-

tions comply with the correct use of the grammatical

rules of the Portuguese language and have only one

answer. The second call evaluates the answer to see if

it answers the question. Since each QA tuple is evalu-

ated individually, the process described is carried out

in 3 iterations per document page. The expected out-

put for each iteration is binary, conﬁrming or denying

the coherence of the question and the validity of the

provided answer.

Evaluation instructions available on Github:

https://github.com/carlos-vinicios/DocVQA-Data-

Annotator

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

750

4 EXPERIMENTS

This section describes the experiments conducted fol-

lowing the proposed process. We begin with the ex-

perimentation of the transcription stage, where we

evaluated available DLA models from the literature

using a dataset created during this study. Next, we as-

sessed open-source TSR models, testing them on four

datasets to measure their generalization capabilities.

After the transcription stage, we proceeded to evalu-

ate various LLMs for the QA generation stage. Fi-

nally, we assessed the most robust LLMs for the QA

tuple judgment stage.

4.1 Document Layout Analysis (DLA)

To evaluate the DLA models, we built an annotated

dataset using a process similar to the creation of the

DocLayNet dataset (Pﬁtzmann et al., 2022). The val-

idation dataset consisted of 990 randomly selected

pages from ﬁnancial statements, focusing on central

pages that typically contain text, tables, and images.

Two annotators independently annotated each

page, following DocLayNet’s criteria, and reviewed

each other’s work to improve consistency. The Model

evaluation was based on the Mean Average Precision

(mAP), a standard metric for object detection and seg-

mentation. The average Precision (AP) was calcu-

lated for each class as the area under the precision-

recall curve. The model selection followed two main

criteria: it needed to be trained on the DocLayNet

dataset and be open-source, providing access to its

weights for evaluation and modiﬁcation.

Table 1 shows the mAP performance for each

class. RoDLA achieved the highest overall perfor-

mance, excelling in handling diverse document types

and perturbations. Malaysia-AI performed well in

most classes, especially in detecting tables and text,

beneﬁting from advanced data augmentation tech-

niques despite using the same architecture as Maik

Thiele. SwinDocSegmenter, although computation-

ally demanding, showed lower performance than

Malaysia-AI, which can run efﬁciently on a CPU.

In conclusion, RoDLA is the best choice for the

DLA task. While it has a longer execution time, its

superior ability to identify diverse document regions

ensures more accurate layout segmentation,

4.2 Table Structure Recognition (TSR)

Building a speciﬁc dataset for TSR model evaluation

is unfeasible due to the high time cost of data anno-

tation and reﬁnement, wich involves marking column

and row bounding boxes, transcribing table text, and

creating the HTML structure. Three datasets were se-

lected for TSR evaluation to address this. The ﬁrst

is PubTabNet, with around 500,000 table images and

their HTML equivalents for the structural recogni-

tion of tables (Zhong et al., 2020). FinTabNet, which

focuses on ﬁnancial documents, provides annotated

data for table extraction and conversion to HTML

(Zheng et al., 2021). Finally, the ICDAR 2013 Table

Competition dataset serves as a benchmark for this

task(G

obel et al., 2013).

We used TEDS-Struct, a variant of Tree-Edit-

Distance-based Similarity (TEDS) (Zhong et al.,

2020), to evaluate table extraction model perfor-

mance. Unlike the original TEDS, which assesses

structure and text alignment, TEDS-Struct focuses

solely on the HTML structure, ignoring transcription

errors (Qiao et al., 2021), making it suitable for this

work. For TSR model selection, we identiﬁed open-

source models capable of table-to-HTML conversion

from table images. Four models were chosen: Ta-

ble Transformer (TATR) (Smock et al., 2022); MLT-

TabNet (Ly and Takasu, 2023); Table Master (Ye

et al., 2021), and Local and Global Pyramid Mask

Alignment (LGPMA) (Qiao et al., 2021).

Table 2 shows the results for each tested model.

Although no model excelled across all datasets, TATR

All and MLT-TabNet demonstrated the best general-

ization abilities, with TATR All achieving the best

performance on the ICDAR 2013 benchmark dataset.

The ﬁndings suggest models trained on the FinTab-

Net dataset exhibit superior generalization, likely due

to the dataset’s complexity and variety of examples

featuring different degrees of customization. In con-

trast, PubTabNet is limited to tables from scientiﬁc

publications, offering less diverse training data.

Given its extraction performance, TATR All is the

best model for this work. It showed strong gener-

alization abilities, performing well across all tested

datasets, was the fastest model, and the only one ca-

pable of maintaining a low execution time on CPU.

4.3 Question-Answer Generation

We selected models based on cost-beneﬁt, consider-

ing the token usage required to complete the task.

As this stage generates the most output tokens, it is

the main cost driver. On average, document tran-

scriptions and task instructions use 3, 471(±1080) in-

put tokens, while the question, answer, and positional

marker produce 170(±36) output tokens.

Figure 2 shows the quality and cost (USD per mil-

lion tokens) of the main available models, assuming a

3:1 ratio of input to output tokens. The price format

and quality were based on (Analysis, 2024). In the

A Data Annotation Approach Using Large Language Models

751

Table 1: mAP performance for each class and model on a constructed dataset. The best performance is formatted in boldface.

Class

RoDLA

(Chen et al., 2024)

Malaysia-AI

(Malaysia-AI, 2024)

SwinDocSegmenter

(Banerjee et al., 2023)

Maik Thiele

(Maik Thiele, 2024)

Caption 0.302 0.015 0.067 0.127

List Item 0.860 0.741 0.676 0.642

Picture 0.749 0.413 0.569 0.524

Text 0.979 0.910 0.776 0.901

Footnote 0.735 0.486 0.345 0.254

Page Footer 0.911 0.446 0.621 0.435

Section Header 0.776 0.490 0.391 0.346

Title 0.007 0.015 0.003 0.003

Formula - - - -

Page Header 0.870 0.476 0.485 0.447

Table 0.975 0.955 0.899 0.906

All 0.716 0.495 0.483 0.459

Table 2: TEDS-Struct performance by model on each

dataset. The best performance is formatted in boldface.

Model FinTabNet PubTabNet

ICDAR

2013

TATR All 0.92 0.93 0.94

TATR Fin 0.92 0.88 0.91

MLT-TabNet 0.97 0.90 0.91

TableMaster 0.80 0.97 0.90

LGPMA 0.41 0.96 0.91

Figure 2: Quality vs Price for the main LLM models avail-

able via a pay-per-use API. The price of each model was

collected in August 2024.

Figure, models are grouped into four quadrants: high

quality and high cost, high quality and low cost, rea-

sonable quality and low cost, and lower-performing

models with varying costs.

Based on the average token requirements and the

price-quality balance in Figure 2, we selected models

from the second and third quadrants: GPT-4o mini,

LLaMA 3 70B, Mixtral 8x22B, Gemini 1.5 Flash, and

Claude 3 Haiku. LLaMA 3.1 405B, the best but most

expensive open-source model, was excluded from this

stage and reserved for the third stage of our process.

For the question generation experiments using

LLM, we created a dataset of 300 randomly selected

non-consecutive pages, which feature only text, only

tables or a mixture of both. The tables have different

structures and sizes, so the complexity of interpreting

each one varies, with larger tables being more com-

plex. The information on each page may neither start

nor end within that page.

The ﬁrst experiment assessed the distribution of

questions based on their initial 3-grams. The goal

was to evaluate the models’ ability to generate di-

verse questions. Figure 3 shows the distribution of

the generated questions, where each level represents

the most frequent words in the natural reading order.

Thus, the ﬁrst level corresponds to the ﬁrst word of

the question, the second level to the second word, and

so on. For example, Mixtral 8x22B frequently starts

questions with “Qual foi” (“What was”) or “Qual

o” (“What is the”), while empty regions at the second

level indicate infrequent words.

Most models used at least ﬁve different interrog-

ative pronouns, but Cluade 3 Haiku and Gemini 1.5

Flash showed less variety, favoring a more conser-

vative question phrasing approach. Also, most mod-

els favored using “Qual” (“What”) as the interroga-

tive pronoun, likely due to its adaptability in various

contexts. This pattern aligns with human-generated

questions, as demonstrated by (Mathew et al., 2021).

“Quanto” (“how much” or “how many”) was the

second most frequently used pronoun, linked to the

prevalence of tables in the selected pages. Lastly,

“Quando” (“when”) and “Quem” (“Who”), though

not explored equally by all models, were employed

by GPT-4o mini and Mixtral-8x22B.

Based on the variation, GPT-4o mini and Mixtral-

8x22B showed the most promise for question gen-

eration. Considering the cost-performance trade-off,

Mixtral-8x22B is the optimal choice, being open-

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

752

Figure 3: Distribution of questions by their starting 3-

grams.

source and offering strong performance at a lower

cost. To evaluate the quality of the questions and an-

swers generated by the models, we conducted a hu-

man evaluation with 28 voluntary annotators. They

assessed each QA tuple using a web tool developed

for this research. Annotators evaluated the coherence

of the questions, ensuring they were understandable,

adhered to Portuguese language norms, and were free

of ambiguity. If the question was coherent, they vali-

dated whether the answers were correct based on the

document information. The validation process was

binary: annotators responded ”yes” or ”no” for both

criteria. A question was considered valid if it was co-

herent and the answer was correct.

The evaluation was conducted in batches of 10

pages, with each page containing three QA tuples.

Annotators reviewed each page individually. Some

annotators were responsible for more than one batch

due to the number of annotators and available ﬁles.

To maintain consistency, each batch was reviewed by

at least three distinct annotators and inter-annotator

agreement was measured.

The average percent inter-annotator agreement is

0.84 (±0.29) for question coherence and 0.84 (±0.28)

for response accuracy. Based on this high level of

agreement and the number of annotators, we retained

all annotations and used them to validate the models.

Table 3: Question coherence, answer accuracy, and QA tu-

ples valid proportion for each LLM.

Model

Question

Coherence

Answers

Accuracy

QA Tuples

Valid

LLaMA 3 70B 0.89 0.97 0.87

Mixtral 8x22B 0.89 0.95 0.85

Gemini 1.5 Flash 0.93 0.91 0.84

GPT-4o mini 0.83 0.97 0.81

Claude 3 Haiku 0.85 0.94 0.79

Using this dataset, we evaluated the model per-

formance based on the proportion of coherent ques-

tions and correct answers in the QA tuples. Table 3

shows that LLaMA 3 70B generated 0.89 coherent

questions, of which 0.97 had correct answers, yield-

ing 0.87 valid tuples. For all models, the ﬁnal dataset

included only tuples with coherent questions and ac-

curate answers.

The results showed that LLaMA 3 70B excelled in

coherence and accuracy, while Mixtral-8x22B main-

tained strong performance with higher QA tuple gen-

eration. Gemini 1.5 Flash was less consistent, par-

ticularly in answer accuracy, and Claude 3 Haiku dis-

played the highest variation, indicating randomness in

its generation process. GPT-4o mini was stable with

increased tuple output. Overall, LLaMA 3 70B and

Mixtral-8x22B were the most effective models.

4.4 Question-Answer Judgment

For the QA judgment stage, we prioritized high-

quality models, as this step is crucial for ﬁltering out

invalid questions automatically. To ensure accurate

assessment of the generated QA tuples, we needed

models with strong general task performance and the

ability to understand document transcriptions to en-

sure accurate evaluations and reliable use of the gen-

erated tuples.

At this stage, cost primarily stems from input to-

kens, as the instruction prompt is resent for each

QA tuple evaluation. The average input size was

2261.33 ± 819.75 tokens, while the output averaged

2.67 ± 1.72 tokens. We selected three high-quality

models from the ﬁrst quadrant: GPT-4o, Claude

3.5 Sonnet, and Gemini 1.5 Pro. Additionally, we

included LLaMA 3.1 405B, a leading open-source

model, to compare commercial and open-source op-

tions.

To evaluate this stage, we used the dataset cre-

ated after a human evaluation during the QA gener-

ation stage, described in Subsection 4.3. We used this

dataset to assess the judgment capabilities of the pro-

A Data Annotation Approach Using Large Language Models

753

Figure 4: Confusion matrix between LLM and human for

the validity of a QA tuple. The human annotation is consid-

ered the true label.

posed approach by comparing the performance of an

LLM and a human annotator. As in the second stage

(Question-Answer Generation), we used Langchain

to normalize the API calls, setting the temperature to

zero and all the other parameters to default. The LLM

judgment of QA tuples followed the same criteria as

the human evaluation, assessing question coherence,

adherence to Portuguese language norms, and clarity.

Models also validated answers accuracy based on the

document transcriptions. The judgment process was

binary, with the LLM responding “yes” or “no”. A

question was deemed valid if it was coherent and the

answer accurate.

After testing all models, we compared their F1

scores and alignment with human judgments for ﬁnal

dataset use. All four models performed well, with F1

scores above 0.8. The best performance came from

Claude 3.5 Sonnet, which achieved the highest score

at 0.93, followed by Gemini 1.5 Pro and LLaMA 3.1

405B, both at 0.92, while GPT-4o scored 0.90.

The confusion matrix in Figure 4 reveals that all

models struggled to identify false sentences, likely

due to the higher proportion of negative samples in

the dataset, which made detecting invalid tuples more

challenging. To improve judgment performance, we

combined the models into an ensemble. In this ap-

proach, a QA tuple was considered valid only if all

models conﬁrmed both the coherence of the question

and the accuracy of the answer. Table 4 shows the F1

obtained by each ensemble, where we can see that the

combination of models did not improve the metric,

showing a similar performance for all combinations.

Table 4: F1 score between an ensemble of LLMs and hu-

mans for the validity of a QA tuple.

Ensemble

Gemini

1.5 Pro

GPT4-o

LLaMA 3.1

405B

Claude 3.5

Sonnet

1 ✓ ✓ ✓ 0.82

2 ✓ ✓ ✓ 0.82

3 ✓ ✓ ✓ 0.83

4 ✓ ✓ ✓ 0.82

5 CONCLUSION

This paper addresses the research questions by de-

veloping a data annotation process for the DocVQA

task using LLMs. To tackle the ﬁrst part of RQ1,

we designed a three-step process employing CV mod-

els to extract textual representations from documents.

For the second part, we evaluated ﬁve cost-effective

LLMs for generating QA pairs based on document

transcriptions and predeﬁned rules to ensure well-

formed questions and enable human evaluation. To

answer RQ2, we conducted a human evaluation to as-

sess the validity of QA pairs generated by each LLM,

measuring the validity-to-invalidity ratio. Addition-

ally, we evaluated four robust LLMs for automat-

ing the validation process, benchmarking their perfor-

mance against human assessments.

Our ﬁndings demonstrate that LLMs can effec-

tively generate QA pairs with a favorable validity-

to-invalidity ratio. Mixtral-8x22B and GPT-4o

mini achieved the best results, generating diverse,

instruction-compliant QA pairs with minimal in-

valid outputs, with GPT-4o mini excelling in cost-

effectiveness. For automatic QA pair validation,

Claude 3.5 Sonnet achieved the highest F1 score

(0.93) but requires further improvements to reduce

false negatives. Ensemble models were tested as an

enhancement but underperformed compared to indi-

vidual models, achieving a maximum F1 score of 0.83

with signiﬁcantly higher costs, making them imprac-

tical despite a slight reduction in false positives.

5.1 Future Works and Limitations

As discussed earlier, the process occasionally fails in

the validity judgment of QA pairs. To improve the

validity judgment of QA pairs, we plan to introduce a

review score for each generated pair in future works.

This score will help identify questions requiring hu-

man review, minimizing false positives and negatives

while enhancing dataset robustness. Additionally, we

intend to conduct a qualitative assessment of question

relevance and structure, expanding our focus beyond

ICEIS 2025 - 27th International Conference on Enterprise Information Systems

754

validity analysis to ensure alignment with the target

domain.

Given the process’s reliance on document tran-

scription, we also aim to investigate the impact of

transcription errors on question generation. This anal-

ysis will provide insights into model performance un-

der such conditions and inform strategies to mitigate

these effects. Ultimately, we aim to produce a fully

annotated dataset using the proposed process, estab-

lish baselines with state-of-the-art DocVQA models,

and evaluate the process’s strengths and limitations

for further reﬁnement.

ACKNOWLEDGMENTS

This study was supported by the Coordenac¸

ao de

Aperfeic¸oamento de Pessoal de N

ıvel Superior -

Brasil (CAPES) - Finance Code 001.

REFERENCES

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., and

Sontag, D. (2022). Large language models are few-

shot clinical information extractors. In Goldberg, Y.,

Kozareva, Z., and Zhang, Y., editors, Proceedings of

the 2022 Conference on Empirical Methods in Natural

Language Processing, pages 1998–2022, Abu Dhabi,

United Arab Emirates. Association for Computational

Linguistics.

Analysis, A. (2024). Independent analysis of ai models and

api providers. https://artiﬁcialanalysis.ai/. Accessed:

2024-08-20.

Bai, Y., Ying, J., Cao, Y., Lv, X., He, Y., Wang, X., Yu,

J., Zeng, K., Xiao, Y., Lyu, H., et al. (2024). Bench-

marking foundation models with language-model-as-

an-examiner. Advances in Neural Information Pro-

cessing Systems, 36.

Banerjee, A., Biswas, S., Llad

os, J., and Pal, U. (2023).

Swindocsegmenter: An end-to-end uniﬁed domain

adaptive transformer for document instance segmenta-

tion. In International Conference on Document Anal-

ysis and Recognition, pages 307–325. Springer.

Chen, Y., Zhang, J., Peng, K., Zheng, J., Liu, R., Torr, P.,

and Stiefelhagen, R. (2024). Rodla: Benchmarking

the robustness of document layout analysis models.

arXiv preprint arXiv:2403.14442.

obel, M., Hassan, T., Oro, E., and Orsi, G. (2013). Icdar

2013 table competition. In 2013 12th International

Conference on Document Analysis and Recognition,

pages 1449–1453. IEEE.

Kittur, A., Chi, E. H., and Suh, B. (2008). Crowdsourcing

user studies with mechanical turk. In Proceedings of

the SIGCHI conference on human factors in comput-

ing systems, pages 453–456.

Ly, N. T. and Takasu, A. (2023). An end-to-end multi-

task learning model for image-based table recogni-

tion. pages 626–634.

Maik Thiele (2024). documentlayoutsegmenta-

tion yolov8 ondoclaynet (revision 25486d5).

Malaysia-AI (2024). Yolov8x-doclaynet-full-1024-42.

Mathew, M., Bagal, V., Tito, R., Karatzas, D., Valveny, E.,

and Jawahar, C. (2022). Infographicvqa. In Proceed-

ings of the IEEE/CVF Winter Conference on Applica-

tions of Computer Vision, pages 1697–1706.

Mathew, M., Kondreddi, V. K., Biten, A. F., Maﬂa, A.,

Matas, J., Jawahar, C. V., Valveny, E., and Karatzas,

D. (2021). Docvqa: A dataset for vqa on document

images. In Proceedings of the IEEE/CVF Conference

on Computer Vision and Pattern Recognition (CVPR),

pages 2200–2209.

Nguyen, P., Ly, N. T., Takeda, H., and Takasu, A. (2023).

Tabiqa: Table questions answering on business docu-

ment images. arXiv preprint arXiv:2303.14935.

Pﬁtzmann, B., Auer, C., Dolﬁ, M., Nassar, A. S., and Staar,

P. (2022). Doclaynet: a large human-annotated dataset

for document-layout segmentation. In Proceedings

of the 28th ACM SIGKDD Conference on Knowledge

Discovery and Data Mining, pages 3743–3751.

Qiao, L., Li, Z., Cheng, Z., Zhang, P., Pu, S., Niu, Y., Ren,

W., Tan, W., and Wu, F. (2021). Lgpma: compli-

cated table structure recognition with local and global

pyramid mask alignment. In International conference

on document analysis and recognition, pages 99–114.

Springer.

Santos, Y., Silva, M., and Reis, J. C. S. (2023). Evaluation

of optical character recognition (ocr) systems dealing

with misinformation in portuguese. In 2023 36th SIB-

GRAPI Conference on Graphics, Patterns and Images

(SIBGRAPI), pages 223–228.

Smock, B., Pesala, R., and Abraham, R. (2022). Pubtables-

1m: Towards comprehensive table extraction from

unstructured documents. In Proceedings of the

IEEE/CVF Conference on Computer Vision and Pat-

tern Recognition, pages 4634–4642.

Wang, S., Liu, Y., Xu, Y., Zhu, C., and Zeng, M. (2021).

Want to reduce labeling cost? GPT-3 can help. In

Findings of the Association for Computational Lin-

guistics: EMNLP 2021, pages 4195–4205, Punta

Cana, Dominican Republic. Association for Compu-

tational Linguistics.

Ye, J., Qi, X., He, Y., Chen, Y., Gu, D., Gao, P., and

Xiao, R. (2021). Pingan-vcgroup’s solution for ic-

dar 2021 competition on scientiﬁc literature parsing

task b: table recognition to html. arXiv preprint

arXiv:2105.01848.

Zheng, X., Burdick, D., Popa, L., Zhong, X., and Wang, N.

X. R. (2021). Global table extractor (gte): A frame-

work for joint table identiﬁcation and cell structure

recognition using visual context. In Proceedings of

the IEEE/CVF winter conference on applications of

computer vision, pages 697–706.

Zhong, X., ShaﬁeiBavani, E., and Jimeno Yepes, A. (2020).

Image-based table recognition: data, model, and eval-

uation. In European conference on computer vision,

pages 564–580. Springer.

A Data Annotation Approach Using Large Language Models

755