
of building dictionaries from the dataset itself has the
drawback of limiting variability in the augmented in-
stances. As previously discussed, some LLMs may
lack specialized domain knowledge for some docu-
ments and as such may not be a reliable way of gen-
erating text. These generators can be built from text
dictionaries found online and other methods of ran-
dom generation (such as generating random numbers
to compose dates), and so may need to be defined on
a case by case basis, which was the case for NBID.
Finally, as presented, both methods have limited
scalability. There are only so many ways to rewrite a
sentence and add meaningful variations to the dataset,
and the template generations are tied to the number of
templates and the available texts for filling these. As
described in Section 3.2, the template method works
well for domains with simple templates, especially
when these templates are fully known. This was the
case for NBID, where most of the templates in the
training section also appeared in testing. However,
these techniques are not suit for endless augmenta-
tion, and can only take the model performance so far
in domains that are more complex, such as EPHOIE.
Nonetheless, there is still room for improvement, as
shown by our results.
5 CONCLUSIONS
In this work, we presented two new data augmenta-
tion strategies for documents, aiming at both com-
plex and simple domains. We have discussed their
strengths and weaknesses in relation to other meth-
ods. Finally, we show that these methods manage
to improve the baseline model’s performance. In fu-
ture work, we aim to use these same methods in other
datasets, showing their applicability in other domains.
ACKNOWLEDGEMENTS
The authors would like to thank UNICO for all the
support in the making of this research project and also
NVIDIA Corporation for the generous donation of the
Quadro RTX 8000 GPU that made our experiments
possible. The authors also thank PROEX CAPES
for the financing, and David Menotti thanks CNPq (#
315409/2023-1).
REFERENCES
Biswas, S., Riba, P., Llad
´
os, J., and Pal, U. (2021). Doc-
synth: A layout guided approach for controllable doc-
ument image synthesis. In Int. Conf. on Document
Analysis and Recognition (ICDAR).
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language models are few-shot
learners. In Larochelle, H., Ranzato, M., Hadsell, R.,
Balcan, M., and Lin, H., editors, Advances in Neu-
ral Information Processing Systems, volume 33, pages
1877–1901. Curran Associates, Inc.
Chi, Z., Dong, L., Wei, F., Yang, N., Singhal, S., Wang,
W., Song, X., Mao, X.-L., Huang, H., and Zhou, M.
(2021). InfoXLM: An information-theoretic frame-
work for cross-lingual language model pre-training.
In Human Language Technologies, pages 3576–3588.
Association for Computational Linguistics.
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., and Hu, G.
(2020). Revisiting pre-trained models for Chinese nat-
ural language processing. In Cohn, T., He, Y., and
Liu, Y., editors, Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 657–668,
Online. Association for Computational Linguistics.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of deep bidirectional
transformers for language understanding. In Conf. of
the North American Chapter of the Association for
Computational Linguistics, pages 4171–4186.
Dhariwal, P. and Nichol, A. (2021). Diffusion models beat
gans on image synthesis. In Ranzato, M., Beygelz-
imer, A., Dauphin, Y., Liang, P., and Vaughan, J. W.,
editors, Advances in Neural Information Processing
Systems, volume 34, pages 8780–8794. Curran Asso-
ciates, Inc.
Guillaume Jaume, Hazim Kemal Ekenel, J.-P. T. (2019).
Funsd: A dataset for form understanding in noisy
scanned documents. In Accepted to ICDAR-OST.
Guo, Q., Qiu, X., Liu, P., Shao, Y., Xue, X., and Zhang,
Z. (2019). Star-transformer. In Conf. of the North
American Chapter of the Association for Computa-
tional Linguistics.
Guo, Z., Wang, P., Wang, Y., and Yu, S. (2023). Improving
small language models on pubmedqa via generative
data augmentation.
Huang, Y., Lv, T., Cui, L., Lu, Y., and Wei, F.
(2022). Layoutlmv3: Pre-training for document ai
with unified text and image masking. CoRR/arXiv,
abs/2204.08387.
Kim, G., Hong, T., Yim, M., Nam, J., Park, J., Yim, J.,
Hwang, W., Yun, S., Han, D., and Park, S. (2022).
Ocr-free document understanding transformer. In Eu-
ropean Conf. on Computer Vision (ECCV).
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
ageNet classification with deep convolutional neural
networks. In Int. Conf. on Neural Information Pro-
cessing Systems (NeurIPS), pages 1097–1105.
Li, M., Xu, Y., Cui, L., Huang, S., Wei, F., Li, Z., and Zhou,
New Paths in Document Data Augmentation Using Templates and Language Models
365