
and WildReceipt experiments. Among the three
loss functions explored (MSE, cross-entropy, and
focal loss), focal loss consistently delivered the
best performance across datasets and experiments.
This can be largely attributed to the sparsity of the
ground-truth graphs, where focal loss is better suited
to handle label imbalance.
5 CONCLUSIONS
In this paper, we present a spatial structure-guided
framework to address the challenges of KIE by lever-
aging ground-truth graphs and optimizing graph sim-
ilarity through various loss functions. Moreover, by
enhancing the text feature extraction process with
word-level embeddings using a fine-tuned BERT, our
models demonstrate superior performance compared
to the baseline SDMG-R model. Experimental results
on the FUNSD and WildReceipt datasets confirm the
robustness of our approach.
REFERENCES
Chen, Y.-M., Hou, X.-T., Lou, D.-F., Liao, Z.-L., and Liu,
C.-L. (2023). Damgcn: Entity linking in visually rich
documents with dependency-aware multimodal graph
convolutional network. In International Conference
on Document Analysis and Recognition, pages 33–47.
Springer.
Devlin, J. (2018). Bert: Pre-training of deep bidirec-
tional transformers for language understanding. arXiv
preprint arXiv:1810.04805.
Du, Y., Chen, Z., Jia, C., Yin, X., Zheng, T., Li, C., Du,
Y., and Jiang, Y.-G. (2022). Svtr: Scene text recog-
nition with a single visual model. arXiv preprint
arXiv:2205.00159.
Fang, S., Xie, H., Wang, Y., Mao, Z., and Zhang, Y. (2021).
Read like humans: Autonomous, bidirectional and
iterative language modeling for scene text recogni-
tion. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 7098–
7107.
Gbada, H., Kalti, K., and Mahjoub, M. A. (2024a). Deep
learning approaches for information extraction from
visually rich documents: datasets, challenges and
methods. International Journal on Document Anal-
ysis and Recognition (IJDAR), pages 1–22.
Gbada, H., Kalti, K., and Mahjoub, M. A. (2024b). Multi-
modal weighted graph representation for information
extraction from visually rich documents. Neurocom-
puting, 573:127223.
Hong, T., Kim, D., Ji, M., Hwang, W., Nam, D., and
Park, S. (2022). Bros: A pre-trained language model
focusing on text and layout for better key informa-
tion extraction from documents. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 36, pages 10767–10775.
Hwang, W., Yim, J., Park, S., Yang, S., and Seo,
M. (2020). Spatial dependency parsing for semi-
structured document information extraction. arXiv
preprint arXiv:2005.00642.
Jaume, G., Ekenel, H. K., and Thiran, J.-P. (2019). Funsd:
A dataset for form understanding in noisy scanned
documents. In 2019 International Conference on
Document Analysis and Recognition Workshops (IC-
DARW), volume 2, pages 1–6. IEEE.
Katti, A. R., Reisswig, C., Guder, C., Brarda, S., Bickel,
S., H
¨
ohne, J., and Faddoul, J. B. (2018). Chargrid:
Towards understanding 2d documents. arXiv preprint
arXiv:1809.08799.
Krieger, F., Drews, P., Funk, B., and Wobbe, T. (2021).
Information extraction from invoices: a graph neural
network approach for datasets with high layout vari-
ety. In Innovation Through Information Systems: Vol-
ume II: A Collection of Latest Research on Technology
Issues, pages 5–20. Springer.
Lee, C.-Y., Li, C.-L., Wang, C., Wang, R., Fujii, Y., Qin,
S., Popat, A., and Pfister, T. (2021). Rope: reading
order equivariant positional encoding for graph-based
document information extraction. arXiv preprint
arXiv:2106.10786.
Li, H., Wang, P., Shen, C., and Zhang, G. (2019). Show,
attend and read: A simple and strong baseline for ir-
regular text recognition. In Proceedings of the AAAI
conference on artificial intelligence, volume 33, pages
8610–8617.
Liao, M., Wan, Z., Yao, C., Chen, K., and Bai, X. (2020).
Real-time scene text detection with differentiable bi-
narization. In Proceedings of the AAAI conference
on artificial intelligence, volume 34, pages 11474–
11481.
Lin, T. (2017). Focal loss for dense object detection. arXiv
preprint arXiv:1708.02002.
Liu, X., Gao, F., Zhang, Q., and Zhao, H. (2019).
Graph convolution for multimodal information extrac-
tion from visually rich documents. arXiv preprint
arXiv:1903.11279.
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., and Yao,
C. (2018). Textsnake: A flexible representation for
detecting text of arbitrary shapes. In Proceedings of
the European conference on computer vision (ECCV),
pages 20–36.
Mikolov, T. (2013). Efficient estimation of word representa-
tions in vector space. arXiv preprint arXiv:1301.3781,
3781.
Pennington, J., Socher, R., and Manning, C. D. (2014).
Glove: Global vectors for word representation. In
Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP),
pages 1532–1543.
VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications
624