
Table 7: Hyperparameter configurations used in the top two
best models. Learning rate, number of steps in the learning
rate scheduler, whether to use bounding box features, size
of the extracted text vector from BERT, and size of the ex-
tracted image vector from ViT/Swin Transformer V2.
Configuration
Learning
rate
Scheduler
steps
BBox
features
Text vector
size
Image vector
size
Fusion-1 1 × 10
−5
None True 128 128
Fusion-2 1 × 10
−5
None True 128 256
Fusion-7 1 × 10
−5
1000 True 256 128
Fusion-26 1 × 10
−5
None False 128 256
Fusion-34 1 × 10
−5
1500 False 128 256
dataset is dedicated to the evaluation of methods for
layout analysis, with a focus on multi-modality. The
dataset is freely available for research purposes.
Next, we present baseline results for instance
segmentation and multi-modal element classification.
For instance segmentation, we employed three state-
of-the-art models, namely Mask R-CNN, YOLOv8,
and LayoutLMv3. For multi-modal classification, we
proposed a fusion-based model that combines BERT
with various vision Transformers.
We experimentally showed that the best segmenta-
tion of bounding boxes was obtained using YOLOv8
with an input image size of 1280 pixels, while the best
segmentation mask was produced by LayoutLMv3
with the PubLayNet weights.
We further demonstrated that the best multi-modal
classification results has been obtained with BERT for
textual and ViT for image modalities.
Based on these experimental results, we can con-
clude that the proposed models will be integrated into
Porta fontium portal to facilitate the information ex-
traction from historical data.
ACKNOWLEDGEMENTS
This work has been partly supported by the Grant No.
SGS-2022-016 Advanced methods of data processing
and analysis.
REFERENCES
Bossard, L., Guillaumin, M., and Van Gool, L. (2014).
Food-101–mining discriminative components with
random forests. In Computer Vision–ECCV 2014:
13th European Conference, Zurich, Switzerland,
September 6-12, 2014, Proceedings, Part VI 13, pages
446–461. Springer.
CVAT.ai Corporation (2022). Computer Vision Annotation
Tool (CVAT). https://github.com/opencv/cvat.
Dauphinee, T., Patel, N., and Rashidi, M. (2019). Modular
multimodal architecture for document classification.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2018). Bert: Pre-training of deep bidirectional trans-
formers for language understanding.
Ferrando, J., Dom
´
ı nguez, J. L., Torres, J., Garc
´
ıa, R.,
Garc
´
ıa, D., Garrido, D., Cortada, J., and Valero, M.
(2020). Improving accuracy and speeding up docu-
ment image classification through parallel systems. In
Lecture Notes in Computer Science, pages 387–400.
Springer International Publishing.
Gallo, I., Ria, G., Landro, N., and Grassa, R. L. (2020).
Image and text fusion for upmc food-101 using bert
and cnns. In 2020 35th International Conference on
Image and Vision Computing New Zealand (IVCNZ),
pages 1–6.
Hossin, M. and Sulaiman, M. N. (2015). A review on evalu-
ation metrics for data classification evaluations. Inter-
national Journal of Data Mining & Knowledge Man-
agement Process, 5:01–11.
Huang, J., Tao, J., Liu, B., Lian, Z., and Niu, M. (2020).
Multimodal transformer fusion for continuous emo-
tion recognition. In ICASSP 2020-2020 IEEE Inter-
national Conference on Acoustics, Speech and Signal
Processing (ICASSP), pages 3507–3511. IEEE.
Jocher, G., Chaurasia, A., and Qiu, J. (2023). YOLO by
Ultralytics. https://github.com/ultralytics/ultralytics.
Mart
´
ınek, J., Lenc, L., Kr
´
al, P., Nicolaou, A., and
Christlein, V. (2019). Hybrid training data for his-
torical text OCR. In 15th International Conference on
Document Analysis and Recognition (ICDAR 2019),
pages 565–570, Sydney, Australia.
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C.,
and Sun, C. (2021). Attention bottlenecks for multi-
modal fusion. Advances in Neural Information Pro-
cessing Systems, 34:14200–14213.
Prakash, A., Chitta, K., and Geiger, A. (2021). Multi-
modal fusion transformer for end-to-end autonomous
driving. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
7077–7087.
Simonyan, K. and Zisserman, A. (2014). Very deep convo-
lutional networks for large-scale image recognition.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna,
Z. (2015). Rethinking the inception architecture for
computer vision.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Gir-
shick, R. (2019). Detectron2. https://github.com/
facebookresearch/detectron2.
Zhong, X., Tang, J., and Yepes, A. J. (2019). Publaynet:
largest dataset ever for document layout analysis. In
2019 International Conference on Document Analysis
and Recognition (ICDAR), pages 1015–1022. IEEE.
Heimatkunde: Dataset for Multi-Modal Historical Document Analysis
1001