
optimizing computational efficiency, expanding mul-
tilingual capabilities, and exploring applications to
more complex document structures. With its demon-
strated efficacy and innovative approach, TokenOCR
represents a significant advancement in OCR technol-
ogy, setting a new standard for text digitization and
analysis.
REFERENCES
Baek, J., Kim, G., Lee, J., Park, S., Han, D., Yun, S.,
Oh, S. J., and Lee, H. (2019a). What is wrong with
scene text recognition model comparisons? dataset
and model analysis.
Baek, Y., Lee, B., Han, D., Yun, S., and Lee, H. (2019b).
Character region awareness for text detection.
Cloud, G. (2024). Cloud vision documentation. https:
//cloud.google.com/vision/docs. Accessed: 2024-08-
28.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale.
Freitag, M. and Al-Onaizan, Y. (2017). Beam search strate-
gies for neural machine translation. In Proceedings
of the First Workshop on Neural Machine Translation.
Association for Computational Linguistics.
Gheini, M., Ren, X., and May, J. (2021). Cross-attention
is all you need: Adapting pretrained transformers for
machine translation.
Krishnan, P. and Jawahar, C. V. (2016). Generating syn-
thetic data for text recognition.
Kudo, T. and Richardson, J. (2018). Sentencepiece: A sim-
ple and language independent subword tokenizer and
detokenizer for neural text processing.
Li, M., Lv, T., Chen, J., Cui, L., Lu, Y., Florencio,
D., Zhang, C., Li, Z., and Wei, F. (2022). Trocr:
Transformer-based optical character recognition with
pre-trained models.
Liao, M., Zou, Z., Wan, Z., Yao, C., and Bai, X. (2022).
Real-time scene text detection with differentiable bi-
narization and adaptive scale fusion.
Liu, Y., Shao, Z., and Hoffmann, N. (2021). Global at-
tention mechanism: Retain information to enhance
channel-spatial interactions.
Pascanu, R., Mikolov, T., and Bengio, Y. (2013). On the
difficulty of training recurrent neural networks.
Raj, R. and Kos, A. (2022). A comprehensive study of opti-
cal character recognition. In 2022 29th International
Conference on Mixed Design of Integrated Circuits
and System (MIXDES), pages 151–154.
Rusi
˜
nol, M., Sanchez, J.-M., and Karatzas, D. (2021). Doc-
ument image quality assessment via explicit blur and
text size estimation. In International Conference on
Document Analysis and Recognition, pages 308–321.
Springer.
Sennrich, R., Haddow, B., and Birch, A. (2016). Neural
machine translation of rare words with subword units.
Services, A. W. (2024). Amazon textract documentation.
https://aws.amazon.com/textract/. Accessed: 2024-
08-28.
Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Self-
attention with relative position representations.
Smith, R. (2007). An overview of the tesseract ocr engine.
Google Inc. Available at https://code.google.com/p/
tesseract-ocr/.
Soviany, P., Ionescu, R. T., Rota, P., and Sebe, N. (2022).
Curriculum learning: A survey.
Team, N., Costa-juss
`
a, M. R., Cross, J., C¸ elebi, O., El-
bayad, M., Heafield, K., Heffernan, K., Kalbassi, E.,
Lam, J., Licht, D., Maillard, J., Sun, A., Wang, S.,
Wenzek, G., Youngblood, A., Akula, B., Barrault, L.,
Gonzalez, G. M., Hansanti, P., Hoffman, J., Jarrett,
S., Sadagopan, K. R., Rowe, D., Spruit, S., Tran, C.,
Andrews, P., Ayan, N. F., Bhosale, S., Edunov, S.,
Fan, A., Gao, C., Goswami, V., Guzm
´
an, F., Koehn,
P., Mourachko, A., Ropers, C., Saleem, S., Schwenk,
H., and Wang, J. (2022). No language left behind:
Scaling human-centered machine translation.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2023). Attention is all you need.
Xu, Z., Dai, A. M., Kemp, J., and Metz, L. (2019). Learning
an adaptive learning rate schedule.
Yu, R., Wang, Z., Wang, Y., Li, K., Liu, C., Duan, H., Ji, X.,
and Chen, J. (2024). Lape: Layer-adaptive position
embedding for vision transformers with independent
layer normalization. In Proceedings of the IEEE/CVF
International Conference on Computer Vision (ICCV),
pages 5886–5896. IEEE.
ICPRAM 2025 - 14th International Conference on Pattern Recognition Applications and Methods
158