A Refined Multilingual Scene Text Detector Based on YOLOv7

Houssem Turki, Houssem Turki, Mohamed Elleuch, Mohamed Elleuch, Monji Kherallah

2025

Abstract

In recent years, significant advancements in deep learning and the recognition of text in natural scene images have been achieved. Despite considerable progress, the efficacy of deep learning and the detection of multilingual text in natural scene images often face limitations due to the lack of comprehensive datasets that encompass a variety of scripts. Added to this is the absence of a robust detection system capable of overcoming the majority of existing challenges in natural scenes and taking into account in parallel the characteristics of each writing of different languages. YOLO (You Only Look Once) is a highly utilized deep learning neural network that has become extremely popular for its adaptability in addressing various machine learning tasks. YOLOv7 is an enhanced iteration of the YOLO series. It has also proven to be effective in solving complex image-related problems thanks to the evolution of its 'Backbone' responsible for capturing the features of images to overcome the challenges encountered in a natural environment which leads us to adapt it to our text detection context. Our first contribution is to over-come environmental variations through the use of specific data augmentation based on improved basic techniques and a mixed transformation method applied to “RRC-MLT” and “SYPHAX” multilingual datasets which both contain Arabic scripts. The second contribution is the refinement of the 'Backbone' block of the YOLOv7 architecture to better extract the small details of the text which particularly stand out in Arabic scripts in punctuation marks. The article highlights future research directions aimed at developing a generic and efficient multilingual text detection system in the wild that also handles Arabic scripts, which is a new challenge that adds to the context, which justifies the choice of the two datasets.

Download


Paper Citation


in Harvard Style

Turki H., Elleuch M. and Kherallah M. (2025). A Refined Multilingual Scene Text Detector Based on YOLOv7. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 512-519. DOI: 10.5220/0013157100003890


in Bibtex Style

@conference{icaart25,
author={Houssem Turki and Mohamed Elleuch and Monji Kherallah},
title={A Refined Multilingual Scene Text Detector Based on YOLOv7},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={512-519},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013157100003890},
isbn={978-989-758-737-5},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - A Refined Multilingual Scene Text Detector Based on YOLOv7
SN - 978-989-758-737-5
AU - Turki H.
AU - Elleuch M.
AU - Kherallah M.
PY - 2025
SP - 512
EP - 519
DO - 10.5220/0013157100003890
PB - SciTePress