SIFT-ResNet Synergy for Accurate Scene Word Detection in Complex

Scenarios

Riadh Harizi

1 a

, Rim Walha

1,2 b

and Fadoua Drira

1 c

REGIM-Lab, ENIS, University of Sfax, Tunisia

Higher Institute of Computer Science and Multimedia of Sfax, University of Sfax, Tunisia

Keywords:

Scene Text Detection, Deep Learning, SIFT Keypoints, Bounding Box Regressor.

Abstract:

Scene text detection is of growing importance due to its various applications. Deep learning-based systems

have proven effective in detecting horizontal text in natural scene images. However, they encounter difﬁculties

when confronted with oriented and curved text. To tackle this issue, our study introduces a hybrid scene

text detector that combines selective search with SIFT-based keypoint density analysis and a deep learning

training architecture framework. More precisely, we investigated SIFT keypoints to identify important areas

in an image for precise word localization. Then, we ﬁne-tuned these areas with a deep learning-powered

bounding box regressor. This combination ensured accurate word boundary alignment and enhancing word

detection efﬁciency. We evaluated our method on benchmark datasets, including ICDAR2013, ICDAR2015,

and SVT, comparing it with established state-of-the-art scene text detectors. The results underscore the strong

performance of our scene text detector when dealing with complex scenarios.

1 INTRODUCTION

Scene text detection is of growing importance due to

its various applications in a wide range of ﬁelds. It is

often described as the process of localizing the spe-

ciﬁc regions of text within images captured from nat-

ural scenes. Indeed, scene text detection can involve

multiple candidate regions, including text, word, and

character levels, which are then candidates for fur-

ther processing. At each level of candidate regions,

there are distinct advantages and disadvantages based

on their utility for speciﬁc applications. On the one

hand, text-level candidate regions, which treat the en-

tire scene text as a single candidate region, are par-

ticularly useful for recognizing text blocks or head-

ings within images. Their simplicity makes them the

preferred choice when text recognition is unneces-

sary, as it doesn’t provide ﬁne-grained information

about individual words or characters, which can be

a drawback when detailed text analysis is required.

Word-level candidate regions, on the other hand, fo-

cus on providing ﬁner details by localizing individ-

ual words within a scene. These regions are highly

valuable for applications requiring word-level analy-

https://orcid.org/0000-0003-4096-8959

https://orcid.org/0000-0002-0483-6329

https://orcid.org/0000-0001-6706-4218

sis (Harizi et al., 2022b). But, they may face chal-

lenges when dealing with densely packed words or

when character-level recognition becomes necessary.

Character-level candidate regions offer the ultimate

solution for character-level analysis but demand more

complex processing compared to word or text-level

regions. These regions can be particularly challeng-

ing for handwritten or cursive text. In summary, the

choice of the candidate level depends on the appli-

cation’s requirements, striking a balance between the

need for detail and the complexity of processing.

Several text detection algorithms have been em-

ployed to tackle accurately the text localization task.

Inspired by the rapid advancements in deep learn-

ing and the availability of annotated datasets, numer-

ous text detection techniques have emerged. Deep

learning-based systems have proven effective in de-

tecting horizontally aligned text in natural scene im-

ages. However, they encounter difﬁculties when con-

fronted with more challenging and complex scenar-

ios, speciﬁcally those involving oriented or curved

text. Therefore, many investigations continue to ex-

plore innovative architectures and solutions to ad-

vance the state of the art in text detection. Their

primary focus is on addressing some notable lim-

itations of existing techniques, such as improving

the effectiveness of handling images containing text

980

Harizi, R., Walha, R. and Drira, F.

SIFT-ResNet Synergy for Accurate Scene Word Detection in Complex Scenarios.

DOI: 10.5220/0012426200003636

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2024) - Volume 3, pages 980-987

ISBN: 978-989-758-680-4; ISSN: 2184-433X

with varying angles or vertical orientations. Indeed,

main text detection methods adapted to text of vary-

ing shapes and orientations can be categorized into

two primary groups: region-based and texture-based

(Naiemi et al., 2021). In this study, we will primarily

concentrate on process-driven categories of region-

based methods, which can be deﬁned by three main

groups: pixel-based, model-based and hybrid text de-

tectors approaches. This choice is guided by the na-

ture of our proposition. Pixel-based text detector ap-

proaches aim to detect and precisely locate text re-

gions by conducting an in-depth examination of the

individual pixels. These approaches utilize a variety

of image processing techniques and pixel-level char-

acteristics, including but not limited to color, bright-

ness, texture, pixel connectivity, keypoint density and

corners. These approaches are suitable for detecting

text in scenarios where text regions are well-deﬁned

and exhibit distinct pixel-level features. However, in

real-world scenarios, this is often not the case. There-

fore, most approaches dealing with this challenging

task predominantly fall into two categories: model-

based and hybrid-based text detector approaches. On

the one hand, model-based text detector approaches

rely on pre-trained models or speciﬁc neural network

architectures to detect text. These approaches utilize

features learned from training data to recognize and

localize text regions in images. On the other hand, hy-

brid approaches combine the strengths of both pixel-

based and model-based approaches for text detection.

In this context, we introduce an hybrid approach

to address the challenges of text detection in un-

structured, real-world scenarios. This approach com-

bines the strengths of Convolutional Neural Networks

(CNN) and Scale-Invariant Feature Transform (SIFT)

techniques to enhance word detection accuracy and

robustness. While ResNet excels in feature extrac-

tion, SIFT is well-known for its resilience to scale

and orientation variations. Our method utilizes word-

level candidate regions as they strike a balance be-

tween the simplicity of text-level detection and the

detail of character-level detection. Word-level candi-

date regions are considered the optimal choice when

priorities include readability, reduced complexity, and

efﬁcient processing.

The remainder of this study is structured as fol-

lows. Section 2 provides an overview of well-known

model-based and hybrid scene text detection ap-

proaches in unstructured, real-world scenarios. Sec-

tion 3 details our proposed SIFT-ResNet hybrid text

detection method. Section 4 presents the experimen-

tal study to highlight the effectiveness of our ap-

proach. Section 5 concludes the study and highlights

open issues for future research.

2 RELATED WORKS

In this section, we present an in-depth exploration

of related work in the ﬁeld of model-based versus

hybrid-based text detector approaches, with a partic-

ular emphasis on deep learning-driven methods. Our

focus on this area stems from the growing importance

of text detection in various applications, where the

choice between model-based and hybrid approaches

plays a pivotal role in achieving accurate and efﬁcient

results, particularly in addressing the challenges en-

countered in complex real-world scenarios.

Regarding model-based text detector approaches,

common models used include CNNs and other deep

learning architectures. These approaches are suitable

for detecting text in scenarios where text can vary in

terms of orientation, font, and placement (Zhou et al.,

2017; Liao et al., 2018a; Long and Yao, 2020; Zhu

et al., 2021; Yu et al., 2023). For example, EAST (Ef-

ﬁcient and Accurate Scene Text Detector) employs a

CNN to predict text regions and their corresponding

quadrilateral bounding boxes in a single forward pass

(Zhou et al., 2017). Another model-based text detec-

tion approach is TextBoxes, which predicts both text

regions and their corresponding bounding boxes by

incorporating multiple aspect ratios and orientations

in the output layer of the network (Liao et al., 2017).

Additionally, YOLO (You Only Look Once), origi-

nally designed for object detection, can be adapted

for text detection tasks by training it on text-speciﬁc

datasets, making it applicable for text detection in di-

verse scenarios (Redmon and Farhadi, 2017). Stroke

Width Transform (SWT) is another model-based text

detector. It operates in a single pass through the

image, identifying and grouping regions with sim-

ilar stroke widths to identify potential text regions

(Piriyothinkul et al., 2019) (Epshtein et al., 2010).

In (Mallek et al., 2017), the authors explored the in-

tegration of a sparse prior into a model-based scene

text detection approach. Speciﬁcally, the features of

the convolutional PCANet network are enhanced by

sparse coding principle (Walha et al., 2013), repre-

senting each feature map through interconnected dic-

tionaries and hence facilitating the transition from one

resolution level to a suitable lower-resolution level.

Liao et al. proposed a uniﬁed end-to-end network

which operates in a single pass through the network

to detect and recognize text in images (Liao et al.,

2018b). The introduction of the RoIRotate opera-

tor is a part of their single-stage architecture, which

aims to handle oriented text and gain axis-aligned fea-

ture maps efﬁciently. Another study developed an

instance segmentation-based method that employed

a deep neural network to simultaneously predict text

SIFT-ResNet Synergy for Accurate Scene Word Detection in Complex Scenarios

981

regions and their interconnecting relationships (Deng

et al., 2018).

Concerning hybrid text detector approaches, they

may use pixel-level analysis to identify potential text

regions and then utilize models for further reﬁnement

and classiﬁcation. Indeed, the classiﬁcation process

is designed to distinguish text from non-text areas,

while the reﬁnement process is focused on enhanc-

ing the precision of text region detection. This ﬁne-

tuning may entail adjustments to the position, size, or

shape of the bounding boxes to achieve more accurate

text region localization. For instance, Faster R-CNN

can be used to initially generate region proposals that

are likely to contain text and then reﬁne these propos-

als for accurate text localization (Ren et al., 2015).

Moreover, Mask R-CNN extends Faster R-CNN by

incorporating a segmentation mask branch (He et al.,

2017). Another example is CTPN (Connectionist Text

Proposal Network). It generates text proposals in the

ﬁrst stage and then reﬁnes them using a recurrent neu-

ral network (RNN) in the second stage (Tian et al.,

2016). Zihao et al. presented in (Liu et al., 2018b) an

approach that involves linking individual text compo-

nents to create complete text lines. In (Long et al.,

2018), the authors proposed a method representing

curved text with straight lines in a two-stage process:

ﬁrst identifying potential text regions, and then reﬁn-

ing and classifying these regions for improved accu-

racy in text detection. In their multi-oriented scene

text detection approach (Dai et al., 2018), the authors

used a region proposal network for text detection and

segmentation, followed by non-maximum suppres-

sion to handle overlapping instances. Furthermore,

hybrid methods may include a rectiﬁcation stage to

address geometric distortions in text, like perspective

distortion or skew. This step improves text legibility

and streamlines the recognition process. An exam-

ple of this method is ASTER (Attentional Scene Text

Recognizer) (Shi et al., 2019).

In summary, hybrid text detection efﬁciency stems

from using advanced deep learning frameworks for

scene text localization and integrating textual fea-

tures pixel-wise. This integration of textual features

can lead to real-time solutions. Motivated by these

considerations, we introduce our hybrid text detector

framework in the following section.

3 PROPOSED SCENE WORD

DETECTOR

In this section, we present the proposed deep learning-

based scene text detection method. It relies on an

hybrid-based detection approach that harnesses the

advantages of both convolutional-based deep net-

works and key-points based techniques in order to

accurately localize multi-oriented words involved in

a given real-world scene image. An overview of the

proposed method is depicted in Figure 1. As exhibited

in this ﬁgure, our proposition consists of three main

stages: multiscale SIFT-based RoIs detection, RoIs

ﬁltering and grouping, and word bounding box re-

gression. Details concerning each stage of our propo-

sition are provided in the following.

3.1 Multiscale SIFT for RoI Detection

Real-world scene text differs visually from document

text. Rather than employing a preliminary segmenta-

tion or handling the entire input image content and in-

vestigating its overlapping regions, we suggest a more

focused approach which conducts a precise selective

search. This is achieved by detecting keypoints and

exploring a multi-scale spatial grids applied to the

input scene image, facilitating thus the identiﬁcation

of pertinent local regions. Speciﬁcally, our detection

process initiates with the localization of SIFT key-

points, serving as a means to guide the selection of

candidate regions that are likely to encompass text ar-

eas, thereby eliminating the need for exhaustive pro-

cessing of all image regions. Following this, a reﬁne-

ment process partitions the image into cells of varying

sizes using multi-scale grids. Each cell corresponds

to a local patch area having a dimension of n × n pix-

els, with n being selected from the set 8, 12, 16, 32.

This secondary step aids in systematically identify-

ing regions of interest (RoIs). The method focuses

on pertinent local areas, inspecting multi-scale grids

within the image, especially those with SIFT key-

points. Bounding boxes for these chosen cells, re-

ferred to as SIFT-RoIs, are created by computing Eu-

clidean distances between SIFT keypoints and cell

centroids.

3.2 RoIs Filtering and Grouping

Throughout the training phase, our main emphasis

lies in assessing the precision of the chosen SIFT-RoIs

bounding boxes. In this regard, we utilize the Inter-

section over Union (IoU) evaluation measure which

represents a commonly employed measure in object

detection applications. In fact, this measure is a valu-

able tool for evaluating the alignment between the

predicted bounding boxes, speciﬁcally those associ-

ated with SIFT-RoIs, and the ground-truth bounding

boxes. These latter offer precise annotations that de-

ﬁne the true positions of text patterns within the train-

ing images. Speciﬁcally, the IoU measure quantiﬁes

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

982

Table 1: Overview of recent methods in deep learning-based scene text detection.

Reference Model Category Backbone Network Candidate Region Text Shape

MD HD MOT CT H

(Zhang et al., 2015) STLD X CNN C,T X

(Mallek et al., 2017) DLSP X PCANet W,C X

(Zhou et al., 2017) EAST X FCN W,T X X

(Liao et al., 2017) TextBoxes X VGG-16+CRNN W X

(Liao et al., 2018a) TextBoxes++ X SSD+CRNN W X X

(Deng et al., 2018) PixelLink X FCN W X X

(Baek et al., 2019) CRAFT X VGG-16 C,W X X X

(Tian et al., 2016) CTPN X VGG-16 T,W X

(Wang et al., 2019a) PSENET X ResNet-18 T X X X

(Liu et al., 2022) ABCNet X ResNet-18 T X X X

(Long et al., 2018) TextSnake X U-Net W X X X

(Liao et al., 2021) Mask TextSpotter X FPN+CNN T X X X

(Harizi et al., 2022a) CNN X CNN W,C X

(Busta et al., 2017) Deep TextSpotter X CNN W X

(Liu et al., 2018a) FOTS X ResNet-50 W X X

(He et al., 2018) TextSpotter X PVAnet T,W,C X X

(Shi et al., 2019) ASTER X SSD+LSTM W X X

(Xing et al., 2019) charNet X ResNet-50 C,W X X X

(Wang et al., 2019b) PAN X ResNet-18 T,W X X X

(Long and Yao, 2020) UnrealText X FCN W X X

(Liao et al., 2020) SynthText3D X ResNet-50 W X X

(Zhu et al., 2021) FCENet X ResNet-50+FPN T,W X X X

(Yu et al., 2023) TCM X ResNet-50 T,W X X X

Proposed method SIFT-ResNet X ResNet-19 W X X X

Note: W : Word-based, T : Text-based, C: Character-based, MD : Model-based text Detector approach, HD : Hybrid-based

text Detector approach, MOT: Multi-Oriented Text, CT: Curved Text, : HT: Horizontal Text.

the proportion of the overlapping area between the

predicted and actual bounding boxes in relation to the

combined area of both boxes. Especially, it is ob-

tained for the j-th ground-truth (G

) and i-th detection

bounding box (D

) as follows:

IOU =

Area(G

∩ D

)

Area(G

∪ D

)

(1)

This assessment crucially reﬁnes our scene text

detection model’s precision, providing valuable in-

sights into its proﬁciency in identifying text regions

within images. The IoU values guide the selection of

SIFT-RoIs, and from these, a Bag of Word Patterns

(BoW) model is constructed to determine the pres-

ence of text patterns in speciﬁc image regions. After

BoW-based ﬁltering, we move to a region grouping

stage to enhance coverage of dense text patterns. In

this phase, we randomly merge nearest ﬁltered SIFT-

RoIs, usually selecting the two or three closest cen-

troids. This step aims to reduce redundancy, con-

solidate overlapping regions, and potentially improve

precision in the subsequent word bounding box detec-

tion stage.

3.3 Word Bounding Box Regression

As depicted in Figures 1 and 3, the grouped SIFT-

RoIs bounding boxes cover text regions compre-

hensively but lack precision in localizing individ-

ual words. To address this, we introduce a Word

Bounding Box Regressor (WBBR), inspired by object

detectors like YOLO (Redmon and Farhadi, 2017)

and Faster R-CNN (Ren et al., 2015). WBBR en-

hances word region detection for accurate delineation

of word bounding boxes. Our approach adapts bound-

ing box regression for precise localization of arbitrar-

ily oriented words in real-world scene images. Specif-

ically, we analyze SIFT-based RoIs, serving as pro-

posals for WBBR, crucial for accurately localizing

each word region

The proposed WBBR model in this study is based

on the fully-convolutional structure of ResNet-19 (He

et al., 2016), keeping convolution and pooling layers.

However, we modify it by replacing the ﬁnal fully-

connected layers with four dense layers, as shown in

Figure 2. These layers progressively reduce neurons,

with the last layer having four neurons, serving as the

SIFT-ResNet Synergy for Accurate Scene Word Detection in Complex Scenarios

983

Figure 1: Illustration of the proposed scene text detection method.

Figure 2: Illustration of the word bounding box regression architecture.

Figure 3: Illustration of intermediate results generated through the proposed scene words detection process.

detection head for predicting word region positions.

Refer to Figures 1 and 3 for a visual overview of the

proposed word detection process and results on vari-

ous scene image samples.

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

984

Table 2: Features of the datasets used in the study.

Dataset Language Images Text shape Text Features Annotation level

Train Test H CT MO Char Word

ICDAR2013 (Karatzas et al., 2013) EN 229 233 X – – Large – X

ICDAR2015 (Antonacopoulos et al., 2015) EN 1000 500 X X X Small, Blur – X

SVT (Wang and Belongie, 2010) EN 100 250 X X X Low-resolution – X

Note: ML - Multi-Lingual, EN - English, H - Horizontal, CT - Curved Text, MO - Multi-Oriented.

Table 3: Quantitative comparison among some of the recent scene text detection methods evaluated on the ICDAR 2013,

ICDAR 2015, and SVT datasets. The best values are in bold.

ICDAR2013 ICDAR2015 SVT

Reference Model R P F R P F R P F

(Zhang et al., 2015) STLD 0.74 0.88 - - - - - - -

(Liao et al., 2017) TextBoxes 0.83 0.89 0.86 - - - - - -

(Liao et al., 2018a) TextBoxes++ 0.84 0.91 0.88 0.78 0.87 0.82 - - -

(Zhou et al., 2017) EAST - - - 0.78 0.83 0.81 - - -

(Liu et al., 2018a) FOTS - - 0.87 0.82 0.89 0.85 - - -

(He et al., 2018) TextSpotter 0.87 0.88 0.88 0.83 0.84 0.83 - - -

(Shi et al., 2019) ASTER - - - 0.69 0.86 0.76 - - -

(Tian et al., 2016) CTPN 0.83 0.93 0.88 0.52 0.74 0.61 0.65 0.68 0.66

(Deng et al., 2018) PixelLink 0.88 0.89 0.88 0.82 0.86 0.84 - - -

(Baek et al., 2019) CRAFT 0.93 0.97 0.95 0.84 0.89 0.86 - - -

(Wang et al., 2019a) PSENET - - - 0.84 0.86 0.85 - - -

(Metzenthin et al., 2022) WSRL 0.70 0.84 0.77 - - - - - -

(Wang et al., 2019b) PAN - - - 0.82 0.84 0.83 - - -

(Long and Yao, 2020) UnrealText 0.74 0.88 0.81 0.81 0.86 0.83 - - -

(Liao et al., 2020) SynthText3D 0.76 0.71 0.73 0.80 0.87 0.83 - - -

(Zhu et al., 2021) FCENet - - - 0.83 0.90 0.86 - - -

(Harizi et al., 2022a) CNN 0.92 0.94 0.93 0.74 0.79 0.76 0.72 0.78 0.75

(Yu et al., 2023) TCM - - 0.79 - - 0.87 - - -

Proposed method SIFT-ResNet 0.93 0.97 0.95 0.85 0.90 0.87 0.74 0.79 0.76

Note: R - Recall, P - Precision, F - F-score.

4 EXPERIMENTS AND

EVALUATION

4.1 Datasets

This work utilizes three popular datasets: ICDAR

2013 (Karatzas et al., 2013), ICDAR 2015 (Antona-

copoulos et al., 2015), and Street View Text (SVT)

(Wang and Belongie, 2010). Table 2 summarizes key

features and details of the datasets. We note that the

proposed WBBR architecture undergoes training us-

ing these datasets. During the training phase, we di-

vide them into three sets: 80% for training images,

10% for validation, and 10% for testing.

4.2 Evaluation Metrics

The evaluation in this study uses recall, precision, and

F-score metrics. Recall (R) measures the proportion

of true positives to the total positives in ground truth

annotations, while precision (P) is the ratio of true

positives to the total detected text examples. They are

deﬁned as follows: R =

T P

T P+FN

, P =

T P

T P+FP

. Here,

TP, TN, FP, and FN represent true positives, true nega-

tives, false positives, and false negatives. The F-score

(F) is computed by combining recall and precision

values through the formula: F =

2×R×P

R+P

SIFT-ResNet Synergy for Accurate Scene Word Detection in Complex Scenarios

985

Figure 4: Some visual results generated by the proposed scene text detection method which succeeds to localize curved and

oriented scene words with different orientations, curvatures, styles, sizes, illuminations, and spatial resolutions.

4.3 Performance Evaluation Study

Table 3 depicts the results of our scene text detec-

tion method compared to other state-of-the-art meth-

ods on ICDAR 2013, ICDAR 2015, and SVT datasets.

Results, evaluated by precision, recall, and F-score,

show superior performance of our method across all

datasets. Speciﬁcally, on the ICDAR 2013 dataset,

our method outperforms the WSRL text detection

method (Metzenthin et al., 2022), increasing the F-

score from 0.77 to 0.95, with improvements exceed-

ing 13% in precision and 23% in recall. Compared

to the efﬁcient CRAFT detector (Baek et al., 2019),

our method performs similarly on the ICDAR 2013

dataset but outperforms CRAFT on the more chal-

lenging ICDAR 2015 dataset. The improvements in

precision, recall, and F-measure, particularly on IC-

DAR 2015 and SVT datasets, highlight the robust-

ness of our approach in detecting challenging scene

text, including small-scale and multi-oriented exam-

ples that challenge human perception.

Figure 4 displays results from our SIFT-ResNet

scene text detector. The outputs demonstrate success-

ful localization in complex scenarios with varying ori-

entations, curvatures, styles, sizes, illuminations, and

spatial resolutions. Notably, the method accurately

localizes highly-curved words, as shown in Figures

4. In general, evaluating scene text localization con-

siders both accuracy and efﬁciency. Our research

combines traditional (SIFT) and modern (ResNet-

based bounding box regression) techniques, result-

ing in a highly accurate scene word detector. The

detector demonstrates notable improvements in pre-

cision, recall, and F-score measures. The proposed

method excels in performance due to the effective

use of multi-scale SIFT keypoints for character pat-

tern extraction and precise localization with a selec-

tive search-based word bounding box regressor in a

deep learning framework. By combining the local

feature capturing strength of SIFT keypoints with the

semantic understanding of deep neural networks, our

approach achieves a more precise scene word detec-

tor. This collaboration highlights the synergy between

traditional computer vision and modern deep learn-

ing techniques. The efﬁciency of our text detection

method is notably boosted by the signiﬁcant contri-

bution of the SIFT technique. It streamlines the pro-

cess by identifying key regions with a high likelihood

of containing text, allowing the detector to focus on

these areas rather than the entire image. This concen-

tration improves the overall speed of our text detector.

Finally, the achieved scene text detection results could

enhance the functionality and performance of diverse

applications like text super-resolution and recognition

(Walha et al., 2015).

5 CONCLUSION

In summary, our study concentrated on detecting text

in real-world scene images. We introduced a hybrid

text detection method that combines SIFT-based key-

points localization, BoW-based character patterns ﬁl-

tering, and ResNet-19 based word bounding box re-

gression. Experimental results afﬁrmed the method’s

efﬁciency, particularly in handling multi-oriented and

curved scene texts. Performance evaluations were

conducted on three challenging datasets, comparing

favorably with various state-of-the-art text detection

methods. As a future work, we aim to extend this re-

search to address the multi-script text detection and

recognition tasks.

REFERENCES

Antonacopoulos, A., Clausner, C., Papadopoulos, C., and

Pletschacher, S. (2015). ICDAR2015 competition on

recognition of documents with complex layouts. In

ICDAR 2015, pages 1151–1155.

Baek, Y., Lee, B., Han, D., Yun, S., and Lee, H. (2019).

Character region awareness for text detection. In

CVPR 2019, pages 9365–9374.

Busta, M., Neumann, L., and Matas, J. (2017). Deep

textspotter: An end-to-end trainable scene text local-

ization and recognition framework. ICCV 2017, pages

2223–2231.

Dai, Y., Huang, Z., Gao, Y., Xu, Y., Chen, K., Guo, J., and

Qiu, W. (2018). Fused text segmentation networks

ICAART 2024 - 16th International Conference on Agents and Artiﬁcial Intelligence

986

for multi-oriented scene text detection. In ICPR 2018,

pages 3604–3609.

Deng, D., Liu, H., Li, X., and Cai, D. (2018). Pixellink: De-

tecting scene text via instance segmentation. In AAAI

2018, pages 6773–6780.

Epshtein, B., Ofek, E., and Wexler, Y. (2010). Detecting

text in natural scenes with stroke width transform. In

CVPR 2010, pages 2963–2970.

Harizi, R., Walha, R., and Drira, F. (2022a). Deep-learning

based end-to-end system for text reading in the wild.

Multim. Tools Appl., 81(17):24691–24719.

Harizi, R., Walha, R., Drira, F., and Zaied, M. (2022b).

Convolutional neural network with joint stepwise

character/word modeling based system for scene text

recognition. Multim. Tools Appl., 81(3):3091–3106.

He, K., Gkioxari, G., Doll

ar, P., and Girshick, R. B. (2017).

Mask R-CNN. In ICCV 2017, pages 2980–2988.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In CVPR 2016,

pages 770–778.

He, T., Tian, Z., Huang, W., Shen, C., Qiao, Y., and Sun, C.

(2018). An end-to-end textspotter with explicit align-

ment and attention. CVPR 2018, pages 5020–5029.

Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., i Big-

orda, L. G., Mestre, S. R., Mas, J., Mota, D., Almaz

an,

J., and de las Heras, L. (2013). ICDAR 2013 robust

reading competition. In ICDAR, pages 1484–1493.

Liao, M., Lyu, P., He, M., Yao, C., Wu, W., and Bai, X.

(2021). Mask textspotter: An end-to-end trainable

neural network for spotting text with arbitrary shapes.

IEEE Trans. Pattern Anal. Mach. Intell., 43(2):532–

548.

Liao, M., Shi, B., and Bai, X. (2018a). Textboxes++: A

single-shot oriented scene text detector. IEEE Trans-

actions on Image Processing, 27:3676–3690.

Liao, M., Shi, B., Bai, X., Wang, X., and Liu, W. (2017).

Textboxes: A fast text detector with a single deep neu-

ral network. In AAAI 2017, pages 4161–4167.

Liao, M., Song, B., Long, S., He, M., Yao, C., and Bai,

X. (2020). Synthtext3d: synthesizing scene text im-

ages from 3d virtual worlds. Sci. China Inf. Sci.,

63(2):120105.

Liao, M., Zhu, Z., Shi, B., Xia, G., and Bai, X. (2018b).

Rotation-sensitive regression for oriented scene text

detection. In CVPR 2018, pages 5909–5918.

Liu, X., Liang, D., Yan, S., Chen, D., Qiao, Y., and Yan,

J. (2018a). Fots: Fast oriented text spotting with a

uniﬁed network. CVPR 2018, pages 5676–5685.

Liu, Y., Shen, C., Jin, L., He, T., Chen, P., Liu, C., and

Chen, H. (2022). Abcnet v2: Adaptive bezier-curve

network for real-time end-to-end text spotting. IEEE

Trans. Pattern Anal. Mach. Intell., 44(11):8048–8064.

Liu, Z., Shen, Q., and Wang, C. (2018b). Text detection in

natural scene images with text line construction. In

ICICSP 2018, pages 59–63.

Long, S., Ruan, J., Zhang, W., He, X., Wu, W., and Yao,

C. (2018). Textsnake: A ﬂexible representation for

detecting text of arbitrary shapes. In ECCV 2018, Part

II, pages 19–35.

Long, S. and Yao, C. (2020). Unrealtext: Synthesizing real-

istic scene text images from the unreal world. CoRR,

abs/2003.10608.

Mallek, A., Drira, F., Walha, R., Alimi, A. M., and Lebour-

geois, F. (2017). Deep learning with sparse prior - ap-

plication to text detection in the wild. In VISIGRAPP

- Volume 5: VISAPP 2017, pages 243–250.

Metzenthin, E., Bartz, C., and Meinel, C. (2022). Weakly

supervised scene text detection using deep reinforce-

ment learning. CoRR, abs/2201.04866.

Naiemi, F., Ghods, V., and Khalesi, H. (2021). A novel

pipeline framework for multi oriented scene text im-

age detection and recognition. Expert Syst. Appl.,

170:114549.

Piriyothinkul, B., Pasupa, K., and Sugimoto, M. (2019).

Detecting text in manga using stroke width transform.

In KST 2019, pages 142–147.

Redmon, J. and Farhadi, A. (2017). YOLO9000: better,

faster, stronger. In CVPR 2017, pages 6517–6525.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in Neural Informa-

tion Processing Systems, volume 28.

Shi, B., Yang, M., Wang, X., Lyu, P., Yao, C., and Bai, X.

(2019). Aster: An attentional scene text recognizer

with ﬂexible rectiﬁcation. PAMI, 41:2035–2048.

Tian, Z., Huang, W., He, T., He, P., and Qiao, Y. (2016).

Detecting text in natural image with connectionist text

proposal network. In ECCV, Part VIII, pages 56–72.

Walha, R., Drira, F., Lebourgeois, F., Garcia, C., and Alimi,

A. M. (2013). Single textual image super-resolution

using multiple learned dictionaries based sparse cod-

ing. In ICIAP 2013, Part II, volume 8157, pages 439–

448.

Walha, R., Drira, F., Lebourgeois, F., Garcia, C., and Alimi,

A. M. (2015). Joint denoising and magniﬁcation of

noisy low-resolution textual images. In ICDAR 2015,

pages 871–875.

Wang, K. and Belongie, S. (2010). Word spotting in the

wild. In ECCV 2010, pages 591–604.

Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., and

Shao, S. (2019a). Shape robust text detection with

progressive scale expansion network. In CVPR 2019,

pages 9336–9345.

Wang, W., Xie, E., Song, X., Zang, Y., Wang, W., Lu, T.,

Yu, G., and Shen, C. (2019b). Efﬁcient and accurate

arbitrary-shaped text detection with pixel aggregation

network. In ICCV 2019, pages 8439–8448.

Xing, L., Tian, Z., Huang, W., and Scott, M. (2019). Con-

volutional character networks. In ICCV, pages 9125–

9135.

Yu, W., Liu, Y., Hua, W., Jiang, D., Ren, B., and Bai, X.

(2023). Turning a CLIP model into a scene text detec-

tor. In CVPR 2023, pages 6978–6988.

Zhang, Z., Shen, W., Yao, C., and Bai, X. (2015).

Symmetry-based text line detection in natural scenes.

In CVPR 2015, pages 2558–2567.

Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W.,

and Liang, J. (2017). East: An efﬁcient and accurate

scene text detector. In CVPR 2017, pages 2642–2651.

Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., and Zhang,

W. (2021). Fourier contour embedding for arbitrary-

shaped text detection. In CVPR 2021, pages 3123–

3131.

SIFT-ResNet Synergy for Accurate Scene Word Detection in Complex Scenarios

987