Attention-based Text Recognition in the Wild

Zhi-Chen Yan

and Stephanie A. Yu

Facebook Research, 1 Hacker Way, Menlo Park, CA 94025, U.S.A.

West Island School, 250 Victoria Road, Pokfulam, Hong Kong, Republic of China

Keywords:

Attention, Convolution, Deep Learning, LSTM, Text Recognition.

Abstract:

Recognizing texts in real-world scenes is an important research topic in computer vision. Many deep learn-

ing based techniques have been proposed. Such techniques typically follow an encoder-decoder architecture,

and use a sequence of feature vectors as the intermediate representation. In this approach, useful 2D spa-

tial information in the input image may be lost due to vector-based encoding. In this paper, we formulate

scene text recognition as a spatiotemporal sequence translation problem, and introduce a novel attention based

spatiotemporal decoding framework. We ﬁrst encode an image as a spatiotemporal sequence, which is then

translated into a sequence of output characters using the aforementioned decoder. Our encoding and decoding

stages are integrated to form an end-to-end trainable deep network. Experimental results on multiple bench-

marks, including IIIT5k, SVT, ICDAR and RCTW-17, indicate that our method can signiﬁcantly outperform

conventional attention frameworks.

1 INTRODUCTION

Scene text recognition remains a hot research topic in

computer vision (Neumann and Matas, 2012; Jader-

berg et al., 2014; Shi et al., 2016a) due to impor-

tant applications including handwriting recognition

and navigation reading.

With the advance of deep learning, many solu-

tions (Lee and Osindero, 2016; Shi et al., 2016b;

Yang et al., 2017; Cheng et al., 2017; Wang and

Hu, 2017) have been proposed for scene text recog-

nition and promising results have been achieved. In

general, such solutions exploit the encoder-decoder

network architecture. Speciﬁcally, in the encod-

ing stage, a sequence of feature vectors are ex-

tracted from an input image using convolutional neu-

ral networks (CNNs) (Sainath et al., 2013) and re-

current neural networks with long-short-term mem-

ory (LSTM) (Hochreiter and Schmidhuber, 1997). In

the decoding stage, the feature vectors are decoded

into a character sequence with connectionist tempo-

ral classiﬁcation (CTC) (Graves et al., 2006) or an

attention-based temporal decoder (Lee and Osindero,

2016; Shi et al., 2016a; Yang et al., 2017; Cheng et al.,

2017). In fact, any character appearing in a natural

image contains speciﬁc spatial information, such as

the stroke layout of the character. However, previous

methods, as shown in Figure 1, encode a spatial image

Temporal

Encoding

Spatiotemporal

Encoding

Decoding

'S'

'H'

'P'

'O'

'#'

'S'

'H'

'P'

'O'

'#'

Figure 1: Overview of the proposed model. The brown

dashed box refers to the traditional framework for scene text

reading: ﬁrst encoding an image as a sequence of feature

vectors (denoted temporal encoding), and then performing

classiﬁcation; the green dashed box represents our proposed

framework: ﬁrst encoding an image as a sequence of feature

maps (denoted spatiotemporal encoding), and then generat-

ing a sequence of characters.

using feature vectors and generate an output charac-

ter sequence by performing classiﬁcation on the fea-

ture vectors, whose one-dimensional layout may miss

part of the spatial and structural information in the

original two-dimensional image. An ideal network

architecture we seek should retain the original spatial

structures and thus make the best of spatiotemporal

information. Prior to building such a spatiotemporal

model, several issues need to be taken into considera-

tion:

Yan, Z. and Yu, S.

Attention-based Text Recognition in the Wild.

DOI: 10.5220/0009970200420049

In Proceedings of the 1st International Conference on Deep Learning Theory and Applications (DeLTA 2020), pages 42-49

ISBN: 978-989-758-441-1

• Spatiotemporal Encoding: Unlike the previous

methods, our method should have a scheme that

efﬁciently constructs a spatiotemporal sequence

from a spatial image. Inspired by (Li et al., 2017),

we use a sliding window to build the spatiotempo-

ral sequence.

• Spatiotemporal Decoding: Given a constructed

spatiotemporal sequence, a spatiotemporal de-

coder should be used to generate a corresponding

character sequence. Unfortunately, no techniques

are available for this purpose. In the existing liter-

ature, there are two temporal decoders: CTC and

attention. CTC was designed for calculating the

conditional probability between a predicted char-

acter sequence and its corresponding target char-

acter sequence, but cannot map a spatiotemporal

sequence of feature vectors to the target sequence

while a fully connected attention (FC-Attention)

decoder can only map a sequence of feature vec-

tors (instead of 2D feature maps) to a target se-

quence. Inspired by convolutional LSTM (Con-

vLSTM) (Xingjian et al., 2015), we attempt to de-

sign a convolutional attention mechanism that can

build the mapping from an encoded spatiotempo-

ral sequence to the corresponding target sequence.

• Model Training: We integrate the encoding and

decoding modules to form an end-to-end net-

work, which can be trained with the cross-entropy

loss between predicted sequences generated by

ConvAttention and their corresponding target se-

quences, as shown in Figure 1.

As we know that the FC-Attention layer adopted by

previous models does not take spatial correlation into

consideration, we extend the idea of FC-Attention to

ConvAttention, which has convolutional structures in

both input-to-state and state-to-state transitions. With

ConvAttention, we can build an end-to-end trainable

network for describing scene text using character-

level labels.

In summary, this paper has the following contri-

butions.

• We propose a novel spatiotemporal deep learning

framework with a convolutional attention mecha-

nism, which retains more information about spa-

tial structures. This framework can be trained

from end to end.

• Our proposed framework can be conﬁgured as

an end-to-end spatiotemporal model for robustly

reading scene text. Our convolutional attention

can effectively transform each scene text image

into a sequence of characters or target images, as

shown in Figure 1.

• Extensive experiments on public text bench-

marks demonstrate that our convolutional atten-

tion mechanism signiﬁcantly outperforms con-

ventional attention frameworks.

2 RELATED WORK

Reading text from natural image is still one of the

most important challenges in computer vision and

many methods have been proposed. A complete text

reading system contains a text detection module and

a text recognition module. Our work in this paper fo-

cuses on the text recognition task.

Conventional methods typically ﬁrst locate char-

acters one by one with a sliding window, then recog-

nize characters using a classiﬁer with handcrafted fea-

tures such as HOG descriptors (Yao et al., 2014), and

ﬁnally integrate recognized characters into the out-

put text (Neumann and Matas, 2012; Wang and Be-

longie, 2010; Wang et al., 2011). However, two prob-

lems limit the performance of such methods: the low

representation capability of handcrafted features and

missing contextual information in the pipeline. With

the advance in deep learning and convolutional neural

networks, researchers use CNNs for extracting high-

level feature representations. Jaderberg et al. (Jader-

berg et al., 2016) carried out a 90k-class classiﬁcation

task with high-level features. In this task, each class

represents a character string, therefore, their method

cannot recognize out-of-vocabulary words. Wang et

al. (Wang et al., 2012) developed a CNN-based fea-

ture extraction framework for character recognition,

and then performed non-maximum suppression for ﬁ-

nal word prediction. The above models are trained

with the segmentation annotation of each character,

and do not exploit contextual information in the orig-

inal text. In addition, annotating segmentations is

very labor-intensive especially when the background

is cluttered.

Recently, some works regard this problem as

a temporal sequence recognition problem, and re-

current neural networks (RNNs) are integrated with

CNNs to read character sequences. The CTC (Graves

et al., 2006) loss is combined with RNNs in (He

et al., 2016; Shi et al., 2016a; Wang and Hu, 2017)

to calculate the conditional probability between the

predicted and target sequences. Attention-based de-

coders are used for generating output sequences in

(Lee and Osindero, 2016; Shi et al., 2016b; Cheng

et al., 2017; Yang et al., 2017). The above meth-

ods have achieved promising results on text recogni-

tion, but all of them encode a spatial text image into

a sequence of feature vectors, which may lose part of

Attention-based Text Recognition in the Wild

the spatial information. We believe that information

about spatial structures is helpful in describing text

images.

Different from previous approaches for scene text

reading, in this paper we describe scene text from

a spatiotemporal perspective, and propose an end-

to-end trainable network to build spatiotemporal se-

quence models for scene text. To the best of our

knowledge, this is the ﬁrst piece of work that applies

a spatiotemporal framework to scene text reading.

3 PRELIMINARIES: FC-Attention

An attention-based decoder is a recurrent neural

network that directly generates a target sequence

, ..., y

) from input feature vectors (h

, ..., h

where the lengths of the input and output sequences

may be different Bahdanau et al. (Bahdanau et al.,

2015) proposed the architecture of FC-Attention, as

shown in Figure 2. At the t-th step, the attention mod-

ule generates an output y

as follows:

∼ so f tmax(Us

= LST M(y

t−1

, g

, s

t−1

∑

t,k

exp(e

t,k

)

∑

exp(e

t,k

)

t,k

= wtanh(W s

t−1

+V h

+ b)

(1)

where s

, g

, α

and e

represent the hidden state of the

LSTM, the weighted sum of h, the attention weights

and the energy value at the t-th step, respectively. In

the above equation, w, W , V , U and b are all trainable

parameters.

...

Figure 2: Illustration of FC-Attention. Green dotted and

dashed lines correspond to the calculation of attention

weights α.

FC-Attention is capable of generating sequences

of variable lengths. Following (Sutskever et al.,

2014), a special end-of-sequence (EOS) token is

added to the target set so that the decoder completes

the generation of characters when EOS is emitted.

4 PROPOSED METHOD

In this section, we present our proposed ConvAt-

tention network. Although the FC-Attention layer

has proven powerful for handling temporal sequence

generation, full connections contain too much redun-

dancy for spatial data. To address this problem, we

propose to extend FC-Attention to ConvAttention,

which has convolutional structures in both input-to-

state and state-to-state transitions. By integrating

ConvAttention with a spatiotemporal encoder shown

in Figure 3, we are able to build an end-to-end spa-

tiotemporal sequence model.

4.1 Convolutional Attention

Inspired by the processes within FC-Attention and

ConvLSTM, we design our convolutional attention

mechanism as follows. ConvAttention generates a

target sequence (Y

, ..., Y

) from a sequence of input

feature maps H : (H

, ..., H

), where T and M may

not be equal. At the t-th step, the convolutional atten-

tion module generates an output Y

as follows, where

∗ denotes the convolution operator:

∼ Generate(U ∗ S

(2)

where U is a trainable weight template, and S

is the

hidden state of ConvLSTM (Xingjian et al., 2015) at

time t, computed by:

= ConvLST M(Y

t−1

, G

, S

t−1

(3)

where G

is a weighted sum of spatiotemporal feature

maps H . That is,

∑

t,k

(4)

where α

represents the set of spatiotemporal atten-

tion weights for the t-th step. During the computation

of attention weights, α

is often evaluated by scor-

ing each element in H separately and normalizing the

scores as follows:

= Attend(S

t−1

, H ), (5)

where Attend denotes the attending process to be

elaborated later.

Generate process: The generation function in

Equation (2) emphasizes the mapping from ConvL-

STM hidden state S

to Y

. For example, the map-

ping function can be a spatially fully connected layer

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

ConvLSTM

ConvNet

sliding window

'S'

'H'

'P'

'O'

'#'

Figure 3: Our spatiotemporal text recognition network. The

network consists of an encoder and a decoder. For each

image, the encoder uses convolution layers (ConvNet) to

extract a high-level feature representation, then generates

a spatiotemporal sequence with a sliding window. After-

wards, the decoder with a convolutional attention mecha-

nism generates a predicted image sequence or character se-

quence conditioned on the output of the encoder.

with the softmax operator for sequentially generating

scalar labels y (e.g. ‘A’, ‘B’, ...) or a spatially invari-

ant convolution kernel for generating 2D spatial label

maps y

, as shown in Figure 3.

Attend process: The attending process pre-

scribes the weight of each input feature map H

∈

f ilters×width×height

, where f ilters, wid th and height

refers to the number of channels, width and height of

the feature map, respectively. This process can be de-

ﬁned as follows:

t,k

= w ∗ tanh(W ∗ S

t−1

+V ∗ H

+ b),

(6)

where w, W , and V are trainable 4D weight templates,

b is a trainable 3D offset map, and ∗ denotes the con-

volution operator. Here, there are two modalities for

t,k

• If E

t,k

∈ R

, a scalar weight α

t,k

is applied to a

given feature map, and it is computed by

t,k

exp(E

t,k

)

∑

exp(E

t,k

)

(7)

which is called temporal weight, abbreviated as

tw.

• If E

t,k

∈ R

W ×H

, a pixel-wise weight mask is ap-

plied to a given feature map. For each pixel (i, j)

in H

, the weight α

(i, j)

t,k

is computed by

(i, j)

t,k

exp(E

(i, j)

t,k

)

∑

exp(E

(i, j)

t,k

)

(8)

which is called spatiotemporal weight, abbrevi-

ated as stw.

The attending process of FC-Attention in Equa-

tion (1) can be seen as a special case of ConvAttention

with the last two dimensions of H

being 1. In addi-

tion, the padding strategy used by ConvAttention for

the convolution operator is the same as ConvLSTM

(Xingjian et al., 2015).

4.2 Network Training

We ﬁrst encode a spatial image into a spatiotempo-

ral sequence using a spatiotemporal feature extractor

consisting of convolutional layers (ConvNet) and a

sliding window, as shown in Figure 3, and then in-

tegrate this spatiotemporal feature extractor with our

convolutional attention module.

We can train the entire network with either spatial

label maps or scalar labels. In the case of training

with spatial label maps, the following cross-entropy

loss is used.

= −

∑

t,i, j

ˆy

t,i, j

logy

t,i, j

+ (1 − ˆy

t,i, j

)log(1 − y

t,i, j

(9)

where y

and ˆy

are the t-th predicted image and the

t-th target image, respectively.

In the case of training with scalar labels, the loss

function of the network is deﬁned as follows.

= −

∑

lnP( ˆy

|θ), (10)

where ˆy

is the ground truth of the t-th character and

θ includes all the network parameters.

4.3 Evaluation Metrics

If we train our end-to-end network with the cross-

entropy loss L

, we use the cross-entropy value be-

tween the predicted images and the target images

to evaluate the performance of our framework. The

smallest cross-entropy value with respect to the target

images indicates the most probable predictions.

On the other hand, if we train our network with

the loss L

, we follow existing works (Lee and Osin-

dero, 2016; Shi et al., 2016b; Yang et al., 2017; Cheng

et al., 2017) and compute the accuracy of the gen-

erated character sequences with respect to the target

character sequences.

5 EXPERIMENTS

We ﬁrst compare our ConvAttention network with the

FC-Attention network on a synthetic dataset released

Attention-based Text Recognition in the Wild

by Jaderberg et al. (Jaderberg et al., 2014) to gain

basic understanding of the behavior of our network.

We train our network with different numbers of con-

volution kernels and different kernel sizes. To verify

the effectiveness of our network on more challenging

scene text reading tasks, we further test our trained

model on several benchmarks and compare it with the

state of the art.

5.1 Explorative Study

Experimental setting: We randomly choose 200k

training images, 10k validation images and 10k test-

ing images from the synthetic dataset released by

Jaderberg et al. (Jaderberg et al., 2014). All images

are resized to 32 × 256. We have implemented our

network in the Caffe framework (Jia et al., 2014).

We train all the Attention models by minimizing

the cross-entropy loss L

or loss L

using back-

propagation through time (BPTT) (Rumelhart et al.,

1988) and stochastic gradient descent with momen-

tum set to 0.95, weight-decay set to 0.0005, and

batch-size set to 64. For L

and L

, the base learn-

ing rate is set to 0.0001 and 0.01, respectively. Also,

we perform the stopping operation on the validation

set until 120k iterations. We run all the experiments

on a computer with eight NVIDIA Tesla P40 GPUs.

Network details: For the ConvNet in Figure 3,

we use 7 convolutional layers. This is similar to the

encoder architecture proposed in (Shi et al., 2016b).

The {ﬁlter size, number of ﬁlters, stride, padding

size} in these 7 layers are respectively {3,64,1,1},

{3,128,1,1}, {3,256,1,1}, {3,256,1,1}, {3,512,1,1},

{3,512,1,1} and {3,512,1,1}. The 1st and 2nd con-

volution layers are each followed by a 2 × 2 max-

pooling layer. We adopt batch normalization (BN)

and ReLU activation right after the 5th and 6th con-

volution layers. On top of the convolution layers is

a bidirectional ConvLSTM, which forms the basis of

our ConvAttention module. In addition, in order to

construct the spatiotemporal sequence, a sliding win-

dow with a step size set to 2 is used over the output of

the 7-th convolutional layer. Note that FC-LSTM/FC-

Attention can be seen as a special case of ConvL-

STM/ConvAttention with the last two dimensions of

being set to 1.

Performance analysis: The results of 40 experi-

ments are shown in Table 1, where text reading perfor-

mance with different settings is compared. We eval-

uate the performance of ConvAttention in capturing

spatiotemporal correlations as follows: 1) 20 experi-

ments comparing the cross-entropy values which in-

dicate the quality of generated spatial images, 2) 20

experiments comparing the accuracy of the generated

character sequences. We set the number of ﬁlters to

512 in FC-Attention, and trained FC-Attention mod-

els respectively for generating spatial label maps and

scalar character sequences are used as our baseline.

We conduct three groups of experiments related to

ConvAttention to explore its performance by varying

the resolution of feature maps, the kernel size and the

number of kernels. The kernel size and the resolu-

tion of feature maps in ConvLSTM are the same as

their counterparts in ConvAttention, and the number

of kernels in ConvLSTM is always set to 64. Given

extensive comparative experiments shown in Table 1,

we can conclude that

• ConvAttention signiﬁcantly outperforms FC-

Attention in handling spatiotemporal correlations

which help boost the recognition performance.

• Making the kernel size bigger than 1 is useful for

improving the recognition results.

• By varying the number of kernels from 128 to 8,

ConvAttention still maintains a high performance,

which demonstrates that ConvAttention is robust

and stable.

• The performance of ConvAttention is reasonably

affected by the resolution of feature maps because

a larger feature map contains more spatial infor-

mation. This indirectly indicates the importance

of spatial information.

• Spatiotemporally weighted ConvAttention is bet-

ter than temporally weighted ConvAttention.

5.2 Comparison with the State of the

Art

Network details and Environments: For further

demonstration of the performance of our method, we

directly choose a few state-of-the-art networks, such

as FC-Att, ConvAtt1 3, and ConvAtt3 2, listed in Ta-

ble 1 for comparison. In this comparison, we termi-

nate training after 850k iterations.

Training dataset: According to previous

work (Cheng et al., 2017), our training set con-

sists of the 8 million synthetic data released by

Jaderberg et al. (Jaderberg et al., 2014) and 4 million

synthetic instances (excluding the images that contain

non-alphanumeric characters) cropped from the 80

thousand images released by (Gupta et al., 2016).

Testing dataset: We have collected three bench-

marks as our testing datasets. IIIT5K-Words

(IIIT5K in short) (Mishra et al., 2012) wass col-

lected from the Internet, containing 3000 cropped

word images in its test set. Street View Text (SVT

in short) (Wang et al., 2011) was collected from

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

Table 1: Comparison of ConvAttention (ConvAtt) with FC-Attention (FC-Att) on selected datasets. ‘#MS’ and ‘#Mul-Add’

respectively mean the number of parameters in ConvAttention and the number of multiply-add operations in ConvAttention for

predicting a character in Equations (3) and (6). ‘FS’, ‘KS’ and ‘ACC’ represent the resolution of feature maps, the kernel size,

and the accuracy of a model trained with L

or L

. tw and stw respectively refer to the results generated with the temporally

weighted attending operation and the spatiotemporally weighted attending operation, which have been separately described

in Equations (7) and (8). ‘M’ means million. Speciﬁcally, for FS equal to 1 and 4, the (4th, 6th) and (4th) convolution layers

are each followed by a 2 × 1 max-pooling layer, respectively; for FC-Att, the {kernel size, number of kernels, stride, padding

size} of the 7-th convolutional layer is set to {2,512,1,0}.

Model #MS #Mul-Add FS KS Filters ACC by L

(tw/stw) ACC by L

(tw/stw)

FC-Att 1.0496 M 5.01 M 1 1 × 1 512 75.76 ± 0.09/ - 82.29 ± 0.12/ -

ConvAtt1 1 0.1486 M 0.71 M 1 3 × 3 64 75.32 ± 0.05/ - 82.23 ± 0.10/ -

ConvAtt1 2 0.1486 M 11.35 M 4 3 × 3 64 78.28 ± 0.02/78.02 ± 0.04 83.02 ± 0.20/83.24 ± 0.04

ConvAtt1 3 0.1486 M 45.40 M 8 3 × 3 64 78.60 ± 0.03/79.93 ± 0.07 83.50 ± 0.26/83.64 ± 0.08

ConvAtt2 1 0.0024 M 0.75 M 8 3 × 3 8 77.94 ± 0.03/78.08 ±0.05 83.23 ± 0.07/83.43 ± 0.03

ConvAtt2 2 0.0095 M 2.91 M 8 3 × 3 16 78.55 ± 0.01/79.30 ± 0.05 83.55 ± 0.10/83.74 ± 0.02

ConvAtt2 3 0.0374 M 11.45 M 8 3 × 3 32 79.38 ± 0.05/80.03 ± 0.06 83.64 ± 0.13/83.64 ± 0.05

ConvAtt2 4 0.5921 M 180.81 M 8 3 × 3 128 80.09 ±0.07/79.68 ± 0.12 83.68 ± 0.23/83.69 ± 0.09

ConvAtt3 1 0.0165 M 5.09 M 8 1 × 1 64 79.90 ± 0.11/80.13 ± 0.06 82.97 ± 0.11/84.27 ± 0.03

ConvAtt3 2 0.4128 M 126.03 M 8 5 × 5 64 79.53 ± 0.05/80.33 ± 0.05 83.61 ± 0.02/84.20 ± 0.17

Google Street View, consists of 647 word images in

its test set. Many images in this dataset either are

severely corrupted with noise and blur or have a very

low resolution. ICDAR 2003 (IC03 in short) (Lucas

et al., 2003) contains 251 scene images, labeled with

text bounding boxes. For fair comparison, we dis-

carded images that contain non-alphanumeric char-

acters or have less than three characters, following

(Wang et al., 2011). The resulting dataset contains

867 cropped images.

Experiments: We test our method on all three

benchmarks (IIIT5k, SVT and IC03), as shown in Ta-

ble 2.

We ﬁrst compare the performance of our network

against the state of the art. With an encoder that has

7 convolutional layers, ConvAttention (ConvAtt3

achieves better performance than existing methods

Table 2: Comparison of accuracy among state-of-the-art

methods on the IIIT5k, SVT and IC03 datasets.

Method IIIT5k SVT IC03

Bissacco (Bissacco et al., 2013) − 78.0 −

Jaderberg (Jaderberg et al., 2016) − 80.7 93.1

Jaderberg (Jaderberg et al., 2015) − 71.7 89.6

Shi (Shi et al., 2016a) 78.2 80.8 89.4

Shi (Shi et al., 2016b) 81.9 81.9 90.1

Lee (Lee and Osindero, 2016) 78.4 80.7 88.7

Cheng (Cheng et al., 2017) 87.4 85.9 94.2

Wang (Wang and Hu, 2017) 79.2 81.5 91.2

FC-Att 82.9 77.7 89.9

ConvAtt1 3 83.7 80.1 90.4

ConvAtt3 2 84.2 81.5 91.7

FC-Att-ResNet (Cheng et al., 2017) 83.7 82.2 91.5

ConvAtt3 1-ResNet 87.6 86.2 94.3

except for Cheng et al.’s work (Cheng et al., 2017)

on all benchmarks. Two critical factors make it possi-

ble for our network to achieve good performance: a)

using extra geometric annotations (location of each

character) to help train the attention decoder, and

b) exploiting a ResNet-based encoder for obtaining

robust feature representations. However, annotating

the location of each character is extremely expen-

sive, therefore, it is not a practical solution for real

applications. In the ResNet-based encoder (feature

extractor), we discard the 3-rd and 4-th pooling lay-

ers used in (Cheng et al., 2017), and then integrate it

with ConvAtt3 1 to form a deeper network denoted as

ConvAtt3 1-ResNet. We ﬁnd that ConvAtt3 1-ResNet

achieves a better performance than all existing meth-

ods.

5.3 Challenge on Chinese Character

Recognition

Chinese characters contain strokes often in a sophis-

ticated spatial layout. Such a nature indicates com-

plex spatial information. We challenge our method

with Chinese character recognition, and believe that

our proposed convolutional attention mechanism can

help in this task. According to the character gener-

ation method proposed in (Gupta et al., 2016), we

generate 4 million Chinese text images for training

and another 20k Chinese text images as the validation

set (valid-set). Each image has 1-10 Chinese charac-

ters chosen from the set of 3755 most commonly used

ones. We also use RCTW-17 (Shi et al., 2017), which

has a training set containing 8034 Chinese scene im-

ages labeled with text bounding boxes. We discard

Attention-based Text Recognition in the Wild

0 3 6 9 12 15 18

# epoch

Error Rate [%]

/ FC-Att (Test / Valid)

/ ConvAtt1 3 (Test / Valid)

/ ConvAtt3 2 (Test / Valid)

0 3 6 9 12 15 18

# epoch

Negative Log ikelihood

/ FC-Att (Train / Valid)

/ ConvAtt1 3 (Train / Valid)

/ ConvAtt3 2 (Train / Valid)

Figure 4: Performance comparison between FC-Attention

and ConvAttention. (a) error rates on valid-set (dashed) and

test set (solid); (b) loss curves for the training (solid) and

validation (dashed) sets.

words that contain non-Chinese characters, have more

than 10 characters or have a vertical style, and ﬁnally

obtain 21781 cropped images as our test set.

Figure 4 gives a performance comparison between

FC-Attention and ConvAttention. In subﬁgure (a),

both ConvAtt1 3 and ConvAtt3 2 achieve lower error

rates than FC-Attention on valid-set and test set. For

further exploring the effectiveness of ConvAttention,

we depict the training process in subﬁgure (b), and

ﬁnd that ConvAttention has a stronger ﬁtting capacity

than FC-Attention. We also note that the error rate of

FC-Attention on valid-set is lower than that on the test

set because the synthetic dataset has higher complex-

ity than RCTW-17. The results here demonstrate that

ConvAttention can take on more challenging charac-

ter recognition tasks than FC-Attention.

5.4 Discussion: Inﬂuence of Sliding

Window

In most existing literature (Shi et al., 2016a; Shi et al.,

2016b; Cheng et al., 2017),

down-sampling is used

while we perform

down-sampling with respect to

the width of the input image in both ConvAttention

and FC-Attention to lower computational cost, which

may have resulted in suboptimal accuracy in Table 2.

Therefore, we change the step size of the sliding win-

dow from 2 to 1, which increases the length of the

resulting spatiotemporal sequence from 29 to 57; we

ﬁnd that the accuracy of ConvAtt1 3 on average can

be further improved by 0.97%. For fair comparison,

we also change the stride of the 4-th pooling layer in

FC-Attention from 2 to 1, which changes the length

of the temporal sequence from 33 to 65; we ﬁnd that

the accuracy of FC-Att on average can be further im-

proved by 0.65%. Therefore, ConvAttention outper-

forms FC-Attention regardless of the down-sampling

strategy used.

6 CONCLUSIONS

In this paper, we have presented a novel spatiotem-

poral deep learning framework with a convolutional

attention mechanism (ConvAttention) for retaining

more information about spatial structures. ConvA-

ttention not only preserves the advantages of FC-

Attention but is also suitable for spatiotemporal data

due to its inherent convolutional structure. We have

successfully applied ConvAttention to the challeng-

ing problem of scene text recognition. By incorporat-

ing ConvAttention into text reading, we build an end-

to-end trainable deep network for character recogni-

tion. Extensive experiments on public benchmarks

demonstrate that our method achieves state-of-the-art

results. As future work, we will investigate how to

apply ConvAttention to image/video captioning.

REFERENCES

Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural

Machine Translation by Jointly Learning to Align and

Translate. In ICLR.

Bissacco, A., Cummins, M., Netzer, Y., and Neven, H.

(2013). PhotoOCR: Reading Text in Uncontrolled

Conditions. In ICCV, pages 785–792.

Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., and Zhou, S.

(2017). Focusing Attention: Towards Accurate Text

Recognition in Natural Images. In ICCV, pages 5076–

5084.

Graves, A., Fern

andez, S., Gomez, F., and Schmidhuber, J.

(2006). Connectionist Temporal Classiﬁcation : La-

belling Unsegmented Sequence Data with Recurrent

Neural Networks. In ICML, pages 369–376. ACM.

Gupta, A., Vedaldi, A., and Zisserman, A. (2016). Syn-

thetic Data for Text Localisation in Natural Images.

In CVPR, pages 2315–2324.

He, P., Huang, W., Qiao, Y., Loy, C. C., and Tang, X.

(2016). Reading Scene Text in Deep Convolutional

Sequences. In AAAI, pages 3501–3508.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. Neural computation, 9(8):1735–1780.

Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman,

A. (2014). Synthetic Data and Artiﬁcial Neural Net-

works for Natural Scene Text Recognition. arXiv

preprint arXiv:1406.2227.

Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman,

A. (2015). Deep Structured Output Learning for Un-

constrained Text Recognition. In ICLR.

Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman,

A. (2016). Reading Text in the Wild with Convolu-

tional Neural Networks. IJCV, 116(1):1–20.

DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional Architecture for Fast Feature

Embedding. In ACM-MM, pages 675–678.

Lee, C. Y. and Osindero, S. (2016). Recursive Recurrent

Nets with Attention Modeling for OCR in the Wild.

In CVPR, pages 2231–2239.

Li, H., Wang, P., and Shen, C. (2017). Towards End-To-

End Text Spotting With Convolutional Recurrent Neu-

ral Networks. In ICCV, pages 5238–5246.

Lucas, S. M., Panaretos, A., Sosa, L., Tang, A., Wong, S.,

and Young, R. (2003). ICDAR 2003 robust reading

competitions. In ICDAR, pages 682–687.

Mishra, A., Alahari, K., and Jawahar, C. V. (2012). Scene

Text Recognition using Higher Order Language Pri-

ors. In BMVC, pages 1–11.

Neumann, L. and Matas, J. (2012). Real-time scene text

localization and recognition. In CVPR, pages 3538–

3545.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988).

Learning representations by back-propagating errors.

Cognitive Modeling, 5(3):1.

Sainath, T. N., Mohamed, A.-r., Kingsbury, B., and Ram-

abhadran, B. (2013). Deep convolutional neural net-

works for LVCSR. In ICASSP, pages 8614–8618.

IEEE.

Shi, B., Bai, X., and Yao, C. (2016a). An End-to-End

Trainable Neural Network for Image-based Sequence

Recognition and Its Application to Scene Text Recog-

nition. IEEE TPAMI, preprint.

Shi, B., Wang, X., Lyu, P., Yao, C., and Bai, X. (2016b).

Robust Scene Text Recognition with Automatic Rec-

tiﬁcation. In CVPR, pages 4168–4176.

Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L.,

Belongie, S. J., Lu, S., and Bai, X. (2017). IC-

DAR2017 competition on reading chinese text in the

wild (RCTW-17). CoRR, abs/1708.09585.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence

to Sequence Learning with Neural Networks. In NIPS,

pages 3104–3112.

Wang, J. and Hu, X. (2017). Gated Recurrent Convolution

Neural Network for OCR. In NIPS, pages 334–343.

Wang, K., Babenko, B., and Belongie, S. (2011). End-

to-end scene text recognition. In ICCV, pages 1457–

1464.

Wang, K. and Belongie, S. (2010). Word Spotting in the

Wild. In ECCV, pages 591–604. Springer.

Wang, T., Wu, D. J., Coates, A., and Ng, A. Y. (2012). End-

to-end text recognition with convolutional neural net-

works. In ICPR, pages 3304–3308.

Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-

K., and Woo, W.-c. (2015). Convolutional LSTM net-

work: A machine learning approach for precipitation

nowcasting. In NIPS, pages 802–810.

Yang, X., He, D., Zhou, Z., Kifer, D., and Giles, C. L.

(2017). Learning to Read Irregular Text with Atten-

tion Mechanisms. In IJCAI, pages 3280–3286.

Yao, C., Bai, X., Shi, B., and Liu, W. (2014). Strokelets:

A Learned Multi-scale Representation for Scene Text

Recognition. In CVPR, pages 4042–4049.

Attention-based Text Recognition in the Wild