0 3 6 9 12 15 18
20
25
30
35
40
45
50
55
60
# epoch
Error Rate [%]
/ FC-Att (Test / Valid)
/ ConvAtt1 3 (Test / Valid)
/ ConvAtt3 2 (Test / Valid)
0 3 6 9 12 15 18
0
2
4
6
8
10
12
# epoch
Negative Log ikelihood
/ FC-Att (Train / Valid)
/ ConvAtt1 3 (Train / Valid)
/ ConvAtt3 2 (Train / Valid)
Figure 4: Performance comparison between FC-Attention
and ConvAttention. (a) error rates on valid-set (dashed) and
test set (solid); (b) loss curves for the training (solid) and
validation (dashed) sets.
words that contain non-Chinese characters, have more
than 10 characters or have a vertical style, and finally
obtain 21781 cropped images as our test set.
Figure 4 gives a performance comparison between
FC-Attention and ConvAttention. In subfigure (a),
both ConvAtt1 3 and ConvAtt3 2 achieve lower error
rates than FC-Attention on valid-set and test set. For
further exploring the effectiveness of ConvAttention,
we depict the training process in subfigure (b), and
find that ConvAttention has a stronger fitting capacity
than FC-Attention. We also note that the error rate of
FC-Attention on valid-set is lower than that on the test
set because the synthetic dataset has higher complex-
ity than RCTW-17. The results here demonstrate that
ConvAttention can take on more challenging charac-
ter recognition tasks than FC-Attention.
5.4 Discussion: Influence of Sliding
Window
In most existing literature (Shi et al., 2016a; Shi et al.,
2016b; Cheng et al., 2017),
1
4
down-sampling is used
while we perform
1
8
down-sampling with respect to
the width of the input image in both ConvAttention
and FC-Attention to lower computational cost, which
may have resulted in suboptimal accuracy in Table 2.
Therefore, we change the step size of the sliding win-
dow from 2 to 1, which increases the length of the
resulting spatiotemporal sequence from 29 to 57; we
find that the accuracy of ConvAtt1 3 on average can
be further improved by 0.97%. For fair comparison,
we also change the stride of the 4-th pooling layer in
FC-Attention from 2 to 1, which changes the length
of the temporal sequence from 33 to 65; we find that
the accuracy of FC-Att on average can be further im-
proved by 0.65%. Therefore, ConvAttention outper-
forms FC-Attention regardless of the down-sampling
strategy used.
6 CONCLUSIONS
In this paper, we have presented a novel spatiotem-
poral deep learning framework with a convolutional
attention mechanism (ConvAttention) for retaining
more information about spatial structures. ConvA-
ttention not only preserves the advantages of FC-
Attention but is also suitable for spatiotemporal data
due to its inherent convolutional structure. We have
successfully applied ConvAttention to the challeng-
ing problem of scene text recognition. By incorporat-
ing ConvAttention into text reading, we build an end-
to-end trainable deep network for character recogni-
tion. Extensive experiments on public benchmarks
demonstrate that our method achieves state-of-the-art
results. As future work, we will investigate how to
apply ConvAttention to image/video captioning.
REFERENCES
Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural
Machine Translation by Jointly Learning to Align and
Translate. In ICLR.
Bissacco, A., Cummins, M., Netzer, Y., and Neven, H.
(2013). PhotoOCR: Reading Text in Uncontrolled
Conditions. In ICCV, pages 785–792.
Cheng, Z., Bai, F., Xu, Y., Zheng, G., Pu, S., and Zhou, S.
(2017). Focusing Attention: Towards Accurate Text
Recognition in Natural Images. In ICCV, pages 5076–
5084.
Graves, A., Fern
´
andez, S., Gomez, F., and Schmidhuber, J.
(2006). Connectionist Temporal Classification : La-
belling Unsegmented Sequence Data with Recurrent
Neural Networks. In ICML, pages 369–376. ACM.
Gupta, A., Vedaldi, A., and Zisserman, A. (2016). Syn-
thetic Data for Text Localisation in Natural Images.
In CVPR, pages 2315–2324.
He, P., Huang, W., Qiao, Y., Loy, C. C., and Tang, X.
(2016). Reading Scene Text in Deep Convolutional
Sequences. In AAAI, pages 3501–3508.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman,
A. (2014). Synthetic Data and Artificial Neural Net-
works for Natural Scene Text Recognition. arXiv
preprint arXiv:1406.2227.
Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman,
A. (2015). Deep Structured Output Learning for Un-
constrained Text Recognition. In ICLR.
Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman,
A. (2016). Reading Text in the Wild with Convolu-
tional Neural Networks. IJCV, 116(1):1–20.
DeLTA 2020 - 1st International Conference on Deep Learning Theory and Applications
48