Figure 4: Performance comparison between FC-Attention
and ConvAttention. (a) error rates on valid-set (dashed) and
test set (solid); (b) loss curves for the training (solid) and
validation (dashed) sets.
words that contain non-Chinese characters, have more
than 10 characters or have a vertical style, and finally
obtain 21781 cropped images as our test set.
Figure 4 gives a performance comparison between
FC-Attention and ConvAttention. In subfigure (a),
both ConvAtt1 3 and ConvAtt3 2 achieve lower error
rates than FC-Attention on valid-set and test set. For
further exploring the effectiveness of ConvAttention,
we depict the training process in subfigure (b), and
find that ConvAttention has a stronger fitting capacity
than FC-Attention. We also note that the error rate of
FC-Attention on valid-set is lower than that on the test
set because the synthetic dataset has higher complex-
ity than RCTW-17. The results here demonstrate that
ConvAttention can take on more challenging charac-
ter recognition tasks than FC-Attention.
5.4 Discussion: Influence of Sliding
In most existing literature (Shi et al., 2016a; Shi et al.,
2016b; Cheng et al., 2017),
down-sampling is used
while we perform
down-sampling with respect to
the width of the input image in both ConvAttention
and FC-Attention to lower computational cost, which
may have resulted in suboptimal accuracy in Table 2.
Therefore, we change the step size of the sliding win-
dow from 2 to 1, which increases the length of the
resulting spatiotemporal sequence from 29 to 57; we
find that the accuracy of ConvAtt1 3 on average can
be further improved by 0.97%. For fair comparison,
we also change the stride of the 4-th pooling layer in
FC-Attention from 2 to 1, which changes the length
of the temporal sequence from 33 to 65; we find that
the accuracy of FC-Att on average can be further im-
proved by 0.65%. Therefore, ConvAttention outper-
forms FC-Attention regardless of the down-sampling
strategy used.
In this paper, we have presented a novel spatiotem-
poral deep learning framework with a convolutional
attention mechanism (ConvAttention) for retaining
more information about spatial structures. ConvA-
ttention not only preserves the advantages of FC-
Attention but is also suitable for spatiotemporal data
due to its inherent convolutional structure. We have
successfully applied ConvAttention to the challeng-
ing problem of scene text recognition. By incorporat-
ing ConvAttention into text reading, we build an end-
to-end trainable deep network for character recogni-
tion. Extensive experiments on public benchmarks
demonstrate that our method achieves state-of-the-art
results. As future work, we will investigate how to
apply ConvAttention to image/video captioning.
