a
b
c
Figure 2: Image examples (a) with text, (b) without or
badly-cut text, and (c) gathered during bootstrapping.
the positive examples of text images, we created a set
of negative examples with random parts of scenery
images. We added to this a number of images con-
taining multiline or badly-cut text in order to boost
the network in precisely localizing the text when it is
used to scan an image (section 4). A total number
of 64,760 negative examples was thereby produced.
Some of them are depicted in Fig. 2.b. Finally, in or-
der to check the generalization of the network during
training, a validation set containing 10,640 positive
and an equal number of negative examples was cre-
ated with the same method as for the training set.
The network was trained with the standard back-
propagation algorithm. Target responses were fixed
to -1 for negative and +1 for positive examples. The
rejection ability of the network was boosted with a
bootstrapping procedure: after some training cycles,
false alarms are gathered by running the network on a
set of scenery images. These false alarms are added to
the set of negative examples and then training goes on
until convergenceis noticed. Some of the false alarms
gathered are shown in Fig. 2.c. Due to bootstrapping,
the set of negative examples was augmented to a to-
tal number of 114,407 examples. The network was
finally able to correctly classify 96.45% of the pos-
itive examples and 97.01% of the negative ones on
the (augmented) training set. Regarding the valida-
tion set, the figures are similar (95.84% and 97.45%,
respectively).
4 IMAGE SCANNING
We describe in this section how the trained convolu-
tional network is used to scan an entire input image in
order to detect horizontal text lines of any height that
may appear at any possible image location.
In order to detect text at varying height, the input
image is repeatedly subsampled by a factor of 1.2 to
construct a pyramid of different scales. The network
is applied to any slice (scale) of this pyramid individ-
ually. As the neural network uses convolutional ker-
nels at its first layers, instead of feeding the network
at each possible image location, we can apply the net-
work filters to the entire pyramid slice at once, thus
saving a lot of computation time. This filtering pro-
cedure will provide the network responses as if it was
applied at each image position with a step of 4 pix-
els in both directions, as two subsampling operations
take place (in layers S1 and S2).
In the next step of the scanning procedure, the re-
sponses collected at each scale are grouped accord-
ing to their proximity in scale and space to form a
list of candidate targets. The horizontal extension of
a group is determined by the left and right extremes
of the group, while the scale is averaged. Cases of
multiline text are easily discarded in favour of the ac-
tual text lines that constitute it because the network is
trained to reject multiline text.
Finally, the rectangles of the candidates are in-
spected individually by forming local image pyra-
mides around them and applying the network at each
slice with step 1 in both directions in order to mea-
sure more effectively the density of the positive acti-
vations. The candidates that score low average activa-
tion are considered as false alarms and are rejected.
5 EXPERIMENTAL RESULTS
We tested our approach on the ICDAR’03robust read-
ing competition corpus (Lucas et al., 2005). It comes
with a training set of 258 images and a test set of 251
images, all of them manually annotated. The corpus
contains exclusively scene text, i.e., text that naturally
appears in an image. Although our method uses syn-
thetic examples that simulate superimposed text, we
provide experimental results in the ICDAR’03 test set
in order to test the generalization capabilities of the
network and to compare with other methods.
Based on the performance on the training set of
the corpus, we fixed the threshold on the average ac-
tivation for false alarm rejection to 0.12 and confined
our search for text in the range of 36 to 480 pixels
high. Our method detects entire text lines while the
ground truth of the corpus annotates every word sep-
arately (even if it consists of a single letter). This
would cause reporting unfair detection rates. Thus,
we followed the performance metric of Wolf (Lucas
et al., 2005) to compare the rectangles of the detected
text of our method to the rectangles of the ground
truth. This metric takes into account one-to-manycor-
respondences between the detected rectangles and the
annotated ones.
Results are reported in table 1, in terms of preci-
sion and recall rates. The third column gives the stan-
dard f-measure, which combines precision and recall
rates in one measurement (see (Lucas et al., 2005)
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
292