Figure 9: An example of the binarization results obtained
when applying the method to a document image obtained
from a web camera. Upper panel: The original image; lower
panel: The binarized image, obtained using the proposed
method.
a PSNR of 14.49, compared to 14.22 for Sauvola’s
method.
Since the binarizer contains a fairly large number
of histograms, one might be concerned about running
times. However, for the intended application, i.e. as
part of an intelligent agent able to read text in docu-
ment images, time is less of a concern than, say, for a
system that must operate in real-time with a rate of 10
or more frames per second. Of course, when reading
text in a document image out loud, the actual read-
ing takes many seconds, so that a small delay before
reading starts is not very important.
In any case, in its current configuration (which
has not been optimized for speed), the binarizer takes
around 0.36 ms per tile using a computer with an In-
tel Core i7-2600 CPU (3.40 GHz), meaning that the
training and test images used here take around 58 ms
to binarize, with a tile size of 24× 24 pixels. An im-
age of size 640 × 480 pixels would take around 195
ms to binarize. Note, however, that since the tiles are
processed independently of each other, there is ample
opportunity for a speed-up.
The main drawback with the proposed method, in
its current state at least, is that the user must specify
the number of tiles (or, rather, the tile size). The cor-
responding parameters (N and M) are not considered
to be part of the binarizer since, once a binarizer has
been trained, it should work well with other tile sizes
also, at least within a certain range. The tiles must
be able to generate a reasonably accurate histogram,
implying that they cannot be made too small; at least
a few hundred pixels are needed. On the other hand,
the tiles must be small enough so that the brightness
does not vary too much over a tile. However, even
with some brightness variation, a tile can normally be
binarized successfully, making it quite easy to set a
suitable tile size for a given class of images (say, let-
ters with standard font size, held at a distance of 0.5 m
from a camera). In fact, the tile size used here (24×24
pixels) typically works very well, except in extreme
cases (e.g. images with huge characters, of a kind not
usually found in letters).
Still, in future work, an effort will be made to au-
tomatize the tile size selection. This can be done by
starting with large tiles, and then further subdividing
those tiles for which the estimated brightness varia-
tion is above a certain threshold. In addition, rather
than setting the seven parameters manually, one may
consider applying some form of optimization algo-
rithm, e.g. particle swarm optimization (Kennedy and
Eberhart, 1995). However, one should note that the
parameter ranges are quite narrow (see Table 1) so
that the values can generally be set by hand.
One can also note that the training method can be
used accretively, i.e. if, at some point, it is deemed
that the binarizer’s performance is inadequate, per-
haps because it is lacking some crucial histograms
due to insufficient training, one can then add his-
tograms by simply extending the training method over
a few more training images, without having to start
from an empty set of histograms. This implies that
if one happens to end the training procedure prema-
turely, so that the binarizer does not contain a suffi-
cient number of histograms for reliable binarization,
the problem can easily be rectified.
Regarding training, it can be carried out very
quickly with a training set of the kind used here,
i.e. one that consists of pairs of artificially generated
images, one noisy and one clean, such that the latter
can be used as a ground truth. On the other hand, in
order to obtain the best performance possible over ac-
tual camera images (which might also be bent, stained
etc.), it would probably be better to train the bina-
rizer using such images, the problem being that in
such a case one would not, of course, have an exact,
ground truth image (at least not without considerable
effort) to compare with. However, one could use a
subjective method for binarizing the tiles of the train-
ing images, focusing on perceived readability of the
letters. It should also be noted that the PSNR mea-
ICAART2014-InternationalConferenceonAgentsandArtificialIntelligence
40