mentation which includes text and non-text segmen-
tation, main body and side note segregation and text-
line extraction. For this purpose, more specifically,
we have proposed a novel main-body and side-notes
segregation technique, and we have improved OCRo-
pus (OCRopus, ) based text-line extraction technique.
Together with that we have applied the percentile
based binarization method (Afzal et al., 2014) of
OCRopus and Bukhari et al. (Bukhari et al., 2011)
based text and non-text segmentation techniques on
historical documents. For the completeness of this
paper, together with explaining our novel main-body
and side-notes segregation technique and improved
version of OCRopus (OCRopus, ) based text-line ex-
traction method, we have also briefly described the
percentile based binarization method and the mul-
tiresolution morphology based text and non-text seg-
mentation (Bukhari et al., 2011) based techniques.
A brief description of these steps is provided here:
2.1 The Percentile based Binarization
Method (Afzal et al., 2014)
In general, an image can be thresholded by determin-
ing a global threshold for the entire page (known as
global binarization method) or by using the statis-
tics obtained from a local window centered around
the pixel which is being thresholded. The percentile
based binarization method (Afzal et al., 2014) by
OCRopus takes into consideration the background
statistics based on percentile filters.In this method,
text and non-text regions are treated equally for de-
termining the threshold used for local binarization. It
works well both on focused and defocused images.
This method also works well on monocular images
with defocused parts. The binarization method starts
with estimating the background at each location in the
image,i.e., a whole new image is created having only
the background of the image based on percentile.The
threshold in this method is adapted in accordance with
the background properties of the image. The origi-
nal image has a domain of all gray level values, i.e.,
[0,255] and the background image estimated for each
value based on percentile filters at every location has
a domain of only two levels,i.e.,0,255. The threshold-
ing is done in a way that if the pixel value in original
image is less than ’t’ times the pixel value in back-
ground estimated image, then the corresponding pixel
value in the output image is labeled one, where ’t’ is
the parameter used to determine that whether a pixel
is foreground or background, depending on the sim-
ilarity of the pixel, and the background, which has
been estimated using percentile filter; otherwise it is
labeled zero.
2.2 The Multiresolution Morphology
based Text and Non-text
Segmentation (Bukhari et al., 2011)
Bloomberg (Bloomberg, 1991) presented a multires-
olution morphology based text and non-text segmen-
tation method. It is a simple and script independent
text and non-text segmentation method. It performs
well for halftone mask segmentation, for which it
was designed, but most of the time fails to accurately
segment drawing type non-text elements such as line
art, maps etc. Bukhari et al. (Bukhari et al., 2011)
presented an improved multiressolution morphology
based text and non-text segmentation algorithm, that
can handle halftones as well as drawing type non-
text elements. A sample document image and its text
and non-text segmentation results for the original and
the improved version of multiresolution morphology
based methods are shown in Figure 3.
2.3 The Improved Text-Line Extraction
Method
The text-line extraction technique proposed here is
a modified version of the OCRopus’ text-line ex-
traction method, which is called “ocropus-gpageseg”.
The OCRopus technique of text-line extraction is ex-
plained briefly here;
It first estimates the ”scale” of the text by find-
ing connected components of individual letters in the
binary image and calculating the median of their di-
mensions. It removes components which are too big
or too small (according to scale) which are unlikely to
be letters.
In the baseline ocropus-gpageseg method, column
separators in binary image are found using convolu-
tion and thresholding. At first vertical white spaces
on binary image are found and then the rest region is
filled in order to form smooth text region using filter-
ing. Then using Guassian and uniform filtering, the
column edges (gradients) are found in the binary im-
age by setting a certain threshold in accordance with
the scale of the image. Then the smoothened text re-
gion and the column edges are combined to get col-
umn separators. In the next step, out of the total col-
umn separators, only selected number of column sep-
arators with dimensions greater than minimum value
are selected.
In order to find text lines, at first, box-map (bound-
ing box) is found by setting two thresholds. If the
area of the slice list lies in between the threshold ar-
eas, then that slice is labeled one, otherwise it is la-
beled zero- it helps in removing noise. Then a clean
ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods
326