High Performance Layout Analysis of Medieval European Document

Images

Syed Saqib Bukhari

, Ashutosh Gupta

1,2

, Anil Kumar Tiwari

and Andreas Dengel

1,3

German Research Center for Artiﬁcial Intelligence, Kaiserslautern, Germany

IITJ-Indian Institute of Technology Jodhpur, India

Technical University Kaiserslautern, Germany

Keywords:

Document Analysis, Historical Document Analysis, Layout Analysis, Document Image Segmentation.

Abstract:

Layout analysis, mainly including binarization and page segmentation, is one of the most important perfor-

mance determining steps of an OCR system for complex medieval document images, which contain noise,

distortions and irregular layouts. In this paper, we present high performance page segmentation techniques

for medieval European document images which include a novel main-body and side-notes segregation and

an improved version of OCRopus (OCRopus, ) based text line extraction. In order to complete the high

performance layout analysis pipeline, we have also presented the application of the percentile based binariza-

tion (Afzal et al., 2014) and the multiresolution morphology based text and non-text segmentation (Bukhari

et al., 2011) methods over historical document images. presented layout analysis techniques are applied to a

collection of the 15th century Latin document images, which achieved more than 90% accuracy for each of

the segmentation techniques.

1 INTRODUCTION

This paper addresses the problem of layout analysis

of historical European document images. Most lan-

guages of Europe belong to the Indo-European lan-

guage family. This family is divided into a number

of branches, including Romance, Germanic, Baltic,

Slavic, Albanian, Celtic, Armenian and Hellenic

(Greek). The Uralic languages, which include Hun-

garian, Finnish, and Estonian, also have a signiﬁcant

presence in Europe. Example of Latin European doc-

ument images are shown in Figure1.

Layout analysis, text and non-text segmentation,

main-body and side-notes segregation, and text-line

extraction, is a major performance limiting step in

large scale document digitization projects. Over the

last two decades, several layout analysis algorithms

have been proposed in the literature (Cattoni et al.,

1998), (Nagy, 2000) that work for different layouts,

scripts and are quite robust to the presence of noise

in documents. Here, we brieﬂy discuss some state-of-

the-art document image layout analysis approaches in

connection to European documents. Text and non-

text segmentation is an important layout analysis step,

which may directly affect the performance of further

layout processing tasks such as text-line extraction,

(a) (b)

Figure 1: 15

century Medieval European Documents from

the Kallimachos Project (Kallimachos, ); (a), on the left,

contains both Text and Non-Text regions; Document (b), on

the right, contains only Text regions.

and/or character recognition. The performance of

classiﬁcation based on text and non-text segmentation

approaches (Bukhari et al., 2010) heavily depends on

training samples, and they can not be directly ap-

plied to different scripts. On the other hand, smear-

ing (Wong et al., 1982) and multiresolution morphol-

ogy (Bloomberg, 1991), (Bukhari et al., 2011) based

approaches work on an assumption that non-text el-

324

Bukhari, S., Gupta, A., Tiwari, A. and Dengel, A.

High Performance Layout Analysis of Medieval European Document Images.

DOI: 10.5220/0006574603240331

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 324-331

ISBN: 978-989-758-276-9

(a) Scanned Document (b) Binarized Document

Figure 2: The Percentile based Binarization Methodology(Afzal et al., 2014). Input scanned document (a) is binarized using

percentile ﬁltering to give binary output document (b).

ements are bigger than text elements, however these

approaches are script independent and can be directly

used for European script document images.

Text-line extraction is the backbone of a layout anal-

ysis system. Kumar et al. (Kumar et al., 2007) have

evaluated the performance of six algorithms for page

segmentation on Nastaliq script: the x-y cut (Nagy

et al., 1992), the smearing (Wong et al., 1982), whites-

pace analysis (Baird, 1994), the constrained text-line

ﬁnding (Baird, 2002), Docstrum (OGorman, 1993),

and the Voronoi-diagram based approach (Kise et al.,

1998). These algorithms work very well in segment-

ing documents in Latin script as shown in (Shafait

et al., 2008). However, none of these algorithms were

able to achieve an accuracy of more than 70% on their

test data which had simple book layouts. More so-

phisticated approaches for text-line extraction have

been presented in the domain of segmenting hand-

written European document so far. However, the key

problem addressed in these approaches is to handle

local non-linearity of text-lines.

In this paper, we present a high performance lay-

out analysis system for a wide variety of Historical

European document images that belong to a diverse

collection of layout structures such as books, maga-

zines, and newspapers. Our layout analysis system is

a suitable combination of robust and well-established

text and non-text segmentation, main-body and side-

notes segregation, and text-line extraction techniques.

First, it performs text and non-text segmentation using

multiresolution morphology based method (Bukhari

et al., 2011). Then, it segregates main-body and

side-notes based on vertical white space calculation

and ﬁltering for a variety of single and multi-column

layouts. Finally, it determines the text-lines that

are extracted based on y-derivative of Gaussian ker-

nel. In this way, our layout analysis system extends

OCRopus (OCRopus, ) based layout analysis sys-

tem (text-line extraction) by incorporating text and

non-text segmentation, a novel main-body and side-

notes segregation and an improvised text-line extrac-

tion method. To evaluate the performance of the pre-

sented layout analysis system for real-world docu-

ments, a dataset of European scanned documents is

prepared. This paper focuses on an extensive exper-

imental evaluation of the presented layout analysis

system and its comparison with state-of-the-art tech-

niques. The rest of this paper is organized as fol-

lows. Our layout analysis system for historical Eu-

ropean document images is described in Section II.

Performance evaluation and experimental results are

discussed in Section III, followed by a conclusion in

Section IV.

2 HIGH PERFORMANCE

LAYOUT ANALYSIS OF

HISTORICAL EUROPEAN

DOCUMENT IMAGES

The high performance layout analysis of historical

European document images in this paper comprises of

the following main steps; binarization and page seg-

High Performance Layout Analysis of Medieval European Document Images

325

mentation which includes text and non-text segmen-

tation, main body and side note segregation and text-

line extraction. For this purpose, more speciﬁcally,

we have proposed a novel main-body and side-notes

segregation technique, and we have improved OCRo-

pus (OCRopus, ) based text-line extraction technique.

Together with that we have applied the percentile

based binarization method (Afzal et al., 2014) of

OCRopus and Bukhari et al. (Bukhari et al., 2011)

based text and non-text segmentation techniques on

historical documents. For the completeness of this

paper, together with explaining our novel main-body

and side-notes segregation technique and improved

version of OCRopus (OCRopus, ) based text-line ex-

traction method, we have also brieﬂy described the

percentile based binarization method and the mul-

tiresolution morphology based text and non-text seg-

mentation (Bukhari et al., 2011) based techniques.

A brief description of these steps is provided here:

2.1 The Percentile based Binarization

Method (Afzal et al., 2014)

In general, an image can be thresholded by determin-

ing a global threshold for the entire page (known as

global binarization method) or by using the statis-

tics obtained from a local window centered around

the pixel which is being thresholded. The percentile

based binarization method (Afzal et al., 2014) by

OCRopus takes into consideration the background

statistics based on percentile ﬁlters.In this method,

text and non-text regions are treated equally for de-

termining the threshold used for local binarization. It

works well both on focused and defocused images.

This method also works well on monocular images

with defocused parts. The binarization method starts

with estimating the background at each location in the

image,i.e., a whole new image is created having only

the background of the image based on percentile.The

threshold in this method is adapted in accordance with

the background properties of the image. The origi-

nal image has a domain of all gray level values, i.e.,

[0,255] and the background image estimated for each

value based on percentile ﬁlters at every location has

a domain of only two levels,i.e.,0,255. The threshold-

ing is done in a way that if the pixel value in original

image is less than ’t’ times the pixel value in back-

ground estimated image, then the corresponding pixel

value in the output image is labeled one, where ’t’ is

the parameter used to determine that whether a pixel

is foreground or background, depending on the sim-

ilarity of the pixel, and the background, which has

been estimated using percentile ﬁlter; otherwise it is

labeled zero.

2.2 The Multiresolution Morphology

based Text and Non-text

Segmentation (Bukhari et al., 2011)

Bloomberg (Bloomberg, 1991) presented a multires-

olution morphology based text and non-text segmen-

tation method. It is a simple and script independent

text and non-text segmentation method. It performs

well for halftone mask segmentation, for which it

was designed, but most of the time fails to accurately

segment drawing type non-text elements such as line

art, maps etc. Bukhari et al. (Bukhari et al., 2011)

presented an improved multiressolution morphology

based text and non-text segmentation algorithm, that

can handle halftones as well as drawing type non-

text elements. A sample document image and its text

and non-text segmentation results for the original and

the improved version of multiresolution morphology

based methods are shown in Figure 3.

2.3 The Improved Text-Line Extraction

Method

The text-line extraction technique proposed here is

a modiﬁed version of the OCRopus’ text-line ex-

traction method, which is called “ocropus-gpageseg”.

The OCRopus technique of text-line extraction is ex-

plained brieﬂy here;

It ﬁrst estimates the ”scale” of the text by ﬁnd-

ing connected components of individual letters in the

binary image and calculating the median of their di-

mensions. It removes components which are too big

or too small (according to scale) which are unlikely to

be letters.

In the baseline ocropus-gpageseg method, column

separators in binary image are found using convolu-

tion and thresholding. At ﬁrst vertical white spaces

on binary image are found and then the rest region is

ﬁlled in order to form smooth text region using ﬁlter-

ing. Then using Guassian and uniform ﬁltering, the

column edges (gradients) are found in the binary im-

age by setting a certain threshold in accordance with

the scale of the image. Then the smoothened text re-

gion and the column edges are combined to get col-

umn separators. In the next step, out of the total col-

umn separators, only selected number of column sep-

arators with dimensions greater than minimum value

are selected.

In order to ﬁnd text lines, at ﬁrst, box-map (bound-

ing box) is found by setting two thresholds. If the

area of the slice list lies in between the threshold ar-

eas, then that slice is labeled one, otherwise it is la-

beled zero- it helps in removing noise. Then a clean

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

326

Figure 3: Text & Non-Text Segmentation Methodology by Bukhari et al. (Bukhari et al., 2011). The input image, on the left,

is segmented into two images- one containing only Text regions and the other containing only Non-Text regions.

image is obtained by multiplying the two image ar-

rays of box-map and the given binary image, keeping

only the desired text. On this cleaned image, the y-

derivative of a Gaussian kernel is applied to detect the

top and bottom edges of the remaining features. It

then blurs this horizontally to blend the tops of letters

on the same line together. The same is done with the

bottoms of the letters. The areas between top and bot-

tom edges are blurred and treated as text line regions

and termed as line seeds.

Then column separators and line seeds are com-

bined and used to segment the binary images. Basi-

cally, column separators restrict line seeds i.e., sepa-

rate two lines horizontally.

In our presented modiﬁed version of ocropus-

gpageseg method, column separators in binary

image are found more accurately using convolution

and thresholding with some optimal parametrical

changes and post-processing steps like removal of

two column separators that are too close to each other

in the same horizontal line and extension of column

separators to the ﬁrst and last rows of the image with

a condition that no character is crossed in between

on the extended path. At ﬁrst vertical white spaces

on binary image are found and then the rest region

is labeled in order to form smooth text region using

ﬁltering. Then using Gaussian and uniform ﬁltering,

the column edges (gradients) are found in the binary

image by setting a certain threshold in accordance

with the scale of the image. The smoothened text

region and the column edges are combined to get

column separators. In the next step, out of the total

column separators, only selected number of column

separators with dimension greater than min value

are selected. The ﬁnally obtained column separators

are then combined with the initially obtained text

region (through white space method) in order to

ﬁnd more precise text only regions in the binary

image. All the gaps/holes within the text regions

are ﬁlled up and thus ﬁnal text only regions are

obtained. In the Improved OCRopus Text-Line

Method, we focused more on extracting the precise

text only regions which form a sentence and separate

them from other text regions like side-notes which

are too close to the main-body text regions and

then extract text lines using y-derivative of Gaus-

sian kernel and ﬁltering. The result of improved

OCRopus text-line extraction method is shown in 4.

2.4 The Novel Main Body and Side

Notes Segregation Technique

In this segregation technique, the main objective is

the classiﬁcation of text only regions after segment-

High Performance Layout Analysis of Medieval European Document Images

327

(a) (b) (c)

Figure 4: Image(a): Binarized European Document; Image(b): Main-Body and Side-Notes Segregated Document; Image(c):

Improved OCRopus based Text-Line Segmented Document.

ing the binary image of a European document into

text and non-text. After removing the non-text re-

gions and major noise content from the binary im-

age, the image is smoothened in order to label the text

regions and form a blob over them by ﬁnding verti-

cal white spaces and applying Gaussian and uniform

ﬁltering.The blobs are formed over the text regions

in such a way that the text-lines which are not part

of a sentence but appear too close to each other are

also separated. Among all the blobs formed over text

regions, the ones below a certain adaptive threshold

width are classiﬁed as side notes and the rest are la-

beled as main body text regions. Too small blobs be-

low a certain adaptive threshold with respect to me-

dian of heights of every character connected com-

ponent is considered as noise and hence removed.

The result of main-body and side-notes segregation

method is shown in 4.

3 PERFORMANCE EVALUATION

The 15th century novel “Narrenschiff” is part of

the German government funded project Kallima-

chos (Kallimachos, ). For the performance evaluation

of the proposed layout analysis techniques, we have

selected a subset of 50 images from one of the Latin

novels in the Kallimachos project. Sample document

images are shown in Figure 1. These images con-

tain both text and non-text regions, as well as main

body and side notes within text regions. For this

dataset, text and non-text segmentation, main body

and side note segregation, and text-line extraction

ground-truths containing both text and non-text re-

gions are prepared in color coded pixel form as shown

in Figures 6,7 and 8. The images have variety of both

single and multi-column layouts and hence they can

be used to evaluate the performance of a layout anal-

ysis algorithms for European document images. Be-

low, the performance evaluation of the presented lay-

out analysis techniques is done in three parts. The

ﬁrst part evaluates the performance of text and non-

text segmentation, the second part analyzes the errors

made in main body and side note segregation, and the

third part evaluates the overall accuracy of text-line

extraction technique.

Figure 5: Complete Methodology.

3.1 Text and Non-text Segmentation

As stated above, our dataset contains 50 historical

document images. We test the performance of our ap-

proach using images with different writing styles and

layout structures which were not used for training.

Pixel-level ground truth has been generated by

manually assigning text in the documents of the test-

ing set with one of the two classes, main-body or side-

notes text. Several methods to measure the segmen-

tation accuracy have been reported in literature. We

evaluate the segmentation accuracy by adopting the F-

measure metric which combines precision and recall

values into a single scalar representative. It guaran-

tees that both values are high (conservative), in con-

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

328

Figure 6: Text & Non-Text Ground Truth Generation.

trary to the average (tolerant) which does not hold this

property. For example, when precision and recall both

equals one, the average and F-measure will both be

one, but, if the precision is one and the recall is zero,

the average would be 0.5 and the F-measure would be

zero. Therefore,this measure has been adopted as it

reliably measures the segmentation accuracy. Preci-

sion and recall are estimated according to Eq. 1 and

Eq. 2, resp.

Precision =

T P

T P + FP

(1)

Recall =

T P

T P + FN

(2)

where True-Positive(TP), False-Positive(FP)and

False-Negative(FN) with respect to side-notes, are de-

ﬁned as following:

• TP:side-notes text classiﬁed as side-notes text.

• FP:side-notes text classiﬁed as main-body text.

• FN:main-body text classiﬁed as side-notes text.

Likewise, these metrics can also be deﬁned with

respect to main-body text. Once we have the precision

and recall counts, F-measure is calculated according

to Eq. 3.

F − Measure =

(1+β

)∗Precision∗Recall

∗Recall+Precision

(3)

Assigning β = 1 induces equal emphasis of pre-

cision and recall on F-measure estimation. The F-

Measure accuracies are shown in Table 1.

F-measure for both main-body and side-notes text

with different postprocessing window sizes is shown

in Table 1. Note that the optimal window size is 100.

Table 1: Performance of Text and Non-Text Extraction

method by calculating F-Measure for Text and Non-Text re-

gions.

Text and Non-Text Segmentation

Main-Body F-Measure(%) 99.433%

Side-Notes F-Measure(%) 99.6683%

3.2 Main Body and Side Note

Segregation

The performance evaluation matrices for main body

and side note segregation accuracy are based on f-

measure calculation as described in previous section.

The F-Measure accuracies are shown in Table 2.

Table 2: Performance of Main-Body and Side-Notes segre-

gation method by calculating F-Measure for main body and

side note text regions.

Main-Body and Side-Notes Segregation

Main-Body F-Measure(%) 99.7646%

Side-Notes F-Measure(%) 80.4962%

3.3 Text-line Extraction

The ground-truth images for evaluation of Text-Line

Extraction Technique performance are created manu-

ally by pixel coloring

The performance evaluation metrics for text-line

detection accuracy are deﬁned in (Shafait et al.,

2008), where a text-line is said to be correctly

detected if it does not fall into any of the fol-

lowing categories of errors: over-segmentation,

under-segmentation, missed text-lines, and false-

alarms. Let,N

:ground-truth text-lines;N

:segmented

text-lines;N

o2o

:one-2-one correctly detected text-

lines. The one-to-one text-line detection accuracy is

represented by Eq. 4.

High Performance Layout Analysis of Medieval European Document Images

329

Figure 7: Ground-Truth Generation for Main & Side Body Segregation.

Figure 8: Ground-Truth Generation for Text-Line Extraction.

o2o

% =

o2o

(4)

For modiﬁed text-line extraction methodology on Eu-

ropean dataset, we achieved a performance gain from

72.18% to 94.53% after text and non-text segmenta-

tion as shown in Table 3.

Table 3: Performance of Improved Text-Line Extraction

method based on performance evaluation metrics for text-

line detection accuracy deﬁned in (Shafait et al., 2008).

Technique Accuracy(%)

OCRopus-gpageseg 72.177338%

Improved OCRopus-gpageseg 94.530014%

4 CONCLUSION

In this paper, we have presented a high performance

layout analysis system for historical European doc-

ument images, which are composed of a variety of

single and multi-column layouts. The presented lay-

out analysis system is composed of a suitable com-

bination of well-established and robust text and non-

text segmentation, novel main-body and side-notes

segregation, and text-line extraction methods. We

have evaluated the presented layout analysis system

on the dataset of 50 document images from a 15th

century Latin script historical novel from the Kalli-

machos project (Kallimachos, ), which are composed

of a different layout structures as shown in Figure

1 containg both text and non-text regions. For text

and non-text segmentation, multiresolution morphol-

ogy based method (Bukhari et al., 2011) is used. We

have achieved above 99%text and non-text segmen-

tation accuracy on the dataset. For main-body and

side-notes segregation, the methodology is explained

in (Section II-D). For this mthod, we achieved 99%

main-body segregation accuracy and above 80% side-

notes accuracy for the dataset. For text-line extrac-

tion, improved version of OCRopus based text-line

extraction method is used, which is described in (Sec-

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

330

tion II-C). For Improved OCRopus based text-line ex-

traction method, we have achieved above 94% text-

line extraction accuracy for the dataset, which is bet-

ter than the performance of above 72% of OCRopus-

gpageseg method on the dataset. Altogether, the pre-

sented layout analysis system showed good perfor-

mance for text and non-text segmentation, main-body

and side-notes segregation, and text-line extraction on

a variety of European document images, and it can be

used for large scale European documents digitization

processes.

ACKNOWLEDGEMENTS

This work was partially funded by the BMBF (Ger-

man Federal Ministry of Education and Research),

project Kallimachos (01UG1415C).

REFERENCES

Afzal, M. Z., Kr

amer, M., Bukhari, S. S., Youseﬁ, M. R.,

Shafait, F., and Breuel, T. M. (2014). Robust bina-

rization of stereo and monocular document images us-

ing percentile ﬁlter. In Revised Selected Papers of the

International Workshop on Camera-Based Document

Analysis and Recognition - Volume 8357, pages 139–

149, New York, NY, USA.

Baird, H. S. (1994). Background structure in document im-

ages. In in Document Image Analysis, H. Bunke, P.

Wang, and H. S. Baird, Eds. World Scientiﬁc, Singa-

pore.

Baird, H. S. (2002). Two geometric algorithms for layout

analysis. In Proc. Workshop on Document Analysis

Systems.

Bloomberg, D. S. (1991). Multiresolution morphologi-

cal approach to document image analysis. In Proc.

International Conference on Document Analysis and

Recognition, Franc.

Bukhari, S. S., Shafait, F., and Breuel, T. (2010). Docu-

ment image segmentation using discriminative learn-

ing over connected components. In Proc. Workshop

on Document Analysis Systems, Boston, US.

Bukhari, S. S., Shafait, F., and Breuel, T. (2011). Improved

document image segmentation algorithm using mul-

tiresolution morphology. In SPIE, Document Recog-

nition and Retrieval XVIII.

Cattoni, R., Coianiz, T., Messelodi, S., and Modena, C. M.

(1998). Geometric layout analysis techniques for doc-

ument image understanding: a review. In IRST, Trento,

Italy, Tech. Rep.

Kallimachos. www.kallimachos.de.

Kise, K., Sato, A., and Iwata, M. (1998). Segmentation

of page images using the area voronoi diagram. In

Computer Vision and Image Understanding.

Kumar, K. S., Kumar, S., and Jawahar, C. (2007). On seg-

mentation of documents in complex scripts. In Proc.

International Conference on Document Analysis and

Recognition.

Nagy, G. (2000). Twenty years of document image analy-

sis in pami. In IEEE Trans. on Pattern Analysis and

Machine Intelligencen.

Nagy, G., Seth, S., and Viswanathan, M. (1992). A pro-

totype document image analysis system for technical

journals. In Computer, vol. 7, no. 2.

OCRopus. https://github.com/tmbdev/ocropy.

OGorman, L. (1993). The document spectrum for page lay-

out analysis. In IEEE Trans. on Pattern Analysis and

Machine Intelligencen.

Shafait, F., Keysers, D., and Breuel, T. M. (2008). Per-

formance evaluation and benchmarking of six page

segmentation algorithms. In IEEE Trans. on Pattern

Analysis and Machine Intelligencen.

Wong, K. Y., Casey, R. G., , and Wahl, F. M. (1982). Doc-

ument analysis system. In IBM Journal of Research

and Development.

High Performance Layout Analysis of Medieval European Document Images

331