the page segmentation task can be successfully solved
even with a small amount of data. We also present an
approach to automatically create artificial pages that
can be used for data augmentation.
2 RELATED WORK
The issue of locating text in document images has
a long history dating back to the late 1970s when
OCR was addressed and it was necessary to extract
these characters. ”In order to let character recognition
work, it is mandatory to apply layout analysis includ-
ing page segmentation.” (Kise, 2014) Today, there is
also a need for extracting images from pages. The ex-
tracted images can be also further processed to allow
image search for example.
This section first summarizes recent methods
for page segmentation and then it provides a short
overview of available datasets.
2.1 Methods
Today, there are lots of methods for page segmenta-
tion. The methods can be categorized into top-down
and bottom-up categories. Historically, the segmenta-
tion problem was usually solved by conservative ap-
proaches based on simple image operations and on
connected component analysis. Recent trend is to use
neural networks for this task.
A method for segmenting pages using connected
components and a bottom-up approach is presented in
(Drivas and Amin, 1995). The method includes digiti-
zation, rotation correction, segmentation and classifi-
cation into text or graphics classes. Another approach
based on background thinning that is independent of
page rotation is presented in (Kise et al., 1996). These
conservative methods usually fail on handwritten doc-
ument images, because it is hard to binarize pages due
to degraded quality. It is also hard to extract char-
acters since they are usually connected. These prob-
lems are successfully solved by approaches based on
convolutional neural networks (CNN) that brought a
significant improvement in many visual tasks. An ex-
ample of a CNN for page segmentation of historical
document images is presented in (Chen et al., 2017).
Briefly, super-pixels (groups of pixels with similar
characteristics) are found in the image and they are
classified with the network that takes 28x28 pixels in-
put. The result of the classification is then assigned to
the whole superpixel.
Alternatively, every pixel could be classified sep-
arately in a sliding window manner. The problem is
computational inefficiency because a large amount of
computation is repeated as the window moves pixel
by pixel. This problem is solved by FCNs like well-
known U-Net (Ronneberger et al., 2015). U-Net was
initially used for biomedical image segmentation but
could be used on many other segmentation tasks in-
cluding page segmentation. Another FCN architec-
ture is presented by (Wick and Puppe, 2018), this
network is proposed for page segmentation of his-
torical document images. In contrast with U-Net, it
does not use skip-connections and uses transposed
convolutional layer instead of upsampling layer fol-
lowed by convolutional layer. The speed improve-
ment is achieved mainly thanks to the small input of
260x390 pixels. In order to achieve the best results in
competitions, there are also networks like (Xu et al.,
2017). This network uses the original resolution of
images and provides many more details in the outputs.
2.2 Datasets
There are many architectures that solve the problem
of segmentation very well. The main problem is the
corresponding training data because appropriate data
are the key point of approaches based on neural net-
works. There are several datasets for a wide range of
tasks. Unfortunately, a significant number of datasets
are inappropriate for our task or they are not publicly
available.
Diva-hisdb (Simistira et al., 2016) is a publicly
available dataset with detailed ground-truth for text,
comments and decorations. It consists of three
manuscripts and 50 high-resolution pages for each
manuscript. These manuscripts have similar layout
features. The first two manuscripts come from the
11
th
century. They are written in Latin language
using the Carolingian minuscule script. The third
manuscript is from the 14
th
century and shows a
chancery script. The language is Italian and Latin.
Unfortunately, there are no images on pages.
Handwritten historical manuscript images are
available in the repository IAM-HistDB (Fischer
et al., 2010) together with ground-truth for hand-
writing recognition systems. Currently, it includes
three datasets: Saint Gall Database, Parzival Database
and Washington Database. The Saint Gall Database
(Fischer et al., 2011) contains 60 page images of
a handwritten historical manuscript from 9
th
cen-
tury. It is written in Latin language and Carolin-
gian script. 47 page images of a handwritten his-
torical manuscript from 13
th
century are available in
the Parzival Database (Fischer et al., 2012). The
manuscript is written in Medieval German language
and Gothic script. The Washington database (Fischer
et al., 2012) is created from the George Washington
ChronSeg: Novel Dataset for Segmentation of Handwritten Historical Chronicles
315