Extraction of Homogeneous Regions in Historical Document Images
Maroua Mehri
1,2
, Pierre H
´
eroux
2
, Nabil Sliti
3
, Petra Gomez-Kr
¨
amer
1
,
Najoua Essoukri Ben Amara
3
and R
´
emy Mullot
2
1
L3i, University of La Rochelle, Avenue Michel Cr
´
epeau, 17042, La Rochelle, France
2
LITIS, University of Rouen, Avenue de l’Universit
´
e, 76800, Saint-Etienne-du-Rouvray, France
3
SAGE, University of Sousse,
´
Ecole Nationale d’Ing
´
enieurs de Sousse, 4023, Sousse, Tunisia
Keywords:
Historical Document Images, Segmentation, SLIC Superpixels, Gabor Filters, Multi-Scale Analysis, ARLSA.
Abstract:
To reach the objective of ensuring the indexing and retrieval of digitized resources and offering a structured
access to large sets of cultural heritage documents, a raising interest to historical document image segmenta-
tion has been generated. In fact, there is a real need for automatic algorithms ensuring the identification of
homogenous regions or similar groups of pixels sharing some visual characteristics from historical documents
(i.e. distinguishing graphic types, segmenting graphical regions from textual ones, and discriminating text in
a variety of situations of different fonts and scales). Indeed, determining graphic regions can help to segment
and analyze the graphical part in historical heritage, while finding text zones can be used as a pre-processing
stage for character recognition, text line extraction, handwriting recognition, etc. Thus, we propose in this arti-
cle an automatic segmentation method for historical document images based on extraction of homogeneous or
similar content regions. The proposed algorithm is based on using simple linear iterative clustering (SLIC) su-
perpixels, Gabor filters, multi-scale analysis, majority voting technique, connected component analysis, color
layer separation, and an adaptive run-length smoothing algorithm (ARLSA). It has been evaluated on 1000
pages of historical documents and achieved interesting results.
1 INTRODUCTION
The idea of conducting strategies of digitization pro-
grams with cultural heritage documents has emerged
since the early 1960s. The primary goals of these dig-
itization programs, which were related to the tremen-
dous growth and spread of the Internet technologies,
were not clearly identified (e.g. providing digital
copies of historical document images (HDIs), shar-
ing databases of document images between many li-
braries, designing a computer-assistance tool for tex-
tual data handling, etc.). Nevertheless, the rapid
growth of digital libraries has become a serious hin-
drance to promote wide efficiency and effectiveness
in the management of this cultural heritage resources
(i.e. quick and relevant access to information con-
tained therein) due to the huge amount of digital
high quality reproductions of fragile books and digital
copies of rare collections. To meet the need to rein-
force the enrichment and exploitation of heritage doc-
uments in addition to make it electronically available
for access via the Internet, many research projects
has been set up with the support of public funding
provided through the European and American gov-
ernments. The main goals of these projects are to
provide a computer-based access and analysis of cul-
tural heritage documents, searchable and browseable
HDI databases, and an automatic indexing, linking
and retrieval semantic-based systems of HDIs (Cous-
taty et al., 2011).
In this work, we are interested in historical docu-
ment image layout analysis (HDILA). HDILA starts
by segmenting a document in order to find and clas-
sify homogeneous regions or zones, such as graphic
and textual regions (Okun and Pietik
¨
ainen, 1999).
Finding graphic regions can be used to analyze the
graphical part in historical heritage, while determin-
ing text zones can be used as a pre-processing stage
for handwriting recognition, etc. Our goal consists
of identifying homogenous regions or similar groups
of pixels sharing some visual characteristics by label-
ing and grouping pixels from HDIs. We aim to char-
acterize each digitized page of historical book with
a set of homogeneous or similar content regions and
47
Mehri M., Héroux P., Sliti N., Gomez-Krämer P., Ben Amara N. and Mullot R..
Extraction of Homogeneous Regions in Historical Document Images.
DOI: 10.5220/0005265500470054
In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 47-54
ISBN: 978-989-758-091-8
Copyright
c
2015 SCITEPRESS (Science and Technology Publications, Lda.)
their topological relationships to characterize the doc-
ument layout and content by defining one or more
signatures for each digitized page. Therefore, an au-
tomatic segmentation method for HDIs is proposed
in this article. The proposed algorithm is based on
using simple linear iterative clustering (SLIC) super-
pixels (Liu et al., 2011), Gabor filters (Gabor, 1946),
k-means clustering (MacQueen, 1967), multi-scale
analysis (Li et al., 2000), majority voting technique
(Lam and Suen, 1997), connected component (CC)
analysis (Rosenfeld and Pfaltz, 1966), color layer sep-
aration, and adaptive run-length smoothing algorithm
(ARLSA), for extraction of homogeneous or similar
content regions in HDIs.
The remainder of this article is organized as fol-
lows: in Section 2, the proposed segmentation algo-
rithm for HDIs based on extraction of homogeneous
or similar content regions is detailed. Section 3 de-
scribes the experimental protocol by presenting the
experimental corpus, and the defined ground-truth. To
evaluate the performance of our proposed algorithm
and validate our choice of the used techniques on each
step of our algorithm, a set of experiments on a large
variety of HDIs is detailed in Section 4. Then, an
assessment of the different steps of our algorithm is
presented and an analysis of the obtained results is
subsequently discussed. Qualitative results are also
given to demonstrate the segmentation quality. Our
conclusions and future work are presented in Section
5.
2 PROPOSED METHOD
To ensure a relevant segmentation of graphical re-
gions from textual ones, an efficient discrimination
of text in a variety of situations of different fonts
and scales, and a robust distinction between differ-
ent graphic types in HDIs, an automatic segmenta-
tion method for HDIs based on extraction of homo-
geneous or similar content regions is proposed in this
article. The proposed algorithm segments the content
of digitized HDIs. In particular, it discriminates be-
tween the different classes of the foreground layers
of a digitized document based on textural and topo-
logical descriptors. First, a HDI is fed as input and
is read as a gray-scale image. The extraction of tex-
ture information is processed on gray-scale document
images without introducing a binarizing task. A bi-
narization step is avoided because it causes a loss of
information specifically textural information. Then,
to obtain enhanced backgrounds of noisy HDIs and
reduce the step complexity of a pixel-based segmen-
tation method, a foreground-background segmenta-
tion task based on SLIC superpixel segmentation and
k-means clustering algorithms is performed. After-
wards, an automatic extraction of texture descriptors
from foreground superpixels is processed by involv-
ing a multi-scale approach. The extracted textural
features are then used in an unsupervised clustering
approach to label clusters or groups of pixels with re-
spect to the results of the superpixel clustering phase.
Then, for refinement of the pixel labeling results, a
first step of post-processing “Post-processing 1” is
introduced by taking into consideration the topolog-
ical relationship between pixels and integrating a spa-
tial multi-scale analysis of majority votes. Finally,
a second post-processing task “Post-processing 2” is
added based on using the multi-scale analysis, major-
ity voting technique, CC analysis, color layer sepa-
ration, and ARLSA to identify the homogeneous or
similar content regions that are characterized by sim-
ilar properties.
The proposed algorithm does not require a priori
knowledge of the document structure/layout, the ty-
pographical parameters or the graphical properties of
the document image. In addition, it is fast since a
SLIC superpixel technique has been used in our pro-
posed algorithm, instead of using a rigid structure of
pixel grid for feature extraction and processing at the
pixel level for segmentation, localization, and classifi-
cation issues. The superpixel approach has the advan-
tage to be faster, more memory efficient, and more
interesting to compute image features on each super-
pixel center than on each image pixel (Achanta et al.,
2012). Figure 1 illustrates the detailed schematic
block representation of the proposed algorithm.
Document
Gray-scale conversion
Gabor feature extraction
Foreground/background
segmentation
Superpixel clustering
Foreground superpixels Background superpixels
Background superpixel
processing
Enhanced
document image
Gabor filter application
Border replication
Pixel labeling
Post-processing 1
Post-processing 2
Binarization
Extraction and labeling
homogeneous regions
Figure 1: Flowchart of the proposed algorithm for extrac-
tion of homogeneous regions in HDIs.
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
48
2.1 Foreground-background
Segmentation
By using the SLIC superpixel approach on our pro-
posed algorithm, pixels sharing similar characteristics
or properties (e.g. texture cues, contour, color, etc.)
are grouped into a significant polygon-shaped region
(Achanta et al., 2012). Thus, by setting the number
of superpixels k
s
equal to 0.01% of image pixels, an
over-segmented image representing a compact con-
tent map is generated. Afterwards, the background
and foreground superpixels are classified based on
computing the mean gray-level value of each su-
perpixel, which is determined by averaging over all
the gray-level pixels belonging to the superpixel re-
gion, and using the k-means algorithm. To segment
an image into two layers (i.e., the foreground and
background), the k-means algorithm is performed on
the computed mean gray-level values of superpixels,
without taking into account the image spatial coordi-
nates, by setting the number of clusters k
c
equal to
2 to extract two clusters. One represents the infor-
mation of the background (cf. Figure 3(b)) and the
other represents the foreground (e.g. noise, text fields,
drawings, etc.) (cf. Figure 3(c)).
2.2 Document Enhancement
Since the foreground-background segmentation step
is carried out, the background superpixels of the orig-
inal gray-level image are only processed by assign-
ing the value of a white pixel (i.e. a 255 gray-level
value) to their centers and the pixels belonging to
them. However, the values of the gray-level fore-
ground superpixels and their pixels of the original
gray-level image are remained unchangeable. Thus,
an enhanced and non-noisy background is achieved
(cf. Figure 3(d)). Figure 3(d) illustrates an example
of enhanced image by the superpixel technique with a
clean background.
2.3 Gabor Feature Extraction
In our previous work, some of the well-known
texture-based approaches (auto-correlation function,
gray-level co-occurrence matrix, and Gabor filters)
were compared for ancient document image segmen-
tation (Mehri et al., 2013). We concluded that the Ga-
bor features perform better than the auto-correlation
and co-occurrence ones for font segmentation and for
distinguishing textual regions from graphical ones.
Thus, once the document is enhanced, Gabor filters
are applied on it by setting the same default parame-
ters proposed in (Mehri et al., 2013). Then, a quick
and easy way to extract Gabor features on the whole
transformed image by the selective Gabor filter, is to
introduce a border replication step before the Gabor
feature extraction task. By using rectangular overlap-
ping processing windows, Gabor descriptors are only
extracted from the selected foreground superpixels of
the transformed image by the selective Gabor filter
and the border replication step, at four different sizes
of sliding windows to adopt a multi-scale approach.
Thus, a feature vector (with dimension 48 to repre-
sent 24 Gabor filters) is produced based on the com-
puted mean and standard deviation of the magnitude
response of the transformed image by the selective
Gabor filter which are extracted from one analyzed
sliding window. A 192-dimensional feature vector
(48 Gabor indices × 4 sliding window sizes) is subse-
quently formed through four different sizes of sliding
windows.
2.4 Foreground Superpixel Clustering
and Foreground Pixel Labeling
A foreground superpixel clustering task is performed
by partitioning Gabor-based feature sets into compact
and well-separated clusters in the feature space to en-
sure the segmentation of graphical regions from tex-
tual ones, the discrimination text in a variety of situ-
ations of different fonts and scales, and the distinc-
tion between different graphic types in HDIs. The
foreground superpixel clustering task does not include
spatial information and is performed by using the k-
means algorithm. Then, a phase of labeling clusters
of the gray-level foreground superpixels and the gray-
level pixels belonging to each superpixel in the en-
hanced document image is carried out with respect to
the results of the superpixel clustering phase. Since
the clustering and labeling phases of the proposed al-
gorithm have been performed, a pixel-labeled docu-
ment image is obtained (cf. Figure 3(f)).
2.5 Post-processing 1
To refine the pixel labeling results, many researchers
have introduced the spatial relationships between pix-
els which have not been considered when the texture
features have been analyzed. Then, it has the advan-
tage to deal with the non-smoothed boundaries due
to the extraction of texture features from small pre-
defined windows (Chang and Kuo, 1992). In our algo-
rithm, a first step of post-processing “Post-processing
1” is introduced by taking into consideration the topo-
logical relationship between the selected foreground
superpixels and integrating a spatial multi-scale anal-
ysis of majority votes. First, the Euclidean distance
ExtractionofHomogeneousRegionsinHistoricalDocumentImages
49
between each foreground superpixel and the centroid
of cluster belonging to it is computed. Then, the fore-
ground superpixels are sorted in descending order ac-
cording to the computed Euclidean distance values in
such a way that the first processed foreground super-
pixel is the one that has a higher Euclidean distance
value. The higher the values of the computed Eu-
clidean distances, the more there is a high probability
that the foreground superpixel is improperly labeled
since it is far from the centroid of cluster belonging
to it. Thus, the first processed foreground superpix-
els are those that have high values of Euclidean dis-
tances by using a multi-scale majority voting tech-
nique. By performing a multi-scale approach in the
majority voting technique, small isolated groups of
superpixels will be removed. Indeed, a local deci-
sion on the label of each selected foreground super-
pixel is taken using the maximum number or majority
of superpixel labels and pixel labels belonging to it,
which is performed at the same four pre-defined sizes
of sliding windows in the Gabor feature extraction
step. Then, if the processed foreground superpixel has
a new label, the pixels belonging to it will have the
same new label. Afterwards, the next processed fore-
ground superpixel is one that has a smaller Euclidean
distance value than the former foreground superpixel.
The labels of foreground superpixels and the pixels
belonging to them are updated on each run of multi-
scale majority voting technique on each foreground
superpixel to ensure a relevant refinement of the pixel
labeling results. Since the first step of post-processing
of the proposed algorithm “Post-processing 1” has
been performed, a post-processed 1 pixel-labeled doc-
ument image is obtained (cf. Figure 3(g)).
2.6 Post-processing 2
As already seen on the proposed algorithm (cf. Fig-
ure 1), our goal is to find homogeneous regions de-
fined by common characteristics or similar texture
features as easily, quickly, and automatically as pos-
sible. So since the first step of post-processing “Post-
processing 1” has been performed, our output data
consists of a post-processed 1 pixel-labeled document
image. Nevertheless, we need to identify group of
pixels sharing common characteristics or similar tex-
tural properties at this stage in order to extract ho-
mogenous region (i.e. to partition text into columns,
paragraphs, lines or words, and identify the graphi-
cal regions). Therefore, we aim in the second step of
post-processing “Post-processing 2” to fill automat-
ically the space within each pixel in order to deter-
mine the largest CCs illustrating similar content re-
gions by replacing a sequence of background pixels
with foreground ones and afterwards grouping pixels
which share common characteristics or similar textu-
ral properties from the post-processed 1 pixel-labeled
document image (cf. Section 2.5, Figure 3(g)).
First, a binarization step is performed using a stan-
dard parameter-free binarization method, the Otsu’s
method, on the enhanced document image (cf. Sec-
tion 2.2, Figure 3(d)) to obtain a binarized enhanced
document image (cf. Figure 3(e)) and subsequently
to retrieve the CCs (Otsu, 1979). Then, the majority
voting technique is applied on each extracted CC from
the binarized enhanced document image by comput-
ing the maximum number or majority of pixel labels
belonging to the localized CC on the post-processed
1 pixel-labeled document image (cf. Section 2.5, Fig-
ure 3(g)). Therefore, using the majority voting tech-
nique, the extracted CCs from the binarized enhanced
document image are labeled according to the post-
processed 1 pixel-labeled document image. The re-
sulting image of labeling the extracted CCs is illus-
trated in Figure 3(h).
Since the extracted CCs from the binarized en-
hanced document image are labeled, a color layer sep-
aration task is performed to split the CCs according
to their labels. Therefore, a document image contain-
ing only single color CCs is generated for each color
layer. For instance, in the example illustrated in Fig-
ure 3, there are two colors representing separately the
graphical (blue) and textual (green) CCs in Figures
3(i) and 3(m), respectively. The color layer separation
task ensures the segmentation of the extracted CCs
according to their label (i.e. content type). When we
separate the extracted CCs according to their label,
the issues caused by the complex, dense, and over-
lapping document layout of HDIs will be overcome.
The identification of homogeneous regions is based
on finding the largest CCs. By replacing a sequence
of background pixels with foreground ones and after-
wards grouping pixels which share common charac-
teristics or similar textural properties from a pixel-
labeled document image, the extraction of homoge-
neous regions will be more accurate and relevant. In-
deed, the idea is to fill automatically the space within
each component to partition text into columns, para-
graphs, lines or words on the one hand, and identify
the graphical regions on the other hand.
So an adaptive RLSA is proposed in this work,
which is a modified version of the state-of-the-art
RLSA (Wahl et al., 1982). The RLSA studies the
spaces between black pixels in order to link neighbor-
ing black areas by applying the run-length smearing
both horizontally and vertically. It operates by replac-
ing a horizontal (vertical, respectively) sequence of
background pixels with foreground ones if the num-
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
50
ber of background pixels in the horizontal (vertical,
respectively) sequence is smaller or equal to a pre-
defined horizontal (vertical, respectively) threshold.
The proposed ARLSA determines automatically the
horizontal and vertical thresholds, which correspond
to to the run-length smoothing values, respectively.
To obtain the proper values of the horizontal and ver-
tical thresholds, the two histograms of the widths and
heights of the extracted CCs are examined, respec-
tively. These two histograms gives the distributions
of the widths and heights of the extracted CCs in
the analyzed HDI. The estimation of the horizontal
(vertical, respectively) threshold is based on the de-
termination of the global maximum of the histogram
of the widths (heights, respectively) of the extracted
CCs. The global maximum of the histogram of the
widths (heights, respectively) of the extracted CCs
gives mainly information about the mean character
length (height, respectively).
Once the horizontal and vertical run-length
smoothing values are estimated automatically accord-
ing to the analyzed HDI content (i.e. and particu-
larly the distributions of the widths and heights of the
extracted CCs of the binarized enhanced document
image), the proposed ARLSA is applied on each re-
sulting image of the color layer separation task after
performing a binarizing step by using the Otsu’s al-
gorithm. It operates by taking the logical AND of
the horizontally (cf. Figure 3(j) (3(n), respectively))
and vertically (cf. Figure 3(k) (3(o), respectively))
merged images of each resulting image of the color
layer separation task to generate Figure 3(l) (3(p), re-
spectively). After applying the ARLSA on each re-
sulting image of the color layer separation task, the
logical NOT is performed on each resulting image to
merge the different resulting images of the ARLSA
task (cf. Figures 3(l) and 3(p)) with the logical OR.
Since the merge process of the different resulting im-
ages has been performed with the logical OR, a post-
processed binarized document image is generated (cf.
Figure 3(q)) in which the neighboring black areas are
linked by applying the run-length smearing both hor-
izontally and vertically. Then, the post-processed 2
pixel-labeled document image (cf. Figure 3(r)) is ob-
tained with labeling the extracted CCs from Figure
3(q), according to the deduced labels from Figure 3(h)
by using the majority voting technique.
Finally, the homogeneous or similar content re-
gions are extracted and labeled from the resulting
image of the “Post-processing 2” task by identify-
ing group of pixels sharing common characteristics
or similar textural properties (cf. Figure 3(s)). To de-
fine an extracted region, a bounding box covering all
the pixels belonging to the extracted CC is used (i.e.
a contour tracking of the shape of the extracted CC
is carried out to identify the bounding box from each
component). Then, the colors of the external contours
of the defined bounding box is drawn according to the
label deduced from the resulting image of using the
majority voting technique (cf. Figure 3(h)).
3 CORPUS AND PREPARATION
OF GROUND-TRUTH
Our experimental corpus contains 1000 ground-
truthed one-page document images which have been
collected from Gallica, encompassing six centuries
(1200-1900) of French history. The HDIs of our cor-
pus have been selected from several printed mono-
graphs and manuscripts across a variety of disciplines
(e.g. novels, law texts, educational books (history, ge-
ography, nature), xylographic booklets, etc.) to pro-
vide a broader range of document contents. They
are gray-scale/color documents which have been dig-
itized at 300/400 dpi and saved in the “TIFF” format,
which provides a high resolution of digitized images.
Our dataset has been structured into four cate-
gories of real scanned HDIs differentiated by their
content (cf. Figure 2), reflecting the challenges of our
work to determine which texture features can be more
adequate for segmenting graphical regions from tex-
tual ones on the one hand and discriminate text in a
variety of situations of different fonts and scales on
the other hand. Our experimental corpus includes a
sufficient number of HDIs with both simple and com-
plex layouts for each category of documents which
have been ground-truthed to ensure a better under-
standing of the behavior of the evaluated texture fea-
ture sets. It is composed of:
250 pages containing only two fonts (cf. Figure 2(a))
250 pages containing only three fonts (cf. Figure 2(b))
250 pages containing graphics and one text font (cf.
Figure 2(c))
250 pages containing graphics and text with two differ-
ent fonts (cf. Figure 2(d))
(a) Only two
fonts
(b) Only
three fonts
(c) Graphics
and one text
font
(d) Graphics
and two text
fonts
Figure 2: Illustration of our experimental corpus.
ExtractionofHomogeneousRegionsinHistoricalDocumentImages
51
The ground-truth for document images has been
manually outlined using rectangular regions drawn
around each selected zone. The zones have been
ground-truthed by zoning manually each content type
(i.e. each rectangular region has been classified into
text or graphics). Different labels for regions with dif-
ferent fonts have been also defined for evaluating the
performance of texture feature to separate various text
fonts.
4 EVALUATION AND RESULTS
In this work, due to a possible bias produced by es-
timating the number of clusters, the maximum num-
ber of homogeneous and similar content regions is set
equal to the ones defined in the ground-truth. The
first aspect of future work is to introduce a cluster-
ing methodology to estimate automatically the correct
number of clusters (e.g. analysis of changes in aver-
age silhouette width values computed from clusters
built by using the k-means algorithm).
The proposed algorithm provides very satisfying
results particularly in distinguishing textual regions
from graphical ones (cf. Figure 3(s)). This high-
lights a much greater discriminant power for separat-
ing text and graphic regions than for distinguishing
two or more different text fonts (normal, uppercase,
and italic fonts). Nevertheless, this way of assessing
the effectiveness of a segmentation method is inher-
ently a subjective evaluation and we need to evalu-
ate robustness using an appropriate quantitative met-
ric. In the tables there are two “Overall” values. The
“Overall
value is obtained by averaging all the re-
spective column values except the value of “Two fonts
and graphics
∗∗
. The “Overall
∗∗
value is obtained
by averaging all the respective column values except
the value of “Two fonts and graphics
. “Two fonts
and graphics
represents the case when every font
in the text has a different label in the ground truth,
and clustering is performed by setting the number of
types of content regions to 3 (graphics and two dif-
ferent text fonts). “Two fonts and graphics
∗∗
rep-
resents the case when all fonts in the text have the
same label in the ground truth, and clustering is per-
formed by setting the number of types of content re-
gions equal to 2 (graphics and text). This distribution
of this dataset points out whether or not the proposed
algorithm is firstly adequate for segmenting graphi-
cal regions from textual ones, and secondly if it can
discriminate between texts with a variety of fonts and
scales. The results of accuracy metrics are presented
in Table 1.
First, in this study the F-measure (F) is computed
to evaluate the different steps of post-processing of
the proposed algorithm for extraction of homoge-
neous regions in HDIs. The overall pixel label-
ing results are reasonably promising, i.e. we ob-
tain 67% and 70% of overall F-measure rates, with-
out taking into consideration the topological rela-
tionships of pixels and their labels for “Overall
and “Overall
∗∗
, respectively. We conclude that the
best performance is obtained for documents contain-
ing graphics and single text font (81%). The lowest
value of F is obtained for documents containing only
three fonts (53%). Therefore, the pixel labeling re-
sults show a much greater discriminating power for
separating text (single font) and graphic regions than
for distinguishing documents containing graphics and
two or more text fonts. Furthermore, we note that we
do not need a post-processing phase, since there is
a no significant performance difference between the
cases of without and with adding the multi-scale anal-
ysis of majority voting (i.e. overall F gains of 0.1%
and 0.004% for “Overall
and “Overall
∗∗
, respec-
tively). This study shows the robustness of the pro-
posed pixel labeling technique based on Gabor fea-
ture analysis with the SLIC superpixels and multi-
scale approach for segmentation in the case of the use-
lessness of introducing the topological relationships.
However, by adding a post-processing step to iden-
tify group of pixels sharing common characteristics
or similar textural properties and extract homogenous
region, overall F gains of 0.6% and 1% are obtained
for “Overall
and “Overall
∗∗
, respectively.
Then, three accuracy metrics are computed: the
precision (P
AR
), recall (R
AR
), and Jaccard index (J
AR
)
for evaluating the extracted homogeneous regions
(Brunessaux et al., 2014). The results obtained for
homogenous region extraction assessment are per-
formed (P
AR
, R
AR
, and J
AR
) by calculating the mean
value of each accuracy metric relative to the num-
ber of defined rectangular regions of the ground-
truth, which are illustrated in Table 1. The com-
puted accuracy metrics values are quite encouraging
since we obtain 88%(P
AR
), 90%(R
AR
), and 86%(J
AR
).
93%(P
AR
), 93%(R
AR
), and 91%(J
AR
) are noted for
documents containing text (single font) and graphic
regions. In conclusion, by computing numerous ac-
curacy metrics, we prove that the proposed algorithm
has a much greater discriminating power for separat-
ing text (single font) and graphic regions than for dis-
tinguishing documents containing graphics and two
or more text fonts. The results also confirm that it
is more difficult to separate two or three text fonts in
HDIs.
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
52
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
(k) (l) (m) (n) (o)
(p) (q) (r) (s)
Figure 3: Illustration of the intermediate results of the different tasks of the post-processing step of the proposed algorithm:
Figure (a) shows an example of an HDI (as an input of the proposed algorithm). Figures (b) and (c) show the background and
foreground SLIC superpixels, respectively. Colors assigned to the background (foreground, respectively) superpixels which
are illustrated in Figure (b) ((c), respectively) are randomly generated. Figure (d) depicts the result of the enhancement step.
Figure (e) shows the resulting image of applying a binarization step on the enhanced document image. Figure (f) illustrates
the pixel-labeled image (as an output of the analysis of the extracted Gabor features (graphic regions (blue), textual regions
(green)). Figure (g) depicts the results of the first step of post-processing “Post-processing 1”. Figure (h) shows the resulting
image of the labeling of the extracted CCs from the binarized enhanced document image according to the obtained pixel
labeling results in the first step of post-processing by using the majority voting technique. Figures (i) and (m) are the two
resulting binarized images of the color layer separation task, illustrating separately the graphical (blue) and textual (green)
CCs, respectively. Figures (j) and (k) show the resulting images of the application of the run-length smearing both horizontally
and vertically on the resulting binarized image representing the graphical regions (cf. Figure (i)), respectively. Figures (n) and
(o) show the resulting images of the application of the run-length smearing both horizontally and vertically on the resulting
binarized image representing the textual regions (cf. Figure (m)), respectively. Figure (l) ((p), respectively) is the resulting
image of merging the two resulting images of applying the run-length smearing both horizontally and vertically on each
resulting binarized image of the color layer separation task (cf. Figures (j) and (k)) (cf. Figures (n) and (o), respectively) by
using the logical AND. Figure (q) is the resulting image of merging the two resulting images of applying ARLSA on each
resulting binarized image of the color layer separation task by using the logical OR (cf. Figures l) and (p), respectively).
Figure (r) shows the resulting image of the second step of post-processing “Post-processing 2” by labeling the extracted CCs
from Figure (q) with taking into account to the labels of the extracted CCs from Figure (h). Figure (s) illustrates the output of
the proposed algorithm for extraction of homogeneous regions in HDIs.
ExtractionofHomogeneousRegionsinHistoricalDocumentImages
53
Table 1: Evaluation of the different steps of the proposed algorithm for extraction of homogeneous regions in HDIs by
computing the F-measure (F), the precision (P
AR
), the recall (R
AR
), and the Jaccard index (J). µ(.) represents the mean value.
The higher the mean values, the better the results. †, ‡, and o represent the evaluation of the analysis of the extracted texture
features in the case of without the first step of post-processing “the pixel labeling step”, with the first step of post-processing
and with the second step of post-processing, respectively. µ
(.) and µ
o−
(.) represent the mean difference values between
‡ and †, and between o and ‡, respectively.
µ
(F) µ
(F) µ
o
(F) µ
(F) µ
o−
(F) µ (P
AR
) σ (R
AR
) µ(J
AR
)
One font and graphics
0.8159 0.8172 0.7761 0.0013 −0.0557 0.9264 0.9352 0.9158
Two fonts and graphics
0.6011 0.6019 0.6313 0.0008 0.0294 0.9158 0.9282 0.9071
Two fonts and graphics
∗∗
0.7140 0.7125 0.7553 −0.0015 0.0428 0.8223 0.8851 0.8429
Only two fonts
0.7315 0.7322 0.7403 0.0007 0.0081 0.8227 0.8782 0.8373
Only three fonts
0.5342 0.5356 0.5816 0.0014 0.0460 0.8628 0.8564 0.7970
Overall
0.6706 0.6717 0.6786 0.0010 0.0069 0.8818 0.9012 0.8657
Overall
∗∗
0.6989 0.6993 0.7096 0.0004 0.0103 0.8819 0.8995 0.8643
5 CONCLUSION AND
PERSPECTIVES
This article presents a novel automatic segmenta-
tion method for HDIs based on extraction of homo-
geneous or similar content regions. The proposed
algorithm is based on using simple linear iterative
clustering (SLIC) superpixels, Gabor filters, multi-
scale analysis, majority voting technique, CC analy-
sis, color layer separation, and ARLSA. The robust-
ness of the proposed algorithm is used in a parameter-
free method and adapted to all kinds of HDIs which
is designed to identify the homogeneous regions with-
out formulating a hypothesis or assumption concern-
ing the document model/layout or content. It was
evaluated on 1000 pages of HDIs with promising re-
sults.
Homogeneous region extraction in HDIs is a first
step towards automatic historical book understanding,
our future work will build on the results of the ex-
traction of homogeneous or similar content regions
to characterize HDI content with intermediate level
meta-data, between document content and layout. By
characterizing each digitized page of historical book
with a set of homogeneous or similar content regions
and their topological relationships, a signature can be
designed for each book page. The obtained signatures
will help deducing the similarities of book page struc-
ture or layout and/or content.
REFERENCES
Achanta, R., Shaji, A., Lucchi, A., Fua, P., and S
¨
usstrunk,
S. (2012). SLIC superpixels compared to state-of-the-
art superpixel methods. PAMI, pages 2274–2282.
Brunessaux, S., Giroux, P., Grilheres, B., Manta, M., Bodin,
M., Choukri, K., Galibert, O., and Kahn, J. (2014).
The Maurdor project: Improving automatic process-
ing of digital documents. In DAS, pages 349–354.
IEEE.
Chang, T. and Kuo, C. C. J. (1992). Texture segmenta-
tion with tree-structured wavelet transform. In TFTSA,
pages 543–546.
Coustaty, M., Raveaux, R., and Ogier, J. M. (2011). Histor-
ical document analysis: A review of French projects
and open issues. In EUSIPCO, pages 1445–1449.
Gabor, D. (1946). Theory of communication. Part 1: The
analysis of information. Journal of the Institution of
Electrical Engineers - Part III: Radio and Communi-
cation Engineering, pages 429–441.
Lam, L. and Suen, C. Y. (1997). Application of majority
voting to pattern recognition: an analysis of its behav-
ior and performance. SMC, pages 553–568.
Li, J., Wang, J. Z., and Wiederhold, G. (2000). Classifica-
tion of textured and non-textured images using region
segmentation. IP, pages 754–757.
Liu, M. Y., Tuzel, O., Ramalingam, S., and Chellappa, R.
(2011). Entropy rate superpixel segmentation. In
CVPR, pages 2097–2104.
MacQueen, J. B. (1967). Some methods for classification
and analysis of multivariate observations. In Berkeley
Symposium on Mathematical Statistics and Probabil-
ity, pages 281–297. University of California Press.
Mehri, M., Gomez-Kr
¨
amer, P., H
´
eroux, P., Boucher, A.,
and Mullot, R. (2013). Texture feature evaluation for
segmentation of historical document images. In HIP,
pages 102–109.
Okun, O. and Pietik
¨
ainen, M. (1999). A survey of texture-
based methods for document layout analysis. In
WTAMV, pages 137–148.
Otsu, N. (1979). A threshold selection method from gray-
level histograms. SMC, pages 62–66.
Rosenfeld, A. and Pfaltz, J. L. (1966). Sequential oper-
ations in digital picture processing. Journal of the
ACM, pages 471–494.
Wahl, F. M., Wong, K. Y., and Casey, R. G. (1982). Block
segmentation and text extraction in mixed text/image
documents. CGIP, pages 375–390.
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
54