The GIDOC Prototype

N. Serrano, L. Taraz´on, D. P´erez, O. Ramos Terrades and A. Juan

DSIC/ITI, Universitat Polit`ecnica de Val`encia, Cam´ı de Vera, s/n, 46022 Val`encia, Spain

Abstract. Transcription of handwritten text in (old) documents is an important,

time-consuming task for digital libraries. In this paper, an efﬁcient interactive-

predictive transcription prototype called GIDOC (Gimp-based Interactive tran-

scription of old text DOCuments) is presented. GIDOC is a ﬁrst attempt to pro-

vide integrated support for interactive-predictive page layout analysis, text line

detection and handwritten text transcription. It is based on GIMP and uses ad-

vanced techniques and tools for language and handwritten text modelling. Re-

sults are given on a real transcription task on a 764-page Spanish manuscript

from 1891.

1 Introduction

Transcription of handwritten text in (old) documents is an important, time-consuming

task for digital libraries. It might be carried out by ﬁrst processing all document images

off-line, and then manually supervising system transcriptions to edit incorrect parts.

However, current techniques for automatic page layout analysis, text line detection and

handwritingrecognition are still far from perfect [10,2,1], and thus post-editing system

output is not clearly better than simply ignoring it.

A more effective approach to transcribeold text documentsis to follow an interactive-

predictive paradigm in which both, the system is guided by the user, and the user is

assisted by the system to complete the transcription task as efﬁciently as possible.

Following this approach, a system prototype called GIDOC (Gimp-based Interactive

transcription of old text DOCuments) has been developed to provide user-friendly, inte-

grated support for interactive-predictivelayout analysis, line detection and handwriting

transcription [4,7].

GIDOC is designed to work with (large) collections of homogeneous documents,

that is, of similar structure and writing styles. They are annotated sequentially, by (par-

tially) supervising hypotheses drawn from statistical models that are constantly updated

with an increasing number of available annotated documents. And this is done at dif-

ferent annotation levels. For instance, at the level of page layout analysis, GIDOC uses

a novel text block detection method in which conventional, memoryless techniques are

improvedwith a “history” modelof text block positions [4].Similarly, at the levelof text

line image transcription, GIDOC includes a handwriting recognizer which is steadily

improved with a growing number of partially supervised transcriptions [7]. Also at this

level, the user is allowed to decide on a maximum tolerance threshold for the recog-

nition error (in non-supervised parts), and the system adjusts the required supervision

effort on the basis of an estimate for this error [8].

Serrano N., TarazÃ¸sn L., PÃl’rez D., Ramos Terrades O. and Juan A.

The GIDOC Prototype.

DOI: 10.5220/0003028300820089

In Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (ICEIS 2010), page

ISBN: 978-989-8425-14-0

 2010 by SCITEPRESS – Science and Technology Publications, Lda. All r ights reserved

This paper presents a comprehensive description of the GIDOC prototype, with spe-

cial emphasis on parts not previously described [4,7, 8]. After an overview of GIDOC

in Section 2, its main functions are described in Sections 3 (block and line detection),

4 (HTK training) and 5 (transcription). Experiments are reported in Section 6, and con-

clusions are discussed in Section 7.

2 System Overview

As indicated by its name, GIDOC has been implemented on top of the well-known

GNU Image Manipulation Program (GIMP). As GIMP, GIDOC is licensed under the

GNU General Public License, and it can be downloaded from [6]. To run GIDOC, we

must ﬁrst run GIMP and open a document image. GIMP will come up with its high-end

user interface, which is often conﬁgured to only show the main toolbox (with docked

dialogs) and an image window. GIDOC can be accessed from the menubar of the image

window (see Fig. 1).

As shown in Fig. 1, the GIDOC includes six entries: Advanced options, 0: Prefer-

ences, 1: Block Detection, 2: Line Detection, 3: HTK Training, and 4: Transcription.

Advanced options is a second-level menu where experimental features are grouped.

Preferences opens a dialog to conﬁgure global options, as well as more speciﬁc options

for preprocessing, training and recognition. Some of them are discussed below together

with menu entries after Preferences.

3 Block and Line Detection

During its development, GIDOC has been mainly tested on a old book in which most

pages only contain nearly calligraphed text written on ruled sheets of well-separated

lines, as in the example shown in Fig. 1. As said in the introduction, GIDOC is de-

signed to work with such homogeneous documents and, indeed, it takes advantage of

their homogeneity. In particular, the Block Detection entry in the GIDOC menu uses a

novel text block detection method in which conventional, memoryless techniques are

improved with a “history” model of text block positions. Please see [4] for more infor-

mation.

Given a textual block, the Line Detection entry in the GIDOC menu detects all its

text baselines, which are marked as straight paths. The result can be clearly observed

in the example of Fig. 1. Although each baseline has handlers to graphically correct its

position, it is worth noting that the baseline detection method implemented works quite

well, at least in pages like that of the example. It is a rather standard projection-based

method [2]. First, horizontally-averaged pixel values or black/white transitions are pro-

jected vertically. Then, the resulting vertical histogram is smoothed and analyzed so as

to locate baselines accurately. Two preprocessing options are included in Preferences,

ﬁrst, to decide on the histogram type (pixel values or black/white transitions), and sec-

ond, to deﬁne the maximum number of baselines to be found. Concretely, this number

is used to help the projection-based method in locating (nearly) blank lines.

Fig.1. Image window showing GIDOC menu.

4 HTK Training

GIDOC is based on standard techniques and tools for handwritten text preprocess-

ing and feature extraction, HMM-based image modeling, and language modeling [10].

Handwritten text preprocessing applies image denoising, deslanting and vertical size

normalization to a given text (line) image. An illustrative example is given in Fig. 2. It

can be conﬁgured through preprocessing options in Preferences. There is an option to

Fig.2. Preprocessing and feature extraction of a text line image. From top to bottom: original

image, denoising, deslanting, vertical size normalisation and feature extraction.

use instead a customized procedure, and two options to deﬁne (boundsfor) the locations

of the upper and lower lines, with respect to the baseline.

Feature extraction for HMM modeling consists in transforming the preprocessed

image into a sequence of (ﬁxed-dimension) feature vectors. There are two, well-known

feature extraction methods available in GIDOC. The default method ﬁrst divides the

preprocessedimage into a grid of square cells whose size is a small fractionof the image

height (e.g. 1/20). Then, each cell its characterized by its normalized gray level and,

optionally,by its verticalandhorizontalgray-levelderivatives. See Fig. 2 foran example

and [10] for more details. The alternative method moves a single-column window left-

to-right over the image, and extracts 9 geometrical features at each position [1].

HMM image modeling is carried out with the well-known and freely available Hid-

den Markov Model Toolkit (HTK). [11]. Similarly, language modeling is implemented

through the open source SRI Language Modeling Toolkit (SRILM) [9].

HTK Training reads the directory of task document images and, for each image,

it extracts all its transcribed text lines, if any, together with their corresponding line

images. Transcriptions are ﬁrst preprocessed to isolate special characters (mainly punc-

tuation signs) and expand abbreviations (e.g. S.M. is expanded to Su Magestad). Then,

an n-gram language model is built from preprocessed transcriptions using a SRILM

command which, by default, generates a bigram language model with Knesser-Ney dis-

counting. On the other hand, extracted line images are preprocessed and transformed

into sequences of feature vectors so as to train, using their corresponding transcriptions

and HTK, continuous density (Gaussian) left-to-right HMMs at character level.

5 Transcription

The Transcription entry in the GIDOC menu opens the GIDOC interactive transcription

dialog (see Fig. 3). It consists of two main sections: the image section, in the upper

Fig.3. Interactive transcription dialog.

part, and the transcription section, in the bottom part. A number of text line images are

displayed in the image section together with their transcriptions, if available, in separate

editable text boxes within the transcription section. The current line to be transcribed

or simply supervised is selected by placing the edit cursor in the appropriate editable

box. Its corresponding baseline is emphasized (in blue color) and, whenever possible,

GIDOC shifts line images and their transcriptions so as to display the current line in

the central part of both the image and transcription sections. It is assumed that the user

transcribes or supervises text lines, from top to bottom (or in any order desired), by

entering text and moving the edit cursor with the arrow keys or the mouse. However, it

is possible for the user to choose any order desired.

As can be seen in Fig. 3, each editable text box in the transcription section, has a

button attached to its left. This button is labeled with the corresponding line number.

By clicking on it, its associated line image is extracted, preprocessed, transformed into

a sequence of feature vectors, and Viterbi-decoded using HTK and the models trained

with HTK training. In this way, it is not needed to enter the complete transcription of

the current line, but hopefully only minor corrections to the decoded output. Clearly,

this is only possible if, ﬁrst, text lines are correctly detected and, second, the HMM and

language models are adequately trained, from a sufﬁciently large amount of training

data. Therefore, it is assumed that transcription is carried out manually in early stages

of a transcription task, and then is assisted as described here.

6 Experiments

During its development, GIDOC has been used by a paleography expert to annotate

blocks, text lines and transcriptions on a new dataset called GERMANA [3]. GER-

MANA is the result of digitizing and annotating a 764-page Spanish manuscript from

1891, in which most pages only contain nearly calligraphed text written on ruled sheets

of well-separated lines. The example shown in Fig. 1 correspondsto the page 144. GER-

MANA is solely written in Spanish up to page 180; then, the manuscript includes many

parts that are written in languages different from Spanish, namely Catalan, French and

Latin.

Due to its sequential book structure, the very basic task on GERMANA is to tran-

scribe it from the beginning to the end, though here we only consider its transcription

up to page 180. Starting from page 3, we divided GERMANA into 9 consecutive blocks

of 20 pages each (18 in block 9) and, on average, 417 lines and 4687 running words.

Then, from block 2 (pages 23–42)to block 9 (pages 163–180),each block was automat-

ically transcribed by GIDOC trained with all preceding blocks. The results are plotted

in Fig. 4, in terms of transcription Word Error Rate (WER). To avoid ﬂuctuations due

to varying test set complexity, the WER was also computed for a ﬁxed block (block 9)

after each GIDOC re-training, and the resulting WER curve has been added to Fig. 4.

Also shown is the part of the WER due to the occurrence of out-of-vocabulary (OOV)

words.

3-22 3-42 3-62 3-82 3-102 3-122 3-142 3-162

WER

Training pages

WER on pages 163-180

WER on next 20 pages

OOV WER on pages 163-180

OOV WER on next 20 pages

Fig.4. Transcription Word Error Rate (WER) on GERMANA as a function of the pages already

supervised and thus available for training (training pages). The WER is computed for both, the

next 20 pages to supervise (solid line with black circles), and a ﬁxed set comprising pages 163-

180 (solid line with white circles). Also shown is the part of the WER due to the occurrence of

out-of-vocabulary (OOV) words (dashed lines).

As expected, the WER decreases as the amount of training data increases. In par-

ticular, GIDOC achieves around 34% of WER for the last two blocks, which can be

successfully used in computer-assisted transcription. The WER curve for block 9 does

not differ signiﬁcantly from that for the next block, though it appears that block 9 is a bit

more complicated that all but one (block 7) of its preceding blocks. Regarding the OOV

curves, it becomes clear that a considerable fraction of transcription errors is due to the

occurrence of unseen words. More precisely, unseen words account for approximately

50% of transcription errors.

It is worth noting that preliminary WER results (only for the next block) have been

already reported in [3] to accompany GERMANA description. In contrast to them, the

WER and OOV curves reported here are slightly better on average (5.4% and 6.4%,

respectively). This is mainly due to better modeling of word abbreviations and punctu-

ation signs. Also, we have used an updated version of GERMANA baselines which are

more accurately adjusted.

As with GERMANA, GIDOC has been used to annotate blocks, text lines and tran-

scriptions on a more recent dataset called RODRIGO [5]. Although comparable in size

to GERMANA, RODRIGO comes from a much older manuscript, from 1545, where

the typical difﬁcult characteristics of historical documents are more evident. In [5], ex-

periments and results similar to those discussed here are reported.

7 Conclusions

A computer-assisted transcription prototype called GIDOC has been presented for hand-

written text in old documents. GIDOC is a ﬁrst attempt to provide integrated support

for interactive-predictive page layout analysis, text line detection and handwritten text

transcription. It is build on top of GIMP, and uses standard techniques and tools for

handwritten text preprocessing and feature extraction, HMM-based image modeling,

and language modeling. As GIMP, GIDOC is licensed under the GNU General Public

License, and it can be freely downloaded from Internet. The effectiveness of GIDOC

has been empirically demonstrated on the GERMANA database, which is also publicly

available on Internet.

Acknowledgements

Work supported by the EC (FEDER, FSE), the Spanish Government (MEC, MICINN,

”Plan E”, under grants MIPRCV ”Consolider Ingenio 2010” CSD2007-00018, iTrans-

Doc TIN2006-15694-CO2-01, MITTRAL TIN2009-14633-C03-01 and FPU AP2007-

02867) and the Generalitat Valenciana (grant Prometeo/2009/014).

References

1. R. Bertolami and H. Bunke. Hidden Markov model-based ensemble methods for ofﬂine

handwritten text line recognition. Pattern Recognition, 41:3452–3460, 2008.

2. L. Likforman-Sulem, A. Zahour, and B. Taconet. Text line segmentation of historical docu-

ments: a survey. International Journal on Document Analysis and Recognition, 9, 2007.

3. Daniel P´erez, Lionel Taraz´on, Nicolas Serrano, Francisco-Manuel Castro, Oriol Ramos-

Terrades, and Alfons Juan. The GERMANA database. In Proc. of the 10th Int. Conf. on

Document Analysis and Recognition (ICDAR 2009), pages 301–305, Barcelona (Spain), July

2009.

4. O. Ramos-Terrades, N. Serrano, A. Gord´o, E. Valveny, and A. Juan. Interactive-predictive

detection of handwritten text blocks. In Document Recognition and Retrieval XVII (Proc. of

SPIE-IS&T Electronic Imaging), pages 75340Q–(1–10), San Jose, CA (USA), January 2010.

5. N. Serrano, F. Castro, and A. Juan. The RODRIGO database. In Proc. of the 8th Language

Resources and Evaluation Conf. (LREC 2010), Valleta (Malta), May 2010.

6. Nicol´as Serrano, Alfons Juan, et al. The GIDOC prototype. http://prhlt.iti.es/gidoc.php.,

2009.

7. Nicol´as Serrano, Daniel P´erez, Albert Sanchis, and Alfons Juan. Adaptation from Partially

Supervised Handwritten Text Transcriptions. In Proc. of the 11th Int. Conf. on Multimodal

Interfaces and the 6th Workshop on Machine Learning for Multimodal Interaction (ICMI-

MLMI 2009), pages 289–292, Cambridge, MA (USA), November 2009.

8. Nicol´as Serrano, Albert Sanchis, and Alfons Juan. Balancing error and supervision effort in

interactive-predictive handwriting recognition. In Proc. of the 14th Int. Conf. on Intelligent

User Interfaces (IUI 2010), Hong Kong (China), February 2010.

9. A. Stolcke. SRILM - An Extensible Language Modeling Toolkit. In Proc. of the Int. Conf.

on Spoken Language Processing, pages 901–904, Denver, CO (USA), 2002.

10. A. H. Toselli, A. Juan, D. Keysers, J. Gonzlez, I. Salvador, H. Ney, E. Vidal, and F. Casacu-

berta. Integrated Handwriting Recognition and Interpretation using Finite-State Models. In-

ternational Journal of Pattern Recognition and Artiﬁcial Intelligence, 18(4):519–539, 2004.

11. S. Young et al. The HTK Book. Cambridge University Engineering Department, 1995.