2 PROPOSED SYSTEM
We designed an automated system capable of
successfully transcribing documents written in
German Kurrent Schrift. We claim that the system
can be easily adapted to support simpler scripts
(such as Latin, or Greek). The solution assumes that
complex restoration filters which enable proper text
isolation are not necessary (i.e. the image quality of
the documents is fair). However, light imperfections
(such as faint paper folding, material aging and
isolated ink droplets) are managed by the system
(example of document presented in Figure 1).
Figure 1: Excerpt from a Kurrent Schrift historical
document.
2.1 Conceptual Architecture
We have identified three major stages – Document
Processing, Word Processing and Word Selection,
connected in a linear pipeline (as seen in Figure 2).
First, the document image (i.e. the scanned
version of the historical handwritten document) is
processed by the Document Processor. After
separating the text from the background through a
two-step procedure and removing malignant noisy
areas, the document is de-skewed to improve the
correctness of subsequent processing steps. It is then
successively partitioned into lines of text and
individual words.
The Word Processor, the core component of the
system, finalizes the preprocessing of the input word
images, by performing slant correction and character
splitting. The shape of the binary character objects is
then captured using a skeletonization filter, and
important features that discriminate the characters
are extracted. A classifier identifies each character
and word variants are constructed.
The words are validated by the Word Selector
using a local dictionary database and a Knowledge
Base, generating transcription variants with attached
probability. Inappropriate matches are pruned and
the words reordered such as to generate the final
transcription, the output of the system.
2.2 Component Description
The Document Processor extracts the text from the
background using a binary conversion of the image
in a two-step process: greyscaling followed by
binarization. Global thresholding (Otsu, 1979) offers
the best performance – computational complexity
ratio. Noise reduction is ensured by a blobs-based
labelling technique, removing objects of having the
area smaller than a threshold value (dependent on
the image size).
Due to the fact that the human writer most often
fails to write text on perfectly horizontal lines, text is
written at a certain angle. Individual characters are
therefore distorted. This problem is commonly
known as document skew, which we attempt to
minimize through a Projection-Based correction
(Zeeuw, 2006), considering multiple skew angles.
The actual correction is performed by a vertical
shearing in the opposite direction of the skew (Sun,
1997).
Line splitting is performed by applying a
Gaussian smoothing filter to detect horizontal
projection areas of low density. Split points are
identified as local minima inside a rectangular
centred window and are separated from
neighbouring split points by a peak. Analogously,
lines of text are separated into words based on the
individual vertical projections.
Due to the possibility of having irregular word
orientation inside the line of text, the Word
Processor performs slant-correction, similar to
skew-correction through horizontal shearing. Words
are then vertically cut into individual characters
(Zeeuw, 2006).
The extraction of the shape of the binary
characters is done by thinning (using a
skeletonization filter). K3M thinning (Saeed, 2010)
is employed, which generates a pixel-width,
connected skeleton. Pruning of spurious branches
ensures a stable skeleton structure (Bai, 2005),
unaffected by small shape variations.
Significant numerical features are extracted from
the resulting shape in order to discriminate the
characters, considering both strong inter-class
variation and weak correlations (avoidance of
redundancies). The following mix of features is
considered (Vamvakas, 2007): projections, profiles,
transitions and zone densities. Because histogram-
based features are dependent on the character image
resolution (width, height), we propose histogram-
compression based on the Discrete Cosine
Transform. This approach captures the shape of the
histogram.
KDIR2012-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval
354