Impact of Training LSTM-RNN with Fuzzy Ground Truth

Martin Jenckel

1,2

, Sourabh Sarvotham Parkala

, Syed Saqib Bukhari

and Andreas Dengel

1,2

German Research Center for Artiﬁcial Intelligence (DFKI), Kaiserslautern, Germany

TU Kaiserslautern, Kaiserslautern, Germany

Keywords:

Document Analysis, OCR, LSTM, Fuzzy Ground Truth.

Abstract:

Most machine learning algorithms follow the supervised learning approach and therefore require annotated

training data. The large amount of training data required to train state of the art deep neural networks changed

the methods of acquiring the required annotations. User annotations or completely synthetic annotations are

becoming more and more prevalent replacing careful manual annotations by experts. In the ﬁeld of OCR recent

work has shown that synthetic ground truth acquired through clustering with minimal manual annotation yields

good results when combined with bidirectional LSTM-RNN. Similarly we propose a change to standard LSTM

training to handle imperfect manual annotation. When annotating historical documents or low quality scans

deciding on the correct annotation is difﬁcult especially for non-experts. Providing all possible annotations in

such cases, instead of just one, is what we call fuzzy ground truth. Finally we show that training an LSTM-

RNN on fuzzy ground truth achieves a similar performance.

1 INTRODUCTION

With the rise of neural networks and deep learning,

collecting large amounts of training data and their an-

notation became one of the major challenges in ma-

chine learning. Instead of handcrafting annotations

for data sets with a couple thousand data points, often

the data is collected from social media sources like

Twitter, Flickr or YouTube (Wang et al., 2011; Althoff

et al., 2013; You et al., 2015) and instead of hand-

crafted annotations user generated tags, keywords or

comments are used. This allows for data sets with

millions of data points with comparably low effort.

While this works great for images and videos, it is

rather difﬁcult to employ it for other data types like

historical documents.

For images and videos most people can describe

what they see fairly well, but for historical documents

it can be difﬁcult to read characters like the exam-

ples shown in Figure 1 and provide accurate anno-

tations. Even more so if the language on the doc-

ument is old, not used anymore or changed signif-

icantly compared to its modern form. Additionally

historical documents often show various degradations

making it even harder to identify the characters with-

out the right expertise. In practice this makes annotat-

ing historical documents time consuming and there-

fore also expensive with a single page often costing

Figure 1: Examples for historical documents with degraded

or otherwise hard to identify characters.

100USD or more to transcribe.

In this paper we propose a new way of annotating

historical documents for the purpose of LSTM-RNN

based Optical Character Recognition (OCR) systems.

Annotating ambiguous characters, either through high

levels of degradations or ambiguity in the characters

themselves, can often be done by identifying the cor-

rect word, surrounding words or the overall context

of the sentence and text. Without this linguistic ex-

pertise it becomes difﬁcult to make the correct deci-

sion. We propose, that instead of removing useful in-

formation from the annotation by possibly giving the

wrong annotation, the transcription should contain all

possibilities. LSTM-RNN based OCR systems can

388

Jenckel, M., Parkala, S., Bukhari, S. and Dengel, A.

Impact of Training LSTM-RNN with Fuzzy Ground Truth.

DOI: 10.5220/0006592703880393

In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 388-393

ISBN: 978-989-758-276-9

1 2 3

T-2

T-1

...

1 2 3

T-2

T-1

...

Figure 2: In combination with LSTM-RNN the CTC loss function considers all possible alignments between networks outputs

and ground truth. Compared to the classic case (left), fuzzy ground truth (right) effectively increases the number of possible

alignments (red).

use ”fuzzy ground truth” without a signiﬁcant loss in

overall accuracy. This allows for transcribing histori-

cal documents without extensive knowledge about the

language and requires only basic knowledge about the

character shapes. At the same time it reduces the over-

all annotation time because ambiguous characters do

not have to be resolved by using additional tools and

time.

In Section 2 the details about how LSTM-RNNs can

handle the fuzzy annotation compared to normal an-

notation is explained. In Section 3 there is a descrip-

tion of the experimental setup and an introduction to

the data set. Lastly in Section 5 we will conclude the

performance and give an outlook on further research

topics.

2 Long-Short-Term Memory -

Recurrent Neural Networks

A Long-Short-Term-Memory - Recurrent Neural Net-

work (LSTM-RNN) is a RNN consisting of LSTM-

cells. They were introduced to solve the problem of

vanishing gradients during gradient descent training

of classical RNN networks (Werbos, 1990; Hochreiter

and Schmidhuber, 1997). LSTM-RNNs excel at se-

quence learning tasks and can align unsegmented data

with their corresponding ground truth when combined

with Connectionist Temporal Classiﬁcation (CTC)

loss (Graves et al., 2006). This properties make them

especially useful for OCR where they produce state of

the art results (Ul-Hasan et al., ; Karayil et al., 2015;

Simistira et al., 2015).

An LSTM-cell has three gating mechanisms that reg-

ulate the impact of the input via the input gate, the

previous cell state via the forget gate and the output

via the output gate. Especially the forget gate and

the cells internal state allow it to ”remember” infor-

mation through multiple time steps and similarly ”re-

member” errors backwards through time during gra-

dient descent optimization.

The state equations for each LSTM-cell are as fol-

lows:

= σ(w

x f

· x

+ w

h f

· h

t−1

+ b

) (1)

= σ(w

· x

+ w

· h

t−1

+ b

) (2)

= tanh(w

· x

+ w

· h

t−1

+ b

) (3)

= σ(w

· x

+ w

· h

t−1

+ b

) (4)

= f

·C

t−1

+ i

· v

(5)

= o

·tanh(C

) (6)

where

• f , i and o are the forget, input and output gate,

• C is the cell state and y the cell output, while t

iterates of all time steps

• b

, j ∈ { f , i,o,v} are the bias units for the forget,

input and output gate and the input squashing,

• w

i j

are the weight connection between i and j,

• σ is the logistic sigmoid function given as:

σ =

1+exp(−x)

• and tanh is the tangent hyperbolic function

tanh(x) =

−1

Storing information through multiple time steps al-

lows the LSTM to learn temporal relations in the data.

Impact of Training LSTM-RNN with Fuzzy Ground Truth

389

In terms of OCR this means it inherently builds a sta-

tistical language model based on temporal relations of

characters and words.

The most common architecture for LSTM-RNNs is

bidirectional LSTM (BILSTM). Two LSTM-RNNs

consisting of an input layer and a hidden layer go

through the sequence from both sides simultaneously

and feed into a single Softmax layer. Since the num-

ber of LSTM-cells in the input layer has to be ﬁxed,

all input sequences need the same input size at every

time step.

2.1 Connectionist Temporal

Classiﬁcation Loss

The Connection Temporal Classiﬁcation (CTC) loss

allows RNNs to be trained on unsegmented data

(Graves et al., 2006). Similar to other loss functions

the idea is to maximize the likelihood, that the out-

put of the network aligns with the given ground truth.

For this all possible alignments between network out-

put and ground truth have to be considered. This is

done by aligning each ground truth character with a

sequence of network outputs, while keeping the tem-

poral order consistent. This can be understood as ex-

tending the ground truth to the same length of the in-

put sequence by adding ”blank” characters that refer

to no output, or duplicating the same character mul-

tiple times which is equivalent to multiple consecu-

tive inputs referring to the same character in the tran-

scription. For example ”....aa.b.” and ”.aabb...”, with

the ”blank” label ”.”, are two possible extensions of

the transcription ”ab” to a sequence length of 8. The

likelihood of a labeling l is therefore the sum over

the probabilities of all possible extensions T (l) to the

length of the network output, given the network out-

put x.

p(l|x) =

∑

π∈T (l)

p(π|x) (7)

By using a forward-backward algorithm one can then

calculate p(l|x) as:

p(l|x) =

∑

)β

) (8)

with α

) and β

) being the forward and back-

ward variables for the n-th symbol l

in the label-

ing l at network output t. The LSTM-RNN can then

be trained by maximizing the log-likelihood over all

possible extensions. A more detailed mathematical

discourse can be found in (Graves et al., 2006) and

(Graves, 2012).

2.2 Training with Fuzzy Ground Truth

We propose using fuzzy ground truth for ambiguous

characters rather than absolute ground truth. This

can be especially useful when people without lan-

guage expertise transcribe historical documents with

degraded or otherwise ambiguous characters. Fuzzy

ground truth means providing the CTC loss function

with multiple options l

and l

for a speciﬁc ele-

ment of the transcription l

. Training on wrong anno-

tation has two negative effects. It does not reinforce

the correct label and reinforces the wrong label in-

stead. This means network weights would be updated

to produce the wrong output. By providing fuzzy

ground we can prevent these wrong weight updates.

The LSTM-RNN will not be able to learn which of the

options is the correct label from the fuzzy annotation

alone, since both will be correct in terms of CTC-loss.

The intrinsic statistical language modeling of LSTM-

RNNs however will generate a bias towards the more

likely outcome.

For the CTC-loss multiple ground truth options rep-

resent additional possible extensions of the labeling

(Figure 2). The forward and backward variables are

deﬁned as:

) = y











∑

i=n−1

t−1

(i) l

∈ {b, l

n−2

}

∑

i=n−2

t−1

(i) else

(9)

) =











n+1

∑

i=n

t+1

(i)y

∈ {b, l

n+2

}

n+2

∑

i=n

t+1

(i)y

else

(10)

From equation 9 it becomes clear, that the forward

variables α

) and α

) only differ in the network

output y

. The following forward variables α

)

therefore get the same input from α

) and α

)

up to the factor y

. This means we can instead write

1/2

) = y

1/2











∑

i=n−1

t−1

(i) l

∈ {b, l

n−2

}

∑

i=n−2

t−1

(i) else

(11)

with y

1/2

= y

+ y

) and β

) do not depend on y

and are there-

fore the same. For any β

−

) we can therefore also

replace every y

with the sum (y

+ y

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

390

Table 1: Common confusions for historical documents in

Latin script. All confusions are both ways.

p l l l l l l t a

P t 1 r i / r

n n 1 c a a b f

m u i e o u h

3 EXPERIMENTAL SETUP

Due to the high cost of manual transcriptions, rather

than reannotating our data, we decided to syntheti-

cally generate fuzzy ground truth in two different sce-

narios based on existing annotations.

3.1 Scenario I

In the ﬁrst scenario we consider annotations by an

expert and compare it with synthetic non-expert and

fuzzy ground truth annotations. In this experiment

the expert’s annotation has an accuracy of 100% and

serves as a base line. The non-expert’s annotation has

a 50% chance of correctly choosing between two op-

tions for randomly selected characters. For the fuzzy

ground truth both, the wrong and the correct transcrip-

tion options are given. The fuzzy ground truth will

therefore have exactly twice as many fuzzy charac-

ters as the non-expert’s ground truth has errors in the

worst case scenario. While the ambiguous characters

are selected randomly, the possible confusions are se-

lected to be realistic. For a full list of confusions see

Table 1. All confusions are 5% and symmetric.

3.2 Scenario II

A second scenario for fuzzy ground truth arises from

the work of (Jenckel et al., 2016). In their setup

they combine clustering with BILSTM and manage

to improve the ground truth generated via clustering

through LSTM training.

K-Means clustering results in hard bounded clusters

assigning each point to the closest cluster center. To

generate fuzzy ground truth Gaussian mixture model

(GMM) clustering is used instead. It is a probabilis-

tic approach and well established within the machine

learning community (Reynolds, 2015). GMM cluster-

ing assigns a probability p(y = k

|x) that a data point

x is in cluster k

. This will allow us to naturally gen-

erate fuzzy ground truth based on the posterior proba-

bilities. The GMM algorithm optimizes the weighted

sum of Gaussians g(x|µ,σ) with mean µ and covari-

ance σ:

p(x|λ) =

∑

g(x|µ

,σ

) (12)

using an Expectation-Maximization (EM) algorithm.

The algorithm is initialized with the results of a k-

Means clustering.

Before clustering the text lines were segmented into

individual characters using a connected-component

based algorithm and then resized to uniform size of

32x32 pixel while keeping the aspect ratio constant.

In both cases the algorithm follows the description

given in (Jenckel et al., 2016). To reduce the high di-

mensionality of the raw features for the GMM a PCA

has been added to the pipeline.

3.3 OCRopus

As a base for the LSTM implementation we used

OCRopus

, a complete OCR software package. Its

implementation has shown to produce state of the art

results while being easy to use (T. M. Breuel, A. Ul-

Hasan, M. Al Azawi, F. Shafait, 2013). Beside the

LSTM framework it provides us with a text line nor-

malization method based on Gaussian ﬁltering and

afﬁne transformation as described in (Youseﬁ et al.,

2015).

3.4 Data

The data set consists of 100 binarized pages and the

corresponding transcription of a Latin version of the

15th century novel ”Narrenschiff”

. One of the main

challenges when working with historical documents

is the comparably small size of the data sets. In or-

der to train on a maximum of training samples only 2

pages were used as the test data set leaving 3263 text

lines with a total of 83780 characters for training. The

test set consists of 103 lines with 2876 characters.

3.5 Parameters

In both scenarios some free parameters had to be set.

For the clustering the number of components used in

PCA as well as the number of cluster centers in the

GMM had to be chosen. For the PCA the top 10

components were used. Setting the number of clus-

ter centers for the GMM clustering to k = 200 fol-

lows the same strategy as described in (Jenckel et al.,

2016). The idea is to largely overestimate the number

of clusters during clustering. During the following an-

notation of the clusters, similar clusters provided with

https://github.com/tmbdev/ocropy

http://kallimachos.de/kallimachos/index.php

Impact of Training LSTM-RNN with Fuzzy Ground Truth

391

Table 2: Results for scenario 3.1. Comparison of the Char-

acter Error Rates (CER) for the LSTM-RNN training with

perfect ground truth from an expert, worst case ground truth

from a non-expert and fuzzy ground truth containing both

transcription options.

T 0 T 1 T 2

Expert 0 2.9 2.9

Non − Expert 3.3 4.2 3.8

Fuzzy 6.7 2.7 3.0

the same label are indirectly merged.

For the LSTM implementation in OCRopus some pa-

rameters have to be set as well. OCRopus only sup-

ports a single hidden layer whose size is set to 100.

Another parameter is the size of the input layer which

is equivalent to the height of the text lines. As re-

ported in (Jenckel et al., 2016), 48 is a reasonable

value. Learning rate and momentum were kept at the

default values of 1e − 4 and 0.9.

4 RESULT

For evaluation of the two proposed scenarios from

section 3.1 and 3.2, we report the Character Error

Rate (CER) of the model with the lowest CER on seen

and unseen data. All CER are calculated using the

Levenstein-distance. For seen data we use a 3 page

subset of the training data called T 1. The best model

on T1 is then evaluated on our 2 page test set T 2. The

CER on the training data set T 0 describes the error

introduced to the ground truth by wrong annotations.

The experts annotation serves as a base line and is

considered to be perfect (0% error). For fuzzy ground

truth instead we give the percentage of characters with

multiple annotations.

The results for the ﬁrst scenario are seen in Table 2.

While the ground truth from an expert leads to state of

the art CER for this data set, the non-experts ground

truth with 3.3% total error leads to 4.2% CER after

training and 3.8% on the test set. This is in line with

the results from (Jenckel et al., 2016). Even though

the error in the ground truth was increased by 3.3%,

the resulting error compared to perfect ground truth

only rises by about 1%. The LSTM trained with fuzzy

ground truth however achieves almost the same accu-

racy as the 100% accurate ground truth from the ex-

pert, even though 6.5% of the data was fuzzy. The dif-

ferences between the model trained with the experts

transcription and the fuzzy transcription are within the

normal variance.

In the second scenario we used the pipeline described

in (Jenckel et al., 2016) but replaced the hard bounded

k-Means clustering with the probabilistic GMM clus-

Table 3: Results for scenario 3.2. Comparison of the

Character Error Rates (CER) for LSTM-RNN training after

GMM clustering. In the case of fuzzy ground truth the CER

on T 0 is the rate of characters with multiple non-identical

transcriptions.

T 0 T 1 T 2

1st 32.9 24.5 22.1

2nd 32.4 26.6 27.3

3rd 33.5 27.7 28.9

Fuzzy 6.5 24.3 23.1

tering. After segmentation, clustering and manual an-

notation our accuracy was rather low with a CER of

27.4%. For the fuzzy ground truth we assigned the

label of every cluster with a posterior probabilities

higher than 0.05. For comparison we also trained

LSTM models on generated ground truth by using

the label of the cluster with the highest probability to

every character, the label of the second most prob-

able cluster and the label of the third most proba-

ble cluster (if available). We evaluated on the same

data sets T 1 and T 2. The results are shown in Ta-

ble 3. Due to the high CER after clustering the CER

are still high. However LSTM training increased the

overall performance signiﬁcantly, cutting CER by one

third. Choosing the second most probable cluster is

the better choice for some characters slightly improv-

ing the ground truth. However it also is worse for

other data points and leads to a worse overall perfor-

mance. Training on fuzzy ground truth performed on

the same level as taking the most probable cluster for

each data point, while outperforming the two alterna-

tive annotations. Similar to the previous scenario the

rate of fuzzy characters in the ground truth was 6.5%.

5 CONCLUSION AND OUTLOOK

In this paper we have proposed a time conserv-

ing change in annotating historical documents using

fuzzy ground truth. We also showed that in theory that

non-expert’s annotations can lead to similar results

like expert’s ones. We also explained how LSTM-

RNN with CTC can handle fuzzy ground truth and

showed that training LSTM with fuzzy ground truth,

where at least one of the transcription options is the

correct ground truth, does not reduce the performance

signiﬁcantly. We conclude that the LSTM-RNN with

CTC can successfully select the correct option from

the provided possibilities through statistical language

modeling.

With regards to the bad results when applying GMM

clustering to the segmented data we conclude that

GMM is not a good choice for this type of clustering.

ICPRAM 2018 - 7th International Conference on Pattern Recognition Applications and Methods

392

However it provided us with a natural way of generat-

ing fuzzy ground truth in the context of the ”anyOCR”

pipeline proposed in (Jenckel et al., 2016).

The scope of this paper only covered the feasibility

of non-expert annotations which was shown through

the use of synthetic data. Therefore we plan further

evaluation with real non-expert annotations. While

we have shown that training on fuzzy ground truth can

be beneﬁcial in the area of historical documents, fur-

ther analysis on different scripts and document types

is needed as well.

In the future we plan to further explore the possibili-

ties of using fuzzy ground truth when training LSTM

networks, like ground truth options with different seg-

mentations like ”m” and ”in”. Another future fo-

cus will be on automatically generating fuzzy ground

truth. While the proposed method reduces the need

for language experts it is still costly to annotate the

data by hand.

ACKNOWLEDGEMENTS

This work was partially funded by the BMBF (Ger-

man Federal Ministry of Education and Research),

project Kallimachos (01UG1415C).

REFERENCES

Althoff, T., Borth, D., Hees, J., and Dengel, A. (2013).

Analysis and forecasting of trending topics in online

media streams. In Proceedings of the 21st ACM inter-

national conference on Multimedia, pages 907–916.

ACM.

Graves, A. (2012). Supervised sequence labelling. In Super-

vised Sequence Labelling with Recurrent Neural Net-

works, pages 5–13. Springer.

Graves, A., Fernndez, S., Gomez, F. J., and Schmidhuber,

J. (2006). Connectionist temporal classiﬁcation: La-

belling Unsegmented Sequence Data with Recurrent

Neural Networks. In ICML’06.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term

memory. In Neural Computation, pages 1735–1780.

Jenckel, M., Bukhari, S. S., and Dengel, A. (2016). anyocr:

A sequence learning based ocr system for unlabeled

historical documents. In 23rd International Confer-

ence on Pattern Recognition (ICPR’16), Mexiko.

Karayil, T., Ul-Hasan, A., and Breuel, T. M. (2015). A

Segmentation-Free Approach for Printed Devanagari

Script Recognition. In ICDAR, Tunisia.

Reynolds, D. (2015). Gaussian mixture models. In Ency-

clopedia of biometrics, pages 827–832. Springer.

Simistira, F., Ul-Hasan, A., Papavassiliou, V., Gatos, B.,

Katsouros, V., and Liwicki, M. (2015). Recognition of

Historical Greek Polytonic Scripts Using LSTM Net-

works. In ICDAR, Tunisia.

T. M. Breuel, A. Ul-Hasan, M. Al Azawi, F. Shafait (2013).

High Performance OCR for Printed English and Frak-

tur using LSTM Networks. In ICDAR, Washington

D.C. USA.

Ul-Hasan, A., Ahmed, S. B., Rashid, S. F., Shafait, F., and

Breuel, T. M. Ofﬂine Printed Urdu Nastaleeq Script

Recognition with Bidirectional LSTM Networks. In

ICDAR’13, USA.

Wang, H., Klaser, A., Schmid, C., and Liu, C.-L. (2011).

Action recognition by dense trajectories. pages 3169–

3176. IEEE.

Werbos, P. (1990). Backpropagation through time: what

does it do and how to do it. In Proceedings of IEEE,

volume 78.

You, Q., Luo, J., Jin, H., and Yang, J. (2015). Robust im-

age sentiment analysis using progressively trained and

domain transferred deep networks. In CoRR, volume

abs/1509.06041.

Youseﬁ, M. R., Soheili, M. R., Breuel, T. M., and Stricker,

D. (2015). A Comparison of 1D and 2D LSTM Ar-

chitectures for Recognition of Handwritten Arabic. In

DRR-XXI, USA.

Impact of Training LSTM-RNN with Fuzzy Ground Truth

393