An Assessment of the Impact of OCR Noise on Language Models

Konstantin Todorov

and Giovanni Colavizza

Institute for Logic, Language and Computation (ILLC), University of Amsterdam, The Netherlands

Keywords:

Machine Learning, Language Models, Optical Character Recognition (OCR).

Abstract:

Neural language models are the backbone of modern-day natural language processing applications. Their use

on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also

increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still

limited. We perform an assessment of the impact OCR noise has on a variety of language models, using

data in Dutch, English, French and German. We ﬁnd that OCR noise poses a signiﬁcant obstacle to language

modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers.

In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform

transformer-based models in this respect.

1 INTRODUCTION

Statistical neural language models have become the

backbone of modern-day natural language process-

ing (NLP) applications. They have proven high ca-

pabilities in learning complex linguistic features and

for transferable, multi-purpose adaptability Qiu et al.

(2020), in particular with the recent success of con-

textual models like BERT Devlin et al. (2019). Lan-

guage models’ main objective is to assign probabil-

ities to sequences of linguistic units, such as sen-

tences made of words, possibly beneﬁting from auxil-

iary tasks. The success of neural language models has

fostered a signiﬁcant amount of work on understand-

ing how they work internally Rogers et al. (2020).

In Digital/Computational Humanities, language mod-

els are primarily used as components of NLP archi-

tectures and to perform well-posed tasks. Examples

include Handwritten/Optical Character Recognition

(H/OCR) Kahle et al. (2017), Named Entity Recogni-

tion (NER) and linkage Ehrmann et al. (2020), mod-

elling semantic change Shoemark et al. (2019), anno-

tating historical corpora Coll Ardanuy et al. (2020),

translating heritage metadata Banar et al. (2020), and

many more.

An open challenge for neural language models are

low-resource settings: languages or tasks where lan-

guage data is comparatively scarce, and where an-

notations are few Hedderich et al. (2021). This is a

https://orcid.org/0000-0002-7445-4676

https://orcid.org/0000-0002-9806-084X

known issue for many underrepresented languages,

including when a language is distinctively appropri-

ated for example via dialects and idiomatic expres-

sions Nguyen et al. (2016). Consequences include

the possible exclusion of speakers of low-resource

languages, cognitive and societal biases, the reduc-

tion of linguistic diversity and often poor general-

ization Ruder (2020). Historical language data are

also comparatively less abundant and sparser than

modern-day data Piotrowski (2012); Ehrmann et al.

(2016).

What is more, historical language data often poses

two additional challenges: noise and variation. Noise

comes from errors which should be corrected and

their impact mitigated, variation instead is a charac-

teristic of language which may constitute a useful sig-

nal. Modern-day NLP methods, including language

models, ‘overcome’ noise by sheer data size – yet

noise can still remain a problem – and often ﬂatten-

out linguistic variation. Therefore, when employed

on real-world applications in low-resource settings,

or when applied out-of-domain, these methods might

fall short and, crucially, they might fail to appropri-

ately deal with noise and variation.

In this work, we contribute to bridge this gap by

posing the following question: what is the impact on

language models of textual noise caused by OCR?

While recent work has focused on the impact of

OCR noise on downstream tasks Hill and Hengchen

(2019); van Strien et al. (2020); Todorov and Colav-

izza (2020), less is known about language mod-

674

Todorov, K. and Colavizza, G.

An Assessment of the Impact of OCR Noise on Language Models.

DOI: 10.5220/0010945100003116

In Proceedings of the 14th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2022) - Volume 2, pages 674-683

ISBN: 978-989-758-547-0; ISSN: 2184-433X

els in this respect. We therefore proceed by con-

sidering a most basic empirical setting, expanding

from van Strien et al. (2020). First, we use data

in multiple languages available in two versions: as

ground truth, assumed correct and veriﬁed by hu-

mans, and as OCRed texts, containing varying de-

grees of noise from automated text extraction. The

data we use is small when compared to modern-

day language modelling standards, yet of realistic

size in a Digital/Computational Humanities setting.

Furthermore, we consider language models trained

from scratch and do not cover ﬁne-tuning or lan-

guage model adaptation here. Lastly, for each lan-

guage model under consideration we compare the re-

sults of two identically-conﬁgured language models,

trained on these two versions of the same corpus, by

inspecting how similar their learned vector spaces are

upon convergence. In this way, we assume no uni-

versal baseline, but instead compare language models

independently.

While transfer learning on language models is of

extreme importance to model NLP applications Ruder

(2019), we do not consider it here. The reason is

the difﬁculty in establishing a consistent comparison.

While comparing the vector spaces of language mod-

els trained from scratch is possible upon convergence,

this is more problematic for ﬁne tuning since it is

difﬁcult to know when a ﬁne tuned model has actu-

ally converged to an accurate model of the new do-

main. This issue is particularly severe when using

small datasets to ﬁne tune. In fact, in the evalua-

tion setting described above, the best result would be

achieved by performing no ﬁne tuning at all. Indeed,

in this case, both the ground truth and the OCR mod-

els would be perfectly identical. A way around is to

use extrinsic evaluation, and consider (the similarity

in performance on) downstream tasks instead. While

extrinsic evaluation has its merits, it does not allow

to perform a direct assessment of a language model,

but only one limited to its usefulness for downstream

tasks. We therefore consider extrinsic evaluation to

be complementary to the approach we pursue here.

2 RELATED WORK

Neural Language Models. Vector representations

of linguistic units, referred to as embeddings, have

been instrumental in the success of modern neural

NLP. On the one hand, as statistical models of lan-

guage when trained on unsupervised objectives and,

on the other hand, as components in larger NLP ar-

chitectures Xia et al. (2020); Qiu et al. (2020). Very

popular models include Word2Vec Mikolov et al.

(2013b,a) and BERT Devlin et al. (2019). A com-

mon theme of neural language modelling research

over time seems to be that increasing larger param-

eters and datasets lead to better results Brown et al.

(2020); Raffel et al. (2020). More recently, attention

is increasing for low-resource languages which do not

yet possess the amount of data or resources which are

readily available for, say, English Ruder (2020); Hed-

derich et al. (2021). As a consequence, promising

work is emerging on effective small language mod-

els Schick and Sch

utze (2021).

Language Models and Noise. A challenge in lan-

guage modelling which is often – yet not necessar-

ily – occurring in low-resource settings is noise. We

can consider noise as unwanted errors in the texts, in-

troduced by processing steps. Examples include er-

rors in audio to text recognition or in transcription.

Noise can be caused by humans, machines, commu-

nication channels; it can be systematic or not. Sub-

stantial work on noise and language models has so

far focused on robustness to adversarial attacks Pruthi

et al. (2019). Some approaches to language modelling

can indeed be more resilient to noise than others,

despite often having been introduced for other rea-

sons (primarily dealing with out-of-vocabulary words

and being language agnostic). BERT, for example,

has been proven sensitive to (non-adversarial) human

noise Sun et al. (2020); Kumar et al. (2020). Exam-

ples of models that can be more resilient to noise in-

clude typological language models Gerz et al. (2018);

Ponti et al. (2019), sub-word or character-level lan-

guage models Kim et al. (2016); Zhu et al. (2019);

Ma et al. (2020), byte-pair encoding Sennrich et al.

(2016), and their extension in recent tokenization-free

models (Heinzerling and Strube, 2018; Clark et al.,

2021; Xue et al., 2021), yet their use as noise-resilient

language models remains to be fully assessed.

Assessing the Impact of OCR Noise. A grow-

ing body of work is focused on assessing and mit-

igating the impact of OCR noise. An area of ac-

tive work is that of post-OCR correction, which at-

tempts to automatically improve on a noisy text after

OCR H

ainen and Hengchen (2019); Boros et al.

(2020); Nguyen et al. (2020); Todorov and Colavizza

(2020). Several recent contributions have assessed the

impact of OCR noise using extrinsic evaluation and

considering a variety of downstream tasks, includ-

ing topic modelling, text classiﬁcation, named en-

tity recognition and information retrieval, among oth-

ers Hamdi et al. (2019); Hill and Hengchen (2019);

van Strien et al. (2020); Boros et al. (2020). While

more systematic comparisons are needed, the general

An Assessment of the Impact of OCR Noise on Language Models

675

trend seems to indicate that OCR texts are often ‘good

enough’ to be used for downstream tasks.

It is our intent to contribute to this growing body

of work by assessing the impact of OCR noise on a

selection of mainstream language models, in view of

informing their use and future development.

3 EXPERIMENTAL SETUP

In this section we introduce the language models

which we consider for this study and present the data

we used for our experiments. Lastly, we detail the

evaluation procedure to assess their resilience to OCR

noise.

3.1 Language Models

There are several popular techniques for language

modelling. Modern-day language models are typi-

cally vector-based: they map linguistic units (e.g., to-

kens) to vectors of a given size. How these vectors

are updated and learned during training depends on

the model at hand and the task(s) it focuses on. In

order to compare different and popular approaches,

in this study we consider Word2Vec, namely Skip-

Gram with Negative Sampling (SGNS) and Continu-

ous Bag-of-words (CBOW) Mikolov et al. (2013b,a),

and GloVe Pennington et al. (2014). Furthermore, we

consider attention-based models able to make use of

the context of an occurrence at inference time and

not just at training time, namely BERT Devlin et al.

(2019) and ALBERT Lan et al. (2020) (the latter a

very fast variant of the former). Finally, we also

include a language model based on co-occurrence

counts re-weighted using Positive Pointwise Mutual

Information (PPMI), as a baseline Church and Hanks

(1989).

PMI. is a measure of association that quantiﬁes the

likelihood of a co-occurrence of two words taking into

account the independent frequency of the two words

in the corpus. Positive PMI (PPMI) leaves out nega-

tive values, under the assumption that they might cap-

ture unreliable and therefore uninformative statistics

about the word pair. Formally, PMI and PPMI are

calculated as follows:

PMI(w, c) = log

P(w, c)

P(w)P(c)

(1)

PPMI(w, c) = max (PMI(w, c), 0) (2)

Where P(w, c) is the probability of the word w oc-

curring together with word c, and the denominator

contains the individual probabilities of these words

occurring independently.

Skip-gram. positions each word in a vector space

such that similar words are closer to each other. This

is achieved by using the self-supervised task of pre-

dicting the context words given a speciﬁc target word.

To improve the speed of convergence and ease com-

putation, we use negative sampling. If we consider

the original target and context words as positive ex-

amples, we sample negative ones at random from the

corpus during training.

CBOW. uses the mirror task compared to Skip-

gram, by predicting the target word using context

words. For both models, the amount of context words

is conﬁgurable and called window size of the model.

GloVe. is similar to PPMI, in that training is

performed on the aggregated word-to-word co-

occurrence matrix from a corpus, following the intu-

ition that these encode a form of meaning.

BERT. stands for Bidirectional Encoder Represen-

tations for Transformers that, at the time of introduc-

tion, gave state-of-the-art results on a wide variety of

tasks. BERT is a multi-layered bi-directional trans-

former encoder. It uses self-supervised tasks such as

masked language modelling and next sentence predic-

tion. We apply the former to train our BERT model.

The original model uses sub-word tokenization and it

generates contextualised word representation at infer-

ence time.

ALBERT. uses most of BERT’s design choices.

However, this model differs by introducing two

parameter-reduction techniques that lower memory

consumption and drastically decrease training time.

We implement the overall pipeline for our exper-

iments and PPMI ourselves. For CBOW and SGNS

we use a popular Python library, Gensim Rehurek and

Sojka (2010), while for BERT and ALBERT we rely

on HuggingFace Wolf et al. (2020). We remark again

that we always do training from scratch, without us-

ing any pre-trained model.

For all models, we use standard default for

the used architecture tokenization. For Skip-gram,

CBOW and GloVe we use simple rules, such as clean-

ing digits, punctuation and multiple white-spaces. For

BERT and ALBERT we use Byte-level BPE tok-

enizer, as introduced by OpenAI and which works on

sub-word level (Wang et al., 2020). For these models,

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

676

we additionally split the data at 500 characters due

to the design limitations in the original implementa-

tion. Finally, we use default conﬁgurations for the

transformers, namely a vocabulary size of 30,522 for

BERT and of 30,000 for ALBERT, and an vector size

equal to 768. For ALBERT, we set the number of

hidden layers and attention heads to 12 and the inter-

mediate size to 3072 so that these are equal to their

counterparts in BERT.

3.2 Data

We make use of the datasets provided by the In-

ternational Conference on Document Analysis and

Recognition (ICDAR) 2017 Chiron et al. (2017) and

2019 Rigaud et al. (2019) competitions on Post-OCR

text correction. These two datasets combined include

ten European languages of which we consider four

in this study. The data is provided in three versions:

OCR, aligned OCR and ground-truth. We make use

of the aligned OCR and ground-truth versions for the

purpose of this study and combine the training and

evaluation sets together.

From Table 1 we show how different the four lan-

guages are in terms of corpus size. Dutch and En-

glish languages contain comparatively fewer docu-

ments but of longer average size, while French and

German have more, usually shorter documents. In or-

der to assess whether having more data for languages

with a smaller corpus would alter our results, we ex-

perimented with adding data from the National Li-

brary of Australia’s Trove newspaper archive

to the

English corpus, but discovered that it does not lead to

any differences in the outcomes of our experiments

when tested on several of our conﬁgurations. We

therefore leave this out and only report results using

ICDAR 2017+2019 data in what follows.

OCR error rates, per language and averaged over

documents are given in Table 2, alongside the dis-

tribution of character error rates in Figure 1a and of

word error rates in Figure 1b. The error rates on char-

acter level are calculated by comparing characters on

the same position in the OCR and aligned ground-

truth versions. Word error rates are calculated fol-

lowing the de facto standard approach of word errors

to processed words Morris et al. (2004). Documents

which are having misaligned OCR and ground-truth

versions are excluded from the error rates calculation.

It is worth noting that these represent a signiﬁcant

proportion for the German language (Table 1).

Before using each corpus for training, we perform

the following pre-processing steps for all of our con-

ﬁgurations except for BERT and ALBERT. We remove

https://trove.nla.gov.au.

(a) Character-level

(b) Word-level

Figure 1: OCR error rates per language, as distribution of

document scores.

numbers and punctuation from the data, lowercase all

characters, substitute multiple white-spaces with one,

also removing leading and trailing white-spaces in the

process, and ﬁnally split the different words into to-

kens. We then build the vocabulary for each version

of each corpus – ending up with two versions per cor-

pus: OCR and ground truth – and remove all tokens

that occur less than ﬁve times overall (per version).

We pick this number following previous research and

in order to reduce the computational requirements of

training our models. For transformer-based models

instead, we only split words into sub-word tokens and

replace multiple white-spaces with one. These are

typical pre-processing steps in view of presenting as

much contextual information as possible to the model.

3.3 Evaluation

Our goal is to compare two versions of the same

model conﬁguration, each trained from scratch, on the

OCR and the ground-truth versions of the same cor-

pus. We remind the reader that our evaluation there-

fore does not provide any indication of the relative

An Assessment of the Impact of OCR Noise on Language Models

677

Table 1: Dataset statistics calculated on ground truth corpora versions. The column Aligned reports the number of documents

where OCR versions and ground truth are perfectly aligned. The column Split reports the number of documents after document

splitting for transformer-based models, which require equally-sized data points.

Documents Characters per doc. Total

Total Aligned (% of total) Split Avg Min Max characters

Dutch 150 149 (99.3%) 5346 4593 42 16,028 688,934

English 963 951 (98.8%) 51,689 6866 2 869,953 6,612,108

French 3993 3616 (90.6%) 84,676 2660 2 195,177 10,620,966

German 10,032 1738 (17.3%) 128,662 1581 126 16,187 15,856,445

Table 2: OCR error rates per language, averaged over doc-

uments.

Language

Error rate level

Character Word

Dutch 0.286 0.536

English 0.075 0.146

French 0.064 0.193

German 0.240 0.813

beneﬁt of using one language model versus another

in terms of their capacity to represent language or

perform on downstream tasks. Rather, we assess and

compare the resilience of each model to OCR noise.

Our evaluation procedure is composed of the fol-

lowing steps:

1. For each corpus/language, we start by taking the

intersection of the words that are part of all vo-

cabularies/models. For transformer-based mod-

els, which do not have word-level vocabularies,

we use the vocabulary intersection from the other

models.

2. For transformer-based models, we output the hid-

den states for all words from the intersected vo-

cabulary. For words which are split into multiple

sub-word tokens due to BERT and ALBERT tok-

enization, we take the mean values. This approach

leaves out the contextual information from the in-

ferred embeddings, yet it ensures proper compar-

ison with different architectures.

3. For each corpus/language, model conﬁguration,

and word in the vocabulary intersection, we com-

pute the cosine similarity with every other word

in the vocabulary intersection.

4. We then compare the amount of overlap in the

top N fraction of closest words (neighbors) in the

two versions of the same corpus (OCR and ground

truth). In this way, we are able to assess to what

extent the two models agree on what is the neigh-

borhood for each word in the vocabulary intersec-

tion.

5. Since different corpora/languages possess varying

vocabulary sizes, we use the percentage over the

total dataset-speciﬁc vocabulary intersection. N

is thus ranging from 0.01 (1%) to 1.0 (100% top

neighbors).

6. Following Gonen et al. (2020), we use neighbor

overlap as our main evaluation metric. That is

the proportion of overlapping neighbours for any

word in the vocabulary intersection, deﬁned using

the following formula:

overlap@k(r

OCR

, r

truth

) =

OCR

∩ r

truth

(3)

Where r

OCR

are the top-k neighbors of word r in

the OCR model, and r

truth

in the ground truth

model respectively. k is taken to correspond to

the top N fraction of neighbors for each speciﬁc

vocabulary intersection.

7. For example, if we have a vocabulary intersection

of size 1000 and evaluate at N = 0.01, we would

take the k = 10 closest words for each word. If

on average 5 out of 10 words correspond in the

two versions/models, we would have an average

0.5 overlap score (Equation 3).

Further empirical settings are as follows. In agree-

ment and to ensure compatibility with previous work

on Word2Vec embeddings, we use embedding size

of 300 for English, German and French languages

and 320 for Dutch Tulkens et al. (2016). We use the

AdamW optimizer Loshchilov and Hutter (2018) for

transformer based models. We further assess models

using two learning rates – which we label fast and

slow – that are comparatively higher/lower in order

to verify the effect of learning speed on our evalua-

tion. Since BERT and ALBERT usually beneﬁt from

lower learning rates compared to simpler models such

as CBOW and Skip-gram, we adjust accordingly. The

learning rates which correspond to fast and slow for

each model are shown in Table 3, they have been se-

lected following commonly used default values in the

literature.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

678

Figure 2: Neighbourhood overlaps (Dutch). 95% boot-

strapped conﬁdence intervals are provided at .01 intervals.

Figure 3: Neighbourhood overlaps (English). 95% boot-

strapped conﬁdence intervals are provided at .01 intervals.

To control for the stochasticity of the training pro-

cess, we show all results by averaging three differ-

ent runs for each hyper-parameter conﬁguration and

model type.

Table 3: Model parameter size comparison and default

learning rates.

Model

Number of

parameters

Learning rates

Fast Slow

PPMI N/A

GloVe N/A

CBOW 1.7M

1e−3 1e−4

Skip-Gram 7.5M

BERT 108M

1e−4 1e−5

ALBERT 12M

4 RESULTS

Our results are shown in a series of mirrored plots,

one per language. The plots show the neighborhood

overlap (Eq. 3) on the y axis, at varying values of N

from 0.01 (1%) to 1.0 (100%), on the x axis. Nat-

Figure 4: Neighbourhood overlaps (French). 95% boot-

strapped conﬁdence intervals are provided at .01 intervals.

Figure 5: Neighbourhood overlaps (German). 95% boot-

strapped conﬁdence intervals are provided at .01 intervals.

urally, higher values of N are somewhat less interest-

ing and more trivial than lower values of N (overlap of

the most similar neighbors), therefore each plot also

shows a zoom-in inset for values on N between 0.01

(1%) and 0.2 (20%). Conﬁdence intervals are added

and point to the signiﬁcance of all results.

Starting with Dutch, in Figure 2, we can appre-

ciate how CBOW (slow learning rate) and PPMI ap-

pear to be most resilient to OCR noise, followed by

Skip-gram (fast learning rate) and GloVe. Every other

model conﬁguration performs much worse at lower

values of N. Importantly, the Dutch corpus we use

is small in comparison to other corpora, and contains

poor quality OCR.

Next, we show results for English in Figure 3. The

English corpus has a good quality OCR and is of av-

erage size in our study. In this setting, PPMI followed

by Skip-gram models perform best, notwithstanding

a good performance of BERT (slow learning rate) at

very low values of N. All other models play catch-up.

The results are quite clear for French, in Figure 4,

which is a corpus of size and quality comparable to

those of English. For French, Skip-gram, CBOW and

PPMI models perform consistently better in resist-

ing the impact of OCR noise. Compared, contextual

An Assessment of the Impact of OCR Noise on Language Models

679

BERT and ALBERT are lagging behind, along with

GloVe.

Lastly, we report results for German in Figure 5.

The German corpus is the largest in size, but the worst

in terms of OCR quality. The low OCR quality clearly

shows in the comparatively lower overlap scores over-

all. Surprisingly, for German the best performing

models are BERT and ALBERT, in particular on a

slow learning rate. It is worth noting that a fast Skip-

gram conﬁguration is also close.

Our results show clear trends. Firstly, the impact

of OCR quality is overall quite signiﬁcant: all models

trained on OCR data diverge from ground truth. Lan-

guage models trained on lower OCR quality corpora

consistently reach lower overlap scores, as shown for

Dutch and German. For example, the best perform-

ing models for French, which has good OCR qual-

ity, reach over 80% overlap at low N (5%), while this

overlap score is only reached at very high values for

German (55%). The effect of the size of a corpus does

not seem to be signiﬁcant, yet results for German, the

only language where transformer-based models out-

perform the competition, warrant further scrutiny in

this respect. In terms of language models, simpler

seems to be better. PPMI, Skip-gram and CBOW

models consistently perform above transformer-based

models (BERT, ALBERT) in terms of resilience to

OCR noise. The only exception in this respect is Ger-

man, which is plagued by OCR errors but is also the

largest corpus available. Furthermore, BERT consis-

tently outperforms ALBERT, hinting at the fact that

the gains in training speed come at a cost. GloVe does

not appear to be particularly resilient to OCR noise

either. Lastly, the variety and lack of consistency of

results when using fast and slow learning rates un-

derlines the importance of choosing the right hyper-

parameters.

We end our results by highlighting a set of limita-

tions which constitute interesting directions for future

work. Firstly, our results hold within the remits of

the datasets and models we focused on: using more

aligned data – which is unfortunately costly to create

–, in more languages of comparable corpus sizes, and

with more varied contents and OCR quality would all

be useful future contributions. Furthermore, the qual-

ity of the ground truth itself, which we took here for

granted, might warrant further scrutiny. Future work

could also focus on hyper-parameter ﬁne tuning in

order to reach the maximum performance with any

given model, as well as on assessing how many data

are necessary to reach a certain desired result with a

given language model. Next, as we discussed at the

beginning, our results would beneﬁt from a comple-

mentary study focused on the extrinsic evaluation of

the impact of OCR on language models, in particular

when used as components for machine learning archi-

tectures focused on downstream tasks. Related to this

point, we decided not to consider pre-trained models:

approaching the same research question via extrinsic

evaluation would allow to overcome such limitation.

5 CONCLUSION

We have assessed the impact of OCR noise on a va-

riety of language models. Using data in Dutch, En-

glish, French and German, of different sizes and OCR

qualities, we considered two aligned versions of the

same corpus, one OCRed and one manually corrected

(ground truth). Two identical instances of each lan-

guage model were in turn trained from scratch over

these two versions of the same corpus, and the simi-

larity of the resulting vector spaces was assessed us-

ing word neighborhood overlap. This approach al-

lowed us to assess and compare the resilience of each

language model to OCR noise, independently.

We show that OCR noise signiﬁcantly impacts

language models, with a steep degradation of re-

sults for corpora with lower OCR quality. Further-

more, we show that ‘simpler’ language models, in-

cluding PPMI and the Word2Vec family (Skip-gram

and CBOW) are often more resilient to OCR noise

than recent transformer-based models (BERT, AL-

BERT). We also show that the choice of key hyper-

parameters, such as the learning rate, signiﬁcantly im-

pacts results as well. The size of a corpus might also

be an important factor, as suggested by transformer-

based models performing best with the largest corpus

available (German), yet more experiments are needed

to untangle its effects.

While several limitations and opportunities for fu-

ture work have been discussed, we believe the two

most important next steps to be the development of

multilingual evaluation corpora of similar size and

OCR quality in order to study the impact of OCR

noise more systematically, and the need to comple-

ment our study with an extrinsic assessment on down-

stream tasks making use of (possibly pre-trained) lan-

guage models. Nevertheless, we have shown that

OCR noise poses a signiﬁcant obstacle to language

modelling. In the presence of small datasets, simpler

models including PPMI and Word2Vec are more re-

silient to OCR noise than transformer-based models

trained from scratch.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

680

CODE AND DATA AVAILABILITY

All data is openly provided by the International Con-

ference on Document Analysis and Recognition (IC-

DAR) 2017 Chiron et al. (2017) and 2019 Rigaud

et al. (2019) competitions on Post-OCR text correc-

tion.

Our code base is publicly available and described

at https://doi.org/10.5281/zenodo.5799211 (Todorov

and Colavizza, 2021).

REFERENCES

Banar, N., Lasaracina, K., Daelemans, W., and Kestemont,

M. (2020). Transfer Learning for Digital Heritage

Collections: Comparing Neural Machine Translation

at the Subword-level and Character-level. In Proceed-

ings of the 12th International Conference on Agents

and Artiﬁcial Intelligence, pages 522–529, Valletta,

Malta. SCITEPRESS - Science and Technology Pub-

lications.

Boros, E., Hamdi, A., Linhares Pontes, E., Cabrera-Diego,

L. A., Moreno, J. G., Sidere, N., and Doucet, A.

(2020). Alleviating Digitization Errors in Named En-

tity Recognition for Historical Documents. In Pro-

ceedings of the 24th Conference on Computational

Natural Language Learning, pages 431–441, Online.

Association for Computational Linguistics.

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,

Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,

Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,

G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,

Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,

Litwin, M., Gray, S., Chess, B., Clark, J., Berner,

C., McCandlish, S., Radford, A., Sutskever, I., and

Amodei, D. (2020). Language Models are Few-Shot

Learners. In Larochelle, H., Ranzato, M., Hadsell, R.,

Balcan, M. F., and Lin, H., editors, Advances in Neu-

ral Information Processing Systems, volume 33, pages

1877–1901. Curran Associates, Inc.

Chiron, G., Doucet, A., Coustaty, M., and Moreux, J.-P.

(2017). ICDAR 2017 Competition on Post-OCR Text

Correction. In 2017 14th IAPR International Con-

ference on Document Analysis and Recognition (IC-

DAR), volume 1, pages 1423–1428. IEEE.

Church, K. and Hanks, P. (1989). Word Association

Norms, Mutual Information, and Lexicography. In

27th Annual Meeting of the Association for Compu-

tational Linguistics, pages 76–83, Vancouver, British

Columbia, Canada. Association for Computational

Linguistics.

Clark, J. H., Garrette, D., Turc, I., and Wieting, J. (2021).

Canine: Pre-training an Efﬁcient Tokenization-Free

Encoder for Language Representation.

Coll Ardanuy, M., Nanni, F., Beelen, K., Hosseini, K., Ah-

nert, R., Lawrence, J., McDonough, K., Tolfo, G.,

Wilson, D. C., and McGillivray, B. (2020). Living

Machines: A study of atypical animacy. In Proceed-

ings of the 28th International Conference on Com-

putational Linguistics, pages 4534–4545, Barcelona,

Spain (Online). International Committee on Compu-

tational Linguistics.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). BERT: Pre-training of Deep Bidirectional

Transformers for Language Understanding. In Pro-

ceedings of the 2019 Conference of the North, pages

4171–4186, Minneapolis, Minnesota. Association for

Computational Linguistics.

Ehrmann, M., Colavizza, G., Rochat, Y., and Kaplan, F.

(2016). Diachronic Evaluation of NER Systems on

Old Newspapers.

Ehrmann, M., Romanello, M., Fl

uckiger, A., and

Clematide, S. (2020). Overview of CLEF HIPE 2020:

Named Entity Recognition and Linking on Histori-

cal Newspapers. In Arampatzis, A., Kanoulas, E.,

Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eick-

hoff, C., N

eol, A., Cappellato, L., and Ferro, N.,

editors, Experimental IR Meets Multilinguality, Multi-

modality, and Interaction, volume 12260, pages 288–

310. Springer International Publishing, Cham. Series

Title: Lecture Notes in Computer Science.

Gerz, D., Vuli

c, I., Ponti, E. M., Reichart, R., and Korho-

nen, A. (2018). On the Relation between Linguistic

Typology and (Limitations of) Multilingual Language

Modeling. In Proceedings of the 2018 Conference

on Empirical Methods in Natural Language Process-

ing, pages 316–327, Brussels, Belgium. Association

for Computational Linguistics.

Gonen, H., Jawahar, G., Seddah, D., and Goldberg, Y.

(2020). Simple, Interpretable and Stable Method

for Detecting Words with Usage Change across Cor-

pora. In Proceedings of the 58th Annual Meeting of

the Association for Computational Linguistics, page

538–555. Association for Computational Linguistics.

ainen, M. and Hengchen, S. (2019). From the Paft

to the Fiiture: a Fully Automatic NMT and Word Em-

beddings Method for OCR Post-Correction. In Pro-

ceedings of the International Conference on Recent

Advances in Natural Language Processing (RANLP

2019), pages 431–436, Varna, Bulgaria. INCOMA

Ltd.

Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M.,

and Doucet, A. (2019). An Analysis of the Per-

formance of Named Entity Recognition over OCRed

Documents. In 2019 ACM/IEEE Joint Conference

on Digital Libraries (JCDL), pages 333–334, Cham-

paign, IL, USA. IEEE.

Hedderich, M. A., Lange, L., Adel, H., Str

otgen, J.,

and Klakow, D. (2021). A Survey on Recent Ap-

proaches for Natural Language Processing in Low-

Resource Scenarios. arXiv:2010.12309 [cs]. arXiv:

2010.12309.

Heinzerling, B. and Strube, M. (2018). BPEmb:

Tokenization-free Pre-trained Subword Embeddings

in 275 Languages. In Proceedings of the Eleventh In-

ternational Conference on Language Resources and

Evaluation (LREC 2018), Miyazaki, Japan. European

Language Resources Association (ELRA).

Hill, M. J. and Hengchen, S. (2019). Quantifying the impact

of dirty OCR on historical text analysis: Eighteenth

An Assessment of the Impact of OCR Noise on Language Models

681

Century Collections Online as a case study. Digital

Scholarship in the Humanities, 34(4):825–843.

Kahle, P., Colutto, S., Hackl, G., and Muhlberger, G.

(2017). Transkribus - A Service Platform for Tran-

scription, Recognition and Retrieval of Historical

Documents. In 2017 14th IAPR International Con-

ference on Document Analysis and Recognition (IC-

DAR), pages 19–24, Kyoto. IEEE.

Kim, Y., Jernite, Y., Sontag, D., and Rush, A. M. (2016).

Character-Aware Neural Language Models. In Pro-

ceedings of the Thirtieth AAAI Conference on Artiﬁ-

cial Intelligence, AAAI’16, pages 2741–2749. AAAI

Press. event-place: Phoenix, Arizona.

Kumar, A., Makhija, P., and Gupta, A. (2020). Noisy

Text Data: Achilles’ Heel of BERT. In Proceedings

of the Sixth Workshop on Noisy User-generated Text

(W-NUT 2020), pages 16–21, Online. Association for

Computational Linguistics.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma,

P., and Soricut, R. (2020). ALBERT: A Lite BERT

for Self-supervised Learning of Language Represen-

tations. arXiv:1909.11942 [cs]. arXiv: 1909.11942.

Loshchilov, I. and Hutter, F. (2018). Fixing Weight Decay

Regularization in Adam.

Ma, W., Cui, Y., Si, C., Liu, T., Wang, S., and Hu, G.

(2020). CharBERT: Character-aware Pre-trained Lan-

guage Model. Proceedings of the 28th International

Conference on Computational Linguistics.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a).

Efﬁcient estimation of word representations in vector

space. In Bengio, Y. and LeCun, Y., editors, 1st In-

ternational Conference on Learning Representations,

ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013,

Workshop Track Proceedings.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013b). Distributed representations of words

and phrases and their compositionality. In Burges, C.

J. C., Bottou, L., Welling, M., Ghahramani, Z., and

Weinberger, K. Q., editors, Advances in Neural Infor-

mation Processing Systems, volume 26. Curran Asso-

ciates, Inc.

Morris, A. C., Maier, V., and Green, P. (2004). From WER

and RIL to MER and WIL: improved evaluation mea-

sures for connected speech recognition. In INTER-

SPEECH, page 4.

Nguyen, D., Do

gru

oz, A. S., Ros

e, C. P., and de Jong, F.

(2016). Computational Sociolinguistics: A Survey.

Computational Linguistics, 42(3):537–593.

Nguyen, T. T. H., Jatowt, A., Nguyen, N.-V., Coustaty, M.,

and Doucet, A. (2020). Neural Machine Translation

with BERT for Post-OCR Error Detection and Cor-

rection. In Proceedings of the ACM/IEEE Joint Con-

ference on Digital Libraries in 2020, pages 333–336,

Virtual Event China. ACM.

Pennington, J., Socher, R., and Manning, C. (2014). GloVe:

Global Vectors for Word Representation. In Proceed-

ings of the 2014 conference on empirical methods in

natural language processing (EMNLP), volume 14,

pages 1532–1543.

Piotrowski, M. (2012). Natural Language Processing for

Historical Texts. Synthesis Lectures on Human Lan-

guage Technologies, 5(2):1–157.

Ponti, E. M., O’Horan, H., Berzak, Y., Vuli

c, I., Reichart,

R., Poibeau, T., Shutova, E., and Korhonen, A. (2019).

Modeling Language Variation and Universals: A Sur-

vey on Typological Linguistics for Natural Language

Processing. Computational Linguistics, 45(3):559–

601.

Pruthi, D., Dhingra, B., and Lipton, Z. C. (2019). Combat-

ing adversarial misspellings with robust word recog-

nition. In Proceedings of the 57th Annual Meeting of

the Association for Computational Linguistics, pages

5582–5591, Florence, Italy. Association for Computa-

tional Linguistics.

Qiu, X., Sun, T., Xu, Y., Shao, Y., Dai, N., and Huang,

X. (2020). Pre-trained models for natural language

processing: A survey. Science China Technological

Sciences, 63(10):1872–1897.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,

Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020).

Exploring the Limits of Transfer Learning with a Uni-

ﬁed Text-to-Text Transformer. Journal of Machine

Learning Research, 21(140):1–67.

Rehurek, R. and Sojka, P. (2010). Software Framework for

Topic Modelling with Large Corpora. In Proceedings

of LREC 2010 workshop New Challenges for NLP

Frameworks, pages 46–50, Valletta, Malta. University

of Malta.

Rigaud, C., Doucet, A., Coustaty, M., and Moreux, J.-

P. (2019). ICDAR 2019 Competition on Post-OCR

Text Correction. In 2019 International Conference on

Document Analysis and Recognition (ICDAR), pages

1588–1593. IEEE.

Rogers, A., Kovaleva, O., and Rumshisky, A. (2020). A

Primer in BERTology: What We Know About How

BERT Works. Transactions of the Association for

Computational Linguistics, 8:842–866.

Ruder, S. (2019). Neural Transfer Learning for Natural

Language Processing. PhD thesis, National Univer-

sity of Ireland, Galway.

Ruder, S. (2020). Why You Should Do NLP Beyond En-

glish. http://ruder.io/nlp-beyond-english.

Schick, T. and Sch

utze, H. (2021). It’s Not Just Size That

Matters: Small Language Models Are Also Few-Shot

Learners. In Proceedings of the 2021 Conference of

the North American Chapter of the Association for

Computational Linguistics: Human Language Tech-

nologies, pages 2339–2352, Online. Association for

Computational Linguistics.

Sennrich, R., Haddow, B., and Birch, A. (2016). Neu-

ral Machine Translation of Rare Words with Subword

Units. In Proceedings of the 54th Annual Meeting

of the Association for Computational Linguistics (Vol-

ume 1: Long Papers), pages 1715–1725, Berlin, Ger-

many. Association for Computational Linguistics.

Shoemark, P., Liza, F. F., Nguyen, D., Hale, S., and

McGillivray, B. (2019). Room to Glo: A System-

atic Comparison of Semantic Change Detection Ap-

proaches with Word Embeddings. In Proceedings of

the 2019 Conference on Empirical Methods in Nat-

ural Language Processing and the 9th International

Joint Conference on Natural Language Processing

(EMNLP-IJCNLP), pages 66–76, Hong Kong, China.

Association for Computational Linguistics.

ICAART 2022 - 14th International Conference on Agents and Artiﬁcial Intelligence

682

Sun, L., Hashimoto, K., Yin, W., Asai, A., Li, J., Yu, P., and

Xiong, C. (2020). Adv-BERT: BERT is not robust on

misspellings! Generating nature adversarial samples

on BERT.

Todorov, K. and Colavizza, G. (2020). Transfer Learning

for Historical Corpora: An Assessment on Post-OCR

Correction and Named Entity Recognition. In Pro-

ceedings of the Workshop on Computational Human-

ities Research (CHR 2020) : Amsterdam, the Nether-

lands, November 18-20, 2020, CEUR Workshop Pro-

ceedings, 1613-0073, 2723, pages 310–339, Amster-

dam. Aachen: CEUR-WS.

Todorov, K. and Colavizza, G. (2021). Zenodo,

ktodorov/historical-ocr, An assessment of the impact

of OCR noise on language models.

Tulkens, S., Emmery, C., and Daelemans, W. (2016). Eval-

uating Unsupervised Dutch Word Embeddings as a

Linguistic Resource. In Chair), N. C. C., Choukri,

K., Declerck, T., Grobelnik, M., Maegaard, B., Mar-

iani, J., Moreno, A., Odijk, J., and Piperidis, S., ed-

itors, Proceedings of the Tenth International Confer-

ence on Language Resources and Evaluation (LREC

2016), Paris, France. European Language Resources

Association (ELRA).

van Strien, D., Beelen, K., Ardanuy, M., Hosseini, K.,

McGillivray, B., and Colavizza, G. (2020). Assess-

ing the Impact of OCR Quality on Downstream NLP

Tasks. In Proceedings of the 12th International Con-

ference on Agents and Artiﬁcial Intelligence, pages

484–496, Valletta, Malta. SCITEPRESS - Science and

Technology Publications.

Wang, C., Cho, K., and Gu, J. (2020). Neural machine

translation with byte-level subwords. In Proceedings

of the AAAI Conference on Artiﬁcial Intelligence, vol-

ume 34, pages 9154–9160.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue,

C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtow-

icz, M., and et al. (2020). HuggingFace’s Trans-

formers: State-of-the-art Natural Language Process-

ing. arXiv:1910.03771 [cs]. arXiv: 1910.03771.

Xia, P., Wu, S., and Van Durme, B. (2020). Which *BERT?

A Survey Organizing Contextualized Encoders. In

Proceedings of the 2020 Conference on Empirical

Methods in Natural Language Processing (EMNLP),

pages 7516–7533, Online. Association for Computa-

tional Linguistics.

Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S.,

Kale, M., Roberts, A., and Raffel, C. (2021). Byt5:

Towards a token-free future with pre-trained byte-to-

byte models.

Zhu, Y., Heinzerling, B., Vuli

c, I., Strube, M., Reichart,

R., and Korhonen, A. (2019). On the Importance

of Subword Information for Morphological Tasks in

Truly Low-Resource Languages. In Proceedings

of the 23rd Conference on Computational Natural

Language Learning (CoNLL), pages 216–226, Hong

Kong, China. Association for Computational Linguis-

tics.

An Assessment of the Impact of OCR Noise on Language Models

683