CODE AND DATA AVAILABILITY
All data is openly provided by the International Con-
ference on Document Analysis and Recognition (IC-
DAR) 2017 Chiron et al. (2017) and 2019 Rigaud
et al. (2019) competitions on Post-OCR text correc-
tion.
Our code base is publicly available and described
at https://doi.org/10.5281/zenodo.5799211 (Todorov
and Colavizza, 2021).
REFERENCES
Banar, N., Lasaracina, K., Daelemans, W., and Kestemont,
M. (2020). Transfer Learning for Digital Heritage
Collections: Comparing Neural Machine Translation
at the Subword-level and Character-level. In Proceed-
ings of the 12th International Conference on Agents
and Artificial Intelligence, pages 522–529, Valletta,
Malta. SCITEPRESS - Science and Technology Pub-
lications.
Boros, E., Hamdi, A., Linhares Pontes, E., Cabrera-Diego,
L. A., Moreno, J. G., Sidere, N., and Doucet, A.
(2020). Alleviating Digitization Errors in Named En-
tity Recognition for Historical Documents. In Pro-
ceedings of the 24th Conference on Computational
Natural Language Learning, pages 431–441, Online.
Association for Computational Linguistics.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.,
Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner,
C., McCandlish, S., Radford, A., Sutskever, I., and
Amodei, D. (2020). Language Models are Few-Shot
Learners. In Larochelle, H., Ranzato, M., Hadsell, R.,
Balcan, M. F., and Lin, H., editors, Advances in Neu-
ral Information Processing Systems, volume 33, pages
1877–1901. Curran Associates, Inc.
Chiron, G., Doucet, A., Coustaty, M., and Moreux, J.-P.
(2017). ICDAR 2017 Competition on Post-OCR Text
Correction. In 2017 14th IAPR International Con-
ference on Document Analysis and Recognition (IC-
DAR), volume 1, pages 1423–1428. IEEE.
Church, K. and Hanks, P. (1989). Word Association
Norms, Mutual Information, and Lexicography. In
27th Annual Meeting of the Association for Compu-
tational Linguistics, pages 76–83, Vancouver, British
Columbia, Canada. Association for Computational
Linguistics.
Clark, J. H., Garrette, D., Turc, I., and Wieting, J. (2021).
Canine: Pre-training an Efficient Tokenization-Free
Encoder for Language Representation.
Coll Ardanuy, M., Nanni, F., Beelen, K., Hosseini, K., Ah-
nert, R., Lawrence, J., McDonough, K., Tolfo, G.,
Wilson, D. C., and McGillivray, B. (2020). Living
Machines: A study of atypical animacy. In Proceed-
ings of the 28th International Conference on Com-
putational Linguistics, pages 4534–4545, Barcelona,
Spain (Online). International Committee on Compu-
tational Linguistics.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Pro-
ceedings of the 2019 Conference of the North, pages
4171–4186, Minneapolis, Minnesota. Association for
Computational Linguistics.
Ehrmann, M., Colavizza, G., Rochat, Y., and Kaplan, F.
(2016). Diachronic Evaluation of NER Systems on
Old Newspapers.
Ehrmann, M., Romanello, M., Fl
¨
uckiger, A., and
Clematide, S. (2020). Overview of CLEF HIPE 2020:
Named Entity Recognition and Linking on Histori-
cal Newspapers. In Arampatzis, A., Kanoulas, E.,
Tsikrika, T., Vrochidis, S., Joho, H., Lioma, C., Eick-
hoff, C., N
´
ev
´
eol, A., Cappellato, L., and Ferro, N.,
editors, Experimental IR Meets Multilinguality, Multi-
modality, and Interaction, volume 12260, pages 288–
310. Springer International Publishing, Cham. Series
Title: Lecture Notes in Computer Science.
Gerz, D., Vuli
´
c, I., Ponti, E. M., Reichart, R., and Korho-
nen, A. (2018). On the Relation between Linguistic
Typology and (Limitations of) Multilingual Language
Modeling. In Proceedings of the 2018 Conference
on Empirical Methods in Natural Language Process-
ing, pages 316–327, Brussels, Belgium. Association
for Computational Linguistics.
Gonen, H., Jawahar, G., Seddah, D., and Goldberg, Y.
(2020). Simple, Interpretable and Stable Method
for Detecting Words with Usage Change across Cor-
pora. In Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, page
538–555. Association for Computational Linguistics.
H
¨
am
¨
al
¨
ainen, M. and Hengchen, S. (2019). From the Paft
to the Fiiture: a Fully Automatic NMT and Word Em-
beddings Method for OCR Post-Correction. In Pro-
ceedings of the International Conference on Recent
Advances in Natural Language Processing (RANLP
2019), pages 431–436, Varna, Bulgaria. INCOMA
Ltd.
Hamdi, A., Jean-Caurant, A., Sidere, N., Coustaty, M.,
and Doucet, A. (2019). An Analysis of the Per-
formance of Named Entity Recognition over OCRed
Documents. In 2019 ACM/IEEE Joint Conference
on Digital Libraries (JCDL), pages 333–334, Cham-
paign, IL, USA. IEEE.
Hedderich, M. A., Lange, L., Adel, H., Str
¨
otgen, J.,
and Klakow, D. (2021). A Survey on Recent Ap-
proaches for Natural Language Processing in Low-
Resource Scenarios. arXiv:2010.12309 [cs]. arXiv:
2010.12309.
Heinzerling, B. and Strube, M. (2018). BPEmb:
Tokenization-free Pre-trained Subword Embeddings
in 275 Languages. In Proceedings of the Eleventh In-
ternational Conference on Language Resources and
Evaluation (LREC 2018), Miyazaki, Japan. European
Language Resources Association (ELRA).
Hill, M. J. and Hengchen, S. (2019). Quantifying the impact
of dirty OCR on historical text analysis: Eighteenth
An Assessment of the Impact of OCR Noise on Language Models
681