An Assessment of the Impact of OCR Noise on Language Models

Konstantin Todorov, Giovanni Colavizza

2022

Abstract

Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.

Download


Paper Citation


in Harvard Style

Todorov K. and Colavizza G. (2022). An Assessment of the Impact of OCR Noise on Language Models. In Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-547-0, pages 674-683. DOI: 10.5220/0010945100003116


in Bibtex Style

@conference{icaart22,
author={Konstantin Todorov and Giovanni Colavizza},
title={An Assessment of the Impact of OCR Noise on Language Models},
booktitle={Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2022},
pages={674-683},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010945100003116},
isbn={978-989-758-547-0},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - An Assessment of the Impact of OCR Noise on Language Models
SN - 978-989-758-547-0
AU - Todorov K.
AU - Colavizza G.
PY - 2022
SP - 674
EP - 683
DO - 10.5220/0010945100003116