An Assessment of the Impact of OCR Noise on Language Models
Konstantin Todorov, Giovanni Colavizza
2022
Abstract
Neural language models are the backbone of modern-day natural language processing applications. Their use on textual heritage collections which have undergone Optical Character Recognition (OCR) is therefore also increasing. Nevertheless, our understanding of the impact OCR noise could have on language models is still limited. We perform an assessment of the impact OCR noise has on a variety of language models, using data in Dutch, English, French and German. We find that OCR noise poses a significant obstacle to language modelling, with language models increasingly diverging from their noiseless targets as OCR quality lowers. In the presence of small corpora, simpler models including PPMI and Word2Vec consistently outperform transformer-based models in this respect.
DownloadPaper Citation
in Harvard Style
Todorov K. and Colavizza G. (2022). An Assessment of the Impact of OCR Noise on Language Models. In Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART, ISBN 978-989-758-547-0, pages 674-683. DOI: 10.5220/0010945100003116
in Bibtex Style
@conference{icaart22,
author={Konstantin Todorov and Giovanni Colavizza},
title={An Assessment of the Impact of OCR Noise on Language Models},
booktitle={Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,},
year={2022},
pages={674-683},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010945100003116},
isbn={978-989-758-547-0},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 14th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART,
TI - An Assessment of the Impact of OCR Noise on Language Models
SN - 978-989-758-547-0
AU - Todorov K.
AU - Colavizza G.
PY - 2022
SP - 674
EP - 683
DO - 10.5220/0010945100003116