Boosting Language Models for Real-Word Error Detection

Corina Masanti, Corina Masanti, Hans-Friedrich Witschel, Kaspar Riesen

2025

Abstract

With the introduction of transformer-based language models, research in error detection in text documents has significantly advanced. However, some significant research challenges remain. In the present paper, we aim to address the specific challenge of detecting real-word errors, i.e., words that are syntactically correct but semantically incorrect given the sentence context. In particular, we research three categories of frequent real-word errors in German, viz. verb conjugation errors, case errors, and capitalization errors. To address the scarcity of training data, especially for languages other than English, we propose to systematically incorporate synthetic data into the training process. To this end, we employ ensemble learning methods for language models. In particular, we propose to adapt the boosting technique to language model learning. Our experimental evaluation reveals that incorporating synthetic data in a non-systematic way enhances recall but lowers precision. In contrast, the proposed boosting approach improves the recall of the language model while maintaining its high precision.

Download


Paper Citation


in Harvard Style

Masanti C., Witschel H. and Riesen K. (2025). Boosting Language Models for Real-Word Error Detection. In Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM; ISBN 978-989-758-730-6, SciTePress, pages 318-325. DOI: 10.5220/0013251500003905


in Bibtex Style

@conference{icpram25,
author={Corina Masanti and Hans-Friedrich Witschel and Kaspar Riesen},
title={Boosting Language Models for Real-Word Error Detection},
booktitle={Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM},
year={2025},
pages={318-325},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013251500003905},
isbn={978-989-758-730-6},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 14th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM
TI - Boosting Language Models for Real-Word Error Detection
SN - 978-989-758-730-6
AU - Masanti C.
AU - Witschel H.
AU - Riesen K.
PY - 2025
SP - 318
EP - 325
DO - 10.5220/0013251500003905
PB - SciTePress