Authors:
Corina Masanti
1
;
2
;
Hans-Friedrich Witschel
2
and
Kaspar Riesen
1
Affiliations:
1
Institute of Computer Science, University of Bern, 3012 Bern, Switzerland
;
2
Institute for Informations Systems, University of Appl. Sci. and Arts Northwestern Switzerland, 4600 Olten, Switzerland
Keyword(s):
Boosting Techniques, Language Models, Synthetic Data, Real-Word Errors.
Abstract:
With the introduction of transformer-based language models, research in error detection in text documents has significantly advanced. However, some significant research challenges remain. In the present paper, we aim to address the specific challenge of detecting real-word errors, i.e., words that are syntactically correct but semantically incorrect given the sentence context. In particular, we research three categories of frequent real-word errors in German, viz. verb conjugation errors, case errors, and capitalization errors. To address the scarcity of training data, especially for languages other than English, we propose to systematically incorporate synthetic data into the training process. To this end, we employ ensemble learning methods for language models. In particular, we propose to adapt the boosting technique to language model learning. Our experimental evaluation reveals that incorporating synthetic data in a non-systematic way enhances recall but lowers precision. In con
trast, the proposed boosting approach improves the recall of the language model while maintaining its high precision.
(More)