Stealing Brains: From English to Czech Language Model

Petr Hyner, Petr Hyner, Petr Marek, David Adamczyk, David Adamczyk, Jan Hůla, Jan Hůla, Jan Šedivý

2024

Abstract

We present a simple approach for efficiently adapting pre-trained English language models to generate text in lower-resource language, specifically Czech. We propose a vocabulary swap method that leverages parallel corpora to map tokens between languages, allowing the model to retain much of its learned capabilities. Experiments conducted on a Czech translation of the TinyStories dataset demonstrate that our approach significantly outperforms baseline methods, especially when using small amounts of training data. With only 10% of the data, our method achieves a perplexity of 17.89, compared to 34.19 for the next best baseline. We aim to contribute to work in the field of cross-lingual transfer in natural language processing and we propose a simple to implement, computationally efficient method tested in a controlled environment.

Download


Paper Citation


in Harvard Style

Hyner P., Marek P., Adamczyk D., Hůla J. and Šedivý J. (2024). Stealing Brains: From English to Czech Language Model. In Proceedings of the 16th International Joint Conference on Computational Intelligence - Volume 1: NCTA; ISBN 978-989-758-721-4, SciTePress, pages 606-612. DOI: 10.5220/0013064500003837


in Bibtex Style

@conference{ncta24,
author={Petr Hyner and Petr Marek and David Adamczyk and Jan Hůla and Jan Šedivý},
title={Stealing Brains: From English to Czech Language Model},
booktitle={Proceedings of the 16th International Joint Conference on Computational Intelligence - Volume 1: NCTA},
year={2024},
pages={606-612},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013064500003837},
isbn={978-989-758-721-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 16th International Joint Conference on Computational Intelligence - Volume 1: NCTA
TI - Stealing Brains: From English to Czech Language Model
SN - 978-989-758-721-4
AU - Hyner P.
AU - Marek P.
AU - Adamczyk D.
AU - Hůla J.
AU - Šedivý J.
PY - 2024
SP - 606
EP - 612
DO - 10.5220/0013064500003837
PB - SciTePress