perplexity between models. Comparing different
models along with different tokenizers is non-trivial
since we would not be able to use the perplexity met-
ric as is. For us, this essentially invalidated any way
to compare the proposed method with a method that
would use a different tokenizer.
Amount of Training. With increased amount of
training, baselines close the gap between each other
and the proposed method. Our method works best
with a limited amount of training. Training for multi-
ple epochs shows diminishing returns for all baselines
considered, as well as the proposed method.
7 CONCLUSION
In this study, we have demonstrated the potential of
leveraging pre-trained English language models to ef-
fectively adapt and generate coherent text in Czech,
a lower-resource language. Our approach, which uti-
lizes a method of vocabulary swap, significantly re-
duces the computational costs associated with train-
ing language-specific models from scratch. Through
our experiments, we have shown that even with a
small parallel corpus, the adapted model can outper-
form traditional training methods, highlighting the ef-
ficiency of transfer learning in natural language pro-
cessing. Our method is trivial to implement. It only
requires to find a partial mapping between the Czech
and English tokenizer and then to initialize the em-
beddings of Czech tokens to the corresponding em-
beddings of English tokens. Future experiments will
test this method on more realistic datasets.
ACKNOWLEDGEMENTS
This article is from the project "Research of
Excellence on Digital Technologies and Wellbe-
ing CZ.02.01.01/00/22_008/0004583" which is co-
financed by the European Union. The translation of
the parallel corpus, model training and model eval-
uation was supported by the Ministry of Education,
Youth and Sports of the Czech Republic through the
e-INFRA CZ (ID:90254). This article has been pro-
duced with the financial support of the European
Union under the REFRESH – Research Excellence
For REgion Sustainability and High-tech Industries
project number CZ.10.03.01/00/22_003/0000048 via
the Operational Programme Just Transition.
REFERENCES
Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R.,
et al. (2024). Gemini: A family of highly capable
multimodal models.
Csaki, Z., Pawakapan, P., Thakker, U., and Xu, Q. (2023).
Efficiently adapting pretrained language models to
new languages.
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,
et al. (2024). The llama 3 herd of models.
Eldan, R. and Li, Y. (2023). TinyStories: How small can
language models be and still speak coherent english?
Hedderich, M. A., Adelani, D., Zhu, D., Alabi, J., Markus,
U., and Klakow, D. (2020). Transfer learning and dis-
tant supervision for multilingual transformer models:
A study on african languages.
Honnibal, M., Montani, I., Van Landeghem, S., and Boyd,
A. (2020). spacy: Industrial-strength natural language
processing in python.
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,
Chaplot, et al. (2023). Mistral 7b.
Jin, Y., Chandra, M., Verma, G., Hu, Y., De Choudhury,
M., and Kumar, S. (2023). Better to ask in english:
Cross-lingual evaluation of large language models for
healthcare queries.
Lauscher, A., Ravishankar, V., Vuli
´
c, I., and Glavaš, G.
(2020). From zero to hero: On the limitations of
zero-shot language transfer with multilingual trans-
formers. In Proceedings of the 2020 Conference on
Empirical Methods in Natural Language Processing
(EMNLP), pages 4483–4499. Association for Compu-
tational Linguistics.
Li, Y., Feng, Y., Zhou, W., Zhao, Z., Shen, L., Hou, C.,
and Hou, X. (2024). Dynamic data sampler for cross-
language transfer learning in large language models.
OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L.,
et al. (2024). GPT-4 technical report.
Tran, C., Bhosale, S., Cross, J., Koehn, P., Edunov, S., and
Fan, A. (2021). Facebook ai wmt21 news translation
task submission. arXiv preprint arXiv:2108.03265.
Wendler, C., Veselovsky, V., Monea, G., and West, R.
(2024). Do llamas work in english? on the latent lan-
guage of multilingual transformers.
NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications
612