Stealing Brains: From English to Czech Language Model

Petr Hyner

1,2

, Petr Marek

, David Adamczyk

1,2

, Jan H˚ula

1,3

and Jan Šedivý

Institute for Research and Applications of Fuzzy Modeling, University of Ostrava,

Ostrava, Czech Republic

Department of Informatics and Computers, Faculty of Science, University of Ostrava,

Ostrava, Czech Republic

Czech Technical University in Prague, Prague, Czech Republic

Keywords:

Language Models, Neural Networks, Transfer Learning, Vocabulary Swap.

Abstract:

We present a simple approach for efﬁciently adapting pre-trained English language models to generate text

in lower-resource language, speciﬁcally Czech. We propose a vocabulary swap method that leverages paral-

lel corpora to map tokens between languages, allowing the model to retain much of its learned capabilities.

Experiments conducted on a Czech translation of the TinyStories dataset demonstrate that our approach sig-

niﬁcantly outperforms baseline methods, especially when using small amounts of training data. With only

10% of the data, our method achieves a perplexity of 17.89, compared to 34.19 for the next best baseline. We

aim to contribute to work in the ﬁeld of cross-lingual transfer in natural language processing and we propose

a simple to implement, computationally efﬁcient method tested in a controlled environment.

1 INTRODUCTION

State-of-the-art large language models (LLMs)

demonstrate proﬁciency in English text generation

(OpenAI et al., 2024; Dubey et al., 2024; Jiang et al.,

2023; Anil et al., 2024), while exhibiting compara-

tively limited capabilities in lower-resource languages

(Jin et al., 2023; Wendler et al., 2024). This imbal-

ance can be attributed to a higher representation of

English in training corpora (mostly data scraped from

the Internet) relative to other languages.

Currently, there are two primary approaches that

exist for developing language-speciﬁc large language

models: training from scratch and ﬁne-tuning exist-

ing models. However, the computational resources

required to train large language models capable of

generating high-quality text from scratch are pro-

hibitively expensive for most academic institutions,

research centers, or other subjects. Recent estimates

suggest that training costs for state-of-the-art LLMs

can range from tens to over a hundred million dollars,

primarily due to hardware requirements and energy

consumption.

This contribution proposes an approach that lever-

ages a pre-trained English language model to develop

an adapted model capable of generating coherent text

in Czech. Our aim is not to develop a Czech lan-

guage model that would be competitive with other

Czech models, but to explore the efﬁciency of differ-

ent methods on small sample regimes, where we can

conduct experiments quickly and with a small com-

putational budget. Speciﬁcally, to demonstrate the ef-

ﬁcacy of the presented method, we adapt a GPT Neo

model, trained on an English corpus, to a parallel cor-

pus translated into Czech.

Our research focuses on various techniques for

model adaptation that avoid the need for training from

scratch or full ﬁne-tuning. In doing so, we aim to sig-

niﬁcantly reduce the computational costs associated

with developing language-speciﬁc models.

We show that if the target corpus is very small

compared to the original English corpus, the pre-

sented method leads to drastic improvements com-

pared to training from scratch and other baselines.

The text is structured as follows. Section 2 will

discuss the method for ﬁne-tuning a language model

using vocabulary swap. The following Section 3 will

discuss how we obtained a parallel corpus by translat-

ing the original corpus and how having parallel cor-

pora is useful for training a language model. Section

4 discusses the speciﬁc models and data with which

we worked, along with the baselines against which we

tested, and we present experimental results by com-

paring the perplexity of trained models on a valida-

606

Hyner, P., Marek, P., Adamczyk, D., H˚ula, J. and Šedivý, J.

Stealing Brains: From English to Czech Language Model.

DOI: 10.5220/0013064500003837

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 16th International Joint Conference on Computational Intelligence (IJCCI 2024), pages 606-612

ISBN: 978-989-758-721-4; ISSN: 2184-3236

tion set across models and data splits. In Section 5 we

discuss relevant work. Section 6 discusses limitations

to our approach, and we conclude in Section 7.

2 THE PROPOSED METHOD

In this section, we describe the method that produced

the best results in our experiments. We use the name

source model for the model trained on English texts

and target model for the model adapted to Czech lan-

guage.

2.1 Motivation

The motivation for the method is based on the fact that

if the two languages would be used to speak about the

same topics and would have the same grammar, then

the main obstacle for understanding the text in the

second language would be the different vocabulary.

If further the tokenizer would tokenize the sentences

to whole words and we would have a mapping from

English to Czech words, then we could just replace

the Czech words with their English counterparts and

use the source model to generate text in English, and

ﬁnally replace the English words in the generated text

with their Czech counterparts.

The assumption that texts in Czech and English

would be very similar in meaning is probably benign,

and therefore we conduct our experiments on Czech

translation of the English corpus on which the source

model was trained. Unfortunately, Czech and English

do not have the same grammar and word order, and

therefore the source model would need to be ﬁne-

tuned for the new grammar. It could be expected that

during ﬁne-tuning the model would just need to adapt

to a new grammar, and all its other capabilities (rea-

soning, world knowledge, etc.) would transfer from

the source model, and therefore not many training

samples would be needed for ﬁne-tuning.

The ﬁnal obstacle to overcome is the fact that the

tokenizer of the source model does not tokenize sen-

tences to whole words and often is also not suitable

for the tokenization of the target language (Czech in

our case). Therefore, in practice, we have two dif-

ferent tokenizers, which split words into subwords, in

each language in a different way. This makes remap-

ping tokens between the two languages nontrivial.

The naive solution would be to let the source

model adapt to a new vocabulary together with the

new grammar. We could hope that the model would

still transfer many capabilities from the source lan-

guage. This would mean that the subcircuits which

are not dealing with grammar and word meaning

would be left intact and the model would just adapt

the low-level circuits dealing with word embeddings

and grammar (this is of course a simpliﬁed rationale,

as sometimes high-level reasoning is needed to cor-

rectly parse a sentence and understand the meaning

of a given word). Nevertheless, it is possible that the

model would be too “confused” with the new vocab-

ulary and grammar and would start to reorganize all

its subcircuits during ﬁne-tuning, and therefore would

need a lot of training data to rediscover the capabili-

ties such as reasoning, etc. It is even possible that

such ﬁne-tuning would work even worse than training

the model from scratch (as can be seen in our experi-

ments).

We therefore want to keep as much of the form

of the original English sentences as possible so that

the model does not start to fully reorganize itself and

the old subcircuits useful for the new task are kept in-

tact. We attempt to achieve this by identifying Czech

words which are tokenized to one token using the

Czech tokenizer and whose translation to English is

also tokenized to one token using the English tok-

enizer. Such words could be swapped, and if there

is enough such words, the model could have an easier

time to “reorient” itself to the new language.

2.2 High-Level Summary

On the high-level, the presented method is straightfor-

ward. We assume that we have a well-trained model

for English language and a tokenizer optimized for

the Czech training corpus. Our goal is to ﬁne-tune the

original model with the new tokenizer, but we also

want to make the transition to the new language as

smooth as possible. Therefore, we try to remap the

tokens in the Czech language to the tokens in the En-

glish language as closely as possible and then ﬁne-

tune the original model without any additional tricks.

The efﬁciency of this method will depend on the

semantic overlap of the content in the training data for

the source and target model. If the training data for

the source model contain texts about biology and the

training data for the target model contain texts about

mathematics, there is probably not much space for the

transfer of knowledge and capabilities. In order to

avoid uncertainties about the semantic similarity of

texts in Czech and English training data, we conduct

our experiments on a Czech corpus which is obtained

by translating the original English training corpus into

a Czech language. We also leverage the fact that we

have such parallel corpus in order to ﬁnd the mapping

between Czech and English tokens which we describe

in the next subsection.

Stealing Brains: From English to Czech Language Model

607

Figure 1: High-level diagram of the proposed method. The diagram shows the process of the application of the proposed

method, where we begin with a pre-trained model trained on English, and a tokenizer trained on Czech. It shows the process

of remapping the source language tokens to the target counterparts and the replacement of the source embedding layer with

a target embedding layer, where the indices of Czech tokens correspond to the semantically equivalent tokens of the source

language. Finally, we ﬁne-tune the model with now a new Czech embedding layer after we apply these steps.

2.3 Vocabulary Swap

In this section, we use the term embedding matrix for

the layer that maps the indices of the tokens to their

corresponding token embeddings (columns of the ma-

trix). This matrix is reused in the ﬁnal layer, where

its transpose multiplies the ﬁnal token embeddings to

obtain the logits for individual tokens that should be

predicted in the next step. We use the term model

body for all other layers.

As mentioned, our aim is to ﬁnd a partial map-

ping, which will map tokens of the Czech tokenizer to

tokens of the English tokenizer. Once we have such

mapping, we can initialize the columns of the embed-

ding matrix for the Czech tokens with the correspond-

ing columns of the embedding matrix for the English

tokens. This is depicted in Figure 1. If we manage

to remap a large portion of the whole vocabulary, we

can expect that the inputs to the model body will be

very similar for both languages and therefore the sub-

circuits for the high-level capabilities and knowledge

would be retained during ﬁne-tuning.

Mapping Between Czech and English Tokens.

The mapping between Czech and English tokens can

be obtained in multiple ways. The most basic ap-

proach would be to translate the Czech tokens which

correspond to whole words to English and if the trans-

lation of the Czech token corresponds to a token from

the English tokenizer, then this pair would be added

to the mapping. This naive approach could miss a lot

of pairs due to the ambiguity in translation.

To have a high-quality mapping, we leverage the

fact that we have a parallel corpus in which we have a

corresponding English sentence for each Czech sen-

tence. To create the mapping, we iterate over all

pairs of corresponding sentences and update the pair

counter, which counts how many times a given Czech

token was mapped to a given English token. Con-

cretely, this is achieved with the following steps:

1. Obtain word embeddings for each Czech token in

the Czech sentence

2. Obtain word embeddings for each English token

in the English sentence.

3. For each Czech token, ﬁnd the best matching En-

glish token by using the cosine similarity between

the corresponding embeddings.

4. For each mapped pair update the pair counter.

Once we iterate over all pairs of sentences, we se-

lect the pairs for which the count exceeded a given

threshold (20). We choose this threshold arbitrarily,

and the reason we introduce the threshold is to limit

noise. Finally, we use each selected pair to initialize

the Czech embedding matrix, i.e., if there is a pair

with Czech token with ID 5 and English token with

ID 9, then we copy the ninth column of the English

embedding matrix to the ﬁfth column of the Czech

embedding matrix. Using this procedure, we man-

aged to map 14,272 tokens, which is approximately

47.57% of the Czech vocabulary.

3 CREATION OF THE PARALLEL

CORPUS

To translate the English TinyStories (Eldan and Li,

2023) dataset into Czech, we used the advanced ca-

pabilities of the WMT 21 En-X (Tran et al., 2021)

This model, part of the WMT 21 En-X suite, is a

state-of-the-art multilingual encoder-decoder (seq-to-

seq) model with 4.7 billion parameters, trained to per-

form one-to-many multilingual translation, including

Czech.

To manage the size of the TinyStories dataset and

We ﬁrst translate the Czech tokens to English and than

use the word embeddings for English words in both sen-

tences. Word embeddings are obtained using the FastText

library.

facebook/wmt21-dense-24-wide-en-x Hugging Face

model

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

608

address memory limitations of GPU

we employed,

we utilized the sentence tokenizer from the spaCy li-

brary

(Honnibal et al., 2020) to split the dataset into

individual sentences. Each sentence was then trans-

lated separately using the WMT 21 En-X model. This

sentence-level approach not only mitigated memory

constraints but also allowed for efﬁcient parallel pro-

cessing of the dataset.

Once translated, the sentences were meticulously

recombined to recreate complete records that mir-

rored the original dataset structure and narrative co-

herence. This method ensured that each translated

story retained its narrative ﬂow and logical consis-

tency, making the Czech dataset structurally equiva-

lent to the original English version.

4 EXPERIMENTS

We will be using the TinyStories (Eldan and Li, 2023)

dataset along with a pre-trained TinyStories GPT Neo

33 million parameter model for ﬁne-tuning. For train-

ing from scratch, we will use the same model archi-

tecture without any pre-training. Since we are using

a custom tokenizer, we change the vocabulary size of

the model to 30,000 tokens, which corresponds to the

number of tokens our tokenizer learned.

4.1 Data

The TinyStories dataset contains simple stories writ-

ten in simple English. The authors describe the avail-

able vocabulary from which these stories are gener-

ated as containing words that a 3–4-year-old child

would typically understand or use; therefore, the vo-

cabulary only contains 1500 unique, simple words.

These data were generated using GPT 3.5 or GPT 4

and there are approximately 2.1 million samples in

the dataset. It is important to mention that the authors

have also published other models of various sizes, so

for our initial experiments, we used the smallest, 1

million parameter model to not waste unnecessary re-

sources.

As discussed in Section 3, we translated the orig-

inal TinyStories corpus to obtain a parallel corpus in

Czech. We trained or ﬁne-tuned all models on data

splits, the same for each model. We evaluated the

models on a ﬁxed test set, which is the same for each

model.

We tested the three baselines on ﬁve different par-

titions of data: 10%, 25%, 50%, 75%, and 100%. Our

NVIDIA Tesla V100 GPU accelerators with 16 GB

en_core_web_sm model

goal is to show the viability of the proposed method

when we use a small amount of data, obtaining signif-

icantly better results than those produced by the base-

lines.

4.2 Baselines

TinyStories models use the GPT Neo architecture,

which utilizes a decoder-only transformer structure.

The training process then proceeds as follows: We

utilize the default GPT Neo hyperparameters, we set

the batch size to 8 samples, and the gradient accu-

mulation to 128 steps. We then train for 1 epoch on

the target corpus on a single A100 GPU. We used

the same hyperparameters for each baseline and data

split, as well as for the proposed method. We also

experimented with training the models for multiple

epochs; see Figure 4.

We have evaluated a model trained using the pro-

posed method of Vocabulary Swap against three

baselines: a model trained from scratch, a model

trained using naive ﬁne-tuning, and a model trained

using ﬁne-tuning with embedding reset.

Training from Scratch. For this baseline, we train

a 33 million parameter GPT Neo model from ran-

domly initialized weights. As mentioned above, we

use the same hyperparameters for all training regimes.

Naive Fine-Tuning. Here, we begin with the 33

million parameter GPT Neo source model pre-trained

on the source language. We use the aforementioned

tokenizer and target corpus. One could observe that

this naive approach has a big ﬂaw; token IDs of the

target tokenizer will not correspond to the token IDs

of the source tokenizer, and therefore a word “pes”

(dog in English) could start with an embedding of a

completely different word (e.g. “sun”) and this could

cause that the model body will start change in order to

adapt to the new embeddings and therefore can lose a

lot of abilities that it previously learned on the source

language.

Fine-Tuning with Embedding Reset. With this ap-

proach, we apply the same procedure as with Naive

Fine-tuning, with the only difference that we reset the

embeddings of tokens by sampling them from a Gaus-

sian distribution with the mean and standard devia-

tion calculated from the original embeddings for the

source language.

Stealing Brains: From English to Czech Language Model

609

4.3 Results

We use perplexity (PPL) as our evaluation metric, as

it is a common metric to measure the quality of gen-

erative models. We evaluated the trained models on a

test set and calculated the PPL for each sentence using

the following Formula 1.

PPL(X) = exp

(

−

∑

log p

)

(1)

Where log p

) is the log-likelihood of the i-

th token conditioned on the preceding tokens x

ac-

cording to our model.

Formula 1 corresponds to an exponentiated nega-

tive log-likelihood of a sequence.

We can compute the perplexity for the entire test

set as the mean of the perplexities for all individual

sentences, using Formula 2. The test set should have

a high probability according to the evaluated model.

PPL =

∑

i=1

PPL(X

) (2)

Where PPL is the average perplexity of the entire

test set and N is the total number of sentences.

Figure 2 shows the perplexity of each baseline,

as well as the models trained using the proposed

method. Each model was trained on four data splits

that were sampled once, as well as on all data. We

can see that using only 10% of the data, the proposed

method shows a signiﬁcant improvement compared to

the baselines; we show the evaluation results in Table

10% 25% 50% 75% 100%

Data Split (%)

Perplexity

Perplexities for Data Splits: Baselines vs. Proposed Method

Naive Fine-tuning Training From Scratch

Embedding Reset Vocabulary Swap

Figure 2: Perplexity comparison for different training ap-

proaches across various data splits. This serves as a visu-

alization of Table 1, where each bar represents the mean

perplexity of a given set of models. Lower is better.

Inﬂuence of the Number of Swapped Tokens. We

also run an experiment to observe what happens if we

only use part of the translated tokens to swap. The

vocabulary size is 30 000 tokens. The full vocabulary

swap method can map around half (47.57%, or 14,272

tokens out of the total of 30 000 tokens). We wanted

to see whether using fewer tokens still yields a rea-

sonable improvement. Figure 3 shows the perplexity

vs. the percentage of swapped tokens. Keep in mind

that not using any vocabulary swap is equivalent to

the embedding reset method, while using all tokens is

equivalent to full vocabulary swap, meaning that we

swap 47.57% of tokens w.r.t. to the vocabulary size,

which is all the tokens that we successfully mapped.

As shown in Figure 3, we can see that the proportion

of swapped tokens clearly impacts the quality of the

model, although it seems that the model quality stops

to improve if more than 19,03% tokens w.r.t. vocabu-

lary size are swapped.

4.76%

9.51%

14.27%

23.79%

19.03%

33.30%

28.54%

42.82%

38.06%

47.57%

Percentage of Swapped Tokens

Perplexity

Perplexity vs. Token Swap Percentage

Figure 3: The ﬁgure shows perplexity values for different

percentages of swapped tokens. The x-axis shows the per-

centages w.r.t. the total number of tokens (30,000). Lower

is better.

Figure 4 then shows what happens if we train

for more than one epoch. We trained the 4 types

of models on 10%, 25%, and 50% splits for three

epochs with otherwise the same hyperparameters as

previously described. Although the proposed method

is still showing the best results, with the increased

amount of training, the gap between other methods

is getting smaller. It is important to mention that even

a marginal difference in perplexity is signiﬁcant.

5 RELATED WORK

Several contributions have discussed transfer learn-

ing in natural language processing. (Hedderich et al.,

2020) investigated the application of transfer learning

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

610

Table 1: Perplexity values for different training approaches using various data splits. We used the mean value of the perplex-

ities evaluated from 5 independently trained models on 10% and 25% splits. For perplexity results on 50%, 75%, and 100%

of data, we trained only 3 models for each split to save computational resources. Lower is better.

Method

Data Splits

10% 25% 50% 75% 100%

Naive Fine-Tuning 66.02 ± 0.00 17.50 ± 0.00 10.69 ± 0.00 9.13 ± 0.00 8.13 ± 0.00

Training From Scratch 52.27 ± 1.10 18.18 ± 0.22 11.01 ± 0.10 9.25 ± 0.18 8.18 ± 0.02

Embedding Reset 34.19 ± 0.40 13.19 ± 0.19 9.41 ± 0.02 8.15 ± 0.01 7.61 ± 0.03

Vocabulary Swap 17.89 ± 0.54 9.60 ± 0.03 7.74 ± 0.04 7.10 ± 0.03 6.76 ± 0.01

1 2 3

Epoch

Perplexity

10% Split

1 2 3

Epoch

Perplexity

25% Split

1 2 3

Epoch

Perplexity

50% Split

■ Training From Scratch • Naive Fine-tuning ⋄ Embedding Reset ▲ Vocabulary Swap

Figure 4: Perplexity across epochs for different data splits and approaches.

and distant supervision techniques to ﬁne-tune multi-

lingual transformer models based on BERT for low-

resource African languages.

(Lauscher et al., 2020) shows empirically that lin-

guistic distance in the source and target language

plays a role in language transfer, obtaining better eval-

uation results on downstream tasks when the tested

languages are linguistically close. The work shows

that, while multilingual models can capture some uni-

versal linguistic properties, they are affected by fac-

tors such as language similarity, corpus size, and spe-

ciﬁc model design choices.

(Csaki et al., 2023) proposed novel methods for

efﬁciently adapting pre-trained language models to

new languages. A key contribution was their tech-

nique for improving tokenizer efﬁciency by selec-

tively replacing infrequently used tokens from the

original vocabulary with tokens from the target lan-

guage. Importantly, they concluded that adapting an

existing pre-trained model was more effective and

resource-efﬁcient than training a new model from

scratch for low-resource languages.

(Li et al., 2024), demonstrates the advantages of

using a parallel corpus, where they employ transfer

learning to adapt a Llama 2 model from English to

Chinese. Their approach yields superior results on

downstream tasks when compared to source models

evaluated on standard English and Chinese bench-

marks. A signiﬁcant ﬁnding was that their approach

of leveraging both corpora showed impressive efﬁ-

ciency, since the previous state-of-the-art model used

more than twice the amount of data.

6 LIMITATIONS

There are several limitations known to us, as we de-

veloped this method with some favorable conditions

that are not usually present in real-world data.

Controlled Environment. TinyStories is a syn-

thetic dataset which contains simple language. Lan-

guage in the real world is much more complex, and

different ﬁelds vary in the use of terminology, there-

fore one could say that one word can have many dif-

ferent semantic meanings. Since we know how the

data was created and what it contains, it is much sim-

pler for us to devise functional experiments with sim-

ple evaluation on the testing set.

Having a Parallel Corpus. Since the data are syn-

thetic and have a simple language, we could have eas-

ily obtained a high-quality translation, thus obtaining

a parallel corpus. In the real world, we would not

have this beneﬁt. Our method of mapping tokens is

directly tied to the availability of a parallel corpus.

Evaluation Metric. Since we use the same tok-

enizer for each of our models, we can compare the

Stealing Brains: From English to Czech Language Model

611

perplexity between models. Comparing different

models along with different tokenizers is non-trivial

since we would not be able to use the perplexity met-

ric as is. For us, this essentially invalidated any way

to compare the proposed method with a method that

would use a different tokenizer.

Amount of Training. With increased amount of

training, baselines close the gap between each other

and the proposed method. Our method works best

with a limited amount of training. Training for multi-

ple epochs shows diminishing returns for all baselines

considered, as well as the proposed method.

7 CONCLUSION

In this study, we have demonstrated the potential of

leveraging pre-trained English language models to ef-

fectively adapt and generate coherent text in Czech,

a lower-resource language. Our approach, which uti-

lizes a method of vocabulary swap, signiﬁcantly re-

duces the computational costs associated with train-

ing language-speciﬁc models from scratch. Through

our experiments, we have shown that even with a

small parallel corpus, the adapted model can outper-

form traditional training methods, highlighting the ef-

ﬁciency of transfer learning in natural language pro-

cessing. Our method is trivial to implement. It only

requires to ﬁnd a partial mapping between the Czech

and English tokenizer and then to initialize the em-

beddings of Czech tokens to the corresponding em-

beddings of English tokens. Future experiments will

test this method on more realistic datasets.

ACKNOWLEDGEMENTS

This article is from the project "Research of

Excellence on Digital Technologies and Wellbe-

ing CZ.02.01.01/00/22_008/0004583" which is co-

ﬁnanced by the European Union. The translation of

the parallel corpus, model training and model eval-

uation was supported by the Ministry of Education,

Youth and Sports of the Czech Republic through the

e-INFRA CZ (ID:90254). This article has been pro-

duced with the ﬁnancial support of the European

Union under the REFRESH – Research Excellence

For REgion Sustainability and High-tech Industries

project number CZ.10.03.01/00/22_003/0000048 via

the Operational Programme Just Transition.

REFERENCES

Anil, R., Borgeaud, S., Alayrac, J.-B., Yu, J., Soricut, R.,

et al. (2024). Gemini: A family of highly capable

multimodal models.

Csaki, Z., Pawakapan, P., Thakker, U., and Xu, Q. (2023).

Efﬁciently adapting pretrained language models to

new languages.

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,

et al. (2024). The llama 3 herd of models.

Eldan, R. and Li, Y. (2023). TinyStories: How small can

language models be and still speak coherent english?

Hedderich, M. A., Adelani, D., Zhu, D., Alabi, J., Markus,

U., and Klakow, D. (2020). Transfer learning and dis-

tant supervision for multilingual transformer models:

A study on african languages.

Honnibal, M., Montani, I., Van Landeghem, S., and Boyd,

A. (2020). spacy: Industrial-strength natural language

processing in python.

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C.,

Chaplot, et al. (2023). Mistral 7b.

Jin, Y., Chandra, M., Verma, G., Hu, Y., De Choudhury,

M., and Kumar, S. (2023). Better to ask in english:

Cross-lingual evaluation of large language models for

healthcare queries.

Lauscher, A., Ravishankar, V., Vuli

c, I., and Glavaš, G.

(2020). From zero to hero: On the limitations of

zero-shot language transfer with multilingual trans-

formers. In Proceedings of the 2020 Conference on

Empirical Methods in Natural Language Processing

(EMNLP), pages 4483–4499. Association for Compu-

tational Linguistics.

Li, Y., Feng, Y., Zhou, W., Zhao, Z., Shen, L., Hou, C.,

and Hou, X. (2024). Dynamic data sampler for cross-

language transfer learning in large language models.

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L.,

et al. (2024). GPT-4 technical report.

Tran, C., Bhosale, S., Cross, J., Koehn, P., Edunov, S., and

Fan, A. (2021). Facebook ai wmt21 news translation

task submission. arXiv preprint arXiv:2108.03265.

Wendler, C., Veselovsky, V., Monea, G., and West, R.

(2024). Do llamas work in english? on the latent lan-

guage of multilingual transformers.

NCTA 2024 - 16th International Conference on Neural Computation Theory and Applications

612