eter sets, these fine-tuned models must be identical
or almost identical to the pre-trained models. This
makes the testing of different architectures difficult.
A possibility is to use a large model used for pre-
training as a teacher and a medium-sized model as
student, mimicking its performance. This procedure,
referred to as knowledge distillation, has been pro-
posed by (Hinton et al., 2015) and used, e.g., by (Sun
et al., 2019).
These will be important focuses soon.
REFERENCES
Bahdanau, D., Cho, K., and Bengio, Y. (2016). Neural
Machine Translation by Jointly Learning to Align and
Translate. ICLR.
Cholesky, A.-L. (1924). Note Sur Une M
´
ethode de
R
´
esolution des
´
equations Normales Provenant de
L’Application de la M
´
eThode des Moindres Carr
´
es
a un Syst
`
eme D’
´
equations Lin
´
eaires en Nombre
Inf
´
erieur a Celui des Inconnues. — Application
de la M
´
ethode a la R
´
esolution D’un Syst
`
eme
Defini D’
´
eQuations Lin
´
eAires. Bulletin G
´
eod
´
esique,
2(1):67–77.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Min-
derer, M., Heigold, G., Gelly, S., Uszkoreit, J., and
Houlsby, N. (2021). An image is worth 16x16 words:
Transformers for image recognition at scale. In In-
ternational Conference on Learning Representations,
page 21, Vienna, Austria.
Fletcher, R. and Reeves, C. M. (1964). Function minimiza-
tion by conjugate gradients. The Computer Journal,
7(2):149–154.
He, B. and Hofmann, T. (2024). Simplifying Transformer
Blocks. arXiv:2311.01906 [cs].
Hendrycks, D. and Gimpel, K. (2023). Gaussian Error Lin-
ear Units (GELUs). arXiv:1606.08415 [cs].
Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the
Knowledge in a Neural Network. arXiv:1503.02531
[cs, stat].
Hrycej, T., Bermeitinger, B., Cetto, M., and Handschuh,
S. (2023). Mathematical Foundations of Data Sci-
ence. Texts in Computer Science. Springer Interna-
tional Publishing, Cham.
Krizhevsky, A. (2009). Learning Multiple Layers of Fea-
tures from Tiny Images. Dataset, University of
Toronto.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998).
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324.
Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flan-
nery, B. P. (1992). Numerical recipes in C (2nd ed.):
the art of scientific computing. Cambridge University
Press, USA.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., Berg, A. C., and Fei-Fei, L. (2015). Ima-
geNet Large Scale Visual Recognition Challenge. In-
ternational Journal of Computer Vision, 115(3):211–
252.
Sun, S., Cheng, Y., Gan, Z., and Liu, J. (2019). Patient
Knowledge Distillation for BERT Model Compres-
sion. arXiv:1908.09355 [cs].
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, L., and Polosukhin, I.
(2017). Attention is all you need. In Proceedings of
the 31st International Conference on Neural Informa-
tion Processing Systems, NIPS’17, pages 6000–6010,
Red Hook, NY, USA. Curran Associates Inc.
Reducing the Transformer Architecture to a Minimum
241