4 CONCLUSIONS
This paper introduces a novel approach to compress-
ing transformer-based vision-language neural net-
works using TensorTrain decomposition. By applying
this method to popular models like ViT and BERT, we
achieved significant improvements in accuracy, up to
5%, without the need for post-training adjustments.
The iterative compression of model layers, coupled
with retraining, enables us to preserve model accu-
racy while reducing up to 53% of the model’s pa-
rameters. In the future, we would like to general-
ize our approach to other dual encoder models and
test our approach on other multimodal tasks like vi-
sual question answering and caption generation. This
work represents a valuable advancement in the field
of multimodal language processing and contributes to
our broader goal of making transformer-based mod-
els more efficient and practical for real-world appli-
cations.
ACKNOWLEDGEMENTS
The first author would like to extend her gratitude to
Dr. Lina Kim and the Research Mentorship Program
Cohort for their support throughout the research pro-
cess.
REFERENCES
Bacciu, D. and Mandic, D. P. (2020). Tensor decomposi-
tions in deep learning. Computational Intelligence.
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.,
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger,
G., Henighan, T., Child, R., Ramesh, A., Ziegler,
D. M., Wu, J., Winter, C., and Amodei, D. (2020).
Language models are few-shot learners. arXiv.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn,
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer,
M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby,
N. (2021). An image is worth 16x16 words: Trans-
formers for image recognition at scale. arXiv.
Gu, J., Keller, B., Kossaifi, J., Anandkumar, A., Khailany,
B., and Pan, D. Z. (2022). Heat: Hardware-efficient
automatic tensor decomposition for transformer com-
pression. arXiv.
Jia, C., Yang, Y., Xia, Y., Chen, Y., Parekh, Z., Pham,
H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T.
(2021a). Scaling up visual and vision-language repre-
sentation learning with noisy text supervision. CoRR,
abs/2102.05918.
Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham,
H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. (2021b).
Scaling up visual and vision-language representation
learning with noisy text supervision. arXiv.
Krizhevsky, A. (2009). Learning multiple layers of features
from tiny images.
Li, L. H., Yatskar, M., Yin, D., Hsieh, C.-J., and Chang,
K.-W. (2019). Visualbert: A simple and performant
baseline for vision and language. arXiv.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D.,
Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov,
V. (2019). Roberta: A robustly optimized bert pre-
training approach. arXiv.
Maldonado-Chan, M., Mendez-Vazquez, A., and
Guardado-Medina, R. O. (2021). Multimodal
tucker decomposition for gated rbm inference.
Applied Sciences, 11(16):16.
Morris, M. R., Johnson, J., Bennett, C. L., and Cutrell,
E. (2018). Rich representations of visual content for
screen reader users. In Proceedings of the 2018 CHI
Conference on Human Factors in Computing Systems,
pages 1–11.
Oseledets, I. V. (2011). Tensor-train decomposition. SIAM
Journal on Scientific Computing, 33(5):2295–2317.
APPENDIX
ViT Fine-Tuning Hyperparameters
Learning Rate: 5
Train/Evaluation Batch Size: 32
Optimizer: Adam, linear learning rate
Epoch(s): 1
Individual Compression
Figure 3: ViT Individual Compression.
Partial Tensorized Transformers for Natural Language Processing
547