A Collaborative Approach to Multimodal Machine Translation: VLM and LLM
Amulya Ratna Dash, Yashvardhan Sharma
2025
Abstract
With advancements in Large Language Models (LLMs) and Vision Language Pretrained Models (VLMs), there is a growing need to evaluate their capabilities and research methods to use them together for vision language tasks. This study focuses on using VLM and LLM collaboratively for Multimodal Machine Translation (MMT). We finetune LLaMA-3 to use provided image captions from VLMs to disambiguate and generate accurate translations for MMT tasks. We evaluate our novel approach using the German, French and Hindi languages, and observe consistent translation quality improvements. The final model shows an improvement of +3 BLEU score against the baseline and +4 BLEU score against the state-of-the-art model.
DownloadPaper Citation
in Harvard Style
Dash A. and Sharma Y. (2025). A Collaborative Approach to Multimodal Machine Translation: VLM and LLM. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 1412-1418. DOI: 10.5220/0013379900003890
in Bibtex Style
@conference{icaart25,
author={Amulya Dash and Yashvardhan Sharma},
title={A Collaborative Approach to Multimodal Machine Translation: VLM and LLM},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={1412-1418},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013379900003890},
isbn={978-989-758-737-5},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - A Collaborative Approach to Multimodal Machine Translation: VLM and LLM
SN - 978-989-758-737-5
AU - Dash A.
AU - Sharma Y.
PY - 2025
SP - 1412
EP - 1418
DO - 10.5220/0013379900003890
PB - SciTePress