A Collaborative Approach to Multimodal Machine Translation: VLM and LLM

Amulya Ratna Dash, Yashvardhan Sharma

2025

Abstract

With advancements in Large Language Models (LLMs) and Vision Language Pretrained Models (VLMs), there is a growing need to evaluate their capabilities and research methods to use them together for vision language tasks. This study focuses on using VLM and LLM collaboratively for Multimodal Machine Translation (MMT). We finetune LLaMA-3 to use provided image captions from VLMs to disambiguate and generate accurate translations for MMT tasks. We evaluate our novel approach using the German, French and Hindi languages, and observe consistent translation quality improvements. The final model shows an improvement of +3 BLEU score against the baseline and +4 BLEU score against the state-of-the-art model.

Download


Paper Citation


in Harvard Style

Dash A. and Sharma Y. (2025). A Collaborative Approach to Multimodal Machine Translation: VLM and LLM. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 1412-1418. DOI: 10.5220/0013379900003890


in Bibtex Style

@conference{icaart25,
author={Amulya Dash and Yashvardhan Sharma},
title={A Collaborative Approach to Multimodal Machine Translation: VLM and LLM},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART},
year={2025},
pages={1412-1418},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013379900003890},
isbn={978-989-758-737-5},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 3: ICAART
TI - A Collaborative Approach to Multimodal Machine Translation: VLM and LLM
SN - 978-989-758-737-5
AU - Dash A.
AU - Sharma Y.
PY - 2025
SP - 1412
EP - 1418
DO - 10.5220/0013379900003890
PB - SciTePress