Learning Cross-modal Representations with Multi-relations for Image Captioning

Peng Cheng, Tung Le, Teeradaj Racharak, Cao Yiming, Kong Weikun, Minh Nguyen

2022

Abstract

Image captioning is a cross-domain study that generates image description sentences based on a given image. Recently, (Li et al., 2020b) shows that concatenating sentences, object tags, and region features as a unified representation enables to overcome state-of-the-art works in different vision-and-language-related tasks. Such results have inspired us to investigate and propose two new learning methods that exploit the relation representation in the model and improve the model’s generation results in this paper. To the best of our knowledge, we are the first that exploit both relations extracted from text and images for image captioning. Our idea is motivated by the phenomenon that humans can correct other people’s descriptions by knowing the relationship between objects in an image while observing the same image. We conduct experiments based on the MS COCO dataset (Lin et al., 2014) and show that our method can yield the higher SPICE score than the baseline.

Download


Paper Citation


in Harvard Style

Cheng P., Le T., Racharak T., Yiming C., Weikun K. and Nguyen M. (2022). Learning Cross-modal Representations with Multi-relations for Image Captioning. In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-549-4, pages 346-353. DOI: 10.5220/0010915100003122


in Bibtex Style

@conference{icpram22,
author={Peng Cheng and Tung Le and Teeradaj Racharak and Cao Yiming and Kong Weikun and Minh Nguyen},
title={Learning Cross-modal Representations with Multi-relations for Image Captioning},
booktitle={Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2022},
pages={346-353},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010915100003122},
isbn={978-989-758-549-4},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Learning Cross-modal Representations with Multi-relations for Image Captioning
SN - 978-989-758-549-4
AU - Cheng P.
AU - Le T.
AU - Racharak T.
AU - Yiming C.
AU - Weikun K.
AU - Nguyen M.
PY - 2022
SP - 346
EP - 353
DO - 10.5220/0010915100003122