Learning Cross-modal Representations with Multi-relations for Image Captioning
Peng Cheng, Tung Le, Teeradaj Racharak, Cao Yiming, Kong Weikun, Minh Nguyen
2022
Abstract
Image captioning is a cross-domain study that generates image description sentences based on a given image. Recently, (Li et al., 2020b) shows that concatenating sentences, object tags, and region features as a unified representation enables to overcome state-of-the-art works in different vision-and-language-related tasks. Such results have inspired us to investigate and propose two new learning methods that exploit the relation representation in the model and improve the model’s generation results in this paper. To the best of our knowledge, we are the first that exploit both relations extracted from text and images for image captioning. Our idea is motivated by the phenomenon that humans can correct other people’s descriptions by knowing the relationship between objects in an image while observing the same image. We conduct experiments based on the MS COCO dataset (Lin et al., 2014) and show that our method can yield the higher SPICE score than the baseline.
DownloadPaper Citation
in Harvard Style
Cheng P., Le T., Racharak T., Yiming C., Weikun K. and Nguyen M. (2022). Learning Cross-modal Representations with Multi-relations for Image Captioning. In Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM, ISBN 978-989-758-549-4, pages 346-353. DOI: 10.5220/0010915100003122
in Bibtex Style
@conference{icpram22,
author={Peng Cheng and Tung Le and Teeradaj Racharak and Cao Yiming and Kong Weikun and Minh Nguyen},
title={Learning Cross-modal Representations with Multi-relations for Image Captioning},
booktitle={Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,},
year={2022},
pages={346-353},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0010915100003122},
isbn={978-989-758-549-4},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods - Volume 1: ICPRAM,
TI - Learning Cross-modal Representations with Multi-relations for Image Captioning
SN - 978-989-758-549-4
AU - Cheng P.
AU - Le T.
AU - Racharak T.
AU - Yiming C.
AU - Weikun K.
AU - Nguyen M.
PY - 2022
SP - 346
EP - 353
DO - 10.5220/0010915100003122