stitching. In Proceedings of the IEEE CVPR, pages
2634–2641.
Ding, S., Qu, S., Xi, Y., Sangaiah, A. K., and Wan, S.
(2019). Image caption generation with high-level im-
age features. Pattern Recognition Letters, 123:89–95.
do Carmo Nogueira, T., Vinhal, C. D. N., da Cruz J
´
unior,
G., and Ullmann, M. R. D. (2020). Reference-based
model using multimodal gated recurrent units for im-
age captioning. Multimedia Tools and Applications,
pages 1–21.
Donahue, J., Anne Hendricks, L., Guadarrama, S.,
Rohrbach, M., Venugopalan, S., Saenko, K., and Dar-
rell, T. (2015). Long-term recurrent convolutional net-
works for visual recognition and description. In Pro-
ceedings of the IEEE CVPR, pages 2625–2634.
Feng, Y., Ma, L., Liu, W., and Luo, J. (2019). Unsupervised
image captioning. In Proceedings of the IEEE CVPR.
Gan, Z., Gan, C., He, X., Pu, Y., Tran, K., Gao, J., Carin,
L., and Deng, L. (2017). Semantic compositional net-
works for visual captioning. In Proceedings of the
IEEE CVPR, pages 5630–5639.
Hao, W., Zhang, Z., and Guan, H. (2018). Integrating both
visual and audio cues for enhanced video caption. In
Proceedings of the AAAI.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term
memory. Neural computation, 9(8):1735–1780.
Hsu, C.-C., Chen, Y.-H., Chen, Z.-Y., Lin, H.-Y., Huang,
T.-H., and Ku, L.-W. (2019). Dixit: Interactive visual
storytelling via term manipulation. In The World Wide
Web Conference, pages 3531–3535.
Hu, J., Cheng, Y., Gan, Z., Liu, J., Gao, J., and Neubig, G.
(2020). What makes a good story? designing com-
posite rewards for visual storytelling. In AAAI, pages
7969–7976.
Huang, L., Wang, W., Chen, J., and Wei, X.-Y. (2019a). At-
tention on attention for image captioning. In Proceed-
ings of the IEEE International Conference on Com-
puter Vision, pages 4634–4643.
Huang, Q., Gan, Z., Celikyilmaz, A., Wu, D., Wang, J., and
He, X. (2019b). Hierarchically structured reinforce-
ment learning for topically coherent visual story gen-
eration. In Proceedings of the AAAI Conference on
Artificial Intelligence, volume 33, pages 8465–8472.
Huang, T.-H., Ferraro, F., Mostafazadeh, N., Misra, I.,
Agrawal, A., Devlin, J., Girshick, R., He, X., Kohli,
P., Batra, D., et al. (2016). Visual storytelling. In Pro-
ceedings of the 2016 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
1233–1239.
Kim, G., Moon, S., and Sigal, L. (2015). Ranking and re-
trieval of image sequences from multiple paragraph
queries. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages
1993–2001.
Kim, T., Heo, M.-O., Son, S., Park, K.-W., and Zhang, B.-T.
(2018). Glac net: Glocal attention cascading networks
for multi-image cued story generation. arXiv preprint
arXiv:1805.10973.
Kojima, A., Tamura, T., and Fukunaga, K. (2002). Natural
language description of human activities from video
images based on concept hierarchy of actions. IJCV,
50(2):171–184.
Krishnamoorthy, N., Malkarnenkar, G., Mooney, R. J.,
Saenko, K., and Guadarrama, S. (2013). Generating
natural-language video descriptions using text-mined
knowledge. In Proceddings of the AAAI, volume 1,
page 2.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Advances in neural information process-
ing systems, pages 1097–1105.
Li, G., Zhu, L., Liu, P., and Yang, Y. (2019a). Entangled
transformer for image captioning. In Proceedings of
the IEEE International Conference on Computer Vi-
sion, pages 8928–8937.
Li, J., Shi, H., Tang, S., Wu, F., and Zhuang, Y. (2019b). In-
formative visual storytelling with cross-modal rules.
In Proceedings of the 27th ACM International Con-
ference on Multimedia, pages 2314–2322.
Lin, C.-Y. (2004). Rouge: A package for automatic evalu-
ation of summaries. In Text summarization branches
out, pages 74–81.
Liu, S., Ren, Z., and Yuan, J. (2020). Sibnet: Sibling con-
volutional encoder for video captioning. IEEE Trans.
on Pattern Analysis and Machine Intelligence, pages
1–1.
Melis, G., Ko
ˇ
cisk
`
y, T., and Blunsom, P. (2020). Mogrifier
lstm. In Proceedings of the ICLR.
Mogadala, A., Shen, X., and Klakow, D. (2020). Integrat-
ing image captioning with rule-based entity masking.
arXiv preprint arXiv:2007.11690.
Nayyer, A., Mian, A., Liu, W., Gilani, S. Z., and Shah,
M. (2019). Video description: A survey of meth-
ods, datasets, and evaluation metrics. ACM Comput-
ing Surveys (CSUR), 52(6):1–37.
Pan, B., Cai, H., Huang, D.-A., Lee, K.-H., Gaidon, A.,
Adeli, E., and Niebles, J. C. (2020). Spatio-temporal
graph for video captioning with knowledge distilla-
tion. In IEEE CVPR.
Pan, Y., Yao, T., Li, H., and Mei, T. (2017). Video cap-
tioning with transferred semantic attributes. In IEEE
CVPR.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th annual meet-
ing of the Association for Computational Linguistics,
pages 311–318.
Park, C. C., Kim, Y., and Kim, G. (2017). Retrieval of sen-
tence sequences for an image stream via coherence re-
current convolutional networks. IEEE transactions on
pattern analysis and machine intelligence, 40(4):945–
957.
Pei, W., Zhang, J., Wang, X., Ke, L., Shen, X., and Tai, Y.-
W. (2019). Memory-attended recurrent network for
VISAPP 2021 - 16th International Conference on Computer Vision Theory and Applications
204