Unicoder-vl: A universal encoder for vision and lan-
guage by cross-modal pre-training. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 34, pages 11336–11344.
Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., and
Hoi, S. C. H. (2021). Align before fuse: Vision and
language representation learning with momentum dis-
tillation. Advances in neural information processing
systems, 34:9694–9705.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang,
L., Hu, H., Dong, L., Wei, F., et al. (2020b). Os-
car: Object-semantics aligned pre-training for vision-
language tasks. In European Conference on Computer
Vision, pages 121–137. Springer.
Lin, C.-Y. (2004). Rouge: A package for automatic evalu-
ation of summaries. In Text summarization branches
out, pages 74–81.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Doll
´
ar, P., and Zitnick, C. L. (2014).
Microsoft coco: Common objects in context. In Euro-
pean conference on computer vision, pages 740–755.
Springer.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. In Pro-
ceedings of the IEEE/CVF International Conference
on Computer Vision, pages 10012–10022.
Ordonez, V., Kulkarni, G., and Berg, T. (2011). Im2text:
Describing images using 1 million captioned pho-
tographs. Advances in neural information processing
systems, 24.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002).
Bleu: a method for automatic evaluation of machine
translation. In Proceedings of the 40th annual meet-
ing of the Association for Computational Linguistics,
pages 311–318.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.,
et al. (2018). Improving language understanding by
generative pre-training.
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Lev-
skaya, A., and Shlens, J. (2019). Stand-alone self-
attention in vision models. Advances in Neural Infor-
mation Processing Systems, 32.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster
r-cnn: Towards real-time object detection with region
proposal networks. Advances in neural information
processing systems, 28.
Rennie, S. J., Marcheret, E., Mroueh, Y., Ross, J., and Goel,
V. (2017). Self-critical sequence training for image
captioning. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 7008–
7024.
Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X.,
Li, J., and Sun, J. (2019). Objects365: A large-
scale, high-quality dataset for object detection. In Pro-
ceedings of the IEEE/CVF international conference
on computer vision, pages 8430–8439.
Sharma, P., Ding, N., Goodman, S., and Soricut, R. (2018).
Conceptual captions: A cleaned, hypernymed, image
alt-text dataset for automatic image captioning. In
Proceedings of the 56th Annual Meeting of the Associ-
ation for Computational Linguistics (Volume 1: Long
Papers), pages 2556–2565.
Shaw, P., Uszkoreit, J., and Vaswani, A. (2018). Self-
attention with relative position representations. arXiv
preprint arXiv:1803.02155.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.
(2017). Attention is all you need. Advances in neural
information processing systems, 30.
Vedantam, R., Lawrence Zitnick, C., and Parikh, D. (2015).
Cider: Consensus-based image description evalua-
tion. In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pages 4566–
4575.
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J.,
Zhou, C., Zhou, J., and Yang, H. (2022). Ofa: Unify-
ing architectures, tasks, and modalities through a sim-
ple sequence-to-sequence learning framework. In In-
ternational Conference on Machine Learning, pages
23318–23340. PMLR.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang, D.,
Lu, T., Luo, P., and Shao, L. (2021). Pyramid vi-
sion transformer: A versatile backbone for dense pre-
diction without convolutions. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 568–578.
Wu, K., Peng, H., Chen, M., Fu, J., and Chao, H. (2021).
Rethinking and improving relative position encod-
ing for vision transformer. In Proceedings of the
IEEE/CVF International Conference on Computer Vi-
sion, pages 10033–10041.
Yu, F., Tang, J., Yin, W., Sun, Y., Tian, H., Wu, H., and
Wang, H. (2021). Ernie-vil: Knowledge enhanced
vision-language representations through scene graphs.
In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 35, pages 3208–3216.
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L.,
Choi, Y., and Gao, J. (2021). Vinvl: Revisiting visual
representations in vision-language models. In Pro-
ceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pages 5579–5588.
Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., and
Gao, J. (2020). Unified vision-language pre-training
for image captioning and vqa. In Proceedings of
the AAAI Conference on Artificial Intelligence, vol-
ume 34, pages 13041–13049.
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., and Dai, J.
(2020). Deformable detr: Deformable transform-
ers for end-to-end object detection. arXiv preprint
arXiv:2010.04159.
Applying Positional Encoding to Enhance Vision-Language Transformers
845