Applying Positional Encoding to Enhance Vision-Language Transformers

Xuehao Liu; Sarah Delany; Susan McKeever

Research.Publish.Connect.

*Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

*Please fill out at least one Field.

Name:
Country:
Subject:

Advanced Search Affiliations Search

If you're looking for an exact phrase use quotation marks on text fields.

Proceedings

Proceedings Search *Please fill out at least one Field. *Value must be an number!

Title:
ISBN:
Year:
Acronym:
Subject:

Advanced Search Proceedings Search

If you're looking for an exact phrase use quotation marks on text fields.

Papers

Papers Search *Please fill out at least one Field.

Title:
Author:
Affiliation:
Subject:

Advanced Search Papers Search

If you're looking for an exact phrase use quotation marks on text fields.

Authors

Authors Search *Please fill out at least one Field.

Name:
Affiliation:
Country:
Conference:
Subject:

Advanced Search Authors Search

If you're looking for an exact phrase use quotation marks on text fields.

Advanced Search

Paper

Applying Positional Encoding to Enhance Vision-Language Transformers

Topics: Deep Learning for Visual Understanding ; Machine Learning Technologies for Vision

In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5 VISAPP: VISAPP, 838-845, 2023 , Lisbon, Portugal

Authors: Xuehao Liu ; Sarah Delany and Susan McKeever

Affiliation: School of Computer Science, Technological University Dublin, Ireland

Keyword(s): Image Captioning, Positional Encoding, Vision-Language Transformer.

Abstract: Positional encoding is used in both natural language and computer vision transformers. It provides information on sequence order and relative position of input tokens (such as of words in a sentence) for higher performance. Unlike the pure language and vision transformers, vision-language transformers do not currently exploit positional encoding schemes to enrich input information. We show that capturing location information of visual features can help vision-language transformers improve their performance. We take Oscar, one of the state-of-the-art (SOTA) vision-language transformers as an example transformer for implanting positional encoding. We use image captioning as a downstream task to test performance. We added two types of positional encoding into Oscar: DETR as an absolute positional encoding approach and iRPE, for relative positional encoding. With the same training protocol and data, both positional encodings improved the image captioning performance of Oscar by between 6 .8% to 24.1% across five image captioning evaluation criteria used. (More)

CC BY-NC-ND 4.0

Guest: Register as new SciTePress user now for free.

SciTePress user: please login.

My Papers

You are not signed in, therefore limits apply to your IP address 3.22.116.191

In the current month:

Recent papers: 100 available of 100 total

2⁺ years older papers: 200 available of 200 total

Paper citation in several formats:

Liu, X., Delany, S. and McKeever, S. (2023). Applying Positional Encoding to Enhance Vision-Language Transformers. In Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP; ISBN 978-989-758-634-7; ISSN 2184-4321, SciTePress, pages 838-845. DOI: 10.5220/0011796100003417

@conference{visapp23,
author={Xuehao Liu and Sarah Delany and Susan McKeever},
title={Applying Positional Encoding to Enhance Vision-Language Transformers},
booktitle={Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP},
year={2023},
pages={838-845},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011796100003417},
isbn={978-989-758-634-7},
issn={2184-4321},
}

TY - CONF

JO - Proceedings of the 18th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2023) - Volume 5: VISAPP
TI - Applying Positional Encoding to Enhance Vision-Language Transformers
SN - 978-989-758-634-7
IS - 2184-4321
AU - Liu, X.
AU - Delany, S.
AU - McKeever, S.
PY - 2023
SP - 838
EP - 845
DO - 10.5220/0011796100003417
PB - SciTePress