Vision-Perceptual Transformer Network for Semantic Scene Understanding

Mohamad Alansari, Hamad AlRemeithi, Hamad AlRemeithi, Bilal Hassan, Bilal Hassan, Sara Alansari, Jorge Dias, Jorge Dias, Majid Khonji, Majid Khonji, Naoufel Werghi, Naoufel Werghi, Naoufel Werghi, Sajid Javed, Sajid Javed

2024

Abstract

Semantic segmentation, essential in computer vision, involves labeling each image pixel with its semantic class. Transformer-based models, recognized for their exceptional performance, have been pivotal in advancing this field. Our contribution, the Vision-Perceptual Transformer Network (VPTN), ingeniously combines transformer encoders with a feature pyramid-based decoder to deliver precise segmentation maps with minimal computational burden. VPTN’s transformative power lies in its integration of the pyramiding technique, enhancing multi-scale variations handling. In direct comparisons with Vision Transformer-based networks and variants, VPTN consistently excels. On average, it achieves 4.2%, 3.41%, and 6.24% higher mean Intersection over Union (mIoU) compared to Dense Prediction (DPT), Data-efficient image Transformer (DeiT), and Swin Transformer networks, while demanding only 15.63%, 3.18%, and 10.05% of their Giga Floating-Point Operations (GFLOPs). Our validation spans five diverse datasets, including Cityscapes, BDD100K, Mapil-lary Vistas, CamVid, and ADE20K. VPTN secures the position of state-of-the-art (SOTA) on BDD100K and CamVid and consistently outperforms existing deep learning models on other datasets, boasting mIoU scores of 82.6%, 67.29%, 61.2%, 86.3%, and 55.3%, respectively. Impressively, it does so with an average computational complexity just 11.44% of SOTA models. VPTN represents a significant advancement in semantic segmentation, balancing efficiency and performance. It shows promising potential, especially for autonomous driving and natural setting computer vision applications.

Download


Paper Citation


in Harvard Style

Alansari M., AlRemeithi H., Hassan B., Alansari S., Dias J., Khonji M., Werghi N. and Javed S. (2024). Vision-Perceptual Transformer Network for Semantic Scene Understanding. In Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP; ISBN 978-989-758-679-8, SciTePress, pages 325-332. DOI: 10.5220/0012313800003660


in Bibtex Style

@conference{visapp24,
author={Mohamad Alansari and Hamad AlRemeithi and Bilal Hassan and Sara Alansari and Jorge Dias and Majid Khonji and Naoufel Werghi and Sajid Javed},
title={Vision-Perceptual Transformer Network for Semantic Scene Understanding},
booktitle={Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP},
year={2024},
pages={325-332},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0012313800003660},
isbn={978-989-758-679-8},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP
TI - Vision-Perceptual Transformer Network for Semantic Scene Understanding
SN - 978-989-758-679-8
AU - Alansari M.
AU - AlRemeithi H.
AU - Hassan B.
AU - Alansari S.
AU - Dias J.
AU - Khonji M.
AU - Werghi N.
AU - Javed S.
PY - 2024
SP - 325
EP - 332
DO - 10.5220/0012313800003660
PB - SciTePress