STEP: SuperToken and Early-Pruning for Efficient Semantic Segmentation

Mathilde Proust, Martyna Poreba, Michal Szczepanski, Karim Haroun, Karim Haroun

2025

Abstract

Vision Transformers (ViTs) achieve state-of-the-art accuracy in numerous vision tasks, but their heavy computational and memory requirements pose significant challenges. Minimising token-related computations is critical to alleviating this computational burden. This paper introduces a novel SuperToken and Early-Pruning (STEP) approach that combines patch merging along with an early-pruning mechanism to optimize token handling in ViTs for semantic segmentation. The improved patch merging method is developed to effectively address the diverse complexities of images. It features a dynamic and adaptive system, dCTS, which employs a CNN-based policy network to determine the quantity and size of patch groups that share the same supertoken during inference. With a flexible merging strategy, it handles superpatches of varying sizes: 2×2, 4×4, 8×8, and 16×16. Early in the network, high-confidence tokens are discarded and preserved from subsequent processing stages. This hybrid approach reduces both computational and memory requirements without significantly compromising segmentation accuracy. It is shown through experimental results that, on average, 40% of tokens can be predicted from the 16th layer onwards when using ViT-Large as the backbone. Additionally, a reduction of up to 3× in computational complexity is achieved, with a maximum drop in accuracy of 2.5%.

Download


Paper Citation


in Harvard Style

Proust M., Poreba M., Szczepanski M. and Haroun K. (2025). STEP: SuperToken and Early-Pruning for Efficient Semantic Segmentation. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP; ISBN 978-989-758-728-3, SciTePress, pages 50-61. DOI: 10.5220/0013132800003912


in Bibtex Style

@conference{visapp25,
author={Mathilde Proust and Martyna Poreba and Michal Szczepanski and Karim Haroun},
title={STEP: SuperToken and Early-Pruning for Efficient Semantic Segmentation},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP},
year={2025},
pages={50-61},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013132800003912},
isbn={978-989-758-728-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: VISAPP
TI - STEP: SuperToken and Early-Pruning for Efficient Semantic Segmentation
SN - 978-989-758-728-3
AU - Proust M.
AU - Poreba M.
AU - Szczepanski M.
AU - Haroun K.
PY - 2025
SP - 50
EP - 61
DO - 10.5220/0013132800003912
PB - SciTePress