Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models

Yushan Wang, Shuhei Tarashima, Shuhei Tarashima, Norio Tagawa

2025

Abstract

: Fully-transformer frameworks have gradually replaced traditional convolutional neural networks (CNNs) in recent 3D human pose and shape estimation tasks, especially due to its attention mechanism that can capture long-range and complex relationships between input tokens, surpassing CNN’s representation capabilities. Recent attention designs have reduced the computational complexity of transformers in core computer vision tasks like classification and segmentation, achieving extraordinary strong results. However, their potential for more complex, higher-level tasks remains unexplored. For the first time, we propose to integrate the group-mix attention mechanism to 3D human pose and shape estimation task. We combine token-to-token, token-to-group, and group-to-group correlations, enabling a broader capture of human body part relationships and making it promising for challenging scenarios like occlusion+blur. We believe this mix of tokens and groups is well suited to our task, where we need to learn the relevance of various parts of the human body, which are often not individual tokens, but larger in scope. We quantitatively and qualitatively validated our method successfully reduces the parameter count by 97.3% (from 620M to 17M) and the FLOPs count by 96.1% (from 242.1G to 9.5G), with a performance gap of less than 3%.

Download


Paper Citation


in Harvard Style

Wang Y., Tarashima S. and Tagawa N. (2025). Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-728-3, SciTePress, pages 735-742. DOI: 10.5220/0013306400003912


in Bibtex Style

@conference{visapp25,
author={Yushan Wang and Shuhei Tarashima and Norio Tagawa},
title={Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2025},
pages={735-742},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013306400003912},
isbn={978-989-758-728-3},
}


in EndNote Style

TY - CONF

JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - Efficient 3D Human Pose and Shape Estimation Using Group-Mix Attention in Transformer Models
SN - 978-989-758-728-3
AU - Wang Y.
AU - Tarashima S.
AU - Tagawa N.
PY - 2025
SP - 735
EP - 742
DO - 10.5220/0013306400003912
PB - SciTePress