Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study
Zejian Zhang, Cristina Palmero, Sergio Escalera
2025
Abstract
Deep learning models need to encode both local and global temporal dependencies for accurate temporal action localization (TAL). Recent approaches have relied on Transformer blocks, which has a quadratic complexity. By contrast, Mamba blocks have been adapted for TAL due to their comparable performance and lower complexity. However, various factors can influence the choice between these models, and a thorough analysis of them can provide valuable insights into the selection process. In this work, we analyze the Transformer block, Mamba block, and their combinations as temporal feature encoders for TAL, measuring their overall performance, efficiency, and sensitivity across different contexts. Our analysis suggests that Mamba blocks should be preferred due to their performance and efficiency. Hybrid encoders can serve as an alternative choice when sufficient computational resources are available.
DownloadPaper Citation
in Harvard Style
Zhang Z., Palmero C. and Escalera S. (2025). Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study. In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-728-3, SciTePress, pages 150-162. DOI: 10.5220/0013173000003912
in Bibtex Style
@conference{visapp25,
author={Zejian Zhang and Cristina Palmero and Sergio Escalera},
title={Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study},
booktitle={Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP},
year={2025},
pages={150-162},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0013173000003912},
isbn={978-989-758-728-3},
}
in EndNote Style
TY - CONF
JO - Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP
TI - Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study
SN - 978-989-758-728-3
AU - Zhang Z.
AU - Palmero C.
AU - Escalera S.
PY - 2025
SP - 150
EP - 162
DO - 10.5220/0013173000003912
PB - SciTePress