
block-based models should be the preferred choice for
developing TAL applications, especially given per-
formance requirements and potential limitations in
high-performance hardware resources. Additionally,
we found that learning temporal dependencies in se-
quences from both directions—specifically through
the ViM and DBM blocks, which incorporate a back-
ward scanning process—can enhance the model’s per-
formance for TAL.
We focused our analysis on a limited set of hybrid
models. However, there are several other approaches
to building hybrid models, such as those proposed in
(Hatamizadeh and Kautz, 2024) and (Behrouz et al.,
2024). One potential direction for future work could
be exploring the adaptation of these architectures for
TAL. Another avenue for future research would be de-
veloping models that leverage both the simplicity and
performance of Mamba blocks, as well as their dual
scanning capability.
ACKNOWLEDGEMENTS
This work has been partially supported by the Span-
ish project PID2022-136436NB-I00 and by ICREA
under the ICREA Academia programme.
REFERENCES
Alwassel, H., Caba Heilbron, F., Escorcia, V., and Ghanem,
B. (2018). Diagnosing error in temporal action detec-
tors. In ECCV.
Alwassel, H., Giancola, S., and Ghanem, B. (2021). Tsp:
Temporally-sensitive pretraining of video encoders
for localization tasks. In ICCV, pages 3173–3183.
Behrouz, A., Santacatterina, M., and Zabih, R. (2024).
Mambamixer: Efficient selective state space mod-
els with dual token and channel selection. arXiv
2403.19888.
Bisong, E. and Bisong, E. (2019). Recurrent neural net-
works (rnns). Building Machine Learning and Deep
Learning Models on Google Cloud Platform: A Com-
prehensive Guide for Beginners, pages 443–473.
Bodla, N., Singh, B., Chellappa, R., and Davis, L. S. (2017).
Soft-nms — improving object detection with one line
of code. In 2017 ICCV, pages 5562–5570.
Carreira, J. and Zisserman, A. (2017). Quo vadis, action
recognition? a new model and the kinetics dataset. In
2017 CVPR, pages 4724–4733.
Chen, G., Huang, Y., Xu, J., Pei, B., Chen, Z., Li, Z., Wang,
J., Li, K., Lu, T., and Wang, L. (2024). Video mamba
suite: State space model as a versatile alternative for
video understanding. arXiv 2403.09626.
Chen, G., Zheng, Y., Wang, L., and Lu, T. (2022). DCAN:
improving temporal action detection via dual context
aggregation. In AAAI 2022, pages 248–257.
Damen, D., Doughty, H., Farinella, G. M., Fidler, S.,
Furnari, A., Kazakos, E., Moltisanti, D., Munro, J.,
Perrett, T., Price, W., and Wray, M. (2018). Scal-
ing egocentric vision: The epic-kitchens dataset. In
ECCV.
Elharrouss, O., Almaadeed, N., and Al-Maadeed, S. (2021).
A review of video surveillance systems. J Vis Com-
mun Image Represent, 77:103116.
Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).
Slowfast networks for video recognition. In 2019
ICCV, pages 6201–6210, Los Alamitos, CA, USA.
Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A.,
and R
´
e, C. (2023). Hungry Hungry Hippos: Towards
language modeling with state space models. In Inter-
national Conference on Learning Representations.
Ghosh, I., Ramasamy Ramamurthy, S., Chakma, A., and
Roy, N. (2023). Sports analytics review: Artificial
intelligence applications, emerging technologies, and
algorithmic perspective. WRIEs Data Min. Knowl.
Discov., 13(5):e1496.
Gong, G., Zheng, L., and Mu, Y. (2020). Scale matters:
Temporal scale aggregation network for precise action
localization in untrimmed videos. In 2020 ICME.
Gu, A. and Dao, T. (2023). Mamba: Linear-time se-
quence modeling with selective state spaces. arXiv
2312.00752.
Gu, A., Goel, K., and R
´
e, C. (2022). Efficiently modeling
long sequences with structured state spaces. In The In-
ternational Conference on Learning Representations.
Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra,
A., and R
´
e, C. (2021). Combining recurrent, convo-
lutional, and continuous-time models with linear state
space layers. NIPS’21, 34:572–585.
Hatamizadeh, A. and Kautz, J. (2024). Mambavision: A
hybrid mamba-transformer vision backbone. arXiv
2407.08083.
Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C.
(2015). Activitynet: A large-scale video benchmark
for human activity understanding. In 2015 CVPR.
Idrees, H., Zamir, A. R., Jiang, Y.-G., Gorban, A., Laptev,
I., Sukthankar, R., and Shah, M. (2017). The THU-
MOS challenge on action recognition for videos “in
the wild”. Comput. Vis. Image Underst., 155:1–23.
Kalman, R. E. (1960). A New Approach to Linear Filtering
and Prediction Problems. J. Basic Eng., 82(1):35–45.
Kang, T.-K., Lee, G.-H., Jin, K.-M., and Lee, S.-W. (2023).
Action-aware masking network with group-based at-
tention for temporal action localization. In 2023
WACV, pages 6047–6056.
Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,
Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,
Natsev, P., Suleyman, M., and Zisserman, A. (2017).
The kinetics human action video dataset.
Kingma, D. P. and Ba, J. (2017). Adam: A method for
stochastic optimization. In 3rd ICLR Proceedings.
Lecun, Y., Boser, B., Denker, J., Henderson, D., Howard,
R., Hubbard, W., and Jackel, L. (1990). Handwritten
Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study
161