Transformer or Mamba for Temporal Action Localization? Insights from

a Comprehensive Experimental Comparison Study

Zejian Zhang

1 a

, Cristina Palmero

2 b

and Sergio Escalera

1 c

Universitat de Barcelona and Computer Vision Center, Barcelona, Spain

Department of Engineering, King’s College London, London, U.K.

Keywords:

Temporal Action Localization (TAL), Temporal Action Detection (TAD), Transformer, Self-Attention, State

Spate Models (SSMs), Mamba, Multi-Scale.

Abstract:

Deep learning models need to encode both local and global temporal dependencies for accurate temporal

action localization (TAL). Recent approaches have relied on Transformer blocks, which has a quadratic com-

plexity. By contrast, Mamba blocks have been adapted for TAL due to their comparable performance and

lower complexity. However, various factors can inﬂuence the choice between these models, and a thorough

analysis of them can provide valuable insights into the selection process. In this work, we analyze the Trans-

former block, Mamba block, and their combinations as temporal feature encoders for TAL, measuring their

overall performance, efﬁciency, and sensitivity across different contexts. Our analysis suggests that Mamba

blocks should be preferred due to their performance and efﬁciency. Hybrid encoders can serve as an alternative

choice when sufﬁcient computational resources are available.

1 INTRODUCTION

Temporal action localization (TAL) is a challenging

yet critical task for video analysis, owing to its wide

range of real-world applications such as video surveil-

lance, sports analytics, and human activity under-

standing (Elharrouss et al., 2021) (Ghosh et al., 2023)

(Saleem et al., 2023). TAL involves identifying pre-

cise start and end timestamps of actions and assigning

their action labels in untrimmed videos. Due to the

complexity of video contents, actions in videos of-

ten exhibit ambiguity in boundaries and various dura-

tions, making it difﬁcult for traditional convolutional

neural network (CNN)-based models (Gong et al.,

2020) (Zhu et al., 2021) to learn necessary temporal

dependencies to output precise action boundaries.

Recent approaches (Zhang et al., 2022) (Shao

et al., 2023) (Zhang et al., 2024) in TAL have adapted

Transformers (Vaswani et al., 2017) to encode video

frames into a multi-scale representation (see Fig.

1A), leveraging the advantages of both self-attention

(Vaswani et al., 2017) and multi-resolution action de-

tection. However, the self-attention operation has a

https://orcid.org/0000-0001-9810-3726

https://orcid.org/0000-0002-6085-6527

https://orcid.org/0000-0003-0617-8873

Feature Extraction

Feature Embedding

Transformer Encoder

Mamba Block

Hybrid Architecture

Action

Labels

Action

Locations

Temporal Feature Encoder

(A) (B)

Figure 1: Illustration of the multi-scale TAL architecture.

(A) The model involves extracting clip features from video

frames, applying a feature embedding for dimensional re-

duction, and encoding the input into an L-level multi-scale

representation by temporal feature encoders, followed by

an action localization head. (B) The three types of temporal

feature encoder we compare and study in this work.

quadratic complexity, representing one of the biggest

hurdles for processing long sequences. Researchers

have been exploring new possibilities to break this

quadratic computational cost (Gu et al., 2021).

Structured State Space Models (SSMs) (Gu et al.,

2021) (Gu et al., 2022), particularly the Mamba block

(Gu and Dao, 2023), have been proposed and have

demonstrated promising performance in various long

sequence modeling tasks (Gu and Dao, 2023). Com-

pared to the state-of-the-art Transformer-based meth-

150

Zhang, Z., Palmero, C. and Escalera, S.

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study.

DOI: 10.5220/0013173000003912

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 20th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2025) - Volume 2: VISAPP, pages

150-162

ISBN: 978-989-758-728-3; ISSN: 2184-4321

ods, Mamba has fewer parameters and scales nearly

linearly with sequence length, making it effective

for processing untrimmed video sequences for TAL.

Early adaptations in this direction include Action-

Mamba (Chen et al., 2024) and CausalTAD (Liu et al.,

2024b). Despite these initial attempts to explore

Mamba’s potential for TAL, there is a lack of com-

parison among these encoders, especially when com-

bining them (the Transformer and Mamba block) into

a single encoder used to learn temporal dependencies.

Such a comparison could be useful for providing in-

sights into further explorations of their capacities in

TAL and other video understanding tasks.

In this work, we aim to conduct a comprehen-

sive experimental comparison of the well-established

Transformer, the Mamba block, and their integration

into a hybrid architecture. We focus on their perfor-

mance, complexity, and efﬁciency in handling video

sequences for the TAL task. Speciﬁcally, we analyze

four groups of models equipped with different tempo-

ral feature encoders, resulting in a total of 12 distinct

models. Our contributions are as follows:

1. We present a comprehensive experimental com-

parison of the Transformer and Mamba block-based

models for TAL. These models are categorized into:

Transformer-based models utilizing either global or

sliding window attention; pure Mamba block-based

models, including the original Mamba block, bidirec-

tional Mamba (ViM) (Zhu et al., 2024), and decom-

posed bidirectionally Mamba (DBM) (Chen et al.,

2024) blocks; a sequential architecture that integrates

the Transformer block with Mamba blocks; and a par-

allel hybrid model that utilizes the Transformer and

Mamba blocks concurrently

2. We adapt the MambaFormer block (Park et al.,

2024) into a multi-scale architecture, marking, to the

best of our knowledge, its ﬁrst application for TAL

tasks. We extend this conﬁguration to include six dif-

ferent temporal feature encoders.

3. We conduct experiments on three commonly

used TAL datasets, providing insights for selecting

different temporal feature encoders based on datasets

and requirements. Our evaluation uses various met-

rics, including the standard performance measures,

efﬁciency, and sensitivity. Additionally, we analyze

the scaling capability of the original Mamba block.

Our results suggest that, in most cases, pure Mamba

block-based models should be preferred due to their

superior performance and lower complexity. Hybrid

counterparts may serve as an alternative choice for

scenarios with higher performance demands.

2 RELATED WORK

Temporal Action Localization. Research in TAL

can be divided into two categories: two-stage and

one-stage methods. Two-stage methods (Zhu et al.,

2021) (Sridhar et al., 2021) (Zhao et al., 2021) (Chen

et al., 2022) involve generating coarse proposals fol-

lowed by reﬁning these proposals to produce the ﬁnal

outputs. By contrast, one-stage methods (Lin et al.,

2021) (Liu et al., 2022) (Zhang et al., 2022) (Kang

et al., 2023) (Zhao et al., 2023) (Shi et al., 2023)

tackle both sub-tasks simultaneously within a uniﬁed

framework. An effective TAL model relies on en-

coding relevant long and short temporal dependen-

cies in the input data. Various deep learning methods

have been explored for this purpose, including CNN

(Lecun et al., 1990), graph neural networks (GNN)

(Scarselli et al., 2009), and attention-based (Vaswani

et al., 2017) methods. Recent advancements (Zhang

et al., 2022) (Shao et al., 2023) (Zhang et al., 2024)

(Chen et al., 2024) (Liu et al., 2024b) have partic-

ularly focused on one-stage methods, which adapt

temporal encoders such as the Transformer block

(Vaswani et al., 2017) and Mamba block (Gu and

Dao, 2023) to transform input videos into a multi-

scale representation, where the temporal resolution

is downsampled at higher levels, thereby enabling

the localization of actions across different resolutions.

These methods have demonstrated superior perfor-

mance compared to previous CNN- or GNN-based

approaches thanks to their multi-scale design and the

ability to process long sequence data. In this work, we

focus on analyzing the well-established Transformer

block, the emerging Mamba block, and the combi-

nation of both into a hybrid architecture (Park et al.,

2024) (Liu et al., 2024b), exploring their performance

and computational efﬁciency as temporal feature en-

coders in the multi-scale architecture for TAL task on

different benchmarks.

State Space Models. Inspired by the traditional

SSMs in control theory (Kalman, 1960), structured

SSMs (Gu et al., 2021) (Gu et al., 2022) (Fu et al.,

2023) were proposed for modeling long-range se-

quences such as in natural language processing, over-

coming the performance issues in CNNs (Lecun et al.,

1990) and recurrent neural networks (RNNs) (Bisong

and Bisong, 2019), as well as the quadratic scaling

issue in the Transformer (Vaswani et al., 2017) en-

coders. Mamba, in particular, combines an SSM layer

(Fu et al., 2023) with gated MLP (multi-layer per-

ceptron) (Liu et al., 2024a), achieving on par per-

formance in sequence data modeling, such as lan-

guage and audio (Gu and Dao, 2023), while demon-

strating linear scaling with sequence length. Mamba

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study

151

has been explored for visual applications, such as Vi-

sion Mamba (Zhu et al., 2024) for images and Video

Mamba Suite (Chen et al., 2024) for video under-

standing. We study and analyze the original Mamba

block (Gu and Dao, 2023) and its variants, namely,

ViM (Zhu et al., 2024) and DBM (Chen et al., 2024),

as the sole encoder in the multi-scale architecture for

TAL due to their strong capability to process long se-

quence data.

Hybrid Transformer and Mamba. While

Mamba has shown promising performance in certain

tasks, it has been found that a pure Mamba-equipped

encoder does not perform as well as Transformer-

based counterparts, e.g., in in-context learning tasks

that require non-standard retrieval capabilities (Park

et al., 2024). Various studies have explored inte-

grating the Transformer block with Mamba blocks

into a hybrid architecture to capitalize on their com-

plementary strengths for capturing long-range depen-

dencies. For instance, MambaVision (Hatamizadeh

and Kautz, 2024) augments the Mamba block with

Transformer layers to enhance global context recov-

ery. MambaFormer (Park et al., 2024) combines

Mamba with self-attention blocks. Instead of a se-

quential combination of Transformer and Mamba

blocks, CausalTAD (Liu et al., 2024b) introduces a

dual architecture for TAL, where the input is pro-

cessed simultaneously by a Transformer block and a

Mamba block. Despite the existence of other possi-

ble combinations, this study concentrates on two ar-

chitectures: MambaFormer (a sequential model) and

CausalTAD (a parallel model). Our goals thus are: 1)

adapt the MambaFormer architecture for TAL tasks,

and 2) evaluate their performance as temporal feature

encoders for TAL applications.

3 METHODOLOGY

3.1 TAL Problem Deﬁnition

Given an input video, it is represented by a set of

pre-extracted features X = {x

, x

,..., x

}, where t ∈

{1, 2,..., T} are the number of frames and T varies

across videos. The goal is to predict all the action

instances Y based on X, where Y = {y

, y

,..., y

}

are the ground truth action instances. Each ac-

tion instance is deﬁned as y

= (s

, e

, c

), where i ∈

{1, 2,..., N} are the number of action instances, s

∈ [1,T] are the start and end timestamps (s

< e

and c

∈ {1, 2,..., C} is the corresponding action la-

bel.

3.2 Model Overview

We employ a multi-scale architecture, as illustrated

in Fig. 1, to construct the model used for evaluation,

due to its straightforward design and superior perfor-

mance demonstrated in recent one-stage TAL meth-

ods (Zhang et al., 2022) (Shi et al., 2023) (Zhang

et al., 2024). The model comprises three parts: fea-

ture extraction and embedding, multi-scale temporal

feature encoding, and action localization. The in-

put videos are processed by pre-trained CNN mod-

els, such as I3D (Carreira and Zisserman, 2017) on

Kinetics (Kay et al., 2017), to extract clip features.

The extracted features are then embedded into a la-

tent space that matches the dimensional requirements

of the temporal feature encoder. Action localization

is performed by a two-branch head, one for classify-

ing action labels and the other for regressing action

boundaries. The head is shared across all levels of the

multi-scale feature representation. Since our primary

objective is to analyze the performance of different

temporal feature encoders, we ﬁx the architecture de-

sign of the feature embedding and localization head

while varying the temporal feature encoders.

In the following section, we present the temporal

feature encoders utilized for encoding the input video

into a multi-scale representation.

3.3 Temporal Feature Encoders

The evaluated temporal feature encoders are catego-

rized into three types: pure Transformer blocks, pure

Mamba blocks, and hybrid architectures that combine

self-attention layers with Mamba blocks.

3.3.1 Preliminaries

In this section, we provide a brief review of the self-

attention mechanism and the SSMs, which serve as

the core operation for the Transformer block and

Mamba block, respectively.

Self-Attention. The self-attention operation takes

an arbitrary length of token embedding (X) as input

and aggregates relevant information by measuring the

similarity among tokens. The attention score is com-

puted by Attn(Q, K,V) = Softmax



⊤

√



V, where

Q, K, and V are the queries, keys and values projected

from the input embedding. To encode richer and more

robust temporal relationships, the multi-head atten-

tion is commonly used, with each attention head fo-

cusing on different aspects simultaneously. Subse-

quently, the outputs from these heads are concate-

nated to form the ﬁnal attention scores. In order to

mitigate the quadratic complexity, sliding window at-

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

152

Layer Norm

Multi-Head

Attention

Layer Norm

MLP

Linear Linear

Conv1D

SSM

SiLU

Linear

SiLU

Linear Linear

Forward

Conv1D

Forward

SSM

Linear

SiLU

Backward

Conv1D

Backward

SSM

Linear Linear

Forward

Conv1D

Forward

SSM

Linear

SiLU

Backward

Conv1D

Backward

SSM

Linear

SiLU

Linear

Shared Weights

Transformer Encoder Mamba Block - Original

concat

Mamba Block - ViM Mamba Block - DBM

A B

Figure 2: Illustration of temporal feature encoders. (A) The Transformer (Vaswani et al., 2017) block, which utilizes (sliding

window) multi-head self-attention. (B) The original Mamba block (Gu and Dao, 2023), which utilizes SSMs. (C) The ViM

block (Zhu et al., 2024), which adds a backward scanning to the original Mamba block. (D) The DBM block (Chen et al.,

2024), which separates the input projections and adds a backward scanning with shared weights.

tention is often used instead, restricting the attention

calculation to a local window of ﬁxed size.

State Space Models. State space models were in-

troduced in control theory for modeling continuous

systems using state variables (Kalman, 1960). These

models map an input to an output through a hidden

state. Inspired by this approach, the structured SSMs,

such as S4 (Gu et al., 2022) and Mamba (Gu and

Dao, 2023), have been proposed for processing lan-

guage and vision data. The continuous parameters

are discretized to

A and

B through a transformation

method, e.g., zero-order hold (ZOH). Subsequently,

the SSMs for processing discrete input at time t is

deﬁned as: h

t−1

, y

= Ch

. Finally, for

an input X, the computation of the output Y involves

a global convolution, denoted as Y = X ∗

K, where

K = (C

B, C

B, ..., C

M−1

B) is a structured convo-

lutional kernel, and M is the length of X.

3.3.2 Transformer Block

A Transformer block (see Fig. 2A) consists of a (slid-

ing window) multi-head self-attention (MHA) layer

followed by an MLP block. To facilitate stable and ef-

ﬁcient training, Layer Normalization (LN) is applied

before the MHA and MLP layers, and residual con-

nections are added after each. Given an input X

, the

output of the Transformer block X

is obtained by:

′

= X

+MHA(LN(X

)), X

= X

′

+MLP(LN(X

′

)).

3.3.3 Mamba Blocks

The Mamba blocks used for temporal feature encod-

ing are the original Mamba block (Gu and Dao, 2023),

incorporating a forward SSM, and two variants (i.e.,

ViM (Zhu et al., 2024) and DBM (Chen et al., 2024))

which, in addition to the forward SSM, include a

backward SSM.

Mamba Block. A Mamba block (see Fig. 2B)

combines the SSM layer proposed in (Fu et al., 2023)

with gated MLP (Liu et al., 2024a). For an input X

it is mapped to x and z with linear projections. x un-

dergoes transformation via a Conv1D layer followed

by an SSM layer. The transformed x is then gated by

z, after which a projection layer is applied to produce

the ﬁnal output X

. The process is deﬁned as follows:

x = Proj

), z = Proj

′

= SSM(SiLU(Conv1D(x))),

= Proj

out

(SiLU(z) ∗x

′

(1)

ViM Block. In comparison to the Mamba block, the

ViM block (Zhu et al., 2024) (as shown in Fig. 2C)

features a forward branch and a backward branch that

process the projected input simultaneously in both di-

rections. The ViM block transforms an input X

into

an output X

through the following process:

x = Proj

), z = Proj

′

= SSM

(SiLU(Conv1D

(x))),

′

= SSM

(SiLU(Conv1D

(x))),

= Proj

out

(SiLU(z) ∗(x

′

+ x

′

)).

(2)

DBM Block. A DBM block (see Fig. 2D) shares

a similar idea with the ViM block, i.e., introducing a

backward branch. Differently, DBM blocks separate

the projection processes to create distinct inputs for

the Conv1D and SSM transformations. The parame-

ters of the Conv1D and SSM operations are shared.

The outputs from both branches are separately gated,

which are then concatenated and projected to gener-

ate the ﬁnal output. The compressed features, X

, for

an input, X

, are deﬁned as follows:

= Proj

x f

), z

= Proj

), z

= Proj

′

= SSM(SiLU(Conv1D(x

))),

′

= SSM(SiLU(Conv1D(x

))),

= Proj

out

(SiLU(z

) ∗(x

′

) ∥SiLU(z

) ∗x

′

(3)

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study

153

3.3.4 Hybrid Architecture

A hybrid architecture combines the Transformer

block (or self-attention layer) with the Mamba block,

enabling the model to encode diverse dynamics. We

focus on the MambaFormer architecture (Park et al.,

2024) and the concurrent application of self-attention

and the Mamba block, e.g., CausalTAD (Liu et al.,

2024b), as described below.

MambaFormer. The MambaFormer (see Fig.

3A) combines Mamba with attention blocks. It re-

places the positional encoding and MLP block in the

standard Transformer block with two Mamba blocks.

The skip connections after each block, as used in the

Transformer block, are retained.

CausalTAD. CausalTAD (see Fig. 3B) extends

the dual design in the DBM block (Chen et al., 2024)

to the multi-head self-attention operation, resulting in

a causal attention block (CAB). Speciﬁcally, CAB re-

stricts the context of attention to only past or future

tokens. A CAB and a DBM-like block are utilized

to process the input simultaneously, aiming to capture

both long-range temporal relationships and causality

information.

Multi-Head

Attention

Mamba

Attention

Branch

Mamba

Branch

Linear

concat

Linear

SiLU

Shared Weights

Forward

Operation

Backward

Operation

Linear Linear

SiLU

Linear

(A) (B)

Figure 3: Illustration of the hybrid architectures studied.

(A) The MambaFormer. (B) The concurrent application of

self-attention and Mamba block (top). The Attention and

Mamba Branches share a similar architecture (bottom), ex-

cept that in forward and backward operations, the Atten-

tion Branch uses multi-head self-attention while the Mamba

Branch uses Conv1D followed by an SSM layer. The pa-

rameters of the both operations are shared inside the Atten-

tion and Mamba Branches, respectively.

3.4 Training and Inference

For training, we employ a focal loss (Tian et al.,

2019) and distance Intersection over Union (IoU) loss

(Zheng et al., 2020) to optimize the action classiﬁca-

tion and boundary distance regression, respectively.

For inference, we select a subset of the predictions

whose classiﬁcation score is higher than a pre-deﬁned

threshold. Finally, we use Soft-NMS (Bodla et al.,

2017) to remove duplicates.

4 EXPERIMENTS

In this section, we conduct experiments on different

TAL datasets to evaluate the performance of the stud-

ied temporal feature encoders, as discussed in Section

3.3. Given that the architecture of the models is ﬁxed,

the only aspect that varies among them is the applied

temporal feature encoder, which inﬂuences their per-

formance. As a result, the performance of the mod-

els reﬂects the capabilities of the temporal feature en-

coders. Therefore, we assess the performance of the

trained models.

4.1 Datasets

We choose THUMOS14 (Idrees et al., 2017),

ActivityNet-1.3 (Heilbron et al., 2015), and EPIC-

KITCHENS-100 (Damen et al., 2018) for our exper-

iments. THUMOS14 and ActivityNet-1.3 are third-

person view human activity datasets, differing in size

and action durations. EPIC-Kitchens-100 is a ﬁrst-

person view dataset with ﬁne-grained in-door actions.

Such properties can assess TAL models from differ-

ent aspects. Speciﬁcally, THUMOS14 consists of 413

videos with 20 action classes, which we divided into a

training set of 213 videos and a test set of 200 videos,

following previous work (Zhang et al., 2022). The

models were trained on the training set and evaluated

on the test set. ActivityNet-1.3 (Heilbron et al., 2015),

with nearly 20,000 videos across 200 action classes,

was split into a training set of 10,024 videos, a valida-

tion set of 4,926 videos, and a test set of 5,044 videos.

Following previous practice (Zhang et al., 2022), we

trained the models on the training set and evaluated

them on the validation set. EPIC-KITCHENS-100,

which contains 100 hours of egocentric view activ-

ities, was split into a training and a validation set.

Similar to previous approach (Zhang et al., 2022), we

trained separate models for Verb and Noun actions on

the training set and assessed them on the validation

set.

4.2 Performance Metrics

The mean average precision (mAP) was used as a

metric to evaluate the overall performance of the

trained models. Following previous practice (Zhang

et al., 2022) (Zhang et al., 2024), the mAP calculated

at different temporal IoU (tIoU) thresholds and the av-

eraged mAP across all these thresholds are reported.

Speciﬁcally, the thresholds were set to [0.3:0.1:0.7]

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

154

for THUMOS14, [0.5:0.05:0.95] for ActivityNet-1.3,

and [0.1:0.1:0.5] for EPIC-KITCHENS-100. Besides,

the number of trainable parameters and FLOPs (ﬂoat-

ing point operations) are also reported to evaluate the

model’s efﬁciency.

4.3 Implementations

Feature Extraction. Following previous work

(Zhang et al., 2022) (Zhang et al., 2024), we ex-

tract I3D (Carreira and Zisserman, 2017) features

for THUMOS14 and ActivityNet-1.3, and Slow-

Fast (Feichtenhofer et al., 2019) features for EPIC-

KITCHENS-100. Note that while TSP features (Al-

wassel et al., 2021) have demonstrated superior per-

formance on ActivityNet-1.3 (Zhang et al., 2022) (Shi

et al., 2023) compared to other feature extractors such

as I3D, we prioritize investigating temporal feature

encoders over using different features to enhance the

model’s performance. Using a consistent feature ex-

tractor across datasets enables a more insightful com-

parison of the generalizability of these encoders.

Experimental Setup. We re-implemented the

temporal feature encoders using PyTorch 2.1.2 with

CUDA version 12.1. The conﬁguration of the feature

embedding layers and the action localization head

remained consistent with the methods described in

(Zhang et al., 2022) and (Chen et al., 2024). The

preprocessing step for handling variable-length inputs

(ﬁxing the maximum input sequence length through

either padding or cropping) followed a similar ap-

proach to that in (Zhang et al., 2022). Each model was

trained from scratch on all datasets. The Adam opti-

mizer (Kingma and Ba, 2017), along with a warm-up

strategy (Liu et al., 2020), was employed for training.

During training, we optimized the number of epochs,

learning rate, batch size, window size (when using

sliding window attention), and the balance coefﬁcient

for the classiﬁcation and distance regression losses.

The training and testing processes were conducted on

a single NVIDIA GeForce RTX 3090 GPU.

4.4 Overall Performance

The overall performance, measured by mAP, is pre-

sented in Table 1 for THUMOS14 and ActivityNet-

1.3, and in Table 2 for EPIC-Kitchens-100.

We selected the Transformer block with a sliding

window attention-based model, i.e., ActionFormer

(Zhang et al., 2022), as the comparison point for ana-

lyzing the performance of the trained models, speciﬁ-

cally the pure Mamba block-based methods and the

hybrid architectures that combine the Transformer

block and Mamba block.

It is worth noting that the retrained ActionFormer

models achieve different overall performance levels

compared to the reported results in (Zhang et al.,

2022). Similar differences are also observed for the

CausalTAD (Liu et al., 2024b) architecture. These

discrepancies may be attributed to variations in hard-

ware and software setups. Additionally, we used the

implementation from (Zhang et al., 2022) to handle

the training and testing data, which differs from the

implementation in CausalTAD (Liu et al., 2024b).

This may be another factor that has led to the observed

differences in performance for CausalTAD.

Performance on THUMOS14. On THUMOS14,

the original Mamba block-based model surpasses the

baseline at all tIoU thresholds, achieving an average

mAP of 68.5%, which is slightly higher than the base-

line of 67.9%. Furthermore, as can be observed, this

method achieves the best average mAP and outper-

forms all other studied methods at every tIoU thresh-

old, except at the threshold of 0.6, where its perfor-

mance is slightly lower than that of the CausalTAD

method (60.8% vs. 61.1%).

However, the other temporal feature encoders do

not demonstrate similar superior performance, even

for the ViM and DBM blocks, which share the most

similar architecture. This suggests that the backward

scanning implemented in these models may not en-

hance the representativeness of the learned features.

Additionally, the backward design may also inﬂu-

ence the hybrid models, as evidenced by the Mam-

baFormer architectures, where the combination of at-

tention blocks—whether global attention or sliding

window attention—outperforms other conﬁgurations,

such as ViM or DBM in combination with attention

blocks. By contrast, the CausalTAD method, which

achieves the same average mAP as the baseline and

outperforms other hybrid variants, indicates that the

concurrent application of the Mamba block with at-

tention layers, along with the causal information en-

coded by the causal attention, may enhance the dis-

criminative power of the learned representations.

Despite these differences, we observed that the

gaps in average mAP among these models are not sig-

niﬁcant. Additionally, we noticed that the original

Mamba block or DBM block-based methods gener-

ally perform better than others at higher tIoU thresh-

olds, such as at 0.6 and 0.7.

Performance on ActivityNet-1.3. As shown in

Table 1, both the pure Mamba block-based methods

and hybrid approaches outperform the pure Trans-

former block-based methods, with the CausalTAD

method achieving the best performance, surpassing

the baseline by an average margin of 0.9%. Addi-

tionally, the CausalTAD method demonstrates supe-

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study

155

Table 1: Overall performance measured by mAP on THUMOS14 and ActivityNet-1.3. The mAP values calculated are those

at different tIoU thresholds and the averaged ones, i.e., [0.3:0.1:0.7] for THUMOS14 and [0.5:0.05:0.95] for ActivityNet-1.3.

TFE: Temporal feature encoder, Attn: attention, Win: Window.

Method TFE Block

THUMOS14 ActivityNet-1.3

tIoU (%) tIoU (%)

0.3 0.4 0.5 0.6 0.7 Avg. 0.5 0.75 0.95 Avg.

ActionFormer Transformer

Global Attn 83.3 78.8 71.7 59.8 44.5 67.6 54.4 36.4 7.3 35.7

Win Attn

83.3 79.5 71.9 60.2 45.0 67.9 54.4 36.8 8.4 36.0

ActionMamba Mamba

Mamba 83.3 79.7 72.3 60.8 46.1 68.5 54.5 37.5 7.4 36.3

ViM 82.6 79.1 71.5 59.6 44.6 67.5 55.4 37.8 8.0 36.8

DBM 82.9 78.8 71.3 60.2 45.6 67.8 55.2 37.5 7.8 36.7

MambaFormer Mamba + Attn

Mamba + Global Attn 82.3 79.1 70.9 60.3 45.7 67.7 55.0 37.3 7.4 36.5

Mamba + Win Attn 82.4 78.8 70.8 60.2 44.4 67.3 54.7 37.1 7.8 36.3

ViM + Global Attn 81.5 78.4 70.3 59.2 45.5 67.0 55.1 37.4 8.1 36.6

Vim + Win Attn 82.0 78.1 70.6 59.4 44.8 67.0 55.0 37.3 7.6 36.6

DBM + Global Attn 82.6 78.2 71.0 59.6 44.6 67.2 54.6 37.6 8.3 36.6

DBM + Win Attn 81.4 77.2 70.2 58.7 44.1 66.3 54.9 37.4 7.6 36.5

CausalTAD Mamba + Attn DBM + Global Causal Attn

82.4 79.0 71.9 61.1 44.9 67.9 55.3 37.8 8.9 36.9

Retrained models. Bold and underlined numbers indicate the best and second-best performances in each column, respectively.

rior performance at higher tIoU thresholds, particu-

larly at 0.95. Similar to THUMOS14, while these

methods exhibit differences in performance, they are

not signiﬁcant.

Another interesting observation is that causal en-

coding appears to be an important factor for boosting

the model’s performance, as can be see from the mod-

els that integrate the ViM or DBM blocks.

Performance on EPIC-Kitchens-100. As shown

in Table 2, The best-performing methods on EPIC-

Kitchens-100 are the pure DBM block-based model

for Verb actions and the sequential integration of the

original Mamba block with sliding window atten-

tion model for Noun actions. The highest average

mAP surpasses the baseline by 1.1% for both Verb

and Noun actions. Additionally, pure Mamba block-

based models outperform others for Verb actions, par-

ticularly the ViM and DBM blocks, likely due to

their backward scanning design. For Noun actions,

the MambaFormer architecture appears to be the pre-

ferred choice, as the models in this category achieve

both the best and second-best overall performances.

Discussion. Based on the performance results,

it is clear that no single method consistently out-

performs others across all the datasets. This sug-

gests that a customized dataset may require a tailored

model. For instance, the pure original Mamba block-

based method may be suitable for datasets like THU-

MOS14, where action instances with short durations

account for the majority (Wu et al., 2021). A more

complex model, such as CausalTAD, should be cho-

sen for ActivityNet-1.3. For EPIC-Kitchens-100 Verb

actions, the pure DBM-based model is recommended,

while for Noun actions, the MambaFormer model—

speciﬁcally, the combination of the original Mamba

block with sliding window attention or the ViM block

with global attention—should be utilized. Alterna-

tively, the pure Mamba block-based models or their

hybrid counterparts could be selected for better gen-

eralization across all the datasets.

4.5 Efﬁciency Analysis

Model efﬁciency is a crucial factor to consider when

selecting a deep learning model for processing real-

world, large-scale data, especially when hardware re-

sources are limited. We present the number of train-

able parameters and FLOPs as efﬁciency metrics to

measure the model’s resource consumption. Speciﬁ-

cally, a higher number of parameters indicates a larger

model that requires more memory, while FLOPs re-

ﬂect the computational complexity.

We use Python library fvcore to count the num-

ber of parameters and FLOPs. For calculating FLOPs,

we follow the method described in (Zhang et al.,

2022), using a ﬁxed-size tensor as input to the model.

Speciﬁcally, we use a tensor size of [1, 2408, 2304],

i.e., an input of 2304 time steps, for THUMOS14

models, [1, 2304, 160] for ActivityNet-1.3 models,

and [1, 2304, 2304] for EPIC-Kitchens-100 models.

As shown in Table 3, pure Mamba block-based

models exhibit the fewest parameters and consume

the least FLOPs. In contrast, CausalTAD has the

highest number of parameters and consumes the most

FLOPs. Compared to the Transformer models, Mam-

baFormer models generally consume a similar num-

ber of FLOPs while having more parameters due to

the integration of the Mamba blocks.

Discussion. Based on the efﬁciency performance,

pure Mamba block-based models are preferable when

computational resources are limited. Hybrid models

can serve as an alternative choice for improved per-

formance across different datasets when resources are

more readily available.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

156

Table 2: Overall performance on EPIC-Kitchens-100. The presented results include mAP values at different tIoU thresholds,

ranging from [0.1:0.1:0.5], along with the averaged ones for both Verb and Noun actions. TFE: Temporal feature encoder,

Attn: attention, Win: Window.

Method TFE Block

Verb - tIoU (%) Noun - tIoU (%)

0.1 0.2 0.3 0.4 0.5 Avg. 0.1 0.2 0.3 0.4 0.5 Avg.

ActionFormer Transformer

Global Attn 26.9 25.9 24.5 22.5 18.9 23.7 25.2 24.1 22.4 20.0 16.2 21.6

Win Attn

26.5 25.4 24.3 22.4 19.3 23.6 24.7 23.6 22.0 19.6 16.0 21.2

ActionMamba Mamba

Mamba 26.6 25.6 23.9 22.2 19.6 23.6 25.5 24.2 22.6 20.4 16.8 21.9

ViM 27.5 26.3 25.1 22.6 19.0 24.1 25.2 24.2 22.4 20.0 16.8 21.7

DBM 28.0 27.2 25.7 23.2 19.4 24.7 24.8 23.7 21.9 19.8 15.3 21.1

MambaFormer Mamba + Attn

Mamba + Global Attn 27.0 25.9 24.5 22.8 19.0 23.9 25.5 24.2 22.6 20.4 16.8 21.9

Mamba + Win Attn 26.8 25.8 23.9 22.2 19.3 23.6 26.0 24.8 23.2 20.8 17.0 22.3

ViM + Global Attn 26.3 25.6 23.9 21.9 18.7 23.3 25.6 24.4 22.7 20.6 17.5 22.2

Vim + Win Attn 26.2 25.6 24.1 21.6 17.6 23.0 25.0 23.9 22.2 19.9 16.6 21.5

DBM + Global Attn 27.1 26.4 24.3 22.3 19.2 23.8 25.4 24.5 22.8 20.3 16.3 21.8

DBM + Win Attn 26.8 25.9 24.4 22.3 19.3 23.7 25.1 24.1 22.3 20.0 16.6 21.6

CausalTAD Mamba + Attn DBM + Global Causal Attn

26.9 26.2 24.6 22.4 18.8 23.8 25.6 24.5 22.7 20.2 16.3 21.9

Retrained models. Bold and underlined numbers indicate the best and second-best performances in each column, respectively.

Table 3: Efﬁciency measurements include the number of trainable parameters and FLOPs consumed by different models. The

number of parameters was calculated separately for the THUMOS14, ActivityNet-1.3, and EPIC-Kitchens-100 models using

the Python library fvcore. FLOPs were calculated for processing tensors of sizes [1, 2408, 2304] for THUMOS14 models,

[1, 2304, 160] for ActivityNet-1.3 models, and [1, 2304, 2304] for EPIC-Kitchens-100 models.

Method Block

THUMOS14 ActivityNet-1.3 EPIC - Verb EPIC - Noun

# Params FLOPs # Params FLOPs # Params FLOPs # Params FLOPs

ActionFormer

Global Attn 29.3M 115.4 8.1M 18.1 29.8M 118.4 30.1M 121.5

Win Attn 29.3M 45.3 8.1M 17.4 29.8M 93.2 30.1M 96.4

ActionMamba

Mamba 19.0M 33.0 5.6M 13.4 19.5M 69.0 19.8M 72.4

ViM 19.9M 37.8 5.9M 15.0 20.4M 78.5 20.7M 81.6

DBM 18.6M 37.8 5.5M 15.0 19.1M 78.7 19.4M 82.6

MambaFormer

Mamba + Global Attn 42.8M 106.0 11.6M 16.5 43.3M 109.0 43.6M 111.9

Mamba + Win Attn 42.8M 40.5 11.6M 15.9 43.3M 83.8 43.6M 86.8

ViM + Global Attn 43.8M 115.5 12.0M 18.1 44.3M 118.5 44.6M 121.4

ViM + Win Attn 43.8M 45.3 12.0M 17.5 44.3M 93.3 44.6M 96.3

DBM + Global Attn 42.3M 115.5 11.5M 18.1 42.8M 118.5 43.2M 121.8

DBM + Win Attn 42.3M 45.3 11.5M 17.5 42.8M 93.4 43.2M 96.7

CausalTAD DBM + Global Causal Attn 52.1M 161.5 13.9M 29.4 52.6M 164.6 52.9M 167.6

Bold and underline numbers indicate highest and lowest number of parameters and FLOPs in each column.

4.6 Sensitivity Analysis

Actions in videos often vary in length, ranging from

seconds to minutes. In addition, the number of ac-

tion instances is not uniformly distributed and the

type of actions may also vary a lot. These charac-

teristics make TAL a challenging task for deep learn-

ing models, which should effectively identify all ac-

tions. In this section, we present a sensitivity analy-

sis of the models evaluated on the THUMOS14 and

ActivityNet-1.3 datasets, offering an additional met-

ric for understanding the temporal feature encoders.

We use the tool presented in (Alwassel et al.,

2018) for analysis. To perform the analysis, we de-

ﬁne coverage, length, and the number of instances.

Coverage refers to the relative length of the actions

compared to the entire video, while length denotes

the absolute duration of actions in seconds. Both

coverage and length are categorized into Extra Small

(XS), Small (S), Medium (M), Large (L), and Ex-

tra Large (XL). The number of instances refers to

the total count of instances from the same class in a

video, categorized into Extra Small (XS), Small (S),

Medium (M), and Large (L). The deﬁned thresholds

for Coverage on both datasets are XS: (0,0.02], S:

(0.02,0.04], M: (0.04,0.06], L: (0.06,0.08], and XL:

(0.08,1.0]. For Length, the thresholds are XS: (0,3],

S: (3,6], M: (6,12], L: (12,18], and XL: >18 for THU-

MOS14, and XS: (0,30], S: (30,60], M: (60,120], L:

(120,180], and XL: >180 for ActivityNet-1.3. For the

number of instances, XS: 1, S: [2,40], M: [40,80], L:

>80 and XS: 1, S: [2,4], M: [4,8], L: >8 are used for

THUMOS14 and ActivityNet-1.3, respectively.

For each dataset, we selected the models with the

highest overall performance (see Table 1) from each

category to generate predictions for analysis.

On THUMOS14, the models perform well at pre-

dicting actions with short to large durations (see Fig.

4), particularly for small, medium, and large-duration

actions. However, performance drops signiﬁcantly for

extra-large duration actions. The models also become

more sensitive to videos with a large number of action

instances. This behavior can be attributed to the char-

acteristics of the dataset: 1) action durations in THU-

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study

157

MOS14 are relatively short, and 2) each video typ-

ically contains only one action class. Regarding 1),

the lack of sufﬁcient long-duration actions makes the

models more adept at encoding context for shorter ac-

tions. For 2), videos with many action instances pro-

vide the model with more data, improving its gener-

alization capabilities on this kind of samples. Despite

these commonalities, pure Transformer block-based

models seem more effective at handling videos with a

higher number of action instances, and they also per-

form well with respect to absolute duration. Thus,

for videos featuring long durations and numerous ac-

tion instances, pure Transformer block-based models

may be a suitable choice. In other cases, the Mamba

block-based models might be more appropriate. An-

other viable option, though resource-intensive, is the

CausalTAD method, which demands greater compu-

tational resources.

For ActivityNet-1.3, the selected models demon-

strate stronger regression capabilities (see Fig. 5) as

action durations increase (likely because more con-

tinuous frames provide additional context about the

actions), but they perform less effectively on videos

with a large number of action instances. This de-

cline is particularly noticeable in pure Transformer

block-based models. Interestingly, however, these

pure Transformer block-based models excel at han-

dling extra-large duration, as the attention mechanism

allows the model to effectively encode necessary con-

text from the entire input. Therefore, they are more

suitable for videos with long-duration actions but

fewer action instances. For videos with short to rela-

tively long-duration actions, where Transformer mod-

els show weaker performance, Mamba block-enabled

models, especially the Mamba ViM block, are a better

choice. These models offer comparable performance

while being less complex than the alternatives that de-

liver superior performance but require more hardware

resources due to the global attention computation.

Discussion. Overall, for THUMOS14, the orig-

inal Mamba block-based model or the CausalTAD

model should be chosen based on performance re-

quirements. In the case of ActivityNet-1.3, the

CausalTAD should be prioritized for performance

needs, while the Mamba ViM-based method may be

more suitable if hardware resources are a concern.

However, it is important to note that the selection is

highly dataset-dependent. Pure Transformer block-

based methods can still be an excellent choice for de-

veloping real-world TAL applications, as they outper-

form other models in speciﬁc scenarios (as discussed

previously) while maintaining lower complexity than

their hybrid counterparts.

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

74.1

75.9

69.6

72.0

57.0

Coverage

66.3

75.4

73.4

60.9

52.1

Length

75.4

73.1

78.3

87.0

# Instances

74.38

Coverage

Length

# Instances

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

1.4

Average-mAP

Relative Change

0.8

1.0

0.7

1.0

1.2

74.38

Transformer - Window Attention.

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

74.5

78.8

72.2

78.5

62.6

Coverage

67.2

78.1

76.8

71.7

51.9

Length

78.1

75.0

78.1

84.9

# Instances

75.66

Coverage

Length

# Instances

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

Average-mAP

Relative Change

0.8

1.0

0.7

1.0

1.1

75.66

Mamba Original.

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

73.5

78.2

74.5

68.2

57.3

Coverage

64.0

77.2

79.1

56.8

47.3

Length

70.1

74.2

77.3

85.6

# Instances

74.73

Coverage

Length

# Instances

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

Average-mAP

Relative Change

0.8

1.0

0.6

1.1

0.9

1.1

74.73

Mamba + Global Attention

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

75.0

78.1

73.2

78.1

63.1

Coverage

66.8

77.7

79.7

68.4

53.8

Length

76.5

74.8

77.3

86.2

# Instances

75.52

Coverage

Length

# Instances

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

Average-mAP

Relative Change

0.8

1.0

0.7

1.1

1.0

1.1

75.52

Mamba DBM + Global Causal Attention

Figure 4: Sensitivity analysis on THUMOS14. The predic-

tions were produced by the models with the highest over-

all performance from each category. These include the

Transformer block with sliding window attention, the orig-

inal Mamba block, the MambaFormer architecture featur-

ing the original Mamba block with global attention, and

the CausalTAD architecture. The dashed line represents the

performance of the method across all instances in the ana-

lyzed dataset.

4.7 Class Level Analysis

In this section, we analyze the class-level perfor-

mance of the selected models. Fig. 6 presents the per-

class performance on THUMOS14. For ActivityNet-

1.3 and EPIC-Kitchens-100, we select a subset of 20

classes for illustration, based on the number of action

instances in each class. The selected classes range

from small to large in the number of instances. Fig.

7 shows the results for ActivityNet-1.3, while Fig. 8

and Fig. 9 present the results for the Verb and Noun

actions in EPIC-Kitchens-100, respectively.

For THUMOS14, the pure Mamba block-based

methods perform well on many classes. However,

they underperform on certain classes, such as “Soccer

Penalty”, “Tennis Swing”, and “Throw Discus”, com-

pared to hybrid models. This suggests that for classes

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

158

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

5.5

11.7

16.7

43.9

70.3

Coverage

14.2

33.3

43.5

61.8

72.6

Length

45.9

9.3

8.6

5.7

# Instances

27.48

Coverage

Length

# Instances

0.5

1.0

1.5

2.0

2.5

3.0

Average-mAP

Relative Change

0.2

2.6

0.5

2.6

0.2

1.7

27.48

Transformer - Window Attention.

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

12.9

21.9

32.3

53.3

66.0

Coverage

21.2

40.1

48.9

63.8

67.2

Length

51.3

17.5

16.7

7.9

# Instances

34.10

Coverage

Length

# Instances

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

Average-mAP

Relative Change

0.4

1.9

0.6

2.0

0.2

1.5

34.10

Mamba ViM.

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

12.3

21.9

32.3

53.5

66.1

Coverage

20.8

39.7

49.7

63.5

67.9

Length

51.3

17.3

15.9

7.1

# Instances

33.88

Coverage

Length

# Instances

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

Average-mAP

Relative Change

0.4

2.0

0.6

2.0

0.2

1.5

33.88

Mamba DBM + Global Attention

XS S M L XL XS S M L XL XS S M L

100

Average-mAP

(%)

12.9

21.6

32.2

53.6

66.8

Coverage

21.1

40.1

49.4

64.6

68.9

Length

51.6

17.8

16.3

8.5

# Instances

34.25

Coverage

Length

# Instances

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

Average-mAP

Relative Change

0.4

2.0

0.6

2.0

0.2

1.5

34.25

Mamba DBM + Global Causal Attention

Figure 5: Sensitivity analysis on ActivityNet-1.3. The best-

performing models in each category were selected for gen-

erating the outputs, speciﬁcally the Transformer block with

sliding window attention, the Mamba ViM block, the Mam-

baFormer architecture equipped with Mamba DBM block

and global attention, and the CausalTAD architecture. The

dashed line represents the performance of the method across

all instances in the analyzed dataset.

where background scenes dominate the frames, a hy-

brid encoder may be preferable for capturing stronger

contextual information.

On ActivityNet-1.3, it is surprising that none of

the models can handle certain activities, such as “Ski-

ing” and “Shot Put”, despite these activities having a

relatively large number of action instances (over 210)

and total durations (over 10,000 seconds). The wide

variation in durations (ranging from 0.01 to over 210

seconds) may be a contributing factor. Nevertheless,

pure Mamba block-based and hybrid models outper-

form pure Transformer-based models.

For EPIC-Kitchens-100, we observe a similar is-

sue where none of the models perform well on certain

actions, such as “brush”, “choose”, and “break” in the

verb category, and “bacon”, “ﬂoor”, and “sausage” in

the noun category. These actions exhibit wide vari-

ations in duration and a small number of instances,

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Transformer Mamba MambaF ormer CausalTAD

Figure 6: Per-class performance on THUMOS14, reported

as averaged mAPs across tIoU thresholds [0.3:0.1:0.7].

0.1

0.2

0.3

0.4

0.5

0.6

Transformer Mamba MambaFormer CausalTAD

Figure 7: Per-class performance on ActivityNet-1.3 with

20-class subset, reported as averaged mAPs across tIoU

thresholds [0.5:0.05:0.95].

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Transformer Mamba MambaFormer CausalTAD

Figure 8: Per-class performance on EPIC-Kitchens-100

Verb actions with a 20-class subset, reported as averaged

mAPs across tIoU thresholds [0.1:0.1:0.5].

which may limit the models’ effectiveness. However,

in most cases, pure Mamba block-based and hybrid

models outperform pure Transformer-based models

and are generally the better choice.

Finally, we visualize some predicted outputs from

THUMOS14 test set. As shown in Fig. 10, the pure

Mamba block-based method aligns better for both

very short and relatively long-duration actions, in line

with previous analysis.

Discussion. When handling actions with varied

durations, which may lack sufﬁcient data for training

or cause confusion due to background scenes, Mamba

block-enabled models should be preferred.

4.8 Scalability Analysis

Scaling up a deep learning model by increasing the

number of parameters may improve its performance.

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study

159

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Transformer Mamba MambaFormer CausalTAD

Figure 9: Per-class performance on EPIC-Kitchens-100

Noun actions with a 20-class subset, reported as averaged

mAPs across tIoU thresholds [0.1:0.1:0.5].

(B) Action: Baseball Pitch; Duration: 0.9 s.

144.7 Ground Truth 145.6

144.6 Transformer (Win Attn) 146.6

144.5 Mamba + Global Attn 146.5

144.5 Mamba DBM + Global Causal Attn 146.5

144.5 Mamba Original 145.8

22.3 Ground Truth 27.2

22.2 Transformer (Win Attn) 27.2

22.3 Mamba + Global Attn 27.3

22.2 Mamba DBM + Global Causal Attn 27.1

22.3 Mamba Original 27.2

(A) Action: High Jump; Duration: 4.9 s.

Figure 10: Predictions from THUMOS14 test set. (A) All

methods are capable of handling relatively long actions. (B)

For shorter actions, the pure Mamba block-based methods

demonstrate better performance.

In this section, we perform an ablation analysis by

adding an additional original Mamba block to each

level of the encoder, aiming to match the parameter

count of the Transformer-based method. The model’s

performance after scaling is evaluated on the THU-

MOS14 dataset. The resulting mAPs, the number of

parameters, and FLOPs are presented in Table 4.

Table 4: Scalability Analysis for the original Mamba block.

Encoder 0.3 0.4 0.5 0.6 0.7 Avg # Params FLOPs

Win Attn 83.3 79.5 71.9 60.2 45.0 67.9 29.3M 115.4

Mamba 83.3 79.7 72.3 60.1 46.1 68.5 19.0M 33.0

Scaled 83.5 79.0 71.9 60.2 46.8 68.3 30.9M 40.2

The number of parameters after scaling is 30.9M,

which is comparable to that of the Transformer block

with sliding window attention. As observed, the over-

all performance has slightly decreased compared to

the original Mamba block prior to scaling. It is possi-

ble that, at this stage, the model is beginning to expe-

rience overﬁtting.

Discussion. The results indicate that scaling up

the model does not improve performance on THU-

MOS14, suggesting that simpler models are better

suited for smaller datasets to prevent overﬁtting dur-

ing training. For larger datasets like ActivityNet-1.3,

more complex models may be necessary to meet per-

formance requirements. For example, the CausalTAD

method achieves the best results (see Table 1) on

ActivityNet-1.3 with the highest number of parame-

ters (see Table 3). Additionally, data augmentation

techniques can be used for smaller datasets to enrich

training samples to exploit the learning capacity of

complex models.

4.9 Summary and Suggestion

Finally, we summarize our suggestions (in Table 5)

based on the analysis, including performance (P),

model efﬁciency (E), handling short (S) and long (L)

duration actions, dealing with varied durations and

limited data in each class (C), and data efﬁciency (D).

Table 5: Suggested feature encoders on each dataset. Small

number indicates better method. The last column shows the

best method, considering all performance requirements.

THUMOS14

Win Attn Mamba Original Mamba + Global Attn CausalTAD

P 2 1 4 3

Mamba Original

E 2 1 3 4

S 4 1 3 2

L 4 2 3 1

C 4 1 2 3

D 3 1 4 2

ActivityNet-1.3

Win Attn Mamba ViM DBM + Global Attn CausalTAD

P 4 2 3 1

Mamba ViM

E 2 1 3 4

S 4 1 3 2

L 4 2 3 1

C 4 1 2 3

D 4 3 2 1

EPIC-Kitchens-100 Verb

Global Attn Mamba DBM Mamba + Global Attn CausalTAD

P 4 1 2 3

Mamba DBM

E 2 1 2 4

S 3 4 2 1

L 1 2 4 3

C 3 4 1 2

D 1 3 4 2

EPIC-Kitchens-100 Noun

Global Attn Mamba Mamba + Win Attn CausalTAD

P 4 2 1 3

Mamba + Win Attn

E 2 1 3 4

S 1 4 3 2

L 4 2 3 1

C 4 3 2 1

D 4 2 1 3

5 CONCLUSIONS

In this work, we conducted a comparative analysis

of the Transformer encoder, Mamba block, and their

combinations as temporal feature encoders for TAL

tasks across three commonly used benchmarks of var-

ious human actions. This comparison addresses the

question of how to choose between the Transformer

encoder and the Mamba block. Our ﬁndings suggest

that both pure Mamba block-based models and hy-

brid models generally outperform those based on the

Transformer encoder. However, the improved perfor-

mance of the hybrid models comes at the expense

of increased complexity. Therefore, pure Mamba

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

160

block-based models should be the preferred choice for

developing TAL applications, especially given per-

formance requirements and potential limitations in

high-performance hardware resources. Additionally,

we found that learning temporal dependencies in se-

quences from both directions—speciﬁcally through

the ViM and DBM blocks, which incorporate a back-

ward scanning process—can enhance the model’s per-

formance for TAL.

We focused our analysis on a limited set of hybrid

models. However, there are several other approaches

to building hybrid models, such as those proposed in

(Hatamizadeh and Kautz, 2024) and (Behrouz et al.,

2024). One potential direction for future work could

be exploring the adaptation of these architectures for

TAL. Another avenue for future research would be de-

veloping models that leverage both the simplicity and

performance of Mamba blocks, as well as their dual

scanning capability.

ACKNOWLEDGEMENTS

This work has been partially supported by the Span-

ish project PID2022-136436NB-I00 and by ICREA

under the ICREA Academia programme.

REFERENCES

Alwassel, H., Caba Heilbron, F., Escorcia, V., and Ghanem,

B. (2018). Diagnosing error in temporal action detec-

tors. In ECCV.

Alwassel, H., Giancola, S., and Ghanem, B. (2021). Tsp:

Temporally-sensitive pretraining of video encoders

for localization tasks. In ICCV, pages 3173–3183.

Behrouz, A., Santacatterina, M., and Zabih, R. (2024).

Mambamixer: Efﬁcient selective state space mod-

els with dual token and channel selection. arXiv

2403.19888.

Bisong, E. and Bisong, E. (2019). Recurrent neural net-

works (rnns). Building Machine Learning and Deep

Learning Models on Google Cloud Platform: A Com-

prehensive Guide for Beginners, pages 443–473.

Bodla, N., Singh, B., Chellappa, R., and Davis, L. S. (2017).

Soft-nms — improving object detection with one line

of code. In 2017 ICCV, pages 5562–5570.

Carreira, J. and Zisserman, A. (2017). Quo vadis, action

recognition? a new model and the kinetics dataset. In

2017 CVPR, pages 4724–4733.

Chen, G., Huang, Y., Xu, J., Pei, B., Chen, Z., Li, Z., Wang,

J., Li, K., Lu, T., and Wang, L. (2024). Video mamba

suite: State space model as a versatile alternative for

video understanding. arXiv 2403.09626.

Chen, G., Zheng, Y., Wang, L., and Lu, T. (2022). DCAN:

improving temporal action detection via dual context

aggregation. In AAAI 2022, pages 248–257.

Damen, D., Doughty, H., Farinella, G. M., Fidler, S.,

Furnari, A., Kazakos, E., Moltisanti, D., Munro, J.,

Perrett, T., Price, W., and Wray, M. (2018). Scal-

ing egocentric vision: The epic-kitchens dataset. In

ECCV.

Elharrouss, O., Almaadeed, N., and Al-Maadeed, S. (2021).

A review of video surveillance systems. J Vis Com-

mun Image Represent, 77:103116.

Feichtenhofer, C., Fan, H., Malik, J., and He, K. (2019).

Slowfast networks for video recognition. In 2019

ICCV, pages 6201–6210, Los Alamitos, CA, USA.

Fu, D. Y., Dao, T., Saab, K. K., Thomas, A. W., Rudra, A.,

and R

e, C. (2023). Hungry Hungry Hippos: Towards

language modeling with state space models. In Inter-

national Conference on Learning Representations.

Ghosh, I., Ramasamy Ramamurthy, S., Chakma, A., and

Roy, N. (2023). Sports analytics review: Artiﬁcial

intelligence applications, emerging technologies, and

algorithmic perspective. WRIEs Data Min. Knowl.

Discov., 13(5):e1496.

Gong, G., Zheng, L., and Mu, Y. (2020). Scale matters:

Temporal scale aggregation network for precise action

localization in untrimmed videos. In 2020 ICME.

Gu, A. and Dao, T. (2023). Mamba: Linear-time se-

quence modeling with selective state spaces. arXiv

2312.00752.

Gu, A., Goel, K., and R

e, C. (2022). Efﬁciently modeling

long sequences with structured state spaces. In The In-

ternational Conference on Learning Representations.

Gu, A., Johnson, I., Goel, K., Saab, K., Dao, T., Rudra,

A., and R

e, C. (2021). Combining recurrent, convo-

lutional, and continuous-time models with linear state

space layers. NIPS’21, 34:572–585.

Hatamizadeh, A. and Kautz, J. (2024). Mambavision: A

hybrid mamba-transformer vision backbone. arXiv

2407.08083.

Heilbron, F. C., Escorcia, V., Ghanem, B., and Niebles, J. C.

(2015). Activitynet: A large-scale video benchmark

for human activity understanding. In 2015 CVPR.

Idrees, H., Zamir, A. R., Jiang, Y.-G., Gorban, A., Laptev,

I., Sukthankar, R., and Shah, M. (2017). The THU-

MOS challenge on action recognition for videos “in

the wild”. Comput. Vis. Image Underst., 155:1–23.

Kalman, R. E. (1960). A New Approach to Linear Filtering

and Prediction Problems. J. Basic Eng., 82(1):35–45.

Kang, T.-K., Lee, G.-H., Jin, K.-M., and Lee, S.-W. (2023).

Action-aware masking network with group-based at-

tention for temporal action localization. In 2023

WACV, pages 6047–6056.

Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C.,

Vijayanarasimhan, S., Viola, F., Green, T., Back, T.,

Natsev, P., Suleyman, M., and Zisserman, A. (2017).

The kinetics human action video dataset.

Kingma, D. P. and Ba, J. (2017). Adam: A method for

stochastic optimization. In 3rd ICLR Proceedings.

Lecun, Y., Boser, B., Denker, J., Henderson, D., Howard,

R., Hubbard, W., and Jackel, L. (1990). Handwritten

Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study

161

digit recognition with a back-propagation network. In

Touretzky, D., editor, NIPS 1989, volume 2.

Lin, C., Xu, C., Luo, D., Wang, Y., Tai, Y., Wang, C., Li, J.,

Huang, F., and Fu, Y. (2021). Learning salient bound-

ary feature for anchor-free temporal action localiza-

tion. In CVPR, pages 3320–3329.

Liu, H., Dai, Z., So, D. R., and Le, Q. V. (2024a). Pay

attention to mlps. In NIPS’21, Red Hook, NY, USA.

Liu, L., Liu, X., Gao, J., Chen, W., and Han, J. (2020).

Understanding the difﬁculty of training transformers.

In Proceedings of EMNLP 2020.

Liu, S., Sui, L., Zhang, C.-L., Mu, F., Zhao, C., and

Ghanem, B. (2024b). Harnessing temporal causal-

ity for advanced temporal action detection. arXiv

2407.17792.

Liu, X., Wang, Q., Hu, Y., Tang, X., Zhang, S., Bai, S., and

Bai, X. (2022). End-to-end temporal action detection

with transformer. IEEE Trans Image Process.

Park, J., Park, J., Xiong, Z., Lee, N., Cho, J., Oymak,

S., Lee, K., and Papailiopoulos, D. (2024). Can

mamba learn how to learn? A comparative study on

in-context learning tasks. In Salakhutdinov, R., Kolter,

Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and

Berkenkamp, F., editors, Proceedings of the 41st In-

ternational Conference on Machine Learning, pages

39793–39812.

Saleem, G., Bajwa, U. I., and Raza, R. H. (2023). Toward

human activity recognition: a survey. Neural Comput-

ing and Applications, 35(5):4145–4182.

Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and

Monfardini, G. (2009). The graph neural network

model. IEEE Trans Neural Networks, 20(1):61–80.

Shao, J., Wang, X., Quan, R., Zheng, J., Yang, J., and Yang,

Y. (2023). Action sensitivity learning for temporal

action localization. In 2023 IEEE ICCV Proceeding,

pages 13411–13423, Los Alamitos, CA, USA.

Shi, D., Zhong, Y., Cao, Q., Ma, L., Lit, J., and Tao,

D. (2023). Tridet: Temporal action detection with

relative boundary modeling. In 2023 CVPR, pages

18857–18866, Los Alamitos, CA, USA.

Sridhar, D., Quader, N., Muralidharan, S., Li, Y., Dai, P.,

and Lu, J. (2021). Class semantics-based attention for

action detection. In 2021 ICCV, pages 13719–13728,

Los Alamitos, CA, USA.

Tian, Z., Shen, C., Chen, H., and He, T. (2019). Fcos:

Fully convolutional one-stage object detection. In

2019 ICCV, pages 9626–9635.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

Neural Information Processing Systems, volume 30.

Wu, J., Sun, P., Chen, S., Yang, J., Qi, Z., Ma, L., and Luo,

P. (2021). Towards high-quality temporal action de-

tection with sparse proposals. arXiv 2109.08847.

Zhang, C.-L., Wu, J., and Li, Y. (2022). Actionformer:

Localizing moments of actions with transformers. In

ECCV 2022, volume 13664, pages 492–510.

Zhang, Z., Palmero, C., and Escalera, S. (2024). Dualh: A

dual hierarchical model for temporal action localiza-

tion. In 2024 FG Conference, pages 1–10.

Zhao, C., Liu, S., Mangalam, K., and Ghanem, B. (2023).

Re2tal: Rewiring pretrained video backbones for re-

versible temporal action localization. In 2023 CVPR,

pages 10637–10647, Los Alamitos, CA, USA.

Zhao, C., Thabet, A., and Ghanem, B. (2021). Video self-

stitching graph network for temporal action localiza-

tion. In 2021 ICCV, pages 13638–13647.

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D.

(2020). Distance-iou loss: Faster and better learning

for bounding box regression. In The AAAI Conference

on Artiﬁcial Intelligence (AAAI), pages 12993–13000.

Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., and Wang,

X. (2024). Vision mamba: Efﬁcient visual represen-

tation learning with bidirectional state space model.

arXiv 2401.09417.

Zhu, Z., Tang, W., Wang, L., Zheng, N., and Hua, G. (2021).

Enriching local and global contexts for temporal ac-

tion localization. In 2021 ICCV, pages 13496–13505,

Los Alamitos, CA, USA.

VISAPP 2025 - 20th International Conference on Computer Vision Theory and Applications

162