
transactions on pattern analysis and machine intelli-
gence, 29(12):2247–2253.
Grauman, K., Westbury, A., Byrne, E., Chavis, Z., Furnari,
A., Girdhar, R., Hamburger, J., Jiang, H., Liu, M.,
Liu, X., et al. (2022). Ego4d: Around the world in
3,000 hours of egocentric video. In Proceedings of
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 18995–19012.
Gu, A. and Dao, T. (2024). Mamba: Linear-time sequence
modeling with selective state spaces.
Gu, A., Goel, K., and R’e, C. (2021). Efficiently modeling
long sequences with structured state spaces. ArXiv,
abs/2111.00396.
He, B., Yang, X., Kang, L., Cheng, Z., Zhou, X., and
Shrivastava, A. (2022). Asm-loc: Action-aware seg-
ment modeling for weakly-supervised temporal action
localization. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition,
pages 13925–13935.
Idrees, H., Zamir, A. R., Jiang, Y.-G., Gorban, A., Laptev,
I., Sukthankar, R., and Shah, M. (2017). The thu-
mos challenge on action recognition for videos “in the
wild”. Computer Vision and Image Understanding,
155:1–23.
Jiang, W. and He, G. (2021). Study on the effect of shoul-
der training on the mechanics of tennis serve speed
through video analysis. Molecular & Cellular Biome-
chanics, 18(4):221.
Kipf, T. N. and Welling, M. (2016). Semi-supervised clas-
sification with graph convolutional networks. arXiv
preprint arXiv:1609.02907.
Li, J., Liu, X., Zong, Z., Zhao, W., Zhang, M., and Song,
J. (2020). Graph attention based proposal 3d convnets
for action detection. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 34, pages
4626–4633.
Li, L., Kong, T., Sun, F., and Liu, H. (2019). Deep point-
wise prediction for action temporal proposal. In Neu-
ral Information Processing: 26th International Con-
ference, ICONIP 2019, Sydney, NSW, Australia, De-
cember 12–15, 2019, Proceedings, Part III 26, pages
475–487. Springer.
Lin, C., Xu, C., Luo, D., Wang, Y., Tai, Y., Wang, C., Li, J.,
Huang, F., and Fu, Y. (2021). Learning salient bound-
ary feature for anchor-free temporal action localiza-
tion. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 3320–
3329.
Lin, T., Liu, X., Li, X., Ding, E., and Wen, S. (2019). Bmn:
Boundary-matching network for temporal action pro-
posal generation. In 2019 IEEE/CVF International
Conference on Computer Vision (ICCV), pages 3888–
3897.
Lin, T., Zhao, X., Su, H., Wang, C., and Yang, M. (2018).
Bsn: Boundary sensitive network for temporal action
proposal generation. In Proceedings of the European
conference on computer vision (ECCV), pages 3–19.
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll
´
ar, P.
(2017). Focal loss for dense object detection. In 2017
IEEE International Conference on Computer Vision
(ICCV), pages 2999–3007.
Liu, X., Hu, Y., Bai, S., Ding, F., Bai, X., and Torr, P. H.
(2021a). Multi-shot temporal event localization: a
benchmark. In 2021 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), pages
12591–12601.
Liu, Y., Wang, L., Wang, Y., Ma, X., and Qiao, Y. (2022).
Fineaction: A fine-grained video dataset for temporal
action localization. IEEE transactions on image pro-
cessing, 31:6937–6950.
Liu, Z., Wang, L., Zhang, Q., Tang, W., Yuan, J., Zheng,
N., and Hua, G. (2021b). Acsnet: Action-context sep-
aration network for weakly supervised temporal ac-
tion localization. In Proceedings of the AAAI Con-
ference on Artificial Intelligence, volume 35, pages
2233–2241.
Loshchilov, I. and Hutter, F. (2017). Decoupled weight
decay regularization. In International Conference on
Learning Representations.
Myung, W., Su, N., Xue, J.-H., and Wang, G. (2024).
Degcn: Deformable graph convolutional networks for
skeleton-based action recognition. IEEE Transactions
on Image Processing, 33:2477–2490.
Nguyen, P. X., Ramanan, D., and Fowlkes, C. C. (2019).
Weakly-supervised action localization with back-
ground modeling. In Proceedings of the IEEE/CVF
international conference on computer vision, pages
5502–5511.
Paul, S., Roy, S., and Roy-Chowdhury, A. K. (2018). W-
talc: Weakly-supervised temporal activity localization
and classification. In Proceedings of the European
conference on computer vision (ECCV), pages 563–
579.
Peng, X. and Tang, L. (2022). Biomechanics analysis of
real-time tennis batting images using internet of things
and deep learning. The Journal of Supercomputing,
78(4):5883–5902.
Piergiovanni, A. and Ryoo, M. (2019). Temporal gaussian
mixture layer for videos. In International Conference
on Machine learning, pages 5152–5161. PMLR.
Rajasegaran, J., Pavlakos, G., Kanazawa, A., Feichtenhofer,
C., and Malik, J. (2023). On the benefits of 3d pose
and tracking for human action recognition. In 2023
IEEE/CVF Conference on Computer Vision and Pat-
tern Recognition (CVPR), pages 640–649.
Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I.,
and Savarese, S. (2019). Generalized intersection over
union: A metric and a loss for bounding box regres-
sion. In 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 658–
666.
Rizve, M. N., Mittal, G., Yu, Y., Hall, M., Sajeev, S., Shah,
M., and Chen, M. (2023). Pivotal: Prior-driven super-
vision for weakly-supervised temporal action local-
ization. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
22992–23002.
Shi, B., Dai, Q., Mu, Y., and Wang, J. (2020). Weakly-
supervised action localization by generative attention
modeling. In Proceedings of the IEEE/CVF con-
ference on computer vision and pattern recognition,
pages 1009–1019.
TAL4Tennis: Temporal Action Localization in Tennis Videos Using State Space Models
405