ASPERA: Exploring Multimodal Action Recognition in Football Through Video, Audio, and Commentary

Takane Kumakura, Ryohei Orihara, Yasuyuki Tahara, Akihiko Ohsuga, Yuichi Sei



This study proposes ASPERA (Action SPotting thrEe-modal Recognition Architecture), a multimodal football action recognition method based on the ASTRA architecture that incorporates video, audio, and commentary text information. ASPERA showed higher accuracy than models using video and audio only, excluding invisible actions in the video. This result demonstrates the advantage of this multimodal approach. Additionally, we propose three advanced models: ASPERAsrnd incorporating surrounding commentary text within a ±20-second range, ASPERAcln removing irrelevant background information, and ASPERAMC applying a Markov head to provide prior knowledge of football action flow. ASPERAsrnd and ASPERAcln, which refine the text embedding, enhanced the ability to accurately identify the timing of actions. Notably, ASPERAMC with the Markov head demonstrated the highest accuracy for invisible actions in the football video. ASPERAsrnd and ASPERAcln not only demonstrate the utility of text information in football action spotting but also highlight key factors that enhance this effect, such as incorporating surrounding commentary text and removing background information. Finally, ASPERAMC shows the effectiveness of combining Transformer models and Markov chains for recognizing actions in invisible scenes.


Paper Citation

in Harvard Style

Kumakura T., Orihara R., Tahara Y., Ohsuga A. and Sei Y. (2025). ASPERA: Exploring Multimodal Action Recognition in Football Through Video, Audio, and Commentary. In Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART; ISBN 978-989-758-737-5, SciTePress, pages 646-657. DOI: 10.5220/0013300700003890

in Bibtex Style

author={Takane Kumakura and Ryohei Orihara and Yasuyuki Tahara and Akihiko Ohsuga and Yuichi Sei},
title={ASPERA: Exploring Multimodal Action Recognition in Football Through Video, Audio, and Commentary},
booktitle={Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART},

in EndNote Style


JO - Proceedings of the 17th International Conference on Agents and Artificial Intelligence - Volume 2: ICAART
TI - ASPERA: Exploring Multimodal Action Recognition in Football Through Video, Audio, and Commentary
SN - 978-989-758-737-5
AU - Kumakura T.
AU - Orihara R.
AU - Tahara Y.
AU - Ohsuga A.
AU - Sei Y.
PY - 2025
SP - 646
EP - 657
DO - 10.5220/0013300700003890
PB - SciTePress