Authors:
Takane Kumakura
;
Ryohei Orihara
;
Yasuyuki Tahara
;
Akihiko Ohsuga
and
Yuichi Sei
Affiliation:
The University of Electro-Communications, Graduate School of Informatics and Engineering Departments, Department of Informatics 1-5-1 Chofugaoka, Chofu, Japan
Keyword(s):
Action Spotting, Multimodal Learning, Transformer, Markov Chain, Soccer, Football, Live Broadcasting, Deep Learning, Machine Learning, Artificial Intelligence.
Abstract:
This study proposes ASPERA (Action SPotting thrEe-modal Recognition Architecture), a multimodal football action recognition method based on the ASTRA architecture that incorporates video, audio, and commentary text information. ASPERA showed higher accuracy than models using video and audio only, excluding invisible actions in the video. This result demonstrates the advantage of this multimodal approach. Additionally, we propose three advanced models: ASPERAsrnd incorporating surrounding commentary text within a ±20-second range, ASPERAcln removing irrelevant background information, and ASPERAMC applying a Markov head to provide prior knowledge of football action flow. ASPERAsrnd and ASPERAcln, which refine the text embedding, enhanced the ability to accurately identify the timing of actions. Notably, ASPERAMC with the Markov head demonstrated the highest accuracy for invisible actions in the football video. ASPERAsrnd and ASPERAcln not only demonstrate the utility of text informatio
n in football action spotting but also highlight key factors that enhance this effect, such as incorporating surrounding commentary text and removing background information. Finally, ASPERAMC shows the effectiveness of combining Transformer models and Markov chains for recognizing actions in invisible scenes.
(More)