Authors:
Jérémie Ochin
1
;
2
;
Guillaume Devineau
2
;
Bogdan Stanciulescu
1
and
Sotiris Manitsaris
1
Affiliations:
1
Centre for Robotics, MINES Paris - PSL, France
;
2
Footovision, France
Keyword(s):
Spatio-Temporal Action Detection, Video Action Recognition, Sport Video Understanding, 3D Convolutional Neural Networks, Graph Neural Networks, Soccer, Game Structure, Sports Analytics, Soccer Analytics.
Abstract:
Soccer analytics rely on two data sources: the player positions on the pitch and the sequences of events they perform. With around 2000 ball events per game, their precise and exhaustive annotation based on a monocular video stream remains a tedious and costly manual task. While state-of-the-art spatio-temporal action detection methods show promise for automating this task, they lack contextual understanding of the game. Assuming professional players’ behaviors are interdependent, we hypothesize that incorporating surrounding players’ information such as positions, velocity and team membership can enhance purely visual predictions. We propose a spatio-temporal action detection approach that combines visual and game state information via Graph Neural Networks trained end-to-end with state-of-the-art 3D CNNs, demonstrating improved metrics through game state integration.