2 RELATED WORK
In this section, we describe methods that detect gen-
eral events (“group activities”) in soccer videos, and
we discuss approaches for the simultaneous recogni-
tion of actions and group activities, outside the soccer
domain.
2.1 Event Detection in Soccer
The aim of event detection is to detect temporal
boundaries of a match event or camera shot, and
to classify the isolated samples accordingly (Tavas-
solipour et al., 2013). Goal (attempts), corners, cards,
shoots, penalties, fouls and offsides have been de-
tected in television broadcast videos. Recent meth-
ods use a 3D-Convolutional Neural Network (CNN)
(Khan, 2018) or combine a CNN and a Recurrent
Neural Network (RNN) (Jiang et al., 2016) consec-
utively. Often, these methods rely on the detection of
cinematic features, based on general ways for televi-
sion production teams to record soccer events on cam-
era (Ekin et al., 2003). For example, a goal attempt is
often followed by a slow-motion shot of the event. We
consider these dependencies undesirable as it limits
the applicability of a model to broadcast videos only.
Performances range between 82.0% (Tavassolipour
et al., 2013) and 95.5% (Vanderplaetse and Dupont,
2020) multi-class accuracy (MCA), for the recogni-
tion of seven and four activity classes respectively.
Others combine recordings of twelve (Zhang,
2019) or fourteen (Tsunoda, 2017) static cameras, po-
sitioned around the field. The latter approach reaches
70.2% MCA for the recognition of three classes. We
argue that multiple-camera setups are expensive in
purchase and require large computational resources.
Our method is designed for event recognition in
videos from one static panoramic camera, which are
more accessible for non professional clubs.
Soccer videos contain a majority of background
pixels due to the size of the field. Nevertheless, most
of the methods mentioned above classify events di-
rectly from video frames. Zhang et al. (2019) pro-
pose to detect events from latent player embeddings,
created by a U-encoder on pixels in player bound-
ing boxes. Our method creates latent player embed-
dings also, from normalised player snippets instead of
bounding boxes.
2.2 Action and Group Activity
Recognition
Three types of deep learning networks can be dis-
covered in state-of-the-art action and group activ-
ity recognition methods: spatio-temporal, multiple-
stream and hybrid networks. Spatio-temporal net-
works, such as a 3D-CNN (Ji, 2012), search for volu-
metric patterns at different scales of the input videos.
The I3D CNN (Carreira and Zisserman, 2017) is a
multiple-stream network that recognises actions from
RGB and optical flow videos. The network appears to
give better results in group activity recognition than
a standard CNN (Azar, 2019). In a hybrid network,
two networks are combined consecutively (Kong and
Fu, 2018). The approach is popular for group activity
recognition. First, a CNN extracts individual features
and creates a latent embedding per group member.
We will refer to this phase as feature extraction. Sec-
ond, a different network explores inter-human rela-
tions to update the embeddings accordingly. We will
refer to this phase as feature contextualisation. RNNs
(Tsunoda, 2017) and Graph Convolutional Networks
(Ibrahim and Mori, 2018) are often used for the latter
phase.
Our method uses a hybrid network with I3D
for feature extraction and graph attention networks
(GATs) (Veli
ˇ
ckovi
´
c, 2017) for feature contextualisa-
tion. We have not yet seen these networks being ap-
plied to event recognition in the soccer domain.
2.3 Actor Relation Graph as Baseline
It is difficult to compare our method with state-of-the-
art in soccer event detection, because each method is
evaluated with another dataset. The sets vary in event
types, number of classes and input videos, while none
of the methods recognise individual actions.
The Actor Relation Graph (ARG) is a hybrid
network that uses an Inception-V3 CNN (Szegedy,
2016) for feature extraction and uses GATs with self-
attention (Vaswani, 2017) for feature contextualisa-
tion. The method reaches state-of-the-art perfor-
mance in action and group activity recognition on
Volleyball Dataset videos. Because the domain is re-
lated to soccer, and an open-source implementation is
available, the ARG is selected as baseline.
3 PROPOSED METHOD
We start this section with an overview of the proposed
method architecture, and note where it differs from
the baseline approach. Thereafter, the architecture is
explained along four phases in the data pipeline: data
pre-processing, feature extraction, feature contextual-
isation and the generation of predictions. Last, we
provide implementation details.
Individual Action and Group Activity Recognition in Soccer Videos from a Static Panoramic Camera
595