detection algorithm tries to resolve them. Based on
the pose and the type of shot, decisions between using
face recognition, jersey recognition and/or number
recognition are made, and available sensor data is
used to further filter or verify the set of possible
candidates, as explained in Section 3.
2.2 Cycling Sensor Data
Several third parties (such as Velon and Gracenote)
provide structured and very detailed live data of
sports events at a high frequency. Velon, for example,
provided location, heart rate, power, and speed of
each cyclist during several stages of Tirreno Adriatico
2019 and Gracenote provided exact location of each
group of riders during the Tour of Flanders 2019. If
such sensor data of the race is available, it is definitely
the most accurate and computationally most
interesting solution for geo-localization and event
detection. When there are multiple groups of riders,
however, an additional method (such as team or
cyclist recognition) is needed to know which
particular group is shown in the live video stream (as
is further discussed in Section 3).
If detailed sensor data were available, several
events can be detected in it, such as breakaways,
crashes, or difficult sectors (e.g. barriers and sandpits
in cyclocross or gravel segments in road cycling). For
the latter type of events, the approach of (Langer et al,
2020) for difficulty classification of mountainbike
downhill trails can, for example, be tailored to
cyclocross and road cycling segment classification.
The work of (Verstockt, 2014) also shows that this is
feasible. For breakaway/crash detection, experiments
revealed that simple spatio-temporal analysis across
all riders will already provide satisfying results.
TEAM & RIDER DETECTION
3.1 Skeleton and Pose Detection
The proposed team and rider detection methodology
both start from the output of a skeleton recognition
algorithm (such as OpenPose
1
, tf-pose
2
and
AlphaPose
3
). Figure 3 shows an example of the
skeleton detection (front and side view) of these
algorithms – tested in our lab set-up. In order to
measure the accuracy of each of the available pose
estimation libraries, tests were performed in which
ground truth annotations of the rider joints are
1
https://github.com/CMU-Perceptual-Computing-Lab/openpose
2
https://github.com/ildoonet/tf-pose-estimation
compared to the algorithms’ output. As can be seen in
the results shown in Figure 4, none of these skeleton
trackers is outperforming the others in all situations,
but AlphaPose and OpenPose are definitely
outperforming tf-pose. An evaluation on a dataset of
Tour de France footage with OpenPose also provided
satisfying results on the typical filming angles in
cycling live broadcasts.
Figure 3: Rider skeleton detection (lab set-up).
The skeleton detection is providing the keypoints,
i.e., the main joints of the rider’s body. From the
keypoint locations (i.e., pixel coordinates) we can
detect the pose and orientation of the rider. If the left
shoulder is left of the right shoulder, then it is most
likely to be a frontal shot. If the left shoulder is on the
right of the right shoulder, then the frame was most
likely shot from a rear perspective. Based on this
information, different techniques can be selected for
further identification. For instance, if we see a rider
from the back, face detection will not work, but a
combination of number and team recognition will
make it possible to detect the rider from that side. If
we have a frontal view of the rider, the number will
of course not be visible, but now the face recognition
algorithm can take over. If available, sensor data can
help to limit the number of candidate riders that can
be expected in a particular shot or frame.
In addition to detection of the orientation of the
rider, skeleton detection can also be used for shot type
classification. Based on the number of detected
skeletons and their size/location in the video footage,
a close-up shot can easily be distinguished from a
longshot or landscape view, as is shown in Figure 5.
Furthermore, scene changes can also be detected by
analysing the skeleton size/location changes over
time. As a result, we know exactly when it is safe to
start and stop tracking (if needed) and we can also
easily further crop the video into logical story units.
Finally, we also use the skeleton output to crop out
the faces and upper body regions of the riders – results
of this step are shown in Figure 6. In this way the and
3
https://github.com/MVIG-SJTU/AlphaPose