However, those conventional tracking approaches are
facing some challenges related to their cost, players’
safety, and players’ acceptance to wear them during
the game (Edgecomb, 2006). These challenges have
demanded the development of cost-effective,
contactless alternatives to acquire tracking data
during football games.
Filming football games with high-resolution
cameras paved the way towards computer-vision-
based approaches to track players without the need
for extra wearables (Thomas, 2017). Vision-based
tracking data acquisition from the football game feed
has shown promising results. However, certain in-
game scenarios are currently hindering the practical
use of this approach. Occlusion is one of the in-game-
scenarios that occurs frequently during a game
whenever players are in a duel or they move across
each other (Sabirin, 2015). Detecting occlusion
events and correcting the trajectories of occluded
players will improve the performance of vision-based
tracking system and output more usable tracking data.
Many multiple object tracking (MOT) paradigms
have been introduced over the last two decades such
as the tracking by detection paradigm (Andriluka,
2008) which starts with object (e.g., player) detection
in a specific video frame, and constructs a feature
vector for each detection that describes each detected
object. Object detection is the input stage for any
MOT pipeline, where the choice of the detector that
has the best accuracy and highest speed is the key for
efficient detections (Luo, 2021). Recently, deep
neural networks architectures have shown significant
performance in learning features and detecting
objects, and are now the base of any state of the art
object detector (Zhao, 2019). Detection is repeated
for all objects in each frame, then frame-to-frame
features are compared using similarity computation
algorithms and objects with highly similar features
are joined by a unique ID between frames. The
process of associating objects sharing the same ID
over multiple frames leads to the construction of an
object’s trajectory (Sun, 2019).
Effective tracking algorithms rely on accurate
modeling of detected objects in each frame. There are
two main approaches for modeling: appearance
modeling and motion modeling (Luo, 2021).
Appearance modeling is an object-descriptive
approach that efficiently models objects in local
regions. However, when similar objects appear in
local regions, appearance modeling does not achieve
the best performance. On the other hand, motion
modeling adopts probabilistic approaches to model
the dynamic behavior of the player (object) and
predicts player’s future locations according to a
statistical model. Both appearance and motion
modeling approaches lack the ability to detect and/or
correct the trajectories of occluded objects (Luo,
2021).
Figure 1: FOOTBALLTrace Workflow.
In this paper, we adopt the appearance modeling
approach to propose an end-to-end AI-based system
(Figure 1) called FOOTBALLTrace for acquiring
low-cost, contactless tracking data for football
analytics. Our system films the game with three fixed
cameras with 25 frames per second, stitching the three
camera views to create a panoramic view of the game,
utilizes a state-of-the-art real-time object detector,
YOLOv7, and a deep affinity estimation network to
assign unique IDs for detected players throughout the
entire game. Additionally, we introduce a novel
algorithm for occlusion detection and trajectory
correction to enhance the practicality of our system.
2 DATASET DESCRIPTION
Figure 2: Camera setup for panoramic view creation.
We used video data of a football game recorded with
three 4K resolution Go-Pro cameras with wide view
and 30% overlap between each camera view as seen
in Figure 2. The 3 views are then stitched together
using an image registration algorithm (Szeliski, 2007)
to create a panoramic view of the pitch (depicted in
Figure 3). The game video recording and panoramic
view creation are carried out by KoraStats, Egypt
(KORASTATS, 2023). The game is filmed at the Air
Defense Stadium in Cairo, Egypt. The tracking
coordinates are mapped to a pitch top view using
Homography transformation between the pitch
coordinates in the camera view and the corresponding
coordinates in the pitch top view.