3.1 Datasets
The datasets used for training and testing the
detectors can be divided into two parts.
The first, custom part of the dataset, was acquired
from indoor and outdoor footage of handball practice
and competition. The recordings were made by the
authors of this paper during one week in handball
school without additional scene preparation or player
instruction to preserve real-world conditions. The
subjects in the images are mainly youngsters with
accompanying coaches, handling multiple sports ball
objects. The recordings were done using GoPro
cameras positioned at 1.5 m height at the border of
the filed or from the spectator’s viewpoint
approximately 3.5 m high and 10 m away from the
filed limit. Artificial lighting was present during
indoor activities with some sunlight through
windows. Outdoor scenes were taken during daytime
with clear sky or with almost no clouds. From the 751
videos, at 1920x1080 (full HD) resolution and 30
frames per second, 394 training and 27 validating
images were selected for training the models. The ball
objects came in a variety of colors, and so did the
players clothing mainly used for everyday sports
activities.
The second, public part of the dataset was used
to avoid overfitting and to prepare the model for
detection in other sports. It consists of 1445 training
and 13 test images of variable sizes from 174 x174 up
to 5184 x 3456 with 1 to many ball occurrences on
each. This part was gathered in part using an internet
search engine and in part from publicly available
COCO datasets (Lin et al, 2014). Here, the balls are
not exclusive to handball sport and are of different
sizes and colors. The persons in the images also take
different positions and are dressed differently.
The complete dataset has 1837 images with over
3500 ball objects.
3.2 Models
The description of the tested models is given below.
Since the goal of detecting sports balls requires the
possibility of discerning small and distant objects in
images, the input resolution was increased to
1024x1024 pixels in all models except the first two,
where the original input size of 608 x 608 pixels was
used. This was done since the ball objects in a large
number of source full HD images take up just a few
pixels, and the resizing of images to 608x608
resolution can make such objects invisible or almost
invisible.
An additional change was made to some classes
models need to detect. Since object detection in sports
action doesn’t require classes like a teddy bear and
fire hydrant, models were trained solely on the ball
and person classes. All other classes are not
considered in this experiment. This affects Mean
Average Precision (mAP) which will be used in
metrics of how successful the models are.
Our reference model, further marked as Y, is the
pre-trained YOLOv2 model with 608 x 608 input
image size with weights pre-trained on the COCO
dataset and no additional training by the authors of
this paper. The pre-trained model contains the person
and sports ball among other classes from the COCO
dataset.
The model (Y+) is the pre-trained model Y fine-
tuned on the 394 images from the custom part of the
dataset. The training of this model has proven to be
unstable, so only a limited number of epochs was
performed. Since the same dataset, consisting of large
images (up to 5184 x 3456) with small annotated
objects (few pixels high/width), was applied in
training without problems to the subsequent models
with higher input resolution, it can be concluded that
already small objects get too small when resized
ending up in error.
The third model (YB) was trained using transfer
learning, on both public custom parts of the dataset,
1837 images in total, training only for the ball class.
In this and subsequent models in this experiment, the
input image resolution was increased to 1024 x 1024
from 608 x 608 of the original model.
The YBP model included both ball and person
classes and was trained using transfer learning. It was
trained for approximately 80 epochs on both custom
and public parts of the dataset.
In the YPB+ model, the custom dataset was
doubled using flipped images and flipped
annotations. The YPB model was fine-tuned with ten
epochs to achieve this result.
For the sixth model (YPBF) it was decided to
include all images so far, public, custom and flipped
custom dataset. Since flipped custom images were
mirrored around Y axes, it was decided to try flipping
images around X axes as well which results in
unnatural sky-ground and upside-down human
position.
All of the models were trained and then tested in
the same environment, consisting of a PC equipped
with a 12 core E5-2680v3 CPU and one GeForce
GTX TITAN X GPU with 12GB of memory, with
Debian Linux operating system. Additional
programming was done in Python programming
language.