DeepBall: Deep Neural-Network Ball Detector

Jacek Komorowski

1,2

, Grzegorz Kurzejamski

1,2

and Grzegorz Sarwas

1,2

Warsaw University of Technology, Warsaw, Poland

Sport Algorithmics and Gaming Sp. z o.o., Warsaw, Poland

Keywords:

Ball Detection, Neural Network Based Object Detection, Single Stage Detector.

Abstract:

The paper describes a deep network based object detector specialized for ball detection in long shot videos.

Due to its fully convolutional design, the method operates on images of any size and produces ball conﬁdence

map encoding the position of detected ball. The network uses hypercolumn concept, where feature maps from

different hierarchy levels of the deep convolutional network are combined and jointly fed to the convolutional

classiﬁcation layer. This allows boosting the detection accuracy as larger visual context around the object of

interest is taken into account. The method achieves state-of-the-art results when tested on publicly available

ISSIA-CNR Soccer Dataset.

1 INTRODUCTION

An ability to accurately detect and track the ball in

a video sequence is a core capability of any system

aiming to automate analysis of the football matches

or players’ progress. Our method aims to solve the

problem of fast and accurate ball detection. It is de-

veloped as a part of the computer system for football

clubs and academies to track and analyze player per-

formance during both training session and regular ga-

mes. The system is intended to help professional foot-

ball analysts to evaluate the players’ performance, by

allowing automatic indexing and retrieval of interes-

ting events.

Detecting the ball from long-shot video footage of

a football game is not trivial to automate. The object

of interest (the ball) has very small size compared to

other objects visible in the observed scene. Due to

the perspective projection, its size varies depending

on the position on the play ﬁeld. The shape is not

always circular. When a ball is kicked and moves at

high velocity, its image becomes blurry and elliptical.

Perceived colour of the ball changes due to shadows

and lighting variation. The colour is usually similar

to the colour of white lines on the pitch and someti-

mes to players’ jerseys. Other objects with similar

appearance to the ball can be visible, such as small

regions near the pitch lines and regions of players’

bodies such as a head. Situations when the ball is in

player’s possession or partially occluded are especi-

ally difﬁcult. Figure 1 shows exemplary image pat-

ches illustrating high variance in the ball appearance

and difﬁculty of the ball detection task.

Traditional ball detection methods, e.g. based on

variants of circular Hough transform, deal well with

situations where ball is visible as a single object, se-

parated from the player body. They have problems to

detect the ball when it’s possessed or partially occlu-

ded by a player. But for players performance ana-

lysis purposes, the most informative are frames sho-

wing players in close contact with the ball. In this

paper we present a ball detection method expanding

upon the state-of-the-art deep convolutional object

detection network. The method operates on a single

video frame and is intended as the ﬁrst stage in the

ball tracking pipeline. Our method does not have li-

mitations associated with earlier methods based on a

circular Hough transform. It can deal with situations

where the perceived ball shape is not circular due to

the motion blur. It detects the ball when it’s in a close

contact with or partially occlude by a player’s body.

It can detect multiple balls, located relatively close to

each other, in the same image. Another beneﬁt of the

proposed method is its ﬂexibility. Due to the fully

convolutional design it can operate on images of any

size and produces the ball conﬁdence map of a size

proportional to the input image. The detection net-

work is designed with performance in mind. Evalu-

ation performed in Section 4 proves that our method

can efﬁciently process high deﬁnition video input in a

real time.

Komorowski, J., Kurzejamski, G. and Sarwas, G.

DeepBall: Deep Neural-Network Ball Detector.

DOI: 10.5220/0007348902970304

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 297-304

ISBN: 978-989-758-354-4

297

Figure 1: Exemplary patches illustrating high variance in ball appearance and difﬁculty of the ball detection task.

2 RELATED WORK

The ﬁrst step in the traditional ball detection met-

hods, is usually the process of background sub-

traction. It prevents ball detection algorithms from

producing false detections on the static part of the

image such as stadium advertisement. The most com-

monly used background subtraction approaches are

based on chromatic features (Gong et al., 1995; Ali

et al., 2012; Kia, 2016) or motion detection (D’Orazio

et al., 2002; D’Orazio et al., 2004; Leo et al., 2008;

Mazzeo et al., 2012). Segmentation methods based

on chromatic features use domain knowledge about

the visible scene: football pitch is mostly green and

the ball mostly white. The colour of the pitch is usu-

ally modelled using a Gaussian Mixture Model and

hardcoded in the system or learned. When the video

comes from the static camera, motion-based segmen-

tation is often used. For computational performance

reasons, a simple approach is usually applied based on

an absolute difference between consecutive frames or

the difference between the current frame and the mean

or median image obtained from a few previously pro-

cessed frames (Higham et al., 2016).

After the background segmentation, heuristic cri-

teria based on chromatic or morphological features

are applied on the resulting blobs to locate the ball.

These criteria include blob size, colour and shape

(circularity, eccentricity) (Gong et al., 1995). Vari-

ants of Circle Hough Transform (Yuen et al., 1990),

modiﬁed to detect spherical rather than circular ob-

jects, may be used to verify if a blob contains the

ball (D’Orazio et al., 2002; D’Orazio et al., 2004;

Leo et al., 2008; Poppe et al., 2010; Halbinger and

Metzler, 2015). A two-stage approach may be em-

ployed to achieve real-time performance and high de-

tection accuracy (D’Orazio et al., 2002; Leo et al.,

2008; Mazzeo et al., 2012). In this scenario the re-

gions that probably contain the ball are found (ball

candidates extraction). Then, the candidates are vali-

dated (ball candidate validation).

In (Ali et al., 2012) straight lines are detected

using kernel-based Hough transform and removed

from the foreground image to overcome problem of

ball interfusing with white lines on the pitch. Very

similar method is proposed in (Rao and Pati, 2015).

(Gong et al., 1995; Pallavi et al., 2008; Halbinger and

Metzler, 2015) use multiple successive frames to im-

prove the detection accuracy. In (Gong et al., 1995),

detection is conﬁrmed by searching a neighbourhood

area of each ball candidate in the successive frame.

If the white area with similar size and circularity is

found in the next frame, the ball candidate is valida-

ted. In (Pallavi et al., 2008) authors extract ball can-

didate positions using morphological features (shape

and size of the ball). Then, a directed weighted graph

is constructed from ball candidates in successive fra-

mes. The vertices of the graph correspond to candi-

date ball positions and edges link candidates found in

consecutive frames. The longest path in the graph is

computed to give the ball trajectory.

Ball detection methods using morphological fea-

tures to analyze shape of blobs produced by back-

ground segmentation, fail if a ball is touching a player.

See bottom row of Fig. 1 for exemplary images where

these methods are likely to fail. (Halbinger and Metz-

ler, 2015) addresses this limitation by using two-stage

approach. First, the ball is detected in not occluded si-

tuations, where it appears as a single object. This is

done by applying background subtraction to ﬁlter out

temporally static part of the image. Then, foreground

blobs are ﬁltered by size and shape to produce ball

candidates. Ball candidates are veriﬁed by examining

a few successive frames and detecting robust partial

ball trajectories (tracklets). When the ﬁrst stage de-

tector is not able to locate the ball, the second stage

detector specialized for partially occluded situations

is used. Ball candidates are found using a Hough cir-

cle detector. Foreground object contours are extracted

and their Freeman chain code is examined. If a ball

candidate corresponds to a ’bump’ in the foreground

object silhouette it is retained as a true match.

In recent years a signiﬁcant progress was made

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

298

in the area of neural-network based object detection.

Deep neural-network based YOLO detector (Red-

mon et al., 2016) achieves 63.4 mean Average Pre-

cision (mAP) on PASCAL VOC 2007 dataset, whe-

reas traditional Deformable Parts Models (DPM) de-

tector (Felzenszwalb et al., 2010) scores only 30.4.

Current state-of-the-art object detectors can be cate-

gorized as one-stage or two-stage. In two-stage detec-

tor, such as: Fast R-CNN (Girshick, 2015) or Faster

R-CNN (Ren et al., 2015), the ﬁrst stage generates a

sparse set of candidate object locations (region propo-

sals). The second stage uses deep convolutional neu-

ral network to classify each candidate location as one

of the foreground classes or as a background. One-

stage detectors, RetinaNet (Lin et al., 2017), SSD (Liu

et al., 2016) or YOLO (Redmon et al., 2016), do not

include a separate region-proposal generation step. A

single detector based on deep convolutional neural

network is applied instead.

(Speck et al., 2017) uses convolutional neural net-

works (CNN) to localize the ball under varying en-

vironmental conditions. The ﬁrst part of the network

consists of multiple convolution and max-pooling lay-

ers which are trained on the standard object classiﬁca-

tion task. The output of this part is processed by fully

connected layers regressing the ball location as pro-

bability distribution along x- and y-axis. The network

is trained on a large dataset of images with annotated

ground truth ball position. The network is reported

to have 87% detection accuracy on the custom made

dataset. The limitation of this method is that it fails if

more than one ball, or object very similar to the ball,

is present in the image. Our method does not have this

limitation.

(Reno et al., 2018) presents a deep neural net-

work classiﬁer, consisting of convolutional feature ex-

traction layers followed by fully connected classiﬁca-

tion layer. It is trained to classify small, rectangular

image patches as ball or no-ball. The classiﬁer is used

in a sliding window manner to generate a probability

map of the ball occurrence. The method has two dra-

wbacks. First, the set of negative training examples

(patches without the ball) must be carefully chosen

to include sufﬁciently hard examples. Also the rec-

tangular patch size must be manually selected to take

into account all the possible ways the ball appears on

the scene: big or small due to the perspective, sharp

or blurred due to its speed. The method is also not op-

timal from the performance perspective. Each rectan-

gular image patch is separately processed by the neu-

ral network using a sliding-window approach. Then,

individual results are combined to produce a ﬁnal ball

probability map. Our method, in contrast, requires

only a single pass of an entire image through the fully

convolutional detection network.

3 DEEP NETWORK-BASED BALL

DETECTION METHOD

The method presented in this paper, called DeepBall,

is inspired by recent advances in a single-pass deep

neural network based object detection methods, such

as SSD (Liu et al., 2016) or YOLO (Redmon et al.,

2016). A typical architecture of a neural network-

based one stage object detector is modiﬁed, to make

it more appropriate for the ball detection task. Modi-

ﬁcations aim at increasing accuracy of locating small

objects and reducing the processing time. The net-

work is designed to take larger visual context into the

consideration to correctly classify fragments of the

scene containing objects similar to the ball. This is

achieved by using hypercolumn concept introduced in

(Hariharan et al., 2015). In order to increase the per-

formance, we removed unnecessary components ty-

pical for single stage neural network object detector.

Multiple anchor boxes, with different size and aspect

ratios, are not needed as we detect objects from a sin-

gle class (the ball) with a limited shape and size vari-

ance. Localization module, predicting the centre and

size of object bounding boxes relative to a grid cell

is unnecessary, as proposed method produces a dense

conﬁdence map predicting the ball location on a pixel

level.

The method takes a video frame of any resolution

as an input and produces scaled down ball conﬁdence

map encoding probability of ball presence at each lo-

cation. The size of the output ball conﬁdence map

is h

× w

, where h

and w

equal to the original

image height and width divided by the scaling fac-

tor k (k = 4 in our case). Position in the ball conﬁ-

dence map with coordinates (x

) corresponds to

the position (bk(x

− 0.5)c,bk(y

− 0.5)c in the in-

put image. See Fig. 2 for an exemplary input image

and corresponding ball conﬁdence map computed by

the trained network. The actual ball position is re-

trieved from the conﬁdence map using the following

approach. First, the location with the highest conﬁ-

dence is found in the ball conﬁdence map. If the con-

ﬁdence is lower than a threshold θ, no balls are de-

tected. Otherwise, the location with the highest con-

ﬁdence is returned. In ’training game mode’, where

more than one ball can be present in the image, more

balls are detected. This is done by zeroing-out conﬁ-

dence map values at the previously found maximum

and its close neighbourhood (non-max suppression)

and searching for the second global maximum. The

process is repeated until no new maximum with con-

DeepBall: Deep Neural-Network Ball Detector

299

Figure 2: Part of the exemplary input frame from the test se-

quence with highlighted ball position (left) and correspon-

ding ball conﬁdence map (right).

ﬁdence above the threshold θ can be found. Pixel

coordinates of the ball (x

) in the input frame are

calculated using the following formula: (x

) =

(bk(x

−0.5)c,bk(y

−0.5)c, where (x

) are coor-

dinates in the ball conﬁdence map with the maximum

conﬁdence and k = 4 is a scaling factor. The thres-

hold θ is set experimentally, as the value maximizing

detection accuracy on the validation set.

Table 1: Detailes of DeepBall network architecture. Output

size is speciﬁed in the format: (number of channels, height,

width). Each convolutional layer is followed by BatchNorm

layer and ReLU non-linearity (not show for brevity). All

convolutions use same padding and stride one (except for

the ﬁrst one).

Block Layers Output size

Conv1 Conv: 8 7x7 ﬁlters

stride 2

Conv: 8 3x3 ﬁlters

Max pool: 2x2 ﬁlter (8, 268, 480)

Conv2 Conv: 16 3x3 ﬁlters

Conv: 16 3x3 ﬁlters

Max pool: 2x2 ﬁlter (16, 134, 240)

Conv3 Conv: 32 3x3 ﬁlters

Conv: 32 3x3 ﬁlters

Max pool: 2x2 ﬁlter (32, 67, 120)

Conv4 Conv: 56 3x3 ﬁlters

Conv: 2 3x3 Filters (2, 268, 480)

Softmax Softmax (2, 268, 480)

Network Architecture. The diagram depicted in

Fig. 3 shows components of our ball detection net-

work and size of outputs of each block. Note that

output size depends on the size of the input image,

as the network is fully convolutional and can operate

on the image of any size. The input image is proces-

sed by three convolutional blocks (Conv1, Conv2 and

Conv3) producing convolutional feature maps with

decreasing spatial resolution and increasing number

of channels. In contrast to a typical convolutional

network design, the output from each convolutional

block is concatenated and jointly fed into the ﬁnal

classiﬁcation layer. Feature maps produced by con-

volutional blocks Conv2 and Conv3 are ﬁrst upsam-

pled to the same spatial resolution as a feature map

produced by the ﬁrst convolutional block (Conv1).

Then, the feature map produced by the ﬁrst convoluti-

onal block (Conv1) and upsampled feature maps from

second and third convolutional blocks (Conv2 and

Conv3) are concatenated along the dimension corre-

sponding to the number of channels to form a hyper-

column. Concatenated feature map is fed to the ﬁ-

nal fully convolutional classiﬁcation block (Conv4).

The classiﬁcation block consists of two convolutional

layers followed by the softmax layer. It outputs two

channel ball conﬁdence map. One channel is inter-

preted as the probability of the location being a back-

ground and the other as probability of the ball. For the

ball detection task, one output channel, interpreted as

the ball probability, would be sufﬁcient. But the pro-

posed design is extensible and can be easily adapted

to accommodate detection of additional object cate-

gories, such as players. Detailed architecture of each

block is given in Table 1.

Concatenation of multiple convolutional feature

maps from different level of the network, allows using

both low-level features from the ﬁrst convolutional

layers and high-level features computed by higher

convolutional layers. Information from ﬁrst convolu-

tional layers is necessary for a precise spatial location

of the object of interest. Further convolutional layers

operate on feature maps with lower spatial resolution,

thus they cannot provide exact spatial location. But

they have bigger receptive ﬁelds and their output can

provide additional context to improve classiﬁcation

accuracy. This design is inspired by the hypercolumn

concept (Hariharan et al., 2015), where outputs from

intermediary convolutional layers are upsampled and

concatenated in order to allow ﬁnd-grained object lo-

calization.

The network architecture described above was

chosen experimentally by evaluating a number of al-

ternative designs. See Section 4 for information on

examined variants and their performance.

Loss Function: is a modiﬁed version of the loss

used in SSD (Liu et al., 2016) detector. Proposed net-

work does not regress position and size of the object’s

bounding box. The ball position is determined by the

maxima of the conﬁdence map computed by the net-

work. Hence only the classiﬁcation component of the

original SSD loss function is used. The loss L optimi-

zed during the training is cross-entropy loss over ball

and background class conﬁdences:

L (c) =

−

∑

(i, j)∈Pos

log



ball

i j



−

∑

(i, j)∈Neg

log



i j



(1)

where c

i j

is the value of the channel of the ball con-

ﬁdence map corresponding to the background proba-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

300

Conv1

Conv2

Conv3

(3, 1080, 1920)

(56, 268, 480)

Concatenate

(8, 268, 480)

Upsample x 2

(32, 67, 120)

(32, 268, 480)

(16, 268, 480)

Conv4

Softmax

(16, 134, 240)

(2, 268, 480)

Input image

Size: 1080 x 1920

Ball confidence map

Size: 268 x 480

Figure 3: High-level architecture of DeepBall network. The input image is processed by three convolutional blocks (Conv1,

Conv2 and Conv3) producing convolutional feature maps with decreasing spatial resolution and increasing number of chan-

nels. Feature maps are upsampled to the same spatial resolution and concatenated along channels dimension. Concatenated

feature map is fed to the ﬁnal fully convolutional classiﬁcation block (Conv4) followed by Softmax. The output is two channel

ball conﬁdence map.

bility at the spatial location (i, j) and c

ball

i j

is the is the

value of the channel of the ball conﬁdence map corre-

sponding to the ball probability at the spatial location

(i, j). Pos is a set of positive examples, that is the set

of spatial locations on the ball conﬁdence map corre-

sponding to the ground truth ball location. Neg is a

set of negative examples, that is the set of spatial lo-

cations on the ball conﬁdence map corresponding to

the ground truth background.

Set of positive examples Pos is constructed as fol-

lows. If (x, y) is a true ball position for the image

I, then the corresponding conﬁdence map location

(i, j) = (bx/4,y/4c) and all its nearest neighbours are

added to Pos.

Negative examples (locations without the ball)

correspond to locations on the conﬁdence map, where

the ball, according to the ground truth data, is not pre-

sent. The number of negative examples is orders of

magnitude higher than a number of positive examples

(locations with the ball) and this would create highly

imbalanced training set. To mitigate this, we employ

hard negative mining strategy as in (Liu et al., 2016).

We chose a limited number of negative examples with

the highest conﬁdence loss, so the ratio of negative to

positive examples is at most 3:1.

Training Dataset. DeepBall network is trained

using the publicly available ISSIA-CNR Soccer Da-

taset (D’Orazio et al., 2009). The dataset contains

six synchronized, long shot views of the football pitch

acquired by six Full-HD DALSA 25-2M30 cameras.

Three cameras are designated for each side of the

playing-ﬁeld, recording at 25 fps. Videos are acqui-

Figure 4: Exemplary frame from the training dataset.

red during matches of the Italian ’serie A’. There’re

20,000 manually annotated frames in the dataset, out

of which 7,000 contain the ball and 13,000 doesn’t or

the ball is occluded by players. The ball radius varies

from 8 to 16 pixels. Sequences 1, 2, 3 and 4, covering

one penalty area and the centre of the football pitch,

are used for training. Sequences 5 and 6, covering

the side of football pitch not visible on the training

sequences, are left aside for the evaluation purposes.

Fig. 4 shows exemplary frames from the sequence

1 and 3. As the training dataset is relatively small,

we use data augmentation to increase the variety of

training examples and decrease the risk of overﬁtting.

The following transformations are randomly applied

to the training images: random color jitter (random

change in brightness, contrast, saturation or hue), ho-

rizontal ﬂip, random cropping and random scaling

(with scale factor between 0.5 and 1.1). The ground

truth (ball position) is modiﬁed accordingly to align

with the transformed image.

DeepBall: Deep Neural-Network Ball Detector

301

The network is trained using a standard gradient

descent approach with Adam (Kingma and Ba, 2014)

optimizer. The initial learning rate is set to 0.001 and

decreased by 10 after 50 epochs. The training runs for

75 epochs in total. Batch size is set to 16.

4 EXPERIMENTAL RESULTS

Evaluation Dataset. Evaluation is performed on

two datasets. The ﬁrst contains of sequence 5 and 6

from the ISSIA-CNR Soccer Dataset. This sequence

covers the part of the football pitch not seen on the

training sequences (sequence 1, 2, 3 and 4). ISSIA-

CNR dataset is quite demanding because the video

has a moderate quality and there’s noticeable blur.

One of the team wears white jerseys which makes

difﬁcult to distinguish the ball when it’s close to the

player.

Evaluation Metrics. We evaluate Average Preci-

sion (AP), a standard metric used in assessment of

object detection methods. We follow Average Pre-

cision deﬁnition from Pascal 2007 VOC Challenge

(Everingham et al., 2010). The precision/recall curve

is computed from a methods ranked output. Recall is

deﬁned as a proportion of all positive examples ran-

ked above a given threshold to all positive examples

in the ground truth. Precision is a proportion of all

positive examples above that threshold to all exam-

ples above that threshold. The AP summarizes the

shape of the precision/recall curve, and is deﬁned as

the mean precision at a set of eleven equally spaced

recall levels:

AP =

∑

r∈

{

0,0.1,...1

}

p(r) , (2)

where p(r) is a precision at recall level r.

The ball detection method usually operates under

the additional constraint, that no more than one object

of interest (the ball) is present in the image. Under

this constraint, for each image the detector returns the

highest response from the ball conﬁdence map gre-

ater than the threshold θ as the ball position. If no

location in the ball conﬁdence map is greater than θ,

no ball is detected. In this scenario, an image with

the ball is classiﬁed correctly, if the ball is detected

at the correct location. The image without the ball is

classiﬁed correctly, if ball is not detected. Ball de-

tection accuracy is deﬁned as the proportion of cor-

rectly classiﬁed images to all processed images. θ is

chosen experimentally, as the value maximizing the

accuracy on the validation set.

Evaluation Results. Evaluation results are summa-

rized in Table 2. The results contain Average Preci-

sion and Accuracy of evaluated methods, as deﬁned

in the previous section. The table also lists a number

of trainable parameters in each evaluated model and

frame rate, expressed in frames per second, achieva-

ble when detecting the ball in a Full HD (1920x1080

resolution) video. Frame rates given in the table take

into account the time needed to feed a frame through

the detection network and infer the ball position from

the resultant feature map. They do not include the

time needed to load the frame from an input ﬁle,

convert it to the tensor and load into the GPU. All

methods are implemented in PyTorch (Paszke et al.,

2017) and run on nVidia Titan X GPU platform.

Our method yields the best results on the test set

(Sequences 5 and 6 from ISSIA-CNR Soccer Data-

set). It achieves 0.877 Average Precision and 0.951

ball detection accuracy. For comparison we evalu-

ate two recent ball detection methods: (Speck et al.,

2017) and (Reno et al., 2018) using the same training

and test sets and the same data augmentation appro-

ach as in our method.

(Speck et al., 2017) uses the neural network with

three convolutional layers followed by two two-layer

fully connected heads estimating the ball x and y

coordinates. For evaluation we implemented the best

performing model proposed in the paper: Model 1

soft-sign. The model performs poorly on the test da-

taset, achieving only 0.220 Average Precision. This

can be attributed to the fact, that the original model

is intended to detect the ball in videos from Robo-

Cup Soccer matches taken from closer distance. The

ball image is larger and there are no visible distrac-

tors such as advertisement stands around the pitch.

The method regresses only one ball position on the

input image. If there are multiple objects with ball-

like appearance, it likely gets confused and fails to

produce the meaningful result. Our method computes

a dense conﬁdence map indicating probable ball po-

sitions. It’s more robust against presence of objects

with similar appearance to the ball.

(Reno et al., 2018) uses the network consisting

of four convolutional layers followed by a fully con-

nected classiﬁcation layer. This method scores 0.834

Average Precision and 0.917 accuracy. In contrast

to the original method, we enhanced the training set

construction process. Negative examples (no ball pa-

tches) do not need to be manually selected. They are

mined online during the network training, as regions

of the image not containing the ball but incorrectly

classiﬁed with the highest conﬁdence (hard negative

mining). Even with this improvement, the method

yields worse Average Precision and detection accu-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

302

Table 2: Ball detection method evaluation results.

Method

Average

Precision

Accuracy

No. of trainable

parameters

FPS

DeepBall 0.877 0.951 48 658 190

DeepBall (no data augmentation) 0.792 0.899 48 658 190

DeepBall (no hypercolumns/context) 0.833 0.911 29 146 270

(Speck et al., 2017) 0.220 0.220 332 365 744 22

(Reno et al., 2018) 0.834 0.917 313 922 32

Figure 5: Visualization of incorrect detection results. Top

row show image patches where the ball is not detected (false

negatives). The bottom row shows patches with incorrectly

detected ball (false positives).

racy than our method.

It must be noted that, while our method outper-

forms two other neural network based ball detection

methods in terms of average precision and detection

accuracy, it has signiﬁcantly lower number of traina-

ble parameters and much higher video processing rate

(FPS).

Due to the relatively small size of the training set,

data augmentation proved to be the key allowing ge-

neralization of the trained network and good perfor-

mance on the testing set. Without data augmentation

Average Precision drops down from 0.877 to 0.792.

Implementing hypercolumn concept by combi-

ning convolutional feature maps from different levels

of the hierarchy have a positive impact on the method

performance. Using a network with a simpler archi-

tecture, which bases classiﬁcation on the output from

the last convolutional layer, without combining mul-

tiple feature maps, produces worse results. Such ar-

chitecture scored only 0.833 Average Precision.

Fig. 5 show examples of incorrect detections. Two

top rows show image patches where our method fails

to detect the ball (false negatives). It can be noticed,

that misclassiﬁcation is caused by severe occlusion,

where only small part of the ball is visible, or due

to blending of the ball image with white parts of the

player wear or white background objects outside the

play ﬁeld, such as stadium advertisement. The bottom

row shows examples of patches where a ball is incor-

rectly detected (false positives). The detector is so-

metimes confused by players’ white socks or by the

background clutter outside the play ﬁeld.

5 CONCLUSIONS

The article describes an efﬁcient and effective deep

neural network based ball detection method. The

proposed network has a fully convolutional architec-

ture processing entire image at once, in a single pass

through the network. This is much more computatio-

nally effective than a sliding window approach propo-

sed in (Reno et al., 2018). Additionally, the network

can operate on images of any size that can differ from

size of images used during the training. It outputs

scaled down ball conﬁdence map, indicating estima-

ted ball location. The method performs very well on

a challenging ISSIA-CNR Soccer Dataset (D’Orazio

et al., 2009) resulting in 0.877 Average Precision and

0.951 accuracy. It outperforms two other, recently

proposed, neural network-based ball detections met-

hods: (Speck et al., 2017) and (Reno et al., 2018),

while having lower number of trainable parameters

and signiﬁcantly higher frame rate.

In the future we plan to use temporal information

to improve the system accuracy. Combining convolu-

tional feature maps from few subsequent frames gives

additional information that may help to discriminate

static, ball-like objects (e.g. parts of stadium adver-

tisement or spare balls located outside the play ﬁeld)

from the moving ball.

ACKNOWLEDGEMENTS

This work was co-ﬁnanced by the European Union

within the European Regional Development Fund.

DeepBall: Deep Neural-Network Ball Detector

303

REFERENCES

Ali, M. N., Abdullah-Al-Wadud, M., and Lee, S.-L. (2012).

An efﬁcient algorithm for detection of soccer ball and

players. Proc. 16th ASTL Control and Networking, 16.

D’Orazio, T., Ancona, N., Cicirelli, G., and Nitti, M.

(2002). A ball detection algorithm for real soccer

image sequences. In Pattern Recognition, 2002.

Proceedings. 16th International Conference on, vo-

lume 1, pages 210–213. IEEE.

D’Orazio, T., Guaragnella, C., Leo, M., and Distante, A.

(2004). A new algorithm for ball recognition using

circle hough transform and neural classiﬁer. Pattern

Recognition, 37(3):393 – 408.

D’Orazio, T., Leo, M., Mosca, N., Spagnolo, P., and Maz-

zeo, P. (2009). A semi-automatic system for ground

truth generation of soccer video sequences. In 2009

Advanced Video and Signal Based Surveillance, pa-

ges 559–564. IEEE.

Everingham, M., Van Gool, L., Williams, C. K., Winn, J.,

and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. International journal of com-

puter vision, 88(2):303–338.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and

Ramanan, D. (2010). Object detection with discri-

minatively trained part-based models. IEEE tran-

sactions on pattern analysis and machine intelligence,

32(9):1627–1645.

Girshick, R. (2015). Fast r-cnn. In Proceedings of the IEEE

international conference on computer vision, pages

1440–1448.

Gong, Y., Sin, L. T., Chuan, C. H., Zhang, H., and Sa-

kauchi, M. (1995). Automatic parsing of tv soccer

programs. In Proceedings of the International Confe-

rence on Multimedia Computing and Systems, pages

167–174.

Halbinger, J. and Metzler, J. (2015). Video-based soccer

ball detection in difﬁcult situations. In Cabri, J., Pe-

zarat Correia, P., and Barreiros, J., editors, Sports

Science Research and Technology Support, pages 17–

24, Cham. Springer International Publishing.

Hariharan, B., Arbel

aez, P., Girshick, R., and Malik, J.

(2015). Hypercolumns for object segmentation and

ﬁne-grained localization. In Proceedings of the IEEE

conference on computer vision and pattern recogni-

tion, pages 447–456.

Higham, D., Kelley, J., Hudson, C., and Goodwill, S. R.

(2016). Finding the optimal background subtraction

algorithm for eurohockey 2015 video. Procedia Engi-

neering, 147:637 – 642. The Engineering of SPORT

11.

Kia, M. (2016). Ball automatic detection and tracking in

long shot views. International Journal of Computer

Science and Network Security (IJCSNS), 16(6):1.

Kingma, D. P. and Ba, J. (2014). Adam: A method for sto-

chastic optimization. arXiv preprint arXiv:1412.6980.

Leo, M., DOrazio, T., Spagnolo, P., Mazzeo, P. L., and Dis-

tante, A. (2008). Sift based ball recognition in soccer

images. In International Conference on Image and

Signal Processing, pages 263–272. Springer.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Doll

ar, P.

(2017). Focal loss for dense object detection. arXiv

preprint arXiv:1708.02002.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,

C.-Y., and Berg, A. C. (2016). Ssd: Single shot mul-

tibox detector. In European conference on computer

vision, pages 21–37. Springer.

Mazzeo, P. L., Leo, M., Spagnolo, P., and Nitti, M. (2012).

Soccer ball detection by comparing different feature

extraction methodologies. Advances in Artiﬁcial In-

telligence, 2012:6.

Pallavi, V., Mukherjee, J., Majumdar, A. K., and Sural, S.

(2008). Ball detection from broadcast soccer videos

using static and dynamic features. Journal of Visual

Communication and Image Representation, 19(7):426

– 436.

Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,

DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and

Lerer, A. (2017). Automatic differentiation in pytorch.

Poppe, C., De Bruyne, S., Verstockt, S., and Van de Walle,

R. (2010). Multi-camera analysis of soccer sequen-

ces. In Advanced Video and Signal Based Surveillance

(AVSS), 2010 Seventh IEEE International Conference

on, pages 26–31. IEEE.

Rao, U. and Pati, U. C. (2015). A novel algorithm for de-

tection of soccer ball and player. In Communications

and Signal Processing (ICCSP), 2015 International

Conference on, pages 0344–0348. IEEE.

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.

(2016). You only look once: Uniﬁed, real-time object

detection. In Proceedings of the IEEE conference on

computer vision and pattern recognition, pages 779–

788.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Reno, V., Mosca, N., Marani, R., Nitti, M., DOrazio, T., and

Stella, E. (2018). Convolutional neural networks ba-

sed ball detection in tennis games. In Proceedings of

the IEEE Conference on Computer Vision and Pattern

Recognition Workshops, pages 1758–1764.

Speck, D., Barros, P., Weber, C., and Wermter, S. (2017).

Ball localization for robocup soccer using convolutio-

nal neural networks. In Behnke, S., Sheh, R., Sarıel,

S., and Lee, D. D., editors, RoboCup 2016: Robot

World Cup XX, pages 19–30, Cham. Springer Inter-

national Publishing.

Yuen, H., Princen, J., Illingworth, J., and Kittler, J. (1990).

Comparative study of hough transform methods for ci-

rcle ﬁnding. Image and vision computing, 8(1):71–77.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

304