ing (e.g. (Cannav
`
o et al., 2006)), object detection,
tracking and recognition (e.g. (Spampinato, 2009)
and (Spampinato et al., 2010)).
One aspect to deal with when analyzing marine
ecosystems is fish tracking, whose importance goes
beyond simple population counting. In fact, be-
havior understanding and fish interactions’ analysis,
which are interesting perspectives for marine biolo-
gists to study, strictly rely on trajectories extracted us-
ing tracking approaches. However, tracking presents
a few major difficulties, which become greater in un-
derwater environments where objects have multiple
degrees of freedom or when the scene conditions can-
not be controlled.
Many different approaches have been studied in
literature on how to solve the visual tracking prob-
lem such as Kalman filter-based tracking (Doucet
et al., 2001), particle filter tracking (Gordon et al.,
1979), point feature tracking, mean-shift tracking
(Comaniciu and Meer, 2002). However, to the best
of our knowledge, only a variation of mean-shift,
the CAMSHIFT (Bradski, 1998), has been applied
to underwater environments (Spampinato et al., 2008)
achieving an average tracking performance (estimated
as correct counting rate) of about 85%. However, the
CAMSHIFT shows a main drawback when dealing
with fish-fish and fish-background occlusions mainly
due to the fact that it exploits only color informa-
tion. In this paper we propose a tracking algorithm
where fish are modeled as covariance matrices (Tuzel
et al., 2006) of feature built out of each pixel be-
longing to the fish’s region. This representation al-
lows to embody both spatial and statistical proper-
ties of non-rigid objects, unlike histogram represen-
tations (which disregard the structural arrangement of
pixels) and appearance models (which ignore statisti-
cal properties). As shown in the experimental results
section, the performance of the proposed approach is
very encouraging and better that the ones achieved
with CAMSHIFT, thus also indicating how our co-
variance based approach performs very well under ex-
treme conditions.
The remainder of the paper is: Section 2 de-
scribes the details of the proposed covariance based
fish tracking algorithm; Section 3, instead, shows the
achieved tracking results with hand-labeled ground
truth data. Finally, Section 4 points out the conclud-
ing remarks.
2 COVARIANCE BASED
TRACKING ALGORITHM
In the following description, we use “tracked object”
to indicate an entity that represents a unique fish and
contains information about the fish appearance his-
tory and its current covariance model; and “detected
object” to indicate a moving object, which has not
been associated to any tracked object yet. For each
detected object, the corresponding covariance matrix
is computed by building a feature vector for each
pixel, made up of the pixel coordinates, the RGB and
hue values and the mean and standard deviation of
the histogram of a 5×5 window with the target pixel
as centre. The covariance matrix, which models the
object, is then computed from this feature vector and
associated to the detected object. Afterwards, this ma-
trix is used to compare the object with the currently
tracked objects, in order to decide which one it re-
sembles the most. The main issue in comparing co-
variance matrices lies in the fact that they do not lie
on the Euclidean space—for example, the covariance
space is not closed under multiplication with nega-
tive scales. For this reason, as suggested in (Porikli
et al., 2005), we used F
¨
orstner’s distance (Forstner
and Moonen, 1999), which is based on generalized
eigenvalues, to compute the similarity between two
covariance matrices:
ρ(C
i
,C
j
) =
v
u
u
t
d
∑
k=1
ln
2
λ
k
(C
i
,C
j
) (1)
where d is the order of the matrices and
λ
k
(C
i
,C
j
)
are the generalized eigenvalues of covariance matri-
ces C
i
and C
j
, computed from
λ
k
C
i
x
k
−C
j
x
k
= 0 k = 1 ···d (2)
The model of each tracked object is then com-
puted as a mean (based on Lie algebra (Porikli et al.,
2005)) of the covariance matrices corresponding to
the most recent detections of that object. In order to
deal with occlusions, the algorithm handles the tem-
porary loss of tracked objects, by keeping for each
of them a counter (T T L) of how many frames it
has been missing; when this counter reaches a user-
defined value (for 5-fps videos, the best value, ob-
tained empirically, was 6), the object is considered
lost and discarded. In order to decide whether a de-
tected object is a feasible candidate as the new appear-
ance of a current tracked object, we check if the de-
tected object’s region overlaps, at least partially, with
the tracked object’s search area, which by default is
equal to the bounding box of that object’s latest ap-
pearance. To manage the temporary loss of an ob-
ject, and the fact that while the object has not been
detected it might have moved away from its previous
location, we modify the search area, expanding it pro-
portionally to the number of frames where the object
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
410