Uncertainty Fusion based Object Recognition and Tracking in Maritime
Scenes using Spatiotemporal Active Contours
Ikhlef Bechar
1
, Frederic Bouchara
1
, Thibault Lelore
1
, Vincente Guis
1
and Michel Grimaldi
2
1
LSIS Laboratory, Toulon University, Toulon, France
2
PROTEE Laboratory, Toulon University, Toulon, France
Keywords:
Airborne Video System, Maritime Surveillance, Vessel Recognition, Dynamic Background, Chromatic
Uncertainty, Dynamic Texture Uncertainty, MAP Estimation, Energy Minimization, Spatiotemporal Active
Contours.
Abstract:
This article addresses the problem of near real time video analysis of a maritime scene using a (moving)
airborne RGB video camera in the goal of detecting and eventually recognizing a target maritime vessel. This
is a very challenging problem mainly due to the high level of uncertainty of a maritime scene including a
dynamic and noisy background, camera’s and target’s motions, and broad variability of background’s versus
target’s appearances. We propose an approach which attempts to combine several types of spatiotemporal
uncertainty in a single probabilistic framework. This allows to achieve a likelihood ratio with respect to any
possible spatiotemporal configuration of the 2D + T video volume. Using the MAP estimation criterion, such
a problem can be recast as as an energy minimization problem that we solve efficiently using a spatiotemporal
active contour approach. We demonstrate the feasibility of the proposed approach using real maritime videos.
1 INTRODUCTION
Maritime surveillance is an important customs ap-
plication aiming at an efficient monitoring of mar-
itime traffic, and securing sea coasts and harbors from
fraudulent activities such as smuggling, thefts, pira-
cies, intrusions, and human trafficking (Bloisi and
Iocchi, 2009; Pires et al., 2010). Traditionally, it con-
sists of a workflow of laborious tasks performed by
human operators (e.g., coast guards). Recently, semi-
automated and automated airborne video-surveillance
systems have gained increasing popularity in mar-
itime surveillance. The latter generate huge volumes
of video data that thus need be analyzed automatically
and in near real time in the goal of recognizing mar-
itime targets and ranking their activity (e.g., usual ver-
sus fraudulent activity) (see Fig.1).
In this paper, we describe the video processing
system that we have developed for automatic mar-
itime object (e.g., vessel) recognition using an air-
borne visible light (i.e., RGB) video camera. The
hardware architecture chosen for the project allows
to continuously acquire video-streams of a maritime
scene in the visible light spectrum (400-700 nm) in-
volving a single target at once. Each movie of a tar-
get is stored on a local (airborne) computer and it is
analyzed in quasi real time on board for recognition
purposes (see (Bechar et al., 2013) for more details).
1.1 Motivations and Related Work
Most currently existing video based maritime surveil-
lance systems are based on static (i.e., grounded)
RGB cameras. From an algorithmic point of view,
they have borrowed existing video-surveillance tech-
niques originally developed for rather "gentle" en-
vironments, and thus they might not be well suited
to highly dynamic scenes such as maritime environ-
ments, such as background subtraction techniques
(Stauffer and Grimson, 1999), optical flow (Lu-
cas and Kanade, 1981), statistical learning (Bloisi
et al., 2012), and son on. Furthermore, there ex-
ists other surveillance systems which are based on
different (though more expensive) data acquisition
modalities such as infrared imagery (Smith and Teal,
1999), multiple sensor information fusion such as
radar/AIS(B. J. Rhodes, 2007), and RGB/thermal in-
frared imagery (Bechar et al., 2013) in order either to
cover larger areas of the sea or to account for RGB
system’s unreliability.
In this work, we consider a non-static RGB video
system and our goal is to devise an automatic tech-
682
Bechar I., Bouchara F., Lelore T., Guis V. and Grimaldi M..
Uncertainty Fusion based Object Recognition and Tracking in Maritime Scenes using Spatiotemporal Active Contours.
DOI: 10.5220/0004755406820689
In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 682-689
ISBN: 978-989-758-003-1
Copyright
c
2014 SCITEPRESS (Science and Technology Publications, Lda.)
(a) (b) (c) (d)
Figure 1: Four maritime RGB video shots showing: (a) a cargo; (b) a medium-sized vessel; (c) a yacht; (d) a sailboat.
nique which combines several types of RGB video
information in order to yield a likelihood ratio with re-
spect to any spatiotemporal trajectory of a video that
it corresponds to a maritime target’s. Eventually, we
would like to exploit such a result in order to detect
as most accurately as possible (ideally delineate) the
target in a video with an overwhelming probability.
Let us emphasize that in contrast to static video sys-
tems in which objects are only filmed when they enter
the field of view of the camera (in which case, a prior
modeling via learning of the background without ob-
jects, combined with a proper background subtraction
technique may be envisaged for object detection), our
airborne (and thus non-static) video system acquires a
specific maritime scene of a target directly, based on a
radar based notification. Therefore, video processing
should determine alone the spatiotemporal location of
a target in the whole video.
To this end, we advocate the use of an approach
which attempts to fuse several types of uncertainty re-
garding main chromaticity (main color) and dynamic
texture (i.e., stochastic deviations around principal
color) in a common probabilistic approach. This al-
lows to yield a probability ratio with respect to all
possible spatiotemporal configurations (i.e., both the
spatial location and the temporal trajectory) of a target
in a video. After that, using the maximum a posteri-
ori (MAP) estimation criterion (in the log-likelihood
form), we are able to recast the original problem as
an energy minimization problem which is solved us-
ing an active contour approach (Mumford and Shah,
1989; Vese and Chan, 2002).
2 MATHEMATICAL MODEL
Given video data X
1,T
:=
X
t
,t = 1,T
, where for
all t = 1,T , X
t
stands for the t-th video frame, the
goal is to recognize and to track a maritime object
throughout the whole video. Let us then denote by
the image domain, which is identical for all video
frames. We adopt a probabilistic approach, and our
goal is to assign to any spatiotemporal configuration
O
1,T
×1, T a probability that it corresponds to the
spatiotemporal trajectory of the target object, given
the observed video data. Eventually, we extract the
spatiotemporal trajectory O
1,T
×1, T that is most
likely to correspond to the actual target’s one.
Such a problem can indeed be formulated us-
ing the maximum a posteriori (MAP) principle as
it will be described in the sequel. It should be
mentioned however that our MAP approach 1 dif-
fers from another well-known energy based approach
which is based on naive Bayes classification, in
the sense the latter requires an explicit modeling
simultaneously of the foreground p(x
t
/target) and
background p(x
t
/backrgound) probabilities, whereas
our approach solely focuses on foreground model
p(x
t
/target). This has thus the advantage to alleviate
the burden of having to estimate a background model
(which does not seem to be a trivial task when dy-
namic backgrounds are considered) via a MAP esti-
mation framework (which can also be seen as a hy-
pothesis testing against hypothesis H
0
which is fore-
ground model).
Having said this, by using the classical Bayes rule,
one can write
P
O
1,T
/X
1,T
=
P
O
1,T
P
X
1,T
/O
1,T
P
X
1,T
(1)
where: P
O
1,T
/X
1,T
stands for the a posteriori like-
lihood of spatiotemporal region O
1,T
× 1, T ,
P
X
1,T
/O
1,T
stands for the likelihood of video data
given O
1,T
, P
O
1,T
stands for a priori probability,
and finally P
X
1,T
stands for the likelihood of video
data. The goal is to find the spatiotemporal configura-
tion which maximizes formula 1. Note that the latter
formula is often rewritten using the log-likelihood (or
energy) form (which turns out to be more intuitive and
easier to optimize in practice) as follows
log
h
P
O
1,T
/X
1,T
i
= log
h
P
X
1,T
/O
1,T
i
log
h
P
O
1,T
i
log
h
P
X
1,T
i
(2)
UncertaintyFusionbasedObjectRecognitionandTrackinginMaritimeScenesusingSpatiotemporalActiveContours
683
where now:
E
O
1,T
:= log
h
P
O
1,T
/X
1,T
i
stands for the
total energy of O
1,T
given that object is located at
O
1,T
;
E
f id
O
1,T
:= log
h
P
O
1,T
i
stands for a data
fidelity term of the total energy of O
1,T
;
E
reg
O
1,T
:= log
h
P
X
1,T
/O
1,T
i
stands for a
spatiotemporal regularization term;
Our main task in the remainder consists in the
modeling of the a priori probability P
O
1,T
of the a
posteriori likelihood P
O
1,T
/X
1,T
and respectively,
since P
X
1,T
stands for a constant that we may ig-
nore in the remainder.
2.1 Modeling P
O
1,T
One firstly notes that P
O
1,T
models purely spa-
tiotemporal geometric information (or the spatiotem-
poral trajectory) of a target in a video (of course,
up to camera motion). For instance, if one knows
in advance that the target corresponds to some rigid
object moving according to a similarity motion (ro-
tation, translation and scale), thus it could be very
useful to incorporate this information in P
O
1,T
in a way which penalizes spatiotemporal trajectories
O
1,T
× 1, T that do not fit to the aforementioned
prior geometric knowledge. Nevertheless, this sug-
gests a proper parametrization of the total energy (us-
ing for instance a similarity matrix) and this is known
to be computationally untractable. Therefore, it is
often replaced with a computationally efficient ap-
proximate model (e.g., using a random Markov field
model) but which nonetheless may yield quite satis-
factory practical performances.
Therefore in this paper we propose to model
directly E
reg
O
1,T
:= log
h
P
X
1,T
/O
1,T
i
as the
sum of a multiplicative factor of the total 2D surface
of the spatiotemporal volume O
1,T
and of a multi-
plicative factor of its total 3D volume. This gives rise
to a regularization term which is composed of a tradi-
tional total variation (TV) term (Cremers et al., 2011)
and a linear term in a continuous (relaxed) form of the
total energy 2 (see section 3 for more details). While
the former finds a pretty interpretation as a disconti-
nuity preserving smoothing term of a spatiotemporal
trajectory of target, the latter penalizes the total vol-
ume of the spatiotemporal trajectory of a target. Let
us note that because of some (mainly computational)
considerations that will be motivated in subsection
2.3, we assume that all video frames are aligned with
each other prior to model’s optimization. Therefore,
such a regularization term operates directly on the
aligned video.
2.2 Modeling P
X
1,T
/O
1,T
In this subsection, we describe our approach for mod-
eling P
X
1,T
/O
1,T
in equation 1. As mentioned
earlier in this section, the latter models the a poste-
riori likelihood of video data X
1,T
if a target is lo-
cated at spatiotemporal location O
1,T
of the video vol-
ume × 1,T. It is clear that in the absence of a
closed-form expression of a theoretical data observa-
tion model, one may only resort to an approximate
formula of it. This is a difficult task because of the big
variability of both appearance (i.e., intensity) models
of the dynamic maritime background (i.e., the sea)
and of a target (i.e., a vessel).
Indeed, depending on the weather conditions and
on a target’s speed and size, the background’s dynam-
ics may exhibit drastic differences from one video
to another one (i.e., varying from a quasi-static blue
sea to a rough, wavy and dark sea background). On
the other hand, a vessel’s appearance (in terms of
main colors and their spatiotemporal variations) can-
not be anticipated, because of color variability and
important illumination changes inherent to a maritime
scene. Therefore it makes sense to seek an expert
system which can trade-off (merge) different types of
RGB video uncertainty in order to achieve generally
a good detection of a vessel. Having said this, the
starting idea for achieving such a goal is that both
chromaticity (i.e., as the principal pixel’s color) and
spatiotemporal (dynamic) texture turn out generally
to be good features of both sea background and tar-
get. Before going into more details about this idea
in the goal of achieving a good approximate model
of P
X
1,T
/O
1,T
, let us first consider the following
(additive) video data model:
X
1,T
= C
1,T
+ K
1,T
+ n
1,T
(3)
where C
1,T
stands for the principal color (or the
chromaticity) of the video pixels, K
1,T
stands for a
(statistical) spatiotemporal (dynamic) texture (Der-
panis and Wildes, 2012) which superimposes onto
the principal color of video pixels, and n
1,T
mod-
els system’s plus environment’s noise. Further-
more, we view
C
1,T
,K
1,T
as a couple of ran-
dom vector variables having some probability dis-
tribution P
C
1,T
,K
1,T
target
with respect to a tar-
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
684
get object, and some other probability distribu-
tion P
C
1,T
,K
1,T
background
with respect to back-
ground. An elaborated method would attempt to esti-
mate in a common (parametric) framework simultane-
ously for
C
1,T
,K
1,T
and the spatiotemporal location
of a target. However this might be too costly compu-
tationally for near real time video processing. There-
fore, if we assume that one may figure out determinis-
tically, using video preprocessing (e.g., color cluster-
ing and texture filtering) the couple
C
1,T
,K
1,T
from
X
1,T
according to decomposition 3, then one may be
able to exploit both principal color and spatiotempo-
ral texture information in order to characterize target’s
likelihood in a video. This can be seen (after ignoring
the noisy component n
1,T
in formula 3) by replacing
P
X
1,T
/O
1,T
with
P
C
1,T
,K
1,T
/O
1,T
:= f
C
1,T
,K
1,T
(4)
where f (·) is some function which models an ex-
pert’s knowledge regarding both chromaticity and
(dynamic) texture at target’s spatiotemporal locations.
In this work, we model such an expert function
f (·) based on the following remarks:
The brighter (whiter) a pixel is, the less likely it is
to belong to the foreground as brightness is gen-
erally (but not always) characteristic of the foam;
The more blue a pixel is, the more likely it is to
belong to the background (i.e., to the foam);
The more scattered in the image plan a pixel’s
color is, the less likely it is to belong to a target,
as generally the sea occupies a spatially scattered
region of a video frame;
Target’s dynamic texture is normally zero, up to
imaging artifacts (such as illumination changes).
Therefore, we propose to take f (c,k) :=
g
w(c),b(c), s(c),k
where g stands for some
positive real valued function trading off W (c),B(c),
S(c) and k which stand respectively for some func-
tions of how much a color c is bright (white) (W(c)),
blue (B(c)), scattered in the image plane (S(c)), and
(dynamic) texture (k). As aforementioned, we have
assumed that one may be able to extract each of
the components c and k from any spatiotemporal
video location using video preprocessing. Obviously,
there is no single method for choosing the functions
W (c),B(c), and S(c), therefore we propose to model
them based on RGB video information as follows.
For computational convenience, we firstly assume
statistical independence between pixels, and we
propose to take
W (c) exp
c c
w
2
2σ
2
w
1(W )
where c
w
stands for the mean intensity of the brightest
color class in a video frame (if any, and hence the use
of 1(W )), and σ
w
stands for its standard deviation.
B(c) exp
c c
b
2
2σ
2
b
1(B)
where c
b
stands for the mean intensity of the blue
color class in a video frame (if any) and σ
2
b
its stan-
dard deviation. Such brightest color and blue color
classes are identified with respect to each video frame
using color recognition techniques based respectively
on the sum of three RGB components c
r
+c
g
+c
b
and
(roughly) on the ratio
c
b
+c
g
2c
r
and by using a cluster-
ing technique (such as the Otsu algorithm). S(c) is
modeled as some decreasing function of the standard
deviation s(c) of the position in a video frame of color
c. We take
S(c) exp
s
2
(c)
2
1(B)
where c
S
stands for the principal color with biggest
standard deviation. Finally, we estimate the dynamic
texture k at each video pixel by analyzing the video
signals in its neighborhood of size N using the white
noise assumption of dynamic texture at target. We
take k as some increasing function of the minimum
absolute value of the correlation ρ between any pair
of neighboring signals (assuming prior alignment of
video frames). Obviously, the expected value of k is
the expected value of a normalized random variable
with mean 0, and therefore one expects ρ to be gener-
ally much smaller for target pixels than for dynamic
sea pixels (whose dynamic texture instead does not
satisfy the white noise assumption). Thus we take
k exp
ρ
2
2
Finally, we take
f (z) 1 αW (c)B(c)S(c)k (5)
where α stands for some positive constant in ]0,1].
It should be mentioned however that we don’t claim
that such a choice of f (z) is the most pertinent one,
nevertheless such a made simplification may be seen
as the price to pay for speeding up computations for
achieving quasi real time system’s performance.
2.3 Video Frame Alignment
As above-mentioned, in order to be able to figure
out dynamic texture of a spatiotemporal configura-
tion and to reduce problem’s combinatoric complex-
ity, therefore prior to video processing, we align all
video frames in such a way that imaged points of a
maritime scene correspond to a same pixel location
across the video. This is achieved using pixel tracking
UncertaintyFusionbasedObjectRecognitionandTrackinginMaritimeScenesusingSpatiotemporalActiveContours
685
via optical flow (Lucas and Kanade, 1981). However,
it is important to account for illumination changes be-
tween video frames in the optical flow model. There-
fore, we first proceed by transforming linearly each
RGB video frame X
t
to a positive gray image I
t
in a
way which maximizes frame’s contrast. Then, by as-
suming a linear illumination model, we perform op-
tical flow (u,v) at current log-transformed gray video
frame J
t
:= log[I
t
] by assuming the following optical
equation:
J
t
x
u +
J
t
y
v +
J
t
t
+ ν = n (6)
with n standing for white noise, and ν models illumi-
nation variation between consecutive frames. Such
a model is estimated as (Lucas and Kanade, 1981)
using least square criterion in the neighborhood of a
pixel of size N. One notes that a bicubic interpola-
tion and image sampling to obtain a lower resolution
video frame is used prior to optical flow estimation.
This allows to reduce overall video noise and to flatten
highly textured sea regions (such as foam) in order to
yield good estimation of spatiotemporal video gradi-
ent, and thereby to obtain a reliable optical flow. The
latter is recomputed for the original video and later
used to align video frames with each other.
3 ACTIVE CONTOUR
IMPLEMENTATION
The goal is to minimize the following energy model
with respect to all possible spatiotemporal confirgua-
tions O:
(
Z
O
F(z) + λPerim
O
+ βA
O
)
(7)
where F(z) := log[ f (z)] with f (z) given by formula
5, λ and stand for some positive constant, Perim
O
and A
O
stand respectively for the total surface
and the total volume of spatiotemporal configuration
O.However while model’s constant λ (which enforces
target’s spatiotemporal smoothness) is generally easy
to tune as a wide range of values may be suitable for
the task, the choice of β is not easy though crucial for
obtaining a reasonable segmentation of a target. Fur-
thermore, ideally we want to let the system choose
adaptively the best value of µ to use, provided that
we can inform it accordingly about what good val-
ues of β correspond to. Nevertheless, as we are deal-
ing with grey videos (as log of pixel-wise proba-
bilities), therefore we may detect a target as the most
contrasted spatiotemporal configuration in the video,
in a sense which we will specify hereafter. We show
indeed in this paper that model 7 is equivalent to a
traditional Mumford-Chah model, namely the Chan-
Vese model 11 (Vese and Chan, 2002). The proof our
claim along with the transformation of model 7 into
an equivalent Chan-Vese model are detailed in the ap-
pendix section. Therefore the energy that we mini-
mize is written as:
(
Z
O
( f (z) c
1
)
2
+
Z
O
( f (z) c
2
)
2
+λPerim
O
)
(8)
where c
1
and c
2
correspond to some constants that are
estimated adaptively using video data. Such a model 8
is firstly relaxed (M. Nikolova and Chan, 2006; Pock
et al., 2008; Cremers et al., 2011; Chambolle et al.,
2010) using variables u(z;t) [0,1] as
(
Z
( f (z) c
1
)
2
( f (z) c
2
)
2
)u(z)+λTV (u)
)
(9)
where λTV (u) stands for the total variation of u(z;t).
For known c
1
and c
2
, such a model 9 is convex and
thus can be solved for exactly using the following it-
erative scheme, starting from an initial solution u
0
:
u
j+1
= u
j
λdiv
u
j
|u
j
|
where stands for the spa-
tiotemporal gradient sign and div stands for the spa-
tiotemporal divergence operator. The constants c
1
and
c
2
are updated simultaneously with u
j
as the mean in-
tensities inside and outside current target respectively.
4 RESULTS
The current version of the method is implemented in
C++ and runs in quasi real time on a standard 1.7Ghz
PC and for video resolution of 30 frames/sec. and
about 1 mégabyte frame resolution. Fig.2 and Fig.3
show results of the proposed method using two real
maritime videos with λ := 1000.
We have validated the proposed method using
a dozen of realistic video sequences under differ-
ent weather conditions, and target appearances. The
method has shown to perform generally very well
for detecting a vessel target except in some situations
where a target is mainly characterized with a whitish
color and the airborne camera lies so far away from
the target that the sea foam does not exhibit enough
texture that may distinguish it from the target.
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
686
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
(k) (l) (m) (n) (o)
Figure 2: Uncertainty based recognition and tracking of a maritime target (a cargo) using a spatiotemporal active contour.
approach. Upper row: A posteriori log-likelihood of the target (normalized between 0 and 255) at video frames number 1,
100, 200, 300, and 400 (resp). Middle row: The converged spatiotemporal active contour; Lower row: Target recognition.
(a) (b) (c) (d) (e)
(f) (g) (h) (i) (j)
(k) (l) (m) (n) (o)
Figure 3: Uncertainty based recognition and tracking of a maritime target (a yacht) using a spatiotemporal active contour.
Upper row: A posteriori log-likelihood of the target (normalized between 0 and 255) at video frames number 1, 100, 200,
300, and 400 (resp). Middle row: The converged spatiotemporal active contour; Lower row: Target recognition.
5 CONCLUSIONS
We have described a novel method for detecting ves-
sels in dynamic maritime background using a fusion
of uncertainty based approach, and we have solved the
problem efficiently using a spatiotemporal active con-
tour approach. Our tests using a dozen of real video
sequences of several minutes duration have shown
that our methods outperforms some state of the art
object tracking techniques such as meanshift/camshift
techniques. In the current version of our system, all
method’s parameters have been hardly coded based
UncertaintyFusionbasedObjectRecognitionandTrackinginMaritimeScenesusingSpatiotemporalActiveContours
687
on experience. Nevertheless it makes sense to devise
an automatic parameter selection technique in a fu-
ture version of the system. Also as future work, we
plan to consider other types of video information such
as more sophisticated texture models and geometric
prior knowledge (such as object’s rigidity, height and
shape) in order to yield as most reliable vessel detec-
tion algorithm as possible.
ACKNOWLEDGEMENTS
Thanks to the French customs for funding.
REFERENCES
B. J. Rhodes, e. a. (2007). Seecoast: persistent surveil-
lance and automated scene understanding for ports
and coastal areas. Ed., vol. 6578, no. 1. SPIE, p.
65781M.
Bechar, I., Lelore, T., Bouchara, F., Guis, V., and Grimaldi,
M. (2013). Toward an airborne system for near
real-time maritime video-surveillance based on syn-
chronous visible light and thermal infrared video in-
formation fusion. an active contour approach. In Proc.
Ocoss’2013, Nice, France.
Bloisi, D. and Iocchi, L. (2009). Argos - a video surveil-
lance system for boat traffic monitoring in venice. In
IJPRAI, vol. 23 (7), pp. 1477–1502.
Bloisi, D., Iocchi, L., Fiorini, M., and Graziano, G. (2012).
Camera based recognition for marine awareness, great
lakes and st. lawrence seaway border regions. In Int.
Conf. Infor. Fusion, pp. 1982–1987.
Chambolle, A., Caselles, V., Novaga, M., Cremers, D., and
Pock, T. (2010). An introduction to total variation for
image analysis. In Chapter in Theoretical Founda-
tions and Numerical Methods for Sparse Recovery, De
Gruyter.
Cremers, D., Pock, T., Kolev, K., and Chambolle, A. (2011).
Convex relaxation techniques for segmentation, stereo
and multiview reconstruction. In Chapter in Markov
Random Fields for Vision and Image Processing. MIT
Press.
Derpanis, K. and Wildes, R. (2012). Spacetime texture rep-
resentation and recognition based on a spatiotemporal
orientation analysis. In PAMI,34(6):1193-205.
Lucas, B. and Kanade, T. (1981). An iterative image regis-
tration tech- nique with an application to stereo vision.
In In Proceedings of the International Joint Confer-
ence on Artificial Intelligence, pp. 674–679.
M. Nikolova, S. E. and Chan, T. (2006). Algorithms for
finding global minimizers of image segmentation and
denoising models. In SIAM Journal of Applied Math-
ematics 66, 1632–1648.
Mumford, D. and Shah, J. (1989). Optimal approximations
by piecewise smooth functions and associated varia-
tional problems. In Comm. Pure. Appl. Math. 42:577-
685.
Pires, N., Guinet, J., and Dusch, E. (2010). Asv: an inno-
vative automatic system for maritime surveillance. In
Navigation, vol. 58(232), pp. 1–20.
Pock, T., Schoenemann, T., Graber, G., Bischof, H., and
Cremers, D. (2008). A convex formulation of contin-
uous multi-label problems. In ECCV’08.
Smith, A. and Teal, M. (1999). Identification and tracking
of marine objects in nearinfrared image sequences for
collision avoidance. In In 7th Int. Conf. Im. Proc. Ap-
plic., pp. 250–254.
Stauffer, C. and Grimson, W. E. L. (1999). Adaptive back-
ground mixture models for real-time tracking. in
CVPR’99, pp. 2246–2252.
Vese, L. and Chan, T. (2002). A new multiphase level set
framework for image segmentation via the mumford
and shah model. In IJCV, Vol. 50, pp. 271–293.
APPENDIX
Let us prove the claim we made in section 3. For sim-
plicity’s sake and without loss of generality, we con-
sider the following MAP based image segmentation
problem:
min
O
λPerim
O
+ βA
O
+
Z
O
g(z) (10)
where g(z) is a positive function as it corresponds
log of a probability. Now, let us consider the Chan
& Vese image segmentation model
λPerim
O
+
Z
O
( f (z) c
1
)
2
σ
2
1
+
Z
O
( f (z) c
2
)
2
σ
2
2
(11)
One may rewrite model 11 equivalently (after throw-
ing away the constant term) as follows
λPerim
O
+
Z
O
( f (z) c
1
)
2
σ
2
1
( f (z) c
2
)
2
σ
2
2
)
=
(
λPerim
O
+
Z
O
f
2
(z)
1
σ
2
1
1
σ
2
2
2 f (z)
c
1
σ
2
1
c
2
σ
2
2
+
Z
O
c
2
1
σ
2
1
+
c
2
2
σ
2
2
= λPerim
O
+
Z
O
f
2
(z)
1
σ
2
1
1
σ
2
2
2 f (z)
c
1
σ
2
1
c
2
σ
2
2
+ K
+ (
c
2
1
σ
2
1
+
c
2
2
σ
2
2
K + R
A(O)
where K is the smallest positive constant (perhaps 0)
which makes the integrand term
f
2
(z)
1
σ
2
1
1
σ
2
2
2 f (z)
c
1
σ
2
1
c
2
σ
2
2
+ K
always positive, whatever z.
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
688
Now putting
f
2
(z)
1
σ
2
1
1
σ
2
2
2 f (z)
c
1
σ
2
1
c
2
σ
2
2
+ K = g(z)
c
2
1
σ
2
1
+
c
2
2
σ
2
2
K = β
which is always possible via an appropriate tuning of
c
1
and c
2
such that
c
2
1
σ
2
1
+
c
2
2
σ
2
2
K is positive and equal
to β, and by solving for the second degree equation
with respect to f (z) in order to find the formula of
f (z) as a function of g(z), and hence the proof.
UncertaintyFusionbasedObjectRecognitionandTrackinginMaritimeScenesusingSpatiotemporalActiveContours
689