ROBUST VIDEO MOSAICING FOR BENTHIC HABITAT MAPPING

ep Quang Luong and Wilfried Philips

Ghent University, Department of Telecommunication and Information Processing, Image Processing and Interpretation Research Group

Sint-Pietersnieuwstraat 41, 9000 Ghent, Belgium

Anneleen Foubert

Ghent University, Department of Geology and Soil Science, Renard Centre of Marine Geology

Krijgslaan 281 S.8, 9000 Ghent, Belgium

Keywords:

Video registration, mosaicing, M-estimators, outlier rejection.

Abstract:

Nowadays remotely operated vehicles (ROV) have become a popular tool among biologists and geologists to

examine and map the seaﬂoor. For analytical purposes, mosaics have to be created from a large amount of

recorded video sequences. Existing mosaicing techniques fail in case of non-uniform illuminated environ-

ments, due to the presence of a spotlight mounted on the ROV. Also traditional image blending techniques

suffer from ghosting artifacts in the presence of moving objects. We propose a general observation model and

a robust mosaicing algorithm which tackles these major problems. Results show an improvement in visual

quality: noise and ghosting artifacts are removed.

1 INTRODUCTION

Still in recent past, benthic sampling techniques, us-

ing box corers, Van-Veen grabs or dredges, were the

most common tools for biologists and geologists to

examine the seaﬂoor and to groundtruth geophysical

datasets (sidescan sonar, multibeam, seismic). How-

ever, during the last decade the use of ROV’s and

submersibles became more and more widespread in

marine research. The increasing ROV-based explo-

ration and visualization of deep-water environments

revealed a large number of new insights in already

long studied environments. These new technologies

producing a large amount of visual data (formerly

not available as such) are claiming for new analytical

methods, both qualitative and quantitative.

Image mosaicing is the process that warps a col-

lection of overlapping images into a common coordi-

nate system and that merges the overlapping regions

of the warped images into a single image which cov-

ers the entire visible area of the scene. The merged

output image is called the mosaic or the panorama.

Registration is ﬁnding the appropriate transformation

of an input image or a set of input images with re-

spect to a reference image. Many registration meth-

ods require users interaction through selecting ground

control points (GCP) in the reference image and their

corresponding points in the input image. GCP’s are a

set of selected pixels (or regions) that contains impor-

tant features like intersection of roads or coastlines.

However manual registration requires a lot of time

and labour and is furthermore not accurate due to hu-

man mistakes (Luong et al., 2004). The huge amount

of incoming video data from new missions mandate

the need for automatic registration.

Several automatic registration techniques do exist,

they can roughly be divided into two categories: area-

based (e.g. minimizing the overlapped intensity dif-

ferences, Fourier-based methods, etc.) and feature-

based methods (Zitova and Flusser, 2003; Lowe,

2004; Tuytelaars, 2000). Intensity-based registration

methods require a uniform illumination throughout

the video sequence and is consequently not suitable

for our application. Most blending methods use an

averaging scheme which results in ghosting effects

when dealing with moving objects. These two major

issues require novel techniques. In the next chapters,

we propose a general observation model, we describe

our algorithm and we discuss the results.

2 THE OBSERVATION MODEL

In our speciﬁc case, we propose an extension of the

generic model used in (Pires and Aguiar, 2005). Each

pixel of the ith video frame I

can be modeled as a

noisy sample of the panorama P which can addition-

ally be occluded by a moving object O. The moving

306

Quang Luong H., Philips W. and Foubert A. (2006).

ROBUST VIDEO MOSAICING FOR BENTHIC HABITAT MAPPING.

In Proceedings of the First International Conference on Computer Vision Theory and Applications, pages 306-310

DOI: 10.5220/0001368603060310

 SciTePress

objects in this situation are benthic animals (see ﬁg-

ure 1). The local illumination changes due to the spot-

light of the ROV are modeled by the ﬁlter H, also the

end of a ﬂoating rope from the ROV (which is visible

on a somewhat static position throughout the video

sequence) will be incorporated by H (see ﬁgure 1).

Since a video frame I

has only a limited view of the

panorama, we truncate its view with a binary region

of interest mask M

. If the region is observed by the

ith image, then M

) will become 1 otherwise 0.

The generic observation model will become

) =[[(1 − δ

u−u

)P (u)+δ

u−u

O(u

)]

· H(x

)+N(x

)]M

)

(1)

where N denotes the sum of the noise generated

with and without the spotlight ﬁlter H. In the rest

of this paper we assume that the noise has a zero-

mean Gaussian distribution. The image coordinates

=(x

) are expressed in the coordinate system

of the image I

, while the coordinates u =(u, v) are

expressed in the coordinate system of the panorama

P , which is in our case the same as the coordinate

system of the reference image I

. Since the objects

O are moving, the coordinates u

are related to time.

The function δ

u−u

is the Dirac delta function, which

yields 1 if u

equals to u

and zero otherwise. If the

delta function is 1 than the panorama is occluded by

the moving objects and we deal with an unoccluded

panorama otherwise. Note that the delta function is

only deﬁned on the discrete grid, which means that

no subpixel coordinates are used here.

The relationship between the reference coordinate

system of the panorama P and the coordinate system

of the images I

is denoted for example by a global

parametric mapping model. Perspective projection

models (8 degrees of freedom) and polynomial mod-

els, such as translation, afﬁne or biquadratic transfor-

mations (with respectively 2, 6 and 12 degrees of free-

dom), are very common in use (Zitova and Flusser,

2003). The coordinates x

and u are related by

= m(θ

; u) (2)

We have to estimate ﬁrst the parameters of the map-

ping model, this is known as the registration problem.

Our ﬁnal goal is then to recover the panorama P in

good lighting conditions as much as possible without

the moving objects, which is related to robust back-

ground estimation.

3 THE MOSAICING SYSTEM

As we have mentioned in the previous chapter, we

have to estimate the parameters θ

of the mapping

Figure 1: An original image from the video sequence. The

poor illumination conditions and the presence of benthic an-

imals (and the ﬂoating rope visible at the up middle part of

the image) make mosaicing much more difﬁcult.

model m ﬁrst. Since the parameter space, for di-

rect mapping between the image I

and the panorama

P , is very large, we register the images sequentially

in order to reduce the computation time. This pro-

duces good initial estimations, but it still can lead to

propagation errors, which is clearly visible for non-

consecutive video frames covering the same region

of the panorama. In the second step, the estimation

of the parameters is then corrected by registering the

images I

with a temporary build mosaic. Deriving

a global optimal solution for the parameters θ

in (Pires and Aguiar, 2005) is very difﬁcult because

of the presence of the spotlight. Because we a pri-

ori know how the spotlight ﬁlter H approximately

behave, we can use a predeﬁned weight map W to

model the behaviour of H. The weights W (x

) are

low if we want to exclude a speciﬁc region. In the last

step, the ﬁnal mosaic P is estimated by combining the

overlapped frames.

3.1 Robust Image Registration

Because of the spotlight, we can not use featureless

registration methods. So we build a robust feature

point matching algorithm in order to register subse-

quent images I

and I

i+1

and estimate parameters



i+1

. Since areas of uniform intensity,i.e. with no

structural information, do not give us reliable registra-

tion information, we need to ﬁnd well textured blocks.

We apply the Noble corner detection (Noble, 1988) as

the feature point detector on image I

. Since we only

use the feature point detector to distinguish areas with

uniform intensity from ares with interesting structural

information, there is no point using more complex (in-

ROBUST VIDEO MOSAICING FOR BENTHIC HABITAT MAPPING

307

variant) detectors (and descriptors) as in (Lowe, 2004;

Tuytelaars, 2000). We deﬁne the (squared) blocks

on image I

around the detected feature points.

These blocks are matched with blocks B

from im-

age I

i+1

using the weighted zero-mean normalized

cross-correlation (CC):

CC =



1,i

− B

)(B

2,i

− B

)





1,i

− B

)



2,i

− B

)

(3)

where

and B

are denoted as the mean values

of respectively blocks B

and B

. This correlation

measure is illumination invariant, i.e. blocks with a

biased illumination change will yield the same corre-

lation as blocks with no biased illumination change.

The weights w

are chosen to favour the central part of

the window (for example with a Gaussian function).

Higher (subpixel) accuracy is obtained by ﬁtting the

neighbourhood of the highest correlation coefﬁcient

to a second degree polynomial model.

In the next step, we have to estimate the parame-

ters of our transformation model m using the matched

pairs. The inﬂuence of the worst matches (outliers)

should be minimized. A robust estimate of these

parameters can be achieved with Hough transforms,

RANSAC, LMeds, M-estimation, bootstrap methods,

etc. (Rousseeuw and Leroy, 1987). Based on the gen-

eralized maximum likelihood and least squares for-

mulation, we will use M-estimators. In particular, the

M-estimate of a

a

= arg min



ρ(r

i,a

) (4)

where ρ is a robust loss function and r

is the

scale normalized residual. A good (robust) initial-

ization is crucial for the success of M-estimation,

otherwise it would yield poor results due its low

breakdown point (Stewart, 1999). A robust initializa-

tion is achieved using a coarse-to-ﬁne multiresolution

framework. In the coarsest level, we can use temporal

information from the registration between I

i−1

and

, which also additionally reduces the computation

time. Using Kalman or particle ﬁltering could result

in a better prediction (Doucet et al., 2000). But in this

case, we keep it simple: we use the previous estima-

tion as the new prediction. Solving this robust regres-

sion problem leads to W-estimators and the iterative

reweighted least squares (IRLS) algorithm (Stewart,

1999). In each iteration, the weights of each pair are

adapted in function of their residuals and a weighted

least squares (WLS) algorithm is applied until conver-

gence is reached. In order to recover numerical stable

parameters, singular value decomposition is used to

solve the linear system in the WLS algorithm. We

initialize the weights of the IRLS algorithm with CC

information: if a matched pair has a high correlation

(hence is more reliable), then it should have more in-

ﬂuence on the parameter estimation. After applying

IRLS, we do not only have an estimate for parameters



i+1

, but also the ﬁnal output weights which represent

the importance of each contributing pair. With this in-

formation we can exclude bad registered regions (typ-

ically caused by moving objects) in all levels of the

hierarchical framework.

The combination of the transformation parame-

ters θ



i+1

, which are obtained from the registration

between subsequent images, and the parameters θ

which are obtained between the previous image I

and

the panorama P , form a good initial estimation for the

parameters θ

i+1

from the registration between I

i+1

and P . We correct the parameters θ

i+1

using the same

previously described algorithm and update the provi-

sional mosaic with image I

i+1

. Since the next im-

age I

i+2

has the most similar features as image I

i+1

(taking the spotlight into account), more weights are

assigned to the last image when blending it into the

provisional mosaic using an averaging scheme. The

whole process is now repeated for image I

i+2

3.2 Robust Image Fusion

After transformation and resampling of the images I

(using the 8-point windowed Blackman-Harris sinc

function), we have a vector of candidates for each

pixel of the panorama P . Simple averaging will cre-

ate severe artifacts due to non-uniform illumination

conditions, moving objects and possible misregistra-

tion. We can tackle this illumination problem by as-

signing weights to each candidate pixel proportional

to the weights W (x

). Since we are interested in a

panorama in good lightning conditions, the weights

W (x

) for dark regions will tend to zero.

Moving objects can be modeled as a non-zero mean

Gaussian distribution and we classify misregistration

to the noise N. With these considerations, each can-

didate vector is observed as a weighted mixture of

Gaussians. Since we are only interested in the single

Gaussian density which represents the background,

we want to suppress the inﬂuence of other densities

by lowering the weights of the candidates which are

part of the moving objects. Similar to background

subtraction techniques (Radke et al., 2005), we cal-

culate the (weighted) average of all candidates. Af-

terwards we compare this average to all candidates,

if the absolute difference exceeds a certain threshold

(typically a number of standard deviations from the

mean background model), then the candidate belongs

most likely to an object and its weight is set to zero.

The new weighted average is a good ﬁrst estimate,

VISAPP 2006 - IMAGE ANALYSIS

308

which we further reﬁne to the background peak us-

ing robust M-estimation. We recall the fact that we

can not recover the panorama P perfectly if an object

covers the same region on every image. The candi-

dates are represented by their luminance component

in the CIELab colour space. Afterwards, their output

weights, retrieved from the IRLS algorithm, are used

to combine the separate channels in the RGB colour

space.

4 RESULTS

In this paper we have described our algorithm using a

generic transformation model m. In our implementa-

tion, we use the perspective projection model (8 para-

meters):

u =

i,0

+ θ

i,1

+ θ

i,2

1+θ

i,6

+ θ

i,7

(5)

v =

i,3

+ θ

i,4

+ θ

i,5

1+θ

i,6

+ θ

i,7

(6)

Equation 4 can be solved using IRLS with a weight

function w(r)=ρ



(r)/r (W-estimator). After test-

ing several robust loss functions, we ﬁnd the logis-

tic function ρ

logistic

and the Cauchy function ρ

cauchy

give the best performance respectively to registration

and to image fusion. The corresponding weight func-

tions are

logistic

(r)=

tanh r

(7)

cauchy

(r)=

1+r

(8)

Figure 2: A part of a mosaic showing typical dead coral

facies and sediment clogged dead coral.

In ﬁgure 2, a part of a mosaic is shown. The

panorama is recovered with less illumination artifacts.

In ﬁgure 3 we see a region where a moving object

is removed. Both results are created using the same

image registration parameters. Our proposed image

fusion outperforms traditional blending and addition-

ally it also improves the image quality (compared to

the original image): noise and compression artifacts

are reduced. Also ghosting effects (due to moving

benthic fauna) are removed.

(a)

(b)

(c)

Figure 3: A contrast enhanced detailed region: (a) original

image, merging using (b) an averaging scheme and (c) our

proposed method.

5 CONCLUSION

We have proposed a generic mosaicing model and

we have presented a mosaicing algorithm which can

handle video sequences recorded in a non-uniform il-

luminated environment. Additionally our algorithm

can deal with moving objects. Robust M-estimation

is used in the image registration as well as in image

merging. Our proposed algorithm reconstructs the

mosaic in good lighting conditions. Results show also

an improvement in visual quality: noise and ghosting

artifacts are removed.

ROBUST VIDEO MOSAICING FOR BENTHIC HABITAT MAPPING

309

ACKNOWLEDGEMENTS

The authors would like to thank the captain, crew, sci-

entiﬁc party and especially the VICTOR-team from

IFREMER on board of R/V Polarstern (2003) during

the ARK-XIX/3a cruise. The ROV imagery is cour-

tesy and copyright of IFREMER. Anneleen Foubert

is PhD student funded through an FWO-fellowship.

REFERENCES

Doucet, A., Godsill, S., and Andrieu, C. (2000). On se-

quential monte carlo sampling methods for bayesian

ﬁltering. Statistics and Computing, 10:197–208.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Luong, H., Gautama, S., and Philips, W. (2004). Automatic

registration of synthetic aperture radar (sar) images. In

Proc. of IEEE International Geoscience and Remote

Sensing Symposium 2004, pages 3864–3867.

Noble, J. (1988). Finding corners. Image and Vision Com-

puting, 6(2):121–128.

Pires, B. and Aguiar, P. (2005). Featureless global align-

ment of multiple images. In Proceedings of IEEE

International Conference on Image Processing 2005,

pages 57–60.

Radke, R., Andra, S., Al-Kofahi, O., and Roysam, B.

(2005). Image change detection algorithms: A sys-

tematic survey. IEEE Trans. on Image Processing,

14(3):294–307.

Rousseeuw, P. and Leroy, A. (1987). Robust Regression

and Outlier Detection. Wiley Series in Probability and

Mathematical Statistics.

Stewart, C. (1999). Robust parameter estimation in com-

puter vision. SIAM Reviews, 41(3):513–537.

Tuytelaars, T. (2000). Local Invariant Features for Regis-

tration and Recognition. PhD Dissertation KULeu-

ven.

Zitova, B. and Flusser, J. (2003). Image registration meth-

ods: A survey. Image and Vision Computing, 21:977–

1000.

VISAPP 2006 - IMAGE ANALYSIS

310