2002) and archeology (Cornelis et al., 2001; Pollefeys
et al., 2004). When virtual objects are located on the
background scenes, an observer is very sensitive to
the accuracy of camera parameters and hence accu-
rate estimation of camera motion is very crucial in
AR-based applications.
The contributions of this paper improve upon pre-
vious approaches in several aspects. First, our ap-
proach does not impose any constraints on scenes.
This makes our approach inherently suitable for gen-
eral scenes. Second, our approach eliminates the high
correlation between the camera rotation and transla-
tion by treating them separately. And the estimation
process involves five parameters of unknowns with-
out any redundancy. In addition, correct estimation of
camera translation direction can be easily achieved,
which is regarded as a key but intractable problem in
camera motion estimation. In consequence, no spe-
cial optimizer is needed for computing precise para-
meters of camera motion.
The remainder of this paper is organized as fol-
lows. After a brief introduction on related work in
Section 2, a conceptual overview is given in Section
3. In Section 4 we present the proposed algorithm for
two consecutive frames. Optimization over continu-
ous frames of a video sequence is introduced in Sec-
tion 5. Experiments results and discussions are de-
scribed in Section 6. Finally, we conclude the whole
paper in Section 7.
2 BASIC STRUCTURE OF A
SCRIPT
Many efforts have been put on robust camera motion
estimation. Traditional methods try to recover the
camera motion by calculating the fundamental ma-
trix or the essential matrix, which usually includes
seven degrees of freedom. Whereas, camera motion
has actually five degrees of freedom. Researchers
have focused on the robust determination of epipolar
geometry(Zhang, 1998; Zhang and Loop, 2001) by
minimizing the epipolar errors. The epipolar errors
of correspondences can be made much less than one
pixel(Zhang and Loop, 2001). However, this does not
lead to a small 3D projection error and accurate cam-
era motion(Chen et al., 2003). Wang and Tsui (Wang
and Tsui, 2000) report that the resultant rotation ma-
trix and translation vector could be quite unstable.
Structure from motion focuses on the recovery of
3D models contained in the scenes. Pollefeys et
al.(Pollefeys et al., 2004) propose an elegant ap-
proach to recover structure and motion simultane-
ously. Two key frames which exhibit obvious mo-
tion are chosen to compute camera motion, and ini-
tial 3D models of the targets are constructed. Subse-
quently, relative camera motion at any frame between
these two key frames is obtained. Additional refine-
ments on both structure and motion are performed for
each frame. This method is hard to deal with gen-
eral scenes because it makes use of the affine-model
based on two assumptions, i.e., frames can be di-
vided into multiple subregions in which all points are
coplanar and these subregions do not change orders
in the video sequence. Obviously, these assumptions
no longer hold for scenes containing inter-occlusions
and intersections objects.
Some researchers try to calculate the camera
translation separately (Jepson and Heeger, 1991;
MacLean, 1999). One technique named ”subspace
methods” generates constraints perpendicular to the
translation vector of camera motion, and is feasible
for the recovery of the translation vector. Recently,
Nist´er et al. (Nist
´
er, 2004) points out that the epipo-
lar based method exploiting seven or eight pairs of
matched points may result in inaccurate camera para-
meters. They instead propose to compute the essential
matrix with only five pairs of point correspondences,
achieving minimal redundancy. With the computed
essential matrix, the camera motion can be estimated
using SVD algorithm. This indirect approach is dif-
ferent from ours, which evaluates camera motion di-
rectly.
Typically, an efficient optimization process is re-
quired to achieve more stable results over the video
sequence. This kind of refinement is often referred
as bundle adjustment technique(Wong and Chang,
2004). It is shown that the bundle adjustment tech-
nique can also be applied to drift removal (Cornelis
et al., 2004).
3 VARIABLES
For each video sequence, we assume that the intrin-
sic parameters of the camera are unchanged and have
been calibrated in advance. The study on camera mo-
tion estimation can be concentrated on the compu-
tation of extrinsic camera parameters in each frame,
which is composed of one rotation matrix R and one
translation vector T.
As an overview, we first introduce the camera
model briefly in conventional notations. We denote a
3D point and its projective depth by homogenous co-
ordinates X =(X, Y, Z, 1)
and λ. The homogenous
coordinates u =(x, y, 1)
specify its projection in a
2D image. The 3 × 3 rotation matrix and triple trans-
lation vector are defined as R = {r
k
,k =0, ..., 8}
and T =(t
0
,t
1
,t
2
)
, respectively. Throughout this
paper we will use the subscript i to denote the frame
number, the subscript j to specify the index number
of feature points, E for the essential matrix, and I for
ROBUST CAMERA MOTION ESTIMATION IN VIDEO SEQUENCES
295