view, may result in tracking failure. Once the sys-
tem loses track, a re-initialization is required to con-
tinue tracking. In our particular case, namely indus-
trial applications of Augmented Reality, where the
user wears a head-mounted camera, this problem be-
comes very challenging. Due to fast head movements
of users head while working in a collaborative indus-
trial environment, frequent and fast initialization is
required. Manual initialization or the use of hybrid
configurations, for instance instrumenting the camera
with inertial sensors or gyroscopes are not a practical
solution and in many cases not accepted by the end-
users.
We therefore propose a purely vision-based
method that can detect the target object and compute
its three-dimensional pose from a single image. How-
ever, wide baseline matching tends to be both less ac-
curate and more computationally intensive than the
short baseline variety. Therefore, in order to over-
come the above problems, we combine the object de-
tection system with the tracking system using a man-
agement framework. This allows us to get robustness
from object detection, and at the same time accuracy
from recursive tracking.
2 RELATED WORK
Several papers have addressed the problem of
template-based tracking in order to obtain the direct
estimation of 3D camera displacement parameters.
The authors of (Cobzas and Jagersand, 2004) avoid
the explicit computation of the Jacobian that relates
the variation of the 3D pose parameters to the appear-
ance in the image by using implicit function deriva-
tives. In that case, the minimization of the image er-
ror is done using the inverse compositional algorithm
(Baker and Matthews, 2004). However, since the
parametric models are the 3D camera displacement
parameters, this method is valid only for small dis-
placements around the reference positions, i.e, around
the position of the keyframes (Baker et al., 2004). The
authors of (Buenaposada and Baumela, 2002) extend
the method proposed in (Hager and Belhumeur, 1998)
to homographic warpings and make the assumption
that the true camera pose can be approximated by the
current estimated pose (i.e. the camera displacement
is sufficiently small). In addition, the Euclidean con-
straints are not directly imposed during the tracking,
but once a homography has been estimated, the ro-
tation and the translation of the camera are then ex-
tracted. The authors of (Sepp and Hirzinger, 2003)
go one step further and extend the method by includ-
ing the constraint that a set of control points on a
three-dimensional surface undergo the same camera
displacement.
These methods work well for small interframe
displacements. Their major drawback is their re-
stricted convergence radius since they are all based
on a very local minimization where they assume that
the image pixel intensities vary linearly with respect
to the estimated motion parameters. Consequently,
inescapably, the tracking fails when the relative mo-
tion between the camera and the tracked objects is
fast. Recently, in (Benhimane and Malis, 2006) , the
authors propose a 2nd-order optimization algorithm
that considerably increases the convergence domain
and rate of standard tracking algorithms while having
an equivalent computational complexity. This algo-
rithm works well for planar and simple piecewise pla-
nar objects. However, it fails when the tracking needs
to take new regions into account or when the appear-
ance of the object changes during the camera motion.
In this paper, we adopt an extension of this algo-
rithm and we propose a template management algo-
rithm that allows an unsupervised selection of regions
belonging to the object newly visible in the image and
their automatic update. This greatly improves the re-
sults of the tracking in real industrial applications and
makes it much more scalable and applicable to com-
plex objects and severe illumination conditions.
For the initialization of the tracking system a de-
sirable approach would be to use purely vision-based
methods that can detect the target object and compute
its three-dimensional pose from a single image. If this
can be done fast enough, it can then be used to ini-
tialize and re-initialize the system as often as needed.
This problem of automated initialization is difficult
because unlike short baseline tracking methods, a cru-
cial source of information, a strong prior on the pose
and with this spatial-temporal adjacency of features,
is not available.
Computer vision literature includes many object
detection approaches based on representing objects of
interests by a set of local 2D features such as corners
or edges. Combination of such features provide ro-
bustness against partial occlusion and cluttered back-
grounds. However, the appearance of the features can
be distorted by large geometric and significant illu-
mination changes. Recently, there have been some
exciting breakthroughs in addressing the problem of
wide-baseline feature matching. For instance, feature
detectors and descriptors such as SIFT (Lowe, 2004)
and SURF (Bay et al., 2006) and affine covariant fea-
ture detectors (Matas et al., 2002; Mikolajczyk and
Schmid, 2004; Tuytelaars and Van Gool, 2004) have
been shown maturity in real applications. (Mikola-
jczyk et al., 2005) showed that a characterization of
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
338