cues such as lines, feature-points and texture (Vac-
chetti et al., 2004; Pressigout and Marchand, 2006;
Ladikos et al., 2007).
In particular the combination of template-based track-
ing and feature-based tracking gives robust results.
Template-based tracking works well for small inter-
frame displacements, image blur, oblique viewing an-
gles and linear illumination changes, while feature-
based tracking works well with occlusions and large
interframe displacements. Therefore, we based our
algorithm on combining template-based tracking and
feature-based tracking. For template-based tracking,
we use the ESM algorithm of (Benhimane and Malis,
2004) due to its high convergence rate and accuracy,
while the feature-based tracking is based on Harris
Corner points.
In (Ladikos et al., 2007) the authors assume that they
have a textured 3D-model of the object and that the
camera is calibrated. In our application we do not
have this information since we want to use it for in-
teractive tracking in unknown scenes with unknown
cameras. Our contributions therefore focus on mak-
ing this tracking approach work under the constraints
imposed by our application. To accommodate track-
ing in unknown scenes we parameterized the camera
motion using a homography instead of the pose in the
cartesian space. This is sufficient for our application
since we only want to superimpose labels over the ob-
jects. This simplifies the Jacobian so that we obtain a
significant speed increase for the tracking. We also
included an online illumination compensation by es-
timating the gain and the bias for the tracked pattern.
To accommodate interactivity we use a learning-free
SIFT-based initialization instead of a learning-based
Randomized Tree initialization.
Augmented Reality-based remote support has re-
ceived much attention in the past. However, most ex-
isting work has focused on the interaction (Barakonyi
et al., 2004) between the remote user and the local
user and very little work (Lee and H
¨
ollerer, 2006)
has been devoted to markerless tracking. Therefore,
most systems make use of fiducial markers or explicit
scene knowledge which precludes an ad-hoc use in
unknown environments. Our contribution is to use
a reference template for markerless tracking to avoid
drift and combine both template-based and feature-
based tracking to overcome problems with jittering,
fast object motion and partial occlusions.
3 THE TRACKING SYSTEM
Our tracking system combines template-based and
feature-based tracking. This design is based on sev-
eral simulations and experiments that were conducted
in order to determine the properties of both tracking
methods. Some of these experiments are presented
in the experimental section of this paper. The results
suggest that no single method can deal well with all
tracking situations occurring in practice. It is rather
the case that the two tracking approaches are com-
plementary. Therefore, efficiently combining those
methods yields robust and accurate tracking results.
In the remainder of this section, we will first discuss
the design of the tracking system and then go on to
describe each component in detail.
3.1 System Design
Once the initialization module, which uses SIFT de-
scriptors, has determined the pose of the object in
the current image, our algorithm adaptively switches
between template-based tracking and feature-based
tracking depending on the value of the Normalized-
Cross-Correlation (NCC) computed between a refer-
ence image of the object and the object’s appearance
in the current image. During template-based track-
ing, a low NCC score usually means that the object
is being occluded or that the interframe displacement
is outside the convergence radius of the minimization
algorithm. In both cases, feature-based approaches
give higher quality results. Therefore, the proposed
algorithm automatically switches over to the feature-
based tracking. As soon as the NCC score gets high
enough, it is safe to go back to template-based track-
ing. If the score remains low, the feature-based track-
ing is invoked as long as there are enough inlier points
for a stable pose estimation. If this is not the case, the
algorithm invokes the global initialization.
3.2 Initialization
The initialization is performed using SIFT descrip-
tors. The choice of SIFT descriptors over other ap-
proaches is based on the fact that they do not need
a learning phase and that they have been shown to
be very robust with respect to different image trans-
formations. The runtime performance of SIFT is not
real-time, but since it is only used for initialization
this is not a critical issue.
3.3 Template-based Tracking
We use the ESM algorithm to perform template-based
tracking because it enables us to achieve second-order
convergence at the cost of a first-order method. Given
the current image I and the reference image I
∗
of
a planar target, we are looking for the homography
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
628