detection (Doll´ar et al., 2012), making this type of
analysis relatively straightforward. Techniques for
human hand detection on the other hand are far more
complex. Most existing accurate hand detection al-
gorithms use tools in order to facilitate the detections
such as colored gloves or additional motion sensors.
The use of such tools, however, may have an impact
on the naturalness of recorded data, for both produc-
tion and reception. Therefore we cannot allow items
that may attract the visual attention of the participants
in the experiment. The use of 3D cameras, which
provide depth information and therefore facilitate the
hand detection, is also not applicable since most of
the egocentric cameras are 2D cameras.
In this paper, we present a semi-automatic hand
detection algorithm based on an efficient combination
of several techniques. We developed a system that
automatically detects hands but asks for manual inter-
vention when the confidence of a detection is below a
certain threshold. Using such an approach reduces the
manual analysis significantly while guaranteeing high
accuracy. Since our approach relies on algorithms that
are not computationally demanding, it is substantially
faster as compared to the methods based on complex
models. Driven by the wide applicability of our semi-
automatic annotation tool, we made it publicly avail-
able
1
for other users.
The remainder of this paper is organized as fol-
lows: in section 2, we present a thorough comparison
of existing hand detection algorithms. In section 3
the integration of the manual intervention is explained
while in section 4 we discuss our hand detection algo-
rithm in detail. In section 5 the results are discussed
and in section 6 a final summary is given.
2 RELATED WORK
Traditionally, one can divide hand detection tech-
niques into four categories. In this section we give an
overview of existing techniques and explain the limi-
tations of these approaches.
A well-known method for hand detection is the
use of colored gloves, which is used as a marker
that can be easily detected in images. In recent
work (Wang and Popovi´c, 2009) uses a multi-colored
glove, enabling the detection of various hand orien-
tations and poses. Since we focus on hand detection
in natural and unconstrained scenes, we cannot afford
the use of colored gloves, since they disturb the visual
attention during a conversation.
A second approach of hand detection is making
1
http://www.eavise.be/insightout/Downloads
use of motion sensors (Stiefmeier et al., 2006). Typi-
cally multiple sensors, like ultrasonic transmitters and
inertial sensor modules are placed on the user. Be-
cause of the same reason as mentioned above, it is
not recommended to place additional sensors on the
participants due to possible interference in the natural
behavior.
The increasing popularity and public availability
of 3D cameras paved the way for a third type of hand
detection (Ren et al., 2013). These cameras provide
useful depth information of a scene. Depth informa-
tion facilitates hand detection and it even enables the
detection of small items such as for example finger-
tips (Raheja et al., 2011). Although this is a promising
approach, it is not applicable in our application since
most of the egocentric cameras, like for example mo-
bile eye-trackers, are not equipped with 3D sensors.
A last approach of hand detection is based on
image processing in 2D images without the need of
additional markers or sensors placed on the body.
In (Kolsch and Turk, 2004) a hand tracking approach
was described based on KLT features in combination
with color cues. Such an approach yields good results
as long as the hand is easily visible (large enough)
in order to calculate an adequate number of features.
This approach is not applicable in our type of experi-
ments, where the hands represent only a small part of
the image, as can be seen in Figure 1. In (Shan et al.,
2007) a real-time hand tracking is presented using a
mean-shift embedded particle filter. Their system is
very fast (only 28ms per frame is needed) howeverthe
resolution of their test images is only 240×180 pixels.
In their experiments they only detect and track a sin-
gle hand, whereas in our application we need to track
both hands with respect to the human pose. (Bo et al.,
2007) presents a hand detection technique based on a
combination of Haar-like features and skin segmen-
tation. This approach is sufficiently accurate in con-
trolled scenes, e.g. a clean white background on the
other hand, the approach suffers from high false pos-
itive rates when applied to less constrained scenes. In
the work of (Eichner et al., 2012) a technique for esti-
mating the spatial layout of humans in still images is
presented, using a combination of upper body detec-
tion and the detection of individual body parts. This
method performs well on larger body parts (such as
arms or heads), whereas smaller parts (e.g. hands)
are much more challenging. The accuracy of this
technique largely depends on the upper body detec-
tion, detection at a wrong scale will result in deviat-
ing limb detections. This approach works far from
real-time: on average 25 seconds are needed for pro-
cessing a single 1280×720 frame. A similar approach
was proposed by (Yang and Ramanan, 2011). This