Semi-automatic Hand Annotation Making Human-human Interaction

Analysis Fast and Accurate

Stijn De Beugher

, Geert Brˆone

and Toon Goedem´e

EAVISE, ESAT - KU Leuven, Leuven, Belgium

MIDI Research Group - KU Leuven, Leuven, Belgium

Keywords:

Hand Detection, Tracking, Human-human Interaction, Human Pose.

Abstract:

The detection of human hands is of great importance in a variety of domains including research on human-

computer interaction, human-human interaction, sign language and physiotherapy. Withinthis ﬁeld of research

one is interested in relevant items in recordings, such as for example faces, human body or hands. However,

nowadays this annotation is mainly done manual, which makes this task extremely time consuming. In this

paper, we present a semi-automatic alternative for the manual labeling of recordings. Our system automatically

searches for hands in images and asks for manual intervention if the conﬁdence of a detection is too low.

Most of the existing approaches rely on complex and computationally intensive models to achieve accurate

hand detections, while our approach is based on segmentation techniques, smart tracking mechanisms and

knowledge of human pose context. This makes our approach substantially faster as compared to existing

approaches. In this paper we apply our semi-automatic hand detection to the annotation of mobile eye-tracker

recordings on human-human interaction. Our system makes the analysis of such data tremendously faster

(244×) while maintaining an average accuracy of 93.68% on the tested datasets.

1 INTRODUCTION

Many applications could beneﬁt from image process-

ing techniques in order to reduce manual input. In

this paper we focus on a speciﬁc application, viz. the

analysis of recordings in the ﬁeld of human-human

interaction. In this line of research, scholars are in-

terested in the nonverbal behavior of humans during

interaction and communication. An example of such

a recording can be found in Figure 1. Research ques-

tions to be answered within this type of experiments

are for example: Does the spectator notice the hand

gesture of the right hand? If the presenter looks side-

ways, does that affects the viewing behavior of the

spectator? In a presentation-training context, analy-

sis of such a recording can be used to measure and

assess inefﬁcient use of non-verbal language, such as

frantic hand gestures or immobile hands. In this pa-

per we focus speciﬁcally on data captured by wear-

able or egocentric cameras, like for example GoPro

cameras mobile eye-trackers. The analysis of the data

captured by a mobile eye-trackerfor exampleincludes

the annotation of the gaze cursor in terms of items that

are instrumental for human-human interaction. If the

gaze cursor overlaps with for example a human hand

Figure 1: Typical frame captured by a mobile eye-tracker.

Green dots are the hand detections, blue rectangle repre-

sents the face detection and green rectangle represents the

upper body detection.

or a face, one has to annotate this event. Since such

an analysis is extremely time-consuming, there is a

growing interest in algorithms that reduce the manual

workload. As mentioned above, we focus in this pa-

per on the semi-automatic detection of speciﬁc body

parts in images. Those detections could then be used

as input for further analysis such as linking with gaze

data or more complex analysis such as gesture detec-

tion.

Recent developments in image analysis delivered

highly accurate algorithms for both face and person

552

Beugher, S., Brône, G. and Goedemé, T.

Semi-automatic Hand Annotation Making Human-human Interaction Analysis Fast and Accurate.

DOI: 10.5220/0005718505520559

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, pages 552-559

ISBN: 978-989-758-175-5

 2016 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

detection (Doll´ar et al., 2012), making this type of

analysis relatively straightforward. Techniques for

human hand detection on the other hand are far more

complex. Most existing accurate hand detection al-

gorithms use tools in order to facilitate the detections

such as colored gloves or additional motion sensors.

The use of such tools, however, may have an impact

on the naturalness of recorded data, for both produc-

tion and reception. Therefore we cannot allow items

that may attract the visual attention of the participants

in the experiment. The use of 3D cameras, which

provide depth information and therefore facilitate the

hand detection, is also not applicable since most of

the egocentric cameras are 2D cameras.

In this paper, we present a semi-automatic hand

detection algorithm based on an efﬁcient combination

of several techniques. We developed a system that

automatically detects hands but asks for manual inter-

vention when the conﬁdence of a detection is below a

certain threshold. Using such an approach reduces the

manual analysis signiﬁcantly while guaranteeing high

accuracy. Since our approach relies on algorithms that

are not computationally demanding, it is substantially

faster as compared to the methods based on complex

models. Driven by the wide applicability of our semi-

automatic annotation tool, we made it publicly avail-

able

for other users.

The remainder of this paper is organized as fol-

lows: in section 2, we present a thorough comparison

of existing hand detection algorithms. In section 3

the integration of the manual intervention is explained

while in section 4 we discuss our hand detection algo-

rithm in detail. In section 5 the results are discussed

and in section 6 a ﬁnal summary is given.

2 RELATED WORK

Traditionally, one can divide hand detection tech-

niques into four categories. In this section we give an

overview of existing techniques and explain the limi-

tations of these approaches.

A well-known method for hand detection is the

use of colored gloves, which is used as a marker

that can be easily detected in images. In recent

work (Wang and Popovi´c, 2009) uses a multi-colored

glove, enabling the detection of various hand orien-

tations and poses. Since we focus on hand detection

in natural and unconstrained scenes, we cannot afford

the use of colored gloves, since they disturb the visual

attention during a conversation.

A second approach of hand detection is making

http://www.eavise.be/insightout/Downloads

use of motion sensors (Stiefmeier et al., 2006). Typi-

cally multiple sensors, like ultrasonic transmitters and

inertial sensor modules are placed on the user. Be-

cause of the same reason as mentioned above, it is

not recommended to place additional sensors on the

participants due to possible interference in the natural

behavior.

The increasing popularity and public availability

of 3D cameras paved the way for a third type of hand

detection (Ren et al., 2013). These cameras provide

useful depth information of a scene. Depth informa-

tion facilitates hand detection and it even enables the

detection of small items such as for example ﬁnger-

tips (Raheja et al., 2011). Although this is a promising

approach, it is not applicable in our application since

most of the egocentric cameras, like for example mo-

bile eye-trackers, are not equipped with 3D sensors.

A last approach of hand detection is based on

image processing in 2D images without the need of

additional markers or sensors placed on the body.

In (Kolsch and Turk, 2004) a hand tracking approach

was described based on KLT features in combination

with color cues. Such an approach yields good results

as long as the hand is easily visible (large enough)

in order to calculate an adequate number of features.

This approach is not applicable in our type of experi-

ments, where the hands represent only a small part of

the image, as can be seen in Figure 1. In (Shan et al.,

2007) a real-time hand tracking is presented using a

mean-shift embedded particle ﬁlter. Their system is

very fast (only 28ms per frame is needed) howeverthe

resolution of their test images is only 240×180 pixels.

In their experiments they only detect and track a sin-

gle hand, whereas in our application we need to track

both hands with respect to the human pose. (Bo et al.,

2007) presents a hand detection technique based on a

combination of Haar-like features and skin segmen-

tation. This approach is sufﬁciently accurate in con-

trolled scenes, e.g. a clean white background on the

other hand, the approach suffers from high false pos-

itive rates when applied to less constrained scenes. In

the work of (Eichner et al., 2012) a technique for esti-

mating the spatial layout of humans in still images is

presented, using a combination of upper body detec-

tion and the detection of individual body parts. This

method performs well on larger body parts (such as

arms or heads), whereas smaller parts (e.g. hands)

are much more challenging. The accuracy of this

technique largely depends on the upper body detec-

tion, detection at a wrong scale will result in deviat-

ing limb detections. This approach works far from

real-time: on average 25 seconds are needed for pro-

cessing a single 1280×720 frame. A similar approach

was proposed by (Yang and Ramanan, 2011). This

Semi-automatic Hand Annotation Making Human-human Interaction Analysis Fast and Accurate

553

Figure 2: Workﬂow of our hand detection approach.

is a method for human pose estimation in static im-

ages based on a representation of part models, tak-

ing into account the relative locations of parts with

respect to their parents. (e.g. elbow w.r.t. to shoul-

der), which results in accurate detections. However,

the authors admit their approach has difﬁculties with

some body poses e.g. raised or fully stretched arms.

A highly accurate approach was proposed by (Mit-

tal et al., 2011), combining a deformable-part-model

(DPM) of a human hand with skin segmentation to

generate hand candidates. Those candidates are then

suppressed using a super-pixel based non-maximum

suppression yielding accurate detections. This tech-

nique has a large computational cost due to the com-

plexity of the DPM, yielding an average processing

time of a frame of 1280×720 of about 290 seconds.

In our recent work (De Beugher et al., 2015), we im-

proved this approach greatly using a faster DPM cal-

culation, a reduction of the search space based on a

human upper body detector and avoiding the need for

a super-pixel segmentation. Unfortunately, despite

our efforts, the technique remains too slow for practi-

cal use (37 seconds for an image of 1280×720). The

work of (Spruyt et al., 2013) is also a recent hand de-

tection approach focused on real-time HCI. However,

compared to the images we tackle, the difﬁculty of

the datasets they used is limited, in that they do not

involve typical challenges of real-life data, like e.g.

changing camera angles and distances, (partial) oc-

clusions of and by hands, etc. These are situations

in which their approach fails. Furthermore, they do

not provide the possibility to manually steer the de-

tections in case of false detections.

Our approach differs signiﬁcantly from all previ-

ously mentioned ones. We propose a hand detection

methodology, which is both fast and accurate, and al-

lows for manual intervention. We extensively opti-

mized and combined previously described techniques,

and integrated them with probabilistic information.

3 SEMI-AUTOMATIC APPROACH

As mentioned before, we present a semi-automatic

hand annotation tool to process a sequence of con-

secutive frames. It is important to mention that we

tackle an annotation application that is currently done

completely manual. The goal is to reduce the amount

of manual analysis as much as possible while main-

taining top accuracy. Therefore we developed an al-

gorithm to detect body parts that are instrumental for

this type of analysis: hands, face and torso. The de-

tection of a human face and torso can be done automa-

tized with available accurate algorithms like (Felzen-

szwalb et al., 2010; Viola and Jones, 2001). The

bounding boxes in Figure 1 show example detection

results. The detection of human hands on the other

hand is far more complex and it is even impossible to

reach very high accuracy when using the most com-

plex approaches. We developed a system that auto-

matically detects hands in images based on simple

color cues. However when the conﬁdence of a hand

detection drops below a preset threshold, our auto-

matic analysis is paused and manual intervention is

asked from the user, to manually annotate the corre-

sponding hand. After this intervention, the automatic

analysis is continued. We ﬁne-tuned the parameters

of our system to ensure the lowest amount of man-

ual interventions as possible, while guaranteeing high

accuracy. On top of those manual interventions, we

ask the user to manually annotate the ﬁrst frame of

the recording, ensuring a good starting point for the

detections. The integration of these manual interven-

tions are indicated in Figure 2.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

554

4 HAND DETECTION

As illustrated in Figure 2, our system is a combina-

tion of several processing blocks. A ﬁrst step is the

detection of a human upper body, which is used to

identify the presence of a person and to reduce the

search area. Next we apply a skin segmentation that

is used to generate hand candidates. To further en-

hance the detections a tracker is used for temporal

smoothness. Finally we validate those candidates us-

ing a) a comparison between a predicted position and

the candidate and b) a validation of the relative posi-

tion between joints like for example wrist, elbow and

shoulder. Each step of this workﬂow will be discussed

below.

4.1 Context Retrieval

The ﬁrst stage of our approach is based on the work

of (De Beugher et al., 2015) and is used to get context

information. We use an accurate human upper body

detector based on a DPM as proposed in (De Beugher

et al., 2014) to detect the presence of a person in

the images. The advantage of this model is that we

can cope with images in which a person is not visi-

ble from head to foot, as in most of the images cap-

tured by a mobile eye-tracker. This torso detection

is also used to reduce the search area for the hands:

the width of the upper body is extended by factor 3.5

while the height is extended by factor 1.8. Those en-

large factors are determined empirically and ensure

that an average human hand lies within the extended

region. This step allows us to restric the search for

hands within this region and to discard the rest of the

image. In Figure 3a the original torso detection is

displayed using the purple rectangle, while the blue

rectangle illustrates the extended bounding box. Next

to the torso detection, we also apply a face detection

to ﬁnd the face and viewing direction using both a

frontal and proﬁle Haar-based face model (Viola and

Jones, 2001). Both face and upper body detections are

stored since they are instrumental for human-human

interaction.

4.2 Candidate Generation

We segment the image patch, which is the extended

bounding box, in skin and non-skin using rules in

three color spaces as introduced by (Rahim et al.,

2006), shown in Figure 3b. After two dilation and

two erosion steps, we ﬁt a contour over each sufﬁ-

ciently large group of skin pixels (Figure 3c) on which

a bounding ellipse is ﬁtted (Figure 3d). Each end-

point of the major axes of an ellipse is treated as a

hand candidate, as illustrated by the green dots in Fig-

ure 3e. The example shown in Figure 3d) contains

four ellipses: two of them contain the correct hands,

one coincides with the face and is therefore automat-

ically discarded and one ellipse is false, found on an

approximately skin-colored chair.

4.3 Candidate Validation

The ﬁnal stage of our approach is developed to auto-

matically select the best candidate for both left and

right hand and to validate them. Temporal continuity

of the image sequence is exploited using a Kalman

ﬁlter. The selection of the best candidate for both left

and right hand is done by choosing the hand candidate

with the smallest distance to the Kalman ﬁlter pre-

diction of the respective hand. This Kalman tracker

uses either the detection or the manual intervention in

the previous frame to predict the position using a con-

stant velocity motion model. As mentioned above,

each hand candidate belongs to a line (major axis of

the ellipse). When we selected the best candidate for

a hand, we use the other endpoint of the correspond-

ing line for validation. The remaining endpoint can

be seen as a joint, which corresponds to an elbow in

case the person wears short sleeves, or corresponds to

a wrist in case the person wears long sleeves.

We utilize probability maps to summarize possi-

ble positions of elbows and wrists w.r.t. the left and

right shoulder. In order to ﬁlter false detections, we

weight candidate joints to these probability maps and

hereby we remove impossible joint positions. The lo-

cation of the shoulders is estimated using both face

and upper body detection: y-position of the shoulder

corresponds to the bottom of the face-bounding box,

while the x-position of each shoulder is obtained by

the width of the torso detection. The probability maps

are created using the original labeling of the publicly

available Buffy dataset (Ferrari et al., 2009), more

speciﬁcally we used the labelings of wrist, elbow and

shoulder. The motivation to use this particular dataset

comes from the large variety of human poses that are

recorded in this dataset, as can be seen in the sam-

ple frames in Figure 4. For each image in this dataset

we calculate the relative position of elbow and wrist

w.r.t. the shoulder, this results in four sets each con-

taining 1496 data points. In Figure 5 we show the data

points for both left elbow and wrist. The red dot illus-

trates the position of the left shoulder. Data points

in the upper image are the relative positions of the left

wrist w.r.t the left shoulder. The data points in the bot-

tom image are the relative positions of the left elbow

w.r.t the left shoulder. Two mirrored sets of points are

used for the right shoulder. Next we apply a Gaussian

Semi-automatic Hand Annotation Making Human-human Interaction Analysis Fast and Accurate

555

Figure 3: Generation of hand candidates: a) original image, b) skin segmentation, c) contour detection, d) ﬁt ellipse, e) ﬁnal

hand candidates.

Figure 4: Example frames of the Buffy dataset (Ferrari et al., 2009) indicating the large variety of human poses within this

set. From this labeled dataset, our probability maps (P

Elbow

) and (P

Wrist

) are derived.

smoothing resulting in a dense map. After a normal-

ization of the dense map, this results in a probability

map. In total we developed four probability maps: el-

bow w.r.t. shoulder (P

Elbow

) and wrist w.r.t. shoulder

Wrist

), each for both left and right side.

The relative position of the left joint is validated

against each of the two probability maps of this side.

The best probability result (max(P

Elbow

, P

Wrist

)) is

than used in the next steps. The same rules are ap-

plied on the right joint.

Both probability P

Elbow

, P

Wrist

and distance D to

the Kalman prediction are taken into account for the

conﬁdence condition C:

C = {(D > D

max

) ∧ (pred > pred

max

)}∨

{(max(P

Elbow

, P

Wrist

) < P

)}

(1)

max

stands for the maximum allowable distance

between prediction and hand candidate. This max-

imum distance is dependent on the size of the per-

son and is therefore calculated as follows: face width

× 0.75. pred stands for the amount of consecutive

predictions that are used (thus no valid detection was

available), pred

max

stands for the maximum amount

of predictions that is allowed. Finally, P

stands for

the lowest probability value that is allowable. If con-

dition C (equation 1) is not met, our system automat-

ically pauses and asks for manual intervention, as de-

scribed above. Otherwise, the next image is processed

automatically.

By varying the above mentioned parameters, one

can increase or decrease the amount of manual inter-

ventions as shown in Figure 6. This Figure reveals

that when the strictness of the conﬁdence is lowered,

our system asks less manual interventions. It is obvi-

ous that this comes at a cost of lower accuracy.

We implemented an additional feature in our ap-

proach to reduce the amount of manual interventions.

As mentioned before, the probability maps are devel-

oped using the data labels from the Buffy dataset. Al-

though this dataset contains a large variety of human

poses, it may occur that a particular pose of a wrist

or an elbow corresponds to a low probability since

this particular pose occurs only sporadic in the Buffy

dataset. When the automatic processing is paused due

to a too low probability result, the user can indicate

that the particular joint position is nevertheless cor-

rect. In the latter case the probability map is updated

making this joint position valid in future processing.

It is important to notice that the results shown in Fig-

ure 6 are obtained without this feature.

A video of our system is available online

5 RESULTS

In this section we present the results of our semi-

automatic hand detection algorithm. In order to

validate our approach we used several publicly

available datasets. Three of them are introduced

in (De Beugher et al., 2015) and are further referred

as D1, D2 and D3. A ﬁnal dataset was introduced

in (Spruyt et al., 2013) and is further referred as

http://youtu.be/DsxdBc4gGjg

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

556

Figure 5: Top image shows data points and probability map

of the left wrist w.r.t left shoulder. Bottom image shows

the data points and probability map of left elbow w.r.t. left

shoulder.

Figure 6: Inﬂuence of varying the parameters of the conﬁ-

dence calculation. Decreasing the amount of manual inter-

vention comes at a cost of a lower accuracy.

D4. In total, the ﬁngertips of 6000 hand instances

are manually labeled in those sequences and are the

groundtruth for our accuracy measurements. In our

ﬁnal implementation we have chosen to set both con-

ﬁdence calculation parameters pred

max

and P

to 5.

The leftmost columns of table 1, present the ac-

curacy and the amount of manual intervention of

our semi-automatic approach without the probabilis-

tic validation. In the second column, the amount

of manual interventions our entire system needed

to reach the corresponding accuracy (F-measure) is

shown. On average, our system asks for manual inter-

vention in only 1.92% of the frames, and reaches an

accuracy of 93.68%. A detection is considered valid

if the distance between the detection and the annota-

tion is below half-face width, which is a commonly

used measure in other hand detection papers. As ex-

plained above, when the conﬁdence of the detection

drops below a certain threshold, manual intervention

is asked. It is important to notice that the same param-

eters for the conﬁdence calculation are used for all the

datasets. In the third column of table 1 we show the

performance of our earlier work (De Beugher et al.,

2015) on datasets D1, D2 and D3. To allow for a

fair comparison, we show the amount of manual in-

terventions that is required in their system to achieve

a similar accuracy as our work. As seen, we clearly

outperform the competitor in accuracy while the man-

ual work is signiﬁcantly less. In Figure 7 we compare

our approach, using the above mentioned settings, to

our earlier approach (De Beugher et al., 2015). It is

clear that our approach requires a substantially lower

amount of manual interventions. In the two rightmost

columns we show the accuracy of two full automatic

approaches (Mittal et al., 2011; Yang and Ramanan,

2011) on the above mentioned datasets. It is clear that

the accuracy of these automatic approaches is signiﬁ-

cantly lower than our semi-automatic approach.

In Figure 8 we show some example frames on the

four datasets. The green circles represent the hand

detections, yellow circles represent the joints (either

wrist or elbow), red circles represent the shoulders

(this is an estimation based on both upper body and

face detection), and the pink circle represents the cen-

ter of the face. We also draw the connections between

the previously described points in order to symbolize

the upper part of the human skeleton. In the right-

most image, we show the advantage of using two

types of probability maps. Even when an arm is in-

visible in an image, our system is able to detect a

correct hand and joint. In this case, the joint corre-

sponds to a wrist. On top of the improvement in ac-

curacy, we also present a signiﬁcant improvement in

computational speed. Our semi-automatic hand de-

tection algorithm is about 244× faster as compared

to our previous approach (De Beugher et al., 2015).

This needed approximately 36 sec to process an im-

age of 1280×720, where our present approach only

requires 150 ms to process the same frame. This im-

provement in computational speed is mainly achieved

by abandoning the DPM model based hand detection.

Semi-automatic Hand Annotation Making Human-human Interaction Analysis Fast and Accurate

557

Table 1: Comparison of our semi-automatic hand detection approach and (De Beugher et al., 2015), man. indicates the amount

of hands of which manual interventions was required.

Ours without prob. Ours with prob. De Beugher(2015) Mittal(2011) Yang(2011)

man. F-measure man. F-measure man. F-measure man. F-measure man. F-measure

D1 1.63% 90.85% 2.62% 95.76% 4.2% 95.28% 0% 85.0% 0% 24.2%

D2 2.55% 83.57% 1.84% 92.75% 19% 92.13% 0% 46.5% 0% 46.5%

D3 0.65% 81.08% 0.75% 88.31% 8.6% 87.62% n.a. n.a. n.a. n.a.

D4 n.a. n.a. 2.47% 97.89% n.a. n.a. n.a. n.a. n.a. n.a.

avg. 1.61% 85.17% 1.92% 93.68% 6.72% 91.4% 0% 68.15% 0% 35.35%

Figure 7: Results of (De Beugher et al., 2015) superposed

with our present results on the same datasets.

In table 2 an overview of the speed results is given.

Next to the speed of our earlier approach, we show

also the computational time of two other hand detec-

tion techniques on an image of 1280×720. All tim-

ing results were acquired on the same hardware (Intel

Xeon E5645).

Table 2: Execution times per frame averaged over all

frames.

avg. time/frame

Ours 150ms

(De Beugher et al., 2015) 36.67s

(Yang and Ramanan, 2011) 113s

(Mittal et al., 2011) 293.33s

Next to the improvement in accuracy and compu-

tational speed, our approach differs signiﬁcantly from

our earlier work (De Beugher et al., 2015) in sev-

eral ways: (a) we no longer need the highly compu-

tational hand models, (b) skin detection is combined

with contour detection, (c) apart from the hands, we

also track the wrist and/or elbow, (d) we validate can-

didates against probabilitiy maps.

6 CONCLUSION AND FUTURE

WORK

In this paper we proposed a novel semi-automatic

hand detection algorithm for the annotation of ego-

centric recordings in the context of research on

human-human interaction. Our approach is based

on integrating manual supervision with an automatic

hand detection algorithm. Our system automatically

detects hands in images, but when the conﬁdence

drops below a certain threshold, our system asks for

manual intervention. This yields maximally accurate

annotations at the cost of a minimal amount of man-

ual input. Our hand detection algorithm works as fol-

lows: ﬁrst we apply a highly accurate upper body de-

tection to reduce the search area. Next we use a skin

segmentation to generate hand candidates. Finally a

set of trackers is used to follow the hands over time,

combined with knowledge of human poses (e.g. rel-

ative position between shoulder and wrist). This is

combined to decide whether manual intervention is

required.

We validated our approach using four publicly

available datasets and compared against a number

of recent hand detection algorithms. This valida-

tion reveals that our approach is more accurate than

the competitor while being more than 244× faster,

which makes it more applicable in real life applica-

tions. Moreover, our system requires an even lower

number of manual interventions in order to achieve

the same accuracy.

Our future work concentrates on further explor-

ing the capabilities and boundaries of our approach.

We plan to test our approach on more real-life eye-

tracking recordings and to use our semi-automatic ap-

proach as annotation tool. On top of that we plan to

use this algorithm as input for more complex analysis

such as gesture detection.

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

558

Figure 8: Examples of our detections on the four datasets. Green circles are the hand detections, yellow circles are the

corresponding joints.

ACKNOWLEDGEMENTS

This work is ﬁnancially funded by OPAK via the Into

The Wild research project.

REFERENCES

Bo, N., Dailey, M. N., and Uyyanonvara, B. (2007). Ro-

bust hand tracking in low-resolution video sequences.

In Proc. of IASTED, pages 228–233, Anaheim, CA,

USA.

De Beugher, S., Brˆone, G., and Goedem´e, T. (2014). Auto-

matic analysis of in-the-wild mobile eye-tracking ex-

periments using object, face and person detection. In

Proc. of VISAPP, pages 625–633.

De Beugher, S., Brˆone, G., and Goedem´e, T. (2015). Semi-

automatic hand detection - a case study on real life

mobile eye-tracker data. In Proc. of VISAPP, pages

121–129.

Doll´ar, P., Wojek, C., Schiele, B., and Perona, P. (2012).

Pedestrian detection: An evaluation of the state of the

art. IEEE Transactions on PAMI, 34(4):743–761.

Eichner, M., Marin-Jimenez, M., Zisserman, A., and Fer-

rari, V. (2012). 2D articulated human pose estimation

and retrieval in (almost) unconstrained still images.

International Journal of Computer Vision, 99:190–

214.

Felzenszwalb, P. F., Girshick, R. B., McAllester, D., and

Ramanan, D. (2010). Object detection with discrim-

inatively trained part-based models. IEEE Transac-

tions on PAMI, 32(9):1627–1645.

Ferrari, V., Marin-Jimenez, M., and Zisserman, A. (2009).

Pose search: Retrieving people using their pose. In

Proc. of CVPR, pages 1–8.

Kolsch, M. and Turk, M. (2004). Fast 2d hand tracking with

ﬂocks of features and multi-cue integration. In Pro-

ceedings of the 2004 Conference on Computer Vision

and Pattern Recognition Workshop (CVPRW’04) Vol-

ume 10 - Volume 10, CVPRW ’04, pages 158 – 158,

Washington, DC, USA. IEEE Computer Society.

Mittal, A., Zisserman, A., and Torr, P. (2011). Hand de-

tection using multiple proposals. In Proc. of BMVC,

pages 75.1–75.11. BMVA Press.

Raheja, J., Chaudhary, A., and Singal, K. (2011). Tracking

of ﬁngertips and centers of palm using kinect. In Proc.

of CIMSiM, pages 248–252.

Rahim, N. A. A., Kit, C. W., and See, J. (2006). RGB-H-

CbCr skin colour model for human face detection. In

Proc. of M2USIC, Petaling Jaya, Malaysia.

Ren, Z., Yuan, J., Meng, J., and Zhang, Z. (2013). Robust

part-based hand gesture recognition using kinect sen-

sor. IEEE Transactions on Multimedia, 15(5):1110–

1120.

Shan, C., Tan, T., and Wei, Y. (2007). Real-time hand track-

ing using a mean shift embedded particle ﬁlter. Pat-

tern Recognition, 40(7):1958 – 1970.

Spruyt, V., Ledda, A., and Philips, W. (2013). Real-time,

long-term hand tracking with unsupervised initializa-

tion. In Proc. of ICIP, pages 3730–3734.

Stiefmeier, T., Ogris, G., Junker, H., Lukowicz, P., and

Troster, G. (2006). Combining motion sensors and ul-

trasonic hands tracking for continuous activity recog-

nition in a maintenance scenario. In Proc. of ISWC,

pages 97–104.

Viola, P. and Jones, M. (2001). Rapid object detection using

a boosted cascade of simple features. pages 511–518.

Proc. of CVPR.

Wang, R. Y. and Popovi´c, J. (2009). Real-time hand-

tracking with a color glove. ACM Transactions on

Graphics, 28(3).

Yang, Y. and Ramanan, D. (2011). Articulated pose estima-

tion with ﬂexible mixtures-of-parts. In Proc of CVPR,

pages 1385–1392. IEEE.

Semi-automatic Hand Annotation Making Human-human Interaction Analysis Fast and Accurate

559