Rule-based Hand Posture Recognition using Qualitative Finger

Conﬁgurations Acquired with the Kinect

Lieven Billiet

, Jose Oramas

, McElory Hoffmann

1,2

, Wannes Meert

and Laura Antanas

Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium

Department of Mathematical Sciences, Stellenbosch University, Stellenbosch, South Africa

Keywords:

Hand Posture Recognition, 3D Data, Model-based Recognition, Rule-based Model.

Abstract:

Gesture recognition systems exhibit failures when faced with large hand posture vocabularies or relatively new

hand poses. One main reason is that 2D and 3D appearance-based approaches require signiﬁcant amounts of

training data. We address this problem by introducing a new 2D model-based approach to recognize hand

postures. The model captures a high-level rule-based representation of the hand expressed in terms of ﬁnger

poses and their qualitative conﬁguration. The available 3D information is used to ﬁrst segment the hand.

We evaluate our approach on a Kinect dataset and report superior results while using less training data when

comparing to state-of-the-art 3D SURF descriptor.

1 INTRODUCTION

A variety of consumer devices using gestures as

means of communication have been released in the

recent years (e.g., the Microsoft Kinect). A fac-

tor that negatively still inﬂuence the user experience

while using such gesture-based devices is the ges-

ture recognition accuracy. To become part of every-

day life, these systems need to have high accuracy

and to adapt quickly to large body vocabularies. In

this paper we focus on hand gestures and we intro-

duce a new model-based approach to hand posture

recognition. In contrast to purely appearance-based

techniques, which use no structural information about

the hands and thus, require a signiﬁcant amount of

training data, our simple rule-based and user-deﬁned

model can reliably recognize hand postures by esti-

mating few parameters from little training images.

A hand posture recognition system involves (1)

segmenting the hand and (2) extracting the hand

pose description, two challenging problems for which

many methods exist (Barczak and Dadgostar, 2005).

Similar to (Ren et al., 2011a; Ren et al., 2011b; Izadi

et al., 2011), in this work we use the Kinect and 3D

depth information to solve the ﬁrst problem. How-

ever, differently from these, we do not rely on wrist

belts, external media or color-markers to assist the

hand segmentation step. Furthermore, we ﬁnd the lo-

cation of the hand independently of its posture and

our approach is able to discriminate between the cases

when none, one or two hands are used for gestures.

We address the second problem by employing a

qualitative hand model. In contrast to well anchored

appearance-based techniques using either 2D (Li-

wicki and Everingham, 2009; Van den Bergh and

Van Gool, 2011; Pugeault and Bowden, 2011; Al-

tun and Albayrak, 2011) or 3D (Suryanarayan et al.,

2010; Darom and Keller, 2012; Knopp et al., 2010)

information, our approach is model-based and uses

the 2D image data to encode a rule-based hand

description. This has two main advantages over

purely appearance-based approaches: a lower com-

putational demand and any potential loss of discrim-

inative power due to limited amounts of training data

is countered by the use of structural information of

the hand. Our set of rules are based on ﬁnger poses

and properties and can robustly capture the hand pos-

ture conﬁguration. Each rule uses ﬁnger poses such as

stretched or closed and qualitative relations between

them. Our model has only 9 degrees of freedom,

which makes it easily applicable in practice. This

is an important advantage compared to non-structural

methods, for which data acquisition is often a high

cost. Moreover, it offers an elegant alternative to

a more demanding full kinematic hand model (Erol

et al., 2007), which implies a more difﬁcult recovery

of all its parameters from a single video stream.

Related work (Keskin et al., 2011) proposes hand

539

Billiet L., Oramas J., Hoffmann M., Meert W. and Antanas L. (2013).

Rule-based Hand Posture Recognition using Qualitative Finger Conﬁgurations Acquired with the Kinect.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 539-542

DOI: 10.5220/0004230805390542

 SciTePress

Figure 1: Hand segmentation with additional reﬁnement.

From left to right: body segmentation, original hand seg-

mentation, hand segmentation after reﬁnement, palm and

wrist positions (black and yellow circles, respectively).

posture recognition approaches by ﬁtting a 3D skele-

ton to the hand. Although semantically it follows the

same direction as our work, it is still a low-level rep-

resentation of the hand. Closely related are also the

approaches of (Mo and Neumann, 2006; Holt et al.,

2011; Ren et al., 2011a; Ren et al., 2011b). They

use depth cues to build a more qualitative representa-

tion of the hand model based on hand parts and their

conﬁguration. Yet, different from these, we use depth

information only for hand segmentation and employ a

rule-based model to recognize hand postures.

Although hand posture recognition is a popular

problem, recognition results presented in the litera-

ture are often based on self-gathered datasets which

are not available. Exceptions are the recently intro-

duced gesture recognition benchmarks (Guyon and

Athitsos, 2011; Guyon et al., 2012). However, they

mainly focus on hand motion or involve all body

parts. Differently, our current work focuses on the

hand posture recognition problem. Thus, we collected

a dataset to evaluate our rule-based approach and,

as a second contribution, we make this dataset pub-

licly available at http://people.cs.kuleuven.

be/

laura.antanas/Kinectdata.zip. We com-

pare experimentally against an appearance-based

model which employs the recently introduced 3D

SURF (Knopp et al., 2010) descriptor. We show that

starting from the same hand segmentations, our rule-

based system achieves better results than 3D SURF.

Our model-based approach demonstrates the advan-

tage of using rules in case of few training data over

more expensive 3D descriptors.

2 HAND SEGMENTATION

We use the Kinect and depth information to localize

and segment the hand by detecting the point closest

to the visual sensor and thresholding the depth im-

age from this point. This is similar to (Mo and Neu-

mann, 2006), however, we extend this work to deal

with none, one, or two hands.

Our procedure includes several steps performed

on the depth image: front-most body point detec-

Figure 2: Hand postures and visualization of their models.

tion, body segmentation by thresholding the depth,

hand segmentation by re-thresholding the depth and

ﬁnally, reﬁned hand segmentation. The body depth

threshold is estimated as the median depth of the de-

tected objects area when an initial depth threshold

from the front-most body point is considered. As

potential hands have to obey certain size criteria, we

consider as hands (at most) two objects lying at least

hands

cm before the body in the direction of the vi-

sual sensor. The thresholds were empirically estab-

lished based on training data. As our experiments on

new data show, the chosen thresholds allow for gener-

alization. Figure 1 illustrates the two segmentations.

Because hands may be extended quite far-away from

the body, which implies that part of the arm might

be considered as part of the hand, we include a seg-

mentation re-estimation step. We use the closeness to

the wrist and palm positions to successfully ﬁlter long

wrists and reﬁne the segmentation of the palm.

3 RULE-BASED RECOGNITION

To recognize hand postures we propose a model-

based approach which we represent using rules. It is

based on a ﬁxed number of hand components. Each

component is a ﬁnger group with its associated ﬁnger

pose. Thus, each hand posture is a rule which cap-

tures the hand conﬁguration. Because the rules are

user-deﬁned, training involves only the estimation of

few parameters that are general ﬁnger properties and

hold across all hand postures. Our hand model is in-

spired by (Mo and Neumann, 2006), however, differ-

ently, we consider possible ﬁnger poses as stretched

or closed. Also the relations between the ﬁngers can

be either joined or separated. Figure 2 shows possible

conﬁgurations of ﬁnger poses and relations between

them. The black dots are markers deﬁning the global

position of the hand. Red lines represent stretched

ﬁngers, while yellow ones closed ﬁngers.

This model can be extended, yet even in this sim-

ple form, it is able to distinguish between 512 poses

without explicit training. Since the model needs only

9 degrees of freedom it is feasible to learn the param-

eters from a single video stream. Our representation

allows the introduction of a second hand, extra ﬁnger

orientations or even ﬁnger depth. This makes our ap-

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

540

Figure 3: Phases of ﬁnger groups extraction process. From

left to right: full contour, reduced contour, convexity analy-

sis, ﬁnger groups, alignment.

proach general enough with respect to posture types.

From Hand Contour to Finger Groups. Finger

groups are obtained from a convexity analysis of the

hand contour as illustrated in Figure 3. In a ﬁrst step,

it is simpliﬁed to a polycontour. This removes many

random small convexity defects whilst remaining the

contour’s characteristic form. Next, groups of ﬁngers

are found as parts of the contour, in-between subse-

quent convexity defects (marked as yellow circles).

We use the width to determine the number of ﬁngers

in a group and empirically estimated thresholds on the

training data to make the distinction between group

sizes. Additionally, we impose that exactly 5 ﬁngers

are found across the groups. A ﬁnal step aligns the

bases of all ﬁngers, except for the thumb, at the palm

level. We estimate each ﬁnger group pose according

to its length. Based on the lengths, we also determine

if ﬁngers are stretched or closed. Joined or separated

ﬁngers are decided based on the minimum of the dis-

tances of their joining point to the tips.

Rule-based Representation of the Hand Model.

The hand model encodes the ﬁnger conﬁguration that

characterizes a speciﬁc hand posture. A different rule

is associated to each hand posture category, such that

the model is represented by the set of rules for the en-

tire posture vocabulary. For example the third posture

in Figure 2 is modeled using the following rule:

if orient=vertical,thumb=0, f

=1, f

=0, f

=0,

( f

, f

)=l,( f

, f

)=l,( f

, f

)=v,( f

,thumb)=l

then posture third,

where the two closed ﬁngers f

, f

and the closed

thumb are expressed as ‘0’ and the stretched ﬁngers

, f

, as ‘1’. Joined pairs of ﬁngers ( f

, f

) are indi-

cated by ‘l’ and the separated pair of ﬁngers ( f

, f

)

by ‘v’. This rule, except the orientation encoded as

a separate rule test, is practically represented in our

system using the string structure ‘0l1v1l0l0’. As an-

other example, the rule for the second hand posture is

encoded as ‘0l1l1l0l0’. We overcome the restriction

in the original approach of (Mo and Neumann, 2006)

that the palm must always face the camera by includ-

ing a second global orientation. The global hand ori-

entation is treated separately and it extends the space

of possible conﬁgurations. This is encoded as an extra

test in our rule-based model.

Hand Posture Recognition. Starting from the de-

tected ﬁnger groups, we could learn the hand posture

models using, for example, decision trees (Breiman

et al., 1984). However, the goal of this work is to

show the advantages and beneﬁts of a rule-based sys-

tem. Therefore, we assume a user-deﬁned rule-based

model and we use it to recognize hand postures by

directly comparing the encoding of a newly extracted

posture s

from a test image with the rule encoding

of each posture s

in the vocabulary. We quantify the

quality of a match as the number of characters that

match. Because a ﬁnger conﬁguration has nine char-

acters, the similarity is a score in the interval [0, 9],

given by the formula 9 − dist

, s

); dist

is the

Hamming distance between the two strings.

The number of degrees of freedom makes our pro-

posed model a sparse representation when the number

of hand poses to be recognized is small with respect

to all possible encodings. As a result, we propose

also a nearest neighbor approximation, in which ev-

ery hand posture reference rule encoding is replaced

by its nearest neighbor encoding found in the training

data. As an alternative, the sparseness problem can be

solved by estimating the rules directly from the data,

such that only meaningful posture models are learned.

4 EXPERIMENTS

We evaluated our approach on a real-world dataset

which contains 8 different hand postures and was ob-

tained from a Kinect device. The postures are illus-

trated in Table 1. The dataset was collected from 8

different persons. A ﬁrst subset was used for param-

eter estimation and validation, in both segmentation

and recognition steps; the remaining part is the test

data. The training and validation set contains 1100

frames, while the test set 400 frames for all postures.

Evaluation. We evaluate the recognition perfor-

mance of individual hand postures using the nearest

neighbor approximation. We report results in terms

of recall (R), precision (P) and accuracy (Acc). The

confusion matrix obtained is shown in Table 1. Per-

formance results are shown in Table 2.

As the confusion matrix shows, Posture 3 from

left to right is often confused with Posture 4, Posture

8 with Posture 7 and Posture 6 with Posture 7, re-

spectively. The false positive rate is explained by our

chosen model which does not consider depth informa-

tion. For example, Posture 3 (encoded ‘1v1v1l1l1’)

is confused with Posture 4 (encoded ‘0l0l0l0l0’) be-

cause of non-accurate ﬁnger length estimations. If

the ﬁngers in Posture 3 are considered folded in-

stead of stretched, the encoding of Posture 3 becomes

Rule-basedHandPostureRecognitionusingQualitativeFingerConfigurationsAcquiredwiththeKinect

541

Table 1: Confusion matrix for nearest neighbor approx.

Predicted

Correct

90%10%0% 0% 0% 0% 0% 0%

4% 96%0% 0% 0% 0% 0% 0%

2% 4% 56%38%0% 0% 0% 0%

0% 6% 0% 94%0% 0% 0% 0%

0% 0% 2% 0% 96%0% 2% 0%

0% 0% 0% 0% 0% 70%30%0%

0% 0% 0% 0% 0% 2% 98%0%

0% 0% 0% 0% 4% 42%2% 52%

Table 2: Performance results for nearest neighbor approx.

90% 96% 56% 94% 96% 70% 98% 52%

94% 83% 97% 71% 96% 61% 74% 100%

Acc

98% 97% 94% 95% 99% 91% 96% 94%

‘1v0v0l0l0’. This is, indeed, much closer to Posture

4. Also, the data used for training showed that esti-

mating folded vs. stretched ﬁngers is not completely

possible using 2D information solely. This can be im-

proved by considering also depth information.

Comparison to Related Work. We compare

against recently introduced 3D SURF (Knopp et al.,

2010), which has not been investigated yet for hand

gesture recognition. Our aim is to show that, although

depth information is essential for robust hand seg-

mentation and may provide beneﬁts for the posture

recognition on its own, 3D descriptors, in their cur-

rent state, do not pay-off for certain problems. We

show that using our approach we obtain better results

than using a more expensive 3D descriptor, which re-

quires both more data and computational time to train

a model. We consider a bag of words approach to-

gether with a multi-class SVM classiﬁer, similarly

as in (Knopp et al., 2010). We obtain the follow-

ing results for 3D SURF: R = 64.5%, P = 71.0% and

Acc = 87.88%, as apposed to R = 81.5%, P = 84.5%

and Acc = 95.5% for our rule-based approach.

REFERENCES

Altun, O. and Albayrak, S. (2011). Turkish ﬁngerspelling

recognition system using generalized hough trans-

form, interest regions, and local descriptors. Pattern

Recognition Letters, 32(13):1626–1632.

Barczak, A. L. C. and Dadgostar, F. (2005). Real-time hand

tracking using a set of cooperative classiﬁers based on

haar-like features. In Research Letters in the Informa-

tion and Mathematical Sciences, pages 29–42.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone,

C. J. (1984). Classiﬁcation and Regression Trees.

Wadsworth.

Darom, T. and Keller, Y. (2012). Scale-invariant features

for 3-d mesh models. IEEE Transactions on Image

Processing, 21(5):2758 –2769.

Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and

Twombly, X. (2007). Vision-based hand pose estima-

tion: A review. CVIU, 108(1-2):52–73.

Guyon, I. and Athitsos, V. (2011). Demonstrations and live

evaluation for the gesture recognition challenge. In

ICCV Workshops, pages 461–462.

Guyon, I., Athitsos, V., Jangyodsuk, P., Hamner, B., and Es-

calante, H. J. (2012). Chalearn gesture challenge: De-

sign and ﬁrst results. In CVPR Workshop on Gesture

Recognition and Kinect Demonstration Competition.

Holt, B., Ong, E.-J., Cooper, H., and Bowden, R. (2011).

Putting the pieces together: Connected Poselets for

Human Pose Estimation. In Workshop on Consumer

Depth Cameras for Computer Vision.

Izadi, S., Kim, D., Hilliges, O., Molyneaux, D., Newcombe,

R., Kohli, P., Shotton, J., Hodges, S., Freeman, D.,

Davison, A., and Fitzgibbon, A. (2011). Kinectfu-

sion: real-time 3d reconstruction and interaction using

a moving depth camera. In ACM symposium on User

interface software and technology, UIST.

Keskin, C., Kira, F., Kara, Y. E., and Akarun, L. (2011).

Real time hand pose estimation using depth sensors.

In Computational Methods for the Innovative Design

of Electrical Devices, pages 1228–1234.

Knopp, J., Prasad, M., Willems, G., Timofte, R., and

Van Gool, L. (2010). Hough transform and 3d surf

for robust three dimensional classiﬁcation. In ECCV.

Liwicki, S. and Everingham, M. (2009). Automatic recog-

nition of ﬁngerspelled words in British sign language.

In IEEE Workshop for Human Communicative Behav-

ior Analysis, pages 50–57.

Mo, Z. and Neumann, U. (2006). Real-time Hand Pose

Recognition Using Low-Resolution Depth Images.

IEEE Computer Society Conference on Computer Vi-

sion and Pattern Recognition, 2(c):1499–1505.

Pugeault, N. and Bowden, R. (2011). Spelling It Out: Real–

Time ASL Fingerspelling Recognition. In Workshop

on Consumer Depth Cameras for Computer Vision.

Ren, Z., Meng, J., Yuan, J., and Zhang, Z. (2011a). Robust

hand gesture recognition with kinect sensor. In ACM,

MM, pages 759–760, New York, NY, USA. ACM.

Ren, Z., Yuan, J., and Zhang, Z. (2011b). Robust hand

gesture recognition based on ﬁnger-earth mover’s dis-

tance with a commodity depth camera. In ACM, MM,

pages 1093–1096, New York, NY, USA. ACM.

Suryanarayan, P., Subramanian, A., and Mandalapu, D.

(2010). Dynamic hand pose recognition using depth

data. In ICPR, pages 3105–3108.

Van den Bergh, M. and Van Gool, L. J. (2011). Combin-

ing rgb and tof cameras for real-time 3d hand gesture

interaction. In WACV, pages 66–72.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

542