Self-scaling Kinematic Hand Skeleton for Real-time 3D Hand-ﬁnger Pose

Estimation

Kristian Ehlers and Jan Helge Kl

ussendorff

Institute of Computer Engineering, Universit

at zu L

ubeck, L

ubeck, Germany

Keywords:

Hand-ﬁnger Pose Estimation, Self-scaling Kinematic Skeleton, Hand Skeleton Tracking, Gesture Detection

Abstract:

Since low cost RGB-D sensors have become available, gesture detection has gained more and more interest in

the ﬁeld of human computer and human robot interaction. It is possible to navigate through interactive menus

by waving one’s hand and to conﬁrm menu items by pointing at them. Such applications require real-time

body or hand-ﬁnger pose estimation algorithms. This paper presents a kinematic approach to estimate the

full pose of the hand including the angles of the ﬁnger joints. A self-scaling kinematic hand skeleton model

is presented and ﬁtted into the 3D data of the hand in real-time on standard hardware with up to 30 frames

per second without using a GPU. This approach is based on the least-square minimization and an intelligent

choice of the error function. The tracking accuracy is evaluated on the basis of a recorded dataset as well as

simulated data. Qualitative results are presented to emphasize the tracking ability under hard conditions like

full hand turning and self-occlusion.

1 INTRODUCTION

Gestures are part of our daily lives, e.g., some peo-

ple gesticulate while they are talking to emphasize

their intention and young children point at things they

would like to have. People greet each other over larger

distances by simply waving their hand and policemen

control the trafﬁc with certain arm movements.

As gestures are a common, intuitive, and simple

form of communication, human pose estimation and

gesture detection are gaining more and more inter-

est in the ﬁeld of human computer interaction (Han

et al., 2013). They can be used for the touchless con-

trol of applications or to play games on consoles like

the Xbox 360 in combination with the low-cost RGB-

D sensor Microsoft Kinect, which can generate a 3D

point cloud of the scene in front of the camera. In such

use cases, gestures are deﬁned on the basis of move-

ments or the arrangement of body parts like swiping

the hand or jumping.

Regarding applications with limited space in front

of the camera, full body movements are unfavorable.

Therefore, speciﬁc poses of the hand and, in more de-

tail, the ﬁngers are deﬁned as hand or hand-ﬁnger ges-

tures. For instance, the thumbs-up gesture is used to

indicate that everything is ﬁne and even sign language

is based on hand and hand-ﬁnger gestures.

The detection of both body gestures and hand-

ﬁnger gestures requires efﬁcient algorithms to esti-

mate and classify full body poses or hand-ﬁnger poses

in real-time.

One of the major issues in estimating the pose of a

human skeleton is the large state space. For example,

the hand is a highly articulated object and it is often

modeled using at least 26 degrees of freedom (Erol

et al., 2007; ElKoura and Singh, 2003; Oikonomidis

et al., 2010).

However, this paper proposes an approach for full

hand-ﬁnger pose estimation based on a self-scaling

kinematic hand skeleton model: the positions of the

ﬁngers’ joints and other features like the ﬁngertips

as well as the joints’ angles are estimated by using

inverse kinematics and simple model data distances

to solve the formulated optimization problem. Fur-

ther, the model determines the wrist’s joint position

describing the relation between hand and arm. Since

the size of the human hand varies and a standard kine-

matic skeleton has a ﬁxed size, the presented kine-

matic skeleton is scaled during the optimization pro-

cess. This approach uses the data of one single depth

camera and is not limited to human hands but can be

extended to arbitrary skeletons.

A support vector machine (SVM) is trained with

different hand-ﬁnger poses and used for the ﬁnger

gesture classiﬁcation.

The pose estimation approach is evaluated in de-

tail with respect to ﬁngertip tracking accuracy and

performance as well as the correct determination of

the hand’s conﬁguration, i.e., the joints’ angles. De-

termining the wrist’s joint position requires the 3D

Ehlers K. and Klüssendorff J..

Self-scaling Kinematic Hand Skeleton for Real-time 3D Hand-ﬁnger Pose Estimation.

DOI: 10.5220/0005257501850196

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 185-196

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

points of a small part of the arm. As is known to the

authors, the available datasets do not contain this in-

formation, which is why no usable datasets of anno-

tated hand skeletons and depth images are available

(Oikonomidis et al., 2010; Schr

oder et al., 2013; Srid-

har et al., 2013; Qian et al., 2014). Therefore, ground

truth data is generated while robot-based movements

of a gypsum hand model are recorded and some hand

movements performed by a test subject are recorded

and annotated. In addition, the self-scaling ability is

evaluated and joint movements are simulated to de-

termine the correctness of the joint position estima-

tion. Moreover, the results of qualitative evaluation

and performance tests are presented.

Hence, the hand pose estimation is based on the

3D data of the hand an approach for the initial detec-

tion of the hand points is presented.

This paper is organized as follows: Section 2 de-

scribes related work on hand-ﬁnger pose estimation

and ﬁnger gesture detection and presents some of the

main differences to the developed method. An ap-

proach to detect the hand inside a 3D scene, the kine-

matic hand skeleton model including the ﬁnger pose

estimation approach, and a gesture classiﬁer based on

the joints’ angles are presented in Section 3. Experi-

ments and detailed evaluations are given in Section 4

followed by the conclusion presented in Section 5.

2 RELATED WORK

There are several approaches to estimate the pose of

the human hand using RGB or RGB-D image data.

Some of them are based on large datasets, calculated

beforehand, or a speciﬁc hand model. Sometimes

markers have to be worn whereas other approaches

are completely markerless.

Ren et al. determine the hand pose in the form

of ﬁnger gestures by introducing the Fingers-Earth

Mover’s Distance as a distance metric for hand dis-

similarities (Ren and Yuan, 2011; Ren et al., 2011).

The ﬁngers are matched on the basis of the hand shape

given by the depth image of the Kinect. The cur-

rent gesture is determined by template matching on

a dataset, recorded beforehand.

Athitsos et al. determine plausible 3D hand con-

ﬁgurations by formulating image database indexing

problems (Athitsos and Sclaroff, 2003). For this pur-

pose, they compute a large database consisting of syn-

thetic hand images by rendering 26 basic shapes of

an articulated hand model at different orientations.

Given the current hand image, the algorithm matches

the corresponding edge image with the database edge

images by determining image distances like the cham-

fer distance. The closest match directly represents the

current hand conﬁguration.

Horaud et al. estimate a hand model based on 3D

data by matching an articulated hand shape model us-

ing robust point registrations with expectation condi-

tional maximization (Horaud et al., 2011).

De La Gorce et al. recover the 3D hand pose from

monocular images through the minimization of an ob-

jective function using a quasi Newton approach based

on a parametric hand model (Gorce et al., 2011). The

objective function includes texture and shading infor-

mation to handle self-occlusions and the well-deﬁned

image formation process determines a synthetic hand

image as similar as possible to the current image

based on the articulated hand model and the RGB im-

age.

Oikonomidis et al. determine the hand skeleton

based on a particular model and Particle Swarm Op-

timization (PSO) (Oikonomidis et al., 2010; Oikono-

midis et al., 2011). The pose of the hand is determined

by the minimization of an error function using PSO,

whereby it is based on skin and edge feature maps of

the current hand image and hypothesized poses.

Besides the aforementioned methods, there are ap-

proaches that use markers to estimate the pose of the

hands. Thus, the approach of Wang and Popovi

is based on a glove, colored with a speciﬁc pattern,

and a recorded dataset containing rasterized glove im-

ages of natural hand poses (Wang and Popovi

c, 2009).

Therefore, the current hand pose is estimated by ﬁnd-

ing the best matching image in the database with re-

spect to a robust distance metric. Schr

oder et al. ex-

tend this approach to control an anamorphic robot

hand (Schr

oder et al., 2012). They save the pose-

speciﬁc parameters of a kinematic hand model and

use them to render the corresponding glove images

needed for the pose estimation.

In (Keskin et al., 2011), the authors present a

randomized decision forest (RDF) approach for hand

pose estimation. They adapt the well-known body

pose estimation approach of (Shotton et al., 2011) and

train a decision forest with hand pose data. They ex-

tend their approach to a multi-layered RDF network

in (Keskin et al., 2012) by developing a hand shape

RDF. For each hand shape, a special RDF is trained

to determine the parts of the hand.

The aforementioned approaches are based on

large datasets recorded beforehand or calculated data

such as decision forests. They often need to be pro-

cessed by the GPU to achieve acceptable frame rates.

This paper presents a real-time approach running with

up to 30 frames per second (FPS) on standard hard-

ware without GPU acceleration.

Aristidou and Lasenby propose an algorithm to it-

eratively adapt the joints of a kinematic hand model

using inverse kinematics and a marker-based opti-

cal motion capturing system (Aristidou and Lasenby,

2010).

In (Schr

oder et al., 2013), the authors use inverse

kinematics to ﬁt a virtual hand model into the 3D

data of the hand. The model consists of a triangu-

lated mesh with a kinematic hand skeleton. A least-

squares optimization is used to adapt the model to the

3D point cloud. Therefore, the distance between the

model and the current data is determined by the point-

to-triangle distance.

In (Liang et al., 2012), the authors present a three-

step Iterative Closest Point (ICP) based approach for

hand pose estimation based on an articulated 3D mesh

model. The global motion estimation is performed by

overlaying the current hand data with the model data

using the ICP method on the current hand data and the

corresponding model’s visible mesh points. Further,

a ﬁngertip estimation and inverse kinematics result in

a ﬁrst aligned model. The ﬁnal step is the articulated

ICP to align the mesh model and the real data.

Ballan et al. also ﬁt meshed hand models with un-

derlying skeletons into 3D data and extend the appli-

cation to interactions of both hands with objects, e.g.,

holding a small ball or folding the hands. They use

multiple camera views to handle mutual occlusions

(Ballan et al., 2012). There are other approaches cov-

ering human-object interaction tasks like grasping,

hand manipulation tasks, or hand tracking for com-

puter aided design (Wang et al., 2011; Wang et al.,

2013; Zhao et al., 2013).

Nevertheless, this paper concentrates on the task

of hand-ﬁnger pose estimation and introduces a

kinematic model-based approach differing from the

named ones in all ways except the usage of a kine-

matic model but with the addition that it is self-

designed and extended. The kinematic skeleton is di-

rectly ﬁtted into the visible hand data corresponding

to the hand’s surface. The offset between the hand

surface and the bones is ignored during the non-linear

least square optimization process and leads to a sim-

pler data-model distance as presented in (Schr

oder

et al., 2013). This simpliﬁcation speeds up the whole

pose estimation. Further, it allows to handle motion-

depending self-occlusions and just one optimization

process is needed. In addition, the kinematic model

is extended by an arm stub representing the pose in-

formation between hand and arm. One of the main

improvements with respect to other kinematic model

approaches like (Qian et al., 2014) is the model’s

self-scaling ability. The skeleton is scaled during the

whole estimation process and can handle every hand

size automatically.

To the authors’ knowledge, the approach of Qiang

et. al presented in (Qian et al., 2014) is the only

one that achieves a robust hand-ﬁnger pose estimation

with a speed up to 25 FPS. Qiang et. al use a kine-

matic hand motion model with 26 degrees of freedom

and combine the ICP and PSO optimization to realize

full articulated hand tracking based on the data of In-

tel’s Creative Interactive Gesture Camera. They use a

sphere-based hand model and sample the hand’s point

cloud down to 256 points to achieve real-time perfor-

mance.

Despite the resulting high number of 29 degrees of

freedom of the kinematic skeleton presented here, the

pose estimation runs in real-time with 30 FPS, limited

by the camera without any use of the GPU. In addi-

tion, the data of the hand is not sampled down. The

real-time performance of 30 FPS is achieved using the

nearly 16.000 points of the hand in a distance of 50 cm

to the camera as well as by using the 5.500 points

in a distance of 100 cm. The efﬁciency and the fact

that there is no need for previously calculated large

datasets allow for real-time performance and would

predestinate this approach for embedded applications.

For some approaches, multiple steps are necessary

to estimate the hand-ﬁnger pose, e.g., determination

of the hand shape or the ﬁnger tips followed by the de-

termination of the full hand pose (Liang et al., 2012;

Keskin et al., 2012). The presented approach needs

only one optimization process based on the simple

model data distance. Furthermore, the approach is not

limited to the application of the hand pose estimation

and it can deal with arbitrary kinematic skeleton mod-

els.

3 FINGER POSE ESTIMATION

The task of hand-ﬁnger pose estimation can be subdi-

vided into two major problems: ﬁnding the hand in-

side the 3D scene and determining the pose. This sec-

tion presents solutions for both problems. In the fol-

lowing, the full hand pose and the hand-ﬁnger poses,

including the joints’ angles of the ﬁngers, are used

synonymously.

3.1 Initial Hand Detection

One of the most challenging tasks, apart from esti-

mating the full hand pose, is the initial localization of

the hand inside a given 3D scene. Often many restric-

tions are made to the scene and hands’ positions, e.g.,

cropping the whole scene or assuming the hand is the

closest object to the camera (Ren et al., 2011).

This paper presents a simple and more ﬂexible ap-

proach assuming a static camera and only one human

being inside the scene facing the camera. The ﬁrst

depth image is stored as reference. For each frame,

the difference image with respect to the reference im-

age is calculated. Afterwards, a blob detection is per-

formed and the largest blob is assumed to represent

the human being. In each step, the reference image is

adapted to the current scene by updating a depth value

if the current value is larger than the corresponding

one inside the reference image. After a while, the ref-

erence image represents the background. For a usable

pose estimation, the hands should be situated in front

of the human being. Thus, all points outside the blob

as well as all points with a larger depth value than

the mean depth value of the blob representing the hu-

man body reduced by a threshold of about 20 cm are

removed. It is assumed that the two largest remain-

ing blobs represent the hands. It is hard to distinguish

the hands on the basis of their positions only, e.g., the

arms can cross each other and a left hand facing the

camera with the back of the hand looks like a right

hand facing the camera with the palm. Hence, for ini-

tialization, it is assumed, that the palms of the hands

face the camera.

The blob of a hand is examined in a similar way as

presented in (Raheja et al., 2011). The estimation of

the center of the palm is based on the distance trans-

form and the palm-corresponding pixels are removed.

The left sub blobs seem to be the ﬁngers and allow to

determine their tips. Since it is assumed that the open

hand is presented for the initial hand detection, there

have to be ﬁve ﬁnger blobs. The hand’s orientation

can be calculated by the mean value of the vectors de-

ﬁned by the palm’s center and the ﬁngertips. This in-

formation is used to initialize the kinematic skeleton’s

position and the orientation according to the z-axis of

the camera.

3.2 Kinematic Hand Skeleton

In this paper, hand-ﬁnger pose estimation is not re-

duced to the determination of the positions of speciﬁc

hand features like the ﬁngertips or the center of the

palm. It rather tends to estimate the joints’ angles

using a self-scaling kinematic model approach which

also determines the joints’ positions.

The kinematic hand model is designed in accor-

dance with the human hand skeleton illustrated in Fig-

ure 1. It consists of nodes and edges corresponding to

the skeleton’s joints and bones. The carpal bones and

the wrist joint are modeled by one central node (CEN)

forming the anchor of the whole model. The basic

coordinate system (CS

CEN

) of the model is situated in

the CEN where the z-axis faces the back of the hand,

the y-axis faces the thumb, and the x-axis completes

the right handed coordinate system facing the middle

ﬁnger. All ﬁngers and the so called arm stub are con-

nected to the CEN. The arm stub represents the data

inside the hand’s 3D data corresponding to the lower

arm’s distal part. It consists of two nodes connected

to each other and to the CEN via edges. The wrist

can perform movements in two directions, i.e., ﬂex-

ion and extension as well as abduction and adduction.

Therefore, the arm stub can be rotated around the z-

and the y-axis of the basic coordinate system.

Each ﬁnger is modeled by four edges correspond-

ing to the bones: the distal, medial, proximal pha-

langeal bones, and the metacarpal bone. The edges

connect the nodes representing the distal (DIP) and

proximal interphalangeal joints (PIP), the metacar-

pophalangeal joint (MCP), and the ﬁngertip (TIP).

The edge representing the metacarpal bone is directly

connected to the central node. Each ﬁnger is allowed

to perform extension and ﬂexion in all three joints as

well as abduction and adduction in the MCP.

To keep the model simple, the thumb is treated

as a normal ﬁnger, although it has got no medial

phalangeal bone. Hence, it is modeled in a way

that the model’s distal phalangeal edge corresponds

to the thumb’s phalangeal bone, the model’s medial

phalangeal edge corresponds to the thumb’s proxi-

mal phalangeal bone, and the model’s proximal pha-

langeal edge corresponds to the thumb’s metacarpal

bone. The model’s metacarpal edge bridges the dis-

tance between the thumb’s carpometacarpal joint and

the CEN. Consequently, in contrast to the ﬁngers’

carpometacarpal joints, the thumb’s carpometacarpal

joint is modeled to be ﬂexible in two directions.

The kinematic hand model is described with the

aid of the Denavit-Hartenberg (DH) transformations.

If the joints’ coordinate systems are situated in accor-

dance with the DH conventions, the transformation

from a coordinate system CS

to another one CS

i+1

can be described with the following four parameters:

d - offset between both coordinate systems’ origins

along the z

-axis, θ - rotation angle around the z

-axis

to overlay x

and x

i+1

, a - offset between both coor-

dinate systems’ origins along the x

i+1

-axis, and α -

rotation angle around the x

i+1

-axis to overlay x

and

i+1

. The rotation inside a joint is given by the angle

θ around the z-axis of the joint’s coordinate system.

Each ﬁnger is modeled by a kinematic chain con-

sisting of ﬁve coordinate systems, one for each joint’s

movement axis and one for the ﬁngertip. Figure 1(b)

illustrates the way the model is placed inside the hand

based on the back view of the right hand. The ﬁrst co-

ordinate system (CS

MCP1

) is situated at the MCP node

Distal Phalangeal

Bones

Middle Phalangeal

Bones

Proximal Phalangeal

Bones

Metacarpal Bones

Carpal Bones

TIP

MCP

DIP

PIP

CEN

(a)

y z

MCP1

CEN

PIP

DIP

TIP

MCP2

(b)

Figure 1: Back view of the kinematic hand skeleton model of the right hand. (a) The nodes are situated at the joint’s positions

and connected via edges to rebuild the human hand skeleton. All ﬁngers and the thumb are modeled in the same way. Since

the thumb has no middle phalangeal bone, the model’s joint corresponding to the metacarpophalangeal joint is situated inside

the carpometacarpal joint. (b) Each ﬁnger is modeled as a kinematic chain of the ﬁnger joints and the ﬁngertip at the end. All

ﬁve chains are connected to the central node situated at the central point of the wrist bones. The hand’s pose is given as the

pose of its basic coordinate system CS

CEN

situated in the basic hand node CEN.

Table 1: Kinematic model’s DH parameter from CS

MCP1

to CS

TIP

. The rotation inside a joint is given by the corre-

sponding θ. The length of the edges representing the pha-

langeal bones are given by a.

Finger joint

DH-Parameter

l θ a α

MCP1

- CS

MCP2

0 θ

MCP1

MCP2

- CS

PIP

0 θ

MCP2

MCP−PIP

PIP

- CS

DIP

0 θ

PIP

PIP−DIP

DIP

- CS

TIP

0 θ

DIP

DIP−TIP

realizing the abduction and adduction. The origin

of the second coordinate system (CS

MCP2

) overlays

MCP1

’s origin and realizes the ﬂexion and exten-

sion inside the metacarpophalangeal joint. The same

movements inside the interphalangeal joints PIP and

DIP are modeled by CS

PIP

and CS

DIP

. To determine

the ﬁngertip’s position, the coordinate system CS

TIP

is situated in TIP. Thus, the kinematic chain of a ﬁn-

ger is given by CS

MCP1

- CS

MCP2

- CS

PIP

- CS

DIP

TIP

and the corresponding DH parameters listed in

Table 1.

As CS

CEN

is the anchor of the whole model, the

pose of CS

MCP1

is determined by the translations

along the x-, y-, and z-axes without any rotations ex-

cept the thumb. The thumb’s CS

MCP1

is rotated about

◦

around the basic coordinate system’s x-axis. The

translation values and the lengths of the edges rep-

resenting the interphalangeal bones deﬁne the size of

the model and are determined by using a manual mea-

surement of an X-Ray image of a hand.

The model has 28 degrees of freedom (DOFs), i.e.,

20 rotations inside the ﬁnger joints, two rotations of

the arm stub with respect to the CEN and the pose of

the CS

CEN

inside the camera coordinate system de-

scribed by three translations as well as the yaw, pitch,

and roll angles inside the camera’s coordinate system.

The forward kinematic allows to estimate the hand’s

features’ positions inside the camera coordinate sys-

tem based on given joint and pose parameters corre-

sponding to the described DOFs as well as the bone

lengths. For instance, the position of a ﬁngertip is

given by

TIP = T

CEN

· T

MCP1

· DH

MCP2

· DH

PIP

· DH

DIP

· DH

TIP

· (0,0,0,1)

(1)

with the pose matrices T

CEN

- CS

CEN

’s pose inside

the camera coordinate system and T

MCP1

- CS

MCP1

’s

pose inside CS

CEN

. Furthermore, DH

MCP2

to DH

TIP

describe the poses of the corresponding coordinate

system with respect to the previous coordinate system

of the ﬁnger’s kinematic chain and is described by the

DH parameters (see Table 1) and the bone lengths.

The poses of the other features can be determined by

removing the unnecessary pose matrices from Equa-

tion 1, e.g., the position of CS

MCP2

is given by

MCP2 = T

CEN

· T

MCP1

· DH

MCP2

(0,0,0,1)

. (2)

Since the bone lengths have to be deﬁned in ad-

vance, the kinematic skeleton has a ﬁxed size. To

overcome this problem, the measured bone lengths

as well as the values describing the position of the

MCP1

with respect to CS

CEN

are equipped with a

global scaling factor α. This factor represents the 29th

degree of freedom and is one of the main differences

to kinematic models known from other approaches

as presented in (Sridhar et al., 2013; Schr

oder et al.,

2013; Ballan et al., 2012). Other differences are the

additional arm stub nodes which are used to model the

correspondence between hand and arm as the angles

of the wrist joint. Further, the MCP joint of the thumb

is simpliﬁed and initially rotated.

As suggested by other approaches, the joint angles

are limited to prevent impossible hand-ﬁnger poses

(Lee and Kunii, 1995).

3.3 Pose Estimation

Estimating the hand-ﬁnger pose means determining

the 29 model parameters in a such way that the kine-

matic model represents the hand’s 3D data, e.g, the

ﬁngertips should be situated at the center of all 3D

points corresponding to the real ﬁngertip. The same

should be realized for all nodes of the model.

Assuming that the target positions of the model’s

nodes inside the current data are known, the model

can be moved inside the center of the data and the

parameters can be estimated using inverse kinematics

with a non-linear least squares optimization approach

and deﬁning f (p) as the positions of the model’s

nodes described in Section 3.2 depending on the 29

model parameters p. Thus, f (p) represents a vec-

tor containing the x-, y-, and z-coordinates of each

node of the model. The optimization of the param-

eter vector p is done iteratively using a version of the

Levenberg-Marquardt approach whereby the parame-

ter vector p

i+1

in the next iteration step is given by

i+1

= p

− (H + diag [H])

−1

· J (p

)

· e(p

). (3)

In this equation,

represents the model parameters at

iteration step i, J and H are the Jacobian and Hessian

matrices of f , and e(p

) the error function depending

on p

The choice of the error function is essential for

the optimization of the parameter vector. Here, t

deﬁned as the vector containing the node’s target po-

sitions for the optimization corresponding to f (p

) at

iteration i and the error function e(p

) is given as the

difference vector

e(p

) = t

− f (p

). (4)

Given the tracked kinematic skeleton of the previ-

ous frame and the current hand data as shown in Fig-

ure 2(a), the target position of a node is determined

by the mean of all its assigned 3D points of the hand:

all points with a smaller distance to this node than to

all the other nodes of the model. The point assign-

ment is illustrated in Figure 2(b). The skeleton deter-

mined on the basis of the previous frame is adapted

to the target node positions illustrated in Figure 2(c).

Determining the target point as described above, cor-

responds to ﬁt the model into the hand’s surface and

is one of the simpliﬁcations made to accelerate the

whole optimization process. In some applications or

approaches, this distance between the surface and the

real skeleton position is handled (Shotton et al., 2011;

Schr

oder et al., 2013). Nevertheless, the coordinate

systems of all joints are known and it is easy to de-

termine the joint positions inside the hand by adding

a simple offset along the joint’s z-axis. The optimiza-

tion of the parameter vector is performed for a prede-

ﬁned number of iterations and the assignment is re-

newed several times per point cloud, e.g., 20 itera-

tions and renewals every four iterations are sufﬁcient.

Since the initial hand pose is unknown, it is expected

that the initial pose is the open hand, i.e., all ﬁngers

should be extended. Furthermore, the hand detection

approach described in Section 3.1 delivers the hand’s

orientation as the rotation angle around the camera’s

z-axis and the initial model parameters are set to zero

except the hand’s basic coordinate system’s yaw an-

gle and its position which is determined by the mean

of the hand’s 3D point cloud. In the following, the ini-

tial parameter vector used for the current point cloud

is given by the resulting parameter vector of the pre-

vious image. The point cloud itself is determined by

all points within a sphere with a predeﬁned diameter

around the model’s current position.

The scaling is performed by including the global

scaling factor α in the parameter vector. A ma-

jor problem with the hand movements are self-

occlusions, e.g., during a vertical rotation of the hand

in front of the camera, the thumb and the index ﬁnger

can cover the remaining ﬁngers so that there is no 3D

information corresponding to the ﬁngers available. In

such cases, no points are assigned to the correspond-

ing model nodes and these nodes are ignored during

the optimization process. This simple assumption al-

(a) (b) (c)

Figure 2: Determination of the target node positions. (a) Kinematic skeleton of the previous frame inside the current hand

data. (b) The points assigned to the nodes and the target positions given by the corresponding mean values are highlighted as

dark dots. (c) Adapted skeleton inside the data.

lows to handle self-occlusions of single parts of the

hand and even to handle full rotations of the hand.

3.4 Finger Gesture Classiﬁcation

Finger gestures are deﬁned as speciﬁc ﬁnger poses,

e.g., conﬁgurations of extended and not extended ﬁn-

gers. A well-known ﬁnger gesture is the OK sign

where the thumb’s and the index ﬁnger’s tips touch

each other and the remaining ﬁngers are extended.

The developed hand-ﬁnger pose estimation approach

delivers the full hand conﬁguration and is indepen-

dent of the hand’s size. Hence, it is predestinated for

ﬁnger angle based gesture classiﬁcation. Therefore, a

support vector machine is trained with recorded hand-

ﬁnger poses given by vectors containing the joint an-

gles. For each gesture, only a few poses are recorded.

The gesture names are based on their extended ﬁngers

from the thumb to the little ﬁnger, e.g., the OK sign is

named MRL and the open hand leads to TIMRL. The

ﬁst and the so called vulcanian greeting are deﬁned

as additional gestures. The resulting gesture catalog

contains 29 ﬁnger gestures.

The qualitative evaluation given in Section 4

presents skeleton model conﬁgurations inside the

hand’s 3D data for some of the deﬁned gestures.

4 EVALUATION

The evaluation of the presented approach is split into

ﬁve parts: pose estimation accuracy of the ﬁngertips,

correctness of the joint angle estimation, self-scaling

ability, performance, and qualitative evaluation. For

all tests, the Asus Xtion Pro RGB-D sensor is used.

To determine the pose estimation accuracy, a test

subject performs four kinds of hand movements: ab-

duction and adduction of the ﬁngers (ABAD), ﬂex-

ion and extension of the ﬁngers (FLEXEX), horizon-

tal and vertical rotations of the hand of about 180

◦

(ROT), and different movements like presenting a ges-

ture (DIFF). In each frame, the visible ﬁngertips are

manually annotated and the tracking accuracy is de-

termined as the root mean squared error of the ﬁnger-

tip features. The recorded evaluation sequences con-

sist of 1385 frames.

The annotation of the data is based on the depth

image and, consequently, the noise of the 3D data de-

livered by the camera is completely ignored. There-

fore, an additional motion sequence (ROB) contain-

ing the real-world motion is recorded without cam-

era noise. To achieve this, a self-made gypsum hand

model on the basis of an impression of a test subject’s

hand is mounted on an Adept Viper industrial robot

and moved within a volume of 80 cm × 40 cm × 70cm

performing arbitrary translations as well as rotations.

Furthermore, the hand’s features are annotated man-

ually: the ﬁngertips, all ﬁnger joints and the cen-

ter of the hand’s root. To estimate the positions of

the features for a given pose of the robot’s effector,

the hand model is measured using an NDI Polaris

Spectra tracking system

and a previously calibrated

transformation matrix between the robot’s and track-

ing system’s coordinate systems is used to estimate

the transformation between the features and the ef-

fector. The robot’s manually controlled movements

with the gypsum hand are recorded with a Kinect at a

distance of 170 cm in front of the robot while the ef-

fector’s poses are logged and time synchronized. The

features’ mean values and a RANSAC-based SVD ap-

proach are used to determine the transformation be-

tween the robot’s and the camera’s coordinate sys-

tems. The evaluation sequence is about 2.2 min long

www.adept.com

www.ndigital.com

Table 2: The root mean squared errors of the ﬁnger tip nodes’ distances from the corresponding ground truth data for all

evaluation sequences.

Root Mean Squared Error [mm]

Motion

Little Ring Middle Index Thumb

ABAD

12.2 12.0 10.9 14.0 17.3

DIFF

10.4 12.6 12.6 13.6 15.9

FLEXEX

11.6 15.6 11.4 15.8 15.7

ROT

15.0 12.4 14.2 15.6 15.1

ROB

12.3 14.0 12.4 12.3 16.4

MEAN

12.3 13.3 12.3 14.3 16.1

and 1832 frames are left after the synchronization due

to a limited logging speed of the effector’s poses to

13 FPS.

Table 2 illustrates the pose estimation accuracy of

the ﬁngertips for all sequences given as the root mean

squared error (RMSE) of the distances between the

ground truth and the determined poses. The mean

tracking accuracy for the ﬁngertips of the little ﬁn-

ger and the middle ﬁnger is 12 mm, for the ring ﬁnger

about 13 mm, for the index ﬁnger about 14 mm, and

for the thumb about 16 mm. Further, the scaling fac-

tor is about 0.89 to 0.91 during all test sequences. The

results show, that the pose estimation works correctly.

There are several reasons for the resulting distances.

They can be caused by the differences in the propor-

tions of the model and the test subjects or the gypsum

hand. The approach permits the scaling of the model,

but the measured proportions of the ﬁngers are ﬁxed.

The differences of the ROB sequence can be caused

by the camera’s noise. Finally, the manual annotation

process can cause some differences. Because of the

choice of the target joint positions as the mean value

of all assigned points, the joint should not reach the

margin of the hand data which was annotated before-

hand.

Since just evaluating the tracking accuracy of the

ﬁngertips does not guarantee that the hand conﬁg-

uration is estimated correctly, additional evaluation

is needed. There is no markerless system available

which allows the determination of the joints’ angles of

a test subject’s hand while it is recorded with a depth

sensor. The evaluation of the joint angle estimation’s

correctness is thus based on a simulated hand model.

Therefore, the hand of the virtual human being de-

livered by the Makehuman

framework was animated

on the basis of hand and ﬁnger motions tracked be-

forehand. The determined model parameters are used

as ground truth data and directly fed into the virtual

hand. A virtual camera facing the hand’s palm and the

known camera model are used to emulate the RGB-D

www.makehuman.org

Table 3: The root mean squared errors of the joint angles for

the tracking of the simulated hand model sequentially per-

forming ﬂexion and extension of each ﬁnger and presenting

the gestures IMR, IML, IRL and TIL.

Finger

Root Mean Squared Error [

◦

]

MCP1

MCP2

PIP

DIP

Little

17.5 18.3 10.6 5.5

Ring

9.2 13.3 6.8 9.9

Middle

9.3 14.2 6.1 8.4

Index

17.6 8.9 13.6 7.0

Thumb

24.8 16.6 13.4 10.0

sensor’s 3D data. The motion sequence consists of

sequential ﬂexions and extensions of each ﬁnger from

the little ﬁnger to the thumb followed by the presen-

tation of the IMR, IML, IRL, and TIL gestures. Table

3 presents the results in the form of the RMSE of the

recorded and tracked joints’ angles for the full evalu-

ation sequence consisting of 6080 frames. To obtain

this sequence, the data simulation is repeated several

times with a frame rate of 30 FPS. The mean RMSE is

about 12.1

◦

which indicates that the basic hand con-

ﬁguration is estimated correctly and should allow a

gesture classiﬁcation based on the joints’ angles. The

main reason for then differences between the angles

is the animated virtual hand model, e.g., the ﬂexion

of the little ﬁnger as well as the most motions of the

thumb result in abnormal distortions of the hand sur-

face.

To evaluate the self-scaling ability, a test subject

presents his open hand and the scaling factor is de-

termined manually as 0.9. Afterwards, the pose es-

timation approach is restarted with the different ini-

tial scaling factors 0.5, 0.8, 0.9, 1.0, and 1.5 corre-

sponding to different steps of a too small and a too

large scaling with respect to the manually determined

value. In all cases, the approach ends up in scaling

factors of 0.87 to 0.89 indicating a correct scaling.

Facing the extreme cases of initial values of 0.5 and

1.5, the correct factors are reached after 1 to 3 frames.

(a) (b) (c) (d)

Figure 3: (a)-(b) Tracking failures caused by fast ﬁnger motions. (d) Model shift occurred during fast waving of the hand.

The performance is tested on two standard sys-

tems. The tracking algorithm is integrated in the

ROS

framework and the delivered OpenNI driver is

used for the RGB-D data acquisition. Vertical rota-

tions of the open hand of a test subject at three dif-

ferent distances between the hand and the camera are

qualitatively observed and the number of points be-

longing to the hand data are recorded. There are about

16,000 points inside the hand’s point cloud at a dis-

tance of 0.5 m, 5,500 points at a distance of 1.0 m,

and about 1,000 points at 2.0 m. The tracking is cor-

rect for all distances and the achieved performance

represented by the frame rate of the hand poses is the

same. On a Notebook with an Intel Core i7-2640M

@ 2.80 GHz CPU, a frame rate of 24 FPS is achieved

whereby the limitation is given by the data acquisition

and not by the tracking algorithm. On a desktop com-

puter with an Intel Core i7-3770 @ 3.40 GHz CPU,

the real-time frame rate of 30 FPS is reached.

The tests show that, if necessary, a downsampling

of the data is possible. This would reduce the com-

putational complexity without affecting the quality of

the pose estimation.

There is no annotated dataset publicly available

which contains all the needed information, in this

case, the additional data of the arm. Hence, a direct

comparison with approaches like the ones presented

in (Oikonomidis et al., 2010; Schr

oder et al., 2013;

Sridhar et al., 2013; Qian et al., 2014) is not possi-

ble. However, the evaluation results show that the ap-

proach presented here is accurate in view of pose esti-

mation and joint angle determination and much faster

than the named approaches. Further, no large datasets

recorded in advance are needed and the performance

is obtained on standard hardware without using the

GPU which predestinates this approach for embedded

applications.

The qualitative evaluation is based on hand-ﬁnger

motions in front of a static depth camera. Very fast

ﬁnger movements can result in the loss of some model

www.ros.org

ﬁngers exemplarily shown in Figure 3(a). Sometimes

the fast extension of the middle and ring ﬁnger cannot

be tracked due to a wrong point assignment based on

large differences in the point clouds. Another prob-

lem with ﬁnger tracking is the displacement of neigh-

bouring ﬁngers, as illustrated in Figure 3(b). The tip

of the middle ﬁnger is situated inside the real data

of the tip of the index ﬁnger, lost beforehand. A

rarely occurring tracking failure is shown in Figure

3(c), while the index and middle ﬁngers are crossed.

The reason for that are fast ﬂexions and extensions of

both ﬁngers while performing small vertical rotations

of the hand. Even the fast waving of the open hand

can result in a little tracking shift, illustrated in Fig-

ure 3(d), which can be compensated after reducing the

speed.

Some determined hand poses, while turning the

open hand, are given by the ﬁrst row of Figure 4. The

approach allows to handle vertical rotations, whereas

some incorrect tracking results occur while rotating

the hand back (see Figure 4(d)). They are compen-

sated when the initial pose, where the palm faces the

camera, is reached. Figure 4(f)-(i) emphasize that

partial self-occlusions, caused by conﬁgurations us-

ing sharp angles in between the palm of the hand and

the camera, can be handled. Even poses while the

back of the hand faces the camera or the hand is ro-

tated in an arbitrary direction are estimated correctly

and illustrated in Figure 4(j)-(n). Furthermore, the ap-

proach allows to adapt the model on gesture-speciﬁc

hand conﬁgurations as shown in Figure 4(o)-(q). The

model is also able to track gestures where some ﬁn-

gers are mostly parallel like the IML or the IMR ges-

tures (see Figure 4(r) and Figure 4(s)). Qualitative

tests show that the implemented gesture classiﬁer is

able to distinguish the 29 trained gestures. Even sim-

ilar ﬁnger conﬁgurations like the open hand and the

so called vulcanian greeting shown in Figure 4(e) and

Figure 4(t) can be classiﬁed.

(a) (b) (c) (d) (e)

(f) (g) (h) (i) (j)

(k) (l) (m) (n) (o)

(p) (q) (r) (s) (t)

Figure 4: Determined hand skeleton model conﬁgurations with corresponding hand data. The ﬁrst row shows some steps

during the vertical rotation of the open hand of 180

◦

and back. (f) - (i): The determined hand model in situations where there

is a sharp angle between the camera and the hand’s palm. (j) - (n): The pose can be tracked during arbitrary rotations and even

when the back of the hand faces the camera. (o) - (t) show the model while presenting the gestures I, TL, IMRL, IML, IMR,

and the so called vulcanian greeting. The priorly trained SVM is able to classify these gestures correctly.

5 CONCLUSION

In this paper, a real-time self-scaling kinematic hand

skeleton model based approach for full hand-ﬁnger

pose estimation while determining hand and ﬁnger

joints’ angles was presented. The model was itera-

tively adapted on the 3D data of the hand delivered by

a depth camera using a least-squares optimization ap-

proach. Therefore, the data-model distance was sim-

pliﬁed allowing the whole pose estimation process

to be done in an optimization process without prior

steps. Further, the model was equipped with a self-

scaling ability to handle different hand sizes automat-

ically.

A detailed evaluation of the approach was given

including quantitative and qualitative results. It was

shown that the approach allows to track the hand’s

skeleton under hard conditions such as turning the

hand and presenting complex ﬁnger gestures. Fur-

thermore, the tracking performance on standard hard-

ware without using the GPU is up to 30 FPS, limited

by the camera’s speed. In addition, there are no train-

ing data or prior calculations required. Thus, the pre-

sented method is more efﬁcient than most of the other

known hand-ﬁnger tracking approaches.

Future work will focus on the remaining problems

like handling very fast hand or ﬁnger movements and

adaptable hand proportions. Some improvements on

the support vector machine based gesture classiﬁer

and a quantitative evaluation are planned. In addition,

an extension of the presented approach to estimating

the arm pose or even the full body pose is intended.

This would enable human robot interaction applica-

tions like controlling an industrial robot or a robotic

hand and could be used for a simple teach-in proce-

dure. Even simultaneous tracking of both hands and

the body is planned.

REFERENCES

Aristidou, A. and Lasenby, J. (2010). Motion Capture

with Constrained Inverse Kinematics for Real-Time

Hand Tracking. In International Symposium on Com-

munications, Control and Signal Processing, number

March, pages 3–5.

Athitsos, V. and Sclaroff, S. (2003). Estimating 3D hand

pose from a cluttered image. 2003 IEEE Computer

Society Conference on Computer Vision and Pattern

Recognition 2003 Proceedings, 2:II–432–9.

Ballan, L., Taneja, A., Gall, J., Gool, L. V., and Pollefeys,

M. (2012). Motion Capture of Hands in Action Us-

ing Discriminative Salient Points. In Fitzgibbon, A.,

Lazebnik, S., Perona, P., Sato, Y., and Schmid, C., ed-

itors, Computer Vision – ECCV 2012, volume 7577 of

Lecture Notes in Computer Science, pages 640–653.

Springer.

ElKoura, G. and Singh, K. (2003). Handrix: animating the

human hand. Eurographics symposium on Computer

animation.

Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and

Twombly, X. (2007). Vision-based hand pose estima-

tion: A review. Computer Vision and Image Under-

standing, 108(1-2):52–73.

Gorce, M. D. L., Fleet, D. J., and Paragios, N. (2011).

Model-Based 3D Hand Pose Estimation from Monoc-

ular Video. In IEEE Transactions on Pattern Analysis

and Machine Intelligence, volume 33, pages 1793–

1805. Laboratoire MAS, Ecole Centrale de Paris,

Chatenay-Malabry, IEEE.

Han, J., Shao, L., Xu, D., and Shotton, J. (2013). Enhanced

computer vision with Microsoft Kinect sensor: a re-

view. IEEE transactions on cybernetics, 43(5):1318–

34.

Horaud, R., Forbes, F., Yguel, M., Dewaele, G., and Zhang,

J. (2011). Rigid and articulated point registration with

expectation conditional maximization. IEEE Trans-

actions on Pattern Analysis and Machine Intelligence,

33(3):587–602.

Keskin, C., , Kirac, F., Kara, Y. E., and Akarun, L. (2011).

Real time hand pose estimation using depth sensors.

In Computer Vision Workshops (ICCV Workshops),

2011 IEEE International Conference on, pages 1228–

1234.

Keskin, C., , Kirac, F., Kara, Y. E., and Akarun, L. (2012).

Hand Pose Estimation and Hand Shape Classiﬁcation

Using Multi-layered Randomized Decision Forests,

volume 7577. Springer Berlin Heidelberg.

Lee, J. and Kunii, T. (1995). Model-based analysis of hand

posture. Computer Graphics and Applications, IEEE,

15(5):77–86.

Liang, H., Yuan, J., and Thalmann, D. (2012). Hand Pose

Estimation by Combining Fingertip Tracking and Ar-

ticulated ICP. In Proceedings of the 11th ACM SIG-

GRAPH International Conference on Virtual-Reality

Continuum and Its Applications in Industry, VRCAI

’12, pages 87–90, New York, NY, USA. ACM.

Oikonomidis, I., Kyriazis, N., and Argyros, A. (2011). Ef-

ﬁcient model-based 3D tracking of hand articulations

using Kinect. Procedings of the British Machine Vi-

sion Conference, pages 101.1–101.11.

Oikonomidis, I., Kyriazis, N., and Argyros, A. A. (2010).

Markerless and Efﬁcient 26-DOF Hand Pose Recov-

ery. Hand The, pages 744–757.

Qian, C., Sun, X., Wei, Y., Tang, X., and Sun, J. (2014).

Realtime and Robust Hand Tracking from Depth. In

IEEComputer Vision and Pattern Recognition.

Raheja, J. L., Chaudhary, A., and Singal, K. (2011). Track-

ing of Fingertips and Centers of Palm Using KINECT.

Third International Conference on Computational In-

telligence Modelling Simulation, pages 248–252.

Ren, Z., Meng, J., and Yuan, J. (2011). Depth Camera

Based Hand Gesture Recognition and its Applications

in Human-Computer-Interaction. IEEE International

Conference on Information Communication and Sig-

nal Processing, (1):3–7.

Ren, Z. and Yuan, J. (2011). Robust hand gesture recog-

nition based on ﬁnger-earth mover’s distance with a

commodity depth camera. Proceedings of the 19th

ACM international, pages 1–4.

Schr

oder, M., Elbrechter, C., Maycock, J., Haschke, R.,

Botsch, M., and Ritter, H. (2012). Real-Time Hand

Tracking with a Color Glove for the Actuation of An-

thropomorphic Robot Hands. In Proceedings of IEEE-

RAS International Conference on Humanoid Robots,

pages 262–269.

Schr

oder, M., Maycock, J., Ritter, H., and Botsch, M.

(2013). Analysis of Hand Synergies for Inverse Kine-

matics Hand Tracking. In IEEE International Con-

ference on Robotics and Automation, Workshop of

”Hand synergies - how to tame the complexity of

grasping”.

Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio,

M., Moore, R., Kipman, A., and Blake, A. (2011).

Real-time human pose recognition in parts from single

depth images. IEEE Conference on Computer Vision

and Pattern Recognition, pages 1297–1304.

Sridhar, S., Oulasvirta, A., and Theobalt, C. (2013). Inter-

active Markerless Articulated Hand Motion Tracking

Using RGB and Depth Data. 2013 IEEE International

Conference on Computer Vision, pages 2456–2463.

Wang, R., Paris, S., and Popovi

c, J. (2011). 6D Hands:

Markerless Hand-tracking for Computer Aided De-

sign. In Proceedings of the 24th Annual ACM Sym-

posium on User Interface Software and Technology,

UIST ’11, pages 549–558, New York, NY, USA.

ACM.

Wang, R. Y. and Popovi

c, J. (2009). Real-time hand-

tracking with a color glove. ACM Transactions on

Graphics, 28(3):1.

Wang, Y., Min, J., Zhang, J., Liu, Y., Xu, F., Dai, Q., and

Chai, J. (2013). Video-based hand manipulation cap-

ture through composite motion control. ACM Trans-

actions on Graphics, 32(4):1.

Zhao, W., Zhang, J., Min, J., and Chai, J. (2013). Robust re-

altime physics-based motion control for human grasp-

ing. ACM Transactions on Graphics, 32(6):1–12.