Cooperative Gesture Recognition

Learning Characteristics of Classiﬁers and Navigating the User to an Ideal

Situation

Hiromasa Yoshimoto and Yuichi Nakamura

Academic Center for Computing and Media Studies, Kyoto University, Sakyo-ku, Kyoto 606–8501, Japan

Keywords:

Cooperative Recognition, Gesture Interface, and Usability.

Abstract:

This paper introduces a novel scheme of gesture interface that guides the user toward obtaining better perfor-

mance and usability. The accuracy of gesture recognition is heavily affected by how the user makes postures

and moves, as well as environmental conditions such as lighting. The usability of the gesture interface can

potentially be improved by notifying the user of when and how better accuracy is obtained. For this purpose,

we propose a method for estimating the performance of gesture recognition in its current condition, and a

method for suggesting possible ways to improve performance to the user. In performance estimation, accu-

racy in the current condition is estimated based on supervised learning with a large number of samples and

corresponding ground truths. If the estimated accuracy is insufﬁcient, the module searches for better condi-

tions that can be reached with the user’s cooperation. If a good improvement is possible, the way to improve

is communicated to the user in terms of visual feedback, which shows how to avoid or how to recover from

the undesirable condition. In this way, users beneﬁt, i.e., better accuracy and usability, by cooperating with

the gesture interface.

1 INTRODUCTION

As the success of Microsoft’s Kinect sensor shows,

a gesture-based interface is one of practical solu-

tion for a natural user interface(Shotton and Sharp,

2011; OpenNI organization, 2010; PrimeSense Inc.,

2010). Its fundamental technologies, such as mo-

tion tracking and gesture classiﬁcation, however, still

face many difﬁculties and challenges (Pavlovic et al.,

1997; Moeslund et al., 2006; Ren et al., 2013), be-

cause of the ambiguities of human’s behavior, self-

occlusion, and the sensor’s measurement noise, etc.

Moreover, a gesture-based interface has additional us-

ability requirements. It must not only maintain good

recognition performance, but also needs to indicate

its internal status, to give easy ways to recover from

failures(Nielsen, 2009; Jacko and Sears, 2003).

In this paper, we focus on a framework for deal-

ing with both of the above goals: to improve the per-

formance of gesture recognition, and to improve the

usability of the gesture interface. For these two goals,

we propose a cooperative gesture recognition scheme,

which navigates the user into changing their behavior

so that both performance and usability are improved

at the same time.

For this navigation, we introduce a condition vec-

tor and a condition space to represent how the gesture

is easily recognized by the gesture recognition mod-

ule. The condition vector is deﬁned by the parameters

measured by sensors, such as the user’s standing po-

sition, posture of upper body, and movement speed of

the user’s hands.

We then use recall and precision as a measure of

recognition performance, and calculates the distribu-

tion in condition space. Here, we can assume that

if the condition space has enough dimensions, i.e.,

enough parameters, performances around neighbor-

ing positions are similar. If the recognition perfor-

mance is satisfactory for one condition vector, it is

expected to be satisfactory in its neighborhood. For

covering over the condition space, we assume it can

be separated into two portions: a portion in which the

recognition performance is satisfactory and another

portion with unsatisfactory performance.

With the above assumptions, our cooperative

recognition scheme is composed by adding two mod-

ules to a common gesture interface scheme: a per-

formance estimation module and a feedback module.

These two modules work as follows. The perfor-

mance estimation module estimates recall and preci-

210

Yoshimoto H. and Nakamura Y..

Cooperative Gesture Recognition - Learning Characteristics of Classiﬁers and Navigating the User to an Ideal Situation.

DOI: 10.5220/0005206902100218

In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM-2015), pages 210-218

ISBN: 978-989-758-077-2

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

sion from the current condition vector, and the feed-

back module advises the user to change his/her be-

havior to a more desirable condition, where the recall

and precision become higher.

We modeled this performance estimation module

as functions that map condition vectors into recall

and precision. If enough labeled training samples

are available, these functions are constructed by su-

pervised machine learning. With these functions, the

module can not only estimate the performance of cur-

rent conditions, but also predict performance for any

condition.

The feedback module uses estimate functions to

search for more suitable conditions around the cur-

rent condition, where the predicted performance in-

creases. If a suitable condition is found, then the feed-

back module synthesizes visual hints to navigate the

user into the suitable condition. For example, if the

left position is predicted to have more suitable condi-

tions for the gesture recognition, the feedback module

draws an arrow-shaped computer graphic (CG) object

that induces the user to move a little to the left.

The main contribution of this scheme proposes not

a specialized application but a practical and general-

ized scheme. Our scheme can improve the perfor-

mance of the existing gesture interface by just adding

two modules. No modiﬁcation of existing gesture

classiﬁers is required.

This paper is structured as follows. In Section 2,

an overview of related works is given. The proposed

scheme is described in Section 3. In Section 4, we

present the experimental evaluation, and we conclude

in Section 5.

2 RELATED WORK

In the ﬁelds of amusement or entertainment, a

gesture-based interface is a practical technology.

There are many commercial products, such as Mi-

crosoft’s Kinect sensors(Ren et al., 2013), Nintendo

Wii, and Eye cam Sony, etc. However, despite these

latest technologies, there are still challenging prob-

lems in gesture recognition.

From the viewpoint of computer vision (CV) and

pattern recognition (PR), gesture recognition still has

challenges in accuracy and the robustness of measure-

ment and recognition. For example, in hand gesture

recognition, the difﬁculties arise from the complex

structure of hands and the wide variations in their

appearances. Moreover, the ambiguities of the sen-

sor’s observation make further difﬁculties for gesture

recognition; e.g., when we use an RGB–D camera as

the sensor, the pixel’s noise and occlusions on the cap-

Figure 1: Example of ”easy” and “difﬁcult” images for

gesture/sign recognition; the difﬁculties arise from image

noise, occlusions, motion blurs, etc.

Table 1: Four categories of the classiﬁer’s output.

Ground truth is:

Output

TP: True positive FP: False positive

of (Correct) (Type I error)

classiﬁer

FN: False negative TN: True negative

is: (Type II error) (Correct)

tured color and depth image cause ambiguities in im-

age analysis. Figure 1 shows actual hand depth im-

ages captured by a Kinect sensor; there are various

images ranging from “easy” to “difﬁcult” recognition.

Although various approaches have been proposed to

overcome these difﬁculties in terms of CV or PR, ro-

bust and accurate gesture recognition remains a dif-

ﬁcult task(Pavlovic et al., 1997; ?). In other words,

there is no approach that can guarantee 100% accu-

racy in gesture recognition.

However, from the viewpoint of the usability en-

gineering, the user interface should be designed to be

efﬁcient to use and easy to learn. Nielsen’s(Nielsen,

2009) guidelines in usability engineering appeal for

a gesture-based user interface to provide some func-

tions that help the user to diagnose how accurate the

recognized result is, and to rectify the error when ges-

ture recognition fails. These features have already

been proposed in many studies and are used in many

commercial products. However, the pattern of appli-

cation design is not yet well discussed and formalized.

3 COOPERATIVE GESTURE

RECOGNITION

In general, each algorithm in gesture recognition,

such as a feature detector or a gesture classiﬁer, uses

CooperativeGestureRecognition-LearningCharacteristicsofClassifiersandNavigatingtheUsertoanIdealSituation

211

“easy” condition

for classifier

“difficult” condition

for classifier

(a) Condition space

Navigate

(b) User navigation

Figure 2: Overview of cooperative gesture recognition: (a)

we propose condition space to learn the characteristics func-

tion of the classiﬁer. (b) This space is also used to navigate

the user to change the user’s behavior or the environments

where the performance of classiﬁcation will be improved.

its own assumptions on the human behaviors and en-

vironment, e.g., a user is standing in a certain area.

Recognition performance is directly related to how far

those assumptions are satisﬁed.

In a case where the assumptions are well satisﬁed,

we can expect good recognition accuracy. However,

it is difﬁcult to estimate directly how far the assump-

tions are satisﬁed because the process of a feature de-

tector or a classiﬁer usually includes non-linear, high-

dimensional computation, and human behavior and

the environments that are hard to model.

Instead of dealing with direct estimation, we con-

sider a substitute measure that represents the “appro-

priateness” of a current gesture and environment as

a recognition target. For this purpose, we introduce

a condition vector , and the space where the vector

lies is the condition space. Our key idea is that, if

we take a sufﬁcient number and variety of dimensions

for the condition space, the “appropriateness” can be

well approximated and we work similarly to the de-

gree of how far assumptions are satisﬁed. Compo-

nents of the condition vector, for example, would be

the user’s standing position, velocity, or the color of

clothes. Moreover, some indices about jitters or the

signal-to-noise ratio of the pixels’ values on captured

images can also be used.

Theoretically, as we consider greater numbers and

variations of dimensions, the appropriateness is more

accurately approximated. However, we have compu-

tational problems in handling high-dimensional fea-

tures, e.g., the curse of dimensionality; therefore, we

need to design the instances of condition vectors care-

fully.

3.1 Predicting Gesture Recognition

Performance

Once the condition space is composed, the next prob-

lem is to decide what kind of measure is appropriate

for “appropriateness”. Our suggestion is that the use

of recognition accuracy is simple and sufﬁcient.

Let us consider the gesture recognition process

as a multi-class classiﬁer of an input gesture. More

speciﬁcally, we assume an one-versus-the-rest conﬁg-

uration of binary classiﬁers; for each gesture class C

the gesture recognition works as a binary-class clas-

siﬁer F

that outputs X or

X, indicating whether the

input belongs to gesture C

or not. Next, we catego-

rize the combinations of recognition results and ac-

tual classes, as Table 1. For each category, we use the

following four labels: true positive (TP), true negative

(TN), false positive (FP), and false negative (FN). The

terms “positive” and “negatives” refer to the predicted

gesture class, and the other terms “true” and “false”

refer to the correctness of the recognition result.

Using these labels, the overview of condition

space can be shown in Figure 2a. Our framework pre-

dicts the performance of F

at a given query condition

vector s as precision P

(s) and recall R

(s), which are

deﬁned as

(s) :=

#T P

(s)

#T P

(s) + #FP

(s)

(1)

(s) :=

#T P

(s)

#T P

(s) + #FN

(s)

(2)

(s) :=

#T N

(s)

#FN

(s) + #T N

(s)

(3)

(s) :=

#T N

(s)

#FP

(s) + #T N

(s)

, (4)

where # means the number of neighbors, e.g.,

#T P

(s) means the number of samples labeled T P

around s. Note that these recall and precision are

functions of s. In the pattern recognition ﬁeld, the

recall and precision are usually used as indices for av-

erage performance over all possible conditions. How-

ever, our proposed functions calculate recall and pre-

cision around s. Using these functions of s, we can

predict and compare the actual performances of the

two conditions.

The ﬁnal problem is how to calculate the func-

tions: #T P(s), #T N(s), #FP(s), and #FN(s). These

functions count the number of samples around any

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods

212

given query point s in condition space. One of possi-

ble way is learning by example; we can collect gesture

recognition results as training samples where each

sample is a set of predicted gesture class, the actual

gesture, and its condition vector. Once a sufﬁcient

number of training samples are collected, we have

several methods to estimate the values of these func-

tions at each position in the condition space. The de-

tails of the algorithm that we used in our experiments

are described in Section 4.1.

3.2 Navigating the User to an Ideal

Condition

It is widely recognized that the usability of an in-

terface will be increased by giving any appropriate

feedback to the user within a reasonable time(Nielsen,

2009). Many gesture interfaces already give supple-

mental information as feedback, such as the conﬁ-

dence or likelihood of the recognition result. This

internal information helps the user to diagnose what

is going on, and to ﬁnd the reasons for the errors.

Our cooperative recognition framework proposes

the following two additional feedbacks:

1. Feedback to show “appropriateness” of current

condition, which informs the user whether the

current condition is desirable for gesture classi-

ﬁers.

2. Feedback to show the method to improve ap-

propriateness, which is synthesized by a gesture

recognition system for the purpose of leading the

user towards changing his/her behaviors, or envi-

ronment so that sufﬁcient gesture recognition ac-

curacy is obtained.

Feedback is achieved by showing the estimated

performance for the current condition vector by using

our recall and precision function. For example, we

often use a kind of a cursor or an avatar to show the

results of gesture recognition. In these cases, we can

obtain feedback of the appropriateness of the current

condition by changing the color or size of the cursor

or the avatar.

Figure 2b shows the overview of the second feed-

back. This feedback is achieved by two tasks: a

search task to ﬁnd a better condition s

than the cur-

rent condition s

and an inform task to navigate the

user on the way to change the condition. There can

be several strategies for these tasks, and their imple-

mentation strongly depends on the speciﬁc applica-

tion requirements.

However, the design of these two tasks is formu-

lated as follows. For the search task, a possible strat-

egy is as follows. First, we prepare the goal condition

Figure 3: Overview of experimental system.

candidates, e.g., move to the left by one step, move

a hand downward by 20 degrees. Then, we compare

these candidate conditions with the current condition

and choose the best one. This comparison can be eas-

ily achieved by referring the values of estimated re-

call and precision. For the inform task, we propose

the following two visual approaches:

Symbol-based Feedback. This feedback uses sym-

bols, such as an arrow-shaped CG. For example,

to encourage the user to move to the left by one

step, this feedback displays an arrow-shaped CG

that gives the user the distance and direction to

move.

Example-based Feedback. This feedback uses a

previously captured image or video as a prece-

dent. The image or video can give a concrete ex-

planation of how the previous user improved its

condition.

4 EXPERIMENT WITH

EXPERIMENTAL GESTURE

INTERFACE

We built a prototype gesture system to verify our

framework.

Although the characteristics of the gesture clas-

siﬁers and system usability heavily depend on their

application, we think this example is commonly ap-

plicable to other gesture interfaces.

Figure 3 shows the overview of the prototype sys-

tem design that provides a drawing application with

the following two functions: drawing a line by using

a “pointing hand” gesture, and erasing lines by the

“opened hand” gesture. This interface uses the dis-

play to show the virtual canvas for drawing, and also

to show the silhouette of the user. The user can look

at this display as a mirror and can draw something

naturally with his/her gestures.

This gesture interface recognizes the user’s ges-

ture from a combination of their upper body posture

and hand shape. For this recognition process, we use

a Kinect sensor to measure the user as color and depth

images. For posture recognition, we use the skeleton

tracker implemented by OpenNI library(OpenNI or-

CooperativeGestureRecognition-LearningCharacteristicsofClassifiersandNavigatingtheUsertoanIdealSituation

213

(a) Input 3D depth data (b) Result of plane ﬁtting

Figure 4: Example of hand ﬁtting.

ganization, 2010). For hand shape recognition, we

implemented the following four–step algorithm:

1. Find the hand regions on color and depth im-

ages by using the skeleton information obtained

by OpenNI library.

2. Convert the 3D points of the hand into a normal-

ized 2D image. To remove the wrist area, and also

to analyze the hand shape, we ﬁnd a plane that

corresponds to the palm of the hand. We use the

RANSAC algorithm(Fischler and Bolles, 1981) to

ﬁnd this plane. Figure 4 shows an example of the

input 3D points around the hand region, and the

obtained hand plane. We also use principal com-

ponent analysis for the points onto the plane to

ﬁnd the hand orientation and its size. Using these

orientation and size parameters, we projected 3D

points to the normalized 2D image plane.

3. Count the number of ﬁngertips. This is done by

a simple template matching algorithm on the nor-

malized 2D image.

4. Classify the hand shape. This process is achieved

by using the number of the ﬁngertips and the dis-

tribution of the 3D points.

As mentioned in Section 3, our scheme can help to

improve the recognition performance if the condition

space contains “easy” and “difﬁcult” regions. The im-

provement does not depend on the performance of a

recognition algorithm. Therefore, a simple algorithm

such as our prototype system is sufﬁcient for the eval-

uation of our proposed idea.

4.1 Estimators for Precision and Recall

We built estimators of precision and recall, as we

mentioned in Section 3.1. For simplicity, this study

focuses on recognition of hand shape only.

First, we categorized possible gestures into three

classes according to the hand’s shape: “opened hand”

shape as C

, “pointing hand” shape as C

, and other

shapes as C

. Simultaneously, we assumed that target

gestures interface uses three binary classiﬁers: F

, F

and F

Then, we deﬁned the condition vector with the fol-

lowing parameters: user’s forearm direction, 3D po-

sition of hands, 3D position of feet, speed of hand

Table 2: Training dataset.

Outputs of

classiﬁers

Number of training samples

#A 8033 #B 8086 #Z 8799

(s)

A #T P

6717 #FP

180 #FP

702

A #FN

1316 #T N

7906 #T N

8097

(s)

B #FP

200 #T P

4812 #FP

697

B #T N

7833 #FN

3274 #T N

8102

movement, and depth image quality. For the last

parameter, we estimated the image quality from the

number of pixels where the distance is not acquired

by the Kinect sensor. For other parameters, we use

the skeleton information obtained by OpenNI library.

Using this condition vector and corresponding

condition space, we captured in a total of 15 min data

sequences from four participants. Then, we collected

about 24,000 training samples, which consisted of a

pair of condition vectors, label of gesture, and out-

puts of classiﬁers. Table 2 shows the details of them.

Note that to acquire the samples is not an onerous

task. For example, to acquire the samples labeled ges-

ture C

, we just asked the participants to use the tar-

get gesture-interface freely while keeping an “opened

hand”. Then, we recorded a pair of condition vec-

tors of the participants and output of the classiﬁers at

the same time, e.g., when the classiﬁer F

outputs A,

we automatically assign T P

for its sample. Using

this method, we can easily prepare a huge number of

training-sets and labels. No tedious tasks, like anno-

tations by human hand, are required.

Then, we built the estimators for precision and re-

call. To reduce the computational cost for real-time

processing, we used a support vector machine (SVM)

for approximation. First, suppose you have a classi-

ﬁer F

. The goal is to decompose the condition space

into two areas: where F

works accurately, and where

works inaccurately. When the actual gesture is A,

the outputs of F

are grouped into two categories: TP

or FN, where TP means recognized as A correctly, and

TN means the gesture is misrecognized as C

or C

We use the SVM to ﬁnd the hyperplane that separates

these TP and FN categories for C

when the actual

gesture is A. Using this hyperplane, we assume that

if the condition vector is more on the TP side, the ra-

tio of

#T P

#FN+#TP

will increase. In other words, we can

substitute the signed distance from the hyper-plane as

the recall R

of F

. Similarly, we can build a total

of four hyper-planes for F

: TP–FN and TP–FN for

actual gesture A, and TN–FP and TN–FN for actual

gesture

A. These correspond to four estimators: R

and R

for recall, and P

and P

for precision.

As an evaluation of this approximation, we mea-

sured the histograms of TP, TN, FP, and FN in-

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods

214

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

(a) P

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

(b) P

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

(d) R

Figure 5: Learning results of gesture A: each graph shows

the relation between P or R and the histogram of TP, FP,

TN, and FN.

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

(a) P

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

(b) P

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

100

150

200

-3 -2 -1 0 1 2 3

0.2

0.4

0.6

0.8

Frequency

Distance

(d) R

Figure 6: Learning results of gesture B: each graph shows

the relation between P or R and the histogram of TP, FP,

TN, and FN.

stances and corresponding estimated recall and pre-

cision. Figure 5 shows the result of F

and Figure 6

shows the result of F

. In these ﬁgures, the x-axis is

a signed distance between the SVM’s hyperplane and

data sample. In the ﬁgure 5a, the red bar shows a his-

togram of #FP

, and the green bar shows a histogram

of #T P

, while the blue line is the ratio of each pair of

red and green bars. This ratio means P

at the signed

distance. As the ﬁgure 5a shows, the SVM separates

the condition space into FP

and TP

regions, and P

becomes higher as the signed distance becomes big-

ger. The same relationships are conﬁrmed in the other

categories

A, B, and

B, and other classiﬁers F

Using these relationships, we conﬁrm that we can

estimate P and R only from the signed distance of the

condition vector. As an additional evaluation result,

Figure 7 shows the relationship between the estimated

recall and its input depth images. It shows that as

the estimated recall decreases, the undesirable factors,

such as noise and self-occlusion, appears frequently.

These results show that our approximation ap-

proach is practicable and valid.

4.2 Visual Feedback to the User

As feedback for navigating the user to a desirable

condition, we implement a symbol-based feedback.

First, we prepare feedback candidates; each candi-

date is a pair of user’s actions to improve the con-

dition, and corresponding symbols that can navigate

the user to do the action. Figure 8 shows some sam-

ples of the symbols. Frame #2822 of Figure 8 shows

a feedback that navigates the user to change the ori-

entation of his/her ﬁngertips. Frame #3096 of Figure

8 shows another feedback that navigates the user to

move to his/her right side. These arrow-shaped object

≈ 0.1 R

≈ 0.3 R

≈ 0.5 R

≈ 0.7 R

≈ 0.9

≈ 0.1 R

≈ 0.3 R

≈ 0.5 R

≈ 0.7 R

≈ 0.9

Figure 7: Example of recall estimation: this ﬁgure shows captured depth images sorted by R

(top) and R

(bottom). As the

or R

increases, the image errors, such as pixel noise and self-occlusions, drastically reduce.

CooperativeGestureRecognition-LearningCharacteristicsofClassifiersandNavigatingtheUsertoanIdealSituation

215

Frame #2822 Frame #2937 Frame #3070 Frame #3096

Figure 8: Screen-shots of experiment system: (top) screen-shots; (bottom) zoom-up images round the feedback.

are overlaid on the silhouette of the user in real-time.

500 550 600 650 700 750 800

Output of F

(s)

Frame

0.2

0.4

0.6

0.8

500 550 600 650 700 750 800

(s)

Frame

(a) No feedback

500 550 600 650 700 750 800

Output of F

(s)

Frame

0.2

0.4

0.6

0.8

500 550 600 650 700 750 800

(s)

Frame

(b) With feedback (proposed)

Up hand

Down hand

Move left

Move right

500 550 600 650 700 750 800

Frame

Figure 9: Estimated recall and the classiﬁed results: (a)

without the feedback, the situation sometimes falls into a

“bad” condition, where the estimated recall R

becomes

low, and the classiﬁer outputs wrong results. (b) with the

feedback, our navigation can retain a “good” condition,

where the classiﬁer outputs correct results.

The user can look at this interface as a mirror;

seeing himself/herself naturally from this silhouette,

and easily understand that the meaning of the arrow-

shaped object is the direction and amount of required

action.

4.3 Evaluation of Recognition

Performance and Usability

In experiments, we focused on the interval during

which the participant uses gesture A, and recorded the

estimated recall R

(s) and precision P

(s), and the ac-

tual accuracy of the classiﬁer’s output F

(s).

We prepared four symbol-based feedbacks for the

navigation: move a hand upward/downward, and take

one step to the left/right. These feedbacks are synthe-

sized as the arrow-shaped objects, as shown in Figure

8. We asked three participants to use our experimen-

tal gesture interface and recorded the data sequences

for approximately 15 min.

First, we compared the inﬂuence of our feedback

on these values. Figure 9a shows the recorded se-

quence in the case without feedback. This is 300

frames (10 seconds) sequence. The above red-graph

shows the time sequence of R

(s). As this graph

shows, the value of R

(s) changes dynamically. This

means that the condition vector s is drastically varying

between ideal and undesirable conditions. As a result,

the recognition result of F

also varies between accu-

rate (labeled as TP) and inaccurate (labeled as FN). In

contrast, with our visual feedback, the performance

was improved. Figure 9b shows the case with feed-

back, and Figure 9c shows the displayed symbols for

the feedback. The graph shows that the estimated re-

call R

(s) with feedback is higher than without feed-

back, and the recognition result F

(s) retains an accu-

rate (TP) output.

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods

216

100

200

300

400

500

600

700

800

900

0 0.2 0.4 0.6 0.8 1

Frequency

, P

Figure 10: Histogram of R

and P

, without user naviga-

tion.

200

400

600

800

1000

1200

1400

0 0.2 0.4 0.6 0.8 1

Frequency

, P

Figure 11: Histogram of R

and P

, with user navigation.

Table 3: Performance of F

Precision Recall

No feedback 0.59 0.70

With feedback (proposed) 0.61 0.88

We compared the histograms of R

and P

dur-

ing the experiments. Figures 10 and 11 show the

cases without and with feedback, respectively. Com-

paring those two histograms, these results show that

our feedback scheme can improve situation in terms

of R

(s).

We also evaluated how our approach can improve

the actual performance. Table 3 shows the actual re-

call and precision through the experiments. In this

experiment, we use visual feedback to navigate the

participant mainly for increasing the estimated recall.

As a result, the recall is improved from 0.70 to 0.88,

while precision remains unchanged. This means that

our proposed scheme can predict the ideal situation

in terms of recall and navigate the user to the ideal

situation.

From these results, we demonstrate that our coop-

erative recognition scheme is practicable and efﬁcient

in increasing the accuracy of gesture recognition.

4.4 Discussion

The above experiments support our assumption men-

tioned in Section 3: the system performance in the

neighbor of a ”condition” is similar to the system per-

formance of the ”condition”. By the symbol-based

feedback method, we get better performance when

getting further in the direction of the ”easy” condi-

tion.

As far as the above assumption holds, our scheme

can be applied to a wide variety of recognition sys-

tems, because we do not need to change the inside of

a system. Our scheme also gives the users the know-

how for using a system, i.e., how to use a system with

good performance, and it also saves the user’s time

of trial and error for learning how to use the system.

We believe that this scheme can greatly increase the

usability of the gesture interface.

5 CONCLUSION

This paper proposed a cooperative gesture recogni-

tion scheme for a gesture-based user interface, which

navigates the user towards making gesture recogni-

tion more accurate. Our experiments showed its prac-

ticability and effectiveness.

We plan to apply and evaluate our gesture recogni-

tion scheme with different types of gesture interfaces

in the future.

ACKNOWLEDGMENTS

This work was supported in part by the JST-CREST

project “Creation of Human-Harmonized Information

Technology for Convivial Society.”

REFERENCES

Fischler, M. A. and Bolles, R. C. (1981). Random sample

consensus: a paradigm for model ﬁtting with appli-

cations to image analysis and automated cartography.

Commun. ACM, 24(6):381–395.

Jacko, J. A. and Sears, A., editors (2003). The

Human-computer Interaction Handbook: Fundamen-

tals, Evolving Technologies and Emerging Applica-

tions. L. Erlbaum Associates Inc., Hillsdale, NJ, USA.

Moeslund, T. B. and Granum, E. (2001). A survey of com-

puter vision–based human motion capture. Computer

Vision and Image Understanding: CVIU, 81(3):231–

268.

Moeslund, T. B., Hilton, A., and Kr

uger, V. (2006). A

survey of advances in vision-based human motion

capture and analysis. Comput. Vis. Image Underst.,

104(2):90–126.

Nielsen, J. (2009). Ten usability heuristics. http://

www.nngroup.com/articles/ten-usability-heuristics/.

OpenNI organization (2010). OpenNI User Guide. OpenNI

organization.

Pavlovic, V. I., Sharma, R., and Huang, T. S. (1997). Visual

interpretation of hand gestures for human-computer

CooperativeGestureRecognition-LearningCharacteristicsofClassifiersandNavigatingtheUsertoanIdealSituation

217

interaction: A review. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 19:677–695.

PrimeSense Inc. (2010). Prime Sensor

NITE 1.3 Algo-

rithms notes. PrimeSense Inc.

Ren, Z., Yuan, J., Meng, J., and Zhang, Z. (2013). Robust

part-based hand gesture recognition using kinect sen-

sor. Multimedia, IEEE Transactions on, 15(5):1110–

1120.

Shotton, J. and Sharp, T. (2011). Real-time human pose

recognition in parts from single depth images. IEEE

Conference on Computer Vision and Pattern Recogni-

tion (2011), 2(3):1297–1304.

ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods

218