(a) Input 3D depth data (b) Result of plane fitting
Figure 4: Example of hand fitting.
ganization, 2010). For hand shape recognition, we
implemented the following four–step algorithm:
1. Find the hand regions on color and depth im-
ages by using the skeleton information obtained
by OpenNI library.
2. Convert the 3D points of the hand into a normal-
ized 2D image. To remove the wrist area, and also
to analyze the hand shape, we find a plane that
corresponds to the palm of the hand. We use the
RANSAC algorithm(Fischler and Bolles, 1981) to
find this plane. Figure 4 shows an example of the
input 3D points around the hand region, and the
obtained hand plane. We also use principal com-
ponent analysis for the points onto the plane to
find the hand orientation and its size. Using these
orientation and size parameters, we projected 3D
points to the normalized 2D image plane.
3. Count the number of fingertips. This is done by
a simple template matching algorithm on the nor-
malized 2D image.
4. Classify the hand shape. This process is achieved
by using the number of the fingertips and the dis-
tribution of the 3D points.
As mentioned in Section 3, our scheme can help to
improve the recognition performance if the condition
space contains “easy” and “difficult” regions. The im-
provement does not depend on the performance of a
recognition algorithm. Therefore, a simple algorithm
such as our prototype system is sufficient for the eval-
uation of our proposed idea.
4.1 Estimators for Precision and Recall
We built estimators of precision and recall, as we
mentioned in Section 3.1. For simplicity, this study
focuses on recognition of hand shape only.
First, we categorized possible gestures into three
classes according to the hand’s shape: “opened hand”
shape as C
A
, “pointing hand” shape as C
B
, and other
shapes as C
Z
. Simultaneously, we assumed that target
gestures interface uses three binary classifiers: F
A
, F
B
,
and F
Z
.
Then, we defined the condition vector with the fol-
lowing parameters: user’s forearm direction, 3D po-
sition of hands, 3D position of feet, speed of hand
Table 2: Training dataset.
Outputs of
classifiers
Number of training samples
#A 8033 #B 8086 #Z 8799
F
A
(s)
A #T P
A
6717 #FP
A
180 #FP
A
702
¯
A #FN
A
1316 #T N
A
7906 #T N
A
8097
F
B
(s)
B #FP
B
200 #T P
B
4812 #FP
B
697
¯
B #T N
B
7833 #FN
B
3274 #T N
B
8102
movement, and depth image quality. For the last
parameter, we estimated the image quality from the
number of pixels where the distance is not acquired
by the Kinect sensor. For other parameters, we use
the skeleton information obtained by OpenNI library.
Using this condition vector and corresponding
condition space, we captured in a total of 15 min data
sequences from four participants. Then, we collected
about 24,000 training samples, which consisted of a
pair of condition vectors, label of gesture, and out-
puts of classifiers. Table 2 shows the details of them.
Note that to acquire the samples is not an onerous
task. For example, to acquire the samples labeled ges-
ture C
A
, we just asked the participants to use the tar-
get gesture-interface freely while keeping an “opened
hand”. Then, we recorded a pair of condition vec-
tors of the participants and output of the classifiers at
the same time, e.g., when the classifier F
A
outputs A,
we automatically assign T P
A
for its sample. Using
this method, we can easily prepare a huge number of
training-sets and labels. No tedious tasks, like anno-
tations by human hand, are required.
Then, we built the estimators for precision and re-
call. To reduce the computational cost for real-time
processing, we used a support vector machine (SVM)
for approximation. First, suppose you have a classi-
fier F
A
. The goal is to decompose the condition space
into two areas: where F
A
works accurately, and where
F
A
works inaccurately. When the actual gesture is A,
the outputs of F
A
are grouped into two categories: TP
or FN, where TP means recognized as A correctly, and
TN means the gesture is misrecognized as C
B
or C
Z
.
We use the SVM to find the hyperplane that separates
these TP and FN categories for C
A
when the actual
gesture is A. Using this hyperplane, we assume that
if the condition vector is more on the TP side, the ra-
tio of
#T P
#FN+#TP
will increase. In other words, we can
substitute the signed distance from the hyper-plane as
the recall R
A
of F
A
. Similarly, we can build a total
of four hyper-planes for F
A
: TP–FN and TP–FN for
actual gesture A, and TN–FP and TN–FN for actual
gesture
¯
A. These correspond to four estimators: R
A
and R
¯
A
for recall, and P
A
and P
¯
A
for precision.
As an evaluation of this approximation, we mea-
sured the histograms of TP, TN, FP, and FN in-
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
214