Table 1: Configuration and accuracy of the best-performing
CNN architectures.
image C F S U N accuracy weights
bin 2 (3, 5) 32 2 (160, 160) 97,64% 564,194
bin 2 (5, 5) 32 2 (120, 80) 97,58% 347,466
bin 2 (5, 5) 32 2 (160, 160) 97,49% 467,426
bin 2 (3, 3) 32 2 (160, 80) 96,78% 642,290
bin 2 (3, 5) 16 2 (160, 80) 96,75% 275,778
bin 2 (5, 5) 32 2 (160, 80) 96,66% 454,386
bin 2 (3, 5) 32 2 (120, 80) 96,51% 419,914
bin 2 (5, 3) 32 2 (80, 80) 96,01% 272,802
bin 2 (5, 3) 32 1 (160, 0) 95,80% 522,562
bin 2 (3, 5) 32 2 (160, 80) 95,39% 551,154
images, which do not appear among the best results.
This is more clear when we observe the histograms
shown in Fig. 4, that represent the accuracy distri-
bution of the experimented configurations for each
image representation separately. For segmented ima-
ges, accuracies range from 89% to 94%, while bi-
narized images accuracies range from 96% to 99%.
More importantly, we conclude that the best CNN ar-
chitectures are very accurate to detect hand poses in
all experimented body poses. This is a relevant result,
as the user is expected to perform binary commands
while simultaneously controlling the robot arm posi-
tion in many distinct configurations.
By analyzing the hyper-parameters of the best ar-
chitectures, we note that there is a prevalence of 2
convolutional layers with 32 filters per layer, and
a diversity of values for the other evaluated hyper-
parameters. The best architecture is depicted in Fig. 3.
However, it is worth observing that all configurations
in Table 1 achieved a similar accuracy (always above
98%). Thus, we suggest that a smaller number of trai-
nable weights may be a relevant criteria to choose an
architecture among the best ones. However, hyper-
parameter calibration was still relevant to achieve sig-
nificantly better accuracies, as revealed in Fig. 4.
4.3 CNN Evaluation for Robot
Interaction
To further evaluate the best CNN architecture, we per-
formed experiments simulating the usual pick-and-
place task. We collected 10 performances of an indi-
vidual picking an object and placing it in another lo-
cation. It is expected that the user first places its con-
trolling hand (open) over the object; closes his hand
to grasp the object; moves his (closed) hand to anot-
her location; and finally opens his hand to release the
object. Despite the simplicity of the task movements,
some poses are challenging for hand pose classifica-
tion, in particular when the user hand is pointing to-
wards the Kinect sensor. In such situations, images
from the hand poses are difficult to discriminate.
We evaluate and compare the robustness of the
CNN in each sequence when each frame is individu-
ally classified, and when temporal filtering is applied
(Section 3.4). We ignore frames from transitions be-
tween hand poses, since even a ground truth label is
ambiguous. We take into account a delay of 15 frames
to compare ground truth labels with labels classified
after temporal filtering, as this is expected. The eva-
luated sequences have an average of 204 frames.
The results shown in Table 2 reveal that temporal
filtering is indeed relevant to improve accuracy, achie-
ving an excellent mean of 98.65%, against 94.73%
when classification is performed in a per-frame basis.
In particular, per-frame classification only outperfor-
med the temporal filtering approach in sequence 5.
In that case, a single misclassified frame in the 7th
frame following a pose transition forced the command
transition to be delayed by 7 frames, as it requires 15
consecutive similar classifications to change the com-
mand. Consequently, this lead to 7 incorrect results.
However, we claim that this is not a critical issue, as
the consequence would be just a delay in grasping or
releasing an object. Is is also worth noting the worst
result (sequence 7). In that case, the user performed
challenging poses, as the ones shown in Fig. 5, in
which case the hand was closed, but with the thumb
straight. We believe that the CNN classification could
be improved in these cases by providing additional
training examples. Finally, we conclude that the pro-
posed CNN approach is reliable for real-world appli-
cations.
4.4 User Experiments with a Robot
Arm
We validate our method using a 6-DOF industrial-
grade robotic arm of model VP-6242, from Denso
Table 2: Comparison between straightforward per-frame
classification and temporal filtering based classification: the
best trained CNN is employed to classify all frames of 10
sequences simulating a pick-and-place task. Accuracy is
defined as the ratio between correctly classified frames and
the total number of frames of each sequence.
Sequence
Accuracy
Per-frame Temporal filter
1 95.97% 100%
2 97.34% 100%
3 92.63% 100%
4 98.09% 99.04%
5 99.56% 96.92%
6 99.09% 100%
7 81.89% 91.35%
8 95.93% 100%
9 92.39% 99.21%
10 94.43% 100%
Mean ± Std. Dev. 94.73%±5.14% 98.65%±2.73%
Real-time Hand Pose Tracking and Classification for Natural Human-Robot Control
837