4.2 KTH Action Dataset
The KTH Action Dataset consists of 2391 video se-
quences of six human actions: boxing, hand waving,
hand clapping, running, jogging and walking. The ac-
tions were performed by 25 subjects in four different
scenes. Each action is performed three or four times
by each subject in each scene. Following the stan-
dard setup in (Schuldt et al., 2004), videos of sixteen
subjects were used for training and videos of nine re-
maining subjects were used for testing. The recogni-
tion rates are reported as average class accuracy.
The features were detected by Harris3D interest
point detector (Laptev, 2005) and represented by dif-
ferent types of descriptors: Histogram of Oriented
Spatial Gradient (HOG), Histogram of Optical Flow
(HOF), HOGHOF (i.e., the combination of HOG and
HOF) (Laptev et al., 2008) and Histogram of 3D Gra-
dients (HOG3D) (Klaser et al., 2008).
The nearest neighbor search in our experiment
is performed by building the randomized kd-trees
(Silpa-Anan and Hartley, 2008). We used the ex-
isting library FLANN provided by Muja and Lower
(Muja and Lowe, 2009) with the following settings:
the number of random dimensions is 5, the number of
randomized trees is 10, the distance between features
is L
1
.
The results for the HOG, HOF, HOGHOF
and HOG3D descriptors respectively are 82.16%,
89.57%, 91.08% and 89.34%. Table 1 shows the con-
fusion table for HOGHOF descriptor.
Table 1: The result of our approach (Vote-1NN) with Har-
ris3D interest point detector and HOGHOF descriptors on
the KTH dataset. The average accuracy is 91.08%.
gt\res box hclap hwave jog run walk
box 97.9 0 0 0 0 2.1
hclap 1.39 97.92 0.69 0 0 0
hwave 0 6.94 93.06 0 0 0
jog 0 0 0 95.14 2.78 2.08
run 0 0 0 37.5 62.5 0
walk 0 0 0 0 0 100
There is no misclassification of the moving ac-
tions (i.e., jogging, running, walking) to the station-
ary actions (i.e., boxing, handclapping, handwaving)
with any of the four types of feature descriptors. This
means that despite the use of appearance features or
motion features, our approach can accurately distin-
guish these two kinds of actions. The results on the
running action are poor and it is mostly misclassified
to the jogging action. The higher accuracy of run-
ning action achieved by HOG and HOG3D features
(i.e., 63.19% and 75.69% respectively) is probably
due to the more reliable differences in human poses
of running and jogging (stride variations) compared
to temporal changes due to leg motions. HOGHOF
descriptor that encodes both appearance and motion
information gave the best results with overall accu-
racy of 91.08%, which is comparable to the state of
the art results shown in Table 3. Although the video
sequences in KTH dataset contain slight camera mo-
tion and variant view points, the high recognition rates
achieved by our approach (i.e., except for misclassifi-
cation of running to jogging, the average accuracy for
other five actions is about 97%) suggest that our ap-
proach is reasonably robust to the problems of cam-
era motion and view variance. These problems are
more difficult to solve in a framework that uses vector
quantization. The results of extended NBNN for KTH
dataset using descriptors HOG, HOF, HOGHOF and
HOG3D are 84.94%, 89.80%, 92.24% and 91.31%
respectively. This could be due to the fact that in
NBNN approach, the distances from the local features
in the query video to their nearest neighbors in all ac-
tion classes are computed and accumulated. In that
manner, the features that are not highly discrimina-
tive (i.e., its distance to the nearest neighbors in all
action classes are only slightly different) will not con-
tribute significantly to the final decision of the action
class. In contrast, the local features in our approach
are each assigned to only one action class. As a re-
sult, we reap the advantage of lower computational
complexity since searching for the nearest neighbor
is done only once.
4.3 Computational Speed
Based on a KTH dataset test case where the features
have been extracted by Harris3D and described by
the HOGHOF descriptor, we compared the perfor-
mance of our approach with the bag of word model
approaches by following the experiment setup on the
KTH action dataset described in (Wang et al., 2009)
(i.e., the number of visual words is 4000 and the vi-
sual vocabulary is built by k-means clustering). The
classifications were performed using Support Vec-
tor Machine (Chang and Lin, 2001) with χ
2
ker-
nel (Laptev et al., 2008) and Pyramid Match Kernel
(PMK) (Grauman and Darrell, 2007). Our approach
does not require us to build the visual vocabulary –
a process that requires a lot of time, especially for
a large vocabulary size. Moreover, it only takes an
average of 0.05 seconds for our Vote-1NN approach
to classify an action in a video sequence with an av-
erage of 100 frames. Using the same computational
resources and test video sequences, the bag of word
model with SVM+χ
2
kernel and SVM+PMK took
an average of 0.95 and 2.5 seconds respectively to
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
616