A FAST VOTING-BASED TECHNIQUE FOR HUMAN ACTION

RECOGNITION IN VIDEO SEQUENCES

Duc-Hieu Tran and Wooi-Boon Goh

Centre for Multimedia and Network Technology, School of Computer Engineering, Nanyang Technological University,

Singapore, Singapore

Keywords:

Human Action Recognition, Local Feature, Voting Framework, Nearest Neighbor.

Abstract:

Human action recognition has been an active research area in recent years. However, building a robust human

action recognition system still remains a challenging task due to the large variations in action classes, varying

human appearances, illumination changes, camera motion, occlusions and background clutter. Most previous

work focus on the goal of improving recognition rates. This paper describes a computationally fast voting-

based approach for human action recognition, in which the action in the video sequence is recognized based

on the support of the local spatio-temporal features. The proposed technique requires no parameter tuning

and can produce recognition rates that are comparable to those in recent published literature. Moreover, the

technique can localize the single human action in the video sequence without much additional computation.

Recognition results on the KTH and Weizmann action dataset are presented.

1 INTRODUCTION

Human action recognition in video sequences is an

active research area in the Computer Vision com-

munity. The approaches to this problem can be di-

vided into holistic approach (Bobick and Davis, 2002;

Weinland et al., 2006; Yilmaz and Shah, 2005; Blank

et al., 2005; Aggarwal and Ryoo, 2011) and local

feature approach (Laptev, 2005; Dollar et al., 2005;

Willems et al., 2008; Yuan et al., 2009; Niebles et al.,

2008). In a local feature approach, the action is rep-

resented as a set of local features extracted from the

video sequence. The actions can be then classiﬁed us-

ing parametric learning methods (e.g. Support Vector

Machine, Boosting, etc.) or a non-parametric tech-

nique such as the nearest neighbor classiﬁer. In gen-

eral, these machine learning techniques require the

computation of similarity between two feature sets.

To reduce the computational complexity, the feature

vectors are often quantized to form a visual vocabu-

lary and the video is then represented as a histogram

of the visual codewords in this vocabulary. This is

the well-known bag of word model technique (Wang

et al., 2009).

Unfortunately, the quantization process often re-

sults in the loss of information. Moreover, as a

property of clustering algorithms, the clusters will be

formed from the dense feature clouds, which are oft-

en the common features in all videos. These general

features are good for building a vocabulary but they

are not discriminative enough to describe the vari-

ous human actions in video sequences. On the other

hand, the sparse but discriminative features will be

often removed during the quantization process due to

their larger distances from the cluster centroids (i.e.,

visual words) (Boiman et al., 2008). The large varia-

tion in human appearances and action viewpoints also

means that the size of the vocabulary must be large

enough to achieve reasonably discriminative code-

words. As stated in (Wang et al., 2009), the vocab-

ulary size should be around 4000 for a wide range

of datasets. With such a large visual vocabulary, the

computational time required to represent a video as a

histogram of visual words can be quite long since it is

necessary to compute the distance from each feature

to every visual word.

Most previous work on human action recognition

focus on new approaches to improve the recognition

rates. Few address the computational speed issues

of the recognition process. Inspired by the query-

to-class distance proposed in (Boiman et al., 2008)

and the use of random forest in object recognition to

speed up the classiﬁcation process, we proposed a fast

voting-based framework, in which each local feature

will cast a support for the action class in the video.

In addition, the approach avoids the need to quantize

613

Tran D. and Goh W..

A FAST VOTING-BASED TECHNIQUE FOR HUMAN ACTION RECOGNITION IN VIDEO SEQUENCES.

DOI: 10.5220/0003850606130619

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 613-619

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

the feature vectors and thus avoids the need to tune

parameters (Laptev et al., 2008; Yuan et al., 2009;

Willems et al., 2008). The video will then be assigned

to the action class of the major support within all

the detected local features. Experiments on the KTH

(Schuldt et al., 2004) and Weizmann (Blank et al.,

2005) datasets gave comparable recognition rates to

that in recent literature and the computational time

for classiﬁcation is much faster than the bag of word

model approach.

2 RELATED WORK

In a voting framework where a video (or an image

in object recognition case) is represented as a set of

features, each feature will cast a vote for the action

category (or object category, respectively) contained

in the video or the image. Given a video that is rep-

resented as a set of features, the posterior probability

P(C| f ) of a feature f to a class C can be viewed as a

vote of that feature for the class C. We discuss pre-

vious work that classiﬁes the image, objects or action

based on the voting of individual feature to a class.

Yuan et al. (Yuan et al., 2009) computed the votes

( f ) of feature f for a class C and the action recogni-

tion is transformed into the problem of searching for a

subvolume of maximum supporting score for the ac-

tion class.

In (Gall and Lempitsky, 2009) the posterior prob-

ability P(C| f ) of a feature f belonging to class C

is computed from a random forest (Breiman, 2001).

Given a built random forest, if the feature f falls into

a leaf node L then P(C| f ) will be estimated as the ra-

tio of features from class C at leaf node L. Yao et al.

(Yao et al., 2010) extended the idea to human action

recognition, however their approach is time consum-

ing and computational costly since it requires human

body detection and tracking in each video frame.

In an unsupervised approach, Niebles et al.

(Niebles et al., 2008) applied the probabilistic Latent

Semantic Analysis (pLSA) model to the problem of

human action recognition. From their model, one can

extract the value p(C|w, v), the probability of action C

for a visual word w in a particular video v. The action

can be localized as the cluster of the estimated local

features.

In a bag of word model, the vocabulary is often

built by a clustering method such as k-means. Each

cluster centroid is considered as a visual word and a

feature is quantized to the most similar visual word

(i.e., the closest by a distance measure such as L

Euclidean distance). This process can be seen as a

way to separate the feature space into regions of sim-

ilar features. The k-means clustering approach tends

to group region with dense features into a cluster. As

a result, discriminative features are often suppressed

over commonly occurring features due to their large

distance (dissimilarity) to a centroid. Moreover, k-

means clustering methods are computationally costly,

especially when the number of clusters is large.

In (Moosmann et al., 2006) the authors proposed

a method called Extremely Randomized Clustering

Forest to effectively build visual vocabulary using a

random forest (Breiman, 2001). At each node of a

random tree, the feature set is splited into two dis-

criminative sets by an optimal binary function. The

leaf nodes of the trees will form the visual vocabulary

where each leaf node is a visual word.

Lepetit et al. (Lepetit et al., 2005) performed lo-

cal feature classiﬁcation in real-time by randomized

trees to localize an instance image in a video. Their

binary function for an image patch is slightly differ-

ent in that it is the comparison of the intensity value

of two random pixels in the image patches.

Although using different binary functions, the

building of visual vocabulary or the classiﬁcation of

local patches by random forests are based on the same

observation that the local image patch can be seen as

a vector of its ﬂattened intensity values or encoding

descriptor. Then the patch can be represented as a

point in N-dimensional feature space, where N is the

number of pixels in the image patch or the dimension

of the feature descriptor. The binary function at each

tree node will deﬁne a hyperplane that splits the fea-

ture space into two discriminative regions. From that

point of view, each leaf node of a tree can be seen as a

region in feature space in which the features are simi-

lar but highly distinguishing to those in other regions.

Also inspired by the use of random forests, in

(Silpa-Anan and Hartley, 2008) the authors have pro-

posed a framework to approximately ﬁnd the near-

est neighbors using randomized kd-trees. Instead of

uniformly choosing the dimension to split as in tradi-

tional kd-trees, they randomly choose the dimension

from D dimensions in which the data has the greatest

variance. The experiments in (Muja and Lowe, 2009)

showed that it is efﬁcient to ﬁx D = 5. The idea of

building randomized kd-trees is very similar to that of

building a visual vocabulary in (Gall and Lempitsky,

2009).

3 THE PROPOSED APPROACH

Inspired by the fast recognition of keypoints in im-

ages (Lepetit et al., 2005) and the Naive Bayes Near-

est Neighbor classiﬁer (Boiman et al., 2008) that is

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

614

based on the image-to-class distance, we proposed a

voting framework called Vote-1NN.

Given a test video Q consisting of local spatio-

temporal features f

, . . . , f

which are independent

identically distributed, under the Bayesian assump-

tion and the uniform of the prior probability P(c), the

Maximum Likelihood classiﬁer is formulated as:

ˆc = arg max

P(Q|c) = arg max

∏

i=1

P( f

|c)

= arg max

∏

i=1

P(c| f

)P( f

)

P(c)

(1)

Assuming the uniform of P( f

) and P(c), take the log

probability of the rule, we have:

ˆc = arg max

∑

i=1

logP(c| f

) (2)

Assume the local features are extremely discrimi-

native and each local feature can be classiﬁed to one

of the action classes:

P(c| f

) =

(

1, if f

is from class c

0, otherwise

(3)

The posterior probability P(c| f

) can be intuitively in-

terpreted as the support of feature f

to class c. With

an arbitrary distribution of local features, the support

of a local feature f

in a test video to an action class

c can be assigned as the same as the support of its

nearest neighbor to c. This is the key idea of the clas-

siﬁcation by the nearest neighbor, but here we apply

it to the local features but not the video sequences.

From Equation (2) and (3), we deﬁne function g(c, f

)

as follows:

g(c, f

) =











1, if the nearest neighbor of f

is from class c

0, otherwise

(4)

The original classiﬁcation problem is formulated as:

ˆc = arg max

∑

i=1

g(c, f

) (5)

The formulation in (5) can interpreted as follows: the

human action in a video sequence will be recognized

based on the majority votes of the local features.

In the pre-processing phase, a training feature set

is formed from all the features extracted from train-

ing video sequences. These features are assigned to

the action class of the video from which they are ex-

tracted. Given a feature f

from a test video sequence,

based on the action class of its nearest neighbor in

the training feature set, we can compute the value of

g(c

, f

) for all action classes c

at once. In our exper-

iment, the search of the nearest neighbor can be per-

formed reliably and speedily by building the random-

ized kd-trees (Silpa-Anan and Hartley, 2008). Our

proposed approach is computational fast, it does not

require a complicated training phase but only a pre-

processing phase in which the randomized kd-trees

are built.

After recognizing the human action in the given

video sequence, the center of the human action can be

localized based on the positions of the local features

that support the recognized action class. According

to the experiment on the KTH dataset in (Schindler

and Gool, 2008), human action can be accurately rec-

ognized in about 7 frames of a video sequence. As

such, the center of the human body performing the

recognized action in a given frame is also estimated

as the centroid of all supporting features (i.e., the fea-

tures that support the recognized action class) about 7

neighboring frames. The proposed process is straight-

forward, has low computational cost and yet produces

acceptable localization accuracy.

4 EXPERIMENT AND

DISCUSSION

We performed the experiments on two popular human

action datasets: KTH (Schuldt et al., 2004) and Weiz-

mann (Blank et al., 2005). We also compared our ap-

proach with an extension of the NBNN image classi-

ﬁer (Boiman et al., 2008).

4.1 Extended Naive Bayesian Nearest

Neighbor (NBNN)

Since the work of (Boiman et al., 2008) is similar

to the Vote-1NN approach with terms of query-to-

class distance, we extended the Naive Bayes Near-

est Neighbor (NBNN) image classiﬁer to perform

human action recognition in video sequences. The

NBNN classiﬁer for human action recognition was

then tested on the KTH and Weizmann action dataset

in the following manner:

1. for a test video sequence, extract the local features

, f

, . . . , f

2. action class

C = arg max

∑

i=1

|| f

− NN

( f

)||,

where NN

( f

) is the nearest neighbor of f

within

all features of class C. We use L

distance instead

of L

as in (Boiman et al., 2008).

A FAST VOTING-BASED TECHNIQUE FOR HUMAN ACTION RECOGNITION IN VIDEO SEQUENCES

615

4.2 KTH Action Dataset

The KTH Action Dataset consists of 2391 video se-

quences of six human actions: boxing, hand waving,

hand clapping, running, jogging and walking. The ac-

tions were performed by 25 subjects in four different

scenes. Each action is performed three or four times

by each subject in each scene. Following the stan-

dard setup in (Schuldt et al., 2004), videos of sixteen

subjects were used for training and videos of nine re-

maining subjects were used for testing. The recogni-

tion rates are reported as average class accuracy.

The features were detected by Harris3D interest

point detector (Laptev, 2005) and represented by dif-

ferent types of descriptors: Histogram of Oriented

Spatial Gradient (HOG), Histogram of Optical Flow

(HOF), HOGHOF (i.e., the combination of HOG and

HOF) (Laptev et al., 2008) and Histogram of 3D Gra-

dients (HOG3D) (Klaser et al., 2008).

The nearest neighbor search in our experiment

is performed by building the randomized kd-trees

(Silpa-Anan and Hartley, 2008). We used the ex-

isting library FLANN provided by Muja and Lower

(Muja and Lowe, 2009) with the following settings:

the number of random dimensions is 5, the number of

randomized trees is 10, the distance between features

is L

The results for the HOG, HOF, HOGHOF

and HOG3D descriptors respectively are 82.16%,

89.57%, 91.08% and 89.34%. Table 1 shows the con-

fusion table for HOGHOF descriptor.

Table 1: The result of our approach (Vote-1NN) with Har-

ris3D interest point detector and HOGHOF descriptors on

the KTH dataset. The average accuracy is 91.08%.

gt\res box hclap hwave jog run walk

box 97.9 0 0 0 0 2.1

hclap 1.39 97.92 0.69 0 0 0

hwave 0 6.94 93.06 0 0 0

jog 0 0 0 95.14 2.78 2.08

run 0 0 0 37.5 62.5 0

walk 0 0 0 0 0 100

There is no misclassiﬁcation of the moving ac-

tions (i.e., jogging, running, walking) to the station-

ary actions (i.e., boxing, handclapping, handwaving)

with any of the four types of feature descriptors. This

means that despite the use of appearance features or

motion features, our approach can accurately distin-

guish these two kinds of actions. The results on the

running action are poor and it is mostly misclassiﬁed

to the jogging action. The higher accuracy of run-

ning action achieved by HOG and HOG3D features

(i.e., 63.19% and 75.69% respectively) is probably

due to the more reliable differences in human poses

of running and jogging (stride variations) compared

to temporal changes due to leg motions. HOGHOF

descriptor that encodes both appearance and motion

information gave the best results with overall accu-

racy of 91.08%, which is comparable to the state of

the art results shown in Table 3. Although the video

sequences in KTH dataset contain slight camera mo-

tion and variant view points, the high recognition rates

achieved by our approach (i.e., except for misclassiﬁ-

cation of running to jogging, the average accuracy for

other ﬁve actions is about 97%) suggest that our ap-

proach is reasonably robust to the problems of cam-

era motion and view variance. These problems are

more difﬁcult to solve in a framework that uses vector

quantization. The results of extended NBNN for KTH

dataset using descriptors HOG, HOF, HOGHOF and

HOG3D are 84.94%, 89.80%, 92.24% and 91.31%

respectively. This could be due to the fact that in

NBNN approach, the distances from the local features

in the query video to their nearest neighbors in all ac-

tion classes are computed and accumulated. In that

manner, the features that are not highly discrimina-

tive (i.e., its distance to the nearest neighbors in all

action classes are only slightly different) will not con-

tribute signiﬁcantly to the ﬁnal decision of the action

class. In contrast, the local features in our approach

are each assigned to only one action class. As a re-

sult, we reap the advantage of lower computational

complexity since searching for the nearest neighbor

is done only once.

4.3 Computational Speed

Based on a KTH dataset test case where the features

have been extracted by Harris3D and described by

the HOGHOF descriptor, we compared the perfor-

mance of our approach with the bag of word model

approaches by following the experiment setup on the

KTH action dataset described in (Wang et al., 2009)

(i.e., the number of visual words is 4000 and the vi-

sual vocabulary is built by k-means clustering). The

classiﬁcations were performed using Support Vec-

tor Machine (Chang and Lin, 2001) with χ

ker-

nel (Laptev et al., 2008) and Pyramid Match Kernel

(PMK) (Grauman and Darrell, 2007). Our approach

does not require us to build the visual vocabulary –

a process that requires a lot of time, especially for

a large vocabulary size. Moreover, it only takes an

average of 0.05 seconds for our Vote-1NN approach

to classify an action in a video sequence with an av-

erage of 100 frames. Using the same computational

resources and test video sequences, the bag of word

model with SVM+χ

kernel and SVM+PMK took

an average of 0.95 and 2.5 seconds respectively to

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

616

Table 2: The performance of different approaches on the

KTH action dataset. Time of performing recognition is es-

timated on video sequences of average 100 frames. Our

approach Vote-1NN can perform much faster than the bag

of word model approaches and is able to localize the center

of human action in video sequences.

Vote-1NN NBNN SVM + χ

SVM+PMK

time (sec) * 0.05 0.25 0.95 2.5

accuracy (%) 91.08 92.24 91.80 91.08

localization YES NO NO NO

classify an action. The comparative recognition rates

of the various approaches are comparable, as shown

in Table 2, while our approach can compute the re-

sults much faster (i.e., 5, 20 and 50 times faster than

NBNN, SVM+χ

and SVM+PMK respectively) and

it is able to localize the center of human action.

Table 3: Recognition rates on the KTH dataset of the ex-

tended NBNN and our Vote-1NN approach compared to

other well known techniques. Except for (Niebles et al.,

2008), all experimental setups are consistent.

Extended NBNN 92.24%

Vote-1NN 91.08%

Subvolume search (Yuan et al., 2009) 93.3%

Harris3D + SVM-χ

(Laptev et al., 2008) 91.8%

Unsupervised learning (Niebles et al., 2008) 83.3%

Harris3D + local SVM (Schuldt et al., 2004) 71.7%

4.4 Human Action Localization

We localized the recognized action based on our Vote-

1NN approach by estimating the action centers in the

test KTH video sequences using the centroid of the

spatial positions of the local features that have been

classiﬁed to the action class. The centroid position

is computed from the cluster of features within 7 con-

secutive frames centered about the current frame. The

location of the action is approximately bounded by a

spatial window containing the features used to com-

pute the spatial mean. The action center (x

, y

) is

computed as average of the spatial locations (x

, y

)

of all local features that support the classiﬁed action

in the video sequences. The local window around

the action is deﬁned by the upper-left (x

, y

) and

the lower-right (x

, y

), where x

and x

are the av-

erage of x-position of the local feature points on the

left and right respectively of the action center point

, y

), y

and y

are the average of y-position of

the local points above and below respectively of the

point (x

, y

). From our empirical studies, the local

action window speciﬁed by (x

, y

) and (x

, y

) is en-

larged to twice the size to better capture the spatial

area where the human action can be localized.

Figure 1: The localization of six action classes in the KTH

dataset. Computed centers of actions (x

, y

) are marked

with a red cross-hair and the local features supporting the

action are blue. The upper-left image illustrates the lo-

calization of actions in video sequences. The small win-

dow (a, b, c, d) is estimated by values (x

, y

), (x

, y

), and

the local action window (A, B, C, D) that is at twice size of

(a, b, c, d) will capture the spatial area where the human ac-

tion can be localized.

Figure 2: The recognition rates with respect to value of K

on KTH action dataset.

We manually checked the action location esti-

mated for 100 randomly selected video sequences

from the KTH dataset and found that the centers were

appropriately located at the human body performing

the actions. This suggests that the Vote-1NN ap-

proach can be used to localize the action centers as

shown in Figure 1.

We analyzed the effect of the number of nearest

neighbors to the recognition rates on the KTH ac-

tion dataset. Instead of assigning the action class of

the nearest neighbor to a test local feature from the

test video sequences, as is the case for Vote-1NN, we

assigned the action class represented by the major-

ity among K nearest neighbors (i.e., the standard K-

nearest neighbor setting). The recognition rates with

respect to different value of K are shown in Figure 2.

Increasing value of K decreases the recognition rates

and the best performance is achieved by our proposed

Vote-1NN approach (i.e., K = 1). This result justiﬁes

our proposed approach of using only a single nearest

neighbor.

A FAST VOTING-BASED TECHNIQUE FOR HUMAN ACTION RECOGNITION IN VIDEO SEQUENCES

617

4.5 Weizmann Action Dataset

We also performed the experiment on the Weizmann

action dataset (Blank et al., 2005). The Weizmann

dataset consists of 93 video sequences of ten different

action classes performed by 9 subjects in the scene

with static background. The results of Vote-1NN and

the extended NBNN on Weizmann dataset are shown

in Table 4. We use the leave-one-subject-out setup

for the Weizmann dataset, i.e., for each test, video

sequences of one subject are used for testing while

the training feature set is formed from the video se-

quences performed by the remaining subjects. The

recognition rates shown are computed as an average

accuracy of all tests. Like the results from the KTH

dataset, results from the Weizmann dataset show that

both Vote-1NN and NBNN have comparable perfor-

mance as well. The action localization is performed

on the Weizmann dataset and manually checked. The

samples of action localizations for a random selected

subject in the dataset are shown in Figure 3. The com-

Figure 3: The localization of ten action classes in the Weiz-

mann dataset. Computed centers of actions is marked with

a red cross-hair. The local action windows are marked by

red rectangles.

parison of our approach on the Weizmann dataset with

several well-known bag of word model approaches is

shown in Table 5.

Table 4: The results of our Vote-1NN approach and NBNN

on the Weizmann action dataset with interest point detector

Harris3D and descriptors HOG, HOF and HOGHOF.

Approaches HOG HOF HOGHOF

Extended NBNN 81.88% 90.00% 90.26%

Vote-1NN 83.85% 91.11% 89.15%

Table 5: The comparison of our approach with selected bag

of word model approaches on the Weizmann action dataset.

Based on similar experimental setups, the recognition rates

of Vote-1NN is comparable to the best of the bag of word

model approaches.

Vote-1NN 91.11%

Extended NBNN 90.26%

3D SIFT + BoW (Scovanner et al., 2007) 82.60%

HOG3D + BoW (Klaser et al., 2008) 84.30%

Unsupervised learning (Niebles et al., 2008) 90.00%

Multiple features (Liu et al., 2008) 90.40%

5 CONCLUSIONS AND FUTURE

WORK

In this paper we have presented a novel method for

human action recognition based on the support of

each local feature for its closest action class. Recog-

nition rates of the KTH and Weizmann action datasets

obtained by our approach are comparable to the state

of the art results. The action classiﬁcation speed is

about 20 to 50 times faster than the various bag of

word model approaches tested and 5 times faster than

the extended NBNN approach. In addition, the pro-

posed Vote-1NN approach is able to localize the hu-

man actions without much additional computational

cost. We have also veriﬁed experimentally that us-

ing one nearest neighbor (i.e., Vote-1NN) produces

the best recognition performance compared to case,

where more nearest neighbors were used. Our next

challenge is to apply this voting approach to recognize

and localize multiple simultaneous actions in video

sequences based on the classiﬁcation of individual lo-

cal features.

REFERENCES

Aggarwal, J. and Ryoo, M. S. (2011). Human Activity

Analysis : A Review. ACM Computing Surveys.

Blank, M., Gorelick, L., Shechtman, E., Irani, M., and

Basri, R. (2005). Actions as space-time shapes. In

CVPR, volume 29, pages 1395–1402.

Bobick, A. and Davis, J. (2002). The recognition of hu-

man movement using temporal templates. TPAMI,

23(3):257–267.

Boiman, O., Shechtman, E., and Irani, M. (2008). In de-

fense of Nearest-Neighbor based image classiﬁcation.

In CVPR, pages 1–8.

Breiman, L. (2001). Random forests. ML, 45(1):5–32.

Chang, C.-C. and Lin, C.-J. (2001). LIBSVM: a library for

support vector machines.

Dollar, P., Rabaud, V., Cottrell, G., and Serge, B. (2005).

Behavior Recognition via Sparse Spatio-Temporal

Features. In VS-PETS, pages 65–72.

Gall, J. and Lempitsky, V. (2009). Class-Speciﬁc Hough

Forests for Object Detection. In CVPR, pages 1022–

1029.

Grauman, K. and Darrell, T. (2007). The pyramid match

kernel: Efﬁcient learning with sets of features. JMLR,

8(2):725–760.

Klaser, A., Marszalek, M., and Schmid, C. (2008). A spatio-

temporal descriptor based on 3D-gradients. In BMVC.

Laptev, I. (2005). On Space-Time Interest Points. IJCV,

64(2-3):107–123.

Laptev, I., Marszalek, M., Schmid, C., and Rozenfeld,

B. (2008). Learning realistic human actions from

movies. In CVPR, pages 1–8.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

618

Lepetit, V., Lagger, P., and Fua, P. (2005). Randomized

trees for real-time keypoint recognition. In CVPR,

pages 775–781.

Liu, J., Ali, S., and Shah, M. (2008). Recognizing human

actions using multiple features. In CVPR, pages 1–8.

Moosmann, F., Triggs, B., and Jurie, F. (2006). Fast dis-

criminative visual codebooks using randomized clus-

tering forests. In NIPS, pages 985–992. MIT Press.

Muja, M. and Lowe, D. (2009). Fast approximate nearest

neighbors with automatic algorithm conﬁguration. In

VISAPP, pages 331–340.

Niebles, J. C., Wang, H., and Fei-Fei, L. (2008). Unsu-

pervised Learning of Human Action Categories Using

Spatial-Temporal Words. IJCV, 79(3):299–318.

Schindler, K. and Gool, L. V. (2008). Action Snippets: How

many frames does human action recognition require?

In CVPR, pages 1–8.

Schuldt, C., Laptev, I., and Caputo, B. (2004). Recogniz-

ing human actions: a local SVM approach. In ICPR,

volume 3, pages 32–36.

Scovanner, P., Ali, S., and Shah, M. (2007). A 3-

dimensional sift descriptor and its application to ac-

tion recognition. In ACM MM, page 357.

Silpa-Anan, C. and Hartley, R. (2008). Optimised KD-trees

for fast image descriptor matching. In CVPR.

Wang, H., Ullah, M. M., Klaser, A., Laptev, I., and Schmid,

C. (2009). Evaluation of local spatio-temporal fea-

tures for action recognition. In BMVC, pages 1–11.

Weinland, D., Ronfard, R., and Boyer, E. (2006). Free

viewpoint action recognition using motion history vol-

umes. CVIU, 104(2-3):249–257.

Willems, G., Tuytelaars, T., and Gool, L. V. (2008). An

efﬁcient dense and scale-invariant spatio-temporal in-

terest point detector. In ECCV, pages 650–663.

Yao, A., Gall, J., and Gool, L. V. (2010). A

Hough Transform-Based Voting Framework for Ac-

tion Recognition. In CVPR.

Yilmaz, A. and Shah, M. (2005). Actions sketch: A novel

action representation. In CVPR, volume 1, pages 984–

989.

Yuan, J., Liu, Z., and Wu, Y. (2009). Discriminative sub-

volume search for efﬁcient action detection. In CVPR,

pages 2442–2449.

A FAST VOTING-BASED TECHNIQUE FOR HUMAN ACTION RECOGNITION IN VIDEO SEQUENCES

619