Exploiting the Kinematic of the Trajectories of the Local Descriptors to

Improve Human Action Recognition

Adel Saleh

, Miguel Angel Garcia

, Farhan Akram

, Mohamed Abdel-Nasser

and Domenec Puig

Department of Computer Engineering and Mathematics, Rovira i Virgili University, Tarragona, Spain

Department of Electronic and Communications Technology, Autonomous University of Madrid, Madrid, Spain

Keywords:

Activity Recognition, Kinematic Features, Classiﬁcation.

Abstract:

This paper presents a video representation that exploits the properties of the trajectories of local descriptors in

human action videos. We use spatial-temporal information, which is led by trajectories to extract kinematic

properties: tangent vector, normal vector, bi-normal vector and curvature. The results show that the pro-

posed method provides comparable results compared to the state-of-the-art methods. In turn, it outperforms

compared methods in terms of time complexity.

1 INTRODUCTION

Human action recognition is still an open challeng-

ing problem in computer vision community. The per-

formance of applications, such as surveillance sys-

tems (Ben Aoun et al., 2011) and human-computer

interaction (Bouchrika et al., 2014) mainly depend

on the accuracy of human activity recognition sys-

tems. Several methods were proposed to improve the

performance of human action recognition in uncon-

trolled videos. Bag of words (BOW) based on a set

of low level features, such as histogram of optical

ﬂow (HOF), histogram of oriented gradients (HOG)

and motion boundary histograms (MBH) have be-

come very common video representation for action

recognition (Laptev et al., 2008). These models are

insensitive to the position and orientation of the ob-

jects in the image. In addition, they have ﬁxed length

vectors irrespective to the number of objects and num-

ber of frames in each video. These aforementioned

methods are independent of usage of explicit conﬁg-

uration of visual word. Moreover, they have a poor

localization of the objects and actions in the videos.

The use of the local information is very useful to im-

prove the recognition rate (Peng et al., 2014).

In this paper we used the kinematic features of

the trajectories of the local descriptors to improve the

performance of the current super-vector based activ-

ity recognition methods. For each local descriptor, a

trajectory is deﬁned, then a set of kinematic features

are calculated, such as tangent vector, normal vector,

bi-normal vector and curvature. The steps of the pro-

posed method are shown in Figure 1. The rest of the

paper is organized as follows. In Section 2 we review

related work. In section 3 the mathematical formula-

tions and functioning of the proposed method are ex-

plained. Section 4 includes the experimental results

and discussion. Finally, the conclusion of this paper

is given in section 5.

2 RELATED WORK

Several works showed that the performance of human

recognition methods can be improved signiﬁcantly

while using the trajectory of the spatio-temporal inter-

est points (Wang et al., 2009). In (Wang et al., 2011)

trajectories were used as features to build a code-

book of visual words. They proposed a robust method

Figure 1: The proposed approach.

182

Saleh, A., Garcia, M., Akram, F., Abdel-Nasser, M. and Puig, D.

Exploiting the Kinematic of the Trajectories of the Local Descriptors to Improve Human Action Recognition.

DOI: 10.5220/0005781001800185

In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 3: VISAPP, pages 182-187

ISBN: 978-989-758-175-5

to get information of trajectory shape by tracking

densely sampled points using the optical ﬂow ﬁelds.

In (Wang et al., 2013) Fisher coding is compared with

other encoding methods. It provided better results

in action recognition. In (Jain et al., 2013) an im-

proved method using motion stabilization and person

trajectories is demonstrated. In (Raptis and Soatto,

2010) parts from different grouping trajectories were

embedded into a graphical model. In (Sekma et al.,

2013) the Delaunay triangulation method is applied

on the trajectories of each video to get geometric re-

lationship of objects. A graph is built for trajectories

and then encoded (this method is also known as bag-

of-graphs).

Many works were dedicated to describe the shape

of the area around the trajectory, local motion and

appearance pattern. The common methods are:

histograms of optical ﬂow (HOF) (Laptev et al.,

2008), motion boundary histograms (MBH) (Dalal

et al., 2006) and histograms of oriented gradients

(HOG) (Dalal and Triggs, 2005). After extracting the

features and encoding them, a codebook is built using

a clustering algorithm, such as k-means. In (Wang

and Schmid, 2013) the authors showed that camera

motion compensation and removal of the inconsistent

matches generated by human motion can greatly im-

prove the performance of the dense trajectories. They

used the Fisher vector (FV) (S

anchez et al., 2013) to

generate a representation for each video. However,

the previous approaches only consider simple trajec-

tory information. The proposed method uses the kine-

matic features of trajectories to improve the perfor-

mance of activity recognition methods.

The proposed work is inspired by (Wang et al.,

2015) in which they extracted Frenet-Serret frames

(see Section 3.5 for its deﬁnition) from the trajecto-

ries then histograms of tangent vector and normal,

bi-normal vector were used to overcome the depen-

dency on the trajectory length. After that they clus-

tered videos using histograms. In their work, they did

not discuss the performance of these features in ac-

tivity recognition. In (Jain et al., 2013) the authors

used kinematic features of the ﬂow ﬁeld to capture

additional information about motion patterns. The

proposed method is different from the technique dis-

cussed in (Wang et al., 2015) because in it kinematic

features of the trajectory (tangent vector, binormal

vector an curvature) is applied rather than histograms.

The proposed method exploits the improved trajec-

tories of the same length and then combines them

with low level features like HOF, HOG and MBH.

Compared to the proposed method, the CNN based

method (Simonyan and Zisserman, 2014) is better in

terms of accuracy but it has high time complexity be-

cause it needs a large number of training samples with

supervised labels.

3 PROPOSED APPROACH

The main idea of the proposed approach is that we are

getting complementary information like motion, ac-

celeration and curvature, which gives useful descrip-

tion of the trajectory. The improvements in results

leads to the conclusion that modeling the trajectories

of low-level features statistically enhance the recog-

nition performances of the concepts in videos.

3.1 Improved Dense Trajectories

In the proposed method, the low-level motion features

are calculated using the same conﬁguration proposed

in (Wang and Schmid, 2013). It uses dense sampled

features for several spatial scales, estimate a homog-

raphy with RANSAC using the SURF feature match-

ing between two consecutive frames. Then it warps

optical ﬂow with the estimated homography. The

calculated low-level descriptor is computed on the

warped optical ﬂow to capture motion patterns. Ad-

ditional improvement is obtained by using a human

detector because it removes the trajectories which are

consistent with camera motion compensation.

Default parameters are supposed to extract all

low-level feature descriptors, such as HOG, HOF,

MBHx, MBHy and trajectory. The length of utilized

trajectories is 15 frames. The dimensions of descrip-

tors are : 96 for HOG, 108 for HOF, 96 for MBHx ,

96 for MBHy and 30 for trajectory.

3.2 Gradient and Optical Flow

Histograms

According to (Wang et al., 2009), HOG and HOF de-

scriptors show good results on a different data-sets

compared to classical descriptors for activity recog-

nition. Unlike HOG descriptor which captures infor-

mation about the appearance, HOF captures the lo-

cal motion information. The proposed method com-

puted the HOG and HOF descriptors using the same

approach proposed in (Wang and Schmid, 2013). To

calculate HOG, the proposed method computed gra-

dient magnitude responses in the horizontal and ver-

tical directions. To calculate HOF, optical ﬂow dis-

placement vectors in horizontal and vertical direc-

tions were determined. As a result we have a 2D

vector ﬁeld per frame. Then for each response the

magnitude is quantized in number of orientations.

For HOG descriptor, orientations are quantized into

Exploiting the Kinematic of the Trajectories of the Local Descriptors to Improve Human Action Recognition

183

8 bins, while they are quantized into 9 bins for HOF

as given in (Laptev et al., 2008). We used l2-norm

to normalize the descriptors. The length of HOG and

HOF is 96 (2 ×2 ×3 ×8) and 108 (2 ×2 ×3 ×9),

respectively.

3.3 Motion Boundary Histograms

Descriptor

MBH is popular descriptor for video classiﬁcation

tasks (Dalal et al., 2006). In their work, they showed

the robustness of the descriptor against camera and

background motion. The intuition behind MBH is

computing oriented gradients over the vertical and the

horizontal optical ﬂow displacements. The superi-

ority of this representation is that camera and opti-

cal ﬂow differences between frames (motions bound-

aries) boosted in the same time. Actually, the opti-

cal ﬂow’s horizontal and vertical displacements are

mapped, so they can be treated as gray-level images

of the motion displacements. For each of the two

optical ﬂow component images, histograms of ori-

ented gradients are computed using the same conﬁg-

uration used for still images. Information related to

motion changes in the boundaries is attained using

the ﬂow difference, while information with constant

scene from the camera is discarded.

3.4 Theoretical Background to

Calculate Moving Frames

Suppose that we have a curve which describes a tra-

jectory, as follows:

r(t) = (x(t), y(t),z(t)) (1)

Then the tangent vector to the curve at the point where

is:

v(t) = r

(t) = (x

(t),y

(t),z

(t)) (2)

Some researchers call it velocity vector and length of

it is called speed (Figure 2). Derivative vector of unit

length is especially important. It is called the unit tan-

gent vector and is obtained by dividing the derivative

vector by its length:

T (t) =

(t),y

(t),z

(t))

(t)

(3)

also,

T (t) =

v(t)

|v(t)|

(4)

Since T (t) is a unit vector, then

T (t)T (t) = 1 (5)

Figure 2: The velocity vector.

Figure 3: The osculating plane deﬁned by tangent and nor-

mal vectors and the direction of binormal vector.

For all values of t, if we calculate the differentia-

tion of both sides of this equation we ﬁnd that

(t)T (t) = 0 (6)

Thus T

(t) and T (t) are always orthogonal for

each value of t. We deﬁne that unit normal as follows.

N(t) =

(t)

(t)|

(7)

For each point we can span a plane using T and

N, this plane called osculating plane (see Figure 3). If

we are dealing with 2D motion, then z(t) = 0. Obvi-

ously, that normal vector and osculating plane are not

deﬁned when T

(t) = 0. The derivative of the velocity

vector is called acceleration vector.

a(t) = a

(t)T (t) + a

(t)N(t) (8)

This vector has the same direction as the force

needed to keep the particle on the track of the curve.

This force makes a particle traveling along curve to

stay on this course. Without this force, such particle

would continue the motion as indicated by the veloc-

ity vector and not stay on the course of r(t). The ac-

celeration vector lies on the osculating plane too.

Suppose that r (t) is a circle of radius ρ centered at

(0,0).

r(t) = ρ(cos(ωt), sin(ωt)) (9)

The velocity is v(t) = pω(−sin(ωt), cos(ωt)), the

speed is |v(t)| = pω = v

, the unit tangent vector is

T (t) = v(t)/|v(t)| = (−sin(ωt), cos(ωt)) and the unit

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

184

normal vector is N = (−sin(ωt),cos(ωt)). Since the

speed is constant, then a

(t) =

v(t) = 0. There is

no acceleration in the tangential direction. Hence, the

whole acceleration should be in the normal direction,

which can be described with the following two equa-

tions.

a = ρω

(−cos(ωt),−sin(ωt)) = ρω

N (10)

and

a =

(11)

Hence a

(t) = v

/ρ. One can get ρ at t = t

follows.

ρ = |r

)|/|T

)| (12)

The radius of the circle of the motion can be found

from Eq. 12. The circle of the motion is called oscu-

lating circle. The reciprocal of the radius is called

curvature at t = t

(see Figure 4). The curvature can

be deﬁned as follows.

k = |T

)|/|r

)| (13)

3.5 Moving Frame

Bi-normal vector is deﬁned as cross product of T and

N. Obviously, B is perpendicular to both T and N and

its unit vector since T and N are of have length 1. N

and B determine a plane which is called the normal

plane (see Figure 4). All lines in the normal plane are

perpendicular to the tangent vector T.

In literature, it is agreed to call the triple (T, N,

B) moving frame (it is also known as FrenetSerret

frame). The moving frame (T,N,B) is an orthonormal

basis, that means that each vector is of unit length,

and each pair of them are perpendicular, so every

three dimensions vector can be represented using a

linear combination of these three components. Con-

sequently, they take over the role of the usual basis

vectors k = (0,0,1), i = (1,0,0), j = (0,1, 0) at a point

on the curve.

Figure 4: The relation between osculating circle and curva-

ture.

3.6 From Trajectories to Moving

Frames

Let r = {r(1),.. ., r(t), ... ,r(n)} be a trajectory in 2D

space in which r(t) represents the positions of a mov-

ing point in t

video frame and n is the length of the

trajectory. Since we exploit improved dense trajecto-

ries (Wang and Schmid, 2013), all trajectories have

the same length. For each trajectory, the proposed

method calculates the moving frame (T, N,B) and the

curvature k. Then it builds a codebook for the con-

catenated vectors of (T,N, B,k). Likewise, it does

with HOF, HOG and MBH.

Obviously, the calculated descriptors extract

meaningful information from the trajectory which

does not depend on the original location of tracked

point. The good thing that information which is ex-

tracted from trajectory can cover descriptions in both

3D and 2D spaces. Furthermore, the 3D trajectory is

able to accurately describe the spatial and temporal

relations among the multiple trajectories. Generally,

image data suffers from the motion confusion caused

by the possibly different viewpoints in visual pro-

jection but it is still very useful for 2D applications.

Therefore, the overall descriptor consists of concate-

nation of HOF, HOG, MBH and kinematic features

(moving frame + curvature).

3.7 Fisher Vector

In pattern recognition, Fisher vector (FV) coding was

derived from well know Fisher Kernel, which is based

on the assumption that generation process of local

descriptor X can be modeled by a probability den-

sity function p(X,θ). Using the gradient of the log-

likelihood it is possible to describe the way that pa-

rameters contribute to the generation process of X.

Then the sample can be described as:

= ∇

log(p(X; θ))/N (14)

The dimensionality of this vector depends on the

number of parameters in θ. Gaussian mixture model

(GMM) is used to model the probability density func-

tion, and θ = {π

,µ

,σ

.. ., π

,µ

,σ

} contains the

parameters of the model,were π,µ,σ are Gaussian

mixture weights, means and diagonal covariance, re-

spectively. An improved Fisher Vector (Perronnin

et al., 2010) was proposed as follows.

√

(x −µ

)/σ

(15)

√

((x −µ

)

/σ

−1)

(16)

Exploiting the Kinematic of the Trajectories of the Local Descriptors to Improve Human Action Recognition

185

where γ

is the weight of local descriptor x to k

Gaussian component, so

m = π

N(x; µ

;σ

) + ···+ π

N(x; µ

;σ

) (17)

The parameter γ can be determined as follows.

= π

N(x; µ

;σ

)/m (18)

FV is a result of concatenation of the gradients of

Eq 15 and 16 as follows.

FV = [ρ

,τ

,. .. ,ρ

,τ

] (19)

4 EXPERIMENTAL RESULTS

AND DISCUSSION

4.1 Data-set

HMDB51 data-set, which is a generic action classiﬁ-

cation data-set (Kuehne et al., 2011) is used in this pa-

per. Videos in this data-set were collected from differ-

ent sources: YouTube and movies. It contains around

6670 videos, which are further grouped in 51 action

classes in which each class contains around 100. To

measure the performance we follow the original eval-

uation protocol. We used three training and testing

splits and the average accuracy over the three splits is

computed.

4.2 Experiments Setup

In our experiments, we adopt improved dense trajec-

tory features used in (Wang and Schmid, 2013). To

implement the kinematic model, we build 256 Gaus-

sian mixture models and use them to cluster randomly

taken 250,000 samples for each type separately. The

resulting GMM models were used to build a FV for

each descriptors type of low level vectors. The ﬁsher

vectors were concatenated and send to a classiﬁer. We

use a SVM classiﬁer with RBF (χ

kernel) for classi-

ﬁcation. Since, the motion is 2D, we took ﬁrst two

dimensions from tangent and normal vectors and the

third dimension from bi-normal vector for each tra-

jectory. Since, the improved length of trajectory is

15 (by default), therefore; trajectory descriptor has

90 dimensions (30 from the tangent vector, 30 from

the normal, 15 from the bi-normal and 15 from the

curvature). For dimension reduction and correlation

removal an algorithm based on principle component

analysis (PCA) before building FV. The PCA factor

is set to 0.5.

In this work a complementary descriptor is de-

signed by using the trajectories of the local descrip-

tors. It utilizes the spatio-temporal data of the tra-

jectories, and extracts additional information about

Table 1: Comparison of the baseline methods with the pro-

posed approach using the HMDB51 data-set.

Method Accuracy

(Yang and Tian, 2014) 26.90%

(Wang and Schmid, 2013) 57.20%

(Hou et al., 2014) 57.88%

(Simonyan and Zisserman, 2014) 59.50%

Proposed approach 58.20%

the shape of trajectories. The proposed method en-

hances the recognition performance of concepts in

videos. CNN based model has a better performance

because it represents higher level semantic concept

but it has high time complexity and requires compli-

cated training passes in the training step. In turn, the

proposed method is faster than (Simonyan and Zis-

serman, 2014). It took approximately 1 day for one

temporal CNN on a system with four NVIDIA Titan

cards (it took 3.1 times the aforementioned training

time on single GPU). In turn, our approach took ap-

proximately 14 hours on Core i7 2.5GHz CPU with

16 GB RAM. This certiﬁes that our approach gives

comparable results with small training time.

5 CONCLUSION

In this paper a new method to recognize human ac-

tion in videos is proposed. It exploits the trajecto-

ries information extracted from the motion frames.

The proposed method calculates tangent, normal, bi-

normal and curvature then combines them with clas-

sical low level features. The proposed approach gives

better description for geometrical shape of the trajec-

tories and shows comparable results with the stateof-

the-art. The performance of the proposed method is

evaluated by using a complex and large-scale action

data-set HMDB51. Experimental results demonstrate

that the proposed approach is comparable with several

state-of-the-art methods as shown in Table 1.

ACKNOWLEDGMENTS

This work was partially supported by Hodeida Uni-

versity, Yemen and Univesity Rovira i Virgili, Tarrag-

ona, Spain.

REFERENCES

Ben Aoun, N., Elghazel, H., and Ben Amar, C. (2011).

Graph modeling based video event detection. In 2011

VISAPP 2016 - International Conference on Computer Vision Theory and Applications

186

International Conference on Innovations in Informa-

tion Technology (IIT), pages 114–117. IEEE.

Bouchrika, T., Zaied, M., Jemai, O., and Amar, C. B.

(2014). Neural solutions to interact with computers

by hand gesture recognition. Multimedia Tools and

Applications, 72(3):2949–2975.

Dalal, N. and Triggs, B. (2005). Histograms of oriented gra-

dients for human detection. In IEEE Computer Society

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), volume 1, pages 886–893. IEEE.

Dalal, N., Triggs, B., and Schmid, C. (2006). Human de-

tection using oriented histograms of ﬂow and appear-

ance. In Computer Vision–ECCV 2006, pages 428–

441. Springer.

Hou, R., Zamir, A. R., Sukthankar, R., and Shah, M.

(2014). Damn–discriminative and mutually nearest:

Exploiting pairwise category proximity for video ac-

tion recognition. In Computer Vision–ECCV 2014,

pages 721–736. Springer.

Jain, M., J

egou, H., and Bouthemy, P. (2013). Better ex-

ploiting motion for better action recognition. In 2013

IEEE Conference on Computer Vision and Pattern

Recognition (CVPR), pages 2555–2562. IEEE.

Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre,

T. (2011). Hmdb: a large video database for human

motion recognition. In IEEE International Conference

on Computer Vision (ICCV), pages 2556–2563. IEEE.

Laptev, I., Marszałek, M., Schmid, C., and Rozenfeld,

B. (2008). Learning realistic human actions from

movies. In IEEE Conference on Computer Vision and

Pattern Recognition (CVPR), pages 1–8. IEEE.

Peng, X., Wang, L., Wang, X., and Qiao, Y. (2014). Bag of

visual words and fusion methods for action recogni-

tion: Comprehensive study and good practice. arXiv

preprint arXiv:1405.4506.

Perronnin, F., S

anchez, J., and Mensink, T. (2010). Improv-

ing the ﬁsher kernel for large-scale image classiﬁca-

tion. In Computer Vision–ECCV 2010, pages 143–

156. Springer.

Raptis, M. and Soatto, S. (2010). Tracklet descriptors for

action modeling and video analysis. In Computer

Vision–ECCV 2010, pages 577–590. Springer.

anchez, J., Perronnin, F., Mensink, T., and Verbeek, J.

(2013). Image classiﬁcation with the ﬁsher vector:

Theory and practice. International journal of com-

puter vision, 105(3):222–245.

Sekma, M., Mejdoub, M., and Amar, C. B. (2013). Hu-

man action recognition using temporal segmentation

and accordion representation. In Computer Analysis

of Images and Patterns, pages 563–570. Springer.

Simonyan, K. and Zisserman, A. (2014). Two-stream con-

volutional networks for action recognition in videos.

In Advances in Neural Information Processing Sys-

tems, pages 568–576.

Wang, H., Kl

aser, A., Schmid, C., and Liu, C.-L. (2011).

Action recognition by dense trajectories. In IEEE

Conference on Computer Vision and Pattern Recog-

nition (CVPR), pages 3169–3176. IEEE.

Wang, H. and Schmid, C. (2013). Action recognition with

improved trajectories. In IEEE International Confer-

ence on Computer Vision (ICCV), pages 3551–3558.

IEEE.

Wang, H., Ullah, M. M., Klaser, A., Laptev, I., and Schmid,

C. (2009). Evaluation of local spatio-temporal fea-

tures for action recognition. In British Machine Vision

Conference (BMVC), pages 124–1. BMVA Press.

Wang, L., Qiao, Y., and Tang, X. (2013). Motionlets: Mid-

level 3d parts for human motion recognition. In IEEE

Conference on Computer Vision and Pattern Recogni-

tion (CVPR), pages 2674–2681. IEEE.

Wang, W.-C., Chung, P.-C., Cheng, H.-W., and Huang, C.-

R. (2015). Trajectory kinematics descriptor for trajec-

tory clustering in surveillance videos. In IEEE Inter-

national Symposium on Circuits and Systems (ISCAS),

pages 1198–1201. IEEE.

Yang, X. and Tian, Y. (2014). Action recognition using su-

per sparse coding vector with spatio-temporal aware-

ness. In Computer Vision–ECCV 2014, pages 727–

741. Springer.

Exploiting the Kinematic of the Trajectories of the Local Descriptors to Improve Human Action Recognition

187