Action Recognition using the R

Transform on Optical Flow Images

Josep Maria Carmona and Joan Climent

Barcelona Tech (UPC), Barcelona, Spain

josep.maria.carmona@estudiant.upc.edu

Keywords:

R Transform, Action Recognition, PHOW, Projection Templates.

Abstract:

The objective of this paper is the automatic recognition of human actions in video sequences. The use of spatio-

temporal features for action recognition has become very popular in recent literature. Instead of extracting the

spatio-temporal features from the raw video sequence, some authors propose to project the sequence to a sin-

gle template ﬁrst.

As a contribution we propose the use of several variants of the R transform for projecting the image sequences

to templates. The R transform projects the whole sequence to a single image, retaining information concern-

ing movement direction and magnitude. Spatio-temporal features are extracted from the template, they are

combined using a bag of words paradigm, and ﬁnally fed to a SVM for action classiﬁcation.

The method presented is shown to improve the state-of-art results on the standard Weizmann action dataset.

1 INTRODUCTION

One of the multiple applications of automatic vi-

sual analysis of human movements is the understand-

ing of human activities in video sequences. Clas-

siﬁcation and recognition of human activities can

be very useful for multiple applications, like video-

surveillance, human-computer interaction, biometric

analysis... The objective of action/gesture recognition

is to identify human movements invariantly to the ges-

ture speed, distance to camera, or background.

Since human activity is captured in video se-

quences, the temporal domain is very important to

model gestures or human actions. Several authors

have extended the classical object recognition tech-

niques to the spatio-temporal domain (Kl

aser et al.,

2008), (Scovanner, 2007), (Jhuang et al., 2007). They

use vocabularies of volumetric features that are com-

puted using three-dimensional keypoint detectors and

descriptors. In (Scovanner, 2007) they used 3D SIFT

for action recognition using spatio-temporal features.

In (Kl

aser et al., 2008), authors proposed a descrip-

tor based on the histogram of 3D spatio-temporal gra-

dients. 3D gradients are binned into regular polyhe-

drons. They also extend the idea of integral images to

3D which allows rapid dense sampling of the cuboid

over multiple scales and locations in both space and

time. The approach presented in (Jhuang et al., 2007)

combined keypoint detection with the calculation of

local descriptors in a feed-forward framework. This

was motivated by similarity with the human visual

system, extending a bioinspired method to action

recognition. At the lowest level, they compute the

spatial gradients along the x and y axis for each frame.

Then the obtained responses are converted to a higher

level using stored prototypes.

We present a new method for action/gesture

recognition, based on a projection template, which is

obtained using a variant of the R transform. The R

transform was originally designed for object recogni-

tion, but some authors (Souvenir and Parrigan, 2009)

(Wang et al., 2007) (Zhu et al., 2009) (Vishwakarma

et al., 2015) (Goudelis et al., 2013) have used it, or

some variants, for action recognition too. In (Vish-

wakarma et al., 2015), they proposed a method based

on the combined information obtained from the R

transform and the energy silhouettes. They generated

a feature vector from the average energy silhouettes

and applied R transform to the normalized silhou-

ette extracted. Authors presented in(Goudelis et al.,

2013) two methods to assess the capability of the

Trace transform to recognize human actions. Trace

transform is a generalization of the Radon transform.

These previous works use this transform on sil-

houette images or human shapes previously seg-

mented from image sequences. In this work, we apply

the R transform directly to the optical ﬂow compo-

nents of the input sequence, avoiding all the problems

regarding the segmentation stage.

We also show in this paper that some variants of

266

Carmona J. and Climent J.

Action Recognition using the Rf Transform on Optical Flow Images.

DOI: 10.5220/0006218002660271

In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017), pages 266-271

ISBN: 978-989-758-225-7

the R transform preserve important information of hu-

man action sequences, giving more accurate results in

the recognition process.

We compute different R transforms using different

projection functions. Using a R

transform, being f

a projection function, we obtain a single image from

each video sequence. Different projection functions

f lead to different templates. In the results section

we evaluate several projection functions and select the

one that gives the highest recognition rate.

Once the image template is computed, we use a

Pyramid Histogram Of visual Words (PHOW) (Bosch

et al., 2007) as feature descriptor. Next, we combine

the feature descriptors, ignoring the structural infor-

mation among keypoints, using the paradigm known

as Bag of Words (BoW) (Csurka et al., 2004). In a

BoW approach, the number of occurrences of similar

feature patterns is accumulated in the bins of a his-

togram. Some other authors have successfully used

BoW for action recognition (Niebles et al., 2008).

Once the feature patterns have been computed, the

action sequence is recognized by means of a SVM

classiﬁer. In the results section, we compare the re-

sults obtained using our approach on the Weizmann

action dataset with the ones reported by other authors

using the same dataset.

2 PRELIMINARIES

The Radon transform (Radon, 1917) consists in a

multiple angle projection of a given image I(x, y). The

result of this projection is an integral line, that is,

the cumulative sum of pixel values in all directions.

Given a line in its polar form:

ρ = xcosθ + ysinθ (1)

the Radon transform can be expressed mathemat-

ically using equation 2

g(ρ, θ) =

∑

I(x, y)δ(xcosθ + ysinθ − ρ) (2)

where I(x, y) is the input image, δ is the Dirac

function, ρ is the distance from the line to the ori-

gin, and θ is the line direction. The main drawback

of the Radon transform is that it is not invariant to

translation, scale, or rotation. There exist several ap-

proaches to achieve such invariances (Arodz, 2005).

In (Tabbone et al., 2006), they presented a variant of

the Radon transform, the R transform, which is in-

variant to translation and scale.

The R transform is computed summing all

squared values of the Radon transform for all image

rows having a given direction θ. It can be expressed

using equation 3

R(θ) =

∑

(ρ, θ) (3)

The result of the R transform is a function giv-

ing the normalized sum of pixel values for all orienta-

tions. It maps a 2D image to a 1D signal.

The R

transform is a variant of the R transform,

being f a generic function. It can be expressed in its

general form:

(θ) = f (g(ρ, θ)) (4)

where g(ρ, θ) is the Radon transform and f is a

function that can be tuned as parameter, and allows

to adapt the transform to the speciﬁc problem to be

solved.

For example, R

max

substitutes the squared values

of the R transform by the maximum of the absolute

value of pixel values. This transform is invariant to

translation and, if correctly normalized dividing by

the supremum of the image values, is also invariant

to scale.

max

(θ) =

(

max

(g(ρ, θ)) i f R

≥ R

min

(g(ρ, θ)) i f R

< R

)

(5)

where R

= |max

(g(ρ, θ))| and R

|min

(g(ρ, θ))|.

dev

uses the standard deviation instead the sum

of squared values. It is also invariant to translation.

dev

(θ) = dev

(g(ρ, θ)) (6)

mean

uses the man value of pixel values for all

orientations. Even though is pretty similar to the orig-

inal R transform, it has the advantage of considering

the negative values of g(ρ,θ).

mean

(θ) = mean

(g(ρ, θ)) (7)

The properties of all these R

transform are to-

tally dependent on the function f chosen. Apart from

their properties concerning invariances, they present

different behaviors when applied to images that may

contain negative values (like the optical ﬂow images

used in this work). Figure 1 shows the result of the

R, R

max

, R

dev

, R

mean

for two input images containing

positive and negative values. We can see that R

max

and R

mean

transform give different results for positive

and negative values, while these differences are lost

when using the standard R and the R

dev

transforms.

Action Recognition using the Rf Transform on Optical Flow Images

267

Figure 1: Result of the R, R

max

, R

dev

and R

mean

respec-

tively, for two sintetic images containing positive (left col-

umn) and negative (right column) values.

In this work, we consider to use the R transform,

and its variants, on human action video sequences, but

instead of applying them to grey level, shape, or edge

images like in (Souvenir and Parrigan, 2009), (Wang

et al., 2007) and (Zhu et al., 2009), we do it on their

optical ﬂow components.

3 OUR APPROACH

This section describes the method used for action

recognition in this work. It is based on a Bag of fea-

tures approach, but previous to the keypoint extrac-

tion stage, the video sequences are projected to static

templates. Figure 2 shows a block diagram of the

whole process. Next, we describe in detail the differ-

ent stages of our system. The implementation details,

including the tuning parameters, are given in section

In order to project the video sequences, we ap-

ply the R

transform to both F

, F

components of

the optical ﬂow, obtaining two surfaces R

f x

and R

f y

Figure 2: Block diagram of our approach.

These surfaces can be considered as spatio-temporal

templates deﬁning an action sequence. These surfaces

retain some information of the different speeds that

action movements have in a local region of the scene.

Figure 3 shows an example of such surfaces, con-

cretely the result of applying the R

max

transform on

the ’bend’ image from the Weizmann action dataset

(Blank et al., 2005).

The optical ﬂow of the video sequence has been

computed using the real-time algorithm presented in

(Karlsson and Bigun, 2012).

Figure 3: R

f x

and R

f y

surfaces computed applying R

max

transform on the ’bend’ image from the Weizmann action

dataset.

Once the transform has been computed, we search

a set of keypoints in both surfaces using a standard de-

tector. In this work we have used the PHOW detector

(Bosch et al., 2007), a variant of SIFT but computed

on a dense grid at different scales. Figure 4 shows

the keypoints obtained using PHOW on the images

shown in ﬁgure 3. Only 50 keypoints, randomly cho-

sen, are shown for visualization purposes. The cir-

cles are centered on the selected keypoints, their sizes

represent the scale, and the lines inside the circles

show the main gradient orientation. Once the key-

points have been selected, we use a classical descrip-

tor based on a gradient orientation vector. We use a

128 bins histogram for each keypoint to describe the

gradient orientations within a local neighborhood.

Using a Bag of features approach, similar descrip-

tors are grouped using a k-means clustering tech-

nique. The set of cluster centers form a visual code-

book. In this way, every input sequence is represented

by a set of words, each word describing a small re-

gion of the R

surfaces. For the classiﬁcation stage,

we use a Stochastic Dual Coordinate Ascent Meth-

ods (SDCA)(Shalev-Shwartz and Zhang, 2013) linear

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

268

Figure 4: Keypoints obtained using PHOW on the R

f x

and

f y

surfaces.

SVM solver. We tested several kernels and we ob-

tained the best performance using Chi-Squared Ker-

nel (χ

) (Vedaldi and Zisserman, 2012). It can be

expressed using equation 8.

k(x, y) =

∑

i=1

+ y

)

(8)

Where x, y are the n-element input vectors.

4 RESULTS

In our preliminary experiments we have used the

Weizmann dataset (Blank et al., 2005). It is a widely

used sequence database containing a set of human ac-

tions. The sequences have been recorded with static

camera and background, there are no occlusions, and

only a person is moving in all sequences. They do

not present serious illumination changes either. This

dataset consists in 10 different actions carried out by

9 different persons. Figure 5 shows some snapshots

of the Weizmann dataset.

Figure 5: Weizmann human actions. Bend, jack, jump,

pjump, run, side, skip, walk, wave1, wave2.

For the experiments, we have used leave-one-out

cross validation method, since it is the usual method

used by other authors for testing. We use the 10 ac-

tions done by a single person for testing, and the ac-

tions done by the remaining 8 persons are used for

training. This process is repeated for all 9 persons.

We have done two different experiments. The ﬁrst

one has the objective of determining which one of the

transforms will produce a higher action recogni-

tion rate. As evaluation criteria we have computed

the recognition rate (RR) using all four different trans-

forms (R, R

max

, R

mean

, and R

dev

RR =

samples correctly classiﬁed

Total samples Tested

(9)

Once we have determined the optimum R trans-

form, it is selected for the second experiment. The

objective of this second experiment is to compare the

results of our approach with the ones published by

other authors using the same dataset.

For tuning the parameters of our approach, we

have tested codewords from 100 to 1100 visual words.

For PHOW we have tested from 1 to 5 pixels between

keypoints in the grid of dense SIFT and from 2 to

10 for the size of the spatial bins (scales). We have

obtained the best performance using 1-pixel distance

between keypoints in the grid of dense SIFT, a single

scale of size 3, and a Visual Vocabulary of 900 visual

words.

Table 1 shows the results obtained for the four dif-

ferent transforms R

(R, R

max

, R

mean

, and R

dev

) using

the scheme shown in the ﬁgure 2. We can see that us-

ing the supremum as projection function (R

max

) we

obtain the best results.

Table 1: Recognition rate (%) for R and R

transforms ap-

plied to Weizmann dataset.

Transform %

R 95.55

max

98.88

mean

88.55

dev

95.55

To establish a fair comparison with the state-of-

art methods, we have tuned all parameters exactly

as reported by the authors in their original papers.

For the Scovanner method (Scovanner, 2007) we have

used the conﬁgurations of the sub-histograms 2x2x2

and 4x4x4, and 8x4 histograms to represent θ and φ

in the 3D SIFT descriptor. For the Kl

aser method

(Kl

aser et al., 2008) we have used a code-book size

V=4000, spatial and temporal support s0=8, t0=6,

amount of histogram cells M=4, N=3, number of sup-

porting mean gradients S=3, cut-off value c=0.25,

and a complete orientation polyhedron icosahedron.

For the Jhuang method (Jhuang et al., 2007) we used

500 gradient-based features. For the Niebles method

(Niebles et al., 2008) the parameters chosen were σ

=1.2, τ =1.2, the descriptors dimensionality was 100,

and the codebook size was ﬁxed to 1200. For (Vish-

wakarma et al., 2015) we have used a feature vector of

[1 * 7 + 1 * 168 + 1 * 2] dimensions. For the method

Action Recognition using the Rf Transform on Optical Flow Images

269

presented in (Goudelis et al., 2013) we have used Lin-

ear Discriminant Analysis (LDA) and a vector of 31

features. This latter method requires a previous sil-

houette extraction stage.

Table 2 shows the comparative results for all these

methods using the Weizmann sequences. In our ap-

proach we have used the R

max

transform since it was

the one that gave higher accuracy in the former exper-

iment.

Table 2: Recognition rate (%) on the Weizmann dataset.

Method %

(Scovanner, 2007) 84.2

(Kl

aser et al., 2008) 84.3

(Niebles et al., 2008) 90

(Jhuang et al., 2007) 98.8

(Vishwakarma et al., 2015) 96.64

(Goudelis et al., 2013) 93.4

Ours (using R

max

) 98.8

The mean computational time for recognizing an

action sequence of 100 frames of 160x120 pixels is

900ms. It has been computed using a 3.1GHZ i3 Intel

Core. For these preliminary tests, code is not opti-

mized and the whole process has been implemented

using Matlab.

5 CONCLUSIONS

Template based approaches allow to project a whole

sequence into a single image. In this paper we have

presented a generalized form R

of the Radon trans-

form for projecting the action sequence. Choosing the

correct projection function f , it can be adapted to a

concrete problem.

We have tested three different f functions to

project the Radon transform, namely, mean, standard

deviation and supremum, and applied these trans-

forms to the optical ﬂow components of a video se-

quence. This experiment has shown that the R

max

transform gives the highest recognition rate for ac-

tion recognition, higher than the standard R trans-

form, and the other projection functions.

The results obtained in a second experiment also

show that the use of such transforms is a very promis-

ing technique since it yielded higher recognition rates

than the state-of-art methods using the same dataset,

achieving a 98.8 % recognition rate.

6 FURTHER WORK

The results presented in this paper have been obtained

using the Weizmann dataset as testbed. Next experi-

ments will involve other popular action/gesture recog-

nition datasets like KTH (Schuldt et al., 2004) dataset,

and Cambridge hand-gesture data set.

We are currently working on the extension of this

technique to action segmentation. Standard action

recognition datasets used by most researches, usually

contain a set of single actions that start at the begin-

ning of the sequence and stop at the end. In a real ap-

plication actions should be detected, segmented, and

ﬁnally, recognized. The use of the R

transforms is a

promising technique for sequence segmentation too.

ACKNOWLEDGEMENTS

This work was supported by the Spanish Ministry

of Science and Innovation, project DPI2016-78957-

R, and AEROARMS European project H2020-ICT-

2014-1-644271

REFERENCES

Arodz, T. (2005). Invariant object recognition using radon-

based transform. Computers and Artiﬁcial Intelli-

gence, 24:183–199.

Blank, M., Gorelick, L., Shechtman, E., Irani, M., and

Basri, R. (2005). Actions as space-time shapes.

Computer Vision, IEEE International Conference on,

2:1395–1402 Vol. 2.

Bosch, A., Zisserman, A., and Munoz, X. (2007). Image

classiﬁcation using random forests and ferns. Com-

puter Vision, 2007. ICCV 2007. IEEE 11th Interna-

tional Conference on, pages 1–8.

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., and

Bray, C. A. (2004). Visual categorization with bags of

keypoints. pages 1–22.

Goudelis, G., Karpouzis, K., and Kollias, S. (2013). Explor-

ing trace transform for robust human action recogni-

tion. Pattern Recognition, 46(12):3238 – 3248.

Jhuang, H., Serre, T., Wolf, L., and Poggio, T. (2007).

A biologically inspired system for action recognition.

Computer Vision, 2007. ICCV 2007. IEEE 11th Inter-

national Conference on, pages 1–8.

Karlsson, S. and Bigun, J. (2012). Lip-motion events anal-

ysis and lip segmentation using optical ﬂow. pages

138–145.

aser, A., Marszaek, M., and Schmid, C. (2008). A

spatio-temporal descriptor based on 3d-gradients. In

In BMVC08.

Niebles, J., Wang, H., and Fei-Fei, L. (2008). Unsupervised

learning of human action categories using spatial-

temporal words. International Journal of Computer

Vision, 79(3):299–318.

VISAPP 2017 - International Conference on Computer Vision Theory and Applications

270

Radon, J. (1917).

Uber die Bestimmung von Funktio-

nen durch ihre Integralwerte l

angs gewisser Mannig-

faltigkeiten. Akad. Wiss., 69:262–277.

Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing

human actions: A local svm approach. In Proceedings

of the Pattern Recognition, 17th International Confer-

ence on (ICPR’04) Volume 3 - Volume 03, ICPR ’04,

pages 32–36, Washington, DC, USA. IEEE Computer

Society.

Scovanner (2007). A 3-dimensional sift descriptor and its

application to action recognition. pages 357–360.

Shalev-Shwartz, S. and Zhang, T. (2013). Stochastic dual

coordinate ascent methods for regularized loss. J.

Mach. Learn. Res., 14(1):567–599.

Souvenir, R. and Parrigan, K. (2009). Viewpoint mani-

folds for action recognition. J. Image Video Process.,

2009:1:1–1:1.

Tabbone, S., Wendling, L., and Salmon, J.-P. (2006). A

new shape descriptor deﬁned on the radon transform.

Comput. Vis. Image Underst., 102(1):42–51.

Vedaldi, A. and Zisserman, A. (2012). Efﬁcient additive

kernels via explicit feature maps. IEEE Trans. Pattern

Anal. Mach. Intell., 34(3):480–492.

Vishwakarma, D., Dhiman, A., Maheshwari, R., and

Kapoor, R. (2015). Human motion analysis by fusion

of silhouette orientation and shape features. Procedia

Computer Science, 57:438 – 447.

Wang, Y., Huang, K., and Tan, T. (2007). Human activity

recognition based on r transform. In In Proceedings of

the IEEE International Conference on Computer Vi-

sion and Pattern Recognition, pages 1–8.

Zhu, P., Hu, W., Li, L., and Wei, Q. (2009). Human Activity

Recognition Based on R Transform and Fourier Mellin

Transform, pages 631–640. Springer Berlin Heidel-

berg, Berlin, Heidelberg.

Action Recognition using the Rf Transform on Optical Flow Images

271