Recognition of Human Actions using Edit Distance on Aclet Strings

Luc Brun

, Pasquale Foggia

, Alessia Saggese

and Mario Vento

GREYC UMR CNRS 6072, ENSICAEN - Universit

e de Caen Basse-Normandie, 14050 Caen, France

Dept. of Computer Eng. Electrical and Eng.Applied Mathematics, University of Salerno

Via Giovanni Paolo II, 132, Fisciano (SA), Italy

Keywords:

Action Recognition, Human Behavior Analysis, String Based Representations, Edit Distance.

Abstract:

In this paper we propose a novel method for human action recognition based on string edit distance. A two

layer representation is introduced in order to exploit the temporal sequence of the events: a ﬁrst representation

layer is obtained by using a feature vector obtained from depth images. Then, each action is represented as a

sequence of symbols, where each symbol corresponding to an elementary action (aclet) is obtained according

to a dictionary previously deﬁned during the learning phase. The similarity between two actions is ﬁnally

computed in terms of string edit distance, which allows the system to deal with actions showing different

length as well as different temporal scales. The experimentation has been carried out on two widely adopted

datasets, namely the MIVIA and the MHAD datasets, and the obtained results, compared with state of the art

approaches, conﬁrm the effectiveness of the proposed method.

1 INTRODUCTION

The recognition of human actions has attracted in

the last years several scientists working in the pat-

tern recognition and computer vision ﬁelds (Turaga

et al., 2008; Poppe, 2010; Weinland et al., 2011; Vish-

wakarma and Agrawal, 2013). This is mainly due to

the following facts: ﬁrst, the applicative outcome is

considerable, ranging from video surveillance to am-

bient assisted living and business intelligence appli-

cation ﬁelds. Second, this problem can be seen as

a rather traditional pattern recognition task: a set of

features is computed on the sequence of images and

is used to feed a classiﬁer, which is trained on a set of

labeled training data.

Starting from this last assumption, the main fo-

cus of the scientiﬁc community up to now has been

on the deﬁnition of novel feature sets, tailored to dis-

criminate the different actions of interest. For in-

stance, in (Mokhber et al., 2005) actions are repre-

sented through a spatio-temporal cuboid, which is de-

scribed by a geometrical transform based on Hu mo-

ments; in (Dollar et al., 2005) the cuboid is built by

exploiting a 2D Gaussian ﬁlter for managing the spa-

tial dimensions and a 1D Gabor ﬁlter for the temporal

one. Scale and translation invariance is the main fo-

cus of the methods presented by (Chen et al., 2011),

(Wang et al., 2007) and (Yuan et al., 2013); in par-

ticular, the former adopts the Radon transform while

the last two approaches exploit an extended version,

namely the R transform.

Independently of the particular way in which the

features are designed, one of the main limitations

of the above mentioned systems lies in the fact that

their decision is strongly inﬂuenced by the noise in-

troduced during the feature extraction step. For this

reason, in the last years a common trend of the sci-

entiﬁc community has been to introduce a high level

representation, aimed at increasing the overall recog-

nition performance. In (Zhang and Gong, 2010) and

(Sung et al., 2012), for instance, the use of an Hidden

Markov Model (HMM) is proposed. In general, the

main drawback of HMM based approaches lies in the

large amount of labeled data required during the train-

ing step, which is often not simple to obtain in real

environments. In (Foggia et al., 2014) a deep learn-

ing architecture is introduced, aiming at extracting a

high level representation of the actions directly from

the data. Although they show very promising results,

both the methods based on HMM and deep learning

architectures are very difﬁcult to conﬁgure and in gen-

eral the achievement of good results requires a signif-

icant expertise in the ﬁeld.

The introduction of a high level representation for

actions has been also exploited in (Dollar et al., 2005),

(Li et al., 2013) and (Foggia et al., 2013), where a

bag of words approach has been adopted. The main

idea is to use the ﬁrst-level feature vectors to recog-

Brun L., Foggia P., Saggese A. and Vento M..

Recognition of Human Actions using Edit Distance on Aclet Strings.

DOI: 10.5220/0005304700970103

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 97-103

ISBN: 978-989-758-090-1

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

nize small elements of an action called visual words;

then the histogram of the occurrences of such visual

words is used as a high-level feature vector to per-

form the classiﬁcation of the action. The set of the

visual words is deﬁned by constructing a codebook

using an unsupervised learning approach. The main

drawback of the bag of words methods is that they

base their decision only on the occurrence or on the

absence of the relevant visual words (the elementary

actions) within the analyzed time window; the order

in which these words appear is not taken into account.

However, for human beings, this order is an impor-

tant piece of information for discriminating between

similar actions. On the other hand, a element-wise

comparison of the observed sequence of visual words

with the one obtained for a reference action would not

yield good results, because of two kinds of problems:

ﬁrst, the speed of execution of the same action by dif-

ferent persons (or even by the same person at different

times) may change; the change may even be not uni-

form within the same action. Second, both because

of noise in the ﬁrst-level representation and of indi-

vidual differences in the way an action is performed,

an observed sequence of visual word will likely con-

tain spurious elements with respect to the correspond-

ing reference action in the training set, and conversely

may lack some elements of the latter.

In order to overcome the above mentioned prob-

lems, in this paper we propose a system for Human

Action Recognition based on a string Edit Distance

(HARED); each action is represented as a sequence

of symbols (a string) according to a dictionary ac-

quired during the learning step. The similarity be-

tween two strings is computed by a string edit dis-

tance, measuring the cost of the minimal sequence of

edit operations needed to transform one string into the

other; the string edit distance is robust with respect

to local modiﬁcations (such as the insertion or dele-

tion of symbols) even when they change the length of

the string, thus dealing in a natural way with speed

changes and with spurious elementary actions.

The experimentation, conducted over two stan-

dard datasets, conﬁrms the robustness of the proposed

approach, both in absolute terms and in comparison

with other state of the art methodologies.

2 THE PROPOSED METHOD

In Figure 2c an overview of the proposed approach is

shown; more details about each module, namely ﬁrst

layer representation, second layer representation and

classiﬁcation will be detailed in the following.

2.1 Feature Extraction

The feature vector is extracted by analyzing depth

images acquired by a Kinect sensor. This choice is

mainly justiﬁed by the following reasons: ﬁrst, the

device has a very affordable cost, so making such

method especially suited for budget-constrained ap-

plications. Furthermore, in (Carletti et al., 2013)

the authors proved the effectiveness of a set of fea-

tures obtained by the combination of three differ-

ent descriptors, respectively based on Hu Moments,

R transform and Min-Max variations, computed on

depth images. Starting from the above considerations,

in this paper we decided to adopt the same feature

vector. It is worth pointing out that the focus of this

paper is on the string-based high level representation,

as well as on the measure introduced for evaluating

the distance between two actions; it means that any

kind of feature vector could be proﬁtably used.

In order to compute the feature vector, we ﬁrst ex-

tract the set of derived images, proposed in (Megavan-

nan et al., 2012) and shown in Figure 2, able to model

the spatio-temporal variations of each pixel: in par-

ticular, at each frame the last N frames are processed

through the employing of a sliding window so as to

obtain the Average Depth Image (ADI), the Motion

History Image (MHI) and the Depth Difference Im-

age (DDI). In our experiments N has been set to one

second, as suggested in (Megavannan et al., 2012).

In particular, the ADI is the average depth at posi-

tion (x,y) over the images at times t − N + 1,...,t;

it uses N temporally adjacent depth images in or-

der to capture the motion information in the third di-

mension. The MHI is able to capture into a single

and static image the sequence of motions; In par-

ticular, MHI(x,y,t) = 255 if the point (x,y) passes

from background to foreground at time t, otherwise

it is equal to max{M(x,y,t − 1) − τ,0}. τ is a con-

stant set in our experiments to (256/N) − 1, as sug-

gested in (Megavannan et al., 2012). Finally, the DDI

evaluates the motion changes in the depth dimen-

sion: DDI(x,y,t) = D

max

(x,y,t) − D

min

(x,y,t), where

max

(x,y,t) and D

min

(x,y,t) are the maximum and

minimum depth for position (x, y) over the images at

times t − N + 1, ... ,t, respectively.

Both the MHI and the ADI are represented

through the seven Hu moments, which are invariant

to translation, scale and orientation. DDI is repre-

sented through a combination of the R transform and

the Min-Max Depth Variations. The former is an

extended 3D discrete Radon transform, able to cap-

ture the geometrical information of the interest points.

Although its very low computational complexity, R

transform is robust with respect to errors occurring

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

Second Level String

Representation

Codebook

C=(c

,…,c

|C|

)

…

First Level Feature Vector

Representation

,…,c

Action

Edit Distance

K-NN Classifier

Training samples

Prototypes

Figure 1: An overview of the proposed method. Each feature vector is associated to a visual symbol (here for

simplicity represented as a letter), according to the codebook previously extracted during the learning step. Thus,

the sequence of symbols is used to build a string that will be fed to the K-NN classiﬁer based on the edit distance;

the classiﬁer ﬁnds the K most similar reference images among those provided in the learning step, and uses their

classes (using a majority scheme) to assign a class to the observed action.

(a) (b) (c)

Figure 2: An example of the three derived images

computed for extracting the ﬁrst level feature vector,

namely ADI (a), MHI (b) and DDI (c).

during the detection phase, such as disjoint silhouettes

and holes in the shape. Min-Max Depth Variations is

obtained by hierarchically partitioning the bounding

box enclosing the silhouette into cells of equal size

(1x1, 2x1, 1x2, 2x2, 3x3 and 6x6). For each cell,

the minimum and the maximum value is computed,

so obtaining a 108-sized vector.

Note that the features we used are complemen-

tary in their nature, since they are able to analyze the

global distribution of the pixels (Hu Moments and the

Min-Max Depth Variations) and at the same time to

capture those properties related to the alignment of

subregions of the image (R transform).

2.2 String based Representation

A second layer representation based on visual strings

is used to represent the actions. This choice allows

our system to take into account the order, and not only

the occurrence, of sub-actions for characterizing each

action. In particular, (1) each ﬁrst-level feature vector

is encoded by an aclet, a visual symbol representing

the small and atomic unit of action, namely a sub-

action. (2) Then, the consecutive visual symbols are

concatenated in order to obtain the visual string.

As for the ﬁrst step, it is worth pointing out that

the space of the possible feature vectors is by its na-

ture ideally inﬁnite. For this reason, during the learn-

ing step a preliminary quantization of the space is

performed in an unsupervised way by using the well-

known K-Means clustering algorithm. In this way, we

are able to generate a codebook C, that is a ﬁnite vo-

cabulary of visual symbols, obtained by assimilating

the i-th cluster to its centroid c

C =



,. ..,c

|C|



, (1)

being |C| the number of clusters and thus the size of

the dictionary. It is important to highlight that, as we

will show in Section 3, HARED is very robust with re-

spect to different |C| values. This is a very important

and not negligible feature since it makes the proposed

approach easily conﬁgurable also by unexperienced

operators.

During the operating phase, the codebook will be

used in order to compute, for each low-level feature

vector v

, its closest cluster centroid c

chosen from C

and then to associate the i-th visual symbol s

= argmin

− c

for j ∈ {1,. ..,|C|}. (2)

The concatenation of the last |S| symbols are ﬁ-

nally used in order to build the visual string:

=< s

t−|S|

,..., s

>, (3)

where t is the current time instant. The |S| value is

adaptively identiﬁed during the learning step by com-

puting the average string length over the training set.

2.3 The Decision Step

Once represented each action as a string, a K-NN

classiﬁer is used. The similarity between two strings

RecognitionofHumanActionsusingEditDistanceonAcletStrings

is evaluated by using an edit distance, and in particu-

lar the Levenshtein distance, able to evaluate the dif-

ferences between two sequences in terms of insertion,

deletion and substitution.

The main advantages in the use of a similarity

measure based on edit distance are the following:

ﬁrst, edit distance takes into account the ordering of

the symbols in the string, differently from methods

based on histograms such as the bag of words ap-

proach; as we said, the order of sub-actions is a sig-

niﬁcant discriminative information for the classiﬁca-

tion of an action. Furthermore, it automatically ﬁnds

an optimal alignment between the strings being com-

pared, even a complex one involving different combi-

nations of shrinking and expanding on different parts

of the strings (Xiao et al., 2008); this is important to

deal with possibly different local speeds of execution

of actions. Finally, it is robust to small local changes,

such as the insertion of spurious symbols, and so it is

very well suited to work with noisy input data, such

as those obtained by observing a person in a realistic

environment.

For these very important properties, the edit dis-

tance has been successfully applied in several ap-

plication domains, ranging from computational biol-

ogy to signal processing and text retrieval (Navarro,

2001). Among the different edit distances metrics,

such as Hamming distance and Longest Common

Subsequence, the Levenshtein distance allows the

system to deal with different length strings as well as

to consider the substitution operators, very useful in

order to give a more precise metric.

In a more formal way, let be a =< a

,..., a

and b =< b

,..., b

> the two strings of length

respectively n and m. The Levenshtein distance d

evaluates the minimum-cost sequence of operations

required to transform a into b. In particular, when

comparing symbols b

and a

, three possible edit op-

erations are considered:

• substitution of a

with b

, with a cost w

sub

)

(obviously the cost is 0 if a

= b

);

• insertion of a

in the string b, with a cost w

ins

);

• deletion of b

from the string b, with a cost

del

);

There is a very elegant recursive formulation

of the problem of ﬁnding the minimum edit dis-

tance d

i j

between the substrings < a

,. ..,a

> and

< b

,. ..,b

>. The base cases of the recursive for-

mulae are:

∑

k=1

del

), for 1 ≤ i ≤ m (4)

(a) (b)

Figure 3: Some examples of the MHAD Dataset (a-b)

and of the MIVIA Dataset (c-d).

0 j

∑

k=1

ins

), for 1 ≤ j ≤ n (5)

Given the base cases, the generic d

i j

can be de-

ﬁned by the following recurrence:

i j

= min











i−1, j

+ w

del

)

i, j−1

+ w

ins

)

i−1, j−1

+ w

sub

)

(6)

Finally, the edit distance between the two strings

is deﬁned as:

D(a,b) = d

(7)

In order to speed up the computation of the edit

distance, we computed it by exploiting the dynamic

programming version proposed in (Wagner and Fis-

cher, 1974), whose time complexity is O(mn). Fur-

thermore, considering that in our context we are not

interested in visualizing the sequence of edit op-

erations, the space complexity can be reduced to

O(min(m,n)).

The K-NN classiﬁer uses the edit distance for ﬁnd-

ing the K reference actions in the training sets; then,

the class of the relative majority of the K actions is

output as the result of the classiﬁer.

3 EXPERIMENTAL RESULTS

The experimentation has been carried out over two

widely adopted datasets, namely the Berkeley Multi-

modal Human Action Detection (hereinafter MHAD)

Dataset (Oﬂi et al., 2013) and the MIVIA Dataset

(Foggia et al., 2013), both providing RGB-D images

and background for each sequence.

Some examples are reported in Figure 3, while the

actions included in each dataset are listed in Table 1.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

100

Table 1: Description of the MHAD (a) and of the

MIVIA (b) datasets.

(a)

Action Length per

Recording

Jumping in place 5 secs

Jumping jacks 7 secs

Bending - hands up all the way

down

12 secs

Punching (boxing) 10 secs

Waving - two hands 7 secs

Waving - one hand (right) 7 secs

Clapping hands 5 secs

Throwing a ball 3 secs

Sit down then stand up 15 secs

Sit down 2 secs

Stand up 2 secs

(b)

Action Length per

Recording

Opening a jar 2 sec

Drinking 3 secs

Sleeping 3 secs

Random Movements 11 secs

Stopping 7 secs

Interacting with a table 3 secs

Sitting 3 secs

In particular, the MHAD dataset is based on ac-

tions with movement in both upper and lower extrem-

ities, actions with high dynamics in upper extremities

or actions with high dynamics in lower extremities. It

contains 11 actions performed by 7 male and 5 female

subjects. All the subjects performed 5 repetitions

of each action, yielding about 660 action sequences

which correspond to approximatively 82 minutes of

total recording time. The MIVIA dataset is composed

by 7 actions performed by 14 subjects, 7 male and 7

female. All the subjects performed 2 repetitions of

each action. This dataset is more challenging, since it

is mainly devoted to actions with movement involving

only upper extremities.

As for the parameters of the edit distance and of

the classiﬁer, we have set the costs of the edit opera-

tions to be uniform:

ins

) = w

del

) = 1 (8)

sub

) =



0 if a

= b

1 if a

6= b

(9)

For parameter K of the classiﬁer, we have set K = 5

after experimenting with several values over a subset

of the training set.

The experimentation has been carried out accord-

ing to two different protocols simulating different real

scenarios:

• Protocol T1: the person whose actions have to be

recognized is known to the system. It means that

the test person is included in the training set; in

particular, given r repetitions of each action of the

test person, one repetition is included in the train-

ing set while the other r − 1 in the test set. This

protocol simulates, for instance, the home moni-

toring of an elderly person, where it is possible to

adapt the system during the conﬁguration step to

the person under test.

• Protocol T2: the test person is not known to the

system. It implies that the test person is not in-

cluded in the training set, which is formed by

all the repetitions of all the other persons in the

dataset. This protocol simulates those applica-

tions for surveilling public places, where the sys-

tem has to recognize actions of unknown persons.

For each protocol, the tests have been carried out by a

leave-one-out cross-validation strategy; in particular,

each test has been repeated for all the repetitions of

all the persons of each dataset and ﬁnally the average

performance have been reported.

The results obtained by varying the |C| parame-

ter are reported in Figures 4a and 4b for the MIVIA

and MHAD datasets, respectively. We can note that in

0,6$

0,7$

0,8$

0,9$

32$ 64$ 128$ 256$ 512$

Clusters(#(

Accuracy(.(MIVIA(

TEST(T1( TEST(T2(

(a)

0,6$

0,7$

0,8$

0,9$

32$ 64$ 128$ 256$ 512$

Clusters(#(

Accuracy(.(MHAD(

TEST(T1( TEST(T2(

(b)

Figure 4: Accuracy computed respectively over the

MIVIA (a) and the MHAD (b) datasets by varying

the |C| parameter.

RecognitionofHumanActionsusingEditDistanceonAcletStrings

101

Table 2: Comparison with state of the art approaches

conducted over both the MHAD (a) and MIVIA (b)

datasets.

(a)

Method Source Accuracy

HARED Proposed 87.1

Bow (Foggia et al., 2013) 72.9

BM1 (Cheema et al., 2013) 77.7

Deep (Foggia et al., 2014) 85.8

SMIJ (Oﬂi et al., 2012) 94.2

HMIJ (Oﬂi et al., 2012) 82.9

HMW (Oﬂi et al., 2012) 81.1

LDSP (Oﬂi et al., 2012) 82.2

BM2 (Cheema et al., 2013) 87.8

(b)

Method Source Accuracy

HARED Proposed 85.2

Cuboids (Dollar et al., 2005) 74.4

Reject (Carletti et al., 2013) 79.8

Bow (Foggia et al., 2013) 84.1

HAcK (Brun et al., 2014) 80.1

Deep (Foggia et al., 2014) 84.7

both the datasets the performance achieved are very

stable with respect to the number of the clusters. It

makes the proposed approach very suited for real ap-

plications, since the set up of the system can be per-

formed in a very simple way also for an unexperi-

enced operators.

Finally, in order to further conﬁrm the effective-

ness of the proposed approach, a proper comparison

has been carried out with several state of the art ap-

proaches, as shown in Table 2. It is worth pointing

out that among the methods listed in Table 2a only

Bow, BM1 and Deep are based on depth information;

in fact, the other methods are based on the skeleton

acquired by a Mocap system, able to capture 3D posi-

tion of active LED markers. Even if improving the

performance of the action recognition system, this

kind of marker can not be easily applied in real envi-

ronments since it requires ad hoc hardware mounted

on the persons under test. Thus, the performance ob-

tained by the proposed approach can be considered

even better in the light of the above considerations.

4 CONCLUSIONS

In this paper we proposed HARED, a novel method

able to recognize human actions by string edit dis-

tance. The main advantages of the proposed approach

are the following: ﬁrst, the string based representa-

tion allows the system to explicitly take into account

the order of sub-actions within an action; furthermore,

the introduction of the edit distance for evaluating the

similarity between two strings enables the system to

deal with actions of different lengths as well as with

changes in the speed of execution; ﬁnally, the edit dis-

tance, being robust to the presence of spurious sym-

bols, makes the approach suitable to work in noisy

environments.

The results obtained by testing the proposed meth-

ods over two standard datasets conﬁrm its effective-

ness: in fact, the accuracy achieved on the MIVIA

dataset (85.2%) is higher with respect to any other

considered state of the art method; the one obtained

on the MHAD dataset (87.1%) is still higher with re-

spect to all the methods exploiting depth images in-

stead of skeleton information. Furthermore, the sta-

bility observed when varying the number of clusters

conﬁrms its usability in real applications, where unex-

perienced human operator may easily set the system

up.

Future work will investigate the possibility of fur-

ther improving the algorithm performance by suit-

ably tuning the edit operation costs during the training

phase.

ACKNOWLEDGEMENTS

This research has been partially supported by

A.I.Tech s.r.l. (http://www.aitech-solutions.eu).

REFERENCES

Brun, L., Percannella, G., Saggese, A., and Vento, M.

(2014). Hack: Recognition of human actions by ker-

nels of visual strings. In Advanced Video and Signal-

Based Surveillance (AVSS), 2014 IEEE International

Conference on.

Carletti, V., Foggia, P., Percannella, G., Saggese, A., and

Vento, M. (2013). Recognition of human actions from

rgb-d videos using a reject option. In ICIAP 2013,

volume 8158, pages 436–445.

Cheema, M. S., Eweiwi, A., and Bauckhage, C. (2013). Hu-

man activity recognition by separating style and con-

tent. Pattern Recognition Letters, (0):–.

Chen, Y., Wu, Q., and He, X. (2011). Human action recog-

nition based on radon transform. In Multimedia Anal-

ysis, Processing and Communications, volume 346,

pages 369–389. Springer Berlin Heidelberg.

Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005).

Behavior recognition via sparse spatio-temporal fea-

tures. In IEEE International Workshop on Visual

Surveillance and Performance Evaluation of Tracking

and Surveillance, pages 65–72.

Foggia, P., Percannella, G., Saggese, A., and Vento, M.

(2013). Recognizing human actions by a bag of vi-

sual words. In IEEE SMC 2013.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

102

Foggia, P., Saggese, A., Strisciuglio, N., and Vento, M.

(2014). Exploiting the deep learning paradigm for

recognizing human actions. In Advanced Video and

Signal-Based Surveillance (AVSS), 2014 IEEE Inter-

national Conference on.

Li, W., Yu, Q., Sawhney, H., and Vasconcelos, N. (2013).

Recognizing activities via bag of words for attribute

dynamics. In Computer Vision and Pattern Recogni-

tion (CVPR), 2013 IEEE Conference on, pages 2587–

2594.

Megavannan, V., Agarwal, B., and Babu, R. (2012). Hu-

man action recognition using depth maps. In SPCOM

2012, pages 1 –5.

Mokhber, A., Achard, C., Qu, X., and Milgram, M. (2005).

Action recognition with global features. In Sebe,

N., Lew, M., and Huang, T., editors, Computer Vi-

sion in Human-Computer Interaction, volume 3766 of

Lecture Notes in Computer Science, pages 110–119.

Springer Berlin Heidelberg.

Navarro, G. (2001). A guided tour to approximate string

matching. ACM Comput. Surv., 33(1):31–88.

Oﬂi, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy,

R. (2012). Sequence of the most informative joints

(smij): A new representation for human skeletal ac-

tion recognition. In CVPRW 2012, pages 8–13.

Oﬂi, F., Chaudhry, R., Kurillo, G., Vidal, R., and Bajcsy, R.

(2013). Berkeley mhad: A comprehensive multimodal

human action database. In WACV.

Poppe, R. (2010). A survey on vision-based human action

recognition. Image Vision Comput., 28(6):976–990.

Sung, J., Ponce, C., Selman, B., and Saxena, A. (2012). Un-

structured human activity detection from rgbd images.

In ICRA, pages 842–849. IEEE.

Turaga, P., Chellappa, R., Subrahmanian, V. S., and Udrea,

O. (2008). Machine recognition of human activities:

A survey. IEEE T Circuits Syst, 18(11):1473–1488.

Vishwakarma, S. and Agrawal, A. (2013). A survey on ac-

tivity recognition and behavior understanding in video

surveillance. Visual Comput., 29(10):983–1009.

Wagner, R. A. and Fischer, M. J. (1974). The string-to-

string correction problem. J. ACM, 21(1):168–173.

Wang, Y., Huang, K., and Tan, T. (2007). Human activ-

ity recognition based on r transform. In CVPR 2007,

pages 1 –8.

Weinland, D., Ronfard, R., and Boyer, E. (2011). A sur-

vey of vision-based methods for action representation,

segmentation and recognition. Comput. Vis. Image

Und., 115(2):224 – 241.

Xiao, C., Wang, W., and Lin, X. (2008). Ed-join: An efﬁ-

cient algorithm for similarity joins with edit distance

constraints. Proc. VLDB Endow., 1(1):933–944.

Yuan, C., Li, X., Hu, W., Ling, H., and Maybank, S.

(2013). 3d r transform on spatio-temporal interest

points for action recognition. In Computer Vision and

Pattern Recognition (CVPR), 2013 IEEE Conference

on, pages 724–730.

Zhang, J. and Gong, S. (2010). Action categorization with

modiﬁed hidden conditional random ﬁeld. Pattern

Recogn., 43(1):197–203.

RecognitionofHumanActionsusingEditDistanceonAcletStrings

103