nize small elements of an action called visual words;
then the histogram of the occurrences of such visual
words is used as a high-level feature vector to per-
form the classification of the action. The set of the
visual words is defined by constructing a codebook
using an unsupervised learning approach. The main
drawback of the bag of words methods is that they
base their decision only on the occurrence or on the
absence of the relevant visual words (the elementary
actions) within the analyzed time window; the order
in which these words appear is not taken into account.
However, for human beings, this order is an impor-
tant piece of information for discriminating between
similar actions. On the other hand, a element-wise
comparison of the observed sequence of visual words
with the one obtained for a reference action would not
yield good results, because of two kinds of problems:
first, the speed of execution of the same action by dif-
ferent persons (or even by the same person at different
times) may change; the change may even be not uni-
form within the same action. Second, both because
of noise in the first-level representation and of indi-
vidual differences in the way an action is performed,
an observed sequence of visual word will likely con-
tain spurious elements with respect to the correspond-
ing reference action in the training set, and conversely
may lack some elements of the latter.
In order to overcome the above mentioned prob-
lems, in this paper we propose a system for Human
Action Recognition based on a string Edit Distance
(HARED); each action is represented as a sequence
of symbols (a string) according to a dictionary ac-
quired during the learning step. The similarity be-
tween two strings is computed by a string edit dis-
tance, measuring the cost of the minimal sequence of
edit operations needed to transform one string into the
other; the string edit distance is robust with respect
to local modifications (such as the insertion or dele-
tion of symbols) even when they change the length of
the string, thus dealing in a natural way with speed
changes and with spurious elementary actions.
The experimentation, conducted over two stan-
dard datasets, confirms the robustness of the proposed
approach, both in absolute terms and in comparison
with other state of the art methodologies.
2 THE PROPOSED METHOD
In Figure 2c an overview of the proposed approach is
shown; more details about each module, namely first
layer representation, second layer representation and
classification will be detailed in the following.
2.1 Feature Extraction
The feature vector is extracted by analyzing depth
images acquired by a Kinect sensor. This choice is
mainly justified by the following reasons: first, the
device has a very affordable cost, so making such
method especially suited for budget-constrained ap-
plications. Furthermore, in (Carletti et al., 2013)
the authors proved the effectiveness of a set of fea-
tures obtained by the combination of three differ-
ent descriptors, respectively based on Hu Moments,
R transform and Min-Max variations, computed on
depth images. Starting from the above considerations,
in this paper we decided to adopt the same feature
vector. It is worth pointing out that the focus of this
paper is on the string-based high level representation,
as well as on the measure introduced for evaluating
the distance between two actions; it means that any
kind of feature vector could be profitably used.
In order to compute the feature vector, we first ex-
tract the set of derived images, proposed in (Megavan-
nan et al., 2012) and shown in Figure 2, able to model
the spatio-temporal variations of each pixel: in par-
ticular, at each frame the last N frames are processed
through the employing of a sliding window so as to
obtain the Average Depth Image (ADI), the Motion
History Image (MHI) and the Depth Difference Im-
age (DDI). In our experiments N has been set to one
second, as suggested in (Megavannan et al., 2012).
In particular, the ADI is the average depth at posi-
tion (x,y) over the images at times t − N + 1,...,t;
it uses N temporally adjacent depth images in or-
der to capture the motion information in the third di-
mension. The MHI is able to capture into a single
and static image the sequence of motions; In par-
ticular, MHI(x,y,t) = 255 if the point (x,y) passes
from background to foreground at time t, otherwise
it is equal to max{M(x,y,t − 1) − τ,0}. τ is a con-
stant set in our experiments to (256/N) − 1, as sug-
gested in (Megavannan et al., 2012). Finally, the DDI
evaluates the motion changes in the depth dimen-
sion: DDI(x,y,t) = D
max
(x,y,t) − D
min
(x,y,t), where
D
max
(x,y,t) and D
min
(x,y,t) are the maximum and
minimum depth for position (x, y) over the images at
times t − N + 1, ... ,t, respectively.
Both the MHI and the ADI are represented
through the seven Hu moments, which are invariant
to translation, scale and orientation. DDI is repre-
sented through a combination of the R transform and
the Min-Max Depth Variations. The former is an
extended 3D discrete Radon transform, able to cap-
ture the geometrical information of the interest points.
Although its very low computational complexity, R
transform is robust with respect to errors occurring
VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications
98