Recurrence Matrices for Human Action Recognition
V. Javier Traver
1,2
, Pau Agust
´
ı
1,2
and Filiberto Pla
1,2
1
DLSI, Jaume-I University, Castell
´
on, Spain
2
iNIT, Jaume-I University, Castell
´
on, Spain
Keywords:
Human Action Recognition, Recurrence Matrices, Frame Descriptors, Motion and Shape Cues, Action
Characterization, Temporal Information.
Abstract:
One important issue for action characterization consists of properly capturing temporally related informa-
tion. In this work, recurrence matrices are explored as a way to represent action sequences. A recurrence
matrix (RM) encodes all pair-wise comparisons of the frame-level descriptors. By its nature, a recurrence
matrix can be regarded as a temporally holistic action representation, but it can hardly be used directly and
some descriptor is therefore required to compactly summarize its contents. Two simple RM-level descriptors
computed from a given recurrence matrix are proposed. A general procedure to combine a set of RM-level
descriptors is presented. This procedure relies on a combination of early and late fusion strategies. Recogni-
tion performances indicate the proposed descriptors are competitive provided that enough training examples
are available. One important finding is the significant impact on performance of both, which feature subsets
are selected, and how they are combined, an issue which is generally overlooked.
1 INTRODUCTION
Human action recognition has been receiving remark-
able research effort for about two decades (Aggar-
wal and Ryoo, 2011) due to both the difficulty of
the problem and its wide range of applications such
as visual surveillance, human-machine interaction, or
team sports analysis, to name but a few.
One of the relevant issues for action represen-
tation is to properly capture the temporal informa-
tion. Some solutions involve accumulating local his-
tograms along time (Lucena et al., 2012), extract a
short-time series of a few still snapshots of representa-
tive poses (Brendel and Todorovic, 2010), or decom-
pose actions into sequences of “actoms” (key atomic
action units), and weight visual features by their tem-
poral distance to these actoms (Gaidon et al., 2011a).
There is some recent interest in enriching the popular
bag-of-words representation with temporal informa-
tion (Matikainen et al., 2010). Considering multiple
temporal scales (Niebles et al., 2010) can be effective
for modelling human activity which may include sim-
pler and shorter actions.
In this work, feature vectors describing the action
at frame level at two different time steps in the im-
age sequence are compared, thus producing a 2D re-
currence matrix (RM) of pair-wise distances which
captures all the frame-to-frame similarities, and it
can therefore be viewed as a time-holistic represen-
tation providing a rich characterization of the tempo-
ral structure and evolution of the action. However,
this information implicitly contained in the recurrence
matrices have to be summarized in the form of an
appropriate RM descriptor, which is finally used for
learning and recognizing actions.
This or similar representations were used in (Cut-
ler and Davis, 2000) to analyse periodic mo-
tion and distinguish running bipeds (humans) from
quadrupeds (dogs), and in (BenAbdelkader et al.,
2004) for gait-based biometrics. A related approach,
a delay-embedding technique, was developed in (Ali
et al., 2007) for action recognition from trajectories of
body landmarks. An auto-correlation kernel for time
series has been recently proposed for action recogni-
tion (Gaidon et al., 2011b). The most related work
is (Junejo et al., 2011), where temporal self-similarity
matrices are explored as an appropriate view-invariant
action representation. Our focus is not on view invari-
ance, but in exploring alternative representations both
to build the recurrence (or self-similarity) matrices,
and to derive the RM descriptor. Additionally, a gen-
eral procedure to combine RM-level descriptors using
a combination of early and late fusion strategies is in-
vestigated.
271
Traver V., Agustí P. and Pla F..
Recurrence Matrices for Human Action Recognition.
DOI: 10.5220/0004216202710276
In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 271-276
ISBN: 978-989-8565-47-1
Copyright
c
2013 SCITEPRESS (Science and Technology Publications, Lda.)
2 METHODOLOGY
The proposed system consists of a frame descriptor
(Section 2.1) that describes every frame of a given in-
put action sequence, a recurrence matrix (RM) (Sec-
tion 2.2) that compares all frame descriptors pair-
wise, and an RM descriptor (Section 2.3) that summa-
rizes the RM to characterize the action category. One
or several RM descriptors are finally used to repre-
sent any action sequence. Learning action categories
and recognizing new (unseen) action instances rely on
these RM descriptors. To be more precise, the RM
descriptors corresponding to a single action sequence
are combined using both early and late fusion (Sec-
tion 2.4).
2.1 Frame Descriptor
The Tran-Sorokin descriptor (Tran and Sorokin,
2008) is used in this work to characterize the ac-
tions at individual frames within an action sequence.
This is descriptor includes both motion and shape vi-
sual cues (Table 1) into the “single-frame descriptor”
(SFD) comprising 216 features. The SFD of three 5-
frame time windows, each corresponding to current
frames ([t 2,t + 2]), past frames ([t 7,t 3]), and
future frames ([t + 3,t + 7]), are separately concate-
nated and PCA-projected to reduce the overall dimen-
sionality. These parts (CFW, PFW and FFW) provide
temporally contextual information which enrich the
representation at a single frame.
2.2 Recurrence Matrix
Let f
t
be a frame descriptor at a discrete time t
{1, . . . , T }, for a video sequence of T frames. Then,
the recurrence matrix R is computed from pair-
wise distances of the frame descriptors, R(i, j) =
d(f
i
, f
j
), i, j {1, . . . , T }. A binary version of this
matrix can be obtained by thresholding distances:
R
θ
(i, j) = H(d(f
i
, f
j
) θ), where H is the Heaviside
step function: H(x) = 1 if x > 0 and H(x) = 0 other-
wise.
The proposed recurrence matrix representation is
inspired by the idea of recurrence plots (Marwan
et al., 2007). Recurrence plots allow to visually anal-
yse or automatically quantify the properties or be-
haviour of dynamical systems. Although these plots
may be computed using concepts of phase space and
time delay methods, in this work we just used the con-
cepts of state and state-to-state comparison. We con-
sider the action sequence as the dynamical system,
and the state at a given time is the snapshot of the
action at that time. Such an state is represented with
a chosen frame descriptor.
For the distance function d, the Euclidean distance
normalized by the length of the frame descriptor was
chosen. This normalization aims at removing the ef-
fect of the length of different frame descriptors.
2.3 Describing a Recurrence Matrix
Given a recurrence matrix, a description of it is re-
quired as the final representation of the underlying ac-
tion. We have explored two different RM descriptors:
the histogram of line lengths (HoL) and the projec-
tions along anti-diagonals (PaD).
Histogram of Line Lengths (HoL). Some mea-
sures proposed for the recurrence quantification anal-
ysis (RQA) are based on the diagonal lines of a bi-
nary recurrence matrix. A diagonal of length l in
our recurrence matrix means that the action is simi-
lar during l frames, according to the frame descrip-
tor, distance function, and threshold used. Instead of
using individual measures derived from these diago-
nals, such as entropy or determinism (Marwan et al.,
2007), we propose to use an histogram of the diag-
onal lengths for a set of lengths L = {l
1
, l
2
, . . . , l
n
},
which is expected to provide a richer representation
than a few individual measures. To further enrich the
descriptor, the histogram of vertical lines lengths is
also considered. Both histograms are separately nor-
malized. To find the lengths of the vertical and diag-
onal lines, the CRP Toolbox
1
(Marwan et al., 2007)
is used. In the experiments reported below, we used
L = {1, 2, . . . , 20}.
Therefore, the histogram of (diagonal and verti-
cal) line lengths (HoL) has some useful properties for
our problem. It provides a global quantification of
action dynamics, and it is not sensitive to the start-
ing point of the action. Being an histogram, it is also
robust to whether the action has cycles or repetitions
and, to some extent, to the speed of the performance
of the action as well.
Before thresholding, we smooth the RM with a
Gaussian filter (σ = 5) on 5 × 5 local windows. This
choice was somehow arbitrary and subject to further
experimentation. Since an appropriate threshold is
not easy to choose, we obtain several binary recur-
rence matrices with different thresholds. In the ex-
periments reported here, we experimented with three
thresholds expressed as the 30%, 40% and 50% quan-
tiles of the contents of the binary RM. Further work
is needed to explore the effect of these thresholds or
whether a single threshold suffices and can be learned.
1
http://tocsy.pik-potsdam.de/CRPtoolbox
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
272
Table 1: Parts of the Tran-Sorokin descriptor (Tran and Sorokin, 2008). Font styles, ’+’ symbols and indentation are given to
denote subparts.
Short name Description Number of features
MCD Motion Context Descriptor, full descriptor 286
SFD Single-Frame Descriptor (OF+SH) 216
+ OF Optical flow, (local) motion +144
+ SH Silhouette, shape +72
CONTX Temporally contextual information 70
+ CFW Current-frame window, present +50
+ PFW Past-frame window, past +10
+ FFW Future-frame window, future +10
Projections along anti-Diagonals (PaD). Areas in
the RM which are close to the main diagonal rep-
resent the temporally local information, which can
be captured and summarized by projecting the infor-
mation in the RM along the lines orthogonal to the
main diagonal. We refer to these lines as the anti-
diagonals. Each projection is computed by a weighed
sum of the anti-diagonals, where the weight decays
away from the main diagonal. The resulting projec-
tion is then normalized and resized to a fixed size w
(we set w = 40) so that sequences of different lengths
have feature vectors of the same size.
Unlike HoL, the Projections Along anti-Diagonals
(PaD) better captures the temporally local informa-
tion, and it is sensitive to the starting point and the
number of cycles of the action. Another difference is
that HoL requires the RM to be binarized to be able
to compute the lengths of the diagonal lines, whereas
PaD can be computed both with unthresholded and
binary RMs. Here, we use unthresholded RMs.
In both, HoL and PaD descriptors, for the SH and
OF parts of the descriptor, the features are split ac-
cording to the 2 × 2 grid considered in the bounding
box in (Tran and Sorokin, 2008). Then, independent
recurrence matrices and descriptors are computed for
each of the 4 cells in the grid, as well as for the global
feature set. As a result, five RM descriptors sets result
from any of the SH or OF parts.
2.4 Feature Combination
One procedure to build the final RM-based action
descriptor could consider the full 286-feature Tran-
Sorokin descriptor as the frame descriptor, then build
the RM, and finally describe the RM by using HoL or
PaD. However, we explore a more general procedure
where arbitrary parts of the Tran-Sorokin descriptor
can be used and combined flexibly, so that the dif-
ferent roles played by motion, shape and temporally
contextual cues can be studied. Additionally, having
several RMs, each computed from separate pieces of
visual information, can be more discriminative than
having a single RM of the complete information taken
as a whole (Serra-Toro and Traver, 2011).
The notation we follow to represent how the fi-
nal features Φ are computed is Φ = {hF
1
i, . . . , hF
N
i},
where F
i
= {f
i1
. . . f
in
i
}, denotes a set of frame de-
scriptors f
i j
, and hFi represents the concatenation of
the RM descriptors resulting from the recurrence ma-
trices obtained from each of the frame descriptors in
F. Finally, the set of N concatenations of RM descrip-
tors are used in separate classifiers and a max-score
fusion scheme is adopted. Let us have a look at a
couple of clarifying examples using parts of the Tran-
Sorokin descriptor (Table 1) as frame descriptors:
(1) {hshi, hofi} means that two sets of RM descriptors
are built, one using the shape features and another one
using the motion features; (2) {hshi, hpfw, ffwi} in-
volves also two sets of RM descriptors, but the second
one is, in turn, a concatenation of the RM descriptors
resulting from separately considering the PFW and
FFW parts. Please note that hpfw, ffwi represents the
concatenation of two RM descriptors, not one RM de-
scriptor computed from the concatenation of the two
frame descriptors, pfw and ffw.
Notice this representation is fairly general and
combines early fusion (by concatenating RM descrip-
tors —operator hi) with late fusion (by combining the
RM descriptors at the decision level —operator {}).
In some contexts, this kind of combining early and
late fusion is shown to be advantageous over using
only early or late fusion separately (Lan et al., 2012).
Individual features of the final feature vector are
normalized through standardization (i.e. normalized
features have zero mean and unit variance).
Some Remarks. Due to its exploratory nature, this
work does not aim at comparing the proposed ap-
proach with that of (Junejo et al., 2011). However,
a rough comparison is provided in one part of the ex-
RecurrenceMatricesforHumanActionRecognition
273
Table 2: Features of the used action datasets.
Weizmann UIUC1 IXMAS
No. actions 10 14 11
No. examples 90 532 1980
Mean sequence
62 82 77
length (frames)
perimental section to assess the potential of the pro-
posed descriptors and combination of features. To
better contextualize this comparisoin, the main dif-
ferences in both works follow. Junejo et al. (Junejo
et al., 2011) mainly address the computation of the
self-similarity matrix from point trajectories, but also
consider image-based representations. We focus on
an image-based representation. To that end, they use
the HOG descriptor (Dalal and Triggs, 2005), which
is a generic descriptor that captures shape/appearance
information, and was originally proposed for pedes-
trian detection, whereas the Tran-Sorokin frame de-
scriptor that we use, is a richer descriptor, and specifi-
cally proposed for action recognition. Indeed, for this
reason, Junejo et al. also explore the optic flow infor-
mation to complement the appearance cues in HOG.
For describing the self-similarity matrix (SSM),
Junejo et al. use histograms of gradient orientations
over log-polar grids centred on the main diagonal
of unthresholded SSMs. Then, an orderless bag-of-
words (BoW) representation of the resulting descrip-
tor is used and each action sequence is finally rep-
resented as one histogram. In contrast, we use sim-
pler descriptors, of both global (HoL) and semi-local
(PaD) nature, and do not use any BoW representation.
3 RESULTS
We used the Weizmann (Gorelick et al., 2007),
UIUC1 (Tran and Sorokin, 2008), and IX-
MAS (Weinland et al., 2006) datasets of human
actions. The Tran-Sorokin descriptor computed for
these datasets is publicly available (Tran-Sorokin,
2008). Some relevant features of these datasets
are given in Table 2. For classification, the Near-
est Neighbour (NN) and a linear Support Vector
Machines (SVM) are used. For evaluation, the
leaving-one-out (LOO) and leaving-one-actor-out
(LOAO) protocols are used. When using the
multi-view dataset IXMAX, some other protocols
(described in an experiment below) are employed.
First, we evaluate the performance of three
different parts of Tran-Sorokin descriptor as a frame
descriptor, and the two proposed RM descriptors.
Results on LOO on the three datasets (Table 3) sug-
gests that HoL is a better descriptor in the Weizmann
dataset, while PaD is generally better in UIUC1
and IXMAS datasets. Regarding the discriminative
power of motion and shape cues, there is not a clear
winner, although shape tends to offer better results
than motion. An interesting observation is the role
of the FFW part of the frame descriptor which,
despite its low dimensionality (just 10 features),
gives comparable or better results than shape and
motion cues, which are about 7 and 14 times larger,
respectively. This can be explained by the way the
FFW is computed which integrates shape and motion
features over a short-time window. Finally, the poor
results on the Weizmann dataset can be due to the
limited amount of training sequences available (9
examples per action) and that these sequences are
also shorter than those in the other datasets.
Next, we compare different feature combinations.
Results (Table 4) clearly indicates that combining
features of different nature (e.g. optic flow and
silhouette) tends to outperform the use of these
features separately. This can be seen by comparing
the LOO columns in Table 4 with the corresponding
lines in Table 3 that use parts of the Tran-Sorokin
descriptor separately. In addition, which features
are used and how they are combined can make a
significant difference. Notice, for instance, that Φ
3
,
Φ
4
and Φ
5
are three different ways of combining
sh, of and ffw features, which result in different
performances. While combination Φ
3
gives the
best results in most cases (notice the column-wise
best results, which are boldfaced), the optimum
way of combining the features is data-dependent.
This calls for an automatic procedure that efficiently
chooses a feature combination which optimizes
computational and recognition criteria. While it is
commonly known the complementarity of different
visual features for action recognition, not much work
has been done on finding optimal ways of fusing
information, and simple (weighted) concatenation is
often performed (Schindler and van Gool, 2008).
Finally, the influence of the camera point of view
and the classifier (NN and SVM) is analysed. To
that end, we use the IXMAS dataset which has five
different views (0–4) of the same action, and try some
different choices as the sets of views available for
training and testing. When the sets of training and test
views have some common view, the examples of the
shared views are randomly split into training an test
sets in an 80%-20% ratio, and the average accuracy
over 10 runs is computed. A linear SVM (Chang and
Lin, 2011) is used and the regularization factor C
is chosen from the set {10
e
: e {−3, 2, . . . , 4}}.
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
274
Table 3: LOO accuracy (%) with NN using several frame and RM descriptors.
Frame descriptor O RM descriptor O
Dataset
Weizmann UIUC1 IXMAS
SH
HoL 37.8 76.1 34.7
PaD 40.0 83.8 66.9
OF
HoL 50.0 69.9 45.6
PaD 27.7 71.6 64.9
FFW
HoL 55.6 73.9 34.3
PaD 46.7 63.4 59.9
Table 4: LOO and LOAO accuracies (%) with NN for different feature combinations.
Dataset Weizmann UIUC1
RM descriptor HoL PaD HoL PaD
Feature combination O LOO LOAO LOO LOAO LOO LOAO LOO LOAO
Φ
1
= {hshi, hofi} 48.9 53.3 36.7 40.0 80.3 60.0 85.3 63.2
Φ
2
= {hsh, ofi} 57.8 61.1 46.7 48.9 80.1 60.2 85.5 64.7
Φ
3
= {hffw, sh, ofi} 63.3 66.7 44.4 50.0 85.2 65.2 87.2 66.7
Φ
4
= {hffwi, hsh, ofi} 58.9 60.0 53.3 56.7 77.8 67.1 82.9 62.6
Φ
5
= {hffwi, hshi, hofi} 58.9 58.9 43.3 47.8 79.7 69.4 84.7 65.6
To select the value of C, a validation procedure is
followed using a 80%-20% split of the training set.
To speed up the experiments, if a common C was
observed to be consistently selected during some
preliminary validation procedure, this value for C
was fixed so that repetitive search is avoided.
Results (Table 5) show that that performance
is generally better with SVM than with NN. This
is particularly true when training and testing with
sequences taken with the same camera (view). This
seemingly counter-intuitive result can be due to the
higher sensitiveness of the NN classifier to limited
number of training examples. (When using the
same view for training and testing, fewer examples
are available for training). Besides the benefit of
having more training examples, the fact of having
higher performance when training and testing on
disjoint sets of views suggests that the proposed RM
descriptor exhibit some view invariance. It is worth
noticing how the use of Φ
4
significantly outperforms
Φ
3
, most notably for the NN classifier. This indicates
the positive effect of fusing the scores of classifiers
based on different sets of features. Please, note that
no late fusion is involved in Φ
3
, since it consists of a
single set of concatenated RM descriptors.
As a rough comparison with state-of-the-art
similar approaches, results from (Junejo et al., 2011)
are also provided in Table 5 when they use a com-
bination of HOG and optic flow. While their results
are usually better when using the same view for
training and testing, our proposal performs similarly
or better in other view combinations. Although the
comparison is not performed under exactly the same
conditions
2
, the figures suggest that our proposal for
RM descriptors and feature combinations is compet-
itive, even with simpler descriptors and classifiers
(e.g. they use a non-linear SVM with a χ
2
kernel).
4 CONCLUSIONS
A temporally holistic action representation based on
recurrence matrices has been explored. Two recur-
rence matrix descriptors, and a general way of fea-
ture generation which combines early and late fusion
strategies, have been proposed. The experiments re-
veal that the proposed descriptors offer competitive
results despite being simpler than an existing one in
the context of recurrence matrices. However, the per-
formance depends on having enough training exam-
ples and/or using advanced classifiers. Further work
2
For instance, (Junejo et al., 2011) reports not to have
the same performer in the training and test sets at the same
time. However, we did not consider this separation since
the annotation of the performer was not found available in
the feature dataset (Tran-Sorokin, 2008) that we use.
RecurrenceMatricesforHumanActionRecognition
275
Table 5: Accuracy (%) with classifiers NN and SVM (C = 10
4
), RM descriptor PaD, and feature combination Φ
3
and Φ
4
(see
Table 4) on different sets of views (IXMAS dataset) for training and testing.
Training Views : Testing Views
Feat. Comb. O Classifier O 0:0 1:1 2:2 3:3 4:4 2:3 1,2:3 All:All
Φ
3
NN 44.9 49.5 45.5 45.9 47.3 66.4 65.4 66.6
SVM 65.8 63.9 64.5 65.1 61.1 68.9 80.6 78.4
Φ
4
NN 66.3 66.9 69.1 66.6 60.5 79.5 81.1 71.2
SVM 70.5 72.3 73.0 75.4 65.8 68.9 77.0 75.3
SSM (HOG+OF) (Junejo et al., 2011) 77.0 77.3 75.8 71.2 68.8 68.5 N/A 74.6
is required to understand why the proposed system
(descriptors, fusion strategy, or classifier) is somehow
behind the state-of-the-art results, so that it can be
made more discriminative, yet as simple as possible.
Combining features of different nature (such as
shape, motion, and time-contextual information) gen-
erally improves the performance over individual sub-
sets of these features. However, it is observed that
which frame descriptors are chosen and how they are
combined may significantly affect the performance in
a data-dependent way. Consequently, devising an ef-
ficient procedure to select both, a proper subset of
the descriptor parts, and a suitable fusion strategy, is
among the most interesting research possibilities.
ACKNOWLEDGEMENTS
This work is partially supported by the Span-
ish research programme Consolider Ingenio-2010
CSD2007-00018, Fundaci
´
o Caixa-Castell
´
o Bancaixa
(projects P1·1A2010-11 and P1·1B2010-27), and
Generalitat Valenciana (PROMETEO/2010/028).
REFERENCES
Aggarwal, J. K. and Ryoo, M. S. (2011). Human activity
analysis: A review. ACM Comp. Surv., 43(3).
Ali, S., Basharat, A., and Shah, M. (2007). Chaotic invari-
ants for human action recognition. In ICCV.
BenAbdelkader, C., Cutler, R., and Davis, L. S. (2004). Gait
recognition using image self-similarity. EURASIP J.
on Applied Signal Processing, 2004(4).
Brendel, W. and Todorovic, S. (2010). Activities as time
series of human postures. In ECCV, pages 721–734.
Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: a library
for support vector machines. ACM Transactions on
Intelligent Systems and Technology, 2(3):27:1–27:27.
Cutler, R. and Davis, L. S. (2000). Robust periodic mo-
tion and motion symmetry detection. In CVPR, pages
2615–2622.
Dalal, N. and Triggs, B. (2005). Histograms of oriented
gradients for human detection. In CVPR.
Gaidon, A., Harchaoui, Z., and Schmid, C. (2011a). Ac-
tom sequence models for efficient action detection. In
CVPR, pages 3201–3208.
Gaidon, A., Harchaoui, Z., and Schmid, C. (2011b). A time
series kernel for action recognition. In BMVC.
Gorelick, L., Blank, M., Shechtman, E., Irani, M., and
Basri, R. (2007). Actions as space-time shapes. PAMI,
29(12):2247–2253.
Junejo, I. N., Dexter, E., Laptev, I., and P
´
erez, P. (2011).
View-independent action recognition from temporal
self-similarities. PAMI, 33(1):172–185.
Lan, Z.-z., Bao, L., Yu, S.-I., Liu, W., and Hauptmann,
A. G. (2012). Double fusion for multimedia event de-
tection. In Proc. of the 18th Intl. Conf. on Advances in
Multimedia Modeling, pages 173–185.
Lucena, M. J., de la Blanca, N. P., and Fuertes, J. M. (2012).
Human action recognition based on aggregated lo-
cal motion estimates. Mach. Vis & Apps. (MVA),
23(1):135–150.
Marwan, N., Romano, M. C., Thiel, M., and Kurthss, J.
(2007). Recurrence plots for the analysis of complex
systems. Physics Reports, 438(5–6):237–329.
Matikainen, P., Hebert, M., and Sukthankar, R. (2010).
Representing pairwise spatial and temporal relations
for action recognition. In ECCV, pages 508–521.
Niebles, J. C., Chen, C.-W., and Li, F.-F. (2010). Modeling
temporal structure of decomposable motion segments
for activity classification. In ECCV, pages 392–405.
Schindler, K. and van Gool, L. (2008). Action snippets:
How many frames does human action recognition re-
quire? In CVPR.
Serra-Toro, C. and Traver, V. J. (2011). A new pedestrian
detection descriptor based on the use of spatial recur-
rences. In CAIP, pages 97–104.
Tran, D. and Sorokin, A. (2008). Human activity recogni-
tion with metric learning. In ECCV, pages 548–561.
Tran-Sorokin (2008). Human activ-
ity recognition with metric learning.
http://vision.cs.uiuc.edu/projects/activity.
Weinland, D., Ronfard, R., and Boyer, E. (2006). Free
viewpoint action recognition using motion history vol-
umes. Comp. Vis. & Image Underst. (CVIU), 104(2–
3):249–257.
VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications
276