tor should be chosen in relation with the frame rate of
the analyzed video and the length of an atomic part of
the analyzed action (in this case one step of the per-
son executing the action) and validated on the train-
ing data. Deciding for a sequence based on a major-
ity of frame-decisions from the first 24 frames proved
well suited for the current data set. As a rule of the
thumb, the more frames are considered, the better the
sequence-decision. During feature extraction, we use
Q = 40 FDs left and right from zero and hence implic-
itly assume a minimal contour length for a certain im-
age resolution. This number of FDs is well suited for
our data, but it should be chosen according to these
considerations in practice. All parameters with no
rules for determining them were established by six-
fold cross validation using the videos of the first 10
persons.
Our feature vector is invariant to several variations
of the person’s contour, some corresponding to an-
thropometric changes, however, some can be thought
of as corresponding to viewpoint changes and to scale
changes and thus we obtain also a mild viewpoint in-
variance in our feature vector.
A decision for a sequence of around 50 frames is
available after 47 seconds under MATLAB on a 2.66
GHz dual-core machine. However, many of the algo-
rithmic steps can be conducted in a parallel manner.
The sparse classification paradigm can be used
for action recognition and to detect all sort of point
events. While not directly suited for context and col-
lective events, it may represent an action-recognition
building block for such algorithms, other blocks be-
ing necessary for analyzing the chains of individual
actions (Matern et al., 2011). The particular feature
extraction process we use here is specific for the anal-
ysis of human behavior. We are concerned with the
behavior of a person in a single track and the current
feature extraction is adapted for this case. A prerequi-
site for deploying these methods in more complicated
scenarios is a successful tracking, irrespective of the
number of cameras used.
Our algorithm can be seen of consisting of two
parts: feature extraction and sparse classification. The
feature extraction is targeted to certain invariances
and tailored to the sparse classifier. We have shown
that sparse classification as introduced by Wright et
al. (2009) is well suited for human event detection.
Sparse classification offers a set of advantages over
other methods for the problem of action recognition
and event detection, being robust, adaptive and easy
to tune. In this context, the focus is now set on the
extraction of suitable features to enable the usage of
such methods. Furthermore, even if the issue of in-
variance can be addressed at the classifier level when
using sparse classifiers, many of the desirable invari-
ance properties that characterize a good human ac-
tion recognition/event detection method should be ob-
tained by means of the feature extraction process. We
have shown how to use the invariant integration as de-
scribed by Schulz-Mirbach (1992) to extract such fea-
tures from the contour of the acting person.
REFERENCES
Arbter, K., Snyder, W., Burkhardt, H., and Hirzinger, G.
(1990). Application of affine-invariant fourier descrip-
tors to recognition of 3-d objects. IEEE Trans. Patt.
Anal. Mach. Intell., 12:640–647.
Cand
`
es, E. and Tao, T. (2006). Near-optimal signal re-
covery from random projections: Universal encoding
strategies? IEEE Trans. Inform. Theory, 52(12):5406–
5425.
d’Aspremont, A., Ghaoui, L. E., Jordan, M., and Lanckriet,
G. (2007). A direct formulation of sparse pca using
semidefinite programming. SIAM Review, 49(3).
Donoho, D. (2006). Compressed sensing. IEEE Trans. In-
form. Theory, 52(4):1289–1306.
Donoho, D. and Elad, M. (2003). Optimal sparse repre-
sentation in general (nonorthogonal) dictionaries via
`
1
minimization. Proc. Nat’l Academy of Sciences,
pages 2197–2202.
Gorelick, L., Blank, M., Shechtman, E., Irani, M., and
Basri, R. (2007). Actions as space-time shapes. Trans.
Patt. Anal. Mach. Intell., 29(12):2247–2253.
Guo, K., Ishwar, P., and Konrad, J. (2010). Action recog-
nition using sparse representation on covariance man-
ifolds of optical flow. In Proc. AVSS, pages 188–195.
Matern, D., Condurache, A. P., and Mertins, A. (2011).
Event detection using log-linear models for coronary
contrast agent injections. In Proc. ICPRAM.
M
¨
uller, F. and Mertins, A. (2011). Contextual invariant-
integration features for improved speaker-independent
speech recognition. Speech Comm., 53(6):830 – 841.
Otsu, N. (1979). A threshold selection method from gray-
level histograms. IEEE Trans. on Sys., Man and Cyb.,
SMC-9(1):62–66.
Schuldt, C., Laptev, I., and Caputo, B. (2004). Recognizing
human actions: A local svm approach. Proc. ICPR,
3:32–36.
Schulz-Mirbach, H. (1992). On the existence of complete
invariant feature spaces in pattern recognition. In
Proc. ICPR, volume 2, pages 178–182, The Hague.
Schulz-Mirbach, H. (1994). Algorithms for the construction
of invariant features. In DAGM Symposium, volume 5,
pages 324–332, Wien.
Wright, J., Yang, A., Ganesh, A., Sastry, S., and Ma, Y.
(2009). Robust face recognition via sparse representa-
tion. IEEE Trans. Patt. Anal. Mach. Intell., 31(2):210–
227.
Yang, A., Jafari, R., Sastry, S., and Bajcsy, R. (2008).
Distributed recognition of human actions using wear-
able motion sensor networks. J Amb. Intl. Smt. Env.,
30(5):893–908.
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
684