
without any consideration of the temporal context.
In this work, we consider the use of the
trajectories of local space-time interest points
(STIPs) that correspond to points with significant
local variation in both space and time, thus
extending the approaches above which are limited to
2D interest points. In fact, STIPs have proven to be a
strong feature extraction method that has given
impressing results in real-world human action
recognition tasks. Our motivation is that STIPs’
trajectories can provide rich spatio-temporal
information about human activity at the local level.
For sequence modeling at the global level, a suitable
statistical sequence model is required.
Hidden Markov Models (HMMs) (Rabiner,
1989) have been widely used for temporal sequence
recognition. However, HMMs make strong
independence assumptions on feature independence
that are hardly met in human activity tasks.
Furthermore, generative models like HMMs often
use a joint model to solve a conditional problem,
thus focusing on modeling the observations that at
runtime are fixed anyway. To overcome these
problems, Lafferty et al. (Lafferty et al., 2001) have
proposed powerful discriminative models:
Conditional Random Fields (CRF) for sequence text
labeling. CRF is a sequence labeling model that has
the ability to incorporate a long range dependency
among observations. CRF assign to each observation
in a sequence a label but it cannot capture intrinsic
sub-structures of observations. To deal with this,
CRF is augmented with hidden states that can model
the latent structures of the input domain with the so
called Hidden CRF (HCRF) (Quattoni, 2004). This
makes it better suited to modeling temporal and
spatial variation in an observation sequence. Such a
capability is particularly important as human
activities usually consist of a sequence of elementary
actions. However, HCRF needs a long time range
for the training phase. To overcome this problem we
propose to combine HCRF with a discriminative
local classifier (e.g SVM). The local classifier
predicts confidence of activity labels from input
vectors. We use the predicted confidence
measurements of different classes from the local
discriminative classifier as the input observation to
the HCRF model. Assuming, as is the usual case,
that the number of classes is significantly lower than
feature dimensionality, this will reduce as much the
feature space dimensionality during HCRF inference
while exploiting the high discriminative aspect of
SVM.
To summarize, the first objective of this paper is
to investigate the use of STIPs’ trajectories as
activity descriptors. To the best of our knowledge,
such a descriptor has not been addressed before in
the state of the art. The second objective is to assess
the discriminant power of HCRF-SVM combination
on a daily living activities recognition task. This
constitutes the second contribution of our work.
The organization of the paper is as follows.
Section 2 gives a brief description of local space
time features. HCRF and its combination with SVM
are reviewed in Section 3. In Section 4, the
databases used for experiments are described and
results are detailed and compared with the state of
the art. Section 5 draws some conclusions and
sketches futures directions of this work.
2 LOCAL SPACE-TIME
TRAJECTORIES
Local space-time features capture structural and
temporal information from a local region in a video
sequence. A variety of approaches exist to detect
these features (Wang et al., 2009). One of the most
popular methods is the one detecting Space Time
Interest Points (STIP), proposed by Laptev et al.
(Laptev et al., 2001), that extends Harris corner
detector to the space- time domain. The main idea is
to find points that have a significant change in space
and time.
To characterize the detected points, histograms
of gradients (HOG) and histograms of optical flows
(HOF) are usually calculated inside a volume
surrounding the interest point and used as
descriptors.
To provide a description at the video action
level, one of the most popular methods is to
represent each video sequence by a BOW of
HOG/HOF STIP’s descriptors. However, this
representation does not capture the spatio-temporal
layout of detected STIPs. To overcome this
limitation, a number of recent methods encode the
spatio-temporal distribution of interest points.
Nevertheless, these methods typically ignore the
spatio-temporal evolution of each STIP in the video
sequence. As mentioned above, some approaches
have attained a good result when using the
trajectories of 2D-interest points that are mainly
adapted to 2D space domain. In this section, we
present our approach of activity representation based
on the trajectories of STIPs (Figure 1) which are
adapted to video data.
To construct our basic feature, we first extract
STIPs from the video sequences. Then we track
them with Kanade-Lucas-Tomasi (KLT) tracker
ACombinedSVM/HCRFModelforActivityRecognitionbasedonSTIPsTrajectories
569