VIDEO

EVENT CLASSIFICATION AND DETECTION

USING 2D TRAJECTORIES

Alexandre Hervieu, Patrick Bouthemy

INRIA, Centre Rennes - Bretagne Atlantique, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France

Jean-Pierre Le Cadre

INRIA, Centre Rennes - Bretagne Atlantique / CNRS, Campus Universitaire de Beaulieu, 35042 Rennes Cedex, France

Keywords:

Image sequence analysis, Image motion analysis, Hidden Markov models, Pattern recognition.

Abstract:

This paper describes an original statistical trajectory-based approach which can address several issues related

to dynamic video content understanding: unsupervised clustering of events, recognition of events correspond-

ing to learnt classes of dynamic video contents, and detection of unexpected events. Appropriate local differ-

ential features combining curvature and motion magnitude are robustly computed on the trajectories. They are

invariant to image translation, in-the-plane rotation and scale transformation. The temporal causality of these

features is then captured by hidden Markov models whose states are properly quantized values, and similarity

between trajectories is expressed by exploiting the HMM framework. We report experiments on two sets of

data, a ﬁrst one composed of typical classes of synthetic (noised) trajectories (such as parabola or clothoid),

and a second one formed with trajectories computed in sports videos. We have also favorably compared

our method to other ones, including feature histogram comparison, use of the longest common subsequence

(LCSS) distance and SVM-based classiﬁcation.

1 INTRODUCTION

Content-based exploitation of video footage is of con-

tinuously increasing interest in numerous applica-

tions, e.g., for retrieving video sequences in huge TV

archives, creating automatic video summarization of

sports TV programs (Kokaram et al., 2006), or detect-

ing speciﬁc actions or activities in video-surveillance

(Boiman and Irani, 2005; Hu et al., 2007). It im-

plies to shorten the well-known semantic gap between

computed low-level features and high-level concepts.

Considering 2D trajectories is attractive since they

form computable image features which capture elab-

orate spatio-temporal information on the viewed ac-

tions. Methods for tracking moving objects in an im-

age sequence are now available to get reliable enough

2D trajectories in various situations. These trajecto-

ries are given as a set of consecutive positions in the

image plane (x, y) over time. If they are embedded in

an appropriate modeling framework, high-level infor-

mation on the dynamic scene can then be reachable.

We aim at designing a general trajectory classiﬁ-

cation method. It should take into account both the

trajectory shape (geometrical information related to

the type of motion and to variations in the motion di-

rection) and the speed change of the moving object

on its trajectory (dynamics-related information). Un-

less required by a speciﬁc application, it should not

be affected by the location of the trajectory in the im-

age plane (invariance to translation), by its direction

in the image plane (invariance to rotation), by the dis-

tance of the viewed action to the camera (invariance

to scale). It should also be robust enough, since local

differential features computed on the extracted trajec-

tories are prone to be noise corrupted. It should not

exploit strong a priori information on the scene struc-

ture, the camera set-up, the 3D object motions.

In this paper we tackle three important tasks re-

lated to dynamic video content understanding within

the same trajectory-based framework. The ﬁrst one

is clustering trajectories extracted from videos. An

unsupervised solution is developed. The second con-

sidered problem is recognizing (or retrieving) events

in videos. Semantic classes of dynamic video con-

tents are ﬁrst learnt from a set of representative train-

ing trajectories. The third task is detecting unexpected

158

Hervieu A., Bouthemy P. and Le Cadre J. (2008).

VIDEO EVENT CLASSIFICATION AND DETECTION USING 2D TRAJECTORIES.

In Proceedings of the Third Inter national Conference on Computer Vision Theory and Applications, pages 158-166

DOI: 10.5220/0001073001580166

 SciTePress

events by comparing the test trajectory to representa-

tive trajectories of known classes of events.

The remainder of the paper is organized as fol-

lows. In Section 2, we outline related work on

trajectory-based video content analysis. In Section 3,

we introduce the local differential features consid-

ered to represent 2D trajectories. We show that they

are invariant to 2D translation, 2D rotation and scale

transformation, and we also describe their computa-

tion. Section 4 presents our HMM-based framework

to model trajectories. It can be viewed as a (statistical)

quantization of the local features while accounting for

their temporal evolution. We also describe the HMM-

based similarity measure used to compare or to clas-

sify trajectories. Section 5 deals with the detection of

unexpected events. Section 6 introduces other classi-

ﬁcation methods which will intervene in the compara-

tive experimental evaluation of the proposed method.

In Section 7, we present the two data sets used to

test and compare the methods. The ﬁrst one is com-

posed of typical classes of synthetic (noised) trajec-

tories (such as parabola or clothoid), and the second

one includes trajectories computed in sports videos.

Results are then reported and discussed. Concluding

remarks are given in Section 8.

2 RELATED WORK

Trajectory analysis can help recognizing events, ac-

tions, or interactions between people and objects.

First methods considered point coordinates and lo-

cal orientations on image trajectories as input features

(Bashir et al., 2007; Buzan et al., 2004; Chan et al.,

2004; Porikli, 2004). Using these features leads to

express strict spatial similarity between trajectories.

Other methods use velocities as features to compare

2D trajectories (Porikli, 2004; Hu et al., 2007; Wang

et al., 2006), but visual velocity still depends on the

distance of the viewed action to the camera.

Different methods have been developed to com-

pare and cluster trajectories in order to analyze the

content of video sequences. Buzan et al. (Buzan et al.,

2004) resorted to the Longest Common Subsequence

(LCSS) distance (Vlachos et al., 2002), to classify tra-

jectories computed in an image sequence acquired by

a single stationary camera for video surveillance. It is

based on a hierarchical unsupervised clustering of tra-

jectories where trajectory features are vectors of 2D

coordinates of the trajectory points. Wang et al. intro-

duced a novel similarity measure based on a modiﬁed

Hausdorff distance and a comparison conﬁdence mea-

sure. They compare the distributions of the spatial co-

ordinates of the trajectory points, and also use other

attributes, such as velocity and object size (Wang et

al., 2006) Bashir et al. presented a trajectory-based

real-time indexing method (Bashir et al., 2007), us-

ing PCA and spectral clustering. A system that learns

patterns of activity from trajectories, and hierarchi-

cally classiﬁes sequences using a codebook was de-

veloped by Stauffer and Grimson (Stauffer and Grim-

son). Li et al. considered statistical distributions of

trajectory orientations exploited in a clustering algo-

rithm (Li et al., 2006). Recent work has explored

modeling frameworks such as DPN (Dynamic Proba-

bilistic Network) and HMM (Hidden Markov Model)

to express the temporal information (causality) em-

bedded in video trajectories and the semantic mean-

ing that they convey. Hongeng et al. (Hu et al., 2007)

described a complex event recognition method based

on the deﬁnition of scenarios and on the use of Semi-

Markov Chain (SMC). Chan et al. (Chan et al., 2006)

coped with fragmented tracks that occur when using

mean-shift tracking. They attempted to jointly solve

the problem of linking these “tracklets” and recogniz-

ing complex events using DBN (Dynamic Bayes Net).

They also proposed a method for detecting rare events

by representing motions and space-time relations be-

tween objects using HMMs (Chan et al., 2004). A

recognition method for group activities was deﬁned

by Gong and Xiang (Gong and Xiang, 2003) relying

on DPN to model and detect actions involving multi-

ple objects. DPN are specially used to model the tem-

poral relationships among different temporal events

in the scene. Porikli deﬁned distances to handle tra-

jectories, especially a HMM-based distance (Porikli,

2004). The methods based on HMMs, SMCs or DPNs

developed so far are unable to treat short trajectories

(see subsection 4.1). Let us also stress that all the

aforementioned methods exploit features invariant to

translation or scale transformation only.

The approach we have designed is different from

those proposed so far in several points. First, we in-

troduce local differential trajectory features which are

able to jointly capture information on the trajectory

shape and on the object speed. Besides, they are in-

variant to translation, rotation and scale transforma-

tions. We have also developed a procedure to com-

pute them which is efﬁcient and robust to noise. Sec-

ond, temporal evolution of these features over the tra-

jectory curve is explicitly accounted for by consider-

ing an original and effective HMM scheme. Indeed,

the HMMs states are given by properly quantizing the

real feature values. Our HMM method is also able to

process trajectories of any sizes (especially small tra-

jectories). Moreover, we have adopted a HMM dis-

tance which can be exploited both for clustering and

recognizing dynamic video contents and for detecting

VIDEO EVENT CLASSIFICATION AND DETECTION USING 2D TRAJECTORIES

159

unexpected events. All these elements make the over-

all framework we have deﬁned general and ﬂexible.

3 INVARIANT LOCAL

TRAJECTORY FEATURES

A feature that represents both trajectory shape and

object acceleration (more speciﬁcally, we mean ve-

locity magnitude change) is required to capture the

full intrinsic properties of a video trajectory. As

stressed in the introduction, it should also be invariant

to 2D translation, 2D rotation and scale transforma-

tion, which will be helpful in most video applications.

3.1 Trajectory Kernel Smoothing

A trajectory T

is deﬁned by a set of n

points

{(x

, y

), ..,(x

, y

)} corresponding to the succes-

sive image positions of the tracked object in the im-

age sequence (video shot). The term “object” must

be understood in a broad sense, i.e., interest point,

gravity center of a segmented region, window cen-

ter,. . . To reliably compute the local differential tra-

jectory features, we need a continuous representation

of the curve formed by the trajectory. To this end, we

perform a kernel approximation of T

deﬁned by

∑

j=1

−(

t− j

)

∑

j=1

−(

t− j

)

, v

∑

j=1

−(

t− j

)

∑

j=1

−(

t− j

)

, (1)

where (x

, y

) designates the coordinates of the

tracked object at t and (u

, v

) its smoothed represen-

tation. h is a smoothing parameter to be set accord-

ing to the observed noise magnitude. Explicit expres-

sions can then be derived for the ﬁrst- and second-

order temporal derivatives of the trajectory positions:

respectively, ˙u

, ˙v

, ¨u

and ¨v

3.2 Derivation of the Trajectory

Features

Let us ﬁrst consider the local orientation of the curve

given by γ

= arctan(

˙v

˙u

). By construction, it is invari-

ant to 2D translation and scale transformation. To add

invariance to 2D rotation, let us now take the tempo-

ral derivative of γ

, and let us analyze this quantity.

We have

d(tanγ

)

cos

. On the other hand :

d(tan γ

)

¨v

˙u

− ¨u

˙v

˙u

Then

= cos

¨v

˙u

− ¨u

˙v

˙u

Also

cos

= (1 + tan

)

−1

˙u

+ ˙v

Hence

¨v

˙u

− ¨u

˙v

˙u

+ ˙v

= κ

.kw

k (2)

where κ

¨v

˙u

− ¨u

˙v

( ˙u

+ ˙v

)

is the local curvature of the tra-

jectory and kw

k = ( ˙u

+ ˙v

)

the local velocity mag-

nitude at point (u

, v

). The numerator of

is the

determinant of matrix

˙u

¨u

˙v

¨v

and the denomi-

nator ˙u

+ ˙v

= kw

is the squared velocity magni-

tude. Then,

is also rotation invariant. We have also

demonstrated that this local feature well captures both

the trajectory shape and the object speed since it is the

product of the local curvature and the instantaneous

velocity magnitude.

4 TRAJECTORY MODELING

AND SIMILARITY

4.1 Design of the Hidden Markov Model

We resort to a hidden Markov model (HMM) to build

the statistical framework we need since HMM natu-

rally expresses temporal causality. The feature vector

representing a trajectory T

extracted in a video shot is

the vector containing the n

successive values of

γ(t):

= (

, ...,

−1

We also exploit the HMM framework in a some-

what original way since the HMMs states are given

by properly quantized values of

γ(t). To determine

the HMMs state values, we ﬁrst study the distribu-

tion of

γ(t) on representative trajectories. We deﬁne

an interval [−S, S] containing a given percentage P

computed

γ values in order to discard “outliers” and to

control the number N of state values. Hence, a quan-

tization is performed on [−S, S] into a ﬁxed number N

of bins (whatever the value of S, in order to be able

to compare trajectories using the estimated HMMs).

This is illustrated in Fig.1 where synthetic trajecto-

ries of six different classes are drawn and their corre-

sponding histograms are plotted restricted to [−S, S].

In contrast, in the HMM framework introduced in

(Porikli, 2004) to model trajectories and their tempo-

ral evolution, the number of states remain difﬁcult to

set (it relies on a validity score that requires a balanc-

ing factor to be ﬁxed), and the trajectory size should

be much larger than the number of Gaussian mix-

ture components (used for the conditional observa-

tion distribution) times the number of states, whereas

our method is developed to handle trajectories of any

sizes.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

160

Figure 1: Six samples of synthetic trajectories (a sinusoid, an ellipse, a parabola, a spiral, a clothoid and a cycloid) and their

associated normalized histograms plotted in [−S, S], with h = 3, P

= 90%, and N = 21 (see text).

The HMM which models the trajectory T

is now

characterized by:

- the state transition matrix A = {a

i j

} with

i j

= P[ q

t+1

= S

| q

= S

], 1 ≤ i, j ≤ N,

where q

is the state variable at instant t and S

is its

value (i.e., the ith bin of the quantized histogram);

- the initial state distribution π = {π

}, with π

P[ q

= S

], 1 ≤ i ≤ N;

- the conditional observation probabilities B =

(

)}, where b

(

) = P[

| q

= S

], since the com-

puted

are the observed values.

The conditional observation probability is deﬁned

as a Gaussian distribution of mean µ

(i.e., the me-

dian value of the histogram bin S

). Its standard devi-

ation σ does not depend on the state and is speciﬁed

so that the interval [µ

− σ, µ

+ σ] corresponds to the

bin width. This conditional observation model can

reasonably account for measurement uncertainty. It

also prevents from having zero values when estimat-

ing matrix A in the training stage by lack of measures

(especially in case of short trajectories). Otherwise,

inﬁnite distances would be found between trajectories

(some coefﬁcients of matrix A would be zero values).

To estimate A and π, we have adapted the least-

squares technique proposed in (Ford and Moore,

1998) where the HMM is assimilated to a count pro-

cess. If H

(i)

= P(

= i) (corresponding to a weight

for the count process), empirical estimates of A and π,

for a trajectory k of size n

are given by

i j

∑

−1

t=1

(i)

( j)

t+1

∑

−1

t=1

(i)

and π

∑

t=1

. (3)

As illustration, we show an example of a real trajec-

tory in Fig. 2, its smoothed counterpart, the estimated

values of A and π coefﬁcients for its associated HMM.

Figure 2: Upper part: Plots of a real trajectory (extracted

from a Formula One race video shot), its smoothed coun-

terpart obtained with h = 8. Colors of the curve points

stand for the different state values and correspond to the

histogram bin colors. Middle part: histogram of the state

values (with N = 5). Lower part: estimated transition ma-

trix A and initial state distribution π.

4.2 Similarity Measure

To compare two trajectories, we have to deﬁne a sim-

ilarity measure. To this end, we exploit the HMM

framework we have built. We adopt the distance be-

tween HMMs proposed by Rabiner (Rabiner, 1989).

It can also be used to classify the trajectories since

it is deﬁned at the trajectory model level. Given two

HMMs represented by their parameter sets λ

and λ

VIDEO EVENT CLASSIFICATION AND DETECTION USING 2D TRAJECTORIES

161

(λ

= (A

, B

, π

), i = 1, 2), the distance D is deﬁned by

D(λ

, λ

) =

[log P(O

(2)

|λ

) − log P(O

(2)

|λ

)]

(4)

where O

( j)

= {

, ...,

} is the sequence of mea-

sures used to train the model λ

and P(O

( j)

|λ

) ex-

presses the probability of observing O

( j)

with model

(computed with the Viterbi algorithm). To be used

as a similarity measure, a symmetrized version is re-

quired:

(λ

, λ

) =

[D(λ

, λ

) + D(λ

, λ

)]. (5)

5 VIDEO UNDERSTANDING

TASKS

5.1 Unsupervised Clustering of

Trajectories

We ﬁrst describe how we address the unsupervised

clustering task. Given a set of video shots (obtained

by an automatic temporal video segmentation tech-

nique), we try to cluster the extracted trajectories in

a sensible way to come out with relevant classes of

dynamic video content. We ﬁrst represent each tra-

jectory by a HMM whose parameters are estimated as

explained in Section 4. We then perform a classical

binary ascendant hierarchical classiﬁcation using the

trajectory similarity measure introduced in the previ-

ous section. The distance between two groups of tra-

jectories G

and G

is deﬁned using an average link

method, e.g, calculating the mean of the distance be-

tween all pairs of trajectories :

average link

, G

) =

∑

∈G

, T

)

. (6)

When achieving a binary ascendant hierarchical

classiﬁcation, the system needs to know when to stop

the merging iterations. If the process is stopped too

late, trajectories will be grouped in too few heteroge-

neous classes. Otherwise, if the process is terminated

too early, classes will be too fragmented, and they

would not correspond to relevant semantic classes.

Two alternatives have been tested. We stop merging

groups when a predeﬁned number C of classes has

been created, which means that the user has some

knowledge on the diversity of the dynamic video

contents. A second procedure is to ﬁx a threshold τ,

and merging is continued until the current minimum

inter-classes distance passes the threshold τ.

5.2 Recognition of Learnt Classes of

Dynamic Video Content

Let us consider the problem of recognizing events, or

equivalently, of retrieving instances of known classes

of events in videos. It can be achieved in a super-

vised way (classes are learnt from training examples)

or in an unsupervised way using the clustering stage

described above. Each class is modeled by a set

of HMMs corresponding to representative trajectories

(those used in the training step, or those belonging to

the corresponding cluster supplied by the initial clus-

tering step applied on a subset of the video sequence

base) which we will call the initial members in the se-

quel. Recognition is then performed by assigning the

processed trajectory to the nearest class. As afore-

mentioned, the distance to a class is deﬁned using the

average link method (see subsection 5.1).

5.3 Detection of Unexpected Events

Detecting unexpected (or equivalently, rare or abnor-

mal) events is of interest in many applications. We

tackle this issue with the same HMM-based frame-

work. First, we consider a set of predeﬁned (or learnt)

classes represented again by the estimated HMMs of

the initial class members. We compute for each class

its centroid G

in the λ-parameter space from the

estimated parameters λ

of the HMMs of the initial

members T

of class C

. Then we evaluate the dis-

tances of all the initial class members T

to the cen-

troid G

using relation (5), and we denote R

the maxi-

mum distance value. Let σ

designate the standard de-

viation of these distance values. We decide that a test

trajectory T

corresponds to an unexpected event if,

for every class C

, D

average link

) > R

+σ

, where

average link

) =

∑

∈C

)

6 OTHER METHODS FOR

COMPARISON PURPOSE

6.1 Bhattacharyya Distance between

Histograms

To assess the importance of introducing temporal

causality, i.e., transitions between states, we have im-

plemented a Bhattacharyya distance-based classiﬁca-

tion method. The Bhattacharyya distance D

between

two (normalized) histograms h

and h

of features

respectively corresponding to two trajectories T

and

, is deﬁned by

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

162

, h

) =

1 −

∑

q=1

(7)

where h

is the histogram value of bin q for trajec-

tory T

. Similarly to the HMM-based method, we as-

sign the test trajectory T

to the nearest class and the

distance to a class is deﬁned using an average link

method (see subsection 5.1).

6.2 SVM Classiﬁcation Method

An efﬁcient tool to perform supervised classiﬁcation

of patterns is the SVM method (Burges, 1998). As

input for the SVM method, we take the HMM pa-

rameters corresponding to the trajectories. A SVM

method needs patterns to be represented by vectors.

Hence, for each trajectory T

, a vector X

containing

the HMM parameters λ

of the trajectory is created.

Let us give an example with only N = 3 state values.

We have









, π

= [a

and X

= [a

]

is the vector representing the trajectory T

. A SVM

classiﬁcation technique with a Gaussian RBF (radial

basis function) kernel is used. The reported results

are obtained using the “one against all” classiﬁcation

scheme.

7 EXPERIMENTS

7.1 Synthetic Trajectories

First, a set of typical trajectories has been generated to

settle experiments with ground truth and easily speci-

ﬁable data. More speciﬁcally, 8 classes (sinusoid,

parabola, hyperbola, ellipse, cycloid, spiral, straight

line, and clothoid) have been considered and 8 dif-

ferent trajectories (of different sizes, including short

trajectories) have been simulated for each class, with

different parametrizations of the curves, and for sev-

eral geometric transformations (rotation, scaling), as

illustrated in Fig.1. Noised versions have been gen-

erated with different noise levels. Thus, we can eval-

uate the performance of the methods with respect to

trajectory shape and length variations within a class,

invariance to transformations and robustness to noise.

7.2 Video Trajectories

Real trajectories have been extracted from a Formula

One race TV program and from Alpine skiing TV

programs, both ﬁlmed with several cameras. The tra-

jectories are computed with the tracking method de-

scribed in (Perez et al., 2002). The background mo-

tion due to camera panning, tilting and zooming is

canceled. Trajectory shapes supplied by this method

are thus nearly similar to the real 3D trajectories of

Formula One cars (up to an homography, since the

3D motion is almost planar) and of skiers. Examples

are plotted on Fig.3.

Figure 3: Images from video shots acquired by two differ-

ent cameras at two different places on the circuit. The com-

puted trajectories are overprinted on the images.

Figure 4: Plots of the 6 classes of dynamic content (tra-

jectories) for a Formula 1 race video, each box contains a

different class. A class of trajectories is composed of tra-

jectories extracted from shots acquired by the same camera.

The different classes correspond to different cameras placed

throughout the circuit at different strategic turns.

7.3 Results on Unsupervised Clustering

We have applied the proposed unsupervised trajec-

tory clustering scheme based on the distance between

HMMs to trajectories extracted from a Formula One

race TV program (Fig.3). In that real example, the

different classes correspond to views supplied by six

different cameras placed throughout the circuit at dif-

ferent strategic turns. Indeed, a given type of dynamic

content is attached to a given view, and it can then be

characterized by a speciﬁc trajectory shape and car

speed which essentially depend on the turn conﬁgura-

VIDEO EVENT CLASSIFICATION AND DETECTION USING 2D TRAJECTORIES

163

tion at the considered location of the circuit, what-

ever the passing car. Hence, ground truth is avail-

able while a real video (TV program) is processed.

The clustering relying on the ascendant hierarchical

binary classiﬁcation (AHBC) technique supplies very

good results with the two stopping criteria aforemen-

tioned. Due to page limitation, we will only report

results obtained with the ﬁrst stopping criteria. Ta-

ble 1 contains results obtained with our HMM-based

method and with the same AHBC technique but using

the Bhattacharyya distance and the LCSS distance in-

stead. Two cases were considered: 4 and 6 classes,

the four ﬁrst ones being nested subsets of the last

six ones. Our method outperforms the Bhattacharyya

distance based method (with approximately the same

computation time), and the LCSS method while re-

quiring a much lower computation time (at least ﬁve

time faster). This experiment also demonstrates that

our unsupervised classiﬁcation method is able to form

meaningful clusters since the later are very close to

the ground truth groups presented in Fig.4.

Table 1: Results of unsupervised classiﬁcation by an ascen-

dant hierarchical binary classiﬁcation (AHBC) technique,

using the proposed HMM-based distance, the LCSS dis-

tance and the Bhattacharyya distance for Formula One cars

trajectories. Two cases were considered: 4 and 6 classes,

the four ﬁrst ones being nested subsets of the last six ones

(precisely the four classes on the left on Fig.4). Percentages

correspond to rates of good classiﬁcation for the extracted

classes (the ground truth being known). As stopping cri-

terion, the number of classes to create is provided to the

AHBC technique.

Percentage of correct clustering

4 classes 6 classes

HMM 100 92.2

Bhattacharyya 58.4 53.9

LCSS 84.4 72.3

7.4 Results on Supervised Recognition

We are now reporting results regarding the recog-

nition task. We have compared our HMM-based

method with the SVM classiﬁcation method de-

scribed in subsection 6.2 and the histogram compar-

ison technique based on the Bhattacharyya distance

outlined in subsection 6.1. To evaluate the perfor-

mances, we have adopted the leave-one-out cross val-

idation. Table 2 contains best classiﬁcation results for

the real sets of video trajectories (4 and 6 classes pre-

sented in Fig.4). Table 3 shows the performance of

our HMM method for different levels of noise on the

synthetic trajectories using different values of h.

Tests performed on the sets of synthetic trajecto-

ries gave very promising results, hence a perfect clas-

siﬁcation was performed for most parameter conﬁg-

urations (i.e., for N, h and P

) with the SVM and

HMM methods, where the technique based on the

Bhattacharyya distance fail to efﬁciently classify syn-

thetic trajectories (highlighting the importance of the

temporal causalities modeled with HMM). The tech-

nique based on the Longest Common Subsequence

distance (LCSS) (Buzan et al., 2004) gave good re-

sults but not perfect, and with a higher computation

time (more than ﬁve times longer than with HMM

based method). Recognition results with noised data

(Table 3) shows that the parameter h helps handling

efﬁciently noised data, by smoothing the processed

trajectories.

For the evaluation on real videos (Fig.4), the same

type of results have been obtained, very satisfying

classiﬁcation results for most parameter conﬁgura-

tions with the SVM and HMM methods, and less ac-

curate classiﬁcation results with the techniques us-

ing the Bhattacharyya distance and the LCSS dis-

tance (Table 2). Besides, our HMM method is much

more ﬂexible than the SVM classiﬁcation stage (e.g.,

adding a new class only requires to learn the parame-

ters of that class)

Table 2: Comparison of the best recognition percentages for

the trajectories extracted from real video, using the leave-

one-out cross validation technique.

Percentage of correct classiﬁcation

4 classes 6 classes

HMM 100 99.0

SVM 100 96.1

Bhattacharyya 100 93.1

LCSS 97.1 91.2

Table 3: Classiﬁcation results for the synthetic trajectories,

with a HMM-based method, using the leave-one-out cross

validation technique, for different values of h and σ (σ is

the standard deviation of the added noise).

7.5 Results on the Detection of

Unexpected Events

We have conducted experiments on several real

videos for the detection of different types of unex-

pected events. For the Formula One race video, we

were able to detect incidents such as cars driving off

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

164

the track (revealed by an abnormal trajectory shape)

or intervention of the safety car (revealed by a quite

different speed while the global trajectory shape re-

mains unchanged). For the skiing competition, the

objective was to detect falls of skiers. Fig.5 and

Fig.6 respectively show three Formula One video se-

quences and two Alpine skiing race video sequences.

Fig.6 also presents trajectories corresponding to a

class (printed in blue) and to two unexpected events

(printed in magenta and blue). In each case, the ﬁrst

one belongs to a regular event class while the other

ones are examples of unexpected events. The criterion

described in subsection 5.3 allowed us to correctly

detect the unexpected events in all the processed ex-

amples. In Table 5 we supply the maximum intra-

class distance R

and the distance between the trajec-

tory detected as unexpected event and the six consid-

ered classes C

(presented in Fig.4), for several cases.

Table 4 presents results corresponding to skiers tra-

jectories, showing the difference between the max-

imum intra-class distance and the distance between

unexpected events and this regular class. These re-

sults show that our HMM-based framework can be

straightforwardly and successfully exploited for de-

tecting unexpected events in videos.

Figure 5: Images from Formula One race video shots ac-

quired by the same camera. Top row: images displaying

instances of a regular class. Bottom rows: example of un-

expected events (safety car, car driving off the track). The

trajectories are overprinted on the images.

8 CONCLUSIONS

We have proposed a trajectory-based HMM frame-

work for video content understanding. We have

shown that it is general and ﬂexible enough to solve

three tasks: unsupervised clustering of events, recog-

nition of events corresponding to learnt classes of dy-

namic video contents, and detection of unexpected

Figure 6: Images from Alpine skiing competition video

shots acquired by the same camera. Trajectories are over-

printed on the images. Top row: example of regular class.

Bottom row: example of unexpected event (fall of a skier).

Table 4: Detection threshold is supplied in the second row

for the class of ski trajectories used in the detection task

of unexpected events. The following rows contain the dis-

tances between the unexpected events trajectories and the

regular class. Last column shows the detection status.

Class 1 Status

+ σ

0.1434

Accident 0.4412 detected

Skier veering off the pist 0.3307 detected

events. We have introduced appropriate local tra-

jectory features invariant to translation, rotation and

scale transformations, and we can reliably compute

them in presence of noise. We have conducted an im-

portant set of comparative experiments both on syn-

thetic examples and real videos (sports TV programs)

with classiﬁcation ground truth. We have shown that

our method supplies accurate results and offers bet-

ter performance and usability than other approaches

such as SVM classiﬁcation, histogram comparison or

LCSS distance. Extensions of this work will investi-

gate the hierarchical modeling of space-time groups

of trajectories in association or in interaction to repre-

sent activities in videos.

REFERENCES

F. I. Bashir, A. A. Khokhar, and D. Schonfeld. Real-

time motion trajectory-based indexing and retrieval

of video sequences. IEEE Trans. on Multimedia,

9(1):58–65, 2007.

O. Boiman and M. Irani. Detecting irregularities in images

and in video. IEEE Int. Conf. on Computer Vision,

ICCV’05, Beijing, Vol.1, pages 462–469, Oct. 2005.

C. Burges. A tutorial on support vector machines for pattern

recognition. Data Mining and Knowledge Discovery,

2:121–167, 1998.

VIDEO EVENT CLASSIFICATION AND DETECTION USING 2D TRAJECTORIES

165

Table 5: Detection thresholds are supplied in the second row for the (learnt) classes C

used in the detection task of unexpected

events. The following rows contain the distances between the unexpected events trajectories and the regular classes. The

events ’Accident 1’, ’Veering off the track’ and ’Safety car’ were shot by the camera corresponding to class 1, whereas

’Accident 2’ corresponds to class 2. Last column shows the detection status.

Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 status

+ σ

0.0753 0.1310 0.1180 0.0632 0.0225 0.0330

Accident 1 0.2084 0.1427 0.1713 0.1306 0.0296 0.1992 detected

Veering off the track 0.1603 0.1991 0.3017 0.2068 0.0484 0.1556 detected

Safety car 0.2958 0.5200 0.6425 0.3088 0.2595 0.2788 detected

Accident 2 0.2474 0.3978 0.5141 0.2068 0.0716 0.2127 detected

D. Buzan, S. Sclaroff, and G. Kollios. Extraction and clus-

tering of motion trajectories in video. In Proc. of the

17th Int. Conf. Pattern Recognition, ICPR’04, pages

521–524, Cambridge, UK, Aug. 2004.

M. T. Chan, A. Hoogs, J. Schmiederer, and M. Peterson.

Detecting rare events in video using semantic primi-

tives with HMM. In Proc. of the 17th Int. Conf. on

Pattern Recognition, ICPR’04,pages 150– 154, Cam-

bridge, UK, Aug. 2004.

M. T. Chan, A. Hoogs, R. Bhotika, and A. Perera. Joint

recognition of complex events and track matching. In

Proc of the IEEE Conf. on Comp. Vis. and Patt. Rec.,

CVPR’06, pages 694–699, New York, June 2006.

J. Ford and J. Moore. Adaptive estimation of HMM transi-

tion probabilities. IEEE Trans. on Signal Processing,

46(5):1374–1385, 1998.

P. Perez, C. Hue, J. Vermaak, and M. Gangnet. Color-based

probabilistic tracking. Proc. Europ. Conf. Computer

Vision, ECCV’02, Copenhaguen, June 2002.

S. Gong and T. Xiang. Recognition of group activities using

dynamic probabilistic networks. In Proc. of the IEEE

Int. Conf. on Computer Vision, ICCV’03, pages 742–

749, Nice, Oct. 2003.

S. Hongeng, R. Nevatia, and F. Bremond. Video-based

event recognition: Activity representation and prob-

abilistic recognition methods. Computer Vision and

Image Understanding, 96(2):129–162, 2003.

W. Hu, D. Xie, Z. Fu, W. Zheng, and S. Maybank.

Semantic-based surveillance video retrieval. IEEE

Trans. on Image Proc., 16(4):1168-1181, April 2007.

A. Kokaram, N. Rea, R. Dahyot, M. Tekalp, P. Bouthemy,

P. Gros, and I. Sezan. Browsing sports video (Trends

in sports-related indexing and retrieval work). IEEE

Signal Processing Magazine, 23(2):47–58, March

2006.

C. Li and G. Biswas. Temporal pattern generation using

hidden Markov model based unsupervised classiﬁca-

tion. Advances in Intelligent Data Analysis, Lecture

Notes in Computer Science, 1642:245–256, 1999.

X. Li, W. Hu, and W. Hu. A coarse-to-ﬁne strategy for ve-

hicle motion trajectory clustering. In Proceedings of

the 17th International Conference on Pattern Recog-

nition, pages 591–594, Hong Kong, Aug. 2006.

F. Porikli. Trajectory distance metric using hidden Markov

model based representation. In PETS Workshop,

Prague, May 2004.

L. Rabiner. A tutorial on hidden Markov models and se-

lected applications in speech recognition. Proc. IEEE,

77(2):257–285, 1989.

C. Stauffer and W. E. L. Grimson. Learning patterns of

activity using real-time tracking. IEEE Trans. Pat-

tern Analysis and Machine Intelligence, 22(8):747–

757, Aug. 2000.

M. Vlachos, G. Kollios, and D. Gunopulos. Discovering

similar multidimensional trajectories. In Proc. of the

18th International Conference on Data Engineering,

San Jose, Feb 2002.

X. Wang, K. Tieu, and E. Grimson. Learning semantic

scene models by trajectory analysis. In Proc. Europ.

Conf. Computer Vision, ECCV’06, Graz, May 2006.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

166