MOVING OBJECT ANALYSIS IN VIDEO SEQUENCES USING
SPACE-TIME INTEREST POINTS
Alain Simac-Lejeune
Litii, Alpespace, 15 rue St Exupery, 73800 Francin, France
Keywords:
Video Signal Processing, Image Object Detection, Event Detection, Motion Analysis, Interest Points.
Abstract:
Among all the features which can be extracted from videos, we propose to use Space-Time Interest Points
(STIPs). STIPs are particularly interesting because they are simple and robust low-level features providing an
efficient characterization of moving objects within videos. In this paper, after defining STIPs and after giving
some of their properties, we will use STIPs to detect moving objects and to characterize specific changes in
the movements of these objects. Proposed results are obtained from two very different types of videos, namely
athletic videos and animation movies.
1 INTRODUCTION
The human perception system is naturally attracted by
differences between parts of images and by motion
or moving objects. Therefore, in the video indexing
framework, interest points provide useful information
which may be related to a semantic content. Differ-
ent methods have been suggested to extract spatial in-
terest points. An evaluation of these approaches is
proposed in (Schmid et al., 2000). In (Laptev and
Lindeberg, 2003), Laptev and Lindeberg propose a
spatio-temporal extension of the interest point detec-
tion, denoted Space-Time Interrest points (STIPs) in
the following. STIPs are interest points which are in-
teresting both in the spatial and temporal domains.
STIPs have been used for action recognition (Ke et
al., 2005), automatic summarization (Laganiere et al.,
2008) or, more generally, spatio temporal event de-
tection (Laptev, 2005). In this paper, we propose to
use STIPs to detect moving objects in videos and to
characterize some specific changes in the movement
of these objects. To illustrate the robustness of this
approach, two very different types of videos are used
: athletic videos and animation movies. The paper
is organized as follows : Section 2 briefly describes
the videos which are used in this study. Section 3 in-
troduces STIPs and gives an overview of some STIPs
specific properties. Section 4 and 5 show some out-
comes obtained on moving objects dtection and on the
localization of specific movement changes, respec-
tively. Finally, Section 6 explores the limitations of
the proposed method.
2 DATABASE
In order to characterize our work and test our assump-
tions, we have used three different types of data:
synthesis videos : 60 sequences composed of syn-
thetic images with a uniform background, one or
more objects (round, square, triangle, polylines)
in uniform motion or not, straight or not with a
288x288 image size;
sport videos : 40 sequences of athletic jumps hav-
ing 100 to 160 frames (about 5 seconds) with a
300x300 image size (Ramasso, 2007);
an animation movie from the International Festi-
val of Animated Movies of Annecy. The movie,
entitled ”Le Moine et le Poisson”, lasts 6 minutes
and 23 seconds (5745 frames) with a 320x240 im-
age size.
It can also be noted that in all the following tests, per-
formances has been evaluated on separated shots and
without taking into account STIPs generated by shot
transitions. Indeed, in the tested videos or movies,
transitions can easily be detected.
3 SPACE-TIME INTEREST
POINTS (STIPS)
3.1 Detection
On an image, spatial interest points (SIPs) can be
201
Simac-Lejeune A..
MOVING OBJECT ANALYSIS IN VIDEO SEQUENCES USING SPACE-TIME INTEREST POINTS.
DOI: 10.5220/0003866402010204
In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 201-204
ISBN: 978-989-8565-03-7
Copyright
c
2012 SCITEPRESS (Science and Technology Publications, Lda.)
defined as pixels with a significant intensity variation.
Examples of interest points are corners, junctions,
isolated points or specific texture points. In (Harris
and Stephens, 1988), Harris proposes to find such
points using a second moment matrix.
In (Laptev and Lindeberg, 2003) Laptev and Linde-
berg proposed a spatio-temporal extension to detect
what they call ”Space-Time Interest Points” (STIPs).
STIPs are points which are relevant both in space
and time. Theses points are especially interesting
because they focus information initially contained
in thousands of pixels on a few specific points
which can be related to spatio-temporal events in
the sequence. Typically, STIPs appear in articulated
motions (walking, running or jumping person).
However, it can be noted that constant motion of a
corner does not produce any STIPs.
STIPs detection is performed by using the
Hessian-Laplace matrix (Laptev, 2005) defined, for a
pixel (x, y) at time t having intensity I(x, y, t), by :
H(x, y, t) =
2
I
x
2
2
I
xy
2
I
xt
2
I
xy
2
I
y
2
2
I
yt
2
I
xt
2
I
yt
2
I
t
2
(1)
In order to highlight STIPs, different criteria have
been proposed. As in (Laptev, 2005), we have cho-
sen the extension of the Harris corner function, called
salience function”, defined by:
R(x, y, t) = det(H(x, y, t)) k trace(H(x, y, t))
3
(2)
where k is a parameter empirically adjusted. STIP
correspond to high values of the salience function.
We are make tests for different values of the stan-
dard deviations σ
s
and σ
t
. These tests highlight the
impact of Gaussian filters: when the values of σ
s
and σ
t
are low, the number of STIPs increases, but
the good detection rate decreases. On the contrary,
when the values ofσ
s
and σ
t
are high, the number of
STIPs decreases and good detection rate increases up
the 100%. However, the settings corresponding to a
100% rate provide a too small number of STIPs. Fi-
nally, a good compromise is σ
s
= 1.5 and σ
t
= 1.5.
Although there are methods to make an automatic ad-
justment, we preferred to define them manually in or-
der to optimize computation time.
3.2 Properties
STIP properties are well known particularly the rel-
ative stability with respect to geometric transforma-
tions. In our application, we lay interest in some
specific properties, such as the robustness of STIPs
against impulsive noise and contrast modification.
3.2.1 Low/High Contrast and Noise
An analysis of the effects of image quality on the
STIP detection has also been done. Two situations
were examined: contrast modifications and noise ad-
dition. The noise that were used is an impulsive noise
because it is the most difficult type of noise relative
to interest point detection. Table 1 shows the num-
ber of STIPs obtained for different contrast and noise
conditions.
Table 1: Influence of contrast and impulse noise.
Contrast 50 75 100 125 150 175
STIP 1 2 29 64 68 127
a) Contrast influence
Pow 0 20 20 50 50 50
Intensity 0 20 50+ 20 50 70+
STIP 29 29 33 49 78 126
b) Noise influence
80 sequences of video synthesis and athletics jump
k = 0, 04, σ
s
= σ
t
= 1.5,salience threshold = 150
The evaluation is performed by observing the vari-
ations of the number of STIPs compared with the ini-
tial situation (no contrast modification and no noise :
29 STIPs by frame). It can be noticed that the STIP
detection is very sensitive to contrast modification.
On the contrary, the number of STIPs is relatively sta-
ble with respect to impulse noise.
3.2.2 Video Compression
The last criterion that influences STIPs generation is
the compression factor of the video. Indeed, as a re-
sult of compression, straight lines show an aliasing
which may, under certain circumstances, be perceived
as angles (Clarke, 1995). This change causes the gen-
eration of STIPs.
Table 2: Influence of MPEG2 factor compression : average
number of STIP by frame.
Compression factor (%) 10 20 30 40 50
STIP by frame (nb) 29 29 30 38 44
Compression factor (%) 60 70 80 90 100
STIP by frame (nb) 51 62 77 90 118
80 sequences of synthesis videos
and athletics jump
k = 0, 04, σ
s
= σ
t
= 1.5
and salience threshold = 120
Table 2 shows the influence of MPEG2 compres-
sion factor on the number of generated STIPs. It is
important to note that the sequence with square has
not generated false positives. Indeed, no aliasing has
occurred. These results show that the compression
factor has an important influence past the threshold of
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
202
30% compression. In order not to disturb the results,
it is necessary to ensure that the sequences used are
not compressed beyond this threshold.
4 DETECTION OF MOVING
OBJECTS
4.1 Principle
There are many methods for the detection of moving
objects based on motion detection (Giai-Checa
et al., 1993), the segmentation (Bugeau, 2007), the
difference between successive images, etc. STIPs
can be used for moving object detection. However, it
only works if the object has a non regular motion, as
STIPs correspond to second order variation both in
space and time.
4.2 Experimental Evaluation
In athletic jumps or in animation movies, such type of
motions occurs frequently, and generally corresponds
to objects or persons which have an important role
in the scene. Tests are performed according to the
classical Precision/Recall criteria. The validation has
been manually obtained in the following way:
true positive: at least one STIP within an interest-
ing moving object;
false positive: at least one STIP within a non-
interesting moving object;
false negative: no STIP within an interesting mov-
ing object.
Table 3 shows that we obtained very good results us-
ing STIPs as interesting object detectors, even if there
are several moving objects within the same frame.
Table 3: Object detection performances.
Precision Recall
animation movie 0.99 0.91
athletic movie 0.99 0.95
20 sequences of long jump (duration: 2120 frames)
and 500 frames from the animated
movie ”Le Moine et le Poisson”
k = 0, 04, σ
s
, σ
t
= 1.5,salience threshold = 120
To conclude, we can stress that the STIPs have a
large enough performance to locate moving objects.
Plus they will present ”corner” if the number of points
is important. A function determining the focus of
these points can then define the approximate position
of moving objects and make tracking.
5 DETECTION OF MOVEMENT
CHANGES
5.1 Principle
In (Lagani`ere et al., 2008) the activity level is de-
fined within a video as the number of pixels altering
their characteristics between two images. As a con-
sequence, he proposes to define an activity function
by the number of detected STIPs within each frame.
A high (respectively low) value reflects a strong (re-
spectively weak) activity. Moreover, the time evolu-
tion of this activity may contain some interesting in-
formation from a semantic point of view. Particularly,
local maxima of this activityfunction are generally re-
lated to important events in the sequence. This is why
we used this strategy to detect the different phases in
movement. The hypothesis is that a local maxima of
the activity functionis related to a significative change
in the non constant motion, and must correspond to a
transition between two phases of a movement (for ex-
ample, in a jump : running phase, ascending flight
phase, descending flight phase, etc.).
5.2 Realization
Given that the activity function is generally noisy, it
is first smoothed through the use of a mean filter with
a filter size of 11. Let’s denote a
filt
(t) the filtered
activity function. Then the we look for local maxima
of a
filt
(t) satisfying the following condition:
0.8× a
filt
(t α) a
filt
(t) 0.8 × a
filt
(t + α) (3)
with α accounting for the temporal extent determined
by σ
t
.
5.3 Experimental Evaluation
We used twenty sequences of different types of jumps
(high jump, pole vault, long jump and triple jump)
for test. In athletic jumps, such sequences generally
contain a single dominant time event. The evaluation
is a comparison between ground truth and detected
transitions. As the transition location is not always
accurate, we accepted a tolerance on the transition lo-
cation. This tolerance depends on the kind of jump.
Let’s note that we used the same parameter set for all
the sequences.
Table 4 shows the obtained results. Globally, the
transitions are correctly detected with an accuracy be-
tween 3 and 10 images.
Precision and recall are relatively high. The least
satisfying performances are obtained with the triple
jump. This is probably due to the camera motion
which is more complex for this type of jump.
MOVING OBJECT ANALYSIS IN VIDEO SEQUENCES USING SPACE-TIME INTEREST POINTS
203
Table 4: Detection of significant changes in movement.
Precision Recall Tolerance
long jump 0.93 0.92 ±3frames
high jump 0.92 0.88 ±3frames
triple jump 0.81 0.71 ±5frames
pole vault 0.84 0.85 ±10frames
20 sequences of long jump (2120 frames)
k = 0, 04, σ
s
= σ
t
= 1.5, salience threshold = 120
6 DISCUSSION
The proposed tool, that is STIPs, shows convincing
results for the detection of moving objects and for the
detection of significant changes in videos. However,it
has some limitations. The first limitation comes from
the setting. Indeed, the σ
s
and σ
t
parameters are diffi-
cult to adjust and the settings suggested in this analy-
sis may be less effective in videos with very different
characteristics. The second limitation relies on the
conditions necessary shooting and the video quality
(noise, contrast, compression), especially in the case
of captured video in real time. These constraints can
be problematic if one wishes to use this tool on videos
from Web or stream videos real time. In this case, it
will probably be necessary to make a pre-processing
of contrast adjustment and / or noise filtering. The
last limitation deals with reliability. The proposed as-
sessments were performed on data which the events
and movements were actually visible for. In the case
of movement of which speeds are low or constant (for
object detection) or in the case of movement which
changes are not large enough (to detect change), there
is no doubt that performance will be lower than pro-
posed. Despite these limitations, the tool can be im-
proved in many ways, this time to load very low.
7 CONCLUSIONS
In this paper, we proposed to use STIPs for video
analysis. First, we examined some STIPs spe-
cific properties related to our applications. Thus,
we showed that STIPs detection is sensitive to fac-
tor compression, parameter settings, specifically the
variances of the gaussian filters, and intensity con-
trast. Conversely, STIPs detection is relatively ro-
bust against shooting condition variations and impul-
sive noise. Second, we used STIPs to detect moving
objects in three different types of videos : synthesis
videos for qualification, athletic jumps and animated
movie for evaluation. The results we got were satis-
fying. In the specific case of athletic videos, we also
resorted to STIPs to detect the transitions between the
different phases of a jump, which provided good re-
sults too. The next step of this work will be to find out
an adaptive setting of the most sensitive parameters.
REFERENCES
Bugeau, A. (2007). Dtection et suivi d’objets en mouvement
dans des scnes complexes, application la surveillance
des conducteurs. PhD thesis, IRISA.
Clarke, R. (1995). Digital compression of still images and
video. London : Academic press, pages 285–299.
Giai-Checa, B., Bouthemy, P., and Vieville, T. (1993). De-
tection d’objets en mouvement. Technical Report
INRIA-RR - 1906, INRIA.
Harris, C. and Stephens, M. (1988). A combined corner and
edge detector. In Alvey Vision Conference.
Lagani`ere, R., Bacco, R., Hocevar, A., Lambert, P., Pa¨ıs, G.,
and Ionescu, B. (2008). Video summarization from
spatio-temporal features. ACM.
Laptev, I. (2005). On space-time interest points. Interna-
tional Journal of Computer Vision, 64(2/3):107–123.
Laptev, I. and Lindeberg, T. (2003). Space-time interest
points. ICCV’03, pages 432–439.
Ramasso, E. (2007). Reconnaissance de squences d’tats par
le Modle des Croyances Transfrables et application
l’analyse de vidos d’athltisme. PhD thesis, University
Joseph Fourier of Grenoble.
Schmid, C., Mohr, R., and Bauckhage, C. (2000). Evalua-
tion of interest point detectors. International Journal
of Computer Vision, 37(2):151–172.
VISAPP 2012 - International Conference on Computer Vision Theory and Applications
204