REFA3D: ROBUST SPATIO-TEMPORAL ANALYSIS

OF VIDEO SEQUENCES

Manuel Grand-Brochier, Christophe Tilmant and Michel Dhome

Laboratoire des Sciences et Matriaux pour l’Electronique et d’Automatique (LASMEA), UMR 6602 UBP-CNRS,

24 Avenue des Landais, 63177 Aubi

ere, France

Keywords:

Ellipsoid-HOG, Local Descriptor, Space-time Robustness.

Abstract:

This article proposes a generalization of our approach REFA (Grand-brochier et al., 2011) to spatio-temporal

domain. Our new method REFA3D, is based mainly on hes-STIP detector and E-HOG3D. SIFT3D and

HOG/HOF are the two must used methods for space-time analysis and give good results. So their studies allow

us to understand their construction and to extract some components to improve our approach. The mask of

analysis used by REFA is modiﬁed and therefore relies on the use of ellipsoids. The validation tests are based

on video clips from synthetic transformations as well as real sequences from a simulator or an onboard camera.

Our system (detection, description and matching) must be as invariant as possible for the image transformation

(rotations, scales, time-scaling). We also study the performance obtained for registration of subsequence, a

process often used for the location, for example. All the parameters (analysis shape, thresholds) and changes

to the space-time generalization will be detailed in this article.

1 INTRODUCTION

Today, digital imaging is becoming more prevalent in

current applications of life. It is used for example to

track, to localize, or to recognize. Scientists search

and propose methods to acquire or create images, to

edit content, or to extract all the information neces-

sary for various applications. To give some exam-

ples, we can cite the 3D reconstruction, object track-

ing and the face recognition. These applications need

data usually extracted with two tools: the detections

of interest points and the local description. For 2D ap-

plications, we can cite methods such as SIFT (Scale

Invariant Feature Transform) (Lowe, 1999; Lowe,

2004) and SURF (Speed Up Robust Features) (Bay

et al., 2006), offering a complete system for the detec-

tion and local description of points. We proposed in

2011 the method REFA (Grand-brochier et al., 2011),

to extract and characterize interest points with greater

precision and a higher matching rate. The addition of

temporal information is used to complete the analy-

sis to study the mouvement of points in a video se-

quence. Processes such as localization or tracking re-

quired to use this type of data. Several methods offers

this type of study, we can cite SIFT3D (Scovanner

et al., 2007; Klaser et al., 2008) which is the general-

ization of SIFT, SURF generalized (Willems et al.,

2008) or the coupling HOG/HOF (Laptev and Lin-

deberg, 2006; Laptev et al., 2007). To provide the

best possible characteristic points of video for differ-

ent space-time applications, we propose to generalize

our approach REFA, making sure to remain as robust

as possible against the various transformations exist-

ing between two video sequences (translations, rota-

tions, scale changes, timescaling changes). We must

also retain the various constraints that we set for our

spatial method (robustness, matching rate and preci-

sion). All parameters of our new method REFA3D

will be detailed in this article.

Section 2 presents brieﬂy two space-time de-

tectors and three characterizations of points, the

method SIFT3D, SURF generalized and the couplig

HOG/HOF. Additions, changes and parameters used

for the construction of our new approach REFA3D

are detailed in Section 3. To validate our method, we

compare it with the SIFT3D and HOG/HOF, through

various tests by implementing a number of transfor-

mations of data in section 4. We also propose results

for the registration of subsequence.

2 RELATED WORK

Many approaches provide tools to extract and charac-

352

Grand-Brochier M., Tilmant C. and Dhome M..

REFA3D: ROBUST SPATIO-TEMPORAL ANALYSIS OF VIDEO SEQUENCES.

DOI: 10.5220/0003857203520357

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 352-357

ISBN: 978-989-8565-03-7

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

terize the interest points moving in time. For detec-

tion, we can cite Laptev and Lindberg (Laptev and

Lindeberg, 2003), Dollar and al (Dollar et al., 2005)

and Willems and al (Willems et al., 2008). The spatio-

temporal description is generally based on the cou-

pling, or the extension 2D+t, of existing methods

such as SIFT (Lowe, 1999; Lowe, 2004) or SURF

(Bay et al., 2006). A listing of these generaliza-

tion was published by Wang and al (Wang et al.,

2009). We limit our analysis to the SIFT3D (Sco-

vanner et al., 2007; Klaser et al., 2008) and to the

coupling HOG/HOF (Laptev and Lindeberg, 2006;

Laptev et al., 2007)

Introduced by Laptev and Lindberg (Laptev and

Lindeberg, 2003), Harris3D proposes a temporal gen-

eralization of the matrix of Harris, to obtain the tensor

of structure:

M = g

σ,τ

∗









. (1)

where g

σ,τ

is the spacetime Gaussian function, de-

ﬁned by a spatial scale σ and by a temporal scale τ.

Dollar and al. coupling in 2005 this approach with the

impulsives responses of the temporal ﬁlters deﬁne by:

(t;τ) = −cos(8πt)e

−t

/τ

and h

(t;τ) = −sin(8πt)e

−t

/τ

(2)

Willems and al. resume in 2008 the general idea of

Laptev and Lindeberg to apply it to the hessian matrix

and to create the hes-STIP (hessian spatio-temporal

interest point) detector. Their goal is to propose a gen-

eralization of the SURF method, usually used for the

images analysis.

To generalize the SIFT descriptor, Scovanner and

al then Klaser and al, add it a 3D analysis model. Fig-

ure 1 illustrates their histograms HOG3D (Klaser and

al.), by detailing the steps of construction.

Figure 1: Various steps of the HOG3D construction: sam-

pling of the mask of analysis (a et b), determination of the

gradient orientation (d) in every sub-block with an icosahe-

dron (c). (Klaser et al., 2008).

This approach consists to determine a region of 3D

analysis, centred on the interest point. The mask is

divided into M × M × N blocks which is divided in

turn into S

sub-blocks (Figure 1.a and 1.b). The ori-

entation is determined with a regular polyhedron (Fig-

ure 1.c). Finally an histogram of oriented gradients is

built on each b

A spatio-temporal extension of the SURF is pro-

posed by Willems and al. The principle is to extend

the Haar warvelet to a cuboid of size sσ × sσ × sτ,

where σ and τ are respectively the spatial scale and

the temporal scale and s is a factor deﬁned by the

user. The descriptor is made up of the Harr wavelets

responses x, y and t.

Laptev et al. (Laptev and Lindeberg, 2006; Laptev

et al., 2007) combine different histograms to deﬁne

the spatial and the temporal aspects. Their idea is to

build a HOG with a spatial analysis ’classic’ and pair

it with a histogram of oriented optical ﬂow (HOF) in

order to have a temporal concept.

We presented various approaches of detection and

local description, integrating a temporal analysis. The

study of these methods allows us to extract the main

advantages from it (stability, performances and invari-

ances). We propose a generalization of our approach

REFA (Grand-brochier et al., 2011) based on these

diffrents tools and based on an ellipsoidal local ex-

ploration. So we detail in the next section, the mod-

iﬁcations, the new parameters and the optimizations

used.

3 METHOD

We propose a generalization of our method, to include

space-time data to process video. To remain as invari-

ant as possible to the various image transformations,

our approach is divided into three parts: a hes-STIP

detector (hessian spatio-temporal interest point), a lo-

cal E-HOG3D (ellipsoid histogram of oriented gra-

dients 3D) and an optimized matching. This section

describes the different steps of our method and pa-

rameters used.

3.1 Detection

Proposed by Willems and al. (Willems et al., 2008),

the hes-STIP is a generalization of the fast-hessian

method (Bay et al., 2006), to include temporal data.

This addition provides the following equation:

H(x;σ, τ) =





(x;σ, τ) L

(x;σ, τ)

(x;σ, τ) L

(x;σ, τ)

(x;σ, τ) L

(x;σ, τ)





(3)

Its construction is based on the interpretation of the

hessienne matrix (equation 3) and particularly on two

local scales: σ and τ. The ﬁrst corresponds to the

space exploration deﬁned by the fast-hessien and the

REFA3D: ROBUST SPATIO-TEMPORAL ANALYSIS OF VIDEO SEQUENCES

353

second allows us to add a temporal analysis of the lo-

cal information. To optimize this detector, we observe

the inﬂuence of these two scales on the repeatability

rate of our method. The results show that this rate is

optimal for a spatial analysis following two octaves

and a temporal exploration following four scales. The

number of points is not the most signiﬁcant for ap-

plications such as the homography estimation or ob-

jects recognition for exemple. On the contrary, good

matchings precision increase strongly the quality and

the performances, due to a lower number of outliers

(false matchings). So we choose these criteria in spite

of 7% loss of matched points.

3.2 Description

The local description of the method REFA is based

on the use of histograms of oriented gradients fol-

lowing an elliptical mask. The addition of temporal

data forces us to change our mask, transforming the

ellipses in ellipsoids. In order to analyze the entire

spatio-temporal information, we propose the mask

shown in Figure 2, based on a sampling of the ellip-

soidal neighborhood of the interest point. The latter

is determined according to ﬁve levels of description

(level -2 to level 2) combining 37 ellipsoids. For bet-

ter visibility of the spatio-temporal aspect of our de-

scriptor, we only display the centers of the ellipsoids

in the illustration.

Figure 2: Representation of our analysis ellipsoidal mask,

according to ﬁve levels of description.

The parameters of the ellipsoids are based on the

scales (spatial and temporal) of local interest points.

To increase the invariance to rotation, we adjust the

mask analysis in two angles. The analysis of the ma-

trix Harris3D (equation 1) introduced by Laptev and

Lindeberg (Laptev and Lindeberg, 2003) to retrieve

two angles θ and ϕ, shown in Figure 3.

The description of the method REFA is essen-

tially based on the use of histograms of oriented gra-

Figure 3: Illustration of spatial adjustement (left) and tem-

poral (right) of an ellipsoid.

dients (eight classes). So the addition of temporal data

forces us to change these histograms. Building on the

work of Klaser and al. (Klaser et al., 2008), providing

a generalization of HOG in space-time domain, we

construct the following twenty classes. To do this, our

histograms is based on an icosahedron (regular poly-

hedron) to optimize the distribution of such data. The

choice of the class of the histogram based on the de-

termination of the intersection of the gradient vector

with one of the twenty faces of the icosahedron. To

order our descriptor optimally, the face corresponding

the ﬁrst class of our histograms are readjusted accord-

ing to the vector v. The latter corresponds to the com-

bination readjustments shown in Figure 3.

A ﬁnal step is to saturate the values of the gradi-

ents, allowing us to increase the robustness to illumi-

nation changes. This process limits the inﬂuence of

outliers characterized by high gradient values.

3.3 Matching

The goal is to ﬁnd the best similarity (corresponding

to the minimum distance) between descriptors des

and des

of two video sequences. Euclidean distance,

denoted d

, between two descriptors is deﬁned by:

(des

),des

)) =

[des

)]

· des

(4)

where (x

) = x

and (x

) = x

represent the

interest points respectively in the ﬁrst and in the sec-

ond sequence. The minimization of d

, denoted d

min

provides a pair of points {(x

);(x

)}:

l = argmin

l∈[[0;L−1]]

(des

),des

))) (5)

and so

min

= d

(des

),des

)). (6)

To reduce the computation time, we generalize the de-

cision tree used by the method REFA. The latter de-

pends on the size of the data provided, its size is there-

fore R

340

(seventeen histograms with twenty classes

each). Regarding the selection threshold and method

of removing duplicates, processes and parameters re-

main unchanged.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

354

4 RESULTS

We are going to compare our method REFA3D

with SIFT3D (Klaser et al., 2008) and the coupling

HOG/HOF (Laptev and Lindeberg, 2006; Laptev

et al., 2007). These two methods give good results

for video analysis. We propose to study the matching

rate and the precision of each of them. We will also

study the subsequences registration.

4.1 Databases

The ﬁrst database, noted BSS, is based on video ex-

tracted from an onboard camera. We then apply syn-

thetic transformations such as translations (BSS

), ro-

tations (BSS

), scale changes (BSS

) or timescaling

changes (BSS

). Figure 4 illustrates these transfor-

mations.

Figure 4: Examples of transformations (translations, rota-

tionsby an angle θ and scale changes σ).

The second database, noted BSR, comes from

the simulator ASROCAM (Malartre, 2011; Delmas,

2011), to create trajectories (BRSs, BSRaq, BSRat) in

a virtual environment. Figure 5 shows an example of

this database.

Figure 5: Example of an image sequence created by simu-

lator ASROCAM.

4.2 Evaluation Tests and Results

4.2.1 Matching Rate and Precision

We propose to compare the matching rate as well

as the precision of method REFA3D (blue), SIFT3D

(yellow) and HOG/HOF (red). The matching rate

is deﬁned by the number of matches divided by the

number of possible matches. The precision is deﬁned

by the number of correct matches divided by the num-

ber of matches performed. Figure 6 shows a synthesis

of the results obtained.

Figure 6: Summary of results for the spatio-temporal preci-

sion (left) and matching rate (right).

Given the different results, it appears that our ap-

proach has the best results in most cases. Its precision

decreases for real changes, but remains higher than

the HOG/HOF and SIFT3D. Our approach also pro-

vides a better overall matching rate, characterizing a

description more relevant in the neighborhood. Fi-

nally our method REFA3D is more robust and stable

for the various transformations considered. To detail

the precision curves of different methods, we propose

Figures 7 and 8.

REFA3D: ROBUST SPATIO-TEMPORAL ANALYSIS OF VIDEO SEQUENCES

355

(a) (b)

Figure 7: (a) Precision rate for translation (in pixels) and (b)

precision rate for rotations (in degrees).

(a) (b)

Figure 8: (a) Precision rate for scale changes and (b) preci-

sion rate for timescaling changes.

Concerning transformations studied, our method

has generally a higher precision than other methods

or similar to SIFT3D in the case of rotations. The

stability also enables us to conclude that a better ro-

bustness of our approach. Nevertheless, these results

are based on various tools (optimization, threshold)

involving a slight decrease in the number of matched

points.

4.2.2 Subsequences Registration

We propose a study of the subsequence registration.

First we analyze three trajectories: a straight line, a

curve and a subsequence simulation. Table 1 show

the precision “P”, the number of matches “Nm” and

the frame rates are registrated “Fr”, for three methods

compared. It appears that our approach has a registra-

tion generally with a better precision of matches and

the rate of registered images is greater. Our approach

therefore presents a more relevant description of the

scene. The only disadvantage is the decrease in the

number of matching.

We propose a ﬁnal test by implementing readjust-

ments of ﬁve subsequences in an obstacle avoidance.

Figure 9 illustrates the ﬁve stages of obstacle avoid-

ance, and subsequences associated. Table 2 shows the

results (“P” for the precision in percent and “Fr” for

the frame rates are registrated in percent) of REFA3D

methods, SIFT3D and HOG/HOF. The analysis of

these results shows that our approach gives a precision

rate and registered images generally higher than those

of the methods compared. Only SIFT3D presents, for

the subsequence ss5, a higher precision. Our method

Table 1: Results for the registration of subsequences for our

method REFA3D, the coupling HOG/HOF and the method

SIFT3D.

P Nm Fr

REFA3D

Straight line 99.8% 204 100%

Curve 97.6% 155 97.6%

Simulator 97.4% 237 98.3%

HOG/HOF

Straight line 99.2% 212 99.6%

Curve 96.9% 178 95.3%

Simulator 94.8% 256 92.5%

SIFT3D

Straight line 98.7% 284 98.1%

Curve 97.2% 247 97.8%

Simulator 95.4% 294 93.2%

Figure 9: Samples of the initial sequence and those with an

obstacle avoidance (split into ﬁve subsequence).

also has better stability, represented by decreases low-

est observation criteria. With the performances ob-

tained by our method, it would be interesting to con-

sider the use of these data in a different process of

realignment of the vehicle on its nominal trajectory.

The matches extracted by our approach would esti-

mate, frame by frame, the homography and thus to

calculate the various parameters of registration to pro-

vide the localization system.

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

356

Table 2: Results for the registration of subsequences in an

obstacle avoidance.

REFA3D HOG/HOF SIFT3D

P Fr P Fr P Fr

ss1 99.4 100 98.7 99.5 99.2 100

ss2 91.3 95.6 89.2 90.3 86.7 93.3

ss3 79.1 87.6 67.3 72.3 71.2 81.3

ss4 85.4 92.2 79.6 84.4 81.3 88.4

ss5 94.7 97.3 91.3 91.5 95.1 95.6

5 CONCLUSIONS

We propose in this article a space-time generalization

of our method REFA. To do this, we use the detec-

tor hes-STIP, which has the highest repeatability rate

for this type of analysis. The optimization that we

bring on the limitation of exploration scales (spatial

and temporal). The mask of analysis is also modiﬁed

to add the time component in the histograms. The

ellipses are converted into ellipsoids and we use ﬁve

levels of description (Figure 2). Adding a temporal

adjustment results a stable three-dimensional explo-

ration of the sequence. To validate this space-time

generalization, we ﬁrst propose several tests based on

sequences from a real camera and a simulator. The

results show that our approach generally gets the best

precision. We also observe a better stability and a

higher matching rate. In a second step, we study the

registration of subsequences. This type of process is

used to provide space-time informations of the object

(localization, trajectory for example). Our method

performs best for the precision and the rate of reg-

istered images.

Our future prospects is the integration of our ap-

proach REFA3D in intelligent vehicles. Our goal is to

improve again and again the precision of our method,

for the vehicules to be more reliable and secure. An

other prospect is to export our descriptor to three-

dimensional ﬁeld to use it in medical imaging.

REFERENCES

Bay, H., Tuylelaars, T., and Gool, L. V. (2006). Surf :

Speeded up robust features. European Conference on

Computer Vision, pages 404–417.

Delmas, P. (2011). Gnration active des dplacements d’un

vhicule agricole dans son environnement. PhD thesis,

University Blaise Pascal - Clermont II.

Dollar, P., Rabaud, V., Cottrell, G., and Belongie, S. (2005).

Behavior recognition via sparse spatio-temporal fea-

tures. IEEE International Conference on Computer

Vision.

Grand-brochier, M., Tilmant, C., and Dhome, M. (2011).

Method of extracting interest points based on multi-

scale detector and local e-hog descriptor. Interna-

tional Conference on Computer Vision Theory and

Applications.

Klaser, A., Marszalek, M., and Schmid, C. (2008). A spatio-

temporal descriptor based on 3d-gradients. British

Machine Vision Conference, pages 995–1004.

Laptev, I., Caputo, B., Schuldt, C., and Lindeberg, T.

(2007). Local velocity-adapted motion events for

spatio-temporal recognition. Computer Vision and Im-

age Understanding, 108(3):207–229.

Laptev, I. and Lindeberg, T. (2003). Space-time interest

points. IEEE International Conference on Computer

Vision, 1:432–439.

Laptev, I. and Lindeberg, T. (2006). Local descriptors for

spatio-temporal recognition. Computer and Informa-

tion Science, 3667:91–103.

Lowe, D. (1999). Object recognition from local scale-

invariant features. IEEE International Conference on

Computer Vision, pages 1150–1157.

Lowe, D. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Malartre, F. (2011). Perception intelligente pour la navi-

gation rapide de robots mobiles en environnement na-

turel. PhD thesis, University Blaise Pascal - Clermont

II.

Scovanner, P., Ali, S., and Shah, M. (2007). A 3-

dimensional sift descriptor and its application to ac-

tion recognition. ACM Multimedia.

Wang, H., Ullah, M., Klaser, A., Laptev, I., and Schmid, C.

(2009). Evaluation of local spatio-temporal features

for action recognition. British Machine Vision Con-

ference.

Willems, G., Tuytelaars, T., and Gool, L. V. (2008). An efﬁ-

cient dense and scale-invariant spatio-temporal inter-

est point detector. European Conference on Computer

Vision, 5303(2):650–663.

REFA3D: ROBUST SPATIO-TEMPORAL ANALYSIS OF VIDEO SEQUENCES

357