BUILDING A NORMALITY SPACE OF EVENTS

A PCA Approach to Event Detection

Angelo Cenedese

Department of Engineering and Management, University of Padova, Stradella S.Nicola 3, 36100 Vicenza, Italy

Ruggero Frezza

Department of Information Engineering, University of Padova, Via Gradenigo 6/B, 35131 Padova, Italy

Enrico Campana, Giambattista Gennari, Giorgio Raccanelli

Videotec S.p.a., Via Friuli 6, 36015 Schio, Italy

Keywords:

Event detection, Principal Component Analysis (PCA), Video Analysis.

Abstract:

The detection of events in video streams is a central task in the automatic vision paradigm, and spans het-

erogeneous ﬁelds of application from the surveillance of the environment, to the analysis of scientiﬁc data.

Actually, although well captured by intuition, the deﬁnition itself of event is somewhat hazy and depending

on the speciﬁc application of interest. In this work, the approach to the problem of event detection is different

in nature. Instead of deﬁning the event and searching for it within the data, a normality space of the scene is

built from a chosen learning sequence The event detection algorithm works by projecting any newly acquired

image onto the normality space so as to calculate a distance from it that represents the innovation of the new

frame, and deﬁnes the metric for triggering an event alert.

1 INTRODUCTION

Within the paradigm of automatic computer vision, it

is of paramount importance for many application that

the system is able to automatically discern objects,

features, events in a frame sequence. This task, which

may be considered trivial when performed by human

subjects, is in reality of great complexity when under-

taken by a synthetic intelligence, and it represents a

canonical problem in cognitive sciences. Moreover,

although of general interest in the analysis of images

and video streams, the detection of events, the focus-

ing on their evolution in time, and the understand-

ing of the scene, are key issues in the solution to the

surveillance problem, which involves all these steps.

In this context, research on themes regarding

surveillance and monitoring systems, has become

more and more active during the last few years, sup-

ported by technologies that are becoming increasingly

pervasive in our everyday life so that huge amount of

data are made available (cctv systems, wireless net-

works, distributed sensor architectures)(CogViSys,

2007)(VSAM, 2007).

As might be expected, also scientiﬁc literature and

conferences have been quite a fertile land for papers

and presentationson the subject. We recall here a cou-

ple of special issues published by international jour-

nals (Collins et al., 2000) (Regazzoni et al., 2001),

workshops held in conjunction with international con-

ferences (ICML06, 2006) (ECCV06, 2006), and a re-

cent survey paper (Hu et al., 2004), which are a good

indication of what the current state of the art is.

In this work, we focus our attention on the

anomaly and event detection phase, which aims at

providing the Operator with metadata information of

low semantic level derived from the raw data acquired

from the scene. This information is the base to pro-

duce a high-level semantic description of the scene.

The ﬁrst challenge in approaching the problem

of event detection regards its deﬁnition and formal-

ization. In actual fact, it is a very common task in

many domains to monitor and analyze routinely col-

lected data in search for what differs from and stands

out of the normality, which constitutes the event or

the anomaly. When considering event detection in a

video stream, the event E is any image portion that

differs from the rest of the image stream in some

sense. Since mathematically the difference in some

551

Cenedese A., Frezza R., Campana E., Gennari G. and Raccanelli G. (2008).

BUILDING A NORMALITY SPACE OF EVENTS - A PCA Approach to Event Detection.

In Proceedings of the Third International Conference on Computer Vision Theory and Applications , pages 551-554

DOI: 10.5220/0001085905510554

 SciTePress

sense is linked to some kind of norm, it is one main

concern to try and ﬁnd a procedure that highlights

events as the result of norm measure. Then, the space

of events is partitioned into three classes: Normal

events, which trigger no alarm in the detection, neg-

ligible anomalies (e.g. natural movements, shadows)

whose alarm trigger should be ﬁltered, and detectable

anomalies that are the true events.

2 STATE OF THE ART

In literature, several algorithms are reported that per-

form event detection and have been applied in differ-

ent contexts.

Some of these are based on the so-called back-

ground subtraction techniques (Jain et al., 1977),

which range from simple frame differencing, to more

complex probabilistic modeling schemes. The main

drawbacks of background subtraction are that the

methodology is extremely sensitive to change in il-

lumination and variation of weather conditions, hence

the use in outdoor environment may be critical; also, it

integrates no spatial correlation, therefore it is scantly

robust with respect to spurious signals or artifacts

present in the frame, which can generate false alarms.

Moreover, these techniques cannot cope with natu-

ral movements of the background, changing in the

background geometry, or camera oscillations, result-

ing again in the generation of a great amount of false

alarms. A partial solution to the issues raised by the

previous discussion is given by the Radial Reach Fil-

ter algorithm (Satoh et al., 2002), where the thresh-

olding operations to discern between background and

foreground in the image are performed at pixel level,

but taking into account a sort of local texture. One

main concern remains: The fact that the threshold

values are commonly chosen by the operator, implies

that the performance of the algorithm is depending on

the operator experience and skills.

Starting from these basic techniques, other solu-

tions have been developed, some employing a mix-

ture of Gaussians to build a multimodal background

model (Friedman and Russell, 1997), some resorting

to the identiﬁcation of shapes in the image with a dic-

tionary of objects, which results in algorithms of non

general application. Many approaches to paramet-

ric modeling have been proposed, among which hid-

den Markov models have become more popular in the

ﬁeld because they naturally enclose spatial and tem-

poral information (Clarkson and Pentland, 2000). In

these models, despite the high-quality detection pro-

vided, the main problem is the need of training the

model over a really extensive sample set and the fact

that often in video sequences there is no prior knowl-

edge available to support the deﬁnition of a paramet-

ric model. A non parametric approach is consid-

ered in (Zelnik-Manor and Irani, 2001) that regards

the event as a stochastic process sampled in time and

space to build an empirical distribution associated to

the event, in this way allowing to handle a wide range

of events. A combined detection and tracking method

for complex events is proposed in (Medioni et al.,

2001). This work addresses the problem also in the

case of moving cameras, resorting to a preprocessing

procedure (Image Stabilization) to ﬁlter out the cam-

era movement, and relying to the normal component

of the optical ﬂow ﬁeld for the computation of the

residual motion and the detection task.

A PCA approach has been explored by Monnet

and colleagues, to model background in dynamic

scenes (Monnet et al., 2003).

For a more exhaustive overview we refer the

reader to the survey paper (Radke et al., 2005).

3 RATIONALE & ALGORITHM

The basic idea is to steer away from a pixel-based ap-

proach, in which the frame image is a board where to

perform calculation on pixels, and move towards an

information content approach, where the frame image

is a container of information, regardless to the repre-

sentation of the image as a matrix of pixels. As a

ﬁrst step in this direction we introduce the concept

of region frame, which is a section of the image plane

frame whose dimension is deﬁned according to the in-

formation content of the image sequence in the sense

that it is related directly to the size of the desired ob-

ject or action to be detected. In this way the spatial

correlation within each frame is enhanced. As a con-

sequence, the size of the region assumes the role of

characteristic dimension: The image is no longer an-

alyzed pixel by pixel, but region by region, thus cir-

cumventing issues related to electronic (pixel) noise

or spurious pixel signals.

The algorithm is organized through a two-step

process, namely the learning phase and the (proper)

detection phase. Starting from a sequence of frames

(and the regions within), a normality space is built

during the learning phase, which is supervised by an

operator. Then, during the detection phase, any newly

acquired frame is projected onto the normality space

that has been built during the learning phase of the

algorithm. The projection on the normality model

yields the innovation brought in by the new frame se-

quence, that is the detection of unexpected events.

In the remainder of the paper, we will make use

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

552

of notions of SVD decomposition, projection over a

subspace(Golub and VanLoan, 1996).

Be the image I

at time t of (n

× n

) pixels,

the image plane of I

is partitioned into N regions

, i = 1, . . . , N}, which can be imagined without

loss of generality as regions of (n × m) pixels. We

denote with I

t,i

:= I

the restriction of I

to the re-

gion R

. I

t,i

is then transformed into the image vector

t,i

of size nm, by piling the columns of I

t,i

(the pro-

cedure can also be performed by rows).

Starting from a learning set of T frames



,t = 1, . . . , T



, and a partition of the image

frame into N regions {R

} we produce the se-

quence

t,i

,t = 1, . . . , T

for each region of the im-

age plane to compose the normality matrix Y



1,i

| y

2,i

| . . . | y

T,i



, whose size is (nm × T),

where typically nm > T. The columns of Y

report

the information on region R

in time, while the eigen-

vectors of Y

⊤

are the principal components.

The SVD decomposition applied to Y

yields Y

· S

· V

⊤

, with U

column-orthonormal of dimen-

sion (nm × T), and V

unitary square matrix of size

(T ×T), S

diagonal matrix of the singular values σ

j,i

We remind that for the properties of the SVD algo-

rithm the ﬁrst columns of U

corresponding to non-

null singular values form a base for Y

.The higher the

value of the σ

j,i

, the more relevant the principal com-

ponent u

j,i

is to describe the scene in R

. Therefore,

in order to retain only the main information from the

learning set, a truncation at singular value σ

r,i

> 0 is

operated. It follows:

= U

· S

· (V

)

⊤

, (1)

where U

, S

, and V

, are the restrictions of U

, and V

, and the vector space U

:= range{Y

} =

span{u

1,i

, u

2,i

, . . . , u

r,i

It is now to understand how to choose the trun-

cation index r. Firstly, we observe that the princi-

pal components corresponding to null singular values

are negligible, which gives a ﬁrst value for r, be it

r = r

− 1 (corresponding to σ

= 0). Then, the se-

lection of the optimal value is done following an in-

tensity principle, that is by considering the ﬁrst r com-

ponents of Y

retaining up to the I

(r) of image inten-

sity, I

(r) =

∑

j=1

j,i

∑

−1

j=1

j,i

. This value follows automat-

ically from the SVD procedure, and depends on the

information content of the region of interest. There-

fore, we introduce the concept of normality through

the following

Deﬁnition 1 (Normality Model.) Given a learning

set of images {I

} and a set of image plane regions

}, interpreted as a sequence of matrices {Y

}, the

Normality Model is formed by:

• the base B

= {u

1,i

, u

2,i

, . . . , u

r,i

} of subspace U

;

• the threshold T

= σ

r+1

, stating an upper bound

to the norm of the projection error of a vector in

the space generated by the columns of Y

on U

In reality, the normality model is build from the

learning sequence in a slightly more complex way,

in order to ensure validation and consistency of the

model. The learning sequence is subdivided by sam-

pling into a number of subsequences: The ﬁrst subse-

quence Y

= Y

r(1)

is then used to build a ﬁrst instance

of the normality space U

= U

(1)

and a related thresh-

old value T

= T

σ(1)

. The remaining subsequences

r(∗)

are used to validate it: Each image (column)

vector y

L∗

t,i

of Y

r(∗)

is projected onto the candidate U

L∗

t,i

) = U

· (U

)

⊤

· y

L∗

t,i

and the projection error e

L∗

t,i

) is computed

L∗

t,i

) = y

L∗

t,i

− P

L∗

t,i

A comparison among the norm of all the e

L∗

t,i

)

from the validation set and the threshold value T

allows to understand whether the subsequence used

for learning is a good learning set, or, conversely, if

there is need to extend it so as to include the image

vectors that are not well included in U

. The proce-

dure derives a new

from Y

r(1)

to build a further in-

stance of the normality space U

. If the whole learn-

ing sequence is chosen appropriately, the procedure

ﬁnally converges to a validated space

, with base

, neatly representing the essence of what happened

during the learning period, and threshold value

Since

is the threshold value discriminating the

maximum allowed projection error during the learn-

ing phase, it is reasonable to use during the detec-

tion phase a less strict value, in order to accept in

the normality domain also events that are similar to

those of the learning set but have never been observed

before. To this aim, the threshold value is altered

by resorting to an additive or a multiplicative factor,

according to the characteristics of the region of in-

terest. In particular, if the scene is characterized by

intense dynamics, the threshold value

is already

high, so it is advisable to use only an additive term α

to modify slightly

in order to reduce the number of

false positives without introducing further false neg-

atives (

+ α

). Differently, when the scene

is static,

is low. The use of an additive correction

law would result in an excessive generation of false

negatives; conversely, the multiplicative term intro-

duces a scaling factor in the learning threshold value

(

:= α

). In both cases, the α

value is tuned

according to experimental observations.

BUILDING A NORMALITY SPACE OF EVENTS - A PCA Approach to Event Detection

553

The detection phase of the algorithm follows basi-

cally the procedure illustrated for the validation of the

normality space: Each newly acquired image frame

is pre-processed according to the same region de-

composition of the learning phase, and reshaped into

the image vectors {y

t,i

}, where the i index spans the

region set. Then, each vector y

t,i

is projected onto the

correspondent

to determine whether and at which

degree it is included in span{B

}: That is, the norm

of the projection error is compared with the modiﬁed

threshold value

An exhaustivecampaign of simulations and exper-

iments have been performed, regarding the detection

of anomalous events in heterogeneous environment:

In particular, we focused on the analysis of trafﬁc ﬂow

video sequences, and the detection of people in out-

door environments in presence of wind acting on nat-

ural objects such as trees and bushes, and change of

light. The algorithm has also been implemented and

tested in real life situation, such as the task of moni-

toring the behavior of people at a fair, with consistent

results.

In addition, it is remarkable how the Event Detec-

tor has proven robust to use in common situations as

the detection of forgotten objects, and prohibited di-

rections in crowd ﬂow, because by studying the nor-

mality of the scene, the algorithm implicitly includes

these events. Nonetheless, there is currently under de-

velopment a dedicated tool that implements a set of

rules to speciﬁcally manage these situations.

4 CONCLUSIONS

In this paper we present a novel approach to event

detection by resorting to the SVD technique. The

core contribution is the capability of building a vec-

tor space summarizing what is normal in the scene

with only little supervision by the operator, who has

just to choose an appropriate learning sequence. The

algorithm works by projecting newly acquired images

onto the so-constructed normality space, in search for

innovation that is, in the case of a surveillance sys-

tems, the presence of events of some kind. Moreover,

we employ an object-oriented approach by analysing

regions of the image related to the characteristic size

of the event of interest instead of single pixels or local

textures as done in previous works.

From the preliminary results obtained so far using

indoor and outdoorsequences in operating conditions,

the results are quite promising, showing good robust-

ness and high performance.

REFERENCES

Clarkson, B. and Pentland, A. (2000). Framing through pe-

ripheral perception. In Proceedings of the IEEE inter-

national conference on image processing (ICIP 2000),

Vancouver, Canada, pages 38–41.

CogViSys (2007). http://cogvisys.iaks.uni-karlsruhe.de/.

[online].

Collins, R. T., Lipton, A. J., and Kanade, T. (2000). Special

issue on video surveillance. IEEE Trans. On Pattern

Analysis and Machine Intelligence, 22(8).

ECCV06 (2006). 6th ieee international workshop on visual

surveillance. In conjunction with the 9th European

Conference on Computer Vision 2006, Graz, Austria.

Friedman, N. and Russell, S. (1997). Image segmentation in

video sequences: A probabilistic approach. In Annual

Conference on Uncertainty in Artiﬁcial Intelligence,

volume 2, pages 175–181.

Golub, G. and VanLoan, C. (1996). Matrix Computations.

Johns Hopkins University Press, Baltimore.

Hu, W., Tan., T., Wang, L., and Maybank, S. (2004). A

survey on visual surveillance of object motion and be-

haviours. IEEE Trans. On Systems, Man and Cyber-

netics Part C, Applications and Reviews, 34(3):334–

352.

ICML06 (2006). Machine learning algorithms for surveil-

lance and event detection. In conjunction with the

International Conference on Machine Learning 2006,

Carnegie Mellon University, Pittsburgh, PA.

Jain, R., Militzer, D., and Nagel, H. (1977). Separating non-

stationary from stationary scene components in a se-

quence of real world tv-images. In International Joint

Conference on Artiﬁcial Intelligence, pages 612–618.

Medioni, G. G., Cohen, I., Bremond, F., Hongeng, S., and

Nevatia, R. (2001). Event detection and analysis from

video streams. IEEE Transactions on Pattern Analysis

and Machine Intelligence, 23(8):873–889.

Monnet, A., Mittal, A., Paragios, N., and Ramesh, V.

(2003). Background modeling and subtraction of dy-

namic scenes.

Radke, R. J., Andra, S., Al-Kofahi, O., and Roysam, B.

(2005). Image change detection algorithms: A sys-

tematic survey. IEEE Transactions on Image Process-

ing, 14(3):294–307.

Regazzoni, C., Ramesh, V., and Foresti, G. L. (2001). Spe-

cial issue on video communication, processing and

understanding for third generation surveillance sys-

tems. Proceedings of the IEEE, 89(10).

Satoh, Y., Tanahashi, H., Wang, C., Kaneko, S., Niwa, Y.,

and Yamamoto, K. (2002). Robust event detection by

radial reach ﬁlter. In 16th ICPR, International Con-

ference on Pattern Recognition, volume 2, pages 623–

626.

VSAM (2007). http://www.cs.cmu.edu/ vsam/. [online].

Zelnik-Manor, L. and Irani, M. (2001). Event-based anal-

ysis of video. In Proceedings of the IEEE conference

on computer vision and pattern recognition (CVPR

2001), Kauai, Hawaii, December 2001, pages 123–

130.

VISAPP 2008 - International Conference on Computer Vision Theory and Applications

554