BUILDING A NORMALITY SPACE OF EVENTS
A PCA Approach to Event Detection
Angelo Cenedese
Department of Engineering and Management, University of Padova, Stradella S.Nicola 3, 36100 Vicenza, Italy
Ruggero Frezza
Department of Information Engineering, University of Padova, Via Gradenigo 6/B, 35131 Padova, Italy
Enrico Campana, Giambattista Gennari, Giorgio Raccanelli
Videotec S.p.a., Via Friuli 6, 36015 Schio, Italy
Keywords:
Event detection, Principal Component Analysis (PCA), Video Analysis.
Abstract:
The detection of events in video streams is a central task in the automatic vision paradigm, and spans het-
erogeneous elds of application from the surveillance of the environment, to the analysis of scientific data.
Actually, although well captured by intuition, the definition itself of event is somewhat hazy and depending
on the specific application of interest. In this work, the approach to the problem of event detection is different
in nature. Instead of defining the event and searching for it within the data, a normality space of the scene is
built from a chosen learning sequence The event detection algorithm works by projecting any newly acquired
image onto the normality space so as to calculate a distance from it that represents the innovation of the new
frame, and defines the metric for triggering an event alert.
1 INTRODUCTION
Within the paradigm of automatic computer vision, it
is of paramount importance for many application that
the system is able to automatically discern objects,
features, events in a frame sequence. This task, which
may be considered trivial when performed by human
subjects, is in reality of great complexity when under-
taken by a synthetic intelligence, and it represents a
canonical problem in cognitive sciences. Moreover,
although of general interest in the analysis of images
and video streams, the detection of events, the focus-
ing on their evolution in time, and the understand-
ing of the scene, are key issues in the solution to the
surveillance problem, which involves all these steps.
In this context, research on themes regarding
surveillance and monitoring systems, has become
more and more active during the last few years, sup-
ported by technologies that are becoming increasingly
pervasive in our everyday life so that huge amount of
data are made available (cctv systems, wireless net-
works, distributed sensor architectures)(CogViSys,
2007)(VSAM, 2007).
As might be expected, also scientific literature and
conferences have been quite a fertile land for papers
and presentationson the subject. We recall here a cou-
ple of special issues published by international jour-
nals (Collins et al., 2000) (Regazzoni et al., 2001),
workshops held in conjunction with international con-
ferences (ICML06, 2006) (ECCV06, 2006), and a re-
cent survey paper (Hu et al., 2004), which are a good
indication of what the current state of the art is.
In this work, we focus our attention on the
anomaly and event detection phase, which aims at
providing the Operator with metadata information of
low semantic level derived from the raw data acquired
from the scene. This information is the base to pro-
duce a high-level semantic description of the scene.
The first challenge in approaching the problem
of event detection regards its definition and formal-
ization. In actual fact, it is a very common task in
many domains to monitor and analyze routinely col-
lected data in search for what differs from and stands
out of the normality, which constitutes the event or
the anomaly. When considering event detection in a
video stream, the event E is any image portion that
differs from the rest of the image stream in some
sense. Since mathematically the difference in some
551
Cenedese A., Frezza R., Campana E., Gennari G. and Raccanelli G. (2008).
BUILDING A NORMALITY SPACE OF EVENTS - A PCA Approach to Event Detection.
In Proceedings of the Third International Conference on Computer Vision Theory and Applications , pages 551-554
DOI: 10.5220/0001085905510554
Copyright
c
SciTePress
sense is linked to some kind of norm, it is one main
concern to try and find a procedure that highlights
events as the result of norm measure. Then, the space
of events is partitioned into three classes: Normal
events, which trigger no alarm in the detection, neg-
ligible anomalies (e.g. natural movements, shadows)
whose alarm trigger should be filtered, and detectable
anomalies that are the true events.
2 STATE OF THE ART
In literature, several algorithms are reported that per-
form event detection and have been applied in differ-
ent contexts.
Some of these are based on the so-called back-
ground subtraction techniques (Jain et al., 1977),
which range from simple frame differencing, to more
complex probabilistic modeling schemes. The main
drawbacks of background subtraction are that the
methodology is extremely sensitive to change in il-
lumination and variation of weather conditions, hence
the use in outdoor environment may be critical; also, it
integrates no spatial correlation, therefore it is scantly
robust with respect to spurious signals or artifacts
present in the frame, which can generate false alarms.
Moreover, these techniques cannot cope with natu-
ral movements of the background, changing in the
background geometry, or camera oscillations, result-
ing again in the generation of a great amount of false
alarms. A partial solution to the issues raised by the
previous discussion is given by the Radial Reach Fil-
ter algorithm (Satoh et al., 2002), where the thresh-
olding operations to discern between background and
foreground in the image are performed at pixel level,
but taking into account a sort of local texture. One
main concern remains: The fact that the threshold
values are commonly chosen by the operator, implies
that the performance of the algorithm is depending on
the operator experience and skills.
Starting from these basic techniques, other solu-
tions have been developed, some employing a mix-
ture of Gaussians to build a multimodal background
model (Friedman and Russell, 1997), some resorting
to the identification of shapes in the image with a dic-
tionary of objects, which results in algorithms of non
general application. Many approaches to paramet-
ric modeling have been proposed, among which hid-
den Markov models have become more popular in the
field because they naturally enclose spatial and tem-
poral information (Clarkson and Pentland, 2000). In
these models, despite the high-quality detection pro-
vided, the main problem is the need of training the
model over a really extensive sample set and the fact
that often in video sequences there is no prior knowl-
edge available to support the definition of a paramet-
ric model. A non parametric approach is consid-
ered in (Zelnik-Manor and Irani, 2001) that regards
the event as a stochastic process sampled in time and
space to build an empirical distribution associated to
the event, in this way allowing to handle a wide range
of events. A combined detection and tracking method
for complex events is proposed in (Medioni et al.,
2001). This work addresses the problem also in the
case of moving cameras, resorting to a preprocessing
procedure (Image Stabilization) to filter out the cam-
era movement, and relying to the normal component
of the optical flow field for the computation of the
residual motion and the detection task.
A PCA approach has been explored by Monnet
and colleagues, to model background in dynamic
scenes (Monnet et al., 2003).
For a more exhaustive overview we refer the
reader to the survey paper (Radke et al., 2005).
3 RATIONALE & ALGORITHM
The basic idea is to steer away from a pixel-based ap-
proach, in which the frame image is a board where to
perform calculation on pixels, and move towards an
information content approach, where the frame image
is a container of information, regardless to the repre-
sentation of the image as a matrix of pixels. As a
first step in this direction we introduce the concept
of region frame, which is a section of the image plane
frame whose dimension is defined according to the in-
formation content of the image sequence in the sense
that it is related directly to the size of the desired ob-
ject or action to be detected. In this way the spatial
correlation within each frame is enhanced. As a con-
sequence, the size of the region assumes the role of
characteristic dimension: The image is no longer an-
alyzed pixel by pixel, but region by region, thus cir-
cumventing issues related to electronic (pixel) noise
or spurious pixel signals.
The algorithm is organized through a two-step
process, namely the learning phase and the (proper)
detection phase. Starting from a sequence of frames
(and the regions within), a normality space is built
during the learning phase, which is supervised by an
operator. Then, during the detection phase, any newly
acquired frame is projected onto the normality space
that has been built during the learning phase of the
algorithm. The projection on the normality model
yields the innovation brought in by the new frame se-
quence, that is the detection of unexpected events.
In the remainder of the paper, we will make use
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
552
of notions of SVD decomposition, projection over a
subspace(Golub and VanLoan, 1996).
Be the image I
t
at time t of (n
y
× n
x
) pixels,
the image plane of I
t
is partitioned into N regions
{R
i
, i = 1, . . . , N}, which can be imagined without
loss of generality as regions of (n × m) pixels. We
denote with I
t,i
:= I
t
|R
i
the restriction of I
t
to the re-
gion R
i
. I
t,i
is then transformed into the image vector
y
t,i
of size nm, by piling the columns of I
t,i
(the pro-
cedure can also be performed by rows).
Starting from a learning set of T frames
I
L
t
,t = 1, . . . , T
, and a partition of the image
frame into N regions {R
i
} we produce the se-
quence
n
y
L
t,i
,t = 1, . . . , T
o
for each region of the im-
age plane to compose the normality matrix Y
i
=
y
L
1,i
| y
L
2,i
| . . . | y
L
T,i
, whose size is (nm × T),
where typically nm > T. The columns of Y
i
report
the information on region R
i
in time, while the eigen-
vectors of Y
i
Y
i
are the principal components.
The SVD decomposition applied to Y
i
yields Y
i
=
U
i
· S
i
· V
i
, with U
i
column-orthonormal of dimen-
sion (nm × T), and V
i
unitary square matrix of size
(T ×T), S
i
diagonal matrix of the singular values σ
j,i
.
We remind that for the properties of the SVD algo-
rithm the first columns of U
i
corresponding to non-
null singular values form a base for Y
i
.The higher the
value of the σ
j,i
, the more relevant the principal com-
ponent u
j,i
is to describe the scene in R
i
. Therefore,
in order to retain only the main information from the
learning set, a truncation at singular value σ
r,i
> 0 is
operated. It follows:
Y
r
i
= U
r
i
· S
r
i
· (V
r
i
)
, (1)
where U
r
i
, S
r
i
, and V
r
i
, are the restrictions of U
i
,
S
i
, and V
i
, and the vector space U
i
:= range{Y
r
i
} =
span{u
1,i
, u
2,i
, . . . , u
r,i
}.
It is now to understand how to choose the trun-
cation index r. Firstly, we observe that the princi-
pal components corresponding to null singular values
are negligible, which gives a first value for r, be it
r = r
0
1 (corresponding to σ
r
0
= 0). Then, the se-
lection of the optimal value is done following an in-
tensity principle, that is by considering the first r com-
ponents of Y
i
retaining up to the I
%
i
(r) of image inten-
sity, I
%
i
(r) =
r
j=1
σ
j,i
r
0
1
j=1
σ
j,i
. This value follows automat-
ically from the SVD procedure, and depends on the
information content of the region of interest. There-
fore, we introduce the concept of normality through
the following
Definition 1 (Normality Model.) Given a learning
set of images {I
t
} and a set of image plane regions
{R
i
}, interpreted as a sequence of matrices {Y
r
i
}, the
Normality Model is formed by:
the base B
i
= {u
1,i
, u
2,i
, . . . , u
r,i
} of subspace U
i
;
the threshold T
σ
i
= σ
r+1
, stating an upper bound
to the norm of the projection error of a vector in
the space generated by the columns of Y
i
on U
i
.
In reality, the normality model is build from the
learning sequence in a slightly more complex way,
in order to ensure validation and consistency of the
model. The learning sequence is subdivided by sam-
pling into a number of subsequences: The first subse-
quence Y
r
i
= Y
r(1)
i
is then used to build a first instance
of the normality space U
i
= U
(1)
i
and a related thresh-
old value T
σ
i
= T
σ(1)
i
. The remaining subsequences
Y
r()
i
are used to validate it: Each image (column)
vector y
L
t,i
of Y
r()
i
is projected onto the candidate U
i
P
U
i
(y
L
t,i
) = U
r
i
· (U
r
i
)
· y
L
t,i
,
and the projection error e
U
i
(y
L
t,i
) is computed
e
U
i
(y
L
t,i
) = y
L
t,i
P
U
i
(y
L
t,i
).
A comparison among the norm of all the e
U
i
(y
L
t,i
)
from the validation set and the threshold value T
σ
i
,
allows to understand whether the subsequence used
for learning is a good learning set, or, conversely, if
there is need to extend it so as to include the image
vectors that are not well included in U
i
. The proce-
dure derives a new
ˆ
Y
r
i
from Y
r(1)
i
to build a further in-
stance of the normality space U
i
. If the whole learn-
ing sequence is chosen appropriately, the procedure
finally converges to a validated space
ˆ
U
i
, with base
ˆ
B
i
, neatly representing the essence of what happened
during the learning period, and threshold value
ˆ
T
σ
i
.
Since
ˆ
T
σ
i
is the threshold value discriminating the
maximum allowed projection error during the learn-
ing phase, it is reasonable to use during the detec-
tion phase a less strict value, in order to accept in
the normality domain also events that are similar to
those of the learning set but have never been observed
before. To this aim, the threshold value is altered
by resorting to an additive or a multiplicative factor,
according to the characteristics of the region of in-
terest. In particular, if the scene is characterized by
intense dynamics, the threshold value
ˆ
T
σ
i
is already
high, so it is advisable to use only an additive term α
i
,
to modify slightly
ˆ
T
σ
i
in order to reduce the number of
false positives without introducing further false neg-
atives (
ˇ
T
σ
i
:=
ˆ
T
σ
i
+ α
i
). Differently, when the scene
is static,
ˆ
T
σ
i
is low. The use of an additive correction
law would result in an excessive generation of false
negatives; conversely, the multiplicative term intro-
duces a scaling factor in the learning threshold value
(
ˇ
T
σ
i
:= α
i
ˆ
T
σ
i
). In both cases, the α
i
value is tuned
according to experimental observations.
BUILDING A NORMALITY SPACE OF EVENTS - A PCA Approach to Event Detection
553
The detection phase of the algorithm follows basi-
cally the procedure illustrated for the validation of the
normality space: Each newly acquired image frame
I
t
is pre-processed according to the same region de-
composition of the learning phase, and reshaped into
the image vectors {y
t,i
}, where the i index spans the
region set. Then, each vector y
t,i
is projected onto the
correspondent
ˆ
U
i
to determine whether and at which
degree it is included in span{B
i
}: That is, the norm
of the projection error is compared with the modified
threshold value
ˆ
T
σ
i
.
An exhaustivecampaign of simulations and exper-
iments have been performed, regarding the detection
of anomalous events in heterogeneous environment:
In particular, we focused on the analysis of traffic flow
video sequences, and the detection of people in out-
door environments in presence of wind acting on nat-
ural objects such as trees and bushes, and change of
light. The algorithm has also been implemented and
tested in real life situation, such as the task of moni-
toring the behavior of people at a fair, with consistent
results.
In addition, it is remarkable how the Event Detec-
tor has proven robust to use in common situations as
the detection of forgotten objects, and prohibited di-
rections in crowd flow, because by studying the nor-
mality of the scene, the algorithm implicitly includes
these events. Nonetheless, there is currently under de-
velopment a dedicated tool that implements a set of
rules to specifically manage these situations.
4 CONCLUSIONS
In this paper we present a novel approach to event
detection by resorting to the SVD technique. The
core contribution is the capability of building a vec-
tor space summarizing what is normal in the scene
with only little supervision by the operator, who has
just to choose an appropriate learning sequence. The
algorithm works by projecting newly acquired images
onto the so-constructed normality space, in search for
innovation that is, in the case of a surveillance sys-
tems, the presence of events of some kind. Moreover,
we employ an object-oriented approach by analysing
regions of the image related to the characteristic size
of the event of interest instead of single pixels or local
textures as done in previous works.
From the preliminary results obtained so far using
indoor and outdoorsequences in operating conditions,
the results are quite promising, showing good robust-
ness and high performance.
REFERENCES
Clarkson, B. and Pentland, A. (2000). Framing through pe-
ripheral perception. In Proceedings of the IEEE inter-
national conference on image processing (ICIP 2000),
Vancouver, Canada, pages 38–41.
CogViSys (2007). http://cogvisys.iaks.uni-karlsruhe.de/.
[online].
Collins, R. T., Lipton, A. J., and Kanade, T. (2000). Special
issue on video surveillance. IEEE Trans. On Pattern
Analysis and Machine Intelligence, 22(8).
ECCV06 (2006). 6th ieee international workshop on visual
surveillance. In conjunction with the 9th European
Conference on Computer Vision 2006, Graz, Austria.
Friedman, N. and Russell, S. (1997). Image segmentation in
video sequences: A probabilistic approach. In Annual
Conference on Uncertainty in Artificial Intelligence,
volume 2, pages 175–181.
Golub, G. and VanLoan, C. (1996). Matrix Computations.
Johns Hopkins University Press, Baltimore.
Hu, W., Tan., T., Wang, L., and Maybank, S. (2004). A
survey on visual surveillance of object motion and be-
haviours. IEEE Trans. On Systems, Man and Cyber-
netics Part C, Applications and Reviews, 34(3):334–
352.
ICML06 (2006). Machine learning algorithms for surveil-
lance and event detection. In conjunction with the
International Conference on Machine Learning 2006,
Carnegie Mellon University, Pittsburgh, PA.
Jain, R., Militzer, D., and Nagel, H. (1977). Separating non-
stationary from stationary scene components in a se-
quence of real world tv-images. In International Joint
Conference on Artificial Intelligence, pages 612–618.
Medioni, G. G., Cohen, I., Bremond, F., Hongeng, S., and
Nevatia, R. (2001). Event detection and analysis from
video streams. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 23(8):873–889.
Monnet, A., Mittal, A., Paragios, N., and Ramesh, V.
(2003). Background modeling and subtraction of dy-
namic scenes.
Radke, R. J., Andra, S., Al-Kofahi, O., and Roysam, B.
(2005). Image change detection algorithms: A sys-
tematic survey. IEEE Transactions on Image Process-
ing, 14(3):294–307.
Regazzoni, C., Ramesh, V., and Foresti, G. L. (2001). Spe-
cial issue on video communication, processing and
understanding for third generation surveillance sys-
tems. Proceedings of the IEEE, 89(10).
Satoh, Y., Tanahashi, H., Wang, C., Kaneko, S., Niwa, Y.,
and Yamamoto, K. (2002). Robust event detection by
radial reach filter. In 16th ICPR, International Con-
ference on Pattern Recognition, volume 2, pages 623–
626.
VSAM (2007). http://www.cs.cmu.edu/ vsam/. [online].
Zelnik-Manor, L. and Irani, M. (2001). Event-based anal-
ysis of video. In Proceedings of the IEEE conference
on computer vision and pattern recognition (CVPR
2001), Kauai, Hawaii, December 2001, pages 123–
130.
VISAPP 2008 - International Conference on Computer Vision Theory and Applications
554