camera
1
is connected to a computer in which a de-
voted software organises the acquisitions and runs the
monitoring application. The analysis in performed
continuously over time.
When a new dynamic event is observed and clas-
sified, an event is generated and described with a set
of tags – including type of event (normal event with
information on the assigned class, or anomaly), start-
ing and ending time, description, ... – which are col-
lected on a data base. The final user can constantly
monitor the system status via a web machine-user in-
terface, where the intermediate output of each stage of
the pipeline can be visualised. Moreover, the user is
allowed to formulate queries to the data base to anal-
yse the results and obtain statistics.
Previous Works. Profiling behaviours from tempo-
ral data is a research domain with a long history,
where the benefit of adopting machine learning tech-
niques has soon been observed. The problem has
been tackled using supervised methods, as Support
Vector Machines (Chen and Aggarwal, 2011) or Hid-
den Markov Models (F. Bashir and Schonfeld., 2007).
However many applications– in particular within the
video analysis domain– are characterised by the avail-
ability of huge sets of data but relatively few labeled
ones; therefore an increasing interest has been posed
to the unsupervised perspective. A rather complete
account of event classification in an unsupervised set-
ting is reported in (Morris and Trivedi, 2008); see also
(Stauffer and Grimson, 2000; Hu et al., 2006).
More recently, there has been a renewed attention
on the problem of detecting abnormal events. We cite
for instance the work in (Cheng et al., 2015), where
a local and global anomalies are detected using a re-
gression model on Spatio-Temporal Interesting Points
(STIPs), (Ren et al., 2015) that employs Dictionary
Learning to build models of common activities, or
again (Xu et al., 2015), which is built on top of the
by now popular Deep Neural Networks.
2 MONITORING APPLICATION
We address the problem of modelling behaviours in a
setting where people are observed as a whole and their
dynamic can be described by a trajectory of temporal
observations. A behaviour can then be formalised as a
set of dynamic events coherent with respect to a cer-
tain metric (e.g. going from a point A to point B or
1
The video surveillance setup has been obtained
within a technology transfer program with Imavis S.r.l.,
http://www.imavis.com.
enter the region C). Depending on the available infor-
mation on the trajectories, more subtle properties can
be enhanced, e.g. if velocity is considered, then the
behaviour going from A to B can be further divided
into going from A to B by walking or by running.
In the following we describe our monitoring applica-
tion. On a first period, we simply collect dynamic
events representations, and then run the training stage
in batch to obtain an initial guess on the patterns of
common activity in the scene. The core of this step is
the method we proposed in (Noceti and Odone, 2012),
summarised in Sec. 2.1 and 2.2 (we refer to the orig-
inal paper for further details). Then, the online test
analysis starts. If necessary, the behaviours models
are updated to address the problem of adapting to the
scene changes. We discuss this point if Sec. 2.3.
2.1 Modelling Common Activities in a
Scene
The method we adopt consists of different steps. A
low-level processing collects trajectories over time,
which in a second step are mapped into strings rep-
resentations. Finally, groups of coherent strings are
detected with clustering.
2.1.1 Low-level Video Processing
Our low-level processing (Noceti et al., 2009) aims
first at performing a motion-based image segmenta-
tion to detect moving regions (see Fig. 2(b)), that
in our setting can be associated with a single per-
son or a small group of people (see Fig. 2(c)). At
a given time instant t, each moving object i in the
scene is described according to a vector of feature
x
t
i
= [P
t
i
,S
t
i
,M
t
i
,D
t
i
] ∈ R
5
, where P denotes the object
position in the image plane, S its size, M and D the
magnitude and orientation of the object motion.
The tracking procedure correlates such vectors
over time (Fig. 2(c)), obtaining N spatio-temporal tra-
jectories that are collected on a training set of tempo-
ral series X = {x
i
}
N
i=1
.
2.1.2 String-based Trajectory Representation
We map each temporal trajectory into a new repre-
sentation based on strings – formally, a concatena-
tion of symbols from a finite alphabet A . The al-
phabet can be obtained by partitioning the space of
all vectors x
t
i
disregarding the temporal correlation:
˜
X = {{x
t
i
}
k
i
t=1
}
N
i=1
, where k
i
refers to the length of the
i-th trajectory. We adopted the method proposed in
(Noceti et al., 2011), where the benefit of adopting a
strategy guided by the available data, as opposed to
VISAPP 2016 - International Conference on Computer Vision Theory and Applications
598