tures for crowd event recognition. Our method has
been validated by using a LOOCV cross-validation
on a novel dataset.
From the results, some conclusions can be drawn.
First, it can be seen that results are generally better
for lower values of K. This seems to be logical, since
there should be, at most, three or four different situa-
tions at a given moment in time; that is: normal, devi-
ations from normal and chaotic. Also, it can be seen
that results with histograms that do not take inten-
sity into account are better than their intensity-aware
counterparts. Finally, polar histogram seems to per-
form better than circular; which again, seems logical
due to the fact that the circular histogram does not
take orderliness into account.
It has to be noted that the dataset in use in this case
is a very challenging one, due to the presence of heavy
clutter due to objects in the scene (trees, benches...)
which complicates the tracking, which is the bottle-
neck of the process. A poor tracking result will al-
ways yield to worse results in general. Further work
needs to be carried out in this regard.
4.2 Future Work
In this section a series of immediate and future im-
provements are shown, which are considered to ame-
liorate the results.
As just said, the bottleneck of the whole process is
in the tracking. If the tracking fails, the tracklet plots
will not be representative of the situation in the scene.
For this reason, a good tracker is essential. Track-
ing perfectly and flawlessly is still an open challenge
in the computer vision research community. Thus,
and since the aim of this work is not achieving bet-
ter tracking, ground truth data of the people could be
used to evaluate the tracklet plot histograms and the
bag-of-words modelling being applied. Another op-
tion would be using promising trackers such as the
recent work by Kwon et al. (Kwon and Lee, 2013).
Furthermore, testing our method on other datasets
is a pending task. Nevertheless, most existing datasets
are near-field, and thus, ours seems more appropriate
for group and crowd analysis. Also, most of them pro-
vide video footage from a single view. Furthermore,
the types of situations present in our dataset, are not
always present in other publicly available datasets.
For instance, the UMN dataset
1
, would be the best
candidate to try our algorithm next. However, it has
two main drawbacks: first, it includes scenes were
people wander, which is not considered ‘normal’ be-
haviour as defined and used in this paper; second, it
1
http://mha.cs.umn.edu/proj events.shtml (Accessed:
Nov. 2013)
is single view, so planned extensions of our work for
multiple views could not be tested on it.
Finally, as just mentioned, future work also in-
cludes the use of video footage from multiple views.
To this end, information from all the available cam-
eras (four in our dataset) is to be combined and tested
either by a fusion method at the “feature level”, or
by merging by means of a “model-level” algorithm.
For the former, synchronised video is to be used, so
obtaining the features from the various video sources
simultaneously. The features are then fused by merg-
ing them into a longer feature. Dimensionality reduc-
tion techniques might be needed, since the features
are now much longer (i.e. four-fold) than the origi-
nal. In the latter, on the other hand, different models
for the different cameras are learnt, and then fusion
is performed afterwards, using a voting mechanism,
that in turn could assign weights to the different views
(e.g. by using an additional neural network layer).
ACKNOWLEDGEMENTS
This work has been supported by the European
Commission’s Seventh Framework Programme (FP7-
SEC-2011-1) under grant agreement N.
o
285320
(PROACTIVE project).
REFERENCES
Ballan, L., Bertini, M., Del Bimbo, A., Seidenari, L., and
Serra, G. (2011). Event detection and recognition for
semantic annotation of video. Multimedia Tools and
Applications, 51(1):279–302.
Bobick, A. and Davis, J. (2001). The recognition of human
movement using temporal templates. Pattern Analy-
sis and Machine Intelligence, IEEE Transactions on,
23(3):257–267.
Candamo, J., Shreve, M., Goldgof, D. B., Sapper, D. B.,
and Kasturi, R. (2010). Understanding Transit Scenes:
A Survey on Human Behavior-Recognition Algo-
rithms. IEEE Transactions on Transportation Sys-
tems, 11(1):206–224.
Davies, A., Yin, J., and Velastin, S. (1995). Crowd monitor-
ing using image processing. Electronics & Communi-
cation Engineering Journal, 7(1):34–47.
Dee, H. M. and Caplier, A. (2010). Crowd behaviour anal-
ysis using histograms of motion direction. In Im-
age Processing (ICIP), 2010 17th IEEE International
Conference on, pages 1545–1548. IEEE.
Garate, C., Bilinsky, P., and Bremond, F. (2009). Crowd
event recognition using hog tracker. In Performance
Evaluation of Tracking and Surveillance (PETS-
Winter), 2009 Twelfth IEEE International Workshop
on, pages 1–6.
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
180