tion 3.3, this ground truth is actually noisy as the hu-
man subject do not agree exactly on the interesting
streams. We evaluate the different algorithms together
with the different normalization methods. Our algo-
rithms are compared against a theoretical random al-
gorithm that would rank the views randomly (no nor-
malization is meaningful).
As for example in Figure 4, the result figures show
the rank distributions for various interestingness algo-
rithm, grouped by normalization method. In a given
graph, there is a line per considered method. For a
given method, the colored bars show the proportion of
clicked views ranked 1
st
, 2
nd
, 3
rd
and , 4
th
by the al-
gorithms. The red bar corresponds to p
algo
(1) and so
on. More weight on the first (leftmost) ranks is better.
We see for example that random (which is repeated on
each graph) has a probability of 0.25 for each of the
ranks. Methods that will be above 0.25 for the first
rank, can be considered as better than random.
The normalization based on percentiles below
95% give systematically lower performance than the
one with 95%. To improve readability, we thus omit
these in the shown graphs. The results show that the
consider algorithms, both motion based and PLSM
based, perform better than the random guess. Despite
the fact that the raw annotations have been found rel-
atively noisy, the gain over random algorithms is im-
portant.
The results also show that the simple motion-
based criterion is almost as good as the more elab-
orated PLSM algorithm. The main observation is that
in this metro setting, human social attention, encoded
in the form of our annotations, is very well linked to
the amount of motion present in the video. Both kind
of abnormality measures derived from PLSM seem
also to be following the behavior of “Motion”. How-
ever, in other contexts like traffic videos, where mo-
tion is present in the form or more regular patterns, the
capacity of algorithms such as PLSM at filtering out
normal activities would be beneficial to detect anoma-
lies as compared to relying on motion only.
In this setup, we observe that normalization has a
small negative on the results. The interpretation can
be that human attention is directed toward the abso-
lute amount of motion and not a motion amount rela-
tive to the normal amount. Actually, noise in the an-
notations is also part of the reason as shown in the
following section.
5.2 Results on Filtered Dataset using
the Gaussian Smoother
We also evaluate the performance of the different al-
gorithms on the ground truth after smoothing by a
Gaussian kernel, as explained in Section 3.3. We var-
ied the threshold on the smoothed index of interest so
as to keep around 2000 evaluation points. Figure 5
shows results with the corresponding results.
Globally, the methods exhibit better accuracy on
the cleaned ground truth. We also observe a more
marked effect of method mixing: results are notably
lower than with each individual methods. When mix-
ing methods, normalization also helps. However, as
observed in the raw case, normalization degrades the
results for individual methods. This again means that
the attention is directed to absolute motion.
Overall, the results with the smooth annotations
consolidate and complete what has been observed
with the raw annotations:
• non-normalized version are preferable for both the
motion-based measure and the PLSM-based one,
• when mixing methods of different scales, as ex-
pected, results degrades and the normalization
helps,
• motion-based and PLSM-based selection both
significantly outperform the random guess algo-
rithm especially on cleaned ground truth,
• on the raw ground truth, the motion-based selec-
tion works best,
• motion-based methods and PLSM-based methods
perform comparably,
• we reach good ranking accuracy of up to 0.6
meaning that 60% of the times the stream clicked
by the human subject is ranked first among four
streams; 85% of the times, the clicked stream is
among the two firsts in the rank provided by the
algorithm.
6 CONCLUSIONS AND FUTURE
WORK
We have introduced the stream selection task and pro-
posed an evaluation protocol for it. The proposed
evaluation is based on a social attention experiment
where human subjects were shown four video streams
at once and were asked to mark a video at any time
they spotted something interesting in it. We tested
various stream selection algorithms to rate the inter-
estingness of the streams at any instant. The video
instants marked by the subjects are used to evaluate
how well the marked video has been ranked by a se-
lection algorithm.
Our evaluations have shown that, considering all
the raw annotations produced by the human sub-
jects, all test approaches were better than random but
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
438