What to Show?

Automatic Stream Selection among Multiple Sensors

emi Emonet

, E. Oberzaucher

and J.-M. Odobez

Laboratoire Hubert Curien UMR CNRS 5516, Universit

e Jean Monnet, St-Etienne, France

Dpt of Anthropology, University of Vienna, Vienna, Austria

Idiap Research Institute, 1920 Martigny, Switzerland

Keywords:

Stream Selection, Camera Network, Probabilistic Models, Temporal Topic Models.

Abstract:

The installation of surveillance networks has been growing exponentially in the last decade. In practice, videos

from large surveillance networks are almost never watched, and it is frequent to see surveillance video wall

monitors showing empty scenes. There is thus a need to design methods to continuously select streams to be

shown to human operators. This paper addresses this issue and make three main contributions: it introduces

and investigates, for the ﬁrst time in the literature, the live stream selection task; based on the theory of social

attention, it formalizes a way of obtaining some ground truth for the task and hence a way of evaluating stream

selection algorithms; and ﬁnally, it proposes a two-step approach to solve this task and compares different

approaches for interestingness rating using our framework. Experiments conducted on 9 cameras from a metro

station and 5 hours of data randomly selected over one week show that, while complex unsupervised activity

modeling algorithms achieve good performance, simpler approaches based on amount of motion perform

almost as well for this type of indoor setting.

1 INTRODUCTION AND

PREVIOUS WORK

There is an ever increasing number of sensors de-

ployed in the real world, being it in large scale sen-

sor networks or closed-circuit television (CCTV) net-

works. Exploitation of these cameras fulﬁll several

needs: safety, when the physical integrity of people in

a given environment need to be ensured (e.g., when

people go in or out a metro train, or by detecting

events that could lead to accidents in trafﬁc infrastruc-

tures); security and protection of equipment through

the detection of anti-social behaviors and vandalism;

and efﬁciency, e.g., through the identiﬁcation of typi-

cal ﬂux and detection of congestions that can be com-

municated appropriately to users.

However, most of the time, surveillance network

videos are never watched. For instance, in our case,

28 monitors are used in the control room to supervise

more than 800 cameras. The probability to watch the

right streams at the right time is therefore very lim-

ited. Moreover, the operator vigilance also appears to

be a core criterion that strongly inﬂuences CCTV ef-

fectiveness, especially when a constant human moni-

toring is required. As vigilance studies conﬁrm, oper-

ators who spend hours screen gazing at static scenes

tend to become bored and less efﬁcient, and are then

likely to miss low-frequency events, such as a person

falling or jumping over a turnstile.

Thus, along with top-down user-based protocols

for visualizing and browsing the video networks, and

the detection of predeﬁned event, there is a need for

bottom-up automatic and content-based selection sys-

tems that provide surveillance operators with the most

salient and informative data streams within an adap-

tive real-time process. To address this issue, un-

usual event detection (Zhong et al., 2004; Boiman

and Irani, 2007; Hua et al., 2009) and abnormality

rating algorithms (Zelniker et al., 2008; Varadarajan

and Odobez, 2009; Emonet et al., 2011) have been

proposed in the past. Although these algorithms are

obviously related to this paper, none of them has ad-

dressed the stream selection task as we propose in this

paper.

Commercial systems tends to use the amount of

motion and person tracking in order to select cameras

with a high level of activity, in this paper we try to

investigate how relevant this choice is. Recent related

work also includes work that both model the content

of the videos and monitor the operator with a cam-

433

Emonet R., Oberzaucher E. and Odobez J..

What to Show? - Automatic Stream Selection among Multiple Sensors.

DOI: 10.5220/0004688504330440

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 433-440

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

a) b)

Figure 1: a) An operator in front of a video wall. b) Stream selection task: at any time instant, select the M streams, among

N, that are the most meaningful to show to a human operator.

era such as in (Atrey et al., 2011). We consider a

setup where observing the operator is not an option

and where only the content of the videos is used to

select the stream to display.

In this article, we explore the stream selection is-

sue. First, in Section 2, we formalize the task. Then,

in Section 3, we propose an approach for collecting

ground truth for the task, based on human estima-

tion of interestingness done by 100 persons on videos

coming from 9 cameras and spread over one week.

Section 4 describes the approaches used to solve this

task together with the performance measure we pro-

pose to evaluate these approaches. The experimental

results are detailed in Section 5. Finally, our conclu-

sions and future work is presented in Section 6.

2 DEFINING THE STREAM

SELECTION TASK

The ultimate goal of stream selection algorithms is

to allow the automatic pre-selection of some audio

or video streams to present to a human operator for

further screening. More formally, given a set of N

streams, the system must continuously select the M

ones that are the most meaningful to show to an oper-

ator. Deﬁning what relevance criterion has to be used

is an open problem and in the following sections, we

use human attention as a cue. Also, even if not consid-

ered in our evaluations, the selection must be consis-

tent in time, for example it should avoid screen ﬂick-

ering and avoid showing continuously the same views

(unless it is sure these are very relevant).

Figure 1 illustrates the context and the task. In

Figure 1a), an operator looking at a video wall is

depicted. In Figure 1b) the stream selection task is

shown, with the N input streams and the M streams

selected by the system and to be shown to the opera-

tor.

3 SOCIAL ATTENTION AND

GROUND TRUTH

The ground truth dataset eventually used for the eval-

uation of the stream selection task is based on an ex-

periment on social attention, described in Section 3.1.

We follow the idea that human attention can act as a

guide for the selection of interesting sensors.

3.1 Button Clicking Experiments

We carried out an experiment where we simultane-

ously showed multiple movies to our subjects, and

they had to press a button corresponding to the screen

that showed something interesting. For this purpose,

we created a set of 2-minute long movie chunks. We

considered 9 cameras from a camera network de-

ployed in a public transportation system. We ran-

domly picked 150 starting times across a week pe-

riod, and for each starting time, we created 2-minute

chunks for all 9 camera views starting at these times.

These 150×9 movies are the basis for the attention

study.

An environment called the “Virtual Operator”

have been used for data collection. Figure 2 shows

a picture of the virtual operator setup. The virtual op-

erator simultaneously shows four synchronized video

streams on a 2 by 2 grid of screens. Each screen

shows a shot from a different surveillance camera.

Every two minutes, a new set of movies is randomly

brought up to the screens.

Whenever the subject sees something interesting

on one of the screens, he or she has to press a but-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

434

Figure 2: The setup for the Virtual Operator experiment on

social attention. The subject is exposed to 4 synchronized

videos. Whenever something interesting is happening on

one of the screen, the user have to press the button corre-

sponding to the screen. Every two minutes, the videos are

changed; and b) an illustration of the window-based and

Gaussian-smoothing-based ﬁltering for button clicks.

ton indicating the respective monitor. The participants

had to ﬁll out some questionnaires prior to and af-

ter the experiment. In total 52 male and 51 female

subjects (mean age 23.78) participated in the study.

Each 2 minute chunk was watched between 14 and

18 times.

First, we see that attention is clearly not randomly

distributed over the movies. The random probability

of two subjects hitting the same button at the same

time (same frame, at 5 frames per second) is 0.0016.

By analyzing the distribution of the button clicks, we

observe that clicks tend to cluster around some in-

stants and views of interest.

Overall, we can retained the following observa-

tions: a) the attention is not randomly distributed, b)

there are certain scene properties that suggest atten-

tion, and c) there is some degree of random clicks.

This experiment suggests that, once cleaned, the but-

ton clicks can be a good indicator of the video stream

interestingness. We thus use these interestingness

clicks for the creation of a ground truth, making the

assumption that the interestingness clicks are related

to which stream should be selected in the task of

stream selection.

3.2 Derived Ground Truth

From the button clicks, we derive a “annotation

database” that we will use for the evaluation of stream

selection algorithms. Overall, the annotation database

can be seen as a list of annotated instants where peo-

ple marked a single view. Each instant has a times-

tamp, the ordered list (set) of the views shown and

the index of the view that was marked by the subject.

More formally we can deﬁne the annotation database

as:

A = {A

}

i=1..n

= {(t

, V

, u

)}

i=1..n

where n is the number of annotations, A

is the i

an-

notation in the list, t

is its timestamp, V

is the list of

shown views and u

is the index of the view marked

as interesting (V

) thus being the selected view). In

the raw button click dataset presented before, there

are 9 possible views (i.e., each element of V

can take

9 different values) and n is almost 10000 and realized

by around 100 subjects. The timestamp is a pointer

to any instant (a frame) within one of the 150 slots

(each, long of 2 minutes) that were extracted over one

full week. Each annotation provide information about

the selected view versus the 3 others but does not pro-

vide information between these 3 other views.

There is a actually a mismatch between the button

clicks and the goal of the stream selection system: the

complete system should be able to continuously se-

lect which stream(s) to show, while the button clicks

are sparse. For the evaluation measure presented in

Section 4.1, we thus make the assumption that only

the instant that have been clicked by some users are

meaningful for the evaluation. Said the other way

around: on the periods where there are no clicks, the

selection algorithms can produce any results without

being penalized.

3.3 Robust Ground Truth

We observed some user clicks are random. Also, as

some subjects might be quicker to respond than oth-

ers, it is problematic to instantaneously link the button

clicks to the respective scene in the video. Apart from

considering the raw button clicks we also considered

a smoothed version of the “annotations”.

We run a Gaussian smoothing function over the

click results, generating a continuous index of inter-

estingness (IoI). The IoI is taking into account multi-

ple button presses within a small time span account-

ing for the temporal uncertainty of the raw button

clicks. A threshold on the IoI value is used to decide

whether an annotation should be kept. The threshold

can be tuned to control the robustness of the result-

ing dataset. The provided results are obtained with

around 10000 annotation for the raw case and around

2000 for the smoothed one.

4 EXPERIMENTAL SETUP

The goal of a stream selection system is to select the

M among N streams to display to an operator. Almost

WhattoShow?-AutomaticStreamSelectionamongMultipleSensors

435

equivalently, we can consider that the system task is to

rank the streams by interestingness and the ﬁrst ones

are displayed. We thus suppose that the output of an

algorithm algo at any time instant t

eval

and for a set of

views V

eval

is a ranking of the camera views (or their

indices). For instance, algo(t

eval

, V

eval

) = (3, 1, 2, 4)

means that the third streams is the most interesting to

show, then the ﬁrst stream, then second and fourth.

For performance measurement, we make the assump-

tion that the instants where stream selection matters

are only the ones which have been clicked by the hu-

man subjects.

4.1 Performance Measure

At each marked instants, the ideal selection algorithm

would rank ﬁrst the stream that was marked by the

human subject. A less effective algorithm would rank

this stream second, and so on, the worse in our case

being to rank it 4 (as there are 4 views shown at any

time). To analyze the performance of an algorithm,

we can thus look at the distribution of the rank pro-

duced by the algorithm for the clicked instant. More

precisely, we compute the proportion of times the al-

gorithm ranked the mark stream ﬁrst, second, third

and fourth:

∀r ∈ 1..4 (1)

algo

(r) =

×Card{i|algo(t

, V

)(r) = u

} (2)

× num. clicked u

ranked r by algo

(3)

This distribution (p

algo

(r)) in itself is a complete de-

scription of the performance of the algorithm. From

this distribution, we could extract more synthetic nu-

merical performance measures by giving different

weights to the different ranking errors. For example,

if we are interested in the percentage of time the algo-

rithm performs in perfect accordance with the annota-

tions, we can just take the value of p

algo

(1). Average

precision is algo a common measure and corresponds

∑

r=1

algo

(r).

4.2 Modular Architecture for Selection

The stream selection algorithms are responsible for

providing a ranking or a selection of the cameras to

show to the operator. Most algorithms are providing

an interestingness rating at each instant for a camera.

If we want to be able to integrate information across

cameras and to mix different algorithms in a system

(some can be more effective on some cameras), we

have to consider a kind of normalization of the mea-

sures across cameras. The overall stream ranking sys-

tem follows the architecture shown in the Figure 3:

• each video stream is processed independently by

the interestingness/abnormality rating algorithm;

• a possible normalization step is performed in or-

der to provide comparable abnormality measures

across views and algorithms. Normalization pa-

rameters are typically learnt without supervision

using statistics on the abnormality measure on

some (unlabeled) learning videos.

• the normalized abnormality scores are used as the

inputs of a ranking algorithm whose goal is to pro-

duce the ﬁnal ranking of the camera views. The

simplest ranking algorithm (that we have used

in the reported experiments) just ranks the views

according to their normalized abnormality score.

Temporal integration (not done here), which en-

sures stability in the views shown to the operator,

would also take place in the ﬁnal ranking algo-

rithm.

In the experiments, we consider a single stream

ranking algorithm: it simply sorts the streams based

on their normalized interestingness measure. For the

stream interestingness rating algorithms and for the

normalization algorithms, we considered different al-

gorithms detailed in the following subsections.

4.3 Considered Interestingness Rating

Algorithms

We consider multiple algorithms: two simple baseline

algorithms, a method from the state of the art (Emonet

et al., 2011) with variations, and a mixture of meth-

ods.

Random

– we just simulate a random interestingness

score at any instant. This pseudo-algorithm is used

only to illustrate the random factor inherent to the

task. This random baseline can also be seen as a ran-

dom ranking of the 4 views of interest. Overall, it is

equivalent to ranking all the views in a uniform ran-

dom manner and thus having 25% of probability for

any of the 4 ranks.

Motion – we use the amount of motion as an inter-

estingness measure given that studies on human at-

tention have shown correlation between motion and

attention. The abnormality score for each view is sim-

ply the amount of detected motion (proportion of im-

age pixels where a motion could be robustly detected)

using a Lucas Kanade method. With no normaliza-

tion, this score will tend to display the camera with

absolutely more motion. With some per-view nor-

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

436

Figure 3: Architecture for the stream selection. Each stream is processed independently: a ﬁrst algorithm is extracting an

interestingness measure and this measure is then normalized. The normalized measures are used by a simple algorithm that

ranks the streams based on their measures.

malization, the score will favor cameras with more

motion than the view average.

PLSM – we use the Probabilistic Latent Sequential

Motifs (PLSM) that has been used in (Emonet et al.,

2011) to extract an abnormality measure from video.

The principle of PLSM is to learn, in an unsupervised

way, the temporal motifs that are recurrent in a tem-

poral document. Applied to a video with localized

motion features, the algorithm learns the typical ac-

tivities, which are often typical trajectories of objects

within the scene. The authors of (Emonet et al., 2011)

suppose that the recurrent motifs represent the usual,

normal activities and use this supposition to derive an

abnormality score. We use the abnormality score as

an interestingness measure. PLSM has been run, on

each view, on 2 hours of video data with the follow-

ing parameters: 75 topic at low-level, 20 motifs of 15

seconds each.

PLSM-LL – taking the exact same approach as

PLSM, we also experimented with using the negative

log-likelihood of the observations after the ﬁtting of

the PLSM model. The same parameters as for the

PLSM measure have been used.

MIX – we use a combination of methods, where half

of the camera views are rated with “Motion” while the

other half are rated with “PLSM”.

4.4 Considered Normalization Methods

The interestingness rating produced by different algo-

rithms and/or for different camera views might vary

notably in their scale. To be able to merge the views

in a meaningful manner, we need to re-scale them to a

common range of values. Many ways of normalizing

the scores can be imagined and implemented.

In our case we only consider automatic ways of

re-normalizing the abnormality scores. All the nor-

malization parameters are set using a learning datasets

of 2 hours (without any supervision or annotation).

The abnormality score is ﬁrst extracted on this data (at

each time instant) and then the resulting abnormality

time series is analyzed to produce the normalization

parameters. We considered two kind of normalization

methods.

nonorm (∅) – we just simply don’t normalize the ab-

normality measure. Given the discrepancy between

different rating algorithms, this method is limited to

the cases where only a single type of interestingness

algorithm is used. Still, it makes sense to try this

normalization (actually non-normalization). For ex-

ample, the “motion” measure without normalization

will actually favor the selection of the view with the

absolute most motion. From the experiments, it can

also be meaningful to not normalize the interesting-

ness measure obtained from PLSM as it is also related

to motion.

percent99 (p99) – we explored a normalization based

on the 99

percentile. Using the learning set, we

linearly re-normalize the abnormality using a scaling

factor p

such that the observed interestingness score

falls below a ﬁxed threshold (e.g., 1) 99% of the time.

The scaling factor is then used to proportionally re-

scale the abnormality, i.e, norm.score =

score−minscore

−minscore

This normalization can be seen as a robust equiva-

lent of re-normalizing using the maximum observed

value.

percent95 (p95) – as with “percent99”, we exper-

imented with a normalization based on other per-

centiles, only 95% being shown (see results). We also

experimented with other percentiles: for high per-

centiles, results are comparable to the 99

and 95

one, while smaller percentiles lead to lower results.

5 RESULTS

We provide results using the two versions of the but-

ton clicking annotations. First with the raw annota-

tions (about 10000 clicks), then with the Gaussian-

smoothing approach (about 2000 clicks).

5.1 Results using the Raw Ground

Truth

We ﬁrst evaluate the algorithms introduced previously

on the complete ground truth. As mentioned in Sec-

WhattoShow?-AutomaticStreamSelectionamongMultipleSensors

437

tion 3.3, this ground truth is actually noisy as the hu-

man subject do not agree exactly on the interesting

streams. We evaluate the different algorithms together

with the different normalization methods. Our algo-

rithms are compared against a theoretical random al-

gorithm that would rank the views randomly (no nor-

malization is meaningful).

As for example in Figure 4, the result ﬁgures show

the rank distributions for various interestingness algo-

rithm, grouped by normalization method. In a given

graph, there is a line per considered method. For a

given method, the colored bars show the proportion of

clicked views ranked 1

, 2

, 3

and , 4

by the al-

gorithms. The red bar corresponds to p

algo

(1) and so

on. More weight on the ﬁrst (leftmost) ranks is better.

We see for example that random (which is repeated on

each graph) has a probability of 0.25 for each of the

ranks. Methods that will be above 0.25 for the ﬁrst

rank, can be considered as better than random.

The normalization based on percentiles below

95% give systematically lower performance than the

one with 95%. To improve readability, we thus omit

these in the shown graphs. The results show that the

consider algorithms, both motion based and PLSM

based, perform better than the random guess. Despite

the fact that the raw annotations have been found rel-

atively noisy, the gain over random algorithms is im-

portant.

The results also show that the simple motion-

based criterion is almost as good as the more elab-

orated PLSM algorithm. The main observation is that

in this metro setting, human social attention, encoded

in the form of our annotations, is very well linked to

the amount of motion present in the video. Both kind

of abnormality measures derived from PLSM seem

also to be following the behavior of “Motion”. How-

ever, in other contexts like trafﬁc videos, where mo-

tion is present in the form or more regular patterns, the

capacity of algorithms such as PLSM at ﬁltering out

normal activities would be beneﬁcial to detect anoma-

lies as compared to relying on motion only.

In this setup, we observe that normalization has a

small negative on the results. The interpretation can

be that human attention is directed toward the abso-

lute amount of motion and not a motion amount rela-

tive to the normal amount. Actually, noise in the an-

notations is also part of the reason as shown in the

following section.

5.2 Results on Filtered Dataset using

the Gaussian Smoother

We also evaluate the performance of the different al-

gorithms on the ground truth after smoothing by a

Gaussian kernel, as explained in Section 3.3. We var-

ied the threshold on the smoothed index of interest so

as to keep around 2000 evaluation points. Figure 5

shows results with the corresponding results.

Globally, the methods exhibit better accuracy on

the cleaned ground truth. We also observe a more

marked effect of method mixing: results are notably

lower than with each individual methods. When mix-

ing methods, normalization also helps. However, as

observed in the raw case, normalization degrades the

results for individual methods. This again means that

the attention is directed to absolute motion.

Overall, the results with the smooth annotations

consolidate and complete what has been observed

with the raw annotations:

• non-normalized version are preferable for both the

motion-based measure and the PLSM-based one,

• when mixing methods of different scales, as ex-

pected, results degrades and the normalization

helps,

• motion-based and PLSM-based selection both

signiﬁcantly outperform the random guess algo-

rithm especially on cleaned ground truth,

• on the raw ground truth, the motion-based selec-

tion works best,

• motion-based methods and PLSM-based methods

perform comparably,

• we reach good ranking accuracy of up to 0.6

meaning that 60% of the times the stream clicked

by the human subject is ranked ﬁrst among four

streams; 85% of the times, the clicked stream is

among the two ﬁrsts in the rank provided by the

algorithm.

6 CONCLUSIONS AND FUTURE

WORK

We have introduced the stream selection task and pro-

posed an evaluation protocol for it. The proposed

evaluation is based on a social attention experiment

where human subjects were shown four video streams

at once and were asked to mark a video at any time

they spotted something interesting in it. We tested

various stream selection algorithms to rate the inter-

estingness of the streams at any instant. The video

instants marked by the subjects are used to evaluate

how well the marked video has been ranked by a se-

lection algorithm.

Our evaluations have shown that, considering all

the raw annotations produced by the human sub-

jects, all test approaches were better than random but

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

438

Figure 4: Ranking results on the raw ground truth (9980 evaluation instants) coming directly from the button clicking experi-

ment. For each method (row), the proportion of each rank position is given for the considered annotation points. More weight

on the ﬁrst (leftmost) ranks is better.

Figure 5: Ranking results on the smoothed ground truth obtained by removing annotations with unsufﬁcient supporting

evidence (total, 2003 evaluation instants).

mostly equivalent to a simple criterion based on the

amount of motion. By ﬁltering the annotations (re-

moving the isolated ones) we obtained a more trustful

set of annotations. On such more robust evaluation

dataset, it has been shown that mixing interestingness

measures of different scale actually poses problem

and that the proposed normalization method helps in

this case.

Overall, the results are encouraging and more in-

terestingness rating algorithms need to be tested and

evaluated in this framework. More advanced normal-

ization methods needs also to be explored: we can

expect to ﬁnd a normalization procedure that allows

to mix various algorithms without degrading their in-

dividual accuracy. We plan to explore a normaliza-

tion based on robustly aligning every percentile of all

interestingness measures. A possible per-view bias

(e.g., a particular view being almost never interest-

ing) could also be integrated in the normalization pro-

cess if we accept some slight human supervision. Col-

lecting more ground truth, especially with real oper-

ators which have been knowing the camera setup for

months, and making it available for research purpose

is also in our future work direction.

ACKNOWLEDGEMENTS

This work was funded by the European Union FP7

project VANAHEIM (248907) under the 7th frame-

work program.

REFERENCES

Atrey, P. K., El Saddik, A., and Kankanhalli, M. S. (2011).

Effective multimedia surveillance using a human-

centric approach. Multimedia Tools and Applications,

51(2):697–721.

Boiman, O. and Irani, M. (2007). Detecting irregularities in

images and in video. International Journal of Com-

puter Vision, 74(1):17–31.

Emonet, R., Varadarajan, J., and Odobez, J.-M. (2011).

Multi-camera Open Space Human Activity Discovery

for Anomaly Detection. In 8th IEEE Int. Conf. on Ad-

vanced Video and Signal-Based Surveillancei (AVSS),

page 6.

Hua, D., Zhang, X., Yin, J., Zheng, V. W., and Yang, Q.

(2009). Abnormal activity recognition based on hdp-

hmm models. In Int. Joint Conference in Artiﬁcial

Intelligence.

WhattoShow?-AutomaticStreamSelectionamongMultipleSensors

439

Varadarajan, J. and Odobez, J. (2009). Topic models for

scene analysis and abnormality detection. In ICCV-

12th International Workshop on Visual Surveillance.

Zelniker, E., Gong, S., and Xiang, T. (2008). Global ab-

normal behaviour detection using a network of cctv

cameras. In IEEE International Workshop on Visual

Surveillance.

Zhong, H., Shi, J., and Visontai, M. (2004). Detecting un-

usual activity in video. In IEEE Conference on Com-

puter Vision and Pattern Recognition, volume 2, pages

819–826, Washington, DC.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

440