A POMDP-based Camera Selection Method

Li Qian

1,2

, Sun Zheng-Xing

, and Chen Song-Le

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210093, China

Institute of Meteorology, PLA University of Science and Technology, Nanjing 211101, China

Keywords: Camera Selection, POMDP, Video Analysis, Multi-camera System.

Abstract: This paper addresses the problem of camera selection in multi-camera systems and proposes a novel

selection method based on a partially observable Markov decision process model (POMDP). And an

innovative evaluation function identifies the most informative of several multi-view video streams by

extracting and scoring features related to global motion, attributes of moving objects, and special events

such as the appearance of new objects. The experiments show that these proposed visual evaluation criteria

successfully measure changes in scenes and our camera selection method effectively reduces camera

switching.

1 INTRODUCTION

Multiple cameras are widely used for security

surveillance, human–computer interaction,

navigation, and positioning. However, they

introduce issues related to deployment and control

the cameras, real-time fusion of video streams with

high resolution and high frame rates, and selection

and coordination of the cameras (Soro and

Heinzelman, 2009). Camera selection, which

involves selection of one or more cameras from a

group of cameras to extract essential information, is

a particularly challenging task in multi-camera

systems.

A number of previous studies have investigated

the issues related to camera selection. Li and Bhanu

(Li and Bhanu, 2009) proposed a game-theoretic

approach to hand-off the camera with the global

utility, camera utility, and person utility determined

by user-supplied criteria such as the size, position,

and view of the individual being tracked. Daniyal,

Taj, and Cavallaro (Daniyal et al., 2010) proposed a

Dynamic Bayesian Network approach that uses

object- and frame-level features. Bimbo and Pernici

(Bimbo and Pernici, 2006) selected optimal

parameters for the active camera on the basis of the

appearance of objects and predicted motions to solve

the traveling salesman problem. Tessens et. al

(Tessens et al., 2008) used face detection and the

calculated spatial position of the target to select a

primary view and a number of additional views. All

methods mentioned above use low-level image

features to evaluate visual information; high-level

information in video streams, such as local salient

movement details and specific events, is more

informative.

In this paper, we present a novel method for

camera selection based on a partially observable

Markov decision model (POMDP) and use the belief

states of the model to represent noisy visual

information. By considering current states and

anticipated transition trends with the cost generated

by camera switching, the visual jitters that arise from

frequent switching can be effectively reduced. Our

evaluation function are presented for visual

information, which is designed to reflect the richness

of information in each view by extracting global

motion, properties of moving objects in the scene,

and specific events.

2 DYNAMIC CAMERA

SELECTION BASED ON A

POMDP MODEL

The camera selection problem can be described as

follows. A multi-camera system has

cameras

( , ,... )

CC C

with partially or completely

overlapped FOVs, and one node is designated as the

central controller for scheduling according to a

selection policy that is computed offline. In our

746

Qian L., Zheng-Xing S. and Song-Le C..

A POMDP-based Camera Selection Method .

DOI: 10.5220/0004228907460751

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 746-751

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

method, the central controller selects only one

camera

online as the optimal camera at fixed

time intervals of duration



. At each time step

visual features indicating global motion, properties

of objects, and specific events in the view of each

camera are extracted and scored (see Section 3). The

scores are then sent to the central controller, which

makes dynamic selection decisions based on current

and previous camera view scores. Although the

camera selection problem can usually be modeled as

a finite-state Markov decision process (MDP), when

an observed state contains errors caused by factors

such as illumination, occlusion, and camera shock

and does not reflect the actual state, we must

implement sequential decision making based on

partially observable states. Owing to uncertainty, we

model this dynamic process as a POMDP, which is

an extended version of an MDP.

2.1 Definition of POMDPs

A POMDP can be formally defined as a 6-

tuple

,,,,,SA TOR 

, where

is a finite set of

all possible underlying states,

is a finite set of

actions, i.e., available control choices at each time

instant,



is a finite set of all possible observations

that the process can provide,

is a state transition

function

:TSA S

that encodes the uncertainty

about the evolution of the states of the process,

an observation function

:OS A

that relates

the process outputs (camera observations) to the true

underlying state of the process, and

is an

immediate reward function

:RSAR

that

assigns real-valued rewards to the actions that may

be performed in each of the underlying process

states.

2.2 Selection Policy

On the basis of our description of the camera

selection problem, we formulate the POMDP as

follows.

System state vector

The system state vector consists of the currently

selected results and visual information scores. At

time step

, the system state is represented as

[ , , ,..., ]

tttt t

Scss s

, where

is the best camera

that is selected at time

, and

, {1, 2,..., }

the actual visual information score for camera

And the score values

are uniformly discretized

with

quantization levels and normalized to [0,1]

to produce the range

{0,1, ..., 1}





Actions

An action is a vector represented as

[ ,..., ]

tt t

aa a

, where at time step

if the

camera is selected,



; else

0

Observation state

The observation state is a collection of

observations from all cameras and is defined as the

vector

[', , ,..., ]

ttttt

Ocoo o

is the camera

that is selected at time

. Because this component

has no error, it should be the same as the component

of the system state, which is similar to the

definition of system states. Each

observation

, {1, 2,..., }

ok N

is the visual score

computed by our method for camera

at time step

obtained by extracting visual features from the

video stream and scoring them. The number of

observations is equal to the number of system states.

State transitions

The state transition probability

describes

the differences in the scene between the views taken

in adjacent time steps. Because the selection action

does not affect the visual measures and state

transitions

'| '|



as s s

, the state transition will be

based on the visual score component

. We assume

that the visual scores for the cameras are

independent, i.e.,

12 12

11 1

11 22

11 1

Pr( ' ( , ,..., ) | ( , ,..., ))

Pr( | ) Pr( | ) ... Pr( | )

tt t tt t

tt tt tt

Sss sSss s

ss ss ss

 





(1)

For each discrete state component

{0,1, ... 1}

probability of transition to a neighboring state is

higher than that to more distant states. Thus, we set

the transition probabilities on the basis of distances

between states as follows:

1()

(1 ) , {0, 1, ..., 1}

()

Pr( =u | )

0 {0,1,..., 1} {0,1,..., 1}



































uv m

ssv

umorvm

(2)

This equation defines the transition probabilities

between all states in the state space.

Observation function

The observation function

'| 'oas

indicates the

likelihood of the observation state being

if the

system state

performs action

. Because the

selection actions do not affect the camera

observations or the computed scores, we set

APOMDP-basedCameraSelectionMethod

747

'| ' '| '



oas os

. Also, we assume that the states in the

observation state space for different cameras are

independent. The observation probability is defined

as follows:

1()

(1 ) , {0,1,..., 1}

()

r( =u | )

0 {0,1,..., 1} {0,1,..., 1}































uv m

osv

umorvm

(3)

Immediate rewards

For each action, we define an immediate reward

or cost to measure the degree of optimization that

would result from that action. We use the visual

score from each camera as a positive reward and use

the camera switching cost

cos t

, which represents the

visual jitter caused by frequent switching as a

negative reward. Therefore, the immediate reward

after camera selection is defined as

cos 1

(,) (1 ) ( )

ii i

tt t t t

Sa s c a







(4)

where

[0,1]





is the weight coefficient between the

positive and negative rewards, and the



function

()





= 1 if the camera selected at time step

was

also selected at time step



; else

()





= 0.

Given the belief state

()bs

for the camera system

state

at time step

, the scheduling agent attempts

to maximize the total reward

(())



Vbs

by selecting

the best camera

on the basis of the optimal

policy



. This condition is represented as follows:

* *

( ( )) max{ ( ( ), ) ( , , ') ( ( '))},

tttt

V bs Rbs a psa sV bs













(5)

where

[0,1]





is a discount factor that controls the

future impact of rewards so that the effect of a

reward decays exponentially with respect to elapsed

time. If

is the observation after action

has been

executed, the next belief state

(')

is calculated

on the basis of Bayesian theory as follows:

(') ('|,)()

os a

oab

bs pssabs







(6)

where

|oab

is a normalized constant defined as

||''|

()







oab os a s sa

sS sS

pppbs

Formula (5) can usually be computed iteratively

using dynamic programming; the computational

complexity increases exponentially with respect to

the scale of the problem. Therefore, a direct solution

to our POMDP is unfeasible because the problem is

intractable, and thus, we use the Perseus method

(Spaan and Vlassis, 2005), which is a point-based

approximation. We sample randomized a number of

belief state points

as a belief set and compute the

reward values for this belief set. We save the results

as a set of value vectors

{, ,...,}





in which

each vector



is associated with a selected action

so that the vector contains the same number of

components as the state space. When the central

controller selects the best camera online, it

transforms the observed states into belief states and

makes its decision on the basis of the following

relation:

arg max( ( ') )









abs

(7)

The value vector



that has its inner product with

the current belief state at maximum is selected, and

the corresponding action

is selected as the best

camera.

3 VISUAL INFORMATION

MEASURE

In this section, we propose a measure that evaluates

the quality of the image captured by a camera by

extracting features indicating motion, properties of

objects, and special events and expressing features

as a motion score

, an object score

obs

, and an

event score

, respectively. The final visual score

as then calculated as follows:

12 3

ii i i

m obs e

SwSwS wS ,

(8)

where

, and

are weight coefficients such

that





. To simplify the exposition that follows,

we denote the motion scores, object scores, and

event scores without a superscript for the camera as

obs

, and

, respectively.

3.1 Global Motion

The degree of motion in a video stream reflects the

ability of the camera to capture real world changes.

We adopt the method presented in (Zhang et al.,

2011) for detecting moving objects and mitigating

the effects of illumination and shadows. Then we

determine the degree of global motion in a video

frame by calculating the ratio of the foreground area

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

748

to the area of the entire image. We assume that the

ratio will increase significantly when a new object

enters the scene or objects are close to the camera.

However, the ratio exceeding a certain threshold

implies that noise has been introduced by factors

such as illumination. Thus, we score the global

motion contained in a binary motion image as

follows:

























(9)

where

is the ratio of the area of the moving object

to the area of the entire image, and



is the best

ratio required by applications.

3.2 Object Properties

Individuals and special objects are the most

attractive elements in video surveillance

applications. To properly measure the properties of

objects in video streams, we focus on motion

saliency for individual objects and the degree to

which the objects in a scene occlude one another. To

detect the objects in a scene and appropriately

separate objects that are partially overlapped, we use

the method proposed by Hu (Hu et al., 2006) for

extracting bounding boxes for objects even when

they overlap. We then use the bounding boxes to

assign scores on the basis of local motion saliency

and occlusion between two overlapped objects.

Finally, we calculate an overall score for the

properties of the object in this scene as a weighted

sum of the two scores:

(1 ) ,

obs l oc

kOBS

SSS

OBS









(10)

where

is the local motion saliency score for

object

OBS

is the number of objects in the scene,

is the occlusion score, and



is a weight factor.

Local motion

We use the Harris 3D spacetime interest point

method (Laptev, 2005) to detect salient motion

details in the bounding boxes of objects and set the

saliency measure

of the

th object to be the

normalized number of 3D interest points detected in

the bounding box. Although the interest point

method is computationally expensive, we can

control the computation process by limiting the

search to a small region. Our experiments show that

this method can be effectively applied in real-time

video surveillance applications.

Object occlusion

One of the most serious issues in video

surveillance, tracking, and other applications is

occlusion between objects in a scene. In multi-

camera systems, selecting a camera with the least

occlusion is an effective solution. For this purpose,

we measure the degree of occlusion between the

objects in a scene on the basis of intersection of the

bounding boxes for different targets and use this

measure to assign a score that reflects the occlusion.

The larger the areas with intersections, the lower the

score. Therefore, we define the occlusion score as

follows:

#( )

min(# ,# )

OBS OBS

iji

OBS

Rc Rc

OBS

Rc Rc

























(11)

where

denotes the bounding box of the

object and

denotes the area of the box.

3.3 Event Detection

It is necessary to detect events of interest in videos.

In this paper, we focus on the entrance of new

objects in a video Frame. We determine an entrance

region in the image called the “inner region” either

by predefining it or through training and monitoring

the ratio

of the area of motion in the inner region

to the entire area of the inner region. When

exceeding the threshold

，

it indicates that an

object possibly enters the scene. To avoid false

detection with the ratio

due to the movement of

individuals present in the scene, we introduce an

external region around the entrance called the “outer

region” and determine the ratio

of the area of

motion in the outer region to the area of the entire

outer region. When

exceeds a threshold

motion in the entrance is assumed to be the motion

of an object that is already in the scene. Also, an

entrance event is related to time. Therefore, we set

11 2 2



















e r Th and r Th

otherwise

(12)

where

is the time that has elapsed since the

condition for a new object entering the scene is

satisfied, so that the event score is gradually lowered

as time elapses.

APOMDP-basedCameraSelectionMethod

749

4 EXPERIMENTS AND RESULTS

4.1 Experimental Setup

We conducted experiments on a personal computer

to simulate the process of camera selection. We

generated optimal policies offline using the Perseus’

point-based method implemented using C++ and

quantified the camera scores on an eight-point scale

(

=8m

) and saved the optimal policies as value

vectors. We used publicly available datasets

composed of sequences from POM (Fleuret et al.,

2008) and HUMANEVA (Leonid et al., 2010) to

evaluate the performance of our method. The POM

dataset use four cameras, while the HUMANEVA

dataset uses seven cameras.

4.2 Visual Information Measurement

Analysis

In this section, we evaluate visual information scores

obtained from our method for the POM Terrace1

sequence. We denote the images from the four

cameras as C0, C1, C2, and C3. In the scene

captured by frames 1 through 700 of the Terrace1

sequence, no individual is present in the scene,

thereafter, one person enters the scene, followed by

two people with no occlusion, two people occluding

each other, and eventually three people are present

in the scene. After several experiments and analysis,

we set the parameters





,ww

, and

respectively. Therefore the global motion score

curve with

0.6





shown in Figure 1(a) reflects the

ability of the camera to capture the motion in the

observed scene, as well as the number of objects and

their distance from the camera. The numbers of

salient points for different views are shown in Figure

1(b). The object property scores with

0.6





Figure 1(c) indicate that cameras can detect the same

significant object properties, when global motion

plays a dominant role in the videos. Also, when

cameras provide similar global information, the local

motion saliencies are different owing to different

camera orientations and relative directions of motion

of the objects from these perspectives, for example,

the scores for frames 160 to 200 in Figure 1(c) and

the occlusion between objects is appropriately

detected. The curves in Figure 1(d) show the scores

for measuring entrance events that appropriately

indicate that two people have entered the scene

during this period. Finally, Figure 1(e) displays the

overall visual information score curves with the

weights

, and

set to 0.5, 0.2, and 0.3,

respectively. Thus, the evaluation of visual

information by our method in the analyzed video

streams can appropriately reflect changes in the

scene and details and special events of interest to

observers.

(a) Global motion scores

(b) Numbers of local salient points

0.6





(d) Entrance event scores for C2

(e) Visual information scores

Figure 1: Visual information measurement results.

4.3 Selection Results and Analysis

We compared our method of camera selection

(POMDP) to the state-of-the-art camera selection

methods based on greedy criteria for maximum

visual scores (Max), game theory 0 (LYM), and

Dynamic Bayesian Networks 0 (DBN). Our

experiments used the visual information scores

presented in Section 3 as the measures for Max and

POMDP, and the weight coefficient for the

immediate reward was

=0.8



. Figures 5(a) and 5(b)

show the camera selection results for the video

sequences POM Terrace1 and HUMANEVA Walk1;

the best camera was selected for each frame of the

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

750

POM Terrace1 sequence on the basis of the camera

quality measures described in Section 4.2, and the

best camera was selected for the other sequences at

intervals of 5 frames. The camera selection results

indicate that there are some frequent camera

switches using LYM and DBN owing to false

selection because errors were introduced in the

motion detection process when there are two people

in the scene. Moreover, the LYM method was

especially prone to frequent switching when the

person utility is approximated to zero. In contrast,

our method effectively predicted future trends of the

visual information scores on the basis of history, and

this reduced the number of false selections, resulting

in smoother visual effects.

(a) POM Terrace1 selection results

(b) HUMANEVA Walk1 selection results

Figure 2: Selection results for the different methods.

5 CONCLUSIONS

Real-time selection of the most informative video

stream from a number of video streams has become

one of the key issues in visual analysis and

processing. The experimental results show that our

proposed POMDP-based method has a higher degree

of accuracy and is more stable than other methods.

In addition, we have proposed a visual information

score function for extracting and scoring visual

features associated with global motion, object

properties, and special events, and this function can

accurately reflect and describe the visual

information in a scene.

ACKNOWLEDGEMENTS

This work is supported by The National Natural Scie

nce Foundation of China (61272219, 61100110 and

61021062), The National High Technology Researc

h and Development Program of China (2007AA01Z

334), The Program for New Century Excellent Talen

ts in University of China (NCET-04-04605) and Th

e Science and technology program of Jiangsu Provin

ce (BE2010072, BE2011058 and BY2012190).

REFERENCES

Soro S, Heinzelman W. 2009. A survey of visual sensor

networks. Advances in Multimedia.

Li Y, Bhanu B. 2009. Utility-Based Camera Assignment

in a Video Network: A Game Theoretic Framework.

IEEE Sensors Journal 11(3).

Daniyal F., Taj M., Cavallaro 2010. A Content and task-

based view selection from multiple video streams.

Multimedia Tools and Applications ,46.

Bimbo A. D., Pernici F. 2006. Towards on-line saccade

planning for high-resolution image sensing. Pattern

Recognition Letters, 27(15).

Tessens L., Morbee M., Huang Lee, Philips W., Aghajan

H. 2008. Principal view determination for camera

selection in distributed smart camera networks. In

Second ACM/IEEE International Conference on

Distributed Smart Cameras.

Littman M. 2009. A tutorial on partially observable

markov decision processes. Journal of Mathematical

Psychology, 53(3).

Spaan M, Vlassis N. 2005. Perseus: Randomized point-

based value iteration for POMDPs. Journal of

Artificial Intelligence Research, vol 24.

Zhang S., Ding H., and Zhang W. 2011. Background

Modeling and Object Detection Based on Two-Model.

Journal of Computer Research and

Development,48(11).

Hu W. M., Hu M., Zhou X., et al. 2006. Principal axis-

base correspondence between multiple cameras for

people tracking. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 28(4).

Laptev I. 2005. On space-time interest points.

International Journal of Computer Vision. Vol 64.

Fleuret F., Berclaz J., Lengane R., Fua P. 2008. Multi-

camera people tracking with a probabilistic occupancy

map. IEEE Transaction on Pattern Analysis and

Machine Intelligence, vol 30(2).

Leonid S., Alexandru O., Michael J. 2010. HUMANEVA:

Synchronized Video and Motion Capture Dataset and

Baseline Algorithm for Evaluation of Articulated

Human Motion. Journal of Computer Vision.

APOMDP-basedCameraSelectionMethod

751