Motion Compensated Temporal Image Signature Approach

Haroon Qureshi and Markus Ludwig

Institut f¨ur Rundfunktechnik GmbH (IRT), D-80939 Munich, Germany

Keywords:

Saliency, Visual Attention Modeling, DCT Based Object Detection.

Abstract:

Detecting salient regions in a temporal domain is indeed a challenging problem. The problem gets trickier

when there is a moving object in a scene and becomes even more complex in the presence of camera motion.

The camera motion can inﬂuence saliency detection as on one side, it can provide important information about

the location of moving object. On the other side, camera motion can also lead to wrong estimation of salient

regions. Therefore it is very important to handle this issue more sensible. This paper provides a solution to this

issue by combining a saliency detection approach with motion estimation approach. This further extends the

Temporal Image Signature (TIS) (Qureshi, 2013) approach to the more complex level where not only object

motion is considered but also camera motion inﬂuence is compensated.

1 INTRODUCTION

Motion of pixels in 2D images can be perceived as a

result of projection of object motion and camera mo-

tion in a 3D world coordinate system. In the case

there is no camera motion, anything that moves is

salient. But in case of camera motion, it is important

to identify salient regions in images that correspond

to moving objects only. Therefore it is of interest to

separate the two motions from each other and in other

words to detect and compensate the camera motion.

In Temporal Image Signature (TIS) approach

(Qureshi, 2013) a combination of two saliency ap-

proaches (Image Signature based approach (IS) (Hou

et al., 2012) and Temporal Spectral Residual (TSR)

approach (Cui et al., 2009)) was shown and camera

motion was ignored. It was also suggested that com-

pensating the camera motion effect while detecting

salient regions might certainly improve the results.

The camera motion artifacts were evident in the re-

sults. In this paper, a camera motion along with the

moving object is considered.

With the advancement in modern digital technolo-

gies, the demand for better quality video assessment

in term of efﬁcient transmission of media contents has

also increased. Therefore, there is a need for intelli-

gent video compression. One possible way is to com-

press the video equally without considering the con-

tents which requires a speciﬁc amount of bandwidth.

Another but intelligent way is to distribute the amount

of compression of parts based on contents. This can

be done by dividing regions that require higher com-

pression quality, thus higher bitrate, from other re-

gions that do not require such high quality. The ﬁrst

regions should be compressed less and second regions

could be compressed higher. The information to di-

vide the regions can be perceived from saliency de-

tection methods. The proposed method is an initial

step towards developing the intelligent video coding

approach by detecting the salient regions in a scene.

The proposed approach compensates the effect of

camera motion by using motion estimation and com-

bines that information with the salient regions. Sev-

eral options exist in order to compensate camera mo-

tion. One way is to measure camera movement phys-

ically. Another automatic but general way is to es-

timate camera motion using local and global motion

information of objects in a scene. This can be done

by computing motion vectors ﬁelds locally or glob-

ally. In this paper, global motion (i.e. background

motion) is considered as the camera motion.

The paper is organized as follows: In Section

2, some existing state-of-the-art approaches are ex-

plained. Section 3 describes the proposed work in de-

tail along with the computation and fusion of saliency

map with the motion information. Finally, the evalua-

tion of the proposed model with other state-of-the-art

approaches is presented in section 4.

2 RELATED WORK

The idea of visual attention modelling is strongly in-

512

Qureshi H. and Ludwig M..

Motion Compensated Temporal Image Signature Approach.

DOI: 10.5220/0005303305120516

In Proceedings of the 10th International Conference on Computer Vision Theory and Applications (VISAPP-2015), pages 512-516

ISBN: 978-989-758-089-5

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

Figure 1: The proposed system.

spired by two models that are based on the human vi-

sual perception. The ﬁrst model ”Feature Integration

Theory” (FIT) (Treisman and Gelade, 1980) explains

the behavior of a human when it looks at the visual

scene. The second model Guided Search (GS) (Treis-

man, 1986) can be seen as an extension to FIT. Gener-

ally saliency detection approaches can be categorized

into two parts, image based saliency and video based

saliency approaches.

The main goal of image based saliency methods

is to ﬁgure out salient regions from the background.

Numerous processes have been suggested in this area.

Itti et al. simulated the process of human visual search

in order to detect salient regions in still images (Itti

and Koch, 2001). Huo et al. proposed Spectral Resid-

ual approach (SR) (Hou and Zhang, 2007) by con-

sidering irregularity clue from the smooth spectrum

for saliency detection. Whereas Achanta et al. and

Cheng et al. estimate saliency using frequency-tuned

saliency and contrast based concept (Achanta et al.,

2009)(Cheng et al., 2011). More recently DCT and

its variation (Qureshi, 2013) (Schauerte and Stiefel-

hagen, 2012) is also utilized in detecting salient re-

gions.

The main aim of video based saliency is to sepa-

rate salient motions from the background. In the con-

text of video many approaches have been proposed.

In fact, many similar approaches from still image do-

main were developed also in the video or temporal do-

main as well. For example, Temporal Spectral Resid-

ual (TSR) (Cui et al., 2009) approach is an extension

of Spectral Residual (SR) (Hou and Zhang, 2007) and

Temporal Image Signature (TIS) (Qureshi, 2013) ap-

proach is an extension of the Image Signature (IS)

(Hou et al., 2012) approach. For video content, in

order to detect salient region as accurate as possi-

ble, different clues (e.g., camera motion, face, text

and speech) are used by the researchers. For exam-

ple many researchers proposed to use human face as

an important clue while detecting salient regions be-

cause of its importance of attracting human’s atten-

tion (Qureshi and Ludwig, 2013) (M. Cerf and Koch,

2009). Motion (e.g., camera, object) can also play an

important clue to detect automatic regions of interest.

Studies of different visual models have shown that

motion may indeed play an important role and provide

useful information about salient areas. A tool that of-

ten used in motion analysis is the Motion Vector Field

that can be retrieved by different algorithms. Liang et

al. suggests a method called phase-correlation Mo-

tion Estimation (Guo et al., 2008). Based on the

motion vector ﬁeld researchers describe a method

that estimates global motion in a scene and use it to

compensate camera movement (Deigmoeller, 2010)

(Hadi Hadizadeh, 2014). Some researchers used cam-

era motion information to ﬁnd salient regions of inter-

est (Abdollahian and Edward J, 2007). A very com-

prehensive survey, analysis of scores, datasets, and

model of state of the art technologies in visual atten-

tion modeling is presented in the papers (Ali Borji,

2013) (Borji. A and Itti, 2013). In this paper, a method

proposed in (Chen and Baji´c, 2010) is used for the

proof-of-concept for motion estimation.

There are many applications of saliency mod-

els which have been developed over the years and

thus further increased interest in attention mod-

eling. For example, object based segmentation

(Han et al., 2006), image retargeting (Achanta and

S¨usstrunk, 2009), face detection (Qureshi and Lud-

wig, 2013) (M. Cerf and Koch, 2009) , compres-

sion (Hadi Hadizadeh, 2014) and video summariza-

tion (Y.-F. Ma and Zhang, 2005).

MotionCompensatedTemporalImageSignatureApproach

513

(a) Original (b) IS (Hou et al., 2012) (c) TIS (Qureshi, 2013) (d) Proposed (MC-TIS) (e) Final mask of (d)

Figure 2: Visual comparison of Proposed approach with motion compensation Vs IS (Hou et al., 2012) and TIS (Qureshi,

2013) with no motion compensation.

3 PROPOSED APPROACH

This paper proposes a combination of motion detec-

tion algorithms (Chen and Baji´c, 2010)

with Tem-

poral Image Signature (TIS), a saliency detection ap-

proach (Qureshi, 2013). The system is shown in Fig-

ure 1. Temporal Image Signature (TIS) (Qureshi,

2013) approach completely ignores the effect of cam-

era motion. The proposed approach extends Temporal

Image Signature (TIS) while compensating the effect

of camera motion.

It is now well established that local object mo-

tion is an important clue to grab the human attrac-

tion (Qureshi and Ludwig, 2013) (Itti et al., 1998). In

Temporal Image Signature (TIS) approach (Qureshi,

2013), it was observed that the accuracy of TIS ap-

proach degrades whenever there is a camera motion

present in the scenes. Although, the model was able

to detect salient regions successfully. But in some

cases due to severe camera motion where background

motion competes with the foreground object motion

background motion or camera motion could confuse

the salient object motion.

The processing of the proposed approach is as fol-

lows: At ﬁrst, a number of frames are sliced into

the horizontal (XT) and vertical planes (YT). XT and

YT are the planes of image lines in a temporal do-

main. Then the IS (Hou et al., 2012) approach is ap-

plied separately on all planes. IS approach deﬁnes

the saliency using the inverse Discrete Cosine Trans-

form (DCT) of the signs in the cosine spectrum. Sec-

ondly, the global motion estimation algorithm (Chen

and Baji´c, 2010), followed by global motion com-

pensation (Chen and Baji´c, 2010) is applied on each

frame separately in XY domain. In the ﬁnal step, ﬁrst

salient information in the XT and YT plane is accu-

mulated by transformation back into the XY domain

and it is then it is fused with the motion compen-

matlab implementation for motion estimation is avail-

able on http://www.sfu.ca/ ibajic/software.html

sated frames using coherent-normalization-based fu-

sion method (C. Chamaret and Meur, 2010). This re-

sults in a ﬁnal map which is a combination of saliency

map and motion estimation. Figure 2 provides a vi-

sual comparison of the proposed approach (MC-TIS)

after compensating camera motion with the results of

TIS approach (Qureshi, 2013) and Image Signature

(IS) approach (Hou et al., 2012) with no compensa-

tion. The binary mask of the proposed approach is

also shown.

Fusion Process

Given a video clip (I) with size m * n * t where m*n

is the image size and t is the number of frames be-

ing processed, TIS approach as proposed in (Qureshi,

2013) can be modiﬁed as follows.

MapXT

= sign(DCT(I

)) (1)

MapYT

= sign(DCT(I

)) (2)

(MapXT

) −−−−−−→

Transform

hMapXY

(3)

(MapYT

) −−−−−−→

Transform

vMapXY

(4)

Adding eq (3) and (4)

SMap(t) = hMapXY(t) + vMapXY(t) (5)

Transformation of frames to global motion com-

pensated (GMC) frames can be written as.

GMC

−−−−−−→

GMC−MVs

(6)

Saliency map can be fused with the information of

global motion compensation information by adding

equation (5) and (6) using coherent-normalization-

based fusion method (C. Chamaret and Meur, 2010).

fSMap(t) = (1− α)SMap(t) + αXY

+ βSMap(t)XY

(7)

MapXT and MapYT represent horizontal and ver-

tical maps, SMap is the saliency map using TIS ap-

proach, fSMap is the ﬁnal map after the combination,

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

514

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

0.46

0.48

0.5

0.52

0.54

0.56

0.58

Blur Width

(STD of Gaussian kernel in image widths)

ROC Score

(averaged across 90 images)

ROC Score Dependence on Blur

Itti

TIS

MC−TIS (Proposed)

Figure 3: The ROC metric comparison.

alpha and beta are positive constant of values in the

range between 0 to 1. MC represent global motion

compensated frames, MVs are the motion vectors, I

and I

are the slices of the images in horizontal and

vertical axis. ’sign (DCT(I))’ is the IS process applied

on the image. The algorithm is graphically depicted

in Figure 1.

4 EVALUATION

To validate the saliency maps generated by our algo-

rithm, the data set of human eye tracking database for

standard video sequences introduced by Hadizadeh,

Enriquez, and Baji´c (H. Hadizadeh, 2012) is used to

compare the various saliency map algorithms. This

dataset includes a database of gaze locations by 15

subjects free-viewing 90 color images (288 * 352

pixels). In order to evaluate

the reliability between

a particular saliency map and a set of ﬁxations of

the image, Receiver Operating Characteristics (ROC)

Area Under the Curve (AUC) score is computed for

each image. It is computed in a way as described in

Image Signature (Hou et al., 2012). Area under the

ROC curve, focus on saliency location at gaze posi-

tions.

We compare our saliency maps generated from

our proposed approach (MC-TIS) to the following

published saliency algorithms: the original Itti-Koch

(Itti) saliency model (Itti et al., 1998), Image Signa-

ture (IS) (Hou et al., 2012), Temporal Image Signa-

matlab script for benchmarking developed in California

Institute of Technology is available on http://goo.gl/bu1tkc

ture (TIS) (Qureshi, 2013) and Spectral Residual (SR)

(Hou and Zhang, 2007). Figure 3 shows dependence

of ROC score of ﬁve algorithms on blur when applied

to the ﬁnal saliency maps. The AUC score of all ﬁve

algorithms under its optimal mean is also shown in

Table 1. From Figure 3, it can be seen that the perfor-

mance of MC-TIS (proposed approach) is better than

others saliency algorithms. Hence the regions empha-

sized by the proposed saliency algorithm matches to a

large extent with those image regions seen by humans

in free viewing conditions.

Table 1: AUC score of all 5 algorithms.

Algorithm AUC

(name) (mean)

Proposed (MC-TIS) 0.5838

Temporal Image Signature (TIS) 0.5315

Image Signature (IS) 0.5248

Spectral Residual (SR) 0.5214

Itti 0.5104

5 CONCLUSION

The proposed approach showed the advantage of us-

ing motion estimation information in combination

with the saliency information. The addition of mo-

tion information can indeed play an important ﬁlter in

removing unknown artifacts that can origin from the

camera motion. The presence of camera motion can

inﬂuence a saliency model between actual object or

salient motion and the background or camera motion.

Therefore, it becomes very important to compensate

MotionCompensatedTemporalImageSignatureApproach

515

the camera motion as it is more visible in Figure 2. in

case of the TIS approach (c). It was also shown that

the detected salient regions by the proposed approach

(MC-TIS) have a large overlap with the locations of

human eye movement ﬁxations as compared to other

saliency algorithms.

Our proposedsystem may fail in some difﬁcult sit-

uations, such as in case of more severe camera mo-

tions or camera motion is wrongly estimated. So the

success depends strongly on the quality and accuracy

of the used motion estimation method as well. Of

course the proposed approach can also prove effec-

tive in other computer vision problems, e.g. in object

categorization or object recognition, video encoding

where compression plays an important role.

Interesting avenues for future research are to in-

vestigate the combination of motion estimation and

saliency algorithms for the application of intelligent

video compression.

ACKNOWLEDGEMENTS

This work was supported by the ROMEO project

(grant number: 287896), funded by the EC FP7 ICT

collaborative research programme.

REFERENCES

Abdollahian, G. and Edward J, D. (2007). Finding regions

of interest in home videos based on camera motion. In

IEEE International Conference on Image Processing

(ICIP), volume 4.

Achanta, R., Hemami, S. S., Estrada, F. J., and S¨usstrunk,

S. (2009). Frequency-tuned salient region detection.

In CVPR, pages 1597–1604. IEEE.

Achanta, R. and S¨usstrunk, S. (2009). Saliency detection

for content-aware image resizing. In IEEE Intl. Conf.

on Image Processing.

Ali Borji, L. I. (2013). State-of-the-art in visual attention

modeling. In IEEE transactions on Pattern Analysis

and Machine Intelligence, volume 35, pages 185–207.

Borji. A, Tavakoli. H, S. D. and Itti, L. (2013). Analysis of

scores, datasets, and models in visual saliency predic-

tion. In Proceedings of the IEEE International Con-

ference on Computer Vision, pages 921–928.

C. Chamaret, J. C. C. and Meur, O. L. (2010). Spatio-

temporal combination of saliency maps and eye-

tracking assessment of different strategies. In Proc.

IEEE Int. Conf. Image Process, pages 1077–1080.

Chen, Y.-M. and Baji´c, I. V. (2010). Motion vector outlier

rejection cascade for global motion estimation. IEEE

Signal Process. Lett, 17(2):197–200.

Cheng, M.-M., Zhang, G.-X., Mitra, N. J., Huang, X., and

Hu, S.-M. (2011). Global contrast based salient region

detection. In CVPR, pages 409–416.

Cui, X., Liu, Q., and Metaxas, D. (2009). Temporal spec-

tral residual: fast motion saliency detection. In Pro-

ceedings of the 17th ACM international conference on

Multimedia, MM ’09, pages 617–620, New York, NY,

USA. ACM.

Deigmoeller, J. (2010). Intelligent image cropping and scal-

ing. In PhD thesis, Brunel University.

Guo, C., Ma, Q., and Zhang, L. (2008). Spatio-temporal

saliency detection using phase spectrum of quaternion

fourier transform. In CVPR’08.

H. Hadizadeh, M. J. Enriquez, I. V. B. (2012). Eye-tracking

database for a set of standard video sequences. IEEE

Trans. on Image Processing, 21(2):898–903.

Hadi Hadizadeh, I. V. B. (2014). Saliency-aware video

compression. IEEE Trans. on Image Processing,

23(1):19–33.

Han, J., Ngan, K. N., Li, M., and Zhang, H. (2006). Un-

supervised extraction of visual attention objects in

color images. IEEE Trans. Circuits Syst. Video Techn.,

16(1):141–145.

Hou, X., Harel, J., and Koch, C. (2012). Image signature:

Highlighting sparse salient regions. IEEE Trans. Pat-

tern Anal. Mach. Intell., 34(1):194–201.

Hou, X. and Zhang, L. (2007). Saliency detection: A

spectral residual approach. In IEEE Conference on

Computer Vision and Pattern Recognition (CVPR07).

IEEE Computer Society, pages 1–8.

Itti, L. and Koch, C. (2001). Computational modelling

of visual attention. Nature Review Neuroscience,

2(3):194–203.

Itti, L., Koch, C., and Niebur, E. (1998). A model

of saliency-based visual attention for rapid scene

analysis. IEEE Trans. Pattern Anal. Mach. Intell.,

20(11):1254–1259.

M. Cerf, E. P. F. and Koch, C. (2009). Faces and text attract

gaze independent of the task: Experimental data and

computer model. In Journal of vision, volume 9.

Qureshi, H. (2013). Dct based temporal image signature

approach. Proceedings of the 8th International Con-

ference on Computer Vision Theory and Applications

(VISAPP ’13), 1:208–212.

Qureshi, H. and Ludwig, M. (2013). Improving temporal

image signature approach by adding face conspicuity

map. Proceedings of the 2nd ROMEO Workshop.

Schauerte, B. and Stiefelhagen, R. (2012). Predicting

human gaze using quaternion dct image signature

saliency and face detection. In Proceedings of the

IEEE Workshop on the Applications of Computer Vi-

sion (WACV). IEEE.

Treisman, A. (1986). Features and objects in visual pro-

cessing. Sci. Am., 255(5):114–125.

Treisman, A. M. and Gelade, G. (1980). A feature-

integration theory of attention. Cognitive Psychology,

12:97–136.

Y.-F. Ma, X.-S. Hua, L. L. and Zhang, H.-J. (2005). A

generic framework of user attention model and its ap-

plication in video summarization. In Trans. Multi, vol-

ume 7, pages 907–919.

VISAPP2015-InternationalConferenceonComputerVisionTheoryandApplications

516