REAL-TIME ENHANCEMENT OF IMAGE AND VIDEO SALIENCY

USING SEMANTIC DEPTH OF FIELD

Zhaolin Su and Shigeo Takahashi

Graduate School of Frontier Sciences, The University of Tokyo, Japan

Keywords:

Visual attention, Attention guidance, Saliency maps, Importance maps, Semantic depth of ﬁeld.

Abstract:

In this paper, we propose a method for automatically directing viewers’ visual attention to important regions of

images and videos in low-level vision. Inspired by the modern model of visual attention, the importance map

of an input scene is automatically calculated by the combination of low-level features such as intensity and

color, which are extracted using spatial ﬁlters in different spatial frequencies, together with a set of temporal

features extracted using a temporal ﬁlter in case of dynamic scenes. A variable-kernel-convolution based on

the importance map is then performed on the input scene, in order to make semantic depth of ﬁeld effects in

a way that important regions remain focused while others are blurred. The pipeline of our method is efﬁcient

enough to be executed in real time on modern low-end machines, and the a

ssociated experiment demonstrates

that the proposed system can be complementary to the human visual system.

1 INTRODUCTION

We live in an information world that our sense organs

receive tremendous information from surroundings,

where 80% of which is visual information and can be

up to 10

bits per second at the optic nerve. Although

human visual system cannot fully process all of these

information (Tsotsos, 1990), it can selectively allo-

cates its hardware resources to focus on important re-

gions, which made it possible to process the visual

information efﬁciently by discontinuous ﬁxations (Itti

and Koch, 2001). Thus, the problem of how to direct

viewers’ attention has become a big issue when gen-

erating visual materials such as photos and videos.

For direct the human visual attention, the depth of

ﬁeld (DOF) effect is often employed in the art of ﬁlm

and photograph production. This type of effect is in-

troduced by adjusting the properties of camera lens,

so that objects with speciﬁed-distance from the cam-

era can be displayed sharply while others are blurred

intentionally. This technique is effective and can be

accepted naturally by audience, because the DOF is

processed by an intrinsic part of the human visual sys-

tem. Previous study (Kosara et al., 2001) explored

this type of effect for the use in computer visualiza-

tion, named Semantic Depth of Field (SDOF), which

blurs the images of the objects depend on the user-

speciﬁc relevance value rather than the distance from

the camera. Traditional methods for guiding human

visual attention assume speciﬁc regions to be empha-

sized. Under some circumstances, however, we do

not know where is important in the given images, and

even we cannot predict it, for example, as seen in

general-purpose remote monitoring systems. In this

paper, we present a method for enhancing the salient

image regions that are different from surroundings in

intensity or colour in real time, by fully taking ad-

vantage of the SDOF formulation. Here, the salient

regions correspond to the conspicuous spatial fea-

tures in images and spatiotemporal features in videos,

which will be extracted through our low-level vision.

Our run time algorithm has been accomplished by in-

troducing the deﬁnition of importance map, whose

scalar value distribution topographically represents

the perceptual importance of each pixel of the input

visual scene. Based on this map, the essential regions

of the input scene will remain focused while others

will be blurred to generate semantic depth of ﬁeld ef-

fects, as shown in Figure 1. Our results through eye-

tracking experiments suggest that the method can re-

duce reaction times and increase ﬁxation times on im-

portant regions in practice, and can help us overcome

some congenital shortages of the human visual system

such as change blindness phenomenon (see details in

Section 4). Our implementation shows that the over-

all pipeline is computationally efﬁcient enough to run

in real time, thus being particularly suitable for appli-

cations that need run-time processing.

370

Su Z. and Takahashi S. (2010).

REAL-TIME ENHANCEMENT OF IMAGE AND VIDEO SALIENCY USING SEMANTIC DEPTH OF FIELD.

In Proceedings of the International Conference on Computer Vision Theory and Applications, pages 370-375

DOI: 10.5220/0002825703700375

 SciTePress

(a) (b) (c)

Figure 1: Results of our method. (a) Input scene through a web camera. (b) Semantic depth of ﬁeld effects generated by our

algorithm. (c) The corresponding importance maps. The top row corresponds to a case of a static scene, where no speciﬁc

object are not visually focused. The bottom row corresponds to a case a dynamic scene where the person is about to open the

door. Note here the moving objects have been intentionally focus to direct our visual attention using the depth of ﬁeld effects.

The remainder of this paper is organized as fol-

lows: First, Section 2 provides a brief survey on re-

lated work. Section 3 then shows the overall pipeline

of our method. Section 4 introduces the implementa-

tion results and psychophysical validations by exper-

iments. Finally, Section 5 concludes this paper.

2 RELATED WORK

In this section, we introduce related work on the com-

putational models of visual attention to explain how

the human visual selective mechanism works, and

compare the conventional models of saliency maps

and the importance map we will introduce. Previous

studies on the SDOF will also be discussed.

2.1 Visual Attention Modelling

Although the performance of our visual system will

be affected in a top-down manner by circumstances

(e.g., task-depend), the low-level vision still plays an

crucial role in our visual information processing and

its corresponding bottom-up architecture has been

proposed in (Koch and Ullman, 1985), inspired by

the biological model of human visual system. This

bottom-up architecture allows us to explain why some

speciﬁc objects, for example, red apple among green

ones, will pop-out from a scene, and further extended

to the computational model of visual attention called

saliency maps formulated by Itti et al. (Itti et al.,

1998), where the centre-surround mechanism is used

to extract low-level features such as colour, intensity,

and orientation that are different from surrounding ar-

eas.

The saliency map is a one-channel scalar image

that topographically represents the visual saliency of

a corresponding visual scene. Based on this saliency

map, a selection process deploys the gaze sequence on

the visual scene by accessing to the corresponding 2D

topographic scalar ﬁeld using a winner-takes-all com-

petition mechanism. In order to analyze video materi-

als, Itti et al. extended their model to dynamic scenes

using the Bayesian theory (Itti and Baldi, 2009).

In recent years, several purely computational

models of saliency map, which are no longer based

on biological principles, were also proposed. In these

models, image processing methods were employed to

estimate the saliency of each pixel in a given image.

The basic ideas behind these methods are to calculate

the centre-surround features (Ma and Zhang, 2003;

Achanta et al., 2009), to maximize the mutual infor-

mation of centre-surround features (Bruce and Tsot-

sos, 2007; Gao and Vasconcelos, 2007; Zhang et al.,

2008), and to analyze frequency domain of the input

image (Hou and Zhang, 2007).

2.1.1 Problems with Saliency Maps

Previous researches explored applications of the

saliency map in image region segmentation, object

detection, and robot vision simulation. However, the

low-resolution output and heavy computational com-

plexity of the saliency map still remain as crucial lim-

itations for a wider utilization of the saliency map.

The reason why the most models of the saliency

map produce a low-resolution output is that the

REAL-TIME ENHANCEMENT OF IMAGE AND VIDEO SALIENCY USING SEMANTIC DEPTH OF FIELD

371

(a) (b) (c) (d) (e) (f) (g)

Figure 2: Comparison of the saliency maps in previous methods and the importance maps using our method. (a) The original

image. Saliency maps using the method of (b) Itti et al. (Itti et al., 1998), (c) Ma and Zhang (Ma and Zhang, 2003), (d) Harel

et al. (Harel et al., 2006), (e) Hou and Zhang (Hou and Zhang, 2007), and (f) Achanta et al. (Achanta et al., 2009). (g) The

importance map using our method. The images (b)-(f) are provided courtesy of Achanta et al. (Achanta et al., 2009).

saliency map can be considered as a low-resolution

abstract of the input scene (Koch and Ullman, 1985).

To extract centre-surround features in different fre-

quency domains, downsampling approach has been

employed. The reason for the second issue, i.e., heavy

computational complexity, is that some of the models

employed a non-linear normalization process based

on a biological approach (Itti et al., 1998), a graph-

based approach (Harel et al., 2006), etc.

2.1.2 Saliency Maps vs. Importance Maps

In this paper, we propose the deﬁnition of an impor-

tance map in a hope to solve these two problems.

The importance map is a scalar value image that topo-

graphically represents the perceptual importance of a

visual scene. Because the ultimate goal of the impor-

tance map is to model the visual property of a given

visual scene, we argue that the resolution of the im-

portance map should be the same as that of the input

image, and this can be accomplished by changing the

ﬁlter size rather than the input image size (see details

in Section 3.1). To reduce computational complexity,

we simpliﬁed the non-linear normalization process to

a linear one (see details in Section 3.2).

Another property of the importance map is that, its

scalar value distribution of the importance map should

be more ﬂexible than that of the saliency map. This

is because even objects that do not pop-out in a visual

scene may also hold a high importance value (see de-

tails in Section 3.2). Figure 2 shows the implementa-

tion results of the saliency maps in previous methods

and the importance map obtained using our method.

2.2 Semantic Depth of Field

Semantic depth of ﬁeld (SDOF) proposed by Kosara

et al. (Kosara et al., 2001) is a kind of Focus and Con-

text (F+C) information visualization technique that

allows us to investigate a speciﬁc local feature of the

given data while identifying its position with respect

to the global overview. The basic idea to assign a rele-

vance value to every object in a visual scene ﬁrst, and

then blur each of them by different levels of Gaus-

sian ﬁlters according on the corresponding relevance

value. Furthermore, Kosara et al. studied the prop-

erties of the SDOF (Kosara et al., 2002b), explored

its use in applications (Kosara et al., 2002a), and

designed user-interactive experiments (Kosara et al.,

2002b) to ﬁnd that the SDOF can quickly and effec-

tively guide viewer’s attention.

3 PROPOSED METHOD

In this section, we describe our proposed method,

which mainly consists of three steps. First, we extract

low-level features in image intensity and colour from

the input scene, together with motion feature in case

of the dynamic scene. These features are then linearly

normalized and combined into a single importance

map. In addition, since regions that do not pop-out

still may be important in perception, the histogram of

importance map is adjusted by a sigmoid function to

give more dynamic range according to the importance

values in the map. Finally, a convolution with variable

Gaussian kernels, whose coefﬁcients change depend-

ing on the importance of the corresponding pixel, is

performed on the input image.

3.1 Low-level Feature Extraction

By referring to the RGB colour components of each

pixel (i.e., red (r), green (g), and blue (b)), we cal-

culate the intensity contrast (I), red-green (RG) and

blue-yellow (BY) double opponent channels as:

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

372

Figure 3: Plots of the DoE function with τ = 0.5.

I =

(r+ g+ b), (1)

RG = r− g, and (2)

BY = b−

((r+ g) − |r− g|) (3)

To extract the centre-surround features, we use a Dif-

ference of Gaussians (DoG) ﬁlter, which models the

response of neurons in Lateral Geniculate Nucleus

(LGN) at the early stages of the human visual system.

Its kernel has the following form:

DoG(x, y;σ) =

exp



−

2σ



−

exp



−

2(2σ)



(2σ)

(4)

To extract features of different spatial frequencies,

we adjust the window size of the DoG ﬁlter without

downsampling the input image, thus the ﬁltered re-

sults retain the same resolution as the input image. We

choose σ ∈



, 2

, ..., 2



(pixels) for the win-

dow size, and the number of i can be changed depend-

ing on the input image resolution, because a higher

resolution image may need a larger window size. Em-

pirically, we choose i = 5 in our implementation.

Let Input(c) denote one of the three channels

(c ∈

{

I, RG, BY

}

), i denote the scale of the DoG ﬁl-

ter, and ∗ denote the convolution operator. We derive

3× i spatial feature maps F

(c, i):

(c, i) = Input(c) ∗ DoG(σ

). (5)

For videos, we use a difference of Exponential-

like temporal ﬁlter (Zhang et al., 2009) to extract mo-

tion features, which is given by (Figure 3)

DoE(t;τ) = E(t;2τ) − E(t;τ), (6)

where

E(t;τ) =

1+ τ

· (1+ τ)

. (7)

Here, t ∈ (−∞, 0] is the frame number relative to the

current frame (i.e., t = 0 corresponds to the current

frame), and τ is the shape parameter. A bigger value

τ means that the extracted motion features will re-

ceive more inﬂuences from the immediate neighbor

frame, but not the distant past, while the condition

Figure 4: Plot of S(x) with m = 0.3.

τ ∈ [0.1, 1.0] works well in our implementation. Ap-

plying this ﬁlter to F

(c, i) in Eq. (5), we derive an-

other set of 3 × i temporal feature maps F

(c, i) for

the current frame as:

(c, i) = F

(c, i) ∗ DoE(0;τ). (8)

3.2 Importance Maps

As we described above, the purpose of computing im-

portance maps is to evaluate the perceptual impor-

tance of every pixel in the input scene. Since distinc-

tive features should be more important than common

ones (Zhang et al., 2008), we normalize the aforemen-

tioned feature maps by its self-average, and then lin-

early combined the results into one single map. For

static images, the output I

′

is obtained as:

′

∑

(c, i)

avg(F

(c, i))

. (9)

For video inputs, with extra motion feature responses,

′

is reformulated as:

′

∑

(c, i)

avg(F

(c, i))

+ k·

(c, i)

avg(F

(c, i))

, (10)

where k is a weight value that determines the propor-

tion of the motion features in I

′

. This value can be

adjusted by circumstances.

We then normalize I

′

into the range [0, 1], and ad-

just the histogram of I

′

using a fourth-order sigmoid

interpolation polynomial (Figure 4) to remap every

pixel x in I

′

to a new value S(x) as follows:

S(x) =

(ax

+ bx

+ cx

), (11)

where

a = −1+ 4m

− 3m

, b = 2− 4m+ 2m

c = −1+ 3m− 2m

, and d = −1+ 2m

− m

. (12)

Here, x ∈ [0, 1], S(x) ∈ [0, 1]. m is the threshold value

that satisﬁes S(m) = m, from which we enhance the

dynamic range of the importance values. Empirically

REAL-TIME ENHANCEMENT OF IMAGE AND VIDEO SALIENCY USING SEMANTIC DEPTH OF FIELD

373

(a) (b) (c)

Figure 5: Adjustment by the function S(x). (a) Input image. (b) Importance map before the adjustment. (c) Importance map

after the adjustment. Note that several objects (such as a blue basket on the upper left) gain high importance values while

other objects that already pop out are further enhanced.

(a) (b) (c) (d)

Figure 6: Change in the distribution of gaze ﬁxation times. (a) An original image and (b) its corresponding distribution of

gaze ﬁxation times. (c) The enhanced image and (d) its corresponding distribution of gaze ﬁxation times. Warning signs direct

more visual attention due to the depth of ﬁeld effects.

we choose half the average of I

′

as the threshold in our

implementation. This transformation let us increase

the importance values above m while suppressing the

values below m, which exaggerates the contrast of I

′

and results in a more clearly enhanced output. Fig-

ure 5 shows the ﬁnal importance map I obtained by

applying this adjustment.

3.3 Applying Depth of Field Effects

The last step of the method is to reconstruct ev-

ery pixel of the input scene using a variable-kernel-

convolution to accomplish the ﬁnal DOF effects. The

kernel we used is a Gaussian function with σ

x,y

G(x, y;σ

x,y

) =

2πσ

x,y

· exp



−

+ y

2σ

x,y



(13)

where σ

x,y

is proportional to the inverse of the local

importance as:

x,y

= η·

x,y

. (14)

Here, I

x,y

stands for the scalar value of the pixel (x, y)

of I, and η is a weight coefﬁcient that controls the

degree of blur effects. In our implementation, we

chose η = 5. After applying this variable-kernel-

convolution, we obtain the ﬁnal output as:

Output = (Input∗ G)(x, y). (15)

4 RESULTS

We implemented our method in C++ and tested it on a

notebook with a 2.4GHz Intel CPU and 2GB memory.

The results show that our system can comfortably pro-

cess videos at 160×120 resolution in real time (about

40fps), and videos at 320×240 resolution in semi real

time (about 16fps). The experiments also suggest that

the computation time is roughly proportional to the

number of pixels in the input image. Figure 1 shows

the results of static and dynamic scenes, which were

captured with an ordinary web camera.

Since the DOF effect is advantageous in that it in-

ﬂuences each colour channel independently (Kosara

et al., 2002a), our method can be applied to video

clips that have only one intensity channel, for exam-

ple, such as greyscale videos obtained through video

surveillance systems. We also tested our system by

applying it to videos that were used as stimuli in

change blindness experiments. We found that our sys-

tem can emphasize gradual changes such as objects

slowing changing in its position, colour, and shape,

while we often fail to detect these changes without

such visual enhancements (Simons, 2000).

To evaluate the effectiveness of our method, we

VISAPP 2010 - International Conference on Computer Vision Theory and Applications

374

also conducted eye-tracking experiments, in order to

investigate how our method can aggressively direct

the visual attention of the viewers. The original and

visually enhanced versions of a natural scene were

displayed in a random sequence with the resolution

of 800x600 pixels. Subjects (8 males and 2 females)

were asked to freely look at each image for 10 seconds

while their gaze movements were recorded with a To-

bii X120 non-intrusive eye tracker. Figure 6 shows

the comparison of the results, which indicates the fact

that we pay more visual attention to the warning signs

on the top right in the enhanced version of the scene.

Careful analysis of the recorded gaze movements also

suggests that the ﬁrst gaze ﬁxation on the warning

sizes has been reduced from 6.15s to 4.95s in aver-

age due to the DOF effects, while we count the time

as 10.0s in the case that the subject did not notice the

warning signs.

5 CONCLUSIONS

In this paper, we have ﬁrst introduced the deﬁnition

of the importance map, which represents the percep-

tual importance of a visual scene by extraction and

combination of low-level features. Based on this im-

portance map, we enhance the salient features in the

input scene by applying semantic depth of ﬁeld effects

to naturally guide the visual attention of the viewers.

The whole pipeline can be executed in real time with-

out any user-intervention, and the experiment results

suggest that our method can actually assist people in

rapidly ﬁnding signiﬁcant features in the scene.

ACKNOWLEDGEMENTS

We would like to thank Radhakrishna Achanta et al.

for sharing their implementation results of previous

researches, and anonymous reviewers for their valu-

able comments. This work has been partially sup-

ported by Japan Society of the Promotion of Sci-

ence under Grants-in-Aid for Scientiﬁc Research (B)

No. 20300033 and No. 21300033.

REFERENCES

Achanta, R., Hemami, S., Estrada, F., and Susstrunk, S.

(2009). Frequency-tuned Salient Region Detection. In

Proc. IEEE International Conf. Computer Vision and

Pattern Recognition (CVPR2009), pages 1597–1604.

Bruce, N. and Tsotsos, J. (2007). Attention based on infor-

mation maximization. Journal of Vision, 7(9):950.

Gao, D. and Vasconcelos, N. (2007). Bottom-up saliency

is a discriminant process. In Proc. IEEE International

Conf. Computer Vision (ICCV2007), pages 1–6.

Harel, J., Koch, C., and Perona, P. (2006). Graph-based vi-

sual saliency. In Proc. Neural Information Processing

Systems (NIPS2006), pages 570–577.

Hou, X. and Zhang, L. (2007). Saliency detection: A

spectral residual approach. In Proc. IEEE Interna-

tional Conf. Computer Vision and Pattern Recognition

(CVPR2007), pages 1–8.

Itti, L. and Baldi, P. F. (2009). Bayesian surprise attracts

human attention. Vision Research, 49(10):1295–1306.

Itti, L. and Koch, C. (2001). Computational modelling

of visual attention. Nature Reviews Neuroscience,

2(3):194–203.

Itti, L., Koch, C., and Niebur, E. (1998). A model of

saliency-based visual attention for rapid scene anal-

ysis. IEEE Trans. Pattern Analysis and Machine In-

telligence, 20(11):1254–1259.

Koch, C. and Ullman, S. (1985). Shifts in selective visual

attention: towards the underlying neural circuitry. Hu-

man Neurobiology, 4(4):219–227.

Kosara, R., Miksch, S., and Hauser, H. (2001). Semantic

depth of ﬁeld. In Proc. IEEE Symp. Information Visu-

alization 2001 (INFOVIS2001), pages 97–104.

Kosara, R., Miksch, S., and Hauser, H. (2002a). Fo-

cus+Context taken literally. IEEE Computer Graphics

and Applications, 22(1):22–29.

Kosara, R., Miksch, S., Hauser, H., Schrammel, J., Giller,

V., and Tscheligi, M. (2002b). Useful properties of

semantic depth of ﬁeld for better F+C visualization. In

Proc. Symp. Data Visualisation 2002 (VISSYM2002),

pages 205–210.

Ma, Y.-F. and Zhang, H.-J. (2003). Contrast-based image at-

tention analysis by using fuzzy growing. In Proc. 11th

ACM International Conf. Multimedia (MULTIME-

DIA2003), pages 374–381.

Simons, D. J. (2000). Current approaches to change blind-

ness. Visual Cognition, 7:1–15.

Tsotsos, J. (1990). Analyzing vision at the complexity level.

Behavioral and Brain Sciences, 13(3):423–445.

Zhang, L., Tong, M. H., and Cottrell, G. W. (2009). Sunday:

Saliency using natural statistics for dynamic analysis

of scenes. In Proc. 31st Annual Cognitive Science So-

ciety Conf. (CogSci2009), pages 2944–2949.

Zhang, L., Tong, M. H., Marks, T. K., Shan, H., and Cot-

trell, G. W. (2008). SUN: A Bayesian framework for

saliency using natural statistics. Journal of Vision,

8(7):1–20.

REAL-TIME ENHANCEMENT OF IMAGE AND VIDEO SALIENCY USING SEMANTIC DEPTH OF FIELD

375