Fully Automatic Saliency-based Subjects Extraction in Digital Images

Luca Greco, Marco La Cascia and Francesco Lo Cascio

Dicgim, Università degli Studi di Palermo, Palermo, Italy

Keywords: Saliency Maps, Subjects Detection, Automatic Thumbnailing.

Abstract: In this paper we present a novel saliency-based technique for the automatic extraction of relevant subjects in

digital images. We use enhanced saliency maps to determine the most relevant parts of the images and an

image cropping technique on the map itself to extract one or more relevant subjects. The contribution of the

paper is two-fold as we propose a technique to enhance the standard GBVS saliency map and a technique to

extract the most salient parts of the image. The GBVS saliency map is enhanced by applying three filters

particularly designed to optimize the performance for the task of relevant subjects extraction. The extraction

of relevant subjects is demonstrated on a manually annotated dataset and results are encouraging. A

variation of the same technique has also been used to extract the most significant region of an image. This

region can then be used to obtain a thumbnail keeping most of the relevant information of the original image

and discarding nonsignificant background. Experimental results are reported also in this case.

1 INTRODUCTION AND

RELATED WORKS

The term saliency usually refers to visual quality or

characteristic of interest for a human observer. It is

often used also the term conspicuity, which literally

indicates the visibility and what is most visible. In

visual science a subject is important if it is

characterized from what surrounds him, gaining

greater visibility and attracting attention, resulting in

a shift of the gaze. Therefore, the salient parts of a

scene are those that evoke a strong visual response

and polarize attention. Human visual attention is

composed of two factors coming from two stimulus

of different nature: the first one depends exclusively

on the characteristics of the image, the second one is

subjective in nature and is related to the subject's

will. An objective stimulus has an bottom-up

activation, due to the physical characteristics such as

brightness, color, shape. In many situations the

greatest contribution to the acquisition of

information may be due to the top-down process, as

the focus of the attention is influenced by the

knowledge obtained by learning the probabilistic

structure of the environment.

Saliency can be used in several tasks in

Computer Vision. In this paper saliency maps are

used as a basis to solve the subject extraction and

thumbnailing problem in images.

The extraction of subjects in images consists of

the identification of the regions where the most

salient subjects of the image are located. Saliency is

concentrated in the zones containing the subject, or

subjects if there are more than one, and the rest of

the image that can be considered as background,

containing less informative pixels. The proposed

algorithm is aimed to identify background and

foreground areas and extract the subjects of the

image.

The detection of salient subjects has many

practical applications, such as cropping of images,

extraction of dominant colors or other visual features

of the objects of interest, search in image databases,

etc.... There are several techniques to detect salient

subjects. They can be divided into two main

categories: the first category is based on the

segmentation of the saliency map based on

exhaustive search until it reaches a fixed fraction of

saliency. Examples are the work of Marchesotti et

al., (2009), which is based on a comparison of

similar images stored on a database, the algorithm of

Liu et al., (2007) based on finding the right weights

for combining the various channels of the saliency

map, and the algorithms of Cheng et al., (2011) and

Wang et al., (2011) which are based on refinements

of the saliency map. The second is based on the use

of sliding windows on saliency map, which

determine the probability of containing an object by

determining areas with maximum saliency. These

129

Greco L., La Cascia M. and Lo Cascio F..

Fully Automatic Saliency-based Subjects Extraction in Digital Images.

DOI: 10.5220/0004530401290136

In Proceedings of the 10th International Conference on Signal Processing and Multimedia Applications and 10th International Conference on Wireless

Information Networks and Systems (SIGMAP-2013), pages 129-136

ISBN: 978-989-8565-74-7

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

techniques are based on thresholds that identify the

correct subjects and that are dependent on image

data. Among these it is notable the work of Alexe et

al., (2010) based on '"Objectness" and the work by

Feng et al., (2011) based on the "superpixels".

Thumbnailing is widely used in many

applications and consists of the creation of a smaller

version of the image. This technique is used by

visual search engines to organize and show images,

by photo sharing sites and by all modern desktop

operating systems. Even though this is an essential

technique in many cases it is implemented as a

simple image resizing. Image resizing is simple and

fast but not always effective, since in resizing the

image there is a loss of information that can make

subjects not easily recognizable. Another possible

implementation is based on cropping and resizing,

like the one presented by Suh et al., (2003), which

does not have the critical points of the simple

resizing, but represents only a part of the image. It is

also necessary to determine a criterion for cropping,

which is a method to find a frame that contains what

most distinguishes a photo. A natural criterion to

measure the importance of a pixel according to Itti et

al., (1998) is the use of a model of visual attention.

For example in Suh et al., (2003) the authors

proposes a compromise between the size of the crop

and the internal value of the saliency. Another

approach is to eliminate less important pixels trying

to keep the high level semantic structures. Among

these is the work of Samadani et al., (2007) that is

based on the quick determination of natural

previews. Avidan and Shamir (2007) present the

"seam carving", but it run into problems if important

content is spread all over the image. Simakov et al.,

(2008) presents another technique that overcomes, to

a certain extent, the problems suffered by the seam

carving. Other related work can be found in Liu and

Gleicher, (2005:153-162) and Rother et al., (2005).

2 SALIENCY

The analysis of an image is characterized by what

one would expect to find or what you want to

identify. By applying an attention based strategy it is

possible to reduce the time of search or analysis. In

summary attention is composed of two separate

processes, as claimed by Itti and Koch (1998):

 A bottom-up process, which is based exclusively

on a physical stimulus. This process is simple

and fast but lacks from any high level

considerations.

 A top-down process, which depends on observer

will and that is based on a semantic analysis of

the image. This process has a higher level of

abstraction but it is slower. This search is more

accurate having a high degree of environmental

knowledge and examing the consistency of the

context.

In this paper we will deal with bottom-up stimulus

and in particular we use the saliency map model

presented in Harel et al., (2006).

Figure 1: Original image and GBVS saliency map.

2.1 Graph-based Visual Saliency

The GBVS is a model of bottom-up visual saliency,

which is composed of two stages: formation of the

activation maps of certain features such as color,

intensity, brightness, and feature normalization to

emphasize the saliency and to allow combination

with other maps.

First stage consist in extracting feature maps and

creating activation maps, constructing a fully-

connected graph with a node for each pixel and

directed edges weighted using pixels’ dissimilarity

in feature map and their distance. The weights are

normalized to 1 and used as the transition

probability of a Markov chain, then the activation

map is calculated by the equilibrium distribution of

the resulting chain. Second stage is performed

constructing another graph using the calculated

activation map and the distances. Again, this graph

is used as a Markov chain and the normalized map is

derived by the equilibrium distribution.

2.2 Saliency Map

The GBVS saliency map is represented by a matrix

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

130

representing the saliency of each pixel. The map is

normalized, containing values between zero and one:

the closer the saliency is to one, more salient is the

corresponding pixel. An example of saliency map is

shown in Figure 1 where white pixels correspond to

a saliency value equal to one, and black for a

saliency value equal to zero. Therefore in

correspondence of the two deers there is a higher

concentration of the salience while in the areas of

the background, i.e. the trees in the distance and the

terrain, the concentration of saliency is lower and

almost uniform.

2.3 Noise in the Map

In some cases the presence of noisy and non-

uniform background in the image affect the correct

determination of the saliency map. Also in images

where many subjects are concentrated in a small

area, the saliency map tends to consider the whole

area salient without distinguishing between the

various subject. Moreover, if a subject has

characteristics (shape, color, orientation, etc.) similar

to the background it is considered not salient and not

detected as a subject of the image.

3 PROPOSED ALGORITHM

The proposed algorithm is designed to solve two

different problems:

 The detection and extraction of interesting

subjects. This includes the subjects detection in

an image and determining for each of these

subjects a bounding box.

 Determination of a significant thumbnail that is

the determination of a small preview picture of

the original image

.

The algorithm is based on a saliency map to

determine in the first case the areas where

interesting subjects are present and to determine in

the second case the foreground and the background.

In both cases, the structure of the algorithm

remains the same changing only the stopping

criterion.

3.1 Computation and Refining of

Saliency Maps

The saliency map used is obtained by GBVS

algorithm, so it is purely dominated by objective

stimulus without taking into account the global

image. For the task of the proposed method the

interesting part of the map is the one associated to

subjects, so eliminating the part associated to

background can lead to a better result. If in the

image there are one or more subjects a more useful

map is composed of saliency spots corresponding to

the subjects.

Figure 2: Initial map and refined version.

The application of three filters, each with a

specific goal, can lead to such a map. The first filter

accentuates the saliency in correspondence of image

contours, to make the subject cropping more precise

and decrease the possibility of an incomplete cut. In

fact, if the subject has not a compact shape it may

occur that the contour details are ignored in the

saliency map. To prevent this, a binary map

describing the pixels that belong to the contours is

used. So in the saliency map the pixels

corresponding to ones in the contour map are

incremented by a factor of 0.3. Contours are

extracted using a Sobel filter.

The second filter deals with transforming the

saliency map, through a nonlinear transformation,

increasing the difference between background pixels

and those which belong to subject. After this

transformation less informative pixels are truncated

using a fixed threshold. The nonlinear

transformation is:





(1)

Finally, the third filter is a Gaussian filter that has

the result of smoothing the refined saliency map.

Figure 2 shows the result of these filters applied to

the saliency map of the two deers.

3.2 Image Cropping

Based on the refined saliency map we define a

FullyAutomaticSaliency-basedSubjectsExtractioninDigitalImages

131

cropping algorithm based on a greedy search. A

greedy algorithm attempts to construct a pseudo-

optimal solution from a partial initial solution and

searching until a stop criterion. The extensions of the

solution do not consider all possible solutions

because this could be very expensive. Only some

solutions are considered in paricular those that are

closer to the partial solution, proceeding to the

attainment of the solution pseudo-optimal for small

steps. «

These extensions of a partial solution that could

be called "local" are somewhat smaller and possible

extensions are relatively few. Among the various

local extensions the algorithm proceeds to choose

the most "greedy", or rather the most convenient. So,

the more promising extension that is configured as a

local optimal solution is selected as a new partial

solution.

The cropping algorithm based on saliency is

initialized with a partial solution obtained by cutting

a box centered at the peak of the salience of fixed

size (5x5 pixels is the proposed initial box). Then we

proceed to increase the dimension of the box

obtained at each iteration of partial solutions

considering the most valid solutions the one with a

higher percentage of salience.

Partial solutions are selected in such a way to

obtain at each iteration a pane that possesses an

increasing saliency and possible solutions are

computed changing two factors:

1. The center of the frame. The center of the subject

does not correspond with the peak saliency is

then evaluated the increase of saliency by

modifying the center of the box along the

diagonal directions

2. The increase of the box. In most cases the

optimal cropping is not a square, thus evaluating

width or length direction.

3.3 Stopping Criterion

The stopping criterion should determine if the partial

frame obtained contains sufficient saliency area of

the subject and therefore if it defines the final

rectangle. At each iteration the difference of the

saliency of last frame and the new proposed frame is

evaluated and the search algorithm is stopped when

the increase of salience is under a fixed threshold

close to zero. Another stopping criterion is

evaluating the sum of internal saliency of the frame

and stop if it exceeds a threshold.

Figure 3: Query image and subjects detected.

4 SUBJECT DETECTION

The proposed method can be used to identify

subjects in images, if presents. For this problem,

images can be divided in two categories, like in

Huang et al., (2010):

Salient Images: containing one or more subjects

easily identified, placed in a relatively uniform

background.

Cluttered Images: difficult to distinguish

subjects, with disordered background and high

presence of noise and details. Extracted saliency

map can be influenced by uninformative part of

background or by details.

Subject extraction problem can be stated as

follows: given an image, the aim is to find a

rectangle, containing a subset of image where the

subject is visible. In this problem, using the saliency

value for evaluating the degree of informativeness of

different areas of the image, the crop rectangle must

meet two conditions: have a small size and contain

most of the subject. These two conditions are usually

in conflict each other, therefore, the goal is to find

the pseudo-optimal rectangle with a trade-off of the

constraints.

4.1 Subjects Extraction

An image can represent more than one salient

subject, but the proposed algorithm can extract a

single entity starting by a peak of salience.

Therefore, it is possible to reuse the same algorithm

to extract other subjects in the picture by ignoring

the part selected for the first subject and considering

the remaining saliency map to determine the

presence of other subjects in the picture.

The map is then processed again by the

algorithm that will identify another peak salience

and extracts another frame. This technique leads to

the extraction of all salient areas in the image related

to the subjects. Finally, the operation is repeated

until the percent of the salience remained after the

last cut is less than 5% of the initial saliency. If we

apply this algorithm to the image of two deer we

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

132

obtain two frames containing the two subjects. In

fact, in Figure 3 we can see the previews of the

subjects they represent completely, giving two

frames that isolate the subjects from the background.

Figure 4: Not precise saliency map lead to a not correct

extraction.

4.2 Incomplete Preview

In some cases, the object composition and structure

is not accurately identified by the saliency map. This

is due to the nature of the saliency map that is

exclusively linked to the features and does not take

into account the semantics of the image. In fact, the

saliency map can be erroneously determined and

parts of the main subject may not be detected

because they were considered not salient. It may also

occur that parts of the background may be

considered an integral part of a subject. These two

phenomena lead to a lack of precision of the saliency

map and are reflected in the results of cropping,

making the previews inaccurate. In some cases

previews of images that include parts of the

background where computed. Although the subject

remains recognizable, the frame is not accurate. In

other cases images contain only part of the subject

that can be interpreted as the particular most

dissimilar from the rest of the image. In these cases

the saliency has a sharp peak and the rest of the

subject is consider not information. An example may

illustrate these cases (Figure 4). It can be observed

that the salience in an area of the background in

proximity of the white insect is not distributed with

precision on the subject. This moves the crop from

the subject to the background by perturbing the

correctness of the final frame. The preview that

appears in fact not entirely contains the subject but

only a part of the latter and part of the background.

4.3 Results Metrics

To evaluate the results of subjects detection two

factors are considered: the number of subjects

revealed and the accuracy of the crop. In fact, it is

necessary to first evaluate how many subjects were

identified by the algorithm on the total number

present in the image and then assessing if the crop

computed contains entirely or only a part of the

subject.

We used two different datasets. A dataset of 20

images with salient subjects have been identified for

the assessment of the estimation the number of

subjects. Approximately 80% of the total number of

interesting subjects are correctly estimated. The

method show the tendency to not overestimate the

number of interesting subjects. When the detection

is wrong usually the estimated number of subjects is

less then real. On a total of 20 images only in one

case the number of salient subjects estimated was

larger then real one.

For the evaluation of the accuracy of the crop

some pictures of the dataset presented in Wang et

al., (2012) have been used. The images of the dataset

contain a main subject and a frame containing it has

been annotated by hand. Figure 5 show an example

of this annotation.

Figure 5: The blue box is the result of the proposed

algorithm, the white box is the manual crop of the subject.

Therefore, it is possible to determine how the

frame identified by the algorithm, Ralg, is similar to

the frame noted, Rnot, using expressions:

Precision = Area ( R

alg

∩ R

not

) / Area( R

alg

)

Recall = Area ( R

alg

∩ R

not

) / Area( R

not

)

where the precision indicates percentage of the area

of Ralg that is contained in Rnot and the recall

indicates the tendency of Ralg to contain the frame

FullyAutomaticSaliency-basedSubjectsExtractioninDigitalImages

133

annotated. In the experiments precision has an

average value of 0.68 and recall of 0.75 .

The measures are discordant with each other and

can be enclosed in a single greatness, called F-

measure, and defined mathematically:









1



∗∗

 ∗   

(2)

where for our purposes α is set to the value of 0.5.

In a dataset of 40 image has been achieved on

average F

0.5

= 0.66.

All the dataset used for the experiments will be

available online soon.

5 THUMBNAILING

A modified version of the proposed algorithm can

also be used for image thumbnailing. In subject

extraction the focus was on finding spots in the

refined saliency map, so the stopping criterion is

aimed to find the rectangle that contain a single spot.

The process is then repeated until all subjects are

extracted. For thumbnailing the goal is to find a

single rectangle containing the most recognizable

part of the original image. It is then sufficient to

resize this selected part of the original image to

obtain a smart preview.

Figure 6: Not constrained crop.

The stopping criterion for this problem is based

on thresholding of the ratio of the sum of saliency in

the proposed crop and the sum of saliency over the

entire map. Changing the threshold from low to high

values the proposed cropping rectangle raise in

dimension and contain most salient part of the initial

image. Figure 6 shows the results for different

threshold values. The higher is the threshold the

larger is the dimension of the selected thumbnail,

starting from a close view of the subject.

Using the proposed algorithm, the result of the

crop is a rectangle which width and height

proportion are not fixed. If there is a need for a

particular aspect ratio of the preview image, for

example in an image browsing application, the

greedy policy of search can be changed to obtain a

fixed aspect ratio. For the example in Figure 7 and in

the following section a modified version is used that

perform only square crop.

Figure 7: Square crop.

Visually analyzing the result of thumbnailing we

found that a good value for the threshold is 0.5.

5.1 Yahoo! Thumbnails Comparison

The results of the proposed method were compared

with thumbnails extraction of Yahoo!

(www.yahoo.com) that propose square previews as

result of an image search. Figure 8 shows the

difference between the different thumbnails for the

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

134

images obtained using the keyword "Animals". In

column (a) are shown the original images, in column

(b) the previews extracted from Yahoo!, in column

The proposed algorithm generally focuses on

salient part producing a reduced part of the original

image while Yahoo! seems to simply crop the

image.

Generally, this representation is useful if the

previews are used for searching for details and

content of pictures or known image. High presence

of background in the preview reduce the information

shown. In personal photo collection, for example,

having a salient and zoomed part of an image can

lead to a faster visual retrieval of the image.

Figure 8: Comparison of yahoo thumbnails and the result

of the proposed method.

6 CONCLUSIONS

We presented a simple technique to automatically

extract relevant information from digital images

using a bottom-up visual saliency model.

The technique has been applied to detect the

most relevant subjects in the foreground also in the

case of cluttered background. Performance were

evaluated on small annotated image dataset we

prepared for our experiments and that will be

available for download and could be used for future

comparison.

A variation of the same technique has also been

demonstrated on the problem of smart thumbnailing

that is the creation of thumbnails keeping most of

the information of the original image using a much

smaller number of pixels. A qualitative comparison

with Yahoo! thumbnails has shown also in this case

the goodness of our approach.

In both cases we observed that filtering the

saliency map significantly increases the

performance.

Finally, even though in many cases a top-down

model-based approach might be preferable we have

shown that, in many cases, a simple data driven

approach could be satisfactory.

REFERENCES

Harel, J., Koch, C. and Perona, P., 2006. Graph-Based

Visual Saliency. Proceedings of Neural Information

Processing Systems (NIPS).

Marchesotti, L., Cifarelli, C. and Csurka, G., 2009. A

framework for visual saliency detection with

applications to image thumbnailing. Xerox Research

Centre Europe (XRCE), France

Suh, B., Ling, H., Bederson, B.B., Jacobs, D.W., 2003.

Automatic Thumbnail Cropping and its Effectiveness.

Proceedings of the 16th annual ACM symposium on

User interface software and technology

Liu, L., Chen, R., Wolf, L., Cohen-Or, D., 2010.

Optimizing Photo Composition. Computer Graphics

Forum, vol. 29, n. 2 pp. 469-478, Wiley Online

Library

Zhang, M., Zhang, L., Sun, Y., Feng, L., Ma, W., 2005.

Auto Cropping for Digital Photographs., IEEE

International Conference on Multimedia and Expo,

(ICME).

Liu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X.

and Shum, H.-Y., 2007. Learning to detect a salient

object. IEEE Trans. Pattern Anal. Mach.

Cheng, M.-M., Zhang, G.-X., Mitra, N. J., Huang, X. and

Hu, S.-M., 2011. Global contrast based salient region

detection. IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Wang, L., Xue, J., Zheng, N. and Hua, G., 2011.

Automatic salient object extraction with contextual

cue. IEEE Conference on Computer Vision (ICCV).

FullyAutomaticSaliency-basedSubjectsExtractioninDigitalImages

135

Alexe, B., Deselaers, T. and Ferrari, V., 2010. What is an

object? IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Feng, J.,Wei, Y., Tao, L., Zhang, C. and Sun, J., 2011.

Salient object detection by composition. IEEE

Conference on Computer Vision (ICCV).

Huang, J., Yang, X., Zhang, R. and Fang, X., 2010. Re-

Ranking Image Search Results by Multiscale Visual

Saliency Model. IEEE International Symposium on

Broadband Multimedia Systems and Broadcasting

(BMSB) .

Itti, L., Koch, C. and Niebur, E., 1998, A model of

saliency-based visual attention for rapid scene

analysis. IEEE Transaction on Pattern Analysis and

Machine Intelligence, vol. 20 n. 11 pp. 1254-1259.

Samadani, R., Lim, S. H. and Tretter, D., 2007.

Representative image thumbnails for good browsing.

IEEE International Conference on Image Processing

(ICIP).

Avidan, S. and Shamir, A., 2007. Seam carving for

content-aware image resizing. ACM Transaction on

Graphics.

Simakov, D., Caspi, Y., Shechtman, E. and Irani, M.,

2008. Summarizing visual data using bidirectional

similarity. IEEE Conference on Computer Vision and

Pattern Recognition (CVPR).

Liu, F. and Gleicher, M., 2005. Automatic image

retargeting with fisheye-view warping. ACM

symposium on User interface software and technology

(UIST).

Rother, C., Kumar, S., Kolmogorov, V. and Blake, A.,

2005. Digital tapestry. IEEE Conference on Computer

Vision and Pattern Recognition (CVPR).

Wang, P., Wang, J., Zeng, G., Feng, J., Zha, H. and Li, S.,

2012. Salient Object Detection for Searched Web

Images via Global Saliency IEEE Conference on

Computer Vision and Pattern Recognition (CVPR).

SIGMAP2013-InternationalConferenceonSignalProcessingandMultimediaApplications

136