Human-centered Region Selection and Weighting for Image Retrieval

Jean Martinet

LIFL/CNRS-UMR 8022, Universit´e Lille 1, Lille, France

Keywords:

Gaze Tracking, Image Indexing and Retrieval, Weighting Scheme, Human-centered Computing.

Abstract:

We present an application of gaze tracking to image and video indexing, in the form of a model for selecting

and weighting Regions of Interest (RoIs). Image/video indexing refers to the process of creating a synthetic

representation of the media, for instance for retrieval purposes. It usually consists in labeling the media

with semantic keywords describing its content. When automatized, this process is based on the analysis of

visual features, which can be extracted either from the whole image or keyframe, or locally from regions.

Since most of the times the whole image is not relevant for indexing (e.g. large ﬂat regions with no speciﬁc

semantic interpretation, blur regions, background regions that may not be relevant for retrieval purposes, and

that should be ﬁltered out), it would be preferable to concentrate the labeling process on speciﬁc RoIs that

are considered representative of the scene, like the main subjects. The objective of the work presented here is

to take advantage of natural human gaze information in order to deﬁne a human-centered Region of Interest

selection and weighting technique in the context of media retrieval.

1 INTRODUCTION

Automatic image annotation refers to the process of

labeling image content – e.g. with semantic key-

words, generally for retrieval purposes. This process

is based on the analysis of image features, which can

be extracted either from the whole image or from im-

age regions. Since most of the times the whole image

is not relevant for indexing, it would be preferable to

concentrate the analysis on so-called Regions of In-

terest (RoIs), which represent the most important – or

representative – parts of the picture, like the subject.

With the recent development of low-cost gaze

tracker devices, the possibility of taking advantage of

the information conveyed in gaze has opened many

research directions, namely in image compression –

where users’ gaze is used to set variable compression

ratios at different places in an image, in marketing –

for detecting products of interest for customers, civil

security – for detecting drowsiness or lack of concen-

tration of persons operating machinery such as motor

vehicles or air trafﬁc control systems, and in human-

computer interaction. In the latter for instance, the

user’s gaze is used as a complementary input device

to traditional ones such as a mouse and a keyboard,

namely for disabled users. The goal of our work

is to use gaze information as an input for develop-

ing a human-centered Region of Interest selection and

weighting technique. We present in the remainder of

this paper some preliminary works in this direction.

The paper is organized as follows. Section 2 gives

an overview of region selection and weighting tech-

niques. Section 3 presents a brief history of research

related to gaze tracking. We describe in Section 4

how RoIs are deﬁned by gathering gaze information

from several persons. The next step is to deﬁne an

importance measure for RoIs in an image, so that cor-

responding labels can be weighted (Section 5). Fi-

nally, we describe how to apply and integrate this

model in a retrieval system in Section 6. We report

and discuss experimental results in Sections 7 and 8,

and Section 9 gives a conclusion of this contribution,

including some insights for further works.

2 IMAGE REGION

SELECTION/WEIGHTING

While keyword selection and weighting techniques

are widely used in text retrieval systems, such tech-

niques are seldom applied for image retrieval. The

importance of image regions has been considered in

some publications. In (Osberger and Maeder, 1998),

Osberger and Maeder have deﬁned a way to iden-

tify perceptually important regions in an image based

on human visual attention and eye movement charac-

teristics. In a similar way, Itti and Koch (Itti et al.,

729

Martinet J..

Human-centered Region Selection and Weighting for Image Retrieval.

DOI: 10.5220/0004348707290734

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2013), pages 729-734

ISBN: 978-989-8565-47-1

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

1998) have developed a visual attention system based

on the early primate vision system for scene analy-

sis. Later, Stentiford (Stentiford, 2003) applied the

visual attention to similarity matching. Wang (Wang

et al., 2001; Wang and Du, 2001) introduced in SIM-

PLYcity the region frequency and inverse picture fre-

quency (rf·ipf), a region-based measure for image re-

trieval purposes, inspired from the tf·idf weighting

scheme for text retrieval. Jing et al. (Jing et al., 2002)

have introduced a region weighting scheme that is

based on the users’ relevance feedback information.

The work presented here is aimed at using the natu-

ral users’ gaze information for image region selection

and weighting.

3 GAZE TRACKING

The analysis of gaze has been studied for over a cen-

tury in several disciplines, including physiology, psy-

chology, psychoanalysis, and cognitive sciences. The

purpose is to analyze eye saccades and ﬁxations of

persons watching a given scene, in order to extract

several kinds of information.

3.1 Techniques and Systems

In order to detect and track users’ gaze, it is necessary

to employ a gaze tracking device which is able to de-

termine the ﬁxation point of a user on a screen from

the position of their eye. Earliest gaze trackers were

very intrusive. In addition to constrain the user to be

totally static, they were in direct contact with him by

sticking a reﬂective white dot directly onto the eye or

attaching a number of electrodes around the eye.

Nowadays, the most accurate gaze tracking sys-

tems generally consist of head mounted devices

which allow detecting the direction of the gaze with-

out having to cope with the pose of the user’s head.

These trackers are also intrusive; they consist of de-

vices generally composed with 3 cameras mounted on

a padded headband (2 eye cameras to allow binocular

eye tracking with built-in light sources, and 1 camera

to allow accurate tracking of the user’s point of gaze).

Non-intrusive gaze tracking systems usually re-

quire a static camera capturing the face of the user

and detecting the direction of their gaze with respect

to a known position. A basic gaze tracker system is

composed with a static camera, a display device and

software to provide an interface between them. The

precision of the system can be increased by different

ways, like adding a speciﬁc light source such as an in-

frared beam – in order to create reﬂections on the eye

and produce more accurate tracking.

3.2 Visual Attention

During visual perception, human eyes move and suc-

cessively ﬁxate at the most informative parts of the

image (Yarbus, 1967). While RoIs can be deﬁned in

various ways according to the requirement, the pre-

sented indexing model is based on visual attention.

Attention is the cognitive process of selectively con-

centrating on one aspect of the environment while ig-

noring other things. For images and videos, the visual

attention is at the core of the visual perception, be-

cause it drives the gaze to salient points in the scene.

4 IDENTIFICATION OF

REGIONS OF INTEREST

Gaze trackers provide the horizontal and vertical co-

ordinates of the point of regard relative to the display

device. Thus, one can easily obtain a sequence of

points corresponding to the sampled positions of the

eye direction. These points correspond to triplets of

the form (x, y, t) and reﬂect the scan path. The scan

path consists of a sequence of eye ﬁxations (gaze is

kept still in a location, yielding regions with impor-

tant density of points), separated by eye saccades (fast

movement, yielding large spaces with only few iso-

lated points).

4.1 Fixations and Saccades

An essential part in scan path analysis and process-

ing is the identiﬁcation of ﬁxations and saccades. In-

deed, ﬁxations and saccades are often used as basic

source for the various metrics that are used for in-

terpreting eye movements (number of ﬁxations, sac-

cades, duration of the ﬁrst ﬁxation, average amplitude

of saccades, etc.) (Jacob and Karn, 2004; Poole et al.,

2004).

The most widespread identiﬁcation technique is

by computing the velocity of each point (deﬁned as

the angular speed of the eye in degrees per second).

The velocity of a point corresponds to the distance

that separates it from its predecessor or successor.

Separation of points into ﬁxations and saccades is

achieved by using a velocity threshold. Successive

points labelled ﬁxations are then grouped into what

can be considered as an eye ﬁxation. Another thresh-

old involving a minimal duration of a ﬁxation allows

eliminating insigniﬁcant groups.

4.2 Point Clustering

After combining collected data from several partici-

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

730

pants by merging all their ﬁxations into a single set,

a clustering process allows reducing the spatial char-

acteristics of ﬁxations into a limited subset of clusters

, which deﬁne the RoIs (see Figure 1).

Figure 1: Clustered ﬁxation points highlight conspicuous

locations in the scene, allowing to deﬁne the Regions of

Interest.

The clustering is to be achieved by unsupervised

techniques, such as K-means, because the spatial dis-

tribution of points from all scan paths is unknown, and

their number is ﬁnite.

Once a set of RoIs is identiﬁed for an image, cor-

responding objects can be labeled, and the labels can

be given more or less importance according to their

importance in the image. The ﬁnal objective is to in-

tegrate this generated index into a retrieval system.

5 REGION OF INTEREST

IMPORTANCE

In the context of image indexing for retrieval purpose,

it is a central problem to assign weights to index items

in order to give emphasis to important ones, quan-

tifying how well they describe and represent docu-

ments (Martinet et al., 2008). For instance, in text

retrieval, popular weighting scheme such as t f × id f

are based on two sources, one representing the local

importance of the word in the document, and the other

representing the global importance of the word in

the whole collection (Baeza-Yates and Ribeiro-Neto,

1999; Salton, 1971). The importance measure of an

RoI in an image can be interpreted in the same way as

a t f measure of a word in a text document.

5.1 Cluster Importance Criteria

Having the main Regions of Interest deﬁned as de-

scribed in Section 4, we are interested in identi-

fying criteria to estimate their relative importance,

based on the characteristics of the clusters. Inspired

from (Nguyen et al., 2006), we identify the following

measures:

• The Cardinal C(K

) of a Cluster: the cardinal of

a cluster K

measures the number of gaze points

that belong the cluster. This measure is related

to the duration of gaze within the cluster when

a uniform gaze sampling is recorded. Hence, it

represents the total time spent viewing/gazing at

the corresponding region. The cluster importance

is conjectured to be proportional to the cardinal

C(K

): importance(K

) ∝ C(K

). The cardinal

C(K

) of a cluster is deﬁned as:

C(K

) =

∑

p∈K

1 (1)

where p denotes the successive recorded gaze

points from all participants, and n

is the total

number of recorded points.

• The Surface S(K

) of a Cluster: the surface of

a cluster measures the area (in pixels) of the re-

gion covered by gaze points in the cluster. The

surface S(K

) is conjectured to be inversely pro-

portional to the cluster importance. Indeed, for

a given number of gaze points, a smaller sur-

face indicates a higher concentration of points:

importance(K

) ∝

S(K

)

. The surface S(K

) of a

cluster is given by the surface of its convex hull:

S(K

) = convexHullSur face({p|p ∈ K

}) (2)

• The VarianceV(K

) of a Cluster: the variance of

gaze points from their cluster mean. The variance

V(K

) is conjectured to be inversely proportional

to the cluster importance : importance(K

) ∝

V(K

)

. It is deﬁned by the following equation:

V(K

) =

∑

p∈K

(p− µ

)

(3)

where µ

represents the centroid of K

• The time-weighted Visit Count W(K

) of a

Cluster: the time-weighted visit count is used to

weight the count (or duration) of the k

cluster

visited by the inverse of the cluster visit time.

W(K

) =

∑

p∈K

rank

(4)

where rank

denotes the cluster visit rank of point

p in the scan path (i.e. 1 for the ﬁrst visited cluster,

2 for the second, and so on). This measure gives

more importance to the ﬁrst sampled points.

• The Revisit Count R(i) of a Cluster: the revisit

count measures the number of saccade revisits to

a given cluster during the scan path. The revisit-

ing of the ﬁxation-saccade path to regions in an

image has been found to be a fundamental prop-

erty of eye movements. The cluster importance

Human-centeredRegionSelectionandWeightingforImageRetrieval

731

is conjectured to be proportional to the number of

cluster revisits : importance(K

) ∝ R(K

). It is de-

ﬁned by the following equation:

R(K

) = card({p|(p ∈ K

) ∧ (rank

> rank

)||})

(5)

where card(X) is the cardinal of the set X, and

rank

is the lowest rank of visit to cluster K

– this

is to count in the cluster only points after the ﬁrst

visit to the cluster.

5.2 Criteria Combination

The application of gaze tracking to image indexing is

done under the hypothesis that the importance of an

image region is related to its conspicuity. The notion

of importance of the RoI is calculated on the basis of

the measures presented above. The total importance

of an RoI is obtained by combining these normalized

values.

importance(K

) = k× ( α

C(K

)

∑

C(K

)

S(K

)

∑

S(K

)

V(K

)

∑

V(K

)

W(K

)

∑

W(K

)

R(K

)

∑

R(K

)

(6)

where

k =

+ α

is a normalizing constant, and

(α

, α

)

is the vector of parameters allowing to set the relative

importance of each criterion.

6 APPLICATION TO IMAGE

INDEXING AND RETRIEVAL

Image indexing consists in creating a synthetic repre-

sentation of an image, e.g. by assigning semantic de-

scriptors (keywords) to images describing their con-

tent. There are several ways to automatically generate

semantic descriptors for an image. Most popular tech-

niques are based on a learning process, using Support

Vector Machines for instance. In this process, several

samples of a given concept are used to feed a training

algorithm, which is later use to detect the presence of

the concept in a test image.

Based on the previous deﬁnition of RoIs, the an-

notation process can be concentrated on RoI loca-

tions, since they represent the main areas that peo-

ple perceive in the media. As a consequence, the in-

dexing tool will focus the process on the most impor-

tant regions in the image, by extracting visual features

(such as color, texture, etc.) preferably from these

speciﬁc locations, generate labels to be assigned to

them together with the estimated importance value.

At query time, the search engine will then the query

features with region features, considering their impor-

tance. This makes the search process more efﬁcient.

The importance of a given index term regarding an

image is related to the aboutness of the image consid-

ering the term. The use of RoI importance as weight-

ing method for index terms is driven by the assump-

tion that if most users concentrate their attention on a

speciﬁc object, then the aboutness of the image con-

sidering the object is high.

We can extend the indexing and retrieval with se-

mantics. In this case, we integrate a weighting scheme

to semantic concepts associated to the most impor-

tant RoIs. As indicated in (Salton, 1971), weighting

schemes are often based on the formula t f ×id f. The

importance of label (or index term) attached to an RoI

is considered the visual counterpart of the t f measure

for text. The id f value can be computed tradition-

ally (Baeza-Yates and Ribeiro-Neto, 1999).

7 EXPERIMENTS

We have carried out experiments to validate our ap-

proach. We describe in this section the experimental

settings, and some preliminary resutls.

7.1 Settings

In our experiments, we used a single-camera gaze

tracker setting based the Pupil-Centre/ Corneal-

Reﬂection (PCCR) method to determine the gaze di-

rection. The video camera is located below the com-

puter screen, and monitors the subject’s eyes. No at-

tachment to the head is required, but the head still

needs to be motionless. A small low power infrared

light emitting diode (LED) embedded in the infrared

camera and directed towards the eye. The LED gener-

ates the corneal reﬂection and causes the bright pupil

effect, which enhances the camera’s image of the

pupil. The centers of both the pupil and corneal re-

ﬂection are identiﬁed and located, and trigonometric

calculations allow projecting the gaze point onto the

image.

Most recent systems still suffer from the follow-

ing problems: (1) the need for calibration for each

session, (2) the large restriction on head motion – es-

pecially for single-camera systems, (3) the limitation

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

732

to a single-user tracking only. The main additional

difﬁculty in a single-camera setting is determining the

distance of the user from the camera, since a triangu-

lation as in a multi-camera setting cannot be carried

out without calibration.

7.2 Preliminary Results

Given this experimental setting, we have recorded

the gaze of 10 persons participating in our test ses-

sion. Participants were shown images successively

and asked to watch attentively the presented scenes.

All gaze information have been recorded and pro-

cessed according to the the description of Section 5.

Figure 2: Example of a photograph from our database, dis-

playing a scan path (left), and showing the processed clus-

ters as superimposed disks, with their estimated importance

denoted by the size of the disks (right).

Figure 2 shows an example of image from our

database. It is a photographshowing a landscape. The

upper part displays a scan path, composed with sev-

eral points of regards, each being linked to the previ-

ous one to represent the path. The lower part shows

the processed clustdeters – represented in the image

with the superimposed disks, with their estimated im-

portance denoted by the size of the disks. Note that

after ﬁltering out small clusters, the algorithm kept 8

locations. Figure 3 shows a similar example, with a

painting in order to illustrate our approach. The paint-

ing represents the Raft of the Medusa painted by the

French Th´eodore G´ericault. Here, after ﬁltering out

small clusters, the algorithm kept 9 locations.

Figure 4 presents the result of our approach for

another painting, and also shows how a 3D interest

map can be automatically generated from gaze in-

formation. These results illustrate how our approach

takes advantage of natural human gaze. We have col-

lected gaze information from participants, clustered

the points into clusters, and determined locations of

interest in images with their respective importance.

The next step is to integrate this automatically gener-

ated index into a retrieval system on order to demon-

strate the impact of our approach, namely for the task

of query-by-example retrieval.

Figure 3: Another example with a painting: scan path (left)

and processed clusters with their importance (right).

Figure 4: Another example of painting from our database,

displaying a scan path (upper left), the processed clusters

and their importance (upper right), and also an automati-

cally generated 3D interest map (below).

8 DISCUSSION

For large collections of images, building RoI from

users’ gaze information requires the availability of

gaze data from several users, for each image in the

collection. We could imagine a scenario in the near

future where the gaze can be estimated from a sim-

ple webcam, so that it is possible to collect users’

gaze information while they are searching/browsing

image collections on the Web. An alternative solu-

tion is to use saliency maps (Itti et al., 1998; Itti and

Koch, 1999) to simulate this information. Saliency

maps have been widely used in the computer vision

ﬁeld to solve the attention selection problem. The pur-

pose of saliency maps is to represent the conspicuity

(as a scalar value) of each locations of the image. Ac-

cording to such maps, it is possible to simulate users’

gaze, and to generate RoIs together with their impor-

tance scores as described in previous sections.

Active research in computer vision is aimed at de-

veloping gaze tracking software operating from a sim-

ple webcam. From an accurate face detection, it is

Human-centeredRegionSelectionandWeightingforImageRetrieval

733

possible to determine the position of the eyes, and

therefore to ﬁnd the iris. The eye detection can be

based on (among other methods) template matching,

appearance classiﬁcation, or feature detection. In the

template matching methods, a generic eye model is

ﬁrst created based on the eye shape, and a template

matching process is then used to search eyes in the im-

age. The appearance based methods detect eyes based

on their appearance using a classiﬁer trained using a

large amount of image patches representing the eyes

of several users under different orientations and illu-

mination conditions. The feature detection methods

explore the visual characteristics of the eyes (such as

edge, intensity of iris, or color distributions) to iden-

tify some distinctive features around the eyes.

9 CONCLUSIONS

One of the most difﬁcult problems in content-based

image indexing and retrieval is the automatic identi-

ﬁcation of regions of interest from images. The difﬁ-

culty is related to the subjective semantics associated

to the region. This difﬁculty is the main reason to

consider semi-automatic approach as the most real-

istic approach to extract regions of interest from im-

ages for indexing and retrieval. More image regions

are semantically labeled, better is the quality of index-

ing. The semi-automatic approach needs user collab-

oration and cooperation with algorithms to determine

regions of interests. Generally, experts use graphical

tools to determine these regions. This task, although

popular, is time consuming when considering huge

quantities of images.

Exploiting the information carried in natural hu-

man gaze is an interesting approach to determine efﬁ-

ciently potential semantic regions with almost no hu-

man effort. Our model is based on the use of succes-

sive ﬁxations and saccades from people watching the

media. It processes these data in order to determine

clusters of points, and to extract several metrics for

estimating the importance of areas in the image. The

metrics are: the cardinal, the variance, the surface,

the time-weighted visit count and the revisit count.

All these metrics are combined together into a sin-

gle estimator of the importance of image regions, in

a human-centered fashion. The use of this natural in-

formation is en effective way of dealing with the high

number of images in searched collections, which is a

crucial issue in indexing.

Also we have concentrated our study on static im-

ages/keyframes, video indexing and retrieval can ben-

eﬁt from this approach, where the quantity of data is

very high. Extensions of our work to the speciﬁcity of

the video will need to face other challenging problems

related to the temporal aspect of data.

REFERENCES

Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern In-

formation Retrieval. Addison-Wesley.

Itti, L. and Koch, C. (1999). Learning to detect salient ob-

jects in natural scenes using visual attention. In In

Image Understanding Workshop.

Itti, L., Koch, C., and Niebur, E. (1998). A model of

saliency-based visual attention for rapid scene anal-

ysis. IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, 20(11):1254–1259.

Jacob, R. J. and Karn, K. S. (2004). Eye tracking in human-

computer interaction and usability research: Ready to

deliver the promises. In Elsevier Science, Oxford, U.,

editor, The Mind’s Eyes: Cognitive and Applied As-

pects of Eye Movements.

Jing, F., Li, M., jiang Zhang, H., and Zhang, B. (2002).

Learning region weighting from relevance feedback in

image retrieval. In in Image Retrieval, Proc. the 27th

IEEE International Conference on Acoustics, Speech,

and Signal Processing (ICASSP.

Martinet, J., Satoh, S., Chiaramella, Y., and Mulhem, P.

(2008). Media objects for user-centered similarity

matching. Multimedia Tools and Applications, Spe-

cial Issue on Multimedia Semantics.

Nguyen, A., Chandran, V., and Sridharan, S. (2006).

Gaze tracking for region of interest coding in jpeg

2000. Signal Processing: Image Communication,

21(5):359–377.

Osberger, W. and Maeder, A. J. (1998). Automatic identi-

ﬁcation of perceptually important regions in an image

using a model of the human visual system. In Interna-

tional Conference on Pattern Recognition, Brisbane,

Australia.

Poole, A., Ball, L. J., and Phillips, P. (2004). In search of

salience: A response-time and eye-movement analysis

of bookmark recognition. In Conference on Human-

Computer Interaction (HCI), pages 19–26.

Salton, G. (1971). The SMART Retrieval System. Prentice

Hall.

Stentiford, F. (2003). An attention based similarity mea-

sure with application to content based information re-

trieval.

Wang, J., J.L., and Wiederhold, G. (2001). SIMPLIcity:

Semantics-sensitive Integrated Matching for picture

LIbraries. IEEE Transactions on Pattern Analysis and

Machine Intelligence, 23(9):947–963.

Wang, J. Z. and Du, Y. (2001). Rf x ipf: A weight-

ing scheme for multimedia information retrieval. In

ICIAP, pages 380–385.

Yarbus, A. L. (1967). Eye Movements and Vision. Plenum

Press, New York.

VISAPP2013-InternationalConferenceonComputerVisionTheoryandApplications

734