RELEVANCE FEEDBACK AS AN INTERACTIVE NAVIGATION

TOOL

Daniele Borghesani, Costantino Grana and Rita Cucchiara

Universit

a degli Studi di Modena e Reggio Emilia, Modena e Reggio Emilia, Italy

Keywords:

CBIR, HCI, Artistic Collections, Relevance Feedback.

Abstract:

Image collections are searched in common retrieval systems in many different ways, but the typical presenta-

tion is by means of a grid styled view. In this paper we try to suggest a novel use of relevance feedback as a

tool to warp the view and allow the user to spatially navigate the image collection, and at the same time focus

on his retrieval aim. This is obtained by the use of a distance based space warping on the 2D projection of the

distance matrix.

1 INTRODUCTION

The growing availability of multimedia content, es-

pecially pictures, and the proposal of increasingly ef-

ﬁcient content analysis techniques led to the devel-

opment of impressive multimedia systems in recent

years. Nevertheless there is a signiﬁcant gap between

the research view of such systems and the user per-

spective, which is strongly inﬂuenced by the way in-

formation is presented. This led to a standardization

of the interface solutions, based on their success On

the market.

Let’s look at the way in which image library tools

usually present information to the user. For exam-

ple, in desktop applications like iPhoto or Picasa, im-

ages are classiﬁed using default metadata like GPS,

tags possibly associated to pictures, and time stamps.

With this simple information, the system can perform

an automatic grouping of data to assist the user in

the process of management of his library. Another

quite standard functionality is the ﬁltering, using both

metadata or —very rarely— rough visual information

based on color. All these functionalities ﬁnally re-

lies on a very standard grid-based layout representa-

tion, which is very familiar to most of users but, es-

pecially in a similarity retrieval context, can be con-

sidered a bad design choice since it erases all the sim-

ilarity relations (connections) between images. The

same kind of problem is clearly recognizable in all

the most used web search engines, like Google, Bing

or Yahoo. Only very recently Google introduced vi-

sual search capabilities (subject to features precom-

putation), which seems quite good on speciﬁc objects

retrieval, but behaves much more inaccurately on av-

erage. In the majority of situation, instead, we can-

not search using an image as a query, but we need

to start from a standard textual input. Secondly, when

we have the resulting list of pictures, a grid-based lay-

out is proposed to the user. In every case the modus

operandi is just the same: look and scroll for more

images. We believe that this approach is essentially

ﬂawed, because of two main reasons: it does not con-

vey visual feedback about the content of the collec-

tion, and it does not dynamically react to the feed-

backs of the user.

In this paper, we want to propose an easy solution

to solve this interface gap. Starting from a solid set of

content analysis and indexing techniques (which can

be eventually designed to ﬁt the large scale require-

ments), we propose the relevance feedback not only

as an effective tool to improve the raw performance of

the retrieval system, but mainly as a mean to help the

user navigating into the collection, especially when

no metadata are available or when the the search in-

tentions cannot be easily expressed as textual queries.

In this way, we want to facilitate the user in the pro-

cess of manipulation of the information: by visually

surﬁng through images, the user can build connec-

tions and feel emotionally involved in the navigation

experience, using the relevance feedback to warp the

space around his needs, quickly learning the results

content and possibly moving to a destination he did

not even think about when he started. We believe that,

in the near future, the similarity search will have a key

role in the market, not in order to substitute the search

by text but more importantly in order to complement

Borghesani D., Grana C. and Cucchiara R..

RELEVANCE FEEDBACK AS AN INTERACTIVE NAVIGATION TOOL.

DOI: 10.5220/0003858700540059

In Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP-2012), pages 54-59

ISBN: 978-989-8565-04-4

 2012 SCITEPRESS (Science and Technology Publications, Lda.)

it. Actually it will probably be next door feature of

image library management tools and web search en-

gines, complementing other research efforts focusing

on classiﬁcation, annotation and so on.

2 BACKGROUND

The problem of image retrieval is two-fold. In the ﬁrst

place, we need fast and effective techniques to convey

visual similarity to the user. In the second place, we

need an effective technique to allow the user to man-

age the results.

Regarding the ﬁrst problem, a great amount of lit-

erature has been proposed. Among it, we think that

the natural choice is a global feature representation,

providing a compact summary by aggregating some

information extracted at every pixel location of the

image. The bag-of-words approach, a global repre-

sentation build of clustered local features like SIFT

(Lowe, 2004) or SURF (Bay et al., 2008) as a vi-

sual dictionary, is generally considered the state of

the art. For a complete comparison of performance of

local features in CBIR, please refer to (Mikolajczyk

and Schmid, 2005). Most of these local descriptors

use luminance information only. Nevertheless, both

color and shape are widely considered important vi-

sual characteristics in a cognitive context, so an inter-

esting way to account this information is by using the

covariance region descriptor, proposed by Tuzel et

al. in (Tuzel et al., 2008), which aggregates the cor-

relations of a custom amount of elementary sources

of information (like color, shape, spatial information,

gradients). Moreover, great interest was devoted to

GIST feature, a statistical summary of the spatial lay-

out properties (Spatial Envelope representation) of the

scene (Oliva and Torralba, 2006).

To solve the second problem, as pioneered by

Renninson in (Rennison, 1994), a presentation strat-

egy is required. The classical spatial arrangement of

images is their placement on a grid, typically in row-

major ordering based on relevance. Despite its sim-

plicity, this visualization is unable to convey infor-

mation on the structure of the collection, for example

the availability of a cluster of similar images. As de-

scribed in (Heesch, 2008), alongside with more stan-

dard approaches based on static hierarchies or cluster-

ing, the main approaches are build around a network

based or a dimensionality reduction based represen-

tations. Multi-Dimensional Scaling (MDS) solves a

non linear optimization problem by determining the

mapping that best approximates the high-dimensional

pairwise distances between data points. One of the

initial proposals was the Sammon mapping by (Sam-

mon, 1969). An interesting proposal of this kind is

the Hyperbolic-MDS by (Walter, 2004), which ex-

ploits the hyperbolic space H

to map the most sig-

niﬁcant images in the center of the projection (thus

visualizing them with a greater detail) while displac-

ing the others along the curve H

falling towards in-

ﬁnity with a smaller scale; moreover this projection

has the advantage of allowing to focus the view in

different points by applying the M

obius transforma-

tion. A number of other non-linear projections have

been proposed to solve the prohibitive computational

costs, for example the isometric mapping (ISOMAP)

(Tenenbaum et al., 2000), the stochastic neighbor em-

bedding (SNE) (Hinton and Roweis, 2002) and the

local linear embedding (LLE) (Roweis and Lawrence,

2000). An older yet effective approach, especially in

large scale contexts, is ﬁnally the FastMap (Faloutsos

and Lin, 1995) which exploits a set of pivot objects to

project points in the reduced space. This technique,

exploited also in this paper, has the advantage to al-

low easily a fast insertion of new objects within the

map.

3 RELEVANCE FEEDBACK FOR

IMAGE SURFING

The ﬁrst task in image searching on large scale col-

lections is clearly managing the scalability problem.

Many techniques for approximated nearest neighbor

(ANN) search, starting from the LSH (Andoni and

Indyk, 2006) up to the product quantization (J

egou

et al., 2011), allow to greatly improve the perfor-

mance using vocabulary codes (with precomputed

distances) in place of real features. Moreover image

search based on contextual information (as done by

all search engines) proves to be deﬁnitely effective.

The real limitation of todays multimedia systems is

within the interaction possibilities.

The most important way in which the user can

help the system cross the semantic gap and interact

with the retrieval results, i.e. the relevance feedback,

becomes ﬁrst of all prohibitive in large scale contexts.

Just consider the usual approaches: query point move-

ment (QPM), feature space warping (FSW) or ma-

chine learning approaches (Chang et al., 2009). QPM

notoriously suffers of slow convergence, and does not

guarantee to ﬁnd intended targets; a fast QPM tech-

nique, trying to ﬁx this problem, has been proposed

by (Liu et al., 2009). FSW requires a full space re-

encoding, and no proposals at the best of our knowl-

edge take into account FSW in large scale scenarios.

Finally the learning is notoriously a heavy procedure,

often requiring an ofﬂine processing and hardly capa-

RELEVANCE FEEDBACK AS AN INTERACTIVE NAVIGATION TOOL

ble of producing real time results. Moreover, the rel-

evance feedback is proposed to the user as a tedious

procedure (as well as the annotation) to overcome the

limitations of the system itself, which could be con-

sidered an admission of poor quality.

Nevertheless, the ability to guide the system to-

wards the desired result needs to be considered as

an important feature. The user himself implicitly de-

mands this kind of capability, because visual similar-

ity is mostly helpful when the user does not clearly

know or is not capable of expressing the subject of

his search: as a matter of facts if he could, he would

type the precise query on the search engine. This is

even more true when the user is approaching the im-

age collection for fun or curiosity: in this scenario the

user is mainly interested in surﬁng through pictures

being guided by his emotional preferences, using vi-

sual cues as exploration rails. In the meantime, new

and reﬁned results could be suggested by the retrieval

system, adjusting his search goal.

In order to satisfy all these requirements, we need

to visualize the effect of relevance feedbacks from the

original feature space into the two-dimensional map-

ping. This procedure allows the system to show to the

user a real-time feedback of his manipulations, bring-

ing him into the collection itself.

We need to provide the user with a ﬁrst 2D visual-

ization of his query results. The technique used in this

step is FastMap, due to its high performance and the

ability to quickly include new points to the map with-

out recomputing the entire mapping. This algorithm

brieﬂy works as follows (Faloutsos and Lin, 1995).

Firstly, two distant-enough objects are chosen with an

heuristic approach. Given a distance function D() be-

tween each pair of objects O

and O

in the feature

space, each object O

is projected to object O

on the

line joining the pivots (O

, O

) using the cosine law

and obtaining the x coordinates. Then the y coordinate

is computed using the distances D

on the hyperplane

perpendicular to the line (O

, O

). These may be ob-

tained from the original distance D by means of Eq.1:



, O



= D (O

, O

)

− (x

− x

)

(1)

When the process is completed, the pictures are vi-

sualized on the two-dimensional plane adjusting the

scale.

When a query O

is selected by the user, the points

are adjusted in order to support the similarity ranking.

In particular the user requires a new projection which

better reﬂects the distances from the query, thus the

angle of points from the query is kept ﬁxed, while

the distance is scaled along the unit vector proportion-

ally to the ranking itself. In this way, the similar pic-

tures get closer to the query, while the dissimilar ones

are moved away. At this point, the user is focused

on the query itself (at the center of the screen) and

the most similar content within the results is placed

nearby, easily gathering his attention.

The user can now provide feedbacks on the re-

sults, highlighting what he likes (being more similar

to the query he submitted) and what he dislikes (be-

ing different from what he expects). For each point O

in the results set, the system ﬁnds the nearest element

of both positive and negative feedbacks sets (a pro-

cess which can be eased up with approximate search)

and warps the space. In particular, given f

the dis-

tance from its nearest good feedback (including the

query image) and f

the distance from its nearest bad

feedback, the system computes the distance for the

projection P as:

P (O

, O

) = D (O

, O

)



1 +

− f

max( f

, f

)



(2)

The equation states that what is positive should be

moved towards the query, while what is negative

should be pushed away. The “positiveness” of an im-

age is related to how much more similar to a positive

than to a negative the image is. The images may now

be ranked according the warped distances and the vi-

sualization is updated by moving the images along the

line which connects the points to the query in the 2D

plane. The new distances are ordered according to the

ranking.

Compared with other relevance feedback ap-

proaches, this solution may perform worse with re-

spect to the global recall or precision. The real merit,

which becomes essential, regards the interface aspect:

in fact the changes induced to the ranking are limited

to the local neighborhood of the selected feedback el-

ement. In other words, only the points for which the

feedback is the nearest positive or negative feedback

are inﬂuenced, therefore a strong connection between

the visual mapping and the observed changes appears.

Moreover the use of a ranking based projection has

the effect of showing the similar images slowly ap-

proaching the query, thus the user’s attention focus.

The user is still allowed to move the images as he

feels like, implicitly asking to prevent the image from

being moved by the automatic positioning. Note that

the distance calculations are always performed on the

original distances, so removing a feedback allows to

step back to the previous position: this is an easy way

to “undo” the user’s choices.

4 SAMPLE APPLICATIONS

One of the most immediate implementations of this

approach is the image web search interface. Cur-

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

(a) (b) (c)

Figure 1: Application example with the Google query “klimt”.

(a) (b) (c)

Figure 2: Application example by query one of the ImageCLEF images, speciﬁcally representing sea pictures.

(a) (b) (c)

Figure 3: Application example by query one of the ImageCLEF images, speciﬁcally representing sunset pictures.

rently, Ajax based interfaces are common, so the pre-

sentation could be further enhanced to support our

idea. The web server could provide image features or

they could be directly computed by the client interface

if they are simple enough. Suppose we use Google

Images to look at something related to a painter

(“klimt” could be our query term): the set of im-

ages is presented in a 2D mapping (Fig. 1(a)) and the

user quickly pans and zooms through the results, not

dissimilarly from what he would do scrolling down

the results page. After identifying some interesting

content, he can select it (Fig. 1(b)) and then better

convey his interest with further reﬁnement, which are

likely to attract other versions of the painting and sim-

ilar ones, while repelling unrelated results (Fig. 1(c)).

Note that this is deﬁnitely an ephemeral interest, ex-

actly related to the moment and the feelings of the

person: it is likely that a new object of interest gets

identiﬁed by the collection exploration, to start over

again.

The same approach can be extended quite easily to

surf through wide collections of images. The idea to

allow the user to zoom into the dataset, and ﬁlter out

the pictures based perceptual similarity exploiting in-

teractive relevance feedback, can be a winning key in

the process of managing efﬁciently large amount of

data, focusing on the user’s search intentions. Let’s

look, for example, the dataset provided for the CLEF

Photo Annotation task (Nowak et al., 2011), aimed at

the automatic annotation of a large number of con-

sumer photos with multiple annotations. The user is

presented with the 2D mapping of the images, com-

puted in real-time by FastMap, and allowed to zoom

and navigate through it(Fig. 2(a),3(a)). After identi-

fying an interesting image (a sea landscape in Fig. 2,

a sunset in Fig. 3), the user selects it and the other

images are rearranged to convey their distance in the

feature space from the selected query (Fig. 2(b),3(b)).

This shows how well the 2D mapping is able to re-

spect the original distance matrix. Now the user may

simply select positive or negative samples, getting an

immediate feedback of the effect of his choice on the

RELEVANCE FEEDBACK AS AN INTERACTIVE NAVIGATION TOOL

mapping: selecting a negative feedback forces the im-

age and some other neighbors to be pushed away and

at the same time all the lower ranked image to be

dragged toward the query. The selection of a posi-

tive feedback “recalls” images from outside the cur-

rent view towards the query. Once the ﬁltering is

completed, the resulting images constitute a bucket of

interesting images to be used somehow or annotated

with a tag (Fig. 2(c),3(c)).

5 INTERACTION DESIGN

REMARKS

Usually, the user interface design is considered the

last step to deal with in the development of a retrieval

application or a system, because the engine is consid-

ered the only real focus of the problem. Instead, we

argue that in the next future of user centric applica-

tion, the interface will become the key point of the

system.

A compelling user interface, containing useful and

engaging interaction paradigms, is a fundamental as-

pect for a multimedia system because it is the only

part of the system which will link directly to the user’s

emotion. For example, Jaimes and Sebe (Jaimes and

Sebe, 2007) show how to deal with user’s emotional

expressions as part of the data processing. These con-

cepts evolved through time becoming what generally

could be deﬁned as natural interaction (Baraldi et al.,

2009), exploiting means which are considered natu-

ral since they belong to the nature of human beings

themselves.

The simpler and the more natural (let’s say intu-

itive) the machine interaction is, the less amount of

cognitive effort is delegated to humans. Nevertheless

the design problem is remarkable. If we focus our at-

tention on functionalities to be provided to users to

accomplish some tasks, we risk losing the focus on

intuitiveness. On the other side, an extreme simpli-

ﬁcation can lead to poorly performing functionalities.

For this reason, we need to design these two aspects at

the same time, linking very closely the search engines

and the visualization techniques with the functionali-

ties. If we design an effortless interaction capable of

expressing naturally all the technically complex tasks

to search, visualize and browse, we are close to a nat-

ural multimedia system really centered on user’s de-

sires and therefore really useful to him.

The aim of natural interaction is therefore the de-

sign of an interaction system able to getting rid of

computer-friendly interaction paradigms (like win-

dows, menus, scrollbars, mouses) towards more

human-friendly paradigms. In this context, very im-

portant roles are played by concepts like aesthetic

beauty, emotions and a playful dimension between the

user and the system; moreover, an intensive use of

animations and dynamic mathematical models is nec-

essary in order to link the virtual interface with real

life metaphors. Finally, the spatial organization of in-

formation is fundamental to improve content under-

standing, for example by clustering similar objects.

This proposal just moves towards this kind of in-

teraction. The image collection is not only a list of

images, but becomes a space to explore, reacting dy-

namically on the user’s preferences collected contin-

uously through relevance feedback. The entire sys-

tem can be easily improved with convenient multi-

touch gestures. The removal of one or more unde-

sired pictures can be triggered with swipe gestures,

while the pinch gesture can allow to zoom the col-

lection to focus on the individual pictures (or groups

of pictures). Groups of good or bad feedbacks can

be selected drawing circles around them. Once the

collection has been ﬁltered, according to the desired

predominant visual characteristic, a tag could be asso-

ciated to the resulting group of pictures, performing a

visually assisted tagging.

6 CONCLUSIONS

In this paper we introduced a novel proposal for the

presentation of image collections, obtained by query-

ing or similarity search. We believe that the combined

use of 2D mapping and relevance feedback allows the

user to better express his querying intention, therefore

easily surf through the results.

This technique, however much simple, could open

a wide range of improvements of today’s web search

engines and image collections management software.

For example, new results could be dynamically added

to the mapping, based on the already selected images,

thus formulating a new query based on the positive

and the negative selections. Moreover, the visual sim-

ilarity search can be exploited also to mine the not in-

dexed content using positive feedbacks as suggested

prototypes for the retrieval system. Finally, an inter-

esting possibility is the exploitation of such an inter-

active experience to collect user provided information

and therefore improving the retrieval system itself.

REFERENCES

Andoni, A. and Indyk, P. (2006). Near-optimal hashing

algorithms for approximate nearest neighbor in high

VISAPP 2012 - International Conference on Computer Vision Theory and Applications

dimensions. In IEEE Symposium on Foundations of

Computer Science, pages 459–468.

Baraldi, S., Bimbo, A. D., Landucci, L., and Torpei, N.

(2009). Natural interaction. In Encyclopedia of

Database Systems, pages 1880–1885. Springer.

Bay, H., Ess, A., Tuytelaars, T., and Van Gool, L. (2008).

Speeded-Up Robust Features (SURF). Comput Vis

Image Und, 110(3):346–359.

Chang, Y., Kamataki, K., and Chen, T. (2009). Mean shift

feature space warping for relevance feedback. In IEEE

Image Proc, pages 1849–1852.

Faloutsos, C. and Lin, K.-I. (1995). Fastmap: a fast al-

gorithm for indexing, data-mining and visualization

of traditional and multimedia datasets. In ACM SIG-

MOD International Conference on Management of

Data, pages 163–174.

Heesch, D. (2008). A survey of browsing models for

content based image retrieval. Multimed Tools Appl,

40:261–284.

Hinton, G. E. and Roweis, S. T. (2002). Stochastic neighbor

embedding. In Neu Inf Pro Syst, pages 833–840.

Jaimes, A. and Sebe, N. (2007). Multimodal human-

computer interaction: A survey. Comput Vis Image

Und, 108(1-2):116–134.

egou, H., Douze, M., and Schmid, C. (2011). Product

quantization for nearest neighbor search. IEEE T Pat-

tern Anal, 33(1):117–128.

Liu, D., Hua, K., Vu, K., and Yu, N. (2009). Fast query point

movement techniques for large cbir systems. IEEE

Transactions on Knowledge and Data Engineering,

21(5):729–743.

Lowe, D. G. (2004). Distinctive Image Features from Scale-

Invariant Keypoints. Int J Comput Vision, 60(2):91–

110.

Mikolajczyk, K. and Schmid, C. (2005). A performance

evaluation of local descriptors. IEEE T Pattern Anal,

27(10):1615–1630.

Nowak, S., Nagel, K., and Liebetrau, J. (2011). The clef

2011 photo annotation and concept-based retrieval

tasks. In Petras, V., Forner, P., and Clough, P. D., edi-

tors, CLEF (Notebook Papers/Labs/Workshop).

Oliva, A. and Torralba, A. (2006). Building the gist of a

scene: The role of global image features in recogni-

tion. Visual Perception, Progress in Brain Research,

155.

Rennison, E. (1994). Galaxy of news: an approach to visu-

alizing and understanding expansive news landscapes.

In ACM symposium on User interface software and

technology, pages 3–12.

Roweis, S. T. and Lawrence, K. (2000). Nonlinear dimen-

sionality reduction by locally linear embedding. Sci-

ence, pages 2323–2326.

Sammon, J. W. (1969). A nonlinear mapping for data struc-

ture analysis. IEEE T Comput, 18(5):401–409.

Tenenbaum, J. B., Silva, V., and Langford, J. C. (2000). A

Global Geometric Framework for Nonlinear Dimen-

sionality Reduction. Science, 290(5500):2319–2323.

Tuzel, O., Porikli, F., and Meer, P. (2008). Pedestrian De-

tection via Classiﬁcation on Riemannian Manifolds.

IEEE T Pattern Anal, 30(10):1713–1727.

Walter, J. A. (2004). H-mds: a new approach for interactive

visualization with multidimensional scaling in the hy-

perbolic space. Inform Syst, 29(4):273–292.

RELEVANCE FEEDBACK AS AN INTERACTIVE NAVIGATION TOOL