In this paper we propose an image search system using keyword annotations, low-
level visual metadata and semantic inter-image relationships. The semantic relation-
ships are learned exclusively from the human users’ interaction with the search sys-
tem. The proposed system can be used to search huge (web-based) image sets more
efficiently. Our system retrieves more images in an initial phase. We use CBIR tech-
niques not to search but to sort these images according to their visual similarity. Using
this visually sorted arrangement more images can be displayed simultaneously. Thus
the user can identify very quickly images, which are good candidates for the desired
search result. In the next step these images will serve as a visual filter for further
result images. The filtering will refine the search result and generate more images that
are similar to the desired query. Our proposed system dramatically cuts down the time
for image retrieval.
However, the most important advantage of the new system is that it can be used to
learn semi-automatically semantic relationships between images from the users inter-
action with the system. These relationships are language independent and can be used
to further improve the quality and effectiveness of the image search.
The rest of this paper is organized as follows: Section 1 reviews the principle and
current approaches of content-based image retrieval systems. Visual image sorting
using self-organizing maps is described. Section 2 presents the proposed strategy and
compares our scheme to other approaches. Section 3 describes implementation details
and evaluates the new approach. We conclude the paper in Section 4.
1.1 Content-based Image Retrieval
In order to avoid manual annotation and to automate the process of image retrieval,
content-based image retrieval (CBIR) techniques have been developed since the early
1990s. A good overview about the current state of the art of CBIR can be found in [2].
CBIR systems use automatically generated low-level metadata (features) to describe
the visual statistics of images (like color, texture, and shape) [1, 8].
Low-level CBIR-systems are very well suited to find images that share visual fea-
tures. These systems rely on the assumption that similar images do also have similar
features. This assumption may be correct in many cases, however the opposite case is
not necessarily true. Similar features could come from very different images that do
not share any semantic similarities (see figure 1, images a and b).
Despite intense research efforts, the results of CBIR systems have not reached the
performance of text based search engines. There are still several unsolved problems:
The search for particular images is difficult if no query image is available.
Some approaches do use manually drawn sketches. However, the visual features of
these sketched images can differ significantly from those of “real” images.
Recent approaches (SIFT) have used interest points describing significant local fea-
tures of an image [7]. Interest points have proven to be very effective in finding im-
ages containing identical objects although the lightning conditions, the scale and the
viewing positions can vary. However, even sophisticated CBIR systems using interest
points cannot determine similarities between images that do have similar semantic
content but do look different (see figure 1, images c and d).
The main problem of CBIR systems is the fact that there is an important (semantic)
gap between the “content” that can be described with low-level visual features and the
6868