1.2 Overview of Our Approach
We propose an image based feature space and a map-
ping of image regions and words into the space for
object recognition. Firstly, each image is segmented
automatically into several regions. For each region, a
feature descriptor is calculated. We then build a fea-
ture space, each dimension of which corresponds to
an image from the database. Finally, we define the
mapping of image regions and labels into the space.
The correspondence between regions and words is
learned based on their relative positions in the feature
space.
The details of our algorithm are described in Sec-
tion 2. Section 3 shows experimental results and some
discussions. Finally we draw some conclusions and
give some pointers to future work.
2 IMAGE SPACE BASED
EMBEDDING OF REGIONS
AND WORDS
In this section, we first describe how image segments
can be represented by visual terms which are based
on salient regions. Secondly, we propose how to em-
bed image regions and words into an image based fea-
ture space, in order to find the relationships between
words and image regions. Then, a simple example is
presented as an illustration of the algorithm.
2.1 Representing Image Regions by
Salient Regions
There are very many different automatic image seg-
mentation algorithms. In this work the Normalized
Cuts framework (Shi and Malik, 2000) is used be-
cause it handles segmentation in a global way which
has more chance than some approaches to segment
out whole objects.
Once images are segmented, a descriptor is calcu-
lated for each image segment. The approach of (Tang
et al., 2006) is followed to represent images by salient
regions. Specifically, we first select salient regions by
using the method proposed by Lowe (Lowe, 2004), in
which scale-space peaks are detected in a multi-scale
difference-of-Gaussian pyramid. Lowe’s SIFT (Scale
Invariant Feature Transform) descriptor is used as the
feature descriptor for the salient regions. The SIFT
descriptor is a three dimensional histogram of gradi-
ent location and orientation. The descriptor is con-
structed in such a way as to make it relatively invariant
to small translations of the sampling regions, as might
happen in the presence of imaging noise. Quantisa-
tion is applied to the feature vectors to map them from
continuous space into discrete space. Specifically, the
k-means clustering algorithm is adopted to cluster the
whole set of SIFT descriptors. Each cluster then rep-
resents a visual word from the visual vocabulary. As
a result, each image segment can be represented by a
k-dimensional frequency vector or histogram, for the
visual words contained within the segment.
2.2 Image-Based Feature Mapping
We denote images as I
i
(i = 1, 2, ...N, N being the total
number of images), and the jth segment in image I
i
as I
ij
. For the sake of convenience, we line up all
the segments in the whole set of images together and
re-index them as I
t
( t = 1, 2, ..., n, n being the total
number of segments).
We define an image-based feature mapping m,
which maps each image segment into a feature space
F. The feature space F is an N dimensional space
where each dimension corresponds to an image from
the data-set. The coordinates of a segment in F are
defined by the mapping m:
m(I
t
) = [d(I
t
, I
1
), d(I
t
, I
2
), ..., d(I
t
, I
N
)] (1)
where d(I
t
, I
i
) represents the coordinate of segment
I
t
on the ith dimension for which we use the dis-
tance of I
t
to image I
i
. The distance of a segment
to an image is defined as the distance to the clos-
est segment within the image. Because the number
of visual words in a single segment can vary from
a few to thousands, the distance between two vec-
tors/histograms V
1
and V
2
, which represent two seg-
ments, is measured by normalised scalar product (co-
sine of angle), cos(V
1
, V
2
) =
V
1
•V
2
|V
1
||V
2
|
. Therefore, in this
work we define
d(I
t
, I
i
) = max
j=1,...,n
i
cos(I
t
, I
ij
). (2)
Intuitively, segments relating to the same objects or
concepts should be close to each other in the feature
space.
We can also map labels used to annotate the im-
ages into the space. Suppose the vocabulary of the
data-set isW
j
(j = 1, 2, ..., M, M being the vocabulary
size). The coordinate of a label on a particular di-
mension is decided by the image this dimension rep-
resents. If the image is annotated by that label, the
coordinate is 1, otherwise it is 0. Therefore, the map-
ping of words is defined as:
m(W
j
) = [d(W
j
, I
1
), d(W
j
, I
2
), ..., d(W
j
, I
N
)] (3)
where
d(W
j
, I
i
) =
1 if I
i
is annotated by W
j
0 otherwise
(4)