values ∆x and ∆y represent the translation along the
x and the y axis respectively, and 1 value ∆θ repre-
sents the rotation angle between these two patches.
The translation values are easily obtained by evaluat-
ing the differences between the x and y coordinates of
the centers of the patches. The ∆θ can be obtained
from their respective Qk
2x2
matrix introduced in the
previous section. Indeed, these matrices represent the
rotation in the image space applied to each patch so
that they match the same position in the color space.
Consequently if the patch Pa
0
is rotated by an angle
θ
0
and the patch Pa
1
by an angle θ
1
in order to match
the same position, the ∆θ is just the difference be-
tween these two angles. Consequently, we propose to
create a 3D structure around the patch Pa
0
whose axis
are ∆x, ∆y and ∆θ and to put the value of the similar-
ity between Pa
0
and Pa
1
in the cell whose coordinates
are {x
1
− x
0
,y
1
− y
0
,θ
1
− θ
0
}, where x
k
and y
k
are the
position in the image space of the center of the patch
Pa
k
, k = 0 or 1. Likewise, we can do the same for
all the patches Pa
i
, i > 0, around Pa
0
in order to rep-
resent the spatial distribution of the self-similarities
around Pa
0
. For this, the neighborhood of the patch
Pa
0
is discretized into a grid of size 5x5 and the angle
axis is discretized into 4 values. The maximal sim-
ilarity is stored within one cell if several similarities
are falling in the same position. This representation is
similar to this proposed by Shechtman et al. (Shecht-
man and Irani, 2007) but since we can find similarities
between patches with different orientations, we have
added a third dimension for the angle. The dimension
of the feature of Shechtman et al. was 4 radial inter-
vals x 20 angles = 80 while our is 100 (5 ∆x intervals
x 5 ∆y intervals x 4 angles).
5 EXPERIMENTS
5.1 Experimental Approach
The rotation invariance of our self-similarity having
been theoretically demonstrated, we propose to assess
the discriminative power of the final descriptor and
to compare it with the other self-similarities that are
classically used in many applications, as mentioned
in the introduction. For this purpose, we consider
the context of object classification by using the PAS-
CAL VOC 2007 dataset (Everingham et al., ). This
dataset contains 9963 images representing 20 classes.
The aim of this experiment is not to get state-of-the-
art classification score on the considered dataset, but
rather to fairly compare the discriminative powers of
the different self-similarity descriptors. In order to
test our self-similarity, we propose to use the Bag-
of-words approach which is based on the following
successive steps:
• keypoint detection (dense sampling is used for all
tested methods),
• local descriptor extraction around each keypoint
(we use the 3D structures presented in the previ-
ous section for our method),
• clustering in the descriptor space, the cluster rep-
resentatives are called visual words (k-means is
used with 400 words for all tested methods),
• in each image, each local descriptor is associated
with the nearest visual word,
• each image is characterized by the histogram of
visual word,
• learning on the train images and classification of
the test images (linear SVM is used for all tested
methods).
Furthermore, we propose to compare our results
with the local self-similarity descriptor (Shechtman
and Irani, 2007) and with the global self-similarity
descriptor (Deselaers and Ferrari, 2010). For both,
we use the codes provided by the authors. For the
global self-similarity descriptor, we have constructed
the SSH (with D
1
= D
2
= 10) from the image and
consider each grid cell as a feature. So, this feature is
also of dimension 100. For all the approaches, we re-
duce the dimension of the features to 32 by applying
PCA.
5.2 Results
The results are shownin figure 3. In this figure, BOCS
stands for Bag Of Correlation Surfaces (Deselaers and
Ferrari, 2010), BOLSS stands for Bag Of Local Self
Similarities and BORISS is our proposed approach
and means Bag Of Rotation Invariant Self Similari-
ties.
In this figure, we can see that our approach outper-
forms the two other ones for most of the 20 classes.
Furthermore, the Mean Average Precision (MAP)
provided by our BORISS is 25% while it is around
18% for the two other approaches. This experiment
shows that the self similarities are more efficient to
characterize the content of an image if they are in-
variant to the rotation. Of course, these results are
not competitive with the ones provided by SIFT or
CNN features, but the aim of these experiments was to
showthat adding color and rotation invarianceinto the
self-similarity descriptors improves the discriminat-
ing power of the final features. These results clearly
show that our self-similarity representation is a good
candidate to be used as complementary information
with the appearance-based features.