4.3 Comparison with Existing Methods
In Table 1, the proposed method is first compared with
the concurrent techniques that use a single SIFT de-
scriptor and the original BoVW coding, i.e. BoVW
histogram (sum pooling) with a hard assignment for
visual words for fair comparison. These methods
are the original SPM method, SPM + co-occurrence
(combination of SPM and the spatial relationship in-
formation between visual words inside each image
(Yang and Newsam, 2011), sequence matching (Yang
et al., 2009b) and optimal spatial partitioning (Sharma
and Jurie, 2011). The table shows that for both
datasets, our approach clearly outperforms all other
methods. It is important to note that the best result is
obtained with the smallest vocabulary of 100 words.
Also, to compare SMD with recent works based
on sparse coding to create the vocabulary, we have in-
tegrated sparse coding in our method. For this, we use
the Matlab code ScSPM from authors of (Yang et al.,
2009b) , and following (Boureau et al., 2010), we use
the max pooling to compute local BoVW due to it bet-
ter performance than average pooling. We compare
to methods Sc-SPM (Yang et al., 2009b) and Kernel
Sparse Representation (KSR-SPM) (Gao et al., 2010).
The Sc-SPM approach can be treated as spatial pyra-
mid matching method using sparse coding. The KSR-
SPM approach is the combination of SPM with a ker-
nel sparse representation technique. Our method def-
initely outperforms both of them.
Table 1: Comparison of our approach over concurrent meth-
ods based on SIFT and k-means. The size of the codebook
is given in brackets. We report the highest values obtained
in pyramidal case only. − means there is no result available.
Method Caltech 101 15 Scenes
SPM (pyr., K=100) 63.2 [100] 80.1 [100]
SPM (best pyr. result) 64.6 [200] 81.4 [400]
SPM+co-occurrence - 82.51 [200]
Sequence matching - 80.9 [200]
SPM+ spatial - 80.1 [1000]
partition learning
SMD 66.5 [100] 83.2 [100]
Table 2: Comparison with sparse coding based methods. −
means there is no result available.
Method Caltech 101 15 Scenes
ScSPM [1024] 73.2 ± 0.5 80.28 ± 0.9
KSR-SPM [1024] - 83.68 ± 0.61
SMD [100] 73.44 ± 1.1 84.59 ± 0.7
5 CONCLUSIONS
In this paper, our contribution is twofold. First, we
describe a novel image representation as strings of
histograms which encodes spatial information, each
histogram being a BoVW model of a subregion. Sec-
ond, we introduce a new edit distance able to automat-
ically identify local alignments between subregions
and the removal of sequences of similar subregions.
This characteristic makes our method more robust to
translation or scale variations of objects in images
than SPM-based approaches that compare rigidly cor-
responding parts of images.
The experiments confirm that our model is able to
take into account spatial relationships between local
BoVW and leads to a clear improvement of perfor-
mance in the context of scene and image classifica-
tion compared to the classical spatial pyramid repre-
sentation. It is worth noticing that to the best of our
knowledge, it is the first time that results better than
SPR are reported with the standard BoVW coding and
a lower dimension for the representation. Moreover,
the proposed approach obtain similar or better accura-
cies than other recent methods trying to infuse spatial
relationships into the original BoVW model with the
great advantage of using a small codebook and a com-
pact representation. In the future, we are interested
in extending our edit distance to other data structures
such as trees. Trees are indeed often used to represent
image content, and some edit distances already exist.
REFERENCES
Ballan, L., Bertini, M., Del Bimbo, A., and Serra, G.(2010).
Video event classification using string kernels. Multi-
media Tools and Applications, 48(1):69–87.
Battiato, S., Farinella, G., Gallo, G., and Rav`ı, D. (2009).
Spatial hierarchy of textons distributions for scene
classification. In Proceedings of the 15th Interna-
tional Multimedia Modeling Conference on Advances
in Multimedia Modeling, MMM ’09, pages 333–343,
Berlin, Heidelberg. Springer-Verlag.
Boureau, Y.-L., Bach, F., LeCun, Y., and Ponce, J. (2010).
Learning mid-level features for recognition. In Com-
puter Vision and Pattern Recognition (CVPR), 2010
IEEE Conference on, pages 2559–2566. IEEE.
Cao, Y., Wang, C., Li, Z., Zhang, L., and Zhang, L. (2010).
Spatial-bag-of-features. In Computer Vision and Pat-
tern Recognition (CVPR), 2010 IEEE Conference on,
pages 3352–3359. IEEE.
Chen, X., Hu, X., and Shen, X. (2009). Spatial weighting
for bag-of-visual-words and its application in content-
based image retrieval. In Proceedings of the 13th
Pacific-Asia Conference on Advances in Knowledge
Discovery and Data Mining, PAKDD ’09, pages 867–
874, Berlin, Heidelberg. Springer-Verlag.
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
352