nition rates than bag-of-words for reasonable vocabu-
lary sizes but in order to obtain the binary signature,
an offline training process is required.
Several variations of query expansion have been
proposed in literature (Arandjelovi
´
c and Zisserman,
2012)(Chum et al., 2007) but as observed in (Jegou
et al., 2010), this significantly increases query time
by several orders of magnitude and only works well
when there are several images of the same object in
corpus.
(Wu et al., 2009) proposed grouping of SIFT
features inside Maximally Stable Extremal Regions
(MSER) and applied to logo and web image retrieval.
This does not work well for real-world photographs
with large changes in viewpoint and illumination and
is mostly applicable for 2D images.
Our method is able to consistently produce sig-
nificantly better recognition rates than bag-of-words
for a wide range of vocabulary sizes and incurs no
offline training or learning overheads. Experiments
reveal that the proposed approach can be effective for
recognition in real-world photographs involving large
changes in viewpoint and occlusion and sub-image re-
trieval problems. Also, the retrieval is several orders
of magnitude faster than performing geometric verifi-
cation or query expansion.
3 METHODOLOGY
Figure 2 provides an overview of proposed method.
The offline processing module (subfigure 2(a)) pre-
pares the inverted index and files that store spatial in-
formation from the images in corpus. It also com-
putes the inverse document frequency weights for vi-
sual words inside the ROIs. The online processing
module (subfigure 2(b)) inputs the query image and
computes the regions of interest (ROIs) and dense fea-
tures. It assigns dense features to visual words us-
ing the codebook computed from the dense features
in the corpus images. A voting mechanism is used
to determine ROIs in corpus that share common vi-
sual words. An array whose size equals the number
of ROIs in corpus is first initialized to zero. Using an
inverted index structure, each visual word in a ROI
in the query image votes for ROIs in corpus in which
it occurs. Figure 3 illustrates the voting mechanism.
After the voting, the counts in the array represent the
number of visual words each corpus ROI has in com-
mon with the query ROI. For corpus ROIs that have
count < 2, the match score is set to zero. For the rest,
a match score is computed based on the number of
visual words in common and agreement in their spa-
tial layout. The match score is then added back to a
second array that stores the cumulative match score
for all the corpus images. The corpus images are then
ordered based on descending order of the cumulative
match score. For all our experiments, we use Root-
SIFT (Arandjelovi
´
c and Zisserman, 2012) instead of
SIFT. With our matching framework, this is observed
to increase mAP by about 1% − 2% from using L2
distance for SIFT comparison.
3.1 ROI Computation
In this section, we provide details of the feature ex-
traction. We compute Harris-Laplace interest points
using the LIP-VIREO toolkit of (Zhao, 2010). In
contrast to our previous work (Bhattacharya and
Gavrilova, 2013) which detects interest points using
LoG and also computes the descriptors, we simply
compute the interest points using Harris-Laplace. We
discard overly large and overly small interest points
as these are likely to result in false matches. Specifi-
cally, for all our experiments, we discard any interest
points with radius < 15 or > 51 pixels. The number
of interest points still number in the thousands. We
sort the interest points in descending order of radius.
Using a kd-tree, we efficiently determine the nearest
interest points to any given interest point. We dis-
card any nearby interest points for which the distance
between the interest point centres is < 4D and differ-
ence in radius is < 4R. The motivation is to discard
overlapping interest points that are similar in scale
and hence likely to represent similar image structure.
If the number of ROIs is > 200, we sort the ROIs
based on saliency in descending order and select up
to the top 200 ROIs. Subfigure 4(a) displays the inter-
est points detected using Harris-Laplace while subfig-
ure 4(b) displays the ROIs extracted using the simple
technique just outlined. We set 4D and 4R to 20 pix-
els for all our experiments. It is important to note that
if the value of 4D is set too high, then the localiza-
tion accuracy of the ROIs will not be good, resulting
in degradation in recognition performance.
We next compute dense features at a spatial stride
of 5 pixels and 6 scales {9,12,15,18,21,24} us-
ing the vl phow command of (Vedaldi and Fulk-
erson, 2012). This has been used for category
recognition in (Chatfield et al., 2011). It is fast
to compute and takes well below a second per im-
age. For ROIs with radii in the range [15, 21], we
only consider dense features that are contained in-
side and have a radius of 9 pixels (roughly half
the ROI radius). Similarly, for ROIs with radii in
the range {[22,27],[28,33],[34,39],[40, 45],[46, 51]},
we consider dense features that have a radius of
{12,15,18,21,24} pixels respectively. Since the ra-
VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications
70