LOCAL FEATURE BASED IMAGE SIMILARITY FUNCTIONS

FOR kNN CLASSIFICATION

Giuseppe Amato and Fabrizio Falchi

Institute of Information Science and Technology, via G. Moruzzi 1, Pisa, Italy

Keywords:

Image classiﬁcation, Image content recognition, Pattern recognition, Machine learning.

Abstract:

In this paper we consider the problem of image content recognition and we address it by using local features

and kNN based classiﬁcation strategies. Speciﬁcally, we deﬁne a number of image similarity functions relying

on local feature similarity and matching with and without geometric constrains. We compare their performance

when used with a kNN classiﬁer. Finally we compare everything with a new kNN based classiﬁcation strategy

that makes direct use of similarity between local features rather than similarity between entire images.

As expected, the use of geometric information offers an improvement over the use of pure image similarity.

However, surprisingly, the kNN classiﬁer that use local feature similarity has a better performance than the

others, even without the use of geometric information.

We perform our experiments solving the task of recognizing landmarks in photos.

1 INTRODUCTION

Image content recognition is a very important issue

that is being studied by many scientists worldwide.

In fact, with the explosion of the digital photography,

during the last decade, the amount of digital pictures

available on-line and off-line has extremely increased.

However, many of these pictures remain unannotated

and are stored with generic names on personal com-

puters and on on-line services. Currently, there are

no tools and effective technologies to help users in

searching for pictures by real content, when they are

not explicitly annotated. Therefore, it is becoming

more and more difﬁcult for users to retrieve even their

own pictures.

A picture contains a lot of implicit conceptual in-

formation that is not yet possible to exploit entirely

and effectively. Automatically content based image

recognition opens up opportunities for new advanced

applications. For instance, pictures themselves might

be used as queries on the web. An example in this

direction is the service “Google Goggles” (Google,

2010) recently launched by Google, that allows you

to obtain information about a monument through your

smartphone using this paradigm.

Note that, even if many smartphones and cameras

are equipped with a GPS and a compass, the geo-

reference obtained with this is not enough to infer

what the user is actually aiming at. Content analy-

sis of the picture is still needed to determine more

precisely the user query or the annotation to be as-

sociated with a picture.

A promising approach toward image content

recognition is the use of classiﬁcation techniques to

associate images with classes (labels) according to

their content. For instance, if an image contains a car,

it might be automatically associated with the class car

(labelled with the label car).

In this paper we study the problem of image con-

tent recognition by using SIFT (Lowe, 2004) and

SURF (Bay et al., 2006) local features, to represent

image visual content, and kNN based classiﬁers to de-

cide about the presence of conceptual content.

In more details we will deﬁne 20 different func-

tions that measure similarity between images. These

functions are deﬁned using various options and com-

binations of local feature matching and similarities.

Some of them also take into consideration geometric

properties of the matching local features. These func-

tions are used in combination of a standard Single-

label Distance-Weighted kNN algorithm. In addition

we also propose a new classiﬁcation algorithm that

extend the traditional kNN classiﬁers by making di-

rect use of similarity between local features, rather

than similarity between entire images.

We will see that the similarity functions that also

make use of geometric considerations offer a better

performance than the others. However, the new kNN

157

Amato G. and Falchi F..

LOCAL FEATURE BASED IMAGE SIMILARITY FUNCTIONS FOR KNN CLASSIFICATION.

DOI: 10.5220/0003185401570166

In Proceedings of the 3rd International Conference on Agents and Artiﬁcial Intelligence (ICAART-2011), pages 157-166

ISBN: 978-989-8425-40-9

 2011 SCITEPRESS (Science and Technology Publications, Lda.)

based classiﬁer that exploit directly the similarity be-

tween local features has an higher performance even

without using geometric information.

The paper is organized as follows. In Section 3

we brieﬂy introduce local features. In Section 4 we

present various iamge similarity features relying on

local features to be used with a kNN classiﬁcation al-

gorithm. Section 5 propose a novel classiﬁcation ap-

proach. Finally, Sections 6 and 7 presents the experi-

mental results.

2 RELATED WORK

The ﬁrst approach to recognizing location from mo-

bile devices using image-based web search was pre-

sented in (Yeh et al., 2004). Two image matching met-

rics were used: energy spectrum and wavelet decom-

positions. Local features were not tested.

In the last few years the problem of recogniz-

ing landmarks have received growing attention by

the research community. In (Serdyukov et al., 2009)

methods for placing photos uploaded to Flickr on the

World map was presented. In the proposed approach

the images were represented by vectors of features of

the tags, and visual keywords derived from a vector

quantization of the SIFT descriptors.

In (Kennedy and Naaman, 2008) a combination of

context- and content-based tools were used to gener-

ate representative sets of images for location-driven

features and landmarks. Visual information is com-

bined with the textual metadata while we are only

considering content-based classiﬁcation.

In (Zheng et al., 2009), Google presented its ap-

proach to building a web-scale landmark recognition

engine. Most of the work reported was used to im-

plement the Google Goggles service (Google, 2010).

The approach makes use of the SIFT feature. The

recognition is based on best matching image search-

ing, while our novel approach is based on local fea-

tures classiﬁcation. In (Chen et al., 2009) a survey

on mobile landmark recognition for information re-

trieval is given. Classiﬁcation methods reported as

previously presented in the literature include SVM,

Adaboost, Bayesian model, HMM, GMM. The kNN

based approach which is the main focus of this paper

is not reported in that survey. In (Fagni et al., 2010),

various MPEG-7 descriptors have been used to build

kNN classiﬁer committees. However local features

were not considered.

In (Boiman et al., 2008) the effectiveness of NN

image classiﬁers has been proved and an innovative

approach based on Image-to-Class distance that is

similar in spirit to our approach has been proposed.

3 LOCAL FEATURES

The approach described in this paper focuses on the

use of image local features. Speciﬁcally, we per-

formed our tests using the SIFT (Lowe, 2004) and

SURF (Bay et al., 2006) local features. In this sec-

tion, we brieﬂy describe both of them.

The Scale Invariant Feature Transformation

(SIFT) (Lowe, 2004) is a representation of the low

level image content that is based on a transformation

of the image data into scale-invariant coordinates rel-

ative to local features. Local feature are low level de-

scriptions of keypoints in an image. Keypoints are

interest points in an image that are invariant to scale

and orientation. Keypoints are selected by choosing

the most stable points from a set of candidate loca-

tion. Each keypoint in an image is associated with

one or more orientations, based on local image gra-

dients. Image matching is performed by comparing

the description of the keypoints in images. For both

detecting keypoints and extracting the SIFT features

we used the public available software developed by

David Lowe

The basic idea of Speeded Up Robust Features

(SURF) (Bay et al., 2006) is quite similar to SIFT.

SURF detects some keypoints in an image and de-

scribes these keypoints using orientation information.

However, the SURF deﬁnition uses a new method for

both detection of keypoints and their description that

is much faster still guaranteeing a performance com-

parable or even better than SIFT. Speciﬁcally, key-

point detection relies on a technique based on a ap-

proximation of the Hessian Matrix. The descriptor of

a keypoint is built considering the distortion of Haar-

wavelet responses around the keypoint itself. For

both, detecting keypoints and extracting the SURF

features, we used the public available noncommercial

software developed by the authors

4 IMAGE SIMILARITY BASED

CLASSIFIER

In this section we discuss how traditional kNN classi-

ﬁcation algorithms can be applied to the task of clas-

sifying images described by local features, as for in-

stance SIFT or SURF. In particular, we deﬁne 20 im-

age similarity measures based on local features de-

scription. These will be later on compared to the new

classiﬁcation strategy that we propose in Section 5.

http://people.cs.ubc.ca/ lowe/

http://www.vision.ee.ethz.ch/ surf

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

158

4.1 Single-label Distance-weighted kNN

Given a set of documents D and a predeﬁned set

of classes (also known as labels, or categories)

C = {c

,...,c

}, single-label document classiﬁcation

(SLC) (Dudani, 1975) is the task of automatically ap-

proximating, or estimating, an unknown target func-

tion Φ : D → C, that describes how documents ought

to be classiﬁed, by means of a function

Φ : D → C,

called the classiﬁer, such that

Φ is an approximation

of Φ.

A popular SLC classiﬁcation technique is the

Single-label distance-weighted kNN. Given a training

set Tr containing various examples for each class c

it assigns a label to a document in two steps. Given

a document d

(an image for example) to be classi-

ﬁed, it ﬁrst executes a kNN search between the ob-

jects of the training set. The result of such operation

is a list χ

) of labelled documents d

belonging to

the training set ordered with respect to the decreas-

ing values of the similarity s(d

) between d

and

. The label assigned to the document d

by the clas-

siﬁer is the class c

∈ C that maximizes the sum of the

similarity between d

and the documents d

, labelled

, in the kNN results list χ

)

Therefore, ﬁrst a score z(d

) for each label is

computed for any label c

∈ C:

z(d

) =

∑

∈χ

) : Φ(d

)=c

s(d

) .

Then, the class that obtains the maximum score is

chosen:

) = argmax

∈C

z(d

) .

It is also convenient to express a degree of conﬁ-

dence on the answer of the classiﬁer. For the Single-

label distance-weighted kNN classiﬁer described here

we deﬁned the conﬁdence as 1 minus the ratio be-

tween the score obtained by the second-best label and

the best label, i.e,

doc

(

) = 1 −

arg max

∈C−

)

z(d

)

argmax

∈C

z(d

)

This classiﬁcation conﬁdence can be used to de-

cide whether or not the predicted label has an high

probability to be correct.

4.2 Image Similarity

In order the kNN search step to be executed, a sim-

ilarity function between images should be deﬁned.

Global features, generally, are deﬁned along with a

similarity (or a distance) function. Therefore, simi-

larity between images, is computed as the similarity

between the corresponding global features. On the

other hand, a single image has several local features.

Therefore, computing the similarity between two im-

ages requires combining somehow the similarities be-

tween their numerous local features.

In the following we deﬁne a function for comput-

ing similarity between images on the basis of their lo-

cal features that is derived from the work presented

in (Lowe, 2004). In the experiments, at the end

of this paper, we will compare the performance of

the similarity function, when used with the single-

label distance-weighted kNN classiﬁcation technique,

against the local feature based classiﬁcation algorithm

proposed in Section 5.

4.2.1 Local Feature Similarity

The Computer Vision literature related to local fea-

tures, generally uses the notion of distance, rather

than that of similarity. However in most cases a sim-

ilarity function s() can be easily derived from a dis-

tance function d(). For both SIFT and SURF the Eu-

clidean distance is typically used as measure of dis-

similarity between two features (Lowe, 2004; Bay

et al., 2006).

Let d(p

, p

) ∈ [0, 1] be the normalized distance

between two local features p

and p

. We can deﬁne

the similarity as:

s(p

, p

) = 1 − d(p

, p

)

Obviously 0 ≤ s(p

, p

) ≤ 1 for any p

and p

4.2.2 Local Features Matching

A useful aspect that is often used when dealing with

local features is the concept of local feature matching.

In (Lowe, 2004), a distance ratio matching scheme

was proposed that has also been adopted by (Bay

et al., 2006) and many others. Let’s consider a lo-

cal feature p

belonging to an image d

(i.e. p

∈ d

)

and an image d

. First, the point p

∈ d

closest to p

(in the remainder NN

)) is selected as candi-

date match. Then, the distance ratio σ(p

) ∈ [0,1]

of closest to second-closest neighbors of p

in d

considered. The distance ratio is deﬁned as:

σ(p

) =

d(p

,NN

))

d(p

,NN

))

Finally, p

and NN

) are considered match-

ing if the distance ratio σ(p

) is smaller than a

given threshold. Thus, a function of matching be-

tween p

∈ d

and an image d

is deﬁned as:

m(p

) =



1 if σ(p

) < c

0 otherwise

LOCAL FEATURE BASED IMAGE SIMILARITY FUNCTIONS FOR kNN CLASSIFICATION

159

In (Lowe, 2004), c = 0.8 was proposed reporting that

this threshold allows to eliminate 90% of the false

matches while discarding less than 5% of the cor-

rect matches. In Section 7 we report an experimen-

tal evaluation of classiﬁcation effectiveness varying c

that conﬁrms the results obtained by Lowe. Please

note, that this parameter will be used in deﬁning the

image similarity measure used as a baseline and in

one of our proposed local feature based classiﬁers.

For Computer Vision applications, the distance ra-

tio described above is used for selecting good candi-

date matches. More sophisticated algorithms are then

used to select actual matches from the selected ones

considering geometric information as scale, orienta-

tion and coordinates of the interest points. In most of

the cases a Hough transform (Ballard, 1981) is used

to search for keys that agree upon a particular model

pose. To avoid the problem of boundary effects in

hashing, each match is hashed into the 2 closest bins

giving a total of 16 entries for each hypothesis in the

hash table. This method has been proposed for SIFT

(Lowe, 2004) and is very similar to the weak geome-

try consistency check used in (J

egou et al., 2010).

Thus, we deﬁne the set M

) as the matching

points in the most populated entry in the Hash table

containing the Hough transform of the matches in d

obtained using the distance ratio criteria.

4.3 Similarity Measures

In this section, we deﬁne 5 different image similarity

measures approaches and 4 different versions of each

of them for a total of 20 measures.

4.3.1 1-NN Similarity Average – s

The simplest similarity measure only consider the

closest neighbor for each p

∈ d

and its distance from

the query point p

. The similarity between two docu-

ments d

and d

can be deﬁned as the average similar-

ity between the local features in d

and their closest

neighbors in d

. Thus, we deﬁne the 1-NN Similarity

Average as:

) =

∑

∈d

max

∈d

(s(p

, p

))

For simplicity, we indicate the number of local

features in an image d

4.3.2 Percentage of Matches – s

A reasonable measure of similarity between two im-

age d

and d

is the percentage of local features in d

that have a match in d

. Using the distance ratio crite-

rion described in 4.2.2 for individuating matches, we

deﬁne the Percentage of Matches similarity function

as follows:

) =

∑

∈d

m(p

)

where m(p

) is 1 if p

has a match in d

and 0

otherwise as deﬁned in Section 4.2.2.

4.3.3 Distance Ratio Average – s

The matching function m(p

) used in the Percent-

age of Matches similarity function is based on the ra-

tio between closest to second-closest neighbors for ﬁl-

tering candidate matches as proposed in (Lowe, 2004)

and reported in Section 4.2.2. However, this distance

ratio value can be used directly to deﬁne a Distance

Ratio Average function between two images d

and d

as follows:

) =

∑

∈d

σ(p

)

Please note that function does not require a dis-

tance ratio c threshold to be set.

4.3.4 Hough Transform Matches Percentage – s

As mentioned in Section 4.2.2, an Hough transform is

often used to search for keys that agree upon a partic-

ular model pose. The Hough transform can be used to

deﬁne a Hough Transform Matches Percentage:

) =

where M

) is the subset of matches voting

for the most voted pose. For the experiments, we used

the same parameters proposed in (Lowe, 2004), i.e.

bin size of 30 degrees for orientation, a factor of 2 for

scale, and 0.25 times the maximum model dimension

for location.

4.3.5 Managing the Asymmetry

All the proposed similarity functions are not symmet-

ric, i.e., s(d

) = s(d

) does not hold. Consider

the case in which the set of local features belonging

to d

is a subset of the ones belonging to d

. In this

case the similarity s(d

) is 1 while the same does

not hold for s(d

In searching for images similar to d

, it is not

clear in advance whether s(d

) or s(d

) would

be a better similarity measure for the recognition task.

Thus, we tested various combinations.

Given an image d

belonging to Te (i.e., an image

that we want to automatically classify), and an image

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

160

belonging to Tr (i.e., an image for which the class

label is known in advance) we deﬁne various versions

of the similarities deﬁned before:

• s

) = s(d

) is the canonical ap-

proach which tries to ﬁnd points in the test image

that are similar to the ones in the training one.

• s

) = s(d

) is the inverse approach

which uses the points in training documents as

queries.

• s

) = max(s(d

),s(d

)) is the

fuzzy or of s

and s

. This considers equivalent

two images if any of the two is a crop of the other.

• s

and

) = min(s(d

),s(d

)) is

the fuzzy and of s

and s

. This never considers

equivalent two images if any of the two is a crop

of the other.

• s

avg

) = (s(d

)+s(d

))/2 is the

mean of s

and s

Thus, we have deﬁned 5 versions of our 4 similar-

ity measures for a total of 20 similarity measures that

will be denoted as s

m,Te

m,Tr

m,or

,...,s

h,Te

, etc.

5 LOCAL FEATURE BASED

IMAGE CLASSIFIER

In the previous section, we considered the classiﬁca-

tion of an image d

as a process of retrieving the most

similar ones in the training set Tr and then applying

a kNN classiﬁcation technique in order to predict the

class of d

In this section, we propose a new approach that

ﬁrst assigns a label to each local feature of an image.

The label of the image is then assigned by analyzing

the labels and conﬁdences of its local features.

This approach has the advantage that any access

method for similarity search in metric spaces (Zezula

et al., 2006) can be used to speed-up classiﬁcation.

The proposed Local Feature Based Image Classi-

ﬁers classify an image d

in two steps:

1. First each local feature p

belonging to d

is clas-

siﬁed considering the local features of the images

in Tr;

2. Second the whole image is classiﬁed considering

the class assigned to each local feature and the

conﬁdence of the classiﬁcation.

Note that classifying individually the local fea-

tures, before assigning the label to an image, we

might loose the implicit dependency between interest

points of an image. However, surprisingly, we will

see that this method offers better effectiveness than

the baseline approach. In other words we are able to

improve at the same time both efﬁciency and effec-

tiveness.

In the following, we assume that the label of each

local feature p

, belonging to images in the training

set Tr, is the label assigned to the image it belongs to

(i.e., d

). Following the notation used in Section 4,

∀p

∈ d

, ∀d

∈ Tr , Φ(p

) = Φ(d

In other words, we assume that the local features

generated over interest points of the images in the

training set can be labeled as the image they belong

to. Note that the noise introduced by this label propa-

gation from the whole image to the local features can

be managed by the local features classiﬁer. In fact,

we will see that when very similar training local fea-

tures are assigned to different classes, a local feature

close to them is classiﬁed with a low conﬁdence. The

experimental results reported in Section 7 conﬁrm the

validity of this assumption.

As we said before, given p

∈ d

, a classiﬁer

Φ re-

turns both a class

Φ(p

) = c

∈ C to which it believes

to belong and a numerical value ν(

Φ, p

) that rep-

resents the conﬁdence that

Φ has in its decision. High

values of ν correspond to high conﬁdence.

5.1 Local Feature Classiﬁer

Among all the possible approach for assigning a label

to a interest point, the simplest is to consider the label

of its closest neighbor in Tr. The conﬁdence value

can be evaluated using the idea of the distance ratio

discussed in Section 4.2.2.

We thus deﬁne a local feature based classiﬁer

) that assign a candidate label

) as the

one of the nearest neighbor in Tr closest to p

(i.e.,

,Tr)):

) = Φ(NN

,Tr))

The conﬁdence here plays the role of a matching

function, where the idea of the distance ratio is used

to decide if the candidate label is a good match:

ν(

, p

) =



1 if

σ(p

) < c

0 otherwise

The distance ratio

σ here is computed consider-

ing the nearest local feature to p

and the closest local

feature that has a label different than the nearest lo-

cal feature. This idea follows the suggestion given

by Lowe in (Lowe, 2004), that whenever there are

multiple training images of the same object, then the

second-closest neighbor to consider for the distance

ratio evaluation should be the closest neighbor that is

known to come from a different object than the ﬁrst.

LOCAL FEATURE BASED IMAGE SIMILARITY FUNCTIONS FOR kNN CLASSIFICATION

161

Following this intuition, we deﬁne the similarity ratio

σ as:

σ(p

) =

d(p

,NN

∗

,Tr))

d(p

,NN

,Tr))

where NN

∗

,Tr) is the closest neighbor that is

known to be labeled differently than the ﬁrst as sug-

gested in (Lowe, 2004).

The parameter c used in the deﬁnition of the con-

ﬁdence is the equivalent of the one used in (Lowe,

2004) and (Bay et al., 2006). We will see in Section

7 that c = 0.8 proposed in (Lowe, 2004) by Lowe is

able to guarantee good effectiveness. It is worth to

note that c is the only parameter to be set for this clas-

siﬁer considering that the similarity search performed

over the local features in Tr does not require a param-

eter k to be set.

5.2 Whole Image Classiﬁcation

As we said before, the local feature based feature clas-

siﬁcation is composed of two steps (see Section 5). In

previous section we have dealt with the issue of clas-

sifying the local feature of an image. Now, in this

section, we discuss the second phase of the local fea-

ture based classiﬁcation of images. In particular we

consider the classiﬁcation of the whole image given

the label

Φ(p

) and the conﬁdence ν(

Φ, p

) assigned

to its local features p

∈ d

during the ﬁrst phase.

To this aim, we use a conﬁdence-rated majority

vote approach. We ﬁrst compute a score z(p

) for

each label c

∈ C. The score is the sum of the conﬁ-

dence obtained for the local features predicted as c

Formally,

z(d

) =

∑

∈d

Φ(p

)=c

ν(

Φ, p

) .

Then, the label that obtains the maximum score is

chosen:

Φ(d

) = argmax

∈C

z(d

) .

As measure of conﬁdence for the classiﬁcation of

the whole image we use ratio between the predicted

and the second best class:

img

(

Φ,d

) = 1 −

arg max

∈C−

Φ(p

)

z(d

)

argmax

∈C

z(d

)

This whole image classiﬁcation conﬁdence can be

used to decide whether or not the predicted label has

an high probability to be correct. In the experimen-

tal results Section 7 we will show that the proposed

conﬁdence is reasonable.

6 EVALUATION SETTINGS

For evaluating the various classiﬁers we need at least:

a data set, an interest points detector, a local feature

extractor, some performance measures. In the follow-

ing, we present all the evaluation setting we used for

the experimentation.

Figure 1: Example images taken from the dataset.

6.1 The Dataset

The dataset that we used for our tests is composed of

1,227 photos of landmarks located in Pisa and was

used also in (Amato et al., 2010). The photos have

been crawled from Flickr, the well known on-line

photo service. The dataset we built is publicly avail-

able. The IDs of the photos used for these experi-

ments together with the assigned label and extracted

features can be downloaded from (Falchi, 2010). In

the following we list the classes that we used and the

number of photos belonging to each class. In Figure

1 we reported an example for each class in the same

order as they are reported in the list below:

• Leaning Tower (119 photos) – leaning campanile

• Duomo (130 photos) – the cathedral of St. Mary

• Battistero (104 photos) – the baptistery of St. John

• Camposanto Monumentale (exterior) (46 photos)

• Camposanto Monumentale (ﬁeld) (113 photos)

• Camposanto Monumentale (portico) (138 photos)

• Chiesa della Spina (112 photos) – Gothic church

• Palazzo della Carovana (101 photos) – building

• Palazzo dell’Orologio (92 photos) – building

• Guelph tower (71 photos)

• Basilica of San Piero (48 photos) – church

• Certosa (53 photos) – charter house

In order to build and evaluating a classiﬁer for

these classes, we divided the dataset in a training set

(Tr) consisting of 226 photos (approximately 20% of

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

162

the dataset) and a test set (Te) consisting of 921 (ap-

proximately 80% of the dataset). The image resolu-

tion used for feature extraction is the standard resolu-

tion used by Flickr i.e., maximum between width and

height equal to 500 pixels.

The total number of local features extracted by the

SIFT and SURF detectors were about 1,000,000 and

500,000 respectively.

6.2 Performance Measures

For evaluating the effectiveness of the classiﬁers in

classifying the documents of the test set we use

the micro-averaged accuracy and micro- and macro-

averaged precision, recall and F

Micro-averaged values are calculated by con-

structing a global contingency table and then calcu-

lating the measures using these sums. In contrast

macro-averaged scores are calculated by ﬁrst calcu-

lating each measure for each category and then taking

the average of these. In most of the cases we reported

the micro-averaged values for each measure.

Precision is deﬁned as the ratio between correctly

predicted and the overall predicted documents for a

speciﬁc class. Recall is the ratio between correctly

predicted and the overall actual documents for a spe-

ciﬁc class. F

is the harmonic mean of precision and

recall.

Note that for the single-label classiﬁcation task,

micro-averaged accuracy is deﬁned as the number

of documents correctly classiﬁed divided by the total

number of documents in the test set and it is equiv-

alent to the micro-averaged precision, recall and F

scores.

7 EXPERIMENTAL RESULTS

In this section we report the experimental results ob-

tained for all the 20 image similarity based and lo-

cal feature based classiﬁers. For the image similarity

based classiﬁer results are reported for each similarity

measure deﬁned in Section 4.3. We also show that the

proposed measure of conﬁdence can be used to im-

prove effectiveness on classiﬁed images accepting a

small percentage of not classiﬁed objects.

7.1 Image Similarity based Classiﬁers

In Table 1, Accuracy and macro averaged F

of the

image similarity based classiﬁers for the 20 similarity

functions deﬁned in Section 4 are reported. Note that

the single-label distance-weighted kNN technique has

a parameter k that determines the number of closest

neighbors retrieved in order to classify a given image

(see Section 4). This parameter should be set dur-

ing the training phase and is kept ﬁxed during the test

phase. However, in our experiments we decided to re-

port the result obtained ranging k between 1 and 100.

For simplicity, in Table 1, we report the best perfor-

mance obtained and the k for which it was obtained.

Moreover, we report the performance obtained for

k = 1 which is a particular case in which the kNN

classiﬁer simply consider the closest image.

Let’s ﬁrst consider the approach used for manag-

ing the asymmetry of the distance functions discussed

in Section 4.3.5. The best approach for all the simi-

larity functions using both SIFT and SURF features

is the fuzzy and, i.e., s

*,and

. The more traditional ap-

proach s

*,Te

is the second best in most of the cases.

On the contrary, s

*,Tr

always offers the worst perfor-

mance. In other words, the best results were obtained

when the similarity between two images is computed

as the minimum of the similarity obtained consider-

ing as query in turn the test image local features and

the training images. The result is the same both when

using SIFT and SURF.

The Hough Transform Matches Percentage (s

)

similarity function is the best choice for both SIFT

and SURF for all the 5 versions for managing the

asymmetry. The geometric information considered by

this function allows to obtain signiﬁcantly better per-

formance in particular for SURF.

The second best is Distance Ratio Average (s

)

which only considers the distance ratio as match-

ing criterion. Please note that s

does not require a

distance ratio threshold (c) because it weights every

match considering the distance ratio value. Moreover,

performs sightly better than Percentage of Matches

) which requires the threshold c to be set.

The results obtained by the 1-NN Similarity Aver-

age (s

) function show that considering just the dis-

tance between a local features and its closest neigh-

bors gives worst performance than considering the

similarity ratio s

. In other words, the similarity be-

tween a local feature and its closest neighbor is mean-

ingful only if compared to the other nearest neighbors,

which is exactly what the distance ratio does.

Regarding the parameter k it is interesting to note

that the k value for which the best performance was

obtained for each similarity measure is typically much

higher for SURF than SIFT. In other words, the test

image closest neighbors in the training set are more

relevant using SIFT than using SURF.

This is more evident in Figure 2 where we report

the accuracy obtained for k between 1 and 100 by

both SIFT and SURF using the s

m,Te

similarity func-

tion. SIFT obtains the best performance for smaller

LOCAL FEATURE BASED IMAGE SIMILARITY FUNCTIONS FOR kNN CLASSIFICATION

163

Table 1: Image similarity based classiﬁer (

) performance obtained using various image similarity functions.

Te Tr or and avg Te Tr or and avg Te Tr or and avg Te Tr or and avg

SIFT

.75 .52 .55 .85 .82 .88 .80 .81 .90 .88 .89 .80 .81 .91 .89 .92 .88 .88

.93

.91

SURF

.79 .70 .73 .80 .82 .85 .73 .76 .88 .86 .82 .73 .75 .87 .84 .89 .76 .79

.92

.86

SIFT

.72 .55 .56 .84 .84 .86 .80 .80 .89 .86 .87 .80 .81 .91 .88 .90 .87 .86

.93

.90

SURF

.76 .67 .70 .78 .80 .83 .70 .74 .87 .84 .81 .68 .73 .86 .82 .87 .74 .77

.89

.85

SIFT

.73 .52 .55 .85 .82 .88 .78 .80 .90 .88 .89 .78 .80 .91 .88 .91 .87 .87

.93

.91

SURF

.79 .63 .67 .80 .82 .81 .60 .62 .86 .79 .81 .63 .64 .84 .76 .87 .66 .68

.90

.81

SIFT

.72 .55 .53 .84 .84 .86 .78 .80 .89 .86 .87 .79 .80 .90 .87 .90 .86 .86

.92

.90

SURF

.76 .63 .67 .78 .80 .79 .65 .65 .84 .78 .80 .67 .67 .83 .77 .85 .68 .70

.89

.81

SIFT

9 1 1 1 1 1 7 4 2 3 1 5 5 3 5 2 3 9 2 1

SURF

3 6 8 1 1 20 28 42 14 20 8 23 17 11 14 21 35 39 11 18

SIFT

1 1 1 1 1 1 7 4 3 3 1 5 5 3 5 2 8 5 9 9

SURF

1 6 3 1 1 18 28 19 23 20 8 5 17 11 14 21 14 30 3 28

Best

Acc

Best

k =1

- Perc. of Matches s

- Avg Sim. Ratio s

- Avg 1-NN

similarity function

Acc

- Hough Transform

version

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Accuracy

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 10 20 30 40 50 60 70 80 90 100

Accuracy

SIFT

SURF

Figure 2: Accuracy obtained for various k using the s

m,Te

similarity function by both SIFT and SURF.

values of k with respect to SURF. Moreover, SIFT

performance is generally higher than SURF.

It is interesting to note that performance obtained

for k = 1 is typically just slightly worst than that of

the best k. Thus, k = 1 gives very good performance

even if a better k could be selected during a learning

phase.

Two of the similarity measures proposed in Sec-

tion 4.3 require a parameter to be set. In particular,

the similarity measures Percentage of Matches (s

)

and Hough Transform Matches Percentage (s

) use

the matching function deﬁned in Section 4.2.2 that re-

quires a threshold for the distance ratio threshold (c)

to be ﬁxed in advance.

In Figure 3 we report the performance obtained

by using the Percentage of Matches classiﬁer, i.e., the

image similarity based classiﬁer

using the similar-

ity measure s

. For each distance ratio threshold c we

report the best result obtained for k between 0 and

100. As mentioned in Section 4.2.1, in the paper

where SIFT (Lowe, 2004) was presented, Lowe sug-

gested to use 0.8 as distance ratio threshold (c). The

results conﬁrm that the threshold proposed in (Lowe,

2004) is the best for both SIFT and SURF and that the

algorithm is stable around this values. In Table 1, re-

sults were reported for s

and s

with c = 0.8 for both

SIFT and SURF.

Let us now consider the conﬁdence ν

doc

assigned

to the predicted label of each image (see Section 4.1).

This conﬁdence can be used to obtain greater accu-

racy at the price of a certain number of false dis-

missals. In fact, a conﬁdence threshold can be used

to ﬁlter all the label assigned to an image with a con-

ﬁdence ν

doc

less than the threshold. In Figure 4 we

report the accuracy obtained by the s

h,and

measure us-

ing SIFT, varying the conﬁdence threshold between 0

and 1. We also report the percentage of images in Te

that were not classiﬁed together with the percentage

of images that where actually correctly classiﬁed but

that were ﬁltered because of the threshold. Note that

for ν

doc

= 0.3 the accuracy of classiﬁed objects rise

from 0.93 to 0.99 obtained for ν

doc

= 0. At the same

time the percentage of correctly predicted images that

are ﬁltered (i.e., the classiﬁer does not assign a label

because of the low conﬁdence threshold ν

doc

) is less

than 10%.

This prove that the measure of conﬁdence deﬁned

is meaningful. However, the best conﬁdence thresh-

old to be used depends on the task. Sometimes it

could be better to try to guess the class of an image

even if we are not sure, while in other cases it might

be better to assign a label only if the classiﬁcation has

an high conﬁdence.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

164

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

SIFT Accuracy

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Threshold c used for local features matching

SIFT Accuracy

SURF Accuracy

SIFT Macro F1

SURF Macro F1

Figure 3: Accuracy and Macro F

obtained for various

matching threshold by the image similarity based classiﬁer

(

) using the s

m,Tr

similarity measure and SIFT.

0,5

0,6

0,7

0,8

0,9

Accuracy on Classified

Not Classified Perc.

Correctly Classified Losts Perc.

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Image classification confidence threshold

νν

doc

docdoc

doc

Accuracy on Classified

Not Classified Perc.

Correctly Classified Losts Perc.

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

Image classification confidence threshold

νν

doc

docdoc

doc

Accuracy on Classified

Not Classified Perc.

Correctly Classified Losts Perc.

Figure 4: Accuracy on classiﬁed obtained by the image sim-

ilarity based classiﬁer for the similarity measure s

h,and

us-

ing SIFT, for various image classiﬁcation conﬁdence thresh-

olds (c).

7.2 Local Feature based Classiﬁer

In this section we compare the performance of the im-

age similarity based classiﬁers using the 20 similarity

measures deﬁned in Section 4.3 with the local feature

based classiﬁer deﬁned in 5.

In Table 2, we report accuracy and macro-

averaged F

obtained by the Local Feature Based Im-

age Classiﬁer (

) using both SIFT and SURF to-

gether with the results obtained by the image simi-

larity based approach (

) for the various similarity

measures. Considering that in the previous section

we showed that the fuzzy and approach performs bet-

ter than the other, we only report the result obtained

for the and version of each measures and for the best

The ﬁrst observation is that the Local Feature

Based Image Classiﬁer (

) performs signiﬁcantly

Table 2: Accuracy and Macro F

for the local feature based

classiﬁers

and for the kNN classiﬁers based on the var-

ious image similarity measures proposed for best k and re-

lated to the and version.

classifier

similarity

1, and

m, and

σ, and

h, and

SIFT

.94

.85 .90 .91 .93

SURF

.93

.80 .88 .87 .92

SIFT

.94

.84 .89 .91 .93

SURF

.91

.78 .87 .86 .84

Accuracy

Macro

better then any Image Similarity Based Classiﬁer. In

particular

performs better then s

h,and

, even if no

geometric consistency checks are performed by

while matches in s

h,and

are ﬁltered making use of the

Hough transform.

Even if in this paper we did not consider the com-

putational cost of classiﬁcation, we can make some

simple observations. In fact, it is worth saying that

the local feature based classiﬁer is less critical from

this point of view. First, because closest neighbors of

local features in the test image are searched once for

all in the Tr and not every time for each image of Tr.

Second, because it is possible to leverage on global

spatial index for all the features in Tr, to support efﬁ-

cient k nearest neighbors searching. In fact, the sim-

ilarity function between two local features is the Eu-

clidean distance, which is a metric. Thus, it could be

efﬁciently indexed by using a metric data structures

(Zezula et al., 2006; Samet, 2005; Batko et al., 2008).

Regarding the local features used and the compu-

tational cost, we underline that the number of local

features detected by the SIFT extractor is twice that

detected by SURF. Thus, on one hand SIFT has better

performance while on the other hand SURF is more

efﬁcient.

8 CONCLUSIONS

In this paper we addressed the problem of image con-

tent recognition using local features and kNN based

classiﬁcation techniques. We deﬁned 20 similarity

functions and compared their performance on a image

content landmarks recognition task. We found that a

two-way comparison of two images based on fuzzy

and allows better performance than the standard ap-

proach that compares a query image with the ones in a

training set. Moreover, we showed that the similarity

functions relying on matching of local features that

makes use of geometric constrains perform slightly

better than the others.

LOCAL FEATURE BASED IMAGE SIMILARITY FUNCTIONS FOR kNN CLASSIFICATION

165

Finally, we deﬁned a novel kNN classiﬁer that ﬁrst

assigns a label to each local feature of an image and

then label the whole image by considering the labels

and the conﬁdences assigned to its local features.

The experiments showed that our proposed lo-

cal features based classiﬁcation approach outperforms

the standard image similarity kNN approach in com-

bination with any of the deﬁned image similarity

functions, even the ones considering geometric con-

strains.

ACKNOWLEDGEMENTS

This work was partially supported by the VISITO

Tuscany project, funded by Regione Toscana, in the

POR FESR 2007-2013 program, action line 1.1.d, and

the MOTUS project, funded by the Industria 2015

program.

REFERENCES

Amato, G., Falchi, F., and Bolettieri, P. (2010). Recog-

nizing landmarks using automated classiﬁcation tech-

niques: an evaluation of various visual features. In in

Proceeding of The Second Interantional Conference

on Advances in Multimedia (MMEDIA 2010), Athens,

Greece, 13-19 June 2010, pages 78–83. IEEE Com-

puter Society.

Ballard, D. H. (1981). Generalizing the hough trans-

form to detect arbitrary shapes. Pattern Recognition,

13(2):111–122.

Batko, M., Novak, D., Falchi, F., and Zezula, P.

(2008). Scalability comparison of peer-to-peer sim-

ilarity search structures. Future Generation Comp.

Syst., 24(8):834–848.

Bay, H., Tuytelaars, T., and Gool, L. J. V. (2006). Surf:

Speeded up robust features. In ECCV (1), pages 404–

417.

Boiman, O., Shechtman, E., and Irani, M. (2008). In de-

fense of nearest-neighbor based image classiﬁcation.

In CVPR.

Chen, T., Wu, K., Yap, K.-H., Li, Z., and Tsai, F. S. (2009).

A survey on mobile landmark recognition for informa-

tion retrieval. In MDM ’09: Proceedings of the 2009

Tenth International Conference on Mobile Data Man-

agement: Systems, Services and Middleware, pages

625–630, Washington, DC, USA. IEEE Computer So-

ciety.

Dudani, S. (1975). The distance-weighted k-nearest-

neighbour rule. IEEE Transactions on Systems, Man

and Cybernetics, SMC-6(4):325–327.

Fagni, T., Falchi, F., and Sebastiani, F. (2010). Image classi-

ﬁcation via adaptive ensembles of descriptor-speciﬁc

classiﬁers. Pattern Recognition and Image Analysis,

20:21–28.

Falchi, F. (2010). Pisa landmarks dataset.

http://www.fabriziofalchi.it/pisaDataset/. last ac-

cessed on 30-March-2010.

Google (2010). Google Goggles. http://www.google.com/

mobile/goggles/. last accessed on 30-March-2010.

egou, H., Douze, M., and Schmid, C. (2010). Improving

bag-of-features for large scale image search. Int. J.

Comput. Vision, 87(3):316–336.

Kennedy, L. S. and Naaman, M. (2008). Generating diverse

and representative image search results for landmarks.

In WWW ’08: Proceeding of the 17th international

conference on World Wide Web, pages 297–306, New

York, NY, USA. ACM.

Lowe, D. G. (2004). Distinctive image features from scale-

invariant keypoints. International Journal of Com-

puter Vision, 60(2):91–110.

Samet, H. (2005). Foundations of Multidimensional and

Metric Data Structures. Computer Graphics and Geo-

metric Modeling. Morgan Kaufmann Publishers Inc.,

San Francisco, CA, USA.

Serdyukov, P., Murdock, V., and van Zwol, R. (2009). Plac-

ing ﬂickr photos on a map. In Allan, J., Aslam, J. A.,

Sanderson, M., Zhai, C., and Zobel, J., editors, SIGIR,

pages 484–491. ACM.

Yeh, T., Tollmar, K., and Darrell, T. (2004). Searching the

web with mobile images for location recognition. In

CVPR (2), pages 76–81.

Zezula, P., Amato, G., Dohnal, V., and Batko, M. (2006).

Similarity Search: The Metric Space Approach, vol-

ume 32 of Advances in Database Systems. Springer-

Verlag.

Zheng, Y., 0003, M. Z., Song, Y., Adam, H., Buddemeier,

U., Bissacco, A., Brucher, F., Chua, T.-S., and Neven,

H. (2009). Tour the world: Building a web-scale land-

mark recognition engine. In CVPR, pages 1085–1092.

IEEE.

ICAART 2011 - 3rd International Conference on Agents and Artificial Intelligence

166