Performance Evaluation of Image Filtering for Classiﬁcation and

Retrieval

Falk Schubert

and Krystian Mikolajczyk

EADS Innovation Works, Ottobrunn, Germany

University of Surrey, Guildford, U.K.

Keywords:

Image Processing, Filtering, Enhancement, Logo Retrieval, Scene Classiﬁcation.

Abstract:

Much research effort in the literature is focused on improving feature extraction methods to boost the per-

formance in various computer vision applications. This is mostly achieved by tailoring feature extraction

methods to speciﬁc tasks. For instance, for the task of object detection often new features are designed that

are even more robust to natural variations of a certain object class and yet discriminative enough to achieve

high precision. This focus led to a vast amount of different feature extraction methods with more or less

consistent performance across different applications. Instead of ﬁne-tuning or re-designing new features to

further increase performance we want to motivate the use of image ﬁlters for pre-processing. We therefore

present a performance evaluation of numerous existing image enhancement techniques which help to increase

performance of already well-known feature extraction methods. We investigate the impact of such image en-

hancement or ﬁltering techniques on two state-of-the-art image classiﬁcation and retrieval approaches. For

classiﬁcation we evaluate using a standard Pascal VOC dataset. For retrieval we provide a new challenging

dataset. We ﬁnd that gradient-based interest-point detectors and descriptors such as SIFT or HOG can beneﬁt

from enhancement methods and lead to improved performance.

1 INTRODUCTION

Signiﬁcant progress has been made in image recogni-

tion and retrieval over past decades due to intensive

studies of feature extraction methods, image repre-

sentation and machine learning techniques. A num-

ber of alternative solutions have been proposed for

each of the well established steps of the recognition

and retrieval approaches. However, little research has

been carried out on the quantitative inﬂuence of pre-

processing steps which alter the image before apply-

ing the commonly used feature extractors in the com-

puter vision applications mentioned above. Previous

works include only basic ﬁltering methods (e.g. blur-

ring) employed in the context of very speciﬁc tasks

such as face recognition (Heseltine et al., 2002; Gross

and Brajovic, 2003; Kumar et al., 2011) or character

recognition (Huang et al., 2007). Some open-source

implementations of feature extractors also apply blur-

ring as an initial step, but such pre-processing steps

are never discussed in terms of quantitative perfor-

mance gain in the respective papers. Besides these

simple ﬁltering techniques, there exists however a

wide variety of more advanced image ﬁltering tech-

niques (e.g. bilateral ﬁltering, cartoon-style or image-

based rendering) in the domain of computer graph-

ics which are not commonly used. Such ﬁlters, e.g.

abstraction ﬁlters, have a direct impact on the image

gradients which leads to a normalization of gradient-

based descriptors. We therefore want to motivate the

use of such advanced pre-processing ﬁlters in order to

further increase the performance of computer vision

applications, instead of re-designing or ﬁne-tuning

features for a speciﬁc computer vision task.

To better understand the quantitative difference

image ﬁltering techniques can generally make on the

performance of feature extractors, we present in this

paper a performance evaluation of a number of im-

age enhancement or modiﬁcation techniques applied

to two common computer vision applications: scene

recognition and logo retrieval. To our knowledge this

is the ﬁrst quantitative evaluation of such image ﬁl-

tering for pre-processing. Because image ﬁltering is

a data-driven or pixelwise local process, it is to be

expected that the inﬂuence of image ﬁltering also de-

pends on the image content. We therefore evaluate us-

ing different datasets consisting of images of various

categories. For scene recognition we evaluate using

485

Schubert F. and Mikolajczyk K. (2013).

Performance Evaluation of Image Filtering for Classiﬁcation and Retrieval.

In Proceedings of the 2nd International Conference on Pattern Recognition Applications and Methods, pages 485-491

DOI: 10.5220/0004333104850491

 SciTePress

the well-known Pascal VOC 2007 dataset (Evering-

ham et al., 2010) which contains 20 different types

of image scenes (e.g. natural scenes, man-made ob-

jects, etc.). For logo retrieval we evaluate using a

dataset of 30 different logo classes (e.g. Volkswagen,

BMW, Coca Cola, etc.) which consists of real im-

ages of these logos captured in normal life (i.e. the im-

ages were taken from personal and professional pho-

tographs downloaded from Flickr).

The paper is structured as follows: In section 2 we

brieﬂy discuss the different ﬁltering techniques which

we will consider. In section 3 we give details about

the implementation of the two computer vision appli-

cations. In section 4 we discuss the evaluation proto-

col and discuss the results on the benchmark datasets.

2 FILTERING TECHNIQUES

The type of normalization and hence the effect on

subsequent steps of the recognition system, in partic-

ular feature extraction, strongly depends on the way

the ﬁlter modiﬁes the image. We focus on three cate-

gories of ﬁlters: boosting gradients, suppressing gra-

dients and enhancing color. These types are moti-

vated by the fact that the most successful features in

computer recognition applications are based on gradi-

ents (Everingham et al., 2010) (e.g. Harris, Hessian,

HOG, SIFT, DAISY) and color (e.g. Color-SIFT). For

each of the ﬁlter types we consider different methods

which we discuss in the following.

2.1 Boosting Gradients

In the computation of HOG or SIFT the strength of

the gradient is used to weight the corresponding bins

in the histograms. Hence boosting important gradi-

ents can increase their importance in the image de-

scriptors. We consider two different variations to

increase the strength of gradients: convolution with

sharpening kernels and tonemapping (Fattal et al.,

2002), which is based on a compression, where weak

edges are boosted and strong ones are reduced. The

ﬁrst is a fairly simple and well-known ﬁlter. The sec-

ond one is a more complex ﬁltering technique which

roughly works as follows. A new gradient ﬁeld is

computed from the existing gradients H(x, y) accord-

ing to the formula (Fattal et al., 2002) :

G(x, y) = ∇H(x, y) · ϕ(x, y)

The attenuation factor ϕ modiﬁes the existing gradi-

ents H

(x, y) at different resolution scales k of an im-

age and is computed as (Fattal et al., 2002) :

(x, y) =

k∇H

(x, y)k



k∇H

(x, y)k



From this gradient ﬁeld G an image is reconstructed

using the Poisson equation:

∇

I = div G

The parameters α and β control the amount of atten-

uation of large gradients and magniﬁcation of small

ones. The effect of tonemapping (tmo) is visualized

in the right column of Fig. 2.

2.2 Suppressing Gradients

Eliminating only weak gradients results in smooth

image segments and cartoon-like stylization. On

such images feature detectors generate interest points

mainly on dominant image structure. These interest

points tend to be more stable under different vari-

ations (e.g. pose variations). This effect can help

to focus a learning process on the important image

structures, leading to a better visual recognition de-

spite the loss of information. We consider four dif-

ferent gradient suppressing ﬁlters: Gaussian blur-

ring, median ﬁltering, bilateral ﬁltering (Tomasi and

Manduchi, 1998) and weighted-least-squares ﬁlter-

ing (wls) (Farbman et al., 2008). The ﬁrst three are

often used as pre-processing ﬁlters. However, the

impact of these preprocessing steps are rarely dis-

cussed or evaluated. The fourth ﬁlter is an advanced

edge-preserving ﬁlter superior to the standard bilat-

eral ﬁltering technique. The image is obtained via an

iterative optimization procedure (i.e. weighted least

squares) which minimizes the following cost function

C (Farbman et al., 2008):

C =

∑

(x,y)



[I(x, y) − O(x, y)]

(x, y)



∂I

∂x



+ v

(x, y)



∂I

∂y



The ﬁrst term of the cost function ensures that the re-

sulting image I is visually similar to the original input

image O. The second term acts as a regularizer which

suppresses gradients along x- and y-direction in the

resulting image at all locations where the input im-

age O contains weak gradients. These locations are

controlled by the spatially varying weights u and v:

(x, y) =





∂O(x, y)

∂x



+ ε



−1

The formulation for v is analogous to u, just the par-

tial derivative is along y direction. The manually cho-

sen parameter α controls the strength of the smooth-

ing effect. The constant ε prevents division by zero.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

486

All other locations which contain dominant gradients

are left unchanged. The parameters λ and the weight

functions u and v control the amount of smoothing.

The effect of this edge-preserving or weighted least

squares (wls) ﬁltering is visualized in the center col-

umn of Fig. 2.

To better understand the difference of gradient

suppression and gradient enhancement an illustration

is given in Fig. 1. Intensity values along line scans

of an example image are shown. The left plot shows

the original intensity values, the center one shows

the intensity values after tonemapping, the right one

shows the intensity values after applying the WLS ﬁl-

ter. Filters such as tonemapping boost weak gradi-

ents and keep the dominant ones unchanged. The ﬁl-

ters suppressing gradients such as abstraction ﬁlters

keep dominant gradients but signiﬁcantly smoothen

small gradients. The choice of the inﬂuence of boost-

ing and suppression is clearly arbitrary and depends

on the image content and the subjective taste of the

user. In our experiments in sec. 4 we selected these

parameters manually prior to all experiments without

focussing on increasing the performance but purely

on visual appearance (e.g. the settings of the WLS ﬁl-

ter where chosen to clearly suppress weak gradients

whereas the tonemapping set to clearly boost them).

In Fig. 2 examples of the ﬁltered images are depicted

for each dataset. In future work it would be very help-

ful to have a learning-based process that ﬁnd these

settings automatically. However, the contribution of

this paper is a quantitative performance evaluation

of the impact of the ﬁltering techniques independent

whether they have been chosen manually or automat-

ically.

2.3 Enhancing Colors

Using color in image descriptors was reported to sig-

niﬁcantly improve the recognition results (Yan et al.,

2012). Pictures are often taken with sub-optimal

color settings due to simplistic auto-exposure con-

trols and auto-white-balancing. A post-processing

step can help recover or improve the contrast of the

image if all details and structures are captured without

saturation. A histogram normalization step (which

we call colorboost) can be used to equalize the col-

ors within an image and bring out much better de-

tail. We use the method described in (Horv

ath, 2011),

which is very robust and parameter free. The ﬁltered

color images also show better contrast when convert-

ing them to grayscale. As all images in our bench-

mark datasets are colored, we can therefore evalu-

ate this color-normalization also for descriptors which

only use grayscale images.

3 APPLICATIONS

A straightforward approach to investigate the impact

of image ﬁltering techniques, could either consist of

a simple toy application (e.g. simple nearest neigh-

bor search within a pool of features) or some heuristic

measurements on the feature vector (e.g. variations of

individual feature dimensions or intra/inter-class vari-

ance). However, in our opinion conclusions drawn

from such experiments cannot really be generalized to

other realistic applications. We therefore propose to

evaluate the impact of image enhancement on the per-

formance of typical computer vision tasks. We con-

sidered two different applications: scene recognition

and image retrieval. Both applications share a search

task or matching step based on features that are com-

puted. In the ﬁrst case this matching step is based on

a learning process, whereas in the second case a sim-

ple distance measure is used. In the following a brief

summary of the implementation of each application is

given.

3.1 Image Retrieval

We employ a bag-of-words representation (Sivic and

Zisserman, 2003) which has become the state-of-the-

art for fast scalable retrieval and classiﬁcation tasks.

The different computation steps can be summarized

as follows:

1. detect interest points (Hessian and Harris-

Laplace)

2. extract SIFT features at the interest points

3. generate visual words from the image features

of the whole training dataset using randomly se-

lected clusters

4. compute an inverted ﬁle index from histograms of

visual word occurrences for every image

5. search with new image as query using L2-norm

on the index signatures

Many powerful extensions have been proposed in

the past to enforce geometric consistency or expand

queries (Philbin, 2010). However, the baseline ap-

proach as described in (Sivic and Zisserman, 2003)

is sufﬁcient to demonstrate the beneﬁt of using pre-

processing ﬁlters.

3.2 Scene Classiﬁcation

In image classiﬁcation experiments we use the ap-

proach from (Yan et al., 2012) that has proven very

successful in various classiﬁcation benchmarks. The

different computation steps can be summarized as fol-

lows:

PerformanceEvaluationofImageFilteringforClassificationandRetrieval

487

image width

intensity

image width

intensity

boosting gradients

image width

intensity

suppress gradients

original

Figure 1: Intensity values (diagrams on bottom) along the line scans across the image (red) are shown for different ﬁlters:

original (left), boosting (center) and suppression (right) of gradients.

1. compute local image descriptors (e.g. SIFT,

CSIFT, etc.) on a uniform dense grid

2. generate visual words from the image features

of the whole training dataset using randomly se-

lected clusters

3. compute a histogram of visual word occurrences

for every image

4. compute spatial pyramid match kernel (Lazebnik

et al., 2006) from the histrograms

5. train SVM with χ

kernels

6. classify new image using combination of multiple

kernels

4 EVALUATION

In this section we evaluate the impact of the image ﬁl-

tering techniques on the recognition applications. For

classiﬁcation and retrieval different benchmarks have

been established in the literature. We evaluate the im-

pact of the image ﬁltering techniques using standard

evaluation protocol of the respective datasets used for

the two applications. For the scene recognition the

Pascal datasets are considered as the gold standard.

We use the Pascal VOC 2007 dataset (Everingham

et al., 2010) which contains 20 different types of im-

age scenes (e.g. natural scenes, man-made objects,

etc.). These variations can be considered challeng-

ing enough to allow drawing valid conclusions about

the impact of the ﬁltering techniques.

For the image retrieval task, we use an own dataset

for many reasons. Typical retrieval datasets (e.g.

the Oxford Building dataset (Philbin, 2010) or the

Flickr1M dataset as used in (J

egou et al., 2008)) ei-

ther address the scalability of the retrieval task or they

are designed for the retrieval of a very speciﬁc image.

In the ﬁrst case this means that they are very big (e.g.

Flick1M dataset with 1 million images (J

egou et al.,

2008)) and the goal is to show that the retrieval engine

is capable of ﬁnding many images that are similar to

the input image. In the second case these datasets

contain images of speciﬁc objects (e.g. buildings like

in the Oxford building dataset (Philbin, 2010)) which

might have been taken from different view points and

the goal is to ﬁnd all instances of the object shown on

the input image.

We would like to generalize the retrieval task fur-

ther by allowing more variation to the retrieved ob-

jects but still ensure that the look of the object is well

deﬁned. This is the case in logo retrieval. The lo-

gos are like objects (e.g. building) but they can have

different appearances (e.g. a painted logo or printed

logo) and yet belong to the same logo label. There

exist a few datasets for logo retrieval however some

of which are too simple (e.g. only contain synthetic

images (Jain and Doermann, 2012)) or which are

not consistant (e.g. logos have the same label where

the logo changed its design over time (Kalantidis

et al., 2011)). We therefore provide a new logo re-

trieval dataset with 30 different logo classes which

has roughly the same size or variation as the existing

datasets (e.g. the Flick27 logo dataset with 27 logos

classes (Kalantidis et al., 2011)).

4.1 Scene Classiﬁcation

Improvements in scene classiﬁcation are evaluated us-

ing the evaluation protocol from Pascal VOC 2007

dataset (Everingham et al., 2010). More speciﬁcally

the “average-precision” (AP) which is the area under

the precision-recall curve (Everingham et al., 2010)

is computed for each scene class for both the origi-

nal, unaltered images and for all ﬁltered ones. In this

experiment we considered four different ﬁlters (blur,

colorboost, bilateral, wls). The ﬁlteres were applied

to each training and test image and evaluated sepa-

rately with constant settings for all experiments. The

results are summarized in Tab. 1.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

488

Table 1: Comparison of recognition performance (AP) on

VOC 2007 using best performing ﬁlter and original images.

In the fourth column the difference of AP between ﬁltered

and original images are given.

class ﬁlter original diff ﬁlter name

aeroplane 64.9 64.4 +0.5 bilateral

bicycle 56.2 52.9 +4 wls

bird 43 37 +6 bilateral

boat 55.5 52.5 +3 colorboost

bottle 19 14.3 +4.7 bilateral

bus 43.4 43.1 +0.3 colorboost

car 69.4 68 +1.4 bilateral

cat 45.4 46.4 -1 colorboost

chair 42.4 41.6 +0.8 bilateral

cow 23.9 21.8 +2 wls

table 31.9 29.5 +2.4 bilateral

dog 35.8 36.1 -0.3 colorboost

horse 64.6 65.2 -0.7 colorboost

motorbike 52.6 49 +3.6 wls

person 78.7 77.8 +0.9 bilateral

plant 22.6 18.6 +4 bilateral

sheep 26.6 28 -1.4 bilateral

sofa 33.7 32.6 +1.1 blur

train 64.3 63.2 +1.1 bilateral

tv 39.9 39.2 +0.7 colorboost

For 16 out of 20 classes in Tab. 1 ﬁltered images

produce better results than the original ones. Gradi-

ent suppression (e.g. bilateral or wls ﬁlters) in partic-

ular improves the AP performance by up to 6%. This

can be explained by the elimination of weak, noisy

gradients using abstraction ﬁlters such as bilateral ﬁl-

tering. For instance, many of the images in the class

“bird” were captured with background such as vegeta-

tion and nature, which contain many ﬁne detailed gra-

dients that are irrelevant for the classiﬁcation. Focus-

ing the descriptors on dominant gradients (e.g. stems

from trees and not the leaves, bird shape and not the

feathers) helps to discriminate these images. Again

we note that the an automatic choice of the best per-

forming ﬁlter would be required for practical applica-

tions. However, in this experiment we are more in-

terested on the quantitative performance differences,

which indicate how much mAP can be gained by a

good choice of image ﬁltering for preprocessing. The

ﬁlter parameters were manually chosen prior to all

experiments without focussing on increasing the per-

formance but purely on visual appearance to achieve

clearly visible ﬁltering effects.

4.2 Image Retrieval

For reasons mentioned above, we collect our own

benchmark dataset with images that present particu-

lar challenge to the descriptors due to various render-

ing methods (e.g. logo is painted on a wall or carved

out of metal) which introduces more appearance vari-

ations (see Fig. 2 and Fig. 1 for some examples).

In such cases, image ﬁltering is especially expected

to aid the matching process. The dataset consists

of 30 random logos classes from well known brands

(e.g. Coca Cola). For each logo 10 random images

were pooled out of 1000 images downloaded from

www.ﬂickr.com using the logo name as the search

query. For all 300 images of the dataset the oc-

curences of the logos are labeled. The retrieval task is

to use each labeled logo and retrieve all the other ones

with the same label. We use the same protocol for

the generation of the index and evaluation of the re-

trieval performance as in (Sivic and Zisserman, 2003).

Similarly to the evaluation of scene classiﬁcation all

ﬁlter settings were constant for all images and were

chosen prior to running the experiments. In Tab. 2

the summarized mean-average-precision (mAP) val-

ues are listed separately for the two interest point de-

tectors (Harris-Laplace and Hessian-Laplace) used in

the experiment. For each query image an AP value

(Everingham et al., 2010) is generated which is then

averaged (mAP value) across all queries belonging to

the same logo label. We further average these mAP

values over all logo labels to generate a single score

for each ﬁlter. We can observe that gradient suppres-

sion ﬁlters, in particular median and wls, improve the

retrieval by up to 8%. The performance gain depends

on the type of interest point detector, but the general

tendency is the same. It is important to note, that

the overall performance of mAP ≈ 45% is not very

high compared to systems with geometric veriﬁca-

tion or query expansion (Philbin, 2010). However,

in this experiment we are interested in relative per-

formance differences between ﬁltered and unaltered

images. Although the overall performance across a

collection of 30 very different logos consistently im-

proves by using wls ﬁltering, we noticed that certain

logo types beneﬁt more than the others. Car logos

(e.g. Porsche) which do not vary as much in their

rendering form (e.g. car logos are usually printed on

badges and not other material like T-Shirts) improve

by 58.1% (mAP for “Porsche” logo using original im-

ages is 36.2% and 94.3% using wls ﬁltering).

5 CONCLUSIONS

The results from the evaluation indicate that image ﬁl-

tering signiﬁcantly improves the matching and clas-

siﬁcation performance. Furthermore the amount of

improvement and the type of best performing ﬁlter

PerformanceEvaluationofImageFilteringforClassificationandRetrieval

489

original

wls tmo

Figure 2: Sample images (original and ﬁltered) from the logo dataset (top 2 rows) and Pascal VOC 2007 (bottom 2 rows).

Table 2: Mean-Average-Precision (mAP) listed for each ﬁl-

ter and interest point detector. Behind each mAP score, the

difference to the original (top row) is given.

ﬁlter name Harris (diff) Hessian (diff)

original 32.4 38.4

bilateral 35.5 (+2.9) 39.5 (+1.1)

blur 33.7 (+1.3) 39.8 (+1.4)

colorboost 33.3 (+0.9) 41.4 (+3.0)

median 35.4 (+3.0) 44.2 (+5.8)

sharpen 29.7 (-2.7) 36.1 (-2.3)

tonemapping 31.9 (-0.5) 39.4 (+1.0)

wls 40.4 (+8.0) 46.9 (+8.5)

depends on the image category (e.g. natural scenes,

synthetic images). For the recognition for each class

different ﬁlters perform best. For the retrieval certain

types of logos beneﬁt more from ﬁltering than oth-

ers. In future work we would like to further investi-

gate the impact of image ﬁltering on different types of

interest point detectors and features. Also we would

like to develop an automatic selection process which

ﬁnds the best suiting ﬁlter type and parameter settings

given a training dataset. Last but not least, we would

like to include other and notably larger datasets for

the retrieval and consider other applications like ob-

ject detection.

ICPRAM2013-InternationalConferenceonPatternRecognitionApplicationsandMethods

490

REFERENCES

Everingham, M., Van Gool, L., Williams, C. K. I., Winn,

J., and Zisserman, A. (2010). The pascal visual object

classes (voc) challenge. IJCV.

Farbman, Z., Fattal, R., Lischinski, D., and Szeliski, R.

(2008). Edge-preserving decompositions for multi-

scale tone and detail manipulation. SIGGRAPH.

Fattal, R., Lischinski, D., and Werman, M. (2002). Gradi-

ent domain high dynamic range compression. In SIG-

GRAPH.

Gross, R. and Brajovic, V. (2003). An image preprocessing

algorithm for illumination invariant face recognition.

In AVBPA.

Heseltine, T., Pears, N. E., and Austin, J. (2002). Evalua-

tion of image pre-processing techniques for eigenface

based face recognition. ICIG.

Horv

ath, A. (2011). Aaphoto.

http://log69.com/aaphoto en.html .

Huang, B. Q., Zhang, Y. B., and Kechadi, M. T. (2007). Pre-

processing techniques for online handwriting recogni-

tion. In ISDA.

Jain, R. and Doermann, D. (2012). Logo retrieval in docu-

ment images. Document Analysis Systems, IAPR In-

ternational Workshop on, 0:135–139.

egou, H., Douze, M., and Schmid, C. (2008). Hamming

embedding and weak geometric consistency for large

scale image search. In ECCV.

Kalantidis, Y., Pueyo, L., and Trevisiol, M. (2011). Scalable

triangulation-based logo recognition. ICMR.

Kumar, M., Murthy, P., and Kumar, P. (2011). Performance

evaluation of different image ﬁltering algorithms us-

ing image quality assessment. IJCA.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond

bags of features: Spatial pyramid matching for recog-

nizing natural scenes and categories. CVPR.

Philbin, J. (2010). Scalable Object Retrieval in Very Large

Image Collections. PhD thesis, University of Oxford.

Sivic, J. and Zisserman, A. (2003). Video Google: A text

retrieval approach to object matching in videos. In

ICCV.

Tomasi, C. and Manduchi, R. (1998). Bilateral ﬁltering for

gray and color images. In ICCV.

Yan, F., Kittler, J., Mikolajczyk, K., and Tahir, A. (2012).

Non-sparse multiple kernel ﬁsher discriminant analy-

sis. JMLR.

PerformanceEvaluationofImageFilteringforClassificationandRetrieval

491