Adaptive Committees of Feature-speciﬁc Classiﬁers for

Image Classiﬁcation

Tiziano Fagni, Fabrizio Falchi and Fabrizio Sebastiani

Istituto di Scienza e Tecnologia dell’Informazione

Consiglio Nazionale delle Ricerche

Via Giuseppe Moruzzi 1 – 56124 Pisa, Italy

Abstract. We present a system for image classiﬁcation based on an adaptive

committee of ﬁve classiﬁers, each specialized on classifying images based on a

single MPEG-7 feature. We test four different ways to set up such a committee,

and obtain important accuracy improvements with respect to a baseline in which

a single classiﬁer, working an all ﬁve features at the same time, is employed.

1 Introduction

An automated classiﬁcation system is normally speciﬁed by specifying two essential

components. The ﬁrst is a scheme for internally representing the data items that are the

objects of classiﬁcation; this representation scheme, that is usually vectorial in nature,

must be such that a suitable notion of similarity (or closeness) between the representa-

tions of two data items can be deﬁned. Here, “suitable” means that similar representa-

tions must be attributed to data items that are perceived to be similar. If so, a classiﬁer

may identify, within the space of all the representations of the data items, a limited re-

gion of space where the objects belonging to a given class lie; here, the assumption of

course is that data items that belong to the same class are “similar”. The second com-

ponent is a learning device that takes as input the representations of training data items

and generates a classiﬁer from them.

In this work we address single-label image classiﬁcation, i.e., the problem of setting

up an automated system that classiﬁes an image into exactly one from a predeﬁned set

of classes. Image classiﬁcation has a long history (see e.g., [1]), most of which has

produced systems that conform to the pattern described at the beginning of this section.

In this paper we take a detour from this tradition, and describe an image classiﬁ-

cation system that makes use not of a single representation, but of ﬁve different ones

for the same data item; these representations are based on ﬁve different descriptors, or

“features”, from the MPEG-7 standard, each analyzing an image under a different point

of view. As a learning device we use a “committee” of ﬁve feature-speciﬁc classiﬁers,

i.e., an appropriately combined set of classiﬁers each based on the representation of the

image speciﬁc to a single MPEG-7 feature. The committees that we use are adaptive,

in the sense that, for each image to be classiﬁed, they dynamically decide which among

the ﬁve classiﬁers should be entrusted with the classiﬁcation decision, or decide whose

decisions should be trusted more. We study experimentally four different techniques of

Fagni T., Falchi F. and Sebastiani F. (2009).

Adaptive Committees of Feature-speciﬁc Classiﬁers for Image Classiﬁcation.

In Proceedings of the 2nd International Workshop on Image Mining Theory and Applications, pages 113-122

DOI: 10.5220/0001968501130122

 SciTePress

combining the decisions of the ﬁve individual classiﬁers, using a dataset consisting of

photographs of stone slabs classiﬁed into different types of stone.

As a technique for generating the individual members of the classiﬁer committee

we use distance-weighted k nearest neighbours, a well-known example-based learn-

ing technique. Technically, this method does not require a vectorial representation of

data items to be deﬁned, since it simply requires that, given two data items, a distance

between them is deﬁned. In the discussion that follows this will allow us to abstract

away from the details of the representation speciﬁed by the MPEG-7 standard, and sim-

ply specify our methods in terms of distance functions between data items. This is not

problematic, since distance functions both for the individual MPEG-7 features and for

the image as a whole have already been studied and deﬁned in the literature.

Since distance computation is so fundamental to our methods, we have also studied

how to compute distances between data items efﬁciently, and have implemented an

efﬁcient system that makes use of metric data structures explicitly devised for “nearest

neighbour search”.

The rest of the paper is organized as follows. Section 2 describes in detail the learn-

ing algorithm, while Section 3 discusses how we have implemented efﬁciently these

learning algorithms by recurring to metric data structures. In Section 4 we move to

describing our experiments, and to discuss conclusions that can be drawn from them.

2 Automatic Image Classiﬁcation by means of Adaptive,

Feature-speciﬁc Committees

Given a set of documents D and a predeﬁned set of classes (also known as labels, or

categories) C = {c

, . . . , c

}, single-label (aka 1-of-m, or multiclass) document clas-

siﬁcation (SLC) is the task of automatically building a single-label document classiﬁer,

i.e., a function

Φ that predicts, for any d

∈ D, the correct class c

∈ C to which d

belongs. More formally, the task is that of approximating, or estimating, an unknown

target function Φ : D → C, that describes how documents ought to be classiﬁed, by

means of a function

Φ : D → C, called the classiﬁer, such that Φ and

Φ “coincide as

much as possible”

The solutions we will give to this task will be based on automatically generating

the classiﬁers

Φ by supervised learning. This will require a set Ω of documents as

input which are manually labelled according to the classes C, i.e., such that for each

document d

∈ Ω the value of the function Φ(d

) is known. In the experiments we

present in Section 4 the set Ω will be partitioned into two subsets T r (the training set)

and T e (the test set), with T r ∪ T e = Ω; T r will be used in order to generate the

classiﬁers

Φ by means of supervised learning methods, while T e will be used in order

to test the effectiveness (i.e., accuracy) of the generated classiﬁers.

Consistently with most mathematical literature we use the caret symbol (ˆ) to indicate estima-

tion.

114

2.1 Image Classiﬁers as Committees of Single-feature Classiﬁers

The image classiﬁer

Φ : D → C that we will generate will actually consist of a classi-

ﬁer committee (aka classiﬁer ensemble), i.e., of a tuple

Φ = (

, . . . ,

) of classiﬁers,

where each classiﬁer

is specialized in analyzing the image from the point of view

of a single feature f

∈ F , where F is a set of image features. For instance, a classi-

ﬁer

colour

will be set up that classiﬁes the image only according to its distribution of

colours, and a further classiﬁer

texture

will be set up that classiﬁes the image accord-

ing to texture considerations. As image features we will use ﬁve visual “descriptors” as

deﬁned in the MPEG-7 standard

, each of them characterizing a particular visual aspect

of the image. These ﬁve descriptors are Colour Layout (CL – information about the spa-

tial layout of colour images), Colour Structure (CS – information about colour content

and its spatial arrangement), Edge Histogram (EH – information about the spatial distri-

bution of ﬁve types of edges), Homogeneous Texture (HT – texture-related properties of

the image), and Scalable Colour (SC – a colour histogram in the HSV colour space)

The “aggregate” classiﬁer

Φ takes its classiﬁcation decision by combining the de-

cisions returned by the feature-speciﬁc classiﬁers

by means of an adaptive combi-

nation rule, i.e., a combination rule that pays particular attention to those

’s that are

expected to perform more accurately on the particular image that needs to be classiﬁed.

This is advantageous, since different features could be the most revealing for classify-

ing different types of images; e.g., for correctly recognizing that an image belongs to

class c

′

the Homogeneous Texture feature might be more important than Colour Layout,

while the contrary might happen for class c

′′

. In the techniques that we have used in this

work, whether and how much a given feature is effective for classifying a given docu-

ment is automatically detected, and automatically brought to bear in the classiﬁcation

decision.

For implementing the classiﬁer committee, i.e., for combining appropriately the

outputs of the

’s, we will experiment with four different techniques. In Sections 2.1

to 2.1 we will describe these techniques, while in Section 2.2 we will describe how to

generate the individual members of these committees.

Dynamic Classiﬁer Selection. The ﬁrst technique we test is dynamic classiﬁer selec-

tion (DCS) [2–4]. This technique consists in

1. identifying the set

) = arg

min

∈T r

δ(d

, d

) (1)

of the w training examples closest to the test document d

, where δ(d

′

, d

′′

) is a

(global) measure of distance to be discussed more in detail in Section 3);

2. attributing to each feature-speciﬁc classiﬁer

a score g(

, χ

)) that measures

how well it classiﬁes the examples in χ

); see below for details;

International Organization for Standardization, Information technology - Multimedia content

description interfaces, Standard ISO/IEC 15938, 2002.

For deﬁnitions of these MPEG-7 visual descriptors see: International Organization for Stan-

dardization, Information technology - Multimedia content description interfaces - Part 3: Vi-

sual, Standard ISO/IEC 15938-3, 2002.

115

3. adopting the decision of the classiﬁer with the highest score; i.e.,

Φ(d

) =

)

where

= arg max

∈

, χ

)).

This technique is based on the intuition that similar documents are handled best by

similar techniques, and that we should thus trust the classiﬁer which has proven to

behave best on documents similar to the one we need to classify.

We compute the score from Step (2) as

, d

) =

∈χ

)

(1 − δ(d

, d

)) · [

) = Φ(d

)] (2)

where [α] is an indicator function, i.e.,

[α] =



+1 if α = T rue

−1 if α = F alse

Equation 2 thus encodes the intuition that the more examples in χ

) are correctly

classiﬁed by

(i.e., are such that

) = Φ(d

)), and the closer they are to d

(i.e,

the lower δ(d

, d

) is), the better

may be expected to behave in classifying d

Weighted Majority Vote. The second technique we test is weighted majority vote

(WMV), a technique similar in spirit to the “adaptive classiﬁer combination” technique

of [3]. WMV is different from DCS in that, while DCS eventually trusts a single feature-

speciﬁc classiﬁer (namely, the one that has proven to behave best on documents similar

to the test document), thus completely disregarding the decisions of all the other clas-

siﬁers, WMV uses a weighted majority vote of the decisions of all the feature-speciﬁc

classiﬁers

∈

Φ, with weights proportional to how well each

has proven to be-

have on documents similar to the test document. This technique is thus identical to DCS

except that Step 3 is replaced by the following two steps:

3. for each class c

∈ C, all evidence in favour of the fact that c

is the correct class of

is gathered by summing the g(

, χ

)) scores of the classiﬁers that believe

this fact to be true; i.e.,

z(d

, c

) =

∈F :

)=c

, χ

)) (3)

4. the class that obtains the maximum z(d

, c

) score is chosen, i.e.,

Φ(d

) = arg max

∈C

z(d

, c

) (4)

Conﬁdence-rated Dynamic Classiﬁer Selection. The third technique we test is conﬁ-

dence-rated dynamic classiﬁer selection (CRDCS), a variant of DCS in which the conﬁ-

dence with which a given classiﬁer has classiﬁed a document is also taken into account.

From now on we will indeed assume that, given a test document d

, a given feature-

speciﬁc classiﬁer

returns both a class c

∈ C to which it believes d

to belong and a

116

numerical value ν(

, d

) that represents the conﬁdence that

has in its decision (high

values of ν correspond to high conﬁdence). In Section 2.2 we will see this to be true

of the feature-speciﬁc classiﬁers we generate in our experiment. Note also that, with

respect to the “standard” version of DCS described in Section 2.1, this “conﬁdence-

aware” variant is more in line with the developments in computational learning theory

of the last 10 years, since conﬁdence is closely related to the notion of “margin”, which

plays a key role in learning frameworks based on structural risk minimization, such as

kernel machines and boosting [5].

The intuition behind the use of these conﬁdence values is that a classiﬁer that has

made a correct decision with high conﬁdence should be preferred to one which has

made the same correct decision but with a lower degree of conﬁdence; and a classiﬁer

that has taken a wrong decision with high conﬁdence should be trusted even less than a

classiﬁer that has taken the same wrong decision but with a lower conﬁdence.

CRDCS is thus the same as DCS in Section 2.1, except for the computation of

the g(

, d

) score in Step 2, which now becomes conﬁdence-sensitive. In CRDCS

Equation (2) thus becomes

, d

) =

∈χ

)

(1 − δ(d

, d

)) · [

) = Φ(d

)] · ν(

, d

) (5)

Therefore,a classiﬁer

may be expected to perform accurately on an example d

when

many examples in χ

) are correctly classiﬁed by

, when these are close to d

, and

when these correct classiﬁcations have been reached with high conﬁdence.

Steps 1 and 3 from Section 2.1 remain unchanged.

Conﬁdence-rated Weighted Majority Vote. The fourth technique we test, conﬁdence-

rated weighted majority vote (CRWMV), stands to WMV as CRDCS stands to DCS;

that is, it consists of a version of WMV in which conﬁdence considerations, as from the

previous section, are taken into account. CRWMV has thus the same form of WMV; the

only difference is that the g(

, d

) score as from Step 2 is obtained through Equation

(5), which takes into account the conﬁdence with which the

classiﬁers haveclassiﬁed

the training examples in χ

), instead of Equation (2), which does not. Steps 1, 3 and

4 from Section 2.1 remain unchanged.

2.2 Generating the Individual Classiﬁers

Each individual classiﬁer

(i.e., each member of the various committees described in

Section 2.1) is generated by means of the well-known (single-label, distance-weighted)

k nearest neighbours (k-NN) technique. This technique consists in the following steps;

for a test document d

1. (similarly to Equation 1) identify the set

) = arg

min

∈T r

, d

) (6)

of the k training examples closest to the test document d

, where δ

′

, d

′′

) is a

distance measure between documents in which only aspects speciﬁc to feature f

are taken into consideration, and k is an integer parameter;

117

2. for each class c

∈ C, gather the evidence q(d

, c

) in favour of c

by summing the

complements of the distances between d

and the documents in χ

) that belong

to c

; i.e.,

q(d

, c

) =

∈χ

) : Φ(d

)=c

(1 − δ

, d

)) (7)

3. pick the class that maximizes this evidence, i.e.,

) = arg max

∈C

q(d

, c

) (8)

Standard forms of distance-weighted k-NN do not usually output a value of conﬁdence

in their decision. We naturally make up for this by adding a further step to the process,

i.e.,

4. set the value of conﬁdence in this decision to

ν(

, d

) = q(d

)) −

)

q(d

, c

)

m − 1

That is, the conﬁdence in the decision taken is deﬁned as the strength of evidence in

favour of the chosen class minus the average strength of evidence in favour of all the

remaining classes.

Distance-weighted k-NN classiﬁers have several advantages over classiﬁers gener-

ated by means of other learning methods:

– Very good effectiveness, as shown in several text classiﬁcation experiments [6–

9]; this effectiveness is often due to their natural ability to deal with non-linearly

separable classes;

– The fact that they scale extremely well (better than SVMs) to very high numbers

of classes [9]. In fact, computing the |T r| distance scores and sorting them in de-

scending order (as from Step 1) needs to be performed only once, irrespectively of

the number m of classes involved; this means that distance-weighted k-NN scales

(wildly) sublinearly with the number of classes involved, while learning methods

that generate linear classiﬁers scale linearly, since none of the computation needed

for generating a single classiﬁer

′

can be reused for the generation of another

classiﬁer

′′

, even if the same training set T r is involved.

– The fact that they are parametric in the distance function they use. This allows the

use of distance measures customized to the speciﬁc type of data involved, which

turns out to be extremely useful in our case.

3 Efﬁcient Implementation of Nearest Neighbour Search by

Metric Data Structures

In order to speed up the computations of our classiﬁers we have focused on imple-

menting efﬁciently nearest neighbour search, which can be deﬁned as the operation of

ﬁnding, within a set of objects, the k objects closest to a given target object, given a

suitable notion of distance. The reason we have focused on speeding up this operation

is that

118

1. it accounts for most of the computation involved in classifying objects through the

k-NN method of Section 2.2; Step 1 of this method requires nearest neighbour

search;

2. it also accounts for most of the computation involved in combining base classi-

ﬁers through each of the four methods of Section 2.1; Step 1 of each of these four

methods also requires nearest neighbour search.

Efﬁcient implementation of nearest neighbour search requires data structures in sec-

ondary storage that are explicitly devised for this task [10–12]. As such a data structure

we have used an M-tree [13]

, a data structure explicitly devised for speeding up near-

est neighbour search in metric spaces, i.e., sets in which a distance function is deﬁned

between their members that is a metric

. We have been able to use M-trees exactly

because

– as the ﬁve feature-speciﬁc distance functions δ

of Equation 6, we have chosen the

distance measures recommended by the MPEG group (see [14] for details), which

are indeed metrics;

– as the global distance function δ of Equation 1 we have chosen a linear combination

of the previously mentioned ﬁve δ

functions, which is by deﬁnition also a metric.

As the linear combination weights w

we have simply adopted the weights derived

from the study presented in [15], i.e., w(CL) = .007, w(CS) = .261, w(EH) =

.348, w(HT ) = .043, w(SC) = .174.

Note that, in reality, the δ

functions from [14] that we have adopted do not range

on [0, 1], but on ﬁve different intervals [0, α

]; in order to have them all range on

[0, 1] we have multiplied all distances by the normalization weights z(CL) = .174,

z(CS) = .075, z(EH) = .059, z(HT ) = .020, z(SC) = .001.

4 Experiments

The dataset that we have used for our experiments (here called the Stone dataset) is

a set of 2,597 photographs of stone slabs, subdivided under 37 classes representing

different types of stone

. The dataset was randomly split into a training set, containing

approximately 30% of the entire dataset, and a test set, consisting of the remaining

70%. For each photograph an internal representation in terms of MPEG-7 features was

generated and stored into an M-tree.

We have used the publicly available Java implementation of M-trees developed at Masaryk

University, Brno; see http://lsd.fi.muni.cz/trac/mtree/.

A metric is a distance function δ on a set of objects X such that, for any x

, x

∈ X, it is

true that (a) δ(x

, x

) ≥ 0 (non-negativity); (b) δ(x

, x

) = 0 if and only if x

= x

(iden-

tity of indiscernibles); (c) δ(x

, x

) = δ(x

, x

) (symmetry); (d) δ(x

, x

) ≤ δ(x

, x

) +

δ(x

, x

) (triangle inequality).

The dataset was provided by the Metro S.p.A. Marmi e Graniti company (see

http://www.metromarmi.it/),and was generated during their routine production pro-

cess, according to which slabs are ﬁrst cut from stone blocks, and then photographed in order to

be listed in online catalogues that group together stone slabs produced by different companies.

119

As a measure of effectiveness we have used error rate (noted E), i.e., the percentage

of test documents that have been misplaced in a wrong class.

As a baseline, we have use a “multi-feature” version of the distance-weighted k-

NN technique of Section 2.2, i.e., one in which the distance function δ mentioned at

the end of Section 3, and resulting from a linear combination of the ﬁve feature-speciﬁc

functions, is used in place of δ

in Equation 6. For completeness we also report ﬁve

other baselines, obtained in a way similar to the one above but using in each a feature-

speciﬁc distance function δ

. In these baselines and in the experiments involving our

adaptive classiﬁers the k parameter has been ﬁxed to 30, since this value has proved

the best choice in previous experiments involving the same technique [7, 8]. The w

parameter of the four adaptive committees has been set to 5, which is the value that

had performed best on previous experiments we had run on a different dataset. In future

experiments we plan to optimize these parameters more carefully by cross-validation.

The results of our experiments are reported in Table 1. From this table we may

notice that all four committees (2nd row, 2nd to 5th cells) bring about a noteworthy

reduction of error rate with respect to the baseline (2nd row, 1st cell). The best performer

proves the conﬁdence-rated dynamic classiﬁer selection method of Section 2.1, with

a reduction in error rate of 39.7% with respect to the baseline. This is noteworthy,

since both this method and the baseline use the same information, and only combine

it in different ways. The results also show that conﬁdence-rated methods (CRDCS and

CRWMV) are not uniformly superior to methods (DCS and WMV) which do not make

use of conﬁdence values. They also show that dynamic classiﬁer selection methods

(DCS and CRDCS) are deﬁnitely superior to weighted majority voting methods (WMV

and CRWMV).

This latter result might be explained by the fact that, out of ﬁve features, three (CS,

CL, SC) are based on colour, and are thus not completely independent from each other;

if, for a given test image, colour considerations are not relevant for picking the correct

class, it may be different to ignore them anyway, since they are brought to bear three

times in the linear combination. In this case, DCS and CRDCS are more capable of

ignoring colour considerations, since they will likely entrust either the EH- or the HT-

based classiﬁer with taking the ﬁnal classiﬁcation decision.

The same result also seems to suggest that, for any image, there tends to be a single

feature that alone is able to determine the correct class of the image, but this feature is

not always the same, and sharply differs across categories. For instance, the SC feature

is the best performer, among the single-feature classiﬁers (1st row), on test images

belonging to class GIALLO VENEZIANO (E = .11), where it largely outperforms the

EH feature (E = .55), but the contrary happens for class ANTIQUE BROWN, where

EH (E = .01) largely outperforms SC (.22). That no single feature alone is a solution

for all situations is also witnessed by the fact that all single-feature classiﬁers (1st row)

are, across the entire dataset, largely outperformed by both the baseline classiﬁer and

all the adaptive committees. This fact conﬁrms that splitting the image representation

into independent feature-speciﬁc representations on which feature-speciﬁc classiﬁers

operate is a good idea.

120

Table 1. Error rates of the techniques as tested on the Stone dataset; percentages indicate de-

crease in error rate with respect to the baseline. The ﬁrst ﬁve results are relative to the ﬁve

feature-speciﬁc baselines. Boldface indicates the best performer.

CL CS EH HT SC

0.479 0.318 0.479 0.410 0.419

Baseline DCS CRDCS WMV CRWMV

0.297 0.183 (-38.4%) 0.179 (-39.7%) 0.225 (-24.2%) 0.227 (-23.6%)

Acknowledgements

This work has been partially supported by Project “Networked Peers for Business”

(NeP4B), funded by the Italian Ministry of University and Research (MIUR) under

the “Fondo per gli Investimenti della Ricerca di Base” (FIRB) funding scheme. We

thank Gianluca Fabrizi and Metro S.p.A. Marmi e Graniti for making the Stone dataset

available. Thanks also to Claudio Gennaro and Fausto Rabitti for useful discussions.

References

1. Lu, D., Weng, Q.: A survey of image classiﬁcation methods and techniques for improving

classiﬁcation performance. International Journal of Remote Sensing 28(5) (2007) 823-870

2. Giacinto, G., Roli, F.: Adaptive selection of image classiﬁers. In: Proceedings of the 9th

International Conference on Image Analysis and Processing (ICIAP’97), Firenze, IT (1997)

38-45

3. Li, Y.H., Jain, A.K.: Classiﬁcation of text documents. The Computer Journal 41(8) (1998)

537-546

4. Woods, K., Kegelmeyer Jr, W., Bowyer, K.: Combination of multiple classiﬁers using local

accuracy estimates. IEEE Transactions on Pattern and Machine Intelligence 19(4) (1997)

405-410

5. Schapire, R.E., Singer, Y.: Improved boosting using conﬁdence-rated predictions. Machine

Learning 37(3) (1999) 297-336

6. Joachims, T.: Text categorization with support vector machines: Learning with many rel-

evant features. In: Proceedings of the 10th European Conference on Machine Learning

(ECML’98), Chemnitz, DE (1998) 137-142

7. Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval

1(1/2) (1999) 69-90

8. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the

22nd ACM International Conference on Research and Development in Information Retrieval

(SIGIR’99), Berkeley, US (1999) 42-49

9. Yang, Y., Zhang, J., Kisiel, B.: A scalability analysis of classiﬁers in text categorization. In:

Proceedings of the 26th ACM International Conference on Research and Development in

Information Retrieval (SIGIR’03), Toronto, CA (2003) 96-103

10. Ch·vez, E., Navarro, G., Baeza-Yates, R., Marroqu

In, J.L.: Searching in metric spaces. ACM

Computing Surveys 33(3) (2001) 273-321

11. Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann,

San Francisco, US (2006)

121

12. Zezula, P.,Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach.

Springer Verlag, Heidelberg, DE (2006)

13. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efﬁcient access method for similarity search

in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data

Bases (VLDB ’97), Athens, GR (1997) 426-435

14. Manjunath, B., Salembier, P., Sikora, T., eds.: Introduction to MPEG-7: Multimedia Content

Description Interface. John Wiley & Sons, New York, US (2002)

15. Amato, G., Falchi, F., Gennaro, C., Rabitti, F., Savino, P., Stanchev, P.: Improving image

similarity search effectiveness in a multimedia content management system. In: Proceedings

of the 10th International Workshop on Multimedia Information System (MIS’04), College

Park, US (2004) 139-146

122