SOFT CATEGORIZATION AND ANNOTATION OF IMAGES

WITH RADIAL BASIS FUNCTION NETWORKS

Moreno Carullo, Elisabetta Binaghi and Ignazio Gallo

Universit`a degli Studi dell’Insubria, via Ravasi 2, Varese, Italy

Keywords:

Content-based image retrieval, Image categorization, Image annotation, Soft classiﬁcation, Neural networks.

Abstract:

This work focuses on fast approaches for image retrieval and classiﬁcation by employing simple features to

build image signatures. For this purpose a neural model for soft classiﬁcation and automatic image annotation

is proposed. The salient aspects of this solution are: a) the employment of a Radial Basis Function Network

built on top of an image retrieval distance metric b) a soft learning strategy for annotation handling. Experi-

ments have been conducted on a subset of the Corel image dataset for evaluation and comparative analysis.

1 INTRODUCTION

The growing demand for digital visual data in many

applications related to scientiﬁc, commercial and

cultural contexts has aroused a signiﬁcant interest

in Content-Based Image Retrieval (CBIR) aimed

at deﬁning methods to archive, query and retrieve

these data based on their content (Datta et al., 2007;

Smeulders et al., 2000).

The problem of ﬁlling the gap between visual,

low level similarity and abstract semantic similarity

is even more complicated when dealing with general-

purpose, broad content image databases, such as

Internet image archives, because of the large size of

the database, the heterogeneity of the recorded scenes

and the imaging techniques employed (Li and Wang,

2008). Usually these are databases where images

are annotated with semantic labels, enabling the

user to specify the query through a natural language

description of the visual concepts of interest. These

aspects combined with the cost of manual image

annotation, have generated signiﬁcant interest in

the problem of automatically extracting high level

semantic descriptors from images.

The problem can be addressed by different

approaches (Datta et al., 2007; Smeulders et al.,

2000). Early methods followed the geometrical

approaches focusing directly on an explicit deﬁnition

of a similarity function, powerful enough to represent

high level meaning. In this direction strategies

aimed at updating the query or optimizing the

similarity function, thanks to the user annotations,

have been proposed. Recent studies conﬁrm that a

single similarity measure can barely produce robust

semantically meaningful ranking of images (Datta

et al., 2007). An alternative approach which in some

sense circumvents the problem, is the use of auto-

mated machine learning techniques able to induce,

and then implicitly deﬁne from a set of already

classiﬁed/annotated images, semantically meaningful

similarity functions with which to categorize, ranking

and annotate images (Datta et al., 2007; Vailaya et al.,

2001).

Classiﬁcation methods can be divided into two

major branches: generative modeling and discrim-

inative approaches (Bishop, 1996). In generative

modeling, the searched category is modeled as a

density probability function and the Bayes formula

is then used to compute the posterior (Li and Wang,

2008). Discriminative modeling approaches are more

direct in ﬁnding classiﬁcation boundaries (Chen and

Wang, 2004; Shotton et al., 2008).

Early works in the image categorization

ﬁeld make use of global color and texture his-

tograms (Swain and Ballard, 1990). Recent works

that try to exploit local features include the bag-

of-features (Chen and Wang, 2004) where learning

models are applied to collection of local features and

pyramidal approaches (Grauman and Darrell, 2005;

Lazebnik et al., 2006) where geometric description of

the scene is accomplished. New trends also include

approximated segmentation techniques are also

applied to obtain good results with stricter time con-

straints (Li and Wang, 2008). All recent works can be

roughly divided into approaches that prefer fast and

309

Carullo M., Binaghi E. and Gallo I. (2009).

SOFT CATEGORIZATION AND ANNOTATION OF IMAGES WITH RADIAL BASIS FUNCTION NETWORKS.

In Proceedings of the Fourth International Conference on Computer Vision Theory and Applications, pages 309-314

DOI: 10.5220/0001785203090314

 SciTePress

simple techniques for general purpose images and

approaches with deeper and more costly techniques

for image segmentation and object recognition.

The present work focuses on discriminative

modeling for categorization and annotation tasks on

large, content varied image databases. The main

contribution of our study is to investigate whether

these tasks can be approached successfully using

the approximation capability of a novel supervised

neural learning technique based on Radial Basis

Function Network (RBFN). A salient aspect of the

proposed solution is the integration within the RBFN

of the Earth Movers Distance (EMD) which has

been recognized as a useful similarity metric in

information retrieval (Rubner et al., 2000).

In the context of automated image annotation the

neural model is considered a soft classiﬁer to better

represent the inherent vagueness and imprecision

with which images are annotated by users. The output

of the neural classiﬁer, for a given image signature

provided in input, usually interpreted as crisp class

assignment according to the winner-takes-all rule,

must be softened here, considering the values of the

output neurons directly as gradual relevance of the

corresponding class/annotation to the image.

The overall strategy was experimentally evaluated

using the Corel database subset used in (Chen

and Wang, 2004). Several experiments have been

conceived and conducted to quantify and compare

the contribution of the different solutions adopted.

2 RBFN-BASED LEARNING FOR

IMAGE CATEGORIZATION

AND ANNOTATION

The present work focuses on the learning task within

a CBIR strategy. However, to make the work self-

contained, the important pre-processing phase con-

cerning visual signature extraction is derived from

previous works.

Section 2.1 describes the strategy adopted for vi-

sual signature extraction, while sections 2.2 and 2.3

explain the learning model and the annotation strat-

egy.

2.1 Extraction of Visual Signature

This phase is crucial for the ability of the learning

model to understand and predict concepts and cat-

egories. Proceeding from solutions adopted by Li

and Wang (Li and Wang, 2008) a signature extrac-

tion technique for generic images is adopted, com-

putationally easy but powerful enough to solve real

world problems.

To build the signature a set of two feature F =

{ f

, f

} where f

= color and f

= texture is con-

sidered. Each signature feature f

is built on a set of

vectors extracted from the image, one for each pixel.

Vectors are then grouped together into a set of cen-

troids v

j,k

, k = 1, ... ,K with the K-Means clustering

method (with ﬁxed K), and for each centroid v

j,k

weight w

j,k

is computed to express the relevance of

related pixels.

For the color feature the LUV color space com-

ponents for each pixel are considered, while the

Daubechies 4 wavelet transform (Daubechies, 1992)

is employed as a texture descriptor. The texture de-

scriptor is computed on the L-plane of the image, con-

sidering the LH, HL and HH planes to form the set of

vectors that are in turn clustered.

Each image I

is thus represented by its

signature γ

∈ Γ and is formed by features

i, j

, j = 1,. ..,|F| where each feature β

i, j

{(v

i,1

),... , (v

i,K

))} is a discrete distribu-

tion.

The clustering phase for the extraction of the dis-

crete distributions is a mean to summarize images by

dividing them into regions with similar feature vec-

tors. Several strategies can be adopted to exploit the

differences among different types of images in the

collection: in (Li and Wang, 2008) an adaptive meta-

clustering method based on K-Means is used. A com-

mon and simpler alternative is to use the K-Means al-

gorithm with ﬁxed K. This strategy is adopted for the

proposed solution.

2.2 The Soft Classiﬁer

The present work addresses the main problem of se-

mantic categorization and annotation of images with

a machine learning approach. In particular, we chose

adopted the RBFN model introduced by (Moody and

Darken, 1989) for its proven training speed and ro-

bustness on classiﬁcation and regression tasks. These

capabilities are especially suitable for the inherent

vagueness related to categorization and annotation

within the CBIR context.

RBFNs have a single hidden layer of process-

ing units with local, restricted activation domains: a

Gaussian function is commonly used, but any other

locally-tunable functions can be used. They were

introduced as a neural network evolution of exact

interpolation (Moody and Darken, 1989), and have

been shown to have the universalapproximation prop-

erty (Hartman et al., 1990). As outlined in (Jain et al.,

2000), the RBFN main advantages are that the classi-

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

310

ﬁcation function is non-linear,the model may produce

conﬁdence values and it may be robust to outliers; its

drawbacks are the potential sensitivity to input param-

eters, and potential overtraining sensitivity.

The need to learn and predict on signature objects

instead of regular vector patterns requires the stan-

dard Euclidean distance within the Gaussian activa-

tion units and the ﬁrst-level K-Means clustering to be

substituted with a distance tailored to discrete distri-

butions. Considering the previous works on appropri-

ate metrics for CBIR and CBIR systems making use

of such metrics (Lv et al., 2006; Almeida et al., 2008)

we selected the EMD - Earth Mover’s Distance as the

image distance metric within the RBFN model (Rub-

ner et al., 2000).

The network is structured as a regular RBFN

and its non-linear function f : Γ → R

maps the

signature space to the categories space as a result

of the learning phase on the training set TrS =

{(γ

),... ,(γ

)}, where γ

is a signature and

∈ R

is the vector whose j-th component is the soft

membership truth for the the j-th annotation.

The network is structured as follows:

1. a ﬁrst level of M Gaussian Processing units φ

Γ → R

(γ) = exp(−emd(γ,γ

)/σ

) (1)

where emd(γ,γ

) is the mean EMD over all sig-

nature features between the signature given as ar-

gument in φ

and γ

is the centroid signature for

processing unit φ

2. a second level of C linear weights w

i,1

,. ..,w

i,C

} connect each ﬁrst level unit with

each output unit.

3. the two levels are then linearly combined to build

the model function f:

(γ) =

∑

i=1

(γ) · w

i,c

(2)

f(γ) = {o

(γ),... , o

(γ)} (3)

Following (Moody and Darken, 1989), the train-

ing scheme is two-phased: one is unsupervised and

decides values for γ

, i = 1,... ,M while the other

solves a linear problem to ﬁnd values for w

, i =

1,. ..,M.

1. the ﬁrst phase ﬁnds suitable centroid signatures

, i = 1,..., M by running an EMD-based iter-

ative K-Means clustering algorithm with k = M.

Then the p-means heuristic (Moody and Darken,

1989) is applied to compute the processing unit

spreads σ

, i = 1, ... , M.

2. the second phase computes w

, i = 1, ... ,M. This

phase is supervised and therefore the training set

is considered; the objective is to minimize the

difference between predicted output and truth by

Least Mean Squares, computed through the pseu-

doinverse.

(a) Φ is a N × M matrix where Φ

i, j

= φ

(

)

(b) W is a M ×C matrix where W

i, j

= w

i, j

the minimization problem to solve is ΦW = T and

thus W = Φ

†

T, where Φ

†

is the pseudoinverse.

The model has therefore two user parameters:

1. the number M of ﬁrst level local processing units

2. the number p of the p-means heuristic, used to de-

termine the spread of ﬁrst level processing units.

2.3 Annotations and Categories

The visual content of an image can be described

with words that have an accepted meaning. Be A =

,. ..,a

|A|

} the global dictionary of known annota-

tions, the process of annotating each image I

results

in a set of weights A

= {α

,. ..,α

|A|

} with α

∈ [0;1]

and positive values of α

are set for annotations a

that belong to the image I

A soft classiﬁcation framework can be set up

by teaching the model the annotation weights as

the expected output for a given image; the train-

ing and test sets elements (γ

) are such that y

{α

i,1

,. ..,α

i,|A|

The RBFN output

y ∈ R

for a given image sig-

nature γ describes the level of conﬁdence for each an-

notation, and can be used to predict the set of anno-

tations. This can be addressed by considering only

elements whose output units are activated with values

higher than a threshold parameter ε ∈ [0;1].

The elicitation strategy of annotation weights α

is manifold. Considering real-world scenarios where

users interact with the system by providing examples

of tagged images, it is easy to imagine a simple graph-

ical user interface where each annotation can be given

a weight by adjusting its “visual size” just like a ge-

ometrical shape can be within a painting program.

In simpler scenarios where only annotations can be

taught and learned, the expected output y

can be such

that all components are equal to

|{a

i, j

is an annotation of I

(4)

assuming that images with fewer annotations proba-

bly have stronger and clearer membership in respect

to the annotation set.

SOFT CATEGORIZATION AND ANNOTATION OF IMAGES WITH RADIAL BASIS FUNCTION NETWORKS

311

Table 1: Error matrix of the hard classiﬁcation analysis over the ﬁve runs, with User Accuracy (UA) and Producer Accuracy

(PA) for each category.

- ω

Tot U UA

193 7 18 9 0 7 0 0 7 12 253 76.28 %

4 157 17 2 0 2 3 0 46 0 231 67.97 %

10 23 151 10 0 12 2 0 13 3 224 67.41 %

0 11 13 210 0 0 0 0 5 4 243 86.42 %

0 0 0 0 250 6 0 0 0 3 259 96.53 %

18 14 15 3 0 204 0 1 8 6 269 75.84 %

1 1 21 0 0 0 231 1 4 8 267 86.52 %

4 4 0 1 0 13 10 247 5 3 287 86.06 %

6 31 10 10 0 4 0 0 162 6 229 70.74 %

14 2 5 5 0 2 4 1 0 205 238 86.13 %

Tot P 250 250 250 250 250 250 250 250 250 250 - -

PA 77.20 % 62.80 % 60.40 % 84.00 % 100.00 % 81.60 % 92.40 % 98.80 % 64.80 % 82.00 % - -

Total accuracy: 80.4000 % (2010 hit, 490 miss, 2500 total)

The global set of annotations A can grow unex-

pectedly when users are allowed to add their own new

words. Its size can be kept under control by grouping

clusters of elements into a single high-level annota-

tion. In scenarios where automated tagging is used

as a basic suggestion for the user, we expect that the

most relevant elements are presented to the user, min-

imizing the presence of words with the same visual

semantic.

3 EXPERIMENTS

The experimental analysis aims at assessing the per-

formance of the proposed approach as an automated

image annotation method. As shown in section 2 the

overall process relies on the ability of the underlying

machine learning model to predict a soft membership

of a given set of conceptual classes. To better iso-

late the contribution of the learning model and of the

annotations management task, the experiments were

divided in two parts: ﬁrst the proposed model is as-

sessed as a hard classiﬁer of images, while the second

part considers the soft classiﬁcation and automatic an-

notation capabilities.

For both the hard and soft experiments, the K-

Means clustering technique used for signature build-

ing is employed with K = 5 as a result of trial and

error phase, also taking into account reasonable com-

putational times.

3.1 Hard Classiﬁcation Analysis

For the hard classiﬁcation analysis the Corel database

subset used in (Chen and Wang, 2004) is considered.

This dataset

is composed of 1000 small JPEG im-

Dataset labels are now available at

http://john.

cs.olemiss.edu/

ychen/ddsvm.html

. Images can

ages divided into 10 categories: African people and

villages (ω

), Beach (ω

), Historical buildings (ω

Buses (ω

), Dinosaurs (ω

), Elephants (ω

), Flowers

(ω

), Horses (ω

), Mountains and glaciers (ω

) and

Food (ω

The set of images within each category is ran-

domly split into two subsets of 50 elements to form

the training and test set. Each experiment is repeated

ﬁve times and the average overall accuracy (OA) is

then reported as the main evaluation metric. When

available, the number of processing units (NPU) of

the learning model is reported. A complete error ma-

trix (Congalton, 1991) over the ﬁve random runs is

presented in table 1.

Two image categorization models proposed

in (Chen and Wang, 2004) and (Andrews et al., 2003)

respectively are considered to compare our method:

MI-SVM, an extension of the standard Support Vec-

tor Machines model to the multiple-instance learning

paradigm and DD-SVM, which aims at improving the

MI-SVM by going beyond the single-prototype bag

model.

We also compare the performance of a RBFN us-

ing a 125 bins LUV histogram (R-Hist) with that ob-

tained by SVM employing the same image represen-

tation technique (HistSVM). Results are from (Chen

and Wang, 2004). Experimental results of the overall

accuracy are reported in table 2.

The HistSVM and R-Hist ﬁgures conﬁrm that Ra-

dial Basis Function Networks can be employed in

image catgorization tasks with similar performance

to SVM models. The performance of the proposed

R-EMD proves that a standard RFBN training tech-

nique combined with EMD-based radial basis func-

tions and K-means can compete with more complex

models based on the multiple instance framework.

be downloaded from

http://wang.ist.psu.edu/docs/

related.shtml

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

312

building, monument, sky animal, elephant, tree, vegetation, grass, sky mountain, building, vegetation, grass, tree

Figure 1: Sample images from the dataset used for soft classiﬁcation and annotation analysis.

Table 2: Hard classiﬁcation results of the proposed ap-

proach (R-EMD). Overall Accuracy (OA) with 95% con-

ﬁdence interval is reported; when available the number of

Processing Units is also presented.

Model OA% - [conf.int] NPU

R-EMD 80.40 - [77.80− 82.60] 100

R-EMD 77.52 - [74.91− 80.13] 50

R-Hist 71.16 - [68.32− 73.99] 100

R-Hist 67.88 - [64.96− 70.80] 50

DD-SVM 81.5 - [78.5− 84.5] n.d.

MI-SVM 74.7 - [74.1− 75.3] n.d.

HistSVM 66.7 - [64.5− 68.9] n.d.

3.2 Soft Classiﬁcation and Annotation

Analysis

To investigate the performance of the model for an-

notation purposes, an annotated image dataset was

needed. The absence of a widely accepted benchmark

dataset in the CBIR research area lead us to put to-

gether a subset of the Corel images found in (Chen

and Wang, 2004)

and adding proper annotations.

A set of 29 annotations A = {animal, beach, boat,

building, cloth, cloud, decoration, desert, elephant,

face, ﬂower, forest, grass, horse, lake, monument,

mountain, palace, person, river, rock, sand, sea, sky,

snow, street, tree, vegetation, water} is used to anno-

tate 573 images, some examples are provided in ﬁg-

ure 1. Annotations deﬁned on images are convertedto

soft memberships as explained in section 2.3 by con-

sidering uniform weights as suggested in (4).

The whole image dataset is randomly split into

two parts - for the training and test sets. The model is

then trained and the neural network’s output is evalu-

ated within the soft paradigm as suggested in (Binaghi

et al., 1999), considering the OA descriptive measure

of the fuzzy error matrix. This evaluation metric iso-

The dataset is available at

http://www.dicom.

uninsubria.it/

moreno.carullo/cbir/datasets.

html

lates the behavior of the RBFN model without con-

sidering the threshold parameter ε. The annotation

process is then evaluated considering the well-known

Information Retrieval metrics Precision (P), Recall

(R) and F-Measure (Frakes and Baeza-Yates, 1992)

(F1) with the micro-average approach. These metrics,

in particular F-Measure, describe the user-perceived

performance of the system. All experiments are re-

peated ﬁve times and the average OA, Precision, Re-

call and F-Measure are reported.

Table 3: Soft classiﬁcation and automated annotation re-

sults.

Model F.OA% P% R% F1% NPU

R-EMD 48.44 64.43 55.95 57.23 20

R-EMD 53.59 66.56 63.99 62.49 50

Random n.a. 14.62 52.27 20.69 n.a.

The model is evaluated ﬁxing the threshold pa-

rameter ε = 0.1 and annotation performance is com-

pared to a random annotator that selects a random

number of tags from the available ones. This assesses

the overall utility of the method with a lower bound

method.

The Fuzzy OA (F.OA) shows that the model can

learn soft memberships reasonably. The model was

not supposed to behave perfectly with respect to this

metric, and in addition the vagueness of learned and

evaluated data makes Fuzzy OA behave differently

from conventional, crisp OA.

The F1 score obtained shows the utility of the

model over a completely random approach, by deliv-

ering an average 62.49% of correct annotations over

the expected ones. Looking into the the F1 in detail,

the Precision and Recall ﬁgures show that the major

impact provided by the model is found in making the

set of suggested tags more precise, or in other words

small enough to contain the set of expected annota-

tions.

SOFT CATEGORIZATION AND ANNOTATION OF IMAGES WITH RADIAL BASIS FUNCTION NETWORKS

313

4 CONCLUSIONS

This work presented and evaluated a Radial Basis

Function Network based approach to image catego-

rization and annotation. Experimental analysis con-

ﬁrms that the proposed solution can be employed for

both categorization and annotation tasks with encour-

aging results. The proposed soft classiﬁcation ap-

proach seems promising and adequate for the man-

agement of intrinsic uncertainty of user-provided an-

notations. Future works involve the investigation of

the performance on larger datasets with more images

and annotations to assess the impact on the model’s

behavior.

REFERENCES

Almeida, J., Rocha, A., Torres, R., and Goldenstein, S.

(2008). Making colors worth more than a thousand

words. In SAC ’08: Proceedings of the 2008 ACM

symposium on Applied computing, pages 1180–1186,

New York, NY, USA. ACM.

Andrews, S., Tsochantaridis, I., and Hofmann, T. (2003).

Support vector machines for multiple-instance learn-

ing. In Advances in Neural Information Processing

Systems 15, pages 561–568. MIT Press.

Binaghi, E., Brivio, P. A., Ghezzi, P., and Rampini, A.

(1999). A fuzzy set-based accuracy assessment of soft

classiﬁcation. Pattern Recogn. Lett., 20(9):935–948.

Bishop, C. M. (1996). Neural networks for pattern recog-

nition. Oxford University Press, Oxford, UK.

Chen, Y. and Wang, J. Z. (2004). Image categorization by

learning and reasoning with regions. J. Mach. Learn.

Res., 5:913–939.

Congalton, R. (1991). A review of assessing the accuracy of

classiﬁcations of remotely sensed data. Remote sens-

ing of environment, 37(1):35–46.

Datta, R., Joshi, D., Li, J., James, and Wang, Z. (2007).

Image retrieval: Ideas, inﬂuences, and trends of the

new age. ACM Computing Surveys, 39.

Daubechies, I. (1992). Ten lectures on wavelets. Society

for Industrial and Applied Mathematics, Philadelphia,

PA, USA.

Frakes, W. B. and Baeza-Yates, R. A., editors (1992). In-

formation Retrieval: Data Structures & Algorithms.

Prentice-Hall.

Grauman, K. and Darrell, T. (2005). The pyramid match

kernel: Discriminative classiﬁcation with sets of im-

age features. In ICCV, pages 1458–1465.

Hartman, E., Keeler, J. D., and Kowalski, J. M. (1990). Lay-

ered neural networks with gaussian hidden units as

universal approximations. Neural Comput., 2(2):210–

215.

Jain, A., Duin, R., and J.Mao (2000). Statistical pattern

recognition: A review. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 22(1):4–37.

Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond

bags of features: Spatial pyramid matching for recog-

nizing natural scene categories. In CVPR ’06: Pro-

ceedings of the 2006 IEEE Computer Society Con-

ference on Computer Vision and Pattern Recogni-

tion, pages 2169–2178, Washington, DC, USA. IEEE

Computer Society.

Li, J. and Wang, J. Z. (2008). Real-time computerized an-

notation of pictures. IEEE Transactions on Pattern

Analysis and Machine Intelligence, 30(6).

Lv, Q., Josephson, W., Wang, Z., Charikar, M., and Li, K.

(2006). Ferret: a toolkit for content-based similarity

search of feature-rich data. In EuroSys ’06: Proceed-

ings of the 1st ACM SIGOPS/EuroSys European Con-

ference on Computer Systems 2006, pages 317–330,

New York, NY, USA. ACM.

Moody, J. E. and Darken, C. (1989). Fast learning in net-

works of locally-tuned processing units. Neural Com-

putation, 1:281–294.

Rubner, Y., Tomasi, C., and Guibas, L. J. (2000). The earth

mover’s distance as a metric for image retrieval. Int.

J. Comput. Vision, 40(2):99–121.

Shotton, J., Johnson, M., and Cipolla, R. (2008). Semantic

texton forests for image categorization and segmenta-

tion. In Semantic Texton Forests for Image Catego-

rization and Segmentation.

Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A.,

and Jain, R. (2000). Content-based image retrieval at

the end of the early years. IEEE Trans. Pattern Anal.

Mach. Intell., 22(12):1349–1380.

Swain, M. and Ballard, D. (1990). Indexing via color his-

tograms. Computer Vision, 1990. Proceedings, Third

International Conference on, pages 390–393.

Vailaya, A., Member, A., Figueiredo, M. A. T., Jain, A. K.,

Zhang, H.-J., and Member, S. (2001). Image classiﬁ-

cation for content-based indexing. IEEE Transactions

on Image Processing, 10:117–130.

VISAPP 2009 - International Conference on Computer Vision Theory and Applications

314