Active Learning in Social Context for Image Classiﬁcation

Elisavet Chatzilari

1,2

, Spiros Nikolopoulos

, Yiannis Kompatsiaris

and Josef Kittler

Centre for Research & Technology Hellas - Information Technologies Institute, Thessaloniki, Greece

Centre for Vision, Speech and Signal Processing, University of Surrey Guildford, U.K.

Keywords:

Selective Sampling, Active Learning, Large Scale, User-generated Content, Social Context, Image Classiﬁca-

tion, Multimodal Fusion.

Abstract:

Motivated by the widespread adoption of social networks and the abundant availability of user-generated mul-

timedia content, our purpose in this work is to investigate how the known principles of active learning for

image classiﬁcation ﬁt in this newly developed context. The process of active learning can be fully automated

in this social context by replacing the human oracle with the user tagged images obtained from social net-

works. However, the noisy nature of user-contributed tags adds further complexity to the problem of sample

selection since, apart from their informativeness, our conﬁdence about their actual content should be also max-

imized. The contribution of this work is on proposing a probabilistic approach for jointly maximizing the two

aforementioned quantities with a view to automate the process of active learning. Experimental results show

the superiority of the proposed method against various baselines and verify the assumption that signiﬁcant

performance improvement cannot be achieved unless we jointly consider the samples’ informativeness and

the oracle’s conﬁdence.

1 INTRODUCTION

The majority of state-of-the-art methods for auto-

matic concept detection rely on the paradigm of pat-

tern recognition through machine learning. Based on

this paradigm, a model is parametrized to recognize

all different attributes of a concepts’ form and appear-

ance using a set of training examples. The efﬁcient es-

timation of model parameters mainly depends on two

factors, the quality and the quantity of the training ex-

amples. High quality is usually accomplished through

manual annotation, which is a laborious and time con-

suming task. This has a direct impact on the second

factor since it inevitably leads into a small number

of training examples and limits the performance of

the generated models. In an effort to minimize the

labelling effort, active learning (Cohn et al., 1994)

trains the initial model with a very small set of la-

belled examples and enhances the training set by se-

lectively sampling new examples from a much larger

set of unlabelled examples (also referred as pool of

candidates). These examples are selected based on

their informativeness, i.e. how much they are ex-

pected to improve the model performance, and they

are labelled by an oracle. They are typically found in

the uncertainty areas of the model and their inclusion

in the training set results in reducing the generaliza-

tion error.

In the typical version of active learning, the pool

of candidates usually consists of unlabelled examples

that are annotated upon request by an errorless oracle.

This requirement, which implies the involvement of

a human annotator, renders active learning impracti-

cal in cases where the initial set needs to be enhanced

with a signiﬁcantly high number of additional sam-

ples while, at the same time, limits the scalability

of this approach. On the other hand, the widespread

use of Web 2.0 has made available large amounts of

user tagged images that can be obtained at almost no

cost and offer more information than their mere visual

content. Our goal in this paper is to examine active

learning in a rather different context from what has

been considered so far. More speciﬁcally, if we could

leverage these tags to become indicators of the im-

ages’ actual content, we could potentially remove the

need for a human annotator and automate the whole

process. This, however, adds a new parameter, the

oracle’s conﬁdence about the image’s actual content,

that should also be considered when actively select-

ing new samples. Additionally, even though in our

case there is no annotation effort, adding informative

instead of random samples is still important to mini-

Chatzilari E., Nikolopoulos S., Kompatsiaris Y. and Kittler J..

Active Learning in Social Context for Image Classiﬁcation.

DOI: 10.5220/0004686400760085

In Proceedings of the 9th International Conference on Computer Vision Theory and Applications (VISAPP-2014), pages 76-85

ISBN: 978-989-758-004-8

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

mize the complexity of the classiﬁcation models (i.e.

achieve the same robustness with signiﬁcantly fewer

images).

The novelty of this work, in contrast to what has

been considered so far in active learning, is to pro-

pose a sample selection strategy that maximizes not

only the informativeness of the selected samples but

also the oracle’s conﬁdence about their actual con-

tent. Towards this goal, we quantify the samples’ in-

formativeness by measuring their distance from the

separating hyperplane of the visual model, while the

oracle’s conﬁdence is measured based on the pre-

diction of a textual classiﬁer trained on a set of de-

scriptors extracted using a typical bag of words ap-

proach (Joachims, 1998). Joint maximization is then

accomplished by ranking the samples based on the

probability to select a sample given the two aforemen-

tioned quantities (see Fig. 1). This probability indi-

cates the beneﬁt that our system is expected to have

if the examined sample is selected to enhance the ini-

tial model. The rest of the manuscript is organized

as follows. Section 2 reviews the related literature. In

Section 3 the selective sampling algorithm is analysed

and a theoretical analysis that quantiﬁes the probabil-

ity of selecting a new sample is presented. The exper-

imental results are presented in Section 4 and conclu-

sions are drawn in Section 5.

2 RELATED WORK

The examined context of this work combines three

topics; active learning, multimedia domain and noisy

data. During the past decade there have been many

works exploring a subset of these topics, e.g. active

learning in the multimedia domain (Wang and Hua,

2011), (Freytag et al., 2013) or active learning with

noisy data (Settles, 2009), (Yan et al., 2011), (Fang

and Zhu, 2012) or even non-active learning from

noisy data in the multimedia domain (Chatzilari et al.,

2012), (Raykar et al., 2010), (Yan et al., 2010), (Uric-

chio et al., 2013), (Verma and Jawahar, 2012), (Verma

and Jawahar, 2013). However, it has been only re-

cently that the scientiﬁc community started to inves-

tigate the implications of substituting the human or-

acle with a less expensive and less reliable source

of annotations in the multimedia domain. There has

been only a few attempts to combine active learning

with user contributed images and most of them rely

on either a human annotator or on the use of active

crowdsourcing (i.e. a service like the MTurk) and

not on passive crowdsourcing (i.e. the user provided

tags that are typically found in social networks like

ﬂickr). In this direction, the authors of (Zhang et al.,

2011) propose to use ﬂickr notes in the typical ac-

tive learning framework with the purpose of obtain-

ing a training dataset for object localization. In a

similar endeavour, the authors of (Vijayanarasimhan

and Grauman, 2011) introduce the concept of live

learning where they attempt to combine active learn-

ing with crowdsourced labelling. More speciﬁcally,

rather than ﬁlling the pool of candidates with some

canned dataset, the system itself gathers possibly rel-

evant images via keywordsearch on ﬂickr. Then, it re-

peatedly surveys the data to identify the samples that

are most uncertain according to the current model,

and generates tasks on MTurk to get the correspond-

ing annotations.

On the other hand, social networks and user con-

tributed content are leading most of the recent re-

search efforts, mainly because of their ability to offer

more information than the mere image visual content,

coupled with the potential to grow almost unlimitedly.

In this direction, the authors of (Li et al., 2013) pro-

pose a solution for sampling loosely-tagged images to

enrich the negative training set of an object classiﬁer.

The presented approach is based on the assumption

that the tags of such images can reliably determine if

an image does not include a concept, thus making so-

cial sites a reliable pool of negative examples. The se-

lected negative samples are further sampled by a two

stage sampling strategy. First, a subset is randomly

selected and then, the initial classiﬁer is applied on

the remaining negative samples. The examples that

are most misclassiﬁed are considered as the most in-

formative negatives and are ﬁnally selected to boost

the classiﬁer.

Our aim in this work is to investigate the extent

to which the loosely tagged images that are found in

social networks can be used as a reliable substitute

of the human oracle in the context of active learn-

ing. Given that the oracle is not expected to reply

with 100% correctness to the queries submitted by

the selective sampling mechanism, we expect to face

a number of implications that will question the effec-

tiveness of active learning in noisy context. In this

perspective our work differs from the large body of

works that are found in the literature in the sense that

most of them appear to be sensitive in label noise. In

most of the works that do not use an expert as the or-

acle, MTurk is used instead to annotate the datasets.

However, although active crowdsourcing services like

MTurk are closer to expert’s annotation (Nowak and

R¨uger, 2010) with respect to noise, they cannot be

considered fully automated. In this work we rely on

data originating from passive crowdsourcing (ﬂickr

images and tags) that although noisier, can be used to

support a fully automatic active learning framework.

ActiveLearninginSocialContextforImageClassification

Visual Analysis

Textual Analysis

Informativeness

Selection Index

Init. Training Set

Selected Samples

Ranking

KƌĂĐůĞ͛Ɛ

confidence

Enhanced model

Init. Training Set

Initial model

Figure 1: System Overview.

The work presented in (Li et al., 2013) is examined

under the same context as in this work (i.e. active

learning in the multimedia domain using data from

passive crowdsourcing), which, however, focuses on

enriching the negative training set. Our work, on the

other hand, focuses on enriching the positive training

set that is more complex, since negative training sam-

ples are generally easier to harvest. Moreover,most of

the existing datasets already contain a large number of

negative examples but lack positives, which renders a

positive sample selection strategy more applicable to

a real world scenario.

3 SELECTIVE SAMPLING IN

SOCIAL CONTEXT

Let us consider the typical case where, given a con-

cept c

, a base classiﬁer is trained on the initial set

of labelled images using Support Vector Machines

(SVMs). We follow the popular rationale of SVM-

based active learning methods ((Tong and Chang,

2001), (Campbell et al., 2000), (Schohn and Cohn,

2000)), which quantify the informativeness of a sam-

ple based on its distance from the separating hyper-

plane of the visual model (Section 3.1). In the typical

active learning paradigm, a human oracle is employed

to decide which of the selected informative samples

are positive or negative. However, in the proposed

scheme the human oracle is replaced with user con-

tributed tags. Thus, in order to decide about a sam-

ple’s actual label we utilize a typical bag-of-words

classiﬁcation scheme based on the image tags and the

linguistic description of c

. The outcome of this pro-

cess is a conﬁdence score for each image-concept pair

(i.e. the oracle’s conﬁdence) which we consider as a

strong indicator about the existence or not of c

in the

image content (Section 3.2). Finally, the candidate

samples are ranked based on the probability of select-

ing a new image given the two aforementioned quan-

tities. The samples with the highest probability are

considered the ones that jointly maximize the sam-

ples’ informativeness and oracle’s conﬁdence, and are

selected to enhance the initial training set.

3.1 Measuring Informativeness

As already mentioned the informativeness of an im-

age is measured using the distance of its visual rep-

resentation from the hyperplane of the visual model.

For the visual representation of the images, we have

used the approach that was shown to perform best in

(Chatﬁeld et al., 2011). More speciﬁcally gray SIFT

features were extracted at densely selected key-points

at four scales, using the vl-feat library (Vedaldi and

Fulkerson, 2008). Principal component analysis was

applied on the SIFT features, decreasing their dimen-

sionality from 128 to 80. The parameters of a Gaus-

sian mixture model with K = 256 components were

learned by expectation maximization from a set of de-

scriptors, which were randomly selected from the en-

tire set of descriptors extracted by an independent set

of images. The descriptors were encoded in a single

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

feature vector using the Fisher vector encoding (Per-

ronnin et al., 2010). Moreover, each image was di-

vided in 1× 1, 3× 1, 2× 2 regions, resulting in 8 to-

tal regions. A feature vector was extracted for each

region by the Fisher vector encoding and the feature

vector of the whole image (1× 1) was calculated us-

ing sum pooling (Chatﬁeld et al., 2011). Finally the

feature vectors of all 8 regions were l2 normalized

and concatenated to a single 327680− dimensional

feature vector, which was again power and l2 normal-

ized.

For every concept c

, a linear SVM classiﬁer

, b

), where w

is the normal vector to the hyper-

plane and b

the bias term, was trained using the la-

belled training set. The images labelled with c

were

chosen as positive examples while all the rest were

used as negative examples (One Versus All / OVA ap-

proach). For each candidate image I

represented by

a feature vector x

, the distance from the hyperplane

V(I

, c

) is extracted by applying the SVM classiﬁer:

V(I

, c

) = w

× x

+ b

(1)

Using Eq. 1 we obtain the prediction scores, which

indicate the certainty of the SVM model that the im-

age I

depicts the concept c

. In the typical self-

training paradigm (Ng and Cardie, 2003), this cer-

tainty score is used to rank the samples in the pool of

candidates and the samples with the highest certainty

scores are chosen to enhance the models. However, as

claimed and provenby the active learning theory (Set-

tles, 2009), (Tong and Chang, 2001) these samples do

not provide more information to the classiﬁers in or-

der to alter signiﬁcantly the classiﬁcation boundaries.

Alternatively, as suggested by the active learning

theory (Settles, 2009), the samples for which the ini-

tial classiﬁer is more uncertain are more likely to in-

crease the classiﬁer’s performance if selected. In the

case of an SVM classiﬁer, the margin around the hy-

perplane forms an uncertainty area and the samples

that are closer to the hyperplane are considered to be

the most informative ones (Fig. 2) (Tong and Chang,

2001). Based on the above, the samples that we want

to select (i.e. the most informative) are the ones with

the minimum distance to the hyperplane. Addition-

ally, we only consider samples that lie in the margin

area, since the rest of the samples are not expected to

have any impact on the enhanced classiﬁers. We de-

note the probability to select an image I

given its dis-

tance to the hyperplane V(I

, c

) as P(S|V). Based on

our previous observations, shown in Fig. 2, this prob-

ability can be formulated as a function of the sample’s

distance to the hyperplane which can be seen in Fig.

w*x+b=1

w*x+b=0

w*x+b=-1

-1

+1

Informativeness = 0 ( min)

Informativeness = 1 (max)

0 < Informativeness < 1

Margin

Figure 2: Informativeness.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

P(S|V)

Figure 3: Probability of selecting a sample based on its dis-

tance to the hyperplane.

P(S|V) =



1− |V| if 0 < V < 1

0 else

(2)

3.2 Measuring Oracle’s Conﬁdence

In order to measure the oracle’s conﬁdence about the

existence of the concept c

in each tagged image, a

typical bag-of-words scheme is utilized (Joachims,

1998). The vocabulary is extracted from a large in-

dependent image dataset crawled from ﬂickr. Initially

the distinct tags of all the images are gathered. The

tags that are not included in WordNet are removed and

the remaining tags compose the vocabulary. Then, in

order to represent each image with a vector, a his-

togram is calculated by assigning the value 1 at the

bins of the image tags in the vocabulary.

Afterwards, for every concept c

a linear SVM

model (w

text

, b

text

) is trained using the tag histograms

as the feature vectors. In order to do this, a training

set of images that contains both tags and ground truth

information is utilized. The tags are required in order

to calculate the feature vectors and the ground truth

information to provide the class labels for training the

model. In the testing procedure, for every tagged im-

age I

the feature vector f

is calculated as above and

the SVM model is applied. This results into a value

ActiveLearninginSocialContextforImageClassification

−3 −2 −1 0 1 2 3

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

P(S|T)

Figure 4: Probability of selecting a sample based on the

oracle’s conﬁdence.

for each tagged image T(I

, c

), which corresponds to

the distance of f

from the hyperplane:

T(I

, c

) = w

text

× f

+ b

text

(3)

This distance indicates the oracle’sconﬁdence that the

examined image I

depicts the concept c

We denote the probability to select an image I

given the oracle’s conﬁdence T(I

, c

) as P(S|T). In

order to transform the oracle’s conﬁdence T(I

, c

)

(which corresponds to the distance of I

to the SVM

hyperplane) into a probability we use a modiﬁcation

of Platt’s algorithm (Platt, 1999) proposed by Lin et

al. (Lin et al., 2007). Thus, the probability P(S|T) can

be formulated as a function of the oracle’s conﬁdence

using the sigmoid function as shown in Fig. 4:

P(S|T) =











exp(−AT−B)

1+exp(−AT−B)

if AT + B ≥ 0

1+exp(AT+B)

if AT + B < 0

(4)

The parameters A and B are learned on the training set

using cross validation.

3.3 Sample Ranking and Selection

Our aim is to calculate the probability P(S = 1|V, T),

that an image is selected (S = 1) given the distance of

the image to the hyperplane V and the oracle’s conﬁ-

dence T. Considering thatV and T originate from dif-

ferent modalities (i.e. visual and textual respectively)

we regard them as independent. Using the basic rules

of probabilities (e.g. Bayesian rule) and based on our

assumption that V and T are independent we can ex-

press the probability P(S|V, T) as follows:

P(S | V, T)=

P(V, T | S)P(S)

P(V, T)

P(S | V)

P(V)

P(S)

P(S | T)

P(T)

P(S)

P(V, T)

P(S | V)P(S | T)P(V)P(T)

P(V, T)P(S)

In order to calculate the probability P(S = 1|V, T) and

eliminate the probabilities P(V), P(T) and P(V, T),

we divide the probability of selecting an image with

the probability of not selecting it.

P(S = 1|V, T)

P(S = 0|V, T)

P(S=1|V)P(S=1|T)P(V)P(T)

P(V,T)P(S=1)

P(S=0|V)P(S=0|T)P(V)P(T)

P(V,T)P(S=0)

⇒

P(S = 1|V, T)

P(S = 0|V, T)

P(S=1|V)P(S=1|T)

P(S=1)

P(S=0|V)P(S=0|T)

P(S=0)

Then we use the basic probabilistic rule that the prob-

ability of an event’s complement equals 1 minus the

probability of the event (P(S = 0|V, T) = 1 − P(S =

1|V, T)).

P(S = 1 | V, T)

1 − P(S = 1 | V, T)

P(S=1|V)P(S=1|T)

P(S=1)

(1−P(S=1|V))(1−P(S=1|T))

1−P(S=1)

⇒. . . ⇒

P(S = 1|V, T) =

P(S = 1|V)P(S = 1|T)

P(S = 1) − P(S = 1)P(S = 1|T)

···

(1− P(S = 1))

−P(S = 1)P(S = 1|V) + P(S = 1|V)P(S = 1|T)

(5)

Thus we only need to estimate three probabilities:

P(S = 1), P(S = 1 | V) and P(S = 1 | T). The ﬁrst

one is set to 0.5 as the probability of selecting an im-

age without any prior knowledge is the same with the

probability of dismissing it. For the estimation of the

other two probabilities we use the equations 2 and 4

(shown in Fig. 3 and 4). Finally, the top N images

with the highest probability P(S = 1|V, T) are selected

to enhance the initial training set.

4 EXPERIMENTS

4.1 Datasets and Implementation

Details

Two datasets were employed for the purpose of our

experiments. The imageCLEF dataset IC (Thomee

and Popescu, 2012) consists of 25000 labelled im-

ages and was split into two parts (15k train and 10k

test images). The ground truth labels were gathered

using Amazon’s crowdsourcing service MTurk. The

dataset was annotated by a vocabulary of 94 concepts

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

which belong to 19 general categories (age, celes-

tial, combustion, fauna, ﬂora, gender, lighting, qual-

ity, quantity, relation, scape, sentiment, setting, style,

time of day, transport, view, water, weather). On aver-

age there are 934 positive images per concept, while

the minimum and the maximum number of positive

images for a single concept is 16 and 10335 respec-

tively. In our experimental study the 15k training im-

ages were used to train the initial classiﬁers.

The MIRFLICKR-1M dataset F (Mark J. Huiskes

and Lew, 2010) consists of one million loosely tagged

images harvested from ﬂickr. The images of F were

tagged with 862115 distinct tags of which 46937 were

meaningful (included in WordNet). After the textual

preprocessing, i.e. removing the tags that were not

included in WordNet, 131302 images had no mean-

ingful tags, 825365 images were described by 1 to 16

meaningful tags and 43333 images had more than 16

meaningful tags. Given that the IC dataset is a subset

of F, the images that are included in both sets were

removed from F. In our experiments, this dataset

constitutes the pool of loosely tagged images, out of

which the top N = 500 images ranked by Eq. 5 are se-

lected for each concept (i.e. 94 concepts * 500 images

per concept = 47k images total) to act as the positive

examples enhancing the initial training set. Finally,

mean average precision (MAP) served as the metric

for measuring the models’ classiﬁcation performance

and evaluating the proposed approach.

4.2 Evaluation of the Proposed Selective

Sampling Approach

The objective of this section is to compare the pro-

posed active sample selection strategy against vari-

ous baselines. The ﬁrst baseline is the initial models

that were generated using only the ground truth im-

ages from the training set (15k images). Afterwards,

the initial models are enhanced with positive samples

from F using the following sample selection strate-

gies:

Self-training (Ng and Cardie, 2003). The images

that maximize the certainty of the SVM model

trained on visual information (i.e. maximize the

visual distance to the hyperplane as measured by

Eq. 1) are chosen.

Textual based The images that maximize the ora-

cle’s conﬁdence are selected (Eq. 4).

Max Informativeness. The images that maximize

the informativeness (i.e. are closer to the hyper-

plane) are chosen (Eq. 2).

Na¨ıve Oracle. The images that maximize the infor-

mativeness (Eq. 2) and explicitly contain the con-

cept of interest in their tag list are chosen (i.e.

plain string matching is used).

Proposed Approach. The images that jointly max-

imize the sample’s informativeness and the ora-

cle’s conﬁdence are chosen (Eq. 5).

The average performance of the enhanced classiﬁers

using the aforementioned sample selection strategies

is shown in Table 1. We can see that in all cases the

enhanced classiﬁers outperform the baseline. More-

over, the approaches relying on active learning yield

a higher performance gain compared to the typical

self-training approach, showing that the informative-

ness of the selected samples is a critical factor. The

same conclusion is drawn when comparing the tex-

tual based approach to the proposed method, showing

that informativeness is crucial to optimize the learning

curve, i.e. achieve higher improvement when adding

the same number of images. On the other hand, the

fact that the proposed sample selection strategy and

the string matching variation (i.e. na¨ıve oracle) out-

perform signiﬁcantly the visual-based variations, ver-

iﬁes that the oracle’s conﬁdence is a critical factor

when applying active learning in social context and

unless we manage to consider this value jointly with

informativeness, the selected samples are inappropri-

ate for improving the performance of the initial clas-

siﬁers.

Additionally, we note that the na¨ıve oracle varia-

tion performs relatively well, which can be attributed

to the high prediction accuracy achieved by string

matching. Nevertheless, the recall of string matching

is expected to be lower than the textual similarity al-

gorithm used in the proposed approach (Section 3.2),

since it does not account for synonyms, plural ver-

sions and the context of the tags. This explains the

superiority of our method compared to the na¨ıve ora-

cle variation. In order to verify that the performance

improvement of the proposed approach compared to

the na¨ıve oracle is statistically signiﬁcant, we apply

the Student’s t-test to the results, as it was proposed

for signiﬁcance testing in the information retrieval

ﬁeld (Smucker et al., 2007). The obtained p-value is

2.58e-5, signiﬁcantly smaller than 0.05, which is typ-

ically the limit for rejecting the null hypothesis (i.e.

the results are obtained from the same distribution and

thus the improvement is random), in favour of the al-

ternative hypothesis (i.e. that the obtained improve-

ment is statistically signiﬁcant).

Moreover, a per concept comparison of the en-

hanced models generated by the two best performing

approaches of Table 1 (i.e. the proposed approach and

the na¨ıve oracle variation) to the baseline classiﬁers

can be seen in the bar diagram shown in Fig. 5. We

can see that the proposed approach outperforms the

ActiveLearninginSocialContextforImageClassification

Table 1: Performance scores.

Model mAP (%)

Baseline 28.06

Self-training 28.68

Textual based 29.89

Max informativeness 28.73

Na¨ıve oracle 30

Proposed approach 31.22

na¨ıve oracle in 70 concepts out of 94. It is also in-

teresting to note that the na¨ıve oracle outperforms the

proposed approach mostly in concepts that depict ob-

jects such us amphibian-reptile, rodent, baby, coast,

cycle and rail. This can be attributed to the fact that

web users tend to use the same keywords to tag im-

ages with concepts depicting strong visual content,

which are typically the object of interest in an image.

In such cases, the string matching oracle can be rather

accurate, providing valid samples for enhancing the

classiﬁers. On the other hand, the proposed approach

copes better with more abstract and ambiguous con-

cepts for which the context is a crucial factor (e.g.

ﬂames, smoke, lens effect, small group, co-workers,

strangers, circular wrap and overlay).

A closer look at the obtained results from the pro-

posed approach shows that the concept with the most

notable increase in performance is the spider, ini-

tially trained by 16 positive examples yielding only

5.48% AP. After adding the samples that were in-

dicated by the proposed oracle, the classiﬁer gains

23.31 units of performance, resulting in 28.79% av-

erage precision. Similarly, other concepts yielding a

performance gain in the range of 5 and more units in-

clude stars, rainbow, ﬂames, ﬁreworks, underwater,

horse, insect, baby, rail and air. Most of these con-

cepts’ baseline classiﬁers yield a low performance.

Another category of concepts are the ones with slight

variations on performance, below 0.1%. This cate-

gory includes the concepts cloudy sky, coast, city, tree,

none, adult, female, no blur and city life whose base-

line classiﬁers yield a rather high performance and are

trained with 3600 positive images on average. This

shows that the proposed method, as it could be ex-

pected, is more beneﬁcial for difﬁcult concepts, i.e.

whose initial classiﬁers perform poorly. Finally, there

are also the concepts that either yield minor varia-

tions or even decrease in performance and consist in

melancholic, unpleasant and big group. This can be

attributed to the ambiguous nature of these concepts

which renders the oracle unable to effectively deter-

mine their existence.

4.3 Comparing with State-of-the-Art

In this section the proposed approach is compared to

the methods submitted to the 2012 ImageClef compe-

tition (Thomee and Popescu, 2012) and speciﬁcally

in the concept annotation task for visual concept de-

tection, annotation, and retrieval using Flickr pho-

tos

. Since the proposed approach is only using the

visual information of the test images without taking

into account the associated tags, it is only compared

to the visual-based approaches submitted in the com-

petition. The performance scores for the three metrics

utilized by the competition organizers (miAP, GmiAP

and F-ex) are reported in Table 2 for each of the 14

participating teams, along with the baselines of Ta-

ble 1 and the proposed approach. In order to measure

the F-ex score, the threshold for the positive-negative

class separation was set to zero, i.e. images with an

SVM prediction score greater than zero were anno-

tated as positive and negative otherwise. We can see

that our approach is ranked third in terms of miAP,

ﬁrst in terms of GmiAP and ﬁfth in terms of F-ex.

Additionally, we note that the proposed approach out-

performs the rest in terms of GmiAP, which accord-

ing to (Thomee and Popescu, 2012) is a metric sus-

ceptible to better performances on difﬁcult concepts.

This explains the superiority of our approach and the

higher performance gain compared to our baseline

since it tends to improve the performance of the dif-

ﬁcult concepts, as it was also observed in Section 4.2

(see Fig. 5). Moreover, it is important to note that the

Table 2: Comparison with ImageClef 2012.

Team miAP GmiAP F-ex

LIRIS 34.81% 28.58% 54.37%

NPDILIP6 34.37% 28.15% 41.99%

NII 33.18% 27.03% 55.49%

ISI 32.43% 25.90% 54.51%

MLKD 31.85% 25.67% 55.34%

CERTH 26.28% 19.04% 48.38%

UAIC 23.59% 16.85% 43.59%

BUAA AUDR 14.23% 8.18% 21.67%

UNED 10.20% 5.12% 10.81%

DBRIS 9.76% 4.76% 10.06%

PRA 9.00% 4.37% 25.29%

MSATL 8.68% 4.14% 10.69%

IMU 8.19% 3.87% 4.29%

URJCyUNED 6.22% 2.54% 19.84%

Baseline 30.37% 24.21% 48.6%

Self-training 30.77% 24.41% 49.63%

Textual based 32.48% 26.84% 51.7%

Max informativeness 30.83% 24.48% 52.24%

Na¨ıve oracle 32.18% 26.53% 51.66%

Proposed approach 33.84% 29.17% 52.64%

http://imageclef.org/2012/photo-ﬂickr

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

day

night

sunrisesunset

sun

moon

stars

clearsky

overcastsky

cloudysky

rainbow

lightning

fogmist

snowice

flames

smoke

fireworks

shadow

reflection

silhouette

lenseffect

mountainhill

desert

forestpark

coast

rural

city

graffiti

underwater

seaocean

lake

riverstream

other

tree

plant

flower

grass

cat

dog

horse

fish

bird

insect

spider

amphibianreptile

rodent

none

one

two

three

smallgroup

biggroup

baby

child

teenager

adult

elderly

male

female

familyfriends

coworkers

strangers

noblur

partialblur

completeblur

motionblur

artifacts

pictureinpicture

circularwarp

graycolor

overlay

portrait

closeupmacro

indoor

outdoor

citylife

partylife

homelife

sportsrecreation

fooddrink

happy

calm

inactive

melancholic

unpleasant

scary

active

euphoric

funny

cycle

car

truckbus

rail

water

air

Average

Baseline

Naive oracle

Proposed approach

Figure 5: Per concept comparison of the two best performing approaches (i.e. the na¨ıve oracle and the proposed approach) to

the baseline (best viewed in colour).

ActiveLearninginSocialContextforImageClassification

proposed approach has achieved these very compet-

itive scores by using a single feature space (gray

SIFT features), which was not the case for the other

participants that relied on more than one feature

spaces (Thomee and Popescu, 2012).

5 CONCLUSIONS

In this paper, we propose an automatic variation of

active learning for image classiﬁcation adjusted in

the context of social media. This adjustment con-

sists in replacing the typical human oracle with user

tagged images obtained from social sites and in us-

ing a probabilistic approach for jointly maximizing

the informativeness of the samples and the oracle’s

conﬁdence. The results show that in this context it

is critical to jointly consider these two quantities for

successfully selecting additional samples to enhance

the initial training set. Additionally, we noticed that

the na¨ıve oracle performs very well on concepts that

depict strong visual content corresponding to typical

foreground visual objects (e.g ﬁsh, spider, bird and

baby), while the proposed approach copes better with

more abstract and ambiguous concepts (e.g. ﬂames,

smoke, strangers and circular wrap), since the utilized

textual classiﬁer accounts for the context of the tags

as well.

Finally, an interesting note is that the difﬁcult con-

cepts (i.e. models with low performance) tend to gain

much more in terms of effectiveness from such boot-

strapping methods, as shown in Fig. 5. Similar con-

clusions are drawn when comparing the proposed ap-

proach, which trained a simple SVM classiﬁer using

a single feature space to the more sophisticated ap-

proaches of the ImageCLEF 2012 challenge, which

typically used many feature spaces. Especially in the

case of difﬁcult concepts, as shown by the superiority

of the proposed approach based on the GmiAP metric,

we can also conclude that it is more important to ﬁnd

more positive samples than more sophisticated algo-

rithms.

Our plans for future work include the use of ﬂickr

groups as a richer and more large-scale pool of can-

didates for positive samples and the extension of the

proposed approach to an on-line continuous learning

scheme.

ACKNOWLEDGEMENTS

This work was supported by the EU 7th Framework

Programme under grant number IST-FP7-288815 in

project Live+Gov (www.liveandgov.eu).

REFERENCES

Campbell, C., Cristianini, N., and Smola, A. J. (2000).

Query learning with large margin classiﬁers. In Pro-

ceedings of the Seventeenth International Conference

on Machine Learning, ICML ’00, pages 111–118, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

Chatﬁeld, K., Lempitsky, V., Vedaldi, A., and Zisserman,

A. (2011). The devil is in the details: an evaluation of

recent feature encoding methods. In British Machine

Vision Conference.

Chatzilari, E., Nikolopoulos, S., Kompatsiaris, Y., and Kit-

tler, J. (2012). Multi-modal region selection approach

for training object detectors. In Proceedings of the

2nd ACM International Conference on Multimedia

Retrieval, ICMR ’12, pages 5:1–5:8, New York, NY,

USA. ACM.

Cohn, D., Atlas, L., and Ladner, R. (1994). Improving

generalization with active learning. Mach. Learn.,

15(2):201–221.

Fang, M. and Zhu, X. (2012). I don’t know the label: Active

learning with blind knowledge. In Pattern Recognition

(ICPR), 2012 21st International Conference on, pages

2238–2241.

Freytag, A., Rodner, E., Bodesheim, P., and Denzler, J.

(2013). Labeling examples that matter: Relevance-

based active learning with gaussian processes. In We-

ickert, J., Hein, M., and Schiele, B., editors, GCPR,

volume 8142 of Lecture Notes in Computer Science,

pages 282–291. Springer.

Joachims, T. (1998). Text categorization with support vec-

tor machines: Learning with many relevant features.

In Ndellec, C. and Rouveirol, C., editors, Machine

Learning: ECML-98, volume 1398 of Lecture Notes

in Computer Science, pages 137–142. Springer Berlin

Heidelberg.

Li, X., Snoek, C. G. M., Worring, M., Koelma, D. C., and

Smeulders, A. W. M. (2013). Bootstrapping visual

categorization with relevant negatives. IEEE Transac-

tions on Multimedia, In press.

Lin, H.-T., Lin, C.-J., and Weng, R. C. (2007). A note

on platt’s probabilistic outputs for support vector ma-

chines. Machine Learning, 68(3):267–276.

Mark J. Huiskes, B. T. and Lew, M. S. (2010). New

trends and ideas in visual concept detection: The mir

ﬂickr retrieval evaluation initiative. In MIR ’10: Pro-

ceedings of the 2010 ACM International Conference

on Multimedia Information Retrieval, pages 527–536,

New York, NY, USA. ACM.

Ng, V. and Cardie, C. (2003). Bootstrapping coreference

classiﬁers with multiple machine learning algorithms.

In Proceedings of the 2003 conference on Empirical

methods in natural language processing, EMNLP ’03,

pages 113–120.

Nowak, S. and R¨uger, S. (2010). How reliable are annota-

tions via crowdsourcing: a study about inter-annotator

agreement for multi-label image annotation. In Pro-

ceedings of the international conference on Multime-

dia information retrieval, MIR ’10, pages 557–566,

New York, NY, USA. ACM.

VISAPP2014-InternationalConferenceonComputerVisionTheoryandApplications

Perronnin, F., S´anchez, J., and Mensink, T. (2010). Improv-

ing the ﬁsher kernel for large-scale image classiﬁca-

tion. In Proceedings of the 11th European conference

on Computer vision: Part IV, ECCV’10, pages 143–

156. Springer-Verlag.

Platt, J. C. (1999). Probabilistic outputs for support vector

machines and comparisons to regularized likelihood

methods. In ADVANCES IN LARGE MARGIN CLAS-

SIFIERS, pages 61–74. MIT Press.

Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin,

C., Bogoni, L., and Moy, L. (2010). Learning from

crowds. J. Mach. Learn. Res., 11:1297–1322.

Schohn, G. and Cohn, D. (2000). Less is more: Active

learning with support vector machines. In Proceed-

ings of the Seventeenth International Conference on

Machine Learning, ICML ’00, pages 839–846, San

Francisco, CA, USA. Morgan Kaufmann Publishers

Inc.

Settles, B. (2009). Active learning literature survey. Com-

puter Sciences Technical Report 1648, University of

Wisconsin–Madison.

Smucker, M. D., Allan, J., and Carterette, B. (2007). A

comparison of statistical signiﬁcance tests for infor-

mation retrieval evaluation. In Proceedings of the six-

teenth ACM conference on Conference on information

and knowledge management, CIKM ’07, pages 623–

632.

Thomee, B. and Popescu, A. (2012). Overview of the clef

2012 ﬂickr photo annotation and retrieval task. in the

working notes for the clef 2012 labs and workshop.

Rome, Italy.

Tong, S. and Chang, E. (2001). Support vector machine

active learning for image retrieval. In Proceedings of

the ninth ACM international conference on Multime-

dia, MULTIMEDIA ’01, pages 107–118, New York,

NY, USA. ACM.

Uricchio, T., Ballan, L., Bertini, M., and Del Bimbo, A.

(2013). An evaluation of nearest-neighbor methods

for tag reﬁnement.

Vedaldi, A. and Fulkerson, B. (2008). VLFeat: An open

and portable library of computer vision algorithms.

http://www.vlfeat.org/.

Verma, Y. and Jawahar, C. V. (2012). Image annotation

using metric learning in semantic neighbourhoods.

In Proceedings of the 12th European conference on

Computer Vision - Volume Part III, ECCV’12, pages

836–849.

Verma, Y. and Jawahar, C. V. (2013). Exploring svm for

image annotation in presence of confusing labels. In

Proceedings of the 24th British Machine Vision Con-

ference, BMVC’13.

Vijayanarasimhan, S. and Grauman, K. (2011). Large-scale

live active learning: Training object detectors with

crawled data and crowds. In Computer Vision and Pat-

tern Recognition (CVPR), 2011 IEEE Conference on,

pages 1449 –1456.

Wang, M. and Hua, X.-S. (2011). Active learning in multi-

media annotation and retrieval: A survey. ACM Trans.

Intell. Syst. Technol., 2(2):10:1–10:21.

Yan, Y., Rosales, R., Fung, G., and Dy, J. (2011). Active

learning from crowds. In Getoor, L. and Scheffer,

T., editors, Proceedings of the 28th International Con-

ference on Machine Learning (ICML-11), ICML ’11,

pages 1161–1168, New York, NY, USA. ACM.

Yan, Y., Rosales, R., Fung, G., Schmidt, M., Hermosillo,

G., Bogoni, L., Moy, L., and Dy, J. (2010). Modeling

annotator expertise: Learning when everybody knows

a bit of something.

Zhang, L., Ma, J., Cui, C., and Li, P. (2011). Active learn-

ing through notes data in ﬂickr: an effortless training

data acquisition approach for object localization. In

Proceedings of the 1st ACM International Conference

on Multimedia Retrieval, ICMR ’11, pages 46:1–46:8,

New York, NY, USA. ACM.

ActiveLearninginSocialContextforImageClassification