NoisyArt: A Dataset for Webly-supervised Artwork Recognition

R. Del Chiaro, Andrew Bagdanov and A. Del Bimbo

Media Integration and Communication Center, University of Florence, Italy

Keywords:

Cultural Heritage, Computer Vision, Instance Recognition, Image Categorization, Webly-supervised Learning.

Abstract:

This paper describes the NoisyArt dataset, a dataset designed to support research on webly-supervised recog-

nition of artworks. The dataset consists of more than 90,000 images and in more than 3,000 webly-supervised

classes, and a subset of 200 classes with veriﬁed test images. Candidate artworks are identiﬁed using publicly

available metadata repositories, and images are automatically acquired using Google Image and Flickr search.

Document embeddings are also provided for short descriptions of all artworks. NoisyArt is designed to sup-

port research on webly-supervised artwork instance recognition, zero-shot learning, and other approaches to

visual recognition of cultural heritage objects. Baseline experimental results are given using pretrained Convo-

lutional Neural Network (CNN) features and a shallow classiﬁer architecture. Experiments are also performed

using a variety of techniques for identifying and mitigating label noise in webly-supervised training data.

1 INTRODUCTION

Cultural patrimony and exploitation of its artifacts

is an extremely important economic driver interna-

tionally. This is especially true for culturally dense

regions like Europe and Asia who rely on cultural

tourism for jobs and important industry. U.S. tourist

travelers alone represent nearly 130 million adults

spending approximately $171 billion annually on

leisure travel (Chen and Rahman, 2018). Museums

are massive, distributed repositories of physical and

digital artifacts. For decades now museums have been

frantically digitizing their collections in an effort to

render their content more available to the general pub-

lic. Initiatives like EUROPEANA (Valtysson, 2012)

and the European Year of Cultural Heritage

have ad-

vanced the state-of-the-art in cultural heritage meta-

data exchange and promoted coordinated valorization

of cultural history assets, but have had limited impact

on diffusion and dissemination of each collection.

The state-of-the-art in automatic recognition of

objects, actions, and other visual phenomena has ad-

vanced by leaps and bounds in the last ﬁve years (Rus-

sakovsky et al., 2015). This visual recognition tech-

nology can offer the potential of linking cultural

tourists to the (currently inaccessible) collections of

museums. Imagine the following scenario:

• A cultural tourist arrives at a destination rich in

cultural heritage offerings.

https://europa.eu/cultural-heritage/

• Our prototypical cultural tourist snaps a photo of

an object or landmark of interest with his smart-

phone.

• After automatic recognition of the artwork or

landmark, our tourist receives personalized, cu-

rated information about the object of interest and

other cultural offerings in the area.

This type of scenario is realistic only if we have some

way of easily recognizing a broad range of artworks.

The challenges and barriers to this type of recognition

technology have been studied in the past in the mul-

timedia information analysis community (Cucchiara

et al., 2012).

Recent breakthroughs in visual media recognition

offer promise, but also present new challenges. One

key challenging factor in the application of state-of-

the-art classiﬁers is the data-hungry nature of modern

visual recognition models. Even modestly sized Con-

volutional Neural Networks (CNNs) can have hun-

dreds of millions of trainable parameters. As a conse-

quence, they can require millions of annotated train-

ing examples to be effectively trained. The real prob-

lem then becomes the cost of annotation. Museum

budgets are already stretched with classical curation

requirements, adding to that the additional costs of

collecting and annotating example media is not feasi-

ble.

Webly-supervised learning can offer solutions to

the data annotation problem by exploiting abundantly

available media on the web. This approach is appeal-

Del Chiaro, R., Bagdanov, A. and Del Bimbo, A.

NoisyArt: A Dataset for Webly-supervised Artwork Recognition.

DOI: 10.5220/0007392704670475

In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 467-475

ISBN: 978-989-758-354-4

467

ing as it is potentially able to exploit the millions of

images available on the web without requiring any

additional human annotation. In our application sce-

nario, for example, there are abundant images, videos,

blog posts, and other multimedia assets freely avail-

able on the web. If the multimedia corresponding

to speciﬁc instances of cultural heritage items can be

retrieved and veriﬁed in some way, this multimedia

can in turn be exploited as (noisy) training data. The

problem then turns from one of a lack of data, to one

of mitigating the effects of various types of noise in

the training process that derive from its Webly na-

ture (Temmermans et al., 2011; Sukhbaatar and Fer-

gus, 2014).

In this paper we present a dataset for Webly-

supervised learning speciﬁcally targeting cultural her-

itage artifacts and their recognition. Starting from an

authoritative list of known artworks from DBpedia,

we query Google Images and Flickr in order to iden-

tify likely image candidates. The dataset consists of

more than 3000 artworks with an average of 30 im-

ages per class. A test set of 200 artworks with veriﬁed

images is also included for validation. We call our

dataset NoisyArt to emphasize its webly-supervised

nature and the presence of label noise in the train-

ing and validation sets. NoisyArt is designed to sup-

port research on multiple types of webly-supervised

recognition problems. Included in the database are

document embeddings of short, veriﬁed text descrip-

tions of each artwork in order to support development

of models that mix language and visual features such

as zero-shot learning (Xian et al., 2017) and auto-

matic image captioning (Vinyals et al., 2017). We

believe that NoisyArt represents the ﬁrst benchmark

dataset for webly-supervised learning for cultural her-

itage collections.

In addition to the NoisyArt dataset, we report on

baseline experiments designed to probe the effective-

ness of pretrained CNN features for webly-supervised

learning of artwork instances. We also describe a

number of techniques designed to mitigate various

sources of noise in training data, as well as techniques

for identifying “clean” classes for which recognition

is likely to be robust. These techniques provide sev-

eral practical tools for building classiﬁers trained on

automatically acquired imagery from the web.

The rest of the paper is organized as follows. In

the next section we review recent work related to our

contribution. In section 3 we describe the NoisyArt

dataset designed speciﬁcally for research on Webly-

supervised learning in museum contexts, and in sec-

tion 4 we discuss several techniques to cope with

noise in webly supervised data. In section 5 we

https://github.com/delchiaro/NoisyArt

present a range of experimental results establishing

baselines for state-of-the-art methods on the NoisyArt

dataset. We conclude with a discussion of our contri-

bution in section 6.

2 RELATED WORK

In this section we review work from the literature re-

lated to the NoisyArt dataset and webly-supervised

learning.

Visual Recognition for Cultural Heritage. Cul-

tural heritage and recognition of artworks enjoys a

long tradition in the computer vision and multimedia

research communities. The Mobile Museum Guide

was an early attempt to build a system to recognize in-

stances from a collection of 17 artworks using photos

from mobile phone (Temmermans et al., 2011). More

recently, the Rijksmuseum Challenge dataset was

published which contains more than 100,000 highly

curated photos of artworks from the Rijksmuseum

collection (Mensink and Van Gemert, 2014). The

PeopleArt dataset, on the other hand, consists of high-

quality, curated photos of paintings depicting people

in various artistic styles (Westlake et al., 2016). The

objectives of these datasets vary, from person detec-

tion invariant to artistic style, to artist/artwork recog-

nition. A unifying characteristic of these datasets, is

the high level of curation and meticulous annotation

invested.

Another common application theme in multime-

dia analysis and computer vision applied to cultural

heritage is personalized content delivery. The goal of

the MNEMOSYNE project was to analyze visitor in-

terest in situ and to then select content to deliver on

the basis of similarity to recognized content of inter-

est (Karaman et al., 2016). The authors of (Baraldi

et al., 2015), on the other hand, concentrate on closed-

collection artwork recognition and gesture recogni-

tion using a wearable sensor to enable novel interac-

tions between visitor and museum content.

Webly-supervised Category Recognition. Early

approaches to webly-supervised learning (long before

it was called by that name), were the decontamina-

tion technique of (Barandela and Gasca, 2000), and

the noise ﬁltering approach of (Brodley and Friedl,

1999). Both of these approaches are based on explicit

identiﬁcation and removal of mislabeled training sam-

ples. A more recent approach is the noise adaptation

approach of (Sukhbaatar and Fergus, 2014). This ap-

proach looks at two speciﬁc types of label noise –

labelﬂip and outliers – and modiﬁes a deep network

architecture to absorb and adapt to them. A very re-

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

468

cent approach to webly-supervised training of CNNs

is the representation adaptation approach of (Chen

and Gupta, 2015). The authors, in this work, at ﬁrst

ﬁt a CNN to “easy” images identiﬁed by Google, and

then adapt this representation to “harder” images by

identifying sub- and similar-category relationships in

the noisy data.

The majority of work on webly-supervised learn-

ing has concentrated on category learning. However,

the NoisyArt dataset we introduce in this paper is

an instance-based, webly-supervised learning prob-

lem. As we will describe in section 3, instance-base

learning presents different sources of label noise than

category-based.

Landmark Recognition. The problem of land-

mark recognition is similar to our focus of artwork

classiﬁcation, since they are both instance recognition

problems rather than category recognition problems.

It is also one of the ﬁrst problems to which webly-

supervised learning was widely applied. The authors

of (Raguram et al., 2011) use webly-supervised learn-

ing to acquire visual models of landmarks by identi-

fying iconic views of each landmark in question. An-

other early work merged image and contextual text

features to build recognition models for large-scale

landmark collection (Li et al., 2009).

Artwork recognition differs from landmark recog-

nition, however, in the diversity of viewpoints recov-

erable from web search alone. As we will show in

section 3, the NoisyArt dataset suffers from several

types of label bias and label noise which are particu-

lar to the artwork recognition context.

3 THE NoisyArt DATASET

NoisyArt is a collection of artwork images collected

using articulated queries to metadata repositories and

image search engines on the web. The goal of

NoisyArt is to support research on webly-supervised

artwork recognition for cultural heritage applications.

Webly-supervision is an important feature, since in

the cultural applications data can be acutely scarce.

Thus, the ability to exploit abundantly available im-

agery to acquire visual recognition models would be

a tremendous advantage.

We feel that NoisyArt can be well-suited for ex-

perimentation on a wide variety of recognition prob-

lems. The dataset is particularly well-suited to

webly-supervised instance recognition as a weakly-

supervised extension of fully-supervised learning. To

support this, we provide a subset of classes with man-

ually veriﬁed test images (i.e. with no label noise).

In the next section we describe the data sources

used for collecting images and metadata. Then in sec-

tion 3.2 we describe the data collection process and

detail the statistics of the NoisyArt dataset.

3.1 Data Sources

To collect the NoisyArt dataset we exploited a range

of publicly available data sources on the web.

Structured Knowledge Bases. As a starting point,

we used public knowledge bases like DBpedia (Bizer

et al., 2009; Mendes et al., 2011) and Europeana (Val-

tysson, 2012) to query, select, and ﬁlter the entities to

be used a basis for NoisyArt. The result of this exer-

cise was a reduced list of 3,120 artwork classes with

Wikipedia entries and ancillary information for each

one: title, descriptions, museum in which the artwork

is preserved, artist information like name, birth and

death date, description, and artistic movement.

DBpedia. DBpedia is the same source from which

we retrieved metadata. For some artworks it also con-

tains one or more images. We call this kind of images

a seed image because it is unequivocally associated

with the metadata of the artwork. Note, however, that

though the association is reliable, some times the seed

image is an image of the artist and not of the artwork.

Google Images. We queried Google Images using

the title of each artwork and the artist name. For each

query we downloaded the ﬁrst 20 retrieved images.

These images tend to be very clean, in particular for

paintings, most of which do not have a background

and tend to be very similar to scans or posters. For this

reason the variability of examples can be poor: we can

retrieve images that are almost identical, maybe with

just different resolutions or with some differences in

color calibration. Another issue with Google Image

search results is the label ﬂip phenomenon: search-

ing for minor artworks by a famous artist can result

in retrieving images of other artworks from the same

artist. Outliers are also present in a small part for less

famous artworks by less famous artists.

Flickr. Finally, we used the Flickr API to retrieve

a small set of images more similar to real-world pic-

tures taken by users. Due to its nature, the images

retrieved from Flickr tend to be more noisy: the only

supervision is by the end-users, and a lot of images

(specially for famous and iconic artworks) do not con-

tain the expected subject. For least famous artworks,

the number of retrieved images is almost zero and can

be full of outliers. For these reasons we only retrieve

the ﬁrst 12 images from each Flickr query in order to

ﬁlter some of the outlier noise.

Discussion. In the end, Flickr images are the most

NoisyArt: A Dataset for Webly-supervised Artwork Recognition

469

Name/Artist DBPedia Google Flickr

Self-Portrait / Raffaello

Alien / David Breuer-Weil

Saint Jerome in his Study /

Antonello da Messina

Anxiety / Munch

Figure 1: Sample classes and training images from the NoisyArt dataset. For each artwork/artist pair we show the seed image

obtained from DBpedia, the ﬁrst two Google Image search results, and the ﬁrst two Flickr search results.

informative due to variety and similarity to real-world

pictures. However a lot of them are incorrect (out-

liers). DBpedia seed images are the most reliable but

are at most one per artwork. Google images are usu-

ally more consistent with the searched concept when

compared to the Flicker ones, but normally present

low variability.

3.2 Data Collection

From these sources we managed to collect 89,395 im-

ages for the 3120 classes, that became 89,095 after

we pruned unreadable images and some error banners

received from websites. Before ﬁltering, each class

contained a minimum of 20 images (from google) and

a maximum of 33 (12 from Flickr and the DBpedia

seed).

We could have used the seed images as a single-

shot test set (pruning all the classes without the seed)

but the importance of these images in the training

phase joined to the inconsistency of seed in some

classes led us to create a supervised test set using a

small subset of the original classes: 200 classes con-

taining more than 1,300 images taken from the web or

from our personal photos. We have been careful not

to use images from the training set. This test set is not

Table 1: Characteristics of the NoisyArt dataset.

(webly images) (veriﬁed images)

classes training validation test

2,920 65,759 17,368 0

200 4,715 1,253 1,355

totals: 3,120 70,474 18,621 1,355

balanced: for some classes we have few images, and

some others have up to 12. Figure 2 illustrates some

sample classes and images from our veriﬁed test set.

Note the strong domain shift in these images with re-

spect to those in the training set shown in ﬁgure 1.

Finally, each artwork has a description and meta-

data retrieved from DBpedia, from which a single tex-

tual document was created for each class. These short

descriptions were then embedded using doc2vec (Le

and Mikolov, 2014) in order to provide a compact,

vector space embedding for each artwork description.

These embeddings are included to support research on

zero-shot learning and other multi-modal approaches

to learning over weakly supervised data.

In the end, NoisyArt is a multi-modal, weakly-

supervised dataset of artworks with 3,120 classes and

more than 90,000 images, 1,300 of which are human

validated. Table 1 details the breakdown of the splits

deﬁned in NoisyArt.

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

470

Name/Artist Images from TestSet-200 (Validated by Humans)

Self-Portrait / Raffaello

David / Michelangelo

Bacchus / Leonardo

Anxiety / Munch

Figure 2: Sample veriﬁed test images from the NoisyArt test set. For a random sample of 200 classes we collected an

additional set of images that we manually veriﬁed. Note the signiﬁcant domain shift on these images with respect to those in

ﬁgure 1.

3.3 Discussion

In ﬁgure 1 we give a variety of examples from the

NoisyArt dataset. For each artwork we show: the

seed image from DBpedia, the ﬁrst two Google Image

search results, and the ﬁrst two results from Flickr.

These examples show typical scenarios of this art-

work instance recognition problem:

• Best Case. The second row of ﬁgure 1 contains

pictures of a statue. For these kinds of objects

it is usually much easier to retrieve images with

a good level of diversity, both from Google and

Flickr. This is due to the 360

◦

access and thus

the relative variety of viewpoints from which such

artworks are photographed.

• Lack of Diversity. The ﬁrst row of ﬁgure 1 is an

example of an artwork for which Google retrieves

images with extremely low variety, although in

this case Flickr returns images with some diver-

sity, but also outliers. In the third row we can

observe an example for which both Google and

Flickr failed to have diversity.

• Labelﬂip. In the fourth of ﬁgure 1 we see a

pathology particular to our instance recognition

problem: we are looking for images of a not-

so-famous artwork (Anxiety) by a famous artist

(Munch) who also made much more iconic art-

works (like The Scream). In these cases the risk

of labelﬂip is high, and in fact we retrieved from

both Google and Flickr also images of The Scream

(together with some correct and some outlier im-

ages).

These types of label noise in the NoisyArt dataset

render it difﬁcult to acquire robust visual models us-

ing webly supervision. In the next section we discuss

techniques to mitigate or identify noise during train-

ing.

4 COPING WITH NOISY DATA

In this section we describe several techniques for mit-

igating and/or identifying label noise during training.

First we describe the baseline classiﬁer model used in

all experiments.

4.1 Baseline Classiﬁer Model

For all our experiments we use a shallow classiﬁer

based on image features extracted from CNNs pre-

trained on ImageNet. Figure 3 shows the architecture

NoisyArt: A Dataset for Webly-supervised Artwork Recognition

471

Labeled

Images

Pretrained

CNN

Image

Features

Shallow

Classiﬁer

p(c|x)

Figure 3: Classiﬁer models used for webly-supervised ex-

periments on NoisyArt. Green blocks represent data ﬂowing

through the network, blue ones components with trainable

parameters. A CNN (pretrained on ImageNet) is used to ex-

tract features from training images, and then a shallow net-

work with a single hidden layer and output layer are trained

to predict class probabilities. The F matrix (see section 4.2)

is used to model and absorb labelﬂip noise in the training

set, and the loss function is either the cross entropy loss L or

the weighted cross entropy loss L

described in section 4.3.

of our networks. Given an input image x, we extract a

feature vector using the pretrained CNN and then we

pass it trough a shallow classiﬁer, consisting of a sin-

gle hidden layer (with the same size as the extracted

features) and an output layer that estimates class prob-

abilities p(c|x) for each of the 200 test classes.

The shallow classiﬁer is then optionally followed

by multiplication by a 200 × 200 labelﬂip matrix F

(see section 4.2). For the baseline experiments F is

set to the identity matrix.

Finally, the loss function used to train the shallow

network weights is the cross entropy loss:

L(x, y;θ) = −

∑

(c)p(c | x),

where

(

1 if c = y

0 otherwise.

The alternate loss function L

shown in ﬁgure 3 is de-

scribed in section 4.3.

4.2 Labelﬂip Noise

Labelﬂip noise refers to images in the training set

which are mislabeled as belonging to the incorrect

class. This problem can be acute in instance recogni-

tion, for example when artists have works which are

signiﬁcantly more famous than their others and these

famous works are often returned on queries. We ex-

perimented with the technique for labelﬂip absorption

proposed in (Sukhbaatar and Fergus, 2014).

The main idea of labelﬂip absorption is to intro-

duce a new fully connected layer without bias after

the ﬁnal softmax output (see the component F in ﬁg-

ure 3). The weights of this layer, which we call F,

are an N × N stochastic matrix, where N is the num-

ber of classes. Each row of F models the likelihood

of confusing one class for any of the other classes.

This matrix is initialized to the identity matrix and, at

the start of training, the weights are locked (not train-

able). After a number of training epochs (500 in our

experiments), the weights are unlocked, allowing F

to model class confusion probabilities and to spread

out the probability mass from each class to common

confusions for that class, thanks also to a trace regu-

larization. At each training iteration the rows of F are

reprojected onto the N-simplex to keep F stochastic.

The result is that labelﬂip noise is absorbed into the

F matrix, leaving the network free to learn on “clean”

labels.

4.3 Entropy Scaling for Outlier

Mitigation

The labelﬂip matrix described in the previous sec-

tion attempts to compensate for class-level confusions

during training. In this section we describe an alter-

nate technique that performs soft outlier detection in

order to weight training samples during training. Our

hypothesis is that the class-normalized entropy of a

training sample is an indicator of how conﬁdent the

model is about a particular input sample.

The normalized entropy of a training sample x

deﬁned as:

H(x

) = −

∑

p(c | x

)ln p(c | x

where C is a normalizing constant equal to the max-

imum entropy attainable for the given number of

classes. When

H(x

) is zero, the classiﬁer is abso-

lutely certain about x

; when it is one, the classiﬁer

has maximal uncertainty.

The entropy weighted loss is deﬁned as:

(x, y;θ) = −σ(

H(x))

∑

(c)p(c | x),

where the normalized entropy is passed through a

modiﬁed sigmoid σ function of the types illustrated

in ﬁgure 4. This function is deﬁned as:

σ(x;m, b) =

1 + e

m(x−b)

so that the loss for training sample x is weighted

inversely proportionally to the normalized entropy

H(x).

4.4 Gradual Bootstrapping

The entropy scaling technique described in the previ-

ous section applies soft weights to the loss contributed

by speciﬁc training samples. These weights are based

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

472

Figure 4: Modiﬁed sigmoid function used to calculate per-

sample weights based on normalized entropy. The m pa-

rameter controls the steepness of the transition from 1.0 to

0.0, and the b parameter the point at which is begins its tran-

sition.

on an estimate of the class uncertainty. However,

CNNs are known to produce highly-conﬁdent predic-

tions even on outliers. Instead, here we propose a

method for gradually bootstrapping during training by

starting from highly reliable training examples, and

sequentially introducing less reliable training data.

For NoisyArt we have the seed images acquired

from DBpedia metadata records that can be used as

an absolutely reliable image for each class. If there

is no seed image, we use the ﬁrst result returned by

Google Image Search as the initial bootstrap image

for each class. Training is performed for 80 epochs

using only seed images, then the rest of the training

images are added and training proceeds using entropy

scaling as described in section 4.3. We expect that af-

ter acquiring a reliable model on seed images, entropy

scaling will be more robust as the classiﬁers should be

more conservative as they have been initially trained

on a very reduced training set.

4.5 Identifying Problem Classes

The entropy measure used for noise mitigation in the

approach described in section 4.3 can also be used to

ﬁlter “problem classes” in the sense that the average

class entropy on the test images is an indicator of clas-

siﬁer uncertainty. For each class we compute the av-

erage entropy, as measured by a trained shallow clas-

siﬁer model, for every training image in that class. We

then rank the classes by decreasing average entropy.

We then progressively remove these problem classes.

This provides a practical technique for ﬁltering unre-

liable classes from the ﬁnal model.

Table 2: Networks used for image feature extraction in our

experiments.

CNN Reference Feature Size

ResNet-50 (He et al., 2016) Global pool 2048

ResNet-101 (He et al., 2016) Global pool 2048

ResNet-152 (He et al., 2016) Global pool 2048

VGG16 (Simonyan and Zisserman, 2014) FC7 4096

VGG19 (Simonyan and Zisserman, 2014) FC7 4096

Table 3: Baseline and noise ﬁltering results for webly-

supervised recognition. We report accuracy (acc) and

mean average precision (mAP) results on both the webly-

supervised validation and fully-validate test sets.

test validation

acc mAP acc mAP

ResNet-50 BL 67.01 54.14 76.46 64.10

ResNet-50 LF 67.90 55.83 76.54 63.54

ResNet-50 ES 68.71 57.42 76.46 63.74

ResNet-50 BS 68.27 57.44 75.98 62.83

ResNet-101 BL 67.31 54.82 76.54 63.33

ResNet-101 LF 67.08 55.58 77.09 64.17

ResNet-101 ES 67.16 56.60 76.38 63.56

ResNet-101 BS 68.27 57.41 76.78 63.46

ResNet-152 BL 67.60 54.92 76.70 63.66

ResNet-152 LF 66.72 54.66 76.46 63.02

ResNet-152 ES 67.16 56.06 76.70 64.16

ResNet-152 BS 67.38 55.81 76.22 62.90

VGG16 BL 64.72 50.61 74.86 60.82

VGG16 LF 64.65 50.62 73.74 59.23

VGG16 ES 64.80 51.17 75.42 61.65

VGG16 BS 66.27 52.52 74.38 60.07

VGG19 BL 62.80 48.98 73.50 59.42

VGG19 LF 61.33 46.53 73.07 57.84

VGG19 ES 61.92 48.43 72.87 58.34

VGG19 BS 63.99 51.14 72.63 58.21

5 EXPERIMENTAL RESULTS

In this section we report experimental results for a

number of feature extraction and label noise com-

pensation methods. All experiments were conducted

using features extracted from CNNs pretrained on

ImageNet, which are then fed as input to a shallow

classiﬁer (see ﬁgure 3). More speciﬁcally, we ex-

tracted features using the networks shown in table 2.

The shallow networks were trained with the Adam

optimizer (Kingma and Ba, 2014) for 1500 epochs on

the 200-class training set. We use a learning rate of

1e-4 and L2 weight decay with a coefﬁcient of 1e-7.

For experiments using entropy scaling, we use param-

eters m = 20 and b = 0.8 for the modiﬁed sigmoid

function. After 1500 epochs, the model correspond-

ing to the best classiﬁcation accuracy on the webly-

supervised validation set was evaluated on the veriﬁed

test set.

NoisyArt: A Dataset for Webly-supervised Artwork Recognition

473

5.1 Webly-supervised Classiﬁcation

Table 3 gives results for all extracted features. For

each extracted feature type we report results for:

• Baseline (BL): the shallow network trained with

no noise mitigation.

• LabelFlip (LF): the shallow network trained with

labelﬂip absorption as described in section 4.2.

• Entropy Scaling (ES): the shallow network

trained with entropy scaling as described in sec-

tion 4.3.

• BootStrapping (BS): the shallow network trained

with gradual bootstrapping as described in sec-

tion 4.4.

From table 3 we can draw a few conclusions. First

of all, despite the high degree of noise in the training

labels, even the baseline classiﬁers perform surpris-

ingly well on the webly-supervised learning problem.

All of the ResNet models achieve nearly 70% classi-

ﬁcation accuracy on the veriﬁed test set. The shallow

classiﬁer seems to be able to construct models robust

to noise in the majority of classes.

Another interesting observation to be made from

table 3 is that simpler models seem to perform better,

on average. ResNet-50 and ResNet-101 outperform

the more complex models. The VGG16 and VGG19

features perform signiﬁcantly worse than all of the

ResNet features.

All three of the noise mitigation techniques gen-

erally improve over the baseline shallow classiﬁer,

though not always by a signiﬁcant margin. Our grad-

ual bootstrapping technique described in section 4.4

generally yields the most consistent and signiﬁcant

improvement: about 2% improvement in accuracy

and 3% in mAP over the baseline on ResNet-50 and

ResNet-101 features.

Finally, results on the validation set are an unre-

liable ﬁne-grained predictor of classiﬁer performance

on validated test data. Though the performance on

the validation set between ResNet and VGG models

is a reliable indicator, performance on the different

ResNet models is too close to call.

5.2 Identifying Problem Classes

In ﬁgure 5 we show the improvement that can be

gained by ﬁltering classes with high average entropy.

The ﬁgure plots classiﬁer accuracy for all models with

bootstrapping as a function of progressively ﬁltered

test sets (i.e. removing unreliable classes). Observe

that the average class entropy is a reasonable measure

of classiﬁer reliability. After ﬁltering only about 20%

Figure 5: Filtering problem classes. We progressively re-

move classes with high entropy from the test set. Accuracy

is plotted as a function of the number of remaining classes.

of the problem classes we can obtain an overall accu-

racy of better than 80% on the remaining ones for the

ResNet models.

6 CONCLUSIONS AND FUTURE

WORK

In this paper we described the NoisyArt dataset which

is designed to support research on webly-supervised

training for artwork recognition. NoisyArt contains

more than 3,000 classes with over 90,000 correspond-

ing images automatically collected from the web.

Metadata and document embeddings are included for

all artworks. A veriﬁed set of test samples for a subset

of 200 classes is also provided.

Preliminary results on artwork recognition using

shallow classiﬁers trained on features extracted with

pretrained CNNs are encouraging. Baseline classi-

ﬁers, with relatively simple networks and compact

image features, achieve nearly 70% classiﬁcation ac-

curacy when trained on webly-supervised data. Noise

mitigation techniques are able to improve perfor-

mance, though the increase is at times marginal. Our

technique for noise mitigation, a type of gradual boot-

strapping, yields consistent improvements on most

features.

We also described a technique for identifying

problem classes (i.e. classes whose recognition is

likely to be unreliable). Our approach is to use the

average entropy of training samples, as measured by

the outputs of the trained classiﬁer, and ﬁlter those

with high average entropy. Our results show that ﬁl-

tering only about 20% of classes can yield a dramatic

increase in overall reliability.

Current and ongoing work includes research on

VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

474

zero-shot learning using webly-supervised data, and

a deeper investigation of entropy-based noise mitiga-

tion. We are also interested in investigating the po-

tential for using entropy-based problem class iden-

tiﬁcation as a means to articulate better queries for

problem classes, leading to an iterative query-train-

requery-retrain cycle in order to improve robustness.

REFERENCES

Baraldi, L., Paci, F., Serra, G., Benini, L., and Cucchiara,

R. (2015). Gesture recognition using wearable vi-

sion sensors to enhance visitors’ museum experiences.

IEEE Sens. J, 15(5):2705–2714.

Barandela, R. and Gasca, E. (2000). Decontamination

of training samples for supervised pattern recogni-

tion methods. In Joint IAPR International Work-

shops on Statistical Techniques in Pattern Recognition

(SPR) and Structural and Syntactic Pattern Recogni-

tion (SSPR), pages 621–630. Springer.

Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C.,

Cyganiak, R., and Hellmann, S. (2009). Dbpedia-a

crystallization point for the web of data. Web Seman-

tics: science, services and agents on the world wide

web, 7(3):154–165.

Brodley, C. E. and Friedl, M. A. (1999). Identifying mis-

labeled training data. Journal of artiﬁcial intelligence

research, 11:131–167.

Chen, H. and Rahman, I. (2018). Cultural tourism: An

analysis of engagement, cultural contact, memorable

tourism experience and destination loyalty. Tourism

management perspectives, 26:153–163.

Chen, X. and Gupta, A. (2015). Webly supervised learn-

ing of convolutional networks. In Proceedings of the

IEEE International Conference on Computer Vision,

pages 1431–1439.

Cucchiara, R., Grana, C., Borghesani, D., Agosti, M., and

Bagdanov, A. D. (2012). Multimedia for cultural her-

itage: Key issues. In Multimedia for Cultural Her-

itage, pages 206–216. Springer.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Karaman, S., Bagdanov, A. D., Landucci, L., D’Amico,

G., Ferracani, A., Pezzatini, D., and Del Bimbo, A.

(2016). Personalized multimedia content delivery on

an interactive table by passive observation of mu-

seum visitors. Multimedia Tools and Applications,

75(7):3787–3811.

Kingma, D. P. and Ba, J. (2014). Adam: A method for

stochastic optimization. CoRR, abs/1412.6980.

Le, Q. and Mikolov, T. (2014). Distributed representations

of sentences and documents. In International Confer-

ence on Machine Learning, pages 1188–1196.

Li, Y., Crandall, D. J., and Huttenlocher, D. P. (2009). Land-

mark classiﬁcation in large-scale image collections. In

Computer vision, 2009 IEEE 12th international con-

ference on, pages 1957–1964. IEEE.

Mendes, P. N., Jakob, M., Garc

ıa-Silva, A., and Bizer, C.

(2011). Dbpedia spotlight: shedding light on the web

of documents. In Proceedings of the 7th international

conference on semantic systems, pages 1–8. ACM.

Mensink, T. and Van Gemert, J. (2014). The rijksmuseum

challenge: Museum-centered visual recognition. In

Proceedings of International Conference on Multime-

dia Retrieval, page 451. ACM.

Raguram, R., Wu, C., Frahm, J.-M., and Lazebnik, S.

(2011). Modeling and recognition of landmark image

collections using iconic scene graphs. International

journal of computer vision, 95(3):213–239.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,

Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-

stein, M., et al. (2015). Imagenet large scale visual

recognition challenge. International Journal of Com-

puter Vision, 115(3):211–252.

Simonyan, K. and Zisserman, A. (2014). Very deep con-

volutional networks for large-scale image recognition.

CoRR, abs/1409.1556.

Sukhbaatar, S. and Fergus, R. (2014). Learning from

noisy labels with deep neural networks. CoRR,

abs/1406.2080.

Temmermans, F., Jansen, B., Deklerck, R., Schelkens, P.,

and Cornelis, J. (2011). The mobile museum guide:

artwork recognition with eigenpaintings and surf. In

Proceedings of the 12th International Workshop on

Image Analysis for Multimedia Interactive Services.

Valtysson, B. (2012). Europeana: The digital construction

of europe’s collective memory. Information, Commu-

nication & Society, 15(2):151–170.

Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2017).

Show and tell: Lessons learned from the 2015 mscoco

image captioning challenge. IEEE transactions on

pattern analysis and machine intelligence, 39(4):652–

663.

Westlake, N., Cai, H., and Hall, P. (2016). Detecting people

in artwork with cnns. In Computer Vision – ECCV

2016 Workshops, pages 825–841.

Xian, Y., Schiele, B., and Akata, Z. (2017). Zero-shot learn-

ing - the good, the bad and the ugly. In IEEE Com-

puter Vision and Pattern Recognition (CVPR).

NoisyArt: A Dataset for Webly-supervised Artwork Recognition

475