Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass

Nikolay Banar

1,2

, Walter Daelemans

and Mike Kestemont

1,2

Computational Linguistics and Psycholinguistics Research Center, University of Antwerp, Belgium

Antwerp Centre for Digital Humanities and Literary Criticism, University of Antwerp, Belgium

Keywords:

Iconclass, Information Retrieval, Multi-modal Matching, Cross-lingual Matching, Multi-lingual Matching.

Abstract:

Iconclass is an iconographic classiﬁcation system from the domain of cultural heritage which is used to an-

notate subjects represented in the visual arts. In this work, we investigate the feasibility of automatically

assigning Iconclass codes to visual artworks using a cross-modal retrieval set-up. We explore the text and

image branches of the cross-modal network. In addition, we describe a multi-modal architecture that can

jointly capitalize on multiple feature sources: textual features, coming from the titles for these artworks (in

multiple languages) and visual features, extracted from photographic reproductions of the artworks. We uti-

lize Iconclass deﬁnitions in English as matching labels. We evaluate our approach on a publicly available

dataset of artworks (containing English and Dutch titles). Our results demonstrate that, in isolation, textual

features strongly outperform visual features, although visual features can still offer a useful complement to

purely linguistic features. Moreover, we show the cross-lingual (Dutch-English) strategy to be on par with the

monolingual approach (English-English), which opens important perspectives for applications of this approach

beyond resource-rich languages.

1 INTRODUCTION

Iconclass (Vellekoop et al., 1973; Brandhorst, 2019)

is a well-known iconographic classiﬁcation system

which is used to describe and retrieve content in art-

works. The ontology is adopted across various in-

stitutions in the GLAM sector (Galleries, Libraries,

Archives and Museums). Iconclass offers a hierar-

chy of unique codes, associated with keywords and

deﬁnitions, to encode the presence of objects, people,

events and ideas depicted in visual artworks, such as

paintings. Assigning Iconclass codes (or other class

labels coming from other iconographic thesauri) is a

complex interpretive task that is typically carried out

by highly-trained subject experts. Assigning an Icon-

class code to an artwork is an especially challenging

task because of the large number of available labels.

Hence, the annotation process is time-consuming and

requires the (expensive) intervention of skilled ex-

perts.

Recent advances in deep learning (LeCun et al.,

2015; Schmidhuber, 2015) increasingly ﬁnd real-

world applications in the cultural heritage domain

(Fiorucci et al., 2020). Iconclass, for instance, was re-

cently used to improve the quality of neural machine

translation, speciﬁcally for artwork titles (Banar et al.,

2020). In spite of the growing availability of relevant

datasets, however, the automatic assignment of Icon-

class codes to artworks has yet not attracted the schol-

arly attention which this challenging task deserves. A

related study into the automatic classiﬁcation of art-

works into (just) 10 Iconclass categories (Milani and

Fraternali, 2020) recently demonstrated the consider-

able difﬁculty of this task.

We aim to move beyond the state of the art in this

area through exploiting the latest advances in infor-

mation retrieval in order to reliably match artworks

with suitable interpretive metadata, such as Iconclass

codes. Importantly, we propose a multi-modal ap-

proach that is able to jointly capitalize on various

data sources, including (multilingual) textual infor-

mation as well as visual characteristics available for

these artworks. First, we describe the extraction of

the linguistic features from Dutch and English art-

work titles, as well as the visual feature extraction

from the artwork images. We start by investigating

the feasibility of cross-modal (image-to-text) match-

ing using the image and text branches of a recently

proposed architecture, called “Self-Attention Embed-

dings” (SAEM, (Wu et al., 2019a)). Subsequently, we

compare this matching strategy to a text-to-text map-

ping, only using the text branch of SAEM. In these

experiments, we additionally report on the feasibility

of cross-language matching. Finally, we propose a

622

Banar, N., Daelemans, W. and Kestemont, M.

Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass.

DOI: 10.5220/0010390606220629

In Proceedings of the 13th International Conference on Agents and Artiﬁcial Intelligence (ICAART 2021) - Volume 1, pages 622-629

ISBN: 978-989-758-484-8

 2021 by SCITEPRESS – Science and Technology Publications, Lda. All rights reser ved

simple extension of the model that is able to simulta-

neously exploit multi-lingual metadata and visual fea-

tures to match Iconclass codes to a work of art.

The structure of this paper is as follows. We ﬁrst

review the related work on cross-modal matching in

Section 2. Then, we describe the proposed methods in

more detail in Section 3. We further describe the rep-

resentative cultural heritage dataset used in our exper-

iments and discuss the experimental settings adopted

in Section 4. The quantitative results of this work, to-

gether with a more interpretive discussion, are offered

in Section 5. Finally, we summarize our contributions

in Section 6 and propose worthwhile directions for fu-

ture research.

2 RELATED WORK

In this section, we review related methods from the

recent literature on cross-modal matching. Generally,

the published approaches can be classiﬁed into 4 cat-

egories (Chen et al., 2020): 1) pairwise learning em-

beddings; 2) adversarial learning; 3) attribute learn-

ing; 4) interaction learning.

Pairwise learning methods focus on designing a

cross-modal loss function to map image and text em-

beddings into one common space. The loss func-

tion is speciﬁcally designed so as to reduce the dis-

tance between positive pairs, while increasing the dis-

tance between negative ones. Zhang and Lu (2018)

proposed a Cross-Modal Projection Matching loss,

which uses the Kullback-Leibler divergence between

image-to-text matching probability and normalized

ground-truth probability. Jian et al. (2019) proposed

a similar method, adopting a softmax cross-entropy

loss and a bi-triplet loss.

Adversarial learning methods use Generative Ad-

versarial Nets (GANs, (Goodfellow et al., 2014)) for

cross-modal matching. GANs were seminally applied

in this context by Wang et al. (2017a). Their method

is based on mini-max strategy applied to the generator

and discriminator in GANs. Saraﬁanos et al. (2019)

have extended this framework to obtain modality-

invariant representations. Liu et al. (2019), ﬁnally,

proposed a deep adversarial graph attention convolu-

tion network which exploits textual and visual scene

graphs.

Attribute learning methods exploit high-level se-

mantic attributes instead of the basic image features

and text features. The Attribute-Guided Network (Ji

et al., 2019) utilizes zero-shot learning and hashing

retrieval for this purpose. With this approach, the at-

tribute vectors are mapped onto hash codes. The aim

of this mapping is to automatically obtain clusters,

across different modalities, in a common space.

Interaction learning methods, the fourth and last

category, aim to transfer information between the text

and image branches in a model, before mapping them

jointly into a common space. The Multitask learning

approach for Cross-Modal Image-Text Retrieval (Luo

et al., 2019) uses a relation-enhanced cross-modal

auto-encoder. The cross-modal auto-encoder corre-

lates the hidden representations of two uni-modal

auto-encoders, before mapping image and textual fea-

tures into a common latent space. Through stacked

cross-attention, Lee et al. (2018) build attended vec-

tors for each image region from the salient parts of the

sentence. Then, the similarity is calculated between

each of the attended vectors and the corresponding

image region; subsequently, the dimensionality of the

similarity matrix is reduced by spatial pooling. The

SAEM framework (Wu et al., 2019a) adopts a self-

attention mechanism (Vaswani et al., 2017) to process

image regions. In this work, we resort to the image

and text branches of SAEM, as it has achieved ex-

cellent results in cross-modal matching (Chen et al.,

2020). Moreover, a reference implementation of the

model is available online.

3 METHODS

In this section, we describe the SAEM architecture in

more detail. The framework generally consists of two

branches (see Figure 1): we present the image branch

in Section 3.1 and the text branch in Section 3.2. In

Section 3.3, ﬁnally, we propose a simple extension of

the architecture that allows to consider multiple data

sources simultaneously in the matching task at hand.

3.1 Image Branch

This branch is responsible for extracting visual fea-

tures from an arbitrary image. It implements a

bottom-up-attention mechanism (Anderson et al.,

2018) to extract salient regions from the image and

a self-attention layer (Vaswani et al., 2017) to encode

these regions. It deserves emphasis that the bottom-

up-attention network is not ﬁne-tuned in the training

process. The bottom-up-attention mechanism brings

speciﬁc advantages over the simple extraction of fea-

tures from the last pooling layer of a convolutional

neural network (CNN). Such CNN feature extractors

divide an image into equal-size spatial blocks and pre-

serve spatial information about the image in the pro-

cess. This process will ignore the semantic struc-

ture of the image but might devote too much of its

https://github.com/yiling2018/saem

Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass

623

Add & Norm

Feed Forward

Add & Norm

Multi-Head

Attention

other groups of apostles

Bottom-up attention Self-attention layerInput image

Add & Norm

Feed Forward

Add & Norm

Multi-Head

Attention

BERT

CNN

Tokenizer

Input sentence

Linear layer

Average pooling

Image

embedding

Text

embedding

Figure 1: The scheme of the SAEM framework. The upper branch processes visual information and the lower branch is

responsible for textual information.

representational capacity to image fragments contain-

ing redundant information. The bottom-up-attention

mechanism, however, will boost salient image regions

of the different sizes, which allows to emphasize the

semantic information in the image and avoid unnec-

essary computations.

The bottom-up-attention mechanism employs the

Faster R-CNN model (Ren et al., 2015) with a

ResNet-101 (He et al., 2016), pretrained on Visual

Genomes (Krishna et al., 2017). Faster R-CNN is a

mature object detection algorithm that involves a se-

ries of steps. First, the CNN extracts features from

an image to build a feature map. Next, an indepen-

dent Region Proposal Network uses the feature map

to identify regions-of-interest (RoI) proposals of dif-

ferent sizes. These regions are then resized into equal-

size vectors by the RoI pooling layer. The resized re-

gions are ﬁnally used to predict the offset values for

bounding boxes and to classify objects.

The current pipeline uses the resized regions pro-

cessed by the RoI pooling layer. Next on, these re-

gions are fed through a position-wise fully connected

layer to combine them into a single feature matrix.

Only at this stage, the self-attention layer is applied

to encode the relationships that exist between these

regions. This step is crucial, as the extracted regions

do not have a ﬁxed order. The self-attention layer is

able to access all regions simultaneously, which en-

ables it to extract useful information from the regions,

despite their lack of a ﬁxed order. Finally, the image

embedding is build by average pooling over the fea-

ture matrix, before it eventually gets L2-normalized.

3.2 Text Branch

The text branch of the model uses Bidirectional

Encoder Representations from Transformers (BERT,

(Devlin et al., 2019)) with the WordPiece tokenizer to

encode the textual information. Importantly, and anal-

ogously to the image branch, the BERT weights are

also not ﬁne-tuned in the training process. The well-

known BERT architecture has been pretrained taking

into account bi-directional context. Thus, it provides

context-aware sentence embeddings, which is a ma-

jor advantage over context-insensitive (word) embed-

ding approaches, such a Word2Vec (Mikolov et al.,

2013) or Glove (Pennington et al., 2014). BERT has

been pretrained on two different tasks. In the ﬁrst

task, BERT has been pretrained to predict randomly

masked input tokens in sentences. In the second task,

BERT was additionally pretrained on the task of next-

sentence prediction.

The embeddings obtained from BERT are pro-

cessed by one-dimensional convolutions (for uni-

grams, bi-grams and tri-grams), followed by max-

ARTIDIGH 2021 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

624

pooling to capture the local context. Further, the fea-

tures are concatenated into a single vector and fed

through a fully connected layer to obtain the ﬁnal text

embedding, which is L2-normalized in the end.

3.3 Multi-modal Branch

We propose a simple extension to the original SAEM

approach through which we are able to exploit infor-

mation coming from multiple sources. In our work,

we have available both textual and visual sources for

the matching task. We process the available textual in-

formation with the text branch and process the avail-

able visual information with image branch. Hence,

we obtain multi-modal embeddings, which we con-

catenate into a single vector. We resize the resulting

embedding by a fully connected layer, to re-obtain the

original size of the target embeddings.

4 EXPERIMENTAL SETTINGS

In this section, we describe the datasets and discuss

the experimental settings which we adopted.

4.1 Datasets and Preprocessing

4.1.1 Iconclass

Iconclass is an iconographic classiﬁcation system

used by cultural heritage intuitions to describe and

retrieve content in the visual arts. An Iconclass

code (see Figure 2) is a unique identiﬁer assigned to

an iconographic subject represented in an artwork.

Iconclass includes 28,000 hierarchically ordered

deﬁnitions (codes) and 14,000 keywords in multiple

languages. As matching targets, we use the English

deﬁnitions of the Iconclass codes in this work, as

they are well presented in Iconclass compared to, for

example, the Dutch deﬁnitions. Iconclass is divided

into 10 main categories that are represented with

a digit from 0 to 9: (0) abstract art; (1-5) general

topics; (6) history; (7) Bible; (8) literature; (9) clas-

sical mythology and ancient history. Furthermore,

an Iconclass code can be extended by the three

options presented in Table 1. We have used the

Iconclass Python package to obtain the Iconclass

deﬁnitions associated with each individual code.

In our experiments, we have considered Iconclass

codes with a depth of 5 and obtained 10,418 codes

in this manner, which could be used as labels for the

present matching task. Hence, we do not exploit the

https://labs.brill.com/ictestset/

hierarchical structure of Iconclass codes in this work.

4.1.2 Dataset

To work with a representative collection of material

from the domain of cultural heritage, we extracted

the dataset from the database of the Netherlands

Institute for Art History.

The dataset consists of

26,725 objects and their corresponding (manually

assigned) Iconclass codes. Each object represents

a unique visual artwork and includes the following

metadata triplets: 1) Dutch-language artwork titles;

2) English-language artwork titles; 3) a photographic

reproduction of the artwork under scrutiny. We set

aside a random selection of 2,000 objects to be used

as the development set and included another 2,000

objects for the test set. We randomly selected one

Iconclass code per object in the development and test

sets, while the training set has 1.41 ± 1.40 Iconclass

codes per object. In our experiments, we use all

unique combinations of the triplets to train and test

the models (see Table 2).

4.2 Training and Inference Details

The SAEM framework is mainly implemented in Py-

Torch (Paszke et al., 2019) but it is dependent on a vi-

sual feature extractor implemented in Caffe (Jia et al.,

2014). Instead of the cased English BERT, we use the

uncased multi-lingual version of BERT to maximally

beneﬁt from our multi-lingual metadata. Through-

out our work, we have maximally adhered to the de-

fault parametrization of the SAEM framework, with

the obvious exception of our own extensions to the

model. The models were trained by minimizing a

combination of a bi-directional triplet loss (Wu et al.,

2019b) and a bi-directional angular loss (Wang et al.,

2017b), with hard negative mining using the Adam

optimizer (Kingma and Ba, 2014) and a batch size

of 64 sentences. The initial learning rate was set to

0.0001 but it was decayed by a factor of 0.1 after ev-

ery 10 epochs. The models were trained for 30 epochs

on a single GeForce GTX 1080 Ti with 11 GB RAM.

The evaluation is conducted using the standard met-

ric Recall@K (for K=1, 5, 10), which represents the

portion of relevant items found in the top-K retrieved

labels. Below, we report the performance of the mod-

els for (image, text or multi-modal) query versus the

Iconclass deﬁnitions. For each experiment, we se-

lect the best performing model which had the opti-

https://rkd.nl/en/explore/images

Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass

625

Table 1: Extension of Iconclass codes: (1) a letter or digit increases speciﬁcity; (2) bracketed text adds the name of a speciﬁc

entity; (3) bracketed text with a plus-sign introduces an additional ‘shade of meaning’.

N Extension Deﬁnition Keywords

1 73E8 Joseph’s death and coronation Joseph (St.), death

73E81 Joseph on his deathbed; Christ and Mary present Mary (Virgin), deathbed

2 22C4 colours, pigments, and paints colour, paint, pigment

22C4(GOLD) colours, pigments, and paints: gold gold

3 31D111 infant, baby the ages of man baby

31D111(+89) infant, baby the ages of man (+ nude human be-

ing)

Akt, baby, nackt, nu, nude,

nudo

(a) 71H611

(b) 11I424

Figure 2: Examples of images assigned Iconclass codes (Posthumus, 2020) with the following deﬁnitions: (a) ‘David com-

municating with God; David praying (in general)’; (b) ‘angel (possibly with book) symbol of St. Matthew’; (c) ‘Mary

saluting Elisabeth, who kneels before her’.

mal (summed) performance of the metrics presented

above on the validation set.

5 RESULTS AND DISCUSSION

In this section, we present and discuss our experimen-

tal results. We present our results in three different

sections. First, we discuss the cross-modal (image-

text) matching and compare it to one-modal (text-text)

matching in Section 5.1. Secondly, we present the

results for the cross-lingual matching in Section 5.2.

The results for the multi-modal approach follow in

Section 5.3.

5.1 Cross-modal Matching

As shown in Table 2, model (a) for cross-modal

matching demonstrates low results in comparison to

the models (b, c) utilizing textual features for match-

ing. Therefore, we conclude that the cross-modal

matching does not deliver satisfactory results in the

current setting. However, we should emphasize that

this alone does not prove that cross-modal matching

would not be feasible for this task. The bottom-up-

attention mechanism used to extract the visual fea-

tures is pretrained on the Visual Genomes dataset

which obviously belongs to an entirely another do-

main. In computer vision, it is a well-known prob-

lem of domain mismatch. The common remedy to

overcome this issue is transfer learning (Ribani and

Marengoni, 2019), namely, ﬁne-tuning weights of a

neural network. However, the current framework is

ARTIDIGH 2021 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

626

Table 2: Results of the experiments with different matching sources.

Source Metrics

Image EN Title NL Title Recall@1 Recall@5 Recall@10 Average

a X 13.10 19.60 23.20 18.63

b X 62.80 77.30 80.75 73.62

c X 66.45 77.05 80.55 74.68

d X X 67.85 79.20 82.30 76.45

e X X 66.60 78.80 81.80 75.73

f X X 68.95 80.30 82.90 77.38

g X X X 70.05 80.35 83.10 77.83

Table 3: Example of Iconclass code matching from the best performing model (g). The ground-truth label is highlighted in

bold. The title of the artwork is ‘Venus mourning the dead Adonis’ (‘Venus beweent de dode Adonis’).

N code deﬁnition

1 92C42 love-affairs of Venus

2 92C49 offspring, companion(s), train etc. of Venus

3 92C46 suffering, misfortune of Venus

4 92C41 early life, prime youth of Venus

5 92C48 attributes of Venus

6 92C47 speciﬁc aspects, allegorical aspects of Venus [...]

7 92C44 aggressive, unfriendly activities [...] of Venus

8 24C19 Venus (planet)

9 92K22 Dione, mother of Venus

10 24C34 Venus representing copper

not suitable for ﬁne-tuning as it is spread among Py-

Torch and Caffe. These parts should be merged in

order to achieve this goal.

5.2 Cross-lingual Matching

From Table 2, we can see that the model (c) trained

on the Dutch titles performs on par with the corre-

sponding model (b) trained on the English titles. The

model (c) outperforms the corresponding model (b)

in Recall@1 by a large margin. However, the model

(b) is moderately better in Recall@5 and Recall@10.

This result demonstrates that BERT provides high

quality multi-lingual embeddings for our task. Hence,

there is no need to translate metadata to English to

achieve strong results in the task of Iconclass codes

matching. We conclude that the textual features are

language-independent and extremely useful in our

task.

5.3 Multi-modal Matching

From Table 2, we can observe that all models in the

multi-modal scenario (d, e, f, g) outperform all mod-

els in the cross-modal (a) and one-modal (b, c) sce-

narios. Surprisingly, the visual features in the mod-

els (d, e) help to improve performance compared to

the corresponding models (b, c). This result may be

explained by the hypothesis that the visual features

might overall have a lower quality, but that the infor-

mation which they offer is orthogonal to the textual

features, and thus a worthwhile addition to the model.

The best performer in the scenario with 2 sources is

the model (f) where the featured are extracted from

both the Dutch and English titles. And ﬁnally, the

best performing model overall is the model (g) that

exploits all available information. Hence, we con-

clude that the multi-modal approaches (d, e, f, g) out-

perform cross-modal and one-modal methods and the

increased amount of sources improves the quality of

matching.

As can be seen from Table 3, Iconclass codes

matching remains a challenging task due to the hier-

archical structure and a high number of similar labels

(codes). Not only, the number of similar labels grows

signiﬁcantly with the depth of Iconclass labels, but

allowing a greater depth will also increase the num-

ber of very similar labels. In the example, the ﬁrst

seven labels have the same upper code (92C4) and are

highly similar which makes the matching extremely

difﬁcult. In order to learn such subtle shades of mean-

ing, the models would probably need even more prop-

erly annotated data, which is challenging in the cul-

tural heritage domain.

Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass

627

6 CONCLUSION AND FUTURE

WORK

In this paper, we investigated different strategies for

matching (metadata about) art objects with suitable

Iconclass codes. We additionally proposed a sim-

ple method that utilizes multiple sources, through a

linear mapping of the source embeddings. We uti-

lized textual and visual features extracted from En-

glish and Dutch titles and artwork images, respec-

tively. The experiments demonstrate that the cross-

modal (image-text) matching using the visual fea-

tures are not promising compared to the uni-modal

(text-text) matching using purely textual features. We

show that the cross-lingual matching using the Dutch-

language artwork titles works as good as the match-

ing that uses the English-language artwork titles. This

ﬁnding will be meaningful to practitioners in the ﬁeld,

because it suggests that GLAM institutions around the

world, including thus operating in a more resource-

scarce context to use their metadata in local languages

to match Iconclass codes without translating them

ﬁrst to English. And ﬁnally, the proposed method that

uses all available information is the best performer. In

this case, the visual features help to boost the perfor-

mance.

The current pipeline has several disadvantages.

First, the model uses the BERT and the bottom-

up-attention to extract features without actual ﬁne-

tuning. It may explain low results for cross-modal

matching due to the unsuitable feature representa-

tion. Secondly, some parts of the pipeline are imple-

mented in different frameworks which makes model-

wide ﬁne-tuning difﬁcult. Thirdly, the current dataset

is comparatively small as we had only 22,725 objects

in the training set for 10,418 possible labels. A larger

dataset certainly would be useful. In future work, we

would like to reimplement the entire pipeline in Py-

Torch in order to ﬁne-tune the full model. In addi-

tion, exploiting the hierarchical structure of Iconclass

codes remains an important desideratum.

REFERENCES

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,

Gould, S., and Zhang, L. (2018). Bottom-up and

top-down attention for image captioning and visual

question answering. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition,

pages 6077–6086.

Banar, N., Daelemans, W., and Kestemont, M. (2020). Neu-

ral machine translation of artwork titles using icon-

class codes. In Proceedings of the The 4th Joint

SIGHUM Workshop on Computational Linguistics for

Cultural Heritage, Social Sciences, Humanities and

Literature, pages 42–51.

Brandhorst, H. (2019). A word is worth a thousand pic-

tures: Why the use of iconclass will make artiﬁcial

intelligence smarter. https://labs.brill.com/ictestset/

ICONCLASS and AI.pdf.

Chen, J., Zhang, L., Bai, C., and Kpalma, K. (2020). Re-

view of recent deep learning methods for image-text

retrieval. In IEEE 3rd International Conference on

Multimedia Information Processing and Retrieval.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.

(2019). Bert: Pre-training of deep bidirectional trans-

formers for language understanding. In NAACL-HLT

(1).

Fiorucci, M., Khoroshiltseva, M., Pontil, M., Traviglia, A.,

Del Bue, A., and James, S. (2020). Machine learning

for cultural heritage: A survey. Pattern Recognition

Letters, 133:102–108.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,

Warde-Farley, D., Ozair, S., Courville, A., and Ben-

gio, Y. (2014). Generative adversarial nets. In

Advances in neural information processing systems,

pages 2672–2680.

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-

ual learning for image recognition. In Proceedings of

the IEEE conference on computer vision and pattern

recognition, pages 770–778.

Ji, Z., Sun, Y., Yu, Y., Pang, Y., and Han, J. (2019).

Attribute-guided network for cross-modal zero-shot

hashing. IEEE transactions on neural networks and

learning systems, 31(1):321–330.

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,

Girshick, R., Guadarrama, S., and Darrell, T. (2014).

Caffe: Convolutional architecture for fast feature em-

bedding. In Proceedings of the 22nd ACM interna-

tional conference on Multimedia, pages 675–678.

Jian, Y., Xiao, J., Cao, Y., Khan, A., and Zhu, J. (2019).

Deep pairwise ranking with multi-label information

for cross-modal retrieval. In 2019 IEEE International

Conference on Multimedia and Expo (ICME), pages

1810–1815. IEEE.

Kingma, D. P. and Ba, J. (2014). Adam: A

method for stochastic optimization. arXiv preprint

arXiv:1412.6980.

Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,

K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-

J., Shamma, D. A., et al. (2017). Visual genome:

Connecting language and vision using crowdsourced

dense image annotations. International journal of

computer vision, 123(1):32–73.

LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-

ing. nature, 521(7553):436–444.

Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. (2018).

Stacked cross attention for image-text matching. In

Proceedings of the European Conference on Com-

puter Vision (ECCV), pages 201–216.

Liu, J., Zha, Z.-J., Hong, R., Wang, M., and Zhang, Y.

(2019). Deep adversarial graph attention convolution

network for text-based person search. In Proceedings

of the 27th ACM International Conference on Multi-

media, pages 665–673.

ARTIDIGH 2021 - Special Session on Artiﬁcial Intelligence and Digital Heritage: Challenges and Opportunities

628

Luo, J., Shen, Y., Ao, X., Zhao, Z., and Yang, M. (2019).

Cross-modal image-text retrieval with multitask learn-

ing. In Proceedings of the 28th ACM International

Conference on Information and Knowledge Manage-

ment, pages 2309–2312.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and

Dean, J. (2013). Distributed representations of words

and phrases and their compositionality. In Advances in

neural information processing systems, pages 3111–

3119.

Milani, F. and Fraternali, P. (2020). A data set and a convo-

lutional model for iconography classiﬁcation in paint-

ings. arXiv preprint arXiv:2010.11697.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,

Chanan, G., Killeen, T., Lin, Z., Gimelshein, N.,

Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-

Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S.,

Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019).

Pytorch: An imperative style, high-performance deep

learning library. In Wallach, H., Larochelle, H.,

Beygelzimer, A., d'Alch

e-Buc, F., Fox, E., and Gar-

nett, R., editors, Advances in Neural Information Pro-

cessing Systems 32, pages 8024–8035. Curran Asso-

ciates, Inc.

Pennington, J., Socher, R., and Manning, C. D. (2014).

Glove: Global vectors for word representation. In

Proceedings of the 2014 conference on empirical

methods in natural language processing (EMNLP),

pages 1532–1543.

Posthumus, E. (2020). Brill iconclass ai test set. https:

//labs.brill.com/ictestset/.

Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster

r-cnn: Towards real-time object detection with region

proposal networks. In Advances in neural information

processing systems, pages 91–99.

Ribani, R. and Marengoni, M. (2019). A survey of trans-

fer learning for convolutional neural networks. In

2019 32nd SIBGRAPI Conference on Graphics, Pat-

terns and Images Tutorials (SIBGRAPI-T), pages 47–

57. IEEE.

Saraﬁanos, N., Xu, X., and Kakadiaris, I. A. (2019). Adver-

sarial representation learning for text-to-image match-

ing. In Proceedings of the IEEE International Confer-

ence on Computer Vision, pages 5814–5824.

Schmidhuber, J. (2015). Deep learning in neural networks:

An overview. Neural networks, 61:85–117.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,

L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I.

(2017). Attention is all you need. In Advances in

neural information processing systems, pages 5998–

6008.

Vellekoop, G., Tholen, E., and Couprie, L. D. (1973). Icon-

class : an iconographic classiﬁcation system. North-

Holland Pub. Co., Amsterdam.

Wang, B., Yang, Y., Xu, X., Hanjalic, A., and Shen, H. T.

(2017a). Adversarial cross-modal retrieval. In Pro-

ceedings of the 25th ACM international conference on

Multimedia, pages 154–162.

Wang, J., Zhou, F., Wen, S., Liu, X., and Lin, Y. (2017b).

Deep metric learning with angular loss. In Proceed-

ings of the IEEE International Conference on Com-

puter Vision, pages 2593–2601.

Wu, Y., Wang, S., Song, G., and Huang, Q. (2019a). Learn-

ing fragment self-attention embeddings for image-text

matching. In Proceedings of the 27th ACM Interna-

tional Conference on Multimedia, pages 2088–2096.

Wu, Y., Wang, S., Song, G., and Huang, Q. (2019b). Online

asymmetric metric learning with multi-layer similar-

ity aggregation for cross-modal retrieval. IEEE Trans-

actions on Image Processing, 28(9):4299–4312.

Zhang, Y. and Lu, H. (2018). Deep cross-modal projec-

tion learning for image-text matching. In Proceed-

ings of the European Conference on Computer Vision

(ECCV), pages 686–701.

Multi-modal Label Retrieval for the Visual Arts: The Case of Iconclass

629