In this paper, we investigated different strategies for
matching (metadata about) art objects with suitable
Iconclass codes. We additionally proposed a sim-
ple method that utilizes multiple sources, through a
linear mapping of the source embeddings. We uti-
lized textual and visual features extracted from En-
glish and Dutch titles and artwork images, respec-
tively. The experiments demonstrate that the cross-
modal (image-text) matching using the visual fea-
tures are not promising compared to the uni-modal
(text-text) matching using purely textual features. We
show that the cross-lingual matching using the Dutch-
language artwork titles works as good as the match-
ing that uses the English-language artwork titles. This
finding will be meaningful to practitioners in the field,
because it suggests that GLAM institutions around the
world, including thus operating in a more resource-
scarce context to use their metadata in local languages
to match Iconclass codes without translating them
first to English. And finally, the proposed method that
uses all available information is the best performer. In
this case, the visual features help to boost the perfor-
The current pipeline has several disadvantages.
First, the model uses the BERT and the bottom-
up-attention to extract features without actual fine-
tuning. It may explain low results for cross-modal
matching due to the unsuitable feature representa-
tion. Secondly, some parts of the pipeline are imple-
mented in different frameworks which makes model-
wide fine-tuning difficult. Thirdly, the current dataset
is comparatively small as we had only 22,725 objects
in the training set for 10,418 possible labels. A larger
dataset certainly would be useful. In future work, we
would like to reimplement the entire pipeline in Py-
Torch in order to fine-tune the full model. In addi-
tion, exploiting the hierarchical structure of Iconclass
codes remains an important desideratum.
