6 CONCLUSION AND FUTURE
WORK
In this paper, we investigated different strategies for
matching (metadata about) art objects with suitable
Iconclass codes. We additionally proposed a sim-
ple method that utilizes multiple sources, through a
linear mapping of the source embeddings. We uti-
lized textual and visual features extracted from En-
glish and Dutch titles and artwork images, respec-
tively. The experiments demonstrate that the cross-
modal (image-text) matching using the visual fea-
tures are not promising compared to the uni-modal
(text-text) matching using purely textual features. We
show that the cross-lingual matching using the Dutch-
language artwork titles works as good as the match-
ing that uses the English-language artwork titles. This
finding will be meaningful to practitioners in the field,
because it suggests that GLAM institutions around the
world, including thus operating in a more resource-
scarce context to use their metadata in local languages
to match Iconclass codes without translating them
first to English. And finally, the proposed method that
uses all available information is the best performer. In
this case, the visual features help to boost the perfor-
mance.
The current pipeline has several disadvantages.
First, the model uses the BERT and the bottom-
up-attention to extract features without actual fine-
tuning. It may explain low results for cross-modal
matching due to the unsuitable feature representa-
tion. Secondly, some parts of the pipeline are imple-
mented in different frameworks which makes model-
wide fine-tuning difficult. Thirdly, the current dataset
is comparatively small as we had only 22,725 objects
in the training set for 10,418 possible labels. A larger
dataset certainly would be useful. In future work, we
would like to reimplement the entire pipeline in Py-
Torch in order to fine-tune the full model. In addi-
tion, exploiting the hierarchical structure of Iconclass
codes remains an important desideratum.
REFERENCES
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M.,
Gould, S., and Zhang, L. (2018). Bottom-up and
top-down attention for image captioning and visual
question answering. In Proceedings of the IEEE con-
ference on computer vision and pattern recognition,
pages 6077–6086.
Banar, N., Daelemans, W., and Kestemont, M. (2020). Neu-
ral machine translation of artwork titles using icon-
class codes. In Proceedings of the The 4th Joint
SIGHUM Workshop on Computational Linguistics for
Cultural Heritage, Social Sciences, Humanities and
Literature, pages 42–51.
Brandhorst, H. (2019). A word is worth a thousand pic-
tures: Why the use of iconclass will make artificial
intelligence smarter. https://labs.brill.com/ictestset/
ICONCLASS and AI.pdf.
Chen, J., Zhang, L., Bai, C., and Kpalma, K. (2020). Re-
view of recent deep learning methods for image-text
retrieval. In IEEE 3rd International Conference on
Multimedia Information Processing and Retrieval.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
formers for language understanding. In NAACL-HLT
(1).
Fiorucci, M., Khoroshiltseva, M., Pontil, M., Traviglia, A.,
Del Bue, A., and James, S. (2020). Machine learning
for cultural heritage: A survey. Pattern Recognition
Letters, 133:102–108.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Warde-Farley, D., Ozair, S., Courville, A., and Ben-
gio, Y. (2014). Generative adversarial nets. In
Advances in neural information processing systems,
pages 2672–2680.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern
recognition, pages 770–778.
Ji, Z., Sun, Y., Yu, Y., Pang, Y., and Han, J. (2019).
Attribute-guided network for cross-modal zero-shot
hashing. IEEE transactions on neural networks and
learning systems, 31(1):321–330.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J.,
Girshick, R., Guadarrama, S., and Darrell, T. (2014).
Caffe: Convolutional architecture for fast feature em-
bedding. In Proceedings of the 22nd ACM interna-
tional conference on Multimedia, pages 675–678.
Jian, Y., Xiao, J., Cao, Y., Khan, A., and Zhu, J. (2019).
Deep pairwise ranking with multi-label information
for cross-modal retrieval. In 2019 IEEE International
Conference on Multimedia and Expo (ICME), pages
1810–1815. IEEE.
Kingma, D. P. and Ba, J. (2014). Adam: A
method for stochastic optimization. arXiv preprint
arXiv:1412.6980.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,
K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-
J., Shamma, D. A., et al. (2017). Visual genome:
Connecting language and vision using crowdsourced
dense image annotations. International journal of
computer vision, 123(1):32–73.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learn-
ing. nature, 521(7553):436–444.
Lee, K.-H., Chen, X., Hua, G., Hu, H., and He, X. (2018).
Stacked cross attention for image-text matching. In
Proceedings of the European Conference on Com-
puter Vision (ECCV), pages 201–216.
Liu, J., Zha, Z.-J., Hong, R., Wang, M., and Zhang, Y.
(2019). Deep adversarial graph attention convolution
network for text-based person search. In Proceedings
of the 27th ACM International Conference on Multi-
media, pages 665–673.
ARTIDIGH 2021 - Special Session on Artificial Intelligence and Digital Heritage: Challenges and Opportunities
628