puter Vision and Pattern Recognition (CVPR), pages
3130–3139.
Desai, S. and Ramaswamy, H. G. (2020). Ablation-CAM:
Visual explanations for deep convolutional network
via gradient-free localization. In IEEE Winter Con-
ference on Applications of Computer Vision (WACV),
pages 972–980.
Durand, T., Mordan, T., Thome, N., and Cord, M. (2017).
WILDCAT: Weakly supervised learning of deep con-
vnets for image classification, pointwise localization
and segmentation. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 5957–
5966.
Fellbaum, C. (1998). WordNet. Wiley Online Library.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean,
J., Ranzato, M., and Mikolov, T. (2013). DeViSE: A
deep visual-semantic embedding model. In Interna-
tional Conference on Neural Information Processing
Systems (NIPS), NIPS’13, pages 2121–2129, USA.
Curran Associates Inc.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep
residual learning for image recognition. In IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 770–778.
Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin,
I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M.,
Kolesnikov, A., Duerig, T., and Ferrari, V. (2020).
The open images dataset v4: Unified image classifi-
cation, object detection, and visual relationship detec-
tion at scale. International Journal of Computer Vi-
sion (IJCV).
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra-
manan, D., Doll
´
ar, P., and Zitnick, C. L. (2014). Mi-
crosoft COCO: Common objects in context. In Fleet,
D., Pajdla, T., Schiele, B., and Tuytelaars, T., editors,
European Conference on Computer Vision (ECCV),
pages 740–755, Cham. Springer International Pub-
lishing.
Loshchilov, I. and Hutter, F. (2017). SGDR: Stochastic
gradient descent with warm restarts. In International
Conference on Learning Representations (ICLR).
Qin, T., Zhang, X.-D., Tsai, M.-F., Wang, D.-S., Liu, T.-
Y., and Li, H. (2008). Query-level loss functions for
information retrieval. Information Processing & Man-
agement, 44(2):838–855.
Redmon, J. and Farhadi, A. (2017). YOLO9000: Better,
faster, stronger. In IEEE Conference on Computer Vi-
sion and Pattern Recognition (CVPR), pages 6517–
6525.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
stein, M., Berg, A. C., and Fei-Fei, L. (2015). Im-
agenet large scale visual recognition challenge. In-
ternational Journal of Computer Vision, 115(3):211–
252.
Salvador, A., Hynes, N., Aytar, Y., Marin, J., Ofli, F., We-
ber, I., and Torralba, A. (2017). Learning cross-modal
embeddings for cooking recipes and food images. In
IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pages 3068–3076. IEEE.
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R.,
Parikh, D., and Batra, D. (2017). Grad-CAM: Visual
explanations from deep networks via gradient-based
localization. In IEEE International Conference on
Computer Vision (ICCV), pages 618–626.
Sudholt, S. and Fink, G. A. (2017). Evaluating word string
embeddings and loss functions for CNN-based word
spotting. In International Conference on Document
Analysis and Recognition (ICDAR), volume 1, pages
493–498. IEEE.
Sun, C., Shrivastava, A., Singh, S., and Gupta, A. (2017).
Revisiting unreasonable effectiveness of data in deep
learning era. In IEEE International Conference on
Computer Vision (ICCV).
Wu, B., Chen, W., Fan, Y., Zhang, Y., Hou, J., Liu, J., and
Zhang, T. (2019). Tencent ML-images: A large-scale
multi-label image database for visual representation
learning. IEEE Access, 7:172683–172693.
Zhang, X., Wei, Y., Feng, J., Yang, Y., and Huang,
T. S. (2018). Adversarial complementary learning for
weakly supervised object localization. In IEEE Con-
ference on Computer Vision and Pattern Recognition
(CVPR), pages 1325–1334.
Zhong, Z., Zheng, L., Kang, G., Li, S., and Yang, Y. (2017).
Random erasing data augmentation. arXiv preprint
arXiv:1708.04896.
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Tor-
ralba, A. (2016). Learning deep features for discrimi-
native localization. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 2921–
2929.
Zhu, Y., Zhou, Y., Ye, Q., Qiu, Q., and Jiao, J. (2017). Soft
proposal networks for weakly supervised object local-
ization. In IEEE International Conference on Com-
puter Vision (ICCV), pages 1859–1868.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
296