REFERENCES
Baruch, E. B., Ridnik, T., Zamir, N., Noy, A., Friedman, I.,
Protter, M., and Zelnik-Manor, L. (2021). Asymmet-
ric loss for multi-label classification. ICCV.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020).
A simple framework for contrastive learning of visual
representations. arXiv preprint arXiv:2002.05709.
Chen, Z.-M., Wei, X.-S., Wang, P., and Guo, Y. (2019).
Multi-label image recognition with graph convolu-
tional networks. CVPR.
Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., and
Zheng, Y. (2009). Nus-wide: a real-world web im-
age database from national university of singapore. In
CIVR.
Cubuk, E. D., Zoph, B., Shlens, J., and Le, Q. (2020).
Randaugment: Practical automated data augmentation
with a reduced search space. In Advances in Neu-
ral Information Processing Systems, volume 33, pages
18613–18624.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. (2009). ImageNet: A Large-Scale Hierarchical
Image Database. In CVPR09.
Deng, J., Guo, J., and Zafeiriou, S. (2018). Arcface: Ad-
ditive angular margin loss for deep face recognition.
ArXiv.
Devries, T. and Taylor, G. W. (2017). Improved regular-
ization of convolutional neural networks with cutout.
ArXiv, abs/1708.04552.
Everingham, M., Gool, L. V., Williams, C. K. I., Winn,
J. M., and Zisserman, A. (2009). The pascal visual
object classes (voc) challenge. International Journal
of Computer Vision, 88.
Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B.
(2020). Sharpness-aware minimization for efficiently
improving generalization. ArXiv, abs/2010.01412.
Gao, B.-B. and Zhou, H.-Y. (2021). Learning to discover
multi-class attentional regions for multi-label image
recognition. IEEE Transactions on Image Processing,
30:5920–5932.
Girshick, R. B., Donahue, J., Darrell, T., and Malik, J.
(2014). Rich feature hierarchies for accurate object
detection and semantic segmentation. CVPR.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2019a).
Momentum contrast for unsupervised visual represen-
tation learning. arXiv preprint arXiv:1911.05722.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. CVPR.
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.
(2019b). Bag of tricks for image classification with
convolutional neural networks. In CVPR.
Howard, A. G., Sandler, M., Chu, G., Chen, L.-C., Chen,
B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan,
V., Le, Q. V., and Adam, H. (2019). Searching for
mobilenetv3. ICCV.
Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y.,
Isola, P., Maschinot, A., Liu, C., and Krishnan, D.
(2020). Supervised contrastive learning. In Ad-
vances in Neural Information Processing Systems,
volume 33, pages 18661–18673.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata,
K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J.,
Shamma, D. A., Bernstein, M. S., and Fei-Fei, L.
(2016). Visual genome: Connecting language and vi-
sion using crowdsourced dense image annotations. In-
ternational Journal of Computer Vision, 123:32–73.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Im-
agenet classification with deep convolutional neural
networks. In Advances in Neural Information Pro-
cessing Systems 25, pages 1097–1105.
Kuznetsova, A., Rom, H., Alldrin, N. G., Uijlings, J. R. R.,
Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Mal-
loci, M., Duerig, T., and Ferrari, V. (2018). Dataset v
4 unified image classification , object detection , and
visual relationship detection at scale.
Lanchantin, J., Wang, T., Ordonez, V., and Qi, Y. (2021).
General multi-label image classification with trans-
formers. CVPR.
Lin, T.-Y., Maire, M., Belongie, S. J., Hays, J., Perona, P.,
Ramanan, D., Doll
´
ar, P., and Zitnick, C. L. (2014).
Microsoft coco: Common objects in context. In
ECCV.
Liu, S., Zhang, L., Yang, X., Su, H., and Zhu, J. (2021).
Query2label: A simple transformer way to multi-label
classification. ArXiv, abs/2107.10834.
Makarenkov, V., Shapira, B., and Rokach, L. (2016). Lan-
guage models with pre-trained (glove) word embed-
dings. arXiv: Computation and Language.
Prokofiev, K. and Sovrasov, V. (2022). Towards effi-
cient and data agnostic image classification training
pipeline for embedded systems. In ICIAP.
Ridnik, T., Lawen, H., Noy, A., and Friedman, I. (2021a).
Tresnet: High performance gpu-dedicated architec-
ture. WACV.
Ridnik, T., Sharir, G., Ben-Cohen, A., Ben-Baruch, E., and
Noy, A. (2021b). Ml-decoder: Scalable and versatile
classification head.
Smith, L. N. (2018). A disciplined approach to neu-
ral network hyper-parameters: Part 1 - learning rate,
batch size, momentum, and weight decay. ArXiv,
abs/1803.09820.
Sovrasov, V. and Sidnev, D. (2021). Building compu-
tationally efficient and well-generalizing person re-
identification models with metric learning. ICPR.
Tan, M. and Le, Q. V. (2019). Efficientnet: Rethink-
ing model scaling for convolutional neural networks.
ArXiv, abs/1905.11946.
Tan, M. and Le, Q. V. (2021). Efficientnetv2: Smaller mod-
els and faster training. ArXiv, abs/2104.00298.
Veli
ˇ
ckovi
´
c, P., Cucurull, G., Casanova, A., Romero, A., Li
`
o,
P., and Bengio, Y. (2018). Graph Attention Networks.
ICLR.
Wang, Z., Chen, T., Li, G., Xu, R., and Lin, L. (2017).
Multi-label image recognition by recurrently discov-
ering attentional regions. ICCV.
Wen, Y., Liu, W., Weller, A., Raj, B., and Singh, R. (2021).
Sphereface2: Binary classification is all you need for
deep face recognition. ArXiv, abs/2108.01513.
Yuan, J., Chen, S., Zhang, Y., Shi, Z., Geng, X., Fan,
J., and Rui, Y. (2022). Graph attention transformer
Combining Metric Learning and Attention Heads for Accurate and Efficient Multilabel Image Classification
395