worth 16x16 words: Transformers for image recogni-
tion at scale. arXiv preprint arXiv:2010.11929.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R.,
Brendel, W., Bethge, M., and Wichmann, F. A. (2020).
Shortcut learning in deep neural networks. Nature
Machine Intelligence, 2(11):665–673.
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017).
On calibration of modern neural networks. In Interna-
tional Conference on Machine Learning, pages 1321–
1330. PMLR.
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep resid-
ual learning for image recognition. In Proceedings of
the IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, volume 2016-Decem,
pages 770–778.
Hendrycks, D. and Dietterich, T. (2019). Benchmarking
neural network robustness to common corruptions a
perturbations.
Hermann, K. L., Chen, T., and Kornblith, S. (2020). The
origins and prevalence of texture bias in convolutional
neural networks.
Huang, X. and Belongie, S. (2017). Arbitrary style transfer
in real-time with adaptive instance normalization. In
ICCV.
Ivanov, A., Dryden, N., Ben-Nun, T., Li, S., and Hoe-
fler, T. (2020). Data movement is all you need: A
case study on optimizing transformers. arXiv preprint
arXiv:2007.00072.
Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M.,
and Tang, P. T. P. (2016). On large-batch training for
deep learning: Generalization gap and sharp minima.
arXiv preprint arXiv:1609.04836.
Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S.,
and Shah, M. (2021). Transformers in vision: A sur-
vey. arXiv preprint arXiv:2101.01169.
Kuppers, F., Kronenberger, J., Shantia, A., and Haselhoff,
A. (2020). Multivariate confidence calibration for ob-
ject detection. In Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition
Workshops, pages 326–327.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P.,
Ramanan, D., Doll
´
ar, P., and Zitnick, C. L. (2014).
Microsoft coco: Common objects in context. In Euro-
pean conference on computer vision, pages 740–755.
Springer.
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin,
S., and Guo, B. (2021). Swin transformer: Hierarchi-
cal vision transformer using shifted windows. arXiv
preprint arXiv:2103.14030.
Loshchilov, I. and Hutter, F. (2019). Decoupled weight de-
cay regularization.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and
Vladu, A. (2017). Towards deep learning mod-
els resistant to adversarial attacks. arXiv preprint
arXiv:1706.06083.
Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai,
X., Houlsby, N., Tran, D., and Lucic, M. (2021). Re-
visiting the calibration of modern neural networks.
arXiv preprint arXiv:2106.07998.
Naeini, M. P., Cooper, G., and Hauskrecht, M. (2015).
Obtaining well calibrated probabilities using bayesian
binning. In Twenty-Ninth AAAI Conference on Artifi-
cial Intelligence.
Naseer, M., Ranasinghe, K., Khan, S., Hayat, M.,
Khan, F. S., and Yang, M.-H. (2021). Intriguing
properties of vision transformers. arXiv preprint
arXiv:2105.10497.
Paul, S. and Chen, P.-Y. (2021). Vision transformers are
robust learners. arXiv preprint arXiv:2105.07581.
Ranftl, R., Bochkovskiy, A., and Koltun, V. (2021). Vi-
sion transformers for dense prediction. arXiv preprint
arXiv:2103.13413.
Srinivas, A., Lin, T.-Y., Parmar, N., Shlens, J., Abbeel, P.,
and Vaswani, A. (2021). Bottleneck transformers for
visual recognition. arXiv preprint arXiv:2101.11605.
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Er-
han, D., Goodfellow, I., and Fergus, R. (2013). In-
triguing properties of neural networks. arXiv preprint
arXiv:1312.6199.
Tan, M. and Le, Q. (2019). EfficientNet: Rethinking
Model Scaling for Convolutional Neural Networks.
In Chaudhuri, K. and Salakhutdinov, R., editors, Pro-
ceedings of the 36th International Conference on Ma-
chine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 6105–6114. PMLR.
Tian, Z., Shen, C., Chen, H., and He, T. (2019). FCOS:
Fully Convolutional One-Stage Object Detection.
Proceedings of the IEEE International Conference on
Computer Vision, 2019-Octob:9626–9635.
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles,
A., and J
´
egou, H. (2020). Training data-efficient
image transformers & distillation through attention.
arXiv preprint arXiv:2012.12877.
Wang, W., Xie, E., Li, X., Fan, D.-P., Song, K., Liang,
D., Lu, T., Luo, P., and Shao, L. (2021). Pyra-
mid vision transformer: A versatile backbone for
dense prediction without convolutions. arXiv preprint
arXiv:2102.12122.
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F.,
Madhavan, V., and Darrell, T. (2020). Bdd100k: A
diverse driving dataset for heterogeneous multitask
learning.
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
222